Search Engine Studio FAQ

Question ID:
Q2024
Question:
How can I exclude documents from being indexed (or: how do I prevent the indexer from indexing the website infinitely)?
 
The navigation mode of the indexer allows you to specify many filters which can be applied during the indexing process to exclude documents (some of these filters are available in file structure and FTP modes as well):
  • 'File types to be indexed': by selecting specific file types you determine which document types will be indexed. If you need to narrow down this set of documents, simply unselect all boxes and then in the 'custom' field enter the exact filters to be used. For example, setting this to 'a*.html;b*.html' will only index files with the html extension and starting with the letter 'a' or 'b'.
  • In advanced options of the indexer in 'Excluded file/URL filters' list you can enter URLs to be excluded even though they match the filters to be indexed, e.g.: 'http://www.domain.com/dir/*' will exclude all URLs starting with 'http://www.domain.com/dir/', for example: 'http://www.domain.com/dir/subdir/file.htm' will be excluded.
  • 'Advanced options / Filter out documents found deeper in the hierarchy': using this option, you can tell the indexer not to go deeper in following links than N levels, so that only URLs from the first few levels are indexed
  • 'Advanced options / Ignored links': you can filter out URLs by specifying which links should be ignored. You can specify a pattern like: *weather* to ignore links containing the word 'weather'. This will result in links like <a href='/weather/europe/'> or 'http://www.domain.com/dir/weather.htm' being ignored. The difference between this filter and 'excluded URL filters' is that adding an entry to ignored links will completely ignore matching URLs, meaning that the indexer will not even try to find further links on such pages. In case of excluded URLs, the indexer will still enter matching pages in order to find links to other URLs but will also not index the page.
  • 'Advanced options / Excluded title filters': this option will let you exclude documents whose title is set to or contains a specific text. This option is very useful for servers which don't return an error code for invalid URLs but instead display a custom 'not found' page with no error number