Coveo Knowledge Base – Information Article – 040407-1
CES040407-1: Including and Excluding Content
The information in this technical note applies to:
Coveo Enterprise Search 3.5+
Coveo Enterprise Search 4
This article describes the mechanisms used by Coveo Enterprise Search to include and exclude documents from the index.
Filters are the first mechanism used to verify if a document should be retrieved for indexing. There are two kinds of filters: inclusion and exclusion filters. A document is indexed if its address is validated against the filters and the following rules are respected:
1. The document address must not match any of the excluding filters.
2. The document address must match at least one including filter.
Coveo Enterprise Search supports two types of filters:
· Wildcards: Wildcards use asterisks (*) and question marks (?) as matching characters.
* (asterisk) equals zero or more alphanumeric characters. For example inter* finds all words that begin with inter, like: internet, interstate.
? (question mark) equals one and only one alphanumeric character. For example: subject? finds all words that contain subject and one character after it, such as: subjects or subject1.
· Regular expressions: A regular expression is a powerful but complex string matching mechanism. For a complete description of the syntax, go to http://www.boost.org/libs/regex/doc/syntax_perl.html
Web sources are crawled by: 1) extracting the links contained in each document, 2) validating these links and 3) feeding them back to the crawler. The extracted links may point to a different domain or site, in other words, an external site. For example, a Web page may contain links that are accessible through a corporate Web site:
· http://www.coveo.com/
· http://download.coveo.com/
· http://support.coveo.com/
Changing the including filter to http://*.coveo.com/* allows Coveo Enterprise Search to index the documents contained on the three Web sites listed above.
External documents are documents that are not supported by the source filters. The previous example shows that all documents outside http://*.coveo.com/* are external documents. When external crawling is enabled, it allows the Web crawler to retrieve such documents for indexing (but only one level of redirection).
Specifically, adding an exclusion filter affects the way that external documents are handled. Referring back to the previous example, it shows that documents contained in http://www.coveo.com/fr/* should not be indexed. In such a case, an external document matching this filter is not retrieved.
For example, the external crawling option for the http://www.coveo.com/index.html document is enabled and has the following links:
· http://www.coveo.com/fr/index.html
· http://www.download.com/
Although the document http://www.download.com/ is not supported by the current filters, it will be indexed as it is not specifically excluded and external crawling is enabled.
It is different for the document http://www.coveo.com/fr/index.html. This document is not supported and it will not be indexed because it is specifically excluded by the filter http://www.coveo.com/fr/*.
Many Web sites contain addresses with query parts (dynamic attributes, for example an id). These attributes can be problematic since they lead to addresses that point to the same document. This amounts to an index containing duplicate documents. To resolve this problem, define an exclusion filter, such as http://*id=*, to eliminate all documents containing an id part. For example, strikethrough addresses are excluded:
· http://www.coveo.com/product/id/index.htm
·
http://www.coveo.com/forum.php?id=56
·
http://www.coveo.com/forum.php?id=67&page=3
·
http://www.coveo.com/help.php?productid=557
· http://www.coveo.com/help.php?productname=CES
The robots META tag is used only by the Web Crawler to determine which action will be taken for a given document and its links. The robots META tag contains directives that indicate whether a document is indexed and its links followed. Remember that filters are the first parameters that are validated in order to determine which action will be taken on a document. Consequently, if a filter prevents the crawling of a given page, the Robots META tag will not be used as the page will not be retrieved.
The directives contained in the Robots META tag are separated by commas. Currently, two types of directives are supported:
· INDEX/NOINDEX;
· FOLLOW/NOFOLLOW.
The INDEX directive specifies whether the page will be indexed and the FOLLOW directive specifies whether the page links will be followed. The default directives are INDEX and FOLLOW. The values ALL and NONE enable or disable all directives (for instance ALL=INDEX,FOLLOW and NONE=NOINDEX,NOFOLLOW).
Here are a few examples:
Index the page and follow the links:
<meta name="robots" content="all">
<meta name="robots" content="index,follow">
Only follow the page links:
<meta name="robots" content="noindex,follow">
Index the page but do not follow the links:
<meta name="robots" content="index,nofollow">
Do nothing (do not index the page and do not follow the links):
<meta name="robots" content="none">
<meta name="robots" content="noindex,nofollow">
Note: Both the "robots" name tag and the content are case insensitive.
You must not specify conflicting or repeating directives such as:
<meta name="robots" content="INDEX,NOINDEX,NOFOLLOW,FOLLOW,FOLLOW">
A formal syntax for the Robots META tag content is:
content = all | none | directives
all = "ALL"
none = "none"
directives = index | follow
index = "INDEX" | "NOINDEX"
follow = "FOLLOW" | "NOFOLLOW"
Finally, document types can determine if a document will be indexed or not. You can simply delete a specific document type and its associated document will not be indexed at all. The document types also introduce the concept of file information. The content of a document with option Index file information only set is not indexed, but its address, and associated caption in the case of a Web source, is indexed. This enables, for instance, unsupported file type to be indexed.
There are three default document types that handle special cases, and they are always present. They cannot be turned off. They are Exchange Items, No Extension and Other document types.
Exchange Items
The Exchange Items document type is used when a document from Microsoft Exchange, such as an e-mail message, is encountered. Message attachments do not fall into this category; they are indexed as normal files. The Exchange Items document type applies to Exchange source only. Consequently, it is not possible to skip Exchange documents as skipping them would lead to an empty Exchange Server source.
No Extension
The No Extension document type is used when a document doesn’t have an extension. For example, http://www.coveo.com/ has no extension.
The Other document type is used when a document of an unknown type is encountered. This means a document that doesn't match with any of the existing document type. The default action is to skip indexing for unknown types of documents.
For File and Exchange sources, the content manager may index subfolders. This option tells the crawler to index the content of the subfolders of the specified Folder Path. For example, the "file" source will index documents contained in file:///c:/ and all its subfolders, including the subfolders of its subfolders, and so on (see image above). To index the documents found directly under file:///c:/ only, clear the Index subfolders option.
The Web source option Restrict crawling to xxx levels enables the content manager to limit the number of documents that will be indexed for a given source. The depth of a document is determined by calculating the number of links that must be followed to go from the starting address to the document. The following schema presents an example of crawling depth for a site:

To index a site, the Web Crawler parses each downloaded HTML document and extracts the links that it contains. If the site navigation is made through the use of java scripts, the crawler might not be able to extract the links included in this site's documents, making the indexing of such a site incomplete. It is possible to bypass this limitation by adding other starting addresses that contain non-java script links. Best practices include adding the site map address, if such a map exists, since it generally contains links to all parts of the site.
The File Crawler is responsible for retrieving the documents from local and network file sources and adding them into the index. Hidden files are, by nature, hidden from the user. As a result the Crawler does not retrieve them, even when the file type is supported.
However, you can enable the indexing of hidden documents for a specific source through an advanced option, which is not available in the Administration Tool. To enable indexing of hidden files:
1. Stop the Coveo Enterprise Search service.
2. From your Index location, open the config.txt file.
3.
Add the following XML tag to the source section for which the hidden
files must be indexed:
<IndexHiddenFiles
value="true"/>
4. Refresh the source to ensure proper synchronization with the index.
|
Last Reviewed |
2006/03/30 |
|
Keywords |
Source filters, Robots, Crawling, Limitations |