Coveo Knowledge Base – Information Article - 060330-3

CES4-060330-3: Understanding Stemming

The information in this article applies to:

Coveo Enterprise Search 4

Summary

This article explains stemming, or similar words matching, and indicates which settings should be chosen according to the index's document properties. The article comprises the following sections:

·          What is stemming?

·          What is multilingual stemming?

·          Benefits of stemming

·          Known limitations

·          Settings

·          Frequently asked questions about stemming

What Is Stemming?

Linguistically, words follow morphological rules that allow a speaker to derive variants of a same idea to evoke an action (verb), an object or concept (noun) or the property of something (adjective). For instance, the following words are derived from the same stem and share an abstract meaning of action and movement:

                        activate            activating         active               activenesses

                        activated          activation         actively            actives

                        activates          activations       activeness        etc.

Stemming does the reverse process: it deduces the stem from a fully suffixed word according to its morphological rules. These rules concern morphological and inflectional suffixes. The former type usually changes the lexical category of words whereas the latter indicates plural and gender (in gender oriented languages such as French, Spanish and German):

            Morphological suffix  :        activate      (verb)   à    activation   (noun)

            Inflextional suffix     :        activation    (noun)   à   activations   (plural noun)

Since words that derived from a single root usually share a common meaning, stemming allows Coveo Enterprise Search to group words that share a same stem into semantically related sets.

What Is Multilingual Stemming?

Multilingual Stemming is similar to regular stemming but uses morphological rules of several languages at the same time instead of rules relative to only one language. Coveo Enterprise Search uses English, French, Spanish and German morphological rules simultaneously.

When rules of more than one language apply to a word, extra procedures are used to determine the right rule. Although these procedures cover the majority of possible conflicts, there are cases that cannot be solved. As a result, multilingual stemming is slightly less precise than monolingual stemming for a given language.

Benefits of Stemming

Takes care of morphological variants. When an index is stemmed, users do not need to include several forms of a same word in a query: "sale OR sales". Stemming handles these variations such that the queries "sale" and "sales" will return the same results.

Reduces index size. Indexed terms are stored in the lexicon. A stemmed index means that linguistic roots are stored in the lexicon rather than whole terms. Each entry in the lexicon contains references to documents in which it appears. When stemming is used, several terms are conflated as well as information relative to document references, hence significantly reducing the size of the index.

Known Limitations

It is important to keep in mind that stemming limitations tend to disappear with queries including more than one term.

Occasional increased recall. Since more documents are likely to be retrieved by a single query, the desired documents may be returned with a large number of other related documents. If a query returns too many documents, narrow it down by adding more words to the query.

Exceptions to stemming rules. Although stemming has been used since the early days of computer science, algorithms that were developed for that purpose still suffer from the same limitation, that is, the conflation of words that are not semantically related, such as "university" and "universal". This means that a query containing the term "universal" may retrieve a document that contains "university". Although stemming rules' exceptions are limited, it would be expensive to handle them in terms of memory and speed, given their relative rareness. It is then required to understand the stemming mechanisms to recognize and bypass these situations when they happen, because Coveo Enterprise Search's "Optimized for English" and "Multilingual Stemming" both suffer from this limitation.

Impact on advanced syntax and exact match. Advanced syntax and exact match queries work the same, but terms in such queries are stemmed. In the case of advanced syntax, it can lead to unexpected results if different terms in the query share the same stem. For instance, "consumers NOT consuming" is stemmed to "consum- NOT consum-" and leads to no result. Exact match queries can also occasionally retrieve non expected results because Coveo Enterprise Search looks for exact matches of stems. For instance, the exact match string "sales reports" matches "sal- report-".

Accented characters are not supported. For performance purposes only, Coveo Enterprise Search removes accents from characters before stemming. Consequently, some accented characters in stems lose their distinction in languages such as French, Spanish and German, although, this very rarely leads to confusion between stems.

Short words are not stemmed. Words of four or fewer letters are not stemmed. Short words tend to be more error prone to stemming; especially in English, where concerns must be addressed to recognize short and long syllables. Since this situation refers to only a small number of terms, short words are not stemmed to favor performance. Therefore, morphological variations of short words will not be implicitly included in a query.


Settings

Coveo Enterprise Search offers three different settings for stemming. Stemming options are presented to the administrator at index creation. To change stemming settings, the index must be completely recreated and information previously stored in the index will be lost (see Administration Tool help article "Modify Core Settings"). When the interface language is English, stemming is set to "Optimized for English" by default, whereas "Multilingual Stemming" is the default setting for a French interface.

When should stemming be considered?

Stemming is intended to increase the number of relevant documents retrieved. Therefore, it is ideal for small to medium size indexes.

When should stemming be avoided?

If precision is a major concern, it is better not to stem. In doing so, retrieval will proceed by exact string matching. Queries will only return documents that contain the exact words of the query. No morphological variations of the query terms will be taken into account. Therefore, it relies on users to formulate queries that include morphological variants (e.g. "sale OR sales").

If the index is so large (in number of documents) that the recall is too large and precision too low, stemming is probably not required. Indeed, when the index is large, too many documents may be retrieved. The more documents in the index, the more morphological variants will be retrieved.

If most of the documents contained in the index are written in languages other than English, French, Spanish and German, it is better to avoid stemming. When stemming is disabled, Coveo Enterprise Search indexes whole terms and is therefore fully functional with unrecognized languages. In addition, even though an excerpt will appear underneath each result, document summaries and concepts will not be available since summarization technologies are intended for the four languages previously mentioned.

"Multilingual Stemming" or "Stemming Optimized for English"?

When most documents are in English, we recommend that you choose "Stemming Optimized for English". In this case, stemming rules only correspond to the morphology of English and they do not have to handle possible conflicts with other languages' morphological rules. "Stemming Optimized for English" is more precise for English than "Multilingual Stemming".

On the other hand, when documents are written in more than one language (English, French, Spanish or German) "Multilingual Stemming" is more appropriate. In that case, stemming rules for the four languages are used. Special heuristics minimize the errors that are due to cross language conflicts between rules. "Multilingual Stemming" is slightly less precise for English than "Stemming Optimized for English. However, it is by far much better for French, Spanish and German.

What about "Light Stemming"?

Light Stemming create  stems by reducing only plural and gender forms of words. Light stemming applies to English or French document: it can have unexpected results if used with other languages.

Frequently Asked Questions About Stemming

What happens with documents written in languages other than English, French, Spanish or German?

In a stemmed index, documents are all stemmed according to the rules of the selected stemming option, namely English rules for the "Optimized for English" option and rules for English, French, Spanish and German for the "Multilingual Stemming" option. Hence, words of documents written in languages other than the four mentioned are still stemmed according to the stemming rules of the selected option. This means that an unapplied morphological rule could prevent certain words from being stemmed, whereas a rule that is inappropriate for the language in which they are written might erroneously stem other words. In other words, if stemming is applied to languages that the rules do not recognize, words might not be semantically grouped in the proper way. Hence, and far from implying that foreign language documents cannot be retrieved, it means that the morphological variations of their words will not be taken into account.

Why are documents retrieved even if they contain words that are totally unrelated words to the query retrieved?

Although stemming is generally precise, sometimes words that do not share the same semantic stem (a basic related sense) are conflated.

The most common errors are caused by proper names. For instance, Coveo Enterprise Search will stem visit, visits and visited into vis-, conflating them into the same semantic family. However, Visio, the name of a Microsoft product, will also be stemmed to vis- because its ending triggers stemming rules. Hence, documents containing the word Visio will be retrieved for a query that contains the word visit. In addition, a query for the word Visio will retrieve the documents that contain one or more of the following words visit, visits and visited.

Proper names are more likely to be erroneously stemmed because they do not follow regular morphological rules, hence the stemming errors.

Some regular words are known to be limitations of stemmers. For instance, although university and universe are not members of the semantic family, they are conflated because of the limitations of stemming procedures. A query for any one of these words will retrieve documents containing either one of the two words. Fortunately, these special cases are limited.

Documents that contain similar words were retrieved. However, none of them contains the exact words of the query. Why?

Sometimes, words of the query are not found in any of the indexed documents, but words that have the same stem are. Example: A query for the word "visit" does not return any documents containing this exact word. However, Coveo Enterprise Search returns documents that contain the words "visits" and "visited" because they share the same stem.

Why don't advanced queries behave normally?

Stemming does not change how advanced queries work. But, it must be taken into account because Coveo Enterprise Search looks for stems rather than terms. Here are two examples:

In the index, some documents contain either "consumers" or "consuming", others contain "consumers" but not "consuming". Yet, when querying for "consumers NOT consuming", there are no results. Why? The reason is quite simple. The query terms are both stemmed to "consum-" so Coveo Enterprise Search interprets the query like this "consum- NOT consum-". Evidently, this leads to no result.

The exact match string "shipped products" can return documents containing "[…] this ship products steam […]" because Coveo Enterprise Search looks for exact matching stems: "ship- product-".

 

 

Last Reviewed

2006/03/30

Keywords

Stemming, Multilingual stemming, English stemming, Similar words matching