Query understanding

Query understanding is the process of inferring the intent of a search engine user by extracting semantic meaning from the searcher’s keywords. Query understanding methods generally take place before the search engine retrieves and ranks results. It is related to natural language processing but specifically focused on the understanding of search queries.

Methods

Tokenization

Tokenization is the process of breaking up a text string into words or other meaningful elements called tokens. Typically, tokenization occurs at the word level. However, it is sometimes difficult to define what is meant by a "word". Often a tokenizer relies on simple heuristics, such as splitting the string on punctuation and whitespace characters. Tokenization is more challenging in languages without spaces between words, such as Chinese and Japanese. Tokenizing text in these languages requires the use of word segmentation algorithms.^[1]

Spelling correction

Spelling correction is the process of automatically detecting and correcting spelling errors in search queries. Most spelling correction algorithms are based on a language model, which determines the a priori probability of an intended query, and an error model (typically a noisy channel model), which determines the probability of a particular misspelling, given an intended query.^[2]

Stemming and lemmatization

Many, but not all, language inflect words to reflect their role in the utterance they appear in: a word such as *care* may appear as, besides the base form. as *cares*, *cared*, *caring*, and others. The variation between various forms of a word is likely to be of little import for the relatively coarse-grained model of meaning involved in a retrieval system, and for this reason the task of conflating the various forms of a word is a potentially useful technique to increase recall of a retrieval system. ^[3]

The languages of the world vary in how much morphological variation they exhibit, and for some languages there are simple methods to reduce a word in query to its lemma or root form or its stem. For some other languages, this operation involves non-trivial string processing. A noun in English typically appears in four variants: *cat* *cat's* *cats* *cats'* or *child* *child´s* *children* *children's*. Other languages have more variation. Finnish, e.g., potentially exhibits about 5~000 forms for a noun^[4], and for many languages the inflectional forms are not limited to affixes but change the core of the word itself.

Stemming algorithms, also known as stemmers, typically use a collection of simple rules to remove suffixes intended to model the language’s inflection rules. ^[5]

More advanced methods, lemmatisation methods, group together the inflected forms of a word through more complex rule sets based on a word’s part of speech or its record in a lexical database, transforming an inflected word through lookup or a series of transformations to its lemma. For a long time, it was taken to be proven that morphological normalisation by and large did not help retrieval performance. ^[6] Once the attention of the information retrieval field moved to languages other than English, it was found that for some languages there were obvious gains to be found.Cite error: A <ref> tag is missing the closing </ref> (see the help page). uses query rewrite to transform complex queries to more primitive queries, such as expressions with wildcards (e.g. quer*) into a boolean query of the matching terms from the index (such as query OR queries) ^[7].

References

^ "Tokenization".
^ "How to Write a Spelling Corrector".
^ Lowe, Thomas; Roberts, David; Kurtz, Peterdate=1973. Additional Text Processing for On-Line Retrieval (The RADCOL System). Volume 1. DTIC Document.{{cite book}}: CS1 maint: numeric names: authors list (link) Lennon, Martin; Peirce, David; Tarry, Brian D; Willett, Peter (1981). "An evaluation of some conflation algorithms for information retrieval". Information Scientist. 3 (4). SAGE.
^ Karlsson, Fred (2008). Finnish: an essential grammar. Routledge.
^ Lovins, Julie (1968). Development of a stemming algorithm. MIT Information Processing Group.
^ Harman, Donna. "The effectiveness of stemming for natural-language access to Slovene textual data". {{cite journal}}: Cite journal requires |journal= (help); Unknown parameter | date= ignored (help); Unknown parameter | issue= ignored (help); Unknown parameter | journal= ignored (help); Unknown parameter | volume= ignored (help)CS1 maint: extra punctuation (link)
^ "Query in Lucene 6.4.1 API documentation".

[1] "Tokenization".

[2] "How to Write a Spelling Corrector".

[3] Lowe, Thomas; Roberts, David; Kurtz, Peterdate=1973. Additional Text Processing for On-Line Retrieval (The RADCOL System). Volume 1. DTIC Document.{{cite book}}: CS1 maint: numeric names: authors list (link) Lennon, Martin; Peirce, David; Tarry, Brian D; Willett, Peter (1981). "An evaluation of some conflation algorithms for information retrieval". Information Scientist. 3 (4). SAGE.

[4] Karlsson, Fred (2008). Finnish: an essential grammar. Routledge.

[5] Lovins, Julie (1968). Development of a stemming algorithm. MIT Information Processing Group.

[6] Harman, Donna. "The effectiveness of stemming for natural-language access to Slovene textual data". {{cite journal}}: Cite journal requires |journal= (help); Unknown parameter | date= ignored (help); Unknown parameter | issue= ignored (help); Unknown parameter | journal= ignored (help); Unknown parameter | volume= ignored (help)CS1 maint: extra punctuation (link)

[7] "Query in Lucene 6.4.1 API documentation".

[1]

[2]

[3]

[4]

[5]

[6]

[7]