Query understanding

Query understanding is the process of inferring the intent of a search engine user by extracting semantic meaning from the searcher’s keywords.^[1] Query understanding methods generally take place before the search engine retrieves and ranks results. It is related to natural language processing but specifically focused on the understanding of search queries.

Methods

Stemming and lemmatization

Many, but not all, languages inflect words to reflect their role in the utterance they appear in: a word such as *care* may appear as, besides the base form. as *cares*, *cared*, *caring*, and others. The variation between various forms of a word is likely to be of little importance for the relatively coarse-grained model of meaning involved in a retrieval system, and for this reason the task of conflating the various forms of a word is a potentially useful technique to increase recall of a retrieval system.^[2]

The languages of the world vary in how much morphological variation they exhibit, and for some languages there are simple methods to reduce a word in query to its lemma or root form or its stem. For some other languages, this operation involves non-trivial string processing. A noun in English typically appears in four variants: *cat* *cat's* *cats* *cats'* or *child* *child´s* *children* *children's*. Other languages have more variation. Finnish, e.g., potentially exhibits about 5000 forms for a noun,^[3] and for many languages the inflectional forms are not limited to affixes but change the core of the word itself.

Stemming algorithms, also known as stemmers, typically use a collection of simple rules to remove suffixes intended to model the language’s inflection rules.^[4]

More advanced methods, lemmatisation methods, group together the inflected forms of a word through more complex rule sets based on a word’s part of speech or its record in a lexical database, transforming an inflected word through lookup or a series of transformations to its lemma. For a long time, it was taken to be proven that morphological normalisation by and large did not help retrieval performance.^[5]

Once the attention of the information retrieval field moved to languages other than English, it was found that for some languages there were obvious gains to be found.^[6]

Query Segmentation

Query segmentation is a key component of query understanding, aiming to divide a query into meaningful segments. Traditional approaches, such as the bag-of-words model, treat individual words as independent units, which can limit interpretative accuracy. For languages like Chinese, where words are not separated by spaces, segmentation is essential, as individual characters often lack standalone meaning. Even in English, the BOW model may not capture the full meaning, as certain phrases—such as "New York"—carry significance as a whole rather than as isolated terms. By identifying phrases or entities within queries, query segmentation enhances interpretation, enabling search engines to apply proximity and ordering constraints, ultimately improving search accuracy and user satisfaction.^[7]

Entity recognition

Entity recognition is the process of locating and classifying entities within a text string. Named-entity recognition specifically focuses on named entities, such as names of people, places, and organizations. In addition, entity recognition includes identifying concepts in queries that may be represented by multi-word phrases. Entity recognition systems typically use grammar-based linguistic techniques or statistical machine learning models.^[8]

Query rewriting

Query rewriting is the process of automatically reformulating a search query to more accurately capture its intent. Query expansion adds additional query terms, such as synonyms, in order to retrieve more documents and thereby increase recall. Query relaxation removes query terms to reduce the requirements for a document to match the query, thereby also increasing recall. Other forms of query rewriting, such as automatically converting consecutive query terms into phrases and restricting query terms to specific fields, aim to increase precision.

Spelling Correction

Automatic spelling correction is a critical feature of modern search engines, designed to address common spelling errors in user queries. Such errors are especially frequent as users often search for unfamiliar topics. By correcting misspelled queries, search engines enhance their understanding of user intent, thereby improving the relevance and quality of search results and overall user experience.^[9]

External links

References

^ "Association for Computing Machinery (ACM) Special Interest Group on Information Retrieval (SIGIR) 2010 Workshop on Query Representation and Understanding" (PDF).
^ Lowe, Thomas; Roberts, David; Kurtz, Peterdate=1973. Additional Text Processing for On-Line Retrieval (The RADCOL System). Volume 1. DTIC Document.{{cite book}}: CS1 maint: numeric names: authors list (link) Lennon, Martin; Peirce, David; Tarry, Brian D; Willett, Peter (1981). "An evaluation of some conflation algorithms for information retrieval". Information Scientist. 3 (4). SAGE.
^ Karlsson, Fred (2008). Finnish: an essential grammar. Routledge.
^ Lovins, Julie (1968). Development of a stemming algorithm. MIT Information Processing Group.
^ Harman, Donna (1991). "How Effective is Suffixing?". Journal of the American Society for Information Science. 42 (1): 7–15. doi:10.1002/(SICI)1097-4571(199101)42:1<7::AID-ASI2>3.0.CO;2-P.
^ Popovic, Mirkoc; Willett, Peter (1981). "The effectiveness of stemming for natural-language access to Slovene textual data". Information Scientist. 3 (4). SAGE.
^ Li, Hang; Xu, Jun; Zhang, Min (2021). Query Understanding for Search Engines. Springer.
^ "A Survey of Named Entity Recognition and Classification" (PDF).
^ Li, Hang; Xu, Jun; Zhang, Min (2021). Query Understanding for Search Engines. Springer.

[1] "Association for Computing Machinery (ACM) Special Interest Group on Information Retrieval (SIGIR) 2010 Workshop on Query Representation and Understanding" (PDF).

[2] Lowe, Thomas; Roberts, David; Kurtz, Peterdate=1973. Additional Text Processing for On-Line Retrieval (The RADCOL System). Volume 1. DTIC Document.{{cite book}}: CS1 maint: numeric names: authors list (link) Lennon, Martin; Peirce, David; Tarry, Brian D; Willett, Peter (1981). "An evaluation of some conflation algorithms for information retrieval". Information Scientist. 3 (4). SAGE.

[3] Karlsson, Fred (2008). Finnish: an essential grammar. Routledge.

[4] Lovins, Julie (1968). Development of a stemming algorithm. MIT Information Processing Group.

[5] Harman, Donna (1991). "How Effective is Suffixing?". Journal of the American Society for Information Science. 42 (1): 7–15. doi:10.1002/(SICI)1097-4571(199101)42:1<7::AID-ASI2>3.0.CO;2-P.

[6] Popovic, Mirkoc; Willett, Peter (1981). "The effectiveness of stemming for natural-language access to Slovene textual data". Information Scientist. 3 (4). SAGE.

[7] Li, Hang; Xu, Jun; Zhang, Min (2021). Query Understanding for Search Engines. Springer.

[8] "A Survey of Named Entity Recognition and Classification" (PDF).

[9] Li, Hang; Xu, Jun; Zhang, Min (2021). Query Understanding for Search Engines. Springer.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]