Jump to content

Web query classification

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by EvanXW (talk | contribs) at 13:20, 17 March 2008 (Created page with ''''Web query topic classification/categorization''' is a problem in information science. The task is to assign a Web search query to one or mor...'). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)

Web query topic classification/categorization is a problem in information science. The task is to assign a Web search query to one or more predefined categories, based on its topics. Different from the traditional document classification tasks, queries submitted by Web search users are usually short and ambiguous; also the meanings of the queries are evolving over time. Therefore, some canonical document classification techniques cannot be directly applied to the query topic classification tasks.


Problem

Web query topic classification is to automatically assign a query to some predefined categories. Different from the traditional document classification tasks, there are several major difficulties which hinder the progress of Web query understanding:

Short and Noisy - Many queries are short and query terms are noisy. As an example, in the KDDCUP 2005 dataset[1], queries containing 3 words are most frequent (22%). Furthermore, 79% queries have no more than 4 words. Each query is a combination of words, names of persons or locations, URLs, special acronyms, program segments and malicious codes. Some queries contain the words which are very clean while others may contain typos or meaningless strings which are totally noisy. Ambiguous - A user query often has multiple meanings. For example, "Apple" can mean a kind of fruit or a computer company. "Java" can mean a programming language or an island in Indonesia. In the KDDCUP 2005 dataset, most of the queries contain more than one meaning. Concept Drift - The meanings of queries may also evolve over time. For example, the word "Barcelona" has a new meaning of the new micro-processor of AMD, while it refers to a city or football club before 2007. The distribution of the meanings of this term is therefore a function of time on the Web.