Explicit semantic analysis
In natural language processing and information retrieval, explicit semantic analysis (ESA) is a vectorial representation of text (individual words or entire documents) that uses Wikipedia as a knowledge base. Specifically, in ESA, a word is represented as a column vector in the tf–idf matrix of Wikipedia's article text and a document (string of words) is represented as the centroid of the vectors representing its words.
ESA was designed by Evgeniy Gabrilovich and Shaul Markovitch as a means of improving text categorization[1] and has been used by this pair of researchers to compute what they refer to as "semantic relatedness" by means of cosine similarity between the aforementioned vectors, collectively interpreted as a space of "concepts explicitly defined and described by humans", where Wikipedia articles are equated with concepts; the name "explicit semantic analysis" contrasts with latent semantic analysis (LSA).[2]
ESA, as originally posited by Gabrilovich and Markovitch, operates under the assumption that Wikipedia articles are topically orthogonal. However, it was later shown that ESA also improves the performance of information retrieval systems when it is based not on Wikipedia, but on the Reuters corpus of newswire articles, which does not satisfy the orthogonality property.[3] To explain this observation, links have been shown between ESA and the generalized vector space model.[4]
The cross-language explicit semantic analysis (CL-ESA) is a multilingual generalization of ESA.[5] CL-ESA exploits a document-aligned multilingual reference collection (e.g., Wikipedia) to represent a document as a language-independent concept vector. The relatedness of two documents in different languages is assessed by the cosine similarity between the corresponding vector representations.
See also
External links
- Explicit semantic analysis on Evgeniy Gabrilovich's homepage; has links to implementations
References
- ^ Evgeniy Gabrilovich and Shaul Markovitch. Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge. Proceedings of the 21st National Conference on Artifical Intelligence (AAAI), pp. 1301-1306, 2006.
- ^ Evgeniy Gabrilovich and Shaul Markovitch. Computing semantic relatedness using Wikipedia-based Explicit Semantic Analysis. Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI), pp. 1606-1611, 2007.
- ^ Maik Anderka and Benno Stein. The ESA retrieval model revisited. Proceedings of the 32nd International ACM Conference on Research and Development in Information Retrieval (SIGIR), pp. 670-671, 2009.
- ^ Thomas Gottron, Maik Anderka and Benno Stein. Insights into explicit semantic analysis. Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM), pp. 1961-1964, 2011.
- ^ Martin Potthast, Benno Stein, and Maik Anderka. A Wikipedia-based multilingual retrieval model. Proceedings of the 30th European Conference on IR Research (ECIR), pp. 522-530, 2008.