Explicit semantic analysis
In natural language processing and information retrieval, explicit semantic analysis (ESA) is a vectorial representation of text (individual words or entire documents) that uses Wikipedia as a knowledge base. Specifically, in ESA, a word is represented as a column vector in the tf–idf matrix of Wikipedia's article text and a document (string of words) is represented as the centroid of the vectors representing its words.
ESA was designed by Evgeniy Gabrilovich and Shaul Markovitch as a means of improving text categorization[1] and has been used by this pair of researchers to compute what they refer to as "semantic relatedness" by means of cosine similarity between the aforementioned vectors, collectively interpreted as a space of "concepts explicitly defined and described by humans", where Wikipedia articles are equated with concepts; the name "explicit semantic analysis" contrasts with latent semantic analysis (LSA).[2]
ESA, as originally posited by Gabrilovich and Markovitch, operates under the assumption that Wikipedia articles are topically orthogonal. However, it was later shown that ESA also improves the performance of information retrieval systems when it is based not on Wikipedia, but on the Reuters corpus of newswire articles, which does not satisfy the orthogonality property.[3] To explain this observation, links have been shown between ESA and the generalized vector space model.[4]
See also
External links
- Explicit semantic analysis on Evgeniy Gabrilovich's homepage; has links to implementations
References
- ^ Evgeniy Gabrilovich and Shaul Markovitch (2006). Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge. In Proc. 21st National Conference on AI (AAAI), pp. 1301-1306.
- ^ Evgeniy Gabrilovich and Shaul Markovitch (2007). Computing semantic relatedness using Wikipedia-based Explicit Semantic Analysis. Proc. 20th Int'l J. Conf. on AI (IJCAI).
- ^ Maik Anderka and Benno Stein (2009). The ESA retrieval model revisited. Proc. SIGIR.
- ^ Thomas Gottron, Maik Anderka and Benno Stein (2011). Insights into explicit semantic analysis. Proc. CIKM.