Jump to content

User:Grigori sidorov/sandbox

From Wikipedia, the free encyclopedia

Soft Cosine Measure [1] is a measure of “soft” similarity between two vectors, i.e., the measure that considers similarity of pairs of features. The traditional cosine similarity considers the Vector Space Model (VSM) features as independent or completely different, while the soft cosine measure proposes considering the similarity of features in VSM, which allows generalization of the concepts of cosine measure and also the idea of similarity (soft similarity).

For example, in the field of Natural Language Processing (NLP) the similarity between features is quite intuitive. Features such as words, n-grams or syntactic n-grams[2] can be quite similar, though formally they are considered as different features in the VSM. For example, words “play” and “game” are different words and thus are mapped to different dimensions in VSM; yet it is obvious that they are related semantically. In case of n-grams or syntactic n-grams, Levenshtein distance can be applied (in fact, Levenshtein distance can be applied to words as well).

For calculation of the soft cosine measure, the matrix of similarity between features is introduced. It can be calculated using Levenshtein distance or other similarity measures, e.g., various WordNet similarity measures. Then we just multiply by this matrix.

Given two N-dimension vectors a and b, the soft cosine similarity is calculated as follows:


where
If there is no similarity between features (, for ), the given equation is equivalent to the conventional cosine similarity formula.

The complexity of this measure is quadratic, which makes it perfectly applicable to real world tasks. The complexity can be even transformed to linear.

  1. ^ Sidorov, Grigori; Gelbukh, Alexander; Gómez-Adorno, Helena; Pinto, David. "Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model". Computación y Sistemas. 18 (3): 491–504. doi:10.13053/CyS-18-3-2043. Retrieved 7 October 2014.
  2. ^ Sidorov, Grigori; Velasquez, Francisco; Stamatatos, Efstathios; Gelbukh, Alexander; Chanona-Hernández, Liliana. Syntactic Dependency-based N-grams as Classification Features. LNAI 7630. pp. 1–11. ISBN 978-3-642-37798-3. Retrieved 7 October 2014.