User:Grigori sidorov/sandbox

This is the user sandbox of Grigori sidorov. A user sandbox is a subpage of the user's user page. It serves as a testing spot and page development space for the user and is not an encyclopedia article. Create or edit your own sandbox here.

Other sandboxes: Main sandbox | Template sandbox

Finished writing a draft article? Are you ready to request review of it by an experienced editor for possible inclusion in Wikipedia? Submit your draft for review!

Soft Cosine Measure ^[1] is a measure of “soft” similarity between two vectors, i.e., the measure that considers similarity of pairs of features. The traditional cosine similarity considers the Vector Space Model (VSM) features as independent or completely different, while the soft cosine measure proposes considering the similarity of features in VSM, which allows generalization of the concepts of cosine measure and also the idea of similarity (soft similarity).

For example, in the field of Natural Language Processing (NLP) the similarity between features is quite intuitive. Features such as words, n-grams or syntactic n-grams^[2] can be quite similar, though formally they are considered as different features in the VSM. For example, words “play” and “game” are different words and thus are mapped to different dimensions in VSM; yet it is obvious that they are related semantically. In case of n-grams or syntactic n-grams, Levenshtein distance can be applied (in fact, Levenshtein distance can be applied to words as well).

For calculation of the soft cosine measure, the matrix $s$ of similarity between features is introduced. It can be calculated using Levenshtein distance or other similarity measures, e.g., various WordNet similarity measures. Then we just multiply by this matrix.

Given two N-dimension vectors a and b, the soft cosine similarity is calculated as follows:

${\begin{aligned}soft\_cosine_{1}(a,b)={\frac {\sum \sum \nolimits _{i,j}^{N}s_{ij}a_{i}b_{j}}{{\sqrt {\sum \sum \nolimits _{i,j}^{N}s_{ij}a_{i}a_{j}}}{\sqrt {\sum \sum \nolimits _{i,j}^{N}s_{ij}b_{i}b_{j}}}}},\end{aligned}}$

where $s_{ij}=similarity(feature_{i},feature_{j}).$
If there is no similarity between features ( $s_{ii}=1$ , $s_{ij}=0$ for $i\neq j$ ), the given equation is equivalent to the conventional cosine similarity formula.

The complexity of this measure is quadratic, which makes it perfectly applicable to real world tasks. The complexity can be even transformed to linear.

^ Sidorov, Grigori; Gelbukh, Alexander; Gómez-Adorno, Helena; Pinto, David. "Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model". Computación y Sistemas. 18 (3): 491–504. doi:10.13053/CyS-18-3-2043. Retrieved 7 October 2014.
^ Sidorov, Grigori; Velasquez, Francisco; Stamatatos, Efstathios; Gelbukh, Alexander; Chanona-Hernández, Liliana. Syntactic Dependency-based N-grams as Classification Features. LNAI 7630. pp. 1–11. ISBN 978-3-642-37798-3. Retrieved 7 October 2014.

[1] Sidorov, Grigori; Gelbukh, Alexander; Gómez-Adorno, Helena; Pinto, David. "Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model". Computación y Sistemas. 18 (3): 491–504. doi:10.13053/CyS-18-3-2043. Retrieved 7 October 2014.

[2] Sidorov, Grigori; Velasquez, Francisco; Stamatatos, Efstathios; Gelbukh, Alexander; Chanona-Hernández, Liliana. Syntactic Dependency-based N-grams as Classification Features. LNAI 7630. pp. 1–11. ISBN 978-3-642-37798-3. Retrieved 7 October 2014.

[1]

[2]