Jump to content

String Metrics

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Mehmet Karatay (talk | contribs) at 13:28, 17 May 2007 (Removed two links. These just gave alternative names for articles already mentioned. Please see discussion page.). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

String Metrics (also known as Similarity Metrics) are a class of textual based metrics resulting in a similarity or dissimilarity score between two pairs of string for approximate matching or comparison. For example the strings "Sam" and "Samuel" can be considered although not the same to a degree similar. A String Metric provides a float based numkber indicating a algorithym specific indication of similarity (or dissimilarity in some cases).

The most widely known (although rudimentary) string metric is Levenshtein Distance (also known as Edit Distance) operates between two input strings returning a score equivalent to the number of transpositions substitutions and deletions in order to transform one input string into another. Simplistic Stirng Metrics such as Levenshtein Distance have expanded to include pheonetic, token, grammitical and character based methods of statistical comparisons.

A widespread example of a String Metric is DNA analysis and RNA analysis which are performed by optimised String Metrics to identify matching sequences.

Sting Metrics are used heavily in Information Integration, Data mining, Fraud Detection, Ontology Merging, Database Deduplication as well as many other tasks ranging from Fingerprint Analysis to tracking geneology.

Examples of String Metrics

SimMetrics is an open source extensible library of Similarity or Distance Metrics (also known as String Metrics), e.g. Levenshtein Distance, Block distance or City Block Distance or L2 Distance, Cosine Similarity, Jaccard Similarity, Needleman-Wunch distance or Sellers Algorithm, Smith-Waterman distance, Gotoh Distance or Smith-Waterman-Gotoh distance, Monge Elkan distance, Jaro distance, Jaro Winkler, SoundEx distance, Matching Coefficient, Dice’s Coefficient, Jaccard Similarity or Jaccard Coefficient or Tanimoto coefficient, Overlap Coefficient, Euclidean distance, q-gram distance. SimMetrics provides a library of float based (0-1) similarity measures between pairs of String Data as well as the unnormalised metric output. SimMetrics is used to provide an extensible platform for Infromation Integration.