Jump to content

String Metrics

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by SmackBot (talk | contribs) at 21:15, 18 May 2007 (Date/fix the maintenance tags or gen fixes). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

String metrics (also known as similarity metrics) are a class of textual based metrics resulting in a similarity or dissimilarity score between two pairs of string for approximate matching or comparison. For example the strings "Sam" and "Samuel" can be considered although not the same to a degree similar. A String Metric provides a float based number indicating a algorithm-specific indication of similarity (or dissimilarity in some cases).

The most widely known (although rudimentary) string metric is Levenshtein Distance (also known as Edit Distance) operates between two input strings returning a score equivalent to the number of transpositions substitutions and deletions in order to transform one input string into another. Simplistic string metrics such as Levenshtein distance have expanded to include pheonetic, token, grammatical and character based methods of statistical comparisons.

A widespread example of a String Metric is DNA analysis and RNA analysis which are performed by optimised String Metrics to identify matching sequences.

Sting Metrics are used heavily in Information Integration, Data mining, Fraud Detection, Ontology Merging, Database Deduplication as well as many other tasks ranging from Fingerprint Analysis to tracking geneology.

Examples of String Metrics

SimMetrics is an open source extensible library of Similarity or Distance Metrics (also known as String Metrics), e.g. Levenshtein Distance, Block distance or City Block Distance or L2 Distance, Cosine Similarity, Jaccard Similarity, Needleman-Wunch distance or Sellers Algorithm, Smith-Waterman distance, Gotoh Distance or Smith-Waterman-Gotoh distance, Monge Elkan distance, Jaro distance, Jaro Winkler, SoundEx distance, Matching Coefficient, Dice’s Coefficient, Jaccard Similarity or Jaccard Coefficient or Tanimoto coefficient, Overlap Coefficient, Euclidean distance, q-gram distance. SimMetrics provides a library of float based (0-1) similarity measures between pairs of String Data as well as the unnormalised metric output. SimMetrics is used to provide an extensible platform for Infromation Integration.