Gestalt pattern matching
Gestalt Pattern Matching[1], also Ratcliff/Obershelp Pattern Recognition[2], is a string-matching algorithms for determining the similarity of two strings. It was developed in 1983 by John W. Ratcliff and John A. Obershelp and published in the Dr. Dobb's Journal in July 1988.[2]
Algorithm
The similarity of two string and is determined by the formula, calculating twice the number of matching [[Character (symbol)|characters] divided by the total number of characters of both strings. The matching characters are defined as the longest common substring (LCS) plus recursively the number of matching characters in the non-matching regions on both sides of the LCS:[2]
where the similarity metric can take a value between zero and one:
The value of 1 stands for the complete match of the two strings, whereas the value of 0 means there is no match and not even one common letter.
Sample
S1 | W | I | K | I | M | E | D | I | A |
---|---|---|---|---|---|---|---|---|---|
S2 | W | I | K | I | M | A | N | I | A |
The longest common substring is WIKIM
(dark grey) with 5 characters. There is no further substring on left. The non-matching substrings on right side are EDIA
and ANIA
. They again have a longest common substring IA
(light gray) with length 2.
The similarity metric is determined by:
Properties
Complexity
The execution time of the algorithm is O in worst case and in average case. By changing the computing method the execution time can be improved significantly.[1]
Commutative property
It can be shown, that Gestalt Pattern Matching Algorithm is not commutative: [4]
- Sample
For the two strings
and
the metric result for
- is with the substrings
GESTALT
,T
,E
,R
,I
and for - the metric is with the substrings
GESTALT
,T
,H
,I
.
Applications
The algorithm became a basis of the Python difflib
library, which was introduced in version 2.1.[1] Due to the unfavourable runtime behaviour of this similarity metric, three methods have been implemented. Two of them returns an upper bound in a faster execution time.[1] The fastest variant only compares the length of the two substrings:[5]
- ,
# Drqr Implementation in Python
def real_quick_ratio(s1, s2):
"""Return an upper bound on ratio() very quickly."""
l1, l2 = len(s1), len(s2)
length = l1 + l2
if not length:
return 1.0
return 2.0 * min(l1, l2) / length
The second upper bound calculates twice the sum of all used characters which occur in divided by the lenght od both strings but the sequence is ignored.
- ,
# Dqr Implementation in Python
def quick_ratio(s1, s2):
"""Return an upper bound on ratio() relatively quickly."""
length = len(s1), len(s2)
if not length:
return 1.0
intersect = collections.Counter(s1) & collections.Counter(s2)
matches = sum(intersect.values())
return 2.0 * matches / length
Trivially the following applies:
- .
Complexity
The execution timr of this particular Python implementation was statet as in worst case and in best case.[1]
References
- ^ a b c d e difflib — Helpers for computing deltas inside the Python documentation
- ^ a b c National Institute of Standards and Technology Ratcliff/Obershelp pattern recognition
- ^ Ilya Ilyankou: Comparison of Jaro-Winkler and Ratcliff/Obershelp algorithms in spell check, May 2014 (PDF)
- ^ How does Pythons SequenceMatcher work? at stackoverflow.com
- ^ Borrowed from Python 3.7.0, difflib.py Lines 38–41 and 676–686
Further Reading
- John W. Ratcliff und David Metzener: Pattern Matching: The Gestalt Approach, Dr. Dobb's Journal, Seile 46, Juli 1988