Rabin–Karp algorithm

The Rabin–Karp or Karp–Rabin algorithm in computer science is a form of rolling hash used in string searching that interprets the input as a polynomial over $\mathbb {Z} _{n}$ . $n$ is usually taken to be 2, and the resulting hash function is called a Rabin fingerprint. ^[1] For text of length n pattern of length m, its average and best case running time is O(n+km) for k occurrences of the pattern in space O(p), but its worst-case time is O(nm). It provides a marginal advantage on average over a naive string search algorithm; other algorithms have much better worst-case performance.

A practical application of the algorithm is detecting plagiarism. Given source material, the algorithm can rapidly search through a paper for instances of sentences from the source material, ignoring details such as case and punctuation. Because of the abundance of the sought strings, single-string searching algorithms are impractical.

String search

A naive string search scans the text string for the first character of a pattern until there's a match, then forward from that point until there's a mismatch against the pattern, and etc. For a random pattern and random text, the running time depends on the length n of the text string, the length m of the pattern, and the number k of characters in the character string. The search may be conceptualized as, at each location in the text, a pattern-matching sequence is done, where the probability of each successive corresponding character matching is 1/k, so the cumulative probability of matching is ${\frac {1}{k}}*{\frac {1}{k}}*...$ over the $m$ characters of the pattern thus:

running time $\tau \propto n\times \sum _{i=1}^{m}{\frac {1}{k^{i}}}$ where n is the length of the text, m is the length of the pattern and k is the cardinality of the character set.

As the pattern gets longer, $\sum _{\lim i\rightarrow \infty }{\frac {1}{k^{i}}}=1+{\frac {1}{k-1}}$ , so the running time is proportional to $n$ plus a small fraction, or nearly linear in $n$ for most character sets

For example, the set of printable ascii characters is 95; Searching for a pattern of random characters in that set of length of 3 in a random text string of length 1,000,000 would be expected to take 1,010,638 comparisons. But frequency distribution of characters in the character set is usually not random, so searching may be a little less linear or a little more linear depending on the distribution of letters in the pattern as well as the text. Searching for the in an English language text string might be as non-linear as string search can get in that context, because the is the most common word in English text.

With random text, a nearly linear result is the best that can be expected. Since even naive implementations achieve that result, effort has been focused on reducing worst case running time (see See also section). Better results depend on having knowledge of the text beforehand, either statistical knowledge, or compiled knowledge from preprocessing the text, like a dictionary.

Rolling hashing

The effort to do a character string comparison at each text position can be reduced by computing a hash code over the pattern string, and searching the text for matching hash codes. Thus string comparison is reduced to a single integer compare. Pattern-length portions of the text must still iterated over to produce the hash codes, but a rolling hash, adding and deleting characters from a window, enables all hash codes to be produced in time proportional to n. This is similar to the way a moving average is maintained.

Formally,

A hash function is a function $h:S^{\times }\to F$ where $S,F$ are finite sets. You may think of them as sets of bits, bytes, alphanumeric characters, etc.

A rolling hash function has an associated function $u:F\times S^{2}\times \mathbb {N} \to F$ satisfying $u(n,a,a',h(...a...))=h(...a'...)$

To form a Rabin-Karp hash function, pick a prime $p$ and do the operation (over $\mathbb {Z} _{n}$ ): $h(\{c_{n}\})=c_{1}p^{k-1}+c_{2}p^{k-2}+...+c_{k}$

The actual hash function must have a special kind of invertability.

Polynomial hashing

The key to the Rabin–Karp algorithm's performance is the efficient computation of hash values of the successive substrings of the text. The hash function described here is not a Rabin fingerprint, but it works equally well. It treats every substring as a number in some base, the base being usually the size of the character set.

For example, if the substring is "hi", the base is 256, and prime modulus is 101, then the hash value would be

 [(104 × 256 ) % 101  + 105] % 101  =  65
 (ASCII of 'h' is 104 and of 'i' is 105)

_{'%' is 'mod' or modulo, or remainder after integer division, operator}

Technically, this algorithm is only similar to the true number in a non-decimal system representation, since for example we could have the "base" less than one of the "digits". See hash function for a much more detailed discussion. The essential benefit achieved by using a rolling hash such as the Rabin fingerprint is that it is possible to compute the hash value of the next substring from the previous one by doing only a constant number of operations, independent of the substrings' lengths.

For example, if we have text "abracadabra" and we are searching for a pattern of length 3, the hash of the first substring, "abr", using 256 as the base, and 101 as the prime modulus is:

// ASCII a = 97, b = 98, r = 114. 
hash("abr") =  [ ( [ ( [  (97 × 256) % 101 + 98 ] % 101 ) × 256 ] %  101 ) + 114 ]   % 101   =  4

We can then compute the hash of the next substring, "bra", from the hash of "abr" by subtracting the number added for the first 'a' of "abr", i.e. 97 × 256², multiplying by the base and adding for the last a of "bra", i.e. 97 × 256⁰. Like so:

//           old hash   -ve avoider   old 'a'   left base offset      base shift    new 'a'    prime modulus
hash("bra") =     [ ( 4   + 101         -  97 * [(256%101)*256] % 101 ) * 256         +    97 ] % 101              =  30

_{the last '* 256' is the shift of the subtracted hash to the left}

_{although ((256%101)*256)%101 is the same as 256² % 101, to avoid overflowing integer maximums when the pattern string is longer (e.g. 'Rabin-Karp' is 10 characters, 256⁹ is the offset without modulation ), the pattern length base offset is pre-calculated in a loop, modulating the result each iteration}

If we are matching the search string "bra", using similar calculation of hash("abr"),

hash'("bra") =  [ ( [ ( [ ( 98 × 256) %101  + 114] % 101 ) × 256 ] % 101) + 97 ] % 101 = 30

If the substrings in question are long, this algorithm achieves great savings compared with many other hashing schemes.

Theoretically, there exist other algorithms that could provide convenient recomputation, e.g. multiplying together ASCII values of all characters so that shifting substring would only entail dividing the previous hash by the first character value, then multiplying by the new last character's value. The limitation, however, is the limited size of the integer data type and the necessity of using modular arithmetic to scale down the hash results, (see hash function article). Meanwhile, naive hash functions do not produce large numbers quickly, but, just like adding ASCII values, are likely to cause many hash collisions and hence slow down the algorithm. Hence the described hash function is typically the preferred one in the Rabin–Karp algorithm.

Rabin fingerprint

The Rabin fingerprint is a popular and effective rolling hash function.

Comparision with other string search algorithms

The Rabin–Karp algorithm is inferior for single pattern searching to Knuth–Morris–Pratt algorithm, Boyer–Moore string search algorithm and other faster single pattern string searching algorithms because of its slow worst case behavior.

References

^ first1=Richard M.|last1=Karp|author1-link=Richard M. Karp|first2=Michael O.|last2=Rabin|author2-link=Michael O. Rabin|year=1987

Karp, Richard M.; Rabin, Michael O. (March 1987). "Efficient randomized pattern-matching algorithms". IBM Journal of Research and Development. 31 (2): 249–260. CiteSeerX 10.1.1.86.9502. doi:10.1147/rd.312.0249. {{cite journal}}: Invalid |ref=harv (help)
Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001-09-01) [1990]. "The Rabin–Karp algorithm". Introduction to Algorithms (2nd ed.). Cambridge, Massachusetts: MIT Press. pp. 911–916. ISBN 978-0-262-03293-3.
Candan, K. Selçuk; Sapino, Maria Luisa (2010). Data Management for Multimedia Retrieval. Cambridge University Press. pp. 205–206. ISBN 978-0-521-88739-7. (for the Bloom filter extension)

External links

"Rabin–Karp Algorithm/Rolling Hash" (PDF). MIT 6.006: Introduction to Algorithms 2011- Lecture Notes. MIT.

[1] rst1=Richard M.|last1=Karp|author1-link=Richard M. Karp|first2=Michael O.|last2=Rabin|author2-link=Michael O. Rabin|year=1987

[1]

v t e Strings
String metric	Approximate string matching Bitap algorithm Damerau–Levenshtein distance Edit distance Gestalt pattern matching Hamming distance Jaro–Winkler distance Lee distance Levenshtein automaton Levenshtein distance Wagner–Fischer algorithm
String-searching algorithm	Apostolico–Giancarlo algorithm Boyer–Moore string-search algorithm Boyer–Moore–Horspool algorithm Knuth–Morris–Pratt algorithm Rabin–Karp algorithm Raita algorithm Trigram search Two-way string-matching algorithm Zhu–Takaoka string matching algorithm
Multiple string searching	Aho–Corasick Commentz-Walter algorithm
Regular expression	Comparison of regular-expression engines Regular grammar Thompson's construction Nondeterministic finite automaton
Sequence alignment	BLAST Hirschberg's algorithm Needleman–Wunsch algorithm Smith–Waterman algorithm
Data structure	DAFSA Substring index Suffix array Suffix automaton Suffix tree Compressed suffix array LCP array FM-index Generalized suffix tree Rope Ternary search tree Trie
Other	Parsing Pattern matching Compressed pattern matching Longest common subsequence Longest common substring Sequential pattern mining Sorting String rewriting systems String operations