Rabin–Karp algorithm

The Rabin–Karp or Karp–Rabin algorithm in computer science is a form of rolling hash used in string searching that interprets the input as a polynomial over $\mathbb {Z} _{n}$ . $n$ is usually taken to be 2, and the resulting hash function is called a Rabin fingerprint. ^[1] For text of length n and pattern of length m, its average and best case running time is O(n+km) for k occurrences of the pattern in space O(p), but its worst-case time is O(nm). It provides a marginal advantage on average over a naive string search algorithm; other algorithms have much better worst-case performance.

String search

A naive string search scans the text string for the first character of a pattern until there's a match, then forward from that point until there's a mismatch against the pattern, and etc. For a random pattern and random text, the running time depends on the length n of the text string, the length m of the pattern, and the number k of characters in the character string. The search may be conceptualized as, at each location in the text, a pattern-matching sequence is done, where the probability of each successive corresponding character matching is 1/k, so the cumulative probability of matching is ${\frac {1}{k}}*{\frac {1}{k}}*...$ over the $m$ characters of the pattern thus:

running time $\tau \propto n\times \sum _{i=1}^{m}{\frac {1}{k^{i}}}$ where n is the length of the text, m is the length of the pattern and k is the cardinality of the character set.

As the pattern gets longer, $\sum _{\lim i\rightarrow \infty }{\frac {1}{k^{i}}}=1+{\frac {1}{k-1}}$ , so the running time is proportional to $n$ plus a small fraction, or nearly linear in $n$ for most character sets

For example, the set of printable ascii characters is 95; Searching for a pattern of random characters in that set of length of 3 in a random text string of length 1,000,000 would be expected to take 1,010,638 comparisons. But frequency distribution of characters in the character set is usually not random, so searching may be a little less linear or a little more linear depending on the distribution of letters in the pattern as well as the text. Searching for the in an English language text string might be as non-linear as string search can get in that context, because the is the most common word in English text.

With random text, a nearly linear result is the best that can be expected. Since even naive implementations achieve that result, effort has been focused on reducing worst case running time (see See also section). Better results depend on having knowledge of the text beforehand, either statistical knowledge, or compiled knowledge from preprocessing the text, like a dictionary.

Rolling hashing

The effort to do a character string comparison at each text position can be reduced by computing a hash code over the pattern string, and searching the text for matching hash codes. Thus string comparison is reduced to a single integer compare. Pattern-length portions of the text must still iterated over to produce the hash codes, but a rolling hash, adding and deleting characters from a window, enables all hash codes to be produced in time proportional to n. This is similar to the way a moving average is maintained.

Formally,

A hash function is a function $h:S^{\times }\to F$ where $S,F$ are finite sets. You may think of them as sets of bits, bytes, alphanumeric characters, etc.

A rolling hash function has an associated function $u:F\times S^{2}\times \mathbb {N} \to F$ satisfying $u(n,a,a',h(...a...))=h(...a'...)$

To form a Rabin-Karp hash function, pick a prime $p$ and do the operation (over $\mathbb {Z} _{n}$ ): $h(\{c_{n}\})=c_{1}p^{k-1}+c_{2}p^{k-2}+...+c_{k}$

The actual hash function must have a special kind of invertability.

Polynomial hashing

Rabin fingerprint

The Rabin fingerprint is a popular and effective rolling hash function.

Comparision with other string search algorithms

The Rabin–Karp algorithm is inferior for single pattern searching to Knuth–Morris–Pratt algorithm, Boyer–Moore string search algorithm and other faster single pattern string searching algorithms because of its slow worst case behavior.

References

^ first1=Richard M.|last1=Karp|author1-link=Richard M. Karp|first2=Michael O.|last2=Rabin|author2-link=Michael O. Rabin|year=1987

Karp, Richard M.; Rabin, Michael O. (March 1987). "Efficient randomized pattern-matching algorithms". IBM Journal of Research and Development. 31 (2): 249–260. CiteSeerX 10.1.1.86.9502. doi:10.1147/rd.312.0249. {{cite journal}}: Invalid |ref=harv (help)
Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001-09-01) [1990]. "The Rabin–Karp algorithm". Introduction to Algorithms (2nd ed.). Cambridge, Massachusetts: MIT Press. pp. 911–916. ISBN 978-0-262-03293-3.
Candan, K. Selçuk; Sapino, Maria Luisa (2010). Data Management for Multimedia Retrieval. Cambridge University Press. pp. 205–206. ISBN 978-0-521-88739-7. (for the Bloom filter extension)

External links

"Rabin–Karp Algorithm/Rolling Hash" (PDF). MIT 6.006: Introduction to Algorithms 2011- Lecture Notes. MIT.

[1] rst1=Richard M.|last1=Karp|author1-link=Richard M. Karp|first2=Michael O.|last2=Rabin|author2-link=Michael O. Rabin|year=1987

[1]

v t e Strings
String metric	Approximate string matching Bitap algorithm Damerau–Levenshtein distance Edit distance Gestalt pattern matching Hamming distance Jaro–Winkler distance Lee distance Levenshtein automaton Levenshtein distance Wagner–Fischer algorithm
String-searching algorithm	Apostolico–Giancarlo algorithm Boyer–Moore string-search algorithm Boyer–Moore–Horspool algorithm Knuth–Morris–Pratt algorithm Rabin–Karp algorithm Raita algorithm Trigram search Two-way string-matching algorithm Zhu–Takaoka string matching algorithm
Multiple string searching	Aho–Corasick Commentz-Walter algorithm
Regular expression	Comparison of regular-expression engines Regular grammar Thompson's construction Nondeterministic finite automaton
Sequence alignment	BLAST Hirschberg's algorithm Needleman–Wunsch algorithm Smith–Waterman algorithm
Data structure	DAFSA Substring index Suffix array Suffix automaton Suffix tree Compressed suffix array LCP array FM-index Generalized suffix tree Rope Ternary search tree Trie
Other	Parsing Pattern matching Compressed pattern matching Longest common subsequence Longest common substring Sequential pattern mining Sorting String rewriting systems String operations