Jump to content

Rabin–Karp algorithm

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Sbalfour (talk | contribs) at 14:56, 28 September 2019 (missing word). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

The Rabin–Karp or Karp–Rabin algorithm in computer science is a form of rolling hash used in string searching that interprets the input as a polynomial over . is usually taken to be 2, and the resulting hash function is called a Rabin fingerprint. [1] For text of length n and pattern of length m, its average and best case running time is O(n+km) for k occurrences of the pattern in space O(p), but its worst-case time is O(nm). It provides a marginal advantage on average over a naive string search algorithm; other algorithms have much better worst-case performance.

A naive string search scans the text string for the first character of a pattern until there's a match, then forward from that point until there's a mismatch against the pattern, and etc. For a random pattern and random text, the running time depends on the length n of the text string, the length m of the pattern, and the number k of characters in the character string. The search may be conceptualized as, at each location in the text, a pattern-matching sequence is done, where the probability of each successive corresponding character matching is 1/k, so the cumulative probability of matching is over the characters of the pattern thus:

running time where n is the length of the text, m is the length of the pattern and k is the cardinality of the character set.

As the pattern gets longer, , so the running time is proportional to plus a small fraction, or nearly linear in for most character sets

For example, the set of printable ascii characters is 95; Searching for a pattern of random characters in that set of length of 3 in a random text string of length 1,000,000 would be expected to take 1,010,638 comparisons. But frequency distribution of characters in the character set is usually not random, so searching may be a little less linear or a little more linear depending on the distribution of letters in the pattern as well as the text. Searching for the in an English language text string might be as non-linear as string search can get in that context, because the is the most common word in English text.

With random text, a nearly linear result is the best that can be expected. Since even naive implementations achieve that result, effort has been focused on reducing worst case running time (see See also section). Better results depend on having knowledge of the text beforehand, either statistical knowledge, or compiled knowledge from preprocessing the text, like a dictionary.

Rolling hashing

The effort to do a character string comparison at each text position can be reduced by computing a hash code over the pattern string, and searching the text for matching hash codes. Thus string comparison is reduced to a single integer compare. Pattern-length portions of the text must still iterated over to produce the hash codes, but a rolling hash, adding and deleting characters from a window, enables all hash codes to be produced in time proportional to n. This is similar to the way a moving average is maintained.

Formally,

A hash function is a function where are finite sets. You may think of them as sets of bits, bytes, alphanumeric characters, etc.

A rolling hash function has an associated function satisfying

To form a Rabin-Karp hash function, pick a prime and do the operation (over ):

The actual hash function must have a special kind of invertability.

Polynomial hashing

Rabin fingerprint

The Rabin fingerprint is a popular and effective rolling hash function.

Comparision with other string search algorithms

The Rabin–Karp algorithm is inferior for single pattern searching to Knuth–Morris–Pratt algorithm, Boyer–Moore string search algorithm and other faster single pattern string searching algorithms because of its slow worst case behavior.

See also

References

  1. ^ first1=Richard M.|last1=Karp|author1-link=Richard M. Karp|first2=Michael O.|last2=Rabin|author2-link=Michael O. Rabin|year=1987
  • Karp, Richard M.; Rabin, Michael O. (March 1987). "Efficient randomized pattern-matching algorithms". IBM Journal of Research and Development. 31 (2): 249–260. CiteSeerX 10.1.1.86.9502. doi:10.1147/rd.312.0249. {{cite journal}}: Invalid |ref=harv (help)
  • Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001-09-01) [1990]. "The Rabin–Karp algorithm". Introduction to Algorithms (2nd ed.). Cambridge, Massachusetts: MIT Press. pp. 911–916. ISBN 978-0-262-03293-3.
  • Candan, K. Selçuk; Sapino, Maria Luisa (2010). Data Management for Multimedia Retrieval. Cambridge University Press. pp. 205–206. ISBN 978-0-521-88739-7. (for the Bloom filter extension)