Jump to content

Rabin–Karp algorithm

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Watcher (talk | contribs) at 10:58, 28 May 2004 (big article about Rabin-Karp string search algorithm). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)

Rabin-Karp algorithm is a string searching algorithm that seeks a patter, i.e. a substring, within a text by using hashing. It is not widely used for single pattern matching, but is of considerable theoretical importance and is very effective for multiple pattern matching. Historically, it predates the very popular Knuth-Morris-Pratt algorithm. For text of length n and pattern of length m, its average and best case running time is proportional to n, but the worst case scenario is proportional to n*m, which is one of the reasons why it is not widely used.

Shifting substrings search and competing algorithms

The basic problem the algorithm addresses is finding a pattern (fixed substring) of length m within a text of length n, for example "sun" in "Hello sunshine in this vale of tears." By way of comparison of the algorithms, a naive brute force string searching algorithm involves aligning the first character of the pattern with the first character of the text and checking if the pattern matches the initial substring of equivalent length. If it does not match, the pattern is shifted right by 1 and the procedure is repeated, for a worst case running time of m*n. Knuth-Morris-Pratt-algorithm is an elegant elaboration on this naive algorithm that uses precomputed data to skip forward not by 1 character, but by as many as possible for the search to succeed.

Rabin-Karp algorithm takes a very different approach. Rather than pursuing more sophisticated skipping, it seeks to speed up the testing of equality of the pattern to the substrings in the text by using hashing. Hashing a string means computing a numerical value from the value of its characters using some standard hash function (a very primitive one could involve multiplying the ASCII values of all of its characters together). Rabin-Karp exploits the fact that if 2 strings are equal, their hash values are also equal. Note that the converse of this statement is not valid, although good hash functions seek to reduce the number of such hash collisions. Rabin-Karp computes hash value of the pattern, and then goes through the text computing hash values of all of its substrings and checking whether or not the pattern's hash is equal to the substring hash, and advancing by 1 character every time. If the two numbers are indeed discovered to be equal, then the algorithm has to verify that the two string really are equal, rather than this being a fluke of the hashing scheme. It does so using regular string comparison. It is this verification that in the worst case can drag down algorithm performance down to the level of the naive string search. This general approach is sometimes referred to as the use of a virtual hash table.

Hash function used

The key to Rabin-Karp performance is the efficient computation of hash values of the successive substrings of the text. Rabin-Karp achieves this by treating every substring as a number in some base, the base being usually a big prime. For example, if the substring is "hi" and the base is 101, the hash value would be 104 * 101^1 + 105 * 101^0 = 10609 (ASCII of 'h' is 104 and of 'i' is 105). Technically, this algorithm is only similar to the true number in a non-decimal system representation, since for example we could have the "base" less than one of the "digits". See hash function for much more detailed discussion. The essential benefit achieved by such representation is that is is possible to compute the hash value of the next substring from the previous one by doing only a finite number of operations, independent of the substrings' length. For example, if we have text "abracadabra" and we are searching for a pattern of length 3, we can compute the hash of "bra" from the hash for "abr" (the previous substring) by subtracting the number added for the first 'a' of "abr", i.e. 97 * 101^2 (97 is ASCII for 'a' and 101 is the base we are using) and adding that for the last a of "bra", i.e. 97 * 101^0 = 97. If the substrings in question are long, this algorithm achieves great savings compared with many other hashing schemes. Theoretically, there exist other algorithms that could provide convenient recompution, e.g. multiplying together ASCII values of all characters so that shifting substring would only entail dividing by the first character and multiplying by the last. The limitation, however, is the limitations of the size of integer data type and the necessity of using modular arithmetic to scale down the hash results, for which see hash function article. Hence the number in arbitrary base hash function is the preferred one in Rabin-Karp.

Rabin-Karp is inferior to Knuth-Morris-Pratt algorithm, Boyer-Moore algorithm and other faster single pattern string search algorithms because of its slow worst case behavior. However, Rabin-Karp is an algorithm of choice for multiple pattern search. That is, if we want to find any of a large number, say k, fixed length patterns in a text, a variant Rabin-Karp that uses a hash table to check whether the hash of a given string belongs to a set of hash values of patterns we are looking for. Other algorithms can search for a single pattern in time order n, hence they will search for k patterns in time order n*k. The variant Rabin-Karp will still work in time order n in the best and average case because a hash table allows to check whether or not substring hash equals any of the pattern hashes in order of 1 time.