Talk:Rabin–Karp algorithm

Computer science C‑class Mid‑importance

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science

C

This article has been rated as C-class on Wikipedia's content assessment scale.

Mid

This article has been rated as Mid-importance on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

Extending the algorithm

I think it would be nice if the article discussed extending the algorithm for 2 dimensional pattern matching, as well as giving some optimizations in the case of having varying string lengths for multi-pattern matching case.

Concerns

In the pseudocode of the RK string search algorithm, if hs is given as hash(s[1..n]), the algorithm can not detect a target pattern which occurs on the first m bytes of a given text. This is why I reverted it to be hash(s[1...m]).

One arduous but impetuous, so-called, vandalism detector claims that it should be hash(s[1...n]), because, when m is larger than n, the code crashes. That is true, but I reckon such a range check matter is not a major issue in describing the fundamentals of the algorithm.

Considering the following sentence which is excerpted from the article and is supposed to be written by one of the original contributors,

"Lines 2,3, and 6 each require omega(m) time."

if one still thinks hs := hash(s[1...n]) is correct, why doesn't he/she also modify the above sentence too?

footnote) The line 3 of the pseudocode has been in the form of hash(s[1..m]) for the last six months or so since its original edition by the original writer.

--129.254.65.18 08:35, 23 May 2005 (UTC)[reply]

I agree with you. We are assuming m ≤ n, and if [1..n] is used, the algorithm won't work, because it'd be trying to compare hashes of strings of length n to hashes of strings of length m. Deco 21:41, 23 May 2005 (UTC)[reply]

Line 8 of the multi-pattern-search pseudo-code seems wrong to me

Quote: if s[i..i+m-1] = a substring with hash hs This is exactly the same condition as in the line above (element of hash). As pointed out corretly in the single patters search - to rule out a hash collision we need to compare with the actual search string(s). (Plural only if by fat chance or denial of service attack, two search strings have the same hash.)

Or am I missing something obvious here?--iSee 10:16, 28 July 2005 (UTC)[reply]

I need to correct myself. The code is ok, although I don't quite get how the rolling hash would compare varying key-lengths. I improved the code indentation so the code is more readable iSee

Are you sure this is OK? I just came to the talk page with exactly the same concern: line 8 doesn't make sense. We *know* that "s[i..i+m-1] = a substring with hash hs" because we set hs to satisfy this on the last line of the previous iteration. I think what line 8 is *meant* to do is check that s[i..i+m-1] is a member of subs -- i.e. do a string match to ensure this is not a hash collision. *Something* is wrong because when returning at line 9, nothing has checked against hash collisions. (As a side-issue, the nested if statements seem unnecessary and would be cleaner with an "and", and the braces around the function are inconsistent with indentation-nesting and with the rest of the article.)

I think the pseudocode could be changed to the following, which I'm doing now as I'm confident it's wrong currently.

 function RabinKarpSet(string s[1..n], set of string subs, m):
     set hsubs := emptySet
     for each sub in subs
         insert hash(sub[1..m]) into hsubs
     hs := hash(s[1..m])
     for i from 1 to n-m+1
         if hs ∈ hsubs and s[i..i+m-1] ∈ subs
             return i
         hs := hash(s[i+1..i+m])
     return not found

Jameshfisher (talk) 09:54, 22 May 2010 (UTC)[reply]

No worst case

I removed the following text in the "multiple pattern search" section: "We can also ensure O(m n log k) time in the worst case, where m is the length of the longest of the k strings, by storing the hashes in a self-balancing binary search tree instead of a hash table." Rabin-Karp simply doesn't make sense in the worst case, because the whole idea of hashing requires randomization and can only work in expectation (or with high probability).

That's true. My mistake. Deco 02:51, 31 July 2005 (UTC)[reply]

Suggestions

Firstly, it seems like

 1 function RabinKarp(string s[1..n], string sub[1..m])
 2     hsub := hash(sub[1..m])
 3     hs := hash(s[1..m])
 4     for i from 1 to n
 5         if hs = hsub
 6             if s[i..i+m-1] = sub
 7                 return i
 8         hs := hash(s[i+1..i+m])
 9     return not found

could be shortened to

 1 function RabinKarp(string s[1..n], string sub[1..m])
 2     hsub := hash(sub[1..m])
 3     for i from 1 to n
 4         hs := hash(s[i..i+m-1])
 5         if hs = hsub
 6             if s[i..i+m-1] = sub
 7                 return i
 8     return not found

Aside from being a line shorter, I find this more clearly displays what hs is for any given iteration. The same comment applies to the other example of multi-pattern matching. (Edit: Ignore this suggestion--of course you want to keep the structure of the code the way it is to make it easy to introduce rolling hash functions.)

On the topic of the multi-pattern matching code, I think it is worth noting in both the code and the surrounding discussion the possibility for hash table collisions. As it stands, one might wrongly infer from the code that the only possibility is that a hash table either has no entry or a single entry for any given hash code.

Thanks for looking over the article. The collisions are internal to the implementation of the hash table. Here we are treating it as an abstract set data structure, and don't even really care whether it's a hash table, binary search tree, or association list. If you think you can make this clearer please feel free. Deco 05:14, 15 November 2005 (UTC)[reply]

StTwister 15:39, 25 January 2007 (UTC): In order to be possible to change compute the hash of the next subsequence of S in constant time (to keep a low complexity), the hash must first be calculated outisde the loop entirely (with a 1 to M iteration) and then only updated inside the loop (remove the ith element from the hash and add the (i+m)-th element to the hash). So I think it's clearer if left this way.[reply]

May 23 2007: CRCs are another possible rolling hash function. If you google "Rabin Hash", you'll find a SourceForge project implementing a CRC in Java (though they call it a Rabin fingerprint instead of a CRC).

Are CRCs another possible rolling hash function? If so, please mention CRCs in the "rolling hash" article. --DavidCary (talk) 12:25, 8 June 2015 (UTC)[reply]

Some pseudocode fixes (both single and multiple pattern search)

Firstly, changed this in single pattern search:

 hs := hash(s[1..n])

to

 hs := hash(s[1..m])

Secondly, changed the loop from (1 to n) to (1 to n-m+1), because from that points onwards, the hash function would be impossible to calculate (it would get out of sub's range), in both single and multiple pattern search.

Changes by StTwister 15:39, 25 January 2007 (UTC)[reply]

Invalid character in last code segmet for searching for server strings

Perhaps it is a browser problem, but on line 7 of the pseudo-code, it appears to be

        if hs ∈ hsubs

which looks like "if hs SQUARE-SYMBOL hsubs". If that is the intended symbol, there should be some explanation of its meaning. —Preceding unsigned comment added by 134.129.106.244 (talk) 00:17, 30 March 2008 (UTC)[reply]

That's a browser issue, it's the element-of mathematical symbol. What browser are you on? I generally recommend not using characters that IE6 does not render. Dcoetzee 21:56, 21 April 2008 (UTC)[reply]

Reference for multiple pattern search

Probably there is something I have done wrong. But I took a look at the original paper (where the wiki points to) and also the section in the "Introduction to Algorithms" book, but none of them talk about the section in the wiki that explains mulitple pattern search using Bloom filter. Could the reference for that part be posted? is that another paper newer than the original? I haven't been able to find the source.Michael.Do.Wiki (talk) 20:25, 11 March 2009 (UTC)[reply]

Correction for multiple pattern search

Referring to

Here we assume all the substrings have a fixed length m, but this assumption can be eliminated. We simply compare the current hash value against the hash values of all the substrings simultaneously using a quick lookup in our set data structure, and then verify any match we find against all substrings with that hash value.

It is not enough to compare the current hash value against the hash values of all the substrings because the current hash value is the hash of the 'm' chars from current location and we would like to check all the lengths from the current location.

For example we have three substrings: "apple" "chair" "cube" and our search text is "the cubes are blue". when we reach current location=4 in the search text ('c') (assuming index starts from 0). So our substring lengths ('m' values) are 5, 5 and 4. So we want to compare the hashes from the search text of:

1. "cubes" (m=5) with hashes of substrings with length of 5 ("apple" & "chair")

and

2. "cube" (m=4) with the hashes of substrings with length of 4 ("cube") —Preceding unsigned comment added by Thedrs (talk • contribs) 15:51, 16 June 2009 (UTC)[reply]

I agree that the paragraph is highly questionable. --Lukax (talk) 22:36, 3 December 2010 (UTC)[reply]

Disagreement regarding multiple pattern search

Aho and Corasick's algorithm from 1975 finds k patterns in a text of length n in O(n) time after preprocessing. This fact contradicts the statement currently made in the article, namely that this property is unique to Rabin-Karp. 12:20, 19 August 2009 (UTC) —Preceding unsigned comment added by AmirOnWiki (talk • contribs)

To address above concern I have added a link where string search algorithms using finite set of patterns are listed. --Mahmutuludag (talk) 17:18, 19 November 2009 (UTC)[reply]

Restore of simple algorithm

I restored the simple algorithm presentation to this page on the grounds that other parts of that section refer to it, making the overall page hard to understand if it isn't there. I hate calling other WP users incompetent, but this felt a lot like it; the removalist didn't even try to read the page afterwards to see if it made basic sense! (Whether or not the example should be here is not my concern; the page should at least read as a consistent entity...) --Donal Fellows (talk) 22:12, 3 January 2010 (UTC)[reply]

Time complexity of multiple pattern search

The time complexity initialization stage of the algorithm (lines 3,4) is O(mk). The reason is that each hash calculation takes O(m) Time and we have k of them. Thus, the overall time complexity should be O(max(n,mk)). Elhanan mishraki (talk) 20:19, 17 May 2011 (UTC)[reply]

Suggested move

The following discussion is an archived discussion of a requested move. Please do not modify it. Subsequent comments should be made in a new section on the talk page. Editors desiring to contest the closing decision should consider a move review. No further edits should be made to this section.

The result of the move request was: Not moved. EdJohnston (talk) 04:23, 14 May 2014 (UTC)[reply]

Rabin–Karp algorithm → Karp–Rabin algorithmKarp–Rabin algorithm – There is no obvious reason for Rabin to be listed first. The author order on the original paper was Karp–Rabin (matching the usual alphabetical ordering convention for authors in this subject) and Google scholar has a somewhat larger number of hits for Karp–Rabin than for Rabin–Karp. Relisted and note left at WT:COMPSCI. Favonian (talk) 20:02, 5 May 2014 (UTC) —David Eppstein (talk) 22:31, 27 April 2014 (UTC)[reply]

Support With both forms in use, I'm fine defaulting to alphabetical order. --BDD (talk) 17:50, 29 April 2014 (UTC)[reply]
Neutral. Either is good for me. Cormen uses RK, and RK sounds better KR (which may be why author order was switched -- leading with a double stop is awkward). Glrx (talk) 23:37, 29 April 2014 (UTC)[reply]
Strong, speedy support - dang, somewhere someone messed up big time to botch this order. Follow the order they wrote it in. Red Slash 00:39, 30 April 2014 (UTC)[reply]
Oppose – in both scholar search and book search, Rabin–Karp is somewhat more common than Karp–Rabin (at least, when I look). I don't know why, but I think it's OK to stick with what it's more commonly called. Unless someone can show that these are due to wiki mirroring. Dicklyon (talk) 04:27, 2 May 2014 (UTC)[reply]
- FWIW, Google scholar gives me about 820 hits for "Karp-Rabin" (as a quoted string) and about 733 for "Rabin-Karp". I don't put much stock in general web searches for this sort of topic. —David Eppstein (talk) 05:05, 2 May 2014 (UTC)[reply]
  - I included "algorithm" in my search, and the results were the other way around, unless I messed up. I agree web search would be much less useful for any kind of counting. Dicklyon (talk) 05:30, 2 May 2014 (UTC)[reply]
    - A lot of the hits are for things like "Karp–Rabin matching" or "Karp–Rabin string matching" rather than "algorithm", though. They do all seem to be valid hits for this topic rather than for something else those two might have done together. —David Eppstein (talk) 05:43, 2 May 2014 (UTC)[reply]
Oppose. Historically it was called Karp-Rabin after the original paper, but current and recent usage is Rabin-Karp. We should create a redirect from the other name, of course, agree that somewhere someone messed up big time not to create one. Easily fixed. Andrewa (talk) 00:53, 5 May 2014 (UTC)[reply]
- [Citation needed]. CLRS uses Rabin-Karp, but why? And what is the evidence that others follow them? —David Eppstein (talk) 01:07, 5 May 2014 (UTC)[reply]
  - My observation was based on a quick look at Google books. [1] [2] It's not an overwhelming result, but the trend looked clear to me. Certainly enough to require more evidence before moving. Andrewa (talk) 08:02, 7 May 2014 (UTC)[reply]

The above discussion is preserved as an archive of a requested move. Please do not modify it. Subsequent comments should be made in a new section on this talk page or in a move review. No further edits should be made to this section.

confusing example data in section 'Hash function used'

The example text, "abracadabra" with a pattern length of 3 results in the same character, "a", being used to remove from the old hash and added to the next hash. While, perfectly valid, but it distracts from following the flow of the description of the algorithm. The description tries to clarify this by referring to "old 'a'" and "new 'a'", but that only adds more noise. It's not that this algorithm is all that hard to understand. It's just that it's irritating and unnecessarily distracting. This could have been avoided by using a different example text.

Also, adding a comma as the thousands separator to decimal values seems strange in the context of a computer science example. My first thought when seeing these numbers was, "Where did these tuple values come from? Did I miss something?" You don't expect this kind of formatting when reading about algorithms... Not to mention that it's not localized. What if my locale uses '.' for the thousands divider? Why make it messy and confusing? It's not as if keeping track of the thousands position is important to understanding the algorithm. This isn't a financial spreadsheet. --50.0.241.18 (talk) 19:34, 22 February 2016 (UTC)[reply]

The hash function used is wrongly referred to as a Rabin Fingerprint

Although the hash function described is indeed a rolling hash, this is NOT the original Rabin fingerprint. A Rabin fingerprint is the result of a modulo operation (rest of a division) of the hashed string, taken as a sequence of bits (not bytes), and interpreted as a polynomial over GF(2), by another polynomial irreducible over GF(2). This is NOT equivalent to a modulo operation with an actual prime number. Multiplications and Additions of polynomials over GF(2) do NOT equate to multiplications and additions over n-bits integers, mainly because there is no carry. See https://en.wikipedia.org/wiki/Rabin_fingerprint for a more detailed explanation. It contains a link to the original paper. The coefficients of the polynomial are the bits, not the bytes (this is not a typo).

Note that the pdf linked as a reference does NOT actually refer to the fingerprint function being used as a Rabin fingerprint, so it is correct. The Rabin Karp algorithm does not require using the Rabin Fingerprint, but any rolling hash will work. — Preceding unsigned comment added by Chmduquesne (talk • contribs) 11:32, 30 November 2017 (UTC)[reply]