Talk:Rabin–Karp algorithm
![]() | Computer science Unassessed Mid‑importance | ||||||||||||||||
|
Extending the algorithm
I think it would be nice if the article discussed extending the algorithm for 2 dimensional pattern matching, as well as giving some optimizations in the case of having varying string lengths for multi-pattern matching case.
Concerns
In the pseudocode of the RK string search algorithm, if hs is given as hash(s[1..n]), the algorithm can not detect a target pattern which occurs on the first m bytes of a given text. This is why I reverted it to be hash(s[1...m]).
One arduous but impetuous, so-called, vandalism detector claims that it should be hash(s[1...n]), because, when m is larger than n, the code crashes. That is true, but I reckon such a range check matter is not a major issue in describing the fundamentals of the algorithm.
Considering the following sentence which is excerpted from the article and is supposed to be written by one of the original contributors,
"Lines 2,3, and 6 each require omega(m) time."
if one still thinks hs := hash(s[1...n]) is correct, why doesn't he/she also modify the above sentence too?
footnote) The line 3 of the pseudocode has been in the form of hash(s[1..m]) for the last six months or so since its original edition by the original writer.
--129.254.65.18 08:35, 23 May 2005 (UTC)
- I agree with you. We are assuming m ≤ n, and if [1..n] is used, the algorithm won't work, because it'd be trying to compare hashes of strings of length n to hashes of strings of length m. Deco 21:41, 23 May 2005 (UTC)
Line 8 of the multi-pattern-search pseudo-code seems wrong to me
Quote: if s[i..i+m-1] = a substring with hash hs This is exactly the same condition as in the line above (element of hash). As pointed out corretly in the single patters search - to rule out a hash collision we need to compare with the actual search string(s). (Plural only if by fat chance or denial of service attack, two search strings have the same hash.)
Or am I missing something obvious here?--iSee 10:16, 28 July 2005 (UTC)
- I need to correct myself. The code is ok, although I don't quite get how the rolling hash would compare varying key-lengths. I improved the code indentation so the code is more readable iSee
- Are you sure this is OK? I just came to the talk page with exactly the same concern: line 8 doesn't make sense. We *know* that "s[i..i+m-1] = a substring with hash hs" because we set hs to satisfy this on the last line of the previous iteration. I think what line 8 is *meant* to do is check that s[i..i+m-1] is a member of subs -- i.e. do a string match to ensure this is not a hash collision. *Something* is wrong because when returning at line 9, nothing has checked against hash collisions. (As a side-issue, the nested if statements seem unnecessary and would be cleaner with an "and", and the braces around the function are inconsistent with indentation-nesting and with the rest of the article.)
I think the pseudocode could be changed to the following, which I'm doing now as I'm confident it's wrong currently.
function RabinKarpSet(string s[1..n], set of string subs, m): set hsubs := emptySet for each sub in subs insert hash(sub[1..m]) into hsubs hs := hash(s[1..m]) for i from 1 to n-m+1 if hs ∈ hsubs and s[i..i+m-1] ∈ subs return i hs := hash(s[i+1..i+m]) return not found
Jameshfisher (talk) 09:54, 22 May 2010 (UTC)
No worst case
I removed the following text in the "multiple pattern search" section: "We can also ensure O(m n log k) time in the worst case, where m is the length of the longest of the k strings, by storing the hashes in a self-balancing binary search tree instead of a hash table." Rabin-Karp simply doesn't make sense in the worst case, because the whole idea of hashing requires randomization and can only work in expectation (or with high probability).
- That's true. My mistake. Deco 02:51, 31 July 2005 (UTC)
Suggestions
Firstly, it seems like
1 function RabinKarp(string s[1..n], string sub[1..m]) 2 hsub := hash(sub[1..m]) 3 hs := hash(s[1..m]) 4 for i from 1 to n 5 if hs = hsub 6 if s[i..i+m-1] = sub 7 return i 8 hs := hash(s[i+1..i+m]) 9 return not found
could be shortened to
1 function RabinKarp(string s[1..n], string sub[1..m]) 2 hsub := hash(sub[1..m]) 3 for i from 1 to n 4 hs := hash(s[i..i+m-1]) 5 if hs = hsub 6 if s[i..i+m-1] = sub 7 return i 8 return not found
Aside from being a line shorter, I find this more clearly displays what hs is for any given iteration. The same comment applies to the other example of multi-pattern matching. (Edit: Ignore this suggestion--of course you want to keep the structure of the code the way it is to make it easy to introduce rolling hash functions.)
On the topic of the multi-pattern matching code, I think it is worth noting in both the code and the surrounding discussion the possibility for hash table collisions. As it stands, one might wrongly infer from the code that the only possibility is that a hash table either has no entry or a single entry for any given hash code.
- Thanks for looking over the article. The collisions are internal to the implementation of the hash table. Here we are treating it as an abstract set data structure, and don't even really care whether it's a hash table, binary search tree, or association list. If you think you can make this clearer please feel free. Deco 05:14, 15 November 2005 (UTC)
StTwister 15:39, 25 January 2007 (UTC): In order to be possible to change compute the hash of the next subsequence of S in constant time (to keep a low complexity), the hash must first be calculated outisde the loop entirely (with a 1 to M iteration) and then only updated inside the loop (remove the ith element from the hash and add the (i+m)-th element to the hash). So I think it's clearer if left this way.
May 23 2007: CRCs are another possible rolling hash function. If you google "Rabin Hash", you'll find a SourceForge project implementing a CRC in Java (though they call it a Rabin fingerprint instead of a CRC).
Some pseudocode fixes (both single and multiple pattern search)
Firstly, changed this in single pattern search:
hs := hash(s[1..n])
to
hs := hash(s[1..m])
Secondly, changed the loop from (1 to n) to (1 to n-m+1), because from that points onwards, the hash function would be impossible to calculate (it would get out of sub's range), in both single and multiple pattern search.
Changes by StTwister 15:39, 25 January 2007 (UTC)
Invalid character in last code segmet for searching for server strings
Perhaps it is a browser problem, but on line 7 of the pseudo-code, it appears to be
if hs ∈ hsubs
which looks like "if hs SQUARE-SYMBOL hsubs". If that is the intended symbol, there should be some explanation of its meaning. —Preceding unsigned comment added by 134.129.106.244 (talk) 00:17, 30 March 2008 (UTC)
- That's a browser issue, it's the element-of mathematical symbol. What browser are you on? I generally recommend not using characters that IE6 does not render. Dcoetzee 21:56, 21 April 2008 (UTC)
Reference for multiple pattern search
Probably there is something I have done wrong. But I took a look at the original paper (where the wiki points to) and also the section in the "Introduction to Algorithms" book, but none of them talk about the section in the wiki that explains mulitple pattern search using Bloom filter. Could the reference for that part be posted? is that another paper newer than the original? I haven't been able to find the source.Michael.Do.Wiki (talk) 20:25, 11 March 2009 (UTC)
Correction for multiple pattern search
Referring to
Here we assume all the substrings have a fixed length m, but this assumption can be eliminated. We simply compare the current hash value against the hash values of all the substrings simultaneously using a quick lookup in our set data structure, and then verify any match we find against all substrings with that hash value.
It is not enough to compare the current hash value against the hash values of all the substrings because the current hash value is the hash of the 'm' chars from current location and we would like to check all the lengths from the current location.
For example we have three substrings: "apple" "chair" "cube" and our search text is "the cubes are blue". when we reach current location=4 in the search text ('c') (assuming index starts from 0). So our substring lengths ('m' values) are 5, 5 and 4. So we want to compare the hashes from the search text of:
1. "cubes" (m=5) with hashes of substrings with length of 5 ("apple" & "chair")
and
2. "cube" (m=4) with the hashes of substrings with length of 4 ("cube") —Preceding unsigned comment added by Thedrs (talk • contribs) 15:51, 16 June 2009 (UTC)
- I agree that the paragraph is highly questionable. --Lukax (talk) 22:36, 3 December 2010 (UTC)
Disagreement regarding multiple pattern search
Aho and Corasick's algorithm from 1975 finds k patterns in a text of length n in O(n) time after preprocessing. This fact contradicts the statement currently made in the article, namely that this property is unique to Rabin-Karp. 12:20, 19 August 2009 (UTC) —Preceding unsigned comment added by AmirOnWiki (talk • contribs)
To address above concern I have added a link where string search algorithms using finite set of patterns are listed. --Mahmutuludag (talk) 17:18, 19 November 2009 (UTC)
Restore of simple algorithm
I restored the simple algorithm presentation to this page on the grounds that other parts of that section refer to it, making the overall page hard to understand if it isn't there. I hate calling other WP users incompetent, but this felt a lot like it; the removalist didn't even try to read the page afterwards to see if it made basic sense! (Whether or not the example should be here is not my concern; the page should at least read as a consistent entity...) --Donal Fellows (talk) 22:12, 3 January 2010 (UTC)
Time complexity of multiple pattern search
The time complexity initialization stage of the algorithm (lines 3,4) is O(mk). The reason is that each hash calculation takes O(m) Time and we have k of them. Thus, the overall time complexity should be O(max(n,mk)). Elhanan mishraki (talk) 20:19, 17 May 2011 (UTC)
Suggested move
![]() | It has been proposed in this section that Rabin–Karp algorithm be renamed and moved to Karp–Rabin algorithm. A bot will list this discussion on the requested moves current discussions subpage within an hour of this tag being placed. The discussion may be closed 7 days after being opened, if consensus has been reached (see the closing instructions). Please base arguments on article title policy, and keep discussion succinct and civil. Please use {{subst:requested move}} . Do not use {{requested move/dated}} directly. |
Rabin–Karp algorithm → Karp–Rabin algorithm – There is no obvious reason for Rabin to be listed first. The author order on the original paper was Karp–Rabin (matching the usual alphabetical ordering convention for authors in this subject) and Google scholar has a somewhat larger number of hits for Karp–Rabin than for Rabin–Karp. Relisted and note left at WT:COMPSCI. Favonian (talk) 20:02, 5 May 2014 (UTC) —David Eppstein (talk) 22:31, 27 April 2014 (UTC)
- Support With both forms in use, I'm fine defaulting to alphabetical order. --BDD (talk) 17:50, 29 April 2014 (UTC)
- Neutral. Either is good for me. Cormen uses RK, and RK sounds better KR (which may be why author order was switched -- leading with a double stop is awkward). Glrx (talk) 23:37, 29 April 2014 (UTC)
- Strong, speedy support - dang, somewhere someone messed up big time to botch this order. Follow the order they wrote it in. Red Slash 00:39, 30 April 2014 (UTC)
- Oppose – in both scholar search and book search, Rabin–Karp is somewhat more common than Karp–Rabin (at least, when I look). I don't know why, but I think it's OK to stick with what it's more commonly called. Unless someone can show that these are due to wiki mirroring. Dicklyon (talk) 04:27, 2 May 2014 (UTC)
- FWIW, Google scholar gives me about 820 hits for "Karp-Rabin" (as a quoted string) and about 733 for "Rabin-Karp". I don't put much stock in general web searches for this sort of topic. —David Eppstein (talk) 05:05, 2 May 2014 (UTC)
- I included "algorithm" in my search, and the results were the other way around, unless I messed up. I agree web search would be much less useful for any kind of counting. Dicklyon (talk) 05:30, 2 May 2014 (UTC)
- A lot of the hits are for things like "Karp–Rabin matching" or "Karp–Rabin string matching" rather than "algorithm", though. They do all seem to be valid hits for this topic rather than for something else those two might have done together. —David Eppstein (talk) 05:43, 2 May 2014 (UTC)
- I included "algorithm" in my search, and the results were the other way around, unless I messed up. I agree web search would be much less useful for any kind of counting. Dicklyon (talk) 05:30, 2 May 2014 (UTC)
- FWIW, Google scholar gives me about 820 hits for "Karp-Rabin" (as a quoted string) and about 733 for "Rabin-Karp". I don't put much stock in general web searches for this sort of topic. —David Eppstein (talk) 05:05, 2 May 2014 (UTC)
- Oppose. Historically it was called Karp-Rabin after the original paper, but current and recent usage is Rabin-Karp. We should create a redirect from the other name, of course, agree that somewhere someone messed up big time not to create one. Easily fixed. Andrewa (talk) 00:53, 5 May 2014 (UTC)
- [Citation needed]. CLRS uses Rabin-Karp, but why? And what is the evidence that others follow them? —David Eppstein (talk) 01:07, 5 May 2014 (UTC)
- My observation was based on a quick look at Google books. [1] [2] It's not an overwhelming result, but the trend looked clear to me. Certainly enough to require more evidence before moving. Andrewa (talk) 08:02, 7 May 2014 (UTC)
- [Citation needed]. CLRS uses Rabin-Karp, but why? And what is the evidence that others follow them? —David Eppstein (talk) 01:07, 5 May 2014 (UTC)