String Matching
String matching 2
Pattern Matching
Given a text string T[0..n-1] and a pattern
P[0..m-1], find all occurrences of the pattern
within the text.
Example: T = 000010001010001 and P =
0001, the occurrences are:
first occurrence starts at T[1]
second occurrence starts at T[5]
third occurrence starts at T[11]
String matching 3
Naïve algorithm
Worst-case running time = O(nm).
String matching 4
Rabin-Karp Algorithm
Key idea:
think of the pattern P[0..m-1] as a key, transform
(hash) it into an equivalent integer p
Similarly, we transform substrings in the text string
T[] into integers
For s=0,1,…,n-m, transform T[s..s+m-1] to an equivalent
integer ts
The pattern occurs at position s if and only if p=ts
If we compute p and ts quickly, then the
pattern matching problem is reduced to
comparing p with n-m+1 integers
String matching 5
Rabin-Karp Algorithm …
How to compute p?
p = 2m-1 P[0] + 2m-2 P[1] + … + 2 P[m-2] + P[m-1]
Using horner’s rule
This takes O(m) time, assuming each arithmetic operation
can be done in O(1) time.
String matching 6
Rabin-Karp Algorithm …
Similarly, to compute the (n-m+1) integers ts from the
text string
This takes O((n – m + 1) m) time, assuming that each
arithmetic operation can be done in O(1) time.
This is a bit time-consuming.
String matching 7
Rabin-Karp Algorithm
A better method to compute the integers is:
This takes O(n+m) time, assuming that each arithmetic
operation can be done in O(1) time.
String matching 8
Problem
The problem with the previous strategy is that when m
is large, it is unreasonable to assume that each
arithmetic operation can be done in O(1) time.
In fact, given a very long integer, we may not even be able to
use the default integer type to represent it.
Therefore, we will use modulo arithmetic. Let q be a
prime number so that 2q can be stored in one
computer word.
This makes sure that all computations can be done using
single-precision arithmetic.
String matching 9
String matching 10
Once we use the modulo arithmetic, when p=ts for
some s, we can no longer be sure that P[0 .. M-1] is
equal to T[s .. S+ m -1 ]
Therefore, after the equality test p = ts, we should
compare P[0..m-1] with T[s..s+m-1] character by
character to ensure that we really have a match.
So the worst-case running time becomes O(nm), but it
avoids a lot of unnecessary string matchings in
practice.