String Matching - RYS - Lect - 1 - 2 - 3 - Update
String Matching - RYS - Lect - 1 - 2 - 3 - Update
Table of Contents
• String Matching
– Naïve Method
– Finite Automata Approach
– Rabin Karp
– KMP
Pattern Matching
• Given a text string T[0..n-1] and a pattern
P[0..m-1], find all occurrences of the pattern
within the text.
• Given:
T: a b c a b d a a b c d e
P: a b d
Example ( Step – 1 )
T: a b c a b d a a b c d e
P: a b d
T: a b c a b d a a b c d e
a b d
P:
T: a b c a b d a a b c d e
a b d
P:
T: a b c a b d a a b c d e
a b d
P:
P : a a a f of size 4
Example ( Step – 1 )
T: a a a a . . . . . a a f
P: a a a f
T: a a a a a , , , , a a f
P: a a a f
T: a a a a a . . . . a a a f
a a a f
P:
#a
#a ∑
a
a a f
s1 s2 s3 f
s0
#a
Worst Case Running Time
19
KMP : Knuth Morris Pratt Algorithm
T : …… tj .. …...tj+r-1 ….tj+k-r…...tj+k-2 tj+k-1 …
………………………………
P: p1 …… pr …… ……… pk-1 pk ……
p1 …… pr pk …
If tj+k ≠ pk
Shifting of the pattern is required. But instead of shifting right by 1
character, we look for longest prefix of p1 … pk-1 that matches the
suffix of tj … tj+k-1.
20
KMP Contd..
• Let r be the length of the longest prefix of P that
matches with the matched part of P. Then the
pattern can be shifted by r positions instead of 1 and
tj+k-1 should be compared with pr+1.
• Claim 1: We have not missed any match i.e. the
pattern does not exist at any position from j to j+k-r-
1.
• Proof: Had it been, we would have a longer prefix
matching with its suffix.
Why LONGEST?
T:abcabcabcabcaf
mismatch found
P:abcabcabcaf
22
T:abcabcabcabcaf
mismatch found
P:abcabcabcaf
23
T:abcabcabcabcaf
P: abcabcabcaf
Pattern found.
24
T:abcabcabcabcaf
mismatch
P: abcabcabcaf
26
P : p1 ….………….…………… pk …………
27
T : …… tj .. …...tj+r-1 ….tj+k-r…...tj+k-2 tj+k-1 …
………………………………
P: p1 …… pr …… ……… pk-1 pk ……
p1 …… pr pk …
If tj+k-1 ≠ pk
Let Fail[k] be a pointer which says that if a mismatch
occurs for pk then what is the character in P that
should come in place of pk by shifting P accordingly .
51
Rabin-Karp Example
• Hash value of “AAAAA” is 37
• Hash value of “AAAAH” is 100
52
Rabin-Karp Algorithm
54
Example
• To find the pattern 26535 in the text 3 1 4 1 5
9 2 6 5 3 5 8 9 7 9 3 , we choose a table size Q
(997 in the example), compute the hash value
26535 % 997 = 613, and then look for a match
by computing hash val ues for each five-digit
substring in the text
Example-Method 1
• Furthermore, given x(i) we can compute x(i+1) for the next subsequence t[i+1 .. i+M] in
constant time, as follows:
61
•
Rabin-Karp Mods
If M is large, then the resulting value (~bM) will be enormous. For this
reason, we hash the value by taking it mod a prime number q.
• The mod function (% in Java) is particularly useful in this case due to several
of its inherent properties:
[(x mod q) + (y mod q)] mod q = (x+y) mod q
(x mod q) mod q = x mod q
• For these reasons:
h(i)=((t[i] bM-1 mod q) +(t[i+1] bM-2 mod q) + ...
+(t[i+M-1] mod q))mod q
h(i+1) =( h(i) b mod q
Shift left one digit
-t[i] bM mod q
Subtract leftmost digit
+t[i+M] mod q )
Add new rightmost digit
62
mod q
Rabin-Karp Complexity
• If a sufficiently large prime number is used for the hash function,
the hashed values of two different patterns will usually be distinct.
• If this is the case, searching takes O(N) time, where N is the
number of characters in the larger body of text.
• It is always possible to construct a scenario with a worst case
complexity of O(MN). This, however, is likely to happen only if the
prime number used for hashing is small.
63
Rabin Karp Algorithm
Example: P: 31415, T: 2359023141526739921
Finite Automata
Example
pattern P : ababaca, text T : abababacaba
Shift 9-7=2
Algorithm
Reference
• Algorithms by Kevin and et.al
• Introduction to Algorithms by Cormen and
et.al