Exact String Matching Algorithms: Presented by Dr. Shazzad Hosain Assoc. Prof. EECS, NSU
Exact String Matching Algorithms: Presented by Dr. Shazzad Hosain Assoc. Prof. EECS, NSU
Presented By
Dr. Shazzad Hosain
Assoc. Prof. EECS, NSU
Classical Comparison Based Methods
• Boyer-Moore Algorithm
• Knuth-Morris-Pratt Algorithm (KMP Algorithm)
Boyer-Moore Algorithm
• Basic ideas:
– Previously discussed ideas for naïve matching
1. successively align P and T to check for a match.
2. Shift P to the right on match failure.
– new concepts wrt the naïve algorithm
1. Scan from right-to-left, i.e.,
2. Special Bad character rule
3. Suffix shift rule
Concept: Right-to-left Scan
Here the bad character is c. Perhaps we should shift to align this character with
its rightmost occurrence in P?
Concept: Bad Character Rule
Here the bad character is x. The minimum that we should shift should align
this character with its occurrence in P.
11
Concept: Bad Character Rule
• We will define a bad character rule that uses the concept of
the rightmost occurrence of each letter.
• Let R(x) be the rightmost position of the letter x in P for
each letter x in our alphabet.
• If x doesn’t occur in P, define R(x) to be 0.
1234567
a b c d z
P = adacara
R 7 0 4 2 * * 0
Concept: Bad Character Rule
12345678901234567
T: spbctbsabpqsctbpq
P: tpabsab
P: tpabxab
R(t)=1, R(s)=5.
i: the position of mismatch in P. i=3
k: the counterpart in T. k=5. T[k]=t
• The bad character rule says P should be shifted right by max{1, i-R(T[k])}. i.e.,
if the right-most occurrence of character T[k] in P is in position j (j<i), then P[j]
should be below T[k] after the shifting.
• Otherwise, we will shift P one position, i.e., when R(T[k]) >= i, 1 >= i - R(T[k])
• Obviously this rule is not very useful when R(T[k]) >= i, which is usually the
case for DNA sequences
13
Concept: Extended Bad Character Rule
19
Concept: Suffix Shift Rule
P .....................................adbadbaddog
Concept: Suffix Shift Rule
P .....................................adbadbaddog
Concept: Suffix Shift Rule
T ............................................axbadbaddog.....
’
P .........gbadbaddoghorseadbadbaddog
Concept: Suffix Shift Rule
T .......................................xbadbaddog.....
’
P .........gbadbaddogcatdbadbaddog
’
P after shifting .........gbadbaddogcatdbadbaddog
Concept: Suffix Shift Rule
T .......................................xbadbaddog.....
P dogcatratdbadbaddog
T .......................................xbadbaddog.....
P batcatratdbadbaddog
• Let L(i) denote the largest position less than n s.t. P[i..n]
matches a suffix of P[1..L(i)].
• If there is no such position, then L(i) = 0
• Example 1: If i = 17 then L(i) = 9
P batcatdogdbadbaddog
L(17) 17
P batcatdogdbadbaddog
16
Concept: Suffix Shift Rule
• Let L´(i) denote the largest position less than n s.t. P[i..n]
matches a suffix of P[1..L´(i)] and s.t. the character
preceding the suffix is not equal to P(i-1).
• If there is no such position, then L´(i) = 0
• Example 1: If i = 20 then L(i) = 12 and L´(i) = 6
P slydogsaddogdbadbaddog
L’(20) L(20) 20
Concept: Suffix Shift Rule
P slydogsaddogdbadbaddog
L(19) 19
Concept: Suffix Shift Rule
• Notice that L(i) indicates the right-most copy of P[i..n] that is not a
suffix of P.
• In contrast, L´(i) indicates the right-most copy of P[i..n] that is not a
suffix of P and whose preceding character doesn’t match P(i-1).
• The relation between L´(i) and L(i) is analogous to the relation
between a´ and a.
P slydogsaddogdbadbaddog
L’(20) L(20) 20
Concept: Suffix Shift Rule
T .......................................xbadbaddog.....
’
P .........gbadbaddogcatdbadbaddog
’
P after shifting .........gbadbaddogcatdbadbaddog
Concept: Suffix Shift Rule
’
P slybaddogbadbaddogcatdbadbaddog
L’(i) L(i)
’
P after shifting slybaddogbadbaddogcatdbadbaddog
Concept: Suffix Shift Rule
P hogslydogsaddogdbadbaddog
3 9 15 19
Concept: Suffix Shift Rule
N is the reverse of Z!
P: the pattern
Pr the string obtained by reversing P
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0
P: q c a b d a b d a b Pr: b a d b a d b a c q
Nj: 0 0 0 2 0 0 5 0 0 0 Zi 0 0 0 5 0 0 2 0 0 0
t y t’ x
i
y t’ x t
j
36
Concept: Suffix Shift Rule
For pattern P,
Nj (for j=1,…,n) can be calculated in O(n) using the Z algorithm.
P z t’ y t
L’(i) i n
z t’ y t
37
Concept: Suffix Shift Rule
• Example: P = asdbasasas, n = 10
• Values of Ni(P): 0, 2, 0, 0, 0, 2, 0, 4, 0
• Computed values i: 11, 9, 11, 11, 11, 9, 11, 7, 11
• Values of L´: 0, 0, 0, 0, 0, 0, 8, 0, 6
Concept: Suffix Shift Rule
l´(i) = t t’ t
i
Concept: Suffix Shift Rule
l´(i) = t t’ t
i
y t’ x t
j
Boyer-Moore Algorithm
Preprocessing:
Compute L´(i) and l´(i) for each position i in P,
Compute R(x), the right-most occurrence of x in P, for each character x in S.
Search:
k = n;
While k <= m {
i = n; h = k;
While i > 0 and P(i) = T(j) {
i = i – 1; h = h – 1;}
if i = 0 {
report occurrence of P in T at position k.
k = k + n - l´(2);}
else Shift P (increase k) by the max amount indicated by the
extended bad character rule and the good suffix rule.
}
Boyer-Moore Algorithm
Example: P = golgol
Preprocessing:
Compute L´(i) and l´(i) for each position i in P
Notice that first we need Nj(P) values in order to compute L´(i) and l´(i) for each
position i in P.
Boyer-Moore Algorithm
Example: P = golgol
Recall that Nj(P) is the length of the longest suffix of P[1..j]
that is also a suffix of P.
Search:
k = n;
While k <= m {
i = n; h = k;
While i > 0 and P(i) = T(j) {
i = i – 1; h = h – 1;}
if i = 0 {
report occurrence of P in T at position k.
k = k + n - l´(2);}
else Shift P (increase k) by the max amount indicated by the
extended bad character rule and the good suffix rule.
}
Search
k = 6;
While k <= 9 { lolgolgol
i = 6; h = k; golgol
While i > 0 and P(i) = T(j) { ^ ^ ^^ ^ ^
i = i – 1; h = h – 1;} i =i =1,
i =i2,h=i3,h=4,
ih=5,
1,h=26,
hP(1) 6 T(1)
=3=h4 5=!=
if i = 0 { But i = 1!
report occurrence of P in T at position k.
k = k + 6 - l´(2);}
else Shift P (increase k) by the max amount indicated by the
extended bad character rule and the good suffix rule.
}
α α
i
α α
i+1
2. Match case:
• If no mismatch is found, an occurrence of P has been found.
• Shift P by n – sp´n spaces to continue searching for other
occurrences.
α
α α
n+1
KMP Algorithm
• Observations:
– The prefix P[1..sp´i] of the shifted P is shifted to match
the corresponding substring in T.
– Subsequent character matching proceeds from
position sp´i + 1
– Unlike Boyer-Moore, the matched substring is not
compared again.
– The shift rule based on sp´i guarantees that the exact
same mismatch won’t occur at sp´i + 1 but doesn’t
guarantee that P(sp´i+1) = T(k)
KMP Algorithm
• Example: P = abcxabcde
– If a mismatch occurs at position 8, P will be shifted 4
positions to the right.
– Q: Where did the 4 position shift come from?
– A: The number of position is given by i - sp´i , in this
example i = 7, sp´7 = 3, 7 – 3 = 4
– Notice that we know the amount of shift without
knowing anything about T other than there was a
mismatch at position 8..
KMP Algorithm
xyabcxabcxadcdqfeg
abcxabcde
abcxabcde
^^^ ^ ^^^ ^^
123 4 567 8 1d!=x,
startshift
again4 from position 4
places
Preprocessing for KMP
Approach: show how to derive sp´ values from Z values.
Definition: Position j > 1 maps to i if i = j + Zj(P) – 1
– Recall that Zj(P) denotes the length of the Z-box starting at position j.
α α
j
α α
i
spn(P) = sp´n(P);
For i = n - 1 downto 2 {
spi (P) = max[spi+1 (P) - 1, sp´i(P)];}
α α
j
α x α y
i
Preprocessing for KMP
Defn. Failure function F´(i) = sp´i-1 + 1 , 1 i n + 1, sp´0 = 0
(similarly F(i) = spi-1 + 1 , 1 i n + 1, sp0 = 0)
cc c c
|| | |
xyabcxabcxadcdqfeg
xyabcxabcxadcdqfeg
abcxabcde
abcxabcde abcxabcde
^^^ ^ ^^^ ^ ^^ ^^ ^
123 4 567 8 d!=x, shift 4 places ii ii i
^
1 a!=x
Full KMP Algorithm
c = 1; p = 1;
While c + (n – p) m {
While P(p) = T( c )and p n {
p = p + 1;
c = c + 1;}
If (p = n + 1) then p != n+1
report an occurrence of P at position c – n of T.
if (p = 1) then c = c + 1; p = 1! c = 3
p = F´(p) ; p = F’(1) = 1
}
xyabcxabcxabcdefeg
abcxabcde
abcxabcde
^
1 a!=y
Full KMP Algorithm
c = 1; p = 1;
While c + (n – p) m {
While P(p) = T( c )and p n {
p = p + 1;
c = c + 1;}
If (p = n + 1) then p != n+1
report an occurrence of P at position c – n of T.
if (p = 1) then c = c + 1; p = 8! don’t change c
p = F´(p) ; p = F´(8) = 4
}
xyabcxabcxabcdefeg
abcxabcde
abcxabcde
^^^ ^^^^ ^
123 4567 8 d!=x
Full KMP Algorithm
c = 1; p = 1;
While c + (n – p) m { p = 4, c = 10
While P(p) = T( c )and p n {
p = p + 1;
c = c + 1;}
If (p = n + 1) then p = n+1 !
report an occurrence of P at position c – n of T.
if (p = 1) then c = c + 1;
p = F´(p) ;
}
xyabcxabcxabcdefeg
abcxabcde
abcxabcde
abcxabcde
^^^^ ^ ^
4567 8 9
Real-Time KMP