0% found this document useful (0 votes)

43 views80 pages

Exact String Matching Algorithms: Presented by Dr. Shazzad Hosain Assoc. Prof. EECS, NSU

Here are the key steps of the suffix shift rule: 1. Find the rightmost occurrence a' of the matching suffix t in P such that a' is not a suffix of P and the character to the left of a' is not the same as the character to the left of a. 2. If a' exists, shift P to align a' with the matching substring in T. 3. If a' does not exist, shift P to align a prefix of P with the suffix t in T. 4. If steps 2 and 3 are not possible, shift P left by n positions. The suffix shift rule aims to make larger shifts by exploiting suffix matches within the pattern P when a

Uploaded by

Alimushwan Adnan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views80 pages

Exact String Matching Algorithms: Presented by Dr. Shazzad Hosain Assoc. Prof. EECS, NSU

Uploaded by

Alimushwan Adnan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 80

Exact String Matching Algorithms

Presented By
Dr. Shazzad Hosain
Assoc. Prof. EECS, NSU
Classical Comparison Based Methods

• Boyer-Moore Algorithm
• Knuth-Morris-Pratt Algorithm (KMP Algorithm)
Boyer-Moore Algorithm
• Basic ideas:
– Previously discussed ideas for naïve matching
1. successively align P and T to check for a match.
2. Shift P to the right on match failure.
– new concepts wrt the naïve algorithm
1. Scan from right-to-left, i.e., 
2. Special Bad character rule
3. Suffix shift rule
Concept: Right-to-left Scan

• How can we check for a match of pattern P at

location i in target T?
• Naïve algorithm scanned left-to-right, i.e.,
T[i+k]&P[1+k], k = 0 to length(P)-1
Example: P = adab, T = abaracadabara
abaracadabara
adab
^^
1 2 ad==
!= ab
Concept: Right-to-left Scan

• Alternative, scan right-to-left, i.e.,

T[i+k]&P[1+k], k = length(P)-1 down-to 0
Example: P = adab, T = abaracadabara
abaracadabara
adab
^
1 b != r
Concept: Right-to-left Scan
• Why is scanning right-to-left a good idea?
• Answer: by itself, it isn’t any better than left-
to-right.
– A naïve approach with right-to-left scanning is
also Q(nm).
– Larger shifts, supported by a clever bad
character rule and a suffix shift rule make it
better.
Concept: Bad Character Rule

• Idea: the mismatched character indicates a safe

minimum shift.
Example: P = adacara, T=abaracadabara
abaracadabara
adacara
^^
21 ra !=
==ca

Here the bad character is c. Perhaps we should shift to align this character with
its rightmost occurrence in P?
Concept: Bad Character Rule

Shift two positions to align the rightmost

occurrence of the mismatched character c in P.
aba racadabara
a d a dc aa cr aa r a

Now, start matching again from right to left.

Concept: Bad Character Rule

Second Example: P = adacara, T=abaxaradabara

abaxaradabara
adacara
^ ^ ^^
4 3 2ac1!=
== xra ==
== ra

Here the bad character is x. The minimum that we should shift should align
this character with its occurrence in P.

But x doesn’t occur in P!!!!

Concept: Bad Character Rule

Since x doesn’t occur in P, we can shift past it.

Second Example: P = adacara, T=abaxaradabara

abaxaradabara
a d a c aa dr a c a r a

Now, start matching again from right to left.

Concept: Bad Character Rule

• The idea of bad character rule is to shift P by more than one

characters when possible.
• But if rightmost position is greater than the mismatched
position.
• Unfortunately, it is often the case
12345678901234567
T: spbctbsatpqsctbpq
P: tpabsat
P: tpabsat

11
Concept: Bad Character Rule
• We will define a bad character rule that uses the concept of
the rightmost occurrence of each letter.
• Let R(x) be the rightmost position of the letter x in P for
each letter x in our alphabet.
• If x doesn’t occur in P, define R(x) to be 0.

1234567
a b c d z
P = adacara
R 7 0 4 2 * * 0
Concept: Bad Character Rule
12345678901234567
T: spbctbsabpqsctbpq
P: tpabsab
P: tpabxab

R(t)=1, R(s)=5.
i: the position of mismatch in P. i=3
k: the counterpart in T. k=5. T[k]=t
• The bad character rule says P should be shifted right by max{1, i-R(T[k])}. i.e.,
if the right-most occurrence of character T[k] in P is in position j (j<i), then P[j]
should be below T[k] after the shifting.
• Otherwise, we will shift P one position, i.e., when R(T[k]) >= i, 1 >= i - R(T[k])
• Obviously this rule is not very useful when R(T[k]) >= i, which is usually the
case for DNA sequences

13
Concept: Extended Bad Character Rule

Extended Bad Character Rule: If P[i]

mismatches T[k], shift P along T so that the
closest occurrence of the letter T[k] in P to the
left of i in P is aligned with T[k].
Example: P = aracara, T=abararadabara
abararadabara
aracara
^ ^^ ^ ^
This2acis1!=
4 3 rightmost
This is the == raa==
the ==roccurrence
a
rightmost occurrence of rleft
of r to the in P.
of i in P.
Notice
Notice that 4 –that i - i.e.,
2 > 0, R(T(k))
this<gives
0 , i.e.,
us 4a –positive
6 < 0 shift.
Concept: Extended Bad Character Rule

The amount of shift is i – j, where:

– i is the index of the mismatch in P.
– j is the rightmost occurrence of T[k] to the left of i in P.
Example: P = aracara, T=abataradabara
abataradabara
aracara
^^^ ^
4 3 2ac 1!=
== traa==
==r a
There is no occurrence of t in P, thus j = 0. Notice that i – j = 4,
i.e., this gives us a positive shift past the point of mismatch.
Concept: Extended Bad Character Rule
• How do we implement this rule?
• We preprocess P (from right to left), recording the
position of each occurrence of the letters.
• For each character x in S, the alphabet, create a list
of its occurrences in P. If x doesn’t occur in P, then
it has an empty list.
Concept: Extended Bad Character Rule
Example: S = {a, b, c, d, r, t}, P = abataradabara
• a_list = <13, 11,9,7,5,3,1> since ‘a’ occurs at these positions
in P, i.e., abataradabara
• b_list = <10,2> (abataradabara)
• c_list = Ø
• d_list = <8> (abataradabara)
• r_list = <12,6> (abataradabara)
• t_list = <4>(abataradabara)
Concept: Suffix Shift Rule
• Recall that we investigated finding prefixes before.
• Since we are matching P to T from right-to-left, we will
instead need to use suffixes.
Suffix Shift Rule

t is a suffix of P that match with a substring t of T

x≠y
t’ is the right-most copy of t in P such that t’ is not a suffix of P and z≠y

19
Concept: Suffix Shift Rule

• Consider the partial right-to-left matching of P

to T below.
• This partial match involves a, a suffix of P.

T ............................................axbadbaddog.....


P .....................................adbadbaddog
Concept: Suffix Shift Rule

• This partial match ends where the first

mismatch occurs, where x is aligned with d.

T ............................................axbadbaddog.....


P .....................................adbadbaddog
Concept: Suffix Shift Rule

We want to find a right-most copy a´ of this

substring a in P such that:
1. a´ is not a suffix of P and
2. The character to the left of a´ is not the same as the
character to the left of a


T ............................................axbadbaddog.....

’ 
P .........gbadbaddoghorseadbadbaddog
Concept: Suffix Shift Rule

1. If a´ exists, shift P to the right such that a´ is

now aligned with the substring in T that was
previously aligned with a.


T .......................................xbadbaddog.....

’ 
P .........gbadbaddogcatdbadbaddog
’ 
P after shifting .........gbadbaddogcatdbadbaddog
Concept: Suffix Shift Rule

2. If a´ doesn’t exist, shift P right by the least

amount such that a prefix of P is aligned with a
suffix of a in T.


T .......................................xbadbaddog.....


P dogcatratdbadbaddog

P after shifting dogcatratdbadbaddog

Concept: Suffix Shift Rule

3. If a´ doesn’t exist, and there is no prefix of P

that matches a suffix of a in T, shift P left by n
positions.


T .......................................xbadbaddog.....

P batcatratdbadbaddog

P after shifting batcatratdbadbaddog

Preprocessing for the good suffix rule

• Let L(i) denote the largest position less than n s.t. P[i..n]
matches a suffix of P[1..L(i)].
• If there is no such position, then L(i) = 0
• Example 1: If i = 17 then L(i) = 9
P batcatdogdbadbaddog
L(17) 17

• Example 2: If i = 16 then L(i) = 0

P batcatdogdbadbaddog
16
Concept: Suffix Shift Rule

• Let L´(i) denote the largest position less than n s.t. P[i..n]
matches a suffix of P[1..L´(i)] and s.t. the character
preceding the suffix is not equal to P(i-1).
• If there is no such position, then L´(i) = 0
• Example 1: If i = 20 then L(i) = 12 and L´(i) = 6

P slydogsaddogdbadbaddog
L’(20) L(20) 20
Concept: Suffix Shift Rule

• Example 2: If i = 19 then L(i) = 12 and L´(i) = 0

P slydogsaddogdbadbaddog
L(19) 19
Concept: Suffix Shift Rule

• Notice that L(i) indicates the right-most copy of P[i..n] that is not a
suffix of P.
• In contrast, L´(i) indicates the right-most copy of P[i..n] that is not a
suffix of P and whose preceding character doesn’t match P(i-1).
• The relation between L´(i) and L(i) is analogous to the relation
between a´ and a.

P slydogsaddogdbadbaddog
L’(20) L(20) 20
Concept: Suffix Shift Rule

• Q: What is the point?

• A: If P(i - 1) causes the mismatch and L´(i) > 0, then
we can shift P right by n - L´(i) positions. Example:


T .......................................xbadbaddog.....

’ 
P .........gbadbaddogcatdbadbaddog
’ 
P after shifting .........gbadbaddogcatdbadbaddog
Concept: Suffix Shift Rule

• If L(i) and L´(i) are different, then obviously shifting

by n - L´(i) positions is a greater shift than n - L(i).
• Example:

T .......................................xbaxbaddog.....

’  
P slybaddogbadbaddogcatdbadbaddog
L’(i) L(i)
’  
P after shifting slybaddogbadbaddogcatdbadbaddog
Concept: Suffix Shift Rule

• Let Nj(P) denote the length of the longest suffix of P[1..j]

that is also a suffix of P.
• Example 1: N6(P) = 3 and N12(P) = 5.
P slydogsaddogdbadbaddog
6 12
• Example 2: N3(P) = 2, N9(P) = 3, N15(P) = 5, N19(P) = 0.

P hogslydogsaddogdbadbaddog
3 9 15 19
Concept: Suffix Shift Rule

• Q: How are the concepts of Ni and Zi related?

• Recall that Zi = Length of a maximal substring starting at
position i, which is a prefix of P.
a
i
• In contrast, Ni = Length of a maximal substring ending at
position i, which is a suffix of P.
a
i

• In the case of Boyer-Moore, we are naturally interested in

suffixes since we are scanning right-to-left
Concept: Suffix Shift Rule

• Let Pr denote the mirror image of P, then the

relationship can be expressed as Nj(P)=Zn-j+1(Pr).
• In words, the length of the substring matching a
suffix at position j in P is equal to the length of
the corresponding substring matching a prefix in
the reverse of P.
• Q: Why must this true?
• A: Because they are the same substring, except
that one is the reverse of the other.
Concept: Suffix Shift Rule

• Since Nj(P) = Zn-j+1(Pr), we can use the Z algorithm

to compute N in O(n).
• Q: How do we do this?
• A: We create Pr, the reverse of P, and process it
with the Z algorithm.
Concept: Suffix Shift Rule

N is the reverse of Z!
P: the pattern
Pr the string obtained by reversing P

Then Nj (P)=Zn-j+1 (Pr)

1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0
P: q c a b d a b d a b Pr: b a d b a d b a c q
Nj: 0 0 0 2 0 0 5 0 0 0 Zi 0 0 0 5 0 0 2 0 0 0

t y t’ x
i
y t’ x t
j
36
Concept: Suffix Shift Rule

For pattern P,
Nj (for j=1,…,n) can be calculated in O(n) using the Z algorithm.

Why do we need to define Nj ?

To use the strong good suffix rule, we need to find out L’(i) for every i=1,…,n.
T x t

P z t’ y t
L’(i) i n
z t’ y t

We can get L’(i) from Nj !

37
Concept: Suffix Shift Rule

• We can then find L´(i) and L(i) values from N

values in linear time with the following:

For i = 1 to n {L´(i) = 0;}

For j = 1 to n – 1 {
i = n - Nj(P) + 1;
L´(i) = j;
}

// L values (if desired) can be obtained

L(2) = L´(2) ;
For i = 3 to n { L(i) = max(L(i - 1), L´(i));}
Concept: Suffix Shift Rule
For i = 1 to n {L´(i) = 0;}
For j = 1 to n – 1 {
i = n - Nj(P) + 1;
L´(i) = j;
}
L(2) = L´(2) ;
For i = 3 to n { L(i) = max(L(i - 1), L´(i));}

• Example: P = asdbasasas, n = 10
• Values of Ni(P): 0, 2, 0, 0, 0, 2, 0, 4, 0
• Computed values i: 11, 9, 11, 11, 11, 9, 11, 7, 11
• Values of L´: 0, 0, 0, 0, 0, 0, 8, 0, 6
Concept: Suffix Shift Rule

• Let l´(i) denote the length of the largest suffix of

P[i..n] that is also a prefix of P. Let l´(i) = 0 if no
such suffix exists.
Example: P = asasbsasas
^ ^^^^^^^^^
l’(1)
l’(6)
l’(7)
l’(4)
l’(5)
l’(2)
l’(3)
l’(8)====44
l’(9)
l’(10) =420

l´(i) = t t’ t
i
Concept: Suffix Shift Rule

• Thm: l´(i) = largest j <= n – i + 1 s.t. Nj(P) = j.

• Q: How can we compute l´(i) values in linear
time?
• A: This is problem #9 in Chapter 2. This would
make an interesting homework problem.

l´(i) = t t’ t
i

y t’ x t
j
Boyer-Moore Algorithm
Preprocessing:
Compute L´(i) and l´(i) for each position i in P,
Compute R(x), the right-most occurrence of x in P, for each character x in S.
Search:
k = n;
While k <= m {
i = n; h = k;
While i > 0 and P(i) = T(j) {
i = i – 1; h = h – 1;}
if i = 0 {
report occurrence of P in T at position k.
k = k + n - l´(2);}
else Shift P (increase k) by the max amount indicated by the
extended bad character rule and the good suffix rule.
}
Boyer-Moore Algorithm
Example: P = golgol
Preprocessing:
Compute L´(i) and l´(i) for each position i in P

For i = 1 to n {L´(i) = 0;}

For j = 1 to n – 1 {
i = n - Nj(P) + 1;
L´(i) = j;
}

Notice that first we need Nj(P) values in order to compute L´(i) and l´(i) for each
position i in P.
Boyer-Moore Algorithm
Example: P = golgol
Recall that Nj(P) is the length of the longest suffix of P[1..j]
that is also a suffix of P.

N1(P) = 0, there is no suffix of P that ends with g

N2(P) = 0, there is no suffix of P that ends with o
N3(P) = 3, there is a suffix of P that ends with l

N4(P) = 0, there is no suffix of P that ends with g

N5(P) = 0, there is no suffix of P that ends with o
N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3
Boyer-Moore Algorithm
Preprocessing: P = golgol, n = 6
N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3
Compute L´(i) and l´(i) for each position i in P
For i = 1 to n {L´(i) = 0;}
For j = 1 to n – 1 {
i = n - Nj(P) + 1;
L´(i) = j;
}
j=1i=7 Therefore L´(7) = 1
j=2i=7 Therefore L´(7) = 2
j=3i=4 Therefore L´(4) = 3
j=4i=7 Therefore L´(7) = 4
j=5i=7 Therefore L´(7) = 5
L´(1) = L´(2) = L´(3) = L´(5) = 0 and L´(4) = 3
Boyer-Moore Algorithm
Preprocessing: P = golgol, n = 6
N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3
L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3
Compute l´(i) for each position i in P.
Recall that l´(i) is the length of the longest suffix of P[i..n] that
is also a prefix of P.
l´(1) = 6 since gol is the longest suffix of P[1..n] that is a prefix of P.
l´(2) = 3 since gol is the longest suffix of P[2..n] that is a prefix of P.
l´(3) = 3 since gol is the longest suffix of P[3..n] that is a prefix of P.
l´(4) = 3 since gol is the longest suffix of P[4..n] that is a prefix of P.
l´(5) = 0 since there is no suffix of P[5..n] that is a prefix of P.
l´(6) = 0 since there is no suffix of P[6..n] that is a prefix of P.
l´(1) = 6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0
Boyer-Moore Algorithm
Preprocessing: P = golgol, n = 6
N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3
L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3
l´(1) = 6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0
Compute the list R(x), the right-most occurrences of x in P,
for each character x in S = {g, o, l}

R(g) = <4, 1>

R(o) = <5, 2>
R(l) = <6, 3>
Boyer-Moore Algorithm
Preprocessing: P = golgol, n = 6, T = lolgolgol, m = 9
L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3
l´(1) =6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0
R(g) = <4, 1>, R(o) = <5, 2>, R(l) = <6, 3>

Search:
k = n;
While k <= m {
i = n; h = k;
While i > 0 and P(i) = T(j) {
i = i – 1; h = h – 1;}
if i = 0 {
report occurrence of P in T at position k.
k = k + n - l´(2);}
else Shift P (increase k) by the max amount indicated by the
extended bad character rule and the good suffix rule.
}
Search
k = 6;
While k <= 9 { lolgolgol
i = 6; h = k; golgol
While i > 0 and P(i) = T(j) { ^ ^ ^^ ^ ^
i = i – 1; h = h – 1;} i =i =1,
i =i2,h=i3,h=4,
ih=5,
1,h=26,
hP(1) 6 T(1) 
=3=h4 5=!=
if i = 0 { But i = 1!
report occurrence of P in T at position k.
k = k + 6 - l´(2);}
else Shift P (increase k) by the max amount indicated by the
extended bad character rule and the good suffix rule.
}

Bad Character Rule: there is no occurrence of l, the mismatched character

in T, to the left of P(1). This suggests shifting only 1 place
Good Suffix Rule: Since L´(2) = 0, l´(2) = 3 therefore
shift P by n - l´(2) places, i.e., 6-3=3 places. Thus k = k + 3 = 9
Search
k = 6;
lolgolgol
While k <= 9 { k = 12, we are done! 
golgol golgol
i = 6; h = k;
While i > 0 and P(i) = T(j) { ^ ^ ^^ ^ ^ ^
i = i – 1; h = h – 1;}
i =i =0,ii1,
=h=i =2,
h=3, =3ihh5,
i =4, =4h==6,
=6=h7 8= 9
h5
if i = 0 {
report occurrence of P in T at position k.
k = k + 6 - l´(2);}
else Shift P (increase k) by the max amount indicated by the
extended bad character rule and the good suffix rule.
}

 i = 0, report occurrence of P in T at position 4,

k = k + 6 - l´(2) = 9 + 6 - 3 = 12
Homework 1: Due Next Week
• Implement the Boyeer More Algorithm
Break
KMP Algorithm
• Preliminaries:
– KMP can be easily explained in terms of finite
state machines.
– KMP has a easily proved linear bound
– KMP is usually not the method of choice
KMP Algorithm
• Recall that the naïve approach to string
matching is Q(mn).
• How can we reduce this complexity?
– Avoid redundant comparisons
– Use larger shifts
• Boyer-Moore good suffix rule
• Boyer-Moore extended bad character rule
KMP Algorithm
• KMP finds larger shifts by recognizing patterns
in P.
– Let spi(P) denote the length of the longest proper
suffix of P[1..i] that matches a prefix of P.

α α
i

– By definition sp1 = 0 for any string.

– Q: Why does this make sense?
– A: The proper suffix must be the empty string
KMP Algorithm
• Example: P = abcaeabcabd
– P[1..2] = ab hence sp2 = ?
– sp2 = 0
– P[1..3] = abc hence sp3 = ?
– sp3 = 0
– P[1..4] = abca hence sp4 = ?
– sp4 = 1
– P[1..5] = abcae hence sp5 = ?
– sp5 = 0
– P[1..6] = abcaea hence sp6 = ?
– sp6 = 1
KMP Algorithm
• Example Continued
– P[1..7] = abcaeab hence sp7 = ?
– sp7 = 2
– P[1..8] = abcaeabc hence sp8 = ?
– sp8 = 3
– P[1..9] = abcaeabca hence sp9 = ?
– sp9 = 4
– P[1..10] = abcaeabcab hence sp10 = ?
– sp10 = 2
– P[1..11] = abcaeabcabd hence sp11 = ?
– sp11 = 0
KMP Algorithm
• Like the a/a concept for Boyer-Moore, there is
an analogous spi/spí concept.
• Let spí(P) denote the length of the longest proper
suffix of P[1..i] that matches a prefix of P, with the
added condition that characters P(i + 1) and P(spí
+ 1) are unequal.
• Example: P = abcdabce sp´7 = 3
α x α y
i
Obviously spí(P) <= spi(P), since the later is less
restrictive.
KMP Algorithm
• KMP Shift Rule:
1. Mismatch case:
• Let position i+1 in P and position k in T be the first mismatch
in a left-to-right scan.
• Shift P to the right, aligning P[1..spí] with T[k- spí..k-1]
α k

α α
i+1
2. Match case:
• If no mismatch is found, an occurrence of P has been found.
• Shift P by n – sp´n spaces to continue searching for other
occurrences.
α

α α
n+1
KMP Algorithm

• Observations:
– The prefix P[1..spí] of the shifted P is shifted to match
the corresponding substring in T.
– Subsequent character matching proceeds from
position spí + 1
– Unlike Boyer-Moore, the matched substring is not
compared again.
– The shift rule based on spí guarantees that the exact
same mismatch won’t occur at spí + 1 but doesn’t
guarantee that P(spí+1) = T(k)
KMP Algorithm

• Example: P = abcxabcde
– If a mismatch occurs at position 8, P will be shifted 4
positions to the right.
– Q: Where did the 4 position shift come from?
– A: The number of position is given by i - sp´i , in this
example i = 7, sp´7 = 3, 7 – 3 = 4
– Notice that we know the amount of shift without
knowing anything about T other than there was a
mismatch at position 8..
KMP Algorithm

• Example Continued: P = abcxabcde

– After the shift, P[1..3] lines up with T[k-4..k-1]
– Since it known that P[1..3] must match T[k-4..k-1], no
comparison is needed.
– The scan continues from P(4) & T(k)
• Advantages of KMP Shift Rule
1. P is often shifted by more than 1 character, (i - sp´i )
2. The left-most sp´i characters in the shifted P are
known to match the corresponding characters in T.
KMP Algorithm

Full Example: T = xyabcxabcxadcdqfeg P = abcxabcde

Assume that we have already shifted past the first two
positions in T.

xyabcxabcxadcdqfeg
abcxabcde
abcxabcde
^^^ ^ ^^^ ^^
123 4 567 8 1d!=x,
startshift
again4 from position 4
places
Preprocessing for KMP
Approach: show how to derive sp´ values from Z values.
Definition: Position j > 1 maps to i if i = j + Zj(P) – 1
– Recall that Zj(P) denotes the length of the Z-box starting at position j.
α α
j
α α
i

– This says that j maps to i if i is the right end of a Z-box starting at j.

Preprocessing for KMP
Definition: Position j > 1 maps to i if i = j + Zj(P) – 1

Theorem. For any i > 1, sp´i(P) = Zj = i – j + 1

Where j > 1 is the smallest position that maps to i.
If  j then spí(P) = 0
Similarly for sp:
For any i > 1, spi(P) = i – j + 1
Where j, i  j > 1, is the smallest position that maps to i
or beyond.
If  j then spi(P) = 0
α α
j
α x α y
i
Preprocessing for KMP
Given the theorem from the preceding slide, the spí and spi
values can be computed in linear time using Zi values:
For i = 1 to n { spí = 0;}
For j = n downto 2 {
i = j + Zj(P) – 1;
spí = Zj;
}

spn(P) = sp´n(P);
For i = n - 1 downto 2 {
spi (P) = max[spi+1 (P) - 1, sp´i(P)];}

α α
j
α x α y
i
Preprocessing for KMP
Defn. Failure function F´(i) = sp´i-1 + 1 , 1  i  n + 1, sp´0 = 0
(similarly F(i) = spi-1 + 1 , 1  i  n + 1, sp0 = 0)
cc c c
|| | |
xyabcxabcxadcdqfeg
xyabcxabcxadcdqfeg
abcxabcde
abcxabcde abcxabcde
^^^ ^ ^^^ ^ ^^ ^^ ^
123 4 567 8 d!=x, shift 4 places ii ii i

Two special cases:

Shifting is only conceptual and 1. Mismatch at position 1, then F’(1) = 1
P is never explicitly shifted 2. Match found, then P shifts by n - sp’n places
o Which is F’(n+1) = sp’n + 1
Preprocessing for KMP
Defn. Failure function F´(i) = spí-1 + 1 , 1  i  n + 1, sp´0 = 0
(similarly F(i) = spi-1 + 1 , 1  i  n + 1, sp0 = 0)
• Idea:
– We maintain a pointer i in P and c in T.
– After a mismatch at P(i+1) with T(c), shift P to align
P(spí + 1) with T(c), i.e., i = spí + 1.
– Special case 1: i = 1  set i = F´(1) = 1 & c = c + 1
– Special case 2: we find P in T,  shift n - spń spaces,
i.e., i = F´(n + 1) = spń + 1.
Full KMP Algorithm
Preprocess P to find F´(k) = sp´k-1 +1 for k from 1 to n + 1
c = 1; p = 1;
While c + (n – p)  m {
While P(p) = T( c )and p  n {
p = p + 1;
c = c + 1;}
If (p = n + 1) then
report an occurrence of P at position c – n of T.
if (p = 1) then c = c + 1;
p = F´(p) ;}
c
|
T = xyabcxabcxadcdqfeg |T| = m
P = abcxabcde |P| = n
^
p
Full KMP Algorithm
c = 1; p = 1;
While c + (n – p)  m {
While P(p) = T( c )and p  n {
p = p + 1;
c = c + 1;}
If (p = n + 1) then p != n+1
report an occurrence of P at position c – n of T.
if (p = 1) then c = c + 1; p = 1!  c = 2
p = F´(p) ; p = F’(1) = 1
}
xyabcxabcxabcdefeg
abcxabcde

^
1 a!=x
Full KMP Algorithm
c = 1; p = 1;
While c + (n – p)  m {
While P(p) = T( c )and p  n {
p = p + 1;
c = c + 1;}
If (p = n + 1) then p != n+1
report an occurrence of P at position c – n of T.
if (p = 1) then c = c + 1; p = 1!  c = 3
p = F´(p) ; p = F’(1) = 1
}
xyabcxabcxabcdefeg
abcxabcde
abcxabcde
^
1 a!=y
Full KMP Algorithm
c = 1; p = 1;
While c + (n – p)  m {
While P(p) = T( c )and p  n {
p = p + 1;
c = c + 1;}
If (p = n + 1) then p != n+1
report an occurrence of P at position c – n of T.
if (p = 1) then c = c + 1; p = 8!  don’t change c
p = F´(p) ; p = F´(8) = 4
}
xyabcxabcxabcdefeg
abcxabcde
abcxabcde
^^^ ^^^^ ^
123 4567 8 d!=x
Full KMP Algorithm
c = 1; p = 1;
While c + (n – p)  m { p = 4, c = 10
While P(p) = T( c )and p  n {
p = p + 1;
c = c + 1;}
If (p = n + 1) then p = n+1 !
report an occurrence of P at position c – n of T.
if (p = 1) then c = c + 1;
p = F´(p) ;
}
xyabcxabcxabcdefeg
abcxabcde
abcxabcde
abcxabcde
^^^^ ^ ^
4567 8 9
Real-Time KMP

• Q: What is meant by real-time algorithms?

• A: Typically these are algorithms that are meant
to interact synchronously in the real world.
– This implies a known fixed turn-around time for
processing a task
– Many embedded scheduling systems are examples
involving real-time algorithms.
– For KMP this means that we require a constant time
for processing all strings of length n.
Real-Time KMP

• Q: Why is KMP not real-time?

• A: For any mismatched character in T, we may try
matching it several times.
– Recall that spí only guarantees that P(i + 1) and P(spí + 1) differ
– There is NO guarantee that P(i + 1) and T(k) match
• We need to ensure that a mismatch at T(k) does
NOT entail additional matches at T(k).
• This means that we have to compute spí values
with respect to all characters in S since any could
appear in T.
Real-Time KMP

• Define: sp´(i,x)(P) to be the length of the longest

proper suffix of P[1..i] that matches a prefix of P,
with the added condition that character P(sp´i +
1) is x.
• This is will tell us exactly what shift to use for
each possible mismatch.
• A mismatched character T(k) will never be
involved in subsequent comparisons.
Real-Time KMP

• Q: How do we know that the mismatched

character T(k) will never be involved in
subsequent comparisons?
• A: Because the shift will shift P so that either the
matching character aligns with T(k) or P will be
shifted past T(k).
• This results in a real-time version of KMP.
• Let’s consider how we can find the sp´(i,x)(P)
values in linear time.
Real-Time KMP

Thm. For P[i + 1]  x, sp´(i,x)(P) = i - j + 1

– Here j is the smallest position such that j maps to i and
P(Zj + 1) = x.
– If there is no such j then where sp´(i,x)(P) = 0
For i = 1 to n { sp´(i,x) = 0 for every character x;}
For j = n downto 2 {
i = j + Zi(P) – 1;
x = P(Zj + 1);
sp´(i,x) = Zi;
}
Real-Time KMP
For i = 1 to n { sp´(i,x) = 0 for every character x;}
For j = n downto 2 {
i = j + Zi(P) – 1;
x = P(Zj + 1);
sp´(i,x) = Zi;}

• Notice how this works:

– Starting from the right
• Find i the right end of the Z box associated with j
• Find x the character immediately following the prefix
corresponding to this Z box.
• Set sp´(i,x) = Zi, the length of this Z box.
Reference
• Chapter 1, 2: Exact Matching: Fundamental
Preprocessing and First Algorithms

14. String Matching (1)
No ratings yet
14. String Matching (1)
116 pages
MADF Unit 4
No ratings yet
MADF Unit 4
144 pages
12_strings.v3
No ratings yet
12_strings.v3
111 pages
DS UNIT-V
No ratings yet
DS UNIT-V
35 pages
Unit 2a PROBLEM SOLVING TECHNIQUES - Uninformed Search PDF
No ratings yet
Unit 2a PROBLEM SOLVING TECHNIQUES - Uninformed Search PDF
65 pages
INF715-11
No ratings yet
INF715-11
57 pages
Moore Algorithm
No ratings yet
Moore Algorithm
22 pages
Unit 5 DS
No ratings yet
Unit 5 DS
53 pages
Class Three
No ratings yet
Class Three
74 pages
Bio 4
No ratings yet
Bio 4
39 pages
Communication Structure
No ratings yet
Communication Structure
5 pages
DSA Minor
No ratings yet
DSA Minor
28 pages
Naïve Bayes Classifier: Ke Chen
No ratings yet
Naïve Bayes Classifier: Ke Chen
19 pages
32.4 The Knuth-Morris-Pratt Algorithm: Either
No ratings yet
32.4 The Knuth-Morris-Pratt Algorithm: Either
10 pages
12 - Strings Matching
No ratings yet
12 - Strings Matching
111 pages
Amitabha Bagchi, Rahul Muthu - Algorithms and Discrete Applied Mathematics 2023
No ratings yet
Amitabha Bagchi, Rahul Muthu - Algorithms and Discrete Applied Mathematics 2023
464 pages
Sequence Alignment: Lecture 2, Thursday April 3, 2003
No ratings yet
Sequence Alignment: Lecture 2, Thursday April 3, 2003
39 pages
Data Structures Unit 5
No ratings yet
Data Structures Unit 5
20 pages
4 module algorithms
No ratings yet
4 module algorithms
28 pages
UNIT 5
No ratings yet
UNIT 5
14 pages
04 Boyer Moore v2
No ratings yet
04 Boyer Moore v2
23 pages
CCCS314 - DAA - 22!23!3rd 05 Space and Time Tradeoffs - Modified
No ratings yet
CCCS314 - DAA - 22!23!3rd 05 Space and Time Tradeoffs - Modified
30 pages
4string Matching Kmprabin Karp and Naive
No ratings yet
4string Matching Kmprabin Karp and Naive
57 pages
CSE 516/CSE 446 Introduction To Bioinformatics: Presented by Dr. Shazzad Hosain Asst. Prof. EECS, NSU
No ratings yet
CSE 516/CSE 446 Introduction To Bioinformatics: Presented by Dr. Shazzad Hosain Asst. Prof. EECS, NSU
25 pages
BoyerMoore Algorithm Simplified
No ratings yet
BoyerMoore Algorithm Simplified
9 pages
Lec3
No ratings yet
Lec3
37 pages
Topic 15
No ratings yet
Topic 15
11 pages
Boyer-Moore String Search: - How Does It Work? - Examples - Complexity - Acknowledgements
100% (1)
Boyer-Moore String Search: - How Does It Work? - Examples - Complexity - Acknowledgements
14 pages
Lectures
No ratings yet
Lectures
262 pages
10 String Algorithms
No ratings yet
10 String Algorithms
36 pages
GeorgiaTech CS-6515: Graduate Algorithms: EXAM3 Flashcards by Daniel Conner - Brainscape
No ratings yet
GeorgiaTech CS-6515: Graduate Algorithms: EXAM3 Flashcards by Daniel Conner - Brainscape
10 pages
Jtree, Jtable Java Programming
No ratings yet
Jtree, Jtable Java Programming
19 pages
TOA assignment 2
No ratings yet
TOA assignment 2
5 pages
Boyer Moore
100% (1)
Boyer Moore
19 pages
The Protein: Presented by Dr. Shazzad Hosain Asst. Prof. EECS, NSU
No ratings yet
The Protein: Presented by Dr. Shazzad Hosain Asst. Prof. EECS, NSU
157 pages
35 Spanning Trees
No ratings yet
35 Spanning Trees
13 pages
Random graph.ipynb - Colab
No ratings yet
Random graph.ipynb - Colab
4 pages
Unit V - CART
No ratings yet
Unit V - CART
4 pages
Unit 5
No ratings yet
Unit 5
42 pages
Boyer - Moore - Performance Comparison
No ratings yet
Boyer - Moore - Performance Comparison
12 pages
Sadia Zannat - Final Report
No ratings yet
Sadia Zannat - Final Report
56 pages
Boyer Moore Algorithm
No ratings yet
Boyer Moore Algorithm
16 pages
The Knuth Morris Pratt Algorithm
No ratings yet
The Knuth Morris Pratt Algorithm
7 pages
Digital Logic Design Lab 3
No ratings yet
Digital Logic Design Lab 3
5 pages
MADFL_2025_Expt8 (2)
No ratings yet
MADFL_2025_Expt8 (2)
8 pages
AAD-String Matching
No ratings yet
AAD-String Matching
15 pages
Eight Essential Components of Communication
No ratings yet
Eight Essential Components of Communication
7 pages
U3 - SpaceAndTimeTradeoff
No ratings yet
U3 - SpaceAndTimeTradeoff
30 pages
DSA _Strings_ Notes
No ratings yet
DSA _Strings_ Notes
8 pages
9 Regression Analysis
No ratings yet
9 Regression Analysis
38 pages
5 TH Long Ans
No ratings yet
5 TH Long Ans
31 pages
Theory of Computation - Context Free Languages and PDA - English - 1630314636
No ratings yet
Theory of Computation - Context Free Languages and PDA - English - 1630314636
20 pages
ADSA_IA2_solution
No ratings yet
ADSA_IA2_solution
14 pages
Compensation Philosophies
No ratings yet
Compensation Philosophies
15 pages
DS V Unit Notes
No ratings yet
DS V Unit Notes
33 pages
Regular Expressions & Fsas
No ratings yet
Regular Expressions & Fsas
6 pages
String Matching - RYS - Lect - 1 - 2 - 3 - Update
No ratings yet
String Matching - RYS - Lect - 1 - 2 - 3 - Update
61 pages
Unit-4 Ads
100% (1)
Unit-4 Ads
31 pages
Website About Class Scheduling System With Genetic Algorithm
No ratings yet
Website About Class Scheduling System With Genetic Algorithm
5 pages
Exact String Matching Algorithms: Presented by Dr. Shazzad Hosain Asst. Prof. EECS, NSU
No ratings yet
Exact String Matching Algorithms: Presented by Dr. Shazzad Hosain Asst. Prof. EECS, NSU
27 pages
Optimization in Railway Scheduling
No ratings yet
Optimization in Railway Scheduling
8 pages
DSA Lab Manual
100% (1)
DSA Lab Manual
65 pages
Experiment No. 3 Image Sampling and Quantization
No ratings yet
Experiment No. 3 Image Sampling and Quantization
3 pages
Advanced String Lecture
No ratings yet
Advanced String Lecture
50 pages
Notes 5
No ratings yet
Notes 5
23 pages
Quiz#5 Question
No ratings yet
Quiz#5 Question
2 pages
Aho Johnson Paper
No ratings yet
Aho Johnson Paper
14 pages
Submitted To: Dr. Nazrul Islam Professor, Dean, SOB Canadian University of Bangladesh
No ratings yet
Submitted To: Dr. Nazrul Islam Professor, Dean, SOB Canadian University of Bangladesh
3 pages
Lecture #2: P & NP Problems
No ratings yet
Lecture #2: P & NP Problems
3 pages
Xpbctbxabpqxctbpg Abxab: The Boyer-Moore Algorithm Right-To-Left Scan
No ratings yet
Xpbctbxabpqxctbpg Abxab: The Boyer-Moore Algorithm Right-To-Left Scan
5 pages
28 - Text Processing
No ratings yet
28 - Text Processing
7 pages
Unit-2 (EM) - 50 Marks
No ratings yet
Unit-2 (EM) - 50 Marks
2 pages
ANTLR Reference Manual
No ratings yet
ANTLR Reference Manual
152 pages
A Fast String Matching Algorithm: H N Verma, Ravendra Singh M.Tech (CSE-0104cs09mt16) RKDF IST Bhopal, India
No ratings yet
A Fast String Matching Algorithm: H N Verma, Ravendra Singh M.Tech (CSE-0104cs09mt16) RKDF IST Bhopal, India
7 pages
Co 4 (Lo 2)
No ratings yet
Co 4 (Lo 2)
12 pages
AI Lecture Four Heuristic Search
No ratings yet
AI Lecture Four Heuristic Search
15 pages
CS402 SOLVED MCQs FINAL TERM BY JUNAID
No ratings yet
CS402 SOLVED MCQs FINAL TERM BY JUNAID
55 pages
ADA Lect10
No ratings yet
ADA Lect10
12 pages
9A02709 Optimization Techniques
No ratings yet
9A02709 Optimization Techniques
4 pages
Bad Character Rule
No ratings yet
Bad Character Rule
3 pages
Sandeep Singh (Iii B.Tech I.T)
No ratings yet
Sandeep Singh (Iii B.Tech I.T)
179 pages
CS114 - Fundamentals of Programming
No ratings yet
CS114 - Fundamentals of Programming
20 pages
Factors Considered in Deciding Compensation
100% (1)
Factors Considered in Deciding Compensation
21 pages
Boyer Moore Algorithm: Idan Szpektor
100% (1)
Boyer Moore Algorithm: Idan Szpektor
48 pages
String Matching
No ratings yet
String Matching
35 pages
Text Pattern Search Using Naïve Algorithm: Justine Estoesta, Patricia Mae Omana, Winci John Singh
No ratings yet
Text Pattern Search Using Naïve Algorithm: Justine Estoesta, Patricia Mae Omana, Winci John Singh
5 pages
String Match - Horspool Sad Life
No ratings yet
String Match - Horspool Sad Life
4 pages
Long Term Production Planning of Open Pit Mines by Ant Colony PDF
100% (1)
Long Term Production Planning of Open Pit Mines by Ant Colony PDF
12 pages
String Matching
100% (1)
String Matching
12 pages
Mathematical Model For String Pattern Matching Algorithm (Boyer-Moore's Algorithm)
No ratings yet
Mathematical Model For String Pattern Matching Algorithm (Boyer-Moore's Algorithm)
5 pages
String Matching Algorithms
No ratings yet
String Matching Algorithms
25 pages
A Two Way Pattern Matching Algorithm Using Sliding Patterns
No ratings yet
A Two Way Pattern Matching Algorithm Using Sliding Patterns
5 pages
Optimization of Airport Ground Operations
No ratings yet
Optimization of Airport Ground Operations
23 pages
Finding Errors (Past Paper Questions With Answer) - Unlocked
100% (5)
Finding Errors (Past Paper Questions With Answer) - Unlocked
30 pages
Data Structures Using C: Example 4.13
No ratings yet
Data Structures Using C: Example 4.13
5 pages
Knuth-Morris-Pratt Algorithm KENT
No ratings yet
Knuth-Morris-Pratt Algorithm KENT
4 pages
Strings and Pattern Matching
No ratings yet
Strings and Pattern Matching
17 pages
Application of A Modified Convolution Method To Exact String Matching
No ratings yet
Application of A Modified Convolution Method To Exact String Matching
6 pages
Popular Lectures on Logic
From Everand
Popular Lectures on Logic
John-Michael Kuczynski
No ratings yet
Calculus Super Review
From Everand
Calculus Super Review
Editors of REA
No ratings yet
Set Theory Essentials
From Everand
Set Theory Essentials
Emil Milewski
No ratings yet
Math for Computer Applications
From Everand
Math for Computer Applications
The Editors of REA
No ratings yet
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet

Exact String Matching Algorithms: Presented by Dr. Shazzad Hosain Assoc. Prof. EECS, NSU

Uploaded by

Exact String Matching Algorithms: Presented by Dr. Shazzad Hosain Assoc. Prof. EECS, NSU

Uploaded by

Exact String Matching Algorithms

• How can we check for a match of pattern P at

• Alternative, scan right-to-left, i.e.,

• Idea: the mismatched character indicates a safe

Shift two positions to align the rightmost

Now, start matching again from right to left.

Second Example: P = adacara, T=abaxaradabara

But x doesn’t occur in P!!!!

Since x doesn’t occur in P, we can shift past it.

Second Example: P = adacara, T=abaxaradabara

Now, start matching again from right to left.

• The idea of bad character rule is to shift P by more than one

Extended Bad Character Rule: If P[i]

The amount of shift is i – j, where:

t is a suffix of P that match with a substring t of T

• Consider the partial right-to-left matching of P

• This partial match ends where the first

We want to find a right-most copy a´ of this

1. If a´ exists, shift P to the right such that a´ is

2. If a´ doesn’t exist, shift P right by the least

P after shifting dogcatratdbadbaddog

3. If a´ doesn’t exist, and there is no prefix of P

P after shifting batcatratdbadbaddog

• Example 2: If i = 16 then L(i) = 0

• Example 2: If i = 19 then L(i) = 12 and L´(i) = 0

• Q: What is the point?

• If L(i) and L´(i) are different, then obviously shifting

• Let Nj(P) denote the length of the longest suffix of P[1..j]

• Q: How are the concepts of Ni and Zi related?

• In the case of Boyer-Moore, we are naturally interested in

• Let Pr denote the mirror image of P, then the

• Since Nj(P) = Zn-j+1(Pr), we can use the Z algorithm

Then Nj (P)=Zn-j+1 (Pr)

Why do we need to define Nj ?

We can get L’(i) from Nj !

• We can then find L´(i) and L(i) values from N

For i = 1 to n {L´(i) = 0;}

// L values (if desired) can be obtained

• Let l´(i) denote the length of the largest suffix of

• Thm: l´(i) = largest j <= n – i + 1 s.t. Nj(P) = j.

For i = 1 to n {L´(i) = 0;}

N1(P) = 0, there is no suffix of P that ends with g

N4(P) = 0, there is no suffix of P that ends with g

R(g) = <4, 1>

Bad Character Rule: there is no occurrence of l, the mismatched character

 i = 0, report occurrence of P in T at position 4,

– By definition sp1 = 0 for any string.

• Example Continued: P = abcxabcde

Full Example: T = xyabcxabcxadcdqfeg P = abcxabcde

– This says that j maps to i if i is the right end of a Z-box starting at j.

Theorem. For any i > 1, sp´i(P) = Zj = i – j + 1

Two special cases:

• Q: What is meant by real-time algorithms?

• Q: Why is KMP not real-time?

• Define: sp´(i,x)(P) to be the length of the longest

• Q: How do we know that the mismatched

Thm. For P[i + 1]  x, sp´(i,x)(P) = i - j + 1

• Notice how this works:

You might also like