0% found this document useful (0 votes)
30 views

Exact String Matching Algorithms: Presented by Dr. Shazzad Hosain Assoc. Prof. EECS, NSU

Here are the key steps of the suffix shift rule: 1. Find the rightmost occurrence a' of the matching suffix t in P such that a' is not a suffix of P and the character to the left of a' is not the same as the character to the left of a. 2. If a' exists, shift P to align a' with the matching substring in T. 3. If a' does not exist, shift P to align a prefix of P with the suffix t in T. 4. If steps 2 and 3 are not possible, shift P left by n positions. The suffix shift rule aims to make larger shifts by exploiting suffix matches within the pattern P when a

Uploaded by

Alimushwan Adnan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Exact String Matching Algorithms: Presented by Dr. Shazzad Hosain Assoc. Prof. EECS, NSU

Here are the key steps of the suffix shift rule: 1. Find the rightmost occurrence a' of the matching suffix t in P such that a' is not a suffix of P and the character to the left of a' is not the same as the character to the left of a. 2. If a' exists, shift P to align a' with the matching substring in T. 3. If a' does not exist, shift P to align a prefix of P with the suffix t in T. 4. If steps 2 and 3 are not possible, shift P left by n positions. The suffix shift rule aims to make larger shifts by exploiting suffix matches within the pattern P when a

Uploaded by

Alimushwan Adnan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 80

Exact String Matching Algorithms

Presented By
Dr. Shazzad Hosain
Assoc. Prof. EECS, NSU
Classical Comparison Based Methods

• Boyer-Moore Algorithm
• Knuth-Morris-Pratt Algorithm (KMP Algorithm)
Boyer-Moore Algorithm
• Basic ideas:
– Previously discussed ideas for naïve matching
1. successively align P and T to check for a match.
2. Shift P to the right on match failure.
– new concepts wrt the naïve algorithm
1. Scan from right-to-left, i.e., 
2. Special Bad character rule
3. Suffix shift rule
Concept: Right-to-left Scan

• How can we check for a match of pattern P at


location i in target T?
• Naïve algorithm scanned left-to-right, i.e.,
T[i+k]&P[1+k], k = 0 to length(P)-1
Example: P = adab, T = abaracadabara
abaracadabara
adab
^^
1 2 ad==
!= ab
Concept: Right-to-left Scan

• Alternative, scan right-to-left, i.e.,


T[i+k]&P[1+k], k = length(P)-1 down-to 0
Example: P = adab, T = abaracadabara
abaracadabara
adab
^
1 b != r
Concept: Right-to-left Scan
• Why is scanning right-to-left a good idea?
• Answer: by itself, it isn’t any better than left-
to-right.
– A naïve approach with right-to-left scanning is
also Q(nm).
– Larger shifts, supported by a clever bad
character rule and a suffix shift rule make it
better.
Concept: Bad Character Rule

• Idea: the mismatched character indicates a safe


minimum shift.
Example: P = adacara, T=abaracadabara
abaracadabara
adacara
^^
21 ra !=
==ca

Here the bad character is c. Perhaps we should shift to align this character with
its rightmost occurrence in P?
Concept: Bad Character Rule

Shift two positions to align the rightmost


occurrence of the mismatched character c in P.
aba racadabara
a d a dc aa cr aa r a

Now, start matching again from right to left.


Concept: Bad Character Rule

Second Example: P = adacara, T=abaxaradabara


abaxaradabara
adacara
^ ^ ^^
4 3 2ac1!=
== xra ==
== ra

Here the bad character is x. The minimum that we should shift should align
this character with its occurrence in P.

But x doesn’t occur in P!!!!


Concept: Bad Character Rule

Since x doesn’t occur in P, we can shift past it.

Second Example: P = adacara, T=abaxaradabara


abaxaradabara
a d a c aa dr a c a r a

Now, start matching again from right to left.


Concept: Bad Character Rule

• The idea of bad character rule is to shift P by more than one


characters when possible.
• But if rightmost position is greater than the mismatched
position.
• Unfortunately, it is often the case
12345678901234567
T: spbctbsatpqsctbpq
P: tpabsat
P: tpabsat

11
Concept: Bad Character Rule
• We will define a bad character rule that uses the concept of
the rightmost occurrence of each letter.
• Let R(x) be the rightmost position of the letter x in P for
each letter x in our alphabet.
• If x doesn’t occur in P, define R(x) to be 0.

1234567
a b c d z
P = adacara
R 7 0 4 2 * * 0
Concept: Bad Character Rule
12345678901234567
T: spbctbsabpqsctbpq
P: tpabsab
P: tpabxab

R(t)=1, R(s)=5.
i: the position of mismatch in P. i=3
k: the counterpart in T. k=5. T[k]=t
• The bad character rule says P should be shifted right by max{1, i-R(T[k])}. i.e.,
if the right-most occurrence of character T[k] in P is in position j (j<i), then P[j]
should be below T[k] after the shifting.
• Otherwise, we will shift P one position, i.e., when R(T[k]) >= i, 1 >= i - R(T[k])
• Obviously this rule is not very useful when R(T[k]) >= i, which is usually the
case for DNA sequences

13
Concept: Extended Bad Character Rule

Extended Bad Character Rule: If P[i]


mismatches T[k], shift P along T so that the
closest occurrence of the letter T[k] in P to the
left of i in P is aligned with T[k].
Example: P = aracara, T=abararadabara
abararadabara
aracara
^ ^^ ^ ^
This2acis1!=
4 3 rightmost
This is the == raa==
the ==roccurrence
a
rightmost occurrence of rleft
of r to the in P.
of i in P.
Notice
Notice that 4 –that i - i.e.,
2 > 0, R(T(k))
this<gives
0 , i.e.,
us 4a –positive
6 < 0 shift.
Concept: Extended Bad Character Rule

The amount of shift is i – j, where:


– i is the index of the mismatch in P.
– j is the rightmost occurrence of T[k] to the left of i in P.
Example: P = aracara, T=abataradabara
abataradabara
aracara
^^^ ^
4 3 2ac 1!=
== traa==
==r a
There is no occurrence of t in P, thus j = 0. Notice that i – j = 4,
i.e., this gives us a positive shift past the point of mismatch.
Concept: Extended Bad Character Rule
• How do we implement this rule?
• We preprocess P (from right to left), recording the
position of each occurrence of the letters.
• For each character x in S, the alphabet, create a list
of its occurrences in P. If x doesn’t occur in P, then
it has an empty list.
Concept: Extended Bad Character Rule
Example: S = {a, b, c, d, r, t}, P = abataradabara
• a_list = <13, 11,9,7,5,3,1> since ‘a’ occurs at these positions
in P, i.e., abataradabara
• b_list = <10,2> (abataradabara)
• c_list = Ø
• d_list = <8> (abataradabara)
• r_list = <12,6> (abataradabara)
• t_list = <4>(abataradabara)
Concept: Suffix Shift Rule
• Recall that we investigated finding prefixes before.
• Since we are matching P to T from right-to-left, we will
instead need to use suffixes.
Suffix Shift Rule

t is a suffix of P that match with a substring t of T


x≠y
t’ is the right-most copy of t in P such that t’ is not a suffix of P and z≠y

19
Concept: Suffix Shift Rule

• Consider the partial right-to-left matching of P


to T below.
• This partial match involves a, a suffix of P.

T ............................................axbadbaddog.....


P .....................................adbadbaddog
Concept: Suffix Shift Rule

• This partial match ends where the first


mismatch occurs, where x is aligned with d.

T ............................................axbadbaddog.....


P .....................................adbadbaddog
Concept: Suffix Shift Rule

We want to find a right-most copy a´ of this


substring a in P such that:
1. a´ is not a suffix of P and
2. The character to the left of a´ is not the same as the
character to the left of a


T ............................................axbadbaddog.....

’ 
P .........gbadbaddoghorseadbadbaddog
Concept: Suffix Shift Rule

1. If a´ exists, shift P to the right such that a´ is


now aligned with the substring in T that was
previously aligned with a.


T .......................................xbadbaddog.....

’ 
P .........gbadbaddogcatdbadbaddog
’ 
P after shifting .........gbadbaddogcatdbadbaddog
Concept: Suffix Shift Rule

2. If a´ doesn’t exist, shift P right by the least


amount such that a prefix of P is aligned with a
suffix of a in T.


T .......................................xbadbaddog.....


P dogcatratdbadbaddog

P after shifting dogcatratdbadbaddog


Concept: Suffix Shift Rule

3. If a´ doesn’t exist, and there is no prefix of P


that matches a suffix of a in T, shift P left by n
positions.


T .......................................xbadbaddog.....

P batcatratdbadbaddog

P after shifting batcatratdbadbaddog


Preprocessing for the good suffix rule

• Let L(i) denote the largest position less than n s.t. P[i..n]
matches a suffix of P[1..L(i)].
• If there is no such position, then L(i) = 0
• Example 1: If i = 17 then L(i) = 9
P batcatdogdbadbaddog
L(17) 17

• Example 2: If i = 16 then L(i) = 0

P batcatdogdbadbaddog
16
Concept: Suffix Shift Rule

• Let L´(i) denote the largest position less than n s.t. P[i..n]
matches a suffix of P[1..L´(i)] and s.t. the character
preceding the suffix is not equal to P(i-1).
• If there is no such position, then L´(i) = 0
• Example 1: If i = 20 then L(i) = 12 and L´(i) = 6

P slydogsaddogdbadbaddog
L’(20) L(20) 20
Concept: Suffix Shift Rule

• Example 2: If i = 19 then L(i) = 12 and L´(i) = 0

P slydogsaddogdbadbaddog
L(19) 19
Concept: Suffix Shift Rule

• Notice that L(i) indicates the right-most copy of P[i..n] that is not a
suffix of P.
• In contrast, L´(i) indicates the right-most copy of P[i..n] that is not a
suffix of P and whose preceding character doesn’t match P(i-1).
• The relation between L´(i) and L(i) is analogous to the relation
between a´ and a.

P slydogsaddogdbadbaddog
L’(20) L(20) 20
Concept: Suffix Shift Rule

• Q: What is the point?


• A: If P(i - 1) causes the mismatch and L´(i) > 0, then
we can shift P right by n - L´(i) positions. Example:


T .......................................xbadbaddog.....

’ 
P .........gbadbaddogcatdbadbaddog
’ 
P after shifting .........gbadbaddogcatdbadbaddog
Concept: Suffix Shift Rule

• If L(i) and L´(i) are different, then obviously shifting


by n - L´(i) positions is a greater shift than n - L(i).
• Example:

T .......................................xbaxbaddog.....

’  
P slybaddogbadbaddogcatdbadbaddog
L’(i) L(i)
’  
P after shifting slybaddogbadbaddogcatdbadbaddog
Concept: Suffix Shift Rule

• Let Nj(P) denote the length of the longest suffix of P[1..j]


that is also a suffix of P.
• Example 1: N6(P) = 3 and N12(P) = 5.
P slydogsaddogdbadbaddog
6 12
• Example 2: N3(P) = 2, N9(P) = 3, N15(P) = 5, N19(P) = 0.

P hogslydogsaddogdbadbaddog
3 9 15 19
Concept: Suffix Shift Rule

• Q: How are the concepts of Ni and Zi related?


• Recall that Zi = Length of a maximal substring starting at
position i, which is a prefix of P.
a
i
• In contrast, Ni = Length of a maximal substring ending at
position i, which is a suffix of P.
a
i

• In the case of Boyer-Moore, we are naturally interested in


suffixes since we are scanning right-to-left
Concept: Suffix Shift Rule

• Let Pr denote the mirror image of P, then the


relationship can be expressed as Nj(P)=Zn-j+1(Pr).
• In words, the length of the substring matching a
suffix at position j in P is equal to the length of
the corresponding substring matching a prefix in
the reverse of P.
• Q: Why must this true?
• A: Because they are the same substring, except
that one is the reverse of the other.
Concept: Suffix Shift Rule

• Since Nj(P) = Zn-j+1(Pr), we can use the Z algorithm


to compute N in O(n).
• Q: How do we do this?
• A: We create Pr, the reverse of P, and process it
with the Z algorithm.
Concept: Suffix Shift Rule

N is the reverse of Z!
P: the pattern
Pr the string obtained by reversing P

Then Nj (P)=Zn-j+1 (Pr)

1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0
P: q c a b d a b d a b Pr: b a d b a d b a c q
Nj: 0 0 0 2 0 0 5 0 0 0 Zi 0 0 0 5 0 0 2 0 0 0

t y t’ x
i
y t’ x t
j
36
Concept: Suffix Shift Rule

For pattern P,
Nj (for j=1,…,n) can be calculated in O(n) using the Z algorithm.

Why do we need to define Nj ?


To use the strong good suffix rule, we need to find out L’(i) for every i=1,…,n.
T x t

P z t’ y t
L’(i) i n
z t’ y t

We can get L’(i) from Nj !

37
Concept: Suffix Shift Rule

• We can then find L´(i) and L(i) values from N


values in linear time with the following:

For i = 1 to n {L´(i) = 0;}


For j = 1 to n – 1 {
i = n - Nj(P) + 1;
L´(i) = j;
}

// L values (if desired) can be obtained


L(2) = L´(2) ;
For i = 3 to n { L(i) = max(L(i - 1), L´(i));}
Concept: Suffix Shift Rule
For i = 1 to n {L´(i) = 0;}
For j = 1 to n – 1 {
i = n - Nj(P) + 1;
L´(i) = j;
}
L(2) = L´(2) ;
For i = 3 to n { L(i) = max(L(i - 1), L´(i));}

• Example: P = asdbasasas, n = 10
• Values of Ni(P): 0, 2, 0, 0, 0, 2, 0, 4, 0
• Computed values i: 11, 9, 11, 11, 11, 9, 11, 7, 11
• Values of L´: 0, 0, 0, 0, 0, 0, 8, 0, 6
Concept: Suffix Shift Rule

• Let l´(i) denote the length of the largest suffix of


P[i..n] that is also a prefix of P. Let l´(i) = 0 if no
such suffix exists.
Example: P = asasbsasas
^ ^^^^^^^^^
l’(1)
l’(6)
l’(7)
l’(4)
l’(5)
l’(2)
l’(3)
l’(8)====44
l’(9)
l’(10) =420

l´(i) = t t’ t
i
Concept: Suffix Shift Rule

• Thm: l´(i) = largest j <= n – i + 1 s.t. Nj(P) = j.


• Q: How can we compute l´(i) values in linear
time?
• A: This is problem #9 in Chapter 2. This would
make an interesting homework problem.

l´(i) = t t’ t
i

y t’ x t
j
Boyer-Moore Algorithm
Preprocessing:
Compute L´(i) and l´(i) for each position i in P,
Compute R(x), the right-most occurrence of x in P, for each character x in S.
Search:
k = n;
While k <= m {
i = n; h = k;
While i > 0 and P(i) = T(j) {
i = i – 1; h = h – 1;}
if i = 0 {
report occurrence of P in T at position k.
k = k + n - l´(2);}
else Shift P (increase k) by the max amount indicated by the
extended bad character rule and the good suffix rule.
}
Boyer-Moore Algorithm
Example: P = golgol
Preprocessing:
Compute L´(i) and l´(i) for each position i in P

For i = 1 to n {L´(i) = 0;}


For j = 1 to n – 1 {
i = n - Nj(P) + 1;
L´(i) = j;
}

Notice that first we need Nj(P) values in order to compute L´(i) and l´(i) for each
position i in P.
Boyer-Moore Algorithm
Example: P = golgol
Recall that Nj(P) is the length of the longest suffix of P[1..j]
that is also a suffix of P.

N1(P) = 0, there is no suffix of P that ends with g


N2(P) = 0, there is no suffix of P that ends with o
N3(P) = 3, there is a suffix of P that ends with l

N4(P) = 0, there is no suffix of P that ends with g


N5(P) = 0, there is no suffix of P that ends with o
N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3
Boyer-Moore Algorithm
Preprocessing: P = golgol, n = 6
N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3
Compute L´(i) and l´(i) for each position i in P
For i = 1 to n {L´(i) = 0;}
For j = 1 to n – 1 {
i = n - Nj(P) + 1;
L´(i) = j;
}
j=1i=7 Therefore L´(7) = 1
j=2i=7 Therefore L´(7) = 2
j=3i=4 Therefore L´(4) = 3
j=4i=7 Therefore L´(7) = 4
j=5i=7 Therefore L´(7) = 5
L´(1) = L´(2) = L´(3) = L´(5) = 0 and L´(4) = 3
Boyer-Moore Algorithm
Preprocessing: P = golgol, n = 6
N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3
L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3
Compute l´(i) for each position i in P.
Recall that l´(i) is the length of the longest suffix of P[i..n] that
is also a prefix of P.
l´(1) = 6 since gol is the longest suffix of P[1..n] that is a prefix of P.
l´(2) = 3 since gol is the longest suffix of P[2..n] that is a prefix of P.
l´(3) = 3 since gol is the longest suffix of P[3..n] that is a prefix of P.
l´(4) = 3 since gol is the longest suffix of P[4..n] that is a prefix of P.
l´(5) = 0 since there is no suffix of P[5..n] that is a prefix of P.
l´(6) = 0 since there is no suffix of P[6..n] that is a prefix of P.
l´(1) = 6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0
Boyer-Moore Algorithm
Preprocessing: P = golgol, n = 6
N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3
L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3
l´(1) = 6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0
Compute the list R(x), the right-most occurrences of x in P,
for each character x in S = {g, o, l}

R(g) = <4, 1>


R(o) = <5, 2>
R(l) = <6, 3>
Boyer-Moore Algorithm
Preprocessing: P = golgol, n = 6, T = lolgolgol, m = 9
L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3
l´(1) =6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0
R(g) = <4, 1>, R(o) = <5, 2>, R(l) = <6, 3>

Search:
k = n;
While k <= m {
i = n; h = k;
While i > 0 and P(i) = T(j) {
i = i – 1; h = h – 1;}
if i = 0 {
report occurrence of P in T at position k.
k = k + n - l´(2);}
else Shift P (increase k) by the max amount indicated by the
extended bad character rule and the good suffix rule.
}
Search
k = 6;
While k <= 9 { lolgolgol
i = 6; h = k; golgol
While i > 0 and P(i) = T(j) { ^ ^ ^^ ^ ^
i = i – 1; h = h – 1;} i =i =1,
i =i2,h=i3,h=4,
ih=5,
1,h=26,
hP(1) 6 T(1) 
=3=h4 5=!=
if i = 0 { But i = 1!
report occurrence of P in T at position k.
k = k + 6 - l´(2);}
else Shift P (increase k) by the max amount indicated by the
extended bad character rule and the good suffix rule.
}

Bad Character Rule: there is no occurrence of l, the mismatched character


in T, to the left of P(1). This suggests shifting only 1 place
Good Suffix Rule: Since L´(2) = 0, l´(2) = 3 therefore
shift P by n - l´(2) places, i.e., 6-3=3 places. Thus k = k + 3 = 9
Search
k = 6;
lolgolgol
While k <= 9 { k = 12, we are done! 
golgol golgol
i = 6; h = k;
While i > 0 and P(i) = T(j) { ^ ^ ^^ ^ ^ ^
i = i – 1; h = h – 1;}
i =i =0,ii1,
=h=i =2,
h=3, =3ihh5,
i =4, =4h==6,
=6=h7 8= 9
h5
if i = 0 {
report occurrence of P in T at position k.
k = k + 6 - l´(2);}
else Shift P (increase k) by the max amount indicated by the
extended bad character rule and the good suffix rule.
}

 i = 0, report occurrence of P in T at position 4,


k = k + 6 - l´(2) = 9 + 6 - 3 = 12
Homework 1: Due Next Week
• Implement the Boyeer More Algorithm
Break
KMP Algorithm
• Preliminaries:
– KMP can be easily explained in terms of finite
state machines.
– KMP has a easily proved linear bound
– KMP is usually not the method of choice
KMP Algorithm
• Recall that the naïve approach to string
matching is Q(mn).
• How can we reduce this complexity?
– Avoid redundant comparisons
– Use larger shifts
• Boyer-Moore good suffix rule
• Boyer-Moore extended bad character rule
KMP Algorithm
• KMP finds larger shifts by recognizing patterns
in P.
– Let spi(P) denote the length of the longest proper
suffix of P[1..i] that matches a prefix of P.

α α
i

– By definition sp1 = 0 for any string.


– Q: Why does this make sense?
– A: The proper suffix must be the empty string
KMP Algorithm
• Example: P = abcaeabcabd
– P[1..2] = ab hence sp2 = ?
– sp2 = 0
– P[1..3] = abc hence sp3 = ?
– sp3 = 0
– P[1..4] = abca hence sp4 = ?
– sp4 = 1
– P[1..5] = abcae hence sp5 = ?
– sp5 = 0
– P[1..6] = abcaea hence sp6 = ?
– sp6 = 1
KMP Algorithm
• Example Continued
– P[1..7] = abcaeab hence sp7 = ?
– sp7 = 2
– P[1..8] = abcaeabc hence sp8 = ?
– sp8 = 3
– P[1..9] = abcaeabca hence sp9 = ?
– sp9 = 4
– P[1..10] = abcaeabcab hence sp10 = ?
– sp10 = 2
– P[1..11] = abcaeabcabd hence sp11 = ?
– sp11 = 0
KMP Algorithm
• Like the a/a concept for Boyer-Moore, there is
an analogous spi/sp´i concept.
• Let sp´i(P) denote the length of the longest proper
suffix of P[1..i] that matches a prefix of P, with the
added condition that characters P(i + 1) and P(sp´i
+ 1) are unequal.
• Example: P = abcdabce sp´7 = 3
α x α y
i
Obviously sp´i(P) <= spi(P), since the later is less
restrictive.
KMP Algorithm
• KMP Shift Rule:
1. Mismatch case:
• Let position i+1 in P and position k in T be the first mismatch
in a left-to-right scan.
• Shift P to the right, aligning P[1..sp´i] with T[k- sp´i..k-1]
α k

α α
i+1
2. Match case:
• If no mismatch is found, an occurrence of P has been found.
• Shift P by n – sp´n spaces to continue searching for other
occurrences.
α

α α
n+1
KMP Algorithm

• Observations:
– The prefix P[1..sp´i] of the shifted P is shifted to match
the corresponding substring in T.
– Subsequent character matching proceeds from
position sp´i + 1
– Unlike Boyer-Moore, the matched substring is not
compared again.
– The shift rule based on sp´i guarantees that the exact
same mismatch won’t occur at sp´i + 1 but doesn’t
guarantee that P(sp´i+1) = T(k)
KMP Algorithm

• Example: P = abcxabcde
– If a mismatch occurs at position 8, P will be shifted 4
positions to the right.
– Q: Where did the 4 position shift come from?
– A: The number of position is given by i - sp´i , in this
example i = 7, sp´7 = 3, 7 – 3 = 4
– Notice that we know the amount of shift without
knowing anything about T other than there was a
mismatch at position 8..
KMP Algorithm

• Example Continued: P = abcxabcde


– After the shift, P[1..3] lines up with T[k-4..k-1]
– Since it known that P[1..3] must match T[k-4..k-1], no
comparison is needed.
– The scan continues from P(4) & T(k)
• Advantages of KMP Shift Rule
1. P is often shifted by more than 1 character, (i - sp´i )
2. The left-most sp´i characters in the shifted P are
known to match the corresponding characters in T.
KMP Algorithm

Full Example: T = xyabcxabcxadcdqfeg P = abcxabcde


Assume that we have already shifted past the first two
positions in T.

xyabcxabcxadcdqfeg
abcxabcde
abcxabcde
^^^ ^ ^^^ ^^
123 4 567 8 1d!=x,
startshift
again4 from position 4
places
Preprocessing for KMP
Approach: show how to derive sp´ values from Z values.
Definition: Position j > 1 maps to i if i = j + Zj(P) – 1
– Recall that Zj(P) denotes the length of the Z-box starting at position j.
α α
j
α α
i

– This says that j maps to i if i is the right end of a Z-box starting at j.


Preprocessing for KMP
Definition: Position j > 1 maps to i if i = j + Zj(P) – 1

Theorem. For any i > 1, sp´i(P) = Zj = i – j + 1


Where j > 1 is the smallest position that maps to i.
If  j then sp´i(P) = 0
Similarly for sp:
For any i > 1, spi(P) = i – j + 1
Where j, i  j > 1, is the smallest position that maps to i
or beyond.
If  j then spi(P) = 0
α α
j
α x α y
i
Preprocessing for KMP
Given the theorem from the preceding slide, the sp´i and spi
values can be computed in linear time using Zi values:
For i = 1 to n { sp´i = 0;}
For j = n downto 2 {
i = j + Zj(P) – 1;
sp´i = Zj;
}

spn(P) = sp´n(P);
For i = n - 1 downto 2 {
spi (P) = max[spi+1 (P) - 1, sp´i(P)];}

α α
j
α x α y
i
Preprocessing for KMP
Defn. Failure function F´(i) = sp´i-1 + 1 , 1  i  n + 1, sp´0 = 0
(similarly F(i) = spi-1 + 1 , 1  i  n + 1, sp0 = 0)
cc c c
|| | |
xyabcxabcxadcdqfeg
xyabcxabcxadcdqfeg
abcxabcde
abcxabcde abcxabcde
^^^ ^ ^^^ ^ ^^ ^^ ^
123 4 567 8 d!=x, shift 4 places ii ii i

Two special cases:


Shifting is only conceptual and 1. Mismatch at position 1, then F’(1) = 1
P is never explicitly shifted 2. Match found, then P shifts by n - sp’n places
o Which is F’(n+1) = sp’n + 1
Preprocessing for KMP
Defn. Failure function F´(i) = sp´i-1 + 1 , 1  i  n + 1, sp´0 = 0
(similarly F(i) = spi-1 + 1 , 1  i  n + 1, sp0 = 0)
• Idea:
– We maintain a pointer i in P and c in T.
– After a mismatch at P(i+1) with T(c), shift P to align
P(sp´i + 1) with T(c), i.e., i = sp´i + 1.
– Special case 1: i = 1  set i = F´(1) = 1 & c = c + 1
– Special case 2: we find P in T,  shift n - sp´n spaces,
i.e., i = F´(n + 1) = sp´n + 1.
Full KMP Algorithm
Preprocess P to find F´(k) = sp´k-1 +1 for k from 1 to n + 1
c = 1; p = 1;
While c + (n – p)  m {
While P(p) = T( c )and p  n {
p = p + 1;
c = c + 1;}
If (p = n + 1) then
report an occurrence of P at position c – n of T.
if (p = 1) then c = c + 1;
p = F´(p) ;}
c
|
T = xyabcxabcxadcdqfeg |T| = m
P = abcxabcde |P| = n
^
p
Full KMP Algorithm
c = 1; p = 1;
While c + (n – p)  m {
While P(p) = T( c )and p  n {
p = p + 1;
c = c + 1;}
If (p = n + 1) then p != n+1
report an occurrence of P at position c – n of T.
if (p = 1) then c = c + 1; p = 1!  c = 2
p = F´(p) ; p = F’(1) = 1
}
xyabcxabcxabcdefeg
abcxabcde

^
1 a!=x
Full KMP Algorithm
c = 1; p = 1;
While c + (n – p)  m {
While P(p) = T( c )and p  n {
p = p + 1;
c = c + 1;}
If (p = n + 1) then p != n+1
report an occurrence of P at position c – n of T.
if (p = 1) then c = c + 1; p = 1!  c = 3
p = F´(p) ; p = F’(1) = 1
}
xyabcxabcxabcdefeg
abcxabcde
abcxabcde
^
1 a!=y
Full KMP Algorithm
c = 1; p = 1;
While c + (n – p)  m {
While P(p) = T( c )and p  n {
p = p + 1;
c = c + 1;}
If (p = n + 1) then p != n+1
report an occurrence of P at position c – n of T.
if (p = 1) then c = c + 1; p = 8!  don’t change c
p = F´(p) ; p = F´(8) = 4
}
xyabcxabcxabcdefeg
abcxabcde
abcxabcde
^^^ ^^^^ ^
123 4567 8 d!=x
Full KMP Algorithm
c = 1; p = 1;
While c + (n – p)  m { p = 4, c = 10
While P(p) = T( c )and p  n {
p = p + 1;
c = c + 1;}
If (p = n + 1) then p = n+1 !
report an occurrence of P at position c – n of T.
if (p = 1) then c = c + 1;
p = F´(p) ;
}
xyabcxabcxabcdefeg
abcxabcde
abcxabcde
abcxabcde
^^^^ ^ ^
4567 8 9
Real-Time KMP

• Q: What is meant by real-time algorithms?


• A: Typically these are algorithms that are meant
to interact synchronously in the real world.
– This implies a known fixed turn-around time for
processing a task
– Many embedded scheduling systems are examples
involving real-time algorithms.
– For KMP this means that we require a constant time
for processing all strings of length n.
Real-Time KMP

• Q: Why is KMP not real-time?


• A: For any mismatched character in T, we may try
matching it several times.
– Recall that sp´i only guarantees that P(i + 1) and P(sp´i + 1) differ
– There is NO guarantee that P(i + 1) and T(k) match
• We need to ensure that a mismatch at T(k) does
NOT entail additional matches at T(k).
• This means that we have to compute sp´i values
with respect to all characters in S since any could
appear in T.
Real-Time KMP

• Define: sp´(i,x)(P) to be the length of the longest


proper suffix of P[1..i] that matches a prefix of P,
with the added condition that character P(sp´i +
1) is x.
• This is will tell us exactly what shift to use for
each possible mismatch.
• A mismatched character T(k) will never be
involved in subsequent comparisons.
Real-Time KMP

• Q: How do we know that the mismatched


character T(k) will never be involved in
subsequent comparisons?
• A: Because the shift will shift P so that either the
matching character aligns with T(k) or P will be
shifted past T(k).
• This results in a real-time version of KMP.
• Let’s consider how we can find the sp´(i,x)(P)
values in linear time.
Real-Time KMP

Thm. For P[i + 1]  x, sp´(i,x)(P) = i - j + 1


– Here j is the smallest position such that j maps to i and
P(Zj + 1) = x.
– If there is no such j then where sp´(i,x)(P) = 0
For i = 1 to n { sp´(i,x) = 0 for every character x;}
For j = n downto 2 {
i = j + Zi(P) – 1;
x = P(Zj + 1);
sp´(i,x) = Zi;
}
Real-Time KMP
For i = 1 to n { sp´(i,x) = 0 for every character x;}
For j = n downto 2 {
i = j + Zi(P) – 1;
x = P(Zj + 1);
sp´(i,x) = Zi;}

• Notice how this works:


– Starting from the right
• Find i the right end of the Z box associated with j
• Find x the character immediately following the prefix
corresponding to this Z box.
• Set sp´(i,x) = Zi, the length of this Z box.
Reference
• Chapter 1, 2: Exact Matching: Fundamental
Preprocessing and First Algorithms

You might also like