0% found this document useful (0 votes)
34 views

Exact String Matching Algorithms: Presented by Dr. Shazzad Hosain Assoc. Prof. EECS, NSU

Here are the key steps of the suffix shift rule: 1. Find the rightmost occurrence a' of the matching suffix t in P such that a' is not a suffix of P and the character to the left of a' is not the same as the character to the left of a. 2. If a' exists, shift P to align a' with the matching substring in T. 3. If a' does not exist, shift P to align a prefix of P with the suffix t in T. 4. If steps 2 and 3 are not possible, shift P left by n positions. The suffix shift rule aims to make larger shifts by exploiting suffix matches within the pattern P when a

Uploaded by

Alimushwan Adnan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Exact String Matching Algorithms: Presented by Dr. Shazzad Hosain Assoc. Prof. EECS, NSU

Here are the key steps of the suffix shift rule: 1. Find the rightmost occurrence a' of the matching suffix t in P such that a' is not a suffix of P and the character to the left of a' is not the same as the character to the left of a. 2. If a' exists, shift P to align a' with the matching substring in T. 3. If a' does not exist, shift P to align a prefix of P with the suffix t in T. 4. If steps 2 and 3 are not possible, shift P left by n positions. The suffix shift rule aims to make larger shifts by exploiting suffix matches within the pattern P when a

Uploaded by

Alimushwan Adnan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 80

Exact String Matching Algorithms

Presented By
Dr. Shazzad Hosain
Assoc. Prof. EECS, NSU
Classical Comparison Based Methods

• Boyer-Moore Algorithm
• Knuth-Morris-Pratt Algorithm (KMP Algorithm)
Boyer-Moore Algorithm
• Basic ideas:
– Previously discussed ideas for naïve matching
1. successively align P and T to check for a match.
2. Shift P to the right on match failure.
– new concepts wrt the naïve algorithm
1. Scan from right-to-left, i.e., 
2. Special Bad character rule
3. Suffix shift rule
Concept: Right-to-left Scan

• How can we check for a match of pattern P at


location i in target T?
• Naïve algorithm scanned left-to-right, i.e.,
T[i+k]&P[1+k], k = 0 to length(P)-1
Example: P = adab, T = abaracadabara
abaracadabara
adab
^^
1 2 ad==
!= ab
Concept: Right-to-left Scan

• Alternative, scan right-to-left, i.e.,


T[i+k]&P[1+k], k = length(P)-1 down-to 0
Example: P = adab, T = abaracadabara
abaracadabara
adab
^
1 b != r
Concept: Right-to-left Scan
• Why is scanning right-to-left a good idea?
• Answer: by itself, it isn’t any better than left-
to-right.
– A naïve approach with right-to-left scanning is
also Q(nm).
– Larger shifts, supported by a clever bad
character rule and a suffix shift rule make it
better.
Concept: Bad Character Rule

• Idea: the mismatched character indicates a safe


minimum shift.
Example: P = adacara, T=abaracadabara
abaracadabara
adacara
^^
21 ra !=
==ca

Here the bad character is c. Perhaps we should shift to align this character with
its rightmost occurrence in P?
Concept: Bad Character Rule

Shift two positions to align the rightmost


occurrence of the mismatched character c in P.
aba racadabara
a d a dc aa cr aa r a

Now, start matching again from right to left.


Concept: Bad Character Rule

Second Example: P = adacara, T=abaxaradabara


abaxaradabara
adacara
^ ^ ^^
4 3 2ac1!=
== xra ==
== ra

Here the bad character is x. The minimum that we should shift should align
this character with its occurrence in P.

But x doesn’t occur in P!!!!


Concept: Bad Character Rule

Since x doesn’t occur in P, we can shift past it.

Second Example: P = adacara, T=abaxaradabara


abaxaradabara
a d a c aa dr a c a r a

Now, start matching again from right to left.


Concept: Bad Character Rule

• The idea of bad character rule is to shift P by more than one


characters when possible.
• But if rightmost position is greater than the mismatched
position.
• Unfortunately, it is often the case
12345678901234567
T: spbctbsatpqsctbpq
P: tpabsat
P: tpabsat

11
Concept: Bad Character Rule
• We will define a bad character rule that uses the concept of
the rightmost occurrence of each letter.
• Let R(x) be the rightmost position of the letter x in P for
each letter x in our alphabet.
• If x doesn’t occur in P, define R(x) to be 0.

1234567
a b c d z
P = adacara
R 7 0 4 2 * * 0
Concept: Bad Character Rule
12345678901234567
T: spbctbsabpqsctbpq
P: tpabsab
P: tpabxab

R(t)=1, R(s)=5.
i: the position of mismatch in P. i=3
k: the counterpart in T. k=5. T[k]=t
• The bad character rule says P should be shifted right by max{1, i-R(T[k])}. i.e.,
if the right-most occurrence of character T[k] in P is in position j (j<i), then P[j]
should be below T[k] after the shifting.
• Otherwise, we will shift P one position, i.e., when R(T[k]) >= i, 1 >= i - R(T[k])
• Obviously this rule is not very useful when R(T[k]) >= i, which is usually the
case for DNA sequences

13
Concept: Extended Bad Character Rule

Extended Bad Character Rule: If P[i]


mismatches T[k], shift P along T so that the
closest occurrence of the letter T[k] in P to the
left of i in P is aligned with T[k].
Example: P = aracara, T=abararadabara
abararadabara
aracara
^ ^^ ^ ^
This2acis1!=
4 3 rightmost
This is the == raa==
the ==roccurrence
a
rightmost occurrence of rleft
of r to the in P.
of i in P.
Notice
Notice that 4 –that i - i.e.,
2 > 0, R(T(k))
this<gives
0 , i.e.,
us 4a –positive
6 < 0 shift.
Concept: Extended Bad Character Rule

The amount of shift is i – j, where:


– i is the index of the mismatch in P.
– j is the rightmost occurrence of T[k] to the left of i in P.
Example: P = aracara, T=abataradabara
abataradabara
aracara
^^^ ^
4 3 2ac 1!=
== traa==
==r a
There is no occurrence of t in P, thus j = 0. Notice that i – j = 4,
i.e., this gives us a positive shift past the point of mismatch.
Concept: Extended Bad Character Rule
• How do we implement this rule?
• We preprocess P (from right to left), recording the
position of each occurrence of the letters.
• For each character x in S, the alphabet, create a list
of its occurrences in P. If x doesn’t occur in P, then
it has an empty list.
Concept: Extended Bad Character Rule
Example: S = {a, b, c, d, r, t}, P = abataradabara
• a_list = <13, 11,9,7,5,3,1> since ‘a’ occurs at these positions
in P, i.e., abataradabara
• b_list = <10,2> (abataradabara)
• c_list = Ø
• d_list = <8> (abataradabara)
• r_list = <12,6> (abataradabara)
• t_list = <4>(abataradabara)
Concept: Suffix Shift Rule
• Recall that we investigated finding prefixes before.
• Since we are matching P to T from right-to-left, we will
instead need to use suffixes.
Suffix Shift Rule

t is a suffix of P that match with a substring t of T


x≠y
t’ is the right-most copy of t in P such that t’ is not a suffix of P and z≠y

19
Concept: Suffix Shift Rule

• Consider the partial right-to-left matching of P


to T below.
• This partial match involves a, a suffix of P.

T ............................................axbadbaddog.....


P .....................................adbadbaddog
Concept: Suffix Shift Rule

• This partial match ends where the first


mismatch occurs, where x is aligned with d.

T ............................................axbadbaddog.....


P .....................................adbadbaddog
Concept: Suffix Shift Rule

We want to find a right-most copy a´ of this


substring a in P such that:
1. a´ is not a suffix of P and
2. The character to the left of a´ is not the same as the
character to the left of a


T ............................................axbadbaddog.....

’ 
P .........gbadbaddoghorseadbadbaddog
Concept: Suffix Shift Rule

1. If a´ exists, shift P to the right such that a´ is


now aligned with the substring in T that was
previously aligned with a.


T .......................................xbadbaddog.....

’ 
P .........gbadbaddogcatdbadbaddog
’ 
P after shifting .........gbadbaddogcatdbadbaddog
Concept: Suffix Shift Rule

2. If a´ doesn’t exist, shift P right by the least


amount such that a prefix of P is aligned with a
suffix of a in T.


T .......................................xbadbaddog.....


P dogcatratdbadbaddog

P after shifting dogcatratdbadbaddog


Concept: Suffix Shift Rule

3. If a´ doesn’t exist, and there is no prefix of P


that matches a suffix of a in T, shift P left by n
positions.


T .......................................xbadbaddog.....

P batcatratdbadbaddog

P after shifting batcatratdbadbaddog


Preprocessing for the good suffix rule

• Let L(i) denote the largest position less than n s.t. P[i..n]
matches a suffix of P[1..L(i)].
• If there is no such position, then L(i) = 0
• Example 1: If i = 17 then L(i) = 9
P batcatdogdbadbaddog
L(17) 17

• Example 2: If i = 16 then L(i) = 0

P batcatdogdbadbaddog
16
Concept: Suffix Shift Rule

• Let L´(i) denote the largest position less than n s.t. P[i..n]
matches a suffix of P[1..L´(i)] and s.t. the character
preceding the suffix is not equal to P(i-1).
• If there is no such position, then L´(i) = 0
• Example 1: If i = 20 then L(i) = 12 and L´(i) = 6

P slydogsaddogdbadbaddog
L’(20) L(20) 20
Concept: Suffix Shift Rule

• Example 2: If i = 19 then L(i) = 12 and L´(i) = 0

P slydogsaddogdbadbaddog
L(19) 19
Concept: Suffix Shift Rule

• Notice that L(i) indicates the right-most copy of P[i..n] that is not a
suffix of P.
• In contrast, L´(i) indicates the right-most copy of P[i..n] that is not a
suffix of P and whose preceding character doesn’t match P(i-1).
• The relation between L´(i) and L(i) is analogous to the relation
between a´ and a.

P slydogsaddogdbadbaddog
L’(20) L(20) 20
Concept: Suffix Shift Rule

• Q: What is the point?


• A: If P(i - 1) causes the mismatch and L´(i) > 0, then
we can shift P right by n - L´(i) positions. Example:


T .......................................xbadbaddog.....

’ 
P .........gbadbaddogcatdbadbaddog
’ 
P after shifting .........gbadbaddogcatdbadbaddog
Concept: Suffix Shift Rule

• If L(i) and L´(i) are different, then obviously shifting


by n - L´(i) positions is a greater shift than n - L(i).
• Example:

T .......................................xbaxbaddog.....

’  
P slybaddogbadbaddogcatdbadbaddog
L’(i) L(i)
’  
P after shifting slybaddogbadbaddogcatdbadbaddog
Concept: Suffix Shift Rule

• Let Nj(P) denote the length of the longest suffix of P[1..j]


that is also a suffix of P.
• Example 1: N6(P) = 3 and N12(P) = 5.
P slydogsaddogdbadbaddog
6 12
• Example 2: N3(P) = 2, N9(P) = 3, N15(P) = 5, N19(P) = 0.

P hogslydogsaddogdbadbaddog
3 9 15 19
Concept: Suffix Shift Rule

• Q: How are the concepts of Ni and Zi related?


• Recall that Zi = Length of a maximal substring starting at
position i, which is a prefix of P.
a
i
• In contrast, Ni = Length of a maximal substring ending at
position i, which is a suffix of P.
a
i

• In the case of Boyer-Moore, we are naturally interested in


suffixes since we are scanning right-to-left
Concept: Suffix Shift Rule

• Let Pr denote the mirror image of P, then the


relationship can be expressed as Nj(P)=Zn-j+1(Pr).
• In words, the length of the substring matching a
suffix at position j in P is equal to the length of
the corresponding substring matching a prefix in
the reverse of P.
• Q: Why must this true?
• A: Because they are the same substring, except
that one is the reverse of the other.
Concept: Suffix Shift Rule

• Since Nj(P) = Zn-j+1(Pr), we can use the Z algorithm


to compute N in O(n).
• Q: How do we do this?
• A: We create Pr, the reverse of P, and process it
with the Z algorithm.
Concept: Suffix Shift Rule

N is the reverse of Z!
P: the pattern
Pr the string obtained by reversing P

Then Nj (P)=Zn-j+1 (Pr)

1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0
P: q c a b d a b d a b Pr: b a d b a d b a c q
Nj: 0 0 0 2 0 0 5 0 0 0 Zi 0 0 0 5 0 0 2 0 0 0

t y t’ x
i
y t’ x t
j
36
Concept: Suffix Shift Rule

For pattern P,
Nj (for j=1,…,n) can be calculated in O(n) using the Z algorithm.

Why do we need to define Nj ?


To use the strong good suffix rule, we need to find out L’(i) for every i=1,…,n.
T x t

P z t’ y t
L’(i) i n
z t’ y t

We can get L’(i) from Nj !

37
Concept: Suffix Shift Rule

• We can then find L´(i) and L(i) values from N


values in linear time with the following:

For i = 1 to n {L´(i) = 0;}


For j = 1 to n – 1 {
i = n - Nj(P) + 1;
L´(i) = j;
}

// L values (if desired) can be obtained


L(2) = L´(2) ;
For i = 3 to n { L(i) = max(L(i - 1), L´(i));}
Concept: Suffix Shift Rule
For i = 1 to n {L´(i) = 0;}
For j = 1 to n – 1 {
i = n - Nj(P) + 1;
L´(i) = j;
}
L(2) = L´(2) ;
For i = 3 to n { L(i) = max(L(i - 1), L´(i));}

• Example: P = asdbasasas, n = 10
• Values of Ni(P): 0, 2, 0, 0, 0, 2, 0, 4, 0
• Computed values i: 11, 9, 11, 11, 11, 9, 11, 7, 11
• Values of L´: 0, 0, 0, 0, 0, 0, 8, 0, 6
Concept: Suffix Shift Rule

• Let l´(i) denote the length of the largest suffix of


P[i..n] that is also a prefix of P. Let l´(i) = 0 if no
such suffix exists.
Example: P = asasbsasas
^ ^^^^^^^^^
l’(1)
l’(6)
l’(7)
l’(4)
l’(5)
l’(2)
l’(3)
l’(8)====44
l’(9)
l’(10) =420

l´(i) = t t’ t
i
Concept: Suffix Shift Rule

• Thm: l´(i) = largest j <= n – i + 1 s.t. Nj(P) = j.


• Q: How can we compute l´(i) values in linear
time?
• A: This is problem #9 in Chapter 2. This would
make an interesting homework problem.

l´(i) = t t’ t
i

y t’ x t
j
Boyer-Moore Algorithm
Preprocessing:
Compute L´(i) and l´(i) for each position i in P,
Compute R(x), the right-most occurrence of x in P, for each character x in S.
Search:
k = n;
While k <= m {
i = n; h = k;
While i > 0 and P(i) = T(j) {
i = i – 1; h = h – 1;}
if i = 0 {
report occurrence of P in T at position k.
k = k + n - l´(2);}
else Shift P (increase k) by the max amount indicated by the
extended bad character rule and the good suffix rule.
}
Boyer-Moore Algorithm
Example: P = golgol
Preprocessing:
Compute L´(i) and l´(i) for each position i in P

For i = 1 to n {L´(i) = 0;}


For j = 1 to n – 1 {
i = n - Nj(P) + 1;
L´(i) = j;
}

Notice that first we need Nj(P) values in order to compute L´(i) and l´(i) for each
position i in P.
Boyer-Moore Algorithm
Example: P = golgol
Recall that Nj(P) is the length of the longest suffix of P[1..j]
that is also a suffix of P.

N1(P) = 0, there is no suffix of P that ends with g


N2(P) = 0, there is no suffix of P that ends with o
N3(P) = 3, there is a suffix of P that ends with l

N4(P) = 0, there is no suffix of P that ends with g


N5(P) = 0, there is no suffix of P that ends with o
N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3
Boyer-Moore Algorithm
Preprocessing: P = golgol, n = 6
N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3
Compute L´(i) and l´(i) for each position i in P
For i = 1 to n {L´(i) = 0;}
For j = 1 to n – 1 {
i = n - Nj(P) + 1;
L´(i) = j;
}
j=1i=7 Therefore L´(7) = 1
j=2i=7 Therefore L´(7) = 2
j=3i=4 Therefore L´(4) = 3
j=4i=7 Therefore L´(7) = 4
j=5i=7 Therefore L´(7) = 5
L´(1) = L´(2) = L´(3) = L´(5) = 0 and L´(4) = 3
Boyer-Moore Algorithm
Preprocessing: P = golgol, n = 6
N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3
L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3
Compute l´(i) for each position i in P.
Recall that l´(i) is the length of the longest suffix of P[i..n] that
is also a prefix of P.
l´(1) = 6 since gol is the longest suffix of P[1..n] that is a prefix of P.
l´(2) = 3 since gol is the longest suffix of P[2..n] that is a prefix of P.
l´(3) = 3 since gol is the longest suffix of P[3..n] that is a prefix of P.
l´(4) = 3 since gol is the longest suffix of P[4..n] that is a prefix of P.
l´(5) = 0 since there is no suffix of P[5..n] that is a prefix of P.
l´(6) = 0 since there is no suffix of P[6..n] that is a prefix of P.
l´(1) = 6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0
Boyer-Moore Algorithm
Preprocessing: P = golgol, n = 6
N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3
L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3
l´(1) = 6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0
Compute the list R(x), the right-most occurrences of x in P,
for each character x in S = {g, o, l}

R(g) = <4, 1>


R(o) = <5, 2>
R(l) = <6, 3>
Boyer-Moore Algorithm
Preprocessing: P = golgol, n = 6, T = lolgolgol, m = 9
L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3
l´(1) =6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0
R(g) = <4, 1>, R(o) = <5, 2>, R(l) = <6, 3>

Search:
k = n;
While k <= m {
i = n; h = k;
While i > 0 and P(i) = T(j) {
i = i – 1; h = h – 1;}
if i = 0 {
report occurrence of P in T at position k.
k = k + n - l´(2);}
else Shift P (increase k) by the max amount indicated by the
extended bad character rule and the good suffix rule.
}
Search
k = 6;
While k <= 9 { lolgolgol
i = 6; h = k; golgol
While i > 0 and P(i) = T(j) { ^ ^ ^^ ^ ^
i = i – 1; h = h – 1;} i =i =1,
i =i2,h=i3,h=4,
ih=5,
1,h=26,
hP(1) 6 T(1) 
=3=h4 5=!=
if i = 0 { But i = 1!
report occurrence of P in T at position k.
k = k + 6 - l´(2);}
else Shift P (increase k) by the max amount indicated by the
extended bad character rule and the good suffix rule.
}

Bad Character Rule: there is no occurrence of l, the mismatched character


in T, to the left of P(1). This suggests shifting only 1 place
Good Suffix Rule: Since L´(2) = 0, l´(2) = 3 therefore
shift P by n - l´(2) places, i.e., 6-3=3 places. Thus k = k + 3 = 9
Search
k = 6;
lolgolgol
While k <= 9 { k = 12, we are done! 
golgol golgol
i = 6; h = k;
While i > 0 and P(i) = T(j) { ^ ^ ^^ ^ ^ ^
i = i – 1; h = h – 1;}
i =i =0,ii1,
=h=i =2,
h=3, =3ihh5,
i =4, =4h==6,
=6=h7 8= 9
h5
if i = 0 {
report occurrence of P in T at position k.
k = k + 6 - l´(2);}
else Shift P (increase k) by the max amount indicated by the
extended bad character rule and the good suffix rule.
}

 i = 0, report occurrence of P in T at position 4,


k = k + 6 - l´(2) = 9 + 6 - 3 = 12
Homework 1: Due Next Week
• Implement the Boyeer More Algorithm
Break
KMP Algorithm
• Preliminaries:
– KMP can be easily explained in terms of finite
state machines.
– KMP has a easily proved linear bound
– KMP is usually not the method of choice
KMP Algorithm
• Recall that the naïve approach to string
matching is Q(mn).
• How can we reduce this complexity?
– Avoid redundant comparisons
– Use larger shifts
• Boyer-Moore good suffix rule
• Boyer-Moore extended bad character rule
KMP Algorithm
• KMP finds larger shifts by recognizing patterns
in P.
– Let spi(P) denote the length of the longest proper
suffix of P[1..i] that matches a prefix of P.

α α
i

– By definition sp1 = 0 for any string.


– Q: Why does this make sense?
– A: The proper suffix must be the empty string
KMP Algorithm
• Example: P = abcaeabcabd
– P[1..2] = ab hence sp2 = ?
– sp2 = 0
– P[1..3] = abc hence sp3 = ?
– sp3 = 0
– P[1..4] = abca hence sp4 = ?
– sp4 = 1
– P[1..5] = abcae hence sp5 = ?
– sp5 = 0
– P[1..6] = abcaea hence sp6 = ?
– sp6 = 1
KMP Algorithm
• Example Continued
– P[1..7] = abcaeab hence sp7 = ?
– sp7 = 2
– P[1..8] = abcaeabc hence sp8 = ?
– sp8 = 3
– P[1..9] = abcaeabca hence sp9 = ?
– sp9 = 4
– P[1..10] = abcaeabcab hence sp10 = ?
– sp10 = 2
– P[1..11] = abcaeabcabd hence sp11 = ?
– sp11 = 0
KMP Algorithm
• Like the a/a concept for Boyer-Moore, there is
an analogous spi/sp´i concept.
• Let sp´i(P) denote the length of the longest proper
suffix of P[1..i] that matches a prefix of P, with the
added condition that characters P(i + 1) and P(sp´i
+ 1) are unequal.
• Example: P = abcdabce sp´7 = 3
α x α y
i
Obviously sp´i(P) <= spi(P), since the later is less
restrictive.
KMP Algorithm
• KMP Shift Rule:
1. Mismatch case:
• Let position i+1 in P and position k in T be the first mismatch
in a left-to-right scan.
• Shift P to the right, aligning P[1..sp´i] with T[k- sp´i..k-1]
α k

α α
i+1
2. Match case:
• If no mismatch is found, an occurrence of P has been found.
• Shift P by n – sp´n spaces to continue searching for other
occurrences.
α

α α
n+1
KMP Algorithm

• Observations:
– The prefix P[1..sp´i] of the shifted P is shifted to match
the corresponding substring in T.
– Subsequent character matching proceeds from
position sp´i + 1
– Unlike Boyer-Moore, the matched substring is not
compared again.
– The shift rule based on sp´i guarantees that the exact
same mismatch won’t occur at sp´i + 1 but doesn’t
guarantee that P(sp´i+1) = T(k)
KMP Algorithm

• Example: P = abcxabcde
– If a mismatch occurs at position 8, P will be shifted 4
positions to the right.
– Q: Where did the 4 position shift come from?
– A: The number of position is given by i - sp´i , in this
example i = 7, sp´7 = 3, 7 – 3 = 4
– Notice that we know the amount of shift without
knowing anything about T other than there was a
mismatch at position 8..
KMP Algorithm

• Example Continued: P = abcxabcde


– After the shift, P[1..3] lines up with T[k-4..k-1]
– Since it known that P[1..3] must match T[k-4..k-1], no
comparison is needed.
– The scan continues from P(4) & T(k)
• Advantages of KMP Shift Rule
1. P is often shifted by more than 1 character, (i - sp´i )
2. The left-most sp´i characters in the shifted P are
known to match the corresponding characters in T.
KMP Algorithm

Full Example: T = xyabcxabcxadcdqfeg P = abcxabcde


Assume that we have already shifted past the first two
positions in T.

xyabcxabcxadcdqfeg
abcxabcde
abcxabcde
^^^ ^ ^^^ ^^
123 4 567 8 1d!=x,
startshift
again4 from position 4
places
Preprocessing for KMP
Approach: show how to derive sp´ values from Z values.
Definition: Position j > 1 maps to i if i = j + Zj(P) – 1
– Recall that Zj(P) denotes the length of the Z-box starting at position j.
α α
j
α α
i

– This says that j maps to i if i is the right end of a Z-box starting at j.


Preprocessing for KMP
Definition: Position j > 1 maps to i if i = j + Zj(P) – 1

Theorem. For any i > 1, sp´i(P) = Zj = i – j + 1


Where j > 1 is the smallest position that maps to i.
If  j then sp´i(P) = 0
Similarly for sp:
For any i > 1, spi(P) = i – j + 1
Where j, i  j > 1, is the smallest position that maps to i
or beyond.
If  j then spi(P) = 0
α α
j
α x α y
i
Preprocessing for KMP
Given the theorem from the preceding slide, the sp´i and spi
values can be computed in linear time using Zi values:
For i = 1 to n { sp´i = 0;}
For j = n downto 2 {
i = j + Zj(P) – 1;
sp´i = Zj;
}

spn(P) = sp´n(P);
For i = n - 1 downto 2 {
spi (P) = max[spi+1 (P) - 1, sp´i(P)];}

α α
j
α x α y
i
Preprocessing for KMP
Defn. Failure function F´(i) = sp´i-1 + 1 , 1  i  n + 1, sp´0 = 0
(similarly F(i) = spi-1 + 1 , 1  i  n + 1, sp0 = 0)
cc c c
|| | |
xyabcxabcxadcdqfeg
xyabcxabcxadcdqfeg
abcxabcde
abcxabcde abcxabcde
^^^ ^ ^^^ ^ ^^ ^^ ^
123 4 567 8 d!=x, shift 4 places ii ii i

Two special cases:


Shifting is only conceptual and 1. Mismatch at position 1, then F’(1) = 1
P is never explicitly shifted 2. Match found, then P shifts by n - sp’n places
o Which is F’(n+1) = sp’n + 1
Preprocessing for KMP
Defn. Failure function F´(i) = sp´i-1 + 1 , 1  i  n + 1, sp´0 = 0
(similarly F(i) = spi-1 + 1 , 1  i  n + 1, sp0 = 0)
• Idea:
– We maintain a pointer i in P and c in T.
– After a mismatch at P(i+1) with T(c), shift P to align
P(sp´i + 1) with T(c), i.e., i = sp´i + 1.
– Special case 1: i = 1  set i = F´(1) = 1 & c = c + 1
– Special case 2: we find P in T,  shift n - sp´n spaces,
i.e., i = F´(n + 1) = sp´n + 1.
Full KMP Algorithm
Preprocess P to find F´(k) = sp´k-1 +1 for k from 1 to n + 1
c = 1; p = 1;
While c + (n – p)  m {
While P(p) = T( c )and p  n {
p = p + 1;
c = c + 1;}
If (p = n + 1) then
report an occurrence of P at position c – n of T.
if (p = 1) then c = c + 1;
p = F´(p) ;}
c
|
T = xyabcxabcxadcdqfeg |T| = m
P = abcxabcde |P| = n
^
p
Full KMP Algorithm
c = 1; p = 1;
While c + (n – p)  m {
While P(p) = T( c )and p  n {
p = p + 1;
c = c + 1;}
If (p = n + 1) then p != n+1
report an occurrence of P at position c – n of T.
if (p = 1) then c = c + 1; p = 1!  c = 2
p = F´(p) ; p = F’(1) = 1
}
xyabcxabcxabcdefeg
abcxabcde

^
1 a!=x
Full KMP Algorithm
c = 1; p = 1;
While c + (n – p)  m {
While P(p) = T( c )and p  n {
p = p + 1;
c = c + 1;}
If (p = n + 1) then p != n+1
report an occurrence of P at position c – n of T.
if (p = 1) then c = c + 1; p = 1!  c = 3
p = F´(p) ; p = F’(1) = 1
}
xyabcxabcxabcdefeg
abcxabcde
abcxabcde
^
1 a!=y
Full KMP Algorithm
c = 1; p = 1;
While c + (n – p)  m {
While P(p) = T( c )and p  n {
p = p + 1;
c = c + 1;}
If (p = n + 1) then p != n+1
report an occurrence of P at position c – n of T.
if (p = 1) then c = c + 1; p = 8!  don’t change c
p = F´(p) ; p = F´(8) = 4
}
xyabcxabcxabcdefeg
abcxabcde
abcxabcde
^^^ ^^^^ ^
123 4567 8 d!=x
Full KMP Algorithm
c = 1; p = 1;
While c + (n – p)  m { p = 4, c = 10
While P(p) = T( c )and p  n {
p = p + 1;
c = c + 1;}
If (p = n + 1) then p = n+1 !
report an occurrence of P at position c – n of T.
if (p = 1) then c = c + 1;
p = F´(p) ;
}
xyabcxabcxabcdefeg
abcxabcde
abcxabcde
abcxabcde
^^^^ ^ ^
4567 8 9
Real-Time KMP

• Q: What is meant by real-time algorithms?


• A: Typically these are algorithms that are meant
to interact synchronously in the real world.
– This implies a known fixed turn-around time for
processing a task
– Many embedded scheduling systems are examples
involving real-time algorithms.
– For KMP this means that we require a constant time
for processing all strings of length n.
Real-Time KMP

• Q: Why is KMP not real-time?


• A: For any mismatched character in T, we may try
matching it several times.
– Recall that sp´i only guarantees that P(i + 1) and P(sp´i + 1) differ
– There is NO guarantee that P(i + 1) and T(k) match
• We need to ensure that a mismatch at T(k) does
NOT entail additional matches at T(k).
• This means that we have to compute sp´i values
with respect to all characters in S since any could
appear in T.
Real-Time KMP

• Define: sp´(i,x)(P) to be the length of the longest


proper suffix of P[1..i] that matches a prefix of P,
with the added condition that character P(sp´i +
1) is x.
• This is will tell us exactly what shift to use for
each possible mismatch.
• A mismatched character T(k) will never be
involved in subsequent comparisons.
Real-Time KMP

• Q: How do we know that the mismatched


character T(k) will never be involved in
subsequent comparisons?
• A: Because the shift will shift P so that either the
matching character aligns with T(k) or P will be
shifted past T(k).
• This results in a real-time version of KMP.
• Let’s consider how we can find the sp´(i,x)(P)
values in linear time.
Real-Time KMP

Thm. For P[i + 1]  x, sp´(i,x)(P) = i - j + 1


– Here j is the smallest position such that j maps to i and
P(Zj + 1) = x.
– If there is no such j then where sp´(i,x)(P) = 0
For i = 1 to n { sp´(i,x) = 0 for every character x;}
For j = n downto 2 {
i = j + Zi(P) – 1;
x = P(Zj + 1);
sp´(i,x) = Zi;
}
Real-Time KMP
For i = 1 to n { sp´(i,x) = 0 for every character x;}
For j = n downto 2 {
i = j + Zi(P) – 1;
x = P(Zj + 1);
sp´(i,x) = Zi;}

• Notice how this works:


– Starting from the right
• Find i the right end of the Z box associated with j
• Find x the character immediately following the prefix
corresponding to this Z box.
• Set sp´(i,x) = Zi, the length of this Z box.
Reference
• Chapter 1, 2: Exact Matching: Fundamental
Preprocessing and First Algorithms

You might also like