0% found this document useful (0 votes)
30 views10 pages

32.4 The Knuth-Morris-Pratt Algorithm: Either

Uploaded by

tripathiaryashi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views10 pages

32.4 The Knuth-Morris-Pratt Algorithm: Either

Uploaded by

tripathiaryashi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

32.

4 The Knuth-Morris-Pratt algorithm 975

? 32.3-5
Given two patterns P and P 0 , describe how to construct a ûnite automaton that
determines all occurrences of either pattern. Try to minimize the number of states
in your automaton.

32.3-6
Given a pattern P containing gap characters (see Exercise 32.1-4), show how to
build a ûnite automaton that can ûnd an occurrence of P in a text T in O.n/
matching time, where n D jT j.

? 32.4 The Knuth-Morris-Pratt algorithm

Knuth, Morris, and Pratt developed a linear-time string matching algorithm that
avoids computing the transition function ı altogether. Instead, the KMP algorithm
uses an auxiliary function � , which it precomputes from the pattern in ‚.m/ time
and stores in an array �Œ1 W m�. The array � allows the algorithm to compute the
transition function ı efûciently (in an amortized sense) <on the üy= as needed.
Loosely speaking, for any state q D 0; 1; : : : ; m and any character a 2 †, the
value �Œq � contains the information needed to compute ı.q; a/ but that does not
depend on a. Since the array � has only m entries, whereas ı has ‚.m j†j/ en-
tries, the KMP algorithm saves a factor of j†j in the preprocessing time by com-
puting � rather than ı . Like the procedure F INITE -AUTOMATON -M ATCHER, once
preprocessing has completed, the KMP algorithm uses ‚.n/ matching time.

The preûx function for a pattern


The preûx function � for a pattern encapsulates knowledge about how the pattern
matches against shifts of itself. The KMP algorithm takes advantage of this infor-
mation to avoid testing useless shifts in the naive pattern-matching algorithm and to
avoid precomputing the full transition function ı for a string-matching automaton.
Consider the operation of the naive string matcher. Figure 32.9(a) shows a par-
ticular shift s of a template containing the pattern P D ababaca against a text T .
For this example, q D 5 of the characters have matched successfully, but the 6th
pattern character fails to match the corresponding text character. The informa-
tion that q characters have matched successfully determines the corresponding text
characters. Because these q text characters match, certain shifts must be invalid.
In the example of the ûgure, the shift s C 1 is necessarily invalid, since the ûrst
pattern character (a) would be aligned with a text character that does not match the
ûrst pattern character, but does match the second pattern character (b). The shift
976 Chapter 32 String Matching

s0 D s C 2 shown in part (b) of the ûgure, however, aligns the ûrst three pattern
characters with three text characters that necessarily match.
More generally, suppose that you know that P Œ W q� h T Œ W s C q� or, equiva-
lently, that P Œ1 W q� D T Œs C 1 W s C q�. You want to shift P so that some shorter
preûx P Œ W k� of P matches a sufûx of T Œ W s C q�, if possible. You might have more
than one choice for how much to shift, however. In Figure 32.9(b), shifting P by 2
positions works, so that P Œ W 3� h T Œ W s C q�, but so does shifting P by 4 positions,
so that P Œ W 1� h T Œ W s C q� in Figure 32.9(c). If more than one shift amount works,
you should choose the smallest shift amount so that you do not miss any potential
matches. Put more precisely, you want to answer this question:
Given that pattern characters P Œ1 W q� match text characters T Œs C 1 W s C q�
(that is, P Œ W q� h T Œ W s C q�), what is the least shift s 0 > s such that for
some k < q ,

P Œ1 W k� D T Œs 0 C 1 W s 0 C k� ; (32.6)

(that is, P Œ W k� h T Œ W s 0 C k�), where s 0 C k D s C q?


Here’s another way to look at this question. If you know P Œ W q� h T Œ W s C q�,
then how do you ûnd the longest proper preûx P Œ W k� of P Œ W q� that is also a sufûx
of T Œ W s C q�? These questions are equivalent because given s and q , requiring
s 0 C k D s C q means that ûnding the smallest shift s 0 (2 in Figure 32.9(b)) is
tantamount to ûnding the longest preûx length k (3 in Figure 32.9(b)). If you add
the difference q  k in the lengths of these preûxes of P to the shift s , you get the
new shift s 0 , so that s 0 D s C .q  k/. In the best case, k D 0, so that s 0 D s C q ,
immediately ruling out shifts s C 1; s C 2; : : : ; s C q  1. In any case, at the new
shift s 0 , it is redundant to compare the ûrst k characters of P with the corresponding
characters of T , since equation (32.6) guarantees that they match.
As Figure 32.9(d) demonstrates, you can precompute the necessary information
by comparing the pattern against itself. Since T Œs 0 C 1 W s 0 C k� is part of the
matched portion of the text, it is a sufûx of the string P Œ W q�. Therefore, think
of equation (32.6) as asking for the greatest k < q such that P Œ W k� h P Œ W q�.
Then, the new shift s 0 D s C .q  k/ is the next potentially valid shift. It will be
convenient to store, for each value of q , the number k of matching characters at the
new shift s 0 , rather than storing, say, the amount s 0  s to shift by.
Let’s look at the precomputed information a little more formally. For a given
pattern P Œ1 W m�, the preûx function for P is the function � W f1; 2; : : : ; mg !
f0; 1; : : : ; m  1g such that
�Œq � D max fk W k < q and P Œ W k� h P Œ W q�g :
That is, �Œq � is the length of the longest preûx of P that is a proper sufûx of P Œ W q�.
Here is the complete preûx function � for the pattern ababaca:
32.4 The Knuth-Morris-Pratt algorithm 977

b a c b a b a b a a b c b a b T b a c b a b a b a a b c b a b T

s a b a b a c a P sʹ = s + 2 a b a b a c a P
q k
(a) (b)

b a c b a b a b a a b c b a b T a b a b a P Œ W q�

s+4 a b a b a c a P a b a P Œ W k�
(c) (d)

Figure 32.9 The preûx function � . (a) The pattern P D ababaca aligns with a text T so that the
ûrst q D 5 characters match. Matching characters, in blue, are connected by blue lines. (b) Knowing
these particular 5 matched characters (P Œ W 5�) sufûces to deduce that a shift of s C 1 is invalid,
but that a shift of s 0 D s C 2 is consistent with everything known about the text and therefore is
potentially valid. The preûx P Œ W k�, where k D 3, aligns with the text seen so far. (c) A shift of s C 4
is also potentially valid, but it leaves only the preûx P Œ W 1� aligned with the text seen so far. (d) To
precompute useful information for such deductions, compare the pattern with itself. Here, the longest
preûx of P that is also a proper sufûx of P Œ W 5� is P Œ W 3�. The array � represents this precomputed
information, so that �Œ5� D 3. Given that q characters have matched successfully at shift s , the next
potentially valid shift is at s 0 D s C .q  �Œq �/ as shown in part (b).

i 1 2 3 4 5 6 7
P Œi � a b a b a c a
�Œi � 0 0 1 2 3 0 1

The procedure KMP-M ATCHER on the following page gives the Knuth-Morris-
Pratt matching algorithm. The procedure follows from F INITE -AUTOMATON -
M ATCHER for the most part. To compute � , KMP-M ATCHER calls the auxiliary
procedure C OMPUTE -P REFIX -F UNCTION. These two procedures have much in
common, because both match a string against the pattern P : KMP-M ATCHER
matches the text T against P , and C OMPUTE -P REFIX -F UNCTION matches P
against itself.
Next, let’s analyze the running times of these procedures. Then we’ll prove them
correct, which will be more complicated.

Running-time analysis
The running time of C OMPUTE -P REFIX -F UNCTION is ‚.m/, which we show by
using the aggregate method of amortized analysis (see Section 16.1). The only
tricky part is showing that the while loop of lines 536 executes O.m/ times alto-
978 Chapter 32 String Matching

KMP-M ATCHER .T; P; n; m/


1 � D C OMPUTE -P REFIX -F UNCTION .P; m/
2 q D 0 // number of characters matched
3 for i D 1 to n // scan the text from left to right
4 while q > 0 and P Œq C 1� ¤ T Œi �
5 q D �Œq � // next character does not match
6 if P Œq C 1� = = T Œi �
7 q D qC1 // next character matches
8 if q == m // is all of P matched?
9 print <Pattern occurs with shift= i  m
10 q D �Œq � // look for the next match

C OMPUTE -P REFIX -F UNCTION .P; m/


1 let �Œ1 W m� be a new array
2 �Œ1� D 0
3 k D0
4 for q D 2 to m
5 while k > 0 and P Œk C 1� ¤ P Œq�
6 k D �Œk �
7 if P Œk C 1� == P Œq�
8 k D kC1
9 �Œq � D k
10 return �

gether. Starting with some observations about k , we’ll show that it makes at most
m  1 iterations. First, line 3 starts k at 0, and the only way that k increases is by the
increment operation in line 8, which executes at most once per iteration of the for
loop of lines 439. Thus, the total increase in k is at most m  1. Second, since k < q
upon entering the for loop and each iteration of the loop increments q , we always
have k < q . Therefore, the assignments in lines 2 and 9 ensure that �Œq � < q for
all q D 1; 2; : : : ; m, which means that each iteration of the while loop decreases k .
Third, k never becomes negative. Putting these facts together, we see that the total
decrease in k from the while loop is bounded from above by the total increase in k
over all iterations of the for loop, which is m  1. Thus, the while loop iterates at
most m  1 times in all, and C OMPUTE -P REFIX -F UNCTION runs in ‚.m/ time.
Exercise 32.4-4 asks you to show, by a similar aggregate analysis, that the match-
ing time of KMP-M ATCHER is ‚.n/.
32.4 The Knuth-Morris-Pratt algorithm 979

P5 a b a b a c a

P3 a b a b a c a �Œ5� D3

i 1 2 3 4 5 6 7
P1 a b a b a c a �Œ3� D1
P Œi � a b a b a c a
�Œi � 0 0 1 2 3 0 1
P0 " a b a b a c a �Œ1� D0

(a) (b)

Figure 32.10 An illustration of Lemma 32.5 for the pattern P ababaca and q D D
5. (a) The
� function for the given pattern. Since �Œ5� 3, �Œ3� D 1, and �Œ1� D D
0, iterating � gives
�  Œ5� Df g
3; 1; 0 . (b) Sliding the template containing the pattern P to the right and noting when
W W
some preûx P Œ k� of P matches up with some proper sufûx of P Œ 5�. Matches occur when k 3, D
1, and 0. In the ûgure, the ûrst row gives P , and the vertical red line is drawn just after P Œ 5�. W
W
Successive rows show all the shifts of P that cause some preûx P Œ k� of P to match some sufûx
W
of P Œ 5�. Successfully matched characters are shown in blue. Blue lines connect aligned matching
f W
characters. Thus, k k < 5 and P Œ k� P Œ 5� W ❂ 3; 1; 0 . Lemma 32.5 claims that �  Œq�
W gDf g D
f W
k k < q and P Œ k� W ❂ W g
P Œ q� for all q .

Compared with F INITE -AUTOMATON -M ATCHER, by using � rather than ı , the


KMP algorithm reduces the time for preprocessing the pattern from O.m j†j/
to ‚.m/, while keeping the actual matching time bounded by ‚.n/.

Correctness of the prefix-function computation


We’ll see a little later that the preûx function � helps to simulate the transition
function ı in a string-matching automaton. But ûrst, we need to prove that the
procedure C OMPUTE -P REFIX -F UNCTION does indeed compute the preûx func-
tion correctly. Doing so requires ûnding all preûxes P Œ W k� that are proper sufûxes
of a given preûx P Œ W q�. The value of �Œq � gives us the length of the longest such
preûx, but the following lemma, illustrated in Figure 32.10, shows that iterating the
preûx function � generates all the preûxes P Œ W k� that are proper sufûxes of P Œ W q�.
Let
˚ 
�  Œq� D �Œq �; � .2/ Œq�; � .3/ Œq�; : : : ; � .t / Œq� ;

where � .i / Œq� is deûned in terms of functional iteration, so that � .0/ Œq� D q and
� .i / Œq� D �Œ� .i 1/ Œq�� for i  1 (so that �Œq � D � .1/ Œq�), and where the sequence
in �  Œq� stops upon reaching � .t / Œq� D 0 for some t  1.
980 Chapter 32 String Matching

Lemma 32.5 (Prefix-function iteration lemma)


Let P be a pattern of length m with preûx function � . Then, for q D 1; 2; : : : ; m,
we have �  Œq� D fk W k < q and P Œ W k� ❂ P Œ W q�g.

Proof We ûrst prove that �  Œq� ෂ fk W k < q and P Œ W k� ❂ P Œ W q�g or, equiva-
lently,
i 2 �  Œq� implies P Œ W i � ❂ P Œ W q� : (32.7)
If i 2 �  Œq�, then i D � Œq� for some u > 0. We prove equation (32.7)
.u/

by induction on u. For u D 1, we have i D �Œq �, and the claim follows since


i < q and P Œ W �Œq �� ❂ P Œ W q� by the deûnition of � . Now consider some u  1
such that both � Œq� and � C Œq� belong to �  Œq�. Let i D � Œq�, so that
.u/ .u 1/ .u/

�Œi � D � C Œq�. The inductive hypothesis is that P Œ W i � ❂ P Œ W q�. Because


.u 1/

the relations < and ❂ are transitive, we have �Œi � < i < q and P Œ W �Œi �� ❂
P Œ W i � ❂ P Œ W q�, which establishes equation (32.7) for all i in �  Œq�. Therefore,
�  Œq� ෂ fk W k < q and P Œ W k� ❂ P Œ W q�g.
We now prove that fk W k < q and P Œ W k� ❂ P Œ W q�g ෂ �  Œq� by contradiction.
Suppose to the contrary that the set fk W k < q and P Œ W k� ❂ P Œ W q�g  �  Œq� is
nonempty, and let j be the largest number in the set. Because �Œq � is the largest
value in fk W k < q and P Œ W k� ❂ P Œ W q�g and �Œq � 2 �  Œq�, it must be the case
that j < �Œq �. Having established that �  Œq� contains at least one integer greater
than j , let j 0 denote the smallest such integer. (We can choose j 0 D �Œq � if
no other number in �  Œq� is greater than j .) We have P Œ W j � ❂ P Œ W q� because
j 2 fk W k < q and P Œ W k� ❂ P Œ W q�g, and from j 0 2 �  Œq� and equation (32.7),
we have P Œ W j 0 � ❂ P Œ W q�. Thus, P Œ W j � ❂ P Œ W j 0 � by Lemma 32.1, and j is the
largest value less than j 0 with this property. Therefore, we must have �Œj 0 � D j
and, since j 0 2 �  Œq�, we must have j 2 �  Œq� as well. This contradiction proves
the lemma.

The algorithm C OMPUTE -P REFIX -F UNCTION computes �Œq �, in order, for q D


1; 2; : : : ; m. Setting �Œ1� to 0 in line 2 of C OMPUTE -P REFIX -F UNCTION is cer-
tainly correct, since �Œq � < q for all q . We’ll use the following lemma and its
corollary to prove that C OMPUTE -P REFIX -F UNCTION computes �Œq � correctly
for q > 1.

Lemma 32.6
Let P be a pattern of length m, and let � be the preûx function for P . For q D
1; 2; : : : ; m, if �Œq � > 0, then �Œq �  1 2 �  Œq  1�.

Proof Let r D�Œq � > 0, so that r < q and P Œ r � W ❂ W


P Œ q�, and thus,
r 1<q 1 and P Œ rW  ❂
1� PŒ q W 
1� (by dropping the last character from
32.4 The Knuth-Morris-Pratt algorithm 981

W W
P Œ r � and P Œ q�, which we can do because r > 0). By Lemma 32.5, therefore,
 2 
r 1 �  Œq 1�. Thus, we have �Œq � 1  D  2
r 1 �  Œq 1�. 
For q D 2; 3; : : : ; m, deûne the subset E  ෂ �  Œq  1� by
q 1

E q 1 D fk 2 �  Œq  1� W P Œk C 1� D P Œq�g
D fk W k < q  1 and P Œ W k� ❂ P Œ W q  1� and P Œk C 1� D P Œq�g
(by Lemma 32.5)
D fk W k < q  1 and P Œ W k C 1� ❂ P Œ W q�g :
The set Eq1 consists of the values k < q  1 for which P Œ W k� ❂ P Œ W q  1� and
for which, because P Œk C 1� D P Œq�, we have P Œ W k C 1� ❂ P Œ W q�. Thus, Eq1
consists of those values k 2 �  Œq  1� such that extending P Œ W k� to P Œ W k C 1�
produces a proper sufûx of P Œ W q�.

Corollary 32.7
Let P be a pattern of length m, and let � be the preûx function for P . Then, for
q D 2; 3; : : : ; m,
(
�Œq � D 0 if Eq1 D;;
1 C max E q 1 if Eq1 ¤;:

Proof If Eq1 is empty, there is no k 2 �  Œq  1� (including k D 0) such that


extending P Œ W k� to P Œ W k C 1� produces a proper sufûx of P Œ W q�. Therefore,
�Œq � D 0.
If, instead, Eq1 is nonempty, then for each k 2 Eq1 , we have k C 1 < q and
P Œ W k C 1� ❂ P Œ W q�. Therefore, the deûnition of �Œq � gives

�Œq �  1 C max E  q 1 : (32.8)


Note that �Œq �> 0. Let r D 
�Œq � 1, so that r 1 C D
�Œq � > 0, and therefore
W C 1� ❂
PŒ r W
P Œ q�. If a nonempty string is a sufûx of another, then the two
C W C
strings must have the same last character. Since r 1 > 0, the preûx P Œ r 1� is
C D 2 
nonempty, and so P Œr 1� P Œq�. Furthermore, r �  Œq 1� by Lemma 32.6.
2  D හ
Therefore, r Eq1 , and so �Œq � 1 r max Eq1 or, equivalently,
�Œq � හ 1 C max E  q 1 : (32.9)
Combining equations (32.8) and (32.9) completes the proof.

We now ûnish the proof that C OMPUTE -P REFIX -F UNCTION computes � cor-
rectly. The key is to combine the deûnition of Eq1 with the statement of Corol-
lary 32.7, so that �Œq � equals 1 plus the greatest value of k in �  Œq  1� such that
982 Chapter 32 String Matching

P Œk C 1� D P Œq�. First, in C OMPUTE-P REFIX -F UNCTION, k D �Œq  1� at the


start of each iteration of the for loop of lines 439. This condition is enforced by
lines 2 and 3 when the loop is ûrst entered, and it remains true in each successive
iteration because of line 9. Lines 538 adjust k so that it becomes the correct value
of �Œq �. The while loop of lines 536 searches through all values k 2 �  Œq  1� in
decreasing order to ûnd the value of �Œq �. The loop terminates either because k
reaches 0 or P Œk C 1� D P Œq�. Because the <and= operator short-circuits, if the
loop terminates because P Œk C 1� D P Œq�, then k must have also been positive,
and so k is the greatest value in Eq1 . In this case, lines 739 set �Œq � to k C 1,
according to Corollary 32.7. If, instead, the while loop terminates because k D 0,
then there are two possibilities. If P Œ1� D P Œq�, then Eq1 D f0g, and lines 739
set both k and �Œq � to 1. If k D 0 and P Œ1� ¤ P Œq�, however, then Eq1 D ;. In
this case, line 9 sets �Œq � to 0, again according to Corollary 32.7, which completes
the proof of the correctness of C OMPUTE -P REFIX -F UNCTION.

Correctness of the Knuth-Morris-Pratt algorithm


You can think of the procedure KMP-M ATCHER as a reimplemented version
of the procedure F INITE -AUTOMATON -M ATCHER, but using the preûx func-
tion � to compute state transitions. Speciûcally, we’ll prove that in the i th
iteration of the for loops of both KMP-M ATCHER and F INITE -AUTOMATON -
M ATCHER, the state q has the same value upon testing for equality with m (at
line 8 in KMP-M ATCHER and at line 4 in F INITE -AUTOMATON -M ATCHER).
Once we have argued that KMP-M ATCHER simulates the behavior of F INITE -
AUTOMATON -M ATCHER, the correctness of KMP-M ATCHER follows from the
correctness of F INITE -AUTOMATON -M ATCHER (though we’ll see a little later why
line 10 in KMP-M ATCHER is necessary).
Before formally proving that KMP-M ATCHER correctly simulates F INITE -
AUTOMATON -M ATCHER, let’s take a moment to understand how the preûx func-
tion � replaces the ı transition function. Recall that when a string-matching
automaton is in state q and it scans a character a D T Œi �, it moves to a new
state ı.q; a/. If a D P Œq C 1�, so that a continues to match the pattern, then the
state number is incremented: ı.q; a/ D q C 1. Otherwise, a ¤ P Œq C 1�, so that
a does not continue to match the pattern, and the state number does not increase:
0 හ ı.q; a/ හ q . In the ûrst case, when a continues to match, KMP-M ATCHER
moves to state q C 1 without referring to the � function: the while loop test in
line 4 immediately comes up false, the test in line 6 comes up true, and line 7
increments q .
The � function comes into play when the character a does not continue to match
the pattern, so that the new state ı.q; a/ is either q or to the left of q along the spine
of the automaton. The while loop of lines 435 in KMP-M ATCHER iterates through
32.4 The Knuth-Morris-Pratt algorithm 983

the states in �  Œq�, stopping either when it arrives in a state, say q 0 , such that a
matches P Œq 0 C 1� or q 0 has gone all the way down to 0. If a matches P Œq 0 C 1�,
then line 7 sets the new state to q 0 C 1, which should equal ı.q; a/ for the simulation
to work correctly. In other words, the new state ı.q; a/ should be either state 0 or
a state numbered 1 more than some state in �  Œq�.
Let’s look at the example in Figures 32.6 and 32.10, which are for the pattern
P D ababaca. Suppose that the automaton is in state q D 5, having matched
ababa. The states in �  Œ5� are, in descending order, 3, 1, and 0. If the next char-
acter scanned is c, then you can see that the automaton moves to state ı.5; c/ D 6
in both F INITE -AUTOMATON -M ATCHER (line 3) and KMP-M ATCHER (line 7).
Now suppose that the next character scanned is instead b, so that the automaton
should move to state ı.5; b/ D 4. The while loop in KMP-M ATCHER exits after
executing line 5 once, and the automaton arrives in state q 0 D �Œ5� D 3. Since
P Œq 0 C 1� D P Œ4� D b, the test in line 6 comes up true, and the automaton moves
to the new state q 0 C 1 D 4 D ı.5; b/. Finally, suppose that the next character
scanned is instead a, so that the automaton should move to state ı.5; a/ D 1. The
ûrst three times that the test in line 4 executes, the test comes up true. The ûrst time
ûnds that P Œ6� D c ¤ a, and the automaton moves to state �Œ5� D 3 (the ûrst state
in �  Œ5�). The second time ûnds that P Œ4� D b ¤ a, and the automaton moves to
state �Œ3� D 1 (the second state in �  Œ5�). The third time ûnds that P Œ2� D b ¤ a,
and the automaton moves to state �Œ1� D 0 (the last state in �  Œ5�). The while loop
exits once it arrives in state q 0 D 0. Now line 6 ûnds that P Œq 0 C 1� D P Œ1� D a,
and line 7 moves the automaton to the new state q 0 C 1 D 1 D ı.5; a/.
Thus, the intuition is that KMP-M ATCHER iterates through the states in �  Œq� in
decreasing order, stopping at some state q 0 and then possibly moving to state q 0 C 1.
Although that might seem like a lot of work just to simulate computing ı.q; a/,
bear in mind that asymptotically, KMP-M ATCHER is no slower than F INITE -
AUTOMATON -M ATCHER.
We are now ready to formally prove the correctness of the Knuth-Morris-Pratt
algorithm. By Theorem 32.4, we have that q D �.T Œ W i �/ after each time line 3 of
F INITE -AUTOMATON -M ATCHER executes. Therefore, it sufûces to show that the
same property holds with regard to the for loop in KMP-M ATCHER. The proof
proceeds by induction on the number of loop iterations. Initially, both procedures
set q to 0 as they enter their respective for loops for the ûrst time. Consider iter-
ation i of the for loop in KMP-M ATCHER. By the inductive hypothesis, the state
number q equals �.T Œ W i  1�/ at the start of the loop iteration. We need to show
that when line 8 is reached, the new value of q is �.T Œ W i �/. (Again, we’ll handle
line 10 separately.)
Considering q to be the state number at the start of the for loop iteration, when
KMP-M ATCHER considers the character T Œi �, the longest preûx of P that is a
sufûx of T Œ W i � is either P Œ W q C 1� (if P Œq C 1� D T Œi �) or some preûx (not
984 Chapter 32 String Matching

necessarily proper, and possibly empty) of P Œ W q�. We consider separately the


three cases in which �.T Œ W i �/ D 0, �.T Œ W i �/ D q C 1, and 0 < �.T Œ W i �/ හ q .
 If �.T Œ W i �/ D 0, then P Œ W 0� D " is the only preûx of P that is a sufûx of T Œ W i �.
The while loop of lines 435 iterates through each value q 0 in �  Œq�, but although
P Œ W q 0 � ❂ P Œ W q� ❂ T Œ W i  1� for every q 0 2 �  Œq� (because < are ❂ are tran-
sitive relations), the loop never ûnds a q 0 such that P Œq 0 C 1� D T Œi �. The loop
terminates when q reaches 0, and of course line 7 does not execute. Therefore,
q D 0 at line 8, so that now q D �.T Œ W i �/.
 If �.T Œ W i �/ D q C 1, then P Œq C 1� D T Œi �, and the while loop test in line 4 fails
the ûrst time through. Line 7 executes, incrementing the state number to q C 1,
which equals �.T Œ W i �/.
 If 0 < �.T Œ W i �/ හ q 0 , then the while loop of lines 435 iterates at least once,
checking in decreasing order each value in �  Œq� until it stops at some q 0 < q .
Thus, P Œ W q 0 � is the longest preûx of P Œ W q� for which P Œq 0 C 1� D T Œi �, so
that when the while loop terminates, q 0 C 1 D �.P Œ W q�T Œi �/. Since q D
�.T Œ W i  1�/, Lemma 32.3 implies that �.T Œ W i  1�T Œi �/ D �.P Œ W q�T Œi �/.
Thus we have
q 0 C 1 D �.P Œ W q�T Œi �/
D �.T Œ W i  1�T Œi �/
D �.T Œ W i �/
when the while loop terminates. After line 7 increments q , the new state num-
ber q equals �.T Œ W i �/.
Line 10 is necessary in KMP-M ATCHER, because otherwise, line 4 might try
to reference P Œm C 1� after ûnding an occurrence of P . (The argument that
q D �.T Œ W i  1�/ upon the next execution of line 4 remains valid by the hint
given in Exercise 32.4-8: that ı.m; a/ D ı.�Œm�;a/ or, equivalently, �.P a/ D
�.P Œ W �Œm��a/ for any a 2 †.) The remaining argument for the correctness
of the Knuth-Morris-Pratt algorithm follows from the correctness of F INITE -
AUTOMATON -M ATCHER, since we have shown that KMP-M ATCHER simulates
the behavior of F INITE -AUTOMATON -M ATCHER.

Exercises

32.4-1
Compute the preûx function � for the pattern ababbabbabbababbabb.

32.4-2
Give an upper bound on the size of �  Œq� as a function of q . Give an example to
show that your bound is tight.

You might also like