32.4 The Knuth-Morris-Pratt Algorithm: Either
32.4 The Knuth-Morris-Pratt Algorithm: Either
? 32.3-5
Given two patterns P and P 0 , describe how to construct a ûnite automaton that
determines all occurrences of either pattern. Try to minimize the number of states
in your automaton.
32.3-6
Given a pattern P containing gap characters (see Exercise 32.1-4), show how to
build a ûnite automaton that can ûnd an occurrence of P in a text T in O.n/
matching time, where n D jT j.
Knuth, Morris, and Pratt developed a linear-time string matching algorithm that
avoids computing the transition function ı altogether. Instead, the KMP algorithm
uses an auxiliary function � , which it precomputes from the pattern in ‚.m/ time
and stores in an array �Œ1 W m�. The array � allows the algorithm to compute the
transition function ı efûciently (in an amortized sense) <on the üy= as needed.
Loosely speaking, for any state q D 0; 1; : : : ; m and any character a 2 †, the
value �Œq � contains the information needed to compute ı.q; a/ but that does not
depend on a. Since the array � has only m entries, whereas ı has ‚.m j†j/ en-
tries, the KMP algorithm saves a factor of j†j in the preprocessing time by com-
puting � rather than ı . Like the procedure F INITE -AUTOMATON -M ATCHER, once
preprocessing has completed, the KMP algorithm uses ‚.n/ matching time.
s0 D s C 2 shown in part (b) of the ûgure, however, aligns the ûrst three pattern
characters with three text characters that necessarily match.
More generally, suppose that you know that P Œ W q� h T Œ W s C q� or, equiva-
lently, that P Œ1 W q� D T Œs C 1 W s C q�. You want to shift P so that some shorter
preûx P Œ W k� of P matches a sufûx of T Œ W s C q�, if possible. You might have more
than one choice for how much to shift, however. In Figure 32.9(b), shifting P by 2
positions works, so that P Œ W 3� h T Œ W s C q�, but so does shifting P by 4 positions,
so that P Œ W 1� h T Œ W s C q� in Figure 32.9(c). If more than one shift amount works,
you should choose the smallest shift amount so that you do not miss any potential
matches. Put more precisely, you want to answer this question:
Given that pattern characters P Œ1 W q� match text characters T Œs C 1 W s C q�
(that is, P Œ W q� h T Œ W s C q�), what is the least shift s 0 > s such that for
some k < q ,
P Œ1 W k� D T Œs 0 C 1 W s 0 C k� ; (32.6)
b a c b a b a b a a b c b a b T b a c b a b a b a a b c b a b T
s a b a b a c a P sʹ = s + 2 a b a b a c a P
q k
(a) (b)
b a c b a b a b a a b c b a b T a b a b a P Œ W q�
s+4 a b a b a c a P a b a P Œ W k�
(c) (d)
Figure 32.9 The preûx function � . (a) The pattern P D ababaca aligns with a text T so that the
ûrst q D 5 characters match. Matching characters, in blue, are connected by blue lines. (b) Knowing
these particular 5 matched characters (P Œ W 5�) sufûces to deduce that a shift of s C 1 is invalid,
but that a shift of s 0 D s C 2 is consistent with everything known about the text and therefore is
potentially valid. The preûx P Œ W k�, where k D 3, aligns with the text seen so far. (c) A shift of s C 4
is also potentially valid, but it leaves only the preûx P Œ W 1� aligned with the text seen so far. (d) To
precompute useful information for such deductions, compare the pattern with itself. Here, the longest
preûx of P that is also a proper sufûx of P Œ W 5� is P Œ W 3�. The array � represents this precomputed
information, so that �Œ5� D 3. Given that q characters have matched successfully at shift s , the next
potentially valid shift is at s 0 D s C .q �Œq �/ as shown in part (b).
i 1 2 3 4 5 6 7
P Œi � a b a b a c a
�Œi � 0 0 1 2 3 0 1
The procedure KMP-M ATCHER on the following page gives the Knuth-Morris-
Pratt matching algorithm. The procedure follows from F INITE -AUTOMATON -
M ATCHER for the most part. To compute � , KMP-M ATCHER calls the auxiliary
procedure C OMPUTE -P REFIX -F UNCTION. These two procedures have much in
common, because both match a string against the pattern P : KMP-M ATCHER
matches the text T against P , and C OMPUTE -P REFIX -F UNCTION matches P
against itself.
Next, let’s analyze the running times of these procedures. Then we’ll prove them
correct, which will be more complicated.
Running-time analysis
The running time of C OMPUTE -P REFIX -F UNCTION is ‚.m/, which we show by
using the aggregate method of amortized analysis (see Section 16.1). The only
tricky part is showing that the while loop of lines 536 executes O.m/ times alto-
978 Chapter 32 String Matching
gether. Starting with some observations about k , we’ll show that it makes at most
m 1 iterations. First, line 3 starts k at 0, and the only way that k increases is by the
increment operation in line 8, which executes at most once per iteration of the for
loop of lines 439. Thus, the total increase in k is at most m 1. Second, since k < q
upon entering the for loop and each iteration of the loop increments q , we always
have k < q . Therefore, the assignments in lines 2 and 9 ensure that �Œq � < q for
all q D 1; 2; : : : ; m, which means that each iteration of the while loop decreases k .
Third, k never becomes negative. Putting these facts together, we see that the total
decrease in k from the while loop is bounded from above by the total increase in k
over all iterations of the for loop, which is m 1. Thus, the while loop iterates at
most m 1 times in all, and C OMPUTE -P REFIX -F UNCTION runs in ‚.m/ time.
Exercise 32.4-4 asks you to show, by a similar aggregate analysis, that the match-
ing time of KMP-M ATCHER is ‚.n/.
32.4 The Knuth-Morris-Pratt algorithm 979
P5 a b a b a c a
P3 a b a b a c a �Œ5� D3
i 1 2 3 4 5 6 7
P1 a b a b a c a �Œ3� D1
P Œi � a b a b a c a
�Œi � 0 0 1 2 3 0 1
P0 " a b a b a c a �Œ1� D0
(a) (b)
Figure 32.10 An illustration of Lemma 32.5 for the pattern P ababaca and q D D
5. (a) The
� function for the given pattern. Since �Œ5� 3, �Œ3� D 1, and �Œ1� D D
0, iterating � gives
� Œ5� Df g
3; 1; 0 . (b) Sliding the template containing the pattern P to the right and noting when
W W
some preûx P Œ k� of P matches up with some proper sufûx of P Œ 5�. Matches occur when k 3, D
1, and 0. In the ûgure, the ûrst row gives P , and the vertical red line is drawn just after P Œ 5�. W
W
Successive rows show all the shifts of P that cause some preûx P Œ k� of P to match some sufûx
W
of P Œ 5�. Successfully matched characters are shown in blue. Blue lines connect aligned matching
f W
characters. Thus, k k < 5 and P Œ k� P Œ 5� W ❂ 3; 1; 0 . Lemma 32.5 claims that � Œq�
W gDf g D
f W
k k < q and P Œ k� W ❂ W g
P Œ q� for all q .
where � .i / Œq� is deûned in terms of functional iteration, so that � .0/ Œq� D q and
� .i / Œq� D �Œ� .i 1/ Œq�� for i 1 (so that �Œq � D � .1/ Œq�), and where the sequence
in � Œq� stops upon reaching � .t / Œq� D 0 for some t 1.
980 Chapter 32 String Matching
Proof We ûrst prove that � Œq� ෂ fk W k < q and P Œ W k� ❂ P Œ W q�g or, equiva-
lently,
i 2 � Œq� implies P Œ W i � ❂ P Œ W q� : (32.7)
If i 2 � Œq�, then i D � Œq� for some u > 0. We prove equation (32.7)
.u/
the relations < and ❂ are transitive, we have �Œi � < i < q and P Œ W �Œi �� ❂
P Œ W i � ❂ P Œ W q�, which establishes equation (32.7) for all i in � Œq�. Therefore,
� Œq� ෂ fk W k < q and P Œ W k� ❂ P Œ W q�g.
We now prove that fk W k < q and P Œ W k� ❂ P Œ W q�g ෂ � Œq� by contradiction.
Suppose to the contrary that the set fk W k < q and P Œ W k� ❂ P Œ W q�g � Œq� is
nonempty, and let j be the largest number in the set. Because �Œq � is the largest
value in fk W k < q and P Œ W k� ❂ P Œ W q�g and �Œq � 2 � Œq�, it must be the case
that j < �Œq �. Having established that � Œq� contains at least one integer greater
than j , let j 0 denote the smallest such integer. (We can choose j 0 D �Œq � if
no other number in � Œq� is greater than j .) We have P Œ W j � ❂ P Œ W q� because
j 2 fk W k < q and P Œ W k� ❂ P Œ W q�g, and from j 0 2 � Œq� and equation (32.7),
we have P Œ W j 0 � ❂ P Œ W q�. Thus, P Œ W j � ❂ P Œ W j 0 � by Lemma 32.1, and j is the
largest value less than j 0 with this property. Therefore, we must have �Œj 0 � D j
and, since j 0 2 � Œq�, we must have j 2 � Œq� as well. This contradiction proves
the lemma.
Lemma 32.6
Let P be a pattern of length m, and let � be the preûx function for P . For q D
1; 2; : : : ; m, if �Œq � > 0, then �Œq � 1 2 � Œq 1�.
W W
P Œ r � and P Œ q�, which we can do because r > 0). By Lemma 32.5, therefore,
2
r 1 � Œq 1�. Thus, we have �Œq � 1 D 2
r 1 � Œq 1�.
For q D 2; 3; : : : ; m, deûne the subset E ෂ � Œq 1� by
q 1
E q 1 D fk 2 � Œq 1� W P Œk C 1� D P Œq�g
D fk W k < q 1 and P Œ W k� ❂ P Œ W q 1� and P Œk C 1� D P Œq�g
(by Lemma 32.5)
D fk W k < q 1 and P Œ W k C 1� ❂ P Œ W q�g :
The set Eq1 consists of the values k < q 1 for which P Œ W k� ❂ P Œ W q 1� and
for which, because P Œk C 1� D P Œq�, we have P Œ W k C 1� ❂ P Œ W q�. Thus, Eq1
consists of those values k 2 � Œq 1� such that extending P Œ W k� to P Œ W k C 1�
produces a proper sufûx of P Œ W q�.
Corollary 32.7
Let P be a pattern of length m, and let � be the preûx function for P . Then, for
q D 2; 3; : : : ; m,
(
�Œq � D 0 if Eq1 D;;
1 C max E q 1 if Eq1 ¤;:
We now ûnish the proof that C OMPUTE -P REFIX -F UNCTION computes � cor-
rectly. The key is to combine the deûnition of Eq1 with the statement of Corol-
lary 32.7, so that �Œq � equals 1 plus the greatest value of k in � Œq 1� such that
982 Chapter 32 String Matching
the states in � Œq�, stopping either when it arrives in a state, say q 0 , such that a
matches P Œq 0 C 1� or q 0 has gone all the way down to 0. If a matches P Œq 0 C 1�,
then line 7 sets the new state to q 0 C 1, which should equal ı.q; a/ for the simulation
to work correctly. In other words, the new state ı.q; a/ should be either state 0 or
a state numbered 1 more than some state in � Œq�.
Let’s look at the example in Figures 32.6 and 32.10, which are for the pattern
P D ababaca. Suppose that the automaton is in state q D 5, having matched
ababa. The states in � Œ5� are, in descending order, 3, 1, and 0. If the next char-
acter scanned is c, then you can see that the automaton moves to state ı.5; c/ D 6
in both F INITE -AUTOMATON -M ATCHER (line 3) and KMP-M ATCHER (line 7).
Now suppose that the next character scanned is instead b, so that the automaton
should move to state ı.5; b/ D 4. The while loop in KMP-M ATCHER exits after
executing line 5 once, and the automaton arrives in state q 0 D �Œ5� D 3. Since
P Œq 0 C 1� D P Œ4� D b, the test in line 6 comes up true, and the automaton moves
to the new state q 0 C 1 D 4 D ı.5; b/. Finally, suppose that the next character
scanned is instead a, so that the automaton should move to state ı.5; a/ D 1. The
ûrst three times that the test in line 4 executes, the test comes up true. The ûrst time
ûnds that P Œ6� D c ¤ a, and the automaton moves to state �Œ5� D 3 (the ûrst state
in � Œ5�). The second time ûnds that P Œ4� D b ¤ a, and the automaton moves to
state �Œ3� D 1 (the second state in � Œ5�). The third time ûnds that P Œ2� D b ¤ a,
and the automaton moves to state �Œ1� D 0 (the last state in � Œ5�). The while loop
exits once it arrives in state q 0 D 0. Now line 6 ûnds that P Œq 0 C 1� D P Œ1� D a,
and line 7 moves the automaton to the new state q 0 C 1 D 1 D ı.5; a/.
Thus, the intuition is that KMP-M ATCHER iterates through the states in � Œq� in
decreasing order, stopping at some state q 0 and then possibly moving to state q 0 C 1.
Although that might seem like a lot of work just to simulate computing ı.q; a/,
bear in mind that asymptotically, KMP-M ATCHER is no slower than F INITE -
AUTOMATON -M ATCHER.
We are now ready to formally prove the correctness of the Knuth-Morris-Pratt
algorithm. By Theorem 32.4, we have that q D �.T Œ W i �/ after each time line 3 of
F INITE -AUTOMATON -M ATCHER executes. Therefore, it sufûces to show that the
same property holds with regard to the for loop in KMP-M ATCHER. The proof
proceeds by induction on the number of loop iterations. Initially, both procedures
set q to 0 as they enter their respective for loops for the ûrst time. Consider iter-
ation i of the for loop in KMP-M ATCHER. By the inductive hypothesis, the state
number q equals �.T Œ W i 1�/ at the start of the loop iteration. We need to show
that when line 8 is reached, the new value of q is �.T Œ W i �/. (Again, we’ll handle
line 10 separately.)
Considering q to be the state number at the start of the for loop iteration, when
KMP-M ATCHER considers the character T Œi �, the longest preûx of P that is a
sufûx of T Œ W i � is either P Œ W q C 1� (if P Œq C 1� D T Œi �) or some preûx (not
984 Chapter 32 String Matching
Exercises
32.4-1
Compute the preûx function � for the pattern ababbabbabbababbabb.
32.4-2
Give an upper bound on the size of � Œq� as a function of q . Give an example to
show that your bound is tight.