The Knuth Morris Pratt Algorithm
The Knuth Morris Pratt Algorithm
Algorithm
String matching algorithm
KMP Algorithm
• Due to Knuth, Morris, Pratt .
• Time complexity Θ(n+m) .
• Use an auxiliary function π[1…m] precomputed from pattern in
O(m) time .
• Avoids computing function δ .
• For any state q = 0,1,..m and any char a ϵ Σ, π[q] contains the info
that is independent of a .
• π has only m entries δ has O(m|Σ|) entries .
• We save a factor of |Σ| .
PREFIX Function for a Pattern
• Encapsulates knowledge about how the pattern matches against
shifts of itself.
• The above information helps to avoid testing useless shifts in
naïve algorithm or computing the δ in S.M.Automaton
• T b a c b a b a b a b b c b a b
q = 5 characters
P S a b a b a c a have matched at
shift s
s=4
q=5 a b a b a Þ s+1 is an invalid
s+q = 9 a b a b a c a shift
S' But s+2 may be
possible.
s’ = 6, a b a
k=3
Question ??
• P[1…m], T[1…n]
Given that P[1…q] matched text char T[s+1…s+q], what is the least shift
s’>s such that P[1…k] = T[s’+1..s’+k]
Where s’+k = s+q ?
• Such a shift is the 1st shift of interest after s .
• Best case s’ = s+q , so reject all shifts s+1,s+2,…,s+q-1.
• Also in any case, at new shift s’, we don’t need to compare the first k
characters of P, with the corresponding T characters.
• This answer can be precomputed by comparing pattern against itself.
• Note- Since T[s’+1,…s’+k] is part of the known portion of the text,
pk is a suffix of pq , so ask the question in a different way ---
Find largest k<q, ϶ pk ⊐ pq
then s’ s+ (q-k) is the next potentially valid shift.
• So store k at the new shift s’ rather than s-s’.
• Prefix Function-
π{1,2,..m} → {0,1,..,m-1}
such that
π[q] = max{k : k<q and pk ⊐ pq }
• So π[q] is the length of the longest prefix of P that is proper suffix
of Pq .
COMPUTE –PREFIX –FUNCTION (P)
1. m ← length [P]
2. π[1] ← 0
3. k ← 0
4. For q ← 2 to m
5. do while k > 0 and p[k+1] ≠ p[q]
6. do k ← π[k]
7. if p[k+1] = p[q]
8. then k ← k+1
9. π[q] ← k
10. return π
• KMP Matcher
1. n ← length[T]
2. m ← length[P]
3. π ← Compute Prefix Function(P)
4. q ← 0
5. For i ← 1 to n
6. do while q > 0 and p[q+1] ≠ T[i]
7. do q ← π[q]
8. if p[q+1] = T[i]
9. then q ← q+1
10. if q=m
11. then print “Pattern valid shift ” i-m
12. q ← π[q]
Time complexity = O(m+n)