String Matching Algorithms: Antonio Carzaniga
String Matching Algorithms: Antonio Carzaniga
Antonio Carzaniga
Faculty of Informatics
University of Lugano
Outline
Problem definition
Naïve algorithm
Knuth-Morris-Pratt algorithm
Boyer-Moore algorithm
Given a pattern P
◮ P ∈ Σ∗ : same finite alphabet Σ
◮ |P| = m: the length of P is m
n = 14
T a b c a a b a a b a b a c a
m=3
s = 4s =s7= 9
P a b a a b a a b a b a
Result
s=4
s=7
s=9
Naïve Algorithm
For each position s in 0 . . . n − m, see if T [s + i] = P[i] for all
1≤i≤m
Naive-String-Matching(T , P)
1 n = length(T )
2 m = length(P)
3 for s = 0 to n − m
4 if Substring-At(T , P, s)
5 output(s)
Substring-At(T , P, s)
1 for i = 1 to length(P)
2 if T [s + i] ≠ P[i]
3 return false
4 return true
T = an , P = am
i.e.,
n m
z }| { z }| {
T = aa · · · a, P = aa · · · a
Θ((n − m + 1)m)
Improvement Strategy
Observation
T a b c a a b a a b a b a c a
= = ≠
P a b a
What now?
◮ the naïve algorithm tells us to go back to the second position in
T and to start from the beginning of P
◮ can’t we simply move along through T ?
◮ why?
Wrong-String-Matching(T , P)
1 n = length(T )
2 m = length(P)
3 q=0 // number of characters matched in P
4 s=1
5 while s ≤ n
6 s = s+1
7 if T [s] = = P[q + 1]
8 q = q+1
9 if q = = m
10 output(s − m)
11 q=0
12 else q = 0
s s s s s s s s s s s s s s s s
T p a g l i a i o b a g o r d o
P a g o Output: 10
q+1
q+1
q+1
Done. Perfect!
Complexity: Θ(n)
s s s s s s s s
T a a b a a a b a b a b a c a
output(0) missed!
P a a b
q+1
q+1
q+1
s s s
T a a b a a a b a b a b a c a
P a a b
q+1
q+1
q+1
s s s s s s s s
T a b a b a b a c b a c b c a
output(2)
P a b a b a c
q+1
q+1
q+1
q+1
q+1
q+1
New Strategy
P[1 . . . q] is the prefix of P matched so far
P a b a b a c
π =3
q+1 q+1
Restart from q = π
Iterate as usual
Example
P a b a b a c
π 0 0 1 2 3 0
KMP-String-Matching(T , P)
1 n = length(T )
2 m = length(P)
3 π = Prefix-Function(P)
4 q=0 // number of character matched
5 for i = 1 to n // scan the text left-to-right
6 while q > 0 and P[q + 1] ≠ T [i]
7 q = π [q] // no match: go back using π
8 if P[q + 1] == T [i]
9 q = q+1
10 if q == m
11 output(i − m)
12 q = π [q] // go back for the next match
Prefix-Function(P)
1 m = length(P)
2 π [1] = 0
3 k =0
4 for q = 2 to m
5 while k > 0 and P[k + 1] ≠ P[q]
6 k = π [k]
7 if P[k + 1] = = P[q]
8 k = k+1
9 π [q] = k
Prefix-Function at Work
Prefix-Function(P)
q q q q q
1 m = length(P)
2 π [1] = 0
3 k =0 P a b a b a c
4 for q = 2 to m
5 while k > 0 and P[k + 1] ≠ P[q] k+1k+1k+1k+1
6 k = π [k]
7 if P[k + 1] = = P[q] π 0 0 1 2 3 0
8 k = k+1
9 π [q] = k
Can we do better?
Comments on KMP
Knuth-Morris-Pratt is Ω(n)
h e r e i s a s i m p l e e x a m p l e
e x a m p l e e x e
a m
x p
a m
el p
x
e al m
e p
x e
al m
x
e p
a m
l p
e l e
Comments on Boyer-Moore
Like KMP, Boyer-Moore includes a pre-processing phase