CS 240 Tutorial 11 Notes: C A A B A
CS 240 Tutorial 11 Notes: C A A B A
Pattern Matching/String Search: Find the location(s) of a short string (the pattern) in a longer string
(the text).
Example: Find the pattern P = needle in the text T = haystackneedlehaystack.
Answer: The pattern occurs at position 8.
Examples: For simplicity, return only the first position. (If necessary, repeat on remaining portion.)
search(a, abc) = 0
search(c, abcabc) = 2
search(d, abc) = FAIL
An elegant, real-world problem; solved by e.g., word processors, strstr in C, strpos in PHP, etc.
Brute force: Check every possible location in T to see if P is there.
Example: Search for P = abc in T = abdacabcd.
Try each possible location 0, 1, . . . in P , and see if abc starts there. Once a mismatch occurs, move on to
the next location.
a b d a c a b c d
0 aX bX c
1
a
2
a
3
aX b
4
a
5
aX bX cX
X
Loop over two indices: position or T -index and mismatch or P -index.
Analysis: Say |T | = n and |P | = m.
There are n m + 1 positions to check, and each position can have up to m character comparisons.
Total # of comparisons: O((n m + 1)m) = O(nm) = O(n2 ).
Behaviour this bad really can occur, e.g., with T = an and P = am1 b every possible comparison is made.
When m = n/2 this is (n2 ) comparisons.
KnuthMorrisPratt: More efficient string search. Dont check positions where its impossible to find P .
When a mismatch occurs, can often remove possible locations for P by looking at the last characters
seen.
Example: P = abcd, T = abcabcd
0
3
a b c a b c d
aX bX cX d
aX bX cX dX
We can rule out positions 1 and 2 because P [0] = a does not appear in P [1..2], so it also doesnt appear in
the text T [1..2] which was just matched to P [1..2].
Example: P = abcabc . . . , T = abcabd . . .
On the first mismatch, consider the possible shifts of P before the mismatch:
0
1
2
3
4
a b c
aX bX cX
a b
a
a
aX
c
b
aX
b d ...
bX c
a
c
bX
a
valid shift
Key observation: The valid shifts are when a prefix of P is equal to a suffix of P truncated at the
mismatch.
Before we even see T we can calculate the possible shifts for each truncation of P ! This pre-processing
of P allows efficient string search.
Actually, only need to store the minimal valid shift for each truncation of P . (Equivalently, the maximal
valid prefix length.) This is known as the KMP failure array F :
F [i] := largest prefix of P that is a proper suffix of P [0..i]
= largest prefix of P that is a suffix of P [1..i]
Note proper suffix, otherwise the largest prefix would always be P [0..i] itself, leading to a shift of 0.
Example: Give the KMP failure array F for P
Answer:
i
0
1
2
3
4
5
= ababac.
P [0..i]
a
ab
aba
abab
ababa
ababac
F [i]
0
0
1
2
3
0
Once F is computed (can be done in O(m) time), KMP is really just brute-force except that on a
mismatch at P -index i you shift the T -index by i F [i 1] (or 1 if i = 0) and set the P -index to
F [i 1] (or 0 if i = 0).
A naive analysis says there are O(n) changes of T -index and for each there are O(m) changes of P -index,
the same as brute-force. But it actually performs better, as we shall see.
Example: Run KMP with P = ababac and T = abcaabaabababac.
a b c a a b a a b a b a b a c
aX bX a
a
aX b
aX bX aX b
b
aX bX aX bX aX c
bX aX cX
P -index:
i
2
0
1
3
1
5
5
inew
F [1] = 0
0
F [0] = 0
F [2] = 1
F [1] = 0
F [4] = 3
Notice the staircase pattern that KMP generates. In general, the number of comparisons made will be
O(staircase length + staircase width).
But the length and height of the staircase are O(n), so KMP costs O(n).
if
you wanted to continue the search, i would be 6 and inew would be F [5] = 0