0% found this document useful (0 votes)
48 views2 pages

CS 240 Tutorial 11 Notes: C A A B A

This document discusses string pattern matching algorithms. It begins by describing the brute force algorithm, which has O(nm) runtime where n and m are the lengths of the text and pattern strings. It then introduces the Knuth-Morris-Pratt (KMP) algorithm, which improves efficiency by not checking positions where a match is impossible based on previous characters. KMP preprocesses the pattern to calculate a failure array F showing the largest valid shift if a mismatch occurs. During matching, the text index shifts by the amount in F, avoiding invalid positions. This results in an O(n) runtime where n is the text length.

Uploaded by

DavidKnight
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views2 pages

CS 240 Tutorial 11 Notes: C A A B A

This document discusses string pattern matching algorithms. It begins by describing the brute force algorithm, which has O(nm) runtime where n and m are the lengths of the text and pattern strings. It then introduces the Knuth-Morris-Pratt (KMP) algorithm, which improves efficiency by not checking positions where a match is impossible based on previous characters. KMP preprocesses the pattern to calculate a failure array F showing the largest valid shift if a mismatch occurs. During matching, the text index shifts by the amount in F, avoiding invalid positions. This results in an O(n) runtime where n is the text length.

Uploaded by

DavidKnight
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

CS 240 Tutorial 11 Notes

Pattern Matching/String Search: Find the location(s) of a short string (the pattern) in a longer string
(the text).
Example: Find the pattern P = needle in the text T = haystackneedlehaystack.
Answer: The pattern occurs at position 8.
Examples: For simplicity, return only the first position. (If necessary, repeat on remaining portion.)
search(a, abc) = 0
search(c, abcabc) = 2
search(d, abc) = FAIL
An elegant, real-world problem; solved by e.g., word processors, strstr in C, strpos in PHP, etc.
Brute force: Check every possible location in T to see if P is there.
Example: Search for P = abc in T = abdacabcd.
Try each possible location 0, 1, . . . in P , and see if abc starts there. Once a mismatch occurs, move on to
the next location.
a b d a c a b c d
0 aX bX c

1
a

2
a

3
aX b

4
a

5
aX bX cX
X
Loop over two indices: position or T -index and mismatch or P -index.
Analysis: Say |T | = n and |P | = m.
There are n m + 1 positions to check, and each position can have up to m character comparisons.
Total # of comparisons: O((n m + 1)m) = O(nm) = O(n2 ).
Behaviour this bad really can occur, e.g., with T = an and P = am1 b every possible comparison is made.
When m = n/2 this is (n2 ) comparisons.
KnuthMorrisPratt: More efficient string search. Dont check positions where its impossible to find P .
When a mismatch occurs, can often remove possible locations for P by looking at the last characters
seen.
Example: P = abcd, T = abcabcd
0
3

a b c a b c d
aX bX cX d
aX bX cX dX

We can rule out positions 1 and 2 because P [0] = a does not appear in P [1..2], so it also doesnt appear in
the text T [1..2] which was just matched to P [1..2].
Example: P = abcabc . . . , T = abcabd . . .
On the first mismatch, consider the possible shifts of P before the mismatch:
0
1
2
3
4

a b c
aX bX cX
a b
a

a
aX
c
b
aX

b d ...
bX c
a
c
bX
a

abcab is P truncated at mismatch

valid shift

Key observation: The valid shifts are when a prefix of P is equal to a suffix of P truncated at the
mismatch.
Before we even see T we can calculate the possible shifts for each truncation of P ! This pre-processing
of P allows efficient string search.
Actually, only need to store the minimal valid shift for each truncation of P . (Equivalently, the maximal
valid prefix length.) This is known as the KMP failure array F :
F [i] := largest prefix of P that is a proper suffix of P [0..i]
= largest prefix of P that is a suffix of P [1..i]
Note proper suffix, otherwise the largest prefix would always be P [0..i] itself, leading to a shift of 0.
Example: Give the KMP failure array F for P
Answer:
i
0
1
2
3
4
5

= ababac.
P [0..i]
a
ab
aba
abab
ababa
ababac

F [i]
0
0
1
2
3
0

Once F is computed (can be done in O(m) time), KMP is really just brute-force except that on a
mismatch at P -index i you shift the T -index by i F [i 1] (or 1 if i = 0) and set the P -index to
F [i 1] (or 0 if i = 0).
A naive analysis says there are O(n) changes of T -index and for each there are O(m) changes of P -index,
the same as brute-force. But it actually performs better, as we shall see.
Example: Run KMP with P = ababac and T = abcaabaabababac.
a b c a a b a a b a b a b a c
aX bX a
a
aX b
aX bX aX b
b
aX bX aX bX aX c
bX aX cX

P -index:

i
2
0
1
3
1
5
5

inew
F [1] = 0
0
F [0] = 0
F [2] = 1
F [1] = 0
F [4] = 3

Notice the staircase pattern that KMP generates. In general, the number of comparisons made will be
O(staircase length + staircase width).
But the length and height of the staircase are O(n), so KMP costs O(n).

if

you wanted to continue the search, i would be 6 and inew would be F [5] = 0

You might also like