0% found this document useful (0 votes)
7 views23 pages

12 StringMatching

This document discusses string matching algorithms. It introduces the brute force algorithm and its O(MN) complexity. It then describes the Knuth-Morris-Pratt (KMP) algorithm which improves efficiency by preprocessing the pattern string to determine how to match more efficiently. The KMP algorithm runs in O(N+M) time where N and M are the lengths of the text and pattern strings, respectively. It provides examples of building the KMP table and tracing the algorithm.

Uploaded by

huy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views23 pages

12 StringMatching

This document discusses string matching algorithms. It introduces the brute force algorithm and its O(MN) complexity. It then describes the Knuth-Morris-Pratt (KMP) algorithm which improves efficiency by preprocessing the pattern string to determine how to match more efficiently. The KMP algorithm runs in O(N+M) time where N and M are the lengths of the text and pattern strings, respectively. It provides examples of building the KMP table and tracing the algorithm.

Uploaded by

huy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

So sánh xâu nhanh

String matching
String Matching

Tham khảo bài giảng 15-211


Fundamental Data Structures and
Algorithms, CMU
In this lecture
• String Matching Problem
– Concept
– brute force algorithm
– complexity
• Knuth-Morris-Pratt (KMP) Algorithm
– Pre-processing
– complexity

Tham khảo bài giảng 15-211 Fundamental Data


Structures and Algorithms, CMU
Pattern Matching
Algorithms
String Matching
• Text string T[0..N-1]
T = “abacaabaccabacabaabb”
• Pattern string P[0..M-1]
P = “abacab”
• Where is the first instance of P in T?
T[10..15] = P[0..5]
• Typically N >> M
Why String Matching?
• Applications in Computational Biology
– DNA sequence is a long word (or text) over a 4-letter alphabet
– GTTTGAGTGGTCAGTCTTTTCGTTTCGACGGAGCCCCCAATTA
ATAAACTCATAAGCAGACCTCAGTTCGCTTAGAGCAGCCGAA
A…..
– Find a Specific pattern W
• Finding patterns in documents formed using a large alphabet
– Word processing
– Web searching
– Desktop search (Google, MSN)
• Matching strings of bytes containing
– Graphical data
– Machine code
• grep in Unix/Linux
– grep searches for lines matching a pattern.
Naïve Algorithm
(or Brute Force)
• Assume |T| = n and |P| = m
Text T
Pattern P
Pattern P
Pattern P

Compare until a match is found. If so return the index where match


occurs
else return -1
String Matching
abacaabaccabacabaabb • The brute force
abacab algorithm
abacab
• 22+6=28 comparisons.
abacab
abacab
abacab
abacab
abacab
abacab
abacab
abacab
abacab
A bad case
00000000000000001
• 60+5 = 65
0000- comparisons are
0000- needed
0000- • How many of them
0000- could be avoided?
0000-
0000-
0000-
0000-
0000-
0000-
0000-
0000-
00001
String Matching
• Brute force worst case
– O(MN)
– Expensive for long patterns in repetitive text
• How to improve on this?
• Intuition:
– Remember what is learned from previous
matches
Knuth Morris Pratt
(KMP)
Algorithm
KMP – The Big Idea
• Retain information from prior attempts.
• Compute in advance how far to jump in P when a match
fails.
– Suppose the match fails at P[j]  T[i+j].
– Then we know P[0 .. j-1] = T[i .. i+j-1].
• We must next try P[0] ? T[i+1].
– But we know T[i+1]=P[1]
– What if we compare: P[1] ? P[0]
• If so, increment j by 1. No need to look at T.
– What if P[1] = P[0] and P[2] = P[1] ?
• Then increment j by 2. Again, no need to look at T.
• In general, we can determine how far to jump without any
knowledge of T!
Implementing KMP
• Never decrement i, ever.
– Comparing
T[i] with P[j].
• Compute a table f of how far to jump j
forward when a match fails.
– The next match will compare
T[i] with P[ f[j-1] ]
• Do this by matching P against itself in all
positions.
Building the Table for f
• P = 1010011
• Find self-overlaps
Prefix Overlap j f
1 . 1 0
10 . 2 0
101 1 3 1
1010 10 4 2
10100 . 5 0
101001 1 6 1
1010011 1 7 1
What f means
Prefix Overlap j f • f non-zero implies there is
1 . 1 0 a self-match.
10 . 2 0 E.g., f=2 means P[0..1] = P[j-
101 1 3 1 2..j-1]
1010 10 4 2
• Hence must start new
10100 . 5 0 comparison at j-2, since we
101001 1 6 1 know T[i-2..i-1] = P[0..1]
1010011 1 7 1
• If f is zero, there is no In general:
self-match. – Set j=f[j-1]
– Do not change i.
– Set j=0
• The next match is
– Do not change i. T[i] ? P[f[j-1]]
• The next match is
T[i] ? P[0]
Favorable conditions
• P = 1234567
• Find self-overlaps
Prefix Overlap j f
1 . 1 0
12 . 2 0
123 . 3 0
1234 . 4 0
12345 . 5 0
123456 . 6 0
1234567 . 7 0
Mixed conditions
• P = 1231234
• Find self-overlaps
Prefix Overlap j f
1 . 1 0
12 . 2 0
123 . 3 0
1231 1 4 1
12312 12 5 2
123123 123 6 3
1231234 . 7 0
Poor conditions
• P = 1111110
• Find self-overlaps
Prefix Overlap j f
1 . 1 0
11 1 2 1
111 11 3 2
1111 111 4 3
11111 1111 5 4
111111 11111 6 5
1111110 . 7 0
KMP pre-process
Algorithm
m = |P|;
Define a table F of size m
F[0] = 0;
i = 1; j = 0;
while(i<m) {
compare P[i] and P[j];
if(P[j]==P[i]) Use
{ F[i] = j+1; previous
i++; j++; } values of f
else if (j>0) j=F[j-1];
else {F[i] = 0; i++;}
}
KMP Algorithm
input: Text T and Pattern P
|T| = n
|P| = m
Compute Table F for Pattern P
i=j=0
while(i<n) {
if(P[j]==T[i])
{ if (j==m-1) return i-m+1;
i++; j++; }
else if (j>0) j=F[j-1];
else i++; Use F to determine
} next value for j.

output: first occurrence of P in T


Brute Force KMP
000000000000000000000000001 0000000000000000000000000001
0000000000000-
0000000000000- 0000000000000-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
0-
0000000000000-
0-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
01
• A worse case example: 28+14 = 42 comparisons
196 + 14 = 210 comparisons
KMP Performance
• Pre-processing needs O(M) operations.
• At each iteration, one of three cases:
– T[i] = P[j]
• i increases
– T[i] <> P[j] and j>0
• i-j increases
– T[I] <> P[j] and j=0
• i increases and i-j increases
• Hence, maximum of 2N iterations.
• Thus worst case performance is O(N+M).
Exercises
• E1
– Construct the KMP table for P = 10010001
– Trace the KMP algorithm with T =
000100100100010111
• E2
– Construct the KMP table for pattern P =
ababaca
– Trace the KMP algorithm with T =
bacbababaabcbab

You might also like