0% found this document useful (0 votes)
6 views

12 StringMatching

This document discusses string matching algorithms. It introduces the brute force algorithm and its O(MN) complexity. It then describes the Knuth-Morris-Pratt (KMP) algorithm which improves efficiency by preprocessing the pattern string to determine how to match more efficiently. The KMP algorithm runs in O(N+M) time where N and M are the lengths of the text and pattern strings, respectively. It provides examples of building the KMP table and tracing the algorithm.

Uploaded by

huy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

12 StringMatching

This document discusses string matching algorithms. It introduces the brute force algorithm and its O(MN) complexity. It then describes the Knuth-Morris-Pratt (KMP) algorithm which improves efficiency by preprocessing the pattern string to determine how to match more efficiently. The KMP algorithm runs in O(N+M) time where N and M are the lengths of the text and pattern strings, respectively. It provides examples of building the KMP table and tracing the algorithm.

Uploaded by

huy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

So sánh xâu nhanh

String matching
String Matching

Tham khảo bài giảng 15-211


Fundamental Data Structures and
Algorithms, CMU
In this lecture
• String Matching Problem
– Concept
– brute force algorithm
– complexity
• Knuth-Morris-Pratt (KMP) Algorithm
– Pre-processing
– complexity

Tham khảo bài giảng 15-211 Fundamental Data


Structures and Algorithms, CMU
Pattern Matching
Algorithms
String Matching
• Text string T[0..N-1]
T = “abacaabaccabacabaabb”
• Pattern string P[0..M-1]
P = “abacab”
• Where is the first instance of P in T?
T[10..15] = P[0..5]
• Typically N >> M
Why String Matching?
• Applications in Computational Biology
– DNA sequence is a long word (or text) over a 4-letter alphabet
– GTTTGAGTGGTCAGTCTTTTCGTTTCGACGGAGCCCCCAATTA
ATAAACTCATAAGCAGACCTCAGTTCGCTTAGAGCAGCCGAA
A…..
– Find a Specific pattern W
• Finding patterns in documents formed using a large alphabet
– Word processing
– Web searching
– Desktop search (Google, MSN)
• Matching strings of bytes containing
– Graphical data
– Machine code
• grep in Unix/Linux
– grep searches for lines matching a pattern.
Naïve Algorithm
(or Brute Force)
• Assume |T| = n and |P| = m
Text T
Pattern P
Pattern P
Pattern P

Compare until a match is found. If so return the index where match


occurs
else return -1
String Matching
abacaabaccabacabaabb • The brute force
abacab algorithm
abacab
• 22+6=28 comparisons.
abacab
abacab
abacab
abacab
abacab
abacab
abacab
abacab
abacab
A bad case
00000000000000001
• 60+5 = 65
0000- comparisons are
0000- needed
0000- • How many of them
0000- could be avoided?
0000-
0000-
0000-
0000-
0000-
0000-
0000-
0000-
00001
String Matching
• Brute force worst case
– O(MN)
– Expensive for long patterns in repetitive text
• How to improve on this?
• Intuition:
– Remember what is learned from previous
matches
Knuth Morris Pratt
(KMP)
Algorithm
KMP – The Big Idea
• Retain information from prior attempts.
• Compute in advance how far to jump in P when a match
fails.
– Suppose the match fails at P[j]  T[i+j].
– Then we know P[0 .. j-1] = T[i .. i+j-1].
• We must next try P[0] ? T[i+1].
– But we know T[i+1]=P[1]
– What if we compare: P[1] ? P[0]
• If so, increment j by 1. No need to look at T.
– What if P[1] = P[0] and P[2] = P[1] ?
• Then increment j by 2. Again, no need to look at T.
• In general, we can determine how far to jump without any
knowledge of T!
Implementing KMP
• Never decrement i, ever.
– Comparing
T[i] with P[j].
• Compute a table f of how far to jump j
forward when a match fails.
– The next match will compare
T[i] with P[ f[j-1] ]
• Do this by matching P against itself in all
positions.
Building the Table for f
• P = 1010011
• Find self-overlaps
Prefix Overlap j f
1 . 1 0
10 . 2 0
101 1 3 1
1010 10 4 2
10100 . 5 0
101001 1 6 1
1010011 1 7 1
What f means
Prefix Overlap j f • f non-zero implies there is
1 . 1 0 a self-match.
10 . 2 0 E.g., f=2 means P[0..1] = P[j-
101 1 3 1 2..j-1]
1010 10 4 2
• Hence must start new
10100 . 5 0 comparison at j-2, since we
101001 1 6 1 know T[i-2..i-1] = P[0..1]
1010011 1 7 1
• If f is zero, there is no In general:
self-match. – Set j=f[j-1]
– Do not change i.
– Set j=0
• The next match is
– Do not change i. T[i] ? P[f[j-1]]
• The next match is
T[i] ? P[0]
Favorable conditions
• P = 1234567
• Find self-overlaps
Prefix Overlap j f
1 . 1 0
12 . 2 0
123 . 3 0
1234 . 4 0
12345 . 5 0
123456 . 6 0
1234567 . 7 0
Mixed conditions
• P = 1231234
• Find self-overlaps
Prefix Overlap j f
1 . 1 0
12 . 2 0
123 . 3 0
1231 1 4 1
12312 12 5 2
123123 123 6 3
1231234 . 7 0
Poor conditions
• P = 1111110
• Find self-overlaps
Prefix Overlap j f
1 . 1 0
11 1 2 1
111 11 3 2
1111 111 4 3
11111 1111 5 4
111111 11111 6 5
1111110 . 7 0
KMP pre-process
Algorithm
m = |P|;
Define a table F of size m
F[0] = 0;
i = 1; j = 0;
while(i<m) {
compare P[i] and P[j];
if(P[j]==P[i]) Use
{ F[i] = j+1; previous
i++; j++; } values of f
else if (j>0) j=F[j-1];
else {F[i] = 0; i++;}
}
KMP Algorithm
input: Text T and Pattern P
|T| = n
|P| = m
Compute Table F for Pattern P
i=j=0
while(i<n) {
if(P[j]==T[i])
{ if (j==m-1) return i-m+1;
i++; j++; }
else if (j>0) j=F[j-1];
else i++; Use F to determine
} next value for j.

output: first occurrence of P in T


Brute Force KMP
000000000000000000000000001 0000000000000000000000000001
0000000000000-
0000000000000- 0000000000000-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
0-
0000000000000-
0-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
01
• A worse case example: 28+14 = 42 comparisons
196 + 14 = 210 comparisons
KMP Performance
• Pre-processing needs O(M) operations.
• At each iteration, one of three cases:
– T[i] = P[j]
• i increases
– T[i] <> P[j] and j>0
• i-j increases
– T[I] <> P[j] and j=0
• i increases and i-j increases
• Hence, maximum of 2N iterations.
• Thus worst case performance is O(N+M).
Exercises
• E1
– Construct the KMP table for P = 10010001
– Trace the KMP algorithm with T =
000100100100010111
• E2
– Construct the KMP table for pattern P =
ababaca
– Trace the KMP algorithm with T =
bacbababaabcbab

You might also like