0% found this document useful (0 votes)
25 views40 pages

Lecture 18 - String Matching-KMP

string matching
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views40 pages

Lecture 18 - String Matching-KMP

string matching
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 40

15-211

Fundamental Data Structures and


Algorithms

String Matching

March 28, 2006


Ananda Gunawardena
In this lecture
• String Matching Problem
– Concept
– Regular expressions
– brute force algorithm
– complexity
• Finite State Machines
• Knuth-Morris-Pratt(KMP) Algorithm
– Pre-processing
– complexity
Pattern Matching
Algorithms
The Problem
• Given a text T and a pattern P, check
whether P occurs in T
– eg: T = {aabbcbbcabbbcbccccabbabbccc}
– Find all occurrences of pattern P = bbc
• There are variations of pattern matching
– Finding “approximate” matchings
– Finding multiple patterns etc..
Why String Matching?
• Applications in Computational Biology
– DNA sequence is a long word (or text) over a 4-letter alphabet
– GTTTGAGTGGTCAGTCTTTTCGTTTCGACGGAGCCCCCAATT
AATAAACTCATAAGCAGACCTCAGTTCGCTTAGAGCAGCCG
AAA…..
– Find a Specific pattern W
• Finding patterns in documents formed using a large alphabet
– Word processing
– Web searching
– Desktop search (Google, MSN)
• Matching strings of bytes containing
– Graphical data
– Machine code
• grep in unix
– grep searches for lines matching a pattern.
String Matching
• Text string T[0..N-1]
T = “abacaabaccabacabaabb”
• Pattern string P[0..M-1]
P = “abacab”
• Where is the first instance of P in T?
T[10..15] = P[0..5]
• Typically N >>> M
Java Pattern Matching Utilities
• Java provides an API for pattern matching
with regular expressions
– java.util.regex
• Regular expressions describe a set of
strings based on some common
characteristics shared by each string in the
set. eg: a* ={ ,a, aa, aaa, …}
• Regular expressions can be used as a tool
to search, edit or manipulate text or data
– perl, java, C#
Java Pattern Matching Utilities
• java.util.regex
– Pattern
• Is a compiled representation of a regular expression.
• Eg: Pattern p = Pattern.compile("a*b");
– Matcher
• A machine that performs match operations on a character sequence by
interpreting a pattern.
• Eg: Matcher m = p.matcher("aabbb");
• Example:
public static void main( String args[] ) {
Pattern p = Pattern.compile("(aa|bb)*");
Matcher m = p.matcher("aabbb");
boolean b = m.matches(); //match the entire input sequence against the
pattern
// or boolean b = m.find(); // match the entire input sequence against the pattern
System.out.println("The value is " + b);
}
String Matching
abacaabaccabacabaabb • The brute force algorithm
abacab • 22+6=28 comparisons.
abacab
abacab
abacab
abacab
abacab
abacab
abacab
abacab
abacab
abacab
Naïve Algorithm
(or Brute Force)
• Assume |T| = n and |P| = m
Text T
Pattern P
Pattern P
Pattern P

Compare until a match is found. If so return the index where match


occurs
else return -1
Brute Force Version 1
static int match(char[] T, char[] P){
for (int i=0; i<T.length; i++){
boolean flag = true;
if (P[0]==T[i])
for (int j=1;j<P.length;j++)
if (T[i+j]!=P[j])
{flag=false; break;}
if (flag) return i;
}
}
• What is the complexity of the
code?
Brute Force, Version 2
static int match(char[] T, char[] P){
int n = T.length;
int m = P.length;
int i = 0;
int j = 0;
// rewrite the brute-force code with only one loop

do {

// Homework

while (j<m && i<n);

}
• What is the complexity of your code?
A bad case
00000000000000001
• 60+5 = 65
0000- comparisons are
0000- needed
0000- • How many of them
0000- could be avoided?
0000-
0000-
0000-
0000-
0000-
0000-
0000-
0000-
00001
A bad case
00000000000000001
• 60+5 = 65
0000- comparisons are
0000- needed
0000- • How many of them
0000- could be avoided?
0000-
0000-
0000-
0000-
0000-
0000-
0000-
0000-
00001
Typical text matching
This is a sample sentence
-
- • 20+5=25
- comparisons are
s-
- needed
- (The match is near the same
s- point in the target string as
- the previous example.)
- • In practice, 0j2
-
s-
-
-
-
-
-
-
sente
String Matching
• Brute force worst case
– O(MN)
– Expensive for long patterns in repetitive text
• How to improve on this?
• Intuition:
– Remember what is learned from previous
matches
Finite State Machines
Finite State Machines (FSM)
• FSM is a computing machine that takes
– A string as an input
– Outputs YES/NO answer
• That is, the machine “accepts” or “rejects” the
string

Input String Yes / No


FSM
FSM Model

• Input to a FSM
– Strings built from a fixed alphabet {a,b,c}
– Possible inputs: aa, aabbcc, a etc..
• The Machine
– A directed graph
• Nodes = States of the machine
• Edges = Transition from one state to another
o 1
FSM Model
• Special States
– Start (q0) and Final (or Accepting) (q2)
• Assume the alphabet is {a,b}
– Which strings are accepted by this FSM?
FSM Model
• Exercise: draw a finite automaton that
accepts any string with “even” number of
1’s
• Exercise: draw a finite automaton that
accepts any string with “even” number of
consecutive 1’s followed by “odd” number
of consecutive zeros
Why Study FSM’s
• Useful Algorithm Design Technique
– Lexical Analysis (“tokenization”)
– Control Systems
• Elevators, Soda Machines….
• Modeling a problem with FSM is
– Simple
– Elegant
State Transitions
• Let Q be the set of states and ∑ be the alphabet. Then
the transition function T is given by
– T: Q x ∑  Q
• ∑ could be
– {0,1} – binary
– {C,G,T,A} – nucleotide base
– {0,1,2,..,9,a,b,c,d,e,f} – hexadecimal
– etc..
• Eg: Consider ∑ ={a,b,c} and P=aabc
– set of states are all prefixes of P
– Q = { , a, aa, aab, aabc} or
– Q = {0 1 2 3 4 }
• State transitions T(0,’a’) = 1; T(1, ‘a’) = 2, etc…
• What about failure transitions?
Failure Transitions
• Where do we go when a Q ∑ Q’
failure occurs?
• P=“aabc” 0 a 1
• Q – current state {b,c} 0

• Q’ – next state 1 a 2
• initial state = 0 {b,c} 0

• end state = 4 2 b 3
a 2
• How to store state
c 0
transition table?
– as a matrix 3 c 4
a 1
b 0
Using FSM concept in
Pattern Matching
• Consider the alphabet {a,b,c}
• Suppose we are looking for pattern “aabc”
• Construct a finite automaton for “aabc” as follows
a
b|c a

a a b c
Start 0 1 2 3 4
c
b|c
b
Knuth Morris Pratt
(KMP)
Algorithm
KMP – The Big Idea
• Retain information from prior attempts.
• Compute in advance how far to jump in P when a match
fails.
– Suppose the match fails at P[j]  T[i+j].
– Then we know P[0 .. j-1] = T[i .. i+j-1].
• We must next try P[0] ? T[i+1].
– But we know T[i+1]=P[1]
– What if we compare: P[1]?P[0]
• If so, increment j by 1. No need to look at T.
– What if P[1]=P[0] and P[2]=P[1]?
• Then increment j by 2. Again, no need to look at T.
• In general, we can determine how far to jump without any
knowledge of T!
Implementing KMP
• Never decrement i, ever.
– Comparing
T[i] with P[j].
• Compute a table f of how far to jump j
forward when a match fails.
– The next match will compare
T[i] with P[f[j-1]]
• Do this by matching P against itself in all
positions.
Building the Table for f
• P = 1010011
• Find self-overlaps
Prefix Overlap j f
1 . 1 0
10 . 2 0
101 1 3 1
1010 10 4 2
10100 . 5 0
101001 1 6 1
1010011 1 7 1
What f means
Prefix Overlap j f • f non-zero implies there is
1 . 1 0 a self-match.
10 . 2 0 E.g., f=2 means P[0..1] = P[j-
101 1 3 1
2..j-1]
1010 10 4 2
10100 . 5 0 • Hence must start new
101001 1 6 1 comparison at j-2, since we
know T[i-2..i-1] = P[0..1]
1010011 1 7 1
• If f is zero, there is In general:
no self-match. – Set j=f[j-1]
– Do not change i.
– Set j=0 • The next match is
– Do not change i. T[i] ? P[f[j-1]]
• The next match is
T[i] ? P[0]
Favorable conditions
• P = 1234567
• Find self-overlaps
Prefix Overlap j f
1 . 1 0
12 . 2 0
123 . 3 0
1234 . 4 0
12345 . 5 0
123456 . 6 0
1234567 . 7 0
Mixed conditions
• P = 1231234
• Find self-overlaps
Prefix Overlap j f
1 . 1 0
12 . 2 0
123 . 3 0
1231 1 4 1
12312 12 5 2
123123 123 6 3
1231234 . 7 0
Poor conditions
• P = 1111110
• Find self-overlaps
Prefix Overlap j f
1 . 1 0
11 1 2 1
111 11 3 2
1111 111 4 3
11111 1111 5 4
111111 11111 6 5
1111110 . 7 0
KMP pre-process
Algorithm
m = |P|;
Define a table F of size m
F[0] = 0;
i = 1; j = 0;
while(i<m) {
compare P[i] and P[j];
if(P[j]==P[i]) Use
{ F[i] = j+1; previous
i++; j++; }
else if (j>0) j=F[j-1]; values of f
else {F[i] = 0; i++;}
}
KMP Algorithm
input: Text T and Pattern P
|T| = n
|P| = m
Compute Table F for Pattern P
i=j=0
while(i<n) {
if(P[j]==T[i])
{ if (j==m-1) return i-m+1;
i++; j++; }
else if (j>0) j=F[j-1];
else i++; Use F to determine
next value for j.
}
output: first occurrence of P in T
Specializing the matcher
Prefix Overlap j f
1 . 1 0
10 . 2 0
101 1 3 1
1010 10 4 2
10100 . 5 0
101001 1 6 1 .
0
1010011 1 7 1
0 1 0 1
1 1 0 1 0 0 1
0 1 0 0 1 1
0 1 1 0
Brute Force KMP
000000000000000000000000001 0000000000000000000000000001
0000000000000-
0000000000000-
0000000000000-
0000000000000-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
0000000000000-
0-
0-
0-
01
• A worse case example: 28+14 = 42 comparisons
196 + 14 = 210 comparisons
Brute Force KMP
abcdeabcdeabcedfghijkl abcdeabcdeabcedfghijkl

- -
bc- bc-
- -
- -
-
-
bc-
-
-
bc- -
- -
- bcedfg
-
-
bcedfg

21 comparisons 19 comparisons
5 preparation comparisons
KMP Performance
• Pre-processing needs O(M) operations.
• At each iteration, one of three cases:
– T[i] = P[j]
• i increases
– T[i] <> P[j] and j>0
• i-j increases
– T[I] <> P[j] and j=0
• i increases and i-j increases
• Hence, maximum of 2N iterations.
• Thus worst case performance is O(N+M).
Exercises
• Suppose we are given the pattern P =
10010001 and
• text T = 000100100100010111
• do the following
– Draw a FSM for pattern P
– Construct the KMP table for P
– Trace the KMP algorithm with T

You might also like