Lecture 18 - String Matching-KMP

string matching

Uploaded by

mori.alizadeh.2000m

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views40 pages

Lecture 18 - String Matching-KMP

string matching

Uploaded by

mori.alizadeh.2000m

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 40

15-211

Fundamental Data Structures and

Algorithms

String Matching

March 28, 2006

Ananda Gunawardena
In this lecture
• String Matching Problem
– Concept
– Regular expressions
– brute force algorithm
– complexity
• Finite State Machines
• Knuth-Morris-Pratt(KMP) Algorithm
– Pre-processing
– complexity
Pattern Matching
Algorithms
The Problem
• Given a text T and a pattern P, check
whether P occurs in T
– eg: T = {aabbcbbcabbbcbccccabbabbccc}
– Find all occurrences of pattern P = bbc
• There are variations of pattern matching
– Finding “approximate” matchings
– Finding multiple patterns etc..
Why String Matching?
• Applications in Computational Biology
– DNA sequence is a long word (or text) over a 4-letter alphabet
– GTTTGAGTGGTCAGTCTTTTCGTTTCGACGGAGCCCCCAATT
AATAAACTCATAAGCAGACCTCAGTTCGCTTAGAGCAGCCG
AAA…..
– Find a Specific pattern W
• Finding patterns in documents formed using a large alphabet
– Word processing
– Web searching
– Desktop search (Google, MSN)
• Matching strings of bytes containing
– Graphical data
– Machine code
• grep in unix
– grep searches for lines matching a pattern.
String Matching
• Text string T[0..N-1]
T = “abacaabaccabacabaabb”
• Pattern string P[0..M-1]
P = “abacab”
• Where is the first instance of P in T?
T[10..15] = P[0..5]
• Typically N >>> M
Java Pattern Matching Utilities
• Java provides an API for pattern matching
with regular expressions
– java.util.regex
• Regular expressions describe a set of
strings based on some common
characteristics shared by each string in the
set. eg: a* ={ ,a, aa, aaa, …}
• Regular expressions can be used as a tool
to search, edit or manipulate text or data
– perl, java, C#
Java Pattern Matching Utilities
• java.util.regex
– Pattern
• Is a compiled representation of a regular expression.
• Eg: Pattern p = Pattern.compile("a*b");
– Matcher
• A machine that performs match operations on a character sequence by
interpreting a pattern.
• Eg: Matcher m = p.matcher("aabbb");
• Example:
public static void main( String args[] ) {
Pattern p = Pattern.compile("(aa|bb)*");
Matcher m = p.matcher("aabbb");
boolean b = m.matches(); //match the entire input sequence against the
pattern
// or boolean b = m.find(); // match the entire input sequence against the pattern
System.out.println("The value is " + b);
}
String Matching
abacaabaccabacabaabb • The brute force algorithm
abacab • 22+6=28 comparisons.
abacab
abacab
abacab
abacab
abacab
abacab
abacab
abacab
abacab
abacab
Naïve Algorithm
(or Brute Force)
• Assume |T| = n and |P| = m
Text T
Pattern P
Pattern P
Pattern P

Compare until a match is found. If so return the index where match

occurs
else return -1
Brute Force Version 1
static int match(char[] T, char[] P){
for (int i=0; i<T.length; i++){
boolean flag = true;
if (P[0]==T[i])
for (int j=1;j<P.length;j++)
if (T[i+j]!=P[j])
{flag=false; break;}
if (flag) return i;
}
}
• What is the complexity of the
code?
Brute Force, Version 2
static int match(char[] T, char[] P){
int n = T.length;
int m = P.length;
int i = 0;
int j = 0;
// rewrite the brute-force code with only one loop

do {

// Homework

while (j<m && i<n);

}
• What is the complexity of your code?
A bad case
00000000000000001
• 60+5 = 65
0000- comparisons are
0000- needed
0000- • How many of them
0000- could be avoided?
0000-
0000-
0000-
0000-
0000-
0000-
0000-
0000-
00001
A bad case
00000000000000001
• 60+5 = 65
0000- comparisons are
0000- needed
0000- • How many of them
0000- could be avoided?
0000-
0000-
0000-
0000-
0000-
0000-
0000-
0000-
00001
Typical text matching
This is a sample sentence
-
- • 20+5=25
- comparisons are
s-
- needed
- (The match is near the same
s- point in the target string as
- the previous example.)
- • In practice, 0j2
-
s-
-
-
-
-
-
-
sente
String Matching
• Brute force worst case
– O(MN)
– Expensive for long patterns in repetitive text
• How to improve on this?
• Intuition:
– Remember what is learned from previous
matches
Finite State Machines
Finite State Machines (FSM)
• FSM is a computing machine that takes
– A string as an input
– Outputs YES/NO answer
• That is, the machine “accepts” or “rejects” the
string

Input String Yes / No

FSM
FSM Model

• Input to a FSM
– Strings built from a fixed alphabet {a,b,c}
– Possible inputs: aa, aabbcc, a etc..
• The Machine
– A directed graph
• Nodes = States of the machine
• Edges = Transition from one state to another
o 1
FSM Model
• Special States
– Start (q0) and Final (or Accepting) (q2)
• Assume the alphabet is {a,b}
– Which strings are accepted by this FSM?
FSM Model
• Exercise: draw a finite automaton that
accepts any string with “even” number of
1’s
• Exercise: draw a finite automaton that
accepts any string with “even” number of
consecutive 1’s followed by “odd” number
of consecutive zeros
Why Study FSM’s
• Useful Algorithm Design Technique
– Lexical Analysis (“tokenization”)
– Control Systems
• Elevators, Soda Machines….
• Modeling a problem with FSM is
– Simple
– Elegant
State Transitions
• Let Q be the set of states and ∑ be the alphabet. Then
the transition function T is given by
– T: Q x ∑  Q
• ∑ could be
– {0,1} – binary
– {C,G,T,A} – nucleotide base
– {0,1,2,..,9,a,b,c,d,e,f} – hexadecimal
– etc..
• Eg: Consider ∑ ={a,b,c} and P=aabc
– set of states are all prefixes of P
– Q = { , a, aa, aab, aabc} or
– Q = {0 1 2 3 4 }
• State transitions T(0,’a’) = 1; T(1, ‘a’) = 2, etc…
• What about failure transitions?
Failure Transitions
• Where do we go when a Q ∑ Q’
failure occurs?
• P=“aabc” 0 a 1
• Q – current state {b,c} 0

• Q’ – next state 1 a 2
• initial state = 0 {b,c} 0

• end state = 4 2 b 3
a 2
• How to store state
c 0
transition table?
– as a matrix 3 c 4
a 1
b 0
Using FSM concept in
Pattern Matching
• Consider the alphabet {a,b,c}
• Suppose we are looking for pattern “aabc”
• Construct a finite automaton for “aabc” as follows
a
b|c a

a a b c
Start 0 1 2 3 4
c
b|c
b
Knuth Morris Pratt
(KMP)
Algorithm
KMP – The Big Idea
• Retain information from prior attempts.
• Compute in advance how far to jump in P when a match
fails.
– Suppose the match fails at P[j]  T[i+j].
– Then we know P[0 .. j-1] = T[i .. i+j-1].
• We must next try P[0] ? T[i+1].
– But we know T[i+1]=P[1]
– What if we compare: P[1]?P[0]
• If so, increment j by 1. No need to look at T.
– What if P[1]=P[0] and P[2]=P[1]?
• Then increment j by 2. Again, no need to look at T.
• In general, we can determine how far to jump without any
knowledge of T!
Implementing KMP
• Never decrement i, ever.
– Comparing
T[i] with P[j].
• Compute a table f of how far to jump j
forward when a match fails.
– The next match will compare
T[i] with P[f[j-1]]
• Do this by matching P against itself in all
positions.
Building the Table for f
• P = 1010011
• Find self-overlaps
Prefix Overlap j f
1 . 1 0
10 . 2 0
101 1 3 1
1010 10 4 2
10100 . 5 0
101001 1 6 1
1010011 1 7 1
What f means
Prefix Overlap j f • f non-zero implies there is
1 . 1 0 a self-match.
10 . 2 0 E.g., f=2 means P[0..1] = P[j-
101 1 3 1
2..j-1]
1010 10 4 2
10100 . 5 0 • Hence must start new
101001 1 6 1 comparison at j-2, since we
know T[i-2..i-1] = P[0..1]
1010011 1 7 1
• If f is zero, there is In general:
no self-match. – Set j=f[j-1]
– Do not change i.
– Set j=0 • The next match is
– Do not change i. T[i] ? P[f[j-1]]
• The next match is
T[i] ? P[0]
Favorable conditions
• P = 1234567
• Find self-overlaps
Prefix Overlap j f
1 . 1 0
12 . 2 0
123 . 3 0
1234 . 4 0
12345 . 5 0
123456 . 6 0
1234567 . 7 0
Mixed conditions
• P = 1231234
• Find self-overlaps
Prefix Overlap j f
1 . 1 0
12 . 2 0
123 . 3 0
1231 1 4 1
12312 12 5 2
123123 123 6 3
1231234 . 7 0
Poor conditions
• P = 1111110
• Find self-overlaps
Prefix Overlap j f
1 . 1 0
11 1 2 1
111 11 3 2
1111 111 4 3
11111 1111 5 4
111111 11111 6 5
1111110 . 7 0
KMP pre-process
Algorithm
m = |P|;
Define a table F of size m
F[0] = 0;
i = 1; j = 0;
while(i<m) {
compare P[i] and P[j];
if(P[j]==P[i]) Use
{ F[i] = j+1; previous
i++; j++; }
else if (j>0) j=F[j-1]; values of f
else {F[i] = 0; i++;}
}
KMP Algorithm
input: Text T and Pattern P
|T| = n
|P| = m
Compute Table F for Pattern P
i=j=0
while(i<n) {
if(P[j]==T[i])
{ if (j==m-1) return i-m+1;
i++; j++; }
else if (j>0) j=F[j-1];
else i++; Use F to determine
next value for j.
}
output: first occurrence of P in T
Specializing the matcher
Prefix Overlap j f
1 . 1 0
10 . 2 0
101 1 3 1
1010 10 4 2
10100 . 5 0
101001 1 6 1 .
0
1010011 1 7 1
0 1 0 1
1 1 0 1 0 0 1
0 1 0 0 1 1
0 1 1 0
Brute Force KMP
000000000000000000000000001 0000000000000000000000000001
0000000000000-
0000000000000-
0000000000000-
0000000000000-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
0000000000000- 0-
0000000000000-
0-
0-
0-
01
• A worse case example: 28+14 = 42 comparisons
196 + 14 = 210 comparisons
Brute Force KMP
abcdeabcdeabcedfghijkl abcdeabcdeabcedfghijkl

- -
bc- bc-
- -
- -
-
-
bc-
-
-
bc- -
- -
- bcedfg
-
-
bcedfg

21 comparisons 19 comparisons
5 preparation comparisons
KMP Performance
• Pre-processing needs O(M) operations.
• At each iteration, one of three cases:
– T[i] = P[j]
• i increases
– T[i] <> P[j] and j>0
• i-j increases
– T[I] <> P[j] and j=0
• i increases and i-j increases
• Hence, maximum of 2N iterations.
• Thus worst case performance is O(N+M).
Exercises
• Suppose we are given the pattern P =
10010001 and
• text T = 000100100100010111
• do the following
– Draw a FSM for pattern P
– Construct the KMP table for P
– Trace the KMP algorithm with T

Codehelp: Lec-9: SQL in 1-Video
100% (3)
Codehelp: Lec-9: SQL in 1-Video
8 pages
hw10 Solution PDF
No ratings yet
hw10 Solution PDF
5 pages
Pattern Matching 2
No ratings yet
Pattern Matching 2
46 pages
CH 8
No ratings yet
CH 8
26 pages
String Matching
No ratings yet
String Matching
63 pages
String Matching
No ratings yet
String Matching
35 pages
Unit8 ADA SPPDF 2022 11 11 17 17 37pdf 2023 12 06 16 57 08
No ratings yet
Unit8 ADA SPPDF 2022 11 11 17 17 37pdf 2023 12 06 16 57 08
18 pages
Unit 5
No ratings yet
Unit 5
14 pages
String Matching Chapter 12 Goodrich Nep
No ratings yet
String Matching Chapter 12 Goodrich Nep
43 pages
DAA Unit 5
No ratings yet
DAA Unit 5
22 pages
Ada Notes Unit 4
No ratings yet
Ada Notes Unit 4
28 pages
AAD-String Matching
No ratings yet
AAD-String Matching
15 pages
Sandeep Singh (Iii B.Tech I.T)
No ratings yet
Sandeep Singh (Iii B.Tech I.T)
179 pages
KMP 2
No ratings yet
KMP 2
7 pages
Algorithms in Bioinformatics
No ratings yet
Algorithms in Bioinformatics
7 pages
Unit 3
No ratings yet
Unit 3
34 pages
String Matching Algorithms
No ratings yet
String Matching Algorithms
46 pages
Pattren Matching
No ratings yet
Pattren Matching
3 pages
Abstract
No ratings yet
Abstract
12 pages
String Matching
No ratings yet
String Matching
30 pages
String Matching Algorithm
100% (1)
String Matching Algorithm
14 pages
11 Data Structures and Algorithms - Narasimha Karumanchi
100% (1)
11 Data Structures and Algorithms - Narasimha Karumanchi
12 pages
A357460420 - 22393 - 2 - 2018 - String Matching
No ratings yet
A357460420 - 22393 - 2 - 2018 - String Matching
27 pages
4string Matching Kmprabin Karp and Naive
No ratings yet
4string Matching Kmprabin Karp and Naive
57 pages
String Matching
100% (1)
String Matching
27 pages
String Matching
No ratings yet
String Matching
34 pages
A Two Way Pattern Matching Algorithm Using Sliding Patterns
No ratings yet
A Two Way Pattern Matching Algorithm Using Sliding Patterns
5 pages
Unit-8 String Matching
No ratings yet
Unit-8 String Matching
31 pages
Ch-5 Numerical Daa
No ratings yet
Ch-5 Numerical Daa
11 pages
CPS Final Project
No ratings yet
CPS Final Project
4 pages
Lecture#8 - String Matching Algorithm
No ratings yet
Lecture#8 - String Matching Algorithm
38 pages
String Matching - RYS - Lect - 1 - 2 - 3 - Update
No ratings yet
String Matching - RYS - Lect - 1 - 2 - 3 - Update
61 pages
KMP Algorithm
No ratings yet
KMP Algorithm
21 pages
Strings and Pattern Searching
100% (1)
Strings and Pattern Searching
80 pages
Unit II
No ratings yet
Unit II
94 pages
Unit-4 Ads
100% (1)
Unit-4 Ads
31 pages
DS V Unit Notes
No ratings yet
DS V Unit Notes
33 pages
SOU Lecture Handout ADA Unit-8
No ratings yet
SOU Lecture Handout ADA Unit-8
17 pages
CS 240 Tutorial 11 Notes: C A A B A
No ratings yet
CS 240 Tutorial 11 Notes: C A A B A
2 pages
Week 9 String Algorithms, Approximation
No ratings yet
Week 9 String Algorithms, Approximation
22 pages
16 String Matching - Naive String Algorithm
100% (1)
16 String Matching - Naive String Algorithm
9 pages
54.string Inotes
No ratings yet
54.string Inotes
20 pages
Mathematical Model For String Pattern Matching Algorithm (Boyer-Moore's Algorithm)
No ratings yet
Mathematical Model For String Pattern Matching Algorithm (Boyer-Moore's Algorithm)
5 pages
Ads Unit5
No ratings yet
Ads Unit5
26 pages
CHPT 9 Pattern Matching
No ratings yet
CHPT 9 Pattern Matching
14 pages
Notes 5
No ratings yet
Notes 5
23 pages
28 - Text Processing
No ratings yet
28 - Text Processing
7 pages
Unit 5 String Matching 2010
No ratings yet
Unit 5 String Matching 2010
5 pages
KMP Algorithm
No ratings yet
KMP Algorithm
19 pages
4th Sem DAA Module 4
No ratings yet
4th Sem DAA Module 4
10 pages
Unit 2 Daa PDF
No ratings yet
Unit 2 Daa PDF
99 pages
String Matching Introduction To NP-Completeness
No ratings yet
String Matching Introduction To NP-Completeness
37 pages
資料工程 Data Engineering: Pattern Matching 張賢宗
No ratings yet
資料工程 Data Engineering: Pattern Matching 張賢宗
38 pages
String Matching Algorithms
No ratings yet
String Matching Algorithms
25 pages
04 03-PatternMatchingAndTries
No ratings yet
04 03-PatternMatchingAndTries
28 pages
Text Pattern Search Using Naïve Algorithm: Justine Estoesta, Patricia Mae Omana, Winci John Singh
No ratings yet
Text Pattern Search Using Naïve Algorithm: Justine Estoesta, Patricia Mae Omana, Winci John Singh
5 pages
Kumboji Pattern Matching Alg
No ratings yet
Kumboji Pattern Matching Alg
4 pages
Naive and Rabin Karp
No ratings yet
Naive and Rabin Karp
47 pages
MADFL 2025 Expt8
No ratings yet
MADFL 2025 Expt8
8 pages
Outline and Reading: Strings ( 9.1.1) Pattern Matching Algorithms
No ratings yet
Outline and Reading: Strings ( 9.1.1) Pattern Matching Algorithms
3 pages
Programming in Java: Methods and Constructors
No ratings yet
Programming in Java: Methods and Constructors
30 pages
CS304 Collection of Old Papers
No ratings yet
CS304 Collection of Old Papers
33 pages
1.1 General Introduction: N Queen Problem
No ratings yet
1.1 General Introduction: N Queen Problem
33 pages
M.C.A. Syllabus
No ratings yet
M.C.A. Syllabus
17 pages
Text Processing With Ruby Extract Value From The Data That Surrounds You 1st Edition Rob Miller
No ratings yet
Text Processing With Ruby Extract Value From The Data That Surrounds You 1st Edition Rob Miller
79 pages
Classical IPC Problems
No ratings yet
Classical IPC Problems
5 pages
Lecture 2 Design Database ER
No ratings yet
Lecture 2 Design Database ER
55 pages
Compiler Design PYQP
No ratings yet
Compiler Design PYQP
2 pages
Unit 5 Material
No ratings yet
Unit 5 Material
12 pages
Unit - 4 Software Engineering
No ratings yet
Unit - 4 Software Engineering
17 pages
F030 - The Evolution of Expert System by Noran
No ratings yet
F030 - The Evolution of Expert System by Noran
25 pages
Capítulo 1 - Getting Started: IBM Watson Academy
No ratings yet
Capítulo 1 - Getting Started: IBM Watson Academy
6 pages
Our Lady of Fatima Catholic School Lubiran Street Bacood Sta Mesa Manila First Quarter Summative Test Computer 6
No ratings yet
Our Lady of Fatima Catholic School Lubiran Street Bacood Sta Mesa Manila First Quarter Summative Test Computer 6
3 pages
Synopsis Of: Standard Format For Preparing The Synopsis
No ratings yet
Synopsis Of: Standard Format For Preparing The Synopsis
30 pages
Class 8 Qbasic Notes
100% (2)
Class 8 Qbasic Notes
5 pages
GUI Programming With Python QT EDITION
80% (5)
GUI Programming With Python QT EDITION
641 pages
Practice Q 02
No ratings yet
Practice Q 02
2 pages
V.Bhargavi Resume
No ratings yet
V.Bhargavi Resume
3 pages
Querying Microsoft SQL Server 2012: Version: Demo
No ratings yet
Querying Microsoft SQL Server 2012: Version: Demo
14 pages
Basic RPG400 Programming
100% (1)
Basic RPG400 Programming
126 pages
CS600 Lecture 18 - 40996
No ratings yet
CS600 Lecture 18 - 40996
5 pages
Log
No ratings yet
Log
5 pages
Nathaniels Resume
No ratings yet
Nathaniels Resume
1 page
Thesis For Computer Science
100% (1)
Thesis For Computer Science
5 pages
MC A Equivalent
No ratings yet
MC A Equivalent
4 pages
Excel-DNA - Step-By-Step C# Add-In
No ratings yet
Excel-DNA - Step-By-Step C# Add-In
10 pages
FSD Module-05
No ratings yet
FSD Module-05
38 pages
Sde Sheet (Core) :: Vjera4/Edit
No ratings yet
Sde Sheet (Core) :: Vjera4/Edit
12 pages
Introduction To Computer Vision in Python
No ratings yet
Introduction To Computer Vision in Python
12 pages

Lecture 18 - String Matching-KMP

Uploaded by

Lecture 18 - String Matching-KMP

Uploaded by

15-211

Fundamental Data Structures and

March 28, 2006

Compare until a match is found. If so return the index where match

while (j<m && i<n);

Input String Yes / No

You might also like