0% found this document useful (0 votes)

93 views34 pages

String Matching

The document provides an outline and overview of different string matching algorithms: Naive, Rabin-Karp, and Knuth-Morris-Pratt (KMP). It defines the string matching problem, describes the naive algorithm and its weaknesses, then introduces the Rabin-Karp and KMP algorithms as improvements over the naive approach by utilizing hashing and preprocessing respectively to reduce runtime complexity.

Uploaded by

Tanmay Thaware

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

93 views34 pages

String Matching

Uploaded by

Tanmay Thaware

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 34

Outline

String Matching
Introduction
Nave Algorithm
Rabin-Karp Algorithm
Knuth-Morris-Pratt (KMP) Algorithm
Introduction
What is string matching?
Finding all occurrences of a pattern in a given text (or
body of text)
Many applications
While using editor/word processor/browser
Login name & password checking
Virus detection
Header analysis in data communications
DNA sequence analysis, Web search engines (e.g.
Google), image analysis
String-Matching Problem
The text is in an array T [1..n] of length n
The pattern is in an array P [1..m] of
length m
Elements of T and P are characters from a
finite alphabet
E.g., = {0,1} or = {a, b, , z}
Usually T and P are called strings of
characters
String-Matching Problem contd

We say that pattern P occurs with shift s in

text T if:
a) 0 s n-m and
b) T [(s+1)..(s+m)] = P [1..m]
If P occurs with shift s in T, then s is a
valid shift, otherwise s is an invalid shift
String-matching problem: finding all valid
shifts for a given T and P
Example 1
1 2 3 4 5 6 7 8 9 10 11 12 13
text T
a b c a b a a b c a b a c

pattern P s=3
a b a a
1 2 3 4

shift s = 3 is a valid shift

(n=13, m=4 and 0 s n-m holds)
1 2
Example
3 4
2
pattern P a b a a
1 2 3 4 5 6 7 8 9 10 11 12 13
text T
a b c a b a a b c a b a a

s=3 a b a a

s=9 a b a a
Nave String-Matching Algorithm
Input: Text strings T [1..n] and P[1..m]
Result: All valid shifts displayed

NAVE-STRING-MATCHER (T, P)
n length[T]
m length[P]
for s 0 to n-m
if P[1..m] = T [(s+1)..(s+m)]
print pattern occurs with shift s
Nave Algorithm

The Nave algorithm consists in checking, at all the positions in

the text between 0 to n-m, whether an occurrence of the pattern
starts there or not.
After each attempt, it shifts the pattern by exactly one position to
the right.
Example (from left to right):
a b c a b c a
a b c a (shift = 0)
a b c a (shift = 1)
a b c a (shift = 2)
a b c a (shift = 3)
Analysis: Worst-case Example
1 2 3 4
pattern P a a a b
1 2 3 4 5 6 7 8 9 10 11 12 13
text T a a a a a a a a a a a a a

a a a b

a a a b
Worst-case Analysis
There are m comparisons for each shift in the
worst case
There are n-m+1 shifts
So, the worst-case running time is ((n-
m+1)m)
In the example on previous slide, we have (13-4+1)4
comparisons in total
Nave method is inefficient because information
from a shift is not used again
Nave Algorithm

Example (from right to left):

a b c a b c a
a b c a (shift =3)
a b c a (shift = 2)
a b c a (shift = 1)
a b c a (shift = 0)
Pattern occur with shift 0 and 3
Rabin-Karp Algorithm
Has a worst-case running time of O((n-
m+1)m) but average-case is O(n+m)
Also works well in practice
Based on number-theoretic notion of
modular equivalence
We assume that = {0,1, 2, , 9}, i.e.,
each character is a decimal digit
In general, use radix-d where d = ||
Rabin-Karp Approach
We can view a string of k characters (digits)
as a length-k decimal number
E.g., the string 31425 corresponds to the
decimal number 31,425
Given a pattern P [1..m], let p denote the
corresponding decimal value
Given a text T [1..n], let ts denote the decimal
value of the length-m substring T [(s+1)..
(s+m)] for s=0,1,,(n-m)
The Rabin-Karp algorithm
The Rabin-Karp algorithm
Rabin-Karp Approach contd

ts = p iff T [(s+1)..(s+m)] = P [1..m]

s is a valid shift iff ts = p
p can be computed in O(m) time
p = P[m] + 10 (P[m-1] + 10 (P[m-2]+))
t0 can similarly be computed in O(m) time
Other t1, t2,, tn-m can be computed in O(n-
m) time since ts+1 can be computed from ts in
constant time
Rabin-Karp Approach contd

ts+1 = 10(ts - 10m-1 T [s+1]) + T [s+m+1]

E.g., if T={,3,1,4,1,5,2,}, m=5 and ts=
31,415, then ts+1 = 10(31415 100003) + 2
=14152
Thus we can compute p in (m) and can
compute t0, t1, t2,, tn-m in (n-m+1) time
And we can find al occurrences of the pattern
P[1m] in text T[1n] with (m) preprocessing
time and (n-m+1) matching time.
Buta problem: this is assuming p and ts are small numbers
They may be too large to work with easily
Rabin-Karp Approach contd

Solution: we can use modular arithmetic with

a suitable modulus, q
E.g.,
ts+1 (10(ts T[s+1]h) + T [s+m+1]) (mod q)
Where h =10 m-1 (mod q)
q is chosen as a small prime number ; e.g.,
13 for radix 10
Generally, if the radix is d, then dq should fit
within one computer word
How values modulo 13 are computed
3 1 4 1 5 2

old high- new low-

order digit 7 8 order digit

14152 ((31415 3 10000) 10 + 2 )(mod

13)
((7 3 3) 10 + 2 )(mod 13)
8 (mod 13)
Problem of Spurious Hits
ts p (mod q) does not imply that ts=p
Modular equivalence does not necessarily mean
that two integers are equal
A case in which ts p (mod q) when ts p is
called a spurious hit

On the other hand, if two integers are not

modular equivalent, then they cannot be
equal
Example
3 1 4 1 5 pattern

mod 13
7 text

1 2 3 4 5 6 7 8 9 10 11 12 13 14
2 3 1 4 1 5 2 6 7 3 9 9 2 1

mod 13

1 7 8 4 5 10 11 7 9 11
valid spurious
match hit
Rabin-Karp Algorithm
Basic structure like the nave algorithm,
but uses modular arithmetic as described
For each hit, i.e., for each s where ts p
(mod q), verify character by character
whether s is a valid shift or a spurious hit
In the worst case, every shift is verified
Running time can be shown as O((n-m+1)m)
Average-case running time is O(n+m)
3. The KMP Algorithm
The Knuth-Morris-Pratt (KMP) algorithm
looks for the pattern in the text in a left-to-
right order (like the brute force algorithm).

But it shifts the pattern more intelligently

than the brute force algorithm.

continued
If a mismatch occurs between the text and
pattern P at P[j], the most we can shift the
pattern to avoid wasteful comparisons?
Example
Why j == 5

Find largest prefix (start) of:

"a b a a b" ( P[0..j-1] )

which is suffix (end) of:

"b a a b" ( p[1 .. j-1] )

Answer: "a b"

Set j = 2 // the new j value
KMP Failure Function
KMP preprocesses the pattern to find
matches of prefixes of the pattern with the
pattern itself.
j = mismatch position in P[]
k = position before the mismatch (k = j-1).
The failure function F(k) is defined as the
size of the largest prefix of P[0..k] that is
also a suffix of P[1..k].
Failure Function Example
(k == j-1)
P: "abaaba" j 0 1 2 3 4
j: 012345 F(j) 0 0 1 1 2

F(k) is the size of

the largest prefix.

In code, F() is represented by an array, like

the table.
Why is F(4) == 2?P: "abaaba"
F(4) means
find the size of the largest prefix of P[0..4] that
is also a suffix of P[1..4]
= find the size largest prefix of "abaab" that
is also a suffix of "baab"
= find the size of "ab"
=2
Using the Failure Function

Knuth-Morris-Pratts algorithm modifies the

brute-force algorithm.
if a mismatch occurs at P[j]
(i.e. P[j] != T[i]), then
k = j-1;
j = F(k); // obtain the new j
Example
T: a b a c a a b a c c a b a c a b a a b b
1 2 3 4 5 6
P: a b a c a b
7
a b a c a b
8 9 10 11 12
a b a c a b
13
a b a c a b
k 0 1 2 3 4 14 15 16 17 18 19
F(k ) 0 0 1 0 1 a b a c a b
Why is F(4) == 1?P: "abacab"
F(4) means
find the size of the largest prefix of P[0..4] that
is also a suffix of P[1..4]
= find the size largest prefix of "abaca" that
is also a suffix of "baca"
= find the size of "a"
=1
KMP Advantages
KMP runs in optimal time: O(m+n)
very fast

The algorithm never needs to move

backwards in the input text, T
this makes the algorithm good for processing
very large files that are read in from external
devices or through a network stream
KMP Disadvantages
KMP doesnt work so well as the size of the
alphabet increases
more chance of a mismatch (more possible
mismatches)
mismatches tend to occur early in the pattern,
but KMP is faster when the mismatches occur
later

Game Theory Yuval Peres
100% (1)
Game Theory Yuval Peres
180 pages
Colected Papers, Vol. I, by Flroentin Smarandache
No ratings yet
Colected Papers, Vol. I, by Flroentin Smarandache
229 pages
B.SC Maths Syll PDF
0% (1)
B.SC Maths Syll PDF
62 pages
An Introduction To Game Theory: Presented As An Undergraduate Class in Multimedia Mathematics
100% (1)
An Introduction To Game Theory: Presented As An Undergraduate Class in Multimedia Mathematics
72 pages
Modular Arithmetic
No ratings yet
Modular Arithmetic
12 pages
The Seven Bridges of Konigsberg-Euler's Solution PDF
100% (1)
The Seven Bridges of Konigsberg-Euler's Solution PDF
16 pages
Turing Machines
No ratings yet
Turing Machines
27 pages
05 Game Playing
100% (1)
05 Game Playing
36 pages
Data Structure Sorting
No ratings yet
Data Structure Sorting
138 pages
Math101 8thed PDF
100% (1)
Math101 8thed PDF
229 pages
Game Theory
100% (1)
Game Theory
10 pages
System of Linear Congruences
No ratings yet
System of Linear Congruences
22 pages
Alan Turing
100% (1)
Alan Turing
9 pages
Chapter 4 Game Theory
100% (1)
Chapter 4 Game Theory
58 pages
Grammars, Recursively Enumerable Languages, and Turing Machines
100% (1)
Grammars, Recursively Enumerable Languages, and Turing Machines
58 pages
Analysis of Algorithm: Space Complexity
100% (1)
Analysis of Algorithm: Space Complexity
98 pages
Chapter 8: Sorting: Important Concepts Common Applications
100% (2)
Chapter 8: Sorting: Important Concepts Common Applications
68 pages
Graph Traversal - DFS & BFS
100% (1)
Graph Traversal - DFS & BFS
42 pages
Graph Traversal: Bfs & Dfs
100% (1)
Graph Traversal: Bfs & Dfs
57 pages
Solow Model
100% (2)
Solow Model
50 pages
Cs112 - Programming Fundamental: Lecture # 04 - Pseudocode and Flow Chart Syed Shahrooz Shamim
100% (1)
Cs112 - Programming Fundamental: Lecture # 04 - Pseudocode and Flow Chart Syed Shahrooz Shamim
60 pages
Game Theory
100% (1)
Game Theory
34 pages
4string Matching Kmprabin Karp and Naive
No ratings yet
4string Matching Kmprabin Karp and Naive
57 pages
Discrete Mathematics (12) Number Theory
No ratings yet
Discrete Mathematics (12) Number Theory
55 pages
Simon Kuznets
No ratings yet
Simon Kuznets
41 pages
Algorithms
100% (4)
Algorithms
27 pages
DS Number Theory
No ratings yet
DS Number Theory
42 pages
Solving Linear Diophantine Equations and
100% (1)
Solving Linear Diophantine Equations and
37 pages
Introduction To Game Theory: Yale Braunstein Spring 2007
100% (1)
Introduction To Game Theory: Yale Braunstein Spring 2007
39 pages
Game Theory 93 2 Slide01
100% (1)
Game Theory 93 2 Slide01
44 pages
06 Sorting
No ratings yet
06 Sorting
78 pages
History Wednesday Code Breakers
100% (1)
History Wednesday Code Breakers
11 pages
Harris-Todaro Chapter 10 D Ray
No ratings yet
Harris-Todaro Chapter 10 D Ray
25 pages
IOQM - Practice Sheet-4 - (Answer Key & Sol.)
No ratings yet
IOQM - Practice Sheet-4 - (Answer Key & Sol.)
5 pages
Game Playing
100% (1)
Game Playing
36 pages
Alan Turing - Education, Movie & Quotes
100% (1)
Alan Turing - Education, Movie & Quotes
9 pages
Introduction To Number Theory
No ratings yet
Introduction To Number Theory
48 pages
Rabin Krap
100% (1)
Rabin Krap
14 pages
Rabin Karp
100% (1)
Rabin Karp
13 pages
Alan Turing - Historical Significance
100% (1)
Alan Turing - Historical Significance
18 pages
Isi 2023
No ratings yet
Isi 2023
43 pages
Pack & Saggi - 2006 - The Case For Industrial Policy - A Critical Survey
No ratings yet
Pack & Saggi - 2006 - The Case For Industrial Policy - A Critical Survey
51 pages
Primes and Greatest Common Divisors: Section 4.3
No ratings yet
Primes and Greatest Common Divisors: Section 4.3
48 pages
Turing Machine Synopsis
100% (1)
Turing Machine Synopsis
1 page
Chapter 4
No ratings yet
Chapter 4
37 pages
The Fibonacci Quarterly
No ratings yet
The Fibonacci Quarterly
116 pages
Python Unit I (2023-26)
No ratings yet
Python Unit I (2023-26)
41 pages
Genius Reference
No ratings yet
Genius Reference
110 pages
Introduction To The Theory of Computation: Part II: Computability Theory
100% (1)
Introduction To The Theory of Computation: Part II: Computability Theory
42 pages
Random Number Generator
No ratings yet
Random Number Generator
46 pages
Konigsberg Bridge 1
No ratings yet
Konigsberg Bridge 1
18 pages
Malthusian Persentation
100% (1)
Malthusian Persentation
19 pages
Graph Traversals (BFS and DFS)
No ratings yet
Graph Traversals (BFS and DFS)
47 pages
Week 5 PDF
No ratings yet
Week 5 PDF
61 pages
Graph: Dr. Inayat-ur-Rehman COMSATS Institute of Information Technology, Islamabad
No ratings yet
Graph: Dr. Inayat-ur-Rehman COMSATS Institute of Information Technology, Islamabad
39 pages
String Matching
No ratings yet
String Matching
35 pages
Presentations PPT Unit-5 25042019031434AM
No ratings yet
Presentations PPT Unit-5 25042019031434AM
38 pages
DM GTU Study Material E-Notes Unit-4 29012022085557AM
No ratings yet
DM GTU Study Material E-Notes Unit-4 29012022085557AM
12 pages
Imei (International Mobile Equipment Identity)
No ratings yet
Imei (International Mobile Equipment Identity)
5 pages
98
100% (4)
98
96 pages
Alan Turing Powerpoint
No ratings yet
Alan Turing Powerpoint
10 pages
Pigeonhole Sorting
No ratings yet
Pigeonhole Sorting
7 pages
Sorting Algorithm PDF
No ratings yet
Sorting Algorithm PDF
11 pages
Can Do PROB#1: Which of The Following Is A Valid and Correct UPC? Show Why The Other Numbers Are
100% (1)
Can Do PROB#1: Which of The Following Is A Valid and Correct UPC? Show Why The Other Numbers Are
2 pages
CHAPTER 5. Number Theory. 1. Integers and Division. Discussion
No ratings yet
CHAPTER 5. Number Theory. 1. Integers and Division. Discussion
9 pages
Advance Database Management System: Unit - 2 .Query Processing and Optimization
No ratings yet
Advance Database Management System: Unit - 2 .Query Processing and Optimization
38 pages
Conversion Gate02 - 2 PDF
No ratings yet
Conversion Gate02 - 2 PDF
9 pages
Graph Traversals
No ratings yet
Graph Traversals
11 pages
Doing Business in Singapore
No ratings yet
Doing Business in Singapore
19 pages
Prime Number Hide-and-Seek: How The RSA Cipher Works: Preface: What Is This?
No ratings yet
Prime Number Hide-and-Seek: How The RSA Cipher Works: Preface: What Is This?
19 pages
External Sorting: A Technical Paper
No ratings yet
External Sorting: A Technical Paper
22 pages
Rabin Karp
No ratings yet
Rabin Karp
13 pages
Exploiting The Power of Gpus For Asymmetric Cryptography: Abstract. Modern Graphics Processing Units (Gpu) Have Reached A
No ratings yet
Exploiting The Power of Gpus For Asymmetric Cryptography: Abstract. Modern Graphics Processing Units (Gpu) Have Reached A
21 pages
Primality Testing
No ratings yet
Primality Testing
56 pages
Enigma
No ratings yet
Enigma
10 pages
Sorting and Searching
No ratings yet
Sorting and Searching
23 pages
ACM ICPC Reference: University of São Paulo May 13, 2015
No ratings yet
ACM ICPC Reference: University of São Paulo May 13, 2015
25 pages
Cross Over Every Bridge Once, and Only Once (Known As A Euler Walk) ?
100% (1)
Cross Over Every Bridge Once, and Only Once (Known As A Euler Walk) ?
2 pages
Dijkstra Algorithm
No ratings yet
Dijkstra Algorithm
23 pages
DAA Assignment (Module4)
No ratings yet
DAA Assignment (Module4)
10 pages
Fibonacci Pitch Sets
No ratings yet
Fibonacci Pitch Sets
22 pages
Adobe Scan Nov 24, 2023
No ratings yet
Adobe Scan Nov 24, 2023
5 pages
Sorting Algorithm
No ratings yet
Sorting Algorithm
11 pages
Dijkstra's Algorithm Shortest Path First (SPF) : Presented by Sajid Ali Hidaya ID: 5287
No ratings yet
Dijkstra's Algorithm Shortest Path First (SPF) : Presented by Sajid Ali Hidaya ID: 5287
16 pages
Zero-Sum Game: Solution
No ratings yet
Zero-Sum Game: Solution
6 pages
A Comparison Between Encryption and Decryption'
No ratings yet
A Comparison Between Encryption and Decryption'
15 pages
Traversal Algorithms: D7022E - Formal Methods in Telecommunications Engineering
No ratings yet
Traversal Algorithms: D7022E - Formal Methods in Telecommunications Engineering
26 pages
Customer Relationship Management Practices in Telecom Sector: Comparative Study of Public and Private Companies
No ratings yet
Customer Relationship Management Practices in Telecom Sector: Comparative Study of Public and Private Companies
10 pages
4-Modular Arithmetic, Euclidean Algorithm & Assignment-1-09-01-2024
No ratings yet
4-Modular Arithmetic, Euclidean Algorithm & Assignment-1-09-01-2024
2 pages
Modular Arithmetic Class1
No ratings yet
Modular Arithmetic Class1
2 pages
Criticisms of Malthusian Model
No ratings yet
Criticisms of Malthusian Model
5 pages
Vodafone Data Protection Terms 23-10-2018
No ratings yet
Vodafone Data Protection Terms 23-10-2018
2 pages
Sem V Syllabus St. Xavier College
No ratings yet
Sem V Syllabus St. Xavier College
4 pages
Binary Ordering Algorithm
No ratings yet
Binary Ordering Algorithm
5 pages

String Matching

Uploaded by

String Matching

Uploaded by

Outline

We say that pattern P occurs with shift s in

shift s = 3 is a valid shift

The Nave algorithm consists in checking, at all the positions in

Example (from right to left):

ts = p iff T [(s+1)..(s+m)] = P [1..m]

ts+1 = 10(ts - 10m-1 T [s+1]) + T [s+m+1]

Solution: we can use modular arithmetic with

old high- new low-

14152 ((31415 3 10000) 10 + 2 )(mod

On the other hand, if two integers are not

But it shifts the pattern more intelligently

Find largest prefix (start) of:

which is suffix (end) of:

Answer: "a b"

F(k) is the size of

In code, F() is represented by an array, like

Knuth-Morris-Pratts algorithm modifies the

The algorithm never needs to move

You might also like