0% found this document useful (0 votes)

71 views11 pages

String Matching Algorithms: Antonio Carzaniga

String matching problems Antonio carzaniga, university of lugano. A more challenging example: How many times does the string "110011" appear in the following text. The (worst-case) complexity of Naive-String-Match is O((n - m + 1)m)

Uploaded by

vickimgore

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views11 pages

String Matching Algorithms: Antonio Carzaniga

Uploaded by

vickimgore

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

String Matching Algorithms

Antonio Carzaniga

Faculty of Informatics
University of Lugano

December 23, 2009

© 2007 Antonio Carzaniga 1

Outline
Problem definition

Naïve algorithm

Knuth-Morris-Pratt algorithm

Boyer-Moore algorithm

© 2007 Antonio Carzaniga 2

Problem
Given the text
Nel mezzo del cammin di nostra vita
mi ritrovai per una selva oscura
che la dritta via era smarrita. . .

Find the string “trova”

A more challenging example: How many times does the string

“110011” appear in the following text
0011110101011010011000110101111011010111
0110111001001010101011111011110110000101
1011000010111111011110011000011111000100
1001010010111011101011011110101001100101
0010111001000011111110010011011101011010
0110011011101001010010101000010100111110

© 2007 Antonio Carzaniga 3

String Matching: Definitions

Given a text T
◮ T ∈ Σ∗ : finite alphabet Σ
◮ |T | = n: the length of T is n

Given a pattern P
◮ P ∈ Σ∗ : same finite alphabet Σ
◮ |P| = m: the length of P is m

Both T and P can be modeled as arrays

◮ T [1 . . . n] and P[1 . . . m]

Pattern P occurs with shift s in T iff

◮ 0≤s ≤n−m
◮ T [s + i] = P[i] for all positions 1 ≤ i ≤ m

© 2007 Antonio Carzaniga 4

Example
Problem: find all s such that
◮ 0≤s ≤n−m
◮ T [s + i] = P[i] for 1 ≤ i ≤ m

n = 14
T a b c a a b a a b a b a c a

m=3
s = 4s =s7= 9
P a b a a b a a b a b a

Result
s=4
s=7
s=9

© 2007 Antonio Carzaniga 5

Naïve Algorithm
For each position s in 0 . . . n − m, see if T [s + i] = P[i] for all
1≤i≤m

Naive-String-Matching(T , P)
1 n = length(T )
2 m = length(P)
3 for s = 0 to n − m
4 if Substring-At(T , P, s)
5 output(s)

Substring-At(T , P, s)
1 for i = 1 to length(P)
2 if T [s + i] ≠ P[i]
3 return false
4 return true

© 2007 Antonio Carzaniga 6

Complexity of the Naïve Algorithm
Complexity of Naive-String-Match is O((n − m + 1)m)

Worst case example

T = an , P = am
i.e.,
n m
z }| { z }| {
T = aa · · · a, P = aa · · · a

So, (n − m + 1)m is a tight bound, so the (worst-case)

complexity of Naive-String-Match is

Θ((n − m + 1)m)

© 2007 Antonio Carzaniga 7

Improvement Strategy
Observation

T a b c a a b a a b a b a c a

= = ≠

P a b a

What now?
◮ the naïve algorithm tells us to go back to the second position in
T and to start from the beginning of P
◮ can’t we simply move along through T ?

◮ why?

© 2007 Antonio Carzaniga 8

Improvement Strategy (2)
Here’s a wrong but insightful strategy

Wrong-String-Matching(T , P)
1 n = length(T )
2 m = length(P)
3 q=0 // number of characters matched in P
4 s=1
5 while s ≤ n
6 s = s+1
7 if T [s] = = P[q + 1]
8 q = q+1
9 if q = = m
10 output(s − m)
11 q=0
12 else q = 0

© 2007 Antonio Carzaniga 9

Improvement Strategy (3)

Example run of Wrong-String-Matching

s s s s s s s s s s s s s s s s

T p a g l i a i o b a g o r d o

P a g o Output: 10

q+1
q+1
q+1

Done. Perfect!

Complexity: Θ(n)

© 2007 Antonio Carzaniga 10

Improvement Strategy (4)
What is wrong with Wrong-String-Matching?

s s s s s s s s

T a a b a a a b a b a b a c a
output(0) missed!

P a a b

q+1
q+1
q+1

So Wrong-String-Matching doesn’t work, but it tells us

something useful

© 2007 Antonio Carzaniga 11

Improvement Strategy (5)

Where did Wrong-String-Matching go wrong?

s s s

T a a b a a a b a b a b a c a

P a a b

q+1
q+1
q+1

Wrong: by going all the way back to q = 0 we throw away a

good prefix of P that we already matched

© 2007 Antonio Carzaniga 12

Improvement Strategy (6)
Another example

s s s s s s s s

T a b a b a b a c b a c b c a
output(2)

P a b a b a c

q+1
q+1
q+1
q+1
q+1
q+1

We have matched “ababa”

◮ suffix “aba” can be reused as a prefix

© 2007 Antonio Carzaniga 13

New Strategy
P[1 . . . q] is the prefix of P matched so far

Find the longest prefix of P that is also a suffix of P[2 . . . q]

◮ i.e., find 0 ≤ π < q such that P[q − π + 1 . . . q] = P[1 . . . π ]
◮ π = 0 means that such a prefix does not exist

P a b a b a c

π =3
q+1 q+1

Restart from q = π

Iterate as usual

In essence, this is the Knuth-Morris-Pratt algorithm

The Prefix Function
Given a pattern prefix P[1 . . . q], the longest prefix of P that is
also a suffix of P[2 . . . q] depends only on P and q

This prefix is identified by its length π (q)

Because π (q) depends only on P (and q), π can be computed

at the beginning by Prefix-Function
◮ we represent π as an array of length m

Example

P a b a b a c

π 0 0 1 2 3 0

The Knuth-Morris-Pratt Algorithm

KMP-String-Matching(T , P)
1 n = length(T )
2 m = length(P)
3 π = Prefix-Function(P)
4 q=0 // number of character matched
5 for i = 1 to n // scan the text left-to-right
6 while q > 0 and P[q + 1] ≠ T [i]
7 q = π [q] // no match: go back using π
8 if P[q + 1] == T [i]
9 q = q+1
10 if q == m
11 output(i − m)
12 q = π [q] // go back for the next match

Prefix Function Algorithm
Computing the prefix function amounts to finding all the
occurrences of a pattern P in itself
In fact, Prefix-Function is remarkably similar to
KMP-String-Matching

Prefix-Function(P)
1 m = length(P)
2 π [1] = 0
3 k =0
4 for q = 2 to m
5 while k > 0 and P[k + 1] ≠ P[q]
6 k = π [k]
7 if P[k + 1] = = P[q]
8 k = k+1
9 π [q] = k

Prefix-Function at Work

Prefix-Function(P)
q q q q q
1 m = length(P)
2 π [1] = 0
3 k =0 P a b a b a c
4 for q = 2 to m
5 while k > 0 and P[k + 1] ≠ P[q] k+1k+1k+1k+1
6 k = π [k]
7 if P[k + 1] = = P[q] π 0 0 1 2 3 0
8 k = k+1
9 π [q] = k

Complexity of KMP
O(n) for the search phase

O(m) for the pre-processing of the pattern

The complexity analysis is non-trivial

Can we do better?

Comments on KMP
Knuth-Morris-Pratt is Ω(n)

◮ KMP will always go through at least n character comparisons

◮ it fixes our “wrong” algorithm in the case of periodic patterns
and texts

Perhaps there’s another algorithm that works better on the

average case
◮ e.g., in the absence of periodic patterns

A New Strategy

h e r e i s a s i m p l e e x a m p l e

e x a m p l e e x e
a m
x p
a m
el p
x
e al m
e p
x e
al m
x
e p
a m
l p
e l e

We match the pattern right-to-left

If we find a bad character α in the text, we can shift

◮ so that the pattern skips α, if α is not in the pattern
◮ so that the pattern lines up with the rightmost occurrence of α
in the pattern, if the pattern contains α
◮ so that a pattern prefix lines up with a suffix of the current
partial (or complete) match

In essence, this is the Boyer-Moore algorithm

Comments on Boyer-Moore
Like KMP, Boyer-Moore includes a pre-processing phase

The pre-processing is O(m)

The search phase is O(nm)

The search phase can be as low as O(n/m) in common cases

In practice, Boyer-Moore is the fastest string-matching

algorithm for most applications

Expanding Information Technology
No ratings yet
Expanding Information Technology
2 pages
11 Data Structures and Algorithms - Narasimha Karumanchi
100% (1)
11 Data Structures and Algorithms - Narasimha Karumanchi
12 pages
UNIT-5 DAA Complete Notes
No ratings yet
UNIT-5 DAA Complete Notes
52 pages
Matrix Questions For SSC Stenographer PDF
No ratings yet
Matrix Questions For SSC Stenographer PDF
9 pages
Unit2-Notes Ai Updated
No ratings yet
Unit2-Notes Ai Updated
36 pages
String Matching Algorithms
No ratings yet
String Matching Algorithms
146 pages
Semantic Web Based Information Systems State of The Art Applications Advances in Semantic Web and Information Systems Vol 1.9781599044279.47602
100% (2)
Semantic Web Based Information Systems State of The Art Applications Advances in Semantic Web and Information Systems Vol 1.9781599044279.47602
329 pages
Ch-5 Numerical Daa
No ratings yet
Ch-5 Numerical Daa
11 pages
PLCM Chapter-2 Notes
100% (3)
PLCM Chapter-2 Notes
6 pages
Algo Lecture 7
No ratings yet
Algo Lecture 7
52 pages
04 03-PatternMatchingAndTries
No ratings yet
04 03-PatternMatchingAndTries
28 pages
Okoko
No ratings yet
Okoko
18 pages
M3-String Matching
No ratings yet
M3-String Matching
74 pages
Unit - 3
No ratings yet
Unit - 3
14 pages
4 Module Algorithms
No ratings yet
4 Module Algorithms
28 pages
Clickstream Analysis Using Hadoop
No ratings yet
Clickstream Analysis Using Hadoop
16 pages
Credits 2jbbd PDF
No ratings yet
Credits 2jbbd PDF
8 pages
InDesign Skripting Kurzreferenz
No ratings yet
InDesign Skripting Kurzreferenz
1 page
M269 - Lec8 Fall 1819
No ratings yet
M269 - Lec8 Fall 1819
24 pages
String Matching
No ratings yet
String Matching
63 pages
Unit 5 String Matching 2010
No ratings yet
Unit 5 String Matching 2010
5 pages
String Matching
100% (1)
String Matching
27 pages
Outline and Reading: Strings ( 9.1.1) Pattern Matching Algorithms
No ratings yet
Outline and Reading: Strings ( 9.1.1) Pattern Matching Algorithms
3 pages
Unit 3
No ratings yet
Unit 3
34 pages
Lecture 39 Knutt Morris Pratt
No ratings yet
Lecture 39 Knutt Morris Pratt
15 pages
Aman Jain
No ratings yet
Aman Jain
2 pages
DS V Unit Notes
No ratings yet
DS V Unit Notes
33 pages
2 Wheeler BCU en
No ratings yet
2 Wheeler BCU en
4 pages
Patternmatching
No ratings yet
Patternmatching
29 pages
Tarun Dua: Professional Summary
No ratings yet
Tarun Dua: Professional Summary
6 pages
Patternmatchingalgorithms
No ratings yet
Patternmatchingalgorithms
63 pages
String Matching: COMP171 Fall 2005
No ratings yet
String Matching: COMP171 Fall 2005
15 pages
Week 9 String Algorithms, Approximation
No ratings yet
Week 9 String Algorithms, Approximation
22 pages
Isolation Game Heuristic Analysis
No ratings yet
Isolation Game Heuristic Analysis
4 pages
OAF Page To Upload Files Into Server From Local Machine
No ratings yet
OAF Page To Upload Files Into Server From Local Machine
5 pages
TCD 004 DS4
No ratings yet
TCD 004 DS4
1 page
String Search: 1 2 I I+1 I+m-1 N
No ratings yet
String Search: 1 2 I I+1 I+m-1 N
8 pages
String Matching Chapter 12 Goodrich Nep
No ratings yet
String Matching Chapter 12 Goodrich Nep
43 pages
Pattern Matching Algo
No ratings yet
Pattern Matching Algo
21 pages
Lecture 56string Matching
No ratings yet
Lecture 56string Matching
43 pages
Lecture#8 - String Matching Algorithm
No ratings yet
Lecture#8 - String Matching Algorithm
38 pages
4string Matching Kmprabin Karp and Naive
No ratings yet
4string Matching Kmprabin Karp and Naive
57 pages
Saurabh
No ratings yet
Saurabh
13 pages
CHPT 9 Pattern Matching
No ratings yet
CHPT 9 Pattern Matching
14 pages
Unit II
No ratings yet
Unit II
94 pages
Notes 5
No ratings yet
Notes 5
23 pages
28 - Text Processing
No ratings yet
28 - Text Processing
7 pages
Busqueda de Texto
No ratings yet
Busqueda de Texto
13 pages
Unit 5
No ratings yet
Unit 5
14 pages
Jereh Global-Catalogue of DRILL BIT
No ratings yet
Jereh Global-Catalogue of DRILL BIT
46 pages
CH 8
No ratings yet
CH 8
26 pages
Pattern Matching
No ratings yet
Pattern Matching
3 pages
Sirena Selena
No ratings yet
Sirena Selena
2 pages
New Text Document
0% (1)
New Text Document
3 pages
Sandeep Singh (Iii B.Tech I.T)
No ratings yet
Sandeep Singh (Iii B.Tech I.T)
179 pages
Trings and Attern Atching: - Brute Force, Rabin-Karp, Knuth-Morris-Pratt - Regular Expressions
No ratings yet
Trings and Attern Atching: - Brute Force, Rabin-Karp, Knuth-Morris-Pratt - Regular Expressions
21 pages
Quickref Guide
No ratings yet
Quickref Guide
2 pages
CS 240 Tutorial 11 Notes: C A A B A
No ratings yet
CS 240 Tutorial 11 Notes: C A A B A
2 pages
DAA Unit 5
No ratings yet
DAA Unit 5
22 pages
Trings and Attern Atching: - Brute Force, Rabin-Karp, Knuth-Morris-Pratt
No ratings yet
Trings and Attern Atching: - Brute Force, Rabin-Karp, Knuth-Morris-Pratt
49 pages
Ada Notes Unit 4
No ratings yet
Ada Notes Unit 4
28 pages
UNIT-4 PPT New
No ratings yet
UNIT-4 PPT New
47 pages
Unit-4 Ads
100% (1)
Unit-4 Ads
31 pages
Unit8 ADA SPPDF 2022 11 11 17 17 37pdf 2023 12 06 16 57 08
No ratings yet
Unit8 ADA SPPDF 2022 11 11 17 17 37pdf 2023 12 06 16 57 08
18 pages
String Matching
No ratings yet
String Matching
18 pages
Dca01 Block02 Computer Fundamental
No ratings yet
Dca01 Block02 Computer Fundamental
22 pages
C Programming Language
No ratings yet
C Programming Language
34 pages
String Matching Algorithm
100% (1)
String Matching Algorithm
14 pages
Knuth-Morris-Pratt Algorithm KENT
No ratings yet
Knuth-Morris-Pratt Algorithm KENT
4 pages
A357460420 - 22393 - 2 - 2018 - String Matching
No ratings yet
A357460420 - 22393 - 2 - 2018 - String Matching
27 pages
12CS em 2024
No ratings yet
12CS em 2024
152 pages
String Matching
No ratings yet
String Matching
35 pages
Text Pattern Search Using Naïve Algorithm: Justine Estoesta, Patricia Mae Omana, Winci John Singh
No ratings yet
Text Pattern Search Using Naïve Algorithm: Justine Estoesta, Patricia Mae Omana, Winci John Singh
5 pages
Analytic Hierarchy Process
100% (1)
Analytic Hierarchy Process
41 pages
Solucionario Econometria Jeffrey M Wooldridge PDF
11% (9)
Solucionario Econometria Jeffrey M Wooldridge PDF
4 pages
String Matching
No ratings yet
String Matching
30 pages
Pattren Matching
No ratings yet
Pattren Matching
3 pages
String Searching Algorithm
No ratings yet
String Searching Algorithm
22 pages
Introduction To Qcamap: Philipp Mayring University of Klagenfurt July 2013
No ratings yet
Introduction To Qcamap: Philipp Mayring University of Klagenfurt July 2013
10 pages
Information Retrieval - Chapter 10 - String Searching Algorithms
No ratings yet
Information Retrieval - Chapter 10 - String Searching Algorithms
27 pages
String Matching Algorithms
No ratings yet
String Matching Algorithms
46 pages
Array Vs Linked List
No ratings yet
Array Vs Linked List
7 pages
Abstract
No ratings yet
Abstract
12 pages
Elements of Gloud Computing Security (2016)
No ratings yet
Elements of Gloud Computing Security (2016)
65 pages
String Search Algorithm
No ratings yet
String Search Algorithm
6 pages
Ensure Condition Monitoring Success
No ratings yet
Ensure Condition Monitoring Success
6 pages
Guide To Be Anonymous
No ratings yet
Guide To Be Anonymous
5 pages
String Matching Algorithms
No ratings yet
String Matching Algorithms
25 pages
String Matching Algorithms: 1 Brute Force
No ratings yet
String Matching Algorithms: 1 Brute Force
5 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)

String Matching Algorithms: Antonio Carzaniga

Uploaded by

String Matching Algorithms: Antonio Carzaniga

Uploaded by

String Matching Algorithms

December 23, 2009

© 2007 Antonio Carzaniga 1

© 2007 Antonio Carzaniga 2

Find the string “trova”

A more challenging example: How many times does the string

© 2007 Antonio Carzaniga 3

String Matching: Definitions

Both T and P can be modeled as arrays

Pattern P occurs with shift s in T iff

© 2007 Antonio Carzaniga 4

© 2007 Antonio Carzaniga 5

© 2007 Antonio Carzaniga 6

Worst case example

So, (n − m + 1)m is a tight bound, so the (worst-case)

© 2007 Antonio Carzaniga 7

© 2007 Antonio Carzaniga 8

© 2007 Antonio Carzaniga 9

Improvement Strategy (3)

© 2007 Antonio Carzaniga 10

So Wrong-String-Matching doesn’t work, but it tells us

© 2007 Antonio Carzaniga 11

Improvement Strategy (5)

Wrong: by going all the way back to q = 0 we throw away a

© 2007 Antonio Carzaniga 12

We have matched “ababa”

© 2007 Antonio Carzaniga 13

Find the longest prefix of P that is also a suffix of P[2 . . . q]

In essence, this is the Knuth-Morris-Pratt algorithm

© 2007 Antonio Carzaniga 14

This prefix is identified by its length π (q)

Because π (q) depends only on P (and q), π can be computed

© 2007 Antonio Carzaniga 15

The Knuth-Morris-Pratt Algorithm

© 2007 Antonio Carzaniga 16

© 2007 Antonio Carzaniga 17

© 2007 Antonio Carzaniga 18

O(m) for the pre-processing of the pattern

The complexity analysis is non-trivial

© 2007 Antonio Carzaniga 19

◮ KMP will always go through at least n character comparisons

Perhaps there’s another algorithm that works better on the

© 2007 Antonio Carzaniga 20

We match the pattern right-to-left

If we find a bad character α in the text, we can shift

In essence, this is the Boyer-Moore algorithm

The pre-processing is O(m)

The search phase is O(nm)

The search phase can be as low as O(n/m) in common cases

In practice, Boyer-Moore is the fastest string-matching

© 2007 Antonio Carzaniga 22

You might also like