0% found this document useful (0 votes)
71 views11 pages

String Matching Algorithms: Antonio Carzaniga

String matching problems Antonio carzaniga, university of lugano. A more challenging example: How many times does the string "110011" appear in the following text. The (worst-case) complexity of Naive-String-Match is O((n - m + 1)m)

Uploaded by

vickimgore
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views11 pages

String Matching Algorithms: Antonio Carzaniga

String matching problems Antonio carzaniga, university of lugano. A more challenging example: How many times does the string "110011" appear in the following text. The (worst-case) complexity of Naive-String-Match is O((n - m + 1)m)

Uploaded by

vickimgore
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

String Matching Algorithms

Antonio Carzaniga

Faculty of Informatics
University of Lugano

December 23, 2009

© 2007 Antonio Carzaniga 1

Outline
Problem definition

Naïve algorithm

Knuth-Morris-Pratt algorithm

Boyer-Moore algorithm

© 2007 Antonio Carzaniga 2


Problem
Given the text
Nel mezzo del cammin di nostra vita
mi ritrovai per una selva oscura
che la dritta via era smarrita. . .

Find the string “trova”

A more challenging example: How many times does the string


“110011” appear in the following text
0011110101011010011000110101111011010111
0110111001001010101011111011110110000101
1011000010111111011110011000011111000100
1001010010111011101011011110101001100101
0010111001000011111110010011011101011010
0110011011101001010010101000010100111110

© 2007 Antonio Carzaniga 3

String Matching: Definitions


Given a text T
◮ T ∈ Σ∗ : finite alphabet Σ
◮ |T | = n: the length of T is n

Given a pattern P
◮ P ∈ Σ∗ : same finite alphabet Σ
◮ |P| = m: the length of P is m

Both T and P can be modeled as arrays


◮ T [1 . . . n] and P[1 . . . m]

Pattern P occurs with shift s in T iff


◮ 0≤s ≤n−m
◮ T [s + i] = P[i] for all positions 1 ≤ i ≤ m

© 2007 Antonio Carzaniga 4


Example
Problem: find all s such that
◮ 0≤s ≤n−m
◮ T [s + i] = P[i] for 1 ≤ i ≤ m

n = 14
T a b c a a b a a b a b a c a

m=3
s = 4s =s7= 9
P a b a a b a a b a b a

Result
s=4
s=7
s=9

© 2007 Antonio Carzaniga 5

Naïve Algorithm
For each position s in 0 . . . n − m, see if T [s + i] = P[i] for all
1≤i≤m

Naive-String-Matching(T , P)
1 n = length(T )
2 m = length(P)
3 for s = 0 to n − m
4 if Substring-At(T , P, s)
5 output(s)

Substring-At(T , P, s)
1 for i = 1 to length(P)
2 if T [s + i] ≠ P[i]
3 return false
4 return true

© 2007 Antonio Carzaniga 6


Complexity of the Naïve Algorithm
Complexity of Naive-String-Match is O((n − m + 1)m)

Worst case example

T = an , P = am
i.e.,
n m
z }| { z }| {
T = aa · · · a, P = aa · · · a

So, (n − m + 1)m is a tight bound, so the (worst-case)


complexity of Naive-String-Match is

Θ((n − m + 1)m)

© 2007 Antonio Carzaniga 7

Improvement Strategy
Observation

T a b c a a b a a b a b a c a

= = ≠

P a b a

What now?
◮ the naïve algorithm tells us to go back to the second position in
T and to start from the beginning of P
◮ can’t we simply move along through T ?

◮ why?

© 2007 Antonio Carzaniga 8


Improvement Strategy (2)
Here’s a wrong but insightful strategy

Wrong-String-Matching(T , P)
1 n = length(T )
2 m = length(P)
3 q=0 // number of characters matched in P
4 s=1
5 while s ≤ n
6 s = s+1
7 if T [s] = = P[q + 1]
8 q = q+1
9 if q = = m
10 output(s − m)
11 q=0
12 else q = 0

© 2007 Antonio Carzaniga 9

Improvement Strategy (3)


Example run of Wrong-String-Matching

s s s s s s s s s s s s s s s s

T p a g l i a i o b a g o r d o

P a g o Output: 10

q+1
q+1
q+1

Done. Perfect!

Complexity: Θ(n)

© 2007 Antonio Carzaniga 10


Improvement Strategy (4)
What is wrong with Wrong-String-Matching?

s s s s s s s s

T a a b a a a b a b a b a c a
output(0) missed!

P a a b

q+1
q+1
q+1

So Wrong-String-Matching doesn’t work, but it tells us


something useful

© 2007 Antonio Carzaniga 11

Improvement Strategy (5)


Where did Wrong-String-Matching go wrong?

s s s

T a a b a a a b a b a b a c a

P a a b

q+1
q+1
q+1

Wrong: by going all the way back to q = 0 we throw away a


good prefix of P that we already matched

© 2007 Antonio Carzaniga 12


Improvement Strategy (6)
Another example

s s s s s s s s

T a b a b a b a c b a c b c a
output(2)

P a b a b a c

q+1
q+1
q+1
q+1
q+1
q+1

We have matched “ababa”


◮ suffix “aba” can be reused as a prefix

© 2007 Antonio Carzaniga 13

New Strategy
P[1 . . . q] is the prefix of P matched so far

Find the longest prefix of P that is also a suffix of P[2 . . . q]


◮ i.e., find 0 ≤ π < q such that P[q − π + 1 . . . q] = P[1 . . . π ]
◮ π = 0 means that such a prefix does not exist

P a b a b a c

π =3
q+1 q+1

Restart from q = π

Iterate as usual

In essence, this is the Knuth-Morris-Pratt algorithm

© 2007 Antonio Carzaniga 14


The Prefix Function
Given a pattern prefix P[1 . . . q], the longest prefix of P that is
also a suffix of P[2 . . . q] depends only on P and q

This prefix is identified by its length π (q)

Because π (q) depends only on P (and q), π can be computed


at the beginning by Prefix-Function
◮ we represent π as an array of length m

Example

P a b a b a c

π 0 0 1 2 3 0

© 2007 Antonio Carzaniga 15

The Knuth-Morris-Pratt Algorithm

KMP-String-Matching(T , P)
1 n = length(T )
2 m = length(P)
3 π = Prefix-Function(P)
4 q=0 // number of character matched
5 for i = 1 to n // scan the text left-to-right
6 while q > 0 and P[q + 1] ≠ T [i]
7 q = π [q] // no match: go back using π
8 if P[q + 1] == T [i]
9 q = q+1
10 if q == m
11 output(i − m)
12 q = π [q] // go back for the next match

© 2007 Antonio Carzaniga 16


Prefix Function Algorithm
Computing the prefix function amounts to finding all the
occurrences of a pattern P in itself
In fact, Prefix-Function is remarkably similar to
KMP-String-Matching

Prefix-Function(P)
1 m = length(P)
2 π [1] = 0
3 k =0
4 for q = 2 to m
5 while k > 0 and P[k + 1] ≠ P[q]
6 k = π [k]
7 if P[k + 1] = = P[q]
8 k = k+1
9 π [q] = k

© 2007 Antonio Carzaniga 17

Prefix-Function at Work

Prefix-Function(P)
q q q q q
1 m = length(P)
2 π [1] = 0
3 k =0 P a b a b a c
4 for q = 2 to m
5 while k > 0 and P[k + 1] ≠ P[q] k+1k+1k+1k+1
6 k = π [k]
7 if P[k + 1] = = P[q] π 0 0 1 2 3 0
8 k = k+1
9 π [q] = k

© 2007 Antonio Carzaniga 18


Complexity of KMP
O(n) for the search phase

O(m) for the pre-processing of the pattern

The complexity analysis is non-trivial

Can we do better?

© 2007 Antonio Carzaniga 19

Comments on KMP
Knuth-Morris-Pratt is Ω(n)

◮ KMP will always go through at least n character comparisons


◮ it fixes our “wrong” algorithm in the case of periodic patterns
and texts

Perhaps there’s another algorithm that works better on the


average case
◮ e.g., in the absence of periodic patterns

© 2007 Antonio Carzaniga 20


A New Strategy

h e r e i s a s i m p l e e x a m p l e

e x a m p l e e x e
a m
x p
a m
el p
x
e al m
e p
x e
al m
x
e p
a m
l p
e l e

We match the pattern right-to-left

If we find a bad character α in the text, we can shift


◮ so that the pattern skips α, if α is not in the pattern
◮ so that the pattern lines up with the rightmost occurrence of α
in the pattern, if the pattern contains α
◮ so that a pattern prefix lines up with a suffix of the current
partial (or complete) match

In essence, this is the Boyer-Moore algorithm


© 2007 Antonio Carzaniga 21

Comments on Boyer-Moore
Like KMP, Boyer-Moore includes a pre-processing phase

The pre-processing is O(m)

The search phase is O(nm)

The search phase can be as low as O(n/m) in common cases

In practice, Boyer-Moore is the fastest string-matching


algorithm for most applications

© 2007 Antonio Carzaniga 22

You might also like