0% found this document useful (0 votes)
93 views34 pages

String Matching

The document provides an outline and overview of different string matching algorithms: Naive, Rabin-Karp, and Knuth-Morris-Pratt (KMP). It defines the string matching problem, describes the naive algorithm and its weaknesses, then introduces the Rabin-Karp and KMP algorithms as improvements over the naive approach by utilizing hashing and preprocessing respectively to reduce runtime complexity.

Uploaded by

Tanmay Thaware
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views34 pages

String Matching

The document provides an outline and overview of different string matching algorithms: Naive, Rabin-Karp, and Knuth-Morris-Pratt (KMP). It defines the string matching problem, describes the naive algorithm and its weaknesses, then introduces the Rabin-Karp and KMP algorithms as improvements over the naive approach by utilizing hashing and preprocessing respectively to reduce runtime complexity.

Uploaded by

Tanmay Thaware
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 34

Outline

String Matching
Introduction
Nave Algorithm
Rabin-Karp Algorithm
Knuth-Morris-Pratt (KMP) Algorithm
Introduction
What is string matching?
Finding all occurrences of a pattern in a given text (or
body of text)
Many applications
While using editor/word processor/browser
Login name & password checking
Virus detection
Header analysis in data communications
DNA sequence analysis, Web search engines (e.g.
Google), image analysis
String-Matching Problem
The text is in an array T [1..n] of length n
The pattern is in an array P [1..m] of
length m
Elements of T and P are characters from a
finite alphabet
E.g., = {0,1} or = {a, b, , z}
Usually T and P are called strings of
characters
String-Matching Problem contd

We say that pattern P occurs with shift s in


text T if:
a) 0 s n-m and
b) T [(s+1)..(s+m)] = P [1..m]
If P occurs with shift s in T, then s is a
valid shift, otherwise s is an invalid shift
String-matching problem: finding all valid
shifts for a given T and P
Example 1
1 2 3 4 5 6 7 8 9 10 11 12 13
text T
a b c a b a a b c a b a c

pattern P s=3
a b a a
1 2 3 4

shift s = 3 is a valid shift


(n=13, m=4 and 0 s n-m holds)
1 2
Example
3 4
2
pattern P a b a a
1 2 3 4 5 6 7 8 9 10 11 12 13
text T
a b c a b a a b c a b a a

s=3 a b a a

s=9 a b a a
Nave String-Matching Algorithm
Input: Text strings T [1..n] and P[1..m]
Result: All valid shifts displayed

NAVE-STRING-MATCHER (T, P)
n length[T]
m length[P]
for s 0 to n-m
if P[1..m] = T [(s+1)..(s+m)]
print pattern occurs with shift s
Nave Algorithm

The Nave algorithm consists in checking, at all the positions in


the text between 0 to n-m, whether an occurrence of the pattern
starts there or not.
After each attempt, it shifts the pattern by exactly one position to
the right.
Example (from left to right):
a b c a b c a
a b c a (shift = 0)
a b c a (shift = 1)
a b c a (shift = 2)
a b c a (shift = 3)
Analysis: Worst-case Example
1 2 3 4
pattern P a a a b
1 2 3 4 5 6 7 8 9 10 11 12 13
text T a a a a a a a a a a a a a

a a a b

a a a b
Worst-case Analysis
There are m comparisons for each shift in the
worst case
There are n-m+1 shifts
So, the worst-case running time is ((n-
m+1)m)
In the example on previous slide, we have (13-4+1)4
comparisons in total
Nave method is inefficient because information
from a shift is not used again
Nave Algorithm

Example (from right to left):


a b c a b c a
a b c a (shift =3)
a b c a (shift = 2)
a b c a (shift = 1)
a b c a (shift = 0)
Pattern occur with shift 0 and 3
Rabin-Karp Algorithm
Has a worst-case running time of O((n-
m+1)m) but average-case is O(n+m)
Also works well in practice
Based on number-theoretic notion of
modular equivalence
We assume that = {0,1, 2, , 9}, i.e.,
each character is a decimal digit
In general, use radix-d where d = ||
Rabin-Karp Approach
We can view a string of k characters (digits)
as a length-k decimal number
E.g., the string 31425 corresponds to the
decimal number 31,425
Given a pattern P [1..m], let p denote the
corresponding decimal value
Given a text T [1..n], let ts denote the decimal
value of the length-m substring T [(s+1)..
(s+m)] for s=0,1,,(n-m)
The Rabin-Karp algorithm
The Rabin-Karp algorithm
Rabin-Karp Approach contd

ts = p iff T [(s+1)..(s+m)] = P [1..m]


s is a valid shift iff ts = p
p can be computed in O(m) time
p = P[m] + 10 (P[m-1] + 10 (P[m-2]+))
t0 can similarly be computed in O(m) time
Other t1, t2,, tn-m can be computed in O(n-
m) time since ts+1 can be computed from ts in
constant time
Rabin-Karp Approach contd

ts+1 = 10(ts - 10m-1 T [s+1]) + T [s+m+1]


E.g., if T={,3,1,4,1,5,2,}, m=5 and ts=
31,415, then ts+1 = 10(31415 100003) + 2
=14152
Thus we can compute p in (m) and can
compute t0, t1, t2,, tn-m in (n-m+1) time
And we can find al occurrences of the pattern
P[1m] in text T[1n] with (m) preprocessing
time and (n-m+1) matching time.
Buta problem: this is assuming p and ts are small numbers
They may be too large to work with easily
Rabin-Karp Approach contd

Solution: we can use modular arithmetic with


a suitable modulus, q
E.g.,
ts+1 (10(ts T[s+1]h) + T [s+m+1]) (mod q)
Where h =10 m-1 (mod q)
q is chosen as a small prime number ; e.g.,
13 for radix 10
Generally, if the radix is d, then dq should fit
within one computer word
How values modulo 13 are computed
3 1 4 1 5 2

old high- new low-


order digit 7 8 order digit

14152 ((31415 3 10000) 10 + 2 )(mod


13)
((7 3 3) 10 + 2 )(mod 13)
8 (mod 13)
Problem of Spurious Hits
ts p (mod q) does not imply that ts=p
Modular equivalence does not necessarily mean
that two integers are equal
A case in which ts p (mod q) when ts p is
called a spurious hit

On the other hand, if two integers are not


modular equivalent, then they cannot be
equal
Example
3 1 4 1 5 pattern

mod 13
7 text

1 2 3 4 5 6 7 8 9 10 11 12 13 14
2 3 1 4 1 5 2 6 7 3 9 9 2 1

mod 13

1 7 8 4 5 10 11 7 9 11
valid spurious
match hit
Rabin-Karp Algorithm
Basic structure like the nave algorithm,
but uses modular arithmetic as described
For each hit, i.e., for each s where ts p
(mod q), verify character by character
whether s is a valid shift or a spurious hit
In the worst case, every shift is verified
Running time can be shown as O((n-m+1)m)
Average-case running time is O(n+m)
3. The KMP Algorithm
The Knuth-Morris-Pratt (KMP) algorithm
looks for the pattern in the text in a left-to-
right order (like the brute force algorithm).

But it shifts the pattern more intelligently


than the brute force algorithm.

continued
If a mismatch occurs between the text and
pattern P at P[j], the most we can shift the
pattern to avoid wasteful comparisons?
Example
Why j == 5

Find largest prefix (start) of:


"a b a a b" ( P[0..j-1] )

which is suffix (end) of:


"b a a b" ( p[1 .. j-1] )

Answer: "a b"


Set j = 2 // the new j value
KMP Failure Function
KMP preprocesses the pattern to find
matches of prefixes of the pattern with the
pattern itself.
j = mismatch position in P[]
k = position before the mismatch (k = j-1).
The failure function F(k) is defined as the
size of the largest prefix of P[0..k] that is
also a suffix of P[1..k].
Failure Function Example
(k == j-1)
P: "abaaba" j 0 1 2 3 4
j: 012345 F(j) 0 0 1 1 2

F(k) is the size of


the largest prefix.

In code, F() is represented by an array, like


the table.
Why is F(4) == 2?P: "abaaba"
F(4) means
find the size of the largest prefix of P[0..4] that
is also a suffix of P[1..4]
= find the size largest prefix of "abaab" that
is also a suffix of "baab"
= find the size of "ab"
=2
Using the Failure Function

Knuth-Morris-Pratts algorithm modifies the


brute-force algorithm.
if a mismatch occurs at P[j]
(i.e. P[j] != T[i]), then
k = j-1;
j = F(k); // obtain the new j
Example
T: a b a c a a b a c c a b a c a b a a b b
1 2 3 4 5 6
P: a b a c a b
7
a b a c a b
8 9 10 11 12
a b a c a b
13
a b a c a b
k 0 1 2 3 4 14 15 16 17 18 19
F(k ) 0 0 1 0 1 a b a c a b
Why is F(4) == 1?P: "abacab"
F(4) means
find the size of the largest prefix of P[0..4] that
is also a suffix of P[1..4]
= find the size largest prefix of "abaca" that
is also a suffix of "baca"
= find the size of "a"
=1
KMP Advantages
KMP runs in optimal time: O(m+n)
very fast

The algorithm never needs to move


backwards in the input text, T
this makes the algorithm good for processing
very large files that are read in from external
devices or through a network stream
KMP Disadvantages
KMP doesnt work so well as the size of the
alphabet increases
more chance of a mismatch (more possible
mismatches)
mismatches tend to occur early in the pattern,
but KMP is faster when the mismatches occur
later

You might also like