0% found this document useful (0 votes)
7 views

String Matching

Uploaded by

Nidhi Singh
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

String Matching

Uploaded by

Nidhi Singh
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 63

Algorithm Analysis and Design

String Matching
Overview

• String-searching algorithms, are an important class of string


algorithms that try to find a place where one or
several strings (also called patterns) are found within a larger string
or text.

• A basic example of string searching is when the pattern and the


searched text are array of elements of an alphabet(finite set) Σ. Σ
may be a human language alphabet, for example, the
letters A through Z.
STRING MATCHING ALGORITHMS

There are many types of String Matching Algorithms


• The Naive string-matching algorithm
• The Rabin-Karp algorithm
• String matching with finite automata
• The Knuth-Morris-Pratt algorithm
THE NAIVE ALGORITHM
• The naive algorithm finds all valid shifts using a loop
that checks the condition P[1….m]=T[s+1…. s+m] for
each of the n- m+1 possible values of s.(P=pattern ,
T=text/string , s=shift)
NAIVE-STRING-MATCHER(T,P)
n = T.length
m = P.length
for s=0 to n-m
if P[1…m]==T[s+1….s+m]
printf” Pattern occurs with shift ” s
STRING MATCHING PROBLEM

A B C A B A A C A B TEXT

SHIFT=3
A B A A PATTERN
EXAMPLE
SUPPOSE, T=1011101110 P=111, FIND ALL VALID
SHIFT……

1 1
T=Text 1 0 1 1 1 0 1 1 1 0

S=0
P=Pattern 1 1
1 1 1
• 1
• 1
1 0 1 • 1 1 0 0 1 1 1 0
• 1
• 1
• 1
• 0
S=1 • 1
1 1 • 1 1
• 1
• 0
• 0
• 1
• 1
• 1
• 0
• 1
• 1
• 1
• 0
1 0 1 1 1 0 1 1 1 0

S=2
1 1 1

So, S=2 is a valid shift…


1 0 1 1 1 0 1 1 1 0

S=3
1 1 1
1 0 1 1 1 0 1 1 1 0
S=4

1 1 1
1 0 1 1 1 0 1 1 1 0

S=5
1 1 1
1 0 1 1 1 0 1 1 1 0
S=6

1 1 1

So, S=6 is a valid shift…


1 0 1 1 1 0 1 1 1 0
S=7

1 1 1
Algorithm Analysis
• It takes time ((n-m+1)m) in the worst case.
• For each of the (n-m+1) possible shifts s, line 4 will
execute m times. Hence the worst case running time is
((n-m+1)m) which is m2.
THE RABIN-KARP ALGORITHM
• Rabin and Karp proposed a string matching
algorithm that performs well in practice and that also
generalizes to other algorithms for related
problems, such as two dimensional pattern
matching.
ALGORITHM
• RABIN-KARP-MATCHER(T,P,d,q)
n = T.length
m = P.length
h = d^(m-1) mod q
p=0
t =0
for i =1 to m
p = (dp + P[i]) mod q
t = (d t + T[i]) mod q
for s = 0 to n – m
if p == t
if P[1…m] == T[s+1…. s+m]
printf “ Pattern occurs with shift ” s
if s< n-m
then ts+1 = (d(t- T[s+1]h)+ T[s+m+1]) mod q
Pattern P=26, how many spurious hits does the Rabin Karp
matcher in the text T=31415926535, P = 26 will have?

Here T.length=11 so Q=11 and P mod Q = 26 mod 11=


4
Now find the exact match of P mod Q
S=0
3 1 4 1 5 9 2 6 5 3 5

31 mod 11 = 9 not equal to 4

S=1
3 1 4 1 5 9 2 6 5 3 5

14 mod 11 = 3
not equal to 4
S=2 3 1 4 1 5 9 2 6 5 3 5

41 mod 11 = 8 not equal to 4


S=3
3 1 4 1 5 9 2 6 5 3 5

15 mod 11 = 4 equal to 4 SPURIOUS HIT

S=4
3 1 4 1 5 9 2 6 5 3 5

59 mod 11 = 4 equal to 4 SPURIOUS

HIT
3 1 4 1 5 9 2 6 5 3 5
S=5

92 mod 11 = 4 equal to 4 SPURIOUS


S=6
3 1 4 1 5 9 2 6 5 3 5

26 mod 11 = 4 EXACT
MATCH
S=7
3 1 4 1 5 9 2 6 5 3 5

65 mod 11 = 10
not equal to 4

S=8 3 1 4 1 5 9 2 6 5 3 5

53 mod 11 = 9
not equal to 4
S=9
3 1 4 1 5 9 2 6 5 3 5

35 mod 11 = 2 not equal to 4

Pattern occurs with shift 6

For ex: 1 4 = (31 – 3* 1 0) 1 0 + 4 (mod 11)


= 3
Analysis
The running time in the worst case is O(n-m+1). but it has a
good average case running time i.e. O(m+n).
If we choose the prime q to be larger than the length of the
pattern, then we can expect the Rabin-Karp procedure to
use only O(n+m) matching time. Since m<=n, this expected
matching time is O(n).
String matching with finite automata

• The string-matching automaton is very Effective tool


which is used in string matching Algorithms.it
examines each character in the text exactly once and
reports all the valid shifts in O(n) time
The basic idea is to build a automaton
• Each character in the pattern has a state.
• Each match sends the automaton into a new state.
• If all the characters in the pattern has been matched, the
automaton enters the accepting state.
• Otherwise, the automaton will return to a suitable state
according to the current state and the input character.
• The matching takes O(n) time since each character is
examined once.
• The finite automaton begins in state q0 and read the characters
of its input string one at a time. If the automaton is in state q
and reads input character a, it moves from state q to state
(q,a).
input
State b Given pattern: a2k+1
Input string =
abaaa Start state: 0
0 Terminate state: 1
a 0
1
Finite automata
• A finite automaton M is a 5-tuple (Q,q0,A, ,  ), where
•Q is a finite set of states.
• q0  Q is the start state.
• A  Q is a distinguish set of accepting states.
•  is a finite input alphabet
•  is a function from Q ×  into Q, called the transition
function of M.
The following inputs it accepts:
• Odd number of a’s accepted and any number of b’s.
• -“aaa”
• -“abb”
• -“bababab”
• -“babababa”
Rejected:

Even number of a’s not accepted)


• -“aabb”
• -“aaaa”
FINITE-AUTOMATON-MATCHER(T,,m)
1 n = T.length
2q=0
3 for i = 1 to n
4 q= (q, T[i])
5 if q == m
6 print “Pattern occurs with shift” i-m
Example:
• Pattern = aabaaa
• String = aaabaabaaab
Solution
• Build DFA on text
• Run DFA on pattern
Analysis

• These string-matching automata are very efficient: they


examine each text character exactly once, taking constant time
per text character. The matching time used—after
preprocessing the pattern to build the automaton is therefore
O(n).
KMP Algorithm
• The algorithm was conceived in 1974 by
Donald Knuth and Vaughan Pratt and
independently by James H. Morris. The
three published it jointly in 1977.
Problem Definition
Given a string ‘S’, the problem of string matching
deals with finding whether a pattern ‘p’ occurs
in ‘S’ and if ‘p’ does occur then returning
position in ‘S’ where ‘p’ occurs.
Drawbacks of the O(mn) Approach
• If ‘m’ is the length of pattern ‘p’ and ‘n’ the length of string ‘S’,
the matching time is of the order O(mn). This is a certainly a
very slow running algorithm. What makes this approach so
slow is the fact that elements of ‘S’ with which comparisons
had been performed earlier are involved again and again in
some future iterations.
• For example: when mismatch is detected for the first time in
comparison of p[3] with S[3], pattern ‘p’ would be moved one
position to the right and matching procedure would resume
from here. Here the first comparison that would take place
would be between p[0]=‘a’ and S[1]=‘b’. It should be noted here
that S[1]=‘b’ had been previously involved in a comparison in
step 2. this is a repetitive use of S[1] in another comparison.
• It is these repetitive comparisons that lead to the runtime of
O(mn).
• Knuth-Morris-Pratt’s algorithm compares the
pattern to the text in left-to-right, but shifts
the pattern more intelligently than the brute-
force algorithm.

• When a mismatch occurs, what is the most we


can shift the pattern so as to avoid redundant
comparisons
Components of KMP algorithm
The Prefix function, π:
• The prefix function,π for a pattern encapsulates knowledge
about how the pattern matches against shifts of itself. This
information can be used to avoid useless shifts of the pattern
‘p’. In other words, this enables avoiding backtracking on the
string ‘S’.
The KMP Matcher:
• With string ‘S’, pattern ‘p’ and prefix function ‘Π’ as inputs,
finds the occurrence of ‘p’ in ‘S’ and returns the number of
shifts of ‘p’ after which occurrence is found.
The prefix function, π
Compute_Prefix_Function (p)
m = length[p]
π[1] = 0
k=0
for q  2 to m
do while k > 0 and p[k+1] != p[q]
do k  π[k]
If p[k+1] = p[q]
then k  k +1 9
π[q]  k
return π

You might also like