0% found this document useful (0 votes)
4 views

Module9_08

String Matching
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Module9_08

String Matching
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 13

String Matching

Algorithms

Topics
Basics of Strings
Brute-force String Matcher

Rabin-Karp String Matching Algorithm

KMP Algorithm

1
In string matching problems, it is required to find the
occurrences of a pattern in a text.

These problems find applications in text processing,


text-editing, computer security, and DNA sequence
analysis.
Find and Change in word processing
Sequence of the human cyclophilin 40 gene
CCCAGTCTGG AATACAGTGG CGCGATCTCG GTTCACTGCA
ACCGCCGCCT CCCGGGTTCA AACGATTCTC CTGCCTCAGC
CGCGATCTCG : DNA binding protein GATA-1
CCCGGG : DNA binding protein Sma 1

C: Cytosine, G : Guanine, A : Adenosine, T : Thymine


CSE5311 Kumar
2
Text : T[1..n] of length n and Pattern P[1..m] of length m.
The elements of P and T are characters drawn from a finite
alphabet set .
For example = {0,1} or  = {a,b, . . . , z}, or  = {c, g, a, t}.
The character arrays of P and T are also referred to as
strings of characters.
Pattern P is said to occur with shift s in text T
if 0  s  n-m and
T[s+1..s+m] = P[1..m] or
T[s+j] = P[j] for 1  j m,
such a shift is called a valid shift.
The string-matching problem is the problem of finding all
valid shifts with which a given pattern P occurs in a given
text T.
CSE5311 Kumar
3
Brute force string-matching algorithm

To find all valid shifts or possible values of s so that


P[1..m] = T[s+1..s+m] ;
There are n-m+1 possible values of s.

Procedure BF_String_Matcher(T,P)

1. n  length [T];
2. m  length[P];
3. for s  0 to n-m
4. do if P[1..m] = T[s+1..s+m]
5. then shift s is valid

This algorithm takes ((n-m+1)m)


CSE5311
in the worst case.
Kumar
4
a c a a b c a c a a b c

a a b a a b

a c a a b c

a a b

a c a a b c matches

a a b

CSE5311 Kumar
5
Rabin-Karp Algorithm

Let  = {0,1,2, . . .,9}.


We can view a string of k consecutive characters as
representing a length-k decimal number.
Let p denote the decimal number for P[1..m]
Let ts denote the decimal value of the length-m
substring T[s+1..s+m] of T[1..n] for s = 0, 1, . . ., n-m.

ts = p if and only if
T[s+1..s+m] = P[1..m], and s is a valid shift.

p = P[m] + 10(P[m-1] +10(P[m-2]+ . . . +10(P[2]+10(P[1]))


We can compute p in O(m) time.
Similarly we can compute t0 from T[1..m] in O(m) time.
CSE5311 Kumar
6
m =4
6378 = 8 + 7  10 + 3  102 + 6  103
= 8 + 10 (7 + 10 (3 + 10(6)))
= 8 + 70 + 300 + 6000

p = P[m] + 10(P[m-1] +10(P[m-2]+ . . .


+10(P[2]+10(P[1]))

CSE5311 Kumar
7
ts+1 can be computed from ts in constant time.

ts+1 = 10(ts –10m-1 T[s+1])+ T[s+m+1]

Example : T = 314152
ts = 31415, s = 0, m= 5 and T[s+m+1] = 2

ts+1= 10(31415 –10000*3) +2 = 14152

Thus p and t0, t1, . . ., tn-m can all be computed in O(n+m)


time.
And all occurences of the pattern P[1..m] in the text
T[1..n] can be found in time O(n+m).

However, p and ts may beCSE5311


too large
Kumar to work with 8

conveniently.
Computation of p and t0 and the recurrence is done using modulus q.

In general, with a d-ary alphabet {0,1,…,d-1}, q is chosen such that


dq fits within a computer word.

The recurrence equation can be rewritten as


ts+1 = (d(ts –T[s+1]h)+ T[s+m+1]) mod q,
where h = dm-1(mod q) is the value of the digit “1” in the high
order position of an m-digit text window.
Note that ts  p mod q does not imply that ts = p.
However, if ts is not equivalent to p mod q ,
then ts p, and the shift s is invalid.
We use ts  p mod q as a fast heuristic test to rule out the
invalid shifts.
Further testing is done to eliminate spurious hits.
- an explicit test to check whether
P[1..m] = T[s+1..s+m]
CSE5311 Kumar
9
ts+1 = (d(ts –T[s+1]h)+ T[s+m+1]) mod q
h = dm-1(mod q)
Example :

T = 31415; P = 26, n = 5, m = 2, q = 11

p = 26 mod 11 = 4
t0 = 31 mod 11 = 9
t1 = (10(9 - 3(10) mod 11 ) + 4) mod 11
= (10 (9- 8) + 4) mod 11 = 14 mod 11 = 3

CSE5311 Kumar
10
Procedure RABIN-KARP-MATCHER(T,P,d,q)
Input : Text T, pattern P, radix d ( which is typically =),
and the prime q.
Output : valid shifts s where P matches
1. n  length[T];
2. m  length[P];
3. h  dm-1 mod q;
4. p  0;
5. t0  0;
6. for i  1 to m
7. do p  (dp + P[i] mod q;
8. t0  (dt0 +T[i] mod q;
9. for s  0 to n-m
10. do if p = ts
11. then if P[1..m] = T[s+1..s+m]
12. then “pattern occurs with shift
‘s’
13. if s < n-m
CSE5311 Kumar
11
14. then ts+1  (d(ts –T[s+1]h)+ T[s+m+1])
Comments on Rabin-Karp Algorithm

All characters are interpreted as radix-d digits


h is initiated to the value of high order digit position of an
m-digit window
p and t0 are computed in O(m+m) time
The loop of line 9 takes ((n-m+1)m) time
The loop 6-8 takes O(m) time
The overall running time is O((n-m)m)

CSE5311 Kumar
12
Exercises
 -- Home work
 Study KMP Algorithm for String Matching

-- Knuth Morris Pratt (KMP)
 Study Boyer-Moore Algorithm for String matching

 Extend Rabin-Karp method to the problem of searching a text string for an


occurrence of any one of a given set of k patterns? Start by assuming that all
k patterns have the same length. Then generalize your solution to allow the
patterns to have different lengths.

 Let P be a set of n points in the plane. We define the depth of a point in P as


the number of convex hulls that need to be peeled (removed) for p to become
a vertex of the convex hull. Design an O(n2) algorithm to find the depths of all
points in P.

 The input is two strings of characters A = a1, a2,…, an and B = b1, b2, …,
bn. Design an O(n) time algorithm to determine whether B is a cyclic shift of
A. In other words, the algorithm should determine whether there exists an
index k, 1 k n such that ai = b(k+i) mod n , for all i, 1 i n.

CSE5311 Kumar
13

You might also like