0% found this document useful (0 votes)
6 views

Module9_08

String Matching
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Module9_08

String Matching
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 13

String Matching

Algorithms

Topics
Basics of Strings
Brute-force String Matcher

Rabin-Karp String Matching Algorithm

KMP Algorithm

1
In string matching problems, it is required to find the
occurrences of a pattern in a text.

These problems find applications in text processing,


text-editing, computer security, and DNA sequence
analysis.
Find and Change in word processing
Sequence of the human cyclophilin 40 gene
CCCAGTCTGG AATACAGTGG CGCGATCTCG GTTCACTGCA
ACCGCCGCCT CCCGGGTTCA AACGATTCTC CTGCCTCAGC
CGCGATCTCG : DNA binding protein GATA-1
CCCGGG : DNA binding protein Sma 1

C: Cytosine, G : Guanine, A : Adenosine, T : Thymine


CSE5311 Kumar
2
Text : T[1..n] of length n and Pattern P[1..m] of length m.
The elements of P and T are characters drawn from a finite
alphabet set .
For example = {0,1} or  = {a,b, . . . , z}, or  = {c, g, a, t}.
The character arrays of P and T are also referred to as
strings of characters.
Pattern P is said to occur with shift s in text T
if 0  s  n-m and
T[s+1..s+m] = P[1..m] or
T[s+j] = P[j] for 1  j m,
such a shift is called a valid shift.
The string-matching problem is the problem of finding all
valid shifts with which a given pattern P occurs in a given
text T.
CSE5311 Kumar
3
Brute force string-matching algorithm

To find all valid shifts or possible values of s so that


P[1..m] = T[s+1..s+m] ;
There are n-m+1 possible values of s.

Procedure BF_String_Matcher(T,P)

1. n  length [T];
2. m  length[P];
3. for s  0 to n-m
4. do if P[1..m] = T[s+1..s+m]
5. then shift s is valid

This algorithm takes ((n-m+1)m)


CSE5311
in the worst case.
Kumar
4
a c a a b c a c a a b c

a a b a a b

a c a a b c

a a b

a c a a b c matches

a a b

CSE5311 Kumar
5
Rabin-Karp Algorithm

Let  = {0,1,2, . . .,9}.


We can view a string of k consecutive characters as
representing a length-k decimal number.
Let p denote the decimal number for P[1..m]
Let ts denote the decimal value of the length-m
substring T[s+1..s+m] of T[1..n] for s = 0, 1, . . ., n-m.

ts = p if and only if
T[s+1..s+m] = P[1..m], and s is a valid shift.

p = P[m] + 10(P[m-1] +10(P[m-2]+ . . . +10(P[2]+10(P[1]))


We can compute p in O(m) time.
Similarly we can compute t0 from T[1..m] in O(m) time.
CSE5311 Kumar
6
m =4
6378 = 8 + 7  10 + 3  102 + 6  103
= 8 + 10 (7 + 10 (3 + 10(6)))
= 8 + 70 + 300 + 6000

p = P[m] + 10(P[m-1] +10(P[m-2]+ . . .


+10(P[2]+10(P[1]))

CSE5311 Kumar
7
ts+1 can be computed from ts in constant time.

ts+1 = 10(ts –10m-1 T[s+1])+ T[s+m+1]

Example : T = 314152
ts = 31415, s = 0, m= 5 and T[s+m+1] = 2

ts+1= 10(31415 –10000*3) +2 = 14152

Thus p and t0, t1, . . ., tn-m can all be computed in O(n+m)


time.
And all occurences of the pattern P[1..m] in the text
T[1..n] can be found in time O(n+m).

However, p and ts may beCSE5311


too large
Kumar to work with 8

conveniently.
Computation of p and t0 and the recurrence is done using modulus q.

In general, with a d-ary alphabet {0,1,…,d-1}, q is chosen such that


dq fits within a computer word.

The recurrence equation can be rewritten as


ts+1 = (d(ts –T[s+1]h)+ T[s+m+1]) mod q,
where h = dm-1(mod q) is the value of the digit “1” in the high
order position of an m-digit text window.
Note that ts  p mod q does not imply that ts = p.
However, if ts is not equivalent to p mod q ,
then ts p, and the shift s is invalid.
We use ts  p mod q as a fast heuristic test to rule out the
invalid shifts.
Further testing is done to eliminate spurious hits.
- an explicit test to check whether
P[1..m] = T[s+1..s+m]
CSE5311 Kumar
9
ts+1 = (d(ts –T[s+1]h)+ T[s+m+1]) mod q
h = dm-1(mod q)
Example :

T = 31415; P = 26, n = 5, m = 2, q = 11

p = 26 mod 11 = 4
t0 = 31 mod 11 = 9
t1 = (10(9 - 3(10) mod 11 ) + 4) mod 11
= (10 (9- 8) + 4) mod 11 = 14 mod 11 = 3

CSE5311 Kumar
10
Procedure RABIN-KARP-MATCHER(T,P,d,q)
Input : Text T, pattern P, radix d ( which is typically =),
and the prime q.
Output : valid shifts s where P matches
1. n  length[T];
2. m  length[P];
3. h  dm-1 mod q;
4. p  0;
5. t0  0;
6. for i  1 to m
7. do p  (dp + P[i] mod q;
8. t0  (dt0 +T[i] mod q;
9. for s  0 to n-m
10. do if p = ts
11. then if P[1..m] = T[s+1..s+m]
12. then “pattern occurs with shift
‘s’
13. if s < n-m
CSE5311 Kumar
11
14. then ts+1  (d(ts –T[s+1]h)+ T[s+m+1])
Comments on Rabin-Karp Algorithm

All characters are interpreted as radix-d digits


h is initiated to the value of high order digit position of an
m-digit window
p and t0 are computed in O(m+m) time
The loop of line 9 takes ((n-m+1)m) time
The loop 6-8 takes O(m) time
The overall running time is O((n-m)m)

CSE5311 Kumar
12
Exercises
 -- Home work
 Study KMP Algorithm for String Matching

-- Knuth Morris Pratt (KMP)
 Study Boyer-Moore Algorithm for String matching

 Extend Rabin-Karp method to the problem of searching a text string for an


occurrence of any one of a given set of k patterns? Start by assuming that all
k patterns have the same length. Then generalize your solution to allow the
patterns to have different lengths.

 Let P be a set of n points in the plane. We define the depth of a point in P as


the number of convex hulls that need to be peeled (removed) for p to become
a vertex of the convex hull. Design an O(n2) algorithm to find the depths of all
points in P.

 The input is two strings of characters A = a1, a2,…, an and B = b1, b2, …,
bn. Design an O(n) time algorithm to determine whether B is a cyclic shift of
A. In other words, the algorithm should determine whether there exists an
index k, 1 k n such that ai = b(k+i) mod n , for all i, 1 i n.

CSE5311 Kumar
13

You might also like