String Matching Problem
String Matching Problem
Given a text string T of length n and a pattern string P of length m, the exact string matching problem is to find all occurrences of P in T. Example: T=AGCTTGA P=GCT Applications:
Searching keywords in a file Searching engines (like Google and Openfind) Database searching (GenBank)
Brute-force
algorithm brute-force: input: an array of characters, T (the string to be analyzed) , length n an array of characters, P (the pattern to be searched for), length m for i := 0 to n-m do for j := 0 to m-1 do compare T[j] with P[i+j] if not equal, exit the inner loop
Example
Compare each character of P with S if match continue else shift one position ab c abaabc aba c String S
Pattern p
abaa
abaa
S a b c a b a a b c a b a c
p
abaa
p a b a a
Since mismatch is detected, shift P one position to the Right and perform steps analogous to those from step 1 to step 3. At position where mismatch is detected, shift P one position to the right and repeat matching procedure.
Knuth-Morris-Pratt algorithm
-Algorithm Compute-Prefix-Function(P) 1. m length[T] 2. [1] 0 3. k 0 4. for q 2 to m 5. do while k > 0 and P[k + 1] P[q] 6. do k [k] /*if k = 0 or P[k + 1] = P[q], 7. if P[k + 1] = P[q] going out of the while-loop.*/ 8. then k k + 1 9. [q] k 10. return
Knuth-Morris-Pratt algorithm
-Algorithm KMP-Matcher(T, P) 1. n length[T] 2. m length[P] 3. Compute-Prefix-Function(P) 4. q 0 5. for i 1 to n 6. do while q > 0 and P[q + 1] T[i] 7. do q [q] 8. if P[q + 1] = T[i] 9. then q q + 1 10. if q = m 11. then print pattern occurs with shift i m 12. q [q]
k=2 q = 5, P[k + 1] = P[3] = a, P[q] = P[5] = a, P[k + 1] = P[q] k k + 1, [q] k ([5] 3) k=3 q = 6, P[k + 1] = P[4] = b, P[q] = P[6] = b, P[k + 1] = P[q] k k + 1, [q] k ([6] 4) k=4 q = 7, P[k + 1] = P[5] = a, P[q] = P[7] = a, P[k + 1] = P[q] k k + 1, [q] k ([7] 5) k=5 q = 8, P[k + 1] = P[6] = b, P[q] = P[8] = b, P[k + 1] = P[q] k k + 1, [q] k ([8] 6)
k=6 q = 9, P[k + 1] = P[6] = b, P[q] = P[9] = c, P[k + 1] P[q] k [k] (k [6] = 4) P[k + 1] = P[5] = a, P[q] = P[9] = c, P[k + 1] P[q] k [k] (k [4] = 2) P[k + 1] = P[3] = a, P[q] = P[9] = c, P[k + 1] P[q] k [k] (k [2] = 0) k=0 q = 9, P[k + 1] = P[1] = a, P[q] = P[9] = c, P[k + 1] P[q] [q] k ([9] 0) q = 10, P[k + 1] = P[1] = a, P[q] = P[10] = a, P[k + 1] = P[q] k k + 1, [q] k ([10] 1)
1 P[i] a [i] 0
i
P8
2 b 0
3 a 1
4 b 2
5 a 3
c a
6 b 4
7 a 5
8 b 6
9 10 c a 0 1
[8] = 6 [6] = 4 [4] = 2 [2] = 0
a b a b a b a b a b a b a b
P6 P4 P2 P0
a b c a
a b a b
a b
a b a b c a
a b a b a b c a a b a b a b a b c a
Phase 2
First finish the prefix computation
f(41)+1= f(3)+1=0+1=1
Phase 1 matched
f(13-1)+1= 4+1=5