MODULE3 : STRING MATCHING
Naïve String-matching Algorithm, KMP algorithm, Rabin-Karp Algorithm, Suffix Trees
String Matching - Strings
Pattern matching
Exact Matching Algorithms
A string is sequence of characters (0-indexed)
Examples of strings
C++ Code
HTML document
DNA sequence
Digitized Image
Alphabet (∑) – Set of possible characters for a family of strings
ASCII or Unicode
{0,1}
{A, C, G, T}
String Matching - Strings
Given string P of size m
Substring - P[i : j] : subsequence of P consisting of characters with
indexes between i and j
Prefix – is a substring p[0 : i]
Suffix – is a substring p[i: m-1]
Given strings T (text) and P (pattern)
Pattern matching problem consists of finding substring of T equal to P
Applications
Text Editors, compilers
Search Engines
Biological Research
String Matching : Example
0 1 2 3 4 5 6 7 8 9 10 1 12 13 14 15 16 17 18 19
1
Text (T) A A B A A C A A D A A B A A B A A D A A
Pattern
A A B A
(P)
Pattern found at 0, 9, 12
Naïve String Matching Algorithm
void search(char* pat, char* txt)
{
int M = strlen(pat);
int N = strlen(txt);
/* A loop to slide pat[] one by one */
for (int i = 0; i <= N - M; i++) {
int j;
/* For current index i, check for pattern match // Driver's code
*/ int main()
for (j = 0; j < M; j++) {
if (txt[i + j] != pat[j]) char txt[] =
break; "AABAACAADAABAAABAADAA";
char pat[] = "AABA";
if (j == M) // if pat[0...M-1] = txt[i, i+1, ...i+M-
1] // Function call
printf("Pattern found at index %d \n", i); search(pat, txt);
} return 0;
} }
Naïve String Searching Algorithm
Slide the Pattern (P) over Text (T) one by one and check for match
0 1 2 3 4 5 6 7 8 9 10 1 12 13 14 15 16 17 18 19
1
Text (T) A A B A A C A A D A A B A A B A A D A A
Pattern A A B A At 0
(P) Text Size – n
A A B A Pattern size – m
A A B A Each comparison takes m steps
Total n comparisons
A A B A O(mn)
A A B A
A A B A
A A B A
A A B A
A A B A
A A B A At 9 …
Naïve String Searching Algorithm
Slide the Pattern (P) over Text (T) one by one and check for match
0 1 2 3 4 5 6 7 8 9 10 1 12 13 14 15 16 17 18 19
1
Text (T) A A B A A C A A D A A B A A B A A D A A
i j
Pattern A A B A At 0
(P) 0-3 0-3
A A B A 1-4 0-3
A A B A 2-5 0-3
A A B A 3-6 0-3
A A B A 4-7 0-3
A A B A 5-8 0-3
A A B A 6-9 0-3
A A B A 7-10 0-3
A A B A 8-11 0-3
A A B A At 9 … 9-12 0-3
Worst case scenario of Naïve approach
0 1 2 3 4 5 6 7
Text (T) a a a a a a a d
Pattern X
a a a d i j
(P) X 0-3 0-3
a a a d
X 1-4 0-3
a a a d
2-5 0-3
X
a a a d 3-6 0-3
a a a d 4-7 0-3
Found a match at index 5
Naïve method fares badly when more repetitions are there
How about this?
0 1 2 3 4 5 6 7 8 9
(T) T R A I L T R A I N
X i j
(P) T R A I N 0-4 0-4
T R A I N 5-9 0-4
• Here Index i is moved to next one after where mismatch happened (index i keep moving forward)
• Found match at 5
• Does this work always?
How about this?
0 1 2 3 4 5 6 7 8 9 10
i j
(T) O N I O N I O N S P L 0-5 0-5
X
(P) O N I O N S 6-8 0-2
X
O N I X 9 0
O X
10 0
O
• Here Index i is moved to next one after where mismatch happened
• No Match found ; while actually pattern exists
• Why does this NOT work – Overlapping sub patterns
(T) O N I O N I O N S P L
(P) O N I O N S
How about this?
0 1 2 3 4 5 6 7 8 9 10
(T) O N I O N I O N S P L i j
X
(P) O N I O N S 0-5 0-5
O N I O N S 5-8 2-5
• Somehow if we can continue matching Text from index i = 5 with pattern
from index j = 2
Knuth – Morris – Pratt (KMP) Pattern Searching
Linear time algorithm for string matching ; O(n+m)
Text index i doesn’t back track
The basic idea behind KMP’s algorithm is:
1. Whenever we detect a mismatch (after some matches)
2. We already know some of the characters in the text of the next window
3. We take advantage of this information to avoid matching the characters that
we know will anyway match
How do we know how many characters to be skipped? KMP algorithm preprocesses
the pattern and prepares an integer array lps[] that tells the count of characters to
be skipped
KMP – Pre-processing
Longest proper prefix which is also a suffix – LPS
A proper prefix is a prefix where whole string is not allowed
Proper Prefix of “ABC” are “A”, “AB”
“ABC” while is a prefix , it is NOT a proper prefix
Suffixes of “ABC” are “C”, “BC”, “ABC”
Search for LPS happens in sub-patterns
For each sub-pattern pat[0..i] where i = 0 to m-1 ; lps[i] stores the length of
the maximum matching proper prefix which is also a suffix of the sub-
If beginning
pattern part of pattern occurs anywhere else in the
pat[0..i]
pattern ?
KMP – Pre-processing
KMP algorithm maintains a table of size m (same as the size of pattern)
It is called ∏ table or LPS Table
Patte A B C D A B E A B F
rn 0 0 0 0 1 2 0 1 2 0
LPS
Patte A B C D E A B F A B C
rn 0 0 0 0 0 1 2 0 1 2 3
LPS
Patte A A B C A D A A B E
rn 0 1 0 0 1 0 1 2 3 0
LPS
Patte A A A A B A A C D
rn 0 1 2 3 0 1 2 0 0
LPS
KMP – Pre-processing
KMP algorithm maintains a table of size m (same as the size of pattern)
It is called ∏ table or LPS Table
Patte A A A A Patte A A A C A A A A A C
rn rn
0 1 2 3 0 1 2 0 1 2 3 3 3 4
LPS LPS
Patte A B C D E Patte A A A B A A A
rn rn
0 0 0 0 0 0 1 2 0 1 2 3
LPS LPS
Patte A A B A A C A A B A A
rn 0 1 0 1 2 0 1 2 3 4 5
LPS
KMP Way
0 1 2 3 4 5 6 7 8 9 10
(T) O N I O N I O N S P
(P) O N I O N S
Pre-processing the pattern to prepare the LPS table
(P) O N I O N S
LPS 0 0 0 1 2 0
LPS Table
KMP Way… (P) O N I O N S
LPS 0 0 0 1 2 0
0 1 2 3 4 5 6
i
0 1 2 3 4 5 6 7 8 9 10 j
(T) O N I O N I O N S P i j
X
(P) O N I O N S 0-5 0-5
O N I O N S 5-8 2-5
1. Compare T[i] and P[j+1];
2. If match move (++) i and j; continue as long characters match
• T[0-4] and P[1-5] match
• If reached end of pattern ; successful match
3. When mismatch happens;
• if (j ==0) move i ; else j= LPS[j] ; Go To Step 1
A B A B D
LPS 0 0 1 2 0
KMP Way … j
i 0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1
0 1 2 3 4 5
0 1 2 3 4 5
(T) A B A B C A B C A B A B A B D i j
X
(P) A B A B D Mismatch i = 4; j = 4; j = LPS [j] 0-4 0-4
X
Mismatch i = 7 j = 2; A B A B D Mismatch i = 4; j = 2; j = LPS [j] 4-4 2-4
X
Mismatch i = 7 j = 0; A B A B D 5-7 0-4
X 8-13 0-4
Mismatch i = 13 j = 2; A B A B D
Match at 11 13- 2-4
A B A B D
15
1. Compare T[i] and P[j+1];
2. If match move (++) i and j; continue as long as characters match ;
• T[0-4] and P[1-5] match
• If reached end of pattern ; successful match
3. When mismatch happens;
• if (j ==0) move i ; else j= LPS[j] ; Go To Step 1