Lecture#8 - String Matching Algorithm
Lecture#8 - String Matching Algorithm
• Remember :
• *Text is the string that we are searching to
• *Pattern is the string that we are searching for
• *Shift is an offset into a string
• Rabin-Karp Algorithm : compares the string’s hash values, rather than string
themselves.
• Performs well in practice and generalized to other algorithm for problems like 2D matching.
• Naïve (Brute Force) approach : The naive algorithm finds all valid
shifts using a loop that checks the condition P[1..m] = T[s+1, s+2,
s+m] for each of the n - m + 1 possible values of s.
• Whenever a character mismatch occurs after matching of several characters, the comparison
begins by going back in T from the character which follows the last beginning character.
• Worst-case complexity : O(m(n-m+1)); Average case performance is surprisingly good
provided stings are neither long nor have lots of repeated letters.
• If the hash value matches for two strings then it is called a ‘hit’.
• It may be possible that two or more different strings produce the same
hash value.
String 1: “CBJ” hash code=3*100 + 2*10 + 10 = 330
String 2: “CAT” hash code=3*100 + 1*10 + 20 = 330
• Hence it is necessary to check whether it is a hit or not?
• Any hit will have to be tested to verify that it is not spurious and that
p[1..m] = T[s+1..s+m]
• If m is very large then the hash value will be very large in size, so
we can hash the value by taking mod a prime number, say q.
• Initialization
• Step#1
• Step#2
• Step#3
• Step#4
• Step#5
• Step#6
• Let us execute the KMP algorithm to find whether ‘p’ occurs in ‘T’.
• Step#2 : i = 2, q = 0;
comparing p[1] with T[2]
• Step#3 : i = 3, q = 1;
comparing p[2] with T[3]
• Step#4 : i = 4, q = 0;
comparing p[1] with T[4]
• Step#5 : i = 5, q = 0;
comparing p[1] with T[5]
• Step#6 : i = 6, q = 1;
comparing p[2] with T[6]
• Step#7 : i = 7, q = 2;
comparing p[3] with T[7]
• Step#8 : i = 8, q = 3;
comparing p[4] with T[8]
• Step#9 : i = 9 q = 4;
comparing p[5] with T[9]
• Step#10 : i = 10, q = 5;
comparing p[6] with T[10]
• Step#11 : i = 11 q = 4;
comparing p[5] with T[11]
• Step#12 : i = 12, q = 5;
comparing p[6] with T[12]
• Step#13 : i = 13 q = 6;
comparing p[7] with T[13]
• Pattern ‘p’ has been found to completely occur in text ‘T’. The total
number of shifts that took place for the match to be found are : i – m =
13 – 7 = 6 shifts.
• The running time of the KMP-Matcher function is O(n).
• Advantages :
• The running time of the KMP algorithmis optimal (O(m + n)), which is very
fast.
• The algorithm never needs to move backwards in the input text T. It makes
the algorithm good for processing very large files.
• Disadvantages :
• Doesn’t work so well as the size of the alphabets increases. By which more
chances of mismatch occurs.