String Matching Chapter 12 Goodrich Nep
String Matching Chapter 12 Goodrich Nep
3 Pattern Matching
Algorithms
Goodrich
• In Pattern Matching problem on strings, we are given:
• a text string T of length n and
• a pattern string P of length m, and
we want to find whether P is a substring of T or not.
• Give examples….
• Give applications…..
• The notion of a “match” is that there is a substring of T starting at some index i
that matches P, character by character, such that
T[i] = P[0],
T[i+1] = P[1], ...,
T[i+m−1] = P[m−1].
• That is, P = T[i..i+m−1].
• Thus, the output from a pattern matching algorithm could either be some
indication that the pattern P does not exist in T (failure case) or an integer
indicating the starting index in T of a substring matching P(success case).
12.3.1 Brute Force
In this we simply test all the possible placements of P relative to T.
• Example 12.4: Suppose we are given the text string
T="abacaabaccabacabaabb" and the pattern string P="abacab“.
Example- One more example – students must compute number of comparisons..
Correctness:
• The brute-force pattern matching algorithm consists of two nested loops, with the
outer loop indexing through all possible starting indices of the pattern in the text,
and the inner loop indexing through each character of the pattern, comparing it to
its potentially corresponding character in the text.
• Thus, the correctness of the brute-force pattern matching algorithm follows
immediately from this exhaustive search approach.
Performance:
• In the worst case, for each candidate index in T, we can perform upto m character
comparisons to discover that P does not match T at the current index.
• The outer for loop is executed at most n−m+1 times, and the inner loop is executed
at most m times.
• Thus, the running time of the brute-force method is O((n−m+1)m), which is
simplified as O(nm). Note that when m=n/2, this algorithm has quadratic running
time O(n2).
12.3.3 The Knuth-Morris-Pratt Algorithm
• major inefficiency-- we performed many compaisons while testing a potential placement of the
pattern against the text, and if we discover a pattern character that does not match in the text,
then we throw away all the information gained by these comparisons and start over again from
scratch with the next incremental placement of the pattern.
• The Knuth-Morris-Pratt (or “KMP”) algorithm, avoids this waste of information and, in doing so, it
achieves a running time of O(n+m), which is optimal in the worst case.
• That is, in the worst case any pattern matching algorithm will have to examine all the characters of
the text and all the characters of the pattern at least once.
The Failure Function
• Main idea of the KMP algorithm --preprocess the pattern string P so
as to compute a failure function, f, that indicates the proper shift of P
so that, to the largest extent possible, we can reuse previously
performed comparisons.
• failure function f(j) is defined as the length of the longest prefix of P
that is a suffix of P[1..j] (note that we did not put P[0..j] here).
• Use the convention f(0) = 0. Importance of this failure function-- it
“encodes” repeated substrings inside the pattern itself.
Technique of KMP Algorithm ( 3 possibilities)
• 1st- Each time there is a match, we increment the current indices i.e
moves on to the next characters in T and P
• 2nd- If mismatch and we have previously made progress in P, then we
consult the failure function to determine the new index in P where we
need to continue checking P against T i.e consults the failure function
for a new candidate character in P.
• 3rd- there is a mismatch and we are at the beginning of P--we simply
increment the index for T (and keep the index variable for P at its
beginning) i.e starts over with the next index in T.
Repeat the above process until we find a match of P in T or the index for
T reaches n, the length of T (indicating that we did not find the pattern
P in T).
• Correctness of KMP algorithm follows from the definition of the
failure function--Any comparisons that are skipped are actually
unnecessary for the failure function guarantees that all the ignored
comparisons are redundant—they would involve comparing the same
matching characters over again.
• Also note that the algorithm performs fewer overall comparisons than
the brute-force algorithm run on the same strings.
Performance of KMP Algorithm
• Excluding the computation of the failure function, the running time is proportional to
the number of iterations of the while loop.
• For the sake of analysis, let us define k=i−j. Intuitively, k is the total amount by which
the pattern P has been shifted with respect to the text T.
• Note that throughout the execution of the algorithm, we have k≤n. One of the
following three case s occurs a t each iteration of the loop. If T[i]=P[j], then i
increases by 1, and k does not change, since j also increases by 1.
• If T[i]!=P[j] and j>0,then i does not change and k increases by at least 1, since, in this
case, k changes from i−j to i−f(j−1), which is an addition of j −f(j−1),which is positive
because f(j−1)<j.
• If T[i]!=P[j]and j=0, then i increases by 1 and k increases by 1, since j does not
change.
• Thus, at each iteration of the loop, either I or k increases by at least 1(possibly
both);hence, the total number of iterations of the while loop in the KMP pattern
matching algorithm is at most 2n. Of course, achieving this bound assumes that we
have already computed the failure function for P.
Constructing the KMP Failure Function
• Dry Running of KMPFailureFunction for P=“abacab” (do on board in class along with
given algo
j and P) 0 1 2 3 4 5
P [j] a b a c a b
f (j) 0 0 1 0 1 2