5CS4-AOA-Unit-3 @zammers
5CS4-AOA-Unit-3 @zammers
PATTERN MATCHING
String matching
We formalize the string-matching problem as follows. We assume that the text is an array
T[1………n] of length n and that the pattern is an array P[1……. m] of length m ≤ n. We further
assume that the elements of P and T are characters drawn from a finite alphabet Σ. For example,
we may have Σ = {0, 1} or Σ = {a, b, . . . , z}. The character arrays P and T are often called
strings of characters.
We say that pattern P occurs with shift s in text T (or, equivalently, that pattern P occurs
beginning at position s + 1 in text T) if 0 ≤ s ≤ n - m and T [s + 1 _ s + m] = P[1 _ m] (that is, if
T [s + j] = P[ j], for 1 ≤ j ≤ m). If P occurs with shift s in T , then we call s a valid shift;
otherwise, we call s an invalid shift. The string-matching problem is the problem of finding all
valid shifts with which a given pattern P occurs in a given text T . Figure illustrates these
definitions.
Figure: The string-matching problem. The goal is to find all occurrences of the pattern P = abaa
in the text T = abcabaabcabac. The pattern occurs only once in the text, at shift s = 3. The shift s
= 3 is said to be a valid shift. Each character of the pattern is connected by a vertical line to the
matching character in the text, and all matched characters are shown shaded.
An analysis of all string matching algorithm is given by following Table
NAIVE-STRING-MATCHER(T, P)
1 n ← length[T]
2 m ← length[P]
3 for s ← 0 to n - m
4 do if P[1 _ m] = T[s + 1 _ s + m]
5 then print "Pattern occurs with shift" s
Figure: The operation of the naive string matcher for the pattern P = aab and the text T = acaabc.
We can imagine the pattern P as a "template" that we slide next to the text. (a)-(d) The four
successive alignments tried by the naive string matcher. In each part, vertical lines connect
corresponding regions found to match (shown shaded), and a jagged line connects the first
mismatched character found, if any. One occurrence of the pattern is found, at shift s = 2, shown
in part (c).
Complexity
Procedure NAIVE-STRING-MATCHER takes time O((n - m + 1)m), and this bound is tight in
the worst case.
Given a pattern P[1………m], we let p denote its corresponding decimal value. In a similar
manner, given a text T [1………n], we let ts denote the decimal value of the length-m substring
T[s + 1 _ s + m], for s = 0, 1, . . . , n - m. Certainly, ts = p if and only if T [ s + 1 _ s + m] =
P[1……m]; thus, s is a valid shift if and only if ts = p. If we could compute p in time Θ(m) and
all the ts values in a total of Θ(n - m + 1) time,[ 1] then we could determine all valid shifts s in
time Θ(m) + Θ(n - m + 1) = Θ(n) by comparing p with each of the ts's.
We can compute p in time Θ(m) using Horner's rule:
p = P[m] + 10 (P[m - 1] + 10(P[m - 2] + · · · + 10(P[2] + 10P[1]) )).
The value t0 can be similarly computed from T [1…. m] in time Θ(m).
To compute the remaining values t1, t2, . . . , tn-m in time Θ(n - m), it suffices to observe that
ts+1 can be computed from ts in constant time, since
For example, if m = 5 and ts = 31415, then we wish to remove the high-order digit T [s + 1] = 3
and bring in the new low-order digit (suppose it is T [s + 5 + 1] = 2) to obtain
ts+1 = 10(31415 - 10000 · 3) +
2 = 14152.
m-1
Subtracting 10 T[s + 1] removes the high-order digit from ts, multiplying the result by 10
shifts the number left one position, and adding T [s + m + 1] brings in the appropriate low order
digit. The only difficulty with this procedure is that p and ts may be too large to work with
conveniently.
With a d-ary alphabet {0, 1, . . . , d - 1}, we choose q so that dq fits within a computer word and
adjust the recurrence to work modulo q, so that it becomes
m-1
where h = d (mod q) is the value of the digit "1" in the high-order position of an m-digit
text window.
The following algorithm implemented the above idea and inputs are text T, pattern P, the radix d
and the prime q.
RABIN-KARP-MATCHER(T, P, d, q)
1 n ← length[T]
2 m ← length[P]
3 h ← dm-1 mod q
4p←0
5 t0 ← 0
6 for i ← 1 to m ▹ Preprocessing.
7 do p ← (dp + P[i]) mod q
8 t0 ← (dt0 + T[i]) mod q
9 for s ← 0 to n - m ▹ Matching.
10 do if p = ts
11 then if P[1 _ m] = T [s + 1 _ s + m]
12 then print "Pattern occurs with shift" s
13 if s < n - m
14 then ts+1 ← (d(ts - T[s + 1]h) + T[s + m + 1]) mod q
Figure: The Rabin-Karp algorithm. Each character is a decimal digit, and we compute values
modulo 13. (a) A text string. A window of length 5 is shown shaded. The numerical value of the
shaded number is computed modulo 13, yielding the value 7. (b) The same text string with
values computed modulo 13 for each possible position of a length-5 window. Assuming the
pattern P = 31415, we look for windows whose value modulo 13 is 7, since 31415 ≡ 7 (mod 13).
Two such windows are found, shown shaded in the figure. The first, beginning at text position 7,
is indeed an occurrence of the pattern, while the second, beginning at text position 13, is a
spurious hit. (c) Computing the value for a window in constant time, given the value for the
previous window. The first window has value 31415. Dropping the high- order digit 3, shifting
left (multiplying by 10), and then adding in the low order digit 2 gives us the new value 14152.
All computations are performed modulo 13, however, so the value for the first window is 7, and
the value computed for the new window is 8.
5.2.3 The Knuth-Morris-Pratt algorithm
We now present a linear-time string-matching algorithm due to Knuth, Morris, and Pratt using
just an auxiliary function π[1……m] precomputed from the pattern in time Θ(m). The array π
allows the transition function δ to be computed efficiently. For any state q = 0, 1, ………, m and
any character a ϵ Σ, the value π[q] contains the information that is independent of a and is needed
to compute δ(q, a). Since the array π has only m entries, whereas δ has Θ(m |Σ|) entries, we save
a factor of |Σ| in the preprocessing time by computing π rather than δ.
Figure: The prefix function π. (a) The pattern P = ababaca is aligned with a text T so that the
first q = 5 characters match. Matching characters, shown shaded, are connected by vertical lines.
(b) Using only our knowledge of the 5 matched characters, we can deduce that a shift of s + 1 is
invalid, but that a shift of s′ = s + 2 is consistent with everything we know about the text and
therefore is potentially valid. (c) The useful information for such deductions can be precomputed
by comparing the pattern with itself. Here, we see that the longest prefix of P that is also a proper
suffix of P5 is P3. This information is precomputed and represented in the array π, so that π[5] =
3. Given that q characters have matched successfully at shift s, the next potentially valid shift is
at s′ = s + (q - π[q]).
We formalize the precomputation required as follows. Given a pattern P[1…m], the prefix
function for the pattern P is the function π : {1, 2, . . . , m} → {0, 1, . . . , m - 1} such that
π[q] = max {k : k < q and Pk Pq}.
That is, π[q] is the length of the longest prefix of P that is a proper suffix of Pq.
The Knuth-Morris-Pratt matching algorithm is given in pseudocode below as the procedure
KMP-MATCHER which calls auxiliary procedure COMPUTE-PREFIX-FUNCTION to
compute π.
KMP-MATCHER(T, P)
1 n ← length[T]
2 m ← length[P]
3 π ← COMPUTE-PREFIX-FUNCTION(P)
4 q ← 0 //Number of characters matched.
5 for i ← 1 to n //Scan the text from left to right.
6 do while q > 0 and P[q + 1] ≠ T[i]
7 do q ← π[q] //Next character does not match.
8 if P[q + 1] = T[i]
9 then q ← q + 1 //Next character matches.
10 if q = m //Is all of P matched?
11 then print "Pattern occurs with shift" i - m
12 q ← π[q] //Look for the next match.
COMPUTE-PREFIX-FUNCTION(P)
1 m ← length[P]
2 π[1] ← 0
3k←0
4 for q ← 2 to m
5 do while k > 0 and P[k + 1] ≠ P[q]
6 do k ← π[k]
7 if P[k + 1] = P[q]
8 then k ← k + 1
9 π[q] ← k
10 return π
Running-time analysis
The running time of COMPUTE-PREFIX-FUNCTION is Θ(m), Compared to FINITE-
AUTOMATON-MATCHER, by using π rather than δ, we have reduced the time for
preprocessing the pattern from O(m |Σ|) to Θ(m), while keeping the actual matching time
bounded by Θ(n).