Slides 03
Slides 03
Boyer-Moore Algorithm
The Boyer-Moore algorithm (BM) is the practical method of choice for exact matching. It is especially suitable if the alphabet is large (as in natural language) the pattern is long (as often in bio-applications) The speed of BM comes from shifting the pattern P [1 . . . n] to the right in longer steps. Typically less than m chars (often about m/n only) of T [1 . . . m] are examined BM is based on three main ideas:
Longer shifts are based on examining P right-to-left, in order P [n], P [n 1], . . . bad character shift rule avoids repeating unsuccessful comparisons against a target character good sufx shift rule aligns only matching pattern characters against target characters already successfully matched Either rule alone works, but theyre more effective together
Bad character rule: Shift the next-to-left occurrence of i below the mismatched i of T
1 2 3 123456789012345678901234567890 T: maistuko kaima maisemaomaloma? P: maisemaomaloma
Bad character rule is effective, e.g., in searching natural language text (because mismatches are probable) If the alphabet is small, occurrences of any char close to the end of P are probable. Especially in this case, additional benet can be obtained from considering the successfully matched sufx of P We concentrate to the so called strong good sufx rule, which is more powerful than the (weak) sufx rule of the original Boyer-Moore method
In an occurrence, T [12 . . . 14] = ima must align with xma, where x differs from P [n 2] =o
1 2 3 123456789012345678901234567890 T: maistuko kaima maisemaomaloma? P: maisemaomaloma
BSA Lecture 3: BM Algorithm p.7/21
Case 2 Formally
Assume that P [i . . . n] has been successfully matched against target substring t Case 2: If Case 1 does not apply, shift P by the least amount possible s.t. a sufx of t matches a prex of P . NB 1: Case 2 applies when an occurrence of P has been found NB 2: As a special case the longest sufx of t that matches a prex of P can be empty, in which case P is shifted by |P | positions
Example of L (i)
Consider
1 12345678901234 P: maisemaomaloma
Now L (15) = 13 L (14) = 0 L (13) = 7 ( P [13..14] = P [6..7] = ma, P [5] = P [12]) L (12) = 10, and L (11) = L (10) = = L (2) = 0 The L values can be computed in time O(n); See next
The L (i) values can be computed in O(n) time by locating the largest j s.t. Nj (P ) = n i + 1 ( such j is L (i) for i = n Nj (P ) + 1):
for for
i := 2 to n + 1 do L (i) := 0; j := 1 to n 1 do L (n Nj (P ) + 1) := j ;
BSA Lecture 3: BM Algorithm p.15/21
How to compute the smallest shift that aligns a matching prex of P with a sufx of the successfully matched substring of T = P [i . . . n]? For i 2, let l(i) be the length of the longest prex of P (that is, P [1 . . . l(i)]) that is equal to a sufx of P [i . . . n] Example: For P = P [1..5] = ababa, l(6) = 0 ( P [6 . . . 5] = ), l(5) = l(4) = 1 (a), and l(3) = l(2) = 3 (aba)
Now the following theorem holds Theorem 2.2.4 l(i) = max{0 j |P [i . . . n]| | Nj (P ) = j } Proof. (Left as an exercise) This allows us to compute the l(i) values in time O(|P |) ( Exercise)
Since neither the bad character rule nor the good sufx rule misses any occurrence, we can use the maximum of alternative shift values Complete Boyer-Moore Algorithm: // Preprocessing: Compute R(x) for each x ; Compute L (i) and l(i) for each i = 2, . . . , n + 1;
BM Search Loop
// Search: k := n; while k m do i := n; h := k ; while i > 0 and P [i] = T [h] do i := i 1; h := h 1; endwhile; if i = 0 then Report an occurrence at T [h + 1 . . . k ]; k := k + n l(2); else // mismatch at P [i] Increase k by the maximum shift given by the bad character rule and the good sufx rule; endif; endwhile;
BSA Lecture 3: BM Algorithm p.20/21
Final Remarks
The presented rules carefully avoid performing unnecessary comparisons that would fail They can be shown to lead to linear-time behavior, but only if P does not occur in T . Otherwise the worst-case complexity is still (nm) A simple modication (Galil rule; Guseld, Sect. 3.2.2) corrects this and leads to a provable worst-case linear time. On natural language texts the running time is almost always sub-linear