0% found this document useful (0 votes)
98 views21 pages

Slides 03

The document discusses the Boyer-Moore string matching algorithm. It provides an overview of the algorithm's main ideas: examining the pattern from right to left, using a bad character rule to skip characters, and a good suffix rule to skip prefixes. It then goes into detail about implementing the rules, including preprocessing steps like computing lookup tables to determine the maximum shifts for each character mismatch or matched suffix. The full algorithm uses these rules to shift the pattern over by the largest possible amount after each mismatch in an effort to search the text in linear or sublinear time on average.

Uploaded by

dwindaf
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views21 pages

Slides 03

The document discusses the Boyer-Moore string matching algorithm. It provides an overview of the algorithm's main ideas: examining the pattern from right to left, using a bad character rule to skip characters, and a good suffix rule to skip prefixes. It then goes into detail about implementing the rules, including preprocessing steps like computing lookup tables to determine the maximum shifts for each character mismatch or matched suffix. The full algorithm uses these rules to shift the pattern over by the largest possible amount after each mismatch in an effort to search the text in linear or sublinear time on average.

Uploaded by

dwindaf
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Biosequence Algorithms, Spring 2005

Lecture 3: Boyer-Moore Matching


Pekka Kilpelainen University of Kuopio Department of Computer Science

BSA Lecture 3: BM Algorithm p.1/21

Boyer-Moore Algorithm

The Boyer-Moore algorithm (BM) is the practical method of choice for exact matching. It is especially suitable if the alphabet is large (as in natural language) the pattern is long (as often in bio-applications) The speed of BM comes from shifting the pattern P [1 . . . n] to the right in longer steps. Typically less than m chars (often about m/n only) of T [1 . . . m] are examined BM is based on three main ideas:

BSA Lecture 3: BM Algorithm p.2/21

Boyer-Moore: Main ideas

Longer shifts are based on examining P right-to-left, in order P [n], P [n 1], . . . bad character shift rule avoids repeating unsuccessful comparisons against a target character good sufx shift rule aligns only matching pattern characters against target characters already successfully matched Either rule alone works, but theyre more effective together

BSA Lecture 3: BM Algorithm p.3/21

Right-to-left Scan and Bad Character Rule


The pattern is examined right-to-left:
1 2 3 123456789012345678901234567890 T: maistuko kaima maisemaomaloma? (legend: match/mismatch) P: maisemaomaloma

Bad character rule: Shift the next-to-left occurrence of i below the mismatched i of T
1 2 3 123456789012345678901234567890 T: maistuko kaima maisemaomaloma? P: maisemaomaloma

BSA Lecture 3: BM Algorithm p.4/21

Bad Character Rule Formally


For any x , let R(x) = max ({0} {i < n | P [i] = x}) Easy to compute in time (|| + |P |) ( is the alphabet) Bad character shift: When P [i] = T [h] = x, shift P to the right by max{1, i R(x)}. This means: if the right-most occurrence of x in P [1 . . . n 1] is at j < i, chars P [j ] and T [h] get aligned if the right-most occurrence of x in P [1 . . . n 1] is at j > i, the pattern is shifted to the right by one if x doesnt occur in P [1 . . . n 1], shift= i, and the pattern is next aligned with T [h + 1 . . . h + n]
BSA Lecture 3: BM Algorithm p.5/21

(Strong) Good Sufx Rule

Bad character rule is effective, e.g., in searching natural language text (because mismatches are probable) If the alphabet is small, occurrences of any char close to the end of P are probable. Especially in this case, additional benet can be obtained from considering the successfully matched sufx of P We concentrate to the so called strong good sufx rule, which is more powerful than the (weak) sufx rule of the original Boyer-Moore method

BSA Lecture 3: BM Algorithm p.6/21

Good Sufx Rule: Illustration


Consider a mismatch at P [n 2]:
1 2 3 123456789012345678901234567890 T: maistuko kaima maisemaomaloma? P: maisemaomaloma

In an occurrence, T [12 . . . 14] = ima must align with xma, where x differs from P [n 2] =o
1 2 3 123456789012345678901234567890 T: maistuko kaima maisemaomaloma? P: maisemaomaloma
BSA Lecture 3: BM Algorithm p.7/21

Good Sufx Rule Formally


Suppose that P [i . . . n] has been successfully matched against T Case 1: If P [i 1] is a mismatch and P contains another copy of P [i . . . n] which is not preceded by char P [i 1], shift P s.t. the closest-to-left such copy is aligned with the substring already matched by P [i . . . n] (See the previous slide for an example) What if no preceding copy of P [i . . . n] exists? Case 2

BSA Lecture 3: BM Algorithm p.8/21

Good Sufx Rule: Case 2


Consider a mismatch at P [n 5]:
1 2 3 12345678901234567890123456789012 T: mahtava talomaisema omalomailuun P: maisemaomaloma

No preceding occurrence of aloma in P , but a potential occurrence of P begins at T [13 . . . 14] = ma


1 2 3 12345678901234567890123456789012 T: mahtava talomaisema omalomailuun P: maisemaomaloma
BSA Lecture 3: BM Algorithm p.9/21

Case 2 Formally
Assume that P [i . . . n] has been successfully matched against target substring t Case 2: If Case 1 does not apply, shift P by the least amount possible s.t. a sufx of t matches a prex of P . NB 1: Case 2 applies when an occurrence of P has been found NB 2: As a special case the longest sufx of t that matches a prex of P can be empty, in which case P is shifted by |P | positions

BSA Lecture 3: BM Algorithm p.10/21

Preprocessing for the Good Sufx Rule (Case 1)


For i = 2, . . . , n + 1, dene L (i) as the largest position of P that satises the following: L (i) is the end position of an occurrence of P [i . . . n] that is not preceded by char P [i 1]; if no such copy of P [i . . . n] exists in P , let L (i) = 0 NB 1: 0 L (i) < n; If L (i) > 0, it is the right endpoint of the closest-to-left copy of good sufx P [i . . . n], which gives the shift n L (i) NB 2: Since P [n + 1 . . . n] = , L (n + 1) is the right-most position j s.t. P [j ] = P [n] (or 0 if all chars are equal).
BSA Lecture 3: BM Algorithm p.11/21

Example of L (i)

Consider
1 12345678901234 P: maisemaomaloma

Now L (15) = 13 L (14) = 0 L (13) = 7 ( P [13..14] = P [6..7] = ma, P [5] = P [12]) L (12) = 10, and L (11) = L (10) = = L (2) = 0 The L values can be computed in time O(n); See next

BSA Lecture 3: BM Algorithm p.12/21

Computing the L Values (1)


Dene Nj (P ) to be the length of the longest common sufx of P [1 . . . j ] and P ( 0 Nj (P ) j ) Example: For
1 12345678901234 P: maisemaomaloma

N0 (P ) = N1 (P ) = 0, N2 (P ) = 2, N3 (P ) = = N6 (P ) = 0, N7 (P ) = 2, N8 (P ) = N9 (P ) = 0, N10 (P ) = 3, N11 (P ) = = N13 (P ) = 0, N14 (P ) = 14


BSA Lecture 3: BM Algorithm p.13/21

Computing the L Values (2)


Now Nj ( longest common sufx) values and Zi ( longest common prex) values are reverses of each other, i.e., Nj (P ) = Znj +1 (P r ) , where P r is the reverse of P Example: j : 123 45678 n j + 1: 876 5432 1 Pr : umanumaa P : aamunamu the Nj values can be computed in time O(|P |) by applying Algorithm Z to the reversal of P

BSA Lecture 3: BM Algorithm p.14/21

Computing the L Values (3)


How do the Nj values help? Theorem 2.2.2 If L (i) > 0, it is the largest j < n for which Nj (P ) = |P [i . . . n]| (= n i + 1) Such j is the right endpoint of the closest-to-left copy of P [i . . . n] which is not preceded by P [i 1]
Proof.

The L (i) values can be computed in O(n) time by locating the largest j s.t. Nj (P ) = n i + 1 ( such j is L (i) for i = n Nj (P ) + 1):
for for

i := 2 to n + 1 do L (i) := 0; j := 1 to n 1 do L (n Nj (P ) + 1) := j ;
BSA Lecture 3: BM Algorithm p.15/21

Preprocessing for Case 2 (1)

How to compute the smallest shift that aligns a matching prex of P with a sufx of the successfully matched substring of T = P [i . . . n]? For i 2, let l(i) be the length of the longest prex of P (that is, P [1 . . . l(i)]) that is equal to a sufx of P [i . . . n] Example: For P = P [1..5] = ababa, l(6) = 0 ( P [6 . . . 5] = ), l(5) = l(4) = 1 (a), and l(3) = l(2) = 3 (aba)

BSA Lecture 3: BM Algorithm p.16/21

Preprocessing for Case 2 (2)

Now the following theorem holds Theorem 2.2.4 l(i) = max{0 j |P [i . . . n]| | Nj (P ) = j } Proof. (Left as an exercise) This allows us to compute the l(i) values in time O(|P |) ( Exercise)

BSA Lecture 3: BM Algorithm p.17/21

Shifts by the Good Sufx Rule


When P [i 1] is a mismatch (after matching P [i . . . n] successfully) (Case 1) if L (i) > 0, shift the pattern to the right by n L (i) positions (Case 2) if L (i) = 0, shift the pattern to the right by n l(i) positions NB If already P [n] fails to match, i = n + 1, which also gives correct shifts When an occurrence of P has been found, shift P to the right by n l(2) positions. Why? To align a prex of P with the longest matching proper sufx of the occurrence
BSA Lecture 3: BM Algorithm p.18/21

Which Shift to Use?

Since neither the bad character rule nor the good sufx rule misses any occurrence, we can use the maximum of alternative shift values Complete Boyer-Moore Algorithm: // Preprocessing: Compute R(x) for each x ; Compute L (i) and l(i) for each i = 2, . . . , n + 1;

BSA Lecture 3: BM Algorithm p.19/21

BM Search Loop

// Search: k := n; while k m do i := n; h := k ; while i > 0 and P [i] = T [h] do i := i 1; h := h 1; endwhile; if i = 0 then Report an occurrence at T [h + 1 . . . k ]; k := k + n l(2); else // mismatch at P [i] Increase k by the maximum shift given by the bad character rule and the good sufx rule; endif; endwhile;
BSA Lecture 3: BM Algorithm p.20/21

Final Remarks

The presented rules carefully avoid performing unnecessary comparisons that would fail They can be shown to lead to linear-time behavior, but only if P does not occur in T . Otherwise the worst-case complexity is still (nm) A simple modication (Galil rule; Guseld, Sect. 3.2.2) corrects this and leads to a provable worst-case linear time. On natural language texts the running time is almost always sub-linear

BSA Lecture 3: BM Algorithm p.21/21

You might also like