Lecture 05
Lecture 05
The algorithms we have seen so far access every character of the text. If we
start the comparison between the pattern and the current text position from
the end, we can often skip some text characters completely.
There are many algorithms that start from the end. The simplest are the
Horspool-type algorithms.
The Horspool algorithm checks first the text character aligned with the last
pattern character. If it doesn’t match, move (shift) the pattern forward
until there is a match.
81
More precisely, suppose we are currently comparing P against T [j..j + m).
Start by comparing P [m − 1] to T [k], where k = j + m − 1.
• If P [m − 1] 6= T [k], shift the pattern until the pattern character aligned
with T [k] matches, or until the full pattern is past T [k].
• If P [m − 1] = T [k], compare the rest in brute force manner. Then shift
to the next position, where T [k] matches.
83
On an integer alphabet:
• Preprocessing time is O(σ + m).
• In the worst case, the search time is O(mn).
For example, P = bam−1 and T = an .
• In the best case, the search time is O(n/m).
For example, P = bm and T = an .
• In average case, the search time is O(n/ min(m, σ)).
This assumes that each pattern and text character is picked
independently by uniform distribution.
84
BNDM
Starting matching from the end enables long shifts.
• The Horspool algorithm bases the shift on a single character.
• The Boyer–Moore algorithm uses the matching suffix and the
mismatching character.
• Factor based algorithms continue matching until no pattern factor
matches. This may require more comparisons but it enables longer
shifts.
85
Factor based algorithms use an automaton that accepts suffixes of the
reverse pattern P R (or equivalently reverse prefixes of the pattern P ).
• BDM (Backward DAWG Matching) uses a deterministic automaton
that accepts exactly the suffixes of P R .
DAWG (Directed Acyclic Word Graph) is also known as suffix automaton.
a s s i
3 2 1 0 -1
86
Suppose we are currently comparing P against T [j..j + m). We use the
automaton to scan the text backwards from T [j + m − 1]. When the
automaton has scanned T [j + i..j + m):
87
BNDM does a bitparallel simulation of the nondeterministic automaton,
which is quite similar to Shift-And.
The state of the automaton is stored in a bitvector D. When the
automaton has scanned T [j + i..j + m):
• D.i = 1 if and only if there is a path from the initial state to state i
with the string (T [j + i..j + m))R .
• If D.(m − 1) = 1, then T [j + i..j + m) is a prefix of the pattern.
• If D = 0, then the automaton can no more reach an accept state.
88
Algorithm 2.15: BNDM
Input: text T = T [0 . . . n), pattern P = P [0 . . . m)
Output: position of the first occurrence of P in T
Preprocess:
(1) for c ∈ Σ do B[c] ← 0
(2) for i ← 0 to m − 1 do B[P [m − 1 − i]] ← B[P [m − 1 − i]] + 2i
Search:
(3) j ← 0
(4) while j + m ≤ n do
(5) i ← m; shif t ← m
(6) D ← 2m − 1 // D ← 1m
(7) while D 6= 0 do
// Now T [j + i..j + m) is a pattern factor
(8) i←i−1
(9) D ← D & B[T [j + i]]
(10) if D & 2m−1 6= 0 then
// Now T [j + i..j + m) is a pattern prefix
(11) if i = 0 then return j
(12) else shif t ← i
(13) D ← D << 1
(14) j ← j + shif t
(15) return n
89
Example 2.16: P = assi, T = apassi.
90
On an integer alphabet when m ≤ w:
• Preprocessing time is O(σ + m).
• In the worst case, the search time is O(mn).
For example, P = am−1 b and T = an .
• In the best case, the search time is O(n/m).
For example, P = bm and T = an .
• In the average case, the search time is O(n(logσ m)/m).
This is optimal! It has been proven that any algorithm needs to inspect
Ω(n(logσ m)/m) text characters on average.
91
• The search time of BDM and BOM is O(n(logσ m)/m), which is
optimal on average. (BNDM is optimal only when m ≤ w.)
• There are also algorithms that are optimal in both cases. They are
based on similar techniques, but we will not describe them here.
92