0% found this document useful (0 votes)
9 views

Lecture 05

The document discusses string matching algorithms like Horspool, Boyer-Moore, and backward nondeterministic automata matching (BNDM). It explains how these algorithms work and analyzes their time complexities in the worst, best, and average cases.

Uploaded by

sayendranadh2005
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Lecture 05

The document discusses string matching algorithms like Horspool, Boyer-Moore, and backward nondeterministic automata matching (BNDM). It explains how these algorithms work and analyzes their time complexities in the worst, best, and average cases.

Uploaded by

sayendranadh2005
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Horspool

The algorithms we have seen so far access every character of the text. If we
start the comparison between the pattern and the current text position from
the end, we can often skip some text characters completely.

There are many algorithms that start from the end. The simplest are the
Horspool-type algorithms.

The Horspool algorithm checks first the text character aligned with the last
pattern character. If it doesn’t match, move (shift) the pattern forward
until there is a match.

Example 2.10: Horspool


ainaisesti-ainainen
ainaine/n
ainaine//
n
ainainen

81
More precisely, suppose we are currently comparing P against T [j..j + m).
Start by comparing P [m − 1] to T [k], where k = j + m − 1.
• If P [m − 1] 6= T [k], shift the pattern until the pattern character aligned
with T [k] matches, or until the full pattern is past T [k].
• If P [m − 1] = T [k], compare the rest in brute force manner. Then shift
to the next position, where T [k] matches.

Algorithm 2.11: Horspool


Input: text T = T [0 . . . n), pattern P = P [0 . . . m)
Output: position of the first occurrence of P in T
Preprocess:
(1) for c ∈ Σ do shif t[c] ← m
(2) for i ← 0 to m − 2 do shif t[P [i]] ← m − 1 − i
Search:
(3) j ← 0
(4) while j + m ≤ n do
(5) if P [m − 1] = T [j + m − 1] then
(6) i←m−2
(7) while i ≥ 0 and P [i] = T [j + i] do i ← i − 1
(8) if i = −1 then return j
(9) j ← j + shif t[T [j + m − 1]]
(10) return n
82
The length of the shift is determined by the shift table. shif t[c] is defined
for all c ∈ Σ:
• If c does not occur in P , shif t[c] = m.
• Otherwise, shif t[c] = m − 1 − i, where P [i] = c is the last occurrence of
c in P [0..m − 2].

Example 2.12: P = ainainen. c last occ. shif t


a ainainen 4
e ainainen 1
i ainainen 3
n ainainen 2
Σ \ {a,e,i,n} — 8

83
On an integer alphabet:
• Preprocessing time is O(σ + m).
• In the worst case, the search time is O(mn).
For example, P = bam−1 and T = an .
• In the best case, the search time is O(n/m).
For example, P = bm and T = an .
• In average case, the search time is O(n/ min(m, σ)).
This assumes that each pattern and text character is picked
independently by uniform distribution.

In practice, a tuned implementation of Horspool is very fast when the


alphabet is not too small.

84
BNDM
Starting matching from the end enables long shifts.
• The Horspool algorithm bases the shift on a single character.
• The Boyer–Moore algorithm uses the matching suffix and the
mismatching character.
• Factor based algorithms continue matching until no pattern factor
matches. This may require more comparisons but it enables longer
shifts.

Example 2.13: Horspool shift


varmasti-aikai/sen-ainainen
ainaisen-ainainen
ainaisen-ainainen
Boyer–Moore shift Factor shift
varmasti-aikai/sen-ainainen varmasti-ai/kaisen-ainainen
ainaisen-ainainen ainaisen-ainainen
ainaisen-ainainen ainaisen-ainainen

85
Factor based algorithms use an automaton that accepts suffixes of the
reverse pattern P R (or equivalently reverse prefixes of the pattern P ).
• BDM (Backward DAWG Matching) uses a deterministic automaton
that accepts exactly the suffixes of P R .
DAWG (Directed Acyclic Word Graph) is also known as suffix automaton.

• BNDM (Backward Nondeterministic DAWG Matching) simulates a


nondeterministic automaton.

Example 2.14: P = assi.


ε

a s s i
3 2 1 0 -1

• BOM (Backward Oracle Matching) uses a much simpler deterministic


automaton that accepts all suffixes of P R but may also accept some
other strings. This can cause shorter shifts but not incorrect behaviour.

86
Suppose we are currently comparing P against T [j..j + m). We use the
automaton to scan the text backwards from T [j + m − 1]. When the
automaton has scanned T [j + i..j + m):

• If the automaton is in an accept state, then T [j + i..j + m) is a prefix


of P .
⇒ If i = 0, we found an occurrence.
⇒ Otherwise, mark the prefix match by setting shif t = i. This is the
length of the shift that would achieve a matching alignment.

• If the automaton can still reach an accept state, then T [j + i..j + m) is


a factor of P .
⇒ Continue scanning.

• When the automaton can no more reach an accept state:


⇒ Stop scanning and shift: j ← j + shif t.

87
BNDM does a bitparallel simulation of the nondeterministic automaton,
which is quite similar to Shift-And.
The state of the automaton is stored in a bitvector D. When the
automaton has scanned T [j + i..j + m):
• D.i = 1 if and only if there is a path from the initial state to state i
with the string (T [j + i..j + m))R .
• If D.(m − 1) = 1, then T [j + i..j + m) is a prefix of the pattern.
• If D = 0, then the automaton can no more reach an accept state.

Updating D uses precomputed bitvectors B[c], for all c ∈ Σ:


• B[c].i = 1 if and only if P [m − 1 − i] = P R [i] = c.

The update when reading T [j + i] is familiar: D = (D << 1) & B[T [j + i]]


• Note that there is no “+1”. This is because D.(−1) = 0 always, so the
shift brings the right bit to D.0. With Shift-And D.(−1) = 1 always.
• The exception is that in the beginning before reading anything
D.(−1) = 1. This is handled by starting the computation with the first
shift already performed. Because of this, the shift is done at the end of
the loop.

88
Algorithm 2.15: BNDM
Input: text T = T [0 . . . n), pattern P = P [0 . . . m)
Output: position of the first occurrence of P in T
Preprocess:
(1) for c ∈ Σ do B[c] ← 0
(2) for i ← 0 to m − 1 do B[P [m − 1 − i]] ← B[P [m − 1 − i]] + 2i
Search:
(3) j ← 0
(4) while j + m ≤ n do
(5) i ← m; shif t ← m
(6) D ← 2m − 1 // D ← 1m
(7) while D 6= 0 do
// Now T [j + i..j + m) is a pattern factor
(8) i←i−1
(9) D ← D & B[T [j + i]]
(10) if D & 2m−1 6= 0 then
// Now T [j + i..j + m) is a pattern prefix
(11) if i = 0 then return j
(12) else shif t ← i
(13) D ← D << 1
(14) j ← j + shif t
(15) return n
89
Example 2.16: P = assi, T = apassi.

B[c], c ∈ {a,i,p,s} D when scanning apas backwards


a i p s a p a s
i 0 1 0 0 i 0 0 0 1
s 0 0 0 1 s 0 0 1 1
s 0 0 0 1 s 0 0 1 1
a 1 0 0 0 a 0 1 0 1 ⇒ shif t = 2

D when scanning assi backwards


a s s i
i 0 0 0 1 1
s 0 0 1 0 1
s 0 1 0 0 1
a 1 0 0 0 1 ⇒ occurrence

90
On an integer alphabet when m ≤ w:
• Preprocessing time is O(σ + m).
• In the worst case, the search time is O(mn).
For example, P = am−1 b and T = an .
• In the best case, the search time is O(n/m).
For example, P = bm and T = an .
• In the average case, the search time is O(n(logσ m)/m).
This is optimal! It has been proven that any algorithm needs to inspect
Ω(n(logσ m)/m) text characters on average.

When m > w, there are several options:


• Use multi-word bitvectors.
• Search for a pattern prefix of length w and check the rest when the
prefix is found.
• Use BDM or BOM.

91
• The search time of BDM and BOM is O(n(logσ m)/m), which is
optimal on average. (BNDM is optimal only when m ≤ w.)

• MP and KMP are optimal in the worst case.

• There are also algorithms that are optimal in both cases. They are
based on similar techniques, but we will not describe them here.

92

You might also like