13 Filter Algorithms and Approximate Index Search
13 Filter Algorithms and Approximate Index Search
1. Kärkkäinen, J., and Na, J. C. (2007). Faster filters for approximate string matching, 7, 84–90. Springer
13.2 Definitions
We consider a string T of length n. For i, j ∈ N we define:
• [i..j] := {i, i + 1, . . . , j}
• [i..j) := [i..j − 1]
• T[i] is the i-th character of T (counting from 0)
• |T| denotes the string length, i. e. |T| = n
• In this lecture T[i.. j] is a substring but not from position i to j (explained later)
13.3 Introduction
We have seen the effective filters for the approximate string matching problem based on q-gram counting, e.g.
QUASAR and SWIFT. These are based on the q-gram lemma that states that a certain number of overlapping
q-grams are shared between query and each approximate match.
Another simple but effective family of filters are the so-called factor filters, which are based on a factorization
of the pattern. A factorization of a string S is a sequence of strings (factors) whose concatenation is S.
Example 1 (factorization). A possible factorization of the string S = GATTACA would be the sequence of factors
G, AT, TAC, A.
Theorem 2 (pigeonhole lemma). Let A = A0 A1 · · · Ak be a string that is the concatenation of k + 1 non-empty factors
Ai . If a string B is within edit distance k from A, then at least one of the factors Ai is a factor of B.
As a consequence we can divide our pattern P into k + 1 non-empty factors and an exact text occurrence of one
factor signals a potential approximate match.
The specificity of a factor filter is influenced by the expected number of random hits of each factor. The more
random text occurrences the factors have, the more false positive matches the filter will output resulting in
more unsuccessful verifications in a subsequent step. A stronger filter criterion can be reached with longer but
approximate factors, i.e. factors that are searched with errors.
However, before introducing stronger factor filters, we define an optimal factorization.
Theorem 3 (optimal factorization). For twoP strings A and B, let A0 A1 · · · As−1 be a factorization of A. Then there exists
a factorization B0 B1 · · · Bs−1 with dist(A, B) = i∈[0..s) dist(Ai , Bi ). We call such a factorization of B optimal.
Factor and Suffix Filters, by David Weese, June 11, 2013, 01:00 13001
Proof: Exercise.
Example 4. For the strings
A = GATTACA
B = ATCACTA
an optimal factorization would be:
A = GAT · TA · CA
B = AT · CA · CTA ,
as it holds:
dist(A, B) = dist(GAT, AT) + dist(TA, CA) + dist(CA, CTA)
3 = 1+1+1 .
P now assign a weight ti ∈ N ∪ {0} to each factor Ai , we can search approximate matches with less than
If we
t = i∈[0..s) ti errors using the following corollary of theorem 3:
Corollary 5 (weighted pigeonhole lemma). If dist(A, B) < t, then there exists an i ∈ [0..s) such that dist(Ai , Bi ) < ti .
For the k-difference problem the optimal choice would be t = k + 1. As a consequence, if dist(A, B) ≤ k < t, B
must have a factor whose edit distance to some Ai is less than ti .
For 0 ≤ i ≤ j < s, let A[i.. j] denote the concatenation of factors Ai Ai+1 · · · A j . In particular, A[i..s − 1] is a suffix
of A. (We use the same notation for B).
Definition 6 (strong match). We say that A and B match on interval [i.. j] if
X
dist(A[i.. j], B[i.. j]) < th ,
h∈[i..j]
and strongly match on [i.. j] if they match on every interval [i.. j0 ], j0 = i, . . . j, that means if they match on every
nonempty prefix of [i..j].
While the weighted pigeonhole lemma only states that there is a match [i..i], the following more general theorem
guarantees that there is a strong match [i..s) for a certain i ∈ [0..s).
Theorem 7. If dist(A, B) < t, there exists i ∈ [0..s) such that A and B strongly match on [i..s).
P
Proof: Let [0..i) be the longest prefix interval, on which A and B do not match, i.e. dist(A[0..i), B[0..i)) ≥ h∈[0..i) th .
It is well defined as i = 0 always satisfies the condition. It holds i < s as [0..s) is always a match.
Then, A and B strongly match on [i..s). To show this, assume the opposite, i.e. that there exists j ∈ [i..s) such
that A and B do not match on [i.. j]. But then A and B do not match on [0.. j] = [0..i) ∪ [i.. j], which contradicts
[0..i) being maximal.
Example 8. In the situation of the following table, A and B strongly match on [1..5) but not on any other suffix
[i..5):
Factor and Suffix Filters, by David Weese, June 11, 2013, 01:00 13003
i 0 1 2 3 4
ti 1 1 2 1 1 ,
dist(Ai , Bi ) 1 0 1 2 1
The filter algorithm is identical to factor filters except instead of searching for separate factors, the filtration
phase will search for suffixes
P of the pattern satisfying the strong match condition. That is, it searches for each
suffix A[i..s) with less than j∈[i..s) t j errors.
To get an intuition on suffix filters, we want to compare the following (sub)problems of using the index to find
the occurrences of:
2. In the 2nd and 3rd case only t0 in the first factor are sufficient.
3. While the factor filter lets a candidate after matching the first factor, a suffix filter continues the search
and has more chances for elimination
Hence a suffix filter is more specific then a factor filter and saves time due to less unsuccessful verifications.
p a t t e r n
acter at a time and at no errors
Σ Σ Σ Σ Σ Σ Σ Σ
miss. A hit means that ε Σ ε Σ ε Σ ε Σ ε Σ ε Σ ε Σ
p a t t e r n
nd a miss means that 1 error
Σ Σ Σ Σ Σ Σ Σ Σ
lead to a match. The ε Σ ε Σ ε Σ ε Σ ε Σ ε Σ ε Σ
time (sec)
ease of implementation. In the worst case, a diagonal-
60
wise simulation
13.11 step requires
Suffix O(!k 2 /w")
Filter Parameters time while a
row-wise simulation takes O(k!k/w") time, where w 30
is the word size. However, the row-wise simulation
The practically best factorization divides a pattern of length m into k + 1 factors of weight 1.
can easily take advantage of the heuristic of omitting 0
inactiveThe rowslength
but the ` ofdiagonal-wise
the last factorsimulation should becan greater
not.then others j to avoid
3
k l random
4 5 matches
m the last
6 7 of8the9shortest suffix. All
m−` m−` factor size
We alsoother tried factor
the row-wise
sizes should simulation for the standard
be distributed evenly, i.e. either k or k .
NFA but it made the factor filters slightly slower. We (a) varying the last factor size " (s = 13)
implemented two simulation instances according to how
use diagonal- 120
many words are needed for one row of the staircase total time NFA: 500
A but row-wise time for filtering phase total time
one word and more than one 90 word. The NFA in [13] was 400 time for filtering phase
mainly due to
implemented by five instances according to the lengths
time (sec)
time (sec)
se, a diagonal- 300
of row and column. 60
) time while a
200
ime, 3.3whereVerification
w We have30 described how approxi-
ise simulation 100
mate occurrences of factors or suffixes are found using
ic of the
omitting
index. Each occurrence0 marks an area around it as a 0
3 4 5 6 7 8 9 3 4 5 6 7 8 9 10 11 12 13
ationpart
can ofnot.
the potential match area, andthethe union of these
last factor size number of factor
r the areas
standardis formed by sorting. The potential match area
(a) varying the last factor size " (s = 13) (b) varying the number of factors s (" = 6)
ly slower.
is thenWe searched sequentially using Myers’ bit-parallel
cording to how of the dynamic programming matrix [11, 7],
simulation
staircase
which NFA:is one of the fastest 500algorithms. We implemented
total time Figure 4: Performance of suffix filter on DNA data for
NFA infour was13.12 instances
[13]simulation Performance
400according timetoforResults
how many
filtering phase words m = 40, k = 12, and 200 queries.
to thearelengths
needed for one column of the matrix: one, two,
time (sec)
100
re found using 40 than optimal, the verification time increases rapidly.
4 itExperimental
around as a Results0 20
3 4 5 6 7 8 9 10 11 12 13With larger than optimal ", filtration time increases
In this
union of these section we present
20 results from experiments but at a more 10 modest rate. The
number of factor suffixbehaviour
filter in other
testing
al match areathe performance of the suffix filter algorithm cases is similar. The results indicate factor filter
that using the
described in the (b) varying the20number of factors its ("absolutely
= 6)
0 0
rs’ bit-parallel previous 10 section and 30comparing 40 optimal10 " 20is 30 not40 crucial
50 60 70 but80it90is 100
better to
matrixagainst
[11, 7],the factor filter algorithm.error level (%) length of patterns
err in the too large direction. This is in contrast to
We used three texts: English, (a) DNA DNA, and random (a) DNA
e implemented Figure 4: Performance of suffix filter on DNA factor
datafilters,
for where having the free parameter s off by
data. English and DNA were
w many words m = 40, k60= 12, and 200 queries. obtained from Pizza&Chili one in either direction can have a dramatic effect in the
Corpus web-site (http://pizzachili.dcc.uchile.cl/) and 50
rix: one, two, suffix filter running time [13].
truncated to length 16Mbytes. 50 Random
factor filter texts are of The method40 of Section 2.2 sets the number of
mplementation on-line
error level(%)
length 64M with various 40 alphabet sizes. In each test, factors s to k + 130but we experimented also with smaller
time (sec)
hange actually
100∼10000 patterns of30the same length m were selected values of s (larger values would mean that ti = 0 for
lters. 20
randomly from and textotherand
20 factorsearched
sizes with are 3k and errors.4. As We" gets somesmaller
i). Figure 4 (b) shows the results in one case.
often reportthan the error 10level α
optimal, the = k/m instead time
verification of k. increases Hererapidly.
the last factor 10
size " is fixedfactor to 6,
suffix its distance limit
filter
filter
Our machine
With islargera 2.6Ghz
0 thanPentiumoptimalIV",with 2GB oftime
filtration ts−1increases
to 1, and the0 other factor sizes and distance limits
m experiments
RAM, running butLinux.
at a moreWe20used modest
30
the 40
gccrate.compiler 50
The
60
version
behaviour are in
as other
even as
10 20 30 40 50 60 70 80 90 100
possible. The results in this and in other
error level (%) length of patterns
lter algorithm
4.0.2 with optioncases “-O3”.
is similar. The results indicate that using
cases, the
too, show that s = k + 1 is the best choice.
(b) English (b) English
comparing it absolutely optimal " is not crucial but it is better to
4.1 Suffixerr filter
in the parameters
too large Section direction. 2.2 describes
This is in 4.2 Suffix
contrast to filter vs. factor filter In this section, we
method forFigure
, anda random choosing5: Thetheeffect
suffixoffiltererrorparameters,
level for m = 30 and 100 Figure 6: The error levels up to which each indexing
which compare
factor filters, where having the free parameter s method off by filters
suffix to factor filters. The parameters
leaves one queries.
m Pizza&Chili parameter, the last factor size ", to be are chosen aswins upon online search.
described in Sections 2.1 and 2.2. For
one in either direction can have a dramatic effect in the
determined
chile.cl/) experimentally. Figure 4 (a) illustrates the suffix filters the last factor size ", and for factor filters
and running time [13].
m texts effect
areofof" in oneThe case. In this example, the best " is 6 the number filter parameters
of factors(!s iswere largeoptimized
for suffix filters and s = 1 for
experimentally.
First, wemethod
show the ofeffect
Section of error 2.2level sets the number
in Figure 5. factoroffilters). With small alphabet and large error level,
In each test, The factorscurves to k + 1 but
“on-line” we experimented
represents the time foralso with smaller
verifying filtering is no more competitive with on-line searching.
m were selected the values of s (larger values would mean that ti =With
whole text without a filtering step. The figure 0 forlarge alphabet and low error level, the filters are
k errors. We illustrates
some i). that indexing
Figure 4 (b)becomesshows useless
the when resultsthe inerror
onehighly
case. effective and having a more effective filter does
ead of k. level
Heregrows too large.
the last factorFigure size "6isshows fixeduptoto6,which error not
its distance help anymore.
limit
V with 2GB of level each filtering method wins upon on-line search, as
ts−1 to 1, and the other factor sizes and distance limits
mpiler version aare function of m. For suffix filters this limit of usefulness 5 Acknowledgements
as even ashigher possible.
is substantially than The for factorresults in this and in other
filters.
cases, too, show that s =
Figure 6 also indicates that the advantage k + 1 is the best choice.We would like to thank Gonzalo Navarro for kindly
of suffix
providing the source code used in [13].
filters over factor filters increases for longer patterns.
n 2.2 describes More
4.2 clearly
Suffix thisfilter
effect vs.
can factor
be seen in FigureIn7.this
filter In both
section, we
ameters, which figures, there is a notable jump when the pattern length References
compare suffix filters to factor filters. The parameters
size ", to be goes from 90 to 100. This is caused by the verification
are chosen as described in Sections 2.1 and 2.2. For
illustrates the stage, which switches from a fast implementation that [1] R. A. Baeza-Yates and G. Navarro. Faster approximate
suffix
can filters
handle the last
patterns factor
up to lengthsize ", aand
96 to forimple-
slower factor filters
string matching. Algorithmica, 23(2):127–158, 1999.
the best " is 6 the number
mentation thatofhasfactors s were
no limit on theoptimized experimentally.
pattern length. As [2] S. Burkhardt and J. Kärkkäinen. Better filtering with
can be seen in Figure 7, suffix filters are less sensitive gapped q-grams. Fundam. Informaticae, 56(1-2):51–
to the speed of the verification stage than factor filters. 70, 2003.