0% found this document useful (0 votes)
9 views

13 Filter Algorithms and Approximate Index Search

This document discusses approximate string matching using factor filters. Factor filters divide a pattern string into factors and search for occurrences of those factors in a text. Exact and approximate factor searching is explained using a suffix trie data structure. Weighted factors allow searching with a certain number of errors. The document provides definitions, examples, and pseudocode for implementing an approximate factor search algorithm.

Uploaded by

dethleff901
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

13 Filter Algorithms and Approximate Index Search

This document discusses approximate string matching using factor filters. Factor filters divide a pattern string into factors and search for occurrences of those factors in a text. Exact and approximate factor searching is explained using a suffix trie data structure. Weighted factors allow searching with a certain number of errors. The document provides definitions, examples, and pseudocode for implementing an approximate factor search algorithm.

Uploaded by

dethleff901
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

13.

1 Approximate Search in Indices


This exposition has been developed by David Weese. It is based on the following source, which is recommended
reading:

1. Kärkkäinen, J., and Na, J. C. (2007). Faster filters for approximate string matching, 7, 84–90. Springer

13.2 Definitions
We consider a string T of length n. For i, j ∈ N we define:

• [i..j] := {i, i + 1, . . . , j}

• [i..j) := [i..j − 1]
• T[i] is the i-th character of T (counting from 0)
• |T| denotes the string length, i. e. |T| = n
• In this lecture T[i.. j] is a substring but not from position i to j (explained later)

• The concatenation of strings X, Y is denoted as X · Y

13.3 Introduction
We have seen the effective filters for the approximate string matching problem based on q-gram counting, e.g.
QUASAR and SWIFT. These are based on the q-gram lemma that states that a certain number of overlapping
q-grams are shared between query and each approximate match.
Another simple but effective family of filters are the so-called factor filters, which are based on a factorization
of the pattern. A factorization of a string S is a sequence of strings (factors) whose concatenation is S.

Example 1 (factorization). A possible factorization of the string S = GATTACA would be the sequence of factors
G, AT, TAC, A.

13.4 Factor Filters


In the following, we will develop filters for the approximate string matching problem which is to find all substrings
of a text T that are within a distance k of a pattern P. The distance we consider is edit distance (a.k.a. Levenshtein
distance) under which the approximate string matching problem is also called the k-difference problem (Gusfield,
1997).
The simplest factor filter for the k-difference problem is based on the pigeonhole principle:

Theorem 2 (pigeonhole lemma). Let A = A0 A1 · · · Ak be a string that is the concatenation of k + 1 non-empty factors
Ai . If a string B is within edit distance k from A, then at least one of the factors Ai is a factor of B.

As a consequence we can divide our pattern P into k + 1 non-empty factors and an exact text occurrence of one
factor signals a potential approximate match.
The specificity of a factor filter is influenced by the expected number of random hits of each factor. The more
random text occurrences the factors have, the more false positive matches the filter will output resulting in
more unsuccessful verifications in a subsequent step. A stronger filter criterion can be reached with longer but
approximate factors, i.e. factors that are searched with errors.
However, before introducing stronger factor filters, we define an optimal factorization.

Theorem 3 (optimal factorization). For twoP strings A and B, let A0 A1 · · · As−1 be a factorization of A. Then there exists
a factorization B0 B1 · · · Bs−1 with dist(A, B) = i∈[0..s) dist(Ai , Bi ). We call such a factorization of B optimal.
Factor and Suffix Filters, by David Weese, June 11, 2013, 01:00 13001

Proof: Exercise.
Example 4. For the strings
A = GATTACA
B = ATCACTA
an optimal factorization would be:
A = GAT · TA · CA
B = AT · CA · CTA ,
as it holds:
dist(A, B) = dist(GAT, AT) + dist(TA, CA) + dist(CA, CTA)
3 = 1+1+1 .

P now assign a weight ti ∈ N ∪ {0} to each factor Ai , we can search approximate matches with less than
If we
t = i∈[0..s) ti errors using the following corollary of theorem 3:
Corollary 5 (weighted pigeonhole lemma). If dist(A, B) < t, then there exists an i ∈ [0..s) such that dist(Ai , Bi ) < ti .

The corollary can be trivially be proven with the contraposition


∀i∈[0..s) dist(Ai , Bi ) ≥ ti ⇒ dist(A, B) ≥ t .

For the k-difference problem the optimal choice would be t = k + 1. As a consequence, if dist(A, B) ≤ k < t, B
must have a factor whose edit distance to some Ai is less than ti .

13.5 How to Implement a Factor Filter


Given a pattern P, a factorization P = P0 P1 · · · Ps−1 and weights t0 + t1 + . . . + ts−1 = k + 1, we want to implement
a factor filter using the suffix trie1 of the text T.
. .
N S .A .
N .S
A
A $ . . .
.N .S .A .
$
5 . . .
N .5
S .A .$ .N .S
N A S . . .
S .4
A $ $ .N .S .A .$
$
. . . .3
4 1 3 .A .$ .S
N . .2 .
A S .S .$
S $ . .1
$ .$
0 2 .0
suffix tree of T=ANANAS suffix trie of T

13.6 Exact Factor Search


For the special case t0 = . . . = ts−1 = 1, each factor Pi can be searched exactly with a top-down traversal of the
suffix trie along the path of characters Pi [0], Pi [1], Pi [2] · · · .
If the factor occurs in the text, the search ends in a leaf or an inner suffix tree node. The leaf or the leaves
beneath the node represent all occurrences of the factor in the text.
To verify whether an occurrence is part of a true match, we search P with up to k errors in the surroundings
using a (more expensive) approximate search algorithm.
1 The suffix trie results from the suffix tree after breaking all edges into paths of single character edges.
13002 Factor and Suffix Filters, by David Weese, June 11, 2013, 01:00

13.7 Approximate Factor Search


In the more general case, factors have to be searched with ti − 1 errors in the suffix tree. This can be done with
a backtracking approach.
Instead of traversing only matching edges, we traverse all outgoing edges of a suffix tree node and tolerate
errors. Whenever we descend over a mismatching edge, we decrease a counter e that records the number of
remaining tolerable errors.
Indels are simulated by skipping a character either in the pattern or in the suffix tree and decreasing e.

(1) ApproxRecur(F, i, α, e);


(2) // F..factor, i..compared prefix length
(3) // α..suffix trie node with path label α
(4) // e..remaining errors to tolerate
(5) if e ≥ 0
(6) then
(7) if i = |F|
(8) then
(9) report occurrences at getOccurrences(α);
(10) fi
(11) ApproxRecur(F, i + 1, α, e − 1); // insertion in factor
(12) for αc ∈ children(α) do
(13) ApproxRecur(F, i, αc, e − 1); // deletion in factor
(14) if F[i] = c
(15) then
(16) ApproxRecur(F, i + 1, αc, e); // match
(17) else
(18) ApproxRecur(F, i + 1, αc, e − 1); // mismatch
(19) fi
(20) od
(21) fi

13.8 Suffix Filters


Again we consider strings A and B and optimal factorizations A0 A1 · · · As−1 and B0 B1 · · · Bs−1 and weights ti
such that dist(A, B) < i∈[0..s) ti .
P

For 0 ≤ i ≤ j < s, let A[i.. j] denote the concatenation of factors Ai Ai+1 · · · A j . In particular, A[i..s − 1] is a suffix
of A. (We use the same notation for B).
Definition 6 (strong match). We say that A and B match on interval [i.. j] if
X
dist(A[i.. j], B[i.. j]) < th ,
h∈[i..j]

and strongly match on [i.. j] if they match on every interval [i.. j0 ], j0 = i, . . . j, that means if they match on every
nonempty prefix of [i..j].

While the weighted pigeonhole lemma only states that there is a match [i..i], the following more general theorem
guarantees that there is a strong match [i..s) for a certain i ∈ [0..s).
Theorem 7. If dist(A, B) < t, there exists i ∈ [0..s) such that A and B strongly match on [i..s).
P
Proof: Let [0..i) be the longest prefix interval, on which A and B do not match, i.e. dist(A[0..i), B[0..i)) ≥ h∈[0..i) th .
It is well defined as i = 0 always satisfies the condition. It holds i < s as [0..s) is always a match.
Then, A and B strongly match on [i..s). To show this, assume the opposite, i.e. that there exists j ∈ [i..s) such
that A and B do not match on [i.. j]. But then A and B do not match on [0.. j] = [0..i) ∪ [i.. j], which contradicts
[0..i) being maximal.
Example 8. In the situation of the following table, A and B strongly match on [1..5) but not on any other suffix
[i..5):
Factor and Suffix Filters, by David Weese, June 11, 2013, 01:00 13003

i 0 1 2 3 4
ti 1 1 2 1 1 ,
dist(Ai , Bi ) 1 0 1 2 1

as A and B match on [1.. j] for each j ∈ [1..5):

PI [1..1] [1..2] [1..3] [1..4]


i∈I ti 1 3 4 5
.
>
i , Bi )
P
i∈I dist(A 0 1 3 4

The filter algorithm is identical to factor filters except instead of searching for separate factors, the filtration
phase will search for suffixes
P of the pattern satisfying the strong match condition. That is, it searches for each
suffix A[i..s) with less than j∈[i..s) t j errors.
To get an intuition on suffix filters, we want to compare the following (sub)problems of using the index to find
the occurrences of:

1. pattern A with less than t errors


2. factor A0 with less than t0 errors
3. suffix A[0..s) under the strong match condition

1. In the first case t errors need to be found to eliminate a candidate.

2. In the 2nd and 3rd case only t0 in the first factor are sufficient.
3. While the factor filter lets a candidate after matching the first factor, a suffix filter continues the search
and has more chances for elimination

Hence a suffix filter is more specific then a factor filter and saves time due to less unsuccessful verifications.

13.9 Approximate Suffix Search


To search each suffix under the strong match condition we could extend the ApproxRecur algorithm to search
a sequence of factors and allow ti+1 more errors (increase e) after successfully matching the factor Ai .
However, the backtracking of algorithm ApproxRecur visits the same suffix tree node multiple times. Kärkkäinen
and Na use an approach that visits each nodes at most once and hence requires less backtracking steps. They
construct an NFA of the suffix that accepts a suffix within a certain edit distance.
The NFA nodes are arranged in a grid, where the i-th row represents i errors and the j-th column a consumed
prefix of length j (counting from 0). Matches and mismatches are transitions from column j to j + 1, where
matches connect nodes of the same row and mismatches from row i to i + 1. Indels are empty diagonal
transitions and any-character transitions.

p a t t e r n
acter at a time and at no errors
Σ Σ Σ Σ Σ Σ Σ Σ
miss. A hit means that ε Σ ε Σ ε Σ ε Σ ε Σ ε Σ ε Σ
p a t t e r n
nd a miss means that 1 error
Σ Σ Σ Σ Σ Σ Σ Σ
lead to a match. The ε Σ ε Σ ε Σ ε Σ ε Σ ε Σ ε Σ

mata for approximate p a t t e r n


2 errors
nces of the pattern in Σ
ε Σ
Σ
ε Σ
Σ
ε Σ
Σ
ε Σ
Σ
ε Σ
Σ
ε Σ
Σ
ε Σ
Σ

very path in the suffix p a t t e r n


3 errors
he automaton reports (a) An NFA for the pattern “pattern” with 3 errors.
mediate states of the
the whole search can p a t t e r n
no errors
first search of the tree. Σ Σ Σ Σ Σ
ε Σ ε Σ ε Σ ε Σ ε Σ
nse near the root but t e r n
1 error
a favorable situation Σ Σ Σ Σ Σ
ε Σ ε Σ ε Σ ε Σ
13004 Factor and Suffix Filters, by David Weese, June 11, 2013, 01:00

Suffix filters require recognition


p
of strong tmatches, twhich doe not allowr all errorsn to occur in the beginning. The
a
cter at a time and at
corresponding NFA (also called staircase NFA) is obtained from the standard NFA bynoeliminating
errors states that
violate the strongΣmatch Σ Σ
conditions.
Σ ΣA suffixΣ Σhas beenΣ Σfound Σif any
Σ of the
Σ Σ(accepting)
Σ Σ states in the last column
ss. A hit means that
become active.
ε ε ε ε ε ε ε
p a t t e r n
d a miss means that 1 error
Example 9. For example,
Σ consider
Σ the
Σ following Σ factorization
Σ ofΣ the string Σ patternΣ
ead to a match. The ε p Σ ε a Σ ε t Σ ε t Σ ε e Σ ε r Σ ε n Σ
cter at a time and at no errors
ata for approximate Σ ε Σ Σ ε Σ Σ ε Σfactor
p a t Σ t
ε pa Σ Σ tte
e Σ
ε Σ rn ε Σ
r Σ n
ε Σ
Σ 2 errors
ss. A hit means that
ces of the pattern in Σ ε pΣ Σ ε aΣ Σ ε tΣ Σ tiε tΣ 1 Σ ε2 eΣ Σ1 ε rΣ Σ ε Σn Σ
d a miss means that 1 error
ry path in the suffix Σ p Σ Σ a Σ Σ t Σ Σ t Σ Σ e Σ Σ r Σ Σ n Σ Σ 3 errors
ead to a match.
Because theThe ε
prefix pa allows ε exact matches
only ε we ε eliminate ε all states ε with i > ε 0 and j ≤ 2. Because the prefix
e automaton reports >
mata for approximate (a) An NFA for the pattern “pattern” with 3n errors.
patee allows only 2 errors
p we eliminate
a all t states witht i 2 and
e j ≤ 5. r
2 errors
mediate states of the Σ
ces of the pattern in ε Σ Σ
ε Σ Σ
ε Σ Σ
ε Σ Σ
ε Σ Σ
ε Σ Σ
ε Σ Σ
the whole search can p a t t e r n
ery path in the suffix p a t t e r n no errors
st search of the tree. Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ
3 errors
he automaton reports (a) An NFA for the pattern
ε ε ε
“pattern”
ε ε
with n3 errors.
e near the root but
mediate states of the t e r
1 error
a favorable situation Σ Σ Σ Σ Σ
the whole search can p a t ε tΣ ε eΣ ε rΣ ε Σ
n
Near the root, the no errors
rst search of the tree. Σ t Σ e Σ r Σ n Σ
branches short. The ε Σ ε Σ ε Σ ε Σ ε Σ 2 errors
se near the root but t e ε rΣ
Σ
ε Σn
Σ
restrictive at deeper 1 error
a favorable situation Σ Σ Σ Σ n Σ
nches to take there. ε Σ ε Σ ε Σ ε Σ 3 errors
Near the root, the
uses a suffix array, a (b) A staircase NFA t fore suffixr filters.n 2 errors
branches short. The
ulate the suffix tree. ε Σ Σ
ε Σ Σ
restrictive at deeper
ffix tree position with
The suffix tree is searched with the staircase NFA via backtracking. A subtree can n be skipped if the NFA has
nches to take
no
there.
more active Figure
states. 2: NFAs for recognizing approximate patterns. 3 errors
). The suffix array is
uses a suffix array, a (b) A staircase NFA for suffix filters.
x tree and,Thewhile the of the NFA can be speeded up with the following heuristics:
simulation
mulate the suffix tree.
a factor of O(log n),
reading i characters, only states within a triangle whose peak is the (i + 1)-st state of the first row
ffix tree position
• Afterwith 1st state
can be active.Figure 2: NFAs for recognizing
1). The suffix array is i +1st state approximate patterns.
noand
errorcan be omitted.
• If thethe
ffix tree and, while top h rows have no active states, they will stay inactive also in the future,
pproximate factors, 2nd state
y a factor of O(log n), staircase automaton
non-deterministic fi- 1st state
own method for ap- i +1st state
no error
Figure 2 (a) shows
approximate factors, 2nd state staircase automaton
g “pattern” with at
non-deterministic fi-
es the number of er- k errors
own method for ap- 2k +1 states
matching a pattern
. Figure 2 (a) shows
e second row and the
ng “pattern” with at
g read so far matches Figure 3: Active area after reading i characters. Only
tes the number of er- k errors
y the diagonal start- states within the shaded 2k +1 states
triangle can be active.
s matching a pattern
ive. The automaton
he second row and the
ghtmost states is ac-
g read so far matches Figure 3: Active area after reading i characters. Only
s active. We13.10refer Verification
to
y the diagonal start- states within the shaded triangle can be active.
tive. The automaton ”patte” allows at most 2 errors, we eliminate states in
n of strongAfter
matches,
a suffix was found with e ≤ k errors, we search the remaining pattern prefix left of the text occurrences
ightmost states
with up isto kac- the to
− e errors ith rowwhether
verify and the jth is
the suffix column
part of a for ≥ 4 andmatch
true ik-difference j ≤ of 6.the pattern.
cur in the beginning.
is active. We can refer to with We use somepattern heuristics
matching to reduce e.g. the
Myers’time foralgorithm (Myers,
we call theThis be done
staircase any approximate algorithm, bitvector
1999) or the abovesimulating
mentioned NFA an NFA. We do not need to update all states
approach. Alternatively we could use backtracking in an index that allows
NFA by eliminating
a bidirectional ”patte”
search, e.g. the allows at most
bidirectional BWT 2 errors, we
(Schnattinger et eliminate
al., 2012). states in
on of strong matches, during the simulation of the (staircase) NFA. After
tch conditions. For the ith row and the jth column for i ≥ 4 and j ≤ 6.
cur in the beginning. reading i characters, only states within a triangle whose
factorization of the We use some heuristics to reduce the time for
we call the staircase peak is the i+1st state of the first row can be active (see
simulating an NFA. We do not need to update all states
d NFA by eliminating Figure 3). Furthermore, if the top h rows have no active
during the simulation of the (staircase) NFA. After
atch conditions. For states, they will stay inactive also in the future, and can
operations on the words. We (and [13]) use diagonal- 120
total time
wise simulation [1] for the standard NFA but row-wise time for filtering phase
simulation Factor
[16]and for Suffix Filters, by
the staircase NFA, David Weese,
mainly dueJuneto 11, 2013, 01:00 90 13005

time (sec)
ease of implementation. In the worst case, a diagonal-
60
wise simulation
13.11 step requires
Suffix O(!k 2 /w")
Filter Parameters time while a
row-wise simulation takes O(k!k/w") time, where w 30
is the word size. However, the row-wise simulation
The practically best factorization divides a pattern of length m into k + 1 factors of weight 1.
can easily take advantage of the heuristic of omitting 0
inactiveThe rowslength
but the ` ofdiagonal-wise
the last factorsimulation should becan greater
not.then others j to avoid
3
k l random
4 5 matches
m the last
6 7 of8the9shortest suffix. All
m−` m−` factor size
We alsoother tried factor
the row-wise
sizes should simulation for the standard
be distributed evenly, i.e. either k or k .
NFA but it made the factor filters slightly slower. We (a) varying the last factor size " (s = 13)
implemented two simulation instances according to how
use diagonal- 120
many words are needed for one row of the staircase total time NFA: 500
A but row-wise time for filtering phase total time
one word and more than one 90 word. The NFA in [13] was 400 time for filtering phase
mainly due to
implemented by five instances according to the lengths
time (sec)

time (sec)
se, a diagonal- 300
of row and column. 60
) time while a
200
ime, 3.3whereVerification
w We have30 described how approxi-
ise simulation 100
mate occurrences of factors or suffixes are found using
ic of the
omitting
index. Each occurrence0 marks an area around it as a 0
3 4 5 6 7 8 9 3 4 5 6 7 8 9 10 11 12 13
ationpart
can ofnot.
the potential match area, andthethe union of these
last factor size number of factor
r the areas
standardis formed by sorting. The potential match area
(a) varying the last factor size " (s = 13) (b) varying the number of factors s (" = 6)
ly slower.
is thenWe searched sequentially using Myers’ bit-parallel
cording to how of the dynamic programming matrix [11, 7],
simulation
staircase
which NFA:is one of the fastest 500algorithms. We implemented
total time Figure 4: Performance of suffix filter on DNA data for
NFA infour was13.12 instances
[13]simulation Performance
400according timetoforResults
how many
filtering phase words m = 40, k = 12, and 200 queries.
to thearelengths
needed for one column of the matrix: one, two,
time (sec)

three, and more than 80 300 words. The implementation


three 50
in [13] used a slower algorithm. 200 This change actually
suffix filter
factor filter 40
how benefited
approxi- factor filters60more thanon-line suffix filters.
error level(%)

and other factor30 sizes are 3 and 4. As " gets smaller


time (sec)

100
re found using 40 than optimal, the verification time increases rapidly.
4 itExperimental
around as a Results0 20
3 4 5 6 7 8 9 10 11 12 13With larger than optimal ", filtration time increases
In this
union of these section we present
20 results from experiments but at a more 10 modest rate. The
number of factor suffixbehaviour
filter in other
testing
al match areathe performance of the suffix filter algorithm cases is similar. The results indicate factor filter
that using the
described in the (b) varying the20number of factors its ("absolutely
= 6)
0 0
rs’ bit-parallel previous 10 section and 30comparing 40 optimal10 " 20is 30 not40 crucial
50 60 70 but80it90is 100
better to
matrixagainst
[11, 7],the factor filter algorithm.error level (%) length of patterns
err in the too large direction. This is in contrast to
We used three texts: English, (a) DNA DNA, and random (a) DNA
e implemented Figure 4: Performance of suffix filter on DNA factor
datafilters,
for where having the free parameter s off by
data. English and DNA were
w many words m = 40, k60= 12, and 200 queries. obtained from Pizza&Chili one in either direction can have a dramatic effect in the
Corpus web-site (http://pizzachili.dcc.uchile.cl/) and 50
rix: one, two, suffix filter running time [13].
truncated to length 16Mbytes. 50 Random
factor filter texts are of The method40 of Section 2.2 sets the number of
mplementation on-line
error level(%)

length 64M with various 40 alphabet sizes. In each test, factors s to k + 130but we experimented also with smaller
time (sec)

hange actually
100∼10000 patterns of30the same length m were selected values of s (larger values would mean that ti = 0 for
lters. 20
randomly from and textotherand
20 factorsearched
sizes with are 3k and errors.4. As We" gets somesmaller
i). Figure 4 (b) shows the results in one case.
often reportthan the error 10level α
optimal, the = k/m instead time
verification of k. increases Hererapidly.
the last factor 10
size " is fixedfactor to 6,
suffix its distance limit
filter
filter
Our machine
With islargera 2.6Ghz
0 thanPentiumoptimalIV",with 2GB oftime
filtration ts−1increases
to 1, and the0 other factor sizes and distance limits
m experiments
RAM, running butLinux.
at a moreWe20used modest
30
the 40
gccrate.compiler 50
The
60
version
behaviour are in
as other
even as
10 20 30 40 50 60 70 80 90 100
possible. The results in this and in other
error level (%) length of patterns
lter algorithm
4.0.2 with optioncases “-O3”.
is similar. The results indicate that using
cases, the
too, show that s = k + 1 is the best choice.
(b) English (b) English
comparing it absolutely optimal " is not crucial but it is better to
4.1 Suffixerr filter
in the parameters
too large Section direction. 2.2 describes
This is in 4.2 Suffix
contrast to filter vs. factor filter In this section, we
method forFigure
, anda random choosing5: Thetheeffect
suffixoffiltererrorparameters,
level for m = 30 and 100 Figure 6: The error levels up to which each indexing
which compare
factor filters, where having the free parameter s method off by filters
suffix to factor filters. The parameters
leaves one queries.
m Pizza&Chili parameter, the last factor size ", to be are chosen aswins upon online search.
described in Sections 2.1 and 2.2. For
one in either direction can have a dramatic effect in the
determined
chile.cl/) experimentally. Figure 4 (a) illustrates the suffix filters the last factor size ", and for factor filters
and running time [13].
m texts effect
areofof" in oneThe case. In this example, the best " is 6 the number filter parameters
of factors(!s iswere largeoptimized
for suffix filters and s = 1 for
experimentally.
First, wemethod
show the ofeffect
Section of error 2.2level sets the number
in Figure 5. factoroffilters). With small alphabet and large error level,
In each test, The factorscurves to k + 1 but
“on-line” we experimented
represents the time foralso with smaller
verifying filtering is no more competitive with on-line searching.
m were selected the values of s (larger values would mean that ti =With
whole text without a filtering step. The figure 0 forlarge alphabet and low error level, the filters are
k errors. We illustrates
some i). that indexing
Figure 4 (b)becomesshows useless
the when resultsthe inerror
onehighly
case. effective and having a more effective filter does
ead of k. level
Heregrows too large.
the last factorFigure size "6isshows fixeduptoto6,which error not
its distance help anymore.
limit
V with 2GB of level each filtering method wins upon on-line search, as
ts−1 to 1, and the other factor sizes and distance limits
mpiler version aare function of m. For suffix filters this limit of usefulness 5 Acknowledgements
as even ashigher possible.
is substantially than The for factorresults in this and in other
filters.
cases, too, show that s =
Figure 6 also indicates that the advantage k + 1 is the best choice.We would like to thank Gonzalo Navarro for kindly
of suffix
providing the source code used in [13].
filters over factor filters increases for longer patterns.
n 2.2 describes More
4.2 clearly
Suffix thisfilter
effect vs.
can factor
be seen in FigureIn7.this
filter In both
section, we
ameters, which figures, there is a notable jump when the pattern length References
compare suffix filters to factor filters. The parameters
size ", to be goes from 90 to 100. This is caused by the verification
are chosen as described in Sections 2.1 and 2.2. For
illustrates the stage, which switches from a fast implementation that [1] R. A. Baeza-Yates and G. Navarro. Faster approximate
suffix
can filters
handle the last
patterns factor
up to lengthsize ", aand
96 to forimple-
slower factor filters
string matching. Algorithmica, 23(2):127–158, 1999.
the best " is 6 the number
mentation thatofhasfactors s were
no limit on theoptimized experimentally.
pattern length. As [2] S. Burkhardt and J. Kärkkäinen. Better filtering with
can be seen in Figure 7, suffix filters are less sensitive gapped q-grams. Fundam. Informaticae, 56(1-2):51–
to the speed of the verification stage than factor filters. 70, 2003.

You might also like