0% found this document useful (0 votes)

9 views

13 Filter Algorithms and Approximate Index Search

This document discusses approximate string matching using factor filters. Factor filters divide a pattern string into factors and search for occurrences of those factors in a text. Exact and approximate factor searching is explained using a suffix trie data structure. Weighted factors allow searching with a certain number of errors. The document provides definitions, examples, and pseudocode for implementing an approximate factor search algorithm.

Uploaded by

dethleff901

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

13 Filter Algorithms and Approximate Index Search

Uploaded by

dethleff901

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

13.

1 Approximate Search in Indices

This exposition has been developed by David Weese. It is based on the following source, which is recommended
reading:

1. Kärkkäinen, J., and Na, J. C. (2007). Faster filters for approximate string matching, 7, 84–90. Springer

13.2 Definitions
We consider a string T of length n. For i, j ∈ N we define:

• [i..j] := {i, i + 1, . . . , j}

• [i..j) := [i..j − 1]
• T[i] is the i-th character of T (counting from 0)
• |T| denotes the string length, i. e. |T| = n
• In this lecture T[i.. j] is a substring but not from position i to j (explained later)

• The concatenation of strings X, Y is denoted as X · Y

13.3 Introduction
We have seen the effective filters for the approximate string matching problem based on q-gram counting, e.g.
QUASAR and SWIFT. These are based on the q-gram lemma that states that a certain number of overlapping
q-grams are shared between query and each approximate match.
Another simple but effective family of filters are the so-called factor filters, which are based on a factorization
of the pattern. A factorization of a string S is a sequence of strings (factors) whose concatenation is S.

Example 1 (factorization). A possible factorization of the string S = GATTACA would be the sequence of factors
G, AT, TAC, A.

13.4 Factor Filters

In the following, we will develop filters for the approximate string matching problem which is to find all substrings
of a text T that are within a distance k of a pattern P. The distance we consider is edit distance (a.k.a. Levenshtein
distance) under which the approximate string matching problem is also called the k-difference problem (Gusfield,
1997).
The simplest factor filter for the k-difference problem is based on the pigeonhole principle:

Theorem 2 (pigeonhole lemma). Let A = A0 A1 · · · Ak be a string that is the concatenation of k + 1 non-empty factors
Ai . If a string B is within edit distance k from A, then at least one of the factors Ai is a factor of B.

As a consequence we can divide our pattern P into k + 1 non-empty factors and an exact text occurrence of one
factor signals a potential approximate match.
The specificity of a factor filter is influenced by the expected number of random hits of each factor. The more
random text occurrences the factors have, the more false positive matches the filter will output resulting in
more unsuccessful verifications in a subsequent step. A stronger filter criterion can be reached with longer but
approximate factors, i.e. factors that are searched with errors.
However, before introducing stronger factor filters, we define an optimal factorization.

Theorem 3 (optimal factorization). For twoP strings A and B, let A0 A1 · · · As−1 be a factorization of A. Then there exists
a factorization B0 B1 · · · Bs−1 with dist(A, B) = i∈[0..s) dist(Ai , Bi ). We call such a factorization of B optimal.
Factor and Suffix Filters, by David Weese, June 11, 2013, 01:00 13001

Proof: Exercise.
Example 4. For the strings
A = GATTACA
B = ATCACTA
an optimal factorization would be:
A = GAT · TA · CA
B = AT · CA · CTA ,
as it holds:
dist(A, B) = dist(GAT, AT) + dist(TA, CA) + dist(CA, CTA)
3 = 1+1+1 .

P now assign a weight ti ∈ N ∪ {0} to each factor Ai , we can search approximate matches with less than
If we
t = i∈[0..s) ti errors using the following corollary of theorem 3:
Corollary 5 (weighted pigeonhole lemma). If dist(A, B) < t, then there exists an i ∈ [0..s) such that dist(Ai , Bi ) < ti .

The corollary can be trivially be proven with the contraposition

∀i∈[0..s) dist(Ai , Bi ) ≥ ti ⇒ dist(A, B) ≥ t .

For the k-difference problem the optimal choice would be t = k + 1. As a consequence, if dist(A, B) ≤ k < t, B
must have a factor whose edit distance to some Ai is less than ti .

13.5 How to Implement a Factor Filter

Given a pattern P, a factorization P = P0 P1 · · · Ps−1 and weights t0 + t1 + . . . + ts−1 = k + 1, we want to implement
a factor filter using the suffix trie1 of the text T.
. .
N S .A .
N .S
A
A $ . . .
.N .S .A .
$
5 . . .
N .5
S .A .$ .N .S
N A S . . .
S .4
A $ $ .N .S .A .$
$
. . . .3
4 1 3 .A .$ .S
N . .2 .
A S .S .$
S $ . .1
$ .$
0 2 .0
suffix tree of T=ANANAS suffix trie of T

13.6 Exact Factor Search

For the special case t0 = . . . = ts−1 = 1, each factor Pi can be searched exactly with a top-down traversal of the
suffix trie along the path of characters Pi [0], Pi [1], Pi [2] · · · .
If the factor occurs in the text, the search ends in a leaf or an inner suffix tree node. The leaf or the leaves
beneath the node represent all occurrences of the factor in the text.
To verify whether an occurrence is part of a true match, we search P with up to k errors in the surroundings
using a (more expensive) approximate search algorithm.
1 The suffix trie results from the suffix tree after breaking all edges into paths of single character edges.
13002 Factor and Suffix Filters, by David Weese, June 11, 2013, 01:00

13.7 Approximate Factor Search

In the more general case, factors have to be searched with ti − 1 errors in the suffix tree. This can be done with
a backtracking approach.
Instead of traversing only matching edges, we traverse all outgoing edges of a suffix tree node and tolerate
errors. Whenever we descend over a mismatching edge, we decrease a counter e that records the number of
remaining tolerable errors.
Indels are simulated by skipping a character either in the pattern or in the suffix tree and decreasing e.

(1) ApproxRecur(F, i, α, e);

(2) // F..factor, i..compared prefix length
(3) // α..suffix trie node with path label α
(4) // e..remaining errors to tolerate
(5) if e ≥ 0
(6) then
(7) if i = |F|
(8) then
(9) report occurrences at getOccurrences(α);
(10) fi
(11) ApproxRecur(F, i + 1, α, e − 1); // insertion in factor
(12) for αc ∈ children(α) do
(13) ApproxRecur(F, i, αc, e − 1); // deletion in factor
(14) if F[i] = c
(15) then
(16) ApproxRecur(F, i + 1, αc, e); // match
(17) else
(18) ApproxRecur(F, i + 1, αc, e − 1); // mismatch
(19) fi
(20) od
(21) fi

13.8 Suffix Filters

Again we consider strings A and B and optimal factorizations A0 A1 · · · As−1 and B0 B1 · · · Bs−1 and weights ti
such that dist(A, B) < i∈[0..s) ti .
P

For 0 ≤ i ≤ j < s, let A[i.. j] denote the concatenation of factors Ai Ai+1 · · · A j . In particular, A[i..s − 1] is a suffix
of A. (We use the same notation for B).
Definition 6 (strong match). We say that A and B match on interval [i.. j] if
X
dist(A[i.. j], B[i.. j]) < th ,
h∈[i..j]

and strongly match on [i.. j] if they match on every interval [i.. j0 ], j0 = i, . . . j, that means if they match on every
nonempty prefix of [i..j].

While the weighted pigeonhole lemma only states that there is a match [i..i], the following more general theorem
guarantees that there is a strong match [i..s) for a certain i ∈ [0..s).
Theorem 7. If dist(A, B) < t, there exists i ∈ [0..s) such that A and B strongly match on [i..s).
P
Proof: Let [0..i) be the longest prefix interval, on which A and B do not match, i.e. dist(A[0..i), B[0..i)) ≥ h∈[0..i) th .
It is well defined as i = 0 always satisfies the condition. It holds i < s as [0..s) is always a match.
Then, A and B strongly match on [i..s). To show this, assume the opposite, i.e. that there exists j ∈ [i..s) such
that A and B do not match on [i.. j]. But then A and B do not match on [0.. j] = [0..i) ∪ [i.. j], which contradicts
[0..i) being maximal.
Example 8. In the situation of the following table, A and B strongly match on [1..5) but not on any other suffix
[i..5):
Factor and Suffix Filters, by David Weese, June 11, 2013, 01:00 13003

i 0 1 2 3 4
ti 1 1 2 1 1 ,
dist(Ai , Bi ) 1 0 1 2 1

as A and B match on [1.. j] for each j ∈ [1..5):

PI [1..1] [1..2] [1..3] [1..4]

i∈I ti 1 3 4 5
.
>
i , Bi )
P
i∈I dist(A 0 1 3 4

The filter algorithm is identical to factor filters except instead of searching for separate factors, the filtration
phase will search for suffixes
P of the pattern satisfying the strong match condition. That is, it searches for each
suffix A[i..s) with less than j∈[i..s) t j errors.
To get an intuition on suffix filters, we want to compare the following (sub)problems of using the index to find
the occurrences of:

1. pattern A with less than t errors

2. factor A0 with less than t0 errors
3. suffix A[0..s) under the strong match condition

1. In the first case t errors need to be found to eliminate a candidate.

2. In the 2nd and 3rd case only t0 in the first factor are sufficient.
3. While the factor filter lets a candidate after matching the first factor, a suffix filter continues the search
and has more chances for elimination

Hence a suffix filter is more specific then a factor filter and saves time due to less unsuccessful verifications.

13.9 Approximate Suffix Search

To search each suffix under the strong match condition we could extend the ApproxRecur algorithm to search
a sequence of factors and allow ti+1 more errors (increase e) after successfully matching the factor Ai .
However, the backtracking of algorithm ApproxRecur visits the same suffix tree node multiple times. Kärkkäinen
and Na use an approach that visits each nodes at most once and hence requires less backtracking steps. They
construct an NFA of the suffix that accepts a suffix within a certain edit distance.
The NFA nodes are arranged in a grid, where the i-th row represents i errors and the j-th column a consumed
prefix of length j (counting from 0). Matches and mismatches are transitions from column j to j + 1, where
matches connect nodes of the same row and mismatches from row i to i + 1. Indels are empty diagonal
transitions and any-character transitions.

p a t t e r n
acter at a time and at no errors
Σ Σ Σ Σ Σ Σ Σ Σ
miss. A hit means that ε Σ ε Σ ε Σ ε Σ ε Σ ε Σ ε Σ
p a t t e r n
nd a miss means that 1 error
Σ Σ Σ Σ Σ Σ Σ Σ
lead to a match. The ε Σ ε Σ ε Σ ε Σ ε Σ ε Σ ε Σ

mata for approximate p a t t e r n

2 errors
nces of the pattern in Σ
ε Σ
Σ
ε Σ
Σ
ε Σ
Σ
ε Σ
Σ
ε Σ
Σ
ε Σ
Σ
ε Σ
Σ

very path in the suffix p a t t e r n

3 errors
he automaton reports (a) An NFA for the pattern “pattern” with 3 errors.
mediate states of the
the whole search can p a t t e r n
no errors
first search of the tree. Σ Σ Σ Σ Σ
ε Σ ε Σ ε Σ ε Σ ε Σ
nse near the root but t e r n
1 error
a favorable situation Σ Σ Σ Σ Σ
ε Σ ε Σ ε Σ ε Σ
13004 Factor and Suffix Filters, by David Weese, June 11, 2013, 01:00

Suffix filters require recognition

p
of strong tmatches, twhich doe not allowr all errorsn to occur in the beginning. The
a
cter at a time and at
corresponding NFA (also called staircase NFA) is obtained from the standard NFA bynoeliminating
errors states that
violate the strongΣmatch Σ Σ
conditions.
Σ ΣA suffixΣ Σhas beenΣ Σfound Σif any
Σ of the
Σ Σ(accepting)
Σ Σ states in the last column
ss. A hit means that
become active.
ε ε ε ε ε ε ε
p a t t e r n
d a miss means that 1 error
Example 9. For example,
Σ consider
Σ the
Σ following Σ factorization
Σ ofΣ the string Σ patternΣ
ead to a match. The ε p Σ ε a Σ ε t Σ ε t Σ ε e Σ ε r Σ ε n Σ
cter at a time and at no errors
ata for approximate Σ ε Σ Σ ε Σ Σ ε Σfactor
p a t Σ t
ε pa Σ Σ tte
e Σ
ε Σ rn ε Σ
r Σ n
ε Σ
Σ 2 errors
ss. A hit means that
ces of the pattern in Σ ε pΣ Σ ε aΣ Σ ε tΣ Σ tiε tΣ 1 Σ ε2 eΣ Σ1 ε rΣ Σ ε Σn Σ
d a miss means that 1 error
ry path in the suffix Σ p Σ Σ a Σ Σ t Σ Σ t Σ Σ e Σ Σ r Σ Σ n Σ Σ 3 errors
ead to a match.
Because theThe ε
prefix pa allows ε exact matches
only ε we ε eliminate ε all states ε with i > ε 0 and j ≤ 2. Because the prefix
e automaton reports >
mata for approximate (a) An NFA for the pattern “pattern” with 3n errors.
patee allows only 2 errors
p we eliminate
a all t states witht i 2 and
e j ≤ 5. r
2 errors
mediate states of the Σ
ces of the pattern in ε Σ Σ
ε Σ Σ
ε Σ Σ
ε Σ Σ
ε Σ Σ
ε Σ Σ
ε Σ Σ
the whole search can p a t t e r n
ery path in the suffix p a t t e r n no errors
st search of the tree. Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ
3 errors
he automaton reports (a) An NFA for the pattern
ε ε ε
“pattern”
ε ε
with n3 errors.
e near the root but
mediate states of the t e r
1 error
a favorable situation Σ Σ Σ Σ Σ
the whole search can p a t ε tΣ ε eΣ ε rΣ ε Σ
n
Near the root, the no errors
rst search of the tree. Σ t Σ e Σ r Σ n Σ
branches short. The ε Σ ε Σ ε Σ ε Σ ε Σ 2 errors
se near the root but t e ε rΣ
Σ
ε Σn
Σ
restrictive at deeper 1 error
a favorable situation Σ Σ Σ Σ n Σ
nches to take there. ε Σ ε Σ ε Σ ε Σ 3 errors
Near the root, the
uses a suffix array, a (b) A staircase NFA t fore suffixr filters.n 2 errors
branches short. The
ulate the suffix tree. ε Σ Σ
ε Σ Σ
restrictive at deeper
ffix tree position with
The suffix tree is searched with the staircase NFA via backtracking. A subtree can n be skipped if the NFA has
nches to take
no
there.
more active Figure
states. 2: NFAs for recognizing approximate patterns. 3 errors
). The suffix array is
uses a suffix array, a (b) A staircase NFA for suffix filters.
x tree and,Thewhile the of the NFA can be speeded up with the following heuristics:
simulation
mulate the suffix tree.
a factor of O(log n),
reading i characters, only states within a triangle whose peak is the (i + 1)-st state of the first row
ffix tree position
• Afterwith 1st state
can be active.Figure 2: NFAs for recognizing
1). The suffix array is i +1st state approximate patterns.
noand
errorcan be omitted.
• If thethe
ffix tree and, while top h rows have no active states, they will stay inactive also in the future,
pproximate factors, 2nd state
y a factor of O(log n), staircase automaton
non-deterministic fi- 1st state
own method for ap- i +1st state
no error
Figure 2 (a) shows
approximate factors, 2nd state staircase automaton
g “pattern” with at
non-deterministic fi-
es the number of er- k errors
own method for ap- 2k +1 states
matching a pattern
. Figure 2 (a) shows
e second row and the
ng “pattern” with at
g read so far matches Figure 3: Active area after reading i characters. Only
tes the number of er- k errors
y the diagonal start- states within the shaded 2k +1 states
triangle can be active.
s matching a pattern
ive. The automaton
he second row and the
ghtmost states is ac-
g read so far matches Figure 3: Active area after reading i characters. Only
s active. We13.10refer Verification
to
y the diagonal start- states within the shaded triangle can be active.
tive. The automaton ”patte” allows at most 2 errors, we eliminate states in
n of strongAfter
matches,
a suffix was found with e ≤ k errors, we search the remaining pattern prefix left of the text occurrences
ightmost states
with up isto kac- the to
− e errors ith rowwhether
verify and the jth is
the suffix column
part of a for ≥ 4 andmatch
true ik-difference j ≤ of 6.the pattern.
cur in the beginning.
is active. We can refer to with We use somepattern heuristics
matching to reduce e.g. the
Myers’time foralgorithm (Myers,
we call theThis be done
staircase any approximate algorithm, bitvector
1999) or the abovesimulating
mentioned NFA an NFA. We do not need to update all states
approach. Alternatively we could use backtracking in an index that allows
NFA by eliminating
a bidirectional ”patte”
search, e.g. the allows at most
bidirectional BWT 2 errors, we
(Schnattinger et eliminate
al., 2012). states in
on of strong matches, during the simulation of the (staircase) NFA. After
tch conditions. For the ith row and the jth column for i ≥ 4 and j ≤ 6.
cur in the beginning. reading i characters, only states within a triangle whose
factorization of the We use some heuristics to reduce the time for
we call the staircase peak is the i+1st state of the first row can be active (see
simulating an NFA. We do not need to update all states
d NFA by eliminating Figure 3). Furthermore, if the top h rows have no active
during the simulation of the (staircase) NFA. After
atch conditions. For states, they will stay inactive also in the future, and can
operations on the words. We (and [13]) use diagonal- 120
total time
wise simulation [1] for the standard NFA but row-wise time for filtering phase
simulation Factor
[16]and for Suffix Filters, by
the staircase NFA, David Weese,
mainly dueJuneto 11, 2013, 01:00 90 13005

time (sec)
ease of implementation. In the worst case, a diagonal-
60
wise simulation
13.11 step requires
Suffix O(!k 2 /w")
Filter Parameters time while a
row-wise simulation takes O(k!k/w") time, where w 30
is the word size. However, the row-wise simulation
The practically best factorization divides a pattern of length m into k + 1 factors of weight 1.
can easily take advantage of the heuristic of omitting 0
inactiveThe rowslength
but the ` ofdiagonal-wise
the last factorsimulation should becan greater
not.then others j to avoid
3
k l random
4 5 matches
m the last
6 7 of8the9shortest suffix. All
m−` m−` factor size
We alsoother tried factor
the row-wise
sizes should simulation for the standard
be distributed evenly, i.e. either k or k .
NFA but it made the factor filters slightly slower. We (a) varying the last factor size " (s = 13)
implemented two simulation instances according to how
use diagonal- 120
many words are needed for one row of the staircase total time NFA: 500
A but row-wise time for filtering phase total time
one word and more than one 90 word. The NFA in [13] was 400 time for filtering phase
mainly due to
implemented by five instances according to the lengths
time (sec)

time (sec)
se, a diagonal- 300
of row and column. 60
) time while a
200
ime, 3.3whereVerification
w We have30 described how approxi-
ise simulation 100
mate occurrences of factors or suffixes are found using
ic of the
omitting
index. Each occurrence0 marks an area around it as a 0
3 4 5 6 7 8 9 3 4 5 6 7 8 9 10 11 12 13
ationpart
can ofnot.
the potential match area, andthethe union of these
last factor size number of factor
r the areas
standardis formed by sorting. The potential match area
(a) varying the last factor size " (s = 13) (b) varying the number of factors s (" = 6)
ly slower.
is thenWe searched sequentially using Myers’ bit-parallel
cording to how of the dynamic programming matrix [11, 7],
simulation
staircase
which NFA:is one of the fastest 500algorithms. We implemented
total time Figure 4: Performance of suffix filter on DNA data for
NFA infour was13.12 instances
[13]simulation Performance
400according timetoforResults
how many
filtering phase words m = 40, k = 12, and 200 queries.
to thearelengths
needed for one column of the matrix: one, two,
time (sec)

three, and more than 80 300 words. The implementation

three 50
in [13] used a slower algorithm. 200 This change actually
suffix filter
factor filter 40
how benefited
approxi- factor filters60more thanon-line suffix filters.
error level(%)

and other factor30 sizes are 3 and 4. As " gets smaller

time (sec)

100
re found using 40 than optimal, the verification time increases rapidly.
4 itExperimental
around as a Results0 20
3 4 5 6 7 8 9 10 11 12 13With larger than optimal ", filtration time increases
In this
union of these section we present
20 results from experiments but at a more 10 modest rate. The
number of factor suffixbehaviour
filter in other
testing
al match areathe performance of the suffix filter algorithm cases is similar. The results indicate factor filter
that using the
described in the (b) varying the20number of factors its ("absolutely
= 6)
0 0
rs’ bit-parallel previous 10 section and 30comparing 40 optimal10 " 20is 30 not40 crucial
50 60 70 but80it90is 100
better to
matrixagainst
[11, 7],the factor filter algorithm.error level (%) length of patterns
err in the too large direction. This is in contrast to
We used three texts: English, (a) DNA DNA, and random (a) DNA
e implemented Figure 4: Performance of suffix filter on DNA factor
datafilters,
for where having the free parameter s off by
data. English and DNA were
w many words m = 40, k60= 12, and 200 queries. obtained from Pizza&Chili one in either direction can have a dramatic effect in the
Corpus web-site (http://pizzachili.dcc.uchile.cl/) and 50
rix: one, two, suffix filter running time [13].
truncated to length 16Mbytes. 50 Random
factor filter texts are of The method40 of Section 2.2 sets the number of
mplementation on-line
error level(%)

length 64M with various 40 alphabet sizes. In each test, factors s to k + 130but we experimented also with smaller
time (sec)

hange actually
100∼10000 patterns of30the same length m were selected values of s (larger values would mean that ti = 0 for
lters. 20
randomly from and textotherand
20 factorsearched
sizes with are 3k and errors.4. As We" gets somesmaller
i). Figure 4 (b) shows the results in one case.
often reportthan the error 10level α
optimal, the = k/m instead time
verification of k. increases Hererapidly.
the last factor 10
size " is fixedfactor to 6,
suffix its distance limit
filter
filter
Our machine
With islargera 2.6Ghz
0 thanPentiumoptimalIV",with 2GB oftime
filtration ts−1increases
to 1, and the0 other factor sizes and distance limits
m experiments
RAM, running butLinux.
at a moreWe20used modest
30
the 40
gccrate.compiler 50
The
60
version
behaviour are in
as other
even as
10 20 30 40 50 60 70 80 90 100
possible. The results in this and in other
error level (%) length of patterns
lter algorithm
4.0.2 with optioncases “-O3”.
is similar. The results indicate that using
cases, the
too, show that s = k + 1 is the best choice.
(b) English (b) English
comparing it absolutely optimal " is not crucial but it is better to
4.1 Suffixerr filter
in the parameters
too large Section direction. 2.2 describes
This is in 4.2 Suffix
contrast to filter vs. factor filter In this section, we
method forFigure
, anda random choosing5: Thetheeffect
suffixoffiltererrorparameters,
level for m = 30 and 100 Figure 6: The error levels up to which each indexing
which compare
factor filters, where having the free parameter s method off by filters
suffix to factor filters. The parameters
leaves one queries.
m Pizza&Chili parameter, the last factor size ", to be are chosen aswins upon online search.
described in Sections 2.1 and 2.2. For
one in either direction can have a dramatic effect in the
determined
chile.cl/) experimentally. Figure 4 (a) illustrates the suffix filters the last factor size ", and for factor filters
and running time [13].
m texts effect
areofof" in oneThe case. In this example, the best " is 6 the number filter parameters
of factors(!s iswere largeoptimized
for suffix filters and s = 1 for
experimentally.
First, wemethod
show the ofeffect
Section of error 2.2level sets the number
in Figure 5. factoroffilters). With small alphabet and large error level,
In each test, The factorscurves to k + 1 but
“on-line” we experimented
represents the time foralso with smaller
verifying filtering is no more competitive with on-line searching.
m were selected the values of s (larger values would mean that ti =With
whole text without a filtering step. The figure 0 forlarge alphabet and low error level, the filters are
k errors. We illustrates
some i). that indexing
Figure 4 (b)becomesshows useless
the when resultsthe inerror
onehighly
case. effective and having a more effective filter does
ead of k. level
Heregrows too large.
the last factorFigure size "6isshows fixeduptoto6,which error not
its distance help anymore.
limit
V with 2GB of level each filtering method wins upon on-line search, as
ts−1 to 1, and the other factor sizes and distance limits
mpiler version aare function of m. For suffix filters this limit of usefulness 5 Acknowledgements
as even ashigher possible.
is substantially than The for factorresults in this and in other
filters.
cases, too, show that s =
Figure 6 also indicates that the advantage k + 1 is the best choice.We would like to thank Gonzalo Navarro for kindly
of suffix
providing the source code used in [13].
filters over factor filters increases for longer patterns.
n 2.2 describes More
4.2 clearly
Suffix thisfilter
effect vs.
can factor
be seen in FigureIn7.this
filter In both
section, we
ameters, which figures, there is a notable jump when the pattern length References
compare suffix filters to factor filters. The parameters
size ", to be goes from 90 to 100. This is caused by the verification
are chosen as described in Sections 2.1 and 2.2. For
illustrates the stage, which switches from a fast implementation that [1] R. A. Baeza-Yates and G. Navarro. Faster approximate
suffix
can filters
handle the last
patterns factor
up to lengthsize ", aand
96 to forimple-
slower factor filters
string matching. Algorithmica, 23(2):127–158, 1999.
the best " is 6 the number
mentation thatofhasfactors s were
no limit on theoptimized experimentally.
pattern length. As [2] S. Burkhardt and J. Kärkkäinen. Better filtering with
can be seen in Figure 7, suffix filters are less sensitive gapped q-grams. Fundam. Informaticae, 56(1-2):51–
to the speed of the verification stage than factor filters. 70, 2003.

Create Your Own Business Project: Assignment 1: Business Proposal (24 PTS.)
No ratings yet
Create Your Own Business Project: Assignment 1: Business Proposal (24 PTS.)
17 pages
Indonesian Names
No ratings yet
Indonesian Names
9 pages
12 Filter Algorithms
No ratings yet
12 Filter Algorithms
7 pages
Algorithms-sheet
No ratings yet
Algorithms-sheet
2 pages
PracticeSolution 1
No ratings yet
PracticeSolution 1
15 pages
Lecture Chapter 3 Part 2
No ratings yet
Lecture Chapter 3 Part 2
37 pages
Chapter - 09 (Brute-Force)
No ratings yet
Chapter - 09 (Brute-Force)
24 pages
Answer 2019
No ratings yet
Answer 2019
7 pages
Foundations of Sequence Analysis
No ratings yet
Foundations of Sequence Analysis
161 pages
1 s2.0 S0020019015000411 Main
No ratings yet
1 s2.0 S0020019015000411 Main
3 pages
Cheat Sheet
No ratings yet
Cheat Sheet
2 pages
B505 Lec.10 DynamicProgramming 1
No ratings yet
B505 Lec.10 DynamicProgramming 1
19 pages
2014solutii Stele
No ratings yet
2014solutii Stele
2 pages
Neerc 2011 Analysis
No ratings yet
Neerc 2011 Analysis
13 pages
String Matching
No ratings yet
String Matching
16 pages
ADM Lecture Notes
No ratings yet
ADM Lecture Notes
64 pages
Induction: David Arthur
No ratings yet
Induction: David Arthur
6 pages
The Longest Common Extension Problem Revisited and Applications To Approximate String Searching (2010)
No ratings yet
The Longest Common Extension Problem Revisited and Applications To Approximate String Searching (2010)
11 pages
Aaa 7
No ratings yet
Aaa 7
36 pages
Sandor Szabo, Arthur D. Sands - Factoring Groups Into Subsets-CRC Press (2009)
100% (1)
Sandor Szabo, Arthur D. Sands - Factoring Groups Into Subsets-CRC Press (2009)
268 pages
Chapter 9 Factorising and DL Using A Factor Base
No ratings yet
Chapter 9 Factorising and DL Using A Factor Base
11 pages
Week 5
No ratings yet
Week 5
64 pages
Combinatorial Algorithms - Edward M Reingold
0% (2)
Combinatorial Algorithms - Edward M Reingold
12 pages
Stanford Final
No ratings yet
Stanford Final
2 pages
publication_11_23912_388
No ratings yet
publication_11_23912_388
11 pages
Approximate Matching
No ratings yet
Approximate Matching
16 pages
BGC2012 Contest
No ratings yet
BGC2012 Contest
12 pages
The Power of Randomness
No ratings yet
The Power of Randomness
24 pages
Notes
No ratings yet
Notes
97 pages
Integer Factorization
No ratings yet
Integer Factorization
57 pages
UNIT 4
No ratings yet
UNIT 4
27 pages
COMS 6998 Lec 2
No ratings yet
COMS 6998 Lec 2
7 pages
Unit-5
No ratings yet
Unit-5
52 pages
HW3 Sol PDF
No ratings yet
HW3 Sol PDF
44 pages
8GUesH-Math 111 PS 7 Solutions
No ratings yet
8GUesH-Math 111 PS 7 Solutions
5 pages
A Simple Algorithm For Finding Frequent Elements in Streams and Bags
No ratings yet
A Simple Algorithm For Finding Frequent Elements in Streams and Bags
5 pages
lecture6
No ratings yet
lecture6
3 pages
Ada 1
No ratings yet
Ada 1
11 pages
23sp - cs188-sp23-midterm-solutions
No ratings yet
23sp - cs188-sp23-midterm-solutions
22 pages
solution
No ratings yet
solution
12 pages
Complete Solution
No ratings yet
Complete Solution
21 pages
Tutorial2 Final Final Answersv15
No ratings yet
Tutorial2 Final Final Answersv15
12 pages
Instant ebooks textbook (Ebook) Factoring groups into subsets by Sandor Szabo, Arthur D. Sands ISBN 9781420090468, 1420090461 download all chapters
100% (3)
Instant ebooks textbook (Ebook) Factoring groups into subsets by Sandor Szabo, Arthur D. Sands ISBN 9781420090468, 1420090461 download all chapters
71 pages
Factoring Algorithm Based On Parameterized Newton Method-Zhengjun Cao and Lihua Liu
No ratings yet
Factoring Algorithm Based On Parameterized Newton Method-Zhengjun Cao and Lihua Liu
7 pages
9ce0a1bb-9fd0-4dd8-bba2-c2697496360a_Algorithms_Cheat_sheet
No ratings yet
9ce0a1bb-9fd0-4dd8-bba2-c2697496360a_Algorithms_Cheat_sheet
1 page
5.the Knuth Morris Pratt Algorithm
No ratings yet
5.the Knuth Morris Pratt Algorithm
16 pages
3rd Update Pigeon Hole Principle
No ratings yet
3rd Update Pigeon Hole Principle
3 pages
Notes for Math 184A Combinatorics Steven V. Sam - Download the full ebook now for a seamless reading experience
100% (2)
Notes for Math 184A Combinatorics Steven V. Sam - Download the full ebook now for a seamless reading experience
63 pages
Approximating_Edit_Distance_within_Constant_Factor_in_Truly_Sub-Quadratic_Time
No ratings yet
Approximating_Edit_Distance_within_Constant_Factor_in_Truly_Sub-Quadratic_Time
12 pages
Theory of Algorithms
No ratings yet
Theory of Algorithms
332 pages
Unit 7
No ratings yet
Unit 7
60 pages
Notebook 231102
No ratings yet
Notebook 231102
10 pages
Unit V - Daa
No ratings yet
Unit V - Daa
39 pages
Set Cover Problem
No ratings yet
Set Cover Problem
5 pages
M2 Chapter 3
No ratings yet
M2 Chapter 3
43 pages
Quant Gre
No ratings yet
Quant Gre
9 pages
07.01.approximate Nearest Neighbor Queries in Fixed Dimensions
No ratings yet
07.01.approximate Nearest Neighbor Queries in Fixed Dimensions
11 pages
po
No ratings yet
po
14 pages
TR15 013
No ratings yet
TR15 013
19 pages
CS 240 Tutorial 11 Notes: C A A B A
No ratings yet
CS 240 Tutorial 11 Notes: C A A B A
2 pages
Modern Algebra Essentials
From Everand
Modern Algebra Essentials
Lufti A. Lutfiyya
No ratings yet
A First Course in Functional Analysis
From Everand
A First Course in Functional Analysis
Martin Davis
No ratings yet
CA2 Student Guidelines 20222
No ratings yet
CA2 Student Guidelines 20222
4 pages
Concepts Modules Language
No ratings yet
Concepts Modules Language
329 pages
Mia Comic Strips
No ratings yet
Mia Comic Strips
5 pages
Rizal
No ratings yet
Rizal
1 page
Dr. B.R. Ambedkar National Institute of Technology, Jalandhar
No ratings yet
Dr. B.R. Ambedkar National Institute of Technology, Jalandhar
7 pages
MMA_Mansi_Comp_2sem (1) (1)
No ratings yet
MMA_Mansi_Comp_2sem (1) (1)
69 pages
Review On The Various Perspectives of Christ and Culture
No ratings yet
Review On The Various Perspectives of Christ and Culture
15 pages
Problem C - Meta Hacker Cup - 2023 - Practice Round
No ratings yet
Problem C - Meta Hacker Cup - 2023 - Practice Round
1 page
One-Way Ticket To Childhood
No ratings yet
One-Way Ticket To Childhood
2 pages
Dedan Kimathi University of Technology
No ratings yet
Dedan Kimathi University of Technology
5 pages
Answer Script of Constitutional Law - I
No ratings yet
Answer Script of Constitutional Law - I
11 pages
Answer: Solution
No ratings yet
Answer: Solution
9 pages
Activity Teaching Grammer 1
No ratings yet
Activity Teaching Grammer 1
8 pages
Anti Ramon Diaz Minaya: Participante
No ratings yet
Anti Ramon Diaz Minaya: Participante
4 pages
Whether or Not Even Case The That Even Though While Unless Whereas Only
No ratings yet
Whether or Not Even Case The That Even Though While Unless Whereas Only
11 pages
Ways of Correct English Pronunciation: What Is Phonetics?
No ratings yet
Ways of Correct English Pronunciation: What Is Phonetics?
8 pages
Series of Functions
No ratings yet
Series of Functions
236 pages
A Comprehensive Guide To Number Formats in Excel
No ratings yet
A Comprehensive Guide To Number Formats in Excel
17 pages
Water Baptism
No ratings yet
Water Baptism
2 pages
Noun Clauses Proposiciones Subordinadas Sustantivas
No ratings yet
Noun Clauses Proposiciones Subordinadas Sustantivas
9 pages
10 - Kubernetes Handout
100% (1)
10 - Kubernetes Handout
84 pages
BESCK204C
No ratings yet
BESCK204C
2 pages
CONTENTS
No ratings yet
CONTENTS
62 pages
33 Unique Indian Hindu Baby Names Stating With LE With Meaning - Parentune - Com - Page - 3
No ratings yet
33 Unique Indian Hindu Baby Names Stating With LE With Meaning - Parentune - Com - Page - 3
3 pages
Gfk1675 - Cimplicity Hmi Opc Server Operation Manual
No ratings yet
Gfk1675 - Cimplicity Hmi Opc Server Operation Manual
48 pages
3.1 TOEFL Listening Questions
No ratings yet
3.1 TOEFL Listening Questions
11 pages
Essay Peer Review Checklist
No ratings yet
Essay Peer Review Checklist
1 page
Infrastructure as Code 2nd Edition Early Access Kief Morris download
100% (5)
Infrastructure as Code 2nd Edition Early Access Kief Morris download
65 pages

13 Filter Algorithms and Approximate Index Search

Uploaded by

13 Filter Algorithms and Approximate Index Search

Uploaded by

13.

1 Approximate Search in Indices

• The concatenation of strings X, Y is denoted as X · Y

13.4 Factor Filters

The corollary can be trivially be proven with the contraposition

13.5 How to Implement a Factor Filter

13.6 Exact Factor Search

13.7 Approximate Factor Search

(1) ApproxRecur(F, i, α, e);

13.8 Suffix Filters

as A and B match on [1.. j] for each j ∈ [1..5):

PI [1..1] [1..2] [1..3] [1..4]

1. pattern A with less than t errors

1. In the first case t errors need to be found to eliminate a candidate.

13.9 Approximate Suffix Search

mata for approximate p a t t e r n

very path in the suffix p a t t e r n

Suffix filters require recognition

three, and more than 80 300 words. The implementation

and other factor30 sizes are 3 and 4. As " gets smaller

You might also like