12 Filter Algorithms
12 Filter Algorithms
• Kim R. Rasmussen, Jens Stoye, Eugene W. Myers: Efficient q-Gram Filters for Finding All -Matches over a
Given Length, Journal of Computational Biology, Volume 13, Number 2, 2006, pages 296–308. (Originally
presented at GCB 2004 and RECOMB 2005.) [RSM06]
12.1 Motivation
Comparison of large genomic sequences can be speeded up a lot if filtering techniques are applied. The key
observation is that a local alignment of high sequence similarity must contain at least a few short exact matches.
The idea of using q-grams for fast filtering is not new. A q-gram is a substring of length q. Programs like
BLAST use q-grams which occur in both sequences as seeds for a local alignment search.
It has also been observed that combining the idea of seeds with a combinatorial argumentation based on
some form of the pigeon hole principle can be used to discard large parts of the input sequences from further
consideration, because they cannot contain a good local alignment.
We can distinguish three kinds of algorithms.
When applied for finding highly similar regions, the classical exact algorithms (e. g. Smith-Waterman) will
spend most of the time verifying that there is no match between a given pair of regions. The running times
(typically the product of sequence lengths) are infeasible for genome size sequences.
Heuristics like BLAST typically employ a q-gram index to locate seeds and perform a verification for the
candidate regions located in this way. However, BLAST might fail to recognize an existing match, unless the
filtering parameters are set very stringent. Thus one has to trade off sensitivity against speed.
A filter is an algorithm that allows us to discard large parts of the input, but is guaranteed not to loose any
significant match. The trade-off to be considered for filtering algorithms is thus only whether the additional
effort is payed off by the saving of time spent for verifications.
In this lecture, we will consider the problem of finding matches of low error rate ε and a given minimum
length n0 .
The cost measure will be the edit distance (Levenshtein distance). That is, the distance between two strings
is the number of insertions, deletions, and substitutions needed to transform one into the other.
The SWIFT algorithm is an improvement of the QUASAR algorithm by Burkhardt et. al.. Note, however,
that QUASAR uses an absolute error threshold rather than an error rate. Using an error rate is more appropriate
since the length of a local alignment is not known in advance.
The filter has been successfully applied for the fragment overlap computation in sequence assembly and for
BLAST-like searching in EST sequences.
12.2 Definitions
As usual, let A and B denote strings over a finite alphabet Σ, let |A| be the length of A, let A[i] be the i-th letter
of A, and let A[p..q] be the substring starting at position p and ending with position q of A, thus A[i..i] consists
of the letter A[i]. A substring of length q > 0 of A is a q-gram of A.
The (unit cost) edit distance between strings A and B is the minimum number of edit operations (insertion,
deletion, substitution) in an alignment of A and B. It is denoted by dist(A, B).
The edit distance can be computed by the well-known Needleman-Wunsch algorithm. It computes in
O(|A||B|) time an edit matrix E(i, j) := dist(A[1..i], B[1.. j]). The letter A[i] corresponds to the step from row i − 1
to i, so it is natural to visualize the letters between the rows and columns of the edit matrix, etc..
An ε-match is a local alignment for substrings (α, β) with an error rate of at most ε. That is, dist(α, β) ≤ ε β .
(Note the ‘asymmetry’ in the definition of error rate.)
The problem can now be formally stated as follows:
Q-gram Filters for ε-Matches over a Given Length, by Clemens Gröpl, June 7, 2013, 08:06 12001
Given a target string A and a query string B, a minimum match length n0 and a maximum error rate
ε > 0;
Find all ε-matches (α, β) where α and β are substrings of A and B, respectively, such that
1. β ≥ n0 and
2. dist(α, β) ≤ bε β c.
1. Find (enumerate) all q-hits between the query and the target strings.
2. Identify regions (in the Cartesian product of the strings) that have “enough” hits.
3. Such candidate regions are then subjected to a closer examination.
The concrete methods differ in the shape and the size of the regions.
The following lemma relates ε-matches (α, β) to parallelograms of the edit matrix. For a moment, we assume
that the length of β is known, so that we can work with an absolute bound on the distance.
An n × e parallelogram of the edit matrix consists of entries from n + 1 consecutive rows and e + 1 consecutive
diagonals.
Lemma 1. Let α and β be substrings of A and B, respectively, and assume that β = n and dist(α, β) ≤ e. Then
there exists an n × e parallelogram P such that
The A-projection pA (P) of a parallelogram P is defined as the substring of A between the last column of the
first row of P and the first column of the last row of P.
The B-projection pB (P) of a parallelogram P is defined as the substring of B between the first and the last
row of P.
(Note: these figures are taken from the RECOMB and GCB version, which uses the transposed matrix of
the JCB article.)
12002 Q-gram Filters for ε-Matches over a Given Length, by Clemens Gröpl, June 7, 2013, 08:06
Clearly, a q-hit (i, j) corresponds to q + 1 consecutive entries of the edit matrix along the diagonal j − i. A
q-hit is contained in a parallelogram if its corresponding matrix entries are.
The proof of Lemma 1 is straightforward: Consider the path of an optimal alignment of α and β. At each
row except for the last q ones, we have a q-gram unless there is an edit operation among the next q edges. Each
edit operation can ‘destroy’ at most q q-hits.
So the case where β is fixed was easy. Next we consider -matches for β ≥ n0 . The following lemma is
the combinatorial foundation of the SWIFT algorithm.
Lemma 2. Let α and β be substrings of A and B, respectively, and assume that β ≥ n0 and dist(α, β) ≤ ε β . Let
U(n, q, ε) := T(n, q, bεnc) = (n + 1) − q(bεnc + 1) and assume that the q-gram size q and the threshold τ have been
chosen such that n o
q < d1/εe and τ ≤ min U(n0 , q, ε), U(n1 , q, ε) ,
l m
where n1 := (bεn0 c + 1)/ε .
Then there exists a w × e parallelogram P such that:
The purpose of Lemma 2 is as follows. Given parameters ε and n0 , we can choose suitable values for
q, τ, w, and e using Lemma 2. Then we enumerate all parallelograms P with enough hits according to these
parameters. All relevant ε-matches can be found in these regions.
Proof of Lemma 2. The lemma is proven in three steps:
1. Assuming there is an ε-match (α, β) of length β = n ≥ n0 , show that there are at least τ q-hits in the
surrounding n × bεnc parallelogram.
2. Argue that there is a w × e parallelogram that contains at least τ q-hits, where w and e do not depend on
n ≥ n0 .
. . . details omitted . . .
Q-gram Filters for ε-Matches over a Given Length, by Clemens Gröpl, June 7, 2013, 08:06 12003
12.4 Algorithm
The SWIFT algorithm relies on the q-gram filter for -matches of length n0 or greater. Using the parameters
obtained from Lemma 2, it searches for all w × e parallelograms which contain a sufficient number of q-grams.
In the preprocessing step, we construct a q-gram index for the target sequence A. The index consists of
two tables:
1. The occurrence table is a concatenation of the lists L(G) := { i | A[i..i + q − 1] = G } for all q-grams G ∈ Σq
in A.
2. The lookup table is an array indexed by the natural encoding of G to base |Σ|, giving the start of each list
in the occurrence table.
Once the q-gram index is built, the w × e parallelograms containing τ or more q-hits can be found using a
simple sliding window algorithm.
The idea is to split the (fictitious) edit matrix into overlapping bins of e + 1 diagonals. For each bin we count
the number of q-hits in the w × e parallelogram that is the intersection of the diagonals of the corresponding
bin and the rows of the sliding window W j := B[ j..j + w].
As the sliding window proceeds to W j+1 , the bin counters are updated to reflect the changes due to the
q-grams leaving and entering the window.
Whenever a bin counter reaches τ, the corresponding parallelogram is reported. Overlapping parallelo-
grams can be merged on the fly.
The space requirement for the bins is reduced by searching for somewhat larger parallelograms of size
w × (e + ∆). Then each bin counts for e + ∆ + 1 diagonals, and successive bins overlap by e diagonals. While
this will lead to more verifications, it reduces the number of bins which have to be maintained. In practice, ∆
is set to a power of 2, and bin indices are computed with fast bit-operations.
12004 Q-gram Filters for ε-Matches over a Given Length, by Clemens Gröpl, June 7, 2013, 08:06
Q-gram Filters for ε-Matches over a Given Length, by Clemens Gröpl, June 7, 2013, 08:06 12005
Each ‘candidate’ parallelogram must be checked for the presence of an ε-match. This can be done trivially
by dynamic programming. Alternatively, one can use the knowledge about the q-grams in the ε-match to
construct an alignment by sparse dynamic programming.
12006 Q-gram Filters for ε-Matches over a Given Length, by Clemens Gröpl, June 7, 2013, 08:06
12.5 Results