0% found this document useful (0 votes)

19 views

12 Filter Algorithms

This document discusses q-gram filters that can be used to find approximate matches between strings. It defines the problem, provides some key lemmas relating matches to regions in an edit matrix, and outlines the SWIFT algorithm which uses q-gram filtering to find matches between strings.

Uploaded by

dethleff901

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

12 Filter Algorithms

Uploaded by

dethleff901

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

12 Q-gram filters for ε-matches

This exposition was developed by Clemens Gröpl. It is based on:

• Kim R. Rasmussen, Jens Stoye, Eugene W. Myers: Efficient q-Gram Filters for Finding All -Matches over a
Given Length, Journal of Computational Biology, Volume 13, Number 2, 2006, pages 296–308. (Originally
presented at GCB 2004 and RECOMB 2005.) [RSM06]

12.1 Motivation
Comparison of large genomic sequences can be speeded up a lot if filtering techniques are applied. The key
observation is that a local alignment of high sequence similarity must contain at least a few short exact matches.
The idea of using q-grams for fast filtering is not new. A q-gram is a substring of length q. Programs like
BLAST use q-grams which occur in both sequences as seeds for a local alignment search.
It has also been observed that combining the idea of seeds with a combinatorial argumentation based on
some form of the pigeon hole principle can be used to discard large parts of the input sequences from further
consideration, because they cannot contain a good local alignment.
We can distinguish three kinds of algorithms.
When applied for finding highly similar regions, the classical exact algorithms (e. g. Smith-Waterman) will
spend most of the time verifying that there is no match between a given pair of regions. The running times
(typically the product of sequence lengths) are infeasible for genome size sequences.
Heuristics like BLAST typically employ a q-gram index to locate seeds and perform a verification for the
candidate regions located in this way. However, BLAST might fail to recognize an existing match, unless the
filtering parameters are set very stringent. Thus one has to trade off sensitivity against speed.
A filter is an algorithm that allows us to discard large parts of the input, but is guaranteed not to loose any
significant match. The trade-off to be considered for filtering algorithms is thus only whether the additional
effort is payed off by the saving of time spent for verifications.
In this lecture, we will consider the problem of finding matches of low error rate ε and a given minimum
length n0 .
The cost measure will be the edit distance (Levenshtein distance). That is, the distance between two strings
is the number of insertions, deletions, and substitutions needed to transform one into the other.
The SWIFT algorithm is an improvement of the QUASAR algorithm by Burkhardt et. al.. Note, however,
that QUASAR uses an absolute error threshold rather than an error rate. Using an error rate is more appropriate
since the length of a local alignment is not known in advance.
The filter has been successfully applied for the fragment overlap computation in sequence assembly and for
BLAST-like searching in EST sequences.

12.2 Definitions
As usual, let A and B denote strings over a finite alphabet Σ, let |A| be the length of A, let A[i] be the i-th letter
of A, and let A[p..q] be the substring starting at position p and ending with position q of A, thus A[i..i] consists
of the letter A[i]. A substring of length q > 0 of A is a q-gram of A.
The (unit cost) edit distance between strings A and B is the minimum number of edit operations (insertion,
deletion, substitution) in an alignment of A and B. It is denoted by dist(A, B).
The edit distance can be computed by the well-known Needleman-Wunsch algorithm. It computes in
O(|A||B|) time an edit matrix E(i, j) := dist(A[1..i], B[1.. j]). The letter A[i] corresponds to the step from row i − 1
to i, so it is natural to visualize the letters between the rows and columns of the edit matrix, etc..
An ε-match is a local alignment for substrings (α, β) with an error rate of at most ε. That is, dist(α, β) ≤ ε β .
(Note the ‘asymmetry’ in the definition of error rate.)
The problem can now be formally stated as follows:
Q-gram Filters for ε-Matches over a Given Length, by Clemens Gröpl, June 7, 2013, 08:06 12001

Given a target string A and a query string B, a minimum match length n0 and a maximum error rate
ε > 0;
Find all ε-matches (α, β) where α and β are substrings of A and B, respectively, such that

1. β ≥ n0 and
2. dist(α, β) ≤ bε β c.

12.3 q-gram filters for ε-matches

A q-hit is a pair (i, j) of indices such that A[i..i + q − 1] = B[ j.. j + q − 1].
The basic idea of the q-gram method is as follows:

1. Find (enumerate) all q-hits between the query and the target strings.

2. Identify regions (in the Cartesian product of the strings) that have “enough” hits.
3. Such candidate regions are then subjected to a closer examination.

The concrete methods differ in the shape and the size of the regions.
The following lemma relates ε-matches (α, β) to parallelograms of the edit matrix. For a moment, we assume
that the length of β is known, so that we can work with an absolute bound on the distance.
An n × e parallelogram of the edit matrix consists of entries from n + 1 consecutive rows and e + 1 consecutive
diagonals.
Lemma 1. Let α and β be substrings of A and B, respectively, and assume that β = n and dist(α, β) ≤ e. Then
there exists an n × e parallelogram P such that

1. P contains at least T(n, q, e) := (n + 1) − q(e + 1) q-hits,

2. the B-projection of the parallelogram is pB (P) = β,

3. the A-projection pA (P) of the parallelogram is contained in α.

The A- and B-projections are defined as illustrated below.

The A-projection pA (P) of a parallelogram P is defined as the substring of A between the last column of the
first row of P and the first column of the last row of P.
The B-projection pB (P) of a parallelogram P is defined as the substring of B between the first and the last
row of P.
(Note: these figures are taken from the RECOMB and GCB version, which uses the transposed matrix of
the JCB article.)
12002 Q-gram Filters for ε-Matches over a Given Length, by Clemens Gröpl, June 7, 2013, 08:06

Clearly, a q-hit (i, j) corresponds to q + 1 consecutive entries of the edit matrix along the diagonal j − i. A
q-hit is contained in a parallelogram if its corresponding matrix entries are.
The proof of Lemma 1 is straightforward: Consider the path of an optimal alignment of α and β. At each
row except for the last q ones, we have a q-gram unless there is an edit operation among the next q edges. Each
edit operation can ‘destroy’ at most q q-hits.
So the case where β is fixed was easy. Next we consider -matches for β ≥ n0 . The following lemma is
the combinatorial foundation of the SWIFT algorithm.
Lemma 2. Let α and β be substrings of A and B, respectively, and assume that β ≥ n0 and dist(α, β) ≤ ε β . Let
U(n, q, ε) := T(n, q, bεnc) = (n + 1) − q(bεnc + 1) and assume that the q-gram size q and the threshold τ have been
chosen such that n o
q < d1/εe and τ ≤ min U(n0 , q, ε), U(n1 , q, ε) ,
l m
where n1 := (bεn0 c + 1)/ε .
Then there exists a w × e parallelogram P such that:

1. P contains at least τ q-hits whose projections intersect α and β,

2. w = (τ − 1) + q(e + 1),
2τ + q − 3
$ %
3. e = ,
1/ε − q

4. if β ≤ w, then pB (P) contains β, otherwise β contains pB (P).

The purpose of Lemma 2 is as follows. Given parameters ε and n0 , we can choose suitable values for
q, τ, w, and e using Lemma 2. Then we enumerate all parallelograms P with enough hits according to these
parameters. All relevant ε-matches can be found in these regions.
Proof of Lemma 2. The lemma is proven in three steps:

1. Assuming there is an ε-match (α, β) of length β = n ≥ n0 , show that there are at least τ q-hits in the
surrounding n × bεnc parallelogram.
2. Argue that there is a w × e parallelogram that contains at least τ q-hits, where w and e do not depend on
n ≥ n0 .

3. Determine the dimensions w and e of such a parallelogram.

. . . details omitted . . .
Q-gram Filters for ε-Matches over a Given Length, by Clemens Gröpl, June 7, 2013, 08:06 12003

12.4 Algorithm
The SWIFT algorithm relies on the q-gram filter for -matches of length n0 or greater. Using the parameters
obtained from Lemma 2, it searches for all w × e parallelograms which contain a sufficient number of q-grams.
In the preprocessing step, we construct a q-gram index for the target sequence A. The index consists of
two tables:

1. The occurrence table is a concatenation of the lists L(G) := { i | A[i..i + q − 1] = G } for all q-grams G ∈ Σq
in A.

2. The lookup table is an array indexed by the natural encoding of G to base |Σ|, giving the start of each list
in the occurrence table.

Once the q-gram index is built, the w × e parallelograms containing τ or more q-hits can be found using a
simple sliding window algorithm.
The idea is to split the (fictitious) edit matrix into overlapping bins of e + 1 diagonals. For each bin we count
the number of q-hits in the w × e parallelogram that is the intersection of the diagonals of the corresponding
bin and the rows of the sliding window W j := B[ j..j + w].
As the sliding window proceeds to W j+1 , the bin counters are updated to reflect the changes due to the
q-grams leaving and entering the window.
Whenever a bin counter reaches τ, the corresponding parallelogram is reported. Overlapping parallelo-
grams can be merged on the fly.
The space requirement for the bins is reduced by searching for somewhat larger parallelograms of size
w × (e + ∆). Then each bin counts for e + ∆ + 1 diagonals, and successive bins overlap by e diagonals. While
this will lead to more verifications, it reduces the number of bins which have to be maintained. In practice, ∆
is set to a power of 2, and bin indices are computed with fast bit-operations.
12004 Q-gram Filters for ε-Matches over a Given Length, by Clemens Gröpl, June 7, 2013, 08:06
Q-gram Filters for ε-Matches over a Given Length, by Clemens Gröpl, June 7, 2013, 08:06 12005

Each ‘candidate’ parallelogram must be checked for the presence of an ε-match. This can be done trivially
by dynamic programming. Alternatively, one can use the knowledge about the q-grams in the ε-match to
construct an alignment by sparse dynamic programming.
12006 Q-gram Filters for ε-Matches over a Given Length, by Clemens Gröpl, June 7, 2013, 08:06

12.5 Results

(NEOPIS) FRTU Manual EPIC-R300 2018 Rev1.4
No ratings yet
(NEOPIS) FRTU Manual EPIC-R300 2018 Rev1.4
87 pages
hw10 Solution PDF
No ratings yet
hw10 Solution PDF
5 pages
Quasar
No ratings yet
Quasar
17 pages
IR Lecture 3b
No ratings yet
IR Lecture 3b
44 pages
Lecture Notes On Pattern Matching Algorithms
No ratings yet
Lecture Notes On Pattern Matching Algorithms
16 pages
Lecture Notes On Pattern Matching Algorithms
No ratings yet
Lecture Notes On Pattern Matching Algorithms
16 pages
IR Lecture 3b
No ratings yet
IR Lecture 3b
44 pages
Numerical Analysis II Essentials
From Everand
Numerical Analysis II Essentials
The Editors of REA
No ratings yet
2d Pattern Matching
No ratings yet
2d Pattern Matching
35 pages
String matching
No ratings yet
String matching
66 pages
Sandeep Singh (Iii B.Tech I.T)
No ratings yet
Sandeep Singh (Iii B.Tech I.T)
179 pages
Lecture 04
No ratings yet
Lecture 04
18 pages
4string Matching Kmprabin Karp and Naive
No ratings yet
4string Matching Kmprabin Karp and Naive
57 pages
Foundations of Sequence Analysis
No ratings yet
Foundations of Sequence Analysis
161 pages
8 and 9 exp
No ratings yet
8 and 9 exp
13 pages
String Matching Algorithms
No ratings yet
String Matching Algorithms
46 pages
8 LCS 19 01 2024
No ratings yet
8 LCS 19 01 2024
17 pages
13 Filter Algorithms and Approximate Index Search
No ratings yet
13 Filter Algorithms and Approximate Index Search
6 pages
4-Tolerant retrieval
No ratings yet
4-Tolerant retrieval
82 pages
4 module algorithms
No ratings yet
4 module algorithms
28 pages
Unit 2 Daa PDF
No ratings yet
Unit 2 Daa PDF
99 pages
Fast Pattern Matching In: Strings
No ratings yet
Fast Pattern Matching In: Strings
28 pages
CH-8
No ratings yet
CH-8
26 pages
B505 Lec.10 DynamicProgramming 1
No ratings yet
B505 Lec.10 DynamicProgramming 1
19 pages
Lecture 56string Matching
No ratings yet
Lecture 56string Matching
43 pages
1 s2.0 0890540191900465 Main
No ratings yet
1 s2.0 0890540191900465 Main
27 pages
(Ebook) Introduction to algorithms [solutions] by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein ISBN 9780262033848, 9780262533058, 9780262259460, 0262033844, 0262533057, 026225946X - Quickly download the ebook to start your content journey
100% (2)
(Ebook) Introduction to algorithms [solutions] by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein ISBN 9780262033848, 9780262533058, 9780262259460, 0262033844, 0262533057, 026225946X - Quickly download the ebook to start your content journey
52 pages
String Search Algorithm
No ratings yet
String Search Algorithm
6 pages
String Search: 1 2 I I+1 I+m-1 N
No ratings yet
String Search: 1 2 I I+1 I+m-1 N
8 pages
1 s2.0 S0020019015000411 Main
No ratings yet
1 s2.0 S0020019015000411 Main
3 pages
Ir Asnment
No ratings yet
Ir Asnment
6 pages
DAA_unit_5
No ratings yet
DAA_unit_5
22 pages
Approximate Matching
No ratings yet
Approximate Matching
16 pages
String Matching Algorithms: Antonio Carzaniga
No ratings yet
String Matching Algorithms: Antonio Carzaniga
11 pages
Unit8 ADA SPPDF 2022 11 11 17 17 37pdf 2023 12 06 16 57 08
No ratings yet
Unit8 ADA SPPDF 2022 11 11 17 17 37pdf 2023 12 06 16 57 08
18 pages
7 String Matching: 7.1 Brute Force
No ratings yet
7 String Matching: 7.1 Brute Force
15 pages
Spec-141-4-0 (2) SDFHD
No ratings yet
Spec-141-4-0 (2) SDFHD
2 pages
Boyer 77
No ratings yet
Boyer 77
11 pages
Multi-Pattern String Matching With Very Large Pattern Sets: Leena Salmela
No ratings yet
Multi-Pattern String Matching With Very Large Pattern Sets: Leena Salmela
31 pages
Bruteforce ExhaustiveSearch
No ratings yet
Bruteforce ExhaustiveSearch
55 pages
Tsa Lectures 1
No ratings yet
Tsa Lectures 1
226 pages
Lecture 04 Inaryseachtree
No ratings yet
Lecture 04 Inaryseachtree
20 pages
IRS unit-5
No ratings yet
IRS unit-5
62 pages
5CS4-AOA-Unit-3 @zammers
No ratings yet
5CS4-AOA-Unit-3 @zammers
7 pages
Algorithms in Bioinformatics
No ratings yet
Algorithms in Bioinformatics
7 pages
Hors Pool
No ratings yet
Hors Pool
16 pages
Hors Pool
No ratings yet
Hors Pool
16 pages
lec8
No ratings yet
lec8
17 pages
05 MO Quantlog05s
No ratings yet
05 MO Quantlog05s
4 pages
String Matching
No ratings yet
String Matching
16 pages
Text Pattern Search Using Naïve Algorithm: Justine Estoesta, Patricia Mae Omana, Winci John Singh
No ratings yet
Text Pattern Search Using Naïve Algorithm: Justine Estoesta, Patricia Mae Omana, Winci John Singh
5 pages
Patternmatchingalgorithms
No ratings yet
Patternmatchingalgorithms
63 pages
KMP 2
No ratings yet
KMP 2
7 pages
An Efficient Index Structure For String Databases: Tamer Kahveci Ambuj K. Singh
No ratings yet
An Efficient Index Structure For String Databases: Tamer Kahveci Ambuj K. Singh
45 pages
Algo BSC Hons Guidelines With Appendix
No ratings yet
Algo BSC Hons Guidelines With Appendix
73 pages
How A Search Engine Works
No ratings yet
How A Search Engine Works
28 pages
Unit 3
No ratings yet
Unit 3
34 pages
Fifth Dimension: The Light to See
From Everand
Fifth Dimension: The Light to See
Marc E. King
No ratings yet
M3-string_matching
No ratings yet
M3-string_matching
74 pages
Introduction to String Matching
No ratings yet
Introduction to String Matching
28 pages
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
From Everand
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
Gérard Blanchet
3/5 (1)
Dyaryo NG Mga Marian: Lead Story Headline
No ratings yet
Dyaryo NG Mga Marian: Lead Story Headline
6 pages
A Blueprint For Building A Zero Trust Architecture White Paper
No ratings yet
A Blueprint For Building A Zero Trust Architecture White Paper
16 pages
TD 4
No ratings yet
TD 4
2 pages
The CPU
No ratings yet
The CPU
1 page
Romney Ais13 PPT 03
No ratings yet
Romney Ais13 PPT 03
18 pages
E01 - Demonstartion of Amazon EC2
No ratings yet
E01 - Demonstartion of Amazon EC2
3 pages
mod_menu_crash_2025_04_14-14_11_43
No ratings yet
mod_menu_crash_2025_04_14-14_11_43
1 page
ZTY0.464.1425 - PA PZ666 - Series Single Phase Digital Ammeter Voltmeter - V...
100% (1)
ZTY0.464.1425 - PA PZ666 - Series Single Phase Digital Ammeter Voltmeter - V...
17 pages
Networking
No ratings yet
Networking
13 pages
dumps
No ratings yet
dumps
57 pages
Migrating PL/SQL To Java Stored Procedure
No ratings yet
Migrating PL/SQL To Java Stored Procedure
5 pages
Citing A Book in Print: in Early Childhood Development: Review of The World Bank's Recent Experience. DOI
No ratings yet
Citing A Book in Print: in Early Childhood Development: Review of The World Bank's Recent Experience. DOI
4 pages
Advanced Maths - Sample Paper - Class 9 (2024-25)
No ratings yet
Advanced Maths - Sample Paper - Class 9 (2024-25)
9 pages
State Level Technical Quiz Brochure
No ratings yet
State Level Technical Quiz Brochure
2 pages
M3u Titanuim
No ratings yet
M3u Titanuim
3 pages
Compiler Design unit-1
No ratings yet
Compiler Design unit-1
15 pages
Programming Languages and Systems 1st Edition Nobuko Yoshida Download PDF
100% (4)
Programming Languages and Systems 1st Edition Nobuko Yoshida Download PDF
62 pages
How To Access Your Mymodules
No ratings yet
How To Access Your Mymodules
6 pages
CAB 2024
No ratings yet
CAB 2024
1 page
Student Stress Level Prediction Using Machine Learning
No ratings yet
Student Stress Level Prediction Using Machine Learning
3 pages
Video Games As Objects of Art Revival of
No ratings yet
Video Games As Objects of Art Revival of
12 pages
Practical With Solution
No ratings yet
Practical With Solution
15 pages
Termostat It500
No ratings yet
Termostat It500
32 pages
1tool Brochure
No ratings yet
1tool Brochure
8 pages
5 SKILLS THAT CAN HELP YOU SUCCEED IN THE FIELD OF HEALTH INFORMATICSssss
No ratings yet
5 SKILLS THAT CAN HELP YOU SUCCEED IN THE FIELD OF HEALTH INFORMATICSssss
4 pages
Ayush_Internship_Report_format
No ratings yet
Ayush_Internship_Report_format
21 pages
2023 DS Mini Project Template
No ratings yet
2023 DS Mini Project Template
15 pages
Mondevs
100% (1)
Mondevs
233 pages
Charting Survey Results in Excel (Visualize Employee Satisfaction Results)
No ratings yet
Charting Survey Results in Excel (Visualize Employee Satisfaction Results)
39 pages

12 Filter Algorithms

Uploaded by

12 Filter Algorithms

Uploaded by

12 Q-gram filters for ε-matches

This exposition was developed by Clemens Gröpl. It is based on:

12.3 q-gram filters for ε-matches

1. P contains at least T(n, q, e) := (n + 1) − q(e + 1) q-hits,

3. the A-projection pA (P) of the parallelogram is contained in α.

The A- and B-projections are defined as illustrated below.

1. P contains at least τ q-hits whose projections intersect α and β,

4. if β ≤ w, then pB (P) contains β, otherwise β contains pB (P).

3. Determine the dimensions w and e of such a parallelogram.

You might also like