Lecture 4
Lecture 4
Lecture 4
BLAST & Genome assembly
Alexey Kozlov
Staff Scientist
[email protected]
Exelixis Lab
5
Sequence databases: proteins
●
AlphaFold DB
– 200M protein structures → high-quality prediction
– https://alphafold.ebi.ac.uk/
6
Sequence databases: GenBank
●
NCBI GenBank/INSDC →”all” DNA+proteins
– > 246M “regular” sequences as of Aug. 2023
– ~ 2.6B whole-genome-shotgun (WGS)
– “primary” database → sequences submitted and
annotated by the respective authors (biologists)
7
Sequence annotation in GenBank
entry name:
usually includes organism
and gene name
accession number
(~ sequence ID)
organism name and
taxonomy (=biological
classification)
sequence
8
Sequence databases: barcodes
●
“Secondary” databases for barcoding genes
– quality filtering, curated taxonomy, sample geodata...
– useful for species identification/classification
●
SILVA →https://www.arb-silva.de/
– 16S, >10M sequences, focus on Bacteria+Archaea
●
BOLD →http://boldsystems.org/
– COI, >7M sequences, focus on Animals
●
UNITE → https://unite.ut.ee/
– ITS, 1.7M sequences, focus on Fungi
9
Searching for similar sequences
●
Assumption:
– sequence similarity Û evolutionary and/or
functionally relatedness
Annotated!
S1 90 Query hits
S2 70
S4 140
125 S3 125
S3
S4 140
S5 65
11
Naїve approach: complexity
●
Smith-Waterman has a complexity of O(m ´ n)
●
DP matrix for every sequence in database:
Q Q Q Q
S1 S2 S3 ... Sd
database size d
12
Naїve approach: complexity
●
Smith-Waterman has a complexity of O(m ´ n)
●
DP matrix for every sequence in database:
Q Q Q Q
S1 S2 S3 ... Sd
database size d
●
NCBI GenBank: d = 200 million
➔
Do we really need to compute all matrices?
12
BLAST
●
Stands for Basic Local Alignment Search Tool
●
Fast heuristic to find similar sequences
●
One of the most widely-used algorithms in bioinformatics
– Two main papers from 1990s have ~180 000 citations1
– cf. RAxML: ~40 000, Human Genome Sequence: ~45 000
●
Available as both stand-alone application and web
service
– BLAST on NCBI GenBank database:
http://blast.st-va.ncbi.nlm.nih.gov/Blast.cgi
1
Altschul et al (1990), Altschul et al (1997)
13
BLAST: algorithm outline
●
Idea: reduce search space by ignoring dissimilar
regions
●
Three-step heuristic:
– Seeding: find common subwords between query and
database sequences → seeds
– Extension: starting from seeds, extend alignment in
both directions → high-scoring segment pairs (HSP)
– Evaluation: assess the statistical significance of
each HSP
14
BLAST: seeding
●
Pre-process query sequence
● Build a list L1 of all subwords (factors) of the
length w in the query sequence
●
Example with w := 5
Query: ACTTGCTAAC
ACTTG
CTTGC
L1 TTGCT
TGCTA
CTAAC
15
BLAST: seeding (2)
● For each subword Wi Î L1 build a neighborhood Ni which
includes “similar” subwords
●
Similarity is defined via a substitution matrix
– For proteins, BLOSUM matrices can be used
– For DNA, +2 for a match and -3 for mismatch (or +5/-4)
●
Only subwords with similarity score above a threshold T are
added into the neighborhood
16
BLAST: seeding (2)
● For each subword Wi Î L1 build a neighborhood Ni which
includes “similar” subwords
●
Similarity is defined via a substitution matrix
– For proteins, BLOSUM matrices can be used
– For DNA, +2 for a match and -3 for mismatch (or +5/-4)
●
Only subwords with similarity score above a threshold T are
added into the neighborhood
T := 4
Wi T T G C T Wi T T G C T
T T G A T T G C C T
+2 +2 +2 -3 +2 = 5 > 4 +2 -3 -3 +2 +2 = 0 < 4
16
BLAST: seeding (3)
●
Combine all subwords' neighborhoods of a single query
sequence
L2 = È Ni
●
Build a final list by adding the subwords themselves
L = L1 È L2
●
Scan the database for exact matches of subwords in L
17
BLAST: seeding (3)
●
Combine all subwords' neighborhoods of a single query
sequence
L2 = È Ni
●
Build a final list by adding the subwords themselves
L = L1 È L2
●
Scan the database for exact matches of subwords in L
L Database sequence:
ACTTG CTAAT AGCTATTGATGACTG
CTTGC TTGCT
CTTGC CTAAC seed
TGCTA TTGAT
17
BLAST: extension
●
Try to extend the alignment to the left and to the
right from the seed
●
Stop if the current total score drops by more
than X compared to the maximum seen so far
●
Trim alignment back to the maximum score
extend reference
seed
18
BLAST: extension (2)
●
Example with X := 3
seed
DB sequence: A G C T A T T G A T G A C T G
Query: C A C T T G C T A A C
+2 +2 +2 -3 +2
Current score: 5 Max score: 5 Diff: 0
19
BLAST: extension (2)
●
Example with X := 3
seed
DB sequence: A G C T A T T G A T G A C T G
Query: C A C T T G C T A A C
-3 +2 +2 +2 -3 +2
Current score: 5
2 Max score: 5 Diff: 3
19
BLAST: extension (2)
●
Example with X := 3
seed
DB sequence: A G C T A T T G A T G A C T G
Query: C A C T T G C T A A C
-3 -3 +2 +2 +2 -3 +2
Current score: 51
-
2 Max score: 5 Diff: 6
19
BLAST: extension (2)
●
Example with X := 3
seed
DB sequence: A G C T A T T G A T G A C T G
Query: C A C T T G C T A A C
+2 +2 +2 -3 +2
Current score: 5 Max score: 5 Diff: 0
19
BLAST: extension (2)
●
Example with X := 3
seed
DB sequence: A G C T A T T G A T G A C T G
Query: C A C T T G C T A A C
+2 +2 +2 -3 +2 -3
Current score: 5
2 Max score: 5 Diff: 3
19
BLAST: extension (2)
●
Example with X := 3
seed
DB sequence: A G C T A T T G A T G A C T G
Query: C A C T T G C T A A C
+2 +2 +2 -3 +2 -3 +2
Current score: 5
4
2 Max score: 5 Diff: 1
19
BLAST: extension (2)
●
Example with X := 3
seed
DB sequence: A G C T A T T G A T G A C T G
Query: C A C T T G C T A A C
+2 +2 +2 -3 +2 -3 +2 +2
Current score: 5
6
4
2 Max score: 6
5 Diff: 0
19
BLAST: extension (2)
●
Example with X := 3
seed
DB sequence: A G C T A T T G A T G A C T G
Query: C A C T T G C T A A C
+2 +2 +2 -3 +2 -3 +2 +2
Current score: 5
6
4
2 Max score: 6
5
A G C T A T T G A T G A C T G
C A C T T G C T A A C
Total score: +2 +2 +2 -3 +2 -3 +2 +2 = 6 19
BLAST: evaluation
EVD prob. density function
0.20
●
Given S=6 → are sequences biologically related 0.18
f(x)
●
It was shown that Smith-Waterman local 0.10
0.06
distribution (EVD)
0.02
0.00
-5 0 5 10 15 20
20
BLAST: evaluation
EVD prob. density function
0.20
●
Given S=6 → are sequences biologically related 0.18
f(x)
●
It was shown that Smith-Waterman local 0.10
0.06
distribution (EVD)
0.02
0.00
-5 0 5 10 15 20
●
Parameters λ and μ depend on the substitution matrix, gap penalties,
sequence length and nucleotide frequencies
20
BLAST: evaluation (2)
●
Instead of a probability, BLAST reports a so-
called expectation value (E-value), which takes
into account the database size d:
− p(S >x )d
E≈1−e
●
The E-value is the expected number of times that an
unrelated database sequence would obtain a score S
higher than x by chance
●
Hence, for biologically related sequences, E-value
has to be very small
21
BLAST web service
https://blast.ncbi.nlm.nih.gov/ 22
BLAST web service: query
https://blast.ncbi.nlm.nih.gov/ 23
BLAST web service: results
24
Beyond BLAST
●
BLAST implicitly assumes that:
– The number of queries is relatively small (compared to
DB size)
– Pair-wise comparison is sufficient to detect relatedness
– We are (also) interested in remotely-related sequences
●
What if it's not the case?
– Metagenomics: Querying millions of reads against a
fixed-size reference database
– Protein families: Find sequences similar to multiple
queries
25
Beyond BLAST
●
What if it's not the case?
– Metagenomics: Querying millions of reads against a fixed-
size reference database
– Protein families: Find sequences similar to multiple queries
●
Use database pre-processing
– Seed index: BLAT (Kent 2002), MEGABLAST (Morgulis
2008), UBLAST
– Unique words index: USEARCH (Edgar 2010)
– k-mer frequencies: RDP Classifier (Wang et al. 2007)
– Profile HMMs: HMMER (Eddy 2009)
●
Trade query time for pre-processing time
25
Today's agenda
●
Search methods for biological sequences
– Public sequence databases
– BLAST algorithm
– Alternative search algorithms
●
Genome assembly
– De novo assembly
●
Overlap graphs
●
De Brujin graphs
– By-reference assembly
De novo vs. by-reference assembly
●
Genome assembly is a process of reconstructing the original DNA
sequence from its factors (i.e., short reads)
●
De novo: exploit read overlaps to build a (novel) genome sequence
from scratch
– e.g., assembling the first human genome back in 2000
●
By-reference: map reads to the known reference sequence – usually
genome of the same species or a close relative/close relatives
– e.g., getting your personal genome nowadays
●
In terms of time and space complexity, de novo assembly is orders
of magnitude slower and more memory intensive than mapping
assembly
27
Human genome: are we there yet?
2001 2022
28
Genome assembly
●
A very simplified workflow:
Genome (multiple copies) a long DNA molecule
ACTTGCTA...
Shotgun sequencing
Short reads
Assembler
●
Length of contigs and scaffolds are important metrics of assembly quality
30
Human Genome assembly
2014 2022
31
Overlap graphs
●
Represent each read as a graph node
●
Compute all pair-wise alignments between
reads and represent overlaps as graph edges
32
Overlap graphs
●
Represent each read as a graph node
●
Compute all pair-wise alignments between reads
and represent overlaps as graph edges
●
Walking along the Hamiltonian path (visit every
node exactly once), we can reconstruct the
original genomic sequence
– For a circular genome, we can start from any node
– For a linear genome, we should ideally start from a
node with no inbounding edges (in practice, there
could be none/multiple such nodes→ apply heuristics)
32
Overlap graphs: example
Genome: ACTTGCTCAACTGCTGGATCTA
ACTTGCT
AACTGCT
Reads: GCTCAAC
GCTGGAT
GGATCTA
GCT AAC
GCT
GCTGGAT AACTGCT
GGAT
GCT
GGATCTA 33
Overlap graphs: example
Genome: ACTTGCTCAACTGCTGGATCTA
ACTTGCT
AACTGCT
Reads: GCTCAAC
GCTGGAT
GGATCTA
GCTGGAT AACTGCT
GGAT
GCT
GGATCTA 34
Overlap graphs: example
Genome: ACTTGCTCAACTGCTGGATCTA
ACTTGCT
AACTGCT
Reads: GCTCAAC
GCTGGAT
GGATCTA
GCTGGAT AACTGCT
GGAT
GCT
GGATCTA 35
Overlap graphs: example
Genome: ACTTGCTCAACTGCTGGATCTA
ACTTGCT
AACTGCT
Reads: GCTCAAC
GCTGGAT
GGATCTA
GGAT
GCT
GGATCTA 36
Overlap graphs: example
Genome: ACTTGCTCAACTGCTGGATCTA
ACTTGCT
AACTGCT
Reads: GCTCAAC
GCTGGAT
GGATCTA
GGAT
GCT
GGATCTA 37
Overlap graphs: example
Genome: ACTTGCTCAACTGCTGGATCTA
ACTTGCT
AACTGCT
Reads: GCTCAAC
GCTGGAT
GGATCTA
41
Overlap graphs: problems
●
There is no known efficient algorithm for finding a
Hamiltonian path
●
Complexity of doing the pair-wise alignments is
quadratic in terms of number of reads
●
Overlap graphs work well if there is a small number
of reads with significant overlap
– e.g., Sanger sequencing (or 3rd gen. → later)
●
With millions of short NGS reads, this method
becomes computationally unfeasible
41
De Bruijn graphs
●
To build a de Bruijn graph, each read is decomposed
into series of k-mers
– In real applications, k = 20–50 is common
– Optimal value of k depends on read length and error rate;
k < length of the shortest read
– Some methods use multiple k → SPAdes (Bankevich 2012)
42
De Bruijn graphs
●
Each unique k-mer is represented by an edge of the
graph
●
Nodes are (k-1)-mers: prefix and suffix of the k-mer
associated with the edge connecting them
●
Nodes with identical (k-1)-mers are “glued together”
AGA GAT
AG GA GA AT
GAT ATT
GA AT AT TT
ATG TTC
AT TG TT TC
TGA TCG
TG GA TC CG
43
De Bruijn graphs
●
The original sequence can be reconstructed by
finding an Eulerian path (visit each edge exactly
once)
TGA
ATT
TTC TCG
TT TC CG
44
De Bruijn graphs: advantages
●
Compact representation of repeats
– Duplicate (k-1)-mers are represented with a single node
– Longer repeats form a single series of adjacent nodes
●
Building De Bruijn graph has linear complexity
– Time: O(N), N = total length of all reads
– Space: O(min(G,N)), G = genome size
●
There exists an efficient algorithm to find an Eulerian
path
➔
For these reasons, most modern NGS assemblers use
de Bruijn graphs (either explicitly or implicitly)
45
De Bruijn graphs: problems
●
Information loss due to k-mer extraction
– Repeats are (even) harder to resolve
– Some paths are not consistent with source reads
●
Eulerian path always exists if
–reads are error-free
– all genome positions are evenly covered
●
There can be multiple alternative paths due to repeats!
●
Graph can be disjoint due to lack of (sufficient) coverage of
some genome regions
●
In practice: manual post-processing
(see: https://www.science.org/doi/suppl/10.1126/science.abj6987/suppl_file/science.abj6987_sm.pdf )
46
De Bruijn graphs: error correction
tips
GG TG
1x 2x
5x 6x 4x 7x
AG GA AT TT TC 4x CG
1x 1x
2x
TA AC
bubble
●
In reality, reads contain sequencing errors
– Errors in the middle of the read → “bubbles”
– Errors at the ends of the read → “tips”
– Coverage information (=node/edge multiplicity) can be
used to detect and fix errors
47
Outlook: 3 generation sequencing
rd
●
Further advancement after NGS
– Pacific Biosciences (PacBio SMRT)
– Oxford Nanopore (MinION)
●
3GS/PacBio vs NGS/Illumina: Image: wired.com
48
Today's agenda
●
Search methods for biological sequences
– Public sequence databases
– BLAST algorithm
– Alternative search algorithms
●
Genome assembly
– De novo assembly
●
Overlap graphs
●
De Brujin graphs
– By-reference assembly
●
Hash indexes
●
Burrows-Wheeler transform
By-reference assembly
Genome (multiple copies) a long DNA molecule
ACTTGCTA...
Shotgun sequencing
Short reads
Mapping
Alignment to the reference
reference genome
50
Sliding window approach
●
Slide each read along the reference genome and mark all
positions where there is a match
– if gaps are allowed, one has to resort to the classical DP algorithms like
Smith-Waterman
reference
match!
51
Sliding window approach
●
Slide each read along the reference genome and mark all
positions where there is a match
– if gaps are allowed, one has to resort to the classical DP algorithms like
Smith-Waterman
reference
match!
●
Problem: huge complexity!
– Recall the BLAST discussion
●
Build a reference genome index using:
– Hashing
– Burrows-Wheeler-transform
51
Hashing: build genome index
●
Use hash table to store positions of all k-mers
in a genome:
– k << read length
9
–
k := 9 → 4 = 262144 table entries (at most)
reference
ACGTCCAAC 145 532
ACGTATAAC
ACGTCCAAG 097
GCGTCCAAC 141 013
267
ACTACCAAC
ACGTTTAAT
linked list of positions
TCGTCGAAC
52
Hashing: search strategy
●
For each read, select one “proxy” k-mer
– Leftmost (or middle) part is a good choice → better quality
quality
250bp
40bp
position in read
53
Hashing: search strategy
●
For each read, select one “proxy” k-mer
– Leftmost (or middle) part is a good choice → better quality
quality
250bp
40bp
position in read
●
Use hash table lookup to find all positions of this k-mer
in the genome → seeds
●
For each seed, try to extend the alignment (i.e., map
the rest of the read) allowing for mismatches/gaps like
in Smith-Waterman algorithm
53
Hashing: optimizations & variants
●
Inexact matches with spaced seeds
– Binary mask defines positions where mismatches are
allowed:
111001 ATCGGT ↔ ATCACT
●
Multiple k-mers per read
– Require at least n seed matches for a mapping location
to be considered
●
Inverted approach: Build hash table from the reads
and search for k-mers present in the reference
54
By-reference assembly: BWT
●
Most modern mapping tools rely on Burrows-
Wheeler Transform or BWT (Burrows and
Wheeler, 1994) → Bowtie, BWA, SOAP2
●
BWT indexes offer significant improvements
over hash-based methods in terms of both time
and memory usage
●
Also used in data compression programs such
as bzip2
55
Building a BWT
●
Building a BWT consists of three steps:
– Write down all cyclic rotations of the source string S
– Sort the rows lexicographically
– Store the last column → BWT(S)
S: m i s s i s s i p p i $
end-of-line marker
56
Building a BWT
●
Building a BWT consists of three steps:
– Write down all cyclic rotations of the source string S
– Sort the rows lexicographically
– Store the last column → BWT(S)
S: m i s s i s s i p p i $
end-of-line marker
rotate
m i s s i s s i p p i $
$ m i s s i s s i p p i
i $ m i s s i s s i p p
p i $ m i s s i s s i p
p p i $ m i s s i s s i
i p p i $ m i s s i s s
s i p p i $ m i s s i s
s s i p p i $ m i s s i
i s s i p p i $ m i s s
s i s s i p p i $ m i s
s s i s s i p p i $ m i
i s s i s s i p p i $ m
56
Building a BWT
●
Building a BWT consists of three steps:
– Write down all cyclic rotations of the source string S
– Sort the rows lexicographically
– Store the last column → BWT(S)
S: m i s s i s s i p p i $
end-of-line marker
rotate
m i s s i s s i p p i $ $ m i s s i s s i p p i
$ m i s s i s s i p p i i $ m i s s i s s i p p
i $ m i s s i s s i p p i p p i $ m i s s i s s
p i $ m i s s i s s i p i s s i p p i $ m i s s
sort
p p i $ m i s s i s s i i s s i s s i p p i $ m
i p p i $ m i s s i s s m i s s i s s i p p i $
s i p p i $ m i s s i s p i $ m i s s i s s i p
s s i p p i $ m i s s i p p i $ m i s s i s s i
i s s i p p i $ m i s s s i p p i $ m i s s i s
s i s s i p p i $ m i s s i s s i p p i $ m i s
s s i s s i p p i $ m i s s i p p i $ m i s s i
i s s i s s i p p i $ m s s i s s i p p i $ m i
56
Building a BWT
●
Building a BWT consists of three steps:
– Write down all cyclic rotations of the source string S
– Sort the rows lexicographically
– Store the last column → BWT(S)
S: m i s s i s s i p p i $
end-of-line marker
rotate BWT(S):
m i s s i s s i p p i $ $ m i s s i s s i p p i i
$ m i s s i s s i p p i i $ m i s s i s s i p p p
i $ m i s s i s s i p p i p p i $ m i s s i s s s
p i $ m i s s i s s i p i s s i p p i $ m i s s s
sort
p p i $ m i s s i s s i i s s i s s i p p i $ m m
i p p i $ m i s s i s s m i s s i s s i p p i $ $
s i p p i $ m i s s i s p i $ m i s s i s s i p p
s s i p p i $ m i s s i p p i $ m i s s i s s i i
i s s i p p i $ m i s s s i p p i $ m i s s i s s
s i s s i p p i $ m i s s i s s i p p i $ m i s s
s s i s s i p p i $ m i s s i p p i $ m i s s i i
i s s i s s i p p i $ m s s i s s i p p i $ m i i 56
BWT properties
●
BWT has two important features:
– Rows of the matrix form a sorted list of suffixes →
allows efficient search for substring occurrences
– BWT is reversible → no need to store the whole matrix,
since it can be obtained from the last column only
BWT
$ m i s s i s s i p p i
i $ m i s s i s s i p p
i p p i $ m i s s i s s
i s s i p p i $ m i s s
i s s i s s i p p i $ m
m i s s i s s i p p i $
p i $ m i s s i s s i p
p p i $ m i s s i s s i
s i p p i $ m i s s i s
s i s s i p p i $ m i s
s s i p p i $ m i s s i
s s i s s i p p i $ m i 57
BWT properties
●
BWT has two important features:
– Rows of the matrix form a sorted list of suffixes →
allows efficient search for substring occurrences
– BWT is reversible → no need to store the whole matrix,
since it can be obtained from the last column only
Suffix array BWT
11 $ $ m i s s i s s i p p i
10 i $ i $ m i s s i s s i p p
7 i p p i $ i p p i $ m i s s i s s
4 i s s i p p i $ i s s i p p i $ m i s s
1 i s s i s s i p p i $ i s s i s s i p p i $ m
0 m i s s i s s i p p i $ m i s s i s s i p p i $
9 p i $ p i $ m i s s i s s i p
8 p p i $ p p i $ m i s s i s s i
6 s i p p i $ s i p p i $ m i s s i s
3 s i s s i p p i $ s i s s i p p i $ m i s
5 s s i p p i $ s s i p p i $ m i s s i
2 s s i s s i p p i $ s s i s s i p p i $ m i 57
BWT properties
●
BWT has two important features:
– Rows of the matrix form a sorted list of suffixes →
allows efficient search for substring occurrences
– BWT is reversible → no need to store the whole matrix,
since it can be obtained from the last column only
32b/char Suffix array BWT 2b/char
11 $ $ m i s s i s s i p p i
10 i $ i $ m i s s i s s i p p
7 i p p i $ i p p i $ m i s s i s s
4 i s s i p p i $ i s s i p p i $ m i s s
1 i s s i s s i p p i $ i s s i s s i p p i $ m
0 m i s s i s s i p p i $ m i s s i s s i p p i $
9 p i $ p i $ m i s s i s s i p
8 p p i $ p p i $ m i s s i s s i
6 s i p p i $ s i p p i $ m i s s i s
3 s i s s i p p i $ s i s s i p p i $ m i s
5 s s i p p i $ s s i p p i $ m i s s i
2 s s i s s i p p i $ s s i s s i p p i $ m i 57
Why BWT?
●
Space complexity H. sapiens (3Gb)
– Suffix tree: ~ 20 * |Genome| 60 GB
– Suffix array: ~ 4 * |Genome| 12 GB
– BWT-FM: ~ 0.5 * |Genome| 1.5 GB
●
Additional data structure needed for search
●
FM-index (Ferragina and Manzini, 2000)
58
Learn from the experts!
●
Pavel Pevzner (UC San Diego)
– co-author of SPAdes → de-novo assembler
– https://www.youtube.com/c/bioinfalgorithms/videos
●
Ben Langmead (John Hopkins)
– the author of Bowtie → BWT-based read mapper
– https://www.youtube.com/user/BenLangmead/playli
sts
59
Questions?
References
Altschul, Stephen; Gish, Warren; Miller, Webb; Myers, Eugene; Lipman, David (1990). Basic local alignment search tool. Journal of Molecular Biology 215
(3): 403–410. doi:10.1016/S0022-2836(05)80360-2. PMID 2231712
Bankevich, Anton et al. (2012) SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. J Comput Biol. 2012 May;
19(5): 455–477. https://dx.doi.org/10.1089%2Fcmb.2012.0021
Burrows, Michael; Wheeler, David J. (1994). A block sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation
Darriba D, Flouri T, Stamatakis A (2018) The State of Software for Evolutionary Biology, Molecular Biology and Evolution, Volume 35, Issue 5, 1 May 2018,
Pages 1037–1046, https://doi.org/10.1093/molbev/msy014
Eddy, S. R. (2009). A New Generation of Homology Search Tools Based on Probabilistic Inference. Genome Inform., 23:205-211.
Edgar, Robert C. (2010). Search and clustering orders of magnitude faster than BLAST. Bioinformatics (2010) 26 (19): 2460-2461
doi:10.1093/bioinformatics/btq461
Ferragina P, Manzini G (2000). Opportunistic data structures with applications. Foundations of Computer Science, 2000. Proceedings. 41st Annual
Symposium on
Kent, WJ (2002). BLAT--the BLAST-like alignment tool. Genome Research 12 (4): 656–664. doi:10.1101/gr.229202. PMC 187518. PMID 11932250
Morgulis A, Coulouris G, Raytselis Y, Madden T, Agarwala R, and Schäffer AA (2008). Database indexing for production MegaBLAST searches.
Bioinformatics 24 (16): 1757-1764 first published online June 21, 2008 doi:10.1093/bioinformatics/btn322
Nidhi S, Nute GM, Warnow T, Pop M (2018) Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows , Bioinformatics,
bty833, https://doi.org/10.1093/bioinformatics/bty833
Wang, Q, G. M. Garrity, J. M. Tiedje, and J. R. Cole (2007). Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial
Taxonomy. Appl Environ Microbiol. 73(16):5261-5267; doi: 10.1128/AEM.00062-07 [PMID: 17586664]
61
Backup
FM-Index: LF mapping
●
Character ranks are consistent between first (F)
and last (L) columns
F L
$ m i s s i s s i p p i0
i0 $ m i s s i s s i p p
i1 p p i $ m i s s i s s
i2 s s i p p i $ m i s s
i3 s s i s s i p p i $ m
m i s s i s s i p p i $1
p i $ m i s s i s s i p1
p p i $ m i s s i s s i1
s i p p i $ m i s s i s1
s i s s i p p i $ m i s1
s s i p p i $ m i s s i2
s s i s s i p p i $ m i3
63 / 66
FM-Index: LF mapping
●
Character ranks are consistent between first (F)
and last (L) columns
F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
63 / 66
FM-Index: LF mapping
●
Character ranks are consistent between first (F)
and last (L) columns
$0
F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
63 / 66
FM-Index: LF mapping
●
Character ranks are consistent between first (F)
and last (L) columns
$0
F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
63 / 66
FM-Index: LF mapping
●
Character ranks are consistent between first (F)
and last (L) columns
i0 $0
F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
63 / 66
FM-Index: LF mapping
●
Character ranks are consistent between first (F)
and last (L) columns
i0 $0
F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
63 / 66
FM-Index: LF mapping
●
Character ranks are consistent between first (F)
and last (L) columns
p0 i0 $0
F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
63 / 66
FM-Index: LF mapping
●
Character ranks are consistent between first (F)
and last (L) columns
p0 i0 $0
F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
63 / 66
FM-Index: LF mapping
●
Character ranks are consistent between first (F)
and last (L) columns
p1 p0 i0 $0
F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
63 / 66
FM-Index: LF mapping
●
Character ranks are consistent between first (F)
and last (L) columns
p1 p0 i0 $0
F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
63 / 66
FM-Index: LF mapping
●
Character ranks are consistent between first (F)
and last (L) columns
i1 p1 p0 i0 $0
F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
63 / 66
FM-Index: LF mapping
●
Character ranks are consistent between first (F)
and last (L) columns
i1 p1 p0 i0 $0
F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
63 / 66
FM-Index: LF mapping
●
Character ranks are consistent between first (F)
and last (L) columns
s0 i1 p1 p0 i0 $0
F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
63 / 66
FM Index
●
In addition to BWT itself, the FM-index stores:
– F offsets → FO, L rank checkpoints → LRC, SA sample → SAS
FO F L
0 $ m i s s i s s i p p i
1 i $ m i s s i s s i p p
i p p i $ m i s s i s s
i s s i p p i $ m i s s
i s s i s s i p p i $ m
5 m i s s i s s i p p i $
6 p i $ m i s s i s s i p
p p i $ m i s s i s s i
8 s i p p i $ m i s s i s
s i s s i p p i $ m i s
s s i p p i $ m i s s i
s s i s s i p p i $ m i
64 / 66
FM Index
●
In addition to BWT itself, the FM-index stores:
– F offsets → FO, L rank checkpoints → LRC, SA sample → SAS
FO F L i m p s
0 $ m i s s i s s i p p i 1 0 0 0
1 i $ m i s s i s s i p p 1 0 1 0
i p p i $ m i s s i s s 1 0 1 1
i s s i p p i $ m i s s 1 0 1 2
i s s i s s i p p i $ m 1 1 1 2
5 m i s s i s s i p p i $ 1 1 1 2
6 p i $ m i s s i s s i p 1 1 2 2
p p i $ m i s s i s s i 2 1 2 2
8 s i p p i $ m i s s i s 2 1 2 3
s i s s i p p i $ m i s 2 1 2 4
s s i p p i $ m i s s i 3 1 2 4
s s i s s i p p i $ m i 4 1 2 4
64 / 66
FM Index
●
In addition to BWT itself, the FM-index stores:
– F offsets → FO, L rank checkpoints → LRC, SA sample → SAS
LRC
FO F L i m p s
0 $ m i s s i s s i p p i 1 0 0 0
1 i $ m i s s i s s i p p
i p p i $ m i s s i s s
i s s i p p i $ m i s s
i s s i s s i p p i $ m
5 m i s s i s s i p p i $ 1 1 1 2
6 p i $ m i s s i s s i p
p p i $ m i s s i s s i
8 s i p p i $ m i s s i s
s i s s i p p i $ m i s
s s i p p i $ m i s s i 3 1 2 4
s s i s s i p p i $ m i
65 / 66
FM Index
●
In addition to BWT itself, the FM-index stores:
– F offsets → FO, L rank checkpoints → LRC, SA sample → SAS
LRC
SAS FO F L i m p s
0 $ m i s s i s s i p p i 1 0 0 0
10 1 i $ m i s s i s s i p p
i p p i $ m i s s i s s
4 i s s i p p i $ m i s s
i s s i s s i p p i $ m
0 5 m i s s i s s i p p i $ 1 1 1 2
6 p i $ m i s s i s s i p
8 p p i $ m i s s i s s i
6 8 s i p p i $ m i s s i s
s i s s i p p i $ m i s
s s i p p i $ m i s s i 3 1 2 4
2 s s i s s i p p i $ m i
65 / 66
FM-Index: Pattern search
F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
Start here:
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
66 / 66
FM-Index: Pattern search
F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
Start here:
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
66 / 66
FM-Index: Pattern search
F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
Start here:
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
66 / 66
FM-Index: Pattern search
F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
Start here:
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
66 / 66
FM-Index: Pattern search
F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
Start here:
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
66 / 66
FM-Index: Pattern search
F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
Start here:
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 Found! s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
66 / 66