0% found this document useful (0 votes)
23 views106 pages

Lecture 4

Uploaded by

ge0mi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views106 pages

Lecture 4

Uploaded by

ge0mi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 106

Introduction to Bioinformatics for Computer Scientists

Lecture 4
BLAST & Genome assembly

Alexey Kozlov
Staff Scientist
[email protected]

Exelixis Lab

October 30, 2023


Today's agenda

Search methods for biological sequences
– Public sequence databases
– BLAST algorithm
– Alternative search algorithms

Genome assembly
– De novo assembly
– By-reference assembly
Disclaimer

Today: Basic algorithms
– Solving an idealized problem

Real programs are much more complicated
– Deal with biological “fuzziness”
– Heuristics, optimizations...

I’m not an expert
– Ask your advanced questions anyways :)
– Only basic stuff is relevant for the exam
Searching for similar sequences

Assumption:
– sequence similarity Û evolutionary and/or
functionally relatedness
Annotated!

Query sequence Sequence database


S1
m S2
Q
S3
d
ACTTGCTA... S4
S5
n

Query hits (results)


S4
S3
4
Sequence databases: proteins

UniProt
– extensive annotations (3D structure, functions etc.)
– 560K manually curated entries + 180M automatic

5
Sequence databases: proteins

AlphaFold DB
– 200M protein structures → high-quality prediction
– https://alphafold.ebi.ac.uk/

6
Sequence databases: GenBank

NCBI GenBank/INSDC →”all” DNA+proteins
– > 246M “regular” sequences as of Aug. 2023
– ~ 2.6B whole-genome-shotgun (WGS)
– “primary” database → sequences submitted and
annotated by the respective authors (biologists)

7
Sequence annotation in GenBank
entry name:
usually includes organism
and gene name

accession number
(~ sequence ID)
organism name and
taxonomy (=biological
classification)

sequence

8
Sequence databases: barcodes

“Secondary” databases for barcoding genes
– quality filtering, curated taxonomy, sample geodata...
– useful for species identification/classification

SILVA →https://www.arb-silva.de/
– 16S, >10M sequences, focus on Bacteria+Archaea

BOLD →http://boldsystems.org/
– COI, >7M sequences, focus on Animals

UNITE → https://unite.ut.ee/
– ITS, 1.7M sequences, focus on Fungi

9
Searching for similar sequences

Assumption:
– sequence similarity Û evolutionary and/or
functionally relatedness
Annotated!

Query sequence Sequence database


S1
m ??? ?????
S2
Q
S3
d
ACTTGCTA... S4
S5
n

Query hits (results)


p53 Human
S4
p53 Gorilla
S3
10
Naїve approach

Use Smith-Waterman algorithm to build a pair-
wise alignment of query sequence Q with every
sequence in the database S1...Sd

Sort alignments by similarity score

Report best match(es)
Score

S1 90 Query hits

S2 70
S4 140
125 S3 125
S3

S4 140

S5 65
11
Naїve approach: complexity

Smith-Waterman has a complexity of O(m ´ n)

DP matrix for every sequence in database:
Q Q Q Q

S1 S2 S3 ... Sd

database size d

12
Naїve approach: complexity

Smith-Waterman has a complexity of O(m ´ n)

DP matrix for every sequence in database:
Q Q Q Q

S1 S2 S3 ... Sd

database size d


NCBI GenBank: d = 200 million

Do we really need to compute all matrices?
12
BLAST

Stands for Basic Local Alignment Search Tool

Fast heuristic to find similar sequences

One of the most widely-used algorithms in bioinformatics
– Two main papers from 1990s have ~180 000 citations1
– cf. RAxML: ~40 000, Human Genome Sequence: ~45 000

Available as both stand-alone application and web
service
– BLAST on NCBI GenBank database:
http://blast.st-va.ncbi.nlm.nih.gov/Blast.cgi

1
Altschul et al (1990), Altschul et al (1997)
13
BLAST: algorithm outline

Idea: reduce search space by ignoring dissimilar
regions

Three-step heuristic:
– Seeding: find common subwords between query and
database sequences → seeds
– Extension: starting from seeds, extend alignment in
both directions → high-scoring segment pairs (HSP)
– Evaluation: assess the statistical significance of
each HSP

14
BLAST: seeding

Pre-process query sequence
● Build a list L1 of all subwords (factors) of the
length w in the query sequence

Example with w := 5
Query: ACTTGCTAAC
ACTTG
CTTGC
L1 TTGCT
TGCTA
CTAAC
15
BLAST: seeding (2)
● For each subword Wi Î L1 build a neighborhood Ni which
includes “similar” subwords

Similarity is defined via a substitution matrix
– For proteins, BLOSUM matrices can be used
– For DNA, +2 for a match and -3 for mismatch (or +5/-4)

Only subwords with similarity score above a threshold T are
added into the neighborhood

16
BLAST: seeding (2)
● For each subword Wi Î L1 build a neighborhood Ni which
includes “similar” subwords

Similarity is defined via a substitution matrix
– For proteins, BLOSUM matrices can be used
– For DNA, +2 for a match and -3 for mismatch (or +5/-4)

Only subwords with similarity score above a threshold T are
added into the neighborhood
T := 4

Wi T T G C T Wi T T G C T
T T G A T T G C C T
+2 +2 +2 -3 +2 = 5 > 4 +2 -3 -3 +2 +2 = 0 < 4
16
BLAST: seeding (3)

Combine all subwords' neighborhoods of a single query
sequence
L2 = È Ni

Build a final list by adding the subwords themselves
L = L1 È L2

Scan the database for exact matches of subwords in L

17
BLAST: seeding (3)

Combine all subwords' neighborhoods of a single query
sequence
L2 = È Ni

Build a final list by adding the subwords themselves
L = L1 È L2

Scan the database for exact matches of subwords in L
L Database sequence:
ACTTG CTAAT AGCTATTGATGACTG
CTTGC TTGCT
CTTGC CTAAC seed

TGCTA TTGAT
17
BLAST: extension

Try to extend the alignment to the left and to the
right from the seed

Stop if the current total score drops by more
than X compared to the maximum seen so far

Trim alignment back to the maximum score

extend reference

seed

18
BLAST: extension (2)

Example with X := 3
seed

DB sequence: A G C T A T T G A T G A C T G
Query: C A C T T G C T A A C
+2 +2 +2 -3 +2
Current score: 5 Max score: 5 Diff: 0

19
BLAST: extension (2)

Example with X := 3
seed

DB sequence: A G C T A T T G A T G A C T G
Query: C A C T T G C T A A C
-3 +2 +2 +2 -3 +2
Current score: 5
2 Max score: 5 Diff: 3

19
BLAST: extension (2)

Example with X := 3
seed

DB sequence: A G C T A T T G A T G A C T G
Query: C A C T T G C T A A C
-3 -3 +2 +2 +2 -3 +2
Current score: 51
-
2 Max score: 5 Diff: 6

19
BLAST: extension (2)

Example with X := 3
seed

DB sequence: A G C T A T T G A T G A C T G
Query: C A C T T G C T A A C
+2 +2 +2 -3 +2
Current score: 5 Max score: 5 Diff: 0

19
BLAST: extension (2)

Example with X := 3
seed

DB sequence: A G C T A T T G A T G A C T G
Query: C A C T T G C T A A C
+2 +2 +2 -3 +2 -3
Current score: 5
2 Max score: 5 Diff: 3

19
BLAST: extension (2)

Example with X := 3
seed

DB sequence: A G C T A T T G A T G A C T G
Query: C A C T T G C T A A C
+2 +2 +2 -3 +2 -3 +2
Current score: 5
4
2 Max score: 5 Diff: 1

19
BLAST: extension (2)

Example with X := 3
seed

DB sequence: A G C T A T T G A T G A C T G
Query: C A C T T G C T A A C
+2 +2 +2 -3 +2 -3 +2 +2
Current score: 5
6
4
2 Max score: 6
5 Diff: 0

19
BLAST: extension (2)

Example with X := 3
seed

DB sequence: A G C T A T T G A T G A C T G
Query: C A C T T G C T A A C
+2 +2 +2 -3 +2 -3 +2 +2
Current score: 5
6
4
2 Max score: 6
5

High-scoring segment pair (HSP):

A G C T A T T G A T G A C T G
C A C T T G C T A A C
Total score: +2 +2 +2 -3 +2 -3 +2 +2 = 6 19
BLAST: evaluation
EVD prob. density function
0.20


Given S=6 → are sequences biologically related 0.18

or are they similar just by chance?


0.16
f(x, µ=0.5, β=2.0)
f(x, µ=1.0, β=2.0)
0.14 f(x, µ=1.5, β=3.0)
f(x, µ=3.0, β=4.0)
0.12

f(x)

It was shown that Smith-Waterman local 0.10

alignment scores between two random 0.08

0.06

sequences follow the Gumbel extreme value 0.04

distribution (EVD)
0.02

0.00
-5 0 5 10 15 20

alignment score (x)

20
BLAST: evaluation
EVD prob. density function
0.20


Given S=6 → are sequences biologically related 0.18

or are they similar just by chance?


0.16
f(x, µ=0.5, β=2.0)
f(x, µ=1.0, β=2.0)
0.14 f(x, µ=1.5, β=3.0)
f(x, µ=3.0, β=4.0)
0.12

f(x)

It was shown that Smith-Waterman local 0.10

alignment scores between two random 0.08

0.06

sequences follow the Gumbel extreme value 0.04

distribution (EVD)
0.02

0.00
-5 0 5 10 15 20

alignment score (x)



Probability p of observing a score S equal to or greater than x by
chance is given by the equation
−λ (x−μ)
p(S⩾ x)=1−exp(e )


Parameters λ and μ depend on the substitution matrix, gap penalties,
sequence length and nucleotide frequencies

20
BLAST: evaluation (2)

Instead of a probability, BLAST reports a so-
called expectation value (E-value), which takes
into account the database size d:
− p(S >x )d
E≈1−e


The E-value is the expected number of times that an
unrelated database sequence would obtain a score S
higher than x by chance

Hence, for biologically related sequences, E-value
has to be very small

21
BLAST web service

https://blast.ncbi.nlm.nih.gov/ 22
BLAST web service: query

https://blast.ncbi.nlm.nih.gov/ 23
BLAST web service: results

24
Beyond BLAST

BLAST implicitly assumes that:
– The number of queries is relatively small (compared to
DB size)
– Pair-wise comparison is sufficient to detect relatedness
– We are (also) interested in remotely-related sequences

What if it's not the case?
– Metagenomics: Querying millions of reads against a
fixed-size reference database
– Protein families: Find sequences similar to multiple
queries

25
Beyond BLAST

What if it's not the case?
– Metagenomics: Querying millions of reads against a fixed-
size reference database
– Protein families: Find sequences similar to multiple queries

Use database pre-processing
– Seed index: BLAT (Kent 2002), MEGABLAST (Morgulis
2008), UBLAST
– Unique words index: USEARCH (Edgar 2010)
– k-mer frequencies: RDP Classifier (Wang et al. 2007)
– Profile HMMs: HMMER (Eddy 2009)

Trade query time for pre-processing time
25
Today's agenda

Search methods for biological sequences
– Public sequence databases
– BLAST algorithm
– Alternative search algorithms

Genome assembly
– De novo assembly

Overlap graphs

De Brujin graphs
– By-reference assembly
De novo vs. by-reference assembly

Genome assembly is a process of reconstructing the original DNA
sequence from its factors (i.e., short reads)

De novo: exploit read overlaps to build a (novel) genome sequence
from scratch
– e.g., assembling the first human genome back in 2000

By-reference: map reads to the known reference sequence – usually
genome of the same species or a close relative/close relatives
– e.g., getting your personal genome nowadays

In terms of time and space complexity, de novo assembly is orders
of magnitude slower and more memory intensive than mapping
assembly

27
Human genome: are we there yet?
2001 2022

28
Genome assembly

A very simplified workflow:
Genome (multiple copies) a long DNA molecule

ACTTGCTA...
Shotgun sequencing

Short reads

Assembler

Reconstructed (consensus) sequence


29
De novo assembly: overview

Assemble reads into contigs using
overlaps between them
– overlap graphs or de Bruijn graphs

Use mate pairs to combine contigs
into scaffolds
– Information from paired-end reads
allows to determine contigs' order and
orientation

Joining scaffolds is a manual step
ACGTGCTCCCTTTAGAGAGGCTTCCAAT...
– Using optical gene maps or other
coarse-grained structural information
– Often impossible due to large repeats


Length of contigs and scaffolds are important metrics of assembly quality

30
Human Genome assembly
2014 2022

31
Overlap graphs

Represent each read as a graph node

Compute all pair-wise alignments between
reads and represent overlaps as graph edges

32
Overlap graphs

Represent each read as a graph node

Compute all pair-wise alignments between reads
and represent overlaps as graph edges

Walking along the Hamiltonian path (visit every
node exactly once), we can reconstruct the
original genomic sequence
– For a circular genome, we can start from any node
– For a linear genome, we should ideally start from a
node with no inbounding edges (in practice, there
could be none/multiple such nodes→ apply heuristics)

32
Overlap graphs: example
Genome: ACTTGCTCAACTGCTGGATCTA
ACTTGCT
AACTGCT
Reads: GCTCAAC
GCTGGAT
GGATCTA

Overlap graph: GCT Reconstructed sequence:


ACTTGCT
ACTTGCT GCTCAAC

GCT AAC
GCT

GCTGGAT AACTGCT

GGAT
GCT
GGATCTA 33
Overlap graphs: example
Genome: ACTTGCTCAACTGCTGGATCTA
ACTTGCT
AACTGCT
Reads: GCTCAAC
GCTGGAT
GGATCTA

Overlap graph: GCT Reconstructed sequence:


ACTTGCT
ACTTGCT GCTCAAC ACTTGCTGGAT
GCT AAC
GCT

GCTGGAT AACTGCT

GGAT
GCT
GGATCTA 34
Overlap graphs: example
Genome: ACTTGCTCAACTGCTGGATCTA
ACTTGCT
AACTGCT
Reads: GCTCAAC
GCTGGAT
GGATCTA

Overlap graph: GCT Reconstructed sequence:


ACTTGCT
ACTTGCT GCTCAAC ACTTGCTGGAT
ACTTGCTGGATCTA
GCT AAC
GCT

GCTGGAT AACTGCT

GGAT
GCT
GGATCTA 35
Overlap graphs: example
Genome: ACTTGCTCAACTGCTGGATCTA
ACTTGCT
AACTGCT
Reads: GCTCAAC
GCTGGAT
GGATCTA

Overlap graph: GCT Reconstructed sequence:


ACTTGCT
ACTTGCT GCTCAAC ACTTGCTGGAT
ACTTGCTGGATCTA
GCT AAC ACTTGCTGGAT
GCT
ACTTGCT
GCTGGAT AACTGCT

GGAT
GCT
GGATCTA 36
Overlap graphs: example
Genome: ACTTGCTCAACTGCTGGATCTA
ACTTGCT
AACTGCT
Reads: GCTCAAC
GCTGGAT
GGATCTA

Overlap graph: GCT Reconstructed sequence:


ACTTGCT
ACTTGCT GCTCAAC ACTTGCTGGAT
ACTTGCTGGATCTA
GCT AAC ACTTGCTGGAT
GCT
ACTTGCT
GCTGGAT AACTGCT ACTTGCTCAAC

GGAT
GCT
GGATCTA 37
Overlap graphs: example
Genome: ACTTGCTCAACTGCTGGATCTA
ACTTGCT
AACTGCT
Reads: GCTCAAC
GCTGGAT
GGATCTA

Overlap graph: GCT Reconstructed sequence:


ACTTGCT
ACTTGCT GCTCAAC ACTTGCTGGAT
ACTTGCTGGATCTA
GCT AAC ACTTGCTGGAT
GCT
ACTTGCT
GCTGGAT AACTGCT ACTTGCTCAAC
ACTTGCTCAACTGCT
GGAT
GCT
GGATCTA 38
Overlap graphs: example
Genome: ACTTGCTCAACTGCTGGATCTA
ACTTGCT
AACTGCT
Reads: GCTCAAC
GCTGGAT
GGATCTA

Overlap graph: GCT Reconstructed sequence:


ACTTGCT
ACTTGCT GCTCAAC ACTTGCTGGAT
ACTTGCTGGATCTA
GCT AAC ACTTGCTGGAT
GCT
ACTTGCT
GCTGGAT AACTGCT ACTTGCTCAAC
ACTTGCTCAACTGCT
GGAT ACTTGCTCAACTGCTGGAT
GCT
GGATCTA 39
Overlap graphs: example
Genome: ACTTGCTCAACTGCTGGATCTA
ACTTGCT
AACTGCT
Reads: GCTCAAC
GCTGGAT
GGATCTA

Overlap graph: GCT Reconstructed sequence:


ACTTGCT
ACTTGCT GCTCAAC ACTTGCTGGAT
ACTTGCTGGATCTA
GCT AAC ACTTGCTGGAT
GCT
ACTTGCT
GCTGGAT AACTGCT ACTTGCTCAAC
ACTTGCTCAACTGCT
GGAT ACTTGCTCAACTGCTGGAT
GCT ACTTGCTCAACTGCTGGATCTA
GGATCTA ACTTGCTCAACTGCTGGATCTA
40
Overlap graphs: problems

There is no known efficient algorithm for finding
a Hamiltonian path

Complexity of doing the pair-wise alignments is
quadratic in terms of number of reads

41
Overlap graphs: problems

There is no known efficient algorithm for finding a
Hamiltonian path

Complexity of doing the pair-wise alignments is
quadratic in terms of number of reads

Overlap graphs work well if there is a small number
of reads with significant overlap
– e.g., Sanger sequencing (or 3rd gen. → later)

With millions of short NGS reads, this method
becomes computationally unfeasible

41
De Bruijn graphs

To build a de Bruijn graph, each read is decomposed
into series of k-mers
– In real applications, k = 20–50 is common
– Optimal value of k depends on read length and error rate;
k < length of the shortest read
– Some methods use multiple k → SPAdes (Bankevich 2012)

42
De Bruijn graphs

Each unique k-mer is represented by an edge of the
graph

Nodes are (k-1)-mers: prefix and suffix of the k-mer
associated with the edge connecting them

Nodes with identical (k-1)-mers are “glued together”

AGA GAT
AG GA GA AT
GAT ATT
GA AT AT TT
ATG TTC
AT TG TT TC
TGA TCG
TG GA TC CG
43
De Bruijn graphs

The original sequence can be reconstructed by
finding an Eulerian path (visit each edge exactly
once)

TGA

AGA GAT ATG


AG GA AT TG

ATT
TTC TCG
TT TC CG

44
De Bruijn graphs: advantages

Compact representation of repeats
– Duplicate (k-1)-mers are represented with a single node
– Longer repeats form a single series of adjacent nodes

Building De Bruijn graph has linear complexity
– Time: O(N), N = total length of all reads
– Space: O(min(G,N)), G = genome size

There exists an efficient algorithm to find an Eulerian
path

For these reasons, most modern NGS assemblers use
de Bruijn graphs (either explicitly or implicitly)

45
De Bruijn graphs: problems

Information loss due to k-mer extraction
– Repeats are (even) harder to resolve
– Some paths are not consistent with source reads

Eulerian path always exists if
–reads are error-free
– all genome positions are evenly covered

There can be multiple alternative paths due to repeats!

Graph can be disjoint due to lack of (sufficient) coverage of
some genome regions

In practice: manual post-processing
(see: https://www.science.org/doi/suppl/10.1126/science.abj6987/suppl_file/science.abj6987_sm.pdf )

46
De Bruijn graphs: error correction
tips
GG TG
1x 2x

5x 6x 4x 7x
AG GA AT TT TC 4x CG
1x 1x
2x
TA AC

bubble


In reality, reads contain sequencing errors
– Errors in the middle of the read → “bubbles”
– Errors at the ends of the read → “tips”
– Coverage information (=node/edge multiplicity) can be
used to detect and fix errors
47
Outlook: 3 generation sequencing
rd


Further advancement after NGS
– Pacific Biosciences (PacBio SMRT)
– Oxford Nanopore (MinION)

3GS/PacBio vs NGS/Illumina: Image: wired.com

– Longer reads (20-100Kb vs. 0.5Kb)


– Higher raw error rates (5-20% vs. 0.1%)
but: down to <0.1% w/consensus
– Lower throughput (~100 Gb vs. 8 Tb)

Change in assembly methods
– Back to overlap graphs, hybrid approaches
– Resolve long repeats / bridge contigs
Image: PacBio

48
Today's agenda

Search methods for biological sequences
– Public sequence databases
– BLAST algorithm
– Alternative search algorithms

Genome assembly
– De novo assembly

Overlap graphs

De Brujin graphs
– By-reference assembly

Hash indexes

Burrows-Wheeler transform
By-reference assembly
Genome (multiple copies) a long DNA molecule

ACTTGCTA...
Shotgun sequencing

Short reads

Mapping
Alignment to the reference

reference genome

Reconstructed (consensus) sequence

50
Sliding window approach

Slide each read along the reference genome and mark all
positions where there is a match
– if gaps are allowed, one has to resort to the classical DP algorithms like
Smith-Waterman

reference

match!

51
Sliding window approach

Slide each read along the reference genome and mark all
positions where there is a match
– if gaps are allowed, one has to resort to the classical DP algorithms like
Smith-Waterman

reference

match!


Problem: huge complexity!
– Recall the BLAST discussion

Build a reference genome index using:
– Hashing
– Burrows-Wheeler-transform
51
Hashing: build genome index

Use hash table to store positions of all k-mers
in a genome:
– k << read length
9

k := 9 → 4 = 262144 table entries (at most)
reference
ACGTCCAAC 145 532
ACGTATAAC
ACGTCCAAG 097
GCGTCCAAC 141 013
267
ACTACCAAC
ACGTTTAAT
linked list of positions
TCGTCGAAC

52
Hashing: search strategy

For each read, select one “proxy” k-mer
– Leftmost (or middle) part is a good choice → better quality
quality

250bp
40bp
position in read

53
Hashing: search strategy

For each read, select one “proxy” k-mer
– Leftmost (or middle) part is a good choice → better quality
quality

250bp
40bp
position in read


Use hash table lookup to find all positions of this k-mer
in the genome → seeds

For each seed, try to extend the alignment (i.e., map
the rest of the read) allowing for mismatches/gaps like
in Smith-Waterman algorithm
53
Hashing: optimizations & variants

Inexact matches with spaced seeds
– Binary mask defines positions where mismatches are
allowed:
111001 ATCGGT ↔ ATCACT

Multiple k-mers per read
– Require at least n seed matches for a mapping location
to be considered

Inverted approach: Build hash table from the reads
and search for k-mers present in the reference

54
By-reference assembly: BWT

Most modern mapping tools rely on Burrows-
Wheeler Transform or BWT (Burrows and
Wheeler, 1994) → Bowtie, BWA, SOAP2

BWT indexes offer significant improvements
over hash-based methods in terms of both time
and memory usage

Also used in data compression programs such
as bzip2

55
Building a BWT

Building a BWT consists of three steps:
– Write down all cyclic rotations of the source string S
– Sort the rows lexicographically
– Store the last column → BWT(S)
S: m i s s i s s i p p i $
end-of-line marker

56
Building a BWT

Building a BWT consists of three steps:
– Write down all cyclic rotations of the source string S
– Sort the rows lexicographically
– Store the last column → BWT(S)
S: m i s s i s s i p p i $
end-of-line marker
rotate
m i s s i s s i p p i $
$ m i s s i s s i p p i
i $ m i s s i s s i p p
p i $ m i s s i s s i p
p p i $ m i s s i s s i
i p p i $ m i s s i s s
s i p p i $ m i s s i s
s s i p p i $ m i s s i
i s s i p p i $ m i s s
s i s s i p p i $ m i s
s s i s s i p p i $ m i
i s s i s s i p p i $ m
56
Building a BWT

Building a BWT consists of three steps:
– Write down all cyclic rotations of the source string S
– Sort the rows lexicographically
– Store the last column → BWT(S)
S: m i s s i s s i p p i $
end-of-line marker
rotate
m i s s i s s i p p i $ $ m i s s i s s i p p i
$ m i s s i s s i p p i i $ m i s s i s s i p p
i $ m i s s i s s i p p i p p i $ m i s s i s s
p i $ m i s s i s s i p i s s i p p i $ m i s s
sort
p p i $ m i s s i s s i i s s i s s i p p i $ m
i p p i $ m i s s i s s m i s s i s s i p p i $
s i p p i $ m i s s i s p i $ m i s s i s s i p
s s i p p i $ m i s s i p p i $ m i s s i s s i
i s s i p p i $ m i s s s i p p i $ m i s s i s
s i s s i p p i $ m i s s i s s i p p i $ m i s
s s i s s i p p i $ m i s s i p p i $ m i s s i
i s s i s s i p p i $ m s s i s s i p p i $ m i
56
Building a BWT

Building a BWT consists of three steps:
– Write down all cyclic rotations of the source string S
– Sort the rows lexicographically
– Store the last column → BWT(S)
S: m i s s i s s i p p i $
end-of-line marker
rotate BWT(S):
m i s s i s s i p p i $ $ m i s s i s s i p p i i
$ m i s s i s s i p p i i $ m i s s i s s i p p p
i $ m i s s i s s i p p i p p i $ m i s s i s s s
p i $ m i s s i s s i p i s s i p p i $ m i s s s
sort
p p i $ m i s s i s s i i s s i s s i p p i $ m m
i p p i $ m i s s i s s m i s s i s s i p p i $ $
s i p p i $ m i s s i s p i $ m i s s i s s i p p
s s i p p i $ m i s s i p p i $ m i s s i s s i i
i s s i p p i $ m i s s s i p p i $ m i s s i s s
s i s s i p p i $ m i s s i s s i p p i $ m i s s
s s i s s i p p i $ m i s s i p p i $ m i s s i i
i s s i s s i p p i $ m s s i s s i p p i $ m i i 56
BWT properties

BWT has two important features:
– Rows of the matrix form a sorted list of suffixes →
allows efficient search for substring occurrences
– BWT is reversible → no need to store the whole matrix,
since it can be obtained from the last column only
BWT
$ m i s s i s s i p p i
i $ m i s s i s s i p p
i p p i $ m i s s i s s
i s s i p p i $ m i s s
i s s i s s i p p i $ m
m i s s i s s i p p i $
p i $ m i s s i s s i p
p p i $ m i s s i s s i
s i p p i $ m i s s i s
s i s s i p p i $ m i s
s s i p p i $ m i s s i
s s i s s i p p i $ m i 57
BWT properties

BWT has two important features:
– Rows of the matrix form a sorted list of suffixes →
allows efficient search for substring occurrences
– BWT is reversible → no need to store the whole matrix,
since it can be obtained from the last column only
Suffix array BWT
11 $ $ m i s s i s s i p p i
10 i $ i $ m i s s i s s i p p
7 i p p i $ i p p i $ m i s s i s s
4 i s s i p p i $ i s s i p p i $ m i s s
1 i s s i s s i p p i $ i s s i s s i p p i $ m
0 m i s s i s s i p p i $ m i s s i s s i p p i $
9 p i $ p i $ m i s s i s s i p
8 p p i $ p p i $ m i s s i s s i
6 s i p p i $ s i p p i $ m i s s i s
3 s i s s i p p i $ s i s s i p p i $ m i s
5 s s i p p i $ s s i p p i $ m i s s i
2 s s i s s i p p i $ s s i s s i p p i $ m i 57
BWT properties

BWT has two important features:
– Rows of the matrix form a sorted list of suffixes →
allows efficient search for substring occurrences
– BWT is reversible → no need to store the whole matrix,
since it can be obtained from the last column only
32b/char Suffix array BWT 2b/char
11 $ $ m i s s i s s i p p i
10 i $ i $ m i s s i s s i p p
7 i p p i $ i p p i $ m i s s i s s
4 i s s i p p i $ i s s i p p i $ m i s s
1 i s s i s s i p p i $ i s s i s s i p p i $ m
0 m i s s i s s i p p i $ m i s s i s s i p p i $
9 p i $ p i $ m i s s i s s i p
8 p p i $ p p i $ m i s s i s s i
6 s i p p i $ s i p p i $ m i s s i s
3 s i s s i p p i $ s i s s i p p i $ m i s
5 s s i p p i $ s s i p p i $ m i s s i
2 s s i s s i p p i $ s s i s s i p p i $ m i 57
Why BWT?

Space complexity H. sapiens (3Gb)
– Suffix tree: ~ 20 * |Genome| 60 GB
– Suffix array: ~ 4 * |Genome| 12 GB
– BWT-FM: ~ 0.5 * |Genome| 1.5 GB

Additional data structure needed for search

FM-index (Ferragina and Manzini, 2000)

58
Learn from the experts!

Pavel Pevzner (UC San Diego)
– co-author of SPAdes → de-novo assembler
– https://www.youtube.com/c/bioinfalgorithms/videos

Ben Langmead (John Hopkins)
– the author of Bowtie → BWT-based read mapper
– https://www.youtube.com/user/BenLangmead/playli
sts

59
Questions?
References
Altschul, Stephen; Gish, Warren; Miller, Webb; Myers, Eugene; Lipman, David (1990). Basic local alignment search tool. Journal of Molecular Biology 215
(3): 403–410. doi:10.1016/S0022-2836(05)80360-2. PMID 2231712

Bankevich, Anton et al. (2012) SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. J Comput Biol. 2012 May;
19(5): 455–477. https://dx.doi.org/10.1089%2Fcmb.2012.0021

Burrows, Michael; Wheeler, David J. (1994). A block sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation

Darriba D, Flouri T, Stamatakis A (2018) The State of Software for Evolutionary Biology, Molecular Biology and Evolution, Volume 35, Issue 5, 1 May 2018,
Pages 1037–1046, https://doi.org/10.1093/molbev/msy014

Eddy, S. R. (2009). A New Generation of Homology Search Tools Based on Probabilistic Inference. Genome Inform., 23:205-211.

Edgar, Robert C. (2010). Search and clustering orders of magnitude faster than BLAST. Bioinformatics (2010) 26 (19): 2460-2461
doi:10.1093/bioinformatics/btq461

Ferragina P, Manzini G (2000). Opportunistic data structures with applications. Foundations of Computer Science, 2000. Proceedings. 41st Annual
Symposium on

Kent, WJ (2002). BLAT--the BLAST-like alignment tool. Genome Research 12 (4): 656–664. doi:10.1101/gr.229202. PMC 187518. PMID 11932250

Morgulis A, Coulouris G, Raytselis Y, Madden T, Agarwala R, and Schäffer AA (2008). Database indexing for production MegaBLAST searches.
Bioinformatics 24 (16): 1757-1764 first published online June 21, 2008 doi:10.1093/bioinformatics/btn322

Nidhi S, Nute GM, Warnow T, Pop M (2018) Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows , Bioinformatics,
bty833, https://doi.org/10.1093/bioinformatics/bty833

Wang, Q, G. M. Garrity, J. M. Tiedje, and J. R. Cole (2007). Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial
Taxonomy. Appl Environ Microbiol. 73(16):5261-5267; doi: 10.1128/AEM.00062-07 [PMID: 17586664]

+ additional references in the old slide set: http://sco.h-its.org/exelixis/web/teaching/lectures2014/lecture4.pdf

61
Backup
FM-Index: LF mapping

Character ranks are consistent between first (F)
and last (L) columns

F L
$ m i s s i s s i p p i0
i0 $ m i s s i s s i p p
i1 p p i $ m i s s i s s
i2 s s i p p i $ m i s s
i3 s s i s s i p p i $ m
m i s s i s s i p p i $1
p i $ m i s s i s s i p1
p p i $ m i s s i s s i1
s i p p i $ m i s s i s1
s i s s i p p i $ m i s1
s s i p p i $ m i s s i2
s s i s s i p p i $ m i3
63 / 66
FM-Index: LF mapping

Character ranks are consistent between first (F)
and last (L) columns

F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
63 / 66
FM-Index: LF mapping

Character ranks are consistent between first (F)
and last (L) columns
$0

F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
63 / 66
FM-Index: LF mapping

Character ranks are consistent between first (F)
and last (L) columns
$0

F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
63 / 66
FM-Index: LF mapping

Character ranks are consistent between first (F)
and last (L) columns
i0 $0

F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
63 / 66
FM-Index: LF mapping

Character ranks are consistent between first (F)
and last (L) columns
i0 $0

F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
63 / 66
FM-Index: LF mapping

Character ranks are consistent between first (F)
and last (L) columns
p0 i0 $0

F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
63 / 66
FM-Index: LF mapping

Character ranks are consistent between first (F)
and last (L) columns
p0 i0 $0

F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
63 / 66
FM-Index: LF mapping

Character ranks are consistent between first (F)
and last (L) columns
p1 p0 i0 $0

F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
63 / 66
FM-Index: LF mapping

Character ranks are consistent between first (F)
and last (L) columns
p1 p0 i0 $0

F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
63 / 66
FM-Index: LF mapping

Character ranks are consistent between first (F)
and last (L) columns
i1 p1 p0 i0 $0

F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
63 / 66
FM-Index: LF mapping

Character ranks are consistent between first (F)
and last (L) columns
i1 p1 p0 i0 $0

F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
63 / 66
FM-Index: LF mapping

Character ranks are consistent between first (F)
and last (L) columns
s0 i1 p1 p0 i0 $0

F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
63 / 66
FM Index

In addition to BWT itself, the FM-index stores:
– F offsets → FO, L rank checkpoints → LRC, SA sample → SAS

FO F L
0 $ m i s s i s s i p p i
1 i $ m i s s i s s i p p
i p p i $ m i s s i s s
i s s i p p i $ m i s s
i s s i s s i p p i $ m
5 m i s s i s s i p p i $
6 p i $ m i s s i s s i p
p p i $ m i s s i s s i
8 s i p p i $ m i s s i s
s i s s i p p i $ m i s
s s i p p i $ m i s s i
s s i s s i p p i $ m i
64 / 66
FM Index

In addition to BWT itself, the FM-index stores:
– F offsets → FO, L rank checkpoints → LRC, SA sample → SAS

FO F L i m p s
0 $ m i s s i s s i p p i 1 0 0 0
1 i $ m i s s i s s i p p 1 0 1 0
i p p i $ m i s s i s s 1 0 1 1
i s s i p p i $ m i s s 1 0 1 2
i s s i s s i p p i $ m 1 1 1 2
5 m i s s i s s i p p i $ 1 1 1 2
6 p i $ m i s s i s s i p 1 1 2 2
p p i $ m i s s i s s i 2 1 2 2
8 s i p p i $ m i s s i s 2 1 2 3
s i s s i p p i $ m i s 2 1 2 4
s s i p p i $ m i s s i 3 1 2 4
s s i s s i p p i $ m i 4 1 2 4
64 / 66
FM Index

In addition to BWT itself, the FM-index stores:
– F offsets → FO, L rank checkpoints → LRC, SA sample → SAS

LRC
FO F L i m p s
0 $ m i s s i s s i p p i 1 0 0 0
1 i $ m i s s i s s i p p
i p p i $ m i s s i s s
i s s i p p i $ m i s s
i s s i s s i p p i $ m
5 m i s s i s s i p p i $ 1 1 1 2
6 p i $ m i s s i s s i p
p p i $ m i s s i s s i
8 s i p p i $ m i s s i s
s i s s i p p i $ m i s
s s i p p i $ m i s s i 3 1 2 4
s s i s s i p p i $ m i
65 / 66
FM Index

In addition to BWT itself, the FM-index stores:
– F offsets → FO, L rank checkpoints → LRC, SA sample → SAS

LRC
SAS FO F L i m p s
0 $ m i s s i s s i p p i 1 0 0 0
10 1 i $ m i s s i s s i p p
i p p i $ m i s s i s s
4 i s s i p p i $ m i s s
i s s i s s i p p i $ m
0 5 m i s s i s s i p p i $ 1 1 1 2
6 p i $ m i s s i s s i p
8 p p i $ m i s s i s s i
6 8 s i p p i $ m i s s i s
s i s s i p p i $ m i s
s s i p p i $ m i s s i 3 1 2 4
2 s s i s s i p p i $ m i
65 / 66
FM-Index: Pattern search

Find pattern: sip

F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
Start here:
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
66 / 66
FM-Index: Pattern search

Find pattern: sip p1

F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
Start here:
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
66 / 66
FM-Index: Pattern search

Find pattern: sip i1 p1

F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
Start here:
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
66 / 66
FM-Index: Pattern search

Find pattern: sip i1 p1

F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
Start here:
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
66 / 66
FM-Index: Pattern search

Find pattern: sip s0 i1 p1

F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
Start here:
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
66 / 66
FM-Index: Pattern search

Find pattern: sip s0 i1 p1

F L F L
$ m i s s i s s i p p i0 $0 i0
i0 $ m i s s i s s i p p i0 p0
i1 p p i $ m i s s i s s i1 s0
i2 s s i p p i $ m i s s i2 s1
i3 s s i s s i p p i $ m i3 m0
m i s s i s s i p p i $1 m0 $0
p i $ m i s s i s s i p1 p0 p1
Start here:
p p i $ m i s s i s s i1 p1 i1
s i p p i $ m i s s i s1 s0 s2
s i s s i p p i $ m i s1 Found! s1 s3
s s i p p i $ m i s s i2 s2 i2
s s i s s i p p i $ m i3 s3 i3
66 / 66

You might also like