Short Read Alignment Algorithms: Raluca Gordân
Short Read Alignment Algorithms: Raluca Gordân
Raluca Gordân
Aug 2, 2017
Sequencing applications
RNA-seq – this course
ChIP-seq: identify and measure significant peaks
GAAATTTGC
GGAAATTTG
CGGAAATT
T
CGGAAATTT
TCGGAAATT
CTATCGGAAA
CCTATCGGA
…CC TTTGCGGT ATAC…
GCCCTATCG
AAATTTGC
GCCCTATCG AAATTTGC
Genotyping:…CCATAGGCTATATGCG
identify variations CCCTATCGGAAATTTGCGGTA TAC…
GGTATAC…
…CCATAG TATGCGCCC CGGAAATTT CGGTATAC
…CCAT CTATATGCG TCGGAAATT
…CCAT GGCTATATG CGGTATAC GCGGTATA
CTATCGGAAA
…CCA AGGCTATAT CCTATCGGA TTGCGGTA
…CCA AGGCTATAT GCCCTATCG C… TTTGCGGT
…CC AGGCTATAT AAATTTGC
C… ATAC…
GCCCTATCG AAATTTGC
…CC TAGGCTATA GCGCCCTA
GTATAC…
…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…
Sequencing technologies
TECHNICAL BIASES!
Sequence alignment
• INPUT:
– A few million short reads, with certain error characteristics
(specific to the sequencing platform)
– Illumina: few errors, mostly substitutions
– A reference genome
• OUTPUT:
– Alignments of the reads to the reference genome
No match at offset 1
Match at offset 2
Lo = 1; Hi = 15 Lo
Mid = (1+15)/2 = 8
Middle = Suffix[8] = CC
Lo = Mid + 1
Hi
Suffix arrayssearch
‐- for GATTACA
Lo = 9; Hi = 15
Mid = (9+15)/2 = 12
Hi = Mid - 1
Lo
Hi
Suffix arrayssearch
‐- for GATTACA
Lo = 9; Hi = 11
Mid = (9+11)/2 = 10
Hi = Mid - 1
Lo
Hi
Suffix arrayssearch
‐- for GATTACA
Lo = 9; Hi = 9
Mid = (9+9)/2 = 9
Hi
Suffix arraysanalysis
‐-
Same size as T
Burrows-Wheeler
T Burrows-Wheeler
transform of T:
matrix of T
BWT(T)
Last first (LF) mapping
Rank: 2
Rank: 2
We can repeatedly
apply LF mapping
1 to reconstruct T
2
from
3
BWT(T)
UNPERMUTE
4
algorithm 5
(Burrows and 6
Wheeler, 1994)
LF mapping and exact matching
EXACTMATCH algorithm (Ferragina and Manzini, 2000)calculates
‐- the range of
matrix rows beginning with successively longer suffixes of the query
Reference: acaacg. Query: aac
1
2
3
4
5
6
• The Bowtie
algorithm consists
of three phases
that alternate
between using the
forward and partial alignments
with mismatches
mirror indices only in the hi-
half
1. no mismatches in seed
2. no mismatches in hi-half, extend the partial
alignments into
one or two mismatches in
full alignments.
lo-half
3.no mismatches in lo-half,
one or two mismatches in
hi-half
4.one mismatch in hi-half,
one mismatch in lo-half
Aligning 2 million reads to the human genome
Maq: Mapping and Assembly with Qualities SOAP = Short Oligonucleotide Analysis Package
Last first (LF) mapping
Rank: 2
Rank: 2
• One could generate all suffixes, sort then to obtain the SA, then calculate
the BWT in a single pass over the suffix array
• However, constructing the entire suffix array in memory requires at least
~12 gigabytes for the human genome
• Instead, Bowtie uses a block-‐wise strategy: builds the suffix array and the
BWT block-b‐y-b‐lock, discarding suffix array blocks once the corresponding BWT
block has been built
• Bowtie can build the full index for the human genome in about 24 hours in
less than 1.5 gigabytes of RAM
• If 16 gigabytes of RAM or more is available, Bowtie can exploit the
additional RAM to produce the same index in about 4.5 hours
Constructing the index
• How do we construct a BWT index?
• Calculating the BWT is closely related to building a suffix array
• Each element of the BWT can be derived from the corresponding element
of the suffix array:
• One could generate all suffixes, sort then to obtain the SA, then calculate
the BWT in a single pass over the suffix array
• However, constructing the entire suffix array in memory requires at least
~12 gigabytes for the human genome
• Instead, Bowtie uses a block-‐wise strategy: builds the suffix array and the
BWT block-b‐y-b‐lock, discarding suffix array blocks once the corresponding BWT
block has been built
• Bowtie can build the full index for the human genome in about 24 hours in
less than 1.5 gigabytes of RAM
• If 16 gigabytes of RAM or more is available, Bowtie can exploit the
additional RAM to produce the same index in about 4.5 hours
Storing the index