0% found this document useful (0 votes)
44 views

Short Read Alignment Algorithms: Raluca Gordân

Bowtie is an alignment program that uses the Burrows-Wheeler transform and Ferragina-Manzini index to align short DNA reads to large genomes very quickly using little memory. It can align over 25 million reads per CPU hour using only 1.3 gigabytes of memory on the human genome. Bowtie extends previous techniques to allow for mismatches and can use multiple processor cores for even faster alignment.

Uploaded by

Biswa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Short Read Alignment Algorithms: Raluca Gordân

Bowtie is an alignment program that uses the Burrows-Wheeler transform and Ferragina-Manzini index to align short DNA reads to large genomes very quickly using little memory. It can align over 25 million reads per CPU hour using only 1.3 gigabytes of memory on the human genome. Bowtie extends previous techniques to allow for mismatches and can use multiple processor cores for even faster alignment.

Uploaded by

Biswa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 47

Short Read Alignment Algorithms

Raluca Gordân

Department of Biostatistics and Bioinformatics


Department of Computer Science
Department of Molecular Genetics and Microbiology
Center for Genomic and Computational Biology
Duke University

Aug 2, 2017
Sequencing applications
RNA-seq – this course
ChIP-seq: identify and measure significant peaks
GAAATTTGC
GGAAATTTG
CGGAAATT
T
CGGAAATTT
TCGGAAATT
CTATCGGAAA
CCTATCGGA
…CC TTTGCGGT ATAC…
GCCCTATCG
AAATTTGC
GCCCTATCG AAATTTGC

Genotyping:…CCATAGGCTATATGCG
identify variations CCCTATCGGAAATTTGCGGTA TAC…
GGTATAC…
…CCATAG TATGCGCCC CGGAAATTT CGGTATAC
…CCAT CTATATGCG TCGGAAATT
…CCAT GGCTATATG CGGTATAC GCGGTATA
CTATCGGAAA
…CCA AGGCTATAT CCTATCGGA TTGCGGTA
…CCA AGGCTATAT GCCCTATCG C… TTTGCGGT
…CC AGGCTATAT AAATTTGC
C… ATAC…
GCCCTATCG AAATTTGC
…CC TAGGCTATA GCGCCCTA
GTATAC…
…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…
Sequencing technologies

TECHNICAL BIASES!
Sequence alignment

Heuristic local alignment Global/local alignment


(BLAST) (Needleman-­Wunsch,

Smith-­Waterman
‐ )
• INPUT: • INPUT:
– Database – Two sequences
X = x1x2…xm
Y = y1y2…yn
• OUTPUT:
– Query: – Optimal alignment
PSKMQRGIKWLLP between X and Y (or
• OUTPUT: substrings of X and Y)
– sequences similar to query
Short read alignment

• INPUT:
– A few million short reads, with certain error characteristics
(specific to the sequencing platform)
– Illumina: few errors, mostly substitutions
– A reference genome
• OUTPUT:
– Alignments of the reads to the reference genome

• Can we use BLAST?


• Assuming BLAST returns the result for a read in 1 sec
• For 10 million reads: 10 million seconds = 116 days

• Algorithms for exact string matching are more appropriate


Algorithms for exact string matching

• Search for the substring ANA in the string BANANA

Brute Force Suffix Array Hash (Index) Table

Naive Binary Search txeend


Seed-
­ -dn
‐aBLAST,
­‐ BLAT
Slow & Easy PacBio Aligner
(BLASR); Bowtie

Time complexity versus space complexity


Brute force search for GATTACA
• Where is GATTACA in the human genome?

No match at offset 1

Match at offset 2

No match at offset 3...


Brute force search for GATTACA

• Simple, easy to understand


• Analysis
– Genome length = n = 3,000,000,000
– Query length = m = 7
– Comparisons: (n-­m ‐ +1)* m = 21,000,000,000
• Assuming each comparison takes 1/1,000,000 of a second…
• … the total running time is 21,000 seconds = 0.24 days
• … for one b7p-‐­ read
Suffix arrays
Suffix array
• Preprocess the genome
– Sort all the suffixes of the genome

Split into suffixes Sort suffixes alphabetically

• Use binary search


Suffix arrayssearch
­‐- for GATTACA

Lo = 1; Hi = 15 Lo

Mid = (1+15)/2 = 8

Middle = Suffix[8] = CC

Compare GATTACA to CC => Higher

Lo = Mid + 1

Hi
Suffix arrayssearch
­‐- for GATTACA

Lo = 9; Hi = 15

Mid = (9+15)/2 = 12

Middle = Suffix[12] = TACC

Compare GATTACA to TACC => Lower

Hi = Mid - 1
Lo

Hi
Suffix arrayssearch
­‐- for GATTACA

Lo = 9; Hi = 11

Mid = (9+11)/2 = 10

Middle = Suffix[10] = GATTACC

Compare GATTACA to GATTACC => Lower

Hi = Mid - 1
Lo

Hi
Suffix arrayssearch
­‐- for GATTACA

Lo = 9; Hi = 9

Mid = (9+9)/2 = 9

Middle = Suffix[9] = GATTACAG…

Compare GATTACA to GATTACAG… => Match

Return: match at position 2 Lo

Hi
Suffix arraysanalysis
­‐-

• Word (query) of size m = 7


• Genome of size n = 3,000,000,000
• Bruce force:
– approx. m x n = 21,000,000,000 comparisons
• Suffix arrays:
– approx. m x log2(n) = 7 x 32 = 224 comparisons

• Assuming each comparison takes 1/1,000,000 of a second…


• … the total running time is 0.000224 seconds for one b7p-‐­ read
• Compared to 0.24 days for one b7p-‐­ read in the case of brute force search

• For 10 million reads, the suffix array search would take


2240 seconds = 37 minutes
Suffix arraysanalysis
­‐-

• Word (query) of size m = 7


• Genome of size n = 3,000,000,000

• For 10 million reads, the suffix array search would take


2240 seconds = 37 minutes

• Problem? Time complexity versus space complexity

• Total characters in all suffixes combined:


1+2+3+…+n = n(n+1)/2

• For the human genome:


4.5 billion billion characters!!!
Algorithms for exact string matching

• Search for the substring ANA in the string BANANA

Brute Force Suffix Array Hash (Index) Table

Naive Binary Search txeend


Seed-
­ -dn
‐aBLAST,
­‐ BLAT
Slow & Easy PacBio Aligner
(BLASR); Bowtie

Time complexity versus space complexity


Hashing

• Where is GATTACA in the human genome?


Build
­‐- an inverted index of every km
-‐­ er in the genome

• How do we access the table?


We
­‐- can only use numbers to index
Encode
­‐- sequences as numbers
Simple: A=0,C=1,G=2,T=3 => GATTACA=2,033,010
Smart: A = 002, C = 012, G = 102, T = 112
=> GATTACA=100011110001002=915610
• Lookup: very fast
• But constructing an optimal hash is tricky
Hashing
• Number of possible sequences of length k is 4k
• K=7 => 47 = 16,384 (easy to store)
• K=20 => 420 = 1,099,511,627,776 (impossible to store
directly in RAM)
There
­‐- are only 3B 20-­m
‐ ersin the genome
Even
­‐- if we could build this table, 99.7% will be empty
But
­‐- we don't know which cells are empty until we try
Hashing
• Number of possible sequences of length k is 4k
• K=7 => 47 = 16,384 (easy to store)
• K=20 => 420 = 1,099,511,627,776 (impossible to store
directly in RAM)
There
­‐- are only 3B 20-m
‐­ ersin the genome
Even
­‐- if we could build this table, 99.7% will be empty
But
­‐- we don't know which cells are empty until we try
• Use a hash function to shrink the possible range
Maps
­‐- a number n in [0,R] to h in [0,H]
• Use 128 buckets instead of 16,384
Division:
­‐- hash(n) = H*n/R;
• hash(GATTACA)= 128 * 9156/16384 = 71
Modulo:
­‐- hash(n) = n % H
• hash(GATTACA)= 9156 % 128 = 68
Hashing

• By construction, multiple keys have the same hash value


– Store elements with the same key in a bucket chained together
• A good hash evenly distributes the values: R/H have the same
hash value
– Looking up a value scans the entire bucket
Algorithms for exact string matching

• Search for the substring GATTACA in the genome

Brute Force Suffix Array Hash (Index) Table

Easy Fast (binary search) Fast


High space complexity Tricky to develop
Slow hash function
Bowtie is an ultrafast, memory-efficient alignment
program for aligning short DNA sequence reads to
large genomes. For the human genome, Burrows-
Wheeler indexing allows Bowtie to align more than
25 million reads per CPU hour with a memory
footprint of approximately 1.3 gigabytes. Bowtie
extends previous Burrows-Wheeler techniques with a
novel quality-aware backtracking algorithm that
permits mismatches. Multiple processor cores can
be used simultaneously to achieve even greater
alignment speeds. Bowtie is open source http://
bowtie.cbcb.umd.edu.

• Bowtie indexes the genome using a scheme based on the Burrows-­‐


Wheeler transform (BWT) and the Ferragina-­Manzini
‐ (FM) index
Burrows-­‐Wheeler transform

• The BWT is a reversible permutation of the characters in a text


• BWT-­b
‐ ased indexing allows large texts to be searched efficiently in a
small memory footprint

All cyclic rotations of T$,


sorted lexicographically

Same size as T

Burrows-Wheeler
T Burrows-Wheeler
transform of T:
matrix of T
BWT(T)
Last first (LF) mapping

• The BW matrix has a property called last first (LF) mapping:


The ith occurrence of character X in the last column corresponds to
the same text character as the ith occurrence of X in the first column
• This property is at the core of algorithms that use the BWT index to
search the text

Rank: 2

Rank: 2

LF property implicitly encodes the


Suffix Array
Last first (LF) mapping

We can repeatedly
apply LF mapping
1 to reconstruct T
2

from
3
BWT(T)
UNPERMUTE
4
algorithm 5
(Burrows and 6
Wheeler, 1994)
LF mapping and exact matching
EXACTMATCH algorithm (Ferragina and Manzini, 2000)calculates
­‐- the range of
matrix rows beginning with successively longer suffixes of the query
Reference: acaacg. Query: aac

1
2
3
4
5
6

the matrix is sorted lexicographically


At each step, the size of the range
rows beginning with a given sequence either shrinks or remains the
appear consecutively same
LF mapping and exact matching
EXACTMATCH algorithm (Ferragina and Manzini, 2000)calculates
­‐- the range of
matrix rows beginning with successively longer suffixes of the query
Reference: acaacg. Query: aac

What about mismatches?

the matrix is sorted lexicographically


At each step, the size of the range
rows beginning with a given sequence either shrinks or remains the
appear consecutively same
Mismatches?

• EXACTMATCH is insufficient for short read alignment because


alignments may contain mismatches
• What are the main causes for mismatches?
sequencing
­‐- errors
differences
­‐- between reference and query organisms
Bowtie – mismatches and backtracking search

• EXACTMATCH is insufficient for short read alignment because


alignments may contain mismatches
• Bowtie conducts a backtracking search to quickly find alignments
that satisfy a specified alignment policy
• Each character in a read has a numeric quality value, with lower
values indicating a higher likelihood of a sequencing error
• Example: Illumina uses Phred quality scoring
Phred score of a base is: Qphred = -­‐10*log10(e) where e is the
estimated probability of a base being wrong
• Bowtie alignment policy allows a limited number of mismatches
and prefers alignments where the sum of the quality values at all
mismatched positions is low
Bowtiebacktracking
­‐- search

• The search is similar to EXACTMATCH


• It calculates matrix ranges for successively longer query suffixes
LF mapping and exact matching
EXACTMATCH algorithm (Ferragina and Manzini, 2000)calculates
­‐- the range of
matrix rows beginning with successively longer suffixes of the query
Reference: acaacg. Query: aac

the matrix is sorted lexicographically


At each step, the size of the range
rows beginning with a given sequence either shrinks or remains the
appear consecutively same
Bowtiebacktracking
­‐- search

• The search is similar to EXACTMATCH


• It calculates matrix ranges for successively longer query suffixes
• If the range becomes empty (a suffix does not occur in the text),
then the algorithm may select an already-­‐matched query position
and substitute a different base there, introducing a mismatch into
the alignment
• The EXACTMATCH search resumes from just axer the substituted
position
• The algorithm selects only those substitutions that are consistent
with the alignment policy and which yield a modified suffix that
occurs at least once in the text
• If there are multiple candidate substitution positions, then the
algorithm greedily selects a position with a minimal quality value
Exact search for ggta
ggta ggta ggta
Inexact
search
for ggta

Search is greedy: the


first valid alignment
encountered by Bowtie
will not necessarily be
the 'best' in terms of
number of mismatches
or in terms of quality
Bowtiebacktracking
­‐- search
• This standard aligner can, in some cases, encounter sequences that cause excessive
backtracking
• Bowtie mitigates excessive backtracking with the novel technique of double
indexing
– Idea: create 2 indices of the genome: one containing the BWT of the genome, called the forward
index, and a second containing the BWT of the genome with its sequence reversed (not reverse
complemented) called the mirror index.
• Let’s consider a matching policy that allows one mismatch in the alignment (either in
the first half or in the second half)
• Bowtie proceeds in two phases:
1. load the forward index into memory and invoke the aligner with the constraint that it
may not substitute at positions in the query's right half
2. load the mirror index into memory and invoke the aligner on the reversed query,
with the constraint that the aligner may not substitute at positions in the reversed
query's right half (the original query's le\ half).
• The constraints on backtracking into the right half prevent excessive backtracking,
whereas the use of two phases and two indices maintains full sensitivity
Bowtiebacktracking
­‐- search
• Base quality varies across the read
• Bowtie allows the user to select
the
­‐- number of mismatches permitted in the high-­‐quality end of a read
(default: 2 mismatches in the first 28 bases)
maximum
­‐- acceptable quality of mismatched positions over the
alignment (default: 70 PHRED score)
• The first 28 bases on the high-­‐quality end of the read are termed the
seed
• The seed consists of two halves:
the
­‐- 14 bp on the high-­‐quality end (usually the 5' end) = the hi-h ‐­ alf
the
­‐- 14 bp on the low-­‐quality end = lo-­h
‐ alf
• Assuming 2 mismatches permitted in the seed, a reportable alignment will
fall into one of four cases:
1. no mismatches in seed;
2. no mismatches in hi-h‐­ alf, one or two mismatches in lo-h‐­alf
3. no mismatches in lo-­h‐alf, one or two mismatches in hi-h‐­alf
4. one mismatch in hi-h‐­ alf, one mismatch in o­l-‐half
Bowtie

• The Bowtie
algorithm consists
of three phases
that alternate
between using the
forward and partial alignments
with mismatches
mirror indices only in the hi-
half

1. no mismatches in seed
2. no mismatches in hi-half, extend the partial
alignments into
one or two mismatches in
full alignments.
lo-half
3.no mismatches in lo-half,
one or two mismatches in
hi-half
4.one mismatch in hi-half,
one mismatch in lo-half
Aligning 2 million reads to the human genome

Maq: Mapping and Assembly with Qualities SOAP = Short Oligonucleotide Analysis Package
Last first (LF) mapping

• The BW matrix has a property called last first (LF) mapping:


The ith occurrence of character X in the last column corresponds to
the same text character as the ith occurrence of X in the first column
• This property is at the core of algorithms that use the BWT index to
search the text

Rank: 2

Rank: 2

LF property implicitly encodes the


Suffix Array
Constructing the index
• How do we construct a BWT index?
• Calculating the BWT is closely related to building a suffix array
• Each element of the BWT can be derived from the corresponding element
of the suffix array:

• One could generate all suffixes, sort then to obtain the SA, then calculate
the BWT in a single pass over the suffix array
• However, constructing the entire suffix array in memory requires at least
~12 gigabytes for the human genome
• Instead, Bowtie uses a block-­‐wise strategy: builds the suffix array and the
BWT block-­b‐y-­b‐lock, discarding suffix array blocks once the corresponding BWT
block has been built
• Bowtie can build the full index for the human genome in about 24 hours in
less than 1.5 gigabytes of RAM
• If 16 gigabytes of RAM or more is available, Bowtie can exploit the
additional RAM to produce the same index in about 4.5 hours
Constructing the index
• How do we construct a BWT index?
• Calculating the BWT is closely related to building a suffix array
• Each element of the BWT can be derived from the corresponding element
of the suffix array:

• One could generate all suffixes, sort then to obtain the SA, then calculate
the BWT in a single pass over the suffix array
• However, constructing the entire suffix array in memory requires at least
~12 gigabytes for the human genome
• Instead, Bowtie uses a block-­‐wise strategy: builds the suffix array and the
BWT block-­b‐y-­b‐lock, discarding suffix array blocks once the corresponding BWT
block has been built
• Bowtie can build the full index for the human genome in about 24 hours in
less than 1.5 gigabytes of RAM
• If 16 gigabytes of RAM or more is available, Bowtie can exploit the
additional RAM to produce the same index in about 4.5 hours
Storing the index

• The largest single component of the Bowtie index is the BWT


sequence. Bowtie stores the BWT in a 2b -­‐itp
-­‐erb
-‐­aseofrmat
• A Bowtie index for the assembled human genome sequence is
about 1.3 gigabytes
• A full Bowtie index actually consists of pair of equal-‐­size indexes, the
forward and mirror indexes, for any given genome, but it can be
run such that only one of the two indexes is ever resident in
memory at once (using the –z option)

• What about gaps?


Bowtie 2
• Bowtie: very efficient ungapped alignment of short reads based on BWT
index
• Index-­‐based alignment algorithms can be quite inefficient when gaps are
allowed
• Gaps can results from
– sequencing errors
– true insertions
and deletions
• Bowtie 2 extends the index-­‐based approach of Bowtie to permit gapped
alignment
• It divides the algorithm into two stages
1. an initial, ungapped seed-­finding
‐ stage that benefits from the speed and
memory efficiency of the full-t‐­ext index
2. a gapped extension stage that uses dynamic programming and benefits
from the efficiency of single-­instruction
‐ multiple-­‐data (SIMD) parallel processing
available on modern processors
Bowtie 2
• Bowtie: very efficient ungapped alignment of short reads based on BWT
index
• Index-­‐based alignment algorithms can be quite inefficient when gaps are
allowed
• Gaps can results from
– sequencing errors
– true insertions
and deletions
• Bowtie 2 extends the index-­‐based approach of Bowtie to permit gapped
alignment
• It divides the algorithm into two stages
1. an initial, ungapped seed-­finding
‐ stage that benefits from the speed and
memory efficiency of the full-t‐­ext index
2. a gapped extension stage that uses dynamic programming and benefits
from the efficiency of single-­instruction
‐ multiple-­‐data (SIMD) parallel processing
available on modern processors
BWA
TopHat
Exercises:

(a)Perform, by hand, the Burrows-Wheeler Transform on the sequence


GCACGCTGC$.

(b)Suppose you are given a sequence TGTTGCCC$A, which has been


permuted by way of the Burrows-Wheeler Transform. Perform, by hand, the
reverse of the BWT in order to obtain the original sequence from which it was
generated. Hint: If you wish to check your result, you may employ the BWT in
the forward direction (i.e., the algorithm you employed in part (a)) on the
answer you obtained here. Assuming you have performed the reversal
correctly, you should find that your ”original sequence” generates the permuted
sequence given in this problem: TGTTGCCC$A.

(c)Consider the sequence you had to transform in part (a): GCACGCTGC$.


Using this exact matching protocol, determine whether the following
subsequences are contained within the sequence from part (a): CACG, AATG.
CTGC.

You might also like