0% found this document useful (0 votes)

44 views

Short Read Alignment Algorithms: Raluca Gordân

Bowtie is an alignment program that uses the Burrows-Wheeler transform and Ferragina-Manzini index to align short DNA reads to large genomes very quickly using little memory. It can align over 25 million reads per CPU hour using only 1.3 gigabytes of memory on the human genome. Bowtie extends previous techniques to allow for mismatches and can use multiple processor cores for even faster alignment.

Uploaded by

Biswa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views

Short Read Alignment Algorithms: Raluca Gordân

Uploaded by

Biswa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 47

Short Read Alignment Algorithms

Raluca Gordân

Department of Biostatistics and Bioinformatics

Department of Computer Science
Department of Molecular Genetics and Microbiology
Center for Genomic and Computational Biology
Duke University

Aug 2, 2017
Sequencing applications
RNA-seq – this course
ChIP-seq: identify and measure significant peaks
GAAATTTGC
GGAAATTTG
CGGAAATT
T
CGGAAATTT
TCGGAAATT
CTATCGGAAA
CCTATCGGA
…CC TTTGCGGT ATAC…
GCCCTATCG
AAATTTGC
GCCCTATCG AAATTTGC

Genotyping:…CCATAGGCTATATGCG
identify variations CCCTATCGGAAATTTGCGGTA TAC…
GGTATAC…
…CCATAG TATGCGCCC CGGAAATTT CGGTATAC
…CCAT CTATATGCG TCGGAAATT
…CCAT GGCTATATG CGGTATAC GCGGTATA
CTATCGGAAA
…CCA AGGCTATAT CCTATCGGA TTGCGGTA
…CCA AGGCTATAT GCCCTATCG C… TTTGCGGT
…CC AGGCTATAT AAATTTGC
C… ATAC…
GCCCTATCG AAATTTGC
…CC TAGGCTATA GCGCCCTA
GTATAC…
…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…
Sequencing technologies

TECHNICAL BIASES!
Sequence alignment

Heuristic local alignment Global/local alignment

(BLAST) (Needleman-Wunsch,
‐
Smith-Waterman
‐ )
• INPUT: • INPUT:
– Database – Two sequences
X = x1x2…xm
Y = y1y2…yn
• OUTPUT:
– Query: – Optimal alignment
PSKMQRGIKWLLP between X and Y (or
• OUTPUT: substrings of X and Y)
– sequences similar to query
Short read alignment

• INPUT:
– A few million short reads, with certain error characteristics
(speciﬁc to the sequencing platform)
– Illumina: few errors, mostly substitutions
– A reference genome
• OUTPUT:
– Alignments of the reads to the reference genome

• Can we use BLAST?

• Assuming BLAST returns the result for a read in 1 sec
• For 10 million reads: 10 million seconds = 116 days

• Algorithms for exact string matching are more appropriate

Algorithms for exact string matching

• Search for the substring ANA in the string BANANA

Brute Force Suﬃx Array Hash (Index) Table

Naive Binary Search txeend

Seed-
-dn
‐aBLAST,
‐ BLAT
Slow & Easy PacBio Aligner
(BLASR); Bowtie

Time complexity versus space complexity

Brute force search for GATTACA
• Where is GATTACA in the human genome?

No match at offset 1

Match at offset 2

No match at offset 3...

Brute force search for GATTACA

• Simple, easy to understand

• Analysis
– Genome length = n = 3,000,000,000
– Query length = m = 7
– Comparisons: (n-m ‐ +1)* m = 21,000,000,000
• Assuming each comparison takes 1/1,000,000 of a second…
• … the total running time is 21,000 seconds = 0.24 days
• … for one b7p-‐ read
Suffix arrays
Suffix array
• Preprocess the genome
– Sort all the suffixes of the genome

Split into suffixes Sort suffixes alphabetically

• Use binary search

Suﬃx arrayssearch
‐- for GATTACA

Lo = 1; Hi = 15 Lo

Mid = (1+15)/2 = 8

Middle = Suffix[8] = CC

Compare GATTACA to CC => Higher

Lo = Mid + 1

Hi
Suﬃx arrayssearch
‐- for GATTACA

Lo = 9; Hi = 15

Mid = (9+15)/2 = 12

Middle = Suffix[12] = TACC

Compare GATTACA to TACC => Lower

Hi = Mid - 1
Lo

Hi
Suﬃx arrayssearch
‐- for GATTACA

Lo = 9; Hi = 11

Mid = (9+11)/2 = 10

Middle = Suffix[10] = GATTACC

Compare GATTACA to GATTACC => Lower

Hi = Mid - 1
Lo

Hi
Suﬃx arrayssearch
‐- for GATTACA

Lo = 9; Hi = 9

Mid = (9+9)/2 = 9

Middle = Suffix[9] = GATTACAG…

Compare GATTACA to GATTACAG… => Match

Return: match at position 2 Lo

Hi
Suﬃx arraysanalysis
‐-

• Word (query) of size m = 7

• Genome of size n = 3,000,000,000
• Bruce force:
– approx. m x n = 21,000,000,000 comparisons
• Suﬃx arrays:
– approx. m x log2(n) = 7 x 32 = 224 comparisons

• Assuming each comparison takes 1/1,000,000 of a second…

• … the total running time is 0.000224 seconds for one b7p-‐ read
• Compared to 0.24 days for one b7p-‐ read in the case of brute force search

• For 10 million reads, the suﬃx array search would take

2240 seconds = 37 minutes
Suﬃx arraysanalysis
‐-

• Word (query) of size m = 7

• Genome of size n = 3,000,000,000

• For 10 million reads, the suﬃx array search would take

2240 seconds = 37 minutes

• Problem? Time complexity versus space complexity

• Total characters in all suﬃxes combined:

1+2+3+…+n = n(n+1)/2

• For the human genome:

4.5 billion billion characters!!!
Algorithms for exact string matching

• Search for the substring ANA in the string BANANA

Brute Force Suﬃx Array Hash (Index) Table

Naive Binary Search txeend

Seed-
-dn
‐aBLAST,
‐ BLAT
Slow & Easy PacBio Aligner
(BLASR); Bowtie

Time complexity versus space complexity

Hashing

• Where is GATTACA in the human genome?

Build
‐- an inverted index of every km
-‐ er in the genome

• How do we access the table?

We
‐- can only use numbers to index
Encode
‐- sequences as numbers
Simple: A=0,C=1,G=2,T=3 => GATTACA=2,033,010
Smart: A = 002, C = 012, G = 102, T = 112
=> GATTACA=100011110001002=915610
• Lookup: very fast
• But constructing an optimal hash is tricky
Hashing
• Number of possible sequences of length k is 4k
• K=7 => 47 = 16,384 (easy to store)
• K=20 => 420 = 1,099,511,627,776 (impossible to store
directly in RAM)
There
‐- are only 3B 20-m
‐ ersin the genome
Even
‐- if we could build this table, 99.7% will be empty
But
‐- we don't know which cells are empty until we try
Hashing
• Number of possible sequences of length k is 4k
• K=7 => 47 = 16,384 (easy to store)
• K=20 => 420 = 1,099,511,627,776 (impossible to store
directly in RAM)
There
‐- are only 3B 20-m
‐ ersin the genome
Even
‐- if we could build this table, 99.7% will be empty
But
‐- we don't know which cells are empty until we try
• Use a hash function to shrink the possible range
Maps
‐- a number n in [0,R] to h in [0,H]
• Use 128 buckets instead of 16,384
Division:
‐- hash(n) = H*n/R;
• hash(GATTACA)= 128 * 9156/16384 = 71
Modulo:
‐- hash(n) = n % H
• hash(GATTACA)= 9156 % 128 = 68
Hashing

• By construction, multiple keys have the same hash value

– Store elements with the same key in a bucket chained together
• A good hash evenly distributes the values: R/H have the same
hash value
– Looking up a value scans the entire bucket
Algorithms for exact string matching

• Search for the substring GATTACA in the genome

Brute Force Suﬃx Array Hash (Index) Table

Easy Fast (binary search) Fast

High space complexity Tricky to develop
Slow hash function
Bowtie is an ultrafast, memory-efficient alignment
program for aligning short DNA sequence reads to
large genomes. For the human genome, Burrows-
Wheeler indexing allows Bowtie to align more than
25 million reads per CPU hour with a memory
footprint of approximately 1.3 gigabytes. Bowtie
extends previous Burrows-Wheeler techniques with a
novel quality-aware backtracking algorithm that
permits mismatches. Multiple processor cores can
be used simultaneously to achieve even greater
alignment speeds. Bowtie is open source http://
bowtie.cbcb.umd.edu.

• Bowtie indexes the genome using a scheme based on the Burrows-‐

Wheeler transform (BWT) and the Ferragina-Manzini
‐ (FM) index
Burrows-‐Wheeler transform

• The BWT is a reversible permutation of the characters in a text

• BWT-b
‐ ased indexing allows large texts to be searched eﬃciently in a
small memory footprint

All cyclic rotations of T$,

sorted lexicographically

Same size as T

Burrows-Wheeler
T Burrows-Wheeler
transform of T:
matrix of T
BWT(T)
Last ﬁrst (LF) mapping

• The BW matrix has a property called last ﬁrst (LF) mapping:

The ith occurrence of character X in the last column corresponds to
the same text character as the ith occurrence of X in the ﬁrst column
• This property is at the core of algorithms that use the BWT index to
search the text

Rank: 2

LF property implicitly encodes the

Suffix Array
Last ﬁrst (LF) mapping

We can repeatedly
apply LF mapping
1 to reconstruct T
2

from
3
BWT(T)
UNPERMUTE
4
algorithm 5
(Burrows and 6
Wheeler, 1994)
LF mapping and exact matching
EXACTMATCH algorithm (Ferragina and Manzini, 2000)calculates
‐- the range of
matrix rows beginning with successively longer suﬃxes of the query
Reference: acaacg. Query: aac

1
2
3
4
5
6

the matrix is sorted lexicographically

At each step, the size of the range
rows beginning with a given sequence either shrinks or remains the
appear consecutively same
LF mapping and exact matching
EXACTMATCH algorithm (Ferragina and Manzini, 2000)calculates
‐- the range of
matrix rows beginning with successively longer suﬃxes of the query
Reference: acaacg. Query: aac

What about mismatches?

the matrix is sorted lexicographically

At each step, the size of the range
rows beginning with a given sequence either shrinks or remains the
appear consecutively same
Mismatches?

• EXACTMATCH is insuﬃcient for short read alignment because

alignments may contain mismatches
• What are the main causes for mismatches?
sequencing
‐- errors
diﬀerences
‐- between reference and query organisms
Bowtie – mismatches and backtracking search

• EXACTMATCH is insuﬃcient for short read alignment because

alignments may contain mismatches
• Bowtie conducts a backtracking search to quickly ﬁnd alignments
that satisfy a speciﬁed alignment policy
• Each character in a read has a numeric quality value, with lower
values indicating a higher likelihood of a sequencing error
• Example: Illumina uses Phred quality scoring
Phred score of a base is: Qphred = -‐10*log10(e) where e is the
estimated probability of a base being wrong
• Bowtie alignment policy allows a limited number of mismatches
and prefers alignments where the sum of the quality values at all
mismatched positions is low
Bowtiebacktracking
‐- search

• The search is similar to EXACTMATCH

• It calculates matrix ranges for successively longer query suﬃxes
LF mapping and exact matching
EXACTMATCH algorithm (Ferragina and Manzini, 2000)calculates
‐- the range of
matrix rows beginning with successively longer suﬃxes of the query
Reference: acaacg. Query: aac

the matrix is sorted lexicographically

At each step, the size of the range
rows beginning with a given sequence either shrinks or remains the
appear consecutively same
Bowtiebacktracking
‐- search

• The search is similar to EXACTMATCH

• It calculates matrix ranges for successively longer query suffixes
• If the range becomes empty (a suffix does not occur in the text),
then the algorithm may select an already-‐matched query position
and substitute a different base there, introducing a mismatch into
the alignment
• The EXACTMATCH search resumes from just axer the substituted
position
• The algorithm selects only those substitutions that are consistent
with the alignment policy and which yield a modified suffix that
occurs at least once in the text
• If there are multiple candidate substitution positions, then the
algorithm greedily selects a position with a minimal quality value
Exact search for ggta
ggta ggta ggta
Inexact
search
for ggta

Search is greedy: the

first valid alignment
encountered by Bowtie
will not necessarily be
the 'best' in terms of
number of mismatches
or in terms of quality
Bowtiebacktracking
‐- search
• This standard aligner can, in some cases, encounter sequences that cause excessive
backtracking
• Bowtie mitigates excessive backtracking with the novel technique of double
indexing
– Idea: create 2 indices of the genome: one containing the BWT of the genome, called the forward
index, and a second containing the BWT of the genome with its sequence reversed (not reverse
complemented) called the mirror index.
• Let’s consider a matching policy that allows one mismatch in the alignment (either in
the first half or in the second half)
• Bowtie proceeds in two phases:
1. load the forward index into memory and invoke the aligner with the constraint that it
may not substitute at positions in the query's right half
2. load the mirror index into memory and invoke the aligner on the reversed query,
with the constraint that the aligner may not substitute at positions in the reversed
query's right half (the original query's le\ half).
• The constraints on backtracking into the right half prevent excessive backtracking,
whereas the use of two phases and two indices maintains full sensitivity
Bowtiebacktracking
‐- search
• Base quality varies across the read
• Bowtie allows the user to select
the
‐- number of mismatches permitted in the high-‐quality end of a read
(default: 2 mismatches in the first 28 bases)
maximum
‐- acceptable quality of mismatched positions over the
alignment (default: 70 PHRED score)
• The first 28 bases on the high-‐quality end of the read are termed the
seed
• The seed consists of two halves:
the
‐- 14 bp on the high-‐quality end (usually the 5' end) = the hi-h ‐ alf
the
‐- 14 bp on the low-‐quality end = lo-h
‐ alf
• Assuming 2 mismatches permitted in the seed, a reportable alignment will
fall into one of four cases:
1. no mismatches in seed;
2. no mismatches in hi-h‐ alf, one or two mismatches in lo-h‐alf
3. no mismatches in lo-h‐alf, one or two mismatches in hi-h‐alf
4. one mismatch in hi-h‐ alf, one mismatch in ol-‐half
Bowtie

• The Bowtie
algorithm consists
of three phases
that alternate
between using the
forward and partial alignments
with mismatches
mirror indices only in the hi-
half

1. no mismatches in seed
2. no mismatches in hi-half, extend the partial
alignments into
one or two mismatches in
full alignments.
lo-half
3.no mismatches in lo-half,
one or two mismatches in
hi-half
4.one mismatch in hi-half,
one mismatch in lo-half
Aligning 2 million reads to the human genome

Maq: Mapping and Assembly with Qualities SOAP = Short Oligonucleotide Analysis Package
Last ﬁrst (LF) mapping

• The BW matrix has a property called last ﬁrst (LF) mapping:

Rank: 2

LF property implicitly encodes the

Suffix Array
Constructing the index
• How do we construct a BWT index?
• Calculating the BWT is closely related to building a suﬃx array
• Each element of the BWT can be derived from the corresponding element
of the suﬃx array:

• One could generate all suffixes, sort then to obtain the SA, then calculate
the BWT in a single pass over the suffix array
• However, constructing the entire suffix array in memory requires at least
~12 gigabytes for the human genome
• Instead, Bowtie uses a block-‐wise strategy: builds the suffix array and the
BWT block-b‐y-b‐lock, discarding suffix array blocks once the corresponding BWT
block has been built
• Bowtie can build the full index for the human genome in about 24 hours in
less than 1.5 gigabytes of RAM
• If 16 gigabytes of RAM or more is available, Bowtie can exploit the
additional RAM to produce the same index in about 4.5 hours
Constructing the index
• How do we construct a BWT index?
• Calculating the BWT is closely related to building a suffix array
• Each element of the BWT can be derived from the corresponding element
of the suffix array:

• The largest single component of the Bowtie index is the BWT

sequence. Bowtie stores the BWT in a 2b -‐itp
-‐erb
-‐aseofrmat
• A Bowtie index for the assembled human genome sequence is
about 1.3 gigabytes
• A full Bowtie index actually consists of pair of equal-‐size indexes, the
forward and mirror indexes, for any given genome, but it can be
run such that only one of the two indexes is ever resident in
memory at once (using the –z option)

• What about gaps?

Bowtie 2
• Bowtie: very efficient ungapped alignment of short reads based on BWT
index
• Index-‐based alignment algorithms can be quite inefficient when gaps are
allowed
• Gaps can results from
– sequencing errors
– true insertions
and deletions
• Bowtie 2 extends the index-‐based approach of Bowtie to permit gapped
alignment
• It divides the algorithm into two stages
1. an initial, ungapped seed-finding
‐ stage that benefits from the speed and
memory efficiency of the full-t‐ext index
2. a gapped extension stage that uses dynamic programming and benefits
from the efficiency of single-instruction
‐ multiple-‐data (SIMD) parallel processing
available on modern processors
Bowtie 2
• Bowtie: very efficient ungapped alignment of short reads based on BWT
index
• Index-‐based alignment algorithms can be quite inefficient when gaps are
allowed
• Gaps can results from
– sequencing errors
– true insertions
and deletions
• Bowtie 2 extends the index-‐based approach of Bowtie to permit gapped
alignment
• It divides the algorithm into two stages
1. an initial, ungapped seed-finding
‐ stage that benefits from the speed and
memory efficiency of the full-t‐ext index
2. a gapped extension stage that uses dynamic programming and benefits
from the efficiency of single-instruction
‐ multiple-‐data (SIMD) parallel processing
available on modern processors
BWA
TopHat
Exercises:

(a)Perform, by hand, the Burrows-Wheeler Transform on the sequence

GCACGCTGC$.

(b)Suppose you are given a sequence TGTTGCCC$A, which has been

permuted by way of the Burrows-Wheeler Transform. Perform, by hand, the
reverse of the BWT in order to obtain the original sequence from which it was
generated. Hint: If you wish to check your result, you may employ the BWT in
the forward direction (i.e., the algorithm you employed in part (a)) on the
answer you obtained here. Assuming you have performed the reversal
correctly, you should find that your ”original sequence” generates the permuted
sequence given in this problem: TGTTGCCC$A.

(c)Consider the sequence you had to transform in part (a): GCACGCTGC$.

Using this exact matching protocol, determine whether the following
subsequences are contained within the sequence from part (a): CACG, AATG.
CTGC.

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6408)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (640)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1173)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (990)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1851)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (650)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4101)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (887)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (627)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1015)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1138)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (581)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (297)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5142)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (460)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Tóibín
3.5/5 (2126)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4355)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (278)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2283)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (2001)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1087)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2785)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2032)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2876)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4087)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (78)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (835)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (918)
LPE1 Leasehold Property Enquiries
No ratings yet
LPE1 Leasehold Property Enquiries
8 pages
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (814)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (278)
Thermodynamics & Thermochemistry 05-11-2020
No ratings yet
Thermodynamics & Thermochemistry 05-11-2020
6 pages
Sample
100% (2)
Sample
3 pages
Samson Takele Tassew
No ratings yet
Samson Takele Tassew
122 pages
A World of Poetry_ Third Edition
No ratings yet
A World of Poetry_ Third Edition
335 pages
Disini v. Secretary of Justice
100% (1)
Disini v. Secretary of Justice
1 page
Q2 Oral Comm Module 6
No ratings yet
Q2 Oral Comm Module 6
12 pages
Hospital Diagnostic Center Management System
No ratings yet
Hospital Diagnostic Center Management System
23 pages
Naruto RPG-Ninjutsu Shinobi Core Class
100% (1)
Naruto RPG-Ninjutsu Shinobi Core Class
3 pages
PDF Hosted at The Radboud Repository of The Radboud University
No ratings yet
PDF Hosted at The Radboud Repository of The Radboud University
19 pages
Component Maintenance Manual
No ratings yet
Component Maintenance Manual
2 pages
CBSE Class 9 English Workbook Solutions Unit 5 Connectors
No ratings yet
CBSE Class 9 English Workbook Solutions Unit 5 Connectors
11 pages
PM 2015 Sepdec Q
No ratings yet
PM 2015 Sepdec Q
5 pages
53 Killer PHOTOSHOP Effects
100% (36)
53 Killer PHOTOSHOP Effects
15 pages
Periodic and Perpetual Inventory Systems
No ratings yet
Periodic and Perpetual Inventory Systems
17 pages
Geography MCQ
50% (2)
Geography MCQ
154 pages
Installation Checklist For Earthing System & PV Yard Layout
50% (2)
Installation Checklist For Earthing System & PV Yard Layout
2 pages
Teaching Language Construction
100% (1)
Teaching Language Construction
4 pages
Comparison BTWN Climate of Ahmedabad and Nagpur
No ratings yet
Comparison BTWN Climate of Ahmedabad and Nagpur
1 page
7942 4940 PDF
No ratings yet
7942 4940 PDF
562 pages
UN55F9000 Manual
No ratings yet
UN55F9000 Manual
179 pages
29.03.2025
No ratings yet
29.03.2025
4 pages
Coding Examples For JavaScript Source Code
No ratings yet
Coding Examples For JavaScript Source Code
63 pages
OB FR SE List 2016 03
No ratings yet
OB FR SE List 2016 03
40 pages
Atmanirbhar Bharat Abhiyan: (Relief Package by Government of India)
No ratings yet
Atmanirbhar Bharat Abhiyan: (Relief Package by Government of India)
27 pages
Review of Principle and Analysis of Wave Guide: Sem. II, 2016/17 Microwave Devices and Systems by Waltengus
No ratings yet
Review of Principle and Analysis of Wave Guide: Sem. II, 2016/17 Microwave Devices and Systems by Waltengus
26 pages
J Esthet Restor Dent - 2022 - Iemsaengchairat - Fracture Resistance of Thin Wall Endodontically Treated Teeth Without
No ratings yet
J Esthet Restor Dent - 2022 - Iemsaengchairat - Fracture Resistance of Thin Wall Endodontically Treated Teeth Without
10 pages
StudentStatus
No ratings yet
StudentStatus
2 pages
Acc100 Supp PDF
No ratings yet
Acc100 Supp PDF
11 pages
Laman Pengakuan English Language
No ratings yet
Laman Pengakuan English Language
6 pages

Short Read Alignment Algorithms: Raluca Gordân

Uploaded by

Short Read Alignment Algorithms: Raluca Gordân

Uploaded by

Short Read Alignment Algorithms

Department of Biostatistics and Bioinformatics

Heuristic local alignment Global/local alignment

• Can we use BLAST?

• Algorithms for exact string matching are more appropriate

• Search for the substring ANA in the string BANANA

Brute Force Suﬃx Array Hash (Index) Table

Naive Binary Search txeend

Time complexity versus space complexity

No match at offset 3...

• Simple, easy to understand

Split into suffixes Sort suffixes alphabetically

• Use binary search

Compare GATTACA to CC => Higher

Middle = Suffix[12] = TACC

Compare GATTACA to TACC => Lower

Middle = Suffix[10] = GATTACC

Compare GATTACA to GATTACC => Lower

Middle = Suffix[9] = GATTACAG…

Compare GATTACA to GATTACAG… => Match

Return: match at position 2 Lo

• Word (query) of size m = 7

• Assuming each comparison takes 1/1,000,000 of a second…

• For 10 million reads, the suﬃx array search would take

• Word (query) of size m = 7

• For 10 million reads, the suﬃx array search would take

• Problem? Time complexity versus space complexity

• Total characters in all suﬃxes combined:

• For the human genome:

• Search for the substring ANA in the string BANANA

Brute Force Suﬃx Array Hash (Index) Table

Naive Binary Search txeend

Time complexity versus space complexity

• Where is GATTACA in the human genome?

• How do we access the table?

• By construction, multiple keys have the same hash value

• Search for the substring GATTACA in the genome

Brute Force Suﬃx Array Hash (Index) Table

Easy Fast (binary search) Fast

• Bowtie indexes the genome using a scheme based on the Burrows-­‐

• The BWT is a reversible permutation of the characters in a text

All cyclic rotations of T$,

• The BW matrix has a property called last ﬁrst (LF) mapping:

LF property implicitly encodes the

the matrix is sorted lexicographically

What about mismatches?

the matrix is sorted lexicographically

• EXACTMATCH is insuﬃcient for short read alignment because

• EXACTMATCH is insuﬃcient for short read alignment because

• The search is similar to EXACTMATCH

the matrix is sorted lexicographically

• The search is similar to EXACTMATCH

Search is greedy: the

• The BW matrix has a property called last ﬁrst (LF) mapping:

LF property implicitly encodes the

• The largest single component of the Bowtie index is the BWT

• What about gaps?

(a)Perform, by hand, the Burrows-Wheeler Transform on the sequence

(b)Suppose you are given a sequence TGTTGCCC$A, which has been

(c)Consider the sequence you had to transform in part (a): GCACGCTGC$.

You might also like

• Bowtie indexes the genome using a scheme based on the Burrows-‐