0% found this document useful (0 votes)
13 views3 pages

How To Map Billioons of Short Reads Onto Genomes

The document discusses the challenges and solutions for mapping billions of short DNA reads produced by next-generation sequencing technologies onto reference genomes. It highlights various software programs available for short-read mapping, such as Bowtie and Maq, and their respective algorithms for efficiently aligning reads. The document also addresses the limitations of current mapping solutions, particularly in handling insertions, deletions, and the complexities of spliced alignments.

Uploaded by

drishtig
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views3 pages

How To Map Billioons of Short Reads Onto Genomes

The document discusses the challenges and solutions for mapping billions of short DNA reads produced by next-generation sequencing technologies onto reference genomes. It highlights various software programs available for short-read mapping, such as Bowtie and Maq, and their respective algorithms for efficiently aligning reads. The document also addresses the limitations of current mapping solutions, particularly in handling insertions, deletions, and the complexities of spliced alignments.

Uploaded by

drishtig
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

primer

How to map billions of short reads onto


genomes
Cole Trapnell & Steven L Salzberg
Mapping the vast quantities of short sequence fragments produced by next-generation sequencing platforms is a
challenge. What programs are available and how do they work?
© 2009 Nature America, Inc. All rights reserved.

A new generation of DNA sequencers that can


rapidly and inexpensively sequence billions
of bases is transforming genomic science. These
Table 1 A selection of short-read analysis software
Open Handles ABI color Maximum read
new machines are quickly becoming the tech- Program Website source? space? length
nology of choice for whole-genome sequencing Bowtie http://bowtie.cbcb.umd.edu Yes No None
and for a variety of sequencing-based assays, BWA http://maq.sourceforge.net/bwa-man.shtml Yes Yes None
including gene expression, DNA-protein inter- Maq http://maq.sourceforge.net Yes Yes 127
action, human resequencing and RNA splicing Mosaik http://bioinformatics.bc.edu/marthlab/Mosaik No Yes None
studies1–3. For example, the RNA-Seq proto- Novoalign http://www.novocraft.com No No None
col, in which processed mRNA is converted to SOAP2 http://soap.genomics.org.cn No No 60
cDNA and sequenced, is enabling the identifi- ZOOM http://www.bioinfor.com No Yes 240
cation of previously unknown genes and alter-
native splice variants; the ChIP-Seq approach,
which sequences immunoprecipitated DNA In this case, to make sense of the reads, their to understand why the mapping problems are
fragments bound to proteins, is revealing net- positions within the reference sequence must computationally difficult, which difficulties
works of interactions between transcription be determined. This process is known as align- have been overcome and what challenges and
factors and DNA regulatory elements4; and ing or ‘mapping’ the read to the reference. In opportunities remain.
the whole-genome sequencing of tumor cells one version of the mapping problem, reads
is uncovering previously unidentified cancer- must be aligned without allowing large gaps in Challenges of mapping short reads
initiating mutations5. the alignment (we describe this in more detail The first challenge is a practical one: if the
One of the challenges presented by the new in the “Short-read mappers” section below). reference genome is very large, and if we have
sequencing technology is the so-called ‘read A more difficult version of the problem arises billions of reads, how quickly can we align the
mapping’ problem. Sequencing machines made primarily in RNA-Seq, in which alignments are reads to the genome? Sequence alignment is a
by Illumina of San Diego, Applied Biosystems allowed to have large gaps corresponding to classic problem in bioinformatics, supported
(ABI) of Carlsbad, California, and Helicos introns (discussed below in the “Spliced-read by a large body of literature describing different
of Cambridge, Massachusetts, produce short mappers” section). variants for both exact and inexact alignment.
sequences of 25–100 base pairs (bp), called These read mapping problems are certainly As a practical matter, the task of mapping
‘reads’, which are sequence fragments read not new, and there are many programs that billions of sequences to a mammalian-sized
from a longer DNA molecule present in the perform both spliced and unspliced alignment genome calls for extraordinarily efficient algo-
sample that is fed into the machine. In con- for the older Sanger-style capillary reads. Even rithms, in which every bit of memory is used
trast to whole-genome assembly, in which these so, these programs neither scale up to the much optimally or near optimally.
reads are assembled together to reconstruct a greater volumes of data produced by short- The second challenge is strategic: if a read
previously unknown genome, many of the read sequencers nor scale down to the short comes from a repetitive element in the refer-
next-generation sequencing projects begin read lengths. Aligning the reads from ChIP-Seq ence, a program must pick which copy of the
with a known, or so-called ‘reference’, genome. or RNA-Seq experiments can take hundreds or repeat the read belongs to. Because this may
thousands of central processing unit (CPU) be impossible to decide with confidence, the
Cole Trapnell and Steven L. Salzberg are at the hours using conventional software tools such program may choose to report multiple pos-
Center for Bioinformatics and Computational as BLAST or BLAT. Fortunately, new software sible locations or to pick a location heuristi-
Biology, University of Maryland, College Park, packages designed to meet the computational cally. Sequencing errors or variations between
Maryland, USA. challenges of short-read sequencing are quickly the sequenced chromosomes and the reference
e-mail: [email protected] or [email protected] appearing. Before choosing one, it is essential genome exacerbate this problem, because the

nature biotechnology volume 27 number 5 may 2009 455


pr i mer

alignment between the read and its true source using traditional alignment algorithms such as Short-read mappers
in the genome may actually have more differ- BLAST or BLAT, but such grids are not acces- Such programs as Maq and Bowtie (Table 1)
ences than the alignment between the read sible to everyone. To reduce the computing cost use a computational strategy known as ‘index-
and some other copy of the repeat. The spliced of analysis for sequencing-based assays and to ing’ to speed up their mapping algorithms. Like
mapping problem faces this same challenge but make them available to all investigators, we and the index at the end of a book, an index of a
is further complicated by the possible presence others have created a new generation of align- large DNA sequence allows one to rapidly find
of intron-sized gaps. ment programs capable of mapping hundreds shorter sequences embedded within it. Maq is
DNA sequencers from Illumina, ABI, Roche of millions of short reads on a single desktop based on a straightforward but effective strategy
(of Basel, Switzerland), Helicos and other compa- computer. Vendors of sequencing machines called spaced seed indexing6 (Fig. 1a). In this
nies produce millions of reads per run. Complete provide specialized mapping software, such as strategy, a read is divided into four segments of
assays may involve many runs, so an investigator the ELAND program from Illumina, but in this equal length, called the ‘seeds’. If the entire read
may need to map millions or billions of reads article we focus on third-party packages, some of aligns perfectly to the reference genome, then
to a genome. For example, the recent cancer which are free and open source. These programs clearly all of the seeds will also align perfectly.
genome sequencing project by Ley et al.5 gener- are built on algorithms that exploit features of If there is one mismatch, however, perhaps due
ated nearly 8 billion reads from 132 sequencing short DNA sequencing reads to map millions of to a single-nucleotide polymorphism (SNP),
runs. A large, expensive computer grid might reads per hour while minimizing both process- then it must fall within one of the four seeds,
map the reads from this experiment in a few days ing time and memory requirements. but the other three will still match perfectly.
Using similar reasoning, two mismatches will
a Spaced seeds b Burrows-Wheeler fall in at most two seeds, leaving the other two
© 2009 Nature America, Inc. All rights reserved.

to match perfectly. Thus, by aligning all pos-


Reference genome Short read Reference genome Short read
(> 3 gigabases) (> 3 gigabases) sible pairs of seeds (six possible pairs) against
Chr1 ACTCCCGTACTCTAAT Chr1 ACTCCCGTACTCTAAT the reference, it is possible to winnow the list of
Chr2 Chr2
Chr3 Chr3
candidate locations within the reference where
Chr4 Chr4 the full read may map, allowing at most two
Concatenate into mismatches. Maq’s spaced seed index enables
Extract seeds single string it to perform this winnowing operation very
efficiently. The resulting set of candidate reads
is typically small enough that the rest of the
Position N
Burrows-Wheeler read—that is, the other two seeds that might
Position 2 transform and indexing
CTGC CGTA AACT AATG
contain the mismatches—may be individually
Bowtie index checked against the reference.
Position 1
(~2 gigabytes) Bowtie takes an entirely different approach,
ACTG CCGT AAAC TAAT ACTC CCGT ACTC TAAT ACTCCCGTACTCTAAT

ACTG **** AAAC **** 1 T borrowing a technique originally developed


**** CCGT
**** TAAT Six seed 2 Look up AT for compressing large files called the Burrows-
ACTG **** **** TAAT pairs per 3 ‘suffixes’ AAT
read/ 4 of read •
Wheeler transform. Using this transform, the
**** **** AAAC TAAT
ACTG CCGT **** **** fragment 5 • index for the entire human genome fits into
6 •
**** CCGT AAAC ****
ACTCCCGTACTCTAAT less than two gigabytes of memory (an amount
Index seed pairs Hits identify that is commonly available on today’s desktop
positions in and even laptop computers)—in contrast to a
Seed index genome where
(tens of gigabytes) Look up each pair read is found spaced seed index, which may require over 50
of seeds in index gigabytes—and yet reads can still be aligned
ACTG **** AAAC ****
• Hits identify positions efficiently. Bowtie aligns a read one character
• in genome where
• at a time to the Burrows-Wheeler–transformed
• spaced seed pair
• is found genome (Fig. 1b). Each successively aligned

**** CCGT **** TAAT new character allows Bowtie to winnow the
ACTG **** **** TAAT Confirm hits Convert each list of positions to which the read might map.
**** CCGT AAAC **** by checking hit back to
“****” positions If Bowtie cannot find a location where a read
genome location
aligns perfectly, the algorithm backtracks
Report alignment to user to a previous character of the read, makes a
substitution and resumes the search. In effect,
Figure 1 Two recent algorithmic approaches for aligning short (20–200-bp) sequencing reads.
the Burrows-Wheeler transform enables
(a) Algorithms based on spaced-seed indexing, such as Maq, index the reads as follows: each position
in the reference is cut into equal-sized pieces, called ‘seeds’ and these seeds are paired and stored
Bowtie to conquer the mapping problem by
in a lookup table. Each read is also cut up according to this scheme, and pairs of seeds are used as first solving a simple subproblem—align one
keys to look up matching positions in the reference. Because seed indices can be very large, some character—and then building on that solution
algorithms (including Maq) index the reads in batches and treat substrings of the reference as queries. to solve a slightly harder problem—align two
(b) Algorithms based on the Burrows-Wheeler transform, such as Bowtie, store a memory-efficient characters—and then continuing on to three
representation of the reference genome. Reads are aligned character by character from right to left characters, and so on, until the entire read has
against the transformed string. With each new character, the algorithm updates an interval (indicated
been aligned. Bowtie’s alignment algorithm is
by blue ‘beams’) in the transformed string. When all characters in the read have been processed,
alignments are represented by any positions within the interval. Burrows-Wheeler–based algorithms can substantially more complicated than Maq’s, but
run substantially faster than spaced seed approaches, primarily owing to the memory efficiency of the Bowtie’s alignment speed is more than 30-fold
Burrows-Wheeler search. Chr., chromosome. faster7.

456 volume 27 number 5 may 2009 nature biotechnology


pr i mer

Maq and Bowtie both report alignments with does not rely on annotations. Instead, it
up to two mismatches when run in their default uses Bowtie (in an initial alignment pass) to
modes. In some alignment scenarios, a user may Exon A Exon B Exon C identify exons that fully contain some of the
need to allow more mismatches. These two pro- Processed mRNA reads, and then aligns the remaining reads
grams were originally designed for reads between to junctions between those exons9. Another
20 and 40 bp long, and both were optimized for program, G-Mo.R-Se (http://www.genoscope.
human resequencing projects. Even so, Illumina cns.fr/externe/gmorse), performs a similar
sequencers can now produce reads longer than Mapping to genome spliced alignment while constructing gene
100 bp. Additionally, some sequencing projects models from RNA-Seq data10.
(such as bacterial or fungal genome sequencing) Figure 2 RNA-Seq assays produce short reads
produce sequences that have many nucleotide- sequenced from processed mRNAs. Aligning Limitations and open problems
these reads to the genome with Bowtie or Maq will
level differences with respect to the closest fully The current solutions for short-read mapping all
produce the alignments shown in black but will
sequenced genome. Finally, the overall quality fail to align the blue reads. A spliced-read mapper have limitations. Mapping programs such as Maq
of reads produced by the new technologies is such as TopHat or ERANGE will also report the and Bowtie offer very limited support for align-
sensitive to factors such as library preparation, (blue) alignments spanning intron boundaries. ing reads with insertions or deletions (indels).
sequencing protocol and even the temperature Some read mappers, such as SHRiMP (http://
of the room housing the sequencing machine. the SAM tools (http://samtools.sourceforge.net). compbio.cs.toronto.edu/shrimp), support ABI’s
Thus, it is essential to know how to change the SAM includes a consensus base caller and viewer ‘color space’ sequence representation, but most
various default options for any short-read map- that can be used either with Maq or with Bowtie. do not. The spliced alignment programs suffer
© 2009 Nature America, Inc. All rights reserved.

per and to be able to identify when those defaults Most read mapping software is designed with from these same problems and add a few of their
are no longer appropriate. whole-genome resequencing in mind, but the own. Annotation-based methods are of course
Several of the new short-read mappers programs can be configured for other assays. The only as good as the annotations, and many
(Table 1) are open source, are simple to install manuals for Bowtie and Maq are quite detailed, organisms have annotations supported only
and have good documentation and active user and the array of choices a user can make can be by homology or computational predictions.
communities. The installation package for daunting. Moreover, the list of programs capa- Machine learning methods will perform poorly
Bowtie includes a prebuilt index for Escherichia ble of short-read mapping is rapidly growing if they are trained on incorrect annotations, and
coli and a set of sample E. coli reads. To run the (Table 1), and not every program is ideal or they are prone to overtraining.
program on the sample data, just enter the fol- appropriate for every experiment. Fortunately, Many challenges and questions remain for
lowing on the command line: there are ways to get help. The SeqAnswers developers of read mapping software. As all the
message board (http://www.seqanswers.com) sequencing machine vendors are trying to pro-
bowtie e_coli reads/e_coli_1000.fq is an excellent resource for novice and expert duce longer reads, will the short-read mapping
users, frequented by the developers of many programs scale well as the reads get longer? Maq,
This command will produce a tabular report short-read mapping programs. One of the most Bowtie and several other short-read packages
showing each matching read’s identifier, the popular SeqAnswers threads contains a catalog support reads longer than 100 bp, but at some
position(s) where it aligns to the reference of current software for primary analysis and point, software designed for longer reads, such as
sequence, and the number and location of mis- visualization of short-read data. BLAT, may be a better fit for downstream analy-
matches. Maq reports this same information sis. Furthermore, when mapping reads from an
when you run it with the command: Spliced-read mappers organism that has diverged significantly from
The spliced alignment problem, in which its reference genome, how should a program’s
maq.pl easyrun -d outdir cDNA (from processed mRNA) sequences are parameters be adjusted, and can that adjustment
reference.fasta reads.fastq aligned back to genomic DNA, requires more happen automatically? How useful is mapping
specialized algorithms. Reads sampled from quality in downstream analysis, and should it
For a given experiment, the fraction of reads exon-exon junctions need to be mapped dif- be computed while aligning reads, as Maq does,
that align to the genome depends on many fac- ferently from reads that are contained entirely or later? The answers to each of these questions
tors. Assuming the sequenced DNA does not within exons (Fig. 2). will depend on the type of assay and the scale
contain many mismatched nucleotides com- To align cDNA reads from RNA-Seq1–3 of the analysis, and as long as the technology
pared to the reference, and assuming the reads experiments, packages such as ERANGE continues to change, the programs will have to
have passed rudimentary quality filters, most (http://woldlab.caltech.edu/rnaseq) use change rapidly to keep up.
mapping software will find an alignment for the positions of exons and introns within 1. Nagalakshmi, U. et al. Science 320, 1344–1349 (2008).
70–75% of the reads. This might seem surpris- known genes as a guide. This allows ERANGE 2. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. &
ingly low, but the sequencing technology is still to construct the sequences spanning exon- Wold, B. Nat. Methods 5, 621–628 (2008).
3. Wang, E.T. et al. Nature 456, 470–476 (2008).
immature—and it’s worth noting that Sanger exon junctions and use them as reference 4. Johnson, D.S., Mortazavi, A., Myers, R.M. & Wold, B.
sequencing had success rates of less than 80% sequences, and then to invoke a standard read Science 316, 1497–1502 (2007).
until the late 1990s. Note that many reads will mapper such as Maq or Bowtie to align the 5. Ley, T.J. et al. Nature 456, 66–72 (2008).
6. Li, H., Ruan, J. & Durbin, R. Genome Res. 18, 1851–1858
align to multiple positions in the genome. Most spliced reads2. Because this approach will not (2008).
read mappers can be directed to report align- discover entirely new splice junctions, some 7. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L.
Genome Biol. 10, R25 (2009).
ments only for reads that map to a unique loca- studies have used machine learning meth- 8. Pan, Q., Shai, O., Lee, L.J., Frey, B.J. & Blencowe, B.J. Nat.
tion in the genome. ods to predict possible junctions by training Genet. 40, 1413–1415 (2008).
After aligning the reads, next one might want statistical models using available reference 9. Trapnell, C., Pachter, L. & Salzberg, S.L. Bioinformatics pub-
lished online, doi:10.1093/bioinformatics/btp120 (March
to call SNPs or view the alignments against the annotations8. In contrast, the TopHat spliced- 16, 2009).
reference sequence. One package for this task is read mapper (http://tophat.cbcb.umd.edu) 10. Denoeud, F. et al. Genome Biol. 9, R175 (2008).

nature biotechnology volume 27 number 5 may 2009 457

You might also like