How To Map Billioons of Short Reads Onto Genomes
How To Map Billioons of Short Reads Onto Genomes
alignment between the read and its true source using traditional alignment algorithms such as Short-read mappers
in the genome may actually have more differ- BLAST or BLAT, but such grids are not acces- Such programs as Maq and Bowtie (Table 1)
ences than the alignment between the read sible to everyone. To reduce the computing cost use a computational strategy known as ‘index-
and some other copy of the repeat. The spliced of analysis for sequencing-based assays and to ing’ to speed up their mapping algorithms. Like
mapping problem faces this same challenge but make them available to all investigators, we and the index at the end of a book, an index of a
is further complicated by the possible presence others have created a new generation of align- large DNA sequence allows one to rapidly find
of intron-sized gaps. ment programs capable of mapping hundreds shorter sequences embedded within it. Maq is
DNA sequencers from Illumina, ABI, Roche of millions of short reads on a single desktop based on a straightforward but effective strategy
(of Basel, Switzerland), Helicos and other compa- computer. Vendors of sequencing machines called spaced seed indexing6 (Fig. 1a). In this
nies produce millions of reads per run. Complete provide specialized mapping software, such as strategy, a read is divided into four segments of
assays may involve many runs, so an investigator the ELAND program from Illumina, but in this equal length, called the ‘seeds’. If the entire read
may need to map millions or billions of reads article we focus on third-party packages, some of aligns perfectly to the reference genome, then
to a genome. For example, the recent cancer which are free and open source. These programs clearly all of the seeds will also align perfectly.
genome sequencing project by Ley et al.5 gener- are built on algorithms that exploit features of If there is one mismatch, however, perhaps due
ated nearly 8 billion reads from 132 sequencing short DNA sequencing reads to map millions of to a single-nucleotide polymorphism (SNP),
runs. A large, expensive computer grid might reads per hour while minimizing both process- then it must fall within one of the four seeds,
map the reads from this experiment in a few days ing time and memory requirements. but the other three will still match perfectly.
Using similar reasoning, two mismatches will
a Spaced seeds b Burrows-Wheeler fall in at most two seeds, leaving the other two
© 2009 Nature America, Inc. All rights reserved.
Maq and Bowtie both report alignments with does not rely on annotations. Instead, it
up to two mismatches when run in their default uses Bowtie (in an initial alignment pass) to
modes. In some alignment scenarios, a user may Exon A Exon B Exon C identify exons that fully contain some of the
need to allow more mismatches. These two pro- Processed mRNA reads, and then aligns the remaining reads
grams were originally designed for reads between to junctions between those exons9. Another
20 and 40 bp long, and both were optimized for program, G-Mo.R-Se (http://www.genoscope.
human resequencing projects. Even so, Illumina cns.fr/externe/gmorse), performs a similar
sequencers can now produce reads longer than Mapping to genome spliced alignment while constructing gene
100 bp. Additionally, some sequencing projects models from RNA-Seq data10.
(such as bacterial or fungal genome sequencing) Figure 2 RNA-Seq assays produce short reads
produce sequences that have many nucleotide- sequenced from processed mRNAs. Aligning Limitations and open problems
these reads to the genome with Bowtie or Maq will
level differences with respect to the closest fully The current solutions for short-read mapping all
produce the alignments shown in black but will
sequenced genome. Finally, the overall quality fail to align the blue reads. A spliced-read mapper have limitations. Mapping programs such as Maq
of reads produced by the new technologies is such as TopHat or ERANGE will also report the and Bowtie offer very limited support for align-
sensitive to factors such as library preparation, (blue) alignments spanning intron boundaries. ing reads with insertions or deletions (indels).
sequencing protocol and even the temperature Some read mappers, such as SHRiMP (http://
of the room housing the sequencing machine. the SAM tools (http://samtools.sourceforge.net). compbio.cs.toronto.edu/shrimp), support ABI’s
Thus, it is essential to know how to change the SAM includes a consensus base caller and viewer ‘color space’ sequence representation, but most
various default options for any short-read map- that can be used either with Maq or with Bowtie. do not. The spliced alignment programs suffer
© 2009 Nature America, Inc. All rights reserved.
per and to be able to identify when those defaults Most read mapping software is designed with from these same problems and add a few of their
are no longer appropriate. whole-genome resequencing in mind, but the own. Annotation-based methods are of course
Several of the new short-read mappers programs can be configured for other assays. The only as good as the annotations, and many
(Table 1) are open source, are simple to install manuals for Bowtie and Maq are quite detailed, organisms have annotations supported only
and have good documentation and active user and the array of choices a user can make can be by homology or computational predictions.
communities. The installation package for daunting. Moreover, the list of programs capa- Machine learning methods will perform poorly
Bowtie includes a prebuilt index for Escherichia ble of short-read mapping is rapidly growing if they are trained on incorrect annotations, and
coli and a set of sample E. coli reads. To run the (Table 1), and not every program is ideal or they are prone to overtraining.
program on the sample data, just enter the fol- appropriate for every experiment. Fortunately, Many challenges and questions remain for
lowing on the command line: there are ways to get help. The SeqAnswers developers of read mapping software. As all the
message board (http://www.seqanswers.com) sequencing machine vendors are trying to pro-
bowtie e_coli reads/e_coli_1000.fq is an excellent resource for novice and expert duce longer reads, will the short-read mapping
users, frequented by the developers of many programs scale well as the reads get longer? Maq,
This command will produce a tabular report short-read mapping programs. One of the most Bowtie and several other short-read packages
showing each matching read’s identifier, the popular SeqAnswers threads contains a catalog support reads longer than 100 bp, but at some
position(s) where it aligns to the reference of current software for primary analysis and point, software designed for longer reads, such as
sequence, and the number and location of mis- visualization of short-read data. BLAT, may be a better fit for downstream analy-
matches. Maq reports this same information sis. Furthermore, when mapping reads from an
when you run it with the command: Spliced-read mappers organism that has diverged significantly from
The spliced alignment problem, in which its reference genome, how should a program’s
maq.pl easyrun -d outdir cDNA (from processed mRNA) sequences are parameters be adjusted, and can that adjustment
reference.fasta reads.fastq aligned back to genomic DNA, requires more happen automatically? How useful is mapping
specialized algorithms. Reads sampled from quality in downstream analysis, and should it
For a given experiment, the fraction of reads exon-exon junctions need to be mapped dif- be computed while aligning reads, as Maq does,
that align to the genome depends on many fac- ferently from reads that are contained entirely or later? The answers to each of these questions
tors. Assuming the sequenced DNA does not within exons (Fig. 2). will depend on the type of assay and the scale
contain many mismatched nucleotides com- To align cDNA reads from RNA-Seq1–3 of the analysis, and as long as the technology
pared to the reference, and assuming the reads experiments, packages such as ERANGE continues to change, the programs will have to
have passed rudimentary quality filters, most (http://woldlab.caltech.edu/rnaseq) use change rapidly to keep up.
mapping software will find an alignment for the positions of exons and introns within 1. Nagalakshmi, U. et al. Science 320, 1344–1349 (2008).
70–75% of the reads. This might seem surpris- known genes as a guide. This allows ERANGE 2. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. &
ingly low, but the sequencing technology is still to construct the sequences spanning exon- Wold, B. Nat. Methods 5, 621–628 (2008).
3. Wang, E.T. et al. Nature 456, 470–476 (2008).
immature—and it’s worth noting that Sanger exon junctions and use them as reference 4. Johnson, D.S., Mortazavi, A., Myers, R.M. & Wold, B.
sequencing had success rates of less than 80% sequences, and then to invoke a standard read Science 316, 1497–1502 (2007).
until the late 1990s. Note that many reads will mapper such as Maq or Bowtie to align the 5. Ley, T.J. et al. Nature 456, 66–72 (2008).
6. Li, H., Ruan, J. & Durbin, R. Genome Res. 18, 1851–1858
align to multiple positions in the genome. Most spliced reads2. Because this approach will not (2008).
read mappers can be directed to report align- discover entirely new splice junctions, some 7. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L.
Genome Biol. 10, R25 (2009).
ments only for reads that map to a unique loca- studies have used machine learning meth- 8. Pan, Q., Shai, O., Lee, L.J., Frey, B.J. & Blencowe, B.J. Nat.
tion in the genome. ods to predict possible junctions by training Genet. 40, 1413–1415 (2008).
After aligning the reads, next one might want statistical models using available reference 9. Trapnell, C., Pachter, L. & Salzberg, S.L. Bioinformatics pub-
lished online, doi:10.1093/bioinformatics/btp120 (March
to call SNPs or view the alignments against the annotations8. In contrast, the TopHat spliced- 16, 2009).
reference sequence. One package for this task is read mapper (http://tophat.cbcb.umd.edu) 10. Denoeud, F. et al. Genome Biol. 9, R175 (2008).