Transcriptome Software Paper
Transcriptome Software Paper
eISSN 2234-0742
Genomics & Informatics
Genomics InformVol. 13, No. 4, 2015
2015;13(4):119-125
Genomics & Informatics http://dx.doi.org/10.5808/GI.2015.13.4.119
REVIEW ARTICLE
Severance Biomedical Science Institute, Yonsei University College of Medicine, Seoul 03722, Korea
RNA is a polymeric molecule implicated in various biological processes, such as the coding, decoding, regulation, and
expression of genes. Numerous studies have examined RNA features using whole transcriptome sequencing (RNA-seq)
approaches. RNA-seq is a powerful technique for characterizing and quantifying the transcriptome and accelerates the
development of bioinformatics software. In this review, we introduce routine RNA-seq workflow together with related
software, focusing particularly on transcriptome reconstruction and expression quantification.
Received October 13, 2015; Revised December 10, 2015; Accepted December 12, 2015
*Corresponding author: Tel: +82-2-2228-0913, Fax: +82-2-2227-8129, E-mail: [email protected]
Copyright © 2015 by the Korea Genome Organization
CC It is identical to the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/).
www.genominfo.org 119
IS Yang and S Kim. RNA-Seq Analysis Workflow and Software
Read Alignment
There are two strategies in which a genome or trans-
criptome is used as a reference for the read alignment step
[12]. The transcriptome comprises all transcripts in a given
specimen and in which splicing has been conducted by
including the exons and excluding the introns. If a
transcriptome is used as a reference, unspliced aligners that
do not allow large gaps may be the proper choice for accurate
read mapping. Stampy, Mapping and Assembly with Quality
(MAQ) [13], Burrow-Wheeler Aligner (BWA) [14], and
Bowtie [15] can be used in this case. This alignment is
limited to the identification of known exons and junctions
because it does not identify splicing events involving novel
Fig. 1. Typical workflow for RNA sequencing (RNA-seq) data
exons. However, if the genome is used as a reference, spliced
analysis. This workflow shows an example for expression quanti- aligners that allow a wide range of gaps should be employed
fication and differential expression analysis at gene and/or transcript because reads aligned at exon-exon junctions will be split
level using RNA-seq, which is typically consisted of five steps as into two fragments. This approach may increase the pro-
following: preprocessing, read alignment, transcriptome reconstruc-
tion, expression quantification and differential expression analysis. bability of identifying novel transcripts generated by
For each step, currently available programs are written in Table 1. alternative splicing. Various spliced aligners have been
QC, quality control. developed, including TopHat [16], MapSplice [17], STAR
[18], and GSNAP [19].
Preprocessing of Raw Data
RNA-Seq Specific QC
Similarly to whole genome or exome sequencing, RNA-
seq data is formatted in FASTQ (sequence and base quality). Several intrinsic biases and limitations including nucle-
Numerous erroneous sequence variants can be introduced otide composition bias, GC bias and polymerase chain
during the library preparation, sequencing, and imaging reaction bias can be introduced to RNA-seq data of clinical
steps [7], which should be identified and filtered out in the samples with low quality or quantity. To evaluate the biases
data analysis step. Thus, QC of raw data should be performed from RNA-seq data, several metrics may be examined as
as the initial step of routine RNA-seq workflow. Tools such following: percentage of exonic or rRNA reads, accuracy and
as FastQC [8] and HTQC [9] can be applied in this step to biases in gene expression measurements, GC bias, evenness
assess the quality of raw data, enabling assessment of the of coverage, 5′-to-3′ coverage bias, and coverage of 5′ and 3′
overall and per-base quality for each read (i.e., read 1 and 2 ends [6]. Some programs including RNA-SeQC [20],
in case of paired-end sequencing) in each sample. Depending RSeQC [21], and Qualimap 2 [22] are currently available for
on the RNA-seq library construction strategy, some form of the purposes, which take typically BAM file as input.
read trimming may be advisable prior to aligning the RNA-SeQC [20] provides three types of QC metrics based
RNA-seq data. Two common trimming strategies include on read count (total, unique and duplicate reads, rRNA
“adapter trimming” and “quality trimming.” Adapter content, strand specificity, etc.), coverage (mean coverage,
trimming involves removal of the adapter sequence by mas- 5′/3′ coverage, GC bias, etc.), and expression correlation
king specific sequences used during library construction. (reads per kilobase per million mapped reads [RPKM]–based
Quality trimming generally removes the ends of reads where estimation of expression levels and correlation matrix by all
base quality scores have decreased to a level such that pairwise comparison). The software also provides multi-
sequence errors and the resulting mismatches prevent reads sample evaluation regarding library construction protocols,
120 www.genominfo.org
Genomics & Informatics Vol. 13, No. 4, 2015
input materials and other experimental parameters. combination of TopHat and Cufflinks [31]. The latter pro-
RSeQC [21] is a Python-based package program that tocol also includes a transcriptome reconstruction procedure
provides several metrics containing sequence quality, GC (using Cufflinks) from read mapping data to a reference
bias, polymerase chain reaction bias, nucleotide composition genome (using TopHat). These protocols are good examples
bias, sequencing depth, strand specificity, coverage uni- of different strategies that can be used for transcriptome
formity, and read distribution over the genome structure. Of reconstruction according to the presence or absence of a
the metrics, sequencing depth is importance, because it reference sequence.
allows users to determine if current RNA-seq data is suitable
for such application including expression profiling, alter- Expression Quantification
native splicing analysis, novel isoform identification, and
transcriptome reconstruction by checking whether the Numerous methods have been developed for expression
sequencing depth is saturated or not. quantification using RNA-seq data. The methods are
Qualimap 2 [22] is consisted of four analysis modes: BAM grouped into two according to the target levels: gene- and
QC, Counts QC, RNA-seq QC, and Multi-sample BAM QC. isoform-level quantification. Alternative expression analysis
Compared to previous release, this version focuses on by sequencing (ALEXA-seq) [32], enhanced read analysis of
multi-sample QC for high-throughput sequencing data. gene expression (ERANGE) [33], and normalization by
Multi-sample BAM QC mode allows combined QC for expected uniquely mappable area (NEUMA) [34] support
multiple alignment files, which takes the metrics from the gene-level quantification. Isoform-level quantification
single-sample BAM QC mode as input. RNA-seq QC mode is methods are divided into three groups according to the
added to compute the metrics specific to RNA-seq data, reference type and requirement of alignment results. The
which contains per-transcript coverage, junction sequence first group (e.g., RSEM [35]) requires the alignment result of
distribution, genomic localization of reads, 5′-3′ bias and reads using the transcriptome as a reference. The second
consistency of the library protocol. Counts QC mode enables group (e.g., Cufflinks [24] and StringTie [26]) also requires
to estimate the saturation of sequencing depth, read count alignment results of reads using whole genome sequences as
densities, correlation of samples and distribution of counts a reference rather than the transcriptome. The last group
among classes of selected features along with gene ex- (e.g., Sailfish [36]) uses an alignment-free method. We
pression estimation based on NOIseq [23]. discuss each isoform-level quantification method in detail in
the following sections.
Transcriptome Reconstruction
RSEM
Transcriptome reconstruction is the identification of all
transcripts expressed in a specimen. There are two strategies RSEM is software that quantifies transcript-level
used for transcriptome reconstruction, including the abundance from RNA-seq data. RSEM is operated in two
reference-guided approach and the reference-independent steps: (1) generation and preprocessing of a set of reference
approach. First, the reference-guided approach consists of transcript sequences and (2) alignment of reads to the
two sequential steps: (1) alignment of raw reads to the reference transcripts followed by estimation of transcript
reference as described in the previous section and (2) abundances and their credibility intervals. A FASTA
assembly of overlapping reads for reconstructing transcripts. formatted file of transcript sequences is used to generate the
This approach is advantageous when reference annotation reference transcripts, which can be obtained from a reference
information is well-known, such as in human and mouse, genome database, a de novo transcriptome assembler, or an
which is employed in Cufflinks [24], Scripture [25], and Expressed Sequence Tags (EST) database. Alternatively, a
StringTie [26]. Second, the reference-independent approach gene annotation file in GTF format and the full genome
uses a de novo assembly algorithm to directly build con- sequence in FASTA format may be supplied. RSEM uses the
sensus transcripts from short reads without reference, Bowtie alignment program [15]. A user-provided aligner can
which is useful when there is no known reference genome or be used for mapping RNA-seq reads using reference
transcriptome. Trinity [27], Oases [28], and transABySS transcripts. RSEM provides gene-level and isoform-level
[29] may be used for this purpose. estimates as the primary output by computing maximum
Two publications have described RNA-seq protocols: one likelihood abundance estimates based on the Expectation-
is de novo transcriptome reconstruction without reference Maximization (EM) algorithm after read mapping. Abundance
using the Trinity platform [30] and the other is differential estimates are given in terms of two measures: an estimate of
expression analysis of a gene and transcript using a the number of fragments and the estimated fraction of
www.genominfo.org 121
IS Yang and S Kim. RNA-Seq Analysis Workflow and Software
transcripts comprising a given isoform or gene. The latter genome, where some reads can be spliced when they were
−6
estimates can be multiplied by 10 to obtain a measure of aligned on the exon-exon junctions of transcripts. These
transcripts per million (TPM). RSEM also supports the mapped reads are provided as input to Cufflinks for
visualization of alignment and read depth using a genome transcript assembly and abundance estimation. Transcript
browser such as the University of California Santa Cruz assembly is achieved by building an overlap graph from the
(UCSC) Genome Browser. mapped reads followed by computing minimal path cover in
the overlap graph, generating a minimum number of
Cufflinks transcripts that will explain all reads in the graph.
Abundance estimation is performed by estimating the
The Tuxedo package is the most widely used software for maximum likelihood abundance based on transcript
transcript assembly and quantification using RNA-seq and coverage and compatibility together with the use of fragment
consists of a number of different programs, including TopHat, length distribution. Abundances are reported in fragments
Cufflinks, and Cuffdiff [31]. In the initial step, TopHat is per kilobase per million mapped fragments (FPKM) for
employed for mapping raw RNA-seq reads to a reference paired-end and RPKM for a single-end. Cuffdiff, a part of the
122 www.genominfo.org
Genomics & Informatics Vol. 13, No. 4, 2015
Cufflinks package, also uses the mapped reads to report Differential Expression using RNA-seq
genes and transcripts that are differentially expressed.
CummeRbund can produce figures and plots from the For differential expression analysis, a number of software
Cuffdiff outputs. packages and pipelines have been developed including
edgeR [39], DESeq [40], NOIseq [23], SAMseq [41],
StringTie Cuffdiff [24], and EBSeq [42]. Unlike edgeR and DESeq,
which adopt negative binomial models, and NOIseq and
StringTie is software used for transcriptome recon- SAMseq, which are non-parametric, Cuffdiff and EBSeq can
struction and abundance estimation. Similarly to other tools, be used to compare differentially expressed genes by
including Cufflinks, spliced aligners such as TopHat2 [37] or employing transcript-based detection methods. Many of the
GSNAP [19] are used to directly align RNA-seq reads or programs accept read count data as input, which can be
subsequent alignment after generating pre-assembled contigs produced by using HTSeq [43] or BEDTools [44]. Similarly
from the reads using a de novo assembler such as MaSurCa to Cuffdiff, Ballgown program [45] is employed for
[38]. StringTie can perform transcriptome reconstruction differential expression analysis using read mapping data
and abundance estimation simultaneously by building a flow from StringTie [26] (https://ccb.jhu.edu/software/stringtie/
network for the path of the heaviest coverage and computing index.shtml?t=manual). The above programs adopt one or
the maximum flow to estimate abundance. StringTie reports more of the several available normalization methods (total
estimated abundance in FPKM for paired-end and RPKM for count, upper quartile, median, DESeq normalization,
single-end. trimmed mean of M values, quantile and RPKM nor-
malization) to correct biases that may appear between
Sailfish samples (sequencing depth [33]) or within sample (gene
length [46] and GC contents [45]).
Sailfish is unique software adopting an alignment-free Although many programs have been developed, one
approach for isoform quantification. An index is built from a research group reported that there may be large differences
set of reference transcripts and a specific choice of k-mer between these programs and that no single method may be
length, which consists of data structures that maps each optimal under all experimental conditions [48]. Thus, it may
k-mer in the reference transcripts to a unique integer be difficult for most of users with no or weak statistical
identifier, enabling to count k-mers in a set of reads and to background to select a proper method. However, because
resolve their origin in the set of transcripts. Until the set of RNA-seq data sets are rapidly accumulating, we expect that
reference transcripts or the k-mer length is changed, it is not new bioinformatics tools for differential expression will be
necessary to rebuild the index. Sailfish computes an estimate developed, which will function robustly under a wide range
of the relative abundance of each transcript in the reference of conditions.
by employing an EM algorithm similar to that used in RSEM.
Because Sailfish avoids read alignment entirely, the running Conclusion
time for quantification is much lower than for other existing
methods. Sailfish reports terms of abundance measures, Numerous bioinformatics programs have been developed
including (1) RPKM, (2) k-mers per kilobase per million for RNA-seq data analysis. Even tools developed for a same
mapped k-mers (KPKM), and (3) TPM. purpose are based on distinct approaches using different
We described four programs, RSEM, Cufflinks, StringTie, algorithms and models. The diversity of the methodology
and Sailfish in detail. In addition to the use of specific makes it possible to customize analysis protocols by
algorithm, a major difference between these programs may choosing a program that provides the best fit to each specific
be the reference type used. A set of transcript sequences is goal. In this review, we described the routine RNA-seq
used as a reference in RSEM and Sailfish, indicating that the analysis workflow, focusing on transcriptome reconstruc-
programs may be suitable for estimating the abundance of tion and expression quantification, and also introduced its
known transcripts. In contrast, a reference genome is em- related bioinformatics programs. Therefore, we expect that
ployed in Cufflinks and StringTie, making it possible to this review will be helpful for preparing a specific pipeline for
present the estimated abundance of novel transcripts as well RNA-seq data analysis, enabling to design new biological
as already known transcripts, as spliced read mapping data experiments.
can reveal known and novel splice junction information
simultaneously.
www.genominfo.org 123
IS Yang and S Kim. RNA-Seq Analysis Workflow and Software
Acknowledgments 17. Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, et
al. MapSplice: accurate mapping of RNA-seq reads for splice
This work was supported by the Bio-Synergy Research junction discovery. Nucleic Acids Res 2010;38:e178.
Project (NRF-2014M3A9C4066449) of the Ministry of 18. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S,
et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics
Science, ICT and Future Planning through the National
2013;29:15-21.
Research Foundation. 19. Wu TD, Nacu S. Fast and SNP-tolerant detection of complex
variants and splicing in short reads. Bioinformatics 2010;26:
References 873-881.
20. DeLuca DS, Levin JZ, Sivachenko A, Fennell T, Nazaire MD,
1. Ozsolak F, Milos PM. RNA sequencing: advances, challenges Williams C, et al. RNA-SeQC: RNA-seq metrics for quality
and opportunities. Nat Rev Genet 2011;12:87-98. control and process optimization. Bioinformatics 2012;28:
2. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. 1530-1532.
RNA-seq: an assessment of technical reproducibility and com- 21. Wang L, Wang S, Li W. RSeQC: quality control of RNA-seq
parison with gene expression arrays. Genome Res 2008;18: experiments. Bioinformatics 2012;28:2184-2185.
1509-1517. 22. Okonechnikov K, Conesa A, Garcia-Alcalde F. Qualimap 2: ad-
3. Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, vanced multi-sample quality control for high-throughput se-
et al. Alternative isoform regulation in human tissue trans- quencing data. Bioinformatics 2015 Oct 1 [Epub]. http://dx.
criptomes. Nature 2008;456:470-476. doi.org/10.1093/bioinformatics/btv566.
4. Denoeud F, Aury JM, Da Silva C, Noel B, Rogier O, Delledonne 23. Tarazona S, Furio-Tari P, Turra D, Pietro AD, Nueda MJ, Ferrer
M, et al. Annotating genomes with massive-scale RNA A, et al. Data quality aware analysis of differential expression
sequencing. Genome Biol 2008;9:R175. in RNA-seq with NOISeq R/Bioc package. Nucleic Acids Res
5. Maher CA, Kumar-Sinha C, Cao X, Kalyana-Sundaram S, Han 2015;43:e140.
B, Jing X, et al. Transcriptome sequencing to detect gene fu- 24. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van
sions in cancer. Nature 2009;458:97-101. Baren MJ, et al. Transcript assembly and quantification by
6. Adiconis X, Borges-Rivera D, Satija R, DeLuca DS, Busby MA, RNA-Seq reveals unannotated transcripts and isoform
Berlin AM, et al. Comparative analysis of RNA sequencing switching during cell differentiation. Nat Biotechnol 2010;28:
methods for degraded or low-input samples. Nat Methods 511-515.
2013;10:623-629. 25. Guttman M, Garber M, Levin JZ, Donaghey J, Robinson J,
7. Robasky K, Lewis NE, Church GM. The role of replicates for Adiconis X, et al. Ab initio reconstruction of cell type-specific
error mitigation in next-generation sequencing. Nat Rev Genet transcriptomes in mouse reveals the conserved multi-exonic
2014;15:56-62. structure of lincRNAs. Nat Biotechnol 2010;28:503-510.
8. Babraham Bioinformatics. Fast QC. Cambridgeshire: 26. Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT,
Babraham Institute, 2015. Accessed 2015 Nov 2. Available from: Salzberg SL. StringTie enables improved reconstruction of a
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. transcriptome from RNA-seq reads. Nat Biotechnol 2015;33:
9. Yang X, Liu D, Liu F, Wu J, Zou J, Xiao X, et al. HTQC: a fast 290-295.
quality control toolkit for Illumina sequencing data. BMC 27. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA,
Bioinformatics 2013;14:33. Amit I, et al. Full-length transcriptome assembly from
10. FASTX-Toolkit. Cold Spring Harbor: Cold Spring Harbor RNA-Seq data without a reference genome. Nat Biotechnol
Laboratory, 2015. Accessed 2015 Nov 2. Available from: 2011;29:644-652.
http://hannonlab.cshl.edu/fastx_toolkit/. 28. Schulz MH, Zerbino DR, Vingron M, Birney E. Oases: robust
11. Dodt M, Roehr JT, Ahmed R, Dieterich C. FLEXBAR-flexi- de novo RNA-seq assembly across the dynamic range of ex-
blebarcode and adapter processing for next-generation se- pression levels. Bioinformatics 2012;28:1086-1092.
quencing platforms. Biology (Basel) 2012;1:895-905. 29. Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman
12. Garber M, Grabherr MG, Guttman M, Trapnell C. Computa- SD, et al. De novo assembly and analysis of RNA-seq data. Nat
tional methods for transcriptome annotation and quantifica- Methods 2010;7:909-912.
tion using RNA-seq. Nat Methods 2011;8:469-477. 30. Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD,
13. Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads Bowden J, et al. De novo transcript sequence reconstruction
and calling variants using mapping quality scores. Genome Res from RNA-seq using the Trinity platform for reference gen-
2008;18:1851-1858. eration and analysis. Nat Protoc 2013;8:1494-1512.
14. Li H, Durbin R. Fast and accurate short read alignment with 31. Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, et
Burrows-Wheeler transform. Bioinformatics 2009;25:1754-1760. al. Differential gene and transcript expression analysis of
15. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and RNA-seq experiments with TopHat and Cufflinks. Nat Protoc
memory-efficient alignment of short DNA sequences to the 2012;7:562-578.
human genome. Genome Biol 2009;10:R25. 32. Griffith M, Griffith OL, Mwenifumbo J, Goya R, Morrissy AS,
16. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice Morin RD, et al. Alternative expression analysis by RNA
junctions with RNA-Seq. Bioinformatics 2009;25:1105-1111. sequencing. Nat Methods 2010;7:843-847.
124 www.genominfo.org
Genomics & Informatics Vol. 13, No. 4, 2015
33. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. parametric approach for identifying differential expression in
Mapping and quantifying mammalian transcriptomes by RNA-Seq data. Stat Methods Med Res 2013;22:519-536.
RNA-Seq. Nat Methods 2008;5:621-628. 42. Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits
34. Lee S, Seo CH, Lim B, Yang JO, Oh J, Kim M, et al. Accurate BM, et al. EBSeq: an empirical Bayes hierarchical model for in-
quantification of transcriptome from RNA-Seq data by effec- ference in RNA-seq experiments. Bioinformatics 2013;29:
tive length normalization. Nucleic Acids Res 2011;39:e9. 1035-1043.
35. Li B, Dewey CN. RSEM: accurate transcript quantification 43. Anders S, Pyl PT, Huber W. HTSeq: a Python framework to
from RNA-Seq data with or without a reference genome. BMC work with high-throughput sequencing data. Bioinformatics
Bioinformatics 2011;12:323. 2015;31:166-169.
36. Patro R, Mount SM, Kingsford C. Sailfish enables align- 44. Quinlan AR. BEDTools: The Swiss-Army tool for genome fea-
ment-free isoform quantification from RNA-seq reads using ture analysis. Curr Protoc Bioinformatics 2014;47:11.12.1-11.
lightweight algorithms. Nat Biotechnol 2014;32:462-464. 12.34.
37. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg 45. Frazee AC, Pertea G, Jaffe AE, Langmead B, Salzberg SL, Leek
SL. TopHat2: accurate alignment of transcriptomes in the JT. Ballgown bridges the gap between transcriptome assembly
presence of insertions, deletions and gene fusions. Genome Biol and expression analysis. Nat Biotechnol 2015;33:243-246.
2013;14:R36. 46. Oshlack A, Wakefield MJ. Transcript length bias in RNA-seq
38. Zimin AV, Marcais G, Puiu D, Roberts M, Salzberg SL, Yorke data confounds systems biology. Biol Direct 2009;4:14.
JA. The MaSuRCA genome assembler. Bioinformatics 2013;29: 47. Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE,
2669-2677. Nkadori E, et al. Understanding mechanisms underlying hu-
39. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Biocon- man gene expression variation with RNA sequencing. Nature
ductor package for differential expression analysis of digital 2010;464:768-772.
gene expression data. Bioinformatics 2010;26:139-140. 48. Seyednasrollah F, Laiho A, Elo LL. Comparison of software
40. Anders S, Huber W. Differential expression analysis for se- packages for detecting differential expression in RNA-seq
quence count data. Genome Biol 2010;11:R106. studies. Brief Bioinform 2015;16:59-70.
41. Li J, Tibshirani R. Finding consistent patterns: a non-
www.genominfo.org 125