0% found this document useful (0 votes)
6 views19 pages

Survey RNA-Seq Data Analysis (2016)

The review by Conesa et al. outlines best practices for RNA-seq data analysis, emphasizing the importance of experimental design, quality control, and the selection of appropriate analysis strategies based on research goals and organism characteristics. It discusses various steps in RNA-seq analysis, including read alignment, quantification, and differential expression, while highlighting the challenges and considerations at each stage. The authors aim to provide guidelines rather than a single optimal pipeline, addressing the diverse applications and evolving technologies in transcriptomics.

Uploaded by

luluromamez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views19 pages

Survey RNA-Seq Data Analysis (2016)

The review by Conesa et al. outlines best practices for RNA-seq data analysis, emphasizing the importance of experimental design, quality control, and the selection of appropriate analysis strategies based on research goals and organism characteristics. It discusses various steps in RNA-seq analysis, including read alignment, quantification, and differential expression, while highlighting the challenges and considerations at each stage. The authors aim to provide guidelines rather than a single optimal pipeline, addressing the diverse applications and evolving technologies in transcriptomics.

Uploaded by

luluromamez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Conesa et al.

Genome Biology (2016) 17:13


DOI 10.1186/s13059-016-0881-8

REVIEW Open Access

A survey of best practices for RNA-seq data


analysis
Ana Conesa1,2*, Pedro Madrigal3,4*, Sonia Tarazona2,5, David Gomez-Cabrero6,7,8,9, Alejandra Cervera10,
Andrew McPherson11, Michał Wojciech Szcześniak12, Daniel J. Gaffney3, Laura L. Elo13, Xuegong Zhang14,15
and Ali Mortazavi16,17*

published, making it challenging for new users to appreci-


Abstract
ate all of the steps necessary to conduct an RNA-seq study
RNA-sequencing (RNA-seq) has a wide variety of properly.
applications, but no single analysis pipeline can be There is no optimal pipeline for the variety of different
used in all cases. We review all of the major steps in applications and analysis scenarios in which RNA-seq
RNA-seq data analysis, including experimental design, can be used. Scientists plan experiments and adopt dif-
quality control, read alignment, quantification of gene ferent analysis strategies depending on the organism be-
and transcript levels, visualization, differential gene ing studied and their research goals. For example, if a
expression, alternative splicing, functional analysis, genome sequence is available for the studied organism,
gene fusion detection and eQTL mapping. We it should be possible to identify transcripts by mapping
highlight the challenges associated with each step. RNA-seq reads onto the genome. By contrast, for organ-
We discuss the analysis of small RNAs and the isms without sequenced genomes, quantification would
integration of RNA-seq with other functional be achieved by first assembling reads de novo into con-
genomics techniques. Finally, we discuss the outlook tigs and then mapping these contigs onto the transcrip-
for novel technologies that are changing the state of tome. For well-annotated genomes such as the human
the art in transcriptomics. genome, researchers may choose to base their RNA-seq
analysis on the existing annotated reference transcrip-
tome alone, or might try to identify new transcripts and
Background their differential regulation. Furthermore, investigators
Transcript identification and the quantification of gene might be interested only in messenger RNA isoform ex-
expression have been distinct core activities in molecular pression or microRNA (miRNA) levels or allele variant
biology ever since the discovery of RNA’s role as the key identification. Both the experimental design and the ana-
intermediate between the genome and the proteome. lysis procedures will vary greatly in each of these cases.
The power of sequencing RNA lies in the fact that the RNA-seq can be used solo for transcriptome profiling or
twin aspects of discovery and quantification can be com- in combination with other functional genomics methods
bined in a single high-throughput sequencing assay to enhance the analysis of gene expression. Finally, RNA-
called RNA-sequencing (RNA-seq). The pervasive adop- seq can be coupled with different types of biochemical
tion of RNA-seq has spread well beyond the genomics assay to analyze many other aspects of RNA biology, such
community and has become a standard part of the toolkit as RNA–protein binding, RNA structure, or RNA–RNA
used by the life sciences research community. Many varia- interactions. These applications are, however, beyond the
tions of RNA-seq protocols and analyses have been scope of this review as we focus on ‘typical’ RNA-seq.
Every RNA-seq experimental scenario could poten-
* Correspondence: [email protected]; [email protected]; [email protected] tially have different optimal methods for transcript
1
Institute for Food and Agricultural Sciences, Department of Microbiology quantification, normalization, and ultimately differential
and Cell Science, University of Florida, Gainesville, FL 32603, USA
3
Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton,
expression analysis. Moreover, quality control checks
Cambridge CB10 1SA, UK should be applied pertinently at different stages of the
16
Department of Developmental and Cell Biology, University of California, analysis to ensure both reproducibility and reliability of
Irvine, Irvine, CA 92697-2300, USA
Full list of author information is available at the end of the article
the results. Our focus is to outline current standards

© 2016 Conesa et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Conesa et al. Genome Biology (2016) 17:13 Page 2 of 19

and resources for the bioinformatics analysis of RNA- and second by planning an adequate execution of the se-
seq data. We do not aim to provide an exhaustive com- quencing experiment itself, ensuring that data acquisi-
pilation of resources or software tools nor to indicate tion does not become contaminated with unnecessary
one best analysis pipeline. Rather, we aim to provide a biases. In this section, we discuss both considerations.
commented guideline for RNA-seq data analysis. Figure 1 One important aspect of the experimental design is
depicts a generic roadmap for experimental design and the RNA-extraction protocol used to remove the highly
analysis using standard Illumina sequencing. We also abundant ribosomal RNA (rRNA), which typically con-
briefly list several data integration paradigms that have stitutes over 90 % of total RNA in the cell, leaving the
been proposed and comment on their potential and limi- 1–2 % comprising messenger RNA (mRNA) that we are
tations. We finally discuss the opportunities as well as normally interested in. For eukaryotes, this involves
challenges provided by single-cell RNA-seq and long- choosing whether to enrich for mRNA using poly(A) se-
read technologies when compared to traditional short- lection or to deplete rRNA. Poly(A) selection typically
read RNA-seq. requires a relatively high proportion of mRNA with min-
imal degradation as measured by RNA integrity number
Experimental design (RIN), which normally yields a higher overall fraction of
A crucial prerequisite for a successful RNA-seq study is reads falling onto known exons. Many biologically rele-
that the data generated have the potential to answer the vant samples (such as tissue biopsies) cannot, however,
biological questions of interest. This is achieved by first be obtained in great enough quantity or good enough
defining a good experimental design, that is, by choosing mRNA integrity to produce good poly(A) RNA-seq li-
the library type, sequencing depth and number of repli- braries and therefore require ribosomal depletion. For
cates appropriate for the biological system under study, bacterial samples, in which mRNA is not polyadenylated,

Fig. 1 A generic roadmap for RNA-seq computational analyses. The major analysis steps are listed above the lines for pre-analysis, core analysis
and advanced analysis. The key analysis issues for each step that are listed below the lines are discussed in the text. a Preprocessing includes
experimental design, sequencing design, and quality control steps. b Core analyses include transcriptome profiling, differential gene expression,
and functional profiling. c Advanced analysis includes visualization, other RNA-seq technologies, and data integration. Abbreviations: ChIP-seq
Chromatin immunoprecipitation sequencing, eQTL Expression quantitative loci, FPKM Fragments per kilobase of exon model per million mapped
reads, GSEA Gene set enrichment analysis, PCA Principal component analysis, RPKM Reads per kilobase of exon model per million reads, sQTL
Splicing quantitative trait loci, TF Transcription factor, TPM Transcripts per million
Conesa et al. Genome Biology (2016) 17:13 Page 3 of 19

the only viable alternative is ribosomal depletion. Another biological variability of the system under study, as well as
consideration is whether to generate strand-preserving li- on the desired statistical power (that is, the capacity for
braries. The first generation of Illumina-based RNA-seq detecting statistically significant differences in gene ex-
used random hexamer priming to reverse-transcribe pression between experimental groups). These two aspects
poly(A)-selected mRNA. This methodology did not retain are part of power analysis calculations (Fig. 1a; Box 1).
information contained on the DNA strand that is actually The adequate planning of sequencing experiments so
expressed [1] and therefore complicates the analysis and as to avoid technical biases is as important as good
quantification of antisense or overlapping transcripts. Sev-
eral strand-specific protocols [2], such as the widely used Box 1. Number of replicates
dUTP method, extend the original protocol by incorporat-
ing UTP nucleotides during the second cDNA synthesis Three factors determine the number of replicates required in a
step, prior to adapter ligation followed by digestion of the RNA-seq experiment. The first factor is the variability in the
strand containing dUTP [3]. In all cases, the size of the measurements, which is influenced by the technical noise and
final fragments (usually less than 500 bp for Illumina) will the biological variation. While reproducibility in RNA-seq is usually
be crucial for proper sequencing and subsequent analysis. high at the level of sequencing [1, 45], other steps such as RNA
Furthermore, sequencing can involve single-end (SE) or extraction and library preparation are noisier and may introduce
paired-end (PE) reads, although the latter is preferable for
biases in the data that can be minimized by adopting good
de novo transcript discovery or isoform expression ana-
experimental procedures (Box 2). Biological variability is particular
lysis [4, 5]. Similarly, longer reads improve mappability
and transcript identification [5, 6]. The best sequencing to each experimental system and is harder to control [189].
option depends on the analysis goals. The cheaper, short Nevertheless, biological replication is required if inference on the
SE reads are normally sufficient for studies of gene expres- population is to be made, with three replicates being the minimum
sion levels in well-annotated organisms, whereas longer for any inferential analysis. For a proper statistical power analysis,
and PE reads are preferable to characterize poorly anno- estimates of the within-group variance and gene expression levels
tated transcriptomes. are required. This information is typically not available beforehand
Another important factor is sequencing depth or li-
but can be obtained from similar experiments. The exact power will
brary size, which is the number of sequenced reads for a
depend on the method used for differential expression analysis,
given sample. More transcripts will be detected and their
quantification will be more precise as the sample is se- and software packages exist that provide a theoretical estimate of
quenced to a deeper level [1]. Nevertheless, optimal se- power over a range of variables, given the within-group variance of
quencing depth again depends on the aims of the the samples, which is intrinsic to the experiment [190, 191]. Table 1
experiment. While some authors will argue that as few shows an example of statistical power calculations over a range of
as five million mapped reads are sufficient to quantify fold-changes (or effect sizes) and number of replicates in a human
accurately medium to highly expressed genes in most blood RNA-seq sample sequenced at 30 million mapped reads. It
eukaryotic transcriptomes, others will sequence up to
should be noted that these estimates apply to the average gene
100 million reads to quantify precisely genes and tran-
expression level, but as dynamic ranges in RNA-seq data are large,
scripts that have low expression levels [7]. When study-
ing single cells, which have limited sample complexity, the probability that highly expressed genes will be detected as
quantification is often carried out with just one million differentially expressed is greater than that for low-count genes
reads but may be done reliably for highly expressed [192]. For methods that return a false discovery rate (FDR), the
genes with as few as 50,000 reads [8]; even 20,000 reads proportion of genes that are highly expressed out of the total set
have been used to differentiate cell types in splenic tissue of genes being tested will also influence the power of detection
[9]. Moreover, optimal library size depends on the com- after multiple testing correction [193]. Filtering out genes that are
plexity of the targeted transcriptome. Experimental results
expressed at low levels prior to differential expression analysis
suggest that deep sequencing improves quantification and
reduces the severity of the correction and may improve the power
identification but might also result in the detection of
transcriptional noise and off-target transcripts [10]. Satur- of detection [20]. Increasing sequencing depth also can improve
ation curves can be used to assess the improvement in statistical power for lowly expressed genes [10, 194], and for any
transcriptome coverage to be expected at a given sequen- given sample there exists a level of sequencing at which power
cing depth [10]. improvement is best achieved by increasing the number of
Finally, a crucial design factor is the number of repli- replicates [195]. Tools such as Scotty are available to calculate the
cates. The number of replicates that should be included in best trade-off between sequencing depth and replicate number
a RNA-seq experiment depends on both the amount of
given some budgetary constraints [191].
technical variability in the RNA-seq procedures and the
Conesa et al. Genome Biology (2016) 17:13 Page 4 of 19

Table 1 Statistical power to detect differential expression varies section, we address all of the major analysis steps for a
with effect size, sequencing depth and number of replicates typical RNA-seq experiment, which involve quality con-
Replicates per group trol, read alignment with and without a reference genome,
3 5 10 obtaining metrics for gene and transcript expression, and
Effect size (fold change) approaches for detecting differential gene expression. We
1.25 17 % 25 % 44 %
also discuss analysis options for applications of RNA-seq
involving alternative splicing, fusion transcripts and small
1.5 43 % 64 % 91 %
RNA expression. Finally, we review useful packages for
2 87 % 98 % 100 % data visualization.
Sequencing depth (millions of reads)
3 19 % 29 % 52 % Quality-control checkpoints
10 33 % 51 % 80 % The acquisition of RNA-seq data consists of several
15 38 % 57 % 85 %
steps — obtaining raw reads, read alignment and quanti-
fication. At each of these steps, specific checks should
Example of calculations for the probability of detecting differential expression
in a single test at a significance level of 5 %, for a two-group comparison using be applied to monitor the quality of the data (Fig. 1a).
a Negative Binomial model, as computed by the RNASeqPower package of
Hart et al. [190]. For a fixed within-group variance (package default value), the
statistical power increases with the difference between the two groups (effect Raw reads
size), the sequencing depth, and the number of replicates per group. This Quality control for the raw reads involves the analysis of
table shows the statistical power for a gene with 70 aligned reads, which was
the median coverage for a protein-coding gene for one whole-blood RNA-seq
sequence quality, GC content, the presence of adaptors,
sample with 30 million aligned reads from the GTEx Project [214] overrepresented k-mers and duplicated reads in order to
detect sequencing errors, PCR artifacts or contamina-
experimental design, especially when the experiment in-
tions. Acceptable duplication, k-mer or GC content
volves a large number of samples that need to be proc-
levels are experiment- and organism-specific, but these
essed in several batches. In this case, including controls,
values should be homogeneous for samples in the same
randomizing sample processing and smart management
experiments. We recommend that outliers with over
of sequencing runs are crucial to obtain error-free data
30 % disagreement to be discarded. FastQC [11] is a
(Fig. 1a; Box 2).
popular tool to perform these analyses on Illumina
reads, whereas NGSQC [12] can be applied to any plat-
Analysis of the RNA-seq data form. As a general rule, read quality decreases towards
The actual analysis of RNA-seq data has as many varia-
the 3’ end of reads, and if it becomes too low, bases
tions as there are applications of the technology. In this
should be removed to improve mappability. Software
tools such as the FASTX-Toolkit [13] and Trimmomatic
Box 2. Experiment execution choices [14] can be used to discard low-quality reads, trim
RNA-seq library preparation and sequencing procedures include
adaptor sequences, and eliminate poor-quality bases.
a number of steps (RNA fragmentation, cDNA synthesis, adapter
Read alignment
ligation, PCR amplification, bar-coding, and lane loading) that
Reads are typically mapped to either a genome or a tran-
might introduce biases into the resulting data [196]. Including scriptome, as will be discussed later. An important map-
exogenous reference transcripts (‘spike-ins’) is useful both for ping quality parameter is the percentage of mapped
quality control [1, 197] and for library-size normalization [198]. reads, which is a global indicator of the overall sequen-
For bias minimization, we recommend following the suggestions cing accuracy and of the presence of contaminating
made by Van Dijk et al. [199], such as the use of adapters with DNA. For example, we expect between 70 and 90 % of
random nucleotides at the extremities or the use of chemical-based
regular RNA-seq reads to map onto the human genome
(depending on the read mapper used) [15], with a sig-
fragmentation instead of RNase III-based fragmentation. If the
nificant fraction of reads mapping to a limited number
RNA-seq experiment is large and samples have to be processed in
of identical regions equally well (‘multi-mapping reads’).
different batches and/or Illumina runs, caution should be taken to When reads are mapped against the transcriptome, we
randomize samples across library preparation batches and lanes so expect slightly lower total mapping percentages because
as to avoid technical factors becoming confounded with reads coming from unannotated transcripts will be lost,
experimental factors. Another option, when samples are individually and significantly more multi-mapping reads because of
barcoded and multiple Illumina lanes are needed to achieve the reads falling onto exons that are shared by different
desired sequencing depth, is to include all samples in each lane,
transcript isoforms of the same gene.
Other important parameters are the uniformity of read
which would minimize any possible lane effect.
coverage on exons and the mapped strand. If reads
Conesa et al. Genome Biology (2016) 17:13 Page 5 of 19

primarily accumulate at the 3’ end of transcripts in clear standard exists for biological replicates, as this de-
poly(A)-selected samples, this might indicate low RNA pends on the heterogeneity of the experimental system.
quality in the starting material. The GC content of If gene expression differences exist among experimental
mapped reads may reveal PCR biases. Tools for quality conditions, it should be expected that biological repli-
control in mapping include Picard [16], RSeQC [17] and cates of the same condition will cluster together in a
Qualimap [18]. principal component analysis (PCA).

Quantification Transcript identification


Once actual transcript quantification values have been When a reference genome is available, RNA-seq analysis
calculated, they should be checked for GC content and will normally involve the mapping of the reads onto the
gene length biases so that correcting normalization reference genome or transcriptome to infer which tran-
methods can be applied if necessary. If the reference scripts are expressed. Mapping solely to the reference
transcriptome is well annotated, researchers could transcriptome of a known species precludes the discov-
analyze the biotype composition of the sample, which is ery of new, unannotated transcripts and focuses the ana-
indicative of the quality of the RNA purification step. lysis on quantification alone. By contrast, if the organism
For example, rRNA and small RNAs should not be does not have a sequenced genome, then the analysis
present in regular polyA longRNA preparations [10, 19]. path is first to assemble reads into longer contigs and
A number of R packages (such as NOISeq [19] or EDA- then to treat these contigs as the expressed transcrip-
Seq [20]) provide useful plots for quality control of tome to which reads are mapped back again for quantifi-
count data. cation. In either case, read coverage can be used to
quantify transcript expression level (Fig. 1b). A basic
Reproducibility choice is whether transcript identification and quantifi-
The quality-control steps described above involve indi- cation are done sequentially or simultaneously.
vidual samples. In addition, it is also crucial to assess the
global quality of the RNA-seq dataset by checking on Alignment
the reproducibility among replicates and for possible Two alternatives are possible when a reference sequence
batch effects. Reproducibility among technical replicates is available: mapping to the genome or mapping to the
should be generally high (Spearman R2 > 0.9) [1], but no annotated transcriptome (Fig. 2a, b; Box 3). Regardless

Fig. 2 Read mapping and transcript identification strategies. Three basic strategies for regular RNA-seq analysis. a An annotated genome is
available and reads are mapped to the genome with a gapped mapper. Next (novel) transcript discovery and quantification can proceed with or
without an annotation file. Novel transcripts are then functionally annotated. b If no novel transcript discovery is needed, reads can be mapped
to the reference transcriptome using an ungapped aligner. Transcript identification and quantification can occur simultaneously. c When no
genome is available, reads need to be assembled first into contigs or transcripts. For quantification, reads are mapped back to the novel reference
transcriptome and further analysis proceeds as in (b) followed by the functional annotation of the novel transcripts as in (a). Representative
software that can be used at each analysis step are indicated in bold text. Abbreviations: GFF General Feature Format, GTF gene transfer format,
RSEM RNA-Seq by Expectation Maximization
Conesa et al. Genome Biology (2016) 17:13 Page 6 of 19

Transcript discovery
Box 3. Mapping to a reference
Identifying novel transcripts using the short reads pro-
Mapping to a reference genome allows for the identification of vided by Illumina technology is one of the most challen-
novel genes or transcripts, and requires the use of a gapped or ging tasks in RNA-seq. Short reads rarely span across
several splice junctions and thus make it difficult to dir-
spliced mapper as reads may span splice junctions. The
ectly infer all full-length transcripts. In addition, it is dif-
challenge is to identify splice junctions correctly, especially
ficult to identify transcription start and end sites [21],
when sequencing errors or differences with the reference exist and tools such as GRIT [22] that incorporate other data
or when non-canonical junctions and fusion transcripts are such as 5’ ends from CAGE or RAMPAGE typically have
sought. One of the most popular RNA-seq mappers, TopHat, a better chance of annotating the major expressed iso-
follows a two-step strategy in which unspliced reads are first forms correctly. In any case, PE reads and higher cover-
mapped to locate exons, then unmapped reads are split and age help to reconstruct lowly expressed transcripts, and
aligned independently to identify exon junctions [200, 201]. replicates are essential to resolve false-positive calls (that
is, mapping artifacts or contaminations) at the low end
Several other mappers exist that are optimized to identify SNPs
of signal detection. Several methods, such as Cufflinks
or indels (GSNAP [202], PALMapper [203] MapSplice [204]),
[23], iReckon [24], SLIDE [25] and StringTie [26], in-
detect non-canonical splice junctions (STAR [15], MapSplice corporate existing annotations by adding them to the
[204]), achieve ultra-fast mapping (GEM [205]) or map long-reads possible list of isoforms. Montebello [27] couples iso-
(STAR [15]). Important parameters to consider during mapping form discovery and quantification using a likelihood-
are the strandedness of the RNA-seq library, the number of based Monte Carlo algorithm to boost performance.
mismatches to accept, the length and type of reads (SE or PE), Gene-finding tools such as Augustus [28] can incorpor-
and the length of sequenced fragments. In addition, existing ate RNA-seq data to better annotate protein-coding
transcripts, but perform worse on non-coding tran-
gene models can be leveraged by supplying an annotation file
scripts [29]. In general, accurate transcript reconstruc-
to some read mapper in order to map exon coordinates
tion from short reads is difficult, and methods typically
accurately and to help in identifying splicing events. The choice show substantial disagreement [29].
of gene model can also have a strong impact on the quantification
and differential expression analysis [206]. We refer the reader to De novo transcript reconstruction
[30] for a comprehensive comparison of RNA-seq mappers. If the When a reference genome is not available or is incom-
transcriptome annotation is comprehensive (for example, in mouse plete, RNA-seq reads can be assembled de novo (Fig. 2c)
or human), researchers may choose to map directly to a into a transcriptome using packages such as SOAPdenovo-
Trans [30], Oases [31], Trans-ABySS [32] or Trinity [33].
Fasta-format file of all transcript sequences for all genes of interests.
In general, PE strand-specific sequencing and long reads
In this case, no gapped alignment is needed and unspliced
are preferred because they are more informative [33]. Al-
mappers such as Bowtie [207] can be used (Fig. 2b). Mapping to though it is impossible to assemble lowly expressed tran-
the transcriptome is generally faster but does not allow de scripts that lack enough coverage for a reliable assembly,
novo transcript discovery. too many reads are also problematic because they lead to
potential misassembly and increased runtimes. Therefore,
of whether a genome or transcriptome reference is used, in silico reduction of the number of reads is recom-
reads may map uniquely (they can be assigned to only mended for deeply sequenced samples [33]. For compara-
one position in the reference) or could be multi-mapped tive analyses across samples, it is advisable to combine all
reads (multireads). Genomic multireads are primarily reads from multiple samples into a single input in order
due to repetitive sequences or shared domains of paralo- to obtain a consolidated set of contigs (transcripts),
gous genes. They normally account for a significant frac- followed by mapping back of the short reads for expres-
tion of the mapping output when mapped onto the sion estimation [33].
genome and should not be discarded. When the refer- Either with a reference or de novo, the complete recon-
ence is the transcriptome, multi-mapping arises even struction of transcriptomes using short-read Illumina tech-
more often because a read that would have been nology remains a challenging problem, and in many cases
uniquely mapped on the genome would map equally de novo assembly results in tens or hundreds of contigs ac-
well to all gene isoforms in the transcriptome that share counting for fragmented transcripts. Emerging long-read
the exon. In either case — genome or transcriptome technologies, such as SMRT from Pacific Biosciences, pro-
mapping — transcript identification and quantification vide reads that are long enough to sequence complete
become important challenges for alternatively expressed transcripts for most genes and are a promising alternative
genes. that is discussed further in the “Outlook” section below.
Conesa et al. Genome Biology (2016) 17:13 Page 7 of 19

Transcript quantification uniform read distribution along the gene length.


The most common application of RNA-seq is to esti- Cufflinks was designed to take advantage of PE reads,
mate gene and transcript expression. This application is and may use GTF information to identify expressed
primarily based on the number of reads that map to transcripts, or can infer transcripts de novo from the
each transcript sequence, although there are algorithms mapping data alone. Algorithms that quantify expression
such as Sailfish that rely on k-mer counting in reads from transcriptome mappings include RSEM (RNA-Seq
without the need for mapping [34]. The simplest ap- by Expectation Maximization) [40], eXpress [41], Sailfish
proach to quantification is to aggregate raw counts of [35] and kallisto [42] among others. These methods allo-
mapped reads using programs such as HTSeq-count cate multi-mapping reads among transcript and output
[35] or featureCounts [36]. This gene-level (rather than within-sample normalized values corrected for sequen-
transcript-level) quantification approach utilizes a gene cing biases [35, 41, 43]. Additionally, the RSEM algo-
transfer format (GTF) file [37] containing the genome rithm uses an expectation maximization approach that
coordinates of exons and genes, and often discard multi- returns TPM values [40]. NURD [44] provides an effi-
reads. Raw read counts alone are not sufficient to com- cient way of estimating transcript expression from SE
pare expression levels among samples, as these values reads with a low memory and computing cost.
are affected by factors such as transcript length, total
number of reads, and sequencing biases. The measure Differential gene expression analysis
RPKM (reads per kilobase of exon model per million Differential expression analysis (Fig. 1b) requires that
reads) [1] is a within-sample normalization method that gene expression values should be compared among sam-
will remove the feature-length and library-size effects. ples. RPKM, FPKM, and TPM normalize away the most
This measure and its subsequent derivatives FPKM important factor for comparing samples, which is se-
(fragments per kilobase of exon model per million quencing depth, whether directly or by accounting for
mapped reads), a within-sample normalized transcript the number of transcripts, which can differ significantly
expression measure analogous to RPKs, and TPM (tran- between samples. These approaches rely on normalizing
scripts per million) are the most frequently reported methods that are based on total or effective counts, and
RNA-seq gene expression values. It should be noted that tend to perform poorly when samples have heteroge-
RPKM and FPKM are equivalent for SE reads and that neous transcript distributions, that is, when highly and
FPKM can be converted into TPM using a simple differentially expressed features can skew the count dis-
formula [38]. The dichotomy of within-sample and tribution [45, 46]. Normalization methods that take this
between-sample comparisons has led to a lot of confu- into account are TMM [47], DESeq [48], PoissonSeq
sion in the literature. Correcting for gene length is not [49] and UpperQuartile [45], which ignore highly vari-
necessary when comparing changes in gene expression able and/or highly expressed features. Additional factors
within the same gene across samples, but it is necessary that interfere with intra-sample comparisons include
for correctly ranking gene expression levels within the changes in transcript length across samples or condi-
sample to account for the fact that longer genes accu- tions [50], positional biases in coverage along the tran-
mulate more reads. Furthermore, programs such as script (which are accounted for in Cufflinks), average
Cufflinks that estimate gene length from the data can fragment size [43], and the GC contents of genes (cor-
find significant differences in gene length between rected in the EDAseq package [21]). The NOISeq R
samples that cannot be ignored. TPMs, which effectively package [20] contains a wide variety of diagnostic plots
normalize for the differences in composition of the tran- to identify sources of biases in RNA-seq data and to
scripts in the denominator rather than simply dividing apply appropriate normalization procedures in each case.
by the number of reads in the library, are considered Finally, despite these sample-specific normalization
more comparable between samples of different origins methods, batch effects may still be present in the data.
and composition but can still suffer some biases. These These effects can be minimized by appropriate experi-
must be addressed with normalization techniques such mental design [51] or, alternatively, removed by batch-
as TMM. correction methods such as COMBAT [52] or ARSyN
Several sophisticated algorithms have been developed [20, 53]. These approaches, although initially devel-
to estimate transcript-level expression by tackling the oped for microarray data, have been shown to work
problem of related transcripts’ sharing most of their well with normalized RNA-seq data (STATegra project,
reads. Cufflinks [39] estimates transcript expression unpublished).
from a mapping to the genome obtained from mappers As RNA-seq quantification is based on read counts
such as TopHat using an expectation-maximization that are absolutely or probabilistically assigned to tran-
approach that estimates transcript abundances. This scripts, the first approaches to compute differential ex-
approach takes into account biases such as the non- pression used discrete probability distributions, such as
Conesa et al. Genome Biology (2016) 17:13 Page 8 of 19

the Poisson or negative binomial [48, 54]. The negative the differential expression methods to leverage reprodu-
binomial distribution (also known as the gamma-Poisson cibility between replicates.
distribution) is a generalization of the Poisson distribu- Recent independent comparison studies have demon-
tion, allowing for additional variance (called overdisper- strated that the choice of the method (or even the ver-
sion) beyond the variance expected from randomly sion of a software package) can markedly affect the
sampling from a pool of molecules that are characteristic outcome of the analysis and that no single method is
of RNA-seq data. However, the use of discrete distribu- likely to perform favorably for all datasets [56, 63, 64]
tions is not required for accurate analysis of differential (Box 4). We therefore recommend thoroughly docu-
expression as long as the sampling variance of small read menting the settings and version numbers of programs
counts is taken into account (most important for exper- used and considering the repetition of important ana-
iments with small numbers of replicates). Methods for lyses using more than one package.
transforming normalized counts of RNA-seq reads
while learning the variance structure of the data have Alternative splicing analysis
been shown to perform well in comparison to the Transcript-level differential expression analysis can po-
discrete distribution approaches described above [55, tentially detect changes in the expression of transcript
56]. Moreover, after extensive normalization (including isoforms from the same gene, and specific algorithms for
TMM and batch removal), the data might have lost alternative splicing-focused analysis using RNA-seq have
their discrete nature and be more akin to a continuous been proposed. These methods fall into two major cat-
distribution. egories. The first approach integrates isoform expression
Some methods, such as the popular edgeR [57], take estimation with the detection of differential expression
as input raw read counts and introduce possible bias to reveal changes in the proportion of each isoform
sources into the statistical model to perform an inte- within the total gene expression. One such early method,
grated normalization as well as a differential expression BASIS, used a hierarchical Bayesian model to directly
analysis. In other methods, the differential expression re- infer differentially expressed transcript isoforms [65].
quires the data to be previously normalized to remove CuffDiff2 estimates isoform expression first and then
all possible biases. DESeq2, like edgeR, uses the negative compares their differences. By integrating the two steps,
binomial as the reference distribution and provides its the uncertainty in the first step is taken into consider-
own normalization approach [48, 58]. baySeq [59] and ation when performing the statistical analysis to look for
EBSeq [60] are Bayesian approaches, also based on the differential isoform expression [66]. The flow difference
negative binomial model, that define a collection of metric (FDM) uses aligned cumulative transcript graphs
models to describe the differences among experimental from mapped exon reads and junction reads to infer iso-
groups and to compute the posterior probability of each forms and the Jensen-Shannon divergence to measure
one of them for each gene. Other approaches include the difference [67]. Recently, Shi and Jiang [68] proposed
data transformation methods that take into account the a new method, rSeqDiff, that uses a hierarchical likeli-
sampling variance of small read counts and create hood ratio test to detect differential gene expression
discrete gene expression distributions that can be ana- without splicing change and differential isoform expres-
lyzed by regular linear models [55]. Finally, non- sion simultaneously. All these approaches are generally
parametric approaches such as NOISeq [10] or SAMseq hampered by the intrinsic limitations of short-read se-
[61] make minimal assumptions about the data and esti- quencing for accurate identification at the isoform level,
mate the null distribution for inferential analysis from as discussed in the RNA-seq Genome Annotation As-
the actual data alone. For small-scale studies that com- sessment Project paper [30].
pare two samples with no or few replicates, the estima- The so-called ‘exon-based’ approach skips the estima-
tion of the negative binomial distribution can be noisy. tion of isoform expression and detects signals of alterna-
In such cases, simpler methods based on the Poisson tive splicing by comparing the distributions of reads on
distribution, such as DEGseq [62], or on empirical distri- exons and junctions of the genes between the compared
butions (NOISeq [10]) can be an alternative, although it samples. This approach is based on the premise that dif-
should be strongly stressed that, in the absence of bio- ferences in isoform expression can be tracked in the sig-
logical replication, no population inference can be made nals of exons and their junctions. DEXseq [69] and
and hence any p value calculation is invalid. Methods DSGSeq [70] adopt a similar idea to detect differentially
that analyze RNA-seq data without replicates therefore spliced genes by testing for significant differences in read
only have exploratory value. Considering the drop in counts on exons (and junctions) of the genes. rMATS
price of sequencing, we recommend that RNA-seq ex- detects differential usage of exons by comparing exon-
periments have a minimum of three biological replicates inclusion levels defined with junction reads [71]. rDiff
when sample availability is not limiting to allow all of detects differential isoform expression by comparing
Conesa et al. Genome Biology (2016) 17:13 Page 9 of 19

read counts on alternative regions of the gene, either


Box 4. Comparison of software tools for detecting
with or without annotated alternative isoforms [72].
differential gene and transcript expression
DiffSplice uses alignment graphs to identify alternative
Many statistical methods are available for detecting differential gene splicing modules (ASMs) and identifies differential spli-
or transcript expression from RNA-seq data, and a major practical cing using signals of the ASMs [73]. The advantage of
challenge is how to choose the most suitable tool for a particular exon or junction methods is their greater accuracy in
data analysis job. Most comparison studies have focused on
identifying individual alternative splicing events. Exon-
based methods are appropriate if the focus of the study
simulated datasets [56, 208, 209] or on samples to which exogenous
is not on whole isoforms but on the inclusion and exclu-
RNA (‘spike-in’) has been added in known quantities [63, 196]. This sion of specific exons and the functional protein do-
enables a direct assessment of the sensitivity and specificity of the mains (or regulatory features, in case of untranslated
methods as well as their FDR control. As simulations typically rely region exons) that they contain.
on specific statistical distributions or on limited experimental
datasets and as spike-in datasets represent only technical replicates Visualization
with minimal variation, comparisons using simulated datasets have Visualization of RNA-seq data (Fig. 1c) is, in general
been complemented with more practical comparisons in real
terms, similar to that of any other type of genomic se-
quencing data, and it can be done at the level of reads
datasets with true biological replicates [64, 210, 211].
(using ReadXplorer [74], for example) or at the level of
As yet, no clear consensus has been reached regarding the best processed coverage (read pileup), unnormalized (for ex-
practices and the field is continuing to evolve rapidly. However, ample, total count) or normalized, using genome
some common findings have been made in multiple comparison browsers such as the UCSC browser [75], Integrative
studies and in different study settings. First, specific caution is Genomics Viewer (IGV) [76] (Figure S1a in Additional
needed with all the methods when the number of replicate file 1), Genome Maps [77], or Savant [78]. Some
samples is very small or for genes that are expressed at very low visualization tools are specifically designed for visualiz-
levels [55, 64, 209]. Among the tools, limma has been shown to
ing multiple RNA-seq samples, such as RNAseqViewer
[79], which provides flexible ways to display the read
perform well under many circumstances and it is also the fastest to
abundances on exons, transcripts and junctions. Introns
run [56, 63, 64]. DESeq and edgeR perform similarly in ranking genes can be hidden to better display signals on the exons, and
but are often relatively conservative or too liberal, respectively, in the heatmaps can help the visual comparison of signals
controlling FDR [63, 209, 210]. SAMseq performs well in terms of on multiple samples (Figure S1b, c in Additional file 1).
FDR but presents an acceptable sensitivity when the number of However, RNAseqViewer is slower than IGV.
replicates is relatively high, at least 10 [20, 55, 209]. NOISeq and Some of the software packages for differential gene ex-
NOISeqBIO (the adaptation of NOISeq for biological replication) pression analysis (such as DESeq2 or DEXseq in Biocon-
are more efficient in avoiding false positive calls at the cost of
ductor) have functions to enable the visualization of
results, whereas others have been developed for
some sensitivity but perform well with different numbers of
visualization-exclusive purposes, such as CummeRbund
replicates [10, 20, 212]. Cuffdiff and Cuffdiff2 have performed (for CuffDiff [66]) or Sashimi plots, which can be used
surprisingly poorly in the comparisons [56, 63]. This probably to visualize differentially spliced exons [80]. The advan-
reflects the fact that detecting differential expression at the tage of Sashimi plots is that their display of junction
transcript level remains challenging and involves uncertainties in reads is more intuitive and aesthetically pleasing when
assigning the reads to alternative isoforms. In a recent comparison, the number of samples is small (Figure S1d in Add-
BitSeq compared favorably to other transcript-level packages such itional file 1). Sashimi, structure, and hive plots for spli-
as Cuffdiff2 [196]. Besides the actual performance, other issues
cing quantitative trait loci (sQTL) can be obtained using
SplicePlot [81]. Splice graphs can be produced using
affecting the choice of the tool include ease of installation and
SpliceSeq [82], and SplicingViewer [83] plots splice junc-
use, computational requirements, and quality of documentation tions and alternative splicing events. TraV [84] is a
and instructions. Finally, an important consideration when choosing visualization tool that integrates data analysis, but its
an analysis method is the experimental design. While some of the analytical methods are not applicable to large genomes.
differential expression tools can only perform a pair-wise comparison, Owing to the complexity of transcriptomes, efficient
others such as edgeR [57], limma-voom [55], DESeq [48], DESeq2 display of multiple layers of information is still a chal-
[58], and maSigPro [213] can perform multiple comparisons, lenge. All of the tools are evolving rapidly and we can
include different covariates or analyze time-series data.
expect more comprehensive tools with desirable features
to be available soon. Nevertheless, the existing tools are
of great value for exploring results for individual genes
Conesa et al. Genome Biology (2016) 17:13 Page 10 of 19

of biological interest to assess whether particular ana- alternative splicing between adjacent genes. Where
lyses’ results can withstand detailed scrutiny or to reveal possible, fusions should be filtered by their presence in
potential complications caused by artifacts, such as 3’ a set of control datasets [87]. When control datasets
biases or complicated transcript structures. Users should are not available, artifacts can be identified by their
visualize changes in read coverage for genes that are presence in a large number of unrelated datasets, after
deemed important or interesting on the basis of their excluding the possibility that they represent true recur-
analysis results to evaluate the robustness of their rent fusions [90, 91].
conclusions. Strong fusion-sequence predictions are characterized
by distinct subsequences that each align with high speci-
Gene fusion discovery ficity to one of the fused genes. As alignment specificity
The discovery of fused genes that can arise from is highly correlated with sequence length, a strong pre-
chromosomal rearrangements is analogous to novel iso- diction sequence is longer, with longer subsequences
form discovery, with the added challenge of a much lar- from each gene. Longer reads and larger insert sizes pro-
ger search space as we can no longer assume that the duce longer predicted sequences; thus, we recommend
transcript segments are co-linear on a single chromo- PE RNA-seq data with larger insert size over SE datasets
some. Artifacts are common even using state-of-the-art or datasets with short insert size. Another indicator of
tools, which necessitates post-processing using heuristic prediction strength is splicing. For most known fusions,
filters [85]. Artifacts primarily result from misalignment the genomic breakpoint is located in an intron of each
of read sequences due to polymorphisms, homology, and gene [92] and the fusion boundary coincides with a
sequencing errors. Families of homologous genes, and splice site within each gene. Furthermore, fusion iso-
highly polymorphic genes such as the HLA genes, pro- forms generally follow the splicing patterns of wild-type
duce reads that cannot be easily mapped uniquely to genes. Thus, high confidence predictions have fusion
their location of origin in the reference genome. For boundaries coincident with exon boundaries and exons
genes with very high expression, the small but non- matching wild-type exons [91]. Fusion discovery tools
negligible sequencing error rate of RNA-seq will pro- often incorporate some of the aforementioned ideas to
duce reads that map incorrectly to homologous loci. rank fusion predictions [93, 94], though most studies
Filtering highly polymorphic genes and pairs of homolo- apply additional custom heuristic filters to produce a list
gous genes is recommended [86, 87]. Also recom- of high-quality fusion candidates [90, 91, 95].
mended is the filtering of highly expressed genes that
are unlikely to be involved in gene fusions, such as ribo- Small RNAs
somal RNA [86]. Finally, a low ratio of chimeric to wild- Next-generation sequencing represents an increasingly
type reads in the vicinity of the fusion boundary may in- popular method to address questions concerning the
dicate spurious mis-mapping of reads from a highly biological roles of small RNAs (sRNAs). sRNAs are usu-
expressed gene (the transcript allele fraction described ally 18–34 nucleotides in length, and they include miR-
by Yoshihara et al. [87]). NAs, short-interfering RNAs (siRNAs), PIWI-interacting
Given successful prediction of chimeric sequences, the RNAs (piRNAs), and other classes of regulatory mole-
next step is the prioritization of gene fusions that have cules. sRNA-seq libraries are rarely sequenced as deeply
biological impact over more expected forms of genomic as regular RNA-seq libraries because of a lack of com-
variation. Examples of expected variation include plexity, with a typical range of 2–10 million reads. Bio-
immunoglobulin (IG) rearrangements in tumor samples informatics analysis of sRNA-seq data differs from
infiltrated by immune cells, transiently expressed trans- standard RNA-seq protocols (Fig. 1c). Ligated adaptor
posons and nuclear mitochondrial DNA, and read- sequences are first trimmed and the resulting read-
through chimeras produced by co-transcription of adja- length distribution is computed. In animals, there are
cent genes [88]. Care must be taken with filtering in usually peaks for 22 and 23 nucleotides, whereas in
order not to lose events of interest. For example, remov- plants there are peaks for 21- and 24-nucleotide redun-
ing all fusions involving an IG gene may remove real IG dant reads. For instance, miRTools 2.0 [96], a tool for
fusions in lymphomas and other blood disorders; filter- prediction and profiling of sRNA species, uses by default
ing fusions for which both genes are from the IG locus reads that are 18–30 bases long. The threshold value de-
is preferred [88]. Transiently expressed genomic break- pends on the application, and in case of miRNAs is usu-
point sequences that are associated with real gene fu- ally in the range of 19–25 nucleotides.
sions often overlap transposons; these should be filtered As in standard RNA-seq, sRNA reads must then be
unless they are associated with additional fusion iso- aligned to a reference genome or transcriptome se-
forms from the same gene pair [89]. Read-through chi- quences using standard tools, such as Bowtie2 [97],
meras are easily identified as predictions involving STAR [15], or Burrows-Wheeler Aligner (BWA) [98].
Conesa et al. Genome Biology (2016) 17:13 Page 11 of 19

There are, however, some aligners (such as PatMaN [99] transcriptome assembly or reconstruction would lack at
and MicroRazerS [100]) that have been designed to map least some functional information and therefore annota-
short sequences with preset parameter value ranges tion is necessary for functional profiling of those results.
suited for optimal alignment of short reads. The map- Protein-coding transcripts can be functionally annotated
ping itself may be performed with or without mis- using orthology by searching for similar sequences in
matches, the latter being used more commonly. In protein databases such as SwissProt [114] and in data-
addition, reads that map beyond a predetermined set bases that contain conserved protein domains such as
number of locations may be removed as putatively ori- Pfam [115] and InterPro [116]. The use of standard vo-
ginating from repetitive elements. In the case of miR- cabularies such as the Gene Ontology (GO) allows for
NAs, usually 5–20 distinct mappings per genome are some exchangeability of functional information across
allowed. sRNA reads are then simply counted to obtain orthologs. Popular tools such as Blast2GO [117] allow
expression values. However, users should also verify that massive annotation of complete transcriptome datasets
their sRNA reads are not significantly contaminated by against a variety of databases and controlled vocabular-
degraded mRNA, for example, by checking whether a ies. Typically, between 50 and 80 % of the transcripts re-
miRNA library shows unexpected read coverage over the constructed from RNA-seq data can be annotated with
body of highly expressed genes such as GAPDH or functional terms in this way. However, RNA-seq data
ACTB. also reveal that an important fraction of the transcrip-
Further analysis steps include comparison with known tome is lacking protein-coding potential. The functional
sRNAs and de novo identification of sRNAs. There are annotation of these long non-coding RNAs is more chal-
class-specific tools for this purpose, such as miRDeep lenging as their conservation is often less pronounced
[101] and miRDeep-P [102] for animal and plant miR- than that of protein-coding genes. The Rfam database
NAs, respectively, or the trans-acting siRNA prediction [118] contains most well-characterized RNA families,
tool at the UEA sRNA Workbench [103]. Tools such as such as ribosomal or transfer RNAs, while mirBase [119]
miRTools 2.0 [96], ShortStack [104], and iMir [105] also or Miranda [120] are specialized in miRNAs. These re-
exist for comprehensive annotation of sRNA libraries sources can be used for similarity-based annotation of
and for identification of diverse classes of sRNAs. short non-coding RNAs, but no standard functional an-
notation procedures are available yet for other RNA
Functional profiling with RNA-seq types such as the long non-coding RNAs.
The last step in a standard transcriptomics study (Fig. 1b)
is often the characterization of the molecular functions Integration with other data types
or pathways in which differentially expressed genes The integration of RNA-seq data with other types of
(DEGs) are involved. The two main approaches to func- genome-wide data (Fig. 1c) allows us to connect the
tional characterization that were developed first for regulation of gene expression with specific aspects of
microarray technology are (a) comparing a list of DEGs molecular physiology and functional genomics. Integra-
against the rest of the genome for overrepresented func- tive analyses that incorporate RNA-seq data as the pri-
tions, and (b) gene set enrichment analysis (GSEA), mary gene expression readout that is compared with
which is based on ranking the transcriptome according other genomic experiments are becoming increasingly
to a measurement of differential expression. RNA-seq prevalent. Below, we discuss some of the additional chal-
biases such as gene length complicate the direct applica- lenges posed by such analyses.
tions of these methods for count data and hence RNA-
seq-specific tools have been proposed. For example, DNA sequencing
GOseq [106] estimates a bias effect (such as gene length) The combination of RNA and DNA sequencing can be
on differential expression results and adapts the trad- used for several purposes, such as single nucleotide poly-
itional hypergeometric statistic used in the functional morphism (SNP) discovery, RNA-editing analyses, or ex-
enrichment test to account for this bias. Similarly, the pression quantitative trait loci (eQTL) mapping. In a
Gene Set Variation Analysis (GSVA) [107] or SeqGSEA typical eQTL experiment, genotype and transcriptome
[108] packages also combine splicing and implement en- profiles are obtained from the same tissue type across a
richment analyses similar to GSEA. relatively large number of individuals (>50) and correla-
Functional analysis requires the availability of suffi- tions between genotype and expression levels are then
cient functional annotation data for the transcriptome detected. These associations can unravel the genetic
under study. Resources such as Gene Ontology [109], basis of complex traits such as height [121], disease sus-
Bioconductor [110], DAVID [111, 112] or Babelomics ceptibility [122] or even features of genome architecture
[113] contain annotation data for most model species. [123, 124]. Large eQTL studies have shown that genetic
However, novel transcripts discovered during de novo variation affects the expression of most genes [125–128].
Conesa et al. Genome Biology (2016) 17:13 Page 12 of 19

RNA-seq has two major advantages over array-based verifying the expression status of genes that overlap
technologies for detecting eQTLs. First, it can identify a region of interest [150]. DNase-seq can be used for
variants that affect transcript processing. Second, reads genome-wide footprinting of DNA-binding factors,
that overlap heterozygous SNPs can be mapped to ma- and this in combination with the actual expression
ternal and paternal chromosomes, enabling quantifica- of genes can be used to infer active transcriptional
tion of allele-specific expression within an individual networks [150].
[129]. Allele-specific signals provide additional informa-
tion about a genetic effect on transcription, and a num- MicroRNAs
ber of computational methods have recently become Integration of RNA-seq and miRNA-seq data has the
available that leverage these signals to boost power for potential to unravel the regulatory effects of miRNAs on
association mapping [130–132]. One challenge of this transcript steady-state levels. This analysis is challenging,
approach is the computational burden, as billions of however, because of the very noisy nature of miRNA
gene–SNP associations need to be tested; bootstrapping target predictions, which hampers analyses based on
or permutation-based approaches [133] are frequently correlations between miRNAs and their target genes.
used [134, 135]. Many studies have focused on testing Associations might be found in databases such as mir-
only SNPs in the cis region surrounding the gene in Walk [151] and miRBase [152] that offer target predic-
question, and computationally efficient approaches have tion according to various algorithms. Tools such as
been developed recently to allow extremely swift map- CORNA [153], MMIA [154, 155], MAGIA [156], and
ping of eQTLs genome-wide [136]. Moreover, the com- SePIA [157] refine predictions by testing for significant
bination of RNA-seq and re-sequencing can be used associations between genes, miRNAs, pathways and GO
both to remove false positives when inferring fusion terms, or by testing the relatedness or anticorrelation of
genes [88] and to analyze copy number alterations [137]. the expression profiles of both the target genes and the
associated miRNAs. In general, we recommend using
DNA methylation miRNA–mRNA associations that are predicted by sev-
Pairwise DNA-methylation and RNA-seq integration, for eral algorithms. For example, in mouse, we found that
the most part, has consisted of the analysis of the correl- requiring miRNA–mRNA association in five databases
ation between DEGs and methylation patterns [138– resulted in about 50 target mRNA predictions per
140]. General linear models [141–143], logistic regres- miRNA (STATegra observations).
sion models [143] and empirical Bayes model [144] have
been attempted among other modeling approaches. The Proteomics and metabolomics
statistically significant correlations that were observed, Integration of RNA-seq with proteomics is controversial
however, accounted for relatively small effects. An inter- because the two measurements show generally low cor-
esting shift away from focusing on individual gene–CpG relation (~0.40 [158, 159]). Nevertheless, pairwise inte-
methylation correlations is to use a network-interaction- gration of proteomics and RNA-seq can be used to
based approach to analyze RNA-seq in relation to DNA identify novel isoforms. Unreported peptides can be pre-
methylation. This approach identifies one or more sets dicted from RNA-seq data and then used to complement
of genes (also called modules) that have coordinated dif- databases normally queried in mass spectrometry as
ferential expression and differential methylation [145]. done by Low et al. [160]. Furthermore, post-translational
editing events may be identified if peptides that are
Chromatin features present in the mass spectrometry analysis are absent
The combination of RNA-seq and transcription factor from the expressed genes of the RNA-seq dataset. Inte-
(TF) chromatin immunoprecipitation sequencing (ChIP- gration of transcriptomics with metabolomics data has
seq) data can be used to remove false positives in ChIP- been used to identify pathways that are regulated at both
seq analysis and to suggest the activating or repressive the gene expression and the metabolite level, and tools
effect of a TF on its target genes. For example, BETA are available that visualize results within the pathway
[146] uses differential gene expression in combination context (MassTRIX [161], Paintomics [162], VANTED
with peaks from ChIP-seq experiments to call TF tar- v2 [163], and SteinerNet [164]).
gets. In addition, ChIP-seq experiments involving his-
tone modifications have been used to understand the Integration and visualization of multiple data types
general role of these epigenomic changes on gene ex- Integration of more than two genomic data types is still
pression [147, 148]. Other RNA-ChIP-sequencing inte- at its infancy and not yet extensively applied to functional
grative approaches are reviewed in [149]. Integration of sequencing techniques, but there are already some tools
open chromatin data such as that from FAIRE-seq and that combine several data types. SNMNMF [165] and
DNase-seq with RNA-seq has mostly been limited to PIMiM [166] combine mRNA and miRNA expression
Conesa et al. Genome Biology (2016) 17:13 Page 13 of 19

data with protein–protein, DNA–protein, and miRNA– just a single cell. The resulting single-cell libraries enable
mRNA interaction networks to identify miRNA–gene the identification of new, uncharacterized cell types in
regulatory modules. MONA [167] combines different tissues. They also make it possible to measure a fascinat-
levels of functional genomics data, including mRNA, ing phenomenon in molecular biology, the stochasticity
miRNA, DNA methylation, and proteomics data to dis- of gene expression in otherwise identical cells within a
cover altered biological functions in the samples being defined population. In this context, single cell studies
studied. Paintomics can integrate any type of functional are meaningful only when a set of individual cell librar-
genomics data into pathway analysis, provided that the ies are compared with the cell population, with the aim
features can be mapped onto genes or metabolites [162]. of identifying subgroups of multiple cells with distinct
3Omics [168] integrates transcriptomics, metabolomics combinations of expressed genes. Differences may be due
and proteomics data into regulatory networks. to naturally occurring factors such as stage of the cell
In all cases, integration of different datasets is rarely cycle, or may reflect rare cell types such as cancer stem
straightforward because each data type is analyzed separ- cells. Recent rapid progress in methodologies for single-
ately with its own tailored algorithms that yield results cell preparation, including the availability of single-cell
in different formats. Tools that facilitate format conver- platforms such as the Fluidigm C1 [8], has increased the
sions and the extraction of relevant results can help; ex- number of individual cells analyzed from a handful to 50–
amples of such workflow construction software packages 90 per condition up to 800 cells at a time. Other methods,
include Anduril [169], Galaxy [170] and Chipster [171]. such as DROP-seq [175], can profile more than 10,000
Anduril was developed for building complex pipelines cells at a time. This increased number of single-cell librar-
with large datasets that require automated parallelization. ies in each experiment directly allows for the identification
The strength of Galaxy and Chipster is their usability; of smaller subgroups within the population.
visualization is a key component of their design. Simultan- The small amount of starting material and the PCR
eous or integrative visualization of the data in a genome amplification limit the depth to which single-cell librar-
browser is extremely useful for both data exploration and ies can be sequenced productively, often to less than a
interpretation of results. Browsers can display in tandem million reads. Deeper sequencing for scRNA-seq will do
mappings from most next-generation sequencing tech- little to improve quantification as the number of individ-
nologies, while adding custom tracks such as gene annota- ual mRNA molecules in a cell is small (in the order of
tion, nucleotide variation or ENCODE datasets. For 100–300,000 transcripts) and only a fraction of them are
proteomics integration, the PG Nexus pipeline [172] con- successfully reverse-transcribed to cDNA [8, 176]; but
verts mass spectrometry data to mappings that are co- deeper sequencing is potentially useful for discovering
visualized with RNA-seq alignments. and measuring allele-specific expression, as additional
reads could provide useful evidence.
Outlook Single-cell transcriptomes typically include about
RNA-seq has become the standard method for transcrip- 3000–8000 expressed genes, which is far fewer than are
tome analysis, but the technology and tools are continu- counted in the transcriptomes of the corresponding
ing to evolve. It should be noted that the agreement pooled populations. The challenge is to distinguish the
between results obtained from different tools is still un- technical noise that results from a lack of sensitivity at
satisfactory and that results are affected by parameter the single-molecule level [173] (where capture rates of
settings, especially for genes that are expressed at low around 10–50 % result in the frequent loss of the most
levels. The two major highlights in the current applica- lowly expressed transcripts) from true biological noise
tion of RNA-seq are the construction of transcriptomes where a transcript might not be transcribed and present
from small amounts of starting materials and better in the cell for a certain amount of time while the protein
transcript identification from longer reads. The state of is still present. The inclusion of added reference tran-
the art in both of these areas is changing rapidly, but we scripts and the use of unique molecule identifiers
will briefly outline what can be done now and what can (UMIs) have been applied to overcome amplification
be expected in the near future. bias and to improve gene quantification [177, 178].
Methods that can quantify gene-level technical variation
Single-cell RNA-seq allow us to focus on biological variation that is likely to
Single-cell RNA-seq (scRNA-seq) is one of the newest be of interest [179]. Typical quality-control steps involve
and most active fields of RNA-seq with its unique set of setting aside libraries that contain few reads, libraries
opportunities and challenges. Newer protocols such as that have a low mapping rate, and libraries that have
Smart-seq [173] and Smart-seq2 [174] have enabled us zero expression levels for housekeeping genes, such as
to work from very small amounts of starting mRNA GAPDH and ACTB, that are expected to be expressed at
that, with proper amplification, can be obtained from a detectable level.
Conesa et al. Genome Biology (2016) 17:13 Page 14 of 19

Depending on the chosen single-cell protocol and the [186], and for determining allele-specific expression
aims of the experiment, different bulk RNA-seq pipe- from single reads [187]. Nevertheless, long-read sequen-
lines and tools can be used for different stages of the cing has its own set of limitations, such as a still high
analysis as reviewed by Stegle et al. [180]. Single-cell li- error rate that limits de novo transcript identifications
braries are typically analyzed by mapping to a reference and forces the technology to leverage the reference gen-
transcriptome (using a program such as RSEM) without ome [188]. Moreover, the relatively low throughput of
any attempt at new transcript discovery, although at SMRT cells hampers the quantification of transcript ex-
least one package maps to the genome (Monocle [181]). pression. These two limitations can be addressed by
While mapping onto the genome does result in a higher matching PacBio experiments with regular, short-read
overall read-mapping rate, studies that are focused on RNA-seq. The accurate and abundant Illumina reads
gene expression alone with fewer reads per cell tend to can be used both to correct long-read sequencing errors
use mapping to the reference transcriptome for the sake and to quantify transcript levels [189]. Updates in PacBio
of simplicity. Other single-cell methods have been devel- chemistry are increasing sequencing lengths to produce
oped to measure single-cell DNA methylation [182] and reads with a sufficient number of passes over the
single-cell open chromatin using ATAC-seq [183, 184]. cDNA molecule to autocorrect sequencing errors. This
At present, we can measure only one functional genomic will eventually improve sequencing accuracy and allow
data-type at a time in the same single cell, but we can for genome-free determination of isoform-resolved
expect that in the near future we will be able to recover transcriptomes.
the transcriptome of a single cell simultaneously with
additional functional data. Additional file

Long-read sequencing Additional file 1: Figure S1. Screenshots of RNA-seq data visualization.
a Integrative Genomics Viewer (IGV) [77] display of a gene detected as
The major limitation of short-read RNA-seq is the diffi- differentially expressed between the two groups of samples by DEGseq
culty in accurately reconstructing expressed full-length [62]. The bottom track in the right panel is the gene annotation. The
transcripts from the assembly of reads. This is particu- tracks are five samples from each group. b RNAseqViewer [80] display of
the same data as in (a). c RNAseqViewer heatmap display of a gene
larly complicated in complex transcriptomes, where dif- detected as differentially spliced between two groups by both DSGSeq
ferent but highly similar isoforms of the same gene are [70] and DEXSeq [69]. Introns are hidden in the display to emphasize the
expressed, and for genes that have many exons and pos- signals on the exons. d MISO [81] display of another gene detected as
differentially spliced, with junction reads illustrated. (PDF 1152 kb)
sible alternative promoters or 3’ ends. Long-read tech-
nologies, such as Pacific-Biosciences (PacBio) SMRT and
Abbreviations
Oxford Nanopore, that were initially applied to genome ASM: Alternative splicing module; ChIP-seq: Chromatin immunoprecipitation
sequencing are now being used for transcriptomics and sequencing; DEG: Differentially expressed genes; eQTL: Expression
have the potential to overcome this assembly problem. quantitative loci; FDR: False discovery rate; FPKM: Fragments per kilobase of
exon model per million mapped reads; GO: Gene Ontology; GSEA: Gene set
Long-read sequencing provides amplification-free, single- enrichment analysis; GTF: Gene transfer format; IG: Immunoglobulin;
molecule sequencing of cDNAs that enables recovery of IGV: Integrative Genomics Viewer; miRNA: MicroRNA; mRNA: Messenger
full-length transcripts without the need for an assembly RNA; PCA: Principal component analysis; PE read: Paired-end read;
RNA-seq: RNA-sequencing; RPKM: Reads per kilobase of exon model per
step. PacBio adds adapters to the cDNA molecule and cre- million reads; rRNA: Ribosomal RNA; RSEM: RNA-Seq by Expectation
ates a circularized structure that can be sequenced with Maximization; scRNA-seq: Single-cell RNA-seq; SE read: Single-end read;
multiple passes within one single long read. The Nano- siRNA: Short-interfering RNA; SNP: Single nucleotide polymorphism;
sQTL: Splicing quantitative trait loci; sRNA: Small RNA; TF: Transcription
pore GridION system can directly sequence RNA strands factor; TPM: Transcripts per million.
by using RNA processive enzymes and RNA-specific
bases. Another interesting technology was previously Competing interests
known as Moleculo (now Illumina’s TruSeq synthetic The authors declare that they have no competing interests.

long-read technology), where Illumina library preparation Authors’ contributions


is multiplexed and restricted to a limited number of long ACo, PM and AM conceived the idea and shaped the structure of the
DNA molecules that are separately bar-coded and pooled manuscript. ACo drafted the experimental design, alignment and functional
profiling sections and integrated contributions from all authors. PM drafted
back for sequencing. As one barcode corresponds to a the visualization and de novo transcript reconstruction sections, and
limited number of molecules, assembly is greatly simpli- coordinated author contributions. ST drafted the quality-control and differential
fied and unambiguous reconstruction to long contigs is expression sections. DGC drafted the experimental design and integration
sections. ACe contributed to drafting the integration section. AMP drafted the
possible. This approach has recently been published for transcript fusion section. MWS drafted the small RNA section. DG drafted the
RNA-seq analysis [185]. eQTL section. LLE drafted the software comparison for differential expression
PacBio RNA-seq is the long-read approach with the section. LLE and XZ drafted the transcript isoform analysis sections. XZ contributed
to drafting the visualization section. AM drafted the introduction and outlook
most publications to date. The technology has proven sections and globally edited the manuscript. All authors read and approved the
useful for unraveling isoform diversity at complex loci final manuscript.
Conesa et al. Genome Biology (2016) 17:13 Page 15 of 19

Acknowledgements 10. Tarazona S, Garcia-Alcalde F, Dopazo J, Ferrer A, Conesa A. Differential


The authors would like to thank Michael Love and Harold Pimentel for expression in RNA-seq: a matter of depth. Genome Res. 2011;21:2213–23.
helpful suggestions on the initial draft of the manuscript. AC, ST, AM, DGC 11. Andrews S. FASTQC. A quality control tool for high throughput sequence
were supported by the FP7 STATegra project (grant 36000). Research in AC’s data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. Accessed
laboratory was supported by MINECO grant BIO2012-40244 and co-funded 29 September 2014.
with European Regional Development Funds (ERDF). Research in PM’s 12. Dai M, Thompson RC, Maher C, Contreras-Galindo R, Kaplan MH, Markovitz
laboratory is supported by ERC starting grant Relieve-IMDs and by a core DM, et al. NGSQC: cross-platform quality analysis pipeline for deep
support grant from the Wellcome Trust and MRC to the Wellcome Trust- sequencing data. BMC Genomics. 2010;11 Suppl 4:S7.
Medical Research Council Cambridge Stem Cell Institute. XZ was supported 13. FASTX-Toolkit. http://hannonlab.cshl.edu/fastx_toolkit/. Accessed 12
by the National Basic Research Program of China (2012CB316504). LLE was January 2016.
supported by JDRF (grant number 2-2013-32) and by the Sigrid Juselius 14. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina
Foundation. ACe was supported by the Academy of Finland (Center of sequence data. Bioinformatics. 2014;30:2114–20.
Excellence in Cancer Genetics Research). 15. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR:
ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21.
Author details 16. Picard. http://picard.sourceforge.net/. Accessed 12 January 2016.
1
Institute for Food and Agricultural Sciences, Department of Microbiology 17. Wang L, Wang S, Li W. RSeQC: quality control of RNA-seq experiments.
and Cell Science, University of Florida, Gainesville, FL 32603, USA. 2Centro de Bioinformatics. 2012;28:2184–5.
Investigación Príncipe Felipe, Genomics of Gene Expression Laboratory, 18. García-Alcalde F, Okonechnikov K, Carbonell J, Cruz LM, Götz S, Tarazona S,
46012 Valencia, Spain. 3Wellcome Trust Sanger Institute, Wellcome Trust et al. Qualimap: evaluating next-generation sequencing alignment data.
Genome Campus, Hinxton, Cambridge CB10 1SA, UK. 4Wellcome Bioinformatics. 2012;28:2678–9.
Trust-Medical Research Council Cambridge Stem Cell Institute, Anne McLaren 19. Tarazona S, Furió-Tarí P, Turrà D, Pietro AD, Nueda MJ, Ferrer A, et al. Data
Laboratory for Regenerative Medicine, Department of Surgery, University of quality aware analysis of differential expression in RNA-seq with NOISeq
Cambridge, Cambridge CB2 0SZ, UK. 5Department of Applied Statistics, R/Bioc package. Nucleic Acids Res. 2015;43:e140.
Operations Research and Quality, Universidad Politécnica de Valencia, 46020 20. Risso D, Schwartz K, Sherlock G, Dudoit S. GC-content normalization for
Valencia, Spain. 6Unit of Computational Medicine, Department of Medicine, RNA-seq data. BMC Bioinformatics. 2011;12:480.
Karolinska Institutet, Karolinska University Hospital, 171 77 Stockholm, 21. Steijger T, Abril JF, Engström PG, Kokocinski F, Hubbard TJ, Guigó R, et al.
Sweden. 7Center for Molecular Medicine, Karolinska Institutet, 17177 Assessment of transcript reconstruction methods for RNA-seq. Nat Methods.
Stockholm, Sweden. 8Unit of Clinical Epidemiology, Department of Medicine, 2013;10:1177–84.
Karolinska University Hospital, L8, 17176 Stockholm, Sweden. 9Science for Life 22. Boley N, Stoiber MH, Booth BW, Wan KH, Hoskins RA, Bickel PJ, et al.
Laboratory, 17121 Solna, Sweden. 10Systems Biology Laboratory, Institute of Genome-guided transcript assembly by integrative analysis of RNA
Biomedicine and Genome-Scale Biology Research Program, University of sequence data. Nat Biotechnol. 2014;32:341–6.
Helsinki, 00014 Helsinki, Finland. 11School of Computing Science, Simon 23. Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel
Fraser University, Burnaby V5A 1S6BC, Canada. 12Department of transcripts in annotated genomes using RNA-Seq. Bioinformatics.
Bioinformatics, Institute of Molecular Biology and Biotechnology, Adam 2011;27:2325–9.
Mickiewicz University in Poznań, 61-614 Poznań, Poland. 13Turku Centre for 24. Mezlini AM, Smith EJ, Fiume M, Buske O, Savich GL, Shah S, et al. iReckon:
Biotechnology, University of Turku and Åbo Akademi University, FI-20520 simultaneous isoform discovery and abundance estimation from RNA-seq
Turku, Finland. 14Key Lab of Bioinformatics/Bioinformatics Division, TNLIST data. Genome Res. 2013;23:519–29.
and Department of Automation, Tsinghua University, Beijing 100084, China. 25. Li JJ, Jiang CR, Brown JB, Huang H, Bickel PJ. Sparse linear modeling of
15
School of Life Sciences, Tsinghua University, Beijing 100084, China. next-generation mRNA sequencing (RNA-Seq) data for isoform discovery
16
Department of Developmental and Cell Biology, University of California, and abundance estimation. Proc Natl Acad Sci U S A. 2011;108:19867–72.
Irvine, Irvine, CA 92697-2300, USA. 17Center for Complex Biological Systems, 26. Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT, Salzberg SL.
University of California, Irvine, Irvine, CA 92697, USA. StringTie enables improved reconstruction of a transcriptome from RNA-seq
reads. Nat Biotechnol. 2015;33:290–5.
27. Hiller D, Wong WH. Simultaneous isoform discovery and quantification from
RNA-Seq. Stat Biosci. 2013;5:100–18.
References 28. Stanke M, Keller O, Gunduz I, Hayes A, Waack S, Morgenstern B. AUGUSTUS:
1. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and ab initio prediction of alternative transcripts. Nucleic Acids Res.
quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5:1–8. 2006;34:W435–9.
2. Levin JZ, Yassour M, Adiconis X, Nusbaum C, Thompson DA, Friedman N, 29. Engström PG, Steijger T, Sipos B, Grant GR, Kahles A, Rätsch G, et al.
et al. Comprehensive comparative analysis of strand-specific RNA Systematic evaluation of spliced alignment programs for RNA-seq data. Nat
sequencing methods. Nat Methods. 2010;7:709–15. Methods. 2013;10:1185–91.
3. Parkhomchuk D, Borodina T, Amstislavskiy V, Banaru M, Hallen L, Krobitsch S, 30. Xie Y, Wu G, Tang J, Luo R, Patterson J, Liu S, et al. SOAPdenovo-Trans: de
et al. Transcriptome analysis by strand-specific sequencing of novo transcriptome assembly with short RNA-Seq reads. Bioinformatics.
complementary DNA. Nucleic Acids Res. 2009;37:e123. 2014;30:1660–6.
4. Katz Y, Wang ET, Airoldi EM, Burge CB. Analysis and design of RNA sequencing 31. Schulz MH, Zerbino DR, Vingron M, Birney E. Oases: robust de novo RNA-
experiments for identifying isoform regulation. Nat Methods. 2010;7:1009–15. seq assembly across the dynamic range of expression levels. Bioinformatics.
5. Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for 2012;28:1086–92.
transcriptome annotation and quantification using RNA-seq. Nat Methods. 32. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al.
2011;8:469–77. Full-length transcriptome assembly from RNA-seq data without a reference
6. Łabaj PP, Leparc GG, Linggi BE, Markillie LM, Wiley HS, Kreil DP. genome. Nat Biotechnol. 2011;29:644–52.
Characterization and improvement of RNA-Seq precision in quantitative 33. Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, et al.
transcript expression profiling. Bioinformatics. 2011;27:i383–91. De novotranscript sequence reconstruction from RNA-seq using the Trinity
7. Sims D, Sudbery I, Ilott NE, Heger A, Ponting CP. Sequencing depth and coverage: platform for reference generation and analysis. Nat Protoc. 2013;8:1494–512.
key considerations in genomic analyses. Nat Rev Genet. 2014;15:121–32. 34. Patro R, Mount SM, Kingsford C. Sailfish enables alignment-free isoform
8. Pollen AA, Nowakowski TJ, Shuga J, Wang X, Leyrat AA, Lui JH, et al. Low- quantification from RNA-seq reads using lightweight algorithms. Nat
coverage single-cell mRNA sequencing reveals cellular heterogeneity and Biotechnol. 2014;32:462–4.
activated signaling pathways in developing cerebral cortex. Nat Biotechnol. 35. Anders S, Pyl PT, Huber W. HTSeq - a Python framework to work with
2014;32:1053–8. high-throughput sequencing data. Bioinformatics. 2015;31:166–9.
9. Jaitin DA, Kenigsberg E, Keren-Shaul H, Elefant N, Paul F, Zaretsky I, et al. 36. Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose
Massively parallel single-cell RNA-seq for marker-free decomposition of program for assigning sequence reads to genomic features. Bioinformatics.
tissues into cell types. Science. 2014;343:776–9. 2014;30:923–30.
Conesa et al. Genome Biology (2016) 17:13 Page 16 of 19

37. UCSC Genome Bioinformatics: Frequently Asked Questions: Data File 64. Seyednasrollah F, Laiho A, Elo LL. Comparison of software packages for
Formats. https://genome.ucsc.edu/FAQ/FAQformat.html#format4. Accessed detecting differential expression in RNA-seq studies. Brief Bioinform.
on 12 January 2016. 2015;16:59–70.
38. Pachter L. Models for transcript quantification from RNA-seq. arXiv.org. 2011. 65. Zheng S, Chen L. A hierarchical Bayesian model for comparing
http://arxiv.org/abs/1104.3889. Accessed 6 January 2016. transcriptomes at the individual transcript isoform level. Nucleic Acids Res.
39. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. 2009;37:e75.
Transcript assembly and quantification by RNA-Seq reveals unannotated 66. Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, et al. Differential
transcripts and isoform switching during cell differentiation. Nat Biotechnol. gene and transcript expression analysis of RNA-seq experiments with
2010;28:511–5. TopHat and Cufflinks. Nat Protoc. 2012;7:562–78.
40. Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data 67. Singh D, Orellana CF, Hu Y, Jones CD, Liu Y, Chiang DY, et al. FDM: a
with or without a reference genome. BMC Bioinformatics. 2011;12:323. graph-based statistical method to detect differential transcription using
41. Roberts A, Pachter L. Streaming fragment assignment for real-time analysis RNA-seq data. Bioinformatics. 2011;27:2633–40.
of sequencing experiments. Nat Methods. 2013;10:71–3. 68. Shi Y, Jiang H. rSeqDiff: detecting differential isoform expression from
42. Bray N, Pimentel H, Melsted P, Pachter L. Near-optimal RNA-Seq RNA-Seq data using hierarchical likelihood ratio test. PLoS One.
quantification with kallisto. https://liorpachter.wordpress.com/2015/05/10/ 2013;8:e79448.
near-optimal-rna-seq-quantification-with-kallisto/. Accessed 6 January 2016. 69. Anders S, Reyes A, Huber W. Detecting differential usage of exons from
43. Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L. Improving RNA-seq RNA-seq data. Genome Res. 2012;22:2008–17.
expression estimates by correcting for fragment bias. Genome Biol. 70. Wang W, Qin Z, Feng Z, Wang X, Zhang X. Identifying differentially spliced
2011;12:R22. genes from two groups of RNA-seq samples. Gene. 2013;518:164–70.
44. Ma X, Zhang X. NURD: an implementation of a new method to estimate 71. Shen S, Park JW, Lu ZX, Lin L, Henry MD, Wu YN, et al. rMATS: robust and
isoform expression from non-uniform RNA-seq data. BMC Bioinformatics. flexible detection of differential alternative splicing from replicate RNA-Seq
2013;14:220. data. Proc Natl Acad Sci U S A. 2014;111:E5593–601.
45. Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods 72. Drewe P, Stegle O, Hartmann L, Kahles A, Bohnert R, Wachter A, et al.
for normalization and differential expression in mRNA-Seq experiments. Accurate detection of differential RNA processing. Nucleic Acids Res.
BMC Bioinformatics. 2010;11:94. 2013;41:5189–98.
46. Hansen K, Brenner S, Dudoit S. Biases in Illumina transcriptome sequencing 73. Hu Y, Huang Y, Du Y, Orellana CF, Singh D, Johnson AR, et al. DissSplice: the
caused by random hexamer priming. Nucleic Acids Res. 2010;38:e131. genome-wide detection of differential splicing events with RNA-seq.
47. Robinson MD, Oshlack A. A scaling normalization method for differential Nucleic Acids Res. 2013;41:e39.
expression analysis of RNA-seq data. Genome Biol. 2010;11:R25. 74. Hilker R, Stadermann KB, Doppmeier D, Kalinowski J, Stoye J, Straube J, et al.
48. Anders S, Huber W. Differential expression analysis for sequence count data. ReadXplorer - visualization and analysis of mapped sequences.
Genome Biol. 2010;11:R106. Bioinformatics. 2014;30:2247–54.
49. Li J, Witten DM, Johnstone IM, Tibshirani R. Normalization, testing, and 75. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, et al. The
false discovery rate estimation for RNA-sequencing data. Biostatistics. Human Genome Browser at UCSC. Genome Res. 2002;12:996–1006.
2012;13:523–38. 76. Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer
50. Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L. (IGV): high- performance genomics data visualization and exploration. Brief
Differential analysis of gene regulation at transcript resolution with RNA-seq. Bioinformatics. 2012;14:178–92.
Nat Biotechnol. 2012;31:46–53. 77. Medina I, Salavert F, Sanchez R, de Maria A, Alonso R, Escobar P, et al.
51. Auer PL, Doerge RW. Statistical design and analysis of RNA sequencing data. Genome Maps, a new generation genome browser. Nucleic Acids Res.
Genetics. 2010;185:405–16. 2013;41(Web Server issue):W41–6.
52. Johnson WE, Rabinovic A, Li C. Adjusting batch effects in microarray 78. Fiume M, Williams V, Brook A, Brudno M. Savant: genome browser for
expression data using Empirical Bayes methods. Biostatistics. high-throughput sequencing data. Bioinformatics. 2010;26:1938–44.
2007;8:118–27. 79. Rogé X, Zhang X. RNAseqViewer: visualization tool for RNA-Seq data.
53. Nueda MJ, Ferrer A, Conesa A. ARSyN: a method for the identification and Bioinformatics. 2013;30:891–2.
removal of systematic noise in multifactorial time course microarray 80. Katz Y, Wang ET, Silterra J, Schwartz S, Wong B, Thorvaldsdóttir H, et al.
experiments. Biostatistics. 2012;13:553–66. Quantitative visualization of alternative exon expression from RNA-seq data.
54. Robinson MD, Smyth GK. Moderated statistical tests for assessing differences Bioinformatics. 2015;31:2400–2.
in tag abundance. Bioinformatics. 2007;23:2881–7. 81. Wu E, Nance T, Montgomery SB. SplicePlot: a utility for visualizing splicing
55. Law CW, Chen Y, Shi W, Smyth GK. Voom: precision weights unlock linear quantitative trait loci. Bioinformatics. 2014;30:1025–6.
model analysis tools for RNA-seq read counts. Genome Biol. 2014;15:R29. 82. Ryan MC, Cleland J, Kim R, Wong WC, Weinstein JN. SpliceSeq: a resource
56. Soneson C, Delorenzi M. A comparison of methods for differential for analysis and visualization of RNA-Seq data on alternative splicing and its
expression analysis of RNA-seq data. BMC Bioinformatics. 2013;14:91. functional impacts. Bioinformatics. 2012;28:2385–7.
57. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for 83. Liu Q, Chen C, Shen E, Zhao F, Sun Z, Wu J. Detection, annotation and
differential expression analysis of digital gene expression data. visualization of alternative splicing from RNA-Seq data with SplicingViewer.
Bioinformatics. 2010;26:139–40. Genomics. 2012;99:178–82.
58. Love MI, Huber W, Anders S. Moderated estimation of fold change and 84. Dietrich S, Wiegand S, Liesegang H. TraV: a genome context sensitive
dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550. transcriptome browser. PLoS One. 2014;9:e93677.
59. Hardcastle TJ, Kelly KA. baySeq: empirical Bayesian methods for identifying 85. Carrara M, Beccuti M, Lazzarato F, Cavallo F, Cordero F, Donatelli S, et al.
differential expression in sequence count data. BMC Bioinformatics. State-of-the-art fusion-finder algorithms sensitivity and specificity. BioMed
2010;11:422. Res Int. 2013;15:340620.
60. Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BM, et al. 86. Maher CA, Palanisamy N, Brenner JC, Cao X, Kalyana-Sundaram S, Luo S,
EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq et al. Chimeric transcript discovery by paired-end transcriptome sequencing.
experiments. Bioinformatics. 2013;29:1035–43. Proc Natl Acad Sci U S A. 2009;106:12353–8.
61. Li J, Tibshirani R. Finding consistent patterns: a nonparametric approach for 87. Yoshihara K, Wang Q, Torres-Garcia W, Zheng S, Vegesna R, Kim H, et al. The
identifying differential expression in RNA-Seq data. Stat Methods Med Res. landscape and therapeutic relevance of cancer-associated transcript fusions.
2013;22:519–36. Oncogene. 2015;34:4845–54.
62. Wang L, Feng Z, Wang X, Wang X, Zhang X. DEGseq: an R package for 88. McPherson A, Hormozdiari F, Zayed A, Giuliany R, Ha G, Sun MG, et al.
identifying differentially expressed genes from RNA-seq data. Bioinformatics. deFuse: an algorithm for gene fusion discovery in tumor RNA-seq data.
2010;26:136–8. PLoS Comput Biol. 2011;7:e1001138.
63. Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, Zumbo P, et al. 89. Wu C, Wyatt AW, McPherson A, Lin D, McConeghy BJ, Mo F, et al. Poly-gene
Comprehensive evaluation of differential gene expression analysis methods fusion transcripts and chromothripsis in prostate cancer. Gene
for RNA-seq data. Genome Biol. 2013;14:R95. Chromosomes Cancer. 2012;51:1144–53.
Conesa et al. Genome Biology (2016) 17:13 Page 17 of 19

90. Wyatt AW, Mo F, Wang K, McConeghy B, Brahmbhatt S, Jong L, et al. 116. Hunter S, Jones P, Mitchell A, Apweiler R, Attwood TK, Bateman A, et al.
Heterogeneity in the inter-tumor transcriptome of high risk prostate cancer. InterPro in 2011: new developments in the family and domain prediction
Genome Biol. 2014;15:426. database. Nucleic Acids Res. 2011;40(Database issue):D306–12.
91. Stransky N, Cerami E, Schalm S, Kim JL, Lengauer C. The landscape of kinase 117. Conesa A, Götz S, García-Gómez JM, Terol J, Talón M, Robles M. Blast2GO: a
fusions in cancer. Nat Commun. 2014;5:4846. universal tool for annotation, visualization and analysis in functional
92. Rabbitts TH. Commonality but diversity in cancer gene fusions. Cell. genomics research. Bioinformatics. 2005;21:3674–6.
2009;137:391–5. 118. Gardner PP, Daub J, Tate JG, Nawrocki EP, Kolbe DL, Lindgreen S,
93. McPherson A, Wu C, Hajirasouliha I, Hormozdiari F, Hach F, Lapuk A, et al. Rfam: updates to the RNA families database. Nucleic Acids Res.
et al. Comrad: detection of expressed rearrangements by integrated 2009;37 suppl 1:D136–40.
analysis of RNA-Seq and low coverage genome sequence data. Bioinformatics. 119. Kozomara A, Griffiths-Jones S. miRBase: annotating high confidence
2011;27:1481–8. microRNAs using deep sequencing data. Nucleic Acids Res. 2014;
94. Iyer MK, Chinnaiyan AM, Maher CA. ChimeraScan: a tool for 42(Database issue):D68–73.
identifying chimeric transcription in sequencing data. Bioinformatics. 120. Enright AJ, John B, Gaul U, Tuschl T, Sander C, Marks DS. MicroRNA targets
2011;27:2903–4. in Drosophila. Genome Biol. 2003;5:R1.
95. Pflueger D, Terry S, Sboner A, Habegger L, Esgueva R, Lin PC, et al. 121. Giambartolomei C, Vukcevic D, Schadt EE, Franke L, Hingorani AD,
Discovery of non-ETS gene fusions in human prostate cancer using Wallace C, et al. Bayesian test for colocalisation between pairs of
next-generation RNA sequencing. Genome Res. 2011;21:56–67. genetic association studies using summary statistics. PLoS Genet.
96. Wu J, Liu Q, Wang X, Zheng J, Wang T, You M, et al. mirTools 2.0 for 2014;10:e1004383.
non-coding RNA discovery, profiling, and functional annotation based on 122. Moffatt MF, Kabesch M, Liang L, Dixon AL, Strachan D, Heath S, et al.
high-throughput sequencing. RNA Biol. 2013;10:1087–92. Genetic variants regulating ORMDL3 expression contribute to the risk of
97. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat childhood asthma. Nature. 2007;448:470–3.
Methods. 2012;9:357–9. 123. Gilad Y, Rifkin S, Pritchard J. Revealing the architecture of gene regulation:
98. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler the promise of eQTL studies. Trends Genet. 2008;24:408–15.
transform. Bioinformatics. 2009;25:1754–60. 124. Gaffney D. Global properties and functional complexity of human gene
99. Prüfer K, Stenzel U, Dannemann M, Green RE, Lachmann M, Kelso J. PatMaN: regulatory variation. PLoS Genet. 2013;9:e1003501.
rapid alignment of short sequences to large databases. Bioinformatics. 125. Montgomery S, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J,
2008;24:1530–1. et al. Transcriptome genetics using second generation sequencing in a
100. Emde AK, Grunert M, Weese D, Reinert K, Sperling SR. MicroRazerS: rapid Caucasian population. Nature. 2010;464:773–7.
alignment of small RNA reads. Bioinformatics. 2010;26:123–4. 126. Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, et al.
101. An J, Lai J, Lehman ML, Nelson CC. miRDeep*: an integrated application Understanding mechanisms underlying human gene expression variation
tool for miRNA identification from RNA sequencing data. Nucleic Acids Res. with RNA sequencing. Nature. 2010;464:768–72.
2013;41:727–37. 127. Lappalainen T, Sammeth M, Friedlander M, ‘t Hoen PA, Monlong J, Rivas
102. Yang X, Li L. miRDeep-P: a computational tool for analyzing the microRNA MA, et al. Transcriptome and genome sequencing uncovers functional
transcriptome in plants. Bioinformatics. 2011;27:2614–5. variation in humans. Nature. 2013;501:506–11.
103. Stocks MB, Moxon S, Mapleson D, Woolfenden HC, Mohorianu I, Folkes L, 128. Battle A, Mostafavi S, Zhu X, Potash JB, Weissman MM, Shi J, et al.
et al. The UEA sRNA workbench: a suite of tools for analysing and Characterizing the genetic basis of transcriptome diversity through
visualizing next generation sequencing microRNA and small RNA datasets. RNA-sequencing of 922 individuals. Genome Res. 2014;24:14–24.
Bioinformatics. 2012;28:2059–61. 129. Pastinen T. Genome-wide allele-specific analysis: insights into regulatory
104. Axtell MJ. ShortStack: comprehensive annotation and quantification of small variation. Nat Rev Genet. 2010;11:533–8.
RNA genes. RNA. 2013;19:740–51. 130. Sun W. A statistical framework for eQTL mapping using RNA-seq data.
105. Giurato G, De Filippo MR, Rinaldi A, Hashim A, Nassa G, Ravo M, et al. Biometrics. 2012;68:1–11.
iMir: an integrated pipeline for high-throughput analysis of small 131. van de Geijn B, McVicker G, Gilad Y, Pritchard JK. WASP: allele-specific for
non-coding RNA data obtained by smallRNA-Seq. BMC Bioinformatics. robust molecular quantitative trait locus discovery. Nat Methods.
2013;14:362. 2015;12:1061–3.
106. Young MD, Wakefield MJ, Smyth GK, Oshlack A. Gene ontology analysis for 132. Kumasaka N, Knights AJ, Gaffney DJ. Fine-mapping cellular QTLs with
RNA-seq: accounting for selection bias. Genome Biol. 2010;11:1–12. RASQUAL and ATAC-seq. Nat Genet. 2015. doi: 10.1038/ng.3467.
107. Hänzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis for 133. Storey JD, Tibshirani R. Statistical significance for genome-wide studies. Proc
microarray and RNA-Seq data. BMC Bioinformatics. 2013;14:7. Natl Acad Sci U S A. 2003;100:9440–5.
108. Wang X, Cairns MJ. Gene set enrichment analysis of RNA-Seq data: 134. Stranger BE, Forrest MS, Clark AG, Minichiello MJ, Deutsch S, Lyle R, et al.
integrating differential expression and splicing. BMC Bioinformatics. Genome-wide associations of gene expression variation in humans. PLoS
2013;14 Suppl 5:S16. Genet. 2005;1:e78.
109. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene 135. Raj T, Rothamel K, Mostafavi S, Ye C, Lee MN, Replogle JM, et al. Polarization
Ontology: tool for the unification of biology. Nat Genet. 2000;25:25–9. of the effects of autoimmune and neurodegenerative risk alleles in
110. Huber W, Carey VJ, Gentleman R, Anders S, Carlson M, Carvalho BS, et al. leukocytes. Science. 2014;344:519–23.
Orchestrating high-throughput genomic analysis with Bioconductor. Nat 136. Shabalin A. Matrix eQTL: ultra fast eQTL analysis via large matrix operations.
Methods. 2015;12:115–21. Bioinformatics. 2012;28:1353–8.
111. Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of 137. Louhimo R, Lepikhova T, Monni O, Hautaniemi S. Comparative analysis of
large gene lists using DAVID Bioinformatics Resources. Nat Protocols. algorithms for integration of copy number and expression data. Nat
2009;4:44–57. Methods. 2012;9:351–5.
112. Huang DW, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: 138. Kim JH, Dhanasekaran SM, Prensner JR, Cao X, Robinson D, Kalyana-Sundaram
paths toward the comprehensive functional analysis of large gene lists. S, et al. Deep sequencing reveals distinct patterns of DNA methylation in
Nucleic Acids Res. 2009;37:1–13. prostate cancer. Genome Res. 2011;21:1028–41.
113. Medina I, Carbonell J, Pulido L, Madeira SC, Goetz S, Conesa A, et al. 139. Li JL, Mazar J, Zhong C, Faulkner GJ, Govindarajan SS, Zhang Z, et al.
Babelomics: an integrative platform for the analysis of transcriptomics, Genome-wide methylated CpG island profiles of melanoma cells reveal a
proteomics and genomic data with advanced functional profiling. Nucleic melanoma coregulation network. Sci Rep. 2013;3:2962.
Acids Res. 2010;38 suppl 2:W210–3. 140. Xie L, Weichel B, Ohm JE, Zhang K. An integrative analysis of DNA
114. Bairoch A, Boeckmann B, Ferro S, Gasteiger E. Swiss-Prot: juggling between methylation and RNA-Seq data for human heart, kidney and liver. BMC Syst
evolution and stability. Brief Bioinformatics. 2004;5:39–55. Biol. 2011;5 Suppl 3:S4.
115. Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, et al. 141. Van Eijk KR, de Jong S, Boks MP, Langeveld T, Colas F, Veldink JH, et al.
The Pfam protein families database. Nucleic Acids Res. Genetic analysis of DNA methylation and gene expression levels in whole
2014;42(Database issue):D222–30. blood of healthy human subjects. BMC Genomics. 2012;13:636.
Conesa et al. Genome Biology (2016) 17:13 Page 18 of 19

142. Liu Y, Aryee MJ, Padyukov L, Fallin MD, Hesselberg E, Runarsson A, et al. 167. Sass S, Buettner F, Mueller NS, Theis FJ. A modular framework for gene set
Epigenome-wide association data implicate DNA methylation as an analysis integrating multilevel omics data. Nucleic Acids Res. 2013;41:9622–33.
intermediary of genetic risk in rheumatoid arthritis. Nat Biotechnol. 168. Kuo TC, Tian TF, Tseng YJ. 3Omics: a web-based systems biology tool for
2013;31:142–7. analysis, integration and visualization of human transcriptomic, proteomic
143. Yeang C-H. An integrated analysis of molecular aberrations in NCI-60 cell and metabolomic data. BMC Syst Biol. 2013;7:64.
lines. BMC Bioinformatics. 2010;11:495. 169. Ovaska K, Laakso M, Haapa-Paananen S, Louhimo R, Chen P, Aittomäki V,
144. Jeong J, Li L, Liu Y, Nephew KP, Huang YHM, Shen C. An empirical Bayes et al. Large-scale data integration framework provides a comprehensive
model for gene expression and methylation profiles in antiestrogen view on glioblastoma multiforme. Genome Med. 2010;2:65.
resistant breast cancer. BMC Med Genomics. 2010;3:55. 170. Goecks J, Nekrutenko A, Taylor J. Galaxy: a comprehensive approach for
145. Jiao Y, Widschwendter M, Teschendorff AE. A systems-level integrative supporting accessible, reproducible, and transparent computational
framework for genome-wide DNA methylation and gene expression data research in the life sciences. Genome Biol. 2010;11:R86.
identifies differential gene expression modules under epigenetic control. 171. Kallio MA, Tuimala JT, Hupponen T, Klemelä P, Gentile M, Scheinin I, et al.
Bioinformatics. 2014;30:2360–6. Chipster: user-friendly analysis software for microarray and other high-
146. Wang S, Sun H, Ma J, Zang C, Wang C, Wang J, et al. Target analysis throughput data. BMC Genomics. 2011;12:507.
by integration of transcriptome and ChIP-seq data with BETA. Nat 172. Pang CNI, Tay AP, Aya C. Tools to covisualize and coanalyze proteomic data
Protoc. 2013;8:2502–15. with genomes and transcriptomes: validation of genes and alternative
147. Roadmap Epigenomics Consortium, Kundaje A, Meuleman W, Ernst J, mRNA splicing. J Proteome Res. 2014;13:84–98.
Bilenky M, Yen A, et al. Integrative analysis of 111 reference human 173. Ramsköld D, Luo S, Wang YC, Li R, Deng Q, Faridani OR, et al. Full-length
epigenomes. Nature. 2015;518:317–30. mRNA-Seq from single-cell levels of RNA and individual circulating tumor
148. Madrigal P, Krajewski P. Uncovering correlated variability in epigenomic cells. Nat Biotechnol. 2012;30:777–82.
datasets using the Karhunen-Loeve transform. BioData Min. 2015;8:20. 174. Picelli S, Björklund ÅK, Faridani OR, Sagasser S, Winberg G, Sandberg R.
149. Angelini C, Costa V. Understanding gene regulatory mechanisms by Smart-seq2 for sensitive full-length transcriptome profiling in single cells.
integrating ChIP-seq and RNA-seq data: statistical solutions to biological Nat Methods. 2013;10:1096–8.
problems. Front Cell Dev Biol. 2014;2:51. 175. Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, et al. Highly
150. Neph S, Stergachis AB, Reynolds A, Sandstrom R, Borenstein E, parallel genome-wide expression profiling of individual cells using nanoliter
Stamatoyannopoulos JA. Circuitry and dynamics of human transcription droplets. Cell. 2015;161:1202–14.
factor regulatory networks. Cell. 2012;150:1274–86. 176. Marinov GK, Williams BA, McCue K, Schroth GP, Gertz J, Myers RM, et al.
151. Dweep H, Sticht C, Pandey P, Gretz N. miRWalk - database: prediction of From single-cell to cell-pool transcriptomes: stochasticity in gene expression
possible miRNA binding sites by ‘walking’ the genes of 3 genomes. J and RNA splicing. Genome Res. 2014;24:496–510.
Biomed Inform. 2011;44:839–47. 177. Islam S, Zeisel A, Joost S, La Manno G, Zajac P, Kasper M, et al.
152. Griffiths-Jones S, Saini HK, van Dongen S, Enright AJ. miRBase: tools for Quantitative single-cell RNA-seq with unique molecular identifiers. Nat
microRNA genomics. Nucleic Acids Res. 2008;36:D154–8. Methods. 2014;11:163–6.
153. Wu X, Watson M. CORNA: testing gene lists for regulation by microRNAs. 178. Kivioja T, Vähärautio A, Karlsson K, Bonke M, Enge M, Linnarsson S, et al.
Bioinformatics. 2009;25:832–3. Counting absolute numbers of molecules using unique molecular
154. Lee H, Yang Y, Chae H, Nam S, Choi D, Tangchaisin P, et al. BioVLAB-MMIA: identifiers. Nat Methods. 2011;9:72–4.
a cloud environment for microRNA and mRNA integrated analysis (MMIA) 179. Brennecke P, Anders S, Kim JK, Kołodziejczyk AA, Zhang X, Proserpio V, et al.
on Amazon EC2. IEEE Trans Nanobiosci. 2012;11:266–72. Accounting for technical noise in single-cell RNA-seq experiments. Nat
155. Nam S, Li M, Choi K, Balch C, Kim S, Nephew KP. MicroRNA and mRNA Methods. 2013;10:1093–5.
integrated analysis (MMIA): a web tool for examining biological functions of 180. Stegle O, Teichmann SA, Marioni JC. Computational and analytical
microRNA expression. Nucleic Acids Res. 2009;37:W356–62. challenges in single-cell transcriptomics. Nat Rev Genet. 2015;16:133–45.
156. Sales G, Coppe A, Bisognin A, Bortoluzzi S, Romualdi C. MAGIA, a 181. Trapnell C, Cacchiarelli D. The dynamics and regulators of cell fate decisions
web-based tool for miRNA and Genes Integrated Analysis. Nucleic Acids are revealed by pseudotemporal ordering of single cells. Nat Biotechnol.
Res. 2010;38:W352–9. 2014;32:381–6.
157. Icay K, Chen P, Cervera C, Lehtonen R, Hautaniemi S. SePIA: RNA and 182. Lorthongpanich C, Cheow LF, Balu S, Quake SR, Knowles BB, Burkholder WF,
smallRNA-sequence processing, integration, and analysis. 2015. et al. Single-cell DNA-methylation analysis reveals epigenetic chimerism in
http://anduril.org/sepia. Accessed 6 Jan 2016. preimplantation embryos. Science. 2013;341:1110–2.
158. de Sousa AR, Penalva LO, Marcotte EM, Vogel C. Global signatures of 183. Buenrostro JD, Wu B, Litzenburger UM, Ruff D, Gonzales ML, Snyder MP,
protein and mRNA expression levels. Mol Biosyst. 2009;5:1512–26. et al. Single-cell chromatin accessibility reveals principles of regulatory
159. Vogel C, Marcotte EM. Insights into the regulation of protein variation. Nature. 2015;523:486–90.
abundance from proteomic and transcriptomic analyses. Nat Rev Genet. 184. Cusanovich DA, Daza R, Adey A, Pliner HA, Christiansen L, Gunderson KL,
2012;13:227–32. et al. Multiplex single-cell profiling of chromatin accessibility by
160. Low TY, van Heesch S, van den Toorn H, Giansanti P, Cristobal A, Toonen P. combinatorial cellular indexing. Science. 2015;348:910–4.
Quantitative and qualitative proteome characteristics extracted from 185. Tilgner H, Jahanbani F, Blauwkamp T, Moshrefi A, Jaeger E, Chen F, et al.
in-depth integrated genomics and proteomics analysis. Cell Rep. Comprehensive transcriptome analysis using synthetic long-read
2013;5:1469–78. sequencing reveals molecular co-association of distant splicing events. Nat
161. Suhre K, Schmitt-Kopplin P. MassTRIX: mass translator into pathways. Nucleic Biotechnol. 2015;33:736–42.
Acids Res. 2008;36(Web Server issue):W481–4. 186. Au KF, Sebastiano V, Afshar PT, Durruthy JD, Lee L, Williams BA, et al.
162. García-Alcalde F, García-López F, Dopazo J, Conesa A. Paintomics: a web Characterization of the human ESC transcriptome by hybrid sequencing.
based tool for the joint visualization of transcriptomics and metabolomics Proc Natl Acad Sci U S A. 2013;110:E4821–30.
data. Bioinformatics. 2011;27:137–9. 187. Tilgner H, Grubert F, Sharon D, Snyder MP. Defining a personal, allele-specific,
163. Rohn H, Junker A, Hartmann A, Grafahrend-Belau E, Treutler H, Klapperstück and single-molecule long-read transcriptome. Proc Natl Acad Sci U S A.
M, et al. VANTED v2: a framework for systems biology applications. BMC 2014;111:9869–74.
Syst Biol. 2012;6:139. 188. Au KF, Underwood JG, Lee L, Wong WH. Improving PacBio long read
164. Tuncbag N, McCallum S, Huang SS, Fraenkel E. SteinerNet: a web server for accuracy by short read alignment. PLoS One. 2012;7:e46679.
integrating ‘omic’ data to discover hidden components of response 189. Hansen KD, Wu Z, Irizarry RA, Leek JT. Sequencing technology does not
pathways. Nucleic Acids Res. 2012;40:W505–9. eliminate biological variability. Nat Biotechnol. 2011;29:572–3.
165. Zhang S, Li Q, Liu J, Zhou XJ. A novel computational framework for 190. Hart SN, Therneau TM, Zhang Y, Poland GA, Kocher JP. Calculating sample
simultaneous integration of multiple types of genomic data to identify size estimates for RNA sequencing data. J Comput Biol. 2013;20:970–8.
microRNA-gene regulatory modules. Bioinformatics. 2011;27:i401–9. 191. Busby MA, Stewart C, Miller CA, Grzeda KR, Marth GT. Scotty: a web tool for
166. Le H-S, Bar-Joseph Z. Integrating sequence, expression and interaction data designing RNA-Seq experiments to measure differential gene expression.
to determine condition-specific miRNA regulation. Bioinformatics. 2013;29:i89–97. Bioinformatics. 2013;29:656–7.
Conesa et al. Genome Biology (2016) 17:13 Page 19 of 19

192. Oshlack A, Wakefield MJ. Transcript length bias in RNA-seq data confounds
systems biology. Biol Direct. 2009;4:14.
193. Noble WS. How does multiple testing correction work? Nat Biotechnol.
2009;27:1135–7.
194. Robinson DG, Storey JD. subSeq: determining appropriate sequencing
depth through efficient read subsampling. Bioinformatics. 2014;30:3424–6.
195. Liu Y, Zhou J, White KP. RNA-seq differential expression studies: more
sequence or more replication? Bioinformatics. 2013;30:301–4.
196. SEQC/MAQC-III Consortium. A comprehensive assessment of RNA-seq
accuracy, reproducibility and information content by the Sequencing
Quality Control Consortium. Nat Biotechnol. 2014;32:903–14.
197. Jiang L, Schlesinger F, Davis CA, Zhang Y, Li R, Salit M, et al. Synthetic
spike-in standards for RNA-seq experiments. Genome Res. 2011;21:1543–51.
198. Kouzine F, Wojtowicz D, Yamane A, Resch W, Kieffer-Kwon KR, Bandle R,
et al. Global regulation of promoter melting in naive lymphocytes. Cell.
2013;153:988–99.
199. Van Dijk EL, Jaszczyszyn Y, Thermes C. Library preparation methods for
next-generation sequencing: tone down the bias. Exp Cell Res.
2014;322:12–20.
200. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with
RNA-Seq. Bioinformatics. 2009;25:1105–11.
201. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2:
accurate alignment of transcriptomes in the presence of insertions,
deletions and gene fusions. Genome Biol. 2013;14:R36.
202. Wu TD, Nacu S. Fast and SNP-tolerant detection of complex variants and
splicing in short reads. Bioinformatics. 2010;26:873–81.
203. Jean G, Kahles A, Sreedharan VT, De Bona F, Rätsch G. RNA-Seq read
alignments with PALMapper. Curr Protoc Bioinformatics. 2010;11(6).
204. Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, et al. MapSplice:
accurate mapping of RNA-seq reads for splice junction discovery. Nucleic
Acids Res. 2010;38:e178.
205. Marco-Sola S, Sammeth M, Guigó R, Ribeca P. The GEM mapper: fast,
accurate and versatile alignment by filtration. Nat Methods. 2012;9:1185–8.
206. Zhao S, Zhang B. A comprehensive evaluation of ensembl, RefSeq, and
UCSC annotations in the context of RNA-seq read mapping and gene
quantification. BMC Genomics. 2015;16:97.
207. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient
alignment of short DNA sequences to the human genome. Genome Biol.
2009;10:R25.
208. Kvam VM, Liu P, Si Y. A comparison of statistical methods for detecting
differentially expressed genes from RNA-seq data. Am J Bot. 2012;99:248–56.
209. Robles JA, Qureshi SE, Stephen SJ, Wilson SR, Burden CJ, Taylor JM. Efficient
experimental design and analysis strategies for the detection of differential
expression using RNA-Sequencing. BMC Genomics. 2012;13:484.
210. Nookaew I, Papini M, Pornputtapong N, Scalcinati G, Fagerberg L, Uhlén M,
et al. A comprehensive comparison of RNA-Seq-based transcriptome
analysis from reads to differential gene expression and cross-comparison
with microarrays: a case study in Saccharomyces cerevisiae. Nucleic Acids
Res. 2012;40:10084–97.
211. Seyednasrollah F, Rantanen K, Jaakkola P, Elo LL. ROTS: reproducible
RNA-seq biomarker detector-prognostic markers for clear cell renal cell
cancer. Nucleic Acids Res. 2016;44(1):e1. doi:10.1093/nar/gkv806.
212. Bi Y, Davuluri RV. NPEBseq: nonparametric empirical bayesian-based
procedure for differential expression analysis of RNA-seq data. BMC
Bioinformatics. 2013;14:262.
213. Nueda MJ, Tarazona S, Conesa A. Next maSigPro: updating maSigPro
bioconductor package for RNA-seq time series. Bioinformatics.
2014;30:2598–602.
214. GTEx Consortium. The Genotype-Tissue expression (GTEx) project. Nat
Genet. 2013;45:580–5.

You might also like