Survey RNA-Seq Data Analysis (2016)
Survey RNA-Seq Data Analysis (2016)
© 2016 Conesa et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Conesa et al. Genome Biology (2016) 17:13 Page 2 of 19
and resources for the bioinformatics analysis of RNA- and second by planning an adequate execution of the se-
seq data. We do not aim to provide an exhaustive com- quencing experiment itself, ensuring that data acquisi-
pilation of resources or software tools nor to indicate tion does not become contaminated with unnecessary
one best analysis pipeline. Rather, we aim to provide a biases. In this section, we discuss both considerations.
commented guideline for RNA-seq data analysis. Figure 1 One important aspect of the experimental design is
depicts a generic roadmap for experimental design and the RNA-extraction protocol used to remove the highly
analysis using standard Illumina sequencing. We also abundant ribosomal RNA (rRNA), which typically con-
briefly list several data integration paradigms that have stitutes over 90 % of total RNA in the cell, leaving the
been proposed and comment on their potential and limi- 1–2 % comprising messenger RNA (mRNA) that we are
tations. We finally discuss the opportunities as well as normally interested in. For eukaryotes, this involves
challenges provided by single-cell RNA-seq and long- choosing whether to enrich for mRNA using poly(A) se-
read technologies when compared to traditional short- lection or to deplete rRNA. Poly(A) selection typically
read RNA-seq. requires a relatively high proportion of mRNA with min-
imal degradation as measured by RNA integrity number
Experimental design (RIN), which normally yields a higher overall fraction of
A crucial prerequisite for a successful RNA-seq study is reads falling onto known exons. Many biologically rele-
that the data generated have the potential to answer the vant samples (such as tissue biopsies) cannot, however,
biological questions of interest. This is achieved by first be obtained in great enough quantity or good enough
defining a good experimental design, that is, by choosing mRNA integrity to produce good poly(A) RNA-seq li-
the library type, sequencing depth and number of repli- braries and therefore require ribosomal depletion. For
cates appropriate for the biological system under study, bacterial samples, in which mRNA is not polyadenylated,
Fig. 1 A generic roadmap for RNA-seq computational analyses. The major analysis steps are listed above the lines for pre-analysis, core analysis
and advanced analysis. The key analysis issues for each step that are listed below the lines are discussed in the text. a Preprocessing includes
experimental design, sequencing design, and quality control steps. b Core analyses include transcriptome profiling, differential gene expression,
and functional profiling. c Advanced analysis includes visualization, other RNA-seq technologies, and data integration. Abbreviations: ChIP-seq
Chromatin immunoprecipitation sequencing, eQTL Expression quantitative loci, FPKM Fragments per kilobase of exon model per million mapped
reads, GSEA Gene set enrichment analysis, PCA Principal component analysis, RPKM Reads per kilobase of exon model per million reads, sQTL
Splicing quantitative trait loci, TF Transcription factor, TPM Transcripts per million
Conesa et al. Genome Biology (2016) 17:13 Page 3 of 19
the only viable alternative is ribosomal depletion. Another biological variability of the system under study, as well as
consideration is whether to generate strand-preserving li- on the desired statistical power (that is, the capacity for
braries. The first generation of Illumina-based RNA-seq detecting statistically significant differences in gene ex-
used random hexamer priming to reverse-transcribe pression between experimental groups). These two aspects
poly(A)-selected mRNA. This methodology did not retain are part of power analysis calculations (Fig. 1a; Box 1).
information contained on the DNA strand that is actually The adequate planning of sequencing experiments so
expressed [1] and therefore complicates the analysis and as to avoid technical biases is as important as good
quantification of antisense or overlapping transcripts. Sev-
eral strand-specific protocols [2], such as the widely used Box 1. Number of replicates
dUTP method, extend the original protocol by incorporat-
ing UTP nucleotides during the second cDNA synthesis Three factors determine the number of replicates required in a
step, prior to adapter ligation followed by digestion of the RNA-seq experiment. The first factor is the variability in the
strand containing dUTP [3]. In all cases, the size of the measurements, which is influenced by the technical noise and
final fragments (usually less than 500 bp for Illumina) will the biological variation. While reproducibility in RNA-seq is usually
be crucial for proper sequencing and subsequent analysis. high at the level of sequencing [1, 45], other steps such as RNA
Furthermore, sequencing can involve single-end (SE) or extraction and library preparation are noisier and may introduce
paired-end (PE) reads, although the latter is preferable for
biases in the data that can be minimized by adopting good
de novo transcript discovery or isoform expression ana-
experimental procedures (Box 2). Biological variability is particular
lysis [4, 5]. Similarly, longer reads improve mappability
and transcript identification [5, 6]. The best sequencing to each experimental system and is harder to control [189].
option depends on the analysis goals. The cheaper, short Nevertheless, biological replication is required if inference on the
SE reads are normally sufficient for studies of gene expres- population is to be made, with three replicates being the minimum
sion levels in well-annotated organisms, whereas longer for any inferential analysis. For a proper statistical power analysis,
and PE reads are preferable to characterize poorly anno- estimates of the within-group variance and gene expression levels
tated transcriptomes. are required. This information is typically not available beforehand
Another important factor is sequencing depth or li-
but can be obtained from similar experiments. The exact power will
brary size, which is the number of sequenced reads for a
depend on the method used for differential expression analysis,
given sample. More transcripts will be detected and their
quantification will be more precise as the sample is se- and software packages exist that provide a theoretical estimate of
quenced to a deeper level [1]. Nevertheless, optimal se- power over a range of variables, given the within-group variance of
quencing depth again depends on the aims of the the samples, which is intrinsic to the experiment [190, 191]. Table 1
experiment. While some authors will argue that as few shows an example of statistical power calculations over a range of
as five million mapped reads are sufficient to quantify fold-changes (or effect sizes) and number of replicates in a human
accurately medium to highly expressed genes in most blood RNA-seq sample sequenced at 30 million mapped reads. It
eukaryotic transcriptomes, others will sequence up to
should be noted that these estimates apply to the average gene
100 million reads to quantify precisely genes and tran-
expression level, but as dynamic ranges in RNA-seq data are large,
scripts that have low expression levels [7]. When study-
ing single cells, which have limited sample complexity, the probability that highly expressed genes will be detected as
quantification is often carried out with just one million differentially expressed is greater than that for low-count genes
reads but may be done reliably for highly expressed [192]. For methods that return a false discovery rate (FDR), the
genes with as few as 50,000 reads [8]; even 20,000 reads proportion of genes that are highly expressed out of the total set
have been used to differentiate cell types in splenic tissue of genes being tested will also influence the power of detection
[9]. Moreover, optimal library size depends on the com- after multiple testing correction [193]. Filtering out genes that are
plexity of the targeted transcriptome. Experimental results
expressed at low levels prior to differential expression analysis
suggest that deep sequencing improves quantification and
reduces the severity of the correction and may improve the power
identification but might also result in the detection of
transcriptional noise and off-target transcripts [10]. Satur- of detection [20]. Increasing sequencing depth also can improve
ation curves can be used to assess the improvement in statistical power for lowly expressed genes [10, 194], and for any
transcriptome coverage to be expected at a given sequen- given sample there exists a level of sequencing at which power
cing depth [10]. improvement is best achieved by increasing the number of
Finally, a crucial design factor is the number of repli- replicates [195]. Tools such as Scotty are available to calculate the
cates. The number of replicates that should be included in best trade-off between sequencing depth and replicate number
a RNA-seq experiment depends on both the amount of
given some budgetary constraints [191].
technical variability in the RNA-seq procedures and the
Conesa et al. Genome Biology (2016) 17:13 Page 4 of 19
Table 1 Statistical power to detect differential expression varies section, we address all of the major analysis steps for a
with effect size, sequencing depth and number of replicates typical RNA-seq experiment, which involve quality con-
Replicates per group trol, read alignment with and without a reference genome,
3 5 10 obtaining metrics for gene and transcript expression, and
Effect size (fold change) approaches for detecting differential gene expression. We
1.25 17 % 25 % 44 %
also discuss analysis options for applications of RNA-seq
involving alternative splicing, fusion transcripts and small
1.5 43 % 64 % 91 %
RNA expression. Finally, we review useful packages for
2 87 % 98 % 100 % data visualization.
Sequencing depth (millions of reads)
3 19 % 29 % 52 % Quality-control checkpoints
10 33 % 51 % 80 % The acquisition of RNA-seq data consists of several
15 38 % 57 % 85 %
steps — obtaining raw reads, read alignment and quanti-
fication. At each of these steps, specific checks should
Example of calculations for the probability of detecting differential expression
in a single test at a significance level of 5 %, for a two-group comparison using be applied to monitor the quality of the data (Fig. 1a).
a Negative Binomial model, as computed by the RNASeqPower package of
Hart et al. [190]. For a fixed within-group variance (package default value), the
statistical power increases with the difference between the two groups (effect Raw reads
size), the sequencing depth, and the number of replicates per group. This Quality control for the raw reads involves the analysis of
table shows the statistical power for a gene with 70 aligned reads, which was
the median coverage for a protein-coding gene for one whole-blood RNA-seq
sequence quality, GC content, the presence of adaptors,
sample with 30 million aligned reads from the GTEx Project [214] overrepresented k-mers and duplicated reads in order to
detect sequencing errors, PCR artifacts or contamina-
experimental design, especially when the experiment in-
tions. Acceptable duplication, k-mer or GC content
volves a large number of samples that need to be proc-
levels are experiment- and organism-specific, but these
essed in several batches. In this case, including controls,
values should be homogeneous for samples in the same
randomizing sample processing and smart management
experiments. We recommend that outliers with over
of sequencing runs are crucial to obtain error-free data
30 % disagreement to be discarded. FastQC [11] is a
(Fig. 1a; Box 2).
popular tool to perform these analyses on Illumina
reads, whereas NGSQC [12] can be applied to any plat-
Analysis of the RNA-seq data form. As a general rule, read quality decreases towards
The actual analysis of RNA-seq data has as many varia-
the 3’ end of reads, and if it becomes too low, bases
tions as there are applications of the technology. In this
should be removed to improve mappability. Software
tools such as the FASTX-Toolkit [13] and Trimmomatic
Box 2. Experiment execution choices [14] can be used to discard low-quality reads, trim
RNA-seq library preparation and sequencing procedures include
adaptor sequences, and eliminate poor-quality bases.
a number of steps (RNA fragmentation, cDNA synthesis, adapter
Read alignment
ligation, PCR amplification, bar-coding, and lane loading) that
Reads are typically mapped to either a genome or a tran-
might introduce biases into the resulting data [196]. Including scriptome, as will be discussed later. An important map-
exogenous reference transcripts (‘spike-ins’) is useful both for ping quality parameter is the percentage of mapped
quality control [1, 197] and for library-size normalization [198]. reads, which is a global indicator of the overall sequen-
For bias minimization, we recommend following the suggestions cing accuracy and of the presence of contaminating
made by Van Dijk et al. [199], such as the use of adapters with DNA. For example, we expect between 70 and 90 % of
random nucleotides at the extremities or the use of chemical-based
regular RNA-seq reads to map onto the human genome
(depending on the read mapper used) [15], with a sig-
fragmentation instead of RNase III-based fragmentation. If the
nificant fraction of reads mapping to a limited number
RNA-seq experiment is large and samples have to be processed in
of identical regions equally well (‘multi-mapping reads’).
different batches and/or Illumina runs, caution should be taken to When reads are mapped against the transcriptome, we
randomize samples across library preparation batches and lanes so expect slightly lower total mapping percentages because
as to avoid technical factors becoming confounded with reads coming from unannotated transcripts will be lost,
experimental factors. Another option, when samples are individually and significantly more multi-mapping reads because of
barcoded and multiple Illumina lanes are needed to achieve the reads falling onto exons that are shared by different
desired sequencing depth, is to include all samples in each lane,
transcript isoforms of the same gene.
Other important parameters are the uniformity of read
which would minimize any possible lane effect.
coverage on exons and the mapped strand. If reads
Conesa et al. Genome Biology (2016) 17:13 Page 5 of 19
primarily accumulate at the 3’ end of transcripts in clear standard exists for biological replicates, as this de-
poly(A)-selected samples, this might indicate low RNA pends on the heterogeneity of the experimental system.
quality in the starting material. The GC content of If gene expression differences exist among experimental
mapped reads may reveal PCR biases. Tools for quality conditions, it should be expected that biological repli-
control in mapping include Picard [16], RSeQC [17] and cates of the same condition will cluster together in a
Qualimap [18]. principal component analysis (PCA).
Fig. 2 Read mapping and transcript identification strategies. Three basic strategies for regular RNA-seq analysis. a An annotated genome is
available and reads are mapped to the genome with a gapped mapper. Next (novel) transcript discovery and quantification can proceed with or
without an annotation file. Novel transcripts are then functionally annotated. b If no novel transcript discovery is needed, reads can be mapped
to the reference transcriptome using an ungapped aligner. Transcript identification and quantification can occur simultaneously. c When no
genome is available, reads need to be assembled first into contigs or transcripts. For quantification, reads are mapped back to the novel reference
transcriptome and further analysis proceeds as in (b) followed by the functional annotation of the novel transcripts as in (a). Representative
software that can be used at each analysis step are indicated in bold text. Abbreviations: GFF General Feature Format, GTF gene transfer format,
RSEM RNA-Seq by Expectation Maximization
Conesa et al. Genome Biology (2016) 17:13 Page 6 of 19
Transcript discovery
Box 3. Mapping to a reference
Identifying novel transcripts using the short reads pro-
Mapping to a reference genome allows for the identification of vided by Illumina technology is one of the most challen-
novel genes or transcripts, and requires the use of a gapped or ging tasks in RNA-seq. Short reads rarely span across
several splice junctions and thus make it difficult to dir-
spliced mapper as reads may span splice junctions. The
ectly infer all full-length transcripts. In addition, it is dif-
challenge is to identify splice junctions correctly, especially
ficult to identify transcription start and end sites [21],
when sequencing errors or differences with the reference exist and tools such as GRIT [22] that incorporate other data
or when non-canonical junctions and fusion transcripts are such as 5’ ends from CAGE or RAMPAGE typically have
sought. One of the most popular RNA-seq mappers, TopHat, a better chance of annotating the major expressed iso-
follows a two-step strategy in which unspliced reads are first forms correctly. In any case, PE reads and higher cover-
mapped to locate exons, then unmapped reads are split and age help to reconstruct lowly expressed transcripts, and
aligned independently to identify exon junctions [200, 201]. replicates are essential to resolve false-positive calls (that
is, mapping artifacts or contaminations) at the low end
Several other mappers exist that are optimized to identify SNPs
of signal detection. Several methods, such as Cufflinks
or indels (GSNAP [202], PALMapper [203] MapSplice [204]),
[23], iReckon [24], SLIDE [25] and StringTie [26], in-
detect non-canonical splice junctions (STAR [15], MapSplice corporate existing annotations by adding them to the
[204]), achieve ultra-fast mapping (GEM [205]) or map long-reads possible list of isoforms. Montebello [27] couples iso-
(STAR [15]). Important parameters to consider during mapping form discovery and quantification using a likelihood-
are the strandedness of the RNA-seq library, the number of based Monte Carlo algorithm to boost performance.
mismatches to accept, the length and type of reads (SE or PE), Gene-finding tools such as Augustus [28] can incorpor-
and the length of sequenced fragments. In addition, existing ate RNA-seq data to better annotate protein-coding
transcripts, but perform worse on non-coding tran-
gene models can be leveraged by supplying an annotation file
scripts [29]. In general, accurate transcript reconstruc-
to some read mapper in order to map exon coordinates
tion from short reads is difficult, and methods typically
accurately and to help in identifying splicing events. The choice show substantial disagreement [29].
of gene model can also have a strong impact on the quantification
and differential expression analysis [206]. We refer the reader to De novo transcript reconstruction
[30] for a comprehensive comparison of RNA-seq mappers. If the When a reference genome is not available or is incom-
transcriptome annotation is comprehensive (for example, in mouse plete, RNA-seq reads can be assembled de novo (Fig. 2c)
or human), researchers may choose to map directly to a into a transcriptome using packages such as SOAPdenovo-
Trans [30], Oases [31], Trans-ABySS [32] or Trinity [33].
Fasta-format file of all transcript sequences for all genes of interests.
In general, PE strand-specific sequencing and long reads
In this case, no gapped alignment is needed and unspliced
are preferred because they are more informative [33]. Al-
mappers such as Bowtie [207] can be used (Fig. 2b). Mapping to though it is impossible to assemble lowly expressed tran-
the transcriptome is generally faster but does not allow de scripts that lack enough coverage for a reliable assembly,
novo transcript discovery. too many reads are also problematic because they lead to
potential misassembly and increased runtimes. Therefore,
of whether a genome or transcriptome reference is used, in silico reduction of the number of reads is recom-
reads may map uniquely (they can be assigned to only mended for deeply sequenced samples [33]. For compara-
one position in the reference) or could be multi-mapped tive analyses across samples, it is advisable to combine all
reads (multireads). Genomic multireads are primarily reads from multiple samples into a single input in order
due to repetitive sequences or shared domains of paralo- to obtain a consolidated set of contigs (transcripts),
gous genes. They normally account for a significant frac- followed by mapping back of the short reads for expres-
tion of the mapping output when mapped onto the sion estimation [33].
genome and should not be discarded. When the refer- Either with a reference or de novo, the complete recon-
ence is the transcriptome, multi-mapping arises even struction of transcriptomes using short-read Illumina tech-
more often because a read that would have been nology remains a challenging problem, and in many cases
uniquely mapped on the genome would map equally de novo assembly results in tens or hundreds of contigs ac-
well to all gene isoforms in the transcriptome that share counting for fragmented transcripts. Emerging long-read
the exon. In either case — genome or transcriptome technologies, such as SMRT from Pacific Biosciences, pro-
mapping — transcript identification and quantification vide reads that are long enough to sequence complete
become important challenges for alternatively expressed transcripts for most genes and are a promising alternative
genes. that is discussed further in the “Outlook” section below.
Conesa et al. Genome Biology (2016) 17:13 Page 7 of 19
the Poisson or negative binomial [48, 54]. The negative the differential expression methods to leverage reprodu-
binomial distribution (also known as the gamma-Poisson cibility between replicates.
distribution) is a generalization of the Poisson distribu- Recent independent comparison studies have demon-
tion, allowing for additional variance (called overdisper- strated that the choice of the method (or even the ver-
sion) beyond the variance expected from randomly sion of a software package) can markedly affect the
sampling from a pool of molecules that are characteristic outcome of the analysis and that no single method is
of RNA-seq data. However, the use of discrete distribu- likely to perform favorably for all datasets [56, 63, 64]
tions is not required for accurate analysis of differential (Box 4). We therefore recommend thoroughly docu-
expression as long as the sampling variance of small read menting the settings and version numbers of programs
counts is taken into account (most important for exper- used and considering the repetition of important ana-
iments with small numbers of replicates). Methods for lyses using more than one package.
transforming normalized counts of RNA-seq reads
while learning the variance structure of the data have Alternative splicing analysis
been shown to perform well in comparison to the Transcript-level differential expression analysis can po-
discrete distribution approaches described above [55, tentially detect changes in the expression of transcript
56]. Moreover, after extensive normalization (including isoforms from the same gene, and specific algorithms for
TMM and batch removal), the data might have lost alternative splicing-focused analysis using RNA-seq have
their discrete nature and be more akin to a continuous been proposed. These methods fall into two major cat-
distribution. egories. The first approach integrates isoform expression
Some methods, such as the popular edgeR [57], take estimation with the detection of differential expression
as input raw read counts and introduce possible bias to reveal changes in the proportion of each isoform
sources into the statistical model to perform an inte- within the total gene expression. One such early method,
grated normalization as well as a differential expression BASIS, used a hierarchical Bayesian model to directly
analysis. In other methods, the differential expression re- infer differentially expressed transcript isoforms [65].
quires the data to be previously normalized to remove CuffDiff2 estimates isoform expression first and then
all possible biases. DESeq2, like edgeR, uses the negative compares their differences. By integrating the two steps,
binomial as the reference distribution and provides its the uncertainty in the first step is taken into consider-
own normalization approach [48, 58]. baySeq [59] and ation when performing the statistical analysis to look for
EBSeq [60] are Bayesian approaches, also based on the differential isoform expression [66]. The flow difference
negative binomial model, that define a collection of metric (FDM) uses aligned cumulative transcript graphs
models to describe the differences among experimental from mapped exon reads and junction reads to infer iso-
groups and to compute the posterior probability of each forms and the Jensen-Shannon divergence to measure
one of them for each gene. Other approaches include the difference [67]. Recently, Shi and Jiang [68] proposed
data transformation methods that take into account the a new method, rSeqDiff, that uses a hierarchical likeli-
sampling variance of small read counts and create hood ratio test to detect differential gene expression
discrete gene expression distributions that can be ana- without splicing change and differential isoform expres-
lyzed by regular linear models [55]. Finally, non- sion simultaneously. All these approaches are generally
parametric approaches such as NOISeq [10] or SAMseq hampered by the intrinsic limitations of short-read se-
[61] make minimal assumptions about the data and esti- quencing for accurate identification at the isoform level,
mate the null distribution for inferential analysis from as discussed in the RNA-seq Genome Annotation As-
the actual data alone. For small-scale studies that com- sessment Project paper [30].
pare two samples with no or few replicates, the estima- The so-called ‘exon-based’ approach skips the estima-
tion of the negative binomial distribution can be noisy. tion of isoform expression and detects signals of alterna-
In such cases, simpler methods based on the Poisson tive splicing by comparing the distributions of reads on
distribution, such as DEGseq [62], or on empirical distri- exons and junctions of the genes between the compared
butions (NOISeq [10]) can be an alternative, although it samples. This approach is based on the premise that dif-
should be strongly stressed that, in the absence of bio- ferences in isoform expression can be tracked in the sig-
logical replication, no population inference can be made nals of exons and their junctions. DEXseq [69] and
and hence any p value calculation is invalid. Methods DSGSeq [70] adopt a similar idea to detect differentially
that analyze RNA-seq data without replicates therefore spliced genes by testing for significant differences in read
only have exploratory value. Considering the drop in counts on exons (and junctions) of the genes. rMATS
price of sequencing, we recommend that RNA-seq ex- detects differential usage of exons by comparing exon-
periments have a minimum of three biological replicates inclusion levels defined with junction reads [71]. rDiff
when sample availability is not limiting to allow all of detects differential isoform expression by comparing
Conesa et al. Genome Biology (2016) 17:13 Page 9 of 19
of biological interest to assess whether particular ana- alternative splicing between adjacent genes. Where
lyses’ results can withstand detailed scrutiny or to reveal possible, fusions should be filtered by their presence in
potential complications caused by artifacts, such as 3’ a set of control datasets [87]. When control datasets
biases or complicated transcript structures. Users should are not available, artifacts can be identified by their
visualize changes in read coverage for genes that are presence in a large number of unrelated datasets, after
deemed important or interesting on the basis of their excluding the possibility that they represent true recur-
analysis results to evaluate the robustness of their rent fusions [90, 91].
conclusions. Strong fusion-sequence predictions are characterized
by distinct subsequences that each align with high speci-
Gene fusion discovery ficity to one of the fused genes. As alignment specificity
The discovery of fused genes that can arise from is highly correlated with sequence length, a strong pre-
chromosomal rearrangements is analogous to novel iso- diction sequence is longer, with longer subsequences
form discovery, with the added challenge of a much lar- from each gene. Longer reads and larger insert sizes pro-
ger search space as we can no longer assume that the duce longer predicted sequences; thus, we recommend
transcript segments are co-linear on a single chromo- PE RNA-seq data with larger insert size over SE datasets
some. Artifacts are common even using state-of-the-art or datasets with short insert size. Another indicator of
tools, which necessitates post-processing using heuristic prediction strength is splicing. For most known fusions,
filters [85]. Artifacts primarily result from misalignment the genomic breakpoint is located in an intron of each
of read sequences due to polymorphisms, homology, and gene [92] and the fusion boundary coincides with a
sequencing errors. Families of homologous genes, and splice site within each gene. Furthermore, fusion iso-
highly polymorphic genes such as the HLA genes, pro- forms generally follow the splicing patterns of wild-type
duce reads that cannot be easily mapped uniquely to genes. Thus, high confidence predictions have fusion
their location of origin in the reference genome. For boundaries coincident with exon boundaries and exons
genes with very high expression, the small but non- matching wild-type exons [91]. Fusion discovery tools
negligible sequencing error rate of RNA-seq will pro- often incorporate some of the aforementioned ideas to
duce reads that map incorrectly to homologous loci. rank fusion predictions [93, 94], though most studies
Filtering highly polymorphic genes and pairs of homolo- apply additional custom heuristic filters to produce a list
gous genes is recommended [86, 87]. Also recom- of high-quality fusion candidates [90, 91, 95].
mended is the filtering of highly expressed genes that
are unlikely to be involved in gene fusions, such as ribo- Small RNAs
somal RNA [86]. Finally, a low ratio of chimeric to wild- Next-generation sequencing represents an increasingly
type reads in the vicinity of the fusion boundary may in- popular method to address questions concerning the
dicate spurious mis-mapping of reads from a highly biological roles of small RNAs (sRNAs). sRNAs are usu-
expressed gene (the transcript allele fraction described ally 18–34 nucleotides in length, and they include miR-
by Yoshihara et al. [87]). NAs, short-interfering RNAs (siRNAs), PIWI-interacting
Given successful prediction of chimeric sequences, the RNAs (piRNAs), and other classes of regulatory mole-
next step is the prioritization of gene fusions that have cules. sRNA-seq libraries are rarely sequenced as deeply
biological impact over more expected forms of genomic as regular RNA-seq libraries because of a lack of com-
variation. Examples of expected variation include plexity, with a typical range of 2–10 million reads. Bio-
immunoglobulin (IG) rearrangements in tumor samples informatics analysis of sRNA-seq data differs from
infiltrated by immune cells, transiently expressed trans- standard RNA-seq protocols (Fig. 1c). Ligated adaptor
posons and nuclear mitochondrial DNA, and read- sequences are first trimmed and the resulting read-
through chimeras produced by co-transcription of adja- length distribution is computed. In animals, there are
cent genes [88]. Care must be taken with filtering in usually peaks for 22 and 23 nucleotides, whereas in
order not to lose events of interest. For example, remov- plants there are peaks for 21- and 24-nucleotide redun-
ing all fusions involving an IG gene may remove real IG dant reads. For instance, miRTools 2.0 [96], a tool for
fusions in lymphomas and other blood disorders; filter- prediction and profiling of sRNA species, uses by default
ing fusions for which both genes are from the IG locus reads that are 18–30 bases long. The threshold value de-
is preferred [88]. Transiently expressed genomic break- pends on the application, and in case of miRNAs is usu-
point sequences that are associated with real gene fu- ally in the range of 19–25 nucleotides.
sions often overlap transposons; these should be filtered As in standard RNA-seq, sRNA reads must then be
unless they are associated with additional fusion iso- aligned to a reference genome or transcriptome se-
forms from the same gene pair [89]. Read-through chi- quences using standard tools, such as Bowtie2 [97],
meras are easily identified as predictions involving STAR [15], or Burrows-Wheeler Aligner (BWA) [98].
Conesa et al. Genome Biology (2016) 17:13 Page 11 of 19
There are, however, some aligners (such as PatMaN [99] transcriptome assembly or reconstruction would lack at
and MicroRazerS [100]) that have been designed to map least some functional information and therefore annota-
short sequences with preset parameter value ranges tion is necessary for functional profiling of those results.
suited for optimal alignment of short reads. The map- Protein-coding transcripts can be functionally annotated
ping itself may be performed with or without mis- using orthology by searching for similar sequences in
matches, the latter being used more commonly. In protein databases such as SwissProt [114] and in data-
addition, reads that map beyond a predetermined set bases that contain conserved protein domains such as
number of locations may be removed as putatively ori- Pfam [115] and InterPro [116]. The use of standard vo-
ginating from repetitive elements. In the case of miR- cabularies such as the Gene Ontology (GO) allows for
NAs, usually 5–20 distinct mappings per genome are some exchangeability of functional information across
allowed. sRNA reads are then simply counted to obtain orthologs. Popular tools such as Blast2GO [117] allow
expression values. However, users should also verify that massive annotation of complete transcriptome datasets
their sRNA reads are not significantly contaminated by against a variety of databases and controlled vocabular-
degraded mRNA, for example, by checking whether a ies. Typically, between 50 and 80 % of the transcripts re-
miRNA library shows unexpected read coverage over the constructed from RNA-seq data can be annotated with
body of highly expressed genes such as GAPDH or functional terms in this way. However, RNA-seq data
ACTB. also reveal that an important fraction of the transcrip-
Further analysis steps include comparison with known tome is lacking protein-coding potential. The functional
sRNAs and de novo identification of sRNAs. There are annotation of these long non-coding RNAs is more chal-
class-specific tools for this purpose, such as miRDeep lenging as their conservation is often less pronounced
[101] and miRDeep-P [102] for animal and plant miR- than that of protein-coding genes. The Rfam database
NAs, respectively, or the trans-acting siRNA prediction [118] contains most well-characterized RNA families,
tool at the UEA sRNA Workbench [103]. Tools such as such as ribosomal or transfer RNAs, while mirBase [119]
miRTools 2.0 [96], ShortStack [104], and iMir [105] also or Miranda [120] are specialized in miRNAs. These re-
exist for comprehensive annotation of sRNA libraries sources can be used for similarity-based annotation of
and for identification of diverse classes of sRNAs. short non-coding RNAs, but no standard functional an-
notation procedures are available yet for other RNA
Functional profiling with RNA-seq types such as the long non-coding RNAs.
The last step in a standard transcriptomics study (Fig. 1b)
is often the characterization of the molecular functions Integration with other data types
or pathways in which differentially expressed genes The integration of RNA-seq data with other types of
(DEGs) are involved. The two main approaches to func- genome-wide data (Fig. 1c) allows us to connect the
tional characterization that were developed first for regulation of gene expression with specific aspects of
microarray technology are (a) comparing a list of DEGs molecular physiology and functional genomics. Integra-
against the rest of the genome for overrepresented func- tive analyses that incorporate RNA-seq data as the pri-
tions, and (b) gene set enrichment analysis (GSEA), mary gene expression readout that is compared with
which is based on ranking the transcriptome according other genomic experiments are becoming increasingly
to a measurement of differential expression. RNA-seq prevalent. Below, we discuss some of the additional chal-
biases such as gene length complicate the direct applica- lenges posed by such analyses.
tions of these methods for count data and hence RNA-
seq-specific tools have been proposed. For example, DNA sequencing
GOseq [106] estimates a bias effect (such as gene length) The combination of RNA and DNA sequencing can be
on differential expression results and adapts the trad- used for several purposes, such as single nucleotide poly-
itional hypergeometric statistic used in the functional morphism (SNP) discovery, RNA-editing analyses, or ex-
enrichment test to account for this bias. Similarly, the pression quantitative trait loci (eQTL) mapping. In a
Gene Set Variation Analysis (GSVA) [107] or SeqGSEA typical eQTL experiment, genotype and transcriptome
[108] packages also combine splicing and implement en- profiles are obtained from the same tissue type across a
richment analyses similar to GSEA. relatively large number of individuals (>50) and correla-
Functional analysis requires the availability of suffi- tions between genotype and expression levels are then
cient functional annotation data for the transcriptome detected. These associations can unravel the genetic
under study. Resources such as Gene Ontology [109], basis of complex traits such as height [121], disease sus-
Bioconductor [110], DAVID [111, 112] or Babelomics ceptibility [122] or even features of genome architecture
[113] contain annotation data for most model species. [123, 124]. Large eQTL studies have shown that genetic
However, novel transcripts discovered during de novo variation affects the expression of most genes [125–128].
Conesa et al. Genome Biology (2016) 17:13 Page 12 of 19
RNA-seq has two major advantages over array-based verifying the expression status of genes that overlap
technologies for detecting eQTLs. First, it can identify a region of interest [150]. DNase-seq can be used for
variants that affect transcript processing. Second, reads genome-wide footprinting of DNA-binding factors,
that overlap heterozygous SNPs can be mapped to ma- and this in combination with the actual expression
ternal and paternal chromosomes, enabling quantifica- of genes can be used to infer active transcriptional
tion of allele-specific expression within an individual networks [150].
[129]. Allele-specific signals provide additional informa-
tion about a genetic effect on transcription, and a num- MicroRNAs
ber of computational methods have recently become Integration of RNA-seq and miRNA-seq data has the
available that leverage these signals to boost power for potential to unravel the regulatory effects of miRNAs on
association mapping [130–132]. One challenge of this transcript steady-state levels. This analysis is challenging,
approach is the computational burden, as billions of however, because of the very noisy nature of miRNA
gene–SNP associations need to be tested; bootstrapping target predictions, which hampers analyses based on
or permutation-based approaches [133] are frequently correlations between miRNAs and their target genes.
used [134, 135]. Many studies have focused on testing Associations might be found in databases such as mir-
only SNPs in the cis region surrounding the gene in Walk [151] and miRBase [152] that offer target predic-
question, and computationally efficient approaches have tion according to various algorithms. Tools such as
been developed recently to allow extremely swift map- CORNA [153], MMIA [154, 155], MAGIA [156], and
ping of eQTLs genome-wide [136]. Moreover, the com- SePIA [157] refine predictions by testing for significant
bination of RNA-seq and re-sequencing can be used associations between genes, miRNAs, pathways and GO
both to remove false positives when inferring fusion terms, or by testing the relatedness or anticorrelation of
genes [88] and to analyze copy number alterations [137]. the expression profiles of both the target genes and the
associated miRNAs. In general, we recommend using
DNA methylation miRNA–mRNA associations that are predicted by sev-
Pairwise DNA-methylation and RNA-seq integration, for eral algorithms. For example, in mouse, we found that
the most part, has consisted of the analysis of the correl- requiring miRNA–mRNA association in five databases
ation between DEGs and methylation patterns [138– resulted in about 50 target mRNA predictions per
140]. General linear models [141–143], logistic regres- miRNA (STATegra observations).
sion models [143] and empirical Bayes model [144] have
been attempted among other modeling approaches. The Proteomics and metabolomics
statistically significant correlations that were observed, Integration of RNA-seq with proteomics is controversial
however, accounted for relatively small effects. An inter- because the two measurements show generally low cor-
esting shift away from focusing on individual gene–CpG relation (~0.40 [158, 159]). Nevertheless, pairwise inte-
methylation correlations is to use a network-interaction- gration of proteomics and RNA-seq can be used to
based approach to analyze RNA-seq in relation to DNA identify novel isoforms. Unreported peptides can be pre-
methylation. This approach identifies one or more sets dicted from RNA-seq data and then used to complement
of genes (also called modules) that have coordinated dif- databases normally queried in mass spectrometry as
ferential expression and differential methylation [145]. done by Low et al. [160]. Furthermore, post-translational
editing events may be identified if peptides that are
Chromatin features present in the mass spectrometry analysis are absent
The combination of RNA-seq and transcription factor from the expressed genes of the RNA-seq dataset. Inte-
(TF) chromatin immunoprecipitation sequencing (ChIP- gration of transcriptomics with metabolomics data has
seq) data can be used to remove false positives in ChIP- been used to identify pathways that are regulated at both
seq analysis and to suggest the activating or repressive the gene expression and the metabolite level, and tools
effect of a TF on its target genes. For example, BETA are available that visualize results within the pathway
[146] uses differential gene expression in combination context (MassTRIX [161], Paintomics [162], VANTED
with peaks from ChIP-seq experiments to call TF tar- v2 [163], and SteinerNet [164]).
gets. In addition, ChIP-seq experiments involving his-
tone modifications have been used to understand the Integration and visualization of multiple data types
general role of these epigenomic changes on gene ex- Integration of more than two genomic data types is still
pression [147, 148]. Other RNA-ChIP-sequencing inte- at its infancy and not yet extensively applied to functional
grative approaches are reviewed in [149]. Integration of sequencing techniques, but there are already some tools
open chromatin data such as that from FAIRE-seq and that combine several data types. SNMNMF [165] and
DNase-seq with RNA-seq has mostly been limited to PIMiM [166] combine mRNA and miRNA expression
Conesa et al. Genome Biology (2016) 17:13 Page 13 of 19
data with protein–protein, DNA–protein, and miRNA– just a single cell. The resulting single-cell libraries enable
mRNA interaction networks to identify miRNA–gene the identification of new, uncharacterized cell types in
regulatory modules. MONA [167] combines different tissues. They also make it possible to measure a fascinat-
levels of functional genomics data, including mRNA, ing phenomenon in molecular biology, the stochasticity
miRNA, DNA methylation, and proteomics data to dis- of gene expression in otherwise identical cells within a
cover altered biological functions in the samples being defined population. In this context, single cell studies
studied. Paintomics can integrate any type of functional are meaningful only when a set of individual cell librar-
genomics data into pathway analysis, provided that the ies are compared with the cell population, with the aim
features can be mapped onto genes or metabolites [162]. of identifying subgroups of multiple cells with distinct
3Omics [168] integrates transcriptomics, metabolomics combinations of expressed genes. Differences may be due
and proteomics data into regulatory networks. to naturally occurring factors such as stage of the cell
In all cases, integration of different datasets is rarely cycle, or may reflect rare cell types such as cancer stem
straightforward because each data type is analyzed separ- cells. Recent rapid progress in methodologies for single-
ately with its own tailored algorithms that yield results cell preparation, including the availability of single-cell
in different formats. Tools that facilitate format conver- platforms such as the Fluidigm C1 [8], has increased the
sions and the extraction of relevant results can help; ex- number of individual cells analyzed from a handful to 50–
amples of such workflow construction software packages 90 per condition up to 800 cells at a time. Other methods,
include Anduril [169], Galaxy [170] and Chipster [171]. such as DROP-seq [175], can profile more than 10,000
Anduril was developed for building complex pipelines cells at a time. This increased number of single-cell librar-
with large datasets that require automated parallelization. ies in each experiment directly allows for the identification
The strength of Galaxy and Chipster is their usability; of smaller subgroups within the population.
visualization is a key component of their design. Simultan- The small amount of starting material and the PCR
eous or integrative visualization of the data in a genome amplification limit the depth to which single-cell librar-
browser is extremely useful for both data exploration and ies can be sequenced productively, often to less than a
interpretation of results. Browsers can display in tandem million reads. Deeper sequencing for scRNA-seq will do
mappings from most next-generation sequencing tech- little to improve quantification as the number of individ-
nologies, while adding custom tracks such as gene annota- ual mRNA molecules in a cell is small (in the order of
tion, nucleotide variation or ENCODE datasets. For 100–300,000 transcripts) and only a fraction of them are
proteomics integration, the PG Nexus pipeline [172] con- successfully reverse-transcribed to cDNA [8, 176]; but
verts mass spectrometry data to mappings that are co- deeper sequencing is potentially useful for discovering
visualized with RNA-seq alignments. and measuring allele-specific expression, as additional
reads could provide useful evidence.
Outlook Single-cell transcriptomes typically include about
RNA-seq has become the standard method for transcrip- 3000–8000 expressed genes, which is far fewer than are
tome analysis, but the technology and tools are continu- counted in the transcriptomes of the corresponding
ing to evolve. It should be noted that the agreement pooled populations. The challenge is to distinguish the
between results obtained from different tools is still un- technical noise that results from a lack of sensitivity at
satisfactory and that results are affected by parameter the single-molecule level [173] (where capture rates of
settings, especially for genes that are expressed at low around 10–50 % result in the frequent loss of the most
levels. The two major highlights in the current applica- lowly expressed transcripts) from true biological noise
tion of RNA-seq are the construction of transcriptomes where a transcript might not be transcribed and present
from small amounts of starting materials and better in the cell for a certain amount of time while the protein
transcript identification from longer reads. The state of is still present. The inclusion of added reference tran-
the art in both of these areas is changing rapidly, but we scripts and the use of unique molecule identifiers
will briefly outline what can be done now and what can (UMIs) have been applied to overcome amplification
be expected in the near future. bias and to improve gene quantification [177, 178].
Methods that can quantify gene-level technical variation
Single-cell RNA-seq allow us to focus on biological variation that is likely to
Single-cell RNA-seq (scRNA-seq) is one of the newest be of interest [179]. Typical quality-control steps involve
and most active fields of RNA-seq with its unique set of setting aside libraries that contain few reads, libraries
opportunities and challenges. Newer protocols such as that have a low mapping rate, and libraries that have
Smart-seq [173] and Smart-seq2 [174] have enabled us zero expression levels for housekeeping genes, such as
to work from very small amounts of starting mRNA GAPDH and ACTB, that are expected to be expressed at
that, with proper amplification, can be obtained from a detectable level.
Conesa et al. Genome Biology (2016) 17:13 Page 14 of 19
Depending on the chosen single-cell protocol and the [186], and for determining allele-specific expression
aims of the experiment, different bulk RNA-seq pipe- from single reads [187]. Nevertheless, long-read sequen-
lines and tools can be used for different stages of the cing has its own set of limitations, such as a still high
analysis as reviewed by Stegle et al. [180]. Single-cell li- error rate that limits de novo transcript identifications
braries are typically analyzed by mapping to a reference and forces the technology to leverage the reference gen-
transcriptome (using a program such as RSEM) without ome [188]. Moreover, the relatively low throughput of
any attempt at new transcript discovery, although at SMRT cells hampers the quantification of transcript ex-
least one package maps to the genome (Monocle [181]). pression. These two limitations can be addressed by
While mapping onto the genome does result in a higher matching PacBio experiments with regular, short-read
overall read-mapping rate, studies that are focused on RNA-seq. The accurate and abundant Illumina reads
gene expression alone with fewer reads per cell tend to can be used both to correct long-read sequencing errors
use mapping to the reference transcriptome for the sake and to quantify transcript levels [189]. Updates in PacBio
of simplicity. Other single-cell methods have been devel- chemistry are increasing sequencing lengths to produce
oped to measure single-cell DNA methylation [182] and reads with a sufficient number of passes over the
single-cell open chromatin using ATAC-seq [183, 184]. cDNA molecule to autocorrect sequencing errors. This
At present, we can measure only one functional genomic will eventually improve sequencing accuracy and allow
data-type at a time in the same single cell, but we can for genome-free determination of isoform-resolved
expect that in the near future we will be able to recover transcriptomes.
the transcriptome of a single cell simultaneously with
additional functional data. Additional file
Long-read sequencing Additional file 1: Figure S1. Screenshots of RNA-seq data visualization.
a Integrative Genomics Viewer (IGV) [77] display of a gene detected as
The major limitation of short-read RNA-seq is the diffi- differentially expressed between the two groups of samples by DEGseq
culty in accurately reconstructing expressed full-length [62]. The bottom track in the right panel is the gene annotation. The
transcripts from the assembly of reads. This is particu- tracks are five samples from each group. b RNAseqViewer [80] display of
the same data as in (a). c RNAseqViewer heatmap display of a gene
larly complicated in complex transcriptomes, where dif- detected as differentially spliced between two groups by both DSGSeq
ferent but highly similar isoforms of the same gene are [70] and DEXSeq [69]. Introns are hidden in the display to emphasize the
expressed, and for genes that have many exons and pos- signals on the exons. d MISO [81] display of another gene detected as
differentially spliced, with junction reads illustrated. (PDF 1152 kb)
sible alternative promoters or 3’ ends. Long-read tech-
nologies, such as Pacific-Biosciences (PacBio) SMRT and
Abbreviations
Oxford Nanopore, that were initially applied to genome ASM: Alternative splicing module; ChIP-seq: Chromatin immunoprecipitation
sequencing are now being used for transcriptomics and sequencing; DEG: Differentially expressed genes; eQTL: Expression
have the potential to overcome this assembly problem. quantitative loci; FDR: False discovery rate; FPKM: Fragments per kilobase of
exon model per million mapped reads; GO: Gene Ontology; GSEA: Gene set
Long-read sequencing provides amplification-free, single- enrichment analysis; GTF: Gene transfer format; IG: Immunoglobulin;
molecule sequencing of cDNAs that enables recovery of IGV: Integrative Genomics Viewer; miRNA: MicroRNA; mRNA: Messenger
full-length transcripts without the need for an assembly RNA; PCA: Principal component analysis; PE read: Paired-end read;
RNA-seq: RNA-sequencing; RPKM: Reads per kilobase of exon model per
step. PacBio adds adapters to the cDNA molecule and cre- million reads; rRNA: Ribosomal RNA; RSEM: RNA-Seq by Expectation
ates a circularized structure that can be sequenced with Maximization; scRNA-seq: Single-cell RNA-seq; SE read: Single-end read;
multiple passes within one single long read. The Nano- siRNA: Short-interfering RNA; SNP: Single nucleotide polymorphism;
sQTL: Splicing quantitative trait loci; sRNA: Small RNA; TF: Transcription
pore GridION system can directly sequence RNA strands factor; TPM: Transcripts per million.
by using RNA processive enzymes and RNA-specific
bases. Another interesting technology was previously Competing interests
known as Moleculo (now Illumina’s TruSeq synthetic The authors declare that they have no competing interests.
37. UCSC Genome Bioinformatics: Frequently Asked Questions: Data File 64. Seyednasrollah F, Laiho A, Elo LL. Comparison of software packages for
Formats. https://genome.ucsc.edu/FAQ/FAQformat.html#format4. Accessed detecting differential expression in RNA-seq studies. Brief Bioinform.
on 12 January 2016. 2015;16:59–70.
38. Pachter L. Models for transcript quantification from RNA-seq. arXiv.org. 2011. 65. Zheng S, Chen L. A hierarchical Bayesian model for comparing
http://arxiv.org/abs/1104.3889. Accessed 6 January 2016. transcriptomes at the individual transcript isoform level. Nucleic Acids Res.
39. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. 2009;37:e75.
Transcript assembly and quantification by RNA-Seq reveals unannotated 66. Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, et al. Differential
transcripts and isoform switching during cell differentiation. Nat Biotechnol. gene and transcript expression analysis of RNA-seq experiments with
2010;28:511–5. TopHat and Cufflinks. Nat Protoc. 2012;7:562–78.
40. Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data 67. Singh D, Orellana CF, Hu Y, Jones CD, Liu Y, Chiang DY, et al. FDM: a
with or without a reference genome. BMC Bioinformatics. 2011;12:323. graph-based statistical method to detect differential transcription using
41. Roberts A, Pachter L. Streaming fragment assignment for real-time analysis RNA-seq data. Bioinformatics. 2011;27:2633–40.
of sequencing experiments. Nat Methods. 2013;10:71–3. 68. Shi Y, Jiang H. rSeqDiff: detecting differential isoform expression from
42. Bray N, Pimentel H, Melsted P, Pachter L. Near-optimal RNA-Seq RNA-Seq data using hierarchical likelihood ratio test. PLoS One.
quantification with kallisto. https://liorpachter.wordpress.com/2015/05/10/ 2013;8:e79448.
near-optimal-rna-seq-quantification-with-kallisto/. Accessed 6 January 2016. 69. Anders S, Reyes A, Huber W. Detecting differential usage of exons from
43. Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L. Improving RNA-seq RNA-seq data. Genome Res. 2012;22:2008–17.
expression estimates by correcting for fragment bias. Genome Biol. 70. Wang W, Qin Z, Feng Z, Wang X, Zhang X. Identifying differentially spliced
2011;12:R22. genes from two groups of RNA-seq samples. Gene. 2013;518:164–70.
44. Ma X, Zhang X. NURD: an implementation of a new method to estimate 71. Shen S, Park JW, Lu ZX, Lin L, Henry MD, Wu YN, et al. rMATS: robust and
isoform expression from non-uniform RNA-seq data. BMC Bioinformatics. flexible detection of differential alternative splicing from replicate RNA-Seq
2013;14:220. data. Proc Natl Acad Sci U S A. 2014;111:E5593–601.
45. Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods 72. Drewe P, Stegle O, Hartmann L, Kahles A, Bohnert R, Wachter A, et al.
for normalization and differential expression in mRNA-Seq experiments. Accurate detection of differential RNA processing. Nucleic Acids Res.
BMC Bioinformatics. 2010;11:94. 2013;41:5189–98.
46. Hansen K, Brenner S, Dudoit S. Biases in Illumina transcriptome sequencing 73. Hu Y, Huang Y, Du Y, Orellana CF, Singh D, Johnson AR, et al. DissSplice: the
caused by random hexamer priming. Nucleic Acids Res. 2010;38:e131. genome-wide detection of differential splicing events with RNA-seq.
47. Robinson MD, Oshlack A. A scaling normalization method for differential Nucleic Acids Res. 2013;41:e39.
expression analysis of RNA-seq data. Genome Biol. 2010;11:R25. 74. Hilker R, Stadermann KB, Doppmeier D, Kalinowski J, Stoye J, Straube J, et al.
48. Anders S, Huber W. Differential expression analysis for sequence count data. ReadXplorer - visualization and analysis of mapped sequences.
Genome Biol. 2010;11:R106. Bioinformatics. 2014;30:2247–54.
49. Li J, Witten DM, Johnstone IM, Tibshirani R. Normalization, testing, and 75. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, et al. The
false discovery rate estimation for RNA-sequencing data. Biostatistics. Human Genome Browser at UCSC. Genome Res. 2002;12:996–1006.
2012;13:523–38. 76. Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer
50. Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L. (IGV): high- performance genomics data visualization and exploration. Brief
Differential analysis of gene regulation at transcript resolution with RNA-seq. Bioinformatics. 2012;14:178–92.
Nat Biotechnol. 2012;31:46–53. 77. Medina I, Salavert F, Sanchez R, de Maria A, Alonso R, Escobar P, et al.
51. Auer PL, Doerge RW. Statistical design and analysis of RNA sequencing data. Genome Maps, a new generation genome browser. Nucleic Acids Res.
Genetics. 2010;185:405–16. 2013;41(Web Server issue):W41–6.
52. Johnson WE, Rabinovic A, Li C. Adjusting batch effects in microarray 78. Fiume M, Williams V, Brook A, Brudno M. Savant: genome browser for
expression data using Empirical Bayes methods. Biostatistics. high-throughput sequencing data. Bioinformatics. 2010;26:1938–44.
2007;8:118–27. 79. Rogé X, Zhang X. RNAseqViewer: visualization tool for RNA-Seq data.
53. Nueda MJ, Ferrer A, Conesa A. ARSyN: a method for the identification and Bioinformatics. 2013;30:891–2.
removal of systematic noise in multifactorial time course microarray 80. Katz Y, Wang ET, Silterra J, Schwartz S, Wong B, Thorvaldsdóttir H, et al.
experiments. Biostatistics. 2012;13:553–66. Quantitative visualization of alternative exon expression from RNA-seq data.
54. Robinson MD, Smyth GK. Moderated statistical tests for assessing differences Bioinformatics. 2015;31:2400–2.
in tag abundance. Bioinformatics. 2007;23:2881–7. 81. Wu E, Nance T, Montgomery SB. SplicePlot: a utility for visualizing splicing
55. Law CW, Chen Y, Shi W, Smyth GK. Voom: precision weights unlock linear quantitative trait loci. Bioinformatics. 2014;30:1025–6.
model analysis tools for RNA-seq read counts. Genome Biol. 2014;15:R29. 82. Ryan MC, Cleland J, Kim R, Wong WC, Weinstein JN. SpliceSeq: a resource
56. Soneson C, Delorenzi M. A comparison of methods for differential for analysis and visualization of RNA-Seq data on alternative splicing and its
expression analysis of RNA-seq data. BMC Bioinformatics. 2013;14:91. functional impacts. Bioinformatics. 2012;28:2385–7.
57. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for 83. Liu Q, Chen C, Shen E, Zhao F, Sun Z, Wu J. Detection, annotation and
differential expression analysis of digital gene expression data. visualization of alternative splicing from RNA-Seq data with SplicingViewer.
Bioinformatics. 2010;26:139–40. Genomics. 2012;99:178–82.
58. Love MI, Huber W, Anders S. Moderated estimation of fold change and 84. Dietrich S, Wiegand S, Liesegang H. TraV: a genome context sensitive
dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550. transcriptome browser. PLoS One. 2014;9:e93677.
59. Hardcastle TJ, Kelly KA. baySeq: empirical Bayesian methods for identifying 85. Carrara M, Beccuti M, Lazzarato F, Cavallo F, Cordero F, Donatelli S, et al.
differential expression in sequence count data. BMC Bioinformatics. State-of-the-art fusion-finder algorithms sensitivity and specificity. BioMed
2010;11:422. Res Int. 2013;15:340620.
60. Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BM, et al. 86. Maher CA, Palanisamy N, Brenner JC, Cao X, Kalyana-Sundaram S, Luo S,
EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq et al. Chimeric transcript discovery by paired-end transcriptome sequencing.
experiments. Bioinformatics. 2013;29:1035–43. Proc Natl Acad Sci U S A. 2009;106:12353–8.
61. Li J, Tibshirani R. Finding consistent patterns: a nonparametric approach for 87. Yoshihara K, Wang Q, Torres-Garcia W, Zheng S, Vegesna R, Kim H, et al. The
identifying differential expression in RNA-Seq data. Stat Methods Med Res. landscape and therapeutic relevance of cancer-associated transcript fusions.
2013;22:519–36. Oncogene. 2015;34:4845–54.
62. Wang L, Feng Z, Wang X, Wang X, Zhang X. DEGseq: an R package for 88. McPherson A, Hormozdiari F, Zayed A, Giuliany R, Ha G, Sun MG, et al.
identifying differentially expressed genes from RNA-seq data. Bioinformatics. deFuse: an algorithm for gene fusion discovery in tumor RNA-seq data.
2010;26:136–8. PLoS Comput Biol. 2011;7:e1001138.
63. Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, Zumbo P, et al. 89. Wu C, Wyatt AW, McPherson A, Lin D, McConeghy BJ, Mo F, et al. Poly-gene
Comprehensive evaluation of differential gene expression analysis methods fusion transcripts and chromothripsis in prostate cancer. Gene
for RNA-seq data. Genome Biol. 2013;14:R95. Chromosomes Cancer. 2012;51:1144–53.
Conesa et al. Genome Biology (2016) 17:13 Page 17 of 19
90. Wyatt AW, Mo F, Wang K, McConeghy B, Brahmbhatt S, Jong L, et al. 116. Hunter S, Jones P, Mitchell A, Apweiler R, Attwood TK, Bateman A, et al.
Heterogeneity in the inter-tumor transcriptome of high risk prostate cancer. InterPro in 2011: new developments in the family and domain prediction
Genome Biol. 2014;15:426. database. Nucleic Acids Res. 2011;40(Database issue):D306–12.
91. Stransky N, Cerami E, Schalm S, Kim JL, Lengauer C. The landscape of kinase 117. Conesa A, Götz S, García-Gómez JM, Terol J, Talón M, Robles M. Blast2GO: a
fusions in cancer. Nat Commun. 2014;5:4846. universal tool for annotation, visualization and analysis in functional
92. Rabbitts TH. Commonality but diversity in cancer gene fusions. Cell. genomics research. Bioinformatics. 2005;21:3674–6.
2009;137:391–5. 118. Gardner PP, Daub J, Tate JG, Nawrocki EP, Kolbe DL, Lindgreen S,
93. McPherson A, Wu C, Hajirasouliha I, Hormozdiari F, Hach F, Lapuk A, et al. Rfam: updates to the RNA families database. Nucleic Acids Res.
et al. Comrad: detection of expressed rearrangements by integrated 2009;37 suppl 1:D136–40.
analysis of RNA-Seq and low coverage genome sequence data. Bioinformatics. 119. Kozomara A, Griffiths-Jones S. miRBase: annotating high confidence
2011;27:1481–8. microRNAs using deep sequencing data. Nucleic Acids Res. 2014;
94. Iyer MK, Chinnaiyan AM, Maher CA. ChimeraScan: a tool for 42(Database issue):D68–73.
identifying chimeric transcription in sequencing data. Bioinformatics. 120. Enright AJ, John B, Gaul U, Tuschl T, Sander C, Marks DS. MicroRNA targets
2011;27:2903–4. in Drosophila. Genome Biol. 2003;5:R1.
95. Pflueger D, Terry S, Sboner A, Habegger L, Esgueva R, Lin PC, et al. 121. Giambartolomei C, Vukcevic D, Schadt EE, Franke L, Hingorani AD,
Discovery of non-ETS gene fusions in human prostate cancer using Wallace C, et al. Bayesian test for colocalisation between pairs of
next-generation RNA sequencing. Genome Res. 2011;21:56–67. genetic association studies using summary statistics. PLoS Genet.
96. Wu J, Liu Q, Wang X, Zheng J, Wang T, You M, et al. mirTools 2.0 for 2014;10:e1004383.
non-coding RNA discovery, profiling, and functional annotation based on 122. Moffatt MF, Kabesch M, Liang L, Dixon AL, Strachan D, Heath S, et al.
high-throughput sequencing. RNA Biol. 2013;10:1087–92. Genetic variants regulating ORMDL3 expression contribute to the risk of
97. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat childhood asthma. Nature. 2007;448:470–3.
Methods. 2012;9:357–9. 123. Gilad Y, Rifkin S, Pritchard J. Revealing the architecture of gene regulation:
98. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler the promise of eQTL studies. Trends Genet. 2008;24:408–15.
transform. Bioinformatics. 2009;25:1754–60. 124. Gaffney D. Global properties and functional complexity of human gene
99. Prüfer K, Stenzel U, Dannemann M, Green RE, Lachmann M, Kelso J. PatMaN: regulatory variation. PLoS Genet. 2013;9:e1003501.
rapid alignment of short sequences to large databases. Bioinformatics. 125. Montgomery S, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J,
2008;24:1530–1. et al. Transcriptome genetics using second generation sequencing in a
100. Emde AK, Grunert M, Weese D, Reinert K, Sperling SR. MicroRazerS: rapid Caucasian population. Nature. 2010;464:773–7.
alignment of small RNA reads. Bioinformatics. 2010;26:123–4. 126. Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, et al.
101. An J, Lai J, Lehman ML, Nelson CC. miRDeep*: an integrated application Understanding mechanisms underlying human gene expression variation
tool for miRNA identification from RNA sequencing data. Nucleic Acids Res. with RNA sequencing. Nature. 2010;464:768–72.
2013;41:727–37. 127. Lappalainen T, Sammeth M, Friedlander M, ‘t Hoen PA, Monlong J, Rivas
102. Yang X, Li L. miRDeep-P: a computational tool for analyzing the microRNA MA, et al. Transcriptome and genome sequencing uncovers functional
transcriptome in plants. Bioinformatics. 2011;27:2614–5. variation in humans. Nature. 2013;501:506–11.
103. Stocks MB, Moxon S, Mapleson D, Woolfenden HC, Mohorianu I, Folkes L, 128. Battle A, Mostafavi S, Zhu X, Potash JB, Weissman MM, Shi J, et al.
et al. The UEA sRNA workbench: a suite of tools for analysing and Characterizing the genetic basis of transcriptome diversity through
visualizing next generation sequencing microRNA and small RNA datasets. RNA-sequencing of 922 individuals. Genome Res. 2014;24:14–24.
Bioinformatics. 2012;28:2059–61. 129. Pastinen T. Genome-wide allele-specific analysis: insights into regulatory
104. Axtell MJ. ShortStack: comprehensive annotation and quantification of small variation. Nat Rev Genet. 2010;11:533–8.
RNA genes. RNA. 2013;19:740–51. 130. Sun W. A statistical framework for eQTL mapping using RNA-seq data.
105. Giurato G, De Filippo MR, Rinaldi A, Hashim A, Nassa G, Ravo M, et al. Biometrics. 2012;68:1–11.
iMir: an integrated pipeline for high-throughput analysis of small 131. van de Geijn B, McVicker G, Gilad Y, Pritchard JK. WASP: allele-specific for
non-coding RNA data obtained by smallRNA-Seq. BMC Bioinformatics. robust molecular quantitative trait locus discovery. Nat Methods.
2013;14:362. 2015;12:1061–3.
106. Young MD, Wakefield MJ, Smyth GK, Oshlack A. Gene ontology analysis for 132. Kumasaka N, Knights AJ, Gaffney DJ. Fine-mapping cellular QTLs with
RNA-seq: accounting for selection bias. Genome Biol. 2010;11:1–12. RASQUAL and ATAC-seq. Nat Genet. 2015. doi: 10.1038/ng.3467.
107. Hänzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis for 133. Storey JD, Tibshirani R. Statistical significance for genome-wide studies. Proc
microarray and RNA-Seq data. BMC Bioinformatics. 2013;14:7. Natl Acad Sci U S A. 2003;100:9440–5.
108. Wang X, Cairns MJ. Gene set enrichment analysis of RNA-Seq data: 134. Stranger BE, Forrest MS, Clark AG, Minichiello MJ, Deutsch S, Lyle R, et al.
integrating differential expression and splicing. BMC Bioinformatics. Genome-wide associations of gene expression variation in humans. PLoS
2013;14 Suppl 5:S16. Genet. 2005;1:e78.
109. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene 135. Raj T, Rothamel K, Mostafavi S, Ye C, Lee MN, Replogle JM, et al. Polarization
Ontology: tool for the unification of biology. Nat Genet. 2000;25:25–9. of the effects of autoimmune and neurodegenerative risk alleles in
110. Huber W, Carey VJ, Gentleman R, Anders S, Carlson M, Carvalho BS, et al. leukocytes. Science. 2014;344:519–23.
Orchestrating high-throughput genomic analysis with Bioconductor. Nat 136. Shabalin A. Matrix eQTL: ultra fast eQTL analysis via large matrix operations.
Methods. 2015;12:115–21. Bioinformatics. 2012;28:1353–8.
111. Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of 137. Louhimo R, Lepikhova T, Monni O, Hautaniemi S. Comparative analysis of
large gene lists using DAVID Bioinformatics Resources. Nat Protocols. algorithms for integration of copy number and expression data. Nat
2009;4:44–57. Methods. 2012;9:351–5.
112. Huang DW, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: 138. Kim JH, Dhanasekaran SM, Prensner JR, Cao X, Robinson D, Kalyana-Sundaram
paths toward the comprehensive functional analysis of large gene lists. S, et al. Deep sequencing reveals distinct patterns of DNA methylation in
Nucleic Acids Res. 2009;37:1–13. prostate cancer. Genome Res. 2011;21:1028–41.
113. Medina I, Carbonell J, Pulido L, Madeira SC, Goetz S, Conesa A, et al. 139. Li JL, Mazar J, Zhong C, Faulkner GJ, Govindarajan SS, Zhang Z, et al.
Babelomics: an integrative platform for the analysis of transcriptomics, Genome-wide methylated CpG island profiles of melanoma cells reveal a
proteomics and genomic data with advanced functional profiling. Nucleic melanoma coregulation network. Sci Rep. 2013;3:2962.
Acids Res. 2010;38 suppl 2:W210–3. 140. Xie L, Weichel B, Ohm JE, Zhang K. An integrative analysis of DNA
114. Bairoch A, Boeckmann B, Ferro S, Gasteiger E. Swiss-Prot: juggling between methylation and RNA-Seq data for human heart, kidney and liver. BMC Syst
evolution and stability. Brief Bioinformatics. 2004;5:39–55. Biol. 2011;5 Suppl 3:S4.
115. Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, et al. 141. Van Eijk KR, de Jong S, Boks MP, Langeveld T, Colas F, Veldink JH, et al.
The Pfam protein families database. Nucleic Acids Res. Genetic analysis of DNA methylation and gene expression levels in whole
2014;42(Database issue):D222–30. blood of healthy human subjects. BMC Genomics. 2012;13:636.
Conesa et al. Genome Biology (2016) 17:13 Page 18 of 19
142. Liu Y, Aryee MJ, Padyukov L, Fallin MD, Hesselberg E, Runarsson A, et al. 167. Sass S, Buettner F, Mueller NS, Theis FJ. A modular framework for gene set
Epigenome-wide association data implicate DNA methylation as an analysis integrating multilevel omics data. Nucleic Acids Res. 2013;41:9622–33.
intermediary of genetic risk in rheumatoid arthritis. Nat Biotechnol. 168. Kuo TC, Tian TF, Tseng YJ. 3Omics: a web-based systems biology tool for
2013;31:142–7. analysis, integration and visualization of human transcriptomic, proteomic
143. Yeang C-H. An integrated analysis of molecular aberrations in NCI-60 cell and metabolomic data. BMC Syst Biol. 2013;7:64.
lines. BMC Bioinformatics. 2010;11:495. 169. Ovaska K, Laakso M, Haapa-Paananen S, Louhimo R, Chen P, Aittomäki V,
144. Jeong J, Li L, Liu Y, Nephew KP, Huang YHM, Shen C. An empirical Bayes et al. Large-scale data integration framework provides a comprehensive
model for gene expression and methylation profiles in antiestrogen view on glioblastoma multiforme. Genome Med. 2010;2:65.
resistant breast cancer. BMC Med Genomics. 2010;3:55. 170. Goecks J, Nekrutenko A, Taylor J. Galaxy: a comprehensive approach for
145. Jiao Y, Widschwendter M, Teschendorff AE. A systems-level integrative supporting accessible, reproducible, and transparent computational
framework for genome-wide DNA methylation and gene expression data research in the life sciences. Genome Biol. 2010;11:R86.
identifies differential gene expression modules under epigenetic control. 171. Kallio MA, Tuimala JT, Hupponen T, Klemelä P, Gentile M, Scheinin I, et al.
Bioinformatics. 2014;30:2360–6. Chipster: user-friendly analysis software for microarray and other high-
146. Wang S, Sun H, Ma J, Zang C, Wang C, Wang J, et al. Target analysis throughput data. BMC Genomics. 2011;12:507.
by integration of transcriptome and ChIP-seq data with BETA. Nat 172. Pang CNI, Tay AP, Aya C. Tools to covisualize and coanalyze proteomic data
Protoc. 2013;8:2502–15. with genomes and transcriptomes: validation of genes and alternative
147. Roadmap Epigenomics Consortium, Kundaje A, Meuleman W, Ernst J, mRNA splicing. J Proteome Res. 2014;13:84–98.
Bilenky M, Yen A, et al. Integrative analysis of 111 reference human 173. Ramsköld D, Luo S, Wang YC, Li R, Deng Q, Faridani OR, et al. Full-length
epigenomes. Nature. 2015;518:317–30. mRNA-Seq from single-cell levels of RNA and individual circulating tumor
148. Madrigal P, Krajewski P. Uncovering correlated variability in epigenomic cells. Nat Biotechnol. 2012;30:777–82.
datasets using the Karhunen-Loeve transform. BioData Min. 2015;8:20. 174. Picelli S, Björklund ÅK, Faridani OR, Sagasser S, Winberg G, Sandberg R.
149. Angelini C, Costa V. Understanding gene regulatory mechanisms by Smart-seq2 for sensitive full-length transcriptome profiling in single cells.
integrating ChIP-seq and RNA-seq data: statistical solutions to biological Nat Methods. 2013;10:1096–8.
problems. Front Cell Dev Biol. 2014;2:51. 175. Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, et al. Highly
150. Neph S, Stergachis AB, Reynolds A, Sandstrom R, Borenstein E, parallel genome-wide expression profiling of individual cells using nanoliter
Stamatoyannopoulos JA. Circuitry and dynamics of human transcription droplets. Cell. 2015;161:1202–14.
factor regulatory networks. Cell. 2012;150:1274–86. 176. Marinov GK, Williams BA, McCue K, Schroth GP, Gertz J, Myers RM, et al.
151. Dweep H, Sticht C, Pandey P, Gretz N. miRWalk - database: prediction of From single-cell to cell-pool transcriptomes: stochasticity in gene expression
possible miRNA binding sites by ‘walking’ the genes of 3 genomes. J and RNA splicing. Genome Res. 2014;24:496–510.
Biomed Inform. 2011;44:839–47. 177. Islam S, Zeisel A, Joost S, La Manno G, Zajac P, Kasper M, et al.
152. Griffiths-Jones S, Saini HK, van Dongen S, Enright AJ. miRBase: tools for Quantitative single-cell RNA-seq with unique molecular identifiers. Nat
microRNA genomics. Nucleic Acids Res. 2008;36:D154–8. Methods. 2014;11:163–6.
153. Wu X, Watson M. CORNA: testing gene lists for regulation by microRNAs. 178. Kivioja T, Vähärautio A, Karlsson K, Bonke M, Enge M, Linnarsson S, et al.
Bioinformatics. 2009;25:832–3. Counting absolute numbers of molecules using unique molecular
154. Lee H, Yang Y, Chae H, Nam S, Choi D, Tangchaisin P, et al. BioVLAB-MMIA: identifiers. Nat Methods. 2011;9:72–4.
a cloud environment for microRNA and mRNA integrated analysis (MMIA) 179. Brennecke P, Anders S, Kim JK, Kołodziejczyk AA, Zhang X, Proserpio V, et al.
on Amazon EC2. IEEE Trans Nanobiosci. 2012;11:266–72. Accounting for technical noise in single-cell RNA-seq experiments. Nat
155. Nam S, Li M, Choi K, Balch C, Kim S, Nephew KP. MicroRNA and mRNA Methods. 2013;10:1093–5.
integrated analysis (MMIA): a web tool for examining biological functions of 180. Stegle O, Teichmann SA, Marioni JC. Computational and analytical
microRNA expression. Nucleic Acids Res. 2009;37:W356–62. challenges in single-cell transcriptomics. Nat Rev Genet. 2015;16:133–45.
156. Sales G, Coppe A, Bisognin A, Bortoluzzi S, Romualdi C. MAGIA, a 181. Trapnell C, Cacchiarelli D. The dynamics and regulators of cell fate decisions
web-based tool for miRNA and Genes Integrated Analysis. Nucleic Acids are revealed by pseudotemporal ordering of single cells. Nat Biotechnol.
Res. 2010;38:W352–9. 2014;32:381–6.
157. Icay K, Chen P, Cervera C, Lehtonen R, Hautaniemi S. SePIA: RNA and 182. Lorthongpanich C, Cheow LF, Balu S, Quake SR, Knowles BB, Burkholder WF,
smallRNA-sequence processing, integration, and analysis. 2015. et al. Single-cell DNA-methylation analysis reveals epigenetic chimerism in
http://anduril.org/sepia. Accessed 6 Jan 2016. preimplantation embryos. Science. 2013;341:1110–2.
158. de Sousa AR, Penalva LO, Marcotte EM, Vogel C. Global signatures of 183. Buenrostro JD, Wu B, Litzenburger UM, Ruff D, Gonzales ML, Snyder MP,
protein and mRNA expression levels. Mol Biosyst. 2009;5:1512–26. et al. Single-cell chromatin accessibility reveals principles of regulatory
159. Vogel C, Marcotte EM. Insights into the regulation of protein variation. Nature. 2015;523:486–90.
abundance from proteomic and transcriptomic analyses. Nat Rev Genet. 184. Cusanovich DA, Daza R, Adey A, Pliner HA, Christiansen L, Gunderson KL,
2012;13:227–32. et al. Multiplex single-cell profiling of chromatin accessibility by
160. Low TY, van Heesch S, van den Toorn H, Giansanti P, Cristobal A, Toonen P. combinatorial cellular indexing. Science. 2015;348:910–4.
Quantitative and qualitative proteome characteristics extracted from 185. Tilgner H, Jahanbani F, Blauwkamp T, Moshrefi A, Jaeger E, Chen F, et al.
in-depth integrated genomics and proteomics analysis. Cell Rep. Comprehensive transcriptome analysis using synthetic long-read
2013;5:1469–78. sequencing reveals molecular co-association of distant splicing events. Nat
161. Suhre K, Schmitt-Kopplin P. MassTRIX: mass translator into pathways. Nucleic Biotechnol. 2015;33:736–42.
Acids Res. 2008;36(Web Server issue):W481–4. 186. Au KF, Sebastiano V, Afshar PT, Durruthy JD, Lee L, Williams BA, et al.
162. García-Alcalde F, García-López F, Dopazo J, Conesa A. Paintomics: a web Characterization of the human ESC transcriptome by hybrid sequencing.
based tool for the joint visualization of transcriptomics and metabolomics Proc Natl Acad Sci U S A. 2013;110:E4821–30.
data. Bioinformatics. 2011;27:137–9. 187. Tilgner H, Grubert F, Sharon D, Snyder MP. Defining a personal, allele-specific,
163. Rohn H, Junker A, Hartmann A, Grafahrend-Belau E, Treutler H, Klapperstück and single-molecule long-read transcriptome. Proc Natl Acad Sci U S A.
M, et al. VANTED v2: a framework for systems biology applications. BMC 2014;111:9869–74.
Syst Biol. 2012;6:139. 188. Au KF, Underwood JG, Lee L, Wong WH. Improving PacBio long read
164. Tuncbag N, McCallum S, Huang SS, Fraenkel E. SteinerNet: a web server for accuracy by short read alignment. PLoS One. 2012;7:e46679.
integrating ‘omic’ data to discover hidden components of response 189. Hansen KD, Wu Z, Irizarry RA, Leek JT. Sequencing technology does not
pathways. Nucleic Acids Res. 2012;40:W505–9. eliminate biological variability. Nat Biotechnol. 2011;29:572–3.
165. Zhang S, Li Q, Liu J, Zhou XJ. A novel computational framework for 190. Hart SN, Therneau TM, Zhang Y, Poland GA, Kocher JP. Calculating sample
simultaneous integration of multiple types of genomic data to identify size estimates for RNA sequencing data. J Comput Biol. 2013;20:970–8.
microRNA-gene regulatory modules. Bioinformatics. 2011;27:i401–9. 191. Busby MA, Stewart C, Miller CA, Grzeda KR, Marth GT. Scotty: a web tool for
166. Le H-S, Bar-Joseph Z. Integrating sequence, expression and interaction data designing RNA-Seq experiments to measure differential gene expression.
to determine condition-specific miRNA regulation. Bioinformatics. 2013;29:i89–97. Bioinformatics. 2013;29:656–7.
Conesa et al. Genome Biology (2016) 17:13 Page 19 of 19
192. Oshlack A, Wakefield MJ. Transcript length bias in RNA-seq data confounds
systems biology. Biol Direct. 2009;4:14.
193. Noble WS. How does multiple testing correction work? Nat Biotechnol.
2009;27:1135–7.
194. Robinson DG, Storey JD. subSeq: determining appropriate sequencing
depth through efficient read subsampling. Bioinformatics. 2014;30:3424–6.
195. Liu Y, Zhou J, White KP. RNA-seq differential expression studies: more
sequence or more replication? Bioinformatics. 2013;30:301–4.
196. SEQC/MAQC-III Consortium. A comprehensive assessment of RNA-seq
accuracy, reproducibility and information content by the Sequencing
Quality Control Consortium. Nat Biotechnol. 2014;32:903–14.
197. Jiang L, Schlesinger F, Davis CA, Zhang Y, Li R, Salit M, et al. Synthetic
spike-in standards for RNA-seq experiments. Genome Res. 2011;21:1543–51.
198. Kouzine F, Wojtowicz D, Yamane A, Resch W, Kieffer-Kwon KR, Bandle R,
et al. Global regulation of promoter melting in naive lymphocytes. Cell.
2013;153:988–99.
199. Van Dijk EL, Jaszczyszyn Y, Thermes C. Library preparation methods for
next-generation sequencing: tone down the bias. Exp Cell Res.
2014;322:12–20.
200. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with
RNA-Seq. Bioinformatics. 2009;25:1105–11.
201. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2:
accurate alignment of transcriptomes in the presence of insertions,
deletions and gene fusions. Genome Biol. 2013;14:R36.
202. Wu TD, Nacu S. Fast and SNP-tolerant detection of complex variants and
splicing in short reads. Bioinformatics. 2010;26:873–81.
203. Jean G, Kahles A, Sreedharan VT, De Bona F, Rätsch G. RNA-Seq read
alignments with PALMapper. Curr Protoc Bioinformatics. 2010;11(6).
204. Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, et al. MapSplice:
accurate mapping of RNA-seq reads for splice junction discovery. Nucleic
Acids Res. 2010;38:e178.
205. Marco-Sola S, Sammeth M, Guigó R, Ribeca P. The GEM mapper: fast,
accurate and versatile alignment by filtration. Nat Methods. 2012;9:1185–8.
206. Zhao S, Zhang B. A comprehensive evaluation of ensembl, RefSeq, and
UCSC annotations in the context of RNA-seq read mapping and gene
quantification. BMC Genomics. 2015;16:97.
207. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient
alignment of short DNA sequences to the human genome. Genome Biol.
2009;10:R25.
208. Kvam VM, Liu P, Si Y. A comparison of statistical methods for detecting
differentially expressed genes from RNA-seq data. Am J Bot. 2012;99:248–56.
209. Robles JA, Qureshi SE, Stephen SJ, Wilson SR, Burden CJ, Taylor JM. Efficient
experimental design and analysis strategies for the detection of differential
expression using RNA-Sequencing. BMC Genomics. 2012;13:484.
210. Nookaew I, Papini M, Pornputtapong N, Scalcinati G, Fagerberg L, Uhlén M,
et al. A comprehensive comparison of RNA-Seq-based transcriptome
analysis from reads to differential gene expression and cross-comparison
with microarrays: a case study in Saccharomyces cerevisiae. Nucleic Acids
Res. 2012;40:10084–97.
211. Seyednasrollah F, Rantanen K, Jaakkola P, Elo LL. ROTS: reproducible
RNA-seq biomarker detector-prognostic markers for clear cell renal cell
cancer. Nucleic Acids Res. 2016;44(1):e1. doi:10.1093/nar/gkv806.
212. Bi Y, Davuluri RV. NPEBseq: nonparametric empirical bayesian-based
procedure for differential expression analysis of RNA-seq data. BMC
Bioinformatics. 2013;14:262.
213. Nueda MJ, Tarazona S, Conesa A. Next maSigPro: updating maSigPro
bioconductor package for RNA-seq time series. Bioinformatics.
2014;30:2598–602.
214. GTEx Consortium. The Genotype-Tissue expression (GTEx) project. Nat
Genet. 2013;45:580–5.