Introduction To Differential Gene Expression Analysis Using RNA-seq
Introduction To Differential Gene Expression Analysis Using RNA-seq
September 2015
updated November 14, 2019
1 Introduction to RNA-seq 5
1.1 RNA extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Quality control of RNA preparation (RIN) . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Library preparation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Sequencing (Illumina) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.1 Capturing enough variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.2 Avoiding bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Read Quantification 49
4.1 Gene-based read counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Isoform counting methods (transcript-level quantification) . . . . . . . . . . . . . . . . . . . . 50
7 Appendix 78
7.1 Improved alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.2 Additional tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.3 Installing bioinformatics tools on a UNIX server . . . . . . . . . . . . . . . . . . . . . . . . . 89
List of Tables
1 Sequencing depth recommendations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Examples for technical and biological replicates. . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Illumina’s different base call quality score schemes. . . . . . . . . . . . . . . . . . . . . . . . . 19
4 The fields of the alignment section of SAM files. . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5 The FLAG field of SAM files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6 Comparison of classification and clustering techniques. . . . . . . . . . . . . . . . . . . . . . . 60
7 Programs for DGE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8 Biases and artifacts of Illumina sequencing data. . . . . . . . . . . . . . . . . . . . . . . . . . 79
9 FASTQC test modules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
10 Optional entries in the header section of SAM files. . . . . . . . . . . . . . . . . . . . . . . . . 82
11 Overview of RSeQC scripts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
12 Overview of QoRTs QC functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
13 Normalizing read counts between different conditions. . . . . . . . . . . . . . . . . . . . . . . 87
14 Normalizing read counts within the same sample. . . . . . . . . . . . . . . . . . . . . . . . . . 88
List of Figures
1 RNA integrity assessment (RIN). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Overview of general RNA library preparation steps. . . . . . . . . . . . . . . . . . . . . . . . 7
3 Size selection steps during common RNA library preparations. . . . . . . . . . . . . . . . . . 9
4 The different steps of sequencing by synthesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5 Sequence data repositories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6 Schema of an Illumina flowcell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7 Phred score ranges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
8 Typical bioinformatics workflow of differential gene expression analysis. . . . . . . . . . . . . 20
9 Scheme of raw read QC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
10 Classic sequence alignment example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
11 RNA-seq read alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
12 Alignment-free sequence comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
13 Kallisto’s strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
14 Schematic representation of a SAM file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
15 CIGAR strings of aligned reads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
16 Graphical summary of STAR’s log files for 96 samples. . . . . . . . . . . . . . . . . . . . . . . 41
17 Different modes of counting read-transcript overlaps. . . . . . . . . . . . . . . . . . . . . . . . 49
18 Schema of RSEM-based transcrip quantification. . . . . . . . . . . . . . . . . . . . . . . . . . 51
19 Effects of different read count normalization methods. . . . . . . . . . . . . . . . . . . . . . . 53
20 Comparison of the read distribution plots for untransformed and log2 -transformed values. . . 57
21 Comparison of log2 - and rlog-transformed read counts. . . . . . . . . . . . . . . . . . . . . . . 58
22 Dendrogram of rlog-transformed read counts. . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
23 PCA on raw counts and rlog-transformed read counts. . . . . . . . . . . . . . . . . . . . . . . 62
24 Regression models and DGE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
25 Histogram and MA plot after DGE analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
26 Heatmaps of log-transformed read counts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
27 Read counts for two genes in two conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
28 Schematic overview of different DE analysis approaches. . . . . . . . . . . . . . . . . . . . . . 71
29 Example plots to judge DGE analysis results. . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
30 Example results for ORA and GSEA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Command-line interface The first steps of the analyses – that are the most computationally demanding
– will be performed directly on our servers. The interaction with our servers is completely text-based, i.e.,
there will be no graphical user interface. We will instead be communicating entirely via the command line
using the UNIX shell scripting language bash. You can find a good introduction into the shell basics at
http://linuxcommand.org/lc3_learning_the_shell.php (for our course, chapters 2, 3, 5, 7, and 8 are
probably most relevant).
To start using the command line, Mac users should use the App called Terminal. Windows users need to
install putty, a Terminal emulator (http://www.chiark.greenend.org.uk/~sgtatham/putty/download.
html). You probably want the bits under the A Windows installer for everything except PuTTYtel heading.
Putty will allow you to establish a connection with a UNIX server and interact with it.
Programs that we will be using via the command line:
FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
featureCounts http://bioinf.wehi.edu.au/subread-package/)
MultiQC http://multiqc.info/docs/
QoRTs https://hartleys.github.io/QoRTs/
RSeQC http://rseqc.sourceforge.net/
samtools http://www.htslib.org/
STAR https://github.com/alexdobin/STAR
UCSC tools https://hgdownload.soe.ucsc.edu/admin/exe/
Details on how to install these programs via the command line can be found in the Appendix.
The only program with a graphical user interface will be IGV. Go to https://www.broadinstitute.org/
igv/ → “Downloads”, register with your academic email address and launch the Java web start (for Windows
machines, you should go for the 1.2 GB version).
R The second part of the analyses – where we will need support for statistics and visualization more
than pure computation power – will mostly be done on the individual computers using the programming
language R. You can download R for both MacOS and Windows from http://cran.rstudio.com/. After you
have installed R, we highly recommend to install RStudio (http://www.rstudio.com/products/rstudio/
download/), which will provide you with an interface to write commands at a prompt, construct a script
and view plots all in a single integrated environment.
R packages that will be used throughout the course:
from CRAN: ggplot2, magrittr, pheatmap, UpSetR
from bioconductor: DESeq2, edgeR, ggplot2, limma, pcaExplorer, org.Sc.sgd.db, vsn
1 Introduction to RNA-seq
The original goal of RNA sequencing was to identify which genomic loci are expressed in a cell (population)
at a given time over the entire expression range and without the absolute need to pre-define the sequences
of interest as it is the case with cDNA microarrays. Indeed, RNA-seq was shown to detect lowly expressed
transcripts while suffering from strongly reduced false positive rates in comparison to microarray based
expression quantification (Illumina, 2011; Nookaew et al., 2012; Zhao et al., 2014)∗ . In addition, RNA-
seq can, in principle, be used not only for the quantification of expression differences between distinct
conditions, it also offers the possibility to detect and quantify non-protein-coding transcripts, slice isoforms,
novel transcripts and sites of protein-RNA interactions. However, the lack of a pre-specified selection of
cDNA probes for RNA-seq puts the burden of identifying which transcripts we actually found on the post-
processing workflow, and the main goal (e.g., novel transcript discovery versus differential gene expression
analysis) should be clear before the design of the experiment as the devil’s in the detail. The detection of gene
expression changes (i.e., mRNA levels) between different cell populations and/or experimental conditions
remains the most common application of RNA-seq; yet even for that highly popular application, widely
accepted and adopted standards are lacking and the RNA-seq field is only slowly coming to terms about
best practices for differential gene expression analysis amid a myriad of available software (Byron et al.,
2016).
The general workflow of a differential gene expression analysis is:
1. Sequencing (biochemistry)
(a) RNA extraction
(b) Library preparation (including mRNA enrichment)
(c) Sequencing
2. Bioinformatics
(a) Processing of sequencing reads (including alignment)
(b) Estimation of individual gene expression levels
(c) Normalization
(d) Identification of differentially expressed (DE) genes
Before RNA can be sequenced, it must first be extracted and separated from its cellular environment,
which consists primarily of proteins and DNA. The most prevalent methods for RNA isolation are silica-
gel based membranes or liquid-liquid extractions with acidic phenol-chloroform. In the former case, RNA
exclusively binds to a silica-gel membrane, while the remaining cellular components are washed away. Silica-
gel membranes require ethanol for binding. The volume of ethanol influences which transcripts are bound to
the membrane: more ethanol results in the retention of RNAs <200 bp, whereas a smaller volume results in
their loss. When using phenol-chloroform extraction, the cellular components are dissolved into three phases:
the organic phase; the interphase; and the aqueous phase, in which the RNA is retained. Phenol-chloroform
extraction is typically followed by an alcohol precipitation to de-salt and concentrate the RNA. An alcohol
precipitation can be performed with either ethanol or isopropanol, both of which require the use of a salt.
Different salts lead to different precipitation efficiencies and result in different RNA populations; e.g., lithium
chloride, a commonly used salt, has been reported to result in the loss of tRNAs, 5S rRNAs, snRNAs,
and other RNAs <250–300 bp (Cathala et al., 1983). Given the multitude of factors that can influence
the outcome of RNA extraction, it is therefore important to process the RNA in a highly controlled and
standardized manner, so that the knowledge of how the RNA was isolated can be appropriately leveraged for
one’s understanding of the data later on. Additional information on how RNA extraction methods influence
RNA-seq data can be found in Sultan et al. (2014).
Although both extraction methods previously mentioned are designed to eliminate DNA contamination,
they are often imperfect. But even small amounts of DNA contamination (as little as 0.01% genomic DNA
∗ For a detailed comparison of different methods for transcriptome quantification, see Lowe et al. (2017)
by weight) can negatively impact results (NuGEN, 2013). Accordingly, it is advisable to take additional
measures to ensure DNA-free RNA, e.g., by treating the RNA with DNase.
28S
Figure 1: RNA integrity assessment is based on the ratio of 18S rRNA, estimated from the band intensity (a) or a
densitometry plot (b). RNA used for RNA-seq experiments should be as intact as possible. Figure taken from Griffith
et al. (2015).
In high-throughput sequencing terms, a library is a (preferably random) collection of DNA fragments that
are ready for sequencing with a specific protocol (Figure 2). We are focussing on the most popular HTS
platform, provided by Illumina † . Their short read sequencing is based on the principle “Sequencing by
Synthesis”, which originated with the Sanger sequencing protocols starting in the 1970s and was further
adopted for massively parallelized applications by Solexa. These methods rely on (originally) biologically
occurring enzymes that “read” DNA all the time, e.g. DNA polymerases. These molecules have the capacity
to identify the precise order in which specific nucleotides are needed to create a perfect complementary copy
of a given single strand of DNA. Illumina’s method relies on exactly that property of DNA polymerase and
simply ensures that there’s a base-specific record every time DNA polymerase is adding a new nucleotide to
an evolving DNA copy. The record of choice for Illumina platforms is a fluorescent signal, i.e. each nucleotide
is labelled with a specific fluorophore so that the incorporation of A, C, T, G will yield distinct emission
† For a comprehensive overview of recent high-throughput sequencing methods beyond Illumina protocols, see Goodwin et al.
(2016).
signals. These signals are recorded via automated cameras using sophisticated microscopes. This means that
in order for cDNA fragments to be sequenced on an Illumina sequencer, they:
• have to be single-stranded,
• cannot exceed a certain size,
• have to be immobilized on a glass slide, and
• the inherent sequence properties of a given fragment (length, GC content, homopolymers) will influence
the fidelity of the DNA polymerase and thus the ultimate signal (Section 5).
COM
Therefore, library preparation for Illumina protocols aims to generate M Efragments
cDNA N TA RbetweenY 150 to
300 bp, which will be hybridized to a glass slide with microfluidic chambers (flowcell ). Of those fragments,
only the ends (50 to 150 bp) will actually be sequenced.
measure a
cation. Select cell
population
adding Quality control
Extract total RNA
scripts
AAAAA
well as RIN
AAAAA mRNA
ormally Small RNA
There Size-select by Poly(A) select
om the PAGE or by kit ‘ribosome minus’
Recovered
As most RNA amount
merase Small RNA measurement
mRNA
histone AAAAA
AAAAA
mRNA Ligate RNA adapter
Fragment
on cap- Agilent
nefit of bioanalyzer
or RNA gel
e RNA
cessed, Convert to cDNA
cleaner cDNA QPCR
trategy Construct library
hybrid-
uences Agilent
riching bioanalyzer
rough. Sequence
espect- b Quantitation
c Newworkflow.
d
transcript After RNA extraction
Variant mining
Figure 2: General RNA library preparation and measuring its integrity, rRNA
f rRNA is depleted (either using poly(A)-selection ordiscovery
rRNA depletion) and the remaining RNA molecules are fragmented,
makes ideally achieving a uniform size distribution. Double-stranded cDNA is Tsynthesized andA the adapters for sequencing
T A
ern for are added to construct the final library whose fragment size distribution should be unimodal
C A
and well-defined. Image
taken from Zeng and Mortazavi (2012).
s reads Map reads onto C G
ionally, the genome T
Due to the numerous types of RNA families, there is a great variety
G
of libraryApreparation protocols. Since
T
map to the quantification of mRNA is by far the most commonly used application
C of RNA-seq
G experiments, we will
mislead focus on protocols that are typically applied in this context. Keep in mind that the library preparation can
el tran- seriously affect the outcome of the sequencing in terms of qualityC as well as Acoverage. More importantly,
enrich small transcripts (smaller than about 150 bp) and strand information will be lost during standard RNA-seq
Calculate RPKM
e selec- library preparations (Figure 3), i.e., if those details are of interest to you, make sure to select an alternative
protocol. For more details on library preparation protocols including single-cell RNA-seq, CLiP-seq and
horesis SNP RNA
more, see Head et al. (2014), Shanker et al. (2015), Yeri et al. (2018), Booneediting
et al. (2018).
ned for 2 RPKM 1 RPKM 1 RPKM
ruction
Figure
mRNA 1 RNA-seq work flow.The
enrichment (a) Schematic
total RNAdiagram of RNA-seq
pool extracted fromlibrary construction.
typical Totalabout
cells contains RNA is80–85% rRNA and
to the
extracted
10–15%fromtRNA300,000 cells2010).
(Farrell, to 3 million
This iscells, and a small
problematic aliquot
if one is usedinterested
is mostly to measureinthe integrity
protein-coding mRNA. To
reverse of the RNA. the
increase rRNAquantification
is then depleted of through
mRNA,one thereof several
are twomethods to enrich (i)
basic options: subpopulation
enrichmentofofRNA
poly-(A)-containing
molecules, such as mRNA or small RNA. mRNA is fragmented into a uniform size distribution and the
y used fragment size can be monitored by RNA gel electrophoresis or Agilent Bioanalyzer. The cDNA is then
© 2015-2019 Applied Bioinformatics Core | Weill Cornell Medical College Page 7 of 97
librar- built into a library. The size distribution pattern of the library can be checked by Agilent Bioanalyzer;
d into this information is important for RNA-seq data analysis. (b) Mapping programs align reads to the
1. Introduction to RNA-seq
mRNAs with the help of oligo-(dT) beads, or (ii) removal of ribosomal RNA via complementary sequences.
The two strategies will yield different populations of RNA, therefore they should never be mixed within
the same experiment! For example, various non-polyadenylated mRNAs such as histone transcripts and
immature mRNAs will not be captured with the poly(A)-enrichment protocol, while the alternative “ribo-
minus” approach does not exclude unspliced RNAs. Do ask the sequencing facility you are going to collaborate
with about the type of protocol they are using as this will inform you about the types of RNA noise that
you will encounter. For more details about the different enrichment strategies and their impact, see Table S4
of Griffith et al. (2015)‡ .
strand-specific sequencing If you need to distinguish overlapping transcripts, e.g., when sequencing
prokaryotic transcriptomes or because the aims of the RNA-seq experiment include the identification of
anti-sense transcripts, the information about which strand a fragment originated from needs to be preserved
during library preparation. The most commonly used method incorporates deoxy-UTP during the synthesis
of the second cDNA strand (for details see Levin et al. (2010)).
!
The goal of your RNA-seq experiment will determine the most appropriate library preparation
protocol. Make sure to check the most important parameters including the expected size distri-
bution of your transcripts.
‡ http://journals.plos.org/ploscompbiol/article/file?type=supplementary&id=info:doi/10.1371/journal.pcbi.
1004393.s006
Figure 3: Size selection steps during common RNA library preparations. Typically, RNA is fragmented before or after
cDNA synthesis, either via chemical (e.g. metal ion exposure), enzymatic (e.g., RNAses) or physical processes (e.g.,
shearing). Prior to sequencing, cDNA fragments are enriched for the size range that Illumina sequencing machines
can handle best, i.e., between 150 to 1,000 bp (dashed boxes in the gel electropherogram). This means that for the
vast majority of RNA-seq experiments, RNAs smaller than about 150 bp will be strongly under-represented. If you
are interested in smaller RNA species, make sure that a protocol for small RNA library preparation is used. Figure
taken from Griffith et al. (2015).
After hybridization of the DNA fragments to the flowcell through means of adapters, each fragment is
massively and clonally amplified, forming clusters of double-stranded DNA. This step is necessary to ensure
that the sequencing signal will be strong enough to be detected unambiguously for each base of each fragment.
The most commonly used Illumina sequencing protocols will only cover 50 to 100 bp of each fragment
(depending on the read length that was chosen). The sequencing of the fragment ends is based on fluorophore-
labelled dNTPs with reversible terminator elements that will become incorporated and excited by a laser
one at a time and thereby enable the Illumina Sequencing
optical identification Workflow
of single bases (Figure 4, Table 4).
Figure 4: The different steps of sequencing with Illumina’s sequencing by synthesis method. Library preparation:
Adapters are ligated to each cDNA fragment to eventually attach them to the flowcell on which they are going
to be sequenced. To increase the signal of the sequencing step, every fragment is first clonally amplified after the
hybridization onto the flowcell (cluster generation). Finally, the nucleotide order of each fragment is revealed through
PCR with fluorophore-labelled nucleotides: Images are taken after each round of nucleotide incorporation and bases
are identified based on the recorded excitation spectra. Figure from Illumina.
Sequencing depth and coverage Technically, coverage refers to the number of reads being sequenced in
relation to the genome size, i.e., it is an estimate of how many times each base of the genome is sequenced.
For experiments based on the sequencing of the genome, the Lander-Waterman equation is commonly cited
as it describes the relationship between coverage, the number of sequenced reads and the genome size:
read length ∗ number of reads
coverage = haploid genome length
To identify sequencing errors (and possibly distinguish them from genomic variants), every base should be
covered more than once. The coverage value will always be an estimate as the genome is usually not covered
uniformly since, for example, euchromatic fragments tend to be overrepresented and standard PCR protocols
will favor GC-rich regions and impede AT-rich regions (see Table 8 for more examples of biases that occur
with Illumina sequencing platforms).
For RNA-seq, the coverage estimation has rather little practical value as the size of the transcriptome is
not known as accurately as the size of the genome, and, more importantly, the per-base coverage will vary
drastically between different transcripts depending, most importantly, on their expression. Thus, the number
of required reads is determined by the least abundant RNA species of interest. However, it is impossible to
know before sequencing how many reads are going to be needed to capture enough fragments of the most
lowly expressed genes. In order to estimate the sequencing depth (= read numbers) needed for a specific
RNA-seq experiment, consider the following parameters:
• guidelines from the literature/references (e.g., ENCODE (2011), Sims et al. (2014))
• type of experiment and the type of biological question
• transcriptome size and complexity (many repetitive regions present?)
• error rate of the sequencing platform
See Table 1 for recommended numbers of reads for typical RNA-seq applications. Be aware that, depending
on your application, you may want to sequence deeper – consider increasing the number of reads if your goal
is to:
• identify lowly expressed genes
• identify very small fold changes between different conditions
• quantify alternative splicing/different isoforms
• detect chimeric transcripts
• detect novel transcripts, transcription start and end sites
• perform de novo transcript assembly
Keep in mind that strongly expressed genes and residual rRNA will always account for a large fraction of
all reads.
If you are interested in performing power analyses for differential gene expression detection using RNA-seq
data, you can have a look at the publication and R code provided by Ching et al. (2014).
!
In most cases of differential gene expression analysis, it is more important to increase the number
of biological replicates than the sequencing depth of single samples (Rapaport et al., 2013; Ching
et al., 2014; Liu et al., 2014; Gierliński et al., 2015).
Single read vs. paired-end Single read (SR) sequencing determines the DNA sequence of just one end
of each DNA fragment.
Paired-end (PE) sequencing yields both ends of each DNA fragment. PE runs are more expensive (you are
generating twice as many DNA reads as with SR), but they increase the mappability for repetitive regions
and allow for easier identification of structural variations and indels. They may also increase the precision
of studies investigating splicing variants or chimeric transcripts.
Most RNA-seq experiments aim to identify genes whose expression varies between two or more experimen-
tal settings. This means, that during our downstream analyses, we will test every single gene whether its
expression seems to change when comparing two (or more) conditions. It seems immediately obvious that
Table 1: Recommended sequencing depths for typical RNA-seq experiments for different genome sizes (Genohub,
2015). DGE = differential gene expression, SR = single read, PE = paired-end.
comparing just one measurement per condition is not going to yield a very robust answer since gene ex-
pression may vary because of many factors (e.g., temperature, sex, time of the day), not just because of the
condition of interest (e.g., genotype or drug treatment). To distinguish transcription changes caused by the
condition being studied from transcription variation caused by differences between individual organisms, cell
populations, or experimenters, it is important to perform RNA-seq experiments with sufficient numbers of
different types of replicates (Table 2) and with a well thought-out experimental design.
Our goal is to observe a reproducible effect that can be due only to the treatment (avoiding
confounding and bias) while simultaneously measuring the variability required to estimate how
much we expect the effect to differ if the measurements are repeated with similar but not identical
samples (replicates). (Altman and Krzywinski, 2014)
Table 2: Replicate categories and types in a hypothetical mouse single-cell gene expression RNA sequencing experi-
ment. Taken from Blainey et al. (2014).
Replicate type Category
Colonies Biological
Strains Biological
Subjects Cohoused groups Biological
Gender Biological
Individuals Biological
Organs from sacrificed animals Biological
Methods for dissociating cells from tissue Technical
Sample preparation Dissociation runs from given tissue sample Technical
Individual cells Biological
RNA-seq library construction Technical
Runs from the library of a given cell Technical
Sequencing Reads from different transcript molecules Variable
Reads with unique molecular identifier from a given tran- Technical
script molecule
Technical replicates Every experiment will have some random noise associated with protocols or equip-
ment. Generally speaking, technical replicates are therefore repeated measurements of the same sample
(Blainey et al., 2014). For RNA-seq specifically, the ENCODE consortium has defined technical replicates
as different library preparations from the same RNA sample. They should account for batch effects from
the library preparation such as reverse transcription and PCR amplification. To avoid possible lane effects
(e.g., differences in the sample loading, cluster amplification, and efficiency of the sequencing reaction), it is
good practice to multiplex the same sample over different lanes of the same flowcell. In most cases, technical
variability introduced by the sequencing protocol is quite low and well controlled, so that technical replicates
accounting for library preparation alone are rarely done – as long as you use the same protocol and the same
sequencing center for all your samples.
Biological replicates There is an on-going debate over what kinds of samples represent true biological
replicates, but a generally accepted definition is that biological replicates should be “parallel measurements
of biologically distinct samples that capture random biological variation” (Blainey et al., 2014). Biological
replicates will allow you to have a better handle on the true mean and variance of expression (of all genes
in question) for the biological population of interest. The ENCODE consortium specifies that biological
replicates should represent RNA from an independent growth of cells/tissue (ENCODE (2011)). Nevertheless,
for complex experimental designs, this may mean that the distinction between biological and technical
replicates depends on which sources of variation are of interest and which ones are being viewed as noise
sources.
Numbers of replicates Currently, most published RNA-seq experiments contain three biological repli-
cates. Based on one of the most exhaustive RNA-seq experiment reported to-date (48 replicates per condi-
tion), Schurch et al. (2016) recommend the use of at least six replicates per condition if the focus is on a
reliable description of one condition’s transcriptome or strongly changing genes between two conditions. If
the goal of the experiment is to identify as many differentially expressed genes as possible (including slightly
changing ones and those that are lowly expressed), as many as twelve replicates are recommended.
Always keep in mind that you are ultimately trying to draw conclusions about entire populations of cells
or even organisms just by looking at very selective subsets of these. The degree of generalizability of your
findings to, say, all mice of a specific strain, will strongly depend on how well you were able to capture good
representatives in your experiment.
! As a general rule, the more genes with low fold changes that are to be detected, the more replicates
are needed to increase the precision of the estimation of the biological variability.
Artificial RNA spike-in If it is important to you to accurately quantify absolute transcript concentra-
tions, you may want to consider to use spike-ins of artificial RNA (such as the ERCC spike-in standard which
consists of This set consists of 92 polyadenylated transcripts of varying lengths (2502,000 nucleotides) and
GC-contents (551%) (Jiang et al., 2011)). These RNA of known quantities can be used for the calibration of
the RNA concentrations in each sample and to assess the sensitivity, coverage and linearity of your experi-
ment, i.e., the overall technical performance of your experiment. The ERCC has released its own R package
for analyzing spike-ins: erccdashboard, which is available at Bioconductor (Munro et al., 2014). Note that
different spike-in controls are needed for each type of RNA, but standards are not yet available for all RNA
types (ENCODE, 2011).
Spike-ins should not be be used for normalizing between different samples since they cannot account for
differences in the amount of starting material, which will almost always be the case (unless you are sure you
extracted RNA from the same number of cells with the same efficiency for all samples). In addition, Risso
et al. (2014) (and others) demonstrated that the application of the spike-ins is not as reliable as one would
hope for.
1.4.2 Avoiding bias
The main goal of a well planned experiment is to improve the precision of the answers you will eventually
get. This means that you should:
1. Identify the question of interest (What is the effect you are truly after?);
2. Attempt to identify possible sources of variability (nuisance factors);
3. Plan the experiment in a way that reduces the effect of the expected nuisance factors;
4. Protect yourself against unknown sources of variation.
If you feel overwhelmed with the lists of nuisance factors, go back to the first step and try to prioritize. It
may also make sense to start with a pilot experiment first.
The next paragraphs will give you a brief summary of typical means to come up with a suitable experimental
design.
Randomization In addition to sufficient numbers of replicates, true randomization when selecting repli-
cates and performing sample preparations can help to avoid unconscious selection bias that might be caused
by subtle differences in the activity of the animals, their appearance, the growth pattern of cell lines etc.
True randomization means: make the decision about any of the factors of interest by tossing a coin (Honaas
et al., 2016)! This is fairly straight-forward when the factors are easily controllable, such as deciding which
batches of cells to treat with a drug and which ones to keep as a control.
Blocking Randomization is meant to protect you against falling prey to spurious signals due to unintended
batch effects. Usually, you will know about some factors that are very likely to be responsible for gene
expression variation, such as sex, weight, or the cell cycle status of your cells of interest. If it is feasible
to group your samples into distinct classes (or “blocks”) for these known sources of variation, a blocking
experimental design may make sense and will help increase statistical sensitivity. For a blocking design, you
will create complete sub-experiments for each class, i.e. all conditions of interest must be present in every
block. By creating these blocks in which the nuisance factor is kept constant, you will be able to detect
the changes that are due to the factor of interest without having to worry about the nuisance factor. If
the blocking factor accounts for a sufficient amount of sample-to-sample variation, this will increase the
sensitivity of the statistical tests to detect changes of interest – there is no guarantee though! Also, keep in
mind though that within each block the assignment of treatments etc. should still be randomized.
For more general insights into good experimental design, we highly recommend Nature Methods’ “Points of
Significance” series by Naomi Altman and Martin Krzywinski.
1. What is the main advantage of stranded RNA-seq libraries?
Most journals require that any sequencing data related to a manuscript is deposited in a publicly accessible
data base. The Sequence Read Archive (SRA) is the main repository for nucleic acid sequences (Leinonen
et al., 2011) currently storing more than 25,000 tera(!)bases. There are three copies of the SRA which
are maintained by the NCBI, the European Bioinformatics Institute, and the DNA Databank of Japan
(Figure 5), respectively. The different mirrors of the SRA offer different ways to browse and download the
data. NCBI/GenBank, for example, typically allows download of the SRA’s compressed file format only,
which means that you will have to instantly re-format those files into the more commonly used FASTQ
files (Section 2.2). ENA, on the other hand, offers a text file that contains the direct links to different
types of raw data formats, which can then be downloaded with standard command line tools via FTP or
Aspera. To learn more about the SRA and its data storage and retrieval strategies, please read the account by
O’Sullivan et al. (2018). A step-by-step tutorial for downloading data with either ENA or NCBI can be found
here: https://www.biostars.org/p/325010/. Tools that may make the searching for specific sequencing
data more amenable include SRA Explorer (https://www.biostars.org/p/366721/) and MetaSRA (http:
//metasra.biostat.wisc.edu/) (Bernstein et al., 2017).
ENA
Here, we show you how to download raw sequence data from the European instance of the SRA, which can
be accessed via https://www.ebi.ac.uk/ena. At ENA, the sequencing reads are directly available in FASTQ
format, which will be explained below.
To download a set of FASTQ files:
1. Go to https://www.ebi.ac.uk/ena.
2. Search for the accession number of the project, e.g., ERP004763 (should be indicated in the published
paper).
3. There are several ways to start the download:
(a) Click on the link within the column “Fastq files (ftp)” and save the file of interest. Done.
(b) If you prefer the command line, copy the link’s address of the “Fastq files” column (right mouse
click), go to the command line, move to the target directory, type:
$ wget < link copied from the ENA website >
(c) If there are many samples within one project, you can download the summary of the sample
information from ENA by right-clicking on “TEXT” and copying the link location.
$ wget -O samples_at_ENA . txt " < LINK > " # the quotation marks are
important
Once you have done this, go to the folder where you will store the data and use the 11th column
of the TEXT file (”Fastq files (ftp)”) to feed the individual FTP URLs of the different samples to
the wget command:
$ cut - f11 samples_at_ENA . txt | xargs wget # this would download ALL
672 samples
As mentioned above, all sequencing data submitted to the SRA (i.e., with an SRA accession number) can
also be retrieved through NCBI (https://www.ncbi.nlm.nih.gov/sra).
Example data Throughout the course, we will be working with sequencing reads from the most compre-
hensive RNA-seq dataset to date that contains mRNA from 48 replicates of two S. cerevisiae populations:
wildtype and snf2 knock-out mutants (Gierliński et al., 2015; Schurch et al., 2016). All 96 samples were
sequenced on one flowcell (Illumina HiSeq 2000); each sample was distributed over seven lanes, which means
that there are seven technical replicates per sample. The accession number for the entire data set (consisting
of 7 x 2 x 48 (= 672) raw read files) is ERP004763.
?
Use the information from the file ERP004763 sample mapping.tsv (from https://ndownloader.
figshare.com/files/2194841) to download all FASTQ files related to the biological replicates
no. 1 of sample type “SNF2” as well as of sample type “WT”. Try to do it via the command line
and make sure to create two folders (e.g., SNF2 rep1 and WT rep1) of which each should contain
seven FASTQ files in the end.
A simple for-loop could look like this:
$ for ACC_NR in ERR458493 ERR458494 ERR458495 ERR458496 ERR458497 ERR458498
do
egrep $ { ACC_NR } E R P 0 04 7 6 3_ s a m pl e _ m ap p i n g . tsv | cut - f11 | xargs wget
done
? Can you come up with a more generic one, e.g. without manually typing out the actual accession
numbers of interest? Can you spot the vulnerabilities of the code shown above?
Currently, raw reads are most commonly stored as FASTQ files. However, details of the file formats may vary
widely depending on the sequencing platform, the lab that released the data, or the data repository. For
a more comprehensive overview of possible file formats of raw sequencing data, see the NCBI’s file format
guide: https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/.
The FASTQ file format was derived from the simple text format for nucleic acid or protein sequences, FASTA.
FASTQ bundles the sequence of every single read produced during a sequencing run together with the quality
scores. FASTQ files are uncompressed and quite large because they contain the following information for every
single sequencing read:
1. @ followed by the read ID and possibly information about the sequencing run
2. sequenced bases
3. + (perhaps followed by the read ID again, or some other description)
4. quality scores for each base of the sequence (ASCII-encoded, see below)
Again: be aware that this is not a strictly defined file format – variations do exist and may cause havoc!
Here’s a real-life example snippet of a FASTQ file downloaded from ENA:
1 $ zcat ERR459145 . fastq . gz | head
2 @ERR459145 .1 DHKW5DQ1 :219: D0PT7ACXX :2:1101:1590:2149/1
3 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGC
4 +
5 @7 < DBADDDBH ? DHHI@DH > HHHEGHIIIGGIFFGIBFAAGAFHA '5? B@D
6 @ERR459145 .2 DHKW5DQ1 :219: D0PT7ACXX :2:1101:2652:2237/1
7 GCAGCATCGGCCTTTTGCTTCTCTTTGAAGGCAATGTCTTCAGGATCTAAG
8 +
9 @@ ; B D D E F G H H H H I I I G B H H E H C C H G C G I G G H I G H G I G I I G H I I A H I I I G I
10 @ERR459145 .3 DHKW5DQ1 :219: D0PT7ACXX :2:1101:3245:2163/1
11 TGCATCTGCATGATCTCAACCATGTCTAAATCCAAATTGTCAGCCTGCGCG
For paired-end (PE) sequencing runs, there will always be two FASTQ files – one for the forward
!
reads, one for the backward reads.
Once you have downloaded the files for a PE run, make sure you understand how the origin of
each read (forward or reverse read) is encoded in the read name information as some downstream
analysis tools may require you to combine the two files into one.
Sequence identifier The first line of each FASTQ read snippet contains the read ID. Earlier Illumina
sequencing platforms (< version 1.8) generated read IDs with the following format:
@<machine_id>:<lane>:<tile>:<x_coord>:<y_coord>#<index>/<read_#>
Starting from version 1.8 the sequence identifier line has the following format:
@<machine_id>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos>
<read>:<is filtered>:<control number>:<index sequence>
As you can see, every single read can thus be pin-pointed to the precise physical location on the flowcell
where its template DNA was attached. The location is defined by the x- and y-coordinates within a given
tile of a specific lane within a single flowcell (Figure 6). However, it should be pointed out that once data
has been deposited in the SRA, the read ID might have been significantly altered to contain different types
of meta data!
Base call quality scores Illumina sequencing is based on identifying the individual nucleotides by the
fluorescence signal emitted upon their incorporation into the growing sequencing read (Figure 4). Once the
sequencing run is complete, images taken during each DNA synthesis step are analyzed and the read clusters’
fluorescence intensities are extracted and translated into the four letter code. The deduction of nucleotide
sequences from the images acquired during sequencing is commonly referred to as base calling.
Figure 6: Illumina flowcell details. While some details have been changed over the years, flowcells are basically
microscopy glass slides covered with primers.
Due to the imperfect nature of the sequencing process and limitations of the optical instruments (see Table 8),
base calling will always have inherent uncertainty. This is the reason why FASTQ files store the DNA sequence
of each read together with a position-specific quality score that represents the error probability, i.e., how likely
it is that an individual base call may be incorrect. The score is called Phred score, Q, which is proportional
to the probability p that a base call is incorrect, where Q = −10 ∗ log10 (p). For example, a Phred score of
10 corresponds to one error in every ten base calls (Q = −10 ∗ log10(0.1)), or 90% accuracy; a Phred score
of 20 corresponds to one error in every 100 base calls, or 99% accuracy. A higher Phred score thus reflects
higher confidence in the reported base.
To assign each base a unique score identifier (instead of numbers of varying character length), Phred scores
are typically represented as ASCII characters. At http://ascii-code.com/ you can see which characters
are assigned to what number.
For raw reads, the range of scores will depend on the sequencing technology and the base caller used (Illumina,
for example, used a tool called Bustard, or, more recently, RTA). Unfortunately, Illumina has been anything
but consistent in how they a) calculated and b) ASCII-encoded the Phred score (see Table 3 and Figure 7
for the different conventions)! In addition, Illumina now allows Phred scores for base calls with as high as 45,
while 41 used to be the maximum score until the HiSeq X sequencer. This may cause issues with downstream
applications that expect an upper limit of 41.
! Note that different base quality assignments exist (Table 3). Try to always make sure you know
which version of the Phred score you are dealing with.
To convert an Illumina FASTQ file version 1.3 (Phred+64) to version 1.8 (Phred+33), you could, for example,
use the following sed command:
1 $ sed -e '4~4 y / @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [\\]^ _ ` abcdefghi /! " # $ %& '\ ' '()
*+ , -.\/0123456789:; <= >? @ABCDEFGHIJ / ' originalFile . fastq
Table 3: Base call quality scores are represented with the Phred range. Different Illumina (formerly Solexa) versions
used different scores and ASCII offsets. Starting with Illumina format 1.8, the score now represents the standard
Sanger/Phred format that is also used by other sequencing platforms and the sequencing archives. Also see Figure 7.
Figure 7: The ASCII interpretation and ranges of the different Phred score notations used by Illumina and the original
Sanger interpretation (Table 3). Although the Sanger format allows a theoretical score of 93, raw sequencing reads
typically do not exceed a Phred score of 60. In fact, most Illumina-based sequencing will result in maximum scores
of 41 to 45.
1. Which base call is more likely to be incorrect – one with a Phred score of “#” or one with
?
a Phred score of “;”?
2. Can you guess why the Phred scores are always transformed to ASCII with an offset of at
least 33?
3. What is the baseline uncertainty that Illumina grants to its base calling algorithm?
4. How can you find out which Phred score encoding was used in a given FASTQ file?
Quality controls should be done at every analysis step. Ideally, quality control should be proactive and
comprehensive — see it as a chance to get to know your data, which will enable you to perform downstream
analyses with (more) appropriate assumptions and parameters. Even if flaws and biases are identified, you
may be able to correct those problems in silico.
Figure 8: Typical bioinformatics workflow of differential gene expression analysis with commonly used tools (shown in
blue). Tools for quality control are marked in orange (with MultiQC allowing the convenient combination of numerous
QC results). The most commonly used file formats to store the results of each processing step are indicated in gray.
Since an analysis typically starts with the raw reads (stored in FASTQ files), your first step should be to check
the overall quality of the sequenced reads. A poor RNA-seq run will be characterized by the presence of one
or more of the following types of uninformative sequences:
• PCR duplicates*∗
• adapter contamination
• rRNA and tRNA reads
• unmappable reads, e.g. from contaminating nucleic acids
All but the last category of possible problems can be detected using a program called FASTQC. FASTQC
is released by the Babraham Institute and can be freely downloaded at http://www.bioinformatics.
babraham.ac.uk/projects/fastqc/. It essentially aggregates the quality scores and other properties of the
individual reads (e.g. GC content) stored in a given FASTQ file. For each type of feature that it checks, FastQC
flags the results with either “pass”, “warning”, or “fail”, depending on how far the sample deviates from a
hypothetical dataset without significant bias. Keep in mind though that some sample types are expected to
∗ It is impossible to distinguish whether identical reads represent PCR duplicates or independent occurrences of the exact
have certain biases, so not all “fail” verdicts mean that the sequencing should be repeated! RNA-seq data,
for example, tends to have clear signs of non-uniform base content distributions for the first 10-15 bp of most
reads, which will usually result in a “warning” or “fail” by FastQC. However, this is a property that has
been observed for the vast majority of RNA library preparation protocols and it is attributed to the random
hexamer priming step, which appears to not quite be as random as one would hope for (Hansen et al., 2010)†
Table 9 contains the details of each FastQC assessment and the expected results; a great FASTQC tutorial
was written by the Michigan State University’s Core Facility: https://rtsf.natsci.msu.edu/genomics/
tech-notes/fastqc-tutorial-and-faq/.
Zhou and Rokas (2014) provide a good summary of typical issues of Illumina sequencing in general and
subsequent steps that can be taken to alleviate some of them (Figure 9).
Figure 9: Comprehensive workflow of quality controls for high-throughput sequencing data. Rounded rectangles
represent QC procedures, hexagons represent decisionmaking steps, and dotted lines represent QC procedures that
are optional or applicable to only certain types of studies. Figure from (Zhou and Rokas, 2014).
3 # run FastQC ( for the course it ' s available in the software folder )
4 $ ~/ mat / software / FastQC / fastqc ERR458493 . fastq . gz -- extract -o fastqc_results
5
depth discussion.
14
9 $ while read ID
10 do
11 grep -v PASS $ { ID } _fastqc / summary . txt | \
12 montage txt : - $ { ID } _fastqc / Images /* png \
13 - tile x3 - geometry +0.1+0.1 - title $ { ID } $ { ID }. png
14 done < files . txt
15
8 # run multiqc within the fastqc_results folder and use the folder names ( WT_1
etc .) as prefixes to the sample names in the final output
9 $ cd fastqc_results /
10 $ ~/ mat / software / anaconda2 / bin / multiqc . -- dirs -- interactive -o QC /
11
In order to identify the transcripts that are present in a specific sample, the genomic origin of the sequenced
cDNA fragments must be determined. The assignment of sequenced reads to the most likely locus of origin
is called mapping and it is a crucial step in almost all high-throughput sequencing experiments. One way to
solve this string-matching problem is to figure out where a short sequence of nucleotides has its best match
along the very long sequence that is the (transcribed) genome. Since this involves the lining up of individual
letters of two (or more) given strings, these approaches are known as read alignment.
Sequence alignment has been a long-standing research question of computational biology as bioinformaticians
have tried to compare stretches of RNA, DNA and protein sequences since the 1970s; most often to determine
sequence homologies and infer phylogenetic relationships (Figure 10). There are many non-trivial details to
the seemingly simple problem, for example, one needs to decide how to handle mis-matches and gaps, whether
we truly need a global alignment (for the entire lengths of both sequences), and at one point one decides
that two strings cannot be reasonably aligned to each other. Equally important are practical aspects, i.e.
we need clever ways to find the best possible alignment without actually computing all possible solutions
because that would very quickly become a computationally intractable problem.
The general challenge of short read alignment following high-throughput sequencing is to map millions of
reads accurately and in a reasonable time, despite the presence of sequencing errors, genomic variation
and repetitive elements. The different alignment programs employ various strategies that are meant to
speed up the process (e.g., by indexing the reference genome) and find a balance between mapping fidelity
and error tolerance. Most tools for short read alignment use algorithms that follow the “seed-and-extend”
approach∗ :
1. Use a subset of the read as a “seed” for which the tool is going to find the best possible match in an
index made up from the reference genome.
2. Every matched seed is extended on both sides under certain constraints (e.g. max. number of permiss-
able mismatches) until as much of the read is aligned as possible. This is the actual alignment step,
which is searching for the optimal local alignment of a given read at a position within the reference
genome that’s anchored around the seed’s match. That local alignment step is usually performed using
the Smith-Waterman algorithm (Smith and Waterman, 1981), one of the cornerstones of computational
biology.
This means that the parameters for how many mismatches one allows may need to be tuned depending on
the experiment at hand. For example, sequencing results of human tumors can probably be expected to show
greater inherent discrepancies with the reference genome than healthy mouse cells obtained from one of the
inbred reference strains. The same holds true for different sequencing platforms – many third generation
sequencing techniques including Nanopore and PacBio have higher error rates than Illumina short read
sequencing, which needs to be taken into consideration during the alignment step. For an excellent overview
of short read alignment strategies, see Reinert et al. (2015).
The main challenge of RNA-seq data in particular (in contrast to genome sequencing) is the spliced alignment
of exon-exon-spanning reads (Figure 11) and the presence of multiple different isoforms of the same gene.
Some alignment programs tried to mitigate this problem by aligning to the transcriptome, but this approach
is limited to known transcripts and thus heavily dependent on the annotation. Moreover, many reads will
∗ For a great explanation of the differences between the most popular alignment tools, see Ye et al. (2015).
overlap with more than one isoform, introducing mapping ambiguity. The most popular RNA-seq alignment
programs (e.g., STAR, TopHat, GSNAP; see Engström et al. (2013) for a review of RNA-seq aligners) use
the entire genome as the reference and existing gene annotation as a guide for where to expect large gaps
(i.e., intron sequences that are not part of sequence reads of mRNA-derived fragments). If reads cannot be
placed despite accounting for the known introns, most tools will also attempt to identify novel splice events.
This is based on certain assumptions about transcript structures that may or may not be correct (e.g., most
algorithms search for the most parsimonious combination of exons which might not reflect the true biological
driving force of isoform generation). Additionally, lowly expressed isoforms may have very few reads that
span their specific splice junctions while, conversely, splice junctions that are supported by few reads are
more likely to be false positives. Therefore, novel splice junctions will show a bias towards strongly expressed
genes. Until reads routinely are sequenced longer, the alignment of spliced reads will therefore remain the
most prevalent problems of RNA-seq data†
Figure 11: RNA-seq of mRNAs produces 2 kinds of reads: single exon reads (Read 1) and exon-exon-spanning reads
(Read 2). While single exon reads can be aligned equally easily to the genome and to the transcriptome, exon-exon-
spanning reads have to be split to be aligned properly if the genome sequence is used as a reference (b).
Despite tremendous progress in terms of speed, read alignment is often by far the most computationally
intensive step of the entire bioinformatics RNA-seq pipeline. Furthermore, once we have the information
about the locus of origin for every single read (= alignment result), we subsequently need to count the
number of reads that overlap with known genes or transcripts (Section 4) because the ultimate goal of RNA-
seq typically is the quantification of transcripts; it is usually not sufficient to present a catalogue of genes that
were found to be expressed in the samples at hand. The most recent approaches to achieve reliable and fast
quantification of transcript abundances have therefore dispensed with the idea of the tedious alignment step;
they are only focused on judging sequence similarities without assgining residue-residue correspondences.
These alignment-free methods – as implemented by Salmon and Kallisto (Patro et al., 2017; Bray et al.,
2016) – follow the rationale that similar sequences share similar subsequences (k-mers or words). Counting
the shared k-mer occurrences should therefore give a good relative measure of sequence similarity, irrespective
of the precise genome location (Zielezinski et al., 2017). This also means that the quantitative information
about expression levels of individual genes will be the immediate result of the read mapping step without
the need for additional tools.
To assess the sequence similarity, the following steps are typically taken (Figure 12):
1. The sequences for comparison (reads, reference) are sliced up into collections of unique (!) k-mers of a
given length k.
2. For each pairwise comparison, we count the number of times a specific k-mer appears in both sequence
strings that are being compared.
3. To assess the similarity between the two strings, some sort of distance function is employed, for example,
Euclidian distance; two identical sequences should have a distance of zero.
In practice, Salmon and Kallisto will first generate an index of k-mers from all known transcript sequences.
These transcript k-mers will then be compared with the k-mers of the sequenced reads, yielding a pseu-
† For comparisons of the approaches to differential isoform quantification, see Ding et al. (2017), Hooper (2014), Su et al.
(2014).
doalignment that describes how many k-mers a read shares with a set of compatible transcripts (based
on the distance scores; see Figure 13). By grouping all pseudoalignments that belong to the same set of
transcripts, they can then estimate the expression level of each transcript model.
The pseudoalignment tools offer a dramatic increase in speed, but there are some downsides that you may
need to consider. The most obvious caveat of the pseudoaligners is their absolute reliance on a precise and
comprehensive transcript (cDNA!) annotation. When a sequenced fragment originates from a genomic locus
that is not part of the pre-defined cDNA annotation (e.g. an intron or an unannotated transcript), it can be
falsely mapped to a transcript since the relevant genomic sequence is not available. While alignment-based
tools will discard reads if their edit distance becomes too large, pseudoalignment currently does not entail a
comparable scoring system to validate a the compatibility; therefore there is no safeguard against spurious
alignments. This means, for example, that a 100-bp-read can pseudoalign with a transcript with which it
shares only a single k-mer – if no better match can be found within the universe of the pre-generated cDNA
index.
!
While the lightweight mapping tools such as Kallisto and Salmon have been shown to perform
almost as good as classic alignment tools on simulated data, it is clear that these programs are
still under active development (Srivastava et al., 2019) and are prone to spurious alignments,
particularly for lowly expressed genes.
Irrespective of the type of read mapping (alignment- or pseudoalignment), the presence of a reference sequence
as well as gene annotation (i.e., which parts of the reference sequence correspond to genes) are fundamental
to the majority of RNA-seq projects.
Genome sequences and annotation are often generated by consortia such as (mod)ENCODE, The Mouse
Genome Project, The Berkeley Drosophila Genome Project, and many more. The results of these efforts can
either be downloaded from individual websites set up by the respective consortia or from pan-species data
bases such as the one hosted by the University of California, Santa Cruz (UCSC; https://genome.ucsc.
edu/) or the European genome resource, Ensembl (http://www.ensembl.org).
UCSC and Ensembl try to organize, unify, and provide access to a wide range of genomes and annotation
data. The advantage of downloading data from UCSC (or Ensembl) is that even if you were to work with
different species, the file formats and naming conventions will be consistent (and your scripts will be more
likely to work). The documentation at https://genome.ucsc.edu/FAQ/FAQreleases.html gives a good
overview of the genomes and annotation that are available at UCSC. Unfortunately, UCSC and Ensembl
differ in their naming conventions and the frequency of updates. In addition, the gene annotations provided
by either RefSeq or Ensembl are based on different annotation pipelines trying to determine the details about
untranslated regions, introns, exons etc.‡ (Zhao and Zhang, 2015).
!
Note that UCSC and Ensembl use slightly different naming conventions that can seriously affect
downstream analyses. Try to stick to one source.
Always ensure you know exactly which version of a genome and annotation you are
working with.
Reference sequences are usually stored in plain text FASTA files that can either be compressed with the
generic gzip command or, using the tool faToTwoBit, into .2bit format.
We used the UCSC Genome Browser website to download the reference genome of yeast (go to https:
//genome.ucsc.edu/, click on “Downloads” → “Genome Data” to reach http://hgdownload.soe.ucsc.
edu/downloads.html, where you will find an overview of all available reference genomes and the respective
links).
1 # Download genome sequence of S . cerevisiae from UCSC
2 $ wget http :// hgdownload . soe . ucsc . edu / goldenPath / sacCer3 / bigZips / sacCer3 .2 bit
3
7 $ head sacCer3 . fa
8 > chrI
9 CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC
10 CACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTG
11 GCCAACCTGTCTCTCAACTTACCCTCCATTACCCTGCCTCCACTCGTTAC
GFF The General Feature Format has nine required fields; the first three fields form the basic name, start,
end tuple that allows for the identification of the location in respect to the reference genome (e.g., bases 100
to 1,000 of chromosome 1). Fields must be separated by a single TAB, but no white space. All but the final
field in each feature line must contain a value; missing values should be denoted with a ‘.’
There are two versions of the GFF format in use which are similar, but not compatible:
‡ Yandell and Ence (2012) and Mudge and Harrow (2016) wrote excellent introductions of gene annotation intricacies.
6 # GFF - version 3
7 ctg123 . exon 1300 1500 . + . ID = exon00001
8 ctg123 . exon 1050 1500 . + . ID = exon00002
9 ctg123 . exon 3000 3902 . + . ID = exon00003
GTF The Gene Transfer Format is based on the GFF, but is defined more strictly. (It is sometimes referred
to as GFF2.5 because the first eight GTF fields are the same as GFF2, but, as for GFF3, the 9th field has been
expanded into a list of attributes.) Contrary to GFF files, the TYPE VALUE pairs of GTF files are separated
by one space and must end with a semi-colon (followed by exactly one space if another attribute is added
afterwards):
1 # example for the 9 th field of a GTF file
2 gene_id " Em : U62 . C22 .6 " ; transcript_id " Em : U62 . C22 .6. mRNA " ; exon_number 1
The gene id and transcript id values are globally unique identifiers for the genomic locus of the tran-
script or the same transcript itself and must be the first two attributes listed. Textual attributes should be
surrounded by double quotes.
1 # GTF example
2 chr1 HAVANA gene 11869 14412 . + . gene_id " ENSG00000223972 .4 " ;
transcript_id " ENSG00000223972 .4 " ; gene_type " pseudogene " ; gene_status "
KNOWN " ; gene_name " DDX11L1 " ; transcript_type " pseudogene " ; transcript_st atus
" KNOWN " ; transcript_name " DDX11L1 " ; level 2; havana_gene "
OTTHUMG00000000961 .2 " ;
3 chr1 HAVANA transcript 11869 14409 . + . gene_id " ENSG00000223972 .4 " ;
transcript_id " ENST00000456328 .2 " ; gene_type " pseudogene " ; gene_status "
KNOWN " ; gene_name " DDX11L1 " ; transcript_type " processed_transcript " ;
transcript_status " KNOWN " ; transcript_name " DDX11L1 -002 " ; level 2; tag "
basic " ; havana_gene " OTTHUMG00000000961 .2 " ; havana_transcript "
OTTHUMT00000362751 .1 " ;
More information on GTF format can be found at http://mblab.wustl.edu/GTF2.html (or, for the most
recent version: http://mblab.wustl.edu/GTF22.html).
The following screenshot illustrates how you can, for example, download a GTF file of yeast transcripts from
the UCSC Genome Table Browser (https://genome.ucsc.edu/cgi-bin/hgTables).
!
GTF files downloaded from the UCSC Table Browser have the same entries for gene id and
transcript id. This can lead to problems with downstream analysis tools that expect exons
of different isoforms to have the same gene id, but different transcript ids.
Here’s a way to get a properly formatted GTF file (i.e., with different entries for gene name and transcript id)
of RefSeq genes using the UCSC tool genePredToGtf:
1 # first , download a table for " Genes and Gene Predictions " from the UCSC Table
Browser indicating as the output format : " all fields from selected table "
2 # NOTE : this may not work for all GTF files downloaded from UCSC ! genePredToGtf
is very finicky and every organism ' s annotation may have been generated and
deposited by a different person )
3 $ head - n1 allfields_hg19 . txt
4 bin name chrom strand txStart txEnd cdsStart cdsEnd
exonCount exonStarts exonEnds score name2 cdsStartStat
c d s EndStatexonFrames
5 # remove first column and first line , feed that into genePredToGtf
6 $ cut -f 2 - allfields_hg19 . txt | sed '1d ' | \
7 genePredToGtf file stdin hg19_RefSeq . gtf
8 $ head - n1 hg19_RefSeq . gtf
9 chr1 stdin exon 66999639 67000051 . + . gene_id " SGIP1 " ; transcript_id "
NM_032291 " ; exon_number " 1 " ; exon_id " NM_032291 .1 " ; gene_name " SGIP1 " ;
BED format The BED format is the simplest way to store annotation tracks. It has three required fields
(chromosome, start, end) and up to 9 optional fields (name, score, strand, thickStart, thickEnd, itemRgb,
blockCount, blockSizes, blockStarts). The number of fields per line can thus vary from three to twelve,
but must be consistent within a file and must obey the order, i.e. lower-numbered fields must always be
populated if higher-numbered fields are used. Fields seven to twelve are only necessary if regions should
be drawn in a Genome Browser with the typical appearance known for gene tracks. Note that the BED
format indicates a region with 0-based start position and 1-based end position (GTF/GFF are 1-based in both
positions§ .
1 # 6 - column BED file defining transcript loci
2 chr1 66999824 67210768 NM_032291 0 +
3 chr1 33546713 33586132 NM_052998 0 +
4 chr1 25071759 25170815 NM_013943 0 +
5 chr1 48998526 50489626 NM_032785 0 -
1. Which annotation data base is currently recommended for poly(A)-enriched RNA-seq data?
?
2. Which annotation data base would you use for RNA-seq of total RNA?
3. How many non-coding RNA transcripts does the Ensembl annotation for the human refer-
ence hg19 contain? Find out via the command line.
!
Obtaining a correctly formatted GTF file may be one of the most difficult tasks in the entire
analysis! Do take this seriously and invest the time to make sure that the GTF file you are using
is correctly formatted. Do not take the risk of introducing strange results (which you may not
notice) that are due to formatting issues only!
Numerous alignment programs have been published in the past (and will be published in the future), and
depending on your specific project, some aligners may be preferred over others. For example, detection of
structural variants and fusion transcripts will require very specific settings or a dedicated alignment tool for
that particular task.
For straight-forward RNA-seq data that will be used for differential gene expression analysis, STAR (Dobin
et al., 2013) has been shown to be very efficient and reasonably sensitive. The main caveat is the large number
of putative novel splice sites that should be regarded with caution (Engström et al., 2013). The very detailed
documentation of STAR can be found in Alex Dobin’s github account: https://github.com/alexdobin/
STAR/blob/master/doc/STARmanual.pdf and lots of advice regarding optimal parameter settings can be
found in Dobin and Gingeras (2016).
Another popular aligner is TopHat, which is basically a sophisticated wrapper around the genomic aligner
Bowtie (Kim et al., 2013). Generally, the specific choice of alignment tool has relatively little impact on
the downstream analyses (compared to the significant impact that the choices of annotation, quantification
tools, and differential expression analysis tools have; see, for example Costa-Silva et al. (2017); Everaert
et al. (2017); Williams et al. (2017)). However, Ballouz et al. (2018) argue that some tools might offer a high
degree of optimization for samples with specific characteristics that will not be as optimally served by using
the usual top-of-the-class-tool. In any case, we strongly recommend to read the documentation of any tool
you are going to use in order to tune the parameters that might be applicable to your samples.
Shown here are the example commands for the alignment to the S. cerevisiae genome using STAR.
1. Generate genome index This step has to be done only once per genome type (and alignment
program). The index files will contain all the information from the reference genome in a compressed
format that is optimized for efficient access and comparison with the query read sequences. This
includes clever representations of the genome sequence, the chromosome names and lengths, splice
junctions coordinates, and information about the genes (e.g. the strand). The main input files for this
step therefore encompass the reference genome sequence and an annotation file.
1 # create a directory to store the index in
2 $ REF_DIR =~/ mat / referenceGenomes / S_cerevisiae
3 $ mkdir ~/ STARindex
4
!
The default settings or the settings shown here may not be optimal for your application (or
even for this application)! Please, read the STAR manual and Dobin and Gingeras (2016)
and decide which parameters are suitable for your data set!
3. BAM file indexing Most downstream applications will require a .BAM.BAI file together with every BAM
file to quickly access the BAM files without having to load them into memory. To obtain these index
files, simply run the samtools index command for each BAM file once the mapping is finished.
1 # export samtools path ( for convenience )
2 $ export PATH =/ home / classadmin / software / samtools -1.7: $PATH
3
?
Which STAR options shown above:
• ... have to be different for every sample that you map?
• ... should remain consistent for all samples of one analysis?
• ... will affect the number of reads in the final output file?
Check the Section 7.1 (Appendix) for how the alignment could be done better for the yeast data.
The output option of STAR already indicates that the results of the alignment will be stored in a SAM or
BAM file. The Sequence Alignment/Map (SAM) format is, in fact, a generic nucleotide alignment format that
describes the alignment of sequencing reads (or query sequences) to a reference. The human readable, TAB-
delimited SAM files can be compressed into the Binary Alignment/Map format. These BAM files are bigger
than simply gzipped SAM files, because they have been optimized for fast random access rather than size
reduction. Position-sorted BAM files can be indexed so that all reads aligning to a locus can be efficiently
retrieved without loading the entire file into memory. To convert a BAM file into a SAM file, use samtools
view:
1 # export our local installation of samtools into your PATH
2 $ export PATH =~/ mat / software / samtools -1.7/: $PATH
3 $ samtools view -h WT_1_Aligned . sortedByCoord . out . bam > WT_1_Aligned .
sortedByCoord . out . sam
As shown in Figure 14, SAM files typically contain a short header section and a very long alignment section
where each row represents a single read alignment. The following sections will explain the SAM format in a bit
more detail. For the most comprehensive and updated information go to https://github.com/samtools/
hts-specs.
Figure 14: Schematic representation of a SAM file. Each line of the optional header section starts with “@”, followed
by the appropriate abbreviation (e.g., SQ for sequence dictionary which lists all chromosomes names (SN) and their
lengths (LN)). See Table 10 for all possible entries and tags. The vast majority of lines within a SAM file typically
correspond to read alignments where each read is described by the 11 mandatory entries (black font) and a variable
number of optional fields (grey font). See Section 3.3.2 for more details.
This tells you that if a tool chooses to include the “@SQ” line in the SAM file header, the entries SN
(chromosome names) and LN (length of the individual chromosomes) are mandatory, while tags such as the
ones spelling out the species (SP) or the path to the reference genome (UR) are optional. For a hypothetical
organism with three chromosomes of length 1,000 bp, 1,500 bp, and 3,000 bp, the SAM header should therefore
contain the following three lines:
@SQ SN:chr1 LN:1000
@SQ SN:chr2 LN:1500
@SQ SN:chr3 LN:3000
samtools view -H (note the capitalized “H”) can be used to retrieve just the header of a SAM or BAM file.
The output from the following example was slightly modified for better readability. See Table 10 for more
information about the entries typically stored within the header section.
1 # The default behavior of samtools view is to not show the header section .
2 # samtools view -h will show both header and alignment section ;
3 # samtools view -H will return the header section only .
4
16 @CO user command line : STAR -- genomeDir STAR - sacCer3 -- readFilesIn Lane1 .
fastq . gz , Lane2 . fastq . gz , Lane3 . fastq . gz , Lane4 . fastq . gz , Lane5 . fastq . gz , Lane6
. fastq . gz , Lane7 . fastq . gz -- readFilesCommand zcat -- outFileNamePrefix
Sample1_ -- out Filte rMul timap Nmax 1 -- outSAMunmapped Within -- runThreadN 8
-- outSAMtype BAM SortedByCoordinate
Table 4: Overview of the fields that are required for each row of a SAM file’s alignment section. The number of
optional fields can vary widely between different SAM files and even between reads within in the same file. The field
types marked in blue are explained in more detail in the main text below.
1 QNAME Read1 Query template (= read) name (PE: read pair name) required
2 FLAG 83 Information about the read’s mapping properties en- required
coded as bit-wise flags (see next section and Table 5).
3 RNAME chrI Reference sequence name. This should match a @SQ line *
in the header.
4 POS 15364 1-based leftmost mapping position of the first matching 0
base. Set as 0 for an unmapped read without coordi-
nates.
5 MAPQ 30 Mapping quality of the alignment. Should be a Phred- 0
scaled posterior probability that the position of the read
is incorrect, but the value is completely dependent on
the alignment program. Some tools set this to 0 if mul-
tiple alignments are found for one read.
6 CIGAR 51M Detailed information about the alignment (see below). *
7 RNEXT = PE reads: reference sequence name of the next read. Set *
to “=” if both mates are mapped to the same chromo-
some.
8 PNEXT 15535 PE reads: leftmost mapping position of the next read. 0
9 TLEN 232 PE reads: inferred template length (fragment size). 0
10 SEQ CCA...GGC The sequence of the aligned read on the forward strand *
(not including indels).
11 QUAL BBH...1+B Base quality (same as the quality string in the FASTQ *
format, but always in Sanger format [ASCII+33]).
12ff OPT NM:i:0 Optional fields (format: <TAG>:<TYPE>:<VALUE>; see be-
low).
FLAG field The FLAG field encodes various pieces of information about the individual read, which is
particularly important for PE reads. It contains an integer that is generated from a sequence of Boolean bits
(0, 1). This way, answers to multiple binary (Yes/No) questions can be compactly stored as a series of bits,
where each of the single bits can be addressed and assigned separately.
Table 5 gives an overview of the different properties that can be encoded in the FLAG field. The developers of
the SAM format and samtools tend to use the hexadecimal encoding as a means to refer to the different bits
in their documentation. The value of the FLAG field in a given SAM file, however, will always be the decimal
representation of the sum of the underlying binary values (as shown in Table 4, row 2).
Table 5: The FLAG field of SAM files stores several information about the respective read alignment in one single
decimal number. The decimal number is the sum of all the answers to the Yes/No questions associated with each
binary bit. The hexadecimal representation is used to refer to the individual bits (questions).
A bit is set if the corresponding state is true. For example, if a read is paired, 0x1 will be set, returning the
decimal value of 1. Therefore, all FLAG values associated with paired reads must be uneven decimal numbers.
Conversely, if the 0x1 bit is unset (= read is not paired), no assumptions can be made about 0x2, 0x8, 0x20,
0x40 and 0x80.
In a run with single reads, the flags you will most commonly see are ¶ :
• 0: This read has been mapped to the forward strand. (None of the bit-wise flags have been set.)
• 4: The read is unmapped (0x4 is set).
• 16: The read is mapped to the reverse strand (0x10 is set).
¶ 0x100, 0x200 and 0x400 are not used by most aligners, but could, in principle be set for single reads.
Some common FLAG values that you may see in a paired-end experiment include:
69 (= 1 + 4 + 64) The read is paired, is the first read in the pair, and is unmapped.
77 (= 1 + 4 + 8 + 64) The read is paired, is the first read in the pair, both are unmapped.
83 (= 1 + 2 + 16 + 64) The read is paired, mapped in a proper pair, is the first read in the pair,
and it is mapped to the reverse strand.
99 (= 1 + 2 + 32 + 64) The read is paired, mapped in a proper pair, is the first read in the pair,
and its mate is mapped to the reverse strand.
133 (= 1 + 4 + 128) The read is paired, is the second read in the pair, and it is unmapped.
137 (= 1 + 8 + 128) The read is paired, is the second read in the pair, and it is mapped while
its mate is not.
141 (= 1 + 4 + 8 + 128) The read is paired, is the second read in the pair, but both are unmapped.
147 (= 1 + 2 + 16 + 128) The read is paired, mapped in a proper pair, is the second read in the pair,
and mapped to the reverse strand.
163 (= 1 + 2 + 32 + 128) The read is paired, mapped in a proper pair, is the second read in the pair,
and its mate is mapped to the reverse strand.
Note that the strand information of the FLAG field (0x10) does not necessarily indicate the strand of the
original transcript. Unless a strand-specific RNA-seq library protocol was used, this only tells you which
strand of the ds-cDNA fragment was sequenced.
A useful website for quickly translating the FLAG integers into plain English explanations like the ones shown
above is: https://broadinstitute.github.io/picard/explain-flags.html
1. How can you retrieve just the alignment section of a BAM file?
?
3. What does a FLAG value of 2 mean?
4. Would you be happy or sad if your paired-end read alignments all had FLAG values of 77 or
141?
5. Your favorite read pair has FLAG values of 153 and 69. Which read aligned to the forward
strand of the reference?
CIGAR [Concise Idiosyncratic Gapped Alignment Report] String The sixth field of a SAM file
contains a so-called CIGAR string indicating which operations were necessary to map the read to the reference
sequence at that particular locus.
The following operations are defined in CIGAR format (also see Figure 15):
M Alignment (can be a sequence match or mismatch!)
I Insertion in the read compared to the reference
D Deletion in the read compared to the reference
N Skipped region from the reference. For mRNA-to-genome alignments, an N operation represents an
intron. For other types of alignments, the interpretation of N is not defined.
S Soft clipping (clipped sequences are present in read); S may only have H operations between them and
the ends of the string
H Hard clipping (clipped sequences are NOT present in the alignment record); can only be present as
the first and/or last operation
P Padding (silent deletion from padded reference)
= Sequence match (not widely used)
X Sequence mismatch (not widely used)
The sum of lengths of the M, I, S, =, X operations must equal the length of the read.
OPT field(s) Following the eleven mandatory SAM file fields, the optional fields are presented as key-value
pairs in the format of <TAG>:<TYPE>:<VALUE>, where TYPE is one of:
A Character
i Integer
f Float number
Z String
H Hex string
The information stored in these optional fields will vary widely depending on the mapper and new tags can
be added freely. In addition, reads within the same SAM file may have different numbers of optional fields,
depending on the program that generated the SAM file. Commonly used optional tags include:
AS:i Alignment score
BC:Z Barcode sequence
HI:i Match is i -th hit to the read
NH:i Number of reported alignments for the query sequence
NM:i Edit distance of the query to the reference
MD:Z String that contains the exact positions of mismatches (should complement the CIGAR string)
RG:Z Read group (should match the entry after ID if @RG is present in the header.
Thus, for example, we can use the NM:i:0 tag to select only those reads which map perfectly to the reference
(i.e., have no mismatches).
While the optional fields listed above are fairly standardized, tags that begin with X, Y, and Z are re-
served for particularly free usage and will never be part of the official SAM file format specifications. XS, for
example, is used by TopHat to encode the strand information (e.g., XS:A:+) while Bowtie2 and BWA use
XS:i: for reads with multiple alignments to store the alignment score for the next-best-scoring alignment
(e.g., XS:i:30).
4 # turn a BAM file into the human - readable SAM format ( including the header )
5 $ samtools view -h InFile . bam > InFile . sam
6
10 # generate an index for a BAM file ( needed for many downstream tools )
11 $ samtools index InFile . bam
To see all the operations that can be done using samtools, type samtools --help.
The myriad information stored within the alignment files allow you to focus on virtually any subset of read
alignments that you may be interested in. The samtools view tool has many options that directly interpret
some of the mandatory fields of its alignment section (Table 4), such as the mapping quality, the location
and the FLAG field values.
1 # get only unmapped reads
2 $ samtools view -h \ # show header
3 -b \ # output a BAM file
4 -f 4 \ # include only reads where the 0 x4 bit is set
5 Aligned . sortedByCoord . out . bam > unmapped_reads . bam
6
If you would like to filter an alignment file based on any of the optional tags, you will have to resort to means
outside samtools. Looking for exact matches using grep can be particularly helpful here, but you should
make sure that you make the regular expression search as stringent as possible.
!
The number of optional SAM/BAM fields, their value types and the information stored within them
completely depend on the alignment program and can thus vary substantially. Before you do any
filtering on any flag, make sure you know how the aligner generated that value.
Here is an example for retrieving reads with only one alignment (aka uniquely aligned reads), which
might be useful if STAR was not run with --outFilterMultimapNmax 1:
1 # STAR uses the NH : i tag to record the number of alignments found for a read
2 # NH :1 = > 1 alignment ; NH :2 = > 2 alignments etc .
3 $ samtools view -h Aligned . sortedByCoord . out . bam | \ # decompress the BAM file
4 egrep " ^ @ \|\ bNH : i :1\ b " | \ # lines with either @ at the beginning of the
line (= header ) or exact matches of NH : i :1 are returned
5 samtools view -S -b - > un iq ue ly _al ig ne d_r ea ds . bam # turn the SAM file lines
from stdin into a BAM file , - indicates standard input for samtools
To filter out reads with insert sizes greater than 1000 bp, one could make use of the CIGAR string.
The following example assume that the alignment program indicated large insertions with the N operator
(see Section 3.3.2) – this may not be true for all aligners!
1 # for the sake of simplicity , let ' s work on the SAM file :
2 $ samtools view -h WT_1_Aligned . sortedByCoord . out . bam > WT_1_Aligned .
sortedByCoord . out . sam
3
4 # here ' s an example using grep , excluding lines with at least four digits
followed by N
5 $ egrep -v " [0 -9][0 -9][0 -9][0 -9] N " WT_1_Aligned . sortedByCoord . out . sam >
smallInsert_reads . sam
6
?
1. How can you extract all reads that were aligned to the reverse strand?
2. Does it make sense to filter the BAM files generated by STAR using the mapping quality filter
as shown above, i.e., do you find any differences after filtering with -q 40?
Once the reads have been aligned, the following properties should be assessed before downstream analyses
are started:
There are numerous ways to do basic checks of the alignment success. An alignment of RNA-seq reads is
usually considered to have succeeded if the mapping rate is >70%.
The very first QC of aligned reads should be to generally check the aligner’s output. The STAR and samtools
index commands in Section 3.2 generate the following files:
*Aligned.sortedByCoord.out.bam information about the genomic loci of each read incl. its
sequence
*Log.final.out alignment statistics
*Log.out commands, parameters, and files used
*Log.progress.out elapsed time
*SJ.out.tab genomic loci where splice junctions were detected and the
number of reads overlapping with them
*Unmapped.out.mate1 text file with unmapped reads (similar to original fastq
file)
Information about the individual output files are given in the STAR manual which you can find in the pro-
gram’s directory (e.g., STAR-STAR 2.5.4b/doc/STARmanual.pdf) or online (https://github.com/alexdobin/
STAR/blob/master/doc/STARmanual.pdf).
• Which STAR output file will you need most for your downstream analyses?
? • How can you decrease the size of the *out.mate1 files? What format do they have?
• Which optional SAM fields does STAR add and what do they represent?
Most aligners will return a summary of the basic stats of the aligned reads, such as the number of mapped
reads. For STAR, the information is stored in *Log.final.out.
1 $ cat WT_1_Log . final . out
2 Started job on | Jul 24 17:53:18
3 Started mapping on | Jul 24 17:53:22
4 Finished on | Jul 24 17:53:51
5 Mapping speed , Million of reads per hour | 870.78
6
The number of uniquely mapped reads is usually the most important number. If you are handling more than
two BAM files, it will certainly be worthwhile to visualize the alignment rate for all files, e.g., using MultiQC
or your own, customized routine in R (Figure 16).
In addition to the log files generated by the mapping program, there are numerous ways to obtain information
about the numbers and kinds of reads stored in a BAM file, e.g., using samtools or RSeQC (see below). The
simplest approach to finding out the number of alignments within a BAM file is to do a line count.
1 # pseudocode
2 $ samtools view Aligned . sortedByCoord . out . bam | wc -l
Note that if unmapped reads are present in the BAM file, these will also be counted, as well as multiple
instances of the same read mapped to different locations if multi-mapped reads were kept. It is therefore
more informative to run additional tools that will indicate the counts for specific FLAG values, too.
samtools flagstat This tool assesses the information from the FLAG field (see Section 3.3.2) and prints a
summary report to the terminal.
1 $ ~/ mat / software / samtools -1.5/ samtools flagstat / zenodotus / abc / store / courses
/2016 _rnaseq / additionalExamples / alignment / human_samples / UHR - RIN0_Aligned .
sortedByCoord . out . bam
2 66889956 + 2593364 in total ( QC - passed reads + QC - failed reads )
3 0 + 0 secondary
4 0 + 0 supplementary
5 0 + 0 duplicates
6 66889956 + 2593364 mapped (100.00% : 100.00%)
7 0 + 0 paired in sequencing
WT WT
sample
sample
SNF2 SNF2
WT WT
sample
sample
SNF2 SNF2
Figure 16: Graphical summary of STAR’s log files for 96 samples. The individual colors represent the distinct sam-
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
The dashed black line is the median value across all replicates of one condition. There is quite a bit of
ples, the dashed lines indicate the median values across all samples of the same condition (WT or SNF2). For
details of the code underlying these figures, see https://github.com/friedue/course_RNA-seq2015 rightarrow
variation for all four stats between the replicates. Although the median sequencing depth of the SNF2 samples
01 Alignment visualizeSTARresults.pdf.
is slightly higher than for the WT samples, the median number of splice sites shows a reverse trend. This is
counter-intuitive as the number of splice sites should correlate positively with the number of mapped reads.
8
Let’s check whether that relationship is simply confounded by just looking at the median values.
0 + 0 read1
9 0 + 0 read2
10 0 + 0 properly paired ( N / A : N / A )
11 0 + 0 with itself and mate mapped
12 0 + 0 singletons ( N / A : N / A )
13 0 + 0 with mate mapped to a different chr
14 0 + 0 with mate mapped to a different chr ( mapQ >=5)
RSeQC’s bam stat.py RSeQC is a Python- and R-based suite of tools for various quality controls and visual-
izations, some of which are specific for RNA-seq experiments (Wang et al., 2012). See Table 11 for the list of
all currently available scripts. Although RSeQC is one of the most popular tools for RNA-seq quality control,
a recent publication revealed several bugs in the code of RSeQC (Hartley and Mullikin, 2015).
For basic alignment stats, one can use the bam stat.py script:
1 # RSeQC is based on Python ; add the anaconda installation of Python to your
PATH
2 $ export PATH =/ home / classadmin / software / anaconda2 / bin /: $PATH
3
4 # now , all RSeQC scripts are immediately accessible and you can , for example ,
run bam_stat . py
5 $ bam_stat . py -i WT_1_Aligned . sortedByCoord . out . bam
6
7 # ==================================================
8 # All numbers are READ count
9 # ==================================================
10
13 QC failed : 0
14 Optical / PCR duplicate : 0
15 Non primary hits 0
16 Unmapped reads : 0
17 mapq < mapq_cut ( non - unique ) : 0
18
If you want to add the results of samtools flagstat and RSeQC’s bam stat.py to a MultiQC report, capture
the output that is normally printed to screen in reasonably named files.
1 $ bam_stat . py -i WT_1_Aligned . sortedByCoord . out . bam > bam_stat_WT_1 . txt
2 $ samtools flagstat WT_1_Aligned . sortedByCoord . out . bam > flagstat_WT_1 . txt
Visualization of aligned reads It is always a good idea to visually check the results, i.e., ensure the
reads align to the expected regions, preferably without too many mismatches. Here, Genome Browsers come
in handy. Different research groups have released different Genome Browsers, and the most well-known
browsers are probably those from Ensembl and UCSC. There are some reasons why one may not want to
use these web-based options (e.g., HIPAA-protected data or lack of bandwidth to upload all data), and
rather resort to stand-alone Genome Browsers (see https://en.wikipedia.org/wiki/Genome_browser for
an overview).
We are going to use the Broad Institute’s Integrative Genomics Viewer (IGV) that can be downloaded after
a quick registration with an academic email address from https://www.broadinstitute.org/software/
igv/download. It requires an up-to-date Java installation.
The IGV Genome Browser can display numerous file formats, e.g., indexed (!) BAM files‖ with aligned reads
and BED files with information about genomic loci (such as genes). The following IGV snapshot (in IGV, go
to “File”, then “Save image”) shows the region surrounding an arbitrarily chosen yeast gene (blue box) and
the reads aligned to it (grey arrows).
‖ That means, the .bam file should have a bam.bai file with the same base name in the same folder.
On top of the read alignment display, IGV also produces a coverage line that allows for quick identifica-
tion of highly covered regions. Blue lines within the reads indicate insertions with respect to the reference
genome, red lines indicate deletions. Since yeast genes are often intron-less, the reads can be aligned without
gaps.
Human genes, however, tend to have multiple introns, which means that exon-exon-spanning reads must be
aligned with often lengthy gaps within them (Figure 11). Examples of this can be seen in the following IGV
screenshot, where the horizontal grey lines indicate a gap within a read:
If one is interested in the splice junctions of a particular gene, IGV can generate Sashimi plots (Katz et al.,
2015): right-click on the track that contains the BAM file of interest and select “Sashimi plot”. The result will
look like this:
In the Sashimi plot, bar graphs indicate read coverage, arcs indicate splice junctions, and numbers represent
the number of reads that contain the respective splice junction. Bear in mind that while the IGV-Sashimi
plots are great because they allow you to interactively explore exon usage, relying on the simple read counts
may be treacherous – simple differences in sequencing depth (i.e., the total number of reads per sample)
can lead to perceived differences in read counts. If you want read counts normalized per million reads
and adjusted for transcript length, you will have to resort to the standalone version of Sashimi (http:
//miso.readthedocs.io/en/fastmiso/sashimi.html).
Read distribution For mRNA-seq, one would expect the majority of the aligned reads to overlap with
exons. This assumption can be tested using the read distribution.py script, which counts the numbers
of reads overlapping with various gene- and transcript-associated genomic regions, such as exons and in-
trons.
1 $ read_distribution . py -r $ { REF_DIR }/ sacCer3 . bed \ # annotation file
2 -i WT_1_Aligned . sortedByCoord . out . bam \ # runs only on single files
3
To compare the read distribution values for different samples, it is helpful to turn the text-based output of
read distribution.py into a bar graph:
The .R script and read distribution.py result files for this plot can be found at https://github.com/
friedue/course_RNA-seq2015. You will probably want to change a couple of details, e.g. including intron
counts. You can also let MultiQC do the work if you capture the output of read distribution.py in a
simple text file.
Gene body coverage To assess possible 3’ or 5’ biases, you can use RSeQC’s geneBody coverage.py
script. Given an annotation file with the transcript models of your choice, it will divide each transcript into
100 sections, count the reads overlapping with each section and generate two plots visualizing the general
abundance of reads across all transcript bodies.
1 $ REF_DIR =~/ mat / referenceGenomes / S_cerevisiae
2
6 $ gene Body_coverage . py \
7 -i WT_1_Aligned . sortedByCoord . out . bam \ # aligned reads
8 -r $ { REF_DIR }/ sacCer3 . bed \ # annotation file
9 -o g e n e B o dyCov erag e_WT _1 # output name
10
We ran the geneBody coverage.py script on four human samples of RNA-seq with varying degrees of RNA
quality, ranging from RIN = 0 (degraded) to RIN = 9 (high quality RNA) (see Section 1.1.1 for details about
RIN). The resulting plots show varying degrees of 3’ bias where samples with degraded RNA (RIN 0) show
a more prominent bias than high-quality RNA (RIN 9).
in silico mRIN calculation The RNA integrity number (RIN, see Section 1.1.1) that is calculated during
library preparation to assess the RNA quality is rarely indicated in the public data repositories. It might thus
be informative to determine a measure of mRNA degradation in silico. RSeQC’s tin.py script does exactly
that, using the deviation from an expected uniform read distribution across the gene body as a proxy (Feng
et al., 2015).
1 $ tin . py -i WT_1_Aligned . sortedByCoord . out . bam -r $ { REF_DIR }/ sacCer3 . bed
tin.py will generate a .xls file where the in silico mRIN is stored for each gene or transcript from the BED
file. The second output file, *summary.txt, gives a quick overview of the mean and median values across all
genes for a given sample. Using human samples with known, experimentally determined RIN numbers, we
can see that the in silico mRIN does correlate:
You can find the .R script and tin.py result files underlying these box plots in the following github repository:
https://github.com/friedue/course_RNA-seq2015.
3 # to obtain the total number of raw reads you can make use of the calculator
capabilities of bc
4 # ( the number is an optional parameter for QoRTs though )
5 $ for FASTQ in ~/ mat / precomputed / r a w R e a d s _ y e a s t _G i e r l i n s k i / WT_1 / ERR45849 * gz ; do
zcat $FASTQ | wc -l ; done | paste - sd + | bc | awk '{ print $1 /4} '
6
Note that by default, QoRTs assumes the data is paired end unless otherwise specified.
The run or exclude individual functions:
1 $ REF_DIR =~/ mat / referenceGenomes / S_cerevisiae
2
11 # to exclude a function
12 $ java - Xmx4g - jar QoRTs . jar QC \
13 -- singleEnded \
14 -- skipFunctions JunctionCalcs \ # run every function except the
JunctionCalcs function
15 -- gen eratePdfReport WT_1_Aligned . sortedByCoord . out . bam \
16 $ { REF_DIR }/ sacCer3 . gtf \
17 ./ QoRTs_output /
To include or exclude more than one function, use a comma-delimited list (without white spaces) of the
respective functions.
An example QoRTs report can be found at http://chagall.med.cornell.edu/RNASEQcourse/.
– samtools flagstat
– results of RSeQC’s bam stat.py
• RNA-seq-specific QC
– read distribution (e.g., using RSeQC or QoRTs)
– gene body coverage (e.g., using RSeQC or QoRTs)
– the splice junction information obtained with QoRTs
1 # collect all QC results of interest in one folder , e . g . QC_collection
2 # subfolders can be be assigned for each sample , which will make the naming
conventions used
3 # by MultiQC easier
4 # you can either copy or link the files that you need
5 $ ls QC_collection / WT_1 /
6 g e n e Bod yCove rage _WT_1 . geneBodyCoverage . r QC . NVC . minus . clipping . R1 . txt . gz
7 g e n e Bod yCove rage _WT_1 . geneBodyCoverage . txt QC . NVC . raw . R1 . txt . gz
8 QC . b2nAZCenkhtb . log QC . NVC . tail . clip . R1 . txt . gz
9 QC . biotypeCounts . txt . gz QC . QORTS_COMPLETED_OK
10 QC . chromCount . txt . gz QC . QORTS_COMPLETED_WARN
11 QC . cigarOpDistribution . byReadCycle . R1 . txt . gz QC . quals . r1 . txt . gz
12 QC . cigarOpLengths . byOp . R1 . txt . gz QC . r3z9iUXrtnHr . log
13 QC . exonCounts . formatted . for . DEXSeq . txt . gz QC . s p l i c e J u n c t i o n A n d E x o n C o u n t s .
forJunctionSeq . txt . gz
14 QC . fxNgbBmcKJnC . log QC . spliceJunctionCounts .
knownSplices . txt . gz
15 QC . gc . byRead . txt . gz QC . spliceJunctionCounts .
novelSplices . txt . gz
16 QC . gc . byRead . vsBaseCt . txt . gz QC . summary . txt
17 QC . geneBodyCoverage . byExpr . avgPct . txt . gz read_distribution . txt
18 QC . geneBodyCoverage . by . expression . level . txt . gz rseqc_bam_stat . txt
19 QC . geneBodyCoverage . DEBUG . intervals . txt . gz samtools_flagstat . txt
20 QC . geneBodyCoverage . genewise . txt . gz WT_1Log . final . out
21 QC . geneCounts . formatted . for . DESeq . txt . gz WT_1Log . out
22 QC . geneCounts . txt . gz WT_1Log . progress . out
23 QC . NVC . lead . clip . R1 . txt . gz
24
25 # QoRTs results
26 $ ls QC_collection / WT_1 / QC *
27 # STAR Log files
28 $ ls QC_collection / WT_1 /* Log * out
29
33 # run MultiQC
34 $ cd QC_collection /
35 $ ~/ mat / software / anaconda2 / bin / multiqc . \
36 -- dirs \ # use the names of the subdirectories
37 -- ignore ERR * \ # ignoring FastQC results in case they are there
38 -- filename multiQC_align
4 Read Quantification
As discussed in some detail by Van den Berge et al. (2019), there are two different ways of approaching the
determination of individual transcripts’ expression levels – one can either assign all the reads to a given gene
(effectively ignoring the presence of individual isoforms), or one can try to infer the quantity of individual
transcripts.
To obtain gene-level quantifications, one can either directly count reads overlapping with gene loci or use
transcript-level quantification (Section 4.2) followed by some way of aggregating the values per gene. In
principle, the counting of reads overlapping with genomic features is a fairly simple task, but there are some
details that need to be decided on depending on the nature of your experiment and the desired outcome
(Figure 17).
The most popular tools for gene quantification are htseq-count and featureCounts. Both are part of larger
tool packages (Anders et al., 2014; Liao et al., 2014). htseq-count offers three different modes to tune its
behavior to define overlap instances (Figure 17). The recommended mode is union, which counts overlaps
even if a read only shares parts of its sequence with a genomic feature and disregards reads that overlap
more than one feature. This is similar to featureCounts that calls a hit if any overlap (1 bp or more) is
found between the read and a feature and provides the option to either exclude multi-overlap reads or to
count them for each feature that is overlapped.
When counting reads, make sure you know how the program handles the following:
• overlap size (full read vs. partial overlap);
• multi-mapping reads, i.e. reads with multiple hits in the genome;
• reads overlapping multiple genomic features of the same kind;
• reads overlapping introns.
In addition to the nature and lengths of the reads, gene expression quantification will be strongly affected
by the underlying gene models that are usually supplied to the quantification programs via GTF or BED(-like)
files (see Section 3.1 for details on the file formats and annotations).
The following commands will count the number of reads overlapping with genes using featureCounts.
1 # count reads per gene
2 $ ~/ mat / software / subread -1.6.0 - Linux - x86_64 / bin / featureCounts \
3 -a $ { REF_DIR }/ sacCer3 . gtf \
4 -o fe atur eCoun ts_r esult s . txt \
5 alignment /* bam # use all BAM files in the folder " alignment "
The output of featureCounts consists of two files:
1. The one defined by the -o parameter (e.g., featureCounts results.txt) – this one contains the actual
read counts per gene (with gene ID, genomic coordinates of the gene including strand and length); the
first line (starting with #) contains the command that was used to generate the file.
2. A file with the suffix .summary: This file gives a quick overview about how many reads could be assigned
to genes and the reasons why some of the could not be assigned. This is a very useful file to double
check the settings you’ve chosen for the counting.
featureCounts also allows to count reads overlapping with individual exons.
1 # count reads per exon
2 $ ~/ mat / software / subread -1.6.0 - Linux - x86_64 / bin / featureCounts \
3 -a $ { REF_DIR }/ sacCer3 . gtf \
4 -f \ # count read overlaps on the feature level
5 -t exon \ # feature type
6 -O \ # allow reads to overlap more than one exon
7 -o featCounts_exons . txt \
8 alignment /* bam
However, there are (at least) two caveats here:
• If an exon is part of more than one isoform in the annotation file, featureCounts will return the read
counts for the same exon multiple times (n = number of transcripts with that exon). Make sure you
remove those multiple entries in the result file before the differential expression anaysis, e.g., using a
UNIX command∗ or within R.
• If you want to assess differential expression of exons, it is highly recommended to create an annotation
file where overlapping exons of different isoforms are split into artificially disjoint bins before applying
featureCounts. See, for example, Anders et al. (2012). To create such a “flattened” annotation file
from a GTF file (Section 3.1.1), you can use the dexseq prepare annotation.py script of the DEXSeq
package (Anders et al., 2012) and the section“Preparing the annotation” of the corresponding vignette
at bioconductor. Alternatively, you can use QoRTs to prepare the proper annotation, too† .
The previously discussed methods count the number of fragments that can be assigned to a gene as a whole.
The other school of thought suggests that quantifying reads that originated from transcripts should also
be done on the transcript level‡ . So far, most comparisons of methods point towards superior results of
gene-based quantification and there is no standard technique for summarizing expression levels of genes with
several isoforms (see, for example, Soneson et al. (2015), Dapas et al. (2016), Germain et al. (2016), and
(Teng et al., 2016) for detailed comparisons of transcript-level quantifications; Van den Berge et al. (2019)
offers a detailed discussion of some of the caveats of all of the approaches).
One trend seems to be clear though: the simple count-based approaches tend to underperform when they are
used to determine transcript-level counts because they generally tend to ignore reads that overlap with more
∗ e.g.,sort -k2,2n -k3,3n featureCounts exons.txt | uniq
† seehttps://hpc.nih.gov/apps/QoRTs/example-walkthrough.pdf for details
‡ For arguments in favor of the transcript-focused school of thought, see, e.g., Trapnell et al. (2013) and Pimentel’s talk.
than one feature (Soneson et al., 2015). While this is reasonable when the features are entire genes, this leads
to an enormous number of discarded reads when quantifying different isoforms since multiple transcripts of
the same gene naturally tend to overlap.
In order to quantify isoforms, you should perhaps look into different programs, e.g., Cufflinks (Trapnell
et al., 2012), RSEM (Li and Dewey, 2011), eXpress (Roberts and Pachter, 2013) – these tools have been
around the longest and are therefore most often cited. RSEM is the one that tends to perform best in most
comparisons and the statistical interpretations and assumptions to handle transcript structures have been
widely adopted. The main features of these tools are that they need to make assumptions about transcript
structures and models and that the quantification is often done hand-in-hand with the alignment (or k-mer
based mapping) of the reads. The values that are returned are typically not actual read counts because
the majority of the short reads tend to be ambiguous, i.e., they match more than one known isoform.
Therefore, transcript quantification will always have to rely on probabilistic assignments of values using a
specific model with certain assumptions about how likely a given fragment will have originated from a specific
isoform (Figure 18).
As discussed in Section 3, new quantification algorithms for RNA-seq have been proposed§ that are based on
the idea that it may not be important to exactly know where within a transcript a certain read originated
from. Instead, it may be enough to simply know which transcript the read represents. These algorithms
therefore do not generate a BAM file because they do not worry about finding the best possible alignment.
Instead, they yield a (probabilistic) measure of how many reads indicate the presence of each transcript.
While these approaches are extremely fast compared to the usual alignment–counting routines that we have
described at length, they seriously lack sensitivity for lowly expressed genes, small transcripts and transcripts
where the splice variants are fairly similar to each other (Wu et al., 2018). They are also prone to spurious
mapping for immature RNAs or transcript structures that aren’t represented in the cDNA sequence pool
used to generate their mapping index (Srivastava et al., 2019).
Instead of direct isoform quantification, you may be able to glean more accurate answers from alternative
approaches, e.g., quantification of exons (Anders et al., 2012)¶ or estimates of alternative splicing events such
as exon skipping, intron retention etc. (e.g., MISO (Katz et al., 2010), rMATS (Shen et al., 2014)).
The main take home message here is once again: Know your data and your question, and research the
individual strengths and pitfalls of the individual tools before deciding which one to use. For example, one
major issue reported for Cufflinks is its inability to handle single-exon transcripts. Therefore, you should
§ Sailfish
(Patro et al., 2014), Salmon (Patro et al., 2017), kallisto (Bray et al., 2016)
¶ The above shown featureCounts-based exon counting should not be used with DEXSeq unless exons with varying boundaries
have been divided into disjoint bins (Anders et al., 2012; Teng et al., 2016; Soneson et al., 2016).
avoid using it if you are dealing with a fairly simple transcriptome (Kanitz et al., 2015). On the other hand,
transcriptome reconstruction as attempted by Cufflinks generates large amounts of false positives (as well as
false negatives) in very complicated transcriptomes, such as the human one while it seems to hit a better spot
when applied to moderately complex transcriptomes such as the one of C. elegans (Jänes et al., 2015). In
comparison, the novel lightweight quantification algorithms perform well for isoform quantification of known
transcriptomes, but they are naturally very sensitive to incomplete or changing annotation. In addition, it is
not entirely clear yet whether the resulting values can be used with the established algorithms to determine
differential gene expression (Soneson et al., 2015; Pimentel et al., 2016).
The main caveats of assigning reads to transcripts are:
!
• inconsistent annotation of transcripts
• multiple isoforms of widely differing lengths
• anti-sense/overlapping transcripts of different genes
There is no really good solution yet! Be careful with your conclusions and if possible, limit your
analyses to gene-based approaches.
The numbers (or estimates) of reads overlapping with a given gene cannot be directly interpreted as absolute
proxies of individual gene expression levels. The value that is obtained for a single gene in a single sample
is based on the number of reads corresponding to that gene (or transcript), but as we discussed in previous
chapters, there are numerous factors that influence the efficiency of amplification and sequencing of DNA
fragments (e.g. GC-bias). For RNA-seq it is important to remember that, even given a uniform sampling of a
diverse transcript pool (i.e. you succeeded in creating a random and comprehensive sampling of the original
transcript universe), the number of sequenced reads mapped to a gene depends on:
• its own expression level (this is probably the only metric you’re actually interested in!),
• its length (the longer a transcript, the more short fragments it will yield),
• the sequencing depth,
• the expression of all other genes within the sample.
In order to compare the gene expression between two conditions, we must therefore calculate the fraction
of reads assigned to each gene relative to the total number of reads and with respect to the entire RNA
repertoire, which may vary drastically from sample to sample. While the number of sequenced reads is known,
the total RNA library and its complexity (i.e., which transcripts were captured) is unknown and variation
between samples may be due to contamination as well as biological reasons. The purpose of normalization
is to eliminate systematic effects that are not associated with the biological differences of interest so as
not to skew exploratory analyses (Section 5.3) as well as the statistical test for differential gene expression
(Chapter 6). See Table 13 for details on the most commonly used normalization methods that deal with
these issues in different ways. However, the functions that will perform the statistical tests absolutely rely
on being supplied the integer read counts (or their equivalents generated by the pseudo-alignment tools) as
they will preform the normalization and transformation steps described in detail in the next sections under
the hood.
As shown in Figure 19, the size factor method implemented by the R package DESeq2 leads to relatively
similar read count distribution between different libraries. We will now use the output of featureCounts
(= raw read counts), read them into R and normalize the read counts for sequencing depth differences with
functions implemented in the package DESeq2.
Figure 19: Figure from Dillies et al. (2013) that shows the effects of different approaches to normalize for read
count differences due to library sizes (TC, total count; UQ, upper quartile; Med, median; DESeq, size factor; TMM,
Trimmed Mean of M-values; Q, quantile) or gene lengths (RPKM). See Tables 13 and 14 for details of the different
normalization methods.
!
While the majority of normalization methods work well, RPKM and total count normalization
should be avoided in the context of DE analysis, no matter how often you see them applied in
published studies. RPKM, FPKM etc. are only needed if expression values need to be compared
between different genes within the same sample for which the different gene lengths must be taken
into consideration.
5 # get the table of read counts by indicating the path to the file
6 > read . counts <- read . table ( " ~ / Downloads / featureCounts _ result . txt " , header =
TRUE )
7
8 # One of the requirements of the assay () slots is that the row . names
9 # correspond to the gene IDs and the col . names to the sample names
10 > row . names ( readcounts ) <- readcounts $ Geneid
11
12 # in addition , we need to exclude all columns that do not contain read counts
13 > readcounts <- readcounts [ , -c (1:6) ]
14
15 # give meaningful sample names - this can be achieved via numerous approaches
16 # the one shown here is the least generic and most error - prone one !
17 > orig _ names <- names ( readcounts )
18 > names ( readcounts ) <- c ( " SNF2 _ 1 " , " SNF2 _ 2 " , " SNF2 _ 3 " , " SNF2 _ 4 " , " SNF2 _ 5 " , " WT _
1 " , " WT _ 2 " , " WT _ 3 " , " WT _ 4 " , " WT _ 5 " )
19
Now that we have the read counts, we also need some information about the samples, which will be stored
in colData. As described above, this should be a data.frame, where the rows match the column names of
the count data we just generated. In addition, each row should contain information about the condition of
each sample (here: WT and SNF2 [knock-out]).
1 # make a data . frame with meta - data where row . names should match the individual
2 # sample names
3 > sample _ info <- data . frame ( condition = gsub ( " _ [0 -9]+ " , " " , names ( readcounts ) ) ,
4 row . names = names ( readcounts ) )
5 > sample _ info
6 condition
7 SNF2 _ 1 SNF2
8 SNF2 _ 2 SNF2
9 SNF2 _ 3 SNF2
10 SNF2 _ 4 SNF2
11 SNF2 _ 5 SNF2
12 WT _ 1 WT
13 WT _ 2 WT
14 WT _ 3 WT
15 WT _ 4 WT
16 WT _ 5 WT
17
28 # you can check the result using the accessors described above :
29 > colData ( DESeq . ds ) % >% head
30 > assay ( DESeq . ds , " counts " ) % >% head
31 > rowData ( DESeq . ds ) % >% head
32
DESeq2’s default method to normalize read counts to account for differences in sequencing depths is imple-
mented in estimateSizeFactors() (see Table 13).
1 # calculate the size factor and add it to the data set
2 > DESeq . ds <- estimateSizeFactors ( DESeq . ds )
3 > sizeFactors ( DESeq . ds )
4
5 # if you check colData () again , you see that this now contains the sizeFactors
6 > colData ( DESeq . ds )
7
6 # # calculate the geometric mean for each gene using that function
7 # # note the use of apply () , which we instruct to apply the gm _ mean ()
8 # # function per row ( this is what the second parameter , 1 , indicates )
9 pseudo _ refs <- counts ( DESeq . ds ) % >% apply (. , 1 , gm _ mean )
10
15 # # if you want to see what that means at the single - gene level ,
16 # # compare the result of this :
17 counts ( DESeq . ds ) [1 ,] / pseudo _ refs [1]
18 # # with
19 pseudo _ ref _ ratios [1 ,]
20
21 # # determine the median value per sample to get the size factor
22 apply ( pseudo _ ref _ ratios , 2 , median )
The result of the last line of code should be equivalent to the values returned by sizeFactors(DESeq.ds)
after running estimateSizeFactors(DESeq.ds).
1. Name two technical reasons why the read count for the same gene may vary between two
?
samples although it is not differentially expressed.
2. Name two technical reasons why the read counts of two genes may vary within the same
sample although they are expressed at the same level.
Due to the relatively large dynamic range of expression values that RNA-seq data can cover, many down-
stream analyses (including clustering) work much better if the read counts are transformed to the log scale
following normalization. While you will occasionally see log10 transformed read counts, log2 is more com-
monly used because it is easier to think about doubled values rather than powers of 10. The transformation
should be done in addition to sequencing depth normalization.
1 # transform size - factor normalized read counts to log2 scale using a
pseudocount of 1
2 > log . norm . counts <- log2 ( counts . sf _ normalized + 1)
You can see how the log2 transformation makes even simple graphs more easily interpretable by generating
boxplots of read counts similar to the ones in Figure 19:
1 > par ( mfrow = c (2 ,1) ) # to plot the following two images underneath each other
2
Figure 20: Comparison of the read distribution plots for untransformed and log2 -transformed values.
4 # mean - sd plot
5 > library ( vsn )
6 > library ( ggplot2 )
7 > msd _ plot <- meanSdPlot ( log . norm . counts ,
8 ranks = FALSE , # show the data on the original scale
9 plot = FALSE )
10 > msd _ plot $ gg +
11 ggtitle ( " sequencing depth normalized log2 ( read counts ) " ) +
15
15
10
10
WT_2
WT_2
5
5
0 5 10 15 0 5 10 15
panel shows simple
WT_1 WT_1 pairwise comparisons
sequencing depth normalized log2(read counts) rlog−transformed read counts
of replicate samples;
the lower panel con-
3.0
3.0
tains mean-sd-plots
based on all samples of
2.5
2.5
the experiment.
2.0
2.0
1.5
1.5
sd
sd
1.0
1.0
0.5
0.5
0.0
0.0
0 5 10 15 0 5 10 15
mean mean
The rlog() function’s blind parameter should be set to FALSE if the different conditions lead to strong
differences in a large proportion of the genes. If rlog() is applied without incorporating the knowledge of
the experimental design (blind = TRUE, the default setting), the dispersion will be greatly overestimated in
such cases.
An important step before diving into the identification of differentially expressed genes is to check whether
expectations about basic global patterns are met. For example, technical and biological replicates should
show similar expression patterns while the expression patterns of, say, two experimental conditions should
be more dissimilar. There are multiple ways to assess the similarity of expression patterns, we will cover the
three that are used most often for RNA-seq data.
!
Avoid “single” linkage on gene expression data; “complete” and “average” linkage tend to be much
more appropriate, with “complete” linkage often outperforming “average” (Gibbons and Roth,
2002).
Table 6: Comparison of unsupervised classification and clustering techniques. The following table was adapted from
Karimpour-Fard et al. (2015); see that publication for more details on additional (supervised) classification methods
such as support vector machines. Classifiers try to reduce the number of features that represent the most prevalent
patterns within the data. Clustering techniques aim to group similar features.
reports the weight variables into a new dimensions before groups number of
of each component’s set of uncorrelated using supervision treatment
contribution to the variables groups
separation
ICA Separates features Nonlinear, non- Works well when Features are as- Unlimited
into groups by elim- orthogonal transfor- other approaches sumed to be inde- sample size;
inating correlation mation; standard- do not because pendent when they data non-
and reports the izes each variable to data are not nor- actually may be normally
weight of each com- a unit variance and mally distributed dependent distributed
ponents contribution zero mean
to the separation
K- Separates features Compares and Easily visualized Sensitive to ini- Best with
means into clusters of groups magnitudes and intuitive; tial conditions a limited
similar expression of changes in the greatly reduces and user-specified dataset, i.e.,
patterns means into K clus- complexity; per- number of clusters ca. 20 to 300
Clustering
The result of hierarchical clustering is a dendrogram (Figure 22); clusters are obtained by cutting the den-
drogram at a level where the jump between two consecutive nodes is large: connected components then form
individual clusters. It must be noted that there is no consensus on how to decide the “correct” number of
clusters. The cluster structure recovered by the in silico clustering does not necessarily represent the “true”
structure of the data† . As Yona et al. (2009) point out: “The application of any clustering algorithm will
result in some partitioning of the data into groups, (...) the choice of the clustering algorithm may greatly
affect the outcome (...) and their output may vary a great deal, depending on the starting point.” While
statistical approaches to validating cluster choices exist‡ , for most applications in RNA-seq analyses it will
suffice to judge the clustering results based on your prior knowledge of the experiment. In addition, the
structure of the dendrogram should yield compact, well-defined clusters.
A dendrogram can be generated in R using the functions cor(), as.dist(), and hclust():
1 # cor () calculates the correlation between columns of a matrix
2 > distance . m _ rlog <- as . dist (1 - cor ( rlog . norm . counts , method = " pearson " ) )
3
clValid.pdf.
?
1. Which linkage method was used for the dendrogram generated with the code shown above?
2. Can you make a dendrogram with Euclidean distance and linkage method “average”?
DESeq2 also offers a convenience function based on ggplot2 to do PCA directly on a DESeqDataSet:
1 > library ( DESeq2 )
2 > library ( ggplot2 )
3
4 # PCA
5 > P <- plotPCA ( DESeq . rlog )
§ Per definition, PCs are ordered by reducing variability, i.e. the first PC will always be the component that captures the
most variability.
Figure 23: PCA on raw counts and rlog-transformed read counts with the DESEq2 convenience function plotPCA().
As indicated by the labels of the axes, the different sample types explain a greater fraction of the variance for
rlog-transformed values than for the raw counts.
7 # plot cosmetics
8 > P <- P + theme _ bw () + ggtitle ( " Rlog transformed counts " )
9 > print ( P )
! PCA and clustering should be done on normalized and preferably transformed read counts, so
that the high variability of low read counts does not occlude potentially informative data trends.
In addition to performing exploratory analyses based on normalized measures of expression levels (Section 5),
numerous efforts have been dedicated to optimize statistical tests to decide whether a (single!) given gene’s
expression varies between two (or more) conditions based on the information gleaned from as little as two
or three replicates per condition. The two basic tasks of all DGE tools are:
1. Estimate the magnitude of differential expression between two or more conditions based on read counts
from replicated samples, i.e., calculate the fold change of read counts, taking into account the differences
in sequencing depth and variability (Section 5).
2. Estimate the significance of the difference and correct for multiple testing.
The best performing tools tend to be edgeR (Robinson et al., 2010), DESeq/DESeq2 (Anders and Huber,
2010; Love et al., 2014), and limma-voom (Ritchie et al., 2015) (see Rapaport et al. (2013); Soneson and
Delorenzi (2013); Schurch et al. (2015) for reviews of DGE tools). DESeq and limma-voom tend to be more
conservative than edgeR (better control of false positives), but edgeR is recommended for experiments with
fewer than 12 replicates (Schurch et al., 2015). These tools are all based on the R language and make heavy
use of numerous statistical methods that have been developed and implemented over the past two decades
to improve the power to detect robust changes based on extremely small numbers of replicates (Section ??)
and to help deal with the quirks of integer count data. These tools basically follow the same approach, i.e.,
they estimate the gene expression difference for a given gene using regression-based models (and taking the
factors discussed in Section 5 into account), followed by a statistical test based on the null hypothesis that
the difference is close to zero, which would mean that there is no difference in the gene expression values that
could be explained by the conditions. Table 7 has a summary of the key properties of the most popular DGE
tools; the next two sections will explain some more details of the two key steps of the DGE analyses.
Table 7: Comparison of programs for differential gene expression identification. Information shown here is based on
the user guides of DESeq2, edgeR, limmaVoom and Rapaport et al. (2013), Seyednasrollah et al. (2015), and Schurch
et al. (2015). LRT stands for log-likelihood ratio test.
Seq. depth normalization Sample-wise size Gene-wise trimmed Gene-wise trimmed FPKM-like or
factor median of means median of means DESeq-like
(TMM) (TMM)
Dispersion estimate Cox-Reid Cox-Reid squeezes gene-wise
approximate approximate residual variances
conditional inference conditional inference towards the global
with focus on moderated towards variance
maximum individual the mean
dispersion estimate
Assumed distribution Neg. binomial Neg. binomial log-normal Neg. binomial
Test for DE Wald test (2 factors); exact test for 2 t-test t-test
LRT for multiple factors; LRT for
factors multiple factors
False positives Low Low Low High
Detection of differential No No No Yes
isoforms
Support for Yes Yes Yes No
multi-factored
experiments
Runtime (3-5 replicates) Seconds to minutes Seconds to minutes Seconds to minutes Hours
!
All statistical methods developed for read counts rely on approximations of various kinds, so that
assumptions must be made about the data properties. edgeR and DESeq, for example, assume that
the majority of the transcriptome is unchanged between the two conditions. If this assumption is
not met by the data, both log2 fold change and the significance indicators are most likely incorrect!
6.1 Estimating the difference between read counts for a given gene
To determine whether the read count differences between different conditions for a given gene are greater
than expected by chance, DGE tools must find a way to estimate that difference using the information from
the replicates of each condition. edgeR (Robinson et al., 2010), DESeq/DESeq2 (Anders and Huber, 2010; Love
et al., 2014), and limma-voom (Ritchie et al., 2015) all use regression models that are applied to every single
gene. Linear regression models usually take the following typical form: Y = b0 + b1 ∗ x + e (Figure 24).
Here, Y will entail all read counts (from all conditions) for a given gene. b0 is called the intercept; x is the
condition (for RNA-seq, this is very often a discrete factor, e.g., “WT” or “mutant”, or, in mathematical
terms, 0 or 1), e is a term capturing the error or uncertainty, and b1 is the coefficient that captures the
difference between the samples of different conditions∗ .
The very simple model shown above could be fitted in R using the function lm(rlog.norm∼genotype)† ,
which will return estimates for both b0 and b1 , so that the average expression values of the baseline genotype
(e.g., WT = 0) would correspond to Y = b0 + b1 ∗ 0 + e. This is equivalent to Y = b0 (assuming that e is
very small), thereby demonstrating why the intercept (b0 ) can be interpreted as the average of our baseline
group. b1 , on the other hand, will be the coefficient whose closeness to zero will be evaluated during the
statistical testing step.
Instead of using a linear model, DESeq2 and edgeR rely on a negative binomial model to fit the observed
read counts to arrive at the estimate for the difference. Originally, read counts had been modeled using the
Poisson distribution because:
• individual reads can be interpreted as binary data (Bernoulli trials): they either originate from gene i
or not.
• we are trying to model the discrete probability distribution of the number of successes (success = read
is present in the sequenced library).
• the pool of possible reads that could be present is large, while the proportion of reads belonging to
gene i is quite small.
The convenient feature of a Poisson distribution is that variance = mean. Thus, if the RNA-seq experiment
gives us a precise estimate of the mean read counts per condition, we implicitly know what kind of variance
to expect for read counts that are not truly changing between two conditions. This, in turn, then allows us to
identify those genes that show greater differences between the two conditions than expected by chance.
Unfortunately, only read counts of the same library preparation (= technical replicates) can be well ap-
proximated by the Poisson distribution, biological replicates have been shown to display greater variance
∗ You may have encountered linear models in different contexts: they are usually used to predict unknown values of Y , i.e.,
one often wants to find a function that returns Y at any given point along a certain trajectory captured by the model. In the
case of RNA-seq, we are actually more interested in the values of the coefficients, more precisely, the parameter (here: b0 ) that
explains the difference between distinct groups of expression values of x.
† In plain English: rlog-normalized expression values for gene X modeled based on the genotype (Figure 24).
(noise). This overdispersion can be captured with the negative binomial distribution, which is a more general
form of the Poisson distribution where the variance is allowed to exceed the mean. This means that we now
need to estimate two parameters from the read counts: the mean as well as the dispersion. The precision
of these estimates strongly depends on the number (and variation) of replicates – the more replicates, the
better the grasp on the underlying mean expression values of unchanged genes and the variance that is due
to biological variation rather than the experimental treatment. For most RNA-seq experiments, only two
to three replicates are available, which is not enough for reliable mean and variance estimates. Some tools
therefore compensate for the lack of replication by borrowing information across genes with similar expres-
sion values and shrink a given gene’s variance towards the regressed values. These fitted values of the mean
and dispersion are then used instead of the raw estimates to test for differential gene expression.
The null hypothesis is that there is no systematic difference between the average read count values of the
different conditions for a given gene. Which test is used to assign a p-value again depends on the tool
(Table 7), but generally you can think of them as some variation of the well-known t−test (How dissimilar
are the means of two populations?) or ANOVAs (How well does a reduced model capture the data when
compared to the full model with all coefficients?).
Once you’ve obtained a list of p-values for all the genes of your data set, it is important to realize that you
just performed the same type of test for thousands and thousands of genes. That means, that even if you
decide to focus on genes with a p-value smaller than 0.05, if you’ve looked at 10,000 genes your final list may
contain 0.05 ∗ 10, 000 = 500 false positive hits. To guard yourself against this, all the tools will offer some sort
of correction for the multiple hypotheses you tested, e.g. in the form of the Benjamini-Hochberg formula. You
should definitely rely on the adjusted p-values rather than the original ones to zoom into possible candidates
for downstream analyses and follow-up studies.
!
Note that the input for the DGE analysis are the raw read counts (untransformed, not normalized
for sequencing depth); while the tools will perform normalizations and transformations under the
hood, supplying anything but raw read counts to either DESeq2 or edgeR will result in nonsensical
results.
The results() function lets you extract the base means across samples, moderated log2 fold changes,
standard errors, test statistics etc. for every gene.
1 > DGE . results <- results ( DESeq . ds , independentFiltering = TRUE , alpha = 0.05)
2 > summary ( DGE . results )
3
MA plot MA plots were originally developed for visualizing cDNA microarray results, but they are also
useful for RNA-seq analyses. The MA plot provides a global view of the relationship between the expression
change between conditions (log ratios, M), the average expression strength of the genes (average mean,
A) and the ability of the algorithm to detect differential gene expression: genes that pass the significance
threshold (adjusted p-value <0.05) are colored in red (Figure 25).
1 > plotMA ( DGE . results , alpha = 0.05 , main = " WT vs . SNF2 mutants " ,
2 ylim = c ( -4 ,4) )
Figure 25: Left: Histogram of p-values for all genes tested for no differential expression between the two conditions,
SNF2 and WT. Right: The MA plot shows the relationship between the expression change (M) and average expression
strength (A); genes with adjusted p-values <0.05 are marked in red.
Heatmaps Heatmaps are a popular means to visualize the expression values across the individual samples.
The following commands can be used to obtain heatmaps for rlog-normalized read counts for genes that show
differential expression with adjusted p-values <0.05. There are numerous functions for generating heatmaps
in R including NMF::aheatmap(), gplots::heatmap.2() and pheatmap::pheatmap(). A great package for
generating customized heatmaps is the ComplexHeatmap package (Gu et al., 2016).
1 # load the library with the aheatmap () function
2 > library ( NMF )
3
14 # plot the normalized read counts of DE genes sorted by the adjusted p - value
15 > aheatmap ( hm . mat _ DGEgenes , Rowv = NA , Colv = NA )
16
22 # scale the read counts per gene to emphasize the sample - type - specific
differences
23 > aheatmap ( hm . mat _ DGEgenes ,
24 Rowv = TRUE , Colv = TRUE ,
25 distfun = " euclidean " , hclustfun = " average " ,
26 scale = " row " ) # values are transformed into distances from the center
of the row - specific average : ( actual value - mean of the group ) /
standard deviation
Figure 26: Heatmaps of rlog-transformed read counts for genes with adjusted p-values <0.05 in the DGE analysis.
a) Genes sorted according to the adjusted p-values of the DGE analysis. b) Genes sorted according to hierarchical
clustering. c) Same as for (b), but the read count values are scaled per row so that the colors actually represent
z-scores rather than the underlying read counts.
Read counts of single genes An important sanity check of your data and the DGE analysis is to see
whether genes about which you have prior knowledge behave as expected. For example, the samples named
“SNF2” were generated from a mutant yeast strain where the snf2 gene was deleted, so snf2 should be
among the most strongly downregulated genes in this DGE analysis.
To check whether snf2 expression is absent in the mutant strain, we first need to map the ORF identifiers that
we used for generating the read count matrix to the gene name so that we can retrieve the rlog-transformed
read counts and the moderated log2 fold changes. There is more than one way to obtain annotation data,
here we will use a data base that can be directly accessed from within R. The website https://www.
bioconductor.org/packages/release/data/annotation/ lists all annotation packages that are available
through bioconductor. For our yeast samples, we will go with org.Sc.sgd.db. For human data you could,
for example, use org.Hs.eg.db.
1 > BiocManager :: install ( " org . Sc . sgd . db " )
2 > library ( org . Sc . sgd . db )
3
4 # list the types of keywords that are available to query the annotation
database
5 > keytypes ( org . Sc . sgd . db )
6
7 # list columns that can be retrieved from the annotation data base
8 > columns ( org . Sc . sgd . db )
9
23 # DESeq2 offers a wrapper function to plot read counts for single genes
24 > library ( grDevices ) # for italicizing the gene name
25 > plotCounts ( dds = DESeq . ds ,
26 gene = " YOR290C " ,
27 normalized = TRUE , transform = FALSE ,
28 main = expression ( atop ( " Expression of " * italic ( " snf2 " ) , " ( YOR290C ) "
)) )
While R offers myriad possibilities to perform downstream analyses on the lists of DE genes, you may also
need to export the results into a simple text file that can be opened with other programs.
1 # merge the information of the DGE analysis with the information about the
genes
2 > out . df <- merge ( as . data . frame ( DGE . results ) , anno ,
3 by . x = " row . names " , by . y = " ORF " )
4
5 # export all values for all genes into a tab - separated text file
6 > write . table ( out . df , file = " DESeq2results _ WT - vs - SNF2 . txt " ,
7 sep = " \ t " , quote = FALSE , row . names = FALSE )
?
1. What is the difference between the moderated log-transformed values reported by either
rlog(DESeqDataSet) or results(DESeqDataSet)?
2. Which analyses should be preferably performed on log-transformed read counts, which ones
on log fold changes?
6.3.4 edgeR
edgeR is very similar in spirit to DESeq2: both packages rely on the negative binomial distribution to model
the raw read counts in a gene-wise manner while adjusting the dispersion estimates based on trends seen
across all samples and genes (Table 7). The methods are, however, not identical, and results may vary.
The following commands should help you perform a basic differential gene expression analysis, analogous
to the one we have shown you for DESeq2, where five replicates from two conditions (“SNF2”, “WT”) were
compared.
edgeR requires a matrix of read counts where the row names = gene IDs and the column names = sample
IDs. Thus, we can use the same object that we used for DESeq2 (read.counts). In addition, we need to
specify the sample types, similarly to what we did for DESeq2.
1 > BiocManager :: install ( " edgeR " ) # install IF NEEDED
2 > library ( edgeR )
3 > sample _ info . edger <- factor ( c ( rep ( " WT " , 5) , rep ( " SNF2 " , 5) ) )
4 > sample _ info . edger <- relevel ( sample _ info . edger , ref = " WT " )
Now, DGEList() is the function that converts the count matrix into an edgeR object.
1 > edgeR . DGElist <- DGEList ( counts = readcounts , group = sample _ info . edger )
2
7 # remove genes that do not have one count per million in at least 5 samples
8 # ( adjust this to your sample ! )
9 > keep <- rowSums ( cpm ( edgeR . DGElist ) >= 1) >= 5
10 > edgeR . DGElist <- edgeR . DGElist [ keep ,]
11
6 # estimate the dispersion for all read counts across all samples
7 > edgeR . DGElist <- estimateDisp ( edgeR . DGElist , design )
8
12 # perform the testing for every gene using the neg . binomial model
13 > edger _ lrt <- glmLRT ( edger _ fit )
In contrast to DESeq, edgeR does not produce any values similar to the rlog-transformed read count values.
You can, however, get library-size normalized log2 fold changes.
1 # extract results from edger _ lrt $ table plus adjusted p - values
2 > DGE . results _ edgeR <- topTags ( edger _ lrt , n = Inf , # to retrieve all genes
3 sort . by = " PValue " , adjust . method = " BH " )
6.3.5 limma-voom
Limma was originally developed for the analysis of microarray gene expression data using linear models. The
functions of the limma package have continuously been developed for much more than a decade and have laid
the foundation for many widely used statistical concepts for differential gene expression analysis (Smyth,
2004). In order to use the functionalities that had specifically been developed for microarray-based data,
Law et al. (2014) implemented “precision weights” that are meant to transform the finicky count data (with
all its statistically annoying properties including heteroskedasticity shown in Figure 21) into more tractable
normally distributed data. The two main differences to edgeR and DESeq are:
• count values are transformed to log-cpm;
• instead of negative binomial models, linear models are used (on the log-cpm values normalized with
“precision weights”, Figure 28).
The steps limma takes are:
1. For every sample and gene, calculate the counts per million reads and log-transform these.
2. Fit a linear model to the log-cpm values taking the experimental design into account (e.g., conditions,
batches etc.).
3. Use the resulting residual standard deviations for every gene to fit a global mean-variance trend across
all genes and samples.
4. To obtain a “precision weight” for single gene observation (i.e., for every sample!), the fitted log-cpm
values from step 2 are used to predict the counts for every gene and every sample. The mean-variance
trend (step 3) is then used to interpolate the corresponding standard deviation for these predicted
counts.
5. The squared inverse of this observation-wise estimated standard deviation is used as a penalty (in-
verse weight) during the test for differential expression. These penalty values are the above mentioned
“precision weights”.
Like DESeq and edgeR, limma starts with a matrix of raw read counts where each gene is represented by a
row and the columns represent samples. limma assumes that rows with zero or very low counts have been
removed. In addition, size factors for sequencing depth can be calculated using edgeR’s calcNormFactors()
function.
1 > library ( edgeR )
2 # use edgeR to remove lowly expressed genes and normalize reads for
3 # sequencing depth ; see code chunks above
4 # > sample _ info . edger <- factor ( c ( rep (" SNF2 " , 5) , rep (" WT " , 5) ) )
5 # > sample _ info . edger <- relevel ( sample _ info . edger , ref = " WT ")
6 # > edgeR . DGElist <- DGEList ( counts = readcounts , group = sample _ info . edger )
7 # > keep <- rowSums ( cpm ( edgeR . DGElist ) >= 1) >= 5
8 # > edgeR . DGElist <- edgeR . DGElist [ keep ,]
9 # > edgeR . DGElist <- calcNormFactors ( edgeR . DGElist , method = " TMM ")
1 > library ( limma )
2
6 # transform the count data to log2 - counts - per - million and estimate
7 # the mean - variance relationship , which is used to compute weights
8 # for each count -- this is supposed to make the read counts
9 # amenable to be used with linear models
10 > design <- model . matrix ( ~ sample _ info . edger )
11 > rownames ( design ) <- colnames ( edgeR . DGElist )
12 > voomTransformed <- voom ( edgeR . DGElist , design , plot = FALSE )
13
The logFC column gives the value of the contrast. Usually this represents a log2-fold change
between two or more experimental conditions although sometimes it represents a log2-expression
level. The AveExpr column gives the average log2-expression level for that gene across all the
arrays and channels in the experiment. Column t is the moderated t-statistic. Column P.Value
is the associated p-value and adj.P.Value is the p-value adjusted for multiple testing. The most
popular form of adjustment is “BH” which is Benjamini and Hochberg’s method to control the
false discovery rate. (...) The B-statistic (lods or B) is the log-odds that the gene is differentially
expressed. Suppose for example that B = 1.5. The odds of differential expression is exp(1.5) =
4.48, i.e, about four and a half to one. The probability that the gene is differentially expressed is
4.48/(1 + 4.48) = 0.82, i.e., the probability is about 82% that this gene is differentially expressed.
A B-statistic of zero corresponds to a 50-50 chance that the gene is differentially expressed. The
B-statistic is automatically adjusted for multiple testing by assuming that 1% of the genes, or
some other percentage specified by the user in the call to eBayes(), are expected to be differentially
expressed. The p-values and B-statistics will normally rank genes in the same order. In fact, if
the data contains no missing values or quality weights, then the order will be precisely the same.
For RNA-seq with more complex experimental designs, e.g. with more batches and condi-
!
tions, the vignettes of DESeq2 and edgeR contain very good introductions and examples, including
for time course experiments, paired samples as well as about filtering genes with the genefilter
package (Bourgon et al., 2010). For a very comprehensive description of how the theories of linear
models and particularly numerous motivations about the design matrix, have a look at Smyth
(2004).
Once you have obtained a table of genes that show signs of differential expression, you have reached one
of the most important milestones of RNA-seq analysis! To evaluate how confident you can be in that list
of DE genes, you should look at several aspects of the analyses you did and perform basic checks on your
results:
1. Did the unsupervised clustering and PCA analyses reproduce the major trends of the initial experiment?
For example, did replicates of the same condition cluster together and were they well separated from
the replicates of the other condition(s)?
2. How well do different DGE programs agree on the final list of DE genes? You may want to consider
performing downstream analyses only on the list of genes that were identified as DE by more than one
tool.
3. Do the results of the DGE analysis agree with results from small-scale experiments? Can you reproduce
qPCR results (and vice versa: can you reproduce the results of the DGE analysis with qPCR)?
4. How robust are the observed fold changes? Can they explain the effects you see on a phenotypic level?
! If your RNA-seq results are suggesting expression changes that differ dramatically from everything
you would have expected based on prior knowledge, you should be very cautious!
The following code and images should just give you some examples of typical follow-up visualizations that
can be done. To avoid having to load full libraries, we will use the syntax R library::function, i.e.,
gplots::venn uses the function venn() of the gplots package. This will directly call that function without
loading all other functions of gplots into the working environment.
1 # make a Venn diagram
2 > DE _ list <- list ( edger = rownames ( subset ( DGE . results _ edgeR $ table , FDR <=0.05) ) ,
3 deseq2 = rownames ( subset ( DGE . results , padj <=0.05) ) ,
4 limma = rownames ( subset ( DGE . results _ limma , adj . P . Val <=0.05) ) )
5 > gplots :: venn ( DE _ list )
6
21 # heatmap of logFC
22 > pheatmap :: pheatmap ( as . matrix ( DE _ fc ) )
See Figure 29 to see the plots generated by the code above. They illustrate that there’s a large agreement
between the three different tools since the vast majority of genes is identified by all three tools as differentially
expressed. More importantly, all three tools agree on the direction and the magnitude of the fold changes
although there are some individual genes where DESeq2’s estimates of the log fold changes are slightly
different than the ones from edgeR or limma.
(a) (b)
limma
2580
16
28 24
2000
2580
Intersection Size
44 77
edger 56 deseq2
1000
77 56 44 28 24 16
0
limma
edger
deseq2
2000 1000 0
Set Size
(c)
−6 −4 −2 0 2 4
4
2
0
edger
−2
−4
−6
4
2
0
limma
−2
−4
−6
2
0
deseq2
−2
−4
−6
−6 −4 −2 0 2 4 −6 −4 −2 0 2
Figure 29: Some basic plots to judge the agreement of the three different DGE tools that we used. (a) Venn diagram
of gene names. (b) Upset plot of gene names that displays the total size of every set in the bottom left corner,
followed by the type of intersection (dots connected by lines) and the size of the intersection using vertical bars. (c)
Pairwise plots of estimated/moderated log-fold-changes as determined by either one of the tools. Shown here are the
genes that were identified as DE in all three tools (2,580 genes). The code used to generate those images is shown at
the beginning of section 6.4.
Most downstream analyses will be very specific to your question of interest and the model system you are
studying. Generally, most downstream analyses are aimed at elucidating the functions of the DE genes and
to identify possible patterns among them. There are myriad tools to achieve this goal, typical analyses
include:
• enrichments of certain gene ontology (GO) terms encompassing the three classes of GO terms: biological
processes, cell components and molecular functions;
• enrichments of certain pathways such as those defined by MSigDB (Liberzon et al., 2015), STRING
(Szklarczyk et al., 2017), or KEGG (Kanehisa et al., 2017);
• identification of specific “master” regulators or transcription factors that may underlie a bulk of the
changes that are seen.
Enrichments are typically assessed by either one of two approaches: (i) over-representation analysis (ORA)
or (ii) gene set enrichment analyses (GSEA). Both approaches are discussed in great detail by Khatri et al.
(2012) and Alhamdoosh et al. (2017).
All types of enrichment analyses are based on comparisons between genes, that means that the gene-specific
biases such as gene length may cause spurious findings. In fact, Young et al. (2010) describe convincingly
how long transcripts and genes are much more likely to (a) be detected as differentially expressed and (b)
tend to be over-represented in most of the commonly used databases because their length makes it more
likely that they are detected across many different experiments. The goseq package therefore tries to correct
for the inherently increased likelihood of long genes to be present on your list of interest.
Over-representation analyses These types of analyses rely on a filtered list of genes of interest, e.g. all
genes that pass the DE filter. This list is compared to the genes that are known to be part of a specific pathway
or a generic gene set of interest, e.g. “Glycolysis”. A statistical test (e.g. hypergeometric test) is then used
to determine whether the overlap between the gene list of interest and the known pathway is greater than
expected by chance. While this approach is relatively straight-forward, there are serious limitations including
the fact that both magnitude and direction of the change of individual genes are completely disregarded;
the only measure that matters is the presence of absence of a given gene within the lists that are being
compared.
Gene set enrichment analyses To address some of the limitations of the ORA approach, functional
scoring algorithms typically do not require a pre-selected list of genes; instead, they require a fairly exhaustive
list of all genes that could make up your “universe” of genes. These genes should have some measure of change
by which they will be sorted. The basic assumption is that although large changes in individual genes can
have significant effects on pathways, weaker but coordinated changes in sets of functionally related genes
(i.e., pathways) can also have significant effects. Therefore, the gene-level statistics for all genes in a pathway
are aggregated into a single pathway-level statistic (e.g. the sum of all log-fold changes), which will then be
evaluated.
While GSEA does take the magnitude and direction of change into consideration, pathways are regarded as
independent units despite the fact that many pathways share individual genes. See Khatri et al. (2012) for
an in-depth discussion of its limitations.
! The vast majority of downstream analyses are hypothesis-generating tools! Most gene set tests
are not robust to changes in sample size, gene set size, experimental design and fold-change biases.
The following list contains some of the tools that can be used for enrichment analyses – it is by no means
exhaustive, but it may give you an idea of additional possible directions and tools to explore. For example
analyses with goseq and clusterProfiler using the yeast dataset we’ve been using throughout class, see
https://github.com/friedue/course_RNA-seq2019/tree/master/Day04.
• for GO term enrichments, you can first identify enriched GO terms with limma::goana() or the goseq
package, followed by additional tools such as GOrilla (http://cbl-gorilla.cs.technion.ac.il/)
and REVIGO (http://revigo.irb.hr/) to summarize the lists
‡
• for KEGG pathway enrichment, try GAGE & PATHVIEW (https://pathview.uncc.edu/)
• the clusterProfiler package (https://yulab-smu.github.io/clusterProfiler-book/) offers multi-
ple ways for testing enrichments, including for REACTOME and KEGG pathways;
• Gene Set Enrichment Analysis (GUI-based GSEA: https://www.broadinstitute.org/gsea/index.
jsp); fast R-based GSEA implementation: http://www.bioconductor.org/packages/release/bioc/
html/fgsea.html)
• Enrichr (https://amp.pharm.mssm.edu/Enrichr3/) and RegulatorTrail (https://regulatortrail.
bioinf.uni-sb.de/) offer the ability to identify possible upstream regulators
• Ingenuity Pathway Analysis Studio (commercial software)
• ...
‡ https://www.r-bloggers.com/tutorial-rna-seq-differential-expression-pathway-analysis-with-sailfish-deseq2-gage-and-path
(a) (b)
(c)
YGL031C
YOR310C YHR089C YOR119C
YPL043W
YLR197W YDL014W YLL034C
YHR062C
YMR093W YCR057C
YHR148W YLR222C
YGR128C YNL221C
YLR397C
YGL103W YPL126W
YLR175W YOR056C YJL069C
YLR441C YLR333C
YLR344W
YHR196W YER102W
YLR075W
YHR141C YKL156W YNR053C
YEL054C
YOR182C YLR409C
YHR170W YJL109C YGL189C YIL052C
YOL040C YJL177W
YKL180W YNL075W
YKL006W YMR194W YDR398W YLR186W
YPR132W
YNL132W YDR324C
YHR203C YDL202W
YDR447C YDR064W YOL120C
YLR167W YLR048W
YML093W YGL147C Ribosome biogenesis in eukaryotes
YDL191W YML063W
YNL301C YLR325C YGR090W
YDL136W YDR237W YOL010W
YJL136C YGR085C
YDL075W YLR406C
YDR449C YDR471W category
YOL127W YCR031C Ribosome
snR17b Ribosome biogenesis in eukaryotes
YDL208W YDR500C
YER074W YGL123W size
YPR102C
YPL173W YBR048W YDR382W YPL131W
YNL067W YJL190C 43
YJR002W YDR450W
YOR234C 71
YMR239C YJR123W YHL033C
YOR096W YER006W
YER131W 99
YGL135W YBL072C
YOR312C 127
YPL220W YPL090C
YGL068W YDR025W
Ribosome YEL026W log2FC
2.0
YHR147C
YLR009W 1.5
YMR230W YNL178W YHL004W
YIL148W YBR031W YJL191W 1.0
YKR094C YGR034W
YMR143W
YGL076C YPR016C 0.5
YIL133C YJR145C YGR148C
YBR191W YLR061W YER117W
YNL185C YML026C
YIL018W YOR063W
YDL061C YOR293W
YGL099W
YPL198W
YOL121C YPL143W YLL045C
YML025C YOR369C YDL081C
YPL079W YNL302C YOR167C
YGR214W
YLR388W YHR021C YNL069C YBL087C
YML024W YLR129W
YNL002C
YPL093W YMR188C YIL069C
YBL027W YDR418W YNL096C
YKR057W YDL082W
YBL092W YHL001W YDL130W
YGL030W
YLR264W YLR185W
YMR242C YOL039W
YDL083C
YBR181C
YMR142C
YHL015W
YLR448W
YBR146W YBR189W
YNL162W
Figure 30: Example plots that may be produced following ORA with goseq and REVIGO (a) or gene set enrichment
analysis using the KEGG pathways and visualizations provided by clusterProfiler (b, c). (a) Treemap generated
by REVIGO to group GO terms that were determined to be enriched in our set of yeast DEG by goseq. The size
of the squares corresponds to inverse of the log(p-value) of the enrichment, the colors indicate semantically similar
terms. (b) Dot plot generated with the clusterProfiler package following GSEA for KEGG pathways. Shown here
are the most significantly enriched pathways. (c) Gene concept network generated with the cnetplot function of the
clusterProfiler package on the same results. This plot is useful to visually assess the overlap of genes between
different gene sets: the edges correspond to the membership of a given gene within a gene set (central node). See
https://github.com/friedue/course_RNA-seq2019/tree/master/Day04 for the code.
7 Appendix
The problem with our alignment command in Section 3.2 is that the reads contain massive insert sizes. This
is most likely due to the settings controlling the size of what an acceptable intron looks like. Since yeast has
a fairly small genome with relatively few (and small) introns per gene, we should tune that parameter to fit
the specific needs.
To determine suitable lower and upper limits for intron sizes, we will need to download an annotation that
will allow us to determine those sizes easily. This could, for example, be done via the UCSC Table Browser,
as described in this Biostars post: https://www.biostars.org/p/13290/.
1 # get min . intron size
2 $ awk '{ print $3 - $2 } ' introns_yeast . bed | sort - k1n | uniq | head -n 3
3 1
4 31
5 35
6
12 FILES = ` echo $FILES | sed 's / , $ // ' ` # if you want to remove the last comma
13
Table 8: All high-throughput sequencing data will suffer from some degree of bias due to the biochemistry of the
sequencing, the detection technique and bioinformatics processing. Biases that are oftentimes sample-specific (e.g.,
GC content, fragment length distributions) are common sources of technical variation that can either mask or (worse!)
mimick biological signals. For descriptions of RNA-seq specific biases, see the main text and (’t Hoen et al., 2013; Li
et al., 2014; Su et al., 2014).
Page 80 of 97
sequence and stored by the sequencing platform (Phred scores, base <10 (<5), or median
quality see above). For Illumina sequencing, the reagents de- for any base <25 (<20).
grade throughout the sequencing run which is why
the last bases tend to get worse scores.
Per tile se- Check whether a part of the flowcell failed (e.g., due Any tile with a mean The FASTQ file contains the tile IDs for each read
quence qual- to bubbles or dirt on the flowcell). The plot shows Phred score >2 (>5) for which can be used to filter out reads from flawed
ity the deviation from the average quality. that base across all tiles. tiles. Note though that variation in the Phred score
of a flowcell is also a sign of overloading; tiles that
are affected for only few cycles can be ignored.
Per sequence Identify putative subsets of poor sequences. Most frequent mean qual- Quality trimming.
quality ity <27 (<20).
scores
Per base se- Expectation: all bases should be sequenced as often Difference between A and If overrepresented sequences are the cause of a fail-
quence con- as they are represented in the genome. Reasons why T, or G and C >10% ure, these can be removed.
tent this assumption may not hold: (>20%) in any position.
• random hexamer priming during library prepa-
ration
• contamination (e.g., adapter dimers)
7. Appendix
Per sequence The GC content of each read should be roughly nor- Deviations from the nor- Sharp peaks on an otherwise smooth distribu-
GC content mally distributed (maximum should correspond to mal distribution for >15% tion usually indicate a specific contaminant that
the average GC content of the genome). Note that (30%) of the reads should also be reported as an overrepresented se-
the reference shown here is also based on the sup- quence. Broader peaks outside the expectated distri-
plied FASTQ file, therefore it will not be able to detect bution may represent contaminating sequences from
a global shift of GC content. a species with a different GC genome content.
Per base N Percentage of base calls at each position for which Any position with an N A common reason for large numbers of N is lack of li-
content an N was called by the sequencer indicating lack of content of >5% (20%) brary complexity, i.e., if all reads start with the same
confidence to make a base call. bases, the sequencer may not be able to differentiate
them sufficiently.
Continued on next page
Page 81 of 97
7. Appendix
Table 10: Types of optional entries that can be stored in the header section of a SAM file following the format
@<Section> <Tag>:<Value>. The asterisk indicates which tags are required if a section is set.
Table 11: Overview of RSeQC scripts. See the online documentation of RSeQC (http://rseqc.sourceforge.net) and
Wang et al. (2012) for more details. Note that tools marked in red have been shown to return erroneous results
(Hartley and Mullikin, 2015).
Script Purpose
Basic read quality
read duplication Determine read duplication rate, based on the sequence only
(output.dup.seq.DupRate.xls) as well as on the alignment positions
(output.dup.pos.DupRate.xls).
read hexamer Calculates hexamer frequencies from FASTA or FASTQ files. Similar to
FASTQC’s k-mer content analysis.
read GC Calculates % GC for all reads. Similar to FASTQC’s GC distribution plot,
the peak of the resulting histogram should coincide with the average GC
content of the genome.
read NVC Calculates the nucleotide composition across the reads; similar to FASTQC’s
per base sequence content analysis.
read quality Calculates distributions for base qualities across reads; similar to FASTQC’s
per base sequence quality analysis.
Alignment QC
bam stat Calculates reads mapping statistics for a BAM (or SAM) file. Note that
uniquely mapped reads are determined on the basis of the mapping quality.
clipping profile Estimates clipping profile of RNA-seq reads from BAM or SAM file. This
will fail if the aligner that was used does not support clipped mapping
(CIGAR strings must have S operation).
mismatch profile Calculates distribution of mismatches along reads based on the MD tag.
insertion profile, Calculates the distribution of insertions or deletions along reads.
deletion profile
read distribution Calculates fractions of reads mapping to transcript features such as exons,
5’UTR, 3’ UTR, introns.
RPKM saturation The full read set is sequentially downsampled and RPKMs are calculated
for each subset. The resulting plot helps determine how well the RPKM
values can be estimated. The visualized percent relative error is calculated
|RP KMsubsample −RP KMmax |
as RP KMmax ∗ 100
RNA-seq-specific QC
geneBody coverage Scales all transcripts to 100 bp, then calculates the coverage of each base.
The read coverage should be uniform and ideally not show 5’ or 3’ bias
since that would suggest problems with degraded RNA input or with cDNA
synthesis.
infer experiment Speculates how RNA-seq sequencing was configured, i.e., PE or SR and
strand-specificity. This is done by subsampling reads and comparing their
genome coordinates and strands with those of the transcripts from the ref-
erence gene model. For non-strand-specific libraries, the strandedness of
the reads and the transcripts should be independent. See http://rseqc.
sourceforge.net/#infer-experiment-py for details.
junction annotation Compares detected splice junctions to the reference gene model, classifying
them into 3 groups: annotated, complete novel, partial novel (only one of
the splice sites is unannotated). The script differentiates between splice
events (single read level) and splice junctions (multiple reads show the same
splicing event).
junction saturation Similar concept to RPKM saturation: splice junctions are detected for each
sampled subset of reads. The detection of annotated splice sites should be
saturated with the maximum coverage (= all supplied reads), otherwise
alternative splicing analyses are not recommended because low abundance
splice sites will not be detected.
Continued on next page
Script Purpose
tin Calculates the transcript integrity number (TIN) (not to be confused with
the RNA integrity number, RIN, that is calculated before sequencing based
on the 28S/18S ratio). TIN is calculated for each transcript and represents
the fraction of the transcript with uniform read coverage.
General read file handling and processing
bam2fq Converts BAM to FASTQ.
bam2wig Converts read positions stored in a BAM file into read coverage measures all
types of RNA-seq data in BAM format into wiggle file.
divide bam Equally divides a given BAM file (m alignments) into n parts. Each part
contains roughly m/n alignments that are randomly sampled from the entire
alignment file.
inner distance Estimates the inner distance (or insert size) between two paired RNA reads
(requires PE reads). The results should be consistent with the gel size se-
lection during library preparation.
overlay bigwig Allows the manipulation of two BIGWIG files, e.g., to calculate the sum of
coverages. See the --action option for all possible operations.
normalize bigwig Normalizes all samples to the same wigsum (= number of bases covered by
read length ∗ total no. of reads).
split bam, Provided with a gene list and a BAM file, this module will split the original
split paired bam BAM file into 3 smaller ones:
1. *.in.bam: reads overlapping with regions specified in the gene list
2. *.ex.bam: reads that do not overlap with the supplied regions
3. *.junk.bam: reads that failed the QC or are unaligned
RPKM count Calculates the raw count and RPKM values for each exon, intron and mRNA
region defined by the provided annotation file.
Table 12: Overview of QoRTs QC functions. See the online documentation of QoRTs (http://hartleys.github.io/
QoRTs/index.html) and Hartley and Mullikin (2015) for more details.
QC Function Purpose
Basic read quality
GCDistribution Calculates GC content distribution for all reads. Similar to FASTQC’s GC
distribution plot, the peak of the resulting histogram should coincide with
the average GC content of the genome.
NVC Calculates the nucleotide composition across the length of the reads; similar
to FASTQC’s per base sequence content analysis.
QualityScoreDistributionCalculates distributions for base qualities across reads; similar to FASTQC’s
per base sequence quality analysis.
Alignment QC
chromCounts Calculates number of reads mapping to each category of chromosome (au-
tosomes, allosomes, mtDNA).
CigarOpDistribution Calculates rate of various CIGAR operations as a function of read length,
including insertions, deletions, splicing, hard and soft clipping, padding, and
alignment to reference. See Section 3.3.2 for details about CIGAR opera-
tions.
GeneCalcs Calculates fractions of reads mapping to genomic features such as unique
genes, UTRs, ambiguous genes, introns, and intergenic regions.
RNA-seq-specific QC
writeGeneBody Breaks up all genes into 40 equal-length counting bins and determines the
number of reads that overlap with each counting bin. The read coverage
should be uniform and ideally not show 5’ or 3’ bias since that would suggest
degraded RNA input or problems with cDNA synthesis.
writeGenewiseGeneBody Writes file containing gene-body distributions for each gene.
StrandCheck Checks if the data is strand-specific by calcualting the rate at which
reads appear to follow the two possible library-type strandedness rules (fr-
firststrand and fr-secondstrand, described by the CuffLinks documentation
at http://cufflinks.cbcb.umd.edu/manual.html#library).
JunctionCalcs Calculates the number of novel and known splice junctions. Splice junctions
are split into 4 groups, first by whether the splice junction appears in the
gene annotation GTF (known vs. novel), and then by whether the splice
junction has fewer or ≥ 4 reads covering it.
General read file handling and processing
InsertSize Estimates the inner distance (or insert size) between two paired RNA reads
(requires PE reads). The results should be consistent with the gel size se-
lection during library preparation.
makeWiggles Converts read positions stored in a BAM file into wiggle files with 100-bp
window size.
makeJunctionBed Creates a BED file for splice-junction counts.
writeGeneCounts Calculates raw number of reads mapping to each gene in the annotation file.
Also creates a cumulative gene diversity plot, which shows the percent of the
total proportion of reads as a function of the number of genes sequenced.
This is useful as an indicator of whether a large proportion of the reads
stem from of a small number of genes (as a result of ribosomal RNA or
hemoglobin contamination, for example).
calcDetailedGeneCounts Calculates more detailed read counts for each gene, including the number
of reads mapping to coding regions, UTR, and intronic regions.
Continued on next page
Function Purpose
writeBiotypeCounts Write a table listing read counts for each biotype, which is a classification
of genes into broader categories (e.g., protein coding, pseudogene, processed
pseudogene, miRNA, rRNA, scRNA, snoRNA, snRNA). Note that, in or-
der for this function to succeed, the optional “gene biotype” attribute is
required to be present in the gene annotation file (GTF).
FPKM Calculates FPKM values for each gene in the annotation file.
Table 13: Normalization methods for the comparison of gene read counts between different conditions. See, for
example, Bullard et al. (2010) and Dillies et al. (2013) for comprehensive assessments of the individual methods.
Total Count All read counts are divided by the total • biased by highly expressed genes
number of reads (library size) and multi- • cannot account for different RNA
plied by the mean total count across all repertoire between samples
samples. • poor detection sensitivity when
benchmarked against qRT-PCR
(Bullard et al., 2010)
DESeq’s size 1. For each gene, the geometric mean • the size factor is applied to all read
factor of read counts across all samples is counts of a sample
calculated. • more robust than total count normal-
2. Every gene count is divided by the ization
geometric mean. • implemented by the DESeq R li-
3. A sample’s size factor is the median brary (estimateSizeFactors()
of these ratios (skipping the genes function), also available in edgeR
with a geometric mean of zero). (calcNormFactors() function with
option method = "RLE")
• details in Anders and Huber (2010)
Trimmed TMM is always calculated as the weighted • the size factor is applied to ev-
Mean of mean of log ratios between two samples: ery sample’s library size; normalized
M-values 1. Calculate gene-wise log2 fold changes read counts are obtained by divid-
(TMM) (= M-values): ing raw read counts by the TMM-
Y Y 0 adjusted library sizes
Mg = log2 ( Ngkk )/log2 ( Ngk0 )
where Y is the observed number of
k
• more robust than total count normal-
reads per gene g in library k and N ization
is the total number of reads. • implemented in edgeR via
2. Trimming: removal of upper and calcNormFactors() with the
lower 30%. default method = "TMM"
3. Precision weighing: the inverse of • details in Robinson and Oshlack
the estimated variance is used to ac- (2010)
count for lower variance of genes with
larger counts.
Upper 1. Find the upper quartile value (top • similar to total count normalization,
quartile 75% read counts after removal of thus it also suffers from a great influ-
genes with 0 reads). ence of highly-expressed DE genes
2. Divide all read counts by this value. • can be calculated with edgeR’s
calcNormFactors() function
(method = "upperquartile")
Table 14: Normalization methods for the comparison of gene read counts within the same sample.
FPKM
1. Same as RPKM, but for paired-end reads: • implemented in
(fragments per
2. The number of fragments (defined by two reads DESeq2’s fpkm()
kilobase...)
each) is used. function
Xi 1
T P Mi = ∗ ∗ 106
li P Xj
j
lk
Tools are shown in order of usage throughout the script. The exact version numbers and paths may
be subject to change!
FASTQC
1. Download.
1 $ wget https :// www . bioinformatics . babraham . ac . uk / projects / fastqc / fastqc_v0
.11.8. zip
2. Unzip and make executable.
1 $ unzip fastqc_v0 .11.8. zip
2 $ cd FastQC
3 $ chmod 755 fastqc
MultiQC
1. Install anaconda, a package which helps manage Python installations:
1 $ wget https :// repo . continuum . io / archive / Anaconda2 -5.3.1 - Linux - x86_64 . sh
2 $ bash Anaconda2 -5.3.1 - Linux - x86_64 . sh
You will need to accept the license terms, specify where to install anaconda, and specify whether you
want anaconda’s install location to be prepended to your PATH variable.
2. Make sure anaconda’s install location is prepended to your PATH variable:
1 $ export PATH =/ home / classadmin / software / anaconda2 / bin : $PATH
3. Install MultiQC using anaconda’s pip
1 $ pip install multiqc
samtools
1. Download source code and unzip it.
1 $ wget -O samtools -1.9. tar . bz2 https :// github . com / samtools / samtools /
releases / download /1.9/ samtools -1.9. tar . bz2
2 $ tar jxf samtools -1.9. tar . bz2
3 $ cd samtools -1.9
2. Compile.
1 $ make
3. Check whether the tool is running.
1 $ ./ samtools
4. Add the location where samtools was installed to your PATH variable; this way you will not need to
specify the exact location everytime you want to run the tool.
1 $ export PATH =/ home / classadmin / software / samtools -1.9: $PATH
RSeQC
1. RSeQC is a python tool like MultiQC and can be installed using pip. Make sure anaconda’s install
location is prepended to your PATH variable:
1 $ echo $PATH
2 # if you don ' t see anaconda2 somewhere in there ( or not the correct path ) ,
do :
3 export PATH =/ home / classadmin / software / anaconda2 / bin : $PATH
2. Install RSeQC using anaconda’s installer
1 $ pip install RSeQC
QoRTs
1. Install the R component (in R):
1 > install . packages ( " http :// hartleys . github . io / QoRTs / QoRTs_LATEST . tar . gz " ,
2 repos = NULL ,
3 type = " source " ) ;
2. Download the Java component (in the Terminal).
1 $ wget -O qorts . jar " https :// github . com / hartleys / QoRTs / archive / v1 .3.6. tar .
gz "
STAR
1. Download.
1 $ wget -O STAR -2.7.1 a . tar . gz https :// github . com / alexdobin / STAR / archive
/2.7.1 a . tar . gz
2. Unzip.
1 $ tar - zxf STAR -2.7.1 a . tar . gz
To run STAR:
1 $ ./ bin / Linux_x86_64_static / STAR
5 wget http :// hgdownload . soe . ucsc . edu / admin / exe / linux . x86_64 / $ { PROGRAM }
6 chmod + x $ { PROGRAM }
7 done
R
1. Download.
1 $ wget --no - check - certificate https :// cran .r - project . org / src / base /R -3/ R
-3.4.3. tar . gz
2. Unzip.
1 $ tar zxvf R -3.4.3. tar . gz
3. Compile. You can use the --prefix option to specify the folder where you would like to install the
program.
1 $ cd R -3.4.3
2 $ ./ configure -- prefix = < path to folder of choice >
3 $ make
4 $ make install
References
Alhamdoosh M, Ng M, Wilson NJ, Sheridan JM, Huynh H, Wilson MJ, and Ritchie ME. Combining multiple
tools outperforms individual methods in gene set enrichment analyses. Bioinformatics (Oxford, England),
2017. doi:10.1093/bioinformatics/btw623.
Altman N and Krzywinski M. Points of significance: Sources of variation. Nature Methods, 12(1):5–6, 2014.
doi:10.1038/nmeth.3224.
Anders S and Huber W. DESeq: Differential expression analysis for sequence count data. Genome Biology,
11:R106, 2010. doi:10.1186/gb-2010-11-10-r106.
Anders S, Pyl PT, and Huber W. HTSeq - A Python framework to work with high-throughput sequencing
data. Bioinformatics, 31(2):166–169, 2014. doi:10.1093/bioinformatics/btu638.
Anders S, Reyes A, and Huber W. Detecting differential usage of exons from RNA-seq data. Genome
Research, 22(10):2008–2017, 2012. doi:10.1101/gr.133744.111.
Auer PL and Doerge RW. Statistical design and analysis of RNA sequencing data. Genetics, 185(2):405–16,
2010. doi:10.1534/genetics.110.114983.
Ballouz S, Dobin A, Gingeras TR, and Gillis J. The fractured landscape of RNA-seq alignment: the default
in our STARs. Nucleic Acids Research, 2018. doi:10.1093/nar/gky325.
Benjamini Y and Speed TP. Summarizing and correcting the GC content bias in high-throughput sequencing.
Nucleic Acids Research, 40(10):e72, 2012. doi:10.1093/nar/gks001.
Bernstein MN, Doan A, and Dewey CN. MetaSRA: Normalized human sample-specific metadata for the
Sequence Read Archive. Bioinformatics, 2017. doi:10.1093/bioinformatics/btx334.
Blainey P, Krzywinski M, and Altman N. Points of Significance: Replication. Nature Methods, 11(9):879–880,
2014. doi:10.1038/nmeth.3091.
Boone M, De Koker A, and Callewaert N. Survey and summary capturing the ’ome’: The expanding
molecular toolbox for RNA and DNA library construction. Nucleic Acids Research, 46(6):2701–2721,
2018. doi:10.1093/nar/gky167.
Bourgon R, Gentleman R, and Huber W. Independent filtering increases detection power for high-throughput
experiments. PNAS, 107(21):9546–9551, 2010. doi:10.1073/pnas.0914005107.
Bray NL, Pimentel H, Melsted P, and Pachter L. Near-optimal probabilistic RNA-seq quantification. Nat
Biotech, 34(5):525–527, 2016. doi:10.1038/nbt.3519.
Bullard JH, Purdom E, Hansen KD, and Dudoit S. Evaluation of statistical methods for normalization
and differential expression in mRNA-Seq experiments. BMC Bioinformatics, 11:94, 2010. doi:10.1186/
1471-2105-11-94.
Byron SA, Van Keuren-Jensen KR, Engelthaler DM, Carpten JD, and Craig DW. Translating RNA se-
quencing into clinical diagnostics: opportunities and challenges. Nature Reviews Genetics, 17(5):257–271,
2016. doi:10.1038/nrg.2016.10.
Cathala G, Savouret JF, Mendez B, West BL, Karin M, Martial JA, and Baxter JD. A method for isolation
of intact, translationally active ribonucleic acid. DNA, 2(4):329–335, 1983. doi:10.1089/dna.1983.2.329.
Ching T, Huang S, and Garmire LX. Power analysis and sample size estimation for RNA-Seq differential
expression. RNA, pp. rna.046011.114–, 2014. doi:10.1261/rna.046011.114.
Costa-Silva J, Domingues D, and Lopes FM. RNA-Seq differential expression analysis: An extended review
and a software tool. PLoS ONE, 12(12), 2017. doi:10.1371/journal.pone.0190152.
Dapas M, Kandpal M, Bi Y, and Davuluri RV. Comparative evaluation of isoform-level gene expression
estimation algorithms for RNA-seq and exon-array platforms. Briefings in Bioinformatics, (Oct 2015),
2016. doi:10.1093/bib/bbw016.
Dillies MA, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, Keime C, Marot NS, Castel
D, Estelle J, Guernec G, Jagla B, Jouneau L, Laloë D, Le Gall C, Schaëffer B, Le Crom S, Guedj M,
and Jaffrézic F. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA
sequencing data analysis. Briefings in Bioinformatics, 14(6):671–683, 2013. doi:10.1093/bib/bbs046.
Ding L, Rath E, and Bai Y. Comparison of Alternative Splicing Junction Detection Tools Using RNASeq
Data. Current Genomics, 2017. doi:10.2174/1389202918666170215125048.
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, and Gin-
geras TR. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics, 29(1):15–21, 2013. doi:
10.1093/bioinformatics/bts635.
Dobin A and Gingeras TR. Optimizing RNA-seq mapping with STAR. In Methods in Molecular Biology,
vol. 1415, pp. 245–262. Humana Press, New York, NY, 2016. doi:10.1007/978-1-4939-3572-7 13.
ENCODE. Standards, Guidelines and Best Practices for RNA-Seq, 2011. URL https://genome.ucsc.edu/
ENCODE/protocols/dataStandards/ENCODE_RNAseq_Standards_V1.0.pdf.
Engström PG, Steijger T, Sipos B, Grant GR, Kahles A, Rätsch G, Goldman N, Hubbard TJ, Harrow J,
Guigó R, and Bertone P. Systematic evaluation of spliced alignment programs for RNA-seq data. Nature
Methods, 10(12):1185–1191, 2013. doi:10.1038/nmeth.2722.
Everaert C, Luypaert M, Maag JL, Cheng QX, DInger ME, Hellemans J, and Mestdagh P. Benchmarking
of RNA-sequencing analysis workflows using whole-transcriptome RT-qPCR expression data. Scientific
Reports, 7(1), 2017. doi:10.1038/s41598-017-01617-3.
Ewels P, Magnusson M, Lundin S, and Kller M. Multiqc: summarize analysis results for multiple tools and
samples in a single report. Bioinformatics, 32(19):3047, 2016. doi:10.1093/bioinformatics/btw354.
Farrell R. RNA methodologies laboratory guide for isolation and characterization. Elsevier/Academic Press,
Amsterdam Boston, 2010.
Feng H, Zhang X, and Zhang C. mRIN for direct assessment of genome-wide and gene-specific mRNA
integrity from large-scale RNA-sequencing data. Nature Communications, 6(May):7816, 2015. doi:10.
1038/ncomms8816.
Genohub. Depth of Coverage (RNA), 2015. URL https://genohub.com/
next-generation-sequencing-guide/#depth2.
Germain PL, Vitriolo A, Adamo A, Laise P, Das V, and Testa G. RNAontheBENCH: computational and
empirical resources for benchmarking RNAseq quantification and differential expression methods. Nucleic
Acids Research, 44(11), 2016. doi:10.1093/nar/gkw448.
Gibbons FD and Roth FP. Judging the quality of gene expression-based clustering methods using gene
annotation. Genome Research, 12(10):1574–1581, 2002. doi:10.1101/gr.397002.
Gierliński M, Cole C, Schofield P, Schurch NJ, Sherstnev A, Singh V, Wrobel N, Gharbi K, Simpson G, Owen-
Hughes T, Blaxter M, and Barton GJ. Statistical models for RNA-seq data derived from a two-condition
48-replicate experiment. Bioinformatics, 31(22):1–15, 2015. doi:10.1093/bioinformatics/btv425.
Goodwin S, Mcpherson JD, and Mccombie WR. Coming of age : ten years of next- generation sequencing
technologies. Nature Genetics, 17(6):333–351, 2016. doi:10.1038/nrg.2016.49.
Griffith M, Walker JR, Spies NC, Ainscough BJ, and Griffith OL. Informatics for RNA Sequencing: A Web
Resource for Analysis on the Cloud. PLoS Computational Biology, 11(8), 2015. doi:10.1371/journal.pcbi.
1004393.
Gu Z, Eils R, and Schlesner M. Complex heatmaps reveal patterns and correlations in multidimensional
genomic data. Bioinformatics, 2016. doi:10.1093/bioinformatics/btw313.
Hansen KD, Brenner SE, and Dudoit S. Biases in Illumina transcriptome sequencing caused by random
hexamer priming. Nucleic Acids Research, 2010. doi:10.1093/nar/gkq224.
Hartley SW and Mullikin JC. QoRTs: a comprehensive toolset for quality control and data processing of
RNA-Seq experiments. BMC Bioinformatics, 16(1):224, 2015. doi:10.1186/s12859-015-0670-5.
Head SR, Kiyomi Komori H, LaMere SA, Whisenant T, Van Nieuwerburgh F, Salomon DR, and Ordoukha-
nian P. Library construction for next-generation sequencing: Overviews and challenges. BioTechniques,
56(2):61–77, 2014. doi:10.2144/000114133.
Honaas LA, Altman NS, and Krzywinski M. Study design for sequencing studies. In Methods in Molecular
Biology, vol. 1418, pp. 39–66, 2016. doi:10.1007/978-1-4939-3578-9 3.
Hooper JE. A survey of software for genome-wide discovery of differential splicing in RNA-Seq data. Human
Genomics, 2014. doi:10.1186/1479-7364-8-3.
Illumina. RNA-Seq Data Comparison with Gene Expression Microarrays, 2011. URL http://www.
europeanpharmaceuticalreview.com/wp-content/uploads/Illumina_whitepaper.pdf.
Jänes J, Hu F, Lewin A, and Turro E. A comparative study of RNA-seq analysis strategies. Briefings in
Bioinformatics, (January):1–9, 2015. doi:10.1093/bib/bbv007.
Jiang L, Schlesinger F, Davis CA, Zhang Y, Li R, Salit M, Gingeras TR, and Oliver B. Synthetic spike-in
standards for RNA-seq experiments. Genome Research, 21(9):1543–1551, 2011. doi:10.1101/gr.121095.111.
Kanehisa M, Furumichi M, Tanabe M, Sato Y, and Morishima K. KEGG: New perspectives on genomes,
pathways, diseases and drugs. Nucleic Acids Research, 2017. doi:10.1093/nar/gkw1092.
Kanitz A, Gypas F, Gruber AJ, Gruber AR, Martin G, and Zavolan M. Comparative assessment of methods
for the computational inference of transcript isoform abundance from RNA-seq data. Genome Biology,
16(1), 2015. doi:10.1186/s13059-015-0702-5.
Karimpour-Fard A, Epperson LE, and Hunter LE. A survey of computational tools for downstream analysis
of proteomic and other omic datasets. Human Genomics, 9(28), 2015. doi:10.1186/s40246-015-0050-2.
Katz Y, Wang ET, Airoldi EM, and Burge CB. Analysis and design of RNA sequencing experiments for
identifying isoform regulation. Nature Methods, 7(12):1009–1015, 2010. doi:10.1038/nmeth.1528.
Katz Y, Wang ET, Silterra J, Schwartz S, Wong B, Thorvaldsdóttir H, Robinson JT, Mesirov JP, Airoldi
EM, and Burge CB. Quantitative visualization of alternative exon expression from RNA-seq data. Bioin-
formatics (Oxford, England), 2015. doi:10.1093/bioinformatics/btv034.
Khatri P, Sirota M, and Butte AJ. Ten years of pathway analysis: Current approaches and outstanding
challenges. PLoS Computational Biology, 2012. doi:10.1371/journal.pcbi.1002375.
Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, and Salzberg SL. TopHat2: accurate alignment of
transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology, 14(4):R36,
2013. doi:10.1186/gb-2013-14-4-r36.
Kundaje A. A comprehensive collection of signal artifact blacklist regions in the human genome. Tech. rep.,
2013. URL https://sites.google.com/site/anshulkundaje/projects/blacklists.
Law CW, Chen Y, Shi W, and Smyth GK. voom: precision weights unlock linear model analysis tools for
RNA-seq read counts. Genome Biology, 15:R29, 2014. doi:10.1186/gb-2014-15-2-r29.
Leinonen R, Sugawara H, and Shumway M. The sequence read archive. Nucleic Acids Research, 2011.
doi:10.1093/nar/gkq1019.
Levin JZ, Yassour M, Adiconis X, Nusbaum C, Thompson DA, Friedman N, Gnirke A, and Regev A. Compre-
hensive comparative analysis of strand-specific RNA sequencing methods. Nature Methods, 7(9):709–715,
2010. doi:10.1038/nmeth.1491.
Li B and Dewey C. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference
genome. BMC Bioinformatics, 12(1):323, 2011. doi:10.1186/1471-2105-12-323.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, and Durbin
R. The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16):2078–9, 2009. doi:
10.1093/bioinformatics/btp352.
Li S, Labaj PP, Zumbo P, Sykacek P, Shi W, Shi L, Phan J, Wu PY, Wang M, Wang C, Thierry-Mieg D,
Thierry-Mieg J, Kreil DP, and Mason CE. Detecting and correcting systematic variation in large-scale
RNA sequencing data. Nat Biotech, 32(9):888–895, 2014. doi:10.1038/nbt.3000.
Liao Y, Smyth GK, and Shi W. featureCounts: an efficient general purpose program for assigning sequence
reads to genomic features. Bioinformatics, 30(7):923–30, 2014. doi:10.1093/bioinformatics/btt656.
Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, and Tamayo P. The Molecular Signatures
Database Hallmark Gene Set Collection. Cell Systems, 2015.
Liu Y, Zhou J, and White KP. RNA-seq differential expression studies: more sequence or more replication?
Bioinformatics, 30(3):301–4, 2014. doi:10.1093/bioinformatics/btt688.
Love MI, Huber W, and Anders S. Moderated estimation of fold change and dispersion for RNA-seq data
with DESeq2. Genome Biology, 15(12):550, 2014. doi:10.1186/s13059-014-0550-8.
Lowe R, Shirley N, Bleackley M, Dolan S, and Shafee T. Transcriptomics technologies. PLoS Computational
Biology, 2017. doi:10.1371/journal.pcbi.1005457.
Meng C, Zeleznik OA, Thallinger GG, Kuster B, Gholami AM, and Culhane AC. Dimension reduction
techniques for the integrative analysis of multi-omics data. Briefings in Bioinformatics, 17(4):628–641,
2016. doi:10.1093/bib/bbv108.
Mudge JM and Harrow J. The state of play in higher eukaryote gene annotation. Nature Reviews Genetics,
2016. doi:10.1038/nrg.2016.119.
Munro SA, Lund SP, Pine PS, Binder H, Clevert DA, Conesa A, Dopazo J, Fasold M, Hochreiter S, Hong H,
Jafari N, Kreil DP, Labaj PP, Li S, Liao Y, Lin SM, Meehan J, Mason CE, Santoyo-Lopez J, Setterquist
RA, Shi L, Shi W, Smyth GK, Stralis-Pavese N, Su Z, Tong W, Wang C, Wang J, Xu J, Ye Z, Yang Y,
Yu Y, Salit M, Labaj PP, Li S, Liao Y, Lin SM, Meehan J, Mason CE, Santoyo-Lopez J, Setterquist RA,
URL https://github.com/samtools/hts-specs.
Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, and Pachter L. Differential analysis of gene
regulation at transcript resolution with RNA-seq. Nat Biotech, 31(1):46–53, 2013. doi:10.1038/nbt.2450.
Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, and Pachter
L. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks.
Nature Protocols, 7(3):562–78, 2012. doi:10.1038/nprot.2012.016.
Van den Berge K, Hembach KM, Soneson C, Tiberi S, Clement L, Love MI, Patro R, and Robinson MD.
RNA Sequencing Data: Hitchhiker’s Guide to Expression Analysis. Annual Review of Biomedical Data
Science, 2019. doi:10.1146/annurev-biodatasci-072018-021255.
Wagner GP, Kin K, and Lynch VJ. Measurement of mRNA abundance using RNA-seq data: RPKM
measure is inconsistent among samples. Theory in Biosciences, 131(4):281–285, 2012. doi:10.1007/
s12064-012-0162-3.
Wang L, Wang S, and Li W. RSeQC: Quality control of RNA-seq experiments. Bioinformatics, 28(16):2184–
2185, 2012. doi:10.1093/bioinformatics/bts356.
Williams CR, Baccarella A, Parrish JZ, and Kim CC. Empirical assessment of analysis workflows for
differential expression analysis of human samples using RNA-Seq. BMC Bioinformatics, 18(1), 2017.
doi:10.1186/s12859-016-1457-z.
Wu DC, Yao J, Ho KS, Lambowitz AM, and Wilke CO. Limitations of alignment-free tools in total RNA-seq
quantification. BMC genomics, 2018. doi:10.1186/s12864-018-4869-5.
Yandell M and Ence D. A beginner’s guide to eukaryotic genome annotation. Nature Reviews Genetics,
2012. doi:10.1038/nrg3174.
Ye H, Meehan J, Tong W, and Hong H. Alignment of short reads: A crucial step for applica-
tion of next-generation sequencing data in precision medicine. Pharmaceutics, 2015. doi:10.3390/
pharmaceutics7040523.
Yeri A, Courtright A, Danielson K, Hutchins E, Alsop E, Carlson E, Hsieh M, Ziegler O, Das A, Shah
RV, Rozowsky J, Das S, and Van Keuren-Jensen K. Evaluation of commercially available small RNASeq
library preparation kits using low input RNA. BMC Genomics, 2018. doi:10.1186/s12864-018-4726-6.
Yona G, Dirks W, and Rahman S. Comparing algorithms for clustering of expression data: how to assess
gene clusters. Methods Mol Biol, 541:479–509, 2009. doi:10.1007/978-1-59745-243-4 21.
Young MD, Wakefield MJ, Smyth GK, and Oshlack A. Gene ontology analysis for RNA-seq: accounting for
selection bias. Genome Biology, 2010. doi:10.1186/gb-2010-11-2-r14.
Zeng W and Mortazavi A. Technical considerations for functional sequencing assays. Nature Immunology,
13(9):802–807, 2012. doi:10.1038/ni.2407.
Zhao S, Fung-Leung WP, Bittner A, Ngo K, and Liu X. Comparison of RNA-Seq and microarray in tran-
scriptome profiling of activated T cells. PLoS ONE, 9(1), 2014. doi:10.1371/journal.pone.0078644.
Zhao S and Zhang B. A comprehensive evaluation of ensembl, RefSeq, and UCSC annotations in the context
of RNA-seq read mapping and gene quantification. BMC Genomics, 2015. doi:10.1186/s12864-015-1308-8.
Zhou X and Rokas A. Prevention, diagnosis and treatment of high-throughput sequencing data pathologies.
Molecular Ecology, 23(7):1679–1700, 2014. doi:10.1111/mec.12680.
Zielezinski A, Vinga S, Almeida J, and Karlowski WM. Alignment-free sequence comparison: Benefits,
applications, and tools. Genome Biology, 2017. doi:10.1186/s13059-017-1319-7.