RNA seq Data Analysis
RNA seq Data Analysis
Haibo Liu
Treating Bioinformatics as a Data Science
Seven stages to data science
1. Define the question of interest
2. Get and understand the data
3. Clean the data
4. Explore the data
5. Fit statistical models
6. Communicate the results
7. Make your analysis reproducible
Avoid Garbage In, Garbage Out
(https://thedailyomnivore.net/2015/12/02/garbage-in-garbage-out/)
Bioinformatician should NOT JUST be
a data scientist!!!
Bioinformatician’s role at stages of RNA-seq
experiments
Coherent/robust
experimental design:
groundwork of a successful Consult a data analyst
experiment, avoiding or bioinformatician!!!
unusable data and wasted
time, money, and effort
Inform experimenter/
sequencing facility of
experimental design
1. Library prep batch
2. Layout of library on
lane/flow cell
❖ Overview of RNA-seq
❖ Experimental design
❖ Data generation
• Sample preparation
• Library preparation
• Sequencing
❖ Data analysis
• Sample level analysis
• Gene level analysis
• Advanced analysis
Transcriptome
• Transcriptome:
– Broadly speaking, all RNAs transcribed whether from a single cell or a
population of cells
– Narrowly speaking, the total mRNA, with a focus on gene expression
and what is being specifically coded for proteins
https://www.frontiersin.org/articles/10.3389/fgene.2019.01361/full
RNA-seq and its applications
Allele-specific expression …
Calculate read
counts, TPM
4 1 3
RNA-seq is a complex process, where
everything is connected!!
--Aniruddha Chatterjee
Chatterjee et al. (2018) A Guide for Designing and Analyzing RNA-Seq Data. In: Raghavachari N., Garcia-Reyero N. (eds) Gene
Expression Analysis. Methods in Molecular Biology, vol 1783. Humana Press, New York, NY
Three critical principles in
experimental design
• Replication
– Basis of statistical inference
– Replication provides an efficient way of
increasing the precision of an experiment.
𝜎2
– SE =
𝑛
• Randomization
– It eliminates the systematic bias.
– It is needed to obtain a representative sample
from the population.
– It helps in distributing the unknown variation
due to confounded variables throughout the
experiment and breaks the confounding
influence.
• Blocking
– Homogeneous experimental units within the
blocks
– Reduce within-block variance, increase
efficiency
Vocabulary-Biological and technical
replicates
• Biological replicates
– Samples that have been obtained from biologically separate samples.
• different individual organisms
• different samplings of the same tumor
• different population of cells grown separately from each other but
originating from the same cell-line.
– A biological replicate combines both technical and biological variability as it is also an
independent case of all the technical steps.
See Krzywinski, M., Altman, N. Power and sample size. Nat Methods 10, 1139–1140 (2013) doi:10.1038/nmeth.2738
Vocabulary-Confounding factors
• A confounding factor is a
nuisance variable that is
associated with the factor
of interest.
• Possible confounding
factors should be controlled
for so they don't interfere
with analysis.
Biological and technical variance
Observed gene
expr. variance = Biological
variance + Technical
variance
Very small
(Van de Berge et al. Annu. Rev. Biomed. Data Sci. 2019. 2:139–73)
Strategies to minimize variation between
samples and to control confounding variables
0.80
(Van de Berge et al. Annu. Rev. Biomed. Data Sci. 2019. 2:139–73) (Monaco et al. 2019, Cell Reports 26, 1627-1640)
Sequencing options to consider
• Quality and quantity of total RNA
– 100 ~250 ng total RNA per sample
– RIN > 7.0 (adjust for RIN or mRIN in models) RNA enrichment method
(https://bmcbiol.biomedcentral.com/articles/10.1186/1741-7007-12-42;
https://www.nature.com/articles/ncomms8816)
– Poly-A enrichment (minimum of 100ng)
• recommended for most standard RNA-seq experiments
• provide no information about microRNAs and other non-coding RNA species
– rRNA depeletion (minimum of 200ng)
• more noisy
• recommended for poor or variable quality of RNA
• Read type and read length
– Single end: recommended for basic DEG analysis
– Paired end: recommended for transcriptome assembly
– 50-100 bp SE for counting; 100-150 bp PE for transcript reconstruction
– Applications that require more, longer, and possibly paired-end reads:
• quantification of lowly expressed genes
• identification of genes with small changes between conditions
• investigation of alternative splicing/isoform quantification
• identification of novel transcripts, chimeric transcripts
• de novo transcriptome assembly
• Strandedness
– Non-stranded: recommended for basic RNA-seq experiments
– Stranded: novel transcript discovery, more accurate quantitation
• Multiplexing: minimize lane effect
• Spike-in controls:
– Have been used for normalization and quality control
– Recent work has shown that the amount of technical variability in their use dramatically reduces their utility.
(not recommended, Risso et al. Nature Biotechnology, 2014, 1-10)
(https://www.melbournebioinformatics.org.au/tutorials/tutorials/rna_seq_exp_design/rna_seq_experimental_design/)
Outline
❖ Overview of RNA-seq
❖ Experimental design
❖ Data generation
• Sample preparation
• Library preparation
• Sequencing
❖ Data analysis
• Sample level analysis
• Gene level analysis
• Advanced analysis
RNA quantity and quality (RIN)
determines choice of library preparation protocols
Qubit fluorometer
Schroeder et al. The RIN: an RNA integrity number for assigning integrity values to RNA measurements.
BMC Molecular Biol 7, 3 (2006) doi:10.1186/1471-2199-7-3
Choices of mRNA enrichment methods
Quality of
initial RNA
Stranded vs. unstranded RNA-seq
Trends of read length, depth, and sample
size of RNA-seq experiments
(Van de Berge et al. Annu. Rev. Biomed. Data Sci. 2019. 2:139–73)
General biases
❖ Overview of RNA-seq
❖ Experimental design
❖ Data generation
• Sample preparation
• Library preparation
• Sequencing
❖ Data analysis
• Sample level analysis
• Gene level analysis
• Advanced analysis
Best practices on RNA-seq data analysis
https://link.springer.com/protocol/10.1007%2F978-1-4939-7834-2_11
Pre-analysis
FastQC
MultiQC QoRTs R
CPM
RNA-Bloom/Trinity
RNA-Bloom StringTie
TransComb
Transcriptome
assembly
Bowtie
RSEM
(Van de Berge et al. Annu. Rev. Biomed. Data Sci. 2019. 2:139–73)
C. Alignment free or Pseudo-alignment against transcriptome: Salmon, Sailfish, Kallisto
EM algorithms
Sailfish: k-mer-based approaches
Kallisto: pseudo-alignment
Salmon: quasi-alignment, modeling GC-bias
(Cyverse)
Case III: reference genome
well enough annotated
HISAT2
Gsnap
Pseudo-mapping Pseudo-aligner
to transcriptome Sailfish, Salmon, Kallisto
GTF/GFF Sailfish
featureCounts Salmon
• https://cgatoxford.wordpress.com/2016/08/17/why-you-should-stop-using-featurecounts-htseq-or-cufflinks2-and-start-using-
kallisto-salmon-or-sailfish/
• http://crazyhottommy.blogspot.com/2016/07/comparing-salmon-kalliso-and-star-htseq.html
• https://www.biorxiv.org/content/biorxiv/early/2018/10/16/444620.full.pdf
https://f1000research.com/articles/4-1521/v2
Preprocessing of quantitation data
The number of reads aligned to a given gene reflects the sequencing depth
and that gene’s share of the population of mRNA molecules.
Preprocessing: Filtering
• Determining Intra- and Intergroup Sample Variability and Outliers
• Filtering out noise by removing extremely lowly expressed genes:
rowSum(cpm ≥1) ≥ min(𝑛𝑖 ). Filtering based on expression level, but not
variation across samples: Absolute count or CPM.
log2(count +1)
Preprocessing: Normalization
• Normalization methods:
– Case 1: Assumption that most genes are not DE and balanced expression is valid
• TMM (Trimmed weighted mean)
• DESeq (median of log expression ratio)
– Case 2: RNA composition dramatically different between conditions
• Smooth quantile normalization ( R package: YARN): Tissue-aware normalization
(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5862355/,
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1847-
x)
• Spike-in based normalization (use only if no other choice)
https://www.ncbi.nlm.nih.gov/pmc/articl https://www.frontiersin.org/articles/10.3389
es/PMC6171491/ /fbioe.2019.00358/full
Normalization by controls: Methods:
Assumption: RUV-seq
• Existence of controls (HK genes, spike-ins)
• Controls behave like non-control genes Smooth quantile
(e) normalization
Global shift
Normalization by controls
Normalization by distribution/testing
Assumptions:
• Technical effects are
the same for DE and non-DE genes
• Balanced expression: roughly
symmetric differential expression
across conditions
❖ Overview of RNA-seq
❖ Experimental design
❖ Data generation
• Sample preparation
• Library preparation
• Sequencing
❖ Data analysis
• Sample level analysis
• Gene level analysis
• Advanced analysis
RNA-seq data analysis at FOUR levels
http://pedagogix-tagc.univ-mrs.fr/courses/ASG1/practicals/rnaseq_diff_Snf2/rnaseq_diff_Snf2.html
❖ Overview of RNA-seq
❖ Experimental design
❖ Data generation
• Sample preparation
• Library preparation
• Sequencing
❖ Data analysis
• Sample level analysis
• Gene level analysis
• Advanced analysis
Approaches for the three types of
differential expression analysis (DGE, DTE
and DTU)
DESeq2
edgeR limma-voom
(Van de Berge et al. Annu. Rev. Biomed. Data Sci. 2019. 2:139–73)
Statistical modeling and estimation
(Van de Berge et al. Annu. Rev. Biomed. Data Sci. 2019. 2:139–73)
Statistical inference
• Test hypotheses:
– 𝐻0 : there is no DE between conditions, i.e., that the LFC is zero.
– 𝐻1 :the LFC differs from zero.
• Test methods:
– Likelihood ratio tests (LRTs): DESeq2 and edgeR. LRTs compare
the likelihood of a full model, upon estimating all parameters
without constraints, with the likelihood of a reduced model,
where one or some of the parameters are constrained according
to 𝐻0 . LRT statistics are asymptotically χ2-distributed under 𝐻0 .
– Wald test: DESeq2 only. W = LFC /se(LFC), asymptotically
follows a standard normal distribution under H0
(Van de Berge et al. Annu. Rev. Biomed. Data Sci. 2019. 2:139–73)
Adjusting p-values for multiple testing
(Van de Berge et al. Annu. Rev. Biomed. Data Sci. 2019. 2:139–73)
Variants of differential expression analysis
• Differential gene expression (DGE)
• Differential transcript expression (DTE)
• Differential transcript usage (DTU)/differential splicing (DS)
• Differential exon usage (Van de Berge et al. Annu. Rev. Biomed. Data Sci. 2019. 2:139–73)
For details of statistical models, see Li and Li. Quantitative Biology 2018,
6(3): 195–209. https://doi.org/10.1007/s40484-018-0144-7
http://genomespot.blogspot.com/2017/09/up
set-plots-as-replacement-to-venn.html
Heatmap
Further reading:
• Khatri et al. PLoS Computational Biology, 2012(8 )2:e1002375
• García-Campos et al. Frontiers in Physiology, 2015(6):383
• Reimand et al. Nat Protoc. 2019 Feb; 14(2): 482–517.
• minepath.org: Koumakis et al. Nucleic Acids Res. 2017 Jul 3; 45(Web Server issue): W116–W121.
Different representations of pathways
A
❖ Overview of RNA-seq
❖ Experimental design
❖ Data generation
• Sample preparation
• Library preparation
• Sequencing
❖ Data analysis
• Sample level analysis
• Gene level analysis
• Advanced analysis
Advanced Analysis
IGV
UCSC genome Browser
• PacBio Iso-seq
• Oxford NanoPore technologies:
direct RNA sequencing