0% found this document useful (0 votes)
0 views

RNA seq Data Analysis

The document outlines the essential stages of RNA-seq data analysis, emphasizing the importance of experimental design, data generation, and analysis techniques. It highlights critical principles such as replication, randomization, and blocking to minimize variability and confounding factors. The document also discusses best practices for quality control and alignment in RNA-seq experiments, along with the significance of choosing appropriate sequencing methods and tools for accurate data interpretation.

Uploaded by

maneeshw110
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

RNA seq Data Analysis

The document outlines the essential stages of RNA-seq data analysis, emphasizing the importance of experimental design, data generation, and analysis techniques. It highlights critical principles such as replication, randomization, and blocking to minimize variability and confounding factors. The document also discusses best practices for quality control and alignment in RNA-seq experiments, along with the significance of choosing appropriate sequencing methods and tools for accurate data interpretation.

Uploaded by

maneeshw110
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

RNA-seq data analysis

Haibo Liu
Treating Bioinformatics as a Data Science
Seven stages to data science
1. Define the question of interest
2. Get and understand the data
3. Clean the data
4. Explore the data
5. Fit statistical models
6. Communicate the results
7. Make your analysis reproducible
Avoid Garbage In, Garbage Out

(https://thedailyomnivore.net/2015/12/02/garbage-in-garbage-out/)
Bioinformatician should NOT JUST be
a data scientist!!!
Bioinformatician’s role at stages of RNA-seq
experiments
Coherent/robust
experimental design:
groundwork of a successful Consult a data analyst
experiment, avoiding or bioinformatician!!!
unusable data and wasted
time, money, and effort

Inform experimenter/
sequencing facility of
experimental design
1. Library prep batch
2. Layout of library on
lane/flow cell

Analyst aware of experimental


design:
1. Quality control
2. Statistical model selection

Outline

❖ Overview of RNA-seq
❖ Experimental design
❖ Data generation
• Sample preparation
• Library preparation
• Sequencing
❖ Data analysis
• Sample level analysis
• Gene level analysis
• Advanced analysis
Transcriptome

• Transcriptome:
– Broadly speaking, all RNAs transcribed whether from a single cell or a
population of cells
– Narrowly speaking, the total mRNA, with a focus on gene expression
and what is being specifically coded for proteins

Palazzo AF, Lee ES. Front Genet 6:2 (2015)


RNA contents in a typical Eukaryotic cells

Palazzo AF, Lee ES. Front Genet 6:2 (2015)


What is RNA-seq?

• RNA-seq (RNA-sequencing) is a technique that can examine


the quantity and identities of RNA in a sample using next
generation sequencing (NGS).
• It analyzes the gene expression patterns encoded within a
transcriptome.

https://www.frontiersin.org/articles/10.3389/fgene.2019.01361/full
RNA-seq and its applications

Allele-specific expression …

Calculate read
counts, TPM

4 1 3
RNA-seq is a complex process, where
everything is connected!!

The Everything's connected slide by Dündar et al. (2015)


Outline
❖ Overview of RNA-seq
❖ Experimental design
❖ Data generation
• Sample preparation
• Library preparation
• Sequencing
❖ Data analysis
• Sample level analysis
• Gene level analysis
• Advanced analysis
Experimental design is critical

“…, the planning and design of RNA-Seq experiments has


important implications for addressing the desired biological
question and maximizing the value of the data obtained.”

--Aniruddha Chatterjee

Chatterjee et al. (2018) A Guide for Designing and Analyzing RNA-Seq Data. In: Raghavachari N., Garcia-Reyero N. (eds) Gene
Expression Analysis. Methods in Molecular Biology, vol 1783. Humana Press, New York, NY
Three critical principles in
experimental design
• Replication
– Basis of statistical inference
– Replication provides an efficient way of
increasing the precision of an experiment.
𝜎2
– SE =
𝑛
• Randomization
– It eliminates the systematic bias.
– It is needed to obtain a representative sample
from the population.
– It helps in distributing the unknown variation
due to confounded variables throughout the
experiment and breaks the confounding
influence.

• Blocking
– Homogeneous experimental units within the
blocks
– Reduce within-block variance, increase
efficiency
Vocabulary-Biological and technical
replicates
• Biological replicates
– Samples that have been obtained from biologically separate samples.
• different individual organisms
• different samplings of the same tumor
• different population of cells grown separately from each other but
originating from the same cell-line.
– A biological replicate combines both technical and biological variability as it is also an
independent case of all the technical steps.

Also see Blainey et al. Nat Methods


11, 879–880 (2014)
doi:10.1038/nmeth.3091
Vocabulary-Statistical power

• The ability to identify differentially expressed genes when


there really is a difference.

• This is partly dependent on variance and therefore is affected


by the number of replicates available and sequencing depth.

See Krzywinski, M., Altman, N. Power and sample size. Nat Methods 10, 1139–1140 (2013) doi:10.1038/nmeth.2738
Vocabulary-Confounding factors

• A confounding factor is a
nuisance variable that is
associated with the factor
of interest.

• Possible confounding
factors should be controlled
for so they don't interfere
with analysis.
Biological and technical variance

Observed gene
expr. variance = Biological
variance + Technical
variance

• Biological replicates measure combined biological and


technical variability
– Biological variability is the main source of variability
• Natural variation (Genetic and stochastic) in the population and within cells.
– The amount of variance between your biological replicates will affect the
outcome of your analysis. Ideally, you aim to have minimal variability
between samples so you only measure the effect of the condition of
interest.
• Technical variation:
– Mainly from RNA processing + library prep
– flow cells or lane effect is usually small
– Generally, creating technical replicates is unnecessary.
Sources and size of variability in RNA-seq

Biological variability Technical variability

Biggest Big Big

Very small

(Van de Berge et al. Annu. Rev. Biomed. Data Sci. 2019. 2:139–73)
Strategies to minimize variation between
samples and to control confounding variables

• Choosing organisms of similar genetic background


(littermates)
• Choosing organisms of the same sex if possible
• Using a constant sample collection time and
sampling sites on tissues.
• Having the same laboratory technician perform
RNA prep and library prep, same lots of reagents,
limit processing time range
• If variation between samples can not be removed,
use balanced blocking design and modeling the
block effect during analysis
• Randomizing samples to prevent a confounding
batch effect if all samples can't be processed at
one time. Krzywinski and Altman. Nat Methods 11, 699–
700 (2014) doi:10.1038/nmeth.3005
Number of Replicates and Sequencing reads

• As a general rule, the number of biological replicates should never be


below 3 (ideally, ≥ 6 replicates).
• For a basic RNA-seq differential expression experiment, 10M to 20M reads
per sample is usually enough. Do consider the transcriptome complexity.

• More replicates are often better investments than deeper sequencing


• Biological variability is usually the largest effect limiting the power of RNA-
seq analysis
• Scotty – A web-based tool for Power Analysis for RNA Seq Experiments:
http://scotty.genetics.utah.edu/
Sequencing depth vs. replicates
• Once fragments > 10 M, further increasing sequencing only marginally
increases the number of lowly expressed genes detected, but statistical power
to detect DE does not improve considerably.
• In most cases, increasing replicates is more beneficial than increasing
sequencing depth. CAUTION: weight library prep cost vs. sequencing cost!!

0.80

(Van de Berge et al. Annu. Rev. Biomed. Data Sci. 2019. 2:139–73) (Monaco et al. 2019, Cell Reports 26, 1627-1640)
Sequencing options to consider
• Quality and quantity of total RNA
– 100 ~250 ng total RNA per sample
– RIN > 7.0 (adjust for RIN or mRIN in models) RNA enrichment method
(https://bmcbiol.biomedcentral.com/articles/10.1186/1741-7007-12-42;
https://www.nature.com/articles/ncomms8816)
– Poly-A enrichment (minimum of 100ng)
• recommended for most standard RNA-seq experiments
• provide no information about microRNAs and other non-coding RNA species
– rRNA depeletion (minimum of 200ng)
• more noisy
• recommended for poor or variable quality of RNA
• Read type and read length
– Single end: recommended for basic DEG analysis
– Paired end: recommended for transcriptome assembly
– 50-100 bp SE for counting; 100-150 bp PE for transcript reconstruction
– Applications that require more, longer, and possibly paired-end reads:
• quantification of lowly expressed genes
• identification of genes with small changes between conditions
• investigation of alternative splicing/isoform quantification
• identification of novel transcripts, chimeric transcripts
• de novo transcriptome assembly

• Strandedness
– Non-stranded: recommended for basic RNA-seq experiments
– Stranded: novel transcript discovery, more accurate quantitation
• Multiplexing: minimize lane effect
• Spike-in controls:
– Have been used for normalization and quality control
– Recent work has shown that the amount of technical variability in their use dramatically reduces their utility.
(not recommended, Risso et al. Nature Biotechnology, 2014, 1-10)
(https://www.melbournebioinformatics.org.au/tutorials/tutorials/rna_seq_exp_design/rna_seq_experimental_design/)
Outline
❖ Overview of RNA-seq
❖ Experimental design
❖ Data generation
• Sample preparation
• Library preparation
• Sequencing
❖ Data analysis
• Sample level analysis
• Gene level analysis
• Advanced analysis
RNA quantity and quality (RIN)
determines choice of library preparation protocols

Qubit fluorometer

Schroeder et al. The RIN: an RNA integrity number for assigning integrity values to RNA measurements.
BMC Molecular Biol 7, 3 (2006) doi:10.1186/1471-2199-7-3
Choices of mRNA enrichment methods

Quality of
initial RNA
Stranded vs. unstranded RNA-seq
Trends of read length, depth, and sample
size of RNA-seq experiments

(Van de Berge et al. Annu. Rev. Biomed. Data Sci. 2019. 2:139–73)
General biases

• Issues with using reference genome


– CNV
– Mappability
• Mapping ambiguity
– Multi-mapping reads
Outline

❖ Overview of RNA-seq
❖ Experimental design
❖ Data generation
• Sample preparation
• Library preparation
• Sequencing
❖ Data analysis
• Sample level analysis
• Gene level analysis
• Advanced analysis
Best practices on RNA-seq data analysis

https://link.springer.com/protocol/10.1007%2F978-1-4939-7834-2_11
Pre-analysis

FastQC
MultiQC QoRTs R

Conesa et al. Genome Biology (2016) 17:13


Quality control of RNA-seq data
• Tools for RNA-seq data QC:
– pre-alignment QC: fastq files as input
• FASTQC and MultiQC
– Post-alignment QC: bam files as input
• QoRTs (httpToos://hartleys.github.io/QoRTs/): best
• RSeqQC (http://rseqc.sourceforge.net/)
• RNA-SeQC
(https://software.broadinstitute.org/cancer/cga/rna-seqc)
• dupRadar
(https://www.ncbi.nlm.nih.gov/pubmed/27769170):
duplication check specific to RNA-seq data
– Post-quantitation QC: expression matrix as input
• Sample distance, clustering and PCA analysis using R
(Sayols et al. BMC Bioinformatics. 2016; 17: 428.)
QoRTs analysis pipeline

Hartley and Mullikin. BMC Bioinformatics 16, 224 (2015) doi:10.1186/s12859-015-0670-5


Sequencing data quality improvement
(optional)
• Trimming low-quality bases and adapters: Trimmomatic/fastp
– Using relaxed trimming thresholds
– Improve mappability
– Reduce mapping artifacts
• Error correction: Rcorrector
– Improve mappability
– Improve de novo transcriptome assembly

For more tools, see https://en.wikipedia.org/wiki/List_of_RNA-Seq_bioinformatics_tools#Error_correction.


Core Analysis

CPM

Conesa et al. Genome Biology (2016) 17:13


Alignment and Quantification
Case I: No reference genome
No reference transcriptome
• Start with de novo transcriptome assembly

RNA-Bloom/Trinity

Conesa et al. Genome Biology (2016) 17:13


Case II: reference genome
but not well annotated
• de novo transcriptome assembly
• genome guided transcriptome assembly
• Hybrid transcriptome assembly HISAT2
Gsnap

RNA-Bloom StringTie
TransComb
Transcriptome
assembly

Bowtie
RSEM

Martin and Wang. Nature Reviews Genetics 2011(12): 671–682


Conesa et al. Genome Biology (2016) 17:13
Case III: reference genome
well enough annotated
Using Splice-aware aligner: STAR, HISAT2, Gsnap

Using ungapped aligner: Bowtie

(Van de Berge et al. Annu. Rev. Biomed. Data Sci. 2019. 2:139–73)
C. Alignment free or Pseudo-alignment against transcriptome: Salmon, Sailfish, Kallisto
EM algorithms
Sailfish: k-mer-based approaches
Kallisto: pseudo-alignment
Salmon: quasi-alignment, modeling GC-bias
(Cyverse)
Case III: reference genome
well enough annotated

HISAT2
Gsnap
Pseudo-mapping Pseudo-aligner
to transcriptome Sailfish, Salmon, Kallisto

GTF/GFF Sailfish
featureCounts Salmon

Conesa et al. Genome Biology (2016) 17:13


Expression quantification at different levels
• Gene level (most common): featureCounts (best), htseq-count (very slow)
– direct fragment overlap counting of gene features
• no principled way of handling multimapping reads (or using MMR first)
• potentially important compositional changes not reflected directly in gene level read counts (e.g., isoform
switching)
– transcript-level quantification followed by aggregation to the gene level (see below).
• Transcript level:
– Full alignment-based: 𝑀𝑖𝑥 2 (best, not free), RSEM, Cufflinks, eXpress, PennSeq, MMSeq
(https://www.ncbi.nlm.nih.gov/pubmed/28505151).
https://www.ncbi.nlm.nih.gov/pubmed/28505151
All based on EM algorithms
– Quasi-mapping: Salmon, an the like
– Advantages
• clear interpretation, since transcripts are the basic unit of gene transcription
• improved biological resolution and decoding of potentially important biological changes, such as isoform
switching
• most appropriate level to model and correct for technical biases;
• provides a proper model for handling reads that multimap
– Disadvantages
• CAUTION: Most of these approaches (lightweight and otherwise) assume that the annotation of transcripts
to be quantified is complete.
• Many more multimapping reads to handle
• necessitating the adoption of a model, which may fail to adequately capture reality
• read ambiguity translates to additional uncertainty in the estimated transcript abundances.
https://www.ncbi.nlm.nih.gov/pubmed/27257077
• Exon level: JunctionSeq (https://www.ncbi.nlm.nih.gov/pubmed/27257077) or EQP
(https://www.ncbi.nlm.nih.gov/pubmed/27302131)
https://www.ncbi.nlm.nih.gov/pubmed/27302131
Salmon versus HTseq counting

• https://cgatoxford.wordpress.com/2016/08/17/why-you-should-stop-using-featurecounts-htseq-or-cufflinks2-and-start-using-
kallisto-salmon-or-sailfish/
• http://crazyhottommy.blogspot.com/2016/07/comparing-salmon-kalliso-and-star-htseq.html
• https://www.biorxiv.org/content/biorxiv/early/2018/10/16/444620.full.pdf

• Soneson et al. F1000Research 2016, 4:1521 Last updated: 12 NOV 2019,


https://doi.org/10.12688/f1000research.7563.2
Recommendations on quantitation

• Gene abundance estimates are more accurate than transcript


abundance estimates
• DTE is more powerful and easier to interpret on gene level
than for individual transcripts
• Incorporating transcript-level estimates leads to more
accurate DGE results

https://f1000research.com/articles/4-1521/v2
Preprocessing of quantitation data

The number of reads aligned to a given gene reflects the sequencing depth
and that gene’s share of the population of mRNA molecules.
Preprocessing: Filtering
• Determining Intra- and Intergroup Sample Variability and Outliers
• Filtering out noise by removing extremely lowly expressed genes:
rowSum(cpm ≥1) ≥ min(𝑛𝑖 ). Filtering based on expression level, but not
variation across samples: Absolute count or CPM.

log2(count +1)
Preprocessing: Normalization
• Normalization methods:
– Case 1: Assumption that most genes are not DE and balanced expression is valid
• TMM (Trimmed weighted mean)
• DESeq (median of log expression ratio)
– Case 2: RNA composition dramatically different between conditions
• Smooth quantile normalization ( R package: YARN): Tissue-aware normalization
(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5862355/,
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1847-
x)
• Spike-in based normalization (use only if no other choice)

https://www.ncbi.nlm.nih.gov/pmc/articl https://www.frontiersin.org/articles/10.3389
es/PMC6171491/ /fbioe.2019.00358/full
Normalization by controls: Methods:
Assumption: RUV-seq
• Existence of controls (HK genes, spike-ins)
• Controls behave like non-control genes Smooth quantile
(e) normalization
Global shift
Normalization by controls

Normalization by distribution/testing
Assumptions:
• Technical effects are
the same for DE and non-DE genes
• Balanced expression: roughly
symmetric differential expression
across conditions

Normalization by lib. Size • By distribution: UQ, Median, Deseq. TMM, CuffDiff2,


• TC, FPKM/RPKM, TPM MRN
Assumption: same total expression • By testing: PoissonSeq, DEGES
Motivation: Normalized counts reflect
Motivation: Non-DE genes should have, on average, the
the proportion of total mRNA/cell
taken up by each gene.
same normalized counts across conditions. The same
normalization factor for the non-DE genes can be
applied to normalize all genes.
DESeq and TMM normalization methods are
the best for Case I

Dillies et al. Briefs in Bioinformatics, 2012 (14)6: 671-683


Read count normalization within samples

● DO NOT use RPKM (Reads Per Kilobase Million) or FPKM


(Fragments Per Kilobase Million) to express normalized counts in
ChIP-seq (or RNA-seq) ANY MORE!!. (Wagner et al. Theory
Biosci. (2012) 131: 281. https://doi.org/10.1007/s12064-012-
0162-3; )

● CPM (Counts Per Million) and TPM (Transcripts Per MIllion) is


the less biased way of normalizing read counts.

● Watch the video for explanation: https://www.rna-


seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/
Outline

❖ Overview of RNA-seq
❖ Experimental design
❖ Data generation
• Sample preparation
• Library preparation
• Sequencing
❖ Data analysis
• Sample level analysis
• Gene level analysis
• Advanced analysis
RNA-seq data analysis at FOUR levels

Li and Li. Quantitative Biology 2018, 6(3): 195–209. https://doi.org/10.1007/s40484-018-0144-7


Sample level analysis
• Data transformation
– regularized log (rlog, DESeq2 package)
– variance stabilizing transformation(VST, DESeq2 package)
– logCPM (limma package)
• Transcriptome similarity
– CAUTION: Correlation alone is not an effective measure
(https://genomebiology.biomedcentral.com/articles/10.1186/s
13059-016-0940-1)
– Transcriptome overlap measure (TROM): robust (Li et al. Stat
Biosci. 2017 Jun; 9(1): 105–136.)
– Hierarchical clustering based on sample Euclidian distances
(DESeq2)
– Dimension reduction:
• PCA, MDS, tSNE, UMAP
Principal component analysis (PCA)

(Koch et al. Am J Respir Cell Mol Biol 2018(59)2:145–157)


Sample distance plot

Very good Abnormal (batch effect)

http://pedagogix-tagc.univ-mrs.fr/courses/ASG1/practicals/rnaseq_diff_Snf2/rnaseq_diff_Snf2.html

(Koch et al. Am J Respir Cell Mol Biol 2018(59)2:145–157)


Hierarchical clustering and k-means
clustering

(Koch et al. Am J Respir Cell Mol Biol 2018(59)2:145–157)


Outline

❖ Overview of RNA-seq
❖ Experimental design
❖ Data generation
• Sample preparation
• Library preparation
• Sequencing
❖ Data analysis
• Sample level analysis
• Gene level analysis
• Advanced analysis
Approaches for the three types of
differential expression analysis (DGE, DTE
and DTU)

Soneson et al. F1000Research 2016, 4:1521 Last updated: 12 NOV 2019


Overview of RNA-seq DE analysis

DESeq2
edgeR limma-voom

(Van de Berge et al. Annu. Rev. Biomed. Data Sci. 2019. 2:139–73)
Statistical modeling and estimation

• Using negative binomial (NB)


distribution to model count data
from RNA-seq

• NB generalized linear model for


DEG analysis

(Van de Berge et al. Annu. Rev. Biomed. Data Sci. 2019. 2:139–73)
Statistical inference

• Test hypotheses:
– 𝐻0 : there is no DE between conditions, i.e., that the LFC is zero.
– 𝐻1 :the LFC differs from zero.
• Test methods:
– Likelihood ratio tests (LRTs): DESeq2 and edgeR. LRTs compare
the likelihood of a full model, upon estimating all parameters
without constraints, with the likelihood of a reduced model,
where one or some of the parameters are constrained according
to 𝐻0 . LRT statistics are asymptotically χ2-distributed under 𝐻0 .
– Wald test: DESeq2 only. W = LFC /se(LFC), asymptotically
follows a standard normal distribution under H0

(Van de Berge et al. Annu. Rev. Biomed. Data Sci. 2019. 2:139–73)
Adjusting p-values for multiple testing

• False discovery rate (FDR)


– Benjamini–Hochberg Correction (BH)
Variations to the General DEG Workflow
• Alternative models (inference frameworks)
– QL method (QuasiSeq, edgeR): F-statistics
– limma-voom models: empirical Bayes, moderated t- and F-statistics (Law et al.
Genome Biology. 2014, 15:R29)
– Nonparametric methods: NOISeq, SAMseq
• Robust log-fold change estimation: shrinkage of estimators LFC
– Ratios of smaller counts result in more variable LFCs
– the estimation of LFC can be sensitive to outliers.
• Accounting for unobserved effects
– SVAseq and RUVseq
• Statistical inference by testing against a threshold
– H0: |LFC| > a (implemented differently in DESeq2 and edgeR)
• Small-sample inference: too small number of replicates (Di et al. 2013)
• Multiple testing:
– Local FDR, Storey’s q-value

(Van de Berge et al. Annu. Rev. Biomed. Data Sci. 2019. 2:139–73)
Variants of differential expression analysis
• Differential gene expression (DGE)
• Differential transcript expression (DTE)
• Differential transcript usage (DTU)/differential splicing (DS)
• Differential exon usage (Van de Berge et al. Annu. Rev. Biomed. Data Sci. 2019. 2:139–73)

For details of statistical models, see Li and Li. Quantitative Biology 2018,
6(3): 195–209. https://doi.org/10.1007/s40484-018-0144-7

(Collado-Torres et al. 2017.


https://f1000research.com/articles/6
-1558)
Visualization of DEGs
P-value histogram

How to interpret a p-value histogram?


http://varianceexplained.org/statistics/interpreting-pvalue-histogram/

Breheny et al. High-Throughput 2018, 7, 23; doi:10.3390/ht7030023


Volcano plot
R package: EnhancedVolcano
MA plot
M (log ratio) and A (mean average) scales
Venn diagram and upset plot

http://genomespot.blogspot.com/2017/09/up
set-plots-as-replacement-to-venn.html
Heatmap

(Koch et al. Am J Respir Cell Mol Biol 2018(59)2:145–157)


Sankey diagram
Functional annotation of DE genes

• Gene set analysis


– Overrepresentation analysis (ORA)
– Gene set enrichment analysis (GSEA)
• Pathway analysis
– Overrepresentation analysis
– Gene set enrichment analysis
– Pathway topology analysis
Differences in Pathways and Gene Set
• A biological pathway is “a series of interactions among molecules in a
cell that leads to a certain product or a change in a cell”
(https://en.wikipedia.org/wiki/Biological_pathway).
– Types:
• Signaling pathway
• metabolic pathway
• genetic pathway (gene regulatory network)
– Common pathway databases: (for details, see García-Campos et al. Frontiers
in Physiology, 2015(6):383 )
• KEGG (https://www.genome.jp/kegg/pathway.html)
• Reactome (https://reactome.org/)
• WikiPathways (https://www.wikipathways.org/index.php/WikiPathways)
• BioCyc (https://biocyc.org/)
• Gene set: a set of genes, ie. an unordered and unstructured
collection of genes, with some defined relationship.
– GO terms-associated genes
– MSigDB
– Genes on chromosome 1
– …
Over-representation analysis (ORA)
• ORA is a widely used approach to determine whether known
biological functions or processes are over-represented in an
experimentally-derived gene list. Procedures are as follows:
1. First, an arbitrary cutoff is set to define a list of DEGs: p-values, fold changes.
2. Then, a background list of genes (universal) is define. Usually, the background
is a genes expressed under the conditions of interest.
3. Then, the DEG list is compared to a gene set to find the number of common
genes, given the background list.
4. The p-value can be calculated by Fisher’s exact test, given the background.
(See next slide for statistical explanation).
5. The p-values are adjusted for multiple testing.

• CAUTION: This approach will work when the biological


difference is large, but it will not work when the difference is
small, but evidenced in coordinated way in a set of related
genes. Gene Set Enrichment Analysis (GSEA) (Subramanian et
al. 2005) directly addresses this limitation.
Overrepresntation analysis (ORA)
A B Background Gene set DEGs

Hypergeometric distribution Hoxa5


Hoxa11
Edn2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
98
𝑝 = 1 − 𝑃(𝑘|𝑛, 𝐾, 𝑁) . 𝐸𝑛𝑟𝑖𝑐ℎ𝑚𝑒𝑛𝑡 𝑓𝑜𝑙𝑑 = = 1.5
. 65
.
.
.
Enpp2
Gene Set enrichment analysis (GSEA)
• Given a priori defined set of gene S and a ranked gene list L,
𝐻0 : the members of S are randomly distributed throughout L;
𝐻1 : the members of S are not randomly distributed throughout
L. Procedures are as follows:
1. All expressed genes are ranked based on log2 (fold change) from high to
low.
2. Calculation of an Enrichment Score, the degree to which a set S is over-
represented at the top or bottom of the ranked list L
3. Estimation of Significance Level of ES by using permutation test
4. Adjusting p-values for Multiple Testing
Gene Set enrichment analysis (GSEA)

• GSEA: subramanian et al. 2005(102) 43:15545–15550.


http://software.broadinstitute.org/gsea/index.jsp
• GAGE: Luo et al. BMC Bioinformatics 2009, 10:161
• GAGE + Pathview Tutorials:
https://figshare.com/articles/Tutorial_RNA_seq_differential_expression_amp_pathway_analysis_with_
Sailfish_DESeq2_GAGE_and_Pathview/1619655
• Comparison of different methods: Mathur et al. BioData Mining (2018) 11:8; Hung et al. Brief
Bioinform. 2012 May; 13(3): 281–291; Maciejewski. BRIEFINGS IN BIOINFORMATICS. VOL 15. NO 4. 504
-518
Functional annotation of DE genes
• Gene set analysis
– Overrepresentation analysis
– Gene set enrichment analysis
• Pathway analysis
– Overrepresentation analysis
– Gene set enrichment analysis
– Pathway topology analysis
Pathway analysis

Further reading:
• Khatri et al. PLoS Computational Biology, 2012(8 )2:e1002375
• García-Campos et al. Frontiers in Physiology, 2015(6):383
• Reimand et al. Nat Protoc. 2019 Feb; 14(2): 482–517.
• minepath.org: Koumakis et al. Nucleic Acids Res. 2017 Jul 3; 45(Web Server issue): W116–W121.
Different representations of pathways
A

(García-Campos et al. Frontiers in Physiology, 2015(6):383)


Different pathway analyses

García-Campos et al. Frontiers in Physiology, 2015(6):383


Pathway Analysis vs Gene Set Analysis:
When Should I Use Each?
• Pathway analysis:
– when you care about how genes are known to interact
– when you want to take full advantage of the sizes and directions of
measured expression changes
– when you want to account for the type and direction of interactions
on a pathway
– when you want to predict or explain downstream or pathway-level
effects
– when you want your results to be based on the most recent
knowledge
• Gene set analysis
– when you are looking for “quick and dirty” answers
– when you have arbitrarily defined gene sets
(https://advaitabio.com/ipathwayguide/pathway-analysis-vs-gene-set-analysis/)
Outline

❖ Overview of RNA-seq
❖ Experimental design
❖ Data generation
• Sample preparation
• Library preparation
• Sequencing
❖ Data analysis
• Sample level analysis
• Gene level analysis
• Advanced analysis
Advanced Analysis

IGV
UCSC genome Browser

Further readings about data integration


• https://academic.oup.com/bfg/article/17/2/104/4944665
• https://www.sciencedirect.com/science/article/pii/B9780444626509000166

Conesa et al. Genome Biology (2016) 17:13


Co-expression network analysis
A B

Factors contributing to observed co-expr.

Useful tools: Further reading:


• van Dam et al. Briefings in Bioinformatics, 2018 (19)4: 575–592
• WGCNA • Gaiteri et al. Genes, Brain and Behavior. 2014. 13: 13–24
• coseq • Usadel et al. Plant, Cell and Environment(2009)32, 1633–1651
Beyond DEG and co-expression analyses:
Types of gene expression patterns changes between
conditions

Gaiteri et al. Genes, Brain and Behavior. 2014. 13: 13–24


Future direction of RNA-seq: long read
sequencing technologies

• PacBio Iso-seq
• Oxford NanoPore technologies:
direct RNA sequencing

Zhao et al. Front. Genet., 21 March 2019

You might also like