0% found this document useful (0 votes)

0 views

RNA seq Data Analysis

The document outlines the essential stages of RNA-seq data analysis, emphasizing the importance of experimental design, data generation, and analysis techniques. It highlights critical principles such as replication, randomization, and blocking to minimize variability and confounding factors. The document also discusses best practices for quality control and alignment in RNA-seq experiments, along with the significance of choosing appropriate sequencing methods and tools for accurate data interpretation.

Uploaded by

maneeshw110

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views

RNA seq Data Analysis

Uploaded by

maneeshw110

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 90

RNA-seq data analysis

Haibo Liu
Treating Bioinformatics as a Data Science
Seven stages to data science
1. Define the question of interest
2. Get and understand the data
3. Clean the data
4. Explore the data
5. Fit statistical models
6. Communicate the results
7. Make your analysis reproducible
Avoid Garbage In, Garbage Out

(https://thedailyomnivore.net/2015/12/02/garbage-in-garbage-out/)
Bioinformatician should NOT JUST be
a data scientist!!!
Bioinformatician’s role at stages of RNA-seq
experiments
Coherent/robust
experimental design:
groundwork of a successful Consult a data analyst
experiment, avoiding or bioinformatician!!!
unusable data and wasted
time, money, and effort

Inform experimenter/
sequencing facility of
experimental design
1. Library prep batch
2. Layout of library on
lane/flow cell

Analyst aware of experimental

design:
1. Quality control
2. Statistical model selection
…
Outline

❖ Overview of RNA-seq
❖ Experimental design
❖ Data generation
• Sample preparation
• Library preparation
• Sequencing
❖ Data analysis
• Sample level analysis
• Gene level analysis
• Advanced analysis
Transcriptome

• Transcriptome:
– Broadly speaking, all RNAs transcribed whether from a single cell or a
population of cells
– Narrowly speaking, the total mRNA, with a focus on gene expression
and what is being specifically coded for proteins

Palazzo AF, Lee ES. Front Genet 6:2 (2015)

RNA contents in a typical Eukaryotic cells

Palazzo AF, Lee ES. Front Genet 6:2 (2015)

What is RNA-seq?

• RNA-seq (RNA-sequencing) is a technique that can examine

the quantity and identities of RNA in a sample using next
generation sequencing (NGS).
• It analyzes the gene expression patterns encoded within a
transcriptome.

https://www.frontiersin.org/articles/10.3389/fgene.2019.01361/full
RNA-seq and its applications

Allele-specific expression …

Calculate read
counts, TPM

4 1 3
RNA-seq is a complex process, where
everything is connected!!

The Everything's connected slide by Dündar et al. (2015)

Outline
❖ Overview of RNA-seq
❖ Experimental design
❖ Data generation
• Sample preparation
• Library preparation
• Sequencing
❖ Data analysis
• Sample level analysis
• Gene level analysis
• Advanced analysis
Experimental design is critical

“…, the planning and design of RNA-Seq experiments has

important implications for addressing the desired biological
question and maximizing the value of the data obtained.”

--Aniruddha Chatterjee

Chatterjee et al. (2018) A Guide for Designing and Analyzing RNA-Seq Data. In: Raghavachari N., Garcia-Reyero N. (eds) Gene
Expression Analysis. Methods in Molecular Biology, vol 1783. Humana Press, New York, NY
Three critical principles in
experimental design
• Replication
– Basis of statistical inference
– Replication provides an efficient way of
increasing the precision of an experiment.
𝜎2
– SE =
𝑛
• Randomization
– It eliminates the systematic bias.
– It is needed to obtain a representative sample
from the population.
– It helps in distributing the unknown variation
due to confounded variables throughout the
experiment and breaks the confounding
influence.

• Blocking
– Homogeneous experimental units within the
blocks
– Reduce within-block variance, increase
efficiency
Vocabulary-Biological and technical
replicates
• Biological replicates
– Samples that have been obtained from biologically separate samples.
• different individual organisms
• different samplings of the same tumor
• different population of cells grown separately from each other but
originating from the same cell-line.
– A biological replicate combines both technical and biological variability as it is also an
independent case of all the technical steps.

Also see Blainey et al. Nat Methods

11, 879–880 (2014)
doi:10.1038/nmeth.3091
Vocabulary-Statistical power

• The ability to identify differentially expressed genes when

there really is a difference.

• This is partly dependent on variance and therefore is affected

by the number of replicates available and sequencing depth.

See Krzywinski, M., Altman, N. Power and sample size. Nat Methods 10, 1139–1140 (2013) doi:10.1038/nmeth.2738
Vocabulary-Confounding factors

• A confounding factor is a
nuisance variable that is
associated with the factor
of interest.

• Possible confounding
factors should be controlled
for so they don't interfere
with analysis.
Biological and technical variance

Observed gene
expr. variance = Biological
variance + Technical
variance

• Biological replicates measure combined biological and

technical variability
– Biological variability is the main source of variability
• Natural variation (Genetic and stochastic) in the population and within cells.
– The amount of variance between your biological replicates will affect the
outcome of your analysis. Ideally, you aim to have minimal variability
between samples so you only measure the effect of the condition of
interest.
• Technical variation:
– Mainly from RNA processing + library prep
– flow cells or lane effect is usually small
– Generally, creating technical replicates is unnecessary.
Sources and size of variability in RNA-seq

Biological variability Technical variability

Biggest Big Big

Very small

(Van de Berge et al. Annu. Rev. Biomed. Data Sci. 2019. 2:139–73)
Strategies to minimize variation between
samples and to control confounding variables

• Choosing organisms of similar genetic background

(littermates)
• Choosing organisms of the same sex if possible
• Using a constant sample collection time and
sampling sites on tissues.
• Having the same laboratory technician perform
RNA prep and library prep, same lots of reagents,
limit processing time range
• If variation between samples can not be removed,
use balanced blocking design and modeling the
block effect during analysis
• Randomizing samples to prevent a confounding
batch effect if all samples can't be processed at
one time. Krzywinski and Altman. Nat Methods 11, 699–
700 (2014) doi:10.1038/nmeth.3005
Number of Replicates and Sequencing reads

• As a general rule, the number of biological replicates should never be

below 3 (ideally, ≥ 6 replicates).
• For a basic RNA-seq differential expression experiment, 10M to 20M reads
per sample is usually enough. Do consider the transcriptome complexity.

• More replicates are often better investments than deeper sequencing

• Biological variability is usually the largest effect limiting the power of RNA-
seq analysis
• Scotty – A web-based tool for Power Analysis for RNA Seq Experiments:
http://scotty.genetics.utah.edu/
Sequencing depth vs. replicates
• Once fragments > 10 M, further increasing sequencing only marginally
increases the number of lowly expressed genes detected, but statistical power
to detect DE does not improve considerably.
• In most cases, increasing replicates is more beneficial than increasing
sequencing depth. CAUTION: weight library prep cost vs. sequencing cost!!

0.80

(Van de Berge et al. Annu. Rev. Biomed. Data Sci. 2019. 2:139–73) (Monaco et al. 2019, Cell Reports 26, 1627-1640)
Sequencing options to consider
• Quality and quantity of total RNA
– 100 ~250 ng total RNA per sample
– RIN > 7.0 (adjust for RIN or mRIN in models) RNA enrichment method
(https://bmcbiol.biomedcentral.com/articles/10.1186/1741-7007-12-42;
https://www.nature.com/articles/ncomms8816)
– Poly-A enrichment (minimum of 100ng)
• recommended for most standard RNA-seq experiments
• provide no information about microRNAs and other non-coding RNA species
– rRNA depeletion (minimum of 200ng)
• more noisy
• recommended for poor or variable quality of RNA
• Read type and read length
– Single end: recommended for basic DEG analysis
– Paired end: recommended for transcriptome assembly
– 50-100 bp SE for counting; 100-150 bp PE for transcript reconstruction
– Applications that require more, longer, and possibly paired-end reads:
• quantification of lowly expressed genes
• identification of genes with small changes between conditions
• investigation of alternative splicing/isoform quantification
• identification of novel transcripts, chimeric transcripts
• de novo transcriptome assembly

• Strandedness
– Non-stranded: recommended for basic RNA-seq experiments
– Stranded: novel transcript discovery, more accurate quantitation
• Multiplexing: minimize lane effect
• Spike-in controls:
– Have been used for normalization and quality control
– Recent work has shown that the amount of technical variability in their use dramatically reduces their utility.
(not recommended, Risso et al. Nature Biotechnology, 2014, 1-10)
(https://www.melbournebioinformatics.org.au/tutorials/tutorials/rna_seq_exp_design/rna_seq_experimental_design/)
Outline
❖ Overview of RNA-seq
❖ Experimental design
❖ Data generation
• Sample preparation
• Library preparation
• Sequencing
❖ Data analysis
• Sample level analysis
• Gene level analysis
• Advanced analysis
RNA quantity and quality (RIN)
determines choice of library preparation protocols

Qubit fluorometer

Schroeder et al. The RIN: an RNA integrity number for assigning integrity values to RNA measurements.
BMC Molecular Biol 7, 3 (2006) doi:10.1186/1471-2199-7-3
Choices of mRNA enrichment methods

Quality of
initial RNA
Stranded vs. unstranded RNA-seq
Trends of read length, depth, and sample
size of RNA-seq experiments

(Van de Berge et al. Annu. Rev. Biomed. Data Sci. 2019. 2:139–73)
General biases

• Issues with using reference genome

– CNV
– Mappability
• Mapping ambiguity
– Multi-mapping reads
Outline

https://link.springer.com/protocol/10.1007%2F978-1-4939-7834-2_11
Pre-analysis

FastQC
MultiQC QoRTs R

Conesa et al. Genome Biology (2016) 17:13

Quality control of RNA-seq data
• Tools for RNA-seq data QC:
– pre-alignment QC: fastq files as input
• FASTQC and MultiQC
– Post-alignment QC: bam files as input
• QoRTs (httpToos://hartleys.github.io/QoRTs/): best
• RSeqQC (http://rseqc.sourceforge.net/)
• RNA-SeQC
(https://software.broadinstitute.org/cancer/cga/rna-seqc)
• dupRadar
(https://www.ncbi.nlm.nih.gov/pubmed/27769170):
duplication check specific to RNA-seq data
– Post-quantitation QC: expression matrix as input
• Sample distance, clustering and PCA analysis using R
(Sayols et al. BMC Bioinformatics. 2016; 17: 428.)
QoRTs analysis pipeline

Hartley and Mullikin. BMC Bioinformatics 16, 224 (2015) doi:10.1186/s12859-015-0670-5

Sequencing data quality improvement
(optional)
• Trimming low-quality bases and adapters: Trimmomatic/fastp
– Using relaxed trimming thresholds
– Improve mappability
– Reduce mapping artifacts
• Error correction: Rcorrector
– Improve mappability
– Improve de novo transcriptome assembly

For more tools, see https://en.wikipedia.org/wiki/List_of_RNA-Seq_bioinformatics_tools#Error_correction.

Core Analysis

CPM

Conesa et al. Genome Biology (2016) 17:13

Alignment and Quantification
Case I: No reference genome
No reference transcriptome
• Start with de novo transcriptome assembly

RNA-Bloom/Trinity

Conesa et al. Genome Biology (2016) 17:13

Case II: reference genome
but not well annotated
• de novo transcriptome assembly
• genome guided transcriptome assembly
• Hybrid transcriptome assembly HISAT2
Gsnap

RNA-Bloom StringTie
TransComb
Transcriptome
assembly

Bowtie
RSEM

Martin and Wang. Nature Reviews Genetics 2011(12): 671–682

Conesa et al. Genome Biology (2016) 17:13
Case III: reference genome
well enough annotated
Using Splice-aware aligner: STAR, HISAT2, Gsnap

Using ungapped aligner: Bowtie

(Van de Berge et al. Annu. Rev. Biomed. Data Sci. 2019. 2:139–73)
C. Alignment free or Pseudo-alignment against transcriptome: Salmon, Sailfish, Kallisto
EM algorithms
Sailfish: k-mer-based approaches
Kallisto: pseudo-alignment
Salmon: quasi-alignment, modeling GC-bias
(Cyverse)
Case III: reference genome
well enough annotated

HISAT2
Gsnap
Pseudo-mapping Pseudo-aligner
to transcriptome Sailfish, Salmon, Kallisto

GTF/GFF Sailfish
featureCounts Salmon

Conesa et al. Genome Biology (2016) 17:13

Expression quantification at different levels
• Gene level (most common): featureCounts (best), htseq-count (very slow)
– direct fragment overlap counting of gene features
• no principled way of handling multimapping reads (or using MMR first)
• potentially important compositional changes not reflected directly in gene level read counts (e.g., isoform
switching)
– transcript-level quantification followed by aggregation to the gene level (see below).
• Transcript level:
– Full alignment-based: 𝑀𝑖𝑥 2 (best, not free), RSEM, Cufflinks, eXpress, PennSeq, MMSeq
(https://www.ncbi.nlm.nih.gov/pubmed/28505151).
https://www.ncbi.nlm.nih.gov/pubmed/28505151
All based on EM algorithms
– Quasi-mapping: Salmon, an the like
– Advantages
• clear interpretation, since transcripts are the basic unit of gene transcription
• improved biological resolution and decoding of potentially important biological changes, such as isoform
switching
• most appropriate level to model and correct for technical biases;
• provides a proper model for handling reads that multimap
– Disadvantages
• CAUTION: Most of these approaches (lightweight and otherwise) assume that the annotation of transcripts
to be quantified is complete.
• Many more multimapping reads to handle
• necessitating the adoption of a model, which may fail to adequately capture reality
• read ambiguity translates to additional uncertainty in the estimated transcript abundances.
https://www.ncbi.nlm.nih.gov/pubmed/27257077
• Exon level: JunctionSeq (https://www.ncbi.nlm.nih.gov/pubmed/27257077) or EQP
(https://www.ncbi.nlm.nih.gov/pubmed/27302131)
https://www.ncbi.nlm.nih.gov/pubmed/27302131
Salmon versus HTseq counting

• https://cgatoxford.wordpress.com/2016/08/17/why-you-should-stop-using-featurecounts-htseq-or-cufflinks2-and-start-using-
kallisto-salmon-or-sailfish/
• http://crazyhottommy.blogspot.com/2016/07/comparing-salmon-kalliso-and-star-htseq.html
• https://www.biorxiv.org/content/biorxiv/early/2018/10/16/444620.full.pdf

• Soneson et al. F1000Research 2016, 4:1521 Last updated: 12 NOV 2019,

https://doi.org/10.12688/f1000research.7563.2
Recommendations on quantitation

• Gene abundance estimates are more accurate than transcript

abundance estimates
• DTE is more powerful and easier to interpret on gene level
than for individual transcripts
• Incorporating transcript-level estimates leads to more
accurate DGE results

https://f1000research.com/articles/4-1521/v2
Preprocessing of quantitation data

The number of reads aligned to a given gene reflects the sequencing depth
and that gene’s share of the population of mRNA molecules.
Preprocessing: Filtering
• Determining Intra- and Intergroup Sample Variability and Outliers
• Filtering out noise by removing extremely lowly expressed genes:
rowSum(cpm ≥1) ≥ min(𝑛𝑖 ). Filtering based on expression level, but not
variation across samples: Absolute count or CPM.

log2(count +1)
Preprocessing: Normalization
• Normalization methods:
– Case 1: Assumption that most genes are not DE and balanced expression is valid
• TMM (Trimmed weighted mean)
• DESeq (median of log expression ratio)
– Case 2: RNA composition dramatically different between conditions
• Smooth quantile normalization ( R package: YARN): Tissue-aware normalization
(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5862355/,
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1847-
x)
• Spike-in based normalization (use only if no other choice)

https://www.ncbi.nlm.nih.gov/pmc/articl https://www.frontiersin.org/articles/10.3389
es/PMC6171491/ /fbioe.2019.00358/full
Normalization by controls: Methods:
Assumption: RUV-seq
• Existence of controls (HK genes, spike-ins)
• Controls behave like non-control genes Smooth quantile
(e) normalization
Global shift
Normalization by controls

Normalization by distribution/testing
Assumptions:
• Technical effects are
the same for DE and non-DE genes
• Balanced expression: roughly
symmetric differential expression
across conditions

Normalization by lib. Size • By distribution: UQ, Median, Deseq. TMM, CuffDiff2,

• TC, FPKM/RPKM, TPM MRN
Assumption: same total expression • By testing: PoissonSeq, DEGES
Motivation: Normalized counts reflect
Motivation: Non-DE genes should have, on average, the
the proportion of total mRNA/cell
taken up by each gene.
same normalized counts across conditions. The same
normalization factor for the non-DE genes can be
applied to normalize all genes.
DESeq and TMM normalization methods are
the best for Case I

Dillies et al. Briefs in Bioinformatics, 2012 (14)6: 671-683

Read count normalization within samples

● DO NOT use RPKM (Reads Per Kilobase Million) or FPKM

(Fragments Per Kilobase Million) to express normalized counts in
ChIP-seq (or RNA-seq) ANY MORE!!. (Wagner et al. Theory
Biosci. (2012) 131: 281. https://doi.org/10.1007/s12064-012-
0162-3; )

● CPM (Counts Per Million) and TPM (Transcripts Per MIllion) is

the less biased way of normalizing read counts.

● Watch the video for explanation: https://www.rna-

seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/
Outline

Li and Li. Quantitative Biology 2018, 6(3): 195–209. https://doi.org/10.1007/s40484-018-0144-7

Sample level analysis
• Data transformation
– regularized log (rlog, DESeq2 package)
– variance stabilizing transformation(VST, DESeq2 package)
– logCPM (limma package)
• Transcriptome similarity
– CAUTION: Correlation alone is not an effective measure
(https://genomebiology.biomedcentral.com/articles/10.1186/s
13059-016-0940-1)
– Transcriptome overlap measure (TROM): robust (Li et al. Stat
Biosci. 2017 Jun; 9(1): 105–136.)
– Hierarchical clustering based on sample Euclidian distances
(DESeq2)
– Dimension reduction:
• PCA, MDS, tSNE, UMAP
Principal component analysis (PCA)

(Koch et al. Am J Respir Cell Mol Biol 2018(59)2:145–157)

Sample distance plot

Very good Abnormal (batch effect)

http://pedagogix-tagc.univ-mrs.fr/courses/ASG1/practicals/rnaseq_diff_Snf2/rnaseq_diff_Snf2.html

(Koch et al. Am J Respir Cell Mol Biol 2018(59)2:145–157)

Hierarchical clustering and k-means
clustering

(Koch et al. Am J Respir Cell Mol Biol 2018(59)2:145–157)

Outline

Soneson et al. F1000Research 2016, 4:1521 Last updated: 12 NOV 2019

Overview of RNA-seq DE analysis

DESeq2
edgeR limma-voom

(Van de Berge et al. Annu. Rev. Biomed. Data Sci. 2019. 2:139–73)
Statistical modeling and estimation

• Using negative binomial (NB)

distribution to model count data
from RNA-seq

• NB generalized linear model for

DEG analysis

(Van de Berge et al. Annu. Rev. Biomed. Data Sci. 2019. 2:139–73)
Statistical inference

• Test hypotheses:
– 𝐻0 : there is no DE between conditions, i.e., that the LFC is zero.
– 𝐻1 :the LFC differs from zero.
• Test methods:
– Likelihood ratio tests (LRTs): DESeq2 and edgeR. LRTs compare
the likelihood of a full model, upon estimating all parameters
without constraints, with the likelihood of a reduced model,
where one or some of the parameters are constrained according
to 𝐻0 . LRT statistics are asymptotically χ2-distributed under 𝐻0 .
– Wald test: DESeq2 only. W = LFC /se(LFC), asymptotically
follows a standard normal distribution under H0

(Van de Berge et al. Annu. Rev. Biomed. Data Sci. 2019. 2:139–73)
Adjusting p-values for multiple testing

• False discovery rate (FDR)

– Benjamini–Hochberg Correction (BH)
Variations to the General DEG Workflow
• Alternative models (inference frameworks)
– QL method (QuasiSeq, edgeR): F-statistics
– limma-voom models: empirical Bayes, moderated t- and F-statistics (Law et al.
Genome Biology. 2014, 15:R29)
– Nonparametric methods: NOISeq, SAMseq
• Robust log-fold change estimation: shrinkage of estimators LFC
– Ratios of smaller counts result in more variable LFCs
– the estimation of LFC can be sensitive to outliers.
• Accounting for unobserved effects
– SVAseq and RUVseq
• Statistical inference by testing against a threshold
– H0: |LFC| > a (implemented differently in DESeq2 and edgeR)
• Small-sample inference: too small number of replicates (Di et al. 2013)
• Multiple testing:
– Local FDR, Storey’s q-value

(Van de Berge et al. Annu. Rev. Biomed. Data Sci. 2019. 2:139–73)
Variants of differential expression analysis
• Differential gene expression (DGE)
• Differential transcript expression (DTE)
• Differential transcript usage (DTU)/differential splicing (DS)
• Differential exon usage (Van de Berge et al. Annu. Rev. Biomed. Data Sci. 2019. 2:139–73)

For details of statistical models, see Li and Li. Quantitative Biology 2018,
6(3): 195–209. https://doi.org/10.1007/s40484-018-0144-7

(Collado-Torres et al. 2017.

https://f1000research.com/articles/6
-1558)
Visualization of DEGs
P-value histogram

How to interpret a p-value histogram?

http://varianceexplained.org/statistics/interpreting-pvalue-histogram/

Breheny et al. High-Throughput 2018, 7, 23; doi:10.3390/ht7030023

Volcano plot
R package: EnhancedVolcano
MA plot
M (log ratio) and A (mean average) scales
Venn diagram and upset plot

http://genomespot.blogspot.com/2017/09/up
set-plots-as-replacement-to-venn.html
Heatmap

(Koch et al. Am J Respir Cell Mol Biol 2018(59)2:145–157)

Sankey diagram
Functional annotation of DE genes

• Gene set analysis

– Overrepresentation analysis (ORA)
– Gene set enrichment analysis (GSEA)
• Pathway analysis
– Overrepresentation analysis
– Gene set enrichment analysis
– Pathway topology analysis
Differences in Pathways and Gene Set
• A biological pathway is “a series of interactions among molecules in a
cell that leads to a certain product or a change in a cell”
(https://en.wikipedia.org/wiki/Biological_pathway).
– Types:
• Signaling pathway
• metabolic pathway
• genetic pathway (gene regulatory network)
– Common pathway databases: (for details, see García-Campos et al. Frontiers
in Physiology, 2015(6):383 )
• KEGG (https://www.genome.jp/kegg/pathway.html)
• Reactome (https://reactome.org/)
• WikiPathways (https://www.wikipathways.org/index.php/WikiPathways)
• BioCyc (https://biocyc.org/)
• Gene set: a set of genes, ie. an unordered and unstructured
collection of genes, with some defined relationship.
– GO terms-associated genes
– MSigDB
– Genes on chromosome 1
– …
Over-representation analysis (ORA)
• ORA is a widely used approach to determine whether known
biological functions or processes are over-represented in an
experimentally-derived gene list. Procedures are as follows:
1. First, an arbitrary cutoff is set to define a list of DEGs: p-values, fold changes.
2. Then, a background list of genes (universal) is define. Usually, the background
is a genes expressed under the conditions of interest.
3. Then, the DEG list is compared to a gene set to find the number of common
genes, given the background list.
4. The p-value can be calculated by Fisher’s exact test, given the background.
(See next slide for statistical explanation).
5. The p-values are adjusted for multiple testing.

• CAUTION: This approach will work when the biological

difference is large, but it will not work when the difference is
small, but evidenced in coordinated way in a set of related
genes. Gene Set Enrichment Analysis (GSEA) (Subramanian et
al. 2005) directly addresses this limitation.
Overrepresntation analysis (ORA)
A B Background Gene set DEGs

Hypergeometric distribution Hoxa5

Hoxa11
Edn2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
98
𝑝 = 1 − 𝑃(𝑘|𝑛, 𝐾, 𝑁) . 𝐸𝑛𝑟𝑖𝑐ℎ𝑚𝑒𝑛𝑡 𝑓𝑜𝑙𝑑 = = 1.5
. 65
.
.
.
Enpp2
Gene Set enrichment analysis (GSEA)
• Given a priori defined set of gene S and a ranked gene list L,
𝐻0 : the members of S are randomly distributed throughout L;
𝐻1 : the members of S are not randomly distributed throughout
L. Procedures are as follows:
1. All expressed genes are ranked based on log2 (fold change) from high to
low.
2. Calculation of an Enrichment Score, the degree to which a set S is over-
represented at the top or bottom of the ranked list L
3. Estimation of Significance Level of ES by using permutation test
4. Adjusting p-values for Multiple Testing
Gene Set enrichment analysis (GSEA)

• GSEA: subramanian et al. 2005(102) 43:15545–15550.

http://software.broadinstitute.org/gsea/index.jsp
• GAGE: Luo et al. BMC Bioinformatics 2009, 10:161
• GAGE + Pathview Tutorials:
https://figshare.com/articles/Tutorial_RNA_seq_differential_expression_amp_pathway_analysis_with_
Sailfish_DESeq2_GAGE_and_Pathview/1619655
• Comparison of different methods: Mathur et al. BioData Mining (2018) 11:8; Hung et al. Brief
Bioinform. 2012 May; 13(3): 281–291; Maciejewski. BRIEFINGS IN BIOINFORMATICS. VOL 15. NO 4. 504
-518
Functional annotation of DE genes
• Gene set analysis
– Overrepresentation analysis
– Gene set enrichment analysis
• Pathway analysis
– Overrepresentation analysis
– Gene set enrichment analysis
– Pathway topology analysis
Pathway analysis

Further reading:
• Khatri et al. PLoS Computational Biology, 2012(8 )2:e1002375
• García-Campos et al. Frontiers in Physiology, 2015(6):383
• Reimand et al. Nat Protoc. 2019 Feb; 14(2): 482–517.
• minepath.org: Koumakis et al. Nucleic Acids Res. 2017 Jul 3; 45(Web Server issue): W116–W121.
Different representations of pathways
A

(García-Campos et al. Frontiers in Physiology, 2015(6):383)

Different pathway analyses

García-Campos et al. Frontiers in Physiology, 2015(6):383

Pathway Analysis vs Gene Set Analysis:
When Should I Use Each?
• Pathway analysis:
– when you care about how genes are known to interact
– when you want to take full advantage of the sizes and directions of
measured expression changes
– when you want to account for the type and direction of interactions
on a pathway
– when you want to predict or explain downstream or pathway-level
effects
– when you want your results to be based on the most recent
knowledge
• Gene set analysis
– when you are looking for “quick and dirty” answers
– when you have arbitrarily defined gene sets
(https://advaitabio.com/ipathwayguide/pathway-analysis-vs-gene-set-analysis/)
Outline

IGV
UCSC genome Browser

Conesa et al. Genome Biology (2016) 17:13

Co-expression network analysis
A B

Factors contributing to observed co-expr.

Useful tools: Further reading:

• van Dam et al. Briefings in Bioinformatics, 2018 (19)4: 575–592
• WGCNA • Gaiteri et al. Genes, Brain and Behavior. 2014. 13: 13–24
• coseq • Usadel et al. Plant, Cell and Environment(2009)32, 1633–1651
Beyond DEG and co-expression analyses:
Types of gene expression patterns changes between
conditions

Gaiteri et al. Genes, Brain and Behavior. 2014. 13: 13–24

Future direction of RNA-seq: long read
sequencing technologies

• PacBio Iso-seq
• Oxford NanoPore technologies:
direct RNA sequencing

Zhao et al. Front. Genet., 21 March 2019

Rnaseq by Example
No ratings yet
Rnaseq by Example
163 pages
Hotosynthesis & Cellular Respiration: It Shows The Chemical Change
0% (5)
Hotosynthesis & Cellular Respiration: It Shows The Chemical Change
3 pages
Sri Lankan Biology Olympiad 2012
No ratings yet
Sri Lankan Biology Olympiad 2012
16 pages
RNA Seq - Applications and Best Practices
No ratings yet
RNA Seq - Applications and Best Practices
34 pages
Bianca Castiglioni
No ratings yet
Bianca Castiglioni
96 pages
RNA Sequencing: An Introduction To Efficient Planning and Execution of RNA Sequencing (RNA-Seq) Experiments
No ratings yet
RNA Sequencing: An Introduction To Efficient Planning and Execution of RNA Sequencing (RNA-Seq) Experiments
6 pages
Bioinformatics Experimental Design
No ratings yet
Bioinformatics Experimental Design
6 pages
nihms-977214
No ratings yet
nihms-977214
21 pages
Kratz et al. 2014. The devil in details RNAseq - copia
No ratings yet
Kratz et al. 2014. The devil in details RNAseq - copia
3 pages
Gene Expression RNA Sequence
No ratings yet
Gene Expression RNA Sequence
120 pages
RNA Seq R - Final Decode
No ratings yet
RNA Seq R - Final Decode
76 pages
1.RNA Seq Part1 WorkingToTheGoal
No ratings yet
1.RNA Seq Part1 WorkingToTheGoal
75 pages
Module 7 8 Lecture Slides
No ratings yet
Module 7 8 Lecture Slides
59 pages
Survey RNA-Seq data analysis (2016)
No ratings yet
Survey RNA-Seq data analysis (2016)
19 pages
3_RNAseq_background
No ratings yet
3_RNAseq_background
42 pages
RNA-seq With NOISeq R-Bioc Package
No ratings yet
RNA-seq With NOISeq R-Bioc Package
15 pages
RNA-Seq Module 1
No ratings yet
RNA-Seq Module 1
54 pages
Complete_Bulk_RNA_Sequencing_Presentation
No ratings yet
Complete_Bulk_RNA_Sequencing_Presentation
10 pages
RNA-Seq and Transcriptome Analysis: Jessica Holmes
No ratings yet
RNA-Seq and Transcriptome Analysis: Jessica Holmes
98 pages
Assays For Mutation Rate
No ratings yet
Assays For Mutation Rate
8 pages
RNA-seq
No ratings yet
RNA-seq
3 pages
Transcriptome Software Paper
No ratings yet
Transcriptome Software Paper
7 pages
Bacher 2016
No ratings yet
Bacher 2016
14 pages
2023-GenomicaFuncional y Biocomputacion-Day1
No ratings yet
2023-GenomicaFuncional y Biocomputacion-Day1
92 pages
BN335 L6 Transcriptomics JH
No ratings yet
BN335 L6 Transcriptomics JH
9 pages
Day1 Laros RNASeq Galaxy 2012
No ratings yet
Day1 Laros RNASeq Galaxy 2012
40 pages
Module8 RNASeq Pathogen Practical Manual
No ratings yet
Module8 RNASeq Pathogen Practical Manual
23 pages
Brown Goecks 2015 Sample NextGenDNASequencingInformatics2ed
No ratings yet
Brown Goecks 2015 Sample NextGenDNASequencingInformatics2ed
8 pages
The RNA World 11th Lect High-throughput Methods GH AY16 2017
No ratings yet
The RNA World 11th Lect High-throughput Methods GH AY16 2017
59 pages
Systematic Comparison and Assessment of RNA Seq Procedures For Gene Expression Quantitative Analysis
No ratings yet
Systematic Comparison and Assessment of RNA Seq Procedures For Gene Expression Quantitative Analysis
15 pages
BGi RNA-Seq Analysis
No ratings yet
BGi RNA-Seq Analysis
19 pages
05 Lecture Bulk RNA-seq Array
No ratings yet
05 Lecture Bulk RNA-seq Array
40 pages
Measuring Transcriptomes With RNA-Seq
No ratings yet
Measuring Transcriptomes With RNA-Seq
48 pages
3 Rna-Seq
No ratings yet
3 Rna-Seq
59 pages
Rna Seq Dissertation
100% (1)
Rna Seq Dissertation
6 pages
Analysis of RNA-Seq Data
No ratings yet
Analysis of RNA-Seq Data
71 pages
HISAT, StringTie and Ballgown
No ratings yet
HISAT, StringTie and Ballgown
18 pages
Count-Based Differential Expression Analysis of RNA Sequencing Data Using R and Bioconductor
No ratings yet
Count-Based Differential Expression Analysis of RNA Sequencing Data Using R and Bioconductor
22 pages
TMM - A scaling normalization method for differential expression analysis of RNA-seq data-Robinson-GenomeBiology-2010
No ratings yet
TMM - A scaling normalization method for differential expression analysis of RNA-seq data-Robinson-GenomeBiology-2010
9 pages
Chapter On Transcriptomics
No ratings yet
Chapter On Transcriptomics
13 pages
The Bench Scientist's Guide To Statistical Analysis of RNA-Seq Data
No ratings yet
The Bench Scientist's Guide To Statistical Analysis of RNA-Seq Data
10 pages
RNA Sequencing Process and Applications-F19960606001
No ratings yet
RNA Sequencing Process and Applications-F19960606001
7 pages
RNA-Seq Analysis Course
No ratings yet
RNA-Seq Analysis Course
40 pages
Lecture 01 - Genome Sequencing
No ratings yet
Lecture 01 - Genome Sequencing
48 pages
Curso Rnaseq Saebb Utfpr
No ratings yet
Curso Rnaseq Saebb Utfpr
18 pages
Advantages of RNA-seq Over Microarray Technology - Cofactor Genomics
No ratings yet
Advantages of RNA-seq Over Microarray Technology - Cofactor Genomics
4 pages
ExSeq Presentation With Background
No ratings yet
ExSeq Presentation With Background
40 pages
Gene Expression Ebook M GL 00258
No ratings yet
Gene Expression Ebook M GL 00258
26 pages
RNA Seq Tutorial
0% (1)
RNA Seq Tutorial
139 pages
Introduction To Single-Cell RNA-seq
No ratings yet
Introduction To Single-Cell RNA-seq
8 pages
Quantitative Transcriptome Analysis
No ratings yet
Quantitative Transcriptome Analysis
21 pages
Margue Rat 2010
No ratings yet
Margue Rat 2010
11 pages
Statquest Gentle Introduction To Rna Seq
100% (1)
Statquest Gentle Introduction To Rna Seq
188 pages
Nazarov QC-Statistics
No ratings yet
Nazarov QC-Statistics
50 pages
Concepts of Transcriptomics - 20-8-2024
No ratings yet
Concepts of Transcriptomics - 20-8-2024
6 pages
PDxNucleus Brochure
No ratings yet
PDxNucleus Brochure
17 pages
Zhang 2019 IOP Conf. Ser. Earth Environ. Sci. 332 042003
No ratings yet
Zhang 2019 IOP Conf. Ser. Earth Environ. Sci. 332 042003
7 pages
유전공학 Week13
No ratings yet
유전공학 Week13
43 pages
Tutorial RNA-Seq Analysis Part 1
No ratings yet
Tutorial RNA-Seq Analysis Part 1
8 pages
Next Generation Sequencing
No ratings yet
Next Generation Sequencing
44 pages
Large-Scale Analysis of Gene Expression
No ratings yet
Large-Scale Analysis of Gene Expression
27 pages
Introduction to Bioinformatics, Sequence and Genome Analysis
From Everand
Introduction to Bioinformatics, Sequence and Genome Analysis
Jerry H. Swift
No ratings yet
2100 Expert - High Sensitivity DNA Assay - DE04105470 - 2019-07-11 - 10-20-58
No ratings yet
2100 Expert - High Sensitivity DNA Assay - DE04105470 - 2019-07-11 - 10-20-58
18 pages
CHAPTER 23 Fatty Acid Catabolism
No ratings yet
CHAPTER 23 Fatty Acid Catabolism
9 pages
Chem 301 Lecture Biochemistry: Week 1
No ratings yet
Chem 301 Lecture Biochemistry: Week 1
7 pages
SF - MD.0145.T.22 Ally Omary Mwambela
No ratings yet
SF - MD.0145.T.22 Ally Omary Mwambela
10 pages
HPLC Purification Kariko 2011
No ratings yet
HPLC Purification Kariko 2011
10 pages
GUJCET 2024 Question Paper Mar 31 Biology Af5e58719d80b69c9a77dad7081d8f21
No ratings yet
GUJCET 2024 Question Paper Mar 31 Biology Af5e58719d80b69c9a77dad7081d8f21
4 pages
Nucleus & Nucleolus: by Anup R. Kodape M.SC I
No ratings yet
Nucleus & Nucleolus: by Anup R. Kodape M.SC I
16 pages
Respiration: 1. Overview of Topic
No ratings yet
Respiration: 1. Overview of Topic
21 pages
Chapter 4 Biology Form 4
No ratings yet
Chapter 4 Biology Form 4
11 pages
Plasmid
No ratings yet
Plasmid
19 pages
Ivacaftor
No ratings yet
Ivacaftor
25 pages
Plasmids
No ratings yet
Plasmids
1 page
Science：Phage-triggered Reverse Transcription Assembles a Toxic Repetitive Gene From a Noncoding RNA
No ratings yet
Science：Phage-triggered Reverse Transcription Assembles a Toxic Repetitive Gene From a Noncoding RNA
17 pages
DNA Isolation
No ratings yet
DNA Isolation
39 pages
Protein Isolation From Whole Blood
No ratings yet
Protein Isolation From Whole Blood
8 pages
CV - Pankaj R
No ratings yet
CV - Pankaj R
3 pages
BioCyc Database Collection
No ratings yet
BioCyc Database Collection
3 pages
Biotechnology Books
No ratings yet
Biotechnology Books
4 pages
Lesson 2 TLE
No ratings yet
Lesson 2 TLE
2 pages
SHS-Physical Science (Biological Macromolecules) : I-Introductory Content
No ratings yet
SHS-Physical Science (Biological Macromolecules) : I-Introductory Content
13 pages
Lampiran Berita Acara Pemusnahan Reagen Dan Bahan Medis Habis Pakai Tanggal 28 Januari 2019
No ratings yet
Lampiran Berita Acara Pemusnahan Reagen Dan Bahan Medis Habis Pakai Tanggal 28 Januari 2019
2 pages
Activity Proteins (A)
No ratings yet
Activity Proteins (A)
8 pages
Microbe Mission Notes
No ratings yet
Microbe Mission Notes
13 pages
Biology For The IB Diploma - Answers: A2.2 Cell Structure
No ratings yet
Biology For The IB Diploma - Answers: A2.2 Cell Structure
6 pages
Final Exame of Physical Therapy
No ratings yet
Final Exame of Physical Therapy
6 pages
Biological Molecules Worksheet Bozeman
No ratings yet
Biological Molecules Worksheet Bozeman
2 pages
Biology MSC
No ratings yet
Biology MSC
228 pages
MSC Biotechnology Assignment Questionpapers
No ratings yet
MSC Biotechnology Assignment Questionpapers
1 page