0% found this document useful (0 votes)
54 views27 pages

From Microarray To RNA-Seq: A Review of Transcriptome Analysis With Next-Generation Sequencing Data

This document discusses various aspects of analyzing transcriptome data from RNA sequencing (RNA-Seq) experiments, including data processing workflows, normalization methods, and statistical modeling approaches. It covers the history of microarray and next-generation sequencing technologies, workflows for mapping and counting reads from RNA-Seq experiments, issues around multireads, reproducibility, and statistical models such as Poisson and negative binomial distributions for read counts. It also discusses background correction methods that subtract non-coding reads from total reads.

Uploaded by

Mauro Pala
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views27 pages

From Microarray To RNA-Seq: A Review of Transcriptome Analysis With Next-Generation Sequencing Data

This document discusses various aspects of analyzing transcriptome data from RNA sequencing (RNA-Seq) experiments, including data processing workflows, normalization methods, and statistical modeling approaches. It covers the history of microarray and next-generation sequencing technologies, workflows for mapping and counting reads from RNA-Seq experiments, issues around multireads, reproducibility, and statistical models such as Poisson and negative binomial distributions for read counts. It also discusses background correction methods that subtract non-coding reads from total reads.

Uploaded by

Mauro Pala
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Data Model Normalization Two Sample Test More Question

From Microarray to RNA-Seq


A Review of Transcriptome Analysis with Next-Generation
Sequencing Data

Review Author: Yaqing Si

Department of Statistics
Iowa State University

October 30, 2009

1/27
Data Model Normalization Two Sample Test More Question

Outline

1 Data

2 Model

3 Normalization

4 Two Sample Test

5 More Question

2/27
Data Model Normalization Two Sample Test More Question

History

Microarray [Sultan et al.(2008)]


1987: arrayed DNAs for expression profiling and identify genes
was described;
1995: the first miniaturized microarrays for gene expression
profiling was used;
1997: this first complete genome on a microarray was published
Next-Generation Sequencing
1977: described by Sanger et al (Nobel Prize in Chemistry, 1980)
2004: Roche (454) GS FLX sequencer;
2006: Illumina Genome Analyzer;
2007: applied Biosystems SOLiD sequencer

3/27
Data Model Normalization Two Sample Test More Question

Work Flow of Sequencing Data (I)


Prepare RNA samples
Ligate fragments to an array
Sequence RNA segments (reads)
฀฀฀฀

RNA/DNA Fragments
Polony Array

Cyclic array sequencing


(>106 reads/array)
Cycle 1 Cycle 2 Cycle 3
A
A GG A G
G AA T C
A C
C G
G G C
C G T
C T
T G
A
T
T A G
T
T A C A
A G C A C
A C G C T

What is base 1? What is base 2? What is base 3?

4/27
Data Model Normalization Two Sample Test More Question

Work Flow of Sequencing Data (II)


Map reads to the genome
Count the number of reads
mRNA Segments Genome
(reads)
CUCUCA

UU A AGU
Gene 3
CGC A C A
N=4
A AGU A C

GU A A AG

UUUGA C
Gene 2
GA CGCG N=3

A GGC CG

GUGUGG
Gene 1
A GGGGU
N=4

5/27
Data Model Normalization Two Sample Test More Question

Costs and Outputs

Lengthes of reads differ from platform to platform


Recently (April 2009), reads can be as long as 1000 bp
Billions of reads per run
One run costs several days
Improvement is ongoing: higher throughput at lower costs
Moore’s Law for sequencing technology: about 10-fold per year
Recent goal: $1000 to sequence a human genome (in 1990’s,
Human Genome Project cost over $3 billions).

6/27
Data Model Normalization Two Sample Test More Question

Microarray: Analog Signals


Analog signals
Easy to convey the signal’s information
Continuous strength
Signal loss and distortion
Example: microarray data

7/27
Data Model Normalization Two Sample Test More Question

RNA-Seq: Digital Signals

RNA-Seq data are digital signals


Harder to achieve and interpret
Reads counts: discrete values
Weak background or no noise

A dot means a read mapped to the region beginning at the base

8/27
Data Model Normalization Two Sample Test More Question

Multireads
Multiread: a read aligned to more than one location
Longer reads render fewer multireads
Most multireads are mapped to repeating regions [Ji et al.(2008)]
Example: 25bp multireads and their allocation
[Mortazavi et al.(2008)].

1 Site
( 76 %)

11+ Sidtes
( 13 %)

2−10 Sites
( 11 %)

9/27
Data Model Normalization Two Sample Test More Question

Reproducibility

Technical replicate determinations of transcript abundance are


reproducible [Mortazavi et al.(2008)]
a
104 Technical replicates
Brain technical 2 (RPKM) R 2 = 0.96
103

100

10

0.0

0.1 1 10 100 1,000 10,000


Brain technical 1 (RPKM)
c d
For microarray, the correlation between two replicates ranges
between 0.5 and 0.95 [Draghici et al.(2006)]

10/27
Data Model Normalization Two Sample Test More Question

Poisson Model

Let Ng be the count of reads mapped to gene g .


Assume that the reads are sampled independently and with
replacement [Sultan et al.(2008)]
Then Ng follows a Poisson distribution

Ng ∼ Poisson(λg ).

λg is the mapping rate


Large λg means high expression

11/27
Data Model Normalization Two Sample Test More Question

Multinomial Model

Assume that each read is mapped to gene g with probability pg


P
Only unique reads are considered, pg = 1.
Then let N = (N1 , Ng , · · · ) and p = (p1 , p2 , · · · )

N ∼ Multinomial(n, p),
P
where n = g Ng .
For each gene,
Ng ∼ Binomial(n, pg ).
Since n is large and pg is small, approximately

Ng ∼ Poisson(λg ),

with λg = npg

12/27
Data Model Normalization Two Sample Test More Question

Is Poisson Model Enough?


More than 95% counts can be modeled well by Poisson model
[Marioni et al.(2008)].
100%

99.5%

95%

50%

Simple Poisson model cannot explain all the variance.


Solution: assign a Gamma prior to λ
Negative binomial model
Hard to interpret two parameters
13/27
Data Model Normalization Two Sample Test More Question

Background Correction
Genome = Coding Region + Non-Coding Region
Reads mapped to coding region: signal + background
Reads mapped to non-coding region: pure background
Dense of mapped reads in the non-coding region
Reads # 631271
= = 0.00063.
Length 1002002158
Dense of mapped reads in the coding region
Reads # 12739385
= = 0.2319758.
Length 54916879
Signal-to-Noise Ratio
0.2319
SNR = = 368
0.00063
Background correction is not critical
14/27
Data Model Normalization Two Sample Test More Question

Normalization: RPKM

RPKM: reads per kilobase per million mapped reads, introduced by


[Mortazavi et al.(2008)]
Two influential factors behind gene expressions
Total number of mappable reads in the sample
Gene length Lg

N
RPKMg = Pg × 109
Lg g Ng

Analyze RPKM with microarray methods

15/27
Data Model Normalization Two Sample Test More Question

Normalization: A Multiplier for the Poisson Rate

Combine normalization and analysis together


Introduce a normalizing factor Cg in the Poisson model

Ng ∼ Poisson(Cg λ∗g )

λ∗g : normalized expression level


Cg : similar to the RPKM idea
X
Cg = Lg Ng /109
g

One can define Cg in other way according the question.


Treat Cg as known, and interpret λ∗g

16/27
Data Model Normalization Two Sample Test More Question

Two Sample Analysis

Identify differentially expressed genes in two treatments


In microarray, samples from two treatments are often
co-hybridized
RNA-Seq: sequence each sample separately

# Mapped Reads
Gene ID Sample 1 Sample 2
AC147602 81 95 163 140
AC147603 172 190 59 38
AC147604 12 16 0 0
AC147605 3 0 102 92
··· ··· ··· ··· ···

17/27
Data Model Normalization Two Sample Test More Question

Tests from Microarray

Normalize the counts into RPKM


Transform RPKM , e.g., Log, Arc-sine. [Mortazavi et al.(2008)]
Applied microarray methods, e.g., t test, moderated-t test.
Disadvantage: lack the elegance of working directly with the
counts data

18/27
Data Model Normalization Two Sample Test More Question

Tests Based on Discrete Model (I)

Observations from one sample are often merged,


# Mapped Reads
Gene ID Sample 1 Sample 2
AC147602 81 + 95 = 176 163 + 140=303
AC147603 172 + 190 = 362 59 + 38 = 97
AC147604 12 + 16 = 28 0+0=0
AC147605 3+0=3 102 + 92 = 194
··· ··· ···
Poisson model is additive

Ngi ∼ Poisson(Cgi λgi ),

for i = 1, 2 and g = 1, 2, · · · , G , independently.

19/27
Data Model Normalization Two Sample Test More Question

Tests Based on Discrete Model (II)


Test H0 : λg 1 = λg 2 :
Binomial Test [Ji et al.(2008)]
 
Cg 1
P(Ng 1 |Ng 1 + Ng 2 ) ∼ Binomial Ng 1 + Ng 2 ,
Cg 1 + Cg 2
Negative Binomial Test [Audic et al.(1997)]
 
Cg 1
P(Ng 2 |Ng 1 ) ∼ Negative Binomial Ng 1 + 1,
Cg 1 + Cg 2
χ2 goodness-of-fit test for Poisson counts[Mortazavi et al.(2008)]
X (Ngi − mgi )2
∼ χ21 , asymptotically
mgi
i =1,2

for mgi = Cgi (Ng 1 + Ng 2 )/(Cg 1 + Cg 2 ) be the expected


observation under null
20/27
Data Model Normalization Two Sample Test More Question

P-Values

For RNA-Seq data, the density of p-values drops steeply near 0.

P−values of RNA−Seq Data P−values of Microarray Data

2.5
10

2.0
8

1.5
6
Density

Density

1.0
4

0.5
2

0.0
0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

P P

21/27
Data Model Normalization Two Sample Test More Question

FDR Control
FDR can be controlled by adjusting p-values through
[Benjamini & Hochberg (1995)] or
[Storey and Tibshirani (2003)]’s procedures
FDR is often underestimated. [Ji et al.(2008)]
Usually, RNA-Seq can detect more genes than microarray can
RNA-Seq Microarray

2918 9238 110

Data from [Sultan et al.(2008)]


22/27
Data Model Normalization Two Sample Test More Question

Expression Bias for DE genes


Strongly expressed genes are more likely to be detected
[Oshlack et al.(2009)]
Example: 500/10 > 50/1
Log( Fold Change of Expressions)

Log (Average Expression Level)

One Explanation: increase the expressions λ1 , λ2 by L folds


λ1 L − λ2 L λ1 − λ2 √
T =√ =√ L
λ1 L + λ2 L λ1 + λ2
23/27
Data Model Normalization Two Sample Test More Question

More Questions

In microarray, we deal with:


Interaction among treatments
Contrast of treatments
Experimental design
Clustering analysis
Gene set testing
...
In RNA-Seq, solutions are temporarily borrowed from Microarray

24/27
Data Model Normalization Two Sample Test More Question

More Data from Next-Generation Sequencing Technologies


Chromatin Immunoprecipitation Sequencing (ChIP-Seq) data:
Besides read counts, the locations where the reads are mapped
are also known
0
4
2
0
10
0

Questions:
Peak detection
Two sample comparisons
Answers:
Borrow ideas from ChIP-chip analysis
Develop new methods based on discrete models [Ji et al.(2008)]
25/27
Data Model Normalization Two Sample Test More Question

References I

Audic, S., and Claverie, J., (1997). The Significance of Digital Gene Expression
Profiles. Genome Research. 7, 986-995.
Benjamini, Y. and Hochberg, Y., (1995). Controlling the False Discovery Rate: a
Practical and Powerful Approach to Multiple Testing. J. Roy. Statist. Soc., Ser. B
57, 289-300.
Chen, L., Hung, H., and Chen, C., (2007). Maximum Average-Power (MAP) Tests.
Communications in Statistics 36, 2237-2249.
Sorin Draghici, Purvesh Khatri, Aron C. Eklund and Zoltan Szallasi, (2005).
Reliability and reproducibility issues in DNA microarray measurements Trends in
Genetics Nov, 1-9.
Efron, B., Tibshirani, R., Storey, J. D., and Tusher, V. (2001). Empirical Bayes
Analysis of a Microarray Experiment. Journal of the American Statistical Association
96, 1151-1160.
Ji, H., Jiang, H., Ma, W., Johnson, D. S., Myers, R. M., and Wong, W. H., (2008).
Statistical Inferences for Isoform Expression in RNA-Seq. Bioinformatics 25,
1026-1032.
Jiang, H. and Wong, W. H., (2008). Statistical Inferences for Isoform Expression in
RNA-Seq. Bioinformatics 25, 1026-1032.

26/27
Data Model Normalization Two Sample Test More Question

References II

Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M., and Gilad, Y., (2008).
RNA-seq: An assessment of technical reproducibility and comparison with gene
expression arrays. Genome Research 18, 1509-1517.
Mortazavi, A., Williams, B. A., McCue, K, Schaeffer, L., and Wold, B. (2008).
Mapping and Quantifying Mammalian Transcriptomes by RNA-Seq. Nature Methods
5, 621-628.
Oshlack, A. and Wakefield, M. J., (2009). Transcript length bias in RNA-seq data
confounds systems biology. Biology Direct. 4, :14.
Smyth , G. K, (2004). Linear Models and Empirical Bayes Methods for Assessing
Differential Expression in Microarray. Statistical Applications in Genetics and
Molecular Biology 3, Article 3
Storey, J.D. and Tibshirani, R., (2003). Statistical Significance for Genome-Wide
Studies. Proc. Natl. Acad. Sci. 100, 944-9445
Sultan, M, Schulz, M. H., and Richard, H., (2008). A Global View of Gene Activity
and Alternative Splicing by Deep Sequencing of the Human Transcriptome. Science
321, 956-960.

27/27

You might also like