From Microarray To RNA-Seq: A Review of Transcriptome Analysis With Next-Generation Sequencing Data
From Microarray To RNA-Seq: A Review of Transcriptome Analysis With Next-Generation Sequencing Data
Department of Statistics
Iowa State University
1/27
Data Model Normalization Two Sample Test More Question
Outline
1 Data
2 Model
3 Normalization
5 More Question
2/27
Data Model Normalization Two Sample Test More Question
History
3/27
Data Model Normalization Two Sample Test More Question
RNA/DNA Fragments
Polony Array
4/27
Data Model Normalization Two Sample Test More Question
UU A AGU
Gene 3
CGC A C A
N=4
A AGU A C
GU A A AG
UUUGA C
Gene 2
GA CGCG N=3
A GGC CG
GUGUGG
Gene 1
A GGGGU
N=4
5/27
Data Model Normalization Two Sample Test More Question
6/27
Data Model Normalization Two Sample Test More Question
7/27
Data Model Normalization Two Sample Test More Question
8/27
Data Model Normalization Two Sample Test More Question
Multireads
Multiread: a read aligned to more than one location
Longer reads render fewer multireads
Most multireads are mapped to repeating regions [Ji et al.(2008)]
Example: 25bp multireads and their allocation
[Mortazavi et al.(2008)].
1 Site
( 76 %)
11+ Sidtes
( 13 %)
2−10 Sites
( 11 %)
9/27
Data Model Normalization Two Sample Test More Question
Reproducibility
100
10
0.0
10/27
Data Model Normalization Two Sample Test More Question
Poisson Model
Ng ∼ Poisson(λg ).
11/27
Data Model Normalization Two Sample Test More Question
Multinomial Model
N ∼ Multinomial(n, p),
P
where n = g Ng .
For each gene,
Ng ∼ Binomial(n, pg ).
Since n is large and pg is small, approximately
Ng ∼ Poisson(λg ),
with λg = npg
12/27
Data Model Normalization Two Sample Test More Question
99.5%
95%
50%
Background Correction
Genome = Coding Region + Non-Coding Region
Reads mapped to coding region: signal + background
Reads mapped to non-coding region: pure background
Dense of mapped reads in the non-coding region
Reads # 631271
= = 0.00063.
Length 1002002158
Dense of mapped reads in the coding region
Reads # 12739385
= = 0.2319758.
Length 54916879
Signal-to-Noise Ratio
0.2319
SNR = = 368
0.00063
Background correction is not critical
14/27
Data Model Normalization Two Sample Test More Question
Normalization: RPKM
N
RPKMg = Pg × 109
Lg g Ng
15/27
Data Model Normalization Two Sample Test More Question
Ng ∼ Poisson(Cg λ∗g )
16/27
Data Model Normalization Two Sample Test More Question
# Mapped Reads
Gene ID Sample 1 Sample 2
AC147602 81 95 163 140
AC147603 172 190 59 38
AC147604 12 16 0 0
AC147605 3 0 102 92
··· ··· ··· ··· ···
17/27
Data Model Normalization Two Sample Test More Question
18/27
Data Model Normalization Two Sample Test More Question
19/27
Data Model Normalization Two Sample Test More Question
P-Values
2.5
10
2.0
8
1.5
6
Density
Density
1.0
4
0.5
2
0.0
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
P P
21/27
Data Model Normalization Two Sample Test More Question
FDR Control
FDR can be controlled by adjusting p-values through
[Benjamini & Hochberg (1995)] or
[Storey and Tibshirani (2003)]’s procedures
FDR is often underestimated. [Ji et al.(2008)]
Usually, RNA-Seq can detect more genes than microarray can
RNA-Seq Microarray
More Questions
24/27
Data Model Normalization Two Sample Test More Question
Questions:
Peak detection
Two sample comparisons
Answers:
Borrow ideas from ChIP-chip analysis
Develop new methods based on discrete models [Ji et al.(2008)]
25/27
Data Model Normalization Two Sample Test More Question
References I
Audic, S., and Claverie, J., (1997). The Significance of Digital Gene Expression
Profiles. Genome Research. 7, 986-995.
Benjamini, Y. and Hochberg, Y., (1995). Controlling the False Discovery Rate: a
Practical and Powerful Approach to Multiple Testing. J. Roy. Statist. Soc., Ser. B
57, 289-300.
Chen, L., Hung, H., and Chen, C., (2007). Maximum Average-Power (MAP) Tests.
Communications in Statistics 36, 2237-2249.
Sorin Draghici, Purvesh Khatri, Aron C. Eklund and Zoltan Szallasi, (2005).
Reliability and reproducibility issues in DNA microarray measurements Trends in
Genetics Nov, 1-9.
Efron, B., Tibshirani, R., Storey, J. D., and Tusher, V. (2001). Empirical Bayes
Analysis of a Microarray Experiment. Journal of the American Statistical Association
96, 1151-1160.
Ji, H., Jiang, H., Ma, W., Johnson, D. S., Myers, R. M., and Wong, W. H., (2008).
Statistical Inferences for Isoform Expression in RNA-Seq. Bioinformatics 25,
1026-1032.
Jiang, H. and Wong, W. H., (2008). Statistical Inferences for Isoform Expression in
RNA-Seq. Bioinformatics 25, 1026-1032.
26/27
Data Model Normalization Two Sample Test More Question
References II
Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M., and Gilad, Y., (2008).
RNA-seq: An assessment of technical reproducibility and comparison with gene
expression arrays. Genome Research 18, 1509-1517.
Mortazavi, A., Williams, B. A., McCue, K, Schaeffer, L., and Wold, B. (2008).
Mapping and Quantifying Mammalian Transcriptomes by RNA-Seq. Nature Methods
5, 621-628.
Oshlack, A. and Wakefield, M. J., (2009). Transcript length bias in RNA-seq data
confounds systems biology. Biology Direct. 4, :14.
Smyth , G. K, (2004). Linear Models and Empirical Bayes Methods for Assessing
Differential Expression in Microarray. Statistical Applications in Genetics and
Molecular Biology 3, Article 3
Storey, J.D. and Tibshirani, R., (2003). Statistical Significance for Genome-Wide
Studies. Proc. Natl. Acad. Sci. 100, 944-9445
Sultan, M, Schulz, M. H., and Richard, H., (2008). A Global View of Gene Activity
and Alternative Splicing by Deep Sequencing of the Human Transcriptome. Science
321, 956-960.
27/27