Lecture4 Expression - Analysis 2019
Lecture4 Expression - Analysis 2019
Lecture 4
Functional Genomics
Gavin Sherlock
January 29th 2019
ChIP-Seq
Sonicate DNA to
produce sheared,
soluble chromatin
Immunoprecipitate and
purify
immunocomplexes
Reverse cross-links,
and purify DNA
Sequence
ChIP-Seq Data
Peak Calling
ChIP-exo
• ChIP-exo improves on resolution of ChIP-
Seq
• ChIP DNA is treated with an exo-nuclease,
to digest away unprotected sequences
ChIP
Exo 5’
3’
3’
5’
Total RNA
DNAase Treatment
Oligo-dT beads
3’ 5’
5’ AAA(A)n 3’
3’ 5’
5’ AAA(A)n 3’
cDNA:mRNA hybrid
Second Strand Synthesis
3’ 5’
5’ AAA(A)n 3’
dNTPs RNAaseH
E. coli polymerase I
3’ 5’
5’
3’ 5’
5’
bacteriophage T4
DNA ligase
3’ 5’
5’ 3’
double stranded cDNA
Fragment RNA
5’ 3’
3’ 5’
5’ 3’
double stranded cDNA
van Gurp TP, McIntyre LM, Verhoeven KJ. (2013). Consistent errors in first
strand cDNA due to random hexamer mispriming. PLoS One 8(12):e85583.
Using dUTP to retain strand specificity
mRNA
fragment
UNG treatment
Ad #2 Ad #1
Sharon D, Tilgner H, Grubert F, Snyder M. (2013). A single-molecule long-read survey of the human transcriptome.
Nat Biotechnol 31(11):1009-14
TIF-Seq
• Transcript Isoform Sequencing
• Does not capture exonic structure
• Instead captures 5’ and 3’ ends of
transcripts
• From only ~6,000 genes in yeast, almost 2
million unique transcript isoforms identified
• 371,087 major TIFs identified genome-wide
Pelechano V, Wei W, Steinmetz LM. (2013). Extensive transcriptional
heterogeneity revealed by isoform profiling. Nature 497(7447):127-31.
TIF-Seq
TIF-Seq
Analysis and visualization of
expression data
Visualizing Data
MAK16 YAL025C
5 MAK16
0.5
YBL015W ACH1
4
YBL048W
0
3 YBL048W
OD 0.26
OD 0.46
OD 0.80
OD 1.80
OD 3.70
OD 6.90
OD 7.30
YBL049W
YBL049W
-0.5 2
YBL064C
YBL064C
1 YBL078C
-1
MAK16
YBL078C
0 YBR072W
HSP26
O 26
O 46
O 80
O 80
O 70
O 90
30
-1 YBR139W
0.
0.
0.
1.
3.
6.
7.
-1.5
YBR139W
D
D
O YBR147W
-2
YBR147W
-2
YCR021C
-3 HSP30
YDL085W
-2.5
-4 YDL085W
YDL204W
YDL204W
YDL208W NHP2
Extracting Data
Experiments
RNA-Seq data
Genes
200 10000 50.00 5.64
4800 4800 1.00 0.00
9000 300 0.03 -4.91
Cy5 ⎛ Cy5⎞
Cy3 Cy5 log 2 ⎜⎜ ⎟⎟
Cy3 ⎝ Cy3⎠
Visualizing Data (cont.)
Expression During Sporulation
5
Series1
Series2
Series3
Series4
4 Series5
Series6
Series7
Series8
Series9
Series10
3 Series11
Series12
Series13
Series14
Series15
2 Series16
Series17
Series18
Series19
Series20
Series21
1
Log Ratio
Series22
Series23
Series24
Series25
Series26
0 Series27
0 2 4 6 8 10 Series28
Series29
Series30
Series31
-1 Series32
Series33
Series34
Series35
Series36
Series37
-2 Series38
Series39
Series40
Series41
Series42
-3 Series43
Series44
Series45
Series46
Series47
-4 Series48
Series49
Time (hours) Series50
Series51
Organizing Data
In expression studies,
we often use clustering
algorithms to help us
identify patterns in
complex data.
• x = log(ratio)expt1
• y = log(ratio)expt2 z
• z = log(ratio)expt3
• etc. y
Similar expression
x
Distance metrics
• Distances or correlations are measured
“between” expression vectors
• Euclidean distance
• Pearson correlation coefficient(s)
• Spearman’s Rank Correlation
• Manhattan distance
• Mutual information
• Kendall’s Tau
• etc.
in space. In this 2
Gene A
EXPERIMENT 2
1.5
C are closest. 1
Gene B
0.5
0
0 0.5 1 1.5 2 2.5
EXPERIMENT 1
Pearson correlation
• The Pearson correlation
disregards the magnitude
of the vectors but instead
compares their 2.5
directions. In this 2
Gene A
EXPERIMENT 2
1.5
G1
G6
G6
G1
G5
G5 G2
G2
G4
G3 G3
G4
Single linkage Clustering
Nearest Neighbor •
• +•
•
•
This method
• • produces long
• + chains which form
• straggly clusters.
•
Complete Linkage Clustering
Uses the
Furthest •
Neighbor • +•
• This method tends
• to produce very
tight clusters of
• similar patterns
•
• +
• •
Average Linkage Clustering
Average (only •
shown for two •
cases) +• The red and blue
• ‘+’ signs mark the
•
centroids of the
two clusters.
• •
• +
• •
Centroid Linkage Clustering
•
Centroid
• +• The red and blue
• ‘+’ signs mark the
•
centroids of the
two clusters.
• •
• +
• •
And we get a cluster:
Single Complete Average Centroid
Two-way clustering
• Just as gene vectors are clustered,
experiment vectors can be clustered.
• All the data points for an experiment can be
used to construct a vector and the vectors of
multiple experiments can be compared.
Two-way Clustering
Two-way clustering can help show which
samples are most similar, as well as which
genes.
Agglomerative Hierachical
Clustering
Advantages:
• Simple
• Easy to implement
• Easy to visualize
Disadvantages:
• Can lead to artifacts
• Discarding of subtleties in 2-way clustering
Partitioning Methods
• Split data up into smaller, more homogenous
sets
• Should avoid artifacts associated with
incorrectly joining dissimilar vectors
• Can cluster each partition independently of
others, by genes and arrays
• K-means clustering and Self-Organizing
Maps are two possible partitioning methods
K-means Clustering
• Split data into ‘n’ partitions, each with
an associated vector.
• Assign genes to partitions, and
recalculate the vector associated with
each partition as the centroid of its
associated genes.
• Repeat until solution converges, or for
a fixed number of iterations.
Self Organizing Maps
• Create a ‘Map’ of ‘n’ partitions, that
is modeled on the expression data,
where each partition in the map has
an associated vector.
Nucleus
⎛ M ⎞⎛ N − M ⎞
x−1 ⎜ ⎟⎜ ⎟
⎝i ⎠⎝ n − i ⎠
P-value = 1− ∑
⎛ N ⎞
⎜ ⎟
i=0
⎝i ⎠
• Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D. (1998). Cluster analysis and display of
genome-wide expression patterns. Proc Natl Acad Sci U S A 95(25):14863-8.
• Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S.,
Golub, T.R. (1999). Interpreting patterns of gene expression with self-organizing maps:
methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA
96(6):2907.
• Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J., Church, G.M. (1999). Systematic
determination of genetic network architecture. Nat Genet. 22(3):281-5.
• Tusher, V.G., Tibshirani, R., Chu, G. (2001). Significance analysis of microarrays applied to
the ionizing radiation response. Proc Natl Acad Sci USA 98(9):5116-21
• Slonim, D.K. (2002). From patterns to pathways: gene expression data analysis comes of age.
Nat Genet. 32 Suppl:502-8.
• McShane, L.M., Radmacher, M.D., Freidlin, B., Yu, R., Li, M.C., Simon, R. (2002). Methods
for assessing reproducibility of clustering patterns observed in analyses of microarray data.
Bioinformatics 18(11):1462-9.
• Bryan, J. (2004). Problems in gene clustering based on gene expression data. Journal of
Multivariate Analysis 90, 44–66.
• Chipman, H. and Tibshirani, R. (2006). Hybrid Hierarchical Clustering with Applications to
Microarray Data. Biostatistics, 7(2):286-301.