5 Microarray PDF
5 Microarray PDF
Gene i Genomics
m-RNA i Transcriptomics
Protein Sequence /
Protein i Proteomics
Function
(Enzyme, 3-D Structural
hormone etc.) Database
The Flow of Genetic Information
mRNA 5’ ACUGCACCAUGGGGCUCAGCGACGGGGAAUGGCACUUGGUG
Initiation codons
signal
Protein
Met-Gly-Leu-Ser-Asp-Gly-Gln-Trp-His-Leu-Val
DESCRIPTION OF A LIVING CELL / VIRUS
Metabolites
Growth rate
Expression
stem cells
cancer cells
microbes
Some useful signals on Genes
Upstream activating
sequences (UAS)
m-RNA expression
TATA box
start & end
DNA
x x
mRNA
Ribosomal
binding site protein
Protein Protein
synthesis synthesis
starts stops
A typical gene in higher organisms
Transcription Acceptor
Intron Donor
start site model
(non-coding region) model
Translation Stop
start site Exon (coding codon
region)
Alternative splicing leads to diversity
Transcription
start site
E1 I1 E2 I2 E3
E1 E2 E3
E1 I1 E2 E3
Human RNA-splice junctions sequence matrix
Genetic Regulation of Processes
(Regulation of Transcriptional Activity)
A Typical Genetic Regulatory Circuit
McAdams and Arkin, Proc. Natl. Acad. Sci., 1997, vol 94, 814-819
Newly identified members of Gal4 Regulatory Circuit
EC SC BS HI
P1 1 0 1
P2 1 1 0
P3 0 1 1
P4 1 0 0
P5 1 1 1
Microarray data P6 0 1 1
Coregulated sets P7 1 1 0
TCA
cycle
B. subtilis purM purN purH purD
Axeldb www.dkfz-
heidelberg.de/abt0135/axeldb.htm
Gene expression in Xenopus
BodyMap bodymap.ims.u-tokyo.ac.jp/
human & mouse gene expression
FlyView pbio07.uni-muenster.de/ Drosophila
Interferon Stimulated Gene Database
www.lerner.ccf.org/labs/williams/xchi-html.cgi
genes induced by treatment with interferon
Stanford Microarray Database
genome-www.stanford.edu/microarray
Raw & normalized data from various sources
RNA quantitation database integration
experiment • R/G ratios
control ORF
Microarrays1 • R, G values
~1000 bp • quality indicators
hybridization
ORF • Averaged PM-MM
PM • “presence”
Affymetrix2 MM
25-bp hybridization • feature statistics
Streptavidin-
phycoerythrin
Image of hybridized probe array conjugate
Error Model for Microarray Data
Fawcett et al, Proc. Natl. Acad. Sci. USA (2000) 97, 8063-68
Representation of expression data
Normalized Time-point 1
Expression Data
from microarrays
T1 T2 T3
Time-point 3
Gene 1
dij
.
Gene 1
Gene N Gene 2
Cluster analysis of mRNA expression data
Protein/protein complex
Genes
2, r 1 (Manhattan distance)
p
d ( x , y ) | xi yi |
i 1
3 y
1, Euclidean distance : 2 4 2 32 5.
2, Manhattan distance : 4 3 7.
3, " sup" distance : max{4,3} 4.
Manhattan distance is called Hamming
distance when all features are binary.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
GeneA 0 1 1 0 0 1 0 0 1 0 0 1 1 1 0 0 1
GeneB 0 1 1 1 0 0 0 0 1 1 1 1 1 1 0 1 1
Hamming Distance : #( 01 ) #( 10 ) 4 1 5.
Similarity Measures: Correlation Coefficient
p
( x x)( y
i 1
i i y)
s ( x, y )
p p
2 2
i
( x
i 1
x ) i
( y y )
i 1
p p
averages : x 1
p xi and y
i 1
1
p y.
i 1
i
s( x, y) 1
What kind of x and y give
(1) s(x,y)=1,
(2) s(x,y)=-1,
(3) s(x,y)=0 ?
Similarity Measures: Correlation Coefficient
Expression Gene B
Level
Gene A
Time
Pattern recognition &
normalization
ab c d
Clustering methods
Distance Matrix
Complete-Link Method
Euclidean Distance
a
a,b a,b
b a,b,c,d
c,d
c d c d
(1) (2) (3)
b c d b c d c d c, d
a 2 5 6 a 2 5 6 a, b 5 6 a, b 6
b 3 5 b 3 5 c 4
c 4 c 4
Distance Matrix
Compare Dendrograms
Single-Link Complete-Link
ab c d 0
ab c d
6
Which clustering methods do you suggest
for the following two-dimensional data?
Problems of Hierarchical
Clustering
• It concerns more about complete tree
structure than the optimal number of
clusters.
• There is no possibility of correcting for a
poor initial partition.
• Similarity and distance measures rarely
have strict numerical significance.
Non-hierarchical clustering
Normalized Expression Data
Interpreting Patterns of Gene Expression
with Self Organizing Maps
Tamayo et al, Proc. Natl. Acad. Sci. USA, 1999, Vol 96, 2907
SOM algorithm
• Initial mapping of nodes fo is random.
• At each iteration, data-point P is selected and the
node Np that maps closest to P is identified.
• The mapping of the nodes is then adjusted by the
formula
fi+1(N) = fi(N) + (d(N, Np), i) (P-fi(Np)
Time-point 3
Gene N
dij
.
Normalized
Expression Data Gene 1
from microarrays Gene 2
Identifying prevalent expression patterns
(gene clusters)
Time-point 1
Normalized
Expression
1.5
0.5
Time-point 3
-0.5
1 2 3
-1
-1.5
Time -point
Normalized
Expression
Normalized
Expression
1.2 1.5
1
0.7
0.5
0.2
0
-0.3
1 2 3 -0.5 1 2 3
-0.8
-1
-1.3
-1.5
-1.8 -2
Eisen et al, Proc. Natl. Acad. Sci. USA, 1998, Vol 95, 14863
Hierarchical Clustering of Genes from Expression Data
Red=up-regulated, green=down-regulated
Gene Disruption Studies in Yeast
genes
M
u
t
a
n
t
s
g
e
n
e
s
60 cell lines
100 targets
60k compds
A . T’ = A.T’
‘‘Clustered correlation’’ map of compounds & molecular targets
compounds
Targets
Gleaning information from the Cancer databases at NCI
113
1 Targets 113 1 Targets 113
Correlation
between structure
descriptors and
Targets in
S’.(AT’)
database
Scherf et al, Nature Genetics (2000) 24, 236-44
Hierarchical clustering of human cancer cell lines
Based on Based on
gene sensitivity
expression to 1400
profiles compds
tested
drugs Clustered Correlation for A.T’ database
genes
Distinct Types of Diffuse Large B-Cell Lymphoma
Identified by Gene Expression Profiling