Clustering Tutorial May
Clustering Tutorial May
Clustering
Contents
Data sets
Distance and similarity metrics
K-means clustering
Hierarchical clustering
Evaluation of clustering results
Introduction to clustering
Starting from a set of objects, group them into classes, without any prior description of
these classes
hierarchical
k-means
self-organizing maps (SOM)
knn
...
clustering method
similarity or dissimilarity metric
additional parameters specific to each clustering method (e.g. number of centres for
the k-mean, agglomeration rule for hierarchical clustering, ...)
Data sets
Diauxic shift
DeRisi et al. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science (1997) vol. 278 (5338) pp. 680-6
Spellman et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol
Biol Cell (1998) vol. 9 (12) pp. 3273-97
Gasch et al. Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell (2000) vol. 11 (12) pp. 4241-57.
Gasch et al. (2000), 173 chips (stress response, heat shock, drugs, carbon source, )
We selected the 13 chips with the response to different carbon sources.
Gasch et al. Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell (2000) vol. 11 (12) pp. 4241-57.
For the cell cycle experiments, genes had already been filtered in the original
publication. We used the 800 selected genes for the analysis.
For the diauxic shift and carbon source experiments, each chip contain >6000
genes, most of which are un-regulated.
Standardization
Filtering
Gene expression
profiles
Chip-wise
standardization
Z-scores
Threshold
filtering
Profiles of
regulated genes
galactose
glucose
mannose.
raffinose
sucrose
ethanol.vs.ref.pool
fructose.vs.ref.pool
galactose.vs.ref.pool
glucose.vs.ref.pool
mannose.vs.ref.pool
raffinose.vs.ref.pool
sucrose.vs.ref.pool
ORF
YAL066W
YAR008W
YAR071W
YBL005W
YBL015W
YBL043W
YBR018C
YBR019C
YBR020W
...
ethanol
Carbon sources
0.71
-2.70
-5.43
1.40
4.00
3.91
-9.68
-9.68
-9.68
...
-1.87
0.36
-1.22
3.05
0.28
-1.16
5.53
6.16
6.05
...
-1.15
0.03
2.73
3.97
-3.46
-4.89
-8.66
-7.77
-8.66
...
-4.90
-4.90
-0.44
4.92
-3.65
-4.90
-11.19
-11.19
-11.19
...
-1.85
-0.94
-0.24
1.18
-2.38
-1.61
-13.49
-12.09
-13.49
...
-3.81
-0.53
3.24
5.52
-4.94
-4.76
-10.23
-9.17
-10.23
...
-3.34
0.64
-6.69
-0.53
3.26
4.47
-9.81
-9.42
-10.04
...
-0.96
-0.53
1.10
0.79
-4.64
-6.97
-15.15
-12.93
-12.70
...
-3.41
-2.57
-5.21
-0.84
0.59
-0.61
6.32
6.07
6.83
...
-1.04
-0.73
1.39
-1.00
-3.76
-6.67
-10.89
-10.58
-12.82
...
0.36
0.38
-0.70
1.12
-1.62
-7.12
-13.01
-10.90
-13.01
...
-1.55
-1.75
0.22
-2.26
1.08
0.78
-12.10
-9.08
-8.95
...
-0.87
-0.55
2.94
1.23
-5.37
-9.73
-13.73
-11.97
-14.94
...
Hierarchical clustering
Eisen et al. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A (1998)
vol. 95 (25) pp. 14863-8
Gene expression
profiles
Chip-wise
standardization
Z-scores
Threshold
filtering
Profiles of
regulated genes
galactose
glucose
mannose.
raffinose
sucrose
ethanol.vs.ref.pool
fructose.vs.ref.pool
galactose.vs.ref.pool
glucose.vs.ref.pool
mannose.vs.ref.pool
raffinose.vs.ref.pool
sucrose.vs.ref.pool
ORF
YAL066W
YAR008W
YAR071W
YBL005W
YBL015W
YBL043W
YBR018C
YBR019C
YBR020W
...
ethanol
Carbon sources
0.71
-2.70
-5.43
1.40
4.00
3.91
-9.68
-9.68
-9.68
...
-1.87
0.36
-1.22
3.05
0.28
-1.16
5.53
6.16
6.05
...
-1.15
0.03
2.73
3.97
-3.46
-4.89
-8.66
-7.77
-8.66
...
-4.90
-4.90
-0.44
4.92
-3.65
-4.90
-11.19
-11.19
-11.19
...
-1.85
-0.94
-0.24
1.18
-2.38
-1.61
-13.49
-12.09
-13.49
...
-3.81
-0.53
3.24
5.52
-4.94
-4.76
-10.23
-9.17
-10.23
...
-3.34
0.64
-6.69
-0.53
3.26
4.47
-9.81
-9.42
-10.04
...
-0.96
-0.53
1.10
0.79
-4.64
-6.97
-15.15
-12.93
-12.70
...
-3.41
-2.57
-5.21
-0.84
0.59
-0.61
6.32
6.07
6.83
...
-1.04
-0.73
1.39
-1.00
-3.76
-6.67
-10.89
-10.58
-12.82
...
0.36
0.38
-0.70
1.12
-1.62
-7.12
-13.01
-10.90
-13.01
...
-1.55
-1.75
0.22
-2.26
1.08
0.78
-12.10
-9.08
-8.95
...
-0.87
-0.55
2.94
1.23
-5.37
-9.73
-13.73
-11.97
-14.94
...
Gene expression
profiles
YAL066W
YAR008W
YAR071W
YBL005W
YBL015W
YBL043W
YBR018C
YBR019C
YBR020W
YBR054W
...
Chip-wise
standardization
Z-scores
Threshold
filtering
Profiles of
regulated genes
Pairwise distance
calculation
Distance
matrix
YBR018C
36.41
37.51
42.48
44.95
34.47
31.74
0.00
5.12
4.66
35.84
...
YBR019C
32.52
33.46
38.48
41.16
30.79
28.64
5.12
0.00
4.81
32.58
...
YBR020W YBR054W
36.07
12.00
37.18
12.36
42.15
21.09
44.62
17.86
33.77
6.46
30.90
11.13
4.66
35.84
4.81
32.58
0.00
35.63
35.63
0.00
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Gene expression
profiles
Chip-wise
standardization
Z-scores
Threshold
filtering
Profiles of
regulated genes
Hierarchical clustering
Pairwise distance
calculation
Distance
matrix
Tree building
Tree
Distance matrix
object 2
object 3
object 4
object 5
1
2
3
4
5
object 1
object
object
object
object
object
0.00
4.00
6.00
3.50
1.00
4.00
0.00
6.00
2.00
4.50
6.00
6.00
0.00
5.50
6.50
3.50
2.00
5.50
0.00
4.00
1.00
4.50
6.50
4.00
0.00
c1
root
object 5
c3
c4
object 1
object 4
object 2
object 3
leaf
nodes
Algorithm
c2
Tree representation
branch
node
Isomorphism on a tree
branch
node
c1
leaf 5
c3
c2
leaf 2
leaf 3
root
branch
node
c1
leaf 5
leaf 1
c3
root
leaf 4
c4
c4
leaf 1
leaf 2
c2
leaf 4
leaf 3
The choice of the agglomeration rule has a strong impact on the structure of a tree
resulting from hierarchical clustering.
Golub 1999 - Impact of the linkage method (Euclidian distance for all the trees)
Golub 1999 - Effect of the distance metrics (complete linkage for all the trees)
Biclustering consists in
clustering the rows (genes) and
the columns (samples) of the
data set.
This reveals some subgroups
of samples.
With the golub 1999 data set
Dot product
Correlation
Euclidian
Single
Average
Complete
Ward
Gene expression
profiles
Chip-wise
standardization
Z-scores
Threshold
filtering
Profiles of
regulated genes
Pairwise distance
calculation
Distance
matrix
Hierarchical
clustering
Tree
Tree cut
Clusters
The tree can be cut at level k (starting from the root), which creates k clusters
A k-group partitioning is obtained by collecting the leaves below each branch of
the pruned tree
K-means clustering
Step 1
At each step,
At each step,
At each step,
At each step,
At each step,
K-means clustering
K-means is time- and memory-efficient for very large data sets (e.g. thousands of
objects)
Problem of dimensionality
Solution
K-means clustering
K-means is time- and memory-efficient for very large data sets (e.g. thousands of
objects)
Strengths
Weaknesses
Solutions
Simple to use
Fast
Can be used with very large data sets
Instead of one clustering, you obtain hundreds of different clustering results, totaling
thousands of clusters, how to decide among them
Evaluation of
clustering results
Bootstrap
Jack-knife
Test different initial positions for the k-means
k1
h1
h2
h3
h4
h5
h6
h7
Sum
0
0
0
40
2
0
0
42
k-means clustering
k2 k3 k4 k5 k6 k7
Sum
0
2 18 14
1
0
35
0
0
4
0
0
0
4
0
0
0 10
0
0
10
0 10
0
0
9
0
59
12
0
0
0
5
0
19
0
0
0
0
0
4
4
2
0
0
0
0
0
2
14 12 22 24 15
4 133
k-means clustering
k4 k3 k5 k1 k2 k7 k6 Sum
h1
18
2 14
0
0
0
1
35
h2
4
0
0
0
0
0
0
4
h3
0
0 10
0
0
0
0
10
h4
0 10
0 40
0
0
9
59
h5
0
0
0
2 12
0
5
19
h6
0
0
0
0
0
4
0
4
h7
0
0
0
0
2
0
0
2
Sum
22 12 24 42 14
4 15 133
Correspondence between clusters
hierarchical
h1 h2 h3 h4 h5 h6 h7
k-means
k4 k3 k5 k1 k2 k7 k6
Matches
84
Hit rate
63.2%
Mismatches
49
Error rate
36.8%
hierarchical clustering
hierarchical clustering