Clustering MIT 15.097 Course Notes
Clustering MIT 15.097 Course Notes
The K-Means algorithm is a very popular way to do this. It assumes points lie
in Euclidean space.
If we’re given points {zk }k , they can induce a Voronoi partition of Rn : they
break the space into cells where each cell corresponds to one of the zk ’s. That
is, each cell contains the region of space whose nearest representative is zk .
Draw a picture
We can look at the examples in each of these regions of space, which are the
clusters. Specifically,
Ck := {xi : the closest representative to xi is zk }.
Let’s compute the cost another way. Before, we summed over examples, and
then picked the right representative zk for each example. This time, we’ll sum
over clusters, and look at all the examples in that cluster:
cost(z1 , ..., zK ) = Ixi − zk I22 .
k {i:xi ∈Ck }
1
While we’re analyzing, we’ll need to consider suboptimal partitions of the data,
where an example might not be assigned to the nearest representative. So we
redefine the cost:
X X
cost(C1 , ..., CK ; z1 , ..., zK ) = kxi − zk k22 . (1)
k {i:xi ∈Ck }
Let’s say we only have one cluster to deal with. Call it C. The representative is
z. The cost is then: X
cost(C; z) = kxi − zk22 .
{i:xi ∈C }
Where should we place z?
As you probably guessed, we would put it at the mean of the examples in C. But
also, the additional cost incurred by picking z 6= mean(C) can be characterized
very simply:
2
To prove Lemma 1, pick a specific choice for X, namely X is a uniform random
draw from the points xi in set C. So X has a discrete distribution. What will
happen with this choice of X is that the expectation will reduce to the cost we
already defined above.
X
EX kX − zk22 = (prob. that point i is chosen)kxi − zk22
{i:xi ∈C }
X 1 1
= kxi − zk22 = cost(C, z) (2)
|C | |C|
{i:xi ∈C}
To really minimize the cost (1), you’d need to try all possible assignments of the
m data points to K clusters. Uck! The number of distinct assignments is (Jain
and Dubes 1988):
K
1 X K m
S(m, K) = (−1)K−k k
K! k
k=1
3
The K-Means Algorithm
PPT demo
Lemma 3. During the course of the K-Means algorithm, the cost monotonically
decreases.
(t) (t) (t) (t)
Proof. Let z1 , ..., zK , C1 , ..., CK denote the centers and clusters at the start of
the tth iterate of K-Means. The first step of the iteration assigns each data point
to its closest center, therefore, the cluster assignment is better:
(t+1) (t+1) (t) (t) (t) (t) (t) (t)
cost(C1 , ..., CK , z1 , ..., zK ) ≤ cost(C1 , ..., CK , z1 , ..., zK ).
On the second step, each cluster is re-centered at its mean, so the representatives
are better. By Lemma 1,
(t+1) (t+1) (t+1) (t+1) (t+1) (t+1) (t) (t)
cost(C1 , ..., CK , z1 , ..., zK ) ≤ cost(C1 , ..., CK , z1 , ..., zK ).
4
Example of how K-Means could converge to the wrong thing
How might you make K-Means more likely to converge to the optimal?
How might you choose K? (Why can’t you measure test error?)
There are loads of cluster validity measures, alternatives to the cost. Draw a picture
• Davies-Baldwin Index - looks at average intracluster distance (within-cluster
distance) to the centroid (want it to be small), and intercluster distances
between centroids (want it to be large).
• Dunn Index - looks pairwise at minimal intercluster distance (want it to be
large) and maximal intracluster distance (want it to be small).
Example: Microarray data. Have 6830 genes (rows) and 64 patients (columns).
The color of each box is a measurement of the expression level of a gene. The
expression level of a gene is basically how much of its special protein it is pro
ducing. The physical chip itself doesn’t actually measure protein levels, but a
proxy for them (which is RNA, which sticks to the DNA on the chip). If the
color is green, it means low expression levels, if the color is red, it means higher
expression levels. Each patient is represented by a vector, which is the expression
level of their genes. It’s a column vector with values given in color:
5
© source unknown. All rights reserved. This content is excluded from our Creative
Commons license. For more information, see http://ocw.mit.edu/fairuse.
Each patient (column) has some type of cancer. Want to cluster patients to see
whether patients with the same types of cancers cluster together. So each cluster
center is an “average” patient expression level vector for some type of cancer.
It’s also a column vector
Sum of Squares (104)
26
24
22
20
18
16
2 4 6 8 10
Number of Clusters K
Hm, there’s no kink in this figure. Compare K = 3 solution with “true” clusters:
1 3 5 0 0 0 0
2 2 0 0 2 6 2
3 2 0 7 0 0 0
Cluster Melanoma NSCLC Ovarian Prostate Renal Unknown
1 1 7 6 2 9 1
2 7 2 0 0 0 0
3 0 0 0 0 0 0
Images by MIT OpenCourseWare, adapted from Hastie et al., The Elements of Statistical Learning,
Springer, 2009.
6
It’s pretty good at keeping the same cancers in the same cluster. The two breast
cancers in the 2nd cluster were actually melanomas that metastasized.
Generally we cluster genes, not patients. Would really like to get something like
this in practice:
where each row is a gene, and the columns are different immune cell types.
7
NSCLC
BREAST COLON
COLON
COLON
COLON
COLON
COLON
COLON MCF7D-repro
BREAST
BREAST MCF7A-repro
NSCLC
NSCLC
BREAST
CNS
CNS CNS
CNS
CNS
RENAL
PROSTATE
OVARIAN
PROSTATE
NSCLC
NSCLC NSCLC
OVARIAN
OVARIAN
NSCLC RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
MELANOMA
NSCLC OVARIAN
UNKNOWN
OVARIAN
OVARIAN
NSCLC
BREAST
RENAL MELANOMA
MELANOMA
MELANOMA
MELANOMA
MELANOMA
MELANOMA BREAST
MELANOMA BREAST
LEUKEMIA
LEUKEMIA K562A-repro
K562B-repro
LEUKEMIA
LEUKEMIA
LEUKEMIA
LEUKEMIA
Application Slides
8
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.