Clustering
Clustering
Machine Learning
Clustering
What is Clustering?
• Organizing data into clusters
such that there is
• high intra-cluster similarity
Clustering is subjective
Similarity is hard
to define, but…
“We know it when
we see it”
gene1
gene2
0.23 3 342.7
gene1 gene2
Hierarchical Partitional
(How-to) Hierarchical Clustering
The number of dendrograms with n Bottom-Up (agglomerative): Starting
leafs = (2n -3)!/[(2(n -2)) (n -2)!] with each item in its own cluster, find
the best pair to merge into a new cluster.
Number Number of Possible
of Leafs Dendrograms Repeat until all clusters are fused
2 1 together.
3 3
4 15
5 105
... …
10 34,459,425
We begin with a distance
matrix which contains the
distances between every pair
of objects in our database.
0 8 8 7 7
0 2 4 4
0 3 3
D( , ) = 8 0 1
D( , ) = 1 0
Bottom-Up (agglomerative):
Starting with each item in its own
cluster, find the best pair to merge into
a new cluster. Repeat until all clusters
are fused together.
Consider all
Choose
possible
merges… … the best
Consider all
Choose
possible
merges… … the best
Consider all
Choose
possible
merges… … the best
Consider all
Choose
possible
merges… … the best
But how do we compute distances
between clusters rather than
Consider all objects? Choose
possible
merges… … the best
- Potentially
long and skinny
clusters
Example: single link
1 2 3 4 5
1 0
2 2 0
3 6 3 0
4 10 9 7 0
5 9 8 5 4 0
5
4
3
2
1
Example: single link
1 2 3 4 5 (1,2) 3 4 5
1 0 (1,2) 0
2 2
3 3 0
0
3 6 3 0
4 9 7 0
4 10 9 7 0
5 8 5 4 0
5 9 8 5 4 0
5
d (1, 2, 3), 4 min{d(1, 2), 4 , d 3, 4} min{9,7} 7
d (1, 2, 3),5 min{d (1, 2), 5 , d3, 5} min{8,5} 5 4
3
2
1
Example: single link
1 2 3 4 5 (1,2) 3 4 5 (1,2,3) 4 5
1 0 (1,2) 0
2 2 (1,2,3) 0
3 3 0
0
3 6 3 0 4 7 0
4 9 7 0
4 10 9 7 0 5 5 4 0
5 8 5 4 0
5 9 8 5 4 0
5
d (1, 2, 3),( 4, 5) min{d (1, 2, 3), 4 , d (1, 2, 3),5 } 5
4
3
2
1
Computing distance between
clusters: : Complete Link
• cluster distance = distance of two farthest
members
+ tight clusters
Computing distance between
clusters: Average Link
• cluster distance = average distance of all
pairs
Height represents 2
29 2 6 11 9 17 10 13 24 25 26 20 22 30 27 1 3 8 4 12 5 14 23 15 16 18 19 21 28 7
Outlier
Example: clustering genes
• Microarrays measures the activities of all
genes in different conditions
4
k1
3
2
k2
k3
0
0 1 2 3 4 5
K-means Clustering: Iteration 1
Assign all objects to the nearest center.
Move a center to the mean of its members.
5
4
k1
3
2
k2
k3
0
0 1 2 3 4 5
K-means Clustering: Iteration 2
After moving centers, re-assign the objects…
5
4
k1
k3
1 k2
0
0 1 2 3 4 5
K-means Clustering: Iteration 2
After moving centers, re-assign the objects to nearest centers.
Move a center to the mean of its new members.
5
4
k1
k3
1 k2
0
0 1 2 3 4 5
K-means Clustering: Finished!
Re-assign and move centers, until …
no objects changed membership.
expression in condition 2 5
4
k1
k2
1
k3
0
0 1 2 3 4 5
expression in condition 1
Algorithm k-means
1. Decide on a value for K, the number of clusters.
2. Initialize the K cluster centers (randomly, if
necessary).
3. Decide the class memberships of the N objects by
assigning them to the nearest cluster center.
4. Re-estimate the K cluster centers, by assuming the
memberships found above are correct.
5. Repeat 3 and 4 until none of the N objects changed
membership in the last iteration.
Algorithm k-means
1. Decide on a value for K, the number
Use one of clusters.
of the distance /
similarity functions we
2. Initialize the K cluster centers (randomly, if
discussed earlier
necessary).
3. Decide the class memberships of the N objects by
assigning them to the nearest cluster center.
4. Re-estimate the K cluster centers, by assuming the
memberships found above are correct.
5. Repeat 3 and 4 until none of the N objects changed
membership in the last iteration. Average / median of class
members
Why K-means Works
• What is a good partition?
• High intra-cluster similarity
10
• K-means optimizes 9
– the average distance to members of 8
the same cluster 7
K nk nk 6
1
x
2
ki xkj 5
k 1 nk
4
i 1 j 1
3
– which is twice the total distance to 2
centers, also called squared error 1
K nk
se xki k
2 1 2 3 4 5 6 7 8 9 10
k 1 i 1
Summary: K-Means
• Strength
– Simple, easy to implement and debug
– Intuitive objective function: optimizes intra-cluster similarity
– Relatively efficient: O(tkn), where n is # objects, k is # clusters,
and t is # iterations. Normally, k, t << n.
• Weakness
– Applicable only when mean is defined, what about categorical
data?
– Often terminates at a local optimum. Initialization is important.
– Need to specify K, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
• Summary
– Assign members based on current centers
– Re-estimate centers based on current assignment
Outline
• Motivation
• Distance measure
• Hierarchical clustering
• Partitional clustering
– K-means
– Gaussian Mixture Models
– Number of clusters
Gaussian Mixture Models
( x ) 2
• Gaussian 1
P( x) e 2 2
2 2
– ex. height of one population
• Gaussian Mixture: Generative
modeling framework
( x i ) 2
1
P(C – i) wi , P( x | C i) e 2 i2
2 i
2
P( x | ) P(C i, x | ) P( x | C i, )P(C i | )
i i
( x i ) 2
1
w
i
i
2 2
e 2 i2
Likelihood of a data
i
point given the model
Gaussian Mixture Models
• Mixture of Multivariate
Gaussian
( x i ) 2
1
P(C i) wi P( x | C i ) e 2 i2
2 i
2
GMM+EM = “Soft K-means”
• Decide the number of clusters, K
• Initialize parameters (randomly)
• E-step: assign probabilistic membership to all input samples j
p(x j | C i) p(C i)
One for each pi, j p(C i | x j )
cluster p(x j | C k) p(C k)
k
pi pi , j
j
• M-step:
re-estimate parameters based on probabilistic
membership pi , j x j
i j pi
pi , j x j x Tj
i
j pi
pi
wi
pj
j
The cluster
means are
randomly
assigned
Iteration 2
Iteration 5
Iteration 25
Strength of Gaussian Mixture Models
• Interpretability: learns a generative model of each cluster
– you can generate new data based on the learned model
• Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.
• Intuitive (?) objective function: optimizes data likelihood
Weakness of Gaussian Mixture Models
• Often terminates at a local optimum. Initialization
is important.
• Need to specify K, the number of clusters, in
advance
• Not suitable to discover clusters with non-convex
shapes
• Summary
– To learn Gaussian mixture, assign probabilistic
membership based on current parameters, and re-
estimate parameters based on current membership
Algorithm: K-means and GMM
1. Decide on a value for K, the number of clusters.
2. Initialize the K cluster centers / parameters (randomly).
K-means GMM
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
When k = 1, the objective function is 873.0
1 2 3 4 5 6 7 8 9 10
When k = 2, the objective function is 173.1
1 2 3 4 5 6 7 8 9 10
When k = 3, the objective function is 133.6
1 2 3 4 5 6 7 8 9 10
We can plot the objective function values for k equals 1 to 6…
9.00E+02
Objective Function
8.00E+02
7.00E+02
6.00E+02
5.00E+02
4.00E+02
3.00E+02
2.00E+02
1.00E+02
0.00E+00
1 2 3 4 5 6
k
Note that the results are not always as clear cut as in this toy example
Cross validation
• We can also use cross validation to determine the correct number of classes
• Recall that GMMs is a generative model. We can compute the likelihood of
the left out data to determine which model (number of clusters) is more
accurate
n k
p(x1 x n | ) p(x j | C i)wi
j1 i1
Cross validation
Cluster validation
• We wish to determine whether the clusters are real
or compare different clustering methods.
- internal validation (stability, coherence)
- external validation (match to known categories)
Internal validation: Coherence
• A simple method is to compare clustering algorithm based
on the coherence of their results
• We compute the average inter-cluster similarity and the
average intra-cluster similarity
• Requires the definition of the similarity / distance metric
Internal validation: Stability
• If the clusters capture real structure in the data they should
be stable to minor perturbation (e.g., subsampling) of the
data.
• To characterize stability we need a measure of similarity
between any two k-clusterings.
• For any set of clusters C we define L(C) as the matrix of
0/1 labels such that L(C)ij =1 if objects i and j belong to the
same cluster and zero otherwise.
• We can compare any two k clusterings C and C' by
comparing the corresponding label matrices L(C) and
L(C').
Validation by subsampling
• C is the set of k clusters based on all the objects
• C' denotes the set of k clusters resulting from a randomly
chosen subset (80-90%) of objects
• We have high confidence in the original clustering if
Sim(L(C),L(C')) approaches 1 with high probability, where
the comparison is done over the objects common to both
External validation
• For this we need an external source that contains related, but
usually not identical information.
• For example, assume we are clustering web pages based on
the car pictures they contain.
• We have independently grouped these pages based on the
text description they contain.
• Can we use the text based grouping to determine how well
our clustering works?
External validation
• Suppose we have generated k clusters C1,…,Ck. How do we
assess the significance of their relation to m known
(potentially overlapping) categories G1,…,Gm?
• Let's start by comparing a single cluster C with a single
category Gj. The p-value for such a match is based on the
hyper-geometric distribution.
• Board.
• This is the probability that a randomly chosen |Ci| elements
out of n would have l elements in common with Gj.
P-value (cont.)
• If the observed overlap between the sets (cluster and
category) is l elements (genes), then the p-value is
min( c ,m )
p prob(l lˆ) prob(exactly j matches )
j l
Response to
5
stimulus
transerase activity
-Log Pval Profiles
4
cell death
Ratio
0
0 1 2 3 4 5 6 7
- Log Pval Kmeans
What you should know
• Why is clustering useful
• What are the different types of clustering
algorithms
• What are the assumptions we are making
for each, and what can we get from them
• Unsolved issues: number of clusters,
initialization, etc.