Clustering
Clustering
1 Introduction to clustering
• Clustering is the task of dividing up data points into groups or clusters, so that points in any
one group are more “similar” to each other than to points outside the group
• Why cluster? Two main uses:
– Summary: deriving a reduced representation of the full data set. E.g., vector quantitiza-
tion (we’ll see this shortly)
– Discovery: looking for new insights into the structure of the data. E.g., finding groups of
students that commit similar mistakes, groups of 80s songs that sound alike, or groups of
patients that have similar gene expressions
• There are other uses too, e.g., checking up on someone else’s work/decisions, by investigating
the validity of pre-existing group assignments; or helping with prediction, i.e., in regression or
classification. For brevity, we won’t study clustering in these contexts
• A neat example: clustering of baseball pitches
Barry Zito
150
●
100
●
● ●
●● ●
●● ●
● ● ● ● ●● ● ●
● ● ● ●● ●
●
● ●● ● ● ●● ● ●● ● ●●●●
● ● ● ●●●●●● ● ●● ● ●
● ● ● ● ● ●● ● ● ●● ● ● ●
● ● ● ●●
● ● ● ●
●● ●● ● ●●●●● ● ●●●●●●● ●● ●
●● ● ● ●
●● ● ●● ●● ● ●● ●●● ●●●● ●●●● ●● ●● ●●●● ●● ● ●●● ●● ●
●● ●●●● ●● ●●● ●●
●● ● ● ● ● ● ●● ●
● ● ● ● ● ●●●●●●● ● ● ● ● ●
●●
●
●● ● ●●● ●
●●●●●●●● ●● ●
●●●●
●
●●
●
●●● ● ● ● ●● ● ●
●●●
●
●
●● ●
●● ●● ●● ● ● ● ● ●
●●●● ● ● ●
● ●
●
●●
● ●●●●●●●● ●
● ●●●●●●
● ● ●●●●●●● ●
●
●
● ●●●
●● ● ●●● ●●
●● ● ● ●
●● ●●●●● ●● ●● ●
●
●●●●●●
●
●
●●●
● ● ● ●● ● ● ● ●● ● ● ●●●●
● ●● ● ●●●●●
● ● ●●
● ●
● ● ●●
● ●● ●● ●● ● ● ●
●
50
● ● ● ● ●●●●●
● ● ●● ●
●● ●● ●●● ●●● ●●● ●●●
●●
●
●
●●●●●●●● ●●
●●●●
●● ●● ●●● ●● ● ●● ●●
●
●●●●
●●●●
● ●●
●● ● ●● ●● ●● ●
● ● ●
●●●●●● ●
●●●
●●●● ●●
●
●
●
●
●
●●
●●●●
●
●
●●●
●
●
●
●● ●
● ●● ● ● ● ●●●●●● ●●
● ●●
●● ●●●
●
● ●
●
●
●●
● ●●●● ●
●● ●●●
● ● ●
● ●
●
●● ● ●
● ●
● ●
● ● ● ●●●
● ● ●
●●● ●●● ● ●● ●● ● ● ● ●● ●● ● ● ●
● ● ●●●●
●
●●
● ●● ● ●●●
●
●●●●●
●●
●●●
●●
●●
● ●
●●
● ●●● ●●
●
● ● ● ●
●●●
●●●
●● ●
●●●● ● ● ●●●●
● ●●●● ● ● ● ● ● ●● ●
●●●
●
●
●●
●
●● ●●
●●●
●●
●●
●●
●●●●●●●●
●
●
●
●●
●●●●
● ●●
●
●●●
●● ● ●● ● ● ●● ● ●●●●
● ●●● ● ● ● ●●●● ●● ●● ● ● ●
● ●
●●●●●
●●● ●●
●
●
●●●
● ●● ●
● ●●
●●
●●●●● ●● ●● ●● ●●● ●
●●●● ●●●●● ● ●● ● ●
● ●●
● ●
●● ● ●● ● ●
●● ●
●
●
●● ●●● ●
●●●● ●
●● ●●
● ●
●●●
● ●●
●●●●●●●
●
●
●
●●●
●● ● ●
● ● ● ● ● ● ●●●●●● ●● ●●●● ● ●
●
●
● ●●● ● ●● ●
●●●● ●●●●●●
●●●●●● ●●●
●
● ● ● ●●● ●● ●● ● ● ●● ● ● ● ● ● ●
●● ●● ●●●●● ● ●●
●● ● ●● ●
● ●
● ● ●●● ●
●● ● ● ●
● ●
●● ●●●●
●● ●
● ●● ●● ● ● ● ●●●● ● ●● ● ●●●●●
●
● ●
● ●
● ●●●●●●●●
●● ●
●●● ●● ● ● ● ● ●● ●●● ● ● ● ● ●● ●
●●
● ●● ●●
●●
●
● ●●● ●●●●● ●
●●●
Back Spin
●●● ●● ●● ● ● ● ● ●● ● ● ● ●
● ●● ●● ● ● ●
● ●●● ● ●●●● ● ● ●●● ● ● ● ● ●● ●● ●● ●●
●● ● ●● ●● ●●●
● ●● ●●
●●
● ●● ●
●
●
●●
● ● ●● ●
● ● ● ● ● ●● ●●●●●● ●● ●
●● ●●● ●
● ●● ●●●● ●●●● ●●●● ●●
● ●
● ●● ● ● ● ● ●● ● ● ● ●
●● ●●●●●● ● ● ● ●
● ● ● ● ● ● ●● ●●
●●● ●●●●●
● ●●●● ● ●● ●● ●● ●●
●● ● ● ●● ● ●● ● ●●●●●●●●●
●●●● ●
●● ●
●●●●●●●
● ●● ● ● ● ● ●●●● ●●● ● ●
●●●
●● ●●●
●
●●
●●●●
●
●
●●
● ●
●●●●● ●
●●●●●●● ●
●
● ● ●●
●● ● ● ●● ● ●
●● ●
● ●●●
●
●●●● ● ●
●● ●
● ● ● ● ● ● ●● ●● ● ●●●●●●
0
● ● ● ● ●● ● ●●●
● ●●●
● ●
●●●●
●●●● ●●● ●●●
● ●● ●●●
● ●●
●
● ● ●
● ● ● ●
●
● ● ●● ● ●●●
● ● ● ●●●● ●
●
●● ● ●
●● ●●● ●●● ●●●● ●●●
●●●
● ●
●●
●●
● ●
●
● ●●
● ●●●●
● ●
●
● ● ●
●●●
● ●●●● ● ●●●
●
● ● ● ●●●●● ●● ●
● ●●● ●
●
●●
●
●
●
●
●●
●●●
●
●
●
●●
●
●●
●
●●
●●
●
●
●
●
●
● ●●
●
●●
●
●●
●
●●
● ●●
●●●●
●
●●●
● ●
●●
●● ● ●
● ●
●●● ●●
● ●
● ●●
● ● ●●●●
●●●
● ● ●
● ● ●
●●●●● ●●●●
● ● ●
● ● ●● ● ●● ●
● ●● ● ● ● ●●●●
● ●●
●●●
●●●
●●
●●
●
●●●
● ● ●
●● ●
●
● ●●
●
●●●
● ●
●●
●● ●● ● ●
● ● ●●●●● ●● ● ●● ● ● ● ● ● ●● ●●●
●● ●
●●●●●●
●
●
●
●●
●
●
●●●
●●●
●●
●●●
●●●●
●
●
●●
●●
● ●
●
● ●
●●
●●●●● ●
●
●●● ●● ● ● ● ●● ●●●●●●
●
●●
●●●●●● ●
●●●●
●●●
●●
●
●
●
●
● ●●●●
●● ●
●●●
●
●
●● ●
●●
● ●
●●● ●
●●●
●●●●● ●●● ● ●
●
● ●● ● ●
●● ●● ● ● ● ● ● ●● ●●●●●● ●●
●●
●●
● ●
●●
●● ●●●●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
● ●
●●
●●● ●
●
●
●
●●
●●
●
●
●●●● ●●●
● ●●● ● ● ●●●●●●●●● ● ● ● ●● ● ● ●● ● ●
●●●
●
●
● ●●
●
●●
●
●●
●
●
●
●●
●● ●
●
●
●●●●●●
●●●
● ●
●
●●●
●
●● ●
●●
●●
●●●●●
● ●● ● ●
●● ●● ●● ●●●●
●●
●●●
●●● ●●
●●●●●●●●●
● ●● ● ● ●
●
● ● ●● ●●
●●
● ● ●●
●●
● ● ●●
●●●●●●
●●●
●●●●●
●
●●
●
●● ●●●●
● ●● ●● ● ● ●● ●● ●
●●●●
●●●●● ●
●●●
● ● ●
●
●
●
●
●
●●●
●
●
●●●
●
●
● ●
●
●
●
●
●●
● ● ●
●
● ●●●
●
●
●
●
●●
●●●●●●
●●
●●●
●
●
●●
●
●
●●●
●
●● ●●●
●● ●● ● ●●● ● ● ● ●
●● ● ● ● ●● ● ●● ●● ●●
●
●●●●●●●●
●●
●
● ●●●
●●● ● ● ●
●● ● ● ●
●
● ●● ● ● ●
●● ●●● ● ● ●
●● ●● ●●● ● ●● ● ● ●●● ●●●● ●●●● ●
● ●●●●● ●
● ●●
● ●
●● ●●● ●
●● ●● ●
●●●●
● ●
●●●
●●
●
●●
●●●●●
●
●●● ●●
●● ●●
● ●● ●●● ● ●●● ●
● ●●●
●●
●●●●●
●
●●
● ●
●
●●
●● ●
● ● ●
●
●●●
●
●●
●
● ●●●● ●●● ●●● ●
● ●●● ●●● ●
●●
●●●●●
●
●
●
●●●
●● ●●● ●●● ●● ●●●● ● ● ●
●
●●● ●●●●●
● ●●● ●●●
● ●●
●●●
●●●
●●●
●●
●
●
●
●
●
●●
●
●
●●●●
●
●
●●
●●
●●●
●● ●
●●●
●●●●
●
●●● ●
●● ●●●● ●●●●● ●●
● ●
●●●●●● ●●
●●● ●●●
● ●●●● ● ● ●● ●● ●● ●
●● ●● ● ●●● ●● ● ●●
●
●● ●●●●● ●
●●●
●●
●●
●●●●● ●
●●
● ●● ●
● ●●
●●● ●●
● ● ●● ●●●●
●●●●
● ●●● ●
●●●
●
●●●●●●
● ●●●●
●●●
● ●●● ● ●●●●
●
−50
● ● ● ● ● ● ●● ●● ●● ● ● ●
●
● ●●● ●●● ●●●●●●●
●● ●
●● ●
●
●●
●●●●●●
●●
●
●
●●●●
● ●
●●●●
●● ● ● ● ●
● ●
●
● ● ● ●
●
●●● ●●
●●●
●●●
● ●
●●
●
●
●●
●
● ●●
●
●●
●●●
● ●
● ● ● ●●●
●
●
● ●●● ● ●● ●●●●
● ●
●●●●●● ●
●●● ●● ●●●●●
●●
● ●●● ● ●●●● ●●●●●●● ● ●●●● ●● ●● ● ●●●● ● ●
● ●●●●●
●● ●●●●●● ●●●●●● ● ●●● ●● ● ● ●●● ●
●●
● ●●●
●●● ● ●
●●●●● ●
●●●● ●● ●● ● ●● ●●
● ● ● ● ●
●●● ●●● ●● ● ●
● ● ● ● ●● ●●● ●●●●●● ● ●● ● ● ●
● ● ●● ●● ● ● ●●
● ● ● ●●● ●● ●●●● ●● ●●●● ● ●●●
●●●●●●● ●●
● ●● ● ● ●●● ● ●● ● ● ● ●
● ● ●
● ●●●● ● ●● ● ●
● ● ● ●● ● ●●● ●
● ●● ● ●●●●
● ●● ●●● ●●●
●● ●●● ● ●
● ● ●● ● ● ●●●
●
●
●
● ● ●●
●● ●● ● ● ● ●●●
● ● ●● ● ●
● ● ● ● ● ● ●●
●
●
● ●
● ● ●
●
−100
●
Side Spin
150
100
50
0
−50
−150
−100
−150
60 65 70 75 80 85 90
Start Speed
1
Inferred meaning of clusters: black – fastball, red – sinker, green – changeup, blue – slider,
light blue – curveball. (This example is due to Mike Pane, a former CMU statistics undergrad,
from his undergraduate honors thesis)
• A note: don’t confuse clustering and classification! In classification, we have data points for
which the groups are known, and we try to learn what differentiates these groups (i.e., a
classification function) to properly classify future data. In clustering, we look at data points
for which groups are unknown and undefined, and try to learn the groups themselves, as well
as what differentiates them. In short, classification is a supervised task, whereas clustering is
an unsupervised task
• There are many ways to perform clustering (just like there are many ways to fit regression or
classification functions). We’ll study the two of the most common techniques
2 K-means clustering
2.1 Within-cluster variation
• Suppose that we have n data points in p dimensions, x1 , . . . xn ∈ Rp . Let K be the number of
clusters (consider this fixed). A clustering of the points x1 , . . . xn is a function C that assigns
each observation xi to a group k ∈ {1, . . . K}. Our notation: C(i) = k means that xi is
assigned to group k, and nk is the number of points in group k
• A natural objective is to choose the clustering C to minimize the within-cluster variation:
K
X X
kxi − x̄k k22 , (1)
k=1 C(i)=k
This is because the smaller the within-cluster variation, the tighter the clustering of points
• A problem is that doing this exactly is requires trying all possible assignments of the n points
into K groups. The number of possible assignments is
K
1 X K n
N (n, K) = (−1)K−k k .
K! k
k=1
Note that N (10, 4) = 34, 105, and A(25, 4) ≈ 5 × 1013 ... this is incredibly huge!
• Therefore we will have to settle for a clustering C that approximately minimizes the within-
cluster variation. We will start by recalling the following fact: for any points z1 , . . . zm ∈ Rp ,
the quantity
Xm
kzi − ck22
i=1
1
Pm
is minimized by choosing c = z̄ = m i=1 zi , the average of z1 , . . . zm . Hence, minimizing (1)
over a clustering assignment C is the same as minimizing the enlarged criterion
K
X X
kxi − ck k22 (2)
k=1 C(i)=k
2
2.2 K-means clustering
• The K-means clustering algorithm approximately minimizes the enlarged criterion (2). It
does so by alternately minimizing over the choice of clustering assignment C and the centers
c1 , . . . cK
• We start with an initial guess for c1 , . . . cK (e.g., pick K points at random over the range of
x1 , . . . xn ), and then repeat:
1. Minimize over C: for each i = 1, . . . n, find the cluster center ck closest to xi , and let
C(i) = k
2. Minimize over c1 , . . . cK : for each k = 1, . . . K, let ck = x̄k , the average of group k points
We stop when the within-cluster variation doesn’t change
• Put in words, this algorithm repeats two steps:
1. Cluster (label) each point based the closest center
2. Replace each center by the average of points in its clusterx
• Note that the within-cluster variation decreases with each iteration of the algorithm. I.e., if
Wt is the within-cluster variation at iteration t, then Wt+1 ≤ Wt . Also, the algorithm always
terminates, no matter the initial cluster centers. In fact, it takes ≤ K n iterations (why?)
• The final clustering assignment generally depends on the initial cluster centers. Sometimes,
different initial centers lead to very different final outputs. Hence we typically run K-means
multiple times (e.g., 10 times), randomly initializing cluster centers for each run, then choose
among from collection of centers based on which one gives the smallest within-cluster variation
• This is not guaranteed to deliver the clustering assignment that globally minimizes the within-
cluster variation (recall: this would require looking through all possible assignments!), but it
does tend to give us a pretty good approximation
Left: original image; middle: using 23.9% of the storage; right: using 6.25% of the storage
• How did we do this? The basic idea: we run K-means clustering on 4 × 4 squares of pixels in
an image, and keep only the clusters and labels. Smaller K means more compression. Then
we can simply reconstruct the images at the end based on the clusters and the labels
3
2.4 K-medoids clustering
• A cluster center is a representative for all points in a cluster (and is also called a prototype).
In K-means, we simply take a cluster center to be the average of points in the cluster. This is
great for computational purposes—but how does it lend to interpretation?
• This would be fine if we were clustering, e.g., houses in Pittsburgh based on features like price,
square footage, number of bedrooms, distance to nearest bus stop, etc. But not so if, e.g., we
were clustering images of faces (why?)
• In some applications we want each cluster center to be one of the points itself. This is where
K-medoids clustering comes in—it is a similar algorithm to the K-means algorithm, except
when fitting the centers c1 , . . . cK , we restrict our attention to the points themselves
• As before, we start with an initial guess for the centers c1 , . . . cK (e.g., randomly select K of
the points x1 , . . . xn ), and then repeat:
1. Minimize over C: for each i = 1, . . . n, find the cluster center ck closest to xi , and let
C(i) = k
∗
Pxk , the medoid of
2. Minimize over c1 , . . . cK : for each k = 1, . . . K, let ck = the points in
cluster k, i.e., the point xi in cluster k that minimizes C(j)=k kxj − xi k22
We stop when the within-cluster variation doesn’t change
• Put in words, this algorithm repeats two steps:
1. Cluster (label) each point based on the closest center
2. Replace each center by the medoid of points in its cluster
• The K-medoids algorithm shares the properties of K-means that we discussed: each iteration
decreases the criterion; the algorithm always terminates; different starts gives different final
answers; it does not achieve the global minimum
• Importantly, K-medoids generally returns a higher value of the criterion
K
X X
kxi − ck k22
k=1 C(i)=k
than does K-means (why?). Also, K-medoids is computationally more challenging that K-
means, because of its second repeated step: computing the medoid of points is harder than
computing the average of points
• However, remember, K-medoids has the (potentially important) property that the centers are
located among the data points themselves
3 Hierarchical clustering
3.1 Motivation from K-means
• Two properties of K-means (or K-medoids) clustering: (1) the algorithm fits exactly K clusters
(as specified); (2) the final clustering assignment depends on the chosen initial cluster centers
• An alternative approach called hierarchical clustering produces a consistent result, without
the need to choose initial starting positions (number of clusters). It also fits a sequence of
clustering assignments, one for each possible number of underlying clusters K = 1, . . . n
4
• The caveat: we need to choose a way to measure the dissimilarity between groups of points,
called the linkage
• Given the linkage, hierarchical clustering produces a sequence of clustering assignments. At
one end, all points are in their own cluster, at the other end, all points are in one cluster
• There are two types of hierarchical clustering algorithms: agglomerative, and divisive. In
agglomerative, or bottom-up hierarhical clustering, the procedure is as follows:
– Start with all points in their own group
– Until there is only one cluster, repeatedly: merge the two groups that have the smallest
dissimilarity
• Agglomerative strategies are generally simpler than divisive ones, so we’ll focus on them.
Divisive methods are still important, but we won’t be able to cover them due to time constraints
3.2 Dendrograms
• To understand the need for dendrograms, it helps to consider an simple example. Given the
data points on the left, an agglomerative hierarhical clustering algorithm might decide on the
clustering sequence described on the right
1 ●
0.8
7 ●
Step 1: {1}, {2}, {3}, {4}, {5}, {6}, {7};
● 2
Dimension 2
● 3
0.6
5 ●
Step 6: {1, 7}, {2, 3, 4, 5, 6};
● 6
Dimension 1
5
1 ●
0.7
0.8
0.6
7 ●
2
0.5
●
● 3
0.6
Dimension 2
0.4
Height
0.4
0.3
0.2
0.2
5 ●
● 6
0.1
4 ●
3
Dimension 1
Note that cutting the dendrogram horizontally partitions the data points into clusters
• If we fix the leaf nodes at height zero, then each internal node is drawn at a height proportional
to the dissmilarity between its two daughter nodes
3.3 Linkages
• Suppose that we are given points x1 , . . . xn , and dissimilarities dij between each pair xi and
xj . As an example, you may think of xi ∈ Rp and dij = kxi − xj k2
• At any stage of the hierarhical clustering procedure, clustering assignments can be expressed
by sets G = {i1 , i2 , . . . ir }, giving indices of points in this group. Let nG be the size of G (here
nG = r). Bottom level: each group looks like G = {i}, top level: only one group, G = {1, . . . n}
• A linkage is a function d(G, H) that takes two groups G, H and returns a dissimilarity score
between them. Given the linkage, we can summarize agglomerative hierarhical clustering as
follows:
– Start with all points in their own group
– Until there is only one cluster, repeatedly: merge the two groups G, H such that d(G, H)
is smallest
6
3.3.1 Single linkage
• In single linkage (i.e., nearest-neighbor linkage), the dissimilarity between G, H is the smallest
dissimilarity between two points in opposite groups:
dsingle (G, H) = min dij
i∈G, j∈H
2
● ● ●
●
● ●
●
● ● ●●
● ●
● ● ● ●● ● ●●
●● ● ● ●
1
● ● ●● ● ●
● ●●
● ● ●
●
●
An illustration (dissimilarities dij are dis- ●
●
● ●● ●
●● ● ●
●
tances, groups are marked by colors): single ●●
●
0
●
●
●●
linkage score dsingle (G, H) is the distance of ●
●
●
●
●
● ●
the closest pair ●
●
●
−1
● ●● ● ●
●
● ● ●
● ● ●
● ●
● ●
●● ● ●
●●
●●
● ●●
−2
● ●
●
−2 −1 0 1 2
• A single linkage clustering example: here n = 60, xi ∈ R2 , dij = kxi − xj k2 . Cutting the tree
at h = 0.9 gives the clustering assignments marked by colors
● ●●
●
3
1.0
●
● ●
● ●
●
● ● ●
2
● ●
●●
0.8
●
●
● ●
● ●
● ●
1
● ●
● ● ●
0.6
● ●
Height
●
● ●
● ● ●● ●
0
●
●
●
0.4
● ●
● ●
●
−1
●
●
● ● ●
0.2
● ●
●
●
●
−2
●
0.0
−2 −1 0 1 2 3
Cut interpretation: for each point xi , there is another point xj in its cluster with dij ≤ 0.9
7
●
2
● ● ●
●
● ●
●
● ● ●●
● ●
● ● ● ●● ● ●●
●● ● ● ●
1
● ● ●● ● ●
● ●●
● ● ●
●
●
An illustration (dissimilarities dij are dis- ●
●
● ●● ●
●● ● ●
●
tances, groups are marked by colors): com- ●●
●
0
●
●
●●
plete linkage score dcomplete (G, H) is the dis- ●
●
●
●
●
● ●
tance of the furthest pair ●
●
●
−1
● ●● ● ●
●
● ● ●
● ● ●
● ●
● ●
●● ● ●
●●
●●
● ●●
−2
● ●
●
−2 −1 0 1 2
• A complete linkage clustering example: same data set as before. Cutting the tree at h = 5
gives the clustering assignments marked by colors
● ●●
●
3
●
● ●
● ●
●
● ● ●
2
● ●
●● ●
●
● ●
● ●
4
● ●
1
● ●
● ● ●
● ●
Height
●
● ●
●
3
● ●● ●
0
●
●
●
● ●
●
●
2
●
−1
●
●
● ● ●
● ●
●
1
●
●
−2
●
0
−2 −1 0 1 2 3
Cut interpretation: for each point xi , every other point xj in its cluster satisfies dij ≤ 5
8
●
2
● ● ●
●
● ●
●
● ● ●●
● ●
● ● ● ●● ● ●●
An illustration (dissimilarities dij are dis- ●● ● ● ●
1
● ● ●● ● ●
● ●●
●
tances, groups are marked by colors): aver- ●
● ●
● ●
● ●● ● ●
● ●● ●
age linkage score daverage (G, H) is the average ● ●
●●
0
distance across all pairs ●
●●
●
●
● ●
● ●
● ●
●
(Plot here only shows distances between the ● ●
−1
● ●● ● ●
●
● ● ●
blue points and one red point) ●
● ●
●● ●
●
●
●
●
●
●●
●●
● ●●
−2
● ●
●
−2 −1 0 1 2
• An average linkage clustering example: same data set as before. Cutting the tree at h = 1.5
gives clustering assignments marked by the colors
● ●●
●
3
3.0
●
● ●
● ●
●
●
2.5
● ●
2
● ●
●● ●
●
● ●
●
2.0
●
● ●
1
● ●
● ● ●
● ●
Height
●
1.5
● ●
● ● ●● ●
0
●
●
●
● ●
●
1.0
●
●
−1
●
●
● ● ●
● ●
0.5
●
●
●
−2
●
0.0
−2 −1 0 1 2 3
• These linkages operate on dissimilarities dij , and don’t need the data points x1 , . . . xn to be in
Euclidean space
• Also, running agglomerative clustering with any of these linkages produces a dendrogram with
no inversions
• This last property, in words: dissimilarity scores between merged clusers only increases as we
run the algorithm. It means that we can draw a proper dendrogram, where the height of a
parent is always higher than height of its daughters
9
3.3.5 Shortcomings of single, complete linkage
• Single and complete linkage can have some practical problems
• Single linkage suffers from an issue we call chaining. In order to merge two groups, only need
one pair of points to be close, irrespective of all others. Therefore clusters can be too spread
out, and not compact enough
• Complete linkage avoids chaining, but suffers from an issue called crowding. Because its score
is based on the worst-case dissimilarity between pairs, a point can be closer to points in other
clusters than to points in its own cluster. Clusters are compact, but not far enough apart
• Average linkage tries to strike a balance. It uses average pairwise dissimilarity, so clusters tend
to be relatively compact and relatively far apart
• Recall that we’ve already seen examples of chaining and crowding:
● ●●
● ● ●●
● ● ●●
●
3
3
● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
2
2
● ● ● ● ● ●
●● ● ●● ● ●● ●
● ● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
1
1
● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ●
● ● ●
● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ●
0
0
● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ●
−1
−1
−1
● ● ●
● ● ●
● ● ● ● ● ● ● ● ●
●● ●● ●●
● ● ●
●
● ●
● ●
●
−2
−2
−2
● ● ●
● ● ●
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
10
• Other times, K is implicitly defined by cutting a hierarchical clustering tree at a given height,
determined by the application
• But in most exploratory applications, the number of clusters K is unknown. So we are left
asking the question: what is the “right” value of K?
• This is a very hard problem! Why? Determining the number of clusters is a hard task for
humans to perform (unless the data are low-dimensional). Not only that, it’s just as hard to
explain what it is we’re looking for. Usually, statistical learning is successful when at least one
of these is possible
• Why is it important? It can make a big difference for the interpretation of the clustering result
in the application, e.g., it might make a big difference scientifically if we were convinced that
there were K = 2 subtypes of breast cancer vs. K = 3 subtypes. Also, one of the (larger)
goals of data mining/statistical learning is automatic inference; choosing K is certainly part
of this
• For concreteness, we’re going to focus on K-means, but most ideas will carry over to other
settings. Recall that, given the number of clusters K, the K-means algorithm approximately
minimizes the within-cluster variation:
K
X X
WK = kxi − x̄k k22
k=1 C(i)=k
over cluster assignments C, where x̄k is the average of points in group k, x̄k = n1k C(i)=k xi .
P
We’ll start off discussing a method for choosing K that doesn’t work, and then discuss two
methods can work well, in some cases
● ●
1.5
●
●
●
100
● ● ●
● ●
● ●
● ● ● ● ●
●● ● ● ● ●
●● ● ●● ● ●● ●
●● ● ● ● ● ●●● ●
Within−cluster variation
●
1.0
●●
● ●
● ● ●●●●
●●●
●●
● ●
●
●●
80
● ●● ●
● ●
●
● ● ● ● ● ●● ●
●● ● ● ● ● ●
● ●● ● ● ●●●●● ●
● ●● ●● ●
● ●
● ●
● ● ● ●● ●●● ●
● ● ●
●
● ●●● ●● ●●
60
●
0.5
● ●
●
● ● ● ● ● ●
● ● ●
● ●
● ● ●●
● ●
● ● ● ● ●
● ● ● ●●
● ● ● ●●●
40
●● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ●
● ● ●● ●
0.0
●
● ● ●●
● ●
● ●●
● ●
●
●
●● ●● ●
● ●
● ●
● ●
● ● ● ● ●● ● ● ● ●
● ●● ●
●
●● ● ● ● ● ● ●●● ●
20
● ●
● ● ● ●
●● ● ●
● ● ● ●
−0.5
● ●
11
This behavior shouldn’t be surprising, once we think about K-means clustering and the defi-
nition of within-cluster variation
• Interestingly, even if we had computed the centers on training data, and evaluated WK on test
data (assuming we had it), then there would still a problem! Same example as above, but now
the within-cluster variation on the right has been computed on the test data set:
Train Test
120
● ●
1.5
1.5
● ●
●
● ●
●● ● ● ●
100
● ● ● ● ●
● ● ● ●● ● ●
● ● ●● ●●
● ● ● ●● ● ●● ● ●
●● ● ● ● ● ●● ● ● ●
●●● ●● ●● ● ●● ● ●●●●
●● ● ● ● ● ● ●●●● ●
● ● ●●● ● ●● ●
●
●
Within−cluster variation
1.0
1.0
●● ● ● ●
● ●
●
● ●
●
● ●●●●●
●
●●●
●●
●●●
●
● ●
●
● ●
●
●● ● ●
●
●
●
●● ●
● ● ●● ●●● ●
● ●
●
●
80
● ● ● ● ● ●● ● ● ●● ●● ●
●● ● ● ● ● ● ● ● ●● ●
●
● ●●
●
●●●●●● ● ● ●● ● ● ●● ●
● ●● ●● ● ● ● ● ● ●● ●● ●
● ● ● ●● ●● ● ●
● ● ●
● ●
● ● ● ● ● ● ●
●
● ● ● ●● ● ● ●
●
●
●
● ●●● ●● ●● ● ●
●● ● ●● ●
0.5
0.5
● ● ● ● ● ● ● ● ● ●
●● ● ●● ● ● ●
60
● ● ● ●
● ●
● ●● ● ● ●
●
● ● ●● ● ●● ●
● ● ● ● ● ●
●
● ● ● ● ●
●●● ● ● ●● ●●
●●
● ● ●● ●
●
● ● ● ● ●● ● ● ● ●
●● ● ● ● ● ●
● ● ●● ● ● ●● ● ●
●
●● ● ● ● ● ● ●
● ● ● ●
● ●● ● ● ●
40
● ● ●● ● ● ● ●● ● ● ●
0.0
0.0
●
● ● ●●
● ●
●
● ●● ● ●
●
●
● ●
●● ●● ●●
● ● ●
● ●
● ● ●● ● ●●●
● ●
●
●
●
● ● ●●
●
●
●● ● ● ● ● ● ● ● ●●●
●
●● ● ●
● ●
● ● ●
●● ● ● ●●
●●
● ●
● ●●●
●
● ● ●
● ●● ● ● ●
● ● ● ●
● ●● ● ●
●● ● ●● ● ● ●● ●
20
● ● ● ●
● ●
−0.5
● −0.5 ●
● ●
● ● ●
● ● ●
● ●
● ●
where as before x̄k is the average of points in group k, and x̄ is the overall average, i.e.
n
1 X 1X
x̄k = xi and x̄ = xi
nk n i=1
C(i)=k
• Bigger BK is better, so can we use it to choose K? It’s still not going to work alone, for a
similar reason to the one above: between-cluster variation just increases as K increases
• A helpful realization: ideally we’d like our clustering assignments C to simultaneously have
a small WK and a large BK . This is the idea behind the CH index (named after the two
statisticians who proposed it, Calinski and Harabasz). For clustering assignments coming
from K clusters, we record the CH score:
BK /(K − 1)
CHK = .
WK /(n − K)
To choose K, we just pick some maximum number of clusters to be considered Kmax (e.g.,
K = 20), and choose the value of K with the largest score CHK ,
K̂ = argmax CHK
K∈{2,...Kmax }
12
450
● ●
1.5
●
●
●
● ● ●
● ●
● ●
● ● ● ● ●
●● ● ● ● ● ●
●● ● ●● ● ●● ●
●● ● ● ● ● ●●● ●
●
400
●
1.0
●●
● ●
● ● ●●●
●
●
●●●●
● ●
●● ●
●
●
●
●●
● ●
● ● ● ● ● ●● ●
●● ● ● ● ● ●
● ●● ● ● ●●●●● ● ●
CH index
● ●● ●● ●
● ●
● ●
● ● ● ●● ●●● ●
● ● ●
0.5
● ●
● ●●● ●● ●●
●
●
● ● ● ●● ● ● ●
● ●
● ●
● ● ●● ●
350
● ● ●
● ● ● ● ●
● ● ● ●● ●
● ● ● ●●●
●● ● ● ● ● ●
● ●
●● ● ●
● ●● ● ●● ●● ●● ●
0.0
●
● ● ●●
●
● ●
●●
● ● ●
●
●
● ●
●● ●● ● ● ●
● ●
● ● ● ●● ● ● ●
● ●● ●
●
●● ● ● ● ● ● ●●● ●●
●
● ● ●
●●
300
●
−0.5
● ●
● ● ●
0.7
1.5
●
●
●
●
● ● ●
● ●
● ●
● ● ● ● ● ●
●● ● ● ● ●
●● ● ●● ●
0.6
●● ●
●● ● ● ● ● ●●● ●
●
1.0
● ●●
● ●
● ●● ●
● ●
●
●
●●●●
●
● ●
●
●●
● ● ● ● ● ●●●
●
● ● ●
●● ● ● ● ●● ● ●
● ●● ● ● ●●●●● ● ●
● ● ● ●● ●
● ● ●
0.5
● ●
● ●● ●●● ●
Gap
● ●
● ● ●
● ● ●●● ●● ●●
0.5
● ● ●
●
● ● ● ● ● ●
● ●
● ●● ● ●●
● ●
●
● ● ●
● ●●
● ●
0.4
● ●
● ● ● ●●
●
●● ● ● ● ● ●
●● ● ● ●
●
● ●● ● ●● ●● ●● ●
0.0
●
● ● ●●
●
● ●●
● ●
●
● ●●
●● ●
●
● ●
● ●
● ●
● ● ● ● ● ● ●
● ●●●● ●
●● ● ● ● ● ●●●
0.3
● ●●
●
● ● ●
●● ● ●
−0.5
1 Tibshirani et al. (2001), “Estimating the number of clusters in a data set via the gap statistic”
13
Here we would choose K = 3 clusters, which is also reasonable
• The gap statistic does especially well when the data fall into one cluster. Why? (Hint: think
about the null distribution that it uses)
14