0% found this document useful (0 votes)

6 views14 pages

Clustering

Clustering is the process of grouping data points based on their similarities, with primary applications in summarizing data and discovering insights. It can help identify patterns, such as groups of students making similar errors or similar musical styles. The document also presents an example of clustering in the context of baseball pitches.

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views14 pages

Clustering

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Clustering

Advanced Methods for Data Analysis (36-402/36-608)

Spring 2014

1 Introduction to clustering
• Clustering is the task of dividing up data points into groups or clusters, so that points in any
one group are more “similar” to each other than to points outside the group
• Why cluster? Two main uses:
– Summary: deriving a reduced representation of the full data set. E.g., vector quantitiza-
tion (we’ll see this shortly)
– Discovery: looking for new insights into the structure of the data. E.g., finding groups of
students that commit similar mistakes, groups of 80s songs that sound alike, or groups of
patients that have similar gene expressions
• There are other uses too, e.g., checking up on someone else’s work/decisions, by investigating
the validity of pre-existing group assignments; or helping with prediction, i.e., in regression or
classification. For brevity, we won’t study clustering in these contexts
• A neat example: clustering of baseball pitches

Barry Zito
150

●
100

● ● ● ● ●● ● ●●●
● ●●●
● ●
●●●●
●●●● ●●● ●●●
● ●● ●●●
● ●●
●
● ● ●
● ● ● ●
●
● ● ●● ● ●●●
● ● ● ●●●● ●
●
●● ● ●
●● ●●● ●●● ●●●● ●●●
●●●
● ●
●●
●●
● ●
●
● ●●
● ●●●●
● ●
●
● ● ●
●●●
● ●●●● ● ●●●
●
● ● ● ●●●●● ●● ●
● ●●● ●
●
●●
●
●
●
●
●●
●●●
●
●
●
●●
●
●●
●
●●
●●
●
●
●
●
●
● ●●
●
●●
●
●●
●
●●
● ●●
●●●●
●
●●●
● ●
●●
●● ● ●
● ●
●●● ●●
● ●
● ●●
● ● ●●●●
●●●
● ● ●
● ● ●
●●●●● ●●●●
● ● ●
● ● ●● ● ●● ●
● ●● ● ● ● ●●●●
● ●●
●●●
●●●
●●
●●
●
●●●
● ● ●
●● ●
●
● ●●
●
●●●
● ●
●●
●● ●● ● ●
● ● ●●●●● ●● ● ●● ● ● ● ● ● ●● ●●●
●● ●
●●●●●●
●
●
●
●●
●
●
●●●
●●●
●●
●●●
●●●●
●
●
●●
●●
● ●
●
● ●
●●
●●●●● ●
●
●●● ●● ● ● ● ●● ●●●●●●
●
●●
●●●●●● ●
●●●●
●●●
●●
●
●
●
●
● ●●●●
●● ●
●●●
●
●
●● ●
●●
● ●
●●● ●
●●●
●●●●● ●●● ● ●
●
● ●● ● ●
●● ●● ● ● ● ● ● ●● ●●●●●● ●●
●●
●●
● ●
●●
●● ●●●●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
● ●
●●
●●● ●
●
●
●
●●
●●
●
●
●●●● ●●●
● ●●● ● ● ●●●●●●●●● ● ● ● ●● ● ● ●● ● ●
●●●
●
●
● ●●
●
●●
●
●●
●
●
●
●●
●● ●
●
●
●●●●●●
●●●
● ●
●
●●●
●
●● ●
●●
●●
●●●●●
● ●● ● ●
●● ●● ●● ●●●●
●●
●●●
●●● ●●
●●●●●●●●●
● ●● ● ● ●
●
● ● ●● ●●
●●
● ● ●●
●●
● ● ●●
●●●●●●
●●●
●●●●●
●
●●
●
●● ●●●●
● ●● ●● ● ● ●● ●● ●
●●●●
●●●●● ●
●●●
● ● ●
●
●
●
●
●
●●●
●
●
●●●
●
●
● ●
●
●
●
●
●●
● ● ●
●
● ●●●
●
●
●
●
●●
●●●●●●
●●
●●●
●
●
●●
●
●
●●●
●
●● ●●●
●● ●● ● ●●● ● ● ● ●
●● ● ● ● ●● ● ●● ●● ●●
●
●●●●●●●●
●●
●
● ●●●
●●● ● ● ●
●● ● ● ●
●
● ●● ● ● ●
●● ●●● ● ● ●
●● ●● ●●● ● ●● ● ● ●●● ●●●● ●●●● ●
● ●●●●● ●
● ●●
● ●
●● ●●● ●
●● ●● ●
●●●●
● ●
●●●
●●
●
●●
●●●●●
●
●●● ●●
●● ●●
● ●● ●●● ● ●●● ●
● ●●●
●●
●●●●●
●
●●
● ●
●
●●
●● ●
● ● ●
●
●●●
●
●●
●
● ●●●● ●●● ●●● ●
● ●●● ●●● ●
●●
●●●●●
●
●
●
●●●
●● ●●● ●●● ●● ●●●● ● ● ●
●
●●● ●●●●●
● ●●● ●●●
● ●●
●●●
●●●
●●●
●●
●
●
●
●
●
●●
●
●
●●●●
●
●
●●
●●
●●●
●● ●
●●●
●●●●
●
●●● ●
●● ●●●● ●●●●● ●●
● ●
●●●●●● ●●
●●● ●●●
● ●●●● ● ● ●● ●● ●● ●
●● ●● ● ●●● ●● ● ●●
●
●● ●●●●● ●
●●●
●●
●●
●●●●● ●
●●
● ●● ●
● ●●
●●● ●●
● ● ●● ●●●●
●●●●
● ●●● ●
●●●
●
●●●●●●
● ●●●●
●●●
● ●●● ● ●●●●
●
−50

●
Side Spin

150
100
50
0
−50
−150

−100
−150
60 65 70 75 80 85 90

Start Speed

1
Inferred meaning of clusters: black – fastball, red – sinker, green – changeup, blue – slider,
light blue – curveball. (This example is due to Mike Pane, a former CMU statistics undergrad,
from his undergraduate honors thesis)
• A note: don’t confuse clustering and classification! In classification, we have data points for
which the groups are known, and we try to learn what differentiates these groups (i.e., a
classification function) to properly classify future data. In clustering, we look at data points
for which groups are unknown and undefined, and try to learn the groups themselves, as well
as what differentiates them. In short, classification is a supervised task, whereas clustering is
an unsupervised task
• There are many ways to perform clustering (just like there are many ways to fit regression or
classification functions). We’ll study the two of the most common techniques

2 K-means clustering
2.1 Within-cluster variation
• Suppose that we have n data points in p dimensions, x1 , . . . xn ∈ Rp . Let K be the number of
clusters (consider this fixed). A clustering of the points x1 , . . . xn is a function C that assigns
each observation xi to a group k ∈ {1, . . . K}. Our notation: C(i) = k means that xi is
assigned to group k, and nk is the number of points in group k
• A natural objective is to choose the clustering C to minimize the within-cluster variation:
K
X X
kxi − x̄k k22 , (1)
k=1 C(i)=k

where x̄k the average of points in group k,

1 X
x̄k = xi .
nk
C(i)=k

This is because the smaller the within-cluster variation, the tighter the clustering of points
• A problem is that doing this exactly is requires trying all possible assignments of the n points
into K groups. The number of possible assignments is
K
1 X K n
N (n, K) = (−1)K−k k .
K! k
k=1

Note that N (10, 4) = 34, 105, and A(25, 4) ≈ 5 × 1013 ... this is incredibly huge!
• Therefore we will have to settle for a clustering C that approximately minimizes the within-
cluster variation. We will start by recalling the following fact: for any points z1 , . . . zm ∈ Rp ,
the quantity
Xm
kzi − ck22
i=1
1
Pm
is minimized by choosing c = z̄ = m i=1 zi , the average of z1 , . . . zm . Hence, minimizing (1)
over a clustering assignment C is the same as minimizing the enlarged criterion
K
X X
kxi − ck k22 (2)
k=1 C(i)=k

over both the clustering C and centers c1 , . . . cK ∈ Rp

2
2.2 K-means clustering
• The K-means clustering algorithm approximately minimizes the enlarged criterion (2). It
does so by alternately minimizing over the choice of clustering assignment C and the centers
c1 , . . . cK
• We start with an initial guess for c1 , . . . cK (e.g., pick K points at random over the range of
x1 , . . . xn ), and then repeat:
1. Minimize over C: for each i = 1, . . . n, find the cluster center ck closest to xi , and let
C(i) = k
2. Minimize over c1 , . . . cK : for each k = 1, . . . K, let ck = x̄k , the average of group k points
We stop when the within-cluster variation doesn’t change
• Put in words, this algorithm repeats two steps:
1. Cluster (label) each point based the closest center
2. Replace each center by the average of points in its clusterx
• Note that the within-cluster variation decreases with each iteration of the algorithm. I.e., if
Wt is the within-cluster variation at iteration t, then Wt+1 ≤ Wt . Also, the algorithm always
terminates, no matter the initial cluster centers. In fact, it takes ≤ K n iterations (why?)
• The final clustering assignment generally depends on the initial cluster centers. Sometimes,
different initial centers lead to very different final outputs. Hence we typically run K-means
multiple times (e.g., 10 times), randomly initializing cluster centers for each run, then choose
among from collection of centers based on which one gives the smallest within-cluster variation
• This is not guaranteed to deliver the clustering assignment that globally minimizes the within-
cluster variation (recall: this would require looking through all possible assignments!), but it
does tend to give us a pretty good approximation

2.3 Vector quantization

• K-means is often called “Lloyd’s algorithm” in computer science and engineering, and is used
in vector quantization for compression
• An example from the ESL book, page 514 is copied below.

Left: original image; middle: using 23.9% of the storage; right: using 6.25% of the storage
• How did we do this? The basic idea: we run K-means clustering on 4 × 4 squares of pixels in
an image, and keep only the clusters and labels. Smaller K means more compression. Then
we can simply reconstruct the images at the end based on the clusters and the labels

3
2.4 K-medoids clustering
• A cluster center is a representative for all points in a cluster (and is also called a prototype).
In K-means, we simply take a cluster center to be the average of points in the cluster. This is
great for computational purposes—but how does it lend to interpretation?
• This would be fine if we were clustering, e.g., houses in Pittsburgh based on features like price,
square footage, number of bedrooms, distance to nearest bus stop, etc. But not so if, e.g., we
were clustering images of faces (why?)
• In some applications we want each cluster center to be one of the points itself. This is where
K-medoids clustering comes in—it is a similar algorithm to the K-means algorithm, except
when fitting the centers c1 , . . . cK , we restrict our attention to the points themselves
• As before, we start with an initial guess for the centers c1 , . . . cK (e.g., randomly select K of
the points x1 , . . . xn ), and then repeat:
1. Minimize over C: for each i = 1, . . . n, find the cluster center ck closest to xi , and let
C(i) = k
∗
Pxk , the medoid of
2. Minimize over c1 , . . . cK : for each k = 1, . . . K, let ck = the points in
cluster k, i.e., the point xi in cluster k that minimizes C(j)=k kxj − xi k22
We stop when the within-cluster variation doesn’t change
• Put in words, this algorithm repeats two steps:
1. Cluster (label) each point based on the closest center
2. Replace each center by the medoid of points in its cluster
• The K-medoids algorithm shares the properties of K-means that we discussed: each iteration
decreases the criterion; the algorithm always terminates; different starts gives different final
answers; it does not achieve the global minimum
• Importantly, K-medoids generally returns a higher value of the criterion
K
X X
kxi − ck k22
k=1 C(i)=k

than does K-means (why?). Also, K-medoids is computationally more challenging that K-
means, because of its second repeated step: computing the medoid of points is harder than
computing the average of points
• However, remember, K-medoids has the (potentially important) property that the centers are
located among the data points themselves

3 Hierarchical clustering
3.1 Motivation from K-means
• Two properties of K-means (or K-medoids) clustering: (1) the algorithm fits exactly K clusters
(as specified); (2) the final clustering assignment depends on the chosen initial cluster centers
• An alternative approach called hierarchical clustering produces a consistent result, without
the need to choose initial starting positions (number of clusters). It also fits a sequence of
clustering assignments, one for each possible number of underlying clusters K = 1, . . . n

4
• The caveat: we need to choose a way to measure the dissimilarity between groups of points,
called the linkage
• Given the linkage, hierarchical clustering produces a sequence of clustering assignments. At
one end, all points are in their own cluster, at the other end, all points are in one cluster

• There are two types of hierarchical clustering algorithms: agglomerative, and divisive. In
agglomerative, or bottom-up hierarhical clustering, the procedure is as follows:
– Start with all points in their own group
– Until there is only one cluster, repeatedly: merge the two groups that have the smallest
dissimilarity

Meanwhile, in divisive, or top-down hierarchical clustering, the procedure is as follows:

– Start with all points in one cluster
– Until all points are in their own cluster, repeatedly: split the group into two resulting in
the biggest dissimilarity

• Agglomerative strategies are generally simpler than divisive ones, so we’ll focus on them.
Divisive methods are still important, but we won’t be able to cover them due to time constraints

3.2 Dendrograms
• To understand the need for dendrograms, it helps to consider an simple example. Given the
data points on the left, an agglomerative hierarhical clustering algorithm might decide on the
clustering sequence described on the right

1 ●
0.8

7 ●
Step 1: {1}, {2}, {3}, {4}, {5}, {6}, {7};
● 2
Dimension 2

● 3
0.6

Step 2: {1}, {2, 3}, {4}, {5}, {6}, {7};

Step 3: {1, 7}, {2, 3}, {4}, {5}, {6};
0.4

Step 4: {1, 7}, {2, 3}, {4, 5}, {6};

Step 5: {1, 7}, {2, 3, 6}, {4, 5};
0.2

5 ●
Step 6: {1, 7}, {2, 3, 4, 5, 6};
● 6

4 ● Step 7: {1, 2, 3, 4, 5, 6, 7}.

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Dimension 1

• This is a cumbersome representation for the sequence of clustering assignments. Fortunately,

we can also represent the sequence of clustering assignments using what is called a dendrogram:

5
1 ●

0.7
0.8

0.6
7 ●
2

0.5
●
● 3
0.6
Dimension 2

0.4
Height
0.4

0.3
0.2
0.2

5 ●
● 6

0.1
4 ●

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

3
Dimension 1

Note that cutting the dendrogram horizontally partitions the data points into clusters

• In general, a dendrogram is a convenient graphic to display a hierarchical sequence of clustering

assignments. It is simply a tree where:
– Each node represents a group
– Each leaf node is a singleton (i.e., a group containing a single data point)
– Root node is the group containing the whole data set
– Each internal node has two daughter nodes (children), representing the the groups that
were merged to form it
• Remember: the choice of linkage determines how we measure dissimilarity between groups of
points

• If we fix the leaf nodes at height zero, then each internal node is drawn at a height proportional
to the dissmilarity between its two daughter nodes

3.3 Linkages
• Suppose that we are given points x1 , . . . xn , and dissimilarities dij between each pair xi and
xj . As an example, you may think of xi ∈ Rp and dij = kxi − xj k2
• At any stage of the hierarhical clustering procedure, clustering assignments can be expressed
by sets G = {i1 , i2 , . . . ir }, giving indices of points in this group. Let nG be the size of G (here
nG = r). Bottom level: each group looks like G = {i}, top level: only one group, G = {1, . . . n}

• A linkage is a function d(G, H) that takes two groups G, H and returns a dissimilarity score
between them. Given the linkage, we can summarize agglomerative hierarhical clustering as
follows:
– Start with all points in their own group
– Until there is only one cluster, repeatedly: merge the two groups G, H such that d(G, H)
is smallest

6
3.3.1 Single linkage
• In single linkage (i.e., nearest-neighbor linkage), the dissimilarity between G, H is the smallest
dissimilarity between two points in opposite groups:
dsingle (G, H) = min dij
i∈G, j∈H

2
● ● ●
●
● ●
●
● ● ●●
● ●
● ● ● ●● ● ●●
●● ● ● ●

1
● ● ●● ● ●
● ●●
● ● ●
●
●
An illustration (dissimilarities dij are dis- ●
●
● ●● ●
●● ● ●

●
tances, groups are marked by colors): single ●●
●

0
●
●
●●
linkage score dsingle (G, H) is the distance of ●
●
●
●
●
● ●
the closest pair ●
●
●

−1
● ●● ● ●
●
● ● ●
● ● ●
● ●
● ●
●● ● ●
●●
●●
● ●●

−2
● ●
●

−2 −1 0 1 2

• A single linkage clustering example: here n = 60, xi ∈ R2 , dij = kxi − xj k2 . Cutting the tree
at h = 0.9 gives the clustering assignments marked by colors

● ●●
●
3

1.0

●
● ●
● ●
●
● ● ●
2

● ●
●●
0.8

●
●
● ●
● ●
● ●
1

● ●
● ● ●
0.6

● ●
Height

●
● ●
● ● ●● ●
0

●
●
●
0.4

● ●
● ●
●
−1

●
●
● ● ●
0.2

● ●
●
●
●
−2

●
0.0

−2 −1 0 1 2 3

Cut interpretation: for each point xi , there is another point xj in its cluster with dij ≤ 0.9

3.3.2 Complete linkage

• In complete linkage (i.e., furthest-neighbor linkage), the dissimilarity between G, H is the
largest dissimilarity between two points in opposite groups:
dcomplete (G, H) = max dij
i∈G, j∈H

7
●

2
● ● ●
●
● ●
●
● ● ●●
● ●
● ● ● ●● ● ●●
●● ● ● ●

1
● ● ●● ● ●
● ●●
● ● ●
●
●
An illustration (dissimilarities dij are dis- ●
●
● ●● ●
●● ● ●

●
tances, groups are marked by colors): com- ●●
●

0
●
●
●●
plete linkage score dcomplete (G, H) is the dis- ●
●
●
●
●
● ●
tance of the furthest pair ●
●
●

−1
● ●● ● ●
●
● ● ●
● ● ●
● ●
● ●
●● ● ●
●●
●●
● ●●

−2
● ●
●

−2 −1 0 1 2

• A complete linkage clustering example: same data set as before. Cutting the tree at h = 5
gives the clustering assignments marked by colors

● ●●
●
3

●
● ●
● ●
●
● ● ●
2

● ●
●● ●
●
● ●
● ●
4

● ●
1

● ●
● ● ●
● ●
Height

●
● ●
●
3

● ●● ●
0

●
●
●
● ●
●
●
2

●
−1

●
●
● ● ●
● ●
●
1

●
●
−2

●
0

−2 −1 0 1 2 3

Cut interpretation: for each point xi , every other point xj in its cluster satisfies dij ≤ 5

3.3.3 Average linkage

• In average linkage, the dissimilarity between G, H is the average dissimilarity over all points
in opposite groups:
1 X
daverage (G, H) = dij
nG · nH
i∈G, j∈H

8
●

2
● ● ●
●
● ●
●
● ● ●●
● ●
● ● ● ●● ● ●●
An illustration (dissimilarities dij are dis- ●● ● ● ●

1
● ● ●● ● ●
● ●●
●
tances, groups are marked by colors): aver- ●
● ●
● ●

● ●● ● ●
● ●● ●
age linkage score daverage (G, H) is the average ● ●
●●

0
distance across all pairs ●
●●
●
●
● ●
● ●
● ●
●
(Plot here only shows distances between the ● ●

−1
● ●● ● ●
●
● ● ●
blue points and one red point) ●
● ●
●● ●
●
●
●
●
●
●●
●●
● ●●

−2
● ●
●

−2 −1 0 1 2

• An average linkage clustering example: same data set as before. Cutting the tree at h = 1.5
gives clustering assignments marked by the colors

● ●●
●
3

3.0

●
● ●
● ●
●
●
2.5

● ●
2

● ●
●● ●
●
● ●
●
2.0

●
● ●
1

● ●
● ● ●
● ●
Height

●
1.5

● ●
● ● ●● ●
0

●
●
●
● ●
●
1.0

●
●
−1

●
●
● ● ●
● ●
0.5

●
●
●
−2

●
0.0

−2 −1 0 1 2 3

Cut interpretation: there really isn’t a good one!

3.3.4 Common properties of linkages

• Single, complete, average linkage share some nice properties

• These linkages operate on dissimilarities dij , and don’t need the data points x1 , . . . xn to be in
Euclidean space
• Also, running agglomerative clustering with any of these linkages produces a dendrogram with
no inversions

• This last property, in words: dissimilarity scores between merged clusers only increases as we
run the algorithm. It means that we can draw a proper dendrogram, where the height of a
parent is always higher than height of its daughters

9
3.3.5 Shortcomings of single, complete linkage
• Single and complete linkage can have some practical problems
• Single linkage suffers from an issue we call chaining. In order to merge two groups, only need
one pair of points to be close, irrespective of all others. Therefore clusters can be too spread
out, and not compact enough
• Complete linkage avoids chaining, but suffers from an issue called crowding. Because its score
is based on the worst-case dissimilarity between pairs, a point can be closer to points in other
clusters than to points in its own cluster. Clusters are compact, but not far enough apart
• Average linkage tries to strike a balance. It uses average pairwise dissimilarity, so clusters tend
to be relatively compact and relatively far apart
• Recall that we’ve already seen examples of chaining and crowding:

Single Complete Average

● ●●
● ● ●●
● ● ●●
●
3

3
● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
2

2
● ● ● ● ● ●
●● ● ●● ● ●● ●
● ● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
1

1
● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ●
● ● ●
● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ●
0

0
● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ●
−1

−1

−1
● ● ●
● ● ●
● ● ● ● ● ● ● ● ●
●● ●● ●●
● ● ●
●
● ●
● ●
●
−2

−2

−2
● ● ●
● ● ●

−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3

3.3.6 Shortcomings of average linkage

• Average linkage isn’t perfect, and it has its own problems
• It is not clear what properties the resulting clusters have when we cut an average linkage tree
at given height h. Single and complete linkage trees each had simple interpretations
• Moreover, the results of average linkage clustering can change with a monotone increasing
transformation of dissimilarities dij . I.e., if h is such that h(x) ≤ h(y) whenever x ≤ y, and
we used dissimilarites h(dij ) instead of dij , then we could get different answers
• Depending on the context, this last problem may be important or unimportant. E.g., it could
be very clear what dissimilarities should be used, or not
• Note: the results of single, complete linkage clustering are unchanged under monotone trans-
formations

4 Choosing the number of clusters

4.1 An important, hard problem
• Sometimes, when using K-means, K-medoids, or hierarchical clustering, we might have no
problem specifying the number of clusters K ahead of time, e.g., segmenting a client database
into K clusters for K salesman; or, compressing an image using vector quantization, where K
controls the compression rate

10
• Other times, K is implicitly defined by cutting a hierarchical clustering tree at a given height,
determined by the application
• But in most exploratory applications, the number of clusters K is unknown. So we are left
asking the question: what is the “right” value of K?
• This is a very hard problem! Why? Determining the number of clusters is a hard task for
humans to perform (unless the data are low-dimensional). Not only that, it’s just as hard to
explain what it is we’re looking for. Usually, statistical learning is successful when at least one
of these is possible
• Why is it important? It can make a big difference for the interpretation of the clustering result
in the application, e.g., it might make a big difference scientifically if we were convinced that
there were K = 2 subtypes of breast cancer vs. K = 3 subtypes. Also, one of the (larger)
goals of data mining/statistical learning is automatic inference; choosing K is certainly part
of this
• For concreteness, we’re going to focus on K-means, but most ideas will carry over to other
settings. Recall that, given the number of clusters K, the K-means algorithm approximately
minimizes the within-cluster variation:
K
X X
WK = kxi − x̄k k22
k=1 C(i)=k

over cluster assignments C, where x̄k is the average of points in group k, x̄k = n1k C(i)=k xi .
P
We’ll start off discussing a method for choosing K that doesn’t work, and then discuss two
methods can work well, in some cases

4.2 That’s not going to work!

• Clearly a lower value of WK is better. So why not just run K-means for a bunch of different
values of K, and choose the value of K that gives the smallest WK ?
• That’s not going to work! The problem: within-cluster variation WK just keeps decreasing as
K increases
• Here’s an example: n = 250, p = 2, K = 1, . . . 10. Below we plot the data points on the left,
and the within-cluster variation as a function on the right
120

● ●
1.5

●
●

●
100

● ● ●
● ●
● ●
● ● ● ● ●
●● ● ● ● ●
●● ● ●● ● ●● ●
●● ● ● ● ● ●●● ●
Within−cluster variation

●
1.0

●●
● ●
● ● ●●●●
●●●
●●
● ●
●
●●
80

● ●● ●
● ●
●
● ● ● ● ● ●● ●
●● ● ● ● ● ●
● ●● ● ● ●●●●● ●
● ●● ●● ●
● ●
● ●
● ● ● ●● ●●● ●
● ● ●
●
● ●●● ●● ●●
60

●
0.5

● ●
●
● ● ● ● ● ●
● ● ●
● ●
● ● ●●
● ●
● ● ● ● ●
● ● ● ●●
● ● ● ●●●
40

●● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ●
● ● ●● ●
0.0

●
● ● ●●
● ●
● ●●
● ●
●
●
●● ●● ●
● ●
● ●
● ●
● ● ● ● ●● ● ● ● ●
● ●● ●
●
●● ● ● ● ● ● ●●● ●
20

● ●
● ● ● ●
●● ● ●
● ● ● ●
−0.5

● ●

0.0 0.5 1.0 1.5 2 4 6 8 10

11
This behavior shouldn’t be surprising, once we think about K-means clustering and the defi-
nition of within-cluster variation
• Interestingly, even if we had computed the centers on training data, and evaluated WK on test
data (assuming we had it), then there would still a problem! Same example as above, but now
the within-cluster variation on the right has been computed on the test data set:

Train Test

120
● ●
1.5

1.5
● ●
●
● ●
●● ● ● ●

100
● ● ● ● ●
● ● ● ●● ● ●
● ● ●● ●●
● ● ● ●● ● ●● ● ●
●● ● ● ● ● ●● ● ● ●
●●● ●● ●● ● ●● ● ●●●●
●● ● ● ● ● ● ●●●● ●
● ● ●●● ● ●● ●
●
●

Within−cluster variation
1.0

1.0
●● ● ● ●
● ●
●
● ●
●
● ●●●●●
●
●●●
●●
●●●
●
● ●
●
● ●
●
●● ● ●
●
●
●
●● ●
● ● ●● ●●● ●
● ●
●
●

80
● ● ● ● ● ●● ● ● ●● ●● ●
●● ● ● ● ● ● ● ● ●● ●
●
● ●●
●
●●●●●● ● ● ●● ● ● ●● ●
● ●● ●● ● ● ● ● ● ●● ●● ●
● ● ● ●● ●● ● ●
● ● ●
● ●
● ● ● ● ● ● ●
●
● ● ● ●● ● ● ●
●
●
●
● ●●● ●● ●● ● ●
●● ● ●● ●
0.5

0.5
● ● ● ● ● ● ● ● ● ●
●● ● ●● ● ● ●

60
● ● ● ●
● ●
● ●● ● ● ●
●
● ● ●● ● ●● ●
● ● ● ● ● ●
●
● ● ● ● ●
●●● ● ● ●● ●●
●●
● ● ●● ●
●
● ● ● ● ●● ● ● ● ●
●● ● ● ● ● ●
● ● ●● ● ● ●● ● ●
●
●● ● ● ● ● ● ●
● ● ● ●
● ●● ● ● ●

40
● ● ●● ● ● ● ●● ● ● ●
0.0

0.0
●
● ● ●●
● ●
●
● ●● ● ●
●
●
● ●
●● ●● ●●
● ● ●
● ●
● ● ●● ● ●●●
● ●
●
●
●
● ● ●●
●
●
●● ● ● ● ● ● ● ● ●●●
●
●● ● ●
● ●
● ● ●
●● ● ● ●●
●●
● ●
● ●●●
●
● ● ●
● ●● ● ● ●
● ● ● ●
● ●● ● ●
●● ● ●● ● ● ●● ●

20
● ● ● ●
● ●
−0.5

● −0.5 ●
● ●
● ● ●
● ● ●
● ●
● ●

0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 2 4 6 8 10

4.3 The CH index

• Within-cluster variation measures how tightly grouped the clusters are. As we increase the
number of clusters K, this just keeps going down. What are we missing?
• Between-cluster variation measures how spread apart the groups are from each other:
K
X
BK = nk kx̄k − x̄k22
k=1

where as before x̄k is the average of points in group k, and x̄ is the overall average, i.e.
n
1 X 1X
x̄k = xi and x̄ = xi
nk n i=1
C(i)=k

• Bigger BK is better, so can we use it to choose K? It’s still not going to work alone, for a
similar reason to the one above: between-cluster variation just increases as K increases
• A helpful realization: ideally we’d like our clustering assignments C to simultaneously have
a small WK and a large BK . This is the idea behind the CH index (named after the two
statisticians who proposed it, Calinski and Harabasz). For clustering assignments coming
from K clusters, we record the CH score:

BK /(K − 1)
CHK = .
WK /(n − K)

To choose K, we just pick some maximum number of clusters to be considered Kmax (e.g.,
K = 20), and choose the value of K with the largest score CHK ,

K̂ = argmax CHK
K∈{2,...Kmax }

• On our running example (n = 250, p = 2, K = 2, . . . 10):

12
450
● ●

1.5
●
●

●
● ● ●
● ●
● ●
● ● ● ● ●
●● ● ● ● ● ●
●● ● ●● ● ●● ●
●● ● ● ● ● ●●● ●
●

400
●

1.0
●●
● ●
● ● ●●●
●
●
●●●●
● ●
●● ●
●
●
●
●●
● ●
● ● ● ● ● ●● ●
●● ● ● ● ● ●
● ●● ● ● ●●●●● ● ●

CH index
● ●● ●● ●
● ●
● ●
● ● ● ●● ●●● ●
● ● ●
0.5
● ●
● ●●● ●● ●●
●
●
● ● ● ●● ● ● ●
● ●
● ●
● ● ●● ●

350
● ● ●
● ● ● ● ●
● ● ● ●● ●
● ● ● ●●●
●● ● ● ● ● ●
● ●
●● ● ●
● ●● ● ●● ●● ●● ●
0.0

●
● ● ●●
●
● ●
●●
● ● ●
●
●
● ●
●● ●● ● ● ●
● ●
● ● ● ●● ● ● ●
● ●● ●
●
●● ● ● ● ● ● ●●● ●●
●
● ● ●
●●

300
●
−0.5

● ●

0.0 0.5 1.0 1.5 2 4 6 8 10

We would choose K = 4 clusters, which seems reasonable

• A general problem: the CH index is not defined for K = 1. So, using it, we could never choose
to fit just one cluster (the null model)!

4.4 The gap statistic

• It’s true that WK keeps dropping, but how much it drops at as we progress from K to K + 1
should be informative. The gap statistic is based on this idea.1 We compare the observed
unif
within-cluster variation WK to WK , the within-cluster variation we’d see if we instead had
points distributed uniformly (over an encapsulating box). The gap for K clusters is defined as
unif
GapK = log WK − log WK
unif
• The quantity log WK is computed by simulation: we average the log within-cluster variation
over, say, 20 simulated uniform data sets. We also compute the standard deviation of sK of
unif
log WK over the simulations. Then we choose K by
n o
K̂ = min K ∈ {1, . . . Kmax } : GapK ≥ GapK+1 − sK+1

• On our running example (n = 250, p = 2, K = 1, . . . 10):

● ● ●
0.7
1.5

●
●
●
●
● ● ●
● ●
● ●
● ● ● ● ● ●
●● ● ● ● ●
●● ● ●● ●
0.6

●● ●
●● ● ● ● ● ●●● ●
●
1.0

● ●●
● ●
● ●● ●
● ●
●
●
●●●●
●
● ●
●
●●
● ● ● ● ● ●●●
●
● ● ●
●● ● ● ● ●● ● ●
● ●● ● ● ●●●●● ● ●
● ● ● ●● ●
● ● ●
0.5

● ●
● ●● ●●● ●
Gap

● ●
● ● ●
● ● ●●● ●● ●●
0.5

● ● ●
●
● ● ● ● ● ●
● ●
● ●● ● ●●
● ●
●
● ● ●
● ●●
● ●
0.4

● ●
● ● ● ●●
●
●● ● ● ● ● ●
●● ● ● ●
●
● ●● ● ●● ●● ●● ●
0.0

●
● ● ●●
●
● ●●
● ●
●
● ●●
●● ●
●
● ●
● ●
● ●
● ● ● ● ● ● ●
● ●●●● ●
●● ● ● ● ● ●●●
0.3

● ●●
●
● ● ●
●● ● ●
−0.5

0.0 0.5 1.0 1.5 2 4 6 8 10

1 Tibshirani et al. (2001), “Estimating the number of clusters in a data set via the gap statistic”

13
Here we would choose K = 3 clusters, which is also reasonable
• The gap statistic does especially well when the data fall into one cluster. Why? (Hint: think
about the null distribution that it uses)

L07 - Advance Analytical Theory and Methods - Clustering
No ratings yet
L07 - Advance Analytical Theory and Methods - Clustering
22 pages
Survey of Clustering Data Mining Techniques: Pavel Berkhin
100% (1)
Survey of Clustering Data Mining Techniques: Pavel Berkhin
56 pages
An Introduction To Spectral Methods
100% (1)
An Introduction To Spectral Methods
39 pages
Grouping
No ratings yet
Grouping
98 pages
Cs1201 Design and Analysis of Algorithm
No ratings yet
Cs1201 Design and Analysis of Algorithm
27 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
119 pages
8 Clustering
No ratings yet
8 Clustering
53 pages
ML 04
No ratings yet
ML 04
26 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
Chap8 Basic Cluster Analysis
No ratings yet
Chap8 Basic Cluster Analysis
98 pages
Clustering
No ratings yet
Clustering
47 pages
Intelligent System: Lecture Notes For Chapter 7
No ratings yet
Intelligent System: Lecture Notes For Chapter 7
25 pages
CLUSTRING
No ratings yet
CLUSTRING
13 pages
Machine Learning Topic 4
No ratings yet
Machine Learning Topic 4
36 pages
Clustering
No ratings yet
Clustering
28 pages
Cluster Analysis
No ratings yet
Cluster Analysis
60 pages
UG BSF Clustering
No ratings yet
UG BSF Clustering
119 pages
BIS 541 Ch04 20-21 S
No ratings yet
BIS 541 Ch04 20-21 S
82 pages
To Perform Signal Operations On Continuous Time and Discrete Time Signals Using MATLAB.
No ratings yet
To Perform Signal Operations On Continuous Time and Discrete Time Signals Using MATLAB.
16 pages
UNIT5
No ratings yet
UNIT5
60 pages
Module-5 Clustering Algorithms
No ratings yet
Module-5 Clustering Algorithms
44 pages
K Medoids
No ratings yet
K Medoids
101 pages
Chap7 Basic Cluster Analysis
No ratings yet
Chap7 Basic Cluster Analysis
117 pages
Cluster Analysis
No ratings yet
Cluster Analysis
26 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Clustering 1
No ratings yet
Clustering 1
75 pages
Review Paper On Clustering and Validation Techniques
No ratings yet
Review Paper On Clustering and Validation Techniques
5 pages
Clustering Data Mining
No ratings yet
Clustering Data Mining
27 pages
Lecture Notes For Chapter 7 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 7 Introduction To Data Mining, 2 Edition
108 pages
DWDS Unit 6 Cluster Analysis
No ratings yet
DWDS Unit 6 Cluster Analysis
31 pages
Chapter 4
No ratings yet
Chapter 4
60 pages
(Balasko, Dkk. 2007) Fuzzy Clustering
No ratings yet
(Balasko, Dkk. 2007) Fuzzy Clustering
77 pages
Lect 12
No ratings yet
Lect 12
80 pages
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
No ratings yet
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
38 pages
Book - The Design of High Performance Mechatronics 2nd - 20231110
No ratings yet
Book - The Design of High Performance Mechatronics 2nd - 20231110
928 pages
Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Cluster Analysis: Basic Concepts and Algorithms
141 pages
Data Mining Unit-Iv
No ratings yet
Data Mining Unit-Iv
34 pages
Lecture24 s12
No ratings yet
Lecture24 s12
24 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Blowfish
No ratings yet
Blowfish
21 pages
008 Clustering With Examples - Unlocked
No ratings yet
008 Clustering With Examples - Unlocked
6 pages
Secret Key Extraction Using Keyloggers
No ratings yet
Secret Key Extraction Using Keyloggers
9 pages
Cluster Analysis
No ratings yet
Cluster Analysis
21 pages
A Comprehensive Survey of Clustering Algorithms
No ratings yet
A Comprehensive Survey of Clustering Algorithms
30 pages
Fuzzy Clustering Toolbox
No ratings yet
Fuzzy Clustering Toolbox
77 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
70 pages
Introduction To Cluster Analysis.
No ratings yet
Introduction To Cluster Analysis.
53 pages
DM 4
No ratings yet
DM 4
76 pages
Clustering L7
No ratings yet
Clustering L7
7 pages
Unit 2 - Introduction To Cluster Analysis
No ratings yet
Unit 2 - Introduction To Cluster Analysis
53 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
Clustering
No ratings yet
Clustering
38 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
2011control Strategy of Disc Braking Systems For Downward Belt Conveyors
No ratings yet
2011control Strategy of Disc Braking Systems For Downward Belt Conveyors
4 pages
Data Clustering: A Review
No ratings yet
Data Clustering: A Review
60 pages
Unit 4
No ratings yet
Unit 4
65 pages
Creating Performance Curves For VRF System
No ratings yet
Creating Performance Curves For VRF System
8 pages
Iv Unit DM
No ratings yet
Iv Unit DM
26 pages
Clustering
No ratings yet
Clustering
104 pages
Methodology Fyp2 (Experimental & Simulation) - Dr. Zainoor
No ratings yet
Methodology Fyp2 (Experimental & Simulation) - Dr. Zainoor
23 pages
Custer Analysis: Prepared by Navin Ninama
No ratings yet
Custer Analysis: Prepared by Navin Ninama
20 pages
NLP and Generative AI Syllabus - 2025
No ratings yet
NLP and Generative AI Syllabus - 2025
5 pages
B1 3L
No ratings yet
B1 3L
36 pages
Importance of Clustering in Data Mining
No ratings yet
Importance of Clustering in Data Mining
5 pages
Clustering Explanation
No ratings yet
Clustering Explanation
8 pages
Clustering Notes
No ratings yet
Clustering Notes
17 pages
Dy Fxy Yx y DX: 2. Taylor's Series Method
No ratings yet
Dy Fxy Yx y DX: 2. Taylor's Series Method
2 pages
Technical Software Pic Crc16
No ratings yet
Technical Software Pic Crc16
7 pages
T Rec H.235.6 201401 I!!pdf e
No ratings yet
T Rec H.235.6 201401 I!!pdf e
50 pages
Attacking OpenSSL Implementation of ECDSA With A Few Signatures.
No ratings yet
Attacking OpenSSL Implementation of ECDSA With A Few Signatures.
11 pages
2007 Process Optimization of Injection Moulding Using An Adaptive Surrogate Model With Gaussian Process Approach
No ratings yet
2007 Process Optimization of Injection Moulding Using An Adaptive Surrogate Model With Gaussian Process Approach
11 pages
OIT Math112 Trigonometry
No ratings yet
OIT Math112 Trigonometry
157 pages
Renyi Tsallis Fuzzy Divergence
No ratings yet
Renyi Tsallis Fuzzy Divergence
22 pages
STA4026S 2021 - Continuous Assessment 2 Ver0.0 - 2021!09!29
No ratings yet
STA4026S 2021 - Continuous Assessment 2 Ver0.0 - 2021!09!29
6 pages
Hand Gesture Recognition2
No ratings yet
Hand Gesture Recognition2
5 pages
Chapter - 5 Algebra
No ratings yet
Chapter - 5 Algebra
18 pages
Zeeshan (CS) - Assignment 1
No ratings yet
Zeeshan (CS) - Assignment 1
3 pages
PLC Dasar PDF
No ratings yet
PLC Dasar PDF
1 page
Software Needs
No ratings yet
Software Needs
3 pages
Toets2 20201207 Final
No ratings yet
Toets2 20201207 Final
2 pages
Curriculum 610202454550
No ratings yet
Curriculum 610202454550
2 pages
32-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
No ratings yet
32-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
8 pages
QTDM Unit 3 - 2 Mark
No ratings yet
QTDM Unit 3 - 2 Mark
3 pages
Econometrics Work-Sheet, Fikadu
No ratings yet
Econometrics Work-Sheet, Fikadu
3 pages
L-032 L2 Samyuktha Mandampully PPS Experiment 1 PDF Algorithms Computer Programming 4
No ratings yet
L-032 L2 Samyuktha Mandampully PPS Experiment 1 PDF Algorithms Computer Programming 4
1 page
Change of Basis
No ratings yet
Change of Basis
11 pages
Intro To Regression
No ratings yet
Intro To Regression
4 pages
Aalborg - Lecture Notes On Polynomials
No ratings yet
Aalborg - Lecture Notes On Polynomials
7 pages
Managing Diversification Attilio Meucci 2010
No ratings yet
Managing Diversification Attilio Meucci 2010
23 pages
Prompt Engineering Guide For Students
No ratings yet
Prompt Engineering Guide For Students
5 pages

Clustering

Uploaded by

Clustering

Uploaded by

Clustering

Advanced Methods for Data Analysis (36-402/36-608)

where x̄k the average of points in group k,

over both the clustering C and centers c1 , . . . cK ∈ Rp

2.3 Vector quantization

Meanwhile, in divisive, or top-down hierarchical clustering, the procedure is as follows:

Step 2: {1}, {2, 3}, {4}, {5}, {6}, {7};

Step 4: {1, 7}, {2, 3}, {4, 5}, {6};

4 ● Step 7: {1, 2, 3, 4, 5, 6, 7}.

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

• This is a cumbersome representation for the sequence of clustering assignments. Fortunately,

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

• In general, a dendrogram is a convenient graphic to display a hierarchical sequence of clustering

3.3.2 Complete linkage

3.3.3 Average linkage

Cut interpretation: there really isn’t a good one!

3.3.4 Common properties of linkages

Single Complete Average

3.3.6 Shortcomings of average linkage

4 Choosing the number of clusters

4.2 That’s not going to work!

0.0 0.5 1.0 1.5 2 4 6 8 10

0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 2 4 6 8 10

4.3 The CH index

• On our running example (n = 250, p = 2, K = 2, . . . 10):

0.0 0.5 1.0 1.5 2 4 6 8 10

We would choose K = 4 clusters, which seems reasonable

4.4 The gap statistic

• On our running example (n = 250, p = 2, K = 1, . . . 10):

0.0 0.5 1.0 1.5 2 4 6 8 10

You might also like