Module12.02 UnsupervisedLearning
Module12.02 UnsupervisedLearning
Clustering
A simulated data set with 150 observations in 2-dimensional space. Panels show the
results of applying K-means clustering with different values of K , the number of
clusters. The color of each observation indicates the cluster to which it was assigned
using the K-means clustering algorithm. Note that there is no ordering of the
clusters, so the cluster coloring is arbitrary.
These cluster labels were not used in clustering; instead, they are the outputs of the
clustering procedure.
Details of K-means clustering
(2)
, ∈ ∈
A B
C
D
A B
C
D
A B
C
D
A B
C
D
A B
C
Hierarchical Clustering Algorithm
The approach in words:
• Start with each point in its own cluster.
• Identify the closest two clusters and merge them.
• Repeat.
• Ends when all points are in a single cluster.
Dendrogram
4
3
D
E
A B
2
C
1
0
C
E
B
A
Types of Linkage
Linkage Description
Maximal inter-cluster dissimilarity. Compute all pairwise
Complete dissimilarities between the observations in cluster A and
the observations in cluster B, and record the largest of
these dissimilarities.
Minimal inter-cluster dissimilarity. Compute all pairwise
Single dissimilarities between the observations in cluster A and
the observations in cluster B, and record the smallest of
these dissimilarities.
Mean inter-cluster dissimilarity. Compute all pairwise
Average dissimilarities between the observations in cluster A and
the observations in cluster B, and record the average of
these dissimilarities.
Dissimilarity between the centroid for cluster A (a mean
Centroid vector of length p) and the centroid for cluster B. Cen-
troid linkage can result in undesirable inversions.
A n Example
4
X2
2
0
−2
−6 −4 −2 0 2
X1
10
10
8
8
6
6
4
4
2
2
0
0
Details of previous figure
• Left: Dendrogram obtained from hierarchically clustering the data from
previous slide, with complete linkage and Euclidean distance.
• Center: The dendrogram from the left-hand panel, cut at a height of 9
(indicated by the dashed line). Th i s cut results in two distinct clusters, shown
in different colors.
• Right: The dendrogram from the left-hand panel, now cut at a height of 5. Th is
cut results in three distinct clusters, shown in different colors. Note that the colors
were not used in clustering, but are simply used for display purposes in this
figure.
Choice of Dissimilarity Measure
• So far used Euclidean distance.
• An alternative is correlation-based distance which considers
two observations to be similar if their features are highly
correlated.
• Here correlation is computed between the observation
profiles for each pair of observations.
• Correlation care more about the shape, than the levels
20
Observation 1
Observation 2
Observation 3
15
10
2
5
1
0
5 10 15 20
Variable Index
Practical Issues for Clustering
1. Scaling is necessary
2. In some cases, standardization may be useful
3. What dissimilarity measure and linkage should be used (for HC)?
4. Choice of for K-means clustering
5. Which features should be used to drive the clustering?
Example