Cluster Analysis
Cluster Analysis
April 2008
April 2010
In terms of building prediction and classification models, cluster analysis can help
the analyst identify groups of input variables that in turn can lead to different models for
each group. This is, of course, assuming that the output relationships vis-à-vis the input
variables across the groups are not the same. But then one can always test the
“poolability” of the models by either conventional hypothesis tests, when considering
econometric models, or accuracy measures across validation and test data partitions when
considering machine learning models.
1
Hierarchical Clustering
With respect to hierarchical clustering, the final clusters chosen are built in a
series of steps. If we start with N objects, each being in its own separate cluster, and then
combine one of the clusters with another cluster resulting in N – 1 clusters and continue
to combine clusters into fewer and few clusters with more and more objects in each
cluster, we are engaging in Agglomerative clustering. In contrast, if we start with all of
the objects being in a single cluster and then remove one of the objects to form a second
cluster and then continue to build more and more clusters with fewer and few objects in
each cluster until each object is in its own cluster, we are engaging in Divisive
clustering. The distinction between these two hierarchical methods is represented in the
below figure taken from the XLMINER help file.
Figure 1
Hierarchical Clustering:
Agglomerative versus Divisive Methods
The above figure is called a dendrogram and represents the fusions or divisions made at
each successive stage of the analysis. More formally then, a dendrogram is a tree-like
diagram that summarizes the process of clustering.
2
Distance Measures Using in Clustering
*
dij (x x )2 w2i(x
w1i1j1 2j 2
x )
2
x
w (x pipjp )2 (2)
p
where the w1 , w2 ,, wp satisfy the wi 0 and wi 1. For the
weights properties i1
remaining discussion let us focus on the Euclidean distance measure of distance between
objects (cases).
The Single Linkage distance between two clusters is defined as the distance
between the nearest pair of objects in the two clusters (one object in each cluster). If
cluster A is the set of A1 , A2 ,, Am and cluster B is B1 , B2 ,, Bn , the Single
objects
Linkage distance between clusters A and B is
At each stage of hierarchical clustering based on the Single Linkage distance measure,
the clusters A and B, for which D(A, B) is minimum, are merged. The Single Linkage
3
distance is represented in the XLMINER Help File figure below:
4
Figure 2
The Complete Linkage distance between two clusters is defined as the distance
between the most distant (farthest) pair of objects in the two clusters (one object in each
cluster). If cluster A is the set of A1 , A2 ,, Am and cluster B is B1 , B2 ,, Bn , the
objects
Single Linkage distance between clusters A and B is
At each stage of hierarchical clustering based on the Complete Linkage distance measure,
the clusters A and B, for which D(A, B) is minimum, are merged. The Complete Linkage
distance is represented in the XLMINER Help File figure below:
5
Figure 3
Average Linkage
Under Average Linkage the distance between two clusters is defined to be the
average of the distances between all pairs of objects, where each pair is made up on one
object from each cluster. If cluster A is the set of A1 , A2 ,, Am and cluster B is
objects
B1 , B2 ,, Bn , the Average Linkage distance between clusters A and B is
𝐷(𝐴, 𝐵) = 𝑇𝐴𝐵
𝑁𝐴·𝑁𝐵
where 𝑇𝐴𝐵 is the sum of all pairwise distances between cluster A and Cluster B. 𝑁𝐴 and
𝑁𝐵 are the sizes of the clusters A and B, respectively.
At each stage of hierarchical clustering based on the Average Linkage distance measure,
the clusters A and B are merged such that, after merger, the average pairwise distance
within the newly formed cluster, is minimum. The Complete Linkage distance is
represented in the XLMINER Help File figure below:
6
Figure 4
7
Dendrograms are more useful visually when there are a smaller number of cases
as in the Utilities.xls data set. However, the agglomerative procedure works for larger
data sets but is computing intensive in that nxn matrices are the basic building blocks for
the Agglomerative procedure.
Dendrogram(Average linkage)
5
4.5
3.5
3
Distan
2.5
1.5
0.5
1 18 14 19 6392
1 2 3 4 5 6 7 8 22 4
9 10 20 10 13 7
11 12 13 14 12 21 15 17 58
15 16 17 18 19 20 16 11
21 22
If we put our horizontal ruler at 4.0 for the maximal distance allowed between
{7,12,21,15,17}; {5} ;
{8,16,11}. If we put our horizontal ruler at 3.5 for the maximal distance allowed
between clusters we “cut across” 7 vertical lines and thus get 7 clusters. They are as
The four
cluster group is constructed by combining the first and second clusters, the third and
fourth clusters, and the sixth and seventh clusters in the seven cluster group. You can
8
now see why this type of clustering is call hierarchical because the 4 cluster group is
constructed by combining cluster groupings immediately below it. As you move up
9
slowly from the bottom of the dendrogram to the top you move from n clusters to n-l
clusters to n-2 clusters etc. until all of the observations are contained into one cluster.
To show how sensitive the choice of clusters is to the choice of distance, consider
the Single Linkage dendrogram for the Utilities data:
Dendrogram(Single linkage)
4
3.5
2.5
2
Distan
1.5
0.5
0 1 0 1
18 14 19 924
2 3 4 5 6 7 10 13 2 0 7
8 9 10 11 12 21 15 22 638
12 13 14 15 16 17 18 16 17 11 5
19 20 21 22
23
dendrogram. Then we get the following 4 clusters:{5} ; {11}; {17}; {𝑟𝑒𝑠𝑡}. These
In the case of forming 4 groups, set the maximal allowed distance to be 3.0 in the above
four clusters are quite different from the 4 clusters determined by using the Average
Linkage dendrogram. This just goes to show that cluster analysis is an art form and
the clusters should be interpreted with caution and hopefully only accepted if the
clusters make sense given the domain-specific knowledge we have concerning the
utilities under study.
10
Non-hierarchical Clustering (K-means)
Normalized data
10 Random Starts
10 iterations per start
Fixed random seed = 12345
Number of reported clusters = 4
cluster grouping. Once can then use domain-specific knowledge to determine if this 4
cluster grouping makes more or less sense than the 4 group clusters determined by
either of the choices of cluster distance in the agglomerative approach.
K sets (K < n), 𝑆 = {𝑆1, 𝑆2, ⋯ . 𝑆𝐾} so as to minimize the within-cluster sum of
dimensional real vector, then K-means clustering aims to partition the n observations into
squares (WCSS):
𝑎𝑟𝑔 − 𝜇
i= ∑𝗑j∈𝑆i‖ 2
𝐾
min𝑆 ∑
‖
(1)
𝗑j
1
i
where 𝜇i is the mean of the points in 𝑆i . Now minimizing (1) can, in theory, be done by
the integer programming method but this can be extremely time-consuming. Instead
Given the initial set of K-means 𝑚(1), ⋯ , 𝑚(1)which can be specified randomly or by
the Lloyd algorithm is more often used. The steps of the Lloyd algorithm are as follows.
1 𝐾
some heuristic, the algorithm proceeds by alternating between two steps:
Assignment Step: Assign each observation to the cluster with the closest mean
𝑆(𝑡) = {𝗑j: ‖𝗑j − 𝑚(𝑡)‖ ≤ ‖𝗑j − 𝑚(𝑡)‖ } for all i* = 1,2, ⋯ ,(2)
𝐾.
i i i*
11
Update Step: Calculate the new means to be the centroids of the observations in
the clusters, i.e.
𝑚 = for i = 1,2, ⋯ , 𝐾.
(𝑡+1) 1
∑ 𝗑j
(𝑡) (3)
i (𝑡) 𝗑j∈𝑆i
|𝑆 |
i
Repeat the Assignment and Update steps until WCSS (equation (1)) no longer
changes. Then the centroids and members of the K clusters are determined.
Note: When using random assignment of the K-means to start the algorithm, one might
try several starting point K-means and then choose the “best” starting point to be the
random K-means that produces the smallest WCSS among all of the random starting
points K-means tried.
Regardless of the clustering technique used, one should strive to choose clusters
that are interpretable and make sense given the domain-specific knowledge that we have
about the problem at hand.
12