0% found this document useful (0 votes)
6 views12 pages

Cluster Analysis

Cluster Analysis is an unsupervised learning method used to group similar objects, such as customers, into clusters for better understanding and targeted marketing. The document discusses two main clustering approaches: Hierarchical Clustering, which includes Agglomerative and Divisive methods, and Non-hierarchical Clustering, specifically K-means. It also covers distance measures used in clustering, the steps involved in Agglomerative Clustering, and highlights the importance of domain-specific knowledge in interpreting clustering results.

Uploaded by

mavi260900
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views12 pages

Cluster Analysis

Cluster Analysis is an unsupervised learning method used to group similar objects, such as customers, into clusters for better understanding and targeted marketing. The document discusses two main clustering approaches: Hierarchical Clustering, which includes Agglomerative and Divisive methods, and Non-hierarchical Clustering, specifically K-means. It also covers distance measures used in clustering, the steps involved in Agglomerative Clustering, and highlights the importance of domain-specific knowledge in interpreting clustering results.

Uploaded by

mavi260900
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Cluster Analysis

Prof. Thomas B. Fomby


Department of Economics
Southern Methodist University
Dallas, TX 75275

April 2008
April 2010

Cluster Analysis, sometimes called data segmentation or customer


segmentation, is an unsupervised learning method. As you will recall a method is an
unsupervised learning method if it doesn’t involve prediction or classification. The major
purpose of Cluster Analysis is to group together collections of objects (e.g. customers)
into “clusters” so that the objects in the clusters are “similar.” One reason a company
might want to organize its customers into groups is to come to better understand the
nature of its customers. Given the delineation of its customers into distinct groups, the
company could advertise differently to its distinct groups, send different catalogues to its
distinct groups, and the like.

In terms of building prediction and classification models, cluster analysis can help
the analyst identify groups of input variables that in turn can lead to different models for
each group. This is, of course, assuming that the output relationships vis-à-vis the input
variables across the groups are not the same. But then one can always test the
“poolability” of the models by either conventional hypothesis tests, when considering
econometric models, or accuracy measures across validation and test data partitions when
considering machine learning models.

As one will come to understand after working on several clustering projects,


clustering is an “Art Form.” It must be practiced with care. The more experience you
have in doing cluster analysis, the better you become as a practitioner. Before beginning
cluster analysis it is often recommended that the data be normalized first. Cluster
analysis based on variables with very different scales of measurement can lead to clusters
that are not very robust to adding or deleting variables or observations. In this
discussion, we will be focusing on clustering only continuous input variables. The
clustering of mixed data, some continuous and some categorical, is not considered here
as it is beyond the scope of this discussion.

Now let us begin. There are two basic approaches to clustering:


a) Hierarchical Clustering (Agglomerative Clustering discussed here)
b) Non-hierarchical clustering (K-means)

1
Hierarchical Clustering

With respect to hierarchical clustering, the final clusters chosen are built in a
series of steps. If we start with N objects, each being in its own separate cluster, and then
combine one of the clusters with another cluster resulting in N – 1 clusters and continue
to combine clusters into fewer and few clusters with more and more objects in each
cluster, we are engaging in Agglomerative clustering. In contrast, if we start with all of
the objects being in a single cluster and then remove one of the objects to form a second
cluster and then continue to build more and more clusters with fewer and few objects in
each cluster until each object is in its own cluster, we are engaging in Divisive
clustering. The distinction between these two hierarchical methods is represented in the
below figure taken from the XLMINER help file.

Figure 1

Hierarchical Clustering:
Agglomerative versus Divisive Methods

The above figure is called a dendrogram and represents the fusions or divisions made at
each successive stage of the analysis. More formally then, a dendrogram is a tree-like
diagram that summarizes the process of clustering.

2
Distance Measures Using in Clustering

In order to build clusters, either agglomeratively or divisively, we need to define


the distance between two objects (cases), (xi1 , xi 2, .xip ) and (x j1 , x j 2 ,, x jp ) and
eventually between clusters. Let us first examine the distance between two objects. If
the units of measure of the p variables are quite different, it is suggested that the
variables be first normalized by forming z-scores of the variables as in subtracting the
sample means from the original variables and dividing the deviations by their respective
sample standard deviations. The most often used measure of distance (dissimilarity)
between the two cases is the Euclidean distance defined by

dij  (xi1  x j1 ) 2  (x x ) 


2
(x  ipjp
x ) 2
. (1)
i 2j 2

Alternatively, a weighted Euclidean distance can be used and is defined by

*
dij (x  x )2  w2i(x
w1i1j1 2j 2
x ) 
2
x
w (x pipjp )2 (2)

p
where the w1 , w2 ,, wp satisfy the wi  0 and  wi  1. For the
weights properties i1

remaining discussion let us focus on the Euclidean distance measure of distance between
objects (cases).

Moving to the discussion of the distance between clusters we need to somehow


define the distance between the objects in one cluster and the objects in another cluster.
Cluster distances are usually defined in one of three basic ways: Single Linkage
(Nearest Neighbor), Complete Linkage (Farthest Neighbor), and Average Group
Linkage. Each of these cluster distance measures are defined in order below:

Single Linkage (Nearest Neighbor)

The Single Linkage distance between two clusters is defined as the distance
between the nearest pair of objects in the two clusters (one object in each cluster). If
cluster A is the set of A1 , A2 ,, Am and cluster B is B1 , B2 ,, Bn , the Single
objects
Linkage distance between clusters A and B is

D( A, B) = Min{dij : where Ai is in cluster A and B j is in cluster B


object object
and dij is the Euclidean distance Ai and B j }
between

At each stage of hierarchical clustering based on the Single Linkage distance measure,
the clusters A and B, for which D(A, B) is minimum, are merged. The Single Linkage

3
distance is represented in the XLMINER Help File figure below:

4
Figure 2

Single Linkage Distance


Between Clusters

Complete Linkage (Farthest Neighbor)

The Complete Linkage distance between two clusters is defined as the distance
between the most distant (farthest) pair of objects in the two clusters (one object in each
cluster). If cluster A is the set of A1 , A2 ,, Am and cluster B is B1 , B2 ,, Bn , the
objects
Single Linkage distance between clusters A and B is

D( A, B) = Max{dij : where Ai is in cluster A and B j is in cluster B


object object
and dij is the Euclidean distance Ai and B j }
between

At each stage of hierarchical clustering based on the Complete Linkage distance measure,
the clusters A and B, for which D(A, B) is minimum, are merged. The Complete Linkage
distance is represented in the XLMINER Help File figure below:

5
Figure 3

Complete Linkage Distance


Between Clusters

Average Linkage

Under Average Linkage the distance between two clusters is defined to be the
average of the distances between all pairs of objects, where each pair is made up on one
object from each cluster. If cluster A is the set of A1 , A2 ,, Am and cluster B is
objects
B1 , B2 ,, Bn , the Average Linkage distance between clusters A and B is

𝐷(𝐴, 𝐵) = 𝑇𝐴𝐵

𝑁𝐴·𝑁𝐵

where 𝑇𝐴𝐵 is the sum of all pairwise distances between cluster A and Cluster B. 𝑁𝐴 and
𝑁𝐵 are the sizes of the clusters A and B, respectively.

At each stage of hierarchical clustering based on the Average Linkage distance measure,
the clusters A and B are merged such that, after merger, the average pairwise distance
within the newly formed cluster, is minimum. The Complete Linkage distance is
represented in the XLMINER Help File figure below:

6
Figure 4

Average Linkage Distance


Between Clusters

Steps in Agglomerative Clustering

The steps in Agglomerative Clustering are as follows:


1. Start with n clusters (each observation = cluster)
2. The two closest observations are merged into one cluster
3. At every step, the two clusters that are “closest” to each other are merged. That
is, either single observations are added to existing clusters or two exiting
clusters are merged.
4. This process continues until all observations are merged.

This process of agglomeration leads to the construction of a dendrogram. This is


a tree-like diagram that summarizes the process of clustering. For any given number of
clusters we can determine the records in the clusters by sliding a horizontal line (ruler)
up and down the dendrogram until the number of vertical intersections of the
horizontal line equals the number of clusters desired.

7
Dendrograms are more useful visually when there are a smaller number of cases
as in the Utilities.xls data set. However, the agglomerative procedure works for larger
data sets but is computing intensive in that nxn matrices are the basic building blocks for
the Agglomerative procedure.

To demonstrate the construction and interpretation of a dendrogram let’s cluster


the data contained in the Utilities.xls data set. This data set consists of observations on
22 utilities each utility being described by 8 variables. As noted above we have 3
different choices of distance between clusters. They are Single Linkage (Nearest
Neighbor), Complete Linkage (Farthest Neighbor) and Average Linkage. Three separate
dendrograms can be generated for each choice of distance measure. Let’s look at the
dendrogram generated by using the Average Linkage measure. It is reproduced below:

Dendrogram(Average linkage)
5

4.5

3.5

3
Distan

2.5

1.5

0.5
1 18 14 19 6392
1 2 3 4 5 6 7 8 22 4
9 10 20 10 13 7
11 12 13 14 12 21 15 17 58
15 16 17 18 19 20 16 11
21 22

If we put our horizontal ruler at 4.0 for the maximal distance allowed between

clusters. They are as follows:{1,18,14,19,6,3,9,2,22,4,20,10,13};


clusters (as measured by average linkage) we “cut across” 4 vertical lines and thus get 4

{7,12,21,15,17}; {5} ;
{8,16,11}. If we put our horizontal ruler at 3.5 for the maximal distance allowed
between clusters we “cut across” 7 vertical lines and thus get 7 clusters. They are as

{1,18,14,19,6,3,9}; {2,22,4,20,10,13} ; {7,12,21,15}; {17};{5};{8,16};{11}.


follows:

The four
cluster group is constructed by combining the first and second clusters, the third and
fourth clusters, and the sixth and seventh clusters in the seven cluster group. You can
8
now see why this type of clustering is call hierarchical because the 4 cluster group is
constructed by combining cluster groupings immediately below it. As you move up

9
slowly from the bottom of the dendrogram to the top you move from n clusters to n-l
clusters to n-2 clusters etc. until all of the observations are contained into one cluster.

To show how sensitive the choice of clusters is to the choice of distance, consider
the Single Linkage dendrogram for the Utilities data:

Dendrogram(Single linkage)
4

3.5

2.5

2
Distan

1.5

0.5

0 1 0 1

18 14 19 924
2 3 4 5 6 7 10 13 2 0 7
8 9 10 11 12 21 15 22 638
12 13 14 15 16 17 18 16 17 11 5
19 20 21 22
23

dendrogram. Then we get the following 4 clusters:{5} ; {11}; {17}; {𝑟𝑒𝑠𝑡}. These
In the case of forming 4 groups, set the maximal allowed distance to be 3.0 in the above

four clusters are quite different from the 4 clusters determined by using the Average
Linkage dendrogram. This just goes to show that cluster analysis is an art form and
the clusters should be interpreted with caution and hopefully only accepted if the
clusters make sense given the domain-specific knowledge we have concerning the
utilities under study.

Also we should note some additional limitations of hierarchical clustering:


 For very large data sets, can be expensive and slow
 Makes only one pass through the data. Therefore, early clustering decisions
affect the rest of the clustering results.
 Often has low stability. That is, adding or subtracting variables or adding or
dropping observations can affect the groupings substantially.
 Sensitive to outliers and their treatment

10
Non-hierarchical Clustering (K-means)

The following is hopefully a not too technical discussion of K-means clustering.


It is a non-hierarchical method in the sense that if one has 2 clusters, say, generated by
pre-specifying 2 means (centroids) in the K-means algorithm and 3 clusters generated by
pre-specifying 3 means in the K-means algorithm, then it may be the case that no
combination of any two clusters of the 3 cluster group can give rise to the 2 cluster
grouping. In this sense the K-means algorithm is non-hierarchical. Let us turn again to
the Utilities data and use the K-means clustering method to determine 4 clusters based on
the normalized data. We use the following choices:

 Normalized data
 10 Random Starts
 10 iterations per start
 Fixed random seed = 12345
 Number of reported clusters = 4

clusters: cluster 1 = {2,5,7,12,15,17,21,22}; cluster 2 = {4,10,13,20};


Then the K-means algorithm in XLMiner for four clusters generated the following

{3,6,9}; cluster 4 = {1,8,11,14,16,18,19}. Again we derive another distinct 4


cluster 3 =

cluster grouping. Once can then use domain-specific knowledge to determine if this 4
cluster grouping makes more or less sense than the 4 group clusters determined by
either of the choices of cluster distance in the agglomerative approach.

The Steps in the K-means Clustering Approach

Given a set of observations (x1, x2, ⋯ . xn) where each observation is a d-

K sets (K < n), 𝑆 = {𝑆1, 𝑆2, ⋯ . 𝑆𝐾} so as to minimize the within-cluster sum of
dimensional real vector, then K-means clustering aims to partition the n observations into

squares (WCSS):

𝑎𝑟𝑔 − 𝜇
i= ∑𝗑j∈𝑆i‖ 2
𝐾
min𝑆 ∑

(1)
𝗑j
1
i

where 𝜇i is the mean of the points in 𝑆i . Now minimizing (1) can, in theory, be done by
the integer programming method but this can be extremely time-consuming. Instead

Given the initial set of K-means 𝑚(1), ⋯ , 𝑚(1)which can be specified randomly or by
the Lloyd algorithm is more often used. The steps of the Lloyd algorithm are as follows.
1 𝐾
some heuristic, the algorithm proceeds by alternating between two steps:

Assignment Step: Assign each observation to the cluster with the closest mean

𝑆(𝑡) = {𝗑j: ‖𝗑j − 𝑚(𝑡)‖ ≤ ‖𝗑j − 𝑚(𝑡)‖ } for all i* = 1,2, ⋯ ,(2)
𝐾.
i i i*

11
Update Step: Calculate the new means to be the centroids of the observations in
the clusters, i.e.

𝑚 = for i = 1,2, ⋯ , 𝐾.
(𝑡+1) 1

∑ 𝗑j
(𝑡) (3)
i (𝑡) 𝗑j∈𝑆i
|𝑆 |
i

Repeat the Assignment and Update steps until WCSS (equation (1)) no longer
changes. Then the centroids and members of the K clusters are determined.

Note: When using random assignment of the K-means to start the algorithm, one might
try several starting point K-means and then choose the “best” starting point to be the
random K-means that produces the smallest WCSS among all of the random starting
points K-means tried.

Regardless of the clustering technique used, one should strive to choose clusters
that are interpretable and make sense given the domain-specific knowledge that we have
about the problem at hand.

 Review Utilities.xls data

12

You might also like