Chapter 4 - Cluster Analysis
Chapter 4 - Cluster Analysis
Data Science
Catarina Neves
[email protected]
Chapter 4 – Cluster Analysis
2
Basic Concepts and Terminology
Cluster Analysis
Imagine one of the following hypothetical situations:
• The financial analyst of an investment company firm is interested in identifying a group
of firms that are prime targets for a takeover;
• The campaign manager for a political candidate is interested in identifying groups of
voters who have similar views on important issues;
• A marketing manager for a political candidate is interested in identifying groups of
voters who have similar views on important issues.
Each of the above scenarios is concern with identifying groups of entities or subjects that
are similar to each other with respect to certain characteristics. Cluster Analysis is a
useful technique for such purpose.
3
Basic Concepts and Terminology
4
Basic Concepts and Terminology
Maximize inter-
Minimize intra- cluster distances
cluster
distances
5
Basic Concepts and Terminology
6
Basic Concepts and Terminology
7
Basic Concepts and Terminology
8
Similarity Measures
9
Similarity Measures
Squared-Euclidean Distance
p
dij = ( xik − x jk ) 2
k =1
City-Block Distance
p
d ij = xik − x jk
k =1
10
Similarity Measures
k =1
Mahalanobis distance
p
dij = i j S ( xi − x j )
( x
k =1
− x ) ' −1
Where dij is the specific distance between subjects i and j, x ik is the value of the kth
variable for the ith subject, xjk is the value of the kth variable for the jth subject,
and p is the number of variables.
11
Similarity Measures
12
Similarity Measures
Similarity Matrix
Similarity Matrix for the hypothetical data using the Squared Euclidean Distance.
The question is then how can one use the similarities given for forming the groups or
clusters? The answer to this question lies in the two main types of analytic clustering
techniques, hierarchical and non-hierarchical.
13
Hierarchical Clustering
14
Hierarchical Clustering
In the following slides the hierarchical cluster analysis of the hypothetical data set given
will be made using Centroid method.
15
Hierarchical Clustering
Centroid Method
It is obvious that hierarchical cluster algorithm forms clusters in a hierarchical manner. In
other words, the number of clusters at each stage is one less than the previous one. If
there are n observations then at step 1, step 2, step 3, …. , step n-1 of the hierarchical
process the number of clusters will be, respectively, n-1, n-2, n-3, …., 1. In the case of
Centroid method, each cluster is represented by the centroid of that cluster for
computing the distances between the clusters.
16
Hierarchical Clustering
Centroid Method
In Centroid Method each group is represented by the Average Subject which is the
centroid of that group. For example, the first cluster is represented by the centroid of the
Subjects S1 and S2. In other words, Cluster 1 has an average education of 5.5 years and an
average income of 5.5 thousand dollars.
17
Hierarchical Clustering
Centroid Method
Given the iterative process of hierarchical clustering, it is very frequent to plot the
formation path of the observations in what is called dendrogram or tree. In these graphic
representations, the observations are listed on the horizontal axis and the Squared
Euclidean Distance between the centroids on the vertical one.
Note that in the case where we have a high number of observations the dendrogram may
not be very useful.
18
Hierarchical Clustering
Centroid Method
19
Hierarchical Clustering
CLUSTER 1
In the Centroid Method, the distance between clusters was obtained by computing the
Squared Euclidean Distance between the centroids of the respective clusters. In the
Single-Linkage Method. The distance between the two clusters is represented by the
minimum of the distance between cluster 1 (S1 & S2) and Subject S3.
In this case we have Min(181; 145)=145.
20
Hierarchical Clustering
21
Hierarchical Clustering
CLUSTER 1
The distance between the two clusters is represented by the maximum of the distance
between cluster 1 (S1 & S2) and Subject S3.
In this case we have Max(181; 145)=181.
22
Hierarchical Clustering
23
Hierarchical Clustering
Average-Linkage Method
In the Average Method the distance between two clusters is obtained by taking the
average distance between all pairs of subjects in the two clusters. Once again, consider
the initial similarity matrix given for this hypothetical data:
CLUSTER 1
The distance between the two clusters is represented by the average of the distance
between cluster 1 (S1 & S2) and Subject S3.
In this case we have Average(181; 145)=163.
24
Hierarchical Clustering
Average-Linkage Method
The process then continues interactively until only one cluster is formed.
25
Hierarchical Clustering
Ward’s Method
The Ward’s method does not compute distances between clusters. Rather, it forms
clusters by maximizing within-cluster homogeneity. The within-group sum of squares is
used as a measure of homogeneity.
Clusters are formed at each step in such a way that the resulting cluster solution has the
fewest within-cluster sums of squares. This measure is also known as Error Sum of
Squares (ESS). The first two iterations of Ward’s method are presented in the next slide.
The ESS is computed as:
k nj
ij j
( X − X
j =1 i =1
) 2
26
Hierarchical Clustering
Ward’s Method
𝑛(𝑛 − 1)/2
𝑛(𝑛 − 1)/2
27
Hierarchical Clustering
Ward’s Method
The process then continues interactively until only one cluster is formed.
28
Hierarchical Clustering
Centroid
Single linkage
Complete linkage
Average linkage
Ward’s Method
k nj
( X
j =1 i =1
ij − X j )2
29
Hierarchical Clustering
30
Interpreting Hierarchical Clustering
31
Interpreting Hierarchical Clustering
R-Squared
R2 is the ratio of SSb to SSt. Note that the SSb is a measure of the extent to which
groups are different from each other. Since SSt = SSb + SSw the greater the SSb the
smaller the SSw and vice-versa. Consequently, for any given dataset, the greater the
differences between groups the more homogeneous each group is and vice versa.
➢ Hence, R2 measures the extent to which groups or clusters are different from each
other or, in alternative, how much the groups are homogeneous. The value of R 2
ranges from 0 to 1. It can be interpreted as a measure of the proportion of the
total variance that is retained in each of the solutions.
32
Interpreting Hierarchical Clustering
Semi-Partial R-Squared
As already discussed, the new cluster formed at any given step is obtained by
merging two clusters formed in previous steps. The difference between the pooled
SSw of the new cluster and the sum of pooled SSw’s of clusters joined to obtain the
new clusters is called loss of homogeneity. If this loss is zero, then the new cluster is
obtained by merging two perfectly homogeneous clusters. On the other hand, if loss
of homogeneity is large, then that means we are merging two very different clusters.
➢ In a good Cluster solution SPR 2 should be low.
33
Interpreting Hierarchical Clustering
Cluster Distance
The output reports the distance between the two Clusters that are formed at a given
step. In the centroid method it is simply the euclidean distance between the
centroids of the two clusters that are to be joined or merged and it is named as the
centroid distance (CD); for Single-Linkage it is the minimum euclidean distance
(MIND) between all possible pairs of points; for Complete-Linkage it is the maximum
euclidean distance (MAXD) between all possible pairs of points; and for Ward’s
method it is the between-group SS for the two clusters (SSb).
➢ Cluster Distance should be small to merge two Clusters. A large value for CD
would indicate that two dissimilar groups (as they are far apart) are being merged.
34
Interpreting Hierarchical Clustering
These statistics may also be used to define the number of Clusters. Essentially one
should look for the “first big jump” in the value of a given statistics. A popular and
effective way is to plot the statistics and look for the first elbow. In this case it clearly
seems that we have three Clusters in the data.
35
Interpreting Hierarchical Clustering
36
Non-Hierarchical Clustering
37
Non-Hierarchical Clustering
38
Non-Hierarchical Clustering
Initial Data
39
Non-Hierarchical Clustering
Outliers Identification
40
Non-Hierarchical Clustering
Outliers Removal
41
Non-Hierarchical Clustering
Iterative Process
42
Non-Hierarchical Clustering
Iterative Process
43
Non-Hierarchical Clustering
Iterative Process
Recalculate centroids
44
Non-Hierarchical Clustering
Iterative Process
45
Non-Hierarchical Clustering
Final Solution
Recalculate clusters
Successively until convergence
criterion is achieved
46
Interpreting Non-Hierarchical Clustering
47
Which Clustering Method is Best?
48
Which Clustering Method is Best?
Hierarchical Methods
Hierarchical methods do not need the number of clusters decided a priori. This is,
definitely, an advantage over the non-hierarchical methods. However, the
hierarchical methods have the disadvantaged that, once an observation is assigned
to a Cluster, it can no longer “get back”, i.e., reassigned to another cluster. For these
reasons, hierarchical methods are sometimes used in an exploratory analysis, being
the final solution be made by non-hierarchical methods. In other words, hierarchical
and non-hierarchical methods can be used together, in collaboration, instead of
being considered as competitors.
49
Which Clustering Method is Best?
Non-Hierarchical Methods
As already discussed, non-hierarchical methods require that the number of clusters
be defined a priori. Consequently, the cluster centers (seeds) have to be identified
before the technique can proceed to cluster observations, which may be a problem
as these methods are, usually, sensible to the initial seeds. Note that, since a number
of starting partitions can be used, the final solution could result in local optimization
of the objective function. In other words, as there is evidence that non-hierarchical
methods (including k-means) can perform poorly when seeds are randomly assigned,
one can use the solution obtained by the hierarchical methods as initial position of
the seeds in non-hierarchical ones.
➢ Hierarchical and non-hierarchical methods can be viewed as complementary in
most of the cases*. The SAS® Syntax to do so is available in the next slides.
* - A possible exception is for datasets where n is very (very) large. In this case, the
complementarity of methods needs to be adapted.
50
Reliability and External Validity
51
Reliability and External Validity
Reliability
Reliability of a cluster solution can be performed by a cross-validation procedure as
follows:
➢ Performing different analysis in the same data, using different distances, and
verify how stable are the solutions;
➢ Comparing the results provided by different methods and verify if it produces
similar interpretations;
➢ Dividing the data set randomly into two groups (validation and training) and
perform cluster analysis on both groups and compare the results. The degree of
agreement can be assessed and the solution that is “more compatible” in both
groups used.
52
Reliability and External Validity
External Validity
External validity is obtained by comparing the results of the cluster analysis with
those of an external criterion. For example, suppose one clusters firms based on
financial indicators and these are comprised in one of two clusters; then one could
ask to an auditor or financial consultant to classify those same firms and assess how
similar the two solutions were.
53
References
References
Students should read Sharma, S. (1996), Applied Multivariate Techniques, Wiley, p. 185-
236 if they want to extend their knowledge on these subjects.
54
Thank you!