0% found this document useful (0 votes)
25 views55 pages

Chapter 4 - Cluster Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views55 pages

Chapter 4 - Cluster Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Foundational Aspects of

Data Science
Catarina Neves
[email protected]
Chapter 4 – Cluster Analysis

1. Basic Concepts and Terminology


2. Similarity Measures
3. Hierarchical Clustering
4. Nonhierarchical Clustering
5. Which Clustering Method is Best?
6. Reliability and External Validity

2
Basic Concepts and Terminology

Cluster Analysis
Imagine one of the following hypothetical situations:
• The financial analyst of an investment company firm is interested in identifying a group
of firms that are prime targets for a takeover;
• The campaign manager for a political candidate is interested in identifying groups of
voters who have similar views on important issues;
• A marketing manager for a political candidate is interested in identifying groups of
voters who have similar views on important issues.

Each of the above scenarios is concern with identifying groups of entities or subjects that
are similar to each other with respect to certain characteristics. Cluster Analysis is a
useful technique for such purpose.

3
Basic Concepts and Terminology

What is Cluster Analysis


Cluster Analysis is a technique used for combining observation into groups (i.e., clusters)
such that:
➢ Each group or cluster is homogeneous or compact with respect to certain
characteristics. That is, observations in each group are similar to each other;
➢ Each group should be different from other groups with respect to the same
characteristics; i.e., observations of one group should be different from the
observations of other groups.
The definition of similarity or homogeneity varies from analysis to analysis and depends
on the objectives of the study.

4
Basic Concepts and Terminology

Objectives of Cluster Analysis


The objective of Cluster Analysis is to group observations into clusters such that each
cluster is as homogeneous as possible with respect to the clustering variables. The first
step in cluster analysis is to select a measure of similarity. Next, a decision is made on the
type of clustering method to be used (i.e., hierarchical or non-hierarchical methods).
Thirdly, the type of clustering method for the selected technique is selected. Next, a
decision regarding the number of clusters is made and, finally, the solution is interpreted.

Maximize inter-
Minimize intra- cluster distances
cluster
distances

5
Basic Concepts and Terminology

Geometrical view of Cluster Analysis


Geometrically, the concept of cluster analysis is straightforward. Consider the following
hypothetical data:

As it is known, each observation can be represented as a point in a p-dimensional space.


In this case, p=2. Let’s suppose that we are interested in forming three homogeneous
groups. An examination of the observations projected in the two-dimensional space is
given in the next slide.

6
Basic Concepts and Terminology

Geometrical view of Cluster Analysis


An examination of the projected subjects suggests that S1 and S2 will form one group, S3
and S4 form another; whereas S5 and S6 form the third one.

7
Basic Concepts and Terminology

Geometrical view of Cluster Analysis


As can be seen, cluster analysis groups observations such that the observations in each
group are similar with respect with respect to the clustering variables.
The graphical procedures for identifying clusters may not be feasible when we have many
observations or when we have more than three variables or dimensions. What is needed
in such cases is an analytical technique for identifying groups or clusters of points in a
given dimensional space.

8
Similarity Measures

Selecting a measure of distance


In the geometrical approach to cluster analysis, we visually combined S1 and S2; S3 with
S4; and S5 with S6. In other words, we implicitly used the Euclidean Distance to group the
subjects in the two-dimensional space as a measure of similarity.
A number of similarity measures could have been use in alternative to the Euclidean
Distance. Hence, of the first issues ones need to deal when using cluster analysis, is the
selection of a suitable distance/similarity measure.

9
Similarity Measures

Selecting a measure of distance


Euclidean Distance
p
dij =  ik jk
( x
k =1
− x ) 2

Squared-Euclidean Distance
p
dij =  ( xik − x jk ) 2
k =1

City-Block Distance
p
d ij =  xik − x jk
k =1

10
Similarity Measures

Selecting a measure of distance


Minkowski distance
p
dij = m  xik − x jk
m

k =1

Mahalanobis distance
p
dij =  i j S ( xi − x j )
( x
k =1
− x ) ' −1

Where dij is the specific distance between subjects i and j, x ik is the value of the kth
variable for the ith subject, xjk is the value of the kth variable for the jth subject,
and p is the number of variables.

11
Similarity Measures

Selecting a measure of distance


The variables that are presented in higher scales may override the effects of other
variables with smaller scales (thus with smaller levels of variance).
For non-metric variables, the previous notions of distances are not available. In these
cases the analyst may use:
➢ Measures of associations especially suited for contingency tables (e.g., Jaccard
Association Coefficient, Gower Coefficient, etc.).

12
Similarity Measures

Similarity Matrix
Similarity Matrix for the hypothetical data using the Squared Euclidean Distance.

The question is then how can one use the similarities given for forming the groups or
clusters? The answer to this question lies in the two main types of analytic clustering
techniques, hierarchical and non-hierarchical.

13
Hierarchical Clustering

Hierarchical Clustering Techniques


From the table in the previous slide, subjects S1 and S2 are similar to each other, as are
subjects S3 and S4, since the Squared Euclidean distance between these two pairs of
points is the same – two. Either of these two pairs could be selected as the first pair to be
formed. The tie is broken randomly. Nevertheless, let us choose subjects S1 and S2 and
merge these two individuals into one cluster. We now have five clusters: cluster 1
comprised of S1 and S2; cluster 2 comprised of S3; cluster 3 comprised of S4; cluster 4
comprised of S5; and cluster 5 comprised of S6. The next step could be develop another
similarity matrix, given by the previous 5 clusters.
The arising question would be as follows: Since cluster 1 is comprised of S1 and S2, we
must define a rule for determining the distance between this cluster with every other
subject included in our data set. In fact, the answer to this question is what differentiates
between the various hierarchical clustering algorithms.

14
Hierarchical Clustering

Hierarchical Clustering Techniques


In this course the following hierarchical methods are discussed:
➢ Centroid method;
➢ Nearest-neighbor or single-linkage method;
➢ Farthest-neighbor or complete-linkage method;
➢ Average-linkage method;
➢ Ward’s method.

In the following slides the hierarchical cluster analysis of the hypothetical data set given
will be made using Centroid method.

15
Hierarchical Clustering

Centroid Method
It is obvious that hierarchical cluster algorithm forms clusters in a hierarchical manner. In
other words, the number of clusters at each stage is one less than the previous one. If
there are n observations then at step 1, step 2, step 3, …. , step n-1 of the hierarchical
process the number of clusters will be, respectively, n-1, n-2, n-3, …., 1. In the case of
Centroid method, each cluster is represented by the centroid of that cluster for
computing the distances between the clusters.

16
Hierarchical Clustering

Centroid Method
In Centroid Method each group is represented by the Average Subject which is the
centroid of that group. For example, the first cluster is represented by the centroid of the
Subjects S1 and S2. In other words, Cluster 1 has an average education of 5.5 years and an
average income of 5.5 thousand dollars.

17
Hierarchical Clustering

Centroid Method
Given the iterative process of hierarchical clustering, it is very frequent to plot the
formation path of the observations in what is called dendrogram or tree. In these graphic
representations, the observations are listed on the horizontal axis and the Squared
Euclidean Distance between the centroids on the vertical one.
Note that in the case where we have a high number of observations the dendrogram may
not be very useful.

18
Hierarchical Clustering

Centroid Method

19
Hierarchical Clustering

Single-Linkage or Nearest-Neighbor Method


Consider the similarity matrix without any groups of subjects yet formed.

CLUSTER 1

In the Centroid Method, the distance between clusters was obtained by computing the
Squared Euclidean Distance between the centroids of the respective clusters. In the
Single-Linkage Method. The distance between the two clusters is represented by the
minimum of the distance between cluster 1 (S1 & S2) and Subject S3.
In this case we have Min(181; 145)=145.

20
Hierarchical Clustering

Single-Linkage or Nearest-Neighbor Method


The process then continues interactively until only one cluster is formed.

21
Hierarchical Clustering

Complete-Linkage or Farthest-Neighbor Method


The Complete Method is the exact opposite of the Single Method.
The distance between any two clusters is given by the maximum of the distances between
all possible pairs of observations in the two clusters. Once again, consider the initial
similarity matrix given for this hypothetical data:

CLUSTER 1

The distance between the two clusters is represented by the maximum of the distance
between cluster 1 (S1 & S2) and Subject S3.
In this case we have Max(181; 145)=181.

22
Hierarchical Clustering

Complete-Linkage or Farthest-Neighbor Method


The process then continues interactively until only one cluster is formed.

23
Hierarchical Clustering

Average-Linkage Method
In the Average Method the distance between two clusters is obtained by taking the
average distance between all pairs of subjects in the two clusters. Once again, consider
the initial similarity matrix given for this hypothetical data:

CLUSTER 1

The distance between the two clusters is represented by the average of the distance
between cluster 1 (S1 & S2) and Subject S3.
In this case we have Average(181; 145)=163.

24
Hierarchical Clustering

Average-Linkage Method
The process then continues interactively until only one cluster is formed.

25
Hierarchical Clustering

Ward’s Method
The Ward’s method does not compute distances between clusters. Rather, it forms
clusters by maximizing within-cluster homogeneity. The within-group sum of squares is
used as a measure of homogeneity.
Clusters are formed at each step in such a way that the resulting cluster solution has the
fewest within-cluster sums of squares. This measure is also known as Error Sum of
Squares (ESS). The first two iterations of Ward’s method are presented in the next slide.
The ESS is computed as:

k nj

 ij j
( X − X
j =1 i =1
) 2

26
Hierarchical Clustering

Ward’s Method

𝑛(𝑛 − 1)/2
𝑛(𝑛 − 1)/2
27
Hierarchical Clustering

Ward’s Method
The process then continues interactively until only one cluster is formed.

28
Hierarchical Clustering

Hierarchical Methods Overview

Centroid

Single linkage

Complete linkage

Average linkage

Ward’s Method
k nj

 ( X
j =1 i =1
ij − X j )2

29
Hierarchical Clustering

Hierarchical Methods Overview


➢ Centroid: less sensitive to outliers than other methods; may present some limitations
when the clusters have very different sizes, since individuals tend to be agglomerate in
larger clusters;
➢ Single-Linkage: tends to produce elongated clusters, which may involve very different
individuals in the same cluster;
➢ Complete-Linkage: particularly sensitive to outliers as it tends to produce clusters with
similar diameters;
➢ Average-Linkage: tends to combine clusters with small and similar variances;
➢ Ward´s Method: tends to combine clusters with few and similar number of
observations and it is also very sensitive to outliers.

30
Interpreting Hierarchical Clustering

Evaluating the Cluster Solution and Determining the Number of Clusters


Although SAS Miner choose the number of clusters using the CCC criterium, we
should make our own evaluation and determine the number of clusters present in
the data. A number of statistics are provided by SAS to achieve this purpose. The
most widely used are:
1. Root-mean-squared standard deviation (RMSSTD)
2. R-Squared (R2)
3. Semi-Partial R2 (SPR2)
4. Cluster Distance (CD)
All of these statistics provide information about the cluster solution at any given step
of the hierarchical procedure and, hence, the consequences of forming the new
cluster.

31
Interpreting Hierarchical Clustering

R-Squared
R2 is the ratio of SSb to SSt. Note that the SSb is a measure of the extent to which
groups are different from each other. Since SSt = SSb + SSw the greater the SSb the
smaller the SSw and vice-versa. Consequently, for any given dataset, the greater the
differences between groups the more homogeneous each group is and vice versa.
➢ Hence, R2 measures the extent to which groups or clusters are different from each
other or, in alternative, how much the groups are homogeneous. The value of R 2
ranges from 0 to 1. It can be interpreted as a measure of the proportion of the
total variance that is retained in each of the solutions.

32
Interpreting Hierarchical Clustering

Semi-Partial R-Squared
As already discussed, the new cluster formed at any given step is obtained by
merging two clusters formed in previous steps. The difference between the pooled
SSw of the new cluster and the sum of pooled SSw’s of clusters joined to obtain the
new clusters is called loss of homogeneity. If this loss is zero, then the new cluster is
obtained by merging two perfectly homogeneous clusters. On the other hand, if loss
of homogeneity is large, then that means we are merging two very different clusters.
➢ In a good Cluster solution SPR 2 should be low.

33
Interpreting Hierarchical Clustering

Cluster Distance
The output reports the distance between the two Clusters that are formed at a given
step. In the centroid method it is simply the euclidean distance between the
centroids of the two clusters that are to be joined or merged and it is named as the
centroid distance (CD); for Single-Linkage it is the minimum euclidean distance
(MIND) between all possible pairs of points; for Complete-Linkage it is the maximum
euclidean distance (MAXD) between all possible pairs of points; and for Ward’s
method it is the between-group SS for the two clusters (SSb).
➢ Cluster Distance should be small to merge two Clusters. A large value for CD
would indicate that two dissimilar groups (as they are far apart) are being merged.

34
Interpreting Hierarchical Clustering

Final Hierarchical Clustering Solution


The following table gives a summary of the statistics previously discussed.

These statistics may also be used to define the number of Clusters. Essentially one
should look for the “first big jump” in the value of a given statistics. A popular and
effective way is to plot the statistics and look for the first elbow. In this case it clearly
seems that we have three Clusters in the data.

35
Interpreting Hierarchical Clustering

Comparing the R2 of the Hierarchical Methods (and choosing the number


of clusters)

36
Non-Hierarchical Clustering

Non-Hierarchical Clustering Algorithms


In Non-Hierarchical Clustering, data are partitioned in k partitions or groups with
each partition representing a Cluster. Hence, as opposed to Hierarchical Clustering,
the number of Clusters must be known a priori. Non-Hierarchical Clustering
techniques are basically computed in the following steps:
1. Select k initial Cluster centroids or seeds, where k is the number of Clusters
desired;
2. Assign each observation to the Cluster to each it is closer;
3. Re-assign or reallocate each observation to one of the k clusters according to a
predetermined stopping criteria;
4. Stop if there is no reallocation of data points or if the reassignment satisfies the
criteria defined before. Otherwise, return to step 2.

37
Non-Hierarchical Clustering

Non-Hierarchical Clustering Algorithms


Most of the Non-Hierarchical algorithms differ in terms of (i) the method used for
obtaining initial cluster centroids or seeds; and (ii) the rule used for reassigning
observations. Some of the methods used to obtain initial seeds are:
1. Select the first k observations with non-missing data as centroids or seeds for the initial
Clusters;
2. Select the first non-missing observations as the seed for the 1st Cluster. The seed for the
2nd Cluster is selected such that its distance from the previous seed is greater than a
certain selected distance. The 3rd seed is selected such that its distance from previously
selected seeds is greater than selected distance, and so on;
3. Randomly select k non-missing observations as Clusters’ centers or seeds;
4. Redefine the selected seeds using certain rules such that they are as far apart as possible;
5. Use a heuristic that identifies Clusters’ centers such that they are as far apart as possible;
6. Use seeds supplied by the researcher.

38
Non-Hierarchical Clustering

Initial Data

39
Non-Hierarchical Clustering

Outliers Identification

Identify outliers using a


hierarchical technique and
choose the approximate
number of clusters

40
Non-Hierarchical Clustering

Outliers Removal

41
Non-Hierarchical Clustering

Iterative Process

choose the centroids of initial


clusters

42
Non-Hierarchical Clustering

Iterative Process

Affect each individual


to the nearest centroid

43
Non-Hierarchical Clustering

Iterative Process

Recalculate centroids

44
Non-Hierarchical Clustering

Iterative Process

Reallocate individuals to cluster


with nearest centroid

45
Non-Hierarchical Clustering

Final Solution

Recalculate clusters
Successively until convergence
criterion is achieved

46
Interpreting Non-Hierarchical Clustering

Evaluation of the Cluster Solution


➢ The cluster solution is evaluated using the same measures discussed for
hierarchical clustering. The overall R2 and RMSSTD. As both the R2 and the
RMSSTD are used in hierarchical methods to define the number of clusters, one
can use these measures with different k values to help in this decision;
➢ Likewise other techniques discussed earlier in the course, sometimes it can be
of interest to analyze how well the variance of each variable is “kept” in a
specific dimension reduction analysis. Hence, it is possible to analyze the R 2
and the RMSSTD per variable also in this part of the output.

47
Which Clustering Method is Best?

Which Clustering Method is the Best?


Which of the two types of clustering techniques should be used, and in what
circumstances? Moreover, once this decision has been made, which algorithm is the
most suitable to be used?
The answer to the previous questions is not straightforward as it depends on the
context of the problem. Although the literature provides comprehensive summaries
of the various methods and algorithms there is not one that is, overall, better than
other – data and context have a decisive role to play in the performance of the
multiple techniques!
The next slides are dedicated to describe some of the properties between methods
and algorithms so that the answer to the questions above may be taken using
knowledge about every possibility.

48
Which Clustering Method is Best?

Hierarchical Methods
Hierarchical methods do not need the number of clusters decided a priori. This is,
definitely, an advantage over the non-hierarchical methods. However, the
hierarchical methods have the disadvantaged that, once an observation is assigned
to a Cluster, it can no longer “get back”, i.e., reassigned to another cluster. For these
reasons, hierarchical methods are sometimes used in an exploratory analysis, being
the final solution be made by non-hierarchical methods. In other words, hierarchical
and non-hierarchical methods can be used together, in collaboration, instead of
being considered as competitors.

49
Which Clustering Method is Best?

Non-Hierarchical Methods
As already discussed, non-hierarchical methods require that the number of clusters
be defined a priori. Consequently, the cluster centers (seeds) have to be identified
before the technique can proceed to cluster observations, which may be a problem
as these methods are, usually, sensible to the initial seeds. Note that, since a number
of starting partitions can be used, the final solution could result in local optimization
of the objective function. In other words, as there is evidence that non-hierarchical
methods (including k-means) can perform poorly when seeds are randomly assigned,
one can use the solution obtained by the hierarchical methods as initial position of
the seeds in non-hierarchical ones.
➢ Hierarchical and non-hierarchical methods can be viewed as complementary in
most of the cases*. The SAS® Syntax to do so is available in the next slides.

* - A possible exception is for datasets where n is very (very) large. In this case, the
complementarity of methods needs to be adapted.
50
Reliability and External Validity

Reliability and External Validity of a Cluster Solution


Cluster analysis is a heuristic technique. Hence, these techniques will always lead to a
solution, even when there is no “natural groups”, i.e., clusters, in a specific dataset.
Thus, defining the reliability and external validity of a cluster solution is of the most
importance.

51
Reliability and External Validity

Reliability
Reliability of a cluster solution can be performed by a cross-validation procedure as
follows:
➢ Performing different analysis in the same data, using different distances, and
verify how stable are the solutions;
➢ Comparing the results provided by different methods and verify if it produces
similar interpretations;
➢ Dividing the data set randomly into two groups (validation and training) and
perform cluster analysis on both groups and compare the results. The degree of
agreement can be assessed and the solution that is “more compatible” in both
groups used.

52
Reliability and External Validity

External Validity
External validity is obtained by comparing the results of the cluster analysis with
those of an external criterion. For example, suppose one clusters firms based on
financial indicators and these are comprised in one of two clusters; then one could
ask to an auditor or financial consultant to classify those same firms and assess how
similar the two solutions were.

53
References

References
Students should read Sharma, S. (1996), Applied Multivariate Techniques, Wiley, p. 185-
236 if they want to extend their knowledge on these subjects.

54
Thank you!

Address: Campus de Campolide, 1070-312 Lisboa, Portugal


Phone: +351 213 828 610 Fax: +351 213 828 611

You might also like