Chapter4 CA
Chapter4 CA
2023/2024
Chapter 4: Cluster Analysis (CA)
The main objective of cluster analysis is to classify a set of statistical units (or
variables) according to a set of groups:
• mutually exclusive
• exhaustive
• homogeneous
4.1: General concepts
The data to be clustered can be of different types:
• a data matrix summarizing the measurements of a group of variables (quantitative or
qualitative) made on a given set of statistical units
• a matrix of similarities (proximities) or dissimilarities (distances)
• a set of preference data (result from ordering a set of items according to some criterion or
preference).
Scatter plot 60
50
Each point represents a statistical unit
(s.u.) and its coordinates, respectively, 40
2ª variável
are the values that the 1st and 2nd
variables take for that s.u. 30
20
10
0
0 20 40 60 80 100
1ª variável
4.2: Graphical methods
Stars, Chernoff faces, …
If the number of variables and the number of statistical units are both moderate, we
can consider using another type of graphical representation.
or Factor Analysis:
• using the graphical representation of the scores of the individuals in the plane
defined by two factors (in a similar way to what is done in PCA)
• using the graphical representation of the loadings of the variables in the plane
defined by two factors à useful representation for the classification of variables.
4.3: Measures of similarities and dissimilarities
Grouping or Clustering use measures of similarity or proximity, or, conversely,
measures of dissimilarity or distance.
A measure of similarity (or dissimilarity) is a function, s (or d), that corresponds to
each pair of objects a value of a one-dimensional Euclidean space (which is usually ),
according to certain properties.
Various types of measures can be used, which are distinguished by the properties
they have.
4.3: Measures of similarities and dissimilarities
Dissimilarity
Let = dissimilarity between objects r and s, should check the properties:
(i) (ii) (iii)
Furthermore,
• If verify (i), (ii), (iii) and (iv) if and only if
Semi-distance or Semi-metric
• If verify (i), (ii), (iii), (iv) and (v) Triangle inequality
Distance or Metric
• If verify (i), (ii), (iii), (iv) and (vi)
Representing by the value that the variable takes for object i, the distance
between two objects r and s is given by:
This distance, although it is one of the most used, has some disadvantages:
• is not invariant to scale changes, and therefore, should not be used when different
variables are measured in different units
• does not behave very well when variables have very different variances
• does not behave very well when variables are highly correlated
• does not behave very well when there is missing data.
4.3: Measures of similarities and dissimilarities
Therefore, several measures derived from it are preferably used:
better than the previous one when the variables are (very) correlated
This distance solves not only the problem of different scales, but also the problem of
the effects of correlations between variables.
It tends to mask the results of the analysis a bit.
Note:
• These last 3 distances are nothing more than variants of a weighted Euclidean
distance, with weights contained respectively in the matrices D, I/p and S.
• The attribution of weights aims to eliminate the arbitrary effects of the variables,
making them contribute, not in a differentiated way, but in a homogeneous way to
the construction of dissimilarities.
4.3: Measures of similarities and dissimilarities
Similarities are often used when working with binary data
Binary data derive from the observation of variables with only two categories: 1 or 0
depending on whether or not a characteristic is present in an individual.
One can build, for each pair of individuals, a table of the type:
Based on this table, in addition to the already known association
i’ 1 0 measures for 2x2 contingency tables, several (many) similarity
i
1 a b a+b
coefficients were suggested, such as:
0 c d c+d
a+c b+d p=a+b+ JACCARD INDEX SORENSON INDEX
+c+d
SOKAL AND
RUSSEL INDEX SNEATH INDEX
4.3: Measures of similarities and dissimilarities
• For nominal data with more than two categories, a strategy can be used that consists
of decomposing each variable into binary variables – as many as the categories of
that variable –, then building the 2x2 table based on all the defined variables and
applying a coefficient among those already mentioned for binary variables.
• For ordinal data where it makes sense, one might think that if an object has a certain
level of a variable then it also has all levels below it. In these cases, as many binary
variables are constructed as there are attributes and the value 1 is assigned to the
variable corresponding to the highest attribute that the object has, as well as to all
variables corresponding to lower levels. The 2x2 table is then built based on all the
defined variables and a coefficient from among those already mentioned for binary
variables is applied.
4.3: Measures of similarities and dissimilarities
Example 4.3.1: Ordinal data
X1(2) X1(3) X1(4) X2(2) X2(5) X2(10)
X1 X2
1 1 0 0 1 1 0
1 2 5
2 2 10 2 1 0 0 1 1 1
3 3 5
3 1 1 0 1 1 0
4 4 2
5 2 2 4 1 1 1 1 0 0
5 1 0 0 1 0 0
• Quantitative variables
The traditional correlation coefficient between variables can be used (rjj' = correlation
between variables Xj and Xj'), which proves to be related to the Euclidean distance as
follows:
where rjj' represents the correlation between the variables Xj and Xj'.
• Qualitative variables
The already known association coefficients can be used for contingency tables.
4.4: Hierarchical Methods
In these methods the groups form a hierarchy in which given two groups, whatever
they are, are either disjoint, or one of them is contained in the other.
The tree branches can be positioned vertically or horizontally, and the axes must be
suitably adapted.
4.4: Hierarchical Methods
An agglomerative type algorithm proceeds according to cycles, which include the following steps:
Step 3: Calculate the dissimilarity values between the new class and all the others, replacing the values of the rth
and sth rows and columns of D by these new values. The dissimilarity matrix is thus one less row and one column
less.
Step 4: Repeat steps 2 and 3, so that at the end of each cycle the number of groups is reduced by one, until a
single group is obtained.
4.4.1: Agglomerative Methods
Furthermore, for each of these methods, different coefficients can be used in the
construction of the dissimilarity matrix D ⇒ different classifications.
Since Cr and Cs are two classes, the dissimilarity between them will be given by the
value of the smallest dissimilarity between an element of Cr and an element of Cs, that
is:
Advantages:
• It can be implemented in a divisive-type algorithm (although computationally inefficient).
• The classification (the tree) will be the same for any monotonic transformation of the drs distances.
• A tie at the shortest distance does not change the classification (with other methods this raises some
doubts).
• Optimizes the formation of sets of related points.
Disadvantages:
• It produces chain hierarchies – links once established cannot be undone.
4.4.1: Agglomerative Methods
Example 4.4.1
1st Cycle:
Step 1: Consider the following dissimilarity matrix between 5 elements: (a, b, c, d, e)
We have 5 classes →
Step 2: The smallest element of matrix D is: d12 = 2 → we join classes C10 and C20
The new classes are:
Step 3: The distances between the new class and the rest are:
Step 3: The distances between the new class and the rest are:
Step 3: The distances between the new class and the rest are:
Level 3 ® 4
Level 2 ® 3
Level 1 ® 2
a b c d e
4.4.1: Agglomerative Methods
Complete Linkage Method
Level 3 ® 5
Level 1 ® 2
a b c d e
4.4.1: Agglomerative Methods
Group Average Method
Level 3 ® 4.5
Level 2 ® 3
Level 4 ® 10
Level 4 ® 7.8
Level 4 ® 5 Level 3 ® 5
Level 3 ® 4.5
Level 3 ® 4
In this method the distance between two classes is given by the distance between the centers of
the classes.
Then the distance between two classes Ci and Ci' is given by: where d is a measure of
distance (which can be,
for example, the
Euclidean distance).
Instead of a dissimilarity, a similarity can also be used.
The centroid of a new class Cij resulting from the merger of two
classes, Ckj-1 and Ck’ j-1 , in the j th cycle of the algorithm, will be
given by:
4.4.1: Agglomerative Methods
Ward Method
In this method, a measure based on the average of dissimilarities between classes is used, given
by:
• It is proved that this is the measure of the increment that the sum of squares of the
distances of the elements of the two classes Ck and Ck' to the respective centroids suffers
when these two classes are joined (results from the difference between the sum of squares of
the distances of each element to the centroid of the new merged class and the sum of
squares of the distances of the elements of the original classes to the respective centroids).
• To decide which classes to join, in each cycle of the algorithm, this increment is calculated
for all possible pairs of classes, selecting to form the new class, the two classes to which the
smallest increment corresponds.
4.4.2: Notes in the diferente methods
• Almost all methods can be applied with similarities or dissimilarities – it will be enough
to adapt the interpretation; change the words minimum and maximum, …
• For a hierarchical method to be considered good, it must meet some conditions, such as:
i. the results obtained must not depend on the designation of the objects.
ii. the method must be well defined → for the same set of dissimilarities, the same tree
must always be obtained (that is, in case of ties, whatever the choice, the tree must
always be the same)
iii. small changes in the data must correspond to small changes in the resulting tree.
iv. the fact of adding or removing an object from the analysis should produce only small
changes in the tree.
The Single Linkage method has these properties and is therefore considered one of the best.
However, there is no method that can be said to be “the best”, so the ideal will be:
• apply various methods
• they all reveal the same type of clusters.
4.4.3: How to choose the best Partition?
There are situations in which it is important to consider the entire tree (as, for example,
in taxonomy), but, often, what is intended is to define a grouping into classes.
It is therefore necessary to know how to say the number of classes to consider.
This decision can be based on the distances between classes obtained in successive cycles
(levels).
You can decide to stop the process when:
• this distance exceeds a certain value
• the successive differences between distances suddenly increase, (causing a very large on
dendrogram)
The number of classes to consider is what you have when the process is stopped.
4.4.3: How to choose the best Partition?
Þ 3 classes
Ü 4 classes
Statistic units
In addition to these simple methods, there are more complicated ones that are based on tests.
4.4.4: Validation of the Classification
Once the tree is obtained, a new dissimilarity matrix can be constructed in which the
element (i,j) is the value of the dissimilarity between the classes that contained i and j
immediately before their fusion (it is the value that presided over this fusion – is on the
axis of the dendrogram)
To validate the classification, the initial matrix (of 𝑑"# elements) must be compared with
this new matrix (of 𝛿"# elements).
4.4.4: Validation of the Classification
This comparison can be made using:
Level 4 ® 5
with Level 3 ® 4
Level 2 ® 3
Level 1 ® 2
a b c d e
4.5: Non-Hierarchical Methods
• Unlike hierarchical methods, these methods do not produce hierarchies
• produce groups (disjoint or not depending on the method)
There are several methods based on different principles. Stand out the:
Partition Methods
Bearing in mind that (as mentioned at the beginning) objects located within the same group
must be more similar than objects located in different groups
üselect the best partition among all possible ones.
The total number of possible partitions (even for a not very large number of objects) is very
high!
The solution that consists of analysing all possible partitions and choosing the best one is
not feasible!
We examine some partitions in order to find the best one, optimizing a previously
established group formation criterion
4.5: Non-Hierarchical Methods
Partition Methods
The degree of association revealed by the table determines the proximity of the two
classifications.
Centroid d2
Groups x1 x2 A B C D E C is closer to 𝐴, 𝐵 than 𝐶, 𝐷, 𝐸
AB 3.5 4.5 14.5 14.5 56.5 132.75 156.5
CDE 11.67 7 94.51 80.49 83.83 20.09 22.75