0% found this document useful (0 votes)
5 views54 pages

Chapter4 CA

Uploaded by

Rafaela
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views54 pages

Chapter4 CA

Uploaded by

Rafaela
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Multivariate Data Analysis

Chapter 4 – Cluster Analysis(CA)

Docente: Eunice Carrasquinha


[email protected]

Departamento de Estatística e Investigação Operacional (DEIO)


Faculdade de Ciências da Universidade de Lisboa

2023/2024
Chapter 4: Cluster Analysis (CA)

• 4.1. General Concepts


• 4.2. Graphical methods;
• 4.3. Measures of similarities and dissimilarities;
• 4.4. Hierarchical methods;
• 4.5. Non-hierarchical methods;
• 4.6. Graphic representations.
4.1: General concepts
Cluster analysis is a method of grouping statistical units (individuals, objects, etc.), or
variables, into groups (whose characteristics are derived from the data) in such a way
that objects within the same group are more similar than objects located in different
groups.

• it is a method of exploratory analysis and reduction of the dimensionality of the data,


in the sense that the statistical units are replaced by the groups or even by their
representative.

The main objective of cluster analysis is to classify a set of statistical units (or
variables) according to a set of groups:
• mutually exclusive
• exhaustive
• homogeneous
4.1: General concepts
The data to be clustered can be of different types:
• a data matrix summarizing the measurements of a group of variables (quantitative or
qualitative) made on a given set of statistical units
• a matrix of similarities (proximities) or dissimilarities (distances)
• a set of preference data (result from ordering a set of items according to some criterion or
preference).

The different methods of CA are basically divided into three types:


• Graphical Methods à Partition
• Hierarchical Methods à Hierarchical Clustering
• Non-Hierarchical Methods à Non-Hierarchical Clustering

Whatever the type of methods, CA it's very empirical


• different methods can lead to completely different groups, both in the number of groups and in their
content.
4.2: Graphical methods
When we only have the observations of two variables, the classification can be done in
a simple way, through a graphical representation, such as:

Scatter plot 60

50
Each point represents a statistical unit
(s.u.) and its coordinates, respectively, 40

2ª variável
are the values that the 1st and 2nd
variables take for that s.u. 30

20

10

0
0 20 40 60 80 100

1ª variável
4.2: Graphical methods
Stars, Chernoff faces, …

If the number of variables and the number of statistical units are both moderate, we
can consider using another type of graphical representation.

Alimentos - Estrelas Alimentos - Faces de Chernoff

Azeite Manteiga Pescada Vaca Frango


Azeite Manteiga Pescada Vaca Frango

Leite Iogurte Q.Flamengo Q.Serra Arroz


Leite Iogurte Q.Flamengo Q.Serra Arroz

Pão Feijão Açucar Massas Alface


Clockw ise:
Energia Pão Feij ão Açucar Massas Alface
Proteínas
Lípidos face/w = Energia
Cálcio ear/lev = Proteínas
Ferro
Cebola Espinafres Cenoura Batata Couve halfface/h = Ferro
upface/ecc = Cálcio
loface/ecc = Lípidos
Cebola Es pinafres Cenoura Batata Couve
4.2: Graphical methods
You can also use Principal Component Analysis:
• if the first two principal components explain a good part of the total variability, we
can represent the scores of the individuals in the plane defined by these two
components and try to visualize clusters of the obtained points.
• if there is a need to use more than two p.c.'s, we can use the scores of the individuals
for the most important components instead of the initial values of the variables (which
were in greater number), and build the groups from them, using one of the methods of
classification analysis.

or Factor Analysis:
• using the graphical representation of the scores of the individuals in the plane
defined by two factors (in a similar way to what is done in PCA)
• using the graphical representation of the loadings of the variables in the plane
defined by two factors à useful representation for the classification of variables.
4.3: Measures of similarities and dissimilarities
Grouping or Clustering use measures of similarity or proximity, or, conversely,
measures of dissimilarity or distance.
A measure of similarity (or dissimilarity) is a function, s (or d), that corresponds to
each pair of objects a value of a one-dimensional Euclidean space (which is usually ),
according to certain properties.

Generally, these measurements take values in the range [0,1].

Given a measure of dissimilarity, d, we can determine a measure of similarity, s, that is


associated with it, making s = constant – d (generally, constant = maximum value that d
can take).
When 2 cases are similar:
• a similarity measure takes a high value
• a dissimilarity measure takes a low value
4.3: Measures of similarities and dissimilarities
In most situations, the values of similarity (or dissimilarity) measures are not directly
observed, but rather calculated from a data matrix ( )

Data matrix Dissimilarity matrix (or similarity)

dissimilarity (or similarity)


between individuals i and j

contains the values of the measure of dissimilarity (or


similarity) between the elements of each pair of
objects, calculated from the data matrix

Various types of measures can be used, which are distinguished by the properties
they have.
4.3: Measures of similarities and dissimilarities
Dissimilarity
Let = dissimilarity between objects r and s, should check the properties:
(i) (ii) (iii)

Furthermore,
• If verify (i), (ii), (iii) and (iv) if and only if
Semi-distance or Semi-metric
• If verify (i), (ii), (iii), (iv) and (v) Triangle inequality

Distance or Metric
• If verify (i), (ii), (iii), (iv) and (vi)

Ultrametric Since (vi) ⇒ (v) , every ultrametric distance is a metric


4.3: Measures of similarities and dissimilarities
Similarity

If a similarity measure s is obtained from a dissimilarity d:

• that satisfies properties (i), (ii) and (iii) it is said to be a similarity;


• and, if property (iv) is also verified, it is said to be a proximity.
4.3: Measures of similarities and dissimilarities
Euclidean distance

Representing by the value that the variable takes for object i, the distance
between two objects r and s is given by:

which, in vector form, is written:

where is the vector of observations of row i (individual i) of matrix X


4.3: Measures of similarities and dissimilarities
Euclidean distance

This distance, although it is one of the most used, has some disadvantages:

• is not invariant to scale changes, and therefore, should not be used when different
variables are measured in different units
• does not behave very well when variables have very different variances
• does not behave very well when variables are highly correlated
• does not behave very well when there is missing data.
4.3: Measures of similarities and dissimilarities
Therefore, several measures derived from it are preferably used:

Euclidean distance Square

better than the previous one when the variables are (very) correlated

Euclidean distance Standardized


or Karl Pearson Distance
diagonal matrix of variances of variables
Standard value of xrj

is invariant to scale changes: Euclidean distance applied to standardized data

Euclidean distance Mean

has the same disadvantages as the Euclidean distance, but is advantageous


when there are missing data
4.3: Measures of similarities and dissimilarities
Mahalanobis distance

This distance solves not only the problem of different scales, but also the problem of
the effects of correlations between variables.
It tends to mask the results of the analysis a bit.
Note:
• These last 3 distances are nothing more than variants of a weighted Euclidean
distance, with weights contained respectively in the matrices D, I/p and S.
• The attribution of weights aims to eliminate the arbitrary effects of the variables,
making them contribute, not in a differentiated way, but in a homogeneous way to
the construction of dissimilarities.
4.3: Measures of similarities and dissimilarities
Similarities are often used when working with binary data

Binary data derive from the observation of variables with only two categories: 1 or 0
depending on whether or not a characteristic is present in an individual.
One can build, for each pair of individuals, a table of the type:
Based on this table, in addition to the already known association
i’ 1 0 measures for 2x2 contingency tables, several (many) similarity
i
1 a b a+b
coefficients were suggested, such as:
0 c d c+d
a+c b+d p=a+b+ JACCARD INDEX SORENSON INDEX
+c+d

SOKAL AND
RUSSEL INDEX SNEATH INDEX
4.3: Measures of similarities and dissimilarities

• For nominal data with more than two categories, a strategy can be used that consists
of decomposing each variable into binary variables – as many as the categories of
that variable –, then building the 2x2 table based on all the defined variables and
applying a coefficient among those already mentioned for binary variables.

• For ordinal data where it makes sense, one might think that if an object has a certain
level of a variable then it also has all levels below it. In these cases, as many binary
variables are constructed as there are attributes and the value 1 is assigned to the
variable corresponding to the highest attribute that the object has, as well as to all
variables corresponding to lower levels. The 2x2 table is then built based on all the
defined variables and a coefficient from among those already mentioned for binary
variables is applied.
4.3: Measures of similarities and dissimilarities
Example 4.3.1: Ordinal data
X1(2) X1(3) X1(4) X2(2) X2(5) X2(10)
X1 X2
1 1 0 0 1 1 0
1 2 5
2 2 10 2 1 0 0 1 1 1
3 3 5
3 1 1 0 1 1 0
4 4 2
5 2 2 4 1 1 1 1 0 0

5 1 0 0 1 0 0

For every two individuals, a table is then constructed. For example:

For pair (1,2) 1\2 1 0 and for (1,4) 1\4 1 0


1 3 0 1 2 1
0 1 2 0 2 1
4.3: Measures of similarities and dissimilarities
• For quantitative data, the most commonly used measure of similarity is a coefficient
similar to the traditional correlation coefficient (which measures the correlation
between variables) in which the similarity between two objects r and s is given by:

However, the use of this measure is somewhat restricted, as it has several


shortcomings:
• can take negative values ( )
• the fact of taking the value 1 does not mean that
• the meaning of the averages of the observations of an object for all the variables
is not at all clear!
4.3: Measures of similarities and dissimilarities
If there are variables of different types, different strategies can be used:
• Perform separate clustering analyses – one for each group of variables of the same
type
• Reduce all variables to binary variables
• Build a combined coefficient of similarity – such as the one proposed by Gower:

the similarity between objects r


and s based on the variable k

the weight that will be 1 or 0


depending on the comparison of
objects r and s, in the variable k
is valid or not
4.3: Measures of similarities and dissimilarities
Some similarities between variables

• Quantitative variables
The traditional correlation coefficient between variables can be used (rjj' = correlation
between variables Xj and Xj'), which proves to be related to the Euclidean distance as
follows:

where rjj' represents the correlation between the variables Xj and Xj'.

• Qualitative variables
The already known association coefficients can be used for contingency tables.
4.4: Hierarchical Methods

In these methods the groups form a hierarchy in which given two groups, whatever
they are, are either disjoint, or one of them is contained in the other.

The main objective of hierarchical classification is to build a tree, whose two-


dimensional graphic representation is called a dendrogram, tree diagram, hierarchical
tree or phenogram.
4.4: Hierarchical Methods
The tree can be built:

• from bottom to top – using an agglomerative method that proceeds to a series of


successive merging of the cases into new classes, ending up with a single class that
contains all the elements:

• starts with n classes, each containing an element


• in successive steps, the two closest (similar) classes are combined, thus
successively reducing the number of classes (at each step (or level) this
number decreases by one unit)
• at the end, all observations will be grouped in a single class that will have n
elements.
4.4: Hierarchical Methods
• from top to bottom – using a divisive method that partitions the total set of cases
successively into finer partitions, ending up with as many classes as there are cases:

• starts with a class, containing all the elements


• in successive steps, the classes are divided into groups of more distant
(dissimilar) classes, thus successively increasing the number of classes (at each
step, or level, this number increases by one unit)
• at the end, each observation will be in a class, with as many classes as there are
cases.
4.4: Hierarchical Methods

• in one of the axes, objects (or variables)


positioned at equal distances from each other
are identified and according to an order that
will be related to the classification to which
they are subjected (and which may vary
depending on the method applied)

• on the other axis, the values of the distances


between the groups that presided over the
formation of each new group, or other
additional information (such as the step
number or the number of groups in each
step) are marked.
4.4: Hierarchical Methods

The tree branches can be positioned vertically or horizontally, and the axes must be
suitably adapted.
4.4: Hierarchical Methods

Once the hierarchical classification is


done, one can also obtain a partition:
• just draw a line that crosses the
dendrogram at a given point and take
for partition classes those defined at
the level of this line.
4.4: Hierarchical Methods

In addition to the distinction between agglomerative and divisive methods, there


are still other criteria of distinction, of a different nature, which allow the
establishment of other types of methods.

Such criteria arise because it is necessary to decide:

• which objects are closer or further away


üwill have to use coefficients of similarity or distance between individuals.
• which classes should be joined or separated
ü measures of similarity or distance between classes will have to be used.
• and different coefficients or measures ⇒ different methods.
4.4.1: Agglomerative Methods
Let dissimilarity between two objects r and s and dissimilarity between two classes Cr and Cs

An agglomerative type algorithm proceeds according to cycles, which include the following steps:

Step 1: Start with n classes, each one formed by a single object.


Determine the matrix D ( ) of the , that is, the matrix of dissimilarities for the n objects taken 2 to 2.
Step 2: Determine the smallest element of the matrix D - the smallest of the which in the 1st cycle are the
).
Assuming that this is , merge the Cr and Cs classes into one, leaving one less class.

Step 3: Calculate the dissimilarity values between the new class and all the others, replacing the values of the rth
and sth rows and columns of D by these new values. The dissimilarity matrix is thus one less row and one column
less.
Step 4: Repeat steps 2 and 3, so that at the end of each cycle the number of groups is reduced by one, until a
single group is obtained.
4.4.1: Agglomerative Methods
Furthermore, for each of these methods, different coefficients can be used in the
construction of the dissimilarity matrix D ⇒ different classifications.

Therefore, to clearly clarify which method is being used, it should be mentioned:


• if the method is divisive or agglomerative
• which coefficient is used to compare classes
• what is the coefficient used to compare individuals

Among the main methods of agglomerative type we can mention some:

• Single Linkage Method • Centroid Method


• Complete Linkage Method • Median Method
• Group Average Method • Ward Method
4.4.1: Agglomerative Methods
Single Linkage Method

Since Cr and Cs are two classes, the dissimilarity between them will be given by the
value of the smallest dissimilarity between an element of Cr and an element of Cs, that
is:

Advantages:
• It can be implemented in a divisive-type algorithm (although computationally inefficient).
• The classification (the tree) will be the same for any monotonic transformation of the drs distances.
• A tie at the shortest distance does not change the classification (with other methods this raises some
doubts).
• Optimizes the formation of sets of related points.

Disadvantages:
• It produces chain hierarchies – links once established cannot be undone.
4.4.1: Agglomerative Methods
Example 4.4.1
1st Cycle:
Step 1: Consider the following dissimilarity matrix between 5 elements: (a, b, c, d, e)
We have 5 classes →

Step 2: The smallest element of matrix D is: d12 = 2 → we join classes C10 and C20
The new classes are:

Step 3: The distances between the new class and the rest are:

The new dissimilarity


matrix is given by:
4.4.1: Agglomerative Methods
2nd Cycle:
Step 2: The smallest element of matrix D1 is: = 3 → we join classes C31 and C41
The new classes are:

Step 3: The distances between the new class and the rest are:

The new dissimilarity matrix is given by:


4.4.1: Agglomerative Methods
3rd Cycle:
Step 2: The smallest element of matrix D2 is: = 4 → we join classes C22 and C32

The new classes are:

Step 3: The distances between the new class and the rest are:

The new dissimilarity matrix is given by:


4.4.1: Agglomerative Methods
4th Cycle:
Step 2 + Step3: Final Dendrogram
We join the remaining classes, obtaining a
single class:
Level 4 ® 5

Level 3 ® 4

Level 2 ® 3

Level 1 ® 2

a b c d e
4.4.1: Agglomerative Methods
Complete Linkage Method

The dissimilarity between two classes Cr and Cs is level 4 ® 10


defined as the greatest dissimilarity between an
element of Cr and an element of Cs:

Level 3 ® 5

Applying this method to the dissimilarity matrix of


the previous example, we obtain the dendrogram: Level 2 ® 3

Level 1 ® 2

a b c d e
4.4.1: Agglomerative Methods
Group Average Method

The dissimilarity between two classes Cr and Cs is


given by the average of the dissimilarities between
the elements of all pairs that can be formed with an
element of Cr and another of Cs. That is, assuming
that: Level 4 ® 7.8

Level 3 ® 4.5

Level 2 ® 3

Applying this method to the dissimilarity matrix of Level 1 ® 2

the previous example, we obtain the dendrogram:


a b c d e
4.4.1: Agglomerative Methods
Single Linkage Method Group Average Method Complete Linkage Method

Level 4 ® 10

Level 4 ® 7.8

Level 4 ® 5 Level 3 ® 5
Level 3 ® 4.5
Level 3 ® 4

Level 2 ® 3 Level 2 ® 3 Level 2 ® 3

Level 1 ® 2 Level 1 ® 2 Level 1 ® 2


4.4.1: Agglomerative Methods
Centroid Method

In this method the distance between two classes is given by the distance between the centers of
the classes.

The center of a class, is given by:

Then the distance between two classes Ci and Ci' is given by: where d is a measure of
distance (which can be,
for example, the
Euclidean distance).
Instead of a dissimilarity, a similarity can also be used.

The centroid of a new class Cij resulting from the merger of two
classes, Ckj-1 and Ck’ j-1 , in the j th cycle of the algorithm, will be
given by:
4.4.1: Agglomerative Methods
Ward Method

In this method, a measure based on the average of dissimilarities between classes is used, given
by:

• It is proved that this is the measure of the increment that the sum of squares of the
distances of the elements of the two classes Ck and Ck' to the respective centroids suffers
when these two classes are joined (results from the difference between the sum of squares of
the distances of each element to the centroid of the new merged class and the sum of
squares of the distances of the elements of the original classes to the respective centroids).
• To decide which classes to join, in each cycle of the algorithm, this increment is calculated
for all possible pairs of classes, selecting to form the new class, the two classes to which the
smallest increment corresponds.
4.4.2: Notes in the diferente methods
• Almost all methods can be applied with similarities or dissimilarities – it will be enough
to adapt the interpretation; change the words minimum and maximum, …
• For a hierarchical method to be considered good, it must meet some conditions, such as:
i. the results obtained must not depend on the designation of the objects.
ii. the method must be well defined → for the same set of dissimilarities, the same tree
must always be obtained (that is, in case of ties, whatever the choice, the tree must
always be the same)
iii. small changes in the data must correspond to small changes in the resulting tree.
iv. the fact of adding or removing an object from the analysis should produce only small
changes in the tree.

The Single Linkage method has these properties and is therefore considered one of the best.
However, there is no method that can be said to be “the best”, so the ideal will be:
• apply various methods
• they all reveal the same type of clusters.
4.4.3: How to choose the best Partition?
There are situations in which it is important to consider the entire tree (as, for example,
in taxonomy), but, often, what is intended is to define a grouping into classes.
It is therefore necessary to know how to say the number of classes to consider.

This decision can be based on the distances between classes obtained in successive cycles
(levels).
You can decide to stop the process when:
• this distance exceeds a certain value
• the successive differences between distances suddenly increase, (causing a very large on
dendrogram)

The number of classes to consider is what you have when the process is stopped.
4.4.3: How to choose the best Partition?

Þ 3 classes

Ü 4 classes

Statistic units

In addition to these simple methods, there are more complicated ones that are based on tests.
4.4.4: Validation of the Classification

Once the tree is obtained, a new dissimilarity matrix can be constructed in which the
element (i,j) is the value of the dissimilarity between the classes that contained i and j
immediately before their fusion (it is the value that presided over this fusion – is on the
axis of the dendrogram)

To validate the classification, the initial matrix (of 𝑑"# elements) must be compared with
this new matrix (of 𝛿"# elements).
4.4.4: Validation of the Classification
This comparison can be made using:

Cophenetic correlation Stress measure


which is obtained by calculating the value of In the case of Euclidean distances
the usual correlation coefficient between the
values of the original matrix and the Good classification ⇔ stress near 0
values of the new matrix:

Good classification ⇔ r close to 1


4.4.4: Validation of the Classification
Example 4.4.1 (Sigle Linkage exemple):

Level 4 ® 5

with Level 3 ® 4

Level 2 ® 3

Level 1 ® 2

a b c d e
4.5: Non-Hierarchical Methods
• Unlike hierarchical methods, these methods do not produce hierarchies
• produce groups (disjoint or not depending on the method)

There are several methods based on different principles. Stand out the:

Partition Methods

• Only apply to objects


• Operate on an array of data
• Require that the number of groups be fixed at the outset
4.5: Non-Hierarchical Methods
Partition Methods

They consist of building a partition of the set of objects


set of disjoint groups whose assembly is equal to the set of objects

Bearing in mind that (as mentioned at the beginning) objects located within the same group
must be more similar than objects located in different groups
üselect the best partition among all possible ones.

The total number of possible partitions (even for a not very large number of objects) is very
high!
The solution that consists of analysing all possible partitions and choosing the best one is
not feasible!
We examine some partitions in order to find the best one, optimizing a previously
established group formation criterion
4.5: Non-Hierarchical Methods
Partition Methods

In general, the following steps are followed:


1. Select an initial partition of n objects into k groups
2. Evaluate all movements of each object from its own group to each of the others, recording the
changes produced in the group formation criterion used
3. Carry out the displacement corresponding to the highest value of the improvement verified in
the value of the criterion
4. Repeat steps 2 and 3 until you see that moving any object does not improve the value of the
criterion

Choice of home partition:


• by chance
• based on knowledge of the problem under study
• result of the previous application of another method of analysis – C.A.. Hierarchical, P.C.A., . . .
4.5: Non-Hierarchical Methods
K-means Algorithm

Is one of the most frequently used algorithms, and consists:


1. Select an initial partition of n objects into k groups.
2. Evaluate all displacements of each object from its own group to each of the others,
recording the value of the distance of each object to the centroid of the new group.
3. Move objects so that each object is placed in the partition group that has the closest
centroid. Recalculate the centroids of the new groups thus formed.
4. Repeat steps 2 and 3 until no further moves are possible.

• the average of the group's observations


Centroid = center of a group • the most central value of the group
• the group's center of gravity
• …
4.5: Non-Hierarchical Methods
Two different classifications of the same data can be compared:
• a contingency table is constructed in which the rows and columns are defined
respectively by the groups determined by each of the classifications.

The degree of association revealed by the table determines the proximity of the two
classifications.

Example 4.5.1 Variables


Objects X1 X2

Consider the data: A 2 8


B 5 1
C 4 12
D 15 4
Suppose the objective is to build 2 groups
E 16 5
4.5: Non-Hierarchical Methods
Variables
1. We consider an initial partition of the 5 objects into 2 arbitrary
Objects X1 X2
groups: 𝐴, 𝐵 and 𝐶, 𝐷, 𝐸
A 2 8
B 5 1
2. The centroids of the groups and the distances of the objects C 4 12
to the centroids are calculated (d2 = square of the Euclidean D 15 4
distance): E 16 5

Centroid d2
Groups x1 x2 A B C D E C is closer to 𝐴, 𝐵 than 𝐶, 𝐷, 𝐸
AB 3.5 4.5 14.5 14.5 56.5 132.75 156.5
CDE 11.67 7 94.51 80.49 83.83 20.09 22.75

New groups are obtained 𝐴, 𝐵, 𝐶 and


C must get out of 𝐶, 𝐷, 𝐸 and join 𝐴, 𝐵
𝐷, 𝐸
4.5: Non-Hierarchical Methods
3. The new object distances to the centroids of the new groups are:

Centroid d2 it is not possible to move


Groups x1 x2 A B C D E any of the objects of the
group to which it
ABC 3.67 7 3.79 37.77 25.11 137.77 156.03
belongs
DE 15.5 4.5 194.5 122.5 188.75 0.5 0.5

Conclusion: the two groups are 𝐴, 𝐵, 𝐶 and 𝐷, 𝐸


Hierarchical vs. Non-Hierarchical

Hierarchical Methods Non-Hierarchical Methods


• produce hierarchies • produce groups
• apply to variables or objects • apply only to objects
• use proximity arrays • operate on data arrays
• whenever an object is assigned to a • an object can travel from group to
group it no longer leaves it group

You might also like