0% found this document useful (0 votes)
12 views18 pages

Lec 35

Cluster analysis is a method for grouping similar objects into clusters, emphasizing the discovery of natural groupings through iterative processes. It involves measuring similarity using distance and association metrics, and can be applied in various fields such as image analysis, bioinformatics, and finance. The main techniques for clustering include hierarchical methods like agglomerative clustering and non-hierarchical methods like K-means clustering.

Uploaded by

Deepak Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views18 pages

Lec 35

Cluster analysis is a method for grouping similar objects into clusters, emphasizing the discovery of natural groupings through iterative processes. It involves measuring similarity using distance and association metrics, and can be applied in various fields such as image analysis, bioinformatics, and finance. The main techniques for clustering include hierarchical methods like agglomerative clustering and non-hierarchical methods like K-means clustering.

Uploaded by

Deepak Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Statistical Inference and Multivariate Analysis

(MA324)
L ECTURE S LIDES
Lecture 35

Cluster Analysis

Indian Institute of Technology Guwahati

Jan-May 2025
Cluster Analysis

Cluster analysis or clustering is the task of grouping a set of objects in


such a way that objects in the same group (called a cluster) are more
similar (in some sense) to each other than to those in other groups
(clusters).

Clustering can therefore be formulated as a multi-objective


optimization problem.

Cluster analysis as such is not an automatic task, but an iterative


process of knowledge discovery or interactive multi-objective
optimization that involves trial and failure.

The basic objective in cluster analysis is to discover natural groupings


of the items (or variables).

2 / 18
Application of Cluster Analysis...
Image analysis
Pattern recognition
Information Retrieval
Data compression
Bioinformatics
Computer graphics
Anomaly detection
Medical science
Natural language processing (NLP)
Crime analysis
Social science
Robotics
Finance
Petroleum geology
Food Industry
3 / 18
Similarity Measures : Understanding Proximity

In cluster analysis, we must first develop a quantitative scale on which to


measure the association (similarity) between objects.

To understand the "closeness" or "similarity" among clusters, there can


be two different methods:

Distance Measure : Distances and Similarity Coefficients for Pairs of Items.

Association Measure : Similarities and Association Measures for Pairs of


Variables.

4 / 18
Distance Measure
Here, using this method, we try to understand or estimate the statistical
distance between two given clusters (say two p-dimensional observations,
′ ′
x = [x1 , ..., xp ] and y = [y1 , ..., yp ]) . For this procedure, we may use various
distance metrics, namely :

Mahalanobis distance (Statistical Distance) between two observations,


given by, q
d(x, y) = (x − y)′ A(x − y),

where, A = S−1 , S being the matrix of sample variance and covariances.

Minkowski Metric, given by,


" p
# m1
X
m
d(x, y) = |xi − yi |
i=1

5 / 18
Canberra metric,
p
X |xi − yi |
d(x, y) =
i=1
(x i + yi )

Czekanowski coefficient,
Pp
2 min(xi , yi )
d(x, y) = 1 − Pi=1
p
i=1 (xi + yi )

6 / 18
Association Measure

When the variables are binary, the data can again be arranged in the form of a
contingency table. In such a situation, it is better to get a measure of
association among the variables.

A contingency table for,

Variable Variable k
i
1 0 Total

1 a b a+b

0 c d c+d

Total a+c b+d n = a + b +c + d

7 / 18
Product Moment Correlation

The usual product moment correlation formula applied to the binary


variables in the contingency table is,

ad − bc
r= 1
[(a + b)(c + d)(a + c)(b + d)] 2

The above moment correlation can be taken as a measure of the


similarity between the two variables.

The moment
 correlation
 coefficient is related to the Chi-square
2 χ2
statistic r = n for testing the independence of two categorical
variables. Keeping n fixed, a large similarity (or correlation) is
consistent with the absence of independence.

8 / 18
Cluster Creation
Now, in cluster analysis, the main aim is to create the clusters using any one
of the two major techniques, namely

Hierarchical Clustering : which mainly proceeds by either a series of


successive mergers or a series of successive divisions. The two types of
hierarchical clustering are :

Agglomerative hierarchical methods.

Divisive hierarchical methods (Self Study)

Non-Hierarchical Clustering : technique is mainly designed to group


items, rather than variables, into a collection of K clusters. The most
common methodology is:

K-Means Method.

9 / 18
Agglomerative Hierarchical Methods

This methodology of clustering, start with the individual objects.

Initially as many clusters as objects.

The most similar objects are first grouped, and these initial groups are
merged according to their similarities.

As clustering progresses, similarity decreases, and all subgroups are


eventually fused into a single cluster.

One of the most common method of Agglomerative Hierarchical Method


is Linkage Method.

10 / 18
Algorithm for Agglomerative Hierarchical CLustering
Usually while doing an agglomerative clustering methodology for grouping of
N objects, the below steps (or algorithm) is followed :

Starting with N clusters, each one containing a single entity and an N × N


symmetric matrix of distances (or similarities) D = dik .

The distance matrix for the nearest (most similar) pair of clusters are
observed. Let the distance between "most similar" clusters U and V
be dU V

After merge, clusters U and V as one newly formed cluster (UV), the
entries in the distance matrix are updated:

Deleting the rows and columns corresponding to clusters U and V.

Adding a row and column for the distances between newly formed cluster
(UV) and the remaining clusters.

The above steps are repeated for a total of N − 1 times.


11 / 18
Linkage Method

In the above mentioned algorithm, different forms of metric D(dik ) gives rise
to different types of linkages and hence different types of clustering
methodologies. The three used, linkages are :

Single Linkage : d(U V )W = min{dU W , dV W }

Complete Linkage : d(U V )W = max{dU W , dV W }


P P
d
k ik
Averagre Linkage : d(U V )W = N(U i
V ) NW
, where N(U V ) and NW are the
number of items in clusters (UV ) and W , and dik is the distance
between ith object of cluster (UV ) and k th object of cluster W .

12 / 18
Clustering using single linkage:

Reference: Applied Multivariate Statistical Analysis by Johnson and Wichern.


13 / 18
Dendrogram
The result of the previous single linkage clustering can be graphically
observed using a dendrogram or a tree diagram. In hierarchical clustering,
the dendrogram illustrates the arrangement of the clusters produced by
the corresponding cluster analyses.
Section 1 2 .3 H ierarch ica l Cl ustering Methods 683

3 5 2 4 Figure 1 2 .4 S i n g l e l i n kage
dendrogram for dista nces between
Objects five objects .

In typical applications of hierarchical clustering, the intermediate results-where


theReference: Applied
objects are sorted into Multivariate Statistical
a moderate number Analysis by
of clusters-are Johnson
of chief and Wichern.
interest.
14 / 18
A dendrogram showing proximity in terminology as assessed by a combined
panel of experienced wine-tasters and wine-makers. The asterisks show that
238, 2000, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/j.1755-0238.2000.tb00180.x by Indian Institute Of Technology Guwahati, Wiley Online Library on [25/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License

this methodology reveals a number of logically consistent sub-groupings of terms.


205
Red wine mouth-feel terminology

A dendrogram showing proximity in terminology as assessed by a combined panel of experienced tasters and winemakers.
The asterisks show that this methodology reveals a number of logically consistent sub-groupings of terms.
*

Figure 1.
Gawel, Oberholster & Francis

Fine emery

Furry

Chamois

Suede

Velvet

Satin

Silk

Clay

Talc

Plaster

Chalky

Powdery

Grainy

Dusty

Sawdust

Dry

Parching

Numbing

Puckery

Adhesive

Grippy

Chewy

Abrasive

Aggressive

Hard

Soft

Supple

Fleshy

Rich

Mouthcoat

Resinous

Sappy

Green
Reference: Gawel, R., Oberholster, A., & Francis, I. L. (2000). A ‘Mouth-feel Wheel’: terminology for communicating the mouth-feel characteristics of red

wine. Australian Journal of Grape and Wine Research, 6(3), 203-207.


15 / 18
Non-Hierarchical Clustering Methods

These techniques are commonly designed to group items, rather than


variables, into a collection of K clusters.

The number of clusters, K , may either be specified in advance or


determined as part of the clustering procedure.

The Nonhierarchical clustering methods usually can start from any one of
the two points :
an initial partition of items into groups.

an initial set of seed points, which will form the main nuclei of clusters.

One of the unbiased way to start the clustering procedure is to,


randomly select seed points from among the items or to randomly
partition the items into initial groups.

16 / 18
K-Means Clustering
K-means is used to describe an algorithm that assigns each item to the
cluster having the nearest centroid (mean). The process mainly comprises
of three steps :

First of all, partition the items into K initial clusters. [Or, specify K initial
centroids (seed points)]

Now, for each of list of items, assigning an item to the cluster whose
centroid (mean) is nearest. (It has to be observed, distance is usually
computed using Euclidean distance with either standardized or
unstandardized observations.).

Further, recalculate the centroid for the cluster receiving the new item
and also for the cluster losing the item.

The above two steps are repeated until no further reassignments take
place.
17 / 18
Clustering using K-means method:
We measured two variables X1 and X2 for each of four items A, B, C, and D.
The data are given in the following table. The objective is to divide these
items into K = 2 clusters such that the items within a cluster are closer to
one another than they are to the items in different clusters.

Reference: Applied Multivariate Statistical Analysis by Johnson and Wichern.


18 / 18

You might also like