Lec 35
Lec 35
(MA324)
L ECTURE S LIDES
Lecture 35
Cluster Analysis
Jan-May 2025
Cluster Analysis
2 / 18
Application of Cluster Analysis...
Image analysis
Pattern recognition
Information Retrieval
Data compression
Bioinformatics
Computer graphics
Anomaly detection
Medical science
Natural language processing (NLP)
Crime analysis
Social science
Robotics
Finance
Petroleum geology
Food Industry
3 / 18
Similarity Measures : Understanding Proximity
4 / 18
Distance Measure
Here, using this method, we try to understand or estimate the statistical
distance between two given clusters (say two p-dimensional observations,
′ ′
x = [x1 , ..., xp ] and y = [y1 , ..., yp ]) . For this procedure, we may use various
distance metrics, namely :
5 / 18
Canberra metric,
p
X |xi − yi |
d(x, y) =
i=1
(x i + yi )
Czekanowski coefficient,
Pp
2 min(xi , yi )
d(x, y) = 1 − Pi=1
p
i=1 (xi + yi )
6 / 18
Association Measure
When the variables are binary, the data can again be arranged in the form of a
contingency table. In such a situation, it is better to get a measure of
association among the variables.
Variable Variable k
i
1 0 Total
1 a b a+b
0 c d c+d
7 / 18
Product Moment Correlation
ad − bc
r= 1
[(a + b)(c + d)(a + c)(b + d)] 2
The moment
correlation
coefficient is related to the Chi-square
2 χ2
statistic r = n for testing the independence of two categorical
variables. Keeping n fixed, a large similarity (or correlation) is
consistent with the absence of independence.
8 / 18
Cluster Creation
Now, in cluster analysis, the main aim is to create the clusters using any one
of the two major techniques, namely
K-Means Method.
9 / 18
Agglomerative Hierarchical Methods
The most similar objects are first grouped, and these initial groups are
merged according to their similarities.
10 / 18
Algorithm for Agglomerative Hierarchical CLustering
Usually while doing an agglomerative clustering methodology for grouping of
N objects, the below steps (or algorithm) is followed :
The distance matrix for the nearest (most similar) pair of clusters are
observed. Let the distance between "most similar" clusters U and V
be dU V
After merge, clusters U and V as one newly formed cluster (UV), the
entries in the distance matrix are updated:
Adding a row and column for the distances between newly formed cluster
(UV) and the remaining clusters.
In the above mentioned algorithm, different forms of metric D(dik ) gives rise
to different types of linkages and hence different types of clustering
methodologies. The three used, linkages are :
12 / 18
Clustering using single linkage:
3 5 2 4 Figure 1 2 .4 S i n g l e l i n kage
dendrogram for dista nces between
Objects five objects .
A dendrogram showing proximity in terminology as assessed by a combined panel of experienced tasters and winemakers.
The asterisks show that this methodology reveals a number of logically consistent sub-groupings of terms.
*
Figure 1.
Gawel, Oberholster & Francis
Fine emery
Furry
Chamois
Suede
Velvet
Satin
Silk
Clay
Talc
Plaster
Chalky
Powdery
Grainy
Dusty
Sawdust
Dry
Parching
Numbing
Puckery
Adhesive
Grippy
Chewy
Abrasive
Aggressive
Hard
Soft
Supple
Fleshy
Rich
Mouthcoat
Resinous
Sappy
Green
Reference: Gawel, R., Oberholster, A., & Francis, I. L. (2000). A ‘Mouth-feel Wheel’: terminology for communicating the mouth-feel characteristics of red
The Nonhierarchical clustering methods usually can start from any one of
the two points :
an initial partition of items into groups.
an initial set of seed points, which will form the main nuclei of clusters.
16 / 18
K-Means Clustering
K-means is used to describe an algorithm that assigns each item to the
cluster having the nearest centroid (mean). The process mainly comprises
of three steps :
First of all, partition the items into K initial clusters. [Or, specify K initial
centroids (seed points)]
Now, for each of list of items, assigning an item to the cluster whose
centroid (mean) is nearest. (It has to be observed, distance is usually
computed using Euclidean distance with either standardized or
unstandardized observations.).
Further, recalculate the centroid for the cluster receiving the new item
and also for the cluster losing the item.
The above two steps are repeated until no further reassignments take
place.
17 / 18
Clustering using K-means method:
We measured two variables X1 and X2 for each of four items A, B, C, and D.
The data are given in the following table. The objective is to divide these
items into K = 2 clusters such that the items within a cluster are closer to
one another than they are to the items in different clusters.