0% found this document useful (0 votes)

12 views18 pages

Lec 35

Cluster analysis is a method for grouping similar objects into clusters, emphasizing the discovery of natural groupings through iterative processes. It involves measuring similarity using distance and association metrics, and can be applied in various fields such as image analysis, bioinformatics, and finance. The main techniques for clustering include hierarchical methods like agglomerative clustering and non-hierarchical methods like K-means clustering.

Uploaded by

Deepak Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views18 pages

Lec 35

Uploaded by

Deepak Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Statistical Inference and Multivariate Analysis

(MA324)
L ECTURE S LIDES
Lecture 35

Cluster Analysis

Indian Institute of Technology Guwahati

Jan-May 2025
Cluster Analysis

Cluster analysis or clustering is the task of grouping a set of objects in

such a way that objects in the same group (called a cluster) are more
similar (in some sense) to each other than to those in other groups
(clusters).

Clustering can therefore be formulated as a multi-objective

optimization problem.

Cluster analysis as such is not an automatic task, but an iterative

process of knowledge discovery or interactive multi-objective
optimization that involves trial and failure.

The basic objective in cluster analysis is to discover natural groupings

of the items (or variables).

2 / 18
Application of Cluster Analysis...
Image analysis
Pattern recognition
Information Retrieval
Data compression
Bioinformatics
Computer graphics
Anomaly detection
Medical science
Natural language processing (NLP)
Crime analysis
Social science
Robotics
Finance
Petroleum geology
Food Industry
3 / 18
Similarity Measures : Understanding Proximity

In cluster analysis, we must first develop a quantitative scale on which to

measure the association (similarity) between objects.

To understand the "closeness" or "similarity" among clusters, there can

be two different methods:

Distance Measure : Distances and Similarity Coefficients for Pairs of Items.

Association Measure : Similarities and Association Measures for Pairs of

Variables.

4 / 18
Distance Measure
Here, using this method, we try to understand or estimate the statistical
distance between two given clusters (say two p-dimensional observations,
′ ′
x = [x1 , ..., xp ] and y = [y1 , ..., yp ]) . For this procedure, we may use various
distance metrics, namely :

Mahalanobis distance (Statistical Distance) between two observations,

given by, q
d(x, y) = (x − y)′ A(x − y),

where, A = S−1 , S being the matrix of sample variance and covariances.

Minkowski Metric, given by,

" p
# m1
X
m
d(x, y) = |xi − yi |
i=1

5 / 18
Canberra metric,
p
X |xi − yi |
d(x, y) =
i=1
(x i + yi )

Czekanowski coefficient,
Pp
2 min(xi , yi )
d(x, y) = 1 − Pi=1
p
i=1 (xi + yi )

6 / 18
Association Measure

When the variables are binary, the data can again be arranged in the form of a
contingency table. In such a situation, it is better to get a measure of
association among the variables.

A contingency table for,

Variable Variable k
i
1 0 Total

1 a b a+b

0 c d c+d

Total a+c b+d n = a + b +c + d

7 / 18
Product Moment Correlation

The usual product moment correlation formula applied to the binary

variables in the contingency table is,

ad − bc
r= 1
[(a + b)(c + d)(a + c)(b + d)] 2

The above moment correlation can be taken as a measure of the

similarity between the two variables.

The moment
correlation
coefficient is related to the Chi-square
2 χ2
statistic r = n for testing the independence of two categorical
variables. Keeping n fixed, a large similarity (or correlation) is
consistent with the absence of independence.

8 / 18
Cluster Creation
Now, in cluster analysis, the main aim is to create the clusters using any one
of the two major techniques, namely

Hierarchical Clustering : which mainly proceeds by either a series of

successive mergers or a series of successive divisions. The two types of
hierarchical clustering are :

Agglomerative hierarchical methods.

Divisive hierarchical methods (Self Study)

Non-Hierarchical Clustering : technique is mainly designed to group

items, rather than variables, into a collection of K clusters. The most
common methodology is:

K-Means Method.

9 / 18
Agglomerative Hierarchical Methods

This methodology of clustering, start with the individual objects.

Initially as many clusters as objects.

The most similar objects are first grouped, and these initial groups are
merged according to their similarities.

As clustering progresses, similarity decreases, and all subgroups are

eventually fused into a single cluster.

One of the most common method of Agglomerative Hierarchical Method

is Linkage Method.

10 / 18
Algorithm for Agglomerative Hierarchical CLustering
Usually while doing an agglomerative clustering methodology for grouping of
N objects, the below steps (or algorithm) is followed :

Starting with N clusters, each one containing a single entity and an N × N

symmetric matrix of distances (or similarities) D = dik .

The distance matrix for the nearest (most similar) pair of clusters are
observed. Let the distance between "most similar" clusters U and V
be dU V

After merge, clusters U and V as one newly formed cluster (UV), the
entries in the distance matrix are updated:

Deleting the rows and columns corresponding to clusters U and V.

Adding a row and column for the distances between newly formed cluster
(UV) and the remaining clusters.

The above steps are repeated for a total of N − 1 times.

11 / 18
Linkage Method

In the above mentioned algorithm, different forms of metric D(dik ) gives rise
to different types of linkages and hence different types of clustering
methodologies. The three used, linkages are :

Single Linkage : d(U V )W = min{dU W , dV W }

Complete Linkage : d(U V )W = max{dU W , dV W }

P P
d
k ik
Averagre Linkage : d(U V )W = N(U i
V ) NW
, where N(U V ) and NW are the
number of items in clusters (UV ) and W , and dik is the distance
between ith object of cluster (UV ) and k th object of cluster W .

12 / 18
Clustering using single linkage:

Reference: Applied Multivariate Statistical Analysis by Johnson and Wichern.

13 / 18
Dendrogram
The result of the previous single linkage clustering can be graphically
observed using a dendrogram or a tree diagram. In hierarchical clustering,
the dendrogram illustrates the arrangement of the clusters produced by
the corresponding cluster analyses.
Section 1 2 .3 H ierarch ica l Cl ustering Methods 683

3 5 2 4 Figure 1 2 .4 S i n g l e l i n kage
dendrogram for dista nces between
Objects five objects .

In typical applications of hierarchical clustering, the intermediate results-where

theReference: Applied
objects are sorted into Multivariate Statistical
a moderate number Analysis by
of clusters-are Johnson
of chief and Wichern.
interest.
14 / 18
A dendrogram showing proximity in terminology as assessed by a combined
panel of experienced wine-tasters and wine-makers. The asterisks show that
238, 2000, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/j.1755-0238.2000.tb00180.x by Indian Institute Of Technology Guwahati, Wiley Online Library on [25/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License

this methodology reveals a number of logically consistent sub-groupings of terms.

205
Red wine mouth-feel terminology

A dendrogram showing proximity in terminology as assessed by a combined panel of experienced tasters and winemakers.
The asterisks show that this methodology reveals a number of logically consistent sub-groupings of terms.
*

Figure 1.
Gawel, Oberholster & Francis

Fine emery

Furry

Chamois

Suede

Velvet

Satin

Silk

Clay

Talc

Plaster

Chalky

Powdery

Grainy

Dusty

Sawdust

Dry

Parching

Numbing

Puckery

Adhesive

Grippy

Chewy

Abrasive

Aggressive

Hard

Soft

Supple

Fleshy

Rich

Mouthcoat

Resinous

Sappy

Green
Reference: Gawel, R., Oberholster, A., & Francis, I. L. (2000). A ‘Mouth-feel Wheel’: terminology for communicating the mouth-feel characteristics of red

wine. Australian Journal of Grape and Wine Research, 6(3), 203-207.

15 / 18
Non-Hierarchical Clustering Methods

These techniques are commonly designed to group items, rather than

variables, into a collection of K clusters.

The number of clusters, K , may either be specified in advance or

determined as part of the clustering procedure.

The Nonhierarchical clustering methods usually can start from any one of
the two points :
an initial partition of items into groups.

an initial set of seed points, which will form the main nuclei of clusters.

One of the unbiased way to start the clustering procedure is to,

randomly select seed points from among the items or to randomly
partition the items into initial groups.

16 / 18
K-Means Clustering
K-means is used to describe an algorithm that assigns each item to the
cluster having the nearest centroid (mean). The process mainly comprises
of three steps :

First of all, partition the items into K initial clusters. [Or, specify K initial
centroids (seed points)]

Now, for each of list of items, assigning an item to the cluster whose
centroid (mean) is nearest. (It has to be observed, distance is usually
computed using Euclidean distance with either standardized or
unstandardized observations.).

Further, recalculate the centroid for the cluster receiving the new item
and also for the cluster losing the item.

The above two steps are repeated until no further reassignments take
place.
17 / 18
Clustering using K-means method:
We measured two variables X1 and X2 for each of four items A, B, C, and D.
The data are given in the following table. The objective is to divide these
items into K = 2 clusters such that the items within a cluster are closer to
one another than they are to the items in different clusters.

Reference: Applied Multivariate Statistical Analysis by Johnson and Wichern.

18 / 18

Thesis Bar Exam
100% (3)
Thesis Bar Exam
4 pages
Cinderella Complex Scale
No ratings yet
Cinderella Complex Scale
21 pages
CH 1 Introduction Statistics
No ratings yet
CH 1 Introduction Statistics
31 pages
Roles of QA QC Manager
No ratings yet
Roles of QA QC Manager
3 pages
Unit 4 - Data Warehousing and Mining
No ratings yet
Unit 4 - Data Warehousing and Mining
51 pages
Cluster Analysis
No ratings yet
Cluster Analysis
9 pages
Clustering: Source: I. Business Analytics by U Dinesh Kumar Means-Example-1.htm) rial/Clustering/Numerical Example - HTM
No ratings yet
Clustering: Source: I. Business Analytics by U Dinesh Kumar Means-Example-1.htm) rial/Clustering/Numerical Example - HTM
24 pages
Writing A Masters Dissertation Introduction
100% (2)
Writing A Masters Dissertation Introduction
6 pages
A Slot Reallocation Model For Containership Schedule Adjustment
No ratings yet
A Slot Reallocation Model For Containership Schedule Adjustment
10 pages
Aula - Análise de Clusters
No ratings yet
Aula - Análise de Clusters
93 pages
Bacher 2002 Cluster Analysis
No ratings yet
Bacher 2002 Cluster Analysis
199 pages
Grouping
No ratings yet
Grouping
98 pages
DCT-Engineering - With Deadlines
No ratings yet
DCT-Engineering - With Deadlines
124 pages
Clustering
No ratings yet
Clustering
55 pages
Tarc Investigation Final Report 5-10-21 1620829751
No ratings yet
Tarc Investigation Final Report 5-10-21 1620829751
211 pages
Chapter 4 - Cluster Analysis
No ratings yet
Chapter 4 - Cluster Analysis
55 pages
John H Cochrane SHOCKS
No ratings yet
John H Cochrane SHOCKS
60 pages
Don Honorio Ventura State University: College of Business Studies
No ratings yet
Don Honorio Ventura State University: College of Business Studies
13 pages
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
No ratings yet
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
16 pages
L18 19 Clustering
No ratings yet
L18 19 Clustering
48 pages
Presentation 28128 Content Document 20241126014005PM
No ratings yet
Presentation 28128 Content Document 20241126014005PM
80 pages
Conceptual Framework: 3.1 Judgment of Information Quality and Cognitive Authority Model
No ratings yet
Conceptual Framework: 3.1 Judgment of Information Quality and Cognitive Authority Model
13 pages
Manuscript For Defense
No ratings yet
Manuscript For Defense
98 pages
Lecture 02 - Cluster Analysis 1
No ratings yet
Lecture 02 - Cluster Analysis 1
59 pages
Clustering Today
No ratings yet
Clustering Today
52 pages
Unit-6 Clustering Techniques
No ratings yet
Unit-6 Clustering Techniques
110 pages
Introduction To Cluster Analysis.
No ratings yet
Introduction To Cluster Analysis.
53 pages
Hierarchical
No ratings yet
Hierarchical
31 pages
Unit-4 New
No ratings yet
Unit-4 New
36 pages
Cluster Analysis
No ratings yet
Cluster Analysis
33 pages
Cluster Analysis: Clusters Classification Analysis Numerical Taxonomy
No ratings yet
Cluster Analysis: Clusters Classification Analysis Numerical Taxonomy
50 pages
8.cluster Analysis HCA
No ratings yet
8.cluster Analysis HCA
31 pages
Lecture-11 Cluster Analysis-1
No ratings yet
Lecture-11 Cluster Analysis-1
28 pages
Section 3
No ratings yet
Section 3
22 pages
Action Research Arold and Jessa 2
0% (1)
Action Research Arold and Jessa 2
24 pages
Clustering
No ratings yet
Clustering
38 pages
AI20 - Hierarchical-Clustering
No ratings yet
AI20 - Hierarchical-Clustering
31 pages
Introduction To Clustering: Alka Arora Sr. Scientist
No ratings yet
Introduction To Clustering: Alka Arora Sr. Scientist
57 pages
Presentation Malo
No ratings yet
Presentation Malo
65 pages
Cluster Analysis
No ratings yet
Cluster Analysis
5 pages
10.cluster Analysis
No ratings yet
10.cluster Analysis
68 pages
Organizational Commitment Challenges
No ratings yet
Organizational Commitment Challenges
23 pages
DA Seminar
No ratings yet
DA Seminar
29 pages
Cluster Analysis
No ratings yet
Cluster Analysis
25 pages
Block 18 ST3188
No ratings yet
Block 18 ST3188
29 pages
Chapter Twenty: Cluster Analysis
No ratings yet
Chapter Twenty: Cluster Analysis
35 pages
Cluster Analysis
No ratings yet
Cluster Analysis
34 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
26 pages
Unit - 4 DM
No ratings yet
Unit - 4 DM
24 pages
BA2 7 Cluster
No ratings yet
BA2 7 Cluster
33 pages
Un Supervised Learning
No ratings yet
Un Supervised Learning
22 pages
Unit 3
No ratings yet
Unit 3
14 pages
The Mediating Effect of Affective Commitment Between Organizational Justice, Perceived Organization Support and Employee Engagement
No ratings yet
The Mediating Effect of Affective Commitment Between Organizational Justice, Perceived Organization Support and Employee Engagement
15 pages
Knowledge Acquisition and Sharing - Data Mining: INF 791 Lecture 4: Cluster Analysis
No ratings yet
Knowledge Acquisition and Sharing - Data Mining: INF 791 Lecture 4: Cluster Analysis
43 pages
Cluster Analysis Techniques
No ratings yet
Cluster Analysis Techniques
33 pages
Cluster Analysis
No ratings yet
Cluster Analysis
30 pages
Implementing English-Medium Instruction (EMI) in China
No ratings yet
Implementing English-Medium Instruction (EMI) in China
12 pages
Cluster Analysis
No ratings yet
Cluster Analysis
24 pages
Cluster Analysis
No ratings yet
Cluster Analysis
15 pages
CLUSTERING
No ratings yet
CLUSTERING
16 pages
Female Selfies On Facebook: A Study of Viewers' Opinion and Comments On Selfies
No ratings yet
Female Selfies On Facebook: A Study of Viewers' Opinion and Comments On Selfies
89 pages
Cluster Analysis
No ratings yet
Cluster Analysis
33 pages
MSC Cybersecurity Risk Management Programme Handbook 2024-25 V2
No ratings yet
MSC Cybersecurity Risk Management Programme Handbook 2024-25 V2
16 pages
Cluster Analysis: Consumer Segmentation
No ratings yet
Cluster Analysis: Consumer Segmentation
17 pages
Hierarchical Clustering - 11.3.2024 - Full
No ratings yet
Hierarchical Clustering - 11.3.2024 - Full
14 pages
Media Planning and Buying 20 MARKS 2022 - 23 - Notes For Students With Solution
No ratings yet
Media Planning and Buying 20 MARKS 2022 - 23 - Notes For Students With Solution
27 pages
MA Unit 5
No ratings yet
MA Unit 5
7 pages
Total Quality Management
No ratings yet
Total Quality Management
16 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
11 pages
Cluster Analysis GP Seminar
No ratings yet
Cluster Analysis GP Seminar
13 pages
Performance Evaluation of Distance Metrics in The Clustering Algorithms
No ratings yet
Performance Evaluation of Distance Metrics in The Clustering Algorithms
14 pages
Accreditation For Sensory Testing Laboratories
No ratings yet
Accreditation For Sensory Testing Laboratories
18 pages
Lecture-9 Cluster Analysis - LAK
No ratings yet
Lecture-9 Cluster Analysis - LAK
4 pages
MDA Session 4
No ratings yet
MDA Session 4
5 pages
Dproject
No ratings yet
Dproject
10 pages
Cluster Analysis BRM Session 14
No ratings yet
Cluster Analysis BRM Session 14
25 pages
Crystallization of KNO3 Rubric
No ratings yet
Crystallization of KNO3 Rubric
2 pages
Hierarchical Clustering: Required Data
No ratings yet
Hierarchical Clustering: Required Data
6 pages
Cluster Analysis: Prof. (DR.) H. J. Jani Mba Programme, Sardar Patel University Vallabh Vidyanagar - 388 120
No ratings yet
Cluster Analysis: Prof. (DR.) H. J. Jani Mba Programme, Sardar Patel University Vallabh Vidyanagar - 388 120
41 pages
Chapter-5-Cluster Analysis PDF
No ratings yet
Chapter-5-Cluster Analysis PDF
5 pages
Question Bank
No ratings yet
Question Bank
5 pages
ANA MARIA OCHOA - Aurality: Listening and Knowledge in Nineteenth-Century Colombia
No ratings yet
ANA MARIA OCHOA - Aurality: Listening and Knowledge in Nineteenth-Century Colombia
3 pages
Formulation of The Research Chapter 1
No ratings yet
Formulation of The Research Chapter 1
4 pages
In Marketing, Cluster Analysis Is Used For: Statistical
No ratings yet
In Marketing, Cluster Analysis Is Used For: Statistical
3 pages
Cluster Analysis
No ratings yet
Cluster Analysis
9 pages
Agglomerative Hierarchical Clustering Algorithm-A Review: K.Sasirekha, P.Baby
No ratings yet
Agglomerative Hierarchical Clustering Algorithm-A Review: K.Sasirekha, P.Baby
3 pages
(CV Template) Course Surname, Name Feb.2016
No ratings yet
(CV Template) Course Surname, Name Feb.2016
2 pages
Heaer 1
No ratings yet
Heaer 1
4 pages
111111
No ratings yet
111111
4 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)