Clustering - Introduction J Evaluation Metrics

Clustering is an unsupervised machine learning technique that groups similar data points into clusters, aiming for high intra-cluster similarity and low inter-cluster similarity. Various evaluation metrics, such as Silhouette Coefficient, Dunn's Index, and Rand Index, are used to assess the quality of clusters, with applications in market segmentation, biology, and social network analysis. Clustering algorithms can be categorized into five types: partitioning-based, hierarchical-based, density-based, grid-based, and model-based methods.

Uploaded by

aaditmahajan14

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views19 pages

Clustering - Introduction J Evaluation Metrics

Uploaded by

aaditmahajan14

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Clustering

(Introduction, Evaluation Metrics)

CSED, TIET
Clustering-Introduction
▪ Cluster analysis, or clustering, is an unsupervised machine learning task. It involves
automatically discovering natural grouping in data.

▪Clustering is the task of dividing the population or data points into a number of groups
such that data points in the same groups are more similar (high intra-cluster similarity) to
other data points in the same group than those in other groups (low inter-cluster
similarity).

▪In simple words, the aim is to segregate groups with similar traits and assign them into
clusters.
Applications of Clustering
Applications of Clustering (Contd….)
▪ Clustering algorithms are widely used in a number of applications such as:
➢ Market Segmentation / Targeted Marketing / Recommender Systems
➢ Document / News / Article Clustering
➢ Biology / Genome Clustering
➢ City Planning
➢ Speech Recognition
➢ Social Network Analysis
➢ Organize Computing Clusters
➢ Astronomical Data Analysis
Evaluation Metrics
In order to evaluate the quality of clusters produced by a clustering algorithms,
following evaluation metrics are used:
1. Silhouette Coefficient
2. Dunn’s Index
3. Rand Index (RI)
4. Adjusted Rand Index (ARI)
5. Purity
Metrics 1 and 2 are used when we don’t have any ground truth (unsupervised; only data
points) where as metrics 3,4 and 5 are used when we have ground truth (supervised; data
points and labels)
Silhouette Coefficient
▪ The Silhouette Coefficient is defined for each sample and is composed of two scores:
a: The mean distance between a sample and all other points in the same cluster.
b: The mean distance between a sample and all other points in the next nearest cluster.
𝑏−𝑎
𝑠=
max(𝑏,𝑎)

▪ The Silhouette Coefficient for a set of samples is given as the mean of the Silhouette
Coefficient for each sample.
▪The score is bounded between -1 for incorrect clustering and +1 for highly dense clustering.
▪Scores around zero indicate overlapping clusters.
▪The score is higher when clusters are dense and well separated, which relates to a standard
concept of a cluster.
Dunn Index
▪ The Dunn index is another internal clustering validation measure which can be computed as
follow:
1. For each cluster, compute the distance between each of the objects in the cluster and the objects
in the other clusters
2. Use the minimum of this pairwise distance as the inter-cluster separation (min.separation)
3. For each cluster, compute the distance between the objects in the same cluster.
4. Use the maximal intra-cluster distance (i.e maximum diameter) as the intra-cluster compactness
5. Calculate Dunn index (D) is computed as follows:
𝑚𝑖𝑛. 𝑠𝑒𝑝𝑒𝑎𝑟𝑡𝑖𝑜𝑛
𝐷=
𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑑𝑖𝑎𝑚𝑒𝑡𝑒𝑟
If the data set contains compact and well-separated clusters, the diameter of the clusters is expected to
be small and the distance between the clusters is expected to be large. Thus, Dunn index should be
maximized. The value of Dunn Index lies between 0 and infinity.
Rand Index
▪ Rand Index is a measure of how similar clustering results or groupings are to the ground truth.
▪Let C denotes the ground truth class labeling and K be the clustering assignment.
• A be the number of element pairs that lie in the same set of C and K,
• B be the number of element pairs that lie in different sets of both C and K.
Then RI is given by:
𝐴+𝐵
𝑅𝐼 = 𝑛𝐶
2

where n are the total number of samples.

▪ RI can never exceed 1 and its possible lowest value is 0. More closer the score is to 1, better is
the algorithm.
Rand Index- Example
▪ Say we have five examples. The clustering method groups examples A, B, and C into one group
and examples D and E into another group. But according to ground truth groups A and B
are together and C, D, and E together.
▪ To compute RI for this example, lets first list all possible unordered pairs of five examples at
hand. We have 10 (n*(n-1)/2) such pairs. These are: {A, B}, {A, C}, {A, D}, {A, E}, {B, C},
{B, D}, {B, E}, {C, D}, {C, E}, and {D, E}.
▪ Examining these pairs, we notice that the pair {A, B} and {D, E} are always grouped together
(both by clustering algorithm and ground truth). Thus, the value of A is two.
▪ We also notice that four pairs, {A, D}, {A, E}, {B, D}, and {B, E}, never occur together. Thus,
the value of b is four.
2+4
𝑅𝐼 = = 0.6
10
Adjusted Rand Index (ARI)
▪ RI suffers from one drawback; it yields a high value for pairs of random partitions of a given set of
examples.
▪ To counter this drawback, an adjustment is made to the calculations by taking into consideration
grouping by chance.
▪ In this, we create a contingency table, as below the rows denote clusters made by clustering algorithm
and columns denote clusters given by ground truth (For example, if the total clusters returned by
ground truth and clustering method is 3, then contingency table is as shown below).
C1 C2 C3
C1
C2
C3

▪Any (ij)th entry is the number of common objects belonging to clustering algorithm cluster Ci and
ground truth cluster cj
Adjusted Rand Index (ARI)- Contd….
ARI-Example
Consider the same example as discussed for RI (in slide 9).
The contingency matrix for the example is given by:

4×4
2− 10 2−1.6
𝐴𝑅𝐼 = 4+4 4×4 = = 0.1666
− 4−1.6
2 2
Purity
▪ Purity is also an external evaluation criterion of cluster quality.
▪ It is the percent of the total number of objects(data points) that were classified correctly.
▪ It also lies in the range 0 to 1. Higher the purity, better is the model
𝑘
1
𝑃𝑢𝑟𝑖𝑡𝑦 = ෍ 𝑚𝑎𝑥𝑗 |𝑐𝑖 ∩ 𝑡𝑗 |
𝑁
𝑖=1

where N = number of objects(data points), k = number of clusters, ci is a cluster in Clustering

algorithm C, and tj is the cluster in the ground truth
Purity-Example
▪ For the example discussed for ARI (in slide 12), the contingency table is as below:

max 𝑜𝑓 𝑒𝑎𝑐ℎ 𝑟𝑜𝑤 𝑖𝑛 𝑐𝑜𝑛𝑡𝑖𝑔𝑒𝑛𝑐𝑦 𝑚𝑎𝑡𝑟𝑖𝑥 2 + 2

𝑃𝑢𝑟𝑖𝑡𝑦 = = = 0.8
𝑇𝑜𝑡𝑎𝑙 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 5
Types of Clustering Algorithms
▪ The clustering algorithms are broadly classified into five categories:
1. Partitioning-based Methods
2. Hierarchical-based Methods
3. Density-based Methods
4. Grid-based Methods
5. Model-based Methods
Partitioning-based Methods
▪ These methods partition the objects into k clusters and each partition forms one cluster.
▪ This method is used to optimize an objective criterion similarity function.
▪ The quality of clustering is measured by an objective function. This objective function
is designed to achieve high intra-cluster similarity and low inter-cluster similarity.
▪ Example K-means, CLARANS (Clustering Large Applications based upon Randomized
Search) etc.
Hierarchical-based Methods
▪ These methods perform a hierarchical breakdown of a given dataset which can be
classified as agglomerative and divisive.
▪ In agglomerative methods, initially, each object is regarded as a cluster on its own and
they are then successively merged till they satisfy a termination condition.
▪By contrast, in the divisive approach, initially, the set of objects is considered as a single
large cluster and is successively split up into smaller clusters until a termination
condition is satisfied.
▪The former is also called the bottom-up approach whereas the latter is called the top-
down approach.
Density-based Methods
▪ Density-based methods discover clusters based on density.
▪ These methods can find clusters of arbitrary shapes.
▪ Here, a cluster is kept growing as long as the number of data objects in the
neighborhood exceeds some threshold value.
▪ These methods have good accuracy and ability to merge two clusters.
▪ Examples DBSCAN (Density-Based Spatial Clustering of Applications with Noise) , OPTICS
(Ordering Points to Identify Clustering Structure) etc.
Clustering Algorithms

Clustering
No ratings yet
Clustering
28 pages
Pam Clustering Technique: Bachelor of Technology Computer Science and Engineering
No ratings yet
Pam Clustering Technique: Bachelor of Technology Computer Science and Engineering
11 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Clustering
No ratings yet
Clustering
104 pages
Cluster Validation
No ratings yet
Cluster Validation
47 pages
CLUSTRING
No ratings yet
CLUSTRING
13 pages
4.6 Methods For Clustering Validation
No ratings yet
4.6 Methods For Clustering Validation
31 pages
Clustering (Introduction, Evaluation Metrics)
No ratings yet
Clustering (Introduction, Evaluation Metrics)
21 pages
AIML Chapter 13
No ratings yet
AIML Chapter 13
26 pages
DMW Unit-V
No ratings yet
DMW Unit-V
47 pages
UNIT 3-Clustering Metrics
No ratings yet
UNIT 3-Clustering Metrics
59 pages
By Lior Rokach and Oded Maimon: Clustering Methods
No ratings yet
By Lior Rokach and Oded Maimon: Clustering Methods
5 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
Clustering FinancialData
No ratings yet
Clustering FinancialData
38 pages
Clustering Performance Evaluation Metrics1
No ratings yet
Clustering Performance Evaluation Metrics1
19 pages
CE345 - Lecture #10 - Clustering (Part 2)
No ratings yet
CE345 - Lecture #10 - Clustering (Part 2)
64 pages
Clustering Basics
No ratings yet
Clustering Basics
39 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
Unsupervised Machine Learning Techniques
No ratings yet
Unsupervised Machine Learning Techniques
58 pages
E-Note 28966 Content Document 20241211091351PM
No ratings yet
E-Note 28966 Content Document 20241211091351PM
69 pages
Chapter 3-Unsupervised Learning - Updated
No ratings yet
Chapter 3-Unsupervised Learning - Updated
54 pages
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
No ratings yet
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
38 pages
ML - 8
No ratings yet
ML - 8
70 pages
Unit 2 - Introduction To Cluster Analysis
No ratings yet
Unit 2 - Introduction To Cluster Analysis
53 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
DM 4
No ratings yet
DM 4
76 pages
Cluster Analysis
No ratings yet
Cluster Analysis
21 pages
Unit4 Clustering Evaluation
No ratings yet
Unit4 Clustering Evaluation
53 pages
Module 5
No ratings yet
Module 5
91 pages
A Rapid Review of Clustering Algorithms
No ratings yet
A Rapid Review of Clustering Algorithms
14 pages
Data Mining Unit-Iv
No ratings yet
Data Mining Unit-Iv
34 pages
ML Unit-4-1
No ratings yet
ML Unit-4-1
39 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Clustering
No ratings yet
Clustering
34 pages
Entropy: A Clustering Method Based On The Maximum Entropy Principle
No ratings yet
Entropy: A Clustering Method Based On The Maximum Entropy Principle
30 pages
Introduction To Cluster Analysis.
No ratings yet
Introduction To Cluster Analysis.
53 pages
5 - Clustering
No ratings yet
5 - Clustering
13 pages
M5
No ratings yet
M5
40 pages
Clustering
No ratings yet
Clustering
29 pages
Unit Iii - ML
No ratings yet
Unit Iii - ML
13 pages
UnsupervisedLearning FoundationalMathofAI S24
No ratings yet
UnsupervisedLearning FoundationalMathofAI S24
6 pages
Lecture 6
No ratings yet
Lecture 6
42 pages
Clustering
No ratings yet
Clustering
55 pages
Machine Learning Topic 4
No ratings yet
Machine Learning Topic 4
36 pages
Spatial Data Mining: Clustering Techniques
No ratings yet
Spatial Data Mining: Clustering Techniques
56 pages
Clustering New
No ratings yet
Clustering New
6 pages
Cluster
100% (1)
Cluster
72 pages
미분적분학 솔루션 2판 제임스 스튜어트 1 200
No ratings yet
미분적분학 솔루션 2판 제임스 스튜어트 1 200
201 pages
20-463 Internal and External Validity PDF
No ratings yet
20-463 Internal and External Validity PDF
8 pages
Bisection Method
100% (1)
Bisection Method
4 pages
Design and Analysis of Algorithm Lab (BSCS2351) Lab Manual
No ratings yet
Design and Analysis of Algorithm Lab (BSCS2351) Lab Manual
46 pages
ML 8
No ratings yet
ML 8
5 pages
Chapter 5-Computer Theory BY Danial I. A Cohen
67% (21)
Chapter 5-Computer Theory BY Danial I. A Cohen
19 pages
Clustering Data Mining
No ratings yet
Clustering Data Mining
27 pages
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
No ratings yet
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
22 pages
Comparison of Purity and Entropy of K-Means Clustering and Fuzzy C Means Clustering
No ratings yet
Comparison of Purity and Entropy of K-Means Clustering and Fuzzy C Means Clustering
4 pages
Clustering
No ratings yet
Clustering
39 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
A06-A Survey of Clustering Techniques
No ratings yet
A06-A Survey of Clustering Techniques
5 pages
A Fast Instantaneous Method For Sequence Extraction: Rodrigo Cutri / Lourenço Matakas Junior
No ratings yet
A Fast Instantaneous Method For Sequence Extraction: Rodrigo Cutri / Lourenço Matakas Junior
6 pages
Advanced Encryption Standard (AES) (CS-452)
100% (1)
Advanced Encryption Standard (AES) (CS-452)
59 pages
2002 Hakidi Cluster Validity Methods Part II
No ratings yet
2002 Hakidi Cluster Validity Methods Part II
9 pages
Ch2 Wiener Filters
No ratings yet
Ch2 Wiener Filters
80 pages
P&S Sem Answers
No ratings yet
P&S Sem Answers
96 pages
Graph Partitioning Advance Clustering Technique
No ratings yet
Graph Partitioning Advance Clustering Technique
14 pages
Data Science Interview Questions: Answer Here
No ratings yet
Data Science Interview Questions: Answer Here
54 pages
MH4514 Notes
No ratings yet
MH4514 Notes
528 pages
Form Finding of Shells by Structural Optimization
No ratings yet
Form Finding of Shells by Structural Optimization
9 pages
869 When Vision Transformers Outpe
No ratings yet
869 When Vision Transformers Outpe
20 pages
Setting Up The Linear Programming Problem
No ratings yet
Setting Up The Linear Programming Problem
12 pages
ESRGAN Slides 3mar2025
No ratings yet
ESRGAN Slides 3mar2025
40 pages
Lecture 7 Random Variable Confidence Interval
No ratings yet
Lecture 7 Random Variable Confidence Interval
52 pages
AI March - 2024
No ratings yet
AI March - 2024
1 page
13 - Chapter 5 PDF
No ratings yet
13 - Chapter 5 PDF
40 pages
Algorithms and Flowcharts
No ratings yet
Algorithms and Flowcharts
8 pages
S Sarkar Lec 17
No ratings yet
S Sarkar Lec 17
16 pages
Monte Carlo Simulation Handouts
No ratings yet
Monte Carlo Simulation Handouts
8 pages
MATH2015-5A-M-Bernoulli Differential Equations
No ratings yet
MATH2015-5A-M-Bernoulli Differential Equations
13 pages
Flat It Gate 2
No ratings yet
Flat It Gate 2
33 pages
Non Linear Regression Saturation Growth Curve
No ratings yet
Non Linear Regression Saturation Growth Curve
2 pages
Lect Slides - Dynamic Response Characteristics of More Complicated Processes
No ratings yet
Lect Slides - Dynamic Response Characteristics of More Complicated Processes
31 pages
11D Complex Number
No ratings yet
11D Complex Number
1 page
01 Part2
No ratings yet
01 Part2
10 pages
Comparative Study of K-Means and Hierarchical Clustering Techniques
No ratings yet
Comparative Study of K-Means and Hierarchical Clustering Techniques
7 pages
Problem 5
No ratings yet
Problem 5
2 pages
6 - Modeling Road Traffic Flow On The Link
No ratings yet
6 - Modeling Road Traffic Flow On The Link
15 pages
MCC Esa99 Final
No ratings yet
MCC Esa99 Final
14 pages
Pre-Calculus Essentials
From Everand
Pre-Calculus Essentials
Ernest Woodward
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Clustering - Introduction J Evaluation Metrics

Uploaded by

Clustering - Introduction J Evaluation Metrics

Uploaded by

Clustering

(Introduction, Evaluation Metrics)

where n are the total number of samples.

where N = number of objects(data points), k = number of clusters, ci is a cluster in Clustering

max 𝑜𝑓 𝑒𝑎𝑐ℎ 𝑟𝑜𝑤 𝑖𝑛 𝑐𝑜𝑛𝑡𝑖𝑔𝑒𝑛𝑐𝑦 𝑚𝑎𝑡𝑟𝑖𝑥 2 + 2

You might also like