0% found this document useful (0 votes)
101 views59 pages

UNIT 3-Clustering Metrics

Clustering metrics are quantitative measures to evaluate clustering algorithms, categorized into internal metrics (like Silhouette Score and Davies-Bouldin Index) and external metrics (like Adjusted Rand Index and Normalized Mutual Information). The Silhouette Score assesses cluster quality based on cohesion and separation, while the Davies-Bouldin Index evaluates cluster compactness and separation. The Dunn Index, ARI, and NMI further provide insights into clustering effectiveness, with higher values indicating better clustering performance.

Uploaded by

Priyanshu Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views59 pages

UNIT 3-Clustering Metrics

Clustering metrics are quantitative measures to evaluate clustering algorithms, categorized into internal metrics (like Silhouette Score and Davies-Bouldin Index) and external metrics (like Adjusted Rand Index and Normalized Mutual Information). The Silhouette Score assesses cluster quality based on cohesion and separation, while the Davies-Bouldin Index evaluates cluster compactness and separation. The Dunn Index, ARI, and NMI further provide insights into clustering effectiveness, with higher values indicating better clustering performance.

Uploaded by

Priyanshu Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 59

Clustering metrics

Clustering metrics
• Clustering metrics are quantitative measures used to evaluate the quality of
clustering algorithms and the resulting clusters.
• Internal Evaluation Metrics: These metrics evaluate the quality of clusters without
any external reference. They measure the compactness of data points within the
same cluster and the separation between different clusters.
• Such as: Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Index (Variance Ratio Criterion),
Dunn Index
• External Evaluation Metrics: These metrics require a ground truth or a reference set
of labels to compare the clustering results against. They measure the agreement
between the generated clusters and the true classes in the reference set.
• Such as: Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), Fowlkes-Mallows Index
Silhouette Score
• Silhouette Coefficient or silhouette score is a metric used to
calculate the goodness of a clustering technique. This score is
calculated by measuring each data point’s similarity to the cluster
it belongs to and how different it is from other clusters. The
Silhouette score is commonly used to assess the performance of
clustering algorithms like K-Means.
• Its value ranges from -1 to 1.
• 1: Means clusters are well apart from each other and clearly
distinguished (Best Case)
• 0: Means clusters are indifferent, or we can say that the distance
between clusters is not significant (Overlapping Clusters)
• -1: Means clusters are assigned in the wrong way (Worst Case)
It is computed as:
• The silhouette value is a measure of how similar an
object is to its own cluster (cohesion) compared to other
clusters (separation).
• Silhouette Score = (b-a)/max(a,b)
where
• a= average intra-cluster distance i.e the average
distance between each point within a cluster
• b= average inter-cluster distance i.e the average
distance between all clusters
Example:
Datapoints Cluster Label
A1 C1
A2 C1
A3 C2
A4 C2

Datapoint A1 A2 A3 A4
A1 0 0.10 0.65 0.55
A2 0.10 0 0.70 0.60
A3 0.65 0.70 0 0.30
A4 0.55 0.60 0.30 0
For point A1:
a= 0.1/1=0.1
b= (0.65+0.55)/2=0.6
Silhouette Score = (b-a)/max(a,b)
(0.6-0.1)/ 0.6= 0.833

For point A2:


a= 0.1/1=0.1
b= (0.70+0.60)/2=0.65
Silhouette Score = (b-a)/max(a,b)
(0.65-0.1)/ 0.65= 0.846

For point A3:


a= 0.30/1=0.30
b= (0.65+0.70)/2=0.675
Silhouette Score = (b-a)/max(a,b)
(0.675-0.30)/ 0.675= 0.555

For point A4:


a= 0.30/1=0.30
b= (0.55+0.60)/2=0.575
Silhouette Score = (b-a)/max(a,b)
(0.575-0.30)/ 0.575= 0.478
Point A1 and A2 are lying cluster C1 , so for computing Silhouette
Score for cluster C1,
(0.833+0.846)/2= 1.679/2= 0.839

Point A3 and A4 are lying cluster C2 , so for computing Silhouette


Score for cluster C2,
(0.555+0.478)/2= 1.033/2= 0.5165

Silhouette Score/Coefficient for overall clustering


problem is:
(0.839+0.5165)/2= 0.6775 or 0.678
Davies-Bouldin Index

• The Davies-Bouldin index (DBI) is a metric for assessing


the separation and compactness of clusters.
• It is based on the idea that good clusters are those that
have low within-cluster variation and high between-
cluster separation.
• The minimum score is zero, with lower values indicating
better clustering.
• Lower the DB index value, better is the clustering.
Where,
Example
• Data Point 1: (2, 3)
• Data Point 2: (3, 2)
• Data Point 3: (8, 8)
• Data Point 4: (9, 7)
• Data Point 5: (15, 14)
• Data Point 6: (16, 13)
• Let's say we want to perform K-means clustering on this dataset with
K = 2. After clustering, the data points are divided into two clusters:

• Cluster 1: {Data Point 1, Data Point 2} Cluster 2: {Data Point 3, Data


Point 4, Data Point 5, Data Point 6}

• The centroids of these clusters are approximately:

• Centroid of Cluster 1: (2.5, 2.5) Centroid of Cluster 2: (12, 10.5)

• Now, let's calculate the pairwise distances between the centroids of


each cluster:
• Now, let's calculate the pairwise distances between the centroids of
each cluster:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import
davies_bouldin_score
data1 = np.array([[6, 8], [9, 5], [5, 4], [2,
6], [5,6], [3,4]])
labels1 = np.array([0, 0, 1, 1, 1, 0])
davies_bouldin_index =
davies_bouldin_score(data1, labels1)
print("Davies-Bouldin Index:",
davies_bouldin_index)
Dunn index:
• A metric for evaluating clustering algorithms, is an internal evaluation scheme,
where the result is based on the clustered data itself.
• Like all other such indices, the aim of this Dunn index to identify sets of clusters
that are compact, with a small variance between members of the cluster, and well
separated, where the means of different clusters are sufficiently far apart, as
compared to the within cluster variance.
• Higher the Dunn index value, better is the clustering.
• The number of clusters that maximizes Dunn index is taken as the optimal number
of clusters k.
Where,
• A higher Dunn Index value indicates better clustering, as it implies
that the clusters are compact and well-separated. Let's go through a
numerical example to illustrate the concept.
• Consider a dataset with 8 data points in a 2D space:

• Data Point 1: (2, 3) Data Point 2: (3, 2) Data Point 3: (4, 3) Data Point
4: (8, 8) Data Point 5: (9, 7) Data Point 6: (10, 8) Data Point 7: (15, 14)
Data Point 8: (16, 13)
• Let's say we want to perform K-means clustering on this dataset with
K = 2. After clustering, the data points are divided into two clusters:
• Cluster 1: {Data Point 1, Data Point 2, Data Point 3} Cluster 2: {Data
Point 4, Data Point 5, Data Point 6, Data Point 7, Data Point 8}
• Now, let's calculate the inter-cluster distances (minimum distance
between any two points from different clusters) and intra-cluster
distances (maximum distance between any two points within the
same cluster):
Example
• Let's work through a numerical example to calculate the Dunn Index for a simple set of clusters.
Suppose you have the following data points and their corresponding clusters:
Data Points:
• A(1, 2)
• B(2, 2)
• C(5, 8)
• D(6, 8)
• E(10, 12)
• F(11, 12)
Clusters:
• Cluster 1: {A, B}
• Cluster 2: {C, D}
• Cluster 3: {E, F}
• Distance between Cluster 1 and Cluster 2 (d(1, 2)):
• d(A, C) = sqrt((1-5)^2 + (2-8)^2) = sqrt(32) = 4√2
• d(A, D) = sqrt((1-6)^2 + (2-8)^2) = sqrt(41) ≈ 6.40
• d(B, C) = sqrt((2-5)^2 + (2-8)^2) = sqrt(29) ≈ 5.39
• d(B, D) = sqrt((2-6)^2 + (2-8)^2) = sqrt(36) = 6
• So, d(1, 2) = 4√2
• Distance between Cluster 1 and Cluster 3 (d(1, 3)):
• d(A, E) = sqrt((1-10)^2 + (2-12)^2) = sqrt(170) ≈ 13.04
• d(A, F) = sqrt((1-11)^2 + (2-12)^2) = sqrt(200) ≈ 14.14
• d(B, E) = sqrt((2-10)^2 + (2-12)^2) = sqrt(164) ≈ 12.81
• d(B, F) = sqrt((2-11)^2 + (2-12)^2) = sqrt(194) ≈ 13.93
• So, d(1, 3) = 12.81
• Distance between Cluster 2 and Cluster 3 (d(2, 3)):
• d(C, E) = sqrt((5-10)^2 + (8-12)^2) = sqrt(20) ≈ 4.47
• d(C, F) = sqrt((5-11)^2 + (8-12)^2) = sqrt(32) ≈ 5.66
• d(D, E) = sqrt((6-10)^2 + (8-12)^2) = sqrt(16) = 4
• d(D, F) = sqrt((6-11)^2 + (8-12)^2) = sqrt(29) ≈ 5.39
• So, d(2, 3) = 4.47
• Now, let's calculate the minimum inter-cluster distance and the
maximum intra-cluster distance:
• Minimum Inter-cluster Distance: min(d(1, 2), d(1, 3), d(2, 3)) =
min(4√2, 12.81, 4.47) = 4√2
• Maximum Intra-cluster Distance: max(max(d(A, B), d(C, D), d(E, F))) =
max(6.40, 6.40, 14.14) = 14.14
• Finally, we can calculate the Dunn Index:
• Dunn Index = min(d(1, 2), d(1, 3), d(2, 3)) / max(max(d(A, B), d(C, D),
d(E, F))) Dunn Index = (4√2) / 14.14 ≈ 0.2828
• So, the Dunn Index for these clusters is approximately 0.2828.
Data Point Coordinates Cluster
1 (2, 3) A
2 (2, 5) A
3 (3, 8) B
4 (6, 5) B
5 (8, 8) C
6 (9, 6) C
7 (10, 2) A
8 (12, 4) A
9 (15, 7) C
10 (17, 5) C
Drawbacks of Dunn index:
• As the number of clusters and dimensionality of the
data increase, the computational cost also increases.
Adjusted Rand Index (ARI)
• It measures the similarity between the true class labels and the
clusters generated by a clustering algorithm while accounting for
chance agreement.
• The ARI produces a score between -1 and 1, where higher values
indicate better agreement between the predicted clusters and the
true labels.
• Contigency table- m x n
• m= number of clusters by c1 algo.
• n= number of clusters by c2 algo (Ground Truth)
To use the Adjusted Rand Index for evaluating clustering, follow these steps:

Perform Clustering: Apply a clustering algorithm to your dataset to create clusters of data points. This
could be an algorithm like k-means, hierarchical clustering, DBSCAN, or any other clustering technique.

Obtain True Class Labels: If you have access to ground truth labels or class assignments for your data,
this is ideal. These true labels represent the actual groups or categories that your data points belong to.
Example
• Suppose you have two clustering results for a dataset with 100 data points:
• Create a contingency table that shows the number of data points in common
between the two clusterings. Rows represent clusters in Clustering A, and
columns represent clusters in Clustering B. The table might look like this:
• | Cluster 1 | Cluster 2 |
• -----------------------------------
• Cluster 1 | 30 | 20 |
• -----------------------------------
• Cluster 2 | 40 | 10 |
• -----------------------------------
• Calculate the sum of squares for the rows and columns of the
contingency table. Let's denote these as SSR (Sum of Squares for
Rows) and SSC (Sum of Squares for Columns):
• SSR = (30^2 + 20^2) + (40^2 + 10^2) = (900 + 400)+(1600+100) = 3000
• SSC = (30^2 + 40^2) + (20^2 + 10^2) = 2500 + 500 = 3000
• Now Calculate ARI = [ (RI - Expected_RI) ] / [ max(RI) - Expected_RI ) ]
• The Rand Index (RI) is calculated as (SSR + SSC) / [2 * (N choose 2)], where N is the total number of data
points (100 in this case).
• Choose (100,2)

• There are 4,950 ways that 2 items chosen from a set of 100 can be combined.
• RI = (3000 + 3000) / [2 * (100 choose 2)] = 6000 / [2 * 4950] ≈ 0.606
• Expected_RI = (SSR * SSC) / [2 * (N choose 2)^2] = (3000 * 3000) / [2 * 4950^2] ≈ 0.183

• ARI = [ (0.606 - 0.183) ] / [ 1 - 0.183 ] ≈ 0.517


• So, the Adjusted Rand Index (ARI) for the given clustering A and B is approximately 0.51. This indicates a
moderate agreement between the two clusterings, where a higher ARI value would indicate better
agreement.
Compute Adjusted rand Index
Normalized Mutual Information
(NMI)
• Normalized Mutual Information (NMI) is a normalization
of the Mutual Information (MI) score to scale the results
between 0 (no mutual information) and 1 (perfect
correlation).
NMI
• NMI is a good measure for determining the
quality of clustering.
• It is an external measure because we need the
class labels of the instances to determine the
NMI.
• Since it’s normalized we can measure and
compare the NMI between different
clusterings having different number of
clusters.
Normalized Mutual
Information

• Normalized Mutual Information:

𝑁𝑀𝐼 𝐼(𝑌;
𝐻 𝑌 +𝐶)
𝐻
𝑌, 𝐶=
•where, 𝐶
1) Y = class labels
2) C = cluster labels
3) H(.) = Entropy
4) I(Y;C) = Mutual Information b/w Y and C Note: All logs are base-2.
Calculating NMI for
Clustering
• Assume m=3 classes and k=2 clusters

Cluster-1 (C=1) Cluster-2 (C=2)

Class-1 (Y=1) Class-2 (Y=2) Class-3 (Y=3)


H(Y) = Entropy of Class
Labels
• P(Y=1) = 5/20 = ¼
• P(Y=2) = 10/20 = ½
• P(Y=3) = 5/20 =¼
=
1 1 1
• H(Y) = − 41
4
−41
4
−21
2
log log log 1.5
This is calculated for the entire dataset and can be
calculated prior to clustering, as it will not change
depending on the clustering output.
H(C) = Entropy of Cluster
Labels
• P(C=1) = 10/20 = 1/2
• P(C=2) = 10/20 = ½
1
1
− 1 1
=
log 2
• H(Y) =− 2
log 2 2 1
This will be calculated every time the clustering
changes. You can see from the figure that the
clusters are balanced (have equal number of
instances).
I(Y;C)= Mutual
Information
• Mutual information is given as:
–𝐼 𝑌; 𝐶= 𝐻 𝑌 − 𝐻 𝑌 𝐶
– We already know H(Y)
– H(Y|C) is the entropy of class labels within each
cluster, how do we calculate this??

Mutual Information tells us the reduction in the


entropy of class labels that we get if we know the
cluster labels. (Similar to Information gain in
deicison trees)
H(Y|C): conditional entropy of class
labels for clustering C
• Consider Cluster-1:
– P(Y=1|C=1)=3/10 (three triangles in cluster-1)
– P(Y=2|C=1)=3/10 (three rectangles in cluster-1)
– P(Y=3|C=1)=4/10 (four stars in cluster-1)
– Calculate conditional entropy as:

𝐻 𝑌 𝐶 = −𝑃(𝐶 = ∑ 𝑃 𝑌 = 𝑦 𝐶 = 1 log(𝑃 𝑌 = 𝑦
=1 1) Y∈ 𝐶 =1
{1,2,3})
+ log(4 )+ log( )] =
1 3 3 3 4
=− 32 × [1 1 1 10 1
log 0 0 0.7855
0 10 0
H(Y|C): conditional entropy of class
labels for clustering C
• Now, consider Cluster-2:
– P(Y=1|C=2)=2/10 (two triangles in cluster-2)
– P(Y=2|C=2)=7/10 (seven rectangles in cluster-2)
– P(Y=3|C=2)=1/10 (one star in cluster-2)
– Calculate conditional entropy as:

𝐻 𝑌 𝐶 = −𝑃(𝐶 = ∑ 𝑃 𝑌 = 𝑦 𝐶 = 2 log(𝑃 𝑌 = 𝑦
=2 2) Y∈ 𝐶 =2
{1,2,3})
+ log(1 )+ log( )] =
1 2 7 7 1
=− 22 × [1 1 1 10 1
log 0 0 0.5784
0 10 0
I(Y;
C)
𝐼 = 𝐻𝑌 − 𝐻 𝑌 𝐶
• Finally the mutual information is:

𝑌; 𝐶 = 1.5 − 0.7855 +
=
0.5784
0.1361


The NMI is therefore,
𝑁𝑀𝐼 𝐼(𝑌; 𝐶)
𝐻𝑌 + 𝐻
𝑌, 𝐶
𝐶
𝑁𝑀𝐼 2 × 0.1361
= 1.5 + =
=
𝑌, 𝐶
1
Calculate NMI for the following
Homogeneity
• A clustering result satisfies homogeneity if all of its
clusters contain only data points which are members of
a single class.
• This metric is independent of the absolute values of the
labels: a permutation of the class or cluster label values
won’t change the score value in any way.
• Syntax : sklearn.metrics.homogeneity_score(labels_tru
e, labels_pred)
• The Metric is not symmetric,
switching label_true with label_pred will return
the completeness_score.
Parameters :
•labels_true:<int array, shape = [n_samples]> : It accept the
ground truth class labels to be used as a reference.
•labels_pred: <array-like of shape (n_samples,)>: It accepts the
cluster labels to evaluate.
Returns:
homogeneity:<float>: Its return the score between 0.0 and 1.0
stands for perfectly homogeneous labeling.
• To calculate homogeneity numerically, you can use the following formula:
Completeness score
• This score is complementary to the previous one. Its
purpose is to provide a piece of information about the
assignment of samples belonging to the same class.
• More precisely, a good clustering algorithm should
assign all samples with the same true label to the same
cluster.
Completeness portrays the closeness of the clustering algorithm to this
(completeness_score) perfection.
This metric is autonomous of the outright values of the labels. A permutation of the
cluster label values won’t change the score value in any way.

sklearn.metrics.completeness_score()
Syntax: sklearn.metrics.completeness_score(labels_true, labels_pred)

•labels_true:<int array, shape = [n_samples]>: It


accepts the ground truth class labels to be used as a
reference.
•labels_pred: <array-like of shape (n_samples,)>: It
accepts the cluster labels to evaluate.
Returns: completeness score between 0.0 and 1.0.
1.0 stands for perfectly completeness labeling.
V-Measure
One of the primary disadvantages of any clustering technique is that it is difficult to
evaluate its performance. To tackle this problem, the metric of V-Measure was
developed. The calculation of the V-Measure first requires the calculation of two
terms:-

1.Homogeneity: A perfectly homogeneous clustering is one where each cluster has


data-points belonging to the same class label. Homogeneity describes the closeness
of the clustering algorithm to this perfection.
2.Completeness: A perfectly complete clustering is one where all data-points
belonging to the same class are clustered into the same cluster. Completeness
describes the closeness of the clustering algorithm to this perfection.

sklearn.metrics.v_measure_score(labels_true, labels_pred, *, beta=1.0)


The V-measure is the harmonic mean between homogeneity and completeness:

v = (1 + beta) * homogeneity * completeness


/ (beta * homogeneity + completeness)
Fowlkes-Mallows Score
• The Fowlkes-Mallows Score is an evaluation metric to evaluate the
similarity among clustering's obtained after applying different
clustering algorithms.
• Although technically it is used to quantify the similarity between two
clustering's, it is typically used to evaluate the clustering performance
of a clustering algorithm by assuming the second clustering to be the
ground-truth i.e. the observed data and assuming it to be the perfect
clustering.
The Fowlkes-Mallows index (FMI) is defined as the geometric mean between of the precision
and recall:

FMI = TP / sqrt((TP + FP) * (TP + FN))

• Where TP is the number of True Positive (i.e. the number of pair of points that belongs in the
same clusters in both labels_true and labels_pred),
• FP is the number of False Positive (i.e. the number of pair of points that belongs in the same
clusters in labels_true and not in labels_pred)
• and FN is the number of False Negative (i.e the number of pair of points that belongs in the
same clusters in labels_pred and not in labels_True).

The score ranges from 0 to 1. A high value indicates a good similarity between two clusters.
Adjusted Mutual Information (AMI)
• Adjusted Mutual Information (AMI) is an adjustment of
the Mutual Information (MI) score to account for chance.
• It accounts for the fact that the MI is generally higher
for two clusterings with a larger number of clusters,
regardless of whether there is actually more information
shared.
• For two clusterings U and V, the AMI is given as:

• AMI(U, V) = [MI(U, V) - E(MI(U, V))] / [avg(H(U), H(V)) - E(MI(U, V))]


Thank you!

You might also like