UNIT 3-Clustering Metrics
UNIT 3-Clustering Metrics
Clustering metrics
• Clustering metrics are quantitative measures used to evaluate the quality of
clustering algorithms and the resulting clusters.
• Internal Evaluation Metrics: These metrics evaluate the quality of clusters without
any external reference. They measure the compactness of data points within the
same cluster and the separation between different clusters.
• Such as: Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Index (Variance Ratio Criterion),
Dunn Index
• External Evaluation Metrics: These metrics require a ground truth or a reference set
of labels to compare the clustering results against. They measure the agreement
between the generated clusters and the true classes in the reference set.
• Such as: Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), Fowlkes-Mallows Index
Silhouette Score
• Silhouette Coefficient or silhouette score is a metric used to
calculate the goodness of a clustering technique. This score is
calculated by measuring each data point’s similarity to the cluster
it belongs to and how different it is from other clusters. The
Silhouette score is commonly used to assess the performance of
clustering algorithms like K-Means.
• Its value ranges from -1 to 1.
• 1: Means clusters are well apart from each other and clearly
distinguished (Best Case)
• 0: Means clusters are indifferent, or we can say that the distance
between clusters is not significant (Overlapping Clusters)
• -1: Means clusters are assigned in the wrong way (Worst Case)
It is computed as:
• The silhouette value is a measure of how similar an
object is to its own cluster (cohesion) compared to other
clusters (separation).
• Silhouette Score = (b-a)/max(a,b)
where
• a= average intra-cluster distance i.e the average
distance between each point within a cluster
• b= average inter-cluster distance i.e the average
distance between all clusters
Example:
Datapoints Cluster Label
A1 C1
A2 C1
A3 C2
A4 C2
Datapoint A1 A2 A3 A4
A1 0 0.10 0.65 0.55
A2 0.10 0 0.70 0.60
A3 0.65 0.70 0 0.30
A4 0.55 0.60 0.30 0
For point A1:
a= 0.1/1=0.1
b= (0.65+0.55)/2=0.6
Silhouette Score = (b-a)/max(a,b)
(0.6-0.1)/ 0.6= 0.833
• Data Point 1: (2, 3) Data Point 2: (3, 2) Data Point 3: (4, 3) Data Point
4: (8, 8) Data Point 5: (9, 7) Data Point 6: (10, 8) Data Point 7: (15, 14)
Data Point 8: (16, 13)
• Let's say we want to perform K-means clustering on this dataset with
K = 2. After clustering, the data points are divided into two clusters:
• Cluster 1: {Data Point 1, Data Point 2, Data Point 3} Cluster 2: {Data
Point 4, Data Point 5, Data Point 6, Data Point 7, Data Point 8}
• Now, let's calculate the inter-cluster distances (minimum distance
between any two points from different clusters) and intra-cluster
distances (maximum distance between any two points within the
same cluster):
Example
• Let's work through a numerical example to calculate the Dunn Index for a simple set of clusters.
Suppose you have the following data points and their corresponding clusters:
Data Points:
• A(1, 2)
• B(2, 2)
• C(5, 8)
• D(6, 8)
• E(10, 12)
• F(11, 12)
Clusters:
• Cluster 1: {A, B}
• Cluster 2: {C, D}
• Cluster 3: {E, F}
• Distance between Cluster 1 and Cluster 2 (d(1, 2)):
• d(A, C) = sqrt((1-5)^2 + (2-8)^2) = sqrt(32) = 4√2
• d(A, D) = sqrt((1-6)^2 + (2-8)^2) = sqrt(41) ≈ 6.40
• d(B, C) = sqrt((2-5)^2 + (2-8)^2) = sqrt(29) ≈ 5.39
• d(B, D) = sqrt((2-6)^2 + (2-8)^2) = sqrt(36) = 6
• So, d(1, 2) = 4√2
• Distance between Cluster 1 and Cluster 3 (d(1, 3)):
• d(A, E) = sqrt((1-10)^2 + (2-12)^2) = sqrt(170) ≈ 13.04
• d(A, F) = sqrt((1-11)^2 + (2-12)^2) = sqrt(200) ≈ 14.14
• d(B, E) = sqrt((2-10)^2 + (2-12)^2) = sqrt(164) ≈ 12.81
• d(B, F) = sqrt((2-11)^2 + (2-12)^2) = sqrt(194) ≈ 13.93
• So, d(1, 3) = 12.81
• Distance between Cluster 2 and Cluster 3 (d(2, 3)):
• d(C, E) = sqrt((5-10)^2 + (8-12)^2) = sqrt(20) ≈ 4.47
• d(C, F) = sqrt((5-11)^2 + (8-12)^2) = sqrt(32) ≈ 5.66
• d(D, E) = sqrt((6-10)^2 + (8-12)^2) = sqrt(16) = 4
• d(D, F) = sqrt((6-11)^2 + (8-12)^2) = sqrt(29) ≈ 5.39
• So, d(2, 3) = 4.47
• Now, let's calculate the minimum inter-cluster distance and the
maximum intra-cluster distance:
• Minimum Inter-cluster Distance: min(d(1, 2), d(1, 3), d(2, 3)) =
min(4√2, 12.81, 4.47) = 4√2
• Maximum Intra-cluster Distance: max(max(d(A, B), d(C, D), d(E, F))) =
max(6.40, 6.40, 14.14) = 14.14
• Finally, we can calculate the Dunn Index:
• Dunn Index = min(d(1, 2), d(1, 3), d(2, 3)) / max(max(d(A, B), d(C, D),
d(E, F))) Dunn Index = (4√2) / 14.14 ≈ 0.2828
• So, the Dunn Index for these clusters is approximately 0.2828.
Data Point Coordinates Cluster
1 (2, 3) A
2 (2, 5) A
3 (3, 8) B
4 (6, 5) B
5 (8, 8) C
6 (9, 6) C
7 (10, 2) A
8 (12, 4) A
9 (15, 7) C
10 (17, 5) C
Drawbacks of Dunn index:
• As the number of clusters and dimensionality of the
data increase, the computational cost also increases.
Adjusted Rand Index (ARI)
• It measures the similarity between the true class labels and the
clusters generated by a clustering algorithm while accounting for
chance agreement.
• The ARI produces a score between -1 and 1, where higher values
indicate better agreement between the predicted clusters and the
true labels.
• Contigency table- m x n
• m= number of clusters by c1 algo.
• n= number of clusters by c2 algo (Ground Truth)
To use the Adjusted Rand Index for evaluating clustering, follow these steps:
Perform Clustering: Apply a clustering algorithm to your dataset to create clusters of data points. This
could be an algorithm like k-means, hierarchical clustering, DBSCAN, or any other clustering technique.
Obtain True Class Labels: If you have access to ground truth labels or class assignments for your data,
this is ideal. These true labels represent the actual groups or categories that your data points belong to.
Example
• Suppose you have two clustering results for a dataset with 100 data points:
• Create a contingency table that shows the number of data points in common
between the two clusterings. Rows represent clusters in Clustering A, and
columns represent clusters in Clustering B. The table might look like this:
• | Cluster 1 | Cluster 2 |
• -----------------------------------
• Cluster 1 | 30 | 20 |
• -----------------------------------
• Cluster 2 | 40 | 10 |
• -----------------------------------
• Calculate the sum of squares for the rows and columns of the
contingency table. Let's denote these as SSR (Sum of Squares for
Rows) and SSC (Sum of Squares for Columns):
• SSR = (30^2 + 20^2) + (40^2 + 10^2) = (900 + 400)+(1600+100) = 3000
• SSC = (30^2 + 40^2) + (20^2 + 10^2) = 2500 + 500 = 3000
• Now Calculate ARI = [ (RI - Expected_RI) ] / [ max(RI) - Expected_RI ) ]
• The Rand Index (RI) is calculated as (SSR + SSC) / [2 * (N choose 2)], where N is the total number of data
points (100 in this case).
• Choose (100,2)
• There are 4,950 ways that 2 items chosen from a set of 100 can be combined.
• RI = (3000 + 3000) / [2 * (100 choose 2)] = 6000 / [2 * 4950] ≈ 0.606
• Expected_RI = (SSR * SSC) / [2 * (N choose 2)^2] = (3000 * 3000) / [2 * 4950^2] ≈ 0.183
𝑁𝑀𝐼 𝐼(𝑌;
𝐻 𝑌 +𝐶)
𝐻
𝑌, 𝐶=
•where, 𝐶
1) Y = class labels
2) C = cluster labels
3) H(.) = Entropy
4) I(Y;C) = Mutual Information b/w Y and C Note: All logs are base-2.
Calculating NMI for
Clustering
• Assume m=3 classes and k=2 clusters
𝐻 𝑌 𝐶 = −𝑃(𝐶 = ∑ 𝑃 𝑌 = 𝑦 𝐶 = 1 log(𝑃 𝑌 = 𝑦
=1 1) Y∈ 𝐶 =1
{1,2,3})
+ log(4 )+ log( )] =
1 3 3 3 4
=− 32 × [1 1 1 10 1
log 0 0 0.7855
0 10 0
H(Y|C): conditional entropy of class
labels for clustering C
• Now, consider Cluster-2:
– P(Y=1|C=2)=2/10 (two triangles in cluster-2)
– P(Y=2|C=2)=7/10 (seven rectangles in cluster-2)
– P(Y=3|C=2)=1/10 (one star in cluster-2)
– Calculate conditional entropy as:
𝐻 𝑌 𝐶 = −𝑃(𝐶 = ∑ 𝑃 𝑌 = 𝑦 𝐶 = 2 log(𝑃 𝑌 = 𝑦
=2 2) Y∈ 𝐶 =2
{1,2,3})
+ log(1 )+ log( )] =
1 2 7 7 1
=− 22 × [1 1 1 10 1
log 0 0 0.5784
0 10 0
I(Y;
C)
𝐼 = 𝐻𝑌 − 𝐻 𝑌 𝐶
• Finally the mutual information is:
𝑌; 𝐶 = 1.5 − 0.7855 +
=
0.5784
0.1361
2×
The NMI is therefore,
𝑁𝑀𝐼 𝐼(𝑌; 𝐶)
𝐻𝑌 + 𝐻
𝑌, 𝐶
𝐶
𝑁𝑀𝐼 2 × 0.1361
= 1.5 + =
=
𝑌, 𝐶
1
Calculate NMI for the following
Homogeneity
• A clustering result satisfies homogeneity if all of its
clusters contain only data points which are members of
a single class.
• This metric is independent of the absolute values of the
labels: a permutation of the class or cluster label values
won’t change the score value in any way.
• Syntax : sklearn.metrics.homogeneity_score(labels_tru
e, labels_pred)
• The Metric is not symmetric,
switching label_true with label_pred will return
the completeness_score.
Parameters :
•labels_true:<int array, shape = [n_samples]> : It accept the
ground truth class labels to be used as a reference.
•labels_pred: <array-like of shape (n_samples,)>: It accepts the
cluster labels to evaluate.
Returns:
homogeneity:<float>: Its return the score between 0.0 and 1.0
stands for perfectly homogeneous labeling.
• To calculate homogeneity numerically, you can use the following formula:
Completeness score
• This score is complementary to the previous one. Its
purpose is to provide a piece of information about the
assignment of samples belonging to the same class.
• More precisely, a good clustering algorithm should
assign all samples with the same true label to the same
cluster.
Completeness portrays the closeness of the clustering algorithm to this
(completeness_score) perfection.
This metric is autonomous of the outright values of the labels. A permutation of the
cluster label values won’t change the score value in any way.
sklearn.metrics.completeness_score()
Syntax: sklearn.metrics.completeness_score(labels_true, labels_pred)
• Where TP is the number of True Positive (i.e. the number of pair of points that belongs in the
same clusters in both labels_true and labels_pred),
• FP is the number of False Positive (i.e. the number of pair of points that belongs in the same
clusters in labels_true and not in labels_pred)
• and FN is the number of False Negative (i.e the number of pair of points that belongs in the
same clusters in labels_pred and not in labels_True).
The score ranges from 0 to 1. A high value indicates a good similarity between two clusters.
Adjusted Mutual Information (AMI)
• Adjusted Mutual Information (AMI) is an adjustment of
the Mutual Information (MI) score to account for chance.
• It accounts for the fact that the MI is generally higher
for two clusterings with a larger number of clusters,
regardless of whether there is actually more information
shared.
• For two clusterings U and V, the AMI is given as: