Clustering - Introduction J Evaluation Metrics
Clustering - Introduction J Evaluation Metrics
CSED, TIET
Clustering-Introduction
▪ Cluster analysis, or clustering, is an unsupervised machine learning task. It involves
automatically discovering natural grouping in data.
▪Clustering is the task of dividing the population or data points into a number of groups
such that data points in the same groups are more similar (high intra-cluster similarity) to
other data points in the same group than those in other groups (low inter-cluster
similarity).
▪In simple words, the aim is to segregate groups with similar traits and assign them into
clusters.
Applications of Clustering
Applications of Clustering (Contd….)
▪ Clustering algorithms are widely used in a number of applications such as:
➢ Market Segmentation / Targeted Marketing / Recommender Systems
➢ Document / News / Article Clustering
➢ Biology / Genome Clustering
➢ City Planning
➢ Speech Recognition
➢ Social Network Analysis
➢ Organize Computing Clusters
➢ Astronomical Data Analysis
Evaluation Metrics
In order to evaluate the quality of clusters produced by a clustering algorithms,
following evaluation metrics are used:
1. Silhouette Coefficient
2. Dunn’s Index
3. Rand Index (RI)
4. Adjusted Rand Index (ARI)
5. Purity
Metrics 1 and 2 are used when we don’t have any ground truth (unsupervised; only data
points) where as metrics 3,4 and 5 are used when we have ground truth (supervised; data
points and labels)
Silhouette Coefficient
▪ The Silhouette Coefficient is defined for each sample and is composed of two scores:
a: The mean distance between a sample and all other points in the same cluster.
b: The mean distance between a sample and all other points in the next nearest cluster.
𝑏−𝑎
𝑠=
max(𝑏,𝑎)
▪ The Silhouette Coefficient for a set of samples is given as the mean of the Silhouette
Coefficient for each sample.
▪The score is bounded between -1 for incorrect clustering and +1 for highly dense clustering.
▪Scores around zero indicate overlapping clusters.
▪The score is higher when clusters are dense and well separated, which relates to a standard
concept of a cluster.
Dunn Index
▪ The Dunn index is another internal clustering validation measure which can be computed as
follow:
1. For each cluster, compute the distance between each of the objects in the cluster and the objects
in the other clusters
2. Use the minimum of this pairwise distance as the inter-cluster separation (min.separation)
3. For each cluster, compute the distance between the objects in the same cluster.
4. Use the maximal intra-cluster distance (i.e maximum diameter) as the intra-cluster compactness
5. Calculate Dunn index (D) is computed as follows:
𝑚𝑖𝑛. 𝑠𝑒𝑝𝑒𝑎𝑟𝑡𝑖𝑜𝑛
𝐷=
𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑑𝑖𝑎𝑚𝑒𝑡𝑒𝑟
If the data set contains compact and well-separated clusters, the diameter of the clusters is expected to
be small and the distance between the clusters is expected to be large. Thus, Dunn index should be
maximized. The value of Dunn Index lies between 0 and infinity.
Rand Index
▪ Rand Index is a measure of how similar clustering results or groupings are to the ground truth.
▪Let C denotes the ground truth class labeling and K be the clustering assignment.
• A be the number of element pairs that lie in the same set of C and K,
• B be the number of element pairs that lie in different sets of both C and K.
Then RI is given by:
𝐴+𝐵
𝑅𝐼 = 𝑛𝐶
2
▪Any (ij)th entry is the number of common objects belonging to clustering algorithm cluster Ci and
ground truth cluster cj
Adjusted Rand Index (ARI)- Contd….
ARI-Example
Consider the same example as discussed for RI (in slide 9).
The contingency matrix for the example is given by:
4×4
2− 10 2−1.6
𝐴𝑅𝐼 = 4+4 4×4 = = 0.1666
− 4−1.6
2 2
Purity
▪ Purity is also an external evaluation criterion of cluster quality.
▪ It is the percent of the total number of objects(data points) that were classified correctly.
▪ It also lies in the range 0 to 1. Higher the purity, better is the model
𝑘
1
𝑃𝑢𝑟𝑖𝑡𝑦 = 𝑚𝑎𝑥𝑗 |𝑐𝑖 ∩ 𝑡𝑗 |
𝑁
𝑖=1