0% found this document useful (0 votes)
15 views19 pages

Clustering - Introduction J Evaluation Metrics

Clustering is an unsupervised machine learning technique that groups similar data points into clusters, aiming for high intra-cluster similarity and low inter-cluster similarity. Various evaluation metrics, such as Silhouette Coefficient, Dunn's Index, and Rand Index, are used to assess the quality of clusters, with applications in market segmentation, biology, and social network analysis. Clustering algorithms can be categorized into five types: partitioning-based, hierarchical-based, density-based, grid-based, and model-based methods.

Uploaded by

aaditmahajan14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views19 pages

Clustering - Introduction J Evaluation Metrics

Clustering is an unsupervised machine learning technique that groups similar data points into clusters, aiming for high intra-cluster similarity and low inter-cluster similarity. Various evaluation metrics, such as Silhouette Coefficient, Dunn's Index, and Rand Index, are used to assess the quality of clusters, with applications in market segmentation, biology, and social network analysis. Clustering algorithms can be categorized into five types: partitioning-based, hierarchical-based, density-based, grid-based, and model-based methods.

Uploaded by

aaditmahajan14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Clustering

(Introduction, Evaluation Metrics)

CSED, TIET
Clustering-Introduction
▪ Cluster analysis, or clustering, is an unsupervised machine learning task. It involves
automatically discovering natural grouping in data.

▪Clustering is the task of dividing the population or data points into a number of groups
such that data points in the same groups are more similar (high intra-cluster similarity) to
other data points in the same group than those in other groups (low inter-cluster
similarity).

▪In simple words, the aim is to segregate groups with similar traits and assign them into
clusters.
Applications of Clustering
Applications of Clustering (Contd….)
▪ Clustering algorithms are widely used in a number of applications such as:
➢ Market Segmentation / Targeted Marketing / Recommender Systems
➢ Document / News / Article Clustering
➢ Biology / Genome Clustering
➢ City Planning
➢ Speech Recognition
➢ Social Network Analysis
➢ Organize Computing Clusters
➢ Astronomical Data Analysis
Evaluation Metrics
In order to evaluate the quality of clusters produced by a clustering algorithms,
following evaluation metrics are used:
1. Silhouette Coefficient
2. Dunn’s Index
3. Rand Index (RI)
4. Adjusted Rand Index (ARI)
5. Purity
Metrics 1 and 2 are used when we don’t have any ground truth (unsupervised; only data
points) where as metrics 3,4 and 5 are used when we have ground truth (supervised; data
points and labels)
Silhouette Coefficient
▪ The Silhouette Coefficient is defined for each sample and is composed of two scores:
a: The mean distance between a sample and all other points in the same cluster.
b: The mean distance between a sample and all other points in the next nearest cluster.
𝑏−𝑎
𝑠=
max(𝑏,𝑎)

▪ The Silhouette Coefficient for a set of samples is given as the mean of the Silhouette
Coefficient for each sample.
▪The score is bounded between -1 for incorrect clustering and +1 for highly dense clustering.
▪Scores around zero indicate overlapping clusters.
▪The score is higher when clusters are dense and well separated, which relates to a standard
concept of a cluster.
Dunn Index
▪ The Dunn index is another internal clustering validation measure which can be computed as
follow:
1. For each cluster, compute the distance between each of the objects in the cluster and the objects
in the other clusters
2. Use the minimum of this pairwise distance as the inter-cluster separation (min.separation)
3. For each cluster, compute the distance between the objects in the same cluster.
4. Use the maximal intra-cluster distance (i.e maximum diameter) as the intra-cluster compactness
5. Calculate Dunn index (D) is computed as follows:
𝑚𝑖𝑛. 𝑠𝑒𝑝𝑒𝑎𝑟𝑡𝑖𝑜𝑛
𝐷=
𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑑𝑖𝑎𝑚𝑒𝑡𝑒𝑟
If the data set contains compact and well-separated clusters, the diameter of the clusters is expected to
be small and the distance between the clusters is expected to be large. Thus, Dunn index should be
maximized. The value of Dunn Index lies between 0 and infinity.
Rand Index
▪ Rand Index is a measure of how similar clustering results or groupings are to the ground truth.
▪Let C denotes the ground truth class labeling and K be the clustering assignment.
• A be the number of element pairs that lie in the same set of C and K,
• B be the number of element pairs that lie in different sets of both C and K.
Then RI is given by:
𝐴+𝐵
𝑅𝐼 = 𝑛𝐶
2

where n are the total number of samples.


▪ RI can never exceed 1 and its possible lowest value is 0. More closer the score is to 1, better is
the algorithm.
Rand Index- Example
▪ Say we have five examples. The clustering method groups examples A, B, and C into one group
and examples D and E into another group. But according to ground truth groups A and B
are together and C, D, and E together.
▪ To compute RI for this example, lets first list all possible unordered pairs of five examples at
hand. We have 10 (n*(n-1)/2) such pairs. These are: {A, B}, {A, C}, {A, D}, {A, E}, {B, C},
{B, D}, {B, E}, {C, D}, {C, E}, and {D, E}.
▪ Examining these pairs, we notice that the pair {A, B} and {D, E} are always grouped together
(both by clustering algorithm and ground truth). Thus, the value of A is two.
▪ We also notice that four pairs, {A, D}, {A, E}, {B, D}, and {B, E}, never occur together. Thus,
the value of b is four.
2+4
𝑅𝐼 = = 0.6
10
Adjusted Rand Index (ARI)
▪ RI suffers from one drawback; it yields a high value for pairs of random partitions of a given set of
examples.
▪ To counter this drawback, an adjustment is made to the calculations by taking into consideration
grouping by chance.
▪ In this, we create a contingency table, as below the rows denote clusters made by clustering algorithm
and columns denote clusters given by ground truth (For example, if the total clusters returned by
ground truth and clustering method is 3, then contingency table is as shown below).
C1 C2 C3
C1
C2
C3

▪Any (ij)th entry is the number of common objects belonging to clustering algorithm cluster Ci and
ground truth cluster cj
Adjusted Rand Index (ARI)- Contd….
ARI-Example
Consider the same example as discussed for RI (in slide 9).
The contingency matrix for the example is given by:

4×4
2− 10 2−1.6
𝐴𝑅𝐼 = 4+4 4×4 = = 0.1666
− 4−1.6
2 2
Purity
▪ Purity is also an external evaluation criterion of cluster quality.
▪ It is the percent of the total number of objects(data points) that were classified correctly.
▪ It also lies in the range 0 to 1. Higher the purity, better is the model
𝑘
1
𝑃𝑢𝑟𝑖𝑡𝑦 = ෍ 𝑚𝑎𝑥𝑗 |𝑐𝑖 ∩ 𝑡𝑗 |
𝑁
𝑖=1

where N = number of objects(data points), k = number of clusters, ci is a cluster in Clustering


algorithm C, and tj is the cluster in the ground truth
Purity-Example
▪ For the example discussed for ARI (in slide 12), the contingency table is as below:

max 𝑜𝑓 𝑒𝑎𝑐ℎ 𝑟𝑜𝑤 𝑖𝑛 𝑐𝑜𝑛𝑡𝑖𝑔𝑒𝑛𝑐𝑦 𝑚𝑎𝑡𝑟𝑖𝑥 2 + 2


𝑃𝑢𝑟𝑖𝑡𝑦 = = = 0.8
𝑇𝑜𝑡𝑎𝑙 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 5
Types of Clustering Algorithms
▪ The clustering algorithms are broadly classified into five categories:
1. Partitioning-based Methods
2. Hierarchical-based Methods
3. Density-based Methods
4. Grid-based Methods
5. Model-based Methods
Partitioning-based Methods
▪ These methods partition the objects into k clusters and each partition forms one cluster.
▪ This method is used to optimize an objective criterion similarity function.
▪ The quality of clustering is measured by an objective function. This objective function
is designed to achieve high intra-cluster similarity and low inter-cluster similarity.
▪ Example K-means, CLARANS (Clustering Large Applications based upon Randomized
Search) etc.
Hierarchical-based Methods
▪ These methods perform a hierarchical breakdown of a given dataset which can be
classified as agglomerative and divisive.
▪ In agglomerative methods, initially, each object is regarded as a cluster on its own and
they are then successively merged till they satisfy a termination condition.
▪By contrast, in the divisive approach, initially, the set of objects is considered as a single
large cluster and is successively split up into smaller clusters until a termination
condition is satisfied.
▪The former is also called the bottom-up approach whereas the latter is called the top-
down approach.
Density-based Methods
▪ Density-based methods discover clusters based on density.
▪ These methods can find clusters of arbitrary shapes.
▪ Here, a cluster is kept growing as long as the number of data objects in the
neighborhood exceeds some threshold value.
▪ These methods have good accuracy and ability to merge two clusters.
▪ Examples DBSCAN (Density-Based Spatial Clustering of Applications with Noise) , OPTICS
(Ordering Points to Identify Clustering Structure) etc.
Clustering Algorithms

You might also like