Using Hierarchical Clustering as Unsupervised Algorithm for ML
(Lecture 10)
Machine Learning for Real World Applications
Date 16-Aug-2021
Copyright © 2021 Tata Consultancy Services Limited
1 External
Hierarchical Clustering Algorithms
• Hierarchical clustering algorithms can overcome some of the disadvantages of
partitional clustering methods.
Partitional Clustering Hierarchical Clustering
Requires value of K Flexible
Non-deterministic Deterministic
2 External
Hierarchical Clustering Algorithms
Hierarchical Clustering
Agglomerative Divisive
• Agglomerative approaches start with singleton clusters at the bottom level and
continue merging two clusters at a time
— builds a bottom-up hierarchy
• Divisive approaches start with all the data in a single cluster and split it
continuously into smaller groups
— builds a top-down hierarchy
3 External
Hierarchical Clustering Algorithms
• A cluster hierarchy is also called as a dendrogram.
4 External
Hierarchical Clustering Algorithms
• A cluster hierarchy is also called as a dendrogram.
Level 0 (Apex level)
Singletons
5 External
Hierarchical Clustering Algorithms
• Hierarchy can be cut at any given level and clusters can be obtained
k=1
k=4
k=16
Singletons
6 External
Agglomerative Clustering
Basic steps for agglomerative clustering
• A dissimilarity matrix is constructed using a particular proximity measure.
— All data points are represented at the bottom of the dendrogram
• Repeat until the final maximal cluster is obtained:
1. The closest sets of clusters are merged at each level
2. The dissimilarity matrix is updated
7 External
Hierarchical Clustering Algorithms
• Dissimilarity Matrix
P1 P2 P3 P4
P1 0.0 0.2 0.15 0.3
P2 0.2 0.0 0.4 0.5
P3 0.15 0.40 0.0 0.10
P4 0.30 0.50 0.10 0.0
8 External
Agglomerative Clustering
Algorithm for Agglomerative Hierarchical Clustering
• Compute the dissimilarity matrix between all the data points.
• Repeat until the final maximal cluster is obtained:
1. Merge clusters as Ca∪b = Ca ∪ Cb
Set new cluster’s cardinality as Na∪b = Na + Nb
2. Insert a new row and column containing the distances between the new
cluster Ca∪b and the remaining clusters
9 External
Proximity Measures in Agglomerative Clustering
• Single Link Agglomerative Clustering
• Complete Link Agglomerative Clustering
• Group Averaged Agglomerative Clustering
• Centroid Agglomerative Clustering
• Ward’s Agglomerative Clustering
10 External
Single Link Agglomerative Clustering
• Here, the similarity of two clusters is the similarity between their most similar
(nearest neighbour) members.
• This method gives more importance to the regions where clusters are closest.
• Sensitive to noise and outliers in the data
11 External
Single Link Agglomerative Clustering
• Here, the similarity of two clusters is the similarity between their most similar
(nearest neighbour) members.
• This method gives more importance to the regions where clusters are closest.
• Sensitive to noise and outliers in the data
12 External
13 External
14 External
Complete Link Agglomerative Clustering
• Here, the similarity of two clusters is the similarity of their most dissimilar
members.
• The cluster pair whose merger would result in the smallest diameter is the one
chosen for merger.
• Obtains compact clusters, but again, sensitive to outliers
15 External
16 External
17 External
Single Link Agglomerative Clustering Single Link
0.2
• Example
0.15
P1 P2 P3 P4
0.1
P1 0.0 0.2 0.15 0.3
P2 0.2 0.0 0.4 0.5 3 4 1 2
P3 0.15 0.40 0.0 0.10
P4 0.30 0.50 0.10 0.0 0.5
Complete Link
0.2
0.1
3 4 1 2
18 External
Single Link Agglomerative Clustering Single Link
0.2
• Example
0.15
P1 P2 P3 P4
0.1
P1 0.0 0.2 0.15 0.3
P2 0.2 0.0 0.4 0.5 3 4 1 2
P3 0.15 0.40 0.0 0.10
P4 0.30 0.50 0.10 0.0
dmin((3,4),1) = min(d(3,1), d(4,1)) = 0.15
dmin((3,4),2) = min(d(3,2), d(4,2)) = 0.4
dmin((3,4,1),2) = min(d(3,2), d(4,2), d(1,2)) = 0.2
19 External
Single Link Agglomerative Clustering
0.5
• Example Complete Link
P1 P2 P3 P4
0.2
P1 0.0 0.2 0.15 0.3
P2 0.2 0.0 0.4 0.5 0.1
P3 0.15 0.40 0.0 0.10
P4 0.30 0.50 0.10 0.0 3 4 1 2
dmax((3,4),1) = max(d(3,1), d(4,1)) = 0.30
dmax((3,4),2) = max(d(3,2), d(4,2)) = 0.50
dmax((3,4), (1,2)) = max(d(3,1), d(3,2), d(4,1), d(4,2)) = 0.50
20 External
Group Averaged Agglomerative Clustering (GAAC)
• This measure considers the similarity between all pairs of points present in both
the clusters
• Distance between two clusters is the average of all the pair-wise distances
between the data points in these two clusters
1
∑ ∑
SGAAC(Ca, Cb) = d(i, j)
(Na + Nb)(Na + Nb − 1) i∈C ∪C j∈Ca∪Cb,i≠j
a b
• This measure is expensive to compute
21 External
Centroid-based Agglomerative Clustering
• This measure calculates the similarity between two clusters by measuring the
similarity between their centroids.
22 External
Ward’s Agglomerative Clustering
• Ward’s criterion for agglomeration
It uses the K-means squared error criterion to determine the distance.
— For any two clusters, Ca and Cb, the Ward’s criterion calculates the increase in
the value of SSE (sum of squared error) for clustering obtained by merging them
into Ca ∪ Cb
Na Nb
d(ca, cb)
Na + Nb
Where ca and cb are the centroids of the two clusters Ca and Cb
23 External
Divisive Hierarchical clustering
Divisive Hierarchical Clustering
• Divisive hierarchical clustering is a top-down approach
— the procedure starts at the root with all the data points
— the dendrogram is built through a recursive split of clusters
• Divisive approach is more efficient compared to agglomerative clustering when
there is no need to generate a complete hierarchy
• To make the split decision all the points need to be looked at.
Therefore, divisive clustering is considered as a global approach.
25 External
Issues in Divisive Clustering
• Splitting Criterion:
The Ward’s K-means square error criterion(SSE) is used here.
— The greater the reduction obtained in SSE, the better is the split
However, the SSE criterion can be applied only to numerical data
26 External
Issues in Divisive Clustering
• Evaluating the Ward’s criterion takes time
• Alternatively, we can use the K-means approach with K = 2
— This is the bisecting K-means
— Obtain a few good splits using K-means and choose the best one
27 External
Issues in Divisive Clustering
• Choosing the cluster to split
Check the mean square errors of the clusters
— Choose the one with the largest mean square error
This will ensure compact clusters in the dendrogram
28 External
Divisive Hierarchical Clustering Algorithm
• Start with the root node consisting all the data points
• Repeat
— Split parent node into two parts C1 and C2 using Bisecting K-means
to maximise Ward’s distance W(C1, C2)
— Construct the dendrogram. Among the current, choose the cluster
with the highest squared error as the next node to split
• Until Singleton leaves are obtained.
29 External
Minimum Spanning Tree-Based clustering
Minimum Spanning Tree-based clustering
• Given a weighted graph, a minimum spanning tree is an acyclic subgraph
— that covers all the vertices
— has the minimum edge weights
• Minimum spanning tree for a weighted graph can be found using the
— Prim’s algorithm
— Kruskal’s algorithm
31 External
Weighted Graph
P3 27 19 P7
23
P5 15
P1 30
21
P8
P4 37
34
32
55
P6
45
P2
32 External
Weighted Graph
P3 27 19 P7
23
P5 15
P1 30
21
P8
P4 47
34
32
55
P6
45
P2
Minimum Spanning Tree
33 External
Minimum Spanning Tree-based clustering
• For clustering purpose, we can use the edge weights as the Euclidean distance
between two data points.
• Given an MST, a divisive clustering algorithm has the following steps:
— Remove the largest weighted edge to get two clusterings
— Remove the next largest to get 3 clusterings
— and so on
34 External
Minimum Spanning Tree-based clustering
• If we are looking for K clusters, remove the top K − 1 weights one by one
• This will give K connected components.
• Removal of every weighted-edge gives us a finer split.
• This can detect clusters with non-spherical shapes
35 External
Minimum Spanning Tree-based clustering
• Instead of removing the largest weighted edge, we can also remove the edge
with the highest inconsistency measure.
— An inconsistent edge is the one whose edge weight is much higher than
the average weight of the edges in the neighbourhood of that edge
36 External
CURE (Clustering Using Representatives)
CURE (Clustering Using Representatives)
• In this method, a cluster is represented using a set of well-scattered
representative points.
• The distance between two clusters is computed as the average distance between
the representative points.
• Choosing scattered points helps in capturing arbitrary shapes of clusters.
38 External
CHAMELEON
CHAMELEON
• Algorithm begins with an initial partitioning obtained by
— formulating a K-nearest neighbour graph
— applying graph partitioning
• Agglomerative clustering
— Algorithm uses two metrics to merge the clusters
♦ relative interconnectivity of the two clusters
♦ relative closeness of clusters } Capture local information
of clusters
40 External
Thank You