03 Hierarchical Clustering
03 Hierarchical Clustering
Introduction-Hierarchical clustering
2
How it works
Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the
basic process of hierarchical clustering (defined by S.C. Johnson in 1967) is this:
1. Start by assigning each item to a cluster, so that if you have N items, you now
have N clusters, each containing just one item. Let the distances (similarities)
between the clusters be the same as the distances (similarities) between the
items they contain.
2. Find the closest (most similar) pair of clusters and merge them into a single
cluster, so that now you have one cluster less.
3. Compute distances (similarities) between the new cluster and each of the old
clusters.
4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.
(*)
3
Contd..
• Step 3 can be done in different ways, which is what distinguishes single-
linkage from complete-linkage and average-linkage clustering.
• In single-linkage clustering (also called the connectedness or minimum method), we consider
the distance between one cluster and another cluster to be equal to the shortest distance from
any member of one cluster to any member of the other cluster. If the data consist of
similarities, we consider the similarity between one cluster and another cluster to be equal to
the greatest similarity from any member of one cluster to any member of the other cluster.
• In complete-linkage clustering (also called the diameter or maximum method), we consider
the distance between one cluster and another cluster to be equal to the greatest distance
from any member of one cluster to any member of the other cluster.
• In average-linkage clustering, we consider the distance between one cluster and another
cluster to be equal to the average distance from any member of one cluster to any member of
the other cluster.
A variation on average-link clustering is the UCLUS method of R. D'Andrade (1978) which
uses the median distance, which is much more outlier-proof than the average distance.
4
Single-Linkage Clustering: The Algorithm
• Begin with the disjoint clustering having level L(0) = 0 and sequence number m = 0.
• Find the least dissimilar pair of clusters in the current clustering, say pair (r), (s), according to
where the minimum is over all pairs of clusters in the current clustering.
• Increment the sequence number : m = m +1. Merge clusters (r) and (s) into a single cluster to form the next clustering
m. Set the level of this clustering to
L(m) = d[(r),(s)]
• Update the proximity matrix, D, by deleting the rows and columns corresponding to clusters (r) and (s) and adding a row
and column corresponding to the newly formed cluster. The proximity between the new cluster, denoted (r,s) and old
cluster (k) is defined in this way:
TO
BA FI MI NA RM
6
• The nearest pair of cities is MI and TO, at distance 138. These are merged into a single
cluster called "MI/TO". The level of the new cluster is L(MI/TO) = 138 and the new sequence
number is m = 1.
• Then we compute the distance from this new compound object to all other objects. The
shortest distance from "MI/TO" to RM is chosen to be 564, which is the distance from MI to
RM, and so on.
BA FI MI/TO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MI/TO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
7
Contd..
• min d(i,j) = d(NA,RM) = 219 => merge NA and RM into a new cluster called NA/RM
L(NA/RM) = 219
m=2
FI MI/TO NA/RM
BA
BA 0 662 877 255
FI 662 0 295 268
MI/TO 877 295 0 564
8
Contd..
• min d(i,j) = d(BA,NA/RM) = 255 => merge BA and NA/RM into a new cluster called
BA/NA/RM
L(BA/NA/RM) = 255
m=3
BA/NA/RM FI MI/TO
FI 268 0 295
MI/TO 564 295 0
9
Contd..
• min d(i,j) = d(BA/NA/RM,FI) = 268 => merge BA/NA/RM and FI into a new cluster
called BA/FI/NA/RM
L(BA/FI/NA/RM) = 268
m=4
MI/TO
BA/FI/NA/RM
BA/FI/NA/RM 0 295
MI/TO 295 0
10
• Finally, we merge the last two clusters at level 295.
• The process is summarized by the following hierarchical tree:
295
268
255
219
138
11
Difference between k-means and hierarchical
• Hierarchical clustering can’t handle big data well but K Means clustering can. This
is because the time complexity of K Means is linear i.e. O(n) while that of
hierarchical clustering is quadratic i.e. O(n2).
• In K Means clustering, since we start with random choice of clusters, the results
produced by running the algorithm multiple times might differ. While results are
reproducible in Hierarchical clustering.
• K Means is found to work well when the shape of the clusters is hyper spherical
(like circle in 2D, sphere in 3D).
• K Means clustering requires prior knowledge of K i.e. no. of clusters you want to
divide your data into. But, you can stop at whatever number of clusters you find
appropriate in hierarchical clustering by interpreting the dendrogram
12
Applications of Clustering
• Clustering has a large no. of applications spread across various domains. Some of
the most popular applications of clustering are:
• Recommendation engines
• Market segmentation
• Social network analysis
• Search result grouping
• Medical imaging
• Image segmentation
• Anomaly detection
13
Practice question
BOS NY DC MIA CHI SEA SF LA DEN
14
THANK YOU
15