L08 Hierachical Agglomerative Clustering
L08 Hierachical Agglomerative Clustering
Clustering
1
Hierarchical Agglomerative Clustering
Income
Age
Hierarchical Agglomerative Clustering
Find closest pair, merge into a cluster
Income
Age
Hierarchical Agglomerative Clustering
Find next closest pair and merge
Income
Age
Hierarchical Agglomerative Clustering
Find next closest pair and merge
Income
Age
Hierarchical Agglomerative Clustering
Keep merging closest pairs
Income
Age
Hierarchical Agglomerative Clustering
If the closest pair is two clusters, merge them
Income
Age
Hierarchical Agglomerative Clustering
Keep merging closest pairs and clusters
Income
Age
Hierarchical Agglomerative Clustering
Keep merging closest pairs and clusters
Income
Age
Hierarchical Agglomerative Clustering
Current number of clusters = 6
Income
Age
Hierarchical Agglomerative Clustering
Current number of clusters = 5
Income
Age
Hierarchical Agglomerative Clustering
Current number of clusters = 4
Income
Age
Hierarchical Agglomerative Clustering
Current number of clusters = 3
Income
Age
Hierarchical Agglomerative Clustering
Current number of clusters = 2
Income
Age
Hierarchical Agglomerative Clustering
Current number of clusters = 1
Income
Age
Hierarchical Agglomerative Clustering
Income
Age
Hierarchical Agglomerative Clustering
Current number of clusters = 5
distance
Cluster
Hierarchical Agglomerative Clustering
Current number of clusters = 4
Income
Age
Hierarchical Agglomerative Clustering
Current number of clusters = 4
distance
Cluster
Hierarchical Agglomerative Clustering
Current number of clusters = 3
Income
Age
Hierarchical Agglomerative Clustering
Current number of clusters = 3
distance
Cluster
Hierarchical Agglomerative Clustering
Current number of clusters = 2
Income
Age
Hierarchical Agglomerative Clustering
Current number of clusters = 2
distance
Cluster
Agglomerative Clustering
• Start with N groups each with one instance and merge two
closest groups at each iteration
• Distance between two groups Gi and Gj:
• Single-link:
• Complete-link:
• Average-link, centroid
26
Hierarchical Linkage Types
Single linkage: minimum pairwise distance between
clusters
Income
Age
Hierarchical Linkage Types
Single linkage: minimum pairwise distance between
clusters
Income
Age
Example: Single-Link Clustering
Dendrogram
29
Hierarchical Linkage Types
Complete linkage: maximum pairwise distance between
clusters
Income
Age
Hierarchical Linkage Types
Complete linkage: maximum pairwise distance between
clusters
Income
Age
Hierarchical Linkage Types
Average linkage: average pairwise distance between
clusters
Income
Age
Hierarchical Linkage Types
Average linkage: average pairwise distance between
clusters
Income
Age
Hierarchical Linkage Types
Ward linkage: merge based on best inertia
Income
Age
K-means vs Hierarchical Clustering
K-means Clustering Hierarchical Clustering
can handle big data well can’t handle big data well
time complexity is linear O(n) time complexity is quadratic i.e.,
O(n2)
start with random choice of results are reproducible
clusters, the results produced by
running the algorithm multiple
times might differ
work well when the shape of the
clusters is hyper spherical (like
circle in 2D, sphere in 3D)
requires prior knowledge of K i.e. can stop at whatever number of
no. of clusters clusters when find appropriate in
hierarchical clustering by
35
interpreting the dendrogram
Other Types of Clustering
Mini-Batch Affinity Mean Spectral
Ward DBSCAN
K-Means Propagation Shift Clustering
Reference: 36
DBSCAN
Clustering
37
Density-Based Clustering Algorithms
• Density-Based Clustering
• identify distinctive groups/clusters in the data, based on the
idea that a cluster in data space is a contiguous region of high
point density, separated from other such clusters by
contiguous regions of low point density
• Density-Based Spatial Clustering of Applications with
Noise (DBSCAN)
• base algorithm for density-based clustering
• can discover clusters of different shapes and sizes from a
large amount of data, which is containing noise and outliers
38
DBSCAN Algorithm
1. It starts with a random unvisited starting data point.
All points within a distance ‘Epsilon – Ɛ classify as
neighborhood points.
2. You need a minimum number of points within the
neighborhood to start the clustering process. Under
such circumstances, the current data point becomes
the first point in the cluster. Otherwise, the point
gets labeled as ‘Noise.’ In either case, the current
point becomes a visited point.
3. All points within the distance Ɛ become part of the
same cluster. Repeat the procedure for all the new
points added to the cluster group.
4. Continue with the process until you visit and label
each point within the Ɛ neighborhood of the cluster.
5. On completion of the process, start again with a new
unvisited point thereby leading to the discovery of
more clusters or noise. At the end of the process,
you ensure that you mark each point as either
cluster or noise.
39
Source: https://www.digitalvidya.com/blog/the-top-5-clustering-algorithms-data-scientists-should-know/
Pros vs Cons DBSCAN Clustering
• Pros
• better than other cluster algorithms because it does not require a
pre-set number of clusters
• identifies outliers as noise, unlike the Mean-Shift method that forces
such points into the cluster in spite of having different characteristics
• finds arbitrarily shaped and sized clusters quite well
• Cons
• not very effective when have clusters of varying densities. There is a
variance in the setting of the distance threshold Ɛ and the minimum
points for identifying the neighborhood when there is a change in the
density levels
• high dimensional data, the determining of the distance threshold Ɛ
becomes a challenging task
40
Reference:
• https://www.guru99.com/unsupervised-machine-learning.html
• https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-
unsupervised-learning#dimension-reduction
• https://www.ibm.com/cloud/learn/unsupervised-learning
• https://levelup.gitconnected.com/importance-of-data-
preprocessing-and-scaling-in-machine-learning-21db1d4377ec
• https://www.analyticsvidhya.com/blog/2016/11/an-introduction-
to-clustering-and-different-methods-of-clustering/
• https://www.digitalvidya.com/blog/the-top-5-clustering-
algorithms-data-scientists-should-know/
41