0% found this document useful (0 votes)
26 views41 pages

L08 Hierachical Agglomerative Clustering

Uploaded by

black hello
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views41 pages

L08 Hierachical Agglomerative Clustering

Uploaded by

black hello
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Hierarchical Agglomerative

Clustering

1
Hierarchical Agglomerative Clustering

• builds hierarchy of clusters


• starts with all the data points
assigned to a cluster of their own
• then two nearest clusters are
merged into the same cluster
• in the end, this algorithm
terminates when there is only a
single cluster left
• results of hierarchical clustering
can be shown using dendrogram
Source: https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-clustering-and-different-
methods-of-clustering/ 2
Hierarchical Agglomerative Clustering

Income

Age
Hierarchical Agglomerative Clustering
Find closest pair, merge into a cluster

Income

Age
Hierarchical Agglomerative Clustering
Find next closest pair and merge

Income

Age
Hierarchical Agglomerative Clustering
Find next closest pair and merge

Income

Age
Hierarchical Agglomerative Clustering
Keep merging closest pairs

Income

Age
Hierarchical Agglomerative Clustering
If the closest pair is two clusters, merge them

Income

Age
Hierarchical Agglomerative Clustering
Keep merging closest pairs and clusters

Income

Age
Hierarchical Agglomerative Clustering
Keep merging closest pairs and clusters

Income

Age
Hierarchical Agglomerative Clustering
Current number of clusters = 6

Income

Age
Hierarchical Agglomerative Clustering
Current number of clusters = 5

Income

Age
Hierarchical Agglomerative Clustering
Current number of clusters = 4

Income

Age
Hierarchical Agglomerative Clustering
Current number of clusters = 3

Income

Age
Hierarchical Agglomerative Clustering
Current number of clusters = 2

Income

Age
Hierarchical Agglomerative Clustering
Current number of clusters = 1

Income

Age
Hierarchical Agglomerative Clustering

the correct number of clusters is


Condition 1
reached

minimum average cluster


Condition 2 distance reaches a set value
Hierarchical Agglomerative Clustering
Current number of clusters = 5

Income

Age
Hierarchical Agglomerative Clustering
Current number of clusters = 5

distance
Cluster
Hierarchical Agglomerative Clustering
Current number of clusters = 4

Income

Age
Hierarchical Agglomerative Clustering
Current number of clusters = 4

distance
Cluster
Hierarchical Agglomerative Clustering
Current number of clusters = 3

Income

Age
Hierarchical Agglomerative Clustering
Current number of clusters = 3

distance
Cluster
Hierarchical Agglomerative Clustering
Current number of clusters = 2

Income

Age
Hierarchical Agglomerative Clustering
Current number of clusters = 2

distance
Cluster
Agglomerative Clustering
• Start with N groups each with one instance and merge two
closest groups at each iteration
• Distance between two groups Gi and Gj:
• Single-link:

• Complete-link:

• Average-link, centroid

26
Hierarchical Linkage Types
Single linkage: minimum pairwise distance between
clusters

Income

Age
Hierarchical Linkage Types
Single linkage: minimum pairwise distance between
clusters

Income

Age
Example: Single-Link Clustering

Dendrogram
29
Hierarchical Linkage Types
Complete linkage: maximum pairwise distance between
clusters

Income

Age
Hierarchical Linkage Types
Complete linkage: maximum pairwise distance between
clusters

Income

Age
Hierarchical Linkage Types
Average linkage: average pairwise distance between
clusters

Income

Age
Hierarchical Linkage Types
Average linkage: average pairwise distance between
clusters

Income

Age
Hierarchical Linkage Types
Ward linkage: merge based on best inertia

Income

Age
K-means vs Hierarchical Clustering
K-means Clustering Hierarchical Clustering
can handle big data well can’t handle big data well
time complexity is linear O(n) time complexity is quadratic i.e.,
O(n2)
start with random choice of results are reproducible
clusters, the results produced by
running the algorithm multiple
times might differ
work well when the shape of the
clusters is hyper spherical (like
circle in 2D, sphere in 3D)
requires prior knowledge of K i.e. can stop at whatever number of
no. of clusters clusters when find appropriate in
hierarchical clustering by
35
interpreting the dendrogram
Other Types of Clustering
Mini-Batch Affinity Mean Spectral
Ward DBSCAN
K-Means Propagation Shift Clustering

.01s 8.17s .02s .31s .21s .10s

0.1s 8.17s .03s .03s .26s .10s

.01s 8.45s .03s .04s .31s .11s

.02s 8.53s .06s .08s .21s .10s

Reference: 36
DBSCAN
Clustering

37
Density-Based Clustering Algorithms
• Density-Based Clustering
• identify distinctive groups/clusters in the data, based on the
idea that a cluster in data space is a contiguous region of high
point density, separated from other such clusters by
contiguous regions of low point density
• Density-Based Spatial Clustering of Applications with
Noise (DBSCAN)
• base algorithm for density-based clustering
• can discover clusters of different shapes and sizes from a
large amount of data, which is containing noise and outliers

38
DBSCAN Algorithm
1. It starts with a random unvisited starting data point.
All points within a distance ‘Epsilon – Ɛ classify as
neighborhood points.
2. You need a minimum number of points within the
neighborhood to start the clustering process. Under
such circumstances, the current data point becomes
the first point in the cluster. Otherwise, the point
gets labeled as ‘Noise.’ In either case, the current
point becomes a visited point.
3. All points within the distance Ɛ become part of the
same cluster. Repeat the procedure for all the new
points added to the cluster group.
4. Continue with the process until you visit and label
each point within the Ɛ neighborhood of the cluster.
5. On completion of the process, start again with a new
unvisited point thereby leading to the discovery of
more clusters or noise. At the end of the process,
you ensure that you mark each point as either
cluster or noise.
39
Source: https://www.digitalvidya.com/blog/the-top-5-clustering-algorithms-data-scientists-should-know/
Pros vs Cons DBSCAN Clustering
• Pros
• better than other cluster algorithms because it does not require a
pre-set number of clusters
• identifies outliers as noise, unlike the Mean-Shift method that forces
such points into the cluster in spite of having different characteristics
• finds arbitrarily shaped and sized clusters quite well
• Cons
• not very effective when have clusters of varying densities. There is a
variance in the setting of the distance threshold Ɛ and the minimum
points for identifying the neighborhood when there is a change in the
density levels
• high dimensional data, the determining of the distance threshold Ɛ
becomes a challenging task
40
Reference:
• https://www.guru99.com/unsupervised-machine-learning.html
• https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-
unsupervised-learning#dimension-reduction
• https://www.ibm.com/cloud/learn/unsupervised-learning
• https://levelup.gitconnected.com/importance-of-data-
preprocessing-and-scaling-in-machine-learning-21db1d4377ec
• https://www.analyticsvidhya.com/blog/2016/11/an-introduction-
to-clustering-and-different-methods-of-clustering/
• https://www.digitalvidya.com/blog/the-top-5-clustering-
algorithms-data-scientists-should-know/

41

You might also like