0% found this document useful (0 votes)

28 views55 pages

Lecture 6

Hierarchical clustering builds clusters progressively by linking objects based on their distance. It starts with each object in its own cluster, and at each step merges the closest pairs of clusters until all clusters have been merged into one. This can be represented as a dendrogram tree structure. Different linkage criteria can be used such as complete linkage that merges clusters based on the maximum distance between all object pairs.

Uploaded by

Hassan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views55 pages

Lecture 6

Uploaded by

Hassan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

MODULE 3:

CLUSTERING 2

DAT405 / DIT407, 2022-2023, READING PERIOD 4

Topics
• DBSCAN clustering
• Hierarchical clustering
• Validating clusterings
Limitations of K-means clustering
K-means: result depends on initialization

+ + + +

+ + + +
DBSCAN clustering
Steven Bierwagen
DBSCAN
• Density-Based Spatial Clustering of Applications with Noise

• From 1996
Ingredients for DBSCAN
• A distance measure (or metric or similarity measure)
• often Euclidean distance
scanning
radius
• A number defining the meaning of neighbor
• epsilon: the max distance between two points considered neighbors.

• A number defining the meaning of cluster (vs outlier or noise) min points
inside radius
• minpts: the minimum number of points in a cluster.

Two hyperparameters
Labeling step

All points in dataset labeled as one of these:

• Core point
• Boarder point
• Noise point
Neighbors
A neighbor of a point p is a point
that is within distance 𝜖 from p.

𝜖
p
Core points

A core point is a point that has

at least minpts neighbors
Border points
A border point is a non-core
point that has at least one
core-point as neighbor.
Noise points
A noise point or outlier is a
non-core and non-border point
Clusters all core points and
Clustering step border points. Outliers will
not be clustered!

Start by picking a new color c and

an uncoloured core point p.

p
Clustering step
Put an edge between core
points that are neighbors.
Color those connected
components with c

Also color the border points

p of those nodes with c
Clustering step

Repeat until all core

points and border points
have been colored!
Algorithm
Clusterings created by DBSCAN

Ester, Kriegel, Sander, Xu (1996), In Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, (KDD), AAAI Press, pp. 226–231
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),

Using DBSCAN markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)

plt.show()
Large = core point
Small = border point
Black = outlier

Note: data is
standardised (scaled
to range -2 – 2). This
facilitates parameter
search for epsilon.
See Jupyter notebooks for Module 3 for code examples
K-means vs. DBSCAN
• K-means assigns all points to a cluster, whereas DBSCAN doesn’t
necessarily do this. DBSCAN treats outliers as outliers.

• K-means works best when clusters are basically spherical. DBSCAN

can find arbitrarily-shaped clusters.

• DBSCAN doesn’t require the number of clusters to be specified by the

user.
Hierarchical clustering
Luis Serrano
Suppose we want to

A new set of points cluster these addresses by

proximity. No pizza parlors
involved this time!
Join the close ones

First join these two because they

are the closest to each other
Join the close ones

Then join these two because they

are second closest to each other
Join the close ones

Then join these two

Join the close ones

The two closest points that are not

both inside a cluster are these two
Join the close ones

Join them by putting them

into the same cluster
Join the close ones
Continue in the same way…
Join the close ones
Continue in the same way…
Join the close ones
Continue in the same way…

…until we reach the desired

number of clusters (as given
by a parameter)
A dendrogram shows the entire

Dendrogram hierarchichal clustering process

(without STOP)
Dendrogram

If we want 4
clusters, we can
cut here.

If we want 8
clusters, we can
cut here.
Dendrogram
If we have a space with billions of
points in thousands of
dimensions, the dendrogram is
still a 2D graph!

For example the tree of life!

Hierarchichal clustering gives

more than a clustering: a
hierarchy (or taxonomy)
The tree of life

https://www.evogeneao.com/
Hierarchical clustering
Start

Each item is in a cluster of

its own
• Sometimes called
agglomerative clustering, Calculate distance matrix
when done bottom-up

Number of no
• From one extreme case (many clusters > 1?
End
clusters, each containing one
item) to another (one cluster yes
Select pair of clusters to
that contains all items) merge

merge clusters and update

the distance matrix
Edit distances between protein
Distance matrix sequences (strings)

a. Human haemoglobin alpha chain

b. Human haemoglobin beta chain
c. Horse haemoglobin alpha chain
d. Horse haemoglobin beta chain
e. Marine bloodworm haemoglobin
f. Yellow lupine leghaemoglobin

Six proteins with a common

evolutionary ancestor
Amino acid sequences of six proteins
> human_alpha

VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

> human_beta

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

> horse_alpha

VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHGKKVADGLTLAVGHLDDLPGALSDLSNLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR

> horse_beta

VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGNPKVKAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDPENFRLLGNVLALVVARHFGKDFTPELQASYQKVVAGVANALAHKYH

> marine_bloodworm

GLSAAQRQVIAATWKDIAGADNGAGVGKKCLIKFLSAHPQMAAVFGFSGASDPGVAALGAKVLAQIGVAVSHLGDEGKMVAQMKAVGVRHKGYGNKHIKAQYFEPLGASLLSAMEHRIGGKMNAAAKDAWAAAYADISGALISGLQS

> yellow_lupine

GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSEVPQNNPELQAHAGKVFKLVYEAAIQLEVTGVVVTDATLKNLGSVHVSKGVADAHFPVVKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMDDAA

Edit distance is the number of single character operations that are required to change one string into another.
Merging clusters
• When clusters u and v are merged, how do we calculate the distance
between the merged cluster and each of the other clusters?

• Various algorithms to choose from, e.g.

• complete linkage (furthest inter-cluster distance) max(dist(u[i]), v[j]))
• single linkage (closest inter-cluster distance) min(dist(u[i]), v[j]))
• average linkage
• Unweighted Pair Group Method with Arithmetic Mean (UPGMA)
• Weighted Pair Group Method with Arithmetic Mean (WPGMA)
• … and many more

See e.g.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html#scipy.clus
ter.hierarchy.linkage
Example: merging clusters
D a b c d e f 1) Shortest distance a – c
a 0 84 18 86 112 121 2) Merge {a,c}
3) Recompute distance matrix, use max distance (complete linkage)
b 0 85 26 117 119
c 0 84 112 125 D a,c b d e f
d 0 113 121 a,c 0 85 86 112 125
e 0 119 b 0 26 117 119
f 0 d 0 113 121
e 0 119
f 0
a. Human haemoglobin alpha chain
b. Human haemoglobin beta chain
c. Horse haemoglobin alpha chain
d. Horse haemoglobin beta chain
e. Marine bloodworm haemoglobin
f. Yellow lupine leghaemoglobin
Example: merging clusters
D a b c d e f 1) Shortest distance a – c
a 0 84 18 86 112 121 2) Merge {a,c}
3) Recompute distance matrix, use max distance between points
b 0 85 26 117 119
c 0 84 112 125 D a,c b d e f
1) Shortest distance b – d
d 0 113 121 a,c 0 85 86 112 121 2) Merge {b,d}
e 0 119 b 0 26 117 119 3) Recompute distance matric
f 0 d 0 113 121
e 0 119
f 0
a. Human haemoglobin alpha chain
b. Human haemoglobin beta chain
c. Horse haemoglobin alpha chain
d. Horse haemoglobin beta chain
e. Marine bloodworm haemoglobin
f. Yellow lupine leghaemoglobin
Example: merging clusters
D a b c d e f 1) Shortest distance a – c
a 0 84 18 86 112 121 2) Merge {a,c}
3) Recompute distance matrix, use max distance between points
b 0 85 26 117 119
c 0 84 112 125 D a,c b d e f
1) Shortest distance b – d
d 0 113 121 a,c 0 85 86 112 121 2) Merge {b,d}
e 0 119 b 0 26 117 119 3) Recompute distance matric
f 0 d 0 113 121
e 0 119 D a,c b,d e f
f 0 a,c 0 86 112 121
a. Human haemoglobin alpha chain b,d 0 117 121
b. Human haemoglobin beta chain e 0 119
c. Horse haemoglobin alpha chain
d. Horse haemoglobin beta chain f 0
e. Marine bloodworm haemoglobin
f. Yellow lupine leghaemoglobin
Validating clustering
Stability on subsets
Clustering stable if removing a proportion of
random points does not change the clustering
fundamentally
Stability on subsets
Note colors change as labeling clusters into
first, second, third … changes!
Co-occurrence
For all pairs (i,j) count how frequently i and j
are in the same cluster.
Co-occurrence
Silhouette coefficient
a: The mean distance between a sample and all other points in the
same class.
b: The mean distance between a sample and all other points in the next
nearest cluster.

𝑏 −𝑎
𝑠=
max(𝑎, 𝑏)
Ranges between -1 and 1. High
value indicate good separation
between clusters.

https://scikit-learn.org/stable/modules/clustering.html#silhouette-coefficient
Clustering clustering algorithms

Fahad et al. (2014) IIEEE Trans. Emerging Topics in Computing, volume 2, 267-279
Useful idea when

Combining clustering and classification

labeling is
expensive

• Take a dataset with handwritten digits

• Provide only one label per digit (10 labels for the whole dataset)

• Use 10-means with the ten labeled images as starting points for
clustering the whole dataset.

• Then use 1nn for classifying new handwritten digits.

Reflections on clustering
Clustering is successful, but difficult
• Inherent vagueness in the definition of a cluster

• Can be difficult to define an appropriate similarity measure

Jain, A.K. (2010) Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31, 651-666
https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html
Are the framed cases as desired?
Assignment 3
• Using K-means and density-based clustering to cluster the main chain
conformations of amino acid residues in proteins.
• If curious for more information on the problem domain, look at:
• http://bioinformatics.org/molvis/phipsi/
• http://tinyurl.com/RamachandranPrincipleYouTube
Protein main chain

O O

CC
f A
y C
N N

CA N C CA

O
i-1 i i+1
Ramachandran plot
Around 100000
data points
shown here

http://bioinformatics.org/molvis/phipsi/

Hierarchical Clustering Unit 4 ML
No ratings yet
Hierarchical Clustering Unit 4 ML
14 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
Unit 2
No ratings yet
Unit 2
33 pages
Clustering
No ratings yet
Clustering
69 pages
Lec 2
No ratings yet
Lec 2
32 pages
03 Hierarchical Clustering
100% (1)
03 Hierarchical Clustering
15 pages
ML Unit 4
No ratings yet
ML Unit 4
15 pages
Data Science Session 8 Clustering V0
No ratings yet
Data Science Session 8 Clustering V0
30 pages
Clustering
No ratings yet
Clustering
12 pages
Lecture 8
No ratings yet
Lecture 8
56 pages
Clustering
No ratings yet
Clustering
75 pages
Capture D'écran, Le 2025-04-14 À 16.57.54
No ratings yet
Capture D'écran, Le 2025-04-14 À 16.57.54
40 pages
ML 07 Clustering
No ratings yet
ML 07 Clustering
56 pages
Clustering 2
No ratings yet
Clustering 2
17 pages
Hierarchical Clustering: Relationship Between Clusters
No ratings yet
Hierarchical Clustering: Relationship Between Clusters
23 pages
Week 07 Lecture Material
No ratings yet
Week 07 Lecture Material
49 pages
Clustering
No ratings yet
Clustering
22 pages
Chapter 6
No ratings yet
Chapter 6
62 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
3 pages
Week 10
No ratings yet
Week 10
84 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
44 pages
Module 5
No ratings yet
Module 5
43 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
32 pages
20 - 1 - ML - UNSUP - 02 - Hierarchical Clustering
No ratings yet
20 - 1 - ML - UNSUP - 02 - Hierarchical Clustering
41 pages
Hierar Scale4
No ratings yet
Hierar Scale4
51 pages
Unit IV
No ratings yet
Unit IV
51 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
Presentation 28128 Content Document 20241126014005PM
No ratings yet
Presentation 28128 Content Document 20241126014005PM
80 pages
Unit 3 DVA
No ratings yet
Unit 3 DVA
50 pages
P 3.1.3 Hierarchical
No ratings yet
P 3.1.3 Hierarchical
30 pages
ML Unit Iii
No ratings yet
ML Unit Iii
12 pages
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
No ratings yet
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
63 pages
Clustering
No ratings yet
Clustering
53 pages
Clustering
No ratings yet
Clustering
75 pages
ML - 8
No ratings yet
ML - 8
70 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Unit 5
No ratings yet
Unit 5
63 pages
Slide TIF311 DM 10 11
No ratings yet
Slide TIF311 DM 10 11
49 pages
Clustering
No ratings yet
Clustering
75 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
61 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Clustering
No ratings yet
Clustering
65 pages
Un Supervised Learning
No ratings yet
Un Supervised Learning
22 pages
MachineLearning Unit IV
No ratings yet
MachineLearning Unit IV
51 pages
Birch
No ratings yet
Birch
6 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
Clustering Analysis
No ratings yet
Clustering Analysis
30 pages
Lecture 12 - Unsupervised Learning - Shoould Be Marged
No ratings yet
Lecture 12 - Unsupervised Learning - Shoould Be Marged
31 pages
Clustering: Sridhar S Department of IST Anna University
No ratings yet
Clustering: Sridhar S Department of IST Anna University
91 pages
DW&M Unit 3 Part II
No ratings yet
DW&M Unit 3 Part II
50 pages
UNIT5
No ratings yet
UNIT5
60 pages
Module-5-Cluster Analysis-Part1
No ratings yet
Module-5-Cluster Analysis-Part1
24 pages
Clustering
No ratings yet
Clustering
11 pages
Clustering Hierarchical PDF
No ratings yet
Clustering Hierarchical PDF
31 pages
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
No ratings yet
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
45 pages
Electrochemistry
100% (1)
Electrochemistry
78 pages
Clustering
No ratings yet
Clustering
39 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Cluster
100% (1)
Cluster
72 pages
LE 451 Law of Succession Trusts and Wills (WILLS)
100% (2)
LE 451 Law of Succession Trusts and Wills (WILLS)
81 pages
E-Commerce Mis CH 10
No ratings yet
E-Commerce Mis CH 10
42 pages
Seminar Report 1
No ratings yet
Seminar Report 1
4 pages
ASSAULT COURSE - PPSX
100% (1)
ASSAULT COURSE - PPSX
12 pages
NICU Discharge Plan
No ratings yet
NICU Discharge Plan
58 pages
The National Shipbuilding Research Program: 1988 Ship Production Symposium
No ratings yet
The National Shipbuilding Research Program: 1988 Ship Production Symposium
10 pages
Testbank: Chapter 13 Diversification Strategy: True/False Questions
No ratings yet
Testbank: Chapter 13 Diversification Strategy: True/False Questions
8 pages
1 Optimization & Anti-Optimization of Structures Under Uncertainty - Isaac Elishakoff PDF
No ratings yet
1 Optimization & Anti-Optimization of Structures Under Uncertainty - Isaac Elishakoff PDF
425 pages
O-540-A Ilustrate Parts Catalog PC-115-1
No ratings yet
O-540-A Ilustrate Parts Catalog PC-115-1
72 pages
ĐỀ SỐ 7- ĐỀ LƯƠNG THẾ VINH HÀ NỘI KHÓA 8+-CÔ PHẠM LIỄU
No ratings yet
ĐỀ SỐ 7- ĐỀ LƯƠNG THẾ VINH HÀ NỘI KHÓA 8+-CÔ PHẠM LIỄU
6 pages
Administrative and Business Chapter 4
No ratings yet
Administrative and Business Chapter 4
20 pages
A Simplified Method of Three Dimensional Technique For The Detection of AmpC Beta-Lactamases
No ratings yet
A Simplified Method of Three Dimensional Technique For The Detection of AmpC Beta-Lactamases
7 pages
Ez Publish Advanced Content Management
No ratings yet
Ez Publish Advanced Content Management
460 pages
Exercise 2 Memory Management
No ratings yet
Exercise 2 Memory Management
2 pages
An Investigation Into The Learning Experiences of 2kas5yijjw
No ratings yet
An Investigation Into The Learning Experiences of 2kas5yijjw
236 pages
2009 Higher Maths Paper I - Questions & Answers by G Fyfe, Perth College
No ratings yet
2009 Higher Maths Paper I - Questions & Answers by G Fyfe, Perth College
16 pages
The Problem Background of The Study
No ratings yet
The Problem Background of The Study
61 pages
DC Servoamplifiers
No ratings yet
DC Servoamplifiers
4 pages
Column Slides - Chapter 9
No ratings yet
Column Slides - Chapter 9
17 pages
Different Types of Water According To USP
No ratings yet
Different Types of Water According To USP
9 pages
Manual - Parts - Wiring
No ratings yet
Manual - Parts - Wiring
4 pages
Math II IMP-1
No ratings yet
Math II IMP-1
10 pages
The Phoneme Sysytem (Group 4)
No ratings yet
The Phoneme Sysytem (Group 4)
16 pages
Muthish Thangam Resume1
No ratings yet
Muthish Thangam Resume1
4 pages
Energy Transformations
No ratings yet
Energy Transformations
2 pages
Nycote® Plus
No ratings yet
Nycote® Plus
2 pages
Wepik Geometric Blue Tom Resume 20230928140657REcc
No ratings yet
Wepik Geometric Blue Tom Resume 20230928140657REcc
1 page
Resume Updated 1 18
No ratings yet
Resume Updated 1 18
2 pages
Thoughts
No ratings yet
Thoughts
1 page
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Lecture 6

Uploaded by

Lecture 6

Uploaded by

MODULE 3:

DAT405 / DIT407, 2022-2023, READING PERIOD 4

All points in dataset labeled as one of these:

A core point is a point that has

Start by picking a new color c and

Also color the border points

Repeat until all core

Using DBSCAN markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)

• K-means works best when clusters are basically spherical. DBSCAN

• DBSCAN doesn’t require the number of clusters to be specified by the

A new set of points cluster these addresses by

First join these two because they

Then join these two because they

Then join these two

The two closest points that are not

Join them by putting them

…until we reach the desired

Dendrogram hierarchichal clustering process

For example the tree of life!

Hierarchichal clustering gives

Each item is in a cluster of

merge clusters and update

a. Human haemoglobin alpha chain

Six proteins with a common

• Various algorithms to choose from, e.g.

Combining clustering and classification

• Take a dataset with handwritten digits

• Then use 1nn for classifying new handwritten digits.

• Can be difficult to define an appropriate similarity measure

You might also like