Lecture 6
Lecture 6
CLUSTERING 2
+ + + +
+ + + +
DBSCAN clustering
Steven Bierwagen
DBSCAN
• Density-Based Spatial Clustering of Applications with Noise
• From 1996
Ingredients for DBSCAN
• A distance measure (or metric or similarity measure)
• often Euclidean distance
scanning
radius
• A number defining the meaning of neighbor
• epsilon: the max distance between two points considered neighbors.
• A number defining the meaning of cluster (vs outlier or noise) min points
inside radius
• minpts: the minimum number of points in a cluster.
Two hyperparameters
Labeling step
• Core point
• Boarder point
• Noise point
Neighbors
A neighbor of a point p is a point
that is within distance 𝜖 from p.
𝜖
p
Core points
p
Clustering step
Put an edge between core
points that are neighbors.
Color those connected
components with c
Ester, Kriegel, Sander, Xu (1996), In Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, (KDD), AAAI Press, pp. 226–231
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
Note: data is
standardised (scaled
to range -2 – 2). This
facilitates parameter
search for epsilon.
See Jupyter notebooks for Module 3 for code examples
K-means vs. DBSCAN
• K-means assigns all points to a cluster, whereas DBSCAN doesn’t
necessarily do this. DBSCAN treats outliers as outliers.
If we want 4
clusters, we can
cut here.
If we want 8
clusters, we can
cut here.
Dendrogram
If we have a space with billions of
points in thousands of
dimensions, the dendrogram is
still a 2D graph!
https://www.evogeneao.com/
Hierarchical clustering
Start
Number of no
• From one extreme case (many clusters > 1?
End
clusters, each containing one
item) to another (one cluster yes
Select pair of clusters to
that contains all items) merge
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
> human_beta
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
> horse_alpha
VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHGKKVADGLTLAVGHLDDLPGALSDLSNLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR
> horse_beta
VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGNPKVKAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDPENFRLLGNVLALVVARHFGKDFTPELQASYQKVVAGVANALAHKYH
> marine_bloodworm
GLSAAQRQVIAATWKDIAGADNGAGVGKKCLIKFLSAHPQMAAVFGFSGASDPGVAALGAKVLAQIGVAVSHLGDEGKMVAQMKAVGVRHKGYGNKHIKAQYFEPLGASLLSAMEHRIGGKMNAAAKDAWAAAYADISGALISGLQS
> yellow_lupine
GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSEVPQNNPELQAHAGKVFKLVYEAAIQLEVTGVVVTDATLKNLGSVHVSKGVADAHFPVVKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMDDAA
Edit distance is the number of single character operations that are required to change one string into another.
Merging clusters
• When clusters u and v are merged, how do we calculate the distance
between the merged cluster and each of the other clusters?
See e.g.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html#scipy.clus
ter.hierarchy.linkage
Example: merging clusters
D a b c d e f 1) Shortest distance a – c
a 0 84 18 86 112 121 2) Merge {a,c}
3) Recompute distance matrix, use max distance (complete linkage)
b 0 85 26 117 119
c 0 84 112 125 D a,c b d e f
d 0 113 121 a,c 0 85 86 112 125
e 0 119 b 0 26 117 119
f 0 d 0 113 121
e 0 119
f 0
a. Human haemoglobin alpha chain
b. Human haemoglobin beta chain
c. Horse haemoglobin alpha chain
d. Horse haemoglobin beta chain
e. Marine bloodworm haemoglobin
f. Yellow lupine leghaemoglobin
Example: merging clusters
D a b c d e f 1) Shortest distance a – c
a 0 84 18 86 112 121 2) Merge {a,c}
3) Recompute distance matrix, use max distance between points
b 0 85 26 117 119
c 0 84 112 125 D a,c b d e f
1) Shortest distance b – d
d 0 113 121 a,c 0 85 86 112 121 2) Merge {b,d}
e 0 119 b 0 26 117 119 3) Recompute distance matric
f 0 d 0 113 121
e 0 119
f 0
a. Human haemoglobin alpha chain
b. Human haemoglobin beta chain
c. Horse haemoglobin alpha chain
d. Horse haemoglobin beta chain
e. Marine bloodworm haemoglobin
f. Yellow lupine leghaemoglobin
Example: merging clusters
D a b c d e f 1) Shortest distance a – c
a 0 84 18 86 112 121 2) Merge {a,c}
3) Recompute distance matrix, use max distance between points
b 0 85 26 117 119
c 0 84 112 125 D a,c b d e f
1) Shortest distance b – d
d 0 113 121 a,c 0 85 86 112 121 2) Merge {b,d}
e 0 119 b 0 26 117 119 3) Recompute distance matric
f 0 d 0 113 121
e 0 119 D a,c b,d e f
f 0 a,c 0 86 112 121
a. Human haemoglobin alpha chain b,d 0 117 121
b. Human haemoglobin beta chain e 0 119
c. Horse haemoglobin alpha chain
d. Horse haemoglobin beta chain f 0
e. Marine bloodworm haemoglobin
f. Yellow lupine leghaemoglobin
Validating clustering
Stability on subsets
Clustering stable if removing a proportion of
random points does not change the clustering
fundamentally
Stability on subsets
Note colors change as labeling clusters into
first, second, third … changes!
Co-occurrence
For all pairs (i,j) count how frequently i and j
are in the same cluster.
Co-occurrence
Silhouette coefficient
a: The mean distance between a sample and all other points in the
same class.
b: The mean distance between a sample and all other points in the next
nearest cluster.
𝑏 −𝑎
𝑠=
max(𝑎, 𝑏)
Ranges between -1 and 1. High
value indicate good separation
between clusters.
https://scikit-learn.org/stable/modules/clustering.html#silhouette-coefficient
Clustering clustering algorithms
Fahad et al. (2014) IIEEE Trans. Emerging Topics in Computing, volume 2, 267-279
Useful idea when
• Provide only one label per digit (10 labels for the whole dataset)
• Use 10-means with the ten labeled images as starting points for
clustering the whole dataset.
Jain, A.K. (2010) Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31, 651-666
https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html
Are the framed cases as desired?
Assignment 3
• Using K-means and density-based clustering to cluster the main chain
conformations of amino acid residues in proteins.
• If curious for more information on the problem domain, look at:
• http://bioinformatics.org/molvis/phipsi/
• http://tinyurl.com/RamachandranPrincipleYouTube
Protein main chain
O O
CC
f A
y C
N N
CA N C CA
O
i-1 i i+1
Ramachandran plot
Around 100000
data points
shown here
http://bioinformatics.org/molvis/phipsi/