K-Means and Hierarchical Clustering
K-Means and Hierarchical Clustering
What is Clustering ?
2
K – Means Clustering
3
K – means Clustering:
● K-means clustering is one of the simplest and popular
unsupervised machine learning algorithms.
5
K – means clustering Algorithm with example:
Given Dataset:
K= {2,3,4,10,11,12,20,25,30}
K=2
6
Weakness of K – means Clustering:
● When the numbers of data are not so many, initial grouping will
determine the cluster significantly.
● We never know the real cluster, using the same data, because if it is
inputted in a different order it may produce different cluster if the number
of data is few.
7
Applications of K – means Clustering:
9
Hierarchical Clustering
10
Hierarchical Clustering
● Agglomerative approach
Initialization:
Each object is a cluster
Iteration:
a
ab Merge two clusters which are
b abcde most similar to each other;
Until all objects are merged
c
cde into a single cluster
d
de
e
11
Hierarchical Clustering
12
Dendrogram
● A binary tree that shows how clusters are
merged/split hierarchically
● Each node on the tree is a cluster; each leaf node is a
singleton cluster
13
Hierarchical Agglomerative Clustering-Linkage
Method
● The single linkage method is based on minimum distance,
or the nearest neighbor rule.
● The complete linkage method is based on the maximum
distance or the furthest neighbor approach.
● The average linkage method the distance between two
clusters is defined as the average of the distances between all
pairs of objects.
Centroid Method
● In the centroid methods, the distance between two clusters is
the distance between their centroids.
14
Single Linkage
Minimum Distance
Cluster 2
Cluster 1
Complete Linkage
Maximum
Distance
Cluster 1 Cluster 2
Average Linkage
Average Distance
15
Cluster 1 Cluster 2
Centroid Method
16
How to Merge Clusters?
● Single-link
● Complete-link
Distance?
● Average-link
● Centroid distance
17
How to Define Inter-Cluster Distance
● Single-link
● Complete-link
● Average-link The distance between two
● Centroid distance clusters is represented by the
distance of the closest pair of
data objects belonging to
different clusters.
18
How to Define Inter-Cluster Distance
● Single-link
● Complete-link
● Average-link The distance between two
● Centroid distance clusters is represented by the
distance of the farthest pair of
data objects belonging to
different clusters.
19
How to Define Inter-Cluster Distance
● Single-link
● Complete-link
● Average-link
The distance between two
● Centroid distance
clusters is represented by the
average distance of all pairs of
data objects belonging to
different clusters.
20
How to Define Inter-Cluster Distance
× ×
● Single-link
● Complete-link
● Average-link The distance between two
● Centroid distance clusters is represented by the
distance between the means of
the cluters.
21
An Example of the Agglomerative Hierarchical
Clustering Algorithm
1 5
3 4
2 6
22
Result of the Single-Link algorithm
1 5
3 4
2 6
1 3 4 5 2 6
1 5
3 4
2 6
1 3 2 4 5 6
23
Hierarchical Clustering: Comparison
Single-link Complete-link
5
1 4 1
3
2 5
5 5
2 1 2
2 3 6 3 6
3
1
4 4
4
1 2 5 3 6 4 1 2 5 3 6 4
2 5 3 6 4 1 25
1 2 5 3 6 4
Strength of Single-link
26
Limitations of Single-Link
Original Points
Two Clusters
27
Strength of Complete-link
28
Which Distance Measure is Better?
● Each method has both advantages and disadvantages;
application-dependent, single-link and complete-link
are the most common methods
● Single-link
● Can find irregular-shaped clusters
● Sensitive to outliers, suffers the so-called chaining effects
● Complete-link, Average-link, and Centroid distance
● Robust to outliers
● Tend to break large clusters
● Prefer spherical clusters
29
30