Clustering
Clustering
Medicine A 1 1
Medicine B 2 1
Medicine C 4 3
Medicine D 5 4
Hierarchical Clustering
• Use distance matrix as clustering criteria. This method does
not require the number of clusters k as an input, but needs a
termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
34
AGNES (Agglomerative Nesting)
• Introduced in Kaufmann and Rousseeuw (1990)
• Implemented in statistical packages, e.g., Splus
• Use the single-link method and the dissimilarity matrix
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
35
Dendrogram: Shows How Clusters are Merged
Decompose data
objects into a several
levels of nested
partitioning (tree of
clusters), called a
dendrogram
A clustering of the
data objects is
obtained by cutting
the dendrogram at
the desired level,
then each connected
component forms a
cluster
DIANA (Divisive Analysis)
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
37
Example of converting data points
into distance matrix
• Clustering analysis with agglomerative
algorithm
data matrix
distance matrix
Euclidean distance 38
Example
X Y
A 0.40 0.53
B 0.22 0.38
C 0.35 0.32
D 0.26 0.19
E 0.08 0.41
F 0.45 0.30
Example
A B C D E F
A 0
B 0.23 0
C 0.22 0.15 0
A 0
B 0.23 0
A 0
B,E 0.23 0
A 0
D 0.37 0.15 0
Practise
1 2 3 4 5
1 0
2 9 0
3 3 7 0
4 6 5 9 0
5 11 10 2 8 0
MIN or Single Link
Inter-cluster distance
• The distance between two clusters is
represented by the distance of the closest pair
of data objects belonging to different clusters.
• Determined by one pair of points, i.e., by one
link in the proximity graph
MAX or Complete Link
Inter-cluster distance
• The distance between two clusters is
represented by the distance of the farthest pair
of data objects belonging to different clusters
Distance between X X
Clusters
• Single link: smallest distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
• Complete link: largest distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
• Average: avg distance between an element in one cluster and an element in
the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
• Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) =
dist(Ci, Cj)
• Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) =
dist(Mi, Mj)
– Medoid: a chosen, centrally located object in the cluster
49
Centroid, Radius and Diameter of a Cluster
(for numerical data sets)
• Centroid: the “middle” of a cluster iN= 1(t )
Cm = N ip
N N (t − t ) 2
Dm = i =1 i =1 ip iq
N ( N −1)
50
Parametric vs Non Parametric
Estimation
Learning a Function