Unit4 Cluster Analysis 10oct
Unit4 Cluster Analysis 10oct
Cluster Analysis
What is Cluster Analysis?
• Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different
from (or unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
• Interval-scaled variables
• Binary variables
• Nominal, ordinal, and ratio variables
• Variables of mixed types
xif m f
s
• Using mean absolute zdeviation
if
f
is more robust than using
standard deviation
11/16/2020
November 17, 2024
Introduction to Data Mining, 2nd Edition 6 6
Tan, Steinbach, Karpatne, Kumar
Similarity and Dissimilarity Between
Objects
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2 ip jp
11/16/2020 Introduction to Data Mining, 2nd Edition 7 7
Tan, Steinbach, Karpatne, Kumar
Similarity and Dissimilarity Between Objects
(Cont.)
• If q = 2, d is Euclidean distance:
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1 i2 j2 ip jp
– Properties
• d(i,j) 0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j) d(i,k) + d(k,j)
• Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures
11/16/2020 Introduction to Data Mining, 2nd Edition 8 8
Tan, Steinbach, Karpatne, Kumar
Binary Variables
Object j
1 0 sum
A contingency table for binary 1 a b a b
Object i
data 0 c d c d
sum a c b d p
Distance measure for symmetric
d (i, j) b c
binary variables: a b c d
Distance measure for
d (i, j) b c
asymmetric binary variables:
a b c
Jaccard coefficient (similarity
measure for asymmetric binary a
simJaccard (i, j)
variables): a b c
11/16/2020
November 17, 2024
Introduction to Data Mining, 2nd Edition 9 9
Tan, Steinbach, Karpatne, Kumar
Dissimilarity between Binary Variables
• Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
– gender is a symmetric attribute
– the remaining attributes are asymmetric binary
– let the values Y and P be set to 1, and the value N be set to 0
0 1
d ( jack , mary ) 0.33
2 0 1
11
d ( jack , jim ) 0.67
111
1 2
d ( jim , mary ) 0.75
11 2
Introduction to Data Mining, 2nd Edition
11/16/2020
November 17, 2024 1010
Tan, Steinbach, Karpatne, Kumar
Nominal Variables
d (i, j) p p m
11/16/2020
November 17, 2024
Introduction to Data Mining, 2nd Edition 11 11
Tan, Steinbach, Karpatne, Kumar
Ordinal Variables
– Hierarchical clustering
• A set of nested clusters organized as a hierarchical tree
p1
p3 p4
p2
p1 p2 p3 p4
p1
p3 p4
p2
p1 p2 p3 p4
• Prototype-based clusters
• Contiguity-based clusters
• Density-based clusters
• Well-Separated Clusters:
– A cluster is a set of points such that any point in a cluster is
closer (or more similar) to every other point in the cluster than
to any point not in the cluster.
3 well-separated clusters
• Prototype-based
– A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the prototype or “center” of a cluster,
than to the center of any other cluster
– The center of a cluster is often a centroid, the average of all
the points in the cluster, or a medoid, the most “representative”
point of a cluster
4 center-based clusters
8 contiguous clusters
• Density-based
– A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when
noise and outliers are present.
6 density-based clusters
• Hierarchical clustering
• Density-based clustering
2.5
1.5
y
0.5
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2.5
2
Original Points
1.5
y
1
0.5
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2.5
1.5
y
0.5
Iteration 1 Iteration 2
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
• Depending on the
choice of initial
centroids, B and C
may get merged or
remain separate
Iteration 4
1
2
3
8
2
y
-2
-4
-6
0 5 10 15 20
x
Starting with two initial centroids in one cluster of each pair of clusters
11/16/2020 Introduction to Data Mining, 2nd Edition 36
Tan, Steinbach, Karpatne, Kumar
10 Clusters Example
Iteration 1 Iteration 2
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Iteration 3 Iteration 4
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Starting with two initial centroids in one cluster of each pair of clusters
11/16/2020 Introduction to Data Mining, 2nd Edition 37
Tan, Steinbach, Karpatne, Kumar
10 Clusters Example
Iteration 4
1
2
3
8
2
y
-2
-4
-6
0 5 10 15 20
x
Starting with some pairs of clusters having three initial centroids, while other
have only one.
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
Iteration
x 3 Iteration
x 4
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Starting with some pairs of clusters having three initial centroids, while other have only
one.
11/16/2020 Introduction to Data Mining, 2nd Edition 39
Tan, Steinbach, Karpatne, Kumar
Solutions to Initial Centroids Problem
• Multiple runs
– Helps, but probability is not on your side
• Use some strategy to select the k initial
centroids and then select among these
initial centroids
– Select most widely separated
• K-means++ is a robust way of doing this selection
– Use hierarchical clustering to determine initial
centroids
• Bisecting K-means
– Not as susceptible to initialization issues
11/16/2020 Introduction to Data Mining, 2nd Edition 40
Tan, Steinbach, Karpatne, Kumar
11/16/2020 Introduction to Data Mining, 2nd Edition 41
Tan, Steinbach, Karpatne, Kumar
11/16/2020 Introduction to Data Mining, 2nd Edition 42
Tan, Steinbach, Karpatne, Kumar
11/16/2020 Introduction to Data Mining, 2nd Edition 43
Tan, Steinbach, Karpatne, Kumar
11/16/2020 Introduction to Data Mining, 2nd Edition 44
Tan, Steinbach, Karpatne, Kumar
11/16/2020 Introduction to Data Mining, 2nd Edition 45
Tan, Steinbach, Karpatne, Kumar
11/16/2020 Introduction to Data Mining, 2nd Edition 46
Tan, Steinbach, Karpatne, Kumar
11/16/2020 Introduction to Data Mining, 2nd Edition 47
Tan, Steinbach, Karpatne, Kumar
11/16/2020 Introduction to Data Mining, 2nd Edition 48
Tan, Steinbach, Karpatne, Kumar
11/16/2020 Introduction to Data Mining, 2nd Edition 49
Tan, Steinbach, Karpatne, Kumar
Example 2 :
First divide the
objects in 2 clusters
6.8 13 18
6.5
X 9 10
X 15 16
X18.5
6.5
X 9 10
X 15 16
X 18.5
Empty
Cluster
• Several strategies
– Choose the point that contributes most to
SSE, and make it a new centroid
– Choose a point from the cluster with the
highest SSE, and make it a new centroid
– If there are several empty clusters, the above
can be repeated several times.
65
11/16/2020 Introduction to Data Mining, 2nd Edition 65
Tan, Steinbach, Karpatne, Kumar
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10
9 9 9
8 8 8
7 7 7
6
Arbitra 6
Assign 6
5
ry 5
each 5
4 choose 4 remain 4
3
k 3
ing 3
2
object 2
object 2
as to
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10
initial 0 1 2 3 4 5 6 7 8 9 10
neares 0 1 2 3 4 5 6 7 8 9 10
medoi t
K=2 ds medoi Randomly select a
Total Cost = 26 ds nonmedoid
object,Oramdom
10 10
Do loop 9
8
Compute
9
8
Swapping total cost
Until no
7 7
O and 6
of 6
change Oramdom 5
swapping
5
4 4
If quality 3 3
2 2
is 1 1
improved. 0 0
O1 2 6 3 7 c1
O2 3 4 0 4 c1
O3 3 8 4 8 c1
O4 4 7 4 6 c1
O5 6 2 5 3 c2
O6 6 4 3 1 c2
O7 7 3 5 1 c2
O8 7 4 4 0 c2
O9 8 5 6 2 c2
O10 7 6 6 2 c2
Calculate the distance using Manhatan distance |x2-x1|+|y2-y1|
|7-2|+|4-6|=7
O1 2 6 3 8 c1
O2 3 4 0 5 c1
O3 3 8 4 9 c1
O4 4 7 4 7 c1
O5 6 2 5 2 c2
O6 6 4 3 2 c2
O7 7 3 5 0 c2
O8 7 4 4 1 c2
O9 8 5 6 3 c2
O10 7 6 6 3 c2
1
0.05
3 1
0
1 3 2 5 4 6
– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains an individual
point (or there are k clusters)
p2
p3
p4
p5
.
.
. Proximity Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
C1 ?
C2 U C5 ? ? ? ?
C3
C3 ?
C4
C4 ?
Proximity Matrix
C1
C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12
p1 p2 p3 p4 p5 ...
p1
Similarity?
p2
p3
p4
p5
MIN
.
MAX .
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
Min Defines cluster proximity as the proximity p5
between the closest two points that are in different
clusters. .
In graph terms the shortest edge between two nodes in.
different subsets of nodes .
MIN Proximity Matrix
MAX
Group Average
Distance Between Centroids
Other methods driven by an objective function
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
Max Defines cluster proximity as the proximity
between the farthest two points that are in different .
clusters.
.
In graph terms the longest edge between two nodes in
different subsets of nodes .
Proximity Matrix
MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an objective function
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
The Group average technique defines cluster
proximity to be the average pairwise proximities of all .
pairs of points from different clusters
.
MIN
MAX
.
Proximity Matrix
Group Average
Distance Between Centroids
Other methods driven by an objective function
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN
.
MAX .
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
Min((3,6),(1))=min(dist(3,1),(6,1))
=min(0.22,0.23)
=0.22
Min((3,6),(2))=min(dist(3,2),dist(6,2))
=min(0.15,0.25)
=0.15
Min((3,6),(4))=0.15
Min((3,6),(5))=0.28
Min((2,5),1)=0.24
Min((3,6),(2,5))=min(dist(3,2),dist(6,2),dist(3,5),dist(6,5)
=min(0.15,0.25,0.28,0.39)
=0.15
Min((2,5),4)=0.20
Min((2,5),(3,6))=0.15
5
1
3
5 0.2
2 1 0.15
2 3 6 0.1
0.05
4
4 0
3 6 2 5 4 1
Two Clusters
Original Points
Max(dist(3,6),1)=max(0.22,0.23)
=0.23
Max(dist(3,6),2)=max(0.15,0.25)
=0.25
Max(dist(3,6),4)=max(0.15,0.22)
0.22
Max(dist(3,6),5)=max(0.28,0.39)
=0.39
Max((3,6,4),1)=max(0.22,0.23,0.37)=0.37
Max((3,6,4),
(2,5))=max(0.15,0.25,0.20,0.28,0.39,0.29)=0.39
Max((2,5,1),
(3,6,4))=max(0.15,0.28,0.37,0.25,0.39,0.23,0.20,
0.29,0.37)=0.37
4 1
2 5 0.4
0.35
5
2 0.3
0.25
3 6
0.2
3 0.15
1 0.1
4 0.05
0
3 6 4 1 2 5
pjClusterj
proximity(
Cluster
i , Cluster
j)
|Clusteri | |Clusterj |
Dist((3,6),1)=(0.22+0.23)/(2*1)=0.225
Dist((3,6),2)=(0.15+0.25)/(2*1)=0.2
Dist((3,6),4)=(0.15+0.22)/(2*1)=0.185
Dist((3,6),5)=(0.28+0.39)/(2*1)=0.335
Dist((2,5),1)=0.29
Dist((2,5)(3,6))=0.267
Dist((2,5),4)=0.245
5 4 1
2 0.25
5 0.2
2
0.15
3 6 0.1
1 0.05
4 0
3 3 6 4 1 2 5
• Strengths
– Less susceptible to noise and outliers
• Limitations
– Biased towards globular clusters
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4
5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group Average 3 6
3
4 1 1
4 4
3
0.9 0.9
0.8 0.8
0.7 0.7
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
1 1
0.9 0.9
0.5 0.5
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
10
6 9
8
4
7
2 6
SSE
0 5
4
-2
3
-4 2
-6 1
5 10 15 0
2 5 10 15 20 25 30
K
𝑆𝑆𝐵=∑ |𝐶 𝑖|( 𝑚− 𝑚𝑖 )
2
– Separation
𝑖 is measured by the between cluster sum of squares
Example: SSE
SSB + SSE = constant
m
1 m1 2 3 4 m2 5
cohesion separation