4 Clustering1
4 Clustering1
4 Clustering1
Clustering
Definition and K-means
CLUSTERING
In general a grouping of objects such that the objects in a
group (cluster) are similar (or related) to one another and
different from (or unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
2
Discovered Clusters Industry Group
1
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Understanding
Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
2
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
3
Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
4
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
with similar price Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP
fluctuations
Summarization
Reduce the size of large data
sets
Clustering
precipitation in
Australia
How many clusters? Six Clusters
5
A clustering is a set of clusters
Partitional Clustering
A division data objects into subsets (clusters) such that each data
object is in exactly one subset
Hierarchical clustering
A set of nested clusters organized as a hierarchical tree
Density-based clustering
Groups data points that are closely packed together based on a
specified density criteria, while marking sparse regions as outliers.
6
Original Points A Partitional Clustering
7
p1
p3 p4
p2
p1 p2 p3 p4
p1
p3 p4
p2
p1 p2 p3 p4
Hierarchical clustering
DBSCAN
16
17
Partitional clustering approach
Each cluster is associated with a centroid (center point)
Each point is assigned to the cluster with the closest centroid
Number of clusters, K, must be specified
The objective is to minimize the sum of distances of the points
to their respective centroid
18
19
Problem: Given a set X of n points in a d-dimensional space
and an integer K group the points into K clusters C= {C1,
C2,…,Ck} such that
𝑘𝑘
20
• Most common definition is with euclidean distance, minimizing
the Sum of Squares Error (SSE) function
Sometimes K-means is defined like that
21
• NP-hard if the dimensionality of the data is at least 2 (d>=2)
Finding the best solution in polynomial time is infeasible
22
Also known as Lloyd’s algorithm.
K-means is sometimes synonymous with this algorithm
23
k1
Y
k2
Pick 3 initial
Cluster centers
(randomly) k3
X
k1
k2
k2
k3
Move each k2
cluster center
to the mean k3
of each cluster
X
k1
Y
Reassign points k3
closest to a different k2
new cluster center
Q: Which points are
reassigned?
X
k1
Y
A: three points
with
animation k3
k2
X
k1
Y
k3
k2
re-compute
cluster means
X
k1
Y
k2
k3
move cluster
centers to
cluster means X
Cluster C1 C2 C3
Centroid Value 1 20 40
Split 14 people into 3 groups P1 1 0 19 39
P2 3 2 17 37
Only one attribute,age
P3 5 4 15 35
Initial centroids are 1, 20, 40 P4 8 7 12 32
P5 9 8 11 31
Right table demonstrates P6 11 10 9 29
result after steps 1, and 2 P7 12 11 8 28
P8 13 12 7 27
P9 37 36 17 3
P10 43 42 23 3
P11 45 44 25 5
P12 49 48 29 9
P13 51 50 31 11
P14 65 64 45 25
Cluster C1 C2 C3
Re-compute centroid, we have 5, Centroid Value 5 12 48
12, and 48 P1 1 4 11 47
P2 3 2 9 45
Re-compute the distance P3 5 0 7 43
between each instance and 3 P4 8 3 4 40
clusters P5 9 4 3 39
P6 11 6 1 37
P5 is closer to C2 P7 12 7 0 36
P8 13 8 1 35
Need to re-compute the
P9 37 32 25 11
centroid for C1 and C2 P10 43 38 31 5
No need to update C3 as there P11 45 40 33 3
P12 49 44 37 1
is no change
P13 51 46 39 3
P14 65 60 53 17
Cluster C1 C2 C3
Centroid Value 4 11 48
The centroid for 3 Clusters are P1 1 3 10 47
4, 11, and 48 P2 3 1 8 45
P3 5 1 6 43
Calculate the distance P4 8 4 3 40
between each instance to P5 9 5 2 39
each cluster P6 11 7 0 37
P7 12 8 1 36
P4 is closer to C2 P8 13 9 2 35
P9 37 33 26 11
Need to update C1 and C2’s P10 43 39 32 5
centroid, P11 45 41 34 3
P12 49 45 38 1
No need to update C3 as no
P13 51 47 40 3
changes happened P14 65 61 54 17
Cluster C1 C2 C3
Centroid Value 3 10 48
3 Clusters’ centroids are 3, 10, P1 1 2 9 47
48 P2 3 0 7 45
P3 5 2 5 43
Compute the distance P4 8 5 2 40
between each instance to each P5 9 6 1 39
Cluster P6 11 8 1 37
P7 12 9 2 36
No change happened P8 13 10 3 35
P9 37 34 27 11
No new update P10 43 40 33 5
P11 45 42 35 3
P12 49 46 39 1
P13 51 48 41 3
P14 65 62 55 17
3
2.5
1.5
Original Points
y
1
0.5
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2.5
1.5
y
0.5
36
Iteration 1 Iteration 2 Iteration 3
3 3 3
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
37
Iteration 5
1
2
3
4
3
2.5
1.5
y
0.5
38
Iteration 1 Iteration 2
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
39
Do multiple runs and select the clustering with the smallest
error
40
K-means will converge for common similarity measures
mentioned above.
Most of the convergence happens in the first few iterations.
Often the stopping condition is changed to ‘Until relatively few
points change clusters’
Complexity is O( n * K * I * d )
n = number of points, K = number of clusters,
I = number of iterations, d = dimensionality
In general a fast and efficient algorithm
42
K-means has problems when clusters are of different
Sizes
Densities
Non-globular shapes
43
Original Points K-means (3 Clusters)
44
Original Points K-means (3 Clusters)
45
Original Points K-means (2 Clusters)
46
Original Points K-means Clusters
50
51
TITANIC
DATASET
Cluster the records into two i.e. the ones who survived and
the ones who did not
https://www.kaggle.com/datasets/yasserh/titanic-dataset
52