4 Clustering1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

1

Clustering
Definition and K-means
CLUSTERING
 In general a grouping of objects such that the objects in a
group (cluster) are similar (or related) to one another and
different from (or unrelated to) the objects in other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

2
Discovered Clusters Industry Group

1
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,

 Understanding
Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,

 Group related documents for Sun-DOWN

2
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,

browsing, group genes and


ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Technology2-DOWN

proteins that have similar Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN

3
Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,

functionality, or group stocks MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

4
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
with similar price Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP

fluctuations

 Summarization
 Reduce the size of large data
sets

Clustering
precipitation in
Australia
How many clusters? Six Clusters

Two Clusters Four Clusters

5
 A clustering is a set of clusters

 Important distinction between hierarchical and partitional sets


of clusters

 Partitional Clustering
 A division data objects into subsets (clusters) such that each data
object is in exactly one subset

 Hierarchical clustering
 A set of nested clusters organized as a hierarchical tree

 Density-based clustering
 Groups data points that are closely packed together based on a
specified density criteria, while marking sparse regions as outliers.

6
Original Points A Partitional Clustering
7
p1
p3 p4
p2
p1 p2 p3 p4

Traditional Hierarchical Traditional Dendrogram


Clustering

p1
p3 p4
p2
p1 p2 p3 p4

Non-traditional Hierarchical Non-traditional Dendrogram 8


Clustering
 K-means and its variants

 Hierarchical clustering

 DBSCAN

16
17
 Partitional clustering approach
 Each cluster is associated with a centroid (center point)
 Each point is assigned to the cluster with the closest centroid
 Number of clusters, K, must be specified
 The objective is to minimize the sum of distances of the points
to their respective centroid

18
19
 Problem: Given a set X of n points in a d-dimensional space
and an integer K group the points into K clusters C= {C1,
C2,…,Ck} such that
𝑘𝑘

𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶 = � � 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝑥𝑥, 𝑐𝑐)


𝑖𝑖=1 𝑥𝑥∈𝐶𝐶𝑖𝑖

is minimized, where ci is the centroid of the points in cluster Ci

20
• Most common definition is with euclidean distance, minimizing
the Sum of Squares Error (SSE) function
 Sometimes K-means is defined like that

 Problem: Given a set X of n points in a d-dimensional space


and an integer K group the points into K clusters C= {C1,
C2,…,Ck} such that
𝑘𝑘
2
𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶 = � � 𝑥𝑥 − 𝑐𝑐𝑖𝑖
𝑖𝑖=1 𝑥𝑥∈𝐶𝐶𝑖𝑖

is minimized, where ci is the mean of the points in cluster Ci

21
• NP-hard if the dimensionality of the data is at least 2 (d>=2)
 Finding the best solution in polynomial time is infeasible

• For d=1 the problem is solvable in polynomial time (how?)

• A simple iterative algorithm works quite well in practice

22
 Also known as Lloyd’s algorithm.
 K-means is sometimes synonymous with this algorithm

23
k1
Y

k2

Pick 3 initial
Cluster centers
(randomly) k3

X
k1

k2

Assign each point


k3
to the closest
cluster center
X
k1 k1
Y

k2

k3
Move each k2
cluster center
to the mean k3
of each cluster
X
k1
Y

Reassign points k3
closest to a different k2
new cluster center
Q: Which points are
reassigned?
X
k1
Y
A: three points
with
animation k3
k2

X
k1
Y

k3
k2
re-compute
cluster means
X
k1
Y

k2
k3
move cluster
centers to
cluster means X
Cluster C1 C2 C3
Centroid Value 1 20 40
 Split 14 people into 3 groups P1 1 0 19 39
P2 3 2 17 37
 Only one attribute,age
P3 5 4 15 35
 Initial centroids are 1, 20, 40 P4 8 7 12 32
P5 9 8 11 31
 Right table demonstrates P6 11 10 9 29
result after steps 1, and 2 P7 12 11 8 28
P8 13 12 7 27
P9 37 36 17 3
P10 43 42 23 3
P11 45 44 25 5
P12 49 48 29 9
P13 51 50 31 11
P14 65 64 45 25
Cluster C1 C2 C3
 Re-compute centroid, we have 5, Centroid Value 5 12 48
12, and 48 P1 1 4 11 47
P2 3 2 9 45
 Re-compute the distance P3 5 0 7 43
between each instance and 3 P4 8 3 4 40
clusters P5 9 4 3 39
P6 11 6 1 37
 P5 is closer to C2 P7 12 7 0 36
P8 13 8 1 35
 Need to re-compute the
P9 37 32 25 11
centroid for C1 and C2 P10 43 38 31 5
 No need to update C3 as there P11 45 40 33 3
P12 49 44 37 1
is no change
P13 51 46 39 3
P14 65 60 53 17
Cluster C1 C2 C3
Centroid Value 4 11 48
 The centroid for 3 Clusters are P1 1 3 10 47
4, 11, and 48 P2 3 1 8 45
P3 5 1 6 43
 Calculate the distance P4 8 4 3 40
between each instance to P5 9 5 2 39
each cluster P6 11 7 0 37
P7 12 8 1 36
 P4 is closer to C2 P8 13 9 2 35
P9 37 33 26 11
 Need to update C1 and C2’s P10 43 39 32 5
centroid, P11 45 41 34 3
P12 49 45 38 1
 No need to update C3 as no
P13 51 47 40 3
changes happened P14 65 61 54 17
Cluster C1 C2 C3
Centroid Value 3 10 48
 3 Clusters’ centroids are 3, 10, P1 1 2 9 47
48 P2 3 0 7 45
P3 5 2 5 43
 Compute the distance P4 8 5 2 40
between each instance to each P5 9 6 1 39
Cluster P6 11 8 1 37
P7 12 9 2 36
 No change happened P8 13 10 3 35
P9 37 34 27 11
 No new update P10 43 40 33 5
P11 45 42 35 3
P12 49 46 39 1
P13 51 48 41 3
P14 65 62 55 17
3

2.5

1.5
Original Points

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Optimal Clustering Sub-optimal Clustering


35
Iteration 6
1
2
3
4
5
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

36
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
37
Iteration 5
1
2
3
4
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

38
Iteration 1 Iteration 2
3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Iteration 3 Iteration 4 Iteration 5


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

39
 Do multiple runs and select the clustering with the smallest
error

 Select original set of points by methods other than random .


E.g., pick the most distant (from each other) points as cluster
centers (K-means++ algorithm)

40
 K-means will converge for common similarity measures
mentioned above.
 Most of the convergence happens in the first few iterations.
 Often the stopping condition is changed to ‘Until relatively few
points change clusters’
 Complexity is O( n * K * I * d )
 n = number of points, K = number of clusters,
I = number of iterations, d = dimensionality
 In general a fast and efficient algorithm

42
 K-means has problems when clusters are of different
 Sizes
 Densities
 Non-globular shapes

 K-means has problems when the data contains outliers.

43
Original Points K-means (3 Clusters)

44
Original Points K-means (3 Clusters)

45
Original Points K-means (2 Clusters)

46
Original Points K-means Clusters

One solution is to use many clusters.


Find parts of clusters, but need to put together.
47
 K-medoids: Similar problem definition as in K-means, but the
centroid of the cluster is defined to be one of the points in the
cluster (the medoid).

 K-centers: Similar problem definition as in K-means, but the


goal now is to minimize the maximum diameter of the clusters
(diameter of a cluster is maximum distance between any two
points in the cluster).

50
51
TITANIC
DATASET
 Cluster the records into two i.e. the ones who survived and
the ones who did not

https://www.kaggle.com/datasets/yasserh/titanic-dataset

52

You might also like