4 Clustering1

1
Clustering
Definition and K-means
CLUSTERING
 In general a grouping of objects such that the objects in a
group (cluster) are similar (or related) to one another and
different from (or unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
2
Discovered Clusters Industry Group
1
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
 Understanding
Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
 Group related documents for Sun-DOWN
2
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
browsing, group genes and

ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Technology2-DOWN
proteins that have similar Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
3
Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
functionality, or group stocks MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
4
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
with similar price Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP
fluctuations
 Summarization
 Reduce the size of large data
sets
Clustering
precipitation in
Australia
How many clusters? Six Clusters
Two Clusters Four Clusters
5
 A clustering is a set of clusters
 Important distinction between hierarchical and partitional sets

of clusters
 Partitional Clustering
 A division data objects into subsets (clusters) such that each data
object is in exactly one subset
 Hierarchical clustering
 A set of nested clusters organized as a hierarchical tree
 Density-based clustering
 Groups data points that are closely packed together based on a
specified density criteria, while marking sparse regions as outliers.
6
Original Points A Partitional Clustering
7
p1
p3 p4
p2
p1 p2 p3 p4
Traditional Hierarchical Traditional Dendrogram

Clustering
p1
p3 p4
p2
p1 p2 p3 p4
Non-traditional Hierarchical Non-traditional Dendrogram 8

Clustering
 K-means and its variants
 Hierarchical clustering
 DBSCAN
16
17
 Partitional clustering approach
 Each cluster is associated with a centroid (center point)
 Each point is assigned to the cluster with the closest centroid
 Number of clusters, K, must be specified
 The objective is to minimize the sum of distances of the points
to their respective centroid
18
19
 Problem: Given a set X of n points in a d-dimensional space
and an integer K group the points into K clusters C= {C1,
C2,…,Ck} such that
𝑘𝑘
𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶 = � � 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝑥𝑥, 𝑐𝑐)

𝑖𝑖=1 𝑥𝑥∈𝐶𝐶𝑖𝑖
is minimized, where ci is the centroid of the points in cluster Ci
20
• Most common definition is with euclidean distance, minimizing
the Sum of Squares Error (SSE) function
 Sometimes K-means is defined like that
 Problem: Given a set X of n points in a d-dimensional space

and an integer K group the points into K clusters C= {C1,
C2,…,Ck} such that
𝑘𝑘
2
𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶 = � � 𝑥𝑥 − 𝑐𝑐𝑖𝑖
𝑖𝑖=1 𝑥𝑥∈𝐶𝐶𝑖𝑖
is minimized, where ci is the mean of the points in cluster Ci
21
• NP-hard if the dimensionality of the data is at least 2 (d>=2)
 Finding the best solution in polynomial time is infeasible
• For d=1 the problem is solvable in polynomial time (how?)
• A simple iterative algorithm works quite well in practice
22
 Also known as Lloyd’s algorithm.
 K-means is sometimes synonymous with this algorithm
23
k1
Y
k2
Pick 3 initial
Cluster centers
(randomly) k3
X
k1
k2
Assign each point

k3
to the closest
cluster center
X
k1 k1
Y
k2
k3
Move each k2
cluster center
to the mean k3
of each cluster
X
k1
Y
Reassign points k3
closest to a different k2
new cluster center
Q: Which points are
reassigned?
X
k1
Y
A: three points
with
animation k3
k2
X
k1
Y
k3
k2
re-compute
cluster means
X
k1
Y
k2
k3
move cluster
centers to
cluster means X
Cluster C1 C2 C3
Centroid Value 1 20 40
 Split 14 people into 3 groups P1 1 0 19 39
P2 3 2 17 37
 Only one attribute，age
P3 5 4 15 35
 Initial centroids are 1, 20, 40 P4 8 7 12 32
P5 9 8 11 31
 Right table demonstrates P6 11 10 9 29
result after steps 1, and 2 P7 12 11 8 28
P8 13 12 7 27
P9 37 36 17 3
P10 43 42 23 3
P11 45 44 25 5
P12 49 48 29 9
P13 51 50 31 11
P14 65 64 45 25
Cluster C1 C2 C3
 Re-compute centroid, we have 5, Centroid Value 5 12 48
12, and 48 P1 1 4 11 47
P2 3 2 9 45
 Re-compute the distance P3 5 0 7 43
between each instance and 3 P4 8 3 4 40
clusters P5 9 4 3 39
P6 11 6 1 37
 P5 is closer to C2 P7 12 7 0 36
P8 13 8 1 35
 Need to re-compute the
P9 37 32 25 11
centroid for C1 and C2 P10 43 38 31 5
 No need to update C3 as there P11 45 40 33 3
P12 49 44 37 1
is no change
P13 51 46 39 3
P14 65 60 53 17
Cluster C1 C2 C3
 The centroid for 3 Clusters are P1 1 3 10 47
4, 11, and 48 P2 3 1 8 45
P3 5 1 6 43
 Calculate the distance P4 8 4 3 40
between each instance to P5 9 5 2 39
each cluster P6 11 7 0 37
P7 12 8 1 36
 P4 is closer to C2 P8 13 9 2 35
P9 37 33 26 11
 Need to update C1 and C2’s P10 43 39 32 5
centroid， P11 45 41 34 3
P12 49 45 38 1
 No need to update C3 as no
P13 51 47 40 3
changes happened P14 65 61 54 17
Cluster C1 C2 C3
 3 Clusters’ centroids are 3, 10, P1 1 2 9 47
48 P2 3 0 7 45
P3 5 2 5 43
 Compute the distance P4 8 5 2 40
between each instance to each P5 9 6 1 39
Cluster P6 11 8 1 37
P7 12 9 2 36
 No change happened P8 13 10 3 35
P9 37 34 27 11
 No new update P10 43 40 33 5
P11 45 42 35 3
P12 49 46 39 1
P13 51 48 41 3
P14 65 62 55 17
3
2.5
1.5
Original Points
y
1
0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x x
Optimal Clustering Sub-optimal Clustering

35
Iteration 6
1
2
3
4
5
3
2.5
1.5
y
0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x
36
Iteration 1 Iteration 2 Iteration 3
3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5

y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5

y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
37
Iteration 5
1
2
3
4
3
2.5
1.5
y
0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x
38
Iteration 1 Iteration 2
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x x

3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5

y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
39
 Do multiple runs and select the clustering with the smallest
error
 Select original set of points by methods other than random .

E.g., pick the most distant (from each other) points as cluster
centers (K-means++ algorithm)
40
 K-means will converge for common similarity measures
mentioned above.
 Most of the convergence happens in the first few iterations.
 Often the stopping condition is changed to ‘Until relatively few
points change clusters’
 Complexity is O( n * K * I * d )
 n = number of points, K = number of clusters,
I = number of iterations, d = dimensionality
 In general a fast and efficient algorithm
42
 K-means has problems when clusters are of different
 Sizes
 Densities
 Non-globular shapes
 K-means has problems when the data contains outliers.
43
Original Points K-means (3 Clusters)
44
45
46
Original Points K-means Clusters
One solution is to use many clusters.

Find parts of clusters, but need to put together.
47
 K-medoids: Similar problem definition as in K-means, but the
centroid of the cluster is defined to be one of the points in the
cluster (the medoid).
 K-centers: Similar problem definition as in K-means, but the

goal now is to minimize the maximum diameter of the clusters
(diameter of a cluster is maximum distance between any two
points in the cluster).
50
51
TITANIC
DATASET
 Cluster the records into two i.e. the ones who survived and
the ones who did not
https://www.kaggle.com/datasets/yasserh/titanic-dataset
52

4 Clustering1

Uploaded by

Copyright:

Available Formats

4 Clustering1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4 Clustering1

Uploaded by

Copyright:

Available Formats

1

 Group related documents for Sun-DOWN

browsing, group genes and

proteins that have similar Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN

functionality, or group stocks MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

Two Clusters Four Clusters

 Important distinction between hierarchical and partitional sets

Traditional Hierarchical Traditional Dendrogram

Non-traditional Hierarchical Non-traditional Dendrogram 8

𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶 = � � 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝑥𝑥, 𝑐𝑐)

is minimized, where ci is the centroid of the points in cluster Ci

 Problem: Given a set X of n points in a d-dimensional space

is minimized, where ci is the mean of the points in cluster Ci

• For d=1 the problem is solvable in polynomial time (how?)

• A simple iterative algorithm works quite well in practice

Assign each point

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Optimal Clustering Sub-optimal Clustering

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

Iteration 4 Iteration 5 Iteration 6

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 3 Iteration 4 Iteration 5

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

 Select original set of points by methods other than random .

 K-means has problems when the data contains outliers.

One solution is to use many clusters.

 K-centers: Similar problem definition as in K-means, but the

You might also like