0% found this document useful (0 votes)
20 views17 pages

K Means Clustering

The document discusses K-means clustering, an unsupervised machine learning algorithm. It begins by introducing clustering and two common clustering methods: K-means and hierarchical clustering. It then describes how K-means clustering works by randomly selecting initial cluster centroids and iteratively assigning examples to centroids based on distance and recomputing centroids until convergence. The document provides examples to illustrate how K-means clustering is performed.

Uploaded by

Wet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views17 pages

K Means Clustering

The document discusses K-means clustering, an unsupervised machine learning algorithm. It begins by introducing clustering and two common clustering methods: K-means and hierarchical clustering. It then describes how K-means clustering works by randomly selecting initial cluster centroids and iteratively assigning examples to centroids based on distance and recomputing centroids until convergence. The document provides examples to illustrate how K-means clustering is performed.

Uploaded by

Wet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Artificial Intelligence (CSC9YE)

K-Means Clustering

Gabriela Ochoa
[email protected]
Clustering
The main task in unsupervised learning

I The input is a set of examples, each described by a vector of


attribute values – but no class labels.
I The output is a set of two or more clusters of examples.
I The system should automatically identify groups of similar
examples.

1 / 16
Two Clustering Methods

I In K-means clustering, we seek to partition the observations


into a pre-specified number of clusters
I In hierarchical clustering
I we do not know in advance how many clusters we want
I we end up with a tree-like visual representation of the
observations, called a dendrogram
I The dendrogram allows us to view at once the clusterings
obtained for each possible number of clusters, from 1 to n

2 / 16
Characteristics of the Clusters

I How to describe clusters?


I The simplest approach relies on centroids
I If all attributes are numeric, the centroid is identified with the
averages of the individual attributes.
I Example: (2, 5) (1,4) (3, 6). Centroid is (2,5) because
2+1+3
3 = 2 and 5+4+63 =5
I What should the clusters be like?
I Clusters should not overlap each other, each example must
belong to one and only one cluster.
I Within the same cluster, the examples should be relatively
close to each other. Closer than to the examples from the
other clusters.

3 / 16
Measuring Distance

I Clustering algorithms need a mechanism to evaluate the


distance between an example and a cluster.
I When clusters are described by their centroids, the Euclidean
distance between the example and the centroid is a good way
of measuring distance.
I The Euclidean distance, can be applied directly when
attributes are numerical.
I When attributes are categorical, Euclidean distance can also
be used
I Boolean variables can be transformed into 0 and 1
I Other categorical variables (e.g. Seasons) can be transformed
into Boolean (i.e Summer: yes/no)

4 / 16
Distance
Euclidean Distance
I In the 2D plane, the Euclidean distance between p1 = (x1 , y1 )
and p2 = (x2 , y2 ) is given by the Pythagoras theorem:
q
d(p1 , p2 ) = (x2 − x1 )2 + (y2 − y1 )2

I In 3D, the Euclidean distance between (x1 , y1 , z1 ) and


(x2 , y2 , z2 ) is given by the Pythagoras theorem:
q
d(p1 , p2 ) = (x2 − x1 )2 + (y2 − y1 )2 + (z2 − z1 )2

I In general, the distance between points x and y in Rn (n


dimensions):
v
u n
uX
d(x, y ) = |x − y | = t (xi − yi )2
i=1

5 / 16
K-means Clustering Algorithms

1. Randomly select k points in the dataset. These serve as initial


cluster centroids for the observations.
2. Assign each observation to the cluster whose centroid is
closest.
3. Iterate until the cluster assignments stop changing:
3.1 For each of the k clusters, compute the cluster centroid.
3.2 Assign each observation to the cluster whose centroid is
closest.
Notes:
I Centroid: a point in the “centre” of the cluster.
I The notion of closest is defined using the Euclidean distance.
I Ties should be broken deterministically to avoid looping. Example:
assign to the cluster with lowest index.

6 / 16
What is the Centroid of a set of points?

I The most representative point within the group is called the


centroid.
I To find the centroid, one computes the (arithmetic) mean of
the points’ positions separately for each dimension.
I For example, let assume we have 3 dimensions and 3 points
I (-1, 10, 3)
I (0, 5, 2)
I (1, 20, 10)
I The centroid will be ( −1+0+1
3 , 10+5+20
3 , 3+2+10
3 ), which
simplifies (0, 11.67, 5).
I The centroid does not have to be (and rarely is) one of the
original data points.

7 / 16
K-means Algorithm
Example with D = 2, K = 2

1
8

10
6

4
4

5 6
2

7
0

0 2 4 6 8

8 / 16
K-means Algorithm
Randomly choose centroids. Calculate distance between all points and centroids.

1 Distances
8

C1 C2
2
1 6.08 5.39
10
6

2 5.10 5.10
3 3 4.24 3.16
4 4 2.24 5.39
4

9 5 1.00 6.40
5 6
6 0.00 7.21
2

7 7.28 6.08
8
8 6.08 5.00
7
0

9 8.06 3.61
10 7.21 0.00
0 2 4 6 8

9 / 16
K-means Algorithm
Assign points to clusters. Each point assigned to the closest centroid.

1 Distances
8

C1 C2
2
1 6.08 5.39
10 c2
6

2 5.10 5.10
3 3 4.24 3.16
4 4 2.24 5.39
4

9 5 1.00 6.40
5 6 c1
6 0.00 7.21
2

7 7.28 6.08
8
8 6.08 5.00
7
0

9 8.06 3.61
10 7.21 0.00
0 2 4 6 8

10 / 16
K-means Algorithm
Iteration 1

1 Distances
8

C1 C2
2
1 4.26 5.89
10
6

2 3.26 5.23
3 3 2.57 2.46
c2 4 4 0.35 4.17
4

c1
9 5 1.77 4.55
5 6
6 1.90 5.48
2

7 7.29 4.25
8
8 5.93 2.95
7
0

9 7.29 2.95
10 5.71 2.32
0 2 4 6 8

11 / 16
K-means Algorithm
Iteration 2

1 Distances
8

C1 C2
2
1 3.41 7.07
10
6

2 2.41 6.40
3 3 2.24 3.61
c1
4 4 0.63 5.10
4

9 c2 5 2.61 5.10
5 6
6 2.72 6.08
2

7 7.72 3.16
8
8 6.32 2.00
7
0

9 7.38 2.00
10 5.39 3.00
0 2 4 6 8

12 / 16
K-means Algorithm
Iteration 3: no change in centroids

1 Distances
8

C1 C2
2
1 3.34 7.96
10
6

2 2.34 7.30
3 3 1.86 4.51
c1
4 4 0.69 5.94
4

9 5 2.67 5.77
c2
5 6
6 2.91 6.77
2

7 7.47 2.51
8
8 6.07 1.68
7
0

9 7.03 1.35
10 5.01 3.58
0 2 4 6 8

13 / 16
Properties of the Algorithm

I K-Means is guaranteed to decrease (minimise) the distance


from examples to their cluster centroids.
I However it is not guaranteed to find the best solution.
I K-means is not deterministic
I Requires the initial centroids (randomly selected)
I It does matter what the initial centroids are!
I What can go wrong? the algorithm may get stuck in a local
optimum.

14 / 16
Local Optimum
1

0.8

0.6

0.4

0.2

−0.2

−0.4

−0.6

−0.8
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

Various schemes for preventing this kind of thing:


I Multiple restarts
I Variance-based split / merge
I Initialisation heuristics

15 / 16
Summary

I In K-means clustering, we seek to partition the observations


into a pre-specified number of clusters.
I The performance of the algorithm depends on the
initialisation (the initial centroids)
I Still very much used in practice!
I Main limitation, need to suggest number of clusters K in
advance
I Next: Hierarchical Clustering

16 / 16

You might also like