0% found this document useful (0 votes)

20 views17 pages

K Means Clustering

The document discusses K-means clustering, an unsupervised machine learning algorithm. It begins by introducing clustering and two common clustering methods: K-means and hierarchical clustering. It then describes how K-means clustering works by randomly selecting initial cluster centroids and iteratively assigning examples to centroids based on distance and recomputing centroids until convergence. The document provides examples to illustrate how K-means clustering is performed.

Uploaded by

Wet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views17 pages

K Means Clustering

Uploaded by

Wet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Artificial Intelligence (CSC9YE)

K-Means Clustering

Gabriela Ochoa
[email protected]
Clustering
The main task in unsupervised learning

I The input is a set of examples, each described by a vector of

attribute values – but no class labels.
I The output is a set of two or more clusters of examples.
I The system should automatically identify groups of similar
examples.

1 / 16
Two Clustering Methods

I In K-means clustering, we seek to partition the observations

into a pre-specified number of clusters
I In hierarchical clustering
I we do not know in advance how many clusters we want
I we end up with a tree-like visual representation of the
observations, called a dendrogram
I The dendrogram allows us to view at once the clusterings
obtained for each possible number of clusters, from 1 to n

2 / 16
Characteristics of the Clusters

I How to describe clusters?

I The simplest approach relies on centroids
I If all attributes are numeric, the centroid is identified with the
averages of the individual attributes.
I Example: (2, 5) (1,4) (3, 6). Centroid is (2,5) because
2+1+3
3 = 2 and 5+4+63 =5
I What should the clusters be like?
I Clusters should not overlap each other, each example must
belong to one and only one cluster.
I Within the same cluster, the examples should be relatively
close to each other. Closer than to the examples from the
other clusters.

3 / 16
Measuring Distance

I Clustering algorithms need a mechanism to evaluate the

distance between an example and a cluster.
I When clusters are described by their centroids, the Euclidean
distance between the example and the centroid is a good way
of measuring distance.
I The Euclidean distance, can be applied directly when
attributes are numerical.
I When attributes are categorical, Euclidean distance can also
be used
I Boolean variables can be transformed into 0 and 1
I Other categorical variables (e.g. Seasons) can be transformed
into Boolean (i.e Summer: yes/no)

4 / 16
Distance
Euclidean Distance
I In the 2D plane, the Euclidean distance between p1 = (x1 , y1 )
and p2 = (x2 , y2 ) is given by the Pythagoras theorem:
q
d(p1 , p2 ) = (x2 − x1 )2 + (y2 − y1 )2

I In 3D, the Euclidean distance between (x1 , y1 , z1 ) and

(x2 , y2 , z2 ) is given by the Pythagoras theorem:
q
d(p1 , p2 ) = (x2 − x1 )2 + (y2 − y1 )2 + (z2 − z1 )2

I In general, the distance between points x and y in Rn (n

dimensions):
v
u n
uX
d(x, y ) = |x − y | = t (xi − yi )2
i=1

5 / 16
K-means Clustering Algorithms

1. Randomly select k points in the dataset. These serve as initial

cluster centroids for the observations.
2. Assign each observation to the cluster whose centroid is
closest.
3. Iterate until the cluster assignments stop changing:
3.1 For each of the k clusters, compute the cluster centroid.
3.2 Assign each observation to the cluster whose centroid is
closest.
Notes:
I Centroid: a point in the “centre” of the cluster.
I The notion of closest is defined using the Euclidean distance.
I Ties should be broken deterministically to avoid looping. Example:
assign to the cluster with lowest index.

6 / 16
What is the Centroid of a set of points?

I The most representative point within the group is called the

centroid.
I To find the centroid, one computes the (arithmetic) mean of
the points’ positions separately for each dimension.
I For example, let assume we have 3 dimensions and 3 points
I (-1, 10, 3)
I (0, 5, 2)
I (1, 20, 10)
I The centroid will be ( −1+0+1
3 , 10+5+20
3 , 3+2+10
3 ), which
simplifies (0, 11.67, 5).
I The centroid does not have to be (and rarely is) one of the
original data points.

7 / 16
K-means Algorithm
Example with D = 2, K = 2

1
8

10
6

4
4

5 6
2

7
0

0 2 4 6 8

8 / 16
K-means Algorithm
Randomly choose centroids. Calculate distance between all points and centroids.

1 Distances
8

C1 C2
2
1 6.08 5.39
10
6

2 5.10 5.10
3 3 4.24 3.16
4 4 2.24 5.39
4

9 5 1.00 6.40
5 6
6 0.00 7.21
2

7 7.28 6.08
8
8 6.08 5.00
7
0

9 8.06 3.61
10 7.21 0.00
0 2 4 6 8

9 / 16
K-means Algorithm
Assign points to clusters. Each point assigned to the closest centroid.

1 Distances
8

C1 C2
2
1 6.08 5.39
10 c2
6

2 5.10 5.10
3 3 4.24 3.16
4 4 2.24 5.39
4

9 5 1.00 6.40
5 6 c1
6 0.00 7.21
2

7 7.28 6.08
8
8 6.08 5.00
7
0

9 8.06 3.61
10 7.21 0.00
0 2 4 6 8

10 / 16
K-means Algorithm
Iteration 1

1 Distances
8

C1 C2
2
1 4.26 5.89
10
6

2 3.26 5.23
3 3 2.57 2.46
c2 4 4 0.35 4.17
4

c1
9 5 1.77 4.55
5 6
6 1.90 5.48
2

7 7.29 4.25
8
8 5.93 2.95
7
0

9 7.29 2.95
10 5.71 2.32
0 2 4 6 8

11 / 16
K-means Algorithm
Iteration 2

1 Distances
8

C1 C2
2
1 3.41 7.07
10
6

2 2.41 6.40
3 3 2.24 3.61
c1
4 4 0.63 5.10
4

9 c2 5 2.61 5.10
5 6
6 2.72 6.08
2

7 7.72 3.16
8
8 6.32 2.00
7
0

9 7.38 2.00
10 5.39 3.00
0 2 4 6 8

12 / 16
K-means Algorithm
Iteration 3: no change in centroids

1 Distances
8

C1 C2
2
1 3.34 7.96
10
6

2 2.34 7.30
3 3 1.86 4.51
c1
4 4 0.69 5.94
4

9 5 2.67 5.77
c2
5 6
6 2.91 6.77
2

7 7.47 2.51
8
8 6.07 1.68
7
0

9 7.03 1.35
10 5.01 3.58
0 2 4 6 8

13 / 16
Properties of the Algorithm

I K-Means is guaranteed to decrease (minimise) the distance

from examples to their cluster centroids.
I However it is not guaranteed to find the best solution.
I K-means is not deterministic
I Requires the initial centroids (randomly selected)
I It does matter what the initial centroids are!
I What can go wrong? the algorithm may get stuck in a local
optimum.

14 / 16
Local Optimum
1

0.8

0.6

0.4

0.2

−0.2

−0.4

−0.6

−0.8
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

Various schemes for preventing this kind of thing:

I Multiple restarts
I Variance-based split / merge
I Initialisation heuristics

15 / 16
Summary

I In K-means clustering, we seek to partition the observations

into a pre-specified number of clusters.
I The performance of the algorithm depends on the
initialisation (the initial centroids)
I Still very much used in practice!
I Main limitation, need to suggest number of clusters K in
advance
I Next: Hierarchical Clustering

16 / 16

Udemy Course Description
No ratings yet
Udemy Course Description
1 page
Kmea
No ratings yet
Kmea
53 pages
K - Means Clustering
No ratings yet
K - Means Clustering
13 pages
Unsupervised Learning - Clustering
No ratings yet
Unsupervised Learning - Clustering
55 pages
ML 5
No ratings yet
ML 5
61 pages
K-Means With Elbow Method
No ratings yet
K-Means With Elbow Method
24 pages
Lecture 9 K Means
No ratings yet
Lecture 9 K Means
23 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
ML Unit-2
No ratings yet
ML Unit-2
31 pages
5 - CH 5-K-Means Clustering
No ratings yet
5 - CH 5-K-Means Clustering
54 pages
AI Chapter 3 Part 5
No ratings yet
AI Chapter 3 Part 5
30 pages
2875 27398 1 SP
No ratings yet
2875 27398 1 SP
4 pages
16 K Mean Clustring 1 18052023 095249am 08042024 093324am
No ratings yet
16 K Mean Clustring 1 18052023 095249am 08042024 093324am
20 pages
08 K-Means
No ratings yet
08 K-Means
19 pages
Clustering Analysis: What Is Cluster Analysis?
No ratings yet
Clustering Analysis: What Is Cluster Analysis?
5 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
12 pages
Clustering
No ratings yet
Clustering
18 pages
Unsupervised Learning 2024-PPG
No ratings yet
Unsupervised Learning 2024-PPG
85 pages
Machine Learning With Python - Machine Learning Algorithms - K-Means Clustering Algo
No ratings yet
Machine Learning With Python - Machine Learning Algorithms - K-Means Clustering Algo
25 pages
Clustering
No ratings yet
Clustering
125 pages
Unit 4
No ratings yet
Unit 4
125 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
Unit 3 - KmeansClustering
No ratings yet
Unit 3 - KmeansClustering
17 pages
K Means
No ratings yet
K Means
40 pages
Cluster
No ratings yet
Cluster
50 pages
Kmean
No ratings yet
Kmean
24 pages
Kmeans Notes
No ratings yet
Kmeans Notes
8 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
K Means
No ratings yet
K Means
33 pages
21csc305p Machine Learning Unit 3 - Updated
No ratings yet
21csc305p Machine Learning Unit 3 - Updated
147 pages
K-Means Clustering
No ratings yet
K-Means Clustering
5 pages
Kmeansfinal
No ratings yet
Kmeansfinal
16 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
No ratings yet
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
20 pages
K Mean Clustering1
No ratings yet
K Mean Clustering1
23 pages
K Means Algorithms
No ratings yet
K Means Algorithms
27 pages
Clustering - K-Means: Prerequisite
No ratings yet
Clustering - K-Means: Prerequisite
8 pages
K-Means Cluster Analysis UC Business Analytics R Programming Guide
No ratings yet
K-Means Cluster Analysis UC Business Analytics R Programming Guide
19 pages
Quality of Clustering: Clustering (K-Means Algorithm)
No ratings yet
Quality of Clustering: Clustering (K-Means Algorithm)
4 pages
Algo
No ratings yet
Algo
59 pages
Introduction To The K-Means Clustering Algorithm Based On The Elbow
No ratings yet
Introduction To The K-Means Clustering Algorithm Based On The Elbow
4 pages
K-Means Clustering Algorithm
No ratings yet
K-Means Clustering Algorithm
40 pages
Lecture 1 (UNIT 1)
No ratings yet
Lecture 1 (UNIT 1)
68 pages
K Mean
No ratings yet
K Mean
12 pages
K Mean Clustering
No ratings yet
K Mean Clustering
32 pages
ML Unit-5
No ratings yet
ML Unit-5
21 pages
K Mean Clustering
No ratings yet
K Mean Clustering
27 pages
Week 9
No ratings yet
Week 9
66 pages
K Means Algorithm
No ratings yet
K Means Algorithm
4 pages
Clustering K-Means
100% (2)
Clustering K-Means
28 pages
Unit V
No ratings yet
Unit V
165 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
Clustering
No ratings yet
Clustering
84 pages
Lecture 11 K Means Clustering
No ratings yet
Lecture 11 K Means Clustering
8 pages
Clustering Numericals
No ratings yet
Clustering Numericals
8 pages
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
No ratings yet
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
20 pages
K Mean Cluster Analysis
No ratings yet
K Mean Cluster Analysis
16 pages
Kmeans
No ratings yet
Kmeans
6 pages
Chemical Process Control Education and Practice: IEEE Control Systems May 2001
No ratings yet
Chemical Process Control Education and Practice: IEEE Control Systems May 2001
9 pages
Control System Toolbox - Designing Cascade Control System With PI Controllers Demo
No ratings yet
Control System Toolbox - Designing Cascade Control System With PI Controllers Demo
5 pages
IS1 - Foundations of Information Systems
No ratings yet
IS1 - Foundations of Information Systems
26 pages
Fuzzy Lab#2 Udara VER 1
No ratings yet
Fuzzy Lab#2 Udara VER 1
12 pages
G7 Synopsis
No ratings yet
G7 Synopsis
14 pages
Chapter 3
No ratings yet
Chapter 3
25 pages
Hafta 1
No ratings yet
Hafta 1
25 pages
Four Principles of Interpersonal Communication Handout Net
No ratings yet
Four Principles of Interpersonal Communication Handout Net
2 pages
CH-1 Introduction To AI
No ratings yet
CH-1 Introduction To AI
28 pages
ML Basepaper 2
No ratings yet
ML Basepaper 2
3 pages
Assignment DIP
No ratings yet
Assignment DIP
17 pages
Intrusion Detection Systems With Deep Learning: A Systematic Mapping Study
No ratings yet
Intrusion Detection Systems With Deep Learning: A Systematic Mapping Study
5 pages
Elex
No ratings yet
Elex
43 pages
Machine Learning
No ratings yet
Machine Learning
13 pages
Distortion
No ratings yet
Distortion
16 pages
Quiz 4 - Attempt Review
No ratings yet
Quiz 4 - Attempt Review
3 pages
Distance Based Classification Algorithms
No ratings yet
Distance Based Classification Algorithms
12 pages
AI202 - Spring 2024 - Lecture 1 - Introduction
No ratings yet
AI202 - Spring 2024 - Lecture 1 - Introduction
27 pages
Knowledge Based Expert Systems
No ratings yet
Knowledge Based Expert Systems
4 pages
Machine Learning Fundamentals
No ratings yet
Machine Learning Fundamentals
2 pages
Introduction To Instrumentation and Control
No ratings yet
Introduction To Instrumentation and Control
4 pages
Srinivasan Padmanabhan Resume
No ratings yet
Srinivasan Padmanabhan Resume
6 pages
Machine Learning
No ratings yet
Machine Learning
18 pages
Progress in Energy and Combustion Science: Masoud Aliramezani, Charles Robert Koch, Mahdi Shahbakhti
No ratings yet
Progress in Energy and Combustion Science: Masoud Aliramezani, Charles Robert Koch, Mahdi Shahbakhti
38 pages
Latex 001
No ratings yet
Latex 001
1 page
Speech Recognition
No ratings yet
Speech Recognition
13 pages
An Empirical Study of Language CNN For Image Captioning
No ratings yet
An Empirical Study of Language CNN For Image Captioning
10 pages
Assignment 2 Updated AIML
No ratings yet
Assignment 2 Updated AIML
2 pages

K Means Clustering

Uploaded by

K Means Clustering

Uploaded by

Artificial Intelligence (CSC9YE)

I The input is a set of examples, each described by a vector of

I In K-means clustering, we seek to partition the observations

I How to describe clusters?

I Clustering algorithms need a mechanism to evaluate the

I In 3D, the Euclidean distance between (x1 , y1 , z1 ) and

I In general, the distance between points x and y in Rn (n

1. Randomly select k points in the dataset. These serve as initial

I The most representative point within the group is called the

I K-Means is guaranteed to decrease (minimise) the distance

Various schemes for preventing this kind of thing:

I In K-means clustering, we seek to partition the observations

You might also like