0% found this document useful (0 votes)
7 views35 pages

K Means Clustering

Uploaded by

Sajjad Khan8254
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views35 pages

K Means Clustering

Uploaded by

Sajjad Khan8254
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Lecture 7

K Means Clustering

Abrar Hasan
Lecturer
Dept. of Software Engineering
< >

T 1

< T 1 >

Clustering
< >

T 1

< T 1 >

Clustering

▪ Clustering: the process of grouping a set of


objects into classes of similar objects
▪ Documents within a cluster should be similar.
▪ Documents from different clusters should be
dissimilar.
< >

T 1

< T 1 >

Clustering
▪ Clustering: the process of grouping a set of objects into classes of
similar objects
▪ Documents within a cluster should be similar.
▪ Documents from different clusters should be dissimilar.

▪ The commonest form of unsupervised learning


▪ Unsupervised learning = learning from raw data, as
opposed to supervised data where a classification of
examples is given
▪ A common and important task that finds many applications in
IR and other places
< >

T 1

< T 1 >

Clustering
< >

T 1

< T 1 >

Clustering
Class 2

Class 1
< >

T 1

< T 1 >

Clustering
Class 2

Centroid

Class 1
< >

T 1

< T 1 >

Clustering
Example of Clustering:
In real life when a company opens a store, they finds out which
location would be the best location for their store.

Ever Wondered how it is done?


< >

T 1

< T 1 >

Clustering
Lets give you an idea.

At first,
The company collects the dataset of address (latitude and
longitude) of the potential customers.
< >

T 1

< T 1 >
Latitude(x) longitude(y)
Clustering 20.201
25.72864
49.81513
50.11764
The data set looks like this. 23.8191
18.69347
47.39496
44.06723
17.98995 52.33613
22.81407 56.47059
14.37186 48.30252
14.37186 52.84034
16.68341 56.16806
51.65829 20.47059
44.52261 20.97479
Looks scary, right? 48.34171
43.51759
16.94118
16.13446
53.16582 17.04202
54.87437 24.60504
Lets make it beautiful 48.34171 25.10924
56.38191 20.7731
50.85427 13.61345
57.18593 15.32773
< >

T 1

< T 1 >

Clustering Y-Values
60

50

40

30

20

10

0
0 10 20 30 40 50 60 70
< >

T 1

< T 1 >

Clustering
60

50

40

30

20

10

0
0 10 20 30 40 50 60 70
< >

T 1

< T 1 >

Clustering
60

50

Is it beneficial? 40

30 Store Location

20

10

0
0 10 20 30 40 50 60 70
< >

T 1

< T 1 >

Clustering
60

50

So the company
actually makes 40
clusters
30

20

10

0
0 10 20 30 40 50 60 70
< >

T 1

< T 1 >

Clustering Store Location 1


60

50

So the company Store Location 2


actually makes 40
clusters
30

20

10

0
0 10 20 30 40 50 60 70
< >

T 1

< T 1 >

Clustering

Now the question arises:

So what should be the store location (latitude and longitude)?

Or how the clustering would be done?


< >

T 1

< T 1 >

K-Means
< >

T 1

< T 1 >

K Means

6
Remember Euclidian distance? 4, 5
5
2, 4
4
𝑥1 − 𝑥2 2 + 𝑦1 − 𝑦2 2
3

2
= 2−4 2 + 4−5 2
1

0
0 2 4 6
< >

T 1

< T 1 >

K Means
Steps of K means:

STEP 1 : First know, what is the value of K?

(How many cluster you want to make)


< >

T 1

< T 1 >
Latitude(x) longitude(y)
K Means 20.201 49.81513
25.72864 50.11764
23.8191 47.39496
Steps of K means: 18.69347 44.06723
17.98995 52.33613
22.81407 56.47059
STEP 2: Lets suppose K= 2. And you have a dataset 14.37186 48.30252
14.37186 52.84034
16.68341 56.16806
51.65829 20.47059
44.52261 20.97479
48.34171 16.94118
43.51759 16.13446
53.16582 17.04202
54.87437 24.60504
48.34171 25.10924
56.38191 20.7731
50.85427 13.61345
57.18593 15.32773
< >

T 1

< T 1 >
Latitude(x) longitude(y)
K Means 20.201 49.81513
25.72864 50.11764
23.8191 47.39496
Steps of K means: 18.69347 44.06723
17.98995 52.33613
22.81407 56.47059
As K is 2, first two value would be the centroid of two 14.37186 48.30252
clusters, Initially. 14.37186 52.84034
16.68341 56.16806
51.65829 20.47059
44.52261 20.97479
So centroid of cluster 1 = (20.201, 49.81) 48.34171 16.94118
43.51759 16.13446
So centroid of cluster 2 = (25.72864, 50.11764) 53.16582 17.04202
) 54.87437 24.60504
48.34171 25.10924
56.38191 20.7731
50.85427 13.61345
57.18593 15.32773
< >

T 1

< T 1 >
Latitude(x) longitude(y)
K Means 20.201 49.81513
25.72864 50.11764
Y-Values 23.8191 47.39496
18.69347 44.06723
50.15 17.98995 52.33613
50.1 22.81407 56.47059
25.72864,50.11764 14.37186 48.30252
50.05
14.37186 52.84034
50 16.68341 56.16806
49.95 51.65829 20.47059
49.9 44.52261 20.97479
20.201,49.81513
48.34171 16.94118
49.85
43.51759 16.13446
49.8 53.16582 17.04202
0 10 20 30 54.87437 24.60504
48.34171 25.10924
56.38191 20.7731
50.85427 13.61345
57.18593 15.32773
< >

T 1

< T 1 >
Latitude(x) longitude(y)
K Means 20.201 49.81513
25.72864 50.11764
23.8191 47.39496
If K = 3 then we would select first 3 values 18.69347 44.06723
17.98995 52.33613
22.81407 56.47059
So centroid of cluster 1 = (20.201, 49.81) 14.37186 48.30252
14.37186 52.84034
16.68341 56.16806
So centroid of cluster 2 = (25.72864, 50.11764) 51.65829 20.47059
44.52261 20.97479
48.34171 16.94118
So centroid of cluster 3 = (23.8191, 47.3949) 43.51759 16.13446
53.16582 17.04202
54.87437 24.60504
48.34171 25.10924
56.38191 20.7731
50.85427 13.61345
57.18593 15.32773
< >

T 1

< T 1 >
Latitude(x) longitude(y)
K Means 20.201 49.81513
25.72864 50.11764
23.8191 47.39496
STEP 3 : Handling data points 18.69347 44.06723
17.98995 52.33613
For Data Point [23.81, 47.39] 22.81407 56.47059
14.37186 48.30252
Euclidian Distance: 14.37186 52.84034
16.68341 56.16806
(X1−dataX)2+(Y1−dataY)2 51.65829 20.47059
= ( 20.2 − 23.8 )2 + ( 49.81 − 47.39 )2 44.52261 20.97479
=4.35 48.34171 16.94118
43.51759 16.13446
For Cluster 2, 53.16582 17.04202
54.87437 24.60504
(X2−dataX)2+(Y2−dataY)2 48.34171 25.10924
= √(25.72 - 23.81 )2 + (50.11 - 47.39 )2 56.38191 20.7731
=3.325 50.85427 13.61345
57.18593 15.32773
So data Point [23.81,47.39] Belongs to Cluster 2
< >

T 1

< T 1 >
Latitude(x) longitude(y)
K Means 20.201 49.81513
25.72864 50.11764
Y-Values 23.8191 47.39496
18.69347 44.06723
50.5 25.7, 50.1
17.98995 52.33613
50 22.81407 56.47059
20.2, 49.8
49.5 14.37186 48.30252
14.37186 52.84034
49 16.68341 56.16806
48.5 51.65829 20.47059
48 44.52261 20.97479
48.34171 16.94118
47.5
23.8, 47.3 43.51759 16.13446
47 53.16582 17.04202
0 10 20 30 54.87437 24.60504
48.34171 25.10924
56.38191 20.7731
50.85427 13.61345
57.18593 15.32773
< >

T 1

< T 1 >
Latitude(x) longitude(y)
K Means 20.201 49.81513
25.72864 50.11764
Y-Values 23.8191 47.39496
18.69347 44.06723
50.5 25.7, 50.1
17.98995 52.33613
50 22.81407 56.47059
20.2, 49.8
49.5 14.37186 48.30252
14.37186 52.84034
49 16.68341 56.16806
48.5 51.65829 20.47059
48 44.52261 20.97479
48.34171 16.94118
47.5 23.8, 47.3 43.51759 16.13446
47 53.16582 17.04202
0 10 20 30 54.87437 24.60504
48.34171 25.10924
56.38191 20.7731
50.85427 13.61345
57.18593 15.32773
< >

T 1

< T 1 >

K Means
STEP 3 : Update Centroid
Y-Values
As the new data point belongs to cluster 2 25.7, 50.1
50.5
50
We have to update the centroid of cluster 2 20.2, 49.8
49.5
49
48.5
𝑜𝑙𝑑 𝐶𝑒𝑛𝑡𝑟𝑜𝑖𝑑𝑋 + 𝑁𝑒𝑤 𝑉𝑎𝑙𝑢𝑒𝑋
New CentroidX = 48
2
47.5 23.8, 47.3
𝑜𝑙𝑑 𝐶𝑒𝑛𝑡𝑟𝑜𝑖𝑑𝑌 + 𝑁𝑒𝑤 𝑉𝑎𝑙𝑢𝑒𝑌 47
New CentroidY = 0 10 20 30
2
< >

T 1

< T 1 >

K Means
STEP 3 : Update Centroid
𝑜𝑙𝑑 𝐶𝑒𝑛𝑡𝑟𝑜𝑖𝑑𝑋 + 𝑁𝑒𝑤 𝑉𝑎𝑙𝑢𝑒𝑋 Y-Values
New CentroidX = 25.7, 50.1
2 50.5
50
20.2, 49.8
25.7+23.8 49.5
= 2 49
= 24.77386993 48.5
48
47.5 23.8, 47.3
𝑜𝑙𝑑 𝐶𝑒𝑛𝑡𝑟𝑜𝑖𝑑𝑌 + 𝑁𝑒𝑤 𝑉𝑎𝑙𝑢𝑒𝑌
New CentroidY = 47
2 0 10 20 30
50.1+47.3
=
2
= 48.75630134 Centroid
< >

T 1

< T 1 >

K Means
< >

T 1

< T 1 >

K Means

FIND OPTIMUM K VALUE


< >

T 1

< T 1 >

K Means
Calculating total distance of the cluster
< >

T 1

< T 1 >

K Means
Calculating total distance of the cluster 1 and Cluster 2

Total Distance =
Calculate the Euclidian distance
from the centroid to all of the
points if the cluster
< >

T 1

< T 1 >

K Means

Now consider K = 3
And you get 3 Clusters.
So three centroid.
Calculate total error.

Do this for K= 1,2,3…….


< >

T 1

< T 1 >

K Means
Now Plot this Total error vs K into a graph

Total error
Thank You

You might also like