K Means Clustering
K Means Clustering
K Means Clustering
Abrar Hasan
Lecturer
Dept. of Software Engineering
< >
T 1
< T 1 >
Clustering
< >
T 1
< T 1 >
Clustering
T 1
< T 1 >
Clustering
▪ Clustering: the process of grouping a set of objects into classes of
similar objects
▪ Documents within a cluster should be similar.
▪ Documents from different clusters should be dissimilar.
T 1
< T 1 >
Clustering
< >
T 1
< T 1 >
Clustering
Class 2
Class 1
< >
T 1
< T 1 >
Clustering
Class 2
Centroid
Class 1
< >
T 1
< T 1 >
Clustering
Example of Clustering:
In real life when a company opens a store, they finds out which
location would be the best location for their store.
T 1
< T 1 >
Clustering
Lets give you an idea.
At first,
The company collects the dataset of address (latitude and
longitude) of the potential customers.
< >
T 1
< T 1 >
Latitude(x) longitude(y)
Clustering 20.201
25.72864
49.81513
50.11764
The data set looks like this. 23.8191
18.69347
47.39496
44.06723
17.98995 52.33613
22.81407 56.47059
14.37186 48.30252
14.37186 52.84034
16.68341 56.16806
51.65829 20.47059
44.52261 20.97479
Looks scary, right? 48.34171
43.51759
16.94118
16.13446
53.16582 17.04202
54.87437 24.60504
Lets make it beautiful 48.34171 25.10924
56.38191 20.7731
50.85427 13.61345
57.18593 15.32773
< >
T 1
< T 1 >
Clustering Y-Values
60
50
40
30
20
10
0
0 10 20 30 40 50 60 70
< >
T 1
< T 1 >
Clustering
60
50
40
30
20
10
0
0 10 20 30 40 50 60 70
< >
T 1
< T 1 >
Clustering
60
50
Is it beneficial? 40
30 Store Location
20
10
0
0 10 20 30 40 50 60 70
< >
T 1
< T 1 >
Clustering
60
50
So the company
actually makes 40
clusters
30
20
10
0
0 10 20 30 40 50 60 70
< >
T 1
< T 1 >
50
20
10
0
0 10 20 30 40 50 60 70
< >
T 1
< T 1 >
Clustering
T 1
< T 1 >
K-Means
< >
T 1
< T 1 >
K Means
6
Remember Euclidian distance? 4, 5
5
2, 4
4
𝑥1 − 𝑥2 2 + 𝑦1 − 𝑦2 2
3
2
= 2−4 2 + 4−5 2
1
0
0 2 4 6
< >
T 1
< T 1 >
K Means
Steps of K means:
T 1
< T 1 >
Latitude(x) longitude(y)
K Means 20.201 49.81513
25.72864 50.11764
23.8191 47.39496
Steps of K means: 18.69347 44.06723
17.98995 52.33613
22.81407 56.47059
STEP 2: Lets suppose K= 2. And you have a dataset 14.37186 48.30252
14.37186 52.84034
16.68341 56.16806
51.65829 20.47059
44.52261 20.97479
48.34171 16.94118
43.51759 16.13446
53.16582 17.04202
54.87437 24.60504
48.34171 25.10924
56.38191 20.7731
50.85427 13.61345
57.18593 15.32773
< >
T 1
< T 1 >
Latitude(x) longitude(y)
K Means 20.201 49.81513
25.72864 50.11764
23.8191 47.39496
Steps of K means: 18.69347 44.06723
17.98995 52.33613
22.81407 56.47059
As K is 2, first two value would be the centroid of two 14.37186 48.30252
clusters, Initially. 14.37186 52.84034
16.68341 56.16806
51.65829 20.47059
44.52261 20.97479
So centroid of cluster 1 = (20.201, 49.81) 48.34171 16.94118
43.51759 16.13446
So centroid of cluster 2 = (25.72864, 50.11764) 53.16582 17.04202
) 54.87437 24.60504
48.34171 25.10924
56.38191 20.7731
50.85427 13.61345
57.18593 15.32773
< >
T 1
< T 1 >
Latitude(x) longitude(y)
K Means 20.201 49.81513
25.72864 50.11764
Y-Values 23.8191 47.39496
18.69347 44.06723
50.15 17.98995 52.33613
50.1 22.81407 56.47059
25.72864,50.11764 14.37186 48.30252
50.05
14.37186 52.84034
50 16.68341 56.16806
49.95 51.65829 20.47059
49.9 44.52261 20.97479
20.201,49.81513
48.34171 16.94118
49.85
43.51759 16.13446
49.8 53.16582 17.04202
0 10 20 30 54.87437 24.60504
48.34171 25.10924
56.38191 20.7731
50.85427 13.61345
57.18593 15.32773
< >
T 1
< T 1 >
Latitude(x) longitude(y)
K Means 20.201 49.81513
25.72864 50.11764
23.8191 47.39496
If K = 3 then we would select first 3 values 18.69347 44.06723
17.98995 52.33613
22.81407 56.47059
So centroid of cluster 1 = (20.201, 49.81) 14.37186 48.30252
14.37186 52.84034
16.68341 56.16806
So centroid of cluster 2 = (25.72864, 50.11764) 51.65829 20.47059
44.52261 20.97479
48.34171 16.94118
So centroid of cluster 3 = (23.8191, 47.3949) 43.51759 16.13446
53.16582 17.04202
54.87437 24.60504
48.34171 25.10924
56.38191 20.7731
50.85427 13.61345
57.18593 15.32773
< >
T 1
< T 1 >
Latitude(x) longitude(y)
K Means 20.201 49.81513
25.72864 50.11764
23.8191 47.39496
STEP 3 : Handling data points 18.69347 44.06723
17.98995 52.33613
For Data Point [23.81, 47.39] 22.81407 56.47059
14.37186 48.30252
Euclidian Distance: 14.37186 52.84034
16.68341 56.16806
(X1−dataX)2+(Y1−dataY)2 51.65829 20.47059
= ( 20.2 − 23.8 )2 + ( 49.81 − 47.39 )2 44.52261 20.97479
=4.35 48.34171 16.94118
43.51759 16.13446
For Cluster 2, 53.16582 17.04202
54.87437 24.60504
(X2−dataX)2+(Y2−dataY)2 48.34171 25.10924
= √(25.72 - 23.81 )2 + (50.11 - 47.39 )2 56.38191 20.7731
=3.325 50.85427 13.61345
57.18593 15.32773
So data Point [23.81,47.39] Belongs to Cluster 2
< >
T 1
< T 1 >
Latitude(x) longitude(y)
K Means 20.201 49.81513
25.72864 50.11764
Y-Values 23.8191 47.39496
18.69347 44.06723
50.5 25.7, 50.1
17.98995 52.33613
50 22.81407 56.47059
20.2, 49.8
49.5 14.37186 48.30252
14.37186 52.84034
49 16.68341 56.16806
48.5 51.65829 20.47059
48 44.52261 20.97479
48.34171 16.94118
47.5
23.8, 47.3 43.51759 16.13446
47 53.16582 17.04202
0 10 20 30 54.87437 24.60504
48.34171 25.10924
56.38191 20.7731
50.85427 13.61345
57.18593 15.32773
< >
T 1
< T 1 >
Latitude(x) longitude(y)
K Means 20.201 49.81513
25.72864 50.11764
Y-Values 23.8191 47.39496
18.69347 44.06723
50.5 25.7, 50.1
17.98995 52.33613
50 22.81407 56.47059
20.2, 49.8
49.5 14.37186 48.30252
14.37186 52.84034
49 16.68341 56.16806
48.5 51.65829 20.47059
48 44.52261 20.97479
48.34171 16.94118
47.5 23.8, 47.3 43.51759 16.13446
47 53.16582 17.04202
0 10 20 30 54.87437 24.60504
48.34171 25.10924
56.38191 20.7731
50.85427 13.61345
57.18593 15.32773
< >
T 1
< T 1 >
K Means
STEP 3 : Update Centroid
Y-Values
As the new data point belongs to cluster 2 25.7, 50.1
50.5
50
We have to update the centroid of cluster 2 20.2, 49.8
49.5
49
48.5
𝑜𝑙𝑑 𝐶𝑒𝑛𝑡𝑟𝑜𝑖𝑑𝑋 + 𝑁𝑒𝑤 𝑉𝑎𝑙𝑢𝑒𝑋
New CentroidX = 48
2
47.5 23.8, 47.3
𝑜𝑙𝑑 𝐶𝑒𝑛𝑡𝑟𝑜𝑖𝑑𝑌 + 𝑁𝑒𝑤 𝑉𝑎𝑙𝑢𝑒𝑌 47
New CentroidY = 0 10 20 30
2
< >
T 1
< T 1 >
K Means
STEP 3 : Update Centroid
𝑜𝑙𝑑 𝐶𝑒𝑛𝑡𝑟𝑜𝑖𝑑𝑋 + 𝑁𝑒𝑤 𝑉𝑎𝑙𝑢𝑒𝑋 Y-Values
New CentroidX = 25.7, 50.1
2 50.5
50
20.2, 49.8
25.7+23.8 49.5
= 2 49
= 24.77386993 48.5
48
47.5 23.8, 47.3
𝑜𝑙𝑑 𝐶𝑒𝑛𝑡𝑟𝑜𝑖𝑑𝑌 + 𝑁𝑒𝑤 𝑉𝑎𝑙𝑢𝑒𝑌
New CentroidY = 47
2 0 10 20 30
50.1+47.3
=
2
= 48.75630134 Centroid
< >
T 1
< T 1 >
K Means
< >
T 1
< T 1 >
K Means
T 1
< T 1 >
K Means
Calculating total distance of the cluster
< >
T 1
< T 1 >
K Means
Calculating total distance of the cluster 1 and Cluster 2
Total Distance =
Calculate the Euclidian distance
from the centroid to all of the
points if the cluster
< >
T 1
< T 1 >
K Means
Now consider K = 3
And you get 3 Clusters.
So three centroid.
Calculate total error.
T 1
< T 1 >
K Means
Now Plot this Total error vs K into a graph
Total error
Thank You