Kmeans Algorithm
Kmeans Algorithm
Arif
2023-10-10
Algorithmic Steps
Assumptions
K-Means clustering method considers two assumptions regarding the clusters –
first that the clusters are spherical and second that the clusters are of similar size. Spherical assumption
helps in separating the clusters when the algorithm works on the data and forms clusters. If this assumption
is violated, the clusters formed may not be what one expects.
On the other hand, assumption over the size of clusters helps in deciding the boundaries of the cluster. This
assumption helps in calculating the number of data points each cluster should have.
K Means Starter
To understand how K-Means works, we start with an example where all our assumptions hold. R includes a
dataset about waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in
Yellowstone National Park known as ‘faithful’. The dataset consists of 272 observations of 2 features.
#Viewing the Faithful dataset
plot(faithful)
1
Figure 1: Kmeans
2
90
80
waiting
70
60
50
eruptions
Looking at the dataset, we can notice two clusters. I will now use the kmeans() function in R to form clusters.
Let’s see how K-Means clustering works on the data
#Specify 2 centers
k_clust_start=kmeans(faithful, centers=2)
#Plot the data using clusters
plot(faithful, col=k_clust_start$cluster,pch=2)
90
80
waiting
70
60
50
eruptions
Being a small dataset, clusters are formed almost instantaneously but how do we see the clusters, their centers
or sizes? The k_clust_start variable I used contains information on both centers and the size of clusters.
Let’s check them out
3
#Use the centers to find the cluster centers
k_clust_start$centers
## eruptions waiting
## 1 4.29793 80.28488
## 2 2.09433 54.75000
#Use the size to find the cluster sizes
k_clust_start$size
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Generate random data which will be first cluster
clust1 = data_frame(x = rnorm(200), y = rnorm(200))
4
#see the plot
plot(dataset_cir)
15
10
5
y
0
−5
−15
−15 −10 −5 0 5 10 15
x
Simple, isn’t it? There are two clusters – one in the middle and the other circling the first. However, this
violates the assumption that the clusters are spherical. The inner data is spherical while the outer circle is
not. Even though the clustering will not be good, let’s see how does k-means perform on this data
#Fit the k-means model
k_clust_spher1=kmeans(dataset_cir, centers=2)
#Plot the data and clusters
plot(dataset_cir, col=k_clust_spher1$cluster,pch=2)
15
10
5
y
0
−5
−15
−15 −10 −5 0 5 10 15
5
How do we solve this problem? There are clearly 2 clusters but k-means is not working well. A simple way in
this case is to transform our data into polar format. Let’s convert it and plot it.
#Using a function for transformation
cart2pol=function(x,y){
#This is r
newx=sqrt(xˆ2 + yˆ2)
#This is theta
newy=atan(y/x)
x_y=cbind(newx,newy)
return(x_y)
}
dataset_cir2=cart2pol(dataset_cir$x,dataset_cir$y)
plot(dataset_cir2)
0.5 1.0 1.5
newy
−0.5
−1.5
0 5 10 15
newx
Now we run the k-means model on this data
k_clust_spher2=kmeans(dataset_cir2, centers=2)
#Plot the data and clusters
plot(dataset_cir2, col=k_clust_spher2$cluster,pch=2)
6
0.5 1.0 1.5
newy
−0.5
−1.5
0 5 10 15
newx
This time k-means algorithm works well and correctly transform the data. We can also view the clusters on
the original data to double-check this.
plot(dataset_cir, col=k_clust_spher2$cluster,pch=2)
15
10
5
y
0
−5
−15
−15 −10 −5 0 5 10 15
x
By transforming our data into polar coordinates and fitting k-means model on the transformed data, we fulfil
the spherical data assumption and data is accurately clustered. Now let’s look at a data where the clusters
are not of similar sizes. Similar size does not mean that the clusters have to be exactly equal. It simply
means that no cluster should have ‘too few’ members.
#Make the first cluster with 1000 random values
clust1 = data_frame(x = rnorm(1000), y = rnorm(1000))
7
#Keep 10 values together to make the second cluster
clust2=data_frame(x=c(5,5.1,5.2,5.3,5.4,5,5.1,5.2,5.3,5.4),y=c(5,5,5,5,5,4.9,4.9,4.9,4.9,4.9))
#Combine the data
dataset_uneven=rbind(clust1,clust2)
plot(dataset_uneven)
4
2
y
0
−2
−2 0 2 4
x
Here again, we have two clear clusters but they do not satisfy similar size requirement of k-means algorithm.
k_clust_spher3=kmeans(dataset_uneven, centers=2)
plot(dataset_uneven, col=k_clust_spher3$cluster,pch=2)
4
2
y
0
−2
−2 0 2 4
x
Why did this happen? K-means tries to minimize inter cluster and intracluster distance and create ‘tight’
8
clusters. In this process, it assigns some data points in the first cluster to the second cluster incorrectly. This
makes the clustering inaccurate.
40
20
2 4 6 8 10 12 14
sse$k