0% found this document useful (0 votes)
7 views

Kmeans Algorithm

Kmeans algorithm in regression analysis and data modelling very helpful algorithm for computer science students

Uploaded by

oumiarashid6812
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Kmeans Algorithm

Kmeans algorithm in regression analysis and data modelling very helpful algorithm for computer science students

Uploaded by

oumiarashid6812
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Kmeans Assumptions

Arif

2023-10-10

K-Means Clustering: overview


K-Means Clustering is a well known technique based on unsupervised learning. As the name mentions, it
forms ‘K’ clusters over the data using mean of the data. Unsupervised algorithms are a class of algorithms
one should tread on carefully.

Algorithmic Steps
Assumptions
K-Means clustering method considers two assumptions regarding the clusters –
first that the clusters are spherical and second that the clusters are of similar size. Spherical assumption
helps in separating the clusters when the algorithm works on the data and forms clusters. If this assumption
is violated, the clusters formed may not be what one expects.
On the other hand, assumption over the size of clusters helps in deciding the boundaries of the cluster. This
assumption helps in calculating the number of data points each cluster should have.

K Means Starter
To understand how K-Means works, we start with an example where all our assumptions hold. R includes a
dataset about waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in
Yellowstone National Park known as ‘faithful’. The dataset consists of 272 observations of 2 features.
#Viewing the Faithful dataset
plot(faithful)

1
Figure 1: Kmeans

2
90
80
waiting

70
60
50

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

eruptions
Looking at the dataset, we can notice two clusters. I will now use the kmeans() function in R to form clusters.
Let’s see how K-Means clustering works on the data
#Specify 2 centers
k_clust_start=kmeans(faithful, centers=2)
#Plot the data using clusters
plot(faithful, col=k_clust_start$cluster,pch=2)
90
80
waiting

70
60
50

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

eruptions
Being a small dataset, clusters are formed almost instantaneously but how do we see the clusters, their centers
or sizes? The k_clust_start variable I used contains information on both centers and the size of clusters.
Let’s check them out

3
#Use the centers to find the cluster centers
k_clust_start$centers

## eruptions waiting
## 1 4.29793 80.28488
## 2 2.09433 54.75000
#Use the size to find the cluster sizes
k_clust_start$size

## [1] 172 100


This means the first cluster consists of 172 members and is centered at 4.29793 value of eruptions and 80.28488
value of waiting. Similarly the second cluster consists of 100 members with 2.09433 value of eruptions and
54.75 value of waiting. Now this information is golden! We know that these centers are the cluster means.
So, the eruptions typically happen for either ~2 mins or ~4.3 mins. For longer eruptions, the waiting time is
also longer.

Getting into the depths


Imagine a dataset which has clusters which one can clearly identify but k-means cannot. I’m talking about
a dataset which does not satisfy the assumptions. A common example is a dataset which represents two
concentric circles. Let’s generate it and see how it looks like
#The following code will generate different plots for you but they will be similar
library(plyr)
library(dplyr)

##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Generate random data which will be first cluster
clust1 = data_frame(x = rnorm(200), y = rnorm(200))

## Warning: `data_frame()` was deprecated in tibble 1.1.0.


## i Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#Generate the second cluster which will ‘surround’ the first cluster
clust2 =data_frame(r = rnorm(200, 15, .5), theta = runif(200, 0, 2 * pi),
x = r * cos(theta), y = r * sin(theta)) %>%
dplyr::select(x, y)
#Combine the data
dataset_cir= rbind(clust1, clust2)

4
#see the plot
plot(dataset_cir)
15
10
5
y

0
−5
−15

−15 −10 −5 0 5 10 15

x
Simple, isn’t it? There are two clusters – one in the middle and the other circling the first. However, this
violates the assumption that the clusters are spherical. The inner data is spherical while the outer circle is
not. Even though the clustering will not be good, let’s see how does k-means perform on this data
#Fit the k-means model
k_clust_spher1=kmeans(dataset_cir, centers=2)
#Plot the data and clusters
plot(dataset_cir, col=k_clust_spher1$cluster,pch=2)
15
10
5
y

0
−5
−15

−15 −10 −5 0 5 10 15

5
How do we solve this problem? There are clearly 2 clusters but k-means is not working well. A simple way in
this case is to transform our data into polar format. Let’s convert it and plot it.
#Using a function for transformation
cart2pol=function(x,y){
#This is r
newx=sqrt(xˆ2 + yˆ2)
#This is theta
newy=atan(y/x)
x_y=cbind(newx,newy)
return(x_y)
}
dataset_cir2=cart2pol(dataset_cir$x,dataset_cir$y)
plot(dataset_cir2)
0.5 1.0 1.5
newy

−0.5
−1.5

0 5 10 15

newx
Now we run the k-means model on this data
k_clust_spher2=kmeans(dataset_cir2, centers=2)
#Plot the data and clusters
plot(dataset_cir2, col=k_clust_spher2$cluster,pch=2)

6
0.5 1.0 1.5
newy

−0.5
−1.5

0 5 10 15

newx
This time k-means algorithm works well and correctly transform the data. We can also view the clusters on
the original data to double-check this.
plot(dataset_cir, col=k_clust_spher2$cluster,pch=2)
15
10
5
y

0
−5
−15

−15 −10 −5 0 5 10 15

x
By transforming our data into polar coordinates and fitting k-means model on the transformed data, we fulfil
the spherical data assumption and data is accurately clustered. Now let’s look at a data where the clusters
are not of similar sizes. Similar size does not mean that the clusters have to be exactly equal. It simply
means that no cluster should have ‘too few’ members.
#Make the first cluster with 1000 random values
clust1 = data_frame(x = rnorm(1000), y = rnorm(1000))

7
#Keep 10 values together to make the second cluster
clust2=data_frame(x=c(5,5.1,5.2,5.3,5.4,5,5.1,5.2,5.3,5.4),y=c(5,5,5,5,5,4.9,4.9,4.9,4.9,4.9))
#Combine the data
dataset_uneven=rbind(clust1,clust2)
plot(dataset_uneven)
4
2
y

0
−2

−2 0 2 4

x
Here again, we have two clear clusters but they do not satisfy similar size requirement of k-means algorithm.
k_clust_spher3=kmeans(dataset_uneven, centers=2)
plot(dataset_uneven, col=k_clust_spher3$cluster,pch=2)
4
2
y

0
−2

−2 0 2 4

x
Why did this happen? K-means tries to minimize inter cluster and intracluster distance and create ‘tight’

8
clusters. In this process, it assigns some data points in the first cluster to the second cluster incorrectly. This
makes the clustering inaccurate.

How to decide the value of K in data


The datasets I worked on in this article are all simple and it is easily to identify clusters by plotting them.
However, complicated datasets do not have this luxury. The Elbow method is popular for finding a suitable
value of ‘K’ for k-means clustering. This method uses SSE within groups for different values of k and plots
them. Using this plot, we can choose the ‘k’ which shows an abrupt change in SSE, creating an ‘elbow effect’.
I will show an illustration on iris dataset using petal width and petal length.
#Create a vector for storing the sse
sse=vector('numeric')
for(i in 2:15){
#k-means function in R has a feature withinss which stores sse for each cluster group
sse[i-1]=sum(kmeans(iris[,3:4],centers=i)$withinss)
}
#Converting the sse to a data frame and storing corresponding value of k
sse=as.data.frame(sse)
sse$k=seq.int(2,15)
#Making the plot. This plot is also known as screeplot
plot(sse$k,sse$sse,type="b")
80
60
sse$sse

40
20

2 4 6 8 10 12 14

sse$k

You might also like