0% found this document useful (0 votes)

7 views

Kmeans Algorithm

Kmeans algorithm in regression analysis and data modelling very helpful algorithm for computer science students

Uploaded by

oumiarashid6812

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Kmeans Algorithm

Kmeans algorithm in regression analysis and data modelling very helpful algorithm for computer science students

Uploaded by

oumiarashid6812

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Kmeans Assumptions

Arif

2023-10-10

K-Means Clustering: overview

K-Means Clustering is a well known technique based on unsupervised learning. As the name mentions, it
forms ‘K’ clusters over the data using mean of the data. Unsupervised algorithms are a class of algorithms
one should tread on carefully.

Algorithmic Steps
Assumptions
K-Means clustering method considers two assumptions regarding the clusters –
first that the clusters are spherical and second that the clusters are of similar size. Spherical assumption
helps in separating the clusters when the algorithm works on the data and forms clusters. If this assumption
is violated, the clusters formed may not be what one expects.
On the other hand, assumption over the size of clusters helps in deciding the boundaries of the cluster. This
assumption helps in calculating the number of data points each cluster should have.

K Means Starter
To understand how K-Means works, we start with an example where all our assumptions hold. R includes a
dataset about waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in
Yellowstone National Park known as ‘faithful’. The dataset consists of 272 observations of 2 features.
#Viewing the Faithful dataset
plot(faithful)

1
Figure 1: Kmeans

2
90
80
waiting

70
60
50

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

eruptions
Looking at the dataset, we can notice two clusters. I will now use the kmeans() function in R to form clusters.
Let’s see how K-Means clustering works on the data
#Specify 2 centers
k_clust_start=kmeans(faithful, centers=2)
#Plot the data using clusters
plot(faithful, col=k_clust_start$cluster,pch=2)
90
80
waiting

70
60
50

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

eruptions
Being a small dataset, clusters are formed almost instantaneously but how do we see the clusters, their centers
or sizes? The k_clust_start variable I used contains information on both centers and the size of clusters.
Let’s check them out

3
#Use the centers to find the cluster centers
k_clust_start$centers

## eruptions waiting
## 1 4.29793 80.28488
## 2 2.09433 54.75000
#Use the size to find the cluster sizes
k_clust_start$size

## [1] 172 100

This means the first cluster consists of 172 members and is centered at 4.29793 value of eruptions and 80.28488
value of waiting. Similarly the second cluster consists of 100 members with 2.09433 value of eruptions and
54.75 value of waiting. Now this information is golden! We know that these centers are the cluster means.
So, the eruptions typically happen for either ~2 mins or ~4.3 mins. For longer eruptions, the waiting time is
also longer.

Getting into the depths

Imagine a dataset which has clusters which one can clearly identify but k-means cannot. I’m talking about
a dataset which does not satisfy the assumptions. A common example is a dataset which represents two
concentric circles. Let’s generate it and see how it looks like
#The following code will generate different plots for you but they will be similar
library(plyr)
library(dplyr)

##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Generate random data which will be first cluster
clust1 = data_frame(x = rnorm(200), y = rnorm(200))

## Warning: `data_frame()` was deprecated in tibble 1.1.0.

## i Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#Generate the second cluster which will ‘surround’ the first cluster
clust2 =data_frame(r = rnorm(200, 15, .5), theta = runif(200, 0, 2 * pi),
x = r * cos(theta), y = r * sin(theta)) %>%
dplyr::select(x, y)
#Combine the data
dataset_cir= rbind(clust1, clust2)

4
#see the plot
plot(dataset_cir)
15
10
5
y

0
−5
−15

−15 −10 −5 0 5 10 15

x
Simple, isn’t it? There are two clusters – one in the middle and the other circling the first. However, this
violates the assumption that the clusters are spherical. The inner data is spherical while the outer circle is
not. Even though the clustering will not be good, let’s see how does k-means perform on this data
#Fit the k-means model
k_clust_spher1=kmeans(dataset_cir, centers=2)
#Plot the data and clusters
plot(dataset_cir, col=k_clust_spher1$cluster,pch=2)
15
10
5
y

0
−5
−15

−15 −10 −5 0 5 10 15

5
How do we solve this problem? There are clearly 2 clusters but k-means is not working well. A simple way in
this case is to transform our data into polar format. Let’s convert it and plot it.
#Using a function for transformation
cart2pol=function(x,y){
#This is r
newx=sqrt(xˆ2 + yˆ2)
#This is theta
newy=atan(y/x)
x_y=cbind(newx,newy)
return(x_y)
}
dataset_cir2=cart2pol(dataset_cir$x,dataset_cir$y)
plot(dataset_cir2)
0.5 1.0 1.5
newy

−0.5
−1.5

0 5 10 15

newx
Now we run the k-means model on this data
k_clust_spher2=kmeans(dataset_cir2, centers=2)
#Plot the data and clusters
plot(dataset_cir2, col=k_clust_spher2$cluster,pch=2)

6
0.5 1.0 1.5
newy

−0.5
−1.5

0 5 10 15

newx
This time k-means algorithm works well and correctly transform the data. We can also view the clusters on
the original data to double-check this.
plot(dataset_cir, col=k_clust_spher2$cluster,pch=2)
15
10
5
y

0
−5
−15

−15 −10 −5 0 5 10 15

x
By transforming our data into polar coordinates and fitting k-means model on the transformed data, we fulfil
the spherical data assumption and data is accurately clustered. Now let’s look at a data where the clusters
are not of similar sizes. Similar size does not mean that the clusters have to be exactly equal. It simply
means that no cluster should have ‘too few’ members.
#Make the first cluster with 1000 random values
clust1 = data_frame(x = rnorm(1000), y = rnorm(1000))

7
#Keep 10 values together to make the second cluster
clust2=data_frame(x=c(5,5.1,5.2,5.3,5.4,5,5.1,5.2,5.3,5.4),y=c(5,5,5,5,5,4.9,4.9,4.9,4.9,4.9))
#Combine the data
dataset_uneven=rbind(clust1,clust2)
plot(dataset_uneven)
4
2
y

0
−2

−2 0 2 4

x
Here again, we have two clear clusters but they do not satisfy similar size requirement of k-means algorithm.
k_clust_spher3=kmeans(dataset_uneven, centers=2)
plot(dataset_uneven, col=k_clust_spher3$cluster,pch=2)
4
2
y

0
−2

−2 0 2 4

x
Why did this happen? K-means tries to minimize inter cluster and intracluster distance and create ‘tight’

8
clusters. In this process, it assigns some data points in the first cluster to the second cluster incorrectly. This
makes the clustering inaccurate.

How to decide the value of K in data

The datasets I worked on in this article are all simple and it is easily to identify clusters by plotting them.
However, complicated datasets do not have this luxury. The Elbow method is popular for finding a suitable
value of ‘K’ for k-means clustering. This method uses SSE within groups for different values of k and plots
them. Using this plot, we can choose the ‘k’ which shows an abrupt change in SSE, creating an ‘elbow effect’.
I will show an illustration on iris dataset using petal width and petal length.
#Create a vector for storing the sse
sse=vector('numeric')
for(i in 2:15){
#k-means function in R has a feature withinss which stores sse for each cluster group
sse[i-1]=sum(kmeans(iris[,3:4],centers=i)$withinss)
}
#Converting the sse to a data frame and storing corresponding value of k
sse=as.data.frame(sse)
sse$k=seq.int(2,15)
#Making the plot. This plot is also known as screeplot
plot(sse$k,sse$sse,type="b")
80
60
sse$sse

40
20

2 4 6 8 10 12 14

sse$k

Painless Pre-Algebra
From Everand
Painless Pre-Algebra
Barron's Educational Series
3/5 (2)
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
Lab Report6 - B21CI014
No ratings yet
Lab Report6 - B21CI014
8 pages
K.means Clustering
No ratings yet
K.means Clustering
8 pages
FullMarks - Clustering StudentSolution 2
No ratings yet
FullMarks - Clustering StudentSolution 2
13 pages
Elbow Method for Optimal Cluster Number in K-Means
No ratings yet
Elbow Method for Optimal Cluster Number in K-Means
8 pages
21BEC505 Exp2
No ratings yet
21BEC505 Exp2
7 pages
INSY446 - 10 - Clustering Part 2
No ratings yet
INSY446 - 10 - Clustering Part 2
32 pages
Assignment 4 A
No ratings yet
Assignment 4 A
15 pages
K-Means Clustering Algorithm
No ratings yet
K-Means Clustering Algorithm
17 pages
Materi Praktikum
No ratings yet
Materi Praktikum
7 pages
K-means
No ratings yet
K-means
26 pages
A Paper With 12pt Global Font Size
No ratings yet
A Paper With 12pt Global Font Size
13 pages
Exercise 02 RadonovIvan 5967988
No ratings yet
Exercise 02 RadonovIvan 5967988
1 page
Machine Learning Bloque 4
No ratings yet
Machine Learning Bloque 4
12 pages
Experiment No 07: Mihir Patel Teit 2
No ratings yet
Experiment No 07: Mihir Patel Teit 2
5 pages
K-Means Clustering Algorithm - Javatpoint
No ratings yet
K-Means Clustering Algorithm - Javatpoint
21 pages
K-Means in Python - Solution
No ratings yet
K-Means in Python - Solution
6 pages
Lecture+Notes+ +clustering
No ratings yet
Lecture+Notes+ +clustering
13 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
ML0101EN Clus K Means Customer Seg Py v1
100% (1)
ML0101EN Clus K Means Customer Seg Py v1
8 pages
K-Means Clustering
No ratings yet
K-Means Clustering
8 pages
Shahapure 2020
No ratings yet
Shahapure 2020
2 pages
K Means Clustering
No ratings yet
K Means Clustering
5 pages
Assignment 5
No ratings yet
Assignment 5
3 pages
4 Clustering With K-Means - Kaggle
No ratings yet
4 Clustering With K-Means - Kaggle
9 pages
JAVIER KMeans Clustering Jupyter Notebook
No ratings yet
JAVIER KMeans Clustering Jupyter Notebook
7 pages
Unit 4 Aam
No ratings yet
Unit 4 Aam
26 pages
Clustering
No ratings yet
Clustering
24 pages
DSCI 100 Clustering Concept Cheat Sheet
No ratings yet
DSCI 100 Clustering Concept Cheat Sheet
4 pages
K-Means Clustering Clustering Algorithms Implementation and Comparison
No ratings yet
K-Means Clustering Clustering Algorithms Implementation and Comparison
4 pages
Clustering Large Data Sets With Mixed Numeric and Categorical Values
No ratings yet
Clustering Large Data Sets With Mixed Numeric and Categorical Values
14 pages
K Means Clustering in R Example - Learn by Marketing
No ratings yet
K Means Clustering in R Example - Learn by Marketing
3 pages
Ex No: Date: K-Means Clustering Using Python: Scatter
No ratings yet
Ex No: Date: K-Means Clustering Using Python: Scatter
10 pages
Week 10 Abhishek Srivastava VFinal
No ratings yet
Week 10 Abhishek Srivastava VFinal
14 pages
K Means
100% (2)
K Means
329 pages
Clustering: ISOM3360 Data Mining For Business Analytics
No ratings yet
Clustering: ISOM3360 Data Mining For Business Analytics
28 pages
Unit-4th Question-Bank Solution.docx
No ratings yet
Unit-4th Question-Bank Solution.docx
52 pages
Chapter7_Clustering_Exercises_v2_20230112
No ratings yet
Chapter7_Clustering_Exercises_v2_20230112
49 pages
13: Clustering: Unsupervised Learning - Introduction
No ratings yet
13: Clustering: Unsupervised Learning - Introduction
4 pages
CSC 240 HW 5
No ratings yet
CSC 240 HW 5
11 pages
Module 4-1
No ratings yet
Module 4-1
153 pages
UAS Data Mining_Natasya Syarafi_22081016.ipynb - Colab
No ratings yet
UAS Data Mining_Natasya Syarafi_22081016.ipynb - Colab
3 pages
ADB Ch07 - Data Mining Clustering K-Means
No ratings yet
ADB Ch07 - Data Mining Clustering K-Means
27 pages
WINSEM2021-22_ECE6093_ETH_VL2021220505450_Reference_Material_I_23-03-2022_slides_kmeans_(1) (1)
No ratings yet
WINSEM2021-22_ECE6093_ETH_VL2021220505450_Reference_Material_I_23-03-2022_slides_kmeans_(1) (1)
28 pages
Clustering
No ratings yet
Clustering
17 pages
K - Means Clustering
No ratings yet
K - Means Clustering
3 pages
algo
No ratings yet
algo
59 pages
Subject: ML Name: Priyanshu Gandhi Date: 10/4/21 Expt. No.: 9 Roll No.: C008 Title: Clustering Implementation in Python
No ratings yet
Subject: ML Name: Priyanshu Gandhi Date: 10/4/21 Expt. No.: 9 Roll No.: C008 Title: Clustering Implementation in Python
7 pages
K_means.ipynb_-_Colab
No ratings yet
K_means.ipynb_-_Colab
10 pages
Clustering_notes
No ratings yet
Clustering_notes
29 pages
Unsupervisd Learning Algorithm
No ratings yet
Unsupervisd Learning Algorithm
6 pages
R Code For Discriminant and Cluster Analysis
No ratings yet
R Code For Discriminant and Cluster Analysis
23 pages
Unsuper
No ratings yet
Unsuper
15 pages
Data Mining
No ratings yet
Data Mining
18 pages
K-Means Clustering Clearly Explained
No ratings yet
K-Means Clustering Clearly Explained
12 pages
A Basic Approach To K-Means Clustering Applied To Stock Data
No ratings yet
A Basic Approach To K-Means Clustering Applied To Stock Data
4 pages
Solution 1
No ratings yet
Solution 1
6 pages
Introduction To The City Clustering Algorithm: Steffen Kriewald December 19, 2019
No ratings yet
Introduction To The City Clustering Algorithm: Steffen Kriewald December 19, 2019
8 pages
Matrices with MATLAB (Taken from "MATLAB for Beginners: A Gentle Approach")
From Everand
Matrices with MATLAB (Taken from "MATLAB for Beginners: A Gentle Approach")
Peter Kattan
3/5 (4)
Binomial Distribution Powerpoint 1
100% (2)
Binomial Distribution Powerpoint 1
17 pages
0perator Methods L#9.2
No ratings yet
0perator Methods L#9.2
3 pages
Slope Deflection Method
No ratings yet
Slope Deflection Method
8 pages
s44196-024-00589-3
No ratings yet
s44196-024-00589-3
20 pages
Semantic Compositionality Through Recursive Matrix-Vector Spaces
No ratings yet
Semantic Compositionality Through Recursive Matrix-Vector Spaces
11 pages
csc330h1-4w25
No ratings yet
csc330h1-4w25
2 pages
QA Basic Calculus Quarter 3 Week 5 Final
No ratings yet
QA Basic Calculus Quarter 3 Week 5 Final
16 pages
Ford Classification Challenge Results - 2008
No ratings yet
Ford Classification Challenge Results - 2008
16 pages
Examples of Algorithms and Flow Charts C
No ratings yet
Examples of Algorithms and Flow Charts C
13 pages
AIML Lab Programs
No ratings yet
AIML Lab Programs
13 pages
Lab - 8 - 21130568 - NguyenNhuToan - Ipynb - Colab
No ratings yet
Lab - 8 - 21130568 - NguyenNhuToan - Ipynb - Colab
4 pages
Lec 01
No ratings yet
Lec 01
60 pages
Outline Een 407
No ratings yet
Outline Een 407
5 pages
Fuzzy DBScan
No ratings yet
Fuzzy DBScan
12 pages
Mechine Learning
No ratings yet
Mechine Learning
106 pages
contrrol lab 7 control by tanveer by me
No ratings yet
contrrol lab 7 control by tanveer by me
5 pages
Year - M.B.a. (CBCS Pattern) Semester - III Subject - PCB3C01 - Applied Operations Research
No ratings yet
Year - M.B.a. (CBCS Pattern) Semester - III Subject - PCB3C01 - Applied Operations Research
4 pages
Principal Component Analysis For Noise Reduction and Fraudulent Activity Detection in Scientific Data
No ratings yet
Principal Component Analysis For Noise Reduction and Fraudulent Activity Detection in Scientific Data
10 pages
PowerBi VideoNotes PDF
No ratings yet
PowerBi VideoNotes PDF
9 pages
Practical Introduction To FEA and CFD
No ratings yet
Practical Introduction To FEA and CFD
4 pages
A Machine Learning Approach To Predict Price of Airlines Tickets
No ratings yet
A Machine Learning Approach To Predict Price of Airlines Tickets
7 pages
Toy Models of Superposition
No ratings yet
Toy Models of Superposition
62 pages
Duis 2
No ratings yet
Duis 2
8 pages
Analytic Hierarchy Process: - Multiple-Criteria Decision-Making
100% (1)
Analytic Hierarchy Process: - Multiple-Criteria Decision-Making
16 pages
Dijkstra Algorithm
No ratings yet
Dijkstra Algorithm
23 pages
Exercise 6
No ratings yet
Exercise 6
2 pages
EDA Unit-3
No ratings yet
EDA Unit-3
31 pages
7.assignment Problem
No ratings yet
7.assignment Problem
32 pages
Z-Transforms, Their Inverses Transfer or System Functions
No ratings yet
Z-Transforms, Their Inverses Transfer or System Functions
15 pages
Heating and Coling Degree Days Report .
No ratings yet
Heating and Coling Degree Days Report .
9 pages