0% found this document useful (0 votes)

53 views14 pages

Working of K Means Algorithm - YashBhure

Uploaded by

Yash Bhure

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views14 pages

Working of K Means Algorithm - YashBhure

Uploaded by

Yash Bhure

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

WORKING OF K MEANS

ALGORITHM
AN UNSUPERVISED LEARNING ALGORITHM

BY- YASH BHURE

11 JULY 2023
In the field of data science, k-means clustering is a widely used
unsupervised learning algorithm that aims to discover underlying
patterns and group similar data points together. It is particularly useful
for exploratory data analysis and finding meaningful insights from
unlabelled data. In this article, we will reach into the working of the k-
means algorithm, step-by-step, and explore its applications in various
industries.ext

Research Sources:

The K-Means Algorithm Evolution: LINK

Machine Learning For Absolute Beginners: A Plain English Introduction : LINK

Towards Data Science - "A Comprehensive Guide to K-Means

Clustering Algorithm": LINK

To understand the working concept of the K-Means Algorithm it's

essential to get familiar with clustering, so in this Article, we will
look at the following topics to understand the working of K-Means
clustering,

The concepts of clustering

Understanding k-means Clustering
Setting the value of 'k'
Applications of k-means Clustering
CLUSTERING
One helpful approach to analyz information is to identify clusters of data
that share similar attributes. For example, your company may wish to
examine a segment of customers that purchase at the same time of the
year and recognize what factors influence their purchasing behavior.

By understanding a particular cluster of customers, you can form

decisions about which products to recommend to customer groups
through promotions and personalized offers. Outside of market
research, clustering can be applied to various other scenarios, including
pattern recognition, fraud detection, and image processing.

Clustering analysis falls under the banner of both supervised

learning and unsupervised learning. As a supervised learning
technique, clustering is used to classify new data points into existing
clusters through k-nearest neighbors (k-NN) and as an unsupervised
learning technique, clustering is applied to identify discrete groups of
data points through k-means clustering. Although there are other
forms of clustering techniques, these two algorithms are generally the
most popular in both machine learning and data mining.
Understanding k-means
Clustering
As a popular unsupervised learning algorithm, k-means clustering
attempts to divide data into k discrete groups and is effective at
uncovering basic data patterns. Here's how the algorithm works:

Step 1: Initialization
The k-means clustering algorithm works by first splitting data into k
number of clusters with k representing the number of clusters you
wish to create. If you choose to split your dataset into three clusters
then k, for example, is set to 3.In Figure 2, we can see that the original
(unclustered) data has been transformed into three clusters (k is 3). If
we were to set k to 4, an additional cluster would be derived from the
dataset to produce four clusters.
How does k-means clustering separate the data points?

Step 2: Assigning Data Points to Clusters

The first step is to examine the unclustered data on the scatterplot and
manually select a centroid for each k cluster. That centroid then forms
the epicenter of an individual cluster. Centroids can be chosen at
random, which means you can nominate any data point on the
scatterplot to act as a centroid. However, you can save time by
choosing centroids dispersed across the scatterplot and not directly
adjacent to each other. In other words, start by guessing where you
think the centroids for each cluster might be located. The remaining
data points on the scatterplot are then assigned to the closest centroid
by measuring the Euclidean distance.

Each data point can be assigned to only one cluster and each cluster is
discrete. This means that there is no overlap between clusters and no
case of nesting a cluster inside another cluster. Also, all data points,
including anomalies, are assigned to a centroid irrespective of how they
impact the final shape of the cluster. However, due to the statistical
force that pulls all nearby data points to a central point, your clusters
will generally form an elliptical or spherical shape. After all data points
have been allocated to a centroid, the next step is to aggregate the
mean value of all data points for each cluster, which can be found by
calculating the average x and y values of all data points in that cluster
Step 3: Updating Cluster Centroids
Next, take the mean value of the data points in each cluster and plug in
those x and y values to update your centroid coordinates. This will most
likely result in a change to your centroids’ location. Your total number of
clusters, however, will remain the same. You are not creating new
clusters, but rather updating their position on the scatterplot. Like
musical chairs, the remaining data points will then rush to the closest
centroid to form k number of clusters.

Step 4: Iteration and Convergence

Now the data points will then rush to the closest centroid to form k
number of clusters. Should any data point on the scatterplot switch
clusters with the changing of centroids, the previous step is repeated.
This means, again, calculating the average mean value of the cluster
and updating the x and y values of each centroid to reflect the average
coordinates of the data points in that cluster. Once you reach a stage
where the data points no longer switch clusters after an update in
centroid coordinates, the algorithm is complete, and you have your final
set of clusters. The following diagrams break down the full algorithmic
process.
Setting the value of 'k'
In setting k, it is important to strike the right number of clusters. In
general, as k increases, clusters become smaller and variance falls.
However, the downside is that neighboring clusters become less
distinct from one another as k increases. If you set k to the same
number of data points in your dataset, each data point automatically
converts into a standalone cluster. Conversely, if you set k to 1, then all
data points will be deemed as homogenous and produce only one
cluster. Needless to say, setting k to either extreme will not provide any
worthy insight to analyze.
In order to optimize k, you may wish to turn to a scree plot for
guidance. A scree plot charts the degree of scattering (variance) inside
a cluster as the total number of clusters increases. Scree plots are
famous for their iconic “elbow,” which reflects several pronounced kinks
in the plot’s curve. A scree plot compares the Sum of Squared Error
(SSE) for each variation of total clusters. SSE is measured as the sum
of the squared distance between the centroid and the other neighbors
inside the cluster. In a nutshell, SSE drops as more clusters are
formed. This then raises the question of what the optimal number of
clusters is. In general, you should opt for a cluster solution where SSE
subsides dramatically to the left on the scree plot, but before it reaches
a point of negligible change with cluster variations to its right. For
instance, in Figure 10, there is little impact on SSE for six or more
clusters. This would result in clusters that would be small and
difficult to distinguish.
Applications of k-Means Clustering

1. Customer Segmentation:
K-means clustering is commonly used in customer
segmentation, where customers are grouped based on their
behavior, preferences, or demographics.
By clustering customers, businesses can tailor marketing
strategies, personalize recommendations, and improve
customer satisfaction.
2. Image Compression:
K-means clustering can be employed for image compression by
reducing the number of colors in an image.
Each pixel in the image is treated as a data point, and k-means
clustering is applied to cluster similar colors together.
The cluster centroids represent the reduced color palette,
resulting in a compressed image with minimal loss of visual
quality.
3. Anomaly Detection:
K-means clustering can be used to detect anomalies or outliers
in datasets.
Anomalies are data points that do not belong to any cluster or
are significantly different from the other data points within a
cluster.
By examining the data points that are farthest from their cluster
centroids, potential anomalies can be identified.
Imagine a retail company that wants to gain insights into its customer
base and target them more effectively. The company has a large
customer database with information such as purchase history,
demographic data, browsing behavior, and more. By applying k-means
clustering to this dataset, the company can discover distinct customer
segments and tailor their marketing strategies accordingly.

1. Data Preprocessing: Before applying k-means clustering, the

company needs to preprocess the data. This involves cleaning the
data, handling missing values, and scaling the features if
necessary. It is crucial to select relevant features for clustering,
such as customer age, purchase frequency, total spending, and any
other relevant variables.
2. Choosing the Number of Clusters (k): The company needs to
determine the appropriate number of clusters for customer
segmentation. This can be done using techniques like the elbow
method or silhouette analysis. These methods help identify the
optimal value of k by evaluating the within-cluster sum of squares
(WCSS) or the average silhouette coefficient for different values of
k.
3. Applying K-means Clustering: Once the number of clusters is
determined, the company can apply the K-means algorithm to the
preprocessed data. The algorithm will assign each customer to one
of the k clusters based on their similarity in terms of the selected
features.

In summary, k-means clustering allows the retail company to gain

valuable insights into its customer base, identify distinct segments, and
tailor marketing strategies to enhance customer satisfaction. By
understanding customer preferences and behaviors, the company can
optimize its offerings, improve customer engagement, and ultimately
drive business growth.
Assessment Questions:

What is the main goal of the k-means clustering algorithm?

a) Maximizing the inter-cluster distance
b) Minimizing the intra-cluster distance
c) Maximizing the intra-cluster distance
d) Minimizing the inter-cluster distance

How are cluster centroids updated in the k-means algorithm?

a) By calculating the median value of data points in each cluster
b) By calculating the mean value of data points in each cluster
c) By selecting the farthest data point from each cluster centroid
d) By selecting the closest data point to each cluster centroid

What is the role of Euclidean distance in the k-means algorithm?

a) It measures the dissimilarity between clusters.
b) It assigns data points to clusters based on similarity.
c) It updates the number of clusters in the algorithm.
d) It calculates the variance within each cluster.
Conclusion:
K-means clustering is a powerful algorithm for grouping similar data
points together and uncovering underlying patterns. By understanding
its working and the steps involved, you now have a solid foundation to
apply k-means clustering in your data analysis tasks. Remember that k-
means clustering requires an appropriate choice of k, and the
algorithm's performance can be affected by outliers and the initialization
of cluster centroids. With its wide range of applications, k-means
clustering remains a valuable tool in the data scientist's toolbox.

Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
Unit 4 Aam
No ratings yet
Unit 4 Aam
26 pages
Algo
No ratings yet
Algo
59 pages
K Means Clustering Algorithm
No ratings yet
K Means Clustering Algorithm
12 pages
KMeans Clustering
No ratings yet
KMeans Clustering
16 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
Kmean
No ratings yet
Kmean
24 pages
K-Mean Clustering
No ratings yet
K-Mean Clustering
8 pages
AI Week 11
No ratings yet
AI Week 11
21 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Presentation 1
No ratings yet
Presentation 1
47 pages
Clustering
No ratings yet
Clustering
10 pages
Machine Learning Chapter 3
No ratings yet
Machine Learning Chapter 3
12 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
K-Means Clustering Algorithm - Javatpoint
No ratings yet
K-Means Clustering Algorithm - Javatpoint
21 pages
Clustering
No ratings yet
Clustering
17 pages
Clustering
No ratings yet
Clustering
125 pages
Unit 4
No ratings yet
Unit 4
125 pages
K Means Algorithm
No ratings yet
K Means Algorithm
4 pages
K Clustering
No ratings yet
K Clustering
28 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
K Mean
No ratings yet
K Mean
7 pages
DWDM Unit5
No ratings yet
DWDM Unit5
14 pages
Unit 4
No ratings yet
Unit 4
22 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
23 pages
Clustering Notes
No ratings yet
Clustering Notes
29 pages
Machine Learning Unit 4
No ratings yet
Machine Learning Unit 4
22 pages
K Means Clustering
No ratings yet
K Means Clustering
22 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
27 pages
Clustering Explanation
No ratings yet
Clustering Explanation
8 pages
DWDM Unit V Note
No ratings yet
DWDM Unit V Note
19 pages
ML - Unit - 2
No ratings yet
ML - Unit - 2
13 pages
DM Unit Iv
No ratings yet
DM Unit Iv
45 pages
K Mean Clustering1
No ratings yet
K Mean Clustering1
23 pages
ML Unit-2
No ratings yet
ML Unit-2
31 pages
K Means Clustering
No ratings yet
K Means Clustering
11 pages
AI Chapter 3 Part 5
No ratings yet
AI Chapter 3 Part 5
30 pages
Wa0033.
No ratings yet
Wa0033.
38 pages
K Mean Clustering
No ratings yet
K Mean Clustering
59 pages
Clustering
No ratings yet
Clustering
84 pages
K-Means Algo
No ratings yet
K-Means Algo
4 pages
Clustering FinancialData
No ratings yet
Clustering FinancialData
38 pages
Unit 4
No ratings yet
Unit 4
74 pages
KMeans Clustering Report
No ratings yet
KMeans Clustering Report
2 pages
UNIT 4 K-Means Clustring
No ratings yet
UNIT 4 K-Means Clustring
13 pages
Clustering
No ratings yet
Clustering
24 pages
Unsupervised Machine Learning
No ratings yet
Unsupervised Machine Learning
10 pages
K Mean
No ratings yet
K Mean
12 pages
Introduction To Data Science: Clustering
No ratings yet
Introduction To Data Science: Clustering
45 pages
K-Means Clustering
No ratings yet
K-Means Clustering
5 pages
Clustering
No ratings yet
Clustering
18 pages
Lecture 18 K Means Clustering
No ratings yet
Lecture 18 K Means Clustering
77 pages
Machine Learning Notes Anna University
100% (1)
Machine Learning Notes Anna University
14 pages
Mod4 - Unsupervised Learning
No ratings yet
Mod4 - Unsupervised Learning
9 pages
Unit 4
No ratings yet
Unit 4
29 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
Solution Methodology
No ratings yet
Solution Methodology
3 pages
Big Book of Data Warehousing and Bi v7 113023 Final
No ratings yet
Big Book of Data Warehousing and Bi v7 113023 Final
88 pages
21CS53 DBMS Iat3 QB
No ratings yet
21CS53 DBMS Iat3 QB
2 pages
DBMS Question and Answer
100% (1)
DBMS Question and Answer
42 pages
The Web of Data - Aidan Hogan
100% (1)
The Web of Data - Aidan Hogan
696 pages
Advanced Goldengate Configuration ConflictDetect Resolution PDF
No ratings yet
Advanced Goldengate Configuration ConflictDetect Resolution PDF
28 pages
The Goals of Penetration Testing
No ratings yet
The Goals of Penetration Testing
7 pages
EPLC Disposition Plan Template
No ratings yet
EPLC Disposition Plan Template
10 pages
IRS Theory & Lab Syllabus
100% (1)
IRS Theory & Lab Syllabus
3 pages
ION Overview - LDP 2026
No ratings yet
ION Overview - LDP 2026
9 pages
Assignment 4
No ratings yet
Assignment 4
5 pages
MLA Style Guide, 8th Edition - Newspaper Article
No ratings yet
MLA Style Guide, 8th Edition - Newspaper Article
3 pages
Top Five Ways To Find A SAP Table and Field Within A Transaction
No ratings yet
Top Five Ways To Find A SAP Table and Field Within A Transaction
14 pages
Oracle Syllabus
No ratings yet
Oracle Syllabus
15 pages
Department of Electronics and Communication Engineering: Kuppam Engineering College, Kuppam-517425
No ratings yet
Department of Electronics and Communication Engineering: Kuppam Engineering College, Kuppam-517425
3 pages
CHAPTER 27 Computer Based Data Processing
100% (1)
CHAPTER 27 Computer Based Data Processing
15 pages
Ui Ux Design Course in Hyderabad
No ratings yet
Ui Ux Design Course in Hyderabad
13 pages
2nd NGS Workshop - Mar2025 - Flyer
No ratings yet
2nd NGS Workshop - Mar2025 - Flyer
1 page
Chapter-3: Transaction Management and Concurrency Control
No ratings yet
Chapter-3: Transaction Management and Concurrency Control
39 pages
Bsit Information System (Topic Proposal)
No ratings yet
Bsit Information System (Topic Proposal)
6 pages
Railway Reservation System Report With Diagram
No ratings yet
Railway Reservation System Report With Diagram
5 pages
Data Analytics
No ratings yet
Data Analytics
16 pages
Snowflake Certification Syllabus
No ratings yet
Snowflake Certification Syllabus
4 pages
Pertemuan 4. Big Data Adoption & Planning Considerations
No ratings yet
Pertemuan 4. Big Data Adoption & Planning Considerations
24 pages
Flash Fundamentals Notes
No ratings yet
Flash Fundamentals Notes
11 pages
Looking For Real Exam Questions For IT Certification Exams!
No ratings yet
Looking For Real Exam Questions For IT Certification Exams!
13 pages
Open Table Format - Delta Lake
No ratings yet
Open Table Format - Delta Lake
10 pages
Pyramid Diagram of Organizational Levels and Information Requirements
No ratings yet
Pyramid Diagram of Organizational Levels and Information Requirements
5 pages
B.E. 7 TH Semester
No ratings yet
B.E. 7 TH Semester
22 pages
Blockchain Intergration With Cloud and Cases of Usage
No ratings yet
Blockchain Intergration With Cloud and Cases of Usage
11 pages

Working of K Means Algorithm - YashBhure

Uploaded by

Working of K Means Algorithm - YashBhure

Uploaded by

WORKING OF K MEANS

BY- YASH BHURE

The K-Means Algorithm Evolution: LINK

Machine Learning For Absolute Beginners: A Plain English Introduction : LINK

Towards Data Science - "A Comprehensive Guide to K-Means

To understand the working concept of the K-Means Algorithm it's

The concepts of clustering

By understanding a particular cluster of customers, you can form

Clustering analysis falls under the banner of both supervised

Step 2: Assigning Data Points to Clusters

Step 4: Iteration and Convergence

1. Data Preprocessing: Before applying k-means clustering, the

In summary, k-means clustering allows the retail company to gain

What is the main goal of the k-means clustering algorithm?

How are cluster centroids updated in the k-means algorithm?

What is the role of Euclidean distance in the k-means algorithm?

You might also like