0% found this document useful (0 votes)
41 views

Working of K Means Algorithm - YashBhure

Uploaded by

Yash Bhure
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Working of K Means Algorithm - YashBhure

Uploaded by

Yash Bhure
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

WORKING OF K MEANS

ALGORITHM
AN UNSUPERVISED LEARNING ALGORITHM

BY- YASH BHURE

11 JULY 2023
In the field of data science, k-means clustering is a widely used
unsupervised learning algorithm that aims to discover underlying
patterns and group similar data points together. It is particularly useful
for exploratory data analysis and finding meaningful insights from
unlabelled data. In this article, we will reach into the working of the k-
means algorithm, step-by-step, and explore its applications in various
industries.ext

Research Sources:

The K-Means Algorithm Evolution: LINK

Machine Learning For Absolute Beginners: A Plain English Introduction : LINK

Towards Data Science - "A Comprehensive Guide to K-Means


Clustering Algorithm": LINK

To understand the working concept of the K-Means Algorithm it's


essential to get familiar with clustering, so in this Article, we will
look at the following topics to understand the working of K-Means
clustering,

The concepts of clustering


Understanding k-means Clustering
Setting the value of 'k'
Applications of k-means Clustering
CLUSTERING
One helpful approach to analyz information is to identify clusters of data
that share similar attributes. For example, your company may wish to
examine a segment of customers that purchase at the same time of the
year and recognize what factors influence their purchasing behavior.

By understanding a particular cluster of customers, you can form


decisions about which products to recommend to customer groups
through promotions and personalized offers. Outside of market
research, clustering can be applied to various other scenarios, including
pattern recognition, fraud detection, and image processing.

Clustering analysis falls under the banner of both supervised


learning and unsupervised learning. As a supervised learning
technique, clustering is used to classify new data points into existing
clusters through k-nearest neighbors (k-NN) and as an unsupervised
learning technique, clustering is applied to identify discrete groups of
data points through k-means clustering. Although there are other
forms of clustering techniques, these two algorithms are generally the
most popular in both machine learning and data mining.
Understanding k-means
Clustering
As a popular unsupervised learning algorithm, k-means clustering
attempts to divide data into k discrete groups and is effective at
uncovering basic data patterns. Here's how the algorithm works:

Step 1: Initialization
The k-means clustering algorithm works by first splitting data into k
number of clusters with k representing the number of clusters you
wish to create. If you choose to split your dataset into three clusters
then k, for example, is set to 3.In Figure 2, we can see that the original
(unclustered) data has been transformed into three clusters (k is 3). If
we were to set k to 4, an additional cluster would be derived from the
dataset to produce four clusters.
How does k-means clustering separate the data points?

Step 2: Assigning Data Points to Clusters


The first step is to examine the unclustered data on the scatterplot and
manually select a centroid for each k cluster. That centroid then forms
the epicenter of an individual cluster. Centroids can be chosen at
random, which means you can nominate any data point on the
scatterplot to act as a centroid. However, you can save time by
choosing centroids dispersed across the scatterplot and not directly
adjacent to each other. In other words, start by guessing where you
think the centroids for each cluster might be located. The remaining
data points on the scatterplot are then assigned to the closest centroid
by measuring the Euclidean distance.

Each data point can be assigned to only one cluster and each cluster is
discrete. This means that there is no overlap between clusters and no
case of nesting a cluster inside another cluster. Also, all data points,
including anomalies, are assigned to a centroid irrespective of how they
impact the final shape of the cluster. However, due to the statistical
force that pulls all nearby data points to a central point, your clusters
will generally form an elliptical or spherical shape. After all data points
have been allocated to a centroid, the next step is to aggregate the
mean value of all data points for each cluster, which can be found by
calculating the average x and y values of all data points in that cluster
Step 3: Updating Cluster Centroids
Next, take the mean value of the data points in each cluster and plug in
those x and y values to update your centroid coordinates. This will most
likely result in a change to your centroids’ location. Your total number of
clusters, however, will remain the same. You are not creating new
clusters, but rather updating their position on the scatterplot. Like
musical chairs, the remaining data points will then rush to the closest
centroid to form k number of clusters.

Step 4: Iteration and Convergence


Now the data points will then rush to the closest centroid to form k
number of clusters. Should any data point on the scatterplot switch
clusters with the changing of centroids, the previous step is repeated.
This means, again, calculating the average mean value of the cluster
and updating the x and y values of each centroid to reflect the average
coordinates of the data points in that cluster. Once you reach a stage
where the data points no longer switch clusters after an update in
centroid coordinates, the algorithm is complete, and you have your final
set of clusters. The following diagrams break down the full algorithmic
process.
Setting the value of 'k'
In setting k, it is important to strike the right number of clusters. In
general, as k increases, clusters become smaller and variance falls.
However, the downside is that neighboring clusters become less
distinct from one another as k increases. If you set k to the same
number of data points in your dataset, each data point automatically
converts into a standalone cluster. Conversely, if you set k to 1, then all
data points will be deemed as homogenous and produce only one
cluster. Needless to say, setting k to either extreme will not provide any
worthy insight to analyze.
In order to optimize k, you may wish to turn to a scree plot for
guidance. A scree plot charts the degree of scattering (variance) inside
a cluster as the total number of clusters increases. Scree plots are
famous for their iconic “elbow,” which reflects several pronounced kinks
in the plot’s curve. A scree plot compares the Sum of Squared Error
(SSE) for each variation of total clusters. SSE is measured as the sum
of the squared distance between the centroid and the other neighbors
inside the cluster. In a nutshell, SSE drops as more clusters are
formed. This then raises the question of what the optimal number of
clusters is. In general, you should opt for a cluster solution where SSE
subsides dramatically to the left on the scree plot, but before it reaches
a point of negligible change with cluster variations to its right. For
instance, in Figure 10, there is little impact on SSE for six or more
clusters. This would result in clusters that would be small and
difficult to distinguish.
Applications of k-Means Clustering

1. Customer Segmentation:
K-means clustering is commonly used in customer
segmentation, where customers are grouped based on their
behavior, preferences, or demographics.
By clustering customers, businesses can tailor marketing
strategies, personalize recommendations, and improve
customer satisfaction.
2. Image Compression:
K-means clustering can be employed for image compression by
reducing the number of colors in an image.
Each pixel in the image is treated as a data point, and k-means
clustering is applied to cluster similar colors together.
The cluster centroids represent the reduced color palette,
resulting in a compressed image with minimal loss of visual
quality.
3. Anomaly Detection:
K-means clustering can be used to detect anomalies or outliers
in datasets.
Anomalies are data points that do not belong to any cluster or
are significantly different from the other data points within a
cluster.
By examining the data points that are farthest from their cluster
centroids, potential anomalies can be identified.
Imagine a retail company that wants to gain insights into its customer
base and target them more effectively. The company has a large
customer database with information such as purchase history,
demographic data, browsing behavior, and more. By applying k-means
clustering to this dataset, the company can discover distinct customer
segments and tailor their marketing strategies accordingly.

1. Data Preprocessing: Before applying k-means clustering, the


company needs to preprocess the data. This involves cleaning the
data, handling missing values, and scaling the features if
necessary. It is crucial to select relevant features for clustering,
such as customer age, purchase frequency, total spending, and any
other relevant variables.
2. Choosing the Number of Clusters (k): The company needs to
determine the appropriate number of clusters for customer
segmentation. This can be done using techniques like the elbow
method or silhouette analysis. These methods help identify the
optimal value of k by evaluating the within-cluster sum of squares
(WCSS) or the average silhouette coefficient for different values of
k.
3. Applying K-means Clustering: Once the number of clusters is
determined, the company can apply the K-means algorithm to the
preprocessed data. The algorithm will assign each customer to one
of the k clusters based on their similarity in terms of the selected
features.

In summary, k-means clustering allows the retail company to gain


valuable insights into its customer base, identify distinct segments, and
tailor marketing strategies to enhance customer satisfaction. By
understanding customer preferences and behaviors, the company can
optimize its offerings, improve customer engagement, and ultimately
drive business growth.
Assessment Questions:

What is the main goal of the k-means clustering algorithm?


a) Maximizing the inter-cluster distance
b) Minimizing the intra-cluster distance
c) Maximizing the intra-cluster distance
d) Minimizing the inter-cluster distance

How are cluster centroids updated in the k-means algorithm?


a) By calculating the median value of data points in each cluster
b) By calculating the mean value of data points in each cluster
c) By selecting the farthest data point from each cluster centroid
d) By selecting the closest data point to each cluster centroid

What is the role of Euclidean distance in the k-means algorithm?


a) It measures the dissimilarity between clusters.
b) It assigns data points to clusters based on similarity.
c) It updates the number of clusters in the algorithm.
d) It calculates the variance within each cluster.
Conclusion:
K-means clustering is a powerful algorithm for grouping similar data
points together and uncovering underlying patterns. By understanding
its working and the steps involved, you now have a solid foundation to
apply k-means clustering in your data analysis tasks. Remember that k-
means clustering requires an appropriate choice of k, and the
algorithm's performance can be affected by outliers and the initialization
of cluster centroids. With its wide range of applications, k-means
clustering remains a valuable tool in the data scientist's toolbox.

You might also like