0% found this document useful (0 votes)
38 views16 pages

KMeans Clustering

The document discusses the K-means clustering algorithm, an unsupervised machine learning technique that groups unlabeled data points into K number of clusters. It works by first selecting K random centroids, then assigning each data point to the closest centroid and recalculating the centroids until the clusters stabilize. The algorithm is commonly used for customer segmentation, text clustering, image compression, and anomaly detection. It aims to minimize distances between points and cluster centers, though it works best for spherical clusters and cannot determine overlapping clusters.

Uploaded by

Basant Kothari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views16 pages

KMeans Clustering

The document discusses the K-means clustering algorithm, an unsupervised machine learning technique that groups unlabeled data points into K number of clusters. It works by first selecting K random centroids, then assigning each data point to the closest centroid and recalculating the centroids until the clusters stabilize. The algorithm is commonly used for customer segmentation, text clustering, image compression, and anomaly detection. It aims to minimize distances between points and cluster centers, though it works best for spherical clusters and cannot determine overlapping clusters.

Uploaded by

Basant Kothari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

K-Means Clustering Algorithm

K-Means Clustering is an unsupervised learning algorithm that is used to solve the


clustering problems in machine learning or data science. In this topic, we will learn what
is K-means clustering algorithm, how the algorithm works, along with the Python
implementation of k-means clustering.

What is K-Means Algorithm?


K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled
dataset into different clusters. Here K defines the number of pre-defined clusters that
need to be created in the process, as if K=2, there will be two clusters, and for K=3,
there will be three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in
such a way that each dataset belongs only one group that has similar properties.

It allows us to cluster the data into different groups and a convenient way to discover
the categories of groups in the unlabeled dataset on its own without the need for any
training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The


main aim of this algorithm is to minimize the sum of distances between the data point
and their corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k
should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near
to the particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.

The below diagram explains the working of the K-means Clustering Algorithm:
How does the K-Means Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined
K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

AD

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two
variables is given below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them
into different clusters. It means here we will try to group these datasets into two
different clusters.
o We need to choose some random k points or centroid to form the cluster. These
points can be either the points from the dataset or any other point. So, here we
are selecting the below two points as k points, which are not the part of our
dataset. Consider the below image:

o Now we will assign each data point of the scatter plot to its closest K-point or
centroid. We will compute it by applying some mathematics that we have studied
to calculate the distance between two points. So, we will draw a median between
both the centroids. Consider the below image:

AD

From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color
them as blue and yellow for clear visualization.
o As we need to find the closest cluster, so we will repeat the process by
choosing a new centroid. To choose the new centroids, we will compute the
center of gravity of these centroids, and will find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat
the same process of finding a median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and
two blue points are right to the line. So, these three points will be assigned to new
centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new
centroids or K-points.
o We will repeat the process by finding the center of gravity of centroids, so the
new centroids will be as shown in the below image:
o As we got the new centroids so again will draw the median line and reassign the
data points. So, the image will be:

o We can see in the above image; there are no dissimilar data points on either side
of the line, which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final
clusters will be as shown in the below image:

AD
Why K-Means?
K-means as a clustering algorithm is deployed to discover groups that haven’t been
explicitly labeled within the data. It’s being actively used today in a wide variety of business
applications including:

 Customer segmentation: Customers can be grouped in order to better tailor products


and offerings.
 Text, document, or search results clustering: grouping to find topics in text.
 Image grouping or image compression: groups similar in images or colors.
 Anomaly detection: finds what isn’t similar—or the outliers from clusters
 Semi-supervised learning: clusters are combined with a smaller set of labeled data
and supervised machine learning in order to get more valuable results.

How K-Means Works


The K-means algorithm identifies a certain number of centroids within a data set, a centroid
being the arithmetic mean of all the data points belonging to a particular cluster. The
algorithm then allocates every data point to the nearest cluster as it attempts to keep the
clusters as small as possible (the ‘means’ in K-means refers to the task of averaging the
data or finding the centroid). At the same time, K-means attempts to keep the other clusters
as different as possible.
In practice it works as follows:

 The K-means algorithm begins by initializing all the coordinates to “K” cluster
centers. (The K number is an input variable and the locations can also be given as
input.)
 With every pass of the algorithm, each point is assigned to its nearest cluster center.
 The cluster centers are then updated to be the “centers” of all the points assigned to
it in that pass. This is done by re-calculating the cluster centers as the average of the
points in each respective cluster.
 The algorithm repeats until there’s a minimum change of the cluster centers from the
last iteration.

K-means is very effective in capturing structure and making data inferences if the clusters
have a uniform, spherical shape. But if the clusters have more complex geometric shapes,
the algorithm does a poor job of clustering the data. Another shortcoming of K-means is that
the algorithm does not allow data points distant from one another to share the same cluster,
regardless of whether they belong in the cluster. K-means does not itself learn the number
of clusters from the data, rather that information must be pre-defined. And finally, when
there is overlapping between or among clusters, K-means cannot determine how to assign
data points where the overlap occurs.

K-Means for Data Scientists


Owing to its intrinsic simplicity and popularity in unsupervised machine learning operations,
K-means has gained favor among data scientists. Its applicability in data mining operations
allows data scientists to leverage the algorithm to derive various inferences from business
data and enable more accurate data-driven decision-making, the limitations of the algorithm
notwithstanding. It’s widely considered among the most business-critical algorithms or data
scientists.

You might also like