KMeans Clustering
KMeans Clustering
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in
such a way that each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover
the categories of groups in the unlabeled dataset on its own without the need for any
training.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k
should be predetermined in this algorithm.
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near
to the particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
How does the K-Means Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined
K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
AD
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two
variables is given below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them
into different clusters. It means here we will try to group these datasets into two
different clusters.
o We need to choose some random k points or centroid to form the cluster. These
points can be either the points from the dataset or any other point. So, here we
are selecting the below two points as k points, which are not the part of our
dataset. Consider the below image:
o Now we will assign each data point of the scatter plot to its closest K-point or
centroid. We will compute it by applying some mathematics that we have studied
to calculate the distance between two points. So, we will draw a median between
both the centroids. Consider the below image:
AD
From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color
them as blue and yellow for clear visualization.
o As we need to find the closest cluster, so we will repeat the process by
choosing a new centroid. To choose the new centroids, we will compute the
center of gravity of these centroids, and will find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat
the same process of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and
two blue points are right to the line. So, these three points will be assigned to new
centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new
centroids or K-points.
o We will repeat the process by finding the center of gravity of centroids, so the
new centroids will be as shown in the below image:
o As we got the new centroids so again will draw the median line and reassign the
data points. So, the image will be:
o We can see in the above image; there are no dissimilar data points on either side
of the line, which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final
clusters will be as shown in the below image:
AD
Why K-Means?
K-means as a clustering algorithm is deployed to discover groups that haven’t been
explicitly labeled within the data. It’s being actively used today in a wide variety of business
applications including:
The K-means algorithm begins by initializing all the coordinates to “K” cluster
centers. (The K number is an input variable and the locations can also be given as
input.)
With every pass of the algorithm, each point is assigned to its nearest cluster center.
The cluster centers are then updated to be the “centers” of all the points assigned to
it in that pass. This is done by re-calculating the cluster centers as the average of the
points in each respective cluster.
The algorithm repeats until there’s a minimum change of the cluster centers from the
last iteration.
K-means is very effective in capturing structure and making data inferences if the clusters
have a uniform, spherical shape. But if the clusters have more complex geometric shapes,
the algorithm does a poor job of clustering the data. Another shortcoming of K-means is that
the algorithm does not allow data points distant from one another to share the same cluster,
regardless of whether they belong in the cluster. K-means does not itself learn the number
of clusters from the data, rather that information must be pre-defined. And finally, when
there is overlapping between or among clusters, K-means cannot determine how to assign
data points where the overlap occurs.