K - Mean Clustering
K - Mean Clustering
Introduction
Beginning with Unsupervised Learning, a part of machine learning where no response variable is
present to provide guidelines in the learning process and data is analyzed by algorithms itself to
identify the trends.
Opposite to that, supervised learning is where existing data is already labelled and you know which
behavior you want to recognize from new datasets, unsupervised learning doesn’t exhibit labelled
dataset and algorithms are there to explore relationships and patterns in the data.
It is a known fact that the data and information are usually obscured by noise and redundancy so
making it into groups with similar features is the decisive action to bring some insights.
What is Clustering?
Primarily, clustering is an exploratory technique used to analyze hidden structure of the data. It is a
way to decompose the dataset into a subset, with each subset representing as group of similar
characteristics, these groups also known as Clusters. Let's understand this with an example.
A bank wants to give loan offers to its customers. Currently, they look at the details of each
customer and based on customer's details, they decided to give them an offer.
You know, the bank can potentially have millions of customers. Does it make sense to look at the
details of each customer separately and then make a decision? Hell no! The manual process and
will take huge amount of time.
So, what can the bank do? One option is to segment its customers into different Clusters. For
instance, the bank can group the customers based on their income:
Now you can see bank can use 3 different strategies for each group of customers. Thus, clustering
is an Unsupervised Learning technique where we divide the dataset into groups and each
group/cluster possess the same characteristics.
One of the excellent methods in unsupervised machine learning treated for data classification, k-
means suits well for exploratory data analysis to understand data perfectly and get inferences from
all data types despite the data in the form of images, text content or numeric, k-means works
flexibly.
However, in unsupervised learning we do not have any target variable/Dependent variable, all we
have left is Independent Variables.
So, in clustering, we do not have a target to predict. We divide the data into groups/ cluster having
similar characteristics. Hence, it is an unsupervised learning problem.
What is K-Means Clustering?
As a part of the unsupervised learning method, clustering attempts to identify a relationship
between n-observations (data points) without being trained by the response variable.
With the intent of obtaining data points under the same class as identical as possible, and the data
points in a separate class as dissimilar as possible.
Basically, in the process of clustering, one can identify which observations are alike and classify
them significantly in that manner. Keeping this perspective in mind, k-means clustering is the most
straightforward and frequently practiced clustering method to categorize a dataset into a bunch of
k classes (groups).
K-Means, each cluster is associated with a specific centroid due to the reason it is also known as
centroid based algorithm or distance-based algorithm. The main objective of the K-Means
algorithm is to minimize the sum of distances between the points and their respective cluster
centroid.
The number of K can be easily analyzed using an Elbow Method that we cover further in this article.
Let’s go through the steps involved in K means clustering for a better understanding.
Step 1- Select the number of clusters for the dataset (K).
Step 2- Select K number of centroids
Step 3- By calculating the Euclidean distance or Manhattan distance assign the points to the nearest
centroid, thus creating K groups
Step 4- Now find the original centroid in each group
Step 5- Again reassign the whole data point based on this new centroid, then repeat Step 4 until the
position of the centroid doesn’t change.
Step 3- Calculate the Euclidean distance of each data point from the cluster centroids.
Step 4- Assign the data to the nearest cluster. The plot shows that the 1st cluster contains 5 points,
the 2nd cluster contains 2 points, and the 3rd cluster contains 7 points.
Step 5- In step 3, we have obtained 3 clusters based on the initial centroid assignment. Now
calculate the means of these clusters.
Repeat steps 3, 4 and 5, until the cluster centroids remain the same.
Inference-The first cluster contains 5 flowers with the least average petal length and width. This
represents the group of small-sized flowers.
The second cluster represents 5 medium-sized flowers. The third cluster consists of 4 flowers with
the highest average petal length and width.
Thus, K-means has clustered the data into 3 clusters based on the length and width of each flower
petal.
Summary- It Iterates these centroids until no change happens to the position of the centroid i.e K=3
How to choose the Right Number of Clusters in K-Means?
Here comes a concept of ELBOW METHOD, a commonly used method for finding optimal K
value.
In the previous example, it was obvious that the observations were divided into 3 groups; thus we
considered K = 3. But in general, it is not easy to decide the optimal value of K.
The plot shows that the WCSS is decreasing rapidly for the K values less than the optimal K value.
After the elbow point, the WCSS is steadily decreasing, which implies that more clusters are formed
by dividing the large clusters into subgroups.
Selecting the K greater than optimal K leads to overfitting.
Where the E-step is the Expectation step, it comprises upgrading forecasts of associating the data
point with the respective cluster.
And, M-step is the Maximization step, it includes maximizing some features that specify the region
of the cluster centres, for this maximization, is expressed by considering the mean of the data
points of each cluster.
In account with some critical possibilities, each reiteration of E-step and M-step algorithm will
always yield in terms of improved estimation of clusters’ characteristics.
K-means utilize an iterative procedure to yield its final clustering based on the number of
predefined clusters, as per need according to the dataset and represented by the variable K.
For instance, if K is set to 3 (k3), then the dataset would be categorized in 3 clusters if k is equal
to 4, then the number of clusters will be 4 and so on.
The fundamental aim is to define k centres, one for each cluster, these centres must be located in
a sharp manner because of the various allocation causes different outcomes. So, it would be best
to put them as far away as possible from each other.
Also, the maximum number of plausible clusters will be the same as the total number of
observations/features present in the dataset.