0% found this document useful (0 votes)
34 views12 pages

K - Mean Clustering

K-means clustering is an unsupervised machine learning algorithm that groups unlabeled data points into a predefined number of clusters (k) based on their similarities. It works by assigning data points to the cluster with the nearest mean and then recalculating the means of each cluster. This process iterates until the means converge and the assignments no longer change. The optimal number of clusters k can be determined using the elbow method by plotting the within-cluster sum of squares against the number of clusters k and choosing the k at the elbow of the curve. K-means clustering is widely used for exploratory data analysis to discover hidden patterns in large, unlabeled datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views12 pages

K - Mean Clustering

K-means clustering is an unsupervised machine learning algorithm that groups unlabeled data points into a predefined number of clusters (k) based on their similarities. It works by assigning data points to the cluster with the nearest mean and then recalculating the means of each cluster. This process iterates until the means converge and the assignments no longer change. The optimal number of clusters k can be determined using the elbow method by plotting the within-cluster sum of squares against the number of clusters k and choosing the k at the elbow of the curve. K-means clustering is widely used for exploratory data analysis to discover hidden patterns in large, unlabeled datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

K-means Clustering

Introduction
Beginning with Unsupervised Learning, a part of machine learning where no response variable is
present to provide guidelines in the learning process and data is analyzed by algorithms itself to
identify the trends.
Opposite to that, supervised learning is where existing data is already labelled and you know which
behavior you want to recognize from new datasets, unsupervised learning doesn’t exhibit labelled
dataset and algorithms are there to explore relationships and patterns in the data.

It is a known fact that the data and information are usually obscured by noise and redundancy so
making it into groups with similar features is the decisive action to bring some insights.

What is Clustering?
Primarily, clustering is an exploratory technique used to analyze hidden structure of the data. It is a
way to decompose the dataset into a subset, with each subset representing as group of similar
characteristics, these groups also known as Clusters. Let's understand this with an example.
A bank wants to give loan offers to its customers. Currently, they look at the details of each
customer and based on customer's details, they decided to give them an offer.
You know, the bank can potentially have millions of customers. Does it make sense to look at the
details of each customer separately and then make a decision? Hell no! The manual process and
will take huge amount of time.
So, what can the bank do? One option is to segment its customers into different Clusters. For
instance, the bank can group the customers based on their income:
Now you can see bank can use 3 different strategies for each group of customers. Thus, clustering
is an Unsupervised Learning technique where we divide the dataset into groups and each
group/cluster possess the same characteristics.

One of the excellent methods in unsupervised machine learning treated for data classification, k-
means suits well for exploratory data analysis to understand data perfectly and get inferences from
all data types despite the data in the form of images, text content or numeric, k-means works
flexibly.

Why Clustering is an Unsupervised Learning Problem?


Now you know, clustering is a process of dividing a dataset into clusters, where each cluster
represents unique characteristics.
It is important to understand the difference between supervised learning and unsupervised learning.
In supervised learning, we always have our target variable (y) as dependent variable and (x) as
independent variable, and we split our data using the independent variables in the supervision of
the target variable.

However, in unsupervised learning we do not have any target variable/Dependent variable, all we
have left is Independent Variables.
So, in clustering, we do not have a target to predict. We divide the data into groups/ cluster having
similar characteristics. Hence, it is an unsupervised learning problem.
What is K-Means Clustering?
As a part of the unsupervised learning method, clustering attempts to identify a relationship
between n-observations (data points) without being trained by the response variable.
With the intent of obtaining data points under the same class as identical as possible, and the data
points in a separate class as dissimilar as possible.
Basically, in the process of clustering, one can identify which observations are alike and classify
them significantly in that manner. Keeping this perspective in mind, k-means clustering is the most
straightforward and frequently practiced clustering method to categorize a dataset into a bunch of
k classes (groups).

K-means algorithm explores for a preplanned number of clusters in an unlabeled multidimensional


dataset, it concludes this via an easy interpretation of how an optimized cluster can be expressed.
It is a commonly used method, where K (specifies the number of clusters) and means (specify the
average centroid of the cluster).

Primarily the concept would be in two steps:


• Firstly, the cluster centre is the arithmetic mean (AM) of all the data points associated with
the cluster.
• Secondly, each point is adjoint to its cluster centre in comparison to other cluster centres.
These two interpretations are the foundation of the k-means clustering model.
You can take the centre as a data point that outlines the means of the cluster, also it might not
possibly be a member of the dataset.
In simple terms, k-means clustering enables us to cluster the data into several groups by detecting
the distinct categories of groups in the unlabeled datasets by itself, even without the necessity of
training of data.
This is the centroid-based algorithm such that each cluster is connected to a centroid while
following the objective to minimize the sum of distances between the data points and their
corresponding clusters.
As an input, the algorithm consumes an unlabeled dataset, splits the complete dataset into k-
number of clusters, and iterates the process to meet the right clusters, and the value of k should be
predetermined.

Specifically performing two tasks, the k-means algorithm


• Calculates the correct value of K-centre points or centroids by an iterative method
• Assigns every data point to its nearest k-centre, and the data points, closer to a particular
k-centre, make a cluster. Therefore, data points, in each cluster, have some similarities and
far apart from other clusters.

K-Means, each cluster is associated with a specific centroid due to the reason it is also known as
centroid based algorithm or distance-based algorithm. The main objective of the K-Means
algorithm is to minimize the sum of distances between the points and their respective cluster
centroid.
The number of K can be easily analyzed using an Elbow Method that we cover further in this article.
Let’s go through the steps involved in K means clustering for a better understanding.
Step 1- Select the number of clusters for the dataset (K).
Step 2- Select K number of centroids
Step 3- By calculating the Euclidean distance or Manhattan distance assign the points to the nearest
centroid, thus creating K groups
Step 4- Now find the original centroid in each group
Step 5- Again reassign the whole data point based on this new centroid, then repeat Step 4 until the
position of the centroid doesn’t change.

Let’s understand the above steps with an example.


Consider the data of flowers petal length and petal width in millimeters.

Can we cluster the flower petals into distinct groups?


Rather than getting to the python code, I would like you to understand the concept.
Step1- With the help of scatter plot, we plot the above given data, which visualize like this. From
the plot, we can see that the dataset is divided into 3 groups/clusters.
Now, use K-means clustering to check whether the algorithm can form such three clusters (K = 3)
from the given data. Consider the Euclidean distance as a distance measure.
I insist don’t worry about the code, we will do that in later part, just understand the steps that need
to be carried out
Step 2 - Initial assignment: Forgy method. Randomly choose 3 points as cluster centroids. Consider
the below observations as initial centroids.

Step 3- Calculate the Euclidean distance of each data point from the cluster centroids.
Step 4- Assign the data to the nearest cluster. The plot shows that the 1st cluster contains 5 points,
the 2nd cluster contains 2 points, and the 3rd cluster contains 7 points.

Step 5- In step 3, we have obtained 3 clusters based on the initial centroid assignment. Now
calculate the means of these clusters.
Repeat steps 3, 4 and 5, until the cluster centroids remain the same.

Inference-The first cluster contains 5 flowers with the least average petal length and width. This
represents the group of small-sized flowers.
The second cluster represents 5 medium-sized flowers. The third cluster consists of 4 flowers with
the highest average petal length and width.
Thus, K-means has clustered the data into 3 clusters based on the length and width of each flower
petal.
Summary- It Iterates these centroids until no change happens to the position of the centroid i.e K=3
How to choose the Right Number of Clusters in K-Means?
Here comes a concept of ELBOW METHOD, a commonly used method for finding optimal K
value.
In the previous example, it was obvious that the observations were divided into 3 groups; thus we
considered K = 3. But in general, it is not easy to decide the optimal value of K.

The plot shows that the WCSS is decreasing rapidly for the K values less than the optimal K value.
After the elbow point, the WCSS is steadily decreasing, which implies that more clusters are formed
by dividing the large clusters into subgroups.
Selecting the K greater than optimal K leads to overfitting.

Key Features of K-means Clustering


Find below some key features of k-means clustering:
It is very smooth in terms of interpretation and resolution.
1. For a large number of variables present in the dataset, K-means operates quicker than
Hierarchical clustering.
2. While redetermining the cluster centre, an instance can modify the cluster.
3. K-means reforms compact clusters.
4. It can work on unlabeled numerical data.
5. Moreover, it is fast, robust and uncomplicated to understand and yields the best outcomes
when datasets are well distinctive (thoroughly separated) from each other.
Limitations of K-means Clustering
The following are a few limitations with K-Means clustering:
Sometimes, it is quite tough to forecast the number of clusters, or the value of k.
1. The output is highly influenced by original input, for example, the number of clusters.
2. An array of data substantially hits the concluding outcomes.
3. In some cases, clusters show complex spatial views, then executing clustering is not a good
choice.
4. Also, rescaling is sometimes conscious, it can’t be done by normalization or
standardization of data points, the output gets changed entirely.

Disadvantages of K-means Clustering


The algorithm demands for the inferred specification of the number of cluster/ centres.
1. An algorithm goes down for non-linear sets of data and unable to deal with noisy data and
outliers.
2. It is not directly applicable to categorical data since only operatable when mean is provided.
3. Also, Euclidean distance can weight unequally the underlying factors.
4. The algorithm is not variant to non-linear transformation, i.e provides different results with
different portrayals of data.

Expectation-Maximization: K-means Algorithm


K-Means is just the Expectation-Maximization (EM) algorithm, It is a persuasive algorithm that
exhibits a variety of context in data science, the E-M approach incorporates two parts in its
procedure:
1. To assume some cluster centres,
2. Re-run as far as transformed:
• E-Step: To appoint data points to the closest cluster centre,
• M-Step: To introduce the cluster centres to the mean.

Where the E-step is the Expectation step, it comprises upgrading forecasts of associating the data
point with the respective cluster.
And, M-step is the Maximization step, it includes maximizing some features that specify the region
of the cluster centres, for this maximization, is expressed by considering the mean of the data
points of each cluster.
In account with some critical possibilities, each reiteration of E-step and M-step algorithm will
always yield in terms of improved estimation of clusters’ characteristics.
K-means utilize an iterative procedure to yield its final clustering based on the number of
predefined clusters, as per need according to the dataset and represented by the variable K.
For instance, if K is set to 3 (k3), then the dataset would be categorized in 3 clusters if k is equal
to 4, then the number of clusters will be 4 and so on.
The fundamental aim is to define k centres, one for each cluster, these centres must be located in
a sharp manner because of the various allocation causes different outcomes. So, it would be best
to put them as far away as possible from each other.
Also, the maximum number of plausible clusters will be the same as the total number of
observations/features present in the dataset.

Working of K-means Algorithm


By specifying the value of k, you are informing the algorithm of how many means or centres you
are looking for. Again repeating, if k is equal to 3, the algorithm accounts it for 3 clusters.

Following are the steps for working of the k-means algorithm:


• K-centres are modelled randomly in accordance with the present value of K.
• K-means assigns each data point in the dataset to the adjacent centre and attempts to curtail
Euclidean distance between data points. Data points are assumed to be present in the
peculiar cluster as if it is nearby to centre to that cluster than any other cluster centre.
• After that, k-means determines the centre by accounting the mean of all data points referred
to that cluster centre. It reduces the complete variance of the intra-clusters with respect to
the prior step. Here, the “means” defines the average of data points and identifies a new
centre in the method of k-means clustering.
Clustering of data points (objects in this case)
The algorithm gets repeated among the steps 2 and 3 till some paradigm will be achieved such as
the sum of distances in between data points and their respective centres are diminished, an
appropriate number of iterations is attained, no variation in the value of cluster centre or no change
in the cluster due to data points.

Stopping Criteria for K-Means Clustering


On a core note, three criteria are considered to stop the k-means clustering algorithm
1. If the centroids of the newly built clusters are not changing: An algorithm can be brought
to an end if the centroids of the newly constructed clusters are not altering. Even after
multiple iterations, if the obtained centroids are same for all the clusters, it can be concluded
that the algorithm is not learning any new pattern and gives a sign to stop its
execution/training to a dataset.
2. If data points remain in the same cluster: The training process can also be halt if the data
points stay in the same cluster even after the training the algorithm for multiple iterations.
3. If the maximum number of iterations have achieved: At last, the training on a dataset can
also be stopped if the maximum number of iterations is attained, for example, assume the
number of iterations has set as 200, then the process will be repeated for 200 times (200
iterations) before coming to end.
Applications of K-means Clustering
The concern of the fact is that the data is always complicated, mismanaged, and noisy. Let’s learn
where we can implement k-means clustering among various:
1. K-means clustering is applied in the Call Detail Record (CDR) Analysis. It gives in-depth
vision about customer requirements and satisfaction on the basis of call-traffic during the
time of the day and demographic of a particular location.
2. It is used in the clustering of documents to identify the compatible documents in the same
place.
3. It is deployed to classify the sounds on the basis of their identical patterns and segregate
malformation in them.
4. It serves as the model of lossy images compression technique, in the confinement of
images, K-means makes clusters pixels of an image in order to decrease the total size of
it.
5. It is helpful in the business sector for recognizing the portions of purchases made by
customers, also to cluster movements on apps and websites.
6. In the field of insurance and fraud detection on the basis of prior data, it is plausible to
cluster fraudulent consumers to demand based on their proximity to clusters as the patterns
indicate.

K-means vs Hierarchical Clustering


1. K-means clustering produces a specific number of clusters for the disarranged and flat
dataset, where Hierarchical clustering builds a hierarchy of clusters, not for just a partition
of objects under various clustering methods and applications.
2. K-means can be used for categorical data and first converted into numeric by assigning
rank, where Hierarchical clustering was selected for categorical data but due to its
complexity, a new technique is considered to assign rank value to categorical features.
3. K-means are highly sensitive to noise in the dataset and perform well than Hierarchical
clustering where it is less sensitive to noise in a dataset.
4. Performance of the K-Means algorithm increases as the RMSE decreases and the RMSE
decreases as the number of clusters increases so the time of execution increases, in contrast
to this, the performance of Hierarchical clustering is less.
5. K-means are good for a large dataset and Hierarchical clustering is good for small datasets.

You might also like