Presentation 1
Presentation 1
Now we will assign each data point of the scatter plot to its
closest K-point or centroid. We will compute it by applying some
mathematics that we have studied to calculate the distance
between two points. So, we will draw a median between both
the centroids.
Consider the below image:
From the above image, it is clear that points left side of the line
is near to the K1 or blue centroid, and points to the right of the
line are close to the yellow centroid. Let's color them as blue and
yellow for clear visualization.
As we need to find the closest cluster, so we will repeat the
process by choosing a new centroid.
• To choose the new centroids, we will compute the center of
gravity of these centroids, and will find new centroids as
below:
From the above image, we can see, one yellow point is on the left
side of the line, and two blue points are right to the line. So, these
three points will be assigned to new centroids.
• As reassignment has taken place, so we will again go to the
step-4, which is finding new centroids or K-points.
– We will repeat the process by finding the center of gravity
of centroids, so the new centroids will be as shown in the
below image:
• As we got the new centroids so again will draw the median
line and reassign the data points. So, the image will be:
• The Elbow method is one of the most popular ways to find the
optimal number of clusters. This method uses the concept of
WCSS value. WCSS stands for Within Cluster Sum of Squares,
which defines the total variations within a cluster. The formula
to calculate the value of WCSS (for 3 clusters) is given below:
Note: We can choose the number of clusters equal to the given data
points. If we choose the number of clusters equal to the data points,
then the value of WCSS becomes zero, and that will be the endpoint
of the plot.
Python Implementation of K-means Clustering Algorithm
• The first step will be the data pre-processing, as we did in our earlier
topics of Regression and Classification. But for the clustering problem,
it will be different from other models. Let's discuss it:
– Importing Libraries
As we did in previous topics, firstly, we will import the libraries for our model,
which is part of data pre-processing. The code is given below:
# importing libraries
import numpy as nm
import pandas as pd
• In the above code, the numpy we have imported for the
performing mathematics calculation, matplotlib is for plotting
the graph, and pandas are for managing the dataset.
– ImportingtheDataset:
From the above plot, we can see the elbow point is at 5. So the
number of clusters here will be 5.
Step- 3: Training the K-means algorithm on the training dataset
mtp.title('Clusters of customers')
mtp.legend()
mtp.show()
In above lines of code, we have written code for each clusters,
ranging from 1 to 5. The first coordinate of the mtp.scatter, i.e.,
x[y_predict == 0, 0] containing the x value for the showing the
matrix of features values, and the y_predict is ranging from 0 to 1.
Output:
• The output image is clearly showing the five different clusters
with different colors. The clusters are formed between two
parameters of the dataset; Annual income of customer and
Spending. We can change the colors and labels as per the
requirement or choice. We can also observe some points from
the above patterns, which are given below:
Cluster1 shows the customers with average salary and
average spending so we can categorize these customers as
Cluster2 shows the customer has a high income but low
spending, so we can categorize them as careful.
Cluster3 shows the low income and also low spending so they
can be categorized as sensible.
Cluster4 shows the customers with low income with very high
spending so they can be categorized as careless.
Cluster5 shows the customers with high income and high
spending so they can be categorized as target, and these
customers can be the most profitable customers for the mall
owner.