0% found this document useful (0 votes)

11 views47 pages

Presentation 1

Uploaded by

mamehasen2015

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views47 pages

Presentation 1

Uploaded by

mamehasen2015

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 47

K-Means Clustering Algorithm

• is an unsupervised learning algorithm that is used to solve

the clustering problems in machine learning or data science.
• In this topic, we will learn what is K-means clustering
algorithm, how the algorithm works, along with the Python
implementation of k-means clustering.
What is K-Means Algorithm?

• K-Means Clustering is an Unsupervised Learning algorithm,

which groups the unlabeled dataset into different clusters.
• Here K defines the number of pre-defined clusters that need
to be created in the process, as if K=2, there will be two
clusters, and for K=3, there will be three clusters, and so on.
• It is an iterative algorithm that divides the unlabeled dataset
into k different clusters in such a way that each dataset
belongs only one group that has similar properties.
• It allows us to cluster the data into different groups and a convenient
way to discover the categories of groups in the unlabeled dataset on
its own without the need for any training.
• It is a centroid-based algorithm, where each cluster is associated with
a centroid. The main aim of this algorithm is to minimize the sum of
distances between the data point and their corresponding clusters.
• The algorithm takes the unlabeled dataset as input, divides the
dataset into k-number of clusters, and repeats the process until it
does not find the best clusters. The value of k should be
predetermined in this algorithm.
• The k-means clustering algorithm mainly performs two tasks:
– Determines the best value for K center points or centroids
by an iterative process.
– Assigns each data point to its closest k-center. Those data
points which are near to the particular k-center, create a
cluster.
• Hence each cluster has data points with some commonalities,
and it is away from other clusters.
The below diagram explains the working of the K-means
Clustering Algorithm:
How does the K-Means Algorithm Work?

 The working of the K-Means algorithm is explained in the

below steps:
• Step-1: Select the number K to decide the number of clusters.

• Step-2: Select random K points or centroids. (It can be other

from the input dataset).
• Step-3: Assign each data point to their closest centroid, which
will form the predefined K clusters.
• Step-4: Calculate the variance and place a new centroid of each
cluster.
• Step-5: Repeat the third steps, which means reassign each
data point to the new closest centroid of each cluster.
• Step-6: If any reassignment occurs, then go to step-4 else go
to FINISH.
• Step-7: The model is ready.
 Let's understand the above steps by considering the visual
plots:
 Suppose we have two variables M1 and M2. The x-y axis
scatter plot of these two variables is given below:
• Let's take number k of clusters, i.e., K=2, to identify the
dataset and to put them into different clusters. It means here
we will try to group these datasets into two different clusters.
• We need to choose some random k points or centroid to form
the cluster. These points can be either the points from the
dataset or any other point. So, here we are selecting the
below two points as k points, which are not the part of our
dataset. Consider the below
image:

Now we will assign each data point of the scatter plot to its
closest K-point or centroid. We will compute it by applying some
mathematics that we have studied to calculate the distance
between two points. So, we will draw a median between both
the centroids.
Consider the below image:

From the above image, it is clear that points left side of the line
is near to the K1 or blue centroid, and points to the right of the
line are close to the yellow centroid. Let's color them as blue and
yellow for clear visualization.
As we need to find the closest cluster, so we will repeat the
process by choosing a new centroid.
• To choose the new centroids, we will compute the center of
gravity of these centroids, and will find new centroids as
below:

• Next, we will reassign each data point to the new centroid.

For this, we will repeat the same process of finding a median
line.
• The median will be like below image:

From the above image, we can see, one yellow point is on the left
side of the line, and two blue points are right to the line. So, these
three points will be assigned to new centroids.
• As reassignment has taken place, so we will again go to the
step-4, which is finding new centroids or K-points.
– We will repeat the process by finding the center of gravity
of centroids, so the new centroids will be as shown in the
below image:
• As we got the new centroids so again will draw the median
line and reassign the data points. So, the image will be:

• We can see in the above image; there are no dissimilar data

points on either side of the line, which means our model is
formed.
Consider the below image:

As our model is ready, so we can now remove the assumed

centroids, and the two final clusters will be as shown in the
below image:
How to choose the value of "K number of clusters" in
K-means Clustering?

The performance of the K-means clustering algorithm depends

upon highly efficient clusters that it forms. But choosing the
optimal number of clusters is a big task. There are some different
ways to find the optimal number of clusters, but here we are
discussing the most appropriate method to find the number of
clusters or value of K. The method is given below:
Elbow Method

• The Elbow method is one of the most popular ways to find the
optimal number of clusters. This method uses the concept of
WCSS value. WCSS stands for Within Cluster Sum of Squares,
which defines the total variations within a cluster. The formula
to calculate the value of WCSS (for 3 clusters) is given below:

• WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi

in CLuster3 distance(Pi C3)2

• In the above formula of WCSS,

• ∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the
distances between each data point and its centroid within a
cluster1 and the same for the other two terms.
• To measure the distance between data points and centroid,
we can use any method such as Euclidean distance or
Manhattan distance.
 To find the optimal value of clusters, the elbow method
follows the below steps:
– It executes the K-means clustering on a given dataset for
different K values (ranges from 1-10).
– For each value of K, calculates the WCSS value.
– Plots a curve between calculated WCSS values and the
number of clusters K.
– The sharp point of bend or a point of the plot looks like an
arm, then that point is considered as the best value of K.
Since the graph shows the sharp bend, which looks like an elbow,
hence it is known as the elbow method. The graph for the elbow
method looks like the below image:

Note: We can choose the number of clusters equal to the given data
points. If we choose the number of clusters equal to the data points,
then the value of WCSS becomes zero, and that will be the endpoint
of the plot.
Python Implementation of K-means Clustering Algorithm

 In the above section, we have discussed the K-means

algorithm, now let's see how it can be implemented using
Python.
Before implementation, let's understand what type of problem
we will solve here. So, we have a dataset of Mall_Customers,
which is the data of customers who visit the mall and spend
there.
 In the given dataset, we have Customer_Id, Gender, Age,
Annual Income ($), and Spending Score (which is the
calculated value of how much a customer has spent in the
mall, the more the value, the more he has spent). From this
dataset, we need to calculate some patterns, as it is an
unsupervised method, so we don't know what to calculate
exactly.
• The steps to be followed for the implementation are given
below:
 Data Pre-processing
 Finding the optimal number of clusters using the elbow
method
 Training the K-means algorithm on the training dataset
 Visualizing the clusters
Step-1: Data pre-processing Step

• The first step will be the data pre-processing, as we did in our earlier
topics of Regression and Classification. But for the clustering problem,
it will be different from other models. Let's discuss it:
– Importing Libraries
As we did in previous topics, firstly, we will import the libraries for our model,
which is part of data pre-processing. The code is given below:

# importing libraries

import numpy as nm

import matplotlib.pyplot as mtp

import pandas as pd
• In the above code, the numpy we have imported for the
performing mathematics calculation, matplotlib is for plotting
the graph, and pandas are for managing the dataset.
– ImportingtheDataset:

Next, we will import the dataset that we need to use. So

here, we are using the Mall_Customer_data.csv dataset. It
can be imported using the below code:
# Importing the dataset
dataset = pd.read_csv('Mall_Customers_data.csv')
 By executing the above lines of code, we will get our dataset
in the Spyder IDE. The dataset looks like the below image:

From the above dataset, we need to find some patterns in it.

Extracting Independent Variables

 Here we don't need any dependent variable for data pre-

processing step as it is a clustering problem, and we have no
idea about what to determine. So we will just add a line of
code for the matrix of features.
1. x = dataset.iloc[:, [3, 4]].values
 As we can see, we are extracting only 3rd and 4th feature. It is
because we need a 2d plot to visualize the model, and some
features are not required, such as customer_id.
Step-2: Finding the optimal number of clusters using
the elbow method

In the second step, we will try to find the optimal number of

clusters for our clustering problem. So, as discussed above, here
we are going to use the elbow method for this purpose.
As we know, the elbow method uses the WCSS concept to
draw the plot by plotting WCSS values on the Y-axis and the
number of clusters on the X-axis. So we are going to calculate the
value for WCSS for different k values ranging from 1 to 10. Below
is the code for it:
#finding optimal number of clusters using the elbow method
from sklearn.cluster import KMeans
wcss_list= [] #Initializing the list for the values of WCSS
#Using for loop for iterations from 1 to 10.
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k- means+
+', random_state= 42)
kmeans.fit(x)
wcss_list.append(kmeans.inertia_)
mtp.plot(range(1, 11), wcss_list)
mtp.title('The Elobw Method Graph')
mtp.xlabel('Number of clusters(k)')
mtp.ylabel('wcss_list')
mtp.show()
 As we can see in the above code, we have used the
KMeans class of sklearn. cluster library to form the clusters.
 Next, we have created the wcss_list variable to initialize an
empty list, which is used to contain the value of wcss
computed for different values of k ranging from 1 to 10.
 After that, we have initialized the for loop for the iteration on
a different value of k ranging from 1 to 10; since for loop in
Python, exclude the outbound limit, so it is taken as 11 to
include 10th value.
 The rest part of the code is similar as we did in earlier topics,
as we have fitted the model on a matrix of features and then
plotted the graph between the number of clusters and WCSS.
Output: After executing the above code, we will get the below
output:

From the above plot, we can see the elbow point is at 5. So the
number of clusters here will be 5.
Step- 3: Training the K-means algorithm on the training dataset

 As we have got the number of clusters, so we can now train

the model on the dataset.
 To train the model, we will use the same two lines of code
as we have used in the above section, but here instead of
using i, we will use 5, as we know there are 5 clusters that
need to be formed. The code is given below:
#training the K-means model on a dataset
kmeans = KMeans(n_clusters=5, init='k- means+
+', random_state= 42)
y_predict= kmeans.fit_predict(x)
The first line is the same as above for creating the object of
KMeans class.
In the second line of code, we have created the dependent
variable y_predict to train the model.
• By executing the above lines of code, we will get the y_predict
variable. We can check it under the variable explorer option
in the Spyder IDE. We can now compare the values of
y_predict with our original dataset. Consider the below
image:
From the above image, we can now relate that the CustomerID 1
belongs to a cluster
3(as index starts from 0, hence 2 will be considered as 3), and 2
belongs to cluster 4, and so on.
Step-4: Visualizing the Clusters

• The last step is to visualize the clusters. As we have 5 clusters

for our model, so we will visualize each cluster one by one.
• To visualize the clusters will use scatter plot using
mtp.scatter() function of matplotlib.
#visulaizing the clusters
mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c =
'blue', label = 'Cluster 1') #for first cluster
mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c =
'green', label = 'Cluster 2') #for second cluster
mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = '
red', label = 'Cluster 3') #for third cluster
mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', lab

el = 'Cluster 4') #for fourth cluster

mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta',

label = 'Cluster 5') #for fifth cluster

mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s

= 300, c = 'yellow', label = 'Centroid')

mtp.title('Clusters of customers')

mtp.xlabel('Annual Income (k$)')

mtp.ylabel('Spending Score (1-100)')

mtp.legend()

mtp.show()
In above lines of code, we have written code for each clusters,
ranging from 1 to 5. The first coordinate of the mtp.scatter, i.e.,
x[y_predict == 0, 0] containing the x value for the showing the
matrix of features values, and the y_predict is ranging from 0 to 1.
Output:
• The output image is clearly showing the five different clusters
with different colors. The clusters are formed between two
parameters of the dataset; Annual income of customer and
Spending. We can change the colors and labels as per the
requirement or choice. We can also observe some points from
the above patterns, which are given below:
 Cluster1 shows the customers with average salary and
average spending so we can categorize these customers as
 Cluster2 shows the customer has a high income but low
spending, so we can categorize them as careful.
 Cluster3 shows the low income and also low spending so they
can be categorized as sensible.
 Cluster4 shows the customers with low income with very high
spending so they can be categorized as careless.
 Cluster5 shows the customers with high income and high
spending so they can be categorized as target, and these
customers can be the most profitable customers for the mall
owner.

KMeans Clustering
No ratings yet
KMeans Clustering
16 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
Machine Learning Notes Anna University
100% (1)
Machine Learning Notes Anna University
14 pages
K Means Clustering
No ratings yet
K Means Clustering
11 pages
Working of K Means Algorithm - YashBhure
No ratings yet
Working of K Means Algorithm - YashBhure
14 pages
Lecture 18 K Means Clustering
No ratings yet
Lecture 18 K Means Clustering
77 pages
Unit 4
No ratings yet
Unit 4
63 pages
AI Week 11
No ratings yet
AI Week 11
21 pages
LP I Assignment A4 Clustering
No ratings yet
LP I Assignment A4 Clustering
13 pages
Clustering
No ratings yet
Clustering
17 pages
ML - K-Means
No ratings yet
ML - K-Means
12 pages
CLUSTERING
No ratings yet
CLUSTERING
11 pages
K Means - Ipynb - Colab
No ratings yet
K Means - Ipynb - Colab
10 pages
Wa0033.
No ratings yet
Wa0033.
38 pages
Clustering Notes
No ratings yet
Clustering Notes
29 pages
Unit IV
No ratings yet
Unit IV
96 pages
Clustering
No ratings yet
Clustering
24 pages
Unit 4 Aam
No ratings yet
Unit 4 Aam
26 pages
Data Mining-4
No ratings yet
Data Mining-4
9 pages
K-Mean Clustering
No ratings yet
K-Mean Clustering
8 pages
K Clustering
No ratings yet
K Clustering
28 pages
Machine Learning Unit 4
No ratings yet
Machine Learning Unit 4
22 pages
Kmean
No ratings yet
Kmean
24 pages
Algo
No ratings yet
Algo
59 pages
DM Unit Iv
No ratings yet
DM Unit Iv
45 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
Digital Computer Concept and Practice: Unsupervised Learning
No ratings yet
Digital Computer Concept and Practice: Unsupervised Learning
21 pages
TM in SAPS4HANA ConfigurationGuide OutboundTransportation
No ratings yet
TM in SAPS4HANA ConfigurationGuide OutboundTransportation
10 pages
K Means Clustering
No ratings yet
K Means Clustering
11 pages
Unit 4
No ratings yet
Unit 4
22 pages
K-Means Clustering Algorithm - Javatpoint
No ratings yet
K-Means Clustering Algorithm - Javatpoint
21 pages
K-Means Algo
No ratings yet
K-Means Algo
4 pages
CPE412 Pattern Recognition (Week 7)
No ratings yet
CPE412 Pattern Recognition (Week 7)
48 pages
ML (Unit 4)
No ratings yet
ML (Unit 4)
19 pages
K Mean
No ratings yet
K Mean
9 pages
K-Means Clustering
No ratings yet
K-Means Clustering
5 pages
ML Unit-2
No ratings yet
ML Unit-2
31 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
Lecture 11 K Means Clustering
No ratings yet
Lecture 11 K Means Clustering
8 pages
Clustering
No ratings yet
Clustering
18 pages
UNIT 4 K-Means Clustring
No ratings yet
UNIT 4 K-Means Clustring
13 pages
K-Means Clustering Report
No ratings yet
K-Means Clustering Report
2 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
23 pages
ML-Unit III - K-Means Clustering
No ratings yet
ML-Unit III - K-Means Clustering
22 pages
A Paper With 12pt Global Font Size
No ratings yet
A Paper With 12pt Global Font Size
13 pages
Mod4 - Unsupervised Learning
No ratings yet
Mod4 - Unsupervised Learning
9 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
Assignment 6 ML
No ratings yet
Assignment 6 ML
4 pages
ML Seminar
No ratings yet
ML Seminar
37 pages
Clustering Kmeans
No ratings yet
Clustering Kmeans
6 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
DWDM Unit5
No ratings yet
DWDM Unit5
14 pages
K Means Algorithms
No ratings yet
K Means Algorithms
27 pages
Pilot
No ratings yet
Pilot
3 pages
Clustering
No ratings yet
Clustering
10 pages
K Means Clustering Algorithm
No ratings yet
K Means Clustering Algorithm
12 pages
Operating System Assignment (.1)
No ratings yet
Operating System Assignment (.1)
12 pages
Simple K Means
No ratings yet
Simple K Means
3 pages
A Tutorial On Clustering Algorithms
No ratings yet
A Tutorial On Clustering Algorithms
4 pages
Carding Dorks
No ratings yet
Carding Dorks
40 pages
Azure Repos Documentation
No ratings yet
Azure Repos Documentation
838 pages
File Handling
100% (1)
File Handling
52 pages
IIITUGIT22
No ratings yet
IIITUGIT22
212 pages
Factors Affecting Students
No ratings yet
Factors Affecting Students
75 pages
Assignment: Computer Science
No ratings yet
Assignment: Computer Science
8 pages
Tally Technology The Complete Reference To C
50% (2)
Tally Technology The Complete Reference To C
8 pages
T15 Kickoff Statement
No ratings yet
T15 Kickoff Statement
17 pages
Software Engineering Question Unit 3
No ratings yet
Software Engineering Question Unit 3
11 pages
Hashing Part1 - 241021 - 152911
No ratings yet
Hashing Part1 - 241021 - 152911
10 pages
Excel Cheatsheet
No ratings yet
Excel Cheatsheet
9 pages
Unit 5 Embedded Software Design-II
No ratings yet
Unit 5 Embedded Software Design-II
50 pages
Yozolog
No ratings yet
Yozolog
6 pages
NEw and Old Learn Basic SQL
No ratings yet
NEw and Old Learn Basic SQL
6 pages
2024 Model QP State Board Computer
No ratings yet
2024 Model QP State Board Computer
5 pages
CH-7
No ratings yet
CH-7
63 pages
OOP Chapter 2
No ratings yet
OOP Chapter 2
27 pages
Compiler Intro Tutorial
No ratings yet
Compiler Intro Tutorial
7 pages
Pork Meat Freshness
No ratings yet
Pork Meat Freshness
8 pages
SDK Htdisplaydll en
No ratings yet
SDK Htdisplaydll en
2 pages
MCQ Typescript
No ratings yet
MCQ Typescript
6 pages
Softwares Used: Keil Uvision 3
No ratings yet
Softwares Used: Keil Uvision 3
3 pages
FE Unit Testing With Jest - Best Practices
No ratings yet
FE Unit Testing With Jest - Best Practices
29 pages
01 Mapreduce
No ratings yet
01 Mapreduce
77 pages
Assignment On Functional Programming
No ratings yet
Assignment On Functional Programming
6 pages
Advanced Java and J2EE
No ratings yet
Advanced Java and J2EE
2 pages
3.2 Informatica - SCD
No ratings yet
3.2 Informatica - SCD
3 pages
A New Approach To Sorting Min Max Sorting Algorithm IJERTV2IS50210 PDF
No ratings yet
A New Approach To Sorting Min Max Sorting Algorithm IJERTV2IS50210 PDF
4 pages
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
Matrices with MATLAB (Taken from "MATLAB for Beginners: A Gentle Approach")
From Everand
Matrices with MATLAB (Taken from "MATLAB for Beginners: A Gentle Approach")
Peter Kattan
3/5 (4)