Assignment 4 A
Assignment 4 A
Assignment 4 A
Assignment No: 4A
Objectives:
• To find optimum number of clusters using elbow method
• To apply K-Means Algorithm on Iris dataset
Theory:
Clustering is a technique in which the data points are arranged in similar groups dynamically
without any pre-assignment of groups. For example, here is a simple plot of data points. As
you can see, some set of points are closer to each other when compared with others. These sets
could possibly form a group (or a cluster) for further data analytics K-Means Clustering K-
Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science.K-Means Clustering is an Unsupervised Learning
algorithm, which groups the unlabeled dataset into different clusters. Here K defines the
number of pre-defined clusters that need to be created in the process, as if K=2, there will be
two clusters, and for K=3, there will be three clusters, and so on. It is an iterative algorithm
that divides the unlabeled dataset into k different clusters in such a way that each dataset
belongs only one group that has similar properties. It allows us to cluster the data into different
groups and a convenient way to discover the categories of groups in the unlabeled dataset on
its own without the need for any training. It is a centroid based algorithm, where each cluster
is associated with a centroid. The main aim of this algorithm is to minimize the sum of distances
between the data point and their corresponding clusters. The algorithm takes the unlabeled
dataset as input, divides the dataset into k-number of clusters, and repeats the process until it
does not find the best clusters. The value of k should be predetermined in this algorithm. The
k-means clustering algorithm mainly performs two tasks:
• Determines the best value for K center points or centroids by an iterative process.
• Assigns each data point to its closest k-center. Those data points which are near to the
k-center, create a cluster.
The below diagram explains the working of the K-means Clustering Algorithm:
4. Recompute the centroids of newly formed clusters. The centroid (Xc, Yc) of the m data
points in a cluster is calculated as
It is a simple arithmetic mean of all X coordinates and Y coordinates of the m data points in
the cluster.
5. Repeat steps 3 and 4 until any of the following criteria is met
a. Centroids of newly formed clusters do not change.
b. Points remain in the same cluster.
c. Maximum number of iterations are reached as desired.
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters. This
method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares,
which defines the total variations within a cluster. The formula to calculate the value of WCSS
(for 3 clusters) is given below:
In the above formula of WCSS, ΣPi in Cluster1 distance (Pi C1)2: It is the sum of the square
of the distances between each data point and its centroid within a cluster1 and the same for the
other two terms. To measure the distance between data points and centroid, we can use any
method such as Euclidean distance or Manhattan distance. To find the optimal value of clusters,
the elbow method follows the below steps:
● It executes the K-means clustering on a given dataset for different K values (ranges
from 1-10).
● For each value of K, calculates the WCSS value.
Department of AI & DS MCOERC, NASHIK
Department of AI & DS MCOERC, NASHIK
● Plots a curve between calculated WCSS values and the number of clusters K.
● The sharp point of bend or a point of the plot looks like an arm, then that point is
considered as the best value of K. Since the graph shows the sharp bend, which looks
like an elbow, hence it is known as the elbow method. The graph for the elbow method
looks like the below image:
Sample Code:
#importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
Now we will implement 'The elbow method' on the Iris dataset. The elbow method
allows us to pick the optimum amount of clusters for classification. although we
already know the answer is 3 it is still interesting to run.
You can clearly see why it is called 'The elbow method' from the above graph, the
optimum clusters is where the elbow occurs. This is when the within cluster sum of
squares (WCSS) doesn't decrease significantly with every iteration. Now that we have
the optimum amount of clusters, we can move on to applying K-means clustering to
the Iris dataset.
Conclusion:
Students will be able to apply K-Means Algorithm & Elbow method on Dataset to find
Optimum no of Clusters
Implement K-Means clustering on Iris.csv dataset. Determine the number of clusters using the elbow
method.
First we read the data from the dataset using read_csv from the pandas library.
In [ ]: data
Now we view the Head and Tail of the data using head() and tail() respectively.
In [ ]: data.head()
In [ ]: data.tail()
Checking the sample size of data - how many samples are there in the dataset using len() .
In [ ]: len(data)
Out[27]: 150
In [ ]: data.shape
Out[28]: (150, 6)
In [ ]: data.columns
Column number 1 is Id
Column number 2 is SepalLengthCm
Column number 3 is SepalWidthCm
Column number 4 is PetalLengthCm
Column number 5 is PetalWidthCm
Column number 6 is Species
Id
SepalLengthCm
SepalWidthCm
PetalLengthCm
PetalWidthCm
Species.
In [ ]: data.dtypes
Out[31]: Id int64
SepalLengthCm float64
SepalWidthCm float64
PetalLengthCm float64
PetalWidthCm float64
Species object
dtype: object
In [ ]: data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 150 non-null int64
1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB
In [ ]: data.describe()
Checking the data for inconsistencies and further cleaning the data if
needed.
In [ ]: data.isnull()
In [ ]: data.isnull().sum()
Out[35]: Id 0
SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64
Modelling
K - Means Clustering
K-means clustering is a clustering algorithm that aims to partition n observations into k clusters. Initialisation
– K initial “means” (centroids) are generated at random Assignment – K clusters are created by associating
each observation with the nearest centroid Update – The centroid of the clusters becomes the new mean,
Assignment and Update are repeated iteratively until convergence The end result is that the sum of squared
errors is minimised between points and their respective centroids. We will use KMeans Clustering. At first
we will find the optimal clusters based on inertia and using elbow method. The distance between the
centroids and the data points should be less.
First we need to check the data for any missing values as it can ruin our model.
In [ ]: data.isna().sum()
Out[37]: SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64
We conclude that we don't have any missing values therefore we can go forward and start the clustering
procedure.
We will now view and select the data that we need for clustering.
In [ ]: data.head()
Checking the value count of the target column i.e. 'Species' using value_counts()
In [ ]: data['Species'].value_counts()
Out[39]: Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
Name: Species, dtype: int64
Target Data
In [ ]: target_data = data.iloc[:,4]
target_data.head()
Out[40]: 0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
Name: Species, dtype: object
Training data
In [ ]: clustering_data = data.iloc[:,[0,1,2,3]]
clustering_data.head()
Now, we need to visualize the data which we are going to use for the clustering. This will give us a fair idea
about the data we're working on.
In [ ]: fig, ax = plt.subplots(figsize=(15,7))
sns.set(font_scale=1.5)
ax = sns.scatterplot(x=data['SepalLengthCm'],y=data['SepalWidthCm'], s=70, color='#f7343
ax.set_ylabel('Sepal Width (in cm)')
ax.set_xlabel('Sepal Length (in cm)')
plt.title('Sepal Length vs Width', fontsize = 20)
plt.show()
This gives us a fair Idea and patterns about some of the data.
The Elbow method runs k-means clustering on the dataset for a range of values for k (say from 1-10) and
then for each value of k computes an average score for all clusters. By default, the distortion score is
computed, the sum of square distances from each point to its assigned center.
When these overall metrics for each model are plotted, it is possible to visually determine the best value for
k. If the line chart looks like an arm, then the “elbow” (the point of inflection on the curve) is the best value of
k. The “arm” can be either up or down, but if there is a strong inflection point, it is a good indication that the
underlying model fits best at that point.
We use the Elbow Method which uses Within Cluster Sum Of Squares (WCSS) against the the number of
clusters (K Value) to figure out the optimal number of clusters value. WCSS measures sum of distances of
observations from their cluster centroids which is given by the below formula.
formula
h Yi i id f b i Xi Th i li i i b f l d i li i i
With this simple line of code we get all the inertia value or the within the cluster sum of square.
Now, we visualize the Elbow Method so that we can determine the number of optimal clusters for our
dataset.
In [ ]: fig, ax = plt.subplots(figsize=(15,7))
ax = plt.plot(range(1,11),wcss, linewidth=2, color="red", marker ="8")
plt.axvline(x=3, ls='--')
plt.ylabel('WCSS')
plt.xlabel('No. of Clusters (k)')
plt.title('The Elbow Method', fontsize = 20)
plt.show()
It is clear, that the optimal number of clusters for our data are 3, as the slope of the curve is not steep
enough after it. When we observe this curve, we see that last elbow comes at k = 3, it would be difficult to
visualize the elbow if we choose the higher range.
Clustering
Now we will build the model for creating clusters from the dataset. We will use n_clusters = 3 i.e. 3
clusters as we have determined by the elbow method, which would be optimal for our dataset.
Our data set is for unsupervised learning therefore we will use fit_predict() Suppose we were working
with supervised learning data set we would use fit_tranform()
Out[45]: KMeans(n_clusters=3)
Now that we have the clusters created, we will enter them into a different column
In [ ]: clusters = clustering_data.copy()
clusters['Cluster_Prediction'] = kms.fit_predict(clustering_data)
clusters.head()
We can also get the centroids of the clusters by the cluster_centers_ attribute of KMeans algorithm.
In [ ]: kms.cluster_centers_
Now we have all the data we need, we just need to plot the data. We will plot the data using scatterplot
which will allow us to observe different clusters in different colours.
In [ ]: fig, ax = plt.subplots(figsize=(15,7))
plt.scatter(x=clusters[clusters['Cluster_Prediction'] == 0]['SepalLengthCm'],
y=clusters[clusters['Cluster_Prediction'] == 0]['SepalWidthCm'],
s=70,edgecolor='teal', linewidth=0.3, c='teal', label='Iris-versicolor')
plt.scatter(x=clusters[clusters['Cluster_Prediction'] == 1]['SepalLengthCm'],
y=clusters[clusters['Cluster_Prediction'] == 1]['SepalWidthCm'],
s=70,edgecolor='lime', linewidth=0.3, c='lime', label='Iris-setosa')
plt.scatter(x=clusters[clusters['Cluster_Prediction'] == 2]['SepalLengthCm'],
y=clusters[clusters['Cluster_Prediction'] == 2]['SepalWidthCm'],
s=70,edgecolor='magenta', linewidth=0.3, c='magenta', label='Iris-virginica'
plt.scatter(x=kms.cluster_centers_[:, 0], y=kms.cluster_centers_[:, 1], s = 170, c = 'ye
plt.legend(loc='upper right')
plt.xlim(4,8)
plt.ylim(1.8,4.5)
ax.set_ylabel('Sepal Width (in cm)')
ax.set_xlabel('Sepal Length (in cm)')
plt.title('Clusters', fontsize = 20)
plt.show()