Assignment 4 A

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Department of AI & DS MCOERC, NASHIK

Assignment No: 4A

Title: Clustering Analysis


Implement K-Means clustering on Iris.csv dataset. Determine the number of clusters using
the elbow method

Dataset Link: https://www.kaggle.com/datasets/uciml/iris

Objectives:
• To find optimum number of clusters using elbow method
• To apply K-Means Algorithm on Iris dataset

Theory:
Clustering is a technique in which the data points are arranged in similar groups dynamically
without any pre-assignment of groups. For example, here is a simple plot of data points. As
you can see, some set of points are closer to each other when compared with others. These sets
could possibly form a group (or a cluster) for further data analytics K-Means Clustering K-
Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science.K-Means Clustering is an Unsupervised Learning
algorithm, which groups the unlabeled dataset into different clusters. Here K defines the
number of pre-defined clusters that need to be created in the process, as if K=2, there will be
two clusters, and for K=3, there will be three clusters, and so on. It is an iterative algorithm
that divides the unlabeled dataset into k different clusters in such a way that each dataset
belongs only one group that has similar properties. It allows us to cluster the data into different
groups and a convenient way to discover the categories of groups in the unlabeled dataset on
its own without the need for any training. It is a centroid based algorithm, where each cluster
is associated with a centroid. The main aim of this algorithm is to minimize the sum of distances
between the data point and their corresponding clusters. The algorithm takes the unlabeled
dataset as input, divides the dataset into k-number of clusters, and repeats the process until it
does not find the best clusters. The value of k should be predetermined in this algorithm. The
k-means clustering algorithm mainly performs two tasks:
• Determines the best value for K center points or centroids by an iterative process.
• Assigns each data point to its closest k-center. Those data points which are near to the
k-center, create a cluster.

Department of AI & DS MCOERC, NASHIK


Department of AI & DS MCOERC, NASHIK

The below diagram explains the working of the K-means Clustering Algorithm:

The working of the K-Means algorithm is explained in the below steps:


Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid
of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
Algorithm:
At a high-level, the following steps are taken for clustering the data using K-means clustering
algorithm.
1. Decide the number of clusters, k , that you desire to group your data points into.
2. Select k random data points as centroids.
3. Compute the distance from each data point to each centroid. Assign all the data points to the
closest cluster centroid. The distance d between any two points, (x1, y1) and (x2, y2) is
calculated as

Department of AI & DS MCOERC, NASHIK


Department of AI & DS MCOERC, NASHIK

4. Recompute the centroids of newly formed clusters. The centroid (Xc, Yc) of the m data
points in a cluster is calculated as

It is a simple arithmetic mean of all X coordinates and Y coordinates of the m data points in
the cluster.
5. Repeat steps 3 and 4 until any of the following criteria is met
a. Centroids of newly formed clusters do not change.
b. Points remain in the same cluster.
c. Maximum number of iterations are reached as desired.
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters. This
method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares,
which defines the total variations within a cluster. The formula to calculate the value of WCSS
(for 3 clusters) is given below:

In the above formula of WCSS, ΣPi in Cluster1 distance (Pi C1)2: It is the sum of the square
of the distances between each data point and its centroid within a cluster1 and the same for the
other two terms. To measure the distance between data points and centroid, we can use any
method such as Euclidean distance or Manhattan distance. To find the optimal value of clusters,
the elbow method follows the below steps:
● It executes the K-means clustering on a given dataset for different K values (ranges
from 1-10).
● For each value of K, calculates the WCSS value.
Department of AI & DS MCOERC, NASHIK
Department of AI & DS MCOERC, NASHIK

● Plots a curve between calculated WCSS values and the number of clusters K.
● The sharp point of bend or a point of the plot looks like an arm, then that point is
considered as the best value of K. Since the graph shows the sharp bend, which looks
like an elbow, hence it is known as the elbow method. The graph for the elbow method
looks like the below image:

Sample Code:
#importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

#importing the Iris dataset with pandas


dataset = pd.read_csv("C:\\Users\\CHETAN\\OneDrive\\Desktop\\Iris.csv")
x = dataset.iloc[:, [1, 2, 3, 4]].values

Now we will implement 'The elbow method' on the Iris dataset. The elbow method
allows us to pick the optimum amount of clusters for classification. although we
already know the answer is 3 it is still interesting to run.

Department of AI & DS MCOERC, NASHIK


Department of AI & DS MCOERC, NASHIK

#Finding the optimum number of clusters for k-means classification


from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10,
random_state = 0)
kmeans.fit(x)
wcss.append(kmeans.inertia_)
#Plotting the results onto a line graph, allowing us to observe 'The elbow'
plt.plot(range(1, 11), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') #within cluster sum of squares
plt.show()

You can clearly see why it is called 'The elbow method' from the above graph, the
optimum clusters is where the elbow occurs. This is when the within cluster sum of
squares (WCSS) doesn't decrease significantly with every iteration. Now that we have
the optimum amount of clusters, we can move on to applying K-means clustering to
the Iris dataset.

Department of AI & DS MCOERC, NASHIK


Department of AI & DS MCOERC, NASHIK

#Applying kmeans to the dataset / Creating the kmeans classifier


kmeans = KMeans(n_clusters = 3, init = 'k-means++', max_iter = 300, n_init = 10,
random_state = 0)
y_kmeans = kmeans.fit_predict(x)
#Visualising the clusters
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Iris-
setosa')
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Iris-
versicolour')
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Iris-
virginica')
#Plotting the centroids of the clusters
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], s = 100, c =
'yellow', label = 'Centroids')
plt.legend()

Conclusion:
Students will be able to apply K-Means Algorithm & Elbow method on Dataset to find
Optimum no of Clusters

Department of AI & DS MCOERC, NASHIK


15/10/2023, 19:18 4.Clustering Analysis -Kmeans - Jupyter Notebook

Implement K-Means clustering on Iris.csv dataset. Determine the number of clusters using the elbow
method.

Dataset Link: https://www.kaggle.com/datasets/uciml/iris (https://www.kaggle.com/datasets/uciml/iris)

Importing the libraries and the data

In [1]: import pandas as pd # Pandas (version : 1.1.5)


import numpy as np # Numpy (version : 1.19.2)
import matplotlib.pyplot as plt # Matplotlib (version : 3.3.2)
from sklearn.cluster import KMeans # Scikit Learn (version : 0.23.2)
import seaborn as sns # Seaborn (version : 0.11.1)

Importing the data from .csv file

First we read the data from the dataset using read_csv from the pandas library.

In [2]: data = pd.read_csv('iris.csv')

Viewing the data that we imported to pandas dataframe object

In [ ]: data

Out[24]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa

1 2 4.9 3.0 1.4 0.2 Iris-setosa

2 3 4.7 3.2 1.3 0.2 Iris-setosa

3 4 4.6 3.1 1.5 0.2 Iris-setosa

4 5 5.0 3.6 1.4 0.2 Iris-setosa

... ... ... ... ... ... ...

145 146 6.7 3.0 5.2 2.3 Iris-virginica

146 147 6.3 2.5 5.0 1.9 Iris-virginica

147 148 6.5 3.0 5.2 2.0 Iris-virginica

148 149 6.2 3.4 5.4 2.3 Iris-virginica

149 150 5.9 3.0 5.1 1.8 Iris-virginica

150 rows × 6 columns

Viewing and Describing the data

Now we view the Head and Tail of the data using head() and tail() respectively.

localhost:8888/notebooks/4.Clustering Analysis -Kmeans.ipynb 1/9


15/10/2023, 19:18 4.Clustering Analysis -Kmeans - Jupyter Notebook

In [ ]: data.head()

Out[25]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa

1 2 4.9 3.0 1.4 0.2 Iris-setosa

2 3 4.7 3.2 1.3 0.2 Iris-setosa

3 4 4.6 3.1 1.5 0.2 Iris-setosa

4 5 5.0 3.6 1.4 0.2 Iris-setosa

In [ ]: data.tail()

Out[26]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

145 146 6.7 3.0 5.2 2.3 Iris-virginica

146 147 6.3 2.5 5.0 1.9 Iris-virginica

147 148 6.5 3.0 5.2 2.0 Iris-virginica

148 149 6.2 3.4 5.4 2.3 Iris-virginica

149 150 5.9 3.0 5.1 1.8 Iris-virginica

Checking the sample size of data - how many samples are there in the dataset using len() .

In [ ]: len(data)

Out[27]: 150

Checking the dimensions/shape of the dataset using shape .

In [ ]: data.shape

Out[28]: (150, 6)

Viewing Column names of the dataset using columns

In [ ]: data.columns

Out[29]: Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',


'Species'],
dtype='object')

In [ ]: for i,col in enumerate(data.columns):


print(f'Column number {1+i} is {col}')

Column number 1 is Id
Column number 2 is SepalLengthCm
Column number 3 is SepalWidthCm
Column number 4 is PetalLengthCm
Column number 5 is PetalWidthCm
Column number 6 is Species

So, our dataset has 5 columns named:

Id
SepalLengthCm
SepalWidthCm

localhost:8888/notebooks/4.Clustering Analysis -Kmeans.ipynb 2/9


15/10/2023, 19:18 4.Clustering Analysis -Kmeans - Jupyter Notebook

PetalLengthCm
PetalWidthCm
Species.

View datatypes of each column in the dataset using dtype .

In [ ]: data.dtypes

Out[31]: Id int64
SepalLengthCm float64
SepalWidthCm float64
PetalLengthCm float64
PetalWidthCm float64
Species object
dtype: object

Gathering Further information about the dataset using info()

In [ ]: data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 150 non-null int64
1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB

Describing the data as basic statistics using describe()

In [ ]: data.describe()

Out[33]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm

count 150.000000 150.000000 150.000000 150.000000 150.000000

mean 75.500000 5.843333 3.054000 3.758667 1.198667

std 43.445368 0.828066 0.433594 1.764420 0.763161

min 1.000000 4.300000 2.000000 1.000000 0.100000

25% 38.250000 5.100000 2.800000 1.600000 0.300000

50% 75.500000 5.800000 3.000000 4.350000 1.300000

75% 112.750000 6.400000 3.300000 5.100000 1.800000

max 150.000000 7.900000 4.400000 6.900000 2.500000

Checking the data for inconsistencies and further cleaning the data if
needed.

Checking data for missing values using isnull() .

localhost:8888/notebooks/4.Clustering Analysis -Kmeans.ipynb 3/9


15/10/2023, 19:18 4.Clustering Analysis -Kmeans - Jupyter Notebook

In [ ]: data.isnull()

Out[34]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 False False False False False False

1 False False False False False False

2 False False False False False False

3 False False False False False False

4 False False False False False False

... ... ... ... ... ... ...

145 False False False False False False

146 False False False False False False

147 False False False False False False

148 False False False False False False

149 False False False False False False

150 rows × 6 columns

Checking summary of missing values

In [ ]: data.isnull().sum()

Out[35]: Id 0
SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64

In [ ]: data.drop('Id', axis=1, inplace=True)


data.head()

Out[36]: SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 5.1 3.5 1.4 0.2 Iris-setosa

1 4.9 3.0 1.4 0.2 Iris-setosa

2 4.7 3.2 1.3 0.2 Iris-setosa

3 4.6 3.1 1.5 0.2 Iris-setosa

4 5.0 3.6 1.4 0.2 Iris-setosa

Modelling

K - Means Clustering

K-means clustering is a clustering algorithm that aims to partition n observations into k clusters. Initialisation
– K initial “means” (centroids) are generated at random Assignment – K clusters are created by associating
each observation with the nearest centroid Update – The centroid of the clusters becomes the new mean,
Assignment and Update are repeated iteratively until convergence The end result is that the sum of squared
errors is minimised between points and their respective centroids. We will use KMeans Clustering. At first
we will find the optimal clusters based on inertia and using elbow method. The distance between the
centroids and the data points should be less.

localhost:8888/notebooks/4.Clustering Analysis -Kmeans.ipynb 4/9


15/10/2023, 19:18 4.Clustering Analysis -Kmeans - Jupyter Notebook

First we need to check the data for any missing values as it can ruin our model.

In [ ]: data.isna().sum()

Out[37]: SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64

We conclude that we don't have any missing values therefore we can go forward and start the clustering
procedure.

We will now view and select the data that we need for clustering.

In [ ]: data.head()

Out[38]: SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 5.1 3.5 1.4 0.2 Iris-setosa

1 4.9 3.0 1.4 0.2 Iris-setosa

2 4.7 3.2 1.3 0.2 Iris-setosa

3 4.6 3.1 1.5 0.2 Iris-setosa

4 5.0 3.6 1.4 0.2 Iris-setosa

Checking the value count of the target column i.e. 'Species' using value_counts()

In [ ]: data['Species'].value_counts()

Out[39]: Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
Name: Species, dtype: int64

Splitting into Training and Target data

Target Data

In [ ]: target_data = data.iloc[:,4]
target_data.head()

Out[40]: 0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
Name: Species, dtype: object

Training data

localhost:8888/notebooks/4.Clustering Analysis -Kmeans.ipynb 5/9


15/10/2023, 19:18 4.Clustering Analysis -Kmeans - Jupyter Notebook

In [ ]: clustering_data = data.iloc[:,[0,1,2,3]]
clustering_data.head()

Out[41]: SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm

0 5.1 3.5 1.4 0.2

1 4.9 3.0 1.4 0.2

2 4.7 3.2 1.3 0.2

3 4.6 3.1 1.5 0.2

4 5.0 3.6 1.4 0.2

Now, we need to visualize the data which we are going to use for the clustering. This will give us a fair idea
about the data we're working on.

In [ ]: fig, ax = plt.subplots(figsize=(15,7))
sns.set(font_scale=1.5)
ax = sns.scatterplot(x=data['SepalLengthCm'],y=data['SepalWidthCm'], s=70, color='#f7343
ax.set_ylabel('Sepal Width (in cm)')
ax.set_xlabel('Sepal Length (in cm)')
plt.title('Sepal Length vs Width', fontsize = 20)
plt.show()

This gives us a fair Idea and patterns about some of the data.

Determining No. of Clusters Required

The Elbow Method

The Elbow method runs k-means clustering on the dataset for a range of values for k (say from 1-10) and
then for each value of k computes an average score for all clusters. By default, the distortion score is
computed, the sum of square distances from each point to its assigned center.

When these overall metrics for each model are plotted, it is possible to visually determine the best value for
k. If the line chart looks like an arm, then the “elbow” (the point of inflection on the curve) is the best value of
k. The “arm” can be either up or down, but if there is a strong inflection point, it is a good indication that the
underlying model fits best at that point.

localhost:8888/notebooks/4.Clustering Analysis -Kmeans.ipynb 6/9


15/10/2023, 19:18 4.Clustering Analysis -Kmeans - Jupyter Notebook

We use the Elbow Method which uses Within Cluster Sum Of Squares (WCSS) against the the number of
clusters (K Value) to figure out the optimal number of clusters value. WCSS measures sum of distances of
observations from their cluster centroids which is given by the below formula.

formula

h Yi i id f b i Xi Th i li i i b f l d i li i i
With this simple line of code we get all the inertia value or the within the cluster sum of square.

In [ ]: from sklearn.cluster import KMeans


wcss=[]
for i in range(1,11):
km = KMeans(i)
km.fit(clustering_data)
wcss.append(km.inertia_)
np.array(wcss)

Out[43]: array([680.8244 , 152.36870648, 78.94084143, 57.31787321,


46.53558205, 38.93096305, 34.29998554, 30.21678683,
28.23999745, 25.95204113])

Inertia can be recognized as a measure of how internally coherent clusters are.

Now, we visualize the Elbow Method so that we can determine the number of optimal clusters for our
dataset.

In [ ]: fig, ax = plt.subplots(figsize=(15,7))
ax = plt.plot(range(1,11),wcss, linewidth=2, color="red", marker ="8")
plt.axvline(x=3, ls='--')
plt.ylabel('WCSS')
plt.xlabel('No. of Clusters (k)')
plt.title('The Elbow Method', fontsize = 20)
plt.show()

It is clear, that the optimal number of clusters for our data are 3, as the slope of the curve is not steep
enough after it. When we observe this curve, we see that last elbow comes at k = 3, it would be difficult to
visualize the elbow if we choose the higher range.

localhost:8888/notebooks/4.Clustering Analysis -Kmeans.ipynb 7/9


15/10/2023, 19:18 4.Clustering Analysis -Kmeans - Jupyter Notebook

Clustering

Now we will build the model for creating clusters from the dataset. We will use n_clusters = 3 i.e. 3
clusters as we have determined by the elbow method, which would be optimal for our dataset.

Our data set is for unsupervised learning therefore we will use fit_predict() Suppose we were working
with supervised learning data set we would use fit_tranform()

In [ ]: from sklearn.cluster import KMeans



kms = KMeans(n_clusters=3, init='k-means++')
kms.fit(clustering_data)

Out[45]: KMeans(n_clusters=3)

Now that we have the clusters created, we will enter them into a different column

In [ ]: clusters = clustering_data.copy()
clusters['Cluster_Prediction'] = kms.fit_predict(clustering_data)
clusters.head()

Out[46]: SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Cluster_Prediction

0 5.1 3.5 1.4 0.2 1

1 4.9 3.0 1.4 0.2 1

2 4.7 3.2 1.3 0.2 1

3 4.6 3.1 1.5 0.2 1

4 5.0 3.6 1.4 0.2 1

We can also get the centroids of the clusters by the cluster_centers_ attribute of KMeans algorithm.

In [ ]: kms.cluster_centers_

Out[47]: array([[5.9016129 , 2.7483871 , 4.39354839, 1.43387097],


[5.006 , 3.418 , 1.464 , 0.244 ],
[6.85 , 3.07368421, 5.74210526, 2.07105263]])

Now we have all the data we need, we just need to plot the data. We will plot the data using scatterplot
which will allow us to observe different clusters in different colours.

localhost:8888/notebooks/4.Clustering Analysis -Kmeans.ipynb 8/9


15/10/2023, 19:18 4.Clustering Analysis -Kmeans - Jupyter Notebook

In [ ]: fig, ax = plt.subplots(figsize=(15,7))
plt.scatter(x=clusters[clusters['Cluster_Prediction'] == 0]['SepalLengthCm'],
y=clusters[clusters['Cluster_Prediction'] == 0]['SepalWidthCm'],
s=70,edgecolor='teal', linewidth=0.3, c='teal', label='Iris-versicolor')


plt.scatter(x=clusters[clusters['Cluster_Prediction'] == 1]['SepalLengthCm'],
y=clusters[clusters['Cluster_Prediction'] == 1]['SepalWidthCm'],
s=70,edgecolor='lime', linewidth=0.3, c='lime', label='Iris-setosa')


plt.scatter(x=clusters[clusters['Cluster_Prediction'] == 2]['SepalLengthCm'],
y=clusters[clusters['Cluster_Prediction'] == 2]['SepalWidthCm'],
s=70,edgecolor='magenta', linewidth=0.3, c='magenta', label='Iris-virginica'

plt.scatter(x=kms.cluster_centers_[:, 0], y=kms.cluster_centers_[:, 1], s = 170, c = 'ye
plt.legend(loc='upper right')
plt.xlim(4,8)
plt.ylim(1.8,4.5)
ax.set_ylabel('Sepal Width (in cm)')
ax.set_xlabel('Sepal Length (in cm)')
plt.title('Clusters', fontsize = 20)
plt.show()

localhost:8888/notebooks/4.Clustering Analysis -Kmeans.ipynb 9/9

You might also like