0% found this document useful (0 votes)
19 views8 pages

Lab Report6 - B21CI014

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views8 pages

Lab Report6 - B21CI014

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Name - Dikshant Gupta Roll no.

- B21CI014
Prof. - Yashashwi Verma

Subject Name - Introduction to machine learning

LAB REPORT 6

K-means introduction-
One of the most straightforward and well-liked unsupervised
machine learning algorithms is K-means clustering. The dataset's
required number of centroids is indicated by the target number
k. A cluster's center is represented by a centroid, which can be
either an actual or hypothetical location. The word "means" in
the K-means algorithm refers to averaging the data or finding
the centroid. The K-means technique in data mining begins with
the initial set of centroids that are randomly chosen and are
used as the starting points for each cluster in order to process
the learning data. It then uses iterative (repeated)calculations
to optimize the positions of the centroids.
“It is an iterative algorithm that divides the unlabeled dataset
into k different clusters in such a way that each dataset
belongs to only one group that has similar properties.”

The k-means clustering algorithm mainly performs two tasks:

1. Determines the best value for K center points or centroids by


an iterative process.

2. Assigns each data point to its closest k-center. Those data


points which are near to the particular k-center, create a
cluster.

Q.1) a) To begin answering this issue, we must perform some


preprocessing on our dataset, such as checking for null values,
scaling or normalizing our dataset, and visualizing data points
in accordance with the inquiry by selecting all of the dataset's
pairs. After this displaying the scatter plot using the Seaborn
Library and estimating the value of 'k' that is 5, because as we
can see in all the plots that the data points are almost divided
into 5 clusters so we can assume the value of “k” to be equal to
5.

And the code used to plot the graph in which we get the int
representing the number of elements in this object is given
below-

sns.set_style("whitegrid")

sns.pairplot(df,size=3);

plt.show()
And the next code used to plot the graph in which the new
parameter hue represents which column in the data frame, you
want to use for color encoding is given below-

sns.set_style("whitegrid")

sns.pairplot(df,hue =1,size=3);

plt.show()
(b)Using the sci-kit learn library and the value of k=5, depict
the data points in this section of the question by plotting a
scatter plot with centroid points using the k-means library, and
by using different colors for each of the five labels.

And the code used to plot the graph or to create a scatter plot
is given below-

I am explaining the code here step by step firstly The point at


which the elbow shape is created is 5, that is, our K value or
an optimal number of clusters is 5. Now let’s train the model on
the dataset with a number of clusters 5 and the “init” argument
is the method for initializing the centroid.

from sklearn.cluster import KMeans


kmeans = KMeans(n_clusters = 5, init = 'k-means++', max_iter = 300, n_init
= 10, random_state = 0)
y_kmeans = kmeans.fit_predict(x)

Now, y_kmeans gives us different clusters corresponding to X.


Now let’s plot all the clusters using matplotlib. The scatter()
function plots one dot for each observation and the function
legend() is used to Place a legend on the axes where a legend is
an area describing the elements of the graph.

plt.figure(figsize=(10,5))
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1], s = 100, c =
'orange', label =11.2 )
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1], s = 100, c = 'red',
label = 12.0)
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1], s = 100, c =
'green', label = 12.8)
plt.scatter(x[y_kmeans == 3, 0], x[y_kmeans == 3, 1], s = 100, c =
'black', label = 13.6)
plt.scatter(x[y_kmeans == 4, 0], x[y_kmeans == 4, 1], s = 100, c =
'yellow', label = 14.4)

#Plotting the centroids of the cluster


plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], s
= 100, c = 'grey', label = 'Centroids')

plt.legend()

(c)in this part, we use the elbow method for finding the optimal
k value. In the Elbow method, we are actually varying the number
of clusters ( K ) from 1 – 10. For each value of K, we are
calculating WCSS ( Within-Cluster Sum of Square ). WCSS is the
sum of the squared distance between each point and the centroid
in a cluster. When we plot the WCSS with the K value, the plot
looks like an Elbow. As the number of clusters increases, the
WCSS value will start to decrease. WCSS value is largest when K
= 1. When we analyze the graph we can see that the graph will
rapidly change at a point and thus creating an elbow shape. From
this point, the graph starts to move almost parallel to the
X-axis. The K value corresponding to this point is the optimal K
value or an optimal number of clusters.

As we import the dataset above, Now slice the important features

df
x = df.iloc[:,[1,5,13]].values
x

Next, We have to find the optimal K value for clustering the


data. Now we are using the Elbow method to find the optimal K
value.

from sklearn.cluster import KMeans


wcss_list = []
for i in range(1,11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 45)
kmeans.fit(x)
wcss_list.append(kmeans.inertia_)

“init” argument is the method for initializing the centroid. We


calculated the WCSS value for each K value. Now we have to plot
the WCSS with the K value.

plt.plot(range(1, 11), wcss_list)


plt.title('The Elbow Method Graph')
plt.xlabel('Number of clusters(k)')
plt.ylabel('wcss_list')
plt.show()
The point at which the elbow shape is created is 2, that is, our
K value or an optimal number of clusters is 2. Now let’s train
the model on the dataset with a number of clusters 2.

from sklearn.cluster import KMeans


kmeans = KMeans(n_clusters=2, max_iter = 300, n_init = 10, random_state =
40)
y_predict = kmeans.fit_predict(x)

We use the fit predict method that returns for each observation
which clusters it belongs to. The cluster to which the client
belongs and it will return these cluster numbers into a single
vector that is called y K-means and y_kmeans give us different
clusters corresponding to X. Now let’s plot all the clusters
using matplotlib.

plt.figure(figsize=(10,5))
plt.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c =
'green', label = 'cluster1')
plt.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c =
'red', label = 'cluster2')

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], s


= 100, c = 'black', label = 'Centroids')

plt.legend()
As you can see there are 2 clusters in total which are
visualized in different colors and the centroid of each cluster
is visualized in black color.

Google colab file link-


https://colab.research.google.com/drive/1TZRxGVUfWGiYiirMPIBFByDcT7j-Rxgw#scrollTo=Di0
u81sUHJpB

You might also like