Lab Report6 - B21CI014
Lab Report6 - B21CI014
- B21CI014
Prof. - Yashashwi Verma
LAB REPORT 6
K-means introduction-
One of the most straightforward and well-liked unsupervised
machine learning algorithms is K-means clustering. The dataset's
required number of centroids is indicated by the target number
k. A cluster's center is represented by a centroid, which can be
either an actual or hypothetical location. The word "means" in
the K-means algorithm refers to averaging the data or finding
the centroid. The K-means technique in data mining begins with
the initial set of centroids that are randomly chosen and are
used as the starting points for each cluster in order to process
the learning data. It then uses iterative (repeated)calculations
to optimize the positions of the centroids.
“It is an iterative algorithm that divides the unlabeled dataset
into k different clusters in such a way that each dataset
belongs to only one group that has similar properties.”
And the code used to plot the graph in which we get the int
representing the number of elements in this object is given
below-
sns.set_style("whitegrid")
sns.pairplot(df,size=3);
plt.show()
And the next code used to plot the graph in which the new
parameter hue represents which column in the data frame, you
want to use for color encoding is given below-
sns.set_style("whitegrid")
sns.pairplot(df,hue =1,size=3);
plt.show()
(b)Using the sci-kit learn library and the value of k=5, depict
the data points in this section of the question by plotting a
scatter plot with centroid points using the k-means library, and
by using different colors for each of the five labels.
And the code used to plot the graph or to create a scatter plot
is given below-
plt.figure(figsize=(10,5))
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1], s = 100, c =
'orange', label =11.2 )
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1], s = 100, c = 'red',
label = 12.0)
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1], s = 100, c =
'green', label = 12.8)
plt.scatter(x[y_kmeans == 3, 0], x[y_kmeans == 3, 1], s = 100, c =
'black', label = 13.6)
plt.scatter(x[y_kmeans == 4, 0], x[y_kmeans == 4, 1], s = 100, c =
'yellow', label = 14.4)
plt.legend()
(c)in this part, we use the elbow method for finding the optimal
k value. In the Elbow method, we are actually varying the number
of clusters ( K ) from 1 – 10. For each value of K, we are
calculating WCSS ( Within-Cluster Sum of Square ). WCSS is the
sum of the squared distance between each point and the centroid
in a cluster. When we plot the WCSS with the K value, the plot
looks like an Elbow. As the number of clusters increases, the
WCSS value will start to decrease. WCSS value is largest when K
= 1. When we analyze the graph we can see that the graph will
rapidly change at a point and thus creating an elbow shape. From
this point, the graph starts to move almost parallel to the
X-axis. The K value corresponding to this point is the optimal K
value or an optimal number of clusters.
df
x = df.iloc[:,[1,5,13]].values
x
We use the fit predict method that returns for each observation
which clusters it belongs to. The cluster to which the client
belongs and it will return these cluster numbers into a single
vector that is called y K-means and y_kmeans give us different
clusters corresponding to X. Now let’s plot all the clusters
using matplotlib.
plt.figure(figsize=(10,5))
plt.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c =
'green', label = 'cluster1')
plt.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c =
'red', label = 'cluster2')
plt.legend()
As you can see there are 2 clusters in total which are
visualized in different colors and the centroid of each cluster
is visualized in black color.