Lecture 18 K Means Clustering
Lecture 18 K Means Clustering
Clustering is the task of dividing the population or data points into a number of
groups such that data points in the same groups are more similar to other data
points in the same group and dissimilar to the data points in other groups.
Hierarchical
clustering
K-MEANS
CLUSTERING
◾ Let us assume that we have a dataset
◾ The scatter plot is shown in the figure
◾ We want to find the clusters in the data
◾ In first look, we can see that there are
three clusters in the data.
◾ We can see that there are three clusters in the
data
STEPS FOR K-MEANS CLUSTERING
◾ Can you visually identify
the numbers of clusters
in the dataset (not very
easy!!!)
◾ Let us assume that we
have identified the optimal
number of clusters to be
2.
◾ Let us assume that we select the red and blue points as
the centriods
◾ We know from geometry that the points on the
green line are equi-distant from the red and
blue centroid.
◾ Now it becomes clear, which points will belong
to cluster 1 and cluster 2.
◾ Closest centroid is a
relative term.
◾ W e are using Euclidean
distance here
◾ But in other scenarios some
other parameters may be
more appropriate
◾ Compute the new centroids
for each cluster, by
computing the
average/center of gravity
of all the points except the
centroid itself in each
cluster.
◾ New centroids have
been
assigned
◾ So, if we plot a line through the scatter plot.
◾ We can see that three data points are in the wrong
cluster
◾ Now, we will recolor those three points to assign them the correct
cluster.
◾ Since some reassignment as taken place therefore, we go back to step 4
◾ Compute the center of gravity for the new
clusters
◾ The new centroids have been
assigned
◾ We again draw the line to check whether any data points are in the wrong
cluster
◾ We see that there is only one point that is in the wrong cluster
◾ The point has been reassigned to the blue
cluster
◾ Next we need to recompute the
centroids
◾ The centroids have been relocated
◾ Now only one point needs to be
reassigned
◾ The data point gets
reassigned
◾ Computing the new centroids for the
clusters
◾ Now, this time we do not need to reassign any data
points.
◾ The algorithm has converged.
◾ From the initial data points to the clustered
output.
◾ Right away we can tell which three clusters will be
formed
◾ Even if we move around the centroids a little bit, nothing is going to
change
◾ These are the clusters we are going to end up
with
◾ Again, we will go through the steps of k-means
clustering
◾ The initial random selection of centroids are not as good as
before.
◾ The three clusters will be formed as
follows
◾ Recompute the
centroids
◾ Now, no data point will be
reassigned.
◾ The algorithm has converged
K-MEANS SOL 2 K-means sol
1
◾ Where C1, C2 and C3 are the centroids of cluster 1, cluster 2 and cluster 3
respectively.
◾ Pi is the ith data point in the respective cluster
◾ The WCSS is a good metric for comparing the solutions obtained using different values of k for the k-
means clustering algorithm.
◾ Let us see how the WCSS metric changes with different values of k.
◾ Here is the solution with k=1.
◾ When we compute WCSS we will get quite a large value since, the centroid is away from the data
points Pi. Consequently, the distance between the centroid and the data points Pi will be large.
◾ Let us now increase the number of clusters to 2 and see how the WCSS changes.
◾ Since now we have two centroids, and the distance will be computed within each cluster and it does not
need to reach all the way to the middle of the whole dataset.
◾ We can see the value of WCSS will decrease as compared to when we had only one centroid i.e., one
cluster.
◾ Now we increase the number of clusters to 3.
◾ There is no change in cluster 1 so no change in the distance for cluster 1.
◾ The distance in cluster 2 and cluster 3 will decrease as compared to when there were only two
clusters.
◾ What is the upper limit on the number of clusters in K-means algorithm?
◾ The maximum number of clusters can be equal to the number of datapoints.
◾ If we reach the maximum limit of number of clusters the WCSS will reach a value of
zero
◾ The chart above shows the W C SS value as k i.e.,
number
of clusters increases.
◾ We can see that the WCSS starts off with a high
value and then decreases substantially as k increases
◾ For example when k increases from 1 to 2 then the
decrease in number of units on the y-axis is 8000-
3000 = 5000
◾ when k increases from 2 to 3 then the
decrease in number of units on the y-axis is
3000-1000 = 2000
◾ when k increases from 3 to 4 then the
decrease in number of units on the y-axis is
1000-700 = 300
◾ We can see that the change in WCSS was large
in the
beginning and low at the end
◾ So, we can us the elbow method to determine the
optimal number of clusters i.e., look for that point
Evaluating the Clustering Algorithm