Lecture 11 K Means Clustering
Lecture 11 K Means Clustering
K-means clustering is an unsupervised machine learning technique where we cluster data points based on
similarity or closeness between the data points how exactly We cluster them?
Definition: It groups the data points based on their similarity or closeness to each other, in simple terms, the
algorithm needs to find the data points whose values are similar to each other and therefore these points would
then belong to the same cluster.
OR
A K-means clustering algorithm tries to group similar items in the form of clusters. The number of groups is
represented by K.
‘Distance Measure’ - ‘Euclidean Distance’
The observations which are closer or similar to each other would have low Euclidean distance and then
clustered together.
The k-means algorithm uses the concept of centroid to create ‘k clusters.’
Steps in K-Means:
step1: choose k value for ex: k=2
step2: initialize centroids randomly
step3: calculate Euclidean distance from centroids to each data point and form clusters that are close to centroids
step4: find the centroid of each cluster and update centroids
step:5 repeat step3
Each time clusters are made centroids are updated, the updated centroid is the center of all points which
fall in the cluster. This process continues till the centroid no longer changes i.e solution converges.
Example: Suppose you went to a vegetable shop to buy some vegetables. There you will see different kinds of
vegetables. The one thing you will notice there that the vegetables will be arranged in a group of their types. Like
all the carrots will be kept in one place, potatoes will be kept with their kinds and so on. If you will notice here
then you will find that they are forming a group or cluster, where each of the vegetables is kept within their kind
of group forming the clusters.
When the value of k is 1, the within-cluster sum of the square will be high. As the value of k increases, the
within-cluster sum of square value will decrease.
Finally, we will plot a graph between k-values and the within-cluster sum of the square to get the k value. we will
examine the graph carefully. At some point, our graph will decrease abruptly. That point will be considered as a
value of k.
Silhouette Method
The elbow method it also picks up the range of the k values and draws the silhouette graph.
It calculates the silhouette coefficient of every point.
It calculates the average distance of points within its cluster a(i) and the average distance of the points to its next
closest cluster called b(i).
Note : The a(i) value must be less than the b(i) value, that is a(i)<<b(i).
Now, we have the values of a (i) and b (i). we will calculate the silhouette coefficient by using the below formula.
The plot of the silhouette is between -1 to 1.Silhouette coefficient equal to -1 is the worst case scenario.
Observe the plot and check which of the k values is closer 1.
Advantages of K-means
1. It is very simple to implement.
2. It is scalable to a huge data set and also faster to large datasets.
3. it adapts the new examples very frequently.
4. Generalization of clusters for different shapes and sizes.
Disadvantages of K-means
1. It is sensitive to the outliers.
2. Choosing the k values manually is a tough job.
3. As the number of dimensions increases its scalability decreases.
Example
Numerical – Using K means clustering algorithm form two clusters for given data.
Height Weight
185 72
170 56
168 60
179 68
182 72
188 77
180 71
180 70
183 84
180 88
180 67
177 76
Note: As per question we need to form 2 clusters, So for that we consider first two data points of our data and assign
them as a centroid for each cluster.
Now we need to assign each and every data point of our data to one of these clusters based on Euclidean distance
calculation.
Here (X0,Y0) is our data point and (Xc,Yc) is a centroid of a particular cluster. Lets consider the next data point
i.e. 3rd data point(168,60) and check its distance with the centroid of both clusters.
Now we can see from calculations that 3rd data point(168,60) is more closer to k2(cluster 2), so we assign it to k2.
After that we need to modify the centroid of k2 by using the old centroid values and new data point which we just
assigned to k2.
Now after new centroid calculations we got new centroid value for k2 as (169,58) and k1 centroid value will
remain the same as NO new data point is added to that cluster(k1). We need to repeat the above mentioned
procedure until all data points are over.
CW
Location of (x, y) in term of distance is given below. Cluster the given distance points in three clusters
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)
Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).