0% found this document useful (0 votes)
32 views8 pages

Lecture 11 K Means Clustering

Uploaded by

anavlamba94
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views8 pages

Lecture 11 K Means Clustering

Uploaded by

anavlamba94
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

K-Mean Clustering

 K-means clustering is an unsupervised machine learning technique where we cluster data points based on
similarity or closeness between the data points how exactly We cluster them?

Definition: It groups the data points based on their similarity or closeness to each other, in simple terms, the
algorithm needs to find the data points whose values are similar to each other and therefore these points would
then belong to the same cluster.
OR
A K-means clustering algorithm tries to group similar items in the form of clusters. The number of groups is
represented by K.
‘Distance Measure’ - ‘Euclidean Distance’

The observations which are closer or similar to each other would have low Euclidean distance and then
clustered together.
The k-means algorithm uses the concept of centroid to create ‘k clusters.’
Steps in K-Means:
step1: choose k value for ex: k=2
step2: initialize centroids randomly
step3: calculate Euclidean distance from centroids to each data point and form clusters that are close to centroids
step4: find the centroid of each cluster and update centroids
step:5 repeat step3
Each time clusters are made centroids are updated, the updated centroid is the center of all points which
fall in the cluster. This process continues till the centroid no longer changes i.e solution converges.
Example: Suppose you went to a vegetable shop to buy some vegetables. There you will see different kinds of
vegetables. The one thing you will notice there that the vegetables will be arranged in a group of their types. Like
all the carrots will be kept in one place, potatoes will be kept with their kinds and so on. If you will notice here
then you will find that they are forming a group or cluster, where each of the vegetables is kept within their kind
of group forming the clusters.

How Does the K-means clustering algorithm work?


k-means clustering tries to group similar kinds of items in form of clusters. It finds the similarity between the items
and groups them into the clusters. K-means clustering algorithm works in three steps. Let’s see what are these
three steps.
1. Select the k values.
2. Initialize the centroids.
3. Select the group and find the average.
Let us understand the above steps with the help of the figure because a good picture is better than the thousands of
words.
How to choose the value of K?
 If we choose the k values randomly, it might be correct or may be wrong.
 The wrong k value will directly affect your model performance.
 So there are two methods by which you can select the right value of k.
1. Elbow Method.
2. Silhouette Method.
Elbow Method
 It is an empirical method to find out the best value of k. it picks up the range of values and takes the best among
them.
 It calculates the sum of the square of the points and calculates the average distance.

 When the value of k is 1, the within-cluster sum of the square will be high. As the value of k increases, the
within-cluster sum of square value will decrease.
Finally, we will plot a graph between k-values and the within-cluster sum of the square to get the k value. we will
examine the graph carefully. At some point, our graph will decrease abruptly. That point will be considered as a
value of k.

Silhouette Method
 The elbow method it also picks up the range of the k values and draws the silhouette graph.
 It calculates the silhouette coefficient of every point.
 It calculates the average distance of points within its cluster a(i) and the average distance of the points to its next
closest cluster called b(i).

Note : The a(i) value must be less than the b(i) value, that is a(i)<<b(i).

Now, we have the values of a (i) and b (i). we will calculate the silhouette coefficient by using the below formula.
 The plot of the silhouette is between -1 to 1.Silhouette coefficient equal to -1 is the worst case scenario.
Observe the plot and check which of the k values is closer 1.

Advantages of K-means
1. It is very simple to implement.
2. It is scalable to a huge data set and also faster to large datasets.
3. it adapts the new examples very frequently.
4. Generalization of clusters for different shapes and sizes.
Disadvantages of K-means
1. It is sensitive to the outliers.
2. Choosing the k values manually is a tough job.
3. As the number of dimensions increases its scalability decreases.
Example
Numerical – Using K means clustering algorithm form two clusters for given data.

Height Weight

185 72

170 56

168 60

179 68

182 72

188 77

180 71

180 70

183 84

180 88

180 67

177 76

Note: As per question we need to form 2 clusters, So for that we consider first two data points of our data and assign
them as a centroid for each cluster.
 Now we need to assign each and every data point of our data to one of these clusters based on Euclidean distance
calculation.

 Here (X0,Y0) is our data point and (Xc,Yc) is a centroid of a particular cluster. Lets consider the next data point
i.e. 3rd data point(168,60) and check its distance with the centroid of both clusters.

 Now we can see from calculations that 3rd data point(168,60) is more closer to k2(cluster 2), so we assign it to k2.
After that we need to modify the centroid of k2 by using the old centroid values and new data point which we just
assigned to k2.

 Now after new centroid calculations we got new centroid value for k2 as (169,58) and k1 centroid value will
remain the same as NO new data point is added to that cluster(k1). We need to repeat the above mentioned
procedure until all data points are over.
CW

Location of (x, y) in term of distance is given below. Cluster the given distance points in three clusters

A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)

Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).

You might also like