13 Unsupervised Learning
13 Unsupervised Learning
Partitioning methods
• Two of the most important algorithms for partitioning-based clustering are k-means
and k-medoid.
• In the k-means algorithm, the centroid of the prototype is identified for clustering,
which is normally the mean of a group of points.
• Similarly, the k-medoid algorithm identifies the medoid which is the most
representative point for a group of points. We can also infer that in most cases, the
centroid does not correspond to an actual data point, whereas medoid is always an
actual data point. Let us discuss both these algorithms in detail.
Strengths and Weaknesses of K-means
Elbow method
• This method tries to measure the homogeneity or heterogeneity within the cluster
and for various values of ‘K’ and helps in arriving at the optimal ‘K’. These iterations
take significant computation effort, and after a certain point, the increase in
homogeneity benefit is no longer in accordance with the investment required to
achieve it, as is evident from the figure. This point is known as the elbow point, and
the ‘K’ value at this point produces the optimal clustering performance.
• The k-means algorithm is sensitive to outliers in the data set and inadvertently
produces skewed clusters when the means of the data points are used as centroids.
Let us take an example of eight data points, and for simplicity, we can consider them
to be 1-D data with values 1, 2, 3, 5, 9, 10, 12, and 25. Point 25 is the outlier, and it
affects the cluster formation negatively when the mean of the points is considered as
centroids.
• Because the SSE of the second clustering is lower, k-means tend to put point 9 in the
same cluster with 1, 2, 3, and 6 though the point is logically nearer to points 10 and
12. This skewedness is introduced due to the outlier point 25, which shifts the mean
away from the centre of the cluster.
k-medoids
• k-medoids provides a solution to this problem. Instead of considering the mean of
the data points in the cluster, kmedoids considers k representative data points from
the existing points in the data set as the centre of the clusters. It then assigns the
data points according to their distance from these centres to form k clusters. Note
that the medoids in this case are actual data points or objects from the data set and
not an imaginary point as in the case when the mean of the data sets within cluster
is used as the centroid in the k-means technique. The SSE is calculated as