CLUSTERING
CLUSTERING
Clustering: Introduction
Clustering is the task of dividing the population or data points into a number of groups such
that data points in the same groups are more similar to other data points in the same group
than those in other groups. In simple words, the aim is to segregate groups with similar traits
and assign them into clusters.
Let’s understand this with an example. Suppose, you are the head of a rental store and wish to
understand preferences of your costumers to scale up your business. Is it possible for you to
look at details of each costumer and devise a unique business strategy for each one of them?
Definitely not. But, what you can do is to cluster all of your costumers into say 10 groups based
on their purchasing habits and use a separate strategy for costumers in each of these 10
groups. And this is what we call clustering.
Now, that we understand what is clustering. Let’s take a look at the types of clustering.
Types of Clustering
Broadly speaking, clustering can be divided into two subgroups :
Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or
not. For example, in the above example each customer is put into one group out of the 10
groups.
Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a
probability or likelihood of that data point to be in those clusters is assigned. For example,
from the above scenario each costumer is assigned a probability to be in either of 10 clusters
of the retail store.
. Page 21 of 25
R16 – B.Tech – CSE – IV/II – Machine Learning – Unit IV
. Page 22 of 25
R16 – B.Tech – CSE – IV/II – Machine Learning – Unit IV
2. Randomly assign each data point to a cluster : Let’s assign three points in cluster 1 shown
using red color and two points in cluster 2 shown using grey color.
3. Compute cluster centroids : The centroid of data points in the red cluster is shown using
red cross and those in grey cluster using grey cross.
4. Re-assign each point to the closest cluster centroid: Note that only the data point at the
bottom is assigned to the red cluster even though its closer to the centroid of grey cluster.
Thus, we assign that data point into grey cluster
5. Re-compute cluster centroids: Now, re-computing the centroids for both the clusters.
. Page 23 of 25
R16 – B.Tech – CSE – IV/II – Machine Learning – Unit IV
6. Repeat steps 4 and 5 until no improvements are possible : Similarly, we’ll repeat the 4th and
5th steps until we’ll reach global optima. When there will be no further switching of data
points between two clusters for two successive repeats. It will mark the termination of the
algorithm if not explicitly mentioned.
Hierarchical Clustering
Hierarchical clustering, as the name suggests is an algorithm that builds hierarchy of clusters. This
algorithm starts with all the data points assigned to a cluster of their own. Then two nearest
clusters are merged into the same cluster. In the end, this algorithm terminates when there is only
a single cluster left.
The results of hierarchical clustering can be shown using dendrogram. The dendrogram can be
interpreted as:
. Page 24 of 25
R16 – B.Tech – CSE – IV/II – Machine Learning – Unit IV
At the bottom, we start with 25 data points, each assigned to separate clusters. Two closest
clusters are then merged till we have just one cluster at the top. The height in the dendrogram at
which two clusters are merged represents the distance between two clusters in the data space.
The decision of the no. of clusters that can best depict different groups can be chosen by
observing the dendrogram. The best choice of the no. of clusters is the no. of vertical lines in the
dendrogram cut by a horizontal line that can transverse the maximum distance vertically without
intersecting a cluster.
In the above example, the best choice of no. of clusters will be 4 as the red horizontal line in the
dendrogram below covers maximum vertical distance AB.
Two important things that you should know about hierarchical clustering are:
This algorithm has been implemented above using bottom up approach. It is also possible to
follow top-down approach starting with all data points assigned in the same cluster and
recursively performing splits till each data point is assigned a separate cluster.
The decision of merging two clusters is taken on the basis of closeness of these clusters. There
are multiple metrics for deciding the closeness of two clusters :
o Euclidean distance: ||a-b||2 = √(Σ(ai-bi))
o Squared Euclidean distance: ||a-b||22 = Σ((ai-bi)2)
o Manhattan distance: ||a-b||1 = Σ|ai-bi|
o Maximum distance:||a-b||INFINITY = maxi|ai-bi|
o Mahalanobis distance: √((a-b)T S-1 (-b)) {where, s : covariance matrix}
*****END******
. Page 25 of 25