0% found this document useful (0 votes)
60 views

Module-5-Cluster Analysis-Part1

Hierarchical clustering is an unsupervised machine learning algorithm that groups similar data points into clusters. It creates a hierarchy of clusters organized as a tree structure or dendrogram. Agglomerative hierarchical clustering is a bottom-up approach where each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. Divisive hierarchical clustering takes a top-down approach, initially assigning all observations to a single cluster which is then partitioned recursively into smaller clusters. The optimal number of clusters is determined by cutting the dendrogram tree at a level where the horizontal line can traverse maximum distance without intersecting cluster merges.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

Module-5-Cluster Analysis-Part1

Hierarchical clustering is an unsupervised machine learning algorithm that groups similar data points into clusters. It creates a hierarchy of clusters organized as a tree structure or dendrogram. Agglomerative hierarchical clustering is a bottom-up approach where each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. Divisive hierarchical clustering takes a top-down approach, initially assigning all observations to a single cluster which is then partitioned recursively into smaller clusters. The optimal number of clusters is determined by cutting the dendrogram tree at a level where the horizontal line can traverse maximum distance without intersecting cluster merges.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

What is Hierarchical Clustering?

(Help: https://www.javatpoint.com/hierarchical-clustering-in-machine-learning)

What is Clustering??
Clustering is a technique that groups similar objects such that the objects in the same
group are more similar to each other than the objects in the other groups. The group of
similar objects is called a Cluster.

Clustered data points

There are 5 popular clustering algorithms that data scientists need to know:
1. K-Means Clustering:
2. Hierarchical Clustering:
3. Mean-Shift Clustering:
4. Density-Based Spatial Clustering of Applications with Noise (DBSCAN):
5. Expectation-Maximization (EM) Clustering using Gaussian Mixture Models (GMM):
.
Hierarchical Clustering Algorithm
Also called Hierarchical cluster analysis or HCA is an unsupervised clustering
algorithm which involves creating clusters that have predominant ordering from top to
bottom.
For e.g: All files and folders on our hard disk are organized in a hierarchy.
The algorithm groups similar objects into groups called clusters. The endpoint is a set
of clusters or groups, where each cluster is distinct from each other cluster, and the
objects within each cluster are broadly similar to each other.
This clustering technique is divided into two types:
1. Agglomerative Hierarchical Clustering
2. Divisive Hierarchical Clustering

Agglomerative Hierarchical Clustering

The Agglomerative Hierarchical Clustering is the most common type of hierarchical


clustering used to group objects in clusters based on their similarity. It’s also known as
AGNES (Agglomerative Nesting). It's a “bottom-up” approach: each observation starts
in its own cluster, and pairs of clusters are merged as one moves up the
hierarchy.
How does it work?
1. Make each data point a single-point cluster → forms N clusters
2. Take the two closest data points and make them one cluster → forms N-1 clusters
3. Take the two closest clusters and make them one cluster → Forms N-2 clusters.
4. Repeat step-3 until you are left with only one cluster.
Have a look at the visual representation of Agglomerative Hierarchical Clustering for
better understanding:
Agglomerative Hierarchical Clustering

There are several ways to measure the distance between clusters in order to decide the
rules for clustering, and they are often called Linkage Methods. Some of the common
linkage methods are:
 Complete-linkage: the distance between two clusters is defined as the longest distance
between two points in each cluster.
 Single-linkage: the distance between two clusters is defined as the shortest distance
between two points in each cluster. This linkage may be used to detect high values in your
dataset which may be outliers as they will be merged at the end.
 Average-linkage: the distance between two clusters is defined as the average distance
between each point in one cluster to every point in the other cluster.
 Centroid-linkage: finds the centroid of cluster 1 and centroid of cluster 2, and then
calculates the distance between the two before merging.
The choice of linkage method entirely depends on you and there is no hard and fast
method that will always give you good results. Different linkage methods lead to
different clusters.
The point of doing all this is to demonstrate the way hierarchical clustering works, it
maintains a memory of how we went through this process and that memory is stored
in Dendrogram.
What is a Dendrogram?
A Dendrogram is a type of tree diagram showing hierarchical relationships between
different sets of data.
As already said a Dendrogram contains the memory of hierarchical clustering algorithm,
so just by looking at the Dendrgram you can tell how the cluster is formed.

Dendrogram

Note:-
1. Distance between data points represents dissimilarities.
2. Height of the blocks represents the distance between clusters.
So you can observe from the above figure that initially P5 and P6 which are closest to
each other by any other point are combined into one cluster followed by P4 getting
merged into the same cluster(C2). Then P1and P2 gets combined into one cluster
followed by P0 getting merged into the same cluster(C4). Now P3 gets merged in
cluster C2 and finally, both clusters get merged into one.
Parts of a Dendrogram
Pic Credit

A dendrogram can be a column graph (as in the image below) or a row graph. Some
dendrograms are circular or have a fluid-shape, but the software will usually produce a
row or column graph. No matter what the shape, the basic graph comprises the same
parts:
 The Clades are the branch and are arranged according to how similar (or dissimilar) they
are. Clades that are close to the same height are similar to each other; clades with different
heights are dissimilar — the greater the difference in height, the more dissimilarity.
 Each clade has one or more leaves.
 Leaves A, B, and C are more similar to each other than they are to leaves D, E, or F.
 Leaves D and E are more similar to each other than they are to leaves A, B, C, or F.
 Leaf F is substantially different from all of the other leaves.
A clade can theoretically have an infinite amount of leaves. However, the more leaves
you have, the harder the graph will be to read with the naked eye.
One question that might have intrigued you by now is how do you decide when to
stop merging the clusters?
You cut the dendrogram tree with a horizontal line at a height where the line can
traverse the maximum distance up and down without intersecting the merging point.
For example in the below figure L3 can traverse maximum distance up and down
without intersecting the merging points. So we draw a horizontal line and the number of
verticle lines it intersects is the optimal number of clusters.

Choosing the optimal number of clusters.

Number of Clusters in this case = 3.

Divisive Hierarchical Clustering

In Divisive or DIANA(DIvisive ANAlysis Clustering) is a top-down clustering method


where we assign all of the observations to a single cluster and then partition the cluster
to two least similar clusters. Finally, we proceed recursively on each cluster until there is
one cluster for each observation. So this clustering approach is exactly opposite to
Agglomerative clustering.
Pic credit

There is evidence that divisive algorithms produce more accurate hierarchies than
agglomerative algorithms in some circumstances but is conceptually more complex.
In both agglomerative and divisive hierarchical clustering, users need to specify the
desired number of clusters as a termination condition(when to stop merging).

Measuring the goodness of Clusters

Well, there are many measures to do this, perhaps the most popular one is the Dunn's
Index. Dunn's index is the ratio between the minimum inter-cluster distances to the
maximum intra-cluster diameter. The diameter of a cluster is the distance between its
two furthermost points. In order to have well separated and compact clusters you should
aim for a higher Dunn's index.
K-Means Clustering

(Help: https://www.javatpoint.com/k-means-clustering-algorithm-in-machine-learning)

K-Means Clustering Statement

K-means tries to partition x data points into the set of k clusters


where each data point is assigned to its closest cluster. This
method is defined by the objective function which tries to
minimize the sum of all squared distances within a cluster, for all
clusters.

The objective function is defined as:

Where xj is a data point in the data set, Si is a cluster (set of data


points and ui is the cluster mean(the center of cluster of Si)

K-Means Clustering Algorithm:

1. Choose a value of k, number of clusters to be formed.


2. Randomly select k data points from the data set as the intital
cluster centeroids/centers

3. For each datapoint:

a. Compute the distance between the datapoint and the cluster


centroid

b. Assign the datapoint to the closest centroid

4. For each cluster calculate the new mean based on the datapoints
in the cluster.

5. Repeat 3 & 4 steps until mean of the clusters stops changing or


maximum number of iterations reached.
Flowchart of K-Means Clustering

Understanding with a simple example


We will apply k-means on the following 1 dimensional data set for
K=2.

Data set {2, 4, 10, 12, 3, 20, 30, 11, 25}

Iteration 1

M1, M2 are the two randomly selected centroids/means where

M1= 4, M2=11

and the initial clusters are

C1= {4}, C2= {11}

Calculate the Euclidean distance as

D=[x,a]=√(x-a)²

D1 is the distance from M1

D2 is the distance from M2


Iteration 1

As we can see in the above table, 2 datapoints are added to cluster


C1 and other datapoints added to cluster C2

Therefore

C1= {2, 4, 3}

C2= {10, 12, 20, 30, 11, 25}

Iteration 2

Calculate new mean of datapoints in C1 and C2.

Therefore

M1= (2+3+4)/3= 3

M2= (10+12+20+30+11+25)/6= 18
Calculating distance and updating clusters based on table below

Iteration 2

New Clusters

C1= {2, 3, 4, 10}

C2= {12, 20, 30, 11, 25}

Iteration 3

Calculate new mean of datapoints in C1 and C2.

Therefore

M1= (2+3+4+10)/4= 4.75

M2= (12+20+30+11+25)/5= 19.6


Calculating distance and updating clusters based on table below

Iteration 3

New Clusters

C1= {2, 3, 4, 10, 12, 11}

C2= {20, 30, 25}

Iteration 4

Calculate new mean of datapoints in C1 and C2.

Therefore

M1= (2+3+4+10+12+11)/6=7

M2= (20+30+25)/3= 25
Calculating distance and updating clusters based on table below

Iteration 4

New Clusters

C1= {2, 3, 4, 10, 12, 11}

C2= {20, 30, 25}

As we can see that the data points in the cluster C1 and C2 in


iteration 3 are same as the data points of the cluster C1 and C2 of
iteration 2.

It means that none of the data points has moved to other cluster.
Also the means/centeroid of these clusters is constant. So this
becomes the stopping condition for our algorithm.

How many clusters?


Selecting a proper value of ‘K’ is very difficult until we have a good
knowledge about our data set.

Therefore we need some method to determine and validate


weather we are using the right number of clusters. The
fundamental aim of partitioning a data set is minimizing the intra-
cluster variation or SSE. SSE can be calculated by –First take the
difference between each data point with its centroid and then add
up all the squares of the differences calculated

So to find optimal number of clusters:

Run k-means for different values of ‘K’. For example K varying


from 1 to 10 and for each value of K compute SSE.

Plot a line chart K values on x axis and its corresponding values of


SSE on y axis as shown below.

Elbow Method
SSE=0 if K=number of clusters, which means that each data point
has its own cluster.

As we can see in the graph there is a rapid drop in SSE as we move


from K=2 to 3 and it becomes almost constant as the value of K is
further increased.

Because of the sudden drop we see an elbow in the graph. So the


value to be considered for K is 3. This method is known as elbow
method.

There are also many other techniques which are used to determine
the value of K.

K-means is the ‘go-to’ clustering algorithm because it is fast and


easy to understand.

Listing some drawbacks of K-Means

1. The result might not be globally optimal: We can’t assure that


this algorithm will lead to the best global solution. Selecting
different random seeds at the beginning affects the final results.

2. Value of K need to be specified beforehand: We can expect this


value only if we have a good idea about our dataset and if we are
working with a new dataset then elbow method can be used to
determine value of K.
3. Works only for linear boundaries: K-means makes this
assumption that the boundaries will be always linear. Hence it
fails when it comes to complicated boundaries.

4. Slow for large number of samples: As this algorithm access each


point of the dataset, it becomes slow when the sample size grows.

So this was all about K-Means. I hope now you have the better
understanding of how k-means actually works. There are many
other algorithms that are used for clustering in the industry.
K-Means Clustering Example:
Hierarchical Clustering: Agglomerative Clustering

Example:

Dist A B C D,F E
A 0 0.71 5.66 3.20 4.24
B 0.71 0 4.95 2.50 3.54
C 5.66 4.95 0 2.24 1.41
D,F 3.20 2.50 2.24 0 1.00
E 4.24 3.54 1.41 1.00 0

You might also like