0% found this document useful (0 votes)
24 views10 pages

Hierarchical Clustering

Hierarchical clustering is a method for grouping similar objects without requiring a predefined number of clusters, utilizing a dendrogram to visualize relationships. It includes two main types: agglomerative (bottom-up) and divisive (top-down), with agglomerative clustering starting with individual points and merging them, while divisive clustering begins with one large cluster and splits it. This technique is applicable in various fields such as biology, customer segmentation, and image analysis.

Uploaded by

sainihardik89300
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views10 pages

Hierarchical Clustering

Hierarchical clustering is a method for grouping similar objects without requiring a predefined number of clusters, utilizing a dendrogram to visualize relationships. It includes two main types: agglomerative (bottom-up) and divisive (top-down), with agglomerative clustering starting with individual points and merging them, while divisive clustering begins with one large cluster and splits it. This technique is applicable in various fields such as biology, customer segmentation, and image analysis.

Uploaded by

sainihardik89300
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Hierarchical Clustering

Hierarchical clustering is a technique to group similar objects together step-


by-step, based on how close or similar they are. It builds a tree-like structure
(called a dendrogram) that shows how points are merged or split.

It does not require you to specify the number of clusters in advance.

The need:

Other methods like k-means require you to:

 Specify number of clusters beforehand (e.g., k = 3).


 Randomly initialize centroids (which can affect the result).

But Hierarchical Clustering:

 No need to specify number of clusters at the beginning.


 No randomness, deterministic result.
 Gives a full hierarchy, you can choose 2 clusters, 3 clusters, 5 clusters later
by just cutting the tree.
 Flexibility: you can explore small groups or big groups just by adjusting
the cut.
 Visual Understanding: dendrograms are easy to visualize relationships
among points.

Real-Life Where we use Hierarchical Clustering:

 Biology: Group species based on genetic similarity (phylogenetic trees).


 Customer segmentation: Group customers based on behaviour.
 Document grouping: Group similar articles or books.
 Image analysis: Group similar images based on features.

Types of Hierarchical Clustering:

Hierarchical clustering has two main types:

a) Agglomerative: Bottom-up approach (start small and merge)


b) Divisive: Top-down approach (start big and split)

Agglomerative Hierarchical Clustering: Agglomerative clustering starts with


each data point as its own cluster and then iteratively merges the closest clusters
into larger clusters until only one cluster remains.
The steps involved in agglomerative clustering are as follows:

1. Begin with each data point as a separate cluster.


2. Find the two closest clusters and merge them into a single cluster.
3. Repeat step 2 until all data points are in one cluster, forming a hierarchy
of clusters.

The result of agglomerative clustering is often visualized as a dendrogram, a


tree-like diagram showing the hierarchical relationships between clusters. You
can cut the dendrogram at a certain level to obtain a specific number of
clusters.

How Does Agglomerative Hierarchical Clustering Work?

It is a technique that hierarchizes data points or objects into clusters. It starts


with each data point as its cluster and then iteratively merges or splits clusters
until a specific criterion is met.
Here’s a step-by-step explanation of how hierarchical clustering works:

1. Initialization: Begin with each data point as a separate cluster. In other


words, if you have ‘n’ data points, you start with ‘n’ clusters, each containing
one data point.

2. Distance Calculation: Calculate the pairwise distances or similarities


between all clusters. This step requires a distance metric or similarity measure,
such as Euclidean distance, Manhattan distance, or cosine similarity,
depending on the nature of your data and problem.

3. Merging Clusters (Agglomerative Clustering): Find the two closest


clusters based on the distance/similarity metric. These clusters are merged into
a single cluster.

 The choice of which clusters to merge can be determined by different


linkage criteria, such as:
 Single Linkage: Merge clusters based on the closest pair of data
points between the clusters.
 Complete Linkage: Merge clusters based on the farthest pair of data
points between the clusters.
 Average Linkage: Merge clusters based on the average distance
between all pairs of data points between the clusters.
 Centroid Linkage: Merge clusters based on the distance between
the clusters’ centroids (mean points).

4. Update the Distance Matrix:

 Recalculate the distances or similarities between the newly formed and the
remaining clusters.
 This step reduces the number of clusters by one.

5. Repeat Steps 3 and 4:

 Continue merging the closest clusters and updating the distance matrix
until only one cluster remains. This process forms a hierarchy of clusters.

6. Creating a Dendrogram:

 As clusters are merged, you can represent the hierarchical structure using
a dendrogram. A dendrogram is a tree-like diagram visually showing
clusters’ merging processes and relationships.

7. Cutting the Dendrogram:


 You can cut the dendrogram at a certain level to obtain a specific number
of clusters. The height at which you cut the dendrogram determines the
number of clusters you get.

8. Cluster Assignment:

 Once you’ve determined the desired number of clusters, you can assign
each data point to its corresponding cluster based on the hierarchical
structure you’ve created.

Practice Example:

Point Coordinates (x, y)


A (1, 2)
B (2, 2)
C (5, 5)
D (6, 5)

Step 1: Initialization

 Start with each point as its own cluster:


 Clusters = {A}, {B}, {C}, {D}
 Total 4 clusters.

Step 2: Distance Calculation

Calculate all pairwise distances:

Pair Distance
A-B √((2−1)² + (2−2)²) = √1 = 1
A-C √((5−1)² + (5−2)²) = √(16+9) = √25 = 5
A-D √((6−1)² + (5−2)²) = √(25+9) = √34 ≈ 5.83
B-C √((5−2)² + (5−2)²) = √(9+9) = √18 ≈ 4.24
B-D √((6−2)² + (5−2)²) = √(16+9) = √25 = 5
C-D √((6−5)² + (5−5)²) = √1 = 1
Step 3: First Merge

 Find the smallest distance:


 A-B distance = 1
 C-D distance = 1
 Both are 1, so you can pick either merge first.
 Let's merge A and B first.
 New cluster: {A, B}
 Clusters now are: {AB}, {C}, {D}

Step 4: Update Distance Matrix

Now, calculate distances between the new cluster {AB} and others using single
linkage:

 Distance between {AB} and C = minimum (distance A-C, distance B-C) =


min (5, 4.24) = 4.24
 Distance between {AB} and D = minimum (distance A-D, distance B-D)
= min (5.83, 5) = 5
 Distance between C and D = already known = 1

Step 5: Next Merge

Find the next smallest distance:

 C-D distance = 1
 Merge C and D.
 New cluster: {C, D}
 Clusters now: {AB}, {CD}

Step 6: Final Merge

Now, only two clusters left: {AB} and {CD}

Distance between {AB} and {CD}:

Minimum distance between points from {AB} and {CD}:

A-C = 5

A-D = 5.83

B-C = 4.24
B-D = 5

Minimum = 4.24 (between B and C)

Merge {AB} and {CD} into final cluster {ABCD}.

Step 7: Create the Dendrogram

The dendrogram would look like this:

Step 8: Cut the Dendrogram (Optional)

If you cut the dendrogram at distance around 2, you get two clusters:

 Cluster 1: {A, B}
 Cluster 2: {C, D}

If you cut at a higher distance like 5, all points are merged into one cluster.

Practice Question:

Point Coordinates (x, y)


A (1, 1)
B (2, 1)
C (5, 4)
D (6, 5)
E (8, 8)
Divisive Clustering:

 Divisive Hierarchical Clustering is a top-down clustering approach.


 It starts with all data points grouped together into a single large cluster.
 Then, it repeatedly splits the clusters into smaller and smaller groups until
each data point is in its own cluster or a desired number of clusters is
obtained.
 Divisive clustering is considered the reverse of agglomerative clustering,
which starts from individual points and merges them.
How Divisive Clustering Works — Step-by-Step:
1. Start: Consider all data points as one big cluster.
2. Splitting: Find the best split of the cluster into two smaller clusters. The
splitting is often based on maximizing the distance between groups or
minimizing the similarity within groups.
3. Selection: Choose the cluster that can be best split further and apply the
splitting process again.
4. Repeat: Continue splitting clusters until:
o Each data point becomes its own cluster, or
o A stopping condition (such as a specific number of clusters or
distance threshold) is reached.
Methods for Splitting:
 Distance-Based Splitting: Divide points based on which points are
farthest apart.
 K-means Based Splitting: Apply k-means clustering with k=2 inside a
cluster to split it.
 Graph-Based Splitting: Use methods like minimal spanning tree cutting
to separate points.
Why Divisive Clustering?
 It directly identifies large natural groups before refining smaller clusters.
 It can sometimes lead to more meaningful cluster structures than
agglomerative methods.
 It is computationally more expensive but may produce better quality
clusters, especially when large groups are clearly separated.
Important Comparison: Agglomerative vs Divisive
Aspect Agglomerative Divisive
Approach Bottom-up Top-down
Start Each point is a separate cluster All points are in one cluster
Process Merge closest clusters Split largest cluster
Complexity Faster Slower and more complex

Practice Question:
Given five points:
Point Coordinates (x, y)
P (1, 2)
Q (2, 2)
R (8, 9)
S (9, 8)
T (8, 8)
Use Divisive Hierarchical Clustering with these conditions:
 Use Euclidean distance.
 At each split, separate the two farthest points into different clusters.
 Continue splitting until the distance between points in a cluster is less than
or equal to 4.
Solution:
Step 1: Start with all points together {P, Q, R, S, T}.
Step 2: Find the farthest points:
Calculate Euclidean distances:
 P–S ≈ 10 (largest distance)
 P–R ≈ 9.9
 Other distances are smaller.
Thus, P and S are the farthest apart.
Step 3: First Split:
Separate {P} from {Q, R, S, T}.
Clusters:
 Cluster 1: {P}
 Cluster 2: {Q, R, S, T}
Step 4: Split {Q, R, S, T}:
Among {Q, R, S, T}:
 Q–R ≈ 9.22
 Q–S ≈ 9.22
 Q–T ≈ 8.48
 R–S ≈ 1.41
 R–T ≈ 1
 S–T ≈ 1
The farthest points again involve Q.
Thus, split {Q} from {R, S, T}.
New Clusters:
 Cluster 1: {P}
 Cluster 2: {Q}
 Cluster 3: {R, S, T}
Step 5: Examine {R, S, T}:
Distances:
 R–S ≈ 1.41
 R–T ≈ 1
 S–T ≈ 1
All distances are less than 4, so no further splitting is needed.
Final Clusters:
Cluster Number Points
Cluster 1 {P}
Cluster 2 {Q}
Cluster 3 {R, S, T}
Dendrogram Structure:

Practice Question:
You are given the following five points:
Point Coordinates (x, y)
A (2, 3)
B (3, 4)
C (10, 10)
D (11, 10)
E (10, 9)
Tasks:
1. Perform the clustering step-by-step.
2. Show the final clusters.
3. Draw a rough dendrogram structure showing the splits.

You might also like