0% found this document useful (0 votes)
15 views14 pages

9536 DWM Expt 7 Merged

The document provides information about hierarchical clustering methods. It discusses agglomerative and divisive hierarchical clustering. It explains how to calculate distance between clusters using different methods like closest points, furthest points, average distance and distance between centroids. It also describes what a dendrogram is and how it is created to store records of splitting and merging of clusters at each step. The optimal number of clusters can be determined by cutting the dendrogram horizontally at the longest distance line.

Uploaded by

Kakashi Hatake
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views14 pages

9536 DWM Expt 7 Merged

The document provides information about hierarchical clustering methods. It discusses agglomerative and divisive hierarchical clustering. It explains how to calculate distance between clusters using different methods like closest points, furthest points, average distance and distance between centroids. It also describes what a dendrogram is and how it is created to store records of splitting and merging of clusters at each step. The optimal number of clusters can be determined by cutting the dendrogram horizontally at the longest distance line.

Uploaded by

Kakashi Hatake
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

‭TE Comp-V‬ ‭Lab Experiment : 7‬ ‭Date of Submission :‬

‭Name: Saville D’silva Roll number : 9536‬


‭Course outcomes: On successful completion of course learner will be able to:‬

‭-----------------‬

‭Rubrics for assessment of Lab Experiment :‬


‭Indicator‬ ‭Average‬ ‭Good‬ ‭Excellent‬
‭Timeline‬ ‭ ate submission‬
L ‭01 (On Time )‬ ‭ 2 (Before‬
0
‭(0)‬ ‭deadline )‬
‭●‬ O ‭ n time Completion &‬
‭Submission (02)‬
‭Completeness and neatness‬
‭●‬ ‭Complete all parts of‬ ‭ 60% complete‬
< ‭ 80% complete‬
< ‭100%‬
‭schema diagram /‬ ‭(0)‬ ‭(1)‬ ‭complete (2)‬
‭OLAP / Algorithm (2)‬
‭Implementation‬ ‭ 60% complete‬
< ‭ 80% complete‬
< ‭ 00%‬
1
‭●‬ ‭Extent of coding (4)‬ ‭(2)‬ ‭(3)‬ ‭complete (4)‬
‭Knowledge‬
‭ nable to‬
U
‭●‬ ‭In depth knowledge of the‬ ‭ nable to answer‬
U ‭ ble to answer‬
A
‭answer 1‬
‭post assignment questions‬ ‭2 questions(0)‬ ‭2 questions (2)‬
‭question (1)‬
‭(2)‬
‭Completeness and neatness‬ ‭Timeline (2)‬ ‭ nowledge‬
K
I‭ mplementation‬
‭(2)‬ ‭(2)‬
‭(4)‬

‭Teacher's Sign : Total (10):‬


Ishita D'silva
Saville Yadav

9536
9649

Te- Comps A

Expt7

Aim : Implementation of any one Hierarchical Clustering method


Theory :
What is Clustering?
Clustering is nothing but different groups. Items in one group are similar to each other. And
Items in different groups are dissimilar with each other. Clustering is known as Unsupervised
Learning.
Hierarchical clustering
Hierarchical Clustering groups similar objects into one cluster. The final cluster in the
Hierarchical cluster combines all clusters into one cluster. An example of Hierarchical clustering
is Dendrogram.
Type of Hierarchical Clustering:Hierarchical Clustering is of 2 types-
1. Agglomerative Hierarchical Clustering.
2. Divisive Hierarchical Clustering.
1. Agglomerative Hierarchical Clustering.
Agglomerative Hierarchical Clustering uses a bottom-up approach to form clusters. That means
it starts from single data points. Then it clusters the closer data points into one cluster. The same
process repeats until it gets one single cluster.
2. Divisive Hierarchical Clustering.
Divisive Hierarchical Clustering is the opposite of Agglomerative Hierarchical clustering. It is
a Top-Down approach.
That means, it starts from one single cluster. In that single cluster, there may be n number of
clusters and data points.At each step it split the farthest cluster into separate clusters.
Step 1- Make each data point a single cluster. Suppose that forms n clusters.

Step 2- Take the 2 closet data points and make them one cluster. Now the total clusters become
n-1.
Step 3-Take the 2 closet clusters and make them one cluster. Now the total clusters become n-2.

Step 4- Repeat Step 3 until only one cluster is left.

When only one huge cluster is left, the algorithms stops.


How to calculate Distance between Two Clusters?
For calculating the distance between two data points, we use the Euclidean Distance Formula.
But, to calculate the distance between two clusters, we can use four methods-
1. Closet Points- That means we take the distance of two closet points from two clusters. It is
also known as Single Linkage. Something like that-

2. Furthest Points- Another option is to take the two furthest points and calculate their distance.
And consider this distance as the distance of two clusters. It is also known as Complete-linkage
That look something like that-
3. Average Distance- In that method, you can take the average distance of all the data points and
use this average distance as the distance of two clusters. It is known as Average-linkage.
4. Distance between Centroids- Another option is to find the centroid of clusters and then
calculate the distance between two centroids. It is known as Centroid-linkage.

Choosing the method for distance calculation is an important part of Hierarchical Clustering.
Because it affects performance.
That’s why you should keep in mind while working on Hierarchical clustering that distance
between clusters are crucial.
Depending upon you problem you can choose the option.
Now you understood the steps to perform and Hierarchical Clustering.
What is Dendrogram?
A Dendrogram is a tree-like structure, that stores each record of splitting and merging.
Let’s understand how to create dendrogram and how it works-
How Dendrogram is Created?
Suppose, we have 6 data points.
A Dendrogram stores each record of splitting and merging in a chart.
Suppose this is our Dendrogram chart-
Here, all 6 data points P1, P2, P3, P4, P5, and P6 are mention.

So, whenever any merging happen within data points and clusters, dendrogram update it on the
chart.
So, let’s start with the 1st step.
Step 1
That is combine two closet data points into one cluster. Suppose these are two closet data points,
so we combine them into one cluster.
Here we combine P5 and P6 into one cluster. So Dendrogram update this merging into the chart.

Dendrogram Store the records by drawing horizontal line in a chart. The height of this horizontal
line is based on the Euclidean Distance.
The minimum the euclidean distance the minimum height of this horizontal line.
Step 2-
At step 2, find the next two closet data points and convert them into one cluster.
Suppose P2 and P3 are the next closet data points.

So, Dendrogram update this merging into the dendrogram chart.

Again the height of this horizontal line depends upon the Euclidean Distance.
Step 3-
At step 3, again we look at the closet clusters. P4 is closer to the Red cluster.
So, P4, P5, and P6 forms one cluster. The dendrogram update it into the dendrogram chart.

Step 4-
Again, we look at the closet clusters. P1 is closer to the green cluster. So merge the into one
cluster.

Dendrogram again update it into the chart.


Step 5-
Now, no small clusters are left. So, the last step is to merge all clusters into one huge cluster.

Dendrogram draws the final horizontal line. The height of this line is big because the distance
between cluster is very far.

So, that’s how Dendrogram is created. I hope you understood. The dendrogram is the memory of
Hierarchical clustering.
Now, we have created a Dendrogram, its time to find the optimal number of clusters with the
help of Dendrogram.
So, we can find the optimal number of cluster by cutting out dendrogram with a horizontal line.
this horizontal line has highest distance and who can traverse the maximum distance up and
down without intersecting the merging point.
Let’s understand with the help of this example-
Suppose in this dendrogram, this L1 is the longest distance, who can traverse maximum distance
up and down without intersecting the merging points.
So, we make cut by drawing a horizontal line. That look something like that-

This cutting line intersects two vertical lines. And this is the optimal number of clusters.

That’s why in that case, the optimal number of clusters are 2.

Link to download dataset : Mall_Customers | Kaggle


steps:
1) Load dataset
2) Select any two dimensions or attributes representing 2D data
3) Load libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
4) Create Dendrogram to find the Optimal Number of Clusters
import scipy.cluster.hierarchy as sch
dendro = sch.dendrogram(sch.linkage(X, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distances')
plt.show()

5) Use Agglomerative Hierarchical Clustering to fit clusters to the dataset


from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 5, affinity = 'euclidean', linkage = 'ward') y_hc =
hc.fit_predict(X)
Code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
data = pd.read_csv('Mall_Customers.csv')
data.head(10)
X= data[['Annual Income (k$)','Spending Score (1-100)']]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
num_clusters = 5
kmeans = KMeans(n_clusters=num_clusters)
data['Cluster'] = kmeans.fit_predict(X_scaled)

plt.scatter(data['Annual Income (k$)'], data['Spending Score (1-100)'],


c=data['Cluster'], cmap='rainbow')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.title(f'K-Means Clustering with {num_clusters} Clusters')
plt.show()

cluster_centers =
pd.DataFrame(scaler.inverse_transform(kmeans.cluster_centers_),
columns=X.columns)
print(cluster_centers)
import scipy.cluster.hierarchy as sch
dendro = sch.dendrogram(sch.linkage(X, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distances')
plt.show()
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 5, metric ='euclidean' , linkage
='ward')
y_hc =hc.fit_predict(X)
Program with code – Use different dataset

Links:
Hierarchical Clustering in Python, Step by Step Complete Guide (mltut.com)
scipy.cluster.hierarchy.linkage — SciPy v1.7.1 Manual

You might also like