Aiml Unit 3 4
Aiml Unit 3 4
CET3030B
Lab Assignment: 6
• Write a program to implement Hierarchal agglomerative clustering
for a given dataset, e. g. customer dataset on kaggle. Evaluate its
performance.
• 1. Import required python libraries
• 2. Load and explore the data
• 3. Pre-process and train the hierarchal agglomerative clustering
model on the dataset
• 4. Analyse the results and visualize using dendograms.
2
Introduction
• Hierarchical clustering is another unsupervised machine learning algorithm,
which is used to group the unlabeled datasets into a cluster and also known
as Hierarchical Cluster Analysis or HCA.
• In this algorithm, we develop the hierarchy of clusters in the form of a tree,
and this tree-shaped structure is known as the dendrogram.
• A dendrogram is a diagram that shows the hierarchical relationship between
objects.
• It is most commonly created as an output from hierarchical clustering.
• The main use of a dendrogram is to work out the best way to allocate objects
to clusters.
• Hierarchical clustering algorithms group similar objects into groups
called clusters.
3
Introduction
• There are two types of hierarchical clustering algorithms:
• Agglomerative — Bottom up approach. Start with many small clusters
and merge them together to create bigger clusters.
• Divisive — Top down approach. Start with a single cluster than break
it up into smaller clusters.
4
Agglomerative Clustering
• Also known as bottom-up approach or Hierarchical Agglomerative
Clustering (HAC).
• This clustering algorithm does not require us to pre-specify the number of
clusters.
• Bottom-up algorithms treat each data as a singleton cluster at the outset
and then successively agglomerates pairs of clusters until all clusters have
been merged into a single cluster that contains all data.
• It means, this algorithm considers each dataset as a single cluster at the
beginning, and then start combining the closest pair of clusters together.
• It does this until all the clusters are merged into a single cluster that
contains all the datasets.
5
Agglomerative Clustering
6
Customer Dataset
CustomerID Gender Age Annual Income (k$) Spending Score (1-100) Cluster
1 1 19 15 39 3
2 1 21 15 81 4
3 0 20 16 6 3
4 0 23 16 77 4
5 0 31 17 40 3
6 0 22 17 76 4
7 0 35 18 6 3
8 0 23 18 94 4
9 1 64 19 3 3
10 0 30 19 72 4
9
Implementation
• Step-2: Finding the optimal number of clusters using the Dendrogram
• Now we will find the optimal number of clusters using the Dendrogram for
our model. For this, we are going to use scipy library as it provides a
function that will directly return the dendrogram for our code.
• #Finding the optimal number of clusters using the dendrogram
import scipy.cluster.hierarchy as shc
dendro = shc.dendrogram(shc.linkage(x, method="ward"))
mtp.title("Dendrogrma Plot")
mtp.ylabel("Euclidean Distances")
mtp.xlabel("Customers")
mtp.show()
10
Implementation
• In the above lines of code, we have imported the hierarchy module of
scipy library.
• This module provides us a method shc.denrogram(), which takes
the linkage() as a parameter. The linkage function is used to define
the distance between two clusters, so here we have passed the
x(matrix of features), and method "ward," the popular method of
linkage in hierarchical clustering.
• Ward minimizes the sum of squared differences within all clusters.
11
Implementation
• Output:
• By executing the above lines of code, we will get the below output:
12
Implementation
• Using this Dendrogram, we will now determine the optimal number of clusters
for our model. For this, we will find the maximum vertical distance that does not
cut any horizontal bar. Consider the below diagram:
13
Implementation
• In the above diagram, we have shown the vertical distances that are
not cutting their horizontal bars. As we can visualize, the 4th distance
is looking the maximum, so according to this, the number of clusters
will be 5 (the vertical lines in this range).
• So, the optimal number of clusters will be 5, and we will train the
model in the next step, using the same.
14
Implementation
• Step-3: Training the hierarchical clustering model
• As we know the required optimal number of clusters, we can now train
our model.
• #training the hierarchical model on dataset
• from sklearn.cluster import AgglomerativeClustering
• hc= AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage=
'ward')
• y_pred= hc.fit_predict(x)
15
Implementation
• In the code, we have imported the AgglomerativeClustering class of cluster
module of scikit learn library.
• Then we have created the object of this class named as hc.
• The AgglomerativeClustering class takes the following parameters:
• n_clusters=5: It defines the number of clusters, and we have taken here 5
because it is the optimal number of clusters.
• affinity='euclidean': It is a metric used to compute the linkage.
• linkage='ward': It defines the linkage criteria, here we have used the "ward"
linkage. This method is the popular linkage method that we have already used for
creating the Dendrogram.
• In the last line, we have created the dependent variable y_pred to fit or train the
model. It does train not only the model but also returns the clusters to which each
data point belongs.
16
Implementation
• Step-4: Visualizing the clusters
• As we have trained our model successfully, now we can visualize the clusters
corresponding to the dataset.
• #visulaizing the clusters
mtp.scatter(x[y_pred == 0, 0], x[y_pred == 0, 1], s = 100, c = 'blue', label = 'Cluster 1')
mtp.scatter(x[y_pred == 1, 0], x[y_pred == 1, 1], s = 100, c = 'green', label = 'Cluster 2')
mtp.scatter(x[y_pred== 2, 0], x[y_pred == 2, 1], s = 100, c = 'red', label = 'Cluster 3')
mtp.scatter(x[y_pred == 3, 0], x[y_pred == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
mtp.scatter(x[y_pred == 4, 0], x[y_pred == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
mtp.title('Clusters of customers')
mtp.xlabel('Annual Income (k$)')
mtp.ylabel('Spending Score (1-100)')
mtp.legend()
mtp.show()
17
Implementation
• Output: By executing the above lines of code, we will get the below
output:
18
Thank you
19