0% found this document useful (0 votes)
7 views

Hierarchical clustering

Hierarchical clustering is an unsupervised machine learning algorithm that groups similar instances into clusters, with two main types: agglomerative and divisive clustering. The document explains the process of agglomerative hierarchical clustering using a step-by-step example, including the creation of a proximity matrix and the use of a dendrogram to visualize cluster relationships. Additionally, it discusses implementing hierarchical clustering using the scikit-learn library and evaluating clustering effectiveness with the silhouette coefficient.

Uploaded by

24mt0362
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Hierarchical clustering

Hierarchical clustering is an unsupervised machine learning algorithm that groups similar instances into clusters, with two main types: agglomerative and divisive clustering. The document explains the process of agglomerative hierarchical clustering using a step-by-step example, including the creation of a proximity matrix and the use of a dendrogram to visualize cluster relationships. Additionally, it discusses implementing hierarchical clustering using the scikit-learn library and evaluating clustering effectiveness with the silhouette coefficient.

Uploaded by

24mt0362
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Hierarchical

clustering

Prepared By
Archana
AP/PE
IIT(ISM), Dhanbad
Monsoon 24-25
• Another powerful unsupervised ML algorithm is referred to as hierarchical clustering.
Hierarchical clustering is an algorithm that groups similar instances into clusters.
• Hierarchical clustering just like k-means clustering uses a distance-based algorithm to
measure the distance between clusters. There are two main types of hierarchical clustering
as follows:
1) Agglomerative hierarchical clustering (additive hierarchical clustering):
• In this type, each point is assigned to a cluster. For instance, if there are 10 points in a data
set, there will be 10 clusters at the beginning of applying hierarchical clustering.
• Afterward, based on a distance function such as euclidean, the closest pair of clusters are
merged. This iteration is repeated until a single cluster is left.
2) Divisive hierarchical clustering:
• This type of hierarchical clustering works the opposite way of agglomerative hierarchical
clustering.
• Hence, if there are 10 data points, all data points will initially belong to one single cluster.
• Afterward, the farthest point is split in the cluster and this process continues until each
cluster has a single point.
• To further explain the concept of hierarchical clustering, let’s go
through a step-by-step example of applying agglomerative
hierarchical clustering to a small data set of 4 wells with their
respective EURs as shown below.
• Note that since this is a one-dimensional data, the data was not
standardized prior to calculating the distances.
• In other words, it is OK to not standardize the data for this particular
example.
• Step 1) The first step in solving this problem is creating proximity
matrix. Proximity matrix simply stores the distances between each
two points.
• To create a proximity matrix for this example, a square matrix of n by
n is created.
• n represents the number of observations. Therefore, a proximity
matrix of 4*4 can be created as shown in Table 4.2.
• The diagonal elements of the matrix will be 0 because the distance of
each element from itself is 0.
• To calculate the distance between point 1 and 2, let’s use the
euclidean distance function as follows:

• Similarly, that’s how the rest of the distances were calculated in Table
4.2.
• Next, the smallest distance in the proximity matrix is identified, and
the points with the smallest distance are merged.
• As can be seen from this table, the smallest distance is 0.2 between
points 1 and 2. Therefore, these two points can be merged.
• Let’s update the clusters followed by updating the proximity matrix.
• To merge points 1 and 2 together, average, maximum, or minimum can
be chosen.
• For this example, maximum was chosen. Therefore, the maximum
EUR/1000 ft between well numbers 1 and 2 is 1.4.

• Let’s recreate the proximity matrix with the new merged clusters as
illustrated in Table 4.3.
• Clusters 3 and 4 can now (as shown in bold in Table 4.3) be merged into
one cluster with the smallest distance of 0.5. The maximum EUR/1000 ft
between well numbers 3 and 4 is 2.5. Let’s update the table as follows:
• Finally, let’s recreate the proximity matrix as shown in Table 4.4. Now,
all clusters 1, 2, 3, and 4 can be combined into one cluster. This is
essentially how agglomerative hierarchical clustering functions. The
example problem started with four clusters and ended with one
cluster.
Dendrogram
• A dendrogram is used to show the hierarchical relationship between
objects and is the output of the hierarchical clustering.
• A dendrogram could potentially help with identifying the number of
clusters to choose when applying hierarchical clustering.
• Dendrogram is also helpful in obtaining the overall structure of the data.
• To illustrate the concept of using a dendrogram, let’s create a
dendrogram for the hierarchical clustering example above.
• As illustrated in Fig. 4.15, the distance between well numbers 1 and 2 is
0.2 as shown on the y-axis (distance) and the distance between well
numbers 3 and 4 is 0.5.
• Finally, merged clusters 1,2 and 3,4 are connected and have a distance
of 1.1.
• Longer vertical lines in the dendrogram diagram indicate larger
distance between clusters.
• As a general rule of thumb, identify clusters with the longest distance
or branches (vertical lines). Shorter branches are more similar to one
• another.
• For instance, in Fig. 4.15, one cluster combines two smaller branches
(clusters 1 and 2) and another cluster combines the other two smaller
branches (clusters 3 and 4). Therefore, two clusters can be chosen in
this example.
• Please note that the optimum number of clusters is subjective and
could be influenced by the problem, domain knowledge of the
problem, and application.
Implementing dendrogram and hierarchical clustering in scikit-learn
library
• Let’s use the scikit-learn library to apply dendrogram and hierarchical
clustering.
• Please create a new Jupyter Notebook and start importing the main
libraries and use the link below to access the hierarchical clustering
data set which includes 200 wells with their respective Gas in Place
(GIP) and EUR/1000 ft.
• Next, let’s standardize the data prior to applying hierarchical
clustering as follows:

• Next, let’s create the dendrogram.


• First, "import scipy.cluster.hierarchy as shc" library.
• Please make sure to pass along "df_scaled" which is the standardized
version of the data set in
"dend = shc.dendrogram( shc.linkage(df_scaled, method = 'ward'))."
• As illustrated in Fig. 4.17, the dashed black line intersects 5 vertical lines.
• This dashed black horizontal line is drawn based on longest distance or
branches observed and is subjective.
• Therefore, feel free to alter "n_clusters“and visualize the clustering
outcome.
• Let’s now import agglomerative clustering and apply agglomerative
clustering to "df_scaled" data frame.
• Under "AgglomerativeClustering," number of desired clusters can be
accessed with attribute "n_clusters," "affinity" returns the metric used to
compute the linkage.
• In this example, euclidean distance was selected. The linkage determines
which distance to use between sets of observation.
• The "linkage“ parameter can be set as (i) ward, (ii) average, (iii) complete
or maximum, and (iv) single.
• According to scikit-learn library, "ward" minimizes the variance of the
clusters being merged.
• "average" uses the average of the distances of each observation of
the two sets.
• "complete" or "maximum" linkage uses the maximum distances
between all observations of the two sets.
• "single" uses the minimum of the distances between all observations
of the two sets.
• The default linkage of "ward" was used in this example.
• After defining the hierarchical clustering criteria under "HC," apply
"fit_predict()" to the standardized data set (df_scaled).
• Let’s convert the "df_scaled" to a data frame using panda’s "pd.Data-
Frame."
• Afterward, apply the silhouette coefficient to this data set as
illustrated below.
• As illustrated in Fig. 4.20, the silhouette coefficient on this data set is
high (close to 0.7).
• It is recommended to use silhouette coefficient to provide insight on
the clustering effectiveness of the data set.
• Next, let’s unstandardize the data and show the mean of each cluster:
• As illustrated above, the first cluster (cluster 0) represents low
EUR/high GIP, the second cluster (cluster 1) represents high EUR/high
GIP, the third cluster (cluster 2) represents medium EUR/medium GIP,
the forth cluster (cluster 3) represents high EUR/low GIP, and finally
the last cluster (cluster 4) represents low EUR and low GIP.
• Please note that dendrogram in hierarchical clustering can be used to
get a sense of the number of clusters to choose prior to applying to k-
means clustering.
• In other words, if unsure of selecting the number of clusters prior to
applying k-means clustering, a dendrogram can be used to find out
the number of clusters.
THANK YOU

You might also like