0% found this document useful (0 votes)
3 views

Hierarchical Clustering and Data Science Group Project - Assignment 2

This document explores Agglomerative Hierarchical Clustering with a focus on practical R implementation. It includes a detailed explanation of the process, along with a case study on student performance clustering, and provides R code for data preparation, clustering, and visualization. Perfect for data science students and professionals interested in clustering techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Hierarchical Clustering and Data Science Group Project - Assignment 2

This document explores Agglomerative Hierarchical Clustering with a focus on practical R implementation. It includes a detailed explanation of the process, along with a case study on student performance clustering, and provides R code for data preparation, clustering, and visualization. Perfect for data science students and professionals interested in clustering techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Assignment 2: Hierarchical Clustering, Method of

Cluster Analysis, Project Assignment


Overseer: Dr. Maged Mohammed Saeed Nasser

Anpabelt Trah Javala


Tabassum Kajlima Syafrin
Aria Radzika Pradayana
Muhammad Egan Renjiro

CONFIDENTIAL AND PROPRIETARY


Any use of this material without specific permission of Group is strictly prohibited
Assignment 2

Hierarchical clustering is a method of cluster analysis aimed at building a hierarchy of clusters.


This technique is widely utilized in diverse applications, such as information retrieval and
customer segmentation, owing to its simplicity and the intuitive nature of the dendrogram
visualization it offers. Generally, hierarchical clustering strategies are bifurcated into two main
types: Agglomerative (Bottom-Up) and Divisive (Top-Down).

Answer the following questions:

1. Detail the Agglomerative (Bottom-Up) Hierarchical Clustering Process: This


involves explaining each step in the Agglomerative Hierarchical clustering
approach, showcasing how it can be effectively utilized for data clustering.
The Agglomerative Clustering is one of the types of hierarchical clustering that is useful
for grouping objects in clusters based on their similarity. The Agglomerative Clustering is
also recognized as AGNES (Agglomerative Nesting). The beginning of the process is
when the algorithm starts by treating each object as a singleton cluster, after that, pairs of
clusters are being merged continuously until all clusters have been merged into one big
cluster that contains all objects. Lastly, the output is a tree-based representation of the
objects that is known as dendrogram.

Algorithm
Since we are discussing the process of the Agglomerative (Bottom-Up) Hierarchical
Clustering, this means, at first, each object is recognized as a single-element cluster
(leaf). Then, in each step of the algorithm, the most similar two clusters are combined
into a new bigger cluster (nodes). This process is repeated until all points are members of
one single big cluster (root).

Page | 1
Figure 1: The algorithm process of the Agglomerative (Bottom-Up) Hierarchical
Clustering.

The steps of the Agglomerative (Bottom-Up) Hierarchical Clustering:


1. Prepare the data.
In step 1, we need to make sure the data is a numeric matrix with rows
representing observations (individuals) and columns representing variables. The
package used in R for clustering is ‘library(“cluster”)’. Then, we need to
standardize variables in the dataset before performing subsequent analysis.
2. Calculate the dissimilarity information between every pair of objects from
the dataset.
In step 2, we need to measure the similarity between objects, whether the
objects/clusters should be combined or divided. In RStudio, we can use the
‘dist()’ function to calculate the distance between every pair of objects from the
matrix dataset. The output of this is a distance or dissimilarity matrix. The ‘dist()’
function, by default, calculates the Euclidean distance between objects.

After calculating the distance between every pair of objects from the matrix
dataset, we have to see the distance information between objects by reformatting
the results of the ‘dist()’ function into a matrix using the ‘as.matric()’ function.

Page | 2
Then, in the matrix, the row i formed the value in the cell and the column j
indicates the distance between object i and object j in the original dataset.
3. Use linkage function to categorize/group objects into hierarchical cluster
tree.
In step 3, the use of the linkage function is to take the distance information,
returned by the ‘dist()’ function and clusters pairs of objects according to their
similarity. Then, the newly formed clusters are linked to each other to form into
bigger clusters. This process is repeated until all the objects from the original
dataset are linked together in a hierarchical tree. In RStudio, we can use the
‘hclust()’ function to create the hierarchical tree and the agglomeration (linkage)
method to be used for calculating distance between clusters, allowing values in
one of “ward.D”, “ward.D2”, “single”, “complete”, “average”, “mcquitty”,
“median” or “centroid”.

Further explanation of the most common linkages methods:


- Maximum (complete linkage)
The maximum value of all pairwise distances between the elements in
cluster 1 and cluster 2.
- Minimum (single linkage)
The minimum value of all pairwise distances between the elements in
cluster 1 and cluster 2.
- Mean (average linkage)
The average distance between the elements in cluster 1 and cluster 2.
- Centroid linkage
The distance between the centroid for cluster 1 and cluster 2. The centroid
linkage is the average position of all the points in the cluster.
- Ward’s minimum variance method
Minimizes the total within-cluster variance. The pair of clusters with the
smallest distance between them are merged.

Page | 3
4. Dendrogram Visualization.
In step 4, after creating the hierarchical tree using the ‘hclust()’ function. We need
to visualize a dendrogram. Dendrogram is the graphical representation of the
hierarchical tree that is created by the ‘hclust()’ function or tree-based
representation of the data. In RStudio, dendrogram can be visualized by using
base function ‘plot()’ where the parameter inside the ‘plot()’ function is the output
of ‘hclust()’ function. Additionally, we can use the ‘fviz_dend()’ function from
the package of factoextra ‘library(factoextra)’ to create a more visually appealing
dendrogram.

Figure 2: Example of Cluster Dendrogram Visualization.

5. Decide where to cut the hierarchical tree into clusters.


In step 5, to identify the sub-groups of the dendrogram, we can cut the
dendrogram at a specific height, this means we want to evaluate the distances in
the tree accurately represent the original distances.

Page | 4
Firstly, we need to calculate the correlation between the cophenetic distances and
the original distance data produced by the ‘dist()’ function in order to measure the
effectiveness of the cluster tree produced by the ‘hclust()’ function in representing
our data. Then, if the correlation is valid, there is a strong correlation between the
linking of objects in the cluster tree and the distances between objects in the
original distance matrix. To determine the accuracy of the clustering solution
representing our data, observe how close the correlation coefficient is to 1 and the
values above 0.75 is still considered as good. The “average linkage” method tends
to produce high values of this statistic.

Secondly, in RStudio we use the ‘cophenetic()’ function to calculate the


cophenetic distances for hierarchical clustering. Other ways, we can use the
average linkage method to do cophenetic distances for hierarchical clustering by
using the ‘hclust()’ function and the average linkage method, after that, we call
the ‘cophenetic()’ function to evaluate the clustering solution.

6. Cut the Dendrogram into different groups.


In step 6, the hierarchical clustering does not inform us how many clusters are
there or where to cut the dendrogram to form clusters, however, there is a way to
cut the hierarchical tree at a specified height to divide our data into clusters. In
RStudio, the ‘cutree()’ function is used to cut a tree, produced by the ‘hclust()’
function, into multiple groups by either specifying the desired number of groups
or cutting the height. It returns a vector indicating the cluster number for each
observation. Lastly, use the ‘fviz_dend()’ and ‘fviz_cluster()’ functions from the
package of factoextra to visualize the result of cuts. The differences between
‘fviz_dend()’ and ‘fviz_cluster()’ functions are ‘fviz_dend()’ function is used for
hierarchical clustering methods and it can produces dendrograms, while the
‘fviz_cluster()’ function is used for partitioning clustering methods (example:
k-means) and it can produces scatter plots.

Page | 5
Figure 3: Example of Plot Visualization using ‘fviz_dend()’ function.

Figure 4: Example of Plot Visualization produced by ‘hclust()’ and using


‘fviz_cluster()’ functions.

Page | 6
2. Utilize Suitable Data for a Case Study: Identify and apply the explained steps of the
Agglomerative Hierarchical clustering process (question 1) on a chosen dataset,
using it as a case study to implement the methodology.

In our case, we utilize the Student Performance Dataset (student-mat.csv). The objective
is to categorize students according to their performance in Mathematics through
Agglomerative Hierarchical Clustering. The dataset on Student Performance includes
details on students' academic progress and different characteristics. In clustering analysis,
our main focus will be on the students' Mathematics final grades (G3).

Process broken down into steps:

1. Get the data ready:


- Import the Dataset:
● Import the dataset into an appropriate data container, like a
DataFrame in R, to facilitate simple data handling and
examination.
● Examine the dataset to verify that it has been loaded accurately.
Inspect the initial rows to comprehend its layout and information,
and confirm the presence and proper formatting of the columns
regarding student performance in Mathematics.
● Pseudocode:
Check for NA values using is.na() and handle appropriately.
Impute missing values with mean or median, or remove rows using
dplyr library.
- Choose appropriate characteristics:
● Identify the important attribute for grouping data, specifically the
ultimate score in Math (G3).
● Generate a fresh dataset with just this specific feature to streamline
the analysis and concentrate on the most crucial data for clustering.

Page | 7
- Cleaning up data:
● Verify for any absent values in the dataset and manage them
accordingly. This may include replacing missing values with the
average or middle value of the column, or eliminating rows with
missing values.
● Identify any outliers that could impact the clustering outcomes.
Utilize graphical tools such as box plots to emphasize outliers.
After identifying the outliers, determine whether to eliminate or
modify them in order to minimize their influence on the clustering
procedure.
● Pseudocode:
Detects outliers using boxplot(). Handle outliers by removal or
transformation.

- Make the features consistent across the board.


● Normalize the characteristics so that they have an average of 0 and
a variance of 1. Standardization is important as it ensures that all
features have an equal impact on distance calculations, avoiding
any single feature from having too much influence because of scale
variations.
● Standardization consists of subtracting the average of each
characteristic and then dividing it by its standard deviation:

Page | 8
Figure 5: Standardization formula

● Pseudocode:
Standardize 'G3' using scale() function.
Calculate z-scores: z = (x - mean) / sd

2. Determine the Dissimilarity Data:


- Calculation of Distance:
● Determine the Euclidean distances between every pair of student
observations. The distance between two points 𝑥𝑖 and yi in
p-dimensional space can be calculated using the Euclidean
formula:

Page | 9
Figure 6: Euclidean Distance formula

● The outcome is a distance matrix, a symmetric matrix with each


element denoting the distance between two students. This matrix
measures the differences among every pair of observations and
serves as the foundation for the clustering procedure.
● Pseudocode:
Library stats()
Calculate Euclidean distance matrix using dist() function.
Formula: d(x_i, x_j) = sqrt(sum((x_ik - x_jk)^2))

3. Utilize Linkage Function for Constructing a Hierarchical Cluster Tree.


- Starting groups:
● Begin by treating every student as an individual cluster, so that at
first, each data point is seen as a separate cluster.
- Method of linking:

● Utilize Ward's minimum variance method on the distance matrix.


Ward's method aims to reduce the total within-cluster variance,
resulting in clusters that are more condensed and round in shape.

Page | 10
This approach is successful in forming clear groups and is
especially appropriate for categorizing student performance.
● Pseudocode:
Library stats()
Apply Ward's method using hclust() with method = "ward.D2".
Formula for Ward's method: d(A, B) = sqrt((2 * |A| * |B| / (|A| +
|B|)) * sum(d(x_i, x_j)^2))

- Process of grouping data into clusters.


● Merge clusters with the smallest distance pairwise in each step
using iterative merging. This includes:
1. Determining the distance between every combination of
clusters at every stage. In Ward's method, the distance
between two clusters is determined by the change in total
within-cluster variance when the two clusters are
combined.
2. Combining the two clusters with the lowest distance into a
single new cluster through merging.
3. Modifying the distance matrix to represent the merged
cluster's formation. The distance matrix is updated to
account for the distances between the cluster that was just
formed and all other clusters.
4. Continuing the process until all students are combined into
one cluster. This process of hierarchical merging leads to
the creation of a dendrogram, a tree-shaped structure that
visually displays the clustering process.
● Pseudocode:
Merge clusters iteratively:
- Calculate distance between all pairs of clusters.
- Merge pair with smallest distance.
- Update distance matrix.

Page | 11
- Repeat until all clusters are merged into one.

4. Visualization of a dendrogram:
- Creating the dendrogram:
● Observe the hierarchical clustering procedure by examining a
dendrogram. A dendrogram visually illustrates how clusters are
merged and the distances at which they are joined.
● The vertical axis on the dendrogram shows the distance or
dissimilarity where clusters merge, and the horizontal axis shows
individual students or clusters.
● Pseudocode:
Library stats()
Plot dendrogram using plot() function on hclust object.

- Explaining the Dendrogram:


● Clusters that merge closer together are more alike than those
merging further apart. The vertical lines in the dendrogram show
the distance between clusters.
● Identification of Optimal Clusters: Recognize notable changes in
vertical height within the dendrogram, showing distinct cluster
groupings. The best cluster number is found by identifying the
biggest vertical gap and dividing the dendrogram at that spot. This
stage requires examining the dendrogram visually to identify
where the biggest changes happen, showing a clear division
between clusters.
● Pseudocode:
Analyze dendrogram to identify significant jumps.
Identify the optimal number of clusters by finding the largest
vertical jump.

Page | 12
● Pseudocode:
Based on dendrogram and cophenetic correlation, determine the
optimal number of clusters.

5. Make a choice on where to make divisions in the hierarchical tree.

- Examine the dendrogram:


● Determine a specific distance threshold to divide the dendrogram
and create relevant clusters. This includes identifying a horizontal
line that crosses the vertical lines of the dendrogram where there
are substantial increases in the merging distance. By cutting the
dendrogram at this particular height, you will obtain the best
number of clusters.
● Usually, when there is a significant distance between clusters
merging, it signifies that the clusters formed prior to the gap are
more authentic and unique.
● Pseudocode:
Determine threshold distance to cut dendrogram for meaningful
clusters.

- Correlation between the heights in a dendrogram and the original


dissimilarities.
● Determine the cophenetic correlation coefficient to evaluate how
accurately the clustering maintains the distances between
observations. This includes:
1. Determining the cophenetic distance matrix shows the
point in the dendrogram where two observations are
initially combined in terms of height.
2. Examining the distance matrix against the cophenetic
distance matrix. The cophenetic correlation coefficient
measures how accurately the clustering represents the

Page | 13
original distances by calculating the correlation between
the two matrices.
3. A high cophenetic correlation value (near 1) signifies a
successful clustering solution. This implies that the
hierarchical clustering effectively represents the pairwise
distances among observations while maintaining the initial
data organization.
● Pseudocode:
Library stats
Calculate cophenetic correlation using cophenetic() function.
Formula: c = sum((d_ij - mean_d)(c_ij - mean_c)) / sqrt(sum((d_ij
- mean_d)^2) * sum((c_ij - mean_c)^2))

● A high cophenetic correlation value (near 1) signifies a successful


clustering solution. This implies that the hierarchical clustering
effectively represents the pairwise distances among observations
while maintaining the initial data organization.

6. Divide the Dendrogram into Various Groups:


- Developing groups:
● Slice the dendrogram at the specified height to create the required
clusters. This stage allocates every student to a particular cluster
depending on the cut height chosen. The clusters show groups of
students who are more alike to each other than to students in
different clusters.
● The clusters that are formed consist of students who have similar
characteristics regarding their Mathematics performance.
● Pseudocode:
Library stats
Cut dendrogram at chosen height using cutree() function.
Assign each student to a specific cluster.

Page | 14
- Group Identifiers:
● Include the cluster labels in the original dataset by generating a
new column that specifies the cluster assignment for every student.
This allows for more in-depth analysis and understanding by
linking individual students to particular clusters.
● Pseudocode:
Library dplyr
Add cluster labels to the original dataset as a new column.

Page | 15
3. Implement Using R Code: Create R code to execute the clustering process as
outlined in the case study (question 2).

# Install and load necessary library


install.packages("dplyr")
library(dplyr)

# Load the dataset


student_data <- read.csv("C:/Users/ Working
Directory/student-mat.csv")

# Inspect the initial rows of the DataFrame


print(head(student_data))

# Check for NA values and handle them


print(summary(student_data))

# Impute missing values with mean or median, or remove


rows
student_data <- student_data %>%
mutate(across(everything(), ~ifelse(is.na(.), mean(.,
na.rm = TRUE), .)))

# Verify if NA values are handled


print(summary(student_data))

# Choose the attribute for grouping data, specifically G3


# Create a new dataset with just the G3 feature
g3_data <- student_data %>% select(G3)
print(head(g3_data))

# Identify and handle outliers using a boxplot


boxplot(g3_data$G3, main="Boxplot for G3", sub="Detecting
outliers")

Q1 <- quantile(g3_data$G3, 0.25)


Q3 <- quantile(g3_data$G3, 0.75)
IQR <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR


upper_bound <- Q3 + 1.5 * IQR

g3_data <- g3_data %>% filter(G3 >= lower_bound & G3 <=


upper_bound)

Page | 16
# Verify the cleaned data
boxplot(g3_data$G3, main="Boxplot for G3 after removing
outliers")
print(head(g3_data))

# Standardize 'G3' using the scale() function


g3_data <- g3_data %>% mutate(G3_standardized =
as.vector(scale(G3)))

# Verify the standardized data


print(head(g3_data))
print(summary(g3_data$G3_standardized))

# Calculate Euclidean distance matrix using the


standardized G3
distance_matrix <- dist(g3_data$G3_standardized, method =
"euclidean")

# Apply Ward's method using hclust() with method =


"ward.D2"
hc <- hclust(distance_matrix, method = "ward.D2")

# Plot the hierarchical cluster tree


plot(hc, main="Hierarchical Clustering Dendrogram (Ward's
Method)", xlab="Students", ylab="Height")

# Plot the hierarchical cluster tree


plot(hc, main="Hierarchical Clustering Dendrogram (Ward's
Method)", xlab="Students", ylab="Height")

# Add rectangles to the dendrogram to highlight clusters


# Specify the number of clusters, for example, 4 clusters
rect.hclust(hc, k=4, border="red")

# Analyze dendrogram to identify significant jumps


print("Analyzing dendrogram for significant jumps...")

# Calculate cophenetic correlation (optional but useful


for cluster validity)
cophenetic_corr <- cor(cophenetic(hc), distance_matrix)

# Based on the dendrogram, visually identify the optimal


number of clusters
# The optimal number of clusters is often found by finding
the largest vertical jump in the dendrogram

Page | 17
# For example we print the heights at which the clusters
merge
merge_heights <- hc$height
print(merge_heights)

# Plot the merge heights to help identify significant


jumps
plot(merge_heights, type="h", main="Heights of Cluster
Merges", xlab="Merges", ylab="Height")

# Determine the threshold distance to cut the dendrogram


for meaningful clusters
# Here we choose a threshold manually based on visual
inspection
threshold <- 10

# Cut the dendrogram at the chosen threshold


clusters <- cutree(hc, h = threshold)

# Add cluster membership to the original data


student_data <- student_data %>% mutate(Cluster =
clusters)

# View the data with cluster membership


print(head(student_data))

# Save the updated dataset with cluster labels


write.csv(student_data, "C:/Users/Working
Directory/student-mat.csv", row.names = FALSE)

# Analyze the distribution of G3 scores within each


cluster
boxplot(G3 ~ Cluster, data=student_data, main="G3 Scores
by Cluster", xlab="Cluster", ylab="G3 Score")

Page | 18
4. Analyze and Discuss the Results: Conduct a thorough analysis and discussion on the
outcomes derived from the R code implementation (question 3).

From here we wanted to separate the code into sections to understand them a bit better.
The section are as follow:

1. Load library and dataset

install.packages("dplyr")
library(dplyr)

student_data <-
read.csv("C:\\Users\\regul\\Downloads\\student-mat.csv")
print(head(student_data))

The script begins in the same way loading the necessary dplyr package, which will be used to
manipulate data in R. The install. There are two commands to deal with packages, the
packages("name of the package") function checks that the package is available and library(name
of the package) loads in the functions into the current R session.

read is used to load the dataset CSV file containing data on the performance of the students.
csv("C:\\Users\\regul\\Downloads\\student-mat. csv"). The head structure of the student data is
given by the function print(head(student_data)) in the first rows of the dataset.

2. Handle NA values

print(summary(student_data))
student_data <- student_data %>% mutate(across(everything(),
~ifelse(is.na(.), mean(., na.rm = TRUE), .)))
print(summary(student_data))

The summary(student_data) function is used to generate summary statistics for each column,
including the presence of any missing (NA) values. To handle missing values, the script imputes
them with the mean of the respective column using mutate(across(everything(), ~ifelse(is.na(.),

Page | 19
mean(., na.rm = TRUE), .))). The updated summary is printed to confirm that all NA values have
been addressed.

3. Picking G3 as attribute to group data

g3_data <- student_data %>% select(G3)


print(head(g3_data))

The script isolates the G3 column, which likely represents the final grade of students, by
selecting it from the dataset using select(G3). The first few rows of this new dataset are printed
with print(head(g3_data)), providing an initial look at the G3 data for further analysis.

4. Identify and handle outliers

boxplot(g3_data$G3, main="Boxplot for G3", sub="Detecting


outliers")

Q1 <- quantile(g3_data$G3, 0.25)


Q3 <- quantile(g3_data$G3, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

g3_data <- g3_data %>% filter(G3 >= lower_bound & G3 <=


upper_bound)
boxplot(g3_data$G3, main="Boxplot for G3 after removing
outliers")
print(head(g3_data))

A boxplot of the G3 scores is created using boxplot(g3_data$G3, main="Boxplot for G3",


sub="Detecting outliers") to visually identify outliers. The script calculates the interquartile
range (IQR) and determines the lower and upper bounds for acceptable values. It then filters out
any values outside these bounds, effectively removing outliers. The cleaned data is visualized
with another boxplot, and the first few rows of the cleaned data are printed for verification.

Page | 20
5. Standardize G3

g3_data <- g3_data %>% mutate(G3_standardized =


as.vector(scale(G3)))
print(head(g3_data))
print(summary(g3_data$G3_standardized))

The script standardizes the G3 scores to have a mean of 0 and a standard deviation of 1 using the
scale() function. This is done by adding a new column G3_standardized to the dataset. The first
few rows of the standardized data and its summary statistics are printed, showing the
transformation of the G3 scores.

6. Calculate Euclidean Distance Matrix and apply Ward’s Method

distance_matrix <- dist(g3_data$G3_standardized, method =

Page | 21
"euclidean")
hc <- hclust(distance_matrix, method = "ward.D2")
plot(hc, main="Hierarchical Clustering Dendrogram (Ward's
Method)", xlab="Students", ylab="Height")

The first task regards the calculation of a distance matrix based on the standardized G3 scores
applied in the Euclidean distance. The hierarchical clustering is then conducted using the Ward’s
method through the use of the hclust(distance_matrix, method = “ward. D2”). The dendrogram is
plotted which represents the depths of the different clusters obtained after the agglomeration
process.

7. Highlight clusters on the Dendrogram

rect.hclust(hc, k=4, border="red")

Page | 22
print("Analyzing dendrogram for significant jumps...")

In the dendrogram, rectangles are placed around the clusters and it is decided that there should be
four clusters (rect. hclust(hc, k=4, border=”red”)). It assists in finding out the groups that are
formed when the output is portrayed in the dendrogram format.

Therefore, from the results of the hierarchical clustering dendrogram, it is possible to evaluate
the four groups distinguished in relation to the students’ results (G3). Cluster 1 consists of the
low achievers as the scores in G3 are low, which hovers in the region of below average. Some of
them may have problems doing well in their classes or other factors that dictate their
performance. These students could benefit from programs such as supplementary lessons, or
individual education plans.

Page | 23
The second cluster of students obtains slightly higher G3 scores than students in the first cluster
but in general, their scores are still lower than the median. This group consists of learners who
are average performers in their class. They are participating more than the last cluster set but
they are still low performers. Tutoring for example, or group work might assist those students to
have better grades.

This cluster obtains higher G3 scores meaning that the students’ performances are strong and
consistent with the curriculum learned. Such students should be taken through more complex
classes or encouraged to engage in other extra learning activities to fully challenge them and
remain on the right academic track.

The last cluster includes the students with better or the best outcome on G3 which gives evidence
of the highest academic ability. These students should be earning good results, probably among
the first in their classes or next to it depending on the grading system. It might mean that the
students could be selected for leadership positions, peer mentoring, or to be included in special
award-giving programs.

8. Calculate Cophenetic Correlation and analyze cluster merges

cophenetic_corr <- cor(cophenetic(hc), distance_matrix)


merge_heights <- hc$height
print(merge_heights)
plot(merge_heights, type="h", main="Heights of Cluster Merges",
xlab="Merges", ylab="Height")

Cophenetic correlation coefficient which quantifies how well the dendrogram has captured the
inter-point distances of the original data is computed from the formula cor(cophenetic(hc),
distance_matrix). The heights at which clusters merge are printed and also plotted so that the
significant changes in the dendrogram may be read, which acts as a guide to the definition of the
number of clusters.

Page | 24
9. Cut the Dendrogram and analyze cluster

threshold <- 10
clusters <- cutree(hc, h = threshold)
student_data <- student_data %>% mutate(Cluster = clusters)
print(head(student_data))
write.csv(student_data, "C:/Users/Working
Directory/student-mat.csv", row.names = FALSE)

According to the appearance of the dendrogram, the distance is selected whereby the dendrogram
is cut in order to create more clusters. The script provides the cluster memberships of the original
data and then outputs several initial rows of the new table, where a new Cluster column has been
added in the course of the execution. The new list is stored in a CSV file for further use in the
program.

Page | 25
1 GP F 18 U GT3 A 4 4 at_home teacher
course mother 2 2
2 GP F 17 U GT3 T 1 1 at_home other
course father 1 2
3 GP F 15 U LE3 T 1 1 at_home other
other mother 1 2
4 GP F 15 U GT3 T 4 2 health services
home mother 1 3
5 GP F 16 U GT3 T 3 3 other other
home father 1 2
6 GP M 16 U LE3 T 4 3 services other
reputation mother 1 2
failures schoolsup famsup paid activities nursery higher internet romantic
famrel freetime goout Dalc Walc
1 0 yes no no no yes yes no no
4 3 4 1 1
2 0 no yes no no no yes yes no
5 3 3 1 1
3 3 yes no yes no yes yes yes no
4 3 2 2 3
4 0 no yes yes yes yes yes yes yes
3 2 2 1 1
5 0 no yes yes no yes yes no no
4 3 2 1 2
6 0 no yes yes yes yes yes yes no
5 4 2 1 2
health absences G1 G2 G3 Cluster
1 3 6 5 6 6 1
2 3 4 5 5 6 1
3 3 10 7 8 10 1
4 5 2 15 14 15 2
5 5 4 6 10 10 1
6 5 10 15 15 15 2

10. Analyze distribution of G3 scores within each cluster

boxplot(G3 ~ Cluster, data=student_data, main="G3 Scores by


Cluster", xlab="Cluster", ylab="G3 Score")

At last, the distribution of G3 scores within each cluster is conducted with a boxplot (boxplot(G3
~ Cluster, data=student_data, main= “G3 Scores by Cluster”, xlab= “Cluster”, ylab= “G3
Score”). This visualization is useful in identifying how the final grades differ from cluster to
cluster, which gives some information about the given cluster group

Page | 26
Page | 27
References

Agglomerative Hierarchical Clustering. (n.d.). Datanovia. Retrieved July 21, 2024, from

https://www.datanovia.com/en/lessons/agglomerative-hierarchical-clustering/#google_vig

nette

Ranjan, A. (2020, November 30). Hierarchical Clustering (Agglomerative) | by Amit Ranjan |

Analytics Vidhya. Medium. Retrieved July 21, 2024, from

https://medium.com/analytics-vidhya/hierarchical-clustering-agglomerative-f6906d44098

Page | 28

You might also like