Hierarchical Clustering and Data Science Group Project - Assignment 2
Hierarchical Clustering and Data Science Group Project - Assignment 2
Algorithm
Since we are discussing the process of the Agglomerative (Bottom-Up) Hierarchical
Clustering, this means, at first, each object is recognized as a single-element cluster
(leaf). Then, in each step of the algorithm, the most similar two clusters are combined
into a new bigger cluster (nodes). This process is repeated until all points are members of
one single big cluster (root).
Page | 1
Figure 1: The algorithm process of the Agglomerative (Bottom-Up) Hierarchical
Clustering.
After calculating the distance between every pair of objects from the matrix
dataset, we have to see the distance information between objects by reformatting
the results of the ‘dist()’ function into a matrix using the ‘as.matric()’ function.
Page | 2
Then, in the matrix, the row i formed the value in the cell and the column j
indicates the distance between object i and object j in the original dataset.
3. Use linkage function to categorize/group objects into hierarchical cluster
tree.
In step 3, the use of the linkage function is to take the distance information,
returned by the ‘dist()’ function and clusters pairs of objects according to their
similarity. Then, the newly formed clusters are linked to each other to form into
bigger clusters. This process is repeated until all the objects from the original
dataset are linked together in a hierarchical tree. In RStudio, we can use the
‘hclust()’ function to create the hierarchical tree and the agglomeration (linkage)
method to be used for calculating distance between clusters, allowing values in
one of “ward.D”, “ward.D2”, “single”, “complete”, “average”, “mcquitty”,
“median” or “centroid”.
Page | 3
4. Dendrogram Visualization.
In step 4, after creating the hierarchical tree using the ‘hclust()’ function. We need
to visualize a dendrogram. Dendrogram is the graphical representation of the
hierarchical tree that is created by the ‘hclust()’ function or tree-based
representation of the data. In RStudio, dendrogram can be visualized by using
base function ‘plot()’ where the parameter inside the ‘plot()’ function is the output
of ‘hclust()’ function. Additionally, we can use the ‘fviz_dend()’ function from
the package of factoextra ‘library(factoextra)’ to create a more visually appealing
dendrogram.
Page | 4
Firstly, we need to calculate the correlation between the cophenetic distances and
the original distance data produced by the ‘dist()’ function in order to measure the
effectiveness of the cluster tree produced by the ‘hclust()’ function in representing
our data. Then, if the correlation is valid, there is a strong correlation between the
linking of objects in the cluster tree and the distances between objects in the
original distance matrix. To determine the accuracy of the clustering solution
representing our data, observe how close the correlation coefficient is to 1 and the
values above 0.75 is still considered as good. The “average linkage” method tends
to produce high values of this statistic.
Page | 5
Figure 3: Example of Plot Visualization using ‘fviz_dend()’ function.
Page | 6
2. Utilize Suitable Data for a Case Study: Identify and apply the explained steps of the
Agglomerative Hierarchical clustering process (question 1) on a chosen dataset,
using it as a case study to implement the methodology.
In our case, we utilize the Student Performance Dataset (student-mat.csv). The objective
is to categorize students according to their performance in Mathematics through
Agglomerative Hierarchical Clustering. The dataset on Student Performance includes
details on students' academic progress and different characteristics. In clustering analysis,
our main focus will be on the students' Mathematics final grades (G3).
Page | 7
- Cleaning up data:
● Verify for any absent values in the dataset and manage them
accordingly. This may include replacing missing values with the
average or middle value of the column, or eliminating rows with
missing values.
● Identify any outliers that could impact the clustering outcomes.
Utilize graphical tools such as box plots to emphasize outliers.
After identifying the outliers, determine whether to eliminate or
modify them in order to minimize their influence on the clustering
procedure.
● Pseudocode:
Detects outliers using boxplot(). Handle outliers by removal or
transformation.
Page | 8
Figure 5: Standardization formula
● Pseudocode:
Standardize 'G3' using scale() function.
Calculate z-scores: z = (x - mean) / sd
Page | 9
Figure 6: Euclidean Distance formula
Page | 10
This approach is successful in forming clear groups and is
especially appropriate for categorizing student performance.
● Pseudocode:
Library stats()
Apply Ward's method using hclust() with method = "ward.D2".
Formula for Ward's method: d(A, B) = sqrt((2 * |A| * |B| / (|A| +
|B|)) * sum(d(x_i, x_j)^2))
Page | 11
- Repeat until all clusters are merged into one.
4. Visualization of a dendrogram:
- Creating the dendrogram:
● Observe the hierarchical clustering procedure by examining a
dendrogram. A dendrogram visually illustrates how clusters are
merged and the distances at which they are joined.
● The vertical axis on the dendrogram shows the distance or
dissimilarity where clusters merge, and the horizontal axis shows
individual students or clusters.
● Pseudocode:
Library stats()
Plot dendrogram using plot() function on hclust object.
Page | 12
● Pseudocode:
Based on dendrogram and cophenetic correlation, determine the
optimal number of clusters.
Page | 13
original distances by calculating the correlation between
the two matrices.
3. A high cophenetic correlation value (near 1) signifies a
successful clustering solution. This implies that the
hierarchical clustering effectively represents the pairwise
distances among observations while maintaining the initial
data organization.
● Pseudocode:
Library stats
Calculate cophenetic correlation using cophenetic() function.
Formula: c = sum((d_ij - mean_d)(c_ij - mean_c)) / sqrt(sum((d_ij
- mean_d)^2) * sum((c_ij - mean_c)^2))
Page | 14
- Group Identifiers:
● Include the cluster labels in the original dataset by generating a
new column that specifies the cluster assignment for every student.
This allows for more in-depth analysis and understanding by
linking individual students to particular clusters.
● Pseudocode:
Library dplyr
Add cluster labels to the original dataset as a new column.
Page | 15
3. Implement Using R Code: Create R code to execute the clustering process as
outlined in the case study (question 2).
Page | 16
# Verify the cleaned data
boxplot(g3_data$G3, main="Boxplot for G3 after removing
outliers")
print(head(g3_data))
Page | 17
# For example we print the heights at which the clusters
merge
merge_heights <- hc$height
print(merge_heights)
Page | 18
4. Analyze and Discuss the Results: Conduct a thorough analysis and discussion on the
outcomes derived from the R code implementation (question 3).
From here we wanted to separate the code into sections to understand them a bit better.
The section are as follow:
install.packages("dplyr")
library(dplyr)
student_data <-
read.csv("C:\\Users\\regul\\Downloads\\student-mat.csv")
print(head(student_data))
The script begins in the same way loading the necessary dplyr package, which will be used to
manipulate data in R. The install. There are two commands to deal with packages, the
packages("name of the package") function checks that the package is available and library(name
of the package) loads in the functions into the current R session.
read is used to load the dataset CSV file containing data on the performance of the students.
csv("C:\\Users\\regul\\Downloads\\student-mat. csv"). The head structure of the student data is
given by the function print(head(student_data)) in the first rows of the dataset.
2. Handle NA values
print(summary(student_data))
student_data <- student_data %>% mutate(across(everything(),
~ifelse(is.na(.), mean(., na.rm = TRUE), .)))
print(summary(student_data))
The summary(student_data) function is used to generate summary statistics for each column,
including the presence of any missing (NA) values. To handle missing values, the script imputes
them with the mean of the respective column using mutate(across(everything(), ~ifelse(is.na(.),
Page | 19
mean(., na.rm = TRUE), .))). The updated summary is printed to confirm that all NA values have
been addressed.
The script isolates the G3 column, which likely represents the final grade of students, by
selecting it from the dataset using select(G3). The first few rows of this new dataset are printed
with print(head(g3_data)), providing an initial look at the G3 data for further analysis.
Page | 20
5. Standardize G3
The script standardizes the G3 scores to have a mean of 0 and a standard deviation of 1 using the
scale() function. This is done by adding a new column G3_standardized to the dataset. The first
few rows of the standardized data and its summary statistics are printed, showing the
transformation of the G3 scores.
Page | 21
"euclidean")
hc <- hclust(distance_matrix, method = "ward.D2")
plot(hc, main="Hierarchical Clustering Dendrogram (Ward's
Method)", xlab="Students", ylab="Height")
The first task regards the calculation of a distance matrix based on the standardized G3 scores
applied in the Euclidean distance. The hierarchical clustering is then conducted using the Ward’s
method through the use of the hclust(distance_matrix, method = “ward. D2”). The dendrogram is
plotted which represents the depths of the different clusters obtained after the agglomeration
process.
Page | 22
print("Analyzing dendrogram for significant jumps...")
In the dendrogram, rectangles are placed around the clusters and it is decided that there should be
four clusters (rect. hclust(hc, k=4, border=”red”)). It assists in finding out the groups that are
formed when the output is portrayed in the dendrogram format.
Therefore, from the results of the hierarchical clustering dendrogram, it is possible to evaluate
the four groups distinguished in relation to the students’ results (G3). Cluster 1 consists of the
low achievers as the scores in G3 are low, which hovers in the region of below average. Some of
them may have problems doing well in their classes or other factors that dictate their
performance. These students could benefit from programs such as supplementary lessons, or
individual education plans.
Page | 23
The second cluster of students obtains slightly higher G3 scores than students in the first cluster
but in general, their scores are still lower than the median. This group consists of learners who
are average performers in their class. They are participating more than the last cluster set but
they are still low performers. Tutoring for example, or group work might assist those students to
have better grades.
This cluster obtains higher G3 scores meaning that the students’ performances are strong and
consistent with the curriculum learned. Such students should be taken through more complex
classes or encouraged to engage in other extra learning activities to fully challenge them and
remain on the right academic track.
The last cluster includes the students with better or the best outcome on G3 which gives evidence
of the highest academic ability. These students should be earning good results, probably among
the first in their classes or next to it depending on the grading system. It might mean that the
students could be selected for leadership positions, peer mentoring, or to be included in special
award-giving programs.
Cophenetic correlation coefficient which quantifies how well the dendrogram has captured the
inter-point distances of the original data is computed from the formula cor(cophenetic(hc),
distance_matrix). The heights at which clusters merge are printed and also plotted so that the
significant changes in the dendrogram may be read, which acts as a guide to the definition of the
number of clusters.
Page | 24
9. Cut the Dendrogram and analyze cluster
threshold <- 10
clusters <- cutree(hc, h = threshold)
student_data <- student_data %>% mutate(Cluster = clusters)
print(head(student_data))
write.csv(student_data, "C:/Users/Working
Directory/student-mat.csv", row.names = FALSE)
According to the appearance of the dendrogram, the distance is selected whereby the dendrogram
is cut in order to create more clusters. The script provides the cluster memberships of the original
data and then outputs several initial rows of the new table, where a new Cluster column has been
added in the course of the execution. The new list is stored in a CSV file for further use in the
program.
Page | 25
1 GP F 18 U GT3 A 4 4 at_home teacher
course mother 2 2
2 GP F 17 U GT3 T 1 1 at_home other
course father 1 2
3 GP F 15 U LE3 T 1 1 at_home other
other mother 1 2
4 GP F 15 U GT3 T 4 2 health services
home mother 1 3
5 GP F 16 U GT3 T 3 3 other other
home father 1 2
6 GP M 16 U LE3 T 4 3 services other
reputation mother 1 2
failures schoolsup famsup paid activities nursery higher internet romantic
famrel freetime goout Dalc Walc
1 0 yes no no no yes yes no no
4 3 4 1 1
2 0 no yes no no no yes yes no
5 3 3 1 1
3 3 yes no yes no yes yes yes no
4 3 2 2 3
4 0 no yes yes yes yes yes yes yes
3 2 2 1 1
5 0 no yes yes no yes yes no no
4 3 2 1 2
6 0 no yes yes yes yes yes yes no
5 4 2 1 2
health absences G1 G2 G3 Cluster
1 3 6 5 6 6 1
2 3 4 5 5 6 1
3 3 10 7 8 10 1
4 5 2 15 14 15 2
5 5 4 6 10 10 1
6 5 10 15 15 15 2
At last, the distribution of G3 scores within each cluster is conducted with a boxplot (boxplot(G3
~ Cluster, data=student_data, main= “G3 Scores by Cluster”, xlab= “Cluster”, ylab= “G3
Score”). This visualization is useful in identifying how the final grades differ from cluster to
cluster, which gives some information about the given cluster group
Page | 26
Page | 27
References
Agglomerative Hierarchical Clustering. (n.d.). Datanovia. Retrieved July 21, 2024, from
https://www.datanovia.com/en/lessons/agglomerative-hierarchical-clustering/#google_vig
nette
https://medium.com/analytics-vidhya/hierarchical-clustering-agglomerative-f6906d44098
Page | 28