
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Hierarchical Clustering in R Programming
Introduction
In the vast area of data analysis and machine learning, hierarchical clustering stands as a powerful technique for grouping individuals or objects based on their similarities. When combined with the versatility and efficiency of R programming language, it becomes an even more invaluable tool for uncovering hidden patterns and structures within large datasets. In this article, we will explore what hierarchical clustering entails, dive into its various types, illustrate with a practical example, and provide a code implementation in R.
Hierarchical Clustering
Hierarchical clustering is an unsupervised learning algorithm that aims to create clusters by iteratively merging or dividing similar entities based on predetermined distance metrics. Unlike other methods like k?means clustering where we need to define the number of desired clusters beforehand, hierarchical clustering constructs a tree?like structure called a dendrogram that can be cut at a certain height to obtain multiple cluster solutions.
Types of Hierarchical Clustering
There are two main approaches when it comes to hierarchical clustering:
Agglomerative (bottom?up):This method starts by treating everyone as its own cluster and successively merges small clusters together until reaching one big cluster containing all data points. The choice of linkage criterion plays a crucial role here.
Divisive (top?down): Reverse to agglomerative approach; divisive hierarchical clustering begins with one massive cluster that contains all data points and then recursively divides them into smaller subclusters until achieving individual observations as separate clusters.
R programming to implement the Hierarchical Clustering
The Hierarchical clustering is implemented by calculating the distance with the help of the Manhattan distance.
Algorithm
Step 1:To start with, we first need to load the sample dataset.
Step 2:Before clustering, data preprocessing is necessary. We may need to standardize variables or handle missing values if present.
Step 3:Calculating distances of dissimilarity or distance between observations based on selected metrics such as Euclidean distance or Manhattan distance.
Step 4:Creating Hierarchical Clusters, that we have our distance matrix ready, we can proceed to perform hierarchical clustering using `hclust()` function in R.
Step 5:The resulting object "hc" stores all information required for subsequent steps.
Step 6:Plotting Dendrogram, we can visualize our clusters by plotting dendrograms in R.
Example
library(cluster) # Create a sample dataset set.seed(123) x <- matrix(rnorm(100), ncol = 5) # Finding distance matrix distance_mat <- dist(x, method = 'manhattan') distance_mat # Fitting Hierarchical Clustering Model # to the training dataset set.seed(240) # Setting seed Hierar_cl <- hclust(distance_mat, method = "average") Hierar_cl # Plotting dendrogram plot(Hierar_cl) # Choosing no. of clusters # Cutting tree by height abline(h = 5.5, col = "green") # Cutting tree by no. of clusters fit <- cutree(Hierar_cl, k = 3 ) fit table(fit) rect.hclust(Hierar_cl, k = 11, border = "red")
Output
1 2 3 4 5 6 7 8 2 2.928416 3 3.820964 4.579451 4 5.870407 3.963824 6.920070 5 4.712898 3.357644 5.192501 2.041704 6 3.724940 5.188503 2.298511 7.529122 6.906090 7 4.378470 3.603915 6.073011 6.448175 6.242628 5.408591 8 2.909887 2.270199 5.993941 6.134220 5.627842 5.830570 3.531025 9 2.545686 4.500523 5.703258 5.466749 3.856739 6.130000 6.203371 4.746715 10 5.861279 5.127758 9.368977 8.324293 8.236305 8.704556 3.295965 5.013834 11 6.085281 3.179450 4.827798 5.101238 4.895692 5.436850 2.873268 4.584566 12 5.590643 2.816420 6.061696 4.307201 3.803264 6.670747 4.914120 4.986817 13 4.435100 3.563707 6.249931 6.596065 5.579229 5.771611 1.893134 3.980554 14 6.402935 4.233552 6.619402 4.029210 2.452428 8.633337 5.044223 5.761055 15 4.785512 2.544131 7.087874 5.618391 5.530404 7.696926 2.610109 3.284796 16 7.566986 7.161528 6.313095 7.075204 5.756949 6.740845 5.668959 7.578376 17 6.380095 4.860359 5.530564 7.704619 7.499072 6.327322 3.290148 5.335482 18 6.818578 4.550758 9.130209 5.378824 5.184404 9.739261 6.419430 4.339261 19 3.282054 2.655900 3.616115 5.178834 3.243561 4.331110 3.510374 3.815444 20 3.236401 2.604102 5.008448 5.395216 3.577502 6.633752 5.481153 3.662209 9 10 11 12 13 14 15 16 2 3 4 5 6 7 8 9 10 6.099033 11 6.734916 5.530202 12 6.804009 7.368849 4.130914 13 4.591857 3.200043 3.839703 4.870051 14 5.335886 6.919997 4.519932 5.595756 4.300131 15 6.393326 3.634753 3.218255 4.662948 4.028726 4.327200 16 5.414681 7.171170 5.198525 8.439225 4.010792 3.776011 7.892429 17 8.925781 6.482177 4.055843 6.170564 5.183282 6.868406 3.874253 8.586946 18 6.552868 7.159270 5.279163 5.245416 7.472245 6.473367 3.809321 9.815098 19 3.707689 6.167308 3.286156 3.779404 2.967265 4.302227 5.200031 4.681604 20 3.297399 6.324931 4.913250 4.907588 4.817754 3.963564 4.663005 6.219969 17 18 19 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 6.804331 19 5.625103 6.543116 20 7.029360 5.821915 2.451658 Call: hclust(d = distance_mat, method = "average") Cluster method : average Distance : manhattan Number of objects: 20 [1] 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 3 2 1 1 1 fit 1 2 3 16 3 1

Conclusion
Hierarchical clustering constitutes a versatile and powerful technique for discovering underlying structures within datasets. By leveraging its practical applications and implementing it through the R programming language, we have now acquired an understanding and hands?on experience with hierarchical clustering. This approach holds immense potential for various domains like customer segmentation, image analysis, and bioinformatics assisting experts in unveiling hidden patterns that help drive informed decision?making processes.