Hierarchical Cluster Analysis 1
Hierarchical Cluster Analysis 1
net/publication/339443595
CITATIONS READS
40 14,710
1 author:
SEE PROFILE
All content following this page was uploaded by Angur Mahmud Jarman on 23 February 2020.
Abstract:
widely used technique. The success of hierarchical clustering algorithm depends on the linkage
methods used to determine the similarity or dissimilarity between the clusters. This paper
reviews the theories of hierarchical clustering analysis, application and focuses on different
linkage methods and especially the centroid linkage approach in hierarchical clustering analysis,
compares the results, limitations and time complexities. Among the four methods discussed in
the paper, three of them has limitations except the average linkage clustering which is the most
Introduction:
Sorting similar items together, comparing them, trying to get some sense out of the properties of
those items is very common in our everyday life. Doing such monotonous work manually is a
time consuming task. Building some automated methods to group similar items and bringing
some order in chaos was necessary for researchers. It is very essential to find similarities
between data to come out with statistical meaningful group and those importance have gave birth
to the topic of cluster analysis in some disciplines. The primary purpose of cluster analysis is to
build a classification system to group cases that are relatively homogeneous within themselves.
The group of those homogeneous cases are referred to as a cluster. There are some existing
popularly used clustering algorithms yet the choice in algorithms and measurement technique
that specifies to the successive merging of similar cases is a complex process. Clustering
algorithm should be chosen after a rigorous research on the particular data set. While one
technique or algorithm can successfully classify one type of dataset can fail in another type of
data set.
Initial interest in clustering analysis was started in the 1960s and resulted in the development of
some new algorithms that showed the possibility of cluster analysis and during this period
researchers began to research on various innovative tools in their statistical analysis to uncover
the structure in various datasets. In 1963 [1] in the disciplines of biology and ecology cluster
analysis was initially used. This technique has been used in social sciences also but did not get
widespread popularity. In the 1970s several algorithms were developed for clustering analysis.
Clustering analysis became most popular in health and social sciences in the last decades.
In this paper some applications and general technique of hierarchical cluster analysis along with
their computational complexity will be described in brief. Three typical distance measure
techniques and another centroid based technique, comparison of each with their pros and cons
Applications:
So many years have been spent by biologists to create a taxonomy by hierarchical classification
of all living things such as finding the kingdom, family, genus, and species. Much of the early
work in cluster analysis was to create a discipline of mathematical taxonomy that could
information that are now available are recently applied by biologist to analyze the groups of
In this new era there are billions of web pages which are source of huge information and are
being used to find specific information from a query in the search engines by cluster analysis.
Clustering algorithms are being widely used to return the most relevant results from those
billions of pages by grouping these search results into small number of clusters. To predict
atmospheric condition and to know whether or not tomorrow is gonna be snowing or raining it is
very common to use weather forecast apps which are made based on cluster analysis. It requires
finding patterns in the atmosphere and the ocean and cluster analysis is the main tool to find the
patterns of atmospheric pressure of popular regions and areas of land and ocean that controls
weather impacting on climate. In medical science, clustering analysis techniques brought fortune
and speed. Every illness or disease has a number of variations, and cluster analysis can be used
to identify these different subcategories. Now a days clustering analysis is being used in higher
Modern businesses including banks, share markets, online shops and social medias are now
and potential customers from their own platforms or from social media and are being used to
segment customers into a small number of groups for additional analysis and marketing
activities.
A unique set of nested categories or clusters are produced by sequentially pairing variables or
clusters in hierarchical cluster analysis. Starting with the correlation matrix, clusters and
unclustered variables at each step tried in all possible pairs, and that pair producing the highest
average intercorrelation within the trial cluster is chosen as the new cluster. Like the taxonomic
dendrogram graph of the biological systematist, shows the relations between clusters and the
value of the clustering criterion associated with each [2]. The complete definition for clustering,
however, doesn't come to an agreement, and a classic one is described as follows [3]:
3. Measurement for similarity and dissimilarity must be clear and have the practical
meaning;
There is no ideal clustering algorithm that solves all the problems out there, but as it was noted,
"clustering is in the eye of the beholder." [4] No formula exists that can tell about the most
that is designed for one kind of model will generally fail on a data set that contains a radically
Given a set of N items to be clustered, and an N square distance (or similarity) matrix, the basic
1. Start by assigning each item to its own cluster, so that if you have N items, you now
have N clusters, each containing just one item. Let the distances (similarities) between
the clusters equal the distances (similarities) between the items they contain.
2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so
3. Compute distances (similarities) between the new cluster and each of the old clusters.
4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.
The core idea of hierarchical clustering is to group objects having more related to nearby objects
than to objects farther away. It connects "objects" and forms "clusters" based on distance. In
hierarchical clustering the number of clusters changes in every iteration. In divisive clustering
the number of clusters increases: all data instances start in one cluster, and splits are performed
clusters decreases in each iteration which is a bottom-up approach: each instance is a cluster at
Cluster analysis is inherently linked to the similarity concept. Determining the right statistic to
measure similarity or distance among the cases is the first step that a researcher should take. It
might seem that both calculations is the same because it is obvious to observe that if the distance
decreases then the similarity increases and vise versa [5]. First important step in hierarchical
clustering analysis is to choose the right distance measurement technique between the cases or
variables. Distance and similarity measurements are the basis for the process of clustering
analysis. Distance is used when the data points are quantitative, on the other hand similarity is
Most widely used method to calculate the distance between continuous data points or cases is
called squared euclidean distance. In this method the distance between all the variables in two
cases are calculated and reflected as a single value which then considered as the distance
between two whole cases. This method is used in each step in the clustering procedure and is
shown by a proximity matrix. The cases who has the calculated smallest distance value among
them are merged together and becomes a single case or cluster [8]. That is why a hierarchical
It seems very straightforward to find the euclidean distance between clusters when there is no
more than one case in each cluster. However, where there are many cases in each cluster then
calculating the euclidean distance between only two cases belongs to each is not sufficient.
Since the final goal is to find the clusters that are most similar or nearest among all other
clusters, it is important to find the right method of calculating the linkage between two clusters
that have more than one cases or data points. Researchers have found many techniques to find
that linkage between clusters which are very important to study before starting any hierarchical
clustering analysis in a dataset. Each linkage measure technique works in their own way.
Whether one linkage measure can work well for a particular type of dataset, it might not work
well in another type of dataset. The most commonly known linkage methods are single,
complete and average linkage methods. These are all based on the euclidean distance but the
main difference between them is the selection of the data points that are considered to take as a
final criterion on which the similarity or distance depends. In general, single linkage method
depends on the smallest distance between two points where each point belongs to each cluster
from the pair of clusters [9]. Complete linkage and average linkage method takes the farthest
distance and average of the distances respectively and shown in the following figures [10].
Figure 2: Linkage criterion of clustering methods
If among the pair of clusters the first one have points p, q, r and the second one has the points w,
x, y then the distance method between them in a single link method would be the minimum
calculated distance found between the following point pairs: (p,w) (p,x) (p,y) (q,w) (q,x) (q,y)
(r,w) (r,x) (r,y). The main problem that remains in the single link clustering is that some clusters
may merged together just because one of their data points is closest to another point in another
cluster whereas most of the other points resided in significantly larger distance. This is called
chaining effect and this effect has a negative impact on the overall result of the cluster analysis if
there exists noise in the dataset. Single link technique is used in some data points and the
resulting dendrogram is shown in Figure 3(a). In complete link method the distance between
them would be the maximum calculated distance found between the following point pairs: (p,w)
(p,x) (p,y) (q,w) (q,x) (q,y) (r,w) (r,x) (r,y). Complete link method also can give a disastrous
result if noise is present in the dataset. Resulting dendrogram of complete linkage clustering
applied in the same example dataset is presented in the following figure 3(b) . In [11] a new
method was proposed to overcome the limitation of single link and complete link clustering
which is called the average linkage method. The strategy is to take the average of the distance
between the point pairs: (p,w) (p,x) (p,y) (q,w) (q,x) (q,y) (r,w) (r,x) (r,y) as the distance between
the clusters. This method is widely used in hierarchical clustering analysis. The resulting
dendrogram is shown in figure 3(c) after applying the average link method in the same example
dataset applied. The height of the dendrogram is the distance between the clusters.
(a)Single linkage (b) Complete linkage
Figure 3: Dendrogram resulting from single, complete and average linkage clustering
The basic idea of centroid linkage method is to take the distance between the centroids of the
data points in clusters. If among the pair of clusters the first one have points p, q, r, s, t and the
second one has the points w, x, y, z then to find the distance between the clusters would be the
distance between the centroid found for the data points (p,q,r,s,t) and centroid found for the data
points (w,x,y,z). Unlike above single, complete and average linkage method, the distance is
calculated once rather than between each and every points of the clusters which is shown in
figure 4(a). The formula to calculate the centroid of a finite set of k points x1,x2,x3,..xn is
straightforward: [12]
x1+x2+x3+........Xn
C= k
In most cases, the points will be n-dimensional and the centroid should be calculated as taking
the points as vertices in a simplex and the formula would be if the n vertices are: v1,v2,v3...vn
Centroid linkage clustering results somewhat similar to average linkage clustering but centroid
linkage method has a bad characteristics: possibility of inversion [13] and that's why it is
dangerous to use in hierarchical clustering which needs further research. The resulting
dendrogram found after applying centroid linkage method is shown in figure 4(b)
4(a) 4(b)
If there are N data points in the dataset in cluster analysis, then the time complexity to the above
Average Link Average Distance of O(n2 logn) Best choice for most
all the data points applications
Conclusion:
Hierarchical clustering analysis is a technique that allows researchers to find the meaning and
understand the pattern of data from a large unstructured dataset. A theoretical background of
Hierarchical clustering, its application and commonly used linkage methods are discussed in this
paper. Among the single linkage, complete linkage and average linkage methods only average
linkage method is used for most applications because both the single and complete linkage
methods results can be greatly affected by noise. Centroid linkage method also works like other
linkage methods but has a serious limitation of inversion possibility and that's why it is not used
widely in applications.
References:
[1] Sokal, R.R. and Sneath, P.H.A. (1963) Principles of Numerical Taxonomy. W.H. Freeman &
10.2466/pr0.1966.18.3.851
[3] Dongkuan XuYingjie Tian (2015) ,”A Comprehensive Survey of Clustering Algorithms”.
[4] Vladimir Estivill-Castro (2002). “Why so many clustering algorithms: a position paper”.
[6] Jane. Clatworthy Deanna. Buick Matthew. Hankins John. Weinman Robert. Horne
(2005), The use and reporting of cluster analysis in health psychology: A review. British Journal
[7] William H. E. Day, Herbert Edelsbrunner (1984). “Efficient algorithms for agglomerative
[8] Odilia Yim, Kylee T. Ramdeen (2015). “Hierarchical Cluster Analysis: Comparison of Three
[9] Zubrzchi K. Florek, J. Lukaszewiez, J. Perkal, H. Steinhaus, and S. Zubrzchi. (1951) “Sur la
liason et la division des points d’un ensemble fini. Colloquium Methematicum, 2”.
[10] Mario Mazzocchi (2008).”Statistics for Marketing and Consumer Research”. Chapter 12.
DOI 10.4135/9780857024657.
[11] Sokal, R.R. and Michener, C.D. (1958) A Statistical Methods for Evaluating Relationships.
[12] Protter, Murray H.; Morrey, Jr., Charles B. (1970), College Calculus with Analytic
[13] https://nlp.stanford.edu/IR-book/html/htmledition/centroid-clustering-1.html