Hierarchical Clustering Algorithm
Hierarchical Clustering Algorithm
Abstract
1. Introduction
Data mining allows us to extract knowledge from our historical data and predict
outcomes of our future situations. Clustering is an important data mining task. It can be
described as the process of organizing objects into groups whose members are similar
in some way. Clustering can also be define as the process of grouping the data into
classes or clusters, so that objects within a cluster have high similarity in comparison
to one another but are very dissimilar to objects in other clusters. Mainly clustering can
be done by two methods: Hierarchical and Partitioning method [1].
1226 Yogita Rani & Dr. Harish Rohil
In data mining hierarchical clustering works by grouping data objects into a tree of
cluster. Hierarchical clustering methods can be further classified into agglomerative
and divisive hierarchical clustering. This classification depends on whether the
hierarchical decomposition is formed in a bottom-up or top-down fashion. Hierarchical
techniques produce a nested sequence of partitions, with a single, all inclusive cluster
at the top and singleton clusters of individual objects at the bottom. Each intermediate
level can be viewed as combining two clusters from the next lower level or splitting a
cluster from the next higher level. The result of a hierarchical clustering algorithm can
be graphically displayed as tree, called a dendrogram. This tree graphically displays
the merging process and the intermediate clusters. This graphical structure shows how
points can be merged into a single cluster.
Hierarchical methods suffer from the fact that once we have performed either
merge or split step, it can never be undone. This inflexibility is useful in that it leads to
smaller computation costs by not having to worry about a combinatorial number of
different choices. However, such techniques cannot correct mistaken decisions that
once have taken. There are two approaches that can help in improving the quality of
hierarchical clustering: (1) Firstly to perform careful analysis of object linkages at each
hierarchical partitioning or (2) By integrating hierarchical agglomeration and other
approaches by first using a hierarchical agglomerative algorithm to group objects into
micro-clusters, and then performing macro-clustering on the micro-clusters using
another clustering method such as iterative relocation [2].
2. Related Work
Chris ding and Xiaofeng He, introduced the merging and splitting process in
hierarchical clustering method. They provides a comprehensive analysis of selection
methods and proposes several new methods that determine how to best select the next
cluster for split or merge operation on cluster. The author performs extensive
clustering experiments to test 8 selection methods, and found that the average
similarity is the best method in divisive clustering and the Min-Max linkage is the best
in agglomerative clustering. Cluster balance was a key factor there to achieve good
performance. They also introduced the concept of objective function saturation and
clustering target distance to effectively assess the quality of clustering [3].
Marjan Kuchakist et al. gives an overview of some specific hierarchical clustering
algorithm. Firstly, author classified clustering algorithms, and then the main focused
was on hierarchical clustering algorithms. One of the main purposes of describing
these algorithms was to minimize disk I/O operations, consequently reducing time
complexity. They have also declared attributes, disadvantages and advantages of all the
considered algorithms. Finally, comparison between all of them was done according to
their similarity and difference [4].
Tian Zhang et al. proposed al. proposed an agglomerative hierarchical clustering
method named BIRCH (Balanced Iterative Reducing and Clustering using
Hierarchies), and verified that it was especially suitable for large databases. BIRCH
A Study of Hierarchical Clustering Algorithm 1227
representing each cluster by a certain fixed number of points that are generated by
selecting well scattered points from the cluster and then shrinking them toward the
centre of the cluster by a specified fraction[8].
The steps involved in clustering using ROCK are described in figure 2. In this
process after drawing random sample from the database, a hierarchical clustering
algorithm that employs links is applied to sample data points. Finally the clusters
involving only the sample points are used to assign the remaining data points on disk
to the appropriate cluster.
3.6 Leaders–Subleaders
Leaders-Subleaders is an efficient hierarchical clustering algorithm that is suitable for
large data sets. In order to generate a hierarchical structure for finding the subgroups or
sub-clusters, incremental clustering principles is used within each cluster. Leaders–
Subleaders is an extension of the leader algorithm. Leader algorithm can be described
as an incremental algorithm in which L leaders each representing a cluster are
generated using a suitable threshold value. There are mainly two major features of
1230 Yogita Rani & Dr. Harish Rohil
4. Conclusion
This paper presents an overview of improved hierarchical clustering algorithm.
Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy
of clusters. The quality of a pure hierarchical clustering method suffers from its
inability to perform adjustment, once a merge or split decision has been executed. This
merge or split decision, if not well chosen at some step, may lead to some-what low-
quality clusters. One promising direction for improving the clustering quality of
hierarchical methods is to integrate hierarchical clustering with other techniques for
multiple phase clustering. These types of modified algorithm have been discussed in
our paper in detail.
References
[1] Pavel Berkhin (2000), Survey of Clustering Data Mining techniques ,Accrue
Software, Inc..
[2] Jiawei Han and Micheline Kamber (2006), Data Mining: Concepts and
Techniques, The MorganKaufmann/Elsevier India.
A Study of Hierarchical Clustering Algorithm 1231
[3] Chris ding and Xiaofeng He (2002), Cluster Merging And Splitting In
Hierarchical Clustering Algorithms.
[4] MarjanKuchaki Rafsanjani, Zahra Asghari Varzaneh, Nasibeh Emami
Chukanlo (2012), A survey of hierarchical clustering algorithms, The Journal
of Mathematics and Computer Science, 5,.3, pp.229- 240.
[5] Tian Zhang, Raghu Ramakrishnan, MironLinvy (1996), BIRCH: an efficient
data clustering method for large databases, International Conference on
Management of Data, In Proc. of 1996 ACM-SIGMOD Montreal, Quebec.
[6] Sudipto Guha, Rajeev Rastogi, Kyuseok Shim (1998), CURE: An Efficient
Clustering Algorithm For Large
[7] Databases, In Proc. of 1998 ACM-SIGMOD lnt. Conference on Management
of Data.
[8] G.Karypis, E.H.Han and V.Kumar (1999), CHAMELEON: Hierarchical
clustering using dynamic modeling, IEEE Computer, 32, pp. 68-75.
[9] J.A.S. Almeida, L.M.S. Barbosa, A.A.C.C. Pais and S.J. Formosinho (2007),
Improving Hierarchical Cluster Analysis: A new method with outlier detection
and automatic clustering, Chemo metrics and Intelligent Laboratory Systems,
87, pp. 208-217.
[10] L. Feng, M-H Qiu, Y-X. Wang, Q-L. Xiang, Y-F. Yang and K. Liu (2010), A
fast divisive clustering algorithm using an improved discrete particle swarm
optimizer, Pattern Recognition Letters, 31, pp. 1216-1225.
1232 Yogita Rani & Dr. Harish Rohil