10 1109@TNNLS 2018 2853407
10 1109@TNNLS 2018 2853407
10 1109@TNNLS 2018 2853407
Abstract— Hierarchical clustering has been extensively applied In general, a traditional hierarchical clustering framework
for data analysis and knowledge discovery. However, the scala- can be summarized as follows.
bility of hierarchical clustering methods is generally limited due Step 1: Each single data point is assigned to an individual
to their time complexity of O(n2 ), where n is the size of the
input data. To address this issue, we present a fast and accurate cluster.
hierarchical clustering algorithm based on topology training. Step 2: The most similar pair of clusters is found according
Specifically, a trained multilayer topological structure that fits to a certain linkage strategy.
the spatial distribution of the data is utilized to accelerate the Step 3: The most similar pair of clusters is merged to form
similarity measurement, which dominates the computational cost a new cluster.
in hierarchical clustering. Moreover, the topological structure
also guides the merging steps in hierarchical clustering to Step 4: Steps 2 and 3 are repeated until only one cluster
form a meaningful and accurate clustering result. In addition, exists or a particular stop condition is satisfied.
an incremental version of the proposed algorithm is further In the above-mentioned steps, the commonly used linkage
designed so that the proposed approach is applicable to the strategies are single linkage (SL), average linkage (AL), and
streaming data as well. Promising experimental results on various complete linkage (CL), which compute the maximum, average,
data sets demonstrate the efficiency and effectiveness of the
proposed algorithms. and minimum similarity between the data points of two clus-
ters, respectively [27]. The traditional hierarchical clustering
Index Terms— Data analysis, hierarchical clustering, incremen- frameworks with SL, AL, and CL linkages are abbreviated
tal algorithm, time complexity, topology.
as T-SL, T-AL, and T-CL hereinafter. Although these three
I. I NTRODUCTION traditional approaches are parameterless and simple to use,
they have three major problems.
C LUSTERING methods can be classified into two types:
partitional clustering [5]–[9], [26], [40] and hierarchical
clustering [10], [18], [19], [31]. Partitional clustering separates
1) Their performance is sensitive to different data distrib-
ution types. T-SL “has a tendency to produce clusters
a set of data points into a certain number of clusters to min- that are straggly or elongated” [17]; T-CL and T-AL
imize the intracluster distance and maximize the intercluster tend to produce compact and spherical-shaped clusters,
distance, while hierarchical clustering views each data point as respectively.
an individual cluster and builds a nested hierarchy by gradually 2) All three only consider the local distance between the
merging the current most similar pair of them. Compared pairs of data points during clustering. When overlapped
with partitional clustering, hierarchical clustering offers more clusters exist, their performances will be influenced [37].
information regarding the distribution of the data set. Often, 3) Their time complexity is O(n 2 ), which limits their appli-
the hierarchy is visualized using dendrograms, which can be cations, particularly for large-scale data and streaming
“cut” at any level to produce the desired number of clusters. data.
Due to the rich information it offers, hierarchical clustering has To tackle the above-mentioned three problems, various types
been extensively applied to different fields, e.g., data analysis, of hierarchical clustering approaches have been proposed in
knowledge discovery, pattern recognition, image processing, the literature. To solve the first two problems, potential-
bioinformatics, and so on [4], [11], [21]. based hierarchical clustering approaches based on potential
theory [33] have been proposed (see [22] and [23]) where the
Manuscript received April 18, 2017; revised November 2, 2017 and potential field is utilized to measure the similarity between data
April 15, 2018; accepted June 27, 2018. This work was supported in part by
the National Natural Science Foundation of China under Grant 61672444 and points. Because this type of approach merges the data points
Grant 61272366, in part by the SZSTI under Grant JCYJ20160531194006833, by considering both the global distribution, i.e., potential fields
and in part by the Faculty Research Grant of Hong Kong Baptist University of data points and local relationship, i.e., the exact distance
under Project FRG2/16-17/051 and FRG2/17-18/082. (Corresponding author:
Yiu-ming Cheung.) between neighbors, they show robustness when processing
Y.-m. Cheung is with the Department of Computer Science, Hong Kong data sets with different data distribution types and overlapped
Baptist University (HKBU), Hong Kong, and also with the HKBU Institute clusters. Nevertheless, their time complexity is still O(n 2 ).
of Research and Continuing Education, Shenzhen 518057, China (e-mail:
[email protected]). To cope with the third problem, locality-sensitive hashing-
Y. Zhang is with the Department of Computer Science, Hong Kong Baptist based hierarchical clustering [20] has been proposed with a
University, Hong Kong (e-mail: [email protected]). time complexity of O(nm) to speed up the closest pair search
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. procedure of T-SL, where m is the bucket size. However,
Digital Object Identifier 10.1109/TNNLS.2018.2853407 the setting of parameters for this approach is nontrivial, and
2162-237X © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
its clustering accuracy is generally lower than that of T-SL. region of data sets. Accordingly, a hierarchical clustering
Furthermore, hierarchical clustering based on random projec- framework based on GMTT is formed. Differing from our
tion (RP) [30] with time complexity of O(n(log n)2 ) has also preliminary work in [39], this framework can dynamically
been proposed. It accelerates T-SL and T-AL by iteratively create and train seeds to form a multilayer topology. With the
projecting data points into different lines for splitting. In this topology, the merging steps of hierarchical clustering are per-
manner, the data set is partitioned into small subsets, and formed under its guidance. Moreover, the similarity between
the similarity can be measured locally to reduce computa- data points is only measured within each seed’s corresponding
tion cost. However, RP-based approaches inherit the draw- subset, which can significantly reduce the computational cost.
backs of T-SL and T-AL, i.e., they have a bias for certain In general, most of the traditional linkage strategies, i.e., SL,
data distribution types, and they cannot distinguish over- AL, and CL, can be applied to the GMTT-based framework.
lapped clusters well, due to approximation. To simultaneously To achieve better clustering performance, a new density-based
tackle the three problems, summarization-based hierarchical linkage strategy is also presented. Because it simultaneously
clustering frameworks have been proposed in the literature. considers the global and local data distribution information,
Specifically, data bubble-based hierarchical clustering and its its clustering performance is promising. In addition,
variants [2], [3], [25], [29], [41] have been proposed to summa- an incremental version of the GMTT framework, denoted
rize the data points by randomly initializing a set of seed points as the IGMTT framework, is also presented to cope with
to incorporate nearby data points into groups (data bubbles). streaming data. In the IGMTT framework, each new input
Subsequently, the hierarchical clustering is performed on the can easily find its nearest neighbor by searching the topology
bubbles only to avoid the similarity measurement for a large from top to bottom. Then, both the existing topology and
number of original data points. In general, the performance of hierarchy are locally updated to recover the influence caused
the data bubble and its variants is sensitive to the compression by the input. Both the GMTT and the IGMTT frameworks
rate and the initialization of seed points. Our preliminary have competent performance in terms of clustering quality
work in [39] has addressed the sensitivity problem by training and time complexity, i.e., O(n 1.5 ). Their effectiveness and
the seed points to better summarize the data points. Never- efficiency have been empirically investigated. The main
theless, a common shortcoming of the summarization-based contributions of our work are summarized as follows.
approaches is that the hierarchical relationship between data 1) The GMTT algorithm is proposed for seed point train-
points is lost due to summarization. In addition, none of ing. The topology of the seed points can appropriately
the above-mentioned approaches are fundamentally designed represent the structural data distribution. The training is
for streaming data. Specifically, the entire clustering process automatic without prior knowledge of the data set, e.g.,
should be executed to update the hierarchy structure for each number of clusters, proper number of seeds, and so on.
new input, which may sharply increase the computational cost. 2) A fast hierarchical clustering framework has been pro-
To solve this problem, the incremental hierarchical clustering posed based on GMTT. According to the topology
(IHC) approach [34] has been proposed. It saves a large trained through GMTT, distance measurement is locally
amount of computational cost by dynamically and locally performed to reduce computational cost. Merging is
restructuring the inhomogeneous regions of the present hierar- also guided by the topology to make the constructed
chy structure. Therefore, this approach performs hierarchical hierarchy able to distinguish the borders of real clusters.
clustering with a time complexity as low as O(n log n) when 3) A new linkage strategy called density linkage (DL)
the hierarchy structure is completely balanced. However, the is presented, which simultaneously considers the local
balance of the constructed hierarchy is not guaranteed, which and global data distribution information to make the
makes its worst-case time complexity still O(n 2 ). Further- clustering results robust to different data distribution
more, because IHC is an approximation of T-SL, it will also types and overlapping phenomena.
have a bias for certain data distribution types. 4) An incremental version of the GMTT framework,
In this paper, we concentrate on: 1) addressing with the i.e., the IGTMM framework, is provided for streaming
three above-mentioned problems of traditional hierarchical data hierarchical clustering. Similar to the GMTT frame-
clustering frameworks and linkage strategies, and 2) proposing work, it is also fast and accurate.
a new hierarchical clustering framework for streaming data. The rest of this paper is organized as follows. Section II
We first propose a growing multilayer topology training gives an overview of the existing relevant hierarchical clus-
(GMTT) algorithm to dynamically learn the spatial distribution tering approaches. In Section III, the details of the pro-
of data and construct the corresponding topological structure. posed GMTT framework, IGMTT framework, and DL linkage
In the literature, topology training has been widely utilized are described. Then, Section IV presents the experimental
for partitional clustering [1], [14], [15], [28], [32], [36], [38]. results for various benchmark and synthetic data sets. Finally,
However, to the best of our knowledge, it has yet to be utilized we draw a conclusion in Section V.
for hierarchical clustering. We make the topology grow by cre-
II. OVERVIEW OF E XISTING R ELEVANT H IERARCHICAL
ating new layers with new seeds based on existing seeds if the
C LUSTERING M ETHODS
existing seeds cannot represent the data set well. The growth is
continued until each node can appropriately represent the local A. Potential-Based Hierarchical Clustering
data distribution. As a result, the GMTT algorithm assigns The approach proposed in [23] converts the distance
more layers and seeds to finely describe the high-density between data points into potential values to measure the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
CHEUNG AND ZHANG: FAST AND ACCURATE HIERARCHICAL CLUSTERING BASED ON GMTT 3
CHEUNG AND ZHANG: FAST AND ACCURATE HIERARCHICAL CLUSTERING BASED ON GMTT 5
CHEUNG AND ZHANG: FAST AND ACCURATE HIERARCHICAL CLUSTERING BASED ON GMTT 7
CHEUNG AND ZHANG: FAST AND ACCURATE HIERARCHICAL CLUSTERING BASED ON GMTT 9
By contrast, overlapping is very common in real data complexity is O(u nl B 2 ). For each of the leaf nodes, a sub-
sets. MST should be constructed for its corresponding U L data
2) Dimensionality: The GMTT algorithm extracts the data points. For u l leaf nodes in total, the time complexity
distribution structure by gradually creating necessary is O(u l U L2 ). Therefore, the time complexity for constructing
nodes. New nodes gradually split the data space to detect the MST (Algorithm 6, line 3) is O(u nl B 2 + u l U L2 ).
and represent the data distribution. Due to the curse of The overall time complexity of the proposed GMTT frame-
dimensionality, the distribution of data points will be work is O(sn B I + nU L + nu l + u l u nl + u l2 + u nl B 2 + u l U L2 ).
sparser for high-dimensional data. As a result, nodes Here, I is a very small constant ranging from 2 to 10 according
trained through GMTT will be less representative, and to the experiment. B is always set to a small positive
√ integer,
the structural distribution information offered by the e.g., 2–4 in the experiments. When U L is set at n, the overall
topology may have less contribution or even negative time complexity can be optimized to O(n 1.5 ).
contribution to improve the clustering quality. However, With the same parameter setting, the complexity of
the curse of dimensionality will also influence the other applying SL, AL, and CL to the GMTT framework is
hierarchical clustering approaches since Euclidean dis- also O(n 1.5 ).
tance is commonly utilized by the existing approaches. Theorem 2: The IGMTT framework has √time complexity
3) Coarse Topology: The IGMTT algorithm trains a coarse O(n 1.5 ) if the upper limitation U L is set at n.
topology using the former part of streaming data. Proof: According to the proof of Theorem 1, the time
Because it extracts the structural distribution of data and complexity of the coarse topology training (Algorithm 8,
allows fine training for the coarse topology according line 1) is O(r 1.5 ), where r is the size of training set
to the following inputs, the size of the former part and r n.
of streaming inputs for coarse topology training will For n inputs, the time complexity for searching the closest
not influence the clustering quality significantly if the leaf node (Algorithm 8, line 3) according to u nl nonleaf nodes
distribution of streaming data does not change with is O(Bu nl n).
time. The case in which the data distribution changes According to Definition 1, lines 6–10 of Algorithm 8 will
over time is another challenging problem for hierarchical be performed once for every U L (B − 1) new inputs. In other
clustering, which is not considered in this paper. words, they will be triggered n/U L (B − 1) times in total.
The above-mentioned discussion is further justified by the For each trigger, B new nodes should be trained by
experimental results in Section IV. U L (B −1) data points and the training will be repeated I times
We also prove that the time complexity of the GMTT and for convergence (Algorithm 8, line 6). Therefore, the time
IGMTT frameworks can be optimized to O(n 1.5 ), which is complexity for n/U L (B − 1) triggers is O(n I ).
lower than O(n 2 ) of traditional approaches. For each trigger, U L − 1 data points and u l − 1 leaf nodes
Theorem 1: The GMTT framework has √ time complexity should be considered to measure the density for each of
O(n 1.5 ) if the upper limitation U L is set at n. the U L (B − 1) data points; u l − 1 leaf nodes should be
Proof: When the topology T trained through GMTT is considered to measure the density for each of the B new
a total imbalanced tree, we will have the worst case time nodes (Algorithm 8, line 7). Therefore, the time complexity
complexity. In this case, the number of nonleaf nodes is for n/U L (B − 1) triggers is O((U L + u l )n + (u l n/U L )).
u nl = (n − U L /(B − 1)U L ). From the top to the bottom of T , For each trigger, a sub-MST for the corresponding U L
the numbers of data points for training the nonleaf nodes can data points of each of the new nodes should be formed.
be viewed as an arithmetic sequence {n, n − (B − 1)U L , n − Therefore, the time complexity for B new nodes should
2(B − 1)U L , . . . , n − (u nl − 1)(B − 1)U L }. Therefore, total be O(BU L2 ); A sub-MST should also be formed for the B new
number of data points for training all the nonleaf nodes is nodes, which has time complexity O(B 2 ). For n/U L (B − 1)
sn = nu nl − ((B − 1)U L (u 2nl + u nl )/2). For each of the data triggers, the time complexity for the MST construction part
points, B nodes should be considered to find the winner node (Algorithm 8, line 9) is O(U L n + (Bn/U L )).
using (4). For each nonleaf node, the training will be repeated The overall time complexity of the IGMTT framework is
I times for convergence. Therefore, the time complexity for O(r 1.5 + Bu nl n + n I + (U L + u l )n + (u l n/U L ) + U L n +
the topology training (Algorithm 6, line 1) is O(sn B I ). (Bn/U L )). Similar to the GMTT framework, the √ time com-
According to (6)–(8), U L − 1 data points and u l − 1 leaf plexity can be optimized to O(n 1.5 ) with U L = n.
nodes should be considered to measure the density for a The time complexity of the proposed MST-dendrogram
data point, where u l = (n/U L ) stands for the number of transformation algorithm is analyzed as follows. For a leaf
leaf nodes in T . For n data points, the time complexity is node, the time complexity for forming LMQ for the corre-
O(nU L + nu l ). According to (9), at most u l − 1 leaf nodes sponding U L data points is O(U L2 ). For u l leaf nodes, the time
should be considered to measure the density for a node. For complexity is O(U L2 u l ). In each merging step, the distance
u nl nonleaf nodes and u l leaf nodes, the time complexity is between the first pairs in u l LMQs should be compared to
O(u l (u nl +u l )). Therefore, the time complexity for measuring find the smallest one. For n − 1 merges, the time com-
the density for all the data points and nodes (Algorithm 6, plexity is O(u l n). Therefore, the overall time complexity of
line 2) is O(nU L + nu l + u l u nl + u l2 ). the transformation
√ algorithm is O(U L2 u l + u l n). When we
For each nonleaf node, a sub-MST should be constructed set U L at n, the time complexity can also be optimized
for its B child nodes. For u nl nonleaf nodes in total, the time to O(n 1.5 ).
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
CHEUNG AND ZHANG: FAST AND ACCURATE HIERARCHICAL CLUSTERING BASED ON GMTT 11
Fig. 10. Performance of GMTT-DL with different B-U L value combinations on four data sets.
Fig. 11. Performance of GMTT-DL with different B-η value combinations on four data sets.
iterations to converge and will become trapped in local when investigating the impact of the B–U L relationship.
optima. B was set at 4 when evaluating the U L –η relationship.√Accord-
To experimentally investigate the impact of any pair of ing to the analysis in Section III-D, U L was set at n when
the parameters, the proposed GMTT-DL has been performed studying the B–η relationship. For all of the experiments,
10 times for different value combinations of each pair of B = 1 and U L = 1 are not evaluated because they make the
the parameters on four typical data sets: Seed, which is a GMTT algorithm meaningless. Because the size of the Magic
real and small data set; Urban, which is a real and high- data set is large, the parameters U L and B are evaluated
dimensional data set; Magic, which is a real and large-size data with large and small spacing steps to better indicate the
set; and Syn B, which is a synthetic data set with overlapped relationships between parameters. The experimental results of
clusters. For each pair of parameters, the remaining one is B–U L , B–η, and U L –η are presented in Figs. 10–12, respec-
fixed. As a rule of thumb, the value of η was set at 0.1 tively. It can be observed that our discussion regarding the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 12. Performance of GMTT-DL with different U L -η value combinations on four data sets.
TABLE II
H-A CC OF 8 C OUNTERPARTS ON 10 D ATA S ETS
TABLE III
FM-I NDEX OF 8 C OUNTERPARTS ON 10 D ATA S ETS
three parameters is confirmed. Moreover, the clustering quality hierarchical clustering framework combined with traditional
in terms of the H-Acc and FM-index of the GMTT framework linkage strategies, i.e., SL, AL, and CL. For each data set,
is very robust to different parameter value combinations except the H-Acc and FM-index were calculated to measure the
for some extreme values, e.g., B = 2, U L = 2, η = 1, and performance of all the counterparts. Because there are ran-
η = 0.001. From the run time results, it can be observed domization procedures in the GMTT framework, we perform
√ is the lowest when the value of U L is
that the run time it 10 times and take the average performance as the final
approximately n, which also confirms the time complexity result. The experimental results are given in Tables II and III.
analysis in Section III-D. According to the experimental results For each data set, the best result is highlighted via boldface.
and the above-mentioned
√ discussion, we set B = 4, η = 0.1, “+” and “−” beside the GMTT frameworks stand for the
and U L = n for all of the data sets in the following Wilcoxon test results. It can be observed that the GMTT
experiments. framework obviously boosts the performance of SL, AL, and
CL on most of the 10 data sets. T-SL outperforms its GMTT
C. Performance Evaluation of the GMTT Framework version on the Syn C data set because the data distribution
To investigate the effectiveness of the GMTT framework, type of Syn C is chain shaped, which is preferred by SL.
we have compared its performance with that of the traditional The performance of T-AL is also obviously better than that
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
CHEUNG AND ZHANG: FAST AND ACCURATE HIERARCHICAL CLUSTERING BASED ON GMTT 13
TABLE IV
H-A CC P ERFORMANCE C OMPARED W ITH S TATE - OF - THE -A RT C OUNTERPARTS ON 10 D ATA S ETS
TABLE V
FM-I NDEX P ERFORMANCE C OMPARED W ITH S TATE - OF - THE -A RT C OUNTERPARTS ON 10 D ATA S ETS
of its GMTT version on the Syn B data set because the TABLE VI
data distribution type of Syn B is spherical shaped, which is W ILCOXON S IGNED -R ANK T EST B ETWEEN GMTT-BASED
A PPROACHES AND THE O THER T HREE C OUNTERPARTS
preferred by AL. We can also observe from the experimental
results that the performances of different linkages with the
GMTT framework are close to each other with competitive
performance on most of the data sets. This indicates that the
GMTT framework dominates the clustering performance and
that different linkage strategies will not obviously influence
the performance. In general, the GMTT framework is robust
to different linkage strategies and outperforms the traditional
TABLE VII
one in terms of hierarchy quality.
W ILCOXON S IGNED -R ANK T EST B ETWEEN GMTT-DL AND THE
To verify the effectiveness of the proposed GMTT-DL O THER GMTT-BASED A PPROACHES IN T ERMS OF H-A CC
approach, we have compared its performance with that of
all the other linkage strategies with the GMTT frame-
work, i.e., GMTT-SL, GMTT-AL, and GMTT-CL. Moreover,
the state-of-the-art hierarchical clustering approaches, i.e., the
TABLE VIII
potential-based framework with edge-weighted tree linkage
W ILCOXON S IGNED -R ANK T EST B ETWEEN GMTT-DL AND THE
(P-EL) [23] and the RP-based framework with SL (RP-SL) O THER GMTT-BASED A PPROACHES IN T ERMS OF FM-I NDEX
and AL (RP-AL), have also been compared. To make the
comparison fair, we use the autoparameter-selection version
of the RP-based approaches. The hierarchy quality of all
the counterparts in terms of the FM-index and H-Acc are
compared in Tables IV and V, respectively. Because there are data sets have overlapped clusters. This is because the topology
randomization procedures in the GMTT framework, the stan- trained through GMTT extracted the structural distribution
dard deviations of all the linkages with the GMTT framework information of data sets and the DL considers the global
are also presented. The best and the second best results are and local distribution information together. In addition,
highlighted by boldface and underlining, respectively. It can the standard deviations indicate that all of the GMTT-based
be observed from the experimental results that GMTT-DL approaches have stable performance on different data sets.
outperforms the other counterparts on most of the data sets. In Table VI, the experimental results of the GMTT-based
Although its performance is not always the best one for all of approaches and the other three state-of-the-art counterparts,
the data sets, its performance is still competitive. Moreover, i.e., P-EL, RP-SL, and RP-AL, given in Tables IV and V are
almost all of the winners on each data set are GMTT-based compared by the Wilcoxon test. From the test results, we can
approaches, which indicate the effectiveness of the GMTT see that GMTT-DL and GMTT-SL are significantly better than
framework. It can also be observed from the results that all of the other counterparts in terms of H-Acc and FM-index.
GMTT-DL can effectively cope with the overlapping problem We also test the significance between GMTT-DL and all of
since most of the real data sets and the Syn A and Syn B the other GMTT-based approaches in Tables VII and VIII.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE IX
H-A CC P ERFORMANCE OF IGMTT-DL AND IHC ON 10 D ATA S ETS
Fig. 13. Run time on the Magic, Occupy, and synthetic data sets. For the Fig. 14. Run time on the Magic, Occupy and synthetic data sets. For the
Magic and Occupy data sets, the bars from left to right stands for GMTT-DL, Magic and Occupy data sets, the left and right bars stand for IGMTT-DL and
P-EL, RP-SL, RP-AL, T-SL, T-AL, and T-CL, respectively. IHC, respectively.
The results indicate that GMTT-DL significantly outperforms IHC on most of the data sets. Because IHC is an approximation
GMTT-SL, GMTT-AL, and GMTT-CL. of T-SL, it has higher accuracy on the Syn C data set,
To verify the efficiency of GMTT-DL, the run times of all which is composed of chain-shaped clusters. As discussed in
of the counterparts on the two large-scale data sets, Magic and Section III-D, high-dimensional data will influence the perfor-
Occupy, are compared in Fig. 13. The run times on each data mance of the GMTT framework. Therefore, the performance
set are recorded and visualized by histograms for comparison. of IGMTT-DL is not better than that of IHC on the Protein
To better observe the changing orientation of the run time data set, which has 77 attributes. According to the standard
for all of the approaches, we also run all of the counterparts deviation recorded in Table IX, the performance of IGMTT-DL
on a synthetic data set with its size increased from 1000 to is obviously more stable than that of IHC on all of the data sets
200 000 by a step size of 20 000. From Fig. 13, we can observe since the clustering procedure of IGMTT-DL is supervised by
from the run time of the Magic and Occupy data sets that the the topology, which reasonably represents the structural distri-
proposed approach takes much less time in comparison with bution of data sets. For IGMTT-DL and IHC, the significance
all of the other counterparts. According to the run time of level of the difference between their performances in terms of
the synthetic data set with changing size, we can find that H-Acc is also tested through the Wilcoxon signed-rank test.
the run times of T-SL, T-AL, and T-CL increase dramatically “+” besides IGMTT-DL indicates that the H-Acc performance
with the size of the data set. Compared with them, the run of IGMTT-DL is significantly better than that of IHC.
times of the four fast hierarchical clustering approaches, To verify the efficiency of IGMTT-DL, its run time is
i.e., GMTT-DL, P-EL, RP-SL, and RP-AL, increase obvi- also compared with that of IHC. The experimental settings
ously slower. Although the run time of GMTT-DL remains are the same as for the efficiency verification experiment of
the smallest on the synthetic data set with sizes from GMTT-DL in Section IV-C. From Fig. 14, we can see that both
1000 to 200 000, RP-SL and RP-AL have lower growth rates the run time and growth rate of IGMTT-DL are remarkably
than GMTT because their time complexity is lower. If the lower than those of IHC.
size of the synthetic data set continues increasing, the run In general, IGMTT-DL can incorporate new streaming
times of the RP-based approaches will be smaller than that inputs effectively and efficiently in hierarchical clustering
of the proposed GMTT-DL. However, the hierarchy quality tasks.
of RP-SL and RP-AL is limited by T-SL and T-AL as V. C ONCLUSION
discussed in Section I. Therefore, the performance of
This paper has presented a topology training algorithm,
GMTT-DL is still competitive. Generally speaking, GMTT-DL
GMTT, which can train a multilayer topological structure
is very competitive compared to the state-of-the-art counter-
for a data set to fit its density distribution. Based on the
parts when both the hierarchy quality and processing speed
GMTT algorithm, a hierarchical clustering framework has
are considered in practical applications.
been designed, featuring lower time complexity and higher
clustering quality compared to the existing approaches.
D. Performance Evaluation of IGMTT Framework The proposed framework can remarkably boost the perfor-
Furthermore, to verify the effectiveness and efficiency of mance of the existing traditional linkage strategies and has
the IGMTT framework, we compared it with another popular competitive performance when combined with the proposed
IHC method. The two online approaches were also performed DL linkage. We have analyzed that the GMTT framework
on all 10 data sets. Because IHC does not form a binary hier- improves the time complexity of hierarchical clustering
archy, its FM-index performance cannot be measured. There- to O(n 1.5 ) without sacrificing the hierarchy quality. Although
fore, the two online approaches are only compared in terms three parameters should be set, its performance is robust to the
of H-Acc. It can be observed from the experimental results parameter settings, which makes it easily utilized in different
shown in Table IX that IGMTT-DL evidently outperforms application domains. Furthermore, its incremental version,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
CHEUNG AND ZHANG: FAST AND ACCURATE HIERARCHICAL CLUSTERING BASED ON GMTT 15
IGMTT, has also been proposed to expand its application [23] Y. Lu and Y. Wan, “PHA: A fast potential-based hierarchical agglomera-
domain. The IGMTT-based framework has the same time com- tive clustering method,” Pattern Recognit., vol. 46, no. 5, pp. 1227–1239,
2013.
plexity as the GMTT framework but can dynamically update [24] F. Murtagh, “A survey of recent advances in hierarchical clustering
the topology and successively incorporate new inputs to update algorithms,” Comput. J., vol. 26, no. 4, pp. 354–359, 1983.
the corresponding hierarchy structure. Experiments have [25] S. Nassar, J. Sander, and C. Cheng, “Incremental and effective data
summarization for dynamic hierarchical clustering,” in Proc. ACM
shown the promising results of the GMTT-DL and IGMTT-DL SIGMOD Int. Conf. Manage. Data, 2004, pp. 467–478.
approaches in comparison with the existing counterparts. [26] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis
and an algorithm,” in Proc. Conf. Adv. Neural Inf. Process. Syst.,
R EFERENCES Dec. 2001, pp. 849–856.
[27] M. G. Omran, A. P. Engelbrecht, and A. Salman, “An overview of clus-
[1] H. F. Bassani and A. F. Araujo, “Dimension selective self-organizing tering methods,” Intell. Data Anal., vol. 11, no. 6, pp. 583–605, 2007.
maps with time-varying structure for subspace and projected clustering,” [28] S. S. Ray, A. Ganivada, and S. K. Pal, “A granular self-organizing
IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 3, pp. 458–471, map for clustering and gene selection in microarray data,” IEEE Trans.
Mar. 2015. Neural Netw. Learn. Syst., vol. 27, no. 9, pp. 1890–1906, Sep. 2015.
[2] M. M. Breunig, H.-P. Kriegel, P. Kröger, and J. Sander, “Data bubbles: [29] J. Sander, X. Qin, Z. Lu, N. Niu, and A. Kovarsky, “Automatic extraction
Quality preserving performance boosting for hierarchical clustering,” in of clusters from hierarchical clustering representations,” in Proc.
Proc. ACM SIGMOD Conf., 2001, pp. 79–90. Pacific–Asia Conf. Knowl. Discovery Data Mining, 2003, pp. 75–87.
[3] M. M. Breunig, H.-P. Kriegel, and J. Sander, “Fast hierarchical clustering [30] J. Schneider and M. Vlachos, “On randomly projected hierarchical
based on compressed data and optics,” in Proc. 4th Eur. Conf. Princ. clustering with guarantees,” in Proc. SIAM Int. Conf. Data Mining,
Data Mining Knowl. Discovery, 2000, pp. 232–242. 2014, pp. 407–415.
[4] I. Cattinelli, G. Valentini, E. Paulesu, and N. A. Borghese, “A novel [31] H. K. Seifoddini, “Single linkage versus average linkage clustering
approach to the problem of non-uniqueness of the solution in hierarchical in machine cells formation applications,” Comput. Ind. Eng., vol. 16,
clustering,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 7, no. 3, pp. 419–426, 1989.
pp. 1166–1173, Jul. 2013. [32] F. Shen and O. Hasegawa, “A fast nearest neighbor classifier based
[5] Y.-M. Cheung, “k*-means: A new generalized k-means clustering algo- on self-organizing incremental neural network,” Neural Netw., vol. 21,
rithm,” Pattern Recognit. Lett., vol. 24, no. 15, pp. 2883–2893, 2003. no. 10, pp. 1537–1547, Dec. 2008.
[6] Y.-M. Cheung, “A competitive and cooperative learning approach to [33] S. Shuming, Y. Guangwen, W. Dingxing, and Z. Weimin, “Potential-
robust data clustering,” in Proc. IASTED Int. Conf. Neural Netw. based hierarchical clustering,” in Proc. 16th Int. Conf. Pattern Recognit.,
Comput. Intell., 2004, pp. 131–136. 2002, pp. 272–275.
[7] Y.-M. Cheung, “A rival penalized em algorithm towards maximizing [34] D. H. Widyantoro, T. R. Ioerger, and J. Yen, “An incremental approach
weighted likelihood for density mixture clustering with automatic model to building a cluster hierarchy,” in Proc. IEEE Int. Conf. Data Mining,
selection,” in Proc. 17th Int. Conf. Pattern Recognit., vol. 4, 2004, 2002, pp. 705–708.
pp. 633–636. [35] F. Wilcoxon, “Individual comparisons by ranking methods,” Biometrics
[8] Y.-M. Cheung, “Maximum weighted likelihood via rival penalized EM Bull., vol. 1, no. 6, pp. 80–83, 1945.
for density mixture clustering with automatic model selection,” IEEE [36] L. Xu, T. W. S. Chow, and E. W. M. Ma, “Topology-based clustering
Trans. Knowl. Data Eng., vol. 17, no. 6, pp. 750–761, Jun. 2005. using polar self-organizing map,” IEEE Trans. Neural Netw. Learn.
[9] Y.-M. Cheung, “On rival penalization controlled competitive learning for Syst., vol. 26, no. 4, pp. 798–808, Apr. 2015.
clustering with automatic cluster number selection,” IEEE Trans. Knowl. [37] R. Xu and D. Wunsch, II, “Survey of clustering algorithms,” IEEE
Data Eng., vol. 17, no. 11, pp. 1583–1588, Nov. 2005. Trans. Neural Netw., vol. 16, no. 3, pp. 645–678, May 2005.
[10] F. Corpet, “Multiple sequence alignment with hierarchical clustering,” [38] H. Zhang, X. Xiao, and O. Hasegawa, “A load-balancing self-organizing
Nucl. Acids Res., vol. 16, no. 22, pp. 10881–10890, 1988. incremental neural network,” IEEE Trans. Neural Netw. Learn. Syst.,
[11] F. Ferstl, M. Kanzler, M. Rautenhaus, and R. Westermann, “Time- vol. 25, no. 6, pp. 1096–1105, Jun. 2014.
hierarchical clustering and visualization of weather forecast ensembles,” [39] Y. Zhang, Y.-M. Cheung, and Y. Liu, “Quality preserved data
IEEE Trans. Vis. Comput. Graphics, vol. 23, no. 1, pp. 831–840, summarization for fast hierarchical clustering,” in Proc. Int. Joint
Jan. 2017. Conf. Neural Netw., 2016, pp. 4139–4146.
[12] E. B. Fowlkes and C. L. Mallows, “A method for comparing two [40] Z. Zhang and Y.-M. Cheung, “On weight design of maximum weighted
hierarchical clusterings,” J. Amer. Statist. Assoc., vol. 78, no. 383, likelihood and an extended EM algorithm,” IEEE Trans. Knowl. Data
pp. 553–569, 1983. Eng., vol. 18, no. 10, pp. 1429–1434, Oct. 2006.
[13] A. Frank and A. Asuncion, “UCI machine learning repository,” School [41] J. Zhou and J. Sander, “Data bubbles for non-vector data: Speeding-up
Inform. Comput. Sci., Univ. California, Irvine, CA, USA, Tech. Rep., hierarchical clustering in arbitrary metric spaces,” in Proc. 29th Int.
2010. [Online]. Available: http://archive.ics.uci.edu/ml Conf. Very Large Data Bases, 2003, pp. 452–463.
[14] S. Furao, T. Ogura, and O. Hasegawa, “An enhanced self-organizing
incremental neural network for online unsupervised learning,” Neural Yiu-ming Cheung (F’18) received the Ph.D. degree
Netw., vol. 20, no. 8, pp. 893–903, Oct. 2007. from the Department of Computer Science and
[15] S. Furao, A. Sudo, and O. Hasegawa, “An online incremental Engineering, The Chinese University of Hong Kong,
learning pattern-based reasoning system,” Neural Netw., vol. 23, no. 1, Hong Kong.
pp. 135–143, Jan. 2010. He is currently a Full Professor with the Depart-
[16] J. C. Gower and G. J. S. Ross, “Minimum spanning trees and single ment of Computer Science, Hong Kong Baptist Uni-
linkage cluster analysis,” Appl. Statist., vol. 18, no. 1, pp. 54–64, 1969. versity, Hong Kong. His current research interests
[17] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,” include machine learning, pattern recognition, visual
ACM Comput. Surv., vol. 31, no. 3, pp. 264–323, Sep. 1999. computing, and optimization.
[18] S. C. Johnson, “Hierarchical clustering schemes,” Psychometrika, He is a Fellow of IET, BCS, and RSA, and
vol. 32, no. 3, pp. 241–254, 1967. a Distinguished Fellow of IETI. He serves as an
[19] G. Karypis, E.-H. Han, and V. Kumar, “Chameleon: Hierarchical clus- Associate Editor for the IEEE T RANSACTIONS ON N EURAL N ETWORKS AND
tering using dynamic modeling,” Computer, vol. 32, no. 8, pp. 68–75, L EARNING S YSTEMS , Pattern Recognition, and so on.
Aug. 1999. Yiqun Zhang received the B.Eng. degree from
[20] H. Koga, T. Ishibashi, and T. Watanabe, “Fast agglomerative hierarchical the School of Biology and Biological Engineering,
clustering algorithm using locality-sensitive hashing,” Knowl. Inf. Syst., South China University of Technology, Guangzhou,
vol. 12, no. 1, pp. 25–53, 2007. China, in 2013, and the M.Sc. degree from the
[21] A.-A. Liu, Y.-T. Su, W.-Z. Nie, and M. Kankanhalli, “Hierarchical Department of Computer Science, Hong Kong
clustering multi-task learning for joint human action grouping and Baptist University, Hong Kong, in 2014, where he
recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 1, is currently pursuing the Ph.D. degree with the
pp. 102–114, Jan. 2017. Department of Computer Science.
[22] Y. Lu, X. Hou, and X. Chen, “A novel travel-time based similarity His current research interests include machine
measure for hierarchical clustering,” Neurocomputing, vol. 173, pp. 3–8, learning, data mining, and pattern recognition.
Jan. 2016.