10 1109@TNNLS 2018 2853407

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

Fast and Accurate Hierarchical Clustering Based


on Growing Multilayer Topology Training
Yiu-ming Cheung , Fellow, IEEE, and Yiqun Zhang

Abstract— Hierarchical clustering has been extensively applied In general, a traditional hierarchical clustering framework
for data analysis and knowledge discovery. However, the scala- can be summarized as follows.
bility of hierarchical clustering methods is generally limited due Step 1: Each single data point is assigned to an individual
to their time complexity of O(n2 ), where n is the size of the
input data. To address this issue, we present a fast and accurate cluster.
hierarchical clustering algorithm based on topology training. Step 2: The most similar pair of clusters is found according
Specifically, a trained multilayer topological structure that fits to a certain linkage strategy.
the spatial distribution of the data is utilized to accelerate the Step 3: The most similar pair of clusters is merged to form
similarity measurement, which dominates the computational cost a new cluster.
in hierarchical clustering. Moreover, the topological structure
also guides the merging steps in hierarchical clustering to Step 4: Steps 2 and 3 are repeated until only one cluster
form a meaningful and accurate clustering result. In addition, exists or a particular stop condition is satisfied.
an incremental version of the proposed algorithm is further In the above-mentioned steps, the commonly used linkage
designed so that the proposed approach is applicable to the strategies are single linkage (SL), average linkage (AL), and
streaming data as well. Promising experimental results on various complete linkage (CL), which compute the maximum, average,
data sets demonstrate the efficiency and effectiveness of the
proposed algorithms. and minimum similarity between the data points of two clus-
ters, respectively [27]. The traditional hierarchical clustering
Index Terms— Data analysis, hierarchical clustering, incremen- frameworks with SL, AL, and CL linkages are abbreviated
tal algorithm, time complexity, topology.
as T-SL, T-AL, and T-CL hereinafter. Although these three
I. I NTRODUCTION traditional approaches are parameterless and simple to use,
they have three major problems.
C LUSTERING methods can be classified into two types:
partitional clustering [5]–[9], [26], [40] and hierarchical
clustering [10], [18], [19], [31]. Partitional clustering separates
1) Their performance is sensitive to different data distrib-
ution types. T-SL “has a tendency to produce clusters
a set of data points into a certain number of clusters to min- that are straggly or elongated” [17]; T-CL and T-AL
imize the intracluster distance and maximize the intercluster tend to produce compact and spherical-shaped clusters,
distance, while hierarchical clustering views each data point as respectively.
an individual cluster and builds a nested hierarchy by gradually 2) All three only consider the local distance between the
merging the current most similar pair of them. Compared pairs of data points during clustering. When overlapped
with partitional clustering, hierarchical clustering offers more clusters exist, their performances will be influenced [37].
information regarding the distribution of the data set. Often, 3) Their time complexity is O(n 2 ), which limits their appli-
the hierarchy is visualized using dendrograms, which can be cations, particularly for large-scale data and streaming
“cut” at any level to produce the desired number of clusters. data.
Due to the rich information it offers, hierarchical clustering has To tackle the above-mentioned three problems, various types
been extensively applied to different fields, e.g., data analysis, of hierarchical clustering approaches have been proposed in
knowledge discovery, pattern recognition, image processing, the literature. To solve the first two problems, potential-
bioinformatics, and so on [4], [11], [21]. based hierarchical clustering approaches based on potential
theory [33] have been proposed (see [22] and [23]) where the
Manuscript received April 18, 2017; revised November 2, 2017 and potential field is utilized to measure the similarity between data
April 15, 2018; accepted June 27, 2018. This work was supported in part by
the National Natural Science Foundation of China under Grant 61672444 and points. Because this type of approach merges the data points
Grant 61272366, in part by the SZSTI under Grant JCYJ20160531194006833, by considering both the global distribution, i.e., potential fields
and in part by the Faculty Research Grant of Hong Kong Baptist University of data points and local relationship, i.e., the exact distance
under Project FRG2/16-17/051 and FRG2/17-18/082. (Corresponding author:
Yiu-ming Cheung.) between neighbors, they show robustness when processing
Y.-m. Cheung is with the Department of Computer Science, Hong Kong data sets with different data distribution types and overlapped
Baptist University (HKBU), Hong Kong, and also with the HKBU Institute clusters. Nevertheless, their time complexity is still O(n 2 ).
of Research and Continuing Education, Shenzhen 518057, China (e-mail:
[email protected]). To cope with the third problem, locality-sensitive hashing-
Y. Zhang is with the Department of Computer Science, Hong Kong Baptist based hierarchical clustering [20] has been proposed with a
University, Hong Kong (e-mail: [email protected]). time complexity of O(nm) to speed up the closest pair search
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. procedure of T-SL, where m is the bucket size. However,
Digital Object Identifier 10.1109/TNNLS.2018.2853407 the setting of parameters for this approach is nontrivial, and
2162-237X © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

its clustering accuracy is generally lower than that of T-SL. region of data sets. Accordingly, a hierarchical clustering
Furthermore, hierarchical clustering based on random projec- framework based on GMTT is formed. Differing from our
tion (RP) [30] with time complexity of O(n(log n)2 ) has also preliminary work in [39], this framework can dynamically
been proposed. It accelerates T-SL and T-AL by iteratively create and train seeds to form a multilayer topology. With the
projecting data points into different lines for splitting. In this topology, the merging steps of hierarchical clustering are per-
manner, the data set is partitioned into small subsets, and formed under its guidance. Moreover, the similarity between
the similarity can be measured locally to reduce computa- data points is only measured within each seed’s corresponding
tion cost. However, RP-based approaches inherit the draw- subset, which can significantly reduce the computational cost.
backs of T-SL and T-AL, i.e., they have a bias for certain In general, most of the traditional linkage strategies, i.e., SL,
data distribution types, and they cannot distinguish over- AL, and CL, can be applied to the GMTT-based framework.
lapped clusters well, due to approximation. To simultaneously To achieve better clustering performance, a new density-based
tackle the three problems, summarization-based hierarchical linkage strategy is also presented. Because it simultaneously
clustering frameworks have been proposed in the literature. considers the global and local data distribution information,
Specifically, data bubble-based hierarchical clustering and its its clustering performance is promising. In addition,
variants [2], [3], [25], [29], [41] have been proposed to summa- an incremental version of the GMTT framework, denoted
rize the data points by randomly initializing a set of seed points as the IGMTT framework, is also presented to cope with
to incorporate nearby data points into groups (data bubbles). streaming data. In the IGMTT framework, each new input
Subsequently, the hierarchical clustering is performed on the can easily find its nearest neighbor by searching the topology
bubbles only to avoid the similarity measurement for a large from top to bottom. Then, both the existing topology and
number of original data points. In general, the performance of hierarchy are locally updated to recover the influence caused
the data bubble and its variants is sensitive to the compression by the input. Both the GMTT and the IGMTT frameworks
rate and the initialization of seed points. Our preliminary have competent performance in terms of clustering quality
work in [39] has addressed the sensitivity problem by training and time complexity, i.e., O(n 1.5 ). Their effectiveness and
the seed points to better summarize the data points. Never- efficiency have been empirically investigated. The main
theless, a common shortcoming of the summarization-based contributions of our work are summarized as follows.
approaches is that the hierarchical relationship between data 1) The GMTT algorithm is proposed for seed point train-
points is lost due to summarization. In addition, none of ing. The topology of the seed points can appropriately
the above-mentioned approaches are fundamentally designed represent the structural data distribution. The training is
for streaming data. Specifically, the entire clustering process automatic without prior knowledge of the data set, e.g.,
should be executed to update the hierarchy structure for each number of clusters, proper number of seeds, and so on.
new input, which may sharply increase the computational cost. 2) A fast hierarchical clustering framework has been pro-
To solve this problem, the incremental hierarchical clustering posed based on GMTT. According to the topology
(IHC) approach [34] has been proposed. It saves a large trained through GMTT, distance measurement is locally
amount of computational cost by dynamically and locally performed to reduce computational cost. Merging is
restructuring the inhomogeneous regions of the present hierar- also guided by the topology to make the constructed
chy structure. Therefore, this approach performs hierarchical hierarchy able to distinguish the borders of real clusters.
clustering with a time complexity as low as O(n log n) when 3) A new linkage strategy called density linkage (DL)
the hierarchy structure is completely balanced. However, the is presented, which simultaneously considers the local
balance of the constructed hierarchy is not guaranteed, which and global data distribution information to make the
makes its worst-case time complexity still O(n 2 ). Further- clustering results robust to different data distribution
more, because IHC is an approximation of T-SL, it will also types and overlapping phenomena.
have a bias for certain data distribution types. 4) An incremental version of the GMTT framework,
In this paper, we concentrate on: 1) addressing with the i.e., the IGTMM framework, is provided for streaming
three above-mentioned problems of traditional hierarchical data hierarchical clustering. Similar to the GMTT frame-
clustering frameworks and linkage strategies, and 2) proposing work, it is also fast and accurate.
a new hierarchical clustering framework for streaming data. The rest of this paper is organized as follows. Section II
We first propose a growing multilayer topology training gives an overview of the existing relevant hierarchical clus-
(GMTT) algorithm to dynamically learn the spatial distribution tering approaches. In Section III, the details of the pro-
of data and construct the corresponding topological structure. posed GMTT framework, IGMTT framework, and DL linkage
In the literature, topology training has been widely utilized are described. Then, Section IV presents the experimental
for partitional clustering [1], [14], [15], [28], [32], [36], [38]. results for various benchmark and synthetic data sets. Finally,
However, to the best of our knowledge, it has yet to be utilized we draw a conclusion in Section V.
for hierarchical clustering. We make the topology grow by cre-
II. OVERVIEW OF E XISTING R ELEVANT H IERARCHICAL
ating new layers with new seeds based on existing seeds if the
C LUSTERING M ETHODS
existing seeds cannot represent the data set well. The growth is
continued until each node can appropriately represent the local A. Potential-Based Hierarchical Clustering
data distribution. As a result, the GMTT algorithm assigns The approach proposed in [23] converts the distance
more layers and seeds to finely describe the high-density between data points into potential values to measure the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHEUNG AND ZHANG: FAST AND ACCURATE HIERARCHICAL CLUSTERING BASED ON GMTT 3

Algorithm 1 Potential-Based Hierarchical Clustering Algorithm 2 RP-Based Hierarchical Clustering


Input: Data set X Input: Data set X
Output: Dendrogram D Output: Dendrogram D
1: for i = 1 to n do 1: perturb the data points;
2: compute the potential ϕ xi by Eq.(2); 2: while subsets with size larger than mi n Pts exist do
3: end for 3: partition these subsets using random projection;
4: for i = 1 to n do 4: end while
5: find x i ’s parent x p ; 5: compute distance between data points within each subset;
6: link the pair as x p − → x i to form a part of E W T ; 6: sort all the computed distances together;
7: end for 7: for i = 1 to n − 1 do
8: for i = 1 to n − 1 do 8: merge the closest pair;
9: find and merge the pair with the shortest edge in E W T ; 9: end for
10: eliminate the edge between the merged pair;
11: end for
the ranking. Because the procedures of the RP-based
framework with SL linkage and RP-based framework with
density levels of data points. Having the potential value of AL linkage are similar, both of them are summarized
each data point, an Edge Weighted Tree (EWT) is constructed, in Algorithm 2.
and the hierarchy can easily be read off from it. Suppose In the algorithm, improper selection of the parame-
that we have a data set with n data points, denoted as X = ter mi n Pts may lead to the failure of building a den-
{x 1 , x 2 , . . . , x n }. The distance between two data points x a and drogram. Therefore, parameter-free versions of RP-based
x b is denoted as dist(x a , x b ). The potential value of point x a approaches have also been proposed in [30], which solves
received from point x b is calculated by the parameter selection problem by repeatedly performing the
⎧ RP-based approaches with the different values of mi n Pts until
⎪ 1
⎨− if dist(x a , x b ) ≥ λ the dendrogram can be correctly constructed.
xa ,xb = dist(x a , xb ) (1)

⎩− 1
if dist(x a , x b ) < λ
λ C. Incremental Hierarchical Clustering
where the parameter λ is used to avoid the singularity problem
The IHC approach was proposed for streaming data, and
when dist(x a , x b ) is too small. The total potential value of a
it processes each input and maintains the hierarchy in three
data point x a is defined as the sum of the potential values it
steps: 1) search the existing data points to find the one with
has received from all of the other data points
the shortest distance to the new input; 2) detect the hierarchy

n
in a bottom-up manner and insert a new input under a proper
ϕ xa = xa ,xi . (2) node; and 3) detect and restructure the hierarchy in a top-
i=1,i =a
down manner. In the IHC approach, we can judge if a node
According to the potential values and the distances between is homogeneous or not according to its upper and lower
data points, an EWT is constructed by linking data points to limitation. For a new data point x a , its nearest neighbor x b is
its closest point with a higher potential value than it. The first located over the leaf nodes. Then, the upward detection is
hierarchy of the data set can be read off from the EWT by performed to x b ’s parent node v p . If the distance dist(x a , x b )
sequentially merging the linked pair with the closest distance. between x a and x b is smaller than the upper limitation and
The algorithm of the potential-based approach is summarized larger than the lower limitation of v p , v p is judged to be
in Algorithm 1. homogeneous after accepting x a as its child. In this case,
x a and x b are said to form a normal density region under v p ,
B. RP-Based Hierarchical Clustering and x a is simply inserted into the hierarchy as v p ’s children.
RP-based hierarchical clustering approaches aim to partition Similarly, if dist(x a , x b ) is smaller than the lower limitation,
the entire data set into small enough subsets in which the x a and x b will form a higher density region under v p .
data points are very close to each other. In this manner, Therefore, the hierarchy should be restructured by inserting
the similarity can be measured within each subset to reduce a new node with the child nodes x a and x b and parent
computational cost. In this approach, data points are randomly node v p to maintain the homogeneity of the hierarchy.
projected onto different lines for splitting. After each projec- If x a and x b form a lower density region under v p , detection
tion, the original subset is split into two smaller subsets. After should be performed upward to v p ’s parent node, grandparent
a certain amount of splitting, each subset will contain a small node, and so on until x a is properly inserted into the hierarchy.
number of data points that are highly likely to be very close Due to the incorporation of x a , the homogeneity of the nodes
to each other, and each pair of the closest data points will stay in layers lower than x a may also be influenced. Therefore,
in at least one of the subsets. The splitting is stopped when a downward detection and recovery are also necessary to detect
the size of each subset is smaller than a parameter mi n Pts. and recover the inhomogeneous regions of the hierarchy until
Finally, all the similarity values of each subset are ranked no inhomogeneous region is detected. The IHC approach is
together, and pairs of data points are merged according to summarized in Algorithm 3.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Algorithm 3 IHC high-density region via the GMTT algorithm. Consequently,


Input: streaming data set X the structure of the trained topology is similar to the desired
Output: Dendrogram D hierarchy and can offer guidance to accelerate the hierarchical
1: initialize the hierarchy; clustering procedures.
2: for each new input x a do Specifically, given a data set X = {x 1 , x 2 , . . . , x n } with
3: find x a ’s parent v p ; n data points, the topology T is trained by randomly inputting
4: if x a cause a higher-density region then data points from X to adjust the nodes. Each node in T is
5: create new node as the parent node of x a and x b ; expressed in the form of vl, p,h , where l indicates the layer of
6: end if the node in T , p is the sequence number of its parent node,
7: if x a cause a normal-density region then and h is its own sequential number. For simplicity, vl, p,h can
8: insert x a as v p ’s child node; be denoted as v h if the information of its layer and parent node
9: end if is not considered in some cases. The corresponding subset of a
10: end for node vl, p,h is expressed as X h , which contains sh data points
11: detect and recover inhomogeneous regions; belonging to X. During the training, we need to decide if
the topology should grow or not. In other words, we should
decide if a node vl, p,h can represent its corresponding
III. T HE P ROPOSED FAST H IERARCHICAL subset X h well and when to make the topology grow by
C LUSTERING A PPROACH creating B child nodes for vl, p,h in layer l + 1. Here, B is
a constant referred to as the branching factor, and it controls
This section will propose a topology training algorithm
the number of child nodes created for each node. The nodes
that can gradually and automatically make a topology grow
that cannot represent their subset well are defined as coarse
to better represent the distribution of data. Subsequently,
nodes. The definitions of a full coarse node and semicoarse
a framework based on it is presented to achieve fast and
node are as follows.
accurate hierarchical clustering. Furthermore, an incremental
Definition 1: Let vl, p,h be a leaf node with a sh -point
version of the framework will also be presented for streaming
corresponding subset. Given the branching factor B and the
data hierarchical clustering.
upper limitation U L , the node vl, p,h is a full coarse node if
and only if sh > U L · (B − 1).
A. Growing Multilayer Topology Training Definition 2: Let vl, p,h be a leaf node with a sh -point
The GMTT algorithm is presented, which trains a set of corresponding subset. Given the branching factor B and the
seed points to represent the data distribution. In the beginning, upper limitation U L , the node vl, p,h is a semicoarse node if
only one seed point is initialized and trained to be the physical and only if U L < sh  U L · (B − 1).
center of the entire data set. Obviously, one seed point alone In the above-mentioned two definitions, U L controls the upper
cannot represent the spatial distribution of the entire data set bound of the size sh of vl, p,h ’s subset. For a full coarse node,
well, especially for complex real-world data sets. To better B new child nodes should be trained by randomly selecting
represent the data distribution, a number of new seed points data points from X h . For a semicoarse node, Bs new child
are initialized and trained to be the child seed points of the nodes should be created and trained in the same manner, where
original one. The new seed points are the centers of their Bs is the branching factor of a semicoarse node. During the
corresponding subsets, which are produced by splitting the training, the value of Bs will dynamically change according
entire data set according to them. All of the newly created to the size of the semicoarse node’s corresponding subset
seed points are linked to their parent with edges, which  
sh
indicate their affiliation. Because the seed points and their Bs = . (3)
UL
nested affiliation structure are very similar to the neuron nodes
and the topology of multilayer neural networks, respectively, Supposing that vl, p,h is a full coarse node, B child nodes
we utilize the words “nodes” and “topology” to indicate the {vl+1,h,t +1 , vl+1,h,t +2 , . . . , vl+1,h,t +B } should be initialized
seed points and their affiliation structure hereinafter. For each from vl, p,h ’s subset X h = {x h,1 , x h,2 , . . . , x h,sh }, where t is
new node, growing training should be performed repeatedly the total number of nodes before the initialization of B new
until all of the existing seed points represent their subsets child nodes. After the initialization, the value of t is updated
well. It is expected that more nodes should be assigned by t (new) = t (old) + B. For an input x h,i , the winner child node
to the regions that are hard to represent well in the data vl+1,h,w is determined among B child nodes by
set. There are many criteria for defining a region that is w= argmin γ j x h,i − vl+1,h, j 2 (4)
hard to represent well, e.g., inhomogeneous data distribution, t −B+1 j t
high-density data distribution, border region of clusters, and where γ j is the winning frequency of node vl+1,h, j among
overlapped region of several clusters. From the perspective B new child nodes. After the winner child node vl+1,h,w is
of hierarchical clustering, the merging of data points happens selected out, it is adjusted with a small step toward x h,i by
in high-density regions at the beginning and gradually moves (new) (old) (old)
vl+1,h,w = vl+1,h,w + η · x h,i − vl+1,h,w (5)
to low-density regions. Moreover, the merging of the high-
density region data points dominates the processing time. where η is the learning rate. The child nodes are iteratively
Based on this scenario, we choose to better represent the trained through (4) and (5) until convergence. The training
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHEUNG AND ZHANG: FAST AND ACCURATE HIERARCHICAL CLUSTERING BASED ON GMTT 5

Algorithm 4 Nodes Training


Input: Data set X h , learning rate η, upper limitation U L and
branching factor B (Bs )
Output: B (Bs ) new child nodes
1: initialize B (Bs ) new nodes from subset X h ;
2: while Convergence = f alse do
3: randomly select a data point x h,i from X h ;
4: find the winner node according to Eq.(4);
5: adjust the winner node according to Eq.(5);
Fig. 1. Topology trained for a 20-points data set.
6: end while

Algorithm 5 GMTT Algorithm


Input: Data set X, learning rate η, upper limitation U L and
branching factor B
Output: Topology T
1: initialize a node v 1,0,1 from X to be the top node of T ;
2: while existing full-coarse or semi-coarse node do
3: find a coarse node vl, p,h in T ;
4: generate B (Bs ) new nodes through Algorithm 4; Fig. 2. Results of (a) GMTT and (b) its one-layer version.
5: end while

The results of GMTT and its one-layer version are


compared in Fig. 2. It can be observed that the nodes
procedure of child nodes can be summarized in Algorithm 4.
trained through GMTT fit the density distribution better.
After B new child nodes are created and trained for vl, p,h ,
2) The structure of the topology trained through GMTT is
X h is split into B subsets. As each new child node only
consistent with the expected hierarchy whose nodes in
represents a part of X h , the current representation becomes
deeper layers indicate a high-density distribution of data
more precise. Here, we also define the concept of a fine node
and vice versa. Moreover, links in the topology indicate
to judge when to stop the growth of the topology.
the affiliation between subsets of nodes, which are
Definition 3: Let vl, p,h be a child node with a sh -point
similar to the links in the dendrogram. These properties
corresponding subset. Given the upper limitation U L , the
make the topology suitable for hierarchical clustering.
node vl, p,h is a fine node if and only if sh  U L .
However, if all the seed points are trained in one layer,
When all the leaf nodes in the topology are judged as
they cannot offer the desired information for hierarchical
fine nodes, the growth is stopped. The entire GMTT algo-
clustering.
rithm is summarized in Algorithm 5. An example of the
Although the trained topology is similar to the desired den-
GMTT algorithm is illustrated in Fig. 1, where a three-layer
drogram, they still have significant differences. First, each leaf
topology is trained for a 20-point data set with B = 3 and
node in the topology is the physical center of its subset but not
U L = 4. In the topology shown in Fig. 1, layer 1 contains
an exact data point. Second, a link in the topology connecting
only one top node v 1,0,1 with the corresponding subset X 1 ,
two nodes only indicates their affiliation during the growth
which is also the entire data set X. Because s1 > U L ,
of the topology but not their detailed hierarchy relationship.
B child nodes are initialized and trained in the next layer
Therefore, how to efficiently and effectively obtain the desired
using data points from X 1 . A branch stops its growth with fine
dendrogram through further processing of the topology will be
node v 2,1,3 in layer 2 because s3  U L . Finally, the topology
discussed in Section III-B.
stops its growth in layer 3 because all of the leaf nodes are fine
nodes, which means that the entire data set can be represented
well. It can be seen from the figure that the union of all the B. Fast Hierarchical Clustering Based on GMTT
leaf nodes’ subsets X 3 , X 5 , X 6 , X 7 , X 8 , and X 9 is the entire From the perspective of hierarchical clustering, the con-
data set X. structed hierarchy should satisfy two properties: homogeneity
Here, we also discuss why we design the GMTT algorithm and monotonicity [24]. Suppose we cut a dendrogram hori-
but do not directly train sufficient seed points in one layer. zontally to produce a certain number of clusters; homogeneity
1) In GMTT, the number of corresponding data points of is the property that the similarity between intracluster points is
each leaf node will be smaller than U L due to the higher than that of the intercluster points. Monotonicity is the
GMTT. This guarantees that high-density regions have property that the clusters produced by cutting the dendrogram
more nodes, and low-density regions have less. However, in a layer close to the bottom are more homogeneous than
if we initialize a sufficiently large number of nodes once, the clusters produced by cutting the dendrogram in a layer
some nodes will be trapped locally and will not represent close to the top. In the topology obtained through GMTT,
the density distribution of the data well. Therefore, because the subset of each node is a local part of their parent
GMTT is more proper for hierarchical clustering. node’s subset, it roughly satisfies the property of homogeneity.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

The monotonicity is also satisfied among the nodes that are


lineal consanguinity of each other, where the concept of lineal
consanguinity is defined in the following definition.
Definition 4: Let v h be a node in T . If another node v m
in T can be found by searching T in a constant direction
(bottom-up or top-down) from v h , then v h and v m are said to Fig. 3. Data points in the subsets are linked to form sub-MSTs.
be lineal consanguinity of each other.
For instance, v 3,2,5 and v 1,0,1 shown in Fig. 1 are lineal Algorithm 6 GMTT Hierarchical Clustering Framework
consanguinity of each other, but v 3,2,5 and v 2,1,3 are not. Even Input: Data set X, learning rate η, upper limitation U L and
node v 2,1,3 is in layer 2, its homogeneity is not guaranteed to branching factor B
be lower than that of node v 3,2,5 in layer 3 because the subset Output: MST M
of v 3,2,5 is not a local part of v 2,1,3 . 1: train a topology T through Algorithm 5;
To merge all of the data points according to the topol- 2: measure the density for the new child nodes according to
ogy, data points inside leaf nodes’ subsets and the subsets Eq.(9) and their corresponding data points according to
themselves should be merged according to a certain linkage Eq.(6)-(8);
strategy. The merging procedures should also comply with 3: form sub-MSTs for the new child nodes and their corre-
the lineal consanguinity relationship between topology nodes sponding data points according to Eq.(10);
to exploit the homogeneity and monotonicity of the topol-
ogy. As discussed in Section I, potential-based methods have
competitive performance because they consider both the local Based on the above-mentioned density estimation, we also
and global data distributions. The potential value can also present our DL linkage to comply with the topology as follows.
be understood as an index indicating the density level of For a data point x h,i inside a leaf node v h ’s subset X h , a set
a data point. In other words, a very small potential value of data points with higher density values than x h,i is selected
indicates that the data point is located in a very high-density θ
out from X h as X h h,i = {x h, j |θh, j ≥ θh,i , j = 1, 2, . . . , sh ,
region. Because we focus on the density distribution of data θ
j = i }. In X h h,i , the winner x h,w with the shortest distance
points in this paper, we define density as the negative of
to x h,i is selected out by
potential as defined in (1) and (2). However, computing the
density value for each data point by considering all of the w = argmin θh,i ||x h,i − x h, j ||2 (10)
{ j |x h, j ∈X h }
other data points is very time-consuming, especially for large-
scale data sets. To accelerate the computation, we present our and linked with x h,i through an edge with length
density measurement to compute the density value of data ||x h,i − x h,w ||2 . When all of the data points in X h are linked
points and nodes. The density θh,i of a point x h,i inside a with their winner points, a sub-MST has been formed for X h .
leaf node v h ’s subset X h is estimated according to both its In Fig. 3, we take the same data set and topology shown
neighbors inside X h and the other leaf nodes of the topology, in Fig. 1 as an example to show the sub-MSTs formed through
which can be written as DL. According to the sub-MSTs, subsets should be linked to
⎛ ⎞
form a complete MST. Therefore, nodes in the same layer
1 ⎝  
sh u
θh,i = ωh,i, j + h,i,m ⎠ (6) sharing the same parent node are also linked to form sub-MSTs
n−1 according to their density values. It is commonly recognized
j =1, j  =i m=1,m =h
that the hierarchical clustering result can be expressed in the
where n is the size of X and u is the total number of leaf nodes
form of an MST instead of a dendrogram because they contain
inside the topology. ωh,i, j is the density value of x h, j received
the same information and can be converted to each other
from another data point x h, j in X h . h,i,m is the density value
easily [16], [17], [24]. Therefore, the hierarchical cluster-
of x h, j received from another leaf node v m . Here, ωh,i, j and
ing task can also be converted to form an MST for our
h,i,m are defined as
GMTT framework with DL linkage. When sub-MSTs are
1 formed for all of the leaf nodes’ subsets and all of the nodes
ωh,i, j = (7)
||x h,i − x h, j ||2 sharing the same parent, a complete MST linking all of the
and data points has been formed. The entire GMTT hierarchical
1 clustering framework is summarized in Algorithm 6.
h,i,m = sm · (8) Here, we also introduce how to transform the MST into
||x h,i − v m ||2
a dendrogram according to the corresponding topology and
respectively. MST in three stages.
The density h of a leaf node v h can be written as Stage 1: Only data points belonging to leaf nodes’ subsets
1 
u
1 are considered for merging. Specifically, for a leaf
h = sm · . (9) node v h , all the pairs of linked data points in
n − sh ||v h − v m ||2
m=1,m =h its subset X h are stacked together according to
Accordingly, the density of a nonleaf node can be estimated the ascending order of their edge lengths. The
in the same manner according to all of the other leaf nodes stacked pairs form a local merging queue (LMQ)
that are not lineal consanguinity of itself. qh . After the LMQs: {q1 , q2 , . . . , qul }, are formed
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHEUNG AND ZHANG: FAST AND ACCURATE HIERARCHICAL CLUSTERING BASED ON GMTT 7

Fig. 5. Dendrogram at the end of Stage 1.

Fig. 4. Six LMQs: {q1 , q2 , . . . , q6 }, are formed according to the corre-


sponding sub-MSTs in Fig. 3. At the beginning, q3 (1) = p(x8 , x9 ) is the
most similar pair among the candidates C (dashed frame). Therefore, x8 and
x9 are merged first and p(x8 , x9 ) is removed from q3 .

Fig. 6. Topology at the end of Stage 2.


for all the u l leaf nodes, a candidate set
C = {q1 (1), q2 (1), . . . , qul (1)} containing the
pairs with the shortest edges in each LMQ is
formed. A set of lengths of the edges D =
{d1 (1), d2 (1), . . . , dul (1)} of the candidates is also
formed. Then, the most similar pair qg (1) that
should be merged is found by

g = argmin D(i ). (11)


1≤i≤u l

After the merging, qg (1) is removed from both


set C and qg . Subsequently, the current most Fig. 7. Dendrogram at the end of Stage 2.
similar pair in qg is popped up into C. The above-
mentioned operations are iteratively performed
until an all-leaf-parent (ALP) node in T becomes
an all-candidate-parent (ACP) node. For a nonleaf
node, if all its child nodes are leaf nodes, it is
an ALP. When all the data points belonging to the
subsets of ALP’s child nodes are merged together
within their subsets, the ALP becomes an ACP.
Fig. 4 illustrates the merging procedure of Stage 1
using the 20-point data set from Figs. 3, and 5
shows the corresponding dendrogram. After merg-
ing the data points according to Fig. 5, an ALP
v 2,1,2 becomes an ACP. Therefore, both the child Fig. 8. Dendrogram at the end of Stage 3.
nodes of ACP and data points belong to the subsets
of all the other leaf nodes should be considered for of ALP’s child nodes continue to be merged,
merging in Stage 2. more ALPs will become ACPs in Stage 2. Because
Stage 2: Because the topology only guarantees the ACPs’ child nodes also continue to be merged in
monotonicity of nodes that are lineal consanguinity Stage 2, when all the child nodes of an ACP are
of each other, lengths of edges between ACP’s merged together, the ACP becomes a leaf node.
leaf nodes are not guaranteed to be larger than In Stage 2, the merging of leaf nodes and data
the edges linking unmerged data points belonging points is performed repeatedly until all of the
to the subsets of all the leaf nodes. Therefore, ALPs become ACPs. Fig. 6 demonstrates the
pairs of ACPs’ leaf nodes are viewed as merging topology at the end of Stage 2. The corresponding
candidates and should be considered together with dendrogram is presented in Fig. 7.
the data point candidates for merging. Suppose Stage 3. After Stage 2, the candidate set C only contains
that v h is the only ACP at the beginning of pairs of nodes in this stage. These nodes are finally
Stage 2, pairs of its leaf nodes should also be merged according to their edge lengths until all
stacked to form an LMQ Q h for node v h according of the candidates are merged together. The final
to their edge lengths. Afterward, the closest pair dendrogram of the 20-point data example formed
of nodes Q h (1) is put into the candidate set C. after Stage 3 is shown in Fig. 8.
Because data points belonging to the subsets The transformation algorithm is summarized in Algorithm 7.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Algorithm 7 MST-Dendrogram Transformation Algorithm Algorithm 8 IGMTT Hierarchical Clustering Framework


Input: MST M, topology T trained through Algorithm 5 Input: Streaming data set X
Output: Dendrogram D Output: MST M
1: generate LMQs for the subset of each leaf node; 1: train a coarse topology T using the former r inputs;
2: generate merging candidates C; 2: for i = r + 1 to n do
3: while C is not empty do 3: search to find the closest leaf node v h for x i ;
(new) (old)
4: if new ACP occurs then 4: X h = X h ∪ x i and sh = sh + 1;
5: generate LMQ for the ACP; 5: if v h is a full-coarse node then
6: move the closest pair from the LMQ to C; 6: generate B new nodes through Algorithm 4;
7: end if 7: measure the density for the new child nodes and
8: merge the most closest pair in C; 8: their corresponding data points;
9: remove the merged pair from C; 9: form sub-MSTs for the new child nodes and their
10: if the merged pair’s LMQ is not empty then 10: corresponding data points;
11: move the present closest pair from the LMQ to C; 11: end if
12: end if 12: end for
13: end while

triggered to update T by initializing and training B or Bs


Traditional linkages, i.e., SL, AL, and CL, can also be new child nodes for v h . To make the IGMTT algorithm more
applied to the GMTT framework, and their results can be efficient, we choose a reasonable and efficient updating trigger
transformed into dendrograms easily. Here, we offer guidance condition. That is, the updating is only triggered when a full
regarding how to apply them to the GMTT framework. coarse node occurs. Otherwise, the algorithm will directly
SL: Similar to the proposed DL, SL can also form an process the next input.
MST for hierarchical clustering tasks. Therefore, In our IGMTT framework, the MST connecting all the data
SL can be applied by taking the place of DL in the points and nodes should also be updated dynamically accord-
GMTT framework. The MST produced by it can be ing to each input. Specifically, for a new input x i , the density
transformed into a dendrogram using Algorithm 7. values of existing data points and nodes are updated using the
AL: Differing from DL and SL, AL merges data points same trigger condition of the IGMTT algorithm for topology
according to the average distance between the training. That is, if the subset X h of node v h is judged as a
present members of clusters and does not produce an full coarse node after accepting x i , B new child nodes are
MST. Therefore, we apply it to directly produce the initialized and trained for v h . The density values of the data
candidate set C without forming LMQs. Whenever a points belonging to the subsets of the new child nodes and
pair of objects (data points or nodes) is selected from the child nodes themselves are calculated using (6) and (9).
C for merging, AL will produce a new candidate Then, sub-MSTs of each of the new child nodes’ subsets and
among the objects with the same parent node as the the new child nodes themselves are formed according to DL.
merged one. The result of the IGMTT framework can also be transformed
CL: CL can be applied in the same manner as AL. into a dendrogram according to Algorithm 7. To better explain
When applying the three traditional linkages, nodes are viewed the details of our IGMTT framework, we summarize it
as data objects and processed according to their real values. in Algorithm 8.

D. Discussion and Complexity Analysis


C. Incremental Hierarchical Clustering Based on GMTT
In this section, we further discuss and analyze the capabil-
Streaming data processing is a significant challenge for ities and potential limitations of the proposed GMTT frame-
hierarchical clustering approaches. To make the GMTT frame- work in terms of distribution type and dimensionality of data
work feasible for the processing of streaming data, we present sets. For the IGMTT framework, the relationship between
its incremental version. We first train a coarse topology clustering quality and the number of data points for training
through GMTT using the former part of inputs. Then, for the coarse topology is also discussed.
each new input, the coarse topology is dynamically updated 1) Distribution Type: The proposed GMTT-DL approach
through the incremental version of GMTT, which is abbre- is robust to different data distribution types, especially
viated as IGMTT. Specifically, for a streaming data set X the overlapping type since the GMTT algorithm extracts
with n objects, the coarse topology is trained through the the structural distribution of data and the DL linkage
GMTT algorithm using the former r streaming inputs of X considers both the global and local distributions of data.
with the upper limitation U L and branching factor B. Then, For some special distribution types, i.e., chain-shaped,
for each new input x i , the closest leaf node v h of x i is found spherical-shaped, and ring-shaped distributions, its per-
by searching T from top to bottom according to the lineal formance will not be very competitive compared to some
consanguinity relationship. Subset X h of v h incorporates the traditional approaches that have an obvious bias for these
new input, and the size sh of X h is updated by sh(new) = distributions. However, these special distribution types
(old)
sh + 1. If v h is judged as a coarse node, the updating is will not occur individually in most of the real data sets.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHEUNG AND ZHANG: FAST AND ACCURATE HIERARCHICAL CLUSTERING BASED ON GMTT 9

By contrast, overlapping is very common in real data complexity is O(u nl B 2 ). For each of the leaf nodes, a sub-
sets. MST should be constructed for its corresponding U L data
2) Dimensionality: The GMTT algorithm extracts the data points. For u l leaf nodes in total, the time complexity
distribution structure by gradually creating necessary is O(u l U L2 ). Therefore, the time complexity for constructing
nodes. New nodes gradually split the data space to detect the MST (Algorithm 6, line 3) is O(u nl B 2 + u l U L2 ).
and represent the data distribution. Due to the curse of The overall time complexity of the proposed GMTT frame-
dimensionality, the distribution of data points will be work is O(sn B I + nU L + nu l + u l u nl + u l2 + u nl B 2 + u l U L2 ).
sparser for high-dimensional data. As a result, nodes Here, I is a very small constant ranging from 2 to 10 according
trained through GMTT will be less representative, and to the experiment. B is always set to a small positive
√ integer,
the structural distribution information offered by the e.g., 2–4 in the experiments. When U L is set at n, the overall
topology may have less contribution or even negative time complexity can be optimized to O(n 1.5 ). 
contribution to improve the clustering quality. However, With the same parameter setting, the complexity of
the curse of dimensionality will also influence the other applying SL, AL, and CL to the GMTT framework is
hierarchical clustering approaches since Euclidean dis- also O(n 1.5 ).
tance is commonly utilized by the existing approaches. Theorem 2: The IGMTT framework has √time complexity
3) Coarse Topology: The IGMTT algorithm trains a coarse O(n 1.5 ) if the upper limitation U L is set at n.
topology using the former part of streaming data. Proof: According to the proof of Theorem 1, the time
Because it extracts the structural distribution of data and complexity of the coarse topology training (Algorithm 8,
allows fine training for the coarse topology according line 1) is O(r 1.5 ), where r is the size of training set
to the following inputs, the size of the former part and r n.
of streaming inputs for coarse topology training will For n inputs, the time complexity for searching the closest
not influence the clustering quality significantly if the leaf node (Algorithm 8, line 3) according to u nl nonleaf nodes
distribution of streaming data does not change with is O(Bu nl n).
time. The case in which the data distribution changes According to Definition 1, lines 6–10 of Algorithm 8 will
over time is another challenging problem for hierarchical be performed once for every U L (B − 1) new inputs. In other
clustering, which is not considered in this paper. words, they will be triggered n/U L (B − 1) times in total.
The above-mentioned discussion is further justified by the For each trigger, B new nodes should be trained by
experimental results in Section IV. U L (B −1) data points and the training will be repeated I times
We also prove that the time complexity of the GMTT and for convergence (Algorithm 8, line 6). Therefore, the time
IGMTT frameworks can be optimized to O(n 1.5 ), which is complexity for n/U L (B − 1) triggers is O(n I ).
lower than O(n 2 ) of traditional approaches. For each trigger, U L − 1 data points and u l − 1 leaf nodes
Theorem 1: The GMTT framework has √ time complexity should be considered to measure the density for each of
O(n 1.5 ) if the upper limitation U L is set at n. the U L (B − 1) data points; u l − 1 leaf nodes should be
Proof: When the topology T trained through GMTT is considered to measure the density for each of the B new
a total imbalanced tree, we will have the worst case time nodes (Algorithm 8, line 7). Therefore, the time complexity
complexity. In this case, the number of nonleaf nodes is for n/U L (B − 1) triggers is O((U L + u l )n + (u l n/U L )).
u nl = (n − U L /(B − 1)U L ). From the top to the bottom of T , For each trigger, a sub-MST for the corresponding U L
the numbers of data points for training the nonleaf nodes can data points of each of the new nodes should be formed.
be viewed as an arithmetic sequence {n, n − (B − 1)U L , n − Therefore, the time complexity for B new nodes should
2(B − 1)U L , . . . , n − (u nl − 1)(B − 1)U L }. Therefore, total be O(BU L2 ); A sub-MST should also be formed for the B new
number of data points for training all the nonleaf nodes is nodes, which has time complexity O(B 2 ). For n/U L (B − 1)
sn = nu nl − ((B − 1)U L (u 2nl + u nl )/2). For each of the data triggers, the time complexity for the MST construction part
points, B nodes should be considered to find the winner node (Algorithm 8, line 9) is O(U L n + (Bn/U L )).
using (4). For each nonleaf node, the training will be repeated The overall time complexity of the IGMTT framework is
I times for convergence. Therefore, the time complexity for O(r 1.5 + Bu nl n + n I + (U L + u l )n + (u l n/U L ) + U L n +
the topology training (Algorithm 6, line 1) is O(sn B I ). (Bn/U L )). Similar to the GMTT framework, the √ time com-
According to (6)–(8), U L − 1 data points and u l − 1 leaf plexity can be optimized to O(n 1.5 ) with U L = n. 
nodes should be considered to measure the density for a The time complexity of the proposed MST-dendrogram
data point, where u l = (n/U L ) stands for the number of transformation algorithm is analyzed as follows. For a leaf
leaf nodes in T . For n data points, the time complexity is node, the time complexity for forming LMQ for the corre-
O(nU L + nu l ). According to (9), at most u l − 1 leaf nodes sponding U L data points is O(U L2 ). For u l leaf nodes, the time
should be considered to measure the density for a node. For complexity is O(U L2 u l ). In each merging step, the distance
u nl nonleaf nodes and u l leaf nodes, the time complexity is between the first pairs in u l LMQs should be compared to
O(u l (u nl +u l )). Therefore, the time complexity for measuring find the smallest one. For n − 1 merges, the time com-
the density for all the data points and nodes (Algorithm 6, plexity is O(u l n). Therefore, the overall time complexity of
line 2) is O(nU L + nu l + u l u nl + u l2 ). the transformation
√ algorithm is O(U L2 u l + u l n). When we
For each nonleaf node, a sub-MST should be constructed set U L at n, the time complexity can also be optimized
for its B child nodes. For u nl nonleaf nodes in total, the time to O(n 1.5 ).
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Suppose that R1 is the classification result produced by the


benchmark data set, and R2 is the clustering result produced
by horizontally cutting the hierarchy, the FM-index can be
computed by

TP TP
FM = · (14)
TP + FP TP + FN
where TP is the total number of true positives, FP is the total
Fig. 9. Three synthetic data sets. number of false positives, and FN stands for a false negative.
If two clustering results R1 and R2 match completely, the
TABLE I FM-index will take the maximum value 1, and vice versa.
S TATISTICS OF THE 10 D ATA S ETS To determine whether the performances of the proposed
approaches are significantly better than those of their coun-
terparts, we also use the Wilcoxon signed-rank test [35] to
indicate the significance of improvements. In the experiments,
we use “−” and “+” to express the acceptance and rejection of
the null hypothesis, respectively. The acceptance and rejection
of the null hypothesis indicate that the performance of our
method is not significantly better and significantly better than
that of the counterparts, respectively. For all the comparisons
in the experiments, we use the commonly used significance
IV. E XPERIMENTS
level of α = 0.05.
Experiments were conducted in three parts: 1) study of The experiments were conducted on a desktop computer
the parameters; 2) performance evaluation of the GMTT with an Intel(R) Xeon(R) CPU with the main frequency
framework; and 3) performance evaluation of the IGMTT of 3.30 GHz and 8 GB of DDR2-667 RAM.
framework. All of the experiments were performed on both
benchmark and synthetic data sets with different sizes, dimen-
sions, and distribution types. All of the real data sets were B. Study of the GMTT Parameters
collected from the UCI Machine Learning Repository1 [13], In the proposed GMTT framework, there are three para-
and the synthetic data sets, Syn A–Syn C, are shown in Fig. 9. meters, i.e., the branching factor B, upper limitation U L , and
Statistics of the data sets is given in Table I. All of the feature learning rate η. Each of them may influence the clustering
values of the 10 data sets are normalized to the interval [0, 1] performance in different ways. Here, we discuss them indi-
using the min-max normalization scheme for the experiments. vidually and also investigate the combination of any two of
them by fixing the remaining one.
A. Evaluation Measures 1) Branching Factor B: A too-large value of B may cause
a flat topology, which makes the topology unable to
The quality of the hierarchy produced by the proposed
distinguish between high-density and low-density dis-
GMTT framework has been measured by two indices: hier-
tributions of data. Moreover, a too-flat topology cannot
archy accuracy (H-Acc) [34] and the Fowlkes Mallows index
offer rich structural information for hierarchical cluster-
(FM-index) [12].
ing. Therefore, a too-large value of B may influence
The H-Acc is calculated by
 the quality of the hierarchy. By contrast, a too-small B
ci ∈C ε(ci ) ∩ ε(h i ) may make the topology too deep. Too many layers will
Acc H = (12) lead to a high computational cost for topology training.
n
where n is the number of data points in the data set X and In addition, a too-small B will split data points into large
ci stands for the i th class of the class set C. ε(ci ) and ε(h i ) subsets in the topology, which may incorrectly split real
denote the set of data points within class ci in X and the set of clusters and yield poor clustering accuracy. Therefore,
data points under sub-hierarchy h i , respectively. h i stands for a too-small B may influence the run time and quality of
the subhierarchy in the hierarchy H , under which the points the GMTT framework.
correspond to the points in the data set X with the class ci . 2) Upper Limitation U L : A too-large U L may make the
h i can be defined as subsets too large to distinguish different clusters. How-
ever, it will accelerate the run time of GMTT framework.
ε(ci ) ∩ ε(h j ) A too-small U L may lead to large amounts of computa-
h i = argmaxh j ∈H . (13)
ε(ci ) ∪ ε(h j ) tion because it yields a deep topology.
To measure the FM-index, the constructed hierarchy struc- 3) Learning Rate η: Both a too-large and a too-small η
ture should be cut horizontally to produce the same number will lead to a high computation cost and will influence
of clusters as the number of classes of the original data. the clustering quality. When η is too large, the training
is very unstable after each adjustment and hard to
1 http://archive.ics.uci.edu/ml/ converge. When η is set too small, training needs many
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHEUNG AND ZHANG: FAST AND ACCURATE HIERARCHICAL CLUSTERING BASED ON GMTT 11

Fig. 10. Performance of GMTT-DL with different B-U L value combinations on four data sets.

Fig. 11. Performance of GMTT-DL with different B-η value combinations on four data sets.

iterations to converge and will become trapped in local when investigating the impact of the B–U L relationship.
optima. B was set at 4 when evaluating the U L –η relationship.√Accord-
To experimentally investigate the impact of any pair of ing to the analysis in Section III-D, U L was set at n when
the parameters, the proposed GMTT-DL has been performed studying the B–η relationship. For all of the experiments,
10 times for different value combinations of each pair of B = 1 and U L = 1 are not evaluated because they make the
the parameters on four typical data sets: Seed, which is a GMTT algorithm meaningless. Because the size of the Magic
real and small data set; Urban, which is a real and high- data set is large, the parameters U L and B are evaluated
dimensional data set; Magic, which is a real and large-size data with large and small spacing steps to better indicate the
set; and Syn B, which is a synthetic data set with overlapped relationships between parameters. The experimental results of
clusters. For each pair of parameters, the remaining one is B–U L , B–η, and U L –η are presented in Figs. 10–12, respec-
fixed. As a rule of thumb, the value of η was set at 0.1 tively. It can be observed that our discussion regarding the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 12. Performance of GMTT-DL with different U L -η value combinations on four data sets.

TABLE II
H-A CC OF 8 C OUNTERPARTS ON 10 D ATA S ETS

TABLE III
FM-I NDEX OF 8 C OUNTERPARTS ON 10 D ATA S ETS

three parameters is confirmed. Moreover, the clustering quality hierarchical clustering framework combined with traditional
in terms of the H-Acc and FM-index of the GMTT framework linkage strategies, i.e., SL, AL, and CL. For each data set,
is very robust to different parameter value combinations except the H-Acc and FM-index were calculated to measure the
for some extreme values, e.g., B = 2, U L = 2, η = 1, and performance of all the counterparts. Because there are ran-
η = 0.001. From the run time results, it can be observed domization procedures in the GMTT framework, we perform
√ is the lowest when the value of U L is
that the run time it 10 times and take the average performance as the final
approximately n, which also confirms the time complexity result. The experimental results are given in Tables II and III.
analysis in Section III-D. According to the experimental results For each data set, the best result is highlighted via boldface.
and the above-mentioned
√ discussion, we set B = 4, η = 0.1, “+” and “−” beside the GMTT frameworks stand for the
and U L = n for all of the data sets in the following Wilcoxon test results. It can be observed that the GMTT
experiments. framework obviously boosts the performance of SL, AL, and
CL on most of the 10 data sets. T-SL outperforms its GMTT
C. Performance Evaluation of the GMTT Framework version on the Syn C data set because the data distribution
To investigate the effectiveness of the GMTT framework, type of Syn C is chain shaped, which is preferred by SL.
we have compared its performance with that of the traditional The performance of T-AL is also obviously better than that
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHEUNG AND ZHANG: FAST AND ACCURATE HIERARCHICAL CLUSTERING BASED ON GMTT 13

TABLE IV
H-A CC P ERFORMANCE C OMPARED W ITH S TATE - OF - THE -A RT C OUNTERPARTS ON 10 D ATA S ETS

TABLE V
FM-I NDEX P ERFORMANCE C OMPARED W ITH S TATE - OF - THE -A RT C OUNTERPARTS ON 10 D ATA S ETS

of its GMTT version on the Syn B data set because the TABLE VI
data distribution type of Syn B is spherical shaped, which is W ILCOXON S IGNED -R ANK T EST B ETWEEN GMTT-BASED
A PPROACHES AND THE O THER T HREE C OUNTERPARTS
preferred by AL. We can also observe from the experimental
results that the performances of different linkages with the
GMTT framework are close to each other with competitive
performance on most of the data sets. This indicates that the
GMTT framework dominates the clustering performance and
that different linkage strategies will not obviously influence
the performance. In general, the GMTT framework is robust
to different linkage strategies and outperforms the traditional
TABLE VII
one in terms of hierarchy quality.
W ILCOXON S IGNED -R ANK T EST B ETWEEN GMTT-DL AND THE
To verify the effectiveness of the proposed GMTT-DL O THER GMTT-BASED A PPROACHES IN T ERMS OF H-A CC
approach, we have compared its performance with that of
all the other linkage strategies with the GMTT frame-
work, i.e., GMTT-SL, GMTT-AL, and GMTT-CL. Moreover,
the state-of-the-art hierarchical clustering approaches, i.e., the
TABLE VIII
potential-based framework with edge-weighted tree linkage
W ILCOXON S IGNED -R ANK T EST B ETWEEN GMTT-DL AND THE
(P-EL) [23] and the RP-based framework with SL (RP-SL) O THER GMTT-BASED A PPROACHES IN T ERMS OF FM-I NDEX
and AL (RP-AL), have also been compared. To make the
comparison fair, we use the autoparameter-selection version
of the RP-based approaches. The hierarchy quality of all
the counterparts in terms of the FM-index and H-Acc are
compared in Tables IV and V, respectively. Because there are data sets have overlapped clusters. This is because the topology
randomization procedures in the GMTT framework, the stan- trained through GMTT extracted the structural distribution
dard deviations of all the linkages with the GMTT framework information of data sets and the DL considers the global
are also presented. The best and the second best results are and local distribution information together. In addition,
highlighted by boldface and underlining, respectively. It can the standard deviations indicate that all of the GMTT-based
be observed from the experimental results that GMTT-DL approaches have stable performance on different data sets.
outperforms the other counterparts on most of the data sets. In Table VI, the experimental results of the GMTT-based
Although its performance is not always the best one for all of approaches and the other three state-of-the-art counterparts,
the data sets, its performance is still competitive. Moreover, i.e., P-EL, RP-SL, and RP-AL, given in Tables IV and V are
almost all of the winners on each data set are GMTT-based compared by the Wilcoxon test. From the test results, we can
approaches, which indicate the effectiveness of the GMTT see that GMTT-DL and GMTT-SL are significantly better than
framework. It can also be observed from the results that all of the other counterparts in terms of H-Acc and FM-index.
GMTT-DL can effectively cope with the overlapping problem We also test the significance between GMTT-DL and all of
since most of the real data sets and the Syn A and Syn B the other GMTT-based approaches in Tables VII and VIII.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE IX
H-A CC P ERFORMANCE OF IGMTT-DL AND IHC ON 10 D ATA S ETS

Fig. 13. Run time on the Magic, Occupy, and synthetic data sets. For the Fig. 14. Run time on the Magic, Occupy and synthetic data sets. For the
Magic and Occupy data sets, the bars from left to right stands for GMTT-DL, Magic and Occupy data sets, the left and right bars stand for IGMTT-DL and
P-EL, RP-SL, RP-AL, T-SL, T-AL, and T-CL, respectively. IHC, respectively.

The results indicate that GMTT-DL significantly outperforms IHC on most of the data sets. Because IHC is an approximation
GMTT-SL, GMTT-AL, and GMTT-CL. of T-SL, it has higher accuracy on the Syn C data set,
To verify the efficiency of GMTT-DL, the run times of all which is composed of chain-shaped clusters. As discussed in
of the counterparts on the two large-scale data sets, Magic and Section III-D, high-dimensional data will influence the perfor-
Occupy, are compared in Fig. 13. The run times on each data mance of the GMTT framework. Therefore, the performance
set are recorded and visualized by histograms for comparison. of IGMTT-DL is not better than that of IHC on the Protein
To better observe the changing orientation of the run time data set, which has 77 attributes. According to the standard
for all of the approaches, we also run all of the counterparts deviation recorded in Table IX, the performance of IGMTT-DL
on a synthetic data set with its size increased from 1000 to is obviously more stable than that of IHC on all of the data sets
200 000 by a step size of 20 000. From Fig. 13, we can observe since the clustering procedure of IGMTT-DL is supervised by
from the run time of the Magic and Occupy data sets that the the topology, which reasonably represents the structural distri-
proposed approach takes much less time in comparison with bution of data sets. For IGMTT-DL and IHC, the significance
all of the other counterparts. According to the run time of level of the difference between their performances in terms of
the synthetic data set with changing size, we can find that H-Acc is also tested through the Wilcoxon signed-rank test.
the run times of T-SL, T-AL, and T-CL increase dramatically “+” besides IGMTT-DL indicates that the H-Acc performance
with the size of the data set. Compared with them, the run of IGMTT-DL is significantly better than that of IHC.
times of the four fast hierarchical clustering approaches, To verify the efficiency of IGMTT-DL, its run time is
i.e., GMTT-DL, P-EL, RP-SL, and RP-AL, increase obvi- also compared with that of IHC. The experimental settings
ously slower. Although the run time of GMTT-DL remains are the same as for the efficiency verification experiment of
the smallest on the synthetic data set with sizes from GMTT-DL in Section IV-C. From Fig. 14, we can see that both
1000 to 200 000, RP-SL and RP-AL have lower growth rates the run time and growth rate of IGMTT-DL are remarkably
than GMTT because their time complexity is lower. If the lower than those of IHC.
size of the synthetic data set continues increasing, the run In general, IGMTT-DL can incorporate new streaming
times of the RP-based approaches will be smaller than that inputs effectively and efficiently in hierarchical clustering
of the proposed GMTT-DL. However, the hierarchy quality tasks.
of RP-SL and RP-AL is limited by T-SL and T-AL as V. C ONCLUSION
discussed in Section I. Therefore, the performance of
This paper has presented a topology training algorithm,
GMTT-DL is still competitive. Generally speaking, GMTT-DL
GMTT, which can train a multilayer topological structure
is very competitive compared to the state-of-the-art counter-
for a data set to fit its density distribution. Based on the
parts when both the hierarchy quality and processing speed
GMTT algorithm, a hierarchical clustering framework has
are considered in practical applications.
been designed, featuring lower time complexity and higher
clustering quality compared to the existing approaches.
D. Performance Evaluation of IGMTT Framework The proposed framework can remarkably boost the perfor-
Furthermore, to verify the effectiveness and efficiency of mance of the existing traditional linkage strategies and has
the IGMTT framework, we compared it with another popular competitive performance when combined with the proposed
IHC method. The two online approaches were also performed DL linkage. We have analyzed that the GMTT framework
on all 10 data sets. Because IHC does not form a binary hier- improves the time complexity of hierarchical clustering
archy, its FM-index performance cannot be measured. There- to O(n 1.5 ) without sacrificing the hierarchy quality. Although
fore, the two online approaches are only compared in terms three parameters should be set, its performance is robust to the
of H-Acc. It can be observed from the experimental results parameter settings, which makes it easily utilized in different
shown in Table IX that IGMTT-DL evidently outperforms application domains. Furthermore, its incremental version,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHEUNG AND ZHANG: FAST AND ACCURATE HIERARCHICAL CLUSTERING BASED ON GMTT 15

IGMTT, has also been proposed to expand its application [23] Y. Lu and Y. Wan, “PHA: A fast potential-based hierarchical agglomera-
domain. The IGMTT-based framework has the same time com- tive clustering method,” Pattern Recognit., vol. 46, no. 5, pp. 1227–1239,
2013.
plexity as the GMTT framework but can dynamically update [24] F. Murtagh, “A survey of recent advances in hierarchical clustering
the topology and successively incorporate new inputs to update algorithms,” Comput. J., vol. 26, no. 4, pp. 354–359, 1983.
the corresponding hierarchy structure. Experiments have [25] S. Nassar, J. Sander, and C. Cheng, “Incremental and effective data
summarization for dynamic hierarchical clustering,” in Proc. ACM
shown the promising results of the GMTT-DL and IGMTT-DL SIGMOD Int. Conf. Manage. Data, 2004, pp. 467–478.
approaches in comparison with the existing counterparts. [26] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis
and an algorithm,” in Proc. Conf. Adv. Neural Inf. Process. Syst.,
R EFERENCES Dec. 2001, pp. 849–856.
[27] M. G. Omran, A. P. Engelbrecht, and A. Salman, “An overview of clus-
[1] H. F. Bassani and A. F. Araujo, “Dimension selective self-organizing tering methods,” Intell. Data Anal., vol. 11, no. 6, pp. 583–605, 2007.
maps with time-varying structure for subspace and projected clustering,” [28] S. S. Ray, A. Ganivada, and S. K. Pal, “A granular self-organizing
IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 3, pp. 458–471, map for clustering and gene selection in microarray data,” IEEE Trans.
Mar. 2015. Neural Netw. Learn. Syst., vol. 27, no. 9, pp. 1890–1906, Sep. 2015.
[2] M. M. Breunig, H.-P. Kriegel, P. Kröger, and J. Sander, “Data bubbles: [29] J. Sander, X. Qin, Z. Lu, N. Niu, and A. Kovarsky, “Automatic extraction
Quality preserving performance boosting for hierarchical clustering,” in of clusters from hierarchical clustering representations,” in Proc.
Proc. ACM SIGMOD Conf., 2001, pp. 79–90. Pacific–Asia Conf. Knowl. Discovery Data Mining, 2003, pp. 75–87.
[3] M. M. Breunig, H.-P. Kriegel, and J. Sander, “Fast hierarchical clustering [30] J. Schneider and M. Vlachos, “On randomly projected hierarchical
based on compressed data and optics,” in Proc. 4th Eur. Conf. Princ. clustering with guarantees,” in Proc. SIAM Int. Conf. Data Mining,
Data Mining Knowl. Discovery, 2000, pp. 232–242. 2014, pp. 407–415.
[4] I. Cattinelli, G. Valentini, E. Paulesu, and N. A. Borghese, “A novel [31] H. K. Seifoddini, “Single linkage versus average linkage clustering
approach to the problem of non-uniqueness of the solution in hierarchical in machine cells formation applications,” Comput. Ind. Eng., vol. 16,
clustering,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 7, no. 3, pp. 419–426, 1989.
pp. 1166–1173, Jul. 2013. [32] F. Shen and O. Hasegawa, “A fast nearest neighbor classifier based
[5] Y.-M. Cheung, “k*-means: A new generalized k-means clustering algo- on self-organizing incremental neural network,” Neural Netw., vol. 21,
rithm,” Pattern Recognit. Lett., vol. 24, no. 15, pp. 2883–2893, 2003. no. 10, pp. 1537–1547, Dec. 2008.
[6] Y.-M. Cheung, “A competitive and cooperative learning approach to [33] S. Shuming, Y. Guangwen, W. Dingxing, and Z. Weimin, “Potential-
robust data clustering,” in Proc. IASTED Int. Conf. Neural Netw. based hierarchical clustering,” in Proc. 16th Int. Conf. Pattern Recognit.,
Comput. Intell., 2004, pp. 131–136. 2002, pp. 272–275.
[7] Y.-M. Cheung, “A rival penalized em algorithm towards maximizing [34] D. H. Widyantoro, T. R. Ioerger, and J. Yen, “An incremental approach
weighted likelihood for density mixture clustering with automatic model to building a cluster hierarchy,” in Proc. IEEE Int. Conf. Data Mining,
selection,” in Proc. 17th Int. Conf. Pattern Recognit., vol. 4, 2004, 2002, pp. 705–708.
pp. 633–636. [35] F. Wilcoxon, “Individual comparisons by ranking methods,” Biometrics
[8] Y.-M. Cheung, “Maximum weighted likelihood via rival penalized EM Bull., vol. 1, no. 6, pp. 80–83, 1945.
for density mixture clustering with automatic model selection,” IEEE [36] L. Xu, T. W. S. Chow, and E. W. M. Ma, “Topology-based clustering
Trans. Knowl. Data Eng., vol. 17, no. 6, pp. 750–761, Jun. 2005. using polar self-organizing map,” IEEE Trans. Neural Netw. Learn.
[9] Y.-M. Cheung, “On rival penalization controlled competitive learning for Syst., vol. 26, no. 4, pp. 798–808, Apr. 2015.
clustering with automatic cluster number selection,” IEEE Trans. Knowl. [37] R. Xu and D. Wunsch, II, “Survey of clustering algorithms,” IEEE
Data Eng., vol. 17, no. 11, pp. 1583–1588, Nov. 2005. Trans. Neural Netw., vol. 16, no. 3, pp. 645–678, May 2005.
[10] F. Corpet, “Multiple sequence alignment with hierarchical clustering,” [38] H. Zhang, X. Xiao, and O. Hasegawa, “A load-balancing self-organizing
Nucl. Acids Res., vol. 16, no. 22, pp. 10881–10890, 1988. incremental neural network,” IEEE Trans. Neural Netw. Learn. Syst.,
[11] F. Ferstl, M. Kanzler, M. Rautenhaus, and R. Westermann, “Time- vol. 25, no. 6, pp. 1096–1105, Jun. 2014.
hierarchical clustering and visualization of weather forecast ensembles,” [39] Y. Zhang, Y.-M. Cheung, and Y. Liu, “Quality preserved data
IEEE Trans. Vis. Comput. Graphics, vol. 23, no. 1, pp. 831–840, summarization for fast hierarchical clustering,” in Proc. Int. Joint
Jan. 2017. Conf. Neural Netw., 2016, pp. 4139–4146.
[12] E. B. Fowlkes and C. L. Mallows, “A method for comparing two [40] Z. Zhang and Y.-M. Cheung, “On weight design of maximum weighted
hierarchical clusterings,” J. Amer. Statist. Assoc., vol. 78, no. 383, likelihood and an extended EM algorithm,” IEEE Trans. Knowl. Data
pp. 553–569, 1983. Eng., vol. 18, no. 10, pp. 1429–1434, Oct. 2006.
[13] A. Frank and A. Asuncion, “UCI machine learning repository,” School [41] J. Zhou and J. Sander, “Data bubbles for non-vector data: Speeding-up
Inform. Comput. Sci., Univ. California, Irvine, CA, USA, Tech. Rep., hierarchical clustering in arbitrary metric spaces,” in Proc. 29th Int.
2010. [Online]. Available: http://archive.ics.uci.edu/ml Conf. Very Large Data Bases, 2003, pp. 452–463.
[14] S. Furao, T. Ogura, and O. Hasegawa, “An enhanced self-organizing
incremental neural network for online unsupervised learning,” Neural Yiu-ming Cheung (F’18) received the Ph.D. degree
Netw., vol. 20, no. 8, pp. 893–903, Oct. 2007. from the Department of Computer Science and
[15] S. Furao, A. Sudo, and O. Hasegawa, “An online incremental Engineering, The Chinese University of Hong Kong,
learning pattern-based reasoning system,” Neural Netw., vol. 23, no. 1, Hong Kong.
pp. 135–143, Jan. 2010. He is currently a Full Professor with the Depart-
[16] J. C. Gower and G. J. S. Ross, “Minimum spanning trees and single ment of Computer Science, Hong Kong Baptist Uni-
linkage cluster analysis,” Appl. Statist., vol. 18, no. 1, pp. 54–64, 1969. versity, Hong Kong. His current research interests
[17] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,” include machine learning, pattern recognition, visual
ACM Comput. Surv., vol. 31, no. 3, pp. 264–323, Sep. 1999. computing, and optimization.
[18] S. C. Johnson, “Hierarchical clustering schemes,” Psychometrika, He is a Fellow of IET, BCS, and RSA, and
vol. 32, no. 3, pp. 241–254, 1967. a Distinguished Fellow of IETI. He serves as an
[19] G. Karypis, E.-H. Han, and V. Kumar, “Chameleon: Hierarchical clus- Associate Editor for the IEEE T RANSACTIONS ON N EURAL N ETWORKS AND
tering using dynamic modeling,” Computer, vol. 32, no. 8, pp. 68–75, L EARNING S YSTEMS , Pattern Recognition, and so on.
Aug. 1999. Yiqun Zhang received the B.Eng. degree from
[20] H. Koga, T. Ishibashi, and T. Watanabe, “Fast agglomerative hierarchical the School of Biology and Biological Engineering,
clustering algorithm using locality-sensitive hashing,” Knowl. Inf. Syst., South China University of Technology, Guangzhou,
vol. 12, no. 1, pp. 25–53, 2007. China, in 2013, and the M.Sc. degree from the
[21] A.-A. Liu, Y.-T. Su, W.-Z. Nie, and M. Kankanhalli, “Hierarchical Department of Computer Science, Hong Kong
clustering multi-task learning for joint human action grouping and Baptist University, Hong Kong, in 2014, where he
recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 1, is currently pursuing the Ph.D. degree with the
pp. 102–114, Jan. 2017. Department of Computer Science.
[22] Y. Lu, X. Hou, and X. Chen, “A novel travel-time based similarity His current research interests include machine
measure for hierarchical clustering,” Neurocomputing, vol. 173, pp. 3–8, learning, data mining, and pattern recognition.
Jan. 2016.

You might also like