0% found this document useful (0 votes)
23 views

A Tree-Based Incremental Overlapping Clustering Method Using The Three-Way Decision Theory

This document presents a new tree-based incremental overlapping clustering method called TIOC-TWD that uses three-way decision theory. TIOC-TWD introduces a framework for incremental clustering with interval sets and a search tree based on representative points, allowing it to obtain overlapping clusters as data increases. It also introduces strategies for efficiently updating the clustering when multiple new objects are added. Additionally, TIOC-TWD can dynamically determine the number of clusters without needing it predefined. Experimental results show that TIOC-TWD can identify clusters of arbitrary shapes and handle data increases without sacrificing computation time, outperforming compared algorithms in most cases.

Uploaded by

Anwar Shah
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

A Tree-Based Incremental Overlapping Clustering Method Using The Three-Way Decision Theory

This document presents a new tree-based incremental overlapping clustering method called TIOC-TWD that uses three-way decision theory. TIOC-TWD introduces a framework for incremental clustering with interval sets and a search tree based on representative points, allowing it to obtain overlapping clusters as data increases. It also introduces strategies for efficiently updating the clustering when multiple new objects are added. Additionally, TIOC-TWD can dynamically determine the number of clusters without needing it predefined. Experimental results show that TIOC-TWD can identify clusters of arbitrary shapes and handle data increases without sacrificing computation time, outperforming compared algorithms in most cases.

Uploaded by

Anwar Shah
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Knowledge-Based Systems xxx (2015) xxx–xxx

Contents lists available at ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

A tree-based incremental overlapping clustering method using the


three-way decision theory
Hong Yu ⇑, Cong Zhang, Guoyin Wang
Chongqing Key Laboratory of Computational Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, China

a r t i c l e i n f o a b s t r a c t

Article history: Existing clustering approaches are usually restricted to crisp clustering, where objects just belong to one
Received 18 January 2015 cluster; meanwhile there are some applications where objects could belong to more than one cluster. In
Received in revised form 24 April 2015 addition, existing clustering approaches usually analyze static datasets in which objects are kept
Accepted 31 May 2015
unchanged after being processed; however many practical datasets are dynamically modified which
Available online xxxx
means some previously learned patterns have to be updated accordingly. In this paper, we propose a
new tree-based incremental overlapping clustering method using the three-way decision theory. The tree
Keywords:
is constructed from representative points introduced by this paper, which can enhance the relevance of
Incremental clustering
Overlapping clustering
the search result. The overlapping cluster is represented by the three-way decision with interval sets, and
Search tree the three-way decision strategies are designed to updating the clustering when the data increases.
Three-way decision theory Furthermore, the proposed method can determine the number of clusters during the processing. The
experimental results show that it can identifies clusters of arbitrary shapes and does not sacrifice the
computing time, and more results of comparison experiments show that the performance of proposed
method is better than the compared algorithms in most of cases.
Ó 2015 Elsevier B.V. All rights reserved.

1. Introduction method introduces a new incremental clustering framework with


three-way decision using interval sets and a new searching tree
Most of existing clustering algorithms usually analyze static based on representative points, which together allows to obtain
datasets in which objects are kept unchanged after being processed overlapping clusters when data increases. Besides, the TIOC-TWD
[1,2]. However, in many practical applications, the datasets are introduces new three-way strategies to update efficiently the clus-
dynamically modified which means some previously learned pat- tering after multiple objects increases. Furthermore, the proposed
terns have to be updated accordingly [3,4]. Although these method can dynamically determine the number of clusters, and
approaches have been successfully applied, there are some situa- it does not need to define the number of cluster in advance. The
tions in which a richer model is needed for representing a cluster above characteristics make the TIOC-TWD appropriate for handling
[5,6]. For example, a researcher may collaborate with other overlapping clustering in applications where the data is increasing.
researchers in different fields, therefore, if we cluster the research- The experimental results show that the proposed method not
ers according to their interested areas, it could be expected that only can identify clusters of arbitrary shapes, but also can merge
some researchers belong to more than one cluster. In these areas, small clusters into the big one when the data changes; the proposed
overlapping clustering is useful and important as well as incre- method can detect new clusters which might be the result of split-
mental clustering. ting or new patterns. Besides, more experimental results show that
For this reason, the problem of incremental overlapping cluster- the performance of proposed method is better than the compared
ing is addressed in this paper. The main contribution of this work is algorithms in most of cases. We note that a short version of this
an incremental overlapping clustering detection method, called work had been appeared in the RSCTC-2014 Workshop on the
TIOC-TWD (Tree-based Incremental Overlapping Clustering Three-Way Decisions and Probabilistic Rough Sets [7].
method using the Three-Way Decision theory). The proposed
2. Related work
⇑ Corresponding author. Tel.: +86 13617676007.
E-mail addresses: [email protected] (H. Yu), [email protected] Nowadays, there are some achievements on the incremental
(C. Zhang), [email protected] (G. Wang). clustering approaches. Ester et al. [8] put forward the IncDBSCAN

http://dx.doi.org/10.1016/j.knosys.2015.05.028
0950-7051/Ó 2015 Elsevier B.V. All rights reserved.

Please cite this article in press as: H. Yu et al., A tree-based incremental overlapping clustering method using the three-way decision theory, Knowl. Based
Syst. (2015), http://dx.doi.org/10.1016/j.knosys.2015.05.028
2 H. Yu et al. / Knowledge-Based Systems xxx (2015) xxx–xxx

clustering algorithm based on the DBSCAN. After that, Goyal et al. processing to decide the relationship between an object and a clus-
[9] proposed the derivation work which is more efficient than the ter. Objects in the lower bound are definitely part of the cluster,
IncDBSCAN because it is capable of adding points in bulk to exist- and only belong to that cluster; while objects between the two
ing set of clusters. Patra et al. [10] proposed an incremental clus- bounds are possibly part of that cluster and potentially belong to
tering algorithm based on distance and leaders, but the some other clusters.
algorithm needs to search the whole data space to find the sur- Furthermore, in the field of incremental learning, it is common
rounding leaders. Ibrahim et al. [11] proposed an incremental clus- to learn from new incremental samples based on the existing
tering algorithm which can maximize the relatedness of distances results. The tree structures are particularly well suited for this task
between patterns of the same cluster. Ning et al. [12] proposed an because they enable a simple and effective way to search and
incremental spectral clustering approach by efficiently updating update. At the same time, trees are easy to store the learned pat-
the eigen-system, but it could not find the overlapping clusters. terns (results), which can save lots of duplicate learning time.
Pensa et al. [13] proposed an incremental hierarchical Tree structures have been successfully used in some typical incre-
co-clustering approach, which computes a partition of objects mental learning approaches [39,40]. Therefore, this paper will use
and a partition of features simultaneously but it cannot find out a tree to store the searching space, where a node of tree indicates
the overlapping clusters. the information corresponding to some representative points.
Meanwhile, some approaches, addressing the problem of incre-
mental overlapping clustering, have been reported. Hammouda 3. Description of the problem
and Kamel [14] proposed similarity histogram-based clustering
method based on the concept of Histogram Ratio of a cluster. 3.1. Three-way decision clustering
Gil-García and Pons-Porrata [15] proposed dynamic hierarchical
compact method and dynamic hierarchical star method, these To define our framework, let a universe be
methods is time consuming due to the framework of hierarchical U ¼ fx1 ; . . . ; xn ; . . . ; xN g, and the resulting clustering scheme
clustering. Pérez-Suárea et al. [16] proposed an algorithm based C ¼ fC 1 ; . . . ; C k ; . . . ; C K g is a family of clusters of the universe. The
on density and compactness for dynamic overlapping clustering, xn is an object which has D attributes, namely,
but it builds a large number of small clusters. Lughofer [17] pro-
xn ¼ ðx1n ; . . . ; xdn ; . . . ; xDn Þ. The xdn denotes the value of the d-th attri-
posed dynamic split-and merge operations for evolving cluster
bute of the object xn , where n 2 f1; . . . ; Ng, and d 2 f1; . . . ; Dg.
models, which are learned incrementally but can only deal with
We can look at the cluster analysis problem from a decision
crisp clustering. Labroche [18] proposed online fuzzy medoid
making perspective. For crisp clustering, it is a typical two-way
based clustering algorithms, which are adapted to overlapping
decision; meanwhile for overlapping clustering or soft clustering,
clusters but the number of clusters need to define in advance.
it is a type of three-way decision. Let’s review some basic concepts
Therefore, the main objective of this paper is to propose an
of clustering using interval sets from our previous work [33]. In
approach that combine both processing of incremental data and
contrast to the general crisp representation of a cluster, where a
obtaining of overlapping clusters. For this kind of problem, some
cluster is a set of objects, we represent a cluster as an interval
scholars had pointed out that the clustering approaches to com-
set. That is,
bine with rough sets are impactful [19]. Thus, Parmar et al. [20]
proposed an algorithm for clustering categorical data using rough C k ¼ ½C k ; C k ; ð1Þ
set theory; Chen et al. [21,22] researched the incremental data
mining with rough set theory; Peters et al. [23] proposed the where C k is the lower bound of the cluster C k ; C k is the upper bound
dynamic rough clustering; and Lingras et al. [24] reviewed fuzzy of the cluster C k , and C k # C k .
and rough approaches for soft clustering. Therefore, we can define a cluster by the following properties:
Further, the three-way decision with interval sets provides an [
ideal mechanism to represent overlapping clustering. The concept ðiÞ C k – ;; 0 < k 6 K; ðiiÞ C k ¼ U: ð2Þ
of three-way decisions was developed with researching the rough
set theory [25]. A theory of three-way decision is constructed Property ðiÞ implies that a cluster cannot be empty. This makes
based on the notions of acceptance, rejection and noncommitment, sure that a cluster is physically meaningful. Property ðiiÞ states that
and it is an extension of the commonly used binary-decision model any object of U must belong to the upper bound of a cluster, which
with an added third option [26]. Three-way decision approaches ensures that every object is properly clustered.
have been successfully applied in decision systems [27–29], email With respect to the family of clusters, C, we have the following
spam filtering [30], three-way investment decisions [31,32], clus- family of clusters formulated by interval sets as:
tering analysis [33], and a number of other applications [25,34]. C ¼ f½C 1 ; C 1 ; . . . ; ½C k ; C k ; . . . ; ½C K ; C K g: ð3Þ
In our previous work [33], we had proposed a three-way decision
strategy for overlapping clustering, where a cluster is described Therefore, the sets C k ; C k  C k and U  C k formed by certain
by an interval set. In fact, Lingras and Yan [35] had introduced decision rules constitute the three regions of the cluster C k as the
the concept of interval sets to represent clusters, and Lingras and positive region, boundary region and negative region, respectively.
West [36] proposed an interval set clustering method with rough The three-way decisions are given as:
k-means for mining clusters of web visitors. Yao et al. [37] repre-
sented each cluster by an interval set instead of a single set as POSðC k Þ ¼ C k ;
the representation of a cluster. Chen and Miao [38] described a BNDðC k Þ ¼ C k  C k ; ð4Þ
clustering method by incorporating interval sets in the rough
NEGðC k Þ ¼ U  C k :
k-means.
In this paper, we propose the three-way decision clustering, Objects in POSðC k Þ definitely belong to the cluster C k , objects in
which is applicable to crisp clustering as well as overlapping clus- NEGðC k Þ definitely do not belong to the cluster C k , and objects in
tering. There are three relationships between an object and a clus- the region BNDðC k Þ might or might not belong to the cluster.
ter: (1) the object certainly belongs to the cluster, (2) the object Any data mining technique needs to have a clear and precise
certainly does not belong to the cluster, and (3) the object might evaluation measure. In clustering, evaluations such as the similar-
or might not belong to the cluster. It is a typical three-way decision ity between objects and compactness of clusters are appropriate

Please cite this article in press as: H. Yu et al., A tree-based incremental overlapping clustering method using the three-way decision theory, Knowl. Based
Syst. (2015), http://dx.doi.org/10.1016/j.knosys.2015.05.028
H. Yu et al. / Knowledge-Based Systems xxx (2015) xxx–xxx 3

indicators of quality of clusters. We will attempt to obtain the Bigger jNeighborðxÞj is, more compact the area is. If the density
three regions by comparing these evaluation values with a pair is big enough, it is reasonable that we view the area in its
of thresholds in the following work. entirety.
Under the representation, we can formulate the overlapping
clustering and crisp clustering as follows. For a clustering, if there Definition 1 (Representative points). If jNeighborðrÞj P f, we call
exists k – t, such that that r is a representative point and represents the objects in the
area whose center is r and d is the radius.
ð1Þ POSðC k Þ \ POSðC t Þ – ;; or
ð2Þ BNDðC k Þ \ BNDðC t Þ – ;; or Representative points can be fictional points, not points/objects
ð5Þ
ð3Þ POSðC k Þ \ BNDðC t Þ – ;; or in the system. f is the density threshold. In this paper, we propose a
ð4Þ BNDðC k Þ \ POSðC t Þ – ;; sort method, described in Example 1, to obtain representative
points instead of directly using the threshold due to reduce the
we call it is a overlapping (soft) clustering; otherwise, it is a crisp number of thresholds.
(hard) clustering.
As long as one condition from Eq. (5) is satisfied, there must Definition 2 (Representing region). Every representative point r is
exist at least one object belonging to more than one cluster. the representative of a circular area where the point is the
fictitious center of the area and the radius is d, we call the area
3.2. Incremental overlapping clustering is the representing region of the corresponding representative
point r.
In this paper, we suppose such a scenario: the data we observed
and collected are incremental, we have collected some data All objects in the representing region of a representative point
initially and then the data we collected are increasing as time are seen as an entirety; an object which does not be represented
passed. In order to save time and computational resources, we by any representative points is deemed to be a noise.
hope that we can adjust the clustering results obtained in the Assuming rk is the kth representative point, we use
previous step according to new incremental data, rather than Cov erðrk Þ ¼ fx1 ; . . . ; xk g to denote the objects in its representative
re-implement clustering algorithm on the whole dataset. Such a region. Since the representative point r k can represent the region,
scenario often happens in real world, so it’s an important problem it is reasonable to suppose that the fictional point has
in data mining. D-dimensions attributes. That is rk ¼ ðr 1k ; . . . ; rdk ; . . . ; r Dk Þ, and a
Assume there is a given information system, ISt ¼ ðU t ; At Þ, range is used to represent the r dk , namely, r dk ¼ ½rdk :left; rdk :right.
where U t means the universe and At means the set of attributes. The following formulas are used to compute them:
rdk :left ¼ minfxd1 ; . . . ; xdk g, and rdk :right ¼ maxfxd1 ; . . . ; xdk g. Generally
The clustering result is known as Ct ¼ fC t1 ; . . . ; C ti ; . . . ; C tjCt j g, and
speaking, it is possible that there exists overlapping region
the structure information of each cluster is known. The
between two representative regions.
problem of incremental clustering is: for a given dataset U t
and the previous clustering result Ct , how to compute Definition 3 (Similarity between representative regions). Let ri and
Ctþ1 ¼ fC tþ1 tþ1
1 ; . . . ; Ci ; . . . ; C tþ1
jCtþ1 j
g efficiently and effectively accord- rj be two arbitrary representative points, the similarity between
ing to the new arriving objects 4U. Sometimes, we also use U tþ1 their representative regions is defined as follows.
to denote U t [ 4U.
In order to represent overlapping clustering as well as incre- jCov erðr i Þ \ Cov erðr j Þj
SimilarityRRðri ; r j Þ ¼ : ð6Þ
mental clustering, each cluster will be represented as three regions minðjCov erðr i Þj; jCov erðrj ÞjÞ
introduced in the last subsection.
Here, j  j means the cardinality of a set.

4. The TIOC-TWD clustering method In order to speed up searching the similar space with the incre-
mental data, we build the searching tree based on representative
The processing of the proposed TIOC-TWD method is illustrated points. The root represents the original space composing of all rep-
in Fig. 1. In fact, we also devise an overlapping clustering algorithm resentative points, and we sort the attributes by significance.
using three-way decision strategy for the initial static data, which According to the most significance attribute, we construct
is based on a graph of representative points by calculating the sim- the nodes in the 1st layer; that is, all representative points are
ilarity between representative regions. It is called Algorithm 1 and split according to these representative points’ values in the
described in Section 4.2. corresponding attribute. Then the second significance attribute
and so on.
4.1. Related definitions
i
Definition 4. Let Nodej be the jth node of the i layer in the
In a D-dimensions space, when considering a small enough searching tree, let R ¼ fr 1 ; . . . ; rjNodei j g be the set of representative
j
region, the objects are usually well-distributed, thus we can use i
points belonging to the node Nodej .
a fictional point called representative point to represent these
objects. i i
Let Distanceðx; yÞ be the distance between two objects; shorter A node is represented by a value range: Nodej = ½Nodej :left;
i i
the distance between y and x is, more similar they are. For a Nodej :right, where Nodej :left = minfr i1 :left; . . . ; r ijNodei j :leftg, and
j
point x, a distance threshold d, we use NeighborðxÞ denotes i
Nodej :right = maxfri1 :right; . . . ; r ijNodei j :rightg. A node represents
the objects which are near enough to x, that is, j

NeighborðxÞ ¼ fyjDistanceðx; yÞ 6 dg. NeighborðxÞ means the area not a representative point, but a set of representative points whose
whose center is x and d is the radius, we say that objects in the area values among the value range.
are the distance of d relative to x. The cardinality of In addition, we need to measure the similarity between the rep-
NeighborðxÞ; jNeighborðxÞj, reflects the density of the area. resentative points of the incremental data and nodes of the tree.

Please cite this article in press as: H. Yu et al., A tree-based incremental overlapping clustering method using the three-way decision theory, Knowl. Based
Syst. (2015), http://dx.doi.org/10.1016/j.knosys.2015.05.028
4 H. Yu et al. / Knowledge-Based Systems xxx (2015) xxx–xxx

That is, we need to measure the similarity between two mathemat- The algorithm starts with computing the distance between
ical value ranges. objects, Distanceðx; yÞ. This paper uses the Euclidean distance to
measure the distance between two objects as follows:
Definition 5 (Similarity of value range). For arbitrary value range
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Range1 and Range2, wherein Range1 ¼ ½Range1:left; Range1:right, u D
uX
Distanceðx; yÞ ¼ t ðxd  yd Þ :
2
so as to Range2. If Range1:left 6 Range2:left, and Range1:right P ð7Þ
Range2:left, then we call that Range1 is similar to Range2. d¼1

In other words, we say two ranges are similar to each other if Obviously, shorter the distance between two objects is, more simi-
and only if Range1 \ Range2 – ;. lar they are. Of course, we can define the similarity between two
objects as follows:
4.2. Clustering the initial data
Similarityðx; yÞ ¼ 1  Distanceðx; yÞ: ð8Þ
The basic idea of the initial clustering algorithm is a static over- Larger the similarity between two objects is, more similar they are.
lapping clustering algorithm using three-way decisions based on From Line 9 to Line 18, the algorithm determines the represen-
representative points, shorted by SOC-TWD, and Algorithm 1 out- tation points, where a removing strategy is used in order to reduce
lines the top level overview. the number of thresholds. That is, the algorithm set the object
which has the maximum neighbors to be the first representation
Algorithm 1. SOC-TWD: Static overlapping clustering in the initial point. Then, it removes the corresponding rows from the distance
data set matrix. After that, the algorithm find the second representation
point in the rest matrix, and so on.
From Line 19 to Line 25, Algorithm 1 constructs a undirected
graph G based on the set of representation points, R, using the idea
of three-way decisions. Here, a and b are thresholds. For all
ri ; r j 2 R, to compute the SimilartyRRðri ; rj Þ according to Eq. (6). If
SimilartyRRðri ; rj Þ P a, there is a strong linked edge between
them, if b 6 SimilartyRRðr i ; rj Þ < a, there is a weak linked edge
between them. Line 21 computes the neighbor representative
points of every representative points, which will be used in
Algorithm 4.
From Line 27 to Line 35, the algorithm searches the subgraph
which is strong connected by strong linked edges in the graph G.
For every such subgraph, the objects corresponding to it form the
positive region of a cluster, POSðCÞ; the objects, in the union of rep-
resentative regions which have weak edges connected to this sub-
graph, form the boundary of the cluster BNDðCÞ.

Example 1. An Example to obtain the representative points.

Table 1 describes a dataset, which has ten objects. Table 2 gives


the distance matrix between these objects. Set the threshold
d ¼ 1:5. Then, we select the object which has the maximum similar
objects from Table 2. Thus, we choose x9 to be the geometrical cen-
ter of the first representation point r 1 , the corresponding represen-
tation region is Cov erðr1 Þ ¼ fx2 ; x3 ; x5 ; x6 ; x7 ; x8 ; x9 g. After that, the
rows which the objects in the corresponding representation region
are removed from the matrix, then we obtain Table 3.
From Table 3, we choose the object, x1 , which has the maximum
similar objects, to be the geometrical center of the second repre-
sentation point r2 . Likewise, the other two representation regions
are Cov erðr2 Þ ¼ fx1 ; x2 ; x3 ; x5 ; x10 g, and Cov erðr 3 Þ ¼ fx4 g.

4.3. Creating the searching tree

The basic idea of the proposed method is to represent the incre-


mental data as representation points firstly, which brings benefit of
saving computing time compared with methods based on objects.
When we know the relationships between new representation
points and existing representation points, we can make decisions
accordingly. Therefore, we store the existing representation points
on a tree, and we can take the advantage of searching and updating
operations on a tree structure.

Please cite this article in press as: H. Yu et al., A tree-based incremental overlapping clustering method using the three-way decision theory, Knowl. Based
Syst. (2015), http://dx.doi.org/10.1016/j.knosys.2015.05.028
H. Yu et al. / Knowledge-Based Systems xxx (2015) xxx–xxx 5

Incremental data
blocks ΔU On line computing
Off line computing

Obtain the Find neighbors of


C representative representative C
Initial Algorithm 1 Algorithm 2 points points, to search
data set U and update G and G
G according to
the method in Root according to
Clustering the Creating the
searching Subsection 4.2 Algorithm 3 & 4
initial data
tree Root

SOC-TWD: overlapping clustering the static data CIA-TWD: overlapping clustering the incremental data

Fig. 1. Illustration of the processing of TIOC-TWD method.

Algorithm 2. Creating the searching tree Table 1


A dataset U.

U a1 a2 a3 a4 a5
x1 0.5 0 1 2 0
x2 1 0 0 1 0
x3 1 0 1 1 0
x4 1.5 1 2 0 1
x5 1 0 0 2 0
x6 0 1 0 1 0
x7 1 2 0 1 0
x8 0 1.5 0 1 0
x9 1 1 0 1 0
x10 0 0 1 2 0

Table 2
The distance matrix of U.

Distanceðxi ; xj Þ x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

x1 0.0 1.5 1.1 2.8 1.1 1.8 2.5 2.1 1.8 0.5
x2 1.5 0.0 1.0 2.6 1.0 1.4 2.0 1.8 1.0 1.7
x3 1.1 1.0 0.0 2.0 1.4 1.7 2.2 2.0 1.4 1.4
x4 2.8 2.6 2.0 0.0 3.2 2.8 2.6 2.9 2.5 3.0
x5 1.1 1.0 1.4 3.2 0.0 1.7 2.2 2.0 1.4 1.4
x6 1.8 1.4 1.7 2.8 1.7 0.0 1.4 0.5 1.0 1.7
x7 2.5 2.0 2.2 2.6 2.2 1.4 0.0 1.1 1.0 2.6
x8 2.1 1.8 2.0 2.9 2.0 0.5 1.1 0.0 1.1 2.0
x9 1.8 1.0 1.4 2.5 1.4 1.0 1.0 1.1 0.0 2.0
x10 0.5 1.7 1.4 3.0 1.4 1.7 2.6 2.0 2.0 0.0
The method of creating searching tree is similar to that of cre-
ating the decision tree, which is built top-down. It constructs the
tree according to the attribute importance. This paper utilizes a
measure of node impurity to scale the attribute importance. The
Table 3
common indices include the entropy index, Gini index, misclassifi- The distance matrix after removing objects in Cov erðr 1 Þ.
cation error, and so on [41]. The entropy index is used to measure
Distanceðxi ; xj Þ x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
the attribute importance in this paper. Algorithm 2 outlines the top
level overview of creating searching tree. x1 0.0 1.5 1.1 2.8 1.1 1.8 2.5 2.1 1.8 0.5
x4 2.8 2.6 2.0 0.0 3.2 2.8 2.6 2.9 2.5 3.0
The sorted attributes are denoted as A. Algorithm 2 builds every x10 0.5 1.7 1.4 3.0 1.4 1.7 2.6 2.0 2.0 0.0
layer according to the attribute importance, the more important
attribute is prior to be constructed. Line 2 to Line 14 consider a sit-
uation that there exists two adjacent layer whose numbers of
edge. Table 4 shows the value ranges of representation points on
nodes are roughly same, the algorithm stops building the searching
every attribute; we assume the importance of a1 is greater than
tree in order to reduce the depth of tree. Here, NodeðiÞ denotes all
a2 . According to Algorithm 2, the root of searching tree concludes
of nodes in the ith layer, and jNodeðiÞj is the number of nodes of the
the five representation points firstly as denoted in Fig. 2(b).
ith layer. k 2 ð0; 1Þ is a threshold, if jNodeði  1Þj=jNodeðiÞj P k, the
According to Definition 5 and the most important attribute a1 ,
algorithm stops.
the root is split into two nodes to build the first layer, as shown
in Fig. 2(b). Then, the algorithm splits the tree according to the sec-
Example 2. An Example to create the searching tree.
ond important attribute a2 . In this example, the second layer is the
Fig. 2 is an example of creating the searching tree. There are 5 same as the first layer, the algorithm stops.
representation points in Fig. 2(a), where the solid line denotes We need to note that, the node of the searching tree should map
the strong linked edge and the dotted line denotes the weak linked one representation point or multiple representation points.

Please cite this article in press as: H. Yu et al., A tree-based incremental overlapping clustering method using the three-way decision theory, Knowl. Based
Syst. (2015), http://dx.doi.org/10.1016/j.knosys.2015.05.028
6 H. Yu et al. / Knowledge-Based Systems xxx (2015) xxx–xxx

Algorithm 3. FindingNeighbors(r wait )

Fig. 2. Example to create the searching tree.

Table 4
Value ranges of representation points in Fig. 2.

R r1 r2 r3 r4 r5
a1 [1.2,1.6] [1.0,1.5] [1.4,2.3] [3.0,3.6] [3.3,4.0]
a2 [2.0,2.8] [1.6,1.9] [1.8,2.2] [1.9,2.5] [1.7,2.0]

4.4. Clustering the incremental data

The incremental data is denoted by DU, the TIOC-TWD cluster-


ing method does not need to cluster again on all data (U [ DU), it
just need to execute the clustering incremental data algorithm,
abbreviated as CIA-TWD.
The objects in the DU are not completely irrelative with each
other, there exists some links among them. Therefore, it is reason-
able that we obtain representation points in the incremental data
firstly using Algorithm 1. That is to say the CIA-TWD algorithm is
based on representation points, and it includes two steps:
Step 1: it obtains the representative points in the DU according
to the method in Algorithm 1, and a new representative point in
DU is denoted by r wait ;
Step 2: it searches and updates the relation graph G obtained in
Algorithm 1 according to the method described in Algorithm 4.
Obviously, how to carry out Step 1 of the CIA-TWD is clear, and
how to carry out Step 2 is introduced in Algorithm 4. The basic idea
of the CIA-TWD is to search neighbors of every representative
point r wait , and update the related area on the tree and the graph.
Therefore, we need to introduce how to find the neighbors of every
representative point firstly, which is recorded in Algorithm 3.
In view of the different value region of rwait , we explain how to
find the similar nodes through the following example respectively.
4.4.1. Finding neighbors of rwait We take the example searching tree in Fig. 2(b) as the initial
Algorithm 3 not only searches the tree but also updates the tree searching tree.
i i
at the same time. Let SimilarNodeðiÞ ¼ fNodek1 ; . . . ; Nodekn g be the
i
set of similar nodes with rwait in the ith layer, and Nodenew be the Example 3. Examples to find similar nodes with rwait .
new node in ith layer. Algorithm 3 first finds out the similar nodes
with rwait according to Definition 5. Line 28 to Line 32 describe how Fig. 3 shows an example how to deal with Case 1. Here, to
to find the neighbors of r wait from the representative points which assume the value region of r wait in attribute a1 is ½1:2; 1:8, the value
in the set of similar nodes with r wait . region in a2 is ½2:0; 2:5. The incremental representative point r wait
Considering the relationships between r wait and nodes in the is added to the root firstly. The indicator Path, which records the
searching tree: Case 1, there only one node is similar with rwait in path of searching, indicates the root.
1
every layer; Case 2, there are more than one node similar with Fig. 3(a) shows processing with the 1st layer, Node1 ¼ ½1:0; 2:3,
rwait at least in one layer; Case 3, there are no nodes similar with the value region of rwait in attribute a1 is ½1:2; 1:8. Because
rwait at least in one layer. Under the similar cases, they will lead 1
½1:0; 2:3 \ ½1:2; 1:8 – ;, we have r wait is similar with Node1 accord-
to merge nodes. For Case 1, the merging just arises to rwait and 1
ing to Definition 5. Then, rwait is added to the node Node1 ; and the
the similar node; for Case 2, the merging might arise the similar 1
nodes and their children. To the contrary, under the nosimilar case, indicator Path moves to Node1 . The algorithm moves to the 2nd
1 2
there will arise splitting operations. layer, the child of Node1 , namely Node1 ¼ ½1:6; 2:8. We have

Please cite this article in press as: H. Yu et al., A tree-based incremental overlapping clustering method using the three-way decision theory, Knowl. Based
Syst. (2015), http://dx.doi.org/10.1016/j.knosys.2015.05.028
H. Yu et al. / Knowledge-Based Systems xxx (2015) xxx–xxx 7

2
½1:6; 2:8 \ ½2:0; 2:5 – ;, thus rwait is similar with Node1 . The new
2
representative point is added to the node Node1 and the Path moves
2
to Node1 .The result is shown in Fig. 3(a).
Finally, the algorithm finds the neighbors of r wait by computing
2
the distance between centrals of representative points in Node1
and r wait ; which is described in Line 28 to Line 32.
Fig. 4 gives an example of Case 2. Here, assume the value region
of r wait in attribute a1 is ½2:0; 3:1, the value region in a2 is ½2:2; 2:3.
In this situation, there are more than one node similar to r wait . We
1 1 Fig. 4. Case 2 of finding similar nodes of r wait .
have rwait is similar with Node1 and Node2 , because
½1:0; 2:3 \ ½2:0; 3:1 – ; and ½3:0; 4:0 \ ½2:0; 3:1 – ;. Then,
1 1 1
Node1 ; Node2 and rwait are merged into one node as Node1 , which
1
is shown in Fig. 4(a); and Path indicates Node1 . After that, the algo-
1
rithm need to determine whether the kids of Node1 can be merged.
2 2
That is, because ½1:6; 2:8 \ ½1:7; 2:5 – ;, the Node1 and Node2 are
2
merged into one node as Node1 ;
which is shown in Fig. 4(b).
Likewise, the algorithm deals with the 2nd layer. Because
2
½1:6; 2:8 \ ½2:2; 2:3 – ;; rwait is added to Node1 ; which is shown in
2
Fig. 4(c). is added to Path. Finally, the algorithm moves to
Node1
Line 28 to find the neighbors of r wait .
Fig. 5 gives an example of Case 3. In this situation, there
are no nodes similar to rwait . Assume the value region of
r wait in attribute a1 is ½2:8; 3:2, the value region in a2 is
Fig. 5. Case 3 of finding similar nodes of r wait .
½2:6; 3:0. Fig. 5(a) shows the processing on the 1st layer.
1
Because ½3:0; 4:0 \ ½2:8; 3:2 – ;; r wait is added to Node2 ; and
1
Node2 is added to Path. Then, we observe the 2nd layer.
Because ½1:7; 2:5 \ ½2:6; 3:0 ¼ ;, there is no similar node with
1
r wait in 2nd layer. Thus, the node Node2 will split into two
2 2
child nodes as Node2 and Node3 ; which is shown in
Fig. 5(b).

4.4.2. Updating operations


After processing Algorithm 3, we know that a new representa-
tive point rwait might or not have some neighbors. Algorithm 4 pre-
sents the high level of the updating processing.
Let Rneighbor be the set of neighboring representative points with
r wait . Obviously, there exists three relationships between r wait and
its neighbors. Relationship 1: Rneighbor – ;, and the representative
region of r wait is covered completely by representative regions of
Rneighbor . Relationship 2: Rneighbor – ;, and the part of representative
region of r wait is covered by representative regions of Rneighbor . Fig. 6. Example for updating operations on Relationship 1.
Relationship 3: Rneighbor ¼ ;, namely r wait has no neighbors.

In view of the different relationships, we explain how to update


the graph through the following examples respectively. We take
the example in Fig. 2 as the initial clustering pattern.

Example 4. Examples to update the relation graph G.

Fig. 6 shows an example for updating operations on


Relationship 1. Here, Rneighbor ¼ fr 1 ; r 2 ; r 3 g. rwait is covered com-
pletely by representative regions of Rneighbor . Under this situation,
no new representative point is produced. Objects in representative
region of rwait are mapped to the corresponding areas represented
Fig. 3. Case 1 of finding similar nodes of r wait . by Rneighbor ; which is shown in Fig. 6(a).

Please cite this article in press as: H. Yu et al., A tree-based incremental overlapping clustering method using the three-way decision theory, Knowl. Based
Syst. (2015), http://dx.doi.org/10.1016/j.knosys.2015.05.028
8 H. Yu et al. / Knowledge-Based Systems xxx (2015) xxx–xxx

Algorithm 4. UpdatingClustering

Fig. 7. Example for updating operations on Relationship 2.

Fig. 8. Example for updating operations on Relationship 3.

Fig. 8 shows an example for updating operations on


Relationship 3. Under this situation, Rneighbor ¼ ;; rwait is a new rep-
resentative point; which is shown in Fig. 8(a). The updated graph is
shown in Fig. 8(b). There is a new cluster produced.
Three-way decision strategy is used to update the changed sub-
graphs from Line 19 to Line 25. If similarity between representative
regions is no less than a, namely SimilarityRRðr i ; r j Þ P a, we add a
strong linked edge between them; if b 6 SimilarityRRðr i ; r j Þ < a,
we add a weak linked edge between them. From Line 26 to Line
34, the algorithm outputs the finial incremental clustering result.
That is, a strong linked edge means the adjacent node is in the cor-
responding positive region, and a weak linked edge means the
Because there is no representative point produced in the adjacent node is in the corresponding boundary region.
updated tree, we need to remove r wait from the Path. On the other
hand, because new data added, the representative regions of
4.5. Time complexity analysis
Rneighbor will change. The similarity among Rneighbor will be recalcu-
lated. In this example, to assume SimilarityRRðr 1 ; r 2 Þ P a, which We can also execute the static overlapping clustering Algorithm
means the relation between r1 and r2 is changed from weak link 1 on the whole dataset U [ DU to obtain clustering results, while
to strong link. The updated graph is shown in Fig. 6(b). the incremental clustering method CIA-TWD is executed mainly
Fig. 7 shows an example on Relationship 2. Here, on the incremental data DU. On the other hand, for the CIA-TWD
Rneighbor ¼ fr3 ; r4 g. The representative region of r wait is covered algorithm in Section 4.4, it mainly concludes Algorithms 3 and 4.
partly by representative regions of Rneighbor . Under this situation, a Therefore, this subsection will analyze time complexities of these
new representative point is produced in the tree. Objects in repre- algorithms.
sentative region of r wait covered by representative region of Rneighbor Let’s see the static overlapping clustering SOC-TWD algorithm,
are mapped to the corresponding areas represented by the neigh- namely, Algorithm 1. Assume the number of objects is jUj ¼ n,
bors, and objects in representative region of Rneighbor covered by the average number of objects in a representative region is p, the
representative region of r wait are mapped to the corresponding area number of representative points is R. Then, to find all the represen-
represented by the rwait ; which is shown in Fig. 7(a). tative points has a complexity of Oðn2 þ R  p þ nlogðnÞÞ. To con-
Because the representative region of Rneighbor changes, the simi- struct the relation graph G of representative points has a
larity between Rneighbor and r wait will be recalculated. In this exam- complexity of OðR2 Þ. To search the graph has a complexity of
ple, to assume b 6 SimilarityRRðr 4 ; r wait Þ < a, there is a weak link OðR2 Þ. Assume the number of clusters is C, and the average number
between r4 and r wait ; if we have SimilarityRRðr3 ; rwait Þ P a, there is of representative points in a cluster is k. Then, clustering on the
a strong link between r 3 and rwait . The updated graph is shown in relation graph of representative points has a complexity of
Fig. 7(b). OðC  kÞ. Thus, the algorithm has a complexity of

Please cite this article in press as: H. Yu et al., A tree-based incremental overlapping clustering method using the three-way decision theory, Knowl. Based
Syst. (2015), http://dx.doi.org/10.1016/j.knosys.2015.05.028
H. Yu et al. / Knowledge-Based Systems xxx (2015) xxx–xxx 9

Table 5 Let’s see Algorithm 4, which updating clustering results. First, to


Information about the datasets. search the neighbors of new incremental representative points has
Datasets N D M d a b k a complexity of Oðh  ðk2  logðk2 ÞÞ þ k3 Þ according to Algorithm 3.
AD 2000 3 4 0.17 0.15 0.01 0.9 Assume the number of new representative points is R0 , the average
Banknote 1372 5 2 2 0.06 0.03 0.9 number of objects in a representative region is p1 , and the average
LetterABC 2291 17 3 4.6 0.06 0.03 0.9 number of representative points in a neighbor is p2 . Mapping each
LetterAGI 2317 17 3 4.6 0.06 0.03 0.9
object in rðwaitÞ to neighbor representative points costs 2  p1  p2
Page blocks 5473 11 5 400 0.10 0.05 0.9
Pendigits389 3165 17 3 45 0.26 0.13 0.9 operations. Assume there is p3 representative points in G related to
Pendigits1234 4486 17 4 30 0.30 0.15 0.9 the new representative point. To update the graph needs p23 oper-
Pendigits1469 4398 17 4 38 0.20 0.10 0.9 ations. Assume there is C 0 subgraph related to the new representa-
Waveform 5000 22 3 8 0.9 0.7 0.9
tive point, and the average number of representative points of a
Landsat 6435 37 6 38 0.6 0.3 0.9
subgraph is k4 . Then, to update the clustering results needs
OðC 0  k4 Þ. Therefore, the algorithm has a complexity of T 4 ¼ O
Oðn2 þ R  p þ nlogðnÞ þ 2  R2 þ C  kÞ. In fact, p; C and k are very ðR0  ðh  ðk2  logðk2 ÞÞ þ k3 Þ þ R0  ð2  p1  p2 þ p23 ÞÞ þ OðC 0  k4 ÞÞ.
small, the algorithm has thus a complexity of T(SOC-TWD) = Considering the new arriving data, the number of objects is n0 ,
Oðn2 þ nlogðnÞ þ 2  R2 Þ. the number of new representative points is R0 , and the average
Let’s see Algorithm 3, which finds neighbors of the rðwaitÞ by number of objects in a representative region is p1 . According to
searching and updating the tree. Assume the depth of tree is h, Algorithm 1. We know that to find all the representative points
and the average number of similar nodes with the rðwaitÞ on every has a complexity of Oðn02 þ R0  p1 þ n0 logðn0 ÞÞ. Therefore, the com-
layer is k1 . There are k2 child nodes of these k1 nodes. To merge plete CIA-TWD algorithm has a complexity of T(CIA-TWD) =
these child nodes requires k2  logðk2 Þ þ k2  1 operations, and Oðn02 þ R0  p1 þ n0 logðn0 ÞÞ þ T 4 . Generally speaking, the parameters
to merge these subtree requires k1  1 operations. Thus, to in T 4 are far less than n0 , even if we set R0 or p1 near to n0 , the com-
search the tree requires a complexity of Oðh  ððk1  1Þ þ plexity of CIA-TWD algorithm is Oðn02 þ n0 logðn0 ÞÞ at the worst case.
ðk2  logðk2 Þ þ k2  1ÞÞÞ. Assume the average number of represen- However, because of n0  ðn þ n0 Þ; T(CIA-TWD)  T(SOC-TWD).
tative points in leave nodes is k3 , the algorithm needs k3 operations
to find the neighbors. Thus, the algorithm has a complexity of 5. Experimental results
Oðh  ððk1  1Þ þ ðk2  logðk2 Þ þ k2  1ÞÞ þ k3 Þ. In fact, there is little
nodes which are similar with the incremental representative point, 5.1. Evaluation indices and datasets
that is k1 is very small. Therefore, the algorithm has thus a com-
plexity of Oðh  ðk2  logðk2 ÞÞ þ k3 Þ. In the worst case, the depth We evaluate the proposed TIOC-TWD clustering approach
of the tree is the number of attributes, that is h ¼ m. through the following experiments. All the experiments are

6 6

5 5

4 4
POS(C1) POS(C1)

3 POS(C2) 3 POS(C2)
POS(C3) POS(C3)
2 POS(C4) 2 POS(C4)

1 1

0 0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
(a) Before the new data arriving (b) After the new data arriving
Fig. 9. Clustering results of Test 1.

6 6

5 5
POS(C1)
4 POS(C2) 4
POS(C1)
POS(C3)
3 3 POS(C2)
POS(C4)
POS(C3)
2 BND(C4) 2 POS(C4)
POS(C5)
1 BND(C5) 1

0 0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

(a) Before the new data arriving (b) After the new data arriving
Fig. 10. Clustering results of Test 2.

Please cite this article in press as: H. Yu et al., A tree-based incremental overlapping clustering method using the three-way decision theory, Knowl. Based
Syst. (2015), http://dx.doi.org/10.1016/j.knosys.2015.05.028
10 H. Yu et al. / Knowledge-Based Systems xxx (2015) xxx–xxx

6 6

5 5

4 4
POS(C1)
POS(C1)
3 3 POS(C2)
POS(C2)
POS(C3)
2 POS(C3) 2 POS(C4)

1 1

0 0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

(a) Before the new data arriving (b) After the new data arriving
Fig. 11. Clustering results of Test 3.

6 6

5 5
POS(C1)
4 4 POS(C2)
POS(C1)
POS(C3)
3 POS(C2) 3
POS(C4)
POS(C3)
2 BND(C4)
2 POS(C4)
POS(C5)
1 1 BND(C5)

0 0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
(a) Before the new data arriving (b) After the new data arriving
Fig. 12. Clustering results of Test 4.

performed on a 2.67 GHz computer with 4 GB memory, and all the other datasets come from the real datasets [44]. N; D and M
algorithms are programmed in C++. The quality of the final cluster- means the number of objects, the number of attributes, and the
ing is evaluated by the traditional indices such as the accuracy, number of ground-truth clusters, respectively. d; a; b and k are
F-measure [42] and NMI [43], where the objects in boundary the input parameters. LetterABC is the subset of the original data-
regions are deemed to be positive regions to fit these common set with the letter ‘‘A’’ or ‘‘B’’ or ‘‘C’’; LetterAGI is the subset of the
formula. letter ‘‘A’’ or ‘‘G’’ or ‘‘I’’. Pendigits389 is the subset with the digit 3, 8
Table 5 gives the summary information about the datasets and and 9; PenDigits1234 and Pendigits1469 also are the correspond-
the parameters used in our experiments. AD is an artificial dataset, ing subsets.

Table 6
Comparison of experimental results on the size of incremental data block is 10% under Situation 1.

Index Method Banknote LetterABC LetterAGI Pageblocks Pendigits389 Pendigits1234 Pendigits1469 Waveform Landsat
Accuracy TIOC-TWD 0.85 ± 0.00 0.89 ± 0.01 0.73 ± 0.00 0.88 ± 0.00 0.83 ± 0.02 0.83 ± 0.01 0.88 ± 0.01 0.90 ± 0.01 0.73 ± 0.02
OFCMD 0.74 ± 0.11 0.91 ± 0.01 0.81 ± 0.02 0.90 ± 0.00 0.81 ± 0.11 0.89 ± 0.01 0.85 ± 0.10 0.52 ± 0.09 0.53 ± 0.01
HOFCMD 0.80 ± 0.13 0.74 ± 0.16 0.69 ± 0.12 0.90 ± 0.00 0.62 ± 0.10 0.76 ± 0.11 0.76 ± 0.19 0.67 ± 0.05 0.61 ± 0.11
IncDBSCAN 0.84 ± 0.00 0.80 ± 0.03 0.71 ± 0.02 0.87 ± 0.00 0.78 ± 0.00 0.71 ± 0.00 0.73 ± 0.00 0.34 ± 0.00 0.45 ± 0.00
SOC-TWD 0.86 ± 0.00 0.85 ± 0.00 0.73 ± 0.01 0.88 ± 0.00 0.91 ± 0.00 0.84 ± 0.00 0.86 ± 0.04 0.92 ± 0.00 0.79 ± 0.00
Fmeasure TIOC-TWD 0.87 ± 0.00 0.91 ± 0.01 0.72 ± 0.00 0.86 ± 0.00 0.89 ± 0.02 0.86 ± 0.00 0.93 ± 0.01 0.62 ± 0.01 0.70 ± 0.02
OFCMD 0.61 ± 0.09 0.89 ± 0.01 0.79 ± 0.02 0.85 ± 0.00 0.68 ± 0.08 0.87 ± 0.01 0.54 ± 0.08 0.43 ± 0.07 0.47 ± 0.01
HOFCMD 0.71 ± 0.09 0.64 ± 0.15 0.67 ± 0.14 0.85 ± 0.00 0.40 ± 0.08 0.49 ± 0.13 0.36 ± 0.11 0.38 ± 0.06 0.38 ± 0.06
IncDBSCAN 0.86 ± 0.00 0.86 ± 0.02 0.80 ± 0.02 0.84 ± 0.00 0.87 ± 0.00 0.79 ± 0.00 0.67 ± 0.00 0.17 ± 0.00 0.40 ± 0.00
SOC-TWD 0.92 ± 0.00 0.88 ± 0.00 0.82 ± 0.01 0.86 ± 0.00 0.91 ± 0.00 0.87 ± 0.00 0.91 ± 0.02 0.63 ± 0.00 0.70 ± 0.00
NMI TIOC-TWD 0.57 ± 0.00 0.77 ± 0.05 0.57 ± 0.00 0.04 ± 0.00 0.80 ± 0.03 0.77 ± 0.00 0.89 ± 0.01 0.22 ± 0.03 0.59 ± 0.01
OFCMD 0.04 ± 0.05 0.65 ± 0.03 0.51 ± 0.03 0.00 ± 0.00 0.43 ± 0.04 0.72 ± 0.01 0.45 ± 0.03 0.30 ± 0.04 0.43 ± 0.01
HOFCMD 0.11 ± 0.09 0.44 ± 0.06 0.41 ± 0.09 0.00 ± 0.00 0.40 ± 0.13 0.46 ± 0.12 0.14 ± 0.03 0.15 ± 0.07 0.40 ± 0.03
IncDBSCAN 0.56 ± 0.00 0.69 ± 0.03 0.66 ± 0.01 0.02 ± 0.00 0.80 ± 0.00 0.70 ± 0.00 0.81 ± 0.00 0.00 ± 0.00 0.52 ± 0.00
SOC-TWD 0.80 ± 0.00 0.71 ± 0.00 0.66 ± 0.03 0.11 ± 0.00 0.76 ± 0.00 0.78 ± 0.0 0.86 ± 0.03 0.24 ± 0.00 0.64 ± 0.00
CPU(s) TIOC-TWD 0.29 ± 0.01 1.45 ± 0.07 1.32 ± 0.03 1.48 ± 0.25 3.27 ± 0.08 6.83 ± 0.29 4.85 ± 0.20 6.69 ± 1.75 15.41 ± 0.26
OFCMD 0.35 ± 0.03 1.16 ± 0.03 1.26 ± 0.03 20.02 ± 8.33 1.06 ± 0.02 3.97 ± 0.15 3.66 ± 0.31 3.51 ± 0.24 12.60 ± 0.42
HOFCMD 0.46 ± 0.05 1.46 ± 0.14 2.05 ± 2.33 5.13 ± 0.57 2.32 ± 0.41 6.14 ± 0.46 4.63 ± 0.69 3.92 ± 0.46 30.14 ± 4.62
IncDBSCAN 1.41 ± 0.11 3.51 ± 0.38 3.64 ± 0.51 242.78 ± 56.80 4.97 ± 0.14 13.23 ± 0.53 21.12 ± 1.41 11.76 ± 0.24 36.46 ± 6.77
SOC-TWD 0.86 ± 0.05 3.41 ± 0.43 3.40 ± 0.13 7.59 ± 0.25 9.83 ± 0.30 35.10 ± 2.07 3.40 ± 0.13 34.00 ± 5.05 48.99 ± 1.81

Please cite this article in press as: H. Yu et al., A tree-based incremental overlapping clustering method using the three-way decision theory, Knowl. Based
Syst. (2015), http://dx.doi.org/10.1016/j.knosys.2015.05.028
H. Yu et al. / Knowledge-Based Systems xxx (2015) xxx–xxx 11

Table 7
Comparison of experimental results on the size of incremental data block is 20% under Situation 1.

Index Method Banknote LetterABC LetterAGI Pageblocks Pendigits389 Pendigits1234 Pendigits1469 Waveform Landsat
Accuracy TIOC-TWD 0.85 ± 0.00 0.89 ± 0.01 0.73 ± 0.00 0.88 ± 0.00 0.83 ± 0.01 0.83 ± 0.00 0.88 ± 0.01 0.90 ± 0.00 0.73 ± 0.02
OFCMD 0.83 ± 0.13 0.59 ± 0.15 0.81 ± 0.01 0.90 ± 0.00 0.85 ± 0.02 0.88 ± 0.00 0.74 ± 0.23 0.50 ± 0.03 0.52 ± 0.05
HOFCMD 0.62 ± 0.20 0.69 ± 0.19 0.38 ± 0.13 0.90 ± 0.00 0.50 ± 0.16 0.62 ± 0.15 0.71 ± 0.68 0.60 ± 0.15 0.62 ± 0.14
IncDBSCAN 0.84 ± 0.00 0.80 ± 0.03 0.71 ± 0.02 0.87 ± 0.00 0.78 ± 0.00 0.71 ± 0.00 0.73 ± 0.00 0.34 ± 0.00 0.45 ± 0.00
SOC-TWD 0.86 ± 0.00 0.85 ± 0.00 0.73 ± 0.01 0.88 ± 0.00 0.91 ± 0.00 0.84 ± 0.00 0.86 ± 0.04 0.92 ± 0.00 0.79 ± 0.00
Fmeasure TIOC-TWD 0.87 ± 0.00 0.91 ± 0.00 0.72 ± 0.00 0.86 ± 0.00 0.89 ± 0.01 0.86 ± 0.00 0.93 ± 0.01 0.62 ± 0.01 0.71 ± 0.02
OFCMD 0.67 ± 0.10 0.59 ± 0.13 0.79 ± 0.01 0.85 ± 0.00 0.71 ± 0.02 0.86 ± 0.00 0.47 ± 0.16 0.41 ± 0.01 0.46 ± 0.05
HOFCMD 0.57 ± 0.19 0.59 ± 0.16 0.25 ± 0.18 0.85 ± 0.00 0.31 ± 0.14 0.38 ± 0.14 0.35 ± 0.10 0.36 ± 0.12 0.41 ± 0.09
IncDBSCAN 0.86 ± 0.00 0.86 ± 0.02 0.80 ± 0.02 0.84 ± 0.00 0.87 ± 0.00 0.79 ± 0.00 0.67 ± 0.00 0.17 ± 0.00 0.40 ± 0.00
SOC-TWD 0.92 ± 0.00 0.88 ± 0.00 0.82 ± 0.01 0.86 ± 0.00 0.91 ± 0.00 0.87 ± 0.00 0.91 ± 0.02 0.63 ± 0.00 0.70 ± 0.00
NMI TIOC-TWD 0.56 ± 0.00 0.75 ± 0.01 0.57 ± 0.00 0.04 ± 0.00 0.80 ± 0.01 0.77 ± 0.00 0.89 ± 0.01 0.22 ± 0.01 0.60 ± 0.01
OFCMD 0.06 ± 0.04 0.47 ± 0.07 0.52 ± 0.03 0.00 ± 0.00 0.42 ± 0.02 0.72 ± 0.01 0.35 ± 0.09 0.31 ± 0.03 0.44 ± 0.01
HOFCMD 0.09 ± 0.11 0.39 ± 0.11 0.06 ± 0.13 0.00 ± 0.00 0.22 ± 0.19 0.43 ± 0.06 0.14 ± 0.03 0.15 ± 0.12 0.36 ± 0.04
IncDBSCAN 0.56 ± 0.00 0.69 ± 0.03 0.66 ± 0.01 0.02 ± 0.00 0.80 ± 0.00 0.70 ± 0.00 0.81 ± 0.00 0.00 ± 0.00 0.52 ± 0.00
SOC-TWD 0.80 ± 0.00 0.71 ± 0.00 0.66 ± 0.03 0.11 ± 0.00 0.76 ± 0.02 0.78 ± 0.00 0.86 ± 0.03 0.24 ± 0.00 0.64 ± 0.00
CPU(s) TIOC-TWD 0.37 ± 0.03 1.71 ± 0.11 1.43 ± 0.07 1.57 ± 0.23 3.27 ± 0.19 6.95 ± 0.33 5.04 ± 0.21 7.71 ± 3.40 16.24 ± 0.42
OFCMD 0.38 ± 0.03 1.40 ± 0.03 1.42 ± 0.04 3.08 ± 0.35 1.46 ± 0.01 5.90 ± 0.12 4.61 ± 0.28 4.93 ± 0.53 16.16 ± 0.55
HOFCMD 0.77 ± 0.18 2.07 ± 0.42 2.53 ± 4.25 6.25 ± 0.53 4.65 ± 1.76 10.91 ± 2.29 6.69 ± 0.88 5.96 ± 0.73 34.63 ± 4.61
IncDBSCAN 1.41 ± 0.11 3.51 ± 0.38 3.64 ± 0.51 242.78 ± 56.80 4.97 ± 0.14 13.23 ± 0.53 21.12 ± 1.41 11.76 ± 0.24 36.46 ± 6.77
SOC-TWD 0.86 ± 0.05 3.41 ± 0.43 3.40 ± 0.13 7.59 ± 0.25 9.83 ± 0.30 35.10 ± 2.07 3.40 ± 0.13 34.00 ± 5.05 48.99 ± 1.81

Table 8
Comparison of experimental results on the size of incremental data block is 10% under Situation 2.

Index Method Banknote LetterABC LetterAGI Pageblocks Pendigits389 Pendigits1234 Pendigits1469 Waveform Landsat
Accuracy TIOC-TWD 0.82 ± 0.05 0.89 ± 0.00 0.73 ± 0.03 0.88 ± 0.00 0.85 ± 0.03 0.79 ± 0.05 0.76 ± 0.07 0.95 ± 0.01 0.64 ± 0.03
OFCMD 0.66 ± 0.09 0.89 ± 0.02 0.77 ± 0.15 0.90 ± 0.00 0.47 ± 0.13 0.70 ± 0.01 0.70 ± 0.09 0.65 ± 0.12 0.54 ± 0.02
HOFCMD 0.68 ± 0.06 0.58 ± 0.18 0.40 ± 0.09 0.90 ± 0.00 0.59 ± 0.09 0.36 ± 0.12 0.52 ± 0.00 0.53 ± 0.22 0.35 ± 0.06
IncDBSCAN 0.84 ± 0.00 0.67 ± 0.00 0.63 ± 0.01 0.86 ± 0.02 0.69 ± 0.04 0.72 ± 0.00 0.74 ± 0.00 0.34 ± 0.00 0.46 ± 0.00
SOC-TWD 0.85 ± 0.00 0.85 ± 0.00 0.74 ± 0.00 0.88 ± 0.00 0.91 ± 0.00 0.84 ± 0.00 0.87 ± 0.01 0.92 ± 0.00 0.78 ± 0.00
Fmeasure TIOC-TWD 0.86 ± 0.03 0.90 ± 0.00 0.74 ± 0.04 0.86 ± 0.00 0.91 ± 0.02 0.80 ± 0.09 0.80 ± 0.10 0.59 ± 0.01 0.64 ± 0.02
OFCMD 0.62 ± 0.08 0.88 ± 0.02 0.73 ± 0.19 0.85 ± 0.00 0.36 ± 0.11 0.63 ± 0.01 0.45 ± 0.06 0.57 ± 0.12 0.43 ± 0.03
HOFCMD 0.63 ± 0.05 0.48 ± 0.18 0.36 ± 0.12 0.85 ± 0.00 0.37 ± 0.07 0.18 ± 0.07 0.24 ± 0.01 0.28 ± 0.12 0.20 ± 0.04
IncDBSCAN 0.86 ± 0.00 0.75 ± 0.00 0.71 ± 0.01 0.84 ± 0.01 0.78 ± 0.04 0.80 ± 0.00 0.68 ± 0.01 0.17 ± 0.00 0.41 ± 0.00
SOC-TWD 0.91 ± 0.00 0.89 ± 0.00 0.82 ± 0.00 0.86 ± 0.00 0.88 ± 0.03 0.87 ± 0.00 0.92 ± 0.01 0.63 ± 0.00 0.70 ± 0.00
NMI TIOC-TWD 0.54 ± 0.04 0.74 ± 0.01 0.58 ± 0.01 0.11 ± 0.00 0.84 ± 0.03 0.74 ± 0.04 0.79 ± 0.06 0.10 ± 0.01 0.59 ± 0.01
OFCMD 0.05 ± 0.01 0.64 ± 0.05 0.45 ± 0.16 0.00 ± 0.00 0.38 ± 0.01 0.56 ± 0.00 0.48 ± 0.06 0.35 ± 0.04 0.46 ± 0.01
HOFCMD 0.04 ± 0.02 0.17 ± 0.13 0.07 ± 0.08 0.00 ± 0.00 0.12 ± 0.04 0.49 ± 0.07 0.27 ± 0.08 0.01 ± 0.03 0.35 ± 0.04
IncDBSCAN 0.55 ± 0.00 0.58 ± 0.00 0.60 ± 0.00 0.02 ± 0.00 0.74 ± 0.03 0.71 ± 0.00 0.82 ± 0.00 0.00 ± 0.00 0.53 ± 0.00
SOC-TWD 0.79 ± 0.00 0.72 ± 0.00 0.68 ± 0.00 0.11 ± 0.00 0.77 ± 0.03 0.79 ± 0.01 0.87 ± 0.01 0.24 ± 0.00 0.63 ± 0.00
CPU(s) TIOC-TWD 0.22 ± 0.04 1.52 ± 0.51 1.14 ± 0.05 1.62 ± 0.22 2.95 ± 0.06 5.26 ± 0.24 3.68 ± 0.18 9.45 ± 3.58 14.60 ± 0.81
OFCMD 0.31 ± 0.03 1.17 ± 0.04 2.47 ± 3.46 5.00 ± 1.82 1.17 ± 0.04 4.76 ± 0.22 3.73 ± 0.28 3.94 ± 0.16 13.38 ± 0.64
HOFCMD 0.44 ± 0.09 1.20 ± 0.16 6.78 ± 5.65 4.81 ± 0.51 2.12 ± 0.21 6.32 ± 0.25 5.45 ± 0.64 4.86 ± 0.80 23.53 ± 3.40
IncDBSCAN 1.75 ± 0.09 3.34 ± 0.10 2.97 ± 0.14 285.94 ± 38.42 4.85 ± 0.25 8.15 ± 0.11 25.04 ± 2.79 14.10 ± 1.18 46.83 ± 6.25
SOC-TWD 0.83 ± 0.08 3.29 ± 0.41 3.59 ± 0.30 7.57 ± 0.32 10.06 ± 0.33 34.61 ± 2.13 19.44 ± 1.32 34.07 ± 5.08 47.97 ± 1.98

5.2. Performance illustration with artificial data and after the new incremental data arriving are shown in
Fig. 10(a) and (b) respectively.
This subsection conducts a number of experiments on artificial From Fig. 10(a), we see that there are 5 clusters, and there exists
datasets to validate the proposed method has the ability of pro- the overlapping boundary region between C 4 and C 5 . It is reason-
cessing differen incremental situations as well as to deal with able because the density of this region is not high enough.
the dataset with arbitrary shape. 1900 objects of AD are used as However, when we increase objects in this region as shown in
the initial dataset, and 100 objects of the dataset are used as the Fig. 10(b), C 4 and C 5 might be merged into one cluster. The results
incremental data. just reflect the inherent data structure in datasets.
Test 1: the incremental data is distributed randomly in every Test 3: the incremental data is just a new cluster to the original
cluster. The clustering results of before and after the new incre- dataset. The clustering results of before and after the new incre-
mental data arriving are shown in Fig. 9(a) and (b) respectively. mental data arriving are shown in Fig. 11(a) and (b) respectively.
To compare the results in Fig. 9(a) and Fig. 9(b), both of the To observe Fig. 11, we can see that the proposed TIOC-TWD
results show that the proposed method can determine the ground clustering method has the ability of detecting the new structure
cluster correctly. Furthermore, the shape of clusters in the visual in datasets.
example is arbitrary. The results also show that the proposed Test 4: the incremental data is just increasing on the core of
method has the ability of clustering datasets with arbitrary shape. some clusters. The clustering results of before and after the new
Test 2: the incremental data is increased mainly in boundary incremental data arriving are shown in Fig. 12(a) and (b)
region between some clusters. The clustering results of before respectively.

Please cite this article in press as: H. Yu et al., A tree-based incremental overlapping clustering method using the three-way decision theory, Knowl. Based
Syst. (2015), http://dx.doi.org/10.1016/j.knosys.2015.05.028
12 H. Yu et al. / Knowledge-Based Systems xxx (2015) xxx–xxx

Table 9
Comparison of experimental results on the size of incremental data block is 20% under Situation 2.

Index Method Banknote LetterABC LetterAGI Pageblocks Pendigits389 Pendigits1234 Pendigits1469 Waveform Landsat
Accuracy TIOC-TWD 0.81 ± 0.05 0.89 ± 0.00 0.75 ± 0.03 0.88 ± 0.00 0.86 ± 0.03 0.78 ± 0.05 0.76 ± 0.07 0.95 ± 0.00 0.66 ± 0.04
OFCMD 0.82 ± 0.08 0.88 ± 0.02 0.66 ± 0.18 0.90 ± 0.00 0.57 ± 0.12 0.68 ± 0.02 0.71 ± 0.17 0.63 ± 0.16 0.51 ± 0.01
HOFCMD 0.65 ± 0.02 0.55 ± 0.15 0.44 ± 0.08 0.90 ± 0.00 0.48 ± 0.15 0.36 ± 0.12 0.47 ± 0.10 0.44 ± 0.15 0.35 ± 0.07
IncDBSCAN 0.84 ± 0.00 0.67 ± 0.00 0.63 ± 0.01 0.86 ± 0.02 0.69 ± 0.04 0.72 ± 0.00 0.74 ± 0.00 0.34 ± 0.00 0.46 ± 0.00
SOC-TWD 0.85 ± 0.00 0.85 ± 0.00 0.74 ± 0.00 0.88 ± 0.00 0.91 ± 0.00 0.84 ± 0.00 0.87 ± 0.01 0.92 ± 0.00 0.78 ± 0.00
Fmeasure TIOC-TWD 0.85 ± 0.03 0.91 ± 0.00 0.76 ± 0.04 0.86 ± 0.00 0.92 ± 0.02 0.80 ± 0.09 0.79 ± 0.11 0.59 ± 0.00 0.65 ± 0.03
OFCMD 0.76 ± 0.07 0.87 ± 0.02 0.61 ± 0.22 0.85 ± 0.00 0.44 ± 0.11 0.61 ± 0.02 0.44 ± 0.11 0.56 ± 0.15 0.40 ± 0.01
HOFCMD 0.62 ± 0.03 0.47 ± 0.14 0.39 ± 0.09 0.85 ± 0.00 0.29 ± 0.12 0.17 ± 0.06 0.22 ± 0.05 0.23 ± 0.09 0.20 ± 0.05
IncDBSCAN 0.86 ± 0.00 0.75 ± 0.00 0.71 ± 0.01 0.84 ± 0.01 0.78 ± 0.04 0.80 ± 0.00 0.68 ± 0.01 0.17 ± 0.00 0.41 ± 0.00
SOC-TWD 0.91 ± 0.00 0.89 ± 0.00 0.82 ± 0.00 0.86 ± 0.00 0.88 ± 0.03 0.87 ± 0.00 0.92 ± 0.01 0.63 ± 0.00 0.70 ± 0.00
NMI TIOC-TWD 0.53 ± 0.04 0.75 ± 0.01 0.59 ± 0.02 0.11 ± 0.00 0.86 ± 0.02 0.76 ± 0.04 0.79 ± 0.06 0.10 ± 0.01 0.60 ± 0.02
OFCMD 0.21 ± 0.09 0.63 ± 0.05 0.38 ± 0.15 0.00 ± 0.00 0.38 ± 0.02 0.55 ± 0.02 0.37 ± 0.13 0.34 ± 0.03 0.46 ± 0.00
HOFCMD 0.06 ± 0.04 0.18 ± 0.10 0.10 ± 0.09 0.00 ± 0.00 0.11 ± 0.12 0.46 ± 0.07 0.21 ± 0.07 0.01 ± 0.03 0.35 ± 0.03
IncDBSCAN 0.55 ± 0.00 0.58 ± 0.00 0.60 ± 0.00 0.02 ± 0.00 0.74 ± 0.03 0.71 ± 0.00 0.82 ± 0.00 0.00 ± 0.00 0.53 ± 0.00
SOC-TWD 0.79 ± 0.00 0.72 ± 0.00 0.68 ± 0.00 0.11 ± 0.00 0.77 ± 0.03 0.79 ± 0.01 0.87 ± 0.01 0.24 ± 0.00 0.63 ± 0.00
CPU(s) TIOC-TWD 0.23 ± 0.06 1.72 ± 0.18 1.19 ± 0.06 1.61 ± 0.22 3.09 ± 0.20 5.43 ± 0.26 3.87 ± 0.22 6.49 ± 1.41 15.39 ± 0.74
OFCMD 16.52 ± 0.40 1.64 ± 0.17 6.67 ± 9.69 3.11 ± 0.52 1.62 ± 0.01 5.78 ± 0.18 4.70 ± 0.24 5.82 ± 0.36 16.52 ± 0.40
HOFCMD 0.57 ± 0.15 2.92 ± 4.94 7.49 ± 6.98 6.57 ± 0.53 5.50 ± 1.40 10.47 ± 2.03 9.48 ± 1.68 9.51 ± 1.33 36.96 ± 10.84
IncDBSCAN 1.75 ± 0.09 3.34 ± 0.10 2.97 ± 0.14 285.94 ± 38.42 4.85 ± 0.25 8.15 ± 0.11 25.04 ± 2.79 14.10 ± 1.18 46.83 ± 6.25
SOC-TWD 0.83 ± 0.08 3.29 ± 0.41 3.59 ± 0.30 7.57 ± 0.32 10.06 ± 0.33 34.61 ± 2.13 19.44 ± 1.32 34.07 ± 5.08 47.97 ± 1.98

Table 10
Comparison of experimental results on the size of incremental data block is 10% under Situation 3.

Index Method Banknote LetterABC LetterAGI Pageblocks Pendigits389 Pendigits1234 Pendigits1469 Waveform Landsat
Accuracy TIOC-TWD 0.82 ± 0.05 0.87 ± 0.07 0.73 ± 0.02 0.88 ± 0.00 0.76 ± 0.10 0.82 ± 0.03 0.82 ± 0.03 0.89 ± 0.08 0.69 ± 0.03
OFCMD 0.66 ± 0.09 0.92 ± 0.01 0.52 ± 0.01 0.90 ± 0.00 0.52 ± 0.02 0.87 ± 0.00 0.99 ± 0.01 0.71 ± 0.03 0.56 ± 0.05
HOFCMD 0.68 ± 0.06 0.66 ± 0.14 0.50 ± 0.08 0.90 ± 0.00 0.63 ± 0.10 0.76 ± 0.07 0.44 ± 0.16 0.52 ± 0.14 0.50 ± 0.14
IncDBSCAN 0.84 ± 0.00 0.83 ± 0.00 0.70 ± 0.01 0.87 ± 0.00 0.73 ± 0.05 0.71 ± 0.01 0.73 ± 0.01 0.34 ± 0.00 0.40 ± 0.04
SOC-TWD 0.85 ± 0.00 0.85 ± 0.00 0.73 ± 0.01 0.88 ± 0.00 0.91 ± 0.00 0.84 ± 0.00 0.82 ± 0.02 0.92 ± 0.00 0.79 ± 0.00
Fmeasure TIOC-TWD 0.85 ± 0.03 0.86 ± 0.10 0.73 ± 0.04 0.86 ± 0.01 0.75 ± 0.13 0.85 ± 0.05 0.88 ± 0.02 0.53 ± 0.05 0.67 ± 0.02
OFCMD 0.62 ± 0.08 0.90 ± 0.00 0.45 ± 0.00 0.85 ± 0.00 0.43 ± 0.02 0.85 ± 0.00 0.64 ± 0.01 0.63 ± 0.02 0.48 ± 0.06
HOFCMD 0.63 ± 0.05 0.52 ± 0.14 0.48 ± 0.09 0.85 ± 0.00 0.40 ± 0.07 0.50 ± 0.07 0.20 ± 0.07 0.33 ± 0.10 0.28 ± 0.10
IncDBSCAN 0.86 ± 0.00 0.87 ± 0.00 0.80 ± 0.00 0.84 ± 0.00 0.82 ± 0.05 0.80 ± 0.01 0.67 ± 0.01 0.17 ± 0.03 0.35 ± 0.03
SOC-TWD 0.91 ± 0.00 0.86 ± 0.00 0.82 ± 0.00 0.86 ± 0.00 0.89 ± 0.03 0.87 ± 0.00 0.88 ± 0.01 0.63 ± 0.00 0.70 ± 0.00
NMI TIOC-TWD 0.54 ± 0.04 0.69 ± 0.04 0.57 ± 0.04 0.10 ± 0.03 0.71 ± 0.07 0.77 ± 0.01 0.83 ± 0.02 0.08 ± 0.05 0.62 ± 0.02
OFCMD 0.05 ± 0.01 0.68 ± 0.03 0.27 ± 0.02 0.00 ± 0.00 0.42 ± 0.03 0.73 ± 0.00 0.32 ± 0.01 0.39 ± 0.01 0.44 ± 0.02
HOFCMD 0.04 ± 0.02 0.26 ± 0.12 0.13 ± 0.10 0.00 ± 0.00 0.13 ± 0.04 0.36 ± 0.08 0.16 ± 0.02 0.08 ± 0.05 0.38 ± 0.08
IncDBSCAN 0.55 ± 0.00 0.70 ± 0.00 0.65 ± 0.00 0.02 ± 0.00 0.76 ± 0.04 0.71 ± 0.01 0.80 ± 0.01 0.00 ± 0.00 0.47 ± 0.04
SOC-TWD 0.79 ± 0.00 0.62 ± 0.00 0.69 ± 0.02 0.11 ± 0.00 0.76 ± 0.03 0.78 ± 0.00 0.82 ± 0.01 0.24 ± 0.00 0.64 ± 0.00
CPU(s) TIOC-TWD 0.22 ± 0.04 1.20 ± 0.08 1.48 ± 0.07 2.42 ± 0.64 2.51 ± 0.14 6.03 ± 0.30 4.57 ± 0.17 13.16 ± 2.89 14.25 ± 0.59
OFCMD 0.28 ± 0.02 1.22 ± 0.06 1.23 ± 0.08 3.79 ± 2.11 1.23 ± 0.01 3.91 ± 0.12 3.65 ± 0.36 4.01 ± 0.18 20.52 ± 0.47
HOFCMD 0.44 ± 0.09 1.29 ± 0.20 3.25 ± 3.52 5.39 ± 0.46 2.05 ± 0.22 6.30 ± 1.09 4.84 ± 0.66 4.42 ± 0.55 25.58 ± 4.50
IncDBSCAN 1.73 ± 0.09 3.22 ± 0.22 2.88 ± 0.12 250.76 ± 25.33 4.09 ± 0.17 10.97 ± 0.93 15.68 ± 2.26 12.05 ± 1.93 20.01 ± 2.15
SOC-TWD 0.84 ± 0.08 3.25 ± 0.80 3.31 ± 0.11 7.49 ± 0.29 9.98 ± 0.29 34.27 ± 1.88 21.08 ± 0.93 34.23 ± 5.13 59.25 ± 2.19

To observe Fig. 12, we see that the proposed TIOC-TWD cluster- number of clusters, the number of candidates to determine the
ing method has the ability of splitting a big cluster into small clus- medoids, the size of the data chunks, the decay factor and so
ters. The new clusters might has the overlapping regions, which on. The parameters used in the compared algorithms are set as
just reveal the underlying structure in the dataset. in the original references.
To simulate the incremental environment, 60% of each static
5.3. Results of comparison experiments UCI dataset is deemed as the initial dataset, and the rest of dataset
is the incremental dataset. Algorithms 1 and 2 are carried out on
This subsection describes experiments on some of the UCI the initial dataset, and the CIA-TWD is implemented on the incre-
datasets [44] with the proposed approach TIOC-TWD, the mental dataset. For the experiments described in the following, the
SOC-TWD algorithm, the IncDBSCAN algorithm [8], the OFCMD results are always averaged over all the 10 runs, and the standard
algorithm and HOFCMD algorithm [18]; the accuracy, deviation variances are also reported in results.
F-measure and NMI indices are evaluated there. The SOC-TWD On the one hand, considering the data might be increasing con-
algorithm is the only one which is not an incremental clustering tinuously in the real incremental application environment, we
approach, and it is used to as a comparison to incremental need to simulate the continued incremental data. Therefore, each
approaches. There are four parameters d; a; b; k used in our incremental dataset is divided into a plurality of incremental data
method; the compared algorithms are also depended on some blocks; then the CIA-TWD algorithm is executed on each block
parameters. For example, the OFCMD algorithm has to set the until there is no new block.

Please cite this article in press as: H. Yu et al., A tree-based incremental overlapping clustering method using the three-way decision theory, Knowl. Based
Syst. (2015), http://dx.doi.org/10.1016/j.knosys.2015.05.028
H. Yu et al. / Knowledge-Based Systems xxx (2015) xxx–xxx 13

On the other hand, considering the underlying structures in the When we compare results in Tables 6 and 7, we find that the
new incremental data are unknown, we need to simulate the dif- performance of proposed approach is better on the larger size of
ferent situations to valuate the performance of the proposed incremental data block with higher computing time though there
method. Because the new incremental objects might belong to all is no algorithm is absolutely best in every index. Generally speak-
known clusters, or belong to parts of known clusters, or form ing, when the incremental data objects are distributed randomly,
new clusters, we design three experiments corresponding to these the performance of the proposed approach is slightly better than
three different situations. that of the compared methods.
Therefore, for each run and for each dataset, the 40% of incre- Situation 2: the incremental data are distributed randomly on
mental dataset is divided into different size of incremental data part of clusters. That is, the new incremental objects might belong
blocks. When the size of incremental data block is 10%, the incre- to some specific known clusters. Table 8 and Table 9 show the
mental dataset is divided into 4 blocks; and when the size of incre- comparison results on the size of incremental data block is 10%
mental data block is 20%, the incremental dataset is divided into 2 and 20% of the original dataset respectively.
blocks. The accuracy, F-measure, NMI index, and the CPU time are In experiments, the incremental data objects come from the
recorded in each running. class ‘‘2’’ of the Banknote, the class ‘‘B’’ and ‘‘C’’ of LetterABC, the
Situation 1: the incremental data are distributed randomly on class ‘‘G’’ and ‘‘I’’ of the LetterAGI, the class ‘‘2’’ and ‘‘5’’ of Page
most of clusters. That is, the new incremental objects might belong blocks, the class ‘‘8’’ and ‘‘9’’ of Pendigits389, the class ‘‘3’’ and
to all known clusters. Tables 6 and 7 show the comparison results ‘‘4’’ of Pendigits1234, the class ‘‘6’’ and ‘‘9’’ of Pendigits1469, the
on the size of incremental data block is 10% and 20% of the original class ‘‘1’’ and ‘‘2’’ of Waveform, the class ‘‘4’’, ‘‘5’’ and ‘‘7’’ of
dataset respectively. Landsat respectively.
From Table 6, we see that: the accuracies of proposed method Observing the results in Tables 8 and 9, we can see that the per-
are higher than that of the compared incremental algorithms in formance of proposed approach is better when the size of incre-
Banknote, Pendigits389, Pendigits1469, Waveform and Landsat mental data block is bigger. Under this situation, the proposed
datasets, and the accuracies of proposed method are very near to approach has higher performance than the compared algorithms
the best of compared algorithms in other datasets. Thus, the per- at the most of cases, especially on the F-measure and NMI indices.
formance of the proposed approach is roughly equal to the com- Situation 3: the incremental data to produce new clusters. That
pared algorithms in terms of the accuracy index. In terms of the is, the new incremental objects might not belong to any known
F-measure and NMI indices, the performance of proposed method clusters. Table 10 and Table 11 show the comparison results on
is much better than that of the compared algorithms in most of the size of incremental data block is 10% and 20% of the original
datasets. Though the CPU time of the proposed algorithm in some dataset respectively.
datasets are slightly more than that of the compared algorithms; In experiments, the 40% incremental data objects are composed
for Page blocks dataset, the CPU time of the proposed algorithm of several entire classes in a dataset to simulate Situation 3. The
is 1:48  0:25, but the CPU time of the compared algorithms are incremental data objects come from the whole class ‘‘2’’ of
20:02  8:33; 5:13  0:57; 242:78  56:80; 7:59  0:25, respec- Banknote, the class ‘‘C’’ of LetterABC, the class ‘‘G’’ of LetterAGI,
tively. The advantage is obvious on computing time or on the stan- the class ‘‘1’’ and ‘‘2’’ of Page blocks, the class ‘‘9’’ of
dard deviation. Of course, the CPU time of TIOC-TWD is much less Pendigits389, the class ‘‘1’’ of Pendigits1234, the class ‘‘9’’ of
than the static algorithm SOC-TWD, and the difference is more Pendigits1469, the class ‘‘1’’ and ‘‘2’’ of Waveform, and the class
obvious on the big datasets. It is interesting that the indices of sta- of ‘‘1’’, ‘‘2’’ and ‘‘3’’ of Landsat, respectively.
tic algorithm are not always better than the incremental method, Comparing results in Tables 10 and 11, we see that it is not very
which shows the necessity of developing incremental methods distinct the performance of proposed approach between the differ-
from another perspective. Observe the results in Table 7, we can ent sizes of incremental data block under Situation 3; which shows
find almost the same conclusions as the above for these methods. the stability of the proposed approach to detect new patterns in

Table 11
Comparison of experimental results on the size of incremental data block is 20% under Situation 3.

Index Method Banknote LetterABC LetterAGI Pageblocks Pendigits389 Pendigits1234 Pendigits1469 Waveform Landsat
Accuracy TIOC-TWD 0.81 ± 0.05 0.85 ± 0.10 0.73 ± 0.00 0.88 ± 0.00 0.77 ± 0.10 0.82 ± 0.04 0.77 ± 0.06 0.88 ± 0.09 0.68 ± 0.03
OFCMD 0.82 ± 0.08 0.94 ± 0.01 0.46 ± 0.06 0.90 ± 0.00 0.51 ± 0.01 0.82 ± 0.08 0.99 ± 0.01 0.71 ± 0.13 0.60 ± 0.03
HOFCMD 0.65 ± 0.02 0.73 ± 0.15 0.44 ± 0.11 0.90 ± 0.00 0.53 ± 0.10 0.71 ± 0.07 0.36 ± 0.13 0.62 ± 0.10 0.51 ± 0.17
IncDBSCAN 0.84 ± 0.00 0.83 ± 0.00 0.70 ± 0.01 0.87 ± 0.00 0.73 ± 0.05 0.71 ± 0.01 0.73 ± 0.01 0.34 ± 0.00 0.40 ± 0.04
SOC-TWD 0.85 ± 0.00 0.85 ± 0.00 0.73 ± 0.01 0.88 ± 0.00 0.91 ± 0.00 0.84 ± 0.00 0.82 ± 0.02 0.92 ± 0.00 0.79 ± 0.00
Fmeasure TIOC-TWD 0.85 ± 0.03 0.83 ± 0.13 0.72 ± 0.01 0.85 ± 0.01 0.77 ± 0.13 0.85 ± 0.06 0.82 ± 0.10 0.53 ± 0.04 0.67 ± 0.02
OFCMD 0.76 ± 0.07 0.91 ± 0.00 0.39 ± 0.05 0.85 ± 0.00 0.42 ± 0.02 0.78 ± 0.09 0.64 ± 0.03 0.60 ± 0.10 0.52 ± 0.03
HOFCMD 0.62 ± 0.03 0.59 ± 0.15 0.42 ± 0.11 0.85 ± 0.00 0.29 ± 0.10 0.44 ± 0.06 0.16 ± 0.06 0.37 ± 0.07 0.28 ± 0.10
IncDBSCAN 0.86 ± 0.00 0.87 ± 0.00 0.80 ± 0.00 0.84 ± 0.00 0.82 ± 0.05 0.80 ± 0.01 0.67 ± 0.01 0.17 ± 0.03 0.35 ± 0.03
SOC-TWD 0.91 ± 0.00 0.86 ± 0.00 0.82 ± 0.00 0.86 ± 0.00 0.89 ± 0.03 0.87 ± 0.00 0.88 ± 0.01 0.63 ± 0.00 0.70 ± 0.00
NMI TIOC-TWD 0.53 ± 0.04 0.67 ± 0.05 0.56 ± 0.02 0.08 ± 0.03 0.71 ± 0.08 0.77 ± 0.01 0.80 ± 0.05 0.10 ± 0.07 0.61 ± 0.03
OFCMD 0.21 ± 0.09 0.67 ± 0.01 0.28 ± 0.01 0.00 ± 0.00 0.42 ± 0.03 0.66 ± 0.06 0.40 ± 0.04 0.32 ± 0.06 0.45 ± 0.01
HOFCMD 0.06 ± 0.04 0.29 ± 0.14 0.09 ± 0.06 0.00 ± 0.00 0.04 ± 0.04 0.28 ± 0.05 0.15 ± 0.03 0.10 ± 0.06 0.32 ± 0.05
IncDBSCAN 0.55 ± 0.00 0.70 ± 0.00 0.65 ± 0.00 0.02 ± 0.00 0.76 ± 0.04 0.71 ± 0.01 0.80 ± 0.01 0.00 ± 0.00 0.47 ± 0.04
SOC-TWD 0.79 ± 0.00 0.62 ± 0.00 0.69 ± 0.02 0.11 ± 0.00 0.76 ± 0.03 0.78 ± 0.00 0.82 ± 0.01 0.24 ± 0.00 0.64 ± 0.00
CPU(s) TIOC-TWD 0.23 ± 0.06 1.53 ± 0.12 1.47 ± 0.06 3.21 ± 1.28 2.53 ± 0.10 6.25 ± 0.39 4.49 ± 0.27 15.38 ± 4.26 14.41 ± 0.46
OFCMD 0.44 ± 0.04 1.93 ± 0.14 1.92 ± 0.16 3.66 ± 0.35 1.63 ± 0.02 4.98 ± 0.13 4.72 ± 0.30 6.32 ± 0.48 16.39 ± 0.40
HOFCMD 0.56 ± 0.14 1.82 ± 0.33 1.18 ± 0.15 6.10 ± 0.41 3.47 ± 0.46 10.69 ± 1.70 11.32 ± 2.36 8.51 ± 1.12 45.49 ± 10.74
IncDBSCAN 1.73 ± 0.09 3.22 ± 0.22 2.88 ± 0.12 250.76 ± 25.33 4.09 ± 0.17 10.97 ± 0.93 15.68 ± 2.26 12.05 ± 1.93 20.01 ± 2.15
SOC-TWD 0.84 ± 0.08 3.25 ± 0.80 3.31 ± 0.11 7.49 ± 0.29 9.98 ± 0.29 34.27 ± 1.88 21.08 ± 0.93 34.23 ± 5.13 59.25 ± 2.19

Please cite this article in press as: H. Yu et al., A tree-based incremental overlapping clustering method using the three-way decision theory, Knowl. Based
Syst. (2015), http://dx.doi.org/10.1016/j.knosys.2015.05.028
14 H. Yu et al. / Knowledge-Based Systems xxx (2015) xxx–xxx

some sense. Under this situation, the proposed approach has [5] A. Pérez-Suárez, J.F. Martńez-Trinidad, J.A. Carrasco-Ochoa, J.E. Medina-Pagola,
A new overlapping clustering algorithm based on graph theory, Advances in
higher performance on NMI index and CPU time than the com-
Artificial Intelligence, vol. 7629, Springer, 2013, pp. 61–72.
pared algorithms in most of cases; the performance of proposed [6] A.A. Abbasi, M. Younis, A survey on clustering algorithms for wireless sensor
approach in the accuracy or F-measure is very close to the best networks, Comput. Commun. 30 (14) (2007) 2826–2841.
even if it is not the best. [7] H. Yu, C. Zhang, F. Hu, An incremental clustering approach based on three-way
decisions, in: Rough Sets and Current Trends in Computing, Springer, 2014, pp.
To sum up, the performance of proposed approach is better than 152–159.
the compared algorithms in most of cases. Moreover, the proposed [8] M. Ester, H.-P. Kriegel, J. Sander, M. Wimmer, X.W. Xu, Incremental clustering
approach has the following advantages in contrast with other for mining in a data warehousing environment, in: VLDB, vol. 98, 1998, pp.
323–333.
methods including the compared algorithms: the result of cluster- [9] N. Goyal, P. Goyal, K. Venkatramaiah, P.C. Deepak, P.S. SANNOP, An efficient
ing is represented by three-way decisions, that is a cluster is com- density based incremental clustering algorithm in data warehousing
posed of a positive region and a boundary region, which is helpful environment, in: 2009 International Conference on Computer Engineering
and Applications, IPCSIT, vol. 2, 2009, pp. 482–486.
to make further investigation; the time cost of the proposed [10] B.K. Patra, O. Ville, R. Launonen, S. Nandi, K.S. Babu, Distance based
method is not always best, it is still very valuable especially in incremental clustering for mining clusters of arbitrary shapes, in: Pattern
applications, because the proposed method does not define the Recognition and Machine Intelligence, Springer, 2013, pp. 229–236.
[11] R. Ibrahim, N. Ahmed, N.A. Yousri, M.A. Ismail, Incremental mitosis:
number of clusters in advance as other methods. discovering clusters of arbitrary shapes and densities in dynamic data, 11th
International Conference on Machine Learning and Applications, vol. 1, IEEE
Computer Society, 2012, pp. 102–107.
6. Conclusions [12] H.Z. Ning, W. Xu, Y. Chi, Y.H. Gong, T.S. Huang, Incremental spectral clustering
by efficiently updating the eigen-system, Pattern Recogn. 43 (1) (2010) 113–
127.
Existing clustering approaches are either restricted to crisp [13] R.G. Pensa, D. Ienco, R. Meo, Hierarchical co-clustering: off-line and
clustering or static datasets. In order to develop an approach to incremental approaches, Data Mining Knowl. Discovery 28 (1) (2014) 31–64.
deal with overlapping clustering as well as incremental clustering, [14] K.M. Hammouda, M.S. Kamel, Efficient phrase-based document indexing for
web document clustering, IEEE Trans. Knowl. Data Eng. 16 (10) (2004) 1279–
this paper proposed a new tree-based incremental overlapping
1296.
clustering method using the three-way decision theory, called [15] R. Gil-García, A. Pons-Porrata, Dynamic hierarchical algorithms for document
TIOC-TWD. clustering, Pattern Recogn. Lett. 31 (6) (2010) 469–477.
[16] A. Pérez-Suárez, J.F. Martínez-Trinidad, J.A. Carrasco-Ochoa, J.E. Medina-
This paper first introduced three-way decision clustering to rep-
Pagola, An algorithm based on density and compactness for dynamic
resent the overlapping clustering as well as crisp clustering, and overlapping clustering, Pattern Recogn. 46 (11) (2013) 3040–3055.
described the problem of incremental overlapping clustering; and [17] E. Lughofer, A dynamic split-and-merge approach for evolving cluster models,
proposed notions of representative points and the similarity Evolv. Syst. 3 (3) (2012) 135–151.
[18] N. Labroche, Online fuzzy medoid based clustering algorithms,
between representative regions. Then, the paper introduced a Neurocomputing 126 (2014) 141–150.
new searching tree based on representative points, which can [19] J.F. Peters, A. Skowron, Z. Suraj, W. Rzsa, M. Borkowski, Clustering: a rough set
not only enhance the relevance of the search result but it can also approach to constructing information granules, in: Soft Computing and
Distributed Processing, 2002, pp. 57–61.
save the computation time. Besides, the paper devised three-way [20] D. Parmar, T. Wu, J. Blackhurst, Mmr: an algorithm for clustering categorical
strategies to update efficiently the clustering after multiple objects data using rough set theory, Data Knowl. Eng. 63 (3) (2007) 879–893.
increased. Moreover, the proposed method does not need to define [21] H.M. Chen, T.R. Li, C. Luo, S.J. Horng, G.Y. Wang, A rough set-based method for
updating decision rules on attribute values’ coarsening and refining, IEEE
the number of cluster in advance, it can dynamically determine the Trans. Knowl. Data Eng. 26 (12) (2014) 2886–2899.
number of clusters. The above characteristics make the TIOC-TWD [22] H.M. Chen, T.R. Li, C. Luo, S.J. Horng, G.Y. Wang, A decision-theoretic rough set
appropriate for handling overlapping clustering in applications approach for dynamic data mining, IEEE Trans. Fuzzy Syst. (2015). http://
dx.doi.org/10.1109/TFUZZ.2014.238787.
where the data is increasing.
[23] G. Peters, R. Weber, R. Nowatzke, Dynamic rough clustering and its
This paper conducted experiments to illustrate the salient fea- applications, Appl. Soft Comput. 12 (10) (2012) 3193–3207.
tures of the proposed algorithm and evaluate its performance. [24] P. Lingras, G. Peters, F. Crespo, R. Weber, Soft clustering – fuzzy and rough
approaches and their extensions and derivatives, Int. J. Approx. Reason. 54 (2)
The experimental results show that the proposed method not only
(2013) 307–322.
can identify clusters of arbitrary shapes, but also can merge small [25] Y.Y. Yao, The superiority of three-way decisions in probabilistic rough set
clusters into the big one when the data changes; the proposed models, Inform. Sci. 181 (6) (2011) 1080–1096.
method can detect new clusters which might be the result of split- [26] Y.Y. Yao, An outline of a theory of three-way decisions, in: Rough Sets and
Current Trends in Computing, Springer, 2012, pp. 1–17.
ting or new patterns. More results of comparison experiments [27] C. Luo, T.R. Li, H.M. Chen, L.X. Lu, Fast algorithms for computing rough
show that the proposed method has better performance especially approximations in set-valued decision system while updating criteria values,
on F-measure and NMI indices than the compared methods. The Inform. Sci. 299 (2015) 221–242.
[28] C. Luo, T.R. Li, H.M. Chen, Dynamic maintenance of approximations in set-
further analysis of parameters will be our planned future work. valued ordered decision systems under the attribute generalization, Inform.
Sci. 257 (2014) 210–228.
[29] N. Azam, J.T. Yao, Analyzing uncertainties of probabilistic rough set
Acknowledgment regions with game-theoretic rough sets, Int. J. Approx. Reason. 55 (1) (2014)
142–155.
This work was supported in part by the National Natural [30] B. Zhou, Y.Y. Yao, J.G. Luo, Cost-sensitive three-way email spam filtering, J.
Intell. Inform. Syst. 42 (1) (2014) 19–45.
Science Foundation of China under Grant Nos. 61379114 & [31] D.C. Liang, D. Liu, A novel risk decision-making based on decision-theoretic
61272060. rough sets under hesitant fuzzy information, IEEE Trans. Fuzzy Syst. PP (99)
(2014) 1–11.
[32] D.C. Liang, D. Liu, Systematic studies on three-way decisions with interval-
References valued decision-theoretic rough sets, Inform. Sci. 276 (2014) 186–203.
[33] H. Yu, Y. Wang, P. Jiao, A three-way decisions approach to density-based
overlapping clustering, in: Transactions on Rough Sets XVIII, Springer, 2014,
[1] A.K. Jain, Data clustering: 50 years beyond k-means, Pattern Recogn. Lett. 31
pp. 92–109.
(8) (2010) 651–666.
[34] H. Yu, Z.G. Liu, G.Y. Wang, An automatic method to determine the number of
[2] G. Peters, F. Crespo, P. Lingras, R. Weber, Soft clustering-fuzzy and rough
clusters using decision-theoretic rough set, Int. J. Approx. Reason. 55 (1)
approaches and their extensions and derivatives, Int. J. Approx. Reason. 54 (2)
(2014) 101–115.
(2013) 307–322.
[35] P. Lingras, R. Yan, Interval clustering using fuzzy and rough set theory, in:
[3] H.L. Sun, J.B. Huang, X. Zhang, J. Liu, D. Wang, H.L. Liu, J.H. Zou, Q.B. Song,
Processing NAFIPS’04: IEEE Annual Meeting of the Fuzzy Information, vol. 2,
Incorder: incremental density-based community detection in dynamic
2004, pp. 780–784.
networks, Knowl.-Based Syst. 72 (2014) 1–12.
[36] P. Lingras, C. West, Interval set clustering of web users with rough k-means, J.
[4] S. Lee, G. Kim, S. Kim, Self-adaptive and dynamic clustering for online anomaly
Intell. Inform. Syst. 23 (1) (2004) 5–16.
detection, Expert Syst. Appl. 38 (12) (2011) 14891–14898.

Please cite this article in press as: H. Yu et al., A tree-based incremental overlapping clustering method using the three-way decision theory, Knowl. Based
Syst. (2015), http://dx.doi.org/10.1016/j.knosys.2015.05.028
H. Yu et al. / Knowledge-Based Systems xxx (2015) xxx–xxx 15

[37] Y.Y. Yao, P. Lingras, R.Z. Wang, D.Q. Miao, Interval set cluster analysis: a re- [41] L. Breiman, J. Friedman, C.J. Stone, R. Olshen, Classification and regression
formulation, in: Rough Sets, Fuzzy Sets, Data Mining and Granular Computing, trees, AMC 10 (1984) 12.
Springer, 2009, pp. 398–405. [42] B. Larsen, C. Aone, Fast and effective text mining using linear-time
[38] M. Chen, D.Q. Miao, Interval set clustering, Expert Syst. Appl. 38 (4) (2011) document clustering, in: Proceedings of the Fifth ACM SIGKDD
2923–2932. International Conference on Knowledge Discovery and Data Mining, ACM,
[39] E. Ikonomovska, J. Gama, S. Džeroski, Online tree-based ensembles and option 1999, pp. 16–22.
trees for regression on evolving data streams, Neurocomputing 150 (2014) [43] A. Strehl, J. Ghosh, Cluster ensembles—a knowledge reuse framework
458–470. for combining multiple partitions, J. Machine Learn. Res. 3 (2003) 583–
[40] C.W. Tsai, K.W. Huang, M.C. Chiang, C.S. Yang, A fast tree-based search 617.
algorithm for cluster search engine, in: IEEE International Conference on [44] Uci, 2014. <http://archive.ics.uci.edu/ml>.
Systems, Man and Cybernetics, 2009, pp. 1603–1608.

Please cite this article in press as: H. Yu et al., A tree-based incremental overlapping clustering method using the three-way decision theory, Knowl. Based
Syst. (2015), http://dx.doi.org/10.1016/j.knosys.2015.05.028

You might also like