0% found this document useful (0 votes)
2 views

A Domain Adaptive Density Clustering Algorithm for Data with Varying Density Distribution

The document presents a Domain-Adaptive Density Clustering (DADC) algorithm designed to improve clustering accuracy for data with varying density distributions, equilibrium distributions, and multiple domain-density maximums. The DADC algorithm features a three-step process: domain-adaptive density measurement, cluster center self-identification, and cluster self-ensemble, addressing issues such as sparse cluster loss and cluster fragmentation. Experimental results indicate that DADC outperforms existing algorithms in terms of clustering accuracy and computational efficiency, making it suitable for large-scale data applications.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

A Domain Adaptive Density Clustering Algorithm for Data with Varying Density Distribution

The document presents a Domain-Adaptive Density Clustering (DADC) algorithm designed to improve clustering accuracy for data with varying density distributions, equilibrium distributions, and multiple domain-density maximums. The DADC algorithm features a three-step process: domain-adaptive density measurement, cluster center self-identification, and cluster self-ensemble, addressing issues such as sparse cluster loss and cluster fragmentation. Experimental results indicate that DADC outperforms existing algorithms in terms of clustering accuracy and computational efficiency, making it suitable for large-scale data applications.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2954133, IEEE
Transactions on Knowledge and Data Engineering
1

A Domain Adaptive Density Clustering Algorithm


for Data with Varying Density Distribution
Jianguo Chen, Member, IEEE, and Philip S. Yu, Fellow, IEEE

Abstract— As one type of efficient unsupervised learning methods, clustering algorithms have been widely used in data mining and
knowledge discovery with noticeable advantages. However, clustering algorithms based on density peak have limited clustering effect
on data with varying density distribution (VDD), equilibrium distribution (ED), and multiple domain-density maximums (MDDM), leading
to the problems of sparse cluster loss and cluster fragmentation. To address these problems, we propose a Domain-Adaptive Density
Clustering (DADC) algorithm, which consists of three steps: domain-adaptive density measurement, cluster center self-identification,
and cluster self-ensemble. For data with VDD features, clusters in sparse regions are often neglected by using uniform density peak
thresholds, which results in the loss of sparse clusters. We define a domain-adaptive density measurement method based on K -Nearest
Neighbors (KNN) to adaptively detect the density peaks of different density regions. We treat each data point and its KNN neighborhood
as a subgroup to better reflect its density distribution in a domain view. In addition, for data with ED or MDDM features, a large number of
density peaks with similar values can be identified, which results in cluster fragmentation. We propose a cluster center self-identification
and cluster self-ensemble method to automatically extract the initial cluster centers and merge the fragmented clusters. Experimental
results demonstrate that compared with other comparative algorithms, the proposed DADC algorithm can obtain more reasonable
clustering results on data with VDD, ED and MDDM features. Benefitting from a few parameter requirement and non-iterative nature,
DADC achieves low computational complexity and is suitable for large-scale data clustering.

Index Terms—Cluster fragmentation, density-peak clustering, domain-adaptive density clustering, varying density distribution.

1 I NTRODUCTION which results in the loss of sparse clusters. (2) Clustering


results of DPC algorithms depend on a strict constraint that
C LUSTERING algorithms have been widely used in
various data analysis fields [1, 2]. Numerous clus-
tering algorithms have been proposed, including the
there is only one local density maximum in each candidate
cluster. However, for data with MDDM or ED, there are
zero or more local density maximums in a natural cluster,
partitioning-based, hierarchical-based, density-based, grid-
and DPC algorithms might lead to the problem of cluster
based, model-based, and density-peak-based methods [3–
fragmentation. (3) In addition, how to determine the pa-
6]. Among them, density-based methods (e.g., DBSCAN,
rameter thresholds of local density and Delta distance in
CLIQUE, and OPTICS) can effectively discover clusters of
the clustering decision graph is another problem for DPC
arbitrary shape using the density connectivity of cluster-
algorithms. Therefore, it is critical to address the problems of
s, and do not require a pre-defined number of clusters
sparse cluster loss and cluster fragmentation from data with
[6]. In recent years, Density-Peak-based Clustering (DPC)
VDD, ED, and MDDM and improve clustering accuracy.
algorithms, as a branch of density-based clustering, were
introduced in [7, 8], assuming that the cluster centers are
surrounded by low-density neighbors and can be detected
by efficiently searching for local density peaks.
Benefitting from few parameter requirements and non-
iterative nature, DPC algorithms can efficiently detect clus-
ters of arbitrarily shape from large-scale datasets with
low computational complexity. However, as shown in Fig.
1, DPC algorithms have limited clustering effect on data
with varying density distribution (VDD), multiple domain-
density maximums (MDDM), or equilibrium distribution
(ED). (1) For data with VDD characteristics, there are
varying density regions and data points in sparse regions
are usually ignored as outliers or misallocated to adjacent
dense clusters by using uniform density peak thresholds,

• Jianguo Chen is with the College of Computer Science and Electronic


Engineering, Hunan University, Changsha, Hunan 410082, China (jian- Fig. 1. Challenges of DPC algorithms on data with VDD, ED, and MDDM.
[email protected]).
• Philip S. Yu is with the Department of Computer Science, University Aiming at the problems of sparse cluster loss and clus-
of Illinois at Chicago, Chicago, IL 60607, USA, and Institute for Data ter fragmentation, we propose a Domain-Adaptive Density
Science, Tsinghua University, Beijing 100084, China ([email protected]).
Clustering (DADC) algorithm. As shown in Fig. 2, the DAD-

1041-4347 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2954133, IEEE
Transactions on Knowledge and Data Engineering
2

methods (e.g., K-Means and K-Medoids) [3] are easy to


understand and implement, but it is sensitive to noisy data
and can only detect round or spherical clusters. Hierarchical
methods (e.g., BIRCH, CURE, and ROCK) [5] do not need
to pre-define a number of clusters and can extract hierar-
chical relationship of clusters, but require high computa-
tional complexity. Density-based methods (e.g., DBSCAN,
CLIQUE, and OPTICS) [6] also don’t require a pre-defined
number of clusters and can discover clusters of arbitrary
shapes, but the clustering results of these algorithms are
sensitive to the threshold of their parameters.
Fig. 2. Workflow of the proposed DADC algorithm. Focusing on density-based clustering analysis, abundant
improvements of traditional algorithms were presented,
while novelty algorithms were explored [7, 11–13]. Groups
C algorithm consists of three steps: domain-adaptive density of Density-Peak-based Clustering (DPC) algorithms were
measurement, cluster center self-identification, and clus- proposed in [7, 8], where cluster centers are detected by
ter self-ensemble. A domain-adaptive density measurement efficiently searching of density peaks. In [7], Rodriguez et al.
method based on K-Nearest Neighbors (KNN) is defined, proposed a DPC algorithm titled “Clustering by fast search
which can be used to adaptively detect the density peaks and find of density peaks” (widely quoted as CFSFDP).
of different density regions. On this basis, cluster center CFSFDP can effectively detect arbitrarily shaped clusters
self-identification and cluster self-ensemble methods are from large-scale datasets. Benefiting from non-iterative na-
proposed to automatically extract the initial cluster centers ture, CFSFDP achieves low computational complexity and
and merge the fragmented clusters. Extensive experiments high efficiency for big data processing. In addition, consid-
indicate that DADC outperforms comparison algorithms in ering large-scale noisy datasets, robust clustering algorithms
clustering accuracy and robustness. The contributions of this were discussed in [14] by detecting density peaks and
paper are summarized as follows. assigning points based on fuzzy weighted KNN method.
• To address the problem of sparse cluster loss of data For data that exhibit varying-density distribution or
with VDD, a domain-adaptive density measurement multiple local-density maximums, DPC algorithms face a
method is proposed to detect density peaks in differ- variety of limitations, such as sparse cluster loss and cluster
ent density regions. According to these density peak- fragmentation. To address these problems, a variety of op-
s, cluster centers in both dense and sparse regions timization solutions were presented in [15–17]. Zheng et al.
are effectively discovered, which well addresses the proposed an approximate nearest neighbor search method
sparse cluster loss problem. for multiple distance functions with a single index [15].
• To automatically extract the initial cluster centers, we To overcome the limitations of DPC, an adaptive method
draw a clustering decision graph based on domain was presented in [18] for clustering, where heat-diffusion is
density and Delta distance. We then propose a cluster used to estimate density and cutoff distance is simplified.
center self-identification method and automatically In [19], an adaptive density-based clustering algorithm was
determine the parameter thresholds and cluster cen- introduced in spatial databases with noise, which uses a
ters from the clustering decision graph. novel adaptive strategy for neighbor selection based on
• To address the problem of cluster fragmentation spatial object distribution to improve clustering accuracy.
on data with ED or MDDM, an innovative Clus- Aiming at clustering ensemble, an automatic clustering
ter Fusion Degree (CFD) model is proposed, which approach was introduced via outward statistical testing on
consists of the inter-cluster density similarity, clus- density metrics in [16]. A nonparametric Bayesian clustering
ter crossover degree, and cluster density stability. ensemble method was explored in [20] to seek the number of
Then, a cluster self-ensemble method is proposed clusters in consensus clustering, which achieves versatility
to automatically merge the fragmented clusters by and superior stability. Yu et al. proposed an adaptive en-
evaluating the CFD between adjacent clusters. semble framework for semi-supervised clustering solutions
[17]. Zeng et al. proposed a framework for hierarchical
The rest of the paper is organized as follows. Section ensemble clustering [5]. Yu et al. introduced an incremental
2 reviews the related work. Section 3 presents the domain semi-supervised clustering ensemble approach for high-
adaptive method for cluster center detection. A cluster self- dimensional data clustering [21].
identification method and cluster ensemble method are Compared with the existing clustering algorithms, the
respectively introduced in Section 4. Experimental results proposed domain-adaptive density method in this work can
and evaluations are shown in Section 5. Finally, Section 6 adaptively detect the domain densities and cluster centers
concludes the paper. in regions with different densities. This method is very
feasible and practical in actual big data applications. The
proposed cluster self-identification method can effectively
2 R ELATED W ORK identify the candidate cluster centers with minimum artifi-
Being an efficient and unsupervised data mining method, cial intervention. Moreover, the proposed CFD model takes
numerous clustering algorithms are proposed and widely full account of the relationships between clusters of large-
applied in various applications [2, 9, 10]. Partition-based scale datasets, including the inter-cluster density similarity,

1041-4347 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2954133, IEEE
Transactions on Knowledge and Data Engineering
3

cluster crossover degree, and cluster density stability. 180


150 160
140

3 D OMAIN -A DAPTIVE D ENSITY M ETHOD 120

Delta distance
100 100
We propose a domain-adaptive density method to address 80
60
the problem of spare cluster loss of DPC algorithms on 50
40
VDD data. Domain density peaks of data points in regions 20
00 5 10 15 20 25 30
with different densities are adaptively detected. In addition, 0
0 50 100 150 Local density

candidate cluster centers are identified based on the decision (a) Data points (b) Decision graph
parameters of the domain densities and Delta distances.
Fig. 3. Example of the CFSFDP algorithm on a VDD dataset.
3.1 Problem Definitions
Most DPC algorithms [7, 8] are based on the assumption that 3.2 Domain-Adaptive Density Measurements
a cluster center is surrounded by neighbors with lower local
densities and has great Delta distances from any relative To adaptively detect the domain-density peaks in differen-
points with higher densities. For each data point xi , its local t density areas of VDD data, a domain-adaptive density
density ρi and Delta distance δi are calculated from the calculation method is presented in this section. Domain
higher density points. These two quantities depend only on distance and domain density calculation methods are pre-
the distances between the data points. The local density ρi sented based on the KNN method [22, 23]. These methods
of xi is defined as: are very useful and handy on large-scale datasets that likely
contain varying distribution densities in actual applications.

ρi = χ(dij − dc ), (1) To more precisely explain the locality of VDD data,
j
we propose a new definition of domain density based on
the KNN method. Given a dataset X , the KN N -distance
where dc is a cutoff distance and χ(x) = 1, if x < 0; and KN N -density of each data point in X are calculated,
otherwise, χ(x) = 0. Basically, ρi is equal to the number respectively.
of points closer than dc to xi . Delta distance δi of xi is Definition 3.2 KN N -Distance. Given a dataset X , the
measured by computing the shortest distance between xi KN N -distance of each data point xi refers to the average distance
and any other points with a higher density, as defined as: of xi to its k nearest neighbors. The KN N -distance of each data
δi = min dij . (2) point xi is defined as KDisti :
j:ρj >ρi
1 ∑
For the highest density point, δi = maxj dij . Points with a KDisti = dij , (3)
K
high ρ and high δ are considered as cluster centers, while j∈N (xi )
points with a low ρ and a high δ are considered as outliers.
where K is the number of neighbors of xi and N (xi ) is
After finding the cluster centers, each remaining point is
the set of its neighbors. Based on the KN N -distance, we
assigned to the same cluster as its nearest neighbor of higher
calculate the KN N -density for each data point.
density.
Definition 3.3 KN N -Density. The KN N -density of the
Most data in actual applications have the characteristics
data point xi in dataset X refers to the reciprocal of the KN N -
of noise, irregular distribution, and sparsity. In particular,
distance. The smaller KN N -density of a data point, indicating
the density distribution of data points is unpredictable and
that this data point is located in a more sparse area. The KN N -
discrete in most of the cases. For a VDD dataset, there
density of data point xi is defined as KDeni :
coexist regions with different degrees of density, such as
dense and sparse regions, as defined as follows. 1 K
KDeni = = ∑ .
Definition 3.1 VDD Data. For a dataset has multiple regions, KDisti dij (4)
the average density of data points in each region is set as the j∈N (xi )
region’s density. If there coexist regions with obvious different
After obtaining the KN N -distance and KN N -density,
region densities, such as dense and sparse regions, we denote the
the domain-adaptive density of each data point is defined.
dataset as a Varying-Density Distributed (VDD) dataset.
We treat the set of each data point xi and its neighbors N (xi )
The CFSFDP algorithm and other DPC algorithms suffer
as a subgroup to observe its density distribution in X .
from the limitations of sparse cluster loss on VDD datasets.
Definition 3.4 Domain Density. The domain density of each
According to Eq. (1), points in the relatively sparse area
data point xi in dataset X is the sum of the KN N -density of xi
are easily ignored as outliers. An example of the CFSFDP
and the weighted KN N -density of its K -nearest neighbors. The
clustering results on a VDD dataset is shown in Fig. 3.
domain density of the data point xi is defined as ∂i :
In Fig. 3 (a), the heart-shaped dataset has three region-

s with different densities. The clustering decision graph ∂i = KDeni + (KDenj × wj ), (5)
achieved by CFSFDP is shown in Fig. 3 (b), where only j∈N (xi )
one point is obtained with high values of both ρ and δ .
Consequently, the dataset is clustered into one cluster, while where wj = d1ij is the weighted value of the KN N -density
the data points in the sparse regions, indicated by blue dots between each neighbor xj and xi . Compared to the KN N -
and purple squares, are removed as outliers or incorporated density, the domain density can better reflect the density
into the dense cluster. distribution of data points in the local area.

1041-4347 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2954133, IEEE
Transactions on Knowledge and Data Engineering
4

Given a 2-dimensional dataset with 50 samples as an


example, we calculate the distances among data points, as
shown in Fig. 4. Set K = 5, for each point, KN N neighbors
of x7 are x6 , x8 , x12 , x3 , and x13 , with the distances of
8.05, 8.05, 8.70, 8.79, and 12.58, respectively. According to
Eq. (3), it is easy to obtain the KN N -distance of x7 is equal
to 9.23. Hence, the KN N -density KDen7 of x7 is equal to
0.11 and the domain density ∂7 of x7 is 0.16. In the same
way, KN N -distances and KN N -densities of the neighbors
of x7 are calculated successively. We further calculate the
domain density of these data points: ∂6 = ∂8 = 0.15,
∂3 = 0.12, ∂12 = 0.09, and ∂13 = 0.12. It is obvious that
x7 has a higher value of the domain density than that of its
neighbors, reaching the value at 0.16.

Fig. 5. Example of Delta distance calculation (partial).

distance. The domain-adaptive density ∂i of each data point


xi is updated as:

∂i × max (dij ), if ∂i = ∂max ;
j
∂i = ∂i × δ i = (6)
∂i × min (dij ), otherwise .
j:∂j >∂i

There are three levels of domain density for data points:


global density maximum, domain density maximum, and
normal density. (1) It is easy to identify the point with
the highest global density and set it as a cluster center.
For the global density maximum point, we set the largest
distance between this point and any other point as its Delta
Fig. 4. Example of domain density calculation (partial). distance. (2) For a density maximum point xi of a region, the
point xj | minj:∂j >∂i (dij ) must be in another region with a
greater density rather than in the current region. Therefore,
to clearly identify the density peaks of a region, we multiply
3.3 Clustering Decision Parameter Measurement
the domain density and Delta distance for each point. (3) For
Based on the domain density, the Delta distance of each data the remaining points in each region, both of their domain
point is computed as a clustering decision parameter. As densities and Delta distances are much smaller than that of
defined in Eq. (2), Delta distance δi of xi is measured by the peak points of the same region.
calculating the shortest distance between xi and any other Based on the values of domain density ∂ and Delta
points with higher densities. In such a case, only the points distance δ , a clustering decision graph is drawn to identify
with the highest global density have the maximum value the candidate cluster centers. In the clustering decision
of the Delta distance. The domain density peak in a sparse graph, the horizontal axis represents ∂ and the vertical
region yields a Delta distance value that is lower than the axis represents δ . Points having high values of ∂ and δ are
remaining points in a relatively dense region. An example considered as cluster centers, while points with a low ∂ and
of Delta distances of a dataset is shown in Fig. 5. a high δ are considered as outliers.
In Fig. 5, domain densities of all data points are cal- The process of domain density and Delta distance cal-
culated. Since x7 owns the highest domain density, Delta culation of DADC is presented in Algorithm 3.1. Assuming
distance δ7 is the distance between x7 and the point x32 that the number of data points in X is equal to n, for each
farthest against it. Namely, δ7 = d(7,32) = 103.92. Be- data point xi in X , we calculate its K-nearest neighbors
cause the remaining data points do not have the highest and obtain its domain density. Therefore, the computational
domain density, the Delta distances of them are the shortest complexity of Algorithm 3.1 is O(n).
distance between them and the point with higher density.
For example, δ1 = d(1,2) = 7.52, δ40 = d(40,9) = 42.76,
δ49 = d(49,46) = 23.81, and δ50 = d(50,45) = 30.44. 4 C LUSTER S ELF - IDENTIFICATION M ETHOD
Considering that a dataset has multiple regions with For data with ED or MDDM features, a large number of
different densities, domain densities of data points in a density peaks with similar values can be identified, which
dense region are higher than that of points in a sparse results in cluster fragmentation. In this section, aiming at
region. To adaptively identify the density peaks of each the problem of cluster fragmentation, we propose a cluster
region, we update the definition of domain-adaptive density self-identification method to extract initial cluster centers by
by combining the values of the domain density and Delta automatically determining the parameter thresholds of the

1041-4347 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2954133, IEEE
Transactions on Knowledge and Data Engineering
5

Algorithm 3.1 Domain-adaptive density and Delta distance


calculation of DADC.
Input:
X : the dataset for clustering;
K : the number of neighbors of each data point;
Output:
(∂ , δ ): The domain-adaptive densities and Delta dis-
tances of the data points of X .
1: calculate distance matrix D for X ;
2: for each xi in X do
(a) Frame dataset (b) Decision graph for Frame
3: obtain K -nearest neighbors N (xi ) of xi ; ∑
4: calculate KN N -distance KDisti ← K 1
dij ;
j∈N (xi ) Fig. 6. Decision-parameter threshold determination.
5: calculate KN N -density KDeni ← ∑K
dij ;
j∈N (xi )
calculate domain density ∂i ← KDeni + Definition 4.1 MDDM Dataset. Given a dataset, the domain-
∑ ( )
6:
adaptive densities of data points in the dataset are calculated. If
KDenj × 1
dij ; multiple points with the same highest domain density coexist in
j∈N (xi )
7: get the maximum domain density ∂max ← max(∂ ); a region, we call the dataset holds the characteristic of multiple
8: for each xi in X do domain-density maximums. A dataset with multiple domain-
9: if ∂i == ∂max then density maximums is defined as an MDDM dataset.
10: set Delta distance δi ← max(dij ); Definition 4.2 ED Dataset. Given a dataset, the domain
11: else densities of data points in the dataset are calculated. If each data
12: set Delta distance δi ← min dij ; point has the same value of domain density, the dataset is under an
j:∂j >∂i equilibrium distribution and is defined as an ED dataset. In such
13: calculate domain-adaptive density ∂i ← ∂i × δ i ; a case, each data point having the same value of domain density
14: return (∂ , δ ). is regarded as a domain density peak and further considered as a
candidate cluster center.
Clustering results of the CFSFDP algorithm depend on
clustering decision graph. Then, a Cluster Fusion Degree a strict constraint that only one local density maximum is
(CFD) model is proposed to evaluate the relationship of assumed to exist in each candidate cluster. However, when
adjacent clusters. Finally, a CFD-based cluster self-ensemble there exist multiple local density maximums in a natural
method is proposed to merge the fragmented clusters. cluster, CFSFDP might lead to the problem of cluster frag-
mentation. Namely, a cluster is split into many fragmented
clusters. Two examples of the clustering decision graph of
4.1 Problem Definitions
CFSFDP on MDDM and ED datasets are shown in Fig. 7.
Based on domain-adaptive densities and Delta distances,
14 7
candidate cluster centers of a dataset can be obtained from 13
C0
C1
C2 6
the corresponding clustering decision graph. After the clus- 12 C3
C4 5
ter centers are identified, each of the remaining data points 11 C5
Delta distance

C6
10 C7 4
is assigned to the cluster to which the nearest and higher- 9
C8
C9 3
C10
density neighbors belong. 8 C11
C12 2
7 C13
(1) Decision-parameter threshold determination. C14
C15
1
6
C16
A limitation of the CFSFDP algorithm is that how to 5
8
C17
10
C18 12 14 16 18 20 22
0
1 2 3 4 5 6 7
Adaptive domain density
determine the thresholds of the decision parameters in the C19
C20
C21
clustering decision graph. In CFSFDP, data points with (a) MDDM dataset
C22 (b) Decision graph
C23
C24
high values of both local density and Delta distance are 250
C0
C25
C26
100
C1
C27
regarded as cluster centers. But in practice, these parameter 200
C2
C3
C28
80
C4
thresholds are often set manually. An example of the Frame 150
C5
C6
Delta distance

C7 60
dataset and the corresponding clustering decision graph are C8
C9
100
C10
shown in Fig. 6. C11 40
C12
50
Fig. 6 (b) is the clustering decision graph of CFSFDP for C13
C14
20
C15
the dataset in Fig. 6 (a). It is difficult to make a decision 0 C16
C17
C18
0
whether only the points in the red box or those in both −50
C19
C200 50 100 150 200 250
0 5 10 15 20
Adaptive domain density
25 30
C21
red and blue boxes should be regarded as cluster centers. C22
C23
C24
(c) ED dataset (d) Decision graph
Therefore, how to determine the threshold values of deci- C25
C26
sion parameters in an effective way is an important issue of C27
C28
Fig. 7. Clustering decision graph of MDDM and ED dataset.
C29
our algorithm. C30
C31
C32
(2) Cluster fragmentation on MDDM or ED Data. In Fig. 7 (b), there are as many as 29 decision points
C33
C34
C35
Most DPC algorithms have limitations of cluster frag- that hold high values of both domain density ∂ and Delta
C36
C37
C38
mentation on the datasets with multiple domain-density distance δ . In such a case, the dataset is divided into 29 frag-
C39
C40
maximums (MDDM) or equilibrium distribution (ED). mented clusters instead of 2 natural clusters as shown in Fig.
C41
C42
C43
C44
C45
C46
C47
C48
1041-4347 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
C49
C50
C51
C52
C53
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2954133, IEEE
Transactions on Knowledge and Data Engineering
6

7 (a). As shown in Fig. 7 (c), there are two isolated regions in


the dataset, and data points in each region are equilibration 
distributed. Hence, this dataset is expected to be divided in- 
 cluster centers, ∂i > Cp (x), δi > Cp (y);
C (y)×∂
to 2 natural clusters. However, lots of points exhibit similar Λ(xi ) = outliers, ∂i < Cp (x), δi > pCp (x) i ;


values of local/domain densities and are regarded as cluster remaining points, otherwise,
centers, as shown in Fig. 7 (d). Consequently, this dataset is (8)
incorrectly divided into numerous fragmented sub-clusters where Λ(xi ) refers to the subset xi belong to. In Fig. 8,
rather than the expected two clusters. after getting the value of the critical point, data points in
the decision graph can be easily divided into three subsets.
Red points are detected as candidate cluster centers. Black
4.2 Initial Cluster Self-identification points have low values of domain density and high values
of Delta distance, identifying as outliers and removed from
4.2.1 Cluster Center Identification the clustering results. Blue points refer to the remaining data
points, which are assigned to the related clusters in the next
We propose a self-identification method to automatically
step. Hence, initial cluster centers of the dataset are obtained
extract the cluster centers based on the clustering decision
with few parameter requirements and minimum artificial
graph. To automatically determine the parameters threshold
intervention.
values of domain density and Delta distance, a critical point
of the clustering decision graph is defined. The critical
point Cp (x, y) of a clustering decision graph is defined 4.2.2 Remaining Data Point Assignment
as a splitting point by which the candidate cluster centers, After cluster centers being detected, each of the remaining
outliers, and remaining points can be divided obviously, as data points is assigned to the cluster that the nearest and
shown in Fig. 8. higher domain-density neighbors belong to. For each re-
maining data point xi , the neighbors with a higher density
′ ′
are labeled as N (xi ). For a data point xj ∈ N (xi ) with
the shortest distance dij , if xj has been assigned to a
cluster ca , then, xi is also assigned to ca . Otherwise, the
cluster of xj is further measured iteratively. Repeat this step,
until all of the remaining data points are assigned to the
related clusters. An example of the remaining data points
assignment is shown in Fig. 9. The process of initial cluster
self-identification of the DADC algorithm is presented in
Algorithm 4.1.

Fig. 8. Critical point of a clustering decision graph.

As the assumption of the CFSFDP and DADC algo-


rithms, cluster centers are the points with relatively domain-
density peaks, while outliers have the lowest domain densi-
ties. It is easy to get a conclusion that the values of domain
density of density peaks are obviously different against to
that of outliers. Therefore, we take the middle value of the
maximum domain density as the horizontal axis value of the
critical point. Namely, Cp (x) = ∂max2 . In addition, based on
extensive experiments and applications, it is a effectiveness
solution to set the vertical axis value of the critical point as Fig. 9. Example of the remaining data points assignment.
one quarter of the maximum value of the Delta distance.
Namely, Cp (y) = δmax 4 . Therefore, the value of critical point Algorithm 4.1 consists of two steps: cluster center iden-
Cp (x, y) of the clustering decision graph is defined as: tification and remaining data point assignment. Assuming
( ) that the number of data points in X is n, the number of
Cp (x, y) =
∂max δmax
, , (7) cluster centers is m, and that of remaining data points is n′ .
2 4 In general, the number of cluster centers and outliers is far
less than the remaining points, which shows that m+n′ ≈ n.
where δmax and ∂max are the maximum values of δ and ∂ . Therefore, the computational complexity of Algorithm 4.1 is
Based on the critical point, data points in the clustering O(n).
decision graph can be divided into three subsets, namely, Depending on the cluster self-identification method of
cluster centers, outliers, and remaining points. The division DADC, we can obtain cluster centers and initial clusters
method of data points in the clustering decision graph is quickly and simply. Despite the number of cluster centers
defined as: might be more than the real ones in this way and caused

1041-4347 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2954133, IEEE
Transactions on Knowledge and Data Engineering
7

Algorithm 4.1 Cluster center self-identification of the DADC fmax (x) = f (1) = 1. Hence the closer the value of x is to 1,
algorithm. the more similar are the two clusters ca and cb .
Input: In addition, the distance between every two clusters is
X : The raw dataset for clustering; considered. In the relevant studies, different methods were
∂ : The domain densities of the data points of X ; introduced to calculate the distance between two clusters,
δ : The Delta distances of the data points of X ; such as the distance between the center points, the nearest
Output: points, or the farthest points of the two clusters [5]. How-
ICX : The initial clusters of X . ever, these measures are easily affected by noise or outliers,
1: get the maximum domain density ∂max ← max(∂ ); while noisy data elimination is another challenge. We pro-
2: get the maximum Delta distance δmax(← max(δ ); ) pose an innovative method to measure the distance between
∂max δmax
3: calculate the critical point Cp (x, y) = 2 , 4 ; clusters. Crossing points between every two clusters are
4: for each xi in X do found and the crossover degree of the clusters is calculated.
5: if ∂i > Cp (x) and δi > Cp (y) then For each boundary point xi in cluster ca , let N (xi ) be
6: append to the set of cluster centers Λc ← xi ; the K -nearest neighbors of xi . We denote N(i,b) as a set of
C (y)×∂ points in N (xi ) belonging to cluster cb , and N(i,a) be a set of
7: else if ∂i < Cp (x) and δi > pCp (x) i then
8: append to the set of outliers Λo ← xi ; points in N (xi ) belonging to ca . If the amount of neighbors
9: else belonging to cb is close to that of neighbors belonging to
10: append to the set of remaining data points Λn ← the current cluster ca , then, xi is defined as a crossing point
xi ; of ca , and is represented as xi ∈ CP(a→b) . The crossover
11: for each xi in Λn do degree c(i, a→b) of a crossing point xi in ca between clusters
12: append xi to the nearest cluster ICX ; ca and cb is defined as:
13: return ICX . √
2 |N(i,a) | × |N(i,b) |
c(i,a→b) = , (10)
|N(i,a) | + |N(i,b) |
∩ ∩
many fragmented clusters, it does not lead to the clus- where N(i,a) ∈ (N (xi ) ca ), N(i,b) ∈ (N (xi ) cb ), and
tering result that multiple clusters are wrongly classified 0 < c(i,a→b) ≤ 1. Based on crossover degrees of all crossing
as a cluster. Focus on the scenario of fragmented clusters, points of each cluster, we can define the crossover degree
we introduce a cluster self-ensemble method to merge the between every two clusters.
preliminary clustering results. Definition 4.4 Cluster Crossover Degree (CCD). Cluster
crossover degree Ca,b of two clusters ca and cb is calculated by
the sum of the crossover degrees of all crossing points between ca
4.3 Fragmented Cluster Self-ensemble
and cb . The formula of CCD is defined as:
To address the limitation of cluster fragmentation of DADC ∑ ∑
on MDDM datasets, a fragmented cluster self-ensemble Ca,b = c(i,a→b) + c(j,b→a) .
(11)
method is proposed in this section. As we all know, the basic xi ∈CP(a→b) xj ∈CP(b→a)
principle of clustering analysis is that individuals in the To measure whether the data points in a cluster have
same cluster have high similarities with each other, while similar domain densities, we give a definition of the cluster
different from individuals in different clusters. Therefore, to density stability. By analyzing the internal density stability
find out which clusters are misclassified into multiple sub- of the clusters to be merged and that of the merged cluster,
clusters, we propose an inter-cluster similarity measurement we can determine whether the merger is conducive to the
and cluster fusion degree model for fragmented cluster self- stability of these clusters. The internal density stability of
ensemble. Clusters with a superior density similarity and clusters is an important indicator of cluster quality.
cluster fusion are merged into the same cluster. Definition 4.5 Cluster Density Stability (CDS). Cluster
Definition 4.3 Inter-cluster Density Similarity (IDS). Inter- density stability is the reciprocal of the cluster density variance,
cluster density similarity between two clusters refers to the degree which is calculated by the deviation between the domain density
of similarity degree of their cluster densities. The average density of each point and the average domain density of the cluster. The
of a cluster is the average value of the domain densities of all data larger the CDS of a cluster, the smaller domain density differences
points in the cluster. of each point in the cluster. The CDS of a cluster ca is defined as:
Let Sa,b be the inter-cluster density similarity between  
cluster ca and cb . The larger the value of Sa,b , the more
√∑
similar is the density of the two clusters. Sa,b is defined as: da = log  (KDeni − KDenca )2 , (12)
√ i∈ca

2 KDenca × KDencb where KDenca is the average value of domain densities of


Sa,b = , (9)
KDenca + KDencb the data points in ca , and |ca | is the number of data points
∑ in ca . A cluster with a high CDS means that data points in
where KDenca = |c1a | i∈ca KDeni and 0 < Sa,b = Sb,a ≤ the cluster have low domain-density differences. Namely,

1. Let f ( uv ) = 2 uv
u+v = √ u 2 √ v , where u
v ∈ (0, 1]. Let most data points in the same cluster have similar domain
+
v u densities.
′ 2(1−x2 )
x = u
then f (x) =
v,
2x
Since f (x) =
x2 +1 . ≥ 0, (x2 +1)2 For two clusters ca and cb with high inter-cluster density
then f (x) is a strictly monotonically increasing function and similarity and high crossing degree, we can further calculate

1041-4347 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2954133, IEEE
Transactions on Knowledge and Data Engineering
8

their CDS. Assuming that da and db are the CDSs of ca and


cb , and da+b is that of the new cluster merged from ca and
cb . CDS Da,b between ca and cb is calculated in Eq. (13):
da+b da+b
Da,b = × . (13)
da db
If the CDS of the merged cluster is close to the average value
of da and db , it indicates that the merger of the two clusters
does not reduce their overall density stability.
Based on the above indicators of clusters, including
inter-cluster density similarity, cluster crossover degree, and
cluster density stability, the definition of cluster fusion de-
gree is proposed.
Definition 4.6 Cluster Fusion Degree (CFD). Cluster fusion
degree of two clusters is the degree of the correlation between
Fig. 10. Cluster fusion degree measurement.
the clusters in terms of the location and density distribution,
which is calculated depending upon the values of IDS, CCD, and
CDS. Two clusters with a high degree of fusion should satisfy the
following conditions: (1) having a high value of IDS, (2) having a
high value of CCD, and (3) the CDS of the merged cluster should
be close to the average value of the two initial clusters’ CDSs.
If two adjacent and crossed clusters hold a high IDS and similar
CDS, they have a high fusion degree.
Based on Definition 4.6, the fusion degree F between
two clusters is expressed as a triangle in an equilateral
triangle framework, as shown in Fig. 10. Vertices of the
triangle represent S , C , and D, respectively. The value of
(a) Fragmented sub-clusters (b) Ensembled clusters
each indicator variable is represented by the segment from
the triangle center point to the corresponding vertex. Then, Fig. 11. Example of fragmented cluster ensemble.
the value of Fa,b between clusters ca and cb is obtained by
calculating the area of the corresponding triangle consisting Algorithm 4.2 Cluster self-ensemble of DADC.
of Sa,b , Ca,b , and Da,b , as defined as:
Input:
Fa,b = S(Sa,b , Ca,b , Da,b ), ICX : The initial clusters of X ;
√ θF : the threshold value of the cluster fusion degree for
3 (14)
= (Sa,b × Ca,b + Ca,b × Da,b + Da,b × Sa,b ) . cluster self-ensemble.
4 Output:
If the value of Fa,b exceeds a given threshold, then clusters M CX : the merged clusters of dataset X .
ca and cb are merged to a single cluster. In Fig. 10, there 1: while ICX ̸= ∅ do
are three triangles with different edge colors, represent- 2: get the first cluster ca from ICX ;
ing the corresponding fusion degrees of three cluster-pairs 3: for each cb (cb ̸= ca ) in ICX do
(c0 , c1 , c2 ). Fusion degrees between the merged cluster and 4: calculate the inter-cluster density similarity Sa,b ;
other clusters continue to be evaluated. The process is re- 5: calculate crossing points c(i,a→b) and c(j,b→a) ;
peated until the CFDs of all clusters are below the threshold. 6: calculate cluster crossover degree Ca,b ;
An example of cluster fusion degree between three clusters 7: calculate cluster density similarity da , db , and da+b ;
(c0 c2 ) is given in Fig. 10. 8: calculate cluster density similarity Da,b ;
An example of cluster ensemble is shown in Fig. 11. The 9: calculate cluster fusion degree Fa,b ;
detailed steps of the cluster self-ensemble process of DADC 10: if Fa,b > θF then
are presented in Algorithm 4.2. 11: merge clusters c′a ← merge(ca , cb );
In Algorithm 4.2, for each initial cluster ca in ICX , we 12: remove cb from ICX ;
respectively calculate the cluster crossover degree, cluster 13: if ca = c′a then
density similarity, and cluster fusion degree between ca 14: append ca to the merged clusters M CX ;
and each residual cluster cb . Then, in each iteration, we try 15: remove ca from ICX ;
to merge the two clusters with the highest cluster fusion 16: return M CX .
degree. Assuming that the number of initial clusters is m,
2
the computational complexity of Algorithm 4.2 is O(Cm ).
The DADC algorithm consists of processes 3.1, 4.1, and 5 E XPERIMENTS
4.2, requiring the computational complexity of O(n), O(n),
and O(Cm 2
), respectively. Thus, the computational complex- 5.1 Experiment Setup
2
ity of the DADC algorithm is O(2n + Cm ), where n is the Experiments are conducted to evaluate the proposed DADC
number of points in the dataset X and m is that of the initial algorithm by comparing with CFSFDP [7], OPTICS [24],
clusters. DBSCAN [25] algorithms in terms of clustering results

1041-4347 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2954133, IEEE
Transactions on Knowledge and Data Engineering
9

30
analysis and performance evaluation. The experiments are 25

Local density
20
performed using a workstation equipped with Intel Core i5- 150
15
10
6400 quad-core CPU, 8 GB DRAM, and 2 TB main memory. 5
0
Two groups of datasets, e.g., synthetic and large-scale real- 100 0 50 100 150 200

Adaptive domain density


40
35
world datasets, are used in the experiments. These datasets 30
25
50 20
are downloaded from published online benchmarks, such C0
15
10
C1 5
as the clustering benchmark datasets [26] and UCI Machine 0
C2 0
0 50 100 150 200
0 50 100 150 Data points
Learning Repository [27], as shown in Tables 1 and 2.
An implementation of DADC is available from Github at (a) Data points (b) Local/domain density
https://github.com/JianguoChen2015/DADC. 180 180
160 160
140 140

TABLE 1 120 120

Delta distance
Delta distance
Synthetic datasets used in experiments. 100 100
80 80

Datasets #.Samples #.Dimensions #.Clusters 60 60


40 40
Aggregation 350 2 7 20 20
00 0
Compound 399 2 6 5 10 15
Local density
20 25 30 0 5 10 15 20 25
Adaptive domain density
30 35 40

Heartshapes 788 2 3 (c) Decision graph of CFSFDP (d) Decision graph of DADC
Yeast 1484 2 10
Gaussian clusters (G2) 2048 2 2 Fig. 12. Decision graphs for VDD dataset.

TABLE 2 shown in Fig. 3 (b), there is only one decision point in


Large-scale datasets used in experiments. the CFSFDP clustering decision graph that is detected as a
Datasets #.Samples #.Dimensions #.Clusters cluster center. More than 140 points have low values of local
density and high values of Delta distance and are detected
Individual household elec- 275,259 9 196
tric power consumption (I-
as outliers. In contrast, in the DADC clustering decision
HEPC) graph of Fig. 12 (d), there are three decision points with high
Flixster (ASU) 523,386 2 153 values of both domain-adaptive density and Delta distance,
which are identified as the cluster centers of three regions
Heterogeneity activity 930,257 16 289
recognition (HAR) separately. The clustering results show that the proposed
DADC algorithm can effectively detect the domain-density
Twitter (ASU) 316,811 2 194
peaks of data points and identify clusters in different density
regions.
Two groups of VDD datasets (Aggregation and Com-
5.2 Clustering Results Analysis on Synthetic Datasets pound) are used in the experiments to further evaluate the
To clearly and vividly illustrate the clustering results of clustering effectiveness of DADC by comparing the CFSFDP
DADC, multiple experiments are conducted on synthetic algorithm. The local density of each data point is obtained
datasets in this section by comparing the related clustering by the CFSFDP algorithm, while the KNN-density, domain
algorithms, including DADC, CFSFDP in [7], DBSCAN in density, and domain-adaptive density of each data point are
[25], and OPTICS in [24]. Synthetic datasets that with the calculated by the proposed DADC algorithm. The compari-
features of VDD, MDDM, and ED, are used in experiments. son results are illustrated in Fig. 13.
As shown in Fig. 13 (b) and Fig. 13 (d), the local densities
5.2.1 Clustering Results on VDD Datasets of data points in dense regions are obviously higher than
To illustrate the effectiveness of the proposed method of the those of data points in sparse regions. It is easy to treat
domain-adaptive density measurement in DADC, we con- the data points in sparse regions as noisy data rather than
duct experiments on VDD datasets. Fig. 3 (a) is a synthetic independent clusters. In contrast, by the method of DADC,
dataset (Heartshpes) described in Table 1, which is composed domain-adaptive densities of all of the data points are
of three heart-shaped regions with different densities. Each detected with obvious differences. Although the datasets
region contains 71 data points. have multiple regions with different densities, the domain-
The local density and domain density of each data point density peaks in each region are quickly identified. More
are calculated by the original measurement of CFSFDP and comparison results on VDD datasets are described in sup-
the domain adaptive measurement of DADC separately, as plementary material.
shown in Fig. 12 (b). It is evident from Fig. 12 (b), according
to CFSFDP, the local densities of data points in the second
5.2.2 Clustering Results on ED Datasets
region (no. 72-142) are far higher than that of the other
two relatively sparse regions. In this case, it has difficulty To evaluate the effect of the cluster self-identification
in detecting the local-density peaks in the sparse region- method of the proposed DADC algorithm, experiments are
s. In contrast, according to DADC, although the density performed on an ED dataset (Hexagon) by comparing to
distribution of the three regions are different, the domain- CFSFDP, OPTICS, and DBSCAN algorithms, respectively.
density peaks of each region are obviously identified. As The clustering results are shown in Fig. 14. More compar-

1041-4347 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2954133, IEEE
Transactions on Knowledge and Data Engineering
10

30
P. Clustering results of OPTICS and DBSCAN are very
sensitive to the parameter thresholds of eps (connectivity
25
radius) and minpts (minimum number of shared neighbors).
20 For example, for both OPTICS and DBSCAN, we set their
15
parameter thresholds of eps and minpts to 10 and 5, and
generate 14 and 11 fragmented clusters by OPTICS and
10
C0
C1
DBSCAN, respectively. The experimental results show that
C2
5
C3 the proposed DADC algorithm is more accurate than other
C4
0 C5
C6
algorithms for ED datasets clustering.
0 5 10 15 20 25 30 35

(a) Data points of Aggregation (b) Local/domain densities on


Aggregation 5.2.3 Clustering Results on MDDM Datasets
25

For datasets with MDDM characteristics, multi domain-


20
density maximums might lead to fragmented clusters,
whose overall distribution is similar to that of adjacent clus-
15
ters. We conduct comparison experiments on a dataset with
10
MDDM characteristics to evaluate the clustering effect of the
comparative clustering algorithms. A synthesized dataset
C0
5
C1
C2
(G2) with MDDM characteristics is used in the experiments.
C3
C4
The clustering results are shown in 15. More comparison
C5
0
5 10 15 20 25 30 35 40 45
results on MDDM datasets are described in supplementary
material.
(c) Data points of Compound (d) Local/domain densities on
Compound
800 C0
1000 C0
C1
900 C2
Fig. 13. Adaptive domain-densities on VDD datasets. 700 C3
800 C4
C5
C6
700 C7
600 C8
600 C9
C10
ison results on ED datasets are described in supplementary 500 500 C11
C12
C13
400
material. C14
C15
400 300 C16

200
300 300 C0
C0
C1 C1 400 500 600 700 200 300 400 500 600 700 800 900
250 250 C2
C3

200 200
C4
C5
(a) DADC on G2 (b) CFSFDP on G2
C6
C7
150 150 C8
700
C0 C0
C9 C1
C2
700 C1
C2
C10
100 100 C11
650
C3
C4
C3
C4
C12
C13 C5 650 C5
50 50 C14
600
C6
C7
C6
C7
C15
C16 C8 600 C8
0 0 C17 C9 C9
C18 550 C10 C10
C19 C11 550
−50 −50 C20 C12
−50 0 50 100 150 200 250 300 −50 0
C21 50 100 150 200 250 300 500 C13
C22 C14
C15 500
C23
C24 450 C16
(a) DADC on Hexagon (b) CFSFDP on Hexagon
C25
C26
C17
C18 450
C27 400 C19
300 300 C28 C20
C0
C1
C0
C29
C1
C21
450 500 550 600 650 400 450 500 550 600 650
C30
250 C2
C3
250 C2
C31
C3
C32
C4 C4
C33 (c) OPTICS on G2 (d) DBSCAN on G2
200 C5
C6
200 C5
C34
C6
C35
C7 C7
C36
150 C8
C9
150 C8
C37
C9
C10
C38
C10
C39 Fig. 15. Clustering results on MDDM dataset.
100 C11
C12
100 C40
C41
C13 C42
50 50 C43
C44
C45
0 0 C46
C47
C48
After obtaining 17 density peaks using CFSFDP, 17
−50 −50 C49
−50 0 50 100 150 200 250 300 −50 0
C50 50 100 150 200 250 300 corresponding clusters are generated as shown in Fig. 15
C51
C52
(c) OPTICS on Hexagon (d) DBSCAN on Hexagon
C53
C54 (b). However, these clusters have similar overall density
C55
C56
C57
C58
distribution, and it is reasonable to merge them into a single
Fig. 14. Clustering results on equilibrium distributed datasets. C59
C60 cluster. DADC can eventually merge the 17 fragmented
C61
C62
C63
C64
clusters into one cluster, as shown in Fig. 15 (a). Again,
C65
Since the ED dataset does not have local-density peaks, C66
C67
the clustering results of OPTICS and DBSCAN are very
C68
CFSFDP obtains numerous fragmented clusters. Making C69
C70
sensitive to the parameter thresholds of eps and minpts.
C71
use of the cluster self-identification and cluster ensemble C72
C73
As shown in Fig. 15 (c) and (d), when we set the OPTICS
C74
process, the proposed DADC algorithm can merge the C75
C76 and DBSCAN algorithms’ parameter thresholds of eps and
C77
fragmented clusters effectively. Therefore, DADC effectively C78
C79
C80
minpts to 13 and 10, then, 22 and 11 fragmented clusters are
solves this problem and obtains accurate clustering results. C81
C82
C83
clustered by OPTICS and DBSCAN, respectively. Compared
C84
As shown in Fig. 14 (a), the dataset is clustered into two C85
C86
with CFSFDP, OPTICS, and DBSCAN, experimental results
C87
clusters by DADC. In contrast, as shown in Fig. 14 (b) - (d), C88
C89
show that DADC achieves more reasonable clustering re-
C90
more than 231 fragmented clusters are produced by CFSFD- C91
C92
sults on MDDM datasets.
C93
C94
C95
C96
C97
C98

1041-4347 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2954133, IEEE
Transactions on Knowledge and Data Engineering
11

5.3 Performance Evaluation


5.3.1 Clustering Accuracy Analysis
Clustering Accuracy (CA) is introduced to evaluate the
clustering algorithms. CA measures the ratio of the correctly
classified/clustered instances to the pre-defined class labels.
Let X be the dataset in this experiment, C be the set of
classes/clusters detected by the corresponding algorithm,
and L be the set of pre-defined class labels. CA is defined in
Eq. (15):


K−1
max (Ci |Li )
CA = , (15)
i=0
|X| Fig. 16. Comparison of algorithm robustness.

where Ci is the data points in the i-th class/cluster, Li is the


pre-defined class labels of the data points in Ci , and K is accuracy of DADC drops at a minimum rate, while those
the number of C . max(Ci |Li ) is the number of data points of CFSFDP and OPTICS lie in the second and third, while
that have the majority label in Ci . The greater value of CA that of DBSCAN declines at the fastest speed. When the
means that the higher accuracy of the classification / clus- noise-level rises from 1.0% to 15.0%, the average accuracy
tering algorithm, and each cluster achieves high purity. The of DADC decreases from 96.14% to 80.39%, which indicates
experimental results of the clustering accuracy comparison that DADC is most robust to different noise-level data. The
are given in Table 3. average accuracy of CFSFDP drops from 94.21% to 53.18%,
and that of DBSCAN decreases from 78.52% to 31.74%.
TABLE 3 For example, when the scale of noisy data increases from
Clustering accuracy comparison. 1.0% to 15.0%, the average clustering accuracy of DADC
reduces from 96.0% to 80.3%. Compared to the compared
Datasets DADC CFSFDP OPTICS DBSCAN algorithms, DADC retains higher accuracy in each case.
Heartshapes 100.00% 83.42% 91.33% 91.33% Therefore, DADC illustrates higher robustness than com-
Yeast 91.67% 83.23% 82.54% 80.41% pared algorithms to noisy data.
G2 100.00% 90.45% 84.23% 82.85%
IHEPC 92.34% 87.72% 73.98% 62.03% 6 C ONCLUSIONS
Flixster 87.67% 79.09% 65.31% 55.51% This paper presented a domain-adaptive density clustering
Twitter 72.26% 68.85% 53.90% 51.42% algorithm, which is effective in the datasets with varying-
HAR 83.29% 84.23% 58.26% 56.92% density distribution (VDD), multiple domain-density max-
imums (MDDM), or equilibrium distribution (ED). A do-
As shown in Table 3, DADC outperforms others on both main adaptive method was proposed to calculate domain
synthetic and large-scale real-world datasets. In the case of densities and detect density peaks of data points in VDD
Friendster, the average CA of DADC is 87.67%, while that datasets, and cluster centers were identified. In addition,
of CFSFDP is 79.09%, that of OPTICS is 65.31%, and that of a cluster fusion degree model and a CFD-based cluster
DBSCAN is 55.51%. For synthetic datasets, DADC achieves self-ensemble method were proposed to merge fragmented
a high average CA of 97.22%. The average accuracies of clusters with minimum artificial intervention in MDDM
OPTICS and DBSCAN algorithms are noticeably lower than and ED datasets. In comparison with existing clustering
that of CFSFDP and DADC. For the large-scale real-world algorithms, the proposed DADC algorithm requires fewer
datasets, CA of DADC is higher than that of the compared parameters and non-iterative nature, achieving outstanding
algorithms, keeping in the range of 72.26% and 92.34%. It advantages in terms of accuracy and robustness.
illustrates that DADC achieves higher clustering accuracy As future work, we will further research issues of big
over CFSFDP, OPTICS, and DBSCAN algorithms. data clustering analysis, including incremental clustering,
time-series data clustering, and parallel clustering in dis-
5.3.2 Robustness Analysis tributed and parallel computing environments.
Experiments are conducted to evaluate the robustness of
the compared algorithms on noisy datasets. Four groups of
real-world datasets from practical applications described in ACKNOWLEDGMENT
Table 2 are used in the experiments with different degrees of This research is partially funded by the National Key R&D
noise. We generate different amounts of random and non- Program of China (Grant No. 2016YFB0200201), the Key
repetitive data points as noise in the value space of the Program of the National Natural Science Foundation of Chi-
original dataset. The noise-level of each dataset gradually na (Grant No. 61432005), the National Outstanding Youth
increases from 1.0% to 15.0%. The experimental results are Science Program of National Natural Science Foundation of
presented in Fig. 16. China (Grant No. 61625202), and the International Postdoc-
As observed in Fig. 16, with the increasing proportion toral Exchange Fellowship Program (Grant No. 2018024).
of noise, the average clustering accuracy of each algorith- This work is also supported in part by NSF through grants
m decreases, respectively. However, the average clustering IIS-1526499, IIS-1763325, CNS-1626432, and NSFC 61672313.

1041-4347 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2954133, IEEE
Transactions on Knowledge and Data Engineering
12

R EFERENCES [18] S. Ruan, R. Mehmood, A. Daud, H. Dawood, and


J. S. Alowibdi, “An adaptive method for clustering by
[1] X. Liu, C. Wan, and L. Chen, “Returning clustered fast search-and-find of density peaks: Adaptive-dp,” in
results for keyword search on xml documents,” IEEE WWW’17, 2017, pp. 119–127.
Trans. Knowl. Data Eng., vol. 23, no. 12, pp. 1811–1825, [19] D. Ma and A. Zhang, “An adaptive density-based
2011. clustering algorithm for spatial database with noise,”
[2] Y. Wang, X. Lin, L. Wu, W. Zhang, and Q. Zhang, in ICDM’04, 2004, pp. 467–470.
“Exploiting correlation consensus: Towards subspace [20] P. Wang, C. Domeniconi, and K. B. Laskey, “Non-
clustering for multi-modal data,” in MM’14, 2014, pp. parametric bayesian clustering ensembles,” in SDM’10,
981–984. 2010, pp. 331–342.
[3] J. Shao, X. He, C. Bohm, Q. Yang, and C. Plant, [21] Z. Yu, P. Luo, S. Wu, G. Han, J. You, H. Leung, H. Wong,
“Synchronization-inspired partitioning and hierarchi- and J. Zhang, “Incremental semi-supervised clustering
cal clustering,” IEEE Trans. Knowl. Data Eng., vol. 25, ensemble for high dimensional data clustering,” in
no. 4, pp. 893–905, 2013. ICDE’16, 2016, pp. 1484–1485.
[4] B. Jiang, J. Pei, Y. Tao, and X. Lin, “Clustering uncertain [22] S. Yang, M. A. Cheema, X. Lin, Y. Zhang, and W. Zhang,
data based on probability distribution similarity,” IEEE “Reverse k nearest neighbors queries and spatial re-
Trans. Knowl. Data Eng., vol. 25, no. 4, pp. 751–763, 2013. verse top-k queries,” The VLDB Journal, vol. 26, no. 2,
[5] L. Zheng, T. Li, and C. Ding, “A framework for hierar- pp. 151–176, 2017.
chical ensemble clustering,” ACM Trans. Knowl. Discov. [23] D. Jiang, G. Chen, B. C. Ooi, K.-L. Tan, and S. Wu,
Data, vol. 9, no. 2, pp. 9:1–23, 2014. “Epic: An extensibleand scalable system for processing
[6] J. Gan and Y. Tao, “Dynamic density based clustering,” big data,” in VLDB’14, 2014, pp. 541–552.
in SIGMOD’17, 2017, pp. 1493–1507. [24] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander,
[7] A. Rodriguez and A. Laio, “Clustering by fast search “Optics: ordering points to identify the clustering struc-
and find of density peaks,” Science, vol. 344, no. 6191, ture,” in SIGMOD’99, 1999, pp. 49–60.
pp. 1492–1496, 2014. [25] M. Ester, H. Kriegel, J. Sander, and X. Xu, “A density-
[8] M. Du, S. Ding, and H. Jia, “Study on density peaks based algorithm for discovering clusters in large spatial
clustering based on k-nearest neighbors and principal databases with noise,” in KDD’96, 1996, pp. 226–231.
component analysis,” Knowledge-Based Systems, vol. 99, [26] P. F. et al, “Clustering datasets,” 2017. [Online].
pp. 135–145, 2016. Available: http://cs.uef.fi/sipu/datasets/
[9] H. Liu, M. Shao, S. Li, and Y. Fu, “Infinite ensemble for [27] U. of California, “Uci machine learning repository,”
image clustering,” in KDD’16, 2016, pp. 1745–1754. Website, 2017, http://archive.ics.uci.edu/ml/datasets.
[10] Y. Gao, X. Miao, G. Chen, B. Zheng, D. Cai, and
HuiyongCui, “On efficiently finding reverse k nearest
neighbors over uncertain graphs,” The VLDB Journal,
vol. 26, no. 4, pp. 467–492, 2017.
[11] A. Fahad, N. Alshatri, Z. Tari, A. Alamri, I. Khalil, A. Y. Jianguo Chen received the Ph.D. degree in Col-
Zomaya, S. Foufou, and A. Bouras, “A survey of clus- lege of Computer Science and Electronic Engi-
tering algorithms for big data: Taxonomy and empirical neering at Hunan University, China. He was a
analysis,” IEEE Trans. Emerg. Top. Comp., vol. 2, no. 3, visiting Ph.D. student at the University of Illinois
at Chicago from 2017 to 2018. He is currently
pp. 267–279, 2014. a postdoctoral in University of Toronto and Hu-
[12] B. J. Frey and D. Dueck, “Clustering by passing mes- nan University. His major research areas include
sages between data points,” Science, vol. 325, no. 5814, parallel computing, cloud computing, machine
learning, data mining, bioinformatics and big da-
pp. 972–976, 2007. ta.
[13] G. Li, S. Chen, J. Feng, K. lee Tan, and W.-S. Li,
“Efficient location-aware influence maximization,” in
SIGMOD’14, 2014, pp. 87–98.
[14] J. Xie, H. Gao, W. Xie, X. Liu, and P. W. Grant, “Robust
clustering by detecting density peaks and assigning
points based on fuzzy weighted k-nearest neighbors,”
Philip S. Yu received the B.S. Degree in E.E.
Information Sciences, vol. 354, pp. 19–40, 2016. from National Taiwan University, the M.S. and
[15] Y. Zheng, Q. Guo, A. K. Tung, and S. Wu, “Lazylsh: Ph.D. degrees in E.E. from Stanford University,
Approximate nearest neighbor search for multiple dis- and the M.B.A. degree from New York University.
He is a Distinguished Professor in Computer
tance functions with a single index,” in SIGMOD’16, Science at the University of Illinois at Chicago
2016, pp. 2023–2037. and also holds the Wexler Chair in Information
[16] G. Wang and Q. Song, “Automatic clustering via out- Technology. His research interest is on big data,
ward statistical testing on density metrics,” IEEE Trans. including data mining, data stream, database
and privacy. He has published more than 1,100
Knowl. Data Eng., vol. 28, no. 8, pp. 1971–1985, 2016. papers in refereed journals and conferences. He
[17] Z. Yu, Z. Kuang, J. Liu, H. Chen, J. Zhang, J. You, H.- holds or has applied for more than 300 US patents. He was the Editor-
S. Wong, and G. Han, “Adaptive ensembling of semi- in-Chiefs of ACM TKDD (2011-2017) and IEEE TKDE (2001-2004). Dr.
Yu is a Fellow of the ACM and the IEEE.
supervised clustering solutions,” IEEE Trans. Knowl.
Data Eng., vol. 29, no. 8, pp. 1577–1590, 2017.

1041-4347 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like