A Domain Adaptive Density Clustering Algorithm for Data with Varying Density Distribution
A Domain Adaptive Density Clustering Algorithm for Data with Varying Density Distribution
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2954133, IEEE
Transactions on Knowledge and Data Engineering
1
Abstract— As one type of efficient unsupervised learning methods, clustering algorithms have been widely used in data mining and
knowledge discovery with noticeable advantages. However, clustering algorithms based on density peak have limited clustering effect
on data with varying density distribution (VDD), equilibrium distribution (ED), and multiple domain-density maximums (MDDM), leading
to the problems of sparse cluster loss and cluster fragmentation. To address these problems, we propose a Domain-Adaptive Density
Clustering (DADC) algorithm, which consists of three steps: domain-adaptive density measurement, cluster center self-identification,
and cluster self-ensemble. For data with VDD features, clusters in sparse regions are often neglected by using uniform density peak
thresholds, which results in the loss of sparse clusters. We define a domain-adaptive density measurement method based on K -Nearest
Neighbors (KNN) to adaptively detect the density peaks of different density regions. We treat each data point and its KNN neighborhood
as a subgroup to better reflect its density distribution in a domain view. In addition, for data with ED or MDDM features, a large number of
density peaks with similar values can be identified, which results in cluster fragmentation. We propose a cluster center self-identification
and cluster self-ensemble method to automatically extract the initial cluster centers and merge the fragmented clusters. Experimental
results demonstrate that compared with other comparative algorithms, the proposed DADC algorithm can obtain more reasonable
clustering results on data with VDD, ED and MDDM features. Benefitting from a few parameter requirement and non-iterative nature,
DADC achieves low computational complexity and is suitable for large-scale data clustering.
Index Terms—Cluster fragmentation, density-peak clustering, domain-adaptive density clustering, varying density distribution.
1041-4347 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2954133, IEEE
Transactions on Knowledge and Data Engineering
2
1041-4347 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2954133, IEEE
Transactions on Knowledge and Data Engineering
3
Delta distance
100 100
We propose a domain-adaptive density method to address 80
60
the problem of spare cluster loss of DPC algorithms on 50
40
VDD data. Domain density peaks of data points in regions 20
00 5 10 15 20 25 30
with different densities are adaptively detected. In addition, 0
0 50 100 150 Local density
candidate cluster centers are identified based on the decision (a) Data points (b) Decision graph
parameters of the domain densities and Delta distances.
Fig. 3. Example of the CFSFDP algorithm on a VDD dataset.
3.1 Problem Definitions
Most DPC algorithms [7, 8] are based on the assumption that 3.2 Domain-Adaptive Density Measurements
a cluster center is surrounded by neighbors with lower local
densities and has great Delta distances from any relative To adaptively detect the domain-density peaks in differen-
points with higher densities. For each data point xi , its local t density areas of VDD data, a domain-adaptive density
density ρi and Delta distance δi are calculated from the calculation method is presented in this section. Domain
higher density points. These two quantities depend only on distance and domain density calculation methods are pre-
the distances between the data points. The local density ρi sented based on the KNN method [22, 23]. These methods
of xi is defined as: are very useful and handy on large-scale datasets that likely
contain varying distribution densities in actual applications.
∑
ρi = χ(dij − dc ), (1) To more precisely explain the locality of VDD data,
j
we propose a new definition of domain density based on
the KNN method. Given a dataset X , the KN N -distance
where dc is a cutoff distance and χ(x) = 1, if x < 0; and KN N -density of each data point in X are calculated,
otherwise, χ(x) = 0. Basically, ρi is equal to the number respectively.
of points closer than dc to xi . Delta distance δi of xi is Definition 3.2 KN N -Distance. Given a dataset X , the
measured by computing the shortest distance between xi KN N -distance of each data point xi refers to the average distance
and any other points with a higher density, as defined as: of xi to its k nearest neighbors. The KN N -distance of each data
δi = min dij . (2) point xi is defined as KDisti :
j:ρj >ρi
1 ∑
For the highest density point, δi = maxj dij . Points with a KDisti = dij , (3)
K
high ρ and high δ are considered as cluster centers, while j∈N (xi )
points with a low ρ and a high δ are considered as outliers.
where K is the number of neighbors of xi and N (xi ) is
After finding the cluster centers, each remaining point is
the set of its neighbors. Based on the KN N -distance, we
assigned to the same cluster as its nearest neighbor of higher
calculate the KN N -density for each data point.
density.
Definition 3.3 KN N -Density. The KN N -density of the
Most data in actual applications have the characteristics
data point xi in dataset X refers to the reciprocal of the KN N -
of noise, irregular distribution, and sparsity. In particular,
distance. The smaller KN N -density of a data point, indicating
the density distribution of data points is unpredictable and
that this data point is located in a more sparse area. The KN N -
discrete in most of the cases. For a VDD dataset, there
density of data point xi is defined as KDeni :
coexist regions with different degrees of density, such as
dense and sparse regions, as defined as follows. 1 K
KDeni = = ∑ .
Definition 3.1 VDD Data. For a dataset has multiple regions, KDisti dij (4)
the average density of data points in each region is set as the j∈N (xi )
region’s density. If there coexist regions with obvious different
After obtaining the KN N -distance and KN N -density,
region densities, such as dense and sparse regions, we denote the
the domain-adaptive density of each data point is defined.
dataset as a Varying-Density Distributed (VDD) dataset.
We treat the set of each data point xi and its neighbors N (xi )
The CFSFDP algorithm and other DPC algorithms suffer
as a subgroup to observe its density distribution in X .
from the limitations of sparse cluster loss on VDD datasets.
Definition 3.4 Domain Density. The domain density of each
According to Eq. (1), points in the relatively sparse area
data point xi in dataset X is the sum of the KN N -density of xi
are easily ignored as outliers. An example of the CFSFDP
and the weighted KN N -density of its K -nearest neighbors. The
clustering results on a VDD dataset is shown in Fig. 3.
domain density of the data point xi is defined as ∂i :
In Fig. 3 (a), the heart-shaped dataset has three region-
∑
s with different densities. The clustering decision graph ∂i = KDeni + (KDenj × wj ), (5)
achieved by CFSFDP is shown in Fig. 3 (b), where only j∈N (xi )
one point is obtained with high values of both ρ and δ .
Consequently, the dataset is clustered into one cluster, while where wj = d1ij is the weighted value of the KN N -density
the data points in the sparse regions, indicated by blue dots between each neighbor xj and xi . Compared to the KN N -
and purple squares, are removed as outliers or incorporated density, the domain density can better reflect the density
into the dense cluster. distribution of data points in the local area.
1041-4347 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2954133, IEEE
Transactions on Knowledge and Data Engineering
4
1041-4347 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2954133, IEEE
Transactions on Knowledge and Data Engineering
5
C6
10 C7 4
is assigned to the cluster to which the nearest and higher- 9
C8
C9 3
C10
density neighbors belong. 8 C11
C12 2
7 C13
(1) Decision-parameter threshold determination. C14
C15
1
6
C16
A limitation of the CFSFDP algorithm is that how to 5
8
C17
10
C18 12 14 16 18 20 22
0
1 2 3 4 5 6 7
Adaptive domain density
determine the thresholds of the decision parameters in the C19
C20
C21
clustering decision graph. In CFSFDP, data points with (a) MDDM dataset
C22 (b) Decision graph
C23
C24
high values of both local density and Delta distance are 250
C0
C25
C26
100
C1
C27
regarded as cluster centers. But in practice, these parameter 200
C2
C3
C28
80
C4
thresholds are often set manually. An example of the Frame 150
C5
C6
Delta distance
C7 60
dataset and the corresponding clustering decision graph are C8
C9
100
C10
shown in Fig. 6. C11 40
C12
50
Fig. 6 (b) is the clustering decision graph of CFSFDP for C13
C14
20
C15
the dataset in Fig. 6 (a). It is difficult to make a decision 0 C16
C17
C18
0
whether only the points in the red box or those in both −50
C19
C200 50 100 150 200 250
0 5 10 15 20
Adaptive domain density
25 30
C21
red and blue boxes should be regarded as cluster centers. C22
C23
C24
(c) ED dataset (d) Decision graph
Therefore, how to determine the threshold values of deci- C25
C26
sion parameters in an effective way is an important issue of C27
C28
Fig. 7. Clustering decision graph of MDDM and ED dataset.
C29
our algorithm. C30
C31
C32
(2) Cluster fragmentation on MDDM or ED Data. In Fig. 7 (b), there are as many as 29 decision points
C33
C34
C35
Most DPC algorithms have limitations of cluster frag- that hold high values of both domain density ∂ and Delta
C36
C37
C38
mentation on the datasets with multiple domain-density distance δ . In such a case, the dataset is divided into 29 frag-
C39
C40
maximums (MDDM) or equilibrium distribution (ED). mented clusters instead of 2 natural clusters as shown in Fig.
C41
C42
C43
C44
C45
C46
C47
C48
1041-4347 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
C49
C50
C51
C52
C53
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2954133, IEEE
Transactions on Knowledge and Data Engineering
6
1041-4347 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2954133, IEEE
Transactions on Knowledge and Data Engineering
7
Algorithm 4.1 Cluster center self-identification of the DADC fmax (x) = f (1) = 1. Hence the closer the value of x is to 1,
algorithm. the more similar are the two clusters ca and cb .
Input: In addition, the distance between every two clusters is
X : The raw dataset for clustering; considered. In the relevant studies, different methods were
∂ : The domain densities of the data points of X ; introduced to calculate the distance between two clusters,
δ : The Delta distances of the data points of X ; such as the distance between the center points, the nearest
Output: points, or the farthest points of the two clusters [5]. How-
ICX : The initial clusters of X . ever, these measures are easily affected by noise or outliers,
1: get the maximum domain density ∂max ← max(∂ ); while noisy data elimination is another challenge. We pro-
2: get the maximum Delta distance δmax(← max(δ ); ) pose an innovative method to measure the distance between
∂max δmax
3: calculate the critical point Cp (x, y) = 2 , 4 ; clusters. Crossing points between every two clusters are
4: for each xi in X do found and the crossover degree of the clusters is calculated.
5: if ∂i > Cp (x) and δi > Cp (y) then For each boundary point xi in cluster ca , let N (xi ) be
6: append to the set of cluster centers Λc ← xi ; the K -nearest neighbors of xi . We denote N(i,b) as a set of
C (y)×∂ points in N (xi ) belonging to cluster cb , and N(i,a) be a set of
7: else if ∂i < Cp (x) and δi > pCp (x) i then
8: append to the set of outliers Λo ← xi ; points in N (xi ) belonging to ca . If the amount of neighbors
9: else belonging to cb is close to that of neighbors belonging to
10: append to the set of remaining data points Λn ← the current cluster ca , then, xi is defined as a crossing point
xi ; of ca , and is represented as xi ∈ CP(a→b) . The crossover
11: for each xi in Λn do degree c(i, a→b) of a crossing point xi in ca between clusters
12: append xi to the nearest cluster ICX ; ca and cb is defined as:
13: return ICX . √
2 |N(i,a) | × |N(i,b) |
c(i,a→b) = , (10)
|N(i,a) | + |N(i,b) |
∩ ∩
many fragmented clusters, it does not lead to the clus- where N(i,a) ∈ (N (xi ) ca ), N(i,b) ∈ (N (xi ) cb ), and
tering result that multiple clusters are wrongly classified 0 < c(i,a→b) ≤ 1. Based on crossover degrees of all crossing
as a cluster. Focus on the scenario of fragmented clusters, points of each cluster, we can define the crossover degree
we introduce a cluster self-ensemble method to merge the between every two clusters.
preliminary clustering results. Definition 4.4 Cluster Crossover Degree (CCD). Cluster
crossover degree Ca,b of two clusters ca and cb is calculated by
the sum of the crossover degrees of all crossing points between ca
4.3 Fragmented Cluster Self-ensemble
and cb . The formula of CCD is defined as:
To address the limitation of cluster fragmentation of DADC ∑ ∑
on MDDM datasets, a fragmented cluster self-ensemble Ca,b = c(i,a→b) + c(j,b→a) .
(11)
method is proposed in this section. As we all know, the basic xi ∈CP(a→b) xj ∈CP(b→a)
principle of clustering analysis is that individuals in the To measure whether the data points in a cluster have
same cluster have high similarities with each other, while similar domain densities, we give a definition of the cluster
different from individuals in different clusters. Therefore, to density stability. By analyzing the internal density stability
find out which clusters are misclassified into multiple sub- of the clusters to be merged and that of the merged cluster,
clusters, we propose an inter-cluster similarity measurement we can determine whether the merger is conducive to the
and cluster fusion degree model for fragmented cluster self- stability of these clusters. The internal density stability of
ensemble. Clusters with a superior density similarity and clusters is an important indicator of cluster quality.
cluster fusion are merged into the same cluster. Definition 4.5 Cluster Density Stability (CDS). Cluster
Definition 4.3 Inter-cluster Density Similarity (IDS). Inter- density stability is the reciprocal of the cluster density variance,
cluster density similarity between two clusters refers to the degree which is calculated by the deviation between the domain density
of similarity degree of their cluster densities. The average density of each point and the average domain density of the cluster. The
of a cluster is the average value of the domain densities of all data larger the CDS of a cluster, the smaller domain density differences
points in the cluster. of each point in the cluster. The CDS of a cluster ca is defined as:
Let Sa,b be the inter-cluster density similarity between
cluster ca and cb . The larger the value of Sa,b , the more
√∑
similar is the density of the two clusters. Sa,b is defined as: da = log (KDeni − KDenca )2 , (12)
√ i∈ca
1041-4347 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2954133, IEEE
Transactions on Knowledge and Data Engineering
8
1041-4347 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2954133, IEEE
Transactions on Knowledge and Data Engineering
9
30
analysis and performance evaluation. The experiments are 25
Local density
20
performed using a workstation equipped with Intel Core i5- 150
15
10
6400 quad-core CPU, 8 GB DRAM, and 2 TB main memory. 5
0
Two groups of datasets, e.g., synthetic and large-scale real- 100 0 50 100 150 200
Delta distance
Delta distance
Synthetic datasets used in experiments. 100 100
80 80
Heartshapes 788 2 3 (c) Decision graph of CFSFDP (d) Decision graph of DADC
Yeast 1484 2 10
Gaussian clusters (G2) 2048 2 2 Fig. 12. Decision graphs for VDD dataset.
1041-4347 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2954133, IEEE
Transactions on Knowledge and Data Engineering
10
30
P. Clustering results of OPTICS and DBSCAN are very
sensitive to the parameter thresholds of eps (connectivity
25
radius) and minpts (minimum number of shared neighbors).
20 For example, for both OPTICS and DBSCAN, we set their
15
parameter thresholds of eps and minpts to 10 and 5, and
generate 14 and 11 fragmented clusters by OPTICS and
10
C0
C1
DBSCAN, respectively. The experimental results show that
C2
5
C3 the proposed DADC algorithm is more accurate than other
C4
0 C5
C6
algorithms for ED datasets clustering.
0 5 10 15 20 25 30 35
200
300 300 C0
C0
C1 C1 400 500 600 700 200 300 400 500 600 700 800 900
250 250 C2
C3
200 200
C4
C5
(a) DADC on G2 (b) CFSFDP on G2
C6
C7
150 150 C8
700
C0 C0
C9 C1
C2
700 C1
C2
C10
100 100 C11
650
C3
C4
C3
C4
C12
C13 C5 650 C5
50 50 C14
600
C6
C7
C6
C7
C15
C16 C8 600 C8
0 0 C17 C9 C9
C18 550 C10 C10
C19 C11 550
−50 −50 C20 C12
−50 0 50 100 150 200 250 300 −50 0
C21 50 100 150 200 250 300 500 C13
C22 C14
C15 500
C23
C24 450 C16
(a) DADC on Hexagon (b) CFSFDP on Hexagon
C25
C26
C17
C18 450
C27 400 C19
300 300 C28 C20
C0
C1
C0
C29
C1
C21
450 500 550 600 650 400 450 500 550 600 650
C30
250 C2
C3
250 C2
C31
C3
C32
C4 C4
C33 (c) OPTICS on G2 (d) DBSCAN on G2
200 C5
C6
200 C5
C34
C6
C35
C7 C7
C36
150 C8
C9
150 C8
C37
C9
C10
C38
C10
C39 Fig. 15. Clustering results on MDDM dataset.
100 C11
C12
100 C40
C41
C13 C42
50 50 C43
C44
C45
0 0 C46
C47
C48
After obtaining 17 density peaks using CFSFDP, 17
−50 −50 C49
−50 0 50 100 150 200 250 300 −50 0
C50 50 100 150 200 250 300 corresponding clusters are generated as shown in Fig. 15
C51
C52
(c) OPTICS on Hexagon (d) DBSCAN on Hexagon
C53
C54 (b). However, these clusters have similar overall density
C55
C56
C57
C58
distribution, and it is reasonable to merge them into a single
Fig. 14. Clustering results on equilibrium distributed datasets. C59
C60 cluster. DADC can eventually merge the 17 fragmented
C61
C62
C63
C64
clusters into one cluster, as shown in Fig. 15 (a). Again,
C65
Since the ED dataset does not have local-density peaks, C66
C67
the clustering results of OPTICS and DBSCAN are very
C68
CFSFDP obtains numerous fragmented clusters. Making C69
C70
sensitive to the parameter thresholds of eps and minpts.
C71
use of the cluster self-identification and cluster ensemble C72
C73
As shown in Fig. 15 (c) and (d), when we set the OPTICS
C74
process, the proposed DADC algorithm can merge the C75
C76 and DBSCAN algorithms’ parameter thresholds of eps and
C77
fragmented clusters effectively. Therefore, DADC effectively C78
C79
C80
minpts to 13 and 10, then, 22 and 11 fragmented clusters are
solves this problem and obtains accurate clustering results. C81
C82
C83
clustered by OPTICS and DBSCAN, respectively. Compared
C84
As shown in Fig. 14 (a), the dataset is clustered into two C85
C86
with CFSFDP, OPTICS, and DBSCAN, experimental results
C87
clusters by DADC. In contrast, as shown in Fig. 14 (b) - (d), C88
C89
show that DADC achieves more reasonable clustering re-
C90
more than 231 fragmented clusters are produced by CFSFD- C91
C92
sults on MDDM datasets.
C93
C94
C95
C96
C97
C98
1041-4347 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2954133, IEEE
Transactions on Knowledge and Data Engineering
11
∑
K−1
max (Ci |Li )
CA = , (15)
i=0
|X| Fig. 16. Comparison of algorithm robustness.
1041-4347 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2954133, IEEE
Transactions on Knowledge and Data Engineering
12
1041-4347 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.