KNN Block Dbscan
KNN Block Dbscan
Abstract—Large-scale data clustering is an essential key for invented to speedup the clustering process. The experimental
big data problem. However, no current existing approach is results show that KNN-BLOCK DBSCAN is an effective approx-
“optimal” for big data due to high complexity, which remains imate DBSCAN algorithm with high accuracy, and outperforms
it a great challenge. In this article, a simple but fast approx- other current variants of DBSCAN, including ρ-approximate
imate DBSCAN, namely, KNN-BLOCK DBSCAN, is proposed DBSCAN and AnyDBC.
based on two findings: 1) the problem of identifying whether
a point is a core point or not is, in fact, a kNN problem and Index Terms—DBSCAN, FLANN, kNN, KNN-BLOCK
2) a point has a similar density distribution to its neighbors, DBSCAN.
and neighbor points are highly possible to be the same type
(core point, border point, or noise). KNN-BLOCK DBSCAN uses
a fast approximate kNN algorithm, namely, FLANN, to detect
core-blocks (CBs), noncore-blocks, and noise-blocks within which I. I NTRODUCTION
all points have the same type, then a fast algorithm for merg-
LUSTERING analysis is the task of grouping objects
ing CBs and assigning noncore points to proper clusters is also
Manuscript received December 29, 2018; revised May 31, 2019 and July 25,
C according to measured or perceived intrinsic characteris-
tics or similarity, aiming to retrieve some natural groups from a
2019; accepted November 18, 2019. Date of publication December 18, 2019; set of patterns or points. It is a fundamental technique in many
date of current version May 18, 2021. This work was supported in applications, such as data mining, pattern recognition, etc., and
part by the National Natural Science Foundation of China under Grant
61673186, Grant 61972010, Grant 61975124, Grant 61722205, Grant many researchers believe that clustering is an essential key for
61751205, Grant 61572199, and Grant U1611461, in part by the Funds analyzing big data [1].
from State Key Laboratory of Computer Architecture, ICT, CAS under Grant Currently, there are thousands of clustering algorithms have
CARCH201807, in part by the Open Project of Provincial Key Laboratory
for Computer Information Processing Technology, Soochow University under been proposed, for example, k-means [2], mean shift [3],
Grant KJS1839, in part by the Quanzhou City Science and Technology DBSCAN [4], spectral clustering [5], [6], mixtures of dirich-
Program of China under Grant 2018C114R, in part by the Open Project let model [7], [8], clustering based on supervised learn-
of Beijing Key Laboratory of Big Data Technology for Food Safety under
Grant BTBD-2019KF06, in part by the Key Research and Development ing [9], and clustering by local cores [10], [11]. According
Program of Guang Dong Province under Grant 2018B010107002, and in to Jain et al. [12], different categories of these clustering
part by the Grant from the Guang Dong Natural Science Funds under approaches are recognized: centroid-based clustering, parti-
Grant 2017A030312008. This article was recommended by Associate Editor
G. Nicosia. (Corresponding authors: Songwen Pei; Zhiwen Yu.) tioning clustering, density-based clustering etc.
Y. Chen is with the College of Computer Science and Technology, Huaqiao The goal of density-based clustering is to identify densely
University (Xiamen Campus), Xiamen 361021, China, also with the Beijing regions with arbitrary shape, which can be measured by the
Key Laboratory of Big Data Technology for Food Safety, Beijing Technology
and Business University, Beijing 100048, China, also with the Provincial density of a given point. An identified cluster is usually a
Key Laboratory for Computer Information Processing Technology, Soochow region with high density, while outliers are regions with low
University, Suzhou 215301, China, and also with the Fujian Key Laboratory densities. Hence, density-based clustering is one of the most
of Big Data Intelligence and Security, Huaqiao University (Xiamen Campus),
Xiamen 361021, China (e-mail: [email protected]). popular paradigms. There are many algorithms of this kind,
L. Zhou and X. Liu are with the College of Computer Science and such as DBSCAN [4], OPTICS [13], DPeak [14]–[16], mean
Technology, Huaqiao University, Quanzhou 362021, China. shift [3], DCore [11], etc., where DBSCAN [4] is the most
S. Pei is with the Shanghai Key Laboratory of Modern Optical Systems,
University of Shanghai for Science and Technology, Shanghai 200093, China famous one and has been widely used.
(e-mail: [email protected]). Unfortunately, most of the current existing clustering
Z. Yu is with the School of Computer Science and Engineering, approaches do not work well for large-scale data, due to their
South China University of Technology, Guangzhou 510640, China (e-mail:
[email protected]). high complexities. For example, the complexity of k-means
Y. Chen is with the Beijing Key Laboratory of Big Data Technology for is O(ktn) where t is the iterations times, DBSCAN runs in
Food Safety, Beijing Technology and Business University, Beijing, China. O(n2 ). In this article, a fast approximate algorithm named
J. Du is with the College of Computer Science and Technology, Huaqiao
University, Quanzhou 362021, China, and also with the Fujian Key Laboratory KNN-BLOCK DBSCAN,1 is proposed to speedup DBSCAN,
of Big Data Intelligence and Security, Huaqiao University, Quanzhou 362021, which is able to deal with large-scale data. We also concentrate
China. on comparing our algorithm with DBSCAN, ρ-approximate
N. Xiong is with the Department of Mathematics and Computer Science,
Northeastern State University, Tahlequah, OK 74464 USA. DBSCAN [17], and AnyDBC.
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TSMC.2019.2956527.
Digital Object Identifier 10.1109/TSMC.2019.2956527 1 https://github.com/XFastDataLab/KNN-BLOCK-DBSCAN
2168-2216
c 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.
3940 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021
TABLE I
D ESCRIPTION OF M AIN VARIABLES AND S YMBOLS is MinPts which is used as a density threshold for deciding
U SED IN T HIS A RTICLE whether a point is a core point or not.
If a tree-based spatial index is used, the average complex-
ity is reduced to O(n log(n)) [4]. However, this turns out to
be a misclaim: as pointed out by Gunawan and de Berg [21],
DBSCAN actually runs in O(n2 ) time, regardless of and
MinPts. Unfortunately, this misclaim is widely accepted as
a building brick in many research papers and textbooks,
e.g., [22]–[24], etc. Furthermore, DBSCAN is almost use-
less in high dimension, due to the so-called “curse of
dimensionality.”
Mahran and Mahar [25] introduced an algorithm named
GriDBSCAN to enhance the performance of DBSCAN
by using grid partitioning and merging, yielding a high
performance with the advantage of a high degree of
parallelism. But this technique is inappropriate for high-
dimensional data because the effect of redundancy in this
algorithm increases exponentially with dimension. Similarly,
Gunawan and de Berg [21] proposed an algorithm named
Fast-DBSCAN to improve DBSCAN for two-dimensional
The main contributions of this article are listed as follows. (2-D) data, which also imposes an arbitrary grid √ T on 2-
1) We find that the key problem in DBSCAN of identifying D space, where each cell of T has side length / 2. If a
the type of each point is a kNN problem in essence. nonempty cell c contains at least MinPts points, then this
Therefore, many techniques of this field, such as cell is called core cell, and all points in this cell are core
FLANN [18], kd-tree [19], cover tree [20], etc., can be points because the maximum distance within this cell is .
utilized. Therefore, it is unnecessary to compute densities for each
2) According to a general rule that a point has similar den- point in a core cell. Gan and Tao [17] proposed an algorithm
sity distribution to its neighbors, and neighbor points are named ρ-approximate DBSCAN also based on grid technique
likely to be the same type (core, border, or noise). Then, for large data set, and achieved an excellent complexity O(n)
a technique is proposed to identify blocks within which in low dimension. But it degenerates to an O(n2 ) algorithm
all points have the same type, such as CBs, noncore in high even relative high-dimensional data space. Besides,
blocks, and noise blocks. parallel GridDBSCAN [26] and GMDBSCAN [27] are also
3) A fast algorithm is also invented for merging CBs and grid-based DBSCAN.
assigning noncore points to corresponding clusters. AnyDBC [28] compresses the data into smaller density-
Before introducing the proposed algorithm, we would like to connected subsets called primitive clusters and labels objects
present the main variables and symbols used in this article as based on connected components of these primitive clusters
follows. Let P be a set of n points in D-dimensional space RD ; for reducing the label propagation time. To speedup the range
pi ∈ P be the ith point of P; dp,q (or dist(p, q)) be the distance query process, it uses kd-trees [14] for indexing data, and per-
between points p and q, where the distance can be Euclidean forms substantially fewer range queries compared to DBSCAN
or Chebychev distance; be the scanning radius of DBSCAN; while still guaranteeing the exact final result of DBSCAN.
dp,(i) be the distance from p to its ith nearest neighbor, and There are some other variants of DBSCAN as follow-
p(i) be the ith nearest neighbor of p. More symbols are shown ing. IDBSCAN [29] is a sampling-based DBSCAN, which
in Table I. is able to handle large spatial databases with minimum
The remainder of this article is organized as fol- I/O cost by incorporating a better sampling technique, and
lows. Section II introduces the related work of DBSCAN reduces the memory requirement for clustering dramatically.
and nearest neighbor query. Section III revisits FLANN, KIDBSCAN [30] presents a new technique based on the con-
DBSCAN, and ρ-approximate DBSCAN. Section IV cept of IDBSCAN, in which k-means is used to find the
addresses the proposed method, KNN-BLOCK DBSCAN, high-density center points and then IDBSCAN is used to
in detail, including basic ideas, processes, and algorithms. expand clusters from these high-density center points. Based
Section V shows experiments and makes comparison with on IDBSCAN, Quick IDBSCAN [31] (QIDBSCAN) uses
ρ-approximate DBSCAN on some data sets. Section VI gives four marked boundary objects (MBOs) to expand computing
the final conclusion, and our future work that could improve directly.
the proposed method. Moreover, because exact clustering is too costly, this has
generated interest in many approximate methods, includ-
ing our algorithm, to speed up original DBSCAN in the
II. R ELATED W ORK
past two decades. Here, the approximation means that the
A. Variants of DBSCAN clustering result may be different from that of the original
DBSCAN is designed to discover clusters of arbitrary shape. DBSCAN. For example, in original DBSCAN, a data point
It needs two parameters, one is scanning radius , and the other p may be classified into one cluster, while in approximate
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: KNN-BLOCK DBSCAN: FAST CLUSTERING FOR LARGE-SCALE DATA 3941
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.
3942 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021
DBSCAN runs in O(n2 ), and most of its variants still do not eps
work well for large-scale data. In order to find the underlying
causes, we analyzed fundamental techniques used in traditional
clustering approaches, and find that there are some significant Fig. 2. Example of a CB. MinPts = 8, = eps and there are eight red points
deficiencies as follows. which are within N(/2) (p), then all red points are core points.
1) Brute force algorithm is used in original DBSCAN to
compute density for an arbitrary data point, the com-
plexity is O(n). However, there are many redundancies.
eps r−eps
Suppose di,k and dj,k are already known, while di,j is q p eps
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: KNN-BLOCK DBSCAN: FAST CLUSTERING FOR LARGE-SCALE DATA 3943
Algorithm 4 MergeCoreBlocks(CBs, )
1: Input: CBs: core-blocks; CBCENT: core-block centers
set; is the parameter of DBSCAN ;
2: for each core-block CB(p, ξ1 ) do
3: Neibs := FLANN::RangeSearch(p, 2, CBCENT)
4: for each q ∈ Neibs do
5: CB(q, ξ2 ) be the core-block of q
Fig. 5. Framework of KNN-BLOCK DBSCAN. It uses FLANN to identify
6: if p and q are in different cluster then
CBs, NCBs, and NOB, then merges CBs, assigns points in NCBs to proper 7: if dp,q > ξ1 + ξ2 + then
clusters and discards noises. 8: BruteForceMerge(CB(p, ξ1 ), CB(q, ξ2 ))
9: end if
Algorithm 3 KNN-BLOCK DBSCAN(P, , MinPts) 10: end if
1: Input: P is input data; [, MinPts]; 11: end for
2: Output: cluster id of each point; 12: end for
3: Initialize core-blocks set CBs = {φ}
4: Initialize non-core-blocks set NCBs = {φ}
Algorithm 5 AssignNonCoreBlocks(NCBs, CBs, )
5: K := MinPts, cur_cid := 0 // current cluster id
1: Input: NCBs: non-core-blocks; CBs: core blocks; is the
6: for each unvisited point p ∈ P do
parameter of DBSCAN;
7: {p(1) , . . . , p(K) } := FLANN :: kNN(p, P)
2: for each non-core-block NCB(p, ξ1 ) do
8: ξ := dp,(K) , Nξ (p) := {p(1) , p(2) , . . . , p(K) }
3: r := ξ1 + 1.5;
9: if ξ ≤ then
4: Neibs := FLANN::RangeSearch(p, r,CBCENT)
10: cur_cid := cur_cid + 1
if ∃q ∈ Neibs s.t. dp,q ≤ ( − ξ1 ) then
if ξ ≤ 2 then
5:
11:
6: merge NCB(p, ξ1 ) into the cluster of q
12: push Nξ (p) into CBs //a core block found
7: process next non-core-block
13: ∀s ∈ Nξ (p) mark s as core-point and visited
8: else
14: else
9: for each unclassified o ∈ NCB(p, ξ1 ) do
15: push N0 (p) into CBs //single core point
10: if ∃q ∈ Neibs s.t. dp,q ≤ ( + ξ1 + ξ2 ) then
16: mark p as core-point and visited
11: if ∃s ∈ CB(q, ξ2 ) s.t.do,s ≤ then
17: end if
12: assign o to the cluster of q
18: curCorePts:= core points already found in Nξ (p)
13: process next unclassified point o
19: exist_cids:= clusters found in curCorePts
14: end if
20: merge exist_cids into cur_cid
15: end if
21: assign Nξ (p) to cluster cur_cid
16: end for
22: else if < ξ ≤ 2 then
17: end if
23: push Nξ − (p) into NCBs
18: end for
24: mark all points within Nξ − (p) as visited
25: else if ξ > 2 then
26: mark ∀q ∈ Nξ −2 (p) as noise and visited
27: end if of DBSCAN mainly lies in: 1) kNN is used, instead of using
28: end for range query algorithm, to identify core points and noncore
29: CBCENT := extract all center points from CBs points by block (CBs, NCBs, and NOBs); 2) each block has a
30: Create a index tree by FLANN from CBCENT dynamic range, while the width of grid used in ρ-approximate
31: MergeCoreBlocks(CBs, CBCENT, cbIDs, ) DBSCAN and fast DBSCAN is a constant; and 3) CBs can
32: AssignNonCoreBlocks(NCBs, CBs, CBCENT, ) be processed by a simple way which is far more efficient than
grid.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.
3944 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: KNN-BLOCK DBSCAN: FAST CLUSTERING FOR LARGE-SCALE DATA 3945
b) For each CB, the total number of points in a CB Algorithm 7 Pure kNN-Based DBSCAN
is usually far less than n, i.e., MinPts << n, then 1: Input: data set P, and , MinPts;
the complexity of Algorithm 6 is averagely about 2: coreSet := {φ}
O(MinPts). 3: for each unclassified p ∈ P do
Hence, since MinPts << n can be regarded as 4: neibors:= FLANN::kNN(p,MinPts);
a constant, the complexity of Algorithm 4 is about 5: if dp,(MinPts) ≤ then
O(b1 [L D log(b1 )/ log(χ )]). 6: push p into coreSet
4) The complexity of Algorithm 5: there are also two main 7: end if
parts as follows. 8: end for
a) There are b2 NCBs. For each NCB we call 9: for each core point p ∈ coreSet do
FLANN::RangeSearch to find its (ξ1 + 1.5)- 10: neibCores := find core points from k-neighbors of p
neighbors from CBCENT, the complexity is about 11: merge neibCores and p into one cluster
O(b2 [L D log(b1 )/log(χ )]). 12: end for
b) The average complexity of assigning an unclassi- 13: for each two pair of clusters c1 and c2 do
fied point in NCBs to a cluster (from line 5 to line 14: merge c1 and c2 if ∃p1 ∈ c1 and p2 ∈ c2 s.t. p1 is
17) is about O(MinPts[L D log(b1 )/ log(χ )]). density reachable p2
Hence, the complexity of Algorithm 5 is less 15: end for
than O(b2 MinPts [L D log(b1 )/ log(χ )]) < 16: find border points and assign them
O(UCPtsNum [L D log(b1 )/ log(χ )]), where UCPtsNum is
the total number of unclassified points in all NCBs.
As mentioned above, b0 = b1 + b2 + b3 = B. Data Sets
(βn/MinPts) is far less than n provided [, MinPts]
Data sets come from UCI (https://archive.ics.uci.edu/ml/ind
are well chosen, then the overall time complexity is
ex.php), including PAM (PAMPA2), HOUSE (household),
about as O(b0 [L D log(n)/ log(χ )]) = O ([βn/MinPts]
USCENCUS (USCensus 1990), gas sensor, FMA (dataset for
[L D log(n)/ log(χ )]) < O(L D n log(n)/ log(χ )).
music analysis), AAS-1K (Amazon access samples), HIGGS,
In the case of dealing with very high dimensional data
etc., where AAS-1K is a 1000-dimensional data set which
sets, FLANN::kNN degenerates to be an O(n) algorithm, and
is extracted from 20 000-dimensional data set AAS. For each
then the complexity of KNN-BLOCK DBSCAN is about
data set, all duplicate points are removed to make each point
O(b0 [L D n/ log(χ )]).
unique, all missing values are set to 0, and each dimension of
In the worst case, if there is none CB and FLANN::kNN
these data sets is normalized to [0, 105 ]. The following part
runs in O(n), the complexity of KNN-BLOCK DBSCAN is
of this section lists brief descriptions of these data sets.
O(n2 ).
PAM39D is a real 39-dimensional dataset, PAMAP2, with
cardinality n = 3, 850, 505; PAM4D is a real dataset obtained
V. E XPERIMENTS by taking the first four principle components (PCA) of
A. Algorithms and Set Up PAMPA2; Household: dim = 7, n = 2049280; USCENCUS:
dim = 36, n = 365100; GasSensor (Ethylene-CO): dim = 16,
In this section, to evaluate the correctness and effectiveness
n = 4208261; MoCap: dim = 36, n = 65536; APS (APS
of the proposed approach, several experiments are conducted
Failure at Scania Trucks): dim = 170, n = 30000; Font
on different data sets at Intel Core i7-3630 CPU @2.50 GHz,
(CALIBRI): dim = 36, n = 19068; HIGGS: dim = 28,
8G RAM. We mainly compare the proposed algorithm with
n = 11000000; FMA: dim = 512, n = 106574; AAS − 1K:
ρ-approximate DBSCAN, AnyDBC [28] and pure kNN-based
AAS is a large sparse data set, and AAS-1K is a subset extracted
DBSCAN.
from AAS with dim = 1000, n = 30000.
1) “KNN-BLOCK” is KNN-BLOCK DBSCAN which is
coded in C++ and runs on Windows 10 64-bit operat-
ing system, the tree used in FLANN is priority search C. Two Examples of Clustering
k-means tree, and the cluster number χ of k-means is We benchmark KNN-BLOCK DBSCAN on two 2-D test
10. cases to reveal the processes in detail, as shown in Fig. 8.
2) Approx is ρ-approximate DBSCAN which is also writ- The left case is aggregation [44], and the right case comes
ten in C++ and runs on Linux (Ubuntu 14.04 LTS) from [1].
operating system. Specifically, in Fig. 8(a) presents the original data distribu-
3) AnyDBC is the efficient anytime density-based cluster- tion. Fig. 8(b) draws CBs, NCBs, and NOBs plotted by black,
ing algorithm [28]. green, and red circles, respectively. The radius of each circle is
4) kNN-based DBSCAN is an algorithm which only uses different, which means each block has a different size. We also
FLANN::kNN technique to accelerate DBSCAN, as can see that NCBs usually distribute along the border of CBs,
shown in Algorithm 7, and the complexity is about and NOBs appears far from CBs. Fig. 8(c) illustrates the result
O(L D n log(n)/ log(χ )), where L is a data points exam- of merging CBs, which is the most important step to identify
ined by FLANN, D is dimension, and χ is a branching clusters; in Fig. 8(d), as mentioned in Section IV-C3, there
factor of the tree used in FLANN. are three cases to process NCBs: the green circles represent
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.
3946 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021
20 20 20
0 0 0
10 10 10
0 0 0
0 20 40 0 20 40 0 20 40 −0.5 −0.5 −0.5
−0.5 0 0.5 −0.5 0 0.5 −0.5 0 0.5
(a) (b) (c) (a) (b) (c)
30 30 30
0.5 0.5 0.5
20 20 20
0 0 0
10 10 10
0 0 0
0 20 40 0 20 40 0 20 40 −0.5 −0.5 −0.5
−0.5 0 0.5 −0.5 0 0.5 −0.5 0 0.5
(d) (e) (f) (d) (e) (f)
Fig. 8. Two examples present the processes of KNN-BLOCK DBSCAN. (a) is the original data distribution; (b) shows 3 kinds of blocks found by
KNN-BLOCK DBSCAN, where black circles are core blocks, green circles are NCBs and red are NOBs; (c) illustrates clusters found after merging CBs;
(d) addresses the assignment of NCBs to corresponding clusters, the red balls are NCBs that can be assigned to their nearest clusters, and green circles in
(d) are those who find no cluster to assign; (e) exhibits the final result of KNN-BLOCK DBSCAN, where black points are noise; (f) demonstrates the result
of original DBSCAN.
TABLE II
RUNTIME C OMPARISONS ON S UBSETS OF HOUSE AND PAM W ITH n I NCREASING . T HE S PEEDUP OF KNN-BLOCK
DBSCAN OVER I TS C OMPETITOR I S G IVEN IN B RACKETS . (U NIT: S ECOND )
case (1), because they are far from all core-points, all points proposed algorithm and ρ-approximate DBSCAN with dif-
within these NCBs are classified as noise; the red balls illus- ferent [, MinPts]. Figs. 9 and 10 present the results of two
trate case (2), each of them is assigned to one cluster from algorithms, and Table II reveals more details. We also con-
which it is density-reachable; in case (3), for each point p duct experiments on the whole data sets of HOUSE, PAM4D,
within unclassified NCBs, if q is identified as a core from KDD04, USCENCUS, REACTION, MOPCAP, BODONI,
which p is density-reachable, then p is classified to the clus- HIGGS, FMA, and AAS1K, respectively, and Table III shows
ter of q. Fig. 8(e) exhibits the final result of KNN-BLOCK the comparison of runtime with different [, MinPts].
DBSCAN, where black points are noise; and Fig. 8(f) shows From two figures and two tables, we can observe as follows.
the result obtained by original DBSCAN. 1) Both algorithms prefer large and small MinPts. For
It is observed that KNN-BLOCK DBSCAN nearly obtains example, on data set HouseHolod, both KNN-BLOCK
the same result as DBSCAN with high efficiency, because DBSCAN and ρ-approximate DBSCAN run best when
it processes data by blocks, and reduces a large number of [, MinPts] = [5000, 100], and the worst case happens
redundant distance computations. when [, MinPts] = [1000, 200]. On other data sets,
things are similar as shown in Table III.
2) Both algorithms run in linear expected time in low
D. Runtime Comparisons With ρ-Approximate DBSCAN dimensional data sets.
The first experiment is conducted on a set of subsets 3) We can see that on large-scale data sets
of HOUSE and PAM4D to observe the complexities of the PAM4D, HOUSEHOLD, and HIGGS, our algorithm
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: KNN-BLOCK DBSCAN: FAST CLUSTERING FOR LARGE-SCALE DATA 3947
TABLE III
RUNTIME C OMPARISONS ON D IFFERENT DATA S ETS W ITH D IFFERENT AND M IN P TS . T HE S PEEDUP OF
KNN-BLOCK DBSCAN OVER I TS C OMPETITOR I S G IVEN IN B RACKETS . (U NIT: S ECOND )
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.
3948 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021
TABLE IV
RUNTIME C OMPARISONS W ITH P URE K NN-BASED DBSCAN
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: KNN-BLOCK DBSCAN: FAST CLUSTERING FOR LARGE-SCALE DATA 3949
TABLE V
E XECUTION T IMES OF K NN, M ERGE CB, AND A SSIGN NCB, AS W ELL AS B LOCKS F OUND ON PAM W ITH D IFFERENT [, MinPts]
TABLE VII
C OMPARISONS OF Omega-index FOR KNN-BLOCK DBSCAN AND
ρ-A PPROXIMATE DBSCAN (n = 5000)
TABLE IX
E XAMPLE OF C OMPUTING P RECISION FOR KNN-BLOCK DBSCAN
BASED ON T HREE M ATCHED L ABEL PAIRS : (“A1,” “B2”), (“A2,” “B1”),
AND (“A3,” “B4”) F OUND BY K UHN –M UNKRAS
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.
3950 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021
TABLE X
ACCURACY, R ECALL , AND F1-S CORE OF KNN-BLOCK DBSCAN AND ρ-A PPROXIMATE
DBSCAN ON S UBSETS OF HOUSE, PAM4D, MOPCAP, AND APS
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: KNN-BLOCK DBSCAN: FAST CLUSTERING FOR LARGE-SCALE DATA 3951
are clustering labels obtained by the DBSCAN and KNN- includes core points, border points, and noises, respectively.
BLOCK DBSCAN. Then, we proposed an algorithm to merge CBs that are density-
Step 2 (Matching): It is well known that different clustering reachable from each other and assign each point in NCBs to
algorithms may yield different labels on the same data set. For a proper cluster.
example, cluster “A1” labeled by DBSCAN may be the same The superiority of KNN-BLOCK DBSCAN to
as “B2” obtained by KNN-BLOCK DBSCAN. Hence, it is ρ-approximate DBSCAN is that it processes data by blocks,
reasonable to match labels first, and use the matched labels each of which has a dynamic range, instead of grids used
to compute Accuracy. In this article, Kuhn–Munkras [48] per- in ρ-approximate DBSCAN with a fixed width, and fast the
forms the task of maximum matching two different cluster kNN technique is used to identify the types of points. Given a
label sets, which has been used in [11] and [49]. fixed intrinsic dimensionality, the complexity of the proposed
Step 3 (Computing Accuracy): Suppose there are three clus- algorithm is about O([βn/MinPts][L D log(n)/ log(χ )])
ters with labels “A1,” “A2,” and “A3” obtained by DBSCAN where L is a constant, D is dimension, β is a factor of data
on one data set, but KNN-BLOCK DBSCAN labels them distribution, and χ is the branching factor of the tree used in
with “B1,” “B2,” “B3,” and “B4,” and Kuhn–Munkras finds FLANN.
there are three matched pairs: (“A1,”‘ “B2,”) (“A2,” “B1,”) and Experiments address that KNN-BLOCK DBSCAN runs
(“A3,” “B4.”) If the labels of point p obtained by DBSCAN faster than ρ-approximate DBSCAN and pure kNN-based
and KNN-BLOCK DBSCAN match, then the prediction of p DBSCAN with high accuracy, even on some relative high-
is correct, e.g., (“A1” and “B2,”) otherwise it is wrong, e.g., dimensional data sets, e.g., APS (170 dim), BONONI
(“A1” and “B1”). Table IX shows more details. Suppose there (256 dim), FMA (512 dim), and AAS-1K (1000 dim), where
are eight points in the data set, the second row lists labels ρ-approximate DBSCAN degenerates to be an O(n2 ) algo-
obtained by DBSCAN, and the third line is the clustering rithm, KNN-BLOCK DBSCAN can still run very fast.
result of KNN-BLOCK DBSCAN. We can see that there are Our future work is to improve the proposed algorithm and
two cases that are wrongly predicated because (A1, B4) and apply it in real applications in the following aspects.
(A2, B3) are not matched pairs. Therefore, the total precision 1) Try to use other precise the kNN technique, such as
is (8 − 2)/8 = 75%. cover tree, semi-convex hull tree [36], etc., to improve
Because the original DBSCAN has high complexity, we the accuracy of KNN-BLOCK DBSCAN.
only test on small data sets. Here, we extract four subsets from 2) Parallelize KNN-BLOCK DBSCAN on GPUs with a
HOUSE, PAM4D, APS, and MOCAP, and use them as test highly efficient strategy for scheduling data to make the
cases. Also because DBSCAN is nondeterministic (sensitive proposed algorithm faster.
to iteration order), some border points may be assigned to dif- 3) Apply it in our other researches, such as image
ferent clusters according to the order they appear. Therefore, retrieval [50], vehicle reidentification [51], [52], vehi-
the accuracy is computed only by comparing core points. cle crushing analysis [53], and auditing for shared cloud
Table X shows that both algorithms achieve high accuracy. data [54]–[56].
In low-dimensional data sets (HOUSE and PAM4D), the
precision, recall, and F1-score of both approximate algo- R EFERENCES
rithms are about 98%–100%, and there is only a little drop [1] A. K. Jain, “Data clustering: 50 years beyond K-means,” Pattern
in high-dimensional data sets (MOCAP and APS) which are Recognit. Lett., vol. 31, no. 8, pp. 651–666, 2010.
about 94.5%–97.7%. [2] A. Likas, N. Vlassis, and J. J. Verbeek, “The global k-means clustering
algorithm,” Pattern Recognit., vol. 36, no. 2, pp. 451–461, 2003.
[3] Y. Cheng, “Mean shift, mode seeking, and clustering,” IEEE
Trans. Pattern Anal. Mach. Intell., vol. 17, no. 8, pp. 790–799,
VI. C ONCLUSION Aug. 1995.
[4] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm
DBSCAN runs in O(n2 ) expected time and is not suitable for discovering clusters in large spatial databases with noise,” in Proc.
for large-scale data. ρ-approximate DBSCAN is designed to KDD, vol. 96, 1996, pp. 226–231.
replace with DBSCAN for big data, however, it only can work [5] U. Von Luxburg, “A tutorial on spectral clustering,” Stat. Comput.,
vol. 17, no. 4, pp. 395–416, 2007.
in a very low dimension. In this article, we analyze the under- [6] H. Chang and D.-Y. Yeung, “Robust path-based spectral clustering,”
lying causes that current approaches fail in clustering large Pattern Recognit., vol. 41, no. 1, pp. 191–203, 2008.
scale data, and find that the grid technique is nearly useless [7] W. Fan, H. Sallay, and N. Bouguila, “Online learning of hierarchical
Pitman–Yor process mixture of generalized Dirichlet distributions with
for high-dimensional data. feature selection,” IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 9,
Aiming to tame problems mentioned above, an approxi- pp. 2048–2061, Sep. 2017.
mate approach named KNN-BLOCK DBSCAN is proposed [8] W. Fan, N. Bouguila, J. Du, and X. Liu, “Axially symmetric data cluster-
ing through Dirichlet process mixture models of Watson distributions,”
for large-scale data based on two findings: 1) the key IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 6, pp. 1683–1694,
of DBSCAN to find core points is a kNN problem Jun. 2019.
in essence and 2) a point has similar density distribu- [9] L. Duan, S. Cui, Y. Qiao, and B. Yuan, “Clustering based on super-
vised learning of exemplar discriminative information,” IEEE Trans.
tion to its neighbors, which implies it is highly possible Syst., Man, Cybern., Syst., to be published.
that a point has the same type (core/border/noise) as its [10] D. Cheng, Q. Zhu, J. Huang, Q. Wu, and L. Yang, “A novel cluster
neighbors. validity index based on local cores,” IEEE Trans. Neural Netw. Learn.
Syst., vol. 30, no. 4, pp. 985–999, Apr. 2019.
Therefore, we argue that the kNN technique, e.g., FLANN, [11] Y. Chen et al., “Decentralized clustering by finding loose and distributed
can be utilized to identify CBs, NCBs, and NOBs, which only density cores,” Inf. Sci., vols. 433–434, pp. 649–660, Apr. 2018.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.
3952 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021
[12] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,” [39] C. Silpa-Anan and R. Hartley, “Optimised KD-trees for fast image
ACM Comput. Surveys, vol. 31, no. 3, pp. 264–323, 1999. descriptor matching,” in Proc. IEEE Conf. Comput. Vis. Pattern
[13] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “Optics: Recognit. (CVPR), 2008, pp. 1–8.
Ordering points to identify the clustering structure,” in Proc. ACM [40] Y. Chen, S. Tang, N. Bouguila, C. Wang, J. Du, and H. L. Li, “A fast
SIGMOD Rec., vol. 28, 1999, pp. 49–60. clustering algorithm based on pruning unnecessary distance computa-
[14] A. Rodriguez and A. Laio, “Clustering by fast search and find of density tions in DBSCAN for high-dimensional data,” Pattern Recognit., vol. 83,
peaks,” Science, vol. 344, no. 6191, pp. 1492–1496, 2014. pp. 375–387, Nov. 2018.
[15] Y. Chen et al., “Fast density peak clustering for large scale data based [41] A. Karami and R. Johansson, “Choosing DBSCAN parameters automat-
on KNN,” Knowl. Based Syst., vol. 187, Jan. 2020, Art. no. 104824. ically using differential evolution,” Int. J. Comput. Appl., vol. 91, no. 7,
[16] D. Cheng, Q. Zhu, J. Huang, Q. Wu, and Y. Lijun, “Clustering with pp. 1–11, 2014.
local density peaks-based minimum spanning tree,” IEEE Trans. Knowl. [42] H. Zhou, P. Wang, and H. Li, “Research on adaptive parameters deter-
Data Eng., to be published. mination in DBSCAN algorithm,” J. Xian Univ. Technol., vol. 9, no. 7,
[17] J. Gan and Y. Tao, “DBSCAN revisited: Mis-claim, un-fixability, and pp. 1967–1973, 2012.
approximation,” in Proc. ACM SIGMOD Int. Conf. Manag. Data, 2015, [43] F. O. Ozkok and M. Celik, “A new approach to determine eps parameter
pp. 519–530. of DBSCAN algorithm,” Int. J. Intell. Syst. Appl. Eng., vol. 4, no. 5,
pp. 247–251, 2017.
[18] M. Muja and D. G. Lowe, “Scalable nearest neighbor algorithms for
[44] A. Gionis, H. Mannila, and P. Tsaparas, “Clustering aggregation,” in
high dimensional data,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36,
Proc. Int. Conf. Data Eng. (ICDE), 2005, pp. 341–352.
no. 11, pp. 2227–2240, Nov. 2014.
[45] L. M. Collins and C. W. Dent, “Omega: A general formulation of
[19] J. L. Bentley, “Multidimensional binary search trees used for associative
the rand index of cluster recovery suitable for non-disjoint solutions,”
searching,” Commun. ACM, vol. 18, no. 9, pp. 509–517, 1975.
Multivariate Behav. Res., vol. 23, no. 2, pp. 231–242, 1988.
[20] A. Beygelzimer, S. Kakade, and J. Langford, “Cover trees for nearest [46] A. Strehl and J. Ghosh, “Cluster ensembles: A knowledge reuse frame-
neighbor,” in Proc. 23rd Int. Conf. Mach. Learn., 2006, pp. 97–104. work for combining partitionings,” in Proc. 18th Nat. Conf. Artif. Intell.,
[21] A. Gunawan and M. de Berg, “A faster algorithm for DBSCAN,” Ph.D. 2002, pp. 93–99.
dissertation, Dept. Math. Comput. Sci., Univ. Eindhoven, Eindhoven, [47] M. A. Patwary, D. Palsetia, A. Agrawal, W.-K. Liao, F. Manne, and
The Netherlands, 2013. A. Choudhary, “Scalable parallel optics data clustering using graph algo-
[22] V. Chaoji, M. Al Hasan, S. Salem, and M. J. Zaki, “SPARCL: Efficient rithmic techniques,” in Proc. Int. Conf. High Perform. Comput. Netw.
and effective shape-based clustering,” in Proc. 8th IEEE Int. Conf. Data Storage Anal. (SC), 2013, pp. 1–12.
Min., 2008, pp. 93–102. [48] H. W. Kuhn, “The Hungarian method for the assignment problem,”
[23] E. H.-C. Lu, V. S. Tseng, and P. S. Yu, “Mining cluster-based tempo- Naval Res. Logist. Quart., vol. 2, nos. 1–2, pp. 83–97, 1955.
ral mobile sequential patterns in location-based service environments,” [49] Y. Chen, S. Tang, S. Pei, C. Wang, J. Du, and N. Xiong, “DHeat: A
IEEE Trans. Knowl. Data Eng., vol. 23, no. 6, pp. 914–927, Jun. 2011. density heat-based algorithm for clustering with effective radius,” IEEE
[24] S. K. Pal and P. Mitra, Pattern Recognition Algorithms for Data Mining. Trans. Syst., Man, Cybern., Syst., vol. 48, no. 4, pp. 649–660, Apr. 2018.
Boston, MA, USA: CRC Press, 2004. [50] X. Liu, Z. Hu, H. Ling, and Y. Cheung, “MTFH: A matrix tri-
[25] S. Mahran and K. Mahar, “Using grid for accelerating density-based factorization hashing framework for efficient cross-modal retrieval,”
clustering,” in Proc. 8th IEEE Int. Conf. Comput. Inf. Technol. (CIT), IEEE Trans. Pattern Anal. Mach. Intell., to be published.
2008, pp. 35–40. [51] J. Hou, H. Zeng, L. Cai, J. Zhu, J. Chen, and K.-K. Ma,
[26] K. Sonal, G. Poonam, S. Ankit, K. Dhruv, S. Balasubramaniam, and “Multi-label learning with multi-label smoothing regularization
N. Goyal, “Exact, fast and scalable parallel DBSCAN for commodity for vehicle re-identification,” Neurocomputing, vol. 345, pp. 15–22,
platforms,” in Proc. 18th Int. Conf. Distrib. Comput. Netw., 2017, p. 14. Jun. 2019.
[27] X. Chen, Y. Min, Y. Zhao, and P. Wang, “GMDBSCAN: Multi-density [52] J. Zhu et al., “Vehicle re-identification using quadruple directional deep
DBSCAN cluster based on grid,” in Proc. IEEE Int. Conf. e-Bus. Eng., learning features,” IEEE Trans. Intell. Transp. Syst., to be published.
2008, pp. 780–783. [53] Y. Zhang, X. Xu, J. Wang, T. Chen, and C. H. Wang, “Crushing
[28] S. T. Mai, I. Assent, and M. Storgaard, “AnyDBC: An efficient anytime analysis for novel bio-inspired hierarchical circular structures sub-
density-based clustering algorithm for very large complex datasets,” in jected to axial load,” Int. J. Mech. Sci., vol. 140, pp. 407–431,
Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Disc. Data Min., 2016, May 2018.
pp. 1025–1034. [54] H. Tian, F. Nan, C.-C. Chang, Y. Huang, J. Lu, and Y. Du,
[29] B. Borah and D. K. Bhattacharyya, “An improved sampling-based “Privacy-preserving public auditing for secure data storage in fog-
DBSCAN for large spatial databases,” in Proc. Int. Conf. Intell. Sens. to-cloud computing,” J. Netw. Comput. Appl., vol. 127, pp. 59–69,
Inf. Process., 2004, pp. 92–96. Feb. 2019.
[30] C.-F. Tsai and C.-W. Liu, “KIDBSCAN: A new efficient data clus- [55] H. Tian, F. Nan, H. Jiang, C.-C. Chang, J. Ning, and Y. Huang,
tering algorithm,” in Proc. Artif. Intell. Soft Comput. (ICAISC), 2006, “Public auditing for shared cloud data with efficient and secure group
pp. 702–711. management,” Inf. Sci., vol. 472, pp. 107–125, Jan. 2019.
[56] H. Tian et al., “Public audit for operation behavior logs with error locat-
[31] C. Tsai and T. Huang, “QIDBSCAN: A quick density-based cluster-
ing in cloud storage,” Soft Comput., vol. 23, no. 11, pp. 3779–3792,
ing technique,” in Proc. Int. Symp. Comput. Consum. Control, 2012,
Jun. 2019.
pp. 638–641.
[32] A. Bryant and K. Cios, “RNN-DBSCAN: A density-based clustering
algorithm using reverse nearest neighbor density estimates,” IEEE Trans.
Knowl. Data Eng., vol. 30, no. 6, pp. 1109–1121, Jun. 2018.
[33] A. Lulli, M. Dell’Amico, P. Michiardi, and L. Ricci, “NG-DBSCAN:
Scalable density-based clustering for arbitrary data,” Proc. VLDB
Endow., vol. 10, no. 3, pp. 157–168, 2016. Yewang Chen received the B.S. degree in man-
[34] F. Gieseke, J. Heinermann, C. E. Oancea, and C. Igel, “Buffer KD trees: agement of information system from Huaqiao
Processing massive nearest neighbor queries on GPUs,” in Proc. ICML, University, Quanzhou, China, in 2001, and the
2014, pp. 172–180. Ph.D. degree in software engineering from Fudan
[35] Y. Chen, L. Zhou, Y. Tang, N. Bouguila, and H. Wang, “Fast neighbor University, Shanghai, China, in 2009.
search by using revised k-d tree,” Inf. Sci., vol. 472, pp. 145–162, 2019. He is currently an Associate Professor with
[36] Y. Chen, L. Zhou, and N. Bouguila, “Semi-convex hull tree: Fast nearest the School of Computer Science and Technology,
neighbor queries for large scale data on GPUs,” in Proc. IEEE Int. Conf. Huaqiao University, and the Fujian Key Laboratory
Data Min., 2018, pp. 911–916. of Big Data Intelligence and Security, Huaqiao
[37] J. Wang et al., “Trinary-projection trees for approximate nearest neigh- University (Xiamen Campus), Xiamen, China. He
bor search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 2, is also with the Beijing Key Laboratory of Big Data
pp. 388–403, Feb. 2014. Technology for Food Safety, Beijing Technology and Business University,
[38] M. Muja and D. G. Lowe, “Fast approximate nearest neighbors with Beijing, China, and the Provincial Key Laboratory for Computer Information
automatic algorithm configuration,” in Proc. Int. Conf. Comput. Vis. Processing Technology, Soochow University, Suzhou, China. His current
Theory Appl. (VISSAPP), 2009, pp. 331–340. research interests include machine learning and data mining.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: KNN-BLOCK DBSCAN: FAST CLUSTERING FOR LARGE-SCALE DATA 3953
Lida Zhou received the B.S. degree in computer Xin Liu (M’08) received the M.S. degree in applied
science from the College of Computer Science mathematics from Hubei University, Wuhan, China,
and Technology, Central China Normal University, in 2009, and the Ph.D. degree in computer science
Wuhan, China, in 2012. He is currently pursu- from Hong Kong Baptist University, Hong Kong, in
ing the Post-Graduation degree with the School 2013.
of Computer Science and Technology, Huaqiao He was a Visiting Scholar with the Computer
University (Xiamen Campus), Xiamen, China. and Information Sciences Department, Temple
His current research interests is machine learning University, Philadelphia, PA, USA, from 2017 to
and pattern recognition. 2018. He is currently an Associate Professor with the
Department of Computer Science and Technology,
Huaqiao University, Quanzhou, China, and also with
the State Key Laboratory of Integrated Services Networks, Xidian University,
Xi’an, China. His present research interests include multimedia analysis,
Songwen Pei (SM’19) received the B.S. degree in computer vision, pattern recognition, and machine learning.
computer science from the National University of
Defence and Technology, Changsha, China, in 2003,
the M.S. degree in computer science from Guizhou
University, Guiyang, China, in 2006, and the Ph.D.
degree in computer science from Fudan University,
Shanghai, China, in 2009.
He is currently an Associate Professor with
the University of Shanghai for Science and
Technology, Shanghai. Since 2011, he has been a
Guest Researcher with the Institute of Computing
Technology, Chinese Academy of Sciences, Beijing, China, a Research
Scientist with the University of California at Irvine, Irvine, CA, USA from
2013 to 2015 and the Queensland University of Technology, Brisbane, QLD,
Australia, in 2017. His research interests include heterogeneous multicore Jixiang Du received the B.Sc. and M.Sc. degrees
system, cloud computing, and big data. in vehicle engineering from the Hefei University
Dr. Pei is a board member of CCF-TCCET and CCF-TCARCH. He is a of Technology, Hefei, China, in September 1999
member of ACM and CCF in China. and July 2002, respectively, and the Ph.D. degree
in pattern recognition and intelligent system from
the University of Science and Technology of China,
Hefei, in December 2005.
He is currently a Professor with the College
Zhiwen Yu (SM’14) received the Ph.D. degree of Computer Science and Technology, Huaqiao
in computer science from the City University of University, Quanzhou, China.
Hong Kong, Hong Kong, in 2008.
He is a Professor with the School of Computer
Science and Engineering, South China University
of Technology, Guangzhou, China, from 2015 to
2019. He has been published more than 140 referred
journal papers and international conference papers,
including 40 IEEE T RANSACTIONS papers. His
research areas focus on data mining, machine learn-
ing, pattern recognition, and intelligent computing.
Prof. Yu is a Distinguishable Member of China Computer Federation and
the Vice Chair of ACM Guangzhou Chapter. He is a Senior Member of ACM.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.