0% found this document useful (0 votes)
24 views15 pages

KNN Block Dbscan

Uploaded by

Tejas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views15 pages

KNN Block Dbscan

Uploaded by

Tejas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO.

6, JUNE 2021 3939

KNN-BLOCK DBSCAN: Fast Clustering


for Large-Scale Data
Yewang Chen , Lida Zhou, Songwen Pei , Senior Member, IEEE, Zhiwen Yu , Senior Member, IEEE,
Yi Chen, Xin Liu, Member, IEEE, Jixiang Du , and Naixue Xiong , Senior Member, IEEE

Abstract—Large-scale data clustering is an essential key for invented to speedup the clustering process. The experimental
big data problem. However, no current existing approach is results show that KNN-BLOCK DBSCAN is an effective approx-
“optimal” for big data due to high complexity, which remains imate DBSCAN algorithm with high accuracy, and outperforms
it a great challenge. In this article, a simple but fast approx- other current variants of DBSCAN, including ρ-approximate
imate DBSCAN, namely, KNN-BLOCK DBSCAN, is proposed DBSCAN and AnyDBC.
based on two findings: 1) the problem of identifying whether
a point is a core point or not is, in fact, a kNN problem and Index Terms—DBSCAN, FLANN, kNN, KNN-BLOCK
2) a point has a similar density distribution to its neighbors, DBSCAN.
and neighbor points are highly possible to be the same type
(core point, border point, or noise). KNN-BLOCK DBSCAN uses
a fast approximate kNN algorithm, namely, FLANN, to detect
core-blocks (CBs), noncore-blocks, and noise-blocks within which I. I NTRODUCTION
all points have the same type, then a fast algorithm for merg-
LUSTERING analysis is the task of grouping objects
ing CBs and assigning noncore points to proper clusters is also

Manuscript received December 29, 2018; revised May 31, 2019 and July 25,
C according to measured or perceived intrinsic characteris-
tics or similarity, aiming to retrieve some natural groups from a
2019; accepted November 18, 2019. Date of publication December 18, 2019; set of patterns or points. It is a fundamental technique in many
date of current version May 18, 2021. This work was supported in applications, such as data mining, pattern recognition, etc., and
part by the National Natural Science Foundation of China under Grant
61673186, Grant 61972010, Grant 61975124, Grant 61722205, Grant many researchers believe that clustering is an essential key for
61751205, Grant 61572199, and Grant U1611461, in part by the Funds analyzing big data [1].
from State Key Laboratory of Computer Architecture, ICT, CAS under Grant Currently, there are thousands of clustering algorithms have
CARCH201807, in part by the Open Project of Provincial Key Laboratory
for Computer Information Processing Technology, Soochow University under been proposed, for example, k-means [2], mean shift [3],
Grant KJS1839, in part by the Quanzhou City Science and Technology DBSCAN [4], spectral clustering [5], [6], mixtures of dirich-
Program of China under Grant 2018C114R, in part by the Open Project let model [7], [8], clustering based on supervised learn-
of Beijing Key Laboratory of Big Data Technology for Food Safety under
Grant BTBD-2019KF06, in part by the Key Research and Development ing [9], and clustering by local cores [10], [11]. According
Program of Guang Dong Province under Grant 2018B010107002, and in to Jain et al. [12], different categories of these clustering
part by the Grant from the Guang Dong Natural Science Funds under approaches are recognized: centroid-based clustering, parti-
Grant 2017A030312008. This article was recommended by Associate Editor
G. Nicosia. (Corresponding authors: Songwen Pei; Zhiwen Yu.) tioning clustering, density-based clustering etc.
Y. Chen is with the College of Computer Science and Technology, Huaqiao The goal of density-based clustering is to identify densely
University (Xiamen Campus), Xiamen 361021, China, also with the Beijing regions with arbitrary shape, which can be measured by the
Key Laboratory of Big Data Technology for Food Safety, Beijing Technology
and Business University, Beijing 100048, China, also with the Provincial density of a given point. An identified cluster is usually a
Key Laboratory for Computer Information Processing Technology, Soochow region with high density, while outliers are regions with low
University, Suzhou 215301, China, and also with the Fujian Key Laboratory densities. Hence, density-based clustering is one of the most
of Big Data Intelligence and Security, Huaqiao University (Xiamen Campus),
Xiamen 361021, China (e-mail: [email protected]). popular paradigms. There are many algorithms of this kind,
L. Zhou and X. Liu are with the College of Computer Science and such as DBSCAN [4], OPTICS [13], DPeak [14]–[16], mean
Technology, Huaqiao University, Quanzhou 362021, China. shift [3], DCore [11], etc., where DBSCAN [4] is the most
S. Pei is with the Shanghai Key Laboratory of Modern Optical Systems,
University of Shanghai for Science and Technology, Shanghai 200093, China famous one and has been widely used.
(e-mail: [email protected]). Unfortunately, most of the current existing clustering
Z. Yu is with the School of Computer Science and Engineering, approaches do not work well for large-scale data, due to their
South China University of Technology, Guangzhou 510640, China (e-mail:
[email protected]). high complexities. For example, the complexity of k-means
Y. Chen is with the Beijing Key Laboratory of Big Data Technology for is O(ktn) where t is the iterations times, DBSCAN runs in
Food Safety, Beijing Technology and Business University, Beijing, China. O(n2 ). In this article, a fast approximate algorithm named
J. Du is with the College of Computer Science and Technology, Huaqiao
University, Quanzhou 362021, China, and also with the Fujian Key Laboratory KNN-BLOCK DBSCAN,1 is proposed to speedup DBSCAN,
of Big Data Intelligence and Security, Huaqiao University, Quanzhou 362021, which is able to deal with large-scale data. We also concentrate
China. on comparing our algorithm with DBSCAN, ρ-approximate
N. Xiong is with the Department of Mathematics and Computer Science,
Northeastern State University, Tahlequah, OK 74464 USA. DBSCAN [17], and AnyDBC.
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TSMC.2019.2956527.
Digital Object Identifier 10.1109/TSMC.2019.2956527 1 https://github.com/XFastDataLab/KNN-BLOCK-DBSCAN

2168-2216 
c 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.
3940 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021

TABLE I
D ESCRIPTION OF M AIN VARIABLES AND S YMBOLS is MinPts which is used as a density threshold for deciding
U SED IN T HIS A RTICLE whether a point is a core point or not.
If a tree-based spatial index is used, the average complex-
ity is reduced to O(n log(n)) [4]. However, this turns out to
be a misclaim: as pointed out by Gunawan and de Berg [21],
DBSCAN actually runs in O(n2 ) time, regardless of  and
MinPts. Unfortunately, this misclaim is widely accepted as
a building brick in many research papers and textbooks,
e.g., [22]–[24], etc. Furthermore, DBSCAN is almost use-
less in high dimension, due to the so-called “curse of
dimensionality.”
Mahran and Mahar [25] introduced an algorithm named
GriDBSCAN to enhance the performance of DBSCAN
by using grid partitioning and merging, yielding a high
performance with the advantage of a high degree of
parallelism. But this technique is inappropriate for high-
dimensional data because the effect of redundancy in this
algorithm increases exponentially with dimension. Similarly,
Gunawan and de Berg [21] proposed an algorithm named
Fast-DBSCAN to improve DBSCAN for two-dimensional
The main contributions of this article are listed as follows. (2-D) data, which also imposes an arbitrary grid √ T on 2-
1) We find that the key problem in DBSCAN of identifying D space, where each cell of T has side length / 2. If a
the type of each point is a kNN problem in essence. nonempty cell c contains at least MinPts points, then this
Therefore, many techniques of this field, such as cell is called core cell, and all points in this cell are core
FLANN [18], kd-tree [19], cover tree [20], etc., can be points because the maximum distance within this cell is .
utilized. Therefore, it is unnecessary to compute densities for each
2) According to a general rule that a point has similar den- point in a core cell. Gan and Tao [17] proposed an algorithm
sity distribution to its neighbors, and neighbor points are named ρ-approximate DBSCAN also based on grid technique
likely to be the same type (core, border, or noise). Then, for large data set, and achieved an excellent complexity O(n)
a technique is proposed to identify blocks within which in low dimension. But it degenerates to an O(n2 ) algorithm
all points have the same type, such as CBs, noncore in high even relative high-dimensional data space. Besides,
blocks, and noise blocks. parallel GridDBSCAN [26] and GMDBSCAN [27] are also
3) A fast algorithm is also invented for merging CBs and grid-based DBSCAN.
assigning noncore points to corresponding clusters. AnyDBC [28] compresses the data into smaller density-
Before introducing the proposed algorithm, we would like to connected subsets called primitive clusters and labels objects
present the main variables and symbols used in this article as based on connected components of these primitive clusters
follows. Let P be a set of n points in D-dimensional space RD ; for reducing the label propagation time. To speedup the range
pi ∈ P be the ith point of P; dp,q (or dist(p, q)) be the distance query process, it uses kd-trees [14] for indexing data, and per-
between points p and q, where the distance can be Euclidean forms substantially fewer range queries compared to DBSCAN
or Chebychev distance;  be the scanning radius of DBSCAN; while still guaranteeing the exact final result of DBSCAN.
dp,(i) be the distance from p to its ith nearest neighbor, and There are some other variants of DBSCAN as follow-
p(i) be the ith nearest neighbor of p. More symbols are shown ing. IDBSCAN [29] is a sampling-based DBSCAN, which
in Table I. is able to handle large spatial databases with minimum
The remainder of this article is organized as fol- I/O cost by incorporating a better sampling technique, and
lows. Section II introduces the related work of DBSCAN reduces the memory requirement for clustering dramatically.
and nearest neighbor query. Section III revisits FLANN, KIDBSCAN [30] presents a new technique based on the con-
DBSCAN, and ρ-approximate DBSCAN. Section IV cept of IDBSCAN, in which k-means is used to find the
addresses the proposed method, KNN-BLOCK DBSCAN, high-density center points and then IDBSCAN is used to
in detail, including basic ideas, processes, and algorithms. expand clusters from these high-density center points. Based
Section V shows experiments and makes comparison with on IDBSCAN, Quick IDBSCAN [31] (QIDBSCAN) uses
ρ-approximate DBSCAN on some data sets. Section VI gives four marked boundary objects (MBOs) to expand computing
the final conclusion, and our future work that could improve directly.
the proposed method. Moreover, because exact clustering is too costly, this has
generated interest in many approximate methods, includ-
ing our algorithm, to speed up original DBSCAN in the
II. R ELATED W ORK
past two decades. Here, the approximation means that the
A. Variants of DBSCAN clustering result may be different from that of the original
DBSCAN is designed to discover clusters of arbitrary shape. DBSCAN. For example, in original DBSCAN, a data point
It needs two parameters, one is scanning radius , and the other p may be classified into one cluster, while in approximate
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: KNN-BLOCK DBSCAN: FAST CLUSTERING FOR LARGE-SCALE DATA 3941

Algorithm 1 [18] SearchKmeansTree Algorithm 2 [18] TraverseKmeansTree


1: Input: query point q; the K value of kNN; the maximum 1: Input: current node curNode; priority queue PQ; priority
number of examined points L; k-means tree T; queue R count; query point q
2: count := 0; 2: if curNode is leaf then
3: PQ := empty priority queue 3: search all points in curNode and add them to R
4: R := empty priority queue 4: count := count + |curNode|
5: curNode := T 5: else
6: TraverseKmeansTree(curNode, PQ, R, count, q) 6: subNodes := sub nodes of curNode
7: while PQ <> NULL and count < L do 7: nearestSubNode := nearest node of subNodes to q
8: curNode := top of PQ 8: subNodes := subNodes -nearestSubNode
9: TraverseKmeansTree(curNode, PQ, R, count, q) 9: PQ := PQ + subNodes
10: end while 10: TraverseKmeansTree(nearestSubNode, PQ, R)
11: Return K top points from R 11: end if

DBSCAN, it may be designated into another cluster. A scal-


able RNN-DBSCAN [32] solution was investigated to improve
DBSCAN by using an approximate kNN algorithm. NG-
DBSCAN [33] is an approximate density-based clustering
algorithm that operates on arbitrary data and any symmet-
ric distance measure. The distributed design of this algorithm Fig. 1. Example of core cells. Core cells are shown in gray, and each point
in the core cell is a core point [17].
makes it scalable to very large data sets; its approximate nature
makes it fast, yet capable of producing high-quality clustering
results.
1) Initially traverse the tree from root to the q s nearest leaf
node, followed by nonleaf node with the closest cluster
B. Nearest Neighbors Searching Algorithms center to q, and add all unexplored branches along the
In the past few decades, many researchers have launched path to a priority queue [(PQ): lines 7–9 in Algorithm 2],
large amounts of fruitful researches in the field of nearest which is sorted in increasing distance from q to the
neighbor query, many techniques have been proposed and boundary of the branch being added to the queue.
successfully applied to accelerate the processes of search- 2) Restart to traverse the tree in the queue from the top
ing neighbors. For example, partition trees (kd-tree [34], [35], branch (line 10 in Algorithm 2).
semi-convex hull tree [36]), hashing techniques such as ANN Let I be the maximum iteration times of k-means, and L be
based on trinary-project tree [37]. the number of examined points by FLANN. The height of the
Because the exact search is time-consuming for many tree is about log(n)/ log(χ ) if the tree is balanced. During each
applications, then the approximate nearest neighbor query is traversal from top to down, there are about O(log(n)/ log(χ ))
optional in some cases, which returns nonoptimal results, but inner nodes and one leaf node should be checked. Thus, the
runs much faster. For example, FLANN [18], [38] uses the complexity of FLANN is about O(LD(log(n)/ log(χ ))), where
priority search k-means tree or the multiple randomized kd L is the number of examined points.
forest [39] which can give the best performance on a wide ρ-Approximate DBSCAN: For simplicity, the basic con-
range of dimensional data space. In this article, we mainly cepts and terms of DBSCAN [4] (e.g., core points, density-
use it to improve the performance of DBSCAN. reachable, cluster, noise, etc.) are not presented here. Aiming
to improve DBSCAN, ρ-approximate algorithm imposes a
simple quadtree-like hierarchical grid T on D-dimensional
III. FLANN, ρ-A PPROXIMATE DBSCAN REVISITED space, and divides the data space into a set of nonempty cells.
FLANN: In this article, we use FLANN with the priority Each
√ cell is a D-dimensional hyper-square with side length
search k-means tree to perform the nearest neighbor query, / D. Fig. 1 shows an example in 2-D space. Then, it builds
where the priority k-means tree is constructed by k-means a graph by redefining the definition and computation of the
(see [18, Algorithm 1]) that partition the data points at each graph G = (V, E): 1) each vertex is a core cell and 2) given
level into χ (in [18], it is denoted as K which represents two different core cells c1 , c2 .
to the cluster number K of k-means tree. In this article, 1) ∃p1 ∈ c1 , p2 ∈ c2 , such that dist(p1 , p2 ) ≤ , there is an
we use character χ to replace with K in order to make it edge between c1 and c2 .
different from the K value of kNN), which is called the 2) If  ∃p1 ∈ c1 is within the (1 + ρ)-neighborhood of
branching factor with default value 512, distinct regions recur- p2 ∈ c2 , there is no edge between c1 and c2 .
sively, until the total number of points in a region is less 3) Otherwise, don’t care.
than χ . Based on the graph G and the quadtree-like hierarchical
As Algorithm 1 shows, given a query point q, the priority grid, an approximate range counting algorithm is designed to
k-means the tree is searched by the following steps. solving the problem of DBSCAN.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.
3942 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021

IV. P ROPOSED A LGORITHM


eps

A. Drawbacks of DBSCAN Analysis eps/2 pi


p

DBSCAN runs in O(n2 ), and most of its variants still do not eps
work well for large-scale data. In order to find the underlying
causes, we analyzed fundamental techniques used in traditional
clustering approaches, and find that there are some significant Fig. 2. Example of a CB. MinPts = 8,  = eps and there are eight red points
deficiencies as follows. which are within N(/2) (p), then all red points are core points.
1) Brute force algorithm is used in original DBSCAN to
compute density for an arbitrary data point, the com-
plexity is O(n). However, there are many redundancies.
eps r−eps
Suppose di,k and dj,k are already known, while di,j is q p eps

unknown. Suppose |di,k −dj,k | >  or di,k +dj,k ≤ , then


we can infer di,j >  or di,j ≤  according to triangle r

inequality, respectively. Then, the distance computation


for di,j is also unnecessary. Fig. 3. Example of an NCB. MinPts = 22,  = eps, r > eps, and the total
2) In the case of the grid technique
√ is used, the side length number of points within Nr (p) (the outer circle) is 21, then all red points are
of each cell is fixed to / D, which implies that it is noncore points, because they are all within Nr− (p).
almost useless in high dimension [40].

B. Basic Ideas r−2*eps


r−eps
p
eps
As mentioned above, DBSCAN cannot deal with large-scale
data due to its high complexity. According to our observation r

and analysis on DBSCAN, there are two findings as follows.


1) The key problem of DBSCAN is to find core points, Fig. 4. Example of a noise-CB. MinPts = 22,  = eps and r > 2, then all
which is a kNN problem in essence, because the density red points within green circle are noise, because Nr− (p) is noncore block
defined in DBSCAN is the total number of points within which implies there is no core point within the red circle.
a specified neighborhood, and all neighbors of a core
point should be reported for merging.
2) Point p and point q should have similar neighborhoods, then according to Theorem 2 all red points are core points.
provided p and q are close; the closer they are, the Therefore, N(/2) (p) is a CB.
more similar neighborhood they have. Thus, it is highly Theorem 3: Let dp,(K) = r, (1) if r > , then ∀q ∈ Nr− (p)
possible that a point has the same type as its neighbors. is noncore point. (2) if r > 2, then ∀q ∈ Nr−2 (p) is noise.
Hence, it is reasonable to utilize kNN technique to solve Proof:
the problem of DBSCAN. Formally, let K = MinPts and 1) Because dp,(K) = r > , which means ∀q ∈ Nr− (p),
p(1) , . . . , p(K) be the first K nearest neighbor points of p, where N (q) ∈ Nr (p), therefore, |N (q)| < |Nr (p)| = MinPts. Thus,
1 ≤ i ≤ K, then we have the following. q is a noncore point.
Theorem 1: 2) Because dp,(K) = r > 2, then ∀q ∈ Nr−2 (p), we have
1) If dp,(K) ≤ , then p is a core point. N (q) ∈ Nr− (p), and because Nr− (p) is a noncore-block
2) p is a noncore point if dp,(i) > , where 1 ≤ i ≤ K. (NCB), which implies there is no core point in N (q), then q
Proof: 1) Because dp,(K) ≤ , which means dp,(1) ≤ dp,(2) is noise.
≤, . . . , ≤ dp,(K) ≤ , |N (p)| ≥ K = MinPts, p is a core point. Definition 2 (None-Core Block (NCB)): Nξ (p) is an NCB
2) Because 1 ≤ i ≤ K and dp,(i) > ,  < dp,(i) ≤ dp,(K) . with respect to p and ξ , if ∀q ∈ Nξ (p) is noncore point. It is
Thus, |N (p)| < K = MinPts, i.e., p is a noncore point. noted as NCB(p, ξ ), and p is called the center of NCB(p, ξ ).
As a result of Theorem 1, we argue that the problem of Definition 3 (Noise-Block (NOB)): Nξ (p) is an NOB with
identifying whether a point is a core point or not is a kNN respect to p and ξ , if ∀q ∈ Nξ (p) is noise. It is noted as
problem. NOB(p, ξ ), and p is called the center of NOB(p, ξ ).
Theorem 2: If dp,(K) ≤ (/2), p(1) , p(2) , . . . , p(K) are all Obviously, an NOB is NCB, but an NCB may not be NOB;
core points. neither NCB nor NOB is CB, and vice versa.
Proof: Because dp,(K) ≤ (/2) ≤ , according to triangle Fig. 3 addresses an example of Theorem 3 (1). Because
inequality, we have ∀i, j ∈ [1, K] dist(p(i) , p(j) ) ≤ . Therefore, MinPts = 22,  = eps and r > eps, it is impossible for each
∀i ∈ [1, K] we have |N (p(i) )| ≥ K, i.e., p(1) , p(2) , . . . , p(K) point within the blue circle to find enough neighbors within
are all core points. its -neighborhood, (because the total number of points within
Definition 1 (Core-Block (CB)): Nξ (p) is a CB with respect Nr (p), i.e., the outer circle, is 21). Thus, all points within the
to p and ξ , if ∀q ∈ Nξ (p) is core point. It is noted as CB(p, ξ ), blue circle are noncore points, i.e., Nr− (p) is an NCB.
and p is called the center of CB(p, ξ ). Fig. 4 is another example to explain Theorem 3 (2). Because
As Fig. 2 shows, all red points are within N(/2) (p), and r > 2, all points within green circle are noncore points, and
the total number of red points is 8 which is equal to MinPts, it is also impossible for any point p within green circle to

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: KNN-BLOCK DBSCAN: FAST CLUSTERING FOR LARGE-SCALE DATA 3943

Algorithm 4 MergeCoreBlocks(CBs, )
1: Input: CBs: core-blocks; CBCENT: core-block centers
set;  is the parameter of DBSCAN ;
2: for each core-block CB(p, ξ1 ) do
3: Neibs := FLANN::RangeSearch(p, 2, CBCENT)
4: for each q ∈ Neibs do
5: CB(q, ξ2 ) be the core-block of q
Fig. 5. Framework of KNN-BLOCK DBSCAN. It uses FLANN to identify
6: if p and q are in different cluster then
CBs, NCBs, and NOB, then merges CBs, assigns points in NCBs to proper 7: if dp,q > ξ1 + ξ2 +  then
clusters and discards noises. 8: BruteForceMerge(CB(p, ξ1 ), CB(q, ξ2 ))
9: end if
Algorithm 3 KNN-BLOCK DBSCAN(P, , MinPts) 10: end if
1: Input: P is input data; [, MinPts]; 11: end for
2: Output: cluster id of each point; 12: end for
3: Initialize core-blocks set CBs = {φ}
4: Initialize non-core-blocks set NCBs = {φ}
Algorithm 5 AssignNonCoreBlocks(NCBs, CBs, )
5: K := MinPts, cur_cid := 0 // current cluster id
1: Input: NCBs: non-core-blocks; CBs: core blocks;  is the
6: for each unvisited point p ∈ P do
parameter of DBSCAN;
7: {p(1) , . . . , p(K) } := FLANN :: kNN(p, P)
2: for each non-core-block NCB(p, ξ1 ) do
8: ξ := dp,(K) , Nξ (p) := {p(1) , p(2) , . . . , p(K) }
3: r := ξ1 + 1.5;
9: if ξ ≤  then
4: Neibs := FLANN::RangeSearch(p, r,CBCENT)
10: cur_cid := cur_cid + 1
if ∃q ∈ Neibs s.t. dp,q ≤ ( − ξ1 ) then
if ξ ≤ 2 then
5:
11:
6: merge NCB(p, ξ1 ) into the cluster of q
12: push Nξ (p) into CBs //a core block found
7: process next non-core-block
13: ∀s ∈ Nξ (p) mark s as core-point and visited
8: else
14: else
9: for each unclassified o ∈ NCB(p, ξ1 ) do
15: push N0 (p) into CBs //single core point
10: if ∃q ∈ Neibs s.t. dp,q ≤ ( + ξ1 + ξ2 ) then
16: mark p as core-point and visited
11: if ∃s ∈ CB(q, ξ2 ) s.t.do,s ≤  then
17: end if
12: assign o to the cluster of q
18: curCorePts:= core points already found in Nξ (p)
13: process next unclassified point o
19: exist_cids:= clusters found in curCorePts
14: end if
20: merge exist_cids into cur_cid
15: end if
21: assign Nξ (p) to cluster cur_cid
16: end for
22: else if  < ξ ≤ 2 then
17: end if
23: push Nξ − (p) into NCBs
18: end for
24: mark all points within Nξ − (p) as visited
25: else if ξ > 2 then
26: mark ∀q ∈ Nξ −2 (p) as noise and visited
27: end if of DBSCAN mainly lies in: 1) kNN is used, instead of using
28: end for range query algorithm, to identify core points and noncore
29: CBCENT := extract all center points from CBs points by block (CBs, NCBs, and NOBs); 2) each block has a
30: Create a index tree by FLANN from CBCENT dynamic range, while the width of grid used in ρ-approximate
31: MergeCoreBlocks(CBs, CBCENT, cbIDs, ) DBSCAN and fast DBSCAN is a constant; and 3) CBs can
32: AssignNonCoreBlocks(NCBs, CBs, CBCENT, ) be processed by a simple way which is far more efficient than
grid.

find any core point from which p is directly density-reachable, C. Algorithms


because Nr− (p) is noncore block which implies there is no In this section, we outline the proposed method. The frame-
core point within the red circle. Thus, points within Nr−2 (p) work of KNN-BLOCK DBSCAN is shown in Fig. 5. First, it
are all outliers, i.e., Nr−2 (p) is an NOB. uses FLANN to identify CBs, NCBs, and NOB. Second, for
Definition 4: A core block CB(p, ξ1 ) is density-reachable any two pairs of CBs, it merges them into the same cluster
from another core block CB(q, ξ2 ), if ∃s ∈ CB(p, ξ1 ) and provided they are density-reachable from each other. Third, for
w ∈ CB(p, ξ2 ), such that s is density-reachable from w. each point p in NCBs, KNN-BLOCK DBSCAN may assign
Definition 5: A point p is density-reachable from core block p to a cluster if there exists a core point from which it is
CB(q, ξ ), if ∃s ∈ CB(q, ξ ) such that p is density-reachable density-reachable. The details are shown in Algorithm 3, 4, 5,
from q. and 6, respectively.
Comprehensively, based on the two findings mentioned 1) Types and Blocks Identification: As Algorithm 3 shows,
above, the difference of between this article and other variants for each an unvisited point p in P, it uses FLANN::kNN to

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.
3944 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021

Algorithm 6 BruteForceMerge(CB(p, ξ1 ), CB(q, ξ2 ))


1: Input: CB(p, ξ1 ): a core-block; CB(q, ξ2 ): another core- ξ1
ξ2
ε+ξ 2
block; ε+ξ 1
p q

2: Initialize two points set O = {φ} and S = {φ}


3: for each point o in CB(q, ξ2 ) do
4: push o to O if do,p <  + ξ1 Fig. 7. Example of case (3) for merging CBs. CB(p, ξ1 ) is a CB, CB(q, ξ2 )
5: end for is another CB, only points in the two shadow region are possible directly
6: for each point s in CB(p, ξ1 ) do density-reachable from each other.
7: push s to S if ds,q <  + ξ2
8: end for
9: if ∃o ∈ O, s ∈ S, s.t. do,s ≤  then
in the two shadow regions are checked, instead of whole
10: merge CB(p, ξ1 ) and CB(q, ξ2 )) two CBs.
11: end if
3) Borders Identification: At last, given a CB CB(p, ξ1 ) and
an NCB NCB(q, ξ2 ), Algorithm 5 (AssignNonCoreBlocks) is
called to identify border points in NCB(q, ξ2 ) that are density-
reachable from CB(p, ξ1 ). Similar to Fig. 6, there are also three
cases as described below.
Case 1 (dp,q >( + ξ1 + ξ2 )): NCB(q, ξ2 ) is far from
CB(p, ξ1 ), then it is unnecessary to merge them.
Case 2 (dp,q ≤( − ξ2 )): Because NCB(q, ξ2 ) is totally
Fig. 6. Three cases of two CBs. (a) Two CBs can be merged directly. (b) Is contained in N (p), all points within NCB(q, ξ2 ) are density-
a case that can skip directly for they are far from each other. (c) Addresses reachable from p. Therefore, all points in NCB(q, ξ2 ) are
the third case that is necessary to check in detail.
assigned to the cluster of p directly.
Case 3 (( − ξ2 )< dp,q ≤( + ξ2 )): it is necessary to check
whether each point within NCB(q, ξ2 ) is density-reachable
retrieve the first K (K = MinPts) nearest neighbors of p.
from p. Similar to Fig. 7, only points within two shadow
According to Theorem 1, the type of p can be identified.
regions are checked.
If p is a core point, we may find a core block according
to Theorem 2 (lines 11–13). If p is not a core point, we may
find an NCB (lines 22–24) or noise block (lines 25 and 26) D. Complexity Analysis
according to Theorem 3. Let n be the cardinality of data set, b0 = b1 + b2 + b3 be the
2) Blocks Merging: Let CB(p, ξ1 ) and CB(q, ξ2 ) be two total number of all blocks, where b1 , b2 , and b3 are the total
CBs, there are three cases as described below. number of CBs, NCBs, and NOBs, respectively. Averagely,
Case 1 (dp,q ≤ ): As image (a) in Fig. 6 shows, because p b0 = β(n/MinPts), where β is a factor about the distribu-
is directly density-reachable from q, both CBs can be merged tion of the data, and b0 is usually far less than n provided
into a same cluster directly. [, MinPts] are well chosen (how to choose good parameters
As shown from lines 20 and 21 in Algorithm 3, suppose for DBSCAN is another big topic, such as OPTICS [13] and
CB(p, ξ1 ) is a newly identified CB, and if there are some others [41]–[43], which is out of the scope of this article). The
points that have already been assigned to other clusters within complexity of Algorithm 3 is analyzed as follows.
CB(p, ξ1 ), then these clusters can be merged directly. Space Complexity: As shown in the above algorithms, we
Case 2 (dp,q > ( + ξ1 + ξ2 )): As illustrated in Fig. 6 (b), can see that each block should be saved, thus the space cost
they are far away from each other, there is no need to merge is about O(MinPts ∗ b0 ) = O(βn).
them, because according to triangle inequality, there is no point Time Complexity:
in CB(p, ξ1 ) that is density-reachable from another point in 1) From lines 6–29 of Algorithm 3, we can infer that
CB(q, ξ2 ). FLANN::kNN will be called about b0 times. As we
Case 3 ( < dp,q ≤(ξ1 + ξ2 + )): As Fig. 6(c) know, in the case of using priority search k-means tree,
addresses, CB(p, ξ1 ) and CB(q, ξ2 ) have no intersection, FLANN::kNN runs in O(L D log(n)/ log(χ )) expected
and they can be merged if there exists a pair of points time [18] for each query, where L is a data points exam-
(o1 , o2 ) where dist(o1 , o2 ) ≤ , o1 ∈ CB(p, ξ1 ) and ined by FLANN, D is dimension, and χ is a branching
o2 ∈ CB(q, ξ2 ). factor of the tree used in FLANN. Thus, the complexity
In order to detect this case effectively, a simple method is of finding blocks is about O(b0 [L D log(n)/ log(χ )]).
proposed as Algorithm 6 illustrates. First, we select point set 2) The complexity of creating a tree by FLANN from
O ⊆ CB(q, ξ2 ) such that ∀m ∈ O s.t. dp,m ≤  + ξ1 , and CBCENT is about O(b1 D log(b1 )).
point set S ⊆ CB(p, ξ1 ) such that ∀s ∈ S s.t. dp,m ≤  + ξ2 . 3) The complexity of Algorithm 4: There are two main
Then, we simply utilize brute force algorithm to check whether parts as follows.
there exist two points o ∈ O, s ∈ S that are directly a) There are b1 CBs, for each CB
density-reachable from each other, and merge two CBs if FLANN::RangeSearch is called to find its
yes. As Fig. 7 shows, set O is within the right shadow 2-neighbors from CBCENT, the complexity is
region, while S is within the left shadow region. Only points about O(b1 [L d log(b1 )/log(χ )]).

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: KNN-BLOCK DBSCAN: FAST CLUSTERING FOR LARGE-SCALE DATA 3945

b) For each CB, the total number of points in a CB Algorithm 7 Pure kNN-Based DBSCAN
is usually far less than n, i.e., MinPts << n, then 1: Input: data set P, and , MinPts;
the complexity of Algorithm 6 is averagely about 2: coreSet := {φ}
O(MinPts). 3: for each unclassified p ∈ P do
Hence, since MinPts << n can be regarded as 4: neibors:= FLANN::kNN(p,MinPts);
a constant, the complexity of Algorithm 4 is about 5: if dp,(MinPts) ≤  then
O(b1 [L D log(b1 )/ log(χ )]). 6: push p into coreSet
4) The complexity of Algorithm 5: there are also two main 7: end if
parts as follows. 8: end for
a) There are b2 NCBs. For each NCB we call 9: for each core point p ∈ coreSet do
FLANN::RangeSearch to find its (ξ1 + 1.5)- 10: neibCores := find core points from k-neighbors of p
neighbors from CBCENT, the complexity is about 11: merge neibCores and p into one cluster
O(b2 [L D log(b1 )/log(χ )]). 12: end for
b) The average complexity of assigning an unclassi- 13: for each two pair of clusters c1 and c2 do
fied point in NCBs to a cluster (from line 5 to line 14: merge c1 and c2 if ∃p1 ∈ c1 and p2 ∈ c2 s.t. p1 is
17) is about O(MinPts[L D log(b1 )/ log(χ )]). density reachable p2
Hence, the complexity of Algorithm 5 is less 15: end for
than O(b2 MinPts [L D log(b1 )/ log(χ )]) < 16: find border points and assign them
O(UCPtsNum [L D log(b1 )/ log(χ )]), where UCPtsNum is
the total number of unclassified points in all NCBs.
As mentioned above, b0 = b1 + b2 + b3 = B. Data Sets
(βn/MinPts) is far less than n provided [, MinPts]
Data sets come from UCI (https://archive.ics.uci.edu/ml/ind
are well chosen, then the overall time complexity is
ex.php), including PAM (PAMPA2), HOUSE (household),
about as O(b0 [L D log(n)/ log(χ )]) = O ([βn/MinPts]
USCENCUS (USCensus 1990), gas sensor, FMA (dataset for
[L D log(n)/ log(χ )]) < O(L D n log(n)/ log(χ )).
music analysis), AAS-1K (Amazon access samples), HIGGS,
In the case of dealing with very high dimensional data
etc., where AAS-1K is a 1000-dimensional data set which
sets, FLANN::kNN degenerates to be an O(n) algorithm, and
is extracted from 20 000-dimensional data set AAS. For each
then the complexity of KNN-BLOCK DBSCAN is about
data set, all duplicate points are removed to make each point
O(b0 [L D n/ log(χ )]).
unique, all missing values are set to 0, and each dimension of
In the worst case, if there is none CB and FLANN::kNN
these data sets is normalized to [0, 105 ]. The following part
runs in O(n), the complexity of KNN-BLOCK DBSCAN is
of this section lists brief descriptions of these data sets.
O(n2 ).
PAM39D is a real 39-dimensional dataset, PAMAP2, with
cardinality n = 3, 850, 505; PAM4D is a real dataset obtained
V. E XPERIMENTS by taking the first four principle components (PCA) of
A. Algorithms and Set Up PAMPA2; Household: dim = 7, n = 2049280; USCENCUS:
dim = 36, n = 365100; GasSensor (Ethylene-CO): dim = 16,
In this section, to evaluate the correctness and effectiveness
n = 4208261; MoCap: dim = 36, n = 65536; APS (APS
of the proposed approach, several experiments are conducted
Failure at Scania Trucks): dim = 170, n = 30000; Font
on different data sets at Intel Core i7-3630 CPU @2.50 GHz,
(CALIBRI): dim = 36, n = 19068; HIGGS: dim = 28,
8G RAM. We mainly compare the proposed algorithm with
n = 11000000; FMA: dim = 512, n = 106574; AAS − 1K:
ρ-approximate DBSCAN, AnyDBC [28] and pure kNN-based
AAS is a large sparse data set, and AAS-1K is a subset extracted
DBSCAN.
from AAS with dim = 1000, n = 30000.
1) “KNN-BLOCK” is KNN-BLOCK DBSCAN which is
coded in C++ and runs on Windows 10 64-bit operat-
ing system, the tree used in FLANN is priority search C. Two Examples of Clustering
k-means tree, and the cluster number χ of k-means is We benchmark KNN-BLOCK DBSCAN on two 2-D test
10. cases to reveal the processes in detail, as shown in Fig. 8.
2) Approx is ρ-approximate DBSCAN which is also writ- The left case is aggregation [44], and the right case comes
ten in C++ and runs on Linux (Ubuntu 14.04 LTS) from [1].
operating system. Specifically, in Fig. 8(a) presents the original data distribu-
3) AnyDBC is the efficient anytime density-based cluster- tion. Fig. 8(b) draws CBs, NCBs, and NOBs plotted by black,
ing algorithm [28]. green, and red circles, respectively. The radius of each circle is
4) kNN-based DBSCAN is an algorithm which only uses different, which means each block has a different size. We also
FLANN::kNN technique to accelerate DBSCAN, as can see that NCBs usually distribute along the border of CBs,
shown in Algorithm 7, and the complexity is about and NOBs appears far from CBs. Fig. 8(c) illustrates the result
O(L D n log(n)/ log(χ )), where L is a data points exam- of merging CBs, which is the most important step to identify
ined by FLANN, D is dimension, and χ is a branching clusters; in Fig. 8(d), as mentioned in Section IV-C3, there
factor of the tree used in FLANN. are three cases to process NCBs: the green circles represent

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.
3946 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021

30 30 30 0.5 0.5 0.5

20 20 20

0 0 0
10 10 10

0 0 0
0 20 40 0 20 40 0 20 40 −0.5 −0.5 −0.5
−0.5 0 0.5 −0.5 0 0.5 −0.5 0 0.5
(a) (b) (c) (a) (b) (c)
30 30 30
0.5 0.5 0.5

20 20 20

0 0 0
10 10 10

0 0 0
0 20 40 0 20 40 0 20 40 −0.5 −0.5 −0.5
−0.5 0 0.5 −0.5 0 0.5 −0.5 0 0.5
(d) (e) (f) (d) (e) (f)

Fig. 8. Two examples present the processes of KNN-BLOCK DBSCAN. (a) is the original data distribution; (b) shows 3 kinds of blocks found by
KNN-BLOCK DBSCAN, where black circles are core blocks, green circles are NCBs and red are NOBs; (c) illustrates clusters found after merging CBs;
(d) addresses the assignment of NCBs to corresponding clusters, the red balls are NCBs that can be assigned to their nearest clusters, and green circles in
(d) are those who find no cluster to assign; (e) exhibits the final result of KNN-BLOCK DBSCAN, where black points are noise; (f) demonstrates the result
of original DBSCAN.
TABLE II
RUNTIME C OMPARISONS ON S UBSETS OF HOUSE AND PAM W ITH n I NCREASING . T HE S PEEDUP OF KNN-BLOCK
DBSCAN OVER I TS C OMPETITOR I S G IVEN IN B RACKETS . (U NIT: S ECOND )

case (1), because they are far from all core-points, all points proposed algorithm and ρ-approximate DBSCAN with dif-
within these NCBs are classified as noise; the red balls illus- ferent [, MinPts]. Figs. 9 and 10 present the results of two
trate case (2), each of them is assigned to one cluster from algorithms, and Table II reveals more details. We also con-
which it is density-reachable; in case (3), for each point p duct experiments on the whole data sets of HOUSE, PAM4D,
within unclassified NCBs, if q is identified as a core from KDD04, USCENCUS, REACTION, MOPCAP, BODONI,
which p is density-reachable, then p is classified to the clus- HIGGS, FMA, and AAS1K, respectively, and Table III shows
ter of q. Fig. 8(e) exhibits the final result of KNN-BLOCK the comparison of runtime with different [, MinPts].
DBSCAN, where black points are noise; and Fig. 8(f) shows From two figures and two tables, we can observe as follows.
the result obtained by original DBSCAN. 1) Both algorithms prefer large  and small MinPts. For
It is observed that KNN-BLOCK DBSCAN nearly obtains example, on data set HouseHolod, both KNN-BLOCK
the same result as DBSCAN with high efficiency, because DBSCAN and ρ-approximate DBSCAN run best when
it processes data by blocks, and reduces a large number of [, MinPts] = [5000, 100], and the worst case happens
redundant distance computations. when [, MinPts] = [1000, 200]. On other data sets,
things are similar as shown in Table III.
2) Both algorithms run in linear expected time in low
D. Runtime Comparisons With ρ-Approximate DBSCAN dimensional data sets.
The first experiment is conducted on a set of subsets 3) We can see that on large-scale data sets
of HOUSE and PAM4D to observe the complexities of the PAM4D, HOUSEHOLD, and HIGGS, our algorithm

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: KNN-BLOCK DBSCAN: FAST CLUSTERING FOR LARGE-SCALE DATA 3947

TABLE III
RUNTIME C OMPARISONS ON D IFFERENT DATA S ETS W ITH D IFFERENT  AND M IN P TS . T HE S PEEDUP OF
KNN-BLOCK DBSCAN OVER I TS C OMPETITOR I S G IVEN IN B RACKETS . (U NIT: S ECOND )

is much better, the speedup of KNN-BLOCK


DBSCAN over its competitor is about 2.5–6 times
on HOUSEHOLD, 1.4–3 times on PAM4D, and 16
and 17 times on HIGGS (28 dim). On other relative
high-dimensional data sets, e.g., MOPCAP (36 dim)
APS (170 dim), BODONI (256 dim), FMA (512 dim),
and AAS-1K(1000 dim) KNN-BLOCK DBSCAN still
performs well, while ρ-approximate degenerates to be
an O(n2 ) algorithm which conforms to our analysis
mentioned in Section II. It is also notable that the
Fig. 9. Runtime comparisons on subsets of HOUSE with n increasing.
performance of KNN-BLOCK DBSCAN drops with
the dimension, e.g., the proposed algorithm spends
much more time on HIGGS than that on PAM4D, and E. Runtime Comparisons With AnyDBC
the  should be relatively larger in high dimension than To make comparisons with AnyDBC, we conduct exper-
that of low dimension. iments on the two same data sets, namely, GasSensor
From these experiments, we can see that KNN-BLOCK (Ethylence-co) and PAM39D, as shown in Fig. 11 (the result
DBSCAN accelerates ρ-approximate DBSCAN greatly, and of AnyDBC is obtained by running the binary program
is promising for processing such large-scale data. provided by the authors on our machine). It is observed

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.
3948 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021

Fig. 10. Runtime comparisons on subsets of PAM4D with n increasing.

TABLE IV
RUNTIME C OMPARISONS W ITH P URE K NN-BASED DBSCAN

Fig. 11. Runtime comparisons with AnyDBC and ρ-approximate DBCAN


on Gas Sensor and PAM39D, MinPts is fixed to 50.

that KNN-BLOCK DBSCAN outperforms AnyDBC and


ρ-approximate DBSCAN, especially, on PAM39D KNN-
BLOCK DBSCAN runs far faster than AnyDBC.

F. Runtime Comparisons With Pure kNN-Based DBSCAN


In this part, KNN-BLOCK DBSCAN is compared with
pure kNN-based DBSCAN on some data sets, and the results
are shown in Table IV. From this table, we can see that
KNN-BLOCK DBSCAN runs far faster than pure kNN-based
algorithm, and the speedup varies from 1.42 to 5.48. Clearly, Fig. 12. Runtime distributions with the changing of  and MinPts on PAM4D
in most cases, the speedup is larger than 2, which proves that and HOUSEHOLD, respectively.
the block techniques plays an important role in our algorithm,
and greatly speedup DBSCAN.
cardinality 3 850 505 and the dimension is 4. Table V reveals
G. Effect of  and MinPts the execution details of kNN, MergeCB (Algorithm 2) and
In this section, we check the effect of [, MinPts] on the AssignNCB (Algorithm 3), as well as the numbers of CBs,
proposed algorithm. PAM4D is used in this experiment with NCBs, and NOBs.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: KNN-BLOCK DBSCAN: FAST CLUSTERING FOR LARGE-SCALE DATA 3949

TABLE V
E XECUTION T IMES OF K NN, M ERGE CB, AND A SSIGN NCB, AS W ELL AS B LOCKS F OUND ON PAM W ITH D IFFERENT [, MinPts]

TABLE VI TABLE VIII


T OTAL N UMBERS OF CB S , NCB S , AND NOB S F OUND ON D IFFERENT C OMPARISONS OF NMI FOR KNN-BLOCK DBSCAN AND
DATA S ETS ρ-A PPROXIMATE DBSCAN (n = 5000)

TABLE VII
C OMPARISONS OF Omega-index FOR KNN-BLOCK DBSCAN AND
ρ-A PPROXIMATE DBSCAN (n = 5000)

TABLE IX
E XAMPLE OF C OMPUTING P RECISION FOR KNN-BLOCK DBSCAN
BASED ON T HREE M ATCHED L ABEL PAIRS : (“A1,” “B2”), (“A2,” “B1”),
AND (“A3,” “B4”) F OUND BY K UHN –M UNKRAS

times of kNN and 2) KNN-BLOCK DBSCAN prefers large


 and small MinPts, which yields less executions of kNN due
to larger number of CBs identified.
1) In the case of MinPts is small and  is large, most
As the two bold columns show, the execution times of kNN blocks will be identified as CBs, and the number is about
is the same as the number of blocks found by KNN-BLOCK N/MinPts. For example, as Table V shows, [7000, 100]
DBSCAN. It is observed: 1) the runtime and execution times runs fastest, followed by [5000, 100], then [3000, 100],
of kNN linearly increase with MinPts; 2) while the execution and then [1000, 100].
times of MergeCB rapidly decrease with MinPts; and 3) the 2) When MinPts is large and  is small, few CBs are
less CBs the more NCBs and NOBs. found, thus kNN will be called more frequently, and
Fig. 12 also provides more details of the runtime distribution it will degenerate to be an O(n2 ) algorithm in the worst
on PAM4D and HOUSEHOLD with the changing of  and case. As shown in Table III, when the parameters are
MinPts, respectively. Hence, we can infer: 1) the complexity [1000, 30 000] and [3000, 30 000], the runtime is much
of KNN-BLOCK DBSCAN mainly depends on the execution longer than others.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.
3950 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021

TABLE X
ACCURACY, R ECALL , AND F1-S CORE OF KNN-BLOCK DBSCAN AND ρ-A PPROXIMATE
DBSCAN ON S UBSETS OF HOUSE, PAM4D, MOPCAP, AND APS

H. Statistics of Three Kinds of Blocks J. Accuracy of KNN-BLOCK DBSCAN


In this section, to observe the numbers of three kinds To evaluate the accuracy of KNN-BLOCK DBSCAN, some
of blocks with respect to different  and MinPts, some experiments are conducted based on an assumption that the
experiments are conducted on some whole data sets, includ- clustering labels obtained by DBSCAN are ground truth. The
ing HOUSE, PAM4D, KDD04, USCENCUS, REACTION, reason is as follows.
MOPCAP, and BODONI, respectively. 1) This article is only motivated to accelerate the speed
Table VI exhibits some statistics of CBs, NCBs, and NOBs of DBSCAN, without concerning the clustering result
with respect to different  and MinPts on all data sets. From is good or not which is another topic out of the scope
this table, we can see: the total number of blocks, especially of this article. It is expected that the clustering results
CBs, is far less than the cardinality n, which reveals that many should be the same as original DBSCAN provided the
distance computations are filtered. parameters (, MinPts) are the same.
2) Both KNN-BLOCK DBSCAN and ρ-approximate
I. Omega-Index and NMI Evaluations DBSCAN are approximate algorithms, the more simi-
Omega-Index [45] and normalized mutual information lar of their clustering results to the original DBSCAN,
(NMI) [46] are two well known methods to evaluate clustering the better. Hence, we argue it is reasonable to use the
result, then similar to [47], we use them to make comparisons clustering results of DBSCAN as ground truth.
for KNN-BLOCK DBSCAN and ρ-approximate DBSCAN. Specifically, the idea is that each data point belongs to a
Because the complexities of Omega-index and NMI are high unique predefined cluster and its predicted cluster should cor-
(O(n2 )), we only conduct experiments on sub sets of HOUSE, respond either to only one predefined cluster or to none [11].
PAM4D, MOPCAP, and APS with n = 5000. Any pair of data points in the same predefined cluster is con-
In these experiments, we compute the Omega-index sidered to be incorrectly clustered if the predicted cluster does
and NMI scores of both algorithms by comparing the not match the predefined cluster to which they belong, even
results with those obtained from the original DBSCAN. As if both points appear in the same predicted cluster. Therefore,
Tables VII and VIII show the performances of both algorithms we evaluate the precision of two approaches as follows.
are similar, and the results are all close to 1, which indicate Step 1 (Clustering): Given a data set and [, MinPts], sup-
that both algorithms nearly agree on original DBSCAN. pose Lab1 = {A1 , A2 , . . . , Ak } and Lab2 = {B1 , B2 , . . . , Bm }

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: KNN-BLOCK DBSCAN: FAST CLUSTERING FOR LARGE-SCALE DATA 3951

are clustering labels obtained by the DBSCAN and KNN- includes core points, border points, and noises, respectively.
BLOCK DBSCAN. Then, we proposed an algorithm to merge CBs that are density-
Step 2 (Matching): It is well known that different clustering reachable from each other and assign each point in NCBs to
algorithms may yield different labels on the same data set. For a proper cluster.
example, cluster “A1” labeled by DBSCAN may be the same The superiority of KNN-BLOCK DBSCAN to
as “B2” obtained by KNN-BLOCK DBSCAN. Hence, it is ρ-approximate DBSCAN is that it processes data by blocks,
reasonable to match labels first, and use the matched labels each of which has a dynamic range, instead of grids used
to compute Accuracy. In this article, Kuhn–Munkras [48] per- in ρ-approximate DBSCAN with a fixed width, and fast the
forms the task of maximum matching two different cluster kNN technique is used to identify the types of points. Given a
label sets, which has been used in [11] and [49]. fixed intrinsic dimensionality, the complexity of the proposed
Step 3 (Computing Accuracy): Suppose there are three clus- algorithm is about O([βn/MinPts][L D log(n)/ log(χ )])
ters with labels “A1,” “A2,” and “A3” obtained by DBSCAN where L is a constant, D is dimension, β is a factor of data
on one data set, but KNN-BLOCK DBSCAN labels them distribution, and χ is the branching factor of the tree used in
with “B1,” “B2,” “B3,” and “B4,” and Kuhn–Munkras finds FLANN.
there are three matched pairs: (“A1,”‘ “B2,”) (“A2,” “B1,”) and Experiments address that KNN-BLOCK DBSCAN runs
(“A3,” “B4.”) If the labels of point p obtained by DBSCAN faster than ρ-approximate DBSCAN and pure kNN-based
and KNN-BLOCK DBSCAN match, then the prediction of p DBSCAN with high accuracy, even on some relative high-
is correct, e.g., (“A1” and “B2,”) otherwise it is wrong, e.g., dimensional data sets, e.g., APS (170 dim), BONONI
(“A1” and “B1”). Table IX shows more details. Suppose there (256 dim), FMA (512 dim), and AAS-1K (1000 dim), where
are eight points in the data set, the second row lists labels ρ-approximate DBSCAN degenerates to be an O(n2 ) algo-
obtained by DBSCAN, and the third line is the clustering rithm, KNN-BLOCK DBSCAN can still run very fast.
result of KNN-BLOCK DBSCAN. We can see that there are Our future work is to improve the proposed algorithm and
two cases that are wrongly predicated because (A1, B4) and apply it in real applications in the following aspects.
(A2, B3) are not matched pairs. Therefore, the total precision 1) Try to use other precise the kNN technique, such as
is (8 − 2)/8 = 75%. cover tree, semi-convex hull tree [36], etc., to improve
Because the original DBSCAN has high complexity, we the accuracy of KNN-BLOCK DBSCAN.
only test on small data sets. Here, we extract four subsets from 2) Parallelize KNN-BLOCK DBSCAN on GPUs with a
HOUSE, PAM4D, APS, and MOCAP, and use them as test highly efficient strategy for scheduling data to make the
cases. Also because DBSCAN is nondeterministic (sensitive proposed algorithm faster.
to iteration order), some border points may be assigned to dif- 3) Apply it in our other researches, such as image
ferent clusters according to the order they appear. Therefore, retrieval [50], vehicle reidentification [51], [52], vehi-
the accuracy is computed only by comparing core points. cle crushing analysis [53], and auditing for shared cloud
Table X shows that both algorithms achieve high accuracy. data [54]–[56].
In low-dimensional data sets (HOUSE and PAM4D), the
precision, recall, and F1-score of both approximate algo- R EFERENCES
rithms are about 98%–100%, and there is only a little drop [1] A. K. Jain, “Data clustering: 50 years beyond K-means,” Pattern
in high-dimensional data sets (MOCAP and APS) which are Recognit. Lett., vol. 31, no. 8, pp. 651–666, 2010.
about 94.5%–97.7%. [2] A. Likas, N. Vlassis, and J. J. Verbeek, “The global k-means clustering
algorithm,” Pattern Recognit., vol. 36, no. 2, pp. 451–461, 2003.
[3] Y. Cheng, “Mean shift, mode seeking, and clustering,” IEEE
Trans. Pattern Anal. Mach. Intell., vol. 17, no. 8, pp. 790–799,
VI. C ONCLUSION Aug. 1995.
[4] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm
DBSCAN runs in O(n2 ) expected time and is not suitable for discovering clusters in large spatial databases with noise,” in Proc.
for large-scale data. ρ-approximate DBSCAN is designed to KDD, vol. 96, 1996, pp. 226–231.
replace with DBSCAN for big data, however, it only can work [5] U. Von Luxburg, “A tutorial on spectral clustering,” Stat. Comput.,
vol. 17, no. 4, pp. 395–416, 2007.
in a very low dimension. In this article, we analyze the under- [6] H. Chang and D.-Y. Yeung, “Robust path-based spectral clustering,”
lying causes that current approaches fail in clustering large Pattern Recognit., vol. 41, no. 1, pp. 191–203, 2008.
scale data, and find that the grid technique is nearly useless [7] W. Fan, H. Sallay, and N. Bouguila, “Online learning of hierarchical
Pitman–Yor process mixture of generalized Dirichlet distributions with
for high-dimensional data. feature selection,” IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 9,
Aiming to tame problems mentioned above, an approxi- pp. 2048–2061, Sep. 2017.
mate approach named KNN-BLOCK DBSCAN is proposed [8] W. Fan, N. Bouguila, J. Du, and X. Liu, “Axially symmetric data cluster-
ing through Dirichlet process mixture models of Watson distributions,”
for large-scale data based on two findings: 1) the key IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 6, pp. 1683–1694,
of DBSCAN to find core points is a kNN problem Jun. 2019.
in essence and 2) a point has similar density distribu- [9] L. Duan, S. Cui, Y. Qiao, and B. Yuan, “Clustering based on super-
vised learning of exemplar discriminative information,” IEEE Trans.
tion to its neighbors, which implies it is highly possible Syst., Man, Cybern., Syst., to be published.
that a point has the same type (core/border/noise) as its [10] D. Cheng, Q. Zhu, J. Huang, Q. Wu, and L. Yang, “A novel cluster
neighbors. validity index based on local cores,” IEEE Trans. Neural Netw. Learn.
Syst., vol. 30, no. 4, pp. 985–999, Apr. 2019.
Therefore, we argue that the kNN technique, e.g., FLANN, [11] Y. Chen et al., “Decentralized clustering by finding loose and distributed
can be utilized to identify CBs, NCBs, and NOBs, which only density cores,” Inf. Sci., vols. 433–434, pp. 649–660, Apr. 2018.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.
3952 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021

[12] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,” [39] C. Silpa-Anan and R. Hartley, “Optimised KD-trees for fast image
ACM Comput. Surveys, vol. 31, no. 3, pp. 264–323, 1999. descriptor matching,” in Proc. IEEE Conf. Comput. Vis. Pattern
[13] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “Optics: Recognit. (CVPR), 2008, pp. 1–8.
Ordering points to identify the clustering structure,” in Proc. ACM [40] Y. Chen, S. Tang, N. Bouguila, C. Wang, J. Du, and H. L. Li, “A fast
SIGMOD Rec., vol. 28, 1999, pp. 49–60. clustering algorithm based on pruning unnecessary distance computa-
[14] A. Rodriguez and A. Laio, “Clustering by fast search and find of density tions in DBSCAN for high-dimensional data,” Pattern Recognit., vol. 83,
peaks,” Science, vol. 344, no. 6191, pp. 1492–1496, 2014. pp. 375–387, Nov. 2018.
[15] Y. Chen et al., “Fast density peak clustering for large scale data based [41] A. Karami and R. Johansson, “Choosing DBSCAN parameters automat-
on KNN,” Knowl. Based Syst., vol. 187, Jan. 2020, Art. no. 104824. ically using differential evolution,” Int. J. Comput. Appl., vol. 91, no. 7,
[16] D. Cheng, Q. Zhu, J. Huang, Q. Wu, and Y. Lijun, “Clustering with pp. 1–11, 2014.
local density peaks-based minimum spanning tree,” IEEE Trans. Knowl. [42] H. Zhou, P. Wang, and H. Li, “Research on adaptive parameters deter-
Data Eng., to be published. mination in DBSCAN algorithm,” J. Xian Univ. Technol., vol. 9, no. 7,
[17] J. Gan and Y. Tao, “DBSCAN revisited: Mis-claim, un-fixability, and pp. 1967–1973, 2012.
approximation,” in Proc. ACM SIGMOD Int. Conf. Manag. Data, 2015, [43] F. O. Ozkok and M. Celik, “A new approach to determine eps parameter
pp. 519–530. of DBSCAN algorithm,” Int. J. Intell. Syst. Appl. Eng., vol. 4, no. 5,
pp. 247–251, 2017.
[18] M. Muja and D. G. Lowe, “Scalable nearest neighbor algorithms for
[44] A. Gionis, H. Mannila, and P. Tsaparas, “Clustering aggregation,” in
high dimensional data,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36,
Proc. Int. Conf. Data Eng. (ICDE), 2005, pp. 341–352.
no. 11, pp. 2227–2240, Nov. 2014.
[45] L. M. Collins and C. W. Dent, “Omega: A general formulation of
[19] J. L. Bentley, “Multidimensional binary search trees used for associative
the rand index of cluster recovery suitable for non-disjoint solutions,”
searching,” Commun. ACM, vol. 18, no. 9, pp. 509–517, 1975.
Multivariate Behav. Res., vol. 23, no. 2, pp. 231–242, 1988.
[20] A. Beygelzimer, S. Kakade, and J. Langford, “Cover trees for nearest [46] A. Strehl and J. Ghosh, “Cluster ensembles: A knowledge reuse frame-
neighbor,” in Proc. 23rd Int. Conf. Mach. Learn., 2006, pp. 97–104. work for combining partitionings,” in Proc. 18th Nat. Conf. Artif. Intell.,
[21] A. Gunawan and M. de Berg, “A faster algorithm for DBSCAN,” Ph.D. 2002, pp. 93–99.
dissertation, Dept. Math. Comput. Sci., Univ. Eindhoven, Eindhoven, [47] M. A. Patwary, D. Palsetia, A. Agrawal, W.-K. Liao, F. Manne, and
The Netherlands, 2013. A. Choudhary, “Scalable parallel optics data clustering using graph algo-
[22] V. Chaoji, M. Al Hasan, S. Salem, and M. J. Zaki, “SPARCL: Efficient rithmic techniques,” in Proc. Int. Conf. High Perform. Comput. Netw.
and effective shape-based clustering,” in Proc. 8th IEEE Int. Conf. Data Storage Anal. (SC), 2013, pp. 1–12.
Min., 2008, pp. 93–102. [48] H. W. Kuhn, “The Hungarian method for the assignment problem,”
[23] E. H.-C. Lu, V. S. Tseng, and P. S. Yu, “Mining cluster-based tempo- Naval Res. Logist. Quart., vol. 2, nos. 1–2, pp. 83–97, 1955.
ral mobile sequential patterns in location-based service environments,” [49] Y. Chen, S. Tang, S. Pei, C. Wang, J. Du, and N. Xiong, “DHeat: A
IEEE Trans. Knowl. Data Eng., vol. 23, no. 6, pp. 914–927, Jun. 2011. density heat-based algorithm for clustering with effective radius,” IEEE
[24] S. K. Pal and P. Mitra, Pattern Recognition Algorithms for Data Mining. Trans. Syst., Man, Cybern., Syst., vol. 48, no. 4, pp. 649–660, Apr. 2018.
Boston, MA, USA: CRC Press, 2004. [50] X. Liu, Z. Hu, H. Ling, and Y. Cheung, “MTFH: A matrix tri-
[25] S. Mahran and K. Mahar, “Using grid for accelerating density-based factorization hashing framework for efficient cross-modal retrieval,”
clustering,” in Proc. 8th IEEE Int. Conf. Comput. Inf. Technol. (CIT), IEEE Trans. Pattern Anal. Mach. Intell., to be published.
2008, pp. 35–40. [51] J. Hou, H. Zeng, L. Cai, J. Zhu, J. Chen, and K.-K. Ma,
[26] K. Sonal, G. Poonam, S. Ankit, K. Dhruv, S. Balasubramaniam, and “Multi-label learning with multi-label smoothing regularization
N. Goyal, “Exact, fast and scalable parallel DBSCAN for commodity for vehicle re-identification,” Neurocomputing, vol. 345, pp. 15–22,
platforms,” in Proc. 18th Int. Conf. Distrib. Comput. Netw., 2017, p. 14. Jun. 2019.
[27] X. Chen, Y. Min, Y. Zhao, and P. Wang, “GMDBSCAN: Multi-density [52] J. Zhu et al., “Vehicle re-identification using quadruple directional deep
DBSCAN cluster based on grid,” in Proc. IEEE Int. Conf. e-Bus. Eng., learning features,” IEEE Trans. Intell. Transp. Syst., to be published.
2008, pp. 780–783. [53] Y. Zhang, X. Xu, J. Wang, T. Chen, and C. H. Wang, “Crushing
[28] S. T. Mai, I. Assent, and M. Storgaard, “AnyDBC: An efficient anytime analysis for novel bio-inspired hierarchical circular structures sub-
density-based clustering algorithm for very large complex datasets,” in jected to axial load,” Int. J. Mech. Sci., vol. 140, pp. 407–431,
Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Disc. Data Min., 2016, May 2018.
pp. 1025–1034. [54] H. Tian, F. Nan, C.-C. Chang, Y. Huang, J. Lu, and Y. Du,
[29] B. Borah and D. K. Bhattacharyya, “An improved sampling-based “Privacy-preserving public auditing for secure data storage in fog-
DBSCAN for large spatial databases,” in Proc. Int. Conf. Intell. Sens. to-cloud computing,” J. Netw. Comput. Appl., vol. 127, pp. 59–69,
Inf. Process., 2004, pp. 92–96. Feb. 2019.
[30] C.-F. Tsai and C.-W. Liu, “KIDBSCAN: A new efficient data clus- [55] H. Tian, F. Nan, H. Jiang, C.-C. Chang, J. Ning, and Y. Huang,
tering algorithm,” in Proc. Artif. Intell. Soft Comput. (ICAISC), 2006, “Public auditing for shared cloud data with efficient and secure group
pp. 702–711. management,” Inf. Sci., vol. 472, pp. 107–125, Jan. 2019.
[56] H. Tian et al., “Public audit for operation behavior logs with error locat-
[31] C. Tsai and T. Huang, “QIDBSCAN: A quick density-based cluster-
ing in cloud storage,” Soft Comput., vol. 23, no. 11, pp. 3779–3792,
ing technique,” in Proc. Int. Symp. Comput. Consum. Control, 2012,
Jun. 2019.
pp. 638–641.
[32] A. Bryant and K. Cios, “RNN-DBSCAN: A density-based clustering
algorithm using reverse nearest neighbor density estimates,” IEEE Trans.
Knowl. Data Eng., vol. 30, no. 6, pp. 1109–1121, Jun. 2018.
[33] A. Lulli, M. Dell’Amico, P. Michiardi, and L. Ricci, “NG-DBSCAN:
Scalable density-based clustering for arbitrary data,” Proc. VLDB
Endow., vol. 10, no. 3, pp. 157–168, 2016. Yewang Chen received the B.S. degree in man-
[34] F. Gieseke, J. Heinermann, C. E. Oancea, and C. Igel, “Buffer KD trees: agement of information system from Huaqiao
Processing massive nearest neighbor queries on GPUs,” in Proc. ICML, University, Quanzhou, China, in 2001, and the
2014, pp. 172–180. Ph.D. degree in software engineering from Fudan
[35] Y. Chen, L. Zhou, Y. Tang, N. Bouguila, and H. Wang, “Fast neighbor University, Shanghai, China, in 2009.
search by using revised k-d tree,” Inf. Sci., vol. 472, pp. 145–162, 2019. He is currently an Associate Professor with
[36] Y. Chen, L. Zhou, and N. Bouguila, “Semi-convex hull tree: Fast nearest the School of Computer Science and Technology,
neighbor queries for large scale data on GPUs,” in Proc. IEEE Int. Conf. Huaqiao University, and the Fujian Key Laboratory
Data Min., 2018, pp. 911–916. of Big Data Intelligence and Security, Huaqiao
[37] J. Wang et al., “Trinary-projection trees for approximate nearest neigh- University (Xiamen Campus), Xiamen, China. He
bor search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 2, is also with the Beijing Key Laboratory of Big Data
pp. 388–403, Feb. 2014. Technology for Food Safety, Beijing Technology and Business University,
[38] M. Muja and D. G. Lowe, “Fast approximate nearest neighbors with Beijing, China, and the Provincial Key Laboratory for Computer Information
automatic algorithm configuration,” in Proc. Int. Conf. Comput. Vis. Processing Technology, Soochow University, Suzhou, China. His current
Theory Appl. (VISSAPP), 2009, pp. 331–340. research interests include machine learning and data mining.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: KNN-BLOCK DBSCAN: FAST CLUSTERING FOR LARGE-SCALE DATA 3953

Lida Zhou received the B.S. degree in computer Xin Liu (M’08) received the M.S. degree in applied
science from the College of Computer Science mathematics from Hubei University, Wuhan, China,
and Technology, Central China Normal University, in 2009, and the Ph.D. degree in computer science
Wuhan, China, in 2012. He is currently pursu- from Hong Kong Baptist University, Hong Kong, in
ing the Post-Graduation degree with the School 2013.
of Computer Science and Technology, Huaqiao He was a Visiting Scholar with the Computer
University (Xiamen Campus), Xiamen, China. and Information Sciences Department, Temple
His current research interests is machine learning University, Philadelphia, PA, USA, from 2017 to
and pattern recognition. 2018. He is currently an Associate Professor with the
Department of Computer Science and Technology,
Huaqiao University, Quanzhou, China, and also with
the State Key Laboratory of Integrated Services Networks, Xidian University,
Xi’an, China. His present research interests include multimedia analysis,
Songwen Pei (SM’19) received the B.S. degree in computer vision, pattern recognition, and machine learning.
computer science from the National University of
Defence and Technology, Changsha, China, in 2003,
the M.S. degree in computer science from Guizhou
University, Guiyang, China, in 2006, and the Ph.D.
degree in computer science from Fudan University,
Shanghai, China, in 2009.
He is currently an Associate Professor with
the University of Shanghai for Science and
Technology, Shanghai. Since 2011, he has been a
Guest Researcher with the Institute of Computing
Technology, Chinese Academy of Sciences, Beijing, China, a Research
Scientist with the University of California at Irvine, Irvine, CA, USA from
2013 to 2015 and the Queensland University of Technology, Brisbane, QLD,
Australia, in 2017. His research interests include heterogeneous multicore Jixiang Du received the B.Sc. and M.Sc. degrees
system, cloud computing, and big data. in vehicle engineering from the Hefei University
Dr. Pei is a board member of CCF-TCCET and CCF-TCARCH. He is a of Technology, Hefei, China, in September 1999
member of ACM and CCF in China. and July 2002, respectively, and the Ph.D. degree
in pattern recognition and intelligent system from
the University of Science and Technology of China,
Hefei, in December 2005.
He is currently a Professor with the College
Zhiwen Yu (SM’14) received the Ph.D. degree of Computer Science and Technology, Huaqiao
in computer science from the City University of University, Quanzhou, China.
Hong Kong, Hong Kong, in 2008.
He is a Professor with the School of Computer
Science and Engineering, South China University
of Technology, Guangzhou, China, from 2015 to
2019. He has been published more than 140 referred
journal papers and international conference papers,
including 40 IEEE T RANSACTIONS papers. His
research areas focus on data mining, machine learn-
ing, pattern recognition, and intelligent computing.
Prof. Yu is a Distinguishable Member of China Computer Federation and
the Vice Chair of ACM Guangzhou Chapter. He is a Senior Member of ACM.

Naixue Xiong (SM’12) received the first Ph.D.


degree in software engineering from Wuhan
Yi Chen received the Ph.D. degree in computer University, Wuhan, China, in 2007, and the second
science from the Beijing Institute of Technology, Ph.D. degree in dependable networks from the Japan
Beijing, China, in 2002. Advanced Institute of Science and Technology,
She is currently a Professor of computer science Nomi, Japan, in 2007.
with Beijing Technology and Business University, He worked with Colorado Technical University,
Beijing, where she is the Director of Beijing Colorado Springs, CO, USA, Wentworth Technology
Key Laboratory of Big Data Technology for Food Institution, Boston, MA, USA, and Georgia State
Safety. Her research interests mainly focuses on University, Atlanta, GA, USA, for many years. He
information visualization, visual analytics and big is currently a Professor with Northeastern State
data technology for food quality and safety, includ- University, Tahlequah, OK, USA. His research interests include cloud comput-
ing high-dimensional, hierarchical, spatio-temporal, ing, security and dependability, parallel and distributed computing, networks,
and graph data visual analytics. and optimization theory.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on September 25,2024 at 10:49:07 UTC from IEEE Xplore. Restrictions apply.

You might also like