0% found this document useful (0 votes)

92 views4 pages

Parallel Dbscan With Priority R-Tree: Min Chen, Xuedong Gao Huifei Li

P-DBSCAN is a novel parallel version of Algorithm DBSCAN in distributed environment. By separating the database into several parts, the computer nodes carry out clustering independently; after that, the sub-results will be aggregated into one final result. The parallel algorithm has much better scalability than DBSCAN, so that it can be used for clustering analysis in huge databases.

Uploaded by

Prashant Jha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

92 views4 pages

Parallel Dbscan With Priority R-Tree: Min Chen, Xuedong Gao Huifei Li

Uploaded by

Prashant Jha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Parallel DBSCAN with Priority R-tree

Min Chen, 2XueDong Gao

HuiFei Li

School of Economics and Management University of Science and Technology Beijing Beijing, P.R. China l [email protected]
AbstractAccording to the efficiency bottleneck of algorithm DBSCAN, we present P-DBSCAN, a novel parallel version of this algorithm in distributed environment. By separating the database into several parts, the computer nodes carry out clustering independently; after that, the sub-results will be aggregated into one final result. P-DBSCAN achieves good results and much better efficiency than DBSCAN. Experiments show that, P-DBSCAN accelerates more than 40% on one PC, and 60% on two PCs. In addition, the parallel algorithm has much better scalability than DBSCAN, so that it can be used for clustering analysis in huge databases. Keywords-Clustering; DBSCAN algorithm DBSCAN; parallel

Soft Ware Group IBM Global Services (China) Company Limited Beijing, P.R. China [email protected] Section 3 we present the novel parallel algorithm, PDBSCAN. Section 4 gives experimental results and their analysis. The paper concludes in Section 5. II.
2

DBSCAN ALGIRITHM

Here is some basic notion of DBSCAN algorithm [1]. D is the database of points.

INTRODUCTION

Clustering analysis is one of the most important tasks of data mining. Algorithm DBSCAN is a classic clustering algorithm in spatial databases. The key idea of DBSCAN [1] is that for each point of a cluster the neighborhood of a given radius has to contain at least a minimum number of points, i.e. the density in the neighborhood has to exceed some threshold. Algorithm DBSCAN can discover clusters of arbitrary shape, whereas, when facing the huge spatial databases, both the memory usage and computational cost are expensive. In addition, the spatial access method R*-tree is not always efficient. According to the efficiency bottleneck of algorithm DBSCAN, we present P-DBSCAN, a novel parallel version of this algorithm in distributed environment, such as computer clusters. Algorithm P-DBSCAN adopts a better spatial index, Priority R-tree, or PR-tree. It is the first R-tree variant that is not only practically efficient but also provably asymptotically optimal. By separating the database into several parts, each 1 computational node builds PR-tree and carries out clustering independently; after that, the sub-results will be aggregated into one final result. P-DBSCAN achieves good results and much better efficiency than DBSCAN. In addition, the parallel algorithm has much better scalability than algorithm DBSCAN. The rest of this paper is organized as follows. In the next section we introduce and analyze algorithm DBSCAN. In
1

Definition 1: (Eps-neighborhood of a point) The Eps neighborhood of a point p, denoted by NEps(p), is defined q by NEps(p) = {q D | dist(p,q) Eps}. Definition 2: core point) q is a core point wrt. Eps, MinPts if NEps(q) contains at least MinPts points. Definition 3: (directly density-reachable) A point p is directly density-reachable from a point q wrt. Eps, MinPts if 1) p NEps(q) 2) |NEps(q)| MinPts. Definition 4: (density-reachable) A point p is density reachable from a point q wrt. Eps and MinPts if there is a chain of points p1, ... , pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi. Definition 5: (density-connected) A point p is density connected to a point q wrt. Eps and MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts. Definition 6: (cluster) A cluster C wrt. Eps and MinPts is a non-empty subset of D satisfying the following conditions: 1) p, q: if p C and q is density-reachable from p wrt. Eps and MinPts, then q C p, q C: p is density-connected to q wrt. Eps and 2) MinPts. Definition 7: (noise) Let C1 ,. . ., Ck be the clusters of the database D wrt. Epsi and MinPtsi, i = 1, . . ., k. Then we define the noise as the set of points in the database D not belonging to any cluster Ci. To find a cluster, DBSCAN starts with an arbitrary point p and retrieves all points density-reachable from p wrt. Eps and MinPts. If p is a core point, this procedure yields a cluster C wrt. Eps and MinPts, and all the points of NEps(q) belong to it. If p is a border point, no points are densityreachable from p and DBSCAN visits the next point of the database. The average run time complexity of DBSCAN is O (n2), where n is the total number of all the subjects. After using

The National Natural Science Foundation of China

_____________________________________

978-1-4244-5265-1/10/$26.00 2010 IEEE

spatial access methods such as R*-trees to support the region queries, its time complexity descends to O (n * log n). The efficiency of DBSCAN is not good enough for the two following reasons. The first one is that before the clustering process begins, algorithm DBSCAN needs to build spatial indexes R*-tree. The second one is that DBSCAN needs to use k-dist graph to determine the appropriate value of two input parameters, Eps and MinPts . The above two processes are both very time consuming, when the database is huge, both the memory usage and computational cost are expensive. In this paper, we present P-DBSCAN, a novel parallel DBSCAN, which is not only much more efficient than DBSCAN algorithm, but has better scalability. III. P-DBSCAN ALGORITHM

Step 2, Database partition Before clustering, P-DBSCAN partitions the database into several parts with the following method, P-DBSCAN projects the database on each dimension coordinate; then, it partitions the database according to the distribution characteristics of the database. Here is an example.

A. Spatial Access Method and parameters determination 1) Priority R-tree In algorithm DBSCAN, Region query is supported by R*-tree [3], well, R*-tree and most R-tree [2] variants do not historically guarantee good worst-case performance [4]. So, in algorithm P-DBSCAN, we take Priority R-tree or PR-tree as the spatial index, which is the first R-tree variant that always answers a window query worst-case optimally. PR-tree was proposed by Lars Arge in 2004. It is the first R-tree variant that is not only practically efficient but also provably asymptotically optimal. It performs similar to the best known R-tree variants on real-life and relatively nicely distributed data, but outperforms them significantly on more extreme data [4]. So we use it to support region query in algorithm P-DBSCAN. 2) Shape of Neighborhood The shape of neighborhood in most previous studies about algorithm DBSCAN is round, whereas, whether R-tree, R*-tree or PR-tree, all these data structures split space with minimum bounding rectangles. In fact, region query is executed within these rectangles. So in algorithm PDBSCAN, we adopt the rectangular neighborhood instead. 3) Determination of Parameters Algorithm DBSCAN determine the two input parameters, Eps and MinPts by k-dist graph, actually, it eliminates the parameter MinPts by setting it to 4 for all 2-dimensional databases. However, our experiments indicate that when MinPts is set to 4, we could not get accurate clustering results in some cases with noise. Well, if we set MinPts to 7, we can get the precise results. Furthermore, our experiments also show that it is the value of Eps that affects the clustering speed dramatically, the higher the value is, the more computation is needed. And the factor Eps has nothing to do with the computation. The related experimental result is shown in session 4. B. Procedure of P-DBSCAN Algorithm x Step 1, Parameters determination P-DBSCAN sets MinPts to 7, and then use 7-dist graph to determine the parameter Eps.

Figure 1. Database

Set of points is depicted in Figure 1, P-DBSCAN projects the database on X and Y coordinates separately, the result is shown in Figure 2.

Figure 2.

Projection on coordinates

If there are only two computational nodes, the database can be partitioned at point A and point B into 4 parts; and then the 4 parts are deployed to the 2 nodes (4 CPU). x Step 3, Clustering independently Each node builds PR-tree and carries out the clustering independently. No further communication is necessary throughout this step. x Step 4, Sub-results aggregation The correct clustering result is determined by merging the locally detected clusters according to the merging policy of [5]. The procedure of P-DBSCAN algorithm is shown in Figure 3.

PDBSCAN PDBSCAN

55.422

57%

11 clusters

31.812

33%

11 clusters

The experiments show that, the execution time of PDBSCAN is only 60% of DBSCAN when they are executed on the same computer; and 30% of it on two computers. B. Scalability of P-DBSCAN Algorithm To test P-DBSCANs scalability with the number of points, we report results on three different size datasets. The results are shown in Figure 5. We can see that when the number of points increases, the run time of P-DBSCAN increases in a much more conservative way compared to g DBSCAN algorithm.

Figure 3. Procedure of P-DBSCAN

IV.

IMPLEMENTATION AND PERFORMANCE EVALUATION

In this section, we compare the performance between PDBSCAN and DBSCAN. The configuration for experiments includes 2 nodes in the computer cluster, interconnected by a 100 Mbps LAN. Each node runs the windows XP operating system on 2.0 GHZ with 3 GB of main memory. The programs are written in Java. A. Efficiency of P-DBSCAN Algorithm The database with noise is depicted in Figure 1. The database contains 126,862 points, that is, 11 clusters. The parameters, Eps and MinPts are set to 3 and 7 respectively. The clustering result is indicated in Figure 4.
Figure 5. Scalability comparison

C. Parameter influence We use the database depicted in Figure 1 to indicate how the parameters MinPts and Eps influence the computational cost. P-DBSCAN is executed on one computer, and the result is depicted in Table II. When Eps is set to 3, even though MinPts varies, the execution time almost does not change; on the contrast, when MinPts is set to 7, various values of Eps lead to very different computational costs. So we can conclude that its Eps that affects the computational cost much.
TABLE II. Eps 1 MinPts 4 2 3 54750ms 9297ms 23703ms 55422ms 56406ms INFLUENCE OF PARAMETERS

Figure 4. Clustering result

7 10

The experimental clustering result is correct, besides PDBSCAN is much more efficient, that is shown in Table I.
TABLE I. algorithm DBSCAN No. of PC 1
EFFICIENCY COMPARISON

Run time s 97.015

Run time ratio 100%

result 11 clusters

CONCLUSION

In this paper, we presented a novel parallel DBSCAN algorithm in distributed environment, P-DBSCAN, which

adopts the spatial access method Priority R-tree other than R*-tree to support region queries. The results of these experiments demonstrate that P-DBSCAN is much more efficient than original algorithm DBSCAN. It accelerates more than 40% on one PC, and 60% on two PCs. In addition, the parallel algorithm has much better scalability than DBSCAN, so that P-DBSCAN can be used for clustering analysis in huge databases. ACKNOWLEDGMENT This work was supported in part by a grant from the National Science Foundation, the name is High-Dimensional Sparse Data Clustering, with the number of 70771007. REFERENCES
[1] Easter M, Kriegek H-P, Sander J, etc, A density-based algorithm for discovering clusters in large databases, Proc. of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD 96), AAAI Press, Aug.1996, pp. 226-231. A. Guttman, R-trees: A dynamic index structure for spatial searching, Proc. SIGMOD International Conference on Management of Data, ( SIGMOD 84) ACM Press, June 1984, pp. 47-57 N. Beckmann, H.-P. Kriegel, R. Schneider, B. Seeger, The R*-tree: An efficient and robust access method for points and rectangles, Proc. of the ACM SIGMOD International Conference on Management of Data( SIGMOD 90), ACM Press, May, 1990, pp. 322-331. Lars Arge, Mark de Berg, Herman J, Ke Yi, The Priority R-Tree: A Practically Efficient and Worst Case Optimal R-Tree, Proc. of the ACM SIGMOD International Conference on Management of Data (SIGMOD 04), ACM Press, June 2004, pp.347-358. Zhou ShuiGeng, Zhou AoYing, Cao Jing, A data-partitioning-based DBSCAN algorithm, Journal of computer research & development, Vol. 37, Oct. 2000, pp. 1153-1159.

[2]

[3]

[4]

[5]

Unit 8 DBSCAN
No ratings yet
Unit 8 DBSCAN
53 pages
Density Based Clustering Technique
No ratings yet
Density Based Clustering Technique
54 pages
DBSCAN
No ratings yet
DBSCAN
23 pages
DBSCAN Clustering
No ratings yet
DBSCAN Clustering
17 pages
ML14 Dbscan
No ratings yet
ML14 Dbscan
10 pages
2017 SCHUBERT - Artificial Intelligence - DBSCAN Revisited Revisited - Why and How You Should (Still) Use DBSCAN
No ratings yet
2017 SCHUBERT - Artificial Intelligence - DBSCAN Revisited Revisited - Why and How You Should (Still) Use DBSCAN
21 pages
DBSCAN
No ratings yet
DBSCAN
7 pages
Fuzzy DBScan
No ratings yet
Fuzzy DBScan
12 pages
DBSCAN Clustering
No ratings yet
DBSCAN Clustering
6 pages
UNIT-6 DBSCAN Clustering
No ratings yet
UNIT-6 DBSCAN Clustering
6 pages
DBSCAN
No ratings yet
DBSCAN
3 pages
Density Based Clustering
No ratings yet
Density Based Clustering
25 pages
Image Segmentation Using K-Mean and DBSCAN
No ratings yet
Image Segmentation Using K-Mean and DBSCAN
26 pages
Dbscan
No ratings yet
Dbscan
18 pages
DBSCAN Presentation
No ratings yet
DBSCAN Presentation
10 pages
Age Detection
No ratings yet
Age Detection
12 pages
Journal of Parallel and Distributed Computing
No ratings yet
Journal of Parallel and Distributed Computing
13 pages
DBSCAN - A Simple Fast DBSCAN Algorithm For Big Data Author Shaoyuan Weng, Jin Gou and Zongwen Fan
No ratings yet
DBSCAN - A Simple Fast DBSCAN Algorithm For Big Data Author Shaoyuan Weng, Jin Gou and Zongwen Fan
16 pages
A Fast DBSCAN Algorithm For Big Data Based On Efficient Density
No ratings yet
A Fast DBSCAN Algorithm For Big Data Based On Efficient Density
12 pages
M6
No ratings yet
M6
23 pages
Capture D'écran, Le 2025-04-14 À 16.57.54
No ratings yet
Capture D'écran, Le 2025-04-14 À 16.57.54
40 pages
DBSCAN
No ratings yet
DBSCAN
3 pages
DBSCAN
No ratings yet
DBSCAN
18 pages
Lecture 11 DBSCAN
No ratings yet
Lecture 11 DBSCAN
6 pages
Esam - DWM Lab 8
No ratings yet
Esam - DWM Lab 8
5 pages
Choosing DBSCAN Parameters
No ratings yet
Choosing DBSCAN Parameters
11 pages
Ktustudents - In: 1. Hierarchical Methods
No ratings yet
Ktustudents - In: 1. Hierarchical Methods
21 pages
Enhancing DBSCAN Algorithm For Data Mining
No ratings yet
Enhancing DBSCAN Algorithm For Data Mining
5 pages
Se Demo
No ratings yet
Se Demo
29 pages
DBSCAN
No ratings yet
DBSCAN
29 pages
Pax-Dbscan: A Proposed Algorithm For Improved Clustering: Grace L. Samson Joan Lu
No ratings yet
Pax-Dbscan: A Proposed Algorithm For Improved Clustering: Grace L. Samson Joan Lu
36 pages
Density ML
No ratings yet
Density ML
51 pages
4.6 Dbscan
No ratings yet
4.6 Dbscan
27 pages
Density Based
No ratings yet
Density Based
27 pages
Unsupervised Learning Clustering II
No ratings yet
Unsupervised Learning Clustering II
17 pages
14 Dbscan
No ratings yet
14 Dbscan
7 pages
Lab Manual Dbscan
No ratings yet
Lab Manual Dbscan
6 pages
ML Exp 7
No ratings yet
ML Exp 7
6 pages
Enhanced Db-Scan Algorithm
No ratings yet
Enhanced Db-Scan Algorithm
5 pages
Data Mining
No ratings yet
Data Mining
3 pages
20 - 1 - ML - Unsup - 03 - Dbscan Hdbscan
No ratings yet
20 - 1 - ML - Unsup - 03 - Dbscan Hdbscan
21 pages
DB SCAN Unit 4
No ratings yet
DB SCAN Unit 4
6 pages
DIP Lab 13 DBSCAN Clustering
No ratings yet
DIP Lab 13 DBSCAN Clustering
6 pages
DB Scan
No ratings yet
DB Scan
7 pages
Applying SR-Tree Technique in DBSCAN Clustering Algorithm
No ratings yet
Applying SR-Tree Technique in DBSCAN Clustering Algorithm
4 pages
DBSCAN (Density Based Spatial Clustering)
No ratings yet
DBSCAN (Density Based Spatial Clustering)
10 pages
DBSCAN Clustering in ML - Density Based Clustering
No ratings yet
DBSCAN Clustering in ML - Density Based Clustering
5 pages
VDBSCAN
No ratings yet
VDBSCAN
4 pages
DBSCAN Clustering Algorithm: Presented by
No ratings yet
DBSCAN Clustering Algorithm: Presented by
22 pages
Dbscan: Presented By: Garrett Poppe
No ratings yet
Dbscan: Presented By: Garrett Poppe
22 pages
Clustering Algorithm (Dbscan) : Vishal Bharti Computer Science Dept. GC, Cuny
No ratings yet
Clustering Algorithm (Dbscan) : Vishal Bharti Computer Science Dept. GC, Cuny
27 pages
DBSCAN - Introduction in Machine Learning.
No ratings yet
DBSCAN - Introduction in Machine Learning.
3 pages
ML Exp 9
No ratings yet
ML Exp 9
5 pages
An Improvement of DBSCAN Algorithm To Analyze Cluster For Large Dataset
No ratings yet
An Improvement of DBSCAN Algorithm To Analyze Cluster For Large Dataset
5 pages
Bde Dbscan
No ratings yet
Bde Dbscan
11 pages
Autoepsdbscan: Dbscan With Eps Automatic For Large Dataset: Manisha Naik Gaonkar & Kedar Sawant
No ratings yet
Autoepsdbscan: Dbscan With Eps Automatic For Large Dataset: Manisha Naik Gaonkar & Kedar Sawant
6 pages
Dbscan: Densiy Based Scan Algorithm
No ratings yet
Dbscan: Densiy Based Scan Algorithm
8 pages
Understanding DBSCAN Algorithm and Implementation From Scratch - by Andrewngai - Towards Data Science
No ratings yet
Understanding DBSCAN Algorithm and Implementation From Scratch - by Andrewngai - Towards Data Science
10 pages
Comparison of Density-Based Clustering Algorithms: Mariam Rehman
No ratings yet
Comparison of Density-Based Clustering Algorithms: Mariam Rehman
5 pages
Epoch 4 Operations Manual
100% (1)
Epoch 4 Operations Manual
164 pages
RISK ASSESSMENT - Road Work
No ratings yet
RISK ASSESSMENT - Road Work
4 pages
Publication SAW SOUR SERVICE 2023 PDF
No ratings yet
Publication SAW SOUR SERVICE 2023 PDF
7 pages
Economic Theory by ShumPeter
No ratings yet
Economic Theory by ShumPeter
16 pages
(Tom Bottomore) The Frankfurt School and Its Criti
100% (3)
(Tom Bottomore) The Frankfurt School and Its Criti
92 pages
World Eaters 9e Codex OCR
No ratings yet
World Eaters 9e Codex OCR
54 pages
Abdellah's Nursing Theory
No ratings yet
Abdellah's Nursing Theory
6 pages
(Hotel Name) Feedback Form: Customer Name: Address: Email/Phone Account
No ratings yet
(Hotel Name) Feedback Form: Customer Name: Address: Email/Phone Account
2 pages
Imsed01 & Imset01
100% (1)
Imsed01 & Imset01
146 pages
MAT3005 Applied-Numerical-Methods TH 1 AC40
No ratings yet
MAT3005 Applied-Numerical-Methods TH 1 AC40
2 pages
Stanford University CS 229, Autumn 2014 Midterm Examination
No ratings yet
Stanford University CS 229, Autumn 2014 Midterm Examination
23 pages
Chapter 06. Engineering Economics
No ratings yet
Chapter 06. Engineering Economics
35 pages
Contoh Surat Rekomendasi S2
No ratings yet
Contoh Surat Rekomendasi S2
3 pages
Referensi Jalan Raya
No ratings yet
Referensi Jalan Raya
13 pages
Problem Solving TEST 3
No ratings yet
Problem Solving TEST 3
44 pages
Memory-Wise Chapter Sampler
No ratings yet
Memory-Wise Chapter Sampler
28 pages
The History of Using Solar Energy
No ratings yet
The History of Using Solar Energy
8 pages
Length-Weight Relationship and Condition Factor of Channa Aurantimaculata (Musikasinthorn, 2000) Studied in A Riparian Wetland of Dhemaji District, Assam, India
No ratings yet
Length-Weight Relationship and Condition Factor of Channa Aurantimaculata (Musikasinthorn, 2000) Studied in A Riparian Wetland of Dhemaji District, Assam, India
6 pages
Damiano Rossello: DEB University of Catania
No ratings yet
Damiano Rossello: DEB University of Catania
78 pages
How To Rank A Website On Google Without A Backlink
No ratings yet
How To Rank A Website On Google Without A Backlink
2 pages
Inkandvolt Yearly Planning Week1
No ratings yet
Inkandvolt Yearly Planning Week1
3 pages
LITERATURE-IN-MIDWIFERY-FINAL-EXAM Jhoanna Jimlan Opiña Jan 19 2024
No ratings yet
LITERATURE-IN-MIDWIFERY-FINAL-EXAM Jhoanna Jimlan Opiña Jan 19 2024
7 pages
RCCe41 Continuous Beams (A & D)
No ratings yet
RCCe41 Continuous Beams (A & D)
20 pages
Enterprise Network Products Recommended Version List 2015Q4
No ratings yet
Enterprise Network Products Recommended Version List 2015Q4
36 pages
Excerpt From "Awkward" by Ty Tashiro
No ratings yet
Excerpt From "Awkward" by Ty Tashiro
1 page
Fowler 1994 PDF
No ratings yet
Fowler 1994 PDF
14 pages
An Introduction To Global Climate Change
No ratings yet
An Introduction To Global Climate Change
35 pages
The Javascript Switch Statement: Syntax
No ratings yet
The Javascript Switch Statement: Syntax
5 pages
Android - Simple Tab Bar Example
No ratings yet
Android - Simple Tab Bar Example
7 pages
Operations Strategy
No ratings yet
Operations Strategy
4 pages

Parallel Dbscan With Priority R-Tree: Min Chen, Xuedong Gao Huifei Li

Uploaded by

Parallel Dbscan With Priority R-Tree: Min Chen, Xuedong Gao Huifei Li

Uploaded by

Parallel DBSCAN with Priority R-tree

Min Chen, 2XueDong Gao

The National Natural Science Foundation of China

978-1-4244-5265-1/10/$26.00 2010 IEEE

Figure 3. Procedure of P-DBSCAN

IMPLEMENTATION AND PERFORMANCE EVALUATION

Figure 4. Clustering result

Run time s 97.015

Run time ratio 100%

You might also like