Author's Accepted Manuscript: Pattern Recognition
Author's Accepted Manuscript: Pattern Recognition
Author's Accepted Manuscript: Pattern Recognition
www.elsevier.com/locate/pr
PII: S0031-3203(16)00103-5
DOI: http://dx.doi.org/10.1016/j.patcog.2016.03.008
Reference: PR5669
To appear in: Pattern Recognition
Received date: 30 October 2015
Revised date: 18 February 2016
Accepted date: 3 March 2016
Cite this article as: K. Mahesh Kumar and Dr. A. Rama Mohan Reddy, A fast
DBSCAN clustering algorithm by accelerating neighbor searching using Groups
method, Pattern Recognition, http://dx.doi.org/10.1016/j.patcog.2016.03.008
This is a PDF file of an unedited manuscript that has been accepted for
publication. As a service to our customers we are providing this early version of
the manuscript. The manuscript will undergo copyediting, typesetting, and
review of the resulting galley proof before it is published in its final citable form.
Please note that during the production process errors may be discovered which
could affect the content, and all legal disclaimers that apply to the journal pertain.
TITLE
Corresponding Author:
K. Mahesh Kumar
Research Scholar
Department of Computer Science and Engineering,
SVU College of Engineering,
SV University, Tirupati-517 502
Andhra Pradesh, India
Email : mahesh_cse @ outlook.com
Mobile : +91 9966880476
Data Clustering is an unsupervised learning technique that groups given data into
meaningful subclasses such that objects in the same subclass are more similar compared
to objects in other subclasses. Clustering techniques are used in different fields like image
analysis [1], pattern recognition [2], knowledge discovery [3] and bio-informatics [4].
Application of clustering techniques in spatial databases [5] poses the challenge to dis-
cover clusters with arbitrary shapes, determine the input parameters of algorithm with
Partitional clustering methods like k-means [6] can find clusters of spherical
shape only and need to supply number of clusters as input to the algorithm. Kernel k-
means [7] can detect arbitrary shaped clusters by transforming data into feature space us-
ing kernel functions. But, it has time and space complexity of O(n2) and hence not scala-
ble for large datasets. Hierarchical clustering methods partition the dataset into subsets
the subsets at different levels using the minimum distance criteria [8]. Single-linkage [9]
method can detect arbitrary shaped clusters but is sensitive to noise patterns. It also suf-
fers from chaining effect [10]. CURE [11] is an improved version of single-link which
selects a random sample of points and shrinks them towards the centroid to solve chain-
ing effect. The Hierarchical methods have a time complexity of O(n3) and should also
define an appropriate stopping condition for splitting or merging of partitions while de-
rithm which uses a tree based representation for reducing time complexity but it can find
only spherical shaped compact clusters and also clustering result is effected by input or-
der of data. Recently, multi-view based [13,14] and semi-supervised clustering [15]
labeled data from the user to achieve a better clustering accuracy. The user provided data
may be incorporated into clustering algorithm in the form of constraints for guiding to-
the pioneer of density based clustering techniques which can discover clusters of arbitrary
shape and also handles noise or outliers effectively. DBSCAN algorithm has a quadratic
time complexity with dataset size. The algorithm can be extended to large datasets by re-
ducing its time complexity using spatial index structures like R-trees [20] for finding
neighbors of a pattern. But, they can not be applied for high dimensional datasets. In this
paper we propose an algorithm Groups to accelerate the neighbor search queries .Groups
builds a graph-based index structure on the data. Unlike conventional hierarchical index
structures like R-trees, the proposed method scales well for high dimensional datasets.
Also, Groups method is efficient in handling large amounts of noise present in the data
without degrading the performance of DBSCAN. The cluster results produced by our
method are exactly identical to that of DBSCAN but at a reduced running time.
The rest of paper is organized as follows: section 2 reviews existing work in the
method. Section 3 describes the proposed method, with the Groups algorithm followed by
2. Related Work
Density based clustering methods can find arbitrary shaped clusters in the dataset
and also insensitive to noise. In density based clustering methods clusters are formed by
merging dense areas separated by regions of spares areas. DBSCAN is proposed for clus-
tering large spatial databases with noise or outliers. OPTICS [21] is an extension to
DBSCAN which can find clusters with varying densities by creating an augmented order-
ing of given dataset representing a density-based cluster structure. This ordering is equal
to a density-based clustering with varied range of parameter settings. Chen et al. [22]
proposed a parameter free clustering method that utilizes Affinity Propagation algorithm
[23] to detect local densities in the dataset and obtains a normalized density list. Later,
DBSCAN method is modified to cluster the dataset in terms of the parameters in the
normalized density list. DENCLUE [24] defines clusters by a local maximum of estimat-
ed kernel density function. A hill climbing procedure is used for assigning points to near-
est local maximum. l- DBSCAN [25] is a hybrid density based clustering method that
first derives a set of prototypes from the dataset using leaders clustering method [26] and
runs DBSCAN on the prototypes to find clusters. Further, RoughDBSCAN [27] is pro-
posed by applying rough-set theory [28] to l-DBSCAN method. It has a time complexity
of O(n) but the cluster results are influenced by threshold parameter that is specified to
derive the prototypes. Recently, fast and scalable density based clustering methods using
Graphics Processing Units (GPU) are proposed to improve the performance of DBSCAN
[29,30]. Also, parallel and distributed versions of DBSCAN are proposed for handling
large datasets [31-33]. Mostofa et.al [31] proposed a parallel DBSCAN method using
graph algorithmic concepts and achieves a well balanced workload by taking advantage
Clustering algorithms based on graph theory [34] are attractive because of their
ability to detect clusters of diverse shape, size and densities without requiring any prior
knowledge of dataset and does not require user to supply number of clusters as input pa-
dataset which is suitably partitioned and merged to obtain final clusters. CHAMELEON1
[35] represents dataset with a k-nearest neighbor graph, and is partitioned into sub-
clusters only if the relative inter-connectivity and closeness between them is comparable.
Graph based methods also take advantage of Minimum Spanning Tree (MST) to repre-
based graph and the MST of a dataset. Spectral clustering [39] represents a fully connect-
ed graph of the dataset and is based on spectral-graph theory for finding clusters. In addi-
tion, relative neighbor graphs are also used to cluster data [40,41]. Mimaroglu et.al [42]
adopts a similarity graph for combining multiple clustering results into a final clustering
solution. Graph-based manifold learning methods [43, 44] also employs a neighborhood
manifold learning framework for image clustering using a sparse representation to select
1
CHAMELEON is a type of reptile that has an ability to change its skin to different colors. The algorithm
was called so as it is based on a dynamic model to identify varied shape cluster structures.
a few neighbors of each data point that span a low-dimensional affine subspace passing
near that point. A multimodal hypergraph learning based sparse coding method [44] is
proposed for the click prediction of images. In a hypergraph a set of vertices are connect-
points whose distance from x is less than or equal to . The cardinality of eps-
connected points.
From the view of a DBSCAN method every point in the dataset will fall into ei-
ther core point or border point. Further a border point can be either noise point or density
connected point.
Noise point: A point p is a noise point if the threshold density of p is less than
Density-connected point: A border point with at least one core point in its eps-
neighborhood.
DBSCAN algorithm takes two input parameters and . specifies the
points required in the eps-neighborhood of a point to form a cluster. Initially all points
are marked unvisited. The algorithm starts by randomly selecting an unvisited point and
point and a new cluster is created. Further points are added iteratively to the cluster by
finding dense points for each point in the eps-neighbourhood of the cluster. If no unvisit-
ed points can be added to cluster, the new cluster is complete and no points will be added
to the cluster in subsequent iterations. To find out the next cluster find an unvisited point
in the dataset and repeat the above clustering process. The process is halted when all the
points are either assigned to some cluster or marked noise. Every point in a cluster is eps-
connected with at least one point in the same cluster to which it belongs and is not eps-
connected with any other points in remaining clusters. However, there may exist a border
point which is eps-connected with points in some other clusters. In which case, the point
is assigned to the cluster that processed it first. Such exceptional cases are rare in prac-
tice. The total number of eps-neighborhood operations performed is equal to the size of
dataset. If no index structures are used then eps-neighborhood involves computing dis-
end for
end if
mark as
end until
end if
mark as
end for
Output all patterns in marked with or
3. Density based clustering method using groups
DBSCAN clustering method but the nearest neighbor search queries are accelerated by
using Groups method. The proposed algorithm runs in two phases: In the first phase
Groups algorithm is run on the entire dataset to obtain a set of groups. The Second phase
runs conventional DBSCAN method by using groups derived in the first phase for a fast
eps-neighborhood operation.
3.1 Groups
Groups method partitions the dataset by fitting into a graphbased structure where
each vertex is a group and an edge is drawn between two groups if they are reachable
(def.4). The Groups algorithm merges nearby patterns into groups. Each group is a hyper
sphere with its center as master pattern and can have a maximum radius of . Groups
method classifies each pattern into either master or slave pattern. Groups are formed by
scanning the entire dataset twice .In the first round each pattern is searched for some ex-
isting group to fit in. A pattern is added to a group if the distance from the given pattern
to its master pattern is less than or equal to . If the distance from given pattern to mas-
ter pattern of a group is two times , then such patterns are neither assigned to any
group nor itself is created as a new group .Such patterns are processed further in the sec-
ond round of the algorithm. If a pattern doesnt fit into any group and distance from mas-
ter pattern of its nearest group is greater than or equal to two times , then a new group
is created with itself as master pattern. In the second round ,the left out patterns in first
round are assigned to a group if the distance from the given pattern to master pattern is
less than or equal to . If there is no such group to fit then a new group is created with
given pattern as master pattern of the group.Different input order of patterns produces
different set of groups. Whenever a slave pattern is added to a group the threshold dis-
tance of the group is also updated. The maximum threshold distance of groups created in
first iteration is less than or equal to . The threshold distance of groups created in sec-
ond iteration is less than . The Groups method fits a graph based representation of
The following are some of the advantages of Groups method over conventional
index-based structures
(1) Groups method does not require any input parameters specific to the algorithm. It
(2) Groups method handles noise effectively by pruning them early without perform-
(3) Groups method ensures search space of a pattern involving eps-neighborhood op-
eration is always span over a small area (fig 3) irrespective of outliers in the data.
Table 1 Notations
Symbol Denotes
Input set of data patterns
a pattern in
number of dimensions of pattern
size of dataset
set of groups
a group in
a slave pattern in group
master pattern of groups
number of patterns in groups
eps-connected groups of (def. 3)
threshold-distance of group
reachable groups of ( def.4)
Definition 1 (core group) A group s is a core group if the number of patterns in it is greater than
or equal to minpts.
{ |
Definition 2 (border group) A group s is a border group if the number of patterns in it is less
than minpts.
{ |
follows
{ |
Definition 4 (Reachable groups) The set of reachable groups of a group s is defined as follows
{ |
The reachable group relationship is symmetric viz., if is a reachable group of then is reach-
able group of .
Definition 5(empty group) A group s is an empty group if it doesnt contain any slave patterns.
{ | |
Definition 6 (noise group) A group s is a noise group if the number of patterns in is less than
{ |
Core group
Border group
Empty group
Noise group
ing method but the nearest neighbors are searched for using the groups obtained by
Groups algorithm. The reachable groups of a group are computed by finding the distance
between their master patterns. If the distance between master patterns of any two groups
is less than or equal to sum of their threshold distances and , they are considered as
reachable groups of each other(def.3) .since the threshold distance of a group is not
known until the completion of scanning the entire dataset, we take threshold distance of
groups as their maximum value, .Hence,two groups are considered reachable if the
distance between their master patterns is less than or equal to three times (corollary
1).If , are any two eps-connected patterns then , are either patterns that belong to
same group or patterns in reachable groups. To find out the eps-neighborhood of a pat-
tern, firstly all patterns in the current group are searched for eps-connectivity and fol-
lowed by searching patterns in its reachable groups. If the threshold distance of current
group is less than or equal to , then its eps-neighborhood includes all patterns in
the current group (lemma 2). If is a border group without any reachable groups then all
patterns in are noise patterns. For a given pattern in a group, whether there exists an
eps-connected pattern in its reachable groups or not is determined by using eq.3.If eq.3 is
satisfied then distance computations are made from given pattern to all patterns in its
reachable groups for obtaining eps-connected patterns. The search space for computing
given pattern as its center (fig 7). The Algorithm 2 specifies the process of computing
eps-neighborhood of a given pattern using groups method. In the DBSCAN clustering
Our method is efficient than other graph-based clustering methods like CHAME-
LEON [35] that uses a k-nearest neighbor graph for representing dataset. Constructing a
k-nearest neighbor graph involves finding distance computations from given point to all
remaining points in the dataset. The graph is partitioned into multiple disconnected com-
LEON [35] is a hierarchical clustering method and requires user to specify suitable split
method that uses an efficient graph based structure for fast neighbor search operations. G-
twice and involves distance computations from given point to master pattern of groups
only. Each vertex is a group represented by its master pattern and eps-neighborhood pat-
terns of master are added as its slaves. The edges are drawn towards its reachable groups
(fig 2(a)).If clusters are well separated and valid parameters of and selected,
Groups method obtains disconnected graph components which are equal to number of
clusters present in the data. (fig 2(b)). Our method is parameter free without requiring any
/* This method finds the eps- neighborhood of a given pattern and returns
them */
if
all patterns in
else find patterns in such that
end if
for each reachable group of
if
find patterns in such that
end if
end for
Output
*
* *
* + *
* *
** + * * *
** *
* * *
*
3*eps
* *
* *
eps
* + *
* *
* * *
2*eps
*
*
* +* *
* *
* * * * **
+ * *
** * *
* *
** * *
Proof: Let be a core group and be any two slave patterns in group . let be-
Proof: let be a border group, since number of slave patterns is less than in a
neighborhood of any pattern in , can not be more than . Hence, each pattern in
is a noise.
Lemma 3: If then the eps- neighborhood for any pattern in includes all
remaining patterns in .
Proof: From the fig.4, it is evident that if the threshold distance of a group is , then
the maximum distance between any two patterns in the group is . Therefore, every
pattern in the group will fall in the eps-neighborhood of remaining patterns in the
group.
spect to only if
Corollary 1: The distance between master patterns of any two eps-connected groups is
Proof: fig 6 shows a pair of eps-connected groups separated by maximum distance. Let
be any two eps-connected groups, by using eq.3 the distance between their master
tween master pattern and slave pattern of a group is always less than or equal to , sub-
Theorem 2: The distance between any two slave patterns of reachable groups is less
than or equal to .
Proof: let be any two groups , using the triangle inequality property
from fig 7,
fig 7 Maximum distance slave patterns in reachable groups
The time complexity of proposed algorithm involves the cost incurred in deriving
groups from the dataset and running the G-DBSCAN algorithm on the groups. The entire
dataset is scanned once to obtain an initial partition of groups and the patterns not fitting
into any group are processed further in the second round. If p be the number of patterns
left unassigned to any group in the first iteration, then the time complexity for groups al-
involves computing the distance between all patterns in the reachable groups satisfying
eq.3.In the worst case, it involves computing distance between patterns present in all of
point involves computing distance from all other remaining points in the dataset, hence
its time complexity is O(n2). Using an index structure like R-Trees, the search time can
be reduced to O(log n),overall for entire dataset it will be O(n log n). However, index-
based techniques are not efficient for high dimensional data. The Groups method in-
volves neighbor searching limited to only in its reachable groups. Further for a given pat-
tern, the reachable groups are pruned out of searching based on triangle inequality prop-
puting eps-neighborhood for a pattern, then the time complexity of DBSCAN using
groups is O(nd).The time complexity of G-DBSCAN, including the time taken for groups
is O(n+nd). Though the value of can not be established theoretically, for all valid pa-
rameters of , is small. The value of is influenced by the input order of data and
also separation between clusters. Since the distance between slave patterns of reachable
groups is less than or equal to (theorem 2), the eps-neighborhood of a given pat-
4. Experimental Results
effectiveness of proposed method .In the first part, we performed experiments on datasets
obtained from UCI machine learning repository [45] of varied dimensions and a two di-
mensional synthetic dataset .In the second part, we performed experiments on synthetic
datasets of dimensions in varied range .In both cases the performance of G-DBSCAN is
empirically evaluated with the conventional DBSCAN and its state-of-the-art index im-
plementations. In the third part, the behavior of algorithm is analyzed in the presence of
noise of varied sizes in the dataset. All the experiments are performed on intel core i3-
4005U processor with 1.7Ghz and 4GB RAM running windows 7 ultimate service pack -
1.All programs are compiled and executed as single-threaded java console applications
The proposed method is compared with three commonly used Index data struc-
tures for neighbor searching: k-d trees, R-trees and M-trees. k-d tree[46] is a multi-
dimensional search tree where each node is k-dimensional point. The nodes are split re-
cursively by using mean or median of data points across each node. In our experiments
median was used. R-tree [15] is a balanced search tree used for spatial access methods. It
[47] is a variant of R-tree bulk-loaded that repeatedly splits across each dimension suc-
cessively into equal sized partitions. STR was used in our experimental study. M-tree
[48] is a balanced search tree similar like R-Tree, but it uses minimum volume hyper
spheres for grouping near by objects instead of hyper-rectangles. Random split policy is
used for dividing the hyper spheres with minimal overlap. The hierarchical index struc-
to two to three orders of magnitude in running time was observed with respect to the split
policy selected. In each case the algorithm is executed five times with different input or-
ders of dataset and their average is taken as the running time. Experimental results show
that the proposed method is faster than DBSCAN by a factor of 1.5 to 2.2 on benchmark
datasets and is also scalable for high dimensional datasets (fig 9).
4.1 Experiment 1
In this empirical study we had used seven popular datasets from UCI machine
learning repository and one synthetic dataset. The datasets are selected on the criteria of
its corresponding dimension and size. Combined cycle Power plant [49] consists of 9568
data points collected from a power plant over a period of six years .Each point comprise
of four attributes obtained when the power plant was set to work at full load. They are
used to calculate net hourly electrical energy output of the plant. Page blocks classifica-
tion [50] is composed of 5473 blocks obtained from 54 distinct documents. The blocks of
the page layout of a document are obtained using a segmentation process. The blocks are
classified to separate text from graphic areas. Pen-based recognition of handwritten dig-
its [45] is a digit database of 250 samples collected from 44 writers. Each sample is a 16-
used to identify 26 upper case English alphabets. Image segmentation [45] consists of
seven classes of 2310 samples of 19 dimensions. Statlog Landsat Satellite [45] consists of
seven classes of 6435 instances. Each instance consists of 36 features obtained from mul-
Leaves dataset [51] comprise of 1600 instances obtained from sixteen samples of leaf
from one-hundred plant species. Concentric rings is artificially generated with 3 classes
of 3000 points each and 30 noise points. A summary of datasets used in the empirical
table 3. From the experimental results, it is observed that index based structures perform
well for datasets with dimensionality less than 20 and tend to perform poor as the dimen-
sionality of the dataset increases. Our method is more than twice faster than DBSCAN
for pen-digits dataset. For plant species dataset, a performance improvement of about 40
percent in the running time is observed while the index based methods perform slower
brid approach to speed up the DBSCAN clustering method. l- DBSCAN requires two in-
put parameters and that are used to create prototypes at coarse-level and fine-grain
the above methods is shown in table 4. The clustering results are compared using the sim-
denote two different sets of partitions of a dataset, denotes the number of pairs of pat-
terns
patterns in the dataset that are not grouped into a set of both and .
- ( )
(9)
DBSCAN can give a reduction in execution time better than G-DBSCAN but they devi-
ate from the clustering results produced by that of DBSCAN considerably. Also, the se-
lection of input parameter(s) that minimize the execution time and maximize the cluster-
ing accuracy simultaneously is a difficult task and the authors did not report any effective
criteria for determining such suitable input parameter(s) in their respective papers [25,
27]. G-DBSCAN improves the execution speed of DBSCAN and always guarantees the
( - ).
In this experiment, the scalability of proposed method is analyzed for high dimen-
sional datasets. Experiments were performed on synthetic datasets each of size 10,000
and dimensions ranging from 5 to 65 in steps of 5. Data sets are generated using a multi-
DBSCAN is shown in fig 9. It observed that the conventional index based techniques fail
to scale for datasets with dimensions above 20. Further, the running time for index based
implementations goes even worse than DBSCAN above some threshold value of dimen-
sion while the proposed method outperforms than index-based implementations even at
higher dimensions.
16 proposed
DBSCAN(full search)
14 DBSCAN(k-d tree index)
DBSCAN(R-tree index)
12 DBSCAN(M-tree index)
10
Time(secs)
0
0 10 20 30 40 50 60 70
Dimension
fig 9 Running time comparison of proposed, index methods with dimension of data
One of the enviable features of spatial clustering methods is the ability to handle
noise effectively and perform well. In this experiment, the performance of proposed
method is analyzed in the presence of noise. For the experiment purpose, we generate a
two dimensional synthetic dataset of 10,000 points using a multivariate normal distribu-
tion. The clusters centers are generated by sampling from a uniform distribution
from a uniform distribution (-2, 2). At each step, noise is added in increments of 500
from 500, 1000 and so on up to points. It is ensured that majority of noise points are
group while all eps-connected noise patterns constitute a border group or reachable empty
groups. The running time of Groups method increases with the number of noise points
since the process of assigning slaves to their respective groups involves searching more
number of empty and noise groups that are introduced by noise points. During G-
DBSCAN the empty noisy groups does not involve any distance computations for com-
puting eps-neighborhood of the pattern while the points in noisy border groups involves
distance computations less than . From fig 10, it is evident that with an increase in
noise the execution time of Groups method is increasing exponentially where as running
time of G-DBSCAN varies by a small value. In the presence of noise G-DBSCAN gives
DBSCAN.
4000
DBSCAN
G-DBSCAN(including Groups processing)
3500 Groups
G-DBSCAN(excluding Groups processing)
3000
2500
Time(ms)
2000
1500
1000
500
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Noise size to Data size
fig 10 Running time comparison of DBSCAN, G-DBSCAN and Groups with noise
the neighbor search operations of DBSCAN clustering. Groups method scans entire data
set once to obtain a set of groups. A pattern which does not fit into an existing group is
processed in the second round. Such patterns are assigned to a group as slave or a new
group is created with itself as the master pattern. Groups method ensures that always for a
given pattern the neighbor searching does not need to move points farther than distance
to two to three fold magnitude of actual running time. Groups method is more stable than
index-based structures as it do not any require specific input parameters from user to
build the groups index structure. Also, Groups method is robust to noise by pruning outli-
ers early with zero or few distance computations. In future we are planning to extend G-
DBSCAN for very large datasets using parallel and distributed versions of proposed
Conflict of interest
None declared
Acknowledgements
The authors would like to thank anonymous reviewers for their valuable sugges-
tions in improving the quality of paper. The work of Mahesh was supported by a monthly
scholarship from the Department of Higher Education, Ministry of Human Resource De-
velopment (MHRD), Govt. of India, under Technical Education Quality Improvement
Program (TEQIP-II-1.2.1).
References
[1] R.C. Gonzalez, R.E. Woods, Digital Image Processing, third ed., Pearson Pren-
tice-Hall, Upper Saddle River, NJ, 2008.
[4] S. Madeira and A. Oliveira Bi, clustering algorithms for biological data analy-
sis: a survey, IEEE/ACM Trans. on Comp. Biology and Bioinformatics, 1(1)
(2004) 2445.
[5] R.H. Gueting, An introduction to spatial database systems, The VLDB Journal
3 (4) (1994) 357399.
[6] A.K. Jain, Data clustering: 50 years beyond K-means, Pattern Recognition Let-
ters 31 (8) (2010) 651666.
[8] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: A review. ACM Computing
Surveys 31 (3) (1999) 264323.
[10] G. Nagy, State of the art in pattern recognition, In: Proceedings of IEEE 56,
1968, pp. 836862.
[11] S. Guha, R. Rastogi, K. Shim, Cure: an efficient clustering algorithm for large
databases, In: Proc. of the Internat. Conf. on Mangement of Data (ACM SIG-
MOD), 1998, pp.73-84.
[12] Z. Tian, R. Raghu, L. Micon, BIRCH: An efficient data clustering method for
very large databases, In: Proc. ACM SIGMOD Internat. Conf. on Management
of Data, 1996, pp.103114.
[13] X. Chang, T. Dacheng, X. Chao, Multi-View Self-Paced Learning for Cluster-
ing, In: Proceedings of 24th International Joint conference on Artificial Intelli-
gence, 2015, pp. 3974-3980.
[17] K. Lu, J. Zhao, D. Cai, An algorithm for semi-supervised learning in image re-
trieval, Pattern Recognition 39 (2006) 717720.
[19] M. Ester, H.P. Kriegel, X. Xu, A density-based algorithm for discovering clus-
ters in large spatial databases with noise, In: Proc. 2nd ACM SIGKDD, Port-
land, Oregon, 1996, pp. 226231.
[20] A. Guttman, R-trees: A dynamic index structure for spatial searching, In: Proc.
of 13th Int. Conf. on Mang. of Data ACM SIGMOD, vol. 2, 1984, pp. 4757.
[22] X. Chen, W. Liu, H. Qiu, J. Lai, APSCAN:A parameter free clustering algo-
rithm, Pattern Recognition Letters 32, (2011) 973986.
[23] B.J. Frey, D. Dueck, Mixture modeling by affinity propagation, In: Proceedings
of 18th Neural Information Processing Systems, 2005, pp. 379386.
[26] J.A. Hartigan, Clustering Algorithms, John Wiley & Sons, New York, 1975.
[27] P. Viswanath, V.S. Babu, Rough-DBSCAN: A fast hybrid density based clus-
tering method for large data sets, Pattern Recognition Letters 30(16) (2009)
14771488.
[28] Z. Pawlak, Rough Sets: Theoretical Aspects of Reasoning About Data, Kluwer
Academic Publishing, Dordrecht, 1991.
[30] W.K. Loh, H. Yu, Fast density-based clustering through dataset partition using
graphics processing units, journal of Information Sciences, 308 (2015) 94112.
[32] M. Chen, X. Gao, and H. Li, Parallel DBSCAN with priority r-tree, In: Proceed-
ings of Information Management and Engineering (ICIME), 2010 pp. 508511.
[33] B.R. Dai, I.C. Lin, Efficient mapreduce based DBSCAN algorithm with opti-
mized data partition, In: Proceedings of IEEE 5th Int. Conf. on Cloud Compu-
ting(CLOUD),Hawaii, USA, 2012,pp.5966.
[34] C.T. Zahn, Graph-Theoretical Methods for Detecting and Describing Gestalt
Clusters, IEEE Trans. Computers, 20(1) (1971) 68-86.
[38] M.G. Barrios, A.J. Quiroz, A clustering procedure based on the comparison be-
tween the k nearest neighbors graph and the minimal spanning tree, Statistics &
Probability Letters 62 (2003) 2334.
[39] J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Transactions
on Pattern Analysis and Machine Intelligence 22 (2000) 888905.
[41] G.T. Toussaint, The relative neighborhood graph of a finite planar set, Pattern
Recognition 12 (1980) 261268.
[44] J. Yu, Y. Rui, D. Tao, Click prediction for web image re-ranking using multi-
modal sparse coding, IEEE trans. on image processing, 23(5) (2014) 2019-2032.
[45] A. Frank, A. Asuncion, UCI machine learning repository, 2010. URL http://
archive.ics.uci.edu/ml.
[46] J.L. Bently, Multidimensional search trees in database applications, IEEE Trans.
Software Eng. 5 (4) (1979) 333340.
[47] T. Scott, M. Jeffrey, A. Mario, STR: A Simple and Efficient Algorithm for R-
Tree Packing, Technical Report, Institute for Computer Application in Science
and Engineering (ICASE), 1997, ACM Communications.
[48] P. Ciaccia, M. Patella, P. Zezula, M-tree: An Efficient Access Method for Simi-
larity Search in Metric Spaces , In: Proceedings of the 23rd International con-
ference on Very Large Data Bases(VLDB),1997, pp. 426435.
[49] H. Kaya, P.Tufekci , S.F.Gurgen, Local and Global Learning Methods for Pre-
dicting Power of a Combined Gas & Steam Turbine, In: Proceedings of Interna-
tional Conference on Emerging Trends in Computer and Electrical Engineering
(ICETCEE), 2012, pp. 13-18.
[51] C. Mallah, J. Cope, J. Orwell, Plant Leaf Classification Using Probabilistic In-
tegration of Shape, Texture and Margin Features, Signal Processing, Pattern
Recognition and Applications (2013) 45-54.
HIGHLIGHTS