Author's Accepted Manuscript: Pattern Recognition

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

Authors Accepted Manuscript

A fast DBSCAN clustering algorithm by


accelerating neighbor searching using Groups
method

K. Mahesh Kumar, Dr. A. Rama Mohan Reddy

www.elsevier.com/locate/pr

PII: S0031-3203(16)00103-5
DOI: http://dx.doi.org/10.1016/j.patcog.2016.03.008
Reference: PR5669
To appear in: Pattern Recognition
Received date: 30 October 2015
Revised date: 18 February 2016
Accepted date: 3 March 2016
Cite this article as: K. Mahesh Kumar and Dr. A. Rama Mohan Reddy, A fast
DBSCAN clustering algorithm by accelerating neighbor searching using Groups
method, Pattern Recognition, http://dx.doi.org/10.1016/j.patcog.2016.03.008
This is a PDF file of an unedited manuscript that has been accepted for
publication. As a service to our customers we are providing this early version of
the manuscript. The manuscript will undergo copyediting, typesetting, and
review of the resulting galley proof before it is published in its final citable form.
Please note that during the production process errors may be discovered which
could affect the content, and all legal disclaimers that apply to the journal pertain.
TITLE

A fast DBSCAN clustering algorithm by accelerating neighbor


searching using Groups method

Corresponding Author:

K. Mahesh Kumar
Research Scholar
Department of Computer Science and Engineering,
SVU College of Engineering,
SV University, Tirupati-517 502
Andhra Pradesh, India
Email : mahesh_cse @ outlook.com
Mobile : +91 9966880476

Dr. A. Rama Mohan Reddy


Professor & Chairman Board of Studies (UG&PG)
Department of Computer Science and Engineering,
SVU College of Engineering,
SV University, Tirupati-517 502
Andhra Pradesh, India
Email: ramamohansvu @ yahoo.com
1. Introduction

Data Clustering is an unsupervised learning technique that groups given data into

meaningful subclasses such that objects in the same subclass are more similar compared

to objects in other subclasses. Clustering techniques are used in different fields like image

analysis [1], pattern recognition [2], knowledge discovery [3] and bio-informatics [4].

Application of clustering techniques in spatial databases [5] poses the challenge to dis-

cover clusters with arbitrary shapes, determine the input parameters of algorithm with

minimal requirements of domain knowledge and a good efficiency on large databases.

Partitional clustering methods like k-means [6] can find clusters of spherical

shape only and need to supply number of clusters as input to the algorithm. Kernel k-

means [7] can detect arbitrary shaped clusters by transforming data into feature space us-

ing kernel functions. But, it has time and space complexity of O(n2) and hence not scala-

ble for large datasets. Hierarchical clustering methods partition the dataset into subsets

represented by a hierarchical data structure dendogram. Clusters are obtained by merging

the subsets at different levels using the minimum distance criteria [8]. Single-linkage [9]

method can detect arbitrary shaped clusters but is sensitive to noise patterns. It also suf-

fers from chaining effect [10]. CURE [11] is an improved version of single-link which

selects a random sample of points and shrinks them towards the centroid to solve chain-

ing effect. The Hierarchical methods have a time complexity of O(n3) and should also

define an appropriate stopping condition for splitting or merging of partitions while de-

riving suitable clusters. BIRCH [12], is an agglomerative hierarchical clustering algo-

rithm which uses a tree based representation for reducing time complexity but it can find

only spherical shaped compact clusters and also clustering result is effected by input or-
der of data. Recently, multi-view based [13,14] and semi-supervised clustering [15]

methods are shown effective in improving accuracy of clustering results. Multi-view

based clustering methods exploits information obtained from multiple-views to improve

clustering accuracy [16].Semi-supervised clustering algorithms utilize a small amount of

labeled data from the user to achieve a better clustering accuracy. The user provided data

may be incorporated into clustering algorithm in the form of constraints for guiding to-

wards a better solution [17,18].

DBSCAN (Density Based Spatial Clustering of Applications with Noise) [19] is

the pioneer of density based clustering techniques which can discover clusters of arbitrary

shape and also handles noise or outliers effectively. DBSCAN algorithm has a quadratic

time complexity with dataset size. The algorithm can be extended to large datasets by re-

ducing its time complexity using spatial index structures like R-trees [20] for finding

neighbors of a pattern. But, they can not be applied for high dimensional datasets. In this

paper we propose an algorithm Groups to accelerate the neighbor search queries .Groups

builds a graph-based index structure on the data. Unlike conventional hierarchical index

structures like R-trees, the proposed method scales well for high dimensional datasets.

Also, Groups method is efficient in handling large amounts of noise present in the data

without degrading the performance of DBSCAN. The cluster results produced by our

method are exactly identical to that of DBSCAN but at a reduced running time.

The rest of paper is organized as follows: section 2 reviews existing work in the

literature on density based clustering techniques and a detailed description on DBSCAN

method. Section 3 describes the proposed method, with the Groups algorithm followed by

G-DBSCAN (Groups-Density Based Spatial Clustering of Applications with Noise). Sec-


tion 4 provides an experimental analysis of proposed method .Section 5 gives some of the

conclusions and future work of proposed method.

2. Related Work

Density based clustering methods can find arbitrary shaped clusters in the dataset

and also insensitive to noise. In density based clustering methods clusters are formed by

merging dense areas separated by regions of spares areas. DBSCAN is proposed for clus-

tering large spatial databases with noise or outliers. OPTICS [21] is an extension to

DBSCAN which can find clusters with varying densities by creating an augmented order-

ing of given dataset representing a density-based cluster structure. This ordering is equal

to a density-based clustering with varied range of parameter settings. Chen et al. [22]

proposed a parameter free clustering method that utilizes Affinity Propagation algorithm

[23] to detect local densities in the dataset and obtains a normalized density list. Later,

DBSCAN method is modified to cluster the dataset in terms of the parameters in the

normalized density list. DENCLUE [24] defines clusters by a local maximum of estimat-

ed kernel density function. A hill climbing procedure is used for assigning points to near-

est local maximum. l- DBSCAN [25] is a hybrid density based clustering method that

first derives a set of prototypes from the dataset using leaders clustering method [26] and

runs DBSCAN on the prototypes to find clusters. Further, RoughDBSCAN [27] is pro-

posed by applying rough-set theory [28] to l-DBSCAN method. It has a time complexity

of O(n) but the cluster results are influenced by threshold parameter that is specified to

derive the prototypes. Recently, fast and scalable density based clustering methods using
Graphics Processing Units (GPU) are proposed to improve the performance of DBSCAN

[29,30]. Also, parallel and distributed versions of DBSCAN are proposed for handling

large datasets [31-33]. Mostofa et.al [31] proposed a parallel DBSCAN method using

graph algorithmic concepts and achieves a well balanced workload by taking advantage

of a tree-based bottom-up approach to construct clusters.

Clustering algorithms based on graph theory [34] are attractive because of their

ability to detect clusters of diverse shape, size and densities without requiring any prior

knowledge of dataset and does not require user to supply number of clusters as input pa-

rameter. These methods are based on constructing a similarity graph representation of

dataset which is suitably partitioned and merged to obtain final clusters. CHAMELEON1

[35] represents dataset with a k-nearest neighbor graph, and is partitioned into sub-

clusters by minimizing edge-cut effectively. Clusters are obtained by merging sub-

clusters only if the relative inter-connectivity and closeness between them is comparable.

Graph based methods also take advantage of Minimum Spanning Tree (MST) to repre-

sent the dataset [36,37].Barrios[38] identified clusters by comparing k nearest neighbor-

based graph and the MST of a dataset. Spectral clustering [39] represents a fully connect-

ed graph of the dataset and is based on spectral-graph theory for finding clusters. In addi-

tion, relative neighbor graphs are also used to cluster data [40,41]. Mimaroglu et.al [42]

adopts a similarity graph for combining multiple clustering results into a final clustering

solution. Graph-based manifold learning methods [43, 44] also employs a neighborhood

graph representation of dataset to identify clusters. Ju et. al [ 43 ] proposed a graph-based

manifold learning framework for image clustering using a sparse representation to select

1
CHAMELEON is a type of reptile that has an ability to change its skin to different colors. The algorithm
was called so as it is based on a dynamic model to identify varied shape cluster structures.
a few neighbors of each data point that span a low-dimensional affine subspace passing

near that point. A multimodal hypergraph learning based sparse coding method [44] is

proposed for the click prediction of images. In a hypergraph a set of vertices are connect-

ed by a hyperedge to preserve the local smoothness of the constructed sparse codes.

2.1 DBSCAN: A density-based approach

DBSCAN algorithm defines cluster as a region of densely connected points sepa-

rated by regions of non-dense points. If similarity measure is taken as Euclidean distance

the region is a hyper sphere of radius at the given point p as center.

eps-neighborhood: for a point , the eps-neighborhood denotes the set of

points whose distance from x is less than or equal to . The cardinality of eps-

neighborhood defines the threshold density of .

eps-connected: for a pair of points , if , then are eps-

connected points.

From the view of a DBSCAN method every point in the dataset will fall into ei-

ther core point or border point. Further a border point can be either noise point or density

connected point.

Core point: A point with threshold density greater than or equal to .

Border point: A point with threshold density less than .

Noise point: A point p is a noise point if the threshold density of p is less than

and all points in the eps-neighborhood of p are border points.

Density-connected point: A border point with at least one core point in its eps-

neighborhood.
DBSCAN algorithm takes two input parameters and . specifies the

maximum distance neighborhood for given point . is the minimum number of

points required in the eps-neighborhood of a point to form a cluster. Initially all points

are marked unvisited. The algorithm starts by randomly selecting an unvisited point and

finding its eps-neighbourhood. If the number of points in its eps-neighbourhood is less

than then it is marked as noise or outlier, otherwise it is considered as dense-

point and a new cluster is created. Further points are added iteratively to the cluster by

finding dense points for each point in the eps-neighbourhood of the cluster. If no unvisit-

ed points can be added to cluster, the new cluster is complete and no points will be added

to the cluster in subsequent iterations. To find out the next cluster find an unvisited point

in the dataset and repeat the above clustering process. The process is halted when all the

points are either assigned to some cluster or marked noise. Every point in a cluster is eps-

connected with at least one point in the same cluster to which it belongs and is not eps-

connected with any other points in remaining clusters. However, there may exist a border

point which is eps-connected with points in some other clusters. In which case, the point

is assigned to the cluster that processed it first. Such exceptional cases are rare in prac-

tice. The total number of eps-neighborhood operations performed is equal to the size of

dataset. If no index structures are used then eps-neighborhood involves computing dis-

tances to all remaining points in the data set.


Algorithm DBSCAN

mark all patterns in as

for each pattern in


do
FindNeighbours
if | |
mark as
else
mark and each pattern of with
all patterns of
until is
do
delete a pattern from
FindNeighbors
if | |
for each pattern in
mark with
if is unvisited

end for
end if
mark as
end until
end if
mark as

end for
Output all patterns in marked with or
3. Density based clustering method using groups

In this section, we propose the G-DBSCAN algorithm which is basically a

DBSCAN clustering method but the nearest neighbor search queries are accelerated by

using Groups method. The proposed algorithm runs in two phases: In the first phase

Groups algorithm is run on the entire dataset to obtain a set of groups. The Second phase

runs conventional DBSCAN method by using groups derived in the first phase for a fast

eps-neighborhood operation.

3.1 Groups

Groups method partitions the dataset by fitting into a graphbased structure where

each vertex is a group and an edge is drawn between two groups if they are reachable

(def.4). The Groups algorithm merges nearby patterns into groups. Each group is a hyper

sphere with its center as master pattern and can have a maximum radius of . Groups

method classifies each pattern into either master or slave pattern. Groups are formed by

scanning the entire dataset twice .In the first round each pattern is searched for some ex-

isting group to fit in. A pattern is added to a group if the distance from the given pattern

to its master pattern is less than or equal to . If the distance from given pattern to mas-

ter pattern of a group is two times , then such patterns are neither assigned to any

group nor itself is created as a new group .Such patterns are processed further in the sec-

ond round of the algorithm. If a pattern doesnt fit into any group and distance from mas-

ter pattern of its nearest group is greater than or equal to two times , then a new group

is created with itself as master pattern. In the second round ,the left out patterns in first

round are assigned to a group if the distance from the given pattern to master pattern is

less than or equal to . If there is no such group to fit then a new group is created with
given pattern as master pattern of the group.Different input order of patterns produces

different set of groups. Whenever a slave pattern is added to a group the threshold dis-

tance of the group is also updated. The maximum threshold distance of groups created in

first iteration is less than or equal to . The threshold distance of groups created in sec-

ond iteration is less than . The Groups method fits a graph based representation of

dataset such that if is a point in group then all patterns in eps-neighborhood of

will be from either or

The following are some of the advantages of Groups method over conventional

index-based structures

(1) Groups method does not require any input parameters specific to the algorithm. It

takes as input parameter which is specified by user for DBSCAN clustering.

(2) Groups method handles noise effectively by pruning them early without perform-

ing eps-neighborhood operations while obtaining clusters.

(3) Groups method ensures search space of a pattern involving eps-neighborhood op-

eration is always span over a small area (fig 3) irrespective of outliers in the data.
Table 1 Notations

Symbol Denotes
Input set of data patterns
a pattern in
number of dimensions of pattern
size of dataset
set of groups
a group in
a slave pattern in group
master pattern of groups
number of patterns in groups
eps-connected groups of (def. 3)
threshold-distance of group
reachable groups of ( def.4)

Definition 1 (core group) A group s is a core group if the number of patterns in it is greater than

or equal to minpts.

{ |

Definition 2 (border group) A group s is a border group if the number of patterns in it is less

than minpts.

{ |

Definition 3 (eps-neighborhood of a group) The eps-neighborhood of a group s, is defined as

follows

{ |

are called as eps-connected groups.

Definition 4 (Reachable groups) The set of reachable groups of a group s is defined as follows

{ |
The reachable group relationship is symmetric viz., if is a reachable group of then is reach-

able group of .

Definition 5(empty group) A group s is an empty group if it doesnt contain any slave patterns.

{ | |

Definition 6 (noise group) A group s is a noise group if the number of patterns in is less than

and is not eps-connected with any other group in . (lemma 2)

{ |

Definition 7 (Threshold Distance) Threshold distance of a group is the maximum distance of a

slave pattern from its master pattern in a group.


Core group

Border group

Empty group

Noise group

Slave pattern Master pattern

fig 1: Groups method :core ,border, empty and noise groups


Algorithm Groups

mark all patterns in as


for each pattern in
if there exists a group in such that
{
else if is empty or there doesnt exists any group in
such that
createNewGroup( )
else mark as
end if
end for
for each pattern in
if there exists a group in such that
{
else createNewGroup( )
end if
end for
Procedure createNewGroup ( )
create a new group

find reachable Groups of using eq. 4


{
end procedure
Output
3.2 G-DBSCAN

Groups DBSCAN (G-DBSCAN) is similar to the conventional DBSCAN cluster-

ing method but the nearest neighbors are searched for using the groups obtained by

Groups algorithm. The reachable groups of a group are computed by finding the distance

between their master patterns. If the distance between master patterns of any two groups

is less than or equal to sum of their threshold distances and , they are considered as

reachable groups of each other(def.3) .since the threshold distance of a group is not

known until the completion of scanning the entire dataset, we take threshold distance of

groups as their maximum value, .Hence,two groups are considered reachable if the

distance between their master patterns is less than or equal to three times (corollary

1).If , are any two eps-connected patterns then , are either patterns that belong to

same group or patterns in reachable groups. To find out the eps-neighborhood of a pat-

tern, firstly all patterns in the current group are searched for eps-connectivity and fol-

lowed by searching patterns in its reachable groups. If the threshold distance of current

group is less than or equal to , then its eps-neighborhood includes all patterns in

the current group (lemma 2). If is a border group without any reachable groups then all

patterns in are noise patterns. For a given pattern in a group, whether there exists an

eps-connected pattern in its reachable groups or not is determined by using eq.3.If eq.3 is

satisfied then distance computations are made from given pattern to all patterns in its

reachable groups for obtaining eps-connected patterns. The search space for computing

eps-neighborhood of a pattern is bounded by a hyper-sphere of radius with the

given pattern as its center (fig 7). The Algorithm 2 specifies the process of computing
eps-neighborhood of a given pattern using groups method. In the DBSCAN clustering

given in algorithm 1, the step findNeighbors utilizes algorithm 2.

Our method is efficient than other graph-based clustering methods like CHAME-

LEON [35] that uses a k-nearest neighbor graph for representing dataset. Constructing a

k-nearest neighbor graph involves finding distance computations from given point to all

remaining points in the dataset. The graph is partitioned into multiple disconnected com-

ponents (sub-clusters) by removing the longest edges and an agglomerative hierarchical

clustering process is applied on sub-clusters to obtain final clustering results. CHAME-

LEON [35] is a hierarchical clustering method and requires user to specify suitable split

and merge policy in addition to parameter k. G-DBSCAN is a density based clustering

method that uses an efficient graph based structure for fast neighbor search operations. G-

DBSCAN finds a graph-based representation of dataset by scanning the entire dataset

twice and involves distance computations from given point to master pattern of groups

only. Each vertex is a group represented by its master pattern and eps-neighborhood pat-

terns of master are added as its slaves. The edges are drawn towards its reachable groups

(fig 2(a)).If clusters are well separated and valid parameters of and selected,

Groups method obtains disconnected graph components which are equal to number of

clusters present in the data. (fig 2(b)). Our method is parameter free without requiring any

specific inputs from user.


a)
b)

Border group Core group

fig 2 a) Groups obtained as multiple disconnected graph components of dataset (left)

b) Groups obtained as single graph component of dataset (right)


Algorithm FindNeighbours (pat_index,eps)

/* This method finds the eps- neighborhood of a given pattern and returns
them */

if
all patterns in
else find patterns in such that

end if
for each reachable group of
if

find patterns in such that

end if
end for
Output
*
* *
* + *
* *
** + * * *
** *
* * *
*
3*eps

* *
* *
eps
* + *
* *
* * *
2*eps

*
*
* +* *
* *
* * * * **
+ * *
** * *
* *
** * *

+ Master Pattern * Slave Pattern

fig 3 Group with its reachable groups

Lemma 1: All patterns in a core group belong to same cluster.

Proof: Let be a core group and be any two slave patterns in group . let be-

longs to cluster , since is a core group belongs to eps-neighborhood of

and also number of patterns in is greater than or equal to . Therefore, al-

so belongs to . Hence, all patterns of a core group belong to same cluster.


Lemma 2: If set of reachable groups is empty for a border group, then all patterns in the

border group are noise.

Proof: let be a border group, since number of slave patterns is less than in a

border group, is noise. Since is empty, eps-neighborhood of any slave pattern in s

includes patterns in only. As total number of patterns in is less than , eps-

neighborhood of any pattern in , can not be more than . Hence, each pattern in

is a noise.

Lemma 3: If then the eps- neighborhood for any pattern in includes all

remaining patterns in .

Proof: From the fig.4, it is evident that if the threshold distance of a group is , then

the maximum distance between any two patterns in the group is . Therefore, every

pattern in the group will fall in the eps-neighborhood of remaining patterns in the

group.

fig 4 Group with maximum-distance of slave patterns


Theorem 1: let be any two groups are eps-connected with re-

spect to only if

Proof: Using the triangle inequality property from fig 5,

from transitivity in (1) and (2),

fig 5 eps-connected groups


fig 6 Maximum separated eps-connected groups

Corollary 1: The distance between master patterns of any two eps-connected groups is

less than or equal to .

Proof: fig 6 shows a pair of eps-connected groups separated by maximum distance. Let

be any two eps-connected groups, by using eq.3 the distance between their master

patterns is less than or equal to Since distance be-

tween master pattern and slave pattern of a group is always less than or equal to , sub-

stituting for and in above equation yields


Theorem 2: The distance between any two slave patterns of reachable groups is less

than or equal to .

Proof: let be any two groups , using the triangle inequality property

from fig 7,

From transitivity in eq.5 and eq.6

Since, and from eq.4,


fig 7 Maximum distance slave patterns in reachable groups

3.3 Performance analysis

The time complexity of proposed algorithm involves the cost incurred in deriving

groups from the dataset and running the G-DBSCAN algorithm on the groups. The entire

dataset is scanned once to obtain an initial partition of groups and the patterns not fitting

into any group are processed further in the second round. If p be the number of patterns

left unassigned to any group in the first iteration, then the time complexity for groups al-

gorithm is O(n+p).since p n, it can be taken as O(n).The eps-neighborhood of a pattern

involves computing the distance between all patterns in the reachable groups satisfying
eq.3.In the worst case, it involves computing distance between patterns present in all of

its reachable groups. Without using any indexbased structure eps-neighborhood of a

point involves computing distance from all other remaining points in the dataset, hence

its time complexity is O(n2). Using an index structure like R-Trees, the search time can

be reduced to O(log n),overall for entire dataset it will be O(n log n). However, index-

based techniques are not efficient for high dimensional data. The Groups method in-

volves neighbor searching limited to only in its reachable groups. Further for a given pat-

tern, the reachable groups are pruned out of searching based on triangle inequality prop-

erty(eq.3). If denote the maximum number of distance computations involved in com-

puting eps-neighborhood for a pattern, then the time complexity of DBSCAN using

groups is O(nd).The time complexity of G-DBSCAN, including the time taken for groups

is O(n+nd). Though the value of can not be established theoretically, for all valid pa-

rameters of , is small. The value of is influenced by the input order of data and

also separation between clusters. Since the distance between slave patterns of reachable

groups is less than or equal to (theorem 2), the eps-neighborhood of a given pat-

tern in the worst case involves distance computations to patterns at a distance of

from the given pattern.

4. Experimental Results

In this section, we present the experimental results performed to demonstrate the

effectiveness of proposed method .In the first part, we performed experiments on datasets

obtained from UCI machine learning repository [45] of varied dimensions and a two di-

mensional synthetic dataset .In the second part, we performed experiments on synthetic
datasets of dimensions in varied range .In both cases the performance of G-DBSCAN is

empirically evaluated with the conventional DBSCAN and its state-of-the-art index im-

plementations. In the third part, the behavior of algorithm is analyzed in the presence of

noise of varied sizes in the dataset. All the experiments are performed on intel core i3-

4005U processor with 1.7Ghz and 4GB RAM running windows 7 ultimate service pack -

1.All programs are compiled and executed as single-threaded java console applications

using JDK 8u45 on Java HotSpot 64-bit server Virtual Machine.

The proposed method is compared with three commonly used Index data struc-

tures for neighbor searching: k-d trees, R-trees and M-trees. k-d tree[46] is a multi-

dimensional search tree where each node is k-dimensional point. The nodes are split re-

cursively by using mean or median of data points across each node. In our experiments

median was used. R-tree [15] is a balanced search tree used for spatial access methods. It

groups near by objects into a minimum bounding rectangle. Sort-Tile-Recursive (STR)

[47] is a variant of R-tree bulk-loaded that repeatedly splits across each dimension suc-

cessively into equal sized partitions. STR was used in our experimental study. M-tree

[48] is a balanced search tree similar like R-Tree, but it uses minimum volume hyper

spheres for grouping near by objects instead of hyper-rectangles. Random split policy is

used for dividing the hyper spheres with minimal overlap. The hierarchical index struc-

tures are sensitive to input parameters. In case of M-trees a performance difference of up

to two to three orders of magnitude in running time was observed with respect to the split

policy selected. In each case the algorithm is executed five times with different input or-

ders of dataset and their average is taken as the running time. Experimental results show
that the proposed method is faster than DBSCAN by a factor of 1.5 to 2.2 on benchmark

datasets and is also scalable for high dimensional datasets (fig 9).

4.1 Experiment 1

In this empirical study we had used seven popular datasets from UCI machine

learning repository and one synthetic dataset. The datasets are selected on the criteria of

its corresponding dimension and size. Combined cycle Power plant [49] consists of 9568

data points collected from a power plant over a period of six years .Each point comprise

of four attributes obtained when the power plant was set to work at full load. They are

used to calculate net hourly electrical energy output of the plant. Page blocks classifica-

tion [50] is composed of 5473 blocks obtained from 54 distinct documents. The blocks of

the page layout of a document are obtained using a segmentation process. The blocks are

classified to separate text from graphic areas. Pen-based recognition of handwritten dig-

its [45] is a digit database of 250 samples collected from 44 writers. Each sample is a 16-

dimensional feature vector. Letter recognition[45] consists of database of image features

used to identify 26 upper case English alphabets. Image segmentation [45] consists of

seven classes of 2310 samples of 19 dimensions. Statlog Landsat Satellite [45] consists of

seven classes of 6435 instances. Each instance consists of 36 features obtained from mul-

ti-spectral values of pixels in neighborhoods in a satellite image. Plant species

Leaves dataset [51] comprise of 1600 instances obtained from sixteen samples of leaf

from one-hundred plant species. Concentric rings is artificially generated with 3 classes

of 3000 points each and 30 noise points. A summary of datasets used in the empirical

study is provided in Table 2.


Table 2 Characteristics of datasets used in testing

fig 8 Synthetic dataset concentric rings

A performance comparison of proposed method with index-structures is given in

table 3. From the experimental results, it is observed that index based structures perform
well for datasets with dimensionality less than 20 and tend to perform poor as the dimen-

sionality of the dataset increases. Our method is more than twice faster than DBSCAN

for pen-digits dataset. For plant species dataset, a performance improvement of about 40

percent in the running time is observed while the index based methods perform slower

than DBSCAN. l- DBSCAN[25] and Rough-DBSCAN[27] uses a prototype based hy-

brid approach to speed up the DBSCAN clustering method. l- DBSCAN requires two in-

put parameters and that are used to create prototypes at coarse-level and fine-grain

level. Rough-DBSCAN uses an input parameter to create prototypes at coarse-level

only. A performance comparison of execution time and Rand-Index of G-DBSCAN with

the above methods is shown in table 4. The clustering results are compared using the sim-

ilarity measure - [52].The Rand-index is computed by using eq.9. Let

denote two different sets of partitions of a dataset, denotes the number of pairs of pat-

terns

Table 3 Running time(s) comparison of datasets for various indexes

DBSCAN DBSCAN using index


Dataset
(full search) Groups k-d tree R-tree M-Tree
CC Power plant 1.706 1.057 1.245 1.580 1.336
Page blocks 1.024 0.637 0.720 0.840 0.650
Pen-digits 4.708 2.213 3.495 3.835 3.978
Letter recognition 14.63 8.807 17.12 14.97 16.97
Image segmentation 0.221 0.131 0.231 0.277 0.357
SL Satellite 2.967 1.907 3.117 2.885 2.837
Plant species leaves 3.348 2.011 3.985 4.033 3.751
Concentric rings 0.982 0.459 0.594 0.562 0.716
in the dataset that are present in a set of and . Let denotes the number of pairs of

patterns in the dataset that are not grouped into a set of both and .

- ( )
(9)

It is observed that by suitable selection of input parameter(s) l-DBSCAN, Rough-

DBSCAN can give a reduction in execution time better than G-DBSCAN but they devi-

ate from the clustering results produced by that of DBSCAN considerably. Also, the se-

lection of input parameter(s) that minimize the execution time and maximize the cluster-

ing accuracy simultaneously is a difficult task and the authors did not report any effective

criteria for determining such suitable input parameter(s) in their respective papers [25,

27]. G-DBSCAN improves the execution speed of DBSCAN and always guarantees the

clustering results are exactly identical to clustering results produced by DBSCAN

( - ).

Table 4 comparative analysis of Running Time and Rand-Index of G-DBSCAN

Dataset l-DBSCAN Rough-DBSCAN G-DBSCAN DBSCAN


Input Running Rand- Input Running Rand- Running Rand- Running
parameters Time(s) Index parameter Time(s) Index Time(s) Index Time(s)

Pen- 5 2.5 3.287 0.987 7.5 3.812 0.999 2.213 1 4.708


digits 13.5 3.5 2.785 0.912 11 3.145 0.981
25 1.5 2.315 0.871 27.5 2.245 0.913
42.5 0.5 2.102 0.813 43 1.985 0.861
Letter 0.15 0.1 12.25 0.982 0.15 11.21 0.998 8.807 1 14.63
recognition 0.35 0.25 10.13 0.962 0.35 9.21 0.961
0.45 0.32 9.124 0.891 0.48 8.512 0.921
0.65 0.48 8.321 0.834 0.65 7.14 0.845
CC Power 2 1.5 1.549 0.998 2.5 1.512 0.999 1.057 1 1.706
plant 6.5 2.5 1.364 0.941 5 1.385 0.972
12.5 3.5 1.142 0.865 10.5 1.123 0.925
17 5 0.912 0.812 16 0.896 0.832
4.2 Experiment 2

In this experiment, the scalability of proposed method is analyzed for high dimen-

sional datasets. Experiments were performed on synthetic datasets each of size 10,000

and dimensions ranging from 5 to 65 in steps of 5. Data sets are generated using a multi-

variate normal distribution, where covariance is taken as an identity matrix and

cluster centers are generated by random sampling from a uniform distribution .A

running time comparison of proposed method with index based implementations of

DBSCAN is shown in fig 9. It observed that the conventional index based techniques fail

to scale for datasets with dimensions above 20. Further, the running time for index based

implementations goes even worse than DBSCAN above some threshold value of dimen-

sion while the proposed method outperforms than index-based implementations even at

higher dimensions.

16 proposed
DBSCAN(full search)
14 DBSCAN(k-d tree index)
DBSCAN(R-tree index)
12 DBSCAN(M-tree index)

10
Time(secs)

0
0 10 20 30 40 50 60 70
Dimension
fig 9 Running time comparison of proposed, index methods with dimension of data

4.3 Impact of noise

One of the enviable features of spatial clustering methods is the ability to handle

noise effectively and perform well. In this experiment, the performance of proposed

method is analyzed in the presence of noise. For the experiment purpose, we generate a

two dimensional synthetic dataset of 10,000 points using a multivariate normal distribu-

tion. The clusters centers are generated by sampling from a uniform distribution

and covariance is taken as an identity matrix. Noise is added by random sampling

from a uniform distribution (-2, 2). At each step, noise is added in increments of 500

from 500, 1000 and so on up to points. It is ensured that majority of noise points are

not eps-connected. All patterns whose eps-neighborhood is empty constitute an empty

group while all eps-connected noise patterns constitute a border group or reachable empty

groups. The running time of Groups method increases with the number of noise points

since the process of assigning slaves to their respective groups involves searching more

number of empty and noise groups that are introduced by noise points. During G-

DBSCAN the empty noisy groups does not involve any distance computations for com-

puting eps-neighborhood of the pattern while the points in noisy border groups involves

distance computations less than . From fig 10, it is evident that with an increase in

noise the execution time of Groups method is increasing exponentially where as running

time of G-DBSCAN varies by a small value. In the presence of noise G-DBSCAN gives

a performance improvement in running time by a factor of 1.4 to 2.2 when compared to

DBSCAN.
4000
DBSCAN
G-DBSCAN(including Groups processing)
3500 Groups
G-DBSCAN(excluding Groups processing)

3000

2500
Time(ms)

2000

1500

1000

500

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Noise size to Data size

fig 10 Running time comparison of DBSCAN, G-DBSCAN and Groups with noise

5. Conclusions & future work

In this paper we have presented a graph-based index structure Groups to speed up

the neighbor search operations of DBSCAN clustering. Groups method scans entire data

set once to obtain a set of groups. A pattern which does not fit into an existing group is

processed in the second round. Such patterns are assigned to a group as slave or a new

group is created with itself as the master pattern. Groups method ensures that always for a

given pattern the neighbor searching does not need to move points farther than distance

of , which is an advantage over conventional DBSCAN that requires searching all


patterns in the dataset. It is observed from experimental analysis that inappropriate pa-

rameter values for hierarchical index construction gives a performance degradation of up

to two to three fold magnitude of actual running time. Groups method is more stable than

index-based structures as it do not any require specific input parameters from user to

build the groups index structure. Also, Groups method is robust to noise by pruning outli-

ers early with zero or few distance computations. In future we are planning to extend G-

DBSCAN for very large datasets using parallel and distributed versions of proposed

method, incorporating high performance computing (HPC) techniques.

Conflict of interest

None declared

Acknowledgements

The authors would like to thank anonymous reviewers for their valuable sugges-
tions in improving the quality of paper. The work of Mahesh was supported by a monthly
scholarship from the Department of Higher Education, Ministry of Human Resource De-
velopment (MHRD), Govt. of India, under Technical Education Quality Improvement
Program (TEQIP-II-1.2.1).

References

[1] R.C. Gonzalez, R.E. Woods, Digital Image Processing, third ed., Pearson Pren-
tice-Hall, Upper Saddle River, NJ, 2008.

[2] S. Theodoridis, K. Koutroumbas, Pattern Recognition, second ed., Academic


Press, New York, 2003.
[3] U.M. Fayyad, G. P. Shapiro, P. Smyth, R. Uthurusamy, Advances in
Knowledge Discovery and Data Mining, MIT Press, Boston, MA, 1996.

[4] S. Madeira and A. Oliveira Bi, clustering algorithms for biological data analy-
sis: a survey, IEEE/ACM Trans. on Comp. Biology and Bioinformatics, 1(1)
(2004) 2445.

[5] R.H. Gueting, An introduction to spatial database systems, The VLDB Journal
3 (4) (1994) 357399.

[6] A.K. Jain, Data clustering: 50 years beyond K-means, Pattern Recognition Let-
ters 31 (8) (2010) 651666.

[7] M. Girolami, Mercer kernel-based clustering in feature space, IEEE Trans. on


Neural Networks 13 (3) (2002)780784.

[8] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: A review. ACM Computing
Surveys 31 (3) (1999) 264323.

[9] B. King, Step-wise clustering procedures, Journal of the American Statistical


Association 62 (317) (1967) 86101.

[10] G. Nagy, State of the art in pattern recognition, In: Proceedings of IEEE 56,
1968, pp. 836862.

[11] S. Guha, R. Rastogi, K. Shim, Cure: an efficient clustering algorithm for large
databases, In: Proc. of the Internat. Conf. on Mangement of Data (ACM SIG-
MOD), 1998, pp.73-84.

[12] Z. Tian, R. Raghu, L. Micon, BIRCH: An efficient data clustering method for
very large databases, In: Proc. ACM SIGMOD Internat. Conf. on Management
of Data, 1996, pp.103114.
[13] X. Chang, T. Dacheng, X. Chao, Multi-View Self-Paced Learning for Cluster-
ing, In: Proceedings of 24th International Joint conference on Artificial Intelli-
gence, 2015, pp. 3974-3980.

[14] X. Chang, T. Dacheng, X. Chao, Large margin multi-view information bottle-


neck, IEEE Trans. Pattern Analysis and Machine Intelligence, 36(8) (2014)
15591572.

[15] O. Chapelle, B. Schlkopf, A. Zien, Semi-Supervised Learning, vol. 2, MIT


Press, Cambridge, MA, 2006.

[16] M. Liu, Y. Luo, D. Tao, C. Xu, Y. Wen, Low-Rank Multi-View Learning in


Matrix Completion for Multi-Label Image Classification , Proceedings of the
29th American Association for Artificial Intelligence(AAAI) National Confer-
ence, 2015, pp.2778-2784.

[17] K. Lu, J. Zhao, D. Cai, An algorithm for semi-supervised learning in image re-
trieval, Pattern Recognition 39 (2006) 717720.

[18] K. Nigam, A. McCallum, S. Thrun, T. Mitchell, Text classification from labeled


and unlabeled documents using EM, Machine Learning 39 (2000)103134.

[19] M. Ester, H.P. Kriegel, X. Xu, A density-based algorithm for discovering clus-
ters in large spatial databases with noise, In: Proc. 2nd ACM SIGKDD, Port-
land, Oregon, 1996, pp. 226231.

[20] A. Guttman, R-trees: A dynamic index structure for spatial searching, In: Proc.
of 13th Int. Conf. on Mang. of Data ACM SIGMOD, vol. 2, 1984, pp. 4757.

[21] M. Ankerst, M. Breunig, H.P. Kriegel, J. Sander, OPTICS: ordering points to


identify the clustering structure, Proc. of the ACM SIGMOD Int. Conf. on
Mang. of Data (SIGMOD),1999, pp. 4960.

[22] X. Chen, W. Liu, H. Qiu, J. Lai, APSCAN:A parameter free clustering algo-
rithm, Pattern Recognition Letters 32, (2011) 973986.
[23] B.J. Frey, D. Dueck, Mixture modeling by affinity propagation, In: Proceedings
of 18th Neural Information Processing Systems, 2005, pp. 379386.

[24] A. Hinneburg, D.A. Keim, An efficient approach to clustering in large multime-


dia databases with noise, in: Proc. of 4th Int. Conf. on KDD, 1998,pp.5865.

[25] P. Viswanath, R. Pinkesh, l-dbscan: A fast hybrid density based clustering


method, Proc. 18th Int. Conf. on Pattern Recognition, vol. 1. IEEE Computer
Society, Hong Kong, 2006, pp.912915.

[26] J.A. Hartigan, Clustering Algorithms, John Wiley & Sons, New York, 1975.

[27] P. Viswanath, V.S. Babu, Rough-DBSCAN: A fast hybrid density based clus-
tering method for large data sets, Pattern Recognition Letters 30(16) (2009)
14771488.

[28] Z. Pawlak, Rough Sets: Theoretical Aspects of Reasoning About Data, Kluwer
Academic Publishing, Dordrecht, 1991.

[29] C. Bhm, R. Noll, C. Plant, B. W. Reuther, Density-based clustering using


graphics processors, In: Proc. Conf. on Information and knowledge manage-
ment (CIKM), Hong Kong, China,2009, pp. 661670

[30] W.K. Loh, H. Yu, Fast density-based clustering through dataset partition using
graphics processing units, journal of Information Sciences, 308 (2015) 94112.

[31] M. Patwary, M.Ali, D. Palsetia, A. Agrawal, W.K. Liao, F. Manne, A.


Choudary, A new scalable parallel DBSCAN algorithm using the disjoint-set
data structure, In: Proc. of Int. conf. on HPC Networking, Storage and Analy-
sis, 2012, pp.111.

[32] M. Chen, X. Gao, and H. Li, Parallel DBSCAN with priority r-tree, In: Proceed-
ings of Information Management and Engineering (ICIME), 2010 pp. 508511.
[33] B.R. Dai, I.C. Lin, Efficient mapreduce based DBSCAN algorithm with opti-
mized data partition, In: Proceedings of IEEE 5th Int. Conf. on Cloud Compu-
ting(CLOUD),Hawaii, USA, 2012,pp.5966.

[34] C.T. Zahn, Graph-Theoretical Methods for Detecting and Describing Gestalt
Clusters, IEEE Trans. Computers, 20(1) (1971) 68-86.

[35] G. Karypis, E.H. Han, V. Kumar, CHAMELEON: a hierarchical clustering al-


gorithm using dynamic modeling, IEEE Trans. on Computers 32(8) (1999) 68
75.

[36] O. Grygorash, Y. Zhou, Z. Jorgensen, Minimum Spanning Tree-Based Cluster-


ing Algorithms, In: Proceedings of IEEE International Conference on Tools
with Artificial Intelligence, 2006,pp. 73-81.

[37] C. Zhong, D. Miao, R. Wang, A graph-theoretical clustering method based on


two rounds of minimum spanning trees, Pattern Recognition 43 (2010) 752-766.

[38] M.G. Barrios, A.J. Quiroz, A clustering procedure based on the comparison be-
tween the k nearest neighbors graph and the minimal spanning tree, Statistics &
Probability Letters 62 (2003) 2334.

[39] J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Transactions
on Pattern Analysis and Machine Intelligence 22 (2000) 888905.

[40] S. Bandyopadhyay, An automatic shape independent clustering technique, Pat-


tern Recognition 37 (2004) 3345.

[41] G.T. Toussaint, The relative neighborhood graph of a finite planar set, Pattern
Recognition 12 (1980) 261268.

[42] S. Mimaroglu, E. Erdil, combining multiple clusterings using similarity graph,


pattern Recognition 44 (2011) 694703.
[43] J. Yu, R. Hong, M. wang, J. You, Image clustering based on sparse patch
alignment framework, Pattern Recognition 47 (2014) 3512-3519.

[44] J. Yu, Y. Rui, D. Tao, Click prediction for web image re-ranking using multi-
modal sparse coding, IEEE trans. on image processing, 23(5) (2014) 2019-2032.

[45] A. Frank, A. Asuncion, UCI machine learning repository, 2010. URL http://
archive.ics.uci.edu/ml.

[46] J.L. Bently, Multidimensional search trees in database applications, IEEE Trans.
Software Eng. 5 (4) (1979) 333340.

[47] T. Scott, M. Jeffrey, A. Mario, STR: A Simple and Efficient Algorithm for R-
Tree Packing, Technical Report, Institute for Computer Application in Science
and Engineering (ICASE), 1997, ACM Communications.

[48] P. Ciaccia, M. Patella, P. Zezula, M-tree: An Efficient Access Method for Simi-
larity Search in Metric Spaces , In: Proceedings of the 23rd International con-
ference on Very Large Data Bases(VLDB),1997, pp. 426435.

[49] H. Kaya, P.Tufekci , S.F.Gurgen, Local and Global Learning Methods for Pre-
dicting Power of a Combined Gas & Steam Turbine, In: Proceedings of Interna-
tional Conference on Emerging Trends in Computer and Electrical Engineering
(ICETCEE), 2012, pp. 13-18.

[50] F. Esposito, D. Malerba, G. Semeraro, Multi strategy Learning for Document


Recognition, Applied Artificial Intelligence 8 (1994) 33-84.

[51] C. Mallah, J. Cope, J. Orwell, Plant Leaf Classification Using Probabilistic In-
tegration of Shape, Texture and Margin Features, Signal Processing, Pattern
Recognition and Applications (2013) 45-54.

[52] W.M.Rand,. Objective criteria for the evaluation of clustering methods. J.


Amer. Statist. Assoc., 66(336) (1971) 846850.
Highlights

HIGHLIGHTS

A graph-based index structure is built for speeding up neighbor search operations.


No additional inputs are required to build the index structure.
Proposed method is scalable for high-dimensional datasets.
Handles noise effectively to improve the performance of DBSCAN.

You might also like