0% found this document useful (0 votes)
53 views

A Distribution-Based Clustering Algorithm For Mining in Large Spatial Databases

This document describes a new clustering algorithm called DBCLASD (Distribution Based Clustering of LArge Spatial Databas-es) for discovering clusters of arbitrary shape in large spatial databases. DBCLASD assumes clusters are distributed uniformly like a homogeneous Poisson point process. It dynamically determines the appropriate number and shape of clusters without requiring input parameters. Experimental results show DBCLASD discovers non-spherical clusters and works effectively on real-world data, while maintaining good efficiency on large databases comparable to other algorithms.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

A Distribution-Based Clustering Algorithm For Mining in Large Spatial Databases

This document describes a new clustering algorithm called DBCLASD (Distribution Based Clustering of LArge Spatial Databas-es) for discovering clusters of arbitrary shape in large spatial databases. DBCLASD assumes clusters are distributed uniformly like a homogeneous Poisson point process. It dynamically determines the appropriate number and shape of clusters without requiring input parameters. Experimental results show DBCLASD discovers non-spherical clusters and works effectively on real-world data, while maintaining good efficiency on large databases comparable to other algorithms.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Published in the Proceedings of 14th International Conference on Data Engineering (ICDE’98)

A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases

Xiaowei Xu, Martin Ester, Hans-Peter Kriegel, Jörg Sander


University of Munich
Oettingenstr. 67
D-80538 München

Abstract point process restricted to a certain part of the space. This


type of distribution is also called uniform distribution or
The problem of detecting clusters of points belonging to a random distribution. The clusters may have arbitrary shape.
spatial point process arises in many applications. In this pa- This problem arises in many applications, e.g. seismology
per, we introduce the new clustering algorithm DBCLASD (grouping earthquakes clustered along seismic faults), min-
(Distribution Based Clustering of LArge Spatial Databas- efield detection (grouping mines in a minefield) and astron-
es) to discover clusters of this type. The results of experi- omy (grouping stars in galaxies) (see [1], [4] and [12]).
ments demonstrate that DBCLASD, contrary to The application to large spatial databases raises the
partitioning algorithms such as CLARANS, discovers clus- following requirements for clustering algorithms:
ters of arbitrary shape. Furthermore, DBCLASD does not (1) Minimal number of input parameters, because appro-
require any input parameters, in contrast to the clustering priate values are often not known in advance for many
algorithm DBSCAN requiring two input parameters which applications. In the application of detecting mine-
may be difficult to provide for large databases. In terms of fields, e.g., we usually do not know the shape, the den-
efficiency, DBCLASD is between CLARANS and DBSCAN, sity and the number of clusters.
close to DBSCAN. Thus, the efficiency of DBCLASD on
large spatial databases is very attractive when considering (2) Discovery of clusters with arbitrary shape, because the
its nonparametric nature and its good quality for clusters shape of clusters in spatial databases may be spherical,
of arbitrary shape. drawn-out, elongated etc. In geographical information
systems the clusters, e.g., houses whose price falls
within a given range, are usually shaped by the neigh-
1. Introduction boring geographical objects such as rivers or parks.
(3) Good efficiency on large databases, i.e. on databases
Increasingly large amounts of data obtained from sat- of significantly more than just a few thousand objects.
ellite images, X-ray crystallography or other automatic
None of the well-known clustering algorithms fulfills
equipment are stored in databases. Therefore, automated
the combination of these requirements. In this paper, we
knowledge discovery becomes more and more important in
present the new clustering algorithm DBCLASD which is
databases. Data mining is a step in the process of knowl-
based on the assumption that the points inside a cluster are
edge discovery consisting of applying data analysis and dis-
uniformly distributed. This is quite reasonable for many ap-
covery algorithms that, under acceptable computational ef-
plications. The application of DBCLASD to earthquake
ficiency limitations, produce a particular enumeration of
catalogues shows that DBCLASD also works effectively
patterns over the data [8]. Clustering, i.e. the grouping of
on real databases where the data is not exactly uniformly
objects of a database into meaningful subclasses, is one of
distributed (see section 5.5). DBCLASD dynamically de-
the prominent data mining tasks.
termines the appropriate number and shape of clusters for a
Spatial Database Systems (SDBS) [9] are database
database without requiring any input parameters. Finally,
systems for the management of spatial data such as points
the new algorithm is efficient even for large databases.
and polygons representing a part of the surface of the earth.
The rest of the paper is organized as follows. Cluster-
In this paper, we consider the task of clustering in spatial
ing algorithms have been developed and applied in differ-
databases, in particular the problem of detecting clusters of
ent areas of computer science, and we discuss related work
points which are distributed as a homogeneous Poisson
in section 2. In section 3, we present our notion of clusters
which is based on the distribution of the distance to the sumes the clusters to have a specific shape and [4] removes
nearest neighbors. Section 4 introduces the algorithm DB- noise only from the database which restricts it to be a pre-
CLASD which discovers such clusters in a spatial database. processing step for further cluster analysis.
In section 5, we report on an experimental evaluation of The run time of most of the above algorithms is too
DBCLASD according to our major three requirements. A inefficient on large databases. Therefore, some focusing
comparison with the well-known clustering algorithms techniques have been proposed to increase the efficiency of
CLARANS and DBSCAN is presented. Furthermore, we clustering algorithms: [7] presents an R*-tree [2] based fo-
apply DBCLASD to a real database demonstrating its ap- cusing technique (1) creating a sample of the database that
plicability to real world problems. Section 6 concludes with is drawn from each R*-tree data page and (2) applying the
a summary and some directions for future research. clustering algorithm only to that sample. BIRCH [14] is a
CF-tree, a hierarchical data structure designed for cluster-
2. Related Work ing, based multiphase clustering method. First, the database
is scanned to build an initial in-memory CF-tree. Second,
an arbitrary clustering algorithm is used to cluster the leaf
One can distinguish two main types of clustering algo-
nodes of the CF-tree. Both, the R*-tree based focusing
rithms [10]: partitioning and hierarchical algorithms. Parti-
technique and the CF-tree based technique perform basical-
tioning algorithms construct a partition of a database D of n
ly a preprocessing for clustering and can be used for any
objects into a set of k clusters. The partitioning algorithms
clustering algorithm.
typically start with an initial partition of D and then uses an
DBSCAN (Density Based Spatial Clustering of Appli-
iterative control strategy to optimize an objective function.
cations with Noise) [6] relies on a density-based notion of
Each cluster is represented by the gravity center of the clus-
clusters which is designed to discover clusters of arbitrary
ter (k-means algorithms) or by one of the objects of the
shape in spatial databases with noise. A cluster is defined as
cluster located near its center (k-medoid algorithms). Con-
a maximal set of density-connected points, i.e. the Eps-
sequently, partitioning algorithms use a two-step proce-
neighborhood of every point in a cluster contains at least
dure. First, determine k representatives minimizing the ob-
MinPts many points.
jective function. Second, assign each object to the cluster
with its representative “closest” to the considered object.
The second step implies that a partition is equivalent to a 3. A Notion of Clusters Based on the Distance
voronoi diagram and each cluster is contained in one of the Distribution
voronoi cells.
Ng and Han [13] explore partitioning algorithms for Consider the problem of detecting surface-laid mine-
KDD in spatial databases. An algorithm called CLARANS fields on the basis of an image from a reconnaissance air-
(Clustering Large Applications based on RANdomized craft. After processing, such an image is reduced to a set of
Search) is introduced which is an improved k-medoid meth- points, some of which may be mines, and some of which
od. Compared to former k-medoid algorithms, CLARANS may be noise, such as other metal objects or rocks. The aim
is more effective and more efficient. of the analysis is to determine whether or not minefields are
Hierarchical algorithms create a hierarchical decom- present, and where they are. A typical minefield database is
position of D. The hierarchical decomposition is represent- shown in figure 1. Since actual minefields data were not
ed by a dendrogram, a tree that iteratively splits D into made available to the public, the minefields data in this pa-
smaller subsets until each subset consists of only one ob- per were simulated according to specifications developed at
ject. In such a hierarchy, each level of the tree represents a the Naval Coastal System Station, Panama City, Florida, to
clustering of D. In contrast to partitioning algorithms, hier- represent minefield data encountered in practice [12].
archical algorithms do not need k as an input parameter. Visually, two minefields can be easily discovered in
However, a termination condition has to be defined indicat- figure 1 as clusters. We observe that the distances to the
ing when the merge or division process should be terminat- nearest neighbors for points inside the region of cluster 1
ed. are typically smaller than the distances to the nearest neigh-
In recent years, it has been found that probability mod- bors for points outside, which are still in the neighborhood
el based cluster analysis can be useful in discovering clus- of cluster 1. The same holds for cluster 2. We conjecture
ters from noisy data (see [3] and [4]). The clusters and noise that each cluster has a characteristic probability distribution
are represented by a mixture model. Hierarchical clustering of the distance to the nearest neighbors. If this characteristic
then partitions the points between the clusters and noise. distribution of a cluster has been discovered, it can be used
Both methods, however, are somewhat restricted when ap- to decide whether a neighboring point should be accepted
plied for class identification in spatial databases: [3] as- as a member of the cluster or not.
equal to the probability that none of the N points is located
inside of a hypersphere around q with radius x, denoted by
cluster 1 SP ( q, x ):
N
P ( D > x ) = ( 1 – Vol ( SP ( q, x ) ) ⁄ Vol ( R ) )
Consequently, the probability that D is not greater than
x is:
P(D ≤ x) = 1 – P(D > x)
N
cluster 2 = 1 – ( 1 – Vol ( SP ( q, x ) ) ⁄ Vol ( R ) )
In 2-dimensional space, the distribution function is
therefore:
figure 1: Sample minefield database F( x ) = P(D ≤ x )
2 N
= 1 – ( 1 – πx ⁄ Vol ( R ) ) (i)
3.1 Some Definitions From (i), we know that the distribution function has
two parameters N and Vol ( R ). While it is straightforward
to determine N, it is not obvious how to calculate Vol ( R ) of
In the following, we introduce some basic definitions
a point set which may have arbitrary shape.
for our notion of clusters.
Definition 3.1 (nearest neighbor of a point and nearest 3.3 The Computation of the Area of a Cluster
neighbor distance) Let q be a query point and S be a set of
points. Then the nearest neighbor of q in S, denoted by Strictly speaking, there is no area of a set of points S.
NN S ( q ), is a point p in S – { q } which has the minimum Therefore, we assign some approximation to the set S and
distance to q. The distance from q to its nearest neighbor in calculate the area of that approximation. First, the shape of
S is called the nearest neighbor distance of q, NNdist S ( q ) the approximation should be as similar as possible to the
for short. intuitive shape of the cluster. Second, the approximation
Definition 3.2 (nearest neighbor distance set of a set of should be connected, i.e. it should be one polygon which
points) Let S be a set of points and e i be the elements of S. may contain holes.
The nearest neighbor distance set of S, denoted by We use a grid based approach for determining the ap-
NNdistSet ( S ), or distance set for short, is the multi-set of proximating polygons. The key problem is the choice of a
all values NNdist S ( e i ) . value for the grid length which is appropriate for some giv-
en set of points. If the chosen grid length is too large, the
3.2 The Statistic Model for our Cluster Definition shape of the cluster is poorly approximated. On the other
hand, if the chosen grid length is too small, the approxima-
tion may be split into several disconnected polygons.
In the following, we analyze the probability distribu-
Figure 2 illustrates the influence of the grid length on the
tion of the nearest neighbor distances of a cluster. This anal-
area of the approximation. We define the grid length such
ysis is based on the assumption that the points inside of a
that the insertion of a distant point p into the cluster yields a
cluster are uniformly distributed, i.e. the points of a cluster
large increase of the area of the cluster implying that the
are distributed as a homogeneous Poisson point process re-
distance distribution in this area no longer fits the expected
stricted to a certain part of the data space. However, all
distance distribution. We set the grid length for a set of
points of the database need not be uniformly distributed.
points S as the maximum element of NNDistSet(S). An oc-
This assumption seems to be reasonable for many applica-
cupied grid cell is a grid cell containing at least one of the
tions, e.g. for the detection of minefields using reconnais-
sance aircraft images and for the detection of geological
faults from an earthquake record (see [4], [11] and [12]). point p
We want to determine the probability distribution of
the nearest neighbor distances. Let the N points be uniform-
ly distributed over a data space R with volume Vol ( R ). We
can imagine that these N points “fall” independently into
the data space R , such that the probability of one of these N
points falling into a subspace S of R with volume Vol ( S )
is equal to Vol ( S ) ⁄ Vol ( R ). The probability that the nearest (a) Area before insertion of point p (b) Area after insertion of point p
neighbor distance D from any query point q to its nearest
neighbor in the space R is greater than some x is therefore figure 2: The influence of the grid length on the
area of the approximation
points of the set and we define the approximation of set S as subject of section 4.2. While the incremental approach is
the union of all occupied grid cells. Based on the calcula- crucial for the efficiency of DBCLASD on large databases,
tion of the area of a cluster, figure 3 compares the expected it implies an inherent dependency of the discovered cluster-
and the observed distance distributions for cluster 1 from ing from the order of generating and testing candidates
figure 1. from the database. DBCLASD, however, incorporates two
Figure 3 shows a good fit between the expected and important features minimizing this dependency which are
the observed distance distribution. Formally, we use the outlined in section 4.2. To conclude this section, the com-
concept of the χ2−test [5] to derive that the observed dis- plete algorithm of DBCLASD is presented in section 4.3.
tance distribution fits the expected distance distribution.
To conclude the above discussion, we state our defini- 4.1 Generating Candidates
tion of a cluster based on the distribution of the nearest
neighbor distance set as follows: The set of candidates of a cluster is constructed using
Definition 3.3 (cluster) Let DB be a set of points. A cluster region queries, which can be efficiently supported using
C is a non-empty subset of DB with the following proper- spatial access methods (SAM), e.g. R*-tree. A region query
ties: returns all objects of the database intersecting the specified
query region, e.g. a circle. For each new member p of the
(1) NNDistSet(C) has the expected distribution with a current cluster C, we retrieve new candidates using a circle
required confidence level. query with a suitable radius m. This radius is chosen such
(2) C is maximal, i.e. each extension of C by neighboring that for no point of the cluster a larger distance to the near-
points does not fulfill condition (1). (maximality). est neighbor is to be expected. A larger m would lead to
more candidates for the χ2−test and decrease the efficiency
(3) C is connected, i.e. for each pair of points (a,b) of the of the algorithm. The calculation of m, again, is based on
cluster there is a path of occupied grid cells connecting the model of uniformly distributed points inside of a cluster
a and b (connectivity). C. Let A be the area of C and N be the number of its ele-
ments. A necessary condition for m is:
4. The Algorithm DBCLASD N × P ( NNdist C ( p ) > m ) < 1
and applying formula (i) from section 3.2 we obtains
N
Having defined our notion of a cluster, we now design ( 1 – πm 2 ⁄ A ) < 1 ⁄ N
an algorithm to efficiently discover clusters of this type. We Consequently, we require
call this algorithm Distribution Based Clustering of LArge 1⁄N
m > A ⁄ π ⋅ (1 – 1 ⁄ N ) (ii)
Spatial Databases (DBCLASD). DBCLASD is an incre- When inserting a new point p into cluster C, a circle
mental algorithm, i.e. the assignment of a point to a cluster query with center p and radius m is performed and the re-
is based only on the points processed so far without consid- sulting points are considered as new candidates.
ering the whole cluster or even the whole database. DB-
CLASD incrementally augments an initial cluster by its 4.2 Testing Candidates
neighboring points as long as the nearest neighbor distance
set of the resulting cluster still fits the expected distance
The incremental approach of DBCLASD implies an
distribution.
inherent dependency of the discovered clustering from the
A candidate is a point not yet belonging to the current
order of generating and testing candidates. While the dis-
cluster which has to be checked for possible membership in
tance set of the whole cluster might fit the expected dis-
this cluster. The generation of the candidates is discussed in
tance distribution, this does not necessarily hold for all sub-
section 4.1. The procedure of testing the candidates is the
sets of this cluster. Thus, the order of testing the candidates
2.0 is crucial. Candidates which are not accepted by the test
when considered the first time are called unsuccessful can-
1.5 f : frequency
f / NNdist

didates. To minimize the dependency on the order of test-


: expected distance ing, DBCLASD incorporates two important features:
1.0 distribution
: observed distance
distribution (1) Unsuccessful candidates are not discarded but tried
0.5
again later.
0 NNdist
0 0.5 1.0 1.5 (2) Points already assigned to some cluster may switch to
another cluster later.
figure 3: The expected and the observed distance
distributions for cluster 1 from figure 1
In the following, we discuss these features in detail. The test of a candidate is performed as follows. First,
Unsuccessful candidates are not discarded but stored. the current cluster is augmented by the candidate. Then, we
When all candidates of the current cluster have been pro- use the χ2−test to verify the hypothesis that the nearest
cessed, the unsuccessful candidates of that cluster are con- neighbor distance set of the augmented cluster still fits the
sidered again. In many cases, they will now fit the distance expected distance distribution.
distribution of the augmented cluster. Figure 4, e.g., depicts
the history of the χ2 values obtained for each of the candi- 4.3 The Algorithm
dates to be checked for sample cluster 1 from figure 1. For
several of the candidates, a χ2value is obtained which is In the following, we present the algorithm DBCLASD
significantly higher than the threshold value implying that in a pseudo code notation. Major functions are written in
this candidate is not assigned to the cluster. When perform- italic font and are explained below. Also, important details
ing the χ2 test for the distance set of the whole cluster, how- are numbered and commented below.
ever, a χ2 value significantly smaller than the threshold is
procedure DBCLASD (database db)
obtained indicating that the distance set of the cluster in-
initialize the points of db as being assigned to no cluster;
deed fits the expected distance distribution. initialize an empty list of candidates;
Even if none of the unsuccessful candidates separately initialize an empty list of unsuccessful candidates;
passes the test, it may be possible that some larger subset of initialize an empty set of processed points;
the set of unsuccessful candidates fits the distance distribu- for each point p of the database db do
if p has not yet been assigned to some cluster then
tion of the current cluster. So far, such subsets would yield
create a new cluster C and insert p into C;
a cluster of their own conflicting with our definition 4 of a reinitialize all data structures for cluster C;
cluster (maximality). To avoid such erroneous splits of (1) expand cluster C by 29 neighboring points;
clusters, DBCLASD tries to merge neighboring clusters for each point p1 of the cluster C do
whenever a candidate is generated already assigned to some (2) answers := retrieve_neighborhood(C,p1);
(3) update_candidates(C,answers);
other cluster. During the generation of candidates DB-
end for each point p1 of the cluster C;
CLASD does not check whether a candidate already has expand_cluster (C);
been assigned to some other cluster or not. While this ap- end if p has not yet been assigned to some cluster;
proach yields an elegant way of merging clusters, clearly it end for each point of the database;
is computationally expensive. It is possible that some point
may be switched to a different cluster several times before (1) The χ2−test can only be applied to clusters with a min-
it is assigned to its final cluster. On the average case, how- imum size of 30. Therefore, the current cluster is
ever, the number of reassignments of some given point expanded to the size of 30 using k nearest neighbor
tends to be reasonably small. The algorithm terminates be- queries without applying the χ2−test.
cause of the following properties. First, each point of the (2) The list of answers is sorted in ascending order of the
database is chosen at most once as a starting point for the distances to point p1.
generation of a new cluster. Second, the generation of a sin-
gle cluster terminates because if no more candidates exist (3) Each element of the list of answers not yet processed
the unsuccessful candidates are not considered again if is inserted into the list of candidates.
none of them fits the current cluster. procedure expand_cluster (cluster C)
change := TRUE;
χ2-values while change do
change := FALSE;
80.0
while the candidate list is not empty do
70.0 remove the first point p from the candidate list
60.0 and assign it to cluster C;
50.0 if distance set of C still has the expected distribution then
answers:= retrieve_neighborhood(C,p);
40.0 update_candidates(C,answers);
30.0 change := TRUE;
20.0 threshold else
10.0 remove p from the cluster C;
insert p into the list of unsuccessful candidates;
0.0 end if distance set still has the expected distribution;
0 200 400 600 800 1000
order of tests end while the candidate list is not empty;
list of candidates := list of unsuccessful candidates;
figure 4: History of the χ2-values for cluster 1 from end while change;
figure 1
listOfPoints procedure retrieve_neighborhood(cluster C;point p); (2) generate 1000, 500, and 500 uniformly distributed
calculate the radius m according to (ii) in section 4.1; points in each polygon respectively. Then, inserting 400
return result of circle query with center p and radius m;
noise points (about 17% of the database) into the database
procedure update_candidates(cluster C; listOfPoints points); which is depicted in figure 5.
for each point in points do The clustering result of DBCLASD on this database
if point is not in the set of processed points then
insert point at the tail of the list of candidates;
is shown in Figure 5. Different clusters are depicted using
insert point into the set of processed points; different symbols and noise is represented using crosses.
end if point is not in the set of processed points; This result shows that DBCLASD assigns nearly all
end for each point in points; points to the “correct” clusters, i.e. as they were generated.
Only very few points close to the borders of a cluster are
5. Experimental Evaluation assigned to “wrong” clusters. The clustering result of DB-
SCAN is rather similar to the results of DBCLASD and
We evaluate DBCLASD according to the three major therefore it is not depicted.
requirements for clustering algorithms on large spatial da- CLARANS splits up existing clusters and even merg-
tabases as stated in section 1. We compare DBCLASD with es parts of different clusters. Thus, CLARANS, like other
the clustering algorithms CLARANS and DBSCAN in partitioning algorithms, cannot be used to detect clusters of
terms of effectivity and efficiency. Furthermore, we apply arbitrary shapes.
DBCLASD to a real database from an earthquake catalog.
The evaluation is based on an implementation of DB- 5.3 Number of Input Parameters
CLASD in C++ using the R*-tree. All experiments were
run on HP 735 workstations. It is very difficult for a user to determine suitable val-
ues for the input parameters of clustering methods when the
5.1 Choice of Comparison Partners database is large. In the following, we discuss these prob-
lems for CLARANS and DBSCAN.
In the following, we explain our choice of algorithms Ng and Han discusses methods to determine the “nat-
for comparison with DBCLASD. So far, most clustering ural” number knat of clusters in a database. It proposes to
algorithms have been designed for relatively small data run CLARANS once for all possible knat, from 2 to n, where
sets. Recently, some new algorithms have been proposed n is the total number of objects in the database. For each of
with the goal of applicability to large spatial databases. To the discovered clustering the silhouette coefficient which is
our best knowledge, CLARANS is the first clustering algo- a numerical value indicating the quality of the clustering, is
rithm developed for spatial databases. DBSCAN is de- calculated, and finally, the clustering with the maximum
signed to discover clusters of arbitrary shape in spatial da- silhouette coefficient is chosen as the “natural” clustering.
tabases with noise. BIRCH is a CF-tree based multi-phase Unfortunately, the run time of this approach is prohibitive
clustering method. Both, the R*-tree based focusing tech- for large n, because it implies n - 1 calls of CLARANS.
nique and the CF-tree based technique perform basically a Furthermore, if the shape of the clusters is not convex, then
preprocessing for clustering and can be used for any clus- there exists no knat for CLARANS.
tering algorithm including DBCLASD. Therefore, we com-
pare the performance of DBCLASD with CLARANS and
DBSCAN without any preprocessing.

5.2 Discovery of Clusters With Arbitrary Shape

Clusters in spatial databases may be of arbitrary shape,


e.g. spherical, drawn-out, linear, elongated etc. Further-
more, the databases may contain noise.
We will use visualization to evaluate the quality of the
clusterings obtained by the different algorithms. In order to
create readable visualizations without using color, in these
experiments we used small databases. Due to space limita-
tion, we only present the results from one typical database
which was generated as follows: (1) draw three polygons of
different shape (one of them with a hole) for three clusters.
figure 5: Clustering by DBCLASD
DBSCAN requires two parameters. A simple heuristic 14000
is also proposed for DBSCAN which is effective in many 12000

Run Time (sec)


cases to determine the parameters Eps and MinPts of the 10000 DBSCAN
“thinnest” cluster in the database. The heuristic helps the 8000 DBCLASD
user to manually choose Eps and MinPts through the visu- 6000
4000
alization of the k-dist graph, i.e. the distance to the k-th
2000
nearest neighbor for the points in the database. However,
0
this heuristic has also following drawbacks. First, a k-near-
0 100000 200000 300000 400000 500000
est neighbor query is required for each point which is inef-
ficient for very large databases. Second, the visualization of Num ber of Points
k-dist graphs is limited to relatively small databases due to figure 6: Scalability wrt. Increasing Number of
the limited resolution of the screen. This limitation can be points
overcome by restricting the k-dist graph to a sample of the measured with some error. So, over time, observed earth-
database but then the accuracy of the parameter estimation quake epicenters should be clustered along such faults. We
may decrease significantly. considered an earthquake catalog recorded over a 40,000
To conclude, providing correct parameter values for km2 region of the central coast ranges in California from
clustering algorithms is a problem for which no general so- 1962 to 1981 [11]. The left part of figure 7 shows the loca-
lution exists. DBCLASD, however, detects clusters of arbi- tions of the earthquakes recorded in the database.
trary shape without requiring any input parameters. The right part of figure 7 shows the clustering ob-
tained by DBCLASD. The algorithm detects 4 clusters cor-
5.4 Efficiency responding to the 2 main seismic faults without requiring
any input parameters. This example illustrates that DB-
In the following, we compare DBCLASD with CLASD also works effectively on real databases where the
CLARANS and DBSCAN with respect to efficiency on data are not as strictly uniformly distributed as the synthetic
synthetic databases. Since the runtime of CLARANS is data.
very large, we had to restrict this comparison to relatively
small test databases with the number of points ranging from
5,000 to 25,000. The run times for DBCLASD, DBSCAN
and CLARANS on these test databases are listed in table 1.
Table 1: Run Time Comparison (sec)

Number of Points 5000 10000 15000 20000 25000


DBCLASD 77 159 229 348 457
DBSCAN 51 83 134 183 223
CLARANS 5031 22666 64714 138555 250790
The run time of DBCLASD is roughly twice the run
time of DBSCAN, while DBCLASD outperforms CLAR-
ANS by a factor of at least 60.
We generated seven large synthetic test databases with figure 7: California earthquake database
5000, 25000, 100000, 200000, 300000, 400000 and
500000 points to test the efficiency and scalability of DB-
CLASD and DBSCAN wrt. increasing number of points.
6. Conclusions
The run times are plotted in figure 6 which shows that both,
DBSCAN and DBCLASD, have a good scalability w.r.t. The application of clustering algorithms to large spa-
the size of the databases. The run time of DBCLASD is tial databases raises the following requirements: (1) mini-
roughly three times the run time of DBSCAN. mal number of input parameters, (2) discovery of clusters
with arbitrary shape and (3) efficiency on large databases.
5.5 Application to an Earthquake Database The well-known clustering algorithms offer no solution to
the combination of these requirements.
In this paper, we introduce the new clustering algo-
We consider the problem of detecting seismic faults
rithm DBCLASD which is designed to fulfil the combina-
based on an earthquake catalog. The idea is that earthquake
tion of these requirements. Our notion of a cluster is based
epicenters occur along seismically active faults, and are
on the distance of the points of a cluster to their nearest
neighbors and we analyze the expected distribution of these References
distances for a cluster. The analysis in this paper is based on
the assumption that the points inside of a cluster are uni- [1] Allard D. and Fraley C.:”Non Parametric Maximum
formly distributed. This assumption is quite reasonable for Likelihood Estimation of Features in Spatial Point Process
many applications. DBCLASD incrementally augments an Using Voronoi Tessellation”, Journal of the American
initial cluster by its neighboring points as long as the near- Statistical Association, to appear in December 1997.
[Available at http://www.stat.washington.edu/tech.reports/
est neighbor distance set of the resulting cluster still fits the
tr293R.ps].
expected distribution. The retrieval of neighboring points is [2] Beckmann N., Kriegel H.-P., Schneider R., Seeger B.: ‘The
based on region queries, which are efficiently supported by R*-tree: An Efficient and Robust Access Method for Points
spatial access methods such as R*-trees. and Rectangles’, Proc. ACM SIGMOD Int. Conf. on
Experiments on both synthetic and real data demon- Management of Data, Atlantic City, NJ, 1990, pp. 322-331.
[3] Banfield J. D. and Raftery A. E.: “Model based Gaussian
strate that DBCLASD, contrary to partitioning clustering
and non-Gaussian clustering”, Biometrics 49, September
algorithms such as CLARANS, discovers clusters of arbi- 1993, pp. 803-821.
trary density, shape, and size. Furthermore, DBCLASD dis- [4] Byers S. and Raftery A. E.: “Nearest Neighbor Clutter
covers the intuitive clusters without requiring any input pa- Removal for Estimating Features in Spatial Point
rameters, in contrast to the algorithm DBSCAN which Processes”, Technical Report No. 305, Department of
Statistics, University of Washington. [Available at http://
needs two input parameters. In terms of efficiency, DB-
www.stat.washington.edu/tech.reports/tr295.ps].
CLASD is between CLARANS and DBSCAN, close to [5] Devore J. L.: ‘Probability and Statistics for Engineering
DBSCAN. Thus, the efficiency of DBCLASD on large spa- and the Sciences’, Duxbury Press, 1991.
tial databases is very attractive when considering its non- [6] Ester M., Kriegel H.-P., Sander J., Xu X.: “A Density-
parametric nature and its good quality for clusters of arbi- Based Algorithm for Discovering Clusters in Large Spatial
Databases with Noise”, Proc. 2cnd Int. Conf. on
trary shape. Furthermore, we applied DBCLASD to a real
Knowledge Discovery and Data Mining, Portland, Oregon,
database. The result indicates that DBCLASD also works 1996, AAAI Press, 1996.
effectively on real databases where the data is not exactly [7] Ester M., Kriegel H.-P., Xu X.: “Knowledge Discovery in
uniformly distributed. Large Spatial Databases: Focusing Techniques for
Future research will have to consider the following is- Efficient Class Identification”, Proc. 4th Int. Symp. on
Large Spatial Databases, Portland, ME, 1995, in: Lecture
sues. The analysis in this paper is based on the assumption
Notes in Computer Science, Vol. 951, Springer, 1995,
that the points inside of a cluster are uniformly distributed. pp.67-82.
Other distributions of the points should be investigated. [8] Fayyad U. M.,.J., Piatetsky-Shapiro G., Smyth P.: “From
Furthermore, we will consider the case of unknown distri- Data Mining to Knowledge Discovery: An Overview”, in:
butions and explore the use of algorithms from the area of Advances in Knowledge Discovery and Data Mining,
AAAI Press, Menlo Park, 1996, pp. 1 - 34.
stochastic approximation methods in order to estimate the
[9] Gueting R. H.: “An Introduction to Spatial Database
unknown distribution. Systems”, in: The VLDB Journal, Vol. 3, No. 4, October
The application of DBCLASD to high-dimensional 1994, pp.357-399.
feature spaces (e.g. in CAD databases or in multi-media da- [10] Kaufman L., Rousseeuw P. J.: “Finding Groups in Data:
tabases) seems to be especially useful because in such spac- An Introduction to Cluster Analysis”, John Wiley & Sons,
1990.
es it is nearly impossible to manually provide appropriate
[11] McKenzie M., Miller R., and Uhrhammer R.: “Bulletin of
values for the input parameters required by the well-known the Seismographic Stations”, University of California,
clustering algorithms. Such applications will be investigat- Berkeley. Vol. 53, No. 1-2.
ed in the future. [12] Muise R. and Smith C.: “Nonparametric minefield
detection and localization”, Technical Report CSS-TM-
591-91, Naval Surface Warfare Center, Coastal Systems
Acknowledgments Station.
[13] Ng R. T., Han J.: “Efficient and Effective Clustering
Methods for Spatial Data Mining”, Proc. 20th Int. Conf. on
We thank Abhijit Dasgupta, University of Washing-
Very Large Data Bases, Santiago, Chile, 1994, pp. 144-155.
ton, for providing both, the earthquake data and the specifi- [14] Zhang T., Ramakrishnan R., Linvy M.: “BIRCH: An
cation of the minefield data. Efficient Data Clustering Method for Very Large
Databases”, Proc. ACM SIGMOD Int. Conf. on
Management of Data, pp. 103-114.

You might also like