0% found this document useful (0 votes)
65 views7 pages

Spatio-Temporal Outlier Detection in Large Databases: Derya Birant, Alp Kut

This document proposes a three-step approach to detect spatio-temporal outliers in large databases. The steps are clustering, checking spatial neighbors, and checking temporal neighbors. It introduces a new outlier detection algorithm that has the ability to discover outliers according to non-spatial, spatial, and temporal attribute values. The algorithm first identifies spatial outliers using a modified DBSCAN clustering algorithm, then checks temporal neighbors to identify temporal outliers, thus detecting spatio-temporal outliers. It combines the advantages of clustering-based and density-based outlier detection approaches while overcoming limitations such as inability to consider temporal aspects.

Uploaded by

Leet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views7 pages

Spatio-Temporal Outlier Detection in Large Databases: Derya Birant, Alp Kut

This document proposes a three-step approach to detect spatio-temporal outliers in large databases. The steps are clustering, checking spatial neighbors, and checking temporal neighbors. It introduces a new outlier detection algorithm that has the ability to discover outliers according to non-spatial, spatial, and temporal attribute values. The algorithm first identifies spatial outliers using a modified DBSCAN clustering algorithm, then checks temporal neighbors to identify temporal outliers, thus detecting spatio-temporal outliers. It combines the advantages of clustering-based and density-based outlier detection approaches while overcoming limitations such as inability to consider temporal aspects.

Uploaded by

Leet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Journal of Computing and Information Technology - CIT 14, 2006, 4, 291–297 291

doi:10.2498 /cit.2006.04.04

Spatio-Temporal Outlier Detection


in Large Databases

Derya Birant, Alp Kut


Dokuz Eylul University, Department of Computer Engineering, Izmir, Turkey

Outlier detection is one of the major data mining meth- most of these studies don’t consider temporal
ods. This paper proposes a three-step approach to detect aspects. Temporal outlier detection should also
spatio-temporal outliers in large databases. These steps
are clustering, checking spatial neighbors, and checking be considered in many applications such as geo-
temporal neighbors. In this paper, we introduce a new graphic phenomena-based applications. A Tem-
outlier detection algorithm to find small groups of poral Outlier (T-Outlier) is an object whose
data objects that are exceptional when compared with non-spatial attribute value is significantly dif-
the remaining large amount of data. In contrast to the
existing outlier detection algorithms, the new algorithm ferent from those of other objects in its temporal
has the ability of discovering outliers according to the neighborhood. The studies [1][2] focus on the
non-spatial, spatial and temporal values of the objects. representation of a time location which deviates
In order to demonstrate the new algorithm, this paper too much from its temporal neighbors.
also presents an example of application using a data
warehouse. This paper combines S-Outlier and T-Outlier
definitions to define a Spatio-Temporal Outlier
Keywords: outlier detection, data mining, spatio-temporal
data, data warehouse. (ST-Outlier) to be an object whose non-spatial
attribute value is significantly different from
those of other objects in its spatial and tem-
1. Introduction poral neighborhoods. For many applications,
identification of ST-Outliers can lead to the dis-
covery of unexpected, interesting, and implicit
Spatio-temporal databases are growing very knowledge.
rapidly, both in size and in number. This condi-
tion results in an increasing need for knowl- This paper focuses on the question how ST-
edge discovery in spatio-temporal databases. Outliers can be detected. It presents a new out-
Most studies in KDD (Knowledge Discovery in lier detection algorithm which is based on the
Databases) focus on finding the common pat- DBSCAN clustering algorithm since clustering
terns. However, finding the outliers (rare events is a basic method for spatial outlier detection.
or exceptional cases) may be more interesting From the viewpoint of a clustering algorithm,
and useful than finding the common patterns. outliers are objects not located in any cluster.
Furthermore, if a cluster is significantly differ-
Outliers can be defined as observations which ent from other clusters, the objects in this cluster
appear to be inconsistent with the remainder might be potential S-Outliers.
of the dataset. They deviate too much from
other observations. Outlier detection is a data The algorithm proposed in this study first iden-
mining technique like classification, clustering, tifies S-Outliers and then T-Outliers. However,
and association rules. A Spatial Outlier (S- the identification of first T-Outliers and then S-
Outlier) is an object whose non-spatial attribute Outliers yields the same result. So ST-Outliers
value is significantly different from the values and TS-Outliers are identical.
of its spatial neighbors. Recently, a few stud- The rest of the paper is organized as follows.
ies have been conducted on spatial outlier de- Section 2 describes related works on the prob-
tection for large datasets. [4][16][19] However, lem of outlier detection. Section 3 explains
292 Spatio-Temporal Outlier Detection in Large Databases

our algorithm to detect ST-Outliers. Section on their local neighborhood density. Samples
4 presents performance evaluation of the algo- with high LOF value are identified as outliers.
rithm. Section 5 shows the sensitivity analyses The neighborhood is defined by using MinPts
of the parameters of the algorithm. Using a real- parameter.
world dataset, Section 6 presents an application
to demonstrate our solution and shows the data One drawback of the existing methods is that
mining results. Finally, the conclusion is given they don’t consider temporal aspects. In this
in Section 7. paper, we propose a ST-Outlier detection algo-
rithm to overcome this disadvantage. Our algo-
rithm combines the advantages of the clustering-
2. Outlier Detection Approaches based and density-based approaches.

The existing approaches to outlier detection can


3. ST-Outlier Detection Algorithm
be classified into five categories: distribution-
based, clustering-based, depth-based, distance-
based, and density-based [14][16]. In our algorithm, a three-step approach is pro-
Distribution-based approaches use standard sta- posed to identify the spatio-temporal outliers.
tistical distribution. They deploy some standard These steps are: clustering, checking spatial
distribution model (e.g. Normal, Poisson, etc.) neighbors, and checking temporal neighbors.
and recognize as outliers those points which de-
viate from the model. [3] However, for many
KDD applications, the underlying distribution 3.1. Clustering
is unknown. Often a large number of tests
are required in order to decide which distri- Clustering is a basic method to detect potential
bution model fits the arbitrary dataset best, if S-Outliers. From the viewpoint of a cluster-
any. Fitting the data with standard distributions ing algorithm, potential outliers are objects not
is costly, and may not produce satisfactory re- located in any cluster. Furthermore, if a clus-
sults. ter is significantly different from other clusters,
Clustering-based approaches detect outliers as the objects in this cluster might be potential S-
by-products [10]. Some clustering algorithms Outliers.
such as CLARANS [15], DBSCAN [5][6], CURE A clustering algorithm should satisfy three im-
[7] have the capability of handling exceptions. portant requirements: (i) discovery of clusters
However, since the main objective of the clus- with arbitrary shape (ii) good efficiency on large
tering algorithms is to discover clusters, they databases and (iii) some heuristics to determine
are not developed to optimize outlier detection. the input parameters. DBSCAN algorithm sat-
Depth-based approaches are based on compu- isfies all these requirements. But it doesn’t con-
tational geometry and compute different layers sider temporal aspects and it can’t detect some
of k-d convex hulls. [11][17] Outliers are more outliers when clusters have different densities.
likely to be data objects with smaller depths. In order to overcome these disadvantages and
However, in practice, this technique becomes in order to detect ST-Outliers from a dataset,
inefficient for large datasets (k ≥ 4). Depth- we improved DBSCAN clustering algorithm in
based approach is also applied for spatial outlier two important directions. The reason of the first
detection. [2] modification is to support temporal aspects. In
our algorithm, a tree is traversed to find both
Distance-based methods use a distance metric spatial and temporal neighbors of any object
to measure the distances among the data points. within a given radius. The second modifica-
[12][13] Problems may occur if the parameters tion is necessary to find outliers when clusters
of the data are very different from each other in have different densities. Our algorithm assigns
different regions of the data set. a density factor to each cluster. Density factor
Density-based approach was proposed by M. is the degree of the density of the cluster. The
Breunig, et al. [4]. This method assigns a Lo- algorithm also compares the average value of a
cal Outlier Factor (LOF) to each sample based cluster with the new coming value.
Spatio-Temporal Outlier Detection in Large Databases 293

DBSCAN clustering algorithm needs two input


parameters to define the notion of density: Eps
and MinPts. The input parameter Eps is a ra-
dius value and it is based on a distance metric
such as Manhattan, Euclidean etc. The second
input parameter MinPts specifies the minimum
number of points that should occur within Eps
radius. While DBSCAN algorithm needs two
inputs, our algorithm requires four input pa-
rameters: Eps1, Eps2, MinPts, and Δε . While
Eps1 is the distance parameter for spatial at-
tributes, Eps2 is the distance parameter for non-
spatial attributes. MinPts is the minimum num-
ber of points within Eps1 and Eps2 distance of a
point. If a region is dense, then it should contain
more points than MinPts value. In [5], a simple
heuristic is presented to determine the param-
eters Eps and MinPts. The last parameter Δε
is used to prevent the discovering of combined
clusters if there are few differences in the values
of neighbor locations.
In order to discuss whether a set of points is
similar enough to be considered a cluster, we
need a distance measure Dist(i, j) which tells Fig. 1. ST-Outlier Detection Algorithm.
how far points i and j are. The most common
distance measures are Manhattan, Euclidean,
If the selected point has enough neighbors within
and Minkowski distance. In our algorithm, Eu-
Eps1 and Eps2 distances, then a new cluster is
clidian formula is used two times to calculate constructed (v). Then all neighbors within Eps1
two different distance metrics: Eps1 (for spa- and Eps2 radiuses of this object are also marked
tial values) and Eps2 (for non-spatial values). with new cluster label (vi). Then the algorithm
Euclidean distance is defined as follows: iteratively collects all reachable objects from
neighbors by using a stack (vii). If the object is
 not marked as outlier, or it is not in a cluster, and
Dist(i,j)= |xi1 −xj1 |2 +|xi2 −xj2 |2 + . . . +|xin −xjn |2 the difference between the average value of the
(1) cluster and the new coming value is smaller than
Δε , it is placed into the current cluster (viii).
where i=(xi1 , xi2 ,..., xin ) and j=(xj1 , xj2 ,..., xjn )
are two n-dimensional data objects. After processing the selected point, the algo-
rithm selects the next point in D and algorithm
As shown in Figure 1, the algorithm starts with continues iteratively until all of the points have
the first point in database D. After processing been processed. Checking spatial neighbors
this point, it selects the next point in D (i). If the and checking temporal neighbors functions are
selected object doesn’t belong to any cluster (ii), described in sections 3.2 and 3.3.
Retrieve Neighbors function is called (iii). A When the algorithm searches the neighbors of
call of Retrieve Neighbors(object, Eps1, Eps2) any object by using Retrieve Neighbors func-
returns the objects that have a distance less than tion, it takes into consideration both spatial and
Eps1 and Eps2 parameters to the selected object. temporal neighborhoods. The non-spatial value
If the total number of returned points is smaller of an object is compared with the non-spatial
than MinPts input, the object is assigned as out- values of spatial neighbors and also with the val-
lier (iv). The points which have been marked ues of temporal neighbors (previous day, next
as outliers may be changed later. This happens day in the same year and the same day in other
for border points of a cluster. years).
294 Spatio-Temporal Outlier Detection in Large Databases

If two clusters C1 and C2 are very close to each these objects are observed in consecutive time
other and a point p is the border point of both units such as consecutive days in the same year
C1 and C2 , then the algorithm assigns point p to or in the same day in consecutive years. During
the cluster discovered first. the application of the algorithm, a tree is tra-
versed to find the temporal neighbor objects of
any object.
3.2. Checking Spatial Neighbors
In order to support temporal aspects, S-Outliers
From the viewpoint of a clustering algorithm, are compared to other objects of the same lo-
potential outliers are objects not located in any cal area, but in different times. In comparison
cluster. In the previous step, potential outliers operation, spatio-temporal data is first filtered
were detected when the data was clustering. In by retaining only the temporal neighbors and
this step, these potential outliers are checked their corresponding values. If the characteristic
to verify whether these objects are actually S- value of an S-Outlier does not have significant
Outliers or not. During the verification, the differences from its temporal neighbors, this is
background knowledge (the characteristic) of not an ST-Outlier. Otherwise, it is confirmed
the data is required. If no prior-knowledge as an ST-Outlier. The formula used to detect
about the data is available, some methods such ST-Outliers is similar to formula defined in def-
as neural network can be applied to handle it. inition 1. In this case temporal neighbors are
Furthermore, if a cluster is significantly differ- checked instead of spatial neighbors.
ent from other clusters, the objects in this cluster
might be S-Outliers. Thus this step also checks
all clusters identified in the previous step to de- 4. Performance Evaluation
cide whether the cluster is S-Outlier or not. The
formula used to verify S-Outliers is defined in
definition 1. The average runtime complexity of the DB-
SCAN algorithm is O(n*logn), where n is the
Definition 1. Given a database of n data objects number of objects in the database. DBSCAN
D = {o1 , o2, . . . , on }. Assume that the object has been proven in its ability of processing very
o is detected as potential outlier in clustering. large datasets [5][6]. The algorithm yields sig-
The average value of the spatial neighbors of o nificant speed-up factors, even for large num-
within Eps1 radius is defined as bers of data in the database. The paper [6] shows
def oneigh.1 + oneigh.2 + . . . + oneigh.m that the runtime of other clustering algorithms
A=
m
(2) such as CLARANS [15], DBCLASD [20] is be-
tween 1.5 and 3 times the runtime of DBSCAN.
where m is the number of spatial neighbors of This factor increases with increasing size of the
o within Eps1 radius and the standard
√ deviation database. Our modifications do not change the
for the object o is defined as σ = V, where runtime complexity of the algorithm.
As in all databases, fast access to raw data in
(oneigh1 −A)2 + (oneigh2 −A)2 + . . . + (oneigh m −A)2
V=
def
. spatio-temporal databases depends on the struc-
m tural organization of the stored information and
(3)
on the availability of suitable indexing methods.
The object o is classified as an S-Outlier if it is While a well designed data structure can facil-
outside the interval [L,U] (i.e., if either o < L or itate to rapidly extract the desired information
def def
o > U), where L = A − k0 · σ , U = A + k0 · σ , from a set of data, suitable indexing methods
and k0 > 1 is some pre-selected value. can provide to quickly locate single or multiple
objects. [1] Well known spatial indexing tech-
niques include Quadtrees [18], R-Trees [8] and
3.3. Checking Temporal Neighbors others, see [9] for an overview. In our study, we
made an improvement of the R-Tree indexing
This step checks the temporal neighbors of the method to handle spatio-temporal information.
S-Outliers identified in the previous step. Two We created some nodes in R-Tree for each spa-
objects are temporal neighbors if the values of tial object and linked them in temporal order.
Spatio-Temporal Outlier Detection in Large Databases 295

During the application of the algorithm, this value of the Eps2 parameter greatly increases
tree is traversed to find the spatial or temporal the number of noise points. The output of the
neighbor objects of any object. algorithm is very much depending on the relia-
In addition to spatial index structure, some fil- bility of this parameter. The results also show
ters should also be used to reduce the search that the output of the algorithm is little sensitive
space for spatial data mining algorithms. These to changes in the MinPts parameter values.
filters allow the operations on neighborhood
paths by reducing the number of paths actu-
ally created. They are necessary to speed up the 6. Application
processing of queries.
This section demonstrates how our algorithm
detects ST-Outliers by using a real-world dataset.
5. Sensitivity Analyses of the Parameters
The purpose of the application is to detect rare
events and exceptional cases related with sea
Sensitivity analysis is used to determine how a waves in years between 1992 and 2002.
given algorithm output depends upon the input
parameters. Most sensitivity analyses involve
changing one parameter at a time. A sensitiv- 6.1. Dataset
ity analysis was conducted by changing each
parameter value by +/−10%. According to We designed a spatio-temporal data warehouse
the heuristic defined in [5], the input parame- which contained wave height values of four
ters should be assigned as Eps1=1, Eps2=0.25, seas: the Black Sea, the Marmara Sea, the
and MinPts=15. Figure 2 shows the sensitivity Aegean Sea, and the east of the Mediterranean
of our algorithm to the Eps1, Eps2 and MinPts Sea. These seas surround Turkey from the
parameters with respect to the number of noise north, west, and south. The geographical coor-
points. dinates of our work area are 30◦ to 47.5◦ north
latitude and 17.0◦ to 42.5◦ east longitude.
Dataset has been provided from Topex/ Posei-
don Satellite [21]. Topex/Poseidon data are re-
leased by NASA and CNES. Wave heights are
measured in meters. Dataset has approximately
six million rows of record. It has the following
columns: StationID, RegionID, Year, Month,
Day of the record, Latitude of the station, Longi-
tude of the station, WaveHeight value, and Clus-
terID. Whereas the column, StationID, identi-
fies the geographic location of monitoring sta-
tion, RegionID identifies the name of the sea,
Cluster ID identifies a particular cluster of sta-
Fig. 2. The sensitivity analyses results. tions.

According to the results of the tests, even where


there is a huge variation of Eps1 parameter, there 6.2. Implementation Details and Results
is a little influence on the results. The number
of noise points continuously increases when the During the implementation, we first clustered
value of Eps1 parameter decreases. But, there the dataset to find the regions that have sim-
is no change in the number of noise points when ilar sea wave height characteristics. The in-
the value of Eps1 parameter increases. So Eps1 put parameters were designated as Eps1 = 1,
parameter has little influence on the algorithm Eps2 = 0.25, and MinPts=15. Second, we
output. The results show that Eps2 parameter is checked all noise points and all clusters identi-
the most sensitive parameter. Decreasing in the fied in the previous step to determine S-Outliers.
296 Spatio-Temporal Outlier Detection in Large Databases

The region which is circled in dashed lines in 7. Conclusions


Figure 3 had significantly high wave height val-
ues on January 24, 1998. The approximate wave
This paper proposes a three-step approach to de-
height value in this region is 6 meters. So it con-
tect spatio-temporal outliers in large databases.
tains S-Outliers. Third, we checked the tempo-
These steps are clustering, checking spatial neigh-
ral neighbors. We compared the wave height
bors to identify spatial outliers, and checking
values of S-Outliers with other data points of
the same location, but in different times. Fig- temporal neighbors to identify spatio-temporal
ure 4 shows the wave height values of the same outliers. The paper introduces a new outlier
region in different years. We detected that the detection algorithm. According to the perfor-
region circled in dashed lines in Figure 3 had mance tests of the algorithm, it has the ability
extreme wave height values in 1998. But, in of processing very large datasets. According to
other years, wave height values of this region the results of the sensitivity analysis of the input
were not too high. For this reason, the objects parameters, Eps2 parameter is the most sensitive
in this region are confirmed as ST-Outliers. parameter. The example presented in Section 6
demonstrates that our algorithm appears to be
very promising when spatio-temporal outliers
need to be detected.

References

[1] T. ABRAHAM, J. F. RODDICK, Survey of Spatio-


Temporal Databases, GeoInformatica (Springer)
1999; 3 (1), pp. 61–99.
[2] N. R. ADAM, V. P. JANEJA, V. ATLURI,
Neighbourhood-Based Detection of Anomalies in
High Dimension Spatio-Temporal Sensor Datasets,
ACM Symposium on Applied Computing, Nicosia
Cyprus; 2004. pp. 576–583.
Fig. 3. The region circled in dashed lines contains
S-Outliers (January 24, 1998). [3] V. BARNETT, T. LEWIS, Outliers in Statistical Data,
New York: John Wiley; 1994.
[4] M. M. BREUNIG, H-P. KRIEGEL, R. NG, J. SANDER,
LOF: Identifying Density-Based Local Outliers,
ACM SIGMOD Int. Conf. on Management of Data,
Dallas, TX; 2000, pp. 93–104.
[5] M. ESTER, H-P. KRIEGEL, J. SANDER, X. XU, A
Density-Based Algorithm for Discovering Clusters
in Large Spatial Databases with Noise, In: Pro-
ceedings of the 2nd Int. Conference on Knowledge
Discovery and Data Mining, Portland, OR; 1996.
[6] M. ESTER, H-P. KRIEGEL, J. SANDER, X. XU, Clus-
tering for Mining in Large Spatial Databases, KI-
Journal (Artificial Intelligence), Special Issue on
Data Mining 1998; 12 (1), pp. 18–24.
[7] S. GUHA, R. RASTOGI, K. SHIM, CURE: An Effi-
cient Clustering Algorithms for Large Databases,
In: Proc. ACM SIGMOD Int. Conf. on Management
of Data, Seattle, WA; 1998. pp. 73–84.
[8] A. GUTTMAN, R-trees: A Dynamic Index Structure
for Spatial Searching, In: Proceedings of ACM SIG-
Fig. 4. The wave height values of the same region on the MOD Int. Conf. on Management of Data, Boston,
same day, but in different years. Massachusetts; 1984. pp. 47–57.
Spatio-Temporal Outlier Detection in Large Databases 297

[9] R. H. GUTING, An Introduction to Spatial Database Received: June, 2006


System, it VLDB Journal, 1994; 3(4), pp. 357–399. Accepted: September, 2006

[10] A. JAIN, M. MURTY, P. FLYNN, Data Clustering: A Contact addresses:


Review, ACM Computing Surveys, 1999, 31(3), pp. Derya Birant
264–323. Dokuz Eylul University
Department of Computer Engineering
[11] T. JOHNSON, I. KWOK, R. NG, Fast Computation of 35100, Izmir
2-Dimensional Depth Contours, In: Proc. 4th. Int. Turkey
Conf. on KDD, New York, NY, 1998, pp. 224–228. [email protected]

[12] E. M. KNORR, R. T. NG, Algorithms for Mining Alp Kut


Distance-Based Outliers in Large Datasets, In: Dokuz Eylul University
Department of Computer Engineering
Proc. 24th Int. Conf. Very Large Data Bases, New 35100, Izmir
York, NY; 1998, pp. 392–403. Turkey
[email protected]
[13] E. M. KNORR, R. T. NG, V. TUCAKOV, Distance-
Based Outliers: Algorithms and Applications, Jour-
nal: Very Large Data Bases, 2000, 8 (3-4), pp.
237–253. DERYA BIRANT received her PhD in computer engineering from Dokuz
Eylul University in 2006. Currently, she is a research assistant at the De-
[14] L. KOVÁCS, D. VASS, A. VIDÁCS, Improving Quality partment of Computer Engineering, Dokuz Eylul University in Turkey.
of Service Parameter Prediction with Preliminary Her research interests include data mining in large databases, data ware-
Outlier Detection and Elimination, In: Proc. 2nd housing, parallel computing, web systems modeling and engineering.
Int. Workshop on Inter-Domain Performance and
Simulation, Budapest, Hungary, 2004, pp. 194–199.
ALP KUT is a full professor of computer engineering at Dokuz Eylul
[15] R. T. NG, J. HAN, Efficient and Effective Clustering University. He has been head of the Department of Computer Engi-
Methods for Spatial Data Mining, In: Proc. 20th Int. neering of Dokuz Eylul University since the fall of 2003. His research
Conf. on Very Large Data Bases, Santiago, Chile, interests include data mining in databases, database management sys-
1994, pp. 144–155. tems and distributed systems. He has many publications on a variety of
topics, including, web-based systems and parallel systems.
[16] S. PAPADIMITRIOU, C. FALOUTSOS, Cross-Outlier
Detection, In: Proc. 8th International Symposium
on Spatial and Temporal Databases, Greece, 2003,
pp. 199–213.
[17] I. RUTS, P. ROUSSEEUW, Computing Depth Contours
of Bivariate Point Clouds, Journal of Computational
Statistics and Data Analysis, 1996, 23(1996), pp.
153–168.
[18] H. SAMET, The Design and Analysis of Spatial Data
Structures, MA, Addison-Wesley, 1990.
[19] S. SHEKHAR, C-T. LU, P. ZHANG, A Unified Ap-
proach to Detecting Spatial Outliers, GeoInformat-
ica, Kluwer Academic Publishers 2003, 7 (2), pp.
139–166.
[20] X. XU, M. ESTER, H-P. KRIEGEL, J. SANDER, A
Distribution-Based Clustering Algorithm for Min-
ing in Large Spatial Databases, In: Proceedings of
IEEE International Conference on Data Engineer-
ing, Orlando, Florida, 1998. pp. 324–331.
[21] Topex/Poseidon Satellite Description
http://podaac.jpl.nasa.gov/woce/[15/09/2005]

You might also like