Spatio-Temporal Outlier Detection in Large Databases: Derya Birant, Alp Kut
Spatio-Temporal Outlier Detection in Large Databases: Derya Birant, Alp Kut
doi:10.2498 /cit.2006.04.04
Outlier detection is one of the major data mining meth- most of these studies don’t consider temporal
ods. This paper proposes a three-step approach to detect aspects. Temporal outlier detection should also
spatio-temporal outliers in large databases. These steps
are clustering, checking spatial neighbors, and checking be considered in many applications such as geo-
temporal neighbors. In this paper, we introduce a new graphic phenomena-based applications. A Tem-
outlier detection algorithm to find small groups of poral Outlier (T-Outlier) is an object whose
data objects that are exceptional when compared with non-spatial attribute value is significantly dif-
the remaining large amount of data. In contrast to the
existing outlier detection algorithms, the new algorithm ferent from those of other objects in its temporal
has the ability of discovering outliers according to the neighborhood. The studies [1][2] focus on the
non-spatial, spatial and temporal values of the objects. representation of a time location which deviates
In order to demonstrate the new algorithm, this paper too much from its temporal neighbors.
also presents an example of application using a data
warehouse. This paper combines S-Outlier and T-Outlier
definitions to define a Spatio-Temporal Outlier
Keywords: outlier detection, data mining, spatio-temporal
data, data warehouse. (ST-Outlier) to be an object whose non-spatial
attribute value is significantly different from
those of other objects in its spatial and tem-
1. Introduction poral neighborhoods. For many applications,
identification of ST-Outliers can lead to the dis-
covery of unexpected, interesting, and implicit
Spatio-temporal databases are growing very knowledge.
rapidly, both in size and in number. This condi-
tion results in an increasing need for knowl- This paper focuses on the question how ST-
edge discovery in spatio-temporal databases. Outliers can be detected. It presents a new out-
Most studies in KDD (Knowledge Discovery in lier detection algorithm which is based on the
Databases) focus on finding the common pat- DBSCAN clustering algorithm since clustering
terns. However, finding the outliers (rare events is a basic method for spatial outlier detection.
or exceptional cases) may be more interesting From the viewpoint of a clustering algorithm,
and useful than finding the common patterns. outliers are objects not located in any cluster.
Furthermore, if a cluster is significantly differ-
Outliers can be defined as observations which ent from other clusters, the objects in this cluster
appear to be inconsistent with the remainder might be potential S-Outliers.
of the dataset. They deviate too much from
other observations. Outlier detection is a data The algorithm proposed in this study first iden-
mining technique like classification, clustering, tifies S-Outliers and then T-Outliers. However,
and association rules. A Spatial Outlier (S- the identification of first T-Outliers and then S-
Outlier) is an object whose non-spatial attribute Outliers yields the same result. So ST-Outliers
value is significantly different from the values and TS-Outliers are identical.
of its spatial neighbors. Recently, a few stud- The rest of the paper is organized as follows.
ies have been conducted on spatial outlier de- Section 2 describes related works on the prob-
tection for large datasets. [4][16][19] However, lem of outlier detection. Section 3 explains
292 Spatio-Temporal Outlier Detection in Large Databases
our algorithm to detect ST-Outliers. Section on their local neighborhood density. Samples
4 presents performance evaluation of the algo- with high LOF value are identified as outliers.
rithm. Section 5 shows the sensitivity analyses The neighborhood is defined by using MinPts
of the parameters of the algorithm. Using a real- parameter.
world dataset, Section 6 presents an application
to demonstrate our solution and shows the data One drawback of the existing methods is that
mining results. Finally, the conclusion is given they don’t consider temporal aspects. In this
in Section 7. paper, we propose a ST-Outlier detection algo-
rithm to overcome this disadvantage. Our algo-
rithm combines the advantages of the clustering-
2. Outlier Detection Approaches based and density-based approaches.
If two clusters C1 and C2 are very close to each these objects are observed in consecutive time
other and a point p is the border point of both units such as consecutive days in the same year
C1 and C2 , then the algorithm assigns point p to or in the same day in consecutive years. During
the cluster discovered first. the application of the algorithm, a tree is tra-
versed to find the temporal neighbor objects of
any object.
3.2. Checking Spatial Neighbors
In order to support temporal aspects, S-Outliers
From the viewpoint of a clustering algorithm, are compared to other objects of the same lo-
potential outliers are objects not located in any cal area, but in different times. In comparison
cluster. In the previous step, potential outliers operation, spatio-temporal data is first filtered
were detected when the data was clustering. In by retaining only the temporal neighbors and
this step, these potential outliers are checked their corresponding values. If the characteristic
to verify whether these objects are actually S- value of an S-Outlier does not have significant
Outliers or not. During the verification, the differences from its temporal neighbors, this is
background knowledge (the characteristic) of not an ST-Outlier. Otherwise, it is confirmed
the data is required. If no prior-knowledge as an ST-Outlier. The formula used to detect
about the data is available, some methods such ST-Outliers is similar to formula defined in def-
as neural network can be applied to handle it. inition 1. In this case temporal neighbors are
Furthermore, if a cluster is significantly differ- checked instead of spatial neighbors.
ent from other clusters, the objects in this cluster
might be S-Outliers. Thus this step also checks
all clusters identified in the previous step to de- 4. Performance Evaluation
cide whether the cluster is S-Outlier or not. The
formula used to verify S-Outliers is defined in
definition 1. The average runtime complexity of the DB-
SCAN algorithm is O(n*logn), where n is the
Definition 1. Given a database of n data objects number of objects in the database. DBSCAN
D = {o1 , o2, . . . , on }. Assume that the object has been proven in its ability of processing very
o is detected as potential outlier in clustering. large datasets [5][6]. The algorithm yields sig-
The average value of the spatial neighbors of o nificant speed-up factors, even for large num-
within Eps1 radius is defined as bers of data in the database. The paper [6] shows
def oneigh.1 + oneigh.2 + . . . + oneigh.m that the runtime of other clustering algorithms
A=
m
(2) such as CLARANS [15], DBCLASD [20] is be-
tween 1.5 and 3 times the runtime of DBSCAN.
where m is the number of spatial neighbors of This factor increases with increasing size of the
o within Eps1 radius and the standard
√ deviation database. Our modifications do not change the
for the object o is defined as σ = V, where runtime complexity of the algorithm.
As in all databases, fast access to raw data in
(oneigh1 −A)2 + (oneigh2 −A)2 + . . . + (oneigh m −A)2
V=
def
. spatio-temporal databases depends on the struc-
m tural organization of the stored information and
(3)
on the availability of suitable indexing methods.
The object o is classified as an S-Outlier if it is While a well designed data structure can facil-
outside the interval [L,U] (i.e., if either o < L or itate to rapidly extract the desired information
def def
o > U), where L = A − k0 · σ , U = A + k0 · σ , from a set of data, suitable indexing methods
and k0 > 1 is some pre-selected value. can provide to quickly locate single or multiple
objects. [1] Well known spatial indexing tech-
niques include Quadtrees [18], R-Trees [8] and
3.3. Checking Temporal Neighbors others, see [9] for an overview. In our study, we
made an improvement of the R-Tree indexing
This step checks the temporal neighbors of the method to handle spatio-temporal information.
S-Outliers identified in the previous step. Two We created some nodes in R-Tree for each spa-
objects are temporal neighbors if the values of tial object and linked them in temporal order.
Spatio-Temporal Outlier Detection in Large Databases 295
During the application of the algorithm, this value of the Eps2 parameter greatly increases
tree is traversed to find the spatial or temporal the number of noise points. The output of the
neighbor objects of any object. algorithm is very much depending on the relia-
In addition to spatial index structure, some fil- bility of this parameter. The results also show
ters should also be used to reduce the search that the output of the algorithm is little sensitive
space for spatial data mining algorithms. These to changes in the MinPts parameter values.
filters allow the operations on neighborhood
paths by reducing the number of paths actu-
ally created. They are necessary to speed up the 6. Application
processing of queries.
This section demonstrates how our algorithm
detects ST-Outliers by using a real-world dataset.
5. Sensitivity Analyses of the Parameters
The purpose of the application is to detect rare
events and exceptional cases related with sea
Sensitivity analysis is used to determine how a waves in years between 1992 and 2002.
given algorithm output depends upon the input
parameters. Most sensitivity analyses involve
changing one parameter at a time. A sensitiv- 6.1. Dataset
ity analysis was conducted by changing each
parameter value by +/−10%. According to We designed a spatio-temporal data warehouse
the heuristic defined in [5], the input parame- which contained wave height values of four
ters should be assigned as Eps1=1, Eps2=0.25, seas: the Black Sea, the Marmara Sea, the
and MinPts=15. Figure 2 shows the sensitivity Aegean Sea, and the east of the Mediterranean
of our algorithm to the Eps1, Eps2 and MinPts Sea. These seas surround Turkey from the
parameters with respect to the number of noise north, west, and south. The geographical coor-
points. dinates of our work area are 30◦ to 47.5◦ north
latitude and 17.0◦ to 42.5◦ east longitude.
Dataset has been provided from Topex/ Posei-
don Satellite [21]. Topex/Poseidon data are re-
leased by NASA and CNES. Wave heights are
measured in meters. Dataset has approximately
six million rows of record. It has the following
columns: StationID, RegionID, Year, Month,
Day of the record, Latitude of the station, Longi-
tude of the station, WaveHeight value, and Clus-
terID. Whereas the column, StationID, identi-
fies the geographic location of monitoring sta-
tion, RegionID identifies the name of the sea,
Cluster ID identifies a particular cluster of sta-
Fig. 2. The sensitivity analyses results. tions.
References