Bildiri INISTA2011-AnomalyDetectioninTemperatureDataUsingDBSCANAlgorithm
Bildiri INISTA2011-AnomalyDetectioninTemperatureDataUsingDBSCANAlgorithm
net/publication/233919690
CITATIONS READS
22 1,725
3 authors:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Ahmet Şakir Dokuz on 11 June 2014.
Organized by
In cooperation with
Sponsors
Abstract—Anomaly detection is a problem of finding or a summer month with average temperature below 10 oC can
unexpected patterns in a dataset. Unexpected patterns can be be identified as an anomaly. In Fig. 1, on a sample monthly
defined as those that do not conform to the general behavior of temperature data series, the data point, marked as 1, can be
the dataset. Anomaly detection is important for several called an anomaly.
application domains such as financial and communication
services, public health, and climate studies. In this paper, we
focus on discovery of anomalies in monthly temperature data
using DBSCAN algorithm. DBSCAN algorithm is a density-based
clustering algorithm that has the capability of discovering
anomalous data. In the experimental evaluation, we compared
the results of DBSCAN algorithm with the results of a statistical
method. The analysis showed that DBSCAN has several
advantages over the statistical approach on discovering
anomalies.
92
In the algorithm, between steps 2 and 18, there is a for- Algorithm 2. The pseudo code of the pre-processing step
loop running for number of points in the dataset. In step 2,
DBSCAN assigns D[i] as a center point. In step 3, the Input:
X: The dataset
distances between center point D[i] and remaining points are
Output:
calculated, and then in step 4 the points whose distances are Deseasoned data vector deseasoned_X
less than or equal to eps, are accepted as neighbors of the Variables:
center point. In step 5 the number of neighbors of the center Deseasoned_X: The deseasoned data vector.
point is calculated. Step 6 of the algorithm checks if there is Pre-Processing function:
any core point in the neighbor list. In step 7, the algorithm 1. initialization
checks the number of neighbor points, neighbor_count, to test 2. deseasoned_X = monthly_z_score(X)
if it is greater than or equal to minpts. If it is, the center is
determined as a core point. Between steps 8 and 12, a unique
class number is assigned to the core point D[i] and its V. EXPERIMENTAL EVALUATION
neighbors. If the center point is not a core point but near a core
To evaluate the experimental results of the DBSCAN
point, it is defined as a border point between steps 13 and 15.
algorithm, we compared it with a statistical anomaly detection
If the center point is neither a core point nor a border point,
method. In the experiments, we used the deseasoned
and distance of the points are greater than eps, it is labeled as
temperature time series data.
outlier. Finally, in step 19 the algorithm outputs the results.
A. Detecting Anamolies using Statistical Method
The basic assumption of the statistical method is that
IV. DESEASONING TIME SERIES DATA the data are from a normal distribution. The normal
In this study, the aim is to discover anomalies out of distribution has two parameters, the mean (μ) and the standard
monthly temperature data. As temperature data have strong deviation (σ). According to Tan et al. [7] the chance that a
seasonal component, the series should be deseasoned in a pre- value is located at the tails of the distribution is very low.
processing step before applying any of the anomaly detection Based on that, it is a general practice to identify the data points
techniques. greater than “μ+2σ” or “μ+3σ” and smaller than “μ-2σ” or “μ-
3σ as anomalies.
In the pre-processing step, monthly z-score technique,
explained by Kumar et al [6], was applied (Algorithm 2). The
aim of monthly z-score technique is to eliminate the seasonal
component from the dataset to obtain a deseasoned dataset.
The monthly_z_score function first splits the dataset into
months. For each month, the mean and the standard deviation
are calculated. The value of each month is than normalized
using the mean and standard deviation for that month. Fig. 3
shows the original and deseasoned temperature data used in
this study. The mean and standard deviation of the original
data were 10.96 oC and 8.73 oC, respectively. The mean and
standard deviation of the deaseasoned data were changed to 0
o
C and 1 oC, respectively. The temperature values in the a) The results for μ ±2σ
original data were in the range of -8.25 – 26.55 oC. The
temperature values in the deseasoned data were in the range of
-2.6 – 3.11 oC.
93
As can be seen from Fig. 4, the statistical method the eps parameter increases, the number of anomalies detected
detects anomalous points as the points which are smaller or by the algorithm decreases. The results reported in Fig. 5b and
greater than certain temperature values. Fig. 5c are somewhat similar with the results found with the
statistical method. That is, anomalies were discovered above
B. Detecting Anamolies using DBSCAN Algorithm
or below a certain temperature values. In the case of Fig 5a,
The aim of the DBSCAN algorithm is to discover DBSCAN algorithm discovered anomalies in between normal
abnormal points that do not fit any of the clusters. In the points and so there is no linear separation between normal
experiments we evaluated the effects of the values of the eps points and abnormal points as the statistical method showed.
and minpts parameters using the deseasoned dataset.
To evaluate the effect of the minpts parameter on
discovering anomalies, we set eps to 0.1 and conducted
experiments with minpts of 2, 4, and 6. In Fig. 6, an increase
in the number of anomalous points can be observed as the
number of minpts increases. This situation occurs because
DBSCAN assigns a point to a cluster only if there are at least
minpts number of neighbor points in the eps distance. When
minpts is higher, the chance of having a neighbor point within
the eps distance is lower.
94
Overall, the experimental results showed that the ACKNOWLEDGMENT
anomalous points detected by the statistical method are based
on their value. In other words, the statistical method can detect This is study was partially supported by The Scientific and
anomalous points which are above and below a certain Technological Research Council of Turkey (TUBITAK)
threshold (extremes). However, anomalous points are not only (Project no: CAYDAG 110Y110) and Research Fund of the
extreme points but also they are the data that do not occur Erciyes University (Project no: FBA-09-866).
frequently. DBSCAN algorithm discovers these kinds of REFERENCES
anomalies as well as extremely low and high values.
[1] V. Kumar, A. Banerjee, and V. Chandola, "Anomaly detection: A
survey," ACM Computing Surveys, vol. 41, July 2009.
[2] V. Hodge and J. Austin, "A survey of outlier detection
VI. CONCLUSION AND FUTURE WORKS methodologies," Artificial Intelligence Review, vol. 22, pp. 85-
126, October 2004.
We applied DBSCAN algorithm for detecting [3] J. Tang, Z. Chen, A. W.-c. Fu, and D. Cheung, "A robust outlier
anomalies in time series data and compared this method with a detection scheme for large datasets," in Knowledge Discovery and
Data Mining, Pacific-Asia Conf., vol. 6, pp. 6-8, 1996.
statistical anomaly detection method. Because of the temporal [4] M. Petrovskiy, "Outlier detection algorithms in data mining
nature of dataset we applied a pre-processing step to remove systems," Programming and Computing Software, vol. 29, pp. 228-
seasonality before applying DBSCAN algorithm on the 237, July-August 2003.
dataset. The results show that DBSCAN algorithm can [5] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, "A density based
algorithm for discovering clusters in large spatial databases with
discover anomalies even if they are not extreme values. noise," in KDD-96 Proceedings, pp. 226-231, 1996.
In the future, we plan to apply other anomaly detection [6] V. Kumar, P.-N. Tan, and M. Steinbach, "Finding spatio-temporal
patterns in earth science data," in KDD 2001 Workshop on
techniques and to test the performances of the algorithm with Temporal Data Mining, August 2001.
different parameter values. [7] P. N. Tan, M. Steinbach, and V. Kumar, Introduction to Data
Mining. Boston Addison-Wesley, April 2005.
95