0% found this document useful (0 votes)
34 views5 pages

Anomaly Detection

The document discusses using the DBSCAN algorithm to detect anomalies in monthly temperature data. DBSCAN is a density-based clustering algorithm that can identify anomalous data points. The study applies DBSCAN to temperature data from a weather station in Turkey to detect anomalous temperature readings and compares the results to a statistical anomaly detection method.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views5 pages

Anomaly Detection

The document discusses using the DBSCAN algorithm to detect anomalies in monthly temperature data. DBSCAN is a density-based clustering algorithm that can identify anomalous data points. The study applies DBSCAN to temperature data from a weather station in Turkey to detect anomalous temperature readings and compares the results to a statistical anomaly detection method.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Anomaly Detection in Temperature Data Using

DBSCAN Algorithm
Mete ÇELİK Filiz DADAŞER-ÇELİK Ahmet Şakir DOKUZ
Dept. of Computer Engineering Dept. of Environmental Engineering Dept. of Computer Engineering
Erciyes University Erciyes University Erciyes University
38039 Kayseri, Turkey 38039 Kayseri, Turkey 38039 Kayseri, Turkey
[email protected] [email protected] [email protected]

Abstract—Anomaly detection is a problem of finding or a summer month with average temperature below 10 oC can
unexpected patterns in a dataset. Unexpected patterns can be be identified as an anomaly. In Fig. 1, on a sample monthly
defined as those that do not conform to the general behavior of temperature data series, the data point, marked as 1, can be
the dataset. Anomaly detection is important for several called an anomaly.
application domains such as financial and communication
services, public health, and climate studies. In this paper, we
focus on discovery of anomalies in monthly temperature data
using DBSCAN algorithm. DBSCAN algorithm is a density-based
clustering algorithm that has the capability of discovering
anomalous data. In the experimental evaluation, we compared
the results of DBSCAN algorithm with the results of a statistical
method. The analysis showed that DBSCAN has several
advantages over the statistical approach on discovering
anomalies.

Keywords-anamoly detection in time series data;DBSCAN


algorithm; temperature data Figure 1. Representation of anomalies on sample monthly temperature data

One of the major challenges of anomaly detection is to


I. INTRODUCTION characterize normal and abnormal behaviors. Other challenges
Anomaly detection can be defined as “the problem of can be listed as to develop techniques to discover abnormal
finding patterns in data that do not conform to expected data and to design computationally-efficient algorithms.
normal behavior” [1]. Anomaly detection is important when The aim of this study is to discover anomalies on
the abnormal behavior in the dataset provides significant temperature data. DBSCAN, a density-based clustering
information about the system. Anomalies can be caused by algorithm, is used as an anomaly detection technique and the
malicious activities, instrumentation errors, changes in the results of DBSCAN were compared with the results of a
environment (i.e., climate change), and human errors [1]. statistical anomaly detection method. The dataset used in this
Anomaly detection is an important problem in several study consists of daily average temperature data collected at
application domains such as credit card fraud detection in station number 17836 (Develi Station in Kayseri, Turkey)
financial systems, intrusion detection in communication operated by Turkish State Meteorological Service. The daily
systems, and contagious disease detection in public health data were converted to monthly averages by averaging daily
data. In recent years, anomaly detection has also become an values for each month. The data were available for a 33-year
important problem in climate studies for detecting abnormal period (from 1975 to 2008).
climatic conditions caused by global warming. Anomalies in a
climate data series can be explained by the following example.
In a time series of temperature data, one can see fluctuations II. RELATED WORK
between high and low temperatures. The normal behavior for In the literature, there are a number of extensive
temperature data is to reach high values (e.g., temperature reviews discussing anomaly detection approaches. Kumar et
over 20 oC) in summer months and drop to low values (e.g., al. [1] provided a recent review of the anomaly detection
temperature below 10 oC) in winter months. In a temperature problems, techniques, and application areas. Hodge et al. [2]
dataset, a winter month with average temperature above 20 oC provided a review of anomaly detection techniques. The

978-1-61284-922-5/11/$26.00 ©2011 IEEE


91
advantages and disadvantages, and the motivation of these DBSCAN. In contrast, group 3 and group 5 will be defined as
techniques were discussed in a comparative approach. Tang et outliers since they do not contain sufficient number of points
al. [3] reviewed the different solution methods of anomaly to form a cluster. Similarly, if the minpts value is selected as 5,
detection problems, i.e., density-based, connectivity-based group 3, group 5, and group 6 will be assigned as outliers
methods, and the theoretical structure of the anomaly detection using DBSCAN.
problems. Petrovskiy [4] discussed the anomaly detection
algorithms, basic approaches, the advantages and
disadvantages of these approaches, and proposed a new fuzzy-
based anomaly detection algorithm.
The anomaly detection techniques can be classified as
statistical approaches and distance-based approaches.
Statistical approaches aim to develop a statistical model of the
data and identify data that do not fit into the model. In
distance-based approaches, the distances between data are
considered in detection of anomalies: the data at a distance
greater than a pre-defined distance is called an anomaly. This
study focuses on detecting outliers using distance-based
approaches. Figure 2. A sample dataset for distance-based anomaly detection approach
DBSCAN, which was developed by Ester et al. [5], is
one of the distance-based approaches, which has been widely
used for solving anomaly detection problems. Ester et al. [5] The pseudo-code of DBSCAN algorithm is given in
showed that DBSCAN is much more powerful algorithm than Algorithm 1. The inputs of the algorithm are dataset and user-
the other algorithms on mass and dense datasets for finding defined eps and minpts parameter values.
anomalies.
In the study, the DBSCAN algorithm is used for
Algorithm 1. The pseudo code of DBSCAN algorithm
detection of anomalies in monthly temperature data and this
method is compared with a statistical anomaly detection Inputs:
method. D: the dataset
Eps: the neighborhood distance
Minpts: the minimum number of points
Output:
III. DBSCAN ALGORITHM Discovered outliers and clusters
DBSCAN is a density-based spatial clustering Variables:
algorithm that can also define anomalies in the data series. It m, n: row and column values of D matrix, respectively
Dist: distance vector
requires two user-defined parameters, which are neighborhood
indices: indices that distance of points is lower than Eps
distance epsilon (eps) and minimum number of points minpts. class_no: indicates the clusters – default 1
For a given point, the points in the eps distance are called
neighbors of that point. If the number of neighboring points of Algorithm:
a point is more than minpts, this group of points is called a 1. import the data-set into D
cluster. 2. for i = 1 to m //row counter
3. Dist = distance(i, D)
DBSCAN labels the data points as core points, border 4. neighbors= find(Dist =< Eps)
points, and outlier (anomalous) points. Core points are those 5. neighbor_count = count(neighbors)
that have at least minpts number of points in the eps distance. 6. core_neig=check_core_neighbor (neighbors)
Border points can be defined as points that are not core points, 7. if (neighbor_count >=minpts)
but are the neighbors of core points. Outlier points are those 8. class(i) = class_no //clustered point
that are neither core points nor border points. 9. while(more points near i)
10. class(point) = class_no
In DBSCAN, the clustering approach is different from 11. end while
typical clustering approaches. DBSCAN can define outlier 12. class_no += 1
(anomalous) points that do not fit to any clusters. In Fig. 2, the 13. else if(neighbor_count<minpts & core_neig==True)
groups represent results of a clustering approach that takes 14. class(i) = 0 //border point
distance threshold eps as a cluster metric. In addition to 15. else if (neighbor_count<minpts)
16. class(i) = -1 //outlier point
distance metric, DBSCAN requires minimum number of
17. end if
points minpts inside a group to label it as a cluster. For 18. end for
example, if the minpts value is selected as 3, group 1, group 2, 19. return class
group 4, and group 6 will be marked as clusters using

92
In the algorithm, between steps 2 and 18, there is a for- Algorithm 2. The pseudo code of the pre-processing step
loop running for number of points in the dataset. In step 2,
DBSCAN assigns D[i] as a center point. In step 3, the Input:
X: The dataset
distances between center point D[i] and remaining points are
Output:
calculated, and then in step 4 the points whose distances are Deseasoned data vector deseasoned_X
less than or equal to eps, are accepted as neighbors of the Variables:
center point. In step 5 the number of neighbors of the center Deseasoned_X: The deseasoned data vector.
point is calculated. Step 6 of the algorithm checks if there is Pre-Processing function:
any core point in the neighbor list. In step 7, the algorithm 1. initialization
checks the number of neighbor points, neighbor_count, to test 2. deseasoned_X = monthly_z_score(X)
if it is greater than or equal to minpts. If it is, the center is
determined as a core point. Between steps 8 and 12, a unique
class number is assigned to the core point D[i] and its V. EXPERIMENTAL EVALUATION
neighbors. If the center point is not a core point but near a core
To evaluate the experimental results of the DBSCAN
point, it is defined as a border point between steps 13 and 15.
algorithm, we compared it with a statistical anomaly detection
If the center point is neither a core point nor a border point,
method. In the experiments, we used the deseasoned
and distance of the points are greater than eps, it is labeled as
temperature time series data.
outlier. Finally, in step 19 the algorithm outputs the results.
A. Detecting Anamolies using Statistical Method
The basic assumption of the statistical method is that
IV. DESEASONING TIME SERIES DATA the data are from a normal distribution. The normal
In this study, the aim is to discover anomalies out of distribution has two parameters, the mean (μ) and the standard
monthly temperature data. As temperature data have strong deviation (σ). According to Tan et al. [7] the chance that a
seasonal component, the series should be deseasoned in a pre- value is located at the tails of the distribution is very low.
processing step before applying any of the anomaly detection Based on that, it is a general practice to identify the data points
techniques. greater than “μ+2σ” or “μ+3σ” and smaller than “μ-2σ” or “μ-
3σ as anomalies.
In the pre-processing step, monthly z-score technique,
explained by Kumar et al [6], was applied (Algorithm 2). The
aim of monthly z-score technique is to eliminate the seasonal
component from the dataset to obtain a deseasoned dataset.
The monthly_z_score function first splits the dataset into
months. For each month, the mean and the standard deviation
are calculated. The value of each month is than normalized
using the mean and standard deviation for that month. Fig. 3
shows the original and deseasoned temperature data used in
this study. The mean and standard deviation of the original
data were 10.96 oC and 8.73 oC, respectively. The mean and
standard deviation of the deaseasoned data were changed to 0
o
C and 1 oC, respectively. The temperature values in the a) The results for μ ±2σ
original data were in the range of -8.25 – 26.55 oC. The
temperature values in the deseasoned data were in the range of
-2.6 – 3.11 oC.

b) The results for μ ±3σ


Figure 4. Anomalies detected by the statistical approach. ((+) represents
anomolous points and (x) represents normal points)

Figure 3. The original data and deseasoned temperature data

93
As can be seen from Fig. 4, the statistical method the eps parameter increases, the number of anomalies detected
detects anomalous points as the points which are smaller or by the algorithm decreases. The results reported in Fig. 5b and
greater than certain temperature values. Fig. 5c are somewhat similar with the results found with the
statistical method. That is, anomalies were discovered above
B. Detecting Anamolies using DBSCAN Algorithm
or below a certain temperature values. In the case of Fig 5a,
The aim of the DBSCAN algorithm is to discover DBSCAN algorithm discovered anomalies in between normal
abnormal points that do not fit any of the clusters. In the points and so there is no linear separation between normal
experiments we evaluated the effects of the values of the eps points and abnormal points as the statistical method showed.
and minpts parameters using the deseasoned dataset.
To evaluate the effect of the minpts parameter on
discovering anomalies, we set eps to 0.1 and conducted
experiments with minpts of 2, 4, and 6. In Fig. 6, an increase
in the number of anomalous points can be observed as the
number of minpts increases. This situation occurs because
DBSCAN assigns a point to a cluster only if there are at least
minpts number of neighbor points in the eps distance. When
minpts is higher, the chance of having a neighbor point within
the eps distance is lower.

a) The results for 4 minpts, 0.05 eps

a) The results for 2 minpts, 0.1 eps

b) The results for 4 minpts, 0.1 eps

b) The results for 4 minpts, 0.1 eps

c) The results for 4 minpts, 0.15 eps

Figure 5. Anomalies detected by the DBSCAN algorithm for 4 minpts and


0.05, 0.1, and 0.15 eps. ((+) represents anomolous points and (x) represents
normal points)

To evaluate the effect of the eps parameter on


discovering anomalies, we set minpts parameter to 4 and c) The results for 6 minpts, 0.1 eps
conducted experiments with 0.05, 0.1, and 0.15 eps values.
Fig. 5 shows the results of the DBSCAN algorithm for Figure 6. Anomalies detected by the DBSCAN algorithm for 0.1 eps and 2,
different eps values. As can be seen in Fig. 5, as the value of 4, and 6 minpts. ((+) represents anomolous points and (x) represents normal
points)

94
Overall, the experimental results showed that the ACKNOWLEDGMENT
anomalous points detected by the statistical method are based
on their value. In other words, the statistical method can detect This is study was partially supported by The Scientific and
anomalous points which are above and below a certain Technological Research Council of Turkey (TUBITAK)
threshold (extremes). However, anomalous points are not only (Project no: CAYDAG 110Y110) and Research Fund of the
extreme points but also they are the data that do not occur Erciyes University (Project no: FBA-09-866).
frequently. DBSCAN algorithm discovers these kinds of REFERENCES
anomalies as well as extremely low and high values.
[1] V. Kumar, A. Banerjee, and V. Chandola, "Anomaly detection: A
survey," ACM Computing Surveys, vol. 41, July 2009.
[2] V. Hodge and J. Austin, "A survey of outlier detection
VI. CONCLUSION AND FUTURE WORKS methodologies," Artificial Intelligence Review, vol. 22, pp. 85-
126, October 2004.
We applied DBSCAN algorithm for detecting [3] J. Tang, Z. Chen, A. W.-c. Fu, and D. Cheung, "A robust outlier
anomalies in time series data and compared this method with a detection scheme for large datasets," in Knowledge Discovery and
Data Mining, Pacific-Asia Conf., vol. 6, pp. 6-8, 1996.
statistical anomaly detection method. Because of the temporal [4] M. Petrovskiy, "Outlier detection algorithms in data mining
nature of dataset we applied a pre-processing step to remove systems," Programming and Computing Software, vol. 29, pp. 228-
seasonality before applying DBSCAN algorithm on the 237, July-August 2003.
dataset. The results show that DBSCAN algorithm can [5] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, "A density based
algorithm for discovering clusters in large spatial databases with
discover anomalies even if they are not extreme values. noise," in KDD-96 Proceedings, pp. 226-231, 1996.
In the future, we plan to apply other anomaly detection [6] V. Kumar, P.-N. Tan, and M. Steinbach, "Finding spatio-temporal
patterns in earth science data," in KDD 2001 Workshop on
techniques and to test the performances of the algorithm with Temporal Data Mining, August 2001.
different parameter values. [7] P. N. Tan, M. Steinbach, and V. Kumar, Introduction to Data
Mining. Boston Addison-Wesley, April 2005.

95

You might also like