Anomaly Detection
Anomaly Detection
Anomaly detection algorithms look at an unlabeled data set and determine unusual
or abnormal data points or events. Common use cases include applications where
unusual activity is a problem, such as fraud detection, quality analysis in
manufacturing, and monitoring computers in data centers.
Density Estimation
The most common way to carry out anomaly detection is with an algorithm called
density estimation. Given a dataset {x1 , x2 , … , xm }, the density estimation
algorithm computes a model p(x), which is the probability of x being seen in the
dataset.
Anomaly Detection 1
determined by a Gaussian with mean μ and variance
σ 2 . Variance is derived from the standard deviation,
which is σ . Since probabilities always sum up to 1,
the area under the curve is also equal to 1.
1 −(x−μ)2
p(x) = e 2σ 2
2πσ
m
1
μ= ∑ x(i)
m
i=1
Since the standard deviation is simply the difference between the data points and
the mean, the variance is this value squared,
m
1
2
σ = ∑(x(i) − μ)2
m
i=1
Given a training set: {x(1) , x(2) , … , x(m) }, where each example x(i) has n
features,
⎡x ⎤ ⎡ x1 ⎤
(1)
x(2) x2
X= x=
⋮
⎣xn ⎦
⋮
⎣x(n) ⎦
The model for probability is defined as follows, where the mean μ and variance
σ 2 are parameters to the probability function,
2 2 2
Anomaly Detection 2
p(x) = p(x1 ; μ1 , σ12 ) ∗ p(x2 ; μ2 , σ22 ) ∗ ⋯ ∗ p(xn ; μn , σn2 )
n
p(x) = ∏ p(xj ; μj , σj2 )
j=1
m
1
∑ xj
(i)
μj =
m i=1
m
1
∑(xj − μj )2
(i)
σj2 =
m i=1
Vectorized formula
m
1
μ= ∑ x(i)
m i=1
⎡ μ1 ⎤
μ2
μ=
⋮
⎣μn ⎦
n n
1 −(xj − μj )2
p(x) = ∏ p(xj ; μj , σj2 ) =∏ exp( )
j=1 j=1
2πσj 2σj2
Anomaly Detection 3
When developing any learning algorithm (choosing features, etc.) making
decisions is much easier if there is a way to evaluate the algorithm. There is a
technique to produce a single number or value, which is used as an evaluation
metric to gauge performance. This process is called real-number evaluation.
In the case of anomaly detection, there is one needed step, which is to assume
there is some labeled data of anomalous y = 1 and non-anomalous (normal)
examples. The initial training set should contain all anomalous data (y = 1), but
if a few examples are not that is fine. Then cross validation and test sets are
created. Ideally, both of these sets should contain mostly normal examples and a
few anomalous examples.
y={
1 if p(x) < ϵ (anomaly)
0 if p(x) ≥ ϵ (normal)
After the algorithm has been trained on the training set, the cross validation set is
used to fine to the probability boundary ϵ. Epsilon can be fine tuned by looking at
what data was incorrectly labeled. Fine tuning features xj is also an options
here. Once all parameters have been optimized, the model can be implemented
on the test set for a final verdict.
Alternatively, there are times when it is best to not use a test set and only a cross
validation set; for example, if training data contains very few labeled anomalous
examples or the data set itself is small. The downside is the model cannot test
the parameters after being fine tuned, so there is a higher risk of overfitting.
Precision/Recall
Anomaly Detection 4
F1-score
Regardless of what approach is used, the key is to find the best value for the
probability boundary ϵ.
Works the best when the dataset Works well when the dataset has
has a very small number of a large number of both positive
positive examples (y = 1) (0-20 and negative examples; or, at the
is common) and a large number of least, it’s not as skewed as an
negative (y = 0) examples. anomaly detection dataset.
When future anomalies occur, especially ones that look nothing like any of the
anomalous examples seen so far, the two approaches can be quite different.
Anomaly Detection 5
Fine Tuning Features
With supervised learning, if features are not right or relevant to the problem, that
is ok because the algorithm intuitively figures out what features to ignore, rescale,
etc. However, for anomaly detection, which runs with unlabeled data, it is harder
for the algorithm to account for these potential problems. Thus, carefully
choosing features is especially important.
Transforming features
One important step to enforcing efficient features is to make sure they are
Gaussian. In other words, the data for that feature should closely resemble a
symmetric bell curve. There are several tricks to converting non-gaussian
features, one of which is directly manipulating the data. For example, given
feature data x, one could take log(x), ex , x , or any other mathematical
transformation for that matter.
Error analysis
After modeling the initial training data, if it does not test perform well on the cross
validation set, there is also the option of carrying out error analysis. More simply,
this involves looking at errors and trying to reason why they are behaving that
way.
This usually starts with evaluating the probability function p(x). Remember, the
two conditions for an optimal model are:
Anomaly Detection 6
x.
For example, consider a fraud detection algorithm, where the original feature is
the number of transactions, and the additional feature is typing speed (which
would directly related to getting out fast transactions).
Anomaly Detection 7