0% found this document useful (0 votes)
10 views

Anomaly Detection

The document discusses anomaly detection algorithms which analyze unlabeled data to find unusual data points. It describes density estimation techniques using Gaussian distributions to model data and detect anomalies based on a probability cutoff. The text also covers evaluating algorithm performance on labeled validation data and comparing anomaly detection to supervised learning approaches.

Uploaded by

Rutvik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Anomaly Detection

The document discusses anomaly detection algorithms which analyze unlabeled data to find unusual data points. It describes density estimation techniques using Gaussian distributions to model data and detect anomalies based on a probability cutoff. The text also covers evaluating algorithm performance on labeled validation data and comparing anomaly detection to supervised learning approaches.

Uploaded by

Rutvik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Anomaly Detection

Anomaly detection algorithms look at an unlabeled data set and determine unusual
or abnormal data points or events. Common use cases include applications where
unusual activity is a problem, such as fraud detection, quality analysis in
manufacturing, and monitoring computers in data centers.

Density Estimation
The most common way to carry out anomaly detection is with an algorithm called
density estimation. Given a dataset {x1 , x2 , … , xm }, the density estimation
algorithm computes a model p(x), which is the probability of x being seen in the
dataset.

If this were to be graphed, boundaries


form around the data, corresponding to
various probabilities. As the data move
closer to the “center” of the data, it has
a higher probability. Likewise, as it
moves further away, its probability
decreases.

In order to detect an anomaly, a cutoff point must be set, which is represented by ϵ.


This is usually a smaller number, depending on the application. New data is fed into
the model and compared to this value. Then an anomaly (or not) is applied to the
new training examples. This relationship is represented as,

anomaly = p(xtest ) < ϵ

Gaussian (Normal) Distribution


In order to find an optimal probability cutoff ϵ and
apply anomaly detection, gaussian or normal
distribution is used. This is commonly referred to as a
bell curve. Given a value x, the probability of x is

Anomaly Detection 1
determined by a Gaussian with mean μ and variance
σ 2 . Variance is derived from the standard deviation,
which is σ . Since probabilities always sum up to 1,
the area under the curve is also equal to 1.

The equation for the probability as a function of x is as follows,

1 −(x−μ)2
p(x) = e 2σ 2
2πσ

To compute the mean, the values are simply averaged,

m
1
μ= ∑ x(i)
m
i=1

Since the standard deviation is simply the difference between the data points and
the mean, the variance is this value squared,

m
1
2
σ = ∑(x(i) − μ)2
m
i=1

Defining the Algorithm


Taking the density estimation and Gaussian distribution, a more formal definition
for the anomaly detection algorithm is made:

Given a training set: {x(1) , x(2) , … , x(m) }, where each example x(i) has n
features,

⎡x ⎤ ⎡ x1 ⎤
(1)

x(2) x2
X= x=

⎣xn ⎦

⎣x(n) ⎦

The model for probability is defined as follows, where the mean μ and variance
σ 2 are parameters to the probability function,

2 2 2

Anomaly Detection 2
p(x) = p(x1 ; μ1 , σ12 ) ∗ p(x2 ; μ2 , σ22 ) ∗ ⋯ ∗ p(xn ; μn , σn2 )

A more compact way to write this function is below,

n
p(x) = ∏ p(xj ; μj , σj2 )
j=1

Putting it all together


1. Choose n features xi that might be indicative of anomalous examples.

2. Fit parameters μ1 , … , μn and μ2n , … , σn2 :

m
1
∑ xj
(i)
μj =
m i=1

m
1
∑(xj − μj )2
(i)
σj2 =
m i=1

Vectorized formula
m
1
μ= ∑ x(i)
m i=1

⎡ μ1 ⎤
μ2
μ=

⎣μn ⎦

3. Given new example x, compute p(x):

n n
1 −(xj − μj )2
p(x) = ∏ p(xj ; μj , σj2 ) =∏ exp( )
j=1 j=1
2πσj 2σj2

4. Determine an anomaly if p(x) <ϵ

Evaluating the Model’s Performance

Anomaly Detection 3
When developing any learning algorithm (choosing features, etc.) making
decisions is much easier if there is a way to evaluate the algorithm. There is a
technique to produce a single number or value, which is used as an evaluation
metric to gauge performance. This process is called real-number evaluation.

In the case of anomaly detection, there is one needed step, which is to assume
there is some labeled data of anomalous y = 1 and non-anomalous (normal)
examples. The initial training set should contain all anomalous data (y = 1), but
if a few examples are not that is fine. Then cross validation and test sets are
created. Ideally, both of these sets should contain mostly normal examples and a
few anomalous examples.

y={
1 if p(x) < ϵ (anomaly)
0 if p(x) ≥ ϵ (normal)

Cross validation set Test set

(x(1) (1) (mcv ) (mcv )


cv , ycv ), … , (xcv , ycv ) (1) (1) (m ) (m )
(xtest , ytest ), … , (xtesttest , ytesttest )

After the algorithm has been trained on the training set, the cross validation set is
used to fine to the probability boundary ϵ. Epsilon can be fine tuned by looking at
what data was incorrectly labeled. Fine tuning features xj is also an options
here. Once all parameters have been optimized, the model can be implemented
on the test set for a final verdict.

Alternatively, there are times when it is best to not use a test set and only a cross
validation set; for example, if training data contains very few labeled anomalous
examples or the data set itself is small. The downside is the model cannot test
the parameters after being fine tuned, so there is a higher risk of overfitting.

Possible evaluation metrics


Just like other classification problems, there are alternative metrics to help
increase the accuracy of the model. These metrics are particularly useful for very
skewed data and were previously mentioned in detail in Practical Machine
Learning tips.

True, positive, false positive, false negative, true negative

Precision/Recall

Anomaly Detection 4
F1-score

Regardless of what approach is used, the key is to find the best value for the
probability boundary ϵ.

Comparing to Supervised Learning


Since anomaly detection requires labeled data for the cross validation and test
sets (at the very least), wouldn’t it make more sense to use a supervised learning
algorithm? Well, this can sometimes be the case, and the choice between the two
is often subtle.

Anomaly detection Supervised learning


The first criteria lies in the data itself; specifically with how many of the examples
are positive or negative.

Works the best when the dataset Works well when the dataset has
has a very small number of a large number of both positive
positive examples (y = 1) (0-20 and negative examples; or, at the
is common) and a large number of least, it’s not as skewed as an
negative (y = 0) examples. anomaly detection dataset.

When future anomalies occur, especially ones that look nothing like any of the
anomalous examples seen so far, the two approaches can be quite different.

Since this data contains a large Enough positive examples exist


number of negative examples, it is for the algorithm to get a sense of
hard for the algorithm to learn what they are like, so future
from positive examples what positive examples are likely to be
anomalies look like. Therefore, it similar to ones in the training set.
is more likely to flag many Therefore, it is harder to catch
different types of future anomalies different types of anomalies the
it hasn’t seen before. model hasn’t seen before.

A use case to illustrate this difference is manufacturing. Anomaly detection is


more useful for finding new, previously unseen defects, while supervised learning
is more useful for finding known, previously seen defects.

Anomaly Detection 5
Fine Tuning Features
With supervised learning, if features are not right or relevant to the problem, that
is ok because the algorithm intuitively figures out what features to ignore, rescale,
etc. However, for anomaly detection, which runs with unlabeled data, it is harder
for the algorithm to account for these potential problems. Thus, carefully
choosing features is especially important.

Transforming features
One important step to enforcing efficient features is to make sure they are
Gaussian. In other words, the data for that feature should closely resemble a
symmetric bell curve. There are several tricks to converting non-gaussian
features, one of which is directly manipulating the data. For example, given
feature data x, one could take log(x), ex , x , or any other mathematical
transformation for that matter.

Error analysis
After modeling the initial training data, if it does not test perform well on the cross
validation set, there is also the option of carrying out error analysis. More simply,
this involves looking at errors and trying to reason why they are behaving that
way.

This usually starts with evaluating the probability function p(x). Remember, the
two conditions for an optimal model are:

1. p(x) is large for normal examples x. p(x) ≥ ϵ

2. p(x) is small for anomalous examples p(x) < ϵ

Anomaly Detection 6
x.

The most common problem regarding error analysis


is:

p(x) is comparable for normal and anomalous


examples (usually a large value for both).

When an anomalous example has a large value for


p(x), it can often go undetected and will not get
flagged by the algorithm.

Deriving a new feature x2 from the problem feature


x1 can help solve this problem. The values in the
data do not necessarily have to be directly derived,
but the two features should be very closely related.

When these two data features are plotted against


each other, it can make the anomalies stand out
much more clearly.

For example, consider a fraud detection algorithm, where the original feature is
the number of transactions, and the additional feature is typing speed (which
would directly related to getting out fast transactions).

Existing features can also be combined into new features. Moreover,


mathematical transformations can be applied too. For example, a new feature x3
for a data center algorithm could be defined as,

(CPU load)2 (x1 )2


x3 = =
network traffic x2

Anomaly Detection 7

You might also like