0% found this document useful (0 votes)
8 views12 pages

Estimating The Contamination Factor's Distribution in Unsupervised Anomaly Detection

Uploaded by

ChangYulLee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views12 pages

Estimating The Contamination Factor's Distribution in Unsupervised Anomaly Detection

Uploaded by

ChangYulLee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Estimating the Contamination Factor’s Distribution in Unsupervised Anomaly

Detection

Lorenzo Perini 1 Paul-Christian Bürkner 2 Arto Klami 3

Abstract Typically, anomaly detection is tackled from an unsu-


Anomaly detection methods identify examples pervised perspective (Maxion & Tan, 2000; Goldstein &
arXiv:2210.10487v2 [cs.LG] 17 Oct 2023

that do not follow the expected behaviour, typ- Uchida, 2016; Zong et al., 2018; Perini et al., 2020b; Han
ically in an unsupervised fashion, by assigning et al., 2022) because labeled samples, especially anomalies,
real-valued anomaly scores to the examples based may be expensive and difficult to acquire (e.g., you do not
on various heuristics. These scores need to be want to voluntarily break the equipment simply to observe
transformed into actual predictions by threshold- anomalous behaviours), or simply rare (e.g., you may need
ing so that the proportion of examples marked to inspect many samples before finding an anomalous one).
as anomalies equals the expected proportion of Unsupervised anomaly detectors exploit data-driven heuris-
anomalies, called contamination factor. Unfortu- tic assumptions (e.g., anomalies are far away from normals)
nately, there are no good methods for estimating to assign a real-valued score to each sample denoting how
the contamination factor itself. We address this anomalous it is. Using such anomaly scores enables ranking
need from a Bayesian perspective, introducing a the samples from most to least anomalous.
method for estimating the posterior distribution Converting the anomaly scores into discrete predictions
of the contamination factor for a given unlabeled would practically allow the user to flag the anomalies. Com-
dataset. We leverage several anomaly detectors to monly, one sets a decision threshold and labels samples with
capture the basic notion of anomalousness and es- higher scores as anomalous and samples with lower scores
timate the contamination using a specific mixture as normal. However, setting the threshold is a challenging
formulation. Empirically on 22 datasets, we show task as it cannot be tuned (e.g., by maximizing the model
that the estimated distribution is well-calibrated performance) due to the absence of labels. One approach is
and that setting the threshold using the posterior to set the threshold such that the proportion of scores above
mean improves the detectors’ performance over it matches the dataset’s contamination factor γ, i.e. the
several alternative methods. expected proportion of anomalies. If the ranking is correct
(that is, all anomalies are ranked before any normal instance)
then thresholding with exactly the correct γ correctly iden-
1. Introduction tifies all anomalies. However, in most of the real-world
scenarios the contamination factor is unknown.
Anomaly detection aims at automatically identifying sam-
ples that do not conform to the normal behaviour, accord- Estimating the contamination factor γ is challenging. Exist-
ing to some notion of normality (see e.g., Chandola et al. ing works provide an estimate by using either some normal
(2009)). Anomalies are often indicative of critical events labels (Perini et al., 2020a) or domain knowledge (Perini
such as intrusions in web networks (Malaiya et al., 2018), et al., 2022). Alternatively, one can directly threshold the
failures in petroleum extraction (Martı́ et al., 2015), or break- scores through statistical threshold estimators, and derive γ
downs in wind and gas turbines (Zaher et al., 2009; Yan & as the proportion of scores higher than the threshold. For in-
Yu, 2019). Such events have an associated high cost and stance, the Modified Thompson Tau test thresholder (MTT)
detecting them avoids wasting time and resources. finds the threshold through the modified Thompson Tau
1 test (Rengasamy et al., 2021), while the Inter-Quartile Re-
DTAI lab & Leuven.AI, Department of Computer Science,
KU Leuven, Belgium 2 Cluster of Excellence SimTech, University gion thresholder (IQR) uses the third quartile plus 1.5 times
of Stuttgart, Germany 3 Department of Computer Science, Uni- the inter-quartile region (Bardet & Dimby, 2017). In Sec-
versity of Helsinki, Finland. Correspondence to: Lorenzo Perini tion 4 we provide a comprehensive list of estimators.
<[email protected]>.
Transforming the scores into predictions using an incor-
Proceedings of the 40 th International Conference on Machine rect estimate of the contamination factor (or, equivalently,
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright an incorrect threshold) deteriorates the anomaly detector’s
2023 by the author(s).

1
Estimating the Contamination Factor’s Distribution in Unsupervised Anomaly Detection
PK
performance (Fourure et al., 2021; Emmott et al., 2015) k=1 πk N (s|µk , Σk ) for s ∈ RM , where N (s|µk , Σk )
and reduces the trust in the detection system. If such an denotes the Gaussian distribution with mean vector µk and
estimate was coupled with a measure of uncertainty, one covariance matrix ΣkP∈ RM ×M , and πk are the mixing
K
could take into account this uncertainty to improve decisions. proportions such that k=1 πk = 1. For finite mixtures, we
Although existing methods propose Bayesian anomaly de- typically have a Dirichlet prior over π = [π1 , . . . , πK ], but
tectors (Shen & Cooper, 2010; Roberts et al., 2019; Hou Dirichlet Process (DP) priors allow treating also the number
et al., 2022; Heard et al., 2010), none of them study how to of components as unknown (Görür & Rasmussen, 2010).
transform scores into hard predictions. For both cases, we need approximate inference to estimate
the posterior of the model parameters.
Therefore, we are the first to study the estimation of the con-
tamination factor from a Bayesian perspective. We propose
γGMM, the first algorithm for estimating the contamination 3. Methodology
factor’s (posterior) distribution in unlabeled anomaly detec-
We tackle the problem: Given an unlabeled dataset D and
tion setups. First, we use a set of unsupervised anomaly
a set of M unsupervised anomaly detectors; Estimate a
detectors to assign anomaly scores for all samples and use
(posterior) distribution of the contamination factor γ.
these scores as a new representation of the data. Second, we
fit a Bayesian Gaussian Mixture model with a Dirichlet Pro- Learning from an unlabeled dataset has three key challenges.
cess prior (DPGMM) (Ferguson, 1973; Rasmussen, 1999) First, the absence of labels forces us to make relatively
in this new space. If we knew which components contain strong assumptions. Second, the anomaly detectors rely
the anomalies, we could derive the contamination factor’s on different heuristics that may or may not hold, and their
posterior distribution as the distribution of the sum of such performance can hence vary significantly across datasets.
components’ weights. Because we do not know this, as a Third, we need to be careful in introducing user-specified
third step γGMM estimates the probability that the k most hyperparameters, because setting them properly may be as
extreme components are jointly anomalous, and uses this hard as directly specifying the contamination factor.
information to construct the desired posterior. The method
In this paper, we propose γGMM, a novel Bayesian ap-
explained in detail in Section 3.
proach that estimates the contamination factor’s posterior
In summary, we make four contributions. First, we adopt distribution in four steps, which are illustrated in Figure 1:
a Bayesian perspective and introduce the problem of es- Step 1. Because anomalies may not follow any particu-
timating the contamination factor’s posterior distribution. lar pattern in covariate space, γGMM maps the covariates
Second, we propose an algorithm that is able to sample X ∈ Rd into an M dimensional anomaly space, where the
from this posterior. Third, we demonstrate experimentally dimensions correspond to the anomaly scores assigned by
that the implied uncertainty-aware predictions are well cali- the M unsupervised anomaly detectors. Within each dimen-
brated and that taking the posterior mean as point estimate sion of such a space, the evident pattern is that “the higher
of γ outperforms several other algorithms in common bench- the more anomalous”.
marks. Finally, we show that using the posterior mean as a Step 2. We model the data points in the new space
threshold improves the actual anomaly detection accuracy. RM using a Dirichlet Process Gaussian Mixture Model
(DPGMM) (Neal, 1992; Rasmussen, 1999). We assume that
2. Preliminaries each of the (potentially many) mixture components contains
either only normals or only anomalies. If we knew which
Let (Ω, F, P) be a probability space, and X : Ω → Rd a components contained anomalies, we could then easily de-
random variable, from which a dataset D = {X1 , . . . , XN } rive γ’s posterior as the sum of the mixing proportions π of
of N random examples is drawn. Assume that X has a the anomalous components. However, such information is
distribution of the form P = (1 − γ) · P1 + γ · P2 , where P1 not available in our setting.
and P2 are the distributions on Rd corresponding to normal Step 3. Thus, we order the components in decreasing order,
examples and anomalies, respectively, and γ ∈ [0, 1] is the and we estimate the probability of the largest k components
contamination factor, i.e. the proportion of anomalies. An being anomalous. This poses three challenges: (a) how to
(unsupervised) anomaly detector is a measurable function represent each M -dimensional component by a single value
f : Rd → R that assigns real-valued anomaly scores f (X) to sort them from the most to the least anomalous, (b) how
to the examples. Such anomaly scores follow the rule that to compute the probability that the kth component is anoma-
the higher the score, the more anomalous the example. lous given that the (k − 1)th is such, (c) how to derive the
target probability that k components are jointly anomalous.
A Gaussian mixture model (GMM) with K components
Step 4. γGMM estimates the contamination factor’s pos-
(see e.g. Roberts et al. (1998)) is a generative model de-
terior by exploiting such a joint probability and the compo-
fined by a distribution on a space RM such that p(s) =
nents’ mixing proportions posterior.

2
Estimating the Contamination Factor’s Distribution in Unsupervised Anomaly Detection

Normal Anomaly Step 1 Step 2 Step 3 1.0 Step 4 GMM

Probability of being anomalous


GMM

Detector 2's scores

Detector 2's scores

Detector 2's scores


0.8 Mean

Posterior Density
C1 True
Covariate 2

C2 0.6
C3
C4 0.4
C5
C6
C7 0.2
C8
Covariate 1 Detector 1's scores Detector 1's scores Detector 1's scores 0.0 Contamination factor

Figure 1. Illustration of the γGMM’s four steps on a 2D toy dataset (left plot): we 1) map the 2D dataset into an M = 2 dimensional
anomaly space, 2) fit a DPGMM model on it, 3) compute the components’ probability of being anomalous (conditional, in the plot), and
4) derive γ|S’s posterior. γGMM’s mean is an accurate point estimate for the true value γ ∗ .

In the following, we describe these steps in detail. where G is a random distribution of the mean vectors
µi and covariance matrices Σi , drawn from a DP with
3.1. Representing Data Using Anomaly Scores base Pdistribution G0 . We use the explicit representation

G = k=1 πk δ(µk ,Σk ) (µ̃i , Σ̃i ), where δ(µk ,Σk ) is the delta
Learning from an unlabeled anomaly detection dataset has distribution at (µk , Σk ) and πk follow the stick-breaking
two major challenges. First, anomalies are rare and sparse distribution. We set G0 as Normal Inverse Wishart (Nydick,
events, which makes it hard to use common unsupervised 2012) with parameters M, λ, V, u common to all compo-
methods like clustering (Breunig et al., 2000). Second, nents. We use variational inference (VI; see e.g. Blei et al.
making assumptions on the unlabeled data is challenging (2017) for details) for approximating the posterior as VI is
due to the absence of specific patterns in the anomalies, computationally efficient and sufficiently accurate for our
which makes it hard to choose a specific anomaly detector. purposes. Alternative methods (e.g., Markov Chain Monte
Therefore, we use a set of M anomaly detectors to map Carlo (Brooks et al., 2011)) could also be used but were not
the d-dimensional input space into an M -dimensional score considered worth the additional computational effort here.
space RM , such that a sample x gets a score s:
Choice of DPGMM. DPGMM has two key properties that
Rd ∋ x → [f1 (x), f2 (x), . . . , fM (x)] = s ∈ RM . justify its use over other flexible density models. First, we
choose Gaussian distributions over more robust heavy-tailed
This has two main effects: (1) it introduces an interpretable
distributions because isolated samples are likely candidates
space where the evident pattern is that, within each dimen-
for outliers, and encouraging the model to represent them
sion, higher scores are more likely to be anomalous, and (2)
using the heavy tails would be counter-productive. Second,
it accounts for multiple inductive biases by using multiple
the rich-get-richer property of DPs is desirable because we
arbitrary anomaly detectors.
expect some very large components of normals but want
To make the dimensions comparable, we (independently for to allow arbitrarily small clusters of anomalies. Moreover,
each dimension) map the scores s ∈ S to log(s − min(S) + the DP formulation allows us to refrain from specifying
0.01), where the log is used to shorten heavy right tails, and the number of components K. After fitting the model, we
normalize them to have zero mean and unit variance. only consider the components with at least one observation
assigned to them and propagate all the remaining density
3.2. Modeling the Density with DPGMM uniformly over the active components. Thus, for the follow-
ing steps we can still proceed as if the model was a finite
We use mixture models as basis for quantifying the distribu- mixture with π following a Dirichlet distribution.
tion of the contamination factor, relying on their ability to
model the proportions of samples using the mixture weights.
3.3. Estimating the Components’ Anomalousness
For flexible modeling, we use the DPGMM
We assume that each mixture component either contains
si ∼ N (µ̃i , Σ̃i ) i = 1, . . . , N only anomalous or only normal samples. All unsupervised
(µ̃i , Σ̃i ) ∼ G methods rely on some assumption on nearby samples shar-
ing latent characteristics, and this cluster assumption is a nat-
G ∼ DP (G0 , α)
ural and weak assumption. If we knew which components
G0 = N IW(M, λ, V, u) contain anomalies, we could directly derive the posterior of

3
Estimating the Contamination Factor’s Distribution in Unsupervised Anomaly Detection

the contamination factor γ as the sum of the mixing propor- value such that the kth component has a higher value (i.e.,
tions πk of those components. This is naturally not the case, more anomalous) than the (k + 1)th component.
but we need to estimate it in an unsupervised fashion.
Estimating the probability that the kth component is
More formally, we estimate the probability that k (out K)
anomalous. Because the components are sorted by anoma-
components are anomalous such that we can derive γ’s pos-
lousness, our key insight is that the kth component can be
terior by averaging over all the values 0 ≤ k ≤ K. We do
anomalous only if the (k − 1)th is anomalous. Formally,
this in three steps. Initially, we sort the components of score
vectors in decreasing order (by degree of anomalousness), P(ck | ck−1 ) > 0 & P(ck | c̄k−1 ) = 0 (1 < k ≤ K)
which comes natural from the representation we made in
Step 1 (Sec. 3.1). Then, our insight is that the kth compo- where c̄k−1 means “not ck−1 ”. Moreover, we assume
nent can be anomalous only if the (k − 1)th is such. This P(c1 ) ∈ (0, 1). That is, we allow for the data to not have
points to the estimation of conditional probabilities, i.e., the anomalies (< 1) but exclude certain knowledge of no anoma-
probability of ck = “the kth component is anomalous” lies (> 0). This is a sensible assumption because, if one
given ck−1 . Finally, the probability that exactly the first knew for sure that no anomalies are in the data, then we
k components are anomalous can be obtained using basic trivially have γ = 0, whereas we still need to allow for the
rules of probability theory. data to be free of anomalies if evidence suggests so.
We estimate the conditional probability as
Assigning an ordering to the components. As initial
step for computing the joint probability, we need to design a 1
P(ck |ck−1 ) = , (2)
decreasing ordering map for the components based on their 1 + e(τ +δ·r(µk ,Σk ))
anomalousness. We do this in a manner that accounts for
where τ and δ are the two hyperparameters of the sigmoid
the uncertainty of the components’ parameters to rank high
function, which will be carefully discussed in Section 3.4.
the components that can be reliably identified as anomalous:
Note that the principle itself is not restricted to this particular
we want the means to be high but the variance low, to avoid
choice of functional form. One could apply any transforma-
the risk that also samples with low anomaly scores could
tion that maps to [0, 1], but the detailed derivations of the
belong to the component.
parameters would naturally be different.
We construct the overall ranking using dimension-specific
scores because our normalization cannot remove all statisti- Deriving the components’ joint probability. Given the
cal differences between the different detectors. Formally, let conditional probability P(ck | ck−1 ), the joint probability
r : RM × RM ×M → R be the function of the mean vector follows from simple steps. Taking inspiration from the
µk and the covariance matrix Σk that assigns a real value sequential ordinal models (Bürkner & Vuorre, 2019), our
representing the component k’s anomalousness. We set r as insight is that exactly k components are jointly anomalous
M j (z)
if and only if each of them is conditionally anomalous and

(z) (z)
 1 X µk the (k + 1)th is not anomalous. We indicate this as C ∗ = k.
r µk , Σk = q , (1)
M j=1 j,j (z) Essentially,
1 + Σk
(z) (z) P(C ∗ = k) := P(c1 , . . . , ck , c̄k+1 , . . . , c̄K )
where µk and Σk are samples from the parameters’ pos-
k−1
terior distributions of the kth component. We obtain a rep- Y (3)
= P(c1 ) P(ct+1 |ct )(1 − P(ck+1 |ck ))
resentative value of the whole component by taking the
t=1
expected value of r, i.e. through E[r(µk , Σk )]. Equation (1)
intentionally does not consider inter-dimension correlations, for any k ≤ K, where P(cK+1 |cK ) = 0 by convention.
as it remains unclear to us how those should ideally be
included and what benefits it would actually provide. 3.4. Estimating the Contamination Factor’s Distribution
We add 1 to the component’s standard deviation for two Given the joint probability that the first k components are
reasons. First, if a component contains samples with almost anomalous (for k ≤ K), the contamination factor γ’s poste-
the same covariate values, the standard deviation would rior distribution can be obtained as
be close to 0 and the ratio would explode towards infinity,  
K k
masking any effect of the mean. Second, adding 1 is reason- X X
able because it is equal to the theoretical upper bound of the p(γ|S) = p(C ∗ = k) · p  πj S  (4)
k=1 j=1
components’ variances, as they are normalized (Sec. 3.1).
Pk
Without loss of generality, from now on we assume that the where p( j=1 πj |S) is the posterior distribution of the sum
components’ index k is ordered based on their representative of the first k components’ mixing proportions, p(C ∗ =

4
Estimating the Contamination Factor’s Distribution in Unsupervised Anomaly Detection

k) are densities WRT the counting measure. Note Additional technical details. Because our method uses
Pk Pk PK
that p( j=1 πj |S) = B ETA( j=1 αj , j=k+1 αj ), if the variational inference approximation, we run it 10 times
p(π1 , . . . , πK |S) = D IR(α1 , . . . , αK ) (Lin, 2016). and concatenate the samples to reduce the risk of biased
distributions due to local minima. Moreover, after sorting
the components, we set P(ck |ck−1 ) = 0 for all k > K ′ =
Setting the sigmoid’s hyperparameters τ and δ. Intro- Pk
arg max{k : E[ j=1 πj ] < 0.25}. This has the effect of
ducing new hyperparameters when the task is to estimate
setting an upper bound of 0.25 to the contamination factor γ.
the contamination factor γ’s posterior is risky because set-
Because anomalies must be rare, we realistically assume that
ting their value may be as difficult as directly providing a
it is not possible to have more than 25% of them. Although
point estimate of γ. Our key insight is that we can obtain
“0.25” could be considered a hyperparameter, this value has
τ and δ by asking the user two simple questions: (a) How
virtually no impact on the experimental results. Moreover,
likely is that no anomalies are in the data? (b) How likely is
note that E[π1 ] ≥ 0.25 cannot occur, as otherwise we could
that a large amount of anomalies occurred, say, more than
not set the hyperparameters p0 and phigh .
t = 15% of the data? Both of these values are supposed to
be low. Let’s call p0 and phigh the two answers. Formally,
4. Experiments
1
p0 = 1 − P(c1 ) = 1 − We empirically evaluate two aspects of our method: (a)
1 + e(τ +δ·r(µ̃1 ,Σ̃1 )) whether it accurately estimates the contamination factor’s
 
K
X X k posterior, and (b) how thresholding the scores using our
phigh = P(γ ≥ t|S) = P(C ∗ = k) · P πj ≥ t|S  method affects the anomaly detectors’ performance. To this
k=1 j=1 end, we address the following five experimental questions:

One can use a numerical solver for non-linear equations Q1. Is the posterior estimate sharp and well-calibrated?
with linear constraints (e.g., the least square optimizer im-
plemented in S K L EARN) to find the values of τ and δ that Q2. How does γGMM compare to threshold estimators?
satisfy such constraints. The problem has a unique solution
whenever phigh ≥ P(π1 ≥ t|S). This holds almost always Q3. Does a better point estimate of γ improve the anomaly
in our experimental cases, but, in case such a constraint detector performance?
cannot be satisfied, we keep running again the variational Q4. What is the impact of the number of detectors M ?
inference method (with different starting points) for the
DPGMM until the constraint on phigh holds. If this cannot Q5. How sensitive the method is to p0 and phigh ?
happen or does not happen within 100 iterations, we reject
the possibility of too high contamination factors and just set 4.1. Experimental Setup
it to 0. In the experiments (Q5), we show that changing the
p0 and phigh does not have a large impact on γ’s posterior. Methods. We compare the sample mean of γGMM1 with
21 threshold estimators that we cluster into 9 groups:
1. Kernel-based. F GD (Qi et al., 2021) and AUCP (Ren et al.,
Sampling from γ’s posterior. Our estimate of the con-
2018) both use the kernel density estimator to estimate the
tamination factor’s posterior p(γ|S) does not have a simple
score density; F GD exploits the inflection points of the den-
closed form. However, we can sample from the distribu-
sity’s first derivative, while AUCP uses the percentage of the
tion using a simple process. The DPGMM inference deter-
total kernel density estimator’s AUC to set the threshold;
mines an approximation for p(π, µ, Σ|S) and all the quan-
2. Curve-based. E B (Friendly et al., 2013) creates ellipti-
tities required for Equations (2), (3), (4) can be computed
cal boundaries by generating pseudo-random eccentricities,
based on samples from the approximation. Formally, we
while W IND (Jacobson et al., 2013) is based on the topolog-
derive a sample from p(γ|S) in four steps by repeating the
ical winding number with respect to the origin;
next operations for all k ≤ K. First, we draw a sample
(z) (z) (z) 3. Normality-based. Z SCORE (Bagdonavičius &
πk , µk , Σk from πk (Dirichlet), µk (Normal), Σk (In- Petkevičius, 2020) exploits the Z-scores, D SN (Amagata
(z)
verse Wishart). Second, we transform πk by taking the et al., 2021) measures the distance shift from a normal distri-
Pk (z)
cumulative sum and obtain a sample j=1 πj . Third, we bution, and C HAU (Bol’shev & Ubaidullaeva, 1975) follows
(z) (z) the Chauvenet’s criterion before using the Z-score;
pass µk and Σk through the sigmoid function (2) to get
4. Regression-based. C LF and R EGR (Aggarwal, 2017) are
the conditional probabilities P(ck | ck−1 ), and transform
two regression models that separate the anomalies based on
them into the exact joint probabilities P(C ∗ = k) using
the equation 3. Finally, we multiply the samples following 1
Code and online Supplement are available at: https://
Formula 4 and obtain a sample γ (z) from p(γ|S). github.com/Lorenzo-Perini/GammaGMM

5
Estimating the Contamination Factor’s Distribution in Unsupervised Anomaly Detection

the y-intercept value; empirical frequencies. That is, for v ∈ [0, 0.5],
5. Filter-based. F ILTER (Hashemi et al., 2019), and
H IST (Thanammal et al., 2014) use the wiener filter and Expected Prob. = P (γ ∗ ∈ [q(0.5 − v), q(0.5 + v)]) = 2v
the Otsu’s method to filter out the anomalous scores; |{γ ∈ [q(0.5 − v), q(0.5 + v)]}|
Empirical Freq. = ,
6. Statistical test-based. G ESD (Alrawashdeh, 2021), #experiments
M CST (Coin, 2008) and M TT (Rengasamy et al., 2021)
are based on, respectively, the generalized extreme studen- where q(u) is the quantile at the value u of our distribu-
tized, the Shapiro-Wilk, and the modified Thompson Tau tion, for u ∈ [0, 1], and γ ∗ refers to the true dataset’s con-
statistical tests; tamination factor. For evaluating the point estimate of the
7. Statistical moment-based. B OOT (Martin & Roberts, methods, we use the mean absolute error (MAE) between
2006) derives the confidence interval through the two-sided the method’s point estimate and the true value. Finally, we
bias-corrected and accelerated bootstrap; K ARCH (Afsari, measure the impact of thresholding the scores using the
2011) and M AD (Archana & Pawar, 2015) are based on methods’ point estimate through the F1 score (Goutte &
means and standard deviations, i.e., the Karcher mean plus Gaussier, 2005), as common metrics like the Area Under
one standard deviation, and the mean plus the median abso- the ROC curve and the Average Precision are not affected
lute deviation over the standard deviation; by different thresholds. Specifically, for m = 1, . . . , M , we
8. Quantile-based. I QR (Bardet & Dimby, 2017) and measure the relative deterioration of the F1 score:
Q MCD (Iouchtchenko et al., 2019) set the threshold based F1 (fm , D, γ ∗ ) − F1 (fm , D, γ̂)
on quantiles, i.e., respectively, the third quartile Q3 plus 1.5 F1 deterioration =
F1 (fm , D, γ̂)
times the inter-quartile region |Q3 − Q1 |, and the quantile
of one minus the Quasi-Monte Carlo discreprancy; where we compute the F1 score on the dataset D using the
9. Transformation-based. M OLL (Keyzer & Sonneveld, anomaly detector fm , and either the true value γ ∗ or an
1997) smooths the scores through the Friedrichs’ mollifier, estimate γ̂ to threshold the scores. The F1 deterioration of a
while Y J (Raymaekers & Rousseeuw, 2021) applies the method is (mostly) negative, and the higher the better.
Yeo-Johnson monotonic transformations.
Setup. In the experiments we assume a transductive set-
We apply each threshold estimator to the univariate anomaly ting (Campos et al., 2016; Scott & Blanchard, 2008; Toron
scores of each detector at a time. We average the contami- et al., 2022), where a dataset D is used both for training and
nation factors over the M detectors and use it as the final testing. This is the typical setting of anomaly detection (Bre-
point estimate for each dataset. unig et al., 2000; Schölkopf et al., 2001; Angiulli & Pizzuti,
2002; Liu et al., 2012) because the absence of labels and
patterns (for the anomaly class) avoids overfitting issues.
Data. We carry out our study on 20 commonly used bench- For each dataset, we proceed as follows: (i) use a set of M
mark datasets and additionally 2 (proprietary) real tasks. anomaly detectors to assign the anomaly scores S to each
The benchmark datasets contain semantically useful anoma- observation in the dataset D; (ii) map each anomaly score
lies widely used in the literature (Campos et al., 2016). The s ∈ S to log(s − min(S) + 0.01) and normalize them to
datasets vary in size, number of features, and true contami- have mean equal to 0 and standard deviation equal to 1;
nation factor. The online Supplement provides further de- (iii) either use our method to estimate the contamination
tails. For the real tasks, our experiments focus on preventing factor’s posterior and extract the posterior mean as point
blade icing in wind turbines. We use two public wind tur- estimate γ̂, or use one of the threshold estimators to directly
bine datasets, where sensors collect various measurements obtain a point estimate γ̂ of the contamination factor (see
(e.g., wind speed, power energy, etc.) every 7 seconds for methods paragraph above); (iv) evaluate the point estimates
either 8 weeks (turbine 15) or 4 weeks (turbine 21). Fol- using the mean absolute error (MAE) between such estimate
lowing (Zhang et al., 2018), we construct feature-vectors by and the true value γ ∗ ; (v) use the contamination factor’s
taking the average over the time segment of one minute. point estimate to threshold the anomaly scores of each of
the M anomaly detectors fm (individually); (vi) finally, we
measure the F1 score and compute the relative deterioration.

Evaluation metrics. We use three evaluation metrics to Hyperparameters, anomaly detectors and priors. Our
assess the performance of the methods. Contrary to all the method introduces two new hyperparameters: p0 and phigh .
threshold estimators, our method estimates the posterior of We both of them set to 0.01 as default value because ex-
γ. Therefore, we measure the probabilistic calibration tremely high contamination, as well as no anomalies, are
of γGMM’s posterior using a QQ-plot with the x-axis rep- unlikely events. We will experimentally check the impact
resenting the expected probabilities and on the y-axis the of these two hyperparameters in Q5.

6
Estimating the Contamination Factor’s Distribution in Unsupervised Anomaly Detection

We use 10 anomaly detectors with different inductive bi- estimate of γ and compare such value to the point estimates
ases (Soenen et al., 2021): K NN (Angiulli & Pizzuti, 2002) obtained from the threshold estimators. Figure 4 illustrates
assumes that the anomalies are far away from normals, the ordered MAE (mean ± std.) between the methods’ esti-
IF OREST (Liu et al., 2012) assumes that the anomalies are mate and the true γ. On average, γGMM obtains a MAE
easier to isolate, LOF (Breunig et al., 2000) exploits the of 0.026 that is 20% lower than the best runner-up M TT
examples’ density, OCSVM (Green & Richardson, 2001) and 27% lower than the third best method Q MCD (MAE of
encapsulates the data into a multi-dimensional hypersphere, 0.033 and 0.036). For each experiment, we rank the meth-
A E (Chen et al., 2018) and VAE (Kingma & Welling, 2013) ods from the best (position 1, lowest MAE) to the worst
use the reconstruction error as anomaly score function in (position 22, greatest MAE). Our method has the best av-
a, respectively, deterministic and probabilistic perspective, erage rank (2.13 ± 1.04). Moreover, γGMM ranks first 8
LSCP (Zhao et al., 2019a) is an ensemble method that times (≈ 36% of the cases), and for 13 times (≈ 60% of
selects competent detectors locally, HBOS (Goldstein & the cases) it is in the top two. The next best method, M TT,
Dengel, 2012) calculates the degree of anomalousness by ranks first in 6 cases with an average rank of 2.30 ± 1.10.
building histograms, LODA (Pevnỳ, 2016) is an ensemble
of weak detectors that build histograms on randomly gen- Q3. Does a better contamination improve the anomaly
erated projected spaces, and COPOD (Li et al., 2020) is a detectors’ performance? We use γGMM’s posterior
copula based method. All these methods are implemented mean as a point estimate to measure the F1 score of the
in the python library PyOD (Zhao et al., 2019b). anomaly detectors because sampling from the distribution
The threshold estimators are implemented in P Y T HRESH2 would not imply a fair comparison against the other methods
with default hyperparameters. Finally, the DPGMM is that can only provide a point estimate. Moreover, anomaly
implemented in SKLEARN: we use the Stick-breking repre- detectors that fail to rank the samples accurately perform
sentation (Dunson & Park, 2008), with 100 as upper bound poorly even when using the correct γ. Since our focus is
of K. We set the means’ prior to 0, and the covariance ma- studying the effect of γ, for each dataset D, we compare
trices’ prior to identities of appropriate dimension. We opt F1 scores only over the detectors that achieve the great-
for such (in our context) weakly-informative priors because est F1 score using the true contamination factor γ ∗ , i.e.
sensible prior knowledge of the DPGMM hyperparameters arg maxfm {F1 (fm , D, γ ∗ )}. The online Supplement con-
is hard to come by in practice. tains the list of detectors used for each experiment.
Figure 5 shows the average (± std.) deterioration for each
4.2. Experimental Results of the methods. On average, γGMM has the best F1 dete-
Q1. Does our method estimate a sharp and well- rioration (−0.117 ± 0.228) that is around 10% better than
calibrated posterior of γ? Figure 2 shows the contami- the runner-up Q MCD (−0.131 ± 0.238), and 58% better
nation factor γ’s posterior estimated by our method on the than the next best K ARCH (−0.279 ± 0.248). For 25% of
22 datasets. In several cases (e.g., WPBC, Cardio, Spam- the cases we get higher F1 score with γGMM than when
Base, Wilt and T21), the distribution looks accurate as γ’s using the true γ ∗ . This is due to the (still incorrect) ranks
true value (blue line) is close to the posterior mean (i.e., made by the detectors, which achieve better performance
the expected value, the green line). On the contrary, some with slightly incorrect contamination factors. The online
datasets (e.g., Arrhythmia, Shuttle, KDDCup99, Parkinson, Supplement provides further details on how the methods
Glass) obtain less accurate distributions: although γ’s true perform in terms of false alarms and false negatives.
value sometimes falls on low-density regions (Arrhythmia,
Shuttle), in many cases it would be quite likely to sample Q4. What is the impact of M on γ’s posterior? In the
the true value from our posterior (KDDCup99, Parkinson, previous experiments, we used M = 10 detectors. We
Glass), which makes the density still quite reliable. evaluate the effect of M by running all the experiments
10 times with (different) randomly chosen detectors for
Figure 3 shows the calibration plot. The posterior is well-
M = 3, 5, 7. Figure 6 shows that the calibration suffers if
calibrated as it is very close to the dashed black line in-
using fewer detectors, but already M = 5 let the method
dicating a perfectly calibrated distribution. The empirical
work fairly well. The variance of the results (over repeated
frequencies deviate from the real probabilities by less than
experiments) also increases for lower M .
5% (dark shadow grey) in more than 76% of the cases, while
never deviating by more than 10% (light shadow grey).
Q5. Impact of the hyperparameters p0 and phigh . We
Q2. How does γGMM compare to the threshold estima- evaluate the impact of p0 and phigh by running the experi-
tors? We take γGMM’s posterior mean as our best point ments with smaller and larger values than 0.01: we vary, one
at a time, p0 , phigh ∈ [0.0001, 0.001, 0.05, 0.1] and keep
2
Link: https://github.com/KulikDM/pythresh. the other set as default. Figure 7 shows the QQ-plot for p0

7
Estimating the Contamination Factor’s Distribution in Unsupervised Anomaly Detection

ALOI Annthyroid Arrhythmia Cardiotocography Glass InternetAds


Density

0 0.05 0.1KDDCup99
0.15 0 0.05 Lymphography
0.1 0.15 0 0.05 0.1PageBlocks
0.15 0 0.05 0.1 0.15
Parkinson 0 0.05 0.1 0.15
PenDigits 0 0.05 0.1 0.15
Pima
Density

0 0.05 0.1 0.15


Shuttle 0 0.05 0.1SpamBase
0.15 0 0.05 0.1 0.15
Stamps 0 0.05 0.1 0.15
T15 0 0.05 0.1 0.15
T21 0 0.05 0.1 0.15
WBC
Density

0 0.05 0.1 0.15


WDBC 0 0.05 0.1 0.15
WPBC 0 0.05 0.1Waveform
0.15 0 0.05 0.1 0.15
Wilt 0 0.05 0.1 0.15 0 0.05 0.1 0.15
Density

True value
Mean value
0 0.05 0.1 0.15 0 0.05 0.1 0.15 0 0.05 0.1 0.15 0 0.05 0.1 0.15
Contamination factor

Figure 2. Illustration of how γGMM estimates γ’s posterior distribution (red) on all the 22 datasets. The blue vertical line indicates the
true contamination factor, while the green line is the posterior’s mean.

1.0 0.9
Empirical frequencies

Mean Absolute Error

0.8
0.8 0.7
0.6
0.6 0.5
0.4
0.4 0.3
0.2 0.2
0.1
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0
Expected Probabilities
Karch
Qmcd
Chau

Aucp
Mtt
Iqr

Dsn
Hist
Fgd
Mcst
Zscore
Yj

Gesd
Mad
Clf
Eb
Wind
Moll
Boot
Filter

Regr
GMM

Methods
Figure 3. QQ-plot of γGMM’s distribution estimate. The black
dashed line illustrates the perfect calibration, while shades indicate
a deviation of 5% (dark) and 10% (light) from the black line. Figure 4. Average MAE (± std.) of γGMM’s sample mean com-
pared to the other methods. Our method has the lowest (better)
average, which is 20% lower than the runner-up.
(left) and phigh (right). In both cases, smaller hyperparame-
ters lead to slightly under-estimated expected probabilities.
Overall, our method is robust to different values of p0 , while
formance metrics focusing on the ranking of the samples
phigh affects the calibration slightly more. Comparing the
(e.g., AUC), and the ultimate choice of detecting the ac-
resulting 8 variants of γGMM in terms of MAE, we con-
tual anomalies by thresholding the predictions is left to the
clude that the posterior means produce similar values to our
practitioners. They lack good means for thresholding and
default setting, obtaining an MAE that varies from 0.252
thus often resort to using labels for such goal. This largely
(phigh = 0.001, the best) to 0.32 (p0 = 0.0001, the worst).
defeats the point of using unsupervised methods.

5. Conclusion We presented the first practical method for estimating the


posterior distribution of the contamination factor γ in a com-
The literature on anomaly detection has focused on unsuper- pletely unsupervised manner. We empirically demonstrated
vised algorithms, but largely ignored practical challenges on 22 datasets that our mean estimates effectively solve the
in their application. The algorithms are evaluated on per- question of where to threshold the predictions. We outper-

8
Estimating the Contamination Factor’s Distribution in Unsupervised Anomaly Detection

0.2 p = 0.01 (Default) p = 0.0001 p = 0.001 p = 0.05 p = 0.1


1 p =p phigh = p
0.0 0

Empirical frequencies
F1 deterioration

0.2
0.8

0.4 0.6
0.6 0.4
0.8 0.2
0
Karch
Qmcd

Chau
Mtt

Dsn
Fgd
Hist
Aucp
Zscore
Iqr
Yj

Gesd
Mcst
Mad
Clf
Eb
Wind
Moll
Boot
Filter

Regr
GMM

0 0.2 0.4 0.6 0.8 10 0.2 0.4 0.6 0.8 1


Expected Probabilities
Methods
Figure 7. QQ-plot showing how calibrated γGMM’s posterior
Figure 5. F1 deterioration (mean ± std) for each method, where mean would be if we varied p0 (left) and phigh (right). While
the higher the better. γGMM ranks as best method, obtaining p0 does not have a large impact on the method, the empirical fre-
≈ 10% higher average than Q MCD. quencies slightly under (over) estimate the expected probabilities
for low (high) values of phigh .

1.0
Empirical frequencies

10 AD
0.8 7 AD of datasets and hence can be relied on in practical use.
5 AD
0.6 3 AD
On first impression, the success of our method in solving
this challenging and seemingly ill-posed problem may seem
0.4 surprising. However, it can be attributed to a careful choice
0.2 of strong inductive biases built into the underlying proba-
bilistic model. We argue that all of the following elements
0.0
0.0 0.2 0.4 0.6 0.8 1.0 are necessary, each substantially contributing to the overall
Expected Probabilities success: (i) representing the data in the space of anomaly
detector scores defines a meaning for the dimensions and
allows borrowing inductive biases of arbitrary detector algo-
Figure 6. QQ-plot comparing the calibration curves of γGMM rithms, (ii) the mixture model encodes a natural clustering
when a different number M of detectors is used. The colored assumption for both the normal samples and the anomalies,
shades report the uncertainty obtained by randomly sampling the (iii) the ordering used for determining the final distribution
detectors from a set of 10 detectors. The plot shows that the higher incorporates both the location and shape of the mixture
the number of detectors, the more calibrated the distribution. components in a carefully balanced manner, and (iv) the
transformation from the ordering to probabilities is robustly
parameterized via just two intuitive hyperparameters, en-
form all 21 comparison methods and show that the gap in
abling use of the same defaults for all cases.
detection accuracy between our estimate and the ground
truth (available for these benchmark datasets) is small.
Acknowledgments
Besides solving the practical question of thresholding the
predictions, we seek to persuade the anomaly detection com- This work was done during LP’s research visit to the Univer-
munity of the usefulness of a fully probabilistic solution for sity of Helsinki, funded by the Gustave Boël - Sofina Fellow-
the problem. Especially in unsupervised settings, it would ship (grant V407821N). Moreover, this work is supported by
be completely unreasonable to expect the contamination (LP) the FWO-Vlaanderen aspirant grant 1166222N and by
factor could be identified exactly, but rather we need to char- the Flemish government under the “Onderzoeksprogramma
acterize its uncertainty. However, we are not aware of any Artificiële Intelligentie (AI) Vlaanderen” program, (PB) the
previous works even attempting this. As shown in Fig. 2, Deutsche Forschungsgemeinschaft (DFG, German Research
the posterior distribution of γ may not only be wide but also Foundation) under Germany’s Excellence Strategy - EXC
multi-modal. Communicating these aspects to the practi- 2075 – 390740016, (AK) the Academy of Finland (grants
tioner is critical so that they can e.g. use additional domain 313125 and 336019), the Flagship program Finnish Center
knowledge to interpret the alternatives. We showed that our for Artificial Intelligence (FCAI), and the Finnish-American
estimates have near-perfect calibration over the broad range Research and Innovation Accelerator (FARIA).

9
Estimating the Contamination Factor’s Distribution in Unsupervised Anomaly Detection

References Campos, G. O., Zimek, A., Sander, J., Campello, R. J., Mi-
p cenková, B., Schubert, E., Assent, I., and Houle, M. E.
Afsari, B. Riemannian l center of mass: existence, unique-
On the evaluation of unsupervised outlier detection: mea-
ness, and convexity. Proceedings of the American Mathe-
sures, datasets, and an empirical study. Data mining and
matical Society, 139(2):655–673, 2011.
knowledge discovery, 30(4):891–927, 2016.
Aggarwal, C. C. An introduction to outlier analysis. In Chandola, V., Banerjee, A., and Kumar, V. Anomaly Detec-
Outlier analysis, pp. 1–34. Springer, 2017. tion: A survey. ACM computing surveys (CSUR), 41(3):
1–58, 2009.
Alrawashdeh, M. J. An adjusted Grubbs’ and Generalized
Extreme Studentized Deviation. Demonstratio Mathe- Chen, Z., Yeo, C. K., Lee, B. S., and Lau, C. T. Autoencoder-
matica, 54(1):548–557, 2021. based network Anomaly Detection. In 2018 Wireless
telecommunications symposium (WTS), pp. 1–5. IEEE,
Amagata, D., Onizuka, M., and Hara, T. Fast and exact 2018.
outlier detection in metric spaces: a proximity graph-
based approach. In Proceedings of the 2021 International Coin, D. Testing normality in the presence of outliers. Sta-
Conference on Management of Data, pp. 36–48, 2021. tistical Methods and Applications, 17(1):3–12, 2008.

Angiulli, F. and Pizzuti, C. Fast outlier detection in high Dunson, D. B. and Park, J.-H. Kernel Stick-Breaking pro-
dimensional spaces. In European conference on princi- cesses. Biometrika, 95(2):307–323, 2008.
ples of data mining and knowledge discovery, pp. 15–27.
Springer, 2002. Emmott, A., Das, S., Dietterich, T., Fern, A., and Wong, W.-
K. A meta-analysis of the Anomaly Detection problem.
Archana, N. and Pawar, S. Periodicity Detection of Out- arXiv preprint arXiv:1503.01158, 2015.
lier Sequences using Constraint Based Pattern Tree with
MAD. International Journal of Advanced Studies in Ferguson, T. S. A Bayesian analysis of some nonparametric
Computers, Science and Engineering, 4(6):34, 2015. problems. The annals of statistics, pp. 209–230, 1973.

Fourure, D., Javaid, M. U., Posocco, N., and Tihon, S.


Bagdonavičius, V. and Petkevičius, L. Multiple outlier
Anomaly Detection: how to artificially increase your
detection tests for parametric models. Mathematics, 8
f1 -score with a biased evaluation protocol. In Joint Euro-
(12):2156, 2020.
pean Conference on Machine Learning and Knowledge
Bardet, J.-M. and Dimby, S.-F. A new non-parametric detec- Discovery in Databases, pp. 3–18. Springer, 2021.
tor of univariate outliers for distributions with unbounded Friendly, M., Monette, G., and Fox, J. Elliptical insights:
support. Extremes, 20(4):751–775, 2017. understanding statistical methods through elliptical ge-
ometry. Statistical Science, 28(1):1–39, 2013.
Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. Varia-
tional Inference: A review for statisticians. Journal of Goldstein, M. and Dengel, A. Histogram-based outlier
the American statistical Association, 112(518):859–877, score (HBOS): A fast unsupervised Anomaly Detection
2017. algorithm. KI-2012: poster and demo track, 9, 2012.
Bol’shev, L. and Ubaidullaeva, M. Chauvenet’s test in the Goldstein, M. and Uchida, S. A comparative evaluation of
classical theory of errors. Theory of Probability & Its unsupervised Anomaly Detection algorithms for multi-
Applications, 19(4):683–692, 1975. variate data. PloS one, 11(4):e0152173, 2016.

Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. Görür, D. and Rasmussen, E. C. Dirichlet Process Gaussian
LOF: identifying density-based local outliers. In Proceed- Mixture Models: Choice of the base distribution. Journal
ings of the 2000 ACM SIGMOD international conference of Computer Science and Technology, 25(4):653–664,
on Management of data, pp. 93–104, 2000. 2010.

Brooks, S., Gelman, A., Jones, G., and Meng, X.-L. Hand- Goutte, C. and Gaussier, E. A probabilistic interpretation of
book of Markov Chain Monte Carlo. CRC press, 2011. precision, recall and f -score, with implication for evalua-
tion. In Advances in Information Retrieval: 27th Euro-
Bürkner, P.-C. and Vuorre, M. Ordinal regression models pean Conference on IR Research, ECIR 2005, Santiago
in psychology: A tutorial. Advances in Methods and de Compostela, Spain, March 21-23, 2005. Proceedings
Practices in Psychological Science, 2(1):77–101, 2019. 27, pp. 345–359. Springer, 2005.

10
Estimating the Contamination Factor’s Distribution in Unsupervised Anomaly Detection

Green, P. J. and Richardson, S. Modelling heterogeneity Martı́, L., Sanchez-Pi, N., Molina, J. M., and Garcia, A. C. B.
with and without the Dirichlet Process. Scandinavian Anomaly Detection based on sensor data in petroleum
journal of statistics, 28(2):355–375, 2001. industry applications. Sensors, 15(2):2774–2797, 2015.

Han, S., Hu, X., Huang, H., Jiang, M., and Zhao, Y. AD- Martin, M. A. and Roberts, S. An evaluation of bootstrap
Bench: Anomaly Detection Benchmark. arXiv preprint methods for outlier detection in least squares regression.
arXiv:2206.09426, 2022. Journal of Applied Statistics, 33(7):703–720, 2006.

Hashemi, N., German, E. V., Ramirez, J. P., and Ruths, J. Maxion, R. A. and Tan, K. M. Benchmarking anomaly-
Filtering approaches for dealing with noise in Anomaly based detection systems. In Proceeding International
Detection. In 2019 IEEE 58th Conference on Decision Conference on Dependable Systems and Networks. DSN
and Control (CDC), pp. 5356–5361. IEEE, 2019. 2000, pp. 623–630. IEEE, 2000.

Heard, N. A., Weston, D. J., Platanioti, K., and Hand, Neal, R. M. Bayesian Mixture Modeling. In Maximum
D. J. Bayesian Anomaly Detection methods for social Entropy and Bayesian Methods, pp. 197–211. Springer,
networks. The Annals of Applied Statistics, 4, 2010. 1992.

Hou, Y., He, R., Dong, J., Yang, Y., and Ma, W. IoT Nydick, S. W. The Wishart and inverse Wishart distributions.
Anomaly Detection Based on Autoencoder and Bayesian Electronic Journal of Statistics, 6(1-19), 2012.
Gaussian Mixture Model. Electronics, 11(20):3287, Perini, L., Vercruyssen, V., and Davis, J. Class prior es-
2022. timation in active positive and unlabeled learning. In
Iouchtchenko, D., Raymond, N., Roy, P.-N., and Nooijen, M. Proceedings of the 29th International Joint Conference
Deterministic and quasi-random sampling of optimized on Artificial Intelligence and the 17th Pacific Rim In-
Gaussian Mixture distributions for Vibronic Monte Carlo. ternational Conference on Artificial Intelligence (IJCAI-
PRICAI 2020), pp. 2915–2921. IJCAI-PRICAI, 2020a.
arXiv preprint arXiv:1912.11594, 2019.
Perini, L., Vercruyssen, V., and Davis, J. Quantifying the
Jacobson, A., Kavan, L., and Sorkine-Hornung, O. Robust
confidence of anomaly detectors in their example-wise
inside-outside segmentation using generalized winding
predictions. In Joint European Conference on Machine
numbers. ACM Transactions on Graphics (TOG), 32(4):
Learning and Knowledge Discovery in Databases, pp.
1–12, 2013.
227–243. Springer, 2020b.
Keyzer, M. A. and Sonneveld, B. Using the mollifier method Perini, L., Vercruyssen, V., and Davis, J. Transferring
to characterize datasets and models: the case of the uni- the Contamination Factor between Anomaly Detection
versal soil loss equation. ITC Journal, 3(4):263–272, Domains by Shape Similarity. In Proceedings of the
1997.
AAAI Conference on Artificial Intelligence, volume 36,
Kingma, D. P. and Welling, M. Auto-encoding Variational pp. 4128–4136, 2022.
Bayes. arXiv preprint arXiv:1312.6114, 2013. Pevnỳ, T. LODA: Lightweight on-line detector of anomalies.
Machine Learning, 102(2):275–304, 2016.
Li, Z., Zhao, Y., Botta, N., Ionescu, C., and Hu, X. COPOD:
copula-based outlier detection. In 2020 IEEE Interna- Qi, Z., Jiang, D., and Chen, X. Iterative gradient descent for
tional Conference on Data Mining (ICDM), pp. 1118– outlier detection. International Journal of Wavelets, Mul-
1123. IEEE, 2020. tiresolution and Information Processing, 19(04):2150004,
2021.
Lin, J. On the Dirichlet distribution. Department of Mathe-
matics and Statistics, Queens University, 2016. Rasmussen, C. The infinite Gaussian Mixture Model. Ad-
vances in neural information processing systems, 12,
Liu, F. T., Ting, K. M., and Zhou, Z.-H. Isolation-based 1999.
Anomaly Detection. ACM Transactions on Knowledge
Discovery from Data (TKDD), 6(1):1–39, 2012. Raymaekers, J. and Rousseeuw, P. J. Transforming variables
to central normality. Machine Learning, pp. 1–23, 2021.
Malaiya, R. K., Kwon, D., Kim, J., Suh, S. C., Kim, H., and
Kim, I. An empirical evaluation of Deep Learning for Ren, K., Yang, H., Zhao, Y., Chen, W., Xue, M., Miao,
Network Anomaly Detection. In 2018 International Con- H., Huang, S., and Liu, J. A robust AUC maximization
ference on Computing, Networking and Communications framework with simultaneous outlier detection and fea-
(ICNC), pp. 893–898. IEEE, 2018. ture selection for positive-unlabeled classification. IEEE

11
Estimating the Contamination Factor’s Distribution in Unsupervised Anomaly Detection

transactions on neural networks and learning systems, 30 Zhao, Y., Nasrullah, Z., Hryniewicki, M. K., and Li, Z.
(10):3072–3083, 2018. LSCP: Locally selective combination in parallel outlier
ensembles. In Proceedings of the 2019 SIAM Interna-
Rengasamy, D., Rothwell, B. C., and Figueredo, G. P. To- tional Conference on Data Mining, pp. 585–593. SIAM,
wards a more reliable interpretation of machine learning 2019a.
outputs for safety-critical systems using feature impor-
tance fusion. Applied Sciences, 11(24):11854, 2021. Zhao, Y., Nasrullah, Z., and Li, Z. PyOD: A Python Tool-
box for Scalable Outlier Detection. Journal of Machine
Roberts, E., Bassett, B. A., and Lochner, M. Bayesian Learning Research, 20:1–7, 2019b.
Anomaly Detection and Classification. arXiv preprint
arXiv:1902.08627, 2019. Zong, B., Song, Q., Min, M. R., Cheng, W., Lumezanu,
C., Cho, D., and Chen, H. Deep Autoencoding Gaus-
Roberts, S. J., Husmeier, D., Rezek, I., and Penny, W. sian Mixture Model for unsupervised Anomaly Detection.
Bayesian approaches to Gaussian Mixture Modeling. In International conference on learning representations,
IEEE Transactions on Pattern Analysis and Machine In- 2018.
telligence, 20(11):1133–1142, 1998.

Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J.,


and Williamson, R. C. Estimating the support of a high-
dimensional distribution. Neural computation, 13(7):
1443–1471, 2001.

Scott, C. and Blanchard, G. Transductive Anomaly De-


tection. Technical report, Tech. Rep., 2008, http://www.
eecs. umich. edu/cscott, 2008.

Shen, Y. and Cooper, G. A new prior for Bayesian Anomaly


Detection. Methods of Information in Medicine, 49(01):
44–53, 2010.

Soenen, J., Van Wolputte, E., Perini, L., Vercruyssen, V.,


Meert, W., Davis, J., and Blockeel, H. The effect of
hyperparameter tuning on the comparative evaluation of
unsupervised Anomaly Detection methods. In Proceed-
ings of the KDD, volume 21, pp. 1–9, 2021.

Thanammal, K., Vijayalakshmi, R., Arumugaperumal, S.,


and Jayasudha, J. Effective Histogram Thresholding Tech-
niques for Natural Images Using Segmentation. Journal
of Image and Graphics, 2(2):113–116, 2014.

Toron, N., Mourão-Miranda, J., and Shawe-Taylor, J. Trans-


ductgan: a Transductive Adversarial Model for Novelty
Detection. arXiv e-prints, pp. arXiv–2203, 2022.

Yan, W. and Yu, L. On accurate and reliable Anomaly


Detection for gas turbine combustors: A deep learning
approach. arXiv preprint arXiv:1908.09238, 2019.

Zaher, A., McArthur, S., Infield, D., and Patel, Y. Online


wind turbine fault detection through automated SCADA
data analysis. Wind Energy: An International Journal
for Progress and Applications in Wind Power Conversion
Technology, 12(6):574–593, 2009.

Zhang, L., Liu, K., Wang, Y., and Omariba, Z. B. Ice


detection model of wind turbine blades based on Random
Forest classifier. Energies, 11(10):2548, 2018.

12

You might also like