0% found this document useful (0 votes)
82 views11 pages

Feature Bagging For Outlier Detection

Uploaded by

brettharvey555
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views11 pages

Feature Bagging For Outlier Detection

Uploaded by

brettharvey555
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/221653185

Feature bagging for outlier detection

Conference Paper · August 2005


DOI: 10.1145/1081870.1081891 · Source: DBLP

CITATIONS READS
586 7,503

2 authors:

Aleksandar Lazarevic Vipin Kumar


Aetna Health Insurance University of Minnesota Twin Cities
78 PUBLICATIONS 5,182 CITATIONS 731 PUBLICATIONS 81,279 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Vipin Kumar on 06 June 2014.

The user has requested enhancement of the downloaded file.


Research Track Paper

Feature Bagging for Outlier Detection


Aleksandar Lazarevic Vipin Kumar
United Technologies Research Center Department of Computer Science,
University of Minnesota University of Minnesota
411 Silver Lane, MS 129-15 200 Union Street SE
East Hartford, CT 06108, USA Minneapolis, MN 55455, USA
1-860-610-7560 1-612-626-8704
[email protected] [email protected]

ABSTRACT often called outliers or anomalies, are defined as events that occur
Outlier detection has recently become an important problem in very infrequently (their frequency ranges from 5% to less than
many industrial and financial applications. In this paper, a novel 0.01% depending on the application). Detection of outliers (rare
feature bagging approach for detecting outliers in very large, high events) has recently gained a lot of attention in many domains,
dimensional and noisy databases is proposed. It combines results ranging from detecting fraudulent transactions and intrusion de-
from multiple outlier detection algorithms that are applied using tection to direct marketing, and medical diagnostics. For example,
different set of features. Every outlier detection algorithm uses a in the network intrusion detection domain, the number of cyber
small subset of features that are randomly selected from the origi- attacks on the network is typically a very small fraction of the
nal feature set. As a result, each outlier detector identifies differ- total network traffic. In medical databases, when classifying the
ent outliers, and thus assigns to all data records outlier scores that pixels in mammogram images as cancerous or not, abnormal
correspond to their probability of being outliers. The outlier (cancerous) pixels represent only a very small fraction of the en-
scores computed by the individual outlier detection algorithms are tire image. Among all users that visit an e-commerce web site,
then combined in order to find the better quality outliers. Experi- those that actually purchase are quite rare - for example less than
ments performed on several synthetic and real life data sets show 2% of all people who visit Amazon.com’s website make a pur-
that the proposed methods for combining outputs from multiple chase, and this is much higher than the industry average. Al-
outlier detection algorithms provide non-trivial improvements though outliers (rare events) are by definition infrequent, in each
over the base algorithm. of these examples, their importance is quite high compared to
other events, making their detection extremely important.
The problem of detecting outliers (rare events) has been variously
Categories and Subject Descriptors called in different research communities: novelty detection [23],
H.2.8 [Database Management]: Database Applications (data
chance discovery [24], outlier/anomaly detection [3, 5, 10, 19, 27,
mining, scientific databases, spatial databases)
36], exception mining [29], mining rare classes [11, 16-18], etc.
Data mining techniques that have been developed for this problem
General Terms are based on both supervised and unsupervised learning. Super-
Algorithms, Performance, Design, Experimentation. vised learning methods typically build a prediction model for rare
events based on labeled data (the training set), and use it to clas-
Keywords sify each event [11, 16, 18]. The major drawbacks of supervised
data mining techniques include (1) necessity to have labeled data,
Outlier detection, bagging, feature subsets, integration, detection
which can be extremely time consuming for real life applications,
rate, false alarm.
and (2) inability to detect new types of rare events. On the other
hand, unsupervised learning methods typically do not require
1. INTRODUCTION labeled data and detect outliers (rare events) as data points that
The explosion of very large databases and the World Wide Web are very different from the normal (majority) data based on some
has created extraordinary opportunities for monitoring, analyzing measure [5]. These methods are typically called outlier/anomaly
and predicting global economical, geographical, demographic, detection techniques, and their success depends on the choice of
medical, political and other processes in the world. However, similarity measures, feature selection and weighting, etc. Outlier
despite the enormous amount of data being available, particular detection algorithms can detect new types of rare events as devia-
events of interests are still quite rare. These rare events, very tions from normal behavior, but on the other hand suffer from a
possible high rate of false positives, primarily because previously
unseen (yet normal) data are also recognized as out-
Permission to make digital or hard copies of all or part of this work for liers/anomalies, and hence flagged as interesting. In this paper, we
personal or classroom use is granted without fee provided that copies are
focus on unsupervised methods for outlier detection.
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy Many outlier detection algorithms [3, 10, 19, 27, 31] attempt to
otherwise, or republish, to post on servers or to redistribute to lists, re- detect outliers by computing the distances in full dimensional
quires prior specific permission and/or a fee. space. However, in very high dimensional spaces, the data is very
KDD’05, August 21–24, 2005, Chicago, Illinois, USA sparse and the concept of similarity may not be meaningful any-
Copyright 2005 ACM 1-59593-135-X/05/0008...$5.00.

157
Research Track Paper

more [3, 6]. In fact, due to the sparse nature of distance distribu- selecting different feature representations (e.g different set of
tions in high dimensional spaces, the distances between any pair features) [6, 25]. Since many outlier detection techniques that
of data records may become quite similar [6]. Thus, by using the compute full dimensional distances are also local in their nature,
notion of similarity in high dimensional spaces, each data record they are also sensitive to the selection of features used in distance
may be considered as potential outlier. It has been shown recently computation. In addition, presence of noisy and irrelevant features
that by examining the behavior of the data in subspaces, it is pos- can significantly degrade the performance of outlier detection.
sible to develop more effective algorithms for cluster discovery In this paper, we propose a novel feature bagging framework of
[28] and similarity search in high dimensional spaces [1, 2, 4]. It combining predictions from multiple outlier detection algorithms
has been shown that this is also true for the problem of outlier for detecting outliers in high-dimensional and noisy data sets.
detection [3], since in many applications only the subset of attrib- Unlike standard bagging approach where the classifica-
utes is useful for detecting anomalous behavior. In the example tion/regression models that are combined use randomly sampled
shown in Fig. 1, data records A and B can be seen as outliers only data distributions, in this approach outlier detection algorithms are
when certain two dimensions are selected (in Fig. 1b data record combined and their diversity is improved by sampling random
A is seen as outlier, in Figure 1c data record B is observed as subsets of features from the original feature set. Due to aforemen-
outlier, in Figure 1d both data records A and B may be detected as tioned sensitivity of outlier detection algorithms to the selection
outliers), while in other two-dimensional projections they show of features used in distance computation, each outlier detector
average behavior (Fig. 1a) [3]. In addition, when significant num- identifies different outliers and assigns different outlier scores to
ber of features in a database is considered noisy, finding outliers data records. The outlier scores are then combined in order to find
in all dimensions typically do not result in effective detection of the better quality outliers than the outliers identified by single
outliers, while at the same time it is difficult to identify a few outlier detection algorithms.
relevant dimensions where the outliers may be observed.
It is important to note that the proposed combining framework can
Furthermore, it is well known in machine learning that ensembles be applied to the set of any outlier detection algorithms or even to
of classifiers can be effective in improving overall prediction the set of different outlier detection algorithms. Our experimental
performance. These combining techniques typically manipulate results performed on synthetic and real life data sets have shown
the training data patterns single classifiers use (e.g. bagging [9], that the combining outlier detection algorithms provide non-trivial
boosting [14]) or the class labels (e.g. ECOC [20]). In general, an improvement over the base algorithm.
ensemble of classifiers must be both diverse and accurate in order
to improve prediction of the whole. In addition to classifiers’
2. BACKGROUND AND RELATED WORK
accuracy, diversity is also required to ensure that all the classifiers
Outlier detection algorithms are typically evaluated using the
do not make the same errors. However, it has been shown that
detection rate, the false alarm rate, and the ROC curves [26]. In
standard combining methods (e.g. bagging) do not improve the
order to define these metrics, let’s look at a confusion matrix,
prediction performance of simple local classifiers (e.g. k-Nearest
shown in Table 1. In the outlier detection problem, assuming class
Neighbor) due to correlated predictions across the outputs from
“C” as the outlier or the rare class of the interest, and “NC” as a
multiple combined classifiers [9, 20] and their low sensitivity to
normal (majority) class, there are four possible outcomes when
data perturbation. Nevertheless, local classifiers are extremely
detecting outliers (class “C”), namely true positives (TP), false
sensible to the selection of features that are used in the learning
negatives (FN), false positives (FP) and true negatives (TN).
process, and prediction of their ensembles can be decorrelated by

o o o a) b)
A
o o o o x o o
o o o
o o o o
o o B o o B o
o o o
o o
o o o o o
o A o o o
o o o
x o o o o
o o o
o o
o
c)

o
o B o♦ B
o o o
o xA o o
o o o
o o o o o
o o o o o B
o
o o
A o A o
o x o x
o o o o
o o o o
o o o o o o
o
o o
o o

Figure 1. Different two-dimensional projections of data space reveal different set of outliers or may not reveal outliers at all.

158
Research Track Paper

Table 1. Confusion matrix defines four possible scenarios ROC curves for different outlier detection techniques
when classifying class “C” 1

0.9
Predicted Outliers Predicted Normal Ideal
- Class C class NC 0.8 ROC
Actual Outliers True Positives False Negatives curve

Detection rate
0.7
- Class C (TP) (FN)
0.6
Actual Normal False Positives True Negatives
class NC (FP) (TN) 0.5

0.4 AUC
From Table 1, detection rate and false alarm rate may be defined 0.3
as follows:
0.2
Detection rate = TP / (TP + FN)
0.1
False alarm rate = FP / (FP + TN)
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Detection rate gives information about the number of correctly False alarm rate
identified outliers, while the false alarm rate reports the number of
outliers misclassified as normal data records (class NC). The ROC Figure 2. The ROC Curves for different detection algorithms
curve represents the trade-off between the detection rate and the
false alarm rate and is typically shown on a 2-D graph (Fig. 2), On the other hand, extensive research was devoted to classifier
where false alarm rate and detection rate are plotted on x-axis, ensembles in recent years. There were numerous techniques pro-
and y-axis respectively. The ideal ROC curve has 0% false alarm posed in literature for combining classification algorithms [9, 11,
rate, while having 100% detection rate (Figure 2). However, the 14, 17, 20]. However, it is important to note here that the problem
ideal ROC curve is hardly achieved in practice, and therefore of combining outlier detection algorithms is not exactly the same
researchers typically compute detection rate for different false to the problem of classifier ensembles due to several reasons.
alarm rates and present results on ROC curves. Very often, the First, in classifier ensembles, classification algorithms deal with
area under the curve (AUC) is also used to measure the perform- combining discrete outputs (class labels) typically using different
ance of outlier detection algorithm. The AUC of specific algo- types of voting techniques. In combining outlier detection algo-
rithm is defined as the surface area under its ROC curve. The rithms, the outlier scores or rankings of the algorithms are com-
AUC for the ideal ROC curve is typically set to be 1, while AUCs bined instead of class labels, although some classifier ensembles
of “less than perfect” outlier detection algorithms are less than 1. also combine rankings (or class probability estimates) from single
In Figure 2, the shaded area corresponds to the AUC for the low- classifiers through averaging. Second, classifiers that are com-
est ROC curve. bined typically have complete knowledge of training data records
Most of outlier detection techniques can be categorized into four and their labels (supervised learning) while outlier detection algo-
groups: (1) statistical approaches, (2) distance based approaches, rithms typically deal only with data records without any labels
(3) profiling methods and (4) model-based approaches. In statisti- (unsupervised learning). However, some classifier ensembles that
cal techniques [5, 7, 12], the data points are typically modeled do not use class labels effectively (e.g. bagging) are very similar
using a stochastic distribution, and points are determined to be to combining outlier detection algorithms. Finally, certain classi-
outliers depending on their relationship with this model. However, fier ensembles (e.g. boosting [14]) can control the combining
most statistical approaches have limitation with higher dimen- process by observing the error rate, which is not possible in com-
sionality, since it becomes increasingly difficult and inaccurate to bining outlier detection algorithms since the label is not given and
estimate the multidimensional distributions of the data points [3]. it is not known in advance what data records are really outliers.
Distance based approaches [3, 10, 19, 27, 35, 37] attempt to over-
come limitations of statistical techniques and they detect outliers 3. OUTLIER DETECTION TECHNIQUES
by computing distances among points. Several recently proposed Outlier detection algorithms that we utilize in this study are based
distance based outlier detection algorithms are based on (1) com- on computing the full dimensional distances of the points from
puting the full dimensional distances of points from one another one another as well as on computing the densities of local
using all the available features [19, 27] or only feature projections neighborhoods. In our previous work [21]], we have experimented
[3], and (2) on computing the densities of local neighborhoods with numerous outlier detection algorithms in the problem of
[10]. In addition, a few clustering-based techniques have also network intrusion detection, and we have concluded that the den-
been used to detect outliers either as side products of the cluster- sity based outlier detection approach (e.g. LOF) typically
ing algorithms (points that do not belong to clusters) [2, 31] or as achieved the best prediction performance. Therefore, in this study,
clusters that are significantly smaller than others [13]. In profiling we have chosen the LOF approach to illustrate our findings.
methods, profiles of normal behavior are built using different data
mining techniques or heuristic-based approaches, and deviations 3.1 Density Based Local Outlier Factor (LOF)
from them are considered as intrusions. Finally, model-based
approaches usually first characterize the normal behavior using Detection Approach
some predictive models (e.g. replicator neural networks [15] or The main idea of this method [10] is to assign to each data exam-
unsupervised support vector machines [13, 21]), and then detect ple a degree of being outlier. This degree is called the local outlier
outliers as the deviations from the learned model. factor (LOF) of a data example. Data points with high LOF have

159
Research Track Paper

more sparse neighborhoods and typically represent stronger out- unique anomaly score vector ASFINAL, which is lastly used to as-
liers, unlike data points belonging to dense clusters that usually sign a final probability of being an outlier to every data record
tend to have lower LOF values. from the data set.
To illustrate advantages of the LOF approach over the simple • Given: Set S {(x1, y1), … , (xm, ym)} xi ∈Xd, with labels yi
nearest neighbor approach, consider a simple two-dimensional ∈Y = {C, NC}, where C corresponds to outliers, NC cor-
data set given in Figure 3. It is apparent that the density of the responds to a normal class, and d corresponds to the di-
cluster C2 is significantly higher than the density of the cluster C1. mensionality (number of features) of vector X.
Due to the low density of the cluster C1 it is apparent that for
• Normalize data set S
every example p3 inside the cluster C1, the distance between the
• For t = 1, 2, 3, 4, … T
example p3 and its nearest neighbor is similar to the distance be-
1. Randomly chose the size of the feature subset Nt from
tween the example p2 and the nearest neighbor from the cluster
C2, and the example p2 will not be considered as outlier using the a uniform distribution between d / 2 and (d-1)
simple nearest neighbor (NN) scheme. On the other hand, LOF 2. Randomly pick, without replacement, Nt features to
approach is able to capture the example p2 as outlier due to the create a feature subset Ft
fact that it considers the density around the points. Nevertheless, 3. Apply outlier detection algorithm Ot by employing the
the example p1 may be detected as outlier using both NN and feature subset Ft
LOF approaches, since it is too distant from both clusters. 4. The output of the outlier detection algorithm Ot is
anomaly score vector ASt
• Combine the anomaly score vectors ASt and output a
final anomaly score vector ASFINAL as:
ASFINAL = COMBINE(ASt), t = 1, …, T
p3 ×
Figure 4. The general framework for combining outlier detec-
tion techniques

The problem of combining outlier score vectors is conceptually


quite similar to the problem of meta search engines [32, 33, 34]
where different rankings returned by individual search engines are
combined in order to provide the pages that are most relevant to
p2 the search string. In both problems, there is no label that helps to
× p1 understand how relevant the search results are and the rank of
× results from individual algorithms is important in the combining
process, since it gives the notion of result relevance. Motivated by
several approaches used in meta search engines, in this paper we
explore two variants of the function COMBINE that integrates the
Figure 3. Advantages of the LOF approach outputs of multiple outlier detection algorithms. The first variant,
denoted as Breadth First approach, is presented in Figure 5.
4. COMBINING OUTLIER DETECTION
OUTPUTS • Given: ASt, t = 1, …, T, and m is the size of data set S and
We propose two novel techniques for combining outlier detection each vector ASt
algorithms. Their general framework is shown in Fig. 4. The pro- • Sort all outlier score vectors ASt into the vectors SASt and
cedure for combining outlier detection techniques proceeds in a return indices Indt of the sorted vectors, such that SASt(1)
series of T rounds, although these rounds may be run in parallel has the highest score and Indt(1) is the index of the data
for faster execution. In every round t, the outlier detection algo- record in S with the highest score SASt(1)
rithm is called and presented with a different set of features Ft that • Let ASFINAL and IndFINAL be empty vectors.
is used in distance computation. The set of features Ft is randomly
• For i = 1 to m
selected from the original data set, such that the number of fea-
tures in Ft is also randomly chosen between d/2 and (d-1), • For t = 1 to T
where d is the number of features in original data set. When the • If the index Indt(i) of the data record that is ranked at
number of features Nt in Ft is selected, Nt features are randomly the i-th place by t-th outlier detection algorithm and
selected without replacement from the original feature set. that has the outlier score ASt(i) does not exist in the
vector IndFINAL
Every outlier detection algorithm, as a result, outputs different
outlier score vector ASt that reflects the probability of each data • Insert Indt(i) at the end of the vector IndFINAL
record from the data set S being an outlier. For example, if ASt(i) • Insert ASt(i) at the end of the vector ASFINAL
> ASt(j), data record xi has higher probability of being outlier than • Return IndFINAL and ASFINAL
data record xj. At the end of the procedure, after T rounds, there
are T outlier score vectors each corresponding to a single outlier
detection algorithm. The function COMBINE (Figure 4) is then Figure 5. The Breadth-First scheme for combining outlier
used to coalesce these T outlier score vectors ASt , t = 1, T into a detection scores

160
Research Track Paper

Algorithm 1 Algorithm 2 … Algorithm t


all the scores that correspond to data record NC1, namely scores
AS1,1 AS2,1 ASt,1
AS1,1, AS2,4, …, and ASt,2, and then sort all data records NCi, i =
AS1,2 AS2,2 ASt,2 1, …, m according to newly computed score.

AS1,3 AS2,3 ASt,3 Algorithm 1 Algorithm 2 … Algorithm t


AS1,4 AS2,4 ASt,4 AS1,1 – NC1 AS2,1 ASt,1
… … … AS1,2 AS2,2 ASt,2 – NC1
AS1,k AS2,k ASt,k
Figure 6. Illustration of the Breadth-First approach for com- AS1,3 AS2,3 ASt,3
bining outlier detection scores. AS1,4 – NC2 AS2,4 – NC1 ASt,4 – NC2
The Breadth-First combining method first sorts all the outlier … … …
detection vectors ASt into the sorted vectors SASt and returns indi- AS1,k AS2,k – NC2 ASt,k
ces Indt that give the correspondence between the sorted elements
Figure 8. Illustration of the Breadth-First approach for
of the score vectors and the original elements of the sorted vec-
combining outlier detection scores
tors. For example, Indt(1) = k means that in the t-th outlier detec-
tion score vector ASt, data record xk has the highest anomaly score It is important to note that this method is analogue to the ranking
ASt(k). Thus in Figure 6, AS1,1 corresponds to the data record that method in the meta search engines where the ranks are summed,
is ranked as the most probable outlier by Algorithm 1, AS1,2 corre- but it is more flexible since in the ranking method an outlier de-
sponds to the data record that is ranked as the second most prob- tected by a single algorithm may not be detected in the final deci-
able outlier by Algorithm 1, and so on. sion especially if it is ranked low by other detection algorithms.
On the other hand, in the Cumulative Sum approach, the outlier
After sorting all outlier score vectors ASt, the Breadth-First ap-
that is detected by a single algorithm may have very large outlier
proach simply takes the data records with the highest anomaly
score, and after all summations are performed may still have suf-
score from all outlier detection algorithms (scores AS1,1, AS2,1,
ficiently large final outlier score to be detected. This fact is ex-
AS3,1, …, ASt,1 in Figure 6) and inserts their indices in the vector
tremely important in the scenarios where outliers are visible only
IndFINAL, then takes data records with the second highest anomaly
in a few dimensions, since in that case it is sufficient to select
score (scores AS1,2, AS2,2, AS3,2, …, ASt,2 in Figure 6) and ap-
relevant features only in a small number of iterations, compute
pends their indices at the end of the vector IndFINAL, and so on. If
high outlier scores for these feature subsets and thus cause that
the index of the current data record is already in the vector IndFI-
these outliers are ranked high in the final score.
NAL, it is not appended again. At the end of the Breadth-First
method, the index IndFINAL contains indices of the data records 5. EXPERIMENTS
that are sorted according to their probability of being outlier, and Our experiments were performed on several synthetic data and
the vector ASFINAL contains these probabilities. real life data sets summarized in Table 2. In all our experiments,
The final results of the Breadth-First method are in general sensi- we have assumed that we have information about the normal be-
tive to the order of outlier detection algorithms. However, the havior (class) in the data set. Therefore, in the first training phase,
differences are minor since variations may happen only within T we have applied outlier detection algorithms only to the normal
rankings (T is generally much smaller than the total number of data set (without any outliers) in order to set specific false alarm
data records), since at every i-th pass we go through T indices for rates, and in the second (testing) phase, we have applied outlier
data records ranked at i-th place in the outlier detection vector. detection algorithms to test data sets (with all outliers). Using this
procedure we can achieve better detection performance that using
The second variant of the function COMBINE, denoted as Cumu- completely unsupervised approach.
lative Sum approach, is presented in the Figure 7.

• Given: ASt, t = 1, …, T, and m is the size of each vector ASt 5.1 Experiments on Synthetic Data Sets
• Sum all anomaly scores ASt from all T iterations as follows: Our first synthetic data set (synthetic -1 in Table 2) has 5100 data
records, wherein 5000 data records correspond to normal (major-
• For i = 1 to m
ity) behavior, and 100 data records represent outliers. The data set
T
has five original (contributing) features that determine which data
ASFINAL(i)= ∑ AS t (i) records are outliers (Figure 9). Normal behavior (blue points in
t =1
Figure 9) is modeled as a Gaussian distribution of five original
• Return ASFINAL
contributing features, while the outliers (red crosses in Figure 9)
Figure 7. The Cumulative Sum approach for combining are points that are far from the generated Gaussian distribution.
outlier detection scores We added 5 noisy features in order to test robustness of “feature
This combining method first creates the final outlier score vector bagging” approach to the detection performance.
ASFINAL by summing all the outlier score vectors ASt from all T Our experiments on the synthetic-1 data set were performed using
iterations, then sorts the vector ASFINAL and finally identifies the only LOF approaches. The computed ROC curves for this sce-
data records with the highest outlier scores as outliers. For exam- nario for LOF approach, Breadth-First and Cumulative Sum ap-
ple, data record NC1 in Figure 8 may be ranked as the first outlier proaches employing LOF as single outlier detection algorithm are
by Algorithm 1, ranked as fourth by Algorithm 2, …, and ranked presented in Figure 10.
as second by Algorithm t. In the cumulative sum approach we sum

161
Research Track Paper

Table 2. Summary of data sets used in experiments

Modifications made Size of Number of features number of outliers Percentage


Dataset
in the data set dataset continuous discrete (rare class records) of outliers

Synthetic -1 - 5100 5+5 0 100 1.96%


Synthetic -2 - 5050 8 0 50 0.99%
Satimage smallest class vs. rest 6435 36 0 626 9.73%
Coil 2000 - 5822 85 0 348 5.98 %
Rooftop - 17829 9 0 781 4.38 %
Lymphography merged classes 2&4 vs. rest 148 18 0 6 4.05 %
Mammography - 11183 6 0 260 2.32 %
KDDCup 1999 U2R vs. normal 60839 34 7 246 0.40 %
Ann-thyroid class1 vs. class3 3428 6 15 73 2.13%
Ann-thyroid Class2 vs. class3 3428 6 15 177 5.16%
LED each class vs. rest 10000 0 7 ~1000 ~10%
Letter recognition each class vs. rest 6238 617 0 240 3.85%
Segment each class vs. rest 2310 19 0 330 14.29%
Shuttle classes 2, 3, 5, 6 & 7 vs. class 1 14500 9 0 2 - 809 0.014% - 5.58%

14

12

10

-2

-4
-4 -2 0 2 4 6 8 10 12 14

Figure 9. Distribution of two contributing features for the Figure 10. ROC curves for single LOF approach and two
synthetic-1 data set (blue points represent normal behavior, combining methods employing LOF approach when applied
red crosses represent outliers). to the synthetic-1 data set with 5 original and 5 noisy features.
The number of combined outlier detection algorithms for all
data sets was set to 10. The figures are best viewed in color
Analyzing ROC curves from Figure 10, it can be observed that
LOF approach applied with five original and five noisy features On the other hand, the Breadth First approach is slightly worse
has much worse ROC curve than LOF approach that used only than the Cumulative Sum, but still better than LOF approach with
five original features. This was reasonable to assume since den- both contributing and noisy features. That means that if there are
sity computations in LOF approach are significantly influenced irrelevant features in the data sets, combining methods are able to
by noisy and/or irrelevant features, and thus the LOF performance decrease the influence of noisy features regarding the detection
also degrades. However, when the proposed methods for combin- performance. Depending on the number of relevant and irrelevant
ing outlier detection algorithms are applied on the synthetic-1 features this decrease can vary. Our earlier experiments also show
data set with five original and five noisy features, it can be ob- that this decrease is rather small if the number of irrelevant fea-
served that they were able to alleviate the effect of noisy features tures significantly outnumbers the number of relevant features. To
and to outperform single LOF approach. Furthermore, the Cumu- investigate the influence of the noisy features to the detection
lative Sum combining method has very similar ROC curve as the performance, we have created two additional synthetic data sets
LOF approach only with 5 original contributing features. with 10 and 20 noisy features in addition to five contributing fea-
tures. Instead of ROC curves, for these two data sets we have

162
Research Track Paper

reported areas under the curve (AUC), since AUC allows us to so the only differences are for the lower false alarm rates. The
easier compare all three scenarios. From Table 3, it can be ob- degradation in performance of the combining methods compared
served that with increasing number of noisy features, the gap to the single LOF approach is understandable since combining
between single LOF and the combining methods is indeed de- methods do not use all the features in any of the iterations, but at
creasing. That means that the combining methods can alleviate the same time due to the nature of the generated data set all the
the influence of the noisy features only till a certain level. The features are important for detecting outliers. However, in real life
AUC of ideal ROC curve corresponds to 1, and it is computed scenarios, it is hardly the case that all the features are relevant for
using the trapezoidal rule. detecting outliers. To check this claim, we also performed ex-
periments on numerous real life data sets.
Table 3. AUC (areas under the curve) for single LOF, cumula-
tive sum and the breadth first approaches depending on the
number of noisy features in the data set. 5.2 Experiments on Real Life Data Sets
All real life data sets used in our experiments have been used
Number of Cumulative Breadth First earlier by other researchers for the problem of detecting rare
Single LOF
noisy features Sum approach approach classes [11, 22, 25, 30]. These data sets are summarized in Table
5 0.9862 0.9948 0.9899 2. Since rare class analysis is conceptually the same problem as
10 0.9745 9835 0.9781 the outlier detection, we employed those data sets for the purpose
20 0.9489 0.9547 0.9501 of outlier detection, where we detected rare classes as outliers. In
addition to the data sets reported in Table 2, we have also used
Our second synthetic data set (synthetic-2 data set) has also 5050
several data sets from UCI repository [8] that do not directly cor-
data records, wherein 5000 data records correspond to normal
respond to the rare class problems or outlier detection problems
(majority) behavior, and 50 data records represent outliers. This
but can be converted into binary problems by taking one small
data set has 8 features and all 8 features are responsible for de-
class (with less than ~10% proportion present in the data set) and
termining the outliers, i.e. the data set does not have any noisy
remaining data records or the biggest remaining class as a second
features. Like in the synthetic-1 data set (see Figure 9), the nor-
class. Therefore, we selected the following data sets for the con-
mal behavior in this data set corresponds to a Gaussian distribu-
version into binary data sets: ann-thyroid, LED, letter recognition,
tion of eight contributing features, while analogously to the first
segment, and shuttle. The same procedure was used earlier [18]
data set the outliers are data points far from the normal behavior.
when experimenting with the rare class learning. Using this tech-
The computed ROC curves for this data set for LOF approach,
nique, we have formed additional 50 data sets. Some of the data
Breadth-First and Cumulative Sum approaches are presented in
sets selected to perform the experiments have both continuous and
Figure 11. Note that ROC curves for the synthetic-2 data set use
discrete features. Since LOF approach is based on computing
different axis scale than ROC curves for the synthetic-1 data set in
distances between data records, measuring a distance between two
order to observe true differences.
discrete (categorical) values is not always straightforward. In our
implementation, for computing distances between data records
1 that have discrete attributes we have used the concept of inverse
document frequency (IDF) already used in outier detection prob-
lems [38], where each value of categorical attribute is represented
0.95 with the inverse frequency of its appearance in the data set.
When performing experiments on COIL 200 [30], mammography
0.9
[11] and rooftop [22] data sets, we did not change any class dis-
tribution. However, in the original lymphography data set [8],
there are four classes, but two of them are quite small (2 and 4
0.85 data records), so we merged them and considered them as outliers
LOF approach
compared to other two large classes (81 and 61 data records).
Breadth First Approach
When performing experiments on KDDCup’99 data set, we se-
Cumulative Sum Approach
0.8 lected to detect the smallest intrusion class (U2R), which had only
246 instances. Since the outliers are detected as deviations from
the normal behavior, we have modified original data set (311029
0.75
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18
data records with five classes) such that the new data set con-
tained only the data records from the normal class (60593 data
Figure 11. ROC curves for single LOF approach and two records) and from the U2R class. In such modified data set, we
combining methods employing LOF approach when applied have tried to detect the U2R class using outlier detection algo-
to the synthetic-2 data set. The number of combined outlier rithms. Finally, for satimage data set we chose the smallest class
detection algorithms for all data sets was set to 10. to represent outliers and collapsed the remaining classes into one
class as was done in [11]. This procedure gave us a skewed
It can be observed from Figure 11 that in the scenario when all 2-class dataset, with 5809 majority class examples and 626 minor-
features that determine the outliers are important, there is a slight ity class examples (outliers). For 50 created binary data sets, we
decrease in detection performance of combining methods. How- have typically selected one of the smallest classes and then con-
ever, this decrease is minor (e.g. for false alarm = 4%, detection verted either the remaining data records or the biggest remaining
rate was decreased approximately only 1% for the breadth first class into the majority class. Therefore, for ann-thyroid data set
approach and only 2% for the cumulative sum approach. For the we have detected classes 1 and classes 2 as outliers vs. the class 3
false alarm of 10% all three methods achieve 100% detection rate,

163
Research Track Paper

as the normal (majority) class. Similarly, for shuttle data set we improvement of combining methods for this data set, we reported
have created five data sets by selecting classes 2, 3, 5, 6 and 7 to their ROC curves only for small false alarm rates (less than 0.15).
be detected as outliers compared to the biggest remaining class 1. From Table 4, it can be observed that the small improvements
For other real life data sets (LED, letter recognition, and seg- were also achieved for those binary data sets that were created by
ment), we have simply selected each of the classes to be detected taking one small class as outlier class and remaining data records
as outliers and merged all remaining classes to correspond to the as a second class. This can be explained by the fact that the re-
normal (majority) class. maining classes that were merged together to form a single major-
For our experiments performed on first six real life data sets from ity class were quite different, so it was not possible to distinct
Table 2, the computed ROC curves for LOF approach, Breadth- separated class from the remaining data. It can be also observed
First and Cumulative Sum approaches are presented in Figure 12. that in two data sets when the binary data sets were created by
Due to the lack of space the experimental results for remaining 50 taking the small class as outlier class and the biggest one as the
created binary data sets were presented using areas under the normal class, the improvements of the combining methods were
curves (AUC) (Table 4). The computed AUCs, for chess, LED, more apparent.
letter, segment and shuttle data sets have been averaged over all Finally, it can be observed that for all 66 real life data sets used in
generated data sets for the original data set. For example, there our experiments and for all values of false alarm rate, both com-
were 26 binary data sets generated from the original letter data bining methods were consistently better than the single LOF ap-
set (since there are 26 classes), and AUCs were averaged over all proach. The only exceptions are the lymphography data set,
these 26 data sets when reporting experimental results in Table 4. KDDCup’99 data set and certain generated data sets from LED
and letter data sets, where for low false alarm rates (less than 0.05
Table 4. AUC (areas under the curve) for single LOF, cumula- for lymphography data set, less than 0.1 for KDDCup’99 data set
tive sum and breadth first approaches for 50 real life data sets and less than 0.2 for data sets created from LED and letter data
obtained by converting original data into binary problems. sets) detection rates of all three approaches were quite similar.
Single LOF Cumulative Breadth first
Data set
approach sum approach approach 6. CONCLUSIONS
ann-thyroid A novel general framework for combining outlier detection algo-
0.869 0.869 0.856
class1 vs. class 3 rithms was presented. Experiments on several synthetic and vari-
ann-thyroid ous real life data sets indicate that proposed combining methods
0.761 0.769 0.753
class2 vs. class3 can result in much better detection performance than the single
LED (average) 0.699 0.695 0.703 outlier detection algorithms. The proposed combining methods
letter (average) 0.816 0.820 0.818 successfully utilize benefits from combining multiple outputs and
segment (average) 0.820 0.845 0.825 diversifying individual predictions through focusing on smaller
shuttle (average) 0.825 0.839 0.834 feature projections. Data sets used in our experiments contained
different percentage of outliers, different sizes and different num-
Analyzing Figure 12 and Table 4, it can be observed that both,
ber of features, thus providing a diverse test bed and showing
Cumulative Sum and Breadth First combining methods outper-
wide capabilities of the proposed framework. The universal nature
formed single LOF outlier detection approach on all real life data
of the proposed framework allows that the combining schemes
sets. The improvements in the detection performance were the
can be applied to any combination of outlier detection algorithms
smallest (approximately 5% in detection rate for chosen false
thus enhancing their usefulness in real life applications.
alarm rate) on the COIL 2000 data set (Figure 12a) and on the
Although performed experiments have provided evidence that the
satimage data set (Figure 12f). This was probably due to the poor
proposed methods can be very successful for the outlier detection
performance of individual outlier detection algorithms on these
task, future work is needed to fully characterize them especially
two data sets, so combining their outputs could not lead to signifi-
in very large and high dimensional databases, where new algo-
cant improvements. When detecting outliers on the rooftop data
rithms for combining outputs from multiple outlier detection algo-
set (Figure 12b), the improvements were slightly better than for
rithms are worth considering. It would also be interesting to ex-
the Coil 2000 data set, but again not large due to weak perform-
amine the influence of changing the data distributions when de-
ance of individual outlier detection algorithms. Nevertheless, the
tecting outliers in every round of combining methods, employing
improvements in detection rate for the false alarm rates ranging
not only the distance-based but also other types of outlier detec-
from 10% to 50% are not small and they vary from 4% to 14%.
tion approaches.
The greatest enhancements in outlier detection were achieved for
the mammography (Fig. 12d) and KDD Cup’99 (Figure 12e) data 7. ACKNOWLEDGMENTS
sets. For those data sets single outlier detection results had respec- This work was partially supported by Army High Performance
tively reasonable detection performance, so combining their out- Computing Research Center contract number DAAD19-01-2-
puts further improved overall results. However, when performing 0014, by the ARDA Grant AR/F30602-03-C-0243 and by the
experiments on lymphography data set (Figure 12c), the detection NSF grant IIS-0308264. The content of the work does not neces-
rate of a single LOF approach was 100% already at 10% false sarily reflect the position or policy of the government and no
alarm rate, so the combining methods could not improve detection official endorsement should be inferred. Access to computing
performance very much. In order to illustrate even such a slight facilities was provided by the AHPCRC and the Minnesota
Supercomputing Institute.

164
Research Track Paper

COIL 2000 data set Rooftop data set


1 1

0.9 0.9

0.8 0.8
a) b)
0.7 0.7
Detection rate

Detection rate
0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3
Single LOF approach Single LOF approach
0.2 0.2 Cumulative Sum approach
Cumulative Sum approach
0.1 Breadth First approach 0.1 Breadth First approach

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False alarm rate False alarm rate

Lymphography data set Mammography data set


1 1

0.9 d)
c)
0.9
0.8

0.7
Detection rate

Detection rate

0.8

0.6

0.7 0.5
Single LOF approach
0.4
Single LOF approach Cumulative Sum approach
0.6
Cumulative Sum approach 0.3
Breadth First approach
Breadth First approach
0.5 0.2

0.1
0 0.05 0.1 0.15 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False alarm rate False alarm rate
KDD Cup 1999 data set Satimage data set
1 1

0.9 0.9

0.8 e) 0.8

0.7 0.7
f)
Detection rate
Detection rate

0.6
0.6

0.5
0.5
0.4
0.4
0.3 Single LOF approach 0.3
0.2
Cumulative Sum approach Single LOF approach
Breadth First approach 0.2 Cumulative Sum approach
0.1 Breadth First approach
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False alarm rate False alarm rate

Figure. 12. ROC curves for single LOF approach and two combining methods employing LOF approach when applied to all five
data sets. The number of combined outlier detection algorithms for all data sets was set to 50, except for the mammography data
set when this number was 10 due to small number of features (6) in the data set. The figures are best viewed in color.

8. REFERENCES SIGMOD international conference on Management of data,


[1] C. Aggarwal, Re-designing distance functions and distance- Dallas, TX, 70-81, 2000.
based applications for high dimensional data, ACM SIGMOD [3] C.C. Aggarwal, P. Yu, Outlier Detection for High Dimensional
Record, vol. 30, 1, pp. 13 - 18, March 2001. Data, In Proceedings of the ACM SIGMOD International Confer-
[2] C. Aggarwal and P. Yu, Finding Generalized Projected Clus- ence on Management of Data, Santa Barbara, CA, May 2001.
ters in High Dimensional Spaces, In Proceedings of the ACM [4] R. Agrawal, J. Gehrke, D. Gunopulos and P. Raghavan,
Automatic subspace clustering of high dimensional data for

165
Research Track Paper

data mining applications, In Proceedings of the ACM SIG- in network intrusion detection, In Proceedings of the Third
MOD international conference on Management of data, Se- SIAM International Conference on Data Mining, San Fran-
attle, WA, 94-105, June 1998. cisco, CA, May 2003.
[5] V. Barnett and T. Lewis, Outliers in Statistical Data. New [22] M. Maloof, P. Langley, T. Binford, R. Nevatia and S. Sage,
York, NY, John Wiley and Sons, 1994. Improved Rooftop Detection in Aerial Images with Machine
[6] K. Beyer, J. Goldstein, R. Ramakrishnan and U. Shaft, When Learning, Machine Learning, vol. 53, 1-2, pp. 157 - 191, Oc-
is nearest neighbor meaningful?, In Proceedings of the 7th tober-November 2003.
International Conference on Database Theory (ICDT'99), Je- [23] M. Markou and S. Singh, Novelty detection: a review—part
rusalem, Israel, 217-235, 1999. 1: statistical approaches, Signal Processing, vol. 83, 12, pp.
[7] N. Billor, A. Hadi and P. Velleman BACON: Blocked Adap- 2481 - 2497, December 2003.
tive Computationally-Efficient Outlier Nominators, Compu- [24] P. McBurney and Y. Ohsawa, Chance Discovery, Advanced
tational Statist & Data Analysis, vol. 34, pp. 279-298, 2000. Information Processing Springer, 2003.
[8] C. Blake,C. Merz, UCI Repository of machine learning data- [25] R. Michalski, I. Mozetic, J. Hong and N. Lavrac, The Multi-
bases,www.ics.uci.edu/~mlearn/MLRepository.html, 1998. Purpose Incremental Learning System AQ15 and its Testing
[9] L. Breiman, Bagging Predictors, Machine Learning, vol. 24, Applications to Three Medical Domains, In Proceedings of
2, pp. 123-140, August 1996. the Fifth National Conference on Artificial Intelligence,
[10] M.M. Breunig, H.P. Kriegel, R.T. Ng and J. Sander, LOF: Philadelphia, PA, 1041-1045, 1986.
Identifying DensityBased Local Outliers, ACM SIGMOD [26] F. Provost, T. Fawcett, Robust Classification for Imprecise
Conference, vol. Dallas, TX, May 2000. Environments, Machine Learning, vol. 42, pp. 203-231, 2001.
[11] N. Chawla, A. Lazarevic, L. Hall,K. Bowyer, SMOTEBoost: [27] S. Ramaswamy, R. Rastogi, K. Shim, Efficient Algorithms
Improving the Prediction of Minority Class in Boosting, In for Mining Outliers from Large Data Sets, In Proceedings of
Proceedings of the Principles of Knowledge Discovery in the ACM SIGMOD Conference, Dallas, TX, May 2000.
Databases, PKDD-2003, Cavtat, Croatia, September 2003. [28] A. Strehl, J. Ghosh, Cluster ensembles - a knowledge reuse
[12] E. Eskin, Anomaly Detection over Noisy Data using Learned framework for combining multiple partitions, Journal of Ma-
Probability Distributions, In Proceedings of the International Con- chine Learning Research, vol. 3, pp. 583-617, March 2003.
ference on Machine Learning, Stanford University, CA, 2000. [29] E. Suzuki, J. Zytkow, Unified Algorithm for Undirected Discov-
[13] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, S. Stolfo, A ery of Exception Rules, In Proceedings of the Principles of Data
Geometric Framework for Unsupervised Anomaly Detection: Mining and Knowledge Discovery, 4th European Conference,
Detecting Intrusions in Unlabeled Data, in Applications of PKDD2000, Lyon, France, 169-180, September 13-16, 2000.
Data Mining in Computer Security, Advances In Information [30] P. van der Putten, M. van Someren, CoIL Challenge 2000:
Security, S. Jajodia D. Barbara, Ed. Boston: Kluwer, 2002. The Insurance Company Case, Sentient Machine Research,
[14] Y. Freund, R. Schapire, Experiments with a New Boosting Amsterdam and Leiden Institute of Advanced Computer Sci-
Algorithm, In Proceedings of the 13th International Confer- ence, Leiden LIACS Technical Report 2000-09, June, 2000.
ence on Machine Learning, Bari, Italy, 325-332, July 1996. [31] D. Yu, G. Sheikholeslami and A. Zhang, FindOut: Finding
[15] S. Hawkins, H. He, G. Williams, R. Baxter, Outlier Detec- Outliers in Very Large Datasets, The Knowledge and Infor-
tion Using Replicator Neural Networks, In Proceedings of mation Systems (KAIS) journal, vol. 4, 4, October 2002.
the 4th International Conference on Data Warehousing and [32] A. E. Howe, D. Dreilinger, SavvySearch: A meta-search
Knowledge Discovery, Lecture Notes in Computer Science engine that learns which search engines to query, AI Maga-
2454, Aix-en-Provence, France, 170-180, September 2002. zine, Vol. 18., No. 2, 1997.
[16] M. Joshi, R. Agarwal, V. Kumar, PNrule, Mining Needles in [33] S. Lawrence, C. L. Giles, Inquirus, the NECI meta search
a Haystack: Classifying Rare Classes via Two-Phase Rule engine, In Proceedings of Seventh International World Wide
Induction, In Proceedings of the ACM SIGMOD Conference Web Conference, Brisbane, Australia, 95-105, 1998.
on Management of Data, Santa Barbara, CA, May 2001. [34] B. U. Oztekin, G. Karypis, V. Kumar, Expert Agreement and
[17] M. Joshi, R. Agarwal and V. Kumar, Predicting Rare Content Based Reranking in a Meta Search Environment us-
Classes: Can Boosting Make Any Weak Learner Strong?, In ing Mearf, In Proceedings of Eleventh International World
Proceedings of the Eight ACM Conference ACM SIGKDD Wide Web Conference, Honolulu, Hawaii, May 2002.
International Conference on Knowledge Discovery and Data [35] S. D. Bay, M. Schwabacher: Mining distance-based outliers
Mining, Edmonton, Canada, July 2002. in near linear time with randomization and a simple pruning
[18] M. Joshi and V. Kumar, CREDOS: Classification using Rip- rule. In Proceedings of the Ninth ACM SIGKDD Interna-
ple Down Structure (A Case for Rare Classes), In Proceed- tional Conference on Knowledge Discovery and Data Min-
ings of the SIAM International Conference on Data Mining, ing, Washington DC, 29-38, 2003.
Lake Buena Vista, FL, April 2004. [36] S. Papadimitriou, H. Kitagawa, P. B. Gibbons, C. Faloutsos:
[19] E. Knorr and R. Ng, Algorithms for Mining Distance based Out- LOCI: Fast Outlier Detection Using the Local Correlation In-
liers in Large Data Sets, In Proceedings of the Very Large Data- tegral. In Proceedings of IEEE International Conference on
bases (VLDB) Conference, New York City, NY, August 1998. Data engineering, Bangalore, India March 2003.
[20] E. Kong and T. Dietterich, Error-Correcting Output Coding [37] P. Sun, S. Chawla, On Local Spatial Outliers, In Proceedings
Corrects Bias and Variance, In Proceedings of the 12th In- of Fourth IEEE International Conference on Data Mining
ternational Conference on Machine Learning, San Francisco, (ICDM'04), Brighton, United Kingdom, November 2004.
CA, 313-321, 1995. [38] L. Ertoz, Similarity Measures, PhD dissertation, University
[21] A. Lazarevic, L. Ertoz, A. Ozgur, J. Srivastava and V. of Minnesota, in progress, 2005.
Kumar, A comparative study of anomaly detection schemes

166

View publication stats

You might also like