Anomaly Detection in Big Data
Anomaly Detection in Big Data
Ph.D. THESIS
by
arXiv:2203.01684v1 [cs.LG] 3 Mar 2022
A THESIS
of
DOCTOR OF PHILOSOPHY
in
by
List of Figures 7
List of Tables 10
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Applications to Big data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 The Problem Statement and Research Scope . . . . . . . . . . . . . . . . . . . 6
1.4 Specific Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Literature Survey 13
2.1 Traditional Approaches to Anomaly Detection . . . . . . . . . . . . . . . . . 13
2.1.1 Classification-Based Approaches . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 Statistical Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.3 Clustering-Based Approaches . . . . . . . . . . . . . . . . . . . . . . 20
2.1.4 Information-theoretic Approaches . . . . . . . . . . . . . . . . . . . . 25
2.1.5 Spectral-Theory Based Approaches . . . . . . . . . . . . . . . . . . . 26
2.2 Modern Approaches to Anomaly Detection . . . . . . . . . . . . . . . . . . . 27
2.2.1 Non-Parametric Techniques . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.2 Multiple Kernel Learning . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.3 Non-negative Matrix factorization . . . . . . . . . . . . . . . . . . . . 31
2.2.4 Random Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.5 Ensemble Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3 Relevant Algorithms for Anomaly Detection . . . . . . . . . . . . . . . . . . . 37
2.3.1 Online Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3
4 CONTENTS
5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3.1 Experimental Testbed and Setup . . . . . . . . . . . . . . . . . . . . . 95
5.3.2 Convergence of CILSD . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3.3 Comparative Study on Benchmark Data sets . . . . . . . . . . . . . . . 97
5.3.4 Comparative Study on Benchmark and Real Data sets . . . . . . . . . . 103
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.1 Evaluation of Gmean over various benchmark data sets. (a) pageblock (b) w8a
(c) german (d) a9a (e) covtype (f) ijcnn1. In all the figures, PAGMEAN al-
gorithms either outperform or are equally good with respect to its parent algo-
rithms PA and CSOC algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2 Evaluation of Mistake rate over various benchmark data sets. . . . . . . . . . . 57
3.3 Evaluation of Gmean over various real data sets. . . . . . . . . . . . . . . . . . 61
7
8 LIST OF FIGURES
4.1 Evaluation of online average of Gmean over various benchmark data sets. (a)
news (b) news2 (c) gisette (d) realsim (e) rcv1 (f) url (g) pcmac (h) webspam. . 73
4.2 Evaluation of mistake over various benchmark data sets. (a) news (b) news2 (c)
gisette (d) realsim (e) rcv1 (f) url (g) pcmac (h) webspam. . . . . . . . . . . . . 75
4.3 Effect of regularization parameter λ on F-measure on (a) news (b) realsim (c)
gisette (d) rcv1 (e) pcmac. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4 Effect of learning rate η in ASPGD algorithm for maximizing Gmean on (a)
news (b) realsim (c) gisette (d) rcv1. . . . . . . . . . . . . . . . . . . . . . . . 78
5.1 Objective Function vs DADMM iterations over benchmark data sets. (a) ijcnn1
(b) rcv1 (c) pageblocks (d) w8a. . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2 Gmean versus Cost over various data sets for L-DSCIL algorithm. Cost is
given on the x-axis where each number denotes cost pair such that 1={0.1,0.9},
2={0.2,0.8}, 3={0.3,0.7}, 4={0.4,0.6}, 5={0.5,0.5} . . . . . . . . . . . . . . . 91
5.3 Gmean versus Cost over various data sets for R-DSCIL algorithm. Cost is
given on the x-axis where each number denotes cost pair such that 1={0.1,0.9},
2={0.2,0.8}, 3={0.3,0.7}, 4={0.4,0.6}, 5={0.5,0.5} . . . . . . . . . . . . . . . 91
5.4 Training time versus number of cores to measure the speedup of R-DSCIL al-
gorithm. Training time in Figure (a) is on the log scale. . . . . . . . . . . . . . 93
5.5 Training time versus number of cores to measure the speedup of L-DSCIL al-
gorithm. Training time in both the figures is on the log scale. . . . . . . . . . . 93
5.8 Gmean versus regularization parameter λ using R-DSCIL (a) ijcnn1 (b) rcv1 (c)
pageblocks (d) w8a (e) news (f) url (g) realsim (h) webspam. . . . . . . . . . . 96
LIST OF FIGURES 9
5.9 Objective function vs iterations over various benchmark data sets. (a) ijcnn1
(b) rcv1 (c) pageblocks (d) w8a. Obj1 denotes objective function when best
learning rate is searched over {0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3} while
Obj2 denotes objective function value obtained with learning rate 1/L. . . . . . 98
5.10 Training time versus number of cores to measure the speedup of CILSD algo-
rithm. Training time in both the figures is on the log scale. . . . . . . . . . . . 101
5.11 Gmean achieved by CILSD algorithm versus number of cores on various bench-
mark data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.12 Effect of regularization parameter λ on Gmean (i) ijcnn1 (ii) rcv1 (iii) gisette
(iv) news (v) webspam (vi) url (vii) w8a (viii) realsim. λ varies in { 3.00E-007,
0.000009, 0.00003, 0.0009 0.003 0.09 0.3 } . . . . . . . . . . . . . . . . . . . 102
3.1 Evaluation of Gmean and Mistake rate (%) on benchmark data sets. Entries
marked by * are not statistically significant at 95% confidence level than the
entries marked by ** on wilcoxon rank sum test. . . . . . . . . . . . . . . . . 59
3.2 Evaluation of Gmean and Mistake rate (%) on real data sets. Entries marked by
* are not statistically significant at 95% confidence level than the entries marked
by ** on wilcoxon rank sum test. . . . . . . . . . . . . . . . . . . . . . . . . 60
4.1 Evaluation of cumulative Gmean(%) and Mistake rate (%) on benchmark data
sets. Entries marked by * are statistically significant than the entries marked
by ** and entries marked by † are NOT statistically significant than the entries
marked by ‡ at 95% confidence level on Wilcoxon rank sum test. . . . . . . . 77
11
Chapter 1
Introduction
Data mining is the process of discovering hidden patterns in the data through computational
techniques [6–8]. Anomaly detection is one of the sub-fields of data mining. Anomaly is
defined as a state of the system that does not conform to the normal behavior of the sys-
tem/object [2, 9, 10]. For example, emission of neutrons in a nuclear reactor channel above the
specified threshold is an anomaly. Similarly, a suspicious activity of a person over a metro sta-
tion is an anomaly. As a third example, abnormal usage of a credit card refers to the anomalous
event. Above examples indicate that anomaly detection is an important data mining/machine
learning task. The main focus in anomaly detection is to discover the unusual pattern in the data.
The term has also been referred to as outlier mining, event detection, exception, contaminant
mining, intrusion detection system (IDS) [11–14], fraud-detection [15], fault-detection [16] etc.
depending on the application domain. We emphasize here that outlier mining (anomaly detec-
tion) is not a new task. It has its root dated back to 19th century [17,18]. Since then lots of work
has been done till date by different researchers and has resulted in different automated systems
for anomaly detection, e.g, Aircraft monitoring sensor, patient health monitoring, credit-card-
fraud detection system, video surveillance system etc.
The significance of anomaly detection is owing to the nature of anomaly that is often critical
and needs immediate action. For example, modern aircraft record gigabytes of highly complex
data from its propulsion system, navigation, control system, and pilot inputs to the system
giving rise to so-called "Big data". Analyzing such a complex and heterogeneous data indeed
requires automated and sophisticated system [19–21]. As an another example, social media
analytic looks for potential anomalous nodes in the graph created from its users. Before we
1
2 CHAPTER 1. INTRODUCTION
Figure 1.1: EEG data (heart rate shown with respect to time)
Definition 1. Anomalies are patterns in the data that deviate from the normal behavior or
working of the system.
As it can be seen clearly in Fig. 1.1 around time t = 700, there is sudden drop in heart beat
which might be indicative of potential anomaly occurred in the patient at that time. As an
another example of anomaly in crowd scene, see Fig. 1.2. In the Fig. 1.2, automobile in the
park, where only humans are allowed, is an example of anomaly. Similarly, bicyclist in the park
is also an anomaly. Since, our aim is to detect the aforementioned anomaly in big data, we
define the big data as:
3
Feature 2
Anomalous
point
Feature 1
Figure 1.3: An example of point anomaly
Definition 2. Big data refers to a data which is complex in nature and requires lots of computing
resources for its processing.
It should be clear that big data may be small in sample size but huge number of dimensions or
large number of samples with small number of dimensions (2 to 4, say). The data having large
number of dimensions and samples is trivially big data. We define the term big data formally
later in this chapter. Examples of big data can be patient monitoring sensor data that consists of
hundreds of attributes each of various types. Next we describe the various types of anomalies
as defined in [2].
We emphasize that there have been several research works in the direction of finding global
anomaly in large databases but finding local anomaly is limited. In the present work, we do not
distinguish between the local and global anomaly. As we discuss later in the thesis that we take
a different approach to handle anomaly.
1.1 Motivation
Our motivation comes from data laden domain in the industries what is being termed as big
data. The natural question arises is: why anomaly detection in big data? There are numerous
reasons we can account for. Let us look at them one by one. Big data is often classified to
emerge along five dimensions: Volume, Velocity, Variety, Veracity, Variability (5 V’s). Below,
we make the meaning of the 5 V’s clear.
Thus, above points indicate that anomaly detection in big data can be quite tedious and cumber-
some. Till now, there exist only a few approaches that address curse of dimensionality, noise,
sparsity, streaming, heterogeneity issues while targeting anomalies efficiently.
Anomaly detection in big data finds several usage emerging from data laden-domains. Some of
them have been described below.
follows:
“To efficiently detect anomalies in big data which is sparse, high-dimensional, streaming
and distributed”.
Since tackling all of the characteristics of big data together is a complicated task, we
make some assumptions and consider different scenarios of big data characteristics. Our
assumptions are as follows:
Along the line of solution methodologies, we adopt the class-imbalance learning approach
to solve the point anomaly detection problem. We argue that the class-imbalance learning
problem is similar to the point anomaly detection problem and therefore, can be used to
solve the point anomaly detection. Some works that have followed this approach include
[28] [29] [30] [31] [32] and discussed in detail in [33].
In order to detect anomalies from big data , the following scenarios have been considered
in the present work:
By efficiently solving the problem, we mean the technique is able to detect anomaly in
a timely manner, is scalable (in terms of numbers of instances as well as dimensions),
and incur small false positive and small false negative. In other words, we aim to achieve
higher Gmean and lower Mistake rate (defined in the next Chapter) than the existing
methods in the literature.
The thesis is divided into 7 seven chapters. Each chapter can be read independently without
requiring going back and forth. The content of each chapter is described below.
Chapter 1: This Chapter gives the introduction of the proposed work. In particular, we talk
about what is an anomaly? Why is anomaly detection important? Then, we talk about various
kinds of anomalies and how to report the result of anomaly detection algorithm? We also
establish the connection between anomaly detection and related problems like outlier detection,
class-imbalance problem etc. This chapter also covers the motivation and contribution of the
work.
Chapter 3: This chapter begins with a literature survey of algorithms targeted for anomaly
detection in streaming data followed by their limitations. Then, we propose our first online
algorithm for class-imbalance learning and anomaly detection. In particular, passive-aggressive
algorithms (PA) [34], which have been successfully applied in online classification setting, are
sensitive to outliers because of their dependence on the norm of the data points and as such can
not be applied for class-imbalance learning task. In the proposed work, we make it insensitive to
outliers by utilizing a modified hinge loss that arises out of the maximization of Gmean metric
and derives some new algorithms called Passive-Aggressive GMEAN (PAGMEAN). Second,
It is found that the derived algorithms either outperform or perform equally good as compared
to some of the state-of-the-art algorithms in terms of the Gmean and mistake rate over various
benchmark data sets in most of the cases. Application to online anomaly detection on real world
data sets is also presented which shows the potential application of PAGMEAN algorithms for
real-world online anomaly detection task.
Chapter 4: The work presented in Chapter 3, although scalable to a large number of samples,
does not exploit sparsity in the data, which is not uncommon these days. In this chapter, we pro-
pose a novel L1 regularized smooth hinge loss minimization problem and an algorithm based on
accelerated-stochastic-proximal learning framework (called ASPGD) to solve the above prob-
lem. We demonstrate the application of proximal algorithms to solve real world problems (class
imbalance, anomaly detection), scalability to big data and competitiveness with recently pro-
posed algorithms in terms of Gmean, F-measure and Mistake rate on several benchmark data
sets.
Chapter 5: This chapter begins with the survey of techniques developed for handling anomaly
detection in large, sparse, high dimension, distributed data and their limitations. Then, we
describe our proposed framework. DSCIL and CILSD algorithms are described in detail fol-
lowed by their distributed implementation in MPI framework. Finally, we show the efficacy of
the proposed approaches on benchmark and real-world data sets and compare the performance
with the state-of-the-art techniques in the literature.
Chapter 6: Chapter 6 elucidates the application of support vector data description algorithm for
finding anomalies in a real-world data. We illustrate the working mechanism of the algorithm
and show the experimental results on nuclear power plant data.
1.5. ORGANIZATION OF THE THESIS 11
Chapter 7: This chapter summarizes our main finding and ends with some open problems that
we plan to explore in future.
Chapter 2
Literature Survey
In this chapter, we present relevant works that have addressed the problem of anomaly detection
in general. In particular, we discuss the research works in three flavors. The first is based on
the traditional approach to anomaly detection, the second is based on the modern approach
to anomaly detection, and the third one is based on the online learning. The reason for such
categorization is that anomaly detection has been tackled in statistics as outlier detection since
1969 [18] and recently, there has been an advancement in research which utilizes the more
state-of-the-art machine learning approach to solving the problem. We present the discussion
that assumes anomaly detection, outlier detection, novelty detection as the similar problem.
Further, we also present the literature work that has used class-imbalance learning to address
the point anomaly detection problem.
As we mentioned in chapter 1 that anomaly detection is not a new task. Lots of work has been
done in statistics community. However, the ultimate goal of anomaly detection is to find any
outlying pattern in the data. For example, given a dataset from health care, the key task is to find
any anomalous point, subsequence (in the case of gene expression data) anomaly. Traditional
approaches to anomaly detection are known by the umbrella term “outlier detection” [2]. There
exist some literature which distinguishes the anomaly detection and outlier detection such as
[38]. However, we make no difference between the two terms while presenting the survey
13
14 CHAPTER 2. LITERATURE SURVEY
Now, let us look at traditional approaches to anomaly detection and study some of the algo-
rithms developed and what kind of issues they aim at to solve. Broadly speaking, they can be
categorized based on the use of available data labels: Supervised, semi-supervised, and unsu-
pervised.
After anomalies have been found, we need a way to report the result. An anomaly detection
algorithm typically outputs its result in one of the following two ways:-
• Scores: Score reports the degree of outlierness of a data instance. Some techniques
assume high score to be a high degree of outlierness and some assume vice-versa.
• Labels : Label is used when the category of anomalies is small and the algorithm reports
whether a data instance is anomalous or normal.
Below we describe various approaches that have been developed in statistics community as well
as data mining community for anomaly detection. These are:
• Classification-Based Approaches
• Statistical Approaches
• Clustering-Based Approaches
• Information-Theoretic Approaches
• Spectral-Theoretic Approaches
Classification based anomaly detection techniques are built on the assumption that the labeled
instances (for both the normal as well as anomalous classes) are available for learning the model
(typically a classifier). They work in two phases: (i) a model is trained using data from both
the normal class and anomalous class (ii) trained model is presented with unseen data to predict
its class label. Techniques falling under classification-based approaches are classified either as
one-class classification or multi-class classification.
based anomaly detection include Neural network [42], Bayesian Network [43, 44], Support
Vector Machine (SVMs) [45, 46], Rule-based [47, 48] classifiers etc.
One-Class Classification builds discriminatory model using only labeled data of normal in-
stances as shown in Fig. 2.2. In this setting, learner draws a boundary around normal instances
and leaves abnormal instances untouched. Indeed, complex boundaries can be drawn using non-
linear models such as kernel-methods, Neural Networks etc. Popular techniques for anomaly
detection using one-class classification are one-class SVM and its various extensions [49, 50],
one-class Kernel Fisher Discriminants [51], support vector data description (SVDD) [36] etc.
The major disadvantage in using classification based techniques is the availability of the train-
ing data for normal instances and the training time. If there are not sufficient examples from
abnormal classes, it is very hard to build a meaningful decision boundary. Secondly, in real-
time anomaly detection application, the classifier is expected to produce anomaly score within
reasonable time limits. Thirdly, they assign class labels to each test data point that may become
a disadvantage in case a meaningful anomaly score is required.
2.1. TRADITIONAL APPROACHES TO ANOMALY DETECTION 17
Statistical approaches to anomaly detection are model based. That is, it is assumed that the data
is coming from some distribution (but unknown). Model is built by estimating the parameters
of the probability distribution from the data. An anomalous object is such that it does not
fit the model very well. Statistical anomaly detection techniques are based on the following
assumption:
Assumption 1. Normal data instances reside in the high probability region of the stochastic
model while anomalous data instances reside in the low probability region of the stochastic
model.
In clustering problem, anomalous objects are such that do not fall in some particular cluster and
lie in a low-density region (we shall see in the clustering-based approach it is not always the
case and the problem becomes tricky). Likewise, in regression problem, anomalies fall apart
from the regression line.
• Box Plot Rule: Box plot has been used to detect anomalies in univariate and multivariate
data and is, perhaps the simplest anomaly detection technique. A box plot shows the
various statistical estimator on a graph such as largest non-anomaly, upper quartile (Q3),
18 CHAPTER 2. LITERATURE SURVEY
median, lower quartile (Q1), and the smallest non-anomaly as shown in Fig. 2.3. The
difference Q3 − Q1 is called Inter Quartile Range (IQR) and shows the range of the most
normal data (typically 99.3%). A data point that lies 1.5IQR above the Q3 or below
the Q1 is declared as an anomaly. Some works that have used box plot rule to identify
anomalies are [52–54].
• Univariate Gaussian Distribution: As we described previously that most of the real data
can be modeled using some of the distributions. When the data set is large (social media
data, aircraft navigation data etc.), it is assumed that the data follows the normal distri-
bution. Model parameters mean µ and standard deviation σ are computed from the data
using Maximum Likelihood Principle. A point P with attribute value x and confidence
level α is predicted as outlier with probability p(|x| >= c) = α using this model. The
main difficulty encountered in univariate Gaussian distribution assumptions is choosing
parameters of the model using sampling theory. As a result, the accuracy of the prediction
is reduced.
• Multivariate Gaussian Distribution: Univariate Gaussian assumption is applicable to
the univariate data. In order to handle the case of multivariate data, multivariate Gaussian
assumption is used. To model the problem, a point is classified as normal or anomalous
depending upon its probability from the distribution of the data above or below a certain
threshold. Since multivariate data tends to have a high correlation among its different
attributes, asymmetry is invariably present in the model. To cope up with this problem,
we need a metric that takes into account the shape of the distribution. The Mahalanobis
2.1. TRADITIONAL APPROACHES TO ANOMALY DETECTION 19
Where x is the data point and x̄ is the mean data point and S is the covariance matrix.
• Mixture Model Approach: The mixture model is a widely used technique in modeling
problems that assume that the data is generated from several distributions. For example,
one sample can be generated from many distributions with certain probabilities given by
(2.2)
m
X
p(x; Θ) = wi pi (x|Θ), (2.2)
i=1
where Θ is the parameter vector, wi is the weight given to ith component in the mixture.
The basic idea of a mixture model for anomaly detection task is the following. Two sets
of objects are created; one for the normal object N and the other for anomalous objects
A. Initially, set N contains all the objects and set A is empty. An iterative procedure
is applied to move anomalous objects from set N to set A. The algorithm stops as soon
as there is no change in the likelihood of the data. Eventually, the set A will have all
anomalous objects and the set N will have normal objects.
Pros and Cons of Statistical-Based Approach: Below, we describe the pros and cons of using
the statistical approach for anomaly detection task.
• If the distribution underlying the data can be estimated accurately, statistical techniques
provide a reasonable solution for anomaly detection.
• Statistical techniques can be used in an unsupervised mode provided the distribution es-
timation is robust to anomalies.
• The anomaly score produced by statistical techniques is often equipped with a confidence
interval. The confidence interval may be used to gain better insight into the fate of a test
instance.
Some of the cons associated with statistical techniques for anomaly detection are:
• The major challenge in using statistical techniques is that they assume that the data is
distributed according to a particular distribution. However, this assumption is rarely fol-
lowed by real-world data and the problem becomes severe in big data.
20 CHAPTER 2. LITERATURE SURVEY
• Final decision about a test instance, whether to declare an anomaly or not, depends on the
test statistics used; the choice of which is non-trivial [55]
• They are unable to detect the anomaly in streaming, sparse, heterogeneous data effi-
ciently.
Clustering is the process of grouping data into different clusters based on some similarity cri-
teria. Clustering-based approaches for anomaly detection are not new. In fact, outliers are
found as a by-product during clustering provided outliers do not form coherent, compact group
of their own. Clustering-based approaches can have two subcategory namely proximity-based
and density-based approaches which are described in subsections 2.1.3.1 and 2.1.3.2 Cluster-
ing based techniques can be put into three categories depending on the assumption made by
different researchers on anomalies [2].
Assumption 2. It says that normal instances form a coherent cluster while anomalies do not.
Techniques built on the above assumption use clustering algorithm on the given data set and
report points as anomalous that can not be put into any cluster by the algorithm. Notable work
based on the above assumption are of [56, 57]. Clustering algorithms that do not require data
instances to necessarily belong to some cluster can be used under the above assumption such as
ROCK [58], DBSCAN [59], and SNN [60].
Assumption 3. It says that normal records lie close to the center of gravity of the cluster while
anomalous records reside for away from the closest center of gravity of the cluster.
Above definition works in two step. In the first step, clustering algorithm runs over the entire
data so as to form a natural cluster. In the second step, anomaly score is calculated by computing
distance of each data point from their closest centroid. A noteworthy point is that clustering
algorithm based on the second assumption can be executed in either unsupervised or semi-
supervised setting. In [61], the author propose rough set and fuzzy clustering based approach
for intrusion detection. One limitation of the techniques based on the second assumption is that
they will fail to find anomalies if they form a homogeneous group of their own.
Assumption 4. It says that normal records form a dense and huge cluster while abnormal
records form tiny and sparse cluster.
2.1. TRADITIONAL APPROACHES TO ANOMALY DETECTION 21
Techniques in this subcategory fix some threshold or size on the cluster that will enable them to
accumulate outliers in a different group. One notable work in this category is Cluster-based local
outlier factor (CBLOF) [62]. The CBLOF computes two things: (i) the size of the cluster (ii)
distance of the data instance to its cluster centroid. They declare a data instance as anomalous
when the size and/or density of the cluster, in which it falls, is below a certain threshold.
Pros and Cons of Clustering-Based Approaches: The performance of clustering based outlier
detection algorithm depends on the training time. Some clustering algorithm run in quadratic
time and hence several optimizations have been proposed to reduce it to linear time O(N d) but
they are approximation algorithm.
Some salient features of clustering based anomaly detection approaches are the following:
• Performance of clustering algorithm depends on how well the underlying algorithm cap-
ture the intrinsic structure of the data?
• Many techniques are not optimized for anomalies.
• Their clustering performance hinges on the assumption that anomalies do not form sig-
nificant clusters.
2.1.3.1 Proximity-Based
Outlier score in [0 1]
We note that the outlier score of a data point depends on the value of k, the number of nearest
neighbors. If the value of k is too small, say 1, then a small number of neighboring outliers
will contribute a small outlier score and thus hamper the performance of the algorithm (see Fig.
2.5). On the other hand, a large value of k will make a group of points having nearest neighbors
less than k to become outlier Fig. 2.6.
Proximity-based anomaly detection using the sum of distances of the given data point from
its k-nearest neighbors as anomaly score has been used in [63–65]. Nearest-neighbor based
anomaly detection similar to the aforementioned technique has been used to detect fraudulent
credit card transactions in [66].
Pros and Cons of Proximity-Based Approach: Proximity-based approaches are simple and
easy to apply in comparison to statistical techniques. However, their running time is in the order
of O(n2 ) and this makes them less efficient over high dimensional data comprising millions and
billions of points. Secondly, outlier score is sensitive to the value of k, which is NP-hard to
determine in practice. In addition, they perform poorly over clusters of varying densities. To
see this, consider outlier score of points A and B in Fig. 2.7. Clearly, point A is correctly
2.1. TRADITIONAL APPROACHES TO ANOMALY DETECTION 23
A B
identified as an outlier but point B has outlier score even less than points in the green cluster.
The reason is due to different sparsity and density of clusters.
2.1.3.2 Density-Based
Density-based outlier detection scheme states that outliers are found in sparse region. In fact,
density-based approach is similar to nearest-neighbor based approach in the sense that density
can be calculated as the inverse of the distance to k-nearest neightbors. For example, k-nearest
neighbor of a data instance is the number of points enclosed within the hypersphere centered
at the given data instance. Taking reciprocal of this distance will give us the density of the so
called point. Despite this, density-based technique can not solve the issue of varying densities
similar to the reasons described for proximity-based approach. Hence, the concept of relative
density is introduced as given by (2.3).
density(x, k)
avgrelativedensity(x, k) = P (2.3)
y∈N (x,k) density(y, k)/|N (x, k)|
where density(x, y) is the density of the point x. LOF (local outlier factor) proposed by Breunig
et. al. [67] uses the concept of relative density. LOF score for a point is equal to the ratio of
the average relative density of the k-nearest neighbor of the point and local density of the data
point. The local density of a data point is found by dividing k (the number of nearest neighbors)
to the volume of the hypersphere containing k-data points centered at the data instance. Clearly,
2.1. TRADITIONAL APPROACHES TO ANOMALY DETECTION 25
the local density of normal points lying in the dense region will be high while the local density
of anomalous points in the sparse region will be low.
In literature, several researchers have worked upon the variants of LOF. Some of them have at-
tempted to reduce the original time-complexity of LOF lower than O(N 2 ). Some compute local
density in a different way and others have proposed a variant of LOF suitable for different kinds
of data. A very recently, Kai Ming Ting et al. [68] have proposed a novel density estimation
technique that has the average case sublinear time complexity and constant space complexity in
the number of instances. This order of magnitude improvement in performance can deal with
anomaly detection in big data. They have also proposed DEMass-LOF algorithm that does not
require distance calculation and runs in sublinear time without any indexing scheme.
Pros and Cons of Density-Based Approach: Density-based approaches suffer from the same
malady as their counterpart (proximity-based). That is, they have the computational complexity
of O(N 2 ). In order to minimize it, efficient data structures like k-d tree and R-tress have been
proposed [69]. Despite this, modified techniques do not scale over multiple attributes nor do
they provide anomaly score for each test instance, if required.
Besides approaches to anomaly detection mentioned above, there are several others approaches
to solving the anomaly detection. Below, we describe one such approach based on information
theory. It is a stream of Applied mathematics, Computer science etc. that deals with quantitative
information that can be gleaned from the data. It uses several such measures like Kolmogorov
Complexity, entropy, relative entropy. etc.
Assumption 5. Anomalies in the data infuses erratic information content in the data set.
Basic anomaly detection algorithm in this category works as follows. Assume Θ(N ) is the
complexity (Kolmogorov) of the given data set D. The objective is to find the minimal subset
of instances I such that Θ(N ) − Θ(N − I) is maximum. All the instances thus obtained will be
anomalies.
One notable work under this assumption is [70]. In [70], the author uses the size of the com-
pressed data file as a measure of the dataset’s Kolomogorov Complexity. [71] utilizes the size of
26 CHAPTER 2. LITERATURE SURVEY
the regular expression to estimate the Kolomogorov Complexity. Besides the above, information
theoretic measures such as entropy, relative uncertainty etc. has been used in [72–74].
Pros and Cons of Information Theoretic Approach: A major drawback of Information theo-
retic approach is that they involve dual optimization. First, minimize the subset size and second,
maximize the decrease in the complexity of the data set. Hence their running time is exponential
in the number of data points. Some approximation search techniques have been proposed. For
example, Local Search algorithm to approximately find such a subset in O(n) time. The ad-
vantage of the information-theoretic approach is that they can be employed in an unsupervised
setting and does not make any assumption pertaining to the distribution of the data.
Spectral theory deals with the problems in high dimensions. They assume that the data set can
be embedded into much lower dimensions while still preserving the intrinsic structure. In fact,
they are derived from Johnson-Lindenstrauss lemma [75] (see Appendix ?? for the definition).
Assumption 6. Dataset can be projected into lower dimensional subspace such that normal
and anomalous instances appear significantly different.
A consequence of the projection in lower dimension manifold is that not only dataset size is
reduced but also we can search outliers in the latent space because of correlation among several
attributes. The fundamental challenge encountered by such techniques is to determine such
lower embeddings which can sufficiently distinguish anomalies from the normal instances. This
problem is nontrivial because there are an exponential number of dimensions on which data can
be projected. Some notable work in this domain use Principal Component Analysis (PCA)
for anomaly detection [76] in network intrusion, Compact Matrix Decomposition (CMD) for
anomaly detection in a sequence of graph etc [77].
The advantage of spectral methods is that they are suitable for anomaly detection in high di-
2.2. MODERN APPROACHES TO ANOMALY DETECTION 27
mension data. Also, they can work in an unsupervised setting as well as the semi-supervised
setting. The disadvantage of spectral techniques is that they will separate the anomaly from the
normal instances provided there exist a lower dimension embedding. Another disadvantage is
that they suffer from high computation time.
Traditional approaches to anomaly detection in big data suffers from miscellaneous issues. For
example, statistical techniques require underlying distribution to be known a priori. Proximity-
based and density-based approaches require appropriate metric to be defined for calculating
anomaly score and run in quadratic time with the number of data instances. Clustering based
techniques need some kind of optimization for reducing quadratic time complexity and do not
generalize for heterogeneous data. Similarly, Information-theoretic and spectral techniques re-
quire an appropriate measure of information in case of the former and embedding for the latter.
The point is that they can not handle the case of high dimension, heterogeneous, noisy, stream-
ing, and distributed data that is found ubiquitously everywhere. For example, aircraft navigation
data is highly complex, heterogeneous, and noisy in nature that requires sophisticated tools and
techniques for online processing so as to thwart any likely accident. Similarly, mobile phone
call record demands batch processing of millions and billions of calls every day for a potential
terrorist attack.
Above points indicate that we need to have some kind of mechanism that not only reveals po-
tential anomalous record in the complex data but also gives us insight about it. This makes
sense because manual knowledge discovery in big data is a nontrivial task. Therefore, we will
look at techniques that meet some of the aforementioned goals in the forthcoming sections. In
particular, we will discuss recent approaches to anomaly detection, their results, and drawbacks.
Further, techniques going to be covered in this chapter are suitable for applying anomaly detec-
tion in unsupervised as well as semi-supervised mode. This is also an important point since real
world data sets are mostly unlabeled.
28 CHAPTER 2. LITERATURE SURVEY
Non-parametric technique refers to a technique in which the number of parameters grows with
the size of the data set or that does not assume that the structure of the model is fixed. Some
examples of non-parametric models are histograms, kernel density estimator, non-parametric
regression, models based on Dirichlet process, Gaussian Process etc. A key point about non-
parametric techniques is that they do not assume that data come from some fixed but unknown
distribution. Rather, they make fewer assumptions about the data and hence are more widely ap-
plicable. Some notable works in anomaly detection using non-parametric models are described
below.
Liang et al. [3] propose Generalized latent Dirichlet allocation (LDA) and a mixture of Gaus-
sian mixture model (MGMM) for unimodal and multimodal anomaly detection on galaxy data.
They assume that sometimes data besides being anomalous at an individual level is also anoma-
lous at the group level and hence techniques developed for point anomaly detection fails to
identify anomalies at the group level. However, their model is highly complex and learns a lot
of parameters using variational inference method.
In the same line of work, Rose et al. [79] have proposed group latent anomaly detection(GLAD)
algorithm for mining abnormal community in social media. Their model also suffers from
the same problem as that of Liang et al. above, i.e., it is complex and involves learning of
a large number of parameters. In [80], the author uses Gaussian process (GP) for one-class
classification similar to one class SVM approach but in a non-parametric way for identifying
anomalies in wire ropes. However, their approach generates falls alarm and does not incorporate
prior knowledge about the structure of the rope. The same group also combined GP with kernel
functions for one class classification [81]. They show that GP combined with kernel functions
can outperform support vector data description [36] over various data sets.
The potential advantage of the non-parametric techniques is that they do not assume that the
data is coming from some fixed but unknown distribution. Also as the number of parameters
grows linearly with the size of the input, these techniques can handle dynamic nature of the
data. However, on the flip side, there is a lack of suitable methods for estimating hyperparam-
eters such as kernel bandwidth when GP prior is combined with the kernel function. Unless
2.2. MODERN APPROACHES TO ANOMALY DETECTION 29
one has the right kernel bandwidth, performance of GP methods for anomaly detection is poor.
Secondly, doing cross-validation for finding parameters is infeasible since non-parametric tech-
niques involve parameters which can easily go beyond hundreds and thousands in number.
Kernel methods [82] provide a powerful framework for analyzing data in high dimension 2.8.
They have been successfully applied in ranking, classification, regression over multitude of
data.
Let us define some term before delving into depth. Kernel is a function κ such that for all
x, z ∈ X satisfies
κ(x, z) = hφ(x), φ(z)i (2.4)
where φ is a mapping from some Hilbert space X to an (inner product) feature space F
Intuitively, it says that kernel implicitly computes the inner product between two feature vectors
in high dimensions feature space, i.e., without actually computing the features.
Multiple Kernel Learning (MKL) [83] has recently gained a lot of attention among data min-
ing/machine learning community. The growing popularity of MKL is due to the fact that it
combines the power of several kernel functions into one framework. This arouses the curiosity:
can we apply MKL to anomaly detection in heterogeneous data coming from multiple sources?
The answer has been recently given by empirical result on Flights Operation Quality Assurance
(FOQA) archive data of S. Das et al. [19] at NASA. MKL theory is essentially based on the
theory of kernel methods [84]. The idea is to use a kernel function satisfying Mercer’s condi-
tion (see Appendix ?? for the definition) that finds similarity between pair of objects in a high
dimension feature space. The salient feature is that a valid kernel can compute the similarity
between objects of any kind. This leads to the idea of combining multiple kernels into one
30 CHAPTER 2. LITERATURE SURVEY
Figure 2.8: An Example of working of the kernel function. Any non-linearly separable data set
can be mapped to higher dimension feature space through the kernel function where data set
can be separated via a linear decision boundary.
kernel and use it for classification, regression, anomaly detection task etc.
MKL learns kernel from the training data. More specifically, how the kernel κ can be learned
as a linear(convex) combination of the given base kernels κi as shown in (2.6)?
X
κ(xi , zj ) = θk κk (xi , zj ) (2.6)
k
where θk ≥ 0, k = 1, 2..., K. The goal is to learn parameters θk such that resultant kernel κ
is positive semi-definite(PSD) (see the appendix ?? for the definition of positive-semi definite-
ness).
Recently, Verma et al. [85] have proposed Generalized MKL that can learn millions of ker-
nels over half a billion of training points. After learning multiple kernels in a joint fashion,
any anomaly detection technique capable of using kernels as a similarity measure can distin-
guish between normal and anomalous instances such as one-class SVM. In [19], the author
uses two base kernels; one for discrete sequences and one for continuous data. The kernel κd
corresponding to discrete sequences is computed using LCS (longest common sub-sequence)
while kernel κc corresponding to continuous data is inversely proportional to the distance be-
tween the Symbolic Aggregate Approximation (SAX) [22] representation of the points xi and
zj . The combined kernel is fed to one-class SVM and its performance is compared with two
2.2. MODERN APPROACHES TO ANOMALY DETECTION 31
baseline algorithms namely Orca and SequenceMiner. The results over various simulated and
real data demonstrate that multiple kernel anomaly detection algorithm (MKAD) outperforms
the baseline algorithms in terms of detecting different kinds of faults (discrete and continuous).
In [86], the author uses MKL approach to anomaly detection in network traffic data which is
heavy-flow, high-dimension and non-linear. Essentially, they use sparse and non-sparse kernel
mixture based on Lp norm MKL proposed by Kloft et al. [87]. Another work proposed by Tax et
al. [36] is based on support vector data description (SVDD). It builds hypersphere around nor-
mal data leaving outliers either at the boundary or outside of it. Their work uses support vector
classifier that gives an indication that MKL approach can be exploited to build hypersphere in
a high dimensions space. This forms the line of the motivation of Liu et al. [88], Gornitz et
al. [89] in a semi-supervised as well as unsupervised setting to use MKL for anomaly detection.
Thus we see that MKL can tackle high dimension and heterogeneous nature of big data very
nicely. However, further work needs to be done to explore the possibility of using MKL in the
streaming and distributed anomaly detection scenario.
Non-negative matrix factorization (NNMF) as a technique for anomaly detection in image data
was studied by Lee and Seung in 1999 [90]. The non-negative matrix factorization problem is
posed as follows:
Let A be m × n matrix whose components aij are non-negative i.e. aij ≥ 0. Our goal is to find
non-negative matrices W and H of size m × k and k × n such that (2.7) is minimized.
||A − WH||2F
F (W, H) = . (2.7)
2
where k ≤ min{m, n} and depends upon the specific problem to be solved. In practice, k
is much smaller than rank(A). The product WH is called non-negative matrix factorization
for the matrix A. It should be noted that the above problem is non convex in W and H jointly.
Therefore, algorithms proposed so far in the literature seek to approximate matrix A via product
WH i.e. A ≈ WH. Thus, it is obvious that WH represents A in a very compressed form.
After Lee and Seung initial NNMF algorithm based on multiplicative update rule, several vari-
32 CHAPTER 2. LITERATURE SURVEY
Figure 2.9: Video activity detection from left to right are: the original frame, background and
foreground [3]
ants have been proposed in order to solve (2.7). For example, modified multiplicative up-
date [91], projected gradient descent [92], alternating least (ALS) [93], alternating non-negative
least square (ANLS) [92] Quasi-Newton [94] etc. have been proposed.
Recently, Liang et al. [3] propose direct robust matrix factorization (DRMF) for anomaly detec-
tion. The basic idea used by them is to exclude some outliers from the initial data and then ask
the following question: What is the optimal low rank you can obtain if you ignore some data?
They formulate the problem as an optimization problem with constraints on the cardinality of
the outlier set and the rank of the matrix. Essentially, they solve the problem shown in (2.8).
||S||0 ≤ e
where S is the outlier set and L, low rank approximation to A. K is the rank desired and e is
the maximal number of nonzero entries in S. k · kF denotes the frobenious norm of the matrix
(square root of the sum of squares of each element). In the matrix factorization paradigm,
solution to optimization problems involving rank or set cardinality is nontrivial. The author in
the aforementioned work uses the trick that the problem is decomposable in nature and hence
solvable by block-coordinate descent algorithm [95]. They use the DRMF algorithm to separate
background from foreground(noise) which has application in video surveillance 2.9.
In an another work by Allan et al. [96], they use NMF to generate feature vectors that can
2.2. MODERN APPROACHES TO ANOMALY DETECTION 33
be used to cluster text documents. More specifically, they exploit the concept, derived from
NMF, called sum-of-parts representation that shows term usage pattern in the given document.
The coefficient matrix factors and features thus obtained are used to cluster documents. This
procedure maps the anomalies of training documents to feature vectors. In addition to locating
outliers in latent subspace, NMF has been used to interpret outliers in that subspace as well. In
this domain, work of Fei et al. [97] is significant. More specifically, they combine NNMF with
subspace analysis so that not only outliers are found but also they can be interpreted.
From the above discussion, we see that non-negative matrix factorization can be employed for
anomaly detection in large and sparse data. However, suitability of NMF for anomaly detection
in streaming, heterogeneous and distributed setting is still unexplored.
Note that projection is done to reduce the dimensionality of the data set. Subsequently, we can
apply any anomaly detection approach provided certain conditions are met as described below.
• Condition 1: Projection should preserve the pairwise distance metric between data
points with very high probability (informal statement of Johnson-Lindenstrauss lemma).
• Condition 2: Projection should preserve the distance of points to their k-nearest neigh-
bor (result from Vries et al. work [24]).
34 CHAPTER 2. LITERATURE SURVEY
Figure 2.10: Outliers marked in red color, orientation after projection from 3D to 2D [4].
Note that the 1st condition is applicable to approaches that use some metric (e.g. distance)
to calculate outlier. On the other hand, Condition 2 applies to proximity-based approaches
discussed in section 2.1.3.1.
In [24], Vries et al. introduce projection-indexed nearest neighbor (PINN) approach that is es-
sentially based on projection pursuit. They first apply random projection (RP) to reduce the
dimension of the dataset so as to satisfy the condition 1. Thereafter, they use local outlier
factor (LOF) [67] to find the local outlier in a data set of size 300,000 and 102,000 dimen-
sions. In [101], the author uses convex-optimization approach to outlier (anomaly) pursuit
for matrix recovery problem. Their approach recovers the optimal low-dimensional subspace
and marks the distorted points ( the anomaly in image data). In [102], Mazin et al. apply a
variant of random projection which they call Random Spectral Projection. The idea is to use
Fourier/Cosine spectral projection for dimension reduction. They show that random samples
of Fourier spectrum performs better than random projection in terms of accuracy and storage
over text document. In [100], Muller et al. propose OutRank, a novel approach to rank outliers.
OutRank essentially uses subspace view of the data and compares clustered regions in arbitrary
subspaces. It produces the degree outlierness score for each object.
Challenges in using Random Projection pursuit: Major challenge in using anomaly detec-
2.2. MODERN APPROACHES TO ANOMALY DETECTION 35
tion techniques employing random projection is that they should be able to work in reduced
dimension with intrinsic structure of the data. Second issue is how to efficiently choose the
number of dimensions to project the data?
Ensemble technique works on the principle of "Unity is strength". That is, they combine the
power of individual techniques of outlier detection and produce astounding results provided
certain criteria is met. Although, ensemble techniques have been miraculously applied for clas-
sification [103], clustering task long ago, they have recently been used in anomaly detection
scenario. Increasing stardom of ensemble techniques is their ability to locate outliers effec-
tively in high dimension and noisy data [23]. A typical outlier ensemble contains a number of
components that aids to its power. These are:
In literature, ensembles have been categorized on the basis of component independence and
constituent component. The first categorization assumes whether the components are developed
independently or they depend on each other. These are of two types: In sequential ensemble,
algorithms are applied in tandem so that the output of one algorithm affects the other; producing
either better quality data or specific choice on the algorithm. The final output is either weighted
combination or the result of finally applied algorithm. In Independent ensemble, completely
different algorithm or the same algorithm with different instantiations is applied on the whole
or part of the data under analysis.
36 CHAPTER 2. LITERATURE SURVEY
In the same line of work, Cabrera et al. [5] proposed anomaly detection in a distributed setting
of Mobile Ad-Hoc Networks(MANET). They combine the anomaly scores from local IDS (in-
trusion detection systems) attached to each node through averaging operation and this score is
sent to the cluster head. All cluster head send cluster-level anomaly index to a manager which
averages them (see Fig. 2.11). Dynamic Trust Management scheme is proposed in [106] for
detecting anomalies in wireless sensor network (WSN). Hybrid ensemble approach for class-
imbalance and anomaly detection is recently proposed in [107]. The author uses the mixture of
oversampling and undersampling with bagging and Adaboost and show improved performance.
The ensemble of SVMs for imbalanced data set is proposed in [108, 109].
Figure 2.11: Overall infrastructure to support the fusion of anomaly detectors [5]
to the real-world scenario where data is often unlabeled. In such cases, ensemble techniques,
which are mostly based on classification, can not be applied. The second issue alludes to the
case that anomalies are present in a tiny amount among the huge bundle of normal instances.
Normalization issue is related with different output formats of different classifiers, specially
when heterogeneous models are being trained on.
Since our work focuses on anomaly detection in big data, we discuss related research works that
have been tailored to the data mentioned thereof. Below, we present related work that includes
techniques based on (1) Online learning (2) Class-Imbalance learning (3) Anomaly detection in
a streaming environment and discuss the differences with our work.
The online learning refers to a learning mechanism where the learner is given one example at
a time as shown in Algorithm 1. In algorithm 1, the learner is presented with an example xt
in line no 2. It makes its prediction ŷt in line no. 3 and receives correct label yt in line no.
4. In line no. 5, it computes its loss due to mistakes made in the prediction and subsequently
updates the model in line no. 6. Clearly, the online learning algorithm is out of memory issues
38 CHAPTER 2. LITERATURE SURVEY
2: receive instance: xt
3: predict: ŷt
4: receive correct label: yt
5: suffer loss: `(yt , ŷt )
6: update model
7: until All examples processed
when processing massive datasets as it looks at one example at a time. Secondly, it has optimal
running time of O(nd) provided line 5 and 6 take time O(d), where n is the number of samples
processed so far and d is dimensionality of the data. Thirdly, it is easy to implement. In the next
paragraph, we discuss relevant literature work based on online learning and their limitations in
tackling outliers.
Online learning has its origin from classic work of Rosenblatt on perceptron algorithm [110].
Perceptron algorithm is based on the idea of a single neuron. It simply takes an input instance
xt and learn a linear predictor of the form ft (xt ) = wtT xt , where wt is weight vector. If it makes
a wrong prediction, it updates its parameter vector as follows:
wt+1 = wt + yt xt (2.10)
[111] propose online learning with kernels. Their algorithm, called N ORM Aλ , is based on
regularized empirical risk minimization which they solve via regularized stochastic gradient
descent. They also show empirically how this can be used in anomaly detection scenario. How-
ever, their algorithm requires tuning of many parameters which is costly for time critical ap-
plications. Passive-Aggressive (PA) learning [34] is another online learning algorithm based
on the idea of maximizing “margin” in online learning framework. PA algorithm updates the
weight vector whenever “margin” is below a certain threshold on the current example. Further,
the author introduce the idea of a slack variable to handle non-linearly separable data. Nonethe-
less, PA algorithms are sensitive to outliers. The reason is as follows: PA algorithm applies
the update rule wt+1 − wt = τt yt xt , where τt is a learning rate. In the presence of outliers,
the minimum of 21 kw − wt k2 could be high since kxt k2 is high for outliers. Other online learn-
2.3. RELEVANT ALGORITHMS FOR ANOMALY DETECTION 39
ing algorithms in the literature are MIRA [112], ALMA [113], SOP [114], ARROW [115],
NAROW [116], CW [117], SCW [118] etc. Many of these algorithms are variant of the basic
PA algorithm and perceptron algorithms and hence sensitive to outliers. A thorough survey of
these algorithms is not feasible here. For an exhaustive survey on online learning see [119].
Online lerning based algorithms presented above, although scale to the number of data points,
do not scale with the number of data dimensionality. For example, Passive-Aggressive (PA)
learning [34], MIRA [112], ALMA [113], SOP [114], ARROW [115], NAROW [116], CW
[117], SCW [118] etc. have running time complexity of O(nd). Noteworthy points about algo-
rithms mentioned above are: (i) they are sensitive to outliers and cannot handle class-imbalance
problem without modification (ii) they do not consider sparsity present in the data except CW
and SCW. Though CW and SCW exploit sparsity structure in the data, they are not designed for
the class-imbalance problem. In the present work, we attempt to address these issues through
the lens of online and stochastic learning in Chapter 3.
In [97], the author proposed sampling with online bagging (SOB) for class-imbalance detec-
tion. Their idea essentially is based on resampling, that is, oversample minority class and
undersample the majority class from Poisson distribution with average arrival rate of N/P and
Rp respectively, where P is the total number of positive examples, N is the total number of
negative examples and Rp is the recall on positive examples. Essentially, [97] propose an online
40 CHAPTER 2. LITERATURE SURVEY
ensemble of classifiers where they may achieve high accuracy, but the training of ensemble of
classifiers is a time-consuming process. In addition, [97] does not use the concept of surrogate
loss function to maximize Gmean.
[29] proposes an online cost-sensitive classification for imbalanced data. One of their problem
formulation is based on the maximization of the weighted sum of sensitivity and specificity
and the other is the minimization of the weighted cost. Their solution is based on minimizing
convex surrogate loss function (modified hinge loss) instead of the non-convex 0-1 loss. Their
work closely matches our work. But, in section 3.2 we show that the problem formulation
of [29] is different from our formulation and the solution technique they adopt is based on
the online gradient descent while ours is based on the online passive-aggressive framework.
Specifically, the problem formulation of [29] is a special case of our problem formulation. The
further difference will become clear in section 3.3. In [128], the author proposes a methodology
to detect spammers in a social network. Essentially, [128] introduces which features might be
useful for detecting spammers on online forums such as facebook, twitter etc. One of the major
drawbacks of their proposed method is that it is an offline solution and, therefore, can not handle
big data. Secondly, they apply vanilla SVM for spammer detection which could be not effective
due to the use of hinge loss within SVM. The present work also attempts to solve the open
problem in [128] by online spammer detection with low training time.
Recently Gao et al. [129] propose a class-imbalance learning method based on two-stage ex-
treme learning machine (ELM). Although, they are able to handle class-imbalance, scalabil-
ity of two-stage ELM in high dimensions is not shown. Besides, two-stage ELM solves the
class-imbalance learning problem in the offline setting. ESOS-ELM proposed in [26] uses an
ensemble of a subset of sequential ELM to detect concept drifts in the class-imbalance sce-
nario. Although ESOS-ELM is an online method, they demonstrate the performance on low
dimensional data sets only (largest dimension of the data set tested is < 500) and do not exploit
sparsity structure in the data explicitly. However, none of the techniques mentioned above tack-
les the problem of class imbalance in huge dimension (quantity in millions and above), nor do
they handle the problem structure present in the data such as sparsity. In our present work, we
propose an algorithm that is scalable to high dimensions and can exploit sparsity present in the
data.
2.3. RELEVANT ALGORITHMS FOR ANOMALY DETECTION 41
Cost-sensitive learning can be further categorized into the offline cost-sensitive learning (Of-
fCSL) and the online cost-sensitive learning(OnCSL). OffCSL incorporates costs of misclassi-
fication into the offline learning algorithms such as cost-sensitive decision tress [130], [131],
[125], cost-sensitive multi-label learning [127], cost-sensitive naive Bayes [132] etc. On the
other hand, OnCSL-based algorithms use cost-sensitive learning within the Online learning
framework. Notable work in this direction includes the work of Jialei et al. [29], Adacost [124],
SOC [28]. It is to be noted that the cost-sensitive learning methods have been proved to outper-
form sampling-based methods over Big data [133]. On the other hand, OnCSL-based methods
are more scalable over high dimensions when compared to their counterpart OffCSL-based
methods due to the processing of one sample at a time in case of the former.
Online outlier detection in sensor data is proposed in [134]. Their method uses kernel density
estimation (KDE) to approximate the data distribution in an online way and employ distance-
based algorithms for detecting outliers. However, their work suffers from several limitations.
Firstly, although the approach used in [134] is scalable to multi-dimensions, their method does
not take into account evolving data stream. Secondly, using KDE to estimate data distribution in
the case of streaming data is a non-trivial task. Thirdly, their algorithm is based on the concept
of sliding-window. Determining the optimal width of the sliding window is again non-trivial.
Abnormal event detection using online SVM is presented in [135]. In [136], the author presents
a link-based algorithm (called LOADED) for outlier detection in mixed-attribute data. However,
LOADED does not perform well with continuous features and experiments were conducted on
data sets with dimensions at most 50.
Fast anomaly detection using Half-Space Trees was proposed in [137]. Their Streaming HS-
Tree algorithm has constant amortized complexity of O(1) and constant space complexity of
O(1). Essentially, they build an ensemble of HS-tress and store mass1 of the data in the nodes.
Their work is different from our work in the sense that we use Online learning to build our
model instead of an ensemble of HS-tress. Numenta [139] is a recently proposed anomaly
detection benchmark. It includes algorithms for tackling anomaly detection in a streaming
1
Data mass is defined as the number of points in a region, and two groups of data can have the same mass
regardless of the characteristics of the regions [138].
42 CHAPTER 2. LITERATURE SURVEY
setting. However, the working and scoring mechanism of Numenta is different from our work.
Specifically, their Hierarchial Temporal Memory (HTM) algorithm is a window based algorithm
and uses NAB scores (please see [139]) to report anomaly detection results. Whereas, we use
Gmean and Mistake rate to report the experimental results.
In this Section, we describe research works that closely matches our work in Chapter 6. There
exist some work that have tackled anomaly detection in nuclear power plant. In [140], the author
study health monitoring of nuclear power plant. They propose an algorithm based on symbolic
dynamic filtering (SDF) for feature extraction for time series data followed by optimization
of partitioning of sensor time series. The key limitation of their work is that their algorithm
is supervised anomaly detection and they tested their model on small number of features and
data set only (training and test set each has 150 samples), hence, can not be applied as such on
big data. [141] proposed a spectral method for feature extraction and passive acoustic anomaly
detection in nuclear power plants. [142] developed an online fuzzy logic based expert system
for providing clean alarm pictures to the system operators for nuclear power plant monitoring.
The key limitation of their method is that they model is depends on hand-crafted rule which
may be not very accurate, given the many possibilities of anomaly occurrence. Model-based
nuclear power plant monitoring is proposed in [143]. Their model consists of neural network
that takes input signals from the plant. Next component is the expert system that takes input
from neural network and human operator for making informed decision about system’s health.
The shortcomings of the proposed approach in [143] is that neural network requires lots of data
for training and is supervised. The approach that we take in this chapter builds on unsupevised
learning paradigm and hence differs from the previous studies on nuclear power plant condition
monitoring.
In this section, we discuss the datasets used in our experiments. The datasets with their train/test
size, feature size, ratio of positive to negative samples, and sparsity are shown in Tables 2.1,2.2
and 2.3. Note that the imbalance ratio shows the ratio of the positive to negative class in the
2.4. DATASETS USED 43
training set. The test set can have different imbalance ratio. These are the benchmark datasets
which can be freely downloaded from LIBSVM website [144] pageblock from [145] also at
[146]. The datasets in Table 2.1 have a small number of features and with little to no sparsity
while datasets in Tables 2.2 and 2.3 are high dimension data with sparse features. For our
purpose, in each dataset, the positive class will be treated as an anomaly that we wish to detect
efficiently. We briefly describe the various datasets used in our experiments.
• Kddcup 2008 dataset is a breast cancer detection dataset that consists of 4 X-ray images;
two images of each breast. Each image is represented by several candidates. After much
pre-processing, kddcup 2008 dataset overall contains information of 102294 suspicious
regions, each region described by 117 features. Each region is either “benign” or “malig-
nant” and the ratio of malignant to benign regions is 1:163.19. Due to this huge imbalance
ratio, the task of identifying malignant (anomaly) is challenging.
• Breast Cancer Wisconsin Diagnostic data contains the digitized image of a fine needle
aspirate of a breast mass. They delineate the characteristics of the nucleus of the cell
present in the image. Some of the features include the radius, texture, area, perimeter etc.
of the nucleus. The key task is to classify images into benign and malignant.
• Page blocks dataset consists of blocks of the page layout of a document. The block can
be one out of the five block types: (1) text (2) horizontal line (3) pictures (4) vertical line
(5) graphic. Each block is produced by a segmentation process. Some of the features
are height, length, area, blackpix etc. We converted the multi-class classification problem
into binary classification problem by changing the labels of horizontal lines by the positive
class (+1) and rest of the labels to the negative class (-1). The task is to detect the positive
class (anomaly) efficiently. Note that positive class is only a small fraction (6%) of the
total data.
• W8a is a dataset of keywords extracted from a web page and each feature is a sparse
binary feature. The task is to classify whether a web page falls into a category or not.
W8a and a9a (described below) dataset were originally used by J.C. Platt [147].
• A9a is a census data that contains features such as age, workclass, education, sex, martial-
status etc. The task is to predict whether the income exceeds $50K/yr. The challenge is
that the number of individual having income more than $50K/yr is very less.
• German dataset contains credit assessment of customers in terms of good or bad credit
risk. Some of the features in the dataset are credit history, the status of existing checking
44 CHAPTER 2. LITERATURE SURVEY
account, purpose, credit amount etc. The challenge comes in the form of identifying a
small fraction of fraudulent customers from a huge number of loyal customers.
• Covtype dataset contains information about the type of forest and associated attributes.
The task is to predict the type of forest cover from cartographic variables (no remotely
sensed images). The cartographic variables were derived from data obtained from US
Geological Survey and USFS data. Some of the features include Elevation, Aspect, Slope,
Soli_type etc. Covtype is multi-class classification dataset. To convert the multi-class
dataset to binary class dataset, we follow the procedure given in [148]. In short, we treat
class 2 as the positive class and other 6 classes as negative class.
• ijcnn1 dataset consists of time-series samples produced by 10-cylinder internal combus-
tion engine. Some of the features include crankshaft speed in RPM, load, acceleration
etc. The task is to detect misfires (anomalies) in certain regions on the load-speed map.
• Magic04 dataset comprises of simulation of high energy gamma particles in a ground-
based gamma telescope. The idea is to discriminate the action of primary gamma (called
signal) from the images of hadronic showers [146] caused by cosmic rays (called back-
ground). The actual dataset is generated by the Monte Carlo Sampling.
• Cod-rna dataset comes from bioinformatics domain. It consists of a long sequence of
coding and non-coding RNAs (ncRNA). Non-coding RNAs play a vital role in the cell,
several of which remain hidden until now. The task is to detect the novel non-coding
RNA (anomalies) to better understand their functionality.
• News20 dataset is a collection of 20,000 newsgroup posts on 20 topic. Some of the topics
include comp.graphics, sci.crypt,sci.med,talk.religion etc. Original news20 dataset is a
multi-class classification dataset. However, Chih-Jen Lin et al. [144] have converted the
multi-class dataset into the binary class dataset and we use that dataset directly.
• Rcv1 dataset is a benchmark collection of newswire stories that is made available by
Reuters, Ltd. Data is organized into four major topics: ECAT (Economics), CCAT (Cor-
porate/industrial), MCAT (Markets), and GCAT (Government/Social). Chih-Jen Lin et
al. have preprocessed the dataset and assume that ECAT and CCAT denote the positive
category whereas MCAT and GCAT designate the negative category.
• Url dataset [149] is a collection of URLs. The task is to detect malicious URLs (spam,
exploits, phishing, DoS etc.) from the normal URLs. The author represents the URLs
2.4. DATASETS USED 45
based on host-based features and lexical features. Some of the lexical feature types are
hostname, primary domain, path tokens etc. and host-based features are WHOIS info, IP
prefix, Connection speed etc.
• Realsim dataset is a collection of UseNet articles [150] from 4 discussion groups: real
autos, real aviation, simulated auto racing, simulated aviation. The data is often used in
binary classification separating real from simulated and hence the name.
• Gisette dataset was constructed from MINIST dataset [151]. It is a handwritten digit
recognition problem and the task is to classify confusing digits. The dataset also appeared
in NIPS 2003 feature selection challenge [152].
• Pcmac dataset is a modified form of the news20 dataset.
• Webspam dataset contains information about web pages. There exists the category of
web pages whose primary goal is to manipulate the search engines and web users. For
example, phishing site is created to duplicate the e-commerce sites so that the creates of
the phishing site can divert the credit card transaction to his/her account. To combat this
issue, web spam corpus was created in 2011. The corpus consists of approximately 0.35
million web pages; each web page represented by bag-of-words model. The dataset also
appeared in Pascal Large-Scale Learning Challenge in 2008 [153]. The task is to classify
each web page as spam or ham. The challenge comes from the high dimensionality and
sparse features of the dataset.
46 CHAPTER 2. LITERATURE SURVEY
Table 2.2: Summary of sparse data sets used in the experiment in Chapter 4
Table 2.3: Summary of sparse data sets used in the experiment in Chapter 5
In this chapter, we presented the detailed summary of the relevant work in data mining and
machine learning domain to combat the anomaly detection problem. For our literature survey,
we find that:
• Traditional approaches for anomaly detection in big data have a number of limitations.
Some important ones are as follows: statistical techniques require underlying data distri-
bution to be known a priori, proximity-based and density-based techniques require appro-
priate metrics to be defined for calculating anomaly score and have high time complexity.
Clustering based techniques are also computationally intensive. Most of the traditional
anomaly detection techniques assume a static candidate anomaly set. They are not able
to handle evolving anomalies.
• Non-Parametric techniques are useful for anomaly detection in real world data for which
class labels and data distribution are not known in advance. Further, the non-parametric
techniques are also able to handle high dimensional data with varying data distribution.
But, mostly the research work based on non-parametric anomaly detection techniques
have assumed data to be homogeneous and static in nature. Therefore, there is scope
of extending the existing non-parametric techniques for heterogeneous, distributed data
streams.
• Multiple kernel learning (MKL) method and its variants have the advantage of addressing
the issue curse of dimensionality. But, such techniques have been applied mostly on
homogeneous and static data. Further research work needs to be done to explore the
possibility of using MKL in streaming and distributed anomaly detection scenarios. In
addition, the hyper parameters of the kernel function are also being set using some pre-
defined constant values. Automatic learning of hyper parameters is also an open problem.
• Non-negative matrix factorization based methods have the advantage of being able to
handle anomaly detection in high dimensional and sparse data scenarios. But, they have
been mostly applied to centralized data although many real world data is distributed in
nature. Therefore, there is scope for further research in this direction.
• Random projection based techniques are useful once the intrinsic structure of the data and
number of dimensions to be used for projection is known. Therefore, there is need for
devising methods that help us to choose the correct number of dimension for projection
2.5. RESEARCH GAPS IDENTIFIED 49
In conclusion, we find that neither traditional approaches nor modern approaches are able to
detect anomalies in big data efficiently, i.e., solving major big data issues such as streaming,
sparse, distributed and high dimensions. In the next and subsequent chapters, we propose our
work to tackle the aforementioned issues in an incremental fashion.
Chapter 3
In this chapter, we tackle the anomaly detection problem in a streaming environment using on-
line learning. The reason to conduct such a study is that most of the real-world data is streaming
in nature. For example, the measurement from sensors forms a stream. Because of the dynamic
nature of streaming data and the inability to store it, there is an urgent need to develop efficient
algorithms to solve the anomaly detection problem in a streaming environment. We propose
an algorithm called Passive-Aggressive GMEAN (PAGMEAN). This algorithm is based on the
classic online algorithm called Passive-Aggressive (PA) [34]. In PA algorithm, we show that
it is sensitive to outliers and can not be directly applied for anomaly detection. Therefore, we
introduce a modified hinge loss that is a convex surrogate for the indicator function (defined
later in this chapter). The indicator function is obtained from maximizing the Gmean metric
directly. The major challenge is that Gmean metric is non-decomposable, that means, it can not
be written as the sum of losses over individual data points. We exploit the modified hinge loss
within the PA framework and come up with PAGMEAN algorithm. We empirically show the
effectiveness and efficiency of PAGMEAN over various benchmark data sets and compare with
the state-of-the-art techniques in the literature.
3.1 Introduction
In this chapter, we aim at to tackle the streaming problem of big data while detecting anomalies.
Throughout this chapter and subsequent chapters, we make the following assumption:
51
52 CHAPTER 3. PROPOSED ALGORITHM : PAGMEAN
The reason for using the above assumption is that outliers/anomalies are present in a tiny amount
compared to the normal samples and the outlier/anomaly detection problem can be addressed
via class-imbalance learning problem. Our focus will be the detection of point anomalies
through the use of class-imbalance learning.
First we introduce some notation for the ease of exposition. Examples in our data come in
a streaming fashion. At time t, instance-label pair is denoted as (xt , yt ) where xt ∈ Rn and
yt ∈ {−1, +1}. We consider linear classifier of the form ft (xt ) = wtT xt , where wt is the weight
vector. Let ŷt be the prediction for the tth instance, i.e., ŷt = sign(ft (xt )), whereas the value
|ft (xt )|, known as “margin”, is used as the confidence of the learner on the tth prediction step.
In a binary classification, there are two classes. We assume that minority class is positive class
and labeled as +1. Let P and N denote the total number of positive and negative examples
received so far, respectively. When there are two classes, four cases can happen during predic-
tion. True positive (Tp ), true negative (Tn ), false positive (Fp ) and false negative (Fn ). They
are defined as follows:
Note that time t is implicit in the above notations, that is, Tp can also denote the total number of
examples classified as positive up to time t = 1, 2, ..., T . Meaning will become clear from the
context. Our objective is to maximize Gmean metric for class-imbalanced problem.
3.3 Experiments
For comparative evaluation of our proposed algorithms, we use the dataset presented in Table
2.1. Note that the PAGMEAN algorithms are tested on only small-scale datasets. The reason
is that PAGMEAN algorithms, though being online, may not handle high dimension data (the
number of features going into millions and above) in a timely manner. That means they will run
slow. Secondly, PAGMEAN algorithms do not exploit sparsity structure present in the data.
We compare our algorithms with the parent algorithm PA and its variants along with recently
proposed cost-sensitive algorithm CSOC of [29]. In [29], it is claimed that CSOC outperforms
many state-of-the-art algorithms such as PAUM, ROMMA, agg-ROMMA, CPA-PB. Hence, we
only compare with CSOC. We further emphasize that PA, PAGMEAN and CSOC all are first-
order methods that only use the gradient of the loss function. On the other hand, comparison
with ARROW [115], NAROW [116], CW [117] etc. is not presented since they use second-
order information (Hessian) and their scalability over large data sets is poor. Comparison with
SVMperf [154] is also not fair due to (i) It does not optimize Gmean using surrogate loss
function (ii) It is an offline solution.
To make a fair comparison, we first split all the dataset into validation and test set randomly
(online algorithms do not require separate train and test set). The validation set is used to find
the optimal value of the parameters, if any, in the algorithm. Aggressiveness parameter C in PA
algorithm is searched over 2[−6:1:6] and parameter λ in PAGMEAN, its variants as well as CSOC
is searched over 2[−10:1:10] . We report our results on test data over 10 random permutations. Note
that we do not perform feature scaling as it is against the ethos of online learning where we can
access one example at a time and as such, doing z-score normalization is not feasible.
Secondly, most implementations (in Matlab) of online learning available today, e.g., Libol
[155], DOGMA [156],UOSLIB [157] are not out-of-core algorithm. They process examples
by fetching entire data into main memory and thus violates the principle of online learning. Our
implementation is based on the idea that data set is too large to fit into RAM and we can see
one example at a time.
54 CHAPTER 3. PROPOSED ALGORITHM : PAGMEAN
As described in Section 1 that Gmean is a robust metric for the class-imbalance problems.
Hence, we use Gmean to measure the performance of various algorithms. Results on 6 bench-
mark datasets ( “pageblock”, “w8a”, “a9a”, “german”, “ijcnn1”, and “covtype” ) and 4 real
world data sets ( “breast-cancer”, “kddcup2008”, “magic04”, and “cod-rna”) are shown in (Fig.
3.1, Table 3.1 ) and (Fig. 3.3, Table 3.2) respectively. We also show the mistake rate of all
the compared algorithms on benchmark and real data sets in (Fig. 3.2, Table 3.1 ) and (Fig.
3.4 ,Table 3.2) respectively. Mistake rate of an online algorithm is defined as the number of
mistakes made by the algorithm over time. It is noted that we do not show the running time of
out-of-core implementation since it depends upon the speed of device where data is stored (hard
disk, tape or network storage). We assume that data is stored in files and read sequentially in
mini-batches, where batch-size is arbitrary or can depend on the available RAM.
1. Evaluation of Gmean
We first evaluate Gmean on the various benchmark data sets as given in Table 2.1. Online
average of Gmean with respect to the sample size is reported in Fig. 3.1. From the
Figure, we can observe several things. Firstly, PAGMEAN2 algorithm outperforms their
parent algorithms PA on all 6 data sets. Secondly, PAGMEAN2 beats CSOC on all data
sets but ijcnn1. Thirdly, among the PAGMEAN algorithms, PAGMEAN2 outperforms
PAGMEAN and PAGMEAN1 on pageblock, w8a, german, and a9a data sets. On the
other side, Table 3.1 reports Gmean averaged over 5 runs. From the Table, we can see
that PAGMEAN algorithms outperform their parent algorithms PA on all data sets. These
results indicate the potential applicability of PAGMEAN algorithms for real-world class-
imbalance detection task.
Another observation that can be drawn from the Fig. 3.1 is that initially there is perfor-
mance drop in all the algorithms due to small number of samples available for learning.
This phenomena is noticeable in german and pageblocks data set as these are small data
sets. Online performance of algorithms is not smooth on some data sets, e.g., pageblock,
german, ijcnn1. This could be due to sudden change in class-distribution (also known as
concept drift). This means that the presence of cluster of examples from the positive class
3.3. EXPERIMENTS 55
or the negative class in the data has severe effect over the model.
2. Evaluation of Mistake Rate
In this Section, we discuss the mistake rate of PAGMEAN algorithms. The online average
of mistake rate of various algorithms with respect to the sample size is shown in Fig. 3.2.
We can draw several conclusions out of it. Firstly, as more samples are received by the
online algorithm, mistake rate is decreasing on all data sets except pageblock and german
where it seems to increase due to the small sample problem. Secondly, PAGMEAN al-
gorithms suffer higher mistake rate as compared to their parent algorithms PA. This is in
contrast to common intuition where the convex surrogate loss is supposed to be more sen-
sitive to class-imbalance and hence lesser mistake rate compared to the mistake rate due
to the hinge loss employed by PA algorithms. Thirdly, among PAGMEAN algorithms,
PAGMEAN1 suffers smaller mistake rate compared to PAGMEAN2 on all data sets.
Mistake rate averaged over 5 runs of all the compared algorithms is shown in Table 3.1.
56 CHAPTER 3. PROPOSED ALGORITHM : PAGMEAN
1 0.9
0.89
0.95
0.88
Online avg. of Gmean
0.85 0.86
PAGMEAN
0.85 PAGMEAN
0.8 PAGMEAN1
PAGMEAN1
PAGMEAN2
0.84 PAGMEAN2
CSOC
CSOC
0.75 PA
PA
PA1 0.83
PA1
PA2
PA2
0.7 0.82
0 1 2 3 4 0 1 2 3 4 5
10 10 10 10 10 10 10 10 10 10 10
# Sample size # Sample size
0.8 0.85
0.7
0.8
0.6
Online avg. of Gmean
0.75
0.5
0.4
0.7
PAGMEAN PAGMEAN
PAGMEAN1 PAGMEAN1
0.3 PAGMEAN2
PAGMEAN2
CSOC 0.65 CSOC
0.2 PA PA
PA1 PA1
PA2 PA2
0.1
0 1 2 3 4 0 1 2 3 4 5
10 10 10 10 10 10 10 10 10 10 10
# Sample size # Sample size
1 1
0.9
0.9
0.8
Online avg. of Gmean
Online avg. of Gmean
0.8
0.7
PAGMEAN
PAGMEAN1
0.6 PAGMEAN2 0.7
CSOC
PAGMEAN
0.5 PA
PAGMEAN1
PA1 0.6
PAGMEAN2
PA2
0.4 CSOC
PA
0.5
0.3 PA1
PA2
0.2 0.4
0 1 2 3 4 5 6 0 1 2 3 4 5
10 10 10 10 10 10 10 10 10 10 10 10 10
# Sample size # Sample size
Figure 3.1: Evaluation of Gmean over various benchmark data sets. (a) pageblock (b) w8a
(c) german (d) a9a (e) covtype (f) ijcnn1. In all the figures, PAGMEAN algorithms either
outperform or are equally good with respect to its parent algorithms PA and CSOC algorithm.
.
3.3. EXPERIMENTS 57
8 18
7 16
Online avg. of Mistake rate (%)
45 26
40
24
Online avg. of Mistake rate (%)
35
22
30
25 20
20 PAGMEAN PAGMEAN
PAGMEAN1 18 PAGMEAN1
PAGMEAN2 PAGMEAN2
15
CSOC CSOC
PA 16 PA
10 PA1 PA1
PA2 PA2
5 14
0 1 2 3 4 0 1 2 3 4 5
10 10 10 10 10 10 10 10 10 10 10
# Sample size # Sample size
10 16
9
14
Online avg. of Mistake rate (%)
PAGMEAN
8 PAGMEAN1
PAGMEAN2
12
7 CSOC
PA
PA1
6 10
PA2
PAGMEAN
5
8 PAGMEAN1
PAGMEAN2
4 CSOC
6 PA
3 PA1
PA2
2 4
0 1 2 3 4 5 6 0 1 2 3 4 5 6
10 10 10 10 10 10 10 10 10 10 10 10 10 10
# Sample size # Sample size
Figure 3.2: Evaluation of Mistake rate over various benchmark data sets.
58 CHAPTER 3. PROPOSED ALGORITHM : PAGMEAN
It can be observed that PAGMEAN algorithms suffer mistake rate that is not statistically
significantly (on wilcoxon rank sum test ) higher than that of their counterpart PA algo-
rithms on 3 out of 6 data sets (w8a, german, covtype). All of these observations indicate
that further work in theory and practice is to be investigated.
Table 3.1: Evaluation of Gmean and Mistake rate (%) on benchmark data sets. Entries marked
by * are not statistically significant at 95% confidence level than the entries marked by ** on
wilcoxon rank sum test.
Table 3.2: Evaluation of Gmean and Mistake rate (%) on real data sets. Entries marked by * are
not statistically significant at 95% confidence level than the entries marked by ** on wilcoxon
rank sum test.
magic04 cod-rna
Algorithm
Gmean Mistake rate(%) Gmean Mistake rate(%)
1 1
0.99 0.95
0.9
0.98
Online avg. of Gmean
0.92 0.55
0 1 2 3 0 1 2 3 4 5
10 10 10 10 10 10 10 10 10 10
# Sample size # Sample size
0.95 1
0.9
0.95
0.85
Online avg. of Gmean
0.8 0.9
0.75
0.85
0.7 PAGMEAN
PAGMEAN
PAGMEAN1 PAGMEAN1
0.65 0.8 PAGMEAN2
PAGMEAN2
CSOC CSOC
0.6 PA PA
PA1 0.75 PA1
0.55 PA2 PA2
0.5 0.7
0 1 2 3 4 5 0 1 2 3 4 5 6
10 10 10 10 10 10 10 10 10 10 10 10 10
# Sample size # Sample size
5 35
PAGMEAN
4.5 PAGMEAN1
PAGMEAN2
30
Online avg. of Mistake rate (%)
2.5 20
2 PAGMEAN
15 PAGMEAN1
1.5
PAGMEAN2
CSOC
1
10 PA
0.5 PA1
PA2
0 5
0 1 2 3 0 1 2 3 4 5
10 10 10 10 10 10 10 10 10 10
# Sample size # Sample size
45 7
PAGMEAN
PAGMEAN PAGMEAN1
40
PAGMEAN1 6 PAGMEAN2
PAGMEAN2
Online avg. of Mistake rate (%)
CSOC
35 PA
CSOC
PA 5 PA1
30 PA1 PA2
PA2
25 4
20 3
15
2
10
1
5
0 0
0 1 2 3 4 5 0 1 2 3 4 5 6
10 10 10 10 10 10 10 10 10 10 10 10 10
# Sample size # Sample size
Figure 3.4: Evaluation of Mistake rate over various real data sets.
3.4. DISCUSSION 63
3.4 Discussion
In the proposed work, we attempt to make the classical passive-aggressive algorithms insensi-
tive to outliers and apply it to the class-imbalance learning and anomaly detection problems.
To solve the aforementioned problem, we maximize the Gmean metric directly. Since direct
maximization of Gmean is NP-hard, we resort to convex surrogate loss function and minimize
a modified hinge loss instead. The modified hinge loss is utilized within the PA framework
to make it insensitive to outliers and new algorithms are derived called PAGMEAN. Empirical
performance of all the derived algorithms is tested on various benchmark and real data sets.
From the discussion above, we conclude that our derived algorithms perform equally good as
compared to other algorithms (PA and CSOC) in terms of Gmean. This indicates the potential
applicability of PAGMEAN algorithms for real world class-imbalance and anomaly detection
problems. However, the mistake rate of the proposed algorithms are surprisingly higher than
the compared algorithm on some datasets. Therefore, further work is required to identify the
exact reasons for higher mistake rate.
Finally, we would like to highlight is that results on high dimensional data (high dimension
means the number of features is in millions) could not be included. The reason is that even
online algorithms run slowly when working on full feature space. In the next chapter, we
propose an algorithm that exploits the sparsity present in the data and scales over millions of
dimensions.
Chapter 4
In the previous Chapter, we proposed an online algorithm that tackles streaming anomaly de-
tection problem. However, one of the limitations of the PAGMEAN algorithm is that it is not
able to exploit the sparsity structure present in the big data. To solve this problem, we propose
another online algorithm that solves sparse, streaming, high-dimensional problem of big data
during anomaly detection. As before, we employ the class-imbalance learning mechanism to
handle the point anomaly detection problem. The problem formulation in the present work uses
L1 regularized proximal learning framework and is solved via Accelerated-Stochastic-Proximal
Gradient Descent (ASPGD) algorithm. Within the ASPGD algorithm, we use a smooth and
strongly convex loss function. This loss function is insensitive to the class-imbalance since it is
directly derived from the maximization of the Gmean. The work presented in the current Chap-
ter demonstrates (i) the application of proximal algorithms to solve real world problems (class
imbalance) (ii) how it scales to big data, and (iii) how it outperforms some recently proposed
algorithms in terms of Gmean, F-measure and Mistake rate on several benchmark data sets.
4.1 Introduction
As discussed in Section 2.3, there are serious issues in using the classical approaches to anomaly
detection. Firstly, sampling-based techniques are neither scalable to the number of samples nor
to the data dimensionality in the case of big data. Secondly, existing works on sampling such
as [97, 158] do not exploit the rich structure present in the data such as sparsity. Kernel-based
65
66 CHAPTER 4. PROPOSED ALGORITHM : ASPGD
methods suffer from scalability and long training time. For example, if the data dimensionality
is of the order of millions, kernel-based methods will require storage of the gram matrix (also
known as kernel matrix) of size million × million which is prohibitive for machines with low
memory. Cost-sensitive learning has recently gained popularity in addressing the class imbal-
ance problem [28, 29] because of: (i) learning cost to be assigned to different classes in a data
dependent way (ii) scalability (iii) ability to exploit sparsity. The present work builds upon
cost-sensitive learning and extends the work present in [28].
To further address the problem of class imbalance, the choice of metric used to evaluate the
performance of different methods is crucial. For example, using accuracy as a class imbal-
ance performance measure may be misleading. Consider 99 negative examples and 1 positive
example in a dataset, our objective is to classify each positive example correctly. Now, a clas-
sifier which classifies each example as negative will have an accuracy of 99% that is incorrect
because our goal was to detect the positive example. Therefore, different researchers have de-
vised alternative performance metrics to assess different methods for class imbalance problem.
Among them are the recall, precision, F-measure, ROC (receiver operating characteristics) and
Gmean [159]. It is found that Gmean is robust to the class imbalance problem [159].
In this work, we tackle the class imbalance problem in an online setting exploiting sparsity and
high dimensional characteristics of the big data. Our contributions are as follows:
2. We show, through extensive experiments on real and benchmark data sets, the effective-
ness and efficiency of the proposed algorithm over various state-of-the-art algorithm in
the literature.
3. In our work, we also show that Nesterov’s acceleration [160] does not always helps in
achieving higher Gmean.
4. The effect of learning rate η and sparsity regularization parameter λ on Gmean, F-measure,
4.2. PROPOSED ALGORITHM - ASPGD 67
First, we establish some notation for the sake of clarity. Input data is denoted as instance-label
pair {xt , yt } where t = 1, 2, ..., T , xt ∈ X ⊆ Rd and yt ∈ Y ⊆ {−1, +1}. In the offline setting,
we are allowed to access entire data and usually, T is finite. On the other hand, in the online
setting we see one example at a time and T → ∞. At time t, an (instance, label) pair is denoted
by (xt , yt ). We consider linear functional of the form ft (xt ) = wtT xt , where wt is the weight
vector. Let ŷt be the prediction for the tth instance, i.e., ŷt = sign(ft (xt )), whereas the value
|ft (xt )|, known as ‘margin’, is used as the confidence of the learner in the tth prediction step.
We work under the following assumptions on the model function f .
Function f is L- smooth if its gradient is L-Lipschitz, i.e.,
In other words, gradient of the function f is upper bounded by L > 0. k·k denotes the euclidean
norm unless otherwise stated. The function f is µ- strongly convex if
µ
f (y) ≥ f (x) + ∇f (x)T (y − x) + ky − xk2 .
2
where, µ > 0 is the strong convexity parameter. Intuitively, strong convexity is a measure of the
curvature of the function. A function having large µ is has high positive curvature. Interested
readers can refer to [161] and appendix ??.
Our problem formulation is same as presented in Chapter 3. that is, we want to maximize the
Gmean. To do that, we use the lemma ??. However, there are certain number of issues in using
the loss function (??) from Chapter 3. Note that the loss function used in Chapter 3 is:
where,
N P − Fn
ρ= I(y=1) + I(y=−1)
P P
The parameter ρ controls the penalty imposed when the current model mis-classifies the in-
coming example. The important thing here is that we impose a high penalty (the ratio N/P
68 CHAPTER 4. PROPOSED ALGORITHM : ASPGD
is high for N >> P in the class imbalance scenario) on misclassifying a positive example
(y = +1). On the other hand, when model mis-classifies a negative sample, the penalty im-
posed is (P − Fn )/P , which is less when we have a small number of false negatives. It is to
be noted that the loss function in (4.1) is non-differentiable at the hinge point. Thus, it is not
directly applicable to algorithms that require L−smooth and µ−strongly convex loss functions
(please refer to the appendix ?? for the definition of smooth and strongly convex function). In
section 4.2.3, we review some proximal algorithms that work under the assumption that loss
function is smooth and strongly convex. Therefore, we introduce a smooth and strongly convex
function that upper bounds the indicator function in (4.1).
ρ
`(f, (x, y)) = max(0, 1 − yf (x))2 (4.2)
2
Above loss function is utilized within the accelerated-stochastic proximal learning framework
in section 4.2.2. Note that the loss function as presented in (4.2) is strongly convex with strong
convexity parameter µ = ρ. Strong convexity parameter depends on the hinge point of the loss
function in (4.2). Since hinge point is a data dependent term, we estimate it in an online fashion.
Now, we describe proximal learning framework [162] which aims to solve the composite opti-
mization problem of the following form:
1
PT
where f (·) is the average of convex and differentiable functions, i.e., f (w) = T t=1 ft (w)
and r : Rd → R is ‘simple’ convex function that can be non-differentiable. Note that the
framework (4.3) is quite general and encompasses many algorithms. For example, if we set
ft (w) = max(0, 1 − ywT x) and r(w) = λkwk22 , we get L2 regularized SVM. On the other
hand, if we set ft (w) = (y − wT x)2 and r(x) = λkwk22 , we get ridge regression. For our
problem, ft (w) is given in (4.2). Since we consider to solve (4.3) under sparsity constraint, we
utilize sparsity-inducing L1 norm of w, i.e., r(w) = λkwk1 (which is non-differentiable at 0).
There are many algorithms to solve the problem (4.3) in the offline setting under different as-
sumptions on the loss function f (·) and regularization parameter r(·) [163–166]. Since our aim
is to solve the problem (4.3) under the online learning framework, techniques mentioned thereof
4.2. PROPOSED ALGORITHM - ASPGD 69
cannot be used. Under our assumptions on the form of f (·) and r(·) (strongly convex and non-
smooth, respectively), subgradient methods can be a good candidate. However, these methods
have notorious convergence rate of O(1/2 )(iteration complexity), i.e., to obtain accurate so-
lution, we need O(1/2 ) iteration. Hence, we resort to proximal learning algorithms because
of their simplicity, ability to handle non-smooth regularizer, scalability, and faster convergence
under certain assumptions [162].
1
proxηr (u) , argmin ku − wk22 + r(u) (4.5)
u 2η
where η is the step size and k · k2 is 2-norm. ∇ denotes the gradient of the loss function f (·).
One of the major drawbacks of using Proximal Gradient Descent (PGD) in an offline setting
is that they are not scalable to large-scale data sets as they require entire data set to compute
the gradient. In the proposed work, we focus on algorithms that are memory-aware. Such
algorithms come under online learning framework (also known as stochastic algorithms). A
simple Stochastic Proximal Gradient Descent (SPGD) update rule at tth time step is given by:
where ∇ft (·) is evaluated at tth example. In order to achieve acceleration (faster convergence)
in online learning, we follow Nesterov method [160]. Specifically, Nesterov method achieves
acceleration by introducing an auxiliary variable u such that the weight vector at time t is the
convex combination of weight at time t − 1 and ut , i.e.,
We note that similar ASP algorithms based on Nesterov accelerated method have recently ap-
peared in [167, 168]. However, [167] propose an accelerated algorithm with variance reduction
70 CHAPTER 4. PROPOSED ALGORITHM : ASPGD
and [168] propose an accelerated algorithm that uses two sequences for computing the weight
vector w, where one of the sequences utilizes lipschitz parameter of the smooth component of
the composite optimization. On the other hand, we exploit the strong convexity of the smooth
component. Besides, in our present work, we aim at showing how efficient the ASP algorithms
are in dealing with the class imbalance problem?
In this Section, we present the algorithm to solve (4.3) in an online setting. Our proposed
algorithm is called ASPGD which is based on SPGD framework with Nesterov’s acceleration.
ASPGD algorithm which is presented in Algorithm 2 is able to handle the class imbalance
problems.
Algorithm 2 ASPGD: Accelerated Stochastic Proximal Gradient Descent Algorithm for Sparse
Learning
√
1− µη
Require: η > 0, λ, γ = √
1+ µη
Ensure: wT +1
1: for t := 1, ..., T do
2: receive instance xt
3: vt = ∇Φ∗t (θt )
4: ut = proxηr(w) (vt )
5: wt = (1 − γ)wt−1 + γut
6: predict ŷt = sign(wtT xt )
7: receive true label yt ∈ {−1, +1}
8: suffer loss: `t (yt , ŷt ) as given in (4.2)
9: if `t (yt , ŷt ) > 0 then
10: update:
11: θt+1 = θt − η∇`t (wt )
12: end if
13: end for
In Algorithm 2, Φ is some µ−strongly convex function such as k · k22 and ∗ above φ denotes
the dual norm (see the appendix ?? for the definition of dual norm); θ is some vector in Rn .
4.3. EXPERIMENTS 71
At this point, we emphasize that [169] recently proposed algorithms for sparse learning (see
Algorithm 1 in [169]). Their algorithm is a special case of our algorithm and can be analyzed
under Stochastic Proximal Learning (SPL) framework. Specifically, setting γ = 1 and `t to
hinge loss in ASPGD, we obtain Algorithm 1 in [169]. The same author’s extended paper [28]
proposed algorithms for class imbalance sparse learning. Algorithm 6 in [28] is a special case
of ASPGD without acceleration and again Algorithm 6 in [28] can be analyzed under SPL
framework.
4.3 Experiments
In this Section, we empirically validate the performance of ASPGD algorithm over various
benchmark data sets given in Table 2.2. Notice that NEWS2 and PCMAC are balanced data
sets. All the algorithms were run in MATLAB 2012a (64-bit version) on 64-bit Windows 8.1
machine.1
For evaluation purpose, we compared ASPGD and it variants without acceleration (which we
call ASPGDNOACC) with that of a recently proposed algorithm CSFSOL [28]. In [28], the
author compared their first and second order algorithms with a bunch of cost-sensitive algo-
rithms (CS-OGD, CPA, and PAUM etc.) It is found that CSFSOL and its second order version
CSSSOL outperform the aforementioned algorithms in terms of a metric called balanced accu-
racy (which is defined as 0.5 × sensitivity + 0.5 × specif icity). For this reason, we compare
our ASPGD, ASPGDNOACC with CSFSOL algorithm (notice that CSSSOL is a second or-
der algorithm, hence no comparison with CSSSOL is made). For performance evaluation, we
use Gmean and Mistake rate as performance metrics. The time taken by these algorithms are
not shown since each algorithm reads data in mini-batches and processes it online. Hence, to-
tal time consumed will depend on how fast we can read data from the storage device such as
hard disk, network etc. The most time-consuming operation in Algorithm 2 is evaluating the
prox operator. In our case, prox operator is soft-thresholding operator that has a closed form
solution [170]. Other steps in Algorithm 2 take O(1) time.
1
Our code is available at https://sites.google.com/site/chandreshiitr/publication
72 CHAPTER 4. PROPOSED ALGORITHM : ASPGD
For parameter selection (learning rate η), we fix the sparsity regularization parameter λ = 0 and
perform grid search as in [28]. Note that for ASPGD algorithm, strong convexity parameter µ is
equal to ρ shown in (4.1) (which is easy to show). The parameter ρ can be calculated in an online
fashion. Hence, no parameter tuning is required for ASPGD algorithm compared to CSFSOL
and ASPGDNOACC. Further, the parameter ρ (and µ thereof) can be greater than 1, hence, we
have to set η such that µη < 1 for the parameter γ to be a valid convex combination parameter.
1
In our subsequent experiments, we set η = µ+1
. In subsection 4, we also demonstrate the effect
of varying the learning rate. In [167], it is stated that diminishing learning rate helps in reducing
the variance introduced by random sampling, but it leads to slower convergence rate. Keeping
that in mind, we set the aforementioned value of the learning rate.
The results presented in the next subsection have been averaged over 10 random permutations
of the test data and shown on semi log plot. No feature scaling technique has been used as it
is against the ethos of online learning. Online learning dictates that only a subset of the entire
data set is seen at one point of time thus meeting more practical scenario of real world data in
our simulation.
1. Evaluation of Gmean
In this Section, we evaluate the Gmean over various benchmark data sets as shown in
Table 2.3. The results are presented in Figure 5.8 and Table 4.1. From Figure 5.8, several
conclusions can be drawn. First, ASPGD algorithm outperforms ASPGDNOACC on
6 out of 8 data sets (news2, gisette, rcv1, url, pcmac, and webspam ). This indicates
that Nesterov’s acceleration helps in achieving higher Gmean. Secondly, ASPGD either
outperforms or performs equally good compared to CSFSOL on 6 out of 8 data sets
(news, gisette, rcv1, url, pcmac, and webspam ). Thirdly, on news2 and realsim data
sets, all algorithms suffer performance degradation. This may be due to sudden change
in concept or the class distribution. This shows one inherent limitation of ASPGD and
CSFSOL algorithms in addressing concept drift. The same observation can be made
from the cumulative Gmean results presented in Table 4.1. For example, on news data
set, ASPGD achieves Cumulative Gmean which is statistically more significant than the
Cumulative Gmean achieved by CSFSOL and vice versa on rcv1 data set.
4.3. EXPERIMENTS 73
1 1.001
ASPGD
ASPGD No Acc
1
CSFSOL
0.9
0.999
0.7 0.997
0.996
0.6 ASPGD
ASPGD No Acc 0.995
CSFSOL
0.5
0.994
0.4 0.993
2 3 4 3 4 5
10 10 10 10 10 10
# Sample size # Sample size
0.8 1
0.995
0.7
0.99 ASPGD
ASPGD No Acc
0.6
Online avg. of Gmean
0.98
0.5
0.975
0.4
0.97
0.3 0.965
ASPGD
ASPGD No Acc 0.96
0.2 CSFSOL
0.955
0.1 0.95
2 3 4 2 3 4 5
10 10 10 10 10 10 10
# Sample size # Sample size
1 1
ASPGD
0.98 ASPGD No Acc
0.9
CSFSOL
0.96
Online avg. of Gmean
0.8
Online avg. of Gmean
0.94
0.92 0.7
0.9 0.6
ASPGD
0.88
ASPGD No Acc
0.5
CSFSOL
0.86
0.4
0.84
0.82
3 4 5 6 1 2 3 4
10 10 10 10 10 10 10 10
# Sample size # Sample size
0.82 0.75
ASPGD
0.7 ASPGD No Acc
0.8
CSFSOL
0.65
Online avg. of Gmean
0.78 ASPGD
ASPGD No Acc 0.6
CSFSOL
0.76 0.55
0.74 0.5
0.45
0.72
0.4
0.7
0.35
0.68 2 3
1 2 3
10 10 10 10 10
# Sample size # Sample size
Figure 4.1: Evaluation of online average of Gmean over various benchmark data sets. (a) news
(b) news2 (c) gisette (d) realsim (e) rcv1 (f) url (g) pcmac (h) webspam.
74 CHAPTER 4. PROPOSED ALGORITHM : ASPGD
25 0.7
ASPGD ASPGD
ASPGD No Acc ASPGD No Acc
CSFSOL 0.6 CSFSOL
0.5
15
0.4
0.3
10
0.2
5
0.1
0 0
2 3 4 3 4 5
10 10 10 10 10 10
# Sample size # Sample size
24 3.5
ASPGD ASPGD
22 ASPGD No Acc ASPGD No Acc
CSFSOL 3 CSFSOL
Online avg. of Mistake rate (%)
16 2
14 1.5
12
1
10
0.5
8
6 0
2 3 4 2 3 4 5
10 10 10 10 10 10 10
# Sample size # Sample size
10 35
ASPGD ASPGD
ASPGD No Acc ASPGD No Acc
9 30
CSFSOL CSFSOL
Online avg. of Mistake rate (%)
8
25
7
20
6
15
5
10
4
3 5
2 0
3 4 5 6 1 2 3 4
10 10 10 10 10 10 10 10
# Sample size # Sample size
35 20
ASPGD
ASPGD No Acc
30 18
CSFSOL
Online avg. of Mistake rate (%)
16
25 ASPGD
ASPGD No Acc
14
CSFSOL
20
12
15
10
10
8
5 6
0 4
1 2 3 2 3
10 10 10 10 10
# Sample size # Sample size
Figure 4.2: Evaluation of mistake over various benchmark data sets. (a) news (b) news2 (c)
gisette (d) realsim (e) rcv1 (f) url (g) pcmac (h) webspam.
76 CHAPTER 4. PROPOSED ALGORITHM : ASPGD
90 100
80 90
80
70
70
60
60
50
ASPGD 50
ASPGD No Acc
40
CSFSOL
40
30 ASPGD
30 ASPGD No Acc
CSFSOL
20
20
10 10
0 0
0 2 4 6 8 10 0 2 4 6 8 10
Sparsity regularization parameter λ Sparsity regularization parameter λ
90 95.5
ASPGD ASPGD
80 ASPGD No Acc 95 ASPGD No Acc
CSFSOL CSFSOL
70
94.5
Online Cumulative F (%)
60
94
50
93.5
40
93
30
92.5
20
10 92
0 91.5
0 2 4 6 8 10 0 2 4 6 8 10
Sparsity regularization parameter λ Sparsity regularization parameter λ
95
90
Online Cumulative F (%)
85
80
75
ASPGD
ASPGD No Acc
CSFSOL
70
65
0 2 4 6 8 10
Sparsity regularization parameter λ
(e) pcmac
Figure 4.3: Effect of regularization parameter λ on F-measure on (a) news (b) realsim (c) gisette
(d) rcv1 (e) pcmac.
4.3. EXPERIMENTS 77
Table 4.1: Evaluation of cumulative Gmean(%) and Mistake rate (%) on benchmark data sets.
Entries marked by * are statistically significant than the entries marked by ** and entries marked
by † are NOT statistically significant than the entries marked by ‡ at 95% confidence level on
Wilcoxon rank sum test.
news news2
Algorithm
Gmean(%) Mistake rate(%) Gmean(%) Mistake rate(%)
Proposed ASPGD 90.7± 0.003∗ 2.087 ± 0.139† 99.686± 0.015 0.319 ± 0.014
Proposed ASPGDNOACC 96.1±0.003 2.909 ± 0.117 99.426± 0.027 0.587 ± 0.029
CSFSOL 89.5 ± 0.004∗∗ 2.013± 0.069‡ 99.870 ± 0.014 0.136± 0.016
gisette realsim
Algorithm
Gmean(%) Mistake rate(%) Gmean(%) Mistake rate(%)
Proposed ASPGD 74.333 ± 1.174 7.750 ± 0.539 96.319 ± 0.077∗∗ 1.022 ± 0.023∗∗
Proposed ASPGDNOACC 41.810± 1.308 15.511 ± 0.826 96.636 ± 0.116 2.025 ± 0.036
CSFSOL 47.727 ± 2.176 6.468± 0.184 96.783 ± 0.102∗ 1.059 ± 0.025∗
rcv1 url
Algorithm
Gmean(%) Mistake rate(%) Gmean(%) Mistake rate(%)
Proposed ASPGD 97.422 ± 0.003∗∗ 2.576 ± 0.003∗∗ 92.139 ± 0.285 3.684 ± 0.151∗
Proposed ASPGDNOACC 97.426 ± 0.003 2.574 ± 0.003 90.263 ± 0.380 7.880 ± 0.353
CSFSOL 97.461 ± 0.004∗ 2.535 ± 0.004∗ 86.459 ± 0.497 3.791 ± 0.086 ∗∗
pcmac webspam
Algorithm
Gmean(%) Mistake rate(%) Gmean(%) Mistake rate(%)
Proposed ASPGD 79.339 ± 3.149‡ 6.778 ± 0.547‡ 71.052 ± 1.718 6.100 ± 0.727
Proposed ASPGDNOACC 72.840 ± 4.024 7.122 ± 0.424 51.856 ± 5.034 18.270 ± 3.196
† †
CSFSOL 80.655 ± 1.918 6.511 ± 0.360 46.655 ± 5.684 4.380 ± 0.225
78 CHAPTER 4. PROPOSED ALGORITHM : ASPGD
0.75 1.005
η=1/µ+1
η=1/µ+1 η=1/µ+5
0.7 η=1/µ+5 1 η=1/µ+10
η=1/µ+10
Online avg. of Gmean
0.6 0.99
0.55 0.985
0.5 0.98
0.45 0.975
2 3 4 3 4 5
10 10 10 10 10 10
# Sample size # Sample size
0.9 0.94
η=1/µ+5
0.85 η=1/µ+1 0.92
η=1/µ+10
0.8 0.9
Online avg. of Gmean
0.75 0.88
0.7 0.86
0.65 0.84
η=1/µ+1
0.6 0.82
η=1/µ+5
η=1/µ+10
0.55 0.8
0.5 0.78
0.45 0.76
3 3 4 5
10 10 10 10
# Sample size # Sample size
Figure 4.4: Effect of learning rate η in ASPGD algorithm for maximizing Gmean on (a) news
(b) realsim (c) gisette (d) rcv1.
4.3. EXPERIMENTS 79
Gmean on different values of λ on news and gisette data set. For example, ASPGD
achieves highest Gmean at λ = 0 whereas ASPGDNOACC at λ = 1 and CSFSOL at
λ = 10 on news data sets. On the other hand, on realsim and rcv1 data sets, all the
algorithms achieve highest Gmean at λ = 0. Another major observation is that ASPGD
algorithm achieve higher Gmean compared to CSFSOL on 3 out of 4 data sets (news,
realsim, and rcv1) over the entire range of λ values tested. A higher λ value implies the
addition of sparsity and hence more sparse model. Sparse models are easier to interpret
and quicker to evaluate.
In Figure 4.6, the effect of regularization parameter on Mistake rate is shown. As before,
ASPGD algorithm suffers smaller Mistake rate compared to CSFSOL on the entire range
of λ values tested on 3 out of 4 data set (relaism, gisette, and rcv1). Further, smaller
values of λ lead to smaller Mistake rate which is obvious from monotonically increasing
Mistake rate on all data sets and algorithms except ASPGD on gisette data set.
Remark: The difference between ASPGD and SOL (generalized version of CSFSOL) is
that ASPGD uses (1) smooth and modified hinge loss (2) Nesterov’s acceleration. Time
complexity of both the algorithms is O(nd), which is linear in n and d; where n is the
number of data point and d is the dimensionality. Only thing that differs in the time com-
plexity is the hidden constant in Big O notation. In fact, extra step involved in ASPGD is
the summation of two vectors of size d in step 5 of the algorithm that cost O(d). For big
data where usually d is sparse, it will take O(s) time where s is the number of nonzeros
in wt−1 and ut . The evaluation of the smooth hinge loss differs by a constant (O(1)). In
addition, from implementation point of view, the results reported in the paper assume that
data resides in back-end such as hard disk. We read data in mini batches and process them
one by one. Thus, our implementation is more amicable to online learning, where we are
not allowed to see all the data in one go. Whereas, implementation provided by authors
of SOL load entire data in main memory and process one example at a time. Thus, their
implementation violates the principle of online learning.
80 CHAPTER 4. PROPOSED ALGORITHM : ASPGD
90 100
ASPGD
ASPGD No Acc
80
CSFSOL
95
Online Cumulative Gmean (%)
90
60
50 85
40 ASPGD
80
ASPGD No Acc
30 CSFSOL
75
20
10 70
0 2 4 6 8 10 0 2 4 6 8 10
Sprasity regularization parameter λ Sprasity regularization parameter λ
90 95
ASPGD
ASPGD No Acc
80 90 CSFSOL
Online Cumulative Gmean (%)
ASPGD
70 ASPGD No Acc 85
CSFSOL
60 80
50 75
40 70
30 65
20 60
0 2 4 6 8 10 0 2 4 6 8 10
Sprasity regularization parameter λ Sprasity regularization parameter λ
Figure 4.5: Effect of regularization parameter λ in ASPGD algorithm for maximizing Gmean
on (a) news (b) realsim (c) gisette (d) rcv.
4.3. EXPERIMENTS 81
45 25
ASPGD ASPGD
ASPGD No Acc ASPGD No Acc
40 CSFSOL CSFSOL
Online Cumulative MIstake (%)
35
15
30
25
10
20
5
15
10 0
0 2 4 6 8 10 0 2 4 6 8 10
Sparsity regularization parameter λ Sparsity regularization parameter λ
55 35
50 ASPGD
ASPGD
30 ASPGD No Acc
ASPGD No Acc
Online Cumulative MIstake (%)
45 CSFSOL
CSFSOL
40
25
35
30 20
25
15
20
15
10
10
5 5
0 2 4 6 8 10 0 2 4 6 8 10
Sparsity regularization parameter λ Sparsity regularization parameter λ
Figure 4.6: Effect of regularization parameter λ in ASPGD algorithm for minimizing Mistake
rate on (a) news (b) realsim (c) gisette (d) rcv1.
82 CHAPTER 4. PROPOSED ALGORITHM : ASPGD
4.4 Discussion
In the present Chapter, we handle the streaming, sparse, high dimensional problem of big data
for detecting anomalies efficiently. As discussed in Chapter 3, PAGMEAN algorithm does not
scale over high dimensions; nor does it exploit the sparsity present in the big data. We follow
the same recipe as in PAGMEAN algorithm to derive the ASPGD algorithm. However, instead
of using the loss function employed by PAGMEAN, we use a smooth and strongly convex cost-
sensitive loss function that is a convex surrogate for the 0−1 loss function. The relaxed problem
is solved via accelerated-stochastic-proximal learning algorithm called ASPGD. Extensive ex-
periments on several large-scale data sets show that ASPGD algorithm outperforms a recently
proposed algorithm (CSFSOL) in terms of Gmean, F-measure and Mistake rate on many of
the data set tested. Further, we also compared non-accelerated version of ASPGD algorithm
called ASPGDNOACC with ASPGD and CSFSOL. From the discussion in Section 4.3, we also
conclude that acceleration is not always helpful; neither in terms of Gmean nor Mistake rate.
Because of the massive growth in data size and its distributed nature, there is immediate need
to tackle the class imbalance in the distributed setting. In the next Chapter, we propose an
algorithm for handling the class-imbalance problem in the distributed setting.
Chapter 5
Globalization in the 21st century has given rise to the distributed work culture. As a result,
data is no longer collected at a single place. Instead, it is gathered at multiple locations in a
distributed fashion. Gleaning insightful information from distributed data is a challenging task.
There are several concerns that need to be addressed properly. Firstly, collecting the whole data
at a single place for knowledge discovery is costly. Secondly, it involves security risk while
transmitting data over the network. For example, credit card transaction data. To save cost and
minimize the risk of data transportation, there is an urgent need to develop algorithms that can
work in a distributed fashion. This is the main motivation behind the work proposed in the
current Chapter.
We study the class imbalance problems in a distributed setting exploiting sparsity structure
in the data. We formulate the class-imbalance learning problem as a cost-sensitive learning
problem with L1 regularization. The cost-sensitive loss function is a cost-weighted smooth
hinge loss. The resultant optimization problem is minimized within (i) Distributed Alternating
Direction Method of Multiplier (DADMM) [171] framework (ii) FISTA [163]-like update rule
in a distributed environment. We call the algorithm derived within DADMM framework as
Distributed Sparse Class-Imbalance Learning (DSCIL) and within the FISTA-like update rule
as Class-Imbalance Learning on Sparse data in a Distributed Environment (CILSD). The reason
for proposing CILSD is that it improves upon the convergence speed of DSCIL.
In the DSCIL algorithm, we partition the data matrix across samples through DADMM. This
83
84 CHAPTER 5. PROPOSED ALGORITHMS : DSCIL AND CILSD
operation splits the original problem into a distributed L2 regularized smooth loss minimization
and L1 regularized squared loss minimization. L2 regularized subproblem is solved via L-
BFGS and random coordinate descent method in parallel at multiple processing nodes using
Message Passing Interface (MPI, a C++ library) while L1 regularized problem is just a simple
soft-thresholding operation. We show, empirically, that the distributed solution matches the
centralized solution on many benchmark data sets. The centralized solution is obtained via
Cost-Sensitive Stochastic Coordinate Descent (CSSCD).
In CILSD algorithm, we partition the data across examples and distribute the subsamples to
different processing nodes. Each node runs a local copy of FISTA-like algorithm which is a
distributed-implementation of the prox-linear algorithm for cost-sensitive learning. Empirical
results on small and large-scale benchmark datasets show some promising avenues to further
investigate the real-world application of the proposed algorithms such as anomaly detection,
class-imbalance learning etc. To the best of our knowledge, ours is the first work to study
class-imbalance in a distributed environment on large-scale sparse data.
5.1 Introduction
In the present work, we have made an attempt to address the point anomaly detection through
the class-imbalance learning in big data in a distributed setting exploiting sparsity structure.
Without exploiting sparsity structure, learning algorithms run slower since they have to work
in the full-feature space. We propose to solve the class-imbalance learning problem from cost-
sensitive learning perspective due to various reasons. First, Cost-sensitive learning can be di-
rectly applied to different classification algorithms. Second, cost-sensitive learning generalizes
well over large data and third, they can take into account user input cost.
3. DADMM splits the problem (??) into two subproblems: a distributed L2 regularized
loss minimization and a L1 regularized squared loss minimization. The first subproblem
is solved by L-BFGS as well as random coordinate descent method while the second
subproblem is just a soft-thresholding operation obtained in a closed-form. We call our
algorithm using L-BFGS method as L- Distributed Sparse Class-Imbalance Learning (L-
DSCIL) and using random coordinate descent method as R-DSCIL.
4. All the algorithms are tested on various benchmark datasets and results are compared with
the start-of-the-art algorithms as well as the centralized solution over various performance
measures besides Gmean (defined later).
5. We also show (i) the Speedup (ii) the effect of varying cost (iii) the effect of the number
of cores in the distributed implementation (iv) the effect of the regularization parameter.
6. At the end, we show the useful real-world application of DSCIL and CILSD algorithms
on KDDCUP 2008 data set.
5.2 Experiments
In this Section, we demonstrate the empirical performance of the proposed algorithm DSCIL
over various small and large-scale data sets [144] . A brief summary of the benchmark data
sets and the class-imbalance ratio is given in Table 2.3. All the algorithms were implemented
in C++ and compiled by g++ on a Linux 64-bit machine containing 48 cores (2.4Ghz CPUs) 1 .
We compare our DSCIL algorithm with a recently proposed algorithm called Cost-Sensitive
First Order Sparse Online learning (CSFSOL) of [28]. CSFSOL is a cost-sensitive online algo-
rithm based on mirror descent update rule (see for example [119]). CSFSOL also optimizes the
same objective function as ours but the loss function is not smooth. Secondly, as mentioned in
the introduction section, CSFSOL is an online centralized algorithm whereas DSCIL is a dis-
tributed algorithm. In the forthcoming subsections, we will show (i) The convergence of DSCIL
on benchmark data sets (ii) The performance comparison of DSCIL, CSFSOL, and CSSCD al-
1
Sample code and data set for R-DSCIL can be downloaded from
https://sites.google.com/site/chandreshiitr/publication
86 CHAPTER 5. PROPOSED ALGORITHMS : DSCIL AND CILSD
Note also that the DSCIL algorithm contains ADMM penalty parameter ρ which we set to 1 and
the convergence of DSCIL does not require tuning of this parameter. Regularization parameter
λ in DSCIL is set to 0.1λmax where λmax is given by (1/m)kX T ỹk∞ (see [172]), where ỹ is
given by:
m− /m
yi = 1
ỹ =
−m+ /m
yi = −1, i = 1, ..., m
Setting λ as discussed above does not require its tuning as compared to λ in CSFSOL where
no closed form solution is available as such and one needs to do the cross-validation. DSCIL
algorithm is stopped when primal and dual residual fall below the primal and dual residual
tolerance (see chapter 3 of [171] for details). For the CSFSOL algorithm, we see that it con-
tains another parameter, that is, the learning rate η. Both of these parameters were searched
in the range {3 × 10−5 , 9 × 10−5 , 3 × 10−4 , 9 × 10−4 , 3 × 10−3 , 9 × 10−2 , 0.3, 1, 2, 4, 8} and
{0.0312, 0.0625, 0.125, 0.25, 0.5, 1, 2, 4, 8, 16, 32} respectively as discussed in [28] and the best
value on the performance metric is chosen for testing. As an implementation note, we normalize
the columns of data matrix so that each feature value lies in [−1, 1]. All the results are obtained
by running the distributed algorithms on 4 cores using MPI unless otherwise stated.
In this Section, we discuss the convergence of L-DSCIL and R-DSCIL algorithms . The con-
vergence plot with respect to the DADMM iteration over various benchmark datasets is shown
in Figure 5.9. From the Figure 5.9, it is clear that L-DSCIL converges faster than R-DSCIL
which is obvious as L-DSCIL is a second order (quasi-Newton) method while R-DSCIL is a
first order method. On another note, we observe that on w8a and rcv1 data set, R-DSCIL starts
increasing the objective function that indicates the CSRCD algorithm, which is a random co-
ordinate descent algorithm, overshoots the minimum after a certain number of iterations. The
5.2. EXPERIMENTS 87
0.09 0.25
0.245 L-DSCIL
R-DSCIL
0.085
0.24
L-DSCIL
R-DSCIL 0.235
0.08
0.23
0.225
0.075
0.22
0.215
0.07
0.21
0.065 0.205
0 10 20 30 40 50 60 70 80 90 100 0 2 4 6 8 10 12 14 16 18 20
DADMM iterations DADMM iterations
0.085 0.12
0.11
L-DSCIL
0.08
0.1 R-DSCIL
objective function values
0.07
0.07
0.06
L-DSCIL
R-DSCIL 0.05
0.065
0.04
0.06 0.03
0 10 20 30 40 50 60 70 80 90 100 0 50 100 150 200 250 300 350 400 450 500
DADMM iterations DADMM iterations
Figure 5.1: Objective Function vs DADMM iterations over benchmark data sets. (a) ijcnn1 (b)
rcv1 (c) pageblocks (d) w8a.
Table 5.1: Performance comparison of CSFSOL, L-DSCIL, R-DSCIL and CSSCD over various
benchmark data sets.
news
Algorithm
Accuracy Sensitivity Specificity Gmean Sum
url
Algorithm
Accuracy Sensitivity Specificity Gmean Sum
ijcnn1
Algorithm
Accuracy Sensitivity Specificity Gmean Sum
covtype
Algorithm
Accuracy Sensitivity Specificity Gmean Sum
Table 5.2: Performance comparison of CSFSOL, L-DSCIL, R-DSCIL and CSSCD over various
benchmark data sets.
rcv1
Accuracy Sensitivity Specificity Gmean Sum
w8a
Accuracy Sensitivity Specificity Gmean Sum
pageblocks
Accuracy Sensitivity Specificity Gmean Sum
For example, it gives superior performance than CSSCD on news, gisette, rcv1, realsim
and pageblocks data sets in terms of Gmean. Secondly, the performance of R-DSCIL is
not so good compared to CSSCD. It was able to outperform CSSCD on realsim and web-
spam data sets only. Secondly, Gmean achieved by CSFSOL on large-scale data sets such
as news, rcv1, url and webspam is higher than any of the other method. This observation
can be attributed due to possibly (i) the use of strongly convex objective function used in
our present work compared to convex but non-smooth objective function in CSFSOL (ii)
we stopped the L-DSCIL algorithm before it reaches optimality (iii) λmax value calcu-
lated as discussed in subsection 5.3.1 is not the right choice. We believe that the second
and third reason is more likely than the first one as can be seen in the convergence plot
of the L-DSCIL algorithm in Figure 5.9. We stopped the L-DSCIL algorithm either pri-
mal and dual residual went below the primal and dual feasibility tolerance or maximum
iteration is reached (which we set to 20 for large-scale data sets). To verify our point,
we ran another experiment with MaxIter set to 50. This setting gives the Gmean equal to
0.923436 which is clearly larger than previously obtained value 0.916809 for rcv1 data
set. Similarly, by running L-DSCIL and R-DSCIL for a larger number of iterations, we
can increase the desired accuracy . Finally, we also observe that R-DSCIL fails to cap-
ture the class-imbalance on gisette and covtype whereas CSSCD on covtype and CSFSOL
on realsim (due to the value of Gmean being 0) which indicates the possibility of using
L-DSCIL in practical settings.
2. Study on Gmean versus Cost
We study the effect of varying the cost on Gmean. The results are presented in Figure 5.2
and 5.3 for L-DSCIL and R-DSCIL respectively. In each Figure, cost and Gmean appear
on x and y-axis respectively. From these Figures, we can draw the following observations.
Firstly, Gmean is increasing for balanced data such as rcv1 when the cost for each class
equals 0.5. On the other hand, more imbalance the data set is, more the cost given to
positive class and higher the Gmean (see the hist plot for news, pageblocks, url) in Figure
5.2. The same observation can be made from the Figure 5.3. These observations allude to
the fact that right choice for cost is important otherwise classification is affected severely.
3. Speedup Measurement
In this Section, we discuss the speedup achieved by R-DSCIL and L-DSCIL algorithms
when we run these algorithms on multiple cores. Speedup results of R-DSCIL algorithm
5.2. EXPERIMENTS 91
1 0.9
0.9 0.8
0.8
0.7
0.7
0.6
0.6
pageblocks 0.5
GMean
Gmean
w8a
0.5 url
0.4 ijcnn1
0.4 news webspam
rcv1 0.3 realsim
0.3
0.2
0.2
0.1
0.1
0
0
1 2 3 4 5
1 2 3 4 5
Cost (positive, negative) Cost(positive, negative)
(a) (b)
Figure 5.2: Gmean versus Cost over various data sets for L-DSCIL algorithm. Cost is
given on the x-axis where each number denotes cost pair such that 1={0.1,0.9}, 2={0.2,0.8},
3={0.3,0.7}, 4={0.4,0.6}, 5={0.5,0.5}
0.9 1
0.8 0.9
0.8
0.7
0.7
0.6
0.6
0.5
Gmean
pageblocks rcv1
Gmean
0.5
0.4 w8a url
ijcnn1 0.4 webspam
0.3 news
realsim 0.3
0.2
0.2
0.1
0.1
0
0
1 2 3 4 5
1 2 3 4 5
Cost(positive,negative) Cost(positive,negative)
(a) (b)
Figure 5.3: Gmean versus Cost over various data sets for R-DSCIL algorithm. Cost is
given on the x-axis where each number denotes cost pair such that 1={0.1,0.9}, 2={0.2,0.8},
3={0.3,0.7}, 4={0.4,0.6}, 5={0.5,0.5}
92 CHAPTER 5. PROPOSED ALGORITHMS : DSCIL AND CILSD
are presented in Figure 5.4 (a) and (b) and of L-DSCIL in Figure 5.5 (a) and (b). The
number of cores used is shown on the x-axis and the y-axis shows the training time.
From these figures, we can draw multiple conclusions. Firstly, from the Figure 5.4 (a),
we observe that as we increase the number of cores, training time is decreasing for all
the data sets except for webspam at 8 cores. The sudden increase in training time of
webspam could be explained as follows: so long as the computation time (RCD run-
ning time plus primal and dual variable update) remains above the communication time
(M P I_Allreduce operation), adding more cores reduces the overall training time. On
the other hand, in Figure 5.4 (b), training time is first increasing and then decreasing
for all the datasets with increasing number of cores. This could be due to the increas-
ing communication time with the increasing number of cores. After a certain number of
cores, computation time starts leading the communication time and thereby decrease in
the training time is observed. Further, in DSCIL algorithm, the communication time is
data dependent (it depends on the data dimensionality and sparsity). Because of these
reasons, we observe different speedup patterns on different data sets. Speedup results
of L-DSCIL in Figure 5.5 (a) and (b) show the decreasing training time with increasing
number of cores for all the data sets. We also observe that training time for L-DSCIL is
higher than that of R-DSCIL on all data sets and cores.
4. Number of Cores versus Gmean
In this Section, we present the experimental results showing the effect of utilizing a vary-
ing number of cores on Gmean. This also shows how the different partitioning of the data
affects the Gmean. We divide the data set into equal chunks and distribute it to various
cores. Suppose, we have a data set of size m and want to utilize n cores, we allot samples
of size m/n to each core. We choose the data size such that it is divisible by all possible
cores utilized. Results are shown in Figures 5.6 and 5.7 for R-DSCIL and L-DSCIL re-
spectively. In Figure 5.6 (a) and (b), we can observe that Gmean remains almost constant
over various partitioning of the data (various cores). A small deviation is observed over
url and w8a data sets in Figure 5.6 (b). These observations lead to the conclusion that
R-DSCIL algorithm is less sensitive to the data partition. On the other hand, Gmean
5.2. EXPERIMENTS 93
10000 1000
1000
100
100 news
w8a
realsim
url
rcv1
webspam 10
ijcnn1
10
1 1
1 2 4 8 1 2 4 8
# Cores # Cores
(a) (b)
Figure 5.4: Training time versus number of cores to measure the speedup of R-DSCIL algo-
rithm. Training time in Figure (a) is on the log scale.
100000 1000000
100000
10000
Training time (in secdonds)
Training time (in seconds)
10000
1000
w8a 1000
rcv1
ijcnn1
100 realsim
url
100 webspam
news
10
10
1 1
1 2 4 8 1 2 4 8
# Cores # Cores
(a) (b)
Figure 5.5: Training time versus number of cores to measure the speedup of L-DSCIL algo-
rithm. Training time in both the figures is on the log scale.
94 CHAPTER 5. PROPOSED ALGORITHMS : DSCIL AND CILSD
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Gmean
Gmean
0.5 news 0.5
w8a
realsim
0.4 0.4 url
rcv1
webspam
0.3 ijcnn1 0.3
0.2 0.2
0.1 0.1
0 0
1 2 4 8 1 2 4 8
# Cores # Cores
(a) (b)
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Gmean
Gmean
0.2 0.2
0.1 0.1
0 0
1 2 4 8 1 2 4 8
# Cores # Cores
(a) (b)
results of the L-DSCIL algorithm for various partitions in Figures 5.7 (a) and (b) show
the chaotic behavior. For example, Gmean changes a lot with the increasing number of
cores for news in Figure 5.7 (a) and for realsim in Figure 5.7 (b). For other data sets,
fluctuations in Gmean values are less sensitive.
5. Effect of Regularization Parameter on Gmean
In this subsection, we discuss the effect of the regularization parameter, λ, on Gmean
produced by the R-DSCIL algorithm as shown in Figure 5.8 on various benchmark data
sets. It is clear from the Figure 5.8 that Gmean is dropping with increasing regularization
parameter on all the data sets tested. On some data sets such as rcv1, Gmean drops
gradually with increasing λ. On the other side, some data sets such as w8a, pageblocks,
Gmean falls off to 0 quickly with increasing λ. It also shows that higher the sparsity in
the data sets such as rcv1, webspam etc, higher is the penalty required to achieve higher
Gmean and vice versa.
5.3 Experiments
Below, we present empirical simulation results on various benchmark data sets as given in Table
2.3.
For our distributed implementation, we used MPICH2 library [173]. We compare the perfor-
mance of CILSD algorithm with that of the CSFSOL of [28]. In the forthcoming subsections,
we will show (i) The convergence of CILSD on benchmark data sets (ii) The performance com-
parison of CILSD, CSFSOL and CSSCD algorithms in terms of accuracy, sensitivity, specificity,
Gmean and balanced_accuracy (also called Sum which is =0.5*sensitivity+0.5*specificity)
(iii) Speedup (iv) Gmean versus number of cores (v) Gmean versus regularization parameter.
In this Section, we show the convergence of CILSD in two scenarios. In the first scenario, the
convergence of CILSD is shown when we search for the best learning rate over a validation
96 CHAPTER 5. PROPOSED ALGORITHMS : DSCIL AND CILSD
0.7
0.8 0.8
0.6
0.7 0.7
0.5
0.6 0.6
0.4
Gmean
Gmean
0.5 Gmean 0.5
0.4 0.4 0.3
0.3 0.3
0.2
0.2 0.2
0.1
0.1 0.1
0 0 0
0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3
lambda lambda lambda
0.7 0.7
0.8
0.6 0.6
0.7
0.5 0.5
0.6
0.4
Gmean
0.4
Gmean
Gmean
0.5
1
0.8
0.7 0.8
0.6
0.6
Gmean
Gmean
0.5
0.4
0.4
0.3
0.2 0.2
0.1
0 0
0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3
lambda lambda
Figure 5.8: Gmean versus regularization parameter λ using R-DSCIL (a) ijcnn1 (b) rcv1 (c)
pageblocks (d) w8a (e) news (f) url (g) realsim (h) webspam.
5.3. EXPERIMENTS 97
set in the range {0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3}. In the second scenario, we use the
learning rate 1/L. The convergence plot in both the scenario is shown in Figure 5.9. We can
clearly see that objective function converges faster with learning rate set to 1/L (Obj2) than to
search it over the range of possible learning rate (Obj1). These results show the correctness of
our implementation as well as how to choose the learning rate.
0.09 0.2
0.16
0.07
0.06
0.12
0.05
0.1
0.04
0.08
0.03
Obj1 Obj2 0.06
0.02
0.04
0.01 0.02
0 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Iterations Iteratons
0.09 0.05
0.04
0.08
0.04
Obj1 Obj2
Objective Function F(x)
0.03
0.08
0.03
0.02
0.07
Obj1 Obj2
0.02
0.01
0.07
0.01
0.06 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Iterations Iterations
Figure 5.9: Objective function vs iterations over various benchmark data sets. (a) ijcnn1 (b) rcv1
(c) pageblocks (d) w8a. Obj1 denotes objective function when best learning rate is searched
over {0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3} while Obj2 denotes objective function value ob-
tained with learning rate 1/L.
5.3. EXPERIMENTS 99
Table 5.3: Performance comparison of CSFSOL, CSSCD, and CILSD over various benchmark
data sets.
news
Algorithm
Accuracy Sensitivity Specificity Gmean Sum
rcv1
Algorithm
Accuracy Sensitivity Specificity Gmean Sum
url
Algorithm
Accuracy Sensitivity Specificity Gmean Sum
gisette
Algorithm
Accuracy Sensitivity Specificity Gmean Sum
realsim
Algorithm
Accuracy Sensitivity Specificity Gmean Sum
ijcnn1
Algorithm
Accuracy Sensitivity Specificity Gmean Sum
Table 5.4: Performance comparison of CSFSOL, CSSCD, and CILSD over various benchmark
data sets.
w8a
Algorithm
Accuracy Sensitivity Specificity Gmean Sum
pageblocks
Algorithm
Accuracy Sensitivity Specificity Gmean Sum
that Gmean does not remain constant with different partitioning of the data. For some
data sets such as realsim, Gmean is decreasing with increasing number of cores. On the
other hand, Gmean is first increasing and then decreasing for rcv1 and ijcnn1. Whereas,
it is continuously increasing for some data sets such as url. This chaotic behavior of
CILSD may be due to different proportion of positive and negative samples in the data
chunk alloted to different nodes. Above observation leads to the fact CILSD algorithm is
sensitive to the different partitioning of the data and the right choice of data partitioning
is important.
4. Gmean versus Regularization Parameter
In this Section, the effect of sparsity promoting parameter λ on Gmean is demonstrated.
The results are shown in Figure 5.12. From these results, we can clearly observe that
Gmean is quite sensitive to the setting of regularization parameter λ. For example, Gmean
on url data sets drops slowly with increasing lambda while it quickly drops to 0 for gisette.
5.3. EXPERIMENTS 101
1000000 100000
100000
10000
Training Time (in seconds)
10
10
1 1
1 2 4 8 1 2 4 8
# Cores # Cores
(a) (b)
Figure 5.10: Training time versus number of cores to measure the speedup of CILSD algorithm.
Training time in both the figures is on the log scale.
1 1
0.9
0.95
0.8
0.7
0.9
0.6
Gmean
Gmean
0.2
0.75
0.1
0.7 0
1 2 4 8 1 2 4 8
# Cores # Cores
(a) (b)
Figure 5.11: Gmean achieved by CILSD algorithm versus number of cores on various bench-
mark data sets.
102 CHAPTER 5. PROPOSED ALGORITHMS : DSCIL AND CILSD
0.8 1
0.9
0.7
0.8
0.6
0.7
0.5
0.6
Gmean
Gmean
0.4 0.5
0.3 0.4
0.3
0.2
0.2
0.1
0.1
0
0
0.00E+000 5.00E-002 1.00E-001 1.50E-001 2.00E-001 2.50E-001 3.00E-001 3.50E-001
0.00E+000 1.00E-001 2.00E-001 3.00E-001 4.00E-001
Lambda Lambda
0.8 0.9
0.7 0.8
0.7
0.6
0.6
0.5
Gmean
Gmean
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.00E+000 2.00E-004 4.00E-004 6.00E-004 8.00E-004 1.00E-003
0.00E+000 5.00E-002 1.00E-001 1.50E-001 2.00E-001 2.50E-001 3.00E-001 3.50E-001
Lambda Lambda
0.9
1
0.8
0.7
0.8
0.6
Gmean
Gmean
0.5 0.6
0.4
0.4
0.3
0.2 0.2
0.1
0 0
0.00E+000 5.00E-004 1.00E-003 1.50E-003 2.00E-003 2.50E-003 3.00E-003 3.50E-003 0.00E+000 5.00E-004 1.00E-003 1.50E-003 2.00E-003 2.50E-003 3.00E-003 3.50E-003
Lambda Lambda
0.7 0.8
0.7
0.6
0.6
0.5
0.5
Gmean
Gmean
0.4
0.4
0.3
0.3
0.2
0.2
0.1 0.1
0 0
0.00E+000 5.00E-002 1.00E-001 1.50E-001 2.00E-001 2.50E-001 3.00E-001 3.50E-001 0.00E+000 5.00E-002 1.00E-001 1.50E-001 2.00E-001 2.50E-001 3.00E-001 3.50E-001
Lambda Lambda
Figure 5.12: Effect of regularization parameter λ on Gmean (i) ijcnn1 (ii) rcv1 (iii) gisette (iv)
news (v) webspam (vi) url (vii) w8a (viii) realsim. λ varies in { 3.00E-007, 0.000009, 0.00003,
0.0009 0.003 0.09 0.3 }
5.3. EXPERIMENTS 103
Table 5.5: Performance evaluation of R-DSCIL, CILSD, CSFSOL and CSOGD-I on KDDCUP
2008 data set.
Algorithm Sum
CSOGD-I 0.5741092
CSFSOL 0.71089
R-DSCIL 0.7336652
CILSD 0.714153
Breast cancer detection [174, 175] is a type of anomaly detection problem. One in every eight
women around the world is susceptible to breast cancer. In this section, we demonstrate the
applicability of our proposed algorithm for anomaly detection in X-ray images of KDDCup
2008 data set [176]. KDDCup 2008 data set contains information of 102294 suspicious regions,
each region described by 117 features. Each region is either “benign” or “malignment” and the
ratio of malignment to benign regions is 1:163.19. We split the training data set into 5 chunks.
The first four chunk have size 20000 candidates each and are used for training on 4 cores. The
last chunk has size 22294 and used for testing. We compare the performance of R-DCSIL
and CILSD with CSFSOL and CSOGD-I that was proposed in [29]. It is shown in [29] that
CSOGD-I outperformed many first-order algorithms such as ROMMA, PA-I, PA-II, PAUM,
CPA_PB etc (see Table 6 in [29]). Hence, we only compared with CSFSOL and CSOGD-I in
our experiments. We note that CSFSOL and CSOGD-I are online algorithms that do not require
separate training set and test set. We set the learning rate in CSOGD-I to 0.2 as discussed in
their paper and reproduced the results. What is important here is that the ratio of malignment
to benign regions in our test set is 1:123 which is not very less than the ratio of malignment
to benign tumors in the original data set. Comparative performance is reported in Table 5.5
with respect to the Sum metric as defined in the Experiment section. From the results reported
in Table 5.5, we can clearly see that the Sum value achieved by R-DSCIL is much larger than
the value obtained through CSOGD-I. The Sum performance of CILSD remains above the Sum
performance of CSOGD-I and CSFSOL. These observations indicate the possibility of using
R-DSCIL and CILSD in real-world anomaly detection tasks.
104 CHAPTER 5. PROPOSED ALGORITHMS : DSCIL AND CILSD
5.4 Discussion
In the present work, we propose two algorithms for handling class-imbalance in a distributed
environment on small and large-scale data sets. DSCIL algorithm is implemented in two flavors:
one uses second order method (L-BFGS) and the other uses first order method (RCD) to solve
the subproblem in DADMM framework. In our empirical comparison, we showed the con-
vergence results of L-DSCIL and R-DSCIL where L-DSCIL converges faster than R-DSCIL.
Secondly, Gmean achieved by L-DSCIL is close to the Gmean of a centralized solution for most
of the data sets. Whereas Gmean achieved by R-DSCIL varies due to its random updates of co-
ordinates. Thirdly, coming to the training time comparison, we found in our experiments that
R-DSCIL has cheaper per iteration cost but takes a longer time to achieve accuracy compared
to L-DSCIL algorithm. Finally, the effect of varying cost, regularization parameter and the
number of cores is also demonstrated. The empirical comparison showed the potential appli-
cation of L-DSCIL and R-DSCIL for real-world class-imbalance and anomaly detection tasks
where our algorithms outperformed some recently proposed algorithms.
Our second algorithm (CILSD) is based on FISTA-like update rule. We show, through extensive
experiments on benchmark and real data sets, the convergence behavior of CILSD, speed up,
the effect of the number of cores and regularization parameter on Gmean. In particular, CILSD
algorithm does not require tuning of learning rate parameter and can be set to 1/L as in gradient
descent. Comparative evaluation with respect to a recently proposed class-imbalance learning
algorithm and a centralized algorithm shows that CILSD is able to either outperform or perform
equally well on many of the data sets tested. Speedup results demonstrate the advantage of
employing multiple cores. We also observed, in our experiments, chaotic behavior of Gmean
with respect to varying number of cores. Experiment on KDDCup data set indicates the possi-
bility of using CILSD algorithm for real-world distributed anomaly detection task. Comparison
of DCSIL and CILSD shows that CILSD convergences faster and achieves higher Gmean than
DSCIL.
Chapter 6
Anomaly detection techniques discussed in the previous chapters are supervised , that is, they
require labels of normal as well as anomalous examples to build the model. However, real world
data rarely contain labels. This means the techniques discussed in the previous chapters can not
be applied. Therefore, we turn the crank to unsupervised anomaly detection.
In this chapter, we study a robust algorithm, based on support vector data description (SVDD)
due to [36], for anomaly detection in real data. The data set used in the experiment comes
from nuclear power plant and represents the count of neutrons in reactor channels. We apply
the SVDD algorithm on nuclear power plant data to detect the anomalous channels of neu-
tron emission. Experiments demonstrate the effectiveness of the algorithm as well as finding
anomalies in the data set. We also discuss extensions of the algorithm to find anomalies in high
dimension and non linearly separable data.
6.1 Introduction
Our case study is based on the support vector data description algorithm. Tax et al. [36] first
proposed the support vector data description in 2004. Their work is recently extended to un-
certain data by Liu et al. [88]. Original work of Tax was based on support vector classifier in
105
106CHAPTER 6. UNSUPERVISED ANOMALY DETECTION USING SVDD-A CASE STUDY
sv
an unsupervised setting. Later, this was applied in semi-supervised and supervised setting with
little modification by Gornitz [177]. Next, we describe some key terms related to SVDD.
Definition 5. Support vectors are the set of points that lie on the boundary of the region sepa-
rating normal and anomalous points as shown in Fig. 6.1.
Definition 6. Support vector domain description concerns the characterization of a data set
through support vectors.
We describe support vector data description algorithm [36] for completeness. The problem is to
make a description of a training set of data instances and to detect which (new) data instances
resemble this training set. SVDD essentially is a one-class classifier [178]. The basic idea of
SVDD is that a good description of the data encompasses only normal instances. Outliers reside
either at the boundary or outside of the hypersphere (generalization of sphere in more than 3
dimension space) containing data set as shown in Fig .6.2. The method is made robust against
outliers in the training set and is capable of tightening the description by using anomalous
examples. The basic SVDD algorithm is given in Algorithm 1.
The idea of anomaly detection using minimal hypersphere (SVDD) is as follows. From the
preprocessed data in the kernel matrix, we calculate the center of mass of the data. The center
of mass of the data forms the center of the hypersphere. Subsequently, we compute the distances
of training points from the center of mass so as to obtain an empirical estimate of the center of
6.2. SUPPORT VECTOR DATA DESCRIPTION REVISITED 107
mass. Empirical estimation error in center of mass is added to the max distance of training point
from the center of mass to give the threshold. Now, any point in test data lying beyond threshold
is classified as anomalous.
Below, we show how kernel matrix and center of mass (lines 1 and 3 of the algorithm) are
computed because they form the heart of the algorithm.
• Kernel Matrix Construction: Kernel matrix is a matrix whose (i, j)th entry encodes the
similarity between instances i and j. From implementation point of view, we have used
RBF kernel (aka Gaussian kernel) given as:
Where x and z are two samples. Sigma is kernel width very similar to standard deviation.
Again, we note that the final result depends upon the sigma value; careful choice of sigma
is required. In our implementation, we have used σ = 1.
• Computing center of mass(COM):
Intuitively, center of mass (COM) of a set of points is same as center of gravity. More
formally, it is defined as:-
m
1 X
φs = ( φ(Xi )) (6.2)
m i
where m is the size of the training set. φ is a map from input space to feature space. In
fact, we compute distances of training data from COM which is equivalent to centering
108CHAPTER 6. UNSUPERVISED ANOMALY DETECTION USING SVDD-A CASE STUDY
Figure 6.2: Illustrates the spheres for data generated according to a spherical two-dimensional
Gaussian distribution. The center part shows the center of mass of the training points. Anything
outside the boundary can be considered as outliers.
Note that data can be (1) high dimension (2) Non-linearly separable. To handle case 1, we can
introduce multiple kernels corresponding to each feature and learn them from the training data.
To take the second case into account, we can use higher order polynomial or Gaussian kernel
(a kernel is a function that maps low dimension data to higher dimension). Since we have the
model at our disposal, we now embark on evaluating it on real dataset as described in the next
Section.
6.3 Experiments
Figure 6.3: Nuclear power plant data with marked anomalous subsequence
in the normal condition. During testing, test sample of size 5714(=6314-600) is presented to
the model. In the course of our experiment, we make the following assumptions.
• Threshold value for the confidence parameter δ = 0.01. That is, the probability that test
error is less than or equal to training error is 99.99%.
• Algorithm 3 is run per feature-wise, that is, we ran the Algorithm 3 for each channel to
detect point anomalies individually.
• Kernel used is Gaussian.
• All the indices in the figures are with respect to transformed data (log transformation).
of 6 years (from 2005 to 2010). After preprocessing, that is, removing noise from the
data, it constitutes 6314 tuples.
6.4 Results
Our results of applying Algorithm 3 on the data from different channels of the reactor are shown
in Fig. 6.4,6.5,6.6. Results obtained exactly match with anomalous points (In fact neutron count
was as low as 0 to 5 and as high as 170 + which is considered as abnormal flow of neutrons dur-
ing some particular time period) which we verified with the expert at Bhabha Atomic Research
Center). In Fig. 6.6, a large number of anomalous points gets accumulated. This is verified
later that this was due to technical fault in the detector for the specified duration.
6.5 Discussion
In the present work, we studied support vector data description algorithm for anomaly detection
in nuclear power plant data. We observe that SVDD efficiently and effectively finds point
anomalies in the nuclear reactor data set. In the present work, we applied SVDD algorithm for
each reactor channel individually. However, In future, we plan to use more state-of-the-art work
on unsupervised anomaly detection in the multi-variate setting.
6.5. DISCUSSION
Figure 6.4: Anomalies(marked in red) found in detector 1 of the power plant data
111
112CHAPTER 6. UNSUPERVISED ANOMALY DETECTION USING SVDD-A CASE STUDY
Figure 6.5: Anomalies(marked in red) found in detector 1 of the power plant data
6.5. DISCUSSION
Figure 6.6: Anomalies(marked in red) found in detector 1 of the power plant data
113
Chapter 7
Anomaly detection is an important task in machine learning and data mining. Due to unprece-
dented growth in data size and complexity, data miners and practitioners are overwhelmed with
what is called big data. Big data incapacitates the traditional anomaly detection techniques. As
such, there is an urgent need to develop efficient and scalable techniques for anomaly detection
in big data.
7.1 Conclusions
In the present research work, we proposed four novel algorithms for handling anomaly detection
in big data. The PAGMEAN and ASPGD are based on online learning paradigm whereas
DSCIL and CILSD are based on distributed learning paradigm. In order to handle the anomaly
detection problem in big data, we took an approach different from many works in the literature.
Specifically, we employ the class-imbalance learning approach to tackling point anomalies.
PAGMEAN is an online algorithm for class-imbalance learning and anomaly detection in the
streaming setting. In chapter 3, we showed that how we can directly optimize a non-decomposable
performance metric Gmean in binary classification setting. Doing so gives rise to a non-convex
loss function. We employ surrogate loss function for handling non-convexity. Subsequently,
the surrogate loss is used within the PA framework to derive PAGMEAN algorithms. We show
through extensive experiments on benchmark and real data sets that PAGMEAN outperforms
115
116 CHAPTER 7. CONCLUSIONS AND FUTURE WORK
its parent algorithms PA and recently proposed algorithm CSOC in terms of Gmean. However,
at the same time, we observed that PAGMEAN algorithms suffer higher Mistake rate than other
algorithms we compared with.
In chapter 4, we proposed ASPGD algorithm for tackling anomaly detection in streaming, high
dimensional and sparse data. We utilize accelerated-stochastic-proximal learning framework
with a cost-sensitive smooth hinge loss. Cost-sensitive smooth hinge loss applies penalty based
on the number of positive and negative samples received so far. We also proposed a non-
accelerated variant of ASPGD, that is, without Nesterov’s acceleration called ASPGDNOACC.
An empirical study on real and benchmark data sets show that acceleration is not always helpful
neither in terms of Gmean nor Mistake rate. In addition, we also compare with a recently
proposed algorithm called CSFSOL. It is found that ASPGD outperforms CSFSOL in terms of
Gmean, F-measure, Mistake-rate on many of the data sets tested.
In order to handle anomaly detection in sparse, high dimensional and distributed data, we pro-
posed DSCIL and CILSD algorithms in chapter 5. In particular, DSCIL algorithm is based
on the distributed ADMM framework that utilizes a cost-sensitive loss function. Within the
DSCIL algorithm, we solve the L2 regularized loss minimization problem via (1) L-BFGS
method (called L-DSCIL) (2) Random Coordinate Descent method called (R-DSCIL). Firstly,
Empirical convergence analysis shows that L-DSCIL converges faster than R-DSCIL. Secondly,
Gmean achieved by L-DSCIL is close to the Gmean of a centralized solution on most of the
datasets. Whereas Gmean achieved by R-DSCIL varies due to its random updates of coor-
dinates. Thirdly, coming to the training time comparison, we found in our experiments that
R-DSCIL has cheaper per iteration cost but takes a longer time to achieve accuracy compared
to L-DSCIL algorithm. Real world anomaly detection application on KDDCup 2008 data set
clearly shows the potential advantage of using the R-DSCIL algorithm.
tralized algorithm shows that CILSD is able to either outperform or perform equally well on
many of the data sets tested. Speedup results demonstrate the advantage of employing multiple
cores. We also observed, in our experiments, chaotic behavior of Gmean with respect to vary-
ing number of cores. Experiment on KDDCup data set indicates the possibility of using CILSD
algorithm for real-world distributed anomaly detection task. Comparison of DCSIL and CILSD
shows that CILSD convergences faster and achieves higher Gmean than DSCIL.
We present a case study of anomaly detection on real-world data in chapter 6. Our data came
from the nuclear power plant and is unlabeled. All of the algorithms discussed above can not
be applied since they are based on supervised learning, i.e., they require labels for normal and
anomalous instances. Therefore, we utilize SVDD , an unsupervised learning algorithm for
anomaly detection in real-world data. Empirical results show the effectiveness and efficiency of
the SVDD algorithm.
In this section, we discuss some potential research directions for future. Our algorithms, though
are scalable to high dimensions, take care of sparse and distributed data, have certain limita-
tions. For example, we did not handle concept drift specifically in our setting. As a first future
work may look upon utilizing concept drift detection techniques within the framework we used.
Secondly, big data is often not only distributed but also streaming in real-world. Therefore, on-
line distributed algorithm may be developed to handle anomaly detection. As a third work, data
heterogeneity may be combined with streaming, sparse and high dimensional characteristics of
big data while detecting anomalies.
Besides the above task, one may consider extending the existing framework to handle subse-
quence and contextual anomaly in big data.
118 CHAPTER 7. CONCLUSIONS AND FUTURE WORK
References
[2] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey.
Journal of ACM Comput. Surv., 41(3):15:1–15:58, July 2009.
[3] Liang Xiong, Xi Chen, and J. Schneider. Direct robust matrix factorizatoin for anomaly
detection. In 11th International Conference on Data Mining (ICDM), IEEE, pages 844–
853, 2011.
[4] Mia Hubert, Peter J. Rousseeuw, and Karlien Vanden Branden. Robpca: a new approach
to robust principal component analysis. pages 64–79. Journal of Technometrics, 2005.
[5] João B. D. Cabrera, Carlos Gutiérrez, and Raman K. Mehra. Ensemble methods for
anomaly detection and distributed intrusion detection in mobile ad-hoc networks. Journal
of Information Fusion, Elsevier, 9(1):96–119, 2008.
[6] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining,
(First Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA,
2005.
[7] S.K. Gupta, V. Bhatnagar, and S.K. Wasan. Architecture for knowledge discovery and
knowledge management. Journal of Knowledge and Information Systems, 7(3):310–336,
2005.
[8] Vasudha Bhatnagar, S K Gupta, and S K Wasan. On mining of data. IETE Journal of
Research, 47(1-2):5–17, 2001.
119
120 REFERENCES
[9] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection for discrete sequences: A
survey. IEEE Transactions on Knowledge and Data Engineering, 24(5):823–839, May
2012.
[10] Varun Chandola, Varun Mithal, and Vipin Kumar. A reference based analysis framework
for understanding anomaly detection techniques for symbolic sequences. Data Mining
and Knowledge Discovery, 28(3):702–735, 2013.
[11] Deris Stiawan, Abdul Hanan Abdullah, and Mohd. Yazid Idris. Threat and vulnerability
penetration testing: Linux. Journal of Internet Technology, 15(3):333–342, 2014.
[12] Suvrojit Das, Debayan Chatterjee, Debidas Ghosh, and Narayan C. Debnath. Extracting
the system call identifier from within VFS: a kernel stack parsing-based approach. IJICS
Journal, 6(1):12–50, 2014.
[13] Hema A. Murthy Dinil Mon Divakaran and Timothy A. Gonsalves. Detection of syn
flooding attacks using linear prediction analysis. In 14th IEEE Int’l Conf on Networks,
ICON, Singapore, Sep 2006.
[14] Chundury Jagadish and Timothy A. Gonsalves. Distributed control of event floods in
a large telecom network. International Journal of Network Management, 20(2):57–70,
2010.
[15] Sanjay Mittal, Rahul Gupta, Mukesh Mohania, Shyam K. Gupta, Mizuho Iwaihara, and
Tharam Dillon. 7th International Conference, EC-Web 2006 E-Commerce and Web Tech-
nologies, Krakow, Poland, September 5-7, 2006. Proceedings, chapter Detecting Frauds
in Online Advertising Systems, pages 222–231. Springer Berlin Heidelberg, Berlin, Hei-
delberg, 2006.
[16] A. Ramasamy, Hema A. Murthy, and T.A. Gonsalves. Linear prediction for traffic man-
agement and fault detection. In ICIT, pages 141–144, Dec 2000.
[18] Frank E. Grubbs. Procedures for detecting outlying observations in samples. Technomet-
rics, 11(1):1–21, February 1969.
REFERENCES 121
[19] Santanu Das, Bryan L. Matthews, Ashok N. Srivastava, and Nikunj C. Oza. Multiple
kernel learning for heterogeneous anomaly detection: algorithm and aviation safety case
study. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge
discovery and data mining, KDD ’10, pages 47–56, New York, NY, USA, 2010. ACM.
[20] Shreya Banerjee, Renuka Shaw, Anirban Sarkar, and Narayan C. Debnath. Towards
logical level design of big data. In 13th IEEE International Conference on Industrial
Informatics, INDIN 2015, Cambridge, United Kingdom, July 22-24, 2015, pages 1665–
1671, 2015.
[21] Neha Bharill and Aruna Tiwari. Handling big data with fuzzy based classification ap-
proach. In Advance Trends in Soft Computing - Proceedings of WCSC, December 16-18,
San Antonio, Texas, USA, pages 219–227, 2013.
[22] Eamonn J. Keogh, Jessica Lin, and Ada Wai-Chee Fu. Hot sax: Efficiently finding
the most unusual time series subsequence. In International Conference on Data Min-
ing(ICDM), pages 226–233, 2005.
[23] Charu C. Aggarwal and Philip S. Yu. Outlier detection for high dimensional data. In
Proceedings of the ACM SIGMOD International Conference on Management of Data,
SIGMOD ’01, pages 37–46, New York, NY, USA, 2001. ACM.
[24] Timothy de Vries, Sanjay Chawla, and Michael E. Houle. Density-preserving projections
for large-scale local anomaly detection. Journal of Knowledge and Information Systems,
Springer, 32(1):25–52, July 2012.
[25] Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. Database Theory
— ICDT’99: 7th International Conference Jerusalem, Israel, January 10–12, Proceed-
ings, chapter When Is “Nearest Neighbor” Meaningful? Springer Berlin Heidelberg,
Berlin, Heidelberg, 1999.
[26] Bilal Mirza, Zhiping Lin, and Nan Liu. Ensemble of subset online sequential extreme
learning machine for class imbalance and concept drift. Neurocomputing, 149, Part
A:316 – 329, 2015.
122 REFERENCES
[27] Sharanjit Kaur, Vasudha Bhatnagar, Sameep Mehta, and Sudhir Kapoor. Categorizing
concepts for detecting drifts in stream. In Proceedings of the 15th International Confer-
ence on Management of Data, December 9-12, Mysore, India, 2009.
[28] Dayong Wang, Pengcheng Wu, Peilin Zhao, and Steven C. H. Hoi. A framework of
sparse online learning and its applications. CoRR, abs/1507.07146, 2015.
[29] Jialei Wang, Peilin Zhao, and S.C.H. Hoi. Cost-sensitive online classification. IEEE
Transactions on Knowledge and Data Engineering, 26(10):2425–2438, Oct 2014.
[30] Shou Wang, Leandro L. Minku, and Xin Yao. Online class imbalance learning and its
applications in fault detection. International Journal of Computational Intelligence and
Applications, 12(04):1340001, 2013.
[31] Shuo Wang, Leandro L. Minku, and Xin Yao. Resampling-based ensemble methods for
online class imbalance learning. IEEE Trans. Knowl. Data Eng., 27(5):1356–1368, 2015.
[32] Amira Kamil Ibrahim Hassan and Ajith Abraham. Advances in Nature and Biologically
Inspired Computing: Proceedings of the 7th World Congress on Nature and Biologi-
cally Inspired Computing (NaBIC2015) in Pietermaritzburg, South Africa, held Decem-
ber 01-03, 2015, chapter Modeling Insurance Fraud Detection Using Imbalanced Data
Classification, pages 117–127. Springer International Publishing, Cham, 2016.
[34] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer.
Online passive-aggressive algorithms. J. Mach. Learn. Res., 7:551–585, December 2006.
[35] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for
linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
[36] David M. J. Tax and Robert P. W. Duin. Support vector data description. Machine
Learning Journal, 54(1):45–66, 2004.
[37] C.K. Maurya and D. Toshniwal. Anomaly detection in nuclear power plant data using
support vector data description. In IEEE Students’ Technology Symposium (TechSym), at
IIT Kharragpur, pages 82–86, Feb 2014.
REFERENCES 123
[38] Arthur Zimek, Erich Schubert, and Hans-Peter Kriegel. A survey on unsupervised out-
lier detection in high-dimensional numerical data. Stat. Anal. Data Min., 5(5):363–387,
October 2012.
[39] Gilles Blanchard, Gyemin Lee, and Clayton Scott. Semi-supervised novelty detection.
The Journal of Machine Learning Research, 11:2973–3009, 2010.
[40] Dipankar Dasgupta and Fernando Nino. A comparison of negative and positive selection
algorithms in novel pattern detection. In IEEE international conference on Systems, man,
and cybernetics,, volume 1, pages 125–130. IEEE, 2000.
[42] C. De Stefano, C. Sansone, and M. Vento. To reject or not to reject: that is the question-an
answer in case of neural classifiers. IEEE Transactions on Systems, Man, and Cybernet-
ics, Part C: Applications and Reviews, 30(1):84–94, 2000.
[43] Christos Siaterlis and Basil Maglaris. Towards multisensor data fusion for dos detection.
In Proceedings of the ACM symposium on Applied computing, SAC ’04, pages 439–446,
New York, NY, USA, 2004. ACM.
[44] Kaustav Das and Jeff Schneider. Detecting anomalous records in categorical datasets. In
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery
and data mining, KDD ’07, pages 220–229, New York, NY, USA, 2007. ACM.
[45] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning Jour-
nal, 20(3):273–297, 1995.
[46] Manuel Davy and S. Godsill. Detection of abrupt spectral changes using support vector
machines an application to audio signal segmentation. In IEEE International Conference
on Acoustics, Speech, and Signal Processing (ICASSP), volume 2, pages 1313–1316,
2002.
124 REFERENCES
[47] Rakesh Agrawal and Ramakrishnan Srikant. Mining sequential patterns. In Proceedings
of the Eleventh International Conference on Data Engineering, ICDE ’95, pages 3–14,
Washington, DC, USA, 1995. IEEE Computer Society.
[48] Wei Fan, M. Miller, S.J. Stolfo, Wenke Lee, and P.K. Chan. Using artificial anoma-
lies to detect unknown and known network intrusions. In Proceedings of International
Conference on Data Mining( ICDM), IEEE, pages 123–130, 2001.
[49] G. Ratsch, S. Mika, B. Scholkopf, and K. Muller. Constructing boosting algorithms from
svms: an application to one-class classification. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 24(9):1184–1199, 2002.
[50] Mennatallah Amer, Markus Goldstein, and Slim Abdennadher. Enhancing one-class
support vector machines for unsupervised anomaly detection. In Proceedings of the
ACM SIGKDD Workshop on Outlier Detection and Description, ODD ’13, pages 8–15,
New York, NY, USA, 2013. ACM.
[51] Volker Roth. Outlier detection with one-class kernel fisher discriminants. In Advances
in Neural Information Processing Systems 17, pages 1169–1176. MIT Press, 2005.
[52] Jorma Laurikkala, Martti Juhola, Erna Kentala, N Lavrac, S Miksch, and B Kavsek.
Informal identification of outliers in medical data. In Fifth International Workshop on
Intelligent Data Analysis in Medicine and Pharmacology, pages 20–24, 2000.
[53] Helge Erik Solberg and Ari Lahti. Detection of outliers in reference distributions: per-
formance of horn’s algorithm. Clinical chemistry, 51(12):2326–2332, 2005.
[54] Paul S Horn, Lan Feng, Yanmei Li, and Amadeo J Pesce. Effect of outliers and non-
healthy individuals on reference interval estimation. Clinical Chemistry, 47(12):2137–
2145, 2001.
[55] Harvey Motulsky. Intuitive biostatistics: Choosing a statistical test, chapter-17. Technical
report, ISBN 0-19-508607-4), Oxford University Press, 1995.
[56] Martin Ester, Hans peter Kriegel, Jörg S, and Xiaowei Xu. A density-based algorithm for
discovering clusters in large spatial databases with noise. pages 226–231. AAAI Press,
1996.
REFERENCES 125
[57] S. Guha, R. Rastogi, and Kyuseok Shim. Rock: a robust clustering algorithm for categor-
ical attributes. In Proceedings of 15th International Conference on Data Engineering,
pages 512–521, 1999.
[58] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. Rock: A robust clustering algorithm
for categorical attributes. In Proceedings of 15th International Conference on Data En-
gineering, pages 512–521. IEEE, 1999.
[59] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. A density-based algo-
rithm for discovering clusters in large spatial databases with noise. In Kdd, volume 96,
pages 226–231, 1996.
[60] Levent Ertöz, Michael Steinbach, and Vipin Kumar. Finding topics in collections of doc-
uments: A shared nearest neighbor approach. In Clustering and Information Retrieval,
pages 83–103. Springer, 2004.
[61] Witcha Chimphlee, Abdul Hanan Abdullah, Mohd Noor Md Sap, Siriporn Chim-
phlee, and Surat Srinoy. Unsupervised-clustering-methods-for-identifying-rare-events-
in-anomaly-detection. In Sixth International Enformatika Conference, pages 26–28, Bu-
dapest, Hungry, 2005.
[62] Z. He, X. Xu, and S. Deng. Discovering cluster-based local outliers. Pattern recognition
letters, 24(9-10):1641–1650, 2003.
[63] Eleazar Eskin, Andrew Arnold, Michael Prerau, Leonid Portnoy, and Sal Stolfo. A ge-
ometric framework for unsupervised anomaly detection. In Applications of data mining
in computer security, pages 77–101. Springer, 2002.
[64] Fabrizio Angiulli and Clara Pizzuti. Fast outlier detection in high dimensional spaces.
In European Conference on Principles of Data Mining and Knowledge Discovery, pages
15–27. Springer, 2002.
[65] Ji Zhang and Hai Wang. Detecting outlying subspaces for high-dimensional data: the new
task, algorithms, and performance. Knowledge and information systems, 10(3):333–355,
2006.
[66] Richard J Bolton, David J Hand, et al. Unsupervised profiling methods for fraud detec-
tion. Credit Scoring and Credit Control VII, pages 235–255, 2001.
126 REFERENCES
[67] Markus Breunig, Hans Peter Kriegel, Raymond T. Ng, and Jörg Sander. Lof: Identi-
fying density-based local outliers. In Proceedings of the ACM SIGMOD international
conference on management of data, pages 93–104. ACM, 2000.
[68] KaiMing Ting, Takashi Washio, JonathanR. Wells, FeiTony Liu, and Sunil Aryal. De-
mass: a new density estimator for big data. Journal of Knowledge and Information
Systems, 35(3):493–524, 2013.
[69] Jon Louis Bentley. Multidimensional binary search trees used for associative searching.
Commun. ACM, 18(9):509–517, September 1975.
[70] Eamonn Keogh, Stefano Lonardi, and Chotirat Ann Ratanamahatana. Towards
parameter-free data mining. In Proc. 10th ACM SIGKDD International Conf. Knowl-
edge Discovery and Data Mining, pages 206–215. ACM Press, 2004.
[71] Andreas Arning, Rakesh Agrawal, and Prabhakar Raghavan. A linear method for devia-
tion detection in large databases. In KDD, pages 164–169, 1996.
[72] Shin Ando. Clustering needles in a haystack: An information theoretic analysis of mi-
nority and outlier detection. In Seventh IEEE International Conference on Data Mining
(ICDM 2007), pages 13–22. IEEE, 2007.
[73] Zengyou He, Shengchun Deng, and Xiaofei Xu. An optimization model for outlier de-
tection in categorical data. In International Conference on Intelligent Computing, pages
400–409. Springer, 2005.
[74] Wenke Lee and Dong Xiang. Information-theoretic measures for anomaly detection. In
Proceedings of IEEE Symposium on Security and Privacy, S&P 2001, pages 130–143.
IEEE, 2001.
[75] William Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a
Hilbert space. In Conference in modern analysis and probability (New Haven, Conn.,
1982), volume 26 of Contemporary Mathematics, pages 189–206. American Mathemat-
ical Society, 1984.
[76] M.-L Shyu, S.-C Chen, K. Sarinnapakorn, , and L. Cheng. A novel anomaly detection
scheme based on principal component classifier. In Proceedings of 3rd IEEE Interna-
tional Conference on Data Mining ., ICDM ’03, pages 353–365. IEEE, 2003.
REFERENCES 127
[77] J. Sun, Y. Xie, H. Zhang, , and Falaoutsos. Less is more: Compact matrix representation
of large sparse graphs. In Proceedings of 7th SIAM International Conference on Data
Mining. SIAM, 2007.
[78] Simon Gunter, Nicol N. Schraudolph, and S. V. N. Vishwanathan. Fast iterative kernel
principal component analysis. Journal of Machine Learning Research, 8:1893–1918,
2007.
[79] Rose Yu, Xinran He, and Yan Liu. Glad: Group anomaly detection in social media anal-
ysis. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, KDD ’14, pages 372–381, New York, NY, USA, 2014.
ACM.
[80] Erik Rodner, Esther-Sabrina Wacker, Michael Kemmler, and Joachim Denzler. One-class
classification for anomaly detection in wire ropes with gaussian processes in a few lines
of code. In Proceedings of the IAPR Conference on Machine Vision Applications (IAPR
MVA 2011), Nara Centennial Hall, Nara, Japan, June 13-15, 2011, pages 219–222,
2011.
[81] Michael Kemmler, Erik Rodner, Esther-Sabrina Wacker, and Joachim Denzler. One-class
classification with gaussian processes. Pattern Recognition, 46(12):3507–3518, 2013.
[82] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cam-
bridge University Press, New York, NY, USA, 2004.
[83] Francis R. Bach, Gert R. G. Lanckriet, and Michael I. Jordan. Multiple kernel learning,
conic duality, and the smo algorithm. In Proceedings of the twenty-first international
conference on Machine learning, ICML ’04, pages 6–, New York, NY, USA, 2004. ACM.
[84] Bernhard Scholkopf and Alexander J. Smola. Learning with Kernels: Support Vector
Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA,
2001.
[85] Manik Varma and Bodla Rakesh Babu. More generality in efficient multiple kernel learn-
ing. In Proceedings of the 26th Annual International Conference on Machine Learning,
ICML ’09, pages 1065–1072, New York, NY, USA, 2009. ACM.
128 REFERENCES
[86] G. Song, X. Jin, Chen G., and Y. Nie. Multiple kernel learning method for network
anomaly detection. In IEEE Conference on Intelligent Systems and Knowledge Engi-
neering (ISKE), pages 296–299, 2010.
[87] Marius Kloft, Ulf Brefeld, Sören Sonnenburg, and Alexander Zien. Lp-norm multiple
kernel learning. Journal of Machine Learning Research (JMLR), 12:953–997, July 2011.
[88] Bo Liu, Yanshan Xiao, Longbing Cao, Zhifeng Hao, and Feiqi Deng. Svdd-based outlier
detection on uncertain data. Knowledge and Information Systems, Springer, 34(3):597–
618, 2013.
[89] N. Goernitz, Marius Kloft, Konrad Rieck, and Ulf Brefeld. Toward supervised anomaly
detection. J. Artif. Intell. Res. (JAIR), 46:235–262, 2013.
[90] Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization.
In NIPS, pages 556–562. MIT Press, 2000.
[91] Chih Jen Lin. On the convergence of multiplicative update algorithms for nonnegative
matrix factorization. Trans. Neur. Netw., 18(6):1589–1596, November 2007.
[92] Chih jen Lin. Projected gradient methods for non-negative matrix factorization. Techni-
cal report, Neural Computation, 2007.
[93] Michael W. Berry, Murray Browne, Amy N. Langville, V. Paul Pauca, and Robert J.
Plemmons. Algorithms and applications for approximate nonnegative matrix factoriza-
tion. In Computational Statistics and Data Analysis, pages 155–173, 2006.
[94] Dongmin Kim, Suvrit Sra, and Inderjit S. Dhillon. Fast newton-type methods for the least
squares nonnegative matrix approximation problem. In Proceedings of SIAM Conference
on Data Mining, pages 343–354, 2007.
[95] Peter Richtarik and Martin Takáč. Iteration complexity of randomized block-coordinate
descent methods for minimizing a composite function. Mathematical Programming,
144(1):1–38, 2012.
[96] EdwardG. Allan, MichaelR. Horvath, ChristopherV. Kopek, BrianT. Lamb, ThomasS.
Whaples, and MichaelW. Berry. Anomaly detection using nonnegative matrix factoriza-
tion. In MichaelW. Berry and Malu Castellanos, editors, Survey of Text Mining II, pages
203–217. Springer London, 2008.
REFERENCES 129
[97] Fei Wang, Sanjay Chawla, and Didi Surian. Latent outlier detection and the low preci-
sion problem. In Proceedings of the ACM SIGKDD Workshop on Outlier Detection and
Description, ODD ’13, pages 46–52, New York, NY, USA, 2013. ACM.
[98] Robert J. Durrant and Ata Kaban. Random projections as regularizers: Learning a linear
discriminant ensemble from fewer observations than dimensions. In Asian Conference
on Machine Learning, ACML 2013, Canberra, ACT, Australia, November 13-15, 2013,
pages 17–32, 2013.
[99] Robert J. Durrant and Ata Kaban. Sharp generalization error bounds for randomly-
projected classifiers. In Proceedings of the 30th International Conference on Machine
Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, pages 693–701, 2013.
[100] Emmanuel Müller, Ira Assent, Patricia Iglesias, Yvonne Mülle, and Klemens Böhm.
Outlier ranking via subspace analysis in multiple views of the data. In ICDM, pages
529–538. IEEE Computer Society, 2012.
[101] Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust pca via outlier pursuit.
In NIPS, pages 2496–2504. Curran Associates, Inc., 2010.
[102] Mazin Aouf and Laurence A. F. Park. Approximate document outlier detection using
random spectral projection. In Proceedings of the 25th Australasian joint conference
on Advances in Artificial Intelligence, AI’12, pages 579–590, Berlin, Heidelberg, 2012.
Springer-Verlag.
[103] Vipin Kumar and Sonajharia Minz. Multi-view ensemble learning: an optimal feature
set partitioning for high-dimensional data classification. Journal of Knowledge and In-
formation Systems, pages 1–59, 2015.
[104] Aleksandar Lazarevic and Vipin Kumar. Feature bagging for outlier detection. In Pro-
ceedings of the eleventh ACM SIGKDD international conference on Knowledge discov-
ery in data mining, KDD ’05, pages 157–166, New York, NY, USA, 2005. ACM.
[105] Keith Noto, Carly Brodley, and Donna Slonim. Anomaly detection using an ensemble
of feature models. In Proceedings of the 2010 IEEE International Conference on Data
Mining, ICDM ’10, pages 953–958, Washington, DC, USA, 2010. IEEE Computer So-
ciety.
130 REFERENCES
[107] Shaza M. Abd Elrahman and Ajith Abraham. Class imbalance problem using a hybrid
ensemble approach. Int. J. Hybrid Intell. Syst., 12(4):219–227, 2016.
[108] Vasudha Bhatnagar, Manju Bhardwaj, and Ashish Mahabal. Comparing SVM ensembles
for imbalanced datasets. In 10th International Conference on Intelligent Systems Design
and Applications, ISDA , November 29 - December 1, Cairo, Egypt, pages 651–657,
2010.
[109] Manju Bhardwaj, Debasis Dash, and Vasudha Bhatnagar. Accurate classification of bio-
logical data using ensembles. In IEEE International Conference on Data Mining Work-
shop, ICDMW , Atlantic City, NJ, USA, November 14-17, pages 1486–1493, 2015.
[111] J. Kivinen, A.J. Smola, and R.C. Williamson. Online learning with kernels. IEEE Trans-
actions on Signal Processing, 100(10), Oct 2010.
[112] Koby Crammer and Yoram Singer. Ultraconservative online algorithms for multiclass
problems. J. Mach. Learn. Res., 3:951–991, March 2003.
[113] Claudio Gentile. A new approximate maximal margin classification algorithm. J. Mach.
Learn. Res., 2:213–242, March 2002.
[114] Nicolò Cesa-Bianchi, Alex Conconi, and Claudio Gentile. A second-order perceptron
algorithm. SIAM J. Comput., 34(3):640–668, March 2005.
[115] Koby Crammer, Alex Kulesza, and Mark Dredze. Adaptive regularization of weight
vectors. Machine Learning, 91(2):155–187, 2013.
[116] Francesco Orabona and Koby Crammer. New adaptive algorithms for online classifica-
tion. Journal of Machine Learning Research, pages 1840–1848, 2010.
REFERENCES 131
[117] Koby Crammer, Mark Dredze, and Fernando Pereira. Exact convex confidence-weighted
learning. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances
in Neural Information Processing Systems 21, pages 345–352. Curran Associates, Inc.,
2009.
[118] Jialei Wang, Peilin Zhao, and Steven C. H. Hoi. Exact soft confidence-weighted learning.
CoRR, abs/1206.4612, 2012.
[119] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge
University Press, New York, NY, USA, 2006.
[120] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer.
Smote: Synthetic minority over-sampling technique. J. Artif. Int. Res., 16(1):321–357,
June 2002.
[121] Nitesh V. Chawla, Aleksandar Lazarevic, Lawrence O. Hall, and Kevin W. Bowyer.
Smoteboost: Improving prediction of the minority class in boosting. In PKDD, volume
2838 of Lecture Notes in Computer Science, pages 107–119. Springer, 2003.
[122] Shuo Wang, Huanhuan Chen, and Xin Yao. Negative correlation learning for classi-
fication ensembles. In The 2010 International Joint Conference on Neural Networks
(IJCNN), pages 1–8, July 2010.
[123] Charles Elkan. The foundations of cost-sensitive learning. In Proceedings of the 17th
International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’01, pages
973–978, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.
[124] Wei Fan, Salvatore J. Stolfo, Junxin Zhang, and Philip K. Chan. Adacost: Misclassifica-
tion cost-sensitive boosting. In Proceedings of the Sixteenth International Conference on
Machine Learning, ICML ’99, pages 97–105, San Francisco, CA, USA, 1999. Morgan
Kaufmann Publishers Inc.
[125] C.X. Ling, V.S. Sheng, and Qiang Yang. Test strategies for cost-sensitive decision trees.
IEEE Transactions on Knowledge and Data Engineering, 18(8):1055–1067, Aug 2006.
[126] Xu-Ying Liu and Zhi-Hua Zhou. The influence of class imbalance on cost-sensitive
learning: An empirical study. In ICDM ’06. Sixth International Conference on Data
Mining, pages 970–974, Dec 2006.
132 REFERENCES
[127] Hung-Yi Lo, Ju-Chiang Wang, Hsin-Min Wang, and Shou-De Lin. Cost-sensitive multi-
label learning for audio tag annotation and retrieval. IEEE Transactions on Multimedia,
13(3):518–529, June 2011.
[128] Xianghan Zheng, Zhipeng Zeng, Zheyi Chen, Yuanlong Yu, and Chunming Rong. De-
tecting spammers on social networks. Neurocomputing, 159:27 – 34, 2015.
[129] Xingyu Gao, Zhenyu Chen, Sheng Tang, Yongdong Zhang, and Jintao Li. Adaptive
weighted imbalance learning with application to abnormal activity recognition. Neuro-
computing, 173, Part 3:1927 – 1935, 2016.
[130] Chris Drummond and Robert C. Holte. Machine Learning: ECML 2005: 16th Euro-
pean Conference on Machine Learning, Porto, Portugal, October 3-7, 2005. Proceed-
ings, chapter Severe Class Imbalance: Why Better Algorithms Aren’t the Answer, pages
539–546. Springer Berlin Heidelberg, Berlin, Heidelberg, 2005.
[131] Charles X. Ling, Qiang Yang, Jianning Wang, and Shichao Zhang. Decision trees with
minimal costs. In Proceedings of the Twenty-first International Conference on Machine
Learning, ICML ’04, pages 69–, New York, NY, USA, 2004. ACM.
[132] Xiaoyong Chai, Lin Deng, Qiang Yang, and C.X. Ling. Test-cost sensitive naive bayes
classification. In Fourth IEEE International Conference on Data Mining. ICDM ’04,
pages 51–58, Nov 2004.
[133] Gary Weiss, Kate McCarthy, and Bibi Zabar. Cost-sensitive learning vs. sampling:
Which is best for handling unbalanced classes with unequal error costs? In DMIN,
pages 35–41. CSREA Press, 2007.
[135] Manuel Davy, Frédéric Desobry, Arthur Gretton, and Christian Doncarli. An online
support vector machine for abnormal events detection. Signal Process., 86(8):2009–
2025, August 2006.
REFERENCES 133
[136] Matthew Eric Otey, Amol Ghoting, and Srinivasan Parthasarathy. Fast distributed outlier
detection in mixed-attribute data sets. Data Min. Knowl. Discov., 12(2-3):203–228, May
2006.
[137] Swee Chuan Tan, Kai Ming Ting, and Tony Fei Liu. Fast anomaly detection for streaming
data. In Proceedings of the Twenty-Second International Joint Conference on Artificial
Intelligence - Volume Volume Two, IJCAI’11, pages 1511–1516. AAAI Press, 2011.
[138] Kai Ming Ting, Guang-Tong Zhou, Fei Tony Liu, and James Swee Chuan Tan. Mass
estimation and its applications. In Proceedings of the 16th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, KDD ’10, pages 989–998, New
York, NY, USA, 2010. ACM.
[139] Alexander Lavin and Subutai Ahmad. Evaluating real-time anomaly detection algorithms
- the numenta anomaly benchmark. CoRR, abs/1510.03336, 2015.
[140] Xin Jin, Yin Guo, Soumik Sarkar, Asok Ray, and Robert M. Edwards. Anomaly detection
in nuclear power plants via symbolic dynamic filtering. IEEE Transactions on Nuclear
Science, 58(1):277– 288, Feb 2011.
[141] Anders Riber Marklund and Jan Dufek. Development and comparison of spectral meth-
ods for passive acoustic anomaly detection in nuclear power plants. Applied Acoustics,
83:100 – 107, 2014.
[142] Seong Soo Choi, Ki Sig Kang, Han Gon Kim, and Soon Heung Chang. Development of
an on-line fuzzy expert system for integrated alarm processing in nuclear power plants.
IEEE Transactions on Nuclear Science, 42(4):1406–1418, Aug 1995.
[143] K Nabeshima, T Suzudo, T Ohno, and K Kudo. Nuclear reactor monitoring with the
combination of neural network and expert system. Mathematics and Computers in Sim-
ulation, 60(3–5):233 – 244, 2002. Intelligent Forecasting, Fault Diagnosis, Scheduling,
and Control.
[144] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines.
ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software
available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
134 REFERENCES
[147] John C. Platt. Advances in kernel methods. chapter Fast Training of Support Vector Ma-
chines Using Sequential Minimal Optimization, pages 185–208. MIT Press, Cambridge,
MA, USA, 1999.
[148] Ronan Collobert, Samy Bengio, and Yoshua Bengio. A parallel mixture of svms for very
large scale problems. Neural Comput., 14(5):1105–1114, May 2002.
[149] Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker. Identifying
suspicious urls: An application of large-scale online learning. In Proceedings of the 26th
Annual International Conference on Machine Learning, ICML ’09, pages 681–688, New
York, NY, USA, 2009. ACM.
[151] Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. Nips 2003 feature selection
challenge, 2003.
[154] Thorsten Joachims. A support vector method for multivariate performance measures.
In Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05,
pages 377–384, New York, NY, USA, 2005. ACM.
[155] Steven C.H. Hoi, Jialei Wang, and Peilin Zhao. Libol: A library for online learning
algorithms. The Journal of Machine Learning Research, 15:495–499, 2014.
[156] Francesco Orabona. DOGMA: a MATLAB toolbox for Online Learning, 2009. Software
available at http://dogma.sourceforge.net.
REFERENCES 135
[158] Shuo Wang, L.L. Minku, and Xin Yao. Resampling-based ensemble methods for on-
line class imbalance learning. IEEE Transactions on Knowledge and Data Engineering,
27(5):1356–1368, May 2015.
[159] Miroslav Kubat and Stan Matwin. Addressing the curse of imbalanced training sets:
One-sided selection. In In Proceedings of the Fourteenth International Conference on
Machine Learning, pages 179–186. Morgan Kaufmann, 1997.
[160] Yurii Nesterov. A method of solving a convex programming problem with convergence
rate o (1/k2). Soviet Mathematics Doklady, 27(2):372–376, 1983.
[161] Yurii Nesterov. Introductory lectures on convex optimization : a basic course. Applied
optimization. Kluwer Academic Publ., Boston, Dordrecht, London, 2004.
[162] Neal Parikh and Stephen Boyd. Proximal algorithms. Found. Trends Optim., 1(3):127–
239, January 2014.
[163] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for
linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
[164] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society (Series B), 58:267–288, 1996.
[165] Yu. Nesterov. Gradient methods for minimizing composite functions. Mathematical
Programming, 140(1):125–161, 2013.
[166] Yu. Nesterov. Gradient methods for minimizing composite objective function. CORE
Discussion Papers 2007076, Université catholique de Louvain, Center for Operations
Research and Econometrics (CORE), 2007.
[167] Atsushi Nitanda. Stochastic proximal gradient descent with acceleration techniques. In
Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger, editors,
Advances in Neural Information Processing Systems 27, pages 1574–1582. Curran As-
sociates, Inc., 2014.
136 REFERENCES
[168] Chonghai Hu, Weike Pan, and James T. Kwok. Accelerated gradient methods for stochas-
tic optimization and online learning. In Y. Bengio, D. Schuurmans, J.D. Lafferty, C.K.I.
Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems
22, pages 781–789. Curran Associates, Inc., 2009.
[169] Dayong Wang, Pengcheng Wu, Peilin Zhao, Yue Wu, Chunyan Miao, and S.C.H. Hoi.
High-dimensional data stream classification via sparse online learning. In Data Mining
(ICDM), 2014 IEEE International Conference on, pages 1007–1012, Dec 2014.
[171] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed
optimization and statistical learning via the alternating direction method of multipliers.
Foundations and Trends® in Machine Learning, 3(1):1–122, 2010.
[172] Kwangmoo Koh, Seung-Jean Kim, and Stephen Boyd. An interior-point method for
large-scale l1-regularized logistic regression. J. Mach. Learn. Res., 8:1519–1555, De-
cember 2007.
[174] Arpit Bhardwaj, Aruna Tiwari, Dharmil Chandarana, and Darshil Babel. A genetically
optimized neural network for classification of breast cancer disease. In 7th International
Conference on Biomedical Engineering and Informatics, BMEI , Dalian, China, October
14-16,, pages 693–698, 2014.
[175] Arpit Bhardwaj and Aruna Tiwari. Breast cancer diagnosis using genetically optimized
neural network model. Journal of Expert Syst. Appl., 42(10):4611–4620, 2015.
[177] Konrad Rieck Nico Görnitz, Marius Kloft and Ulf Brefeld. Toward supervised anomaly
detection. Journal of Artificial Intelligence Research, 46:235–262, 2013.
[178] M. Moya, M. Koch, and L. Hostetler. One-class classifier networks for target recognition
applications. International Neural Network Society, pages 797–801, 1993.