0% found this document useful (0 votes)
20 views148 pages

Anomaly Detection in Big Data

The Ph.D. thesis by Chandresh Kumar Maurya focuses on anomaly detection in big data, presenting various traditional and modern approaches to the problem. It includes a literature survey, proposed algorithms, experimental results, and discussions on the effectiveness of these algorithms. The work aims to contribute to the field of computer science and engineering by addressing gaps in existing anomaly detection methods.

Uploaded by

Efsane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views148 pages

Anomaly Detection in Big Data

The Ph.D. thesis by Chandresh Kumar Maurya focuses on anomaly detection in big data, presenting various traditional and modern approaches to the problem. It includes a literature survey, proposed algorithms, experimental results, and discussions on the effectiveness of these algorithms. The work aims to contribute to the field of computer science and engineering by addressing gaps in existing anomaly detection methods.

Uploaded by

Efsane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 148

ANOMALY DETECTION IN BIG DATA

Ph.D. THESIS

by
arXiv:2203.01684v1 [cs.LG] 3 Mar 2022

CHANDRESH KUMAR MAURYA

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


INDIAN INSTITUTE OF TECHNOLOGY ROORKEE
ROORKEE - 247 667 (INDIA)
AUGUST, 2016
ANOMALY DETECTION IN BIG DATA

A THESIS

Submitted in partial fulfilment of the


requirements for the award of the degree

of

DOCTOR OF PHILOSOPHY

in

COMPUTER SCIENCE AND ENGINEERING

by

CHANDRESH KUMAR MAURYA

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


INDIAN INSTITUTE OF TECHNOLOGY ROORKEE
ROORKEE - 247 667 (INDIA)
AUGUST, 2016
Contents

List of Figures 7

List of Tables 10

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Applications to Big data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 The Problem Statement and Research Scope . . . . . . . . . . . . . . . . . . . 6
1.4 Specific Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Literature Survey 13
2.1 Traditional Approaches to Anomaly Detection . . . . . . . . . . . . . . . . . 13
2.1.1 Classification-Based Approaches . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 Statistical Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.3 Clustering-Based Approaches . . . . . . . . . . . . . . . . . . . . . . 20
2.1.4 Information-theoretic Approaches . . . . . . . . . . . . . . . . . . . . 25
2.1.5 Spectral-Theory Based Approaches . . . . . . . . . . . . . . . . . . . 26
2.2 Modern Approaches to Anomaly Detection . . . . . . . . . . . . . . . . . . . 27
2.2.1 Non-Parametric Techniques . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.2 Multiple Kernel Learning . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.3 Non-negative Matrix factorization . . . . . . . . . . . . . . . . . . . . 31
2.2.4 Random Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.5 Ensemble Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3 Relevant Algorithms for Anomaly Detection . . . . . . . . . . . . . . . . . . . 37
2.3.1 Online Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3
4 CONTENTS

2.3.2 Class-Imbalance Learning . . . . . . . . . . . . . . . . . . . . . . . . 39


2.3.3 Anomaly Detection in a Streaming environment . . . . . . . . . . . . . 41
2.3.4 Anomaly Detection in Nuclear Power Plant . . . . . . . . . . . . . . . 42
2.4 Datasets Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5 Research Gaps Identified . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3 Proposed Algorithm : PAGMEAN 51


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 Proposed Algorithm - PAGMEAN . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.1 Experimental Testbed and Setup . . . . . . . . . . . . . . . . . . . . . 53
3.3.2 Performance Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . 54
3.3.3 Comparative Study on Benchmark Data sets . . . . . . . . . . . . . . . 54
3.3.4 Comparative Study on Real and Benchmark Data sets . . . . . . . . . . 58
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4 Proposed Algorithm : ASPGD 65


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Proposed Algorithm - ASPGD . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2.1 Problem Formulation and The Loss Function . . . . . . . . . . . . . . 67
4.2.2 Stochastic Proximal Learning Framework . . . . . . . . . . . . . . . . 68
4.2.3 ASPGD Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.1 Experimental Testbed and Setup . . . . . . . . . . . . . . . . . . . . . 71
4.3.2 Comparative Study on Benchmark Data sets . . . . . . . . . . . . . . . 72
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5 Proposed Algorithms : DSCIL and CILSD 83


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2.1 Experimental Testbed and Setup . . . . . . . . . . . . . . . . . . . . . 85
5.2.2 Convergence of DSCIL . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2.3 Comparative Study on Benchmark Data sets . . . . . . . . . . . . . . . 87
CONTENTS 5

5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3.1 Experimental Testbed and Setup . . . . . . . . . . . . . . . . . . . . . 95
5.3.2 Convergence of CILSD . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3.3 Comparative Study on Benchmark Data sets . . . . . . . . . . . . . . . 97
5.3.4 Comparative Study on Benchmark and Real Data sets . . . . . . . . . . 103
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6 Unsupervised Anomaly Detection using SVDD-A Case Study 105


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.2 Support Vector Data Description Revisited . . . . . . . . . . . . . . . . . . . 106
6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.3.1 Experimental Testbed and Setup . . . . . . . . . . . . . . . . . . . . . 108
6.3.2 The Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

7 Conclusions and Future Work 115


7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6 CONTENTS
List of Figures

1.1 EEG data (heart rate shown with respect to time) . . . . . . . . . . . . . . . . 2


1.2 Anomaly detection in crowd scene [1] . . . . . . . . . . . . . . . . . . . . . . 2
1.3 An example of point anomaly . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Contextual Anomaly at t2 in monthly temperature time series [2] . . . . . . . . 4

2.1 Anomaly detected by multi-class classifier . . . . . . . . . . . . . . . . . . . . 16


2.2 Anomaly detected by one-class classifier . . . . . . . . . . . . . . . . . . . . . 17
2.3 Box plot showing anomaly [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Proximity based outlier score with k = 5 . . . . . . . . . . . . . . . . . . . . . 22
2.5 Proximity based outlier score with k = 1 . . . . . . . . . . . . . . . . . . . . . 23
2.6 Proximity based outlier score with k = 5 . . . . . . . . . . . . . . . . . . . . . 23
2.7 Proximity based outlier score with k = 5 . . . . . . . . . . . . . . . . . . . . . 24
2.8 An Example of working of the kernel function. Any non-linearly separable
data set can be mapped to higher dimension feature space through the kernel
function where data set can be separated via a linear decision boundary. . . . . 30
2.9 Video activity detection from left to right are: the original frame, background
and foreground [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.10 Outliers marked in red color, orientation after projection from 3D to 2D [4]. . . 34
2.11 Overall infrastructure to support the fusion of anomaly detectors [5] . . . . . . 37

3.1 Evaluation of Gmean over various benchmark data sets. (a) pageblock (b) w8a
(c) german (d) a9a (e) covtype (f) ijcnn1. In all the figures, PAGMEAN al-
gorithms either outperform or are equally good with respect to its parent algo-
rithms PA and CSOC algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2 Evaluation of Mistake rate over various benchmark data sets. . . . . . . . . . . 57
3.3 Evaluation of Gmean over various real data sets. . . . . . . . . . . . . . . . . . 61

7
8 LIST OF FIGURES

3.4 Evaluation of Mistake rate over various real data sets. . . . . . . . . . . . . . . 62

4.1 Evaluation of online average of Gmean over various benchmark data sets. (a)
news (b) news2 (c) gisette (d) realsim (e) rcv1 (f) url (g) pcmac (h) webspam. . 73

4.2 Evaluation of mistake over various benchmark data sets. (a) news (b) news2 (c)
gisette (d) realsim (e) rcv1 (f) url (g) pcmac (h) webspam. . . . . . . . . . . . . 75

4.3 Effect of regularization parameter λ on F-measure on (a) news (b) realsim (c)
gisette (d) rcv1 (e) pcmac. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.4 Effect of learning rate η in ASPGD algorithm for maximizing Gmean on (a)
news (b) realsim (c) gisette (d) rcv1. . . . . . . . . . . . . . . . . . . . . . . . 78

4.5 Effect of regularization parameter λ in ASPGD algorithm for maximizing Gmean


on (a) news (b) realsim (c) gisette (d) rcv. . . . . . . . . . . . . . . . . . . . . 80

4.6 Effect of regularization parameter λ in ASPGD algorithm for minimizing Mis-


take rate on (a) news (b) realsim (c) gisette (d) rcv1. . . . . . . . . . . . . . . . 81

5.1 Objective Function vs DADMM iterations over benchmark data sets. (a) ijcnn1
(b) rcv1 (c) pageblocks (d) w8a. . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.2 Gmean versus Cost over various data sets for L-DSCIL algorithm. Cost is
given on the x-axis where each number denotes cost pair such that 1={0.1,0.9},
2={0.2,0.8}, 3={0.3,0.7}, 4={0.4,0.6}, 5={0.5,0.5} . . . . . . . . . . . . . . . 91

5.3 Gmean versus Cost over various data sets for R-DSCIL algorithm. Cost is
given on the x-axis where each number denotes cost pair such that 1={0.1,0.9},
2={0.2,0.8}, 3={0.3,0.7}, 4={0.4,0.6}, 5={0.5,0.5} . . . . . . . . . . . . . . . 91

5.4 Training time versus number of cores to measure the speedup of R-DSCIL al-
gorithm. Training time in Figure (a) is on the log scale. . . . . . . . . . . . . . 93

5.5 Training time versus number of cores to measure the speedup of L-DSCIL al-
gorithm. Training time in both the figures is on the log scale. . . . . . . . . . . 93

5.6 Effect of varying number of cores on Gmean in R-DSCIL algorithm. . . . . . . 94

5.7 Effect of varying number of cores on Gmean in L-DSCIL algorithm. . . . . . . 94

5.8 Gmean versus regularization parameter λ using R-DSCIL (a) ijcnn1 (b) rcv1 (c)
pageblocks (d) w8a (e) news (f) url (g) realsim (h) webspam. . . . . . . . . . . 96
LIST OF FIGURES 9

5.9 Objective function vs iterations over various benchmark data sets. (a) ijcnn1
(b) rcv1 (c) pageblocks (d) w8a. Obj1 denotes objective function when best
learning rate is searched over {0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3} while
Obj2 denotes objective function value obtained with learning rate 1/L. . . . . . 98
5.10 Training time versus number of cores to measure the speedup of CILSD algo-
rithm. Training time in both the figures is on the log scale. . . . . . . . . . . . 101
5.11 Gmean achieved by CILSD algorithm versus number of cores on various bench-
mark data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.12 Effect of regularization parameter λ on Gmean (i) ijcnn1 (ii) rcv1 (iii) gisette
(iv) news (v) webspam (vi) url (vii) w8a (viii) realsim. λ varies in { 3.00E-007,
0.000009, 0.00003, 0.0009 0.003 0.09 0.3 } . . . . . . . . . . . . . . . . . . . 102

6.1 Support vectors in two class classification problem . . . . . . . . . . . . . . . 106


6.2 Illustrates the spheres for data generated according to a spherical two-dimensional
Gaussian distribution. The center part shows the center of mass of the training
points. Anything outside the boundary can be considered as outliers. . . . . . . 108
6.3 Nuclear power plant data with marked anomalous subsequence . . . . . . . . . 109
6.4 Anomalies(marked in red) found in detector 1 of the power plant data . . . . . 111
6.5 Anomalies(marked in red) found in detector 1 of the power plant data . . . . . 112
6.6 Anomalies(marked in red) found in detector 1 of the power plant data . . . . . 113
List of Tables

2.1 Summary of datasets used in the experiments in Chapter 3 . . . . . . . . . . . 46


2.2 Summary of sparse data sets used in the experiment in Chapter 4 . . . . . . . . 46
2.3 Summary of sparse data sets used in the experiment in Chapter 5 . . . . . . . . 47

3.1 Evaluation of Gmean and Mistake rate (%) on benchmark data sets. Entries
marked by * are not statistically significant at 95% confidence level than the
entries marked by ** on wilcoxon rank sum test. . . . . . . . . . . . . . . . . 59
3.2 Evaluation of Gmean and Mistake rate (%) on real data sets. Entries marked by
* are not statistically significant at 95% confidence level than the entries marked
by ** on wilcoxon rank sum test. . . . . . . . . . . . . . . . . . . . . . . . . 60

4.1 Evaluation of cumulative Gmean(%) and Mistake rate (%) on benchmark data
sets. Entries marked by * are statistically significant than the entries marked
by ** and entries marked by † are NOT statistically significant than the entries
marked by ‡ at 95% confidence level on Wilcoxon rank sum test. . . . . . . . 77

5.1 Performance comparison of CSFSOL, L-DSCIL, R-DSCIL and CSSCD over


various benchmark data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2 Performance comparison of CSFSOL, L-DSCIL, R-DSCIL and CSSCD over
various benchmark data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.3 Performance comparison of CSFSOL, CSSCD, and CILSD over various bench-
mark data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.4 Performance comparison of CSFSOL, CSSCD, and CILSD over various bench-
mark data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.5 Performance evaluation of R-DSCIL, CILSD, CSFSOL and CSOGD-I on KD-
DCUP 2008 data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

11
Chapter 1

Introduction

Data mining is the process of discovering hidden patterns in the data through computational
techniques [6–8]. Anomaly detection is one of the sub-fields of data mining. Anomaly is
defined as a state of the system that does not conform to the normal behavior of the sys-
tem/object [2, 9, 10]. For example, emission of neutrons in a nuclear reactor channel above the
specified threshold is an anomaly. Similarly, a suspicious activity of a person over a metro sta-
tion is an anomaly. As a third example, abnormal usage of a credit card refers to the anomalous
event. Above examples indicate that anomaly detection is an important data mining/machine
learning task. The main focus in anomaly detection is to discover the unusual pattern in the data.
The term has also been referred to as outlier mining, event detection, exception, contaminant
mining, intrusion detection system (IDS) [11–14], fraud-detection [15], fault-detection [16] etc.
depending on the application domain. We emphasize here that outlier mining (anomaly detec-
tion) is not a new task. It has its root dated back to 19th century [17,18]. Since then lots of work
has been done till date by different researchers and has resulted in different automated systems
for anomaly detection, e.g, Aircraft monitoring sensor, patient health monitoring, credit-card-
fraud detection system, video surveillance system etc.

The significance of anomaly detection is owing to the nature of anomaly that is often critical
and needs immediate action. For example, modern aircraft record gigabytes of highly complex
data from its propulsion system, navigation, control system, and pilot inputs to the system
giving rise to so-called "Big data". Analyzing such a complex and heterogeneous data indeed
requires automated and sophisticated system [19–21]. As an another example, social media
analytic looks for potential anomalous nodes in the graph created from its users. Before we

1
2 CHAPTER 1. INTRODUCTION

Figure 1.1: EEG data (heart rate shown with respect to time)

Figure 1.2: Anomaly detection in crowd scene [1]

proceed further, let us define anomaly formally [6].

Definition 1. Anomalies are patterns in the data that deviate from the normal behavior or
working of the system.

As it can be seen clearly in Fig. 1.1 around time t = 700, there is sudden drop in heart beat
which might be indicative of potential anomaly occurred in the patient at that time. As an
another example of anomaly in crowd scene, see Fig. 1.2. In the Fig. 1.2, automobile in the
park, where only humans are allowed, is an example of anomaly. Similarly, bicyclist in the park
is also an anomaly. Since, our aim is to detect the aforementioned anomaly in big data, we
define the big data as:
3

Feature 2

Anomalous
point

Feature 1
Figure 1.3: An example of point anomaly

Definition 2. Big data refers to a data which is complex in nature and requires lots of computing
resources for its processing.

It should be clear that big data may be small in sample size but huge number of dimensions or
large number of samples with small number of dimensions (2 to 4, say). The data having large
number of dimensions and samples is trivially big data. We define the term big data formally
later in this chapter. Examples of big data can be patient monitoring sensor data that consists of
hundreds of attributes each of various types. Next we describe the various types of anomalies
as defined in [2].

• Categorization based on the nature of the anomaly


– Point Anomaly is a point of usually high or low value with respect to other data
instances.. For example, in Fig.1.3, point O denotes a point anomaly in the dataset.
Example of point anomalies is a small number of malicious transactions in a huge
transaction database. In the present work, we tackle the point anomalies only.
– Contextual Anomaly is an anomaly that is considered as an anomaly in a specific
context and not an anomaly in another context. For example, as shown in Fig.1.4
monthly temperature at t1 and t2 are the same but temperature at t2 is anomalous
because temperature is usually high in the month of June.
– Subsequence Anomaly is a collection of continuous records that are abnormal with
respect to the entire sequence [22] as shown in Fig.1.1 in red color.
4 CHAPTER 1. INTRODUCTION

Figure 1.4: Contextual Anomaly at t2 in monthly temperature time series [2]

• Categorization based on the neighborhood


– Local Anomaly is an anomaly if it is quite dissimilar with respect to it values in its
neighborhood.
– Global Anomaly is distinct with respect to the entire data set.

We emphasize that there have been several research works in the direction of finding global
anomaly in large databases but finding local anomaly is limited. In the present work, we do not
distinguish between the local and global anomaly. As we discuss later in the thesis that we take
a different approach to handle anomaly.

1.1 Motivation

Our motivation comes from data laden domain in the industries what is being termed as big
data. The natural question arises is: why anomaly detection in big data? There are numerous
reasons we can account for. Let us look at them one by one. Big data is often classified to
emerge along five dimensions: Volume, Velocity, Variety, Veracity, Variability (5 V’s). Below,
we make the meaning of the 5 V’s clear.

• Volume: Massive volume of data (Petabytes even Zettabytes) is overwhelming enter-


prises. This leads to gleaning insights from high dimension data where scalability of
algorithms is challenged [23]. In fact, only a limited set of work has been done to detect
anomalies in high dimension data [23,24]. The reason is that curse of dimensionality pro-
1.2. APPLICATIONS TO BIG DATA 5

hibits traditional anomaly detection techniques perform effectively. Curse of dimension-


ality refers to the problem where the meaning of nearest-neighbors become vacuous [25].
• Velocity: Massive data is gathering at a very high rate. A survey by IBM says that about
90% of the whole data in the world today has piled up in the last two year only. For a
time-critical process such as fraud detection, data must be scrutinized as it enters into the
enterprise.
• Variety: Big data can be structured and unstructured. For example, sensor data, online
transaction data, audio, video, click stream data etc. For knowledge discovery and new
insights, these datasets must be analyzed together - Monitoring hundreds of live video
from CCTV cameras to target points of interest is a daunting task.
• Veracity: Veracity refers to the truthfulness of the data. That means that the quality of
the data can change severely which hampers accurate anomaly detection and analysis .
• Value: Having storage and access to big data is well and good but unless we can turn it
into value it is useless. In terms of anomaly detection, it refers to efficiently and accurately
finding anomalies.

Thus, above points indicate that anomaly detection in big data can be quite tedious and cumber-
some. Till now, there exist only a few approaches that address curse of dimensionality, noise,
sparsity, streaming, heterogeneity issues while targeting anomalies efficiently.

1.2 Applications to Big data

Anomaly detection in big data finds several usage emerging from data laden-domains. Some of
them have been described below.

• Business: Anomaly detection in business has tremendous applications. For example,


millions and billions of banking transactions happen on a daily basis around the world.
This has given rise to what is being called as big data. Anomaly detection arises in
the form of very tiny fraction of fraudulent transactions among the large set of normal
transactions. The main challenge is that the data is continuously flowing from one point
to the other, is high dimension, distributed, and often secure (secure here means that not
all data is exposed to the analyst for confidentiality reasons). The task is to detect the
fraudulent transaction on the fly in the big data and possibly prevent it from happening.
6 CHAPTER 1. INTRODUCTION

• Healthcare: In health care domain, patient’s health is continuously monitored through


various sensors giving real-time condition of the patient. Due to unavailability of the
sufficient number of staff, any abnormal situation must be brought to the notice of the
doctors instantly to prevent loss of life. Such a situation can be modeled as the anomaly
detection problem in health care data such as electroencephalogram (EEG), electrocardio-
gram (EEG) etc. The main idea is to detect different abnormal conditions such as cardiac
arrest, low blood pressure, low glucose level etc. in real-time.
• Computer Network and Data Centers: Anomaly detection has several usage in com-
puter network and data centers. For example, intrusion detection system (IDS) is a form
of anomaly detection where the task is to find potential denial-of-service attack (DoS
attack), unauthorized access to computing resources, servers etc. Similarly, anomaly de-
tection in data center is related to the problem of finding abnormal system conditions
from log files.
• Plant Monitoring: Power plants, Nuclear plants are monitored through wireless sensors,
which are distributed in space. The task is to detect abnormal conditions of the plant, and
perhaps before they occur since when anomaly enters into the system, it is impossible
to prevent it. The anomaly detectors are installed to continuously monitor the proper
working conditions of the machines.
• Surveillance: Remote surveillance is another area of anomaly detection. Now-a-days,
CCTV can be seen installed at metro stations, shopping malls, pedestrian crossings etc. in
order to monitor the possible malicious activities For example, CCTV footage are used to
find potential terrorist activity. Data set from CCTV is in the form of videos that requires
state-of-the-art video processing technology in order to track the anomalous activity.
• Satellite Imagery: Satellite images are hyper spectral images and huge in size. They are
used to find water bodies, rare metals etc. on far distant planets and Galaxies. Anomaly
detection is concerned with finding these rare events from hyper spectral images.

1.3 The Problem Statement and Research Scope


Big data brings with it great opportunity and challenge together. Opportunity comes in
the form of plenty of data to glean insightful information. The challenge is that data is too
huge to mine efficiently using conventional knowledge discovery methods. Our primary
aim is therefore to look at the problems posed by big data. We formulate our objective as
1.4. SPECIFIC RESEARCH CONTRIBUTIONS 7

follows:
“To efficiently detect anomalies in big data which is sparse, high-dimensional, streaming
and distributed”.
Since tackling all of the characteristics of big data together is a complicated task, we
make some assumptions and consider different scenarios of big data characteristics. Our
assumptions are as follows:

– Our proposed algorithms work with numeric data only.


– Concept Drift [26, 27] (sudden change in data distribution or concepts) in not han-
dled in the thesis.
– We only tackle point anomaly.

Along the line of solution methodologies, we adopt the class-imbalance learning approach
to solve the point anomaly detection problem. We argue that the class-imbalance learning
problem is similar to the point anomaly detection problem and therefore, can be used to
solve the point anomaly detection. Some works that have followed this approach include
[28] [29] [30] [31] [32] and discussed in detail in [33].
In order to detect anomalies from big data , the following scenarios have been considered
in the present work:

– Scenario 1: To detect anomalies when the data is streaming.


– Scenario 2: To detect anomalies in the data when it is streaming, sparse, high
dimension.
– Scenario 3: To detect anomalies in the data when it is sparse, high dimension,
distributed data.

By efficiently solving the problem, we mean the technique is able to detect anomaly in
a timely manner, is scalable (in terms of numbers of instances as well as dimensions),
and incur small false positive and small false negative. In other words, we aim to achieve
higher Gmean and lower Mistake rate (defined in the next Chapter) than the existing
methods in the literature.

1.4 Specific Research Contributions


Our contributions are as follows:

– To solve the problem of anomaly detection in big data in Scenario 1, we propose


an algorithm based on online learning referred to as Passive-Aggressive GMEAN
8 CHAPTER 1. INTRODUCTION

(PAGMEAN). PAGMEAN is an improved version of a popular online learning al-


gorithm called Passive-Aggressive (PA) of [34]. In other words, PA algorithm is
sensitive to outliers as shown in Chapter 3. To alleviate this problem, PAGMEAN
algorithm utilizes the modified hinge loss function that is a convex surrogate loss
for the 0 − 1 loss function. In turn, the 0 − 1 loss function is obtained from directly
optimizing the Gmean performance metric. The challenge here is that Gmean met-
ric is a non-decomposable metric, i.e., it can not be written as the sum of losses over
individual data points, and hence traditional classification models based on statisti-
cal learning can not be used. We show the efficiency and effectiveness of the pro-
posed algorithms on several popular real and benchmark datasets and demonstrate
the competitiveness with respect to the state-of-the-art algorithms in the literature.
– PAGMEAN algorithm discussed previously is not able to handle high dimension
and sparse data. Therefore, we tackle the Scenario 2 using Accelerated-Stochastic-
Proximal Gradient Decent (ASPGD) algorithm. Specifically, we use a smooth ver-
sion of the modified hinge loss used within the PAGMEAN algorithm. Smooth loss
function gives us the freedom to employ any gradient based algorithm. For that pur-
pose, we use the stochastic proximal learning framework algorithm with Nesterov’s
acceleration. L1 -regularization is used to handle the sparsity. In the experiment
section, we show encouraging results on several benchmark and real data sets and
compare with the state-of-the-art techniques in the literature.
– In order to solve the sparse, high dimension and distributed problem of big data in
Scenario 3, we propose two novel distributed algorithms called Distributed Sparse
Class-Imbalance Learning (DSCIL) and Class-Imbalance Learning on Sparse data
in a Distributed environment (CILSD). DSCIL is based on the distributed alternat-
ing direction method of multiplier (DADMM) framework. The loss function used
in DADMM is a cost-sensitive, smooth and strongly convex hinge loss. Due to the
linear convergence of DSCIL, we propose CILSD algorithm. CILSD uses the same
loss function as DSCIL but is based on FISTA-like update rule in a distributed en-
vironment. As it is known that FISTA algorithm converges quadratically [35], we
get a faster algorithm (CILSD) than DSCIL. We demonstrate the efficiency and ef-
fectiveness of these algorithms in terms of various metrics like Gmean, F-measure,
Speedup, Training time etc. and compare with the state-of-the-art techniques in the
1.5. ORGANIZATION OF THE THESIS 9

literature. We also show real-world application of the proposed algorithms on KDD


Cup 2008 anomaly detection challenge data set.
– Our algorithms based on online learning and distributed learning are supervised
machine learning algorithms, i.e., they require labeled data for normal as well as
anomalous classes. However, in a real world, data is often noisy and is unlabeled.
Therefore, aforementioned techniques can not be applied in a real-world setting.
We seek a solution for a real-world problem where our data is coming from nuclear
reactor channel (obtained from Bhabha Atomic Research Center, Mumbai, India).
The data set contains the count of neutron emission from the nuclear reactor and the
task is to find at what point of time a particular channel was behaving maliciously?
Since this is a unsupervised learning task, we utilize support vector data description
(SVDD) algorithm [36, 37] for finding anomalies.

1.5 Organization of the Thesis

The thesis is divided into 7 seven chapters. Each chapter can be read independently without
requiring going back and forth. The content of each chapter is described below.

Chapter 1: This Chapter gives the introduction of the proposed work. In particular, we talk
about what is an anomaly? Why is anomaly detection important? Then, we talk about various
kinds of anomalies and how to report the result of anomaly detection algorithm? We also
establish the connection between anomaly detection and related problems like outlier detection,
class-imbalance problem etc. This chapter also covers the motivation and contribution of the
work.

Chapter 2: In this chapter, we exhaustively review related works on anomaly detection. In


particular, statistical based, clustering based, density based, nearest neighbor based, informa-
tion theoretic based anomaly detection techniques are discussed in great depth. We argue why
traditional techniques for anomaly detection fail on large data sets? Then, we take a digression
and talk about some nonconventional techniques for anomaly detection. Specifically, multiple
kernel learning based, non-negative matrix factorization based, the random projection based,
ensemble based anomaly detection techniques and their key limitations are discussed. From
these discussions, we find research gaps which we fill up with our contribution.
10 CHAPTER 1. INTRODUCTION

Chapter 3: This chapter begins with a literature survey of algorithms targeted for anomaly
detection in streaming data followed by their limitations. Then, we propose our first online
algorithm for class-imbalance learning and anomaly detection. In particular, passive-aggressive
algorithms (PA) [34], which have been successfully applied in online classification setting, are
sensitive to outliers because of their dependence on the norm of the data points and as such can
not be applied for class-imbalance learning task. In the proposed work, we make it insensitive to
outliers by utilizing a modified hinge loss that arises out of the maximization of Gmean metric
and derives some new algorithms called Passive-Aggressive GMEAN (PAGMEAN). Second,
It is found that the derived algorithms either outperform or perform equally good as compared
to some of the state-of-the-art algorithms in terms of the Gmean and mistake rate over various
benchmark data sets in most of the cases. Application to online anomaly detection on real world
data sets is also presented which shows the potential application of PAGMEAN algorithms for
real-world online anomaly detection task.

Chapter 4: The work presented in Chapter 3, although scalable to a large number of samples,
does not exploit sparsity in the data, which is not uncommon these days. In this chapter, we pro-
pose a novel L1 regularized smooth hinge loss minimization problem and an algorithm based on
accelerated-stochastic-proximal learning framework (called ASPGD) to solve the above prob-
lem. We demonstrate the application of proximal algorithms to solve real world problems (class
imbalance, anomaly detection), scalability to big data and competitiveness with recently pro-
posed algorithms in terms of Gmean, F-measure and Mistake rate on several benchmark data
sets.

Chapter 5: This chapter begins with the survey of techniques developed for handling anomaly
detection in large, sparse, high dimension, distributed data and their limitations. Then, we
describe our proposed framework. DSCIL and CILSD algorithms are described in detail fol-
lowed by their distributed implementation in MPI framework. Finally, we show the efficacy of
the proposed approaches on benchmark and real-world data sets and compare the performance
with the state-of-the-art techniques in the literature.

Chapter 6: Chapter 6 elucidates the application of support vector data description algorithm for
finding anomalies in a real-world data. We illustrate the working mechanism of the algorithm
and show the experimental results on nuclear power plant data.
1.5. ORGANIZATION OF THE THESIS 11

Chapter 7: This chapter summarizes our main finding and ends with some open problems that
we plan to explore in future.
Chapter 2

Literature Survey

In this chapter, we present relevant works that have addressed the problem of anomaly detection
in general. In particular, we discuss the research works in three flavors. The first is based on
the traditional approach to anomaly detection, the second is based on the modern approach
to anomaly detection, and the third one is based on the online learning. The reason for such
categorization is that anomaly detection has been tackled in statistics as outlier detection since
1969 [18] and recently, there has been an advancement in research which utilizes the more
state-of-the-art machine learning approach to solving the problem. We present the discussion
that assumes anomaly detection, outlier detection, novelty detection as the similar problem.
Further, we also present the literature work that has used class-imbalance learning to address
the point anomaly detection problem.

2.1 Traditional Approaches to Anomaly Detection

As we mentioned in chapter 1 that anomaly detection is not a new task. Lots of work has been
done in statistics community. However, the ultimate goal of anomaly detection is to find any
outlying pattern in the data. For example, given a dataset from health care, the key task is to find
any anomalous point, subsequence (in the case of gene expression data) anomaly. Traditional
approaches to anomaly detection are known by the umbrella term “outlier detection” [2]. There
exist some literature which distinguishes the anomaly detection and outlier detection such as
[38]. However, we make no difference between the two terms while presenting the survey

13
14 CHAPTER 2. LITERATURE SURVEY

below due to the possible overlap.

Now, let us look at traditional approaches to anomaly detection and study some of the algo-
rithms developed and what kind of issues they aim at to solve. Broadly speaking, they can be
categorized based on the use of available data labels: Supervised, semi-supervised, and unsu-
pervised.

• Supervised: In the supervised setting of anomaly detection, the common assumption is


that the data is available for both normal as well as anomalous class. First, a model is built
using data from the normal class. Any unseen data instance is then given to the model
which predicts the class label. The main challenges associated with supervised mode of
anomaly detection is that (i) data from anomalous class is often rare (imbalanced) (ii)
getting representative samples from the anomalous class is difficult [2]. In the present
work, we adopt the supervised mode of anomaly detection under the class-imbalance
setting. Later in this chapter, we discuss the classification-based approaches that come
under supervised setting of anomaly detection.
• Semi-supervised: Semi-supervised setting assumes that small amount of labeled data
from normal class and a large amount of unlabeled data is available. Because of the avail-
ability of small amount of data from the normal class, semi-supervised techniques often
perform better than supervised counterpart [39]. Another advantage of small amount of
normal class instances is that semi-supervised mode is far wider applicable than the su-
pervised mode. The typical approach adopted by semi-supervised mode is that they build
the model to capture the normal behavior of the system and any deviation from the normal
behavior raises an alarm. On the other line of work, there exist literature which builds the
model using the data from the anomalous operation of the system [40, 41]. However, it is
difficult to characterize all the abnormal behavior of the system and therefore, approach
mentioned thereof are rarely employed in practice.
• Unsupervised: Unsupervised setting assumes that normal instances are available in
abundant whereas anomalous instances are rare. A false alarm is raised if the assumption
made turns out to be wrong [2]. Because of the less strict assumption on the availability
of the abnormal instances, the unsupervised mode is more generic than the supervised
and semi-supervised mode. Later in this chapter, we will discuss clustering, density, and
nearest-neighbor based approaches which work in the unsupervised mode.
2.1. TRADITIONAL APPROACHES TO ANOMALY DETECTION 15

After anomalies have been found, we need a way to report the result. An anomaly detection
algorithm typically outputs its result in one of the following two ways:-

• Scores: Score reports the degree of outlierness of a data instance. Some techniques
assume high score to be a high degree of outlierness and some assume vice-versa.
• Labels : Label is used when the category of anomalies is small and the algorithm reports
whether a data instance is anomalous or normal.

Below we describe various approaches that have been developed in statistics community as well
as data mining community for anomaly detection. These are:

• Classification-Based Approaches
• Statistical Approaches
• Clustering-Based Approaches
• Information-Theoretic Approaches
• Spectral-Theoretic Approaches

2.1.1 Classification-Based Approaches

Classification based anomaly detection techniques are built on the assumption that the labeled
instances (for both the normal as well as anomalous classes) are available for learning the model
(typically a classifier). They work in two phases: (i) a model is trained using data from both
the normal class and anomalous class (ii) trained model is presented with unseen data to predict
its class label. Techniques falling under classification-based approaches are classified either as
one-class classification or multi-class classification.

In Multi-Class Classification setting, learner (classifier) is trained on labeled data comprising


of various labels corresponding to normal classes as shown in Fig. 2.1. For example, we can
model the anomaly detection as training several binary classifiers where one class is normal
and the rest of the classes as anomalous. During testing, the unseen example is presented to
each of the binary classifiers. An unseen example is declared as an anomaly if (i) none of the
classifiers predicts it as normal (ii) take the majority vote to decide for abnormality. In some
cases, classifiers assign a confidence score to the unseen data and declare it as an anomaly if the
confidence score is below some threshold. Popular classifiers used for multi-class classification
16 CHAPTER 2. LITERATURE SURVEY

Figure 2.1: Anomaly detected by multi-class classifier

based anomaly detection include Neural network [42], Bayesian Network [43, 44], Support
Vector Machine (SVMs) [45, 46], Rule-based [47, 48] classifiers etc.

One-Class Classification builds discriminatory model using only labeled data of normal in-
stances as shown in Fig. 2.2. In this setting, learner draws a boundary around normal instances
and leaves abnormal instances untouched. Indeed, complex boundaries can be drawn using non-
linear models such as kernel-methods, Neural Networks etc. Popular techniques for anomaly
detection using one-class classification are one-class SVM and its various extensions [49, 50],
one-class Kernel Fisher Discriminants [51], support vector data description (SVDD) [36] etc.

Pros and Cons of Classification-Based Approach: The performance of classification-based


technique on anomaly detection task depends on the generalization ability of classifier on un-
seen data. There are many robust linear and nonlinear classifier with a provable guarantee that
they will find decision boundary between normal and abnormal instances if any.

The major disadvantage in using classification based techniques is the availability of the train-
ing data for normal instances and the training time. If there are not sufficient examples from
abnormal classes, it is very hard to build a meaningful decision boundary. Secondly, in real-
time anomaly detection application, the classifier is expected to produce anomaly score within
reasonable time limits. Thirdly, they assign class labels to each test data point that may become
a disadvantage in case a meaningful anomaly score is required.
2.1. TRADITIONAL APPROACHES TO ANOMALY DETECTION 17

Figure 2.2: Anomaly detected by one-class classifier

2.1.2 Statistical Approaches

Statistical approaches to anomaly detection are model based. That is, it is assumed that the data
is coming from some distribution (but unknown). Model is built by estimating the parameters
of the probability distribution from the data. An anomalous object is such that it does not
fit the model very well. Statistical anomaly detection techniques are based on the following
assumption:

Assumption 1. Normal data instances reside in the high probability region of the stochastic
model while anomalous data instances reside in the low probability region of the stochastic
model.

In clustering problem, anomalous objects are such that do not fall in some particular cluster and
lie in a low-density region (we shall see in the clustering-based approach it is not always the
case and the problem becomes tricky). Likewise, in regression problem, anomalies fall apart
from the regression line.

Briefly, we describe some of the techniques under this category.

• Box Plot Rule: Box plot has been used to detect anomalies in univariate and multivariate
data and is, perhaps the simplest anomaly detection technique. A box plot shows the
various statistical estimator on a graph such as largest non-anomaly, upper quartile (Q3),
18 CHAPTER 2. LITERATURE SURVEY

Figure 2.3: Box plot showing anomaly [2].

median, lower quartile (Q1), and the smallest non-anomaly as shown in Fig. 2.3. The
difference Q3 − Q1 is called Inter Quartile Range (IQR) and shows the range of the most
normal data (typically 99.3%). A data point that lies 1.5IQR above the Q3 or below
the Q1 is declared as an anomaly. Some works that have used box plot rule to identify
anomalies are [52–54].
• Univariate Gaussian Distribution: As we described previously that most of the real data
can be modeled using some of the distributions. When the data set is large (social media
data, aircraft navigation data etc.), it is assumed that the data follows the normal distri-
bution. Model parameters mean µ and standard deviation σ are computed from the data
using Maximum Likelihood Principle. A point P with attribute value x and confidence
level α is predicted as outlier with probability p(|x| >= c) = α using this model. The
main difficulty encountered in univariate Gaussian distribution assumptions is choosing
parameters of the model using sampling theory. As a result, the accuracy of the prediction
is reduced.
• Multivariate Gaussian Distribution: Univariate Gaussian assumption is applicable to
the univariate data. In order to handle the case of multivariate data, multivariate Gaussian
assumption is used. To model the problem, a point is classified as normal or anomalous
depending upon its probability from the distribution of the data above or below a certain
threshold. Since multivariate data tends to have a high correlation among its different
attributes, asymmetry is invariably present in the model. To cope up with this problem,
we need a metric that takes into account the shape of the distribution. The Mahalanobis
2.1. TRADITIONAL APPROACHES TO ANOMALY DETECTION 19

distance is such a metric given by (2.1).

M ahalanobsisdist(x, x̄) = (x − x̄)S−1 (x − x̄)T , (2.1)

Where x is the data point and x̄ is the mean data point and S is the covariance matrix.
• Mixture Model Approach: The mixture model is a widely used technique in modeling
problems that assume that the data is generated from several distributions. For example,
one sample can be generated from many distributions with certain probabilities given by
(2.2)
m
X
p(x; Θ) = wi pi (x|Θ), (2.2)
i=1

where Θ is the parameter vector, wi is the weight given to ith component in the mixture.
The basic idea of a mixture model for anomaly detection task is the following. Two sets
of objects are created; one for the normal object N and the other for anomalous objects
A. Initially, set N contains all the objects and set A is empty. An iterative procedure
is applied to move anomalous objects from set N to set A. The algorithm stops as soon
as there is no change in the likelihood of the data. Eventually, the set A will have all
anomalous objects and the set N will have normal objects.

Pros and Cons of Statistical-Based Approach: Below, we describe the pros and cons of using
the statistical approach for anomaly detection task.

• If the distribution underlying the data can be estimated accurately, statistical techniques
provide a reasonable solution for anomaly detection.
• Statistical techniques can be used in an unsupervised mode provided the distribution es-
timation is robust to anomalies.
• The anomaly score produced by statistical techniques is often equipped with a confidence
interval. The confidence interval may be used to gain better insight into the fate of a test
instance.

Some of the cons associated with statistical techniques for anomaly detection are:

• The major challenge in using statistical techniques is that they assume that the data is
distributed according to a particular distribution. However, this assumption is rarely fol-
lowed by real-world data and the problem becomes severe in big data.
20 CHAPTER 2. LITERATURE SURVEY

• Final decision about a test instance, whether to declare an anomaly or not, depends on the
test statistics used; the choice of which is non-trivial [55]
• They are unable to detect the anomaly in streaming, sparse, heterogeneous data effi-
ciently.

2.1.3 Clustering-Based Approaches

Clustering is the process of grouping data into different clusters based on some similarity cri-
teria. Clustering-based approaches for anomaly detection are not new. In fact, outliers are
found as a by-product during clustering provided outliers do not form coherent, compact group
of their own. Clustering-based approaches can have two subcategory namely proximity-based
and density-based approaches which are described in subsections 2.1.3.1 and 2.1.3.2 Cluster-
ing based techniques can be put into three categories depending on the assumption made by
different researchers on anomalies [2].

Assumption 2. It says that normal instances form a coherent cluster while anomalies do not.

Techniques built on the above assumption use clustering algorithm on the given data set and
report points as anomalous that can not be put into any cluster by the algorithm. Notable work
based on the above assumption are of [56, 57]. Clustering algorithms that do not require data
instances to necessarily belong to some cluster can be used under the above assumption such as
ROCK [58], DBSCAN [59], and SNN [60].

Assumption 3. It says that normal records lie close to the center of gravity of the cluster while
anomalous records reside for away from the closest center of gravity of the cluster.

Above definition works in two step. In the first step, clustering algorithm runs over the entire
data so as to form a natural cluster. In the second step, anomaly score is calculated by computing
distance of each data point from their closest centroid. A noteworthy point is that clustering
algorithm based on the second assumption can be executed in either unsupervised or semi-
supervised setting. In [61], the author propose rough set and fuzzy clustering based approach
for intrusion detection. One limitation of the techniques based on the second assumption is that
they will fail to find anomalies if they form a homogeneous group of their own.

Assumption 4. It says that normal records form a dense and huge cluster while abnormal
records form tiny and sparse cluster.
2.1. TRADITIONAL APPROACHES TO ANOMALY DETECTION 21

Techniques in this subcategory fix some threshold or size on the cluster that will enable them to
accumulate outliers in a different group. One notable work in this category is Cluster-based local
outlier factor (CBLOF) [62]. The CBLOF computes two things: (i) the size of the cluster (ii)
distance of the data instance to its cluster centroid. They declare a data instance as anomalous
when the size and/or density of the cluster, in which it falls, is below a certain threshold.

Pros and Cons of Clustering-Based Approaches: The performance of clustering based outlier
detection algorithm depends on the training time. Some clustering algorithm run in quadratic
time and hence several optimizations have been proposed to reduce it to linear time O(N d) but
they are approximation algorithm.

Some salient features of clustering based anomaly detection approaches are the following:

• They perform well in unsupervised and semi-supervised settings.


• They can be utilized to complex data type just by changing the baseline clustering algo-
rithm.

Downside, however, encompasses the following:

• Performance of clustering algorithm depends on how well the underlying algorithm cap-
ture the intrinsic structure of the data?
• Many techniques are not optimized for anomalies.
• Their clustering performance hinges on the assumption that anomalies do not form sig-
nificant clusters.

2.1.3.1 Proximity-Based

Proximity-based approaches (also known as nearest-neighbor based approaches), as the name


suggests, are approaches that rely on some metric for computing the proximity (distance) be-
tween data points. Clearly, the performance of these techniques depend on how good our metric
is? The idea of the proximity-based approach is simple. Anomalous objects are points that are
far away from most of the points. The general strategy to compute the distance is to use k-
nearest neighbor based techniques. The outlier score of an object is given by its distance to its
k-nearest neighbors [6]. For example, in Fig. 2.4, the red marked point has a very large outlier
score as compared to green points.
22 CHAPTER 2. LITERATURE SURVEY

Outlier score in = 5.0

Outlier score in [0 1]

Figure 2.4: Proximity based outlier score with k = 5

We note that the outlier score of a data point depends on the value of k, the number of nearest
neighbors. If the value of k is too small, say 1, then a small number of neighboring outliers
will contribute a small outlier score and thus hamper the performance of the algorithm (see Fig.
2.5). On the other hand, a large value of k will make a group of points having nearest neighbors
less than k to become outlier Fig. 2.6.

Proximity-based anomaly detection using the sum of distances of the given data point from
its k-nearest neighbors as anomaly score has been used in [63–65]. Nearest-neighbor based
anomaly detection similar to the aforementioned technique has been used to detect fraudulent
credit card transactions in [66].

Pros and Cons of Proximity-Based Approach: Proximity-based approaches are simple and
easy to apply in comparison to statistical techniques. However, their running time is in the order
of O(n2 ) and this makes them less efficient over high dimensional data comprising millions and
billions of points. Secondly, outlier score is sensitive to the value of k, which is NP-hard to
determine in practice. In addition, they perform poorly over clusters of varying densities. To
see this, consider outlier score of points A and B in Fig. 2.7. Clearly, point A is correctly
2.1. TRADITIONAL APPROACHES TO ANOMALY DETECTION 23

Figure 2.5: Proximity based outlier score with k = 1

Figure 2.6: Proximity based outlier score with k = 5


24 CHAPTER 2. LITERATURE SURVEY

Outlier score in = 5.0


Outlier score in = 1.1

A B

Outlier score in [0 1.5]

Outlier score in [0 1.5]

Figure 2.7: Proximity based outlier score with k = 5

identified as an outlier but point B has outlier score even less than points in the green cluster.
The reason is due to different sparsity and density of clusters.

2.1.3.2 Density-Based

Density-based outlier detection scheme states that outliers are found in sparse region. In fact,
density-based approach is similar to nearest-neighbor based approach in the sense that density
can be calculated as the inverse of the distance to k-nearest neightbors. For example, k-nearest
neighbor of a data instance is the number of points enclosed within the hypersphere centered
at the given data instance. Taking reciprocal of this distance will give us the density of the so
called point. Despite this, density-based technique can not solve the issue of varying densities
similar to the reasons described for proximity-based approach. Hence, the concept of relative
density is introduced as given by (2.3).

density(x, k)
avgrelativedensity(x, k) = P (2.3)
y∈N (x,k) density(y, k)/|N (x, k)|

where density(x, y) is the density of the point x. LOF (local outlier factor) proposed by Breunig
et. al. [67] uses the concept of relative density. LOF score for a point is equal to the ratio of
the average relative density of the k-nearest neighbor of the point and local density of the data
point. The local density of a data point is found by dividing k (the number of nearest neighbors)
to the volume of the hypersphere containing k-data points centered at the data instance. Clearly,
2.1. TRADITIONAL APPROACHES TO ANOMALY DETECTION 25

the local density of normal points lying in the dense region will be high while the local density
of anomalous points in the sparse region will be low.

In literature, several researchers have worked upon the variants of LOF. Some of them have at-
tempted to reduce the original time-complexity of LOF lower than O(N 2 ). Some compute local
density in a different way and others have proposed a variant of LOF suitable for different kinds
of data. A very recently, Kai Ming Ting et al. [68] have proposed a novel density estimation
technique that has the average case sublinear time complexity and constant space complexity in
the number of instances. This order of magnitude improvement in performance can deal with
anomaly detection in big data. They have also proposed DEMass-LOF algorithm that does not
require distance calculation and runs in sublinear time without any indexing scheme.

Pros and Cons of Density-Based Approach: Density-based approaches suffer from the same
malady as their counterpart (proximity-based). That is, they have the computational complexity
of O(N 2 ). In order to minimize it, efficient data structures like k-d tree and R-tress have been
proposed [69]. Despite this, modified techniques do not scale over multiple attributes nor do
they provide anomaly score for each test instance, if required.

2.1.4 Information-theoretic Approaches

Besides approaches to anomaly detection mentioned above, there are several others approaches
to solving the anomaly detection. Below, we describe one such approach based on information
theory. It is a stream of Applied mathematics, Computer science etc. that deals with quantitative
information that can be gleaned from the data. It uses several such measures like Kolmogorov
Complexity, entropy, relative entropy. etc.

Assumption 5. Anomalies in the data infuses erratic information content in the data set.

Basic anomaly detection algorithm in this category works as follows. Assume Θ(N ) is the
complexity (Kolmogorov) of the given data set D. The objective is to find the minimal subset
of instances I such that Θ(N ) − Θ(N − I) is maximum. All the instances thus obtained will be
anomalies.

One notable work under this assumption is [70]. In [70], the author uses the size of the com-
pressed data file as a measure of the dataset’s Kolomogorov Complexity. [71] utilizes the size of
26 CHAPTER 2. LITERATURE SURVEY

the regular expression to estimate the Kolomogorov Complexity. Besides the above, information
theoretic measures such as entropy, relative uncertainty etc. has been used in [72–74].
Pros and Cons of Information Theoretic Approach: A major drawback of Information theo-
retic approach is that they involve dual optimization. First, minimize the subset size and second,
maximize the decrease in the complexity of the data set. Hence their running time is exponential
in the number of data points. Some approximation search techniques have been proposed. For
example, Local Search algorithm to approximately find such a subset in O(n) time. The ad-
vantage of the information-theoretic approach is that they can be employed in an unsupervised
setting and does not make any assumption pertaining to the distribution of the data.

2.1.5 Spectral-Theory Based Approaches

Spectral theory deals with the problems in high dimensions. They assume that the data set can
be embedded into much lower dimensions while still preserving the intrinsic structure. In fact,
they are derived from Johnson-Lindenstrauss lemma [75] (see Appendix ?? for the definition).

Assumption 6. Dataset can be projected into lower dimensional subspace such that normal
and anomalous instances appear significantly different.

A consequence of the projection in lower dimension manifold is that not only dataset size is
reduced but also we can search outliers in the latent space because of correlation among several
attributes. The fundamental challenge encountered by such techniques is to determine such
lower embeddings which can sufficiently distinguish anomalies from the normal instances. This
problem is nontrivial because there are an exponential number of dimensions on which data can
be projected. Some notable work in this domain use Principal Component Analysis (PCA)
for anomaly detection [76] in network intrusion, Compact Matrix Decomposition (CMD) for
anomaly detection in a sequence of graph etc [77].

Pros and Cons of Spectral-Theoretic Approach: Dimensionality reduction techniques like


PCA work linearly in data size but quadratic in the number of dimensions. On the other hand,
nonlinear techniques run linearly in the number of dimensions but polynomial in the number
of principal components [78]. Techniques performing SVD on the data have O(N 2 ) time com-
plexity.

The advantage of spectral methods is that they are suitable for anomaly detection in high di-
2.2. MODERN APPROACHES TO ANOMALY DETECTION 27

mension data. Also, they can work in an unsupervised setting as well as the semi-supervised
setting. The disadvantage of spectral techniques is that they will separate the anomaly from the
normal instances provided there exist a lower dimension embedding. Another disadvantage is
that they suffer from high computation time.

2.2 Modern Approaches to Anomaly Detection

Traditional approaches to anomaly detection in big data suffers from miscellaneous issues. For
example, statistical techniques require underlying distribution to be known a priori. Proximity-
based and density-based approaches require appropriate metric to be defined for calculating
anomaly score and run in quadratic time with the number of data instances. Clustering based
techniques need some kind of optimization for reducing quadratic time complexity and do not
generalize for heterogeneous data. Similarly, Information-theoretic and spectral techniques re-
quire an appropriate measure of information in case of the former and embedding for the latter.

The point is that they can not handle the case of high dimension, heterogeneous, noisy, stream-
ing, and distributed data that is found ubiquitously everywhere. For example, aircraft navigation
data is highly complex, heterogeneous, and noisy in nature that requires sophisticated tools and
techniques for online processing so as to thwart any likely accident. Similarly, mobile phone
call record demands batch processing of millions and billions of calls every day for a potential
terrorist attack.

Above points indicate that we need to have some kind of mechanism that not only reveals po-
tential anomalous record in the complex data but also gives us insight about it. This makes
sense because manual knowledge discovery in big data is a nontrivial task. Therefore, we will
look at techniques that meet some of the aforementioned goals in the forthcoming sections. In
particular, we will discuss recent approaches to anomaly detection, their results, and drawbacks.
Further, techniques going to be covered in this chapter are suitable for applying anomaly detec-
tion in unsupervised as well as semi-supervised mode. This is also an important point since real
world data sets are mostly unlabeled.
28 CHAPTER 2. LITERATURE SURVEY

2.2.1 Non-Parametric Techniques

Non-parametric technique refers to a technique in which the number of parameters grows with
the size of the data set or that does not assume that the structure of the model is fixed. Some
examples of non-parametric models are histograms, kernel density estimator, non-parametric
regression, models based on Dirichlet process, Gaussian Process etc. A key point about non-
parametric techniques is that they do not assume that data come from some fixed but unknown
distribution. Rather, they make fewer assumptions about the data and hence are more widely ap-
plicable. Some notable works in anomaly detection using non-parametric models are described
below.

Liang et al. [3] propose Generalized latent Dirichlet allocation (LDA) and a mixture of Gaus-
sian mixture model (MGMM) for unimodal and multimodal anomaly detection on galaxy data.
They assume that sometimes data besides being anomalous at an individual level is also anoma-
lous at the group level and hence techniques developed for point anomaly detection fails to
identify anomalies at the group level. However, their model is highly complex and learns a lot
of parameters using variational inference method.

In the same line of work, Rose et al. [79] have proposed group latent anomaly detection(GLAD)
algorithm for mining abnormal community in social media. Their model also suffers from
the same problem as that of Liang et al. above, i.e., it is complex and involves learning of
a large number of parameters. In [80], the author uses Gaussian process (GP) for one-class
classification similar to one class SVM approach but in a non-parametric way for identifying
anomalies in wire ropes. However, their approach generates falls alarm and does not incorporate
prior knowledge about the structure of the rope. The same group also combined GP with kernel
functions for one class classification [81]. They show that GP combined with kernel functions
can outperform support vector data description [36] over various data sets.

The potential advantage of the non-parametric techniques is that they do not assume that the
data is coming from some fixed but unknown distribution. Also as the number of parameters
grows linearly with the size of the input, these techniques can handle dynamic nature of the
data. However, on the flip side, there is a lack of suitable methods for estimating hyperparam-
eters such as kernel bandwidth when GP prior is combined with the kernel function. Unless
2.2. MODERN APPROACHES TO ANOMALY DETECTION 29

one has the right kernel bandwidth, performance of GP methods for anomaly detection is poor.
Secondly, doing cross-validation for finding parameters is infeasible since non-parametric tech-
niques involve parameters which can easily go beyond hundreds and thousands in number.

2.2.2 Multiple Kernel Learning

Kernel methods [82] provide a powerful framework for analyzing data in high dimension 2.8.
They have been successfully applied in ranking, classification, regression over multitude of
data.

Let us define some term before delving into depth. Kernel is a function κ such that for all
x, z ∈ X satisfies
κ(x, z) = hφ(x), φ(z)i (2.4)

where φ is a mapping from some Hilbert space X to an (inner product) feature space F

φ : x ∈ X 7−→ φ(x) ∈ F (2.5)

Intuitively, it says that kernel implicitly computes the inner product between two feature vectors
in high dimensions feature space, i.e., without actually computing the features.

Some examples of kernel function are:


 
||x−z||2
• Gaussian Kernel(RBF): κ(x, z) = exp − 2σ2 2
n
• Polynomial Kernel : κ(x, z) = 1 + xT z

Multiple Kernel Learning (MKL) [83] has recently gained a lot of attention among data min-
ing/machine learning community. The growing popularity of MKL is due to the fact that it
combines the power of several kernel functions into one framework. This arouses the curiosity:
can we apply MKL to anomaly detection in heterogeneous data coming from multiple sources?
The answer has been recently given by empirical result on Flights Operation Quality Assurance
(FOQA) archive data of S. Das et al. [19] at NASA. MKL theory is essentially based on the
theory of kernel methods [84]. The idea is to use a kernel function satisfying Mercer’s condi-
tion (see Appendix ?? for the definition) that finds similarity between pair of objects in a high
dimension feature space. The salient feature is that a valid kernel can compute the similarity
between objects of any kind. This leads to the idea of combining multiple kernels into one
30 CHAPTER 2. LITERATURE SURVEY

Figure 2.8: An Example of working of the kernel function. Any non-linearly separable data set
can be mapped to higher dimension feature space through the kernel function where data set
can be separated via a linear decision boundary.

kernel and use it for classification, regression, anomaly detection task etc.

MKL learns kernel from the training data. More specifically, how the kernel κ can be learned
as a linear(convex) combination of the given base kernels κi as shown in (2.6)?
X
κ(xi , zj ) = θk κk (xi , zj ) (2.6)
k

where θk ≥ 0, k = 1, 2..., K. The goal is to learn parameters θk such that resultant kernel κ
is positive semi-definite(PSD) (see the appendix ?? for the definition of positive-semi definite-
ness).

Recently, Verma et al. [85] have proposed Generalized MKL that can learn millions of ker-
nels over half a billion of training points. After learning multiple kernels in a joint fashion,
any anomaly detection technique capable of using kernels as a similarity measure can distin-
guish between normal and anomalous instances such as one-class SVM. In [19], the author
uses two base kernels; one for discrete sequences and one for continuous data. The kernel κd
corresponding to discrete sequences is computed using LCS (longest common sub-sequence)
while kernel κc corresponding to continuous data is inversely proportional to the distance be-
tween the Symbolic Aggregate Approximation (SAX) [22] representation of the points xi and
zj . The combined kernel is fed to one-class SVM and its performance is compared with two
2.2. MODERN APPROACHES TO ANOMALY DETECTION 31

baseline algorithms namely Orca and SequenceMiner. The results over various simulated and
real data demonstrate that multiple kernel anomaly detection algorithm (MKAD) outperforms
the baseline algorithms in terms of detecting different kinds of faults (discrete and continuous).

In [86], the author uses MKL approach to anomaly detection in network traffic data which is
heavy-flow, high-dimension and non-linear. Essentially, they use sparse and non-sparse kernel
mixture based on Lp norm MKL proposed by Kloft et al. [87]. Another work proposed by Tax et
al. [36] is based on support vector data description (SVDD). It builds hypersphere around nor-
mal data leaving outliers either at the boundary or outside of it. Their work uses support vector
classifier that gives an indication that MKL approach can be exploited to build hypersphere in
a high dimensions space. This forms the line of the motivation of Liu et al. [88], Gornitz et
al. [89] in a semi-supervised as well as unsupervised setting to use MKL for anomaly detection.

Thus we see that MKL can tackle high dimension and heterogeneous nature of big data very
nicely. However, further work needs to be done to explore the possibility of using MKL in the
streaming and distributed anomaly detection scenario.

2.2.3 Non-negative Matrix factorization

Non-negative matrix factorization (NNMF) as a technique for anomaly detection in image data
was studied by Lee and Seung in 1999 [90]. The non-negative matrix factorization problem is
posed as follows:

Let A be m × n matrix whose components aij are non-negative i.e. aij ≥ 0. Our goal is to find
non-negative matrices W and H of size m × k and k × n such that (2.7) is minimized.

||A − WH||2F
F (W, H) = . (2.7)
2

where k ≤ min{m, n} and depends upon the specific problem to be solved. In practice, k
is much smaller than rank(A). The product WH is called non-negative matrix factorization
for the matrix A. It should be noted that the above problem is non convex in W and H jointly.
Therefore, algorithms proposed so far in the literature seek to approximate matrix A via product
WH i.e. A ≈ WH. Thus, it is obvious that WH represents A in a very compressed form.

After Lee and Seung initial NNMF algorithm based on multiplicative update rule, several vari-
32 CHAPTER 2. LITERATURE SURVEY

Original frame | SVD background | RPCA background | DRMF background

Figure 2.9: Video activity detection from left to right are: the original frame, background and
foreground [3]

ants have been proposed in order to solve (2.7). For example, modified multiplicative up-
date [91], projected gradient descent [92], alternating least (ALS) [93], alternating non-negative
least square (ANLS) [92] Quasi-Newton [94] etc. have been proposed.

Recently, Liang et al. [3] propose direct robust matrix factorization (DRMF) for anomaly detec-
tion. The basic idea used by them is to exclude some outliers from the initial data and then ask
the following question: What is the optimal low rank you can obtain if you ignore some data?
They formulate the problem as an optimization problem with constraints on the cardinality of
the outlier set and the rank of the matrix. Essentially, they solve the problem shown in (2.8).

minimize ||(X − S) − L||F


L,S

subject to rank(L) ≤ K (2.8)

||S||0 ≤ e

where S is the outlier set and L, low rank approximation to A. K is the rank desired and e is
the maximal number of nonzero entries in S. k · kF denotes the frobenious norm of the matrix
(square root of the sum of squares of each element). In the matrix factorization paradigm,
solution to optimization problems involving rank or set cardinality is nontrivial. The author in
the aforementioned work uses the trick that the problem is decomposable in nature and hence
solvable by block-coordinate descent algorithm [95]. They use the DRMF algorithm to separate
background from foreground(noise) which has application in video surveillance 2.9.
In an another work by Allan et al. [96], they use NMF to generate feature vectors that can
2.2. MODERN APPROACHES TO ANOMALY DETECTION 33

be used to cluster text documents. More specifically, they exploit the concept, derived from
NMF, called sum-of-parts representation that shows term usage pattern in the given document.
The coefficient matrix factors and features thus obtained are used to cluster documents. This
procedure maps the anomalies of training documents to feature vectors. In addition to locating
outliers in latent subspace, NMF has been used to interpret outliers in that subspace as well. In
this domain, work of Fei et al. [97] is significant. More specifically, they combine NNMF with
subspace analysis so that not only outliers are found but also they can be interpreted.

From the above discussion, we see that non-negative matrix factorization can be employed for
anomaly detection in large and sparse data. However, suitability of NMF for anomaly detection
in streaming, heterogeneous and distributed setting is still unexplored.

2.2.4 Random Projection

In section 2.1.5, we discussed spectral-techniques to detect outliers. Specifically, spectral-


techniques are based on old idea of PCA, CMD etc. Random projection is also based on
spectral-theory. However, the main motivation to present random projection here is that there
has been recent surge in the theory and algorithms for random projection based techniques (see
for example [98, 99]). Random Projection pursuit is a spectral technique that looks for anoma-
lies in a latent subspace. This technique is particularly suitable for high dimensional data with
noise and redundancy. The assumption is that effective dimensionality in which outliers and
normal data reside is very small. The technique works on the principle of multiple subspace
view [100]. The basic idea is to project high dimension data into lower dimension subspace
such that outliers stand out even after projection (see Fig. 2.10). In Fig. 2.10, after projection, it
turns out that point 1 loose its identity as outlier. Point 2 and 4 continue to remain outlier while
points 3 and 5 have zero effect of projection.

Note that projection is done to reduce the dimensionality of the data set. Subsequently, we can
apply any anomaly detection approach provided certain conditions are met as described below.

• Condition 1: Projection should preserve the pairwise distance metric between data
points with very high probability (informal statement of Johnson-Lindenstrauss lemma).
• Condition 2: Projection should preserve the distance of points to their k-nearest neigh-
bor (result from Vries et al. work [24]).
34 CHAPTER 2. LITERATURE SURVEY

Figure 2.10: Outliers marked in red color, orientation after projection from 3D to 2D [4].

Note that the 1st condition is applicable to approaches that use some metric (e.g. distance)
to calculate outlier. On the other hand, Condition 2 applies to proximity-based approaches
discussed in section 2.1.3.1.

In [24], Vries et al. introduce projection-indexed nearest neighbor (PINN) approach that is es-
sentially based on projection pursuit. They first apply random projection (RP) to reduce the
dimension of the dataset so as to satisfy the condition 1. Thereafter, they use local outlier
factor (LOF) [67] to find the local outlier in a data set of size 300,000 and 102,000 dimen-
sions. In [101], the author uses convex-optimization approach to outlier (anomaly) pursuit
for matrix recovery problem. Their approach recovers the optimal low-dimensional subspace
and marks the distorted points ( the anomaly in image data). In [102], Mazin et al. apply a
variant of random projection which they call Random Spectral Projection. The idea is to use
Fourier/Cosine spectral projection for dimension reduction. They show that random samples
of Fourier spectrum performs better than random projection in terms of accuracy and storage
over text document. In [100], Muller et al. propose OutRank, a novel approach to rank outliers.
OutRank essentially uses subspace view of the data and compares clustered regions in arbitrary
subspaces. It produces the degree outlierness score for each object.

Challenges in using Random Projection pursuit: Major challenge in using anomaly detec-
2.2. MODERN APPROACHES TO ANOMALY DETECTION 35

tion techniques employing random projection is that they should be able to work in reduced
dimension with intrinsic structure of the data. Second issue is how to efficiently choose the
number of dimensions to project the data?

2.2.5 Ensemble Techniques

Ensemble technique works on the principle of "Unity is strength". That is, they combine the
power of individual techniques of outlier detection and produce astounding results provided
certain criteria is met. Although, ensemble techniques have been miraculously applied for clas-
sification [103], clustering task long ago, they have recently been used in anomaly detection
scenario. Increasing stardom of ensemble techniques is their ability to locate outliers effec-
tively in high dimension and noisy data [23]. A typical outlier ensemble contains a number of
components that aids to its power. These are:

• Model Creation: An individual model/algorithm is required to create ensemble in the


first place. In some cases, the methodology can be simply random subspace sampling.
• Normalization: Different techniques produce outlier scores that assign different meaning
to outlierness of a point. For example, some models assume low outlier score meaning
high degree of outlierness. Whereas others assume vice versa. So it is important to
consider different ways for merging anomaly scores together to create meaningful outlier
score.
• Model Combination: This refers to the final combination function which is used to create
the final outlier score.

In literature, ensembles have been categorized on the basis of component independence and
constituent component. The first categorization assumes whether the components are developed
independently or they depend on each other. These are of two types: In sequential ensemble,
algorithms are applied in tandem so that the output of one algorithm affects the other; producing
either better quality data or specific choice on the algorithm. The final output is either weighted
combination or the result of finally applied algorithm. In Independent ensemble, completely
different algorithm or the same algorithm with different instantiations is applied on the whole
or part of the data under analysis.
36 CHAPTER 2. LITERATURE SURVEY

In categorization by constituent component, data centric ensemble picks a subset of data or


data dimension (e.g. bagging/boosting in classification) in turn and apply anomaly detection
algorithm. Model centric approach attempts to combine outlier scores from different models
built on the same data. The challenge encountered is that how to combine scores if they have
been produced on different scales or format? Different combination functions like min, max,
avg. etc have been proposed so as to make ensemble techniques work in practice. Further, note-
worthy point is that ensemble approach will prove effective when classification based anomaly
detection techniques are employed.

Although ensemble techniques have deep foundations in classification/clustering, there true


power in outlier detection is revealed by Lazarevic et al. [104] as feature bagging recently. They
show that the proposed feature bagging approach can combine outlier scores from the different
execution of LOF algorithm in a breadth-first manner. That is, first combine the highest scores
from all algorithms; second, largest scores are combined next and so on. In [105], Noto et al.
use feature prediction using the combination of three classifiers and predictor (FRaC). They
combine the anomaly score of FraC using surprisal anomaly score given by (2.9) which is an
information-theoretic measure of prediction. FraC is a semi-supervised approach that learns
conserved relationship among features and characterizes the distribution of “normal” examples.

surprisal(p) = −log(p) (2.9)

In the same line of work, Cabrera et al. [5] proposed anomaly detection in a distributed setting
of Mobile Ad-Hoc Networks(MANET). They combine the anomaly scores from local IDS (in-
trusion detection systems) attached to each node through averaging operation and this score is
sent to the cluster head. All cluster head send cluster-level anomaly index to a manager which
averages them (see Fig. 2.11). Dynamic Trust Management scheme is proposed in [106] for
detecting anomalies in wireless sensor network (WSN). Hybrid ensemble approach for class-
imbalance and anomaly detection is recently proposed in [107]. The author uses the mixture of
oversampling and undersampling with bagging and Adaboost and show improved performance.
The ensemble of SVMs for imbalanced data set is proposed in [108, 109].

Challenges in using ensemble approach to combat Anomalies: Ensemble techniques, al-


though produce high accuracy, brings several challenges with them such as (1) Unsupervised
Nature (2) Small Sample Space Problem (3) Normalization Issues. This first problem refers
2.3. RELEVANT ALGORITHMS FOR ANOMALY DETECTION 37

Figure 2.11: Overall infrastructure to support the fusion of anomaly detectors [5]

to the real-world scenario where data is often unlabeled. In such cases, ensemble techniques,
which are mostly based on classification, can not be applied. The second issue alludes to the
case that anomalies are present in a tiny amount among the huge bundle of normal instances.
Normalization issue is related with different output formats of different classifiers, specially
when heterogeneous models are being trained on.

2.3 Relevant Algorithms for Anomaly Detection

Since our work focuses on anomaly detection in big data, we discuss related research works that
have been tailored to the data mentioned thereof. Below, we present related work that includes
techniques based on (1) Online learning (2) Class-Imbalance learning (3) Anomaly detection in
a streaming environment and discuss the differences with our work.

2.3.1 Online Learning

The online learning refers to a learning mechanism where the learner is given one example at
a time as shown in Algorithm 1. In algorithm 1, the learner is presented with an example xt
in line no 2. It makes its prediction ŷt in line no. 3 and receives correct label yt in line no.
4. In line no. 5, it computes its loss due to mistakes made in the prediction and subsequently
updates the model in line no. 6. Clearly, the online learning algorithm is out of memory issues
38 CHAPTER 2. LITERATURE SURVEY

Algorithm 1 Online Learning Algorithm


1: repeat

2: receive instance: xt
3: predict: ŷt
4: receive correct label: yt
5: suffer loss: `(yt , ŷt )
6: update model
7: until All examples processed

when processing massive datasets as it looks at one example at a time. Secondly, it has optimal
running time of O(nd) provided line 5 and 6 take time O(d), where n is the number of samples
processed so far and d is dimensionality of the data. Thirdly, it is easy to implement. In the next
paragraph, we discuss relevant literature work based on online learning and their limitations in
tackling outliers.

Online learning has its origin from classic work of Rosenblatt on perceptron algorithm [110].
Perceptron algorithm is based on the idea of a single neuron. It simply takes an input instance
xt and learn a linear predictor of the form ft (xt ) = wtT xt , where wt is weight vector. If it makes
a wrong prediction, it updates its parameter vector as follows:

wt+1 = wt + yt xt (2.10)

where wt+1 is weight vector at time t + 1.

[111] propose online learning with kernels. Their algorithm, called N ORM Aλ , is based on
regularized empirical risk minimization which they solve via regularized stochastic gradient
descent. They also show empirically how this can be used in anomaly detection scenario. How-
ever, their algorithm requires tuning of many parameters which is costly for time critical ap-
plications. Passive-Aggressive (PA) learning [34] is another online learning algorithm based
on the idea of maximizing “margin” in online learning framework. PA algorithm updates the
weight vector whenever “margin” is below a certain threshold on the current example. Further,
the author introduce the idea of a slack variable to handle non-linearly separable data. Nonethe-
less, PA algorithms are sensitive to outliers. The reason is as follows: PA algorithm applies
the update rule wt+1 − wt = τt yt xt , where τt is a learning rate. In the presence of outliers,
the minimum of 21 kw − wt k2 could be high since kxt k2 is high for outliers. Other online learn-
2.3. RELEVANT ALGORITHMS FOR ANOMALY DETECTION 39

ing algorithms in the literature are MIRA [112], ALMA [113], SOP [114], ARROW [115],
NAROW [116], CW [117], SCW [118] etc. Many of these algorithms are variant of the basic
PA algorithm and perceptron algorithms and hence sensitive to outliers. A thorough survey of
these algorithms is not feasible here. For an exhaustive survey on online learning see [119].

Online lerning based algorithms presented above, although scale to the number of data points,
do not scale with the number of data dimensionality. For example, Passive-Aggressive (PA)
learning [34], MIRA [112], ALMA [113], SOP [114], ARROW [115], NAROW [116], CW
[117], SCW [118] etc. have running time complexity of O(nd). Noteworthy points about algo-
rithms mentioned above are: (i) they are sensitive to outliers and cannot handle class-imbalance
problem without modification (ii) they do not consider sparsity present in the data except CW
and SCW. Though CW and SCW exploit sparsity structure in the data, they are not designed for
the class-imbalance problem. In the present work, we attempt to address these issues through
the lens of online and stochastic learning in Chapter 3.

2.3.2 Class-Imbalance Learning

Class-imbalance learning aims to correctly classify minority examples in a binary classifica-


tion setting. In the literature, there exist solutions that are either based on the idea of sam-
pling or weighting scheme. In the former case, either majority examples are undersampled or
minority examples are oversampled. In the latter case, each example is weighted differently
and the idea is to learn these weights optimally. Some examples of sampling based technique
are SM OT E [120], SM OT EBoost [121], AdaBoost.N C [122] and so on. Works that use
weighting scheme include cost-sensitive learning [123], Adacost [124], [125], [126], [127] and
so on. It is worthwhile to mention here that only a few work exist that jointly solve the class-
imbalance learning and online learning. Below, we mention some work which closely match
our work.

In [97], the author proposed sampling with online bagging (SOB) for class-imbalance detec-
tion. Their idea essentially is based on resampling, that is, oversample minority class and
undersample the majority class from Poisson distribution with average arrival rate of N/P and
Rp respectively, where P is the total number of positive examples, N is the total number of
negative examples and Rp is the recall on positive examples. Essentially, [97] propose an online
40 CHAPTER 2. LITERATURE SURVEY

ensemble of classifiers where they may achieve high accuracy, but the training of ensemble of
classifiers is a time-consuming process. In addition, [97] does not use the concept of surrogate
loss function to maximize Gmean.

[29] proposes an online cost-sensitive classification for imbalanced data. One of their problem
formulation is based on the maximization of the weighted sum of sensitivity and specificity
and the other is the minimization of the weighted cost. Their solution is based on minimizing
convex surrogate loss function (modified hinge loss) instead of the non-convex 0-1 loss. Their
work closely matches our work. But, in section 3.2 we show that the problem formulation
of [29] is different from our formulation and the solution technique they adopt is based on
the online gradient descent while ours is based on the online passive-aggressive framework.
Specifically, the problem formulation of [29] is a special case of our problem formulation. The
further difference will become clear in section 3.3. In [128], the author proposes a methodology
to detect spammers in a social network. Essentially, [128] introduces which features might be
useful for detecting spammers on online forums such as facebook, twitter etc. One of the major
drawbacks of their proposed method is that it is an offline solution and, therefore, can not handle
big data. Secondly, they apply vanilla SVM for spammer detection which could be not effective
due to the use of hinge loss within SVM. The present work also attempts to solve the open
problem in [128] by online spammer detection with low training time.

Recently Gao et al. [129] propose a class-imbalance learning method based on two-stage ex-
treme learning machine (ELM). Although, they are able to handle class-imbalance, scalabil-
ity of two-stage ELM in high dimensions is not shown. Besides, two-stage ELM solves the
class-imbalance learning problem in the offline setting. ESOS-ELM proposed in [26] uses an
ensemble of a subset of sequential ELM to detect concept drifts in the class-imbalance sce-
nario. Although ESOS-ELM is an online method, they demonstrate the performance on low
dimensional data sets only (largest dimension of the data set tested is < 500) and do not exploit
sparsity structure in the data explicitly. However, none of the techniques mentioned above tack-
les the problem of class imbalance in huge dimension (quantity in millions and above), nor do
they handle the problem structure present in the data such as sparsity. In our present work, we
propose an algorithm that is scalable to high dimensions and can exploit sparsity present in the
data.
2.3. RELEVANT ALGORITHMS FOR ANOMALY DETECTION 41

Cost-sensitive learning can be further categorized into the offline cost-sensitive learning (Of-
fCSL) and the online cost-sensitive learning(OnCSL). OffCSL incorporates costs of misclassi-
fication into the offline learning algorithms such as cost-sensitive decision tress [130], [131],
[125], cost-sensitive multi-label learning [127], cost-sensitive naive Bayes [132] etc. On the
other hand, OnCSL-based algorithms use cost-sensitive learning within the Online learning
framework. Notable work in this direction includes the work of Jialei et al. [29], Adacost [124],
SOC [28]. It is to be noted that the cost-sensitive learning methods have been proved to outper-
form sampling-based methods over Big data [133]. On the other hand, OnCSL-based methods
are more scalable over high dimensions when compared to their counterpart OffCSL-based
methods due to the processing of one sample at a time in case of the former.

2.3.3 Anomaly Detection in a Streaming environment

Online outlier detection in sensor data is proposed in [134]. Their method uses kernel density
estimation (KDE) to approximate the data distribution in an online way and employ distance-
based algorithms for detecting outliers. However, their work suffers from several limitations.
Firstly, although the approach used in [134] is scalable to multi-dimensions, their method does
not take into account evolving data stream. Secondly, using KDE to estimate data distribution in
the case of streaming data is a non-trivial task. Thirdly, their algorithm is based on the concept
of sliding-window. Determining the optimal width of the sliding window is again non-trivial.
Abnormal event detection using online SVM is presented in [135]. In [136], the author presents
a link-based algorithm (called LOADED) for outlier detection in mixed-attribute data. However,
LOADED does not perform well with continuous features and experiments were conducted on
data sets with dimensions at most 50.

Fast anomaly detection using Half-Space Trees was proposed in [137]. Their Streaming HS-
Tree algorithm has constant amortized complexity of O(1) and constant space complexity of
O(1). Essentially, they build an ensemble of HS-tress and store mass1 of the data in the nodes.
Their work is different from our work in the sense that we use Online learning to build our
model instead of an ensemble of HS-tress. Numenta [139] is a recently proposed anomaly
detection benchmark. It includes algorithms for tackling anomaly detection in a streaming

1
Data mass is defined as the number of points in a region, and two groups of data can have the same mass
regardless of the characteristics of the regions [138].
42 CHAPTER 2. LITERATURE SURVEY

setting. However, the working and scoring mechanism of Numenta is different from our work.
Specifically, their Hierarchial Temporal Memory (HTM) algorithm is a window based algorithm
and uses NAB scores (please see [139]) to report anomaly detection results. Whereas, we use
Gmean and Mistake rate to report the experimental results.

2.3.4 Anomaly Detection in Nuclear Power Plant

In this Section, we describe research works that closely matches our work in Chapter 6. There
exist some work that have tackled anomaly detection in nuclear power plant. In [140], the author
study health monitoring of nuclear power plant. They propose an algorithm based on symbolic
dynamic filtering (SDF) for feature extraction for time series data followed by optimization
of partitioning of sensor time series. The key limitation of their work is that their algorithm
is supervised anomaly detection and they tested their model on small number of features and
data set only (training and test set each has 150 samples), hence, can not be applied as such on
big data. [141] proposed a spectral method for feature extraction and passive acoustic anomaly
detection in nuclear power plants. [142] developed an online fuzzy logic based expert system
for providing clean alarm pictures to the system operators for nuclear power plant monitoring.
The key limitation of their method is that they model is depends on hand-crafted rule which
may be not very accurate, given the many possibilities of anomaly occurrence. Model-based
nuclear power plant monitoring is proposed in [143]. Their model consists of neural network
that takes input signals from the plant. Next component is the expert system that takes input
from neural network and human operator for making informed decision about system’s health.
The shortcomings of the proposed approach in [143] is that neural network requires lots of data
for training and is supervised. The approach that we take in this chapter builds on unsupevised
learning paradigm and hence differs from the previous studies on nuclear power plant condition
monitoring.

2.4 Datasets Used

In this section, we discuss the datasets used in our experiments. The datasets with their train/test
size, feature size, ratio of positive to negative samples, and sparsity are shown in Tables 2.1,2.2
and 2.3. Note that the imbalance ratio shows the ratio of the positive to negative class in the
2.4. DATASETS USED 43

training set. The test set can have different imbalance ratio. These are the benchmark datasets
which can be freely downloaded from LIBSVM website [144] pageblock from [145] also at
[146]. The datasets in Table 2.1 have a small number of features and with little to no sparsity
while datasets in Tables 2.2 and 2.3 are high dimension data with sparse features. For our
purpose, in each dataset, the positive class will be treated as an anomaly that we wish to detect
efficiently. We briefly describe the various datasets used in our experiments.

• Kddcup 2008 dataset is a breast cancer detection dataset that consists of 4 X-ray images;
two images of each breast. Each image is represented by several candidates. After much
pre-processing, kddcup 2008 dataset overall contains information of 102294 suspicious
regions, each region described by 117 features. Each region is either “benign” or “malig-
nant” and the ratio of malignant to benign regions is 1:163.19. Due to this huge imbalance
ratio, the task of identifying malignant (anomaly) is challenging.
• Breast Cancer Wisconsin Diagnostic data contains the digitized image of a fine needle
aspirate of a breast mass. They delineate the characteristics of the nucleus of the cell
present in the image. Some of the features include the radius, texture, area, perimeter etc.
of the nucleus. The key task is to classify images into benign and malignant.
• Page blocks dataset consists of blocks of the page layout of a document. The block can
be one out of the five block types: (1) text (2) horizontal line (3) pictures (4) vertical line
(5) graphic. Each block is produced by a segmentation process. Some of the features
are height, length, area, blackpix etc. We converted the multi-class classification problem
into binary classification problem by changing the labels of horizontal lines by the positive
class (+1) and rest of the labels to the negative class (-1). The task is to detect the positive
class (anomaly) efficiently. Note that positive class is only a small fraction (6%) of the
total data.
• W8a is a dataset of keywords extracted from a web page and each feature is a sparse
binary feature. The task is to classify whether a web page falls into a category or not.
W8a and a9a (described below) dataset were originally used by J.C. Platt [147].
• A9a is a census data that contains features such as age, workclass, education, sex, martial-
status etc. The task is to predict whether the income exceeds $50K/yr. The challenge is
that the number of individual having income more than $50K/yr is very less.
• German dataset contains credit assessment of customers in terms of good or bad credit
risk. Some of the features in the dataset are credit history, the status of existing checking
44 CHAPTER 2. LITERATURE SURVEY

account, purpose, credit amount etc. The challenge comes in the form of identifying a
small fraction of fraudulent customers from a huge number of loyal customers.
• Covtype dataset contains information about the type of forest and associated attributes.
The task is to predict the type of forest cover from cartographic variables (no remotely
sensed images). The cartographic variables were derived from data obtained from US
Geological Survey and USFS data. Some of the features include Elevation, Aspect, Slope,
Soli_type etc. Covtype is multi-class classification dataset. To convert the multi-class
dataset to binary class dataset, we follow the procedure given in [148]. In short, we treat
class 2 as the positive class and other 6 classes as negative class.
• ijcnn1 dataset consists of time-series samples produced by 10-cylinder internal combus-
tion engine. Some of the features include crankshaft speed in RPM, load, acceleration
etc. The task is to detect misfires (anomalies) in certain regions on the load-speed map.
• Magic04 dataset comprises of simulation of high energy gamma particles in a ground-
based gamma telescope. The idea is to discriminate the action of primary gamma (called
signal) from the images of hadronic showers [146] caused by cosmic rays (called back-
ground). The actual dataset is generated by the Monte Carlo Sampling.
• Cod-rna dataset comes from bioinformatics domain. It consists of a long sequence of
coding and non-coding RNAs (ncRNA). Non-coding RNAs play a vital role in the cell,
several of which remain hidden until now. The task is to detect the novel non-coding
RNA (anomalies) to better understand their functionality.

Next, we describe the large-scale datasets used in chapter 4 and chapter 5.

• News20 dataset is a collection of 20,000 newsgroup posts on 20 topic. Some of the topics
include comp.graphics, sci.crypt,sci.med,talk.religion etc. Original news20 dataset is a
multi-class classification dataset. However, Chih-Jen Lin et al. [144] have converted the
multi-class dataset into the binary class dataset and we use that dataset directly.
• Rcv1 dataset is a benchmark collection of newswire stories that is made available by
Reuters, Ltd. Data is organized into four major topics: ECAT (Economics), CCAT (Cor-
porate/industrial), MCAT (Markets), and GCAT (Government/Social). Chih-Jen Lin et
al. have preprocessed the dataset and assume that ECAT and CCAT denote the positive
category whereas MCAT and GCAT designate the negative category.
• Url dataset [149] is a collection of URLs. The task is to detect malicious URLs (spam,
exploits, phishing, DoS etc.) from the normal URLs. The author represents the URLs
2.4. DATASETS USED 45

based on host-based features and lexical features. Some of the lexical feature types are
hostname, primary domain, path tokens etc. and host-based features are WHOIS info, IP
prefix, Connection speed etc.
• Realsim dataset is a collection of UseNet articles [150] from 4 discussion groups: real
autos, real aviation, simulated auto racing, simulated aviation. The data is often used in
binary classification separating real from simulated and hence the name.
• Gisette dataset was constructed from MINIST dataset [151]. It is a handwritten digit
recognition problem and the task is to classify confusing digits. The dataset also appeared
in NIPS 2003 feature selection challenge [152].
• Pcmac dataset is a modified form of the news20 dataset.
• Webspam dataset contains information about web pages. There exists the category of
web pages whose primary goal is to manipulate the search engines and web users. For
example, phishing site is created to duplicate the e-commerce sites so that the creates of
the phishing site can divert the credit card transaction to his/her account. To combat this
issue, web spam corpus was created in 2011. The corpus consists of approximately 0.35
million web pages; each web page represented by bag-of-words model. The dataset also
appeared in Pascal Large-Scale Learning Challenge in 2008 [153]. The task is to classify
each web page as spam or ham. The challenge comes from the high dimensionality and
sparse features of the dataset.
46 CHAPTER 2. LITERATURE SURVEY

Table 2.1: Summary of datasets used in the experiments in Chapter 3

Dataset #Test set size(validation set size) #Features #Pos:Neg

page blocks 3472(2000) 10 1:8.7889


w8a 14951(49749) 300 1:31.9317
a9a 16281(32561) 122 1:3.2332
german 334(666) 24 1:2.3
covtype 207711(100000) 54 1:29
ijcnn1 91691(50000) 22 1:10.44
breast cancer 228(455) 10 1:1.86
kddcup2008 68196(34098) 117 1:180
magic04 9020(10000) 10 1:1.8
cod-rna 231152(100000) 8 1:2

Table 2.2: Summary of sparse data sets used in the experiment in Chapter 4

Dataset Balance # Train # Test # Features #Pos:Neg

news20 False 3,000 7,000 1,355,191 1:9.99


rcv1 false 20,370 677,399 47,236 1 : 0.9064
url False 2,000 8,000 3,231,961 1:10
realsim False 3,000 4,000 20,958 1:2.2515
gisette False 1,000 2,800 5,000 1:11
news2 True 10,000 9,000 1,355,191 1:1
pcmac True 1,000 900 3,289 1: 1.0219
webspam False 1,000 1,000 16,609,143 1:19
2.4. DATASETS USED 47

Table 2.3: Summary of sparse data sets used in the experiment in Chapter 5

Dataset Balance #Train # Test #Features Sparsity (%) #Pos:Neg

news20 False 6,598 4,399 1,355,191 99.9682 1:6.0042


rcv1 True 10,000 10,000 47,236 99.839 1:1.0068
url False 8,000 2,000 3,231,961 99.9964 1:10.0041
realsim False 56,000 14,000 20,958 99.749 1:1.7207
gisette False 3,800 1,000 5,000 0 .857734 1:11.667
webspam False 8,000 2,000 16,609,143 99.9714 1:15
w8a false 40,000 14,951 300 95.8204 1:26.0453
ijcnn1 false 48,000 91,701 22 39.1304 1:9.2696
covtype false 2,40,000 60,000 54 0 1:29.5344
pageblocks false 3,280 2,189 10 0 1::11.1933
48 CHAPTER 2. LITERATURE SURVEY

2.5 Research Gaps Identified

In this chapter, we presented the detailed summary of the relevant work in data mining and
machine learning domain to combat the anomaly detection problem. For our literature survey,
we find that:

• Traditional approaches for anomaly detection in big data have a number of limitations.
Some important ones are as follows: statistical techniques require underlying data distri-
bution to be known a priori, proximity-based and density-based techniques require appro-
priate metrics to be defined for calculating anomaly score and have high time complexity.
Clustering based techniques are also computationally intensive. Most of the traditional
anomaly detection techniques assume a static candidate anomaly set. They are not able
to handle evolving anomalies.
• Non-Parametric techniques are useful for anomaly detection in real world data for which
class labels and data distribution are not known in advance. Further, the non-parametric
techniques are also able to handle high dimensional data with varying data distribution.
But, mostly the research work based on non-parametric anomaly detection techniques
have assumed data to be homogeneous and static in nature. Therefore, there is scope
of extending the existing non-parametric techniques for heterogeneous, distributed data
streams.
• Multiple kernel learning (MKL) method and its variants have the advantage of addressing
the issue curse of dimensionality. But, such techniques have been applied mostly on
homogeneous and static data. Further research work needs to be done to explore the
possibility of using MKL in streaming and distributed anomaly detection scenarios. In
addition, the hyper parameters of the kernel function are also being set using some pre-
defined constant values. Automatic learning of hyper parameters is also an open problem.
• Non-negative matrix factorization based methods have the advantage of being able to
handle anomaly detection in high dimensional and sparse data scenarios. But, they have
been mostly applied to centralized data although many real world data is distributed in
nature. Therefore, there is scope for further research in this direction.
• Random projection based techniques are useful once the intrinsic structure of the data and
number of dimensions to be used for projection is known. Therefore, there is need for
devising methods that help us to choose the correct number of dimension for projection
2.5. RESEARCH GAPS IDENTIFIED 49

without their prior knowledge.


• Finally, ensemble techniques have been able to address the issue of anomaly detection
in heterogeneous setting. But, most of the ensemble based techniques are not able to
handle sparse data sets. To account for sparsity and high dimension case, further work
using ensemble approach is required. Combining anomaly scores produced from different
ensemble models also needs to be looked into.

In conclusion, we find that neither traditional approaches nor modern approaches are able to
detect anomalies in big data efficiently, i.e., solving major big data issues such as streaming,
sparse, distributed and high dimensions. In the next and subsequent chapters, we propose our
work to tackle the aforementioned issues in an incremental fashion.
Chapter 3

Proposed Algorithm : PAGMEAN

In this chapter, we tackle the anomaly detection problem in a streaming environment using on-
line learning. The reason to conduct such a study is that most of the real-world data is streaming
in nature. For example, the measurement from sensors forms a stream. Because of the dynamic
nature of streaming data and the inability to store it, there is an urgent need to develop efficient
algorithms to solve the anomaly detection problem in a streaming environment. We propose
an algorithm called Passive-Aggressive GMEAN (PAGMEAN). This algorithm is based on the
classic online algorithm called Passive-Aggressive (PA) [34]. In PA algorithm, we show that
it is sensitive to outliers and can not be directly applied for anomaly detection. Therefore, we
introduce a modified hinge loss that is a convex surrogate for the indicator function (defined
later in this chapter). The indicator function is obtained from maximizing the Gmean metric
directly. The major challenge is that Gmean metric is non-decomposable, that means, it can not
be written as the sum of losses over individual data points. We exploit the modified hinge loss
within the PA framework and come up with PAGMEAN algorithm. We empirically show the
effectiveness and efficiency of PAGMEAN over various benchmark data sets and compare with
the state-of-the-art techniques in the literature.

3.1 Introduction

In this chapter, we aim at to tackle the streaming problem of big data while detecting anomalies.
Throughout this chapter and subsequent chapters, we make the following assumption:

51
52 CHAPTER 3. PROPOSED ALGORITHM : PAGMEAN

Assumption 7. The terms outlier and anomaly are used interchangeably.

The reason for using the above assumption is that outliers/anomalies are present in a tiny amount
compared to the normal samples and the outlier/anomaly detection problem can be addressed
via class-imbalance learning problem. Our focus will be the detection of point anomalies
through the use of class-imbalance learning.

3.2 Proposed Algorithm - PAGMEAN

First we introduce some notation for the ease of exposition. Examples in our data come in
a streaming fashion. At time t, instance-label pair is denoted as (xt , yt ) where xt ∈ Rn and
yt ∈ {−1, +1}. We consider linear classifier of the form ft (xt ) = wtT xt , where wt is the weight
vector. Let ŷt be the prediction for the tth instance, i.e., ŷt = sign(ft (xt )), whereas the value
|ft (xt )|, known as “margin”, is used as the confidence of the learner on the tth prediction step.

3.2.1 Problem Formulation

In a binary classification, there are two classes. We assume that minority class is positive class
and labeled as +1. Let P and N denote the total number of positive and negative examples
received so far, respectively. When there are two classes, four cases can happen during predic-
tion. True positive (Tp ), true negative (Tn ), false positive (Fp ) and false negative (Fn ). They
are defined as follows:

Definition 3. True positive Tp = {y = ŷ = +1}, True negative Tn = {y = ŷ = −1}, False


positive Fp = {y = −1, ŷ = +1}, False negative Fn = {y = +1, ŷ = −1}.

Note that time t is implicit in the above notations, that is, Tp can also denote the total number of
examples classified as positive up to time t = 1, 2, ..., T . Meaning will become clear from the
context. Our objective is to maximize Gmean metric for class-imbalanced problem.

Definition 4. Gmean is defined as:


p
Gmean = sensitivity × specif icity (3.1)

where sensitivity and specif icity are defined as:


Tp Tn
sensitivity = , specif icity = (3.2)
Tp + Fn Tn + Fp
3.3. EXPERIMENTS 53

3.3 Experiments

For comparative evaluation of our proposed algorithms, we use the dataset presented in Table
2.1. Note that the PAGMEAN algorithms are tested on only small-scale datasets. The reason
is that PAGMEAN algorithms, though being online, may not handle high dimension data (the
number of features going into millions and above) in a timely manner. That means they will run
slow. Secondly, PAGMEAN algorithms do not exploit sparsity structure present in the data.

3.3.1 Experimental Testbed and Setup

We compare our algorithms with the parent algorithm PA and its variants along with recently
proposed cost-sensitive algorithm CSOC of [29]. In [29], it is claimed that CSOC outperforms
many state-of-the-art algorithms such as PAUM, ROMMA, agg-ROMMA, CPA-PB. Hence, we
only compare with CSOC. We further emphasize that PA, PAGMEAN and CSOC all are first-
order methods that only use the gradient of the loss function. On the other hand, comparison
with ARROW [115], NAROW [116], CW [117] etc. is not presented since they use second-
order information (Hessian) and their scalability over large data sets is poor. Comparison with
SVMperf [154] is also not fair due to (i) It does not optimize Gmean using surrogate loss
function (ii) It is an offline solution.

To make a fair comparison, we first split all the dataset into validation and test set randomly
(online algorithms do not require separate train and test set). The validation set is used to find
the optimal value of the parameters, if any, in the algorithm. Aggressiveness parameter C in PA
algorithm is searched over 2[−6:1:6] and parameter λ in PAGMEAN, its variants as well as CSOC
is searched over 2[−10:1:10] . We report our results on test data over 10 random permutations. Note
that we do not perform feature scaling as it is against the ethos of online learning where we can
access one example at a time and as such, doing z-score normalization is not feasible.

Secondly, most implementations (in Matlab) of online learning available today, e.g., Libol
[155], DOGMA [156],UOSLIB [157] are not out-of-core algorithm. They process examples
by fetching entire data into main memory and thus violates the principle of online learning. Our
implementation is based on the idea that data set is too large to fit into RAM and we can see
one example at a time.
54 CHAPTER 3. PROPOSED ALGORITHM : PAGMEAN

3.3.2 Performance Evaluation Metrics

As described in Section 1 that Gmean is a robust metric for the class-imbalance problems.
Hence, we use Gmean to measure the performance of various algorithms. Results on 6 bench-
mark datasets ( “pageblock”, “w8a”, “a9a”, “german”, “ijcnn1”, and “covtype” ) and 4 real
world data sets ( “breast-cancer”, “kddcup2008”, “magic04”, and “cod-rna”) are shown in (Fig.
3.1, Table 3.1 ) and (Fig. 3.3, Table 3.2) respectively. We also show the mistake rate of all
the compared algorithms on benchmark and real data sets in (Fig. 3.2, Table 3.1 ) and (Fig.
3.4 ,Table 3.2) respectively. Mistake rate of an online algorithm is defined as the number of
mistakes made by the algorithm over time. It is noted that we do not show the running time of
out-of-core implementation since it depends upon the speed of device where data is stored (hard
disk, tape or network storage). We assume that data is stored in files and read sequentially in
mini-batches, where batch-size is arbitrary or can depend on the available RAM.

3.3.3 Comparative Study on Benchmark Data sets

1. Evaluation of Gmean
We first evaluate Gmean on the various benchmark data sets as given in Table 2.1. Online
average of Gmean with respect to the sample size is reported in Fig. 3.1. From the
Figure, we can observe several things. Firstly, PAGMEAN2 algorithm outperforms their
parent algorithms PA on all 6 data sets. Secondly, PAGMEAN2 beats CSOC on all data
sets but ijcnn1. Thirdly, among the PAGMEAN algorithms, PAGMEAN2 outperforms
PAGMEAN and PAGMEAN1 on pageblock, w8a, german, and a9a data sets. On the
other side, Table 3.1 reports Gmean averaged over 5 runs. From the Table, we can see
that PAGMEAN algorithms outperform their parent algorithms PA on all data sets. These
results indicate the potential applicability of PAGMEAN algorithms for real-world class-
imbalance detection task.
Another observation that can be drawn from the Fig. 3.1 is that initially there is perfor-
mance drop in all the algorithms due to small number of samples available for learning.
This phenomena is noticeable in german and pageblocks data set as these are small data
sets. Online performance of algorithms is not smooth on some data sets, e.g., pageblock,
german, ijcnn1. This could be due to sudden change in class-distribution (also known as
concept drift). This means that the presence of cluster of examples from the positive class
3.3. EXPERIMENTS 55

or the negative class in the data has severe effect over the model.
2. Evaluation of Mistake Rate
In this Section, we discuss the mistake rate of PAGMEAN algorithms. The online average
of mistake rate of various algorithms with respect to the sample size is shown in Fig. 3.2.
We can draw several conclusions out of it. Firstly, as more samples are received by the
online algorithm, mistake rate is decreasing on all data sets except pageblock and german
where it seems to increase due to the small sample problem. Secondly, PAGMEAN al-
gorithms suffer higher mistake rate as compared to their parent algorithms PA. This is in
contrast to common intuition where the convex surrogate loss is supposed to be more sen-
sitive to class-imbalance and hence lesser mistake rate compared to the mistake rate due
to the hinge loss employed by PA algorithms. Thirdly, among PAGMEAN algorithms,
PAGMEAN1 suffers smaller mistake rate compared to PAGMEAN2 on all data sets.
Mistake rate averaged over 5 runs of all the compared algorithms is shown in Table 3.1.
56 CHAPTER 3. PROPOSED ALGORITHM : PAGMEAN

1 0.9

0.89
0.95

0.88
Online avg. of Gmean

Online avg. of Gmean


0.9
0.87

0.85 0.86

PAGMEAN
0.85 PAGMEAN
0.8 PAGMEAN1
PAGMEAN1
PAGMEAN2
0.84 PAGMEAN2
CSOC
CSOC
0.75 PA
PA
PA1 0.83
PA1
PA2
PA2
0.7 0.82
0 1 2 3 4 0 1 2 3 4 5
10 10 10 10 10 10 10 10 10 10 10
# Sample size # Sample size

(a) pageblock (b) w8a

0.8 0.85

0.7
0.8

0.6
Online avg. of Gmean

Online avg. of Gmean

0.75
0.5

0.4
0.7
PAGMEAN PAGMEAN
PAGMEAN1 PAGMEAN1
0.3 PAGMEAN2
PAGMEAN2
CSOC 0.65 CSOC
0.2 PA PA
PA1 PA1
PA2 PA2
0.1
0 1 2 3 4 0 1 2 3 4 5
10 10 10 10 10 10 10 10 10 10 10
# Sample size # Sample size

(c) german (d) a9a

1 1

0.9
0.9

0.8
Online avg. of Gmean
Online avg. of Gmean

0.8
0.7
PAGMEAN
PAGMEAN1
0.6 PAGMEAN2 0.7
CSOC
PAGMEAN
0.5 PA
PAGMEAN1
PA1 0.6
PAGMEAN2
PA2
0.4 CSOC
PA
0.5
0.3 PA1
PA2

0.2 0.4
0 1 2 3 4 5 6 0 1 2 3 4 5
10 10 10 10 10 10 10 10 10 10 10 10 10
# Sample size # Sample size

(e) covtype (f) ijcnn1

Figure 3.1: Evaluation of Gmean over various benchmark data sets. (a) pageblock (b) w8a
(c) german (d) a9a (e) covtype (f) ijcnn1. In all the figures, PAGMEAN algorithms either
outperform or are equally good with respect to its parent algorithms PA and CSOC algorithm.
.
3.3. EXPERIMENTS 57

8 18

7 16
Online avg. of Mistake rate (%)

Online avg. of Mistake rate (%)


14 PAGMEAN
6 PAGMEAN1
12 PAGMEAN2
5 CSOC
10 PA
4 PA1
PA2
8
3 PAGMEAN
PAGMEAN1 6
PAGMEAN2
2
CSOC 4
PA
1 PA1 2
PA2
0 0
0 1 2 3 4 0 1 2 3 4 5
10 10 10 10 10 10 10 10 10 10 10
# Sample size # Sample size

(a) pageblock (b) w8a

45 26

40
24
Online avg. of Mistake rate (%)

Online avg. of Mistake rate (%)

35

22
30

25 20

20 PAGMEAN PAGMEAN
PAGMEAN1 18 PAGMEAN1
PAGMEAN2 PAGMEAN2
15
CSOC CSOC
PA 16 PA
10 PA1 PA1
PA2 PA2
5 14
0 1 2 3 4 0 1 2 3 4 5
10 10 10 10 10 10 10 10 10 10 10
# Sample size # Sample size

(c) german (d) a9a

10 16

9
14
Online avg. of Mistake rate (%)

Online avg. of Mistake rate (%)

PAGMEAN
8 PAGMEAN1
PAGMEAN2
12
7 CSOC
PA
PA1
6 10
PA2
PAGMEAN
5
8 PAGMEAN1
PAGMEAN2
4 CSOC
6 PA
3 PA1
PA2
2 4
0 1 2 3 4 5 6 0 1 2 3 4 5 6
10 10 10 10 10 10 10 10 10 10 10 10 10 10
# Sample size # Sample size

(e) covtype (f) ijcnn1

Figure 3.2: Evaluation of Mistake rate over various benchmark data sets.
58 CHAPTER 3. PROPOSED ALGORITHM : PAGMEAN

It can be observed that PAGMEAN algorithms suffer mistake rate that is not statistically
significantly (on wilcoxon rank sum test ) higher than that of their counterpart PA algo-
rithms on 3 out of 6 data sets (w8a, german, covtype). All of these observations indicate
that further work in theory and practice is to be investigated.

3.3.4 Comparative Study on Real and Benchmark Data sets


The proposed PAGMEAN algorithms can be potentially applied to real world data for
anomaly detection task. To illustrate this, we performed experiments on 4 real data sets
(“breast cancer”, “kddcup 2008”, “magic04”, and “cod-rna”). Performance is measured
with respect to Gmean and mistake rate. The Results are reported in Fig. 3.3, 3.4 and
Table 3.2. In Fig. 3.3, empirical evaluation of Gmean with respect to sample size is
shown. It is clear from the figure that PAGMEAN algorithms outperform PA algorithms
on breast-cancer and kddcup 2008 data sets. Their performance is equally good as com-
pared to PA and CSOC algorithms on magic04 and cod-rna data sets. Interestingly, all
algorithms achieve Gmean of 0.97 approximately on cod-rna data set. On the other hand,
Gmean on kddcup 2008, surprisingly, is falling as more sample are received by all algo-
rithms. This may be due to high class-imbalance ratio (1:180) of positive and negative
examples present in kddcup 2008 data set and algorithms receive very less positive ex-
amples at the end. However, drop in the Gmean value of PAGMEAN and PAGMEAN2
algorithms is significantly lower than other algorithms compared beyond 104 samples.
Similar conclusions are drawn on Gmean from the results in Table 3.2. These results
show the potential applicability of PAGMEAN algorithms for real-world online anomaly
detection task.
Mistake rate with respect to sample size of all the algorithms is shown in Fig. 3.4 and
Table 3.2. Similar to conclusions drawn on benchmark data sets, we observe that mistake
rate of all the algorithms is dropping with increasing sample size. At the same time, it
is clear that among PAGMEAN algorithms, PAGMEAN1 suffers the smallest number of
mistake rate on all 4 data sets. Whereas on kddcup 2008 and magic04 data sets, PAG-
MEAN1 outperforms all other algorithms except PA1 and CSOC. Lastly, PAGMEAN1
outperforms CSOC and PA on cod-rna data sets.
3.3. EXPERIMENTS 59

Table 3.1: Evaluation of Gmean and Mistake rate (%) on benchmark data sets. Entries marked
by * are not statistically significant at 95% confidence level than the entries marked by ** on
wilcoxon rank sum test.

Gmean Mistake rate(%) Gmean Mistake rate(%)


Algorithm
w8a german

PA 0.878 ± 0.025 1.194 ± 0.341 0.536± 0.031 35.255 ± 2.336


PA1 0.873 ± 0.031 1.101± 0.352∗∗ 0.102 ± 0.000 28.859± 0.095
PA2 0.872± 0.032 1.116± 0.344 0.339± 0.043 27.057± 1.383∗∗
CSOC 0.849 ± 0.055 1.346± 0.411 0.102 ± 0.000 28.859± 0.095
Proposed PAGMEAN 0.894 ± 0.014 1.243± 0.379 0.602 ± 0.028 39.820± 2.933
Proposed PAGMEAN1 0.822± 0.059 1.725± 0.411 0.396± 0.014 27.628± 1.464∗
Proposed PAGMEAN2 0.894± 0.014 1.241± 0.372∗ 0.610± 0.019 41.502± 1.976

Algorithm covtype ijcnn1

PA 0.537± 0.003 4.288 ± 0.024 0.776± 0.009 7.017 ± 0.220


PA1 0.463 ± 0.006 2.992± 0.017 0.802 ± 0.024 4.747± 0.484
PA2 0.397± 0.004 3.029± 0.008 0.754± 0.025 5.391± 0.478
CSOC 0.449 ± 0.008 2.988± 0.018∗∗ 0.871 ± 0.017 6.271± 0.526
Proposed PAGMEAN 0.748 ± 0.002 9.056± 0.058 0.846 ± 0.005 11.761± 0.446
Proposed PAGMEAN1 0.329± 0.003 3.074± 0.009∗ 0.877± 0.012 6.610± 0.587
Proposed PAGMEAN2 0.749± 0.002 9.096± 0.061 0.847± 0.005 11.782± 0.454

Algorithm a9a pageblocks

PA 0.708± 0.007 20.105 ± 0.571 0.857± 0.024 4.749 ± 0.795


PA1 0.748 ± 0.016 14.875± 0.670 0.832 ± 0.059 4.083± 1.051
PA2 0.744± 0.013 14.931± 0.511 0.850± 0.039 3.884± 0.958
CSOC 0.798 ± 0.014 16.106± 0.568 0.875 ± 0.036 6.393± 1.053
Proposed PAGMEAN 0.755 ± 0.009 23.294± 0.758 0.900 ± 0.021 7.223± 1.083
Proposed PAGMEAN1 0.785± 0.008 17.295± 0.388 0.852± 0.029 6.831± 1.047
Proposed PAGMEAN2 0.806± 0.007 22.617± 0.652 0.900± 0.021 7.252± 1.080
60 CHAPTER 3. PROPOSED ALGORITHM : PAGMEAN

Table 3.2: Evaluation of Gmean and Mistake rate (%) on real data sets. Entries marked by * are
not statistically significant at 95% confidence level than the entries marked by ** on wilcoxon
rank sum test.

breast cancer kddcup2008


Algorithm
Gmean Mistake rate(%) Gmean Mistake rate(%)

PA 0.990± 0.009 0.580 ± 0.597 0.563± 0.022 25.436 ± 0.748


PA1 0.980 ± 0.001 0.938± 0.141 0.542 ± 0.020 9.846± 0.826
PA2 0.987± 0.005 0.625± 0.312 0.572± 0.014 17.893± 1.208
CSOC 0.987 ± 0.006 0.670± 0.482 0.561 ± 0.017 9.840± 0.808
Proposed PAGMEAN 0.991 ± 0.011 0.804± 0.960 0.720 ± 0.019 28.596± 1.017
Proposed PAGMEAN1 0.984± 0.006 1.027± 0.368 0.555± 0.014 10.585± 0.813
Proposed PAGMEAN2 0.991± 0.007 0.759± 0.597 0.767± 0.025 27.931± 1.016

magic04 cod-rna
Algorithm
Gmean Mistake rate(%) Gmean Mistake rate(%)

PA 0.934± 0.002 5.282 ± 0.137 0.981± 0.001 1.780 ± 0.101


PA1 0.946 ± 0.003 4.286± 0.193 0.986 ± 0.002 1.293± 0.108
∗∗ ∗∗
PA2 0.946± 0.002 4.281± 0.152 0.986± 0.002 1.299± 0.159
CSOC 0.942 ± 0.002 4.118± 0.178 0.985 ± 0.001 1.614± 0.062
Proposed PAGMEAN 0.926 ± 0.002 5.064± 0.133 0.982 ± 0.000 1.774± 0.041
Proposed PAGMEAN1 0.946± 0.002 4.181± 0.476∗ 0.985± 0.003∗ 1.389± 0.192
Proposed PAGMEAN2 0.926± 0.002 5.059± 0.135 0.982± 0.000 1.768± 0.041
3.3. EXPERIMENTS 61

1 1

0.99 0.95

0.9
0.98
Online avg. of Gmean

Online avg. of Gmean


0.85
0.97
0.8
0.96
PAGMEAN 0.75
PAGMEAN
0.95 PAGMEAN1
PAGMEAN1
PAGMEAN2 0.7
PAGMEAN2
CSOC
0.94 CSOC
PA 0.65
PA
PA1
0.93 PA1
PA2 0.6
PA2

0.92 0.55
0 1 2 3 0 1 2 3 4 5
10 10 10 10 10 10 10 10 10 10
# Sample size # Sample size

(a) breast cancer (b) kddcup 2008

0.95 1

0.9
0.95
0.85
Online avg. of Gmean

Online avg. of Gmean

0.8 0.9

0.75
0.85
0.7 PAGMEAN
PAGMEAN
PAGMEAN1 PAGMEAN1
0.65 0.8 PAGMEAN2
PAGMEAN2
CSOC CSOC
0.6 PA PA
PA1 0.75 PA1
0.55 PA2 PA2

0.5 0.7
0 1 2 3 4 5 0 1 2 3 4 5 6
10 10 10 10 10 10 10 10 10 10 10 10 10
# Sample size # Sample size

(c) magic04 (d) cod-rna

Figure 3.3: Evaluation of Gmean over various real data sets.


62 CHAPTER 3. PROPOSED ALGORITHM : PAGMEAN

5 35
PAGMEAN
4.5 PAGMEAN1
PAGMEAN2
30
Online avg. of Mistake rate (%)

Online avg. of Mistake rate (%)


4 CSOC
PA
3.5 PA1
PA2 25
3

2.5 20

2 PAGMEAN
15 PAGMEAN1
1.5
PAGMEAN2
CSOC
1
10 PA
0.5 PA1
PA2
0 5
0 1 2 3 0 1 2 3 4 5
10 10 10 10 10 10 10 10 10 10
# Sample size # Sample size

(a) breast cancer (b) kddcup 2008

45 7
PAGMEAN
PAGMEAN PAGMEAN1
40
PAGMEAN1 6 PAGMEAN2
PAGMEAN2
Online avg. of Mistake rate (%)

Online avg. of Mistake rate (%)

CSOC
35 PA
CSOC
PA 5 PA1
30 PA1 PA2
PA2
25 4

20 3

15
2
10

1
5

0 0
0 1 2 3 4 5 0 1 2 3 4 5 6
10 10 10 10 10 10 10 10 10 10 10 10 10
# Sample size # Sample size

(c) magic04 (d) cod-rna

Figure 3.4: Evaluation of Mistake rate over various real data sets.
3.4. DISCUSSION 63

3.4 Discussion

In the proposed work, we attempt to make the classical passive-aggressive algorithms insensi-
tive to outliers and apply it to the class-imbalance learning and anomaly detection problems.
To solve the aforementioned problem, we maximize the Gmean metric directly. Since direct
maximization of Gmean is NP-hard, we resort to convex surrogate loss function and minimize
a modified hinge loss instead. The modified hinge loss is utilized within the PA framework
to make it insensitive to outliers and new algorithms are derived called PAGMEAN. Empirical
performance of all the derived algorithms is tested on various benchmark and real data sets.
From the discussion above, we conclude that our derived algorithms perform equally good as
compared to other algorithms (PA and CSOC) in terms of Gmean. This indicates the potential
applicability of PAGMEAN algorithms for real world class-imbalance and anomaly detection
problems. However, the mistake rate of the proposed algorithms are surprisingly higher than
the compared algorithm on some datasets. Therefore, further work is required to identify the
exact reasons for higher mistake rate.

Finally, we would like to highlight is that results on high dimensional data (high dimension
means the number of features is in millions) could not be included. The reason is that even
online algorithms run slowly when working on full feature space. In the next chapter, we
propose an algorithm that exploits the sparsity present in the data and scales over millions of
dimensions.
Chapter 4

Proposed Algorithm : ASPGD

In the previous Chapter, we proposed an online algorithm that tackles streaming anomaly de-
tection problem. However, one of the limitations of the PAGMEAN algorithm is that it is not
able to exploit the sparsity structure present in the big data. To solve this problem, we propose
another online algorithm that solves sparse, streaming, high-dimensional problem of big data
during anomaly detection. As before, we employ the class-imbalance learning mechanism to
handle the point anomaly detection problem. The problem formulation in the present work uses
L1 regularized proximal learning framework and is solved via Accelerated-Stochastic-Proximal
Gradient Descent (ASPGD) algorithm. Within the ASPGD algorithm, we use a smooth and
strongly convex loss function. This loss function is insensitive to the class-imbalance since it is
directly derived from the maximization of the Gmean. The work presented in the current Chap-
ter demonstrates (i) the application of proximal algorithms to solve real world problems (class
imbalance) (ii) how it scales to big data, and (iii) how it outperforms some recently proposed
algorithms in terms of Gmean, F-measure and Mistake rate on several benchmark data sets.

4.1 Introduction

As discussed in Section 2.3, there are serious issues in using the classical approaches to anomaly
detection. Firstly, sampling-based techniques are neither scalable to the number of samples nor
to the data dimensionality in the case of big data. Secondly, existing works on sampling such
as [97, 158] do not exploit the rich structure present in the data such as sparsity. Kernel-based

65
66 CHAPTER 4. PROPOSED ALGORITHM : ASPGD

methods suffer from scalability and long training time. For example, if the data dimensionality
is of the order of millions, kernel-based methods will require storage of the gram matrix (also
known as kernel matrix) of size million × million which is prohibitive for machines with low
memory. Cost-sensitive learning has recently gained popularity in addressing the class imbal-
ance problem [28, 29] because of: (i) learning cost to be assigned to different classes in a data
dependent way (ii) scalability (iii) ability to exploit sparsity. The present work builds upon
cost-sensitive learning and extends the work present in [28].

To further address the problem of class imbalance, the choice of metric used to evaluate the
performance of different methods is crucial. For example, using accuracy as a class imbal-
ance performance measure may be misleading. Consider 99 negative examples and 1 positive
example in a dataset, our objective is to classify each positive example correctly. Now, a clas-
sifier which classifies each example as negative will have an accuracy of 99% that is incorrect
because our goal was to detect the positive example. Therefore, different researchers have de-
vised alternative performance metrics to assess different methods for class imbalance problem.
Among them are the recall, precision, F-measure, ROC (receiver operating characteristics) and
Gmean [159]. It is found that Gmean is robust to the class imbalance problem [159].

In this work, we tackle the class imbalance problem in an online setting exploiting sparsity and
high dimensional characteristics of the big data. Our contributions are as follows:

1. We propose a class-imbalance learning algorithm in an online setting within the Accelerated-


Stochastic-Proximal Gradient Decent learning framework (called ASPGD). The novelty
of ASPGD is that it uses a strongly convex loss function that we come up with direct
maximization of the non-decomposable performance metric (Gmean in our case). A non-
decomposable performance metric is a metric that can not be written as the sum of losses
over data points.

2. We show, through extensive experiments on real and benchmark data sets, the effective-
ness and efficiency of the proposed algorithm over various state-of-the-art algorithm in
the literature.

3. In our work, we also show that Nesterov’s acceleration [160] does not always helps in
achieving higher Gmean.

4. The effect of learning rate η and sparsity regularization parameter λ on Gmean, F-measure,
4.2. PROPOSED ALGORITHM - ASPGD 67

and Mistake rate is demonstrated as well.

4.2 Proposed Algorithm - ASPGD

First, we establish some notation for the sake of clarity. Input data is denoted as instance-label
pair {xt , yt } where t = 1, 2, ..., T , xt ∈ X ⊆ Rd and yt ∈ Y ⊆ {−1, +1}. In the offline setting,
we are allowed to access entire data and usually, T is finite. On the other hand, in the online
setting we see one example at a time and T → ∞. At time t, an (instance, label) pair is denoted
by (xt , yt ). We consider linear functional of the form ft (xt ) = wtT xt , where wt is the weight
vector. Let ŷt be the prediction for the tth instance, i.e., ŷt = sign(ft (xt )), whereas the value
|ft (xt )|, known as ‘margin’, is used as the confidence of the learner in the tth prediction step.
We work under the following assumptions on the model function f .
Function f is L- smooth if its gradient is L-Lipschitz, i.e.,

k∇f (x) − ∇f (y)k ≤ Lkx − yk.

In other words, gradient of the function f is upper bounded by L > 0. k·k denotes the euclidean
norm unless otherwise stated. The function f is µ- strongly convex if
µ
f (y) ≥ f (x) + ∇f (x)T (y − x) + ky − xk2 .
2
where, µ > 0 is the strong convexity parameter. Intuitively, strong convexity is a measure of the
curvature of the function. A function having large µ is has high positive curvature. Interested
readers can refer to [161] and appendix ??.

4.2.1 Problem Formulation and The Loss Function

Our problem formulation is same as presented in Chapter 3. that is, we want to maximize the
Gmean. To do that, we use the lemma ??. However, there are certain number of issues in using
the loss function (??) from Chapter 3. Note that the loss function used in Chapter 3 is:

`(f ; (x, y)) = max (0, ρ − yf (x)) (4.1)

where,     
N P − Fn
ρ= I(y=1) + I(y=−1)
P P
The parameter ρ controls the penalty imposed when the current model mis-classifies the in-
coming example. The important thing here is that we impose a high penalty (the ratio N/P
68 CHAPTER 4. PROPOSED ALGORITHM : ASPGD

is high for N >> P in the class imbalance scenario) on misclassifying a positive example
(y = +1). On the other hand, when model mis-classifies a negative sample, the penalty im-
posed is (P − Fn )/P , which is less when we have a small number of false negatives. It is to
be noted that the loss function in (4.1) is non-differentiable at the hinge point. Thus, it is not
directly applicable to algorithms that require L−smooth and µ−strongly convex loss functions
(please refer to the appendix ?? for the definition of smooth and strongly convex function). In
section 4.2.3, we review some proximal algorithms that work under the assumption that loss
function is smooth and strongly convex. Therefore, we introduce a smooth and strongly convex
function that upper bounds the indicator function in (4.1).

ρ
`(f, (x, y)) = max(0, 1 − yf (x))2 (4.2)
2

Above loss function is utilized within the accelerated-stochastic proximal learning framework
in section 4.2.2. Note that the loss function as presented in (4.2) is strongly convex with strong
convexity parameter µ = ρ. Strong convexity parameter depends on the hinge point of the loss
function in (4.2). Since hinge point is a data dependent term, we estimate it in an online fashion.

4.2.2 Stochastic Proximal Learning Framework

Now, we describe proximal learning framework [162] which aims to solve the composite opti-
mization problem of the following form:

φ(w) , min f (w) + r(w) (4.3)


w∈Rd

1
PT
where f (·) is the average of convex and differentiable functions, i.e., f (w) = T t=1 ft (w)
and r : Rd → R is ‘simple’ convex function that can be non-differentiable. Note that the
framework (4.3) is quite general and encompasses many algorithms. For example, if we set
ft (w) = max(0, 1 − ywT x) and r(w) = λkwk22 , we get L2 regularized SVM. On the other
hand, if we set ft (w) = (y − wT x)2 and r(x) = λkwk22 , we get ridge regression. For our
problem, ft (w) is given in (4.2). Since we consider to solve (4.3) under sparsity constraint, we
utilize sparsity-inducing L1 norm of w, i.e., r(w) = λkwk1 (which is non-differentiable at 0).

There are many algorithms to solve the problem (4.3) in the offline setting under different as-
sumptions on the loss function f (·) and regularization parameter r(·) [163–166]. Since our aim
is to solve the problem (4.3) under the online learning framework, techniques mentioned thereof
4.2. PROPOSED ALGORITHM - ASPGD 69

cannot be used. Under our assumptions on the form of f (·) and r(·) (strongly convex and non-
smooth, respectively), subgradient methods can be a good candidate. However, these methods
have notorious convergence rate of O(1/2 )(iteration complexity), i.e., to obtain  accurate so-
lution, we need O(1/2 ) iteration. Hence, we resort to proximal learning algorithms because
of their simplicity, ability to handle non-smooth regularizer, scalability, and faster convergence
under certain assumptions [162].

A proximal gradient step at iteration k is given by:

wk+1 = proxηr (wk − η∇f (wk )) (4.4)

where, proxη (·) is proximal operator defined as

1
proxηr (u) , argmin ku − wk22 + r(u) (4.5)
u 2η

where η is the step size and k · k2 is 2-norm. ∇ denotes the gradient of the loss function f (·).
One of the major drawbacks of using Proximal Gradient Descent (PGD) in an offline setting
is that they are not scalable to large-scale data sets as they require entire data set to compute
the gradient. In the proposed work, we focus on algorithms that are memory-aware. Such
algorithms come under online learning framework (also known as stochastic algorithms). A
simple Stochastic Proximal Gradient Descent (SPGD) update rule at tth time step is given by:

wt+1 = proxηt r (wt − ηt ∇ft (wt )) (4.6)

where ∇ft (·) is evaluated at tth example. In order to achieve acceleration (faster convergence)
in online learning, we follow Nesterov method [160]. Specifically, Nesterov method achieves
acceleration by introducing an auxiliary variable u such that the weight vector at time t is the
convex combination of weight at time t − 1 and ut , i.e.,

wt = (1 − γ)wt−1 + γut (4.7)

With all the tools of accelerated-stochastic-proximal (ASP) learning framework in hand, we


are ready to present our ASP Gradient Descent (ASPGD) algorithm for sparse class imbalance
learning.

We note that similar ASP algorithms based on Nesterov accelerated method have recently ap-
peared in [167, 168]. However, [167] propose an accelerated algorithm with variance reduction
70 CHAPTER 4. PROPOSED ALGORITHM : ASPGD

and [168] propose an accelerated algorithm that uses two sequences for computing the weight
vector w, where one of the sequences utilizes lipschitz parameter of the smooth component of
the composite optimization. On the other hand, we exploit the strong convexity of the smooth
component. Besides, in our present work, we aim at showing how efficient the ASP algorithms
are in dealing with the class imbalance problem?

4.2.3 ASPGD Algorithm

In this Section, we present the algorithm to solve (4.3) in an online setting. Our proposed
algorithm is called ASPGD which is based on SPGD framework with Nesterov’s acceleration.
ASPGD algorithm which is presented in Algorithm 2 is able to handle the class imbalance
problems.

Algorithm 2 ASPGD: Accelerated Stochastic Proximal Gradient Descent Algorithm for Sparse
Learning

1− µη
Require: η > 0, λ, γ = √
1+ µη

Ensure: wT +1
1: for t := 1, ..., T do
2: receive instance xt
3: vt = ∇Φ∗t (θt )
4: ut = proxηr(w) (vt )
5: wt = (1 − γ)wt−1 + γut
6: predict ŷt = sign(wtT xt )
7: receive true label yt ∈ {−1, +1}
8: suffer loss: `t (yt , ŷt ) as given in (4.2)
9: if `t (yt , ŷt ) > 0 then
10: update:
11: θt+1 = θt − η∇`t (wt )
12: end if
13: end for

In Algorithm 2, Φ is some µ−strongly convex function such as k · k22 and ∗ above φ denotes
the dual norm (see the appendix ?? for the definition of dual norm); θ is some vector in Rn .
4.3. EXPERIMENTS 71

At this point, we emphasize that [169] recently proposed algorithms for sparse learning (see
Algorithm 1 in [169]). Their algorithm is a special case of our algorithm and can be analyzed
under Stochastic Proximal Learning (SPL) framework. Specifically, setting γ = 1 and `t to
hinge loss in ASPGD, we obtain Algorithm 1 in [169]. The same author’s extended paper [28]
proposed algorithms for class imbalance sparse learning. Algorithm 6 in [28] is a special case
of ASPGD without acceleration and again Algorithm 6 in [28] can be analyzed under SPL
framework.

4.3 Experiments

In this Section, we empirically validate the performance of ASPGD algorithm over various
benchmark data sets given in Table 2.2. Notice that NEWS2 and PCMAC are balanced data
sets. All the algorithms were run in MATLAB 2012a (64-bit version) on 64-bit Windows 8.1
machine.1

4.3.1 Experimental Testbed and Setup

For evaluation purpose, we compared ASPGD and it variants without acceleration (which we
call ASPGDNOACC) with that of a recently proposed algorithm CSFSOL [28]. In [28], the
author compared their first and second order algorithms with a bunch of cost-sensitive algo-
rithms (CS-OGD, CPA, and PAUM etc.) It is found that CSFSOL and its second order version
CSSSOL outperform the aforementioned algorithms in terms of a metric called balanced accu-
racy (which is defined as 0.5 × sensitivity + 0.5 × specif icity). For this reason, we compare
our ASPGD, ASPGDNOACC with CSFSOL algorithm (notice that CSSSOL is a second or-
der algorithm, hence no comparison with CSSSOL is made). For performance evaluation, we
use Gmean and Mistake rate as performance metrics. The time taken by these algorithms are
not shown since each algorithm reads data in mini-batches and processes it online. Hence, to-
tal time consumed will depend on how fast we can read data from the storage device such as
hard disk, network etc. The most time-consuming operation in Algorithm 2 is evaluating the
prox operator. In our case, prox operator is soft-thresholding operator that has a closed form
solution [170]. Other steps in Algorithm 2 take O(1) time.

1
Our code is available at https://sites.google.com/site/chandreshiitr/publication
72 CHAPTER 4. PROPOSED ALGORITHM : ASPGD

For parameter selection (learning rate η), we fix the sparsity regularization parameter λ = 0 and
perform grid search as in [28]. Note that for ASPGD algorithm, strong convexity parameter µ is
equal to ρ shown in (4.1) (which is easy to show). The parameter ρ can be calculated in an online
fashion. Hence, no parameter tuning is required for ASPGD algorithm compared to CSFSOL
and ASPGDNOACC. Further, the parameter ρ (and µ thereof) can be greater than 1, hence, we
have to set η such that µη < 1 for the parameter γ to be a valid convex combination parameter.
1
In our subsequent experiments, we set η = µ+1
. In subsection 4, we also demonstrate the effect
of varying the learning rate. In [167], it is stated that diminishing learning rate helps in reducing
the variance introduced by random sampling, but it leads to slower convergence rate. Keeping
that in mind, we set the aforementioned value of the learning rate.

The results presented in the next subsection have been averaged over 10 random permutations
of the test data and shown on semi log plot. No feature scaling technique has been used as it
is against the ethos of online learning. Online learning dictates that only a subset of the entire
data set is seen at one point of time thus meeting more practical scenario of real world data in
our simulation.

4.3.2 Comparative Study on Benchmark Data sets

1. Evaluation of Gmean
In this Section, we evaluate the Gmean over various benchmark data sets as shown in
Table 2.3. The results are presented in Figure 5.8 and Table 4.1. From Figure 5.8, several
conclusions can be drawn. First, ASPGD algorithm outperforms ASPGDNOACC on
6 out of 8 data sets (news2, gisette, rcv1, url, pcmac, and webspam ). This indicates
that Nesterov’s acceleration helps in achieving higher Gmean. Secondly, ASPGD either
outperforms or performs equally good compared to CSFSOL on 6 out of 8 data sets
(news, gisette, rcv1, url, pcmac, and webspam ). Thirdly, on news2 and realsim data
sets, all algorithms suffer performance degradation. This may be due to sudden change
in concept or the class distribution. This shows one inherent limitation of ASPGD and
CSFSOL algorithms in addressing concept drift. The same observation can be made
from the cumulative Gmean results presented in Table 4.1. For example, on news data
set, ASPGD achieves Cumulative Gmean which is statistically more significant than the
Cumulative Gmean achieved by CSFSOL and vice versa on rcv1 data set.
4.3. EXPERIMENTS 73

1 1.001
ASPGD
ASPGD No Acc
1
CSFSOL
0.9

0.999

Online avg. of Gmean

Online avg. of Gmean


0.8
0.998

0.7 0.997

0.996
0.6 ASPGD
ASPGD No Acc 0.995
CSFSOL
0.5
0.994

0.4 0.993
2 3 4 3 4 5
10 10 10 10 10 10
# Sample size # Sample size

(a) news (b) news2

0.8 1

0.995
0.7
0.99 ASPGD
ASPGD No Acc
0.6
Online avg. of Gmean

Online avg. of Gmean


0.985
CSFSOL

0.98
0.5
0.975
0.4
0.97

0.3 0.965
ASPGD
ASPGD No Acc 0.96
0.2 CSFSOL
0.955

0.1 0.95
2 3 4 2 3 4 5
10 10 10 10 10 10 10
# Sample size # Sample size

(c) gisette (d) realsim

1 1
ASPGD
0.98 ASPGD No Acc
0.9
CSFSOL
0.96
Online avg. of Gmean

0.8
Online avg. of Gmean

0.94

0.92 0.7

0.9 0.6
ASPGD
0.88
ASPGD No Acc
0.5
CSFSOL
0.86

0.4
0.84

0.82
3 4 5 6 1 2 3 4
10 10 10 10 10 10 10 10
# Sample size # Sample size

(e) rcv1 (f) url

0.82 0.75
ASPGD
0.7 ASPGD No Acc
0.8
CSFSOL
0.65
Online avg. of Gmean

Online avg. of Gmean

0.78 ASPGD
ASPGD No Acc 0.6
CSFSOL
0.76 0.55

0.74 0.5

0.45
0.72
0.4

0.7
0.35

0.68 2 3
1 2 3
10 10 10 10 10
# Sample size # Sample size

(g) pcmac (h) webspam

Figure 4.1: Evaluation of online average of Gmean over various benchmark data sets. (a) news
(b) news2 (c) gisette (d) realsim (e) rcv1 (f) url (g) pcmac (h) webspam.
74 CHAPTER 4. PROPOSED ALGORITHM : ASPGD

2. Evaluation of Mistake rate


Online average of Mistake rate is shown in Figure 4.2 and cumulative Mistake rate is
shown in Table 4.1. The following conclusions can be drawn from the results in Figure
4.2. ASPGD achieves smaller Mistake rate than ASPGDNOACC over all the data sets.
Secondly, as more and more samples are consumed, Mistake rate of ASPGD eventually
reaches the Mistake rate of CSFSOL on 5 out of 8 data sets (news, rcv1, url, pcmac, and
realsim). Thirdly, all algorithm’s Mistake rate is increasing (fluctuating indeed) on news2
and relasim data sets. The same reason as discussed in the previous subsection applies
here too. That is, concept drift of classes may result in higher Mistake rate. Finally, the
Mistake rate result presented in Table 4.1 is consistent with the observation drawn from
the Figure 4.2. For example, cumulative Mistake rate obtained by ASPGD is smaller than
that of ASPGDNOACC on all data sets. On the other hand, Mistake rate of ASPGD is
not statistically significantly higher than the Mistak rate suffered by CSFSOL on news,
pcmac, relasim, and rcv1.
3. Effect of Regularization Parameter on F-measure
The effect of regularization parameter λ on F-measure is shown in Figure 4.3 on various
benchmark data sets. From these figures, we can observe that increasing the regularization
parameter λ decreases the F-measure. Important thing to notice here is that decrease in
F-measure with increasing λ is higher in CSFSOL algorithm compared to ASPGD and
ASPGDNOACC. In other words, we can obtain higher F-measure from ASPGD and
ASPGDNOACC compared to CSFSOL for a given λ.
4. Experiment with Varying Learning Rate
In this Section, we demonstrate the effect of using varying learning rate η. As we can
see in Figure 4.4 that there is no clear winner for maximizing Gmean. On news data set,
η = 1/(µ+10) obtains highest Gmean overall. On the other hand, η = 1/(µ+1) achieves
the highest Gmean overall on realsim and rcv1. We set the learning rate η = 1/(µ + 1) in
the experiment of the previous section based on this observation.
5. Effect of the Regularization Parameter
In this Section, we demonstrate the effect of varying sparsity regularization parameter λ
on maximizing Gmean and minimizing Mistake rate in the algorithms compared. The
results are shown in Figure 4.5 and 4.6. Regularization parameter λ is varied in [0, 10]
in steps of 1. From the Figure 4.5, we observe that different algorithms achieve highest
4.3. EXPERIMENTS 75

25 0.7
ASPGD ASPGD
ASPGD No Acc ASPGD No Acc
CSFSOL 0.6 CSFSOL

Online avg. of Mistake rate (%)

Online avg. of Mistake rate (%)


20

0.5

15
0.4

0.3
10

0.2

5
0.1

0 0
2 3 4 3 4 5
10 10 10 10 10 10
# Sample size # Sample size

(a) news (b) news2

24 3.5
ASPGD ASPGD
22 ASPGD No Acc ASPGD No Acc
CSFSOL 3 CSFSOL
Online avg. of Mistake rate (%)

Online avg. of Mistake rate (%)


20
2.5
18

16 2

14 1.5

12
1
10

0.5
8

6 0
2 3 4 2 3 4 5
10 10 10 10 10 10 10
# Sample size # Sample size

(c) gisette (d) realsim

10 35
ASPGD ASPGD
ASPGD No Acc ASPGD No Acc
9 30
CSFSOL CSFSOL
Online avg. of Mistake rate (%)

Online avg. of Mistake rate (%)

8
25

7
20
6
15
5

10
4

3 5

2 0
3 4 5 6 1 2 3 4
10 10 10 10 10 10 10 10
# Sample size # Sample size

(e) rcv1 (f) url

35 20
ASPGD
ASPGD No Acc
30 18
CSFSOL
Online avg. of Mistake rate (%)

Online avg. of Mistake rate (%)

16
25 ASPGD
ASPGD No Acc
14
CSFSOL
20
12
15
10

10
8

5 6

0 4
1 2 3 2 3
10 10 10 10 10
# Sample size # Sample size

(g) pcmac (h) webspam

Figure 4.2: Evaluation of mistake over various benchmark data sets. (a) news (b) news2 (c)
gisette (d) realsim (e) rcv1 (f) url (g) pcmac (h) webspam.
76 CHAPTER 4. PROPOSED ALGORITHM : ASPGD

90 100

80 90

80
70

Online Cumulative F (%)


Online Cumulative F (%)

70
60
60
50
ASPGD 50
ASPGD No Acc
40
CSFSOL
40
30 ASPGD
30 ASPGD No Acc
CSFSOL
20
20

10 10

0 0
0 2 4 6 8 10 0 2 4 6 8 10
Sparsity regularization parameter λ Sparsity regularization parameter λ

(a) news (b) realsim

90 95.5
ASPGD ASPGD
80 ASPGD No Acc 95 ASPGD No Acc
CSFSOL CSFSOL

70
94.5
Online Cumulative F (%)

Online Cumulative F (%)

60
94
50
93.5
40
93
30

92.5
20

10 92

0 91.5
0 2 4 6 8 10 0 2 4 6 8 10
Sparsity regularization parameter λ Sparsity regularization parameter λ

(c) gisette (d) rcv1

95

90
Online Cumulative F (%)

85

80

75
ASPGD
ASPGD No Acc
CSFSOL
70

65
0 2 4 6 8 10
Sparsity regularization parameter λ

(e) pcmac

Figure 4.3: Effect of regularization parameter λ on F-measure on (a) news (b) realsim (c) gisette
(d) rcv1 (e) pcmac.
4.3. EXPERIMENTS 77

Table 4.1: Evaluation of cumulative Gmean(%) and Mistake rate (%) on benchmark data sets.
Entries marked by * are statistically significant than the entries marked by ** and entries marked
by † are NOT statistically significant than the entries marked by ‡ at 95% confidence level on
Wilcoxon rank sum test.

news news2
Algorithm
Gmean(%) Mistake rate(%) Gmean(%) Mistake rate(%)

Proposed ASPGD 90.7± 0.003∗ 2.087 ± 0.139† 99.686± 0.015 0.319 ± 0.014
Proposed ASPGDNOACC 96.1±0.003 2.909 ± 0.117 99.426± 0.027 0.587 ± 0.029
CSFSOL 89.5 ± 0.004∗∗ 2.013± 0.069‡ 99.870 ± 0.014 0.136± 0.016

gisette realsim
Algorithm
Gmean(%) Mistake rate(%) Gmean(%) Mistake rate(%)

Proposed ASPGD 74.333 ± 1.174 7.750 ± 0.539 96.319 ± 0.077∗∗ 1.022 ± 0.023∗∗
Proposed ASPGDNOACC 41.810± 1.308 15.511 ± 0.826 96.636 ± 0.116 2.025 ± 0.036
CSFSOL 47.727 ± 2.176 6.468± 0.184 96.783 ± 0.102∗ 1.059 ± 0.025∗
rcv1 url
Algorithm
Gmean(%) Mistake rate(%) Gmean(%) Mistake rate(%)

Proposed ASPGD 97.422 ± 0.003∗∗ 2.576 ± 0.003∗∗ 92.139 ± 0.285 3.684 ± 0.151∗
Proposed ASPGDNOACC 97.426 ± 0.003 2.574 ± 0.003 90.263 ± 0.380 7.880 ± 0.353
CSFSOL 97.461 ± 0.004∗ 2.535 ± 0.004∗ 86.459 ± 0.497 3.791 ± 0.086 ∗∗
pcmac webspam
Algorithm
Gmean(%) Mistake rate(%) Gmean(%) Mistake rate(%)

Proposed ASPGD 79.339 ± 3.149‡ 6.778 ± 0.547‡ 71.052 ± 1.718 6.100 ± 0.727
Proposed ASPGDNOACC 72.840 ± 4.024 7.122 ± 0.424 51.856 ± 5.034 18.270 ± 3.196
† †
CSFSOL 80.655 ± 1.918 6.511 ± 0.360 46.655 ± 5.684 4.380 ± 0.225
78 CHAPTER 4. PROPOSED ALGORITHM : ASPGD

0.75 1.005
η=1/µ+1
η=1/µ+1 η=1/µ+5
0.7 η=1/µ+5 1 η=1/µ+10
η=1/µ+10
Online avg. of Gmean

Online avg. of Gmean


0.65 0.995

0.6 0.99

0.55 0.985

0.5 0.98

0.45 0.975
2 3 4 3 4 5
10 10 10 10 10 10
# Sample size # Sample size

(a) news (b) realsim

0.9 0.94
η=1/µ+5
0.85 η=1/µ+1 0.92
η=1/µ+10
0.8 0.9
Online avg. of Gmean

Online avg. of Gmean

0.75 0.88

0.7 0.86

0.65 0.84

η=1/µ+1
0.6 0.82
η=1/µ+5
η=1/µ+10
0.55 0.8

0.5 0.78

0.45 0.76
3 3 4 5
10 10 10 10
# Sample size # Sample size

(c) gisette (d) rcv1

Figure 4.4: Effect of learning rate η in ASPGD algorithm for maximizing Gmean on (a) news
(b) realsim (c) gisette (d) rcv1.
4.3. EXPERIMENTS 79

Gmean on different values of λ on news and gisette data set. For example, ASPGD
achieves highest Gmean at λ = 0 whereas ASPGDNOACC at λ = 1 and CSFSOL at
λ = 10 on news data sets. On the other hand, on realsim and rcv1 data sets, all the
algorithms achieve highest Gmean at λ = 0. Another major observation is that ASPGD
algorithm achieve higher Gmean compared to CSFSOL on 3 out of 4 data sets (news,
realsim, and rcv1) over the entire range of λ values tested. A higher λ value implies the
addition of sparsity and hence more sparse model. Sparse models are easier to interpret
and quicker to evaluate.
In Figure 4.6, the effect of regularization parameter on Mistake rate is shown. As before,
ASPGD algorithm suffers smaller Mistake rate compared to CSFSOL on the entire range
of λ values tested on 3 out of 4 data set (relaism, gisette, and rcv1). Further, smaller
values of λ lead to smaller Mistake rate which is obvious from monotonically increasing
Mistake rate on all data sets and algorithms except ASPGD on gisette data set.
Remark: The difference between ASPGD and SOL (generalized version of CSFSOL) is
that ASPGD uses (1) smooth and modified hinge loss (2) Nesterov’s acceleration. Time
complexity of both the algorithms is O(nd), which is linear in n and d; where n is the
number of data point and d is the dimensionality. Only thing that differs in the time com-
plexity is the hidden constant in Big O notation. In fact, extra step involved in ASPGD is
the summation of two vectors of size d in step 5 of the algorithm that cost O(d). For big
data where usually d is sparse, it will take O(s) time where s is the number of nonzeros
in wt−1 and ut . The evaluation of the smooth hinge loss differs by a constant (O(1)). In
addition, from implementation point of view, the results reported in the paper assume that
data resides in back-end such as hard disk. We read data in mini batches and process them
one by one. Thus, our implementation is more amicable to online learning, where we are
not allowed to see all the data in one go. Whereas, implementation provided by authors
of SOL load entire data in main memory and process one example at a time. Thus, their
implementation violates the principle of online learning.
80 CHAPTER 4. PROPOSED ALGORITHM : ASPGD

90 100
ASPGD
ASPGD No Acc
80
CSFSOL
95
Online Cumulative Gmean (%)

Online Cumulative Gmean (%)


70

90
60

50 85

40 ASPGD
80
ASPGD No Acc
30 CSFSOL

75
20

10 70
0 2 4 6 8 10 0 2 4 6 8 10
Sprasity regularization parameter λ Sprasity regularization parameter λ

(a) news (b) realsim

90 95
ASPGD
ASPGD No Acc
80 90 CSFSOL
Online Cumulative Gmean (%)

Online Cumulative Gmean (%)

ASPGD
70 ASPGD No Acc 85
CSFSOL
60 80

50 75

40 70

30 65

20 60
0 2 4 6 8 10 0 2 4 6 8 10
Sprasity regularization parameter λ Sprasity regularization parameter λ

(c) gisette (d) rcv1

Figure 4.5: Effect of regularization parameter λ in ASPGD algorithm for maximizing Gmean
on (a) news (b) realsim (c) gisette (d) rcv.
4.3. EXPERIMENTS 81

45 25
ASPGD ASPGD
ASPGD No Acc ASPGD No Acc
40 CSFSOL CSFSOL
Online Cumulative MIstake (%)

Online Cumulative MIstake (%)


20

35

15
30

25
10

20

5
15

10 0
0 2 4 6 8 10 0 2 4 6 8 10
Sparsity regularization parameter λ Sparsity regularization parameter λ

(a) news (b) realsim

55 35

50 ASPGD
ASPGD
30 ASPGD No Acc
ASPGD No Acc
Online Cumulative MIstake (%)

Online Cumulative MIstake (%)

45 CSFSOL
CSFSOL
40
25
35

30 20

25
15
20

15
10
10

5 5
0 2 4 6 8 10 0 2 4 6 8 10
Sparsity regularization parameter λ Sparsity regularization parameter λ

(c) gisette (d) rcv1

Figure 4.6: Effect of regularization parameter λ in ASPGD algorithm for minimizing Mistake
rate on (a) news (b) realsim (c) gisette (d) rcv1.
82 CHAPTER 4. PROPOSED ALGORITHM : ASPGD

4.4 Discussion

In the present Chapter, we handle the streaming, sparse, high dimensional problem of big data
for detecting anomalies efficiently. As discussed in Chapter 3, PAGMEAN algorithm does not
scale over high dimensions; nor does it exploit the sparsity present in the big data. We follow
the same recipe as in PAGMEAN algorithm to derive the ASPGD algorithm. However, instead
of using the loss function employed by PAGMEAN, we use a smooth and strongly convex cost-
sensitive loss function that is a convex surrogate for the 0−1 loss function. The relaxed problem
is solved via accelerated-stochastic-proximal learning algorithm called ASPGD. Extensive ex-
periments on several large-scale data sets show that ASPGD algorithm outperforms a recently
proposed algorithm (CSFSOL) in terms of Gmean, F-measure and Mistake rate on many of
the data set tested. Further, we also compared non-accelerated version of ASPGD algorithm
called ASPGDNOACC with ASPGD and CSFSOL. From the discussion in Section 4.3, we also
conclude that acceleration is not always helpful; neither in terms of Gmean nor Mistake rate.

Because of the massive growth in data size and its distributed nature, there is immediate need
to tackle the class imbalance in the distributed setting. In the next Chapter, we propose an
algorithm for handling the class-imbalance problem in the distributed setting.
Chapter 5

Proposed Algorithms : DSCIL and CILSD

Globalization in the 21st century has given rise to the distributed work culture. As a result,
data is no longer collected at a single place. Instead, it is gathered at multiple locations in a
distributed fashion. Gleaning insightful information from distributed data is a challenging task.
There are several concerns that need to be addressed properly. Firstly, collecting the whole data
at a single place for knowledge discovery is costly. Secondly, it involves security risk while
transmitting data over the network. For example, credit card transaction data. To save cost and
minimize the risk of data transportation, there is an urgent need to develop algorithms that can
work in a distributed fashion. This is the main motivation behind the work proposed in the
current Chapter.

We study the class imbalance problems in a distributed setting exploiting sparsity structure
in the data. We formulate the class-imbalance learning problem as a cost-sensitive learning
problem with L1 regularization. The cost-sensitive loss function is a cost-weighted smooth
hinge loss. The resultant optimization problem is minimized within (i) Distributed Alternating
Direction Method of Multiplier (DADMM) [171] framework (ii) FISTA [163]-like update rule
in a distributed environment. We call the algorithm derived within DADMM framework as
Distributed Sparse Class-Imbalance Learning (DSCIL) and within the FISTA-like update rule
as Class-Imbalance Learning on Sparse data in a Distributed Environment (CILSD). The reason
for proposing CILSD is that it improves upon the convergence speed of DSCIL.

In the DSCIL algorithm, we partition the data matrix across samples through DADMM. This

83
84 CHAPTER 5. PROPOSED ALGORITHMS : DSCIL AND CILSD

operation splits the original problem into a distributed L2 regularized smooth loss minimization
and L1 regularized squared loss minimization. L2 regularized subproblem is solved via L-
BFGS and random coordinate descent method in parallel at multiple processing nodes using
Message Passing Interface (MPI, a C++ library) while L1 regularized problem is just a simple
soft-thresholding operation. We show, empirically, that the distributed solution matches the
centralized solution on many benchmark data sets. The centralized solution is obtained via
Cost-Sensitive Stochastic Coordinate Descent (CSSCD).

In CILSD algorithm, we partition the data across examples and distribute the subsamples to
different processing nodes. Each node runs a local copy of FISTA-like algorithm which is a
distributed-implementation of the prox-linear algorithm for cost-sensitive learning. Empirical
results on small and large-scale benchmark datasets show some promising avenues to further
investigate the real-world application of the proposed algorithms such as anomaly detection,
class-imbalance learning etc. To the best of our knowledge, ours is the first work to study
class-imbalance in a distributed environment on large-scale sparse data.

5.1 Introduction

In the present work, we have made an attempt to address the point anomaly detection through
the class-imbalance learning in big data in a distributed setting exploiting sparsity structure.
Without exploiting sparsity structure, learning algorithms run slower since they have to work
in the full-feature space. We propose to solve the class-imbalance learning problem from cost-
sensitive learning perspective due to various reasons. First, Cost-sensitive learning can be di-
rectly applied to different classification algorithms. Second, cost-sensitive learning generalizes
well over large data and third, they can take into account user input cost.

In summary, our contributions are as follows:


1. We propose, to the best of our knowledge, the first regularized cost-sensitive learning
problem in (??) in a distributed setting exploiting sparsity in the data.
2. The problem in (??) is solved via (i) Distributed Alternating Direction Method of Multi-
plier (DADMM) (ii) distributed FISTA-like algorithm by example splitting across differ-
ent processing nodes. The solution obtained by DADMM algorithm is called DSCIL and
the solution obtained by distributed FISTA is called CILSD.
5.2. EXPERIMENTS 85

3. DADMM splits the problem (??) into two subproblems: a distributed L2 regularized
loss minimization and a L1 regularized squared loss minimization. The first subproblem
is solved by L-BFGS as well as random coordinate descent method while the second
subproblem is just a soft-thresholding operation obtained in a closed-form. We call our
algorithm using L-BFGS method as L- Distributed Sparse Class-Imbalance Learning (L-
DSCIL) and using random coordinate descent method as R-DSCIL.

4. All the algorithms are tested on various benchmark datasets and results are compared with
the start-of-the-art algorithms as well as the centralized solution over various performance
measures besides Gmean (defined later).

5. We also show (i) the Speedup (ii) the effect of varying cost (iii) the effect of the number
of cores in the distributed implementation (iv) the effect of the regularization parameter.

6. At the end, we show the useful real-world application of DSCIL and CILSD algorithms
on KDDCUP 2008 data set.

5.2 Experiments

In this Section, we demonstrate the empirical performance of the proposed algorithm DSCIL
over various small and large-scale data sets [144] . A brief summary of the benchmark data
sets and the class-imbalance ratio is given in Table 2.3. All the algorithms were implemented
in C++ and compiled by g++ on a Linux 64-bit machine containing 48 cores (2.4Ghz CPUs) 1 .

5.2.1 Experimental Testbed and Setup

We compare our DSCIL algorithm with a recently proposed algorithm called Cost-Sensitive
First Order Sparse Online learning (CSFSOL) of [28]. CSFSOL is a cost-sensitive online algo-
rithm based on mirror descent update rule (see for example [119]). CSFSOL also optimizes the
same objective function as ours but the loss function is not smooth. Secondly, as mentioned in
the introduction section, CSFSOL is an online centralized algorithm whereas DSCIL is a dis-
tributed algorithm. In the forthcoming subsections, we will show (i) The convergence of DSCIL
on benchmark data sets (ii) The performance comparison of DSCIL, CSFSOL, and CSSCD al-

1
Sample code and data set for R-DSCIL can be downloaded from
https://sites.google.com/site/chandreshiitr/publication
86 CHAPTER 5. PROPOSED ALGORITHMS : DSCIL AND CILSD

gorithms in terms of accuracy, sensitivity, specificity, Gmean and balanced_accuracy (also


called Sum which is =0.5*sensitivity+0.5*specificity). Note that within the DSCIL algorithm,
L2 minimization solved via L-BFGS algorithm is referred to as L-DSCIL and via CSRCD algo-
rithm is referred to as R-DSCIL. For our distributed implementation, we use MPICH2 library.
MPICH is a high-performance C++ library based on Message Passing between nodes to realize
the distributed implementation.

Note also that the DSCIL algorithm contains ADMM penalty parameter ρ which we set to 1 and
the convergence of DSCIL does not require tuning of this parameter. Regularization parameter
λ in DSCIL is set to 0.1λmax where λmax is given by (1/m)kX T ỹk∞ (see [172]), where ỹ is
given by: 
m− /m

yi = 1
ỹ =
−m+ /m

yi = −1, i = 1, ..., m
Setting λ as discussed above does not require its tuning as compared to λ in CSFSOL where
no closed form solution is available as such and one needs to do the cross-validation. DSCIL
algorithm is stopped when primal and dual residual fall below the primal and dual residual
tolerance (see chapter 3 of [171] for details). For the CSFSOL algorithm, we see that it con-
tains another parameter, that is, the learning rate η. Both of these parameters were searched
in the range {3 × 10−5 , 9 × 10−5 , 3 × 10−4 , 9 × 10−4 , 3 × 10−3 , 9 × 10−2 , 0.3, 1, 2, 4, 8} and
{0.0312, 0.0625, 0.125, 0.25, 0.5, 1, 2, 4, 8, 16, 32} respectively as discussed in [28] and the best
value on the performance metric is chosen for testing. As an implementation note, we normalize
the columns of data matrix so that each feature value lies in [−1, 1]. All the results are obtained
by running the distributed algorithms on 4 cores using MPI unless otherwise stated.

5.2.2 Convergence of DSCIL

In this Section, we discuss the convergence of L-DSCIL and R-DSCIL algorithms . The con-
vergence plot with respect to the DADMM iteration over various benchmark datasets is shown
in Figure 5.9. From the Figure 5.9, it is clear that L-DSCIL converges faster than R-DSCIL
which is obvious as L-DSCIL is a second order (quasi-Newton) method while R-DSCIL is a
first order method. On another note, we observe that on w8a and rcv1 data set, R-DSCIL starts
increasing the objective function that indicates the CSRCD algorithm, which is a random co-
ordinate descent algorithm, overshoots the minimum after a certain number of iterations. The
5.2. EXPERIMENTS 87

0.09 0.25

0.245 L-DSCIL
R-DSCIL
0.085
0.24

L-DSCIL

objective function values


objective function values

R-DSCIL 0.235
0.08
0.23

0.225
0.075
0.22

0.215
0.07

0.21

0.065 0.205
0 10 20 30 40 50 60 70 80 90 100 0 2 4 6 8 10 12 14 16 18 20
DADMM iterations DADMM iterations

(a) ijcnn1 (b) rcv1

0.085 0.12

0.11
L-DSCIL
0.08
0.1 R-DSCIL
objective function values

objective function values


0.09
0.075
0.08

0.07
0.07
0.06

L-DSCIL
R-DSCIL 0.05
0.065

0.04

0.06 0.03
0 10 20 30 40 50 60 70 80 90 100 0 50 100 150 200 250 300 350 400 450 500
DADMM iterations DADMM iterations

(c) pageblocks (d) w8a

Figure 5.1: Objective Function vs DADMM iterations over benchmark data sets. (a) ijcnn1 (b)
rcv1 (c) pageblocks (d) w8a.

convergence plot also shows the correctness of our implementation.

5.2.3 Comparative Study on Benchmark Data sets

1. Performance Comparison with respect to Gmean


We show the performance comparison of various algorithms compared with respect to
various metrics such as Accuracy, Sensitivity, Specificity, Gmean, Sum in Table 5.1 and
5.2. Here, we want to focus on Gmean column as this is the metric of our interest. Now,
several conclusions can be drawn from these tables. Firstly, L-DSCIL performance is
equally good compared to the centralized solution of CSSCD on many of the data sets.
88 CHAPTER 5. PROPOSED ALGORITHMS : DSCIL AND CILSD

Table 5.1: Performance comparison of CSFSOL, L-DSCIL, R-DSCIL and CSSCD over various
benchmark data sets.

news
Algorithm
Accuracy Sensitivity Specificity Gmean Sum

CSFSOL 0.966356 0.982759 0.966137 0.974412 0.974448


CSSCD 0.451466 1 0.444137 0.666436 0.722069
Proposed L-DSCIL 0.992271 0.568966 0.997927 0.753516 0.783446
Proposed R-DSCIL 0.991134 0.396552 0.999079 0.629433 0.697815

url
Algorithm
Accuracy Sensitivity Specificity Gmean Sum

CSFSOL 0.9725 0.994505 0.970297 0.982327 0.982401


CSSCD 0.947 0.967033 0.944994 0.95595 0.956014
Proposed L-DSCIL 0.9385 0.824176 0.949945 0.884829 0.8870605
Proposed R-DSCIL 0.944 0.791209 0.959296 0.871208 0.8752525
gisette
Algorithm
Accuracy Sensitivity Specificity Gmean Sum

CSFSOL 0.813 0.896 0.73 0.808752 0.813


CSSCD 0.945 0.918 0.972 0.944614 0.945
Proposed L-DSCIL 0.954 0.926 0.982 0.953589 0.954
Proposed R-DSCIL 0.5 0 1 0 0.5

ijcnn1
Algorithm
Accuracy Sensitivity Specificity Gmean Sum

CSFSOL 0.831932 0.607668 0.855475 0.721002 0.731571


CSSCD 0.83456 0.827824 0.835267 0.831537 0.831546
Proposed L-DSCIL 0.868126 0.576217 0.89877 0.719643 0.737493
Proposed R-DSCIL 0.718258 0.982438 0.690525 0.823649 0.836482

covtype
Algorithm
Accuracy Sensitivity Specificity Gmean Sum

CSFSOL 0.908817 0.86642 0.910199 0.88804 0.888309


CSSCD 0.968433 0 1 0 0.5
Proposed L-DSCIL 0.953617 0.699578 0.961897 0.820318 0.830737
Proposed R-DSCIL 0.968433 0 1 0 0.5
5.2. EXPERIMENTS 89

Table 5.2: Performance comparison of CSFSOL, L-DSCIL, R-DSCIL and CSSCD over various
benchmark data sets.

rcv1
Accuracy Sensitivity Specificity Gmean Sum

CSFSOL 0.957 0.960148 0.953312 0.956724 0.95673


CSSCD 0.8989 0.871548 0.930945 0.900757 0.901246
Proposed L-DSCIL 0.9175 0.925116 0.908578 0.916809 0.916847
Proposed R-DSCIL 0.8697 0.792215 0.960478 0.872299 0.8763465
webspam
Accuracy Sensitivity Specificity Gmean Sum

CSFSOL 0.9925 0.952 0.9952 0.97336 0.9736


CSSCD 0.988 0.808 1 0.898888 0.904
Proposed L-DSCIL 0.9705 0.528 1 0.726636 0.764
Proposed R-DSCIL 0.885 0.968 0.879467 0.922672 0.923733
realsim
Accuracy Sensitivity Specificity Gmean Sum

CSFSOL 0.8825 0 1 0 0.5


CSSCD 0.9105 0.33617 0.986969 0.576012 0.66157
Proposed L-DSCIL 0.812071 0.900304 0.800324 0.848843 0.850314
Proposed R-DSCIL 0.824286 0.90152 0.814002 0.856644 0.857761

w8a
Accuracy Sensitivity Specificity Gmean Sum

CSFSOL 0.970637 0.0330396 1 0.181768 0.51652


CSSCD 0.97231 0.563877 0.9851 0.745302 0.774489
Proposed L-DSCIL 0.975587 0.682819 0.984755 0.820006 0.833787
Proposed R-DSCIL 0.978597 0.348018 0.998344 0.589442 0.673181

pageblocks
Accuracy Sensitivity Specificity Gmean Sum

CSFSOL 0.793056 0.662069 0.81306 0.73369 0.737564


CSSCD 0.887163 0.751724 0.907846 0.826105 0.829785
Proposed L-DSCIL 0.919598 0.706897 0.95208 0.820379 0.829488
Proposed R-DSCIL 0.90772 0.344828 0.993681 0.585362 0.669254
90 CHAPTER 5. PROPOSED ALGORITHMS : DSCIL AND CILSD

For example, it gives superior performance than CSSCD on news, gisette, rcv1, realsim
and pageblocks data sets in terms of Gmean. Secondly, the performance of R-DSCIL is
not so good compared to CSSCD. It was able to outperform CSSCD on realsim and web-
spam data sets only. Secondly, Gmean achieved by CSFSOL on large-scale data sets such
as news, rcv1, url and webspam is higher than any of the other method. This observation
can be attributed due to possibly (i) the use of strongly convex objective function used in
our present work compared to convex but non-smooth objective function in CSFSOL (ii)
we stopped the L-DSCIL algorithm before it reaches optimality (iii) λmax value calcu-
lated as discussed in subsection 5.3.1 is not the right choice. We believe that the second
and third reason is more likely than the first one as can be seen in the convergence plot
of the L-DSCIL algorithm in Figure 5.9. We stopped the L-DSCIL algorithm either pri-
mal and dual residual went below the primal and dual feasibility tolerance or maximum
iteration is reached (which we set to 20 for large-scale data sets). To verify our point,
we ran another experiment with MaxIter set to 50. This setting gives the Gmean equal to
0.923436 which is clearly larger than previously obtained value 0.916809 for rcv1 data
set. Similarly, by running L-DSCIL and R-DSCIL for a larger number of iterations, we
can increase the desired accuracy . Finally, we also observe that R-DSCIL fails to cap-
ture the class-imbalance on gisette and covtype whereas CSSCD on covtype and CSFSOL
on realsim (due to the value of Gmean being 0) which indicates the possibility of using
L-DSCIL in practical settings.
2. Study on Gmean versus Cost
We study the effect of varying the cost on Gmean. The results are presented in Figure 5.2
and 5.3 for L-DSCIL and R-DSCIL respectively. In each Figure, cost and Gmean appear
on x and y-axis respectively. From these Figures, we can draw the following observations.
Firstly, Gmean is increasing for balanced data such as rcv1 when the cost for each class
equals 0.5. On the other hand, more imbalance the data set is, more the cost given to
positive class and higher the Gmean (see the hist plot for news, pageblocks, url) in Figure
5.2. The same observation can be made from the Figure 5.3. These observations allude to
the fact that right choice for cost is important otherwise classification is affected severely.
3. Speedup Measurement
In this Section, we discuss the speedup achieved by R-DSCIL and L-DSCIL algorithms
when we run these algorithms on multiple cores. Speedup results of R-DSCIL algorithm
5.2. EXPERIMENTS 91

1 0.9

0.9 0.8

0.8
0.7

0.7
0.6
0.6
pageblocks 0.5
GMean

Gmean
w8a
0.5 url
0.4 ijcnn1
0.4 news webspam
rcv1 0.3 realsim
0.3
0.2
0.2
0.1
0.1
0
0
1 2 3 4 5
1 2 3 4 5
Cost (positive, negative) Cost(positive, negative)

(a) (b)

Figure 5.2: Gmean versus Cost over various data sets for L-DSCIL algorithm. Cost is
given on the x-axis where each number denotes cost pair such that 1={0.1,0.9}, 2={0.2,0.8},
3={0.3,0.7}, 4={0.4,0.6}, 5={0.5,0.5}
0.9 1

0.8 0.9

0.8
0.7

0.7
0.6
0.6
0.5
Gmean

pageblocks rcv1
Gmean

0.5
0.4 w8a url
ijcnn1 0.4 webspam
0.3 news
realsim 0.3
0.2
0.2
0.1
0.1

0
0
1 2 3 4 5
1 2 3 4 5
Cost(positive,negative) Cost(positive,negative)

(a) (b)

Figure 5.3: Gmean versus Cost over various data sets for R-DSCIL algorithm. Cost is
given on the x-axis where each number denotes cost pair such that 1={0.1,0.9}, 2={0.2,0.8},
3={0.3,0.7}, 4={0.4,0.6}, 5={0.5,0.5}
92 CHAPTER 5. PROPOSED ALGORITHMS : DSCIL AND CILSD

are presented in Figure 5.4 (a) and (b) and of L-DSCIL in Figure 5.5 (a) and (b). The
number of cores used is shown on the x-axis and the y-axis shows the training time.
From these figures, we can draw multiple conclusions. Firstly, from the Figure 5.4 (a),
we observe that as we increase the number of cores, training time is decreasing for all
the data sets except for webspam at 8 cores. The sudden increase in training time of
webspam could be explained as follows: so long as the computation time (RCD run-
ning time plus primal and dual variable update) remains above the communication time
(M P I_Allreduce operation), adding more cores reduces the overall training time. On
the other hand, in Figure 5.4 (b), training time is first increasing and then decreasing
for all the datasets with increasing number of cores. This could be due to the increas-
ing communication time with the increasing number of cores. After a certain number of
cores, computation time starts leading the communication time and thereby decrease in
the training time is observed. Further, in DSCIL algorithm, the communication time is
data dependent (it depends on the data dimensionality and sparsity). Because of these
reasons, we observe different speedup patterns on different data sets. Speedup results
of L-DSCIL in Figure 5.5 (a) and (b) show the decreasing training time with increasing
number of cores for all the data sets. We also observe that training time for L-DSCIL is
higher than that of R-DSCIL on all data sets and cores.
4. Number of Cores versus Gmean
In this Section, we present the experimental results showing the effect of utilizing a vary-
ing number of cores on Gmean. This also shows how the different partitioning of the data
affects the Gmean. We divide the data set into equal chunks and distribute it to various
cores. Suppose, we have a data set of size m and want to utilize n cores, we allot samples
of size m/n to each core. We choose the data size such that it is divisible by all possible
cores utilized. Results are shown in Figures 5.6 and 5.7 for R-DSCIL and L-DSCIL re-
spectively. In Figure 5.6 (a) and (b), we can observe that Gmean remains almost constant
over various partitioning of the data (various cores). A small deviation is observed over
url and w8a data sets in Figure 5.6 (b). These observations lead to the conclusion that
R-DSCIL algorithm is less sensitive to the data partition. On the other hand, Gmean
5.2. EXPERIMENTS 93

10000 1000

1000

Training time (in seconds)


Training time (in seconds)

100

100 news
w8a
realsim
url
rcv1
webspam 10
ijcnn1
10

1 1
1 2 4 8 1 2 4 8

# Cores # Cores

(a) (b)

Figure 5.4: Training time versus number of cores to measure the speedup of R-DSCIL algo-
rithm. Training time in Figure (a) is on the log scale.
100000 1000000

100000
10000
Training time (in secdonds)
Training time (in seconds)

10000
1000

w8a 1000
rcv1
ijcnn1
100 realsim
url
100 webspam
news

10
10

1 1
1 2 4 8 1 2 4 8

# Cores # Cores

(a) (b)

Figure 5.5: Training time versus number of cores to measure the speedup of L-DSCIL algo-
rithm. Training time in both the figures is on the log scale.
94 CHAPTER 5. PROPOSED ALGORITHMS : DSCIL AND CILSD

1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6
Gmean

Gmean
0.5 news 0.5
w8a
realsim
0.4 0.4 url
rcv1
webspam
0.3 ijcnn1 0.3

0.2 0.2

0.1 0.1

0 0
1 2 4 8 1 2 4 8

# Cores # Cores

(a) (b)

Figure 5.6: Effect of varying number of cores on Gmean in R-DSCIL algorithm.


1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6
Gmean

Gmean

0.5 w8a 0.5


rcv1
ijcnn1
0.4 0.4 realsim
url
webspam
0.3 news 0.3

0.2 0.2

0.1 0.1

0 0
1 2 4 8 1 2 4 8

# Cores # Cores

(a) (b)

Figure 5.7: Effect of varying number of cores on Gmean in L-DSCIL algorithm.


5.3. EXPERIMENTS 95

results of the L-DSCIL algorithm for various partitions in Figures 5.7 (a) and (b) show
the chaotic behavior. For example, Gmean changes a lot with the increasing number of
cores for news in Figure 5.7 (a) and for realsim in Figure 5.7 (b). For other data sets,
fluctuations in Gmean values are less sensitive.
5. Effect of Regularization Parameter on Gmean
In this subsection, we discuss the effect of the regularization parameter, λ, on Gmean
produced by the R-DSCIL algorithm as shown in Figure 5.8 on various benchmark data
sets. It is clear from the Figure 5.8 that Gmean is dropping with increasing regularization
parameter on all the data sets tested. On some data sets such as rcv1, Gmean drops
gradually with increasing λ. On the other side, some data sets such as w8a, pageblocks,
Gmean falls off to 0 quickly with increasing λ. It also shows that higher the sparsity in
the data sets such as rcv1, webspam etc, higher is the penalty required to achieve higher
Gmean and vice versa.

5.3 Experiments

Below, we present empirical simulation results on various benchmark data sets as given in Table
2.3.

5.3.1 Experimental Testbed and Setup

For our distributed implementation, we used MPICH2 library [173]. We compare the perfor-
mance of CILSD algorithm with that of the CSFSOL of [28]. In the forthcoming subsections,
we will show (i) The convergence of CILSD on benchmark data sets (ii) The performance com-
parison of CILSD, CSFSOL and CSSCD algorithms in terms of accuracy, sensitivity, specificity,
Gmean and balanced_accuracy (also called Sum which is =0.5*sensitivity+0.5*specificity)
(iii) Speedup (iv) Gmean versus number of cores (v) Gmean versus regularization parameter.

5.3.2 Convergence of CILSD

In this Section, we show the convergence of CILSD in two scenarios. In the first scenario, the
convergence of CILSD is shown when we search for the best learning rate over a validation
96 CHAPTER 5. PROPOSED ALGORITHMS : DSCIL AND CILSD

0.7
0.8 0.8
0.6
0.7 0.7
0.5
0.6 0.6
0.4
Gmean

Gmean
0.5 Gmean 0.5
0.4 0.4 0.3
0.3 0.3
0.2
0.2 0.2
0.1
0.1 0.1
0 0 0
0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3
lambda lambda lambda

(a) ijcnn1 (b) rcv1 (c) pageblocks

0.7 0.7
0.8
0.6 0.6
0.7
0.5 0.5
0.6
0.4
Gmean

0.4
Gmean

Gmean

0.5

0.3 0.3 0.4


0.3
0.2 0.2
0.2
0.1 0.1
0.1
0 0 0
0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3
lambda lambda lambda

(d) w8a (e) news (f) url

1
0.8
0.7 0.8

0.6
0.6
Gmean
Gmean

0.5
0.4
0.4
0.3
0.2 0.2
0.1
0 0
0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3
lambda lambda

(g) realsim (h) webspam

Figure 5.8: Gmean versus regularization parameter λ using R-DSCIL (a) ijcnn1 (b) rcv1 (c)
pageblocks (d) w8a (e) news (f) url (g) realsim (h) webspam.
5.3. EXPERIMENTS 97

set in the range {0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3}. In the second scenario, we use the
learning rate 1/L. The convergence plot in both the scenario is shown in Figure 5.9. We can
clearly see that objective function converges faster with learning rate set to 1/L (Obj2) than to
search it over the range of possible learning rate (Obj1). These results show the correctness of
our implementation as well as how to choose the learning rate.

5.3.3 Comparative Study on Benchmark Data sets

1. Performance Comparison with respect to Gmean


We compare the performance of CILSD, CSFSOL and CSSCD algorithms with respect to
various matrices as mentioned in the beginning of this section. Particularly, we will focus
on the Gmean metric. The comparative results are shown in Table 5.3 and 5.4. From these
results, we observe that CILSD achieves equal or higher Gmean compared to the Gmean
achieved by CSSCD on 8 out of 10 data sets. On the other side, Gmean achieved by
CILSD follows closely to the Gmean achieved by CSFSOL, an online algorithm, on most
of the data sets. In some cases, CILSD outperforms CSFSOL and CSSCD in terms of
Gmean such as on realsim, w8a etc. If we compare the results in Tables 5.1 and 5.2 with
those in Tables 5.3 and 5.4, we can see that CILSD achieves higher Gmean compared
to DSCIL (both R-DSCIL and L-DSCIL) on most of the data sets. These observations
indicate the possibility of using CILSD on real data sets for class-imbalance learning in a
distributed scenario.
2. Speedup Measurements
In order to see how CILSD training time varies with the number of cores, we partition
the data matrix into equal chunks across examples and distribute it to different processing
nodes (cores). Each core runs a local copy of CILSD algorithm. Training time of CILSD
algorithm for different partitioning of the data is shown in Figure 5.10 (a) and (b). We
can clearly see that training time is decreasing linearly with the number of cores for all
the data sets. These results demonstrate the utility of employing multiple cores for class-
imbalance learning task.
3. Gmean versus Number of Cores
We conduct the experiment to see how Gmean varies with different partitioning of the
data. The results are shown in Figure 5.11 (a) and (b). From these results, we can infer
98 CHAPTER 5. PROPOSED ALGORITHMS : DSCIL AND CILSD

0.09 0.2

0.08 0.18 Obj1 Obj2

0.16
0.07

Objective Function F(x)


0.14
Objective Function F(x)

0.06
0.12
0.05
0.1
0.04
0.08
0.03
Obj1 Obj2 0.06

0.02
0.04

0.01 0.02

0 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

Iterations Iteratons

(a) ijcnn1 (b) rcv1

0.09 0.05

0.04

0.08
0.04
Obj1 Obj2
Objective Function F(x)

Objective Function F(x)

0.03
0.08
0.03

0.02
0.07
Obj1 Obj2
0.02

0.01
0.07

0.01

0.06 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

Iterations Iterations

(c) pageblocks (d) w8a

Figure 5.9: Objective function vs iterations over various benchmark data sets. (a) ijcnn1 (b) rcv1
(c) pageblocks (d) w8a. Obj1 denotes objective function when best learning rate is searched
over {0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3} while Obj2 denotes objective function value ob-
tained with learning rate 1/L.
5.3. EXPERIMENTS 99

Table 5.3: Performance comparison of CSFSOL, CSSCD, and CILSD over various benchmark
data sets.

news
Algorithm
Accuracy Sensitivity Specificity Gmean Sum

CSFSOL 0.966356 0.982759 0.966137 0.974412 0.974448


CSSCD 0.451466 1 0.444137 0.666436 0.722069
CILSD 0.97795 0.913793 0.978807 0.945741 0.9463

rcv1
Algorithm
Accuracy Sensitivity Specificity Gmean Sum

CSFSOL 0.957 0.960148 0.953312 0.956724 0.95673


CSSCD 0.8989 0.871548 0.930945 0.900757 0.901246
CILSD 0.9414 0.9481 0.93355 0.940797 0.940825

url
Algorithm
Accuracy Sensitivity Specificity Gmean Sum

CSFSOL 0.9725 0.994505 0.970297 0.982327 0.982401


CSSCD 0.947 0.967033 0.944994 0.95595 0.956014
CILSD 0.965 0.978022 0.963696 0.970833 0.970859
webspam
Algorithm
Accuracy Sensitivity Specificity Gmean Sum

CSFSOL 0.9925 0.952 0.9952 0.97336 0.9736


CSSCD 0.988 0.808 1 0.898888 0.904
CILSD 0.943 0.928 0.944 0.935966 0.936

gisette
Algorithm
Accuracy Sensitivity Specificity Gmean Sum

CSFSOL 0.813 0.896 0.73 0.808752 0.8131


CSSCD 0.451466 1 0.444137 0.666436 0.722069
CILSD 0.799 0.6 0.998 0.773822 0.799

realsim
Algorithm
Accuracy Sensitivity Specificity Gmean Sum

CSFSOL 0.8825 0 1 0 0.5


CSSCD 0.9105 0.33617 0.986969 0.576012 0.66157
CILSD 0.789214 0.93617 0.769648 0.848835 0.852909

ijcnn1
Algorithm
Accuracy Sensitivity Specificity Gmean Sum

CSFSOL 0.831932 0.607668 0.855475 0.721002 0.731571


CSSCD 0.83456 0.827824 0.835267 0.831537 0.831546
CILSD 0.902084 0.507576 0.943499 0.692024 0.725537
100 CHAPTER 5. PROPOSED ALGORITHMS : DSCIL AND CILSD

Table 5.4: Performance comparison of CSFSOL, CSSCD, and CILSD over various benchmark
data sets.

w8a
Algorithm
Accuracy Sensitivity Specificity Gmean Sum

CSFSOL 0.970637 0.0330396 1 0.181768 0.51652


CSSCD 0.97231 0.563877 0.9851 0.745302 0.774489
CILSD 0.971908 0.585903 0.983997 0.759294 0.78495
covtype
Algorithm
Accuracy Sensitivity Specificity Gmean Sum

CSFSOL 0.908817 0.86642 0.910199 0.88804 0.888309


CSSCD 0.968433 0 1 0 0.5
CILSD 0.96855 0.0227033 0.99938 0.150629 0.511042

pageblocks
Algorithm
Accuracy Sensitivity Specificity Gmean Sum

CSFSOL 0.793056 0.662069 0.81306 0.73369 0.737564


CSSCD 0.887163 0.751724 0.907846 0.826105 0.829785
CILSD 0.910005 0.727586 0.937862 0.82606 0.832724

that Gmean does not remain constant with different partitioning of the data. For some
data sets such as realsim, Gmean is decreasing with increasing number of cores. On the
other hand, Gmean is first increasing and then decreasing for rcv1 and ijcnn1. Whereas,
it is continuously increasing for some data sets such as url. This chaotic behavior of
CILSD may be due to different proportion of positive and negative samples in the data
chunk alloted to different nodes. Above observation leads to the fact CILSD algorithm is
sensitive to the different partitioning of the data and the right choice of data partitioning
is important.
4. Gmean versus Regularization Parameter
In this Section, the effect of sparsity promoting parameter λ on Gmean is demonstrated.
The results are shown in Figure 5.12. From these results, we can clearly observe that
Gmean is quite sensitive to the setting of regularization parameter λ. For example, Gmean
on url data sets drops slowly with increasing lambda while it quickly drops to 0 for gisette.
5.3. EXPERIMENTS 101

1000000 100000

100000
10000
Training Time (in seconds)

Training Time (in seconds)


10000
1000

1000 pageblocks webspam


rcv1 url
100
news ijcnn1
100
realsim w8a

10
10

1 1
1 2 4 8 1 2 4 8

# Cores # Cores

(a) (b)

Figure 5.10: Training time versus number of cores to measure the speedup of CILSD algorithm.
Training time in both the figures is on the log scale.
1 1

0.9
0.95
0.8

0.7
0.9
0.6
Gmean
Gmean

0.85 pageblocks 0.5 webspam


rcv1 url
0.4
news ijcnn1
0.8 realsim w8a
0.3

0.2
0.75
0.1

0.7 0
1 2 4 8 1 2 4 8

# Cores # Cores

(a) (b)

Figure 5.11: Gmean achieved by CILSD algorithm versus number of cores on various bench-
mark data sets.
102 CHAPTER 5. PROPOSED ALGORITHMS : DSCIL AND CILSD

0.8 1

0.9
0.7
0.8
0.6
0.7
0.5
0.6
Gmean

Gmean
0.4 0.5

0.3 0.4

0.3
0.2
0.2
0.1
0.1
0
0
0.00E+000 5.00E-002 1.00E-001 1.50E-001 2.00E-001 2.50E-001 3.00E-001 3.50E-001
0.00E+000 1.00E-001 2.00E-001 3.00E-001 4.00E-001
Lambda Lambda

(a) ijcnn1 (b) rcv1


0.9 1

0.8 0.9

0.7 0.8

0.7
0.6
0.6
0.5
Gmean

Gmean

0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.00E+000 2.00E-004 4.00E-004 6.00E-004 8.00E-004 1.00E-003
0.00E+000 5.00E-002 1.00E-001 1.50E-001 2.00E-001 2.50E-001 3.00E-001 3.50E-001
Lambda Lambda

(c) gisette (d) news


1 1.2

0.9
1
0.8

0.7
0.8
0.6
Gmean
Gmean

0.5 0.6

0.4
0.4
0.3

0.2 0.2
0.1

0 0
0.00E+000 5.00E-004 1.00E-003 1.50E-003 2.00E-003 2.50E-003 3.00E-003 3.50E-003 0.00E+000 5.00E-004 1.00E-003 1.50E-003 2.00E-003 2.50E-003 3.00E-003 3.50E-003

Lambda Lambda

(e) webspam (f) url


0.8 0.9

0.7 0.8

0.7
0.6
0.6
0.5
0.5
Gmean

Gmean

0.4
0.4
0.3
0.3
0.2
0.2

0.1 0.1

0 0
0.00E+000 5.00E-002 1.00E-001 1.50E-001 2.00E-001 2.50E-001 3.00E-001 3.50E-001 0.00E+000 5.00E-002 1.00E-001 1.50E-001 2.00E-001 2.50E-001 3.00E-001 3.50E-001

Lambda Lambda

(g) url (h) realsim

Figure 5.12: Effect of regularization parameter λ on Gmean (i) ijcnn1 (ii) rcv1 (iii) gisette (iv)
news (v) webspam (vi) url (vii) w8a (viii) realsim. λ varies in { 3.00E-007, 0.000009, 0.00003,
0.0009 0.003 0.09 0.3 }
5.3. EXPERIMENTS 103

Table 5.5: Performance evaluation of R-DSCIL, CILSD, CSFSOL and CSOGD-I on KDDCUP
2008 data set.

Algorithm Sum
CSOGD-I 0.5741092
CSFSOL 0.71089
R-DSCIL 0.7336652
CILSD 0.714153

5.3.4 Comparative Study on Benchmark and Real Data sets

Breast cancer detection [174, 175] is a type of anomaly detection problem. One in every eight
women around the world is susceptible to breast cancer. In this section, we demonstrate the
applicability of our proposed algorithm for anomaly detection in X-ray images of KDDCup
2008 data set [176]. KDDCup 2008 data set contains information of 102294 suspicious regions,
each region described by 117 features. Each region is either “benign” or “malignment” and the
ratio of malignment to benign regions is 1:163.19. We split the training data set into 5 chunks.
The first four chunk have size 20000 candidates each and are used for training on 4 cores. The
last chunk has size 22294 and used for testing. We compare the performance of R-DCSIL
and CILSD with CSFSOL and CSOGD-I that was proposed in [29]. It is shown in [29] that
CSOGD-I outperformed many first-order algorithms such as ROMMA, PA-I, PA-II, PAUM,
CPA_PB etc (see Table 6 in [29]). Hence, we only compared with CSFSOL and CSOGD-I in
our experiments. We note that CSFSOL and CSOGD-I are online algorithms that do not require
separate training set and test set. We set the learning rate in CSOGD-I to 0.2 as discussed in
their paper and reproduced the results. What is important here is that the ratio of malignment
to benign regions in our test set is 1:123 which is not very less than the ratio of malignment
to benign tumors in the original data set. Comparative performance is reported in Table 5.5
with respect to the Sum metric as defined in the Experiment section. From the results reported
in Table 5.5, we can clearly see that the Sum value achieved by R-DSCIL is much larger than
the value obtained through CSOGD-I. The Sum performance of CILSD remains above the Sum
performance of CSOGD-I and CSFSOL. These observations indicate the possibility of using
R-DSCIL and CILSD in real-world anomaly detection tasks.
104 CHAPTER 5. PROPOSED ALGORITHMS : DSCIL AND CILSD

5.4 Discussion

In the present work, we propose two algorithms for handling class-imbalance in a distributed
environment on small and large-scale data sets. DSCIL algorithm is implemented in two flavors:
one uses second order method (L-BFGS) and the other uses first order method (RCD) to solve
the subproblem in DADMM framework. In our empirical comparison, we showed the con-
vergence results of L-DSCIL and R-DSCIL where L-DSCIL converges faster than R-DSCIL.
Secondly, Gmean achieved by L-DSCIL is close to the Gmean of a centralized solution for most
of the data sets. Whereas Gmean achieved by R-DSCIL varies due to its random updates of co-
ordinates. Thirdly, coming to the training time comparison, we found in our experiments that
R-DSCIL has cheaper per iteration cost but takes a longer time to achieve  accuracy compared
to L-DSCIL algorithm. Finally, the effect of varying cost, regularization parameter and the
number of cores is also demonstrated. The empirical comparison showed the potential appli-
cation of L-DSCIL and R-DSCIL for real-world class-imbalance and anomaly detection tasks
where our algorithms outperformed some recently proposed algorithms.

Our second algorithm (CILSD) is based on FISTA-like update rule. We show, through extensive
experiments on benchmark and real data sets, the convergence behavior of CILSD, speed up,
the effect of the number of cores and regularization parameter on Gmean. In particular, CILSD
algorithm does not require tuning of learning rate parameter and can be set to 1/L as in gradient
descent. Comparative evaluation with respect to a recently proposed class-imbalance learning
algorithm and a centralized algorithm shows that CILSD is able to either outperform or perform
equally well on many of the data sets tested. Speedup results demonstrate the advantage of
employing multiple cores. We also observed, in our experiments, chaotic behavior of Gmean
with respect to varying number of cores. Experiment on KDDCup data set indicates the possi-
bility of using CILSD algorithm for real-world distributed anomaly detection task. Comparison
of DCSIL and CILSD shows that CILSD convergences faster and achieves higher Gmean than
DSCIL.
Chapter 6

Unsupervised Anomaly Detection using


SVDD-A Case Study

Anomaly detection techniques discussed in the previous chapters are supervised , that is, they
require labels of normal as well as anomalous examples to build the model. However, real world
data rarely contain labels. This means the techniques discussed in the previous chapters can not
be applied. Therefore, we turn the crank to unsupervised anomaly detection.

In this chapter, we study a robust algorithm, based on support vector data description (SVDD)
due to [36], for anomaly detection in real data. The data set used in the experiment comes
from nuclear power plant and represents the count of neutrons in reactor channels. We apply
the SVDD algorithm on nuclear power plant data to detect the anomalous channels of neu-
tron emission. Experiments demonstrate the effectiveness of the algorithm as well as finding
anomalies in the data set. We also discuss extensions of the algorithm to find anomalies in high
dimension and non linearly separable data.

6.1 Introduction

Our case study is based on the support vector data description algorithm. Tax et al. [36] first
proposed the support vector data description in 2004. Their work is recently extended to un-
certain data by Liu et al. [88]. Original work of Tax was based on support vector classifier in

105
106CHAPTER 6. UNSUPERVISED ANOMALY DETECTION USING SVDD-A CASE STUDY

sv

Figure 6.1: Support vectors in two class classification problem

an unsupervised setting. Later, this was applied in semi-supervised and supervised setting with
little modification by Gornitz [177]. Next, we describe some key terms related to SVDD.

Definition 5. Support vectors are the set of points that lie on the boundary of the region sepa-
rating normal and anomalous points as shown in Fig. 6.1.

Definition 6. Support vector domain description concerns the characterization of a data set
through support vectors.

6.2 Support Vector Data Description Revisited

We describe support vector data description algorithm [36] for completeness. The problem is to
make a description of a training set of data instances and to detect which (new) data instances
resemble this training set. SVDD essentially is a one-class classifier [178]. The basic idea of
SVDD is that a good description of the data encompasses only normal instances. Outliers reside
either at the boundary or outside of the hypersphere (generalization of sphere in more than 3
dimension space) containing data set as shown in Fig .6.2. The method is made robust against
outliers in the training set and is capable of tightening the description by using anomalous
examples. The basic SVDD algorithm is given in Algorithm 1.

The idea of anomaly detection using minimal hypersphere (SVDD) is as follows. From the
preprocessed data in the kernel matrix, we calculate the center of mass of the data. The center
of mass of the data forms the center of the hypersphere. Subsequently, we compute the distances
of training points from the center of mass so as to obtain an empirical estimate of the center of
6.2. SUPPORT VECTOR DATA DESCRIPTION REVISITED 107

Algorithm 3 Anomaly detection using SVDD


Require: Training data X = (x1 , x2 , . . . , xn )T and Test data Y = (y1 , y2 , . . . , yn )T
Ensure: Novel point indices
1: alculate Kernel matrix K from training data using any kernel function e.g. RBF function
exp(−||X − Z||2 /2σ 2 ).
2: Set confidence parameter δ=0.01
3: Compute distances of data to centre of mass
4: Compute the estimation error of empirical centre of mass
5: Compute resulting threshold
6: Now compute distances of test data
7: Indices of novel test points are calculated as: novelindices = find (testdist2 >threshold)

mass. Empirical estimation error in center of mass is added to the max distance of training point
from the center of mass to give the threshold. Now, any point in test data lying beyond threshold
is classified as anomalous.

Below, we show how kernel matrix and center of mass (lines 1 and 3 of the algorithm) are
computed because they form the heart of the algorithm.

• Kernel Matrix Construction: Kernel matrix is a matrix whose (i, j)th entry encodes the
similarity between instances i and j. From implementation point of view, we have used
RBF kernel (aka Gaussian kernel) given as:

K(i, j) = exp(−||x − z||2 /2 ∗ σ 2 ) (6.1)

Where x and z are two samples. Sigma is kernel width very similar to standard deviation.
Again, we note that the final result depends upon the sigma value; careful choice of sigma
is required. In our implementation, we have used σ = 1.
• Computing center of mass(COM):
Intuitively, center of mass (COM) of a set of points is same as center of gravity. More
formally, it is defined as:-
m
1 X
φs = ( φ(Xi )) (6.2)
m i
where m is the size of the training set. φ is a map from input space to feature space. In
fact, we compute distances of training data from COM which is equivalent to centering
108CHAPTER 6. UNSUPERVISED ANOMALY DETECTION USING SVDD-A CASE STUDY

Figure 6.2: Illustrates the spheres for data generated according to a spherical two-dimensional
Gaussian distribution. The center part shows the center of mass of the training points. Anything
outside the boundary can be considered as outliers.

the kernel matrix. For detailed explanation, see [36].

Note that data can be (1) high dimension (2) Non-linearly separable. To handle case 1, we can
introduce multiple kernels corresponding to each feature and learn them from the training data.
To take the second case into account, we can use higher order polynomial or Gaussian kernel
(a kernel is a function that maps low dimension data to higher dimension). Since we have the
model at our disposal, we now embark on evaluating it on real dataset as described in the next
Section.

6.3 Experiments

In this Section, we present the results of our numerical simulation.

6.3.1 Experimental Testbed and Setup

Algorithm 3 is implemented in MATLAB. To perform the experiment, we constructed the hy-


persphere using training sample of size 600 with the assumption that the training set contains
only few anomalous points. We selected the starting 600 data points as the training sample.
Although, we can also select the training samples uniformly at random. The selection of initial
600 data points as the training set is based on the assumption that reactor channel are working
6.3. EXPERIMENTS 109

Figure 6.3: Nuclear power plant data with marked anomalous subsequence

in the normal condition. During testing, test sample of size 5714(=6314-600) is presented to
the model. In the course of our experiment, we make the following assumptions.

• Threshold value for the confidence parameter δ = 0.01. That is, the probability that test
error is less than or equal to training error is 99.99%.
• Algorithm 3 is run per feature-wise, that is, we ran the Algorithm 3 for each channel to
detect point anomalies individually.
• Kernel used is Gaussian.
• All the indices in the figures are with respect to transformed data (log transformation).

6.3.2 The Dataset


The data set used in the experiments come from the nuclear power plant at Bhabha Atomic
Research Center, Mumbai, India. It consists of neutron flow in nuclear reactor channel.
It contains some textual data and irreverent columns. For example, detector no., Batch
no. and column no. 15. Hence, we have removed them from the raw data resulting
in overall 14 relevant columns(feature). A plot of the preprocessed data from single
channel comprising 500 tuples is shown in Fig. 6.3. The given data is for a duration
110CHAPTER 6. UNSUPERVISED ANOMALY DETECTION USING SVDD-A CASE STUDY

of 6 years (from 2005 to 2010). After preprocessing, that is, removing noise from the
data, it constitutes 6314 tuples.

6.4 Results

Our results of applying Algorithm 3 on the data from different channels of the reactor are shown
in Fig. 6.4,6.5,6.6. Results obtained exactly match with anomalous points (In fact neutron count
was as low as 0 to 5 and as high as 170 + which is considered as abnormal flow of neutrons dur-
ing some particular time period) which we verified with the expert at Bhabha Atomic Research
Center). In Fig. 6.6, a large number of anomalous points gets accumulated. This is verified
later that this was due to technical fault in the detector for the specified duration.

6.5 Discussion

In the present work, we studied support vector data description algorithm for anomaly detection
in nuclear power plant data. We observe that SVDD efficiently and effectively finds point
anomalies in the nuclear reactor data set. In the present work, we applied SVDD algorithm for
each reactor channel individually. However, In future, we plan to use more state-of-the-art work
on unsupervised anomaly detection in the multi-variate setting.
6.5. DISCUSSION

Figure 6.4: Anomalies(marked in red) found in detector 1 of the power plant data
111
112CHAPTER 6. UNSUPERVISED ANOMALY DETECTION USING SVDD-A CASE STUDY

Figure 6.5: Anomalies(marked in red) found in detector 1 of the power plant data
6.5. DISCUSSION

Figure 6.6: Anomalies(marked in red) found in detector 1 of the power plant data
113
Chapter 7

Conclusions and Future Work

Anomaly detection is an important task in machine learning and data mining. Due to unprece-
dented growth in data size and complexity, data miners and practitioners are overwhelmed with
what is called big data. Big data incapacitates the traditional anomaly detection techniques. As
such, there is an urgent need to develop efficient and scalable techniques for anomaly detection
in big data.

7.1 Conclusions

In the present research work, we proposed four novel algorithms for handling anomaly detection
in big data. The PAGMEAN and ASPGD are based on online learning paradigm whereas
DSCIL and CILSD are based on distributed learning paradigm. In order to handle the anomaly
detection problem in big data, we took an approach different from many works in the literature.
Specifically, we employ the class-imbalance learning approach to tackling point anomalies.

PAGMEAN is an online algorithm for class-imbalance learning and anomaly detection in the
streaming setting. In chapter 3, we showed that how we can directly optimize a non-decomposable
performance metric Gmean in binary classification setting. Doing so gives rise to a non-convex
loss function. We employ surrogate loss function for handling non-convexity. Subsequently,
the surrogate loss is used within the PA framework to derive PAGMEAN algorithms. We show
through extensive experiments on benchmark and real data sets that PAGMEAN outperforms

115
116 CHAPTER 7. CONCLUSIONS AND FUTURE WORK

its parent algorithms PA and recently proposed algorithm CSOC in terms of Gmean. However,
at the same time, we observed that PAGMEAN algorithms suffer higher Mistake rate than other
algorithms we compared with.

In chapter 4, we proposed ASPGD algorithm for tackling anomaly detection in streaming, high
dimensional and sparse data. We utilize accelerated-stochastic-proximal learning framework
with a cost-sensitive smooth hinge loss. Cost-sensitive smooth hinge loss applies penalty based
on the number of positive and negative samples received so far. We also proposed a non-
accelerated variant of ASPGD, that is, without Nesterov’s acceleration called ASPGDNOACC.
An empirical study on real and benchmark data sets show that acceleration is not always helpful
neither in terms of Gmean nor Mistake rate. In addition, we also compare with a recently
proposed algorithm called CSFSOL. It is found that ASPGD outperforms CSFSOL in terms of
Gmean, F-measure, Mistake-rate on many of the data sets tested.

In order to handle anomaly detection in sparse, high dimensional and distributed data, we pro-
posed DSCIL and CILSD algorithms in chapter 5. In particular, DSCIL algorithm is based
on the distributed ADMM framework that utilizes a cost-sensitive loss function. Within the
DSCIL algorithm, we solve the L2 regularized loss minimization problem via (1) L-BFGS
method (called L-DSCIL) (2) Random Coordinate Descent method called (R-DSCIL). Firstly,
Empirical convergence analysis shows that L-DSCIL converges faster than R-DSCIL. Secondly,
Gmean achieved by L-DSCIL is close to the Gmean of a centralized solution on most of the
datasets. Whereas Gmean achieved by R-DSCIL varies due to its random updates of coor-
dinates. Thirdly, coming to the training time comparison, we found in our experiments that
R-DSCIL has cheaper per iteration cost but takes a longer time to achieve  accuracy compared
to L-DSCIL algorithm. Real world anomaly detection application on KDDCup 2008 data set
clearly shows the potential advantage of using the R-DSCIL algorithm.

CILSD, which is a cost-sensitive distributed FISTA-like algorithm, is a parameter-free algo-


rithm for anomaly detection. We show, through extensive experiments on benchmark and real
data sets, the convergence behavior of CILSD, speed up, the effect of the number of cores and
regularization parameter on Gmean. In particular, CILSD algorithm does not require tuning of
learning rate parameter and can be set to 1/L as in gradient descent algorithm. Comparative
evaluation with respect to a recently proposed class-imbalance learning algorithm and a cen-
7.2. FUTURE WORKS 117

tralized algorithm shows that CILSD is able to either outperform or perform equally well on
many of the data sets tested. Speedup results demonstrate the advantage of employing multiple
cores. We also observed, in our experiments, chaotic behavior of Gmean with respect to vary-
ing number of cores. Experiment on KDDCup data set indicates the possibility of using CILSD
algorithm for real-world distributed anomaly detection task. Comparison of DCSIL and CILSD
shows that CILSD convergences faster and achieves higher Gmean than DSCIL.

We present a case study of anomaly detection on real-world data in chapter 6. Our data came
from the nuclear power plant and is unlabeled. All of the algorithms discussed above can not
be applied since they are based on supervised learning, i.e., they require labels for normal and
anomalous instances. Therefore, we utilize SVDD , an unsupervised learning algorithm for
anomaly detection in real-world data. Empirical results show the effectiveness and efficiency of
the SVDD algorithm.

7.2 Future Works

In this section, we discuss some potential research directions for future. Our algorithms, though
are scalable to high dimensions, take care of sparse and distributed data, have certain limita-
tions. For example, we did not handle concept drift specifically in our setting. As a first future
work may look upon utilizing concept drift detection techniques within the framework we used.
Secondly, big data is often not only distributed but also streaming in real-world. Therefore, on-
line distributed algorithm may be developed to handle anomaly detection. As a third work, data
heterogeneity may be combined with streaming, sparse and high dimensional characteristics of
big data while detecting anomalies.

Besides the above task, one may consider extending the existing framework to handle subse-
quence and contextual anomaly in big data.
118 CHAPTER 7. CONCLUSIONS AND FUTURE WORK
References

[1] V. Mahadevan, Weixin Li, V. Bhalodia, and N. Vasconcelos. Anomaly detection in


crowded scenes. In IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 1975–1981, 2010.

[2] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey.
Journal of ACM Comput. Surv., 41(3):15:1–15:58, July 2009.

[3] Liang Xiong, Xi Chen, and J. Schneider. Direct robust matrix factorizatoin for anomaly
detection. In 11th International Conference on Data Mining (ICDM), IEEE, pages 844–
853, 2011.

[4] Mia Hubert, Peter J. Rousseeuw, and Karlien Vanden Branden. Robpca: a new approach
to robust principal component analysis. pages 64–79. Journal of Technometrics, 2005.

[5] João B. D. Cabrera, Carlos Gutiérrez, and Raman K. Mehra. Ensemble methods for
anomaly detection and distributed intrusion detection in mobile ad-hoc networks. Journal
of Information Fusion, Elsevier, 9(1):96–119, 2008.

[6] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining,
(First Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA,
2005.

[7] S.K. Gupta, V. Bhatnagar, and S.K. Wasan. Architecture for knowledge discovery and
knowledge management. Journal of Knowledge and Information Systems, 7(3):310–336,
2005.

[8] Vasudha Bhatnagar, S K Gupta, and S K Wasan. On mining of data. IETE Journal of
Research, 47(1-2):5–17, 2001.

119
120 REFERENCES

[9] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection for discrete sequences: A
survey. IEEE Transactions on Knowledge and Data Engineering, 24(5):823–839, May
2012.

[10] Varun Chandola, Varun Mithal, and Vipin Kumar. A reference based analysis framework
for understanding anomaly detection techniques for symbolic sequences. Data Mining
and Knowledge Discovery, 28(3):702–735, 2013.

[11] Deris Stiawan, Abdul Hanan Abdullah, and Mohd. Yazid Idris. Threat and vulnerability
penetration testing: Linux. Journal of Internet Technology, 15(3):333–342, 2014.

[12] Suvrojit Das, Debayan Chatterjee, Debidas Ghosh, and Narayan C. Debnath. Extracting
the system call identifier from within VFS: a kernel stack parsing-based approach. IJICS
Journal, 6(1):12–50, 2014.

[13] Hema A. Murthy Dinil Mon Divakaran and Timothy A. Gonsalves. Detection of syn
flooding attacks using linear prediction analysis. In 14th IEEE Int’l Conf on Networks,
ICON, Singapore, Sep 2006.

[14] Chundury Jagadish and Timothy A. Gonsalves. Distributed control of event floods in
a large telecom network. International Journal of Network Management, 20(2):57–70,
2010.

[15] Sanjay Mittal, Rahul Gupta, Mukesh Mohania, Shyam K. Gupta, Mizuho Iwaihara, and
Tharam Dillon. 7th International Conference, EC-Web 2006 E-Commerce and Web Tech-
nologies, Krakow, Poland, September 5-7, 2006. Proceedings, chapter Detecting Frauds
in Online Advertising Systems, pages 222–231. Springer Berlin Heidelberg, Berlin, Hei-
delberg, 2006.

[16] A. Ramasamy, Hema A. Murthy, and T.A. Gonsalves. Linear prediction for traffic man-
agement and fault detection. In ICIT, pages 141–144, Dec 2000.

[17] F.Y. Edgeworth. Xli. on discordant observations. Philosophical Magazine Series 5,


23(143):364–375, 1887.

[18] Frank E. Grubbs. Procedures for detecting outlying observations in samples. Technomet-
rics, 11(1):1–21, February 1969.
REFERENCES 121

[19] Santanu Das, Bryan L. Matthews, Ashok N. Srivastava, and Nikunj C. Oza. Multiple
kernel learning for heterogeneous anomaly detection: algorithm and aviation safety case
study. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge
discovery and data mining, KDD ’10, pages 47–56, New York, NY, USA, 2010. ACM.

[20] Shreya Banerjee, Renuka Shaw, Anirban Sarkar, and Narayan C. Debnath. Towards
logical level design of big data. In 13th IEEE International Conference on Industrial
Informatics, INDIN 2015, Cambridge, United Kingdom, July 22-24, 2015, pages 1665–
1671, 2015.

[21] Neha Bharill and Aruna Tiwari. Handling big data with fuzzy based classification ap-
proach. In Advance Trends in Soft Computing - Proceedings of WCSC, December 16-18,
San Antonio, Texas, USA, pages 219–227, 2013.

[22] Eamonn J. Keogh, Jessica Lin, and Ada Wai-Chee Fu. Hot sax: Efficiently finding
the most unusual time series subsequence. In International Conference on Data Min-
ing(ICDM), pages 226–233, 2005.

[23] Charu C. Aggarwal and Philip S. Yu. Outlier detection for high dimensional data. In
Proceedings of the ACM SIGMOD International Conference on Management of Data,
SIGMOD ’01, pages 37–46, New York, NY, USA, 2001. ACM.

[24] Timothy de Vries, Sanjay Chawla, and Michael E. Houle. Density-preserving projections
for large-scale local anomaly detection. Journal of Knowledge and Information Systems,
Springer, 32(1):25–52, July 2012.

[25] Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. Database Theory
— ICDT’99: 7th International Conference Jerusalem, Israel, January 10–12, Proceed-
ings, chapter When Is “Nearest Neighbor” Meaningful? Springer Berlin Heidelberg,
Berlin, Heidelberg, 1999.

[26] Bilal Mirza, Zhiping Lin, and Nan Liu. Ensemble of subset online sequential extreme
learning machine for class imbalance and concept drift. Neurocomputing, 149, Part
A:316 – 329, 2015.
122 REFERENCES

[27] Sharanjit Kaur, Vasudha Bhatnagar, Sameep Mehta, and Sudhir Kapoor. Categorizing
concepts for detecting drifts in stream. In Proceedings of the 15th International Confer-
ence on Management of Data, December 9-12, Mysore, India, 2009.

[28] Dayong Wang, Pengcheng Wu, Peilin Zhao, and Steven C. H. Hoi. A framework of
sparse online learning and its applications. CoRR, abs/1507.07146, 2015.

[29] Jialei Wang, Peilin Zhao, and S.C.H. Hoi. Cost-sensitive online classification. IEEE
Transactions on Knowledge and Data Engineering, 26(10):2425–2438, Oct 2014.

[30] Shou Wang, Leandro L. Minku, and Xin Yao. Online class imbalance learning and its
applications in fault detection. International Journal of Computational Intelligence and
Applications, 12(04):1340001, 2013.

[31] Shuo Wang, Leandro L. Minku, and Xin Yao. Resampling-based ensemble methods for
online class imbalance learning. IEEE Trans. Knowl. Data Eng., 27(5):1356–1368, 2015.

[32] Amira Kamil Ibrahim Hassan and Ajith Abraham. Advances in Nature and Biologically
Inspired Computing: Proceedings of the 7th World Congress on Nature and Biologi-
cally Inspired Computing (NaBIC2015) in Pietermaritzburg, South Africa, held Decem-
ber 01-03, 2015, chapter Modeling Insurance Fraud Detection Using Imbalanced Data
Classification, pages 117–127. Springer International Publishing, Cham, 2016.

[33] Charu C. Aggarwal. Outlier Analysis. Springer, 2013.

[34] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer.
Online passive-aggressive algorithms. J. Mach. Learn. Res., 7:551–585, December 2006.

[35] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for
linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.

[36] David M. J. Tax and Robert P. W. Duin. Support vector data description. Machine
Learning Journal, 54(1):45–66, 2004.

[37] C.K. Maurya and D. Toshniwal. Anomaly detection in nuclear power plant data using
support vector data description. In IEEE Students’ Technology Symposium (TechSym), at
IIT Kharragpur, pages 82–86, Feb 2014.
REFERENCES 123

[38] Arthur Zimek, Erich Schubert, and Hans-Peter Kriegel. A survey on unsupervised out-
lier detection in high-dimensional numerical data. Stat. Anal. Data Min., 5(5):363–387,
October 2012.

[39] Gilles Blanchard, Gyemin Lee, and Clayton Scott. Semi-supervised novelty detection.
The Journal of Machine Learning Research, 11:2973–3009, 2010.

[40] Dipankar Dasgupta and Fernando Nino. A comparison of negative and positive selection
algorithms in novel pattern detection. In IEEE international conference on Systems, man,
and cybernetics,, volume 1, pages 125–130. IEEE, 2000.

[41] D. Dasgupta and N. S. Majumdar. Anomaly detection in multidimensional data using


negative selection algorithm. In Proceedings of the Evolutionary Computation on 2002.
CEC ’02. Proceedings of the 2002 Congress - Volume 02, CEC ’02, pages 1039–1044,
Washington, DC, USA, 2002. IEEE Computer Society.

[42] C. De Stefano, C. Sansone, and M. Vento. To reject or not to reject: that is the question-an
answer in case of neural classifiers. IEEE Transactions on Systems, Man, and Cybernet-
ics, Part C: Applications and Reviews, 30(1):84–94, 2000.

[43] Christos Siaterlis and Basil Maglaris. Towards multisensor data fusion for dos detection.
In Proceedings of the ACM symposium on Applied computing, SAC ’04, pages 439–446,
New York, NY, USA, 2004. ACM.

[44] Kaustav Das and Jeff Schneider. Detecting anomalous records in categorical datasets. In
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery
and data mining, KDD ’07, pages 220–229, New York, NY, USA, 2007. ACM.

[45] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning Jour-
nal, 20(3):273–297, 1995.

[46] Manuel Davy and S. Godsill. Detection of abrupt spectral changes using support vector
machines an application to audio signal segmentation. In IEEE International Conference
on Acoustics, Speech, and Signal Processing (ICASSP), volume 2, pages 1313–1316,
2002.
124 REFERENCES

[47] Rakesh Agrawal and Ramakrishnan Srikant. Mining sequential patterns. In Proceedings
of the Eleventh International Conference on Data Engineering, ICDE ’95, pages 3–14,
Washington, DC, USA, 1995. IEEE Computer Society.

[48] Wei Fan, M. Miller, S.J. Stolfo, Wenke Lee, and P.K. Chan. Using artificial anoma-
lies to detect unknown and known network intrusions. In Proceedings of International
Conference on Data Mining( ICDM), IEEE, pages 123–130, 2001.

[49] G. Ratsch, S. Mika, B. Scholkopf, and K. Muller. Constructing boosting algorithms from
svms: an application to one-class classification. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 24(9):1184–1199, 2002.

[50] Mennatallah Amer, Markus Goldstein, and Slim Abdennadher. Enhancing one-class
support vector machines for unsupervised anomaly detection. In Proceedings of the
ACM SIGKDD Workshop on Outlier Detection and Description, ODD ’13, pages 8–15,
New York, NY, USA, 2013. ACM.

[51] Volker Roth. Outlier detection with one-class kernel fisher discriminants. In Advances
in Neural Information Processing Systems 17, pages 1169–1176. MIT Press, 2005.

[52] Jorma Laurikkala, Martti Juhola, Erna Kentala, N Lavrac, S Miksch, and B Kavsek.
Informal identification of outliers in medical data. In Fifth International Workshop on
Intelligent Data Analysis in Medicine and Pharmacology, pages 20–24, 2000.

[53] Helge Erik Solberg and Ari Lahti. Detection of outliers in reference distributions: per-
formance of horn’s algorithm. Clinical chemistry, 51(12):2326–2332, 2005.

[54] Paul S Horn, Lan Feng, Yanmei Li, and Amadeo J Pesce. Effect of outliers and non-
healthy individuals on reference interval estimation. Clinical Chemistry, 47(12):2137–
2145, 2001.

[55] Harvey Motulsky. Intuitive biostatistics: Choosing a statistical test, chapter-17. Technical
report, ISBN 0-19-508607-4), Oxford University Press, 1995.

[56] Martin Ester, Hans peter Kriegel, Jörg S, and Xiaowei Xu. A density-based algorithm for
discovering clusters in large spatial databases with noise. pages 226–231. AAAI Press,
1996.
REFERENCES 125

[57] S. Guha, R. Rastogi, and Kyuseok Shim. Rock: a robust clustering algorithm for categor-
ical attributes. In Proceedings of 15th International Conference on Data Engineering,
pages 512–521, 1999.

[58] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. Rock: A robust clustering algorithm
for categorical attributes. In Proceedings of 15th International Conference on Data En-
gineering, pages 512–521. IEEE, 1999.

[59] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. A density-based algo-
rithm for discovering clusters in large spatial databases with noise. In Kdd, volume 96,
pages 226–231, 1996.

[60] Levent Ertöz, Michael Steinbach, and Vipin Kumar. Finding topics in collections of doc-
uments: A shared nearest neighbor approach. In Clustering and Information Retrieval,
pages 83–103. Springer, 2004.

[61] Witcha Chimphlee, Abdul Hanan Abdullah, Mohd Noor Md Sap, Siriporn Chim-
phlee, and Surat Srinoy. Unsupervised-clustering-methods-for-identifying-rare-events-
in-anomaly-detection. In Sixth International Enformatika Conference, pages 26–28, Bu-
dapest, Hungry, 2005.

[62] Z. He, X. Xu, and S. Deng. Discovering cluster-based local outliers. Pattern recognition
letters, 24(9-10):1641–1650, 2003.

[63] Eleazar Eskin, Andrew Arnold, Michael Prerau, Leonid Portnoy, and Sal Stolfo. A ge-
ometric framework for unsupervised anomaly detection. In Applications of data mining
in computer security, pages 77–101. Springer, 2002.

[64] Fabrizio Angiulli and Clara Pizzuti. Fast outlier detection in high dimensional spaces.
In European Conference on Principles of Data Mining and Knowledge Discovery, pages
15–27. Springer, 2002.

[65] Ji Zhang and Hai Wang. Detecting outlying subspaces for high-dimensional data: the new
task, algorithms, and performance. Knowledge and information systems, 10(3):333–355,
2006.

[66] Richard J Bolton, David J Hand, et al. Unsupervised profiling methods for fraud detec-
tion. Credit Scoring and Credit Control VII, pages 235–255, 2001.
126 REFERENCES

[67] Markus Breunig, Hans Peter Kriegel, Raymond T. Ng, and Jörg Sander. Lof: Identi-
fying density-based local outliers. In Proceedings of the ACM SIGMOD international
conference on management of data, pages 93–104. ACM, 2000.

[68] KaiMing Ting, Takashi Washio, JonathanR. Wells, FeiTony Liu, and Sunil Aryal. De-
mass: a new density estimator for big data. Journal of Knowledge and Information
Systems, 35(3):493–524, 2013.

[69] Jon Louis Bentley. Multidimensional binary search trees used for associative searching.
Commun. ACM, 18(9):509–517, September 1975.

[70] Eamonn Keogh, Stefano Lonardi, and Chotirat Ann Ratanamahatana. Towards
parameter-free data mining. In Proc. 10th ACM SIGKDD International Conf. Knowl-
edge Discovery and Data Mining, pages 206–215. ACM Press, 2004.

[71] Andreas Arning, Rakesh Agrawal, and Prabhakar Raghavan. A linear method for devia-
tion detection in large databases. In KDD, pages 164–169, 1996.

[72] Shin Ando. Clustering needles in a haystack: An information theoretic analysis of mi-
nority and outlier detection. In Seventh IEEE International Conference on Data Mining
(ICDM 2007), pages 13–22. IEEE, 2007.

[73] Zengyou He, Shengchun Deng, and Xiaofei Xu. An optimization model for outlier de-
tection in categorical data. In International Conference on Intelligent Computing, pages
400–409. Springer, 2005.

[74] Wenke Lee and Dong Xiang. Information-theoretic measures for anomaly detection. In
Proceedings of IEEE Symposium on Security and Privacy, S&P 2001, pages 130–143.
IEEE, 2001.

[75] William Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a
Hilbert space. In Conference in modern analysis and probability (New Haven, Conn.,
1982), volume 26 of Contemporary Mathematics, pages 189–206. American Mathemat-
ical Society, 1984.

[76] M.-L Shyu, S.-C Chen, K. Sarinnapakorn, , and L. Cheng. A novel anomaly detection
scheme based on principal component classifier. In Proceedings of 3rd IEEE Interna-
tional Conference on Data Mining ., ICDM ’03, pages 353–365. IEEE, 2003.
REFERENCES 127

[77] J. Sun, Y. Xie, H. Zhang, , and Falaoutsos. Less is more: Compact matrix representation
of large sparse graphs. In Proceedings of 7th SIAM International Conference on Data
Mining. SIAM, 2007.

[78] Simon Gunter, Nicol N. Schraudolph, and S. V. N. Vishwanathan. Fast iterative kernel
principal component analysis. Journal of Machine Learning Research, 8:1893–1918,
2007.

[79] Rose Yu, Xinran He, and Yan Liu. Glad: Group anomaly detection in social media anal-
ysis. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, KDD ’14, pages 372–381, New York, NY, USA, 2014.
ACM.

[80] Erik Rodner, Esther-Sabrina Wacker, Michael Kemmler, and Joachim Denzler. One-class
classification for anomaly detection in wire ropes with gaussian processes in a few lines
of code. In Proceedings of the IAPR Conference on Machine Vision Applications (IAPR
MVA 2011), Nara Centennial Hall, Nara, Japan, June 13-15, 2011, pages 219–222,
2011.

[81] Michael Kemmler, Erik Rodner, Esther-Sabrina Wacker, and Joachim Denzler. One-class
classification with gaussian processes. Pattern Recognition, 46(12):3507–3518, 2013.

[82] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cam-
bridge University Press, New York, NY, USA, 2004.

[83] Francis R. Bach, Gert R. G. Lanckriet, and Michael I. Jordan. Multiple kernel learning,
conic duality, and the smo algorithm. In Proceedings of the twenty-first international
conference on Machine learning, ICML ’04, pages 6–, New York, NY, USA, 2004. ACM.

[84] Bernhard Scholkopf and Alexander J. Smola. Learning with Kernels: Support Vector
Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA,
2001.

[85] Manik Varma and Bodla Rakesh Babu. More generality in efficient multiple kernel learn-
ing. In Proceedings of the 26th Annual International Conference on Machine Learning,
ICML ’09, pages 1065–1072, New York, NY, USA, 2009. ACM.
128 REFERENCES

[86] G. Song, X. Jin, Chen G., and Y. Nie. Multiple kernel learning method for network
anomaly detection. In IEEE Conference on Intelligent Systems and Knowledge Engi-
neering (ISKE), pages 296–299, 2010.

[87] Marius Kloft, Ulf Brefeld, Sören Sonnenburg, and Alexander Zien. Lp-norm multiple
kernel learning. Journal of Machine Learning Research (JMLR), 12:953–997, July 2011.

[88] Bo Liu, Yanshan Xiao, Longbing Cao, Zhifeng Hao, and Feiqi Deng. Svdd-based outlier
detection on uncertain data. Knowledge and Information Systems, Springer, 34(3):597–
618, 2013.

[89] N. Goernitz, Marius Kloft, Konrad Rieck, and Ulf Brefeld. Toward supervised anomaly
detection. J. Artif. Intell. Res. (JAIR), 46:235–262, 2013.

[90] Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization.
In NIPS, pages 556–562. MIT Press, 2000.

[91] Chih Jen Lin. On the convergence of multiplicative update algorithms for nonnegative
matrix factorization. Trans. Neur. Netw., 18(6):1589–1596, November 2007.

[92] Chih jen Lin. Projected gradient methods for non-negative matrix factorization. Techni-
cal report, Neural Computation, 2007.

[93] Michael W. Berry, Murray Browne, Amy N. Langville, V. Paul Pauca, and Robert J.
Plemmons. Algorithms and applications for approximate nonnegative matrix factoriza-
tion. In Computational Statistics and Data Analysis, pages 155–173, 2006.

[94] Dongmin Kim, Suvrit Sra, and Inderjit S. Dhillon. Fast newton-type methods for the least
squares nonnegative matrix approximation problem. In Proceedings of SIAM Conference
on Data Mining, pages 343–354, 2007.

[95] Peter Richtarik and Martin Takáč. Iteration complexity of randomized block-coordinate
descent methods for minimizing a composite function. Mathematical Programming,
144(1):1–38, 2012.

[96] EdwardG. Allan, MichaelR. Horvath, ChristopherV. Kopek, BrianT. Lamb, ThomasS.
Whaples, and MichaelW. Berry. Anomaly detection using nonnegative matrix factoriza-
tion. In MichaelW. Berry and Malu Castellanos, editors, Survey of Text Mining II, pages
203–217. Springer London, 2008.
REFERENCES 129

[97] Fei Wang, Sanjay Chawla, and Didi Surian. Latent outlier detection and the low preci-
sion problem. In Proceedings of the ACM SIGKDD Workshop on Outlier Detection and
Description, ODD ’13, pages 46–52, New York, NY, USA, 2013. ACM.

[98] Robert J. Durrant and Ata Kaban. Random projections as regularizers: Learning a linear
discriminant ensemble from fewer observations than dimensions. In Asian Conference
on Machine Learning, ACML 2013, Canberra, ACT, Australia, November 13-15, 2013,
pages 17–32, 2013.

[99] Robert J. Durrant and Ata Kaban. Sharp generalization error bounds for randomly-
projected classifiers. In Proceedings of the 30th International Conference on Machine
Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, pages 693–701, 2013.

[100] Emmanuel Müller, Ira Assent, Patricia Iglesias, Yvonne Mülle, and Klemens Böhm.
Outlier ranking via subspace analysis in multiple views of the data. In ICDM, pages
529–538. IEEE Computer Society, 2012.

[101] Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust pca via outlier pursuit.
In NIPS, pages 2496–2504. Curran Associates, Inc., 2010.

[102] Mazin Aouf and Laurence A. F. Park. Approximate document outlier detection using
random spectral projection. In Proceedings of the 25th Australasian joint conference
on Advances in Artificial Intelligence, AI’12, pages 579–590, Berlin, Heidelberg, 2012.
Springer-Verlag.

[103] Vipin Kumar and Sonajharia Minz. Multi-view ensemble learning: an optimal feature
set partitioning for high-dimensional data classification. Journal of Knowledge and In-
formation Systems, pages 1–59, 2015.

[104] Aleksandar Lazarevic and Vipin Kumar. Feature bagging for outlier detection. In Pro-
ceedings of the eleventh ACM SIGKDD international conference on Knowledge discov-
ery in data mining, KDD ’05, pages 157–166, New York, NY, USA, 2005. ACM.

[105] Keith Noto, Carly Brodley, and Donna Slonim. Anomaly detection using an ensemble
of feature models. In Proceedings of the 2010 IEEE International Conference on Data
Mining, ICDM ’10, pages 953–958, Washington, DC, USA, 2010. IEEE Computer So-
ciety.
130 REFERENCES

[106] S. D. Roy, S. A. Singh, Subhrabrata Choudhury, and Narayan C. Debnath. Countering


sinkhole and black hole attacks on sensor networks using dynamic trust management.
In Proceedings of the 13th IEEE Symposium on Computers and Communications (ISCC
2008), July 6-9, Marrakech, Morocco, pages 537–542, 2008.

[107] Shaza M. Abd Elrahman and Ajith Abraham. Class imbalance problem using a hybrid
ensemble approach. Int. J. Hybrid Intell. Syst., 12(4):219–227, 2016.

[108] Vasudha Bhatnagar, Manju Bhardwaj, and Ashish Mahabal. Comparing SVM ensembles
for imbalanced datasets. In 10th International Conference on Intelligent Systems Design
and Applications, ISDA , November 29 - December 1, Cairo, Egypt, pages 651–657,
2010.

[109] Manju Bhardwaj, Debasis Dash, and Vasudha Bhatnagar. Accurate classification of bio-
logical data using ensembles. In IEEE International Conference on Data Mining Work-
shop, ICDMW , Atlantic City, NJ, USA, November 14-17, pages 1486–1493, 2015.

[110] F. Rosenblatt. Neurocomputing: Foundations of research. volume 65(6), chapter The


Perception: A Probabilistic Model for Information Storage and Organization in the Brain,
pages 386–408. Psychological Review, Nov 1958.

[111] J. Kivinen, A.J. Smola, and R.C. Williamson. Online learning with kernels. IEEE Trans-
actions on Signal Processing, 100(10), Oct 2010.

[112] Koby Crammer and Yoram Singer. Ultraconservative online algorithms for multiclass
problems. J. Mach. Learn. Res., 3:951–991, March 2003.

[113] Claudio Gentile. A new approximate maximal margin classification algorithm. J. Mach.
Learn. Res., 2:213–242, March 2002.

[114] Nicolò Cesa-Bianchi, Alex Conconi, and Claudio Gentile. A second-order perceptron
algorithm. SIAM J. Comput., 34(3):640–668, March 2005.

[115] Koby Crammer, Alex Kulesza, and Mark Dredze. Adaptive regularization of weight
vectors. Machine Learning, 91(2):155–187, 2013.

[116] Francesco Orabona and Koby Crammer. New adaptive algorithms for online classifica-
tion. Journal of Machine Learning Research, pages 1840–1848, 2010.
REFERENCES 131

[117] Koby Crammer, Mark Dredze, and Fernando Pereira. Exact convex confidence-weighted
learning. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances
in Neural Information Processing Systems 21, pages 345–352. Curran Associates, Inc.,
2009.

[118] Jialei Wang, Peilin Zhao, and Steven C. H. Hoi. Exact soft confidence-weighted learning.
CoRR, abs/1206.4612, 2012.

[119] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge
University Press, New York, NY, USA, 2006.

[120] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer.
Smote: Synthetic minority over-sampling technique. J. Artif. Int. Res., 16(1):321–357,
June 2002.

[121] Nitesh V. Chawla, Aleksandar Lazarevic, Lawrence O. Hall, and Kevin W. Bowyer.
Smoteboost: Improving prediction of the minority class in boosting. In PKDD, volume
2838 of Lecture Notes in Computer Science, pages 107–119. Springer, 2003.

[122] Shuo Wang, Huanhuan Chen, and Xin Yao. Negative correlation learning for classi-
fication ensembles. In The 2010 International Joint Conference on Neural Networks
(IJCNN), pages 1–8, July 2010.

[123] Charles Elkan. The foundations of cost-sensitive learning. In Proceedings of the 17th
International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’01, pages
973–978, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.

[124] Wei Fan, Salvatore J. Stolfo, Junxin Zhang, and Philip K. Chan. Adacost: Misclassifica-
tion cost-sensitive boosting. In Proceedings of the Sixteenth International Conference on
Machine Learning, ICML ’99, pages 97–105, San Francisco, CA, USA, 1999. Morgan
Kaufmann Publishers Inc.

[125] C.X. Ling, V.S. Sheng, and Qiang Yang. Test strategies for cost-sensitive decision trees.
IEEE Transactions on Knowledge and Data Engineering, 18(8):1055–1067, Aug 2006.

[126] Xu-Ying Liu and Zhi-Hua Zhou. The influence of class imbalance on cost-sensitive
learning: An empirical study. In ICDM ’06. Sixth International Conference on Data
Mining, pages 970–974, Dec 2006.
132 REFERENCES

[127] Hung-Yi Lo, Ju-Chiang Wang, Hsin-Min Wang, and Shou-De Lin. Cost-sensitive multi-
label learning for audio tag annotation and retrieval. IEEE Transactions on Multimedia,
13(3):518–529, June 2011.

[128] Xianghan Zheng, Zhipeng Zeng, Zheyi Chen, Yuanlong Yu, and Chunming Rong. De-
tecting spammers on social networks. Neurocomputing, 159:27 – 34, 2015.

[129] Xingyu Gao, Zhenyu Chen, Sheng Tang, Yongdong Zhang, and Jintao Li. Adaptive
weighted imbalance learning with application to abnormal activity recognition. Neuro-
computing, 173, Part 3:1927 – 1935, 2016.

[130] Chris Drummond and Robert C. Holte. Machine Learning: ECML 2005: 16th Euro-
pean Conference on Machine Learning, Porto, Portugal, October 3-7, 2005. Proceed-
ings, chapter Severe Class Imbalance: Why Better Algorithms Aren’t the Answer, pages
539–546. Springer Berlin Heidelberg, Berlin, Heidelberg, 2005.

[131] Charles X. Ling, Qiang Yang, Jianning Wang, and Shichao Zhang. Decision trees with
minimal costs. In Proceedings of the Twenty-first International Conference on Machine
Learning, ICML ’04, pages 69–, New York, NY, USA, 2004. ACM.

[132] Xiaoyong Chai, Lin Deng, Qiang Yang, and C.X. Ling. Test-cost sensitive naive bayes
classification. In Fourth IEEE International Conference on Data Mining. ICDM ’04,
pages 51–58, Nov 2004.

[133] Gary Weiss, Kate McCarthy, and Bibi Zabar. Cost-sensitive learning vs. sampling:
Which is best for handling unbalanced classes with unequal error costs? In DMIN,
pages 35–41. CSREA Press, 2007.

[134] S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogeraki, and D. Gunopulos. On-


line outlier detection in sensor data using non-parametric models. In Proceedings of the
32Nd International Conference on Very Large Data Bases, VLDB ’06, pages 187–198.
VLDB Endowment, 2006.

[135] Manuel Davy, Frédéric Desobry, Arthur Gretton, and Christian Doncarli. An online
support vector machine for abnormal events detection. Signal Process., 86(8):2009–
2025, August 2006.
REFERENCES 133

[136] Matthew Eric Otey, Amol Ghoting, and Srinivasan Parthasarathy. Fast distributed outlier
detection in mixed-attribute data sets. Data Min. Knowl. Discov., 12(2-3):203–228, May
2006.

[137] Swee Chuan Tan, Kai Ming Ting, and Tony Fei Liu. Fast anomaly detection for streaming
data. In Proceedings of the Twenty-Second International Joint Conference on Artificial
Intelligence - Volume Volume Two, IJCAI’11, pages 1511–1516. AAAI Press, 2011.

[138] Kai Ming Ting, Guang-Tong Zhou, Fei Tony Liu, and James Swee Chuan Tan. Mass
estimation and its applications. In Proceedings of the 16th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, KDD ’10, pages 989–998, New
York, NY, USA, 2010. ACM.

[139] Alexander Lavin and Subutai Ahmad. Evaluating real-time anomaly detection algorithms
- the numenta anomaly benchmark. CoRR, abs/1510.03336, 2015.

[140] Xin Jin, Yin Guo, Soumik Sarkar, Asok Ray, and Robert M. Edwards. Anomaly detection
in nuclear power plants via symbolic dynamic filtering. IEEE Transactions on Nuclear
Science, 58(1):277– 288, Feb 2011.

[141] Anders Riber Marklund and Jan Dufek. Development and comparison of spectral meth-
ods for passive acoustic anomaly detection in nuclear power plants. Applied Acoustics,
83:100 – 107, 2014.

[142] Seong Soo Choi, Ki Sig Kang, Han Gon Kim, and Soon Heung Chang. Development of
an on-line fuzzy expert system for integrated alarm processing in nuclear power plants.
IEEE Transactions on Nuclear Science, 42(4):1406–1418, Aug 1995.

[143] K Nabeshima, T Suzudo, T Ohno, and K Kudo. Nuclear reactor monitoring with the
combination of neural network and expert system. Mathematics and Computers in Sim-
ulation, 60(3–5):233 – 244, 2002. Intelligent Forecasting, Fault Diagnosis, Scheduling,
and Control.

[144] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines.
ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software
available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
134 REFERENCES

[145] J. Alcal&#x00e1;-Fdez, L. S&#x00e1;nchez, S. Garc&#x00ed;a, M. J. del Jesus,


S. Ventura, J. M. Garrell, J. Otero, C. Romero, J. Bacardit, V. M. Rivas, J. C.
Fern&#x00e1;ndez, and F. Herrera. Keel: A software tool to assess evolutionary al-
gorithms for data mining problems. Soft Comput., 13(3):307–318, October 2008.

[146] M. Lichman. UCI machine learning repository, 2013.

[147] John C. Platt. Advances in kernel methods. chapter Fast Training of Support Vector Ma-
chines Using Sequential Minimal Optimization, pages 185–208. MIT Press, Cambridge,
MA, USA, 1999.

[148] Ronan Collobert, Samy Bengio, and Yoshua Bengio. A parallel mixture of svms for very
large scale problems. Neural Comput., 14(5):1105–1114, May 2002.

[149] Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker. Identifying
suspicious urls: An application of large-scale online learning. In Proceedings of the 26th
Annual International Conference on Machine Learning, ICML ’09, pages 681–688, New
York, NY, USA, 2009. ACM.

[150] Andrew McCallum. Data, 1995.

[151] Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. Nips 2003 feature selection
challenge, 2003.

[152] Isabelle Guyon. The mnist database, 2013.

[153] Soeren Sonnenburg. Pascal large scale learning challenge, 2008.

[154] Thorsten Joachims. A support vector method for multivariate performance measures.
In Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05,
pages 377–384, New York, NY, USA, 2005. ACM.

[155] Steven C.H. Hoi, Jialei Wang, and Peilin Zhao. Libol: A library for online learning
algorithms. The Journal of Machine Learning Research, 15:495–499, 2014.

[156] Francesco Orabona. DOGMA: a MATLAB toolbox for Online Learning, 2009. Software
available at http://dogma.sourceforge.net.
REFERENCES 135

[157] A. Buschermoehle, J. Huelsmann, and W. Brockmann. Uoslib – a library for analysis


of online-learning algorithms. In Proc. 23. Workshop Computational Intelligence, pages
355–369. KIT Scientific Publishing, 2013.

[158] Shuo Wang, L.L. Minku, and Xin Yao. Resampling-based ensemble methods for on-
line class imbalance learning. IEEE Transactions on Knowledge and Data Engineering,
27(5):1356–1368, May 2015.

[159] Miroslav Kubat and Stan Matwin. Addressing the curse of imbalanced training sets:
One-sided selection. In In Proceedings of the Fourteenth International Conference on
Machine Learning, pages 179–186. Morgan Kaufmann, 1997.

[160] Yurii Nesterov. A method of solving a convex programming problem with convergence
rate o (1/k2). Soviet Mathematics Doklady, 27(2):372–376, 1983.

[161] Yurii Nesterov. Introductory lectures on convex optimization : a basic course. Applied
optimization. Kluwer Academic Publ., Boston, Dordrecht, London, 2004.

[162] Neal Parikh and Stephen Boyd. Proximal algorithms. Found. Trends Optim., 1(3):127–
239, January 2014.

[163] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for
linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.

[164] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society (Series B), 58:267–288, 1996.

[165] Yu. Nesterov. Gradient methods for minimizing composite functions. Mathematical
Programming, 140(1):125–161, 2013.

[166] Yu. Nesterov. Gradient methods for minimizing composite objective function. CORE
Discussion Papers 2007076, Université catholique de Louvain, Center for Operations
Research and Econometrics (CORE), 2007.

[167] Atsushi Nitanda. Stochastic proximal gradient descent with acceleration techniques. In
Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger, editors,
Advances in Neural Information Processing Systems 27, pages 1574–1582. Curran As-
sociates, Inc., 2014.
136 REFERENCES

[168] Chonghai Hu, Weike Pan, and James T. Kwok. Accelerated gradient methods for stochas-
tic optimization and online learning. In Y. Bengio, D. Schuurmans, J.D. Lafferty, C.K.I.
Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems
22, pages 781–789. Curran Associates, Inc., 2009.

[169] Dayong Wang, Pengcheng Wu, Peilin Zhao, Yue Wu, Chunyan Miao, and S.C.H. Hoi.
High-dimensional data stream classification via sparse online learning. In Data Mining
(ICDM), 2014 IEEE International Conference on, pages 1007–1012, Dec 2014.

[170] Simon Lucey. Soft- thresholding, 2012.

[171] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed
optimization and statistical learning via the alternating direction method of multipliers.
Foundations and Trends® in Machine Learning, 3(1):1–122, 2010.

[172] Kwangmoo Koh, Seung-Jean Kim, and Stephen Boyd. An interior-point method for
large-scale l1-regularized logistic regression. J. Mach. Learn. Res., 8:1519–1555, De-
cember 2007.

[173] Message P Forum. Mpi: A message-passing interface standard. Technical report,


Knoxville, TN, USA, 1994.

[174] Arpit Bhardwaj, Aruna Tiwari, Dharmil Chandarana, and Darshil Babel. A genetically
optimized neural network for classification of breast cancer disease. In 7th International
Conference on Biomedical Engineering and Informatics, BMEI , Dalian, China, October
14-16,, pages 693–698, 2014.

[175] Arpit Bhardwaj and Aruna Tiwari. Breast cancer diagnosis using genetically optimized
neural network model. Journal of Expert Syst. Appl., 42(10):4611–4620, 2015.

[176] Balaji Krishnapuram. Kdd cup 2008, 2008.

[177] Konrad Rieck Nico Görnitz, Marius Kloft and Ulf Brefeld. Toward supervised anomaly
detection. Journal of Artificial Intelligence Research, 46:235–262, 2013.

[178] M. Moya, M. Koch, and L. Hostetler. One-class classifier networks for target recognition
applications. International Neural Network Society, pages 797–801, 1993.

You might also like