0% found this document useful (0 votes)
101 views16 pages

Enhanced Network Anomaly Detection Based On Deep Neural Networks

This document discusses using deep learning approaches like convolutional neural networks, autoencoders, and recurrent neural networks for anomaly-based intrusion detection systems. It compares models based on these deep learning techniques to conventional machine learning classification methods. The deep learning models were trained on a standard dataset and evaluated using standard classification metrics to analyze their performance for intrusion detection.

Uploaded by

Yishak Tadele
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views16 pages

Enhanced Network Anomaly Detection Based On Deep Neural Networks

This document discusses using deep learning approaches like convolutional neural networks, autoencoders, and recurrent neural networks for anomaly-based intrusion detection systems. It compares models based on these deep learning techniques to conventional machine learning classification methods. The deep learning models were trained on a standard dataset and evaluated using standard classification metrics to analyze their performance for intrusion detection.

Uploaded by

Yishak Tadele
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

SPECIAL SECTION ON CYBER-THREATS AND COUNTERMEASURES

IN THE HEALTHCARE SECTOR

Received June 3, 2018, accepted July 16, 2018, date of publication August 17, 2018, date of current version September 21, 2018.
Digital Object Identifier 10.1109/ACCESS.2018.2863036

Enhanced Network Anomaly Detection


Based on Deep Neural Networks
SHERAZ NASEER1,2 , YASIR SALEEM1 , SHEHZAD KHALID3 , MUHAMMAD KHAWAR BASHIR1,4 ,
JIHUN HAN5 , MUHAMMAD MUNWAR IQBAL 6 , AND KIJUN HAN5
1 Department of Computer Science & Engineering, University of Engineering and Technology, Lahore 54890, Pakistan
2 Department of Informatics and Systems, University of Management and Technology, Lahore 10033, Pakistan
3 Department of Computer Engineering, Bahria University, Islamabad 44000, Pakistan
4 Department of Statistics and Computer Science, University of Veterinary and Animal Sciences, Lahore 54000, Pakistan
5 School of Computer Science and Engineering, Kyungpook National University, Daegu 37224, South Korea
6 Department of Computer Science, University of Engineering and Technology, Taxila 47080, Pakistan

Corresponding author: Kijun Han ([email protected])

ABSTRACT Due to the monumental growth of Internet applications in the last decade, the need for security
of information network has increased manifolds. As a primary defense of network infrastructure, an intrusion
detection system is expected to adapt to dynamically changing threat landscape. Many supervised and
unsupervised techniques have been devised by researchers from the discipline of machine learning and
data mining to achieve reliable detection of anomalies. Deep learning is an area of machine learning which
applies neuron-like structure for learning tasks. Deep learning has profoundly changed the way we approach
learning tasks by delivering monumental progress in different disciplines like speech processing, computer
vision, and natural language processing to name a few. It is only relevant that this new technology must be
investigated for information security applications. The aim of this paper is to investigate the suitability of deep
learning approaches for anomaly-based intrusion detection system. For this research, we developed anomaly
detection models based on different deep neural network structures, including convolutional neural networks,
autoencoders, and recurrent neural networks. These deep models were trained on NSLKDD training data set
and evaluated on both test data sets provided by NSLKDD, namely NSLKDDTest+ and NSLKDDTest21.
All experiments in this paper are performed by authors on a GPU-based test bed. Conventional machine
learning-based intrusion detection models were implemented using well-known classification techniques,
including extreme learning machine, nearest neighbor, decision-tree, random-forest, support vector machine,
naive-bays, and quadratic discriminant analysis. Both deep and conventional machine learning models were
evaluated using well-known classification metrics, including receiver operating characteristics, area under
curve, precision-recall curve, mean average precision and accuracy of classification. Experimental results of
deep IDS models showed promising results for real-world application in anomaly detection systems.

INDEX TERMS Deep learning, convolutional neural networks, autoencoders, LSTM, k_NN, decision_tree,
intrusion detection, convnets, information security.

I. INTRODUCTION patterns and intrusions. This idea pioneered a new breed


Network intrusion detection refers to the problem of monitor- of intrusion detection systems which were based on learn-
ing and differentiating such network flows and activities from ing algorithms rather than always-updating signatures of
the normal expected behavior of network which can adversely intrusions. Over the last three decades, machine learning
impact the security of information systems. The search of reli- techniques were applied as a conventional approach for devel-
able solutions by Governments and organizations to protect oping network anomaly detection models. These approaches
their information assets from unauthorized disclosures and employ supervised, unsupervised and semi-supervised learn-
illegal accesses has brought intrusion detection and preven- ing algorithms to propose solutions for anomaly detection
tion at the forefront of information security landscape. problem.
Denning [1] proposed the idea of developing intru- Anomaly detection is modeled as a classification prob-
sion detection system by employing Artificial Intelligence lem in supervised learning. Supervised learning uses labeled
techniques on security events to identify abnormal usage data to train anomaly detection models. The goal of this

2169-3536
2018 IEEE. Translations and content mining are permitted for academic research only.
VOLUME 6, 2018 Personal use is also permitted, but republication/redistribution requires IEEE permission. 48231
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
S. Naseer et al.: Enhanced Network Anomaly Detection Based on DNNs

type of training is to classify the test data as anomalous or Precision-Recall Curve, mean average precision (mAP) and
normal on the basis of feature vectors. Unsupervised learn- accuracy of classification.
ing, on the other hand, uses unlabeled or untagged data The primary contribution of this work is filling above-
to perform the task learning. One of the popular unsuper- mentioned research gaps by designing and implementing
vised learning technique is clustering [2], which searches anomaly detection models based on state of the art Deep Neu-
for similarities among instances of the dataset to build clus- ral Networks and their evaluation using standardized classi-
ters. Instances sharing related characteristics are assumed to fication quality metrics. The first gap is filled by developing
be alike and placed in the same cluster. Semi-Supervised anomaly detection models using Deep CNN, LSTM and mul-
Learning (SSL) is a combination of supervised and unsu- tiple types of Autoencoders. To the best of our knowledge,
pervised learning. The SSL approach utilizes both labeled the DNN structures (DCNN, Contractive, and Convolutional
and unlabeled data [3] for learning. SSL methods learn Autoencoders) investigated in this study have not been ana-
feature-label associations from labeled data and assign the lyzed for anomaly detection. In addition, comparisons of deep
labels to unlabeled instances having similar features that learning based anomaly detection models are provided with
of a labeled instance on the basis of learned feature-label well-known classification schemes including SVM, K-NN,
associations. Decision-Tree, Random-Forest, QDA and Extreme Learning
Deep Learning is an area of Machine Learning which machine. To fill second research gap, we opted to train all
applies neuron like mathematical structures [4] for learn- models on training dataset without ever exposing test dataset
ing tasks. Neural Networks have been around for many to the model during training and then tested/evaluated the
decades [5] and have been gaining and losing the favor of models on testing datasets. This approach provided a fair esti-
research community. The latest rise of this technology is mate of model capabilities by using unseen data instances at
attributed to Alexnet [6], a Deep Neural Network, which won evaluation time. To bridge the third research gap, Deep learn-
the ImageNet classification challenge. Alexnet achieved top- ing based anomaly detection models were evaluated amongst
1 and top-5 error rates of 37.5 % and 17.0% on ImageNet themselves and with conventional machine learning models
Dataset [7] which were considerably better than the previous by using unseen test data and employing standard classifica-
state-of-the-art mechanisms. Since then, Deep Neural Net- tion quality metrics including RoC Curve, Area under RoC,
works (DNNs) have attracted the attention of research com- Precision-Recall Curve, mean average precision (mAP) and
munity once again and multiple DNN structures including accuracy of classification.
Convolutional neural networks (CNNs) [8], Recurrent Neural All experiments in this study are performed by authors on
networks (LSTM) [9], Deep belief nets (DBNs) and differ- NSLKDD dataset provided by [14] using a GPU-powered
ent types of Autoencoders including Sparse, Denoising [10], test-bed. NSLKDD is derived from KDDCUP99 [15] which
Convolutional [11], Contractive [12] and variational Autoen- was generated in 1999 from the DARPA98 network traf-
coders have been proposed. These DNN structures have been fic. Tavallaee et al. [14] discovered some inherent flaws
successfully applied to devise state of the art solutions in in original KDDCUP99 dataset which had adverse impacts
multiple disciplines. on the performance of IDS models trained and evaluated
Application of Deep Neural Networks for the solution of on the Dataset. A statistically enhanced version of dataset
Information security problems is a relatively new area of called NSLKDD was proposed by [14] to counter discov-
research. We observed three research gaps during literature ered statistical problems. Some advantages of NSLKDD
review of anomaly detection problem. The first gap was lack over KDDCUP99 dataset include removal of redundant
of investigation of well-known deep learning approaches for records from training dataset for reducing complexity
anomaly detection. Although isolated studies were available and bias towards frequent records and the introduction
as described in II, no comprehensive research work was of non-duplicate records in testing datasets for unbiased
available to fill this gap. The second research gap was the use evaluation.
of training datasets for both training and testing of models NSLKDD Dataset is available in four partitions. Two
using cross-validation mechanisms. Most of the recent works partitions namely NSLKDDTrain20p and NSLKDDTrain+
followed this approach and reported very high detection rates, serve as training Dataset for model learning and provide
e.g., Kim et al. [13] used a four-layer DNN with 100 units 25,192 and 125,973 training records respectively. Remain-
for intrusion detection on the KDD99 dataset and reported ing two partitions called NSLKDDTest+ and NSLKD-
99% accuracy. We believe that this approach does not provide DTest21 are available for performance evaluation of trained
a reliable solution of anomaly detection problem, as, given models on unseen data and provide 22,543 and 11,850 data
sufficient training, models can be over-fitted to achieve such instances respectively. Additionally, NSLKDDTest21 con-
high rates. The 3rd gap turned out to be lack of compari- tains records for attack types not available in other NSLKDD
son/evaluation of deep learning models amongst themselves train and test Datasets. These attack types include pro-
and with conventional machine learning based models using cesstable, mscan, snmpguess, snmpgetattack, saint, apache2,
standardized classification quality metrics which was a natu- httptunnel, back and mailbomb. All models in our study
ral consequence of previous two gaps. Standardized classifi- were trained on NSLKDD training datasets (NSLKD-
cation quality metrics include RoC Curve, Area under RoC, DTrain20p and NSLKDDTrain+) and tested on NSLKDD

48232 VOLUME 6, 2018


S. Naseer et al.: Enhanced Network Anomaly Detection Based on DNNs

TABLE 1. Attack types in KDDCUP99 dataset.

test datasets (NSLKDDTest+ and NSLKDDTest21). This Application of Deep Neural Networks for the solution of
approach was also adopted by [16]–[18]. Information security problems is a relatively new area of
Like its predecessor, NSLKDD dataset consists of 41 input research. DNN structures like Autoencoders (AE), Deep
features as well as class labels. Features 1 to 9 represent the Belief Networks (DBNs) and LSTM have been used for
basic features which were created from TCP/IP connection Anomaly Detection. Gao et al. [25] proposed an IDS archi-
without payload inspection. Features 10 to 22 comprised of tecture based on DBNs using energy based reduced Boltz-
content features, generated from the payload of TCP seg- mann machines (RBMs) on KDDCup99 Dataset. Wang [26]
ments of packets. Features 23 to 31 were extracted from proposed a deep network of stacked autoencoders (SAE)
time-based traffic properties while features 32 to 41 contain for network traffic identification. A semi-supervised learning
application based traffic features that were designed to mea- based approach with Random weights based NN (NNRw) is
sure attack within intervals longer than 2 seconds. A class used by Ashfaq et al. [17] to implement an IDS architecture
label was provided with each record, which identified the using NSLKDD.
network traffic instance either as normal or an attack. Original Aygun and Yavuz [27] employed vanilla and denois-
KDDCUP99 dataset listed different types of attacks shown ing deep Autoencoders on NSLKDD and claimed accu-
in Table 1. racy of 88.28% and 88.6% on NSLKDDTest+ dataset.
Rest of the article is divided into VI sections. Section II They did not provide results for NSLKDDtest21 dataset
highlights prominent works related to IDS problem. neither did they provide any other quality metrics of their
Section III provides the architectural designs of DNNs classifier including AuROC, Precision, Recall, and mAP.
used in this study. Section IV sheds light on implemen- Yousefi-Azar et al. [18] used Autoencoders as an unsu-
tation details including hardware setup and software tool- pervised latent feature generation mechanism and provided
chain. In Section V, we present results of proposed DNN a comparison of classifiers using conventional NSLKDD
based models along with comparisons of results and timing features and possible representations of NSLKDD. They
information. This section is followed by section V and VI reported the accuracy of 83.34% by using latent repre-
which describe the conclusion and references of research sentations of NSLKDD. Alom et al. [28] trained a Deep
respectively. Belief Network (DBN) using staked Restricted-Boltzmann
Machines (RBMs) and reported the accuracy of 97.5% on
II. RELATED WORKS NSLKDD training dataset (training accuracy). They did
Different Machine learning techniques including supervised, not provide results for either NSLKDDTest+ or NSLKD-
unsupervised and semi-supervised, have been proposed to DTest21 datasets. Hodo et al. [29] provided a taxonomy
enhance the performance of anomaly detection. Supervised and survey of deep and conventional structures for intrusion
approaches such as k-nearest neighbor (k-NN) [19], neural detection. Javaid et al. [30] developed a ‘‘Self-taught Learn-
networks and support vector machine (SVM) [20] have been ing’’ classification mechanism by combining encoder layers
studied extensively for anomaly detection. Laskov et al. [21] of Sparse Autoencoder (for hidden features extraction) with
provided a comparative analysis of supervised and unsu- Softmax regression (for probability estimates of classes) to
pervised learning techniques with respect to their detec- perform classification on NSLKDD. They reported 98.3%
tion accuracy and ability to detect unknown attacks. accuracy on training data (training accuracy) for two-class
Ghorbani et al. [22] provided a comprehensive review classification, 98.2% training accuracy for 5 class classifica-
of supervised and unsupervised learning approaches for tion and 98.8% training accuracy for 23 class classification.
anomaly detection. Solanas and Martinez-Balleste [23] pre- Javaid et al. provided results for neither NSLKDDTest+ nor
sented clustering algorithms for anomaly detection. A com- NSLKDDtest21 datasets. Bontemps et al. [31] proposed an
prehensive repertoire of anomaly-based intrusion detection LSTM based anomaly detection system and reported varia-
systems is presented by Bhattacharraya and Kalita [24] and tions incorrect and false alarms of prediction errors with a
Tavallee [16] compared the performance of the NSLKDD change of a β parameter of the proposed system but did not
dataset on different classification algorithms including Naive- provide results in well-known metrics. This shows the need
Bayes, Support Vector Machines, and Decision-Trees, etc. for an experimental study which develops anomaly detection

VOLUME 6, 2018 48233


S. Naseer et al.: Enhanced Network Anomaly Detection Based on DNNs

models using well-known Deep learning approaches and an approximation of any function to an arbitrary degree of
evaluates them on previously unseen test datasets using stan- accuracy which means a deep AE with more than one encoder
dardized classification quality metrics. In this study, we aim layer can approximate any mapping from input to bottleneck
to close above-mentioned research gap and evaluate Deep arbitrarily well. Adding Depth has the effect of exponential
learning based models on unseen data using standard classifi- reduction in computation cost and amount of training data
cation quality metrics including RoC Curve, Area under RoC, needed for representing many functions [4]. Different AE
Precision-Recall Curve, mean average precision (mAP) and structures are described in the literature, but we will discuss
accuracy of classification. the AEs relevant to our study.

III. PRELIMINARIES 1) SPARSE AUTOENCODERS


In this section, a brief overview of DNN structures, imple- Autoencoder, whose training criterion involves both recon-
mented in this article, is provided. The DNNs discussed in struction error and sparsity penalty (h) on the bottleneck
this section include: layer, is known as Sparse Autoencoder. Sparse AE can be
1) Different Autoencoders (Sparse, Denoising, Contrac- represented as:
tive, Convolutional) L(x, g(f (x))) + (h)
2) LSTM
Sparsity helps to thwart overfitting of an AE which would
3) Convolutional Neural Networks (CNN)
otherwise act as an identity function for training data. The
penalty term is usually a regularizer added to bottleneck layer.
A. AUTOENCODERS
An Autoencoder (AE) is a neural network that is trained 2) DENOISING AUTOENCODERS
to regenerate its input vector as its output. It has a hidden
Proposed by Vincent et al. [10], denoising autoen-
layer ‘h’ that learns the latent representation of input vec-
coder (DAE) receives a corrupted data instance as input and is
tor (code) in a different feature space with smaller dimen-
trained to predict the original, uncorrupted version of instance
sions [4]. Both input and output layers contain N nodes,
as its output. Instead of usual AE loss function, DAE mini-
and hidden layer contain K nodes. If K < N then AE
mizes objective function L(x, g(f (x̄))), where x̄ represents a
is called under completed. Hidden layer of AE is known
corrupted version of x using some noise. DAE must undo this
by different names in literature including bottleneck, dis-
corruption rather than simply replicating their input at output
criminative, coding or abstraction layer. We will stick to
and in doing so it captures only the most significant features
the name ‘‘bottleneck’’ and use these names interchangeably
of training data. A Corrupting function C(x̄|x), which denotes
if required. Learning task in an under-complete AE forces
a conditional distribution over corrupted data instances x̄
it to capture most significant features of training data at
given original data instances x, is used to generate inputs and
bottleneck layer so that the input can be regenerated at the
DAE learns a reconstruction distribution Pre (x|x̄) estimated
output layer. This is achieved by minimizing a loss function ¯ DAE can be conceptualized as per-
from training data (x, (x)).
L(x, g(f (x))) penalizing dissimilarity of g(f (x)) from training
forming stochastic gradient decent on following expectation:
data x. In practice, AEs are created from multiple layers
where the outputs of preceding layer are connected to inputs −Ex∼p̂data Ex∼C(x̄|x) logpdecoder (x|h = f (x̄))
of the following layer. Additional benefit of adding noise is reduction in overfitting
Let W l and bl denotes the parameters for each layer where of models generated by DAE.
W and b are weights and bias associated with connection
between layer l and layer l + 1, the encoding step for stacked 3) CONTRACTIVE AUTOENCODER
AE is defined by running the encoding step of each layer in Contractive Autoencoders were proposed by Rifai et al. [12].
forward order al = f (zl ) where a denotes activation outputs, Contractive Autoencoders (ContAE) learn representations
f (.) denotes activation function and z denotes total weighted which are robust towards small changes in training data. This
sum of inputs in layer l as shown below: robustness to small changes is achieved by imposing a penalty
zl+1 = W l .al + bl term based on Frobenius norm of the Jacobian matrix for
the encoder activations with respect to the input samples.
Decoding step of AE is performed by running the decoding According to [12], the penalty term compresses the localized
stack in reverse order an+l = f (zl ) as shown below: space and this contraction in localized feature space yields
robust features on the activation layer. The penalty terms
zn+l+1 = W n−l .an+l + bn−l also aid in learning representations corresponding to a lower-
where an denotes activations of bottleneck layer. dimensional non-linear feature space which are more aligned
Deep AEs offer many advantages. One major benefit of to local directions of variation dictated by data, while invari-
deep AE stems from universal approximator theorem. [32]. ant to the majority of directions orthogonal to the feature
Universal approximator theorem states that a feed-forward space. Loss function of ContAE is given as follows:
neural network with at least one hidden layer can represent TCAE (θ) = 6x∈Dn (L(x, g(f (x))) + λ||Jf (x)||2F

48234 VOLUME 6, 2018


S. Naseer et al.: Enhanced Network Anomaly Detection Based on DNNs

in which ||Jf (x)||2F is the Frobenius norm of the jacobian map. Since Conv layer contains set of filters, it produces a fea-
matrix, which is the sum of squares over all elements inside ture map for every filter, and these feature maps are stacked
the matrix. Frobenius norm is regarded as the generalization together to produce output tensor. Formally, assuming input
of euclidean norm and represented by following equation. shaped as greyscale images, the operation of convolution
layer is specified by following steps [33]:
||Jh (X )||2F = 6ij (δhj (X )/δXi )2
1) Accepts a tensor of size D1 ∗ H1 ∗ W1 and requires
following four hyper-parameters:
4) CONVOLUTIONAL AUTOENCODERS
• Number of filters K
Convolutional Autoencoders (ConvAE) were proposed by
• Spatial dimensions of filters F
Masci et al. [11]. ConvAEs learn non-trivial features using
• Stride with which they are applied S and
simple stochastic gradient descent (SGD), and discover good
• Size of zero-padding if padding is enabled
initializations for Convolutional Neural Networks (CNNs)
2) Conv layer produces an out tensor of size D2 ∗ H2 ∗ W2
that avoid the numerous distinct local minima of highly non-
where D2 = K , W2 = (W1 − F + 2P)/S + 1 and H2 =
convex objective functions arising in different deep learning
(H1 − F + 2P)/S + 1
problems. Fully connected AEs ignore the 2D representation
3) The number of parameters in each filter is F ∗ F ∗ D1
of input which introduces redundancy in parameters, forcing
for a total of (F ∗ F ∗ D1 ) ∗ K weights and K biases
each feature to be global. However, a different approach is
to discover localized features that repeat themselves all over Purpose of pooling layer is to control overfitting by
the input. ConvAEs differ from other AEs as their weights decreasing the size of representation with a fixed down-
are shared among all locations of the input preserving spatial sampling technique (max-pooling or mean-pooling) without
locality. The latent representations generated by ConvAEs any weights. Pooling layers operate on each feature map
are more sensitive to transitive relations of features and separately to reduce its size. A typical setting is to use max-
capture the semantics ignored by other AE structures. For pooling with 2x2 filters with a stride of 2 to downsam-
input x shaped as mono channel, the latent representations ple the representation precisely by half in both height and
of kth feature map is given by hk = σ (x ∗ W k + bk ) width.
where σ is activation function and ∗ denotes 2D convolution
whereas x, W and b denotes input, weights and bias respec- C. LSTM
tively. Reconstruction of ConvAE is obtained using following Long Short term Memory (LSTM) is a special Recurrent
equation: neural network (RNN) architecture. An RNN is a connec-
tivity pattern that perform computations on a sequence of
y = σ (6k∈H hk ∗ W̄ k + c) vectors x1 , · · · , xn using a recurrence formula of the form
Where ‘H’ represents the collection of latent feature maps; ht = fθ (ht−1 , xt ), where f , an activation function and θ,
W̄ represents the flip operation over both dimensions of a parameter, are used at every timestamp to process sequences
the weights. [11] proposed plain SGD as optimizer and with arbitray lengths. The hidden vector ht is called state
Mean Squared Error as objective function for their ConvAE of the RNN and it sort of provides a running summary of
structure. all vectors x till the time-step and this summary is updated
based on current input vector xn . Vanilla RNNs use linear
B. CONVOLUTIONAL NEURAL NETWORKS
combinations of xt , ht−1 which are weak form of coupling
between inputs and hidden states [34]. Formal form of LSTM
Convolutional Neural Networks (CNNs) [8] are neural net-
is shown below:
work architectures specially crafted to handle high dimen-
sional data with some spatial semantics. Examples of such
   
i sigm
data include images, video, sound signals in speech, character
 
f  sigm
 =  W xt
sequence in the text, or any other multi-dimensional data. o sigm ht−1
In all of the abovementioned cases, using fully connected g tanh
networks becomes cumbersome due to larger feature space.
CNNs are preferred in such cases because of awareness of In equation above, sigmoid and tanh functions are applied
spatial layout of input, specific local connectivity, and param- element-wise. The tensor W has dimensions [4H ∗ (D + H )].
eter sharing schemes [33]. The vectors i, f , o ∈ RH intuitively resemble binary gates
For CNN, the input x consist of a multi-dimensional array controlling operations of memory cell. The vector g ∈ RH is
called tensor. The core computational block of CNNs are used to additively modify the memory contents, which allows
convolution Conv and pooling layers. A Conv layer takes gradients to be distributed equally during backpropagation.
input as a tensor and convolves it with a set of filters (kernels)
to produce output tensor. For a single filter k of dimension IV. METHODOLOGY
dk , hk and wk , convolution is performed by sliding filter k This section describes the architectures of Deep Neural Net-
overall spatial positions of the input tensor and calculating dot works used in this study and methodology of their usage for
product between input chunk and filter k to produce a feature developing Anomaly detection models.

VOLUME 6, 2018 48235


S. Naseer et al.: Enhanced Network Anomaly Detection Based on DNNs

FIGURE 1. Architecture of implemented autoencoders for anomaly detection.

A. AUTOENCODERS TABLE 2. Summary of implemented autoencoders.

In our experiments all AEs, except ConvAE, share the


same architecture as shown in Fig 1. In figure 1, Model
depicted under solid border line shows the AE model. AEs
are trained using NSLKDDTrain+ dataset. After training
of AE, Bottleneck layer reduces the dimensionality from
41 to 16 dimensional feature space while keeping feature
information required for reconstruction. Bottleneck represen-
tations of input are fed to an MLP for anomaly detection.
The MLP model is shown by dotted border in figure 1 which TABLE 3. Summary of implemented IDS models based on AEs.
contains the encoder part of AE after training and a fully
connected layer. The output layer contains one sigmoid unit
to classify the input. If the output <0.5 then input instance is
classified as normal and vice versa.
1) Vanilla AE is built as described above.
2) For Sparse AE, L1 regularization is attached to the
input of bottleneck layer to ensure acquisition of unique
statistical features from the Dataset. After training,
the acquired unique features are fed to MLP for clas-
sification as discussed above. 6) Architecture of implemented ConvAE is depicted
3) For Denoising AE (DAE), input is corrupted using in fig 2 and discussed below.
Gaussian noise. The level of input corruption is main- ConvAE accepts input in form of images. Each NSLKDD
tained at 50%. Like earlier AEs, weights of DAE are training record is shaped as 32x32 greyscale image. At first,
frozen after training. MLP and output layer is appended the idea of converting a 41 feature input to a 32x32 2D array
to frozen encoder part of DAE to get an MLP classifi- seems absurd but this approach has its merits. Arranging input
cation model. features as 2D array helps to discover interesting localized
4) For Contractive AE, only the loss function is changed relationships between these features that repeat themselves
to contractive loss in accordance with the recommenda- all over the input. ConvAEs differ from other algorithms as
tions of [12] and remaining structure of AE and MLP their weights are shared among all locations of the input,
is same. preserving spatial locality. The latent representations gen-
5) Output shapes and no. of trainable parameters for erated by ConvAE for classification are more sensitive to
trained AEs and trained IDS models are shown in transitive relationships of features and help to learn high level
table 2 and 3 relationships between global features which would otherwise

48236 VOLUME 6, 2018


S. Naseer et al.: Enhanced Network Anomaly Detection Based on DNNs

FIGURE 2. Architecture of implemented convolutional autoencoder for IDS.

FIGURE 3. Architecture of implemented deep convolutional neural network model for IDS.

be ignored by other classifiers. As ConvAEs make use of yn is zero if corresponding image belongs to normal traffic
GPUs for training, the training time of network with 2D and 1 otherwise. Both Test Datasets NSLKDDTest+ and
input is not much different than a conventional SVM or K- NSLKDDTest21 are also subjected to same preprocessing.
NN classifier. The evidence of abovementioned observation
is presented in results section where training and testing times B. DEEP CONVOLUTIONAL NEURAL NETWORKS
of models are discussed. Like ConvAE, Deep Convolutional neural network (DCNN)
For converting network flow dataset to corresponding also requires input in the form of images, hence Datasets
image dataset, we need to create a mapping F : θ → I , were subjected to same preprocessing as that of ConvAE.
where I represents image Dataset corresponding to θ and Architecture of DCNN Model implemented for anomaly
θ = (φn )N n=1 is the preprocessed network flow Dataset. detection in this study is shown in Figure 3. The model
To achieve image representation I corresponding to each consist of input layer, three convolution and subsampling
training instance, vector v1 of length 41 is generated from the pairs, three fully connected layers followed by an output
preprocessed entries of dataset features and replicated 3 times layer consisting of single sigmoid unit. A dropout layer (not
to generate a corresponding vector of 123 features which shown in figure) is placed between flattened model and first
is converted to a vector v̄ of 128 after concatenating first fully connected layer FC1. Dropout Layer, introduced by
5 features. For each training/testing instance v̄ is replicated Srivastav et al. [35], serves as regularization layer. It ran-
to generate corresponding 32 × 32 greyscale representation. domly drops units from the DCNN along with their weights
After transforming θ → I , the label data is preprocessed as during training time. This has the effect of training an ensem-
per requirements of two-class structure. The result of label ble of neural networks where each member of ensemble
transformation is represented by y = yn ∈ (0, 1)M where is a subset of original neural network. At test time, it is
M denotes the total number of classes. The entry of vector easy to approximate predictions of all ‘thinned’ subsets by

VOLUME 6, 2018 48237


S. Naseer et al.: Enhanced Network Anomaly Detection Based on DNNs

TABLE 4. Impact of different category encoders on training results.

simply using an ‘‘un-thinned’’ original network with smaller and protocol and TCP flags respectively. IDS problem is
weights. The selected hyper-parameters include softsign approached as two-class problem where network flows are
activation, He-normal kernel initialization [36], Adadelta either anomalous or normal. Training dataset is prepared
optimizer [37] with batch size of 64 instances. Additional by combining NSLKDDTrain20% and NSLKDDTrainplus
hyper-parameters of DCNN included output layer of single which collectively provide 1,51,165 network flow records.
sigmoid unit, drop-out rate of 0.5 and zero-padding at each NSLKDD has 41 features like its predecessor KDDCUP99
convolution layer input. and we have used all 41 features. Out of 41 features,
3 features ‘protocol-type’, ‘service’ and ‘flag’ are symbolic
C. LSTM features which require conversion into quantitative form
In LSTM IDS Model, each record was processed as single before they can be consumed by DNNs. Different techniques
member sequence of 41 dimensional vector to 32 LSTM [38]–[40] and [41] have been proposed in literature for encod-
units. Two Dense layers with ten (10) and one neuron respec- ing symbolic features to quantitative features. We studied the
tively were attached with LSTM outputs to make predictions impact of different category encoding schemes on classifica-
for two class problem. First Dense layer used RELU activa- tion accuracy of NSLKDD dataset using a conventional clas-
tion while classification layer with single unit used sigmoid sifier. For this purpose we chose Decision-Tree algorithm due
activation to make predictions. A drop-out layer was intro- to its time efficiency. Impact of different encoding schemes
duced between LSTM output and MLP input to thwart over- on dimensionality of dataset, training time and accuracy of
fitting. LSTM IDS Model was trained on combined Training trained model are shown in Table 4.
Dataset for 15 epochs. In Table 4, Dimensionality shows the number of new fea-
tures inserted by encoding algorithm in each instance during
V. IMPLEMENTATION encoding of three symbolic features. Average scores show the
This section describes the experimental setup, preprocessing training accuracy of selected Decision-Tree classifier while
of datasets and implementation details of deep and conven- using a particular encoding scheme. Based on the perfor-
tional models implemented for experiments. mance of symbolic feature encoders, we chose LeaveOneOu-
tEncoding proposed by [41].
A. EXPERIMENTAL SETUP In general, learning algorithms benefit from standard-
Hardware setup used for implementing proposed models ization of the Dataset. Since different feature vectors of
included: NSLKDD Dataset contained different numerical ranges,
• CPU : Intel Xeon E-1650 Quad Core we applied scaling to convert raw feature vectors into more
• RAM : 16 GB standardize representation for DNNs. As Datasets contained
• GPU : nVidia GTX 1070 with 1920 CUDA cores and both normal and anomalous traffic, to avoid the negative
cuda 8.0 influence of sample mean and variance, we used median and
interquartile range (IQR) to scale the data for better results.
B. PREPROCESSING We removed the median and scaled the data according to IQR.
A network flow, φ, is an ordered set of all packets π1 , · · · , πn
where πi = {ti , Si , Di, si , di , pi , fi } represents a packet such C. IMPLEMENTATION OF DNN MODELS
that: Software toolchain used to implement all DNNs consist of
1) ∀πi , πj ∈ φ, pi = pj Jupyter development environment using Keras 2.0 [42] on
2) ∀πi , πj ∈ φ, (Si = Sj , Di = Dj , si = sj , di = dj ) ∧ Theano [43] backend and nVidia cuda API 8.0 [44]. Both
(Si = Dj , Sj = Di , si = dj , di = sj ) Training and testing datasets were manipulated in the form
3) ∀πi6=n ∈ φ (ti ≤ ti+1 )and (ti+1 − ti ≥ α) of numpy arrays. Python Scikit-learn [45] library was used
whereti , Si , Di, si , di , pi , fi represents time-stamp, source IP for various ML related tasks. Figures and graphs were created
address, destination IP address, source port, destination port using python matplotlib and seaborn libraries.

48238 VOLUME 6, 2018


S. Naseer et al.: Enhanced Network Anomaly Detection Based on DNNs

FIGURE 4. Comparison of RoC curves of deep neural network IDS models for NSLKDDTest+ dataset.

D. IMPLEMENTATION OF CONVENTIONAL MODELS FP/(FP + TN ). It corresponds to the proportion of nega-


For comparisons, we used Scikit-learn [45] implementa- tive data points mistakenly predicted positive to all negative
tions of Binary classification algorithms to train conventional data points. TPR, also called sensitivity or recall, is defined
models. These models were trained on unraveld version of as TP/(TP + FN ). TPR corresponds to the proportion of
Training Datasets. Classification algorithms used to train positive data points that are correctly predicted positive to
conventional models include Extreme Learning Machine [46] all positive data points. RoC shows a trade-off between
with Generalized hidden layer proposed by [47], RBF SVM, sensitivity and specificity of classifier. The closer the RoC
Decision Tree (J48) with 10 node depth, Naive Bayes, curve is to top-left border, the better the quality of predic-
Random-Forest with 10 J48 estimators, Quadratic Discrim- tions by the prediction model and vice versa. RoC curves
inant Analysis and Multilevel perceptron (MLP). of implemented DNN models for both NSLKDDTest+ and
NSLKDDTest21 dataset are shown in figure 4 and figure 5
VI. RESULTS AND EVALUATIONS respectively. Comparison of RoC curves for both DNN and
This section presents results and evaluations of implemented conventional ML models is shown in figure 6 and figure 7.
DNN based IDS models. Comparisons of DNN IDS models
are provided with each other and with different conventional B. AREA UNDER ROC CURVE
models. We used prominent metrics to evaluate classifica- Area under RoC Curve (AuC) is a measure of how well a
tion quality of implemented DNNs, which include Receiver binary classifier can perform predictions of labels. The AuC
Operating Characteristic (ROC), Area under Curve (AuC), of a classifier is equal to the probability that the classifier will
Precision-Recall Curve, Accuracy on Test Datasets and mean rank a randomly chosen positive (Anomalous) record higher
Average Precision (mAP). These evaluation metrics are com- than a randomly chosen negative (Normal) one. A perfect
puted using confusion matrix which presents four measures binary classifier has Auc = 1 and greater value of AuC shows
as follows: better performance. Any AuC value less than 0.5 indicates
• True Positive (TP): if an anomaly is classified by model poor performance of classifier. The AuC values for both
as an anomaly, result is accepted as TP NSLKDDTest+ and NSLKDDTest21 datasets are shown in 6
• False Positive (FP): if a normal instance is classified by and figure 7. Top 5 AuC scores for both test datasets are
model as an anomaly, result is accepted as FP shown in Table 5
• True Negative (TN) : if an anomaly is classified by
model as normal instance, result is accepted as TN C. ACCURACY
• False Negative (FN): if a normal instance is classified by
Accuracy results of both Deep and Shallow models are shown
model as normal instance, result is accepted as FN in figure 8. In DNN models, LSTM model delivered top accu-
In following subsections, we give a brief introduction of rele- racy of 89% and 83% for both NSLKDDTest+ and NSLKD-
vant quality metric and present the results for all implemented DTest21 datasets respectively. DCNN remained runner-up in
models. Results are presented in the form of graph plots to DNN models with 85% accuracy in NSLKDDTest+ while
allow ease of comparisons. for NSLKDDTest21 runner-up turned out to be ConvAE.
From conventional models, Decision-Tree, SVM and k-NN
A. RECEIVER OPERATING CHARACTERISTICS (ROC) had a tie at 82% for NSLKDDTest+, while Decision Tree
RoC is a plot of False positive rate (FPR) against True outperformed the two conventional contender models in
positive rate (TPR) of binary classifiers. FPR is defined as NSLKDDTest21 dataset with accuracy of 68%. Overall best

VOLUME 6, 2018 48239


S. Naseer et al.: Enhanced Network Anomaly Detection Based on DNNs

FIGURE 5. Comparison of RoC curves of deep neural network IDS models for NSLKDDTest21 dataset.

FIGURE 6. Comparison of RoC curves for both deep and conventional IDS models for NSLKDDTest+ dataset..

FIGURE 7. Comparison of RoC curves for both deep and conventional IDS models for NSLKDDTest21 dataset.

accuracy was delivered by LSTM with accuracy score between NSLKDDTest+ and NSLKDDTest21 in all models
of 89% and 83% respectively for NSLKDDTest+ and is due to the fact that NSLKDDTest21 contains records for
NSLKDDTest21 datasets. The sharp difference in Accuracies attack types not available in other NSLKDD train and test

48240 VOLUME 6, 2018


S. Naseer et al.: Enhanced Network Anomaly Detection Based on DNNs

TABLE 5. Top 5 area under RoC curve results of models for NSLKDDplus and NSLKDD21 datasets.

FIGURE 8. Comparison of model accuracies for NSLKDDTest+ and NSLKDDTest21 datasets.

FIGURE 9. Precision-recall curve and mAP scores of DNN models for NSLKDDTest+ dataset.

Datasets. These attack types include processtable, mscan, the model is returning accurate results (high precision), while
snmpguess, snmpgetattack, saint, apache2, httptunnel, back also returning the majority of positive results (high recall).
and mailbomb as mentioned earlier. This means that trained Each classifier exhibits a trade-off between precision and
models never had the opportunity to see these attacks during recall. Due to the fact that individually both Precision and
training as they were not available in training data. Recall provide only a puzzle piece of classifier performance,
they are combined to form Precision-Recall curve which
D. PRECISION-RECALL CURVE AND MAP presents the relationship between them in more meaning-
Precision is defined as a measure of relevancy of results, ful manner. The stair-step nature of Precision-Recall curve
while recall provides us a measure of how many genuinely provides insight into the relationship between precision and
relevant results are returned. High scores for both show that recall. A small change in the threshold at the edges of

VOLUME 6, 2018 48241


S. Naseer et al.: Enhanced Network Anomaly Detection Based on DNNs

FIGURE 10. Precision-recall curve and mAP scores of DNN models for NSLKDDTest21 dataset.

FIGURE 11. Precision-recall curve and mAP scores of DNN and conventional models for NSLKDDTest+ dataset.

stair-step considerably reduces precision with only a small shows PRC and mAP performance of all models. Top six
increase in recall. mAP scores are shown in Table 6.
Figure 9 and 10 depicts precision-recall curves (PRC)
and mean Average Precision (mAP), shown as area under E. TEST AND TRAIN TIMINGS
precision-recall curve in legends section, of DNN models for In this subsection, we provide the train and test timings
both test Datasets. Mean average precision (mAP) summa- of models used in this study. For DNNs, GPU is used
rizes a precision-recall curve as the weighted mean of pre- as training and testing device while conventional models
cisions achieved at each threshold, with differential increase were trained and tested using CPU. In DNNs, ConvAE
in recall used as the weight. mAP for all tested models proved to be the most expensive algorithm because the
is shown in legends section of Figures 9,10, 11 and 12. training time included both Autoencoder model training and
Except SparseAE, all DNN models showed very good results. MLP classification model training. Collectively ConvAE IDS
In NSLKDD+, both LSTM and DCNN model share Top model took approximately 367 seconds on GPU. DCNN
position with mAP scores of 97% while DCNN showed and LSTM models took 109 and 208 seconds respectively.
marginally improved performance for NSLKDD21 with 98% Smallest training time from DNN models was that of Sparse
score. Three models including ContAE, ConvAE and LSTM Autoencoder but it did not show comparable results. SVM
achieved 97% mAP score for NSLKDD21. Figure 11 and 12 with RBF kernel proved to be the most expensive model

48242 VOLUME 6, 2018


S. Naseer et al.: Enhanced Network Anomaly Detection Based on DNNs

FIGURE 12. Precision-recall curve and mAP scores of DNN and conventional models for NSLKDDTest21 dataset.

TABLE 6. Top 6 mean average precision results from implemented IDS models.

FIGURE 13. Training time in seconds for different algorithms used in experiments.

among conventional IDS models and took approximately smallest training time, Decision-tree model showed remark-
314 seconds. Fastest among conventional category was able results and performed comparable to other complex
Random-Forest closely followed by Decision Tree. With models. Remaining conventional models took each under

VOLUME 6, 2018 48243


S. Naseer et al.: Enhanced Network Anomaly Detection Based on DNNs

FIGURE 14. Test time for different algorithms used in experiments for both NSLKDDTest+ and NSLKDDTest21 datasets.

100 seconds for training. Training times of all models REFERENCES


used in this study are shown in 13. Evaluation times of [1] D. E. Denning, ‘‘An intrusion-detection model,’’ IEEE Trans. Softw.
all models used in this study are shown in Figure 14 for Eng., vol. SE-13, no. 2, pp. 222–232, Feb. 1987. [Online]. Available:
http://ieeexplore.ieee.org/abstract/document/1702202/
both NSLKDDTest+ and NSLKDDTest21 datasets. Overall [2] M. Luo, L. Wang, H. Zhang, and J. Chen, ‘‘A research on intrusion detec-
longest time to evaluate complete dataset was approximately tion based on unsupervised clustering and support vector machine,’’ in
8 seconds shown by k-NN based IDS model. Both best per- Proc. 5th Int. Conf. Inf. Commun. Secur. (ICICS), Hohhot, China, S. Qing,
D. Gollmann, and J. Zhou, Eds. Berlin, Germany: Springer, Oct. 2003,
forming DNN models i.e. DCNN and LSTM took approx- pp. 325–336, doi: 10.1007/978-3-540-39927-8_30.
imately 2 and 4 seconds for NSLKDDTest+ and 1 second [3] X. Zhu and A. B. Goldberg, ‘‘Introduction to semi-supervised learning,’’
for NSLKDDTest21 dataset. Decision Tree based IDS model Synth. Lect. Artif. Intell. Mach. Learn., vol. 3, no. 1, pp. 1–130, 2009.
[Online]. Available: http://www.morganclaypool.com/doi/abs/10.2200/
performed imperceptibly fast during evaluation of both test S00196ED1V01Y200906AIM006
Datasets. [4] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge,
MA, USA: MIT Press, 2016.
VII. CONCLUSION [5] M. Minsky and S. Papert, Perceptrons. Cambridge, MA, USA: MIT Press,
In this paper, Intrusion Detection models were proposed, 1969.
[6] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet clas-
implemented and trained using different deep neural network sification with deep convolutional neural networks,’’ in Proc. Adv.
architectures including Convolutional Neural Networks, Neural Inf. Process. Syst., 2012, pp. 1097–1105. [Online]. Avail-
Autoencoders, and Recurrent Neural Networks. These deep able: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-
convolutional-neural-networks
models were trained on NSLKDD training dataset and eval- [7] O. Russakovsky et al., ‘‘ImageNet large scale visual recognition chal-
uated on both test datasets provided by NSLKDD namely lenge,’’ Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015.
NSLKDDTest+ and NSLKDDTest21. For training and eval- [8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, ‘‘Gradient-based learn-
ing applied to document recognition,’’ Proc. IEEE, vol. 86, no. 11,
uation of deep models, a GPU powered test-bed using keras pp. 2278–2324, Nov. 1998.
with theano backend was employed. To make model compar- [9] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neu-
isons more credible, we implemented conventional ML IDS ral Comput., vol. 9, no. 8, pp. 1735–1780, 1997. [Online]. Available:
http://www.mitpressjournals.org/doi/10.1162/neco.1997.9.8.1735
models with different well-known classification techniques [10] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol,
including Extreme Learning Machine, k-NN, Decision-Tree, ‘‘Stacked denoising autoencoders: Learning useful representations in
Random-Forest, Support Vector Machine, Naive-Bays, and a deep network with a local denoising criterion,’’ J. Mach. Learn.
Res., vol. 11, no. 12, pp. 3371–3408, Dec. 2010. [Online]. Available:
QDA. Both DNN and conventional ML models were evalu- http://www.jmlr.org/papers/v11/vincent10a.html
ated using well-known classification metrics including RoC [11] J. Masci, U. Meier, D. Cireşan, and J. Schmidhuber, ‘‘Stacked
Curve, Area under RoC, Precision-Recall Curve, mean aver- convolutional auto-encoders for hierarchical feature extraction,’’
in Artificial Neural Networks and Machine Learning—ICANN.
age precision and accuracy of classification. Both DCNN
Berlin, Germany: Springer, 2011, pp. 52–59. [Online]. Available:
and LSTM models showed exceptional performance with http://www.springerlink.com/index/U47243T6702223K4.pdf
85% and 89% Accuracy on test dataset which demonstrates [12] S. Rifai, P. Vincent, X. Müller, X. Glorot, and Y. Bengio, ‘‘Contractive
the fact that Deep learning is not only viable but rather auto-encoders: Explicit invariance during feature extraction,’’ in Proc.
28th Int. Conf. Int. Conf. Mach. Learn. (ICML), 2011, pp. 833–840.
promising technology for information security applications [Online]. Available: http://machinelearning.wustl.edu/mlpapers/paper_
like other application domains. Our future research will be files/ICML2011Rifai_455.pdf
directed towards investigating deep learning as feature extrac- [13] J. Kim, N. Shin, S. Y. Jo, and S. H. Kim, ‘‘Method of intrusion detec-
tion using deep neural network,’’ in Proc. IEEE Int. Conf. Big Data
tion tool to learn efficient data representations for anomaly Smart Comput. (BigComp), Feb. 2017, pp. 313–316. [Online]. Available:
detection problem. http://ieeexplore.ieee.org/abstract/document/7881684/

48244 VOLUME 6, 2018


S. Naseer et al.: Enhanced Network Anomaly Detection Based on DNNs

[14] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, ‘‘A detailed [33] A. Karpathy, ‘‘Connecting images and natural language,’’ Ph.D.
analysis of the KDD CUP 99 data set,’’ in Proc. IEEE Symp. Com- dissertation, Fac. Comput. Sci., Stanford Univ., Stanford, CA, USA, 2016.
put. Intell. Secur. Defense Appl. (CISDA). Piscataway, NJ, USA: [Online]. Available: https://pdfs.semanticscholar.org/6271/07c02c2df-
IEEE Press, 2009, pp. 53–58. [Online]. Available: http://dl.acm.org/ 136696-5f11678dd3c4fb14ac9b3.pdf
citation.cfm?id=1736481.1736489 [34] Y. Wu, S. Zhang, Y. Zhang, Y. Bengio, and R. Salakhutdinov, ‘‘On multi-
[15] S. D. Bay, D. Kibler, M. J. Pazzani, and P. Smyth, ‘‘The UCI KDD archive plicative integration with recurrent neural networks,’’ in Proc. Adv. Neural
of large data sets for data mining research and experimentation,’’ ACM Inf. Process. Syst., 2016, pp. 2856–2864.
SIGKDD Explor. Newslett., vol. 2, no. 2, pp. 81–85, 2000. [35] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
[16] M. Tavallaee, ‘‘An adaptive hybrid intrusion detection system,’’ Ph.D. R. Salakhutdinov, ‘‘Dropout: A simple way to prevent neural networks
dissertation, Fac. Comput. Sci., Univ. New Brunswick, Saint John, NB, from overfitting,’’ J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958,
Canada, 2011. 2014. [Online]. Available: http://jmlr.org/papers/v15/srivastava14a.html
[17] R. A. R. Ashfaq, X.-Z. Wang, J. Z. Huang, H. Abbas, and Y.-L. He, ‘‘Fuzzi- [36] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Delving deep into recti-
ness based semi-supervised learning approach for intrusion detection sys- fiers: Surpassing human-level performance on ImageNet classification,’’
tem,’’ Inf. Sci., vol. 378, pp. 484–497, Feb. 2017. [Online]. Available: in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 1026–1034.
http://linkinghub.elsevier.com/retrieve/pii/S0020025516302547 [Online]. Available: http://www.cv-foundation.org/openaccess/content_
[18] M. Yousefi-Azar, V. Varadharajan, L. Hamey, and U. Tupakula, iccv_2015/html/He_Delving_Deep_into_ICCV_2015_paper.html
‘‘Autoencoder-based feature learning for cyber security applications,’’ in [37] M. D. Zeiler, ‘‘ADADELTA: An adaptive learning rate method,’’ CoRR,
Proc. Int. Joint Conf. Neural Netw. (IJCNN), May 2017, pp. 3854–3861. Dec. 2012. [Online]. Available: https://arxiv.org/abs/1212.5701
[Online]. Available: http://ieeexplore.ieee.org/abstract/document/ [38] W. Mcginnis. (Jul. 2017). BaseN Encoding and Grid Search in Cat-
7966342/ egorical Variables. [Online]. Available: http://www.willmcginnis.com/
[19] Y. Liao and V. Vemuri, ‘‘Use of K-nearest neighbor classifier for 2016/12/18/basen-encoding-grid-search-category_encoders/
intrusion detection,’’ Comput. Secur., vol. 21, no. 5, pp. 439–448, [39] W. Mcginnis. (Jul. 2017). Beyond One-Hot: An Exploration of Cat-
Oct. 2002. [Online]. Available: http://linkinghub.elsevier.com/retrieve/pii/ egorical Variables. [Online]. Available: http://www.willmcginnis.com/
S016740480200514X 2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/
[40] SC Group. (Feb. 2011). Contrast Coding Systems for Categorical Vari-
[20] S. Mukkamala, G. Janoski, and A. Sung, ‘‘Intrusion detection using
ables. [Online]. Available: https://stats.idre.ucla.edu/r/library/r-library-
neural networks and support vector machines,’’ in Proc. Int. Joint
contrast-coding-systems-for-categorical-variables/
Conf. Neural Netw. (IJCNN), 2002, pp. 1702–1707. [Online]. Available:
[41] O. Zhang. (Feb. 2017). Strategies to Encode Categorical Variables
http://ieeexplore.ieee.org/document/1007774/
With Many Categories. [Online]. Available: https://www.kaggle.com/c/
[21] P. Laskov, P. Dssel, C. Schfer, and K. Rieck, ‘‘Learning intrusion detec-
caterpillar-tube-pricing/discussion/15748#143154
tion: Supervised or unsupervised?’’ in Proc. 13th Int. Conf. Image Anal.
[42] F. Chollet et al. (2015). Keras. GitHub. [Online]. Available: https://github.
Process. (ICIAP), Cagliari, Italy, F. Roli and S. Vitulano, Eds. Berlin,
com/fchollet/keras
Germany: Springer, Sep. 2005, pp. 50–57, doi: 10.1007/11553595_6.
[43] R. Al-Rfou et al. (May 2016). ‘‘Theano: A Python framework for fast com-
[22] A. A. Ghorbani, W. Lu, and M. Tavallaee, Network Intrusion Detec- putation of mathematical expressions.’’ [Online]. Available: https://arxiv.
tion and Prevention (Advances in Information Security), vol. 47. org/abs/1605.02688
Boston, MA, USA: Springer, 2010. [Online]. Available: http://link. [44] J. Nickolls, I. Buck, M. Garland, and K. Skadron, ‘‘Scalable parallel
springer.com/10.1007/978-0-387-88771-5 programming with CUDA,’’ Queue, vol. 6, no. 2, pp. 40–53, Mar. 2008,
[23] A. Solanas and A. Martinez-Balleste, Advances in Artificial Intelli- doi: 10.1145/1365490.1365500.
gence for Privacy Protection and Security (Intelligent Information Sys- [45] F. Pedregosa et al., ‘‘Scikit-learn: Machine learning in Python,’’ J. Mach.
tems). Hackensack, NJ, USA: World Scientific, 2010. [Online]. Available: Learn. Res., vol. 12, pp. 2825–2830, Oct. 2011.
http://site.ebrary.com/id/10421991 [46] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, ‘‘Extreme learning machine:
[24] D. K. Bhattacharyya and J. K. Kalita, Network Anomaly Detection: Theory and applications,’’ Neurocomputing, vol. 70, nos. 1–3,
A Machine Learning Perspective. Boca Raton, FL, USA: CRC Press, pp. 489–501, 2006. [Online]. Available: http://linkinghub.elsevier.com/
2013. retrieve/pii/S0925231206000385
[25] N. Gao, L. Gao, Q. Gao, and H. Wang, ‘‘An intrusion detection model [47] F. Fernández-Navarro, C. Hervás-Martínez, J. Sanchez-Monedero, and
based on deep belief networks,’’ in Proc. 2nd Int. Conf. Adv. Cloud P. A. Gutiérrez, ‘‘MELM-GRBF: A modified version of the extreme
Big Data, Nov. 2014, pp. 247–252. [Online]. Available: http://ieeexplore. learning machine for generalized radial basis function neural networks,’’
ieee.org/document/7176101/ Neurocomputing, vol. 74, no. 16, pp. 2502–2510, 2011.
[26] Z. Wang. (2015). The applications of deep learning on traffic identification.
Blackhat. [Online]. Available: https://www.blackhat.com/docs/us-15/
materials/us-15-Wang-The-Applications-Of-Deep-Learning-On-Traffic-
Identification.pdf
[27] R. C. Aygun and A. G. Yavuz, ‘‘Network anomaly detection with stochas-
tically improved autoencoder based models,’’ in Proc. IEEE 4th Int. Conf.
Cyber Secur. Cloud Comput. (CSCloud), Jun. 2017, pp. 193–198. [Online].
Available: http://ieeexplore.ieee.org/document/7987197/
[28] M. Z. Alom, V. Bontupalli, and T. M. Taha, ‘‘Intrusion detec-
tion using deep belief networks,’’ in Proc. Nat. Aerosp. Electron.
Conf. (NAECON), Jun. 2015, pp. 339–344. [Online]. Available: http:// SHERAZ NASEER received the M.S. degree in information security along
ieeexplore.ieee.org/document/7443094/
with distinguished professional certifications of information security, includ-
[29] E. Hodo, X. Bellekens, A. Hamilton, C. Tachtatzis, and R. Atkinson. ing CISSP, CoBit, and ITIL. He is currently pursuing the Ph.D. degree with
(2017). ‘‘Shallow and deep networks intrusion detection system: A tax-
the University of Engineering & Technology, Lahore. He has over 10 years of
onomy and survey.’’ [Online]. Available: https://arxiv.org/abs/1701.02145
experience in information security and IT. He is an Assistant Professor with
[30] A. Javaid, Q. Niyaz, W. Sun, and M. Alam, ‘‘A Deep Learning
the University of Management and Technology, Lahore, Pakistan. He has
Approach for Network Intrusion Detection System,’’ in Proc. 9th EAI
Int. Conf. Bio-Inspired Inf. Commun. Technol. (BICT), 2016, pp. 21–26, been with various information security positions in financial, consulting,
doi: 10.4108/eai.3-12-2015.2262516. academia, and government sectors. He is very active in academic research
[31] L. Bontemps, V. L. Cao, J. McDermott, and N.-A. Le-Khac, ‘‘Collective with over six research publications in conferences and journals. His research
anomaly detection based on long short term memory recurrent neural interests include cryptography, data driven security, intrusion detection,
network,’’ in Proc. Int. Conf. Future Data Secur. Eng. Cham, Switzerland: malware detection, and application of deep neural networks for information
Springer, 2016, pp. 141–152. security. His other skills include ISO 27001, policy and procedure devel-
[32] K. Hornik, ‘‘Approximation capabilities of multilayer feedforward net- opment, IT security reviews and audits, vulnerability assessment and pen-
works,’’ Neural Netw., vol. 4, no. 2, pp. 251–257, 1991. [Online]. Avail- testing, secure software development, cryptography, log monitoring, and
able: http://linkinghub.elsevier.com/retrieve/pii/089360809190009T information security trainings.

VOLUME 6, 2018 48245


S. Naseer et al.: Enhanced Network Anomaly Detection Based on DNNs

YASIR SALEEM received the secondary education (O-level and A-level) JIHUN HAN received the B.S. and M.S. degrees in mechanical engineering
from the U.K., the bachelor’s, master’s, and Ph.D. degrees from the Electrical from the Korea Advanced Institute of Science and Technology, Daejeon,
Engineering Department, University of Engineering and Technology (UET), South Korea, in 2009 and 2011, respectively, where he is currently pursuing
Lahore, Pakistan, in 2002, 2004, and 2011, respectively, and the MBA the Ph.D. degree in mechanical engineering. His research interests include
from ICBS, Lahore, in 2015, for better understanding of management and optimal control and predictive control, with an emphasis on their application
Industry–Academia relationship. He is currently an Associate Professor to intelligent vehicular and transportation systems, such as hybrid electric
with UET. During his Ph.D., he did his research work for one semester vehicles and connected and automated vehicles.
under supervision of Prof. Dr. Z. Salam at the Renewable Energy and
Power Electronics Lab, Faculty of Electrical Engineering, UTM, Malaysia.
He has authored and co-authored journal and conference papers at national
and international levels in the field of electrical and computer science,
and engineering. His research interests include computer networks, infor-
mation/network security, DSP, power electronics, computer vision, image
processing, simulation and control system.

SHEHZAD KHALID received the degree from the Ghulam Ishaq Khan Insti-
tute of Engineering Sciences and Technology, Pakistan, in 2000, the M.Sc.
degree from the National University of Science and Technology, Pakistan,
in 2003, and the Ph.D. degree in informatics from the University of Manch-
ester, U.K., in 2009. He is the Head of the Computer Vision and Pattern
Recognition Research Group which is a vibrant research group undertaking MUHAMMAD MUNWAR IQBAL received the Ph.D. degree from the
various research projects. He is currently a Professor and also the Head of Department of Computer Science & Engineering, University of Engineering
the Department of Computer Engineering, Bahria University, Pakistan. He is and Technology, Lahore, Pakistan, under the supervision of Dr. Y. Saleem,
a qualified academician and also a researcher with over 50 international the M.S. degree in computer science from the COMSATS Institute of Infor-
publications in conferences and journals. His areas of research include but mation Technology, Lahore, in 2011, and the M.Sc. degree in computer
are not limited to shape analysis and recognition, motion-based data mining science from the University of the Punjab, Lahore. He is currently an
and behavior recognition, medical image analysis, ECG analysis for disease Assistant Professor with the Department of Computer Science, University
detection, biometrics using fingerprints, vessels patterns of hands/retina of of Engineering and Technology, Taxila, Pakistan. He has authored and
eyes, ECG, Urdu stemmer development, short and long multi-lingual text co-authored journal and conference papers at the national and international
mining, and Urdu OCR. He received the Best Researcher Award from Bahria level in the field of computer science. His interests are machine leaning,
University in 2014. He has also been a recipient of the Letter of Appreciation databases, semantics web, eLearning, and artificial intelligence.
for Outstanding Research Contribution in 2013, and the Outstanding Perfor-
mance Award from 2013 to 2014. He is a Reviewer for various leading ISI
indexed journals, such as the Journal of Computer Vision and Image Under-
standing, the Journal of Visual Communication and Image Representation,
the Journal of Medical Systems, the IEEE TRANSACTIONS ON SYSTEM, MAN AND
CYBERNETICS, and the Journal of Information Sciences.

MUHAMMAD KHAWAR BASHIR received the M.Sc. degree in computer


science from the Punjab University College of Information Technology,
Lahore, and the M.Phil. degrees in computer science from National Uni-
versity (FAST), Lahore. He is currently pursuing the Ph.D. degree with
the University of Engineering & Technology, Lahore. He has over twelve
years of teaching experiences. He is a Lecturer with the Department of
Statistics and Computer Science, University of Engineering & Technology, KIJUN HAN received the B.S. degree in electrical engineering from Seoul
Lahore. His fundamental expertise is in the field of Computer Softwares’ National University, South Korea, in 1979, the M.S. degree in electrical
and Image Processing and Computer Vision. Besides, other research area of engineering from KAIST, South Korea, in 1981, and the M.S. and Ph.D.
interests include E-commerce, management information systems, network degrees in computer engineering from the University of Arizona in 1985 and
security, and machine learning. He has local and international publications. 1987, respectively. Since 1988, he has been a Professor with the School of
He presented his research findings in international and national scientific Computer Science and Engineering, Kyungpook National University, Daegu,
conferences in various countries, including China and Turkey. He has the South Korea.
expertise in handling software solutions.

48246 VOLUME 6, 2018

You might also like