Enhanced Network Anomaly Detection Based On Deep Neural Networks
Enhanced Network Anomaly Detection Based On Deep Neural Networks
Received June 3, 2018, accepted July 16, 2018, date of publication August 17, 2018, date of current version September 21, 2018.
Digital Object Identifier 10.1109/ACCESS.2018.2863036
ABSTRACT Due to the monumental growth of Internet applications in the last decade, the need for security
of information network has increased manifolds. As a primary defense of network infrastructure, an intrusion
detection system is expected to adapt to dynamically changing threat landscape. Many supervised and
unsupervised techniques have been devised by researchers from the discipline of machine learning and
data mining to achieve reliable detection of anomalies. Deep learning is an area of machine learning which
applies neuron-like structure for learning tasks. Deep learning has profoundly changed the way we approach
learning tasks by delivering monumental progress in different disciplines like speech processing, computer
vision, and natural language processing to name a few. It is only relevant that this new technology must be
investigated for information security applications. The aim of this paper is to investigate the suitability of deep
learning approaches for anomaly-based intrusion detection system. For this research, we developed anomaly
detection models based on different deep neural network structures, including convolutional neural networks,
autoencoders, and recurrent neural networks. These deep models were trained on NSLKDD training data set
and evaluated on both test data sets provided by NSLKDD, namely NSLKDDTest+ and NSLKDDTest21.
All experiments in this paper are performed by authors on a GPU-based test bed. Conventional machine
learning-based intrusion detection models were implemented using well-known classification techniques,
including extreme learning machine, nearest neighbor, decision-tree, random-forest, support vector machine,
naive-bays, and quadratic discriminant analysis. Both deep and conventional machine learning models were
evaluated using well-known classification metrics, including receiver operating characteristics, area under
curve, precision-recall curve, mean average precision and accuracy of classification. Experimental results of
deep IDS models showed promising results for real-world application in anomaly detection systems.
INDEX TERMS Deep learning, convolutional neural networks, autoencoders, LSTM, k_NN, decision_tree,
intrusion detection, convnets, information security.
2169-3536
2018 IEEE. Translations and content mining are permitted for academic research only.
VOLUME 6, 2018 Personal use is also permitted, but republication/redistribution requires IEEE permission. 48231
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
S. Naseer et al.: Enhanced Network Anomaly Detection Based on DNNs
type of training is to classify the test data as anomalous or Precision-Recall Curve, mean average precision (mAP) and
normal on the basis of feature vectors. Unsupervised learn- accuracy of classification.
ing, on the other hand, uses unlabeled or untagged data The primary contribution of this work is filling above-
to perform the task learning. One of the popular unsuper- mentioned research gaps by designing and implementing
vised learning technique is clustering [2], which searches anomaly detection models based on state of the art Deep Neu-
for similarities among instances of the dataset to build clus- ral Networks and their evaluation using standardized classi-
ters. Instances sharing related characteristics are assumed to fication quality metrics. The first gap is filled by developing
be alike and placed in the same cluster. Semi-Supervised anomaly detection models using Deep CNN, LSTM and mul-
Learning (SSL) is a combination of supervised and unsu- tiple types of Autoencoders. To the best of our knowledge,
pervised learning. The SSL approach utilizes both labeled the DNN structures (DCNN, Contractive, and Convolutional
and unlabeled data [3] for learning. SSL methods learn Autoencoders) investigated in this study have not been ana-
feature-label associations from labeled data and assign the lyzed for anomaly detection. In addition, comparisons of deep
labels to unlabeled instances having similar features that learning based anomaly detection models are provided with
of a labeled instance on the basis of learned feature-label well-known classification schemes including SVM, K-NN,
associations. Decision-Tree, Random-Forest, QDA and Extreme Learning
Deep Learning is an area of Machine Learning which machine. To fill second research gap, we opted to train all
applies neuron like mathematical structures [4] for learn- models on training dataset without ever exposing test dataset
ing tasks. Neural Networks have been around for many to the model during training and then tested/evaluated the
decades [5] and have been gaining and losing the favor of models on testing datasets. This approach provided a fair esti-
research community. The latest rise of this technology is mate of model capabilities by using unseen data instances at
attributed to Alexnet [6], a Deep Neural Network, which won evaluation time. To bridge the third research gap, Deep learn-
the ImageNet classification challenge. Alexnet achieved top- ing based anomaly detection models were evaluated amongst
1 and top-5 error rates of 37.5 % and 17.0% on ImageNet themselves and with conventional machine learning models
Dataset [7] which were considerably better than the previous by using unseen test data and employing standard classifica-
state-of-the-art mechanisms. Since then, Deep Neural Net- tion quality metrics including RoC Curve, Area under RoC,
works (DNNs) have attracted the attention of research com- Precision-Recall Curve, mean average precision (mAP) and
munity once again and multiple DNN structures including accuracy of classification.
Convolutional neural networks (CNNs) [8], Recurrent Neural All experiments in this study are performed by authors on
networks (LSTM) [9], Deep belief nets (DBNs) and differ- NSLKDD dataset provided by [14] using a GPU-powered
ent types of Autoencoders including Sparse, Denoising [10], test-bed. NSLKDD is derived from KDDCUP99 [15] which
Convolutional [11], Contractive [12] and variational Autoen- was generated in 1999 from the DARPA98 network traf-
coders have been proposed. These DNN structures have been fic. Tavallaee et al. [14] discovered some inherent flaws
successfully applied to devise state of the art solutions in in original KDDCUP99 dataset which had adverse impacts
multiple disciplines. on the performance of IDS models trained and evaluated
Application of Deep Neural Networks for the solution of on the Dataset. A statistically enhanced version of dataset
Information security problems is a relatively new area of called NSLKDD was proposed by [14] to counter discov-
research. We observed three research gaps during literature ered statistical problems. Some advantages of NSLKDD
review of anomaly detection problem. The first gap was lack over KDDCUP99 dataset include removal of redundant
of investigation of well-known deep learning approaches for records from training dataset for reducing complexity
anomaly detection. Although isolated studies were available and bias towards frequent records and the introduction
as described in II, no comprehensive research work was of non-duplicate records in testing datasets for unbiased
available to fill this gap. The second research gap was the use evaluation.
of training datasets for both training and testing of models NSLKDD Dataset is available in four partitions. Two
using cross-validation mechanisms. Most of the recent works partitions namely NSLKDDTrain20p and NSLKDDTrain+
followed this approach and reported very high detection rates, serve as training Dataset for model learning and provide
e.g., Kim et al. [13] used a four-layer DNN with 100 units 25,192 and 125,973 training records respectively. Remain-
for intrusion detection on the KDD99 dataset and reported ing two partitions called NSLKDDTest+ and NSLKD-
99% accuracy. We believe that this approach does not provide DTest21 are available for performance evaluation of trained
a reliable solution of anomaly detection problem, as, given models on unseen data and provide 22,543 and 11,850 data
sufficient training, models can be over-fitted to achieve such instances respectively. Additionally, NSLKDDTest21 con-
high rates. The 3rd gap turned out to be lack of compari- tains records for attack types not available in other NSLKDD
son/evaluation of deep learning models amongst themselves train and test Datasets. These attack types include pro-
and with conventional machine learning based models using cesstable, mscan, snmpguess, snmpgetattack, saint, apache2,
standardized classification quality metrics which was a natu- httptunnel, back and mailbomb. All models in our study
ral consequence of previous two gaps. Standardized classifi- were trained on NSLKDD training datasets (NSLKD-
cation quality metrics include RoC Curve, Area under RoC, DTrain20p and NSLKDDTrain+) and tested on NSLKDD
test datasets (NSLKDDTest+ and NSLKDDTest21). This Application of Deep Neural Networks for the solution of
approach was also adopted by [16]–[18]. Information security problems is a relatively new area of
Like its predecessor, NSLKDD dataset consists of 41 input research. DNN structures like Autoencoders (AE), Deep
features as well as class labels. Features 1 to 9 represent the Belief Networks (DBNs) and LSTM have been used for
basic features which were created from TCP/IP connection Anomaly Detection. Gao et al. [25] proposed an IDS archi-
without payload inspection. Features 10 to 22 comprised of tecture based on DBNs using energy based reduced Boltz-
content features, generated from the payload of TCP seg- mann machines (RBMs) on KDDCup99 Dataset. Wang [26]
ments of packets. Features 23 to 31 were extracted from proposed a deep network of stacked autoencoders (SAE)
time-based traffic properties while features 32 to 41 contain for network traffic identification. A semi-supervised learning
application based traffic features that were designed to mea- based approach with Random weights based NN (NNRw) is
sure attack within intervals longer than 2 seconds. A class used by Ashfaq et al. [17] to implement an IDS architecture
label was provided with each record, which identified the using NSLKDD.
network traffic instance either as normal or an attack. Original Aygun and Yavuz [27] employed vanilla and denois-
KDDCUP99 dataset listed different types of attacks shown ing deep Autoencoders on NSLKDD and claimed accu-
in Table 1. racy of 88.28% and 88.6% on NSLKDDTest+ dataset.
Rest of the article is divided into VI sections. Section II They did not provide results for NSLKDDtest21 dataset
highlights prominent works related to IDS problem. neither did they provide any other quality metrics of their
Section III provides the architectural designs of DNNs classifier including AuROC, Precision, Recall, and mAP.
used in this study. Section IV sheds light on implemen- Yousefi-Azar et al. [18] used Autoencoders as an unsu-
tation details including hardware setup and software tool- pervised latent feature generation mechanism and provided
chain. In Section V, we present results of proposed DNN a comparison of classifiers using conventional NSLKDD
based models along with comparisons of results and timing features and possible representations of NSLKDD. They
information. This section is followed by section V and VI reported the accuracy of 83.34% by using latent repre-
which describe the conclusion and references of research sentations of NSLKDD. Alom et al. [28] trained a Deep
respectively. Belief Network (DBN) using staked Restricted-Boltzmann
Machines (RBMs) and reported the accuracy of 97.5% on
II. RELATED WORKS NSLKDD training dataset (training accuracy). They did
Different Machine learning techniques including supervised, not provide results for either NSLKDDTest+ or NSLKD-
unsupervised and semi-supervised, have been proposed to DTest21 datasets. Hodo et al. [29] provided a taxonomy
enhance the performance of anomaly detection. Supervised and survey of deep and conventional structures for intrusion
approaches such as k-nearest neighbor (k-NN) [19], neural detection. Javaid et al. [30] developed a ‘‘Self-taught Learn-
networks and support vector machine (SVM) [20] have been ing’’ classification mechanism by combining encoder layers
studied extensively for anomaly detection. Laskov et al. [21] of Sparse Autoencoder (for hidden features extraction) with
provided a comparative analysis of supervised and unsu- Softmax regression (for probability estimates of classes) to
pervised learning techniques with respect to their detec- perform classification on NSLKDD. They reported 98.3%
tion accuracy and ability to detect unknown attacks. accuracy on training data (training accuracy) for two-class
Ghorbani et al. [22] provided a comprehensive review classification, 98.2% training accuracy for 5 class classifica-
of supervised and unsupervised learning approaches for tion and 98.8% training accuracy for 23 class classification.
anomaly detection. Solanas and Martinez-Balleste [23] pre- Javaid et al. provided results for neither NSLKDDTest+ nor
sented clustering algorithms for anomaly detection. A com- NSLKDDtest21 datasets. Bontemps et al. [31] proposed an
prehensive repertoire of anomaly-based intrusion detection LSTM based anomaly detection system and reported varia-
systems is presented by Bhattacharraya and Kalita [24] and tions incorrect and false alarms of prediction errors with a
Tavallee [16] compared the performance of the NSLKDD change of a β parameter of the proposed system but did not
dataset on different classification algorithms including Naive- provide results in well-known metrics. This shows the need
Bayes, Support Vector Machines, and Decision-Trees, etc. for an experimental study which develops anomaly detection
models using well-known Deep learning approaches and an approximation of any function to an arbitrary degree of
evaluates them on previously unseen test datasets using stan- accuracy which means a deep AE with more than one encoder
dardized classification quality metrics. In this study, we aim layer can approximate any mapping from input to bottleneck
to close above-mentioned research gap and evaluate Deep arbitrarily well. Adding Depth has the effect of exponential
learning based models on unseen data using standard classifi- reduction in computation cost and amount of training data
cation quality metrics including RoC Curve, Area under RoC, needed for representing many functions [4]. Different AE
Precision-Recall Curve, mean average precision (mAP) and structures are described in the literature, but we will discuss
accuracy of classification. the AEs relevant to our study.
in which ||Jf (x)||2F is the Frobenius norm of the jacobian map. Since Conv layer contains set of filters, it produces a fea-
matrix, which is the sum of squares over all elements inside ture map for every filter, and these feature maps are stacked
the matrix. Frobenius norm is regarded as the generalization together to produce output tensor. Formally, assuming input
of euclidean norm and represented by following equation. shaped as greyscale images, the operation of convolution
layer is specified by following steps [33]:
||Jh (X )||2F = 6ij (δhj (X )/δXi )2
1) Accepts a tensor of size D1 ∗ H1 ∗ W1 and requires
following four hyper-parameters:
4) CONVOLUTIONAL AUTOENCODERS
• Number of filters K
Convolutional Autoencoders (ConvAE) were proposed by
• Spatial dimensions of filters F
Masci et al. [11]. ConvAEs learn non-trivial features using
• Stride with which they are applied S and
simple stochastic gradient descent (SGD), and discover good
• Size of zero-padding if padding is enabled
initializations for Convolutional Neural Networks (CNNs)
2) Conv layer produces an out tensor of size D2 ∗ H2 ∗ W2
that avoid the numerous distinct local minima of highly non-
where D2 = K , W2 = (W1 − F + 2P)/S + 1 and H2 =
convex objective functions arising in different deep learning
(H1 − F + 2P)/S + 1
problems. Fully connected AEs ignore the 2D representation
3) The number of parameters in each filter is F ∗ F ∗ D1
of input which introduces redundancy in parameters, forcing
for a total of (F ∗ F ∗ D1 ) ∗ K weights and K biases
each feature to be global. However, a different approach is
to discover localized features that repeat themselves all over Purpose of pooling layer is to control overfitting by
the input. ConvAEs differ from other AEs as their weights decreasing the size of representation with a fixed down-
are shared among all locations of the input preserving spatial sampling technique (max-pooling or mean-pooling) without
locality. The latent representations generated by ConvAEs any weights. Pooling layers operate on each feature map
are more sensitive to transitive relations of features and separately to reduce its size. A typical setting is to use max-
capture the semantics ignored by other AE structures. For pooling with 2x2 filters with a stride of 2 to downsam-
input x shaped as mono channel, the latent representations ple the representation precisely by half in both height and
of kth feature map is given by hk = σ (x ∗ W k + bk ) width.
where σ is activation function and ∗ denotes 2D convolution
whereas x, W and b denotes input, weights and bias respec- C. LSTM
tively. Reconstruction of ConvAE is obtained using following Long Short term Memory (LSTM) is a special Recurrent
equation: neural network (RNN) architecture. An RNN is a connec-
tivity pattern that perform computations on a sequence of
y = σ (6k∈H hk ∗ W̄ k + c) vectors x1 , · · · , xn using a recurrence formula of the form
Where ‘H’ represents the collection of latent feature maps; ht = fθ (ht−1 , xt ), where f , an activation function and θ,
W̄ represents the flip operation over both dimensions of a parameter, are used at every timestamp to process sequences
the weights. [11] proposed plain SGD as optimizer and with arbitray lengths. The hidden vector ht is called state
Mean Squared Error as objective function for their ConvAE of the RNN and it sort of provides a running summary of
structure. all vectors x till the time-step and this summary is updated
based on current input vector xn . Vanilla RNNs use linear
B. CONVOLUTIONAL NEURAL NETWORKS
combinations of xt , ht−1 which are weak form of coupling
between inputs and hidden states [34]. Formal form of LSTM
Convolutional Neural Networks (CNNs) [8] are neural net-
is shown below:
work architectures specially crafted to handle high dimen-
sional data with some spatial semantics. Examples of such
i sigm
data include images, video, sound signals in speech, character
f sigm
= W xt
sequence in the text, or any other multi-dimensional data. o sigm ht−1
In all of the abovementioned cases, using fully connected g tanh
networks becomes cumbersome due to larger feature space.
CNNs are preferred in such cases because of awareness of In equation above, sigmoid and tanh functions are applied
spatial layout of input, specific local connectivity, and param- element-wise. The tensor W has dimensions [4H ∗ (D + H )].
eter sharing schemes [33]. The vectors i, f , o ∈ RH intuitively resemble binary gates
For CNN, the input x consist of a multi-dimensional array controlling operations of memory cell. The vector g ∈ RH is
called tensor. The core computational block of CNNs are used to additively modify the memory contents, which allows
convolution Conv and pooling layers. A Conv layer takes gradients to be distributed equally during backpropagation.
input as a tensor and convolves it with a set of filters (kernels)
to produce output tensor. For a single filter k of dimension IV. METHODOLOGY
dk , hk and wk , convolution is performed by sliding filter k This section describes the architectures of Deep Neural Net-
overall spatial positions of the input tensor and calculating dot works used in this study and methodology of their usage for
product between input chunk and filter k to produce a feature developing Anomaly detection models.
FIGURE 3. Architecture of implemented deep convolutional neural network model for IDS.
be ignored by other classifiers. As ConvAEs make use of yn is zero if corresponding image belongs to normal traffic
GPUs for training, the training time of network with 2D and 1 otherwise. Both Test Datasets NSLKDDTest+ and
input is not much different than a conventional SVM or K- NSLKDDTest21 are also subjected to same preprocessing.
NN classifier. The evidence of abovementioned observation
is presented in results section where training and testing times B. DEEP CONVOLUTIONAL NEURAL NETWORKS
of models are discussed. Like ConvAE, Deep Convolutional neural network (DCNN)
For converting network flow dataset to corresponding also requires input in the form of images, hence Datasets
image dataset, we need to create a mapping F : θ → I , were subjected to same preprocessing as that of ConvAE.
where I represents image Dataset corresponding to θ and Architecture of DCNN Model implemented for anomaly
θ = (φn )N n=1 is the preprocessed network flow Dataset. detection in this study is shown in Figure 3. The model
To achieve image representation I corresponding to each consist of input layer, three convolution and subsampling
training instance, vector v1 of length 41 is generated from the pairs, three fully connected layers followed by an output
preprocessed entries of dataset features and replicated 3 times layer consisting of single sigmoid unit. A dropout layer (not
to generate a corresponding vector of 123 features which shown in figure) is placed between flattened model and first
is converted to a vector v̄ of 128 after concatenating first fully connected layer FC1. Dropout Layer, introduced by
5 features. For each training/testing instance v̄ is replicated Srivastav et al. [35], serves as regularization layer. It ran-
to generate corresponding 32 × 32 greyscale representation. domly drops units from the DCNN along with their weights
After transforming θ → I , the label data is preprocessed as during training time. This has the effect of training an ensem-
per requirements of two-class structure. The result of label ble of neural networks where each member of ensemble
transformation is represented by y = yn ∈ (0, 1)M where is a subset of original neural network. At test time, it is
M denotes the total number of classes. The entry of vector easy to approximate predictions of all ‘thinned’ subsets by
simply using an ‘‘un-thinned’’ original network with smaller and protocol and TCP flags respectively. IDS problem is
weights. The selected hyper-parameters include softsign approached as two-class problem where network flows are
activation, He-normal kernel initialization [36], Adadelta either anomalous or normal. Training dataset is prepared
optimizer [37] with batch size of 64 instances. Additional by combining NSLKDDTrain20% and NSLKDDTrainplus
hyper-parameters of DCNN included output layer of single which collectively provide 1,51,165 network flow records.
sigmoid unit, drop-out rate of 0.5 and zero-padding at each NSLKDD has 41 features like its predecessor KDDCUP99
convolution layer input. and we have used all 41 features. Out of 41 features,
3 features ‘protocol-type’, ‘service’ and ‘flag’ are symbolic
C. LSTM features which require conversion into quantitative form
In LSTM IDS Model, each record was processed as single before they can be consumed by DNNs. Different techniques
member sequence of 41 dimensional vector to 32 LSTM [38]–[40] and [41] have been proposed in literature for encod-
units. Two Dense layers with ten (10) and one neuron respec- ing symbolic features to quantitative features. We studied the
tively were attached with LSTM outputs to make predictions impact of different category encoding schemes on classifica-
for two class problem. First Dense layer used RELU activa- tion accuracy of NSLKDD dataset using a conventional clas-
tion while classification layer with single unit used sigmoid sifier. For this purpose we chose Decision-Tree algorithm due
activation to make predictions. A drop-out layer was intro- to its time efficiency. Impact of different encoding schemes
duced between LSTM output and MLP input to thwart over- on dimensionality of dataset, training time and accuracy of
fitting. LSTM IDS Model was trained on combined Training trained model are shown in Table 4.
Dataset for 15 epochs. In Table 4, Dimensionality shows the number of new fea-
tures inserted by encoding algorithm in each instance during
V. IMPLEMENTATION encoding of three symbolic features. Average scores show the
This section describes the experimental setup, preprocessing training accuracy of selected Decision-Tree classifier while
of datasets and implementation details of deep and conven- using a particular encoding scheme. Based on the perfor-
tional models implemented for experiments. mance of symbolic feature encoders, we chose LeaveOneOu-
tEncoding proposed by [41].
A. EXPERIMENTAL SETUP In general, learning algorithms benefit from standard-
Hardware setup used for implementing proposed models ization of the Dataset. Since different feature vectors of
included: NSLKDD Dataset contained different numerical ranges,
• CPU : Intel Xeon E-1650 Quad Core we applied scaling to convert raw feature vectors into more
• RAM : 16 GB standardize representation for DNNs. As Datasets contained
• GPU : nVidia GTX 1070 with 1920 CUDA cores and both normal and anomalous traffic, to avoid the negative
cuda 8.0 influence of sample mean and variance, we used median and
interquartile range (IQR) to scale the data for better results.
B. PREPROCESSING We removed the median and scaled the data according to IQR.
A network flow, φ, is an ordered set of all packets π1 , · · · , πn
where πi = {ti , Si , Di, si , di , pi , fi } represents a packet such C. IMPLEMENTATION OF DNN MODELS
that: Software toolchain used to implement all DNNs consist of
1) ∀πi , πj ∈ φ, pi = pj Jupyter development environment using Keras 2.0 [42] on
2) ∀πi , πj ∈ φ, (Si = Sj , Di = Dj , si = sj , di = dj ) ∧ Theano [43] backend and nVidia cuda API 8.0 [44]. Both
(Si = Dj , Sj = Di , si = dj , di = sj ) Training and testing datasets were manipulated in the form
3) ∀πi6=n ∈ φ (ti ≤ ti+1 )and (ti+1 − ti ≥ α) of numpy arrays. Python Scikit-learn [45] library was used
whereti , Si , Di, si , di , pi , fi represents time-stamp, source IP for various ML related tasks. Figures and graphs were created
address, destination IP address, source port, destination port using python matplotlib and seaborn libraries.
FIGURE 4. Comparison of RoC curves of deep neural network IDS models for NSLKDDTest+ dataset.
FIGURE 5. Comparison of RoC curves of deep neural network IDS models for NSLKDDTest21 dataset.
FIGURE 6. Comparison of RoC curves for both deep and conventional IDS models for NSLKDDTest+ dataset..
FIGURE 7. Comparison of RoC curves for both deep and conventional IDS models for NSLKDDTest21 dataset.
accuracy was delivered by LSTM with accuracy score between NSLKDDTest+ and NSLKDDTest21 in all models
of 89% and 83% respectively for NSLKDDTest+ and is due to the fact that NSLKDDTest21 contains records for
NSLKDDTest21 datasets. The sharp difference in Accuracies attack types not available in other NSLKDD train and test
TABLE 5. Top 5 area under RoC curve results of models for NSLKDDplus and NSLKDD21 datasets.
FIGURE 9. Precision-recall curve and mAP scores of DNN models for NSLKDDTest+ dataset.
Datasets. These attack types include processtable, mscan, the model is returning accurate results (high precision), while
snmpguess, snmpgetattack, saint, apache2, httptunnel, back also returning the majority of positive results (high recall).
and mailbomb as mentioned earlier. This means that trained Each classifier exhibits a trade-off between precision and
models never had the opportunity to see these attacks during recall. Due to the fact that individually both Precision and
training as they were not available in training data. Recall provide only a puzzle piece of classifier performance,
they are combined to form Precision-Recall curve which
D. PRECISION-RECALL CURVE AND MAP presents the relationship between them in more meaning-
Precision is defined as a measure of relevancy of results, ful manner. The stair-step nature of Precision-Recall curve
while recall provides us a measure of how many genuinely provides insight into the relationship between precision and
relevant results are returned. High scores for both show that recall. A small change in the threshold at the edges of
FIGURE 10. Precision-recall curve and mAP scores of DNN models for NSLKDDTest21 dataset.
FIGURE 11. Precision-recall curve and mAP scores of DNN and conventional models for NSLKDDTest+ dataset.
stair-step considerably reduces precision with only a small shows PRC and mAP performance of all models. Top six
increase in recall. mAP scores are shown in Table 6.
Figure 9 and 10 depicts precision-recall curves (PRC)
and mean Average Precision (mAP), shown as area under E. TEST AND TRAIN TIMINGS
precision-recall curve in legends section, of DNN models for In this subsection, we provide the train and test timings
both test Datasets. Mean average precision (mAP) summa- of models used in this study. For DNNs, GPU is used
rizes a precision-recall curve as the weighted mean of pre- as training and testing device while conventional models
cisions achieved at each threshold, with differential increase were trained and tested using CPU. In DNNs, ConvAE
in recall used as the weight. mAP for all tested models proved to be the most expensive algorithm because the
is shown in legends section of Figures 9,10, 11 and 12. training time included both Autoencoder model training and
Except SparseAE, all DNN models showed very good results. MLP classification model training. Collectively ConvAE IDS
In NSLKDD+, both LSTM and DCNN model share Top model took approximately 367 seconds on GPU. DCNN
position with mAP scores of 97% while DCNN showed and LSTM models took 109 and 208 seconds respectively.
marginally improved performance for NSLKDD21 with 98% Smallest training time from DNN models was that of Sparse
score. Three models including ContAE, ConvAE and LSTM Autoencoder but it did not show comparable results. SVM
achieved 97% mAP score for NSLKDD21. Figure 11 and 12 with RBF kernel proved to be the most expensive model
FIGURE 12. Precision-recall curve and mAP scores of DNN and conventional models for NSLKDDTest21 dataset.
TABLE 6. Top 6 mean average precision results from implemented IDS models.
FIGURE 13. Training time in seconds for different algorithms used in experiments.
among conventional IDS models and took approximately smallest training time, Decision-tree model showed remark-
314 seconds. Fastest among conventional category was able results and performed comparable to other complex
Random-Forest closely followed by Decision Tree. With models. Remaining conventional models took each under
FIGURE 14. Test time for different algorithms used in experiments for both NSLKDDTest+ and NSLKDDTest21 datasets.
[14] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, ‘‘A detailed [33] A. Karpathy, ‘‘Connecting images and natural language,’’ Ph.D.
analysis of the KDD CUP 99 data set,’’ in Proc. IEEE Symp. Com- dissertation, Fac. Comput. Sci., Stanford Univ., Stanford, CA, USA, 2016.
put. Intell. Secur. Defense Appl. (CISDA). Piscataway, NJ, USA: [Online]. Available: https://pdfs.semanticscholar.org/6271/07c02c2df-
IEEE Press, 2009, pp. 53–58. [Online]. Available: http://dl.acm.org/ 136696-5f11678dd3c4fb14ac9b3.pdf
citation.cfm?id=1736481.1736489 [34] Y. Wu, S. Zhang, Y. Zhang, Y. Bengio, and R. Salakhutdinov, ‘‘On multi-
[15] S. D. Bay, D. Kibler, M. J. Pazzani, and P. Smyth, ‘‘The UCI KDD archive plicative integration with recurrent neural networks,’’ in Proc. Adv. Neural
of large data sets for data mining research and experimentation,’’ ACM Inf. Process. Syst., 2016, pp. 2856–2864.
SIGKDD Explor. Newslett., vol. 2, no. 2, pp. 81–85, 2000. [35] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
[16] M. Tavallaee, ‘‘An adaptive hybrid intrusion detection system,’’ Ph.D. R. Salakhutdinov, ‘‘Dropout: A simple way to prevent neural networks
dissertation, Fac. Comput. Sci., Univ. New Brunswick, Saint John, NB, from overfitting,’’ J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958,
Canada, 2011. 2014. [Online]. Available: http://jmlr.org/papers/v15/srivastava14a.html
[17] R. A. R. Ashfaq, X.-Z. Wang, J. Z. Huang, H. Abbas, and Y.-L. He, ‘‘Fuzzi- [36] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Delving deep into recti-
ness based semi-supervised learning approach for intrusion detection sys- fiers: Surpassing human-level performance on ImageNet classification,’’
tem,’’ Inf. Sci., vol. 378, pp. 484–497, Feb. 2017. [Online]. Available: in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 1026–1034.
http://linkinghub.elsevier.com/retrieve/pii/S0020025516302547 [Online]. Available: http://www.cv-foundation.org/openaccess/content_
[18] M. Yousefi-Azar, V. Varadharajan, L. Hamey, and U. Tupakula, iccv_2015/html/He_Delving_Deep_into_ICCV_2015_paper.html
‘‘Autoencoder-based feature learning for cyber security applications,’’ in [37] M. D. Zeiler, ‘‘ADADELTA: An adaptive learning rate method,’’ CoRR,
Proc. Int. Joint Conf. Neural Netw. (IJCNN), May 2017, pp. 3854–3861. Dec. 2012. [Online]. Available: https://arxiv.org/abs/1212.5701
[Online]. Available: http://ieeexplore.ieee.org/abstract/document/ [38] W. Mcginnis. (Jul. 2017). BaseN Encoding and Grid Search in Cat-
7966342/ egorical Variables. [Online]. Available: http://www.willmcginnis.com/
[19] Y. Liao and V. Vemuri, ‘‘Use of K-nearest neighbor classifier for 2016/12/18/basen-encoding-grid-search-category_encoders/
intrusion detection,’’ Comput. Secur., vol. 21, no. 5, pp. 439–448, [39] W. Mcginnis. (Jul. 2017). Beyond One-Hot: An Exploration of Cat-
Oct. 2002. [Online]. Available: http://linkinghub.elsevier.com/retrieve/pii/ egorical Variables. [Online]. Available: http://www.willmcginnis.com/
S016740480200514X 2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/
[40] SC Group. (Feb. 2011). Contrast Coding Systems for Categorical Vari-
[20] S. Mukkamala, G. Janoski, and A. Sung, ‘‘Intrusion detection using
ables. [Online]. Available: https://stats.idre.ucla.edu/r/library/r-library-
neural networks and support vector machines,’’ in Proc. Int. Joint
contrast-coding-systems-for-categorical-variables/
Conf. Neural Netw. (IJCNN), 2002, pp. 1702–1707. [Online]. Available:
[41] O. Zhang. (Feb. 2017). Strategies to Encode Categorical Variables
http://ieeexplore.ieee.org/document/1007774/
With Many Categories. [Online]. Available: https://www.kaggle.com/c/
[21] P. Laskov, P. Dssel, C. Schfer, and K. Rieck, ‘‘Learning intrusion detec-
caterpillar-tube-pricing/discussion/15748#143154
tion: Supervised or unsupervised?’’ in Proc. 13th Int. Conf. Image Anal.
[42] F. Chollet et al. (2015). Keras. GitHub. [Online]. Available: https://github.
Process. (ICIAP), Cagliari, Italy, F. Roli and S. Vitulano, Eds. Berlin,
com/fchollet/keras
Germany: Springer, Sep. 2005, pp. 50–57, doi: 10.1007/11553595_6.
[43] R. Al-Rfou et al. (May 2016). ‘‘Theano: A Python framework for fast com-
[22] A. A. Ghorbani, W. Lu, and M. Tavallaee, Network Intrusion Detec- putation of mathematical expressions.’’ [Online]. Available: https://arxiv.
tion and Prevention (Advances in Information Security), vol. 47. org/abs/1605.02688
Boston, MA, USA: Springer, 2010. [Online]. Available: http://link. [44] J. Nickolls, I. Buck, M. Garland, and K. Skadron, ‘‘Scalable parallel
springer.com/10.1007/978-0-387-88771-5 programming with CUDA,’’ Queue, vol. 6, no. 2, pp. 40–53, Mar. 2008,
[23] A. Solanas and A. Martinez-Balleste, Advances in Artificial Intelli- doi: 10.1145/1365490.1365500.
gence for Privacy Protection and Security (Intelligent Information Sys- [45] F. Pedregosa et al., ‘‘Scikit-learn: Machine learning in Python,’’ J. Mach.
tems). Hackensack, NJ, USA: World Scientific, 2010. [Online]. Available: Learn. Res., vol. 12, pp. 2825–2830, Oct. 2011.
http://site.ebrary.com/id/10421991 [46] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, ‘‘Extreme learning machine:
[24] D. K. Bhattacharyya and J. K. Kalita, Network Anomaly Detection: Theory and applications,’’ Neurocomputing, vol. 70, nos. 1–3,
A Machine Learning Perspective. Boca Raton, FL, USA: CRC Press, pp. 489–501, 2006. [Online]. Available: http://linkinghub.elsevier.com/
2013. retrieve/pii/S0925231206000385
[25] N. Gao, L. Gao, Q. Gao, and H. Wang, ‘‘An intrusion detection model [47] F. Fernández-Navarro, C. Hervás-Martínez, J. Sanchez-Monedero, and
based on deep belief networks,’’ in Proc. 2nd Int. Conf. Adv. Cloud P. A. Gutiérrez, ‘‘MELM-GRBF: A modified version of the extreme
Big Data, Nov. 2014, pp. 247–252. [Online]. Available: http://ieeexplore. learning machine for generalized radial basis function neural networks,’’
ieee.org/document/7176101/ Neurocomputing, vol. 74, no. 16, pp. 2502–2510, 2011.
[26] Z. Wang. (2015). The applications of deep learning on traffic identification.
Blackhat. [Online]. Available: https://www.blackhat.com/docs/us-15/
materials/us-15-Wang-The-Applications-Of-Deep-Learning-On-Traffic-
Identification.pdf
[27] R. C. Aygun and A. G. Yavuz, ‘‘Network anomaly detection with stochas-
tically improved autoencoder based models,’’ in Proc. IEEE 4th Int. Conf.
Cyber Secur. Cloud Comput. (CSCloud), Jun. 2017, pp. 193–198. [Online].
Available: http://ieeexplore.ieee.org/document/7987197/
[28] M. Z. Alom, V. Bontupalli, and T. M. Taha, ‘‘Intrusion detec-
tion using deep belief networks,’’ in Proc. Nat. Aerosp. Electron.
Conf. (NAECON), Jun. 2015, pp. 339–344. [Online]. Available: http:// SHERAZ NASEER received the M.S. degree in information security along
ieeexplore.ieee.org/document/7443094/
with distinguished professional certifications of information security, includ-
[29] E. Hodo, X. Bellekens, A. Hamilton, C. Tachtatzis, and R. Atkinson. ing CISSP, CoBit, and ITIL. He is currently pursuing the Ph.D. degree with
(2017). ‘‘Shallow and deep networks intrusion detection system: A tax-
the University of Engineering & Technology, Lahore. He has over 10 years of
onomy and survey.’’ [Online]. Available: https://arxiv.org/abs/1701.02145
experience in information security and IT. He is an Assistant Professor with
[30] A. Javaid, Q. Niyaz, W. Sun, and M. Alam, ‘‘A Deep Learning
the University of Management and Technology, Lahore, Pakistan. He has
Approach for Network Intrusion Detection System,’’ in Proc. 9th EAI
Int. Conf. Bio-Inspired Inf. Commun. Technol. (BICT), 2016, pp. 21–26, been with various information security positions in financial, consulting,
doi: 10.4108/eai.3-12-2015.2262516. academia, and government sectors. He is very active in academic research
[31] L. Bontemps, V. L. Cao, J. McDermott, and N.-A. Le-Khac, ‘‘Collective with over six research publications in conferences and journals. His research
anomaly detection based on long short term memory recurrent neural interests include cryptography, data driven security, intrusion detection,
network,’’ in Proc. Int. Conf. Future Data Secur. Eng. Cham, Switzerland: malware detection, and application of deep neural networks for information
Springer, 2016, pp. 141–152. security. His other skills include ISO 27001, policy and procedure devel-
[32] K. Hornik, ‘‘Approximation capabilities of multilayer feedforward net- opment, IT security reviews and audits, vulnerability assessment and pen-
works,’’ Neural Netw., vol. 4, no. 2, pp. 251–257, 1991. [Online]. Avail- testing, secure software development, cryptography, log monitoring, and
able: http://linkinghub.elsevier.com/retrieve/pii/089360809190009T information security trainings.
YASIR SALEEM received the secondary education (O-level and A-level) JIHUN HAN received the B.S. and M.S. degrees in mechanical engineering
from the U.K., the bachelor’s, master’s, and Ph.D. degrees from the Electrical from the Korea Advanced Institute of Science and Technology, Daejeon,
Engineering Department, University of Engineering and Technology (UET), South Korea, in 2009 and 2011, respectively, where he is currently pursuing
Lahore, Pakistan, in 2002, 2004, and 2011, respectively, and the MBA the Ph.D. degree in mechanical engineering. His research interests include
from ICBS, Lahore, in 2015, for better understanding of management and optimal control and predictive control, with an emphasis on their application
Industry–Academia relationship. He is currently an Associate Professor to intelligent vehicular and transportation systems, such as hybrid electric
with UET. During his Ph.D., he did his research work for one semester vehicles and connected and automated vehicles.
under supervision of Prof. Dr. Z. Salam at the Renewable Energy and
Power Electronics Lab, Faculty of Electrical Engineering, UTM, Malaysia.
He has authored and co-authored journal and conference papers at national
and international levels in the field of electrical and computer science,
and engineering. His research interests include computer networks, infor-
mation/network security, DSP, power electronics, computer vision, image
processing, simulation and control system.
SHEHZAD KHALID received the degree from the Ghulam Ishaq Khan Insti-
tute of Engineering Sciences and Technology, Pakistan, in 2000, the M.Sc.
degree from the National University of Science and Technology, Pakistan,
in 2003, and the Ph.D. degree in informatics from the University of Manch-
ester, U.K., in 2009. He is the Head of the Computer Vision and Pattern
Recognition Research Group which is a vibrant research group undertaking MUHAMMAD MUNWAR IQBAL received the Ph.D. degree from the
various research projects. He is currently a Professor and also the Head of Department of Computer Science & Engineering, University of Engineering
the Department of Computer Engineering, Bahria University, Pakistan. He is and Technology, Lahore, Pakistan, under the supervision of Dr. Y. Saleem,
a qualified academician and also a researcher with over 50 international the M.S. degree in computer science from the COMSATS Institute of Infor-
publications in conferences and journals. His areas of research include but mation Technology, Lahore, in 2011, and the M.Sc. degree in computer
are not limited to shape analysis and recognition, motion-based data mining science from the University of the Punjab, Lahore. He is currently an
and behavior recognition, medical image analysis, ECG analysis for disease Assistant Professor with the Department of Computer Science, University
detection, biometrics using fingerprints, vessels patterns of hands/retina of of Engineering and Technology, Taxila, Pakistan. He has authored and
eyes, ECG, Urdu stemmer development, short and long multi-lingual text co-authored journal and conference papers at the national and international
mining, and Urdu OCR. He received the Best Researcher Award from Bahria level in the field of computer science. His interests are machine leaning,
University in 2014. He has also been a recipient of the Letter of Appreciation databases, semantics web, eLearning, and artificial intelligence.
for Outstanding Research Contribution in 2013, and the Outstanding Perfor-
mance Award from 2013 to 2014. He is a Reviewer for various leading ISI
indexed journals, such as the Journal of Computer Vision and Image Under-
standing, the Journal of Visual Communication and Image Representation,
the Journal of Medical Systems, the IEEE TRANSACTIONS ON SYSTEM, MAN AND
CYBERNETICS, and the Journal of Information Sciences.