Improving Performance of Autoencoder-Based Network Anomaly Detection On NSL-KDD Dataset
Improving Performance of Autoencoder-Based Network Anomaly Detection On NSL-KDD Dataset
ABSTRACT Network anomaly detection plays a crucial role as it provides an effective mechanism to block
or stop cyberattacks. With the recent advancement of Artificial Intelligence (AI), there has been a number of
Autoencoder (AE) based deep learning approaches for network anomaly detection to improve our posture
towards network security. The performance of existing state-of-the-art AE models used for network anomaly
detection varies without offering a holistic approach to understand the critical impacts of the core set of
important performance indicators of AE models and the detection accuracy. In this study, we propose a novel
5-layer autoencoder (AE)-based model better suited for network anomaly detection tasks. Our proposal is
based on the results we obtained through an extensive and rigorous investigation of several performance
indicators involved in an AE model. In our proposed model, we use a new data pre-processing methodology
that transforms and removes the most affected outliers from the input samples to reduce model bias caused by
data imbalance across different data types in the feature set. Our proposed model utilizes the most effective
reconstruction error function which plays an essential role for the model to decide whether a network traffic
sample is normal or anomalous. These sets of innovative approaches and the optimal model architecture
allow our model to be better equipped for feature learning and dimension reduction thus producing better
detection accuracy as well as f1-score. We evaluated our proposed model on the NSL-KDD dataset which
outperformed other similar methods by achieving the highest accuracy and f1-score at 90.61% and 92.26%
respectively in detection.
INDEX TERMS Network security, intrusion detection systems, network-based IDSs, anomaly detection,
NSL-KDD, artificial intelligence, machine learning, deep learning, autoencoders, unsupervised learning.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
140136 VOLUME 9, 2021
W. Xu et al.: Improving Performance of AE-Based Network Anomaly Detection on NSL-KDD Dataset
reconstruction loss between the input and output. The rate and (3) 5-layer model architecture with the number of
of reconstruction loss is used to decide whether a network neurons on AE [122-32-5-32-122]. We test our pro-
sample is normal or anomalous. There has been a number posed approach on the NSL-KDD dataset [8], which is
of AE-based approaches for network anomaly detection by the most widely used latest public dataset for intrusion
varying a number of performance indicators of the AE model detection methods, and obtained the high performance
which includes model architecture, adapting different data of 90.61% accuracy and f1-score 92.26%, which outper-
pre-processing methodologies, the use of different recon- formed other similar methods.
struction loss schemes, etc. However, these existing state-of- We organize the rest of the paper as follows. We examine
the-arts do not offer a holistic approach to examine the impact the related work in Section II. We provide the details of our
of the core set of the performance indicators for AE models, proposed AE model along with its architecture and the algo-
report on a solid set of investigation as to what works and rithm in Section III. In Section IV, we provide the details of
whatnot, and proposing the best working AE model for the the NSL-KDD dataset and data pre-processing methodology
network anomaly detection task. that work better for AE-based network anomaly detection.
We propose a novel 5-layer AE model that is better In Section V, we describe experimental results including the
equipped to accurately identify anomalous network traffic experiment setup, the description of the performance metrics,
based on the finding from an extensive investigation on the and the results. Finally, we provide a conclusion of our work
set of core performance indicators of AE model construction. and present future work directions in Section VI.
The contribution of our proposed model is following:
• We confirm that there is a high correlation between II. RELATED WORK
the quality of data collection (e.g., input samples) and Anomaly detection using machine learning techniques has
the detection accuracy. Unlike the data pre-processing gained popularity in recent years instead of traditional
methodologies adopted by existing state-of-art AE mod- signature-based intrusion detection methods [9], [10]. Due
els, the best accuracy is achieved when data encoding to the automation nature of the machine learning technique,
is done before outlier removals and normalization. Our it was now possible to build different machine learning
study found that by proceeding with the data encoding methods without the strong involvement of human domain
as the first step in the data pre-processing, the data experts [9] which was often the limitation and expensive.
balance across different data types is better maintained Depending on the existence of labels in the model train-
thus reducing model bias during the model training. ing, proposed methods were categorized as either super-
• The percentile rule provides a simple yet effective vised or unsupervised learning algorithms. In the realm of
non-parametric method to identify outliers, which is supervised machine learning-based network intrusion detec-
especially useful for obtaining an adequate reconstruc- tion, the problem becomes a classification task. To identify
tion loss distribution when training an AE model. It also whether a traffic sample is an attack or not, researchers
has the flexibility in tuning the model to obtain better explored different binary classification algorithms to acquire
performance by changing the percentile in the outlier a highly accurate detection rate. The authors in [8] used the
removal process. J48 model on the KDD99 dataset to achieve the accuracy
• We validated the impact of different reconstruction loss of 93.82% and Naïve Bayes Tree (NBTree) on the NSL-KDD
functions on detection accuracy. Though the difference dataset of 82.02% accuracy. A number of methods using
is not large, the Mean Absolute Error (MAE)-based Decision Tree (DT), Naïve Bayes Network (NB), and Support
reconstruction loss function provided the best accuracy Vector Machine (SVM) were introduced for network anomaly
for the AE model used in network anomaly detection. detection by [11]. The authors in [12] employed Fuzzy logic
• We studied the impact of performance on different in anomaly detection and obtained an accuracy of 84.54% in
AE-based model architecture. The best performing AE the experiment. The authors [13] proposed an Artificial Neu-
model was the 5-layer model which constitutes of 1 input ral Network (ANN) model and reported 81.2% accuracy on
layer, 2 dense layers, 1 bottleneck layer, and 1 output the NSL-KDD dataset. Hybrid models by combining differ-
layer. There was no significant large difference in the ent state-of-the-arts algorithms to deliver an improvement on
performance – less than 5% variation in both accuracy the detecting performance were also proposed. For example,
and f1-score range -though the number of hidden layers Kevric et al. [14] illustrated that combining two tree algo-
and the size of neurons were different across different rithms gain better performance than individual tree classifi-
AE architecture. Our experimental result also illustrates cation while they reported the best combination is the random
that the model architecture has lesser influence on the tree and NB tree with the accuracy of the 89.24% on the KDD
performance compared to data selection. dataset. Autoencoder (AE) which commonly used for feature
• The best performing AE model was achieved in the extraction has been widely used in the first stage of hybrid
following performance indicator conditions. (1) When models. A benefit of using AE is that it generates a condensed
95 percent of feature wise normal data was used retained representation of the original input by removing noise from
after one hot encoding to train autoencoder model, it [9], [15], [16]. Azar et al. [17] used AE for feature learn-
(2) MAE-based reconstruction loss function was used, ing then used supervised machine learning algorithms such
as SVM, KNN for classification to achieve 83.3% accuracy. The bottleneck layer, also referred to as a latent space,
Similarly, Al-Qatf et al. [7] combined AE and SVM together is one of the hidden layers which has the smallest number of
and obtained the 84.96% accuracy rate on the KDD dataset neurons. The latent space contains the compressed represen-
for binary classification. Their proposed approach also used tation of the input. The mechanism of autoencoder attempts
an AE to reduce dimension and learns the feature represen- to reconstruct the input at the output, to receive a similar input
tation. Javaid et al. [18] proposed sparse-autoencoder for and output, i.e. x̂ = x. An example of a generic autoencoder
feature learning and soft-max regression-based neural archi- is shown in Fig.1.
tecture for classification and they achieved 88.39% accu-
racy in intrusion detection. Though the supervised learning
approaches (include hybrid ones) had gained high perfor-
mance in numerical results, their success highly relied on
correct labels and balanced data in the training process, which
means they could only efficiently classify unseen samples
by training with a large amount of similar data with cor-
responding labels [10]. However, in the network intrusion
detection field, very little intrusion data is publicly avail-
able due to complex reasons, e.g. privacy issues and data
confidentiality [19]. To address the limitation, unsupervised
learning methods (e.g., Autoencoder (AE)) using anomaly
detection approaches have been introduced only to use benign FIGURE 1. A generic autoencoder model.
samples in the training phase. Ieracitano et al. [20] analyzed
the NSL-KDD dataset with a statistical approach and tested it
A generic autoencoder architecture consists of two opera-
with a simple 3-layer AE architecture. They obtain the value
tions, encoding and decoding respectively.
of 84.21% accuracy in binary classification. The authors
In the encoding operation, any input sample x is an
in [21] automated threshold learning for anomaly detection in
m dimensional vector [x1 , x2 , x3 , . . . , xm ] and is mapped to
an autoencoder-based model and achieved a high of 88.98%
the hidden layer representation (y), as shown in equation (1).
accuracy.
The majority of these existing works [7], [12], [13], [20] y = f1 (wx + b) (1)
use encoding mechanism for categorical (nominal) features
in the dataset after different data pre-processing procedures where f1 ia an activation function for the encoder. w represents
when processing the features of the NSL-KDD dataset. the weight matrix, and b is a bias vector.
We argue that their methodologies introduce data imbalance In the decoding operation, the hidden representation
issues because they remove categorical values too early in of (y) is mapped back into a reconstruction x̂, as shown in
the data pre-processing stage which significantly affects the equation (2).
performance of proposed models. The studies in [12], [13] x̂ = f2 (w0 y + b0 ) (2)
analyze the features of the input samples using different
clustering mechanisms applied for detecting the most optimal where f2 is an activation function for the decoder. w0 and b0
number of outliers and to reduce the dimension of features. represents the weights and bias for the output layer.
We argue that these methods are not applicable and gener- During these two operations, the neural network’s
alizable to apply to other datasets in similar models. The parameters θ = (w, w0 , b, b0 ) are continuously optimized by
study done by [20] only analyzes the outliers in the numer- minimizing the reconstruction error. To minimize reconstruct
ical features by leaving the symbolic features untouched. error on x with non-linear functions, the loss reconstruc-
We argue that this also creates a bias because most likely tion (L) is calculated from equation (3).
symbolic features also have outliers and these need to be m
1X
handled properly. L(x, x̂) = (x i − xˆi )2 (3)
m
i=1
III. AUTOENCODER-BASED NETWORK ANOMALY
DETECTION B. OUR MODEL
A. GENERIC MODEL The AE model in the network anomaly detection tasks use
An autoencoder (AE) is an unsupervised feed-forward the reconstruction error to find whether a network traffic
neural network used for the reconstruction of its input. sample is anomalous or not. In intuition, a network sample
An AE composes of an input layer, an output layer, and one or showing high reconstruction error during the testing phase
more hidden layers. It has a symmetrical pattern – the output should be considered anomalies when an AE trained on a
layer has the same number of neurons as the input layer while normal network traffic dataset presents low reconstruction
any hidden layer generally has fewer neurons than the input error. Therefore, our proposed model is built on this concept -
and output layer. details are in Algorithm 1.
Algorithm 1: AE-based Network Anomaly Detection In this study, we propose a 5-layer AE architecture. The
Input: Training dataset S = {X1 , X2 , X3 , . . . , Xn } AE encodes the 122-dimensional features representation (x)
Testing dataset N = {X10 , X20 , X30 , . . . , Xn0 } into a 32-dimensional vector (m) which is further reduced as a
/* X and X’ are both m dimensional vectors */ 5-dimensional vector (a) and then decodes it back to the same
Encoder Eφ ; Decoder Dθ input features space. The proposed AE [122-32-5-32-122]
Output: AnomalySet, NormalSet is trained in an unsupervised manner, using mini-batch
begin stochastic gradient descent. All the hidden layers are dense
/* Step 1: Training Phase */ layers (i.e., fully connected layer that connects all neurons
φ, θ ← Initialize parameters from the previous layer) using rectified linear units (ReLU)
/* Training in mini-batch */ as activation function instead of sigmoid function in the
for number of training iterations do compression and reconstruction operations for faster compu-
sample mini-batch of k samples tation. The reconstruction error between x and x̂ is quantified
{X1 , X2 , X3 , . . . , Xk } from S using MAE value. Fig. 2 demonstrates the architecture of our
/* Calculate sum of mini-batch loss */ proposed model.
V (E, D) = m1 ki=1 (Xi − Dθ (Eφ (Xi)))2
P
φ, θ ← Update parameters using Stochastic
Gradient Descent of V (E, D)
end
/* obtain Threshold from Training dataset
*/
for each X ∈ S do
X̂ = Dθ (Eφ (X ))
/* reconstruction loss metric: MAE */
L(X , X̂ ) = |X − X̂ |
end
Threshold α = max(L) /* Threshold */
/* Step 2: Testing Phase */
for each X 0 ∈ N do
L(X 0 ) = |X 0 − Dθ (Eφ (X 0 ))|
if L(X 0 ) > α then
X 0 is an anomaly
insert X 0 to AnomalySet FIGURE 2. Our proposed 5-layer AE model.
else
X 0 is NOT an anomaly
insert X 0 to NormalSet IV. DATA AND METHODOLOGIES
end
In this section, we describe the details of the data we use for
end
our study (i.e., NSL-KDD), the methodology we employed
end
for data processing, and the workflow of our proposed model.
The NSL-KDD dataset has two datasets, KDDTrain+ and
KDDTest+, respectively. Though both datasets contains both
In the training phase, the original features of the network normal and abnormal network traffic samples, we only use
traffic are extracted and reduced by the encoding operation the normal network traffic samples from the KDDTrain+ for
then represented in the latent space. The latent space is then training.
used to reconstruct the output. The difference between the As seen in Fig. 3, we first use only the KDDTrain+ dataset
output traffic sample and the original traffic sample is com- after applying a number of data pre-processing techniques
pared and a reconstruction error is computed. Once all traffic such as one-hot-encoding to transform the categorical fea-
samples are processed by the model, the max value of all tures into numeric data, disposal of outliers, and normalizes
reconstruction errors is marked as the threshold to identify the dataset by scaling them to fit in the range of [0, 1]. After
anomalies. pre-processing the KDDTrain+ dataset, we fit the dataset
During the testing phase, network traffic samples are into our proposed AE model which computes the threshold
inputted to the trained AE model and again a reconstruc- (i.e., reconstruction error rate associated with normal traffic
tion error is calculated – it is called an anomaly score now. pattern). At the testing phase, the KDDTest+ dataset is used
The anomaly score is compared with the threshold value on the trained AE to calculate an anomaly score (i.e., the same
obtained during the training phase. If the anomaly score is meaning as the threshold that calculator reconstruction error).
larger than the threshold, this traffic sample is now considered The underlying assumption is that the reconstruction error
anomalous. rate calculated for the normal traffic must differ from the
A. NSL-KDD DATABASET
NSL-KDD is a dataset suggested to solve many inherent
problems [8] associated with earlier versions
FIGURE 4. The visualisation of PCA for NSL-KDD dataset.
(e.g., KDDCup99) used for network intrusion detection.
Though the dataset may not be a perfect representative
of existing real networks, because of the lack of public Each traffic sample in the NSL-KDD dataset contains a
datasets for network-based IDSs, it is often regarded as the total of 41 features, including 38 numeric (e.g., ‘‘int64’’ or
most widely used latest network intrusion datasets that can ‘‘float64’’) and 3 categorical values (e.g., ‘‘object’’). Table 2
be applied as an effective benchmark to compare different shows the name and data type of all 41 features.
intrusion detection methods along with UNSW-NB15 and
CICIDS-2017. B. DATA PRE-PROCESSING
We use two subsets of the NSL-KDD datasets, We go through three different data pre-processing procedures
KDDTrain+ and KDDTest+ respectively, for AE model to organize and transform the NSL-KDD datasets before
training and evaluation. Though both KDDTrain+ and feeding into the AE model. These include: one-hot-encoding,
KDDTest+ contain multiple class labels, we re-classify them outlier disposal, and min-max normalization.
into two categories, whether the traffic sample contained
in these datasets are normal and abnormal to focus on the 1) ONE-HOT-ENCODING
impacts of the major performance indicator. To increase the efficiency of model training, AE models
As illustrated in Table 1, the KDDTrain+ dataset con- demand non-numerical features (e.g., categorical values) con-
tains the total of 125,973 records in which 67,343 of verted into numerical values. We use the one-hot-encoding
them are labelled as ‘‘normal’’ while 58,630 are labelled technique to convert categorical features into n-dimensional
as ‘‘abnormal’’. Similarly, the KDDTest+ contains a total vectors of binary code, where the ‘‘n’’ is determined by the
of 22,544 records of which 9,711 of them are labeled as total number of attributes in the categorical feature. Take the
TABLE 2. NSL-KDD dataset features: 38 numeric and 3 categorical. ‘‘far out’’. However, this method alone is not practicable
for the KDDTrain+ dataset because the distribution of the
dataset is extremely imbalanced. In fact, 21 out of 38 numer-
ical features of the KDDTrain+ dataset have both Q1 and Q3
equal to the minimum value of zero. Hence, a massive number
of mislabelled outliers may produce.
Another popular outlier analysis method is Z-scores [24],
[25]. Z-score is calculated with the following formula:
(Xi − X )
Zi = (5)
σ
where X and σ are the mean and standard deviation of
the distribution of the feature X, and Xi is the attribute of
ith sample in that feature. Z-score assumes that the feature
is independent with other features and the distribution of the
feature is subordinate to normal distribution. Three-sigma
rule [26], also called the 68-95-99.7 rule, are applied for out-
lier identification with Z-score in general. The rule expresses
that about 68% of the instances lie in one sigma (or standard
deviation) of the mean value, and about 95% instances in two
feature ‘‘protocol_type’’ in NSL-KDD dataset for example sigmas while about 99.7% in three sigmas.
where there are three distinct attributes ‘‘tcp’’, ‘‘udp’’ and For our study, we adopted the outlier fence concept and
‘‘icmp’’ each of which are encoded into three 3-dimensional choose the variation of the two-sigma (95%) effect for outlier
binary vectors: [1,0,0], [0,1,0], [0,0,1] respectively. In other detection. The proposed outlier detection method is called
words, the single feature ‘protocol_type’ is encoded into the 95th percentile rule - any sample has an attribute greater
three features by one-hot-encoding. In the NSL-KDD dataset, than the 95th percentile of all instances in that feature is
there are three categorical features (namely ‘‘protocol_type’’, regarded as outliers. All identified outliers are removed from
‘‘service’’, and ‘‘flag’’) each of which has 3, 70, and 11 dis- the dataset afterward. The pseudocode 2 depicts the process
tinct attributes respectively. These are converted into a total of the proposed outlier disposal.
of 84 features. Combined with the 38 numerical features,
now we have a total of 122 features produced after the one-
Algorithm 2: Outlier Disposal
hot-encoding is applied.
S: samples of dataset {s1 , s2 , . . . , sm }
2) OUTLIER ANALYSIS F: features in samples {f1 , f2 , . . . , fn }
An outlier is a data point that differs significantly from other Calculate upper outlier fence of features
data points in a dataset [22]. The source of outlier varies. OF = {of1 , of2 , . . . , ofn }
In our study, we specify an outlier if a feature in a dataset for each s ∈ S do
contains an extreme value that deviates from what we con- for each f ∈ F do
sider from the ‘‘normal’’ range. It is important to remove if sj .fi > ofi then
such outliers because they tend to generate bias on the correct break
calculation of weights. This makes AE models less sensitive end
to anomalies, thus consequently, decreases the accuracy of end
anomaly detection. To address this issue, we remove outliers delete sj
before model training. end
Towards outlier disposal, the first and foremost step is to
identify outliers. Several outlier detection methods in statis- Our hybrid approach of outlier removal has three dis-
tics have been introduced in the literature. Tukey’s fences [23] tinct advantages compared to other similar statistical meth-
is one of the common methods used for outliers detection ods. Firstly, our hybrid outlier removal approach makes no
as it calculates the outlier fence with the use of interquartile assumption about the distribution of samples so the method
range (IQR). The formula is depicted as follow: can be applied to any dataset. Secondly, the 95th percentile
[Q1 − k(Q3 − Q1 ), Q3 + k(Q3 − Q1 )] (4) is the upper outlier fence, and no lower outlier fence is set
in the experiment due to the analysis of the distribution of
where Q1 , Q3 , and k represent the lower quartile, upper the KDDTrain+ dataset. As mentioned earlier, 75% samples
quartile, and the coefficient respectively. If the coefficient with numerical values have the minimum value 0 so the lower
k = 1.5 and the test data is out of the IQR range, the test data outlier fence is equal to the minimum value 0. In other words,
will be regarded as an ‘‘outlier’’, and k = 3 means the data is no lower outlier fence is necessary. The last advantage is that
TABLE 3. Confusion matrix. Accuracy (Acc) measures the proportion of correct predic-
tion and indicates the proportion of the number of correctly
classified data points to total data points for a given dataset
in Equation 10.
TP + TN
Acc = (10)
TP + TN + FP + FN
we identify outliers after encoding the categorical features, F1-score (F1) denotes the measure of the harmonic mean
which means the outlier detection rules are applied to the hot of recall (or TPR) and precision on Equation 11.
encoded features as well. 2 ∗ TP
Note that the outliers only in the training dataset are F1 = (11)
2 ∗ TP + FP + FN
removed because only ‘‘normal’’ samples in the KDDTrain+
are used for the model training. Its sample size has changed B. RESULTS
from 67,343 to 39,252 after our hybrid approach was applied. We have made a number of different observations to under-
stand the performance implications both during the training
3) DATA NORMALIZATION and testing phases.
Normalization eliminates the impacts of different scales
across features thus reduces the execution time for model 1) DATA REPRESENTATION AT THE LATENT SPACE
training. The min-max normalization is applied after outliers Our model has been trained with 80% of the training dataset
are moved. This method maps the original range of each (presented as TrainingSet in Fig. 3) while validated with the
feature into a new range with the formula: left 20% (ValidationSet in Fig. 3) for 50 epochs. The training
dataset is shuffled at the beginning of each epoch to avoid
X − Xmin
Xstd = (6) overfitting.
Xmax − Xmin The visualization of the distribution between the normal
Xscaled = Xstd ∗ (max − min) + min (7) and abnormal samples across KDDTrain+ and KDDTest+
where min, max = (0, 1) in default are used in this experiment in the latent space, in which the data lies in the bottleneck
to normalize all numerical features [27]. layer, is shown in Fig. 5. The latent space contains a com-
pressed representation of the traffic samples which is the only
information the decoder uses to try to reconstruct the input.
V. EXPERIMENTAL RESULTS
For the model to perform well, it has to learn to extract the
In this section, we provide the details of the performance
most relevant features in the bottleneck. Fig. 5 (a) shows two
metrics used in our experiment and the analysis of results.
distinct clusters, one clearly belongs to the normal samples
and the other abnormal samples clustered yet widely scattered
A. PERFORMANCE METRICS
around while Fig. 5 (b) shows the normal samples scattered
To evaluate the performance of our proposed model, we use wildly across wider space with no visible cluster formed
classification accuracy, precision, recall, and F1 score. around.
We follow the convention which labels the normal sam-
ples as class 0 while the anomalous samples as class 1.
Table reftable:Matrix illustrates the confusion matrix, where
True Positive (TP) is the number of correctly labeled cases for
class 1 (in our case anomalous traffic samples), True Nega-
tive (TN) is the correctly labeled class 0 cases (in our case
normal traffic samples), False Positive (FP) is class 0 cases
that are incorrectly labeled as class 1 and False Negative (FN)
class 1 samples but miss-classified as class 0.
True Positive Rate (TPR) is also called Recall or sensitivity,
which indicates the proportion of data points correctly clas-
sified as anomalous data points, as shown in Equation 8.
TP FIGURE 5. The visualization of latent space representation of KDDTrain+
TPR/Recall = (8) and KDDTest+.
TP + FN
Precision (Pr) denotes the proportion of TP data points,
which is also known as a positive predictive value, as shown 2) IMPACT OF RECONSTRUCTION LOSS FUNCTIONS
in Equation 9. The aim of this experiment was to understand the sen-
TP sitivity of different reconstruction loss functions to the
Pr = (9) detection accuracy. The three reconstruction loss functions
TP + FP
140142 VOLUME 9, 2021
W. Xu et al.: Improving Performance of AE-Based Network Anomaly Detection on NSL-KDD Dataset
FIGURE 6. Threshold and the distribution of reconstruction loss from different matrixs of KDDTest+, reconstruction loss range within [0,2].
were studied: Root Mean Square Error (RMSE), Mean label ‘‘abnormal’’ tend to spread widely where the majority
Absolute Error (MAE), and Mean Squared Error (MSE), of the values are bigger than the threshold.
respectively. The definitions of these functions are described
in the following Equations. 3) IMPACT OF OUTLIERS
s Next, we studied the relationship between the percentile of
6i=1
n (x − x̂ )2
i i outliers in the input samples on the KDDTest+ and the accu-
RMSE = (12)
n racy. As seen in Table 5, the detection accuracy improved as
Pn the number of outliers were removed until the 95th percentile
i=1 xi − x̂i
MAE = (13) rule was applied which peaked the accuracy at 90.61% when
n
the 5% of outliers were removed from the input samples.
6 n (xi − x̂i )2
MSE = i=1 (14)
n
TABLE 5. Performance with different percentiles on KDDTest+.
where n indicates the total number of traffic samples, xi is
the representation of the original input sample while x̂i is the
output represented at the latent space.
Table 4 illustrates that there are variations of the thresh-
old values computed depending on the reconstruction loss
function used. Though there are differences in the threshold
values, both RMSE and MSE provide identical values in the
four different performance metrics. Though the differences
are not visible, MAE came out as the best reconstruction loss
function works best for the AE models used in the network
anomaly detection compared to the other two functions.
TABLE 6. Performance of AE with different model architecture. TABLE 7. Performance comparison with other approaches on KDDTest+.
VI. CONCLUSION
We propose a novel 5-layer AE-based model better suited for
detecting anomalous network traffic. The main components
and architecture of our proposed model are obtained from
the result of an extensive and rigorous study by examin-
ing the impact of the major performance indicators of an
AE model and the detection accuracy. Our experimental
FIGURE 7. The top 5 correlated features within 122 encoded features:
‘‘dst_host_serror_rate’’, ‘‘flag_RSTP’’, ‘‘service_whois’’, ‘srv_serror_rate’ and results show that our proposed 5-layer architecture model
‘flag_RSTOS0’. achieves the highest accuracy with the proposed two-sigma
(95th percentile) outlier disposal approach and MAE as recon-
5) COMPARISON WITH OTHER SIMILAR METHODS struction loss metric.
We compared the performance of our proposed model with Our model uses an innovative data pre-processing method-
other similar models. We compared the performance by ology that effectively transforms the input datasets to contain
using the four metrics, namely accuracy, precision, recall, more balanced data samples in terms of data type and data
and F1-score. The table 7 illustrates that our proposed size as well as removes outliers that would most likely affect
AE model can obtain an accuracy of higher than 90% and the detection bias. The effectiveness of the proposed data
the highest f1-score 92.26%. pre-processing methodology was obtained by analyzing the
impact of the percentile rule in the outlier disposal stage. [10] Z. Ahmad, A. S. Khan, C. W. Shiang, J. Abdullah, and F. Ahmad, ‘‘Net-
Our model utilizes MAE as the basis of the reconstruction work intrusion detection system: A systematic study of machine learning
and deep learning approaches,’’ Trans. Emerg. Telecommun. Technol.,
loss function which provides the best accuracy for the AE vol. 32, no. 1, p. e4150, Jan. 2021.
model used in the network anomaly detection. Our 5-layer [11] S. Agrawal and J. Agrawal, ‘‘Survey on anomaly detection using
architecture with the optimized number of neurons used at data mining techniques,’’ Proc. Comput. Sci., vol. 60, pp. 708–713,
Jan. 2015.
each hidden and latent space layer provides the best perfor- [12] R. A. R. Ashfaq, X.-Z. Wang, J. Z. Huang, H. Abbas, and Y.-L. He,
mance compared to other model architecture. We evaluated ‘‘Fuzziness based semi-supervised learning approach for intrusion detec-
our proposed model on the widely used NSL-KDD dataset. tion system,’’ Inf. Sci., vol. 378, pp. 484–497, Feb. 2017.
[13] B. Ingre and A. Yadav, ‘‘Performance analysis of NSL-KDD dataset using
The test results demonstrate that our approach generates the ANN,’’ in Proc. Int. Conf. Signal Process. Commun. Eng. Syst., Jan. 2015,
best performance at the 90.61% accuracy, 86.83% precision, pp. 92–96.
98.43% recall, and 92.26% F1-score compared to other sim- [14] J. Kevric, S. Jukic, and A. Subasi, ‘‘An effective combining classifier
approach using tree algorithms for network intrusion detection,’’ Neural
ilar models. Comput. Appl., vol. 28, no. 1, pp. 1051–1058, Dec. 2017.
Our experimental results also confirm that among the per- [15] P. Mishra, V. Varadharajan, U. Tupakula, and E. S. Pilli, ‘‘A detailed inves-
formance indicators of an AE model, which includes data tigation and analysis of using machine learning techniques for intrusion
detection,’’ IEEE Commun. Surveys Tuts., vol. 21, no. 1, pp. 686–728,
pre-processing, reconstruction loss metric, and model archi- 1st Quart., 2019.
tecture, data pre-processing has the largest effect on the per- [16] T. N. Sainath, B. Kingsbury, and B. Ramabhadran, ‘‘Auto-encoder bot-
formance. Though currently trained on NSL-KDD dataset, tleneck features using deep belief networks,’’ in Proc. IEEE Int. Conf.
Acoust., Speech Signal Process. (ICASSP), Mar. 2012, pp. 4153–4156.
our proposed model is equipped and tested to recognize [17] M. Yousefi-Azar, V. Varadharajan, L. Hamey, and U. Tupakula,
any abnormal network traffic pattern deviating from normal ‘‘Autoencoder-based feature learning for cyber security applications,’’ in
traffic patterns, very efficiently. Though the characteristics Proc. Int. Joint Conf. Neural Netw. (IJCNN), May 2017, pp. 3854–3861.
[18] A. Javaid, Q. Niyaz, W. Sun, and M. Alam, ‘‘A deep learning approach
of intrusion samples may differ in other intrusion datasets, for network intrusion detection system,’’ EAI Endorsed Trans. Secur. Saf.,
we believe our model can still work very well in detecting vol. 3, no. 9, p. e2, May 2016.
any abnormal patterns. However, further studies are required [19] R. Sommer and V. Paxson, ‘‘Outside the closed world: On using machine
learning for network intrusion detection,’’ in Proc. IEEE Symp. Secur.
to test how effectively our proposed model can work in Privacy, May 2010, pp. 305–316.
real-world large-scale operational network environments by [20] C. Ieracitano, A. Adeel, F. C. Morabito, and A. Hussain, ‘‘A novel statistical
incorporating deeper semantic insights into real systems’ analysis and autoencoder driven intelligent intrusion detection approach,’’
Neurocomputing, vol. 387, pp. 51–62, Apr. 2020.
capabilities and limitations. [21] K. Sadaf and J. Sultana, ‘‘Intrusion detection based on autoencoder and iso-
We have plans in place to apply different types of intrusion lation forest in fog computing,’’ IEEE Access, vol. 8, pp. 167059–167068,
attack samples (e.g., Android-based malware samples [30] 2020.
[22] G. S. Maddala and K. Lahiri, Introduction to Econometrics, vol. 2.
or ransomware [31], [32]) and other dataset samples from New York, NY, USA: Macmillan, 1992.
other applications (e.g., indoor air quality (IAQ) [24], [25], [23] J. W. Tukey, Exploratory Data Analysis, vol. 2. Reading, MA, USA:
[33], medical annotations) to test the generalizability and Addison-Wesley, 1977.
[24] Y. Wei, J. Jang-Jaccard, F. Sabrina, and T. McIntosh, ‘‘MSD-Kmeans: A
practicability of our model. We also plan to extend our current novel algorithm for efficient detection of global and local outliers,’’ 2019,
work to multi-class classification. arXiv:1910.06588. [Online]. Available: http://arxiv.org/abs/1910.06588
[25] Y. Wei, J. Jang-Jaccard, F. Sabrina, and H. Alavizadeh, ‘‘Large-scale
outlier detection for low-cost PM10 sensors,’’ IEEE Access, vol. 8,
REFERENCES pp. 229033–229042, 2020.
[1] Y. B. Zikria, R. Ali, M. K. Afzal, and S. W. Kim, ‘‘Next-generation Inter- [26] F. Pukelsheim, ‘‘The three sigma rule,’’ Amer. Statist., vol. 48, no. 2,
net of Things (IoT): Opportunities, challenges, and solutions,’’ Sensors, pp. 88–91, 1994.
vol. 21, no. 4, p. 1174, Feb. 2021. [27] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,
[2] F. A. M. Khiralla, ‘‘Statistics of cybercrime from 2016 to the first M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas,
half of 2020,’’ Int. J. Comput. Sci. Netw., vol. 9, no. 5, pp. 252–261, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay,
2020. ‘‘Scikit-learn: Machine learning in Python,’’ J. Mach. Learn. Res., vol. 12,
pp. 2825–2830, Nov. 2011.
[3] J. Jang-Jaccard and S. Nepal, ‘‘A survey of emerging threats in cybersecu-
[28] T. D. V. Swinscow, Statistics at Square One. London, U.K.: BMJ, 2002.
rity,’’ J. Comput. Syst. Sci., vol. 80, no. 5, pp. 973–993, 2014.
[29] M. T. Ribeiro, S. Singh, and C. Guestrin, ‘‘‘Why should I trust you?’
[4] J. L. McClelland, Parallel Distributed Processing, vol. 2. Cambridge, MA,
Explaining the predictions of any classifier,’’ in Proc. 22nd ACM SIGKDD
USA: MIT Press, 1986.
Int. Conf. Knowl. Discovery Data Mining, Aug. 2016, pp. 1135–1144.
[5] B. Zhang, Y. Yu, and J. Li, ‘‘Network intrusion detection based on stacked [30] J. Zhu, J. Jang-Jaccard, and P. A. Watters, ‘‘Multi-loss Siamese neural
sparse autoencoder and binary tree ensemble method,’’ in Proc. IEEE Int. network with batch normalization layer for malware detection,’’ IEEE
Conf. Commun. Workshops (ICC Workshops), May 2018, pp. 1–6. Access, vol. 8, pp. 171542–171550, 2020.
[6] B. Yan and G. Han, ‘‘Effective feature extraction via stacked sparse [31] T. R. McIntosh, J. Jang-Jaccard, and P. A. Watters, ‘‘Large scale behavioral
autoencoder to improve intrusion detection system,’’ IEEE Access, vol. 6, analysis of ransomware attacks,’’ in Proc. Int. Conf. Neural Inf. Process.,
pp. 41238–41248, 2018. Siem Reap, Cambodia. Cham, Switzerland: Springer, 2018, pp. 217–229.
[7] M. Al-Qatf, Y. Lasheng, M. Al-Habib, and K. Al-Sabahi, ‘‘Deep learning [32] T. McIntosh, J. Jang-Jaccard, P. Watters, and T. Susnjak, ‘‘The inadequacy
approach combining sparse autoencoder with SVM for network intrusion of entropy-based ransomware detection,’’ in Proc. Int. Conf. Neural Inf.
detection,’’ IEEE Access, vol. 6, pp. 52843–52856, 2018. Process., Sydney, NSW, Australia. Cham, Switzerland: Springer, 2019,
[8] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, ‘‘A detailed analysis pp. 181–189.
of the KDD CUP 99 data set,’’ in Proc. IEEE Symp. Comput. Intell. Secur. [33] R. Weyers, J. Jang-Jaccard, A. Moses, Y. Wang, M. Boulic, C. Chitty,
Defense Appl., Jul. 2009, pp. 1–6. R. Phipps, and C. Cunningham, ‘‘Low-cost indoor air quality (IAQ) plat-
[9] H. Liu and B. Lang, ‘‘Machine learning and deep learning methods for form for healthier classrooms in New Zealand: Engineering issues,’’ in
intrusion detection systems: A survey,’’ Appl. Sci., vol. 9, no. 20, p. 4396, Proc. 4th Asia–Pacific World Congr. Comput. Sci. Eng. (APWC CSE),
Oct. 2019. Dec. 2017, pp. 208–215.
WEN XU received the master’s degree in informa- YUANYUAN WEI received the master’s degree in
tion science from Massey University, Auckland, information technology from Massey University,
New Zealand. He is currently a Junior Research Auckland, New Zealand, where she is currently
Officer with the School of Natural and Com- pursuing the Ph.D. degree with the School of Nat-
putational Sciences, Massey University. His cur- ural and Computational Sciences. Her research
rent research interests include deep learning and interests include AI-powered anomaly detection,
AI-based network intrusion detection. network intrusion detection, machine learning, and
deep learning.