0% found this document useful (0 votes)

42 views13 pages

Electronics 11 00898

Uploaded by

nihaanand24

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views13 pages

Electronics 11 00898

Uploaded by

nihaanand24

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

electronics

Article
A Deep Learning Model for Network Intrusion Detection with
Imbalanced Data
Yanfang Fu 1 , Yishuai Du 1 , Zijian Cao 1 , Qiang Li 1 and Wei Xiang 2,3, *

1 School of Computer Science and Engineering, Xi’an Technological University, Xi’an 710021, China;
[email protected] (Y.F.); [email protected] (Y.D.); [email protected] (Z.C.);
[email protected] (Q.L.)
2 School of Computing, Engineering and Mathematical Sciences, La Trobe University,
Melbourne, VIC 3086, Australia
3 College of Science and Engineering, James Cook University, Cairns, QLD 4878, Australia
* Correspondence: [email protected]

Abstract: With an increase in the number and types of network attacks, traditional firewalls and
data encryption methods can no longer meet the needs of current network security. As a result,
intrusion detection systems have been proposed to deal with network threats. The current mainstream
intrusion detection algorithms are aided with machine learning but have problems of low detection
rates and the need for extensive feature engineering. To address the issue of low detection accuracy,
this paper proposes a model for traffic anomaly detection named a deep learning model for network
intrusion detection (DLNID), which combines an attention mechanism and the bidirectional long
short-term memory (Bi-LSTM) network, first extracting sequence features of data traffic through a
convolutional neural network (CNN) network, then reassigning the weights of each channel through
the attention mechanism, and finally using Bi-LSTM to learn the network of sequence features. In
intrusion detection public data sets, there are serious imbalance data generally. To address data

imbalance issues, this paper employs the method of adaptive synthetic sampling (ADASYN) for
Citation: Fu, Y.; Du, Y.; Cao, Z.; Li, Q.; sample expansion of minority class samples, to eventually form a relatively symmetric dataset,
Xiang, W. A Deep Learning Model for and uses a modified stacked autoencoder for data dimensionality reduction with the objective of
Network Intrusion Detection with enhancing information fusion. DLNID is an end-to-end model, so it does not need to undergo the
Imbalanced Data. Electronics 2022, 11, process of manual feature extraction. After being tested on the public benchmark dataset on network
898. https://doi.org/10.3390/ intrusion detection NSL-KDD, experimental results show that the accuracy and F1 score of this model
electronics11060898 are better than those of other comparison methods, reaching 90.73% and 89.65%, respectively.
Academic Editors: Jihoon Yang and
Unsang Park Keywords: intrusion detection; Bi-LSTM; attention mechanism; NSL-KDD

Received: 12 January 2022

Accepted: 3 March 2022
Published: 14 March 2022
1. Introduction
Publisher’s Note: MDPI stays neutral With the rapid development of computer and communications networks, Internet
with regard to jurisdictional claims in
technology has provided more convenient services to people around the world than ever
published maps and institutional affil-
before. However, the number and types of cyberattacks (such as network viruses, malicious
iations.
eavesdropping, malicious attacks, etc.), which increase year by year, are creating serious
threats to people’s information security and property safety. Therefore, information security
and communications security has become crucial to both individuals and society as a
Copyright: © 2022 by the authors.
whole [1,2]. Firewalls are widely deployed and used as basic means of security. However,
Licensee MDPI, Basel, Switzerland.
due to the difficulty of human configuration and the lag for new types of attacks, it is no
This article is an open access article longer sufficient for units that need high security (e.g., government units, military bases,
distributed under the terms and etc.) [3]. Therefore, network security researchers have proposed a new means to quickly
conditions of the Creative Commons identify and deal with anomalous networks intrusion detection systems (IDSs).
Attribution (CC BY) license (https:// IDS is proven to be one of the efficient and promising approaches. It detects known
creativecommons.org/licenses/by/ threats and malicious activities by monitoring traffic data in computer systems, and alerts are
4.0/). issued when these threats are detected [4]. There are two types of monitoring for malicious

Electronics 2022, 11, 898. https://doi.org/10.3390/electronics11060898 https://www.mdpi.com/journal/electronics

Electronics 2022, 11, 898 2 of 13

activities. One is signature-based detection, similar to antivirus software that requires com-
parison with previously collected attack features, while the other is anomaly-based detection,
which requires comparison with normal traffic to make a judgment. In the KDD99 dataset,
Stolfo et al. classified network attacks into four categories—namely, the denial-of-service
attack (DoS), user-to-root attack (U2R), remote-to-local attack (R2L), and probe attack [5].
Nowadays, there are many researchers who advocate the combination of intrusion
detection and machine learning (ML) technologies for the detection of network attacks
by creating effective models. The authors in [6] propose the use of naive Bayes for the
identification of anomalous networks and compare it with decision trees (another clas-
sical machine learning algorithm). The authors in [7] combine support vector machine
(SVM) and the genetic algorithm to optimize the selection, parameters, and weights of
SVM features, thus improving the accuracy of network attack identification. The authors
in [8] improve the detection by constructing a multi-level random forest model to detect
network anomalous behavior. The authors in [9] improve the existing k-nearest neighbor
(KNN) classifier by combining K-MEANS clustering and KNN classifier with each other to
improve the accuracy of detection. The authors in [10] propose a novel intrusion detection
method that first decomposes the network data into smaller subsets by a C4.5 decision
tree algorithm and then creates multiple SVM models for the subsets, which reduces the
time complexity and improves the detection rate of unknown attacks. However, traditional
machine learning methods usually emphasize feature engineering, which consumes con-
siderable computational resources and usually only learns shallow features, leading to
less satisfactory detection results. Many scholars have turned their attention to the current
trend of deep learning, hoping to import network traffic data directly into the model to
skip the feature selection step. In one study [11], the authors propose a model structure
based on deep belief networks (DBNs) and probabilistic neural networks (PNNs) to reduce
the dimensionality of the data using deep belief networks and then classify the data using a
probabilistic neural network, which is superior to the traditional PNNs. The authors in [12]
propose a convolutional neural network-based detection method by processing traffic data
into image form, saving the process of designing features manually. In another study [13],
the authors use RNN networks for Botnet anomaly detection, and the effectiveness of RNN
networks on timing features is utilized to further improve the accuracy of classification.
Table 1 gives a summary and summary of the relevant research.

Table 1. Summary of relevant research.

Author(s) Year Algorithm Main Contribution Field

Proposed the use of naive Bayes for the identification of Machine
Amor et al. [6] 2004 Naive Bayes
anomalous networks. Learning
Proposed a novel intrusion detection method that first
C4.5 decision tree and decomposes the network data into smaller subsets by C4.5 Machine
Kim et al. [10] 2014
SVM decision tree algorithm and then creates multiple SVM models Learning
for the subsets.
Proposed a classifier combining K-MEANS clustering and KNN Machine
Shapoorifard et al. [9] 2017 K-MEANS and KNN
classifier to improve the accuracy of detection. Learning
SVM and genetic Proposed genetic algorithm to optimize the selection, Machine
Tao et al. [7] 2018
algorithm parameters, and weights of SVM features. Learning
Proposed a multilevel random forest model to detect abnormal Machine
Jiadong et al. [8] 2019 Random forest
network behavior. Learning
Deep
Torres et al. [13] 2016 RNN Proposed the use of RNN model to Botnet anomaly detection.
Learning
Proposed to use CNN to detect the network traffic data. Deep
Wang et al. [12] 2017 CNN
Processed into the form of pictures. Learning
Proposed a model structure based on DBN and PNN to reduce
Deep
Zhao et al. [11] 2017 DBN and PNN the dimensionality of the data using DBN and then classify the
Learning
data using PNN.
Proposed a model based on CNN and LSTM to detect each Deep
Su et al. [14] 2020 CNN and LSTM
attack type. Learning
Electronics 2022, 11, 898 3 of 13

However, there is a problem of uneven distribution in network traffic data, and none
of the above networks exploits the correlation between traffic features. In this paper, a
DLNID model is proposed to solve the above remaining problems, using adaptive synthetic
sampling (ADASYN) for data augmentation of unbalanced samples and a modified stacked
autoencoder for data dimensionality reduction. To train and test the performance of
the DLNID model, we take the NSL-KDD dataset for simulation testing. The following
contributions are presented in this paper:
(1) A DLNID model combining attention mechanism and Bi-LSTM is proposed. This
DLNID model can classify network traffic data accurately;
(2) To address the issue of imbalanced network data, ADASYN is used for data augmenta-
tion of the minority class of samples eventually making the distribution of the number
of each sample type relatively symmetrical, allowing the model to learn adequately;
(3) An improved stacked autoencoder is proposed and used for data dimensionality
reduction with the objective of enhancing information fusion.
The rest of this paper follows: Section 2 details the techniques and innovations used in this
paper and presents a diagram of the model architecture of the DLNID model. Section 3 presents
information about the NSL-KDD dataset used in this paper. Section 4 provides experimental
results and analysis. In Section 5, we summarize our study and propose future research.

2. Technology
2.1. ADASYN
Adaptive synthetic sampling (ADASYN) [15] is an adaptive oversampling algorithm
based on the minority class samples. Compared with other data expansion algorithms,
it is characterized by the fact that it generates more instances in a special space with
lower density and fewer instances in feature space with higher density. This feature has
the advantage of adaptively shifting decision boundaries to difficult-to-learn samples, so
ADASYN is more suitable than other data augmentation algorithms to handle network
traffic with severe data imbalance. The algorithm is executed in the following steps:
Step 1: Calculate the number of samples to be synthesized as G, which can be expressed as

G = (nb − ns ) × β (1)

where nb represents the majority sample, ns represents the minority samples, and β ∈ (0, 1).
Step 2: For each minority sample, calculate K neighbors by the Euclidean distance and
denote by ri the proportion of majority class samples contained in the neighbors, which
can be expressed as
ri = k/K (2)
where K represents the current number of neighbors, and k represents the majority class
sample in the current neighbor.
Step 3: Calculate the number of samples that need to be synthesized for each minority
sample according to G and synthesize the samples according to Equation (4), which can be
expressed as
g = G × ri (3)
Zi = Xi + (Xzi − Xi ) × λ (4)
where g represents the quantity to be synthesized, Zi represents the synthesized new
sample, Xi represents the current minority sample, and XZi represents a random minority
sample among the k neighbors of Xi ,λ ∈ (0, 1).

2.2. Autoencoder
An autoencoder [16] is an unsupervised learning network architecture, in which the
input and output dimensions are the same, and the number of nodes in the middle layer
is generally less than the number of nodes on the left and right sides. Figure 1 illustrates a
typical autoencoder consisting of two main components, i.e., the encoder and decoder. It
sample, Xi represents the current minority sample, and X Zi represents a random minor-
ity sample among the k neighbors of Xi , λ ∈ (0,1) .

2.2. Autoencoder
An autoencoder [16] is an unsupervised learning network architecture, in which the
Electronics 2022, 11, 898 4 of 13
input and output dimensions are the same, and the number of nodes in the middle layer
is generally less than the number of nodes on the left and right sides. Figure 1 illustrates
a typical autoencoder consisting of two main components, i.e., the encoder and decoder.
It works
works byby usingdeep
using deeplearning
learning techniques
techniques totofind
findananefficient representation
efficient of the
representation input
of the input data
data without losing information. In short, it compresses the original data
without losing information. In short, it compresses the original data by using the by using the encoder
encoder
to obtaintoa obtain a lower-dimensional
lower-dimensional representation,
representation, whichwhich
is thenis reconstructed
then reconstructed
intointo
the original
the original data by the decoder. According to this working principle, we can use the
data by the decoder. According to this working principle, we can use the trained encoder as a
trained encoder as a tool for data dimensionality reduction. Compared with the tradi-
tool for data dimensionality reduction. Compared with the traditional principal component
tional principal component analysis (PCA) [17] data dimension reduction method, the au-
analysis (PCA) [17] data dimension reduction method, the autoencoder can achieve nonlinear
toencoder can achieve nonlinear changes, which facilitates the learning of more deep pro-
changes, which
jection data facilitates the learning of more deep projection data information.
information.

Figure 1.
Figure 1. Autoencoder
Autoencoder structure.
structure.

Although
Althoughthe theautoencoder
autoencodercan canachieve
achievebetter
betterdata
datadimensionality
dimensionalityreduction, com-
reduction, compared
paredother
with with other dimensionality
dimensionality reduction
reduction methods,
methods, we weaimed
aimedto
topropose
propose an
anautoencoder
autoencoder that is
that is able to perform dimensionality reduction and enhance data robustness to adapt to
able to perform dimensionality reduction and enhance data robustness to adapt to complex
complex network scenarios. Dropout [18] enables each neuron to have the probability p to
network scenarios. Dropout [18] enables each neuron to have the probability p to be
be discarded during network training iterations, and due to this mechanism, each neuron
discarded during network training iterations, and due to this mechanism, each neuron
is not overly dependent on other neurons, thus reducing the phenomenon of overfitting
is
and improvingdependent
not overly on other
the generalization neurons,
ability of thethus
modelreducing
to some the phenomenon
extent. By combining ofthe
overfitting
and improving the generalization ability of the model to some extent.
two ideas, a low-latitude representation is obtained by using dropout and stacked auto- By combining
Electronics 2022, 11, x FOR PEER REVIEW 5 of 14
the two after
encoder ideas, a low-latitude
dimensionality representation
reduction. Since each is obtainedhas
dimension bythe
using dropout
probability and stacked
of being
autoencoder after dimensionality reduction. Since each dimension has the probability
of being discarded, the information set of each dimension is more comprehensive than
discarded, the information set of each dimension is more comprehensive than that ob-
that obtained by traditional autoencoder after dimensionality reduction, thus facilitating
tained by traditional autoencoder after dimensionality reduction, thus facilitating model
model learning. Based on the above ideas, we proposed a stacked encoder structure with
learning. Based on the above ideas, we proposed a stacked encoder structure with in-
increased dropout,
creased dropout, as shown
as shown in Figure
in Figure 2. 2.

Figure 2.
Figure 2. Improved
Improvedstacked autoencoder.
stacked autoencoder.

2.3. Channel Attention

An attention mechanism was proposed based on the idea that people usually tend to
focus more on some local regions of the image rather than the image as a whole when
observing an image. At ImageNet 2017, the WMW team proposed a squeeze-and-excita-
Figure 2. Improved stacked autoencoder.

2.3. Channel Attention

Electronics 2022, 11, 898 5 of 13
An attention mechanism was proposed based on the idea that people usually tend to
focus more on some local regions of the image rather than the image as a whole when
observing an image. 2.3.At
Channel
ImageNetAttention
2017, the WMW team proposed a squeeze-and-excita-
tion (SE) network basedAn onattention
the channelmechanism was proposed
attention mechanism based
[19]onand
the idea
wonthat
thepeople
Imageusually
Clas- tend
to focus more on some
sification challenge with a great advantage. local regions of the image rather than the image as a whole when
observing an image. At ImageNet 2017, the WMW team proposed a squeeze-and-excitation
The convolutional block attention
(SE) network based on themodule (CBAM)mechanism
channel attention [20] is improved
[19] and won onthe
the basis
Image of
Classifica-
SE by adding a channel of Maxpool,
tion challenge and through
with a great advantage.a large number of experiments, the au-
thor of [20] proved that adding it can effectively
The convolutional block improve
attention module the performance
(CBAM) of theonmodel
[20] is improved the basis of
classification. Based on these ideas, in this paper, the CBAM used in 3D image processing the
SE by adding a channel of Maxpool, and through a large number of experiments,
author of [20] proved that adding it can effectively improve the performance of the model
was applied to the intrusion detection model for 2D data, with modifications. As shown
classification. Based on these ideas, in this paper, the CBAM used in 3D image processing
in Figure 3, the flow
was ofapplied
the CBAM for 2D data
to the intrusion processing
detection model for is2Dcomposed
data, with of two important
modifications. As shown
phases, i.e., squeeze
in and
Figureexcitation.
3, the flow In theCBAM
of the squeeze phase,
for 2D data the trafficisdata
processing are AvgPooling
composed of two important
or Maxpooling, from phases,
a (c,i.e., squeeze and excitation.
w)-dimensional form toInathe(c,squeeze phase, the traffic
1)-dimensional form,data are AvgPooling
to obtain the
or Maxpooling, from a (c, w)-dimensional form to a (c, 1)-dimensional form, to obtain
global information of each channel. In the excitation phase, the compressed data are adap-
the global information of each channel. In the excitation phase, the compressed data are
tively recalibrated adaptively
by a multilayer perceptron
recalibrated (MLP)perceptron
by a multilayer to return(MLP)a weight matrix
to return for each
a weight matrix for
channel. each channel.

Figure 3. Convolutional block

Figure attention module
3. Convolutional (CBAM).
block attention module (CBAM).

2.4. Bidirectional LSTM

Long short-term memory (LSTM) [21,22] introduces storage cells and cell states to
overcome the long-term dependency problem that exists in recurrent neural networks
(RNNs) [23]. The long-term dependency problem is a gradient explosion or gradient
dispersion problem caused by multiple multiplications of matrices when RNNs compute
the relationship of distant nodes. The following shows how the LSTM network is updated
in one time step:
it = σ (W xi xt + Whi ht−1 + bi ) (5)
ft = σ (W xf xt + Whf ht−1 + bf ) (6)
ot = σ (W xo xt + Who ht−1 + bo ) (7)
ct = tan h(W xc xt + Whc ht−1 + bc )
e (8)
ct = ft ct−1 + ite
ct (9)
ht = ot tan h(c t ) (10)
where it , ft , and ot represent the input gate, the forget gate, and the output gate, respectively.
σ (sigmoid) and tanh represent two distinct activation functions, respectively. ct represents
the current cell state, ct−1 represents the previous cell state, and e ct represents the candidate
memory cell. ht represents the hidden state of the current cell, and ht−1 represents the
hidden state of the previous cell.
Electronics 2022, 11, 898 6 of 13

The bidirectional LSTM (Bi-LSTM) network [24] improves its LSTM predecessor by
← →
adding backward hidden states h t to the existing forward hidden states h t , allowing it to
obtain a forward-looking capability similar to that of the hidden Markov model (HMM).
The following shows how the Bi-LSTM network updates itself in one time step:
→ →

h t = tan h W → xt + W→→ h t−1 + b→ (11)
ht hh h

← ←

h t = tan h W ← xt + W← ← h t−1 + b← (12)
ht hh h
→ ←
ht = h t + h t (13)
where ht represents the hidden state of the current cell, ht−1 represents the hidden state
→ ←
of the previous cell, h t represents the forward hidden state of the current cell, and h t
represents the reverse hidden state of the current cell.
For network traffic, Bi-LSTM can effectively utilize the temporal features present in
Electronics 2022, 11, x FOR PEER REVIEW
the contextual information to improve the model training, and its structure is shown 7 ofin
14
Figure 4.

Figure4.4. Bi-LSTM
Figure Bi-LSTM structure.
structure.

2.5.
2.5. Network
Network Architecture
Architecture
As
As shownin
shown in Figure
Figure5, 5,the
the overall
overall architecture
architectureof ofthe
theDLNID
DLNIDmodel
modelconsists
consistsof ofseven
seven
parts, which are the input layer, encoder layer, multiple convolutional layer,
parts, which are the input layer, encoder layer, multiple convolutional layer, attention attention layer,
Bi-LSTM layer, fully
layer, Bi-LSTM layer,connected layer, and
fully connected theand
layer, output layer. In
the output the first
layer. In thelayer,
firstthe model
layer, the
accepts the network traffic data from the dataset. In the encoder layer,
model accepts the network traffic data from the dataset. In the encoder layer, the model the model uses
the
usesencoder part of
the encoder theofimproved
part the improved stacked autoencoder
stacked that has
autoencoder thatbeen trained
has been to perform
trained to per-
dimensionality reduction on the data. In the multiple convolutional layer,
form dimensionality reduction on the data. In the multiple convolutional layer, the model the model uses
multiple
uses multiple convolutional operations to extract features from the downscaled data.the
convolutional operations to extract features from the downscaled data. In In
attention layer,
the attention the model
layer, uses uses
the model the CBAM
the CBAMto redistribute the weights
to redistribute of each
the weights channel
of each and
channel
assign moremore
and assign important channels
important channelswith higher
with higherweights.
weights. InIn
the
theBi-LSTM
Bi-LSTMlayer,layer,the
themodel
model
extracts the feature information of each dimension and learns the relationship between the
extracts the feature information of each dimension and learns the relationship between
dimensions. In the fully connected layer and the output layer, the model passes the learned
the dimensions. In the fully connected layer and the output layer, the model passes the
features onto the classifier and outputs the classification results. Algorithm 1 presents the
learned features onto the classifier and outputs the classification results. Algorithm 1 pre-
training process of the DLNID model.
sents the training process of the DLNID model.
Electronics 2022, 11, 898 7 of 13

Figure 4. Bi-LSTM structure.

Algorithm 1: DLNID Training
2.5. Network
Input: Architecture
NSL-KDD dataset
Output: Accuracy, Precision, Recall, F1 score
As shown in Figure 5, the overall architecture of the DLNID model consists of seve
1 For data in the training set or test set; do
parts,
2 One-hotwhich are the input layer, encoder layer, multiple convolutional layer, attentio
encoding;
3 If training
layer, set,
Bi-LSTM layer, fully connected layer, and the output layer. In the first layer, th
4 ADASYN data augmentation;
model accepts the network traffic data from the dataset. In the encoder layer, the mode
5 Normalization;
uses
6 End.the encoder part of the improved stacked autoencoder that has been trained to per
7 For data in the training set or test set; do
form dimensionality reduction on the data. In the multiple convolutional layer, the mode
8 Use encoder for data dimensionality reduction;
uses multiple
9 Perform convolutional
multilayer operations to extract features from the downscaled data. I
convolution operations;
10 Use CDAM to redistribute channel weights;
the attention layer, the model uses the CBAM to redistribute the weights of each channe
11 Use Bi-LSTM to learn sequence information;
and assign
12 Flatten the more important channels with higher weights. In the Bi-LSTM layer, the mode
dimension;
13 Send to the Fully connected
extracts the feature information layer and classify;
of each dimension and learns the relationship betwee
14 End.
the dimensions. In the
15 Test model on NSL-KDDTest+; fully connected layer and the output layer, the model passes th
learned
16 Obtain features ontoDLNID
loss and update the classifier
by Adam; and outputs the classification results. Algorithm 1 pre
17 Return accuracy, precision, recall, F1 score.
sents the training process of the DLNID model.

Figure5.5.Overall
Figure Overall structure
structure of model.
of the the model.

3. Datasets
3.1. Data Analysis
The experimental data in this paper adopt the NSL-KDD dataset [5], which is an
improved version of the KDD99 dataset [25] that addresses the data redundancy problem
present in the KDD99 dataset and is one of the benchmark datasets used to evaluate the
performance of IDS. It consists of a training set (KDDTrain+), containing 125,973 traffic
samples, and a test set (KDDTest+), containing 22,544 traffic samples. In order to restore
the complex network situation in reality to a greater extent, there are only 19 attack types
in the training set, and the other 17 attack types only exist in the testing set.
The NSL-KDD dataset has a total of 42 dimensional features, one of which is a clas-
sification label, and the rest are feature labels. For binary classification, the classification
Electronics 2022, 11, 898 8 of 13

labels are divided into two categories, i.e., normal and anomaly. For multiclassification, the
classification labels are divided into five categories, i.e., normal, Dos, R2L, U2R, and probe.

3.2. Data Preprocessing

3.2.1. One-Hot Encoding
Since there are three non-numerical types of feature values, and the model can only
accept numerical types, one-hot encoding was adopted to convert the three non-numerical
features into numerical features. For example, the values of protocol_type are TCP, UDP
and ICMP, and after encoding, the values become [1, 0, 0], [0, 1, 0], and [0, 0, 1], respectively.
Finally, the dataset contains 122 dimensional data after encoding.

3.2.2. Data Augmentation

The number of U2R and R2L samples in the NSL-KDD test set is much higher than that in
the training set, and only a small percentage of these samples are in the training set; therefore,
the trained model has difficulty distinguishing these samples, so we used the aforementioned
ADASYN algorithm to expand the data and expand the samples (such as U2R and R2L) that
account for a smaller percentage of the original training set, balancing the percentage of the
majority and minority samples. This can solve the imbalance problem in the network data to
a certain extent and further boost the generalization ability of the model.

3.2.3. Normalization
A large gap between different dimensional feature data within the dataset can bring
about problems such as slow model training and insignificant accuracy improvement;
therefore, in order to tackle this issue, the MinMaxScaler [26] was adopted to map the data
into the range of (0,1) as follows:

x − xmin
x0 = (14)
xmax − xmin

where xmax is the maximum value, and xmin is the minimum value.

4. Results
In the following section, we detail the experimental settings and appraise the per-
formance metrics of the model. In addition, we present two sets of ablation experiments
to verify the reliability of the data augmentation and improve dimensionality reduction
approaches proposed in Section 2. Finally, we compare the model with other papers.

4.1. Experimental Settings

In this study, all experiments were conducted in the hardware environment of Intel(R)
Core(TM) i5-1035G1 CPU @ 1.00 GHz 1.19 GHz, with the operating system of windows
10, using Python 3.7, PyTorch 1.10, and the sklearn library for writing and simulating the
model.

4.2. Performance Metrics

The confusion matrix was selected as the classification metric of the model predicted
data. Additionally, accuracy (Acc), precision (Pre), recall (Rec), and F1 score (F1) were
selected as the performance indicators for binary classification, while recall and false
positive rate (FPR) were used as the performance indicators for multiclassification. The
computation of each performance indicator is detailed as follows:

TP + TN
Acc = (15)
TP + TN + FP + FN
TP
Pre = (16)
TP + FP
Electronics 2022, 11, 898 9 of 13

TP
Rec = (17)
TP + FN
2 × Pre × Rec
F1 = (18)
Pre + Rec
FP
FPR = (19)
FP + TN

4.3.
Electronics 2022, 11, x FOR PEER REVIEW Result Analysis 10 of 14
The experiment studied the performance of the proposed network on normal, Dos,
R2L, U2R, and probe for binary and multiclassification experiments, respectively. When the
network
using the parameters were chosen
confusion matrix. as shown in Table
The experimental 2, the
results show high accuracy
that and F1 score
most samples were could
clas-
be achieved on the KDDTest+ test set. Figures 6 and 7 show the experimental results
sified correctly, which appear on the diagonal, indicating a better classification perfor- using
the confusion
mance. However,matrix. The experimental
the comparison betweenresults
the twoshow
figuresthatshows
most that
samples were classified
the performance of
correctly, which appear on the diagonal, indicating a better classification
the proposed model was somewhat degraded on the multiclassification experiments, performance.
However, the comparison between the two figures shows that the performance of the
compared with the binary classification experiments. Table 3 provides the false-positive
proposed model was somewhat degraded on the multiclassification experiments, compared
and recall rates corresponding to different attacks under the multiclassification task; the
with the binary classification experiments. Table 3 provides the false-positive and recall
aim was to achieve a lower false-positive rate and a higher recall rate in intrusion detec-
rates corresponding to different attacks under the multiclassification task; the aim was to
tion. From the analysis, it can be concluded that despite the data augmentation process,
achieve a lower false-positive rate and a higher recall rate in intrusion detection. From the
the U2R category was more likely to be misclassified because the U2R category in the test
analysis, it can be concluded that despite the data augmentation process, the U2R category
set was much larger than the others in the training set.
was more likely to be misclassified because the U2R category in the test set was much larger
than the others in the training set.
Table 2. Model parameters.

Type
Table 2. Model parameters. Parameter
Encoder -
Type Parameter
Conv1d 5×5
Encoder -
BatchNorma1d -
Conv1d 5×5
Maxpool1d
BatchNorma1d - 3×3
Conv1d
Maxpool1d 3 × 31×1
Conv1d
ChannelAttention 1×1 -
ChannelAttention
Bidirectional LSTM - -
Bidirectional LSTM -
Dropout
Dropout 0.3
0.3
Fully connected (LeakyRelu)
Fully connected (LeakyRelu) 32 32
Dropout
Dropout 0.2 0.2
Fully connected
Fully ()
connected () 16 16
Loss function
Loss function CrossEntropy
CrossEntropy
Optimizer
Optimizer Adam Adam
Learning rate 0.0005
Learning rate 0.0005

Confusionmatrix
Figure6.6.Confusion
Figure matrix(2(2classes).
classes).

Table 3. Rec and FAR on the KDDTest+ (5 classes).

Type Rec (%) FPR (%)

Normal 92.14 13.44
Electronics
Electronics2022,
2022,11,
11,x898
FOR PEER REVIEW 11
10ofof14
13

Confusionmatrix
Figure7.7.Confusion
Figure matrix(5(5classes).
classes).

4.3.1. 3. Rec and FAR

TableComparison ofon the Enhancement
Data KDDTest+ (5 classes).
Methods
Table 4 shows
Type the experimental results of(%)
Rec the transverse comparison test
FPR by selecting
(%)
different data augmentation algorithms under the condition in which the network model
Normal 92.14 13.44
was the same, and the dimensionality reduction method remained unchanged. Compared
Dos 80.96 2.47
with the unprocessed
Probe data, each performance indicator of the proposed model
76.33 3.94 underwent
a large improvement.
R2L Compared with the 65.76
SMOTE data augmentation method,2.77 the data
augmentationU2R method used in this paper also revealed a certain improvement
24.00 1.73 in accuracy
and F1 score, with an increase of 2.09% and 2.61%, respectively.
4.3.1. Comparison of Data Enhancement Methods
Table 4. Comparison of data enhancement methods.
Table 4 shows the experimental results of the transverse comparison test by selecting
different data augmentation
Type algorithms under
ACC (%) the condition
Pre (%) in (%)
Rec which the network
F1 (%)model
was the same,
Not processed and the dimensionality
80.01 reduction
70.71 method remained
91.63 unchanged. Compared
79.82
with the unprocessed
SMOTE [27] data, each
88.64 performance indicator
85.32 of the proposed
88.83 model underwent
87.04
a large improvement. Compared
ADASYN 90.73 with the86.38
SMOTE data augmentation
93.17 method, the data
89.65
augmentation method used in this paper also revealed a certain improvement in accuracy
and F1
4.3.2. score, with anReduction
Dimensionality increase ofComparison
2.09% and 2.61%, respectively.

TableTable 5 shows the

4. Comparison experimental
of data enhancementresults of selecting different dimensionality reduc-
methods.
tion methods for horizontal comparison under the condition in which the model was the
same, andTypethe data augmentation
ACC (%) method remained
Pre (%) unchanged. Compared with
Rec (%) F1the
(%)PCA,
the Not
performance
processed of each model
80.01 in this paper greatly
70.71 improved. Compared
91.63 with79.82 auto-
the
encoder,
SMOTE the[27]
improved stacked
88.64 autoencoder used in this paper
85.32 88.83also had some87.04
improve-
ment ADASYN 90.73 with an increase
in accuracy and F1 score, 86.38 of 4.64% and93.17 89.65
3.28%, respectively.

Table
4.3.2.5.Dimensionality
Dimensionality reduction
Reduction comparison.
Comparison
Table 5 shows
Typethe experimental results
ACC (%)of selecting
Predifferent
(%) dimensionality
Rec (%) reduction
F1 (%)
methods for horizontal
PCA comparison under
85.29the condition
83.45 in which82.14
the model 82.79
was the
same, andAutoencoder
the data augmentation method86.09remained unchanged.
85.08 Compared
82.12 with
83.57 the
PCA, the performance of each
Improved stacked autoencoder model in this
90.73 paper greatly
86.38 improved.
93.17Compared with
89.65
the autoencoder, the improved stacked autoencoder used in this paper also had some
improvement in accuracy and F1 score, with an increase of 4.64% and 3.28%, respectively.
4.3.3. Model Comparison
TableFigure 8 comparesreduction
5. Dimensionality the proposed DLNID model with other reference models in terms
comparison.
of accuracy, and it can be seen that the accuracy of DLNID is higher than other models.
Table 6 comparesTypethe proposed modelACCand(%)other network
Pre (%) modelsRec (%) of various
in terms F1 (%)per-
formance metrics,
PCA from which it can be seen
85.29 that the proposed
83.45 DLNID82.14model outperforms
82.79
its comparison peers in terms of Accuracy
Autoencoder 86.09 and F1 score,
85.08 reaching82.12
90.73% and 89.65%
83.57 on
theImproved
KDDTest+stacked autoencoder
dataset, respectively.90.73
Compared with 86.38the traditional
93.17 machine 89.65
learning
Electronics 2022, 11, 898 11 of 13

4.3.3. Model Comparison

Electronics 2022, 11, x FOR PEER REVIEW
Figure 8 compares the proposed DLNID model with other reference models in terms
of accuracy, and it can be seen that the accuracy of DLNID is higher than other models.
Table 6 compares the proposed model and other network models in terms of various per-
methods such asfrom
formance metrics, GAR-Forest
which it canor
beNB seenTree, the
that the proposed
proposed DLNID method
model required
outperformsno manu
ture extraction
its comparison andinimproves
peers the accuracy
terms of Accuracy rate.reaching
and F1 score, Compared90.73%with the autoencoder,
and 89.65% on the
KDDTest+
proved dataset,autoencoder
stacked respectively. Compared
proposed withinthethis
traditional
papermachine
enhanced learning
the methods
information se
such as GAR-Forest or NB Tree, the proposed method required no manual feature extraction
dimensionality reduction and achieved better classification results. Compared w
and improves the accuracy rate. Compared with the autoencoder, the improved stacked au-
CNN,
toencodertheproposed
proposed model
in this used CNN
paper enhanced to first extract
the information set afterfeature information
dimensionality reductionand the
sign the weights
and achieved of channels
better classification by using
results. the attention
Compared mechanism,
with the CNN, the proposedand finally,
model used learn
CNN to first extract feature information and then reassign the weights
lationship between features in the network traffic by Bi-LSTM, thereby achieving of channels by using
the attention mechanism, and finally, learn the relationship between features in the network
proved classification performance.
traffic by Bi-LSTM, thereby achieving an improved classification performance.

Figure 8. Accuracy comparison.

Figure 8. Accuracy comparison.
Table 6. Overall performance comparison.
Table 6. Overall performance comparison.
Type ACC (%) Pre (%) Rec (%) F1 (%)
DLNID Type 90.73 ACC (%)
86.38 Pre (%)
93.17 Rec
89.65(%) F1
DLHA [28]
DLNID 87.55 88.16
90.73 90.14
86.38 89.19
93.17 89
BAT-MC [14] 84.25 - - -
DLHA
Autoencoder [29] [28] 84.24 87.55
87.00 88.16
80.37 90.14
81.98 89
CNN [30]
BAT-MC [14] 80.13 -
84.25 -- --
Adaptive Ensemble [31] 85.20 86.50 86.50 85.20
Autoencoder
TES-IDS [32] [29] 85.79 84.24
88.00 87.00
86.80 80.37
87.39 81
GAR-Forest
CNN [33] [30] 85.06 87.50
80.13 85.10
- 85.10-
CNN+BiLSTM [34] 83.58 85.82 84.49 85.14
Adaptive
NB Tree [5]Ensemble [31] 82.02 85.20
- 86.50
- 86.50
- 85
TES-IDS
SVM-IDS [35] [32] 82.37 85.79
- 88.00
- 86.80
- 87
GAR-Forest [33] 85.06 87.50 85.10 85
5. Conclusions
CNN+BiLSTM [34] 83.58 85.82 84.49 85
To address the issue of data imbalance in network data and low detection accuracy,
NB Tree [5] 82.02 -
we proposed an ADASYN oversampling algorithm as the data augmentation method to
-
SVM-IDS
tackle the network [35] data imbalance problem,
intrusion 82.37 the stacked autoencoder
- -
with increased
dropout structure as the data downscaling method, to improve the generalization ability
5.ofConclusions
the model, and the network structure by combining the channel attention mechanism
with the bidirectional LSTM network. The accuracy and F1 score of the proposed network
model Toreached
address the and
90.73% issue of data
89.65% imbalance
on the KDDTest+ in testnetwork data and
set, respectively. low detection
Compared with acc
we proposed an ADASYN oversampling algorithm as the data augmentation met
other reference network models, the proposed DLNID model offered a better classification
performance.
tackle The proposed
the network networkdata
intrusion model is considered
imbalance useful for the
problem, thecurrent
stackeddevelopment
autoencoder w
creased dropout structure as the data downscaling method, to improve the general
ability of the model, and the network structure by combining the channel attention
anism with the bidirectional LSTM network. The accuracy and F1 score of the pro
network model reached 90.73% and 89.65% on the KDDTest+ test set, respectively
Electronics 2022, 11, 898 12 of 13

of network intrusion detection. In the future, we plan to apply the DLNID model to an actual,
combined network capture module to implement an online intrusion detection model.

Author Contributions: Methodology, Y.F. and Y.D.; funding acquisition, W.X.; investigation, Y.F., Y.D.,
W.X. and Q.L.; resources, Z.C. and W.X.; validation, Z.C. and Q.L.; writing—original draft preparation,
Y.D.; writing—review and editing, Y.F. All authors have read and agreed to the published version of
the manuscript.
Funding: The work of Y.F., Y.D., Z.C. and Q.L. is supported, in part, by Shannxi S&T under Grant
2021KW-07 and Shaanxi Education under Fund 19jk0414.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Patel, A.; Qassim, Q.; Wills, C. A survey of intrusion detection and prevention systems. Inf. Manag. Comput. Secur. 2010, 18,
277–290. [CrossRef]
2. Khraisat, A.; Gondal, I.; Vamplew, P.; Kamruzzaman, J. Survey of intrusion detection systems: Techniques, datasets and challenges.
Cybersecurity 2019, 2, 20. [CrossRef]
3. Yuan, L.; Chen, H.; Mai, J.; Chuah, C.N.; Su, Z.; Mohapatra, P. Fireman: A toolkit for firewall modeling and analysis. In
Proceedings of the 2006 IEEE Symposium on Security and Privacy (S&P’06), Berkeley/Oakland, CA, USA, 21–24 May 2006; IEEE:
Manhattan, NY, USA, 2006; pp. 15–213.
4. Musa, U.S.; Chhabra, M.; Ali, A.; Kaur, M. Intrusion detection system using machine learning techniques: A review. In
Proceedings of the 2020 International Conference on Smart Electronics and Communication (ICOSEC), Trichy, India, 10–12
September 2020; IEEE: Manhattan, NY, USA, 2020; pp. 149–155.
5. Tavallaee, M.; Bagheri, E.; Lu, W.; Ghorbani, A.A. A detailed analysis of the KDD CUP 99 data set. In Proceedings of the 2009
IEEE Symposium on Computational Intelligence for Security and Defense Applications, Ottawa, ON, Canada, 8–10 July 2009;
IEEE: Manhattan, NY, USA, 2009; pp. 1–6.
6. Amor, N.B.; Benferhat, S.; Elouedi, Z. Naive bayes vs. decision trees in intrusion detection systems. In Proceedings of the
Proceedings of the 2004 ACM Symposium on Applied Computing, Nicosia, Cyprus, 14–17 March 2004; Association for Computing
Machinery: New York, NY, USA, 2004; pp. 420–424.
7. Tao, P.; Sun, Z.; Sun, Z. An improved intrusion detection algorithm based on GA and SVM. IEEE Access 2018, 6, 13624–13631.
[CrossRef]
8. Jiadong, R.; Xinqian, L.; Qian, W.; Haitao, H.; Xiaolin, Z. A multi-level intrusion detection method based on KNN outlier detection
and random forests. J. Comput. Res. Dev. 2019, 56, 566.
9. Shapoorifard, H.; Shamsinejad, P. Intrusion detection using a novel hybrid method incorporating an improved KNN. Int. J.
Comput. Appl. 2017, 173, 5–9. [CrossRef]
10. Kim, G.; Lee, S.; Kim, S. A novel hybrid intrusion detection method integrating anomaly detection with misuse detection. Expert
Syst. Appl. 2014, 41, 1690–1700. [CrossRef]
11. Zhao, G.; Zhang, C.; Zheng, L. Intrusion detection using deep belief network and probabilistic neural network. In Proceedings of
the 2017 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on
Embedded and Ubiquitous Computing (EUC), Guangzhou, China, 21–24 July 2017; IEEE: Manhattan, NY, USA, 2017; Volume 1,
pp. 639–642.
12. Wang, W.; Zhu, M.; Zeng, X.; Ye, X.; Sheng, Y. Malware traffic classification using convolutional neural network for representation
learning. In Proceedings of the 2017 International Conference on Information Networking (ICOIN), Da Nang, Vietnam, 11–13
January 2017; IEEE: Manhattan, NY, USA, 2017; pp. 712–717.
13. Torres, P.; Catania, C.; Garcia, S.; Garino, C.G. An analysis of recurrent neural networks for botnet detection behavior. In
Proceedings of the 2016 IEEE Biennial Congress of Argentina (ARGENCON), Buenos Aires, Argentina, 15–17 June 2016; IEEE:
Manhattan, NY, USA, 2016; pp. 1–6.
14. Su, T.; Sun, H.; Zhu, J.; Wang, S.; Li, Y. BAT: Deep learning methods on network intrusion detection using NSL-KDD dataset.
IEEE Access 2020, 8, 29575–29585. [CrossRef]
15. He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings
of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence),
Hong Kong, China, 1–8 June 2008.
16. Meng, Q.; Catchpoole, D.; Skillicom, D.; Kennedy, P.J. Relational autoencoder for feature extraction. In Proceedings of the 2017
International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; IEEE: Manhattan, NY, USA,
2017; pp. 364–371.
17. Roweis, S. EM algorithms for PCA and SPCA. Adv. Neural Inf. Processing Syst. 1998, 10, 626–632.
18. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks
from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958.
Electronics 2022, 11, 898 13 of 13

19. Jie, H.; Li, S.; Gang, S. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Salt Lake City, UT, USA, 18–23 June 2018.
20. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European conference
on computer vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19.
21. Greff, K.; Srivastava, R.K.; Koutník, J.; Steunebrink, B.R.; Schmidhuber, J. LSTM: A search space odyssey. IEEE Trans. Neural Netw.
Learn. Syst. 2017, 28, 2222–2232. [CrossRef] [PubMed]
22. Gui, Z.; Sun, Y.; Yang, L.; Peng, D.; Li, F.; Wu, H.; Guo, C.; Guo, W.; Gong, J. LSI-LSTM: An attention-aware LSTM for real-time
driving destination prediction by considering location semantics and location importance of trajectory points. Neurocomputing
2021, 440, 72–88. [CrossRef]
23. Lin, T.; Horne, B.G.; Tino, P.; Giles, C.L. Learning long-term dependencies in NARX recurrent neural networks. IEEE Trans. Neural
Netw. 1996, 7, 1329–1338. [PubMed]
24. Hochreiter, S. The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncertain.
Fuzziness Knowl.-Based Syst. 1998, 6, 107–116. [CrossRef]
25. Engen, V.; Vincent, J.; Phalp, K. Exploring discrepancies in findings obtained with the KDD Cup 99 data set. Intell. Data Anal.
2011, 15, 251–276. [CrossRef]
26. Bisong, E. Introduction to scikit-learn. In Building Machine Learning and Deep Learning Models on Google Cloud Platform; Apress:
Berkeley, CA, USA, 2019; pp. 215–229.
27. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell.
Res. 2002, 16, 321–357. [CrossRef]
28. Wisanwanichthan, T.; Thammawichai, M. A double-layered hybrid approach for network intrusion detection system using
combined naive bayes and SVM. IEEE Access 2021, 9, 138432–138450. [CrossRef]
29. Ieracitano, C.; Adeel, A.; Morabito, F.C.; Hussain, A. A novel statistical analysis and autoencoder driven intelligent intrusion
detection approach. Neurocomputing 2020, 387, 51–62. [CrossRef]
30. Ding, Y.; Zhai, Y. Intrusion detection system for NSL-KDD dataset using convolutional neural networks. In Proceedings of the 2018
2nd International Conference on Computer Science and Artificial Intelligence, Shenzhen, China, 8–10 December 2018; pp. 81–85.
31. Gao, X.; Shan, C.; Hu, C.; Niu, Z.; Liu, Z. An adaptive ensemble machine learning model for intrusion detection. IEEE Access 2019,
7, 82512–82521. [CrossRef]
32. Tama, B.A.; Comuzzi, M.; Rhee, K. TSE-IDS: A two-stage classifier ensemble for intelligent anomaly-based intrusion detection
system. IEEE Access 2019, 7, 94497–94507. [CrossRef]
33. Kanakarajan, N.K.; Muniasamy, K. Improving the accuracy of intrusion detection using GAR-forest with feature selection. In
Proceedings of the 4th International Conference on Frontiers in Intelligent Computing: Theory and Applications (FICTA) 2015,
Durgapur, India, 16–18 November 2015; Springer: New Delhi, India, 2016; pp. 539–547.
34. Jiang, K.; Wang, W.; Wang, A.; Wu, H. Network intrusion detection combined hybrid sampling with deep hierarchical network.
IEEE Access 2020, 8, 32464–32476. [CrossRef]
35. Pervez, M.S.; Farid, D.M. Feature selection and intrusion classifi-cation in NSL-KDD Cup 99 dataset employing SVMs. In Proceedings
of the 8th International Conference on Software, Knowledge, Information Management and Applications (SKIMA 2014), Dhaka,
Bangladesh, 18–20 December 2014; pp. 1–6.