0% found this document useful (0 votes)
15 views34 pages

RP 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views34 pages

RP 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

computers & security 103 (2021) 102177

Available online at www.sciencedirect.com

journal homepage: www.elsevier.com/locate/cose

Intrusion detection methods based on


integrated deep learning model

Zhendong Wang a, Yaodi Liu a,∗, Daojing He b, Sammy Chan c


a Schoolof Information Engineering, Jiangxi University of Science and Technology, Ganzhou Jiangxi 341000 China
b Schoolof Software Engineering, East China Normal University, Shanghai 200000 China
c Department of Electrical Engineering, City University of Hong Kong, Hong Kong 999077 China

a r t i c l e i n f o a b s t r a c t

Article history: Intrusion detection system can effectively identify abnormal data in complex network envi-
Received 18 July 2020 ronments, which is an effective method to ensure computer network security. Recently, deep
Revised 12 November 2020 neural networks have been widely used in image recognition, natural language processing,
Accepted 3 January 2021 network security and other fields. For network intrusion detection, this paper designs an in-
Available online 7 January 2021 tegrated deep intrusion detection model based on SDAE-ELM to overcome the long training
time and low classification accuracy of existing deep neural network models, and to achieve
Keywords: timely response to intrusion behavior. For host intrusion detection, an integrated deep intru-
Deep learning sion detection model based on DBN-Softmax is constructed, which effectively improves the
Deep neural network detection accuracy of host intrusion data. At the same time, in order to improve the training
Feature learning efficiency and detection performance of the SDAE-ELM and DBN-Softmax models, a small
Mini-batch gradient descent batch gradient descent method is used for network training and optimization. Experiments
Intrusion detection on the KDD Cup99, NSL-KDD, UNSW-NB15, CIDDS-001, and ADFA-LD datasets show that
SDAE-ELM and DBN-Softmax integrated deep inspection models have better performance
than other classic machine learning models.

© 2021 Elsevier Ltd. All rights reserved.

ious behaviors in the network are safe by analyzing the col-


1. Introduction lected network data. However, with the increasing complex-
ity of the network structure and the diversity of intrusion be-
With the rapid development of global informatization, net-
haviors, existing intrusion detection systems gradually show
work connections are becoming more and more convenient.
some drawbacks, such as high false positive and false negative
People can better enjoy the convenience brought by the rapid
rates, difficult data feature extraction, and low data processing
development of the network, but at the same time, network
efficiency.
security is also increasingly threatened. Intrusion detection
Network-based intrusion detection system (NIDS)
technology as a proactive defense mechanism has received
(Hamed et al., 2018) and host-based intrusion detection
more and more attention. The process of identifying the be-
system (HIDS) (Marteau, 2019) are two types of intrusion
haviors that attempt to invade, invading or have already oc-
detection systems. NIDS determines possible intrusions
curred is called intrusion detection (Tidjon et al., 2019). The
by analyzing network traffic and network protocols. In the
technical core of intrusion detection is to detect whether var-
past, researchers mostly used pattern matching algorithms


Corresponding author.
E-mail addresses: [email protected] (Z. Wang), [email protected] (Y. Liu), [email protected] (D. He),
[email protected] (S. Chan).
https://doi.org/10.1016/j.cose.2021.102177
0167-4048/© 2021 Elsevier Ltd. All rights reserved.
2 computers & security 103 (2021) 102177

(Kim et al., 2009) to analyze it. The matching features mainly can reduce the noise of the data very well and has better ro-
include string features, port features and data packet header bustness, and the extreme learning machine (ELM) has fast
features, so feature selection of data is one of the effective learning speed, the integration of SDAE and ELM is proposed.
methods to improve the reliability and timeliness of NIDS. The learning model improves the detection speed on the basis
The current feature selection algorithms mainly include par- of reducing the noise of network data traffic, and can detect
ticle swarm optimization (Wei et al., 2019), genetic algorithm network-based intrusion systems in real time. On the other
(Ghamisi and Benediktsson, 2015), gray wolf algorithm (AI- hand, host-based intrusion detection systems use the log files
Tashi et al., 2020), cuckoo algorithm (Usman et al., 2020), etc., of each host as the main data source. As the number of log files
but they all have some problems, such as the relatively large is limited, some intrusion means and ways will not appear in
randomness of the genetic algorithm, the easy trapping in the the log files. Considering that deep belief networks (DBN) can
local optimum in the gray wolf algorithm, etc.; The earliest fully dig out the features in the data and the Softmax classifier
intrusion detection system applied to the network is HIDS, can better identify multiple attack types, we propose an inte-
and the detection module is installed on the system hard grated deep learning model of DBN-Softmax, which has high
disk. HIDS analyzes the log files and audits extracted system accuracy and is a better detection method for host-based in-
operation data to achieve the purpose of intrusion detection. trusion detection systems.
With the increasing complexity of the network environment, Overall, this work has made the following contributions to
detection methods migrated from the rule-based expert the intrusion detection domain:
system (IIgun et al., 1995) to machine learning based meth-
ods. Currently, commonly used machine learning methods
1) Considering that the performance of the SDAE model can
include SVM, DT (Teng et al., 2018), LR (Besharati et al., 2019),
still be improved, this paper proposes an integrated deep
neural network (Canedo and Romariz, 2019), deep learning
intrusion detection model based on SDAE-ELM. The SDAE
(AI-Qatf et al., 2018) and so on.
can well reduce the noise of the data and has strong robust-
Deep learning appeared relatively late. In 2006, Hinton first
ness, and can better reduce the noise of the network data
proposed the concept of deep learning (Hinton et al., 2006),
traffic. The ELM has the advantage of fast training speed
which has strong feature learning capabilities with unsuper-
and can realize intrusion detection faster.
vised training layer by layer. Therefore, it has been widely used
2) Considering that the Sigmoid activation function in the
in speech (Tu et al., 2019), image (Liu et al., 2018) and natural
DBN model is more suitable for the binary classification
language processing (Wang et al., 2020). In recent years, with
problem, this paper proposes a method for building a deep
the increase in the number of intrusions, the increase of data
learning framework for deep structure DBN and Softmax
dimensions, and the increasingly perfect deep learning the-
classifiers, and designs an integrated deep intrusion detec-
ory, researchers have begun to consider applying deep learn-
tion model based on DBN-Softmax to train the network us-
ing to intrusion detection. In intrusion detection, the autoen-
ing pre-training and fine-tuning methods.
coder (Sadaf and Sultana, 2020) replaces high-dimensional
3) For different data sets, SDAE-ELM and DBN-Softmax are
input layer neurons with hidden layer labeled neurons to
applied to NIDS and HIDS data sets respectively. This is
achieve the functions of dimensionality reduction and fea-
because the NIDS datasets contain huge data and much
ture extraction; the intrusion data is preprocessed to gener-
noise. The SDAE-ELM model can remove the noise of the
ate uniform traffic gray map, and convolutional neural net-
NIDS datasets and improve the detection performance and
work (Nguyen and Kim, 2020) is sued to extract features in the
speed. The number of log files in the HIDS dataset is lim-
gray image, thereby expanding the number of training sam-
ited, and most intrusions will not be recorded by the host.
ples. Deep belief network (Zhang et al., 2020) using restricted
The DBN-Softmax model has better data mining capabili-
boltzmann machines for unsupervised pre-training and the
ties, thereby improving the detection performance.
obtained results as the initial value of the supervised learn-
4) In order to verify the effectiveness of the methods and
ing training probabilistic model greatly improves the learning
models proposed in this paper, we apply the SDAE-ELM and
performance. Although the deep learning model has achieved
DBN-Softmax models to the network-based intrusion de-
good research results in the field of intrusion detection, in
tection datasets KDD Cup99, NSL-KDD, UNSW-NB15, and
most cases, the performance of the model is only verified on
CIDDS-001, and the host-based intrusion detection dataset
the network-based intrusion detection dataset, without con-
ADFA-LD, respectively, and compare them with various
sidering whether it has the ability to detect the host-based
machine learning algorithms and deep learning models
dataset.
using several experimental evaluation metrics.
Therefore, this paper fully considers the characteristics
of different intrusion detection systems, and proposes in-
tegrated deep learning models for different intrusion data The rest of the paper is organized as follows. Related works
sources, in order to fully exploit the advantages of the inte- are discussed in Section 2. In Section 3, we present the deep
grated deep learning models to achieve the best detection ef- learning and intrusion detection models proposed in this pa-
fect for different intrusion detection data sources. At present, per. Section 4 describes the intrusion detection data sets.
network-based intrusion detection systems use original net- The mathematical details of the intrusion detection models
work packets as data sources. The main problems are that and the evaluation criteria of the models are introduced in
the network data traffic is noisy, the response is not timely, Section 5. The performance of the models are evaluated in
the signature database is not updated in time, and the timeli- Section 6. Finally, we draw conclusions and suggest some fu-
ness is poor. Since the stacked denoising AutoEncoder (SDAE) ture work in Section 7.
computers & security 103 (2021) 102177 3

mizing the parameters of the model, researchers also focused


2. Related works on the characteristics of the model itself and proposed a series
of integrated models. Teng et al. (2018) proposed an adaptive
Intrusion detection has become an important part of network
collaborative intrusion detection method, which uses decision
security defense. More and more researchers are focusing on
trees and support vector machines to design objects, roles
intrusion detection and many studies have been conducted.
and agents, and establishes an adaptive scheduling mecha-
Through literature review, we found that the current research
nism. This method is more effective than a single support vec-
on intrusion detection mainly includes dataset preprocess-
tor machine. Shone et al. (2018) proposed a new asymmetric
ing methods, optimization of detection models, and detection
deep autoencoder (NDAE) for unsupervised feature learning,
technologies in different network environments.
and stacked it with the random forest algorithm to form a
Some researchers start with intrusion detection datasets,
new deep learning classification model S-NDAE. Experimen-
and use dimensionality reduction techniques to reduce irrel-
tally tested on the benchmark test datasets KDD Cup99 and
evant and redundant data before model training to reduce the
NSL-KDD, the model incurred less training time and achieved
complexity of intrusion detection system design, and improve
better training results.
its performance. Feature selection and feature extraction are
In the traditional network environment, the hybrid fea-
two types of dimensionality reduction techniques. Serpen and
ture selection technology and the optimization of detection
Anghaei, (2018) used principal component analysis for feature
model have achieved good detection results. With the in-
extraction of data in the Linux operating system, and applied
creasing complexity of network environments, cloud com-
it to a host-based abuse intrusion detection system. In general,
puting and fog computing environments appear, but the de-
the abuse detection system can detect and predict the type of
tection effects of these methods may be slightly worse. In
attacks. Tama et al. (HFSTE, 2017) considered that it is difficult
order to be able to adapt the intrusion detection to cloud
to distinguish the boundary between normal and attack types
and fog computing environments, in recent years, researchers
in anomaly detection, with a high false alarm rate. Therefore,
have proposed a series of intrusion detection technologies
by combining three algorithms for feature selection and inte-
and architectures. de Araujo-Filho et al, (2020) used gener-
gration of tree-based classifiers, and reducing the error prun-
ative adversarial networks to realize intrusion detection on
ing tree for classification, their method has better detection
cyber-physical systems (CPS) in fog environments. Network
performance than existing methods. Beulah et al. (Beulah and
attack detection on CPS needs to comply with strict delay re-
Punithavathani, 2017) proposed a hybrid method for feature
quirements, and detect and prevent attacks before the sys-
selection. The method selects and combines the best features
tem is threatened. Araujo-Filho atomizes the intrusion de-
from different feature selection methods. It can be used for
tection system to make the computing resources closer to
feature compression in any application domain. Although hy-
the terminal node, which helps to meet the low-latency re-
brid feature selection and feature extraction technologies can
quirements, and solves the response speed problem by train-
reduce irrelevant and redundant data, reduce the dimension-
ing the encoder that accelerates the reconstruction loss cal-
ality of the dataset, and save the training time of the model,
culation. Prabavathy et al. (2018) used fog computing to de-
most of the current hybrid feature selection and feature ex-
tect network attacks in IoT applications in a distributed man-
traction technologies mainly consider linear combination of
ner, and used the OS-ELM algorithm to implement intrusion
the original data but do not consider the potential relation-
detection on distributed fog nodes. The distributed architec-
ship within the variables, which may delete more important
ture of fog computing enables distributed intrusion detec-
features during the dimensionality reduction process, thus af-
tion mechanism scalable, flexible, and interoperable. In addi-
fecting the training effect of the model.
tion, Wang et al. (2019) proposed an effective feature selec-
In addition to focusing on the preprocessing of the dataset,
tion method and SVM classification algorithm to construct a
researchers are also interested in how to optimize the classi-
cloud intrusion detection system. Experimental results show
fier model to improve the detection performance of the clas-
that the detection system has achieved good results in de-
sifier model. For traditional intrusion detection models, most
tecting intrusion detection in cloud computing networks, but
researchers use swarm intelligence algorithms to optimize the
the Libsvm classifier cannot effectively identify some new
parameters of the classifier model, so as to avoid the model
attacks of the test dataset. Aljamal et al. (2019) proposed a
falling into the local optimum and improve the classifica-
network-based anomaly detection system at the Cloud Hyper-
tion performance of the model. Ye et al. (2019) proposed the
visor level, which combines K-means and SVM classification
grasshopper algorithm to search for better SVM kernel param-
algorithms to improve the accuracy of the anomaly detection
eters. It is based on genetic algorithm and particle swarm algo-
system. Distributed denial of service (DDoS) is a common form
rithm in order to avoid the slow convergence speed and local
of attack against cloud computing, and new features of cloud
minimum problems of traditional algorithms. They conducted
computing (e.g., virtualization and virtual machine migration)
comparative experiments using MATLAB tools, which showed
also bring challenges to cloud security. Therefore, Ibrahim and
that the method has superior performance in intrusion de-
Zainal,(2018) proposed an adaptive and distributed intrusion
tection. Duan et al. (2019) used an improved artificial bee
detection model (A-D-CIDS) for cloud computing to detect co-
colony algorithm to optimize the initial weights and thresh-
ordinated attacks and solve the problems caused by the migra-
olds of the back-propagation (BP) neural network to avoid the
tion of virtual machines in the cloud intrusion detection sys-
model falling into the local optimum and improve the train-
tem. From the current research, the intrusion detection tech-
ing speed, and this method has good classification capabilities
nology and structure of cloud computing and fog computing
and high intrusion detection capabilities. In addition to opti-
are mainly distributed detection structure. This structure is
4 computers & security 103 (2021) 102177

based on the existing algorithm and implements intrusion de- testing sets and training sets are used for model training and
tection on distributed nodes, which can greatly improve the model testing, respectively.
detection speed of the model and achieve the purpose of real- Intrusion detection module: It combines the dimensions of
time detection. the data after preprocessing to determine the number of input
In summary, good progress has been made in hybrid fea- and output nodes of the network model, then determines the
ture selection technology, classifier model optimization, and entire network structure and training parameters according to
intrusion detection technology in different network environ- the hidden layer and other parameters, uses the training sets
ments. Table 1 shows the comparison of different intrusion to train the model, and saves model for testing after training.
detection models used by researchers. Different from the ex- Detection and classification module: The test sets use the
isting intrusion detection models, two integrated deep learn- saved model for testing and display the detection classifica-
ing models, SDAE-ELM and DBN-Softmax are proposed in this tion results to the user.
paper based on the deep learning approach, which fully con-
sider the influence of data sources on the detection perfor- 3.1. Integrated deep intrusion detection model based on
mance of the models. These two models are applied to the SDAE-ELM
network-based intrusion detection datasets KDD Cup99, NSL-
KDD, UNSW-NB15, and CIDDS-001 and the host-based intru- 3.1.1. Stacked denoising autoencoder
sion detection dataset ADFA-LD, respectively. The intrusion If only one layer of Denoising AutoEncoder is used, the cod-
detection datasets used in this paper is relatively complete. ing ability is relatively limited, whereas SDAE consists of mul-
We use multiple evaluation indicators such as accuracy, preci- tiple Denoising AutoEncoder for feature extraction, multiple
sion, true positive rate, false positive rate, F1-Score, P-R curve, Denoising AutoEncoders are stacked together, and the output
ROC curve and AUC value to evaluate the performance of the of the previous Denoising AutoEncoder is used as the input of
proposed models. The evaluation is more scientific and com- the latter Denoising AutoEncoder, and is pre-trained using a
prehensive. layer-by-layer initialization strategy. After the training is com-
pleted, the overall network parameters are corrected accord-
ing to the output error of the last layer of the network. Finally,
3. Intrusion detection models the classification of the sample is determined by the classifier.

As classic methods in deep learning, SDAE and DBN have 3.1.2. Denoising autoencoder
achieved better results when applied to shallower models of Denoising AutoEncoder(DAE) (Kachuee et al., 2019) is a type
intrusion detection, but there are certain limitations. The BP of extension of AutoEncoder(AE) (Bengio et al., 2013). The De-
algorithm is used in the fine-tuning process. With the increase noising AutoEncoder is based on the AutoEncoder, the training
in the number of layers, it will have problems with sparse gra- data adds noise, the DAE can remove the noise from the input
dients and local optimization. In SDAE, the BP algorithm has data during the learning process to obtain data that is not con-
local minimization and requires multiple iterations to deter- taminated by noise. This learning method is more generalized
mine the network output weights to affect the learning of the than the general AutoEncoder. The DAE schematic is shown in
network, while the weights and thresholds of the ELM network Fig. 2 below.
need only be calculated once by the least square method to ob- In Fig. 2, x is the original input data, x is the damaged data,
tain the optimal weights and thresholds, without the need to y is the feature after x encoding, x˜ is the output obtained by y
update through the BP algorithm, the model training speed is decoding. The reconstruction error expression is:
faster, and it takes less time. In DBN, in the BP fine-tuning pro-
cess, the Sigmoid function is used as the activation function L(x, g( f (x ) ) ) = x − g( f (x ) )2 (1)
of the last layer to treat each category as a two-category pro-
cess. The Sigmoid activation function has a large amount of The AutoEncoder uses the reconstruction error to repre-
calculation, and the BP error division involves division opera- sent the training effect during the training process, and re-
tions and is prone to gradient disappearance. Hence, the deep quires the minimum reconstruction error to ensure the maxi-
network training cannot be completed.On the other hand, the mization of the common features. During the training process
Softmax activation function can directly implement multi- of the DAE, f (·) is performed on the damaged data x feature
class classification, and the output categories are mutually ex- mapping, therefore, compared to AutoEncoder, DAE increases
clusive to avoid the existence of multiple possibilities, which the robustness of features during training.
can better solve the problem of the Sigmoid function. Based on The objective function of Denoising AutoEncoder is given
this, we propose integrated deep intrusion detection models by:
of SDAE-ELM and DBN-Softmax, and apply them to the host-
⎡ ⎤
based data sets and the network-based data set, respectively. n 
     nl−1 sl sl+1  
1 1 λ   (l ) 2
The integrated deep intrusion detection models proposed J(W, b) = ⎣ ( i )
x − g f x ( i )
 ⎦+
2
W ji
n 2 2
in this paper are mainly divided into three modules, and the i=1 l=1 i=1 j=1

structure is shown in Fig. 1. (2)


Data preprocessing module: Preprocessing operations on
the data sets, mainly including feature extraction, data con- where, x(i ) represents the ith sample input of the Denoising
(l)
version, data normalization, and other operations, are to make AutoEncoder, W ji denotes the connection weight between the
the data sets meeting the requirements of the input data. The ith unit of the lt h layer and the j + 1th unit of the l + 1th layer,
computers & security 103 (2021) 102177 5

Table 1 – Comparison of intrusion detection systems.


Evaluation index
Model Dataset Advantages Disadvantages Accuracy (%) TPR (%) FPR(%)
Data feature (Serpen and ADFA-LD Higher detection Use of a single / Binary Binary
preprocessing extraction Ang- effectiveness; data set; class: 99.9 class:0.2
haei, 2018) Feature extraction No comparison Multi class: Multi
using Eigentraces experiment 99.9 class:0.2
feature NSL-KDD A mixed group Use of a single 99.7 / /
selection (HFSTE, 2017) intelligence algorithm data set;
for feature selection, Single evaluation
which yields a better indicator
subset of features.
Integrating hybrid
feature selection
techniques and
multiple classifiers
(Beulah and NSL-KDD The feature selection Use of a single 79.66 / /
Punitha- method can be applied data set
vathani, 2017) to any field;
Able to choose the best
attributes;
Detection Parameter (Ye et al., KDD Speed up the Use of a single 97.838 / /
model optimization 2019) Cup99 convergence speed to data set; Single
optimization avoid the model falling evaluation
into the local indicator
optimum;
(Duan et al., NSL-KDD Avoid the model falling Use of a single 98.12 97.23 /
2019) into the local data set
optimum;
Improved the detection
ability of the model
Integrated (Teng et al., KDD Shorten the training Use of a single 89.02 / /
model 2018) Cup99 time of the model; data set; Single
Improve the detection evaluation
effect of the model; indicator
Propose an adaptive
intrusion detection
model
(Shone et al., KDD Building a new NDAE; Only DBN model 97.85 / 2.15
2018) Cup99 Model training time is was used for 85.42 14.58
NSL-KDD shorter; experimental
The classification comparison.
results are more The effect was
refined not validated on
the newer data
set
Different Fog (de Araujo- SWaT Improve detection rate; Only the The experimental results are
network computing Filho et al., WADI The detection speed is reconstruction expressed in graphical form
environment environment 2020) NSL-KDD at least 5.5 times faster error is
than traditional IDS considered
NSL-KDD Improve the detection Use of a single Binary class: Binary Binary
(Prabavathy et al., speed of the model; data set 97.36 class: class:
2018) Reduced the false Multi class: 97.72 0.37
alarm rate of the 96.54
model
cloud (Wang et al., KDD Reduce the dimension Older data set; 99.85 / /
computing 2019) Cup99 of data set and shorten Multiple 98.64
environment NSL-KDD the training time; classifications
Maintaining algorithm were not
performance considered
UNSW- Use newer intrusion Use of a single Cluster16: / /
(Aljamal et al., NB15 detection data sets; data set; SVM 84.6
2019) Cluster first and then accuracy rate is Cluster32: 84
classify; low Cluster64:
K-means has better 84.7
accuracy
(Ibrahim and Simulate Data sets are more No comparison Normal: 98.6 / 0.07
Zainal, 2018) a cloud representative; experiments DoS/DDoS: 0.15
environ- The detection effect of were made with 98
ment to the model is better other classical
generate algorithms
new data
sets
6 computers & security 103 (2021) 102177

Fig. 1 – Intrusion detection model structure diagram.

3.1.3. Extreme learning machine


Extreme Learning Machine (ELM) is a Single-hidden Layer Feed
Forward Network (SLFNS) proposed by Huang (Huang et al.,
2014). It can solve the problems of low efficiency and com-
plicated parameters in the BP algorithm. ELM completes the
training by the minimum error function, the main feature is
the connection weight between the input layer and the hidden
layer, and the threshold of the hidden layer node only needs
Fig. 2 – Schematic diagram of DAE. to be calculated once by the least square method, instead of
using the traditional iterative method to calculate the param-
eters. This training method greatly shortens the training time
n represents the number of network layers, nl represents the and has a strong generalization ability.
number of network layers, sl represents the number of neu- Given N training sample data (xi , ti ), where xi =
rons in the lt h layer, λ is to reduce the weight amplitude and [xi1 , xi2 , . . . , xin ]T ∈ Rn is the sample input data, ti =
prevent the regularization coefficient of overfitting. [ti1 , ti2 , . . . , tim ]T ∈ Rm is the sample output value, for a single-
After determining the optimal objective function of the De- layer network with L hidden layer nodes, and excitation
noising AutoEncoder, the error BP algorithm is used to fine- function g(x ), then its network output is
tune the parameters of each layer of the Denoising AutoEn-
coder, namely: 
L  
yj = βi gi ωi x j + bi , j = 1, 2, . . . , N (5)
∂ i=1
W =W −α JD (W, b) (3)
∂W
where βi is the weight vector between the i-th hidden layer
∂ node and the output layer node; ωi is the weight vector be-
b = b − α JD (W, b) (4) tween the i-th hidden layer node and the input layer node;
∂b
bi is the i-th hidden layer offset; y j is the output value of the
where α is the learning rate. network.
computers & security 103 (2021) 102177 7

Fig. 3 – SDAE-ELM model structure diagram.

When the activation function can approach any N samples where H+ is the generalized inverse matrix of the hidden layer
with zero error, that is N
i=1 yi − ti  = 0, there is output matrix H.


L  
tj = βi gi ωi x j + bi , j = 1, 2, . . . , N (6)
3.1.4. SDAE-ELM model and detection procedures
i=1
In this paper, the SDAE-ELM integrated deep model is used
The above formula can be expressed as as the network-based intrusion detection system. The exper-
imental data sets are all network-based data sets. The model
Hβ = T (7) first uses SDAE to learn the features of the data sets, and then,
the features learned by the SDAE are input into the ELM algo-
Among them,
rithm for fine-tuning, to get the trained SDAE-ELM model. Fi-
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ nally, the test sets data are input into the SDAE-ELM model to
gi ( w1 x1 + b1 ) ··· gi ( wL x1 + bL ) β1 t1
⎢ . . ⎥ ⎢ .⎥ ⎢.⎥ complete the intrusion detection. The structure of the SDAE-

H=⎣ .. ⎥ ⎢ ⎥ ⎢ ⎥
.. . .. ⎦β = ⎣ .. ⎦T = ⎣ .. ⎦ ELM model is shown in Fig. 3. The SDAE-ELM intrusion detec-
gi ( w1 xN + b1 ) ··· gi ( wL xN + bL ) βL tN tion flowchart is shown in Fig. 4.
The specific steps of SDAE-ELM intrusion detection are as
where H is the output matrix of the hidden layer; β is the out- follows:
put weight; T is the desired output vector. Step 1: Preprocess the intrusion detection data sets, which
The connection weights of the hidden layer and the output mainly includes high-dimensional data feature mapping and
layer can be obtained by solving the least squares solution of data normalization processing.
the following system of equations: Step 2: SDAE-ELM model training:

min Hβ − T  (8)


β
1) Initialize model parameters and determine the structure
The least squares solution of the system of equations is: of the network model;
2) Unsupervised algorithm is used to train the first layer of
βˆ = H+ T (9) DAE, and the reconstruction error of reconstructed sample
8 computers & security 103 (2021) 102177

Fig. 4 – SDAE-ELM intrusion detection flowchart.

is controlled within a certain range, its output is used as 3.2. Integrated deep intrusion detection model based on
the input of the next layer of DAE; DBN-Softmax
3) The output of the above layer of DAE is used as input, and
the unsupervised algorithm is used to train the DAE of this 3.2.1. Deep belief network
layer, so that its reconstruction error is controlled within a DBN is a multi-layer perceptron neural network formed by
certain range; stacking multiple Restricted Boltzmann machine (RBM) and
4) Repeat step 3) until all DAE training is completed; a layer of BP neural network. The DBN training process
5) ELM algorithm is used to learn the features extracted by includes two parts: pre-training and fine-tuning. The pre-
DAE, and the optimal weight and threshold value of the training mainly adopts the unsupervised layer-by-layer train-
model are determined by the least square method of one ing method to train the RBM parameters of each layer. The out-
order until the specified training times are reached. put of the hidden layer of the low-level RBM is used as the in-
put of the visible layer of the high-level RBM. Abstract feature
parameters can be extracted from the original signal data; the
BP neural network is used in the fine-tuning phase. The dif-
ference between the actual output and the label information
Step 3: SDAE-ELM model test, input the test data sets into
is used as a measurement error, and the error is propagated
the trained SDAE-ELM model, and then obtain the classifica-
back layer by layer to fine-tune the weight and offset of the
tion result of each data set.
computers & security 103 (2021) 102177 9

Fig. 5 – Schematic diagram of the restricted Boltzmann Machine.

entire DBN. After multiple iterations, the optimal parameters the probability of h and v is calculated as follows:
of the entire DBN can be obtained. ⎛ ⎞
  
P h j = 1|v = sigmoid⎝ai + vi wi j ⎠ (13)
3.2.2. Restricted Boltzmann machine i
Restricted Boltzmann Machine (Chen et al., 2019) is an energy-
⎛ ⎞
based model, which can also be regarded as a special type   
of Markov random field. Generally, RBM can represent the re- ⎝
P v j = 1|h = sigmoid b j + h j wi j ⎠ (14)
lationship between random variables. The Restricted Boltz- j

mann Machine is composed of two layers, the visible layer v


and the hidden layer h. As can be seen from Fig. 5, the vis- In the process of RBM training, Eqs. (13) and (14) are its core.
ible layer and the hidden layer of the Restricted Boltzmann When there is input in the visible layer, the mapping value of
Machine are connected with each other (no connection in the input value from the visible layer to the hidden layer is first
the layer), the output of the hidden layer unit can obtain the calculated by Eq. (13), and the mapping value is input into the
higher-order correlation of the visual unit, that is, the charac- Eq. (14) to recalculate the probability of the visible layer. The
teristics of the input data. error between the input data and the reconstructed data is
Let the visible layer unit be v = {v1 , v2 , . . . , vi }, the hid- calculated and used to adjust the network parameters, so that
den layer unit be h = {h1 , h2 , . . . , h j }, and the internal pa- the error is reduced to a minimum. At this time, the output
rameter vector be θ = {w, a, b}, where a and b are the offset data obtained by the hidden layer can be used to represent the
value of the visible layer and the hidden layer, w is the weight visible layer input data. This entire process is the extraction of
vector between the visible layer and the hidden layer. Then data features by RBM.
the RBM energy-based probability model is distributed as Because Gibbs sampling is time consuming in high-
follows: dimensional data features, Hinton proposed Contrast Diver-
gence algorithm(CD-K) (Hinton, 2002) in 2002 to learn RBM.
1 The main steps of the Contrast Divergence method are shown
p(v, h, θ ) = exp(−E (v, h, θ ) ) (10)
Z (θ ) in Algorithm 1.

where Z(θ ) is the regularization factor, given by the sum of the 3.2.3. Softmax classifier
energy functions associated with all visible and hidden units: The principle of Softmax classifier (Zeng et al., 2014) is very
 simple. It is an extension of Logistic Regression (LR). The
Z (θ ) = exp(−E (v, h, θ ) ) (11) biggest difference between the two is that LR category labels
v h can only take two, while Softmax increases the possibility of
multi-category labels, which is more suitable for multi-class
The system energy function of RBM can be obtained from classification problems. The Softmax classifier can map input
the above formula as follows: vectors from N-dimensional space to categories, and output
the classification results in the form of probabilities. The prob-

D 
F 
D 
F
ability formula is:
E (v, h, θ ) = − ai vi − b jh j − wi j vi hi = aT v−bT h−vT Wh
i=1 j=1 i=1 j=1 ⎡  i  ⎤ ⎡ θ T x (i ) ⎤
p y ( ) = 1 |x ( i ) ; θ e1
(12) ⎢   ⎥ ⎢ θ T x (i ) ⎥
  ⎢ p y ( i ) = 2 |x ( i ) ; θ ⎥ ⎢ ⎥
⎢ ⎥ ⎢e
2
1 ⎥
hθ x ( i ) = ⎢
⎢ .
⎥=
⎥ ⎢ . ⎥ (15)
where wi j represents the weight between the visible layer unit ⎢ . ⎥ K θk X ⎢ . ⎥
T
⎣  . ⎦ k=1 e ⎣ . ⎦
i and the hidden layer unit j, ai and b j are the bias values of  T (i )
p y ( i ) = K |x ( i ) ; θ eθK x
the visible layer and hidden layes units, respectively.
In the classification task, the output of RBM is 0 or 1, and
the Sigmoid activation function is usually used. At this time, where p(y(i ) = K )|x(i ) ; θ represents the probability that x(i ) be-
longs to category K, and the sum of all elements in the vector
10 computers & security 103 (2021) 102177

2) The first layer of RBM is trained by unsupervised algorithm.


Algorithm 1 – CD-k algorithm steps.
The input of the hidden layer and the output of the visible
CD-K algorithm of RBM layer are calculated by Eqs. (13) and (14), and the output of
Input: S = {x1 , x2 , . . . , xl } training data, the visible layer is taken as the input of next layer RBM;
vi (i = 1, 2, . . . , n),h j ( j = 1, 2, . . . , m) are the visible layer unit and
3) The output of the upper layer RBM is used as the input of
hidden layer unit, respectively. Learning rate is γ , CD algorithm
the RBM, and the unsupervised algorithm is used to train
parameter k
Output: Model parameters θ = {w, a, b} the RBM of this layer;
1. Initialize model parameters randomly θ = {w, a, b} 4) Repeat step 3) until all RBM training is completed;
2. for each of the training samples xl in S, do 5) Use the BP algorithm and learn the features extracted by
v0 ← v the RBM through the Softmax classifier, and use the BP al-
For t = 0,1,…,k-1 do gorithm to adjust the weights and thresholds of the net-
For i = 1,…,n do calculate hti ∼ P(hi |v(t) ) according to Eq. (13)
work to reduce the prediction error of the output, so that
For j = 1,…,m do calculate vt+1 j
∼ P(v j |h(t) ) according to Eq. (14)
the results approach the predicted output until the speci-
For i = 1,…,n,j = 1,…,m
(0)
Calculate wi j = γ × (P(h j = 1|v(0) )v j − P(h j = 1|v(k ) )vi ), and
(k ) fied training time is reached.
update the weight wi j = wi j + wi j
(0) (k )
Calculate b j = γ × (vi − vi ), and update b j = b j + b j Step 3: DBN-Softmax model test, input the test data set into
Calculate ai = γ × (P(h j = 1|v(0) ) − P(h j = 1|v(k ) )), and update the trained DBN-Softmax model, and then obtain the classifi-
ai = ai + ai cation result of each data set.
end
3.3. Model complexity analysis

Suppose the number of training samples is m, the size of


is equal to one. For the training sample x(i ) , we choose the K mini-batch samples is batchsize, the number of iterations
corresponding to the maximum probability as the final clas- l, the total number of hidden layers is N, the number of
sification result, and the parameter θ is obtained by the cost neurons in the input and output layers are n1 and n3 , the
function, which used in this paper is: number of neurons in the hidden layer is n2 j , j is the jth
⎡ ⎤ hidden layer. In the original SDAE model, for the SDAE model
θ Tj x(i )
1 ⎣   ( i ) 
n K
e with a single hidden layer, n1 ∗ n2 and n2 ∗ n3 matrix mul-
(θ ) = − 1 y = j log ⎦ (16)
n K θkT x(i ) tiplication operations should be carried out respectively in
i=1 j=1 k=1 e
the pre-training stage. Therefore, the time complexity of
calculating a sample in the pre-training is o(n1 ∗ n2 + n2 ∗ n3 ),
where 1{·} is an indicative function, when the value is true, it
and the time complexity of calculating a sample in the
is equal to 1, and when the value is false, it is equal to 0. If J(θ )
adjustment stage and the pre-training stage is the same as
is minimized, the parameter θ can be obtained.
o(n1 ∗ n2 + n2 ∗ n3 ). SDAE model of the whole time complexity
is o(m ∗ l ∗ 2 ∗ (n1 ∗ n21 + N−1 j=1 n2 j ∗ n2 j+1 + n2N ∗ n3 ) ). SDAE-
3.2.4. DBN-Softmax model and detection procedures ELM model on the basis of SDAE was improved, only need
In DBN, each layer is a Restricted Boltzmann Machine, that is, to feature extraction and fine-tuning, but it uses mini-batch
the entire network is regarded as a stack of several RBMs. Af- training, so the SDAE-ELM model of the whole time complex-
ter unsupervised layer-by-layer training, the BP algorithm is ity is o(batchsize ∗ l ∗ (n1 ∗ n21 + N−1 j=1 n2 j ∗ n2 j+1 + n2N ∗ n3 ) ).
used to train the entire network. After the network training, Original DBN and SDAE model training process is ba-
the DBN model uses the Sigmoid classifier to classify the data. sically the same, so the DBN model of the whole time
However, in the process of data processing, the output results complexity is o(m ∗ l ∗ 2 ∗ (n1 ∗ n21 + N−1 j=1 n2 j ∗ n2 j+1 + n2N ∗ n3 ) ).
of the Sigmoid classifier are independent of each other, and Compared with DBN, DBN-Softmax only uses the Soft-
may have multiple results in parallel. For intrusion detection, max classifier to replace the original BP classifier, so
we hope to be able to identify the type of intrusion. There- the time complexity of DBN-Softmax and DBN is ba-
fore, we consider using the Softmax classifier instead of the sically the same, but DBN-Softmax uses mini-batch
original Sigmoid classifier. Because the output of the Softmax training, so the DBN-Softmax overall time complexity is
classifier is interrelated, it can clarify the type of data and im- o(batchsize ∗ l ∗ 2 ∗ (n1 ∗ n21 + N−1
j=1 n2 j ∗ n2 j+1 + n2N ∗ n3 ) ).
prove the accuracy of intrusion detection. The model structure
of DBN-Softmax is shown in Fig. 6. The DBN-Softmax intrusion
detection flowchart is shown in Fig. 7. 4. Data sets description
The specific steps of DBN-Softmax intrusion detection are
as follows: In order to verify the detection capabilities of the SDAE-ELM
Step 1: Preprocess the intrusion detection data set, use the model for network-based intrusion detection system, this pa-
bag of words model to process the data set and normalize the per not only performs intrusion detection on the older intru-
processed data. sion detection data sets KDD CUP99 and NSL-KDD, but also
Step 2: DBN-Softmax model training: on the newer intrusion detection data sets UNSW-NB15 and
CIDDS-001 to verify whether the model has the ability to de-
1) Initialize the model parameters and determine the struc- tect new types of network attacks. To verify the detection ca-
ture of the network model; pability of the DBN-Softmax model for host-based intrusion
computers & security 103 (2021) 102177 11

Fig. 6 – DBN-Softmax model structure.

Fig. 7 – DBN-Softmax intrusion detection flowchart.


12 computers & security 103 (2021) 102177

Table 2 – Training and testing connection records of KDD Cup99 and NSL-KDD.

Data instances-10% data Data instances-20% data


KDD Cup99 NSL-KDD
Attack category Description
Train Test Train Test
Normal Normal connection records 97,277 60,592 13,357 9690
Probe Obtaining detailed statistics of system and 4107 4166 2289 2421
network configuration details
DoS Attacker aims at making network resources 391,438 229,825 9234 7435
down
U2R Obtaining the root or super-user access on 52 228 11 200
a particular computer
R2L Illegal access from remote computer 1126 16,189 209 2754
Total 494,000 311,000 25,100 22,500

detection system, we verify it detection ability on the ADFA- 4.2. Host-based intrusion detection data set
LD data set.
ADFA-LD(Linux Dataset) (ADFA-LD, 2020): This data set is a
set of host-level intrusion detection data released by the Aus-
tralian National Defense Academy, which is a data set that the
4.1. Network-based intrusion detection data sets
intrusion event system calls the system sequence (single pro-
cess, system call api in a time window). The data set mainly
(1) KDD Cup99 (KDD, 2020): This data set comes from the in-
contains three types of data. Tables 5 and 6 detail the ADFA-LD
trusion detection assessment project of DARPA in 1998. All
data set.
the network data comes from a simulated US Air Force
For the network-based intrusion detection data sets, the
LAN. Many simulated attacks are added to the network.
original data sets cannot be directly inputted into the net-
The training data for the experiment is 7 weeks of network
work model for intrusion detection. We need to preprocess the
traffic, this network traffic contains about 5 million net-
data sets in advance. The preprocessing includes two steps: (1)
work connections; the experimental test data is 2 weeks
Covert the symbolic features in the training and testing sets
of network traffic, including about 2 million network con-
into numerical representations. (2) Convert category labels to
nections. The data set has two forms, a complete data set
numeric representation. Data set preprocessing is described
and 10 percent dataset. The dataset contains forty-one at-
in detail in Tables 7–9.
tributes and one category label. There are five categories
For host-based intrusion detection data set, because this
of category labels, namely: Normal, Probe, DoS, U2R, R2L.
data set is generated by the system call time series, we
Table 2 describes the KDD Cup99 dataset in detail.
cannot directly input the original data into the detection
(2) NSL-KDD (NSL-KDD, 2020): This data set is an improvement
model. We need to preprocess the data set. The preprocess-
of the KDD Cup99 data set. The redundant data and dupli-
ing mainly includes 2 steps: (1) Use the Bag of Words(BOW)
cated records in the KDD Cup99 data set are deleted. The
(Bahmanyar et al., 2015) to process the data set. The BOW is
data set includes the complete data set and the 20 percent
mainly used to process the text data set. It does not consider
data set, and is more suitable for misuse detection than
the contextual relationship between the words in the text,
KDD Cup99. Table 2 details the NSL-KDD data set.
only the words of the weight. The weight is related to the fre-
(3) UNSW-NB15 (UNSW-NB15, 2020): This data set is generated
quency of words appearing in the text. So we need to use the
by the network traffic data collected by the Australian Se-
BOW to characterize the ADFA-LD data set and convert it into
curity Laboratory in 2015. It is a comprehensive network
a data set that can be processed by the neural network. (2)
attack traffic data set, including a training data set and test
Convert category labels to numerical representation. Data set
data set. It consists of one normal flow and nine abnormal
preprocessing is described in detail in Table 10.
flows. The data flow is described by forty-two character-
We randomly select 20,000 connections from the network-
istics, plus the final label. The detailed description of the
based data sets, and randomly select 2000 connections from
UNSW-NB15 data set is shown in Table 3.
the host-based data set and visualize it using the t-SNE
(4) CIDDS-001 (CIDDS-001, 2020): This data set is based on tag
method (Van der Maaten and Hinton, 2008). The effectiveness
streams and used to evaluate anomaly intrusion detec-
of t-SNE can be seen from Fig. 8(a), the horizontal axis repre-
tion systems. Data from OpenStack and External servers
sents the distance and the vertical axis represents the den-
was generated by simulating small businesses. The data
sity. For the points of greater similarity, the distance of t dis-
set consists of three log files (attack log, client configura-
tribution in the low-dimensional space needs to be slightly
tion and client log). OpenStack and External servers re-
smaller; for the points of low similarity, the distance of t distri-
spectively captured 3.12 million and 60,000 network traf-
bution in the low-dimensional space needs to be longer. This
fic, including ten characteristic attributes and one cate-
just meets our needs, that is, points within the same cluster
gory label. Table 4 describes CIDDS-001 data set in detail on
(closer distance) are more closely aggregated, and points be-
External.
computers & security 103 (2021) 102177 13

Table 3 – Training and testing connection records of UNSW-NB15.

Attack_cat Description Train Test


Normal Normal connection records 56,000 37,000
Backdoor Technology to gain access to programs or systems by bypassing 1746 583
security controls
Analysis An intrusion method to infiltrate web applications through ports, 2000 677
emails, and web scripts
Fuzzers An attack method that attempts to discover security vulnerabilities 18,184 6062
in a program, operating system, or network, by entering a large
amount of random data to crash
Shellcode An attack method that controls the target machine by sending 1133 378
code that exploits specific vulnerabilities
Reconnaissance An attack method to collect computer network information to 10,491 3496
escape security control
Exploit A piece of code that controls a target system by triggering a 33,352 11,132
vulnerability (or several vulnerabilities)
DoS An attack method that directly or indirectly exhausts the resources 12,264 4089
of the attacked object, so that the target computer or network
cannot provide normal service or resource access
Worms A malicious computer virus that actively spreads through the 130 44
network
Genertic A technique that uses a hash function to collide each block cipher 40,000 18,871
regardless of the configuration of the block cipher
Total 175,300 82,332

tween different clusters (longer distance) are more alienated.


Table 4 – Training and testing connection records of From Fig. 8(b), 8(c), and 8(d), we can see that all the data sets
CIDDS-001-External. are non-linearly separable. And from the connection records,
Class Train Test we can see that compared to the KDD Cup99 data set, UNSW-
NB15 and ADFA -LD data set is more complicated.
Normal 130,000 4240
Attacker 10,000 2260
Suspicious 430,000 7911
Unknown 70,000 7923
5. Experimental setup
Victim 8000 907
Total 648,000 23,241 The experiments reported in this paper were conducted under
the environment of Intel Core i7 dual-core CPU, with 2.5 GHz
main frequency, 8GB memory, and Windows 10 operating sys-
tem. The simulation experiments were carried out using MAT-
Table 5 – Number of system call traces in different cate-
LAB R2017b.
gory of ADFA-LD.

Traces System Calls 5.1. Hyperparameter setting


Training data 833 308,077
Validation data 4372 2,122,085 Since the SDAE-ELM and DBN-Softmax models are parameter-
Attack data 746 317,388 ized, the selection of their parameters greatly affects the clas-
Total 5951 2,747,550 sification performance of the models. In this paper, the opti-
mal parameters and network topology of the models are de-
termined only on the KDD Cup99 dataset, and these parame-
ters and network topology are applied to the NSL-KDD, UNSW-
Table 6 – ADFA-LD dataset attack type.
NB15, CIDDS-001, and ADFA-LD data sets. We chose a shallow
Attack Description Trace Count SDAE-ELM and DBN-Softmax model for the experiments. The
Adduser Client poisoned 91 SDAE-ELM model includes three-layer, an input layer, a hidden
executable file layer, and an output layer. For the KDD Cup99 dataset, the in-
Hydra_FTP FTP brute-force cracking 162 put layer contains 41 neurons, the hidden layer can contain 32,
Hydra_SSH SSH brute-force cracking 176 64, 128, 256 and 512 units, and the output layer contains 1 unit,
Java_Meterpreter TikiWiki vulnerability 124
used to distinguish the normal connection of the network and
attack
the type of attacks. The input layer and the hidden layer, the
Meterpreter Client poisoned 75
executable file hidden layer and the output layer all adopt the full connec-
Web_shell PHP remote file 118 tion mode. For each parameter of the hidden layer, we run 100
containing vulnerability times, when the number of hidden layers is 256, the model can
detect the most attacks at this time; and when the number of
14 computers & security 103 (2021) 102177

Fig. 8 – Dataset visualization.

Table 7 – Pretreatment of KDD Cup99 and NSL-KDD.

Symbolic protocol_type tcp=1,udp=2,icmp=3


features service auth=1,bgp=2,courier=3,cenet_ns=4,ctf=5,daytime=6,discard=7,domain=8,domain_u = 9,echo=10,
eco_i = 11,ecr_i = 12,efs=13,exec=14,finger=15,ftp=16,ftp_data=17,gopher=18,hostname=19,http=20,
http_443=21,http_8001=22,imap4=23,IRC=24,iso_tsap=25,klogin=26,kshell=27,ldap=28,link=29,
login=30,mtp=31,name=32,netbios_dgm=33,netbios_ns=34,netbios_ssn=35,netstat=36,nnsp=37,
nntp=38,ntp_u = 39,other=40,pm_dump=41,pop_2 = 42,pop_3 = 43,printer=44,private=45,red_i = 46,
remote_job=47,rje=48,shell=49,smtp=50,sql_net=51,ssh=52,sunrpc=53,supdup=54,systat=55,
telnet=56,tftp_u = 57,tim_i = 58,time=59,urh_i = 60,urp_i = 61,uucp=62,uucp_path=63,vmnet=64,
whois=65,X11=66,Z39_50=67
flag OTH=1,REJ=2,RSTO=3,RSTOSO=4,RSTR=5,S0=6,S1=7,S2=8,S3=9,SF=10,SH=11
Category label Normal=1,Probe=2,DoS=3,R2L=4,U2R=5
label

hidden layers increases from 256 to 512, the model’s detection better detection effect at the early stage of the iteration. In or-
efficiency will deteriorate due to overfitting. Therefore, in the der to determine the final number of iterations, the remaining
following experiments, the number of hidden layer neurons algorithms are optimized. After 100 times of model training,
will be set to 256. In the SDAE-ELM model with fewer parame- a better classification effect can be achieved. Increasing the
ters, the model can achieve better classification results at the training number of the model again does not significantly im-
early stage of iteration. In this model, we do not consider the prove the classification effect of the model. Finally, we deter-
final iteration number of the model; applying the above pa- mine that the training time of the model is 100. The learning
rameters to the DBN-Softmax model. The model can obtain a rate has a great influence on the training speed of the model.
computers & security 103 (2021) 102177 15

Table 8 – Pretreatment of UNSW-NB15.

Symbolic state ACC=1,CLO=2,CON=3,ECO=4,FIN=5,INT=6,PAR=7,REQ=8,RST=9,URN=10,no=11


features proto 3pc=1,a/n = 2,aes-sp3-d = 3,any=4,argus=5,aris=6,arp=7,axe.25=8,bbn-rcc=9,bna=10,br-sat-mon=11,
cbt=12,cftp=13,chaos=14,compaq-peer=15,cphb=16,cpnx=17,crtp=18,crudp=19,dcn=20,ddp=21,
ddx=22,dgp=23,egp=24,eigrp=25,emcon=26,encap=27,etherip=28,fc=29,fire=30,ggp=31,gmtp=32,
gre=33,hmp=34,i-nlsp=35,iatp=36,ib=37,icmp=38,idpr=39,idpr-cmtp=40,idrp=41,ifmp=42,igmp=43,
igp=44,il=45,ip=46,ipcomp=47,ipcv=48,ipip=49,iplt=50,ipnip=51,ippc=52,ipv6=53,ipv6-frag=54,
ipv6-no=55,ipv6-opts=56,ipv6-route=57,ipx-n-ip=58,irtp=59,isis=60,isoip=61,isotp4=62,
kryptolan=63,l2tp=64,larp=65,leaf-1 = 66,leaf-2 = 67,merit-inp=68,mfe-nsp=69,mhrp=70,micp=71,
mobile=72,mtp=73,mux=74,narp=75,netblt=76,nsfnet_igp=77,nvp=78,ospf=79,pgm=80,pim=81,
pipe=82,pnni=83,pri-enc=84,prm=85,ptp=86,pup=87,pvp=88,qnx=89,rdp=90,rsvp=91,rtp=92,rvd=93,
sat-expak=94,sat-mon=95,sccopmce=96,scps=97,sctp=98,sdrp=99,secure-rmtp=100,sep=101,
skip=102,sm=103,smp=104,snp=105,sprite-rpc=106,sps=107,srp=108,st2=109,stp=110,sun-nd=111,
swipe=112,tcf=113,tcp=114,tlsp=115,tp++=116,trunk-1 = 117,trunk-2 = 118,ttp=119,udp=120,unas=121,
uti=122,vines=123,visa=124,vmtp=125,vrrp=126,wb-expak=127,wb-mon=128,wsn=129,xnet=130,
xns-idp=131,xtp=132,zero=133
service dhcp=1,dns=2,ftp=3,ftp-data=4,http=5,irc=6,pop3=7,radius=8,smtp=9,snmp=10,ssh=11,ssl=12,-=13
Category Attack_cat Normal=1,Fuzzers=2,Analysis=3,Backdoors=4,DoS=5,Exploit=6,Generic=7,Reconnaissance=8,
label Shellcode=9,Worms=10

Table 9 – Pretreatment of CIDDS-001.

Symbolic Proto GRE=1,ICMP=2,TCP=3,UDP=4


features Flags ...…=1,....S.=2,…R..=3,…RS..=4,.A....=5,.A…F = 6,.A..SF=7,.A.R..=9,.A.R.F.=10,.A.RS.=11,.A.RSF=
12,.AP…=13,.AP.S.=14,.AP.SF=15,.APRS=16,.APRSF=17,Ox52=18,Ox53=19,Ox5a=20,Ox5b=21,
Oxc2=22,Oxc6=23,Oxd2=24,Oxd3=25,Oxd6=26,Oxd7=27,Oxda=28,Oxdb=29,Oxde=30,Oxdf=31
attackType bruteForce=1,portScan=2,—=3
attackID —=23
Attack 100 passwords=1,20 passwords=2,nmap args:-sS –T2=3,nmap args:-sU –T2=4,—=5
Description
Category class Normal=1,Attacker=2,Suspicious=3,Unknown=4,Victim=5
label

Table 10 – Pretreatment of ADFA-LD.

Category label class Normal=1,Adduser=2,Hydra_FTP=3,Hydra_SSH=4,Java_Meterpreter=5,Meterpreter=6,Web_Shell=7

5.1.1. Loss function


Table 11 – Model topology.
The main goal of model training is to minimize the loss func-
Model name Number of tion by optimizing the parameters of the neural network,
hidden neurons thereby improving the classification effect of the model. There
SDAE-ELM1 DBN-Softmax1 256 are many types of loss functions, and we need to use differ-
SDAE-ELM2 DBN-Softmax2 256, 128 ent loss functions for different targets. The final output of
SDAE-ELM3 DBN-Softmax3 256, 128, 64 the output layer of the SDAE-ELM and DBN-Softmax model
is the probability distribution of the actual output, so we
choose the cross-entropy loss function as the loss function
in this paper. The cross-entropy is mainly used to describe
the distance between the actual output probability and the
The learning rate can be obtained directly in [0, 1], so a large expected output probability. The smaller the value of cross-
number of experiments are required. To avoid repeated exper- entropy is, the closer the actual output probability and the ex-
iments, we refer to the literature (Tang et al., 2016) to deter- pected output probability are. Because this paper has carried
mine the final learning rate to be 0.0001. Finally, we will use out both binary classification and multi-class classification on
the same optimal parameters and network topology for the intrusion detection data, we have adopted different loss func-
SDAE-ELM and DBN-Softmax models. The final model topol- tions for different classification standards. For binary classi-
ogy determined in this paper is shown in Table 11. fication, we use binary_crossentropy loss function; for multi-
Besides, the unique features of SDAE-ELM and DBN- class classification, we use categorical_crossentropy loss func-
Softmax in this paper are the loss function, activation func- tion. Eqs. (17) and (18) are binary_crossentropy loss function
tion, and the use of Mini-Batch gradient descent to train the
model to maximize the learning efficiency of the model.
16 computers & security 103 (2021) 102177

and categorical_crossentropy loss function, respectively. into local optimum. The Table 12 shows the advantages and
disadvantages of SGD, BGD and MBGD.
loss = −(yi logpi + (1 − yi )log(1 − pi )) (17) The size of MBGD is a parameter that is independent of the
overall architecture of the network, so we do not need to use
where yi is the label of sample data, the positive class is 1, and the rest of the optimized hyperparameters to optimize the size
the negative class is 0; pi is the probability that the sample i is of MBGD. Finally, we determined the size of the small batch of
predicted to be positive. data as 100. Fig. 9 shows the loss function curves of BGD and
MBGD on the SDAE-ELM network in KDD Cup99 dataset. Due
loss = −yic logpic (18) to the noise, MBGD oscillates during the learning process. But
overall, the loss function value of MBGD is less than that of
where yic is the indicator variable, if the category is the same BGD.
as the category of sample i, yic is 1, otherwise it is 0; pic is the
probability that sample i is predicted to be the category c. 5.2. Model evaluation criteria

5.1.2. Activation function This paper uses the accuracy, precision, true positive rate, false
If no activation function is used in the neural network, no mat- positive rate, F value, P-R curve, ROC curve, and AUC value
ter how many layers of the network we train, the final model to evaluate the classification ability of the models. These val-
output is a linear combination of inputs. Therefore, we need to ues are obtained based on the confusion matrix in Table 13.
use the activation function to calculate the weighted sum be- where True Positive(TP) is the number of connection records
tween the input and the deviation to determine whether the correctly classified to the Normal class, True Negative(TN) is
neuron node can be released. The Rectified Linear Unit(ReLU) the number of connection records correctly classified to the
activation function (Nair and Hinton, 2020) was proposed by Attack class, False Positive(FP) is the number of Normal con-
Nair and Hinton in 2010, and compared to the Sigmoid and nection records wrongly classified to the Attack connection
tanh activation functions, ReLU can retain as many linear record, False Negative(FN) is the number of Attack connection
models as possible, without saturation area, and gradient dis- records wrongly classified to the Normal connection record.
appearance. Moreover, the calculation is simple, the efficiency Indicators for model evaluation (Zaidi et al., 2016;
is very high, and the convergence speed is fast. The formula Liang et al., 2020) are defined as follows:
of ReLU is defined as follows: Accuracy: It estimates the ratio of the correctly recognized
connection records to the entire test dataset. If the accuracy is
f (x ) = max(0, x ) (19) higher, the detection model is better(Accuracy ∈ [0, 1]). Accuracy
serves as a good measure for the test dataset that contains
where x is the output of the input layer. balanced classes and is defined as follows:

TP + TN
5.1.3. Mini-Batch gradient descent Accuracy = (20)
TP + TN + FP + FN
Gradient Descent(GD) is an iterative method to find the global
minimum of the objective function in the direction of negative Precision: It estimates the ratio of the correctly identified
gradient. The minimum loss function is the objective function attack connection records to the number of all identified at-
in deep learning. In traditional deep neural network models, tack connection records. If the Precision is higher, the detection
Stochastic Gradient Descent(SGD) (Liu et al., 2019) and Batch model is better(Precision ∈ [0, 1]). Precision is defined as follows:
Gradient Descent(BGD) (Si et al., 2019) are often used to opti-
mize the objective function. The two methods have their own TP
Precision = (21)
advantages and disadvantages in different application ranges. TP + FP
So in order to improve the training speed of the model, the
F1-Score: F1-Score is also called as F1-Measure. It is the har-
Mini-Batch Gradient Descent(MBGD) (Messaoud et al., 2020) is
monic mean of Precision and Recall. If the F1-Score is higher, the
used to train the network model in this paper. BGD needs to
detection model is better(F 1 − Score ∈ [0, 1]). F1-Score is defined
update parameters with all samples to obtain the global op-
as follows:
timal solution and the loss function each time the weight is
updated. However, it will be very tricky to process big data, 2 ∗ Precision ∗ Recall
F 1 − Score = (22)
the training process will be very slow, and unable to continue Precision + Recall
training because of insufficient memory. SGD updates the pa-
rameters every time by training a single sample, which is fast True Positive Rate(TPR): It is also called as Recall. It esti-
in training. But the gradient of loss function calculated based mates the ratio of the correctly classified attack connection
on a random sample is deviated from the gradient of the loss records to the total number of attack connection records. If
function calculated based on all samples, which may be a bad the TPR is higher, the detection model is better(T PR ∈ [0, 1]).
gradient direction and may be able to obtain the global opti- TPR is defined as follows:
mal solution. MBGD uses a part of samples to update parame- TP
ters each time, which overcomes the shortcomings of SGD and T PR = (23)
TP + FN
BGD, and take into account the advantages of both methods.
MBGD can update model parameters faster, improve the com- False Positive Rate(FPR): It estimates the ratio of the normal
putational efficiency of the model, and avoid the model falling connection records flagged as attacks to the total number of
computers & security 103 (2021) 102177 17

Table 12 – The advantages and disadvantages of GD.

Advantages Disadvantages
SGD fast training every time decreased accuracy; not necessarily the global optimal
solution; not easy to implement in parallel; poor
convergence
BGD global optimal solution; easy to implement in parallel when the sample data is large, the computation
overhead is high and the computation is slow
MBGD reduced computational overhead; reduced
randomness; easy to implement in parallel

Fig. 9 – Loss function.

Table 13 – Confusion matrix. 6. Analysis of experimental results


Actual Predictive value
value In the field of intrusion detection, any detection method will
Positive example Counterexample
eventually be applied to different actual scenarios, but differ-
Positive example TP (True Positive) FN (False Negative) ent application environments, data sizes, and network envi-
Counter example FP (False Positive) TN (True Negative) ronments are different, so there is no unique standard to mea-
sure the superiority of the algorithm. To adopt a unified stan-
dard in the experimental environment, the SDAE-ELM model
is applied to the network-based KDD Cup99, NSL-KDD, UNSW-
NB15, and CIDDS-001 data sets, and the DBN-Softmax model
normal connection records. If the FPR is lower, the detection is applied to the host-based ADFA-LD dataset.
model is better(F PR ∈ [0, 1]). FPR is defined as follows:

6.1. Experimental results based on network datasets


FP
F PR = (24) 6.1.1. Binary classification experimental results
TN + FP
Combining the attack types into the abnormal class, the ex-
periment becomes a binary classification problem. Tables 14–
Receiver Operating Characteristics(ROC) curve: ROC is plot- 17 show the experimental results of the binary classification.
ted based on the trade-off between the TPR on the y axis to FPR From the table, it can be seen that the performance of the
on the x axis across different thresholds. The area under the algorithms on the binary classification are not much differ-
ROC curve(AUC) is used along with ROC as a comparison met- ent, but in general, the depth model has better performance
ric for the detection model. If the AUC is higher, the machine than the shallow neural network. Binary classification data
learning model is better. sets are an unbalanced data sets, and the amount of data of
the attack type far exceeds the amount of data of the nor-
mal sample, so only using an accuracy as an evaluation in-
 1 dex cannot reflect the detection ability of the model to un-
TP FP
AUC = d (25) balanced data sets. Therefore, in addition to accuracy, the
0 TP + FN TN + FP
18 computers & security 103 (2021) 102177

Table 14 – KDD Cup99 binary test results.

Algorithm Accuracy Precision Recall F1-score Time


AdaBoost 0.9492 0.9425 0.9272 0.9348 7202s
DT 0.9315 0.9806 0.9327 0.9560 14400s
ELM 0.9404 0.9831 0.9318 0.9568 2305s
SVM 0.9305 0.9818 0.9302 0.9553 >17h
LR 0.9301 0.9814 0.9300 0.9551 3409s
SOM 0.2020 0 0 NaN 100s
DNN 0.9379 0.9310 0.9959 0.9624 >12h
DBN 0.9339 0.9867 0.9297 0.9573 >12h
SDAE 0.9332 0.9858 0.9297 0.9569 >12h
SDAE-ELM1 0.9346 0.9867 0.9306 0.9579 4302s
SDAE-ELM2 0.9357 0.9869 0.9319 0.9586 4709s
SDAE-ELM3 0.9356 0.9875 0.9310 0.9584 4878s

Table 15 – NSL-KDD binary test results.

Algorithm Accuracy Precision Recall F1-score Time


AdaBoost 0.7475 0.9889 0.5629 0.7175 1801s
DT 0.7621 0.9558 0.6106 0.7451 3600s
ELM 0.7449 0.8667 0.6523 0.7444 576s
SVM 0.7317 0.9522 0.5569 0.7028 >5h
LR 0.7283 0.8942 0.5931 0.7132 825s
SOM 0.4305 0 0 NaN 27s
DNN 0.7392 0.9228 0.5915 0.7209 >3h
DBN 0.7160 0.8972 0.5661 0.6942 >3h
SDAE 0.7293 0.9293 0.5679 0.7049 >3h
SDAE-ELM1 0.7804 0.9599 0.6412 0.7687 1075s
SDAE-ELM2 0.7708 0.9244 0.6507 0.7637 1177s
SDAE-ELM3 0.7664 0.9598 0.6155 0.7500 1220s

Table 16 – UNSW-NB15 binary test results.

Algorithm Accuracy Precision Recall F1-score Time


AdaBoost 0.7663 0.7024 0.9887 0.8248 8304s
DT 0.7638 0.7216 0.9612 0.8243 15763s
ELM 0.7651 0.7237 0.9275 0.8130 3098s
SVM 0.6865 0.6513 0.9271 0.7651 >20h
LR 0.6814 0.6605 0.8674 0.7499 4789s
SOM 0.4494 NaN 0 NaN 200s
DNN 0.7342 0.7202 0.8459 0.7780 >17h
DBN 0.6812 0.6594 0.8706 0.7504 >17h
SDAE 0.6816 0.6588 0.8748 0.7516 >17h
SDAE-ELM1 0.7220 0.6693 0.9785 0.7949 5430s
SDAE-ELM2 0.7211 0.6698 0.9736 0.7936 5763s
SDAE-ELM3 0.7238 0.6994 0.8742 0.7771 5943s

model’s binary classification ability is evaluated from preci- and recall of SDAE-ELM, in most cases, the accuracy and re-
sion, recall and F1-Score. In each data set, the detection ef- call of SDAE-ELM are better. SDAE-ELM3 has an increase of
fect of other model is better expect for the poor effect of SOM 0.17% compared to SDAE. Compared with AdaBoost and DNN,
algorithm. the recall of SDAE-ELM is slightly worse, which is mainly be-
In most cases, the SDAE-ELM model achieves better detec- cause the data set of KDD Cup99 has data redundancy and du-
tion performance than traditional machine learning models. plication. Fully learning this data set can achieve better clas-
SDAE-ELM1, SDAE-ELM2, and SDAE-ELM3 can achieve similar sification results. Although the SDAE-ELM can better remove
detection results, but overall better than DT, ELM, SVM, and the noise existing in the data set, compared with AdaBoost
DNN. For the KDD Cup99 dataset, the accuracy of the algo- and DNN, the data mining ability of the SDAE-ELM is relatively
rithm is greater than 93%; in terms of accuracy and recall, weak, but the model’s ability to detect the KDD Cup99 data set
with the increase of model depth, the higher the accuracy has been reduced, which is also expected. For the NSL-KDD
computers & security 103 (2021) 102177 19

Table 17 – CIDDS-001 binary test results.

Algorithm Accuracy Precision Recall F1-score Time


AdaBoost 0.8182 0.8181 0.9397 0.8747 9432s
DT 0.9374 0.9572 0.9497 0.9534 17987s
ELM 0.8176 0.8176 1 0.8996 3987s
SVM 0.8920 0.9983 0.8693 0.9294 >25h
LR 0.8146 0.8146 0.9806 0.8899 5763s
SOM 0.1824 NaN 0 NaN 390s
DNN 0.8176 0.8176 1 0.8996 >20h
DBN 0.9766 0.9962 0.8696 0.9286 >20h
SDAE 0.8806 0.9983 0.8555 0.9214 >20h
SDAE-ELM1 0.9537 0.9979 0.9454 0.9710 6543s
SDAE-ELM2 0.9238 0.9978 0.9089 0.9512 6879s
SDAE-ELM3 0.9258 0.9951 0.9137 0.9527 7012s

data set, compared to the KDD Cup99 data set, because the is easy to understand. It can summarize the records in the
number of training sets is greatly reduced, the model learns data set according to the actual results and prediction results
fewer features, so the detection ability of each algorithm is to achieve visualization. The evaluation indexes of the model
reduced, but each algorithm can maintain the original detec- are also obtained through the confusion matrix. Fig. 10 is the
tion ability. For the UNSW-NB15 data set, in most cases, SDAE- confusion matrix of some algorithms in each data set. It can
ELM can achieve a better detection performance, and as the be seen that in most cases, the classification of the SDAE-ELM
depth increases, the performance of the model is better, and model is the most accurate. That is, the values in the first and
the remaining algorithms can achieve similar detection per- third quadrants in the confusion matrix is the largest, and
formance. For the CIDDS-001 data set, compared to other data that in the second and fourth quadrants in the confusion ma-
sets, the detection performance of each algorithm is better, trix is the smallest. The confusion matrix of other algorithms
because the CIDDS-001 data set has more training data, each such as SVM, ELM, and SDAE are very similar. Although the
algorithm can better learn the characteristics of the CIDDS- values in the first and third quadrants are larger, the values
001 data set, so the detection performance of each algorithm in the second and fourth quadrants are also larger. There are
is better. In terms of accuracy, SDAE is 4.32% worse than SDAE- false positives in the algorithm. Fig. 11 and 12 show the accu-
ELM; SDAE-ELM is slightly worse than SVM in precision, but it racy curve and the accuracy boxplot of each data set. As can
is better than other algorithms; in terms of recall, AdaBoost, be seen from Fig. 11, in the KDD Cup99 and UNSW-NB15 data
ELM, and DNN can reach 100%, that is, all attack data can be sets, AdaBoost, DT, SVM, and LR achieve the optimal accuracy
detected by three algorithms. that can be obtained at the beginning of the iteration, and as
F1-Score is a comprehensive evaluation index of accuracy the number of iterations increases, the accuracy of the model
and recall can better reflect the classification ability of the remains unchanged. For DNN, DBN, SDAE, the accuracy of the
model. In most cases, the F1-Score of SDAE-ELM is better. In model increases with the number of iterations, and the ac-
the KDD Cup99 data set, with the increase of model depth, curacy of the model is basically stable at the later stage of
the F1-Score is getting better and better, but the difference is the iteration. After 25 iterations of the DBN and SDAE, and
0.0038 compared with DNN. In the NSL-KDD data set, SDAE- 55 iterations of the DNN algorithm, the accuracy of models
ELM obtains the optimal F1-Score. In the UNSW-NB15 data are basically unchanged. For the SDAE-ELM model, the opti-
set, SDAE-ELM can obtain the optimal F1-Score, which is much mal accuracy can be obtained at the beginning of the iteration.
higher than that of other detection models. In general, in mul- In most cases, the accuracy of the SDAE-ELM model is better
tiple data sets, the SDAE-ELM model has a better detection per- than the traditional machine learning. The boxplot can reflect
formance for binary classification of data sets. the outliers in the data and the distribution of the data. As
Applying the SDAE-ELM model to a network-based intru- can be seen from Fig. 12, it can be seen that for NSL-KDD and
sion detection system, in addition to ensuring that the model CIDDS-001 data sets, AdaBoost, DT, SVM, LR, and SDAE-ELM
can achieve better detection performance, we also need to models have very stable accuracy distribution and no outliers;
consider the time performance of the model. In the process but for the remaining models, in particular, the ELM algorithm
of detecting network-based data sets, we envision that SDAE- has large fluctuations, especially in the CIDDS-001 data set.
ELM can complete intrusion detection faster, which is con- For the DNN model, the accuracy fluctuates after several it-
firmed by the Time evaluation indicators in Tables 14-17. Al- erations, and outliers appear in the accuracy. This is because
though SDAE-ELM requires a longer training time compared to the accuracy of the DNN model continuously changes during
the ELM and SOM, compared with other machine learning and several iterations, resulting in outliers. Comparing the accu-
deep learning, the training of time of SDAE-ELM is greatly re- racy of the CIDDS-001 dataset and the NSL-KDD dataset, it
duced, which can better meet our requirements for real-time is found that the accuracy of each algorithm on the CIDDS-
performance. 001 dataset is better, indicating that the SDAE-ELM model also
The confusion matrix uses the heat map to show the differ- has better detection ability for newer attack type. Fig. 13 is
ences of data through color difference and brightness, which the P-R curve of the data set. P-R curve can describe the re-
20 computers & security 103 (2021) 102177

Fig. 10 – Confusion matrix.

lationship between the accuracy and the recall, but compared In most cases, the SDAE-ELM model can achieve better de-
to the ROC curve, the P-R curve pays more attention to pos- tection results, but when detecting a certain type of data in
itive samples. As can be seen from Fig. 13, the original deep the data sets, the detection performance may be slightly worse
learning model is directly applied to each data set for intru- than the traditional machine learning model. As can be seen
sion detection, the detection effect is slightly worse than the from Table 18, for "Normal" type data, the comprehensive de-
traditional machine learning model; for each data set, the P- tection performance of AdaBoost and ELM is the worst, the
R curve of the SDAE-ELM model is very close. Combining the true positive rate is less than 85%, and the true positive rate of
traditional machine learning and the SDAE-ELM, we can con- the remaining models are greater than 93%, in particular, the
clude that the P-R curve of the SDAE-ELM is slightly worse than true positive rate of SDAE-ELM can reach 96.82%, this SDAE-
the P-R curve of the DT, but better than the P-R of other mod- ELM can fully learn the characteristics of the "Normal" sam-
els, this is because, in the binary classification algorithm, DT ple, to obtain a better true positive rate and false positive rate;
can mine more data features, so the P-R curve of this model for the "Probe" attack type, the detection performance of the
is better. AdaBoost, DT, ELM, SVM, LR and SDAE-ELM are similar, which
are far better than the detection performance of DNN, DBN,
6.1.2. Multi-class classification experimental results and SDAE. For the "DoS" attack type, the detection perfor-
When the normal samples and different types of attack data mance of each algorithm is better, among which the worst
are contained in the intrusion detection data sets, and the rate of the SDAE-ELM2 can also reach 99.93%. In particular,
experiment becomes a multi-class classification experiment. the true positive rate of AdaBoost, DT, SVM, LR, and SDAE can
Tables 18-21 show the experimental results of the multi- all reach 100%. Because "DoS" has the largest number of at-
class classification of each data set. The multi-class classi- tacks, and this paper judges according to the difference be-
fication data sets alleviate the unbalanced problem of the tween intrusion data and normal data, therefore, for the "DoS"
binary classification data sets to a certain extent, but there attacks with a large number of samples, each algorithm has a
are still large differences in the data amount of each type of better detection performance; for the "R2L" and "U2R" attack
data. Overall, each model has better detection performance types, in most cases, the performance of AdaBoost, DT, ELM
for samples with larger data volume. For multi-class clas- and SDAE-ELM are similar, and the detection performance is
sification, the multi-class classification ability of the model poor, SVM, LR, DNN, DBN, and SDAE have the worst detection
will be evaluated from the true positive rate, false positive performance, and basically cannot recognize the "R2L" and
rate, and ROC value. As with the binary classification, in each "U2R" attack types, but the true positive rate of SDAE-ELM1
data set, except for the poor multi-class classification abil- is increased by 5.26% compared to SDAE, and the effect is in
ity of the SOM, the other models have a better classification line with our expectations. There are few types of "R2L" and
effect. "U2R" attacks, and most of the attacks are disguised as legit-
computers & security 103 (2021) 102177 21

Fig. 11 – Data set accuracy.

imate users. This is to make their characteristics very similar work attacks that now appear on the network, for example,
to normal data, which makes it difficult to detect "R2L" and "Backdoors", "Shellcodes", "Reconnaissance" and "Worm" at-
"U2R" attacks. NSL-KDD and KDD Cup99 datasets have the tack types can better reflect the characteristics of the current
same data type, except for the number of training and test- network intrusion. Considering all classification results, the
ing. Comparing Tables 19 and 18, it can be found that each al- SDAE-ELM can obtain a better true positive rate, false positive
gorithm can maintain similar detection performance as KDD rate, and AUC value. However, the detection rate of some at-
Cup99 in NSL-KDD, but the detection rate of each detection tack types is slightly worse than that of DT. For example, for
performance has decreased. Mainly because the training data the "Worms" attack type, only the true positive rate of the DT is
of NSL-KDD is much less than the training data of KDD Cup99, 13.64%, and the true positive rate of the remaining algorithms
the features learned by the model are not enough, which leads are 0%; but for the "Generic" attack type, the SDAE-ELM model
to the slightly worse detection performance on the NSL-KDD can obtain the best true positive rate and false positive rate.
data set. For the "Analysis", "Backdoors" and "Worms" type attacks, the
At present, in the field of intrusion detection, most peo- detection of each algorithm is poor, this is because the number
ple still use the KDD Cup99 dataset of the Lincoln Labora- of training samples is too small, data distribution imbalance
tory in the United States for experiments. This dataset was appears, which affects the classification ability of the algo-
good in a certain period of time, but this dataset was col- rithm. "Generic" as a general attack, mainly attacks the server,
lected more than 20 years ago, and the current complex net- which is very close to the characteristics of the "Exploits" at-
work cannot be evaluated with this data set. Therefore, we tack type, but there is data imbalance between the two attack
further apply the SDAE-ELM model to the UNSW-NB15 and types. Therefore, there may be misjudgments in the detection
CIDDS-001 data sets that contain newer attack types. As can results.
be seen from Table 20, the UNSW-NB15 data set has more Although there are only four types of attacks in the CIDDS-
types of attacks, plus normal sample data totaling 10 types. 001 data set, they were captured in the 2017 simulated small-
Besides, this data set also contains the newer types of net- scale business environment, and so far few researchers have
22 computers & security 103 (2021) 102177

Fig. 12 – Data set accuracy boxplot.

applied it to the field of intrusion detection. Therefore, it is sification process; for "Suspicious" attack, except for the true
more meaningful for us to apply them to the field of intru- positive rate of the SVM algorithm is 87.36%, and the true
sion detection to detect the performance of the models. As positive of other algorithms are greater than 93%. Since the
can be seen from Table 21, SDAE-ELM can achieve better de- number of "Suspicious" attack are the largest, each algorithm
tection results by integrating all evaluation indicators. For the can fully learn the characteristics of the "Suspicious" attack.
"Normal" data type, except for ELM, LR, and DNN, the remain- Therefore, for a large number of "Suspicious" attacks, each
ing algorithms all can achieve a better comprehensive de- algorithm has better detection performance; for "Unknown"
tection effect, and the true positive rate is greater than 95%, and "Victim" attack types, the SDAE-ELM can achieve a better
indicating that at this time each algorithm can better learn detection rate, the algorithm can fully mine the characteris-
the characteristics of "Normal" data, so as to achieve a bet- tics of "Unknown" and "Victim" attack types, to achieve better
ter classification of this type of data; for "Attackers" type of classification.
attack, only the AdaBoost can perfectly classify this type of Table 22 is the time performance of the algorithm on each
attack. The true positive rate of this algorithm is 100%, the data set. It can be seen that the larger the data volume of
false positive rate is 0%, and the AUC value is 1. The detec- the data set, the longer the training time that the model
tion performance of the remaining algorithms are poor, and needs to spend. This is reasonable. In each data set, except
their true positive rates are all 0%, the attack type cannot be SOM and ELM take less time, and the training time of the
identified at all. This is because "Attackers" have similar char- remaining models is longer. In particular, SVM as a more
acteristics with "Suspicious", so the algorithm misinterprets classic method in machine learning, has also achieved good
the "Attackers" attack as a "Suspicious" attack during the clas- results in many fields, but its time cost is relatively high,
computers & security 103 (2021) 102177 23

Fig. 13 – P-R graph of data set.

while SDAE-ELM also takes some time to train, but compared set have different characteristics. Although the AdaBoost
with other models, the time performance has been greatly has been able to correctly distinguish the attack categories,
improved. but for samples with similar features, we need to attach
In the process of implementing intrusion detection, each more features to correctly distinguish the data with similar
layer of the neural network is helpful to understand how to features.
classify data into "normal" or "attack" and the specific classi- The hidden layer contains more parameters. In deep learn-
fication of attacks into attack categories. To understand this ing, the parameters of the hidden layer have a great impact on
process more intuitively, the activation value is passed to the convergence speed and performance of the model. During
t-SNE to visualize it. The KDD Cup99 and CIDDS-001 data the training process, we continuously iterate the parameters
sets are shown in Fig. 14(a) and 14(b), respectively. For the of the hidden layer to obtain a better classification model. In
KDD Cup99 data set, "Normal", "DoS", and "R2L" features general, the parameters of the hidden layer are obtained in the
have completely appeared in another cluster; for the CIDDS- self-learning process of the model. To understand the hidden
001 data set, the "Normal" type feature has appeared in an- layer parameters more intuitively, we visualize the weight of
other cluster, but for "Suspicious" and "Unknown" it has ap- the first DAE hidden layer in the model. The visualization re-
peared in the same cluster. This shows that the algorithm sults are expressed in grayscale, as shown in Fig. 16.
can well identify some types of attacks at this time, but the ROC curve is an easy to understand graphical tool that can
optimal partition function is not achieved. For the CIDDS- be used in all classification models, and it has a huge advan-
001 data set, the SDAE-ELM1 belongs to the "Suspicious" tage, when the distribution of positive and negative samples
connection record as shown in Fig. 15(b), which have sim- changes, its shape can remain basically unchanged, and it can
ilar characteristics with "Unknown"; for the NSL-KDD data more objectively measure the performance of the model it-
set, the connection records of the AdaBoost are shown in self. As can be seen from Fig. 17, for the "Normal" data type
Fig. 15(a), at this time, different types of data in the data in the KDD Cup99 dataset, as the model depth increases, the
24 computers & security 103 (2021) 102177

Table 18 – KDD Cup99 multi-class classification test results.

Algorithm Normal Probe DoS


TPR FPR AUC TPR FPR AUC TPR FPR AUC
AdaBoost 0.8338 0.0022 0.9158 0.9904 0.0056 0.9924 1 0.8301
0.32189
DT 0.9645 0.0675 0.9485 0.9904 0.0117 0.9894 1 0.0055 0.9973
ELM 0.8487 0.0035 0.9885 0.9832 0.0882 0.9927 0.9999 0.0061 0.9984
SVM 0.9525 0.0705 0.9410 0.9039 0.0072 0.9483 1 0.0427 0.9787
LR 0.9386 0.0702 0.9342 0.8966 0.0064 0.9451 1 0.0634 0.9683
SOM 0 0 0.0093 0.1851 0.9499 0.2183 0 0.9548
0.04517
DNN 0.9668 0.0716 0.9668 0.1899 0.0066 0.5919 0.9995 0.0695 0.9652
DBN 0.9647 0.0719 0.9442 0.1899 0.0065 0.5877 0.9996 0.0814 0.9594
SDAE 0.9512 0.8461 0.9377 0 0 0.5 1 0.1090 0.9455
SDAE-ELM1 0.9647 0.0695 0.9510 0.9928 0.0069 0.9932 0.9997 0.0155 0.9944
SDAE-ELM2 0.9682 0.0697 0.9514 0.9423 0.0072 0.9800 0.9993 0.0158 0.9964
SDAE-ELM3 0.9365 0.0695 0.9512 0.9183 0.0072 0.9885 0.9995 0.0467 0.9947

Algorithm R2L U2R

TPR FPR AUC TPR FPR AUC

AdaBoost 0.1140 0.0002 0.5569 0.0012 0.0007 0.5009


DT 0.1974 0.0008 0.5983 0.0037 0.0009 0.5014
ELM 0.0789 0.0001 0.5589 0.0025 0.0014 0.966
SVM 0 0 0.5 0 0.0009 0.499
LR 0 0 0.5 0.0012 0.0003 0.5005
SOM 0.6667 0.8369 0.1397 0 0 0.0259
DNN 0 0 0.5 0.0037 0.0030 0.5003
DBN 0 0 0.5 0.0006 0.0005 0.5001
SDAE 0 0 0.5 0.0056 0.0015 0.5020
SDAE-ELM1 0.0526 0.0002 0.5459 0.0062 0.0036 0.5033
SDAE-ELM2 0.0044 0 0.5066 0.0080 0.0039 0.5022
SDAE-ELM3 0.0102 0 0.5123 0.0080 0.0043 0.5013

Fig. 14 – Feature mapping of last hidden layer activation function.

AUC value of the SDAE-ELM is better, but its optimal AUC value ter, the worst AUC value of the LR can also reach 0.8228, but
is 0.0371 lower than that of ELM; for the "DoS" attack type in the SDAE-ELM is 0.0537 lower than that of DT. For the "Suspi-
the NSL-KDD data set, SDAE-ELM is slightly worse than DT by cious" attack type of the CIDDS-001 data set, SDAE-ELM can
0.0073, but compared to other models, SDAE-ELM can obtain obtain better AUC value, but some algorithms have poor AUC
the optimal AUC value; for the "Generic" attack type in the values, such as the AUC value of ELM is 0.3183, the AUC value
UNSW-NB15 data set, the AUC value of each algorithm is bet- of DBN is 0.4849. In summary, in most cases, AUC is used as the
computers & security 103 (2021) 102177 25

Table 19 – NSL-KDD multi-class classification test results.

Algorithm Normal Probe DoS


TPR FPR AUC TPR FPR AUC TPR FPR AUC
AdaBoost 0.9726 0.5110 0.7308 0.6126 0.0493 0.7817 0.6109 0.0009 0.8050
DT 0.9624 0.3527 0.8049 0.8030 0.0243 0.8893 0.7658 0.0168 0.8745
ELM 0.9757 0.4166 0.8133 0.6051 0.0279 0.8702 0.7447 0.0465 0.7659
SVM 0.9329 0.5239 0.7045 0.0950 0.0102 0.5424 0.6720 0.2015 0.7352
LR 0.9262 0.5180 0.7041 0.1297 0.0474 0.5412 0.6548 0.1834 0.7357
SOM 0.7567 0.7243 0.5287 0.1404 0.3726 0.4384 0.0793 0.0841 0.3894
DNN 0.9420 0.5214 0.7137 0 0 0.5 0.7147 0.1991 0.7663
DBN 0.9299 0.4988 0.7101 0.2359 0.0602 0.5863 0.6706 0.1357 0.7494
SDAE 0.9319 0.5587 0.6822 0 0 0.5 0.6725 0.1848 0.7335
SDAE-ELM1 0.9333 0.4088 0.7867 0.6468 0.0383 0.8199 0.7294 0.0531 0.8439
SDAE-ELM2 0.9324 0.4334 0.7495 0.5287 0.0574 0.7357 0.6927 0.1083 0.8522
SDAE-ELM3 0.9656 0.4700 0.7478 0.3317 0.0493 0.6412 0.6755 0.1214 0.8672

Algorithm R2L U2R

TPR FPR AUC TPR FPR AUC

AdaBoost 0.0300 0 0.5150 0.0007 0.0001 0.5003


DT 0.0350 0.0050 0.5150 0.0835 0.0200 0.5318
ELM 0.0250 0.0001 0.5224 0.0044 0.0029 0.5080
SVM 0 0 0.5 0 0 0.5
LR 0 0 0.5 0 0 0.5
SOM 0 0.1515 0.2003 0.0054 0.3696 0.3062
DNN 0 0 0.5 0 0 0.5
DBN 0 0 0.5 0 0 0.5
SDAE 0 0 0.5 0.0076 0.0028 0.5024
SDAE-ELM1 0 0 0.5 0.0243 0.0209 0.5055
SDAE-ELM2 0 0 0.5 0.0257 0.0001 0.5130
SDAE-ELM3 0 0 0.5 0.0271 0.5171
6.5867e-
5

evaluation index compared with the existing model, SDAE- 6.2. Experimental results based on host dataset
ELM model performed well. This also shows that SDAE-ELM
obtained a higher TPR and lower FPR. Combining the attack types into the abnormal class, the ex-
periment becomes a binary classification problem. Table 23
shows the results of the binary classification experiment of
the ADFA-LD data set. When the normal samples and indi-

Fig. 15 – Connection record of last hidden layer activation function.


26 computers & security 103 (2021) 102177

Table 20 – UNSW-NB15 multi-class classification test results.

Algorithm Normal Fuzzers Analysis


TPR FPR AUC TPR FPR AUC TPR FPR AUC
AdaBoost 0.5663 0.0505 0.7579 0.6140 0.2077 0.7032 0 0.0027 0.4987
DT 0.7479 0.0510 0.8485 0.4261 0.1272 0.6495 0.0827 0.0331 0.5248
ELM 0.6469 0.1320 0.8130 0.2801 0.0823 0.7943 0 0.0064 0.4951
SVM 0.8735 0.4559 0.7088 0 0 0.5 0 0 0.5
LR 0.865 0.4561 0.7045 0 0 0.5 0 0 0.5
SOM 0.4240 0.9606 0.1380 0 0 0.5 0 0 0.5
DNN 0.6935 0.3154 0.6639 0 0.0005 0.4996 0 0 0.5007
DBN 0.8698 0.4568 0.7001 0 0 0.5 0 0 0.5
SDAE 0.8729 0.4576 0.7071 0 0 0.5 0 0 0.5
SDAE-ELM1 0.7044 0.2672 0.7186 0.0200 0.0037 0.5081 0 0 0.5
SDAE-ELM2 0.7318 0.3131 0.7094 0.0048 0.0009 0.5019 0 0 0.5
SDAE-ELM3 0.7945 0.3987 0.6979 0.0007 0.0004 0.5002 0 0 0.5

Algorithm Backdoors DoS Exploit

TPR FPR AUC TPR FPR AUC TPR FPR AUC

AdaBoost 0 0 0.5 0.0037 0.0007 0.5015 0.8824 0.1991 0.8417


DT 0.2487 0.0513 0.5987 0.1289 0.0280 0.5505 0.6387 0.0798 0.7795
ELM 0 0 0.5 0.4444 0.0534 0.7752 0.6288 0.1147 0.8526
SVM 0 0 0.5 0 0 0.5 0.0230 0.0097 0.5066
LR 0 0 0.5 0.0176 0.0031 0.5073 0.0163 0.0121 0.5021
SOM 0 0 0.5 0.2194 0.7366 0.2954 0 0 0.5
DNN 0 0 0.5 0 2.121e- 0.5008 0.2869 0.1831 0.4427
5
DBN 0 0 0.5 0.0051 0.0008 0.5023 0.0166 0.0098 0.5032
SDAE 0 0 0.5 0.0465 0.0082 0.5191 0 0 0.5
SDAE-ELM1 0 0 0.5 0.1029 0.0002 0.5124 0.5650 0.2789 0.6430
SDAE-ELM2 0 0 0.5 0.1037 0.0002 0.5128 0.5705 0.1933 0.6536
SDAE-ELM3 0 0 0.5 0.1049 0.0004 0.5122 0.5442 0.0808 0.6317

Algorithm Generic Reconnaissance Shellcode

TPR FPR AUC TPR FPR AUC TPR FPR AUC

AdaBoost 0.9518 0.0036 0.9742 0.5712 0.0227 0.7743 0.0714 0.0020 0.5347
DT 0.9665 0.0043 0.9811 0.7589 0.0154 0.8717 0.5238 0.0111 0.7564
ELM 0.9627 0.0015 0.9805 0.7203 0.1321 0.8828 0 0 0.5
SVM 0.9693 0.3214 0.8240 0 0 0.5 0 0 0.5
LR 0.9694 0.3239 0.8228 0 0 0.5 0 0 0.5
SOM 0 0 0.5 0 0 0.5 0 0 0.5
DNN 0.9693 0.3481 0.8243 0 0 0.4999 0 0 0.4588
DBN 0.9695 0.3244 0.8224 0 0 0.5 0 0 0.5
SDAE 0.9695 0.3235 0.8233 0 0 0.5 0 0 0.5
SDAE-ELM1 0.9629 0.1416 0.9106 0.0023 0 0.5134 0 0 0.5
SDAE-ELM2 0.9640 0.1482 0.9129 0.0031 0 0.5178 0 0 0.5
SDAE-ELM3 0.9691 0.1342 0.9274 0.0043 0 0.5213 0 0 0.5

Algorithm Worms

TPR FPR AUC

AdaBoost 0 0 0.5
DT 0.1364 0.0010 0.5677
ELM 0 0 0.5
SVM 0 0 0.5
LR 0 0 0.5
SOM 0 0 0.5
DNN 0 0 0.5
DBN 0 0 0.5
SDAE 0 0 0.5
SDAE-ELM1 0 0 0.5
SDAE-ELM2 0 0 0.5
SDAE-ELM3 0 0 0.5
computers & security 103 (2021) 102177 27

Fig. 16 – First DAE visualization weights.

Fig. 17 – ROC curve.


28 computers & security 103 (2021) 102177

Table 21 – CIDDS-001 multi-class classification test results.

Algorithm Normal Attackers Suspicious


TPR FPR AUC TPR FPR AUC TPR FPR AUC
AdaBoost 0.9693 0.0406 0.9644 1 0 1 0.9565 0.4615 0.7475
DT 0.9929 0.0002 0.9964 0.1128 0 0.5564 0.9963 0.0200 0.9681
ELM 0 0.0069 0.3653 0 0 0.4997 0.9936 1 0.3183
SVM 0.9941 0.2676 0.8632 0 0 0.5 0.8736 0.6547 0.6094
LR 0 0.0092 0.4954 0 0 0.5 0.9927 0.9700 0.5113
SOM 0.4854 0.8828 0.0557 0 0 0.5 0 0 0.5
DNN 0 0 0.5 0 0 0.5 1 1 0.5
DBN 0.0101 0.1365 0.4363 0 0 0.5 0.9350 0.9671 0.4849
SDAE 0.9689 0.0023 0.9905 0 0 0.4916 0.9879 0.3325 0.8711
SDAE-ELM1 0.9948 0.0548 0.97 0.8487 0.0024 0.9231 0.9493 0.2199 0.8730
SDAE-ELM2 0.9599 0.0301 0.9649 0.8379 0.0002 0.9499 0.9731 0.2579 0.8830
SDAE-ELM3 0.9653 0.1860 0.8897 0.8543 0.0012 0.9378 0.9300 0.1986 0.8930

Algorithm Unknown Victim

TPR FPR AUC TPR FPR AUC

AdaBoost 0.1106 0.0028 0.5539 1 0 1


DT 0.9821 0.0048 0.9886 1 0.0850 0.9575
ELM 0 0.0003 0.2759 0 0 0.4007
SVM 0.0617 0.0002 0.5308 0 0 0.5
LR 0.0579 0.0003 0.5388 0 0 0.5
SOM 0 0 0.5 1 0.8672 0.1373
DNN 0 0 0.5 0 0 0.5
DBN 0.0553 0.0003 0.5287 0 0 0.5
SDAE 0.7621 0.0158 0.9467 0 0.0001 0.5823
SDAE-ELM1 0.1934 0.0017 0.5958 0.9945 0.0136 0.9905
SDAE-ELM2 0.3155 0.0298 0.6429 0.9932 0.0124 0.9912
SDAE-ELM3 0.3222 0.0341 0.6111 0.9950 0.1089 0.9872

Table 22 – Time performance evaluation index. Table 23 – ADFA-LD binary test results.

Algorithm KDD Cup99 NSL-KDD UNSW-NB15 CIDDS-001 Algorithm Accuracy Precision Recall F1-score
AdaBoost 4604s 3809s 5123s 6732s AdaBoost 0.8742 0.3654 0.0896 0.1439
DT 7800s 4328s 8763s 9087s DT 0.8606 0.4531 0.4458 0.4494
ELM 1039s 789s 2432s 4321s ELM 0.8724 0.7143 0.0766 0.1384
SVM >20h >10h >24h >28h SVM 0.8775 0.8 0.0755 0.1379
LR 3421s 2319s 5342s 6123s LR 0.8590 0.25 0.0268 0.0484
SOM 100s 78s 140s 200s SOM 0.4024 0.0952 0.4355 0.1563
DNN >7h >4h >12h >15h DNN 0.8790 NaN 0 NaN
DBN >7 >4h >12h >15h SDAE 0.8770 NaN 0 NaN
SDAE >7h >4h >12h >15h DBN 0.8721 0.3214 0.0892 0.1396
SDAE-ELM1 3908s 2987s 4321s 5891s DBN-Softmax1 0.8790 0.8108 0.1230 0.2136
SDAE-ELM2 4109s 3034s 4498s 5987s DBN-Softmax2 0.8790 0.8113 0.1241 0.2153
SDAE-ELM3 4530s 3298s 4510s 6032s DBN-Softmax3 0.8831 0.8243 0.1342 0.2308

vidual type of attack data are contained in the intrusion de-


tection dataset, and the experiment becomes a multi-class binary classification experiment and the multi-class classifi-
classification experiment, Table 24 gives the experimental re- cation experiment, it can be seen that the overall detection ef-
sults of multi-class classification of each data set. Comparing fect of the DBN-Softmax is better than other detection models,
Tables 23 and Table 24, it can be seen that the overall detection and the overall false positive rate of the multi-class classifica-
effect of binary classification is better than that of multi-class tion is lower than other detection models, but in the multi-
classification, because for the detection of binary classifica- class classification, the detection performance of each algo-
tion, if the attack type "Hydra_FTP" is mistakenly detected as rithm on the attack type is not very good, mainly because of
the "Hydra_SSH" attack type, the detection result of the bi- the 5951 sample data, the maximum number of "Hydra_SSH"
nary classification is still correct, but the detection result of attacks are only 176, and the minimum number of "Meter-
the multi-class classification is wrong. From the results of the preter" attacks are only 75, and the total amount of data is too
computers & security 103 (2021) 102177 29

Table 24 – ADFA-LD multi-class classification test results.

Algorithm Normal Adduser Hydra_FTP


TPR FPR AUC TPR FPR AUC TPR FPR AUC
AdaBoost 0.9988 1 0.4994 0 0.0012 0.4994 0 0 0.5
DT 0.9226 0.9150 0.5038 0.0313 0.0226 0.5043 0.0385 0.0368 0.5008
ELM 1 1 0.5039 0 0 0.5 0 0 0.5099
SVM 1 0.9962 0.5019 0 0 0.5 0 0 0.5
LR 0.9894 0.9876 0.5009 0 0 0.5 0.02 0.0053 0.5073
SOM 0.0228 0.0469 0.4625 0 0.4382 0.2760 0.4906 0.9007 0.2967
DNN 1 1 0.5 0 0 0.5 0 0 0.5
SDAE 0.9786 0.9694 0.5046 0 0 0.5 0.0702 0.0242 0.5
DBN 1 1 0.5 0 0 0.5 0 0 0.5
DBN- 0.9641 0.9157 0.5813 0.0432 0 0.5356 0.0413 0 0.5223
Softmax1
DBN- 0.9743 0.9212 0.5798 0.1032 0 0.5373 0.0798 0 0.5267
Softmax2
DBN- 1 1 0.5 0.1143 0 0.5390 0.0942 0 0.5312
Softmax3

 Hydra_SSH Java_Meterpreter Meterpreter

TPR FPR AUC TPR FPR AUC TPR FPR AUC

AdaBoost 0 0 0.5 0 0.0012 0.4994 0 0.0006 0.4997


DT 0.0702 0.0322 0.5190 0.0857 0.0075 0.5391 0 0.0145 0.4926
ELM 0 0 0.5081 0 0 0.4997 0 0 0.5379
SVM 0.0213 0.0047 0.5083 0 0 0.5 0 0 0.5
LR 0.0147 0.0041 0.5053 0 0.0018 0.4991 0.0417 0.0036 0.5191
SOM 0.6140 0.9148 0.3516 0 0.2908 0.33 0 0.1525 0.3992
DNN 0 0 0.5 0 0 0.5 0 0 0.5
SDAE 0 0 0.5 0.0667 0.0146 0.5257 0 0 0.5
DBN 0 0 0.5 0 0 0.5 0 0 0.5
DBN- 0.2341 0 0.6168 0.0270 0.0150 0.5044 0.1132 0.0018 0.5410
Softmax1
DBN- 0.2432 0 0.6179 1 1 0.5 0.1243 0 0.5611
Softmax2
DBN- 0.2498 0 0.6234 0.1043 0 0.5976 0.1034 0 0.5543
Softmax3

 Web_Shell

TPR FPR AUC

AdaBoost 0 0 0.5
DT 0.1035 0.0154 0.5440
ELM 0 0 0.4997
SVM 0 0 0.5
LR 0 0.0018 0.4991
SOM 0 0.7738 0.1114
DNN 0 0 0.5
SDAE 0 0 0.5
DBN 0 0 0.5
DBN- 0.0432 0.0319 0.5675
Softmax1
DBN- 0.0791 0.0432 0.5700
Softmax2
DBN- 0.0913 0.0443 0.5708
Softmax3

small, so the detection rate is poor, but it proves the feasibility algorithms, the accuracy of DBN-Softmax is the best, the opti-
of DBN-Softmax in general. mal is 82.43%, which is 57.43% higher than the accuracy of LR.
It can be seen from Table 23 that the DBN-Softmax can ob- From the perspective of recall, DBN-Softmax is slightly worse
tain the best accuracy. Except for the SOM, the accuracy of the than DT and SOM, but they are better than other algorithms.
other algorithms is all greater than 85.9%. From the perspec- In most cases, the F1-Score of DBN-Softmax is better. The F1-
tive of accuracy, as the depth of the DBN-Softmax increases, Score of DBN-Softmax is second only to DT, which is 0.2186
the accuracy of the algorithm is better. Compared with other worse than DT.
30 computers & security 103 (2021) 102177

Fig. 18 – Binary confusion matrix, accuracy and boxplot.

It can be seen from Table 24 that the detection performance is the accuracy, Fig. 18(e) is a boxplot of accuracy. As can be
of the SOM is always poor. Overall, DBN-Softmax has achieved seen from Fig. 18(a)-18(c), the normal sample data occupies
a good detection rate. For the "Normal" type, the true positive a considerable proportion in the test data set, and each al-
rate of each algorithm is better, and the true positive rate is gorithm has a good performance for the normal sample, the
greater than 92.26%, the true positive rate of some algorithms monitoring effect can be seen from the larger values of the
is even 100%, but each algorithm also has a high false posi- third quadrant of each algorithm, but because the attack sam-
tive rate. For the "Normal" type, which leads to a poor compre- ples are too small, the prediction ability on the attack sample
hensive evaluation index of the model, because the samples is poor. Fig. 18(d) and Fig. 18(e) show that the accuracy of the
of other types of attacks are relatively few, which cannot ac- ELM has been in a fluctuating state during multiple iterations,
curately learn the characteristics of each attack type, causing the remaining algorithms can obtain the optimal accuracy at
each model to falsely report the attack sample as a normal the early stage of the iteration, and it can be seen from the fig-
sample, resulting in a high false positive rate; for the attack ure that the accuracy of DBN-Softmax3 is significantly better
type, the detection effect of each algorithm is poor, in most than the other algorithms.
cases, compared with other algorithms, the detection effect To intuitively understand the distribution of the activation
of DBN-Softmax is significantly improved, and as the depth of function values of the hidden layer of the model, we use t-
the model increases, the detection effect of the algorithm is SNE to visualize it. The feature mapping of the AdaBoost and
better. Although there are attack types in ADFA-LD data set DBN-Softmax1 are shown in Fig. 19(a) and Fig. 19(b), respec-
that do not appear in the network-based data set, the num- tively. In the two algorithms, some of the "Normal" type fea-
ber of various types of samples is too small, which results tures have appeared in a cluster, but the remaining "Normal"
in the detection performance of the algorithm being slightly features and the remaining attack types have appeared in an-
worse than the detection performance of the algorithm on the other cluster, which shows that the algorithm can identify
network-based data set. some normal samples well at this time. For ELM and DBN-
Fig. 18 is the ADFA-LD data set binary classification con- Softmax3, the connection records belonging to "Normal" are
fusion matrix, accuracy, and boxplot. Among them, Fig. 18(a)- shown in Fig. 19(c) and 19(d), respectively. These connection
18(c) is the confusion matrix of some algorithms, and Fig. 18(d) records contain some features of other attack types, which
computers & security 103 (2021) 102177 31

Fig. 19 – Feature mapping and connection record of last hidden layer activation function of multi-class classifications.

Fig. 20 – First RBM visualization weights of multi-class classifications.


32 computers & security 103 (2021) 102177

Fig. 21 – ROC curve of multi-class classifications.

indicates that although each algorithm can distinguish some gradient descent method. The SDAE-ELM and DBN-Softmax
normal samples, more features need to be added to better dis- models have been verified using the network-based and host-
tinguish the types of attacks. In addition to visualizing the ac- based intrusion detection data sets, respectively. The results
tivation function value and connection weight of the hidden have demonstrated that no matter it is a binary classification
layer, we also visualize the hidden layer weights of the first or a multi-class classification, the detection performance of
RBM, and the size of the hidden layer weights can be clearly the SDAE-ELM and DBN-Softmax models on their respective
seen from Fig. 20. data sets is better than the traditional machine learning mod-
The ROC curve is a comprehensive indicator for evaluat- els.
ing TPR and FPR. It is easy to understand. In the face of the Compared with the traditional machine learning models,
imbalance of the number of positive and negative samples, SDAE-ELM and DBN-Softmax models have achieved better de-
the ROC curve is a more stable indicator that can reflect the tection results in intrusion detection. However, the data min-
quality of the model. Fig. 21 depicts the ROC curves. It can ing ability of SDAE-ELM model is effective, and the detection
be seen from the figure that compared with other algorithms, effect of small datasets is poor. In addition, the DBN-Softmax
the AUC value of DBN-Softmax is significantly improved, and model has the disadvantage of long training time for large
as the number of layer increases, the AUC value of DBN- datasets, and cannot realize real-time detection of intrusions.
Softmax is better; for the of "Adduser" attack type, the AUC In the future, we will consider using hybrid feature extraction
value of DBN-Softmax3 is 0.0396 higher than that of AdaBoost; technique to reduce the dimensionality of the dataset and re-
for the "Web-Shell" attack type, the worst AUC value of ELM duce the training time of the model under the premise of en-
is 0.4997. suring the accuracy of the intrusion detection target. In addi-
tion, due to the complex structure of the deep intrusion detec-
tion model and the large number of parameters, we will con-
7. Conclusion and future work sider to improve the model neurons and calculation methods,
simplify the network structure, and improve the model effi-
Based on the deep Denoising AutoEncoder and Deep Belief ciency.
Network, this paper has proposed the integrated deep intru-
sion models SDAE-ELM and DBN-Softmax. SDAE-ELM uses the
distributed deep learning model of SDAE, which can handle
Declaration of Competing Interest
real-time data, analyze large-scale data, and reduce the noise
in the data set. The distributed deep learning model of DBN
The authors declare that they have no known competing fi-
is used to deeply mine the features in the data set and im-
nancial interests or personal relationships that could have ap-
prove the classification accuracy of the model. At the same
peared to influence the work reported in this paper.
time, in order to avoid the problems that the BP algorithm is
prone to gradient sparseness, local optimization, and the orig-
inal classifier is not suitable for multi-class classification dur- CRediT authorship contribution statement
ing the fine-tuning process, we have used the ELM algorithm
and Softmax classifier to optimize the SDAE and DBN mod- Zhendong Wang: Writing - review & editing. Yaodi Liu: Val-
els, respectively. In addition, in order to quickly update the idation, Formal analysis, Visualization, Supervision, Data cu-
parameters and improve the model computation efficiency, ration, Writing - original draft. Daojing He: Writing - review &
SDAE-ELM and DBN-Softmax are trained using the Mini-Batch editing. Sammy Chan: Writing - review & editing.
computers & security 103 (2021) 102177 33

optimization. IEEE Geosci. Remote Sens. Lett.


Acknowledgements 2015;12(2):309–13. doi:10.1109/LGRS.2014.2337320.
Hamed T, Dara R, Kremer SC. Network intrusion detection system
This work is supported by National Natural Science Foun- based on recursive feature addition and bigram technique.
dation of China (61562037, 61562038, 61563019, 61763017, Comput. Sec. 2018;73:137–55. doi:10.1016/j.cose.2017.10.011.
U1936120, U1636216), the National Key Research and HFSTE. Hybrid feature selections and tree-based classifiers
ensemble for intrusion detection system. IEICE Trans. Inf.
Development Program of China (2017YFB0802805 and
Syst. 2017;E100D(8):1729–37. doi:10.1587/transinf.2016ICP0018.
2017YFB0801701), the Natural Science Foundation of Jiangxi Hinton GE, Osindero S, The YW. A Fast learning algorithm for
Province (20171BAB202026, 20181BBE58018). deep belief nets. Neural Comput. 2006;18(7):1527–54.
Hinton GE. Training products of experts by minimizing
contrastive divergence. Neural Comput. 2002;14(8):1771–800.
R E F E R E N C E S
doi:10.1162/089976602760128018.
Huang G, Song S, Gupta JND, Wu C. Semi-supervised and
unsupervised extreme learning machines. IEEE Trans. Cybern.
2014;44(12):2405–17. doi:10.1109/TCYB.2014.2307349.
ADFA-LD 2020 dataset[Online], available: Ibrahim NM, Zainal A. A model for adaptive and distributed
https://www.unsw.adfa.edu.au/unsw- canberra- cyber/ intrusion detection for cloud computing. In: 2018 Seventh ICT
cybersecurity/ADFA- IDS- Datasets/. International Student Project Conference(ICT-ISPC); 2018.
AI-Qatf M, Lasheng Y, AI-Habib M, AI-Sabahi K. Deep learning p. 1–6. doi:10.1109/ict-ispc.2018.8523905.
approach combining sparse autoencoder with SVM for IIgun K, Kemmerer RA, Porras PA. State transition analysis: a
network intrusion detection. IEEE Access 2018;6:52843–56. rule-based intrusion detection approach. IEEE Trans. Softw.
doi:10.1109/ACCESS.2018.2869577. Eng. 1995;21(3):181–99. doi:10.1109/32.372146.
AI-Tashi Q, Abdulkadir SJ, Rais HM, Mirjalili S, Mlhussian H, Kachuee M, Darabi S, Moatamed B, Sarrafzadeh M. Dynamic
Ragab MG, Alqushaibi A. Binary multi-objective grey wolf feature acquisition using denoising autoencoders. IEEE Trans.
optimizer for feature selection in classification. IEEE Access Neural. Netw. Learn. Syst. 2019;30(8):2252–62.
2020;8:106247–63. doi:10.1109/ACCESS.2020.3000040. doi:10.1109/TNNLS.2018.2880403.
Aljamal I, Tekeoglu A, Bekiroglu K, Sengupta S. Hybrid intrusion KDD 2020 Cup99 dataset[Online], available:
detection system using machine learning techniques in cloud http://kdd.ics.uci.edu/databases/kddcup99/kddcup99
computing environments. In: 2019 IEEE 17th International .html.
Conference on Software Engineering Research, Management Kim H, Hong H, Kim H-S, Kang S. A memory-efficient parallel
and Applications(SERA); 2019. p. 84–9. string matching for intrusion detection systems. IEEE
doi:10.1109/SERA.2019.8886794. Commun. Lett. 2009;13(12):1004–6.
Bahmanyar R, Cui S, Datcu M. A comparative study of doi:10.13140/RG.2.2.16174.82241.
bag-of-words and bag-of-topics models of EO image patches. Liang W, Li K, Long J, Kui X, Zomaya AY. An industrial network
IEEE Geosci. Remote Sens. Lett. 2015;12(6):1357–61. intrusion detection algorithm based on multifeature data
doi:10.1109/LGRS.2015.2402391. clustering optimization model. IEEE Trans. Ind. Inf.
Bengio Y, Courville A, Vincent P. Representation Learning: a 2020;16(3):2063–71. doi:10.1109/TII.2019.2946791.
review and new perspectives. IEEE Trans. Pattern Anal. Mach. Liu J, Pan Y, Li M, Chen ZY, Tang L, Lu CQ, Wang JX. Applications of
Intell. 2013;35(8):1798–828. doi:10.1109/TPAMI.2013.50. deep learning to MRI images: a survey. Big Data Mining Anal.
Besharati E., Naderan M., Namjoo E. LR-HIDS: logistic regression 2018;1(1):1–18. doi:10.26599/BDMA.2018.9020001.
host-based intrusion detection system for cloud Liu Y, Huangfu W, Zhang H, Long K. An efficient stochastic
environments. 2019, 10(9): 3669–3692. gradient algorithm to maximize the coverage of cellular
10.1007/s12652-018-1093-8. networks. IEEE Trans. Wireless Commun. 2019;18(7):3424–36.
Beulah JR, Punithavathani DS. A hybrid feature selection method doi:10.1109/TWC.2019.2914040.
for improved detection of wired/wireless network intrusions. Marteau P. Sequence covering for efficient host-based intrusion
Wirel. Pers. Commun. 2017;98(2):1853–69. detection. IEEE Trans. Inf. Forensics Secur. 2019;14(4):994–1006.
doi:10.1007/s11277-017-4949-x. doi:10.13140/RG.2.2.16174.82241.
Canedo DRC, Romariz ARSR. Intrusion detection system in Ad Messaoud S, Bradai A, Moulay E. Online GMM clustering and
Hoc networks with artificial neural networks and algorithm mini-batch gradient descent based optimization for
K-means. IEEE Latin Am. Trans. 2019;17(7):1109–15. industrial. IEEE Trans. Ind. Inf. 2020;16(2):1427–35.
doi:10.1109/TLA.2019.8931198. doi:10.1109/TII.2019.2945012.
Chen CLP, Zhang C, Chen L, Gan M. Fuzzy restricted boltzmann Nair V, Hinton GE. Rectified linear units improved restricted
machine for the enhancement of deep learning. IEEE Trans. Boltzmann machines Vinod Nair. Proceedings of the 27th
Fuzzy Syst. 2019;23(6):2163–73. International Conference on Machine Learning(ICML-10), June
doi:10.1109/TFUZZ.2015.2406889. 21-24, 2020.
CIDDS-001 2020 dataset[Online], available: Nguyen MT, Kim K. Genetic convolutional neural network for
https://www.hs-coburg.de/index.php?id=927. intrusion detection systems. Fut. Gener. Comput. Syst.
de Araujo-Filho PF, Kaddoum G, Campelo DR, Santos AG, 2020;113:418–27. doi:10.1016/j.future.2020.07.042.
Macedo D, Zanchettin C. Intrusion detection for NSL-KDD 2020 dataset[Online], available:
cyber-physical systems using generative adversarial networks http://users.cis.fiu.edu/∼lpeng/Datasets_detail.html.
in fog environment. IEEE Internet Things J. 2020. Prabavathy S, Sundarakantham K, Shalinie SM. Design of
doi:10.1109/JIOT.2020.3024800. cognitive fog computing for intrusion detection in internet of
Duan LT, Han DZ, Tian QT. Design of intrusion detection system things. J.Commun. Netw. 2018;20(3):291–8.
based on improved ABC_elite and BP neural networks. doi:10.1109/JCN.2018.000041.
Comput. Sci. Inf. Syst. 2019;16(3):773–95. Sadaf K, Sultana J. Intrusion detection based on autoencoder and
doi:10.2298/CSIS181001026D. isolation forest in fog computing. IEEE Access
Ghamisi P, Benediktsson JA. Feature selection based on 2020;8:167059–68. doi:10.1109/ACCESS.2020.3022855.
hybridization of genetic algorithm and particle swarm
34 computers & security 103 (2021) 102177

Serpen G, Anghaei E. Host-based misuse intrusion detection Zeng R, Wu J, Shao Z, Senhadji L, Shu H. Quaternion softmax
using PCA feature extraction and KNN classification classifier. Electron. Lett. 2014;50(25):1929–31.
algorithms. Intell. Data Anal. 2018;22(5):1101–14. doi:10.1049/el.2014.2526.
doi:10.3233/IDA-173493. Zhang H, Li Y, Lv Z, Sangaiah AK, Huang T. A real-time and
Shone N, Ngoc TN, Phai VD, Shi Q. A deep learning approach to ubiquitous network attack detection dased on deep delief
network intrusion detection. IEEE Trans. Emerg. Top. Comput. detwork and support vector machine. IEEE/CAA J. Automat.
Intell. 2018;2(1):41–50. doi:10.1109/TETCI.2017.2772792. Sinica 2020;7(3):790–9. doi:10.1109/JAS.2020.1003099.
Si Z, Wen S, Dong B. NOMA codebook optimization by batch
gradient descent. IEEE Access 2019;7:117274–81. Zhendong Wang [S’06, M’09] received the
doi:10.1109/ACCESS.2019.2936483. B.E (2006) and M. Eng. (2009) degrees from
Tang TA, Mhamdi L, McLernon D, Zaidi SAR, Chogho M. Deep Changchun University of Science and Tech-
learning approach for network intrusion detection in software nology and Harbin University of Science and
defined networking. In: 2016 International Coference on Technology respectively, and a Ph.D. degree
Wireless Networks and Mobile Communications(WINCOM); in computer applied technology from Harbin
2016. p. 258–63. doi:10.1109/WINCOM.2016.7777224. Engineering University in 2013. Since 2014,
Teng S, Mu N, Zhu H, Teng L, Zhang W. SVM-DT-based adaptive he has been with the Department of In-
and collaborative intrusion detection. IEEE/CAA J. Automat. formation Engineering, Jiangxi University of
Sinica 2018a;5(1):108–18. doi:10.1109/JAS.2017.7510730. Science and Technology, P.R. China, where
Teng SH, Wu NQ, Zhu HB, Teng LY, Zhang W. SVM-DT-based he is currently an associate professor. His-
adaptive and collaborative intrusion detection. IEEE-CAA J. research interests include wireless sensor
Automat. Sinica 2018b;5(1):108–18. network, artificial intelligence and network
doi:10.1109/JAS.2017.7510730. security.
Tidjon LN, Frappier M, Mammar A. Intrusion detection systems: a
cross-domain overview. IEEE Commun. Survey Tutor. Yaodi Liu received the B.S. degree in Infor-
2019;21(4):3639–81. doi:10.1109/COMST.2019.2922584. mation and Computing Science from Jiangsu
Tu Y, Du J, Lee C. Speech enhancement based on teacher-student Ocean University, in 2018. She is currently
deep learning using improved speech presence probability for pursuing the M.S. degree with Jiangxi Uni-
noise-robust speech recognition. IEEE/ACM Trans. Audio versity of Science and Technology. Her main
Speech Lang Process 2019;27(12):2080–91. research interests include network security
doi:10.1109/TASLP.2019.2940662. and group intelligence optimization algo-
UNSW-NB15 2020 dataset[Online], available: rithms.
https://www.unsw.adfa.edu.au/unsw- canberra- cyber/
cybersecurity/ADFA- NB15- Datasets/.
Usman AM, Yusof UK, Naim S. Filter-based multi-objective
Daojing He [S’07, M’13] received the B.Eng.
feature selection using NSGA III and cuckoo optimization
(2007) and M. Eng. (2009) degrees from
algorithm. IEEE Access 2020;8:76333–56.
Harbin Institute of Technology (China) and
doi:10.1109/ACCESS.2020.2987057.
the Ph.D. degree (2012) from Zhejiang Uni-
Van der Maaten L, Hinton G. Visualizing Data using t-SNE[J]. J.
versity (China), all in computer science. He
Mach. Learn. Res. 2008;9(11):2579–625.
is currently a professor in the School of
doi:10.1007/s10846-008-9235-4.
Computer Science and Software Engineer-
Wang D, Su J, Yu H. Feature extraction and analysis of natural
ing, East China Normal University, P.R. China.
language processing for deep learning english language. IEEE
His-research interests include network and
Access 2020;8:46335–45. doi:10.1109/ACCESS.2020.2974101.
systems security. He is on the editorial board
Wang W, Du X, Wang N. Building a cloud IDS using an efficient
of international journals such as IEEE Com-
feature selection method and SVM. IEEE Access
munications Magazine and IEEE Network.
2019;7:1345–54. doi:10.1109/ACCESS.2018.2883142.
Wei B, Zhang W, Xia X, Zhang Y, Yu F, Zhu Z. Efficient feature Sammy Chan [S’87, M’89] received his B.E.
selection algorithm based on particle swarm optimization and M.Eng. Sc. degrees in electrical engineer-
with learning memory. IEEE Access 2019;7:166066–78. ing from the University of Melbourne, Aus-
doi:10.1109/ACCESS.2019.2953298. tralia, in 1988 and 1990, respectively, and
Ye Z, Sun Y, Sun S, Zhan S, Yu H, Yao Q. Research on network a Ph.D. degree in communication engineer-
intrusion detection based on support vector machine ing from the Royal Melbourne Institute of
optimized with grasshopper optimization algorithm. In: 2019 Technology, Australia, in 1995. Since Decem-
10th IEEE International Conference on Intelligent Data ber 1994 he has been with the Department
Acquistion and Advanced Computing Systems: Technology of Electronic Engineering, City University of
and Applications(IDAACS); 2019. p. 378–83. Hong Kong, where he is currently an asso-
doi:10.1109/IDAACS.2019.8924234. ciate professor.
Zaidi K, Milojevic MB, Rakocevic V, Nallanathan A, Rajarajan M.
Host-based intrusion detection for vanets: a statistical
approach to rogue node detection. IEEE Trans. Veh. Technol.
2016;65(8):6703–14. doi:10.1109/TVT.2015.2480244.

You might also like