RP 1
RP 1
a r t i c l e i n f o a b s t r a c t
Article history: Intrusion detection system can effectively identify abnormal data in complex network envi-
Received 18 July 2020 ronments, which is an effective method to ensure computer network security. Recently, deep
Revised 12 November 2020 neural networks have been widely used in image recognition, natural language processing,
Accepted 3 January 2021 network security and other fields. For network intrusion detection, this paper designs an in-
Available online 7 January 2021 tegrated deep intrusion detection model based on SDAE-ELM to overcome the long training
time and low classification accuracy of existing deep neural network models, and to achieve
Keywords: timely response to intrusion behavior. For host intrusion detection, an integrated deep intru-
Deep learning sion detection model based on DBN-Softmax is constructed, which effectively improves the
Deep neural network detection accuracy of host intrusion data. At the same time, in order to improve the training
Feature learning efficiency and detection performance of the SDAE-ELM and DBN-Softmax models, a small
Mini-batch gradient descent batch gradient descent method is used for network training and optimization. Experiments
Intrusion detection on the KDD Cup99, NSL-KDD, UNSW-NB15, CIDDS-001, and ADFA-LD datasets show that
SDAE-ELM and DBN-Softmax integrated deep inspection models have better performance
than other classic machine learning models.
∗
Corresponding author.
E-mail addresses: [email protected] (Z. Wang), [email protected] (Y. Liu), [email protected] (D. He),
[email protected] (S. Chan).
https://doi.org/10.1016/j.cose.2021.102177
0167-4048/© 2021 Elsevier Ltd. All rights reserved.
2 computers & security 103 (2021) 102177
(Kim et al., 2009) to analyze it. The matching features mainly can reduce the noise of the data very well and has better ro-
include string features, port features and data packet header bustness, and the extreme learning machine (ELM) has fast
features, so feature selection of data is one of the effective learning speed, the integration of SDAE and ELM is proposed.
methods to improve the reliability and timeliness of NIDS. The learning model improves the detection speed on the basis
The current feature selection algorithms mainly include par- of reducing the noise of network data traffic, and can detect
ticle swarm optimization (Wei et al., 2019), genetic algorithm network-based intrusion systems in real time. On the other
(Ghamisi and Benediktsson, 2015), gray wolf algorithm (AI- hand, host-based intrusion detection systems use the log files
Tashi et al., 2020), cuckoo algorithm (Usman et al., 2020), etc., of each host as the main data source. As the number of log files
but they all have some problems, such as the relatively large is limited, some intrusion means and ways will not appear in
randomness of the genetic algorithm, the easy trapping in the the log files. Considering that deep belief networks (DBN) can
local optimum in the gray wolf algorithm, etc.; The earliest fully dig out the features in the data and the Softmax classifier
intrusion detection system applied to the network is HIDS, can better identify multiple attack types, we propose an inte-
and the detection module is installed on the system hard grated deep learning model of DBN-Softmax, which has high
disk. HIDS analyzes the log files and audits extracted system accuracy and is a better detection method for host-based in-
operation data to achieve the purpose of intrusion detection. trusion detection systems.
With the increasing complexity of the network environment, Overall, this work has made the following contributions to
detection methods migrated from the rule-based expert the intrusion detection domain:
system (IIgun et al., 1995) to machine learning based meth-
ods. Currently, commonly used machine learning methods
1) Considering that the performance of the SDAE model can
include SVM, DT (Teng et al., 2018), LR (Besharati et al., 2019),
still be improved, this paper proposes an integrated deep
neural network (Canedo and Romariz, 2019), deep learning
intrusion detection model based on SDAE-ELM. The SDAE
(AI-Qatf et al., 2018) and so on.
can well reduce the noise of the data and has strong robust-
Deep learning appeared relatively late. In 2006, Hinton first
ness, and can better reduce the noise of the network data
proposed the concept of deep learning (Hinton et al., 2006),
traffic. The ELM has the advantage of fast training speed
which has strong feature learning capabilities with unsuper-
and can realize intrusion detection faster.
vised training layer by layer. Therefore, it has been widely used
2) Considering that the Sigmoid activation function in the
in speech (Tu et al., 2019), image (Liu et al., 2018) and natural
DBN model is more suitable for the binary classification
language processing (Wang et al., 2020). In recent years, with
problem, this paper proposes a method for building a deep
the increase in the number of intrusions, the increase of data
learning framework for deep structure DBN and Softmax
dimensions, and the increasingly perfect deep learning the-
classifiers, and designs an integrated deep intrusion detec-
ory, researchers have begun to consider applying deep learn-
tion model based on DBN-Softmax to train the network us-
ing to intrusion detection. In intrusion detection, the autoen-
ing pre-training and fine-tuning methods.
coder (Sadaf and Sultana, 2020) replaces high-dimensional
3) For different data sets, SDAE-ELM and DBN-Softmax are
input layer neurons with hidden layer labeled neurons to
applied to NIDS and HIDS data sets respectively. This is
achieve the functions of dimensionality reduction and fea-
because the NIDS datasets contain huge data and much
ture extraction; the intrusion data is preprocessed to gener-
noise. The SDAE-ELM model can remove the noise of the
ate uniform traffic gray map, and convolutional neural net-
NIDS datasets and improve the detection performance and
work (Nguyen and Kim, 2020) is sued to extract features in the
speed. The number of log files in the HIDS dataset is lim-
gray image, thereby expanding the number of training sam-
ited, and most intrusions will not be recorded by the host.
ples. Deep belief network (Zhang et al., 2020) using restricted
The DBN-Softmax model has better data mining capabili-
boltzmann machines for unsupervised pre-training and the
ties, thereby improving the detection performance.
obtained results as the initial value of the supervised learn-
4) In order to verify the effectiveness of the methods and
ing training probabilistic model greatly improves the learning
models proposed in this paper, we apply the SDAE-ELM and
performance. Although the deep learning model has achieved
DBN-Softmax models to the network-based intrusion de-
good research results in the field of intrusion detection, in
tection datasets KDD Cup99, NSL-KDD, UNSW-NB15, and
most cases, the performance of the model is only verified on
CIDDS-001, and the host-based intrusion detection dataset
the network-based intrusion detection dataset, without con-
ADFA-LD, respectively, and compare them with various
sidering whether it has the ability to detect the host-based
machine learning algorithms and deep learning models
dataset.
using several experimental evaluation metrics.
Therefore, this paper fully considers the characteristics
of different intrusion detection systems, and proposes in-
tegrated deep learning models for different intrusion data The rest of the paper is organized as follows. Related works
sources, in order to fully exploit the advantages of the inte- are discussed in Section 2. In Section 3, we present the deep
grated deep learning models to achieve the best detection ef- learning and intrusion detection models proposed in this pa-
fect for different intrusion detection data sources. At present, per. Section 4 describes the intrusion detection data sets.
network-based intrusion detection systems use original net- The mathematical details of the intrusion detection models
work packets as data sources. The main problems are that and the evaluation criteria of the models are introduced in
the network data traffic is noisy, the response is not timely, Section 5. The performance of the models are evaluated in
the signature database is not updated in time, and the timeli- Section 6. Finally, we draw conclusions and suggest some fu-
ness is poor. Since the stacked denoising AutoEncoder (SDAE) ture work in Section 7.
computers & security 103 (2021) 102177 3
based on the existing algorithm and implements intrusion de- testing sets and training sets are used for model training and
tection on distributed nodes, which can greatly improve the model testing, respectively.
detection speed of the model and achieve the purpose of real- Intrusion detection module: It combines the dimensions of
time detection. the data after preprocessing to determine the number of input
In summary, good progress has been made in hybrid fea- and output nodes of the network model, then determines the
ture selection technology, classifier model optimization, and entire network structure and training parameters according to
intrusion detection technology in different network environ- the hidden layer and other parameters, uses the training sets
ments. Table 1 shows the comparison of different intrusion to train the model, and saves model for testing after training.
detection models used by researchers. Different from the ex- Detection and classification module: The test sets use the
isting intrusion detection models, two integrated deep learn- saved model for testing and display the detection classifica-
ing models, SDAE-ELM and DBN-Softmax are proposed in this tion results to the user.
paper based on the deep learning approach, which fully con-
sider the influence of data sources on the detection perfor- 3.1. Integrated deep intrusion detection model based on
mance of the models. These two models are applied to the SDAE-ELM
network-based intrusion detection datasets KDD Cup99, NSL-
KDD, UNSW-NB15, and CIDDS-001 and the host-based intru- 3.1.1. Stacked denoising autoencoder
sion detection dataset ADFA-LD, respectively. The intrusion If only one layer of Denoising AutoEncoder is used, the cod-
detection datasets used in this paper is relatively complete. ing ability is relatively limited, whereas SDAE consists of mul-
We use multiple evaluation indicators such as accuracy, preci- tiple Denoising AutoEncoder for feature extraction, multiple
sion, true positive rate, false positive rate, F1-Score, P-R curve, Denoising AutoEncoders are stacked together, and the output
ROC curve and AUC value to evaluate the performance of the of the previous Denoising AutoEncoder is used as the input of
proposed models. The evaluation is more scientific and com- the latter Denoising AutoEncoder, and is pre-trained using a
prehensive. layer-by-layer initialization strategy. After the training is com-
pleted, the overall network parameters are corrected accord-
ing to the output error of the last layer of the network. Finally,
3. Intrusion detection models the classification of the sample is determined by the classifier.
As classic methods in deep learning, SDAE and DBN have 3.1.2. Denoising autoencoder
achieved better results when applied to shallower models of Denoising AutoEncoder(DAE) (Kachuee et al., 2019) is a type
intrusion detection, but there are certain limitations. The BP of extension of AutoEncoder(AE) (Bengio et al., 2013). The De-
algorithm is used in the fine-tuning process. With the increase noising AutoEncoder is based on the AutoEncoder, the training
in the number of layers, it will have problems with sparse gra- data adds noise, the DAE can remove the noise from the input
dients and local optimization. In SDAE, the BP algorithm has data during the learning process to obtain data that is not con-
local minimization and requires multiple iterations to deter- taminated by noise. This learning method is more generalized
mine the network output weights to affect the learning of the than the general AutoEncoder. The DAE schematic is shown in
network, while the weights and thresholds of the ELM network Fig. 2 below.
need only be calculated once by the least square method to ob- In Fig. 2, x is the original input data, x is the damaged data,
tain the optimal weights and thresholds, without the need to y is the feature after x encoding, x˜ is the output obtained by y
update through the BP algorithm, the model training speed is decoding. The reconstruction error expression is:
faster, and it takes less time. In DBN, in the BP fine-tuning pro-
cess, the Sigmoid function is used as the activation function L(x, g( f (x ) ) ) = x − g( f (x ) )2 (1)
of the last layer to treat each category as a two-category pro-
cess. The Sigmoid activation function has a large amount of The AutoEncoder uses the reconstruction error to repre-
calculation, and the BP error division involves division opera- sent the training effect during the training process, and re-
tions and is prone to gradient disappearance. Hence, the deep quires the minimum reconstruction error to ensure the maxi-
network training cannot be completed.On the other hand, the mization of the common features. During the training process
Softmax activation function can directly implement multi- of the DAE, f (·) is performed on the damaged data x feature
class classification, and the output categories are mutually ex- mapping, therefore, compared to AutoEncoder, DAE increases
clusive to avoid the existence of multiple possibilities, which the robustness of features during training.
can better solve the problem of the Sigmoid function. Based on The objective function of Denoising AutoEncoder is given
this, we propose integrated deep intrusion detection models by:
of SDAE-ELM and DBN-Softmax, and apply them to the host-
⎡ ⎤
based data sets and the network-based data set, respectively. n
nl−1 sl sl+1
1 1 λ (l ) 2
The integrated deep intrusion detection models proposed J(W, b) = ⎣ ( i )
x − g f x ( i )
⎦+
2
W ji
n 2 2
in this paper are mainly divided into three modules, and the i=1 l=1 i=1 j=1
When the activation function can approach any N samples where H+ is the generalized inverse matrix of the hidden layer
with zero error, that is N
i=1 yi − ti = 0, there is output matrix H.
L
tj = βi gi ωi x j + bi , j = 1, 2, . . . , N (6)
3.1.4. SDAE-ELM model and detection procedures
i=1
In this paper, the SDAE-ELM integrated deep model is used
The above formula can be expressed as as the network-based intrusion detection system. The exper-
imental data sets are all network-based data sets. The model
Hβ = T (7) first uses SDAE to learn the features of the data sets, and then,
the features learned by the SDAE are input into the ELM algo-
Among them,
rithm for fine-tuning, to get the trained SDAE-ELM model. Fi-
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ nally, the test sets data are input into the SDAE-ELM model to
gi ( w1 x1 + b1 ) ··· gi ( wL x1 + bL ) β1 t1
⎢ . . ⎥ ⎢ .⎥ ⎢.⎥ complete the intrusion detection. The structure of the SDAE-
⎢
H=⎣ .. ⎥ ⎢ ⎥ ⎢ ⎥
.. . .. ⎦β = ⎣ .. ⎦T = ⎣ .. ⎦ ELM model is shown in Fig. 3. The SDAE-ELM intrusion detec-
gi ( w1 xN + b1 ) ··· gi ( wL xN + bL ) βL tN tion flowchart is shown in Fig. 4.
The specific steps of SDAE-ELM intrusion detection are as
where H is the output matrix of the hidden layer; β is the out- follows:
put weight; T is the desired output vector. Step 1: Preprocess the intrusion detection data sets, which
The connection weights of the hidden layer and the output mainly includes high-dimensional data feature mapping and
layer can be obtained by solving the least squares solution of data normalization processing.
the following system of equations: Step 2: SDAE-ELM model training:
is controlled within a certain range, its output is used as 3.2. Integrated deep intrusion detection model based on
the input of the next layer of DAE; DBN-Softmax
3) The output of the above layer of DAE is used as input, and
the unsupervised algorithm is used to train the DAE of this 3.2.1. Deep belief network
layer, so that its reconstruction error is controlled within a DBN is a multi-layer perceptron neural network formed by
certain range; stacking multiple Restricted Boltzmann machine (RBM) and
4) Repeat step 3) until all DAE training is completed; a layer of BP neural network. The DBN training process
5) ELM algorithm is used to learn the features extracted by includes two parts: pre-training and fine-tuning. The pre-
DAE, and the optimal weight and threshold value of the training mainly adopts the unsupervised layer-by-layer train-
model are determined by the least square method of one ing method to train the RBM parameters of each layer. The out-
order until the specified training times are reached. put of the hidden layer of the low-level RBM is used as the in-
put of the visible layer of the high-level RBM. Abstract feature
parameters can be extracted from the original signal data; the
BP neural network is used in the fine-tuning phase. The dif-
ference between the actual output and the label information
Step 3: SDAE-ELM model test, input the test data sets into
is used as a measurement error, and the error is propagated
the trained SDAE-ELM model, and then obtain the classifica-
back layer by layer to fine-tune the weight and offset of the
tion result of each data set.
computers & security 103 (2021) 102177 9
entire DBN. After multiple iterations, the optimal parameters the probability of h and v is calculated as follows:
of the entire DBN can be obtained. ⎛ ⎞
P h j = 1|v = sigmoid⎝ai + vi wi j ⎠ (13)
3.2.2. Restricted Boltzmann machine i
Restricted Boltzmann Machine (Chen et al., 2019) is an energy-
⎛ ⎞
based model, which can also be regarded as a special type
of Markov random field. Generally, RBM can represent the re- ⎝
P v j = 1|h = sigmoid b j + h j wi j ⎠ (14)
lationship between random variables. The Restricted Boltz- j
where Z(θ ) is the regularization factor, given by the sum of the 3.2.3. Softmax classifier
energy functions associated with all visible and hidden units: The principle of Softmax classifier (Zeng et al., 2014) is very
simple. It is an extension of Logistic Regression (LR). The
Z (θ ) = exp(−E (v, h, θ ) ) (11) biggest difference between the two is that LR category labels
v h can only take two, while Softmax increases the possibility of
multi-category labels, which is more suitable for multi-class
The system energy function of RBM can be obtained from classification problems. The Softmax classifier can map input
the above formula as follows: vectors from N-dimensional space to categories, and output
the classification results in the form of probabilities. The prob-
D
F
D
F
ability formula is:
E (v, h, θ ) = − ai vi − b jh j − wi j vi hi = aT v−bT h−vT Wh
i=1 j=1 i=1 j=1 ⎡ i ⎤ ⎡ θ T x (i ) ⎤
p y ( ) = 1 |x ( i ) ; θ e1
(12) ⎢ ⎥ ⎢ θ T x (i ) ⎥
⎢ p y ( i ) = 2 |x ( i ) ; θ ⎥ ⎢ ⎥
⎢ ⎥ ⎢e
2
1 ⎥
hθ x ( i ) = ⎢
⎢ .
⎥=
⎥ ⎢ . ⎥ (15)
where wi j represents the weight between the visible layer unit ⎢ . ⎥ K θk X ⎢ . ⎥
T
⎣ . ⎦ k=1 e ⎣ . ⎦
i and the hidden layer unit j, ai and b j are the bias values of T (i )
p y ( i ) = K |x ( i ) ; θ eθK x
the visible layer and hidden layes units, respectively.
In the classification task, the output of RBM is 0 or 1, and
the Sigmoid activation function is usually used. At this time, where p(y(i ) = K )|x(i ) ; θ represents the probability that x(i ) be-
longs to category K, and the sum of all elements in the vector
10 computers & security 103 (2021) 102177
Table 2 – Training and testing connection records of KDD Cup99 and NSL-KDD.
detection system, we verify it detection ability on the ADFA- 4.2. Host-based intrusion detection data set
LD data set.
ADFA-LD(Linux Dataset) (ADFA-LD, 2020): This data set is a
set of host-level intrusion detection data released by the Aus-
tralian National Defense Academy, which is a data set that the
4.1. Network-based intrusion detection data sets
intrusion event system calls the system sequence (single pro-
cess, system call api in a time window). The data set mainly
(1) KDD Cup99 (KDD, 2020): This data set comes from the in-
contains three types of data. Tables 5 and 6 detail the ADFA-LD
trusion detection assessment project of DARPA in 1998. All
data set.
the network data comes from a simulated US Air Force
For the network-based intrusion detection data sets, the
LAN. Many simulated attacks are added to the network.
original data sets cannot be directly inputted into the net-
The training data for the experiment is 7 weeks of network
work model for intrusion detection. We need to preprocess the
traffic, this network traffic contains about 5 million net-
data sets in advance. The preprocessing includes two steps: (1)
work connections; the experimental test data is 2 weeks
Covert the symbolic features in the training and testing sets
of network traffic, including about 2 million network con-
into numerical representations. (2) Convert category labels to
nections. The data set has two forms, a complete data set
numeric representation. Data set preprocessing is described
and 10 percent dataset. The dataset contains forty-one at-
in detail in Tables 7–9.
tributes and one category label. There are five categories
For host-based intrusion detection data set, because this
of category labels, namely: Normal, Probe, DoS, U2R, R2L.
data set is generated by the system call time series, we
Table 2 describes the KDD Cup99 dataset in detail.
cannot directly input the original data into the detection
(2) NSL-KDD (NSL-KDD, 2020): This data set is an improvement
model. We need to preprocess the data set. The preprocess-
of the KDD Cup99 data set. The redundant data and dupli-
ing mainly includes 2 steps: (1) Use the Bag of Words(BOW)
cated records in the KDD Cup99 data set are deleted. The
(Bahmanyar et al., 2015) to process the data set. The BOW is
data set includes the complete data set and the 20 percent
mainly used to process the text data set. It does not consider
data set, and is more suitable for misuse detection than
the contextual relationship between the words in the text,
KDD Cup99. Table 2 details the NSL-KDD data set.
only the words of the weight. The weight is related to the fre-
(3) UNSW-NB15 (UNSW-NB15, 2020): This data set is generated
quency of words appearing in the text. So we need to use the
by the network traffic data collected by the Australian Se-
BOW to characterize the ADFA-LD data set and convert it into
curity Laboratory in 2015. It is a comprehensive network
a data set that can be processed by the neural network. (2)
attack traffic data set, including a training data set and test
Convert category labels to numerical representation. Data set
data set. It consists of one normal flow and nine abnormal
preprocessing is described in detail in Table 10.
flows. The data flow is described by forty-two character-
We randomly select 20,000 connections from the network-
istics, plus the final label. The detailed description of the
based data sets, and randomly select 2000 connections from
UNSW-NB15 data set is shown in Table 3.
the host-based data set and visualize it using the t-SNE
(4) CIDDS-001 (CIDDS-001, 2020): This data set is based on tag
method (Van der Maaten and Hinton, 2008). The effectiveness
streams and used to evaluate anomaly intrusion detec-
of t-SNE can be seen from Fig. 8(a), the horizontal axis repre-
tion systems. Data from OpenStack and External servers
sents the distance and the vertical axis represents the den-
was generated by simulating small businesses. The data
sity. For the points of greater similarity, the distance of t dis-
set consists of three log files (attack log, client configura-
tribution in the low-dimensional space needs to be slightly
tion and client log). OpenStack and External servers re-
smaller; for the points of low similarity, the distance of t distri-
spectively captured 3.12 million and 60,000 network traf-
bution in the low-dimensional space needs to be longer. This
fic, including ten characteristic attributes and one cate-
just meets our needs, that is, points within the same cluster
gory label. Table 4 describes CIDDS-001 data set in detail on
(closer distance) are more closely aggregated, and points be-
External.
computers & security 103 (2021) 102177 13
hidden layers increases from 256 to 512, the model’s detection better detection effect at the early stage of the iteration. In or-
efficiency will deteriorate due to overfitting. Therefore, in the der to determine the final number of iterations, the remaining
following experiments, the number of hidden layer neurons algorithms are optimized. After 100 times of model training,
will be set to 256. In the SDAE-ELM model with fewer parame- a better classification effect can be achieved. Increasing the
ters, the model can achieve better classification results at the training number of the model again does not significantly im-
early stage of iteration. In this model, we do not consider the prove the classification effect of the model. Finally, we deter-
final iteration number of the model; applying the above pa- mine that the training time of the model is 100. The learning
rameters to the DBN-Softmax model. The model can obtain a rate has a great influence on the training speed of the model.
computers & security 103 (2021) 102177 15
and categorical_crossentropy loss function, respectively. into local optimum. The Table 12 shows the advantages and
disadvantages of SGD, BGD and MBGD.
loss = −(yi logpi + (1 − yi )log(1 − pi )) (17) The size of MBGD is a parameter that is independent of the
overall architecture of the network, so we do not need to use
where yi is the label of sample data, the positive class is 1, and the rest of the optimized hyperparameters to optimize the size
the negative class is 0; pi is the probability that the sample i is of MBGD. Finally, we determined the size of the small batch of
predicted to be positive. data as 100. Fig. 9 shows the loss function curves of BGD and
MBGD on the SDAE-ELM network in KDD Cup99 dataset. Due
loss = −yic logpic (18) to the noise, MBGD oscillates during the learning process. But
overall, the loss function value of MBGD is less than that of
where yic is the indicator variable, if the category is the same BGD.
as the category of sample i, yic is 1, otherwise it is 0; pic is the
probability that sample i is predicted to be the category c. 5.2. Model evaluation criteria
5.1.2. Activation function This paper uses the accuracy, precision, true positive rate, false
If no activation function is used in the neural network, no mat- positive rate, F value, P-R curve, ROC curve, and AUC value
ter how many layers of the network we train, the final model to evaluate the classification ability of the models. These val-
output is a linear combination of inputs. Therefore, we need to ues are obtained based on the confusion matrix in Table 13.
use the activation function to calculate the weighted sum be- where True Positive(TP) is the number of connection records
tween the input and the deviation to determine whether the correctly classified to the Normal class, True Negative(TN) is
neuron node can be released. The Rectified Linear Unit(ReLU) the number of connection records correctly classified to the
activation function (Nair and Hinton, 2020) was proposed by Attack class, False Positive(FP) is the number of Normal con-
Nair and Hinton in 2010, and compared to the Sigmoid and nection records wrongly classified to the Attack connection
tanh activation functions, ReLU can retain as many linear record, False Negative(FN) is the number of Attack connection
models as possible, without saturation area, and gradient dis- records wrongly classified to the Normal connection record.
appearance. Moreover, the calculation is simple, the efficiency Indicators for model evaluation (Zaidi et al., 2016;
is very high, and the convergence speed is fast. The formula Liang et al., 2020) are defined as follows:
of ReLU is defined as follows: Accuracy: It estimates the ratio of the correctly recognized
connection records to the entire test dataset. If the accuracy is
f (x ) = max(0, x ) (19) higher, the detection model is better(Accuracy ∈ [0, 1]). Accuracy
serves as a good measure for the test dataset that contains
where x is the output of the input layer. balanced classes and is defined as follows:
TP + TN
5.1.3. Mini-Batch gradient descent Accuracy = (20)
TP + TN + FP + FN
Gradient Descent(GD) is an iterative method to find the global
minimum of the objective function in the direction of negative Precision: It estimates the ratio of the correctly identified
gradient. The minimum loss function is the objective function attack connection records to the number of all identified at-
in deep learning. In traditional deep neural network models, tack connection records. If the Precision is higher, the detection
Stochastic Gradient Descent(SGD) (Liu et al., 2019) and Batch model is better(Precision ∈ [0, 1]). Precision is defined as follows:
Gradient Descent(BGD) (Si et al., 2019) are often used to opti-
mize the objective function. The two methods have their own TP
Precision = (21)
advantages and disadvantages in different application ranges. TP + FP
So in order to improve the training speed of the model, the
F1-Score: F1-Score is also called as F1-Measure. It is the har-
Mini-Batch Gradient Descent(MBGD) (Messaoud et al., 2020) is
monic mean of Precision and Recall. If the F1-Score is higher, the
used to train the network model in this paper. BGD needs to
detection model is better(F 1 − Score ∈ [0, 1]). F1-Score is defined
update parameters with all samples to obtain the global op-
as follows:
timal solution and the loss function each time the weight is
updated. However, it will be very tricky to process big data, 2 ∗ Precision ∗ Recall
F 1 − Score = (22)
the training process will be very slow, and unable to continue Precision + Recall
training because of insufficient memory. SGD updates the pa-
rameters every time by training a single sample, which is fast True Positive Rate(TPR): It is also called as Recall. It esti-
in training. But the gradient of loss function calculated based mates the ratio of the correctly classified attack connection
on a random sample is deviated from the gradient of the loss records to the total number of attack connection records. If
function calculated based on all samples, which may be a bad the TPR is higher, the detection model is better(T PR ∈ [0, 1]).
gradient direction and may be able to obtain the global opti- TPR is defined as follows:
mal solution. MBGD uses a part of samples to update parame- TP
ters each time, which overcomes the shortcomings of SGD and T PR = (23)
TP + FN
BGD, and take into account the advantages of both methods.
MBGD can update model parameters faster, improve the com- False Positive Rate(FPR): It estimates the ratio of the normal
putational efficiency of the model, and avoid the model falling connection records flagged as attacks to the total number of
computers & security 103 (2021) 102177 17
Advantages Disadvantages
SGD fast training every time decreased accuracy; not necessarily the global optimal
solution; not easy to implement in parallel; poor
convergence
BGD global optimal solution; easy to implement in parallel when the sample data is large, the computation
overhead is high and the computation is slow
MBGD reduced computational overhead; reduced
randomness; easy to implement in parallel
model’s binary classification ability is evaluated from preci- and recall of SDAE-ELM, in most cases, the accuracy and re-
sion, recall and F1-Score. In each data set, the detection ef- call of SDAE-ELM are better. SDAE-ELM3 has an increase of
fect of other model is better expect for the poor effect of SOM 0.17% compared to SDAE. Compared with AdaBoost and DNN,
algorithm. the recall of SDAE-ELM is slightly worse, which is mainly be-
In most cases, the SDAE-ELM model achieves better detec- cause the data set of KDD Cup99 has data redundancy and du-
tion performance than traditional machine learning models. plication. Fully learning this data set can achieve better clas-
SDAE-ELM1, SDAE-ELM2, and SDAE-ELM3 can achieve similar sification results. Although the SDAE-ELM can better remove
detection results, but overall better than DT, ELM, SVM, and the noise existing in the data set, compared with AdaBoost
DNN. For the KDD Cup99 dataset, the accuracy of the algo- and DNN, the data mining ability of the SDAE-ELM is relatively
rithm is greater than 93%; in terms of accuracy and recall, weak, but the model’s ability to detect the KDD Cup99 data set
with the increase of model depth, the higher the accuracy has been reduced, which is also expected. For the NSL-KDD
computers & security 103 (2021) 102177 19
data set, compared to the KDD Cup99 data set, because the is easy to understand. It can summarize the records in the
number of training sets is greatly reduced, the model learns data set according to the actual results and prediction results
fewer features, so the detection ability of each algorithm is to achieve visualization. The evaluation indexes of the model
reduced, but each algorithm can maintain the original detec- are also obtained through the confusion matrix. Fig. 10 is the
tion ability. For the UNSW-NB15 data set, in most cases, SDAE- confusion matrix of some algorithms in each data set. It can
ELM can achieve a better detection performance, and as the be seen that in most cases, the classification of the SDAE-ELM
depth increases, the performance of the model is better, and model is the most accurate. That is, the values in the first and
the remaining algorithms can achieve similar detection per- third quadrants in the confusion matrix is the largest, and
formance. For the CIDDS-001 data set, compared to other data that in the second and fourth quadrants in the confusion ma-
sets, the detection performance of each algorithm is better, trix is the smallest. The confusion matrix of other algorithms
because the CIDDS-001 data set has more training data, each such as SVM, ELM, and SDAE are very similar. Although the
algorithm can better learn the characteristics of the CIDDS- values in the first and third quadrants are larger, the values
001 data set, so the detection performance of each algorithm in the second and fourth quadrants are also larger. There are
is better. In terms of accuracy, SDAE is 4.32% worse than SDAE- false positives in the algorithm. Fig. 11 and 12 show the accu-
ELM; SDAE-ELM is slightly worse than SVM in precision, but it racy curve and the accuracy boxplot of each data set. As can
is better than other algorithms; in terms of recall, AdaBoost, be seen from Fig. 11, in the KDD Cup99 and UNSW-NB15 data
ELM, and DNN can reach 100%, that is, all attack data can be sets, AdaBoost, DT, SVM, and LR achieve the optimal accuracy
detected by three algorithms. that can be obtained at the beginning of the iteration, and as
F1-Score is a comprehensive evaluation index of accuracy the number of iterations increases, the accuracy of the model
and recall can better reflect the classification ability of the remains unchanged. For DNN, DBN, SDAE, the accuracy of the
model. In most cases, the F1-Score of SDAE-ELM is better. In model increases with the number of iterations, and the ac-
the KDD Cup99 data set, with the increase of model depth, curacy of the model is basically stable at the later stage of
the F1-Score is getting better and better, but the difference is the iteration. After 25 iterations of the DBN and SDAE, and
0.0038 compared with DNN. In the NSL-KDD data set, SDAE- 55 iterations of the DNN algorithm, the accuracy of models
ELM obtains the optimal F1-Score. In the UNSW-NB15 data are basically unchanged. For the SDAE-ELM model, the opti-
set, SDAE-ELM can obtain the optimal F1-Score, which is much mal accuracy can be obtained at the beginning of the iteration.
higher than that of other detection models. In general, in mul- In most cases, the accuracy of the SDAE-ELM model is better
tiple data sets, the SDAE-ELM model has a better detection per- than the traditional machine learning. The boxplot can reflect
formance for binary classification of data sets. the outliers in the data and the distribution of the data. As
Applying the SDAE-ELM model to a network-based intru- can be seen from Fig. 12, it can be seen that for NSL-KDD and
sion detection system, in addition to ensuring that the model CIDDS-001 data sets, AdaBoost, DT, SVM, LR, and SDAE-ELM
can achieve better detection performance, we also need to models have very stable accuracy distribution and no outliers;
consider the time performance of the model. In the process but for the remaining models, in particular, the ELM algorithm
of detecting network-based data sets, we envision that SDAE- has large fluctuations, especially in the CIDDS-001 data set.
ELM can complete intrusion detection faster, which is con- For the DNN model, the accuracy fluctuates after several it-
firmed by the Time evaluation indicators in Tables 14-17. Al- erations, and outliers appear in the accuracy. This is because
though SDAE-ELM requires a longer training time compared to the accuracy of the DNN model continuously changes during
the ELM and SOM, compared with other machine learning and several iterations, resulting in outliers. Comparing the accu-
deep learning, the training of time of SDAE-ELM is greatly re- racy of the CIDDS-001 dataset and the NSL-KDD dataset, it
duced, which can better meet our requirements for real-time is found that the accuracy of each algorithm on the CIDDS-
performance. 001 dataset is better, indicating that the SDAE-ELM model also
The confusion matrix uses the heat map to show the differ- has better detection ability for newer attack type. Fig. 13 is
ences of data through color difference and brightness, which the P-R curve of the data set. P-R curve can describe the re-
20 computers & security 103 (2021) 102177
lationship between the accuracy and the recall, but compared In most cases, the SDAE-ELM model can achieve better de-
to the ROC curve, the P-R curve pays more attention to pos- tection results, but when detecting a certain type of data in
itive samples. As can be seen from Fig. 13, the original deep the data sets, the detection performance may be slightly worse
learning model is directly applied to each data set for intru- than the traditional machine learning model. As can be seen
sion detection, the detection effect is slightly worse than the from Table 18, for "Normal" type data, the comprehensive de-
traditional machine learning model; for each data set, the P- tection performance of AdaBoost and ELM is the worst, the
R curve of the SDAE-ELM model is very close. Combining the true positive rate is less than 85%, and the true positive rate of
traditional machine learning and the SDAE-ELM, we can con- the remaining models are greater than 93%, in particular, the
clude that the P-R curve of the SDAE-ELM is slightly worse than true positive rate of SDAE-ELM can reach 96.82%, this SDAE-
the P-R curve of the DT, but better than the P-R of other mod- ELM can fully learn the characteristics of the "Normal" sam-
els, this is because, in the binary classification algorithm, DT ple, to obtain a better true positive rate and false positive rate;
can mine more data features, so the P-R curve of this model for the "Probe" attack type, the detection performance of the
is better. AdaBoost, DT, ELM, SVM, LR and SDAE-ELM are similar, which
are far better than the detection performance of DNN, DBN,
6.1.2. Multi-class classification experimental results and SDAE. For the "DoS" attack type, the detection perfor-
When the normal samples and different types of attack data mance of each algorithm is better, among which the worst
are contained in the intrusion detection data sets, and the rate of the SDAE-ELM2 can also reach 99.93%. In particular,
experiment becomes a multi-class classification experiment. the true positive rate of AdaBoost, DT, SVM, LR, and SDAE can
Tables 18-21 show the experimental results of the multi- all reach 100%. Because "DoS" has the largest number of at-
class classification of each data set. The multi-class classi- tacks, and this paper judges according to the difference be-
fication data sets alleviate the unbalanced problem of the tween intrusion data and normal data, therefore, for the "DoS"
binary classification data sets to a certain extent, but there attacks with a large number of samples, each algorithm has a
are still large differences in the data amount of each type of better detection performance; for the "R2L" and "U2R" attack
data. Overall, each model has better detection performance types, in most cases, the performance of AdaBoost, DT, ELM
for samples with larger data volume. For multi-class clas- and SDAE-ELM are similar, and the detection performance is
sification, the multi-class classification ability of the model poor, SVM, LR, DNN, DBN, and SDAE have the worst detection
will be evaluated from the true positive rate, false positive performance, and basically cannot recognize the "R2L" and
rate, and ROC value. As with the binary classification, in each "U2R" attack types, but the true positive rate of SDAE-ELM1
data set, except for the poor multi-class classification abil- is increased by 5.26% compared to SDAE, and the effect is in
ity of the SOM, the other models have a better classification line with our expectations. There are few types of "R2L" and
effect. "U2R" attacks, and most of the attacks are disguised as legit-
computers & security 103 (2021) 102177 21
imate users. This is to make their characteristics very similar work attacks that now appear on the network, for example,
to normal data, which makes it difficult to detect "R2L" and "Backdoors", "Shellcodes", "Reconnaissance" and "Worm" at-
"U2R" attacks. NSL-KDD and KDD Cup99 datasets have the tack types can better reflect the characteristics of the current
same data type, except for the number of training and test- network intrusion. Considering all classification results, the
ing. Comparing Tables 19 and 18, it can be found that each al- SDAE-ELM can obtain a better true positive rate, false positive
gorithm can maintain similar detection performance as KDD rate, and AUC value. However, the detection rate of some at-
Cup99 in NSL-KDD, but the detection rate of each detection tack types is slightly worse than that of DT. For example, for
performance has decreased. Mainly because the training data the "Worms" attack type, only the true positive rate of the DT is
of NSL-KDD is much less than the training data of KDD Cup99, 13.64%, and the true positive rate of the remaining algorithms
the features learned by the model are not enough, which leads are 0%; but for the "Generic" attack type, the SDAE-ELM model
to the slightly worse detection performance on the NSL-KDD can obtain the best true positive rate and false positive rate.
data set. For the "Analysis", "Backdoors" and "Worms" type attacks, the
At present, in the field of intrusion detection, most peo- detection of each algorithm is poor, this is because the number
ple still use the KDD Cup99 dataset of the Lincoln Labora- of training samples is too small, data distribution imbalance
tory in the United States for experiments. This dataset was appears, which affects the classification ability of the algo-
good in a certain period of time, but this dataset was col- rithm. "Generic" as a general attack, mainly attacks the server,
lected more than 20 years ago, and the current complex net- which is very close to the characteristics of the "Exploits" at-
work cannot be evaluated with this data set. Therefore, we tack type, but there is data imbalance between the two attack
further apply the SDAE-ELM model to the UNSW-NB15 and types. Therefore, there may be misjudgments in the detection
CIDDS-001 data sets that contain newer attack types. As can results.
be seen from Table 20, the UNSW-NB15 data set has more Although there are only four types of attacks in the CIDDS-
types of attacks, plus normal sample data totaling 10 types. 001 data set, they were captured in the 2017 simulated small-
Besides, this data set also contains the newer types of net- scale business environment, and so far few researchers have
22 computers & security 103 (2021) 102177
applied it to the field of intrusion detection. Therefore, it is sification process; for "Suspicious" attack, except for the true
more meaningful for us to apply them to the field of intru- positive rate of the SVM algorithm is 87.36%, and the true
sion detection to detect the performance of the models. As positive of other algorithms are greater than 93%. Since the
can be seen from Table 21, SDAE-ELM can achieve better de- number of "Suspicious" attack are the largest, each algorithm
tection results by integrating all evaluation indicators. For the can fully learn the characteristics of the "Suspicious" attack.
"Normal" data type, except for ELM, LR, and DNN, the remain- Therefore, for a large number of "Suspicious" attacks, each
ing algorithms all can achieve a better comprehensive de- algorithm has better detection performance; for "Unknown"
tection effect, and the true positive rate is greater than 95%, and "Victim" attack types, the SDAE-ELM can achieve a better
indicating that at this time each algorithm can better learn detection rate, the algorithm can fully mine the characteris-
the characteristics of "Normal" data, so as to achieve a bet- tics of "Unknown" and "Victim" attack types, to achieve better
ter classification of this type of data; for "Attackers" type of classification.
attack, only the AdaBoost can perfectly classify this type of Table 22 is the time performance of the algorithm on each
attack. The true positive rate of this algorithm is 100%, the data set. It can be seen that the larger the data volume of
false positive rate is 0%, and the AUC value is 1. The detec- the data set, the longer the training time that the model
tion performance of the remaining algorithms are poor, and needs to spend. This is reasonable. In each data set, except
their true positive rates are all 0%, the attack type cannot be SOM and ELM take less time, and the training time of the
identified at all. This is because "Attackers" have similar char- remaining models is longer. In particular, SVM as a more
acteristics with "Suspicious", so the algorithm misinterprets classic method in machine learning, has also achieved good
the "Attackers" attack as a "Suspicious" attack during the clas- results in many fields, but its time cost is relatively high,
computers & security 103 (2021) 102177 23
while SDAE-ELM also takes some time to train, but compared set have different characteristics. Although the AdaBoost
with other models, the time performance has been greatly has been able to correctly distinguish the attack categories,
improved. but for samples with similar features, we need to attach
In the process of implementing intrusion detection, each more features to correctly distinguish the data with similar
layer of the neural network is helpful to understand how to features.
classify data into "normal" or "attack" and the specific classi- The hidden layer contains more parameters. In deep learn-
fication of attacks into attack categories. To understand this ing, the parameters of the hidden layer have a great impact on
process more intuitively, the activation value is passed to the convergence speed and performance of the model. During
t-SNE to visualize it. The KDD Cup99 and CIDDS-001 data the training process, we continuously iterate the parameters
sets are shown in Fig. 14(a) and 14(b), respectively. For the of the hidden layer to obtain a better classification model. In
KDD Cup99 data set, "Normal", "DoS", and "R2L" features general, the parameters of the hidden layer are obtained in the
have completely appeared in another cluster; for the CIDDS- self-learning process of the model. To understand the hidden
001 data set, the "Normal" type feature has appeared in an- layer parameters more intuitively, we visualize the weight of
other cluster, but for "Suspicious" and "Unknown" it has ap- the first DAE hidden layer in the model. The visualization re-
peared in the same cluster. This shows that the algorithm sults are expressed in grayscale, as shown in Fig. 16.
can well identify some types of attacks at this time, but the ROC curve is an easy to understand graphical tool that can
optimal partition function is not achieved. For the CIDDS- be used in all classification models, and it has a huge advan-
001 data set, the SDAE-ELM1 belongs to the "Suspicious" tage, when the distribution of positive and negative samples
connection record as shown in Fig. 15(b), which have sim- changes, its shape can remain basically unchanged, and it can
ilar characteristics with "Unknown"; for the NSL-KDD data more objectively measure the performance of the model it-
set, the connection records of the AdaBoost are shown in self. As can be seen from Fig. 17, for the "Normal" data type
Fig. 15(a), at this time, different types of data in the data in the KDD Cup99 dataset, as the model depth increases, the
24 computers & security 103 (2021) 102177
AUC value of the SDAE-ELM is better, but its optimal AUC value ter, the worst AUC value of the LR can also reach 0.8228, but
is 0.0371 lower than that of ELM; for the "DoS" attack type in the SDAE-ELM is 0.0537 lower than that of DT. For the "Suspi-
the NSL-KDD data set, SDAE-ELM is slightly worse than DT by cious" attack type of the CIDDS-001 data set, SDAE-ELM can
0.0073, but compared to other models, SDAE-ELM can obtain obtain better AUC value, but some algorithms have poor AUC
the optimal AUC value; for the "Generic" attack type in the values, such as the AUC value of ELM is 0.3183, the AUC value
UNSW-NB15 data set, the AUC value of each algorithm is bet- of DBN is 0.4849. In summary, in most cases, AUC is used as the
computers & security 103 (2021) 102177 25
evaluation index compared with the existing model, SDAE- 6.2. Experimental results based on host dataset
ELM model performed well. This also shows that SDAE-ELM
obtained a higher TPR and lower FPR. Combining the attack types into the abnormal class, the ex-
periment becomes a binary classification problem. Table 23
shows the results of the binary classification experiment of
the ADFA-LD data set. When the normal samples and indi-
AdaBoost 0.9518 0.0036 0.9742 0.5712 0.0227 0.7743 0.0714 0.0020 0.5347
DT 0.9665 0.0043 0.9811 0.7589 0.0154 0.8717 0.5238 0.0111 0.7564
ELM 0.9627 0.0015 0.9805 0.7203 0.1321 0.8828 0 0 0.5
SVM 0.9693 0.3214 0.8240 0 0 0.5 0 0 0.5
LR 0.9694 0.3239 0.8228 0 0 0.5 0 0 0.5
SOM 0 0 0.5 0 0 0.5 0 0 0.5
DNN 0.9693 0.3481 0.8243 0 0 0.4999 0 0 0.4588
DBN 0.9695 0.3244 0.8224 0 0 0.5 0 0 0.5
SDAE 0.9695 0.3235 0.8233 0 0 0.5 0 0 0.5
SDAE-ELM1 0.9629 0.1416 0.9106 0.0023 0 0.5134 0 0 0.5
SDAE-ELM2 0.9640 0.1482 0.9129 0.0031 0 0.5178 0 0 0.5
SDAE-ELM3 0.9691 0.1342 0.9274 0.0043 0 0.5213 0 0 0.5
Algorithm Worms
AdaBoost 0 0 0.5
DT 0.1364 0.0010 0.5677
ELM 0 0 0.5
SVM 0 0 0.5
LR 0 0 0.5
SOM 0 0 0.5
DNN 0 0 0.5
DBN 0 0 0.5
SDAE 0 0 0.5
SDAE-ELM1 0 0 0.5
SDAE-ELM2 0 0 0.5
SDAE-ELM3 0 0 0.5
computers & security 103 (2021) 102177 27
Table 22 – Time performance evaluation index. Table 23 – ADFA-LD binary test results.
Algorithm KDD Cup99 NSL-KDD UNSW-NB15 CIDDS-001 Algorithm Accuracy Precision Recall F1-score
AdaBoost 4604s 3809s 5123s 6732s AdaBoost 0.8742 0.3654 0.0896 0.1439
DT 7800s 4328s 8763s 9087s DT 0.8606 0.4531 0.4458 0.4494
ELM 1039s 789s 2432s 4321s ELM 0.8724 0.7143 0.0766 0.1384
SVM >20h >10h >24h >28h SVM 0.8775 0.8 0.0755 0.1379
LR 3421s 2319s 5342s 6123s LR 0.8590 0.25 0.0268 0.0484
SOM 100s 78s 140s 200s SOM 0.4024 0.0952 0.4355 0.1563
DNN >7h >4h >12h >15h DNN 0.8790 NaN 0 NaN
DBN >7 >4h >12h >15h SDAE 0.8770 NaN 0 NaN
SDAE >7h >4h >12h >15h DBN 0.8721 0.3214 0.0892 0.1396
SDAE-ELM1 3908s 2987s 4321s 5891s DBN-Softmax1 0.8790 0.8108 0.1230 0.2136
SDAE-ELM2 4109s 3034s 4498s 5987s DBN-Softmax2 0.8790 0.8113 0.1241 0.2153
SDAE-ELM3 4530s 3298s 4510s 6032s DBN-Softmax3 0.8831 0.8243 0.1342 0.2308
Web_Shell
AdaBoost 0 0 0.5
DT 0.1035 0.0154 0.5440
ELM 0 0 0.4997
SVM 0 0 0.5
LR 0 0.0018 0.4991
SOM 0 0.7738 0.1114
DNN 0 0 0.5
SDAE 0 0 0.5
DBN 0 0 0.5
DBN- 0.0432 0.0319 0.5675
Softmax1
DBN- 0.0791 0.0432 0.5700
Softmax2
DBN- 0.0913 0.0443 0.5708
Softmax3
small, so the detection rate is poor, but it proves the feasibility algorithms, the accuracy of DBN-Softmax is the best, the opti-
of DBN-Softmax in general. mal is 82.43%, which is 57.43% higher than the accuracy of LR.
It can be seen from Table 23 that the DBN-Softmax can ob- From the perspective of recall, DBN-Softmax is slightly worse
tain the best accuracy. Except for the SOM, the accuracy of the than DT and SOM, but they are better than other algorithms.
other algorithms is all greater than 85.9%. From the perspec- In most cases, the F1-Score of DBN-Softmax is better. The F1-
tive of accuracy, as the depth of the DBN-Softmax increases, Score of DBN-Softmax is second only to DT, which is 0.2186
the accuracy of the algorithm is better. Compared with other worse than DT.
30 computers & security 103 (2021) 102177
It can be seen from Table 24 that the detection performance is the accuracy, Fig. 18(e) is a boxplot of accuracy. As can be
of the SOM is always poor. Overall, DBN-Softmax has achieved seen from Fig. 18(a)-18(c), the normal sample data occupies
a good detection rate. For the "Normal" type, the true positive a considerable proportion in the test data set, and each al-
rate of each algorithm is better, and the true positive rate is gorithm has a good performance for the normal sample, the
greater than 92.26%, the true positive rate of some algorithms monitoring effect can be seen from the larger values of the
is even 100%, but each algorithm also has a high false posi- third quadrant of each algorithm, but because the attack sam-
tive rate. For the "Normal" type, which leads to a poor compre- ples are too small, the prediction ability on the attack sample
hensive evaluation index of the model, because the samples is poor. Fig. 18(d) and Fig. 18(e) show that the accuracy of the
of other types of attacks are relatively few, which cannot ac- ELM has been in a fluctuating state during multiple iterations,
curately learn the characteristics of each attack type, causing the remaining algorithms can obtain the optimal accuracy at
each model to falsely report the attack sample as a normal the early stage of the iteration, and it can be seen from the fig-
sample, resulting in a high false positive rate; for the attack ure that the accuracy of DBN-Softmax3 is significantly better
type, the detection effect of each algorithm is poor, in most than the other algorithms.
cases, compared with other algorithms, the detection effect To intuitively understand the distribution of the activation
of DBN-Softmax is significantly improved, and as the depth of function values of the hidden layer of the model, we use t-
the model increases, the detection effect of the algorithm is SNE to visualize it. The feature mapping of the AdaBoost and
better. Although there are attack types in ADFA-LD data set DBN-Softmax1 are shown in Fig. 19(a) and Fig. 19(b), respec-
that do not appear in the network-based data set, the num- tively. In the two algorithms, some of the "Normal" type fea-
ber of various types of samples is too small, which results tures have appeared in a cluster, but the remaining "Normal"
in the detection performance of the algorithm being slightly features and the remaining attack types have appeared in an-
worse than the detection performance of the algorithm on the other cluster, which shows that the algorithm can identify
network-based data set. some normal samples well at this time. For ELM and DBN-
Fig. 18 is the ADFA-LD data set binary classification con- Softmax3, the connection records belonging to "Normal" are
fusion matrix, accuracy, and boxplot. Among them, Fig. 18(a)- shown in Fig. 19(c) and 19(d), respectively. These connection
18(c) is the confusion matrix of some algorithms, and Fig. 18(d) records contain some features of other attack types, which
computers & security 103 (2021) 102177 31
Fig. 19 – Feature mapping and connection record of last hidden layer activation function of multi-class classifications.
indicates that although each algorithm can distinguish some gradient descent method. The SDAE-ELM and DBN-Softmax
normal samples, more features need to be added to better dis- models have been verified using the network-based and host-
tinguish the types of attacks. In addition to visualizing the ac- based intrusion detection data sets, respectively. The results
tivation function value and connection weight of the hidden have demonstrated that no matter it is a binary classification
layer, we also visualize the hidden layer weights of the first or a multi-class classification, the detection performance of
RBM, and the size of the hidden layer weights can be clearly the SDAE-ELM and DBN-Softmax models on their respective
seen from Fig. 20. data sets is better than the traditional machine learning mod-
The ROC curve is a comprehensive indicator for evaluat- els.
ing TPR and FPR. It is easy to understand. In the face of the Compared with the traditional machine learning models,
imbalance of the number of positive and negative samples, SDAE-ELM and DBN-Softmax models have achieved better de-
the ROC curve is a more stable indicator that can reflect the tection results in intrusion detection. However, the data min-
quality of the model. Fig. 21 depicts the ROC curves. It can ing ability of SDAE-ELM model is effective, and the detection
be seen from the figure that compared with other algorithms, effect of small datasets is poor. In addition, the DBN-Softmax
the AUC value of DBN-Softmax is significantly improved, and model has the disadvantage of long training time for large
as the number of layer increases, the AUC value of DBN- datasets, and cannot realize real-time detection of intrusions.
Softmax is better; for the of "Adduser" attack type, the AUC In the future, we will consider using hybrid feature extraction
value of DBN-Softmax3 is 0.0396 higher than that of AdaBoost; technique to reduce the dimensionality of the dataset and re-
for the "Web-Shell" attack type, the worst AUC value of ELM duce the training time of the model under the premise of en-
is 0.4997. suring the accuracy of the intrusion detection target. In addi-
tion, due to the complex structure of the deep intrusion detec-
tion model and the large number of parameters, we will con-
7. Conclusion and future work sider to improve the model neurons and calculation methods,
simplify the network structure, and improve the model effi-
Based on the deep Denoising AutoEncoder and Deep Belief ciency.
Network, this paper has proposed the integrated deep intru-
sion models SDAE-ELM and DBN-Softmax. SDAE-ELM uses the
distributed deep learning model of SDAE, which can handle
Declaration of Competing Interest
real-time data, analyze large-scale data, and reduce the noise
in the data set. The distributed deep learning model of DBN
The authors declare that they have no known competing fi-
is used to deeply mine the features in the data set and im-
nancial interests or personal relationships that could have ap-
prove the classification accuracy of the model. At the same
peared to influence the work reported in this paper.
time, in order to avoid the problems that the BP algorithm is
prone to gradient sparseness, local optimization, and the orig-
inal classifier is not suitable for multi-class classification dur- CRediT authorship contribution statement
ing the fine-tuning process, we have used the ELM algorithm
and Softmax classifier to optimize the SDAE and DBN mod- Zhendong Wang: Writing - review & editing. Yaodi Liu: Val-
els, respectively. In addition, in order to quickly update the idation, Formal analysis, Visualization, Supervision, Data cu-
parameters and improve the model computation efficiency, ration, Writing - original draft. Daojing He: Writing - review &
SDAE-ELM and DBN-Softmax are trained using the Mini-Batch editing. Sammy Chan: Writing - review & editing.
computers & security 103 (2021) 102177 33
Serpen G, Anghaei E. Host-based misuse intrusion detection Zeng R, Wu J, Shao Z, Senhadji L, Shu H. Quaternion softmax
using PCA feature extraction and KNN classification classifier. Electron. Lett. 2014;50(25):1929–31.
algorithms. Intell. Data Anal. 2018;22(5):1101–14. doi:10.1049/el.2014.2526.
doi:10.3233/IDA-173493. Zhang H, Li Y, Lv Z, Sangaiah AK, Huang T. A real-time and
Shone N, Ngoc TN, Phai VD, Shi Q. A deep learning approach to ubiquitous network attack detection dased on deep delief
network intrusion detection. IEEE Trans. Emerg. Top. Comput. detwork and support vector machine. IEEE/CAA J. Automat.
Intell. 2018;2(1):41–50. doi:10.1109/TETCI.2017.2772792. Sinica 2020;7(3):790–9. doi:10.1109/JAS.2020.1003099.
Si Z, Wen S, Dong B. NOMA codebook optimization by batch
gradient descent. IEEE Access 2019;7:117274–81. Zhendong Wang [S’06, M’09] received the
doi:10.1109/ACCESS.2019.2936483. B.E (2006) and M. Eng. (2009) degrees from
Tang TA, Mhamdi L, McLernon D, Zaidi SAR, Chogho M. Deep Changchun University of Science and Tech-
learning approach for network intrusion detection in software nology and Harbin University of Science and
defined networking. In: 2016 International Coference on Technology respectively, and a Ph.D. degree
Wireless Networks and Mobile Communications(WINCOM); in computer applied technology from Harbin
2016. p. 258–63. doi:10.1109/WINCOM.2016.7777224. Engineering University in 2013. Since 2014,
Teng S, Mu N, Zhu H, Teng L, Zhang W. SVM-DT-based adaptive he has been with the Department of In-
and collaborative intrusion detection. IEEE/CAA J. Automat. formation Engineering, Jiangxi University of
Sinica 2018a;5(1):108–18. doi:10.1109/JAS.2017.7510730. Science and Technology, P.R. China, where
Teng SH, Wu NQ, Zhu HB, Teng LY, Zhang W. SVM-DT-based he is currently an associate professor. His-
adaptive and collaborative intrusion detection. IEEE-CAA J. research interests include wireless sensor
Automat. Sinica 2018b;5(1):108–18. network, artificial intelligence and network
doi:10.1109/JAS.2017.7510730. security.
Tidjon LN, Frappier M, Mammar A. Intrusion detection systems: a
cross-domain overview. IEEE Commun. Survey Tutor. Yaodi Liu received the B.S. degree in Infor-
2019;21(4):3639–81. doi:10.1109/COMST.2019.2922584. mation and Computing Science from Jiangsu
Tu Y, Du J, Lee C. Speech enhancement based on teacher-student Ocean University, in 2018. She is currently
deep learning using improved speech presence probability for pursuing the M.S. degree with Jiangxi Uni-
noise-robust speech recognition. IEEE/ACM Trans. Audio versity of Science and Technology. Her main
Speech Lang Process 2019;27(12):2080–91. research interests include network security
doi:10.1109/TASLP.2019.2940662. and group intelligence optimization algo-
UNSW-NB15 2020 dataset[Online], available: rithms.
https://www.unsw.adfa.edu.au/unsw- canberra- cyber/
cybersecurity/ADFA- NB15- Datasets/.
Usman AM, Yusof UK, Naim S. Filter-based multi-objective
Daojing He [S’07, M’13] received the B.Eng.
feature selection using NSGA III and cuckoo optimization
(2007) and M. Eng. (2009) degrees from
algorithm. IEEE Access 2020;8:76333–56.
Harbin Institute of Technology (China) and
doi:10.1109/ACCESS.2020.2987057.
the Ph.D. degree (2012) from Zhejiang Uni-
Van der Maaten L, Hinton G. Visualizing Data using t-SNE[J]. J.
versity (China), all in computer science. He
Mach. Learn. Res. 2008;9(11):2579–625.
is currently a professor in the School of
doi:10.1007/s10846-008-9235-4.
Computer Science and Software Engineer-
Wang D, Su J, Yu H. Feature extraction and analysis of natural
ing, East China Normal University, P.R. China.
language processing for deep learning english language. IEEE
His-research interests include network and
Access 2020;8:46335–45. doi:10.1109/ACCESS.2020.2974101.
systems security. He is on the editorial board
Wang W, Du X, Wang N. Building a cloud IDS using an efficient
of international journals such as IEEE Com-
feature selection method and SVM. IEEE Access
munications Magazine and IEEE Network.
2019;7:1345–54. doi:10.1109/ACCESS.2018.2883142.
Wei B, Zhang W, Xia X, Zhang Y, Yu F, Zhu Z. Efficient feature Sammy Chan [S’87, M’89] received his B.E.
selection algorithm based on particle swarm optimization and M.Eng. Sc. degrees in electrical engineer-
with learning memory. IEEE Access 2019;7:166066–78. ing from the University of Melbourne, Aus-
doi:10.1109/ACCESS.2019.2953298. tralia, in 1988 and 1990, respectively, and
Ye Z, Sun Y, Sun S, Zhan S, Yu H, Yao Q. Research on network a Ph.D. degree in communication engineer-
intrusion detection based on support vector machine ing from the Royal Melbourne Institute of
optimized with grasshopper optimization algorithm. In: 2019 Technology, Australia, in 1995. Since Decem-
10th IEEE International Conference on Intelligent Data ber 1994 he has been with the Department
Acquistion and Advanced Computing Systems: Technology of Electronic Engineering, City University of
and Applications(IDAACS); 2019. p. 378–83. Hong Kong, where he is currently an asso-
doi:10.1109/IDAACS.2019.8924234. ciate professor.
Zaidi K, Milojevic MB, Rakocevic V, Nallanathan A, Rajarajan M.
Host-based intrusion detection for vanets: a statistical
approach to rogue node detection. IEEE Trans. Veh. Technol.
2016;65(8):6703–14. doi:10.1109/TVT.2015.2480244.