0% found this document useful (0 votes)
10 views

Effective Feature Extraction via StackedSparse Autoencoder to ImproveIntrusion Detection System

Uploaded by

praveen kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Effective Feature Extraction via StackedSparse Autoencoder to ImproveIntrusion Detection System

Uploaded by

praveen kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Received May 25, 2018, accepted July 18, 2018, date of publication July 23, 2018, date of current

version August 15, 2018.


Digital Object Identifier 10.1109/ACCESS.2018.2858277

Effective Feature Extraction via Stacked


Sparse Autoencoder to Improve
Intrusion Detection System
BINGHAO YAN AND GUODONG HAN
National Digital Switching System Engineering & Technology Research Center, Zhengzhou 450001, China
Corresponding author: Binghao Yan ([email protected])
This work was supported in part by the National Science Technology Major Project of China under Grant 2016ZX01012101, in part by the
National Natural Science Foundation Project of China under Grant 61572520, and in part by the National Natural Science Foundation
Innovation Group Project of China under Grant 61521003.

ABSTRACT Classification features are crucial for an intrusion detection system (IDS), and the detection
performance of IDS will change dramatically when providing different input features. Moreover, the large
number of network traffic and their high-dimensional features will result in a very lengthy classification
process. Recently, there is an increasing interest in the application of deep learning approaches for classifi-
cation and learn feature representations. So, in this paper, we propose using the stacked sparse autoencoder
(SSAE), an instance of a deep learning strategy, to extract high-level feature representations of intrusive
behavior information. The original classification features are introduced into SSAE to learn the deep sparse
features automatically for the first time. Then, the low-dimensional sparse features are used to build different
basic classifiers. We compare SSAE with other feature extraction methods proposed by previous researchers.
The experimental results both in binary classification and multiclass classification indicate the following:
1) the high-dimensional sparse features learned by SSAE are more discriminative for intrusion behaviors
compared to previous methods and 2) the classification process of basic classifiers is significantly accelerated
by using high-dimensional sparse features. In summary, it is shown that the SSAE is a feasible and efficient
feature extraction method and provides a new research method for intrusion detection.

INDEX TERMS Intrusion detection, deep learning, machine learning, SSAE, feature extraction.

I. INTRODUCTION to accurately define the intrusive behavior mode in advance,


With the advent of new Internet technologies such as file and the intrusion behavior is detected if the attacker’s attack
sharing, mobile payment and instant messaging, the situ- pattern exactly matches the pattern library in the detection
ation of network security is becoming ever more compli- system. Anomaly detection system considers intrusion activ-
cated. At the same time, network attackers become more ity is unknown, is a subset of abnormal activity. When any
invisible and attack costs are further reduced, all of which behavior deviates from the normal behavior pattern to a cer-
seriously threaten the network security environment. As an tain extent, it is considered as an invasion event.
active defense technology, intrusion detection has gradually In order to improve the efficiency and performance of
become the key technology to ensure the security of the intrusion detection systems, in recent years, scholars have
network system. Intrusion detection system (IDS) is designed used machine learning methods in the construction of net-
for network security proactive protection system, which is work intrusion detection systems and achieved breakthroughs
based on a certain security strategy to monitor the opera- progress [1]. However, most machine learning algorithms can
tion of the network system and found a variety of intrusion only achieve satisfactory results in small sample datasets.
behavior, attempt or result, and automatically respond to When these algorithms are actually used in large-scale intru-
effectively prevent illegal access or Intrusion. IDS usually sion detection systems, they usually face the limitations of
includes misuse detection and anomaly detection two kinds time complexity and space complexity. The essential reason
of processing methods. The misuse detection system needs is attributable to the input data in feature space with high

2169-3536 2018 IEEE. Translations and content mining are permitted for academic research only.
41238 Personal use is also permitted, but republication/redistribution requires IEEE permission. VOLUME 6, 2018
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
B. Yan, G. Han: Effective Feature Extraction via SSAE to Improve IDS

dimensional and nonlinear characteristics. Therefore, it is an machine (SVM), which has the characteristics of high detec-
indispensable step in the intrusion detection process to pro- tion rate and low false alarm rate. In addition, other represen-
pose or adopt more effective methods to perform dimension tative machine learning algorithms such as decision trees [7],
reduction on high-dimensional data. Bayesian classification [8], and K-nearest neighbors [9] are
In 2006, Hinton, a professor at the University of Toronto in also used in the construction of IDS. Furthermore, in order to
Canada, published an article on deep learning in Science [2], avoid the existence of defects in a single classifier, the ideas
setting off a wave of research on big data and artificial intel- of hybrid classifiers [10], [11] and ensemble classifiers [12]
ligence. One of the main messages delivered in this article is are also applied in IDS and the efficiency of classification is
that deep artificial neural networks (DANN) with many hid- generally better than single classifier models.
den layers have excellent feature-based learning capabilities Although the above methods have strong adaptability and
and the learned features have a more substantive characteri- scalability, researchers believe that while considering the
zation of the original data to facilitate visualization or classi- detection rate and false alarm rate performance, IDS also
fication. Moreover, DANN reduce the huge workload in the needs to meet the requirements of systems for real-time
process of feature extraction and improve the efficiency of capability and low power consumption. Ambusaidi et al. [13]
feature extraction. combined flexible mutual information feature selection with
Deep learning is a promising solution to the challenge of least square SVM, which contributes to lower computa-
intrusion detection because of its outstanding performance in tional costs and faster detection speeds. Osanaiye et al. [14]
dealing with complex, large-scale data. So we use the stack combined the output of four filter methods into an
autoencoder (SAE) model based on deep learning theory ensemble-based multi-filter feature selection method to
to perform unsupervised dimension reduction of intrusion achieve an optimum selection and shown encouraging per-
detection samples in this paper, and then we add sparsity formance in DDOS detection. However, the above-mentioned
constraints to the SAE model to improve the generalization feature selection algorithms only sort the manually designed
ability and classification accuracy of the model. Compared features according to the difference of the detection targets of
with the existing methods, SSAE can not only effectively the IDS, and remove the features that are more redundant in
compress the feature dimension of the original data and accel- comparison, resulting in the loss of part of the information.
erate the classification process of the classifier, but also the The popularity of deep learning technology has further
feature extraction capability of SSAE is obviously better than promoted the progress of intrusion detection systems. On the
the known methods. one hand, deep learning algorithms can uncover deeper rela-
In the following sections, we provide the related work tionships between input data, which cannot be achieved with
in terms of network intrusion detection methods related to traditional shallow machine learning algorithms. On the other
machine learning and deep learning (Section II), our deep hand, deep learning algorithms have more powerful feature
learning based approach to feature extraction (Section III), extraction and representation capabilities while retaining data
and our test bed, dataset and experimental preparation process information as much as possible. The deep learning meth-
(Section IV). Section V highlights SSAE with a discussion ods used in intrusion detection mainly include Deep Con-
about the experimental results and a comparison with a few volutional Neural Network (DCNN) [15], Recurrent Neural
previous methods. We present in section VI with a summary Network (RNN) [16], Deep Belief Network (DBN) [17] and
of conclusions and future work. their improved models, and the main purpose of these model
is also to improve the detection rate and false alarm rate. Deep
II. RELEVANT WORK learning models used in intrusion detection rarely focus on
The initial intrusion detection methods include misuse feature extraction. Therefore, in this paper, we use stacked
detection [3] and anomaly detection [4]. Although these sparse autoencoder (SSAE) as feature extraction method in
two detection methods have achieved good results, there are IDS. Compared with previous works, we use the SSAE for
still inherent flaws, among which misuse detection cannot feature extraction rather than for sample anomaly detection.
determine whether an unknown behavior is safe, abnormal
detection has a disadvantage of high false alarm rate, and the
deactivation of alarms relies heavily on the domainspecialist III. MATHODOLOGY
knowledge. With the development of artificial intelligence A. ANTOENCODER
theory, intrusion detection technology based on machine Autoencoder(AE) is an unsupervised three-layer neural net-
learning makes up for the defects of the original methods and work, including an input layer, a hidden layer, and an output
becomes a new research hotspot. In [5], the authors combined layer (also referred to as reconstruction layer). The typical
artificial neural network and fuzzy clustering algorithm to structure of AE is shown in Figure 1, and the representation
solve the problem of low detection rate of low-frequency is shown in Figure 2.
attacks and help IDS achieve higher detection rate, less false The AE can gradually transform specific feature vectors
positive rate and stronger stability. Bamakan et al. [6] intro- into abstract feature vectors, which can well realize the non-
duced time-varying chaos particle swarm optimization into linear transformation from high dimensional data space to
IDS to adaptively select the parameters of support vector low dimensional data space. The working process of the

VOLUME 6, 2018 41239


B. Yan, G. Han: Effective Feature Extraction via SSAE to Improve IDS

By adjusting the parameters of the encoder and the decoder,


the error between the output reconstructed data and the orig-
inal data can be minimized, which means that AE recon-
structs the original data through training. We believe that
the data output by the hidden layer units at this time is the
optimal low-dimensional representation of the original data
and includes all the information that exists in the original data.
The reconstruction error function JE (W , ϕ) between H and Y
uses the mean squared-error function as shown in formula 4,
where N is the number of input samples.
N
1 X (r) 2
JE (W , ϕ) = Y − X (r) (4)
2N
r=1

B. SPARSE AUTOENCODER
The idea of sparse coding was originally proposed by
FIGURE 1. The structure of basic Autoencoder. Olshausen [18] to simulate the computational learning of the
receptive fields of simple cells in mammalian primary visual
cortex. Due to the unavoidable problem of the autoencoder,
for example, the input data is transmitted to the output layer
by simple copying. Although the original input data can be
recovered perfectly, the autoencoder does not extract any
meaningful features in this case. Therefore, Ng et al. [19]
used the idea of sparse coding to introduce a sparse penalty
term in the hidden layer of autoencoder, so that the autoen-
FIGURE 2. Autoencoder representation. coder can obtain more concise and efficient low-dimensional
data features under sparse constraints to better express the
automatic encoder can be divided into two stages: encoding input data. Suppose that the average activation of the neurons
and decoding and these two steps can be defined as: N 
in the hidden layer is ρ̂j , ρ̂j = N1
P 
The encoding process from the input layer to the hidden nj (xi ) . We hope that the
i=1
layer: average activation ρ̂j approaches a constant ρ which is close
H = gθ1 (X ) = σ (Wij X + ϕ1 ) (1) to zero.
Therefore, the Kullback–Leibler (KL) divergence is added
The decoding process from the hidden layer to the recon- as a regularization term to the error function of the autoen-
struction layer: coder to achieve the above purpose:
Y = gθ2 (H ) = σ (Wjk H + ϕ2 ) (2) ρ 1−ρ
KL( ρk ρ̂j ) = ρ log + (1 − ρ) log (5)
In the above formulas, X = (x1 , x2 , . . . , xn ) is the input ρ̂j 1 − ρ̂j
data vector, Y = (y1 , y2 , . . . , yn ) is the reconstruction vector At this point, the error function of the sparse autoencoder
of the input data and H = (h1 , h2 , . . . , hm ) is the low- consists of two parts: the first term is the mean square error
dimensional vector output from the hidden layer, X ∈ Rn , term, and the second term is the regularization term. It is
Y ∈ Rn , H ∈ Rm (n is the dimension of the input vector shown in formula 6:
and m is the number of hidden units). Wij ∈ Rm×n is the m
X
Jsparse (W , b) = J (W , b) + µ KL ρ ρ̂j

weight connection matrix between input layer and hidden (6)
layer. Wjk ∈ Rn×m is the weight connection matrix between j=1
hidden layer and output layer. In order to reconstruct the input where m is the number of the hidden units and µ is a
data as accurately as possible while reducing the resource weighting factor that controls the strength of the sparse item.
consumption during model training, Wjk = WijT usually exists Furthermore, in order to prevent overfitting, we also added
in the experiment. ϕ1 ∈ Rn×1 and ϕ2 ∈ Rm×1 are the the weight attenuation items to the error function as shown in
bias vectors of the input layer and hidden layer respectively. formula (7), λ is the attenuation coefficient of the weight.
gθ1 (·) and gθ2 (·) are the activation function of the hidden layer m
X
neurons and output layer neurons respectively, which roles Jsparse (W , b) = JE (W , b) + µ KL ρ ρ̂j

are to map the network summation result to [0,1]. We use j=1
sigmoid function as activation function in this paper: 3 m m+1 2
λ XX X r
1 + (wij ) (7)
gθ1 (·) = gθ2 (·) = (3) 2
1 + e−x r=1 i=1 j=1

41240 VOLUME 6, 2018


B. Yan, G. Han: Effective Feature Extraction via SSAE to Improve IDS

C. STACKED SPARSE AUTOENCODER Considering that there are sparse constraints in the SSAE
Stacked sparse autoencoder (SSAE) neural network is a neu- network, we want to use different learning rate for different
ral network composed of multiple sparse auto-encoders con- parameters, such as reducing the frequency of updates for
nected end to end, the structure shown in Figure 3. The output infrequent features. However, most of the traditional popu-
of the previous layer of sparse self-encoder is used as the input lar gradient descent algorithms include stochastic gradient
of the next layer of self-encoder, so that higher-level feature descent and mini-batch gradient descent, which use the same
representations of the input data can be obtained. learning rate for all network parameters that need to be
updated, making it difficult to choose a suitable learning rate
and easily reach a local minimum [21]. Therefore, in order
to train a better SSAE network model, we use the adaptive
moment estimation (Adam) gradient descent algorithm pro-
posed by Kingma and Ba [22] to achieve dynamic adaptive
adjustment of different parameters.
The Adam algorithm implements dynamic adjustment of
different parameters by calculating the gradient first-order
moment estimate mt and second-order moment estimate vt ,
as shown in formula (12-14), where β1 and β2 respectively
represent the first-order exponential damping decrement and
the second-order exponential damping decrement. gt repre-
sents the gradient of the parameters at timestep tin the loss
function Jsparse (W , b).
mt = β1 mt−1 + (1 − β1 ) · gt (12)
vt = β2 vt−1 + (1 − β2 ) · g2t (13)
FIGURE 3. The structure of stacked autoencoder model. gt ← ∇θ Jt (θt−1 ) (14)

The greedy layer-wise pre-training method [20] is used to Computer bias-corrected for mt and vt :
sequentially train each layer of SSAE to get access to the mt
m‘t = (15)
optimized connection weights and bias values of the entire 1 − β1t
stacked sparse auto-encoded network. Then the error back vt
v‘t = (16)
propagation method is used to fine tune the SSAE until 1 − β2t
the result of the error function between the input data and Update parameters:
the output data satisfies the expected requirements, so as to γ
acquire the optimal parameter model. θt+1 = θt − q · m‘t (17)
For error function Jsparse (W , b)defined in III.B: v‘t +ξ

∂ γ is the update stepsize, ξ takes a small constant to prevent


Jsparse (W , b) the denominator to be zero.
∂wrij
nr
1 X ∂ IV. EXPERIMENTAL SETUP
= Jsparse (W , b, X (n), Y (n)) + λwrij (8)
2nr ∂wrij A. THE FRAMEWORK
r=1
The framework of the SSAE-based intrusion detection system

Jsparse (W , b) is shown in Figure 4. First, the original input dataset is
∂br preprocessed so that the data can be used for training and
nr
1 X ∂ testing of the SSAE network. Then, the processed dataset is
= Jsparse (W , b, X (n), Y (n)) (9)
2nr ∂br divided into two parts: a training set and a testing set. The
r=1
training set is used for the pre-training and fine-tuning of the
Therefore, the update process of weight and bias is as model, while the testing set is input with the optimal model
follows: to obtain the low-dimensional representation dataset. Finally,
∂ the classifiers are trained by using low-dimensional dataset
wkij = wkij − η J (W , b) (10) to test the effect of low-dimensional data on the performance
∂wkij
of the classifier, thereby validating the effectiveness of the

br = br − η r J (W , b) (11) SSAE model.
∂b
Where X (n) and Y (n) are respectively represented as B. DATASET
the nth original vector and its corresponding reconstruction There are few public data sets used in intrusion detection,
vector. η indicates the update learning rate. mainly based on KDD99 [23] data set, NSL-KDD [24] data

VOLUME 6, 2018 41241


B. Yan, G. Han: Effective Feature Extraction via SSAE to Improve IDS

TABLE 2. Number of training, validation and testing samples in new


datasets.

is to test the detection ability of the trained model. In the


experimental process, each dataset of No.1-5 is repeated sev-
eral times independently, and cross-validation is performed
FIGURE 4. Flowchart of the proposed method. between each dataset. The final experimental results are
averaged for each experiment to ensure that the results are
unbiased.
set and Kyoto2006 [25] data set. Considering the advantages
and disadvantages of the existing datasets comprehensively, C. DATA PREPROCESSING
and in order to make an effective and fair comparison with
The NSL-KDD dataset contains 41 classification features,
the existing intrusion detection models and methods, we have
which are divided into symbolic features, 0-1-type fea-
selected the NSL-KDD dataset to evaluate the performance of
tures and percentage-type features. Among them, the feature
our SSAE model in intrusion detection.
value of Num_outbound_cmds is all 0, which has no effect
on the classification process, so this feature is removed.
TABLE 1. Classes and numbers of attacks of NSL-KDD dataset.
Besides, since the input of the SSAE network is a numeric
matrix, we need to transform the symbolic features into
numerical features. In addition, in order to facilitate the
comparison, the original feature values are subjected to a
maximum-minimum normalization process so that the fea-
ture values are in the same order of magnitude.

1) NUMERALIZATION
we use the one-hot encoding to perform the numer-
alization. The symbolic features of NSL-KDD dataset
include ‘‘Protocol_type’’, ‘‘Service’’ and ‘‘Flag’’, where
‘‘Protocol_type’’ includes three different symbolic feature
values, ‘‘Service’’ includes 70 different symbolic feature val-
The NSL-KDD dataset was improved on the KDD ues and ‘‘Flag’’ includes 11 different symbolic feature values.
99 dataset, eliminating the redundant data contained in Therefore, after the completion of the numeric processing,
the KDD 99 dataset. It contained a total of 125973 train- the dimensions of features in NSL-KDD dataset is extended
ing samples and 22543 test samples, including four attack to 121-dimensions.
sample types: Denial of service attacks (DOS), Probing
attacks(Probe), Remote to Local attacks(R2L), User to 2) NORMALIZATION
root attacks(U2R), and the specific distribution is shown To facilitate the comparison of the results, the maximal-
in Table 1. Besides, in order to simplify the experimental minimum normalization method shown in formula 18 is used
process and ease the computational pressure, we randomly to normalize the feature values in the NSL-KDD dataset,
sample the original data in NSL-KDD and reassemble the where xmax and xmin represent the maximum and minimum
sampled samples into several independent datasets, includ- values of the original feature values, respectively. x denotes
ing training dataset, validation dataset and testing dataset, the original feature value and xnorm denotes the normalized
as shown in Table 2. The main purpose of the training datasets feature value.
are to train the model. The validation datasets are used to
fine-tune the model and adjust the parameters of the neu- x − xmin
xnorm = (18)
ral network and classifier. The purpose of the test datasets xmax − xmin

41242 VOLUME 6, 2018


B. Yan, G. Han: Effective Feature Extraction via SSAE to Improve IDS

D. EVALUATION TABLE 4. The experimental parameters of SSAE and Adam.

We use the metrics based on the confusion matrix to measure


the experimental results of this paper. The definition of the
confusion matrix is shown in Table 3. In the table, TP (True
Positive) indicates the number of correctly identified normal
records, TN (True Negative) indicates the number of cor-
rectly identified attack records, FP (False Positive) indicates
the number of incorrectly identified normal records and FN
(False Negative) indicates the number of erroneously identi-
fied attack records.

TABLE 3. Confusion matrix.

The metrics used in this paper mainly include ACC


(accuracy), DR (detection rate), and FAR (false alarm rate),
and the calculation method is shown in formula 19-21. NSL-KDD dataset, including five-category classification
In addition, model training and testing time are represented (Normal, DoS, Probe, R2L, U2R) and binary-category clas-
by Ttrain and Ttest respectively. sification (Normal, Anomaly). More specifically, the experi-
TP + TN ments in this paper aim to achieve the following:
ACC = (19) a. Evaluate the impact of network structure on the perfor-
TP + FP + TN + FN
TP mance of the SSAE.
DR = (20) b. Evaluate the impact of sparse parameter on the perfor-
TP + FN
FP mance of the SSAE.
FAR = (21) c. Evaluate the impact of the low-dimensional features
FP + TN
extracted by the SSAE on classifiers.
E. MODEL PARAMETERS d. Compare SSAE with other state-of-the-art feature selec-
We use the TensorFlow, one of the most popular machine tion and feature compression methods.
learning frameworks, to conduct the experimental simulation e. Compare SSAE with other state-of-the-art shallow learn-
in this paper, and Python is selected as the programming lan- ing and deep learning models.
guage. The hardware experimental environment is a desktop
with an Inter Core i7-7700 quad-core processor, 16G RAM, A. IMPACT OF THE DIFFERENT NETWORK STRUCTURE
256G SSD, and 64-bit Windows 10 operating system. At the In deep neural network, there is no fixed method to determine
same time, we use the GeForce GTX1060 graphics process- the number of hidden layers and the number of neurons in
ing unit to speed up the training and testing of the model. each layer. We need to set different network structures accord-
According to Section IV-C, after the samples in the ing to different experimental backgrounds. If the number of
NSL-KDD dataset are preprocessed, the features are hidden layers is small and the number of neurons in each layer
extended from 41-dimension to 121-dimension. Therefore, is insufficient, it may lead to the model cannot effectively
we select the number of input layer neurons of the SSAE match the distribution of data, and for SSAE network, that
as 121. In addition, it is proved by experiments that the is, the high-dimensional features cannot be effectively com-
hidden structure of the four-layer SSAE network is the opti- pressed. Conversely, if the number of hidden layers is too
mal experimental model (Section V-A), and the best sparse large and the number of neurons in each layer is excessive,
parameter is selected as 0.04(Section V-B). The parameters it may lead to the extremely complex training process of
of the Adam algorithm adopt the default values recommended the model, which greatly increases the training time and the
by the author. Table 4 shows the specific experimental model consumption of computing resources, and at the same time,
parameters. may cause the model to be overfitting.
Table 5 shows the effect on the performance of the clas-
V. EXPERIMENTAL RESULTS AND DISCUSSION sifier when SSAE takes different number of hidden layers
Here, we evaluate the performance of the proposed SSAE on validation datasets, including mean and standard devia-
model by performing a variety of experiments on the tion for each indicator. In order to achieve the purpose of

VOLUME 6, 2018 41243


B. Yan, G. Han: Effective Feature Extraction via SSAE to Improve IDS

TABLE 5. The ACC, DR, FAR and training time of SSAE with different hidden layer structure.

controlling variables, we connect the same Softmax classi- is different. It can be also observed that better classification
fier to the output layer of SSAE with different structures, results can be obtained when the number of neurons in the
and the remaining algorithm parameters in the experiment hidden layer decreases by layer. Further, the experimental
are consistent. It could be found that with above settings, results show that the number of neurons in the hidden layer
as the number of hidden layers of increases, the classification should be less than that in the input layer, which not only
results of the classifier become more and more satisfactory. reduces the training time of the model, but also is more
However, a major drawback of the multilayer network struc- conducive to the compression of the original features.
ture is that it requires more time to train, which causes the Another important study in this paper is to choose the
infeasibility in real network environment. For example, in our smallest feature dimension without losing the information
experiments, although the detection accuracy of the five-layer between original samples. Since the compressed feature
hidden structure is better than that of the four-layer hidden dimension is equal to the number of output layer neurons,
structure, the training time of five-layer has almost stiffened therefore, we measure the number of neurons in the output
to double that of four-layer. Further, when the model is in layer of the SSAE network to observe how they affect the
a large data environment, the training time will increase performance of intrusion detection. The [100, 85, 55, 30]
explosively. Therefore, after a comprehensive comparison, four-hidden layers’ model with optimal results in Table 6 is
the four-layer hidden structure is the best network structure used to perform experiments, while the other parameters
suitable for our experiment. remain unchanged, and the number of output layer neurons
changes from 1 to 10. Figure 5 shows the experimental
TABLE 6. The ACC, DR and FAR of SSAE with different hidden layers results, from which, we can see that those metrics yield the
neurons when the number of hidden layers is the same.
best results when the number of output layer neurons is 5.

B. IMPACT OF THE DIFFERENT SPARSE PARAMETER


Bengio [26] conducted a large number of experiments and
concluded that a good experimental effect can be achieved
when the sparse parameter is chosen to be 0.05 in the deep
architectures and many existing literatures use this param-
eter to perform their experiments. However, we think this
parameter may not apply to our model because sparse param-
eter is relevant to the number of bottleneck code dimension.
Figure 6 indicates that the classification results of SSAE,
Table 6 shows the effect of different number of hidden neu- with four-hidden layers’ model of [100], [85], [55], [30], are
rons on the classification results when the number of hidden affected by the sparse parameter from 0.01 to 0.1. It can be
layers is the same. It can be concluded that the performance intuitively seen that when the value of the sparse parameter
of the SSAE with the same number of neurons in each layer is 0.04, the comprehensive performance of the SSAE is the
is worse than that when the number of neurons in each layer best. Less than this value, it leads to a deficit in the number of

41244 VOLUME 6, 2018


B. Yan, G. Han: Effective Feature Extraction via SSAE to Improve IDS

FIGURE 5. Experimental results with the number of output layer neurons changes from 1 to 10. (a) ACC. (B) DR. (C) FAR.

FIGURE 6. Experimental results with the sparse parameter changes from 0.01 to 0.1. (a) ACC and DR. (B) FAR. (C) Training Time.

TABLE 7. Effect of low-dimensional features extracted by the SSAE on the performance of the base classifiers for overall classification.

neurons used for training, so there is a downward trending for experiments. The parameters of SSAE use the optimal param-
the detection performance, and larger than the value, the input eters measured in the preceding experiments. Table 7 and
features are not effectively compressed, then the training time Table 8 summarizes the experimental results for the overall
of the model sustains over long periods of time. classification and five-category classification, respectively.
It can be seen that the low-dimensional features compressed
by the SSAE not only has no negative effect on the per-
C. IMPACT OF THE LOW-DIMENSIONAL formance of the classifiers, but also significantly reduce
FEATURES ON CLASSIFIERS the training time and testing time of the classifiers. The
To verify the effectiveness of the low-dimensional features results also clearly demonstrate that the SSAE can almost
extracted by the SSAE, we used three base classifiers, retain all the amount of information contained in the orig-
including support vector machine (SVM), K-nearest neigh- inal data while learning the high-level representation of
bor (KNN), and random forests (RF), to conduct comparative features.

VOLUME 6, 2018 41245


B. Yan, G. Han: Effective Feature Extraction via SSAE to Improve IDS

TABLE 8. Effect of low-dimensional features extracted by the SSAE on the performance of the base classifiers for five-category classification.

TABLE 9. Performance comparison obtained by the proposed and other approaches for overall classification.

D. COMPARISON OF DIFFERENT ALGORITHMS our method is that we can achieve above experimental results
To evaluate the performance of the proposed method, the opti- with only 74,487 training samples, far less than the number
mal SSAE+SVM model is selected to perform comparative of training samples required for other experimental methods
experiments against the conventional and topical algorithms, except the RNN-IDS model. However, the performance of the
including the following three categories: features selection RNN-IDS is inferior to our method in the metrics of FAR and
methods, shallow learning and deep learning models. The DR. Obviously, less consumption of training samples means
overall ACC, DR, FAR and the size of training dataset are that under the same experimental conditions, our method will
depicted in Table 9, while Table 10 shows the detection rate train the model in a faster way to meet the requirements of the
for the five categories. According to the experimental results system’s real-time performance.
shown in these tables, it can be seen that the performance On the other hand, in the multi-classification experiments,
of SSAE+SVM is very close to or more than other state- for the detection rate of two types of low frequency attack
of-the-art approaches in terms of overall accuracy and detec- samples, R2L and U2R, the methods we proposed have
tion rate. Moreover, the false alarm rate is only lower than not achieved satisfactory results, only 84.43% and 67.94%,
that of HAST-IDS[15] and LSTM[16], and the gap remains respectively. The two methods of DBN4 + LR [17] and
within 0.05%. This indicates that the method proposed in this NBC-A [8] get the best performance in the detection of
paper has reached or exceeded the average detection level of R2L and U2R respectively. Furthermore, the detection rate
other state-of-the-art methods and models. The strength of of NBC-A for all five types of samples exceeds 97.5%.

41246 VOLUME 6, 2018


B. Yan, G. Han: Effective Feature Extraction via SSAE to Improve IDS

TABLE 10. Comparisons of DR obtained by the proposed and other approaches for five-category classification.

Through further detailed research on existing literature, other state-of-the-art approaches in terms of overall detection
the main reason for this result is that the number of R2L performance with relatively less resource consumption.
and U2R attack samples contain in the training set used in The disadvantage of the proposed method is that it cannot
our experiments are scarce and the classification features are effectively detect R2L and U2R low-frequency attack sam-
insufficient, which result in the SVM classifier failing to ples, that is, it cannot overcome the adverse effects caused by
learn the sample features effectively. Moreover, conventional imbalanced data distribution. In future research, how to use
machine learning classifiers tend to be more biased towards the existing methods or propose new methods to handle the
the majority samples, because they do not consider the dis- problem of imbalanced data in the feature extraction process
tribution of the data in the optimization of the loss function. deserves further attention. Besides, it will be very interesting
Worse still, in this case, low frequency attack samples may be to find out the patterns of features learned by SSAE which
ignored as noise points or outliers of the majority class [31]. can help with manual feature engineering.
Therefore, we believe that all of the above reasons lead to a
low detection rate of R2L and U2R in our experiments. REFERENCES
[1] G. Kumar, K. Kumar, and M. Sachdeva ‘‘The use of artificial intelligence
VI. CONCLUSION AND PROSPECT based techniques for intrusion detection: A review,’’ Artif. Intell. Rev.,
Conventional shallow machine learning methods cannot vol. 34, no. 4, pp. 369–387, 2010.
[2] Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’ Nature, vol. 521,
effectively deal with the high-dimensional feature classifica- pp. 436–444, May 2015.
tion task under the condition of large data volume, which lead [3] R. F. Erbacher, K. L. Walker, and D. A. Frincke, ‘‘Intrusion and misuse
to the intrusion detection systems based on machine learning detection in large-scale systems,’’ IEEE Comput. Graph. Appl., vol. 22,
methods, cannot meet the real-time requirements. Therefore, no. 1, pp. 38–47, Jan. 2002.
[4] W. Lee and D. Xiang, ‘‘Information-theoretic measures for anomaly detec-
this paper proposes a novel feature extraction method based tion,’’ in Proc. IEEE Symp. Secur. Privacy, Oakland, CA, USA, May 2001,
on deep learning theory, which uses stacking sparse autoen- pp. 130–143.
coder to realize the nonlinear mapping of high-dimensional [5] G. Wang, J. Hao, J. Ma, and L. Huang, ‘‘A new approach to intrusion
detection using artificial neural networks and fuzzy clustering,’’ Expert
features to low-dimensional features. Then, the new dataset
Syst. Appl., vol. 37, no. 9, pp. 6225–6232, 2010.
containing the optimal low-dimensional features is used to [6] S. M. H. Bamakan, H. Wang, T. Yingjie, and Y. Shi, ‘‘An effective
train the classifiers and test for performance. The experimen- intrusion detection framework based on MCLP/SVM optimized by time-
tal results on the NSL-KDD dataset show that the SSAE with varying chaos particle swarm optimization,’’ Neurocomputing, vol. 199,
pp. 90–102, Jul. 2016.
the optimal structure can compress the original features to [7] D. Moon, M. Im, I. Kim, and J. H. Park, ‘‘DTB-IDS: An intrusion
5 dimensions without losing the amount of information exist- detection system based on decision tree using behavior analysis for pre-
ing between the original data. In addition, the results com- venting APT attacks,’’ J. Supercomputing, vol. 73, no. 7, pp. 2881–2895,
2017.
pared with the methods proposed in the existing literatures
[8] Y. Wang et al., ‘‘A novel intrusion detection system based on advanced
show that SSAE can reach or even exceed the average detec- naive Bayesian classification,’’ in Proc. Int. Conf. 5G Fut. Wireless
tion level of conventional machine learning classifiers and Netw. (5GWN), Beijing, China, Dec. 2017, pp. 581–588.

VOLUME 6, 2018 41247


B. Yan, G. Han: Effective Feature Extraction via SSAE to Improve IDS

[9] W.-C. Lin, S.-W. Ke, and C.-F. Tsai, ‘‘CANN: An intrusion detection sys- [26] Y. Bengio, ‘‘Practical recommendations for gradient-based training of deep
tem based on combining cluster centers and nearest neighbors,’’ Knowl.- architectures,’’ in Neural Networks (Lecture Notes in Computer Science).
Based Syst., vol. 78, pp. 13–21, Apr. 2015. Berlin, Germany: Springer, 2012, pp. 437–478.
[10] B. M. Aslahi-Shahri et al., ‘‘A hybrid method consisting of GA and SVM [27] H. H. Pajouh, R. Javidan, R. Khayami, D. Ali, and K.-K. R. Choo, ‘‘A two-
for intrusion detection system,’’ Neural Comput. Appl., vol. 27, no. 6, layer dimension reduction and two-tier classification model for anomaly-
pp. 1669–1676, Aug. 2016. based intrusion detection in IoT backbone networks,’’ IEEE Trans. Emerg.
[11] C. Guo, Y. Ping, N. Liu, and S.-S. Luo, ‘‘A two-level hybrid approach for Topics Comput., Nov. 2016, doi: 10.1109/TETC.2016.2633228.
intrusion detection,’’ Neurocomputing, vol. 214, pp. 391–400, Nov. 2016. [28] W. L. Al-Yaseen, Z. A. Othman, and M. Z. A. Nazri, ‘‘Multi-level hybrid
[12] A. A. Aburomman and M. B. I. Reaz, ‘‘A novel SVM-kNN-PSO ensem- support vector machine and extreme learning machine based on modified
ble method for intrusion detection system,’’ Appl. Soft Comput., vol. 38, K-means for intrusion detection system,’’ Expert Syst. Appl., vol. 67,
pp. 360–372, Jan. 2016. pp. 296–303, Jan. 2017.
[13] M. A. Ambusaidi, X. He, P. Nanda, and Z. Tan, ‘‘Building an intrusion [29] D. Papamartzivanos, F. G. Mármol, and G. Kambourakis, ‘‘Dendron:
detection system using a filter-based feature selection algorithm,’’ IEEE Genetic trees driven rule induction for network intrusion detection sys-
Trans. Comput., vol. 65, no. 10, pp. 2986–2998, Oct. 2016. tems,’’ Future Gener. Comput. Syst., vol. 79, pp. 558–574, Feb. 2018.
[14] O. Osanaiye, H. Cai, K.-K. R. Choo, A. Dehghantanha, Z. Xu, and [30] C. Yin, Y. Zhu, J. Fei, and X. He, ‘‘A deep learning approach for intrusion
M. Dlodlo, ‘‘Ensemble-based multi-filter feature selection method for detection using recurrent neural networks,’’ IEEE Access, vol. 5, no. 99,
DDoS detection in cloud computing,’’ Eur. J. Wireless Commun. Netw., pp. 21954–21961, Oct. 2017.
vol. 2016, pp. 130–140, Dec. 2016. [31] T. Fawcett. (Aug. 2016). Learning From Imbalanced Classes. [Online].
[15] W. Wang et al., ‘‘HAST-IDS: Learning hierarchical spatial-temporal fea- Available: https://www.svds.com/learning-imbalanced-classes
tures using deep neural networks to improve intrusion detection,’’ IEEE
Access, vol. 6, no. 1, pp. 1792–1806, Dec. 2017.
[16] R. C. Staudemeyer, ‘‘Applying long short-term memory recurrent neu-
ral networks to intrusion detection,’’ South Afr. Comput. J., vol. 56,
pp. 136–154, Jul. 2015. BINGHAO YAN was born in 1993. He received
[17] K. Alrawashdeh and C. Purdy, ‘‘Toward an online anomaly intrusion the B.S. degree from the University of Electronic
detection system based on deep learning,’’ in Proc. 15th IEEE Int. Science and Technology of China, in 2015, and the
Conf. Mach. Learn. Appl. (ICMLA), Anaheim, CA, USA, Dec. 2016, M.S. degree from PLA Information Engineering
pp. 195–200. University, Zhengzhou, China, in 2017. He is cur-
[18] B. A. Olshausen and D. J. Field, ‘‘Emergence of simple-cell receptive field rently pursuing the Ph.D. degree with the National
properties by learning a sparse code for natural images,’’ Nature, vol. 381, Digital Switching System Engineering & Tech-
no. 6583, pp. 607–609, 1996.
nology Research Center, Zhengzhou. His research
[19] A. Ng, ‘‘Sparse autoencoder,’’ CS294A Lect. Notes, vol. 72, pp. 1–19, 2011.
[20] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, ‘‘Greedy layer-wise
areas are intrusion detection, information security,
training of deep networks,’’ in Proc. 21st Int. Conf. Neural Inform. Process. machine learning, and deep learning.
Syst. (NIPS), Vancouver, BC, Canada, Jan. 2007, pp. 153–160.
[21] T. T. H. Le, J. Kim, and H. Kim, ‘‘An effective intrusion detection
classifier using long short-term memory with gradient descent optimiza-
tion,’’ in Proc. IEEE Int. Conf. Plat. Technol. Service (PlatCon), Busan,
South Korea, Feb. 2017, pp. 1–6. GUODONG HAN was born in 1964. He received
[22] D. P. Kingma and J. L. Ba, ‘‘Adam: A method for stochastic optimization,’’ the B.S. and M.S. degrees from PLA Informa-
in Proc. 3rd Int. Conf. Learn. Represent. (ICLR), San Diego, CA, USA, tion Engineering University, Zhengzhou, China,
Dec. 2015, pp. 1–13. in 1986 and 1990, respectively, and the Ph.D.
[23] S. Rosset and A. Inger, ‘‘KDD-CUP 99: Knowledge discovery in a char-
degree from the National Digital Switching Sys-
itable organization’s donor database,’’ ACM SIGKDD Explor. Newslett.,
tem Engineering & Technology Research Center,
vol. 1, no. 2, pp. 85–90, 2000.
[24] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, ‘‘A detailed analysis Zhengzhou. He is currently an Associate Professor
of the KDD CUP 99 data set,’’ in Proc. IEEE Symp. Comput. Intell. Secur. with the Provincial Key Laboratory of System on
Defense Appl. (CISDA), Ottawa, ON, Canada, Dec. 2009, pp. 1–6. a Chip, National Digital Switching System Engi-
[25] J. Song, H. Takakura, Y. Okabe, M. Eto, D. Inoue, and K. Nakao, ‘‘Sta- neering & Technology Research Center. He has
tistical analysis of honeypot data and building of Kyoto 2006+ dataset authored over 40 journal and conference publications. His research areas are
for NIDS evaluation,’’ in Proc. 1st Workshop Building Anal. Datasets network security, signal processing, broadband network, and deep learning.
Gathering Exper. Returns Secur., Salzburg, Austria, Apr. 2011, pp. 29–36.

41248 VOLUME 6, 2018

You might also like