0% found this document useful (0 votes)
41 views14 pages

Multi-Dimensional Feature Fusion and Stacking Ensemble Mechanism

This paper proposes a multi-dimensional feature fusion and stacking ensemble mechanism (MFFSEM) for effective network intrusion detection. It establishes multiple basic feature datasets considering different aspects of network traffic information like time, space, and load. It then sets up multiple comprehensive feature datasets by exploring the association between the basic feature datasets. Finally, it conducts stacking ensemble learning on the comprehensive feature datasets to build an effective multi-dimensional global anomaly detection model. Experiments on four datasets show that MFFSEM outperforms basic classifiers and other ensemble approaches.

Uploaded by

yousuf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views14 pages

Multi-Dimensional Feature Fusion and Stacking Ensemble Mechanism

This paper proposes a multi-dimensional feature fusion and stacking ensemble mechanism (MFFSEM) for effective network intrusion detection. It establishes multiple basic feature datasets considering different aspects of network traffic information like time, space, and load. It then sets up multiple comprehensive feature datasets by exploring the association between the basic feature datasets. Finally, it conducts stacking ensemble learning on the comprehensive feature datasets to build an effective multi-dimensional global anomaly detection model. Experiments on four datasets show that MFFSEM outperforms basic classifiers and other ensemble approaches.

Uploaded by

yousuf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Future Generation Computer Systems 122 (2021) 130–143

Contents lists available at ScienceDirect

Future Generation Computer Systems


journal homepage: www.elsevier.com/locate/fgcs

Multi-dimensional feature fusion and stacking ensemble mechanism


for network intrusion detection

Hao Zhang, Jie-Ling Li, Xi-Meng Liu, Chen Dong
College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350116, China
Fujian Key Laboratory of Network Computing and Intelligent Information Processing, Fuzhou University, Fuzhou 350116, China

article info a b s t r a c t

Article history: A robust network intrusion detection system (NIDS) plays an important role in cyberspace security for
Received 16 November 2020 protecting confidential systems from potential threats. In real world network, there exists complex
Received in revised form 19 March 2021 correlations among the various types of network traffic information, which may be respectively
Accepted 31 March 2021
attributed to different abnormal behaviors and should be make full utilized in NIDS. Regarding complex
Available online 6 April 2021
network traffic information, traditional learning based abnormal behavior detection methods can
Keywords: hardly meet the requirements of the real world network environment. Existing methods have not
Network intrusion detection taken into account the impact of various modalities of data, and the mutual support among different
Multi-dimensional data features. To address the concerns, this paper proposes a multi-dimensional feature fusion and
Feature fusion stacking ensemble mechanism (MFFSEM), which can detect abnormal behaviors effectively. In order to
Stacking ensemble learning accurately explore the connotation of traffic information, multiple basic feature datasets are established
considering different aspects of traffic information such as time, space, and load. Then, considering
the association and correlation among the basic feature datasets, multiple comprehensive feature
datasets are set up to meet the requirements of real world abnormal behavior detection. In specific,
stacking ensemble learning is conducted on multiple comprehensive feature datasets, and thus an
effective multi-dimensional global anomaly detection model is accomplished. The experimental results
on the dataset KDD Cup 99, NSL-KDD, UNSW-NB15, and CIC-IDS2017 have shown that MFFSEM
significantly outperforms the basic and meta classifiers adopted in our method. Furthermore, its
detection performance is superior to other well-known ensemble approaches.
© 2021 Elsevier B.V. All rights reserved.

1. Introduction estimate whether the network connection is in a normal state or


not by evaluating existing network behavior patterns or rules.
Wide application of information technology and rising devel- However, with the evolvement of Internet, it is aggregating a
opment of cyberspace have enriched amateur life and broadened massive amount of high-dimensional and complex data. As a re-
horizons, but the large amount of network traffic information that sult, the traditional network intrusion detection method become
stems from people’s heavy reliance on cyberspace brings security insufficient [3].
and management issues. Many cyberspace security problems ap- Learning algorithm has been applied for building NIDS, such
pear and they have brought a lot of potential threats to our online as Random Forest (RF) [4,5], Decision Tree (DT) [6,7], Naive Bayes
life. In particular, attackers usually exploit the vulnerabilities of (NB) [8], Support Vector Machine (SVM) [9,10], and Genetic Algo-
popular software to attack computer systems on network [1]. The rithm [11,12]. Alsaeedi et al. [13] compared the performance of
damage caused by these attacks may cause serious problems like DT, NB, and SVM in detecting network attacks. Their experimen-
service interruption or even large financial losses. tal results show that RF and DT are both effective in detecting
Network intrusion detection system (NIDS) is considered to attacks. However, a traditional classifier often cannot meet the
be the most effective technology to detect attacks and guarantee requirements of high classification accuracy and high generaliza-
network security, which carefully examines every data packet tion ability. Deep learning has also successfully applied in the
passing through network and safeguard against the intrusive field of NIDS, such as neural networks based method [14]. Fer-
data packets [2]. Traditional intrusion detection methods usually nandez et al. [15] showed that deep neural networks (DNN) based
methods outperform previous machine learning-based intrusion
∗ Corresponding author at: College of Mathematics and Computer Science, detection systems and demonstrated robustness in the presence
Fuzhou University, Fuzhou 350116, China. of dynamic IP addresses. Nonetheless, existing deep learning al-
E-mail address: [email protected] (C. Dong). gorithms often require high computation costs. Compared with

https://doi.org/10.1016/j.future.2021.03.024
0167-739X/© 2021 Elsevier B.V. All rights reserved.
H. Zhang, J.-L. Li, X.-M. Liu et al. Future Generation Computer Systems 122 (2021) 130–143

existing methods relying on a single classifier, ensemble learning 2. Background and related works
is enabling to combine several models to jointly solve a par-
ticular problem, which demonstrates excellent performance and As an important tool for ensuring network security, NIDS has
robustness while increasing limited computation overhead [16]. continuously attracted the attention of the research community.
The cyberspace data stream contains a large amount of in- There are many popular monitoring technologies, and the aca-
formation with the perspective of time, space, load, and statis- demic community generally divides them into three categories:
tics [17]. It may contain some incomplete or redundant informa- misused detection [18]; anomaly detection [19–21]; hybrid de-
tion, which may influence the data analysis process and results. tection [22]. Misuse detection is powerless against new types
Particularly, anomalies may only exist in a few related feature of attacks. Although anomaly-based detection methods can pre-
dimensions, and thus the noisy data may severely confuse the vent new types of attacks, they are disturbed by the high false
classification models. Besides, for different tasks, the optimal alarm rates. Hybrid detection methods combine the advantages
feature representation for anomaly detection is also different.
of misuse-based and anomaly-based methods, which have out-
Therefore, an anomaly detection model based on full data features
standing performance. In this paper, we focus on the related work
cannot capture the anomaly information hidden in local data fea-
of anomaly detection methods using the strategies of feature
tures. Data analysis from different dimensions or perspectives can
fusion and stacking ensemble learning.
provide different contributions and support for anomaly detec-
tion and analysis results. It often implies mutually supportive and
complementary relationship among the basic features of different 2.1. Feature fusion
dimensions. Anomalies are usually manifested in multiple basic
features data, so a comprehensive feature dataset needs to be More and more researchers believe that more and better refer-
constructed. ence data can be obtained through feature fusion and association,
In addition, the real world network data stream can be diverse which greatly improves the ability to identify malicious network
and dynamically changing. Only using a fixed feature combina- behaviors [23]. Wei et al. [24] used convolutional neural network
tion cannot meet the needs of the high-velocity environment, so (CNN) to learn the spatial features of network packets and then
we can use a grouping combination to cover all. Subsequently, adopted Long Short-Term Memory (LSTM) to learn the temporal
the learning result of a single comprehensive feature dataset features among multiple network packets. Li et al. [25] proposed
is one-sided, and anomalies cannot be identified from a global a deep learning method for NIDS using a multi-convolutional
perspective. neural network (multi-CNN). They divided the feature data into
Therefore, based on the aforementioned limitations, weak- four parts according to the correlation and used the same CNN
nesses, and the reasons for designing new solutions, this paper structure for different parts of the dataset. However, one basic
proposes a stacking ensemble learning based on multi-dimen- feature cannot reflect the traffic characteristics well, and the de-
sional feature fusion for improving the generalization ability and tection effect of one feature subset may be so poor that it affects
accuracy. Firstly, we divide cyberspace traffic data into multi- other feature subsets. Necati et al. [26] employed RF for feature
ple dimensions basic feature datasets according to their charac- selection and combined multiple feature subsets to generate 100
teristics, which meets the data requirements of the associated models. These 100 models were reduced to 10 models according
abnormal data stream for detection. Secondly, combining the to the three evaluation methods of accuracy, information gain,
association and correlation among the basic feature datasets, we and recall. The optimized feature model is used as the input of
propose a construction strategy and implementation method of the second layer. Their study can achieve good detection results,
comprehensive feature datasets. Finally, we adopt the stacking but the intermediate feature selection costs a lot of training
ensemble learning strategy on multiple comprehensive feature
time. Therefore, the current feature fusion technology in network
datasets and realize a multi-dimensional global anomaly detec-
anomaly detection still has some limitations.
tion mechanism in a multi-dimensional space. To evaluate its
effectiveness, several experiments are conducted on four pub-
2.2. Stacking ensemble learning
lic intrusion detection datasets, namely KDD Cup 99, NSL-KDD,
UNSW-NB15, and CIC-IDS2017. Experimental results indicated
that our model can increase detection accuracy on the above four Stacking ensemble learning is a learning method that inte-
datasets. The main contributions in this paper are summarized as grates multiple basic classifiers through a meta-model. The en-
follows. semble process is performed by using a dependent framework,
i.e., the construction of the meta-model depends on the output
• A novel fusion of data characteristics using permutation and of the previous basic classifiers [27]. In [28], they show that
combination is proposed based on the multi-dimensional the performance of boosting and stacking is better than bagging,
characteristics of cyberspace traffic data, which can high- rotation forest, and voting schemes. In particular, stacking pos-
light the mutual support or complementary relationship sess the best performance in terms of performance metrics such
between different features and increase the difference of as accuracy, recall, and precision. Unlike bagging and boosting,
basic classifiers. stacking is often used to combine different classifiers. The use of
• A comprehensive detection strategy of stacking ensemble different basic learning techniques called heterogeneous ensem-
learning is proposed based on basic classifiers from dif- ble classifiers [16], the purpose is to ensure the diversity of basic
ferent perspectives of the multi-dimensional space, which classifiers.
can detect abnormal behaviors robustly, and identify the Table 1 shows the different algorithm combinations that use
intentions of attackers.
stacking ensemble learning to detect network intrusions. Sub-
• To the best of our knowledge, this is the first attempt to udhi et al. [29] found that using Radial Ba-sis Function Network
apply a homogeneous stacking ensemble learning based on
(RBFN) as a meta-classifier can detect intrusions more effec-
different feature categories for network intrusion detection,
tively and keep the false alarm rate to a minimum at the same
which brings about the strong versatility.
time. The algorithm combination of [30] has a better predictive
The remainder of this article is structured as follows. Section 2 effect on real-time datasets than simulated datasets. A similar
reviews the related works and Section 3 formulates the problem. method was also advocated by the study mentioned in [31]. Their
Section 4 describes our proposed model. Section 5 presents the study can greatly reduce the false positive rate and obtain a
experimental setup. Section 6 discusses the experimental results. higher detection rate while maintaining computational efficiency.
131
H. Zhang, J.-L. Li, X.-M. Liu et al. Future Generation Computer Systems 122 (2021) 130–143

Dutta et al. [32] proposed deep models such as stacked neu- 4.1. Construction of comprehensive feature datasets
ral network principles to improve the efficiency of traditional
machine learning methods in network anomaly detection. Tama Different data features have varying degrees of impact on
et al. [33] focused on evaluating selected meta learner algorithms detecting various abnormal behaviors. How to construct data
for NIDS improvement, which are Meta Decision Tree (MDT), features is a very important and complex project. We suggest
Multi Response Linear Regression (MLR), and Multiple Model considering this issue from the perspective of feature fusion.
Trees (MMT). Among them, the MDT model records the highest Cyberspace data stream contains a large number of different
intrusion detection accuracy, and the intrusion detection accuracy types of traffic information. Assume that traffic data is analyzed
recorded by all models of the three meta-learners is higher than from three aspects: temporal dimension, spatial dimension, and
the best accuracy recorded by each basic level model. data content. Temporal dimension refers to the time period in
However, the innovation of the mentioned methods is not sig- which the content of each data item occurs, and time units can
nificant, and they have little improvement to the classic stacking be hour, day, week, month, quarter, year, etc. Spatial dimension
ensemble. Oriola et al. [34] proposed a superposition generaliza- refers to the scope of content covered in each data item, such
tion integration method with two meta-learners. The first meta- as different service types, different network-layer protocols, dif-
learner is based on the classic stacking ensemble, the second ferent ports, etc. Data content refers to the network behavior
meta-learner is based on the multi-feature stacking ensemble. data obtained in determining the spatial–temporal dimension,
They optimized it to obtain the best meta-feature combination. such as the number of data packets, the size of data packets,
application request types, application load snapshots, etc. Based
The improved stack model optimizes the selection of meta feature
on the above data dimensions, the basic feature datasets required
combinations, but need more training time. In addition, their
by the detection algorithm can be constructed.
approach is also a heterogeneous ensemble, which used LR, SVM,
In addition, the basic features of different dimensions are
and NB as the basic classifier. According to the related works, full
mutually supportive and complementary. Anomalies are usu-
data features are always adopted by these studies for reducing
ally manifested through multiple basic features, so a compre-
pretreatment, and heterogeneous stacking ensembles are always
hensive feature dataset needs to be constructed. For example,
employed, which makes it too simple to increase the diversity of
we can combine the time feature and content feature into a
basic classifiers.
comprehensive feature as shown in Fig. 2.
Different anomalies are often concealed in different compre-
3. Problem formulation hensive data characteristics. Therefore, a single comprehensive
feature dataset or a fixed feature combination mode cannot meet
3.1. Definitions the needs of the complex environment. In this paper, we use a
grouping combination to cover all, and the strategy is permu-
Definition 1. D = {x1 , x2 , . . . , xN } is defined as the traffic data, in tation and combination. Moreover, a good fusion model must
which each traffic data is described by d features. Each example meet diversity and accuracy. The sub-models need to be differ-
xi = (xi1 ; xi2 ; . . . ; xid ) is a vector in the d dimensional sample ent, i.e., their strengths and weaknesses are different. The highly
space χ , i.e., xi ∈ χ , where d denotes the dimension of sample related homogenized sub-models cannot complement each other.
xi . Suppose we extract m basic feature datasets, and select n group
n
characteristics of fusion. Then we can get Cm comprehensive
Definition 2. Define a mapping f from input space χ to output feature datasets by combining the basic feature datasets. Every
space γ : χ ↦ → γ . For binary classification tasks, usually let γ = comprehensive feature dataset needs a basic classifier to learn a
{−1, +1} or {0, 1}; For multi-class classification tasks, usually let basic model, which means we need Cm n
basic classifiers. Besides,
γ = {r1 , r2 , . . . , rk }, k > 2. one basic feature cannot reflect traffic characteristics well, result-
ing in poor performance of basic classifier, and the fusion of m
Definition 3. An anomaly in traffic data is defined as an observa- sets of features has no difference, resulting in poor performance
tion that deviates from the previous normal behaviors [35]. of the ensemble classifier, so 1 < n < m. The most important
n
point is that Cm needs to meet the lower repetition rate of feature
3.2. Problem description combination to maximize its difference.

4.2. Stacking ensemble learning based on multi-dimensional feature


Given traffic data D, we would like to detect anomaly in traffic
fusion
data through a binary classification learner or multi-classification
learner.
This combination approach can make each group of compre-
hensive feature datasets different and help homogeneous en-
4. Methodology semble to learn more effectively. As shown in Fig. 3, the com-
prehensive feature dataset of the multi-dimensional sub-space
This paper proposes a network intrusion detection method is separately trained through the basic detection algorithm to
based on multi-dimensional feature fusion and homogeneous realize part of the learning in the detection data space. The output
stacking ensemble learning, which is mainly divided into the data of the basic detection and recognition model is integrated
following three stages. Firstly, based on the properties of data, with a series connection strategy as the input of the meta detec-
we extract several basic feature datasets from the raw data or the tion algorithm, and we train the meta detection model to achieve
benchmark datasets. Then, we combine the basic feature datasets the overall detection effect from a macro perspective. The training
to form several comprehensive feature datasets. Finally, we use results of the basic detection algorithm form recognition differ-
the same basic algorithm to train the different comprehensive ences from different angles and expand the hypothesis space of
feature datasets for multi-dimensional subspace of features to the learning algorithm. The meta detection algorithm trains the
learn a basic classifier and input the predicted probabilities of all model to form a comprehensive detection result with stronger
basic classifiers into a meta module. The process of our algorithm generalization ability.
proposed in this paper is shown in Fig. 1. In the following, the As mentioned before, we train the corresponding basic learn-
proposed method is elaborated. ing algorithm separately to get the basic training model {h1 , h2 ,
132
H. Zhang, J.-L. Li, X.-M. Liu et al. Future Generation Computer Systems 122 (2021) 130–143

Table 1
Different algorithm combinations of stacking ensemble learning.
Scheme Basic classifier Meta classifier Validation dataset
NB
K-Nearest Neighbor (KNN)
[29] RBFN Large scale synthetic dataset
Rule Induction (RI)
DT
RF
[30] Logistic Regression (LR) SVM UNSW NB-15 and UGR’16
KNN
RF
[31] NB SVM NSL-KDD
C4.5
Deep Neural Network (DNN)
[32] LR IoT-23, LITNET-2020, and NetML-2020
Long Short-Term Memory (LSTM)
DT MDT
[33] NB MLR UNSW NB-15
KNN MMT

Fig. 1. The proposed algorithm flowchart.

Fig. 2. Feature construction.

. . . , hK } in several multi-dimensional subspace datasets {D1 , D2 , meta classifier, where N is the number of samples. Next, we per-
. . . , DK }, where K = Cmn . Then, we use the predicted prob- form ensemble training on the results of the differential classifier
ability of each basic classifiers Z1 = {z11 , z12 , . . . , z1N }, Z2 = D′ = {ε1 , ε2 , . . . , εN } to obtain the meta detection model h′ ,
{z21 , z22 , . . . , z2N }, . . . , ZK = {zK1 , zK2 , . . . , zKN } as the input to the where εi = {z1i , z2i , . . . , zKi }, i ∈ [1, N ]. The two-level stacking
133
H. Zhang, J.-L. Li, X.-M. Liu et al. Future Generation Computer Systems 122 (2021) 130–143

Table 2
Setup in the experiment (DT and RF use default parameters).
Classifiers Parameters
MFFSEM basic classifier=DT, meta classifier=RF, use_probas=True,
average_probas=True
DT criterion=‘gini’, splitter=‘best’, max_depth=None,
min_samples_split=2, min_samples_leaf=1,
min_weight_fraction_leaf=0.0, max_features=None,
random_state=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
class_weight=None, presort=False
RF n_estimators=100, criterion=‘gini’, max_depth=None,
min_samples_split=2, min_samples_leaf=1,
min_weight_fraction_leaf=0.0, max_features=‘auto’,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, bootstrap=True, oob_score=False,
n_jobs=None, random_state=None, verbose=0,
warm_start=False, class_weight=None, ccp_alpha=0.0,
max_samples=None

ensemble learning model is shown in Eq. (1). accuracy of NIDS. DT has a relatively few calculations and can
make feasible and good results for large data sources in short
Z = H(D1 , D2 , . . . , DK )
(1) time [36]. RF is an effective prediction method, which effectively
= h′ (h(D1 ), h(D2 ), . . . , h(DK )). improves the classification accuracy through the integration of
Each sub-item of the above dataset D is a comprehensive many decision trees [37]. At the same time, RF is robust to noise
feature dataset analyzed in the previous section, and each sub- and outliers without overfitting.
item is different. The algorithm corresponding to each sub-model
of the above basic training model h is the same. We use the same 5. Experiment setup
basic algorithm and different feature datasets to train different
basic models. The implementation and evaluation of the proposed frame-
The detailed process of multi-dimensional feature fusion and work are conducted using a laptop loaded with Core i5 2.4 GHz
stacking ensemble mechanism (MFFSEM) is shown in Algorithm CPU and 8 GB RAM, and Python3.6 running on a Windows 10. To
1. The main computational cost of Algorithm 1 lies in updating evaluate the performance of our approach, four public standard
D′ , its computational complexity is O(N ∗ K ) and the temporary intrusion detection datasets are carried out. The setup in this
variable takes up O(N) space, where N is the number of samples. K work was presented in Table 2. Among them, without the efforts
is much smaller than N and can be set according actual condition, for specific hyper-parameter tuning, and the classifiers all use the
which is equivalent to a constant. Besides, the construction of default parameters.
comprehensive feature datasets D only need O(1). Therefore, The
total time and space complexity of the algorithm turn out to be 5.1. Dataset used for experiments
O(N).
Algorithm 1 MFFSEM Many datasets have been employed in evaluating the intrusion
detection system. The KDD Cup 99 [38] is one of the well-known
Input: Raw data T = {X , Y }(X , Y ⊂ R ); N
obsolete datasets. Although it has some limitations, it is still
Basic algorithm L; Meta algorithm R. the most commonly used dataset to evaluate intrusion detection
Output: Predictions from the ensemble H models. The NSL-KDD [39] generated by deleting duplicates from
1: Extract basic feature datasets based on traffic characteristics, KDD Cup 99 is also used. The UNSW-NB15 [40] has emerged in
i.e., B = {Xi , Y }m
i=1 (Xi ⊂ X ) //Xi represents all data containing 2015, which covers many of the real world attacks. Besides, the
a type of characteristic attributes; CIC-IDS2017 dataset [41] was published by Canadian Institute for
2: Use permutation and combination on B to get a cube, i.e., Cybersecurity of University of New Brunswick in 2017.
Dk = {(Xi ∪ ... ∪ Xn+i−1 ), Y }m i=1 , k ∈ [1, K ], K = Cm //The value
n

of n depends on m; 5.1.1. The KDD Cup 99 dataset


3: for k ← 1 to K do The KDD Cup 99 dataset was generated from an intrusion
4: hk = L(Dk ) // Learn a classifier hk from Dk ; assessment project conducted by the Defence Advanced Research
′ Projects Agency (DARPA). After the original data were processed
5: D = ∅;
6: for j ← 1 to N do by WENKE and others, the public KDD Cup 99 dataset was
7: for k ← 1 to K do formed [38]. It contains about 5 million records, including normal
8: εjk = hk (Dk ) //Extract a new instance εjk ; and attack events collected over several weeks [42]. Each record
9: D′ = D′ ∪ ((εj1 , . . . , εjK ), yj )(yj ∈ Y ); contains 41 continuous and nominal features, as well as a type
label. As shown in Table 3, we divide the features of the KDD
10: return H = h′ = R(D′ ).
Cup 99 and NSL-KDD datasets into four categories. The first part
is the basic features, which refers to the scope of content cov-
4.3. Selection of basic and meta detection algorithms erage in each data item; the second part is the content features,
which refers to the network behavior data; the third part is the
Most of the sub-models should be able to help improve the time-based network traffic statistics features which refers to the
final output quality, so the sub-models need high accuracy and time period in which the content of each data item occurs; the
efficiency. In this paper, we chose Decision Tree (DT) as the basic fourth part is the host-based network traffic statistics features,
learning algorithm and chose Random Forest (RF) as the suitable which refers to the built-in firewall function to protect incoming
meta learning algorithm to effectively improve the detection and outgoing traffic. Attack types are mainly divided into four
134
H. Zhang, J.-L. Li, X.-M. Liu et al. Future Generation Computer Systems 122 (2021) 130–143

Fig. 3. Stacking ensemble learning model on cube.

Table 3
KDD Cup 99, NSL-KDD features with their category.
Category No. Name Category No. Name
Basic 1 duration Content 22 is_guest_login
2 protocol_type Time 23 count
3 service 24 srv_count
4 Flag 25 serror_rate
5 src bytes 26 srv_serror_rate
6 dst bytes 27 rerror_rate
7 Land 28 srv_rerror_rate
8 wrong_fragment 29 rerror_rate
9 urgent 30 diff_srv_rate
Content 10 Hot 31 srv_diff_host_rate
11 num_failed_logins Host 32 dst_host_count
12 logged_in 33 dst_host_srv_count
13 num_compromised 34 dst_host_same_srv_rate
14 root_shell 35 dst_host_diff_srv_rate
15 su_attempted 36 dst_host_same_src_port_rate
16 num_root 37 dst_host_srv_src_diff_host_rate
17 num_file_creations 38 dst_host_serror_rate
18 num_shells 39 dst_host_srv_serror_rate
19 num_access_files 40 dst_host_rerror_rate
20 num_outbound_cmds 41 dst_host_srv_rerror_rate
21 is_host_login

Table 4 Table 5
Distribution statistics of the KDD Cup 99 training and testing datasets. Distribution statistics of the NSL-KDD training and testing datasets.
Type Training set Testing set Type Training set Testing set
Actual Percentage (%) Actual Percentage (%) Actual Percentage (%) Actual Percentage (%)
DoS 391 458 79.24 229 853 73.90 Attack 11 743 46.61 12 833 56.92
Probe 4107 0.83 4166 1.34 Normal 13 449 53.39 9711 43.08
R2L 1126 0.23 16 189 5.21 Total 25 192 100 22 544 100
U2R 52 0.01 228 0.07
Normal 97 278 19.69 60 593 19.48
Total 494 021 100 311 029 100
binary and multi-class performance of the algorithm respectively.
In this work, 20% corrected training dataset and all testing dataset
are used. Table 5 demonstrates the record distribution of the
categories, namely the Denial of Service (DoS) attack, Probing
training and testing datasets in the NSL-KDD dataset.
(Probe) attack, Remote-to-Local (R2L) attack, and User-to-Root
(U2R) attack. We use the KDD Cup 99 dataset because it maintains
the original state of the information in the network data stream. 5.1.3. The UNSW-NB15 dataset
In this work, 10% corrected training dataset and testing dataset The UNSW-NB15 dataset contains 100 GB of captured raw
are used. Its data distribution is shown in Table 4. traffic [40], with nine comprehensive attack types, namely Re-
connaissance, Backdoor, DoS, Exploits, Analysis, Fuzzers, Worms,
5.1.2. The NSL-KDD dataset Shellcode, and Generic [43]. This paper uses the training set
The NSL-KDD dataset is a modified version of KDD Cup 99 and testing set of the UNSW-NB15 due to the original dataset
dataset. The redundant records and duplicates are removed, and is too large and there is much redundancy. The dataset has a
all attack types are fairly distributed in NSL-KDD [39]. It has the total of 45 features, excluding the id, attack_cat, and label tag
same features as KDD Cup 99 and has higher requirements for types, and the original feature dimension is 42. As shown in
intrusion detection algorithm. We use the two datasets to test the Table 6, we divide the features of the UNSW-NB15 dataset into
135
H. Zhang, J.-L. Li, X.-M. Liu et al. Future Generation Computer Systems 122 (2021) 130–143

Table 6 5.1.4. The CIC-IDS2017 dataset


UNSW-NB15 features with their category. The CIC-IDS2017 dataset is one of the latest intrusion de-
Category No. Name Category No. Name tection datasets, which covers necessary criteria with updated
Basic 1 proto Content 22 response_body_len attacks. In detail, the dataset contains 2,830,743 records designed
2 state Time 23 sjit on 8 files, and each record contains 78 different features with
3 dur 24 djit its tags [41]. As shown in Table 8, we divide the features of the
4 sbytes 25 sinpkt
5 dbytes 26 dinpkt
CIC-IDS2017 dataset into three categories. The first part is the
6 sttl 27 tcprtt space features; the second part is the time features; the third
7 dttl 28 synack part is the content features. Taking into account the requirements
8 sloss 29 ackdat of multi-classification and the huge amount of traffic, we chosen
9 dloss 30 rate the Wednesday-workingHours set for experiments through 10-
10 service General 31 is_sm_ips_ports
fold cross-validation method. The static information of the set is
11 sload 32 ct_state_ttl
12 dload 33 ct_flw_http_mthd
given in Table 9.
13 spkts 34 is_ftp_login
14 dpkts 35 ct_ftp_cmd 5.2. Data preprocessing
Content 15 swin Connection 36 ct_srv_src
16 dwin 37 ct_srv_dst The dataset preprocessing is an important step, where it is
17 stcpb 38 ct_dst_ltm executed on all the datasets used in this work. For KDD Cup
18 dtcpb 39 ct_src_ltm
99, NSL-KDD, and UNSW-NB15, we convert the nominal attribute
19 smean 40 ct_src_dport_ltm
20 dmean 41 ct_dst_sport_ltm value into a numeric value, because machine learning algorithms
21 trans_depth 42 ct_dst_src_ltm back end calculations are done on numeric values, not nominal
values. These datasets include nominal features, for example, pro-
tocol features (e.g., tcp, udp, arp, etc.), service features (e.g., http,
ftp, dns, etc.), and state feature. Therefore, it is essential to ensure
Table 7
Distribution statistics of the UNSW-NB15 training and testing datasets.
that all the symbols are a set of numeric values. For example,
we convert the characteristic value ‘tcp’ to the value 1, ‘udp’ to
Type Training set Testing set
the value 2, and so on. For CIC-IDS2017, the raw data inevitably
Actual Percentage (%) Actual Percentage (%)
contains anomalies and redundant instances, for instance, the
Attack 119 341 68.06 45 332 55.06 feature ‘Fwd Header Length’ appears twice, ‘Flow Packets/s’ and
Normal 56 000 31.94 37 000 44.94 ‘Flow Packets/s’ includes abnormal values such as ‘Infinity’ and
Total 175 341 100 82 332 100
‘NaN’, so we remove these columns.
Similarly, the label column inputs are converted to binary class
0 or 1, where 0 represents a normal record and 1 represents an
five categories. The first part is the basic features; the second attack record. For the multi-classification experiment, the attack
part is the content features; the third part is the time features; types are sequentially converted into integer values.

the fourth part is the general features, which is slightly different 5.3. Performance metrics for NIDS
from basic features, mainly refers to the general operation of the
Internet; the fifth part is the connection features. It is important The proposed models are compared and evaluated based on
six performance metrics: Accuracy Rate (ACC), Recall, Precision, F-
to note that the features scrip, sport, dstip, stime, and ltime are
Measure, False Positive Rate (FPR) and Receiver Operating Charac-
missing in the training datasets. Besides, the distribution statistic teristic (ROC). The introduction of the confusion matrix is shown
of the UNSW-NB15 dataset is shown in Table 7. in Table 10.

Table 8
CIC-IDS2017 features with their category.
Category No. Name Category No. Name Category No. Name
Space 1 Destination Port Time 26 Idle Std Content 51 RST Flag Count
Time 2 Flow Duration 27 Idle Max 52 PSH Flag Count
3 Flow IAT Mean 28 Idle Min 53 ACK Flag Count
4 Flow IAT Std 29 Fwd Packets/s 54 URG Flag Count
5 Flow IAT Max 30 Bwd Packets/s 55 CWE Flag Count
6 Flow IAT Min Content 31 Total Fwd Packets 56 Avg Fwd Segment Size
7 Fwd IAT Total 32 Total Backward Packets 57 Avg Bwd Segment Size
8 Fwd IAT Mean 33 Total Length of Fwd Packet 58 Fwd Avg Bytes/Bulk
9 Fwd IAT Std 34 Total Length of Bwd Packets 59 Fwd Avg Packets/Bulk
10 Fwd IAT Max 35 Fwd Packet Length Max 60 Bwd Avg Bytes/Bulk
11 Fwd IAT Min 36 Fwd Packet Length Min 61 Bwd Avg Packets/Bulk
12 Bwd IAT Total 37 Fwd Packet Length Mean 62 Bwd Avg Bulk Rate
13 Bwd IAT Mean 38 Fwd Packet Length Std 63 Subflow Fwd Packets
14 Bwd IAT Std 39 Bwd Packet Length Max 64 Subflow Fwd Bytes
15 Bwd IAT Max 40 Bwd Packet Length Min 65 Subflow Bwd Packets
16 Bwd IAT Min 41 Bwd Packet Length Mean 66 Subflow Bwd Bytes
17 Fwd PSH Flags 42 Bwd Packet Length Std 67 Fwd Avg Bulk Rat
18 Bwd PSH Flags 43 Bwd Header Length 68 Init_Win_bytes_forward
19 Fwd URG Flags 44 Min Packet Length 69 Init_Win_bytes_backward
20 Bwd URG Flags 45 Max Packet Length 70 act_data_pkt_fwd
21 Active Mean 46 Packet Length Mean 71 min_seg_size_forward
22 Active Std 47 Packet Length Std 72 ECE Flag Countm
23 Active Max 48 Packet Length Variance 73 Down/Up Ratio
24 Active Min 49 FIN Flag Count 74 Average Packet Size
25 Idle Mean 50 SYN Flag Count

136
H. Zhang, J.-L. Li, X.-M. Liu et al. Future Generation Computer Systems 122 (2021) 130–143

Table 9
Statistics of the CIC-IDS2017(Wed.) datasets. ∑n
T Pi
Type Actual Percentage (%) Micro_R = ∑n i=1
∑n . (10)
DoS slowloris 5796 0.84 i=1 T Pi + i=1 F Ni
DoS Slowhttptest 5499 0.80 ∑n
Dos Hulk 230 124 33.28 T Pi
DoS GoldenEye 10 293 1.49
Micro_P = ∑n i=1
∑n . (11)
i=1 T Pi + i=1 F Pi
Heartbleed 11 0.001
Benign 439 683 63.59 2 × Micro_R × Micro_P
Total 691 406 100 Micro_F = . (12)
Micro_R + Micro_P
ROC curve is a graphical method to show the trade-off be-
tween the true rate and false positive rate of the model, which is
ACC returns the percentage of the total number of instances
also used to study the generalization performance of the model.
that are correctly classified, as shown in Eq. (2).
Its quantitative indicator is the area under the ROC curve (AUC).
TP + TN The larger the AUC, the better the detected model performs.
ACC = . (2)
TP + TN + FP + FN
Recall provides information about the rate of the number of 6. Results and discussion
successfully classified normal to the overall number of instances,
which is also known as a True Positive Rate (TPR) or Detection This section is focused on the evaluation of our approach on
Rate (DR). It can be computed according to Eq. (3). standard NIDS datasets. We perform binary classification evalua-
tion on NSL-KDD and UNSW-NB15 and multi-classification eval-
TP
Recall = . (3) uation on KDD Cup 99 and CIC-IDS2017. For CIC-IDS2017 dataset,
TP + FN we conduct experiments by 10-fold cross-validation. We use the
Precision is relative to the prediction result, which indicates same training and testing configurations suggested in NSL-KDD,
how many samples with positive predictions are correct. This is UNSW-NB15, and KDD Cup 99. To fairly evaluate the robustness
another term for information retrieval, usually paired with Recall, of our model, we employ them in our experiments and run 10
as shown in Eq. (4). times to compute the average.
TP
Precision = . (4) 6.1. Feature combination strategy
TP + FP
Precision and Recall indicators sometimes appear contradic- In this section, the results on the datasets after permutation
tory, so we need to consider them comprehensively. F-measure and combination of basic feature sets in four datasets are dis-
computes a trade-off between precision and recall of both classes, cussed, and the effectiveness of this feature fusion strategy is
as shown in Eq. (5). verified through experiments.
2 × Recall × Precision In this paper, we partition the features of the datasets into
F − measure = . (5) several categories according to their contents. As shown in Ta-
Recall + Precision
ble 3, we divide the KDD Cup 99 and NSL-KDD features into
FPR is also known as the false alarm rate. This factor is com-
four categories. For the combination C42 , six comprehensive fea-
puted by dividing the number of attack instances, which are
ture datasets are obtained, each of which basic feature dataset
misclassified as normal by the overall number of attack instances,
appears three times in the comprehensive feature dataset, and
as shown in Eq. (6).
two additional times, so the repetition rate is 33.3%. Similarly,
FP the combination C43 has four comprehensive feature datasets, and
FPR = . (6)
FP + TN each basic feature dataset has two additional occurrences, the
Micro-averaging pays more attention to classes with a large repetition rate is 50%. Table 6 shows that UNSW-NB15 is divided
sample size, whereas macro-averaging pays more attention to into five basic feature sets. The combination C52 has ten compre-
classes with a small sample size, as show in Eqs. (7)–(12). In this hensive feature datasets, and each basic feature dataset has three
paper, we use macro-averaging and micro-averaging for multi- additional occurrences, the repetition rate is 33.3%. Similarly, the
classification. repetition rate of the combination C53 can be calculated to be 50%,
n and the repetition rate of the combination C54 is 60%. Table 8
1∑ T Pi shows that CIC-IDS2017 is divided into three basic feature sets.
Macro_R = . (7)
n T Pi + F Ni The combination C32 has three comprehensive feature datasets,
i=1
and each basic feature dataset has one additional occurrences, the
n
1∑ T Pi repetition rate is 33.3%. Besides, the combinations C33 , C44 , and C55
Macro_P = . (8) only have one comprehensive dataset, which means that the basic
n T Pi + F Pi
i=1 classifier has no difference.
2 × Macro_R × Macro_P In Tables 11–13, experimental results show that better de-
Macro_F = . (9) tection results can be obtained by combining datasets with less
Macro_R + Macro_P

Table 10
Description of confusion matrix.
Symbol Description
True positive (TP) The amount of normal data detected is actually normal data
True negative (TN) The amount of attack data detected is actually attack data
False positive (FP) The attack data that are detected as normal data
False negative (FN) The normal data that are detected as attack data

137
H. Zhang, J.-L. Li, X.-M. Liu et al. Future Generation Computer Systems 122 (2021) 130–143

Table 11 Table 14
Performance comparison of different combinations of multi-classification for Performance analysis of stacking ensemble learning strategy of multi-
KDD Cup 99 (%). classification for KDD Cup 99 (%).
Combinations Metric Normal DoS Probe R2L U2R Algorithms Metric Normal DoS Probe R2L U2R
Precision 72.45 99.61 84.44 82.91 53.73 Precision 71.75 99.80 77.56 58.65 1.59
C41 Recall 98.03 97.07 79.72 8.87 15.79 DT Recall 99.38 96.85 78.66 1.51 5.26
F-measure 83.72 98.32 82.01 16.03 24.41 F-measure 83.33 98.30 78.11 2.94 2.44
Precision 72.18 99.87 89.17 78.95 8.93 Precision 72.29 99.80 91.89 98.53 70.59
C42 Recall 99.53 97.20 76.07 0.28 11.40 RF Recall 99.32 97.06 75.11 2.49 5.26
F-measure 83.67 98.51 82.10 0.55 10.02 F-measure 83.67 98.41 82.66 4.86 9.8
Precision 72.52 99.87 85.79 77.08 15.43 Precision 72.52 99.87 85.79 77.08 15.43
C43 Recall 99.42 97.16 84.21 0.46 10.96 MFFSEM Recall 99.42 97.16 84.21 0.46 10.96
F-measure 83.86 98.50 84.99 0.91 12.82 F-measure 83.86 98.50 84.99 0.91 12.82
Precision 67.25 99.61 61.02 50.41 0.00
C44 Recall 98.84 93.94 52.52 4.97 0.00
F-measure 80.04 96.69 56.45 9.04 0.00

6.2. Stacking ensemble learning strategy

repetition rate and diverse data. Generally speaking, the rep- In order to evaluate the detection performance of our stacking
etition rate does not exceed 50%. As shown in Table 11, the ensemble learning strategy, we compare our method with its ba-
results of the combination C42 and C43 of KDD Cup 99 dataset sic and meta classifiers on the datasets. Tables 14–16 demonstrate
are better than other combinations, except for the two minority the comparison among the detailed performance measurement
classes R2L and U2R. For CIC-IDS2017 dataset, the results listed results of the algorithms in the process of detecting anomalies.
in Table 12 indicate that the combination C32 can achieve better
The results show that our proposed approach that combines
performance. In Table 13, the combination C42 and C43 of NSL-
the classifiers can achieve the best performance. As shown in
KDD dataset outperform other combinations. For UNSW-NB15
Table 14, KDD Cup 99 dataset achieves superior performance in
dataset, it also indicates the combination C52 and C53 of with lower
terms of most metrics except the R2L class. In Table 15, for CIC-
repetition rate can achieve better classification performance than
other combinations. If n = 1, the classification effect is not IDS2017 dataset, the results of MFFSEM are best in almost all
very good except the minority class of KDD Cup 99. In addition, metrics. From Table 16, for the NSL-KDD dataset, the metrics of
the classification performance of combinations C33 , C44 , and C55 is MFFSEM are optimal except for Recall. And for the UNSW-NB15
always poor because the datasets of basic classifier are same. dataset, the ACC, Recall, and F-measure evaluation indicators have
It implies that one basic feature or a combination of features been improved. In summary, the results show that the proposed
with a high repetition rate usually causes poor classification stacking ensemble strategy can significantly improve the overall
performance. performances of the classifiers.

Table 12
Performance comparison of different combinations of multi-classification for CIC-IDS2017(Wed.) (%).
Combinations Metric Benign DoS slowloris DoS Slowhttptest Dos Hulk DoS GoldenEye Heartbleed
Precision 99.81 98.58 97.81 99.88 99.10 100.00
C31 Recall 99.91 99.46 96.58 99.71 98.80 100.00
F-measure 99.86 99.02 97.19 99.80 98.95 100.00
Precision 99.99 99.28 99.64 99.94 99.50 100.00
C32 Recall 99.98 99.49 99.10 99.96 99.60 100.00
F-measure 99.98 99.38 99.37 99.95 99.55 100.00
Precision 99.98 99.11 99.64 99.94 99.30 100.00
C33 Recall 99.97 99.46 98.74 99.96 99.50 100.00
F-measure 99.97 99.28 99.19 99.95 99.40 100.00

Table 13
Performance comparison of different combinations of binary classification for NSL-KDD and
UNSW-NB15 (%).
Datasets Combinations ACC Recall Precision F-measure FPR
C41 76.59 93.83 66.07 77.54 36.45
NSL-KDD C42 80.52 94.07 70.53 80.62 29.73
C43 84.33 96.43 74.61 84.13 24.82
C44 80.11 93.65 70.16 80.22 30.14
C51 85.94 79.99 87.64 83.64 9.20
C52 88.85 80.44 93.88 86.64 2.27
UNSW-NB15
C53 87.76 76.18 95.71 84.84 2.78
C54 86.86 74.34 95.40 83.56 2.92
C55 86.34 74.91 93.38 83.13 4.32

138
H. Zhang, J.-L. Li, X.-M. Liu et al. Future Generation Computer Systems 122 (2021) 130–143

Table 15
Performance analysis of stacking ensemble learning strategy of multi-classification for CIC-IDS2017(Wed.) (%).
Algorithms Metric Benign DoS slowloris DoS Slowhttptest Dos Hulk DoS GoldenEye Heartbleed
Precision 99.98 99.11 99.64 99.92 99.30 100.00
DT Recall 99.96 99.46 98.92 99.95 99.50 100.00
F-measure 99.97 99.28 99.28 99.93 99.40 100.00
Precision 99.98 99.11 99.64 99.90 99.60 100.00
RF Recall 99.96 99.46 99.10 99.95 99.50 100.00
F-measure 99.97 99.28 99.37 99.93 99.55 100.00
Precision 99.99 99.28 99.64 99.94 99.50 100.00
MFFSEM Recall 99.98 99.49 99.10 99.96 99.60 100.00
F-measure 99.98 99.38 99.37 99.95 99.55 100.00

Table 16 Like our stacking ensemble learning, we do not specifically tune


Performance analysis of stacking ensemble learning strategy of binary their hyper-parameters.
classification for NSL-KDD and UNSW-NB15 (%).
Table 17 reports the outcomes of multi-classification experi-
Datasets Algorithms ACC Recall Precision F-measure FPR
ments on the KDD Cup 99 and CIC-IDS2017(Wed.) datasets. The
DT 80.02 93.61 70.29 80.29 29.93 proposed method achieves promising results in the context of
NSL-KDD RF 77.89 97.36 66.66 79.14 36.84
MFFSEM 84.33 96.43 74.61 84.13 24.82
Macro_R, Macro_F, Micro_P, Micro_R, and Micro_F. Catboost can
achieve the best results on the Macro_P metric because its default
DT 86.16 74.87 92.97 82.94 4.61
UNSW-NB15 RF 87.12 73.47 97.19 83.68 1.73
number of iterations is 500. Table 18 depicts the outcomes of
MFFSEM 88.85 80.44 93.88 86.64 2.27 binary classification experiments on the NSL-KDD and UNSW-
NB15 datasets. For the NSL-KDD dataset, the metrics of MFFSEM
are optimal except for Recall. For the UNSW-NB15 dataset, even if
the FPR and Precision are not ideal, other indicators perform well.
6.3. Comparison with ensemble learning methods Fig. 4 shows that MFFSEM report the best AUC score in the binary
classification. Figs. 5 and 6 show the ROC curves on the KDD Cup
To further explore the validity of the proposed MFFSEM model, 99 and CIC-IDS2017(Wed.) datasets. As shown in the Figures, the
we perform comparison experiments with some well-known en- area under the ROC curve of the MFFSEM is larger than those of
semble methods. The classifiers used in the voting ensemble most ensemble methods. Although the AUC of MFFSEM is not as
method are DT and RF, which are the same as those used in good as those of CatBoost and AdaBoost, the training time of these
MFFSEM. Categorical Boosting (CatBoost), Light Gradient Boost- two algorithms is much longer than that of MFFSEM, as show in
ing Machine (LightGBM), and Adaptive Boosting (AdaBoost) are Table 19.
boosting algorithm. Extremely Randomized Trees (ExtRa Trees) Table 19 lists the training time and testing time of differ-
is a bagging algorithm. To make a comparisons fair, these well- ent ensemble methods on each dataset. For datasets KDD Cup
known ensemble learning methods also use default parameters. 99 and CIC-IDS2017(Wed.), we calculate the running time of

Table 17
Compared with well-known ensemble methods for KDD Cup 99 and CIC-IDS2017(Wed.) (%).
Datasets Ensemble methods Macro_P Macro_R Macro_F Micro_P Micro_R Micro_F
CatBoost 86.43 56.63 56.94 92.45 92.45 92.45
LightGBM 56.33 50.47 48.93 89.64 89.64 89.64
AdaBoost 82.92 55.25 55.03 92.44 92.44 92.44
KDD Cup 99
ExtRa Trees 79.80 56.82 55.81 92.39 92.39 92.39
Voting 71.55 56.07 55.38 92.32 92.32 92.32
MFFSEM 78.75 59.89 60.90 92.50 92.50 92.50
CatBoost 99.72 99.56 99.65 99.94 99.94 99.94
LightGBM 67.60 70.01 66.69 96.85 96.85 96.85
AdaBoost 99.70 99.67 99.68 99.88 99.88 99.88
CIC-IDS2017(Wed.)
ExtRa Trees 99.70 99.67 99.68 99.91 99.91 99.91
Voting 99.63 99.61 99.62 99.90 99.90 99.90
MFFSEM 99.72 99.68 99.70 99.95 99.95 99.95

Table 18
Compared with well-known ensemble methods for NSL-KDD and UNSW-NB15 (%).
Datasets Ensemble methods ACC Recall Precision F-measure FPR
CatBoost 79.05 97.12 67.97 79.97 34.62
LightGBM 78.24 97.00 67.12 79.34 35.95
AdaBoost 80.42 96.81 69.61 80.99 31.98
NSL-KDD
ExtRa Trees 77.20 97.15 65.99 78.59 37.88
Voting 81.48 96.81 70.86 81.83 30.12
MFFSEM 84.33 96.43 74.61 84.13 24.82
CatBoost 87.40 73.92 97.43 84.06 1.59
LightGBM 87.61 74.14 97.76 84.33 1.38
AdaBoost 86.41 72.83 95.96 82.81 2.50
UNSW-NB15
ExtRa Trees 86.73 72.60 97.14 83.10 1.74
Voting 86.15 74.71 93.12 82.90 4.50
MFFSEM 88.85 80.44 93.88 86.64 2.27

139
H. Zhang, J.-L. Li, X.-M. Liu et al. Future Generation Computer Systems 122 (2021) 130–143

Table 19
Comparison of running time of well-known ensemble learning methods on four datasets (sec).
Time Datasets CatBoost LightGBM AdaBoost ExtRa Trees Voting MFFSEM
KDD Cup 99 326.192 21.510 98.073 19.711 29.987 18.995
CIC-IDS2017(Wed.) 784.124 64.156 1616.356 145.927 332.575 89.919
Train
NSL-KDD 17.153 0.758 12.433 1.659 2.086 0.999
UNSW-NB15 77.516 3.393 93.929 18.830 37.967 22.776
KDD Cup 99 1.635 7.147 8.800 1.6974 4.625 4.547
CIC-IDS2017(Wed.) 1.401 1.810 2.822 1.678 1.494 1.244
Test
NSL-KDD 0.465 0.204 0.500 0.2060 0.324 0.265
UNSW-NB15 0.691 0.605 1.838 2.025 1.587 1.249

Fig. 4. ROC curve of binary classification.

Fig. 5. ROC curve for KDD Cup 99.

multi-classification. For datasets NSL-KDD and UNSW-NB15, we 6.4. Evaluation results


measure the running time of the binary classification. It can be
found that MFFSEM can achieve lower computation cost than From the evaluation results given in the previous subsections,
we can safely argue that the stacking ensemble method is more
most other methods regardless of whether it is a binary classifi-
effective for detecting network intrusions. The MFFSEM outper-
cation or multi-classification. It can also be found that the testing forms the basic and meta classifiers adopted in our method, and is
time of our approach is relatively short, which is very suitable for better than other well-known ensemble classifiers. These results
online detection. are almost consistent across all the experimental result sets for
140
H. Zhang, J.-L. Li, X.-M. Liu et al. Future Generation Computer Systems 122 (2021) 130–143

Fig. 6. ROC curve for CIC-IDS2017(wed.).

the different benchmark datasets. This finding is critical because Table 20


our approach is conducted against the various NIDS benchmark A comparison of MFFSEM with the classical machine learning algorithms (%).

datasets, which can well protect the entire network environment Datasets Algorithms ACC Recall FPR
and ensure the safe operation of the network environment. KNN 78.16 96.18 35.92
To further explore the novelty of the proposed MFFSEM model, NSL-KDD NB 48.69 95.70 66.87
MFFSEM 84.33 96.43 24.82
the comparisons between the performance results of the classi- KNN 78.29 58.45 5.52
cal machine learning algorithms and the current state-of-the-art UNSW-NB15 NB 70.61 52.51 14.60
schemes are demonstrated in Tables 20 and 21. In Table 20, MFFSEM 88.85 80.44 2.27
we consider KNN and NB algorithms for comparison. These two KNN 92.02 92.02 2.12
KDD Cup 99 NB 83.12 83.12 11.63
classic algorithms also use default parameters. Besides, the recall
MFFSEM 92.48 92.50 2.03
rate of multiple categories is applied as micro-average indicators. KNN 99.48 99.48 0.15
Specifically, MFFSEM surpasses its competitors in three classifica- CIC-IDS2017(Wed.) NB 51.05 51.05 11.32
tion metrics when it comes to the four datasets. In Table 21, we MFFSEM 99.95 99.95 0.013
compare with the state-of-the-art schemes such as Autoencoder
(AE) [44], Genetic Algorithm-Logistic Regression (GALR) [45], Hy-
brid ABC and MBO (HAM) [46], and Correlation Based Feature Table 21
Selection-Bat Algorithm (CFS-BA) [47]. The proposed model also A comparison of MFFSEM with the current state-of-the-art schemes (%).
achieves the best performance in terms of ACC, Recall, and FPR. Datasets Schemes ACC Recall FPR
Hence, MFFSEM outperforms both classical and latest algorithms, AE [44] 84.21 87.00 N/A
which is feasible for application in NIDS research. NSL-KDD
MFFSEM 84.33 96.43 24.82
GALR [45] 81.42 N/A 6.39
UNSW-NB15
7. Conclusion and future works MFFSEM 88.85 80.44 2.27
HAM [46] 87.19 90.89 16.7
KDD Cup 99
MFFSEM 92.48 92.50 2.03
In this paper, we propose a novel approach for intrusion CFS-BA [47] 99.89 99.90 0.12
CIC-IDS2017(Wed.)
detection systems based on multi-dimensional feature fusion and MFFSEM 99.95 99.95 0.013
stacking ensemble learning. According to the correlation of traffic
information, the benchmark datasets are divided into several
basic feature datasets, and then several comprehensive feature
applied on four known datasets including the KDD Cup 99, NSL-
datasets are constructed through permutation and combination.
Leveraging the stacking ensemble learning scheme, MFFSEM can KDD, UNSW-NB15, and CIC-IDS2017. MFFSEM achieves improved
be effectively applied to address the intrusion detection problem. attack detection accuracy while maintaining the balance of Recall
Permutation and combination provide mutual support among and Precision metrics among all the classes of the datasets. The
different features and increase the diversity of basic classifiers. detection result after applying the ensemble scheme is better
The homogeneous stacking ensemble learning based on permu- than the basic and meta classifiers. Moreover, compared with
tation and combination makes MFFSEM more robust than other well-known ensemble learning detection methods, the proposed
state-of-the-art solutions in the field. The proposed approach is MFFSEM model achieves promising classification performance.
141
H. Zhang, J.-L. Li, X.-M. Liu et al. Future Generation Computer Systems 122 (2021) 130–143

At present, the basic feature datasets are constructed based [11] S.S.S. Reddy, P. Chatterjee, C. Mamatha, Intrusion detection in wireless net-
on the existing features of the datasets. As the future work, work using fuzzy logic implemented with genetic algorithm, in: Computing
data construction standards for different dimensions should be and Network Sustainability, Springer, 2019, pp. 425–432.
[12] Y. Zhang, P. Li, X. Wang, Intrusion detection for iot based on im-
established. The abnormal data flow of the network may origi-
proved genetic algorithm and deep belief network, IEEE Access 7 (2019)
nate from multiple different network layer services, and multiple 31711–31722.
dimensional subspace datasets can be built on these services [13] A. Alsaeedi, M.Z. Khan, Performance analysis of network intrusion detection
respectively. And then, we can conduct anomaly detection in system using machine learning, Int. J. Adv. Comput. Sci. Appl. 10 (2019)
multiple services and detect network data streams from different 671–678.
service perspectives. Moreover, some optimal adjustments for the [14] M.G. Raman, N. Somu, K. Kirthivasan, V.S. Sriram, A hypergraph and
classification model, such as the feature selection strategies and arithmetic residue-based probabilistic neural network for classification in
intrusion detection systems, Neural Netw. 92 (2017) 89–97.
DT′ s pruning techniques, should be employed and lead to a more
[15] G.C. Fernández, S. Xu, A case study on using deep learning for network
robust IDS. intrusion detection, in: MILCOM 2019-2019 IEEE Military Communications
Conference (MILCOM), IEEE, 2019, pp. 1–6.
CRediT authorship contribution statement [16] G. Kumar, K. Thakur, M.R. Ayyagari, Mlesidss: machine learning-based
ensembles for intrusion detection systems—a review, J. Supercomput. 76
Hao Zhang: Conceptualization, Methodology, Supervision, (2020) 1–34.
Funding acquisition, Writing - review & editing. Jie-Ling Li: [17] A. Zimba, H. Chen, Z. Wang, M. Chishimba, Modeling and detection of
the multi-stages of advanced persistent threats attacks based on semi-
Methodology,Validation, Formal analysis, Investigation, Data cu-
supervised learning and complex networks characteristics, Future Gener.
ration, Software, Writing - original draft. Xi-Meng Liu: Valida-
Comput. Syst. 106 (2020) 501–517.
tion, Writing - review & editing. Chen Dong: Formal analysis, [18] F. Sabahi, A. Movaghar, Intrusion detection: A survey, in: 2008 Third
Investigation, Software, Writing - review & editing. International Conference on Systems and Networks Communications, IEEE,
2008, pp. 23–26.
Declaration of competing interest [19] E. Kabir, J. Hu, H. Wang, G. Zhuo, A novel statistical technique for intrusion
detection systems, Future Gener. Comput. Syst. 79 (2018) 303–318.
The authors declare that they have no known competing finan- [20] D. Papamartzivanos, F.G. Mármol, G. Kambourakis, Dendron: Genetic trees
driven rule induction for network intrusion detection systems, Future
cial interests or personal relationships that could have appeared
Gener. Comput. Syst. 79 (2018) 558–574.
to influence the work reported in this paper. [21] S. Carta, A.S. Podda, D.R. Reforgiato Recupero, R. Saia, A local feature en-
gineering strategy to improve network anomaly detection, Future Internet
Acknowledgments 12 (10) (2020) 177.
[22] A. Khraisat, I. Gondal, P. Vamplew, J. Kamruzzaman, A. Alazab, Hybrid
This work was supported in part by the National Natural Sci- intrusion detection system based on the stacking ensemble of c5 decision
ence Foundation of China under Grant U1804263 and 61877010, tree classifier and one class support vector machine, Electronics 9 (1)
(2020) 173.
the Natural Science Foundation of Fujian Province China under
[23] G. Li, Z. Yan, Y. Fu, H. Chen, Data fusion for network intrusion detection:
Grant 2020J01130167 and 2018J07005, and the Joint Straits Fund a review, Secur. Commun. Netw. 2018 (2018) 1–16.
of Key Program of the National Natural Science Foundation of [24] W. Wang, Y. Sheng, J. Wang, X. Zeng, X. Ye, Y. Huang, M. Zhu, Hast-ids:
China under Grant U1705262. Learning hierarchical spatial-temporal features using deep neural networks
to improve intrusion detection, IEEE Access 6 (2017) 1792–1806.
References [25] Y. Li, Y. Xu, Z. Liu, H. Hou, Y. Zheng, Y. Xin, Y. Zhao, L. Cui, Robust
detection for network intrusion of industrial iot based on multi-cnn fusion,
[1] N. Moustafa, J. Hu, J. Slay, A holistic review of network anomaly detection Measurement 154 (2020) 107450.
systems: A comprehensive survey, J. Netw. Comput. Appl. 128 (FEB.) (2019) [26] N. Demir, G. DALKILIÇ, Modi ed stacking ensemble approach to detect
33–55. network intrusion, Turk. J. Electr. Eng. Comput. Sci. 26 (1) (2018) 418–433.
[2] M.T. Nguyen, K. Kim, Genetic convolutional neural network for intrusion [27] R. Saia, S. Carta, D.R. Recupero, A probabilistic-driven ensemble approach
detection systems, Future Gener. Comput. Syst. 113 (2020) 418–427. to perform event classification in intrusion detection system., in: KDIR,
[3] N. Ye, S.M. Emran, Q. Chen, S. Vilbert, Multivariate statistical analysis of 2018, pp. 139–146.
audit trails for host-based intrusion detection, IEEE Trans. Comput. 51 (7) [28] B.A. Tama, K.-H. Rhee, Performance evaluation of intrusion detection
(2002) 810–820. system using classifier ensembles, Int. J. Internet Protoc. Technol. 10 (1)
[4] Y. Duan, X. Li, X. Yang, L. Yang, Network security situation factor extraction (2017) 22–29.
based on random forest of information gain, in: Proceedings of the [29] S. Subudhi, S. Panigrahi, Application of optics and ensemble learning for
2019 4th International Conference on Big Data and Computing, 2019, pp. database intrusion detection, J. King Saud Univ.-Comput. Inf. Sci. (2019).
194–197. [30] S. Rajagopal, P.P. Kundapur, K.S. Hareesha, A stacking ensemble for network
[5] N.B. Nanda, A. Parikh, Hybrid approach for network intrusion detection intrusion detection using heterogeneous datasets, Secur. Commun. Netw.
system using random forest classifier and rough set theory for rules gener- 2020 (2020) 1–9.
ation, in: International Conference on Advanced Informatics for Computing [31] A. Alaba, S. Maitanmi, O. Ajayi, An ensemble of classification techniques
Research, Springer, 2019, pp. 274–287. for intrusion detection systems, Int. J. Comput. Sci. Inf. Secur. 17 (2019)
[6] L.E. Jim, J. Chacko, Decision tree based AIS strategy for intrusion detection 24–33.
in MANET, in: TENCON 2019-2019 IEEE Region 10 Conference (TENCON), [32] V. Dutta, M. Choraś, M. Pawlicki, R. Kozik, A deep learning ensemble for
IEEE, 2019, pp. 1191–1195. network anomaly and cyber-attack detection, Sensors 20 (16) (2020) 4583.
[7] P. Nancy, S. Muthurajkumar, S. Ganapathy, S.S. Kumar, M. Selvi, K. [33] O.O. Olasehinde, O.V. Johnson, O.C. Olayemi, Evaluation of selected meta
Arputharaj, Intrusion detection using dynamic feature selection and fuzzy learning algorithms for the prediction improvement of network intru-
temporal decision tree classification for wireless sensor networks, IET sion detection system, in: 2020 International Conference in Mathematics,
Commun. 14 (5) (2020) 888–895. Computer Engineering and Computer Science (ICMCECS), IEEE, 2020, pp.
[8] B.G. Narendrasinh, D. Vdevyas, Flbs: Fuzzy lion bayes system for intrusion 1–7.
detection in wireless communication network, J. Cent. South Univ. 26 (11) [34] O. Oriola, A stacked generalization ensemble approach for improved
(2019) 3017–3033. intrusion detection, Int. J. Comput. Sci. Inf. Secur. (IJCSIS) 18 (5) (2020)
[9] H. Zhang, Y. Li, Z. Lv, A.K. Sangaiah, T. Huang, A real-time and ubiquitous 62–67.
network attack detection based on deep belief network and support vector [35] V. Chandola, A. Banerjee, V. Kumar, Anomaly detection: A survey, ACM
machine, IEEE/CAA J. Autom. Sin. 7 (3) (2020) 790–799. Comput. Surv. (CSUR) 41 (3) (2009) 1–58.
[10] C. Di, Y. Su, Z. Han, S. Li, Learning automata based svm for intrusion de- [36] M.A. Ferrag, L. Maglaras, A. Ahmim, M. Derdour, H. Janicke, Rdtids: Rules
tection, in: International Conference in Communications, Signal Processing, and decision tree-based intrusion detection system for internet-of-things
and Systems, Springer, 2017, pp. 2067–2074. networks, Future Internet 12 (3) (2020) 44.

142
H. Zhang, J.-L. Li, X.-M. Liu et al. Future Generation Computer Systems 122 (2021) 130–143

[37] X. Li, W. Chen, Q. Zhang, L. Wu, Building auto-encoder intrusion detection Jieling Li was born in Fu Jian, China in 1995. He is
system based on random forest feature selection, Comput. Secur. 95 (2020) currently pursuing the master’s degree in computer
101851. with Fuzhou University. His research interest includes
machine learning and cyberspace security.
[38] A. Özgür, H. Erdem, A review of kdd99 dataset usage in intrusion detection
and machine learning between 2010 and 2015, PeerJ Preprints 4 (2016)
e1954v1.
[39] M. Ahmed, A.N. Mahmood, J. Hu, A survey of network anomaly detection
techniques, J. Netw. Comput. Appl. 60 (2016) 19–31.
[40] N. Moustafa, J. Slay, The evaluation of network anomaly detection systems:
Statistical analysis of the unsw-nb15 data set and the comparison with the
kdd99 data set, Inf. Secur. J.: Glob. Perspect. 25 (1–3) (2016) 18–31.
Ximeng Liu received the B.Sc. degree in electronic en-
[41] I. Sharafaldin, A.H. Lashkari, A.A. Ghorbani, Toward generating a new gineering from Xidian University, Xi’an, China, in 2010,
intrusion detection dataset and intrusion traffic characterization, in: ICISSp, and the Ph.D. degree in cryptography from Xidian Uni-
2018, pp. 108–116. versity, China, in 2015. He is currently a Full Professor
[42] S. Elhag, A. Fernández, A. Bawakid, S. Alshomrani, F. Herrera, On the with the College of Mathematics and Computer Science,
combination of genetic fuzzy systems and pairwise learning for improving Fuzhou University. He was a Research Fellow with the
detection rates on intrusion detection systems, Expert Syst. Appl. 42 (1) School of Information System, Singapore Management
(2015) 193–202. University, Singapore. He has published more than 250
[43] N. Moustafa, J. Slay, Unsw-nb15: a comprehensive data set for network papers on the topics of cloud security and big data
intrusion detection systems (unsw-nb15 network data set), in: 2015 security, including papers in the IEEE TRANSACTIONS
Military Communications and Information Systems Conference (MilCIS), ON COMPUTERS, the IEEE TRANSACTIONS ON INDUS-
IEEE, 2015, pp. 1–6. TRIAL INFORMATICS, the IEEE TRANSACTIONS ON DEPENDABLE AND SECURE
[44] C. Ieracitano, A. Adeel, F.C. Morabito, A. Hussain, A novel statistical COMPUTING, the IEEE TRANSACTIONS ON SERVICE COMPUTING, and the IEEE
analysis and autoencoder driven intelligent intrusion detection approach, INTERNET OF THINGS JOURNAL. His research interests include cloud security,
Neurocomputing 387 (2020) 51–62. applied cryptography, and big data security. He is a member of ACM and
CCF. He was awarded the Minjiang Scholars Distinguished Professor, Qishan
[45] C. Khammassi, S. Krichen, A ga-lr wrapper approach for feature selection
Scholars in Fuzhou University, and ACM SIGSAC China Rising Star Award, in
in network intrusion detection, comput. Secur. 70 (2017) 255–277.
2018. He served as a program committee for several conferences, such as the
[46] W.A. Ghanem, A. Jantan, Training a neural network for cyberattack
17th IEEE International Conference on Trust, Security and Privacy in Computing
classification applications using hybridization of an artificial bee colony
and Communications, 2017 IEEE Global Communications Conference, and 2016
and monarch butterfly optimization, Neural Process. Lett. 51 (1) (2020) IEEE Global Communications Conference. He served as the Lead Guest Editor
905–946. for Wireless Communications and Mobile Computing (WCMC), International
[47] Y. Zhou, G. Cheng, S. Jiang, M. Dai, Building an efficient intrusion detection Journal of Distributed Sensor Networks (IJDSN), Wiley Transactions on Emerging
system based on feature selection and ensemble classifier, Comput. Netw. Telecommunications Technologies (ETT).
174 (2020) 107247.

Chen Dong received the B.S. and M.S. degrees from the
College of Mathematics and Computer Science, Fuzhou
Hao Zhang was born in An Hui, China in 1981. He University, China, in 2002 and 2005, respectively, and
received the B.S. and M.S. degrees in Computer Science the Ph.D. degree in computer science from the Com-
from the University of Electronic Science and Technol- puter School, Wuhan University, China, in 2011. She
ogy of China, in 2002 and 2006. He received the Ph.D. was a Visiting Researcher with the University of Cal-
degree in Applied Mathematics from Fu Zhou Univer- ifornia at Los Angeles, from 2015 to 2016. She is
sity, in 2015. He is currently an associate professor currently an Assistant Professor with the College of
at the College of Mathematics and Computer Science, Mathematics and Computer Science, Fuzhou University.
Fuzhou University, China. His research interest includes Her research interests include intelligent computing,
artificial intelligence, machine learning and cyberspace and integrated circuit physical and security design,
security. artificial intelligence.

143

You might also like