0% found this document useful (0 votes)
20 views19 pages

Effective Network Intrusion de

This study presents a stacking-based ensemble approach to enhance network intrusion detection systems (NIDS), addressing limitations in existing datasets, particularly the CICIDS2017. The authors developed a new dataset, CIPMAIDS2023-1, and demonstrated that their proposed method achieved a weighted F1-score of 98.24% through various attack simulations. The research emphasizes the importance of machine learning techniques in effectively detecting both known and unknown network threats while minimizing false positives.

Uploaded by

stephanemessoa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views19 pages

Effective Network Intrusion de

This study presents a stacking-based ensemble approach to enhance network intrusion detection systems (NIDS), addressing limitations in existing datasets, particularly the CICIDS2017. The authors developed a new dataset, CIPMAIDS2023-1, and demonstrated that their proposed method achieved a weighted F1-score of 98.24% through various attack simulations. The research emphasizes the importance of machine learning techniques in effectively detecting both known and unknown network threats while minimizing false positives.

Uploaded by

stephanemessoa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

International Journal of Information Security (2023) 22:1781–1798

https://doi.org/10.1007/s10207-023-00718-7

REGULAR CONTRIBUTION

Effective network intrusion detection using stacking-based ensemble


approach
Muhammad Ali1,2 · Mansoor-ul- Haque1,2 · Muhammad Hanif Durad1,2 · Anila Usman1 ·
Syed Muhammad Mohsin3,4 · Hana Mujlid5 · Carsten Maple6

Published online: 17 July 2023


© The Author(s) 2023

Abstract
The increasing demand for communication between networked devices connected either through an intranet or the internet
increases the need for a reliable and accurate network defense mechanism. Network intrusion detection systems (NIDSs),
which are used to detect malicious or anomalous network traffic, are an integral part of network defense. This research aims
to address some of the issues faced by anomaly-based network intrusion detection systems. In this research, we first identify
some limitations of the legacy NIDS datasets, including a recent CICIDS2017 dataset, which lead us to develop our novel
dataset, CIPMAIDS2023-1. Then, we propose a stacking-based ensemble approach that outperforms the overall state of the
art for NIDS. Various attack scenarios were implemented along with benign user traffic on the network topology created using
graphical network simulator-3 (GNS-3). Key flow features are extracted using cicflowmeter for each attack and are evaluated
to analyze their behavior. Several different machine learning approaches are applied to the features extracted from the traffic
data, and their performance is compared. The results show that the stacking-based ensemble approach is the most promising
and achieves the highest weighted F1-score of 98.24%.

Keywords Machine learning · Intrusion detection system · Denial of service · Ensemble-based learning · CICIDS2017 ·
GNS-3 · Performance metrics

1 Introduction

With the increasing use of the Internet, attacks on networks


are also on the rise, with denial of service (DoS) being the
Muhammad Ali, Mansoor-ul-Haque, Anila Usman and Hana Maple most common example of a network attack. Intrusion detec-
have contributed equally to this work. tion systems (IDSs) are an important part of the defense
mechanisms through which continuous monitoring of the
https://github.com/muhamad-ali/ML-NIDS/.

B Carsten Maple
1
[email protected] Department of Computer and Information Sciences, Pakistan
Institute of Engineering and Applied Sciences, Islamabad
Muhammad Ali 45650, Pakistan
[email protected]
2 Critical Infrastructure Protection and Malware Analysis Lab,
Mansoor-ul- Haque Pakistan Institute of Engineering and Applied Sciences,
[email protected] Islamabad 45650, Pakistan
Muhammad Hanif Durad 3 Department of Computer Science, COMSATS University
[email protected] Islamabad, Islamabad 45550, Pakistan
Anila Usman 4 College of Intellectual Novitiates (COIN), Virtual University
[email protected] of Pakistan, Lahore 55150, Pakistan
Syed Muhammad Mohsin 5 Department of Computer Engineering, Taif University, Taif,
[email protected] Saudi Arabia
Hana Mujlid 6 Cyber Security Centre, University of Warwick, Coventry, UK
[email protected]

123
1782 M. Ali et al.

system is performed. IDS can be categorized according to • Generation of our new dataset (CIPMAIDS2023-1) to
their detection methodology, such as signature-based IDS, overcome the limitations of the CICIDS2017 dataset.
anomaly-based IDS, and stateful protocol analysis [1]. It is • Design and implementation of proposed stacking-based
recognized that traditional mechanisms for detecting these ensemble model.
attacks are no longer sufficient to deal with the increasing • Evaluation and comparative analysis of our proposed
and diverse attacks [2, 3]. To cope with these types of attacks, stacking-based ensemble model on our dataset,
an effective intrusion detection system that can handle unex- CIPMAIDS2023-1.
pected and previously undetected attacks is required.
Network-based intrusion detection systems (NIDS) usu- The remainder of the manuscript is organized as follows:
ally use signature-based detection methods to help detect Section 2 discusses the background and related concepts
attacks for which predefined attack identification rules are and Section 3 provides a brief literature review of existing
available. The problem occurs when there are unknown datasets and techniques applied to them in the area of net-
attacks for which there are no predefined rules for detection. work intrusion detection. Section 4 sheds a light on issues
Anomaly-based methods for detecting attacks on the net- with datasets most notably issues of CICIDS2017. Section 5
work generally use artificial intelligence-based techniques, describes the simulation setup and implementation details,
more specifically machine learning-based techniques that whereas the generation of the new dataset CIPMAIDS2023-
can detect unknown attacks on the network by distinguish- 1 and its methodology are discussed in Sect. 6. The results
ing malicious traffic from normal traffic. Rather than static and discussions are presented in Sect. 7, and finally, Sect. 8
rules, machine learning techniques can automatically learn concludes this study along with future research directions.
about network anomalies and misuse based on data behav-
ior rather than explicit rules or information. This behavior
of an anomaly-based intrusion detection system is one of the 2 Preliminaries
most important features that help in the detection of zero-day
attacks. This section briefly explains the basic concepts belonging to
Various machine learning algorithms are available, often IDS, machine learning, performance evaluation metrics, and
enabling higher detection rates and optimal computation related terms used in this study.
in intrusion detection systems [4, 5]. However, machine
learning-based intrusion detection system is used with cau- 2.1 Intrusion detection systems
tion because they can suffer from a high number of false
positives [4, 6, 7]. The key to developing such an IDS is Intrusion detection systems, or more specifically network
the ability to efficiently extract and select features that truly intrusion detection systems, provide a solution that exam-
represent network traffic. Cicflowmeter [8], for example, has ines incoming and outgoing network traffic from servers and
proven to be a good feature extractor for network traffic. It clients to detect any type of malicious activity. Fred Cohen
extracts layer 3 and 4 time-based features, including time stated in 1987 that it is difficult to detect an invasion in any
intervals and various flag counts. Since it is an open-source situation and that the resources required to detect anomalies
tool, it can be easily integrated into any new solution, such as grow proportionally to the number of requests [10]. NIDS is
the one we proposed. All features extracted in the cicflowme- not to be confused with but rather work together with two
ter are flow-based. This means that these features can be other cybersecurity terms: firewalls and intrusion prevention
extracted from packet headers and session data with times- systems (IPSs). Firewalls, for example, are usually placed
tamps. on the outer perimeter of the network and look for intrud-
There are several datasets, such as CICIDS2017, KDD, ers before they enter the network. Their main purpose is to
and UNSW-NB15, that can be used to evaluate the perfor- analyze network packet headers and filter incoming and out-
mance of an IDS. In this study, the CICIDS2017 dataset going network traffic based on some predetermined rules,
is used to evaluate the proposed stacking-based ensemble which may include protocol IP address and port numbers.
model because it has several important features over other IDS, on the other hand, can also monitor and analyze activ-
datasets [9]. In this work, our objective is to build an approach ities inside the protected network. We cannot take actions
that can detect seen as well as unseen malicious traffic with- such as blocking the source of the attack or malicious traffic
out static rules with higher detection rates while keeping the in IDS. This is where network intrusion prevention systems
false-positive rate low. The key contributions of our study (IPS) come into play. Their job is to monitor and analyze
are: network activity and proactively block any malicious activ-
ity that could pose a threat to the network. This is one of the
• Critical analysis of the CICIDS2017 dataset, including most difficult tasks, as an appropriate response is difficult to
the new insights on its understanding. program and any inappropriate response will cause problems

123
Effective network intrusion detection... 1783

Anomaly-based detection methods deal with unknown


malicious attacks since the number of unseen attacks is
increasing daily and these cannot be detected using signa-
ture mechanisms. Most anomaly-based detection methods
use machine learning to define trusted traffic, and anything
that deviates from this is classified as anomalous or mali-
cious. These methods have better generalization capabilities
compared to signature-based methods but have relatively
high false-positive rates. One of the most important areas
Fig. 1 Intrusion detection system (IDS) categorization
of research in anomaly-based network intrusion detection is
to reduce these false positives.
Stateful protocol analysis is a process for identifying
in the active network. In this section, we address the topic of deviations from predefined profiles, also known as proto-
IDS and therefore focus on IDS. col models, with commonly accepted parameters for benign
protocol behavior for each protocol state. These predefined
profiles, or protocol models, are based on vendor-created uni-
2.2 IDS detection methods versal profiles that specify how certain protocols should and
should not be used. Variations in the implementation of each
The National institute of standards and technology (NIST) protocol are also often accounted for in the protocol models.
distinguishes intrusion detection systems based on the types The fundamental drawback of stateful protocol analysis is
of events they monitor and the methods used to deploy them that it requires a significant amount of resources due to the
[1]. Further, these IDSs use a variety of approaches to detect complexity of the analysis and the cost of tracking the state of
incidents. Figure 1 shows the classification of IDSs based on multiple concurrent sessions. In addition, these approaches
their type (deployment type) and their detection methods. are also unable to detect attacks that do not violate the char-
A host-based intrusion detection system (HIDS) detects acteristics of commonly accepted protocol behavior.
unusual activity by monitoring the characteristics of a par-
ticular host and the events that occur on that host. A
network-based intrusion detection system (NIDS), on the 2.3 Machine learning
other hand, monitors network traffic for specific network
segments or devices and observes the behavior of network The impact of artificial intelligence (AI) on our daily lives
and application layer protocols for abnormal activity. Intru- is increasing. It is changing the nature of our daily activities
sion detection systems (IDSs) deployed for wireless network and influencing the traditional approach to human thinking
traffic observe and examine wireless network traffic and and interaction with the environment [11].
attempt to detect the abnormal behavior of wireless network Machine learning is often considered a subfield of artifi-
protocols. Network behavior analysis is another type of intru- cial intelligence, whose goal is to allow learning from data
sion detection system that analyzes network data to detect without being explicitly programmed to do so. The focus
attacks such as denial-of-service (DoS)/distributed denial is on computer programs that can learn and grow with new
of service (DDoS) attacks, certain types of malware, and data. This enables them to deal with unseen data [12]. For
policy violations. To provide more comprehensive and accu- example, machine learning techniques can learn and extract
rate detection, most IDS technologies use different detection sophisticated patterns for network attacks from large amounts
approaches, either alone or in combination. These basic types of network data that can pose a threat to the network if unde-
of detection methods are signature-based, anomaly-based, tected [13].
and stateful protocol analysis. Machine learning is considered a viable solution espe-
Signature-based detection methods match specific pat- cially when working with a large amount of data. It consists
terns in network traffic, such as the sequence of bytes with of two main phases, namely training and testing. In the train-
an already known sequence of malicious actions. Once the ing phase, we provide a large amount of data to an algorithm
attack patterns are identified, they are used to create signa- from which it learns patterns and relationships. The valida-
tures that can be used to detect the same threat again. Each tion data is used in the training phase to validate each training
attack has its unique signature. Already known attacks are step. As discussed in [14–16], there are numerous machine
easier to detect because we only need to match the observed learning algorithms, each with its own advantages and disad-
pattern with one of the existing signatures in the database. It vantages. The author has provided a comprehensive survey
is difficult to detect new attacks based on existing signatures of deep learning methods and their application in various
because their signatures are not identified. domains. [17].

123
1784 M. Ali et al.

In this research work, the ML-based approach is used to calculated using Eq. 1.
eliminate false positives and false negatives from the system.
A false positive is a type of alert or detection that occurs in (TP + TN)
accuracy = (1)
normal traffic when there is no intrusion or attack in progress, (TP + TN + FP + FN)
which is a problem in rule-based NIDS. A false negative is
Precision measures the ratio of correct positive classifica-
defined as the failure of a rule to alert or detect an actual
tions to the total number of predicted positive classifications,
attack, which also occurs in rule-based NIDS [18].
as shown in Eq. 2.
Another popular approach is ensemble methods. In this
method, several multiple models are trained independently TP
and their predictions are combined to create a final predic- precision = (2)
(TP + FP)
tion. In stacking-based ensemble approach, the base level of
models are trained and their predictions are passed to the Recall measures the number of correct positive classifica-
secondary-level models also known as meta-models. Meta- tions relative to the total number of positive samples in the
models are generally simpler models. This technique has data. It is calculated as shown in Eq. 3.
its own advantages including strength of multiple models.
Ensemble methods generally improve the performance of the TP
recall = (3)
final model [19]. However, they require careful fine-tuning (T P + F N )
of base and meta-models.
The F1-score measures the weighted average, i.e., the har-
In this work, we focus on flow-based network features
monic mean of precision and recall, and is calculated as
as it is more convenient to extract and store these features.
shown in Eq. 4.
There are many open source extractors for flow features that
are helpful in this area. Deep packet inspection is generally 2 ∗ precision ∗ recall
complicated and difficult to use and in most cases very slow F1-score = (4)
precision + recall
to process the load. In developing a machine learning-based
classifier, two steps are required: training and testing. Each We used balanced precision, which is most often used in
selected model has its own training techniques, advantages, anomaly detection when one of the classes under study occurs
and disadvantages. This process of building a model, specify- much more frequently than others. Another metric used for
ing its architecture, training, and testing is usually driven as a unbalanced multiclass data is the weighted F1-score. This
whole by well-known frameworks such as Scikit-learn [20], score is calculated by evaluating the F1-score for each label
TensorFlow [21], PyTorch [22], MATLAB [23], or Weka and then taking the weighted average corresponding to the
[24]. number of samples of that class in the dataset. The most
Each framework has its own impact on the optimization focused evaluation metric is the weighted F1-score as used
time of the algorithm and the degree of optimizations that can in this work. One could use execution time as an evaluation
be performed on it. Tensorflow provides the most support for metric. In this work, execution time is not used as a primary
developers, while Pytorch is currently used with more control evaluation criterion because it can vary depending on the
over machine learning tasks but less support for developers. computational power of the computer used, and our goal is
We used Keras, a wrapper of Tensorflow and Sci-Kit-Learn, offline detection from a full packet capture.
for machine learning tasks.

2.4 Evaluation metrics 3 Literature review

The main evaluation measures commonly used to evaluate The following section presents a comprehensive review of
machine learning-based IDS include recognition rate (true- the literature on machine learning approaches to anomaly-
positive rate), false-positive rate, accuracy, F1-score, and based network intrusion detection systems. With the ever-
ROC curves [16, 25, 26]. True positive (TP) and false pos- evolving nature of network protocols, older datasets have
itive (FP) represent samples classified as attack class and become inadequate for modern intrusion detection scenarios.
samples falsely classified as attack class, respectively. True One of the most significant examples is the QUIC protocol
negative (TN) and false negative (FN) represent the number introduced by Google in 2012 [27], which is now a major
of samples correctly classified as a normal class and samples component of sample traffic for most Google-based services.
incorrectly classified as a normal class, respectively. Consequently, there has been a growing need for dynamic
ROC curves are mostly used in binary detection. Accuracy datasets that account for these changes.
measures the ratio between the total number of correct clas- In [28], the authors used a convolutional neural network
sifications and the total number of samples. Accuracy can be to detect network attacks, but the model only achieved high

123
Effective network intrusion detection... 1785

detection rates for 10 out of 14 network attacks, highlighting 4 Datasets


the class bias of deep neural networks for the CICIDS2017
dataset. Similarly, in [29], the authors proposed an artifi- One of the most important issues in network intrusion detec-
cial neural network (ANN) approach to shellcode detection, tion is testing and evaluating the approach. Datasets used for
resulting in better performance than signature-based detec- training in NIDS, therefore, play an important role in the eval-
tion methods and significantly reducing false positives. uation and validation of NIDS. Datasets are also important
In [30], the authors developed an effective intrusion detec- because they help in replicating the desired experiment, eval-
tion approach using decision trees for the CICIDS2018 uating new approaches, and comparing different approaches.
dataset and concluded that this approach could be extended to Variance in datasets allows machine learning algorithms to
real-time networks, as long as the feature set remained con- learn abnormal or malicious patterns and distinguish them
sistent. In [31], the authors used an efficient wrapper-based from normal patterns. Authors have shown the applicability
feature selection method combined with an ANN to achieve of AI algorithms on several IDS datasets [39]. Several bench-
a high detection rate. V. Kanimozhi and T. Prem Jacob [32] mark datasets are publicly available, e.g., KDD CUP’99,
trained a deep neural network for binary classification using NSL-KDD, CAIDA 2007, UNSW-NB15, CICIDS2017, etc.
the recent dataset CSE-CICIDS2018 and concluded that the [15, 40, 41].
model could be extended to all classes of network attacks. The most frequently used dataset in the literature is KDD-
In [33], the authors used a hybrid multimodal solution Cupp99 and its derived NSL-KDD dataset. Table 1 shows the
to enhance the performance of intrusion detection systems chronological summary of some recent studies performed on
(IDS). They developed an ensemble model using a meta- these datasets [42, 43]. However, these datasets do not rep-
classification approach enabled by stacked generalization resent the current network architecture and attack methods.
and used two datasets, UNSW NB15 (a packet-based dataset) Cyber security researchers have created synthetic NIDS
and UGR’16 (a flow-based dataset), for experiments and datasets using simulated environments to obtain labeled real-
performance evaluation. In [34], a comprehensive study istic network traffic [44]. For this purpose, a test environment
of different ML techniques and their performance on the is first developed to simulate normal traffic between different
CICIDS2017 dataset is presented. This study includes both servers and clients. This simulation is performed to overcome
supervised and unsupervised ML algorithms for evaluation. the security and privacy issues faced by real production net-
Additionally, a comprehensive review of anomaly detec- works. Therefore, these controlled environments are more
tion strategies for high-dimensional Big Data is presented in reliable and secure than production or real networks. Vari-
[35], including limitations of traditional techniques and cur- ous attacks are launched, attack scenarios are simulated, and
rent strategies for high-dimensional data. Various techniques tagged traffic is generated. The captured network traffic can
based on Big Data that can be used to optimize anomaly be in the form of packet capture (pcap) or NetFlow (net-
detection are also discussed. In [36], the authors achieved the work flow) or sFlow (advanced network format). Features
highest accuracy of 99.46% against the NSL-KDD dataset can then be extracted from the pcap files using appropriate
using the CatBoot classifier. In [37, 38], ensemble meth- tools such as cicflowmeter, a tool developed by the CIC Insti-
ods were employed to achieve the accuracy of 99.90% and tute to extract previously created features, or an open-source
99.86%, respectively, using the KDDCupp99 dataset. tool such as Zeek (IDS). This results in the labeled network
In conclusion, several key points can be derived from flows, which are composed of benign and network traffic.
the existing literature on network intrusion detection sys- These network traffic features are critical because they
tems (NIDSs). Firstly, rule-based systems are outdated due are used to distinguish between normal and attack traffic.
to emerging attack patterns and should be replaced by more Therefore, they should have the necessary variance to cap-
advanced methods of intelligent detection based on math- ture these patterns. They must also be an optimal size so
ematical models. Secondly, machine learning approaches that solutions based on these characteristics are scalable and
show great potential for detecting intrusions in large net- practical. Their use is the most important task.
works, but the field is still developing due to problems such However, due to the lack of a standard set of features
as insufficient training data and high false-positive rates. in network intrusion datasets, most approaches are domain-
Finally, network-specific data are crucial for testing the specific and cannot be generalized as a whole. As a result,
performance of machine learning models. A chronological each dataset consists of unique features derived from domain
summary of the state-of-the-art literature on NIDS is pre- research. These exclusive and distinct feature sets pose a
sented in Table 1. problem for evaluating ML models across multiple datasets
and limit the reliability of the evaluation methods. The main
problems, leading to a large gap between academic research
and deployed ML-based NIDS solutions, are described
below.

123
1786 M. Ali et al.

Table 1 Chronological
Paper Year Approach Dataset Results
summary of state-of-the-art
literature on network intrusion [29] 2018 ANN Exploit Database Average accuracy=98%
detection systems (NIDSs)
[30] 2018 Decision Trees CICIDS2017 Average accuracy=99%
[32] 2019 ANN CSE-CIC-IDS2018 Average accuracy=99.97%
[33] 2020 Ensemble UNSW NB15 and UGR’ 16 Average accuracy=97%
[28] 2021 CNN CICIDS2017 TNR=98.984%
[36] 2021 CatBoost Classifier NSL-KDD Accuracy=99.46
[37] 2022 Ensemble KDDCupp99 Accuracy=99.90
[38] 2022 Ensemble KDDCupp99 Accuracy=99.86

• Different capture and storage methods for NIDS features work simulator to simulate victim and attack networks, and
due to domain-specific security events. the traffic generated in these networks was captured in near
• The inability of machine learning models to generalize real-time.
with a fixed feature set across multiple datasets from dif- It is important to note that each network has a specific
ferent network topologies. type of benign traffic that must be considered when apply-
• Lack of a universal aggregation method that can combine ing machine learning models. For instance, in our network
network data streams from different sources and generate topology, Windows operating systems have window update
standard feature sets. services that constantly communicate with Microsoft servers
to receive important security updates. Additionally, most
Machine learning algorithms require a large amount of Google services are communicated using the QUIC proto-
data to train. Quality and quantity of data play a crucial role in col, which is a low-latency Internet transport protocol over
the testing performance of any machine learning algorithm. UDP that combines the advantages of both TCP and UDP
Most of the time, high-quality data beat the better algorithms. [46].
Unfortunately, these types of datasets are expensive, have To generate normal traffic, we simulated web browsing,
privacy issues, and are difficult to generate. They are very file downloads, email, SSH, FTP sessions, and other major
time-consuming to generate, which is why most approaches services such as Google and Microsoft for an entire week.
use publicly available datasets for validation and testing [45]. The traffic was generated from multiple versions of Win-
dows and Linux operating systems running on users’ virtual
machines (VMs). Our goal was to create a comprehensive
4.1 Limitations of CICIDS2017 dataset dataset that accurately reflects the traffic patterns and net-
work topology of our network. Figure 2 provides a visual
In our study, we used the CICIDS2017 dataset to train our representation of our network topology.
machine learning models to detect network attacks. How-
ever, we observed some issues with the dataset that led us to
capture our own traffic based on almost similar attacks. The
purpose of this was to test our models on traffic from our 5 Simulation setup and implementation
network to avoid the problem of generalization that is com- details
monly encountered when testing models on different network
topologies. 5.1 Simulation setup
The generalization problem arises when models trained
on a specific network topology fail to detect attacks on other In [47], a standard feature set is created based on Netflow
network topologies. In our case, even though our machine features. They concluded that this leads to a reduction in
learning models achieved a balanced accuracy greater than overall prediction time with comparable results. Therefore,
95 percent on CICIDS2017 traffic, they had only a 26% these features can lead to overall superior features when
recall on other network topologies, with less than 50 per- additional metrics are used with them. All experimental and
cent accuracy. Therefore, we concluded that capturing attack simulation work is performed on a desktop computer with
traffic from the network on which we intend to test the 52GB of memory, Intel Core i7-7800 CPU @ 3.40 GHz, and
model is necessary to ensure that the model’s performance 930GB of hard drive space. The host operating system is Win-
can be generalized. However, due to privacy and security dows 10, and the main programming frameworks used are
concerns, it was not possible for us to capture real network Python-based libraries such as Sklearn, Keras, and Flask. We
traffic that is publicly accessible. As such, we used a net- have used Anaconda distribution as simulation environment

123
Effective network intrusion detection... 1787

Fig. 2 GNS3 network topology

for this research that included the aforementioned python web application, is also run on one of the victim computers
libraries. to simulate web attacks.

5.2 Network topology 5.2.2 Attacker network

Graphical network simulator-3 (GNS-3) [48] is used in this On the attacker side, we have Kali Linux, a popular offensive
study to simulate a full network topology and implement cybersecurity OS on the attacker network. Other popular OS
various attack scenarios along with benign user traffic. on the attacker side are Ubuntu 18 and Windows 10 to per-
Figure 2 shows our experimental setup with a victim net- form network attacks from multiple operating systems. The
work and an attacker network separated by a typical firewall. main tools used in Kali Linux are Patator, LOIC, and DoS
The firewall we used in our network topology is Pfsense. By goldeneye script.
default, the local area network (LAN) side of the firewall is
a trusted network, while the wide area network side (WAN)
of the firewall is untrusted. Traffic initiated from WAN to 5.2.3 Firewall
LAN is blocked. We have allowed this traffic so that we can
perform network attack scenarios. The victim’s network also The firewall used in this network topology is Pfsense [49].
has access to the Internet through network address translation It was chosen because it is open source and can be easily
(NAT). configured. It can act as both a firewall and a router. We have
two network pages, namely LAN and WAN. The LAN page
is trusted by default, while the WAN page is untrusted. Traffic
5.2.1 Victim network from WAN to LAN is blocked by default. We need to allow
this traffic to perform our network attacks.
We use several popular OS for the victim networks. These
are; Microsoft Windows with different versions including
Windows 10, Windows 7, Windows XP; and several Linux 5.2.4 Traffic capture
operating systems (OS), including Ubuntu 16, Ubuntu 18,
and Ubuntu 12. The victim network can communicate with The victim’s network traffic is recorded using the well-known
the Internet to simulate normal day-to-day traffic, includ- network recording and processing tool Wireshark. Ubuntu
ing Windows updates and other network communications. 18 is used as the operating system. All communication that
The services that run on the victim network include FTP, takes place over the victim’s network is mirrored to this vir-
SSH, HTTP, OpenSSL, and HTTPS. These services are later tual machine. Here, this traffic is recorded for several attack
exploited to simulate network attacks. DVWA, a vulnerable scenarios and forwarded to the cicflowmeter.

123
1788 M. Ali et al.

5.3 Attack traffic generation on GNS3 topology a unique and obfuscated traffic volume. Due to its obfusca-
tion, it is harder to detect compared to other DoS attack tools
The following subsection discusses the implemented attacks because of its predictable, repetitive patterns. The main prin-
in detail. The following attacks were implemented on the ciple of this tool is to create a unique pattern for each request,
GNS3 network, shown in Fig. 2, and their network traffic so it can bypass all intrusion detection and prevention sys-
was recorded. The main attack categories include denial of tems. This attack causes a high number of packets, which
service, brute force, reconnaissance, and OpenSSL exploits. leads to a high number of data flows.

5.3.1 DoS attacks 5.3.2 Brute force attacks

First, a DoS attack based on the low orbit ion cannon (LOIC) The brute force attack is used to crack the credentials of the
is carried out. To perform a DoS attack, the LOIC tool is user or service. We have performed this attack using three
used. This tool performs Layer 3 attacks in three variants. protocols. The first protocol used is secure shell (SSH). Pata-
The attack types are UDP, TCP, and HTTP attacks. All attacks tor, a Kali Linux tool, is used to perform brute force attacks
are performed using the same mechanism. We used the TCP on SSH protocol. SSH is a protocol used to secure multi-
variant of the attack in our network. The LOIC app shows the ple services over the Internet. Most commonly, it is used for
current status of the attack by indicating the failed process- logging into a remote server. The victim VM is Ubuntu 16,
ing of the request and the threats that are trying to connect to which runs an SSH service. We provided the tool with two
the server or are already connected to it. We used TCP SYN files containing usernames and passwords. The brute force
Flood to exploit the three-way TCP handshake. The attacker method sees a large number of small TCP packets with higher
and victim machines are Kali Linux and Ubuntu16, respec- frequency to crack usernames and passwords.
tively. In this way, multiple incomplete connection requests The second protocol that is exploited is the file transfer
overload the server in terms of resources, making it unreach- protocol (FTP). The brute force attack on the FTP proto-
able. col is performed using the patator tool (a penetration testing
The second DoS attack is slowloris. It is carried out using tool). FTP protocol is used for file transfer between clients
the slowloris.py Python script from Kali Linux. The victim and servers. It requires a valid username and password. This
machines are Ubuntu 16 and Ubuntu 12. This attack works attack is used to extract the username ID and password of the
by sending partial HTTP requests to consume all connections user to gain access to the remote system. The victim is Ubuntu
to the web server that will never complete. This is intended 16, which runs the FTP service. Username and password files
to consume all system resources, causing the system service used in SSH brute force attacks are also used here. We find
to be interrupted. The web server is contacted with a large similar behavior to SSH brute force attacks. A large number
number of connections, and the connections persist for a long of unsuccessful password attempts over a short period of time
time. resulted in small TCP packets with high frequency. Overall,
The third DoS attack is Goldeneye. It is implemented the bandwidth consumption of this attack is low.
using the DoS script from the Kali Linux attack. The vic- The last protocol used is the hypertext transfer protocol
tim machine is Ubuntu 16. This is an application level DoS (HTTP). For the brute force attack on HTTP, we configured
attack. The intent is to disrupt the service. It is executed using the DVWA app. Damn vulnerable web app (DVWA) is a web
a Python script and therefore can run on all Linux and Win- application that is vulnerable and used for network attacks.
dows platforms. In the brute force section, there is a web page that is used to
The next attack is the DoS attack using the slowhttptest test the network attacks. The Patator tool is used to perform
tool. This attack is similar to slowloris, but it is a more stealthy this attack. The behavior of this attack is very similar to the
and interactive version. It can also act like a slowloris attack. previously mentioned brute force attacks. The brute force
The attacker machine is Kali Linux, while the victim machine page comes after the first login page of the app. For this
is Ubuntu16. This application-level DoS attack tool uses low reason, we also need to use session cookies in the attack
bandwidth and consumes server resources through a con- command.
current connection pool. As we know, HTTP protocol relies
on the request to complete before it is processed. So, if the 5.3.3 Heartbleed attack
connection speed is slow, the server must be busy waiting
for clients to complete requests, making them inaccessible Heartbeat is an important part of the TLS protocol. It is used
to legitimate users. to confirm that the corresponding device is online. It is a vul-
The last DoS attack that used Hulk was an application nerability in the OpenSSL library implemented in the TLS
layer DoS attack tool. Kali Linux and Ubuntu VM are used protocol. Heartbeat works with a technique where a client
as attackers and victims. It is an HTTP flood tool that uses sends a random payload to the server. Then, the server must

123
Effective network intrusion detection... 1789

respond to the client with the same payload. This is called a 6 Dataset generation and processing
heartbeat request and response. The client must also specify methodology
the length of the random payload. This is where the weak-
ness lies. There is no verification of the length and the actual Due to several issues discussed previously, we created our
payload sent by the client. own new dataset CIPMAIDS2023-1 that includes all the
Suppose the client sends 20 bytes of random payload and attacks listed in CICIDS2017 and uses similar tools so that we
specifies a length of 500, then the server simply sends back can compare the performance of our proposed stacking-based
that 20 bytes of random payload, including 480 bytes of data ensemble model. Dataset generation flowchart is shown in
in the adjacent block, whatever it may contain. This adjacent Fig. 3.
block may contain sensitive information such as personal
data and passwords. The attacker can repeat this process 6.1 Cicflowmeterv4 features extraction
indefinitely to access the entire server memory. The victim’s
machine is Ubuntu 12, running OpenSSL v1. The attacker Cicflowmeter is an open-source tool for extracting 84
first uses nmap to detect the OpenSSL vulnerability and then network traffic features proposed in [8]. These features,
performs the attack with the Metasploit tool. published in [8], represent some good network traffic charac-
teristics. This tool takes the traffic record file and generates
a CSV file based on these 84 features. These characteristics
are based on the layer 3 and 4 header information of OSI and
5.3.4 Web attacks
other time- and volume-based information of the network
traffic, such as the number of bytes, packets, and durations in
Two attacks are carried out based on the exploitation of web
both forward and reverse directions. The layer 3 header spec-
forms. The first one is SQL injection. SQL injection is one of
ifies the source IP, destination IP, and type of service, while
the most popular attacks on the Internet that targets vulnera-
layer 4 contains a variety of header flags, including source
ble web forms that accept uncontrolled queries from users. It
and destination ports, sequence number, acknowledgement
involves exploiting the database layer of the web application,
number, and other flags such as URG, ACK, PSH, RST, SYN,
which uses user input directly in database queries. The main
and FIN. Both TCP and UDP flows are generated. UDP flows
queries include stealing and deleting database data through
usually end with the flow timeout, while TCP flows end with
queries. The victim is a DVWA application configured on
either FIN packet or flow timeout. The default timeout for
Ubuntu 16, while the attacker is running Kali Linux. This
flows is 120,000,000 microseconds or 120 s.
attack is quite difficult to detect because it contains almost
the same traffic as normal traffic. Even a small change in the
6.2 Data distribution
user’s input request can lead to this dangerous attack.
The second web attack is a cross-site scripting attack. This
Twenty percentage of the total data is isolated, for use as test
attack is similar to SQL injection attack where we inject
data. The remaining data is split into another 75% training
malicious scripts into the uncontrolled user input. Once a
data and 25% validation data. The overall distribution of the
malicious script is injected, it is displayed to the victims when
data is shown in Fig. 4.
they visit the website. This can lead to fraud, execution of
Three classes have the least number of flows includ-
malicious scripts, and data theft. DVWA application is used
ing Heartbleed, SQL injection, and cross-site scripting. The
to perform this attack by Kali Linux OS. Like SQL injection,
absolute frequencies of network flows against each attack are
this attack is also difficult to detect.
shown in Table 2.

6.2.1 Feature preprocessing


5.3.5 Portscan
First, all socket features are removed from the overall fea-
This is a popular attack used in the reconnaissance phase of tures because they do not properly reflect network traffic.
the attack. The attacker uses this attack to gather informa- A socket is basically a combination of IP and port. When
tion about systems and plan how to penetrate the system. We two devices communicate with each other, each side has
can get information about operating systems, possible vul- its own socket. They can change while exhibiting the same
nerabilities, running services, and port status. In a PortScan benign and malicious behavior. These characteristics include
attack, the attacker attempts to communicate with each port, flow ID, destination IP, source IP, destination port, and
which is usually in the range of 0 to 65,535, and then inter- timestamp. The attacker can also use spoofed IP addresses
prets from the response whether that port is in use or not, and ports other than the known ones to perform network
leading to the detection of weak entry points. attacks. After removing these features, we found some fea-

123
1790 M. Ali et al.

Fig. 3 Dataset generation


flowchart

Fig. 4 Distribution of all network flows against each attack

tures with near-zero variance. Zero variance means that 6.2.2 Feature importance
they do not contribute to the representation of attacking or
benign network traffic. These features are BwdPSHFlags, First, a random forest regressor is used to calculate the impor-
BwdURGFlags, FwdBulkRateAvg, FwdBytes, BulkAvg, tance of the features for each network attack. Then, the
FwdPacket/BulkAvg, FwdURGFlags, SubflowBwdPackets, importance of the features is calculated for all attacks indi-
and URGFlagCount. Three of these 8 features relate to the vidually, for brute force, DoS, and web attacks in the dataset.
URG flag, which is one of the layer 4 flags in the open systems This results in each feature being ranked with an importance
interconnection (OSI) model. This may be because this flag value relative to other features. This algorithm uses random
is deprecated. Some of the flows in the "FlowBytes/s" feature forests to calculate the importance of each feature. These
contained zero values. These flows were removed from the features are then used to analyze network attack behavior.
data because zero values cannot be entered into the machine Important features that affect each attack give us a good
learning algorithm. Another problem was the non-finite val- insight into the behavior of network attacks. Tables 4 and
ues in the flow features "FlowBytes/s" and "FlowPackets/s". 5 show the distribution of the 5 most important features of
The total number of these values was less than 60. These each attack.
generated flows were also removed from the data to prepare Based on the above important features for each class, we
them for a machine learning algorithm. can conclude that Flow IAT is the most important feature for

123
Effective network intrusion detection... 1791

Table 3 Comparison of baseline


Model Parameters Weighted F1-score (%) Dataset
machine learning models results
on CIPMAIDS2023-1 SVM (Linear) C=10, gamma=0.1 89.12 CIPMAIDS2023-1
SVM (RBF) C=100, gamma=1 91.11 CIPMAIDS2023-1
KNN k=3 96.6 CIPMAIDS2023-1
RF 120 Estimators 97.7 CIPMAIDS2023-1
Logistic Regression Max_iter=1000 87.14 CIPMAIDS2023-1

Table 4 The distribution of the top 5 important features of each attack


Features Importance Attack Features Importance Attack

FlowDuration 0.042642 Port Scan FlowIATMax 0.001384 SSH Brute-force


FlowIATMax 0.003185 FlowIATMean 0.000573
FlowPackets/s 0.00094 FlowPackets/s 0.000236
FlowBytes/s 0.000524 TotalLengthofBwdPacket 0.000222
FlowIATMean 0.000468 FlowBytes/s 0.000145
BwdPacketLengthMean 0.014337 DoS Hulk TotalLengthofBwdPacket 0.009078 DoS LOIC
BwdPacketLengthStd 0.003544 FlowDuration 0.00152
FwdPacketLengthMax 0.003221 FwdPacketLengthMean 0.000714
FlowIATMin 0.003066 FlowIATMax 0.000644
FlowIATMean 0.002792 TotalLengthofFwdPacket 0.000643
TotalLengthofBwdPacket 0.017336 DoS slowhttptest BwdPacketLengthMin 0.002317 HTTP Brute-force
FlowIATMin 0.014788 FwdIATTotal 0.001473
FlowPackets/s 0.002359 FlowDuration 0.001223
FlowIATMean 0.001507 FwdPacketLengthStd 0.001211
FlowDuration 0.001143 FlowIATMin 0.001103

Table 2 Distribution of network flows of each class Both SQL injection and cross-site script have simi-
Class label No. of samples Category lar important features. These features include FlowPack-
ets/s, FlowIATMin, BwdPacketLengthStd, and FlowDura-
Normal 70,692 Benign traffic tion. Because of these same features, these attacks are
Portscan 48,085 Reconnaissance difficult to distinguish. Important features selected by the
DoS(hulk) 15,927 DoS random forest regressor against all brute force attacks
DoS(slowhttp) 9668 DoS were FlowIATMin, FlowDuration, FwdIATTotal, FwdPack-
DoS(goldeneye) 9136 DoS etLengthStd, and TotalBwdpackets.
DoS(slowloris) 5683 DoS For the DoS LOIC attack, no other feature even comes
Ftppatator 5587 Brute-force close to the most important feature, TotalLengthofBwdPack-
Httpwebattack 5456 Brute-force ets. It represents the total size of the packet in the backward
Sshpatator 5405 Brute-force direction. Backward direction means the direction in which
DoS(LOIC) 5239 DoS a response is sent to the attacker. The reason for this is that
Xssweb 477 Cross-site scripting this attack is a volumetric attack that consumes bandwidth
Sqlinjweb 363 SQL injection using layer 3 or 4 protocols, unlike other attacks that take
Heartbleed 247 SSL exploit place at the application layer. These types of attacks primar-
ily include internet control message protocol (ICMP), ping
flooding, and TCP sync flooding. Application layer attacks
network attacks. This is essentially the time interval between tend to be more complex and target specific application layer
two subsequent packets in a flow in the forward direc- protocols, including slowhttptest and goldeneye.
tion. These characteristics include the maximum, minimum, If we look at the relevant features of the portscan attack,
mean, and standard deviation of time between consecutive we can see that the most important features are flow duration
packets in a flow. and FlowIATMax. This is because in most cases portscan

123
1792 M. Ali et al.

focuses on small packets to test each port. FlowIATMax can not necessarily. Several classes with a smaller number of
be a good differentiator between short durations and a high samples were up-sampled using the smote technique.
number of attack packets compared to normal traffic. Other Cicflowmeter, a flow-based feature extractor, is highly
features include flow bytes/s and flow packets/s. These are dependent on the unique source-to-destination and reverse
also important characteristics because portscan traffic has a packets [8]. Therefore, it generates a large number of flows
high number of packets for each flow and an extremely low with traffic containing unique packets, such as benign traf-
number of bytes. fic, portscan attacks, Dos attacks, and brute force attacks. On
Flow duration is one of the most important features in the other hand, web attacks and Heartbleed attacks perform
almost all brute force attacks. This is due to the fast testing minimal network communication to reach their target and
of different usernames and passwords. The main character- therefore generate a smaller number of flows. This problem
istics of combined DoS attacks are FwdPacketLengthMax, of unbalanced class data is addressed by upsampling.
FwdIATTotal, FlowIATMean, BwdPacketLengthMean, and The criteria for up-sampling are as follows. We had
FlowIATMin. DoS attacks tend to disrupt the service so that approximately 40 thousand samples for the benign class. Any
a continuous stream of packets comes from the attacker’s attack class that had less than 5000 samples in the training
machines. The most important features, FwdPacketLenght- data was up-sampled to 5000 samples. This produced overall
Max and FwdIATTotal, help distinguish the incoming traffic balanced results compared to a strong bias of some classes
from the attacker network. when the data was not up-sampled.

6.2.3 Normalization
7 Results and discussion
Normalization of the data is performed after removing
The contents of this section are as follows. First, the results
pedestal features and features with zero variance. In this
of the basic machine learning models on the extracted fea-
work, min–max normalization is used. This is the most
tures and then the results of the proposed ensemble stacking
common type of normalization. One of its disadvantages is
classifier are presented. The distribution of the data is the
that it reduces the impact of outliers. Algorithms that are
same as explained in Sect. 6.2. The main evaluation met-
very sensitive to outliers benefit from this transformation.
ric is the weighted F1-score, which is already explained in
Another normalization that is used is z-score normalization.
Sect. 2.4. The same scoring metric was used by the author
This is a simple normalization of mean and standard devi-
of the CICIDS2017 dataset. An Ablation study has been car-
ation, where the mean of the transformed data is 0 and the
ried out in Sects. 7.1 and 7.2 to justify our proposed ensemble
standard deviation is 1. This method benefits from the spec-
approach.
ification of minimum and maximum values. We tried both
normalizations on our data and decided to use the min–max
7.1 Comparative evaluation of baseline models
normalization because of the better overall results. First, a
scaler is fitted to the training data. Then, the training data,
Before turning to complex deep learning and ensemble mod-
the test data, and the validation data are normalized using this
els, several base classifiers are used. Table 3 shows the results
scaler. Importantly, the scaler is not applied to the validation
of the base classifiers in terms of the weighted F1-score. We
and test data, so this data is completely unseen.
used SVM with multiple kernels. The best performing ker-
nels were linear and RBF, SVM, KNN, random forest, and
6.2.4 Smote up-sampling logistic regression are also used as base classifiers.
It can be observed in Table 3 that the overall perfor-
Most classes have a high number of flows, either due to high mance of the basic classifiers on our generated dataset
traffic or their nature. Web attacks and Heartbleed attacks do CIPMAIDS2023-1 is not bad. Random forest with 120 esti-
not have a large number of flows due to their nature. This mators performed best with a weighted f1 value of 97.7%.
leads to a bias in the machine learning models. To deal with Confusion matrix of the random forest (Fig. 5) and the F1-
this bias, these two approaches are used in most cases. First, score of random forest classifier (Table 6) with the highest
up-sampling a minority class or down-sampling the majority score, as well as correct and incorrect predictions for each
class, second, weighting each class according to its impor- class. We can see that many false-positive and false-negative
tance or number. Both approaches can successfully solve the predictions are made for SQL injection web attacks and
new class imbalance problem, but the latter approach is rec- cross-site script attacks. This is due to the high similarity
ommended. It is recommended to use class weighting instead between these attacks. This similarity was also evident when
of oversampling. That is, we do not have synthetic or aug- we observed the important features of these attacks using a
mented data. This generally leads to better performance, but random forest regressor.

123
Effective network intrusion detection... 1793

Table 5 The distribution of the top 5 important features of each attack


Features Importance Attack Features Importance Attack

FlowIATMean 1.90E-03 BwdPacketLengthStd 0.095654 XSS web attack


FlowIATMin 1.49E-03 DoS goldeneye FlowDuration 0.09433
BwdPacketLengthMean 1.16E-03 FlowIATMax 0.006082
BwdPacketLengthMax 8.87E-04 FlowPackets/s 0.005945
FwdIATTotal 8.65E-04 FlowIATMin 0.004159
FwdPacketLengthMin 6.42E-04 DoS slowloris FlowDuration 0.053775 SQL injection
FwdPacketLengthStd 5.39E-04 BwdPacketLengthMean 0.026574
FlowPackets/s 1.31E-04 FlowIATMin 0.009624
FlowIATMean 1.20E-04 FlowPackets/s 0.004858
FlowIATMin 1.19E-04 BwdPacketLengthStd 0.003083
FlowIATMin 0.00246 BwdPacketLengthMean 0.015019 Heartbleed
FlowBytes/s 0.002149 FTP Brute-force FlowIATMin 0.009121
FlowIATMean 0.001718 BwdPacketLengthStd 0.007426
FlowPackets/s 0.00096 FlowDuration 0.006603
FlowDuration 0.000907 FlowIATMax 0.004415

Fig. 5 Confusion matrix for random forest

We can see that this classifier worked well for multiclass 7.2 Performance analysis of deep learning and
classification. For web attacks, including SQL injection and ensemble classifiers
cross-site scripting attacks, there are many false positives due
to the high correlation. In this subsection, we address the results for deep learning
models and other ensemble classifiers on the dataset. First, a

123
1794 M. Ali et al.

Table 6 Precision, Recall,


Class Support (# of samples) Precision Recall Weighted F1-score
F1-score for each class for
random forest DoS(goldeneye) 1858 0.95 0.93 0.94
DoS(hulk) 3199 0.95 0.95 0.95
DoS(LOIC) 1062 1.00 0.99 0.99
DoS(slowhttptest) 1934 0.95 0.93 0.94
DoS(slowloirs) 1140 1.00 1.00 1.00
Ftppatator 1114 0.90 0.95 0.92
Sshpatator 1103 0.98 0.99 0.99
Sqlinjweb 87 0.55 0.60 0.57
Xssweb 103 0.64 0.75 0.69
Heartbleed 46 0.71 0.85 0.77
Httpwebattack 259 0.85 0.87 0.86
Normal 14,123 1.00 1.00 1.00
Portscan 4549 1.00 0.99 0.99

multilayer perceptron (MLP) with 4 hidden layers and 450, base estimators, and XGBOOST is used as the metamodel.
250, 150, and 50 neurons each are used. A dropout of 0.2 This classifier gave the best results for the test data. Finally,
is used in the last layer to avoid overfitting. The model is the same model was used to train with combined training and
trained for 48 epochs with a stack size of 64. Then, a one- validation data and then tested with test data. Table 7 shows
dimensional convolutional neural network (1DCNN) is also the results for several complex classifiers applied to the data.
optimized on the data. The CNN has 3 convolutional layers Comparative results of baseline models (Table 3) and deep
with 60 kernels each and a dropout of 0.2. The optimizers ML models and ensemble models (Table 7) on our generated
used are Adam and Sparse Categorical Cross Entropy for dataset CIPMAIDS2023-1 represent that various models are
loss. The model is trained for 150 epochs with a batch size applied to our generated dataset, and the weighted F1-score of
of 24. our proposed ensemble approach is the highest, i.e., 98.24%.
When classical machine learning methods are insufficient These results prove the supremacy of our proposed ensemble
to avoid overfitting the data, ensemble methods come into approach using our own generated dataset CIPMAIDS2023-
play. Ensemble methods typically perform better than a sin- 1.
gle model because they can reduce prediction error by adding The confusion matrix for the best-performing classifier,
bias [19] and make model performance more robust. We used which is the stacking-based ensemble classifier, is shown in
ensemble methods to create an optimal predictive model that Fig. 6. These experimental results show that the ensemble
performs better than any of the contributing models. Instead stacking approach achieved the highest weighted F1-score
of relying on a single model, other models can be consid- as compared to other classifiers and select features from the
ered in the final prediction, and the final prediction is based flowmeter are effective and can be used for network intru-
on the aggregated results of all models. Nowadays, ensem- sion detection. This technique achieved the highest score of
ble methods are considered as one of the best state-of-the-art 98.05%, so we slightly improved the weighted F1-score and
solutions in various areas of machine learning. In general, achieved good multiclass accuracy by using this approach.
ensemble models very often achieve better results than any Table 8 shows the weighted F1-score for each attack using
other single model. this stacking-based ensemble method.
Several ensemble techniques are also applied to the data. Based on the results, we can see that we have a significant
Among them, the following three techniques are the most improvement in the F1-score for SQL injection attacks and
powerful. First, the majority-based voting ensemble method cross-site scripting web attacks. Now we will compare our
is used with SVM, KNN, and random forest as the basic stacking based ensemble techniques, in Table 9, with other
estimators. Second, XGBOOST, KNN, and random forests techniques.
are used as the baseline estimators for the ensemble method; The primary objective of the comparative analysis shown
both techniques include soft voting criteria. Hard voting cri- in Table 9 is to highlight the overall performance of our pro-
teria assign a score to each model, while soft voting uses the posed method relative to other state-of-the-art techniques.
probability distribution of each model to assign a label to the As stated earlier, we attempted to overcome the limita-
test data. In the third technique, a stacking-based ensemble tions of the CICIDS2017 dataset by the generation of our
is used. Here, KNN, SVM, and random forest are used as the own dataset CIPMAIDS2023-1 so, it is pertinent to mention

123
Effective network intrusion detection... 1795

Table 7 Comparison of deep ML models and ensemble models results on CIPMAIDS2023-1


Model Technique Weighted F1-score (%) Dataset

MLP 4 Layers 95.40 CIPMAIDS2023-1


1DCNN 3 Layers 96.50 CIPMAIDS2023-1
Ensemble Majority based voting 97.70 CIPMAIDS2023-1
ensemble (SVM, KNN, RF)
Proposed stacking-based ensemble approach Stacking based ensemble 98.24 CIPMAIDS2023-1
Base learner: KNN, SVM, RF
Meta learner: XGBOOST
Bold is to highlight the proposed stacking-based ensemble approach that uses the base learners KNN, SVM, and RF together with the meta-learner
XGBoost to achieve good results compared to the benchmark models

Fig. 6 Confusion matrix for ensemble mode

that our generated dataset CIPMAIDS2023-1 shares simi- synthetic a minority oversampling technique and ensem-
lar characteristics to the CICIDS2017 dataset. This means ble feature selection to improve the results. The evaluation
that both datasets have the same baseline structure, and results showed that the proposed model achieved a maxi-
hence, it is reasonable to compare the performance of our mum score of 90.01%. Evaluation results shown in Table 9
proposed stacking-based ensemble technique and other state- are evident that among all, our proposed ensemble-based
of-the-art techniques on mutually comparable CICIDS2017 model achieved the maximum F1-score of 98.24%. Our pro-
& CIPMAIDS2023-1 datasets. posed ensemble model, when tested on the CICIDS2017
Sharafaldin et al. [8] achieved the highest score of 98.0%, dataset, resulted in a noticeably lower accuracy of 78% as
and Bakhshi et al. [50] used hybrid deep learning for mul- compared the 98% achieved on our proposed similar distri-
ticlass attack detection. In this case, multiple deep learning bution dataset (CIPMAIDS2023-1). It is important to note
models were used. The best-performing model was CNN that CICIDS2017 dataset may possess unique characteristics
with a weighted f1 value of 93.56%. Yulianto et al. [51] used that can present potential challenges for the generalization

123
1796 M. Ali et al.

Table 8 Precision, Recall,


Class Support (# of samples) Precision Recall Weighted F1-score
F1-score for each class for
stacking-based ensemble DoS(goldeneye) 1858 0.95 0.94 0.95
method
DoS(hulk) 3199 0.96 0.95 0.95
DoS(LOIC) 1062 1.00 0.99 0.99
DoS(slowhttptest) 1934 0.95 0.94 0.95
DoS(slowloirs) 1140 1.00 1.00 1.00
Ftppatator 1114 0.90 0.95 0.93
Sshpatator 1103 0.99 1.00 0.99
Sqlinjweb 87 0.57 0.68 0.62
Xssweb 103 0.76 0.63 0.69
Heartbleed 46 0.95 0.89 0.92
Httpwebattack 259 0.89 0.91 0.90
Normal 14,123 1.00 1.00 1.00
Portscan 4549 1.00 1.00 1.00

Table 9 Comparison of proposed stacking-based ensemble technique with other techniques


Model Technique Weighted F1-score Dataset

Sharafaldin et. al., [8], 2018 ID3 98.00% CICIDS2017


Bakhshi et. al., [50], 2021 Hybrid deep learning 93.56% CICIDS2017
Yulianto et. al., [51], 2019 AdaBoost + EFS + SMOTE 90.01% CICIDS2017
Proposed model Stacking based ensemble 78.0% CICIDS2017
98.24% CIPMAIDS2023-1

of the intrusion detection model trained on our generated The performance of the proposed model on the CICIDS2017
dataset. Some of these characteristics are dataset bias, tem- dataset exhibited a decrease of approximately 20.24% in the
poral factor, and collection methodology [52, 53]. weighted F1-score compared to its performance on our pro-
The dataset exhibits bias in terms of network traffic pattern posed similar distribution dataset (CIPMAIDS2023-1). To
distribution due to the network environment or the data col- enhance the performance of the model on the CICIDS2017
lection method. Temporal factors also have an impact on the dataset in the future, we plan to employ improved evaluation
generalization as network traffic protocols are continuously metrics and validation techniques, which can contribute to
upgraded and hence introduce variance in traffic behavior. maximizing the generalization of the model across datasets.
In the last, the collection methodology also introduces noise Additionally, we will explore the use of additional network
and inconsistency in the generated data. We have seen this features not provided by the cicflowmeter to improve dis-
inconsistency in CICIDS2017 in the form of not a number crimination between attack and benign classes. We plan
(nan) values of some features that we had to remove in the to investigate the transferability of models trained on one
preprocessing steps. dataset to another dataset and analyze how the diversity of the
dataset affects the transferability. We will also use a realistic
8 Conclusion and future work network with public IPs to obtain more realistic traffic, which
can lead to better generalization. Finally, we will explore the
In conclusion, we have proposed a stacking-based ensemble use of deep packet inspection to generate more features that
approach for intrusion detection, which achieved a weighted truly represent the discrimination between attack and benign
F1-score of up to 98.24. We have also analyzed the effect of classes, which will require us to allocate more resources to
using different techniques for handling imbalanced data and the processing of network traffic packets.
found that using class weights improves the overall perfor-
Acknowledgements All authors read and approved the final manuscript.
mance, but can degrade the multi-class performance in some
cases. We have also observed similarities between SQL injec- Author contributions MA did conceptualization, data curation, method-
tion attacks and cross-site scripting attacks, leading to a high ology, writing—original draft, software, writing—review and editing;
percentage of false positives in these classes. MH done conceptualization, data curation, methodology, writing—
original draft, software, writing—review and editing, formal analysis,
In future work, we plan to extend the authenticity of our
project administration, visualization, and investigation; MHD was
results by evaluating our proposed model on other datasets.

123
Effective network intrusion detection... 1797

involved in supervision, project administration, visualization, investiga- 8. Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A.: Toward generating
tion, writing—review and editing, and formal analysis; AU contributed a new intrusion detection dataset and intrusion traffic characteriza-
to writing—review and editing, formal analysis, data curation, method- tion. In: ICISSP 2018—Proceedings of the 4th International Con-
ology, project administration; SMM done writing—review and editing, ference on Information Systems Security and Privacy 2018-Janua,
formal analysis, methodology, investigation, project administration, 108–116 (2018). https://doi.org/10.5220/0006639801080116
funding acquisition; HM was involved in writing—review and editing, 9. Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A.: A detailed analysis
formal analysis, data curation, methodology, project administration; of the cicids2017 data set. Commun. Comput. Inf. Sci. 977, 172–
CM done writing—review and editing, formal analysis, methodology, 188 (2019)
investigation, and project administration. 10. Cohen, F.: Computer viruses: Theory and experiments. Com-
put. Secur. 6, 22–35 (1987). https://doi.org/10.1016/0167-
Funding This research is funded by the Ministry of Planning, Develop- 4048(87)90122-2
ment, and Special Initiatives through the Higher Education Commission 11. Ullah, Z., Al-Turjman, F., Mostarda, L., Gagliardi, R.: Applica-
of Pakistan under the National Center for Cyber Security (NCCS) and tions of artificial intelligence and machine learning in smart cities.
Cyber Security Centre, University of Warwick, UK. Comput. Commun. 154, 313–323 (2020). https://doi.org/10.1016/
J.COMCOM.2020.02.069
Data availability There are no public data or materials for this paper. 12. Sravani, K., Srinivasu, P.: Comparative study of machine learning
algorithm for intrusion detection system. Adv. Intell. Syst. Comput.
247, 189–196 (2014). https://doi.org/10.1007/978-3-319-02931-
Declarations
3_23
13. Sahu, S.K., Sarangi, S., Jena, S.K.: A detail analysis on intrusion
detection datasets. In: Souvenir of the 2014 IEEE International
Conflict of interest The authors declare that they have no known com- Advance Computing Conference, IACC 2014, 1348–1353 (2014).
peting financial interests or personal relationships that could have https://doi.org/10.1109/IADCC.2014.6779523
appeared to influence the work reported in this paper. 14. Al-Garadi, M.A., Mohamed, A., Al-Ali, A.K., Du, X., Ali, I.,
Guizani, M.: A survey of machine and deep learning meth-
Open Access This article is licensed under a Creative Commons ods for internet of things (iot) security. IEEE Commun. Surv.
Attribution 4.0 International License, which permits use, sharing, adap- Tutor. 22, 1646–1685 (2020). https://doi.org/10.1109/COMST.
tation, distribution and reproduction in any medium or format, as 2020.2988293
long as you give appropriate credit to the original author(s) and the 15. Kilincer, I.F., Ertam, F., Sengur, A.: Machine learning methods
source, provide a link to the Creative Commons licence, and indi- for cyber security intrusion detection: datasets and comparative
cate if changes were made. The images or other third party material study. Comput. Netw. 188, 107840 (2021). https://doi.org/10.1016/
in this article are included in the article’s Creative Commons licence, J.COMNET.2021.107840
unless indicated otherwise in a credit line to the material. If material 16. Liu, H., Lang, B.: Machine learning and deep learning methods for
is not included in the article’s Creative Commons licence and your intrusion detection systems: a survey. Appl. Sci. (2019). https://
intended use is not permitted by statutory regulation or exceeds the doi.org/10.3390/APP9204396
permitted use, you will need to obtain permission directly from the copy- 17. Aslam, S., Herodotou, H., Mohsin, S.M., Javaid, N., Ashraf, N.,
right holder. To view a copy of this licence, visit http://creativecomm Aslam, S.: A survey on deep learning methods for power load and
ons.org/licenses/by/4.0/. renewable energy forecasting in smart microgrids. Renew. Sustain.
Energy Rev. 144, 110992 (2021)
References 18. Shah, S.N., Singh, M.P.: Signature-based network intrusion detec-
tion system using snort and winpcap—ijert. Int. J. Eng. Res.
1. Scarfone, K., Mell, P.: Guide to intrusion detection and prevention Technol. (IJERT) 01
systems (idps), special publication (nist sp). National Institute of 19. Krishnaveni, S., Sivamohan, S., Sridhar, S.S., Prabakaran, S.: Effi-
Standards and Technology, Gaithersburg (2007). https://doi.org/ cient feature selection and classification through ensemble method
10.6028/NIST.SP.800-94 for network intrusion detection on cloud computing. Clust. Com-
2. Patil, N.V., Krishna, C.R., Kumar, K.: Distributed frameworks for put. 24(3), 1761–1779 (2021). https://doi.org/10.1007/s10586-
detecting distributed denial of service attacks: a comprehensive 020-03222-y
review, challenges and future directions. Concurr. Comput. Pract. 20. FabianPedregosa, F.P., Michel, V., OlivierGrisel, O.G., Blondel,
Exp. 33, 6197 (2021). https://doi.org/10.1002/CPE.6197 M., Prettenhofer, P., Weiss, R., Vanderplas, J., Cournapeau, D.,
3. Jazi, H.H., Gonzalez, H., Stakhanova, N., Ghorbani, A.A.: Detect- Pedregosa, F., Varoquaux, G., Gramfort, A., Thirion, B., Grisel,
ing http-based application layer dos attacks on web servers in the O., Dubourg, V., Passos, A., Brucher, M., andÉdouardand, M.P.,
presence of sampling. Comput. Netw. 121, 25–36 (2017). https:// andÉdouard Duchesnay, Edouardduchesnay, F.D.: Scikit-learn:
doi.org/10.1016/J.COMNET.2017.03.018 Machine learning in python gaël varoquaux bertrand thirion vin-
4. Jallad, K.A., Aljnidi, M., Desouki, M.S.: Anomaly detection opti- cent dubourg alexandre passos pedregosa, varoquaux, gramfort
mization using big data and deep learning to reduce false-positive. et al. matthieu perrot. J. Mach. Learn. Res. 12, 2825–2830 (2011).
J. Big Data 7, 1–12 (2020). https://doi.org/10.1186/S40537-020- https://doi.org/10.5555/1953048
00346-1 21. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,
5. Gupta, N., Jindal, V., Bedi, P.: Lio-ids: handling class imbalance J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.:
using lstm and improved one-vs-one technique in intrusion detec- {TensorFlow}: a system for {Large-Scale} machine learning. In:
tion system. Comput. Netw. 192, 108076 (2021) 12th USENIX Symposium on Operating Systems Design and
6. Verma, A., Ranga, V.: Machine learning based intrusion detection Implementation (OSDI 16), pp. 265–283 (2016)
systems for iot applications. Wirel. Pers. Commun. 111, 2287–2310 22. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito,
(2020). https://doi.org/10.1007/S11277-019-06986-8 Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differ-
7. Kasim, Ö.: An efficient and robust deep learning based network entiation in pytorch (2017)
anomaly detection against distributed denial of service attacks. 23. MATLAB: (R2022a). The MathWorks Inc., Natick, Massachusetts
Comput. Netw. 180, 107390 (2020) (2022)

123
1798 M. Ali et al.

24. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Fuzziness Knowl. Based Syst. 28(Supp02), 65–91 (2020). https://
Witten, I.H.: The weka data mining software. ACM SIGKDD doi.org/10.1142/S0218488520400140
Explor. Newsl 11, 10–18 (2009). https://doi.org/10.1145/1656274. 40. Ferrag, M.A., Maglaras, L., Moschoyiannis, S., Janicke, H.:
1656278 Deep learning for cyber security intrusion detection: approaches,
25. Moulahi, T., Zidi, S., Alabdulatif, A., Atiquzzaman, M.: Com- datasets, and comparative study. J. Inf. Secur. Appl. 50, 102419
parative performance evaluation of intrusion detection based on (2020). https://doi.org/10.1016/j.jisa.2019.102419
machine learning in in-vehicle controller area network bus. IEEE 41. Khraisat, A., Gondal, I., Vamplew, P., Kamruzzaman, J.: Survey of
Access 9, 99595–99605 (2021). https://doi.org/10.1109/ACCESS. intrusion detection systems: techniques, datasets and challenges.
2021.3095962 Cybersecurity (2019). https://doi.org/10.1186/s42400-019-0038-
26. Çavuşoğlu, Ünal.: A new hybrid approach for intrusion detec- 7
tion using machine learning methods. Appl. Intell. 49, 2735–2761 42. Siddique, K., Akhtar, Z., Khan, F.A., Kim, Y.: Kdd cup 99 data sets:
(2019). https://doi.org/10.1007/S10489-018-01408-X/TABLES/ a perspective on the role of data sets in network intrusion detection
16 research. Computer 52, 41–51 (2019)
27. Van, T., Tran, H.A., Souihi, S., Mellouk, A.: Empirical study 43. Hindy, H., Brosset, D., Bayne, E., Seeam, A.K., Tachtatzis, C.,
for dynamic adaptive video streaming service based on google Atkinson, R., Bellekens, X.: A taxonomy of network threats
transport quic protocol. In: 2018 IEEE 43rd Conference on Local and the effect of current datasets on intrusion detection systems.
Computer Networks (LCN), pp. 343–350 (2018). IEEE IEEE Access 8, 104650–104675 (2020). https://doi.org/10.1109/
28. Ho, S., Al Jufout, S., Dajani, K., Mozumdar, M.: A novel intrusion ACCESS.2020.3000179
detection model for detecting known and innovative cyberattacks 44. Ferriyan, A., Thamrin, A.H., Takeda, K., Murai, J.: Generating
using convolutional neural network. IEEE Open J. Comput. Soc. network intrusion detection dataset based on real and encrypted
2, 14–25 (2021) synthetic attack traffic. Appl. Sci. (2021). https://doi.org/10.3390/
29. Shenfield, A., Day, D., Ayesh, A.: Intelligent intrusion detection app11177868
systems using artificial neural networks. ICT Express 4, 95–99 45. Sarhan, M., Layeghy, S., Moustafa, N., Portmann, M.: Netflow
(2018). https://doi.org/10.1016/J.ICTE.2018.04.003 datasets for machine learning-based network intrusion detection
30. Jamadar, R.A.: Network intrusion detection system using machine systems. In: Lecture Notes of the Institute for Computer Sci-
learning. Indian J. Sci. Technol. (2018). https://doi.org/10.17485/ ences, Social-Informatics and Telecommunications Engineering,
ijst/2018/v11i48/139802 LNICST 371 LNICST, 117–135 (2021). https://doi.org/10.1007/
31. Taher, K.A., Jisan, B.M.Y., Rahman, M.M.: Network intrusion 978-3-030-72802-1_9
detection using supervised machine learning technique with feature 46. Tong, V., Tran, H.A., Souihi, S., Mellouk, A.: Empirical study
selection. In: 1st International Conference on Robotics, Electrical for dynamic adaptive video streaming service based on google
and Signal Processing Techniques, ICREST 2019, 643–646 (2019). transport quic protocol. In: Proceedings—Conference on Local
https://doi.org/10.1109/ICREST.2019.8644161 Computer Networks, LCN 2018-October, 343–350 (2019). https://
32. Kanimozhi, V., Jacob, T.P.: Artificial intelligence based network doi.org/10.1109/LCN.2018.8638062
intrusion detection with hyper-parameter optimization tuning on 47. Sarhan, M., Layeghy, S., Portmann, M.: Towards a standard feature
the realistic cyber dataset cse-cic-ids2018 using cloud comput- set for network intrusion detection system datasets. Mobile Netw.
ing. In: Proceedings of the 2019 IEEE International Conference Appl. 27, 357–370 (2022). https://doi.org/10.1007/S11036-021-
on Communication and Signal Processing, ICCSP 2019, 33–36 01843-0/FIGURES/4
(2019). https://doi.org/10.1109/ICCSP.2019.8698029 48. GNS3 The software that empowers network professionals. https://
33. Rajagopal, S., Kundapur, P.P., Hareesha, K.S.: A stacking ensem- www.gns3.com/. (accessed: 15/10/2022) (2022)
ble for network intrusion detection using heterogeneous datasets. 49. Patel, K.C., Sharma, P.: A review paper on pfsense-an open source
Secur. Commun. Netw. (2020). https://doi.org/10.1155/2020/ firewall introducing with different capabilities and customization.
4586875 IJARIIE 3, 2395–4396 (2017)
34. Maseer, Z.K., Yusof, R., Bahaman, N., Mostafa, S.A., Foozy, 50. Bakhshi, T., Ghita, B.: Anomaly detection in encrypted internet
C.F.M.: Benchmarking of machine learning for anomaly based traffic using hybrid deep learning. Secur. Commun. Netw. (2021).
intrusion detection systems in the cicids2017 dataset. IEEE Access https://doi.org/10.1155/2021/5363750
9, 22351–22370 (2021). https://doi.org/10.1109/ACCESS.2021. 51. Yulianto, A., Sukarno, P., Suwastika, N.A.: Improving adaboost-
3056614 based intrusion detection system (ids) performance on cic ids 2017
35. Thudumu, S., Branch, P., Jin, J., Singh, J.J.: A comprehensive sur- dataset. J. Phys: Conf. Ser. 1192, 12018 (2019). https://doi.org/10.
vey of anomaly detection techniques for high dimensional big data. 1088/1742-6596/1192/1/012018
J. Big Data 7, 1–30 (2020). https://doi.org/10.1186/S40537-020- 52. Verkerken, M., D’hooge, L., Wauters, T., Volckaert, B., De Turck,
00320-X/TABLES/6 F.: Towards model generalization for intrusion detection Unsuper-
36. Bhati, N.S., Khari, M.: A new intrusion detection scheme using vised machine learning techniques. J. Netw. Syst. Manag. 30, 1–25
catboost classifier. In: Forthcoming Networks and Sustainability (2022)
in the IoT Era: First EAI International Conference, FoNeS–IoT 53. Dhooge, L., Verkerken, M., Wauters, T., De Turck, F., Volck-
2020, Virtual Event, October 1-2, 2020, Proceedings 1, pp. 169– aert, B.: Investigating generalized performance of data-constrained
176. Springer (2021) supervised machine learning models on novel, related samples
37. Bhati, N.S., Khari, M., Malik, H., Chaudhary, G., Srivastava, S.: A in intrusion detection. Sensors (2023). https://doi.org/10.3390/
new ensemble based approach for intrusion detection system using s23041846
voting. J. Intell. Fuzzy Syst. 42(2), 969–979 (2022). https://doi.
org/10.3233/JIFS-189764
38. Bhati, N.S., Khari, M.: An ensemble model for network intrusion
Publisher’s Note Springer Nature remains neutral with regard to juris-
detection using adaboost, random forest and logistic regression.
dictional claims in published maps and institutional affiliations.
In: Applications of Artificial Intelligence and Machine Learning:
Select Proceedings of ICAAAIML 2021, pp. 777–789. Springer
(2022). https://doi.org/10.1007/978-3-319-10840-7_32
39. Bhati, N.S., Khari, M., García-Díaz, V., Verdú, E.: A review on
intrusion detection systems and techniques. Internat. J. Uncertain.

123
© The Author(s) 2023. This work is published under
http://creativecommons.org/licenses/by/4.0/(the “License”). Notwithstanding
the ProQuest Terms and Conditions, you may use this content in accordance
with the terms of the License.

You might also like