BoT-IoT - Generation
BoT-IoT - Generation
Abstract
The proliferation of IoT systems, has seen them targeted by malicious third
parties. To address this, realistic protection and investigation countermeasures
need to be developed. Such countermeasures include network intrusion detec-
tion and network forensic systems. For that purpose, a well-structured and
representative dataset is paramount for training and validating the credibility
of the systems. Although there are several network, in most cases, not much
information is given about the Botnet scenarios that were used. This paper, pro-
poses a new dataset, Bot-IoT, which incorporates legitimate and simulated IoT
network traffic, along with various types of attacks. We also present a realistic
testbed environment for addressing the existing dataset drawbacks of capturing
complete network information, accurate labeling, as well as recent and complex
attack diversity. Finally, we evaluate the reliability of the BoT-IoT dataset
using di↵erent statistical and machine learning methods for forensics purposes
compared with the existing datasets. This work provides the baseline for allow-
ing botnet identificaiton across IoT-specific networks. The Bot-IoT dataset can
be accessed at [1].
Keywords: Bot-IoT Dataset, Network Flow, Network Forensics, Forensics
Analytics.
1. Introduction
The rapid development of the Internet has been responsible for the emer-
gence of the Internet of Things (IoT), with such prominent examples as smart
homes and cities, healthcare systems, and cyber-physical systems. The IoT is
a collection of interconnected everyday devices, augmented with light proces-
sors and network cards, capable of being managed through web, app or other
⇤ Corresponding author
Email address: [email protected] (Nour Moustafa )
With new and unique threats capable of compromising IoT networks and as
existing techniques are inadequate to address them, it is important to develop
advanced forensics methods for identifying and investigating adversarial behav-
ior. Network Forensic techniques are widely used for analyzing network traffic
and to identify infected devices taking part in major cyber-attacks [4]. Addition-
ally, due to the number and nature of IoT devices, the workload of processing
collected data would be an ideal application of Big Data analytics. Big Data
analytics is a collection of sophisticated analytical processes, which were de-
signed to handle three main problems, variety, velocity and volume [5]. Since
IoT networks generate enormous volumes of data, it is imperative to employ
analytical techniques capable of handling them, in a near-real time fashion. As
such, the term forensic analytics is defined to demonstrate the role of forensics
techniques and big data analytics.
Forensic analytics demand big data sources for validating their credibility in IoT
networks. In order to develop forensics and intrusion detection solutions that
identify and investigate cyber-attacks, the development of a realistic dataset is
still an important topic of research [6]. Over the years, a number of datasets were
developed, each of them having its own advantages and disadvantages. Existing
datasets, although applicable in some scenarios, introduce various challenges,
for example, a lack of reliably labeled data, poor attack diversity such as botnet
scenarios, redundancy of traffic, and missing ground truth [7, 8, 9, 10]. How-
ever, a realistic Botnet traffic dataset in IoT networks has not been e↵ectively
designed. The new Bot-IoT dataset addresses the above challenges, by having a
realistic testbed, multiple tools being used to carry out several botnet scenarios,
and by organizing packet capture files in directories, based on attack types.
The main contributions of this paper are as follows:
• We design a new realistic Bot-IoT dataset and give a detailed description
of designing the testbed configuration and simulated IoT sensors.
• We then statically analyze the proposed features of the dataset using Cor-
relation Coefficient and Joint Entropy techniques.
• We also evaluate the performance of network forensic methods, based on
machine and deep learning algorithms using the botnet-IoT dataset com-
pared with popular datasets.
2
This study is structured as follows. The literature review of the study is dis-
cussed in Section 2. In section 3, the testbed that was used to develop the
Bot-IoT dataset is presented. Sections 4, 5 and 6 discuss the feature extraction
process, the benign and malicious scenarios and the statistical and Machine
Learning methods used to analyze the dataset respectively. Finally, in section
7, experimental results are presented and discussed, followed by the conclusion
of this paper.
This second outlines and analyses the IoT and forensic techniques-based
machine learning that are used in this work. Additionally, information is given
about popular network datasets and their limitations for designing forensic tech-
niques in the IoT.
3
e↵ective techniques in law enforcement, such techniques can be utilized using
Machine Learning (ML) algorithms for designing reliable models that can effi-
ciently determine cybercrime. Therefore, we entitle forensic techniques that use
Machine Learning and big data as forensic analytics. Forensic analytics could
be used in the examination phase, where the Forensic analyst seeks to identify
patterns that would answer the aforementioned questions, related to the occur-
rence of a crime.
4
Destination IP, Destination Port, Protocol. In both categories, Clustering,
Classification and Association Rule Mining techniques have been applied
in literature.
• Email Authorship- is an example of ML application in the context of
Cyber Forensics, not directly related to Network Forensics. In such cases,
Support Vector Machines have been used to identify the author of an
e-mail based on the text itself [22].
Taking a di↵erent approach to building their testbed, Bhatia et al. [25] employed
physical machines arranged appropriately in a local network. Their testbed was
tasked with simulating flash events and various types of DDoS attacks, the latter
through the use of specialized software called Botloader and IP-Aliasing. Com-
pared to our virtualized approach, their choice of using physical machines incurs
added costs (for the physical machines), is not easily deployable as mentioned by
5
the research team itself [25] and does not include the added resiliency that virtu-
alized environments o↵er. Additionally, our approach included a greater variety
of Botnet activities, including but not limited to DDoS, DoS and Port Scanning.
Similarly to [25], Behal et al. [26], developed their testbed, DDoSTB to gener-
ate DDoS attacks and flash events. Their approach was to use a combination of
real computers arranged locally into subnetworks, and emulation software that
created multiple virtual nodes per physical. As mentioned for [25], the use of
physical machines in such a testbed lacks the resiliency, ease and speed with
which Virtual Machines can be deployed. Also, again the focus of the testbed
was DDoS attacks, which, although a significant and particularly destructive
capability of Botnets is far from the only kind of action that they can perform,
as has been observed, that Botnets in the wild exhibit a number of diverse
malicious actions, such as Keylogging, Phishing, Data theft, DDoS, DoS and
Malware proliferation.
Doshi et al. [20] work focused on generating a ML model that would iden-
tify IoT-produced DDoS attacks. In their approach, they made use of physical
IoT consumer appliances, which were deployed in a local network. Normal traf-
fic was collected by interacting with the IoT devices, while DoS attacks were
simulated. The main di↵erences of their proposed testbed and the one we im-
plemented, is scale and attack variety. For our testbed, we chose to use four
Kali Linux machines to simulate both DoS and DDoS attacks, along with other
Botnet actions.
The main novelty of the proposed dataset, is the introduction of the IoT el-
ement in the environment. Namely, we employed popular middleware (Node-
red), to simulate the existence of IoT devices in our Virtual Network. This IoT
simulated traffic is in the form of MQTT, a publish-subscribe communication
protocol implemented over TCP/IP, often used for lightweight network commu-
nication, such as IoT [28]. In contrast, IoT devices were not represented in the
testbeds that were presented in the previous section [8, 9][23, 24, 25, 26] with
the exception of [20].
Choosing this virtualized setup, carries a number of pros and cons. To start
6
with the pros, using virtualized environment means that the setup is portable
and easy to set up with relatively low cost. Additionally, using simulations to
represent the servers, PCs and IoT devices made the launching of malicious
attacks easier and safer, with the extra bonus that the machines could be recov-
ered easily if necessary. From a security perspective, not using actual Botnet
malware to infect our machines made the process a lot safer, as by doing so
we ran the risk of either unwillingly taking part in real attacks (our machines
become part of a real botnet).
Furthermore, many newer versions of Bot malware can detect a virtualized en-
vironment, producing fake data as subterfuge. With regards to the generation
of realistic normal network traffic, the Ostinato [27] Software was used, to make
the dataset appear as if it were collected from a real-world network. Lastly, we
executed a set of standard Botnet attacks, which makes this dataset useful for
the development of realistic Botnet traffic detection.
7
Since the applications of forensics discussed in the IoT and Forensic analytics
section employ machine learning techniques, they require big datasets for ana-
lyzing network flows, di↵erentiating between normal and abnormal traffic, and
producing forensic reports, which could be useful to forensic specialists in law
enforcement. The development of a realistic network dataset is a very important
task for designing network forensics, intrusion detection and privacy-preserving
models. Over the years, several datasets have been produced [8] and although a
good number of them remain private, due to primarily privacy concerns, some
have become publicly available. The most commonly used datasets are briefly
explained below, with a comparison between them and Bot-IoT given in Table
1.
• The DARPA 98 dataset was generated by MITS Lincoln Lab for assessing
intrusion detection systems. The resulting dataset was produced in a
period of 7 weeks, was made up of 4GB of binary data and simulated a
small Air Force network connected to the Internet [29][30], which was later
enhanced in 1999 to generate the features in the KDD99 dataset.
• The KDD99 dataset was generated from the DARPA 98 dataset for eval-
uating intrusion detection systems that distinguish between inbound at-
tacks and normal connections [7][31][30]. Even though it is still used to
this day, it has several problems, for example, non-normal distributions
of attack and normal data named the imbalanced learning problem. The
NSL-KDD dataset was proposed to address the limitations of the KDD99,
but the two versions are outdated and do not reflect current normal and
attack events [31][8].
• The DEFCON-8 dataset consists of port scanning and bu↵er overflow at-
tacks, which were recorded during a Capture the Flag competition [8]. As
it lacks a significant quantity of normal background traffic, its applicability
for evaluating Network Forensics and IDS systems is limited.
• The UNIBS [32][33] dataset was developed by the University of Brescia,
Italy. In their configuration, the researchers installed 20 workstations run-
ning the Ground Truth daemon and traffic was collected through tcpdump
at the router to which they were collected. Although the researchers used
a realistic configuration, there are some drawbacks to this dataset. First,
the attack scenarios are limited to DoS attacks. Secondly, the dataset
exists in packet form with no extra features generated on it. Additionally,
no information is given about the labels.
• The CAIDA datasets [33][34] are collections of varied data types. They
are comprised of anonymized header traffic, excluding the payload. The
datasets are made up of very specific attacks, such as DDoS. One popular
dataset from the CAID collection, is the CAIDA DDoS 2007, which in-
cludes one hour of anonymized attack traces from DDoS attacks that took
8
place in August 4 2007. One drawback of the CAIDA datasets is that they
did not have a ground truth about the attack instances. Additionally, the
gathered data was not processed to generate new, features which could
improve the di↵erentiation between attack and normal traffic.
• The LBNL [35][33] dataset consists of anonymized traffic, which consists
of only header data. It was developed at the Lawrence Berkley National
Laboratory, by collecting real inbound, outbound and routing traffic from
two edge routers. Similarly to the UNIBS, the labeling process is lacking
and no extra features were generated, with the data existing as a collection
of .pcap files.
• The UNSW-NB15 is a dataset developed at UNSW Canberra by Mustafa
et al. [9]. The researchers employed IXIA perfect storm to generate a
mixture of benign and attack traffic, resulting in a 100GB dataset in the
form of PCAP files with a significant number of novel features generated.
The purpose of the produced dataset was to be used for the generation
and validation of intrusion detection. However, the dataset was designed
based on a synthetic environment for generating attack activities.
• The ISCX dataset [36][37] was produced at the Canadian Institute for
Cyber security. The concept of profiles was used to define attack and dis-
tribution techniques in a network environment. Several real traces were
analyzed to generate accurate profiles of attacks and other activities to
evaluate intrusion detection systems. Recently, a new dataset was gen-
erated at the same institution, the CICDS2017 [8]. The CICIDS2017 is
comprised of a variety of attack scenarios, with realistic user-related back-
ground traffic generated by using the B-Profile system. Nevertheless, the
ground truth of the datasets, which would enhance the reliability of the
labeling process, was not published. Furthermore, applying the concept
of profiling, which was used to generate these datasets, in real networks
could be problematic due to their innate complexity.
• The TUIDS [38][33] dataset was generated by the Tezpur University, In-
dia. This dataset features DDoS, DoS and Scan/Probing attack scenarios,
carried out in a physical testbed. However, the flow level data do not in-
clude any new features other than the ones generated by the flow-capturing
process.
Although various research studies have been conducted [7, 8, 9, 32, 35, 38, 34, 36]
to generate network datasets, the development of realistic IoT and network
traffic dataset that includes recent Botnet scenarios still is an unexplored topic.
More importantly, some datasets lack the inclusion of IoT-generated traffic,
while others neglected to generate any new features. In some cases, the testbed
used was not realistic while in other cases, the attack scenarios were not diverse.
This work seeks to address the shortcomings by designing the new Bot-IoT
dataset and evaluate it using multiple forensics mechanisms, based on machine
and deep learning algorithms.
9
3. The proposed IoT-Bot dataset
10
of the AWS via the Message Queuing Telemetry Transport (MQTT) protocol
[28], as detailed in the following sub-section.
There were also a packet filtering firewall and two Network Interface Cards
(NICs) configured in the environment. One of the NIC was configured for LAN
and the other one for WAN. The main reason for using this firewall is to ensure
the validity of the dataset labeling process, as it enables to manage network
access by monitoring incoming and outgoing network packets, based on specific
source and destination internet protocol (IP) addresses of the attack and normal
platforms. VMs which needed to communicate with the Internet, sent their traf-
fic through the PFSense machine which in turn, forwarded the traffic through
a switch and a second firewall before it could be routed further into the Internet.
Our network of VMs consists of four Kali machines, Bot Kali 1, Bot Kali 2,
Bot Kali 3, and Bot Kali 4, an Ubuntu Server, Ubuntu mobile, Windows 7,
Metasploitable and an Ubuntu Tap machine. The Kali VMs, which belong
to the attacking machines, performed port scanning, DDoS and other Botnet-
related attacks by targeting the Ubuntu Server, Ubuntu mobile, Windows 7 and
Metasploitable VMs. In the Ubuntu Server, a number of services had been de-
ployed, such as DNS, email, FTP, HTTP, and SSH servers, along with simulated
IoT services, in order to mimic real network systems.
To generate a massive amount of normal traffic, we used the Ostinato tool [27],
due to its flexibility of generating realistic benign traffic with given IPs and
ports. We also maintained periodically normal connections between the VMs
by executing normal functions of the services installed on the Ubuntu server,
such examples include the DNS server, which resolved the names of the VMS to
their IPs and the FTP server, used to transfer particular files between the VMs.
To collect the entire normal and attack raw packet volume exchanged within
the configured network, the tshark tool was used on the Ubuntu Tap machine,
by setting its NIC in a promiscuous mode that ensured the scalability of the
testbed.
11
Figure 2: Flowchart of weather IoT simulation of the node-red tool used in the dataset
The various pieces of code were triggered for publishing and subscribing to
the topics, as shown in the example of Figure 2. The MQTT protocol [28] was
used as a lightweight communication protocol that links machine-to-machine
(M2M) communications, making it a viable choice for IoT solutions. It works in
a publish/subscribe model, where a device publishes data to an MQTT broker
(server side) under a topic, which is used to organize the published data and
allows clients to connect to the broker and fetch information from the topic of
the device they wish to interact with.
We applied the following IoT scenarios in the testbed of the dataset:
1. A weather station (Topic:/smarthome/weatherStation), which generates
information on air pressure, humidity and temperature.
2. A smart fridge (/smarthome/fridge), which measures the fridges temper-
ature and when necessary adjusts it below a threshold.
3. Motion activated lights (/smarthome/motionLights), which turn on or o↵
based on a pseudo-random generated signal.
4. A remotely activated garage door (/smarthome/garageDoor ), which opens
or closes, based on a probabilistic input.
5. A smart thermostat (/smarthome/thermostat), which regulates the houses
temperature by starting the Air-conditioning system.
The five IoT scenarios were connected to both the Ubuntu server, where the
Mosquitto MQTT broker [44] was installed, as-well-as the AWS IoT hub. While
12
running the testbed environment, MQTT messages were published periodically
from all clients to both brokers. The connections allowed us to simulate regular
IoT traffic, since the MQTT brokers were used as intermediaries that connected
smart devices with web/smartphone applications.
After collecting the pcap files from the virtualized setup, with normal and
attack traffic in the same files, we extracted the flow data, by using the Ar-
gus tools and produced .argus files. Following the flow extraction process, the
produced flow data was imported into a MySQL database for further process-
ing. We then employed statistical measures using Correlation Coefficient [45]
and Entropy [46] techniques to assess the original dataset and select important
feature, as described in Section 6. New features were generated based on the
transactional flows of network connections in order to discover normal and intru-
sive events. Finally, three Machine Learning models, which could be applied for
forensic purposes, were trained and validated on several versions of the dataset
to assess the value of the dataset compared with other datasets, as discussed in
section 7.
13
Figure 3: Process of converting pcap to final csv dataset
After collecting the pcap files, the Argus tool [49] was used to generate the
relevant network flows. The process can be viewed in Figure 4. The pcap files
were converted into Argus format by using the Argus client program. Then,
the rasqlinsert command was applied to extract network flow information, and
simultaneously log the extracted features into MySQL tables.
The final features produced by Argus during the Network flow extraction process
are listed in Table 2.
Feature Description
pkSeqID Row Identifier
Stime Record start time
flgs Flow state flags seen in trans-
actions
flgs number Numerical representation of
feature flags
Proto Textual representation of
transaction protocols present
in network flow
proto number Numerical representation of
feature proto
saddr Source IP address
sport Source port number
daddr Destination IP address
dport Destination port number
pkts Total count of packets in
transaction
bytes Totan number of bytes in
transaction
state Transaction state
14
state number Numerical representation of
feature state
ltime Record last time
seq Argus sequence number
dur Record total duration
mean Average duration of aggre-
gated records
stddev Standard deviation of aggre-
gated records
sum Total duration of aggregated
records
min Minimum duration of aggre-
gated records
max Maximum duration of aggre-
gated records
spkts Source-to-destination packet
count
dpkts Destination-to-source packet
count
sbytes Source-to-destination byte
count
dbytes Destination-to-source byte
count
rate Total packets per second in
transaction
srate Source-to-destination pack-
ets per second
drate Destination-to-source pack-
ets per second
attack Class label: 0 for Normal
traffic, 1 for Attack Traffic
category Traffic category
subcategory Traffic subcategory
15
4.2. New feature generation
We developed new features that were generated based on the features listed
in Table 2. The main purpose of this process is to improve the predictive capa-
bilities of classifiers. The new features demonstrated in Table 3 were designed
over a sliding window of 100 connections. The number 100 although chosen
arbitrarily, pays a significant role in the generation of these new features, as it
captures the statistics of groups of flows, in a relatively small time-window, in-
side of which, patterns of several attacks can be discovered. In order to generate
these features in MySQL DB, we made use of stored procedures.
Feature Description
1 TnBPSrcIP Total Number of bytes per
source IP
2 TnBPDstIP Total Number of bytes per
Destination IP.
3 TnP PSrcIP Total Number of packets per
source IP.
4 TnP PDstIP Total Number of packets per
Destination IP.
5 TnP PerProto Total Number of packets per
protocol.
6 TnP Per Dport Total Number of packets per
dport
7 AR P Proto P SrcIP Average rate per protocol
per Source IP. (calculated by
pkts/dur)
8 AR P Proto P DstIP Average rate per protocol per
Destination IP.
9 N IN Conn P SrcIP Number of inbound connec-
tions per source IP.
10 N IN Conn P DstIP Number of inbound connec-
tions per destination IP.
11 AR P Proto P Sport Average rate per protocol per
sport
12 AR P Proto P Dport Average rate per protocol per
dport
13 Pkts P State P Protocol P DestIP Number of packets grouped
by state of flows and proto-
cols per destination IP.
14 Pkts P State P Protocol P SrcIP Number of packets grouped
by state of flows and proto-
cols per source IP.
16
5. Benign and Botnet scenarios
The configuration of the VMs and utilized platforms represents a realistic smart-
home network, as the five IoT devices: 1) Smart Refrigerator, 2) Smart Garage
door, 3) Weather Monitoring System, 4) Smart Lights, and 5) Smart thermo-
stat, could be deployed in smart homes. Also, the generated messages were
transferred to a Cloud Service provider (AWS) using the MQTT protocol. The
statistics of normal traffic included in the dataset are shown in Table 4.
Protocol Number
UDP 7225
TCP 1750
ARP 468
IPV6-ICMP 88
ICMP 9
IGMP 2
RARP 1
Total 9543
17
a stealthy manner [51][52]. On the other hand, during an active probe,
an attacker generates network traffic, targeting the system, and records
the responses, comparing them with known responses which allow them
to make inferences about services and OS [51][52]. With regards to the
goal of the probe, there are two major subcategories, OS and service fin-
gerprinting. In OS fingerprinting, a scanner gathers information about
the remote systems OS by comparing its responses to pre-existing ones,
or based on di↵erences in TCP/IP stack implementations. In service fin-
gerprinting (scanning), a scanner identifies the services which run behind
the systems ports (0-65535), by sending request packets [52]. Here, we
will be using active scans, as passive scans produce close to zero amount
of generated traffic.
– Port scanning: we used the Nmap and Hping3 tools in order to per-
form a number of di↵erent types of port scans. The Nmap [53] was
launched to scan open services and protocols of the VMs, for ex-
ample, nmap -sT 192.168.100.3, where sT issues a complete TCP
connection and 192.168.100.3 is the IP of the Ubuntu server. Like-
wise, Hping3 [54] was employed to perform similar port scans with an
example being, hping3 S scan 1-1000 192.168.100.3, where S issues
a SYN scan and scan dictates the port numbers to be scanned.
– OS fingerprinting: we used the Nmap and Xprobe2 tools to launch
di↵erent types of OS fingerprint scans. The Nmap [53] tool was
used to identify the OS of our target VMs with di↵erent options, for
example, nmap -sV -T5 -PO -O 192.168.100.3, where sV specifies a
Syn scan, T5 that the scan is as overt as possible, PO to include IP
protocol ping packets and O to enable OS scanning. The Xprobe2
[55] tool was used in conjunction with Nmap. While performing our
scans, we used the default operations of Xprobe2, with no options
specified.
• Denial of Service [26][20][3][56]- are malicious activities that attempt
to disrupt a service, thus making it unavailable to legitimate users. The
DDoS and DoS attack types included in the dataset are described as fol-
lows: Distributed Denial of Service (DDoS) and Denial of Service (DoS)
attacks are performed by a group of compromised machines called Bots,
and target a remote machine, usually a server[26][20][3]. The purpose of
such attacks is the disruption of services accessible by legitimate users.
These attacks can be classified, based on their attack methodology. Two
such groups are volumetric and protocol-based DDoS/DoS attacks[56].
Volumetric attacks generate a great number of network traffic, which ei-
ther force the victim to process through these attack-generated requests,
or cause the machine to crash, thus making the provided service unavail-
able. Protocol-based attacks abuse the mechanics of Internet protocols,
which cause CPU and memory resources to be depleted, thus render the
targeted machine unable to respond to requests. In our attack scenar-
18
ios, we performed both DDoS and DoS, and used the following protocols:
TCP, UDP and HTTP.
– DDoS, DoS: We used the tool Hping3 [54] for both DDoS, DoS
for TCP and UDP, for example, hping3 –syn –flood -d 100 -p 80
192.168.100.3 where syn indicates a SYN TCP attack, flood indicates
packets are sent as fast as possible, d specifies packet body size, p sets
the targeted port. For HTTP DDoS and DoS attacks, we used the
Golden-eye tool, one example being goldeneye.py http://192.168.100.3:80
-m post -s 75 -w 1, with http://192.168.100.3:80 indicating the IP
address of Ubuntu Server and the targeting Port number, m setting
the method as post, s setting number of sockets, w specifying number
of concurrent workers.
• Information Theft [57][58]- is a group of attacks where an adversary
seeks to compromise the security of a machine in order to obtain sensi-
tive data. The information theft attack types included in the dataset are
described as follows:
Information theft attacks can be split into subcategories, based on the
target of the attack. The first subcategory is data theft. During data
theft attacks, an adversary targets a remote machine and attempts to
compromise it, thus gaining unauthorized access to data, which can be
downloaded to the remote attacking machine. The second subcategory is
keylogging. In keylogging activities, an adversary compromises a remote
host in order to record a users keystrokes, potentially stealing sensitive
credentials. Attackers usually employ Advanced Persistent Threat (APT)
methodology in conjunction with information Theft attacks, in order to
maximize their attacks efficiency [57].
19
The statistics of attacks involved in the dataset are described in Table 5
XX
(P (x, y) ⇤ log P (x, y)) (1)
x y
20
6.2. Machine and Deep Learning analysis techniques
Machine and Deep Learning models were used to evaluate the quality of
Bot-IoT, when used to train a classifier. The models that were trained were:
Support Vector Machine (SVM), Recurrent Neural Network (RNN) and Long-
Short Term Memory Recurrent Neural Network (LSTM-RNN).
• SVM: is based on the idea that data instances can be viewed as coordinates
in an N-dimensional space, with N being the number of features. During
training, a hyperplane is sought, that best separates the data into distinct
groups (classes) and that maximize the margin. In our work, we made use
of an SVM Classifier with a linear kernel [63].
• RNN: incorporates a form of memory in its structure [64]. The output
of an RNN during an iteration, depends both on the input at any given
time, and the output of the Hidden state of the previous iteration. What
makes RNN stand out, is that contrary to other NNs, its output depends
on both the current input as-well-as previous outputs, making it ideal
for processing temporal data, such as the ones present in our dataset.
Usual applications of RNNs include machine translation, speech recogni-
tion, generating Image descriptions.
• LSTM: is a special kind of Recurring Neural Network, where a collec-
tion of internal cells are tasked with maintaining a sort of memory which
makes LSTMs ideal for finding temporally distant associations [65]. LSTM
improve on RNN’s ‘vanishing gradient’and ‘exploding gradient’problems,
by incorporating a ‘memory cell’which handles updates in the model’s
memory. This improvement renders LSTMs ideal for learning long-term
dependencies in data.
Furthermore, considering that the generated dataset is very large (more than
72.000.000 records and at 16.7 GB for CSV, with 69.3 GB pcap), , it made
handling the data very cumbersome. As such, we extracted 5% of the original
dataset via the use of select MySQL queries similar to the ones mentioned pre-
viously. The extracted 5%, which we will refer to as the training and testing
21
sets for the rest of the paper, is comprised of 4 files of approximately 0.78 GB
total size, and about 3 million records.
Moreover, due to the existence of certain protocols (ARP), source and desti-
nation port number values were missing (not applicable), as such, these values
were set to -1, which is an invalid port number, again for the purpose of evalu-
ation of the dataset. We converted the categorical feature values in the dataset
into consecutive numeric values for easily applying statistical methods. For ex-
ample, the state attribute has some categorical values such as ‘RST’, ‘CON’,
and ‘REQ’that were mapped into ‘1’, ‘2’and ‘3’.
Moreover, normalization was applied in order to scale the data into a specific
range, such as [0,1], without changing the normality of data behaviors. This step
helps statistical models and deep learning methods to converge and achieve their
objectives by addressing local optimum challenges. We performed a Min-Max
transformation on our data, according to the following formula:
(b a)
x´i = (xi xmin ) ⇤ +a (2)
(xmax xmin )
Where xmax and xmin are the initial max an min values from the original set,
0
b and a are the new max and min set values and xi 2 [a, b]. For our purposes,
a=0 and b=1, making the new set [0,1].In order to measure the performance
of the trained models, corresponding confusion matrices were generated, along
with a collection of metrics, as given in Table 6.
22
optimization of ML models. The idea behind these evaluations, is to identify any
features that are highly correlated and reduce the dimensionality for improving
the performances of machine learning models.
7.2.2. Entropy
In order to calculate the joint entropy between our features, we generated
Python code which traversed the loaded CSV files and calculated the subse-
quent sums of joint probability times the base 2 logarithm for that probability,
as given in Equation 1. It was due to this measure that we performed the dis-
cretization that we mentioned at the beginning of this section.
23
Features (1) flgs number stime ltime dur rate sum
Average JE (1) 1.68267 0.798997 0.798997 0.665417 0.664154 0.661829
Features (2) spkts pkts sbytes bytes srate drate
Average JE (2) 0.661812 0.661783 0.661772 0.661768 0.661734 0.66172
Features (3) dbytes dpkts
Average JE (3) 0.661718 0.661718
24
we compared the new mapped values of Correlation Coefficient and Joint
Entropy in order to extract a subset of 10 features which, overall, had the
best scores in both statistical measures. As such, we identified toe following
best 10 features: srate, drate, rate, max, state number, mean, min, stddev,
flgs number, seq. Having completed the unsupervised evaluation process, we will
further evaluate the worthiness of the final 10 features in the section ‘supervised
evaluation’.
25
We then mapped the average scores in the set [0,1] and plotted the values, in
order to identify the 10 best features.
By observing Figure 5, we can determine that the 10 best features, that is,
the 10 features with the highest combined average Correlation Coefficient and
Joint Entropy are: seq, stddev, N IN Conn P SrcIP, min, state number, mean,
N IN Conn P DstIP, drate, srate, max.
7.3.1.1. SVM.
26
The SVM model that was trained, was a linear Support Vector Classifier.
The parameters of this classifier were penalty parameter (C=1), 4-fold cross-
validation and a number of max iterations equal to 100000 on the final 10-best
features. Similar settings were selected for the dataset version comprised of all
available features, with the only di↵erence being that max iterations were set
to 1000.
The aforementioned setting was practically adjusted to measure the best per-
formance of the SVM model. Initially, the SVM classifier was trained with
default parameters, but it was later observed that by increasing the max itera-
tion number, particularly for the second (all features included) model caused a
longer training time. With regards to the number of folds, we observed a loss
of accuracy when a higher number of folds was chosen.
Table 10: Confusion matrices of SVM models. (10-best feature model on the left,
full-feature model on the right).
True\Predict Normal (0) Attack (1) True\Predict Normal (0) Attack (1)
Normal (0) 477 0 Normal (0) 64 413
Attack (1) 426550 3241495 Attack (1) 0 3668045
Figure 6: ROC curve for SVM models.(10-best feature model on the left,full-feature model on
the right).
7.3.1.2. LSTM.
The LSTM models were defined to have 1 input layer with the number of
neurons equal to the number of input features, two hidden layers and an output
layer. For the 10-best features dataset, the Model was trained in 4 epochs with
batch size 100. The neural network was comprised of 10 input neurons (in the
first layer, the same number as the features), intermediate (hidden) layers with
27
20, 60,80, 90 neurons and 1 output neuron for the binary classification.
For our full-feature dataset, the model was trained in 4 epochs with a batch
size of 100 records. The network had a 35-neuron input layer (again, the same
number as features of the dataset), with similar hidden layers to the model we
used to train the 10-best feature version of our dataset (20,60,80,90 neurons)
and 1 output neuron for the binary classification.
We initially tested the model with a batch size of 1000, but due to poor per-
formance, we sought a di↵erent value. It was observed, that choosing a batch
size of 100 records, the specificity of the model was improved. In both cases, for
the input and hidden layers, the activation function that was used was ‘tanh’,
while the output layer activation function was ‘sigmoid’. Both tanh and sigmoid
activation functions are often used for building Neural Networks, with sigmoid
being an ideal choice for binary classification, as its output is within the [0,1]
set. Bellow the structure of the LSTM model can be viewed in Figure 7.
Figure 7: Structure of LSTM model (in layers). (10-best feature model on the left, full-feature
model on the right)
Next, we present the training times for both LSTM models, the confusion
matrices followed by four metrics.
28
Table 11: Confusion matrices of LSTM models. (10-best feature model on the
left, full-feature model on the right).
True\Predict Normal (0) Attack (1) True\Predict Normal (0) Attack (1)
Normal (0) 149 328 Normal (0) 430 47
Attack (1) 9139 3658906 Attack (1) 71221 3596824
Figure 8: ROC curve for LSTM models. (10-best feature model on the left, full-feature model
on the right)
7.3.1.3. RNN.
The RNN models were defined to have 1 input layer with the number of neurons
equal to the number of input features, two hidden layers and an output layer.
For the 10-best features dataset, the Model was trained in 4 epochs (batch size
100), had 10 input neurons (in the first layer, the same number as the features),
with hidden layers similar to the ones in the LSTM models, and 1 output neuron
for the binary classification.As with LSTM-RNN, the parameters were chosen
through experimentation. Higher values of batch size a↵ected the models speci-
ficity, as such we experimented with lower values.
For our full-feature dataset, the model was trained in 4 epochs (batch size 100),
and had a 35-neuron input layer, same number and consistency of hidden layers
as with the 10-best feature model and 1 output neuron for the binary classifica-
tion. In both cases, for the input and hidden layers, the activation function that
was used was ‘tanh’, while the output layer activation function was ‘sigmoid’.
As mentioned previously, the sigmoid function is ideal for output layers for a
binary classification problem, as its output is within the [0,1] set. Bellow the
structure of the LSTM model can be viewed in Figure 9.
29
Figure 9: Structure of RNN model (in layers). (10-best feature model on the left, full-feature
model on the right)
Next, we present the training times for both LSTM models, the confusion
matrices followed by four metrics.
Table 12: Confusion matrices of RNN models. (10-best feature model on the left,
full-feature model on the right).
True\Predict Normal (0) Attack (1) True\Predict Normal (0) Attack (1)
Normal (0) 127 350 Normal (0) 379 98
Attack (1) 9171 3658874 Attack (1) 76718 3591327
30
Figure 10: ROC curve for RNN models. (10-best feature model on the left, full-feature model
on the right)
The network had 10 input neurons, and 8 hidden layers comprised of 30, 40, 40,
60, 80, 80, 100 neurons each. Finally, the output layer had 1 neuron for binary
classification. The resulting confusion matrices can be viewed in Table 14
In the following table ( 13), we present the four metrics Accuracy, Precision,
Recall, Fall-out along with Training Time of all the models that were trained.
31
Table 13: Accuracy, Precision, Recall Fall-out and training time Metrics. (In tables(b) and (c), the model was an RNN. Training time is in seconds)
33
was DoS UDP-Normal traffic.
34
Additionally, in the binary classification, Fall-out values were rather high, with
the exception of RNN and LSTM models for the fully-featured version of the
dataset. This could be explained by a number of factors, such as poor optimiza-
tion of models in use and the relatively low number of epochs that we chose, in
order to speed up the process. In table 15, the parameters that were chosen for
the three models are presented.
8. Conclusion
This paper presents a new dataset, Bot-IoT, which incorporates both normal
IoT-related and other network traffic, along with various types of attack traffic
commonly used by botnets. This dataset was developed on a realistic testbed,
and has been labeled, with the label features indicated an attack flow, the
attacks category and subcategory for possible multiclass classification purposes.
Additional features were generated to enhance the predictive capabilities of
classifiers trained on this model. Through statistical analysis, a subset of the
original dataset was produced, comprised of the 10-best features. Finally, four
metrics were used in order to compare the validity of the dataset, specifically
Accuracy, Precision, Recall, Fall-out. We observed the highest accuracy and
recall from the SVM model that was trained on the full-featured dataset, while
the highest precision and lowest fall-out from the SVM model of the of the
10-best feature dataset version. With further optimization of these models, we
argue that better results could be achieved. In Future, we plan to develop a
network forensic model using deep learning and evaluate its reliability using the
BoT-IoT dataset.
References
35
[5] C. Liu, C. Yang, X. Zhang, J. Chen, External integrity verification for
outsourced big data in cloud and iot: A big picture, Future generation
computer systems 49 (2015) 58–67.
[6] C. Grajeda, F. Breitinger, I. Baggili, Availability of datasets for digital
forensics–and what is missing, Digital Investigation 22 (2017) S94–S105.
[7] Kddcup99 dataset.
URLhttp://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
[8] I. Sharafaldin, A. H. Lashkari, A. A. Ghorbani, Toward generating a new
intrusion detection dataset and intrusion traffic characterization, in: Pro-
ceedings of fourth international conference on information systems security
and privacy, ICISSP, 2018.
[9] N. Moustafa, J. Slay, Unsw-nb15: a comprehensive data set for network in-
trusion detection systems (unsw-nb15 network data set), in: Military Com-
munications and Information Systems Conference (MilCIS), 2015, IEEE,
2015, pp. 1–6.
[10] 1998 darpa intrusion detection evaluation data set.
URLhttps://www.ll.mit.edu/ideval/data/1998data.html
[11] J. Gubbi, R. Buyya, S. Marusic, M. Palaniswami, Internet of things (iot):
A vision, architectural elements, and future directions, Future generation
computer systems 29 (7) (2013) 1645–1660.
[12] S. S. Silva, R. M. Silva, R. C. Pinto, R. M. Salles, Botnets: A survey,
Computer Networks 57 (2) (2013) 378–403.
[13] S. Khattak, N. R. Ramay, K. R. Khan, A. A. Syed, S. A. Khayam, A tax-
onomy of botnet behavior, detection, and defense, IEEE communications
surveys & tutorials 16 (2) (2014) 898–924.
[14] P. Amini, M. A. Araghizadeh, R. Azmi, A survey on botnet: classification,
detection and defense, in: Electronics Symposium (IES), 2015 Interna-
tional, IEEE, 2015, pp. 233–238.
[15] G. Palmer, A road map for digital forensic research: Report from the
first digital forensic workshop, 7–8 august 2001, DFRWS Technical Report
DTR-T001-01.
[16] E. Hodo, X. Bellekens, A. Hamilton, P.-L. Dubouilh, E. Iorkyase, C. Tach-
tatzis, R. Atkinson, Threat analysis of iot networks using artificial neural
network intrusion detection system, in: Networks, Computers and Com-
munications (ISNCC), 2016 International Symposium on, IEEE, 2016, pp.
1–6.
[17] P. Garcia-Teodoro, J. Diaz-Verdejo, G. Maciá-Fernández, E. Vázquez,
Anomaly-based network intrusion detection: Techniques, systems and chal-
lenges, computers & security 28 (1-2) (2009) 18–28.
36
[18] K. Wang, M. Du, Y. Sun, A. Vinel, Y. Zhang, Attack detection and dis-
tributed forensics in machine-to-machine networks, IEEE Network 30 (6)
(2016) 49–55.
[19] K. Rieck, T. Holz, C. Willems, P. Düssel, P. Laskov, Learning and classi-
fication of malware behavior, in: International Conference on Detection of
Intrusions and Malware, and Vulnerability Assessment, Springer, 2008, pp.
108–125.
[20] R. Doshi, N. Apthorpe, N. Feamster, Machine learning ddos detection for
consumer internet of things devices, arXiv preprint arXiv:1804.04159.
[26] S. Behal, K. Kumar, Detection of ddos attacks and flash events using in-
formation theory metrics–an empirical investigation, Computer Communi-
cations 103 (2017) 18–28.
[27] Ostinato.
URLhttps://ostinato.org/
37
[31] M. Tavallaee, E. Bagheri, W. Lu, A. A. Ghorbani, A detailed analysis
of the kdd cup 99 data set, in: Computational Intelligence for Security
and Defense Applications, 2009. CISDA 2009. IEEE Symposium on, IEEE,
2009, pp. 1–6.
[32] Unibs, university of brescia dataset (2009).
URLhttp://www.ing.unibs.it/ntw/tools/traces/
[33] M. H. Bhuyan, D. K. Bhattacharyya, J. K. Kalita, Towards generating real-
life datasets for network intrusion detection., IJ Network Security 17 (6)
(2015) 683–701.
[34] Center of applied internet data analysis.
URLhttps://www.caida.org/data/
[35] Lawrence berkley national laboratory (lbnl), icsi, lbnl/icsi enterprise trac-
ing project (2005).
URLhttp://www.icir.org/enterprise-tracing/
[36] U. o. n. B. Canadian Institute of Cybersecurity, Iscx dataset.
URLhttp://www.unb.ca/cic/datasets/index.html
[37] A. Ammar, A decision tree classifier for intrusion detection priority tagging,
Journal of Computer and Communications 3 (04) (2015) 52.
[38] P. Gogoi, M. H. Bhuyan, D. Bhattacharyya, J. K. Kalita, Packet and flow
based network intrusion dataset, in: International Conference on Contem-
porary Computing, Springer, 2012, pp. 322–334.
[39] Node-red tool.
URLhttps://nodered.org/
[40] Argus tool.
URLhttps://qosient.com/argus/index.shtml
[41] Esxi hypervisor.
URLhttps://www.vmware.com/au/products/esxi-and-esx.html
[42] vsphere client.
URLhttps://www.vmware.com/au/products/vsphere.html
[43] Iot hub aws.
URLhttps://aws.amazon.com/iot-core/features/
[44] Mosquitto mqtt broker.
URLhttps://mosquitto.org/
[45] G. Hall, Pearsons correlation coefficient, other words 1 (9).
[46] A. Lesne, Shannon entropy: a rigorous notion at the crossroads between
probability, information theory, dynamical systems and statistical physics,
Mathematical Structures in Computer Science 24 (3).
38
[47] Cron scheduling package.
URLhttps://packages.ubuntu.com/search?keywords=cron
[48] Tshark network analysis tool.
URLhttps://www.wireshark.org/
39
[62] Y. Zheng, C. K. Kwoh, A feature subset selection method based on high-
dimensional mutual information, Entropy 13 (4) (2011) 860–901.
[63] D. Meyer, F. T. Wien, Support vector machines, R News 1 (3) (2001)
23–26.
40