Distributed Denial of Services (Ddos) & Iot Botnet Malware Identification Using Machine Learning & Deep Learning Models
Distributed Denial of Services (Ddos) & Iot Botnet Malware Identification Using Machine Learning & Deep Learning Models
Teekam Singh
Graphic Era Deemed to be University,
Dehradun, India
[email protected]
Abstract— In this work, distributed denial of services office, a smart refrigerator that records the available balances
(DDoS) and IoT botnet attacks detection has been performed in it and orders whenever necessary, automating a smart
using machine learning (ML) and deep learning (DL) models. city’s traffic regulation systems, etc. IoT devices can also
For the implementation of the proposed work DDoS attacks communicate with each other and other systems, and, as a
and IoT botnet datasets are used. These instances are collected result, enable collaborative actions and complex interactions.
by the implementation of Mirai and BASHLITE. The used
dataset comprises of 7999 instances and each instance has 29 An IoT botnet is a group of Internet of Things devices
attributes. The collected instances are pre-processed and that have been compromised by malware and are being used
eliminate the redundant attributes. Therefore, finally a set of by an attacker from a distant location. Attacker utilization of
10 attributes are selected for the experiments. After this dataset the botnet can include DDoS assaults [1], [2], spam
is divided into training and testing set. By using training set, promotion email, data theft, and other nefarious activities.
machine leaning models (KNN classifier, logistic regression, The most important issue is security and privacy. IoT devices
SVM model, random forest model) and deep learning models are prone to hacking and unauthorized data breaches, which
(CNN and LSTM) are trained and validated using testing set. makes people at risk of either spreading their personal
After the experiments it has been found that the deep learning- information or gaining control over their systems. Moreover,
based LSTM model obtained outstanding performance in
due to the high amount of the data devices handle, the
terms of accuracy. The obtained testing accuracy for LSTM
model is 99.80 % and 99.82 % for training accuracy.
question of data privacy is challenging, as well. Another
issue related to security in the realm of IoT is lack of
Keywords— Internet of Things (IoTs), IoT botnet, established standards and principles of interoperability and
Distributed denial of services (DDoS) attacks, Machine security online. Figure 1 represents the structure of IoT
Learning, CNN, LSTM. environment.
I. INTRODUCTION
The Internet of Things (IoTs) is a system of organized
computation electronic gadgets, power-driven and digital
machines, individuals, and persons or human-generated
objects that are able to perform significant task in well-
defined manner and preserve an capability to send or receive
data over a network oriented platform without intervention of
human-to-computer involvement. IoT catalyses new
computer solution architectures that are more long-term, real-
time and autonomously implemented. IoT devices are
connected to the internet and thus can send and receive data.
They collect data from the environment via various sensors,
including temperature, motion, and sound, among others. The
data can be analysed to develop insights, basis for decision-
making, and used in automation. IoT also makes it possible
to automate various activities easily [1]. For example, a
homeowner can set an optimum temperature for the home or
Fig. 1. Structure of IoTs environment transmitting modes, and normal mode. In this work, an
independent self-governing defence model that uses edge
IoT environments also are deeply affected by DDoS computing and a 2-D-CNN is used and able to recognize the
attacks [3]. “In a DDoS attack, a goal is bombarded with attacks of type DDoS in IoT environment. The 2D-CNN
traffic so large that their normal operation is disrupted attained the training accuracy of 99.50 % and 99.8 % for
multiple times”. Due to the wide deployment of the device, network packet traffic and network packet features
the lack of security, and the restriction in terms of computing identification, respectively.
power, IoT devices are prime candidates for being infected In this study [9] the novel intrusion detection system
and used in a botnet for DDoS attacks. To perform DDoS based on ML based model and CNN models implicitly
attacks, a criminal creates an IoT botnet. In this case many differentiated over the records timestamps and henceforth
IoT devices that have fallen under the attacker’s control due accomplished an average accuracy > 99 % with three distinct
to software vulnerabilities or weak passwords join the botnet. attribute sets for two class and more than two classifications.
Further, the DDoS criminal establishes control over the The procedure in this work does not avoid territories between
botnet. And perform the necessary actions like creates a features, as triggered by the flow data generator.
command-and-control (C&C) server, with which he can
communicate with other infected IoT devices and make a The study [10], detect the eleven DDoS attacks from
network request to send traffic to the target so that it exceeds multiple DDoS attack datasets, the use of 6 ML classification
its limits. The last step is the selection of a target to place a algorithms was used. This study used the CICDDoS2019,
load on it [4], [5]. The cleared botnet should send traffic to an one of the datasets obtained from the CICDS. The
attacker-selected target to cause the target to malfunction. classification methods were experimented with each DDoS
The major impact of DDoS attacks is like Service disruption, attack to determine the optimal classification algorithms.
financial loss, extensive use of resources etc. Generally, it would be possible to conduct work to assess the
efficiency of the ML based classification algorithms in the
II. LITERATURE SURVEY detection of DDoS attacks. This work used performance
This section has shown the advancement in employing matrices (accuracy, precision, recall, and F1-score) to
ML and DL methods (CNN & LSTM) to identify DDoS determine the suitability of each model.
assaults in IOT networks. Investigation is still being
conducted on new models and algorithms to better advance III. MATERIALS & METHODOLOGY
the detection and avoidance of these attacks on internet of A. Dataset Description
things networks. More recently performed works have shown
The used dataset is collected from Mirai type botnet
promising outcomes for the use of ML and DL methods
attacking an emulated IoT network in OpenStack. Mirai
(CNN & LSTM) models in identifying DDoS assaults in
botnet is a famous botnet that seizes the mentioned
different network environments, particularly in IOT [5].
vulnerabilities of Internet of Things, such as the default
Numerous algorithms and feature selection methods have
password at creation, obsolete firmware, and compromising
been implemented due to varying levels of detection
network services. According to CISA the attackers behind
accuracy and timing.
Mirai access multiple vulnerable IoT devices in large
The study recognizes the need to improve intrusion numbers [5]. The actor thereafter uses this acquired power to
detection systems in lightweight IoT networks, innovates a launch DDoS attacks and other malicious activities. The used
novel data pre-processing technique. This approach ideally dataset set comprised of 7999 cases. Each case is consisting
resolves the peculiarities of IoT networks and further of twenty-nine attributes. All attributes are not important for
enhances the goal of promoting cybersecurity through the detection of attacks, therefore only ten attributes are
successful detection of DDoS attacks because of the selected. The correlation matrix of these selected attributes
accumulation of each IoT devices’ set of constraints. The age given in Figure 2.
experiment utilized the TON-IOT and BOTNET-IOT
datasets [6]. However, the experiment used the binary
classification and multiple-class classification models to
separate the DDoS attacks from the other two types of
attacks.
This study is investigating the utilization of machine
learning and deep learning techniques in identifying and
distinguishing the impacts of DDoS assault in IoT networks.
In order to establish if the network has experienced or not
experienced the attack, adequate detection techniques must
be followed [7]. The detection uses suitable techniques such
as artificial intelligence which heavily relies on Machine
Learning and Deep Learning. Supervised machine learning
based models or algorithms used structured data to learn,
detect or identify the outcomes of the work, and recognize
patterns.
This paper [8] has shown that an autonomous defence
system using edge computing and a 2D-CNN can achieve
autonomous and correct recognition of attacking patterns, Fig. 2. Correlation matrix selected feature set.
B. Proposed model
The proposed model comprises of data preprocessing,
dataset bifurcation, model training and decision of the model.
The flowchart of proposed model is given in Figure 3. The
description of each stage is given in details as:
C. Data preprocessing
Data preprocessing is one of the essential steps in the
development of model for identification of DDoS attacks and
IoT botnet attacks using machine learning, and it converts
raw data into clean and formatted data for modelling. High-
quality data preprocessing can contribute to the significantly
improved performance of the learning model and provide
accurate and valid results. Data preprocessing covers data
collection, data cleaning, data normalization, and data
splitting. Data Collection is a process of gathering data from
different sources such as databases, APIs, or files. Ensure
that the data is representative of the problem you wish to
solve. In data cleaning replaces the missing values with a
given value like the mean, median, or mode and delete the
rows or columns having missing values, or outliers can be
removed or transformed. Data Normalization includes
normalization, which scales data to a standardized range of
0-1. Splitting Data – split the data into training and testing
sets in the ratio of 80:20.
D. Classification Models
KNN classifier: K-Nearest Neighbours (KNN) is the
simple and straightforward supervision ML algorithm used
for classification tasks. KNN identifies [11] the category of a
new observation based on its proximity to known
observations. It’s implemented during a supervised learning
task where a learning algorithm is assigned a labelled data
set. Meaning, it’s going to label input data, assigning it to a
label like, binary category, category name, anomaly, etc., to
supply a model which will be wont to predict a target
outcome. Steps involved in this model is given as:
Input: A dataset with feature values and the
corresponding labels.
Normalize Data: It may be useful, but not necessary, to
standardize or normalize the data so that the distance
metric makes sense.
Choose k: For k-means, it is necessary to select an odd
value of k to completely bypass situations with a tie in
classification.
Find Nearest Neighbours: For any data point, the
distance to the rest of the data points is calculated and
we select the k-nearest neighbours.
Classification: Now, a new label is predicted for k- the dataset when training each decision tree into each
nearest neighbours using the majority class by bootstrapped sample.
classification. Convolutional Neural Network:
Logistic regression: Logistic regression (LR) is a simple, yet Convolutional neural networks (CNN) are DL based
widely used, ML algorithm that is used for both binary and model used for tasks that involve image data most frequently
multiclass classification [12]. In logistic regression the [15]. Although being used for other types of data such as
objective of this ML is to guess the probability that a known NLP and time-series, they have shown outstanding
input goes to a certain class. It is mainly utilized for binary performance in the context of spatial data. As for the
classification, which means classifying a certain set of data methods and techniques, it is one of the automatic feature
into only two classes. It can be wattled simply for predicting extraction algorithms that succinctly describes the features of
multi-class classification difficulties utilizing an analytical data. CNNs [15] are powerful neural networks that are
technique like one-vs-all, one-vs-one or directly SoftMax particularly useful for processing and analysing spatial data,
regression. LR uses logistic function or Sigmoid Function for most notably images. They are effective in finding complex
classification task [12]. The output of LR is modelled as the patterns and relationships about the data and can be
probability value that an input belongs to a certain class. transferred to other domains; however, they have mastered
Logistic function, an S shaped curve, maps the output of a the architecture and advanced techniques to perform well.
linear function to a probability value in the range of 0 to 1. Figure 4 shows the architecture of CNN.
Any real number is squashed to within the range of values 0
to 1. It is defined as follows:
(1)
(2)
(3)
(4)
(5)