100% found this document useful (1 vote)
34 views

A Data-Driven Approach For Classifying and Predicting DDoS Attacks With Machine Learning

The importance of IoT security is growing as a result of the growing number of IoT devices and their many applications. Distributed denial of service (DDoS) assaults on IoT systems have become more frequent, sophisticated, and of a different kind, according to recent research on network security, making DDoS one of the most formidable dangers. Real, lucrative, and efficient cybercrimes are carried out using DDoS attacks.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
34 views

A Data-Driven Approach For Classifying and Predicting DDoS Attacks With Machine Learning

The importance of IoT security is growing as a result of the growing number of IoT devices and their many applications. Distributed denial of service (DDoS) assaults on IoT systems have become more frequent, sophisticated, and of a different kind, according to recent research on network security, making DDoS one of the most formidable dangers. Real, lucrative, and efficient cybercrimes are carried out using DDoS attacks.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Volume 9, Issue 10, October– 2024 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT547

A Data-Driven Approach for Classifying and


Predicting DDoS Attacks with Machine Learning
Prinshu Sharma1 Unmukh Datta2 (Professor)
Computer Science Computer Science
Maharana Pratap College of Technology Maharana Pratap College of Technology
Rajiv Gandhi Proudyogiki Vishwavidyalaya Rajiv Gandhi Proudyogiki Vishwavidyalaya
Gwalior Gwalior

Abstract:- The importance of IoT security is growing as a cyberattacks due to their extensive use. The most prevalent and
result of the growing number of IoT devices and their deadly kind of cyberattacks are DDoS attacks [3]. Numerous
many applications. Distributed denial of service (DDoS) services are being interrupted.
assaults on IoT systems have become more frequent,
sophisticated, and of a different kind, according to recent Denial of service, or DoS, is an acronym describing what
research on network security, making DDoS one of the happens when a system delivers a malicious message to a
most formidable dangers. Real, lucrative, and efficient server. When several hacked systems or computers launch DoS
cybercrimes are carried out using DDoS attacks. One of assaults against a single application, it's known as a
the most dangerous types of assaults in network security is DDoS attack. A deluge of packets from all corners of the globe
the DDoS attack. ML-based DDoS-detection systems is thereafter sent towards the designated network. The
continue to face obstacles that negatively impact their proliferation of disruptive Internet technologies is causing
accuracy. AI, which incorporates ML to detect DDoS assaults to evolve and grow in both number and
cyberattacks, is the most often utilised approach for these sophistication[4][5]. Cyber threats that might seriously affect a
goals. In this study, it is suggested that DDoS assaults in business's operations include ransom demands from attackers,
Software-Defined Networking be identified and countered data theft, and disruptions.
using ML approaches. The F1-score, recall, accuracy, and
precision of many ML techniques, including Cat Boost and Responding quickly to DDoS assaults is the best way to
Extra Tree classifier, are compared in the suggested prevent them. Cyberattacks against internet-connected devices
model. DDoS-Net is designed to handle data imbalance have become more appealing as a target due to the expanding
effectively and incorporates thorough feature analysis to use of the internet. As ML and DL [6][7] reveal their enormous
enhance the model's detection capabilities. Evaluation on potential in multiple areas, academics and industry are
the UNSW-NB15 dataset demonstrates the exceptional investigating the notion of using these technologies for DDoS
performance of DDoS-Net. The highest accuracy achieved detection. Traditional approaches are slower and less accurate
by the machine learning algorithms Cat Boost and Extra when it comes to risk detection. Using an ML method, threats
Tree classifier is 90.78% and 90.27% respectively using the may be identified. DL may thus be a useful DDoS detection
most familiar dataset. This work presents a strong and technique.
precise approach for DDoS attack detection, which greatly
improves the cybersecurity environment and strengthens  Contribution of Study
digital infrastructures against these ubiquitous threats. This research contributes to the field of cybersecurity by
implementing ML techniques for the classification and
Keywords:- Denial-of-Service (DoS), Attack, Classification, prediction of DDoS attacks. This study main contributions are:
Identification, Machine Learning.
 Implementation of ML models for DDoS attack detection
I. INTRODUCTION and classification with the UNSW-NB15 dataset.
 Feature selection using Select K-Best method with the
These days, almost every aspect of contemporary life is ANOVA F-test to identify relevant features.
impacted by the "IoT" [1]. A diverse array of devices that  Data normalization using Min-Max Scaler to ensure
comprise the IoT, each with a different technical background, consistent data scaling.
leaves them open to potential security risks. Each entity has  Application of Cat Boost, ETC for robust prediction
different security basics and qualities, thus it's become difficult performance.
to find a single solution that can safely solve every issue.  Metrics for assessing the model's efficacy, including F1-
Attackers may choose to target IoT devices due to insufficient score, recall, accuracy, and precision.
security infrastructure. Furthermore, the Internet's service
offering makes it possible to conduct banking and financial  Structure of Paper
operations, communicate, engage in e-commerce, shop, make For the sections that follow, this study is organised as
payments online, access healthcare, and get an education online follows: In Section 2, the study's context is examined. Section
[2]. The aforementioned services are particularly susceptible to 3 provides a full approach for this investigation. In Section 4,

IJISRT24OCT547 www.ijisrt.com 633


Volume 9, Issue 10, October– 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT547

talk about the study's conclusions and assessments. Findings research. The study's result analysis using DT, RF, and SVC is
from the study and recommendations for the future Section 5. accomplished up to 99.6%[10].

II. LITERATURE REVIEW In this research work, Patil et al., (2022), create a model
based on ML to forecast DDoS flooding assaults. The DDoS
Machine learning/deep learning (ML/DL) has previously flooding assaults that are to be expected encompass several
shown to be an effective method for identifying DDoS assaults. kinds. These assaults were classified using ML models such as
Some of the previous researchers work explained below: decision tree classifiers, MLP, KNN, and LR. A Jupyter
notebook with the necessary Python libraries loaded was used
In this research, Jiyad et al., (2024), presents a novel for the implementation. KNN and DTC have shown almost
ensemble model that can identify DDoS attacks. The approach identical performance, with the highest accuracy of 99.98
leverages ML algorithms such as LR, RF, DT, and XGBoost percent, in predicting TCP and ICMP flooding attacks out of
classifiers to detect and classify these malicious attacks these four classifiers. When it came to predicting UDP flooding
effectively. In the research, use the potent explainable Artificial attacks, the DTC performed a best, with an accuracy rate of
Intelligence (XAI) models SHAP and LIME. By utilizing 77.23 percent[11].
SHAP and LIME's capabilities, improve the ML models'
readability and transparency, giving us a better understanding Cybersecurity is a critical topic in the field of internet
of difficult predictions and model behavior. The evaluation security (Tufail, Batool and Sarwat, 2022). Cyberattacks affect
results demonstrate that the XGBoost ensemble model many industries, with thousands occurring year. DDOS and
outperforms other classifiers, achieving an impressive accuracy FDIA are two of the most deadly cyberattacks. Two machine
rate of 97 %, with an outstanding F -score of 97%. The learning techniques, LR and SNN, were compared in this
precision and recall are accordingly 98% and 96%[8]. research in order to predict DDoS assaults. 99.85% accuracy
was attained for SNN and 98.63% accuracy in logistic
In this research, Al-Eryani, Hossny and Omara, (2024), regression, respectively. In contrast to logistic regression, the
focuses on providing a comparative study between recent ML analysis reveals that SNN required a significantly longer
algorithms that were tested using the CICDoS2019 dataset. The training period[12].
objective of this comparison is to determine the most effective
ML algorithm for DDoS detection. Based on the comparative Despite significant advancements in machine learning
study results, it is found that the Gradient Boosting (GB) and techniques for DDoS attack detection and classification,
the XGBoost algorithms are extraordinarily accurate and several gaps remain in the current research. While numerous
correctly predicted the type of network traffic with 99.99% and studies have demonstrated high accuracy using various
99.98% accuracy respectively, in addition to, a low false alarm algorithms, there is a lack of comprehensive comparison across
rate of approximately 0.004 for GB[9]. diverse datasets and attack types. This study, showcase
impressive performance with XGBoost and Gradient Boosting,
In this research, Kaur, Sandhu and Bhandari, (2023), respectively, they do not address the performance consistency
developed effective ML classifiers utilising attributes from the across different attack scenarios. Additionally, research focuses
SDN dataset to identify DDoS assaults at the application layer. on specific attack types or datasets but lacks a holistic approach
To narrow down the feature set of data, they have used ICA, incorporating a wide range of attacks and feature reduction
PCA, and LDA. Furthermore, ML classifiers are developed techniques. Furthermore, the computational efficiency and
using extracted characteristics, and DDoS attack prediction is scalability of models are not thoroughly explored. Closing
carried out at the application layer. Out of 13, one feature was these shortcomings could improve DDoS detection systems'
recovered using the LDA model, which provides the highest resilience and applicability. For a detailed overview of related
detection accuracy possible for the classifiers in use. Results work, refer to Table 1: Related work on DDoS Attacks using
are analysed by comparing the suggested work to earlier ML and DL techniques.

Table 1 Related Work on DDoS Attacks using Machine and Deep Learning Techniques
Ref Methods Dataset Performance Limitation/Remarks
Jiyad et al. LR, RF, DT, Custom XGBoost: Accuracy 97%, F- Limited to a specific dataset, lacks
(2024) XGBoost + SHAP, dataset score: 97%, Precision: 98%, real-time implementation analysis
LIME (XAI tools) Recall: 96%
Al-Eryani, Gradient Boosting, CICDoS2019 GB Accuracy: 99.99%, Focuses only on ML algorithms,
Hossny, and XGBoost XGBoost Accuracy: 99.98% no DL models explored
Omara (2024)
Kaur, Sandhu, PCA, LDA, ICA with SDN dataset LDA Accuracy: 99.6% with Limited to application-layer
and Bhandari Decision Tree, ML classifiers DDoS attacks, lacks DL
(2023) Random Forest, SVM exploration
Patil et al. LR, KNN, MLP, DT Custom KNN & Decision Tree: Lower accuracy for UDP attack
(2022) dataset 99.98% (TCP/ICMP attacks), prediction (77.23%), only
Decision Tree: 77.23% (UDP classical ML methods
attacks)

IJISRT24OCT547 www.ijisrt.com 634


Volume 9, Issue 10, October– 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT547

Tufail, Batool, Logistic Regression, Custom SNN Accuracy: 99.85%, High training time for SNN, no
and Sarwat Shallow Neural dataset Logistic Regression: 98.63% other DL models evaluated
(2022) Network (SNN)

III. METHODOLOGY  Data Collection


For Classification and Prediction Techniques for DDoS
There are Nemours stages and phases included in the Attacks data collection is a very initial step. in this study,
strategy that has been presented. Machine learning collect the UNSW_NB15 dataset1 from publicly available
methodologies and techniques are utilized in DDoS attack sources. This dataset contains the following nine types of
classification and prediction. For this project's implementation, attacks: exploits, worms, shellcode, DoS, backdoors, fizzers,
the Python programming language was used. Implementation and reconnaissance. To generate 49 characteristics with the
work additionally makes use of Python packages and libraries, class label, twelve algorithms are constructed in conjunction
including NumPy, seaborn, matplotlib, Pandas, Matplotlib, etc. with the Argus and Bro-IDS tools. Two million and 540,044
The proposed methodology's first step is data collection. This records in all are kept in four CSV files: UNSW-NB15_1.csv,
study uses the UNSW-NB15 dataset that is obtained from the UNSW-NB15_2.csv, UNSW-NB15_3.csv, and UNSW-
Kaggle website. after data collection, conduct preprocessing to NB15_4.csv.
check the dataset's shape, remove missing or duplicate values,
and perform label encoding on categorical columns. Then  Data Preprocessing
perform the feature selection task using select k-best methods Reduced accuracy and prediction rate are the results of
with the ANOVA F-test. Next, normalize the data with the help data preparation eliminating confusing data from the acquired
of Min-max scaler methods. After that, the dataset is split into dataset. It is necessary to exclude the possibility of human error
80% for training and 20% for testing. For classification, Cat as the cause of data loss prior to training the model. Datasets
Boost and Extra Tree classifiers are used to predict DDoS undergo further preprocessing after collection to eliminate
attacks. Next, determine the model's effectiveness using f1- duplicate or missing values. The dataset is then used for
score, recall, accuracy, and precision as performance metrics. training the model after unnecessary values have been
The flowchart in Figure 1 outlines the stages and subsequent removed. Further preprocessing areas are defined in below:
steps of the suggested methodology.

1
https://www.kaggle.com/datasets/mrwellsdavid/unsw-
nb15?select=UNSW_NB15_training-set.csv

IJISRT24OCT547 www.ijisrt.com 635


Volume 9, Issue 10, October– 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT547

Fig 1 Proposed Flowchart for DDoS Attacks Prediction

 Label Encoding on the Categorical Column example, the corresponding encoded values would be 0, 1, and
Categorical variables are those that can take on a small, 2, respectively. Keep in mind that this method may mislead the
fixed range of values. Some examples of these factors include model if it unintentionally implies an ordinal connection among
colour (red, blue, green), size (small, medium, big), and the numerical variables.
location (city, suburban, rural, etc.) [13]. Encoding categorical
variables may be done in a number of ways.  Feature Selection using Select k-Best with Anova f-Test
The first step is to partition the dataset according to the
Label Encoding is one approach; it entails assigning a features and the variable of relevance [14]. After that, find the
number value to each separate category. For a colour most significant features by using the SelectKBest technique
characteristic that includes green, blue, and red categories, for when combined with the ANOVA F-test. Select the desired

IJISRT24OCT547 www.ijisrt.com 636


Volume 9, Issue 10, October– 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT547

number of features to be preserved. To find the best features, The ETR and RF systems differ from one another in two
the SelectKBest technique takes each feature's score relative to important ways. The ETR first separates nodes by randomly
the target variable and uses that score to choose the top k selecting a subset of all the cutting points. Secondly, to reduce
features [15]. To improve the model's performance, this bias, it cultivates the trees using all of the learning samples. The
method focuses on the features that are most strongly related to parameters k and nmin, which determine the minimum sample
the dependent variable. size needed to separate nodes, indicate the number of attributes
that are randomly picked for each node in the ETR approach.
 Normalization with Minmax Scaler The splitting procedure is controlled by these variables. Also,
Normalisation, or Min-Max scaling, is a commonly used k and nmin, respectively, dictate the intensity of the attribute
method. To make values lie between 0 and 1, this approach selection and the average output noise strength. The ETR
adjusts and rescales the values [16]. The formula (1) is used to model's accuracy is increased and overfitting is decreased by
do the transition. these two parameters [18][19].

(𝑥𝑚𝑖𝑛 )  Cat Boost Classifier


𝑥′ = 𝑥 − … … … … … … … … … … … … … … (1)
(𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛 ) Cat Boost is a GBDT system that uses a less
parameterised oblivious tree as its basic learner. It achieves
In where X' stands for a normalized value, X' for an good accuracy and supports categorical variables. Improves the
original value, and Xmax and Xmin for a maximum and lowest algorithm's accuracy and applicability by training a sequence
values of the corresponding feature. of learners sequentially using the boosting approach and then
accumulating their results[20]. Concerning a training set of n
 Train-Test Split samples, where can I get the labelled values and m-dimensional
A dataset's ability to be divided into training and testing input features? After the training is complete, a powerful
portions is crucial for both model assessment and a deeper learner is created. The goal of the subsequent training is to
understanding of the properties of models. The ML model is choose a tree from the CART decision tree set T that minimises
fitted using a train dataset. However, the test dataset is utilized the expectation of the loss function. Our parameter calculation
to evaluate a ML model. In this study, data have been used 80% looks like this:
for training and 20% for testing for better performance.
𝑡𝑘 = 𝑎𝑟𝑔𝑚𝑖𝑛𝐸𝐿 (𝑦, 𝐹𝑘−1 (𝑥) + 𝑡(𝑥)) … … … … … … … … . (3)
 Classification Models
The proposed method includes machine-learning The samples used for testing are separate from those used
algorithms. This study uses Cat Boost, and Extra tree classifier for training. Model M, shown in Equation (3.4), is generated
for DDos attack prediction. Each classifier describes in below: using the initial weak learner and the -th round of the training
step size after iterations. To match the trained CART decision
 Extra Tree Classifier tree, the negative gradient of the loss function is used.
The RF model served as the initial inspiration for the
𝑁
development of the Extra Tree classifier (ETC) technique,
which was proposed by [17]. The ETC algorithm creates a set 𝑀 = 𝑀0 + ∑ 𝑎𝑛 𝑡𝑘 … … … … … … … … … … … … … … . … . . (4)
of unpruned judgements, or regression trees, in accordance 𝑘=1
with the traditional top-down methodology. The RF model uses
bootstrapping and bagging, respectively, in two phases to In comparison to previous boosting algorithms, Cat Boost
achieve the regression. During the bootstrapping phase, a improves upon the classic GBDT and introduces the following
random training dataset sample is used to fuel the development new features:
of each individual tree, resulting in a collection of decision
trees. After the DT nodes reach the ensemble, they are divided  The Cat Boost algorithm incorporates order boosting to
into groups using the two-step bagging phase. Many subsets of counteract the training set's noise points [21];
training data are chosen at random in the initial bagging stage.  To improve the direct support for categorical features, Cat
Making a choice is finished when the optimal subset and its Boost automatically uses the Ordered TS approach to
value are selected. transform them to numerical features.;
 The introduction of categorical characteristics further
The RF model is made up of a series of decision trees, enhances a feature dimension in Cat Boost; and
where the Gth prediction tree is represented by G(x, θr), and θ  Based on a completely symmetric tree, it applies a same
is a uniform independent distribution vector that is provided splitting criteria to each layer, leading to faster predictions
before the tree develops. By averaging each tree, equation (2) and more stability [22].
builds an ensemble of trees of G(x), therefore forming a forest.
IV. EXPERIMENT AND DISCUSSION
𝑅
1
𝐺(𝑥, 𝜃1 , … . 𝜃𝑟 ) = ∑ 𝐺(𝑥, 𝜃𝑟 ) … … … … … … … … … … … (2) This work streamlines package management and
𝑅
𝑟=1 distribution using the widely-used scientific computing
programming language, Python. This system comes pre-
installed with essential machine learning libraries such as

IJISRT24OCT547 www.ijisrt.com 637


Volume 9, Issue 10, October– 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT547

Keras, Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn, and  Exploratory Data Analysis
TensorFlow, enabling efficient model development and data This section of the research uses exploratory data
processing. The hardware setup for the pre-processing phase analysis, or EDA, to look at the data closely. To facilitate
includes a system equipped with an Intel (R) Core (TM) i3- understanding, this study employs a graphical representation of
6100U CPU @ 2.30GHz, 2304 MHz, 2 Cores, and 4 Logical the data. In order to explore the data and gather a synopsis of
Processors, along with 8 GB of RAM and a 256 GB SSD. the most important findings, EDA is used. You may utilize its
Additionally, for computationally intensive tasks, Google statistical insights and visualizations to help you find patterns
Research provides access to dedicated GPUs and TPUs, or trends. The following data visualization graphs are provided
enhancing a performance of ML models used in this project. in this section.

Fig 2 Count Plot for Distribution of Service on UNSW_NB15 Data

The following Figure 2 represents the Count plot for the "service" x-axis can go from 0 to 6. The tallest bar corresponds
Distribution of service on UNSW_NB15 data. Values on the to service value “0,” indicating the highest count (well above
"count" y-axis may go up to 40,000, while values on the 40,000).

Fig 3 Count plot for Distribution of state on UNSW_NB15 data

IJISRT24OCT547 www.ijisrt.com 638


Volume 9, Issue 10, October– 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT547

The distribution of seven network traffic states is shown The first two states have significantly higher counts (around
in figure 3 by the count plot of the UNSW_NB15 dataset. The 40,000 and 35,000), while the remaining states range from
x-axis represents "state," and the y-axis indicates "COUNT." 10,000 to 5,000, and the last state has a count of 0.

Fig 4 Count Plot for Distribution of Attack_cat on UNSW_NB15 Data

The bar graph Distribution of attack cat on UNSW_NB15 frequency for that attack category. Although the exact labels
data displays in figure 4 the count of 9 different attack for the categories are not visible, the graph effectively shows
categories on the x-axis and their respective counts on the y- the overall distribution of cyber-attacks within the dataset.
axis. The first bar is significantly taller, indicating a higher

Fig 5 Box Plot for Features in UNSW_NB15 Data

IJISRT24OCT547 www.ijisrt.com 639


Volume 9, Issue 10, October– 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT547

The box plot for features in the UNSW_NB15 dataset (line inside the box), quartiles (box edges), and potential
displays in figure 5, various features on the x-axis, such as 'dur', outliers (dots beyond the whiskers). This visualization
'spkts', 'dpkts', and 'sbytes', while the y-axis, scaled facilitates quick comparison of central tendency, variability,
logarithmically, shows the values of these features. Each box and outliers across different features.
represents the distribution of a feature, indicating the median

Fig 6 Feature Importance Score Graph

Figure 6 display the Feature important score graph The four-class classification system divides instances
generated by SelectKBest. The y-axis represents various (examples) into four separate groups. Class A, Class B, Class
features (such as ‘ct_dst_sport_ltm’, ‘ct_src_dport_ltm’, etc.). C, and Class D are the four groups that comprise the whole.
The x-axis shows the importance scores, ranging from 0 to Positive (1) and negative (0) stand for the expected values,
8000. Each feature has a corresponding bar, with its length whereas true (1) and false (0) indicate the actual values.
indicating its importance score. Estimates of the potential classification models are derived
using the confusion matrix expressions TP, TN, FP, and FN.
 Evaluation Parameter
Model performance may be better understood with the use  Accuracy
of evaluation metrics. The ability of evaluation metrics to The percentage of correct forecasts compared to the total
differentiate between different model outputs is a key feature. number of predicts is known as accuracy. Equation (5) was
In general, the values used to compute these measures are used to calculate accuracy.
obtained from the confusion matrix (see figure 7 below), which
displays the correctness of the model in a very intuitive way. TN + TP
This matrix is N X N, where N is the projected number of 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = … … … … … … … … . (5)
TP + TN + FP + FN
classes.
 Recall
Recall, which may be expressed as a ratio of positively
categorised samples to the total number of samples in the real
class (including both TP and FN samples), is given by equation
(6).

TP
Recall = … . … … … … … … … … … … … … … … … . (6)
TP + FN

Fig 7 Representation of Confusion Matrix

IJISRT24OCT547 www.ijisrt.com 640


Volume 9, Issue 10, October– 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT547

 Precision 2 ∗ (precision ∗ recall)


The precision measures how many positive samples (FP F1 = … . … … … … … … … … … … . . (8)
precision + recall
and TP combined) were properly detected out of all the positive
samples. The focus is mostly on how well the model detects The F1-score might be anything from 0 to 1. Analysing
positive samples. There is a formula that follows (7). the model's proximity to 1 is another way to find its efficiency.
TP  Results Analysis
Precision = … . … … … … … … … … … … … … … … (7)
TP + FP The proposed model extra tree and Cat Boost model
performance across performance parameters is provided in this
 F1 Score section. The following table 2 provides the model performance
Precision and recall are the two main components of the which shows both models achieve the highest performance
F1 score. The F1-score accounts for categorised samples that across performance parameters. The ETC model achieve
are FP as well as FN. Having an equal number of FP and 90.27% accuracy and Cat boost achieved 90.78% accuracy.
FN samples will improve finding accuracy. The following
formula (8)

Table 2 Proposed model Performance on the UNSW_NB15 Dataset


Performance metric ETC Cat Boost
Accuracy 90.27 90.78
Precision 89.86 90.58
Recall 90.27 90.78
F1-score 89.89 90.37

Fig 8 Bar Graph for proposed model performance

Bar Graph for proposed model performance shows in 90.27%, indicating they equally capture true positive instances.
figure 8. When comparing the performance metrics between F1-scores also favor Cat Boost slightly, achieving 90.37%
ETC and Cat Boost, both models demonstrate strong compared to ETC's 89.89%, reflecting a better balance between
capabilities across accuracy, precision, recall, and F1-score. precision and recall. Overall, while both models perform
Cat Boost slightly outperforms ETC in accuracy (90.78% vs. exceptionally well, Cat Boost demonstrates slightly superior
90.27%) and precision (90.58% vs. 89.86%), showing a slight performance in accuracy and F1-score, making it a favorable
edge in correctly predicting positive instances and minimizing choice for tasks requiring robust predictive performance.
false positives. Recall scores are identical for both models at

IJISRT24OCT547 www.ijisrt.com 641


Volume 9, Issue 10, October– 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT547

Fig 9 Classification Report of Extra Tree Classifier

Figure 9 displays the ETC's classification report, which different classes: it excels in precision for classes 0, 5, and 6
includes a total of ten categories. The classifier's accuracy is but struggles with recall in classes 0, 8, 1, and 9. Classes 3, 4,
90.27%, showing a good match between model predictions and and 7 show moderate to good performance with balanced
labels. The Precision of ETC is 89.86, recall is 90.27, and f1- precision and recall. The overall accuracy of 0.90 with 15124
score is 89.89. The model displays varied performance across support value.

Fig 10 Confusion matrix for Extra tree classifier

The confusion matrix of an ETC is shown in Fig. 10, predictions for a true-predicted label pair are represented by
where the real class labels (0–9) are shown on the y-axis, and deeper hues in each cell. Diagonal cells stand for each class's
the predicted class labels are represented on the x-axis. More accurate predictions, also known as true positives.

IJISRT24OCT547 www.ijisrt.com 642


Volume 9, Issue 10, October– 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT547

Fig 11 Classification Report of CatBoost Classifier

Figure 11 illustrates the Cat Boost classifier's varied performance across different classes: it excels in
classification report, which includes 10 classes. The classifier's precision for classes 0, 5, and 6 but struggles with recall in
accuracy is 90.79%, showing a good match among model classes 0, 8, 1, and 9. Classes 3, 4, and 7 show moderate to good
predictions and labels. The Precision of Cat Boost classifier is performance with balanced precision and recall. The overall
90.58, recall is 90.78, and f1-score is 90.37. The model displays accuracy of 0.91 with 15124 support value.

Fig 12 Confusion Matrix for CatBoost Classifier

IJISRT24OCT547 www.ijisrt.com 643


Volume 9, Issue 10, October– 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT547

Figure 12 displays the confusion matrix for the Cat Boost  Comparative Study
classifier. In this figure, the y-axis displays the actual labels The Comparison of Base and proposed model
while the x-axis displays the predicted labels. Both axes range performance across performance parameters is provided in this
from 0 to 9. Correct predictions are along the diagonal, with section. The model performance comparison in Table 3 below
darker blue indicating higher counts, like 7058 for class 6. Off- demonstrates how well the suggested model performs in
diagonal cells show misclassifications, such as 55 instances contrast to basic models.
where true label 0 was predicted as 1. This matrix helps identify
correct classifications and common confusions, guiding model
improvements.

Table 3 Comparison of base and Propose model Performance on UNSW_NB15 Dataset


Performance Metric Propose Models Base Models
ETC Cat Boost RF XGBoost
Accuracy 90.27 90.78 88.94 89.95
Precision 89.86 90.58 89.03 90.89
Recall 90.27 90.78 88.94 89.95
F1-score 89.89 90.37 88.96 89.67

Comparing the performance metrics of proposed REFERENCES


ensemble models (ETC and Cat Boost) against base models
(RF and XGBoost) reveals consistently high performance [1]. S. Kumar, P. Tiwari, and M. Zymbler, “Internet of
across performance metrics shows in table 3. The figure show Things is a revolutionary approach for future
higher accuracy and precision, with Cat Boost slightly ahead in technology enhancement: a review,” J. Big Data, 2019,
precision at 90.58%. Recall scores are equally strong across all doi: 10.1186/s40537-019-0268-2.
models, matching accuracy levels closely. F1-scores show Cat [2]. M. Snehi and A. Bhandari, “Vulnerability retrospection
Boost leading marginally at 90.37%, indicating balanced of security solutions for software-defined Cyber-
performance in precision and recall. Overall, the ensemble Physical System against DDoS and IoT-DDoS
models of ETC and Cat Boost demonstrate robustness and attacks,” Computer Science Review. 2021. doi:
reliability, making them effective choices for scenarios 10.1016/j.cosrev.2021.100371.
requiring high predictive accuracy and comprehensive model [3]. R. K. C. Chang, “Defending against flooding-based
performance. distributed denial-of-service attacks: A tutorial,” IEEE
Commun. Mag., 2002, doi:
V. CONCLUSION AND FUTURE SCOPE 10.1109/MCOM.2002.1039856.
[4]. B. Patel, V. K. Yarlagadda, N. Dhameliya, K.
The emergence of applications for intelligent buildings Mullangi, and S. C. R. Vennapusa, “Advancements in
raises the possibility of cybersecurity risks for people, 5G Technology: Enhancing Connectivity and
companies, and the technology they use. The study emphasises Performance in Communication Engineering,” Eng.
how crucial it is to use machine learning methods in Int., vol. 10, no. 2, pp. 117–130, 2022, doi:
cybersecurity, particularly when accuracy and speed are 10.18034/ei.v10i2.715.
critical. While research based on ML provide encouraging [5]. R. K. Gupta, K. K. Almuzaini, R. K. Pateriya, K. Shah,
results, this study shows that deep learning is not the only P. K. Shukla, and R. Akwafo, “An Improved Secure
approach that works. Models that are straightforward, Key Generation Using Enhanced Identity-Based
understandable, and practical may be used to counter DDoS Encryption for Cloud Computing in Large-Scale 5G,”
assaults. This study aimed to advance the classification and Wirel. Commun. Mob. Comput., 2022, doi:
prediction of DDoS attacks by employing sophisticated 10.1155/2022/7291250.
machine learning methodologies on the UNSW-NB15 dataset. [6]. V. Rohilla, S. Chakraborty, and M. Kaur, “An
This research showed how well several ML methods, including Empirical Framework for Recommendation-based
Extra Tree and Cat Boost, can be used to the detection and Location Services Using Deep Learning,” Eng.
categorisation of DDoS assaults. Specifically, Cat Boost Technol. Appl. Sci. Res., 2022, doi:
delivered an accuracy90.78%, precision90.58%, recall90.78%, 10.48084/etasr.5126.
and an F1-score90.37%, Both Cat Boost and Extra Tree [7]. P. Khuphiran, P. Leelaprute, P. Uthayopas, K.
classifiers outperformed the base models across all metrics, Ichikawa, and W. Watanakeesuntorn, “Performance
including F1-score, recall, accuracy, and precision. This comparison of machine learning models for DDoS
comparative edge indicates that the proposed models not only attacks detection,” in 2018 22nd International
provide superior detection and prediction of DDoS attacks but Computer Science and Engineering Conference,
also enhance overall system robustness. The results affirm the ICSEC 2018, 2018. doi:
reliability and effectiveness of the proposed methodology, 10.1109/ICSEC.2018.8712757.
highlighting its potential for significantly improving the
capabilities of intrusion detection systems in identifying and
responding to DDoS threats.

IJISRT24OCT547 www.ijisrt.com 644


Volume 9, Issue 10, October– 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT547

[8]. Z. M. Jiyad, A. Al Maruf, M. M. Haque, M. Sen Gupta, [19]. G. Mishra, D. Sehgal, and J. K. Valadi, “Quantitative
A. Ahad, and Z. Aung, “DDoS Attack Classification Structure Activity Relationship study of the Anti-
Leveraging Data Balancing and Hyperparameter Hepatitis Peptides employing Random Forest and Extra
Tuning Approach Using Ensemble Machine Learning Tree regressors,” Bioinformation, 2017, doi:
with XAI,” in 2024 Third International Conference on 10.6026/97320630013060.
Power, Control and Computing Technologies [20]. A. V. Dorogush, V. Ershov, and A. Gulin, “CatBoost:
(ICPC2T), 2024, pp. 569–575. doi: gradient boosting with categorical features support,”
10.1109/ICPC2T60072.2024.10475035. pp. 1–7, 2018.
[9]. A. M. Al-Eryani, E. Hossny, and F. A. Omara, [21]. L. Prokhorenkova, G. Gusev, A. Vorobev, A. V.
“Efficient Machine Learning Algorithms for DDoS Dorogush, and A. Gulin, “Catboost: Unbiased boosting
Attack Detection,” in 2024 6th International with categorical features,” in Advances in Neural
Conference on Computing and Informatics (ICCI), Information Processing Systems, 2018.
2024, pp. 174–181. doi: [22]. H. Liu, L. Guo, H. Li, W. Zhang, and X. Bai,
10.1109/ICCI61671.2024.10485168. “Matching Areal Entities with CatBoost Ensemble
[10]. S. Kaur, A. K. Sandhu, and A. Bhandari, “Feature Method,” J. Geo-Information Sci., 2022, doi:
Extraction and Classification of Application Layer 10.12082/dqxxkx.2022.220050.
DDoS Attacks using Machine Learning Models,” in
2023 International Conference on Communication,
Security and Artificial Intelligence, ICCSAI 2023,
2023. doi: 10.1109/ICCSAI59793.2023.10421652.
[11]. P. S. Patil, S. L. Deshpande, G. S. Hukkeri, R. H.
Goudar, and P. Siddarkar, “Prediction of DDoS
Flooding Attack using Machine Learning Models,” in
Proceedings of the 3rd International Conference on
Smart Technologies in Computing, Electrical and
Electronics, ICSTCEE 2022, 2022. doi:
10.1109/ICSTCEE56972.2022.10100083.
[12]. S. Tufail, S. Batool, and A. I. Sarwat, “A Comparative
Study Of Binary Class Logistic Regression and
Shallow Neural Network For DDoS Attack
Prediction,” in Conference Proceedings - IEEE
SOUTHEASTCON, 2022. doi:
10.1109/SoutheastCon48659.2022.9764108.
[13]. W. Yustanti, N. Iriawan, and Irhamah, “Categorical
encoder based performance comparison in
preprocessing imbalanced multiclass classification,”
Indones. J. Electr. Eng. Comput. Sci., 2023, doi:
10.11591/ijeecs.v31.i3.pp1705-1715.
[14]. V. Rohilla, S. Chakraborty, and R. Kumar, “Deep
learning based feature extraction and a bidirectional
hybrid optimized model for location based
advertising,” Multimed. Tools Appl., vol. 81, no. 11,
pp. 16067–16095, May 2022, doi: 10.1007/s11042-
022-12457-3.
[15]. R. C. Chen, C. Dewi, S. W. Huang, and R. E. Caraka,
“Selecting critical features for data classification based
on machine learning methods,” J. Big Data, 2020, doi:
10.1186/s40537-020-00327-4.
[16]. A. Bhandari, “Feature Engineering: Scaling,
Normalization and Standardization,” Analytics
Vidhya.
[17]. P. Geurts, D. Ernst, and L. Wehenkel, “Extremely
randomized trees,” Mach. Learn., vol. 63, no. 1, pp. 3–
42, 2006, doi: 10.1007/s10994-006-6226-1.
[18]. V. John, Z. Liu, C. Guo, S. Mita, and K. Kidono, “Real-
time lane estimation Using Deep features and extra
trees regression,” in Lecture Notes in Computer
Science (including subseries Lecture Notes in Artificial
Intelligence and Lecture Notes in Bioinformatics),
2016. doi: 10.1007/978-3-319-29451-3_57.

IJISRT24OCT547 www.ijisrt.com 645

You might also like