Predictive Analysis of Network-Based Attacks by Hybrid Machine Learning Algorithms Utilizing Bayesian Optimization Logistic Regression and Random Forest Algorithm (1)
Predictive Analysis of Network-Based Attacks by Hybrid Machine Learning Algorithms Utilizing Bayesian Optimization Logistic Regression and Random Forest Algorithm (1)
ABSTRACT These days, intrusion detection systems are one of the newest trends in society. These
technologies serve as a defense against a variety of security breaches, the number of which has been
rising recently. The need for adaptive security solutions is pressing since the sorts of attacks that arise are
ever-changing. This study aims to enhance the performance of intrusion detection models on the KDD99
and NSL-KDD datasets through advanced optimization techniques. By addressing challenges related to
evolving attack strategies and intricate tasks, the research introduces innovative machine learning approaches
tailored for intrusion detection, focusing on both binary and multiclass classification scenarios. The study
employs a Bayesian Optimization-enhanced Random Forest (BO_RF) algorithm for binary classification
and a hybrid Logistic Regression and Random Forest (LR_RF) algorithm for multiclass classification. Our
models were implemented and evaluated in a Jupyter Notebook environment using key metrics: Accuracy,
Precision, Recall, and F1-Score. For binary classification, eight metrics were assessed, while twenty-six
were analyzed for multiclass classification across both datasets. The results demonstrate the effectiveness
of the proposed approaches in both classification types, highlighting their potential for robust and adaptable
intrusion detection. Theoretical contributions include advancing the understanding of intrusion detection
methodologies and the effectiveness of machine learning algorithms in cybersecurity. From a practical
perspective, the proposed model can offers a robust and adaptable solution for real-world intrusion detection
scenarios, potentially minimizing security breaches and enhancing overall cyber security posture.
INDEX TERMS Machine learning (ML), network attacks, classification, intrusion detection system (IDS).
Anomaly detection establishes a typical pattern of network and limitations. Section IV introduces the datasets utilized in
traffic to identify new assaults, while misuse detection relies the research, including details about the KDD99 cup dataset
on already-registered indications in the system to identify and the NSLKDD dataset. Data visualization and preprocess-
known attacks ing techniques are elaborated upon in Section V, aimed at
An intrusion detection system built using machine learning facilitating a better understanding of the datasets. Section VI,
techniques makes use of logs of standard network activity presents the proposed framework designed to enhance the
as well as data collected from network traffic. It assists in interpretability of any Intrusion Detection System (IDS).
locating known attacks as well as unidentified ones that have This section is further divided into two parts, addressing
not yet been located. Machine learning models can detect binary classification and multiclass classification, respec-
attacks or respond to threats in real time. Machine learn- tively. It encompasses the experiments conducted using the
ing is suited for both large-scale network system analysis NSL-KDD and KDD99 datasets, along with a detailed pre-
and monitoring, as well as the efficient handling of large sentation of the results. Finally, Section VII offers concluding
volumes of data [6]. It can be applied to analyze user and remarks and outlines potential future research directions.
system behavior to spot insider threats or unapproved access.
Machine learning techniques can be used to connect complex II. RELATED WORK
attacks with numerous stages or to dynamically alter network There are a variety of reasons behind the significance of
segments in response to a recognized danger. These models Machine learning in intrusion detection. It is useful for
have numerous drawbacks in addition to their advantages. securing the integrity of computer networks and systems.
The report omits essential material. Certain models merely Cavusoglu [8], proposed a hybrid IDS system by combining
forecast the existence of an assault, not its kind. The second different feature selection and machine learning algorithms
is the restriction of poor detection performance for low data over the NSL-KDD dataset. The performance is demon-
frequency. The primary cause of this is an unbalanced dataset, strated by comparing the proposed system with other studies,
wherein certain attacks have significantly more cases than it is shown that the proposed system has a low false positive
others. As such, it is exceedingly challenging for any model rate and high accuracy. Li et al. [9], proposed a hybrid method
to detect attacks with low frequency [7]. based on K-NN and binary classification which achieved
The research work makes several notable contributions to suitable results over the NSL-KDD dataset. The model is
the area of intrusion detection and cyber security. evaluated by comparing five other learning techniques. The
• Given the increasing complexity and diversity of cyber result showed that the proposed method performs better than
threats, there’s a crucial need for security solutions that all other baselines in different evaluation criteria.
can adapt effectively. To address this demand, the pro- Al-Khassawneh [10] in this study, the effectiveness of
posed methods utilize advanced optimization techniques different classification algorithms in detecting anomalies in
and hybrid algorithms capable of adjusting dynamically network traffic patterns is evaluated using the NSL-KDD
to evolving attack scenarios. dataset. Additionally, the relationship between hacker attacks
• The study introduces innovative intrusion detection and commonly used network protocols is investigated to
approaches based on machine learning algorithms, understand how attackers generate abnormal network traffic.
to improve detection accuracy and adaptability. Through The proposed model enhances IDS precision and suggests
the utilization of a Random Forest with Bayesian opti- new research directions in the field. Fuhnwi et al. [11] in
mization (BO_RF) algorithm for binary classification this study, an approach for Network Intrusion Detection
and a hybrid Logistic Regression and Random For- Systems (NIDS) using XGBoost and Recursive Feature Elim-
est (LR_RF) algorithm for multiclass classification, the ination (RFE) is proposed. Evaluation of the NSL-KDD
research presents fresh strategies for addressing intru- dataset demonstrates superior performance in detecting vari-
sion detection challenges. ous attack types, with XGBoost outperforming other machine
• Extensive evaluation and comparison of the proposed learning algorithms and achieving high classification accu-
algorithms with existing methods, such as Random For- racy. Vibhute et al. [12] the study focuses on developing a
est, Logistic Regression, Naïve Bayes, and SVM, are network based IDS using NSL-KDD datasets. Utilizing an
conducted using the KDD99 and NSL-KDD datasets ensemble learning-enabled random forest algorithm, features
over Jupyter Notebook. This thorough analysis provides were selected. Three machine learning models - KNN, logis-
valuable insights into the effectiveness of the proposed tic regression, SVM and - achieved validation accuracies
models over traditional models. of 98.24%, 88.86%, and 87.58%, respectively, indicating
Overall, the research facilitates the development of more applicability for real-time monitoring and detection of cyber-
efficient, adaptable, and effective intrusion detection systems, attack. Shehadeh et al. [13] the study evaluate intrusion
thereby bolstering overall cybersecurity resilience. detection using Random Forest, KNN, and Naïve Bayes
The subsequent sections of the paper are structured as on three datasets (KDDCUP-99, UNSW-NB15, and NSL-
follows. In Section II, outlines the prior research conducted KDD). Random Forest proves the most reliable. Limitations
on this subject. In Section III, different machine learning include dataset quantity and focusing solely on classification
algorithms are discussed, along with their respective strengths algorithms. Future research could explore additional data
mining techniques like Neural Networks and analyze spe- Naive Bayes operates under the assumption of feature
cific dataset intricacies for improved algorithm performance. independence, which may not be valid for intricate relation-
Unlike previous studies that primarily relied on traditional ships within network data. Its performance diminishes when
algorithms, this paper introduces the application of hybrid dealing with highly correlated features, potentially result-
algorithms. Earlier research often focused exclusively on ing in inflated probability estimates. Moreover, Naive Bayes
binary classification and limited their evaluation to one or two encounters difficulties with continuous features, particularly
metrics. In contrast, this study incorporates both binary and if the underlying distribution is not accurately represented by
multiclass classification, and conducts a more comprehensive the selected probability distribution [19].
analysis by considering all four evaluation metrics.
B. SUPPORT VECTOR MACHINE (SVM)
III. MACHINE LEARNING ALGORITHMS SVM is an abstract algorithm of Machine Learning that learns
Machine Learning is a branch of Artificial Intelligence that is by training over a specific type of dataset for making accurate
the best way to accomplish specific goals by simulation. It can predictions and conceptualizing the remaining data. SVMs
take the result of previous experiences as instructions for belong to the supervised class of Machine Learning that
future operations without being explicitly programmed [14]. are mainly utilized for analyzing data, pattern recognition,
Machine Learning can be classified into three major type’s regression, and classification analysis [20]. The main objec-
supervised, unsupervised, and semi-supervised learning [15]. tive of a SVM is to find a hyperplane in an N-dimensional
If the target labels and classes are known before execution space where data points can distinctly be divided into two
then it is called Supervised Learning. If the target class is categories.
Linear kernel function of SVM—— x,x′
unknown then this learning is called Unsupervised. The learn-
Polynomial function of SVM————, γ x,x′ +r where
ing which is a combination of supervised and unsupervised
methods of learning is called Semi-supervised Learning. d= specified parameter degree.
In this study, supervised learning algorithms are employed Radial basis function of SVM———–exp(−γ ||x − x ′ ),
due to the availability of labeled datasets, facilitating the where γ = specified parameter gamma that is always more
classification task. The proposed techniques are hybrids, than zero.
Sigmoid function of SVM———tanh(γ x, x ′ + r), r =
combining the strengths of multiple algorithms to overcome
limitations observed in existing approaches. Each existing Coefficient of θ
algorithm brings its unique advantages and drawbacks to the The general working structure of SVM is shown in
table, which are carefully considered and addressed [16]. figure (1).
Below is a concise analysis of the methods evaluated in this
study.
A. NAÏVE BAYES
This algorithm can be explained as a probabilistic classifier
which is obtained from the application of Bayes Theorem,
which is an equation of statistical quantities that explains
the relationship between conditional probabilities [17]. The
naïve Bayes classification is very useful because they are very
fast and simple classification algorithm in datasets of high
dimension.
The likelihood of an outcome based on the previous out-
come that has occurred in similar circumstances is known as FIGURE 1. SVM classification.
by using a logistic function [22]. Given data (x, y) where x D. RANDOM FOREST
is (m x n) matrix with m samples and n attributes and y is a This algorithm creates and makes a forest random; this for-
vector of m examples. The weight matrix to create random est is a group of decision trees which are trained with a
initialization defined in equation (3). bugging method. The bagging method is the amalgamation
of the learning models to increase overall results [25]. This
a = w0 + w′1 x1 + w′1 x2 + . . . w′1 xn (3)
algorithm is useful for both regression and classification
Then pass the output a to link function which is formulated problems which are used to form most of the current systems
in equation (4). of machine Learning. The greater number of trees leads to
high accuracy and also prevents the overfitting problem.
ŷi = 1/(1 + e−a ) (4) There are formulas for the evaluation of random forest is
Then the cost function is calculated which derived in shown in equation (9) and (10)
equation (5). Xn
Gini Index = 1 − p2 (9)
i=1 i
1 Xi=m Xn
cost (w) = − yi log (yi ) Entropy = − pi log(pi ) (10)
m i=1 i=1
+ (1 − yi ) log 1 − ŷi
(5) Information Gain = E (Parent) – E (Parent | Child) or Gini
(Parent) – Gini (Parent | Child), Where Gini= Gini Index,
The updating of weights is done as per the derivative of the
E= Entropy, p= probability.
cost, the formula is shown in equation (6) and (7)
For the final evaluation majority/hard voting method is
used, the formula of this method is given in equation (11).
Xi=n
dwj = (ŷ − ŷi )xji (6)
i=1 ŷ = mode{C1 (x) , C2 (x) , . . . . . . .., Cm (x)} (11)
wi = wj − (adwj ) (7)
where ŷ = class label, Cm = set of classifier, the class label
Logistic regression is used to calculate the probability of of each classifier is predicted by majority voting.
a given data point set belonging to either class ‘0’ or ‘1’ a Random Forest models, despite their effectiveness, can
given value of w and x. The exponent function in the sigmoid pose challenges in interpretation due to their complexity,
function is justified because the probability must be more particularly with numerous trees and features. Although they
than zero. But to make the value less than one the numerator offer reduced overfitting risks compared to single decision
needs to be divided by a value bigger than it [23]. This trees, they still need careful tuning to avoid overfitting noisy
equation is divided by the numerator term then the sigmoid data [26]. Additionally, training a Random Forest model
function is obtained. The sigmoid function is expressed in can be computationally demanding, especially with sizable
equation (8). Figure (2) shows the curve of logistic regression. datasets or when employing a high number of trees.
P = 1/1+e−(w1 x1 +w2 x2 +...+wn xn ) (8) E. BAYESIAN OPTIMIZATION
Optimization is considered the heart of a machine learn-
ing model. Bayesian Optimization constructs a model of an
objective function based on probability for selecting hyper-
parameters for the evaluation of the objective function [27].
Bayesian Optimization varies from Grid and Random search
because it enhances the search speed by using past perfor-
mances where other methods are independent of previous
evaluations. It has two components one is the probabilistic
model another is the acquisition function. The probabilis-
tic model starts with the prior probability distribution for
optimization of the function. The acquisition function com-
FIGURE 2. Curve of logistic regression.
putes the posterior distribution of the function [28]. The next
sampling point is determined by maximizing the acquisition
Logistic Regression relies on the assumption of a function. The objective function is defined in equation (12).
linear relationship between features and the log odds Xt = argmaxx u(X|D1:t−1 ) (12)
of the target variable, potentially overlooking complex
non-linear relationships within the data. Challenges arise with where, u= acquisition function, D1:t−1 = the total t samples
high-dimensional or feature-rich datasets, as Logistic Regres- There are mainly three main types of acquisition functions,
sion may struggle to achieve optimal performance [24]. first upper confidence bound is defined in equation (13).
Moreover, Logistic Regression may exhibit poor results when
UCB[x∗ ] = µ x∗ + β 1/2 σ [x ∗ ]
dealing with imbalanced datasets. (13)
Second, probability of improvement which is defined in TABLE 1. Classification of the KDD99 dataset.
equation (14).
Z ∞
PI x∗ = Normf[x ∗ ] µ x ∗ σ x ∗ df [x ∗ ]
(14)
f [ẋ]
Third, expected optimization is defined in equation (15)
Z ∞
(f x∗ − f [ẋ])Normf[x∗ ] [µ x∗ σ x∗ df[x∗ ]
∗
EI x =
f[ẋ]
(15)
A. KDD99 DATASET
In 1999, the International Knowledge Discovery and Data
Mining Tools Competition was organized to collect traffic
records, an environment had been set up by Lincoln Labs
for acquiring raw TCP dump of 9 weeks from a LAN of
the US Air Force [31]. The size of the training data is 4 GB
collected from 7 weeks, which then processed into 5 million
connections and the test data is around 2 million connections
collected from 2 weeks. The attack types explored here are
R2L. U2R, DoS, Probing. The description of the dataset is
given in table 1.
B. NSLKDD DATASET
In 2009, the NSL-KDD dataset was brought as a cleaned and
revised version of the KDD99 dataset by the University of
New Brunswick. The dataset contains 43 features for each
record, where 41 are traffic input data, the other two are labels
i.e. traffic is normal or malicious, and Scores i.e. severity
of traffic input [32]. There are 4 different attack classes in
the dataset: DoS, R2L, U2R, and Probing. There are four of the data provides information about the data. Two cru-
types of features in the dataset: 4 Categorical, 6 Binary, 10 cial elements of data processing are normalization and
Continuous, and 23 Discrete. The features of the dataset is standardization [33]. To get more significant results, the
explained in table 2. size of a dataset grows exponentially as the number of
characteristics increases. This causes an overfitting issue,
V. DATA EXPLORATION which lengthens computation times and lowers model accu-
Visualization is the most effective method for comprehend- racy. The following describes the methods applied in this
ing data and how it functions. Knowing the distribution paper.
A. NORMALIZATION AND STANDARDIZATION the Jupyter Notebook environment [38]. There are mainly two
Normalization is a method of arranging data to reduce types of classification techniques used in machine learning
duplication where data points are scaled between zero and one binary and the other multiclass classification.
one [34]. It is utilized to remove undesired characteris-
tics from a dataset. Mathematically it is represented in
equation (16). A. BINARY CLASSIFICATION
The process of classification where the data is divided
Xnew = X−Xmin Xmax − Xmin (16) into two classes or groups is called as binary classifica-
The method of restructuring data in uniform format is tion [39]. The two classes are labelled as either 0 or 1.
called data standardization. It compares the data points by There are many machine learning algorithms used to do
putting them on the same scale. This process is also called binary classification like SVM, logistic regression, random
Z-score. Mathematically it is represented in equation (17). forest, etc. PCA is utilized to reduce the dimensional-
ity of the dataset, the Bayesian optimization technique is
Z =X−µ σ
(17) employed to identify the optimal parameter configuration
where, µ = Mean of the data points, σ = Standard Deviation for improved results. Bayesian techniques employ histor-
ical evaluation data to construct a probabilistic model,
B. PRINCIPAL COMPONENT ANALYSIS (PCA) termed a ‘‘surrogate,’’ mapping hyperparameters to the
It is a method to examine interrelations between variables of a likelihood of achieving a score on the objective function.
dataset [35]. The algorithm uses an orthogonal transformation This surrogate function simplifies optimization compared
to convert correlated variables into uncorrelated variables. to the actual objective function. By iteratively updat-
Principal component analysis is used to avoid overfitting ing the surrogate probability model with each evaluation,
problems by reducing the dimensionality of the original Bayesian reasoning aims to refine predictions and improve
dataset and transforming it to lower lower-dimensional accuracy.
dataset preserving most of the actual sample information [36]. The random forest technique has eighteen hyperparam-
The variance of the low-dimensional dataset is greater than eters, three of which—max_smaples, max_features, and
the higher-dimensional dataset. Figure (3) gives a general n_estimators—are selected for optimization. In the end, the
overview of how PCA works. attack kinds are classified using an optimized Random Forest
approach. The Random Forest method heavily relies on key
hyperparameters such as max_samples, max_features, and
n_estimators to govern the model’s performance, complexity,
and generalization capability. These hyperparameters control
the bootstrap sample size, feature selection randomness, and
forest count, respectively. Fine-tuning these hyperparame-
ters is crucial for achieving optimal performance, balancing
model complexity, and ensuring generalization ability in
Random Forest models [40].
The framework for binary classification is illustrated
in figure (4). Initially, the datasets undergo standardiza-
tion and normalization, following the processes outlined in
equations (16) and (17). Then, PCA is employed for dimen-
sionality reduction. Following these preparatory steps, the
FIGURE 3. Principal component analysis.
proposed algorithm, along with other traditional algorithms,
is applied for final analysis.
VI. PROPOSED FRAMEWORK Traditional algorithms like SVM and Logistic Regression
A computer system will always have vulnerabilities, which may struggle with imbalanced datasets, where one class dom-
opens the door to different network attacks attempting to inates the other. In contrast, the Random Forest method,
compromise system integrity. For network security purposes, optimized with Bayesian optimization, can address this issue
it is more crucial to identify the sort of attack than whether by dynamically balancing class distribution during training,
an assault has taken place. Any network administrator must ensuring adequate representation of rare attack types. While
constantly have accurate information to take the neces- Random Forest models are robust to noise, they may overfit
sary precautions to safeguard computer infrastructure [37]. without proper tuning [41]. Bayesian optimization helps opti-
We presented a hybrid approach in this paper that outperforms mize Random Forest hyperparameters, such as tree count and
existing approaches in identifying network threats. The pro- depth, mitigating overfitting and enhancing generalization,
posed algorithm experiments over the KDD99 and NSLKDD particularly in noisy environments. Moreover, Random Forest
datasets to evaluate detection performance. The ratio of train- with Bayesian optimization, offers adaptability by updating
ing and testing dataset is 70:30, the implementation is done in model parameters over time, enabling continuous learning
Algorithm 1 The Proposed Bayesian-Based Random Forest TABLE 3. Classification of KDD99 dataset.
(BO_RF) Algorithm
Input: The dataset be X = [X1 , X2 . . . . . . .Xn ]
The Target variables Y = [Y1 , Y2 . . . Ym ]
Output: Classification report for each target variable.
over traditional methods in all the aspects of evaluation crite- work together effectively, leveraging optimization efficiency
ria. and model robustness to develop efficient machine learning
models for real-time scenarios [44].
B. MULTICLASS CLASSIFICATION
The process of classification where data is divided into three
or more classes or categories is called multiclass classifica-
tion [45]. The classes are labelled from 0 to n-1 where n is
the total no of classes. A hybrid algorithm using Logistic
FIGURE 6. Comparison over the NSLKDD dataset. regression and Random forest is proposed. The proposed
Bayesian optimization enhances Random Forest models algorithm is implemented using Grid search which can search
by efficiently tuning hyperparameters like the number of for optimal parameters from a great number of parameters.
trees and maximum depth. This optimization reduces com- The optimal parameters are used to improve the performance
putational costs and improves predictive performance by of the algorithm. The combination of Logistic Regression
adapting to changes in the performance landscape during and Random Forest in a hybrid approach offers comple-
training [43]. This adaptability is particularly beneficial in mentary benefits. While Logistic Regression may overlook
real-time applications where model performance may fluctu- complex data patterns, Random Forest can capture them
ate. In summary, Bayesian optimization and Random Forest effectively by aggregating predictions from multiple decision
trees. Moreover, Random Forest’s robustness to noise and described previously in Section III. The proposed algorithm,
outliers enhances model performance, mitigating the impact however, introduced more advanced and nuanced approach.
of outliers. On the other hand, Logistic Regression pro- In the initial phase, it applied equations (3) - (8) (which
vides interpretable coefficients that elucidate the relationship are rooted in logistic regression) to optimize the model’s
between features and the target variable, aiding in understand- parameters and enhance its predictive capabilities. This phase
ing model predictions. By leveraging the strengths of both laid the groundwork for a more precise classification. The
methods, hybrid algorithms achieve improved performance second phase of the proposed algorithm employed equa-
and prediction accuracy. tions (9) - (11) (which utilize the random forest technique)
Figure 5 depicts the multiclass classification framework. to further refine the classification process. This dual-phase
The datasets are initially subjected to standardization, nor- methodology which combines logistic regression with ran-
malization, and dimensionality reduction. Once these steps dom forest, represents a significant improvement over tradi-
are completed, the proposed algorithm, along with other tra- tional approaches. It not only ensures greater accuracy and
ditional algorithms, is implemented for the final analysis. efficiency in the detection of network intrusions but also
Traditional algorithms such as Logistic Regression and underscores the innovative and robust nature of the proposed
Naive Bayes may struggle to capture intricate data rela- algorithm in addressing complex cybersecurity challenges.
tionships. However, by integrating Logistic Regression The multiclass classification of KDD99 dataset by using
with Random Forest in a hybrid model and optimizing proposed LR_RF algorithm is described in table 5.
hyperparameters through grid search, the model can effec- The comparison of experimental results over KDD99
tively address non-linear relationships and high-dimensional dataset is shown below in figure (8) and (9).
feature spaces, ultimately enhancing classification accu- The result of multiclass classification over four different
racy [46]. Although Random Forest models are potent, attack classes of the NSLKDD dataset is explored below in
they are often perceived as ‘‘black box’’ models, hindering table 6.
interpretability. Yet, by combining Logistic Regression in a The comparison between different machine learning algo-
hybrid approach, interpretability can be enhanced, as Logistic rithms and proposed algorithm over four attack classes of
Regression offers coefficients that signify the significance of NSLKDD dataset is shown in the figures (10) & (11).
each feature in the classification process. In deploying hybrid models combining Logistic Regres-
The classification of the KDD99 and NSL-KDD datasets sion and Random Forest for Intrusion Detection Systems
was carried out using both traditional algorithms and the (IDS), practical considerations such as computational costs
proposed algorithm. Traditional methods leveraged a com- and real-time applicability need to be addressed. Random
prehensive set of equations to conduct their analyses, estab- Forest typically demands more computational resources and
lishing a foundational benchmark for comparison which are training time, especially with large datasets or numerous
for swift decision-making, particularly in scenarios where [11] G. S. Fuhnwi, M. Revelle, and C. Izurieta, ‘‘Improving network intrusion
computational resources are limited [47]. detection performance: An empirical evaluation using extreme gradi-
ent boosting (XGBoost) with recursive feature elimination,’’ in Proc.
IEEE 3rd Int. Conf. AI Cybersecur. (ICAIC), Feb. 2024, pp. 1–8, doi:
VII. CONCLUSION 10.1109/ICAIC60265.2024.10433805.
In summary, the experiments conducted with the KDD99 and [12] A. D. Vibhute, C. H. Patil, A. V. Mane, and K. V. Kale, ‘‘Towards detection
of network anomalies using machine learning algorithms on the NSL-
NSLKDD datasets, combined with PCA for dimensionality KDD benchmark datasets,’’ Proc. Comput. Sci., vol. 233, pp. 960–969,
reduction, confirm the efficacy of ML techniques in intru- Jan. 2024, doi: 10.1016/j.procs.2024.03.285.
sion detection. Through the use of hybrid machine learning [13] A. Shehadeh, H. ALTaweel, and A. Qusef, ‘‘Analysis of data mining
techniques on KDD-cup’99, NSL-KDD and UNSW-NB15 datasets for
methods, the study effectively tackles challenges posed by intrusion detection,’’ in Proc. 24th Int. Arab Conf. Inf. Technol. (ACIT),
attack types with high prediction rates across diverse eval- Dec. 2023, pp. 1–6, doi: 10.1109/ACIT58888.2023.10453884.
uation criteria. These findings underscore the importance [14] T. Mehmood and H. B. Md Rais, ‘‘Machine learning algorithms
in context of intrusion detection,’’ in Proc. 3rd Int. Conf.
of flexible and resilient intrusion detection approaches in Comput. Inf. Sci. (ICCOINS), Aug. 2016, pp. 369–373, doi:
the face of constantly evolving cyber threats. The suggested 10.1109/ICCOINS.2016.7783243.
method, which integrates PCA for dimensionality reduction [15] N. A. Solekha, ‘‘Analysis of NSL-KDD dataset for classification of attacks
based on intrusion detection system using binary logistics and multinomial
and hybrid machine learning techniques, surpasses alternative
logistics,’’ Seminar Nasional Off. Statist., vol. 2022, no. 1, pp. 507–520,
algorithms in both binary and multi-class classification sce- Nov. 2022, doi: 10.34123/semnasoffstat.v2022i1.1138.
narios. Looking ahead, future research efforts will focus on [16] S. K. Mehak, Z. Rasheed, N. A. Ibupoto, and S. Ashraf, ‘‘Machine
advancing intrusion detection capabilities by leveraging deep learning algorithms for prediction of thyroid syndrome at initial stages
in females,’’ Kurdish Stud., vol. 12, no. 5, pp. 466–470, Jul. 2024, doi:
learning methods. These methods have the potential to further 10.53555/ks.v12i5.3247.
enhance the accuracy and efficiency of IDS by automatically [17] N. Wattanapongsakorn, S. Srakaew, E. Wonghirunsombat,
learning intricate patterns and representations from the data. C. Sribavonmongkol, T. Junhom, P. Jongsubsook, and C. Charnsripinyo,
‘‘A practical network-based intrusion detection and prevention system,’’
By exploring deep learning approaches, our aim is to develop in Proc. IEEE 11th Int. Conf. Trust, Secur. Privacy Comput. Commun.,
even more effective and reliable security solutions capable of Jun. 2012, pp. 209–214, doi: 10.1109/TRUSTCOM.2012.46.
addressing the evolving landscape of cyber threats. [18] T. Alves, R. Das, and T. Morris, ‘‘Embedding encryption and machine
learning intrusion prevention systems on programmable logic controllers,’’
IEEE Embedded Syst. Lett., vol. 10, no. 3, pp. 99–102, Sep. 2018, doi:
REFERENCES 10.1109/LES.2018.2823906.
[1] R. D. Ravipati and M. Abualkibash, ‘‘Intrusion detection system clas- [19] S. A. Repalle and V. R. Kolluru, ‘‘Intrusion detection system
sification using different machine learning algorithms on KDD-99 and using AI and machine learning algorithm,’’ Int. Res. J. Eng.
NSL-KDD datasets—A review paper,’’ Int. J. Comput. Sci. Inf. Technol., Technol., vol. 4, no. 12, pp. 1709–1715, 2017. [Online]. Available:
vol. 11, pp. 1–16, Jun. 2019, doi: 10.2139/ssrn.3428211. https://d1wqtxts1xzle7.cloudfront.net/55496979/IRJET-V4I12314
[2] S. Ganesan, G. Shanmugaraj, and A. Indumathi, ‘‘A survey of data mining [20] N. K. Trivedi, R. G. Tiwari, A. K. Agarwal, and V. Gautam, ‘‘A detailed
and machine learning-based intrusion detection system for cyber security,’’ investigation and analysis of using machine learning techniques for thyroid
in Risk Detection and Cyber Security for the Success of Contemporary diagnosis,’’ in Proc. Int. Conf. Emerg. Smart Comput. Informat. (ESCI),
Computing, 2023, pp. 52–74, doi: 10.4018/978-1-6684-9317-5.ch004. Mar. 2023, pp. 1–5, doi: 10.1109/ESCI56872.2023.10099542.
[3] K. Ashok and S. Gopikrishnan, ‘‘Statistical analysis of remote health [21] K. A. Taher, B. Mohammed Yasin Jisan, and Md. M. Rahman, ‘‘Network
monitoring based IoT security models & deployments from a prag- intrusion detection using supervised machine learning technique with fea-
matic perspective,’’ IEEE Access, vol. 11, pp. 2621–2651, 2023, doi: ture selection,’’ in Proc. Int. Conf. Robot., Elect. Signal Process. Techn.
10.1109/ACCESS.2023.3234632. (ICREST), Jan. 2019, pp. 643–646, doi: 10.1109/ICREST.2019.8644161.
[4] M. Rampavan and E. P. Ijjina, ‘‘Genetic brake-net: Deep learn- [22] K. Shaukat, S. Luo, V. Varadharajan, I. Hameed, S. Chen, D. Liu, and
ing based brake light detection for collision avoidance using genetic J. Li, ‘‘Performance comparison and current challenges of using machine
algorithm,’’ Knowl.-Based Syst., vol. 264, Mar. 2023, Art. no. 110338, doi: learning techniques in cybersecurity,’’ Energies, vol. 13, no. 10, p. 2509,
10.1016/j.knosys.2023.110338. May 2020, doi: 10.3390/en13102509.
[5] Z. Ahmad, A. Shahid Khan, C. Wai Shiang, J. Abdullah, and F. Ahmad, [23] S. A. R. Shah and B. Issac, ‘‘Performance comparison of intrusion
‘‘Network intrusion detection system: A systematic study of machine detection systems and application of machine learning to snort system,’’
learning and deep learning approaches,’’ Trans. Emerg. Telecommun. Tech- Future Gener. Comput. Syst., vol. 80, pp. 157–170, Mar. 2018, doi:
nol., vol. 32, no. 1, p. e4150, Jan. 2021, doi: 10.1002/ett.4150. 10.1016/j.future.2017.10.016.
[6] L. Cui, Y. Qu, L. Gao, G. Xie, and S. Yu, ‘‘Detecting false data [24] W. Seo and W. Pak, ‘‘Real-time network intrusion prevention system based
attacks using machine learning techniques in smart grid: A survey,’’ on hybrid machine learning,’’ IEEE Access, vol. 9, pp. 46386–46397, 2021,
J. Netw. Comput. Appl., vol. 170, Nov. 2020, Art. no. 102808, doi: doi: 10.1109/ACCESS.2021.3066620.
10.1016/j.jnca.2020.102808. [25] J. Ribeiro, F. B. Saghezchi, G. Mantas, J. Rodriguez, and
[7] T. Meng, X. Jing, Z. Yan, and W. Pedrycz, ‘‘A survey on machine learn- R. A. Abd-Alhameed, ‘‘HIDROID: Prototyping a behavioral host-based
ing for data fusion,’’ Inf. Fusion, vol. 57, pp. 115–129, May 2020, doi: intrusion detection and prevention system for Android,’’ IEEE Access,
10.1016/j.inffus.2019.12.001. vol. 8, pp. 23154–23168, 2020, doi: 10.1109/ACCESS.2020.2969626.
[8] Ü. Çavuşoğlu, ‘‘A new hybrid approach for intrusion detection using [26] M. A. Al-Naeem, ‘‘Prediction of re-occurrences of spoofed ACK
machine learning methods,’’ Appl. Intell., vol. 49, pp. 2735–2761, packets sent to deflate a target wireless sensor network node
Feb. 2019. [Online]. Available: https://link.springer.com/article/ by DDOS,’’ IEEE Access, vol. 9, pp. 87070–87078, 2021, doi:
10.1007/s10489-018-01408-x 10.1109/ACCESS.2021.3089683.
[9] L. Li, Y. Yu, S. Bai, Y. Hou, and X. Chen, ‘‘An effective two- [27] A. H. Azizan, S. A. Mostafa, A. Mustapha, C. F. M. Foozy,
step intrusion detection approach based on binary classification M. H. A. Wahab, M. A. Mohammed, and B. A. Khalaf, ‘‘A machine learn-
and k-NN,’’ IEEE Access, vol. 6, pp. 12060–12073, 2018, doi: ing approach for improving the performance of network intrusion detection
10.1109/ACCESS.2017.2787719. systems,’’ Ann. Emerg. Technol. Comput., vol. 5, no. 5, pp. 201–208,
[10] Y. A. Al-Khassawneh, ‘‘An investigation of the Intrusion detection system Mar. 2021, doi: 10.33166/aetic.2021.05.025.
for the NSL-KDD dataset using machine-learning algorithms,’’ in Proc. [28] S. Asiri, Y. Xiao, S. Alzahrani, S. Li, and T. Li, ‘‘A survey of intelligent
IEEE Int. Conf. Electro Inf. Technol. (eIT), May 2023, pp. 518–523, doi: detection designs of HTML URL phishing attacks,’’ IEEE Access, vol. 11,
10.1109/eIT57321.2023.10187360. pp. 6421–6443, 2023, doi: 10.1109/ACCESS.2023.3237798.
[29] E. N. Crothers, N. Japkowicz, and H. L. Viktor, ‘‘Machine-generated [43] A. Shokeen, N. Yadav, and V. Sisaudia, ‘‘Performance analysis of different
text: A comprehensive survey of threat models and detection machine learning algorithms for intrusion detection on KDD-CUP-99
methods,’’ IEEE Access, vol. 11, pp. 70977–71002, 2023, doi: dataset,’’ in Proc. AIP Conf., 2024, vol. 3072, no. 1, Art. no. 020010,
10.1109/ACCESS.2023.3294090. doi: 10.1063/5.0203394
[30] Q. Xiong, C. Yuan, B. He, H. Xiong, and Q. Kong, ‘‘GTRF: A general [44] S. M. Kasongo, ‘‘A deep learning technique for intrusion detection system
deep learning framework for tuples recognition towards supervised, semi- using a recurrent neural networks based framework,’’ Comput. Commun.,
supervised and unsupervised paradigms,’’ Eng. Appl. Artif. Intell., vol. 124, vol. 199, pp. 113–125, Feb. 2023, doi: 10.1016/j.comcom.2022.12.010.
Sep. 2023, Art. no. 106500, doi: 10.1016/j.engappai.2023.106500. [45] A. O. Alzahrani and M. J. F. Alenazi, ‘‘ML-IDSDN: Machine learning
[31] S. Mohanty and M. Agarwal, ‘‘Recursive feature selection and intrusion based intrusion detection system for software-defined network,’’ Concur-
classification in NSL-KDD dataset using multiple machine learning meth- rency Comput., Pract. Exper., vol. 35, no. 1, p. e7438, Jan. 2023, doi:
ods,’’ in Proc. Int. Conf. Comput., Commun. Learn. Cham, Switzerland: 10.1002/cpe.7438.
Springer, 2023, pp. 3–14, doi: 10.1007/978-3-031-56998-2_1. [46] K. Johnson Singh, D. Maisnam, and U. S. Chanu, ‘‘Intrusion detection sys-
[32] M. Zakariah, S. A. AlQahtani, A. M. Alawwad, and A. A. Alotaibi, tem with SVM and ensemble learning algorithms,’’ Social Netw. Comput.
‘‘Intrusion detection system with customized machine learning tech- Sci., vol. 4, no. 5, p. 517, Jul. 2023, doi: 10.1007/s42979-023-01954-3.
niques for NSL-KDD dataset,’’ Comput., Mater. Continua, vol. 77, no. 3, [47] S. Ahmadi, ‘‘Network intrusion detection in cloud environments: A com-
pp. 4025–4054, 2023, doi: 10.32604/cmc.2023.043752. parative analysis of approaches,’’ Int. J. Adv. Comput. Sci. Appl., vol. 15,
[33] M. Wang, K. Zheng, Y. Yang, and X. Wang, ‘‘An explainable machine no. 3, pp. 1–9, 2024, doi: 10.14569/IJACSA.2024.0150301.
learning framework for intrusion detection systems,’’ IEEE Access, vol. 8,
pp. 73127–73141, 2020, doi: 10.1109/ACCESS.2020.2988359.
[34] S. Neupane, J. Ables, W. Anderson, S. Mittal, S. Rahimi, I. Banicescu,
and M. Seale, ‘‘Explainable intrusion detection systems (X-IDS): A survey
of current methods, challenges, and opportunities,’’ IEEE Access, vol. 10,
pp. 112392–112415, 2022, doi: 10.1109/ACCESS.2022.3216617.
[35] E. K. Boahen, W. Changda, and B.-M. Brunel Elvire, ‘‘Detection of
compromised online social network account with an enhanced Knn,’’ MANISANKAR SANNIGRAHI received the
Appl. Artif. Intell., vol. 34, no. 11, pp. 777–791, Sep. 2020, doi: M.Tech. degree in computer science and infor-
10.1080/08839514.2020.1782002. mation security from the Kalinga Institute of
[36] E. K. Boahen, B. E. Bouya-Moko, F. Qamar, and C. Wang, ‘‘A deep learn- Industrial Technology, Bhubaneswar, Odisha,
ing approach to online social network account compromisation,’’ IEEE India, in 2020. He is currently pursuing the Ph.D.
Trans. Computat. Social Syst., vol. 10, no. 6, pp. 3204–3216, Dec. 2023, degree with the School of Computer Engineer-
doi: 10.1109/tcss.2022.3199080. ing and Information Systems, Vellore Institute of
[37] B. Sharma, L. Sharma, C. Lal, and S. Roy, ‘‘Explainable artificial intel- Technology, Vellore, India. His research interests
ligence for intrusion detection in IoT networks: A deep learning based
include machine learning, network security, and
approach,’’ Expert Syst. Appl., vol. 238, Mar. 2024, Art. no. 121751, doi:
cryptography.
10.1016/j.eswa.2023.121751.
[38] E. K. Boahen, S. A. Frimpong, M. M. Ujakpa, R. N. A. Sosu,
O. Larbi-Siaw, E. Owusu, J. K. Appati, and E. Acheampong, ‘‘A deep
multi-architectural approach for online social network intrusion detection
system,’’ in Proc. IEEE World Conf. Appl. Intell. Comput. (AIC), Jun. 2022,
pp. 919–924, doi: 10.1109/AIC55036.2022.9848865.
[39] H. Attou, A. Guezzaz, S. Benkirane, M. Azrour, and Y. Farhaoui, ‘‘Cloud-
based intrusion detection approach using machine learning techniques,’’ R. THANDEESWARAN received the B.E.,
Big Data Mining Anal., vol. 6, no. 3, pp. 311–320, Sep. 2023, doi: M.Tech., and Ph.D. degrees from Vellore Institute
10.26599/BDMA.2022.9020038. of Technology, Vellore. He is currently a Professor
[40] J. P. Bharadiya, ‘‘A tutorial on principal component analysis for dimen- with the School of Computer Engineering and
sionality reduction in machine learning,’’ Int. J. Innov. Sci. Res. Technol.,
Information Systems, Vellore Institute of Tech-
vol. 8, no. 5, pp. 2028–2032, 2023. [Online]. Available: https://www.
nology. He has 25 years of teaching experience
researchgate.net/profile/Jasmin-Bharadiya-4/publication/371306692
and expertise in computer and communication
[41] A. Verma and V. Ranga, ‘‘On evaluation of network intrusion
detection systems: Statistical analysis of CIDDS-001 dataset using networks, data and information security, network
machine learning techniques,’’ Authorea Preprints, 2023, doi: protocols, traffic analysis, and the IoT security
10.36227/techrxiv.11454276.v1. domains. He has published 27 research articles in
[42] P. Dini, A. Elhanashi, A. Begni, S. Saponara, Q. Zheng, and K. Gasmi, SCI, Scopus, and highly reputed journals, and also published several books
‘‘Overview on intrusion detection systems design exploiting machine and completed a funded project by the Government of India. He is a member
learning for networking cybersecurity,’’ Appl. Sci., vol. 13, no. 13, p. 7507, of CSI and the Soft Computing Research Society.
Jun. 2023, doi: 10.3390/app13137507.