Splnproc1703 D
Splnproc1703 D
1 Introduction
The scale and complexity of cybersecurity threats have significantly expanded due to
the rapid development of cloud-native technologies and digital infrastructure.
According to Mishra and Mishra [1], traditional signature-based intrusion detection
systems (IDS) often struggle to keep pace with evolving attack vectors and zero-day
exploits. On the other hand, machine learning and deep learning techniques have
emerged as powerful tools for anomaly detection, offering greater accuracy and
adaptability in dynamic threat environments. Techniques such as Random Forest,
Support Vector Machines (SVM), and feature-based dimensionality reduction have
demonstrated notable effectiveness [3], [4], [9].
2 Literature Survey
3 Proposed Methodology
The CIC-IDS 2017 dataset is used for data preparation and aggregation at the start of
the proposed Next-Generation IDS workflow, which is shown in Figure 1. Cleaning,
normalization with StandardScaler, and dimensionality reduction with PCA are applied
to the raw traffic data. Random Forest importance ratings and correlation analysis are
used to guide feature selection. Several supervised classifiers, such as Naive Bayes,
KNN, Decision Trees, SVM, and Logistic Regression, are used to detect network in-
trusions; Random Forest and LSTM are selected as the final models based on their
superior performance. While federated learning with PyTorch allows for distributed,
privacy-preserving model updates, reinforcement learning improves sample selection.
Data synthesis with GANs increases the model's resistance to new threats. A custom-
ized LLM-powered assistant receives the final prediction outputs and automatically
creates structured cyber threat reports. For subsequent examination and analysis, these
reports are kept in a SQLite database. Real-time interaction, monitoring, and automated
report generation are made possible by the system's Flask backend and Gradio-based
user interface, which can be deployed to AWS EC2 or hosted on localhost. In the sec-
tions that follow, each part of this pipeline is covered in detail.
Because of its great quality, diversity, and alignment with contemporary cybersecurity
concerns, this dataset was chosen. It resolves significant problems with old attack sig-
natures and a lack of protocol-level variety that were evident in previous benchmarks
like KDD'99 and NSL-KDD. It is perfect for training conventional machine learning
models, deep learning networks like LSTM, and assessing the efficacy of federated and
reinforcement-based learning strategies because it also provides labeled data for both
supervised and semi-supervised learning. In contemporary IDS research, its temporal
depth and structured style make it appropriate for robust anomaly detection and sequen-
tial modeling.
In order to ensure that the CIC-IDS 2017 network traffic is clean, consistent, and ap-
propriate for machine learning pipelines, data pre-processing is essential to the effec-
tiveness of the suggested intrusion detection system. The raw data is thoroughly cleaned
using Python tools such as Pandas, NumPy, and Scikit-learn. This includes handling
missing values, deleting infinite values, getting rid of duplicates, and standardizing col-
umn names. To reduce noise and inconsistency that could impair model accuracy, this
step is crucial. Label encoding is used to encode categorical fields, like attack types,
because the dataset includes both string-based and numeric features. This makes it eas-
ier for supervised algorithms to classify the data. StandardScaler, which standardizes
characteristics by eliminating the mean and scaling to unit variance, is used to normal-
ize numerical features. In both deep and conventional models, this guarantees steady
convergence. Principal Component Analysis (PCA), which compresses the feature
space while maintaining crucial variance, minimizes overfitting, and boosts training
efficiency, is then used to reduce dimensionality. SMOTE also addresses class imbal-
ance, allowing for balanced class distributions in binary and multi-class classification
applications. In order to maximize model generalization and avoid memory problems,
the processed data is further sampled. All things considered, the preprocessing pipeline
turns unprocessed traffic into a clean, insightful dataset, speeding up training and
greatly improving the system's capacity to identify intrusions with high accuracy.
6 Nilamadhab Mishra | Deepika Ajalkar | Jerome Daniel Sameer Kulkarni | Aarif Sheikh
The core functionality of the proposed intrusion detection system lies in its ability to
accurately identify malicious behavior through a synergistic use of deep learning archi-
tectures, reinforcement-enhanced federated learning, and traditional machine learning
models. The CIC-IDS 2017 dataset serves as the foundation, where multiple supervised
classifiers—Logistic Regression, Naive Bayes, K-Nearest Neighbors (KNN), Support
Vector Machines (SVM), and Decision Trees—are trained to differentiate between be-
nign and various attack types. Logistic Regression is used as a baseline for binary clas-
sification, while SVM employs kernel methods to manage high-dimensional decision
boundaries. Naive Bayes and Decision Trees offer interpretability and efficiency for
structured input, whereas KNN classifies based on proximity to labeled instances.
Random Forest is chosen for its ensemble capabilities, contributing both high accu-
racy and feature importance insights. To further enhance performance, XGBoost is in-
corporated as a gradient boosting method, known for its ability to avoid overfitting
through built-in regularization. For learning temporal dependencies and sequential be-
havior patterns in network traffic—critical for detecting time-sensitive threats such as
infiltration or brute-force attacks—Long Short-Term Memory (LSTM) networks are
implemented.
As shown in Figure 3, the Random Forest model achieved a slightly higher test ac-
curacy (0.98) compared to the LSTM model (0.97), confirming the effectiveness of
ensemble methods in structured data scenarios. Throughout the training phase, model
evaluation is performed using metrics such as accuracy, precision, recall, F1-score, and
ROC curves, ensuring robustness across both centralized and federated configurations.
3.5 Deployment
Using a combination of Flask and Gradio interfaces, the suggested intrusion detection
system is designed for real-time monitoring, automation, and user-friendliness. The
system as a whole operates locally or on cloud-based infrastructure like AWS EC2,
which provides scalable computational resources to effectively manage fluctuating net-
work traffic demands. As a lightweight API layer, the Flask backend controls user au-
thentication, initiates the IDS pipeline, and facilitates communication between the
LLM-based report generator, federated learning modules, and machine learning mod-
els. Users can run the entire detection process straight from the front-end interface after
successfully logging in. Users can create dynamic threat reports with the use of inter-
active features like "Run Full Pipeline" and "Load Batch Reports," which are supported
by the front end, which was constructed with custom HTML and Gradio as shown in
Fig..4. Rapid deployment and user-friendly result presentation are made possible by the
Gradio based interface, and generated reports are stored in SQLite for subsequent que-
rying or viewing in DB Browser. The system can be deployed to AWS EC2 instances
for wider accessibility; however, it is typically hosted at http://127.0.0.1:7860/ for local
deployment. The federated and reinforcement learning modules can adjust to fresh traf-
fic data because of the architecture's support for continuous monitoring. A proprietary
LLM assistant automatically generates structured summaries of risks using outputs
from RF, LSTM, and FL-RL models to support analyst explainability. Understanding
the type and seriousness of detected intrusions is made easier with the help of this re-
porting architecture. Matplotlib and Pandas-based summaries are used to create visual
dashboards and logs that facilitate easy comprehension of IDS activity. From real-time
detection to LLM-based reporting, the deployment pipeline guarantees end-to-end au-
tomation, guaranteeing operational scalability, explainability, and high availability.
This study evaluated several machine learning, deep learning, and federated learning
algorithms for detecting complex network intrusions using the CIC-IDS 2017 dataset.
Among the traditional models, Decision Tree and Random Forest achieved the highest
test accuracies—99.85% and 99.77%, respectively—demonstrating their effectiveness
in managing structured, high-dimensional input features. The LSTM model also per-
formed remarkably well, with a test accuracy of 98.71%, proving particularly valuable
for capturing long-range temporal dependencies associated with sophisticated and per-
sistent threat patterns. In contrast, models such as Naive Bayes and Logistic Regression
exhibited significantly lower performance, indicating their limitations in dealing with
non-linear and intricate interactions present in modern attack scenarios.
The batch evaluation interface is presented in Figure 6, where reports for multiple
threat instances—including DDoS and Infiltration—are generated and displayed. This
interface enables users to interactively load and download grouped reports, which is
particularly beneficial for real-time operations in high-volume traffic environments.
Furthermore, Figure 7 illustrates the persistent storage of these reports in an SQLite
database. Each report includes fields for predicted attack type, model insights, and sum-
mary interpretation. This structured storage approach not only aids historical analysis
but also supports downstream integration with other forensic or SIEM systems.
Validation performance across models confirmed that Random Forest and Decision
Tree maintained minimal validation loss, demonstrating strong generalization without
overfitting. LSTM and KNN also achieved low validation losses, reflecting consistent
reliability, whereas Naive Bayes and Logistic Regression showed higher validation
losses, suggesting underfitting or limited flexibility to complex data distributions. The
ensemble nature of Decision Tree and Random Forest provides resilience to noisy fea-
tures, while LSTM's temporal modeling capabilities allow it to detect threats that may
be missed by traditional models.
accelerated using PyTorch on Google Colab with GPU support, ensuring scalable and
efficient development cycles.
5 Methodology
The methodology for this intrusion detection system is structured around a multi-stage
pipeline that integrates data processing, model training, federated learning, and auto-
mated threat reporting. The CIC-IDS 2017 dataset forms the foundation and is sub-
jected to a comprehensive pre-processing routine that includes cleaning, handling miss-
ing values, normalization using StandardScaler, and dimensionality reduction via Prin-
cipal Component Analysis (PCA). Class imbalance is addressed using SMOTE to en-
sure equitable model training across attack categories.
6 Conclusion
References
1. Mishra, N., Mishra, S.: A review of machine learning-based intrusion detection system. EAI
Endors. Trans. Internet Things 10, (2024)
2. Philip, J., Harshini, M., Patil, S., Shareef, S.K.K., Reddy, B.V., Sridevi, S.: Deep learning
for web intrusion detection. In: Proc. 2nd Int. Conf. Sustainable Computing and Smart Sys-
tems (ICSCSS), pp. 10–12 (2024)
3. Arthy, J., Variar, S.D., Tharakeshwar, N., Narayan, R.A.: End-to-end network traffic exam-
ination for intrusion detection using feature embedding learning. In: Proc. 4th Int. Conf.
Intelligent Technologies (CONIT), pp. 21–23 (2024)
4. Kotnur, H., Muthusamy, S., Ravindran, S., Vijean, V.: A comprehensive survey of intrusion
detection system using machine learning and deep learning approaches. In: Proc. 10th Int.
Conf. Advanced Computing and Communication Systems (ICACCS), pp. 14–15 (2024)
5. Mishra, N., Mishra, S.: A novel intrusion detection technique of the computer networks us-
ing ML. Int. J. Intell. Syst. Appl. Eng. 11(5s), 247–260 (2023)
6. Mishra, N., Mishra, S., Patnaik, B.: A novel IDS based on random oversampling and deep
neural network. Indian J. Comput. Sci. Eng. 13(6), 1924–1936 (2022)
7. Mande, S., Ramachandran, N., Kumar, C.K., Priyanka, C.N.: A brief analysis on machine
learning classifiers for intrusion detection to enhance network security. In: Proc. Int. Conf.
Automation, Computing and Renewable Systems (ICACRS), pp. 1–6 (2022)
8. Mishra, N., Mishra, S.: On the NSL-KDD dataset, a survey of machine learning-based in-
trusion detection systems. J. Huazhong Univ. Sci. Technol. 50(4), 4512 (2021)
9. Mishra, N., Mishra, S.: Support vector machine used in network intrusion detection. In: Nat.
Workshop on Internet of Things (IoT) (2018)
10. Mishra, N., Mishra, S., Patnaik, B.: Intrusion detection using IoT. In: The Things Services
and Applications of Internet of Things, pp. 79–84 (2018)
11. Talukder, M.A., Islam, M.M., Uddin, M.A., Hasan, K.F., Sharmin, S., Alyami, S.A., et al.:
Machine learning-based network intrusion detection for big and imbalanced data using over-
sampling, stacking, feature embedding and feature extraction. J. Big Data 11(1), 33 (2024)
12. Fan, Z., Sohail, S., Sabrina, F., Gu, X.: Sampling-based machine learning models for intru-
sion detection in imbalanced dataset. Electronics 13(10), 1878 (2024)
13. Azar, A.T., Shehab, E., Mattar, A.M., Hameed, I.A., Elsaid, S.A.: Deep learning-based hy-
brid intrusion detection systems to protect satellite networks. J. Netw. Syst. Manag. 31(4),
82 (2023)
14. Rani, M., Gagandeep: Effective network intrusion detection by addressing class imbalance
with deep neural networks. Multimed. Tools Appl. 81(6), 8499–8518 (2022)
15. Praveena, V., Vijayaraj, A., Chinnasamy, P., Ali, I., Alroobaea, R., et al.: Optimal deep
reinforcement learning for intrusion detection in UAVs. Comput. Mater. Contin. 70(2),
2639–2653 (2022)