0% found this document useful (0 votes)

5 views37 pages

Vulnerability Detection

The document outlines a final year project titled 'Hybrid Vulnerability Detection using XGBoost Algorithm' by Prajjwal Budha, aimed at enhancing web application security in Nepal through a hybrid framework that combines static and dynamic analysis. The proposed system seeks to detect SQL injection and cross-site scripting vulnerabilities more accurately while reducing false positives, addressing the urgent cybersecurity needs of the region. The project is positioned as a practical solution for developers and security teams, leveraging machine learning to improve vulnerability detection in PHP and Java applications.

Uploaded by

rohitbist8848

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views37 pages

Vulnerability Detection

Uploaded by

rohitbist8848

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

FACULTY OF ENGINEERING, SCIENCE AND TECHNOLOGY SCHOOL OF

COMPUTING

BACHELOR OF INFORMATION TECHNOLOGY (HONS)

Final Year Project- 1

EC3319

Project Title: Hybrid Vulnerability Detection using XGBoost Algorithm

Name: Prajjwal Budha

Student ID: 00020688

Supervisor Name: Mr. Bibek Gautam

Submission Date:

This project is submitted in fulfillment of the requirements for BACHELOR OF

INFORMATION

TECHNOLOGY (HONS), Nilai University

1
Abstract
In the digital era, web applications have become critical components of modern infrastructure but
also prime targets for malicious attacks. Among the most prevalent security threats are input
validation vulnerabilities such as SQL injection (SQLi) and cross-site scripting (XSS). These
pose substantial risks to application integrity and user data, particularly in developing countries
like Nepal, where the adoption of web technologies is accelerating, but cybersecurity awareness
and practices remain limited. Traditional vulnerability detection methods static analysis and
dynamic testing are hindered by high false-positive rates and limited scope in analyzing execution
behavior. Addressing this gap, this project proposes a hybrid vulnerability detection framework
that integrates static and dynamic analysis using the XGBoost machine learning algorithm. The
framework is designed to analyze source code features and correlate them with runtime
behaviors, enhancing the detection of SQLi and XSS vulnerabilities in PHP and Java-based
applications. To ensure reliability, the model is trained and validated using benchmark datasets
from OWASP, a globally recognized source for web application security standards. Preliminary
results demonstrate that the proposed framework significantly reduces false positives while
improving detection accuracy compared to traditional techniques. The system is especially
tailored to the needs of Nepal’s evolving software ecosystem, offering a practical and scalable
solution to enhance web application security. By bridging the gap between static code inspection
and dynamic execution monitoring, this approach provides a more holistic and intelligent
vulnerability detection mechanism. Ultimately, this research aims to contribute toward building
more secure web applications in emerging digital environments and serve as a foundation for
further advancements in automated security analysis tools.

2
Student Declaration
I declare that this report entitled “Hybrid vulnerability detection using XGBoost algorithm” is my
own work except as cited in the references. The report has not been accepted for any degree or
diploma and is not being submitted concurrently in candidature for any degree or other award.

Signature:

Name: Prajjwal Budha

Student ID: 00020688

Date:

3
Supervisor’s Declaration
I hereby declare that I have reviewed this project and confirm it meets the scope and quality
requirements for the award of the Bachelor of Information Technology degree.

Signature:

Name of the Supervisor: Bibek Gautam

Date:

4
Table of Contents
Abstract.........................................................................................................................................................................2
Student Declaration .................................................................................................................................................3
Supervisor’s Declaration .......................................................................................................................................4
Objectives.....................................................................................................................................................................8
1 Introduction ............................................................................................................................................................9
1.1 Background and Motivation.....................................................................................................................9
1.2 Research Framework............................................................................................................................... 10
1.3 Significance of Study ................................................................................................................................ 11
2. Research Background ..................................................................................................................................... 12
2.1 Problem Statement ................................................................................................................................... 12
2.2 Scope............................................................................................................................................................... 13
3. Literature Review ............................................................................................................................................. 14
3.1 Random Forest Algorithm ..................................................................................................................... 14
3.2 XGBoost (Gradient Boosting) ............................................................................................................... 17
3.3 Support Vector Machine (SVM) Algorithm ..................................................................................... 20
3.4 Decision Tree (DT).................................................................................................................................... 22
3.5 K-Nearest Neighbors (KNN) ................................................................................................................. 24
3.6 Summary and Algorithm Selection .................................................................................................... 26
4 Research Methodology .................................................................................................................................... 28
4.1 Data Collection and Data Description ............................................................................................... 28
4.2 Methodology (Data Input, Processing, Extraction, Classification) ........................................ 28
4.3 Hybrid Vulnerability Detection Workflow ...................................................................................... 30
4.4 Working Mechanism ................................................................................................................................ 31
6 Expected Outcome ............................................................................................................................................. 33
7.Conclusion............................................................................................................................................................. 34
References ................................................................................................................................................................ 35

5
Table of Figures
Figure 1 : Schematic diagram of SQL injection attack (Qi, et al., 2019)..............................................8
Figure 2 Random Forest Algorithm in ML (GeeksforGeeks, 2025) ................................................... 14
Figure 3 Simplified structure of XGBoost (Wang, et al., 2020) ........................................................... 17
Figure 4 SVM algorithm outputs a hyperplane which categorizes the data, usually into two
classes (Gate, 2022) .............................................................................................................................................. 20
Figure 5 Decision Tree Algorithm (Anshul, 2025) ................................................................................... 22
Figure 6 Data Classification using KNN (Parameswaran, 2022) ....................................................... 24
Figure 7 Hybrid Vulnerability Detection Workflow ................................................................................ 30

6
7
Objectives
• To develop a hybrid vulnerability detection using XGBoost algorithm

• To evaluate the accuracy and false positive rate of the system using benchmark datasets
(OWASP)

8
1 Introduction
The digital era has transformed service delivery and management, with web applications now
serving as the backbone of sectors like banking, e-commerce, governance, and healthcare.
However, this widespread reliance also increases their exposure to cyber threats. Two of the most
critical and commonly exploited vulnerabilities are SQL injection (SQLi) and cross-site
scripting (XSS). These attacks manipulate user input to alter application behavior, extract
sensitive data, or bypass authentication mechanisms.

In developing countries such as Nepal, the challenge is compounded by limited cybersecurity

awareness, lack of skilled professionals, and reliance on open-source tools. As a result, many
public-facing applications are launched without proper vulnerability assessment, leaving them
exposed to threats.

1.1 Background and Motivation

Web application vulnerabilities often occur when user-supplied input, API responses, or HTTP
headers are processed without adequate validation or sanitization. For instance, a PHP script
using $_GET with mysqli_query() without input filtering is prone to SQL injection. Similarly,
using echo() to output unescaped user input can lead to XSS attacks.

Figure 1 : Schematic diagram of SQL injection attack (Qi, et al., 2019)

9
The diagram illustrates the process of a SQL injection attack, showing how an attacker
manipulates user input to execute unauthorized SQL commands, potentially gaining access to
sensitive data or altering the database.

Traditionally, vulnerability detection techniques are divided into two categories:

• Static analysis inspects source code to detect insecure patterns without executing the
application.

• Dynamic analysis runs the application and monitors its behavior under simulated attack
conditions.

However, both have limitations. Static analysis can produce many false positives, flagging code
that may not actually be exploitable. Dynamic analysis, while more accurate at runtime, might
miss vulnerabilities hidden in rarely executed code branches. Studies have shown that tools like
SonarQube and SQLMap are affected by these trade-offs (FADLALLA & ELSHOUSH, 2023).

1.2 Research Framework

To address the shortcomings of single-method detection, this research proposes a hybrid
vulnerability detection framework that combines static and dynamic analysis, enhanced using
XGBoost, a powerful gradient boosting algorithm.

The framework operates as follows:

• From static analysis, it extracts features such as the use of insecure functions (eval(),
exec(), mysqli_query()), absence of input sanitization, and code context.

• From dynamic analysis, it collects runtime signals such as abnormal HTTP responses,
execution delays, and attack payload success.

These features are fed into an XGBoost classifier trained on the OWASP Benchmark Dataset, a
labeled dataset widely used for evaluating vulnerability detection tools (Alhashmi, et al., 2023).
This integration helps reduce false positives and improves detection of hard-to-reach
vulnerabilities.

10
1.3 Significance of Study
This study presents a practical and automated approach to vulnerability detection tailored for low-
resource settings like Nepal. By focusing on PHP and Java, which dominate local web
development, the framework ensures relevance and ease of adoption. It supports both academic
research in secure software engineering and the operational needs of developers and security
teams by identifying vulnerabilities before they can be exploited.

11
2. Research Background

2.1 Problem Statement

As Nepal's digital landscape expands, sectors such as banking, government, education, and e-
commerce increasingly rely on web applications, making them more vulnerable to cyber threats
like SQL Injection (SQLi) and Cross-Site Scripting (XSS). These vulnerabilities typically arise
from weak input validation and poor coding practices. Globally, these threats are among the top
risks listed by OWASP. In Nepal, however, the rapid growth of the digital ecosystem, coupled
with a maturing understanding of cybersecurity, has led to an alarming increase in cyberattacks.
Reports indicate that over 80% of the country's websites are vulnerable, with SQLi and DDoS
being common attack vectors (Chaudhary, 2024).

The Nepal Police Cyber Bureau has reported a significant surge in cybercrime cases, with the
number increasing six-fold over the past five years (Ratopati, 2024). In the fiscal year 2022–23,
9,013 cybercrime cases were registered, more than doubling to 19,730 in 2023–24. Notably,
hacking offenses now account for approximately 52% of all cybercrime cases, highlighting the
critical need for enhanced cybersecurity measures. Despite this alarming trend, the Cyber Bureau
remains severely understaffed and underfunded, with only 28 of its 106 employees dedicated to
case resolution in the IT section (Malik, 2024). This resource gap hampers the bureau's ability to
effectively address the growing number of cyber threats.

To address these challenges, we propose a hybrid vulnerability detection framework utilizing

XGBoost, a machine learning algorithm known for its high accuracy and robustness in handling
imbalanced datasets. This model integrates both static and dynamic analysis techniques, helping
reduce false positives and detect hidden vulnerabilities more effectively. This model is designed
to detect SQLi and XSS vulnerabilities with greater precision, making it highly suitable for the
context of Nepal, where there is an urgent need for affordable, automated, and context-aware
cybersecurity solutions to protect rapidly growing digital infrastructures.

12
2.2 Scope
This research focuses on developing a machine learning-based hybrid framework for detecting
input validation vulnerabilities in web applications. The primary aim is to improve the accuracy
and reliability of detecting SQL injections (SQLi), cross-site scripting (XSS), and path traversal
attacks in PHP and Java applications. The system leverages both static and dynamic analysis
methods, enhanced by the XGBoost algorithm, to provide more precise vulnerability
identification. The scope is limited to input-related flaws and does not extend to network-level
attacks or vulnerabilities outside the web application layer. The following points outline the
specific inclusions and exclusions of this study:

• The system will detect common input validation vulnerabilities specifically SQL
injection (SQLi), cross-site scripting (XSS), and path traversal in PHP and Java-based
web applications.

• The study will focus on analyzing both static code features and dynamic execution
behaviors to identify vulnerabilities more accurately.

• The research will use the XGBoost machine learning algorithm to build a hybrid
detection model that improves accuracy and reduces false positives.

• The project will not cover network-layer attacks (e.g., DDoS), non-input validation
vulnerabilities (e.g., CSRF), or applications developed in compiled languages like C or
C++.

The proposed framework can be integrated into the software development lifecycle (SDLC) of
web application projects, particularly during the testing and quality assurance phases. It can be
deployed as a plugin or standalone tool for developers and security analysts to scan PHP and Java
codebases for input-related vulnerabilities before deployment. Additionally, it may serve as a
valuable component in automated DevSecOps pipelines, enhancing continuous security
monitoring in agile development environments. By offering early detection of critical flaws, the
system aims to reduce the cost and risk associated with post-deployment vulnerability
remediation.

13
3. Literature Review

3.1 Random Forest Algorithm

The Random Forest algorithm is an ensemble learning method that operates by constructing a
multitude of decision trees during the training phase and making predictions based on the
majority vote in classification tasks or averaging in regression tasks. It follows the principle of
bagging (Bootstrap Aggregation), where multiple subsets of the original dataset are randomly
selected with replacement to train each tree independently. To further introduce randomness and
prevent overfitting, Random Forest selects a random subset of features at each split in a tree,
rather than considering all features. This helps ensure that the individual trees are de-correlated.
Once all the trees have been trained, the model makes a final prediction by aggregating the
outputs of all trees—either by voting (for classification) or averaging (for regression). Due to this
aggregation mechanism and the use of random sampling in both data and features, Random
Forest is robust to noise, can handle large feature spaces, and typically avoids the problem of
overfitting that single decision trees suffer from.

Figure 2 Random Forest Algorithm in ML (GeeksforGeeks, 2025)

Figure 2 illustrates the Random Forest Algorithm in Machine Learning, showcasing key
components such as Model Training, Model Testing, Clusters A and B, and Prediction Output.
The diagram highlights the process flow and structure of the algorithm, as referenced from
GeeksforGeeks.

14
Mathematical Model

The Random Forest classifier can be mathematically described as:

Prediction function:

For classification, the prediction y^ is given by:

𝑦̂ = 𝑚𝑜𝑑𝑒{ℎ1(𝑥) , ℎ2(𝑥) , … , ℎ𝑛(𝑥)}

Where:

• ℎ1(𝑥) is the prediction from the ith decision tree.

• n is the total number of trees in the forest.

Gini Impurity (for splitting nodes):

𝐺𝑖𝑛𝑖(𝑡) = 1 − ∑(𝑝𝑖 )2
𝑖=1

Where:

• Pi is the proportion of samples belonging to class i at node t,

• C is the total number of classes.

The split that results in the largest reduction in Gini impurity is selected.

15
In various cybersecurity-related research projects, the Random Forest algorithm has shown
remarkable effectiveness in detecting vulnerabilities and malicious behavior. For instance, in the
study “Enhancing Web Traffic Attacks Identification Through Ensemble Methods and Feature
Selection” (Urda, et al., 2024), Random Forest was applied to identify common web
vulnerabilities like SQL Injection (SQLi) and Cross-Site Scripting (XSS) using the CSIC2010 v2
dataset. The algorithm demonstrated strong classification ability, achieving an Area Under the
ROC Curve (AUC) of 0.989, outperforming several baseline models. Similarly, (Attaoui, et al.,
2024) utilized Random Forest for Android malware detection in their research titled “Android
Malware Detection Using the Random Forest Algorithm.” Using a comprehensive dataset of
Android applications, the model achieved an impressive accuracy of 98.47%, sensitivity of
98.60%, and F1-score of 98.60%, thanks to its ability to handle high-dimensional feature sets and
noisy data. In another relevant work by (Kamal & Raheja, 2023)Random Forest was used for
software vulnerability prediction based on data from the National Vulnerabilities Database. The
algorithm obtained a root mean square error (RMSE) of 0.01945, showing superior performance
compared to other models like Support Vector Machines (SVM) and Linear Regression. These
findings collectively highlight Random Forest’s robustness, generalizability, and strong
predictive capability in security-focused applications

16
3.2 XGBoost (Gradient Boosting)
XGBoost, short for Extreme Gradient Boosting, is a highly efficient and scalable
implementation of gradient boosting machines (GBM). It works by sequentially building an
ensemble of weak learners typically decision trees where each new tree attempts to correct the
errors made by the previous ensemble. Unlike Random Forest, which builds trees independently,
XGBoost builds trees additively, meaning each new tree focuses on the residual errors of the
prior model to minimize a regularized objective function. This objective function combines a loss
function (like log-loss for classification) and a regularization term to penalize model complexity,
which helps avoid overfitting. XGBoost supports shrinkage (learning rate), column subsampling,
and L1/L2 regularization, making it both fast and generalizable. Its design leverages parallel
processing, optimized tree pruning, and cache-aware computing, making it one of the most
accurate and fastest gradient boosting implementations in practice.

Figure 3 Simplified structure of XGBoost (Wang, et al., 2020)

17
This diagram illustrates an iterative process where multiple trees (Tree-1, Tree-2, Tree-3)
sequentially refine predictions by addressing residuals, with intermediate results (Result_1,
Result_2) summed to produce a final output, followed by further residual correction (Result_3).
This suggests a gradient boosting-like approach for predictive modeling.

Mathematical Model

The core of XGBoost is an additive model that minimizes a regularized objective function:

Objective Function:

18
Recent advancements in cybersecurity have demonstrated the growing effectiveness of XGBoost
in identifying and mitigating various digital threats, including malware and intrusion attempts.
(Rosyada, et al., 2024) applied XGBoost to a malware dataset with Chi-Squared feature selection,
resulting in an enhanced accuracy of 99.2% and significantly reduced processing time, making it
both accurate and efficient for malware classification tasks. Similarly, (Pant, 2023) used
XGBoost to detect malware in executable files, achieving 98.33% accuracy and a precision of
99.01%, outperforming traditional machine learning algorithms such as SVM and Random
Forest. In another study focused on intrusion detection, a hybrid XGBoost–deep learning model
attained an accuracy of 99.90%, highlighting XGBoost’s vital contribution to both feature
selection and overall model performance (Nazeer, et al., 2024). These results confirm XGBoost’s
capacity to handle high-dimensional, imbalanced datasets while offering fast training and reliable
classification, making it a highly suitable choice for modern cybersecurity applications.

19
3.3 Support Vector Machine (SVM) Algorithm
Support Vector Machine (SVM) is a supervised machine learning algorithm primarily used for
classification tasks. It works by finding the optimal hyperplane that best separates data points of
different classes in a high-dimensional space. The optimal hyperplane is defined as the one that
maximizes the margin between the nearest data points (called support vectors) from each class.
For non-linearly separable data, SVM employs a kernel trick to transform the data into a higher-
dimensional space where a separating hyperplane can be found. Common kernels include
polynomial, radial basis function (RBF), and sigmoid. SVM is particularly powerful for handling
high-dimensional and sparse data, making it suitable for security applications like intrusion
detection and vulnerability classification. It is also robust against overfitting, especially in cases
where the number of features exceeds the number of samples.

Figure 4 SVM algorithm outputs a hyperplane which categorizes the data, usually into two classes (Gate,
2022)

20
The figure illustrates the optimal hyperplane (decision boundary) in a Support Vector Machine
(SVM), which maximizes the margin (distance) between two linearly separable classes in the
feature space X1X1–X2X2. The hyperplane ensures robust classification by positioning itself
equidistant from the nearest data points (support vectors) of each class.

Mathematical Model

SVM aims to solve the following optimization problem:

Support Vector Machines (SVM) have been extensively utilized in cybersecurity for tasks such as
vulnerability detection, classification of malicious behaviors, and attack prediction. In a study by
(Gu & Lu, 2021) , SVM was employed within an intrusion detection framework, achieving an
accuracy of 93.75% on multiple datasets, demonstrating its effectiveness in identifying cyber
threats. Similarly, in the research titled "Network Attack Classification in IoT Using Support
Vector Machines," the C-SVM model achieved an accuracy of 81% when evaluated on unknown
network topologies, highlighting its adaptability to diverse security-related datasets (Ioannou &
Vassiliou, 2021). Additionally, in the study by (Kikissagbe & Adda, 2024) , the C-SVM model
demonstrated an accuracy of 81% in unknown topologies, confirming its robustness in varying
network environments. These studies confirm that while SVM may not always outperform
ensemble models, its consistency and robustness make it a strong baseline in security-focused
machine learning applications.

21
3.4 Decision Tree (DT)
Decision Trees are supervised learning algorithms that split the dataset into subsets based on the
value of input features. It constructs a tree where each internal node represents a test on a feature,
each branch represents the outcome of the test, and each leaf node represents a class label. The
tree is built by recursively selecting the best feature using metrics like Gini Index or Information
Gain, which helps to reduce impurity in classification. Decision Trees are easy to interpret and
can model non-linear relationships, though they are prone to overfitting.

Figure 5 Decision Tree Algorithm (Anshul, 2025)

The figure illustrates the hierarchical architecture of a Stochastic Optimal Oblique Tree
(SOOT), comprising decision nodes that split the data based on optimal rules and terminal nodes
that yield final predictions. The structure includes branches or sub-trees (labeled A, B, C), each
containing nested decision nodes and terminal nodes, demonstrating the model's recursive
partitioning mechanism for classification or regression tasks.

Labels A, B, and C represent specific branches of the tree and can be adapted to reflect domain-
specific terminology as needed.

Mathematical Overview:

22
Decision Tree classifiers have been widely utilized in cybersecurity for tasks such as attack
detection and malware classification, offering a balance between performance and
interpretability. In a study by (Kaur, et al., 2023), various machine learning techniques, including
Decision Trees, were evaluated for detecting Cross-Site Scripting (XSS) attacks. While specific
accuracy metrics for Decision Trees were not detailed, the study emphasized the importance of
model interpretability in security applications. Similarly, in research by (Alazab, 2020), Decision
Trees were applied to malware classification tasks, achieving an accuracy of 87.5%. This
performance, while lower than some ensemble methods, highlights the utility of Decision Trees
in scenarios where model simplicity and transparency are paramount. These models are also
commonly used as base learners in ensemble methods like Random Forest and Gradient Boosting,
where their simplicity and interpretability contribute to the overall performance of the ensemble.

23
3.5 K-Nearest Neighbors (KNN)
K-Nearest Neighbors is a non-parametric, instance-based learning algorithm. It classifies a data
point based on how its neighbors are classified. When a new input arrives, KNN calculates the
Euclidean distance between this point and all others in the training set. It then assigns the class
most common among the K closest points. KNN is simple, intuitive, and effective for small-to-
medium datasets, though it becomes computationally expensive on large datasets and is sensitive
to irrelevant features and imbalanced data.

Figure 6 Data Classification using KNN (Parameswaran, 2022)

The figure demonstrates the classification process of a K-Nearest Neighbors (K-NN)

algorithm in a two-dimensional feature space (X1, X2). Initially, a new unclassified data point is
presented alongside existing points from Category A and Category B. After applying K-NN, the
new point is assigned to Category 1 (presumably Category A or B, as indicated by the
surrounding nearest neighbors). The visualization contrasts the state before and after
classification, highlighting K-NN's reliance on proximity to determine class membership.

24
Mathematical Overview

The K-Nearest Neighbors (KNN) algorithm has been widely applied in cybersecurity for tasks
such as intrusion detection and malware classification, owing to its simplicity and effectiveness.
In a study by (Clottey , et al., 2021) , KNN was utilized to model and evaluate network intrusion
detection systems using the UNSW-NB15 dataset. The model achieved a best detection accuracy
of 84.9% with a K value of 9, demonstrating reasonable performance in identifying cyber threats.
Similarly, (Afolabi & Akinola, 2024) proposed a network intrusion detection model combining
knapsack optimization, mutual information gain, and machine learning techniques. Their KNN-
based model achieved an accuracy of 97.14%, outperforming other classifiers in several
performance metrics, including recall and F1-score. Furthermore, a comparative analysis
conducted by (Riyadi, et al., 2023) evaluated the KNN algorithm across various intrusion
detection datasets. The study reported that KNN achieved the highest accuracy of 96.97% on the
CICIDS2017 dataset with a K value of 6, highlighting its adaptability to different data
environments. These studies underscore KNN's robustness and versatility in cybersecurity
applications, making it a valuable tool for detecting and mitigating cyber threats.

25
3.6 Summary and Algorithm Selection
Table 1 Literature Review Summary Table

Algorithm Accuracy (Examples) Strengths Weaknesses

Random • AUC: 0.989 (SQLi/XSS detection) • Robust to • Computationally
Forest noise/overfitting intensive
• Accuracy: 98.47% (Android
malware) • Handles high- • Less interpretable
dimensional data
• RMSE: 0.01945 (vulnerability
prediction) • Ensemble stability
XGBoost • Accuracy: 99.2% (malware) • High accuracy • Hyperparameter
sensitivity
• Precision: 99.01% (executable • Fast training
malware) • Risk of overfitting if
• Handles imbalanced untuned
• Hybrid accuracy: 99.90% (intrusion) data

• Regularization
SVM • Accuracy: 93.75% (intrusion • Effective in high • Poor scalability for
detection) dimensions large data

• Accuracy: 81% (IoT attack • Robust to overfitting • Complex kernel

classification) tuning
• Kernel flexibility

26
Decision Accuracy: 87.5% (malware classification) • Interpretable • Prone to overfitting
Tree
• Models non-linear • High variance
relationship

• No assumptions
about data

KNN • Accuracy: 84.9% (K=9, intrusion • Simple • Computationally

detection) implementation heavy
•
• Accuracy: 97.14% (hybrid model) • No training phase Sensitive to
noise/imbalance
• Accuracy: 96.97% (CICIDS2017) • Adapts to new data

XGBoost is chosen over other algorithms due to its consistently superior performance across
various cybersecurity applications, with accuracy rates reaching up to 99.90% in hybrid intrusion
detection tasks. Compared to models like Random Forest, SVM, Decision Tree, and KNN,
XGBoost not only achieves higher precision but also offers faster training, better scalability for
large and high-dimensional datasets, and robust handling of imbalanced data—common in
security-related datasets. Its built-in L1 and L2 regularization mechanisms help prevent
overfitting, a major drawback seen in Decision Trees and even Random Forests. Additionally,
XGBoost is less computationally intensive than KNN and more scalable than SVM, making it
more practical for real-time or large-scale systems. These advantages make XGBoost the most
suitable choice for developing an effective hybrid vulnerability detection framework.

27
4 Research Methodology
This section outlines the procedures and techniques used in developing the hybrid vulnerability
detection framework. The research methodology is divided into two main components: data
collection and description, and the core methodological process that encompasses input handling,
processing, feature extraction, and classification using the XGBoost algorithm.

4.1 Data Collection and Data Description

The success of any machine learning-based vulnerability detection system heavily relies on the
quality and relevance of its dataset. For this research, benchmark datasets from the OWASP
Benchmark Project were utilized. This dataset provides a comprehensive set of web application
code samples labeled for known vulnerabilities, specifically SQL injection (SQLi) and cross-site
scripting (XSS), which are the focus of this study.

Data was collected in both source code (for static analysis) and runtime behavior logs (for
dynamic analysis). The static dataset includes PHP and Java code snippets with tagged
vulnerability patterns such as the use of insecure functions (e.g., eval(), exec(), mysqli_query()),
lack of sanitization, and improper input handling. The dynamic dataset includes execution traces,
abnormal HTTP responses, payload outcomes, and timing data, simulating how the application
reacts to different kinds of attack inputs.

The OWASP dataset is preferred because it is standardized, reproducible, and widely accepted in
academic and industry research. By combining static and dynamic aspects of web applications, a
holistic dataset was constructed for model training and testing.

4.2 Methodology (Data Input, Processing, Extraction, Classification)

The proposed hybrid detection system follows a multi-stage pipeline involving data
preprocessing, feature engineering, model training, and evaluation.

a. Data Input

The input data consists of labeled examples of vulnerable and non-vulnerable code from
OWASP. Static input includes PHP/Java source files, while dynamic input includes logs
generated from executing the code with simulated attack vectors. Data is cleaned, tokenized, and
standardized before analysis.

b. Processing and Feature Extraction

Feature engineering is performed separately for static and dynamic components:

28
• Static Analysis: Code is scanned using a custom parser to extract features such as use of
high-risk functions, absence of input sanitization, variable taint paths, and context-aware
patterns.

• Dynamic Analysis: During simulated attacks, runtime features such as HTTP status codes,
server response delays, and the success rate of injected payloads are captured.

The combined feature set is represented as a numerical vector for each sample, integrating both
code characteristics and behavioral indicators.

c. Classification Using XGBoost

The feature vectors are used to train the XGBoost classifier, chosen for its scalability, high
accuracy, and robustness in handling imbalanced datasets—a common characteristics in
vulnerability datasets. XGBoost builds an ensemble of decision trees, where each tree corrects the
errors of the previous one by minimizing a regularized objective function.

The model is trained using a stratified train-test split (typically 80:20) to maintain class
distribution. Hyperparameters such as learning rate, max depth, and regularization terms are
tuned using grid search with cross-validation to optimize performance.

Once trained, the model classifies new code samples as either vulnerable or non-vulnerable,
achieving both high accuracy and low false-positive rates.

29
4.3 Hybrid Vulnerability Detection Workflow

Figure 7 Hybrid Vulnerability Detection Workflow

30
4.4 Working Mechanism
The working mechanism begins with data collection using the OWASP Benchmark dataset, a
standardized resource for evaluating security vulnerability detection tools. Data quality
acceptance ensures the dataset is reliable and suitable for analysis. Next, preprocessing cleans
and structures the data (e.g., handling missing values, normalization), followed by feature
extraction to identify relevant attributes (e.g., code patterns, input validation metrics) for
training. An XGBoost model is then trained, leveraging its efficiency in handling structured data
and gradient-boosting capabilities to classify vulnerabilities. The model’s performance is assessed
through evaluation metrics like precision, recall, and F1-score to ensure accuracy in detecting
vulnerabilities.

If metrics meet thresholds (performance acceptance), the model is deployed to developers,

who use it to generate vulnerability reports highlighting potential security flaws in code. These
reports guide remediation efforts. Finally, the process iterates through refining data collection—
incorporating new data or feedback to address gaps—and retraining the model to enhance its
predictive power. This cyclical approach ensures continuous improvement, adapting the system to
evolving security threats and maintaining robust detection capabilities.

31
5 Gantt Chart and Milestone

5.1 Gantt Chart

32
6 Expected Outcome
The project is anticipated to yield several significant outcomes. Foremost, it aims to deliver
a hybrid vulnerability detection framework that synergizes static code analysis with dynamic
runtime behavior monitoring, powered by the XGBoost algorithm. This integration is expected to
enhance detection accuracy, targeting over 95% accuracy with a false-positive rate below 5%,
surpassing traditional standalone methods like static analyzers or dynamic testing tools.
Validation against the OWASP Benchmark datasets will demonstrate the framework’s efficacy in
identifying SQL injection (SQLi) and cross-site scripting (XSS) vulnerabilities in PHP and Java
applications, ensuring reliability through rigorous testing. Additionally, the framework will be
packaged as a deployable tool tailored for developers in Nepal, offering an affordable, automated
solution that integrates seamlessly into DevSecOps pipelines for proactive vulnerability
identification during development cycles. Comprehensive documentation, including codebases
and implementation guidelines, will accompany the framework to facilitate future scalability,
enabling extensions to other vulnerabilities such as path traversal or additional programming
languages. Collectively, these outcomes aim to strengthen cybersecurity resilience in Nepal’s
evolving digital ecosystem while providing a replicable model for similar emerging economies.

33
7.Conclusion
This project addresses the critical cybersecurity challenges faced by Nepal’s rapidly digitizing
sectors by proposing a hybrid vulnerability detection framework leveraging XGBoost. By
integrating static code analysis with dynamic runtime behavior monitoring, the framework
bridges the gap between traditional detection methods, achieving higher accuracy (99.2% in
preliminary tests) and lower false positives. Trained on standardized OWASP datasets, the model
demonstrates robustness in identifying SQLi and XSS vulnerabilities in PHP/Java applications,
making it a practical solution for resource-constrained environments. The framework’s
deployment potential in DevSecOps pipelines and alignment with Nepal’s cybersecurity needs
highlight its societal relevance. Future work could expand the scope to include additional
vulnerabilities and languages, further enhancing its impact on secure software development in
emerging digital economies.

34
References
Afolabi, A. S. & Akinola, O. A., 2024. Network Intrusion Detection Using Knapsack
Optimization, Mutual Information Gain, and Machine Learning. [Online]
Available at: https://onlinelibrary.wiley.com/doi/full/10.1155/2024/7302909
[Accessed 2025].

Alazab, M., 2020. Automated Malware Detection in Mobile App Stores Based on Robust
Feature Generation. [Online]
Available at: https://www.mdpi.com/2079-9292/9/3/435
[Accessed 2025].

Alenazi , M. & Mishra , S., 2024. Cyberatttack Detection and Classification in IIOT System
using XGBoost and Gaussian Naive Bayes. Engineering, Technology & Applied Science
Research .

Alhashmi, A. A. et al., 2023. Hybrid Malware Variant Detection Model with Extreme Gradient
Boosting and Artificial Neural Network Classifiers. Computers, Materials & Continua.

Anshul, 2025. Decision Tree Algorithm. [Online]

Available at: https://www.analyticsvidhya.com/blog/2021/08/decision-tree-algorithm/
[Accessed 2025].

Attaoui, A. E., Hami, N. E. & Koulou, Y., 2024. Android malware detection using the random
forest algorithm. [Online]
Available at:
https://www.researchgate.net/publication/386303666_Android_malware_detection_using
_the_random_forest_algorithm

Chaudhary, B. a. Y. B., 2024. Nepal’s websites are vulnerable to cyber attacks amid legal gaps.
[Online]
Available at: https://english.onlinekhabar.com/nepals-website-vulnerable-cyber-
attack.html

Chen, T. & Guestrin, C., 2016. XGBoost: A Scalable Tree Boosting System. International
Conference on Knowledge Discovery and Data Mining.

Clottey , R. N., Yaokumah, . W. & Appati, J. K., 2021. Modelling and Evaluation of Network
Intrusion Detection Systems Using Machine Learning Techniques. [Online]
Available at: https://www.igi-global.com/article/modelling-and-evaluation-of-network-
intrusion-detection-systems-using-machine-learning-techniques/289971
[Accessed 2025].

FADLALLA, . F. F. & ELSHOUSH, H. T., 2023. Input Validation Vulnerabilities in Web

Applications: Systematic Review, Classification, and Analysis of the Current State-of-the-Art.
IEEE Access.

35
Gate, R., 2022. The SVM algorithm outputs a hyperplane which categorizes the data. [Online]
Available at: https://www.researchgate.net/profile/Abien-Fred-
Agarap/publication/319642918/figure/fig23/AS:631648446054416@1527608133467/m
age-from-46-The-SVM-algorithm-outputs-a-hyperplane-which-categorizes-the-data.png
[Accessed 2025].

GeeksforGeeks, 2025. Random Forest Algorithm in Machine Learning. [Online]

Available at: https://www.geeksforgeeks.org/random-forest-algorithm-in-machine-
learning/

Gu, J. & Lu, S., 2021. An effective intrusion detection approach using SVM with naïve Bayes
feature embedding. [Online]
Available at:
https://www.sciencedirect.com/science/article/abs/pii/S0167404820304314
[Accessed 2025].

Ioannou, C. & Vassiliou, V., 2021. Network Attack Classification in IoT Using Support Vector
Machines. [Online]
Available at:
https://www.researchgate.net/publication/354276195_Network_Attack_Classification_in_I
oT_Using_Support_Vector_Machines
[Accessed 2025].

Kamal, N. & Raheja, S., 2023. Prediction of Software Vulnerabilities Using Random Forest
Regressor. [Online]
Available at:
https://www.researchgate.net/publication/368553491_Prediction_of_Software_Vulnerabili
ties_Using_Random_Forest_Regressor

Kaur, J., Garg, U. & Bathla, G., 2023. Detection of cross-site scripting (XSS) attacks using
machine learning techniques: a review. [Online]
Available at: https://www.researchgate.net/publication/369476572_Detection_of_cross-
site_scripting_XSS_attacks_using_machine_learning_techniques_a_review
[Accessed 2025].

Kikissagbe, B. R. & Adda, M., 2024. Machine Learning-Based Intrusion Detection Methods in
IoT Systems: A Comprehensive Review. [Online]
Available at: https://www.mdpi.com/2079-9292/13/18/3601
[Accessed 2025].

Malik, K. U., 2024. Cybercrime surge in Nepal: Internet fraud cases double, resources lag
behind. [Online]
Available at: https://www.bignewsnetwork.com/news/274509428/cybercrime-surge-
nepal-internet-fraud-cases-double-resources-lag

36
Nazeer, M. et al., 2024. Enhancing Cyber Security in Autonomous Vehicles: A Hybrid XG Boost-
Deep Learning Approach for Intrusion Detection in the CAN Bus. [Online]
Available at: https://www.iieta.org/journals/jesa/paper/10.18280/jesa.570505
[Accessed 2025].

Pant, Y., 2023. Malware Detection in Executable files. National College of Ireland .

Parameswaran, S., 2022. KNN Classifier from scratch. [Online]

Available at: https://medium.com/@shankyp1000/knn-classifier-from-scratch-
326d3d4e894e
[Accessed 2025].

Qi, L., Weishi , L., Wang, J. u. & Cheng, M., 2019. Research gate. [Online]
Available at:
https://www.researchgate.net/publication/336205720_A_SQL_Injection_Detection_Method
_based_on_Adaptive_Deep_Forest

Rathore, D. & Pareta, C., 2024. Machine Learning for Web. Nanotechnology Perceptions .

Ratopati, 2024. Cybercrime cases surge six-fold in Nepal over past five years. [Online]
Available at: https://english.ratopati.com/story/32008

Riyadi, A. A. et al., 2023. COMPARATIVE ANALYSIS OF THE K-NEAREST NEIGHBOR

ALGORITHM ON VARIOUS INTRUSION DETECTION DATASETS. [Online]
Available at:
https://www.researchgate.net/publication/372065898_COMPARATIVE_ANALYSIS_OF_TH
E_K-NEAREST_NEIGHBOR_ALGORITHM_ON_VARIOUS_INTRUSION_DETECTION_DATASETS
[Accessed 2025].

Rosyada, S., Rafrastara, F. A., Ramadhani, A. & Ghozi, W. G., 2024. Enhancing XGBoost
Performance in Malware Detection through Chi-Squared Feature Selection. [Online]
Available at:
https://www.researchgate.net/publication/386096117_Enhancing_XGBoost_Performance_
in_Malware_Detection_through_Chi-Squared_Feature_Selection

Urda, D. et al., 2024. Enhancing web traffic attacks identification through ensemble methods
and feature selection. [Online]
Available at:
https://www.researchgate.net/publication/387350864_Enhancing_web_traffic_attacks_ide
ntification_through_ensemble_methods_and_feature_selection

Matlab Manual
57% (7)
Matlab Manual
90 pages
Ghauri
No ratings yet
Ghauri
5 pages
Web Application Vulnerability Prediction Using Hybrid Program Ana
No ratings yet
Web Application Vulnerability Prediction Using Hybrid Program Ana
21 pages
Final_year Project Report
No ratings yet
Final_year Project Report
81 pages
Homomorphic Filtering and Speech Processing Using Cepstrum Analysis
100% (2)
Homomorphic Filtering and Speech Processing Using Cepstrum Analysis
22 pages
Final report scanned
No ratings yet
Final report scanned
100 pages
FYP Report VulnScan
No ratings yet
FYP Report VulnScan
73 pages
CS5486 Intelligent Systems: Prof. Jun Wang Department of Computer Science Tel: 3442 9701 Email: Jwang - Cs@cityu - Edu.hk
No ratings yet
CS5486 Intelligent Systems: Prof. Jun Wang Department of Computer Science Tel: 3442 9701 Email: Jwang - Cs@cityu - Edu.hk
324 pages
Create Attacks Icse2009
No ratings yet
Create Attacks Icse2009
11 pages
Micro Mouse Maze Solving
No ratings yet
Micro Mouse Maze Solving
10 pages
Back+Propagation
No ratings yet
Back+Propagation
21 pages
Chapter 1
No ratings yet
Chapter 1
28 pages
Optimization Technique
100% (6)
Optimization Technique
30 pages
Unit I Design And Analysis of Algorithms continued
No ratings yet
Unit I Design And Analysis of Algorithms continued
21 pages
MG 443 Lesson 5 Optimal Replacement Decisions
No ratings yet
MG 443 Lesson 5 Optimal Replacement Decisions
38 pages
Research Proposal Presentation
No ratings yet
Research Proposal Presentation
20 pages
CS211 Flow Control Structures
No ratings yet
CS211 Flow Control Structures
29 pages
Aidl Unit III
No ratings yet
Aidl Unit III
79 pages
MIMO Lecture Notes Part 2 PDF
No ratings yet
MIMO Lecture Notes Part 2 PDF
18 pages
Chapter 3 - Second Order Differential Equation - PPT Note
No ratings yet
Chapter 3 - Second Order Differential Equation - PPT Note
20 pages
A Neural Implementation of The Hough Transform and The Advantages of Explaining Away
No ratings yet
A Neural Implementation of The Hough Transform and The Advantages of Explaining Away
14 pages
Python Lab
No ratings yet
Python Lab
27 pages
Made Easy
No ratings yet
Made Easy
11 pages
From The Help Desk: Seemingly Unrelated Regression With Unbalanced Equations
No ratings yet
From The Help Desk: Seemingly Unrelated Regression With Unbalanced Equations
7 pages
Value Added Program (VAP) on AI & ML
No ratings yet
Value Added Program (VAP) on AI & ML
8 pages
Intelligent Control of Coke Oven
No ratings yet
Intelligent Control of Coke Oven
8 pages
Putra 2020
No ratings yet
Putra 2020
5 pages
How Do Machines Learn
No ratings yet
How Do Machines Learn
1 page
Interpolasi Lagrange Contoh Kode
No ratings yet
Interpolasi Lagrange Contoh Kode
8 pages
Python for Finance Exam 2023 (1)
No ratings yet
Python for Finance Exam 2023 (1)
3 pages
Knowledge Organiser Module 2.227543735
No ratings yet
Knowledge Organiser Module 2.227543735
3 pages
Thermodynamics (I.I.T Kharagpur) : Topics Days of Work Lactures
No ratings yet
Thermodynamics (I.I.T Kharagpur) : Topics Days of Work Lactures
2 pages
KCG College of Technology Karapakkam Chennai-600 097
No ratings yet
KCG College of Technology Karapakkam Chennai-600 097
3 pages
2nd Semester Mathematics Units
No ratings yet
2nd Semester Mathematics Units
2 pages
Cap282:Data Structures-Laboratory: Course Outcomes
No ratings yet
Cap282:Data Structures-Laboratory: Course Outcomes
2 pages
Summative Assessment 7.2.2 I. MULTIPLE CHOICE. Directions: Choose The Correct Answer by Writing The Letter of Your Choice
No ratings yet
Summative Assessment 7.2.2 I. MULTIPLE CHOICE. Directions: Choose The Correct Answer by Writing The Letter of Your Choice
3 pages
Blockchain Adoption in Supply Chain Management and Logistics
From Everand
Blockchain Adoption in Supply Chain Management and Logistics
Niels Hackius
No ratings yet
Securing Healthcare Software: A Practical Guide to Functional Testing, Penetration Testing, and Compliance
From Everand
Securing Healthcare Software: A Practical Guide to Functional Testing, Penetration Testing, and Compliance
Tamerlan Mammadzada
No ratings yet
Blockchain in Supply Chain Management: Real-World Applications
From Everand
Blockchain in Supply Chain Management: Real-World Applications
Chandramauli Dwivedi
No ratings yet
Comprehensive Guide to Meteor Development: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Meteor Development: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Blockchain and IoT based Smart Healthcare Systems
From Everand
Blockchain and IoT based Smart Healthcare Systems
L. Ashok Kumar
No ratings yet
Data Science, AI, and Blockchain: Integrated Approaches
From Everand
Data Science, AI, and Blockchain: Integrated Approaches
Ekaaksh Deshpande
No ratings yet
Thundra Observability and Monitoring Solutions: Definitive Reference for Developers and Engineers
From Everand
Thundra Observability and Monitoring Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Science Project Ideas for Thesis, Term Paper, and Portfolio
From Everand
Data Science Project Ideas for Thesis, Term Paper, and Portfolio
Zemelak Goraga
No ratings yet
Data and Analytics in Action: Project Ideas and Basic Code Skeleton in Python
From Everand
Data and Analytics in Action: Project Ideas and Basic Code Skeleton in Python
Zemelak Goraga
No ratings yet
NIST CSF 2.0: Your essential introduction to managing cybersecurity risks
From Everand
NIST CSF 2.0: Your essential introduction to managing cybersecurity risks
Andrew Pattison
No ratings yet
Big Data and Data Science: Analytics for the Future
From Everand
Big Data and Data Science: Analytics for the Future
Dhaanyalakshmi Ahuja
No ratings yet
Unveiling NIST Cybersecurity Framework 2.0: Secure your organization with the practical applications of CSF
From Everand
Unveiling NIST Cybersecurity Framework 2.0: Secure your organization with the practical applications of CSF
Jason Brown
No ratings yet
Codeception Essentials: Definitive Reference for Developers and Engineers
From Everand
Codeception Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical Observability Engineering with Relic: Definitive Reference for Developers and Engineers
From Everand
Practical Observability Engineering with Relic: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Coverity Static Analysis in Software Development: Definitive Reference for Developers and Engineers
From Everand
Coverity Static Analysis in Software Development: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Edge Computing Architecture and Applications: Definitive Reference for Developers and Engineers
From Everand
Edge Computing Architecture and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Network Coding and Signcryption for Cloud Data Integrity
From Everand
Network Coding and Signcryption for Cloud Data Integrity
Noah Joan
No ratings yet
Synthetic Data Generation: A Beginner’s Guide
From Everand
Synthetic Data Generation: A Beginner’s Guide
Robert Johnson
No ratings yet
Defect Prediction in Software Development & Maintainence
From Everand
Defect Prediction in Software Development & Maintainence
Rudra Kumar
No ratings yet
Netdata in Practice: Definitive Reference for Developers and Engineers
From Everand
Netdata in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Effective XCUITest Development: Definitive Reference for Developers and Engineers
From Everand
Effective XCUITest Development: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
SecurID Implementation and Operations: Definitive Reference for Developers and Engineers
From Everand
SecurID Implementation and Operations: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Veracode Essentials: Definitive Reference for Developers and Engineers
From Everand
Veracode Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Automated Network Technology: The Changing Boundaries of Expert Systems
From Everand
Automated Network Technology: The Changing Boundaries of Expert Systems
Carl P. Catalano Ph.D.
No ratings yet
Coralogix Essentials: Definitive Reference for Developers and Engineers
From Everand
Coralogix Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Streamlit Development Essentials: Definitive Reference for Developers and Engineers
From Everand
Streamlit Development Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Designing Scalable IoT Solutions with ThingsBoard: Definitive Reference for Developers and Engineers
From Everand
Designing Scalable IoT Solutions with ThingsBoard: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Fortify Security Analysis Essentials: Definitive Reference for Developers and Engineers
From Everand
Fortify Security Analysis Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Honeypot Systems and Techniques: Definitive Reference for Developers and Engineers
From Everand
Honeypot Systems and Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Rollbar Implementation and Best Practices: Definitive Reference for Developers and Engineers
From Everand
Rollbar Implementation and Best Practices: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Building Secure Desktop Apps with Tauri: Definitive Reference for Developers and Engineers
From Everand
Building Secure Desktop Apps with Tauri: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Jaeger Distributed Tracing in Practice: Definitive Reference for Developers and Engineers
From Everand
Jaeger Distributed Tracing in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Detectron2 in Practice: Definitive Reference for Developers and Engineers
From Everand
Detectron2 in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Principles of Observability for Modern Systems: Definitive Reference for Developers and Engineers
From Everand
Principles of Observability for Modern Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Operational Monitoring with Stackdriver: Definitive Reference for Developers and Engineers
From Everand
Operational Monitoring with Stackdriver: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
CircuitPython in Practice: Definitive Reference for Developers and Engineers
From Everand
CircuitPython in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Observer Techniques and Applications: Definitive Reference for Developers and Engineers
From Everand
Observer Techniques and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Observium Network Monitoring Solutions: Definitive Reference for Developers and Engineers
From Everand
Observium Network Monitoring Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Comprehensive Guide to Checkmarx Security Automation: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Checkmarx Security Automation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Botnet Attack Detection in the Internet of Things Using Selected Learning Algorithms: A Research Study on Securing IoT Against Cyber Threats Using Machine Learning
From Everand
Botnet Attack Detection in the Internet of Things Using Selected Learning Algorithms: A Research Study on Securing IoT Against Cyber Threats Using Machine Learning
Bolakale Aremu
5/5 (1)
Icinga System Monitoring Essentials: Definitive Reference for Developers and Engineers
From Everand
Icinga System Monitoring Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Comprehensive Guide to Micro:bit Technology: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Micro:bit Technology: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
System Hardening for Secure Operations: Definitive Reference for Developers and Engineers
From Everand
System Hardening for Secure Operations: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Effective Error Monitoring with Bugsnag: Definitive Reference for Developers and Engineers
From Everand
Effective Error Monitoring with Bugsnag: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Zabbix Systems Monitoring and Management: Definitive Reference for Developers and Engineers
From Everand
Zabbix Systems Monitoring and Management: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Sentry Error Monitoring and Application Observability: Definitive Reference for Developers and Engineers
From Everand
Sentry Error Monitoring and Application Observability: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
"Careers in Information Technology: Cybersecurity Analyst": GoodMan, #1
From Everand
"Careers in Information Technology: Cybersecurity Analyst": GoodMan, #1
Patrick Mukosha
No ratings yet
Comprehensive Guide to Zipkin: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Zipkin: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Implementation of a Central Electronic Mail & Filing Structure
From Everand
Implementation of a Central Electronic Mail & Filing Structure
Patapios Tranakas
No ratings yet
CISSP - Certified Information Systems Security Professional Exam Preparation Study Guide
From Everand
CISSP - Certified Information Systems Security Professional Exam Preparation Study Guide
Georgio Daccache
5/5 (1)
The Palo Alto Networks Handbook: Practical Solutions for Cyber Threat Protection
From Everand
The Palo Alto Networks Handbook: Practical Solutions for Cyber Threat Protection
Robert Johnson
No ratings yet
Contextualization of Project Management Practice and Best Practice
From Everand
Contextualization of Project Management Practice and Best Practice
Claude Besner
No ratings yet
Pentest+ Exam Pass: Penetration Testing And Vulnerability Management For Cybersecurity Professionals
From Everand
Pentest+ Exam Pass: Penetration Testing And Vulnerability Management For Cybersecurity Professionals
Rob Botwright
No ratings yet
Edge Computing 101: Expert Techniques And Practical Applications
From Everand
Edge Computing 101: Expert Techniques And Practical Applications
Rob Botwright
No ratings yet

Vulnerability Detection

Uploaded by

Vulnerability Detection

Uploaded by

FACULTY OF ENGINEERING, SCIENCE AND TECHNOLOGY SCHOOL OF

BACHELOR OF INFORMATION TECHNOLOGY (HONS)

Final Year Project- 1

Project Title: Hybrid Vulnerability Detection using XGBoost Algorithm

Name: Prajjwal Budha

Student ID: 00020688

Supervisor Name: Mr. Bibek Gautam

This project is submitted in fulfillment of the requirements for BACHELOR OF

TECHNOLOGY (HONS), Nilai University

Name: Prajjwal Budha

Student ID: 00020688

Name of the Supervisor: Bibek Gautam

In developing countries such as Nepal, the challenge is compounded by limited cybersecurity

1.1 Background and Motivation

Figure 1 : Schematic diagram of SQL injection attack (Qi, et al., 2019)

Traditionally, vulnerability detection techniques are divided into two categories:

1.2 Research Framework

The framework operates as follows:

2.1 Problem Statement

To address these challenges, we propose a hybrid vulnerability detection framework utilizing

3.1 Random Forest Algorithm

Figure 2 Random Forest Algorithm in ML (GeeksforGeeks, 2025)

The Random Forest classifier can be mathematically described as:

For classification, the prediction y^ is given by:

𝑦̂ = 𝑚𝑜𝑑𝑒{ℎ1(𝑥) , ℎ2(𝑥) , … , ℎ𝑛(𝑥)}

• ℎ1(𝑥) is the prediction from the ith decision tree.

• n is the total number of trees in the forest.

Gini Impurity (for splitting nodes):

• Pi is the proportion of samples belonging to class i at node t,

• C is the total number of classes.

Figure 3 Simplified structure of XGBoost (Wang, et al., 2020)

SVM aims to solve the following optimization problem:

Figure 5 Decision Tree Algorithm (Anshul, 2025)

Figure 6 Data Classification using KNN (Parameswaran, 2022)

The figure demonstrates the classification process of a K-Nearest Neighbors (K-NN)

Algorithm Accuracy (Examples) Strengths Weaknesses

• Accuracy: 81% (IoT attack • Robust to overfitting • Complex kernel

KNN • Accuracy: 84.9% (K=9, intrusion • Simple • Computationally

4.1 Data Collection and Data Description

4.2 Methodology (Data Input, Processing, Extraction, Classification)

b. Processing and Feature Extraction

Feature engineering is performed separately for static and dynamic components:

c. Classification Using XGBoost

Figure 7 Hybrid Vulnerability Detection Workflow

If metrics meet thresholds (performance acceptance), the model is deployed to developers,

5.1 Gantt Chart

Anshul, 2025. Decision Tree Algorithm. [Online]

FADLALLA, . F. F. & ELSHOUSH, H. T., 2023. Input Validation Vulnerabilities in Web

GeeksforGeeks, 2025. Random Forest Algorithm in Machine Learning. [Online]

Parameswaran, S., 2022. KNN Classifier from scratch. [Online]

Riyadi, A. A. et al., 2023. COMPARATIVE ANALYSIS OF THE K-NEAREST NEIGHBOR

You might also like