Major Project Proposal
Major Project Proposal
INSTITUTE OF ENGINEERING
PASHCHIMANCHAL CAMPUS
Lamachaur-16, Pokhara
A Proposal On
“Sentinel: Proactive and Vigilant AI-based Web Security Solution”
Submitted by:
Aarati Mahato [PAS077BCT001]
Ashim Karki [PAS077BCT012]
Niraj Neupane [PAS077BCT023]
Prayash Mishra [PAS077BCT027]
Submitted to:
DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING
June, 2023
TABLE OF CONTENTS
II
LIST OF FIGURE
III
LIST OF TABLES
IV
LIST OF ABBREVIATIONS
AI Artificial Intelligence
RF Random Forest
V
CHAPTER 1: INTRODUCTION
1.1 BACKGROUND
In today’s age, web applications have become an intriguing part of society . All of the
services provided are online and are becoming the backbone of the economy of our modern
society. However, the backbone must be robust and resilient to any of the security flaws and
web vulnerabilities. To provide those security and safeguard the web applications from
different types of web attacks, firewalls are used.
A firewall is a security system that monitors and controls network traffic based on a set of
security rules. Firewall matches the network traffic against the rule set defined in its table.
Once the rule is matched, associate action is applied to the network traffic. There are
different types of firewalls and WAF is one of them . WAF is a web application firewall that
is intended to secure the server from different types of attacks from the malicious actor.
WAF is also a type of reverse proxy.
A reverse proxy is an intermediate server that sits in front of a main backend server and
ensures that no client ever communicates directly with main backend server. It accepts
requests from clients and forwards them to the appropriate server. It also receives
responses from the server and sends them back to the client. It provides a layer of
abstraction between clients and origin servers to help optimize traffic routing and improve
performance and security.
WAF operates through a list of rules called policies and this policy aims to filter out all the
malicious requests from the client and forwards other ones to the main server. But we intend
to train the ML model to figure out itself if the request holds any malicious code without
using hardcoded policies. By doing so, our system can be more reliable and accurate in
detecting different attacks and preventing them from penetrating the web application.
Due to the popularity of online services, there’s been an increase in cyber attacks . In the
first quarter of 2024 alone there was a 28% increase in average number of attacks per
organization than the last quarter. These attacks not only jeopardize the security and
1
operations of organizations but also compromise the privacy and safety of users, potentially
leading to the leakage or deletion of their private data. As there are products already in the
market which uses hardcoded set of policies and rules for protecting web application but
only a handful of solutions leverage the advanced technologies like machine learning and
artificial intelligence to dynamically adapt to evolving threats. The advancement in the field
of AI is also booming which will significantly impact every IT sector including the cyber
security domain. Malicious actor also may utilize the power of these technologies to perform
phishing attacks, password cracking, social engineering using AI generated audio and videos
and many more. So, we need something that is more advanced and intelligent to analyze all
these different attacks from the normal and clean communication. For this purpose, Sentinel,
our AI based WAF will be very handy.
To develop a machine learning model that detects and blocks malicious web requests,
and to implement it as a Web Application Firewall (WAF) without relying on classical
predefined rules.
To present the web attack logs in graphical and user friendly manner.
The economic feasibility of any product depends on various factors such as initial
investment, operating costs, client adoption, and the ability to attract and sustain partnerships
with multiple companies. A detailed financial analysis, considering revenue streams, and
potential cost savings would provide a more accurate assessment of the project’s economic
viability.
2
For our project, most of the development tools and technologies we will be using are open
sourced. The reverse proxy server and the firewall model are the backbone of the project.
The reverse proxy can be implemented in open-source web server and model can also be
implemented using popular tools available for free. The dataset required while training the
model can be generated in a way it replicates the real-world data or it can also be taken from
multiple sources such as Kaggle. And, the model can be deployed in the cloud platform for
free as well. To sum up, our project is economically feasible for development.
As we said before, the tools and resources required during the development of project are
already available. The knowledge of machine learning and implementation of suitable
algorithms for the development of model is required. It can be learnt through various free
courses available on YouTube, coursera, Udemy and so on. As our project is related to the
cyber security domain, our team members have experience working in this field which will
significantly help us to build a better product. Therefore, there will not any technical
difficulties in our project.
Our project will be designed and developed using the latest technology available to avoid
any issues that might be encountered while integrating our services with multiple clients. As
there are hundreds of thousands of web products that needs to be secured or be safe from
different types of web attacks. So, the project we are proposing will help those platforms
secure themselves from malicious attacks vectors and also aware them of different
vulnerabilities that exists in their system. So, for the operational part, our services must be
properly configured and integrated with the clients to work. To make this process easy, we
will be designing the project in a way that makes our services modular and customizable
depending on the requirements of the client.
3
CHAPTER 2: LITERATURE REVIEW
In recent years, artificial intelligence (AI) has greatly impacted how we protect web
application systems and networks from cyber threats. The most common way AI is used in
cybersecurity is through machine learning (ML). Since it's hard to predict exactly when,
where, or how cyber-attacks will happen, and we can't completely prevent them, early
detection is crucial to minimize damage. Over the past decade, many ML techniques have
been developed to detect web attacks and improve web security. Various methods have been
used to identify threats of web servers, and as cyber systems evolve, adaptability becomes
essential.
Paper [1] particularly used ML models like RF, DNN, and CNN, and proved highly effective
in detecting various client-based web attacks. The high accuracy rates achieved by these
models demonstrate their potential to significantly enhance web security measures. It
suggests future research should focus on optimizing these models for real-time detection and
exploring their scalability in diverse cybersecurity applications. Additionally, combining
multiple AI techniques and incorporating advanced data preprocessing methods may further
improve detection accuracy and robustness.
Paper [2] highlights the differences between models trained with traditional machine
learning approaches and those utilizing deep learning techniques for detecting cyber-attacks.
It emphasizes the effectiveness of deep learning methods, which can automatically learn and
extract features from large datasets without human intervention. These deep learning
techniques, such as Convolutional Neural Networks (CNN), Recurrent Neural Networks
(RNN), and Deep Neural Networks (DNN), exhibit superior performance and robustness in
handling complex, multidimensional data compared to traditional machine learning
approaches. This is particularly evident in the detection of various types of cyber-attacks,
including DoS, DDoS, MitM, password attack, SQL injection,XSS, eavesdropping, and
phishing.
Paper[3] This paper proposes a web application firewall (WAF) leveraging a Convolutional
Neural Network (CNN) to analyze web traffic features and classify them as normal or
malicious. The model demonstrates effectiveness in detecting various web attacks, including
4
SQL injection and Cross-Site Scripting (XSS), highlighting the potential of deep learning
for web security.
Paper[4] This research presents a distributed deep learning system for real-time web attack
detection on edge devices. The system utilizes a centrally trained deep learning model
deployed on edge devices, which reduces latency and improves scalability for large
deployments. This approach showcases the potential of deep learning for scalable web
security solutions.
Paper [5] This paper provides a broader context for AI in web security by reviewing existing
cyberattack detection models for web-based systems. It highlights various attack detection
techniques beyond deep learning, emphasizing the importance of evaluating models based
on factors like accuracy, efficiency, and adaptability. This broader perspective is crucial for
considering the full landscape of AI-based solutions.
Paper[6] This paper introduces a system that utilizes deep learning for anomaly detection
and diagnosis from system logs. This work demonstrates the potential of deep learning for
broader security applications by analyzing system logs for anomalous behavior, suggesting
possible applications for AI-based firewalls to analyze network traffic logs for suspicious
activity.
Paper[7] This paper presents the effect of the different parameters on the performance of
LSTM and Word2Vec for text classification. Text classification accuracy obtained by
proposed methodology for dataset 1, 2, 3, 4, 5 and 6 are 95.78%, 94.93%, 94.87%, 94.88%,
91.79%, 93.04% and 91.98% respectively. Six different experimentation shows that 100
batch size, 50 epochs, Adagrad optimizer, 5 hidden nodes, 100-word vector length, 2 LSTM
layers, 0.001 L2 regularizations, 0.001 learning rate give better accuracy.
5
CHAPTER 3: METHODOLOGY
The above given diagram is a high-level system architecture of how our firewall will be
embedded into the existing systems. There will a reverse proxy server configured by the
service provider which is responsible for passing the incoming request from the client to the
firewall model. The firewall model will analyze the request and detect if it is malicious or
not. If malicious, it will be further analyzed to detect under what type of web attack it is
categorized then it will be properly logged for showing the statistics in the dashboard. A
response from the model will be send back to the reverse proxy server. On the basis of that
response the type of request is determined in the reverse proxy itself and it will only be
forwarded to the main backend server if request is clean and normal otherwise it will be
dropped.
There will be various tools used during the development of the machine learning model
which is the core of our product. They can be listed as:
Jupyter Notebook
Numpy
Pandas
Matplotlib
Seaborn
Google Colab
PyTorch
6
TensorFlow
Keras
Scikit-learn
Also, there are multiple steps that will be involved during development phase:
7
the noise of the data .The nature of our data demands the following steps to be performed
during data preprocessing.
In this step the missing or duplicate data are fixed or removed from the dataset. Since our
data is text based, fixing the data might be quite challenging so the best solution would be
to remove the corrupted data after identifying it.
Each web request will be processed by a decoder, which will parse the request to extract
relevant features such as headers, payload, and parameters. This step is crucial for
transforming raw web traffic data into a structured format that the LSTM model can
process.
Feature selection is critical to identify the most relevant features from the decoded web
requests that will help the model distinguish between normal and malicious traffic. Key
features might include:
Selecting the right features helps reduce the complexity of the model and improves its
performance by focusing on the most informative parts of the data.
8
3.3.4 Data Transforming
Once the features are selected, the next step is to normalize and scale the data. Normalization
and scaling are important to ensure that all features contribute equally to the model's learning
process. This involves:
By normalizing and scaling the data, we ensure that the LSTM model can learn effectively
and make accurate predictions on web traffic patterns.
9
3.3.1 Key Features of LSTM Networks
Memory Cells
LSTMs have a unique structure called memory cells that can maintain information
in memory for long periods.
This helps overcome the short-term memory limitation of traditional RNNs.
Gates
Input Gate: Controls the extent to which new information flows into the memory
cell.
Forget Gate: Decides what information to discard from the memory cell.
Output Gate: Determines the information to output based on the cell state.
The interactions between these gates allow the LSTM to learn when to remember and forget
information over long sequences, making them effective for modeling complex time
dependencies.
10
3.5 Model Evaluation
We will use various evaluation metrics to assess the performance of our model, including:
Accuracy
Precision
Recall
F1 Score
Confusion Matrix
AUC-ROC
These metrics will help us ensure that the model accurately identifies malicious requests
while minimizing false positives and false negatives.
This methodology will ensure that our intelligent WAF is capable of effectively detecting
and blocking malicious web requests, thereby safeguarding web applications from
potential attacks.
11
CHAPTER 4: DEVELOPMENT AND SOFTWARE
REQUIREMENTS
4.1 Development Environment
VSCode
Jupyter Notebook
Google Colab
Detect Malicious Requests: Automatically identify and filter out harmful web
traffic using a trained machine learning model.
Enhance Security and Performance: Protect web applications from attacks without
relying on hardcoded rules, thereby reducing latency and improving overall
performance.
12
CHAPTER 5: EPILOGUE
The project is divided into multiple phases. The following Gantt chart shows different phases
along with their estimated time for development.
13
CHAPTER 6: EXPECTED OUTPUT
At the end of this project, we expect to develop an deep packet inspection capable firewall
will be able to identify web attacks like SQLi, Cross Site Scripting(XSS), Server Side
Request Forgery(SSRF), Insecure File upload, Command injection, NoSQL injection, CSS
injection, HTML injection and display the attack details in admin dashboard.
14
REFERENCES
[1] Hong, Jiwon, et al. "Client-Based Web Attacks Detection Using Artificial Intelligence."
(2023).
[2] Awuor, Odiaga Gloria. "Assessment of existing cyber-attack detection models for web-
based systems." Global Journal of Engineering and Technology Advances 15.01 (2023):
070-089.
[3] Dawadi, Babu R., Bibek Adhikari, and Devesh Kumar Srivastava. "Deep learning
technique-enabled web application firewall for the detection of web attacks." Sensors 23.4
(2023): 2073.
[4] Tian, Zhihong, et al. "A distributed deep learning system for web attack detection on
edge devices." IEEE Transactions on Industrial Informatics 16.3 (2019): 1963-1971.
[5] Awuor, Odiaga Gloria. "Assessment of existing cyber-attack detection models for web-
based systems." Global Journal of Engineering and Technology Advances 15.01 (2023):
070-089.
[6] Du, Min, et al. "Deeplog: Anomaly detection and diagnosis from system logs through
deep learning." Proceedings of the 2017 ACM SIGSAC conference on computer and
communications security. 2017.
[7] Adamuthe, Amol C. "Improved text classification using long short-term memory and
word embedding technique." Int J Hybrid Inf Technol 13.1 (2020): 19-32
15