0% found this document useful (0 votes)

7 views20 pages

Major Project Proposal

Uploaded by

noyowi6259

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views20 pages

Major Project Proposal

Uploaded by

noyowi6259

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

TRIBHUVAN UNIVERSITY

INSTITUTE OF ENGINEERING
PASHCHIMANCHAL CAMPUS
Lamachaur-16, Pokhara

A Proposal On
“Sentinel: Proactive and Vigilant AI-based Web Security Solution”

Submitted by:
Aarati Mahato [PAS077BCT001]
Ashim Karki [PAS077BCT012]
Niraj Neupane [PAS077BCT023]
Prayash Mishra [PAS077BCT027]

Submitted to:
DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING
June, 2023
TABLE OF CONTENTS

LIST OF FIGURE ........................................................................................................ III

LIST OF TABLES ........................................................................................................ IV
LIST OF ABBREVIATIONS ........................................................................................ V
CHAPTER 1: INTRODUCTION ................................................................................... 1
1.1 BACKGROUND ................................................................................................................ 1
1.2 Problem Statement ............................................................................................................ 1
1.3 Objectives of Project .......................................................................................................... 2
1.4 Feasibility Study ................................................................................................................ 2
1.4.1 Economic Feasibility................................................................................................................... 2
1.4.2 Technical Feasibility ................................................................................................................... 3
1.4.3 Operational Feasibility ................................................................................................................ 3
CHAPTER 2: LITERATURE REVIEW ....................................................................... 4
CHAPTER 3: METHODOLOGY .................................................................................. 6
3.2 Data Collection .................................................................................................................. 7
3.3 Data Preprocessing ............................................................................................................ 7
3.3.1 Data Cleaning ............................................................................................................................. 8
3.3.2 Request Decoding ....................................................................................................................... 8
3.3.3 Feature Selection ........................................................................................................................ 8
3.3.4 Data Transforming ...................................................................................................................... 9
3.3 Model Generation .............................................................................................................. 9
3.3.1 Key Features of LSTM Networks .............................................................................................. 10
3.3.2 LSTM Cell Structure................................................................................................................. 10
3.5 Model Evaluation ............................................................................................................. 11
3.6 Experimentation Tracking .............................................................................................. 11
3.7 Model Deployment ........................................................................................................... 11
CHAPTER 4: DEVELOPMENT AND SOFTWARE REQUIREMENTS ................. 12
4.1 Development Environment .............................................................................................. 12
4.2 Programming Language .................................................................................................. 12
4.3 Hardware and Software Configuration .......................................................................... 12
4.4 Description of the Proposed System ................................................................................ 12
CHAPTER 5: EPILOGUE ........................................................................................... 13
5.1 Estimated Cost ................................................................................................................. 13
CHAPTER 6: EXPECTED OUTPUT .......................................................................... 14
REFERENCES .............................................................................................................. 15

II
LIST OF FIGURE

Figure 3.1 High Level System Architecture ................................................................................................. 6

Figure 3.2 Framework of Proposed Model ................................................................................................... 7
Figure 3.3 LSTM Cell with Internal Implementation .................................................................................... 9
Figure 5.1 Gantt Chart............................................................................................................................... 13

III
LIST OF TABLES

Table 1 List of Abbreviations ........................................................................................... V

IV
LIST OF ABBREVIATIONS

AI Artificial Intelligence

WAF Web Application Firewall

RNN Recurrent Neural Network

DNN Deep Neural Network

CNN Convolutional Neural Network

ANN Artificial Neural Network

XSS Cross Site Scripting

RF Random Forest

LSTM Long Short-Term Memory

NLP Natural Language Processing

GPU Graphical Processing Unit

SSRF Server-Side Request Forgery

SQL Structure Query Language

AUC Area Under Curve

ROC Receiver Operating Characteristics

Table 1 List of Abbreviations

V
CHAPTER 1: INTRODUCTION

1.1 BACKGROUND

In today’s age, web applications have become an intriguing part of society . All of the
services provided are online and are becoming the backbone of the economy of our modern
society. However, the backbone must be robust and resilient to any of the security flaws and
web vulnerabilities. To provide those security and safeguard the web applications from
different types of web attacks, firewalls are used.

A firewall is a security system that monitors and controls network traffic based on a set of
security rules. Firewall matches the network traffic against the rule set defined in its table.
Once the rule is matched, associate action is applied to the network traffic. There are
different types of firewalls and WAF is one of them . WAF is a web application firewall that
is intended to secure the server from different types of attacks from the malicious actor.
WAF is also a type of reverse proxy.

A reverse proxy is an intermediate server that sits in front of a main backend server and
ensures that no client ever communicates directly with main backend server. It accepts
requests from clients and forwards them to the appropriate server. It also receives
responses from the server and sends them back to the client. It provides a layer of
abstraction between clients and origin servers to help optimize traffic routing and improve
performance and security.

WAF operates through a list of rules called policies and this policy aims to filter out all the
malicious requests from the client and forwards other ones to the main server. But we intend
to train the ML model to figure out itself if the request holds any malicious code without
using hardcoded policies. By doing so, our system can be more reliable and accurate in
detecting different attacks and preventing them from penetrating the web application.

1.2 Problem Statement

Due to the popularity of online services, there’s been an increase in cyber attacks . In the
first quarter of 2024 alone there was a 28% increase in average number of attacks per
organization than the last quarter. These attacks not only jeopardize the security and

1
operations of organizations but also compromise the privacy and safety of users, potentially
leading to the leakage or deletion of their private data. As there are products already in the
market which uses hardcoded set of policies and rules for protecting web application but
only a handful of solutions leverage the advanced technologies like machine learning and
artificial intelligence to dynamically adapt to evolving threats. The advancement in the field
of AI is also booming which will significantly impact every IT sector including the cyber
security domain. Malicious actor also may utilize the power of these technologies to perform
phishing attacks, password cracking, social engineering using AI generated audio and videos
and many more. So, we need something that is more advanced and intelligent to analyze all
these different attacks from the normal and clean communication. For this purpose, Sentinel,
our AI based WAF will be very handy.

1.3 Objectives of Project

 To develop a machine learning model that detects and blocks malicious web requests,
and to implement it as a Web Application Firewall (WAF) without relying on classical
predefined rules.
 To present the web attack logs in graphical and user friendly manner.

1.4 Feasibility Study

A feasibility study for a digital product is an analysis for evaluating if it is technically,

financially, and operationally feasible. It assists in identifying risks and obstacles early in
the development process, saves time and money by avoiding developing non-viable
products, and gives a foundation for decision-making and planning. It can also help with
funding and support for the product. So, feasibility study is a must for every project.

1.4.1 Economic Feasibility

The economic feasibility of any product depends on various factors such as initial
investment, operating costs, client adoption, and the ability to attract and sustain partnerships
with multiple companies. A detailed financial analysis, considering revenue streams, and
potential cost savings would provide a more accurate assessment of the project’s economic
viability.

2
For our project, most of the development tools and technologies we will be using are open
sourced. The reverse proxy server and the firewall model are the backbone of the project.
The reverse proxy can be implemented in open-source web server and model can also be
implemented using popular tools available for free. The dataset required while training the
model can be generated in a way it replicates the real-world data or it can also be taken from
multiple sources such as Kaggle. And, the model can be deployed in the cloud platform for
free as well. To sum up, our project is economically feasible for development.

1.4.2 Technical Feasibility

As we said before, the tools and resources required during the development of project are
already available. The knowledge of machine learning and implementation of suitable
algorithms for the development of model is required. It can be learnt through various free
courses available on YouTube, coursera, Udemy and so on. As our project is related to the
cyber security domain, our team members have experience working in this field which will
significantly help us to build a better product. Therefore, there will not any technical
difficulties in our project.

1.4.3 Operational Feasibility

Our project will be designed and developed using the latest technology available to avoid
any issues that might be encountered while integrating our services with multiple clients. As
there are hundreds of thousands of web products that needs to be secured or be safe from
different types of web attacks. So, the project we are proposing will help those platforms
secure themselves from malicious attacks vectors and also aware them of different
vulnerabilities that exists in their system. So, for the operational part, our services must be
properly configured and integrated with the clients to work. To make this process easy, we
will be designing the project in a way that makes our services modular and customizable
depending on the requirements of the client.

3
CHAPTER 2: LITERATURE REVIEW

In recent years, artificial intelligence (AI) has greatly impacted how we protect web
application systems and networks from cyber threats. The most common way AI is used in
cybersecurity is through machine learning (ML). Since it's hard to predict exactly when,
where, or how cyber-attacks will happen, and we can't completely prevent them, early
detection is crucial to minimize damage. Over the past decade, many ML techniques have
been developed to detect web attacks and improve web security. Various methods have been
used to identify threats of web servers, and as cyber systems evolve, adaptability becomes
essential.

Paper [1] particularly used ML models like RF, DNN, and CNN, and proved highly effective
in detecting various client-based web attacks. The high accuracy rates achieved by these
models demonstrate their potential to significantly enhance web security measures. It
suggests future research should focus on optimizing these models for real-time detection and
exploring their scalability in diverse cybersecurity applications. Additionally, combining
multiple AI techniques and incorporating advanced data preprocessing methods may further
improve detection accuracy and robustness.

Paper [2] highlights the differences between models trained with traditional machine
learning approaches and those utilizing deep learning techniques for detecting cyber-attacks.
It emphasizes the effectiveness of deep learning methods, which can automatically learn and
extract features from large datasets without human intervention. These deep learning
techniques, such as Convolutional Neural Networks (CNN), Recurrent Neural Networks
(RNN), and Deep Neural Networks (DNN), exhibit superior performance and robustness in
handling complex, multidimensional data compared to traditional machine learning
approaches. This is particularly evident in the detection of various types of cyber-attacks,
including DoS, DDoS, MitM, password attack, SQL injection,XSS, eavesdropping, and
phishing.

Paper[3] This paper proposes a web application firewall (WAF) leveraging a Convolutional
Neural Network (CNN) to analyze web traffic features and classify them as normal or
malicious. The model demonstrates effectiveness in detecting various web attacks, including

4
SQL injection and Cross-Site Scripting (XSS), highlighting the potential of deep learning
for web security.

Paper[4] This research presents a distributed deep learning system for real-time web attack
detection on edge devices. The system utilizes a centrally trained deep learning model
deployed on edge devices, which reduces latency and improves scalability for large
deployments. This approach showcases the potential of deep learning for scalable web
security solutions.

Paper [5] This paper provides a broader context for AI in web security by reviewing existing
cyberattack detection models for web-based systems. It highlights various attack detection
techniques beyond deep learning, emphasizing the importance of evaluating models based
on factors like accuracy, efficiency, and adaptability. This broader perspective is crucial for
considering the full landscape of AI-based solutions.

Paper[6] This paper introduces a system that utilizes deep learning for anomaly detection
and diagnosis from system logs. This work demonstrates the potential of deep learning for
broader security applications by analyzing system logs for anomalous behavior, suggesting
possible applications for AI-based firewalls to analyze network traffic logs for suspicious
activity.

Paper[7] This paper presents the effect of the different parameters on the performance of
LSTM and Word2Vec for text classification. Text classification accuracy obtained by
proposed methodology for dataset 1, 2, 3, 4, 5 and 6 are 95.78%, 94.93%, 94.87%, 94.88%,
91.79%, 93.04% and 91.98% respectively. Six different experimentation shows that 100
batch size, 50 epochs, Adagrad optimizer, 5 hidden nodes, 100-word vector length, 2 LSTM
layers, 0.001 L2 regularizations, 0.001 learning rate give better accuracy.

5
CHAPTER 3: METHODOLOGY

Figure 3.1 High Level System Architecture

The above given diagram is a high-level system architecture of how our firewall will be
embedded into the existing systems. There will a reverse proxy server configured by the
service provider which is responsible for passing the incoming request from the client to the
firewall model. The firewall model will analyze the request and detect if it is malicious or
not. If malicious, it will be further analyzed to detect under what type of web attack it is
categorized then it will be properly logged for showing the statistics in the dashboard. A
response from the model will be send back to the reverse proxy server. On the basis of that
response the type of request is determined in the reverse proxy itself and it will only be
forwarded to the main backend server if request is clean and normal otherwise it will be
dropped.
There will be various tools used during the development of the machine learning model
which is the core of our product. They can be listed as:

 Jupyter Notebook
 Numpy
 Pandas
 Matplotlib
 Seaborn
 Google Colab
 PyTorch

6
 TensorFlow
 Keras
 Scikit-learn

Also, there are multiple steps that will be involved during development phase:

Figure 3.2 Framework of Proposed Model

3.2 Data Collection

For our training dataset, we have planned to use data from two sources. The first one is the
publicly available dataset in open-source community and platforms like Kaggle. And the
second source is to generate a dataset on our own by following the http request standard. As
the self-generated dataset can also be similar to real life data when it follows the standard
practice and protocols used for http request.

3.3 Data Preprocessing

In data preprocessing the raw data from the source is cleaned, reduced, and transformed
before feeding into the model. This makes the model train more efficiently by cleaning off

7
the noise of the data .The nature of our data demands the following steps to be performed
during data preprocessing.

3.3.1 Data Cleaning

In this step the missing or duplicate data are fixed or removed from the dataset. Since our
data is text based, fixing the data might be quite challenging so the best solution would be
to remove the corrupted data after identifying it.

3.3.2 Request Decoding

Each web request will be processed by a decoder, which will parse the request to extract
relevant features such as headers, payload, and parameters. This step is crucial for
transforming raw web traffic data into a structured format that the LSTM model can
process.

3.3.3 Feature Selection

Feature selection is critical to identify the most relevant features from the decoded web
requests that will help the model distinguish between normal and malicious traffic. Key
features might include:

 Request method (GET, POST, etc.)

 URL patterns
 Header information
 Payload content
 Request frequency and timing

Selecting the right features helps reduce the complexity of the model and improves its
performance by focusing on the most informative parts of the data.

8
3.3.4 Data Transforming

Once the features are selected, the next step is to normalize and scale the data. Normalization
and scaling are important to ensure that all features contribute equally to the model's learning
process. This involves:

 Normalization: Adjusting the range of feature values to a common scale, typically

between 0 and 1, to prevent features with larger ranges from dominating those with
smaller ranges.
 Scaling: Standardizing the feature values so that they have a mean of 0 and a standard
deviation of 1, which helps in accelerating the training process and achieving better
performance.

By normalizing and scaling the data, we ensure that the LSTM model can learn effectively
and make accurate predictions on web traffic patterns.

3.3 Model Generation

For the development of the model, we decided to use the LSTM (Long Short-Term Memory)
model which is a type of recurrent neural network architecture designed to model the
sequential data. This model is suitable for learning long-term dependencies, such as time-
series forecasting, NLPs and speech recognition.

Figure 3.3 LSTM Cell with Internal Implementation

9
3.3.1 Key Features of LSTM Networks

Memory Cells

 LSTMs have a unique structure called memory cells that can maintain information
in memory for long periods.
 This helps overcome the short-term memory limitation of traditional RNNs.

Gates

 Input Gate: Controls the extent to which new information flows into the memory
cell.
 Forget Gate: Decides what information to discard from the memory cell.
 Output Gate: Determines the information to output based on the cell state.

3.3.2 LSTM Cell Structure

An LSTM cell consists of several components:

 Cell State (c_t): Carries information across the sequence.

 Hidden State (h_t): Output of the LSTM cell at each time step.
 Input Gate (i_t): Controls which values from the input to update the memory state.
 Forget Gate (f_t): Controls which parts of the memory state to forget.
 Output Gate (o_t): Controls the output and what parts of the cell state should be
output.

The interactions between these gates allow the LSTM to learn when to remember and forget
information over long sequences, making them effective for modeling complex time
dependencies.

10
3.5 Model Evaluation
We will use various evaluation metrics to assess the performance of our model, including:

 Accuracy
 Precision
 Recall
 F1 Score
 Confusion Matrix
 AUC-ROC

These metrics will help us ensure that the model accurately identifies malicious requests
while minimizing false positives and false negatives.

3.6 Experimentation Tracking

We will use tools like MLFlow to track parameters, metrics, and models throughout the
experimentation process. The tracked logs of different experiments and the corresponding
results will help us understand how changes to the model, hyperparameters, and data affect
its performance.

3.7 Model Deployment

The best performing model will be saved and deployed to provide real-time protection
against web attacks. The deployed model will be integrated with a web application to
monitor and filter incoming web requests, ensuring enhanced security for web applications.

This methodology will ensure that our intelligent WAF is capable of effectively detecting
and blocking malicious web requests, thereby safeguarding web applications from
potential attacks.

11
CHAPTER 4: DEVELOPMENT AND SOFTWARE
REQUIREMENTS
4.1 Development Environment
 VSCode
 Jupyter Notebook
 Google Colab

4.2 Programming Language

 Python: For analyzing, processing data, and synthesizing the machine learning
model.
 React and NodeJs: For building the dashboard for clients to monitor web attack
logs.

4.3 Hardware and Software Configuration

Our project will require substantial data and computing power to train, test, and process web
requests. We will be using Google Colab for intensive computational tasks and a local
computer with Python and Jupyter Notebook, along with necessary libraries installed, for
development and testing.

4.4 Description of the Proposed System

Our system will function as an intelligent Web Application Firewall (WAF) that uses
machine learning to detect and block malicious web requests. The proposed system will:

 Detect Malicious Requests: Automatically identify and filter out harmful web
traffic using a trained machine learning model.
 Enhance Security and Performance: Protect web applications from attacks without
relying on hardcoded rules, thereby reducing latency and improving overall
performance.

12
CHAPTER 5: EPILOGUE

The project is divided into multiple phases. The following Gantt chart shows different phases
along with their estimated time for development.

Figure 5.1 Gantt Chart

5.1 Estimated Cost

The estimated cost for the preparation of this project may include the followings:
 GPU: NRs. 100000
 Dataset collection and Validation: NRs. 100000
 Deployment: NRs. 100000

13
CHAPTER 6: EXPECTED OUTPUT
At the end of this project, we expect to develop an deep packet inspection capable firewall
will be able to identify web attacks like SQLi, Cross Site Scripting(XSS), Server Side
Request Forgery(SSRF), Insecure File upload, Command injection, NoSQL injection, CSS
injection, HTML injection and display the attack details in admin dashboard.

14
REFERENCES

[1] Hong, Jiwon, et al. "Client-Based Web Attacks Detection Using Artificial Intelligence."
(2023).

[2] Awuor, Odiaga Gloria. "Assessment of existing cyber-attack detection models for web-
based systems." Global Journal of Engineering and Technology Advances 15.01 (2023):
070-089.

[3] Dawadi, Babu R., Bibek Adhikari, and Devesh Kumar Srivastava. "Deep learning
technique-enabled web application firewall for the detection of web attacks." Sensors 23.4
(2023): 2073.

[4] Tian, Zhihong, et al. "A distributed deep learning system for web attack detection on
edge devices." IEEE Transactions on Industrial Informatics 16.3 (2019): 1963-1971.

[5] Awuor, Odiaga Gloria. "Assessment of existing cyber-attack detection models for web-
based systems." Global Journal of Engineering and Technology Advances 15.01 (2023):
070-089.

[6] Du, Min, et al. "Deeplog: Anomaly detection and diagnosis from system logs through
deep learning." Proceedings of the 2017 ACM SIGSAC conference on computer and
communications security. 2017.

[7] Adamuthe, Amol C. "Improved text classification using long short-term memory and
word embedding technique." Int J Hybrid Inf Technol 13.1 (2020): 19-32