ML Based Web Application Firewall For Signature and Anomaly Detection Using Feature Extraction
ML Based Web Application Firewall For Signature and Anomaly Detection Using Feature Extraction
Abstract—In today’s digital landscape, web applications have machine learning (ML) [7] [9], and artificial intelligence (AI)
become indispensable tools for businesses, facilitating access to [11] techniques.
vital data and services. However, this ubiquity also exposes Because cyber dangers are constantly evolving, a proactive
them to a myriad of cyber threats, ranging from SQL injection
to cross-site scripting. Traditional Web Application Firewalls approach to security is necessary. This approach ought to
(WAFs), while essential, often struggle with rule-based detection move beyond rigid rule-based defences and instead adjust to
methods, leading to issues like false positives and an inability novel threats on a dynamic basis [3]. ML-based WAFs are
to adapt swiftly to emerging threats. This paper introduces a tempting substitute since they employ complex algorithms
an advanced Web Application Firewall (WAF) fortified with to quickly analyse and respond to patterns of online traffic.
Machine Learning (ML) techniques, specifically a Random Forest
classifier. The proposed WAF achieved nearly 100% accuracy in Unlike traditional WAFs that rely on predefined signatures to
classifying requests as normal or anomalous and 89.5% accuracy identify known attacks [?], ML-based approaches can learn
in distinguishing between different attack types. Furthermore, it from past events and detect irregularities suggestive of mali-
significantly reduces computational time compared to traditional cious behavior [5] [6] [8]. Web application security solutions
WAFs. By harnessing ML, the proposed WAF enhances web may become much more successful with adaptability, allowing
application security, effectively identifying and mitigating attacks
while minimizing false positives, showcasing a practical ML- organisations to stay ahead of constantly changing threats.
driven solution for safeguarding online assets against evolving In light of this, the purpose of this study is to investigate
cyber threats. the deployment of an ML-based WAF intended to improve
web application security. The main goal is to create a de-
Index Terms—Web Application Firewall (WAF), Cybersecurity,
Web Application Security,Structured Query Language (SQL) in-
fence system that is strong and flexible enough to counter
jection, Cross-Site Scripting (XSS), Distributed Denial of Service a variety of cyberthreats while reducing false positives and
(DDoS), Machine Learning (ML), Deep Learning (DL), Rule- false negatives [12] [13]. The suggested WAF aims to improve
based detection mechanisms, False positives, False negatives, overall system security and resilience by reducing response
Anomaly detection, Signature-based filtering times and increasing threat detection accuracy by utilising
ML techniques. Furthermore, in order to shed light on the
I. I NTRODUCTION effectiveness and scalability of the ML-based WAF, this study
Web apps are now essential to company operations in aims to assess its performance in a variety of use cases and
today’s digital world since they enable worldwide data in- deployment situations.
terchange, transactions, and interactions. But this widespread A number of crucial processes are involved in implementing
presence also leaves these apps open to a wide range of the ML-based WAF, including feature engineering [19], data
cyberthreats, from common flaws like Cross-Site Scripting preparation, model training, and deployment. In order to
(XSS) and SQL injection [1] [2] to more complex attacks produce a trustworthy dataset for ML model training and
like Distributed Denial-of-Service (DDoS) attacks. Although testing, data preprocessing comprises gathering and cleaning
signature-based web application firewalls (WAFs) and other web traffic data. In order to feed the ML algorithms, feature
conventional security measures have offered some protection, engineering entails choosing and extracting pertinent features
they frequently fail to keep up with the ever-evolving strate- from the dataset, such as HyperText Transfer Protocol (HTTP)
gies used by malevolent actors. A promising paradigm for request parameters, headers, and payloads. In order to identify
improving web application security has emerged in response patterns suggestive of malicious behavior [16] [17], model
to these issues: the integration of deep learning (DL) [4] [10], training entails training machine learning (ML) models, such
as neural networks [20], Support Vector Machines (SVM), or this parsed data as input and uses sophisticated anomaly
decision trees, on the extracted features. In order to detect and detection algorithms to find patterns that point to malicious
respond to such threats in a timely manner, the trained models activities. The machine learning model continuously learns
are then implemented in a production environment to monitor from previous attack patterns, which allows it to adjust and
and analyse incoming web traffic in real-time [15]. develop over time to increase detection accuracy and resistance
to new threats.
II. M ETHODOLOGY
A. Proposed Model
The proposed approach to web application security is
founded on the synergy between anomaly-based detection
and signature-based detection methodologies. Anomaly-based
detection involves a meticulous observation of web applica-
tion behavior, discerning patterns indicative of normal user
interactions and those signaling potentially malicious activity.
Leveraging advanced artificial intelligence techniques like
classification algorithms and neural networks, the aim is to Fig. 1. Architecture Diagram
develop models capable of swiftly identifying abnormal be-
haviors in real-time, facilitating rapid response and mitigation The WAF design includes a signature database with known
measures against emerging threats. attack patterns, signatures, and rules in parallel with the
In parallel, the methodology integrates signature-based de- ML model. These signatures are the result of community
tection, leveraging databases housing known attack patterns contributions, threat intelligence feeds, and in-depth security
and malicious request signatures. By analyzing these reposi- research. Signature-based detection techniques compare in-
tories, there are insights into prevalent web application vulner- coming HyperText Transfer Protocol (HTTP) requests to the
abilities, enabling to identify familiar attack patterns and com- signatures stored in the database. This adds another line of
pare incoming requests against them. This proactive approach defence against popular attack vectors like SQL injection,
allows to promptly trigger alerts or block suspicious traffic, ef- cross-site scripting, and malware propagation, allowing the
fectively safeguarding web applications against known threats WAF to quickly identify and neutralise known threats.
and vulnerabilities. As illustrated in Fig. 2, the proposed Web Application
The structured workflow of this methodology begins with Firewall (WAF) architecture is made up of a number of
the parsing of incoming requests and the extraction of relevant essential parts. collaborating to guarantee the integrity and
features crucial for identifying malicious traffic. These features security of web applications.When combined, these elements
are then utilized in the classification process, where requests create a strong defence system that thwarts a variety of online
are categorized as either malicious or normal. Subsequently, attacks and guarantees the security and accessibility of web
the decision-making unit within the model utilizes the classi- applications.
fication results to take appropriate actions, blocking nefarious 1. Power Unit: The Power Unit serves as the entry point
traffic while permitting legitimate requests to proceed seam- of the Web Application Firewall (WAF) architecture, respon-
lessly to the server. Through the integration of anomaly-based sible for receiving and routing incoming HTTP requests from
and signature-based detection methodologies, the approach clients to subsequent processing units. It acts as the gateway,
aims to establish a robust framework for web application channeling all web traffic through the WAF for inspection
security, fortifying applications against a wide array of threats and analysis. The Power Unit ensures high availability and
and ensuring resilience in the face of evolving cyber threats. scalability of the WAF system, handling incoming requests
efficiently and effectively distributing the workload across
B. Architecture multiple instances or servers as needed.
As seen in Fig. 1, the Web Application Firewall’s (WAF) 2. Parsing Unit: Once an HTTP request enters the WAF
architecture is centred on intercepting and handling incoming through the Power Unit, it is passed on to the Parsing Unit
HTTP requests prior to them reaching the web application for detailed analysis and extraction of relevant information.
server. The request handler of the WAF is the first to receive The Parsing Unit dissects each request, extracting essential
and process requests to access web applications hosted on the features such as request headers, parameters, cookies, URL
server. Serving as the first point of contact, it is in charge paths, and payload data. This process involves parsing and
of forwarding requests to later parts for review and decision- tokenizing the request to generate a structured representation
making. that can be utilized by subsequent units for further analysis.
The Machine Learning (ML) model, an advanced algorithm The Parsing Unit plays a crucial role in preparing the request
based on historical web traffic data, is the central component of data for both signature-based and anomaly-based detection
the WAF design. The parser retrieves relevant data, including methods employed by the Classification Unit.
request headers, parameters, cookies, and payload data, as 3. Classification Unit: The heart of the WAF architecture lies
requests are intercepted by the WAF. The ML model uses within the Classification Unit, where the actual detection and
3.3. Manipulate Payloads Weight: Capturing instances of Random Forest Classifier: During training, a large number
data manipulation within payloads and headers, this subfeature of decision trees are built using the Random Forest ensemble
increments with each detected manipulation, thereby quanti- learning approach. The class that results is the mean prediction
fying the degree of malicious data alteration. (regression) or mode of the classes (classification) of the
3.4. Alphanumeric Characters to Special Character Ratio: individual trees. It is renowned for being very accurate and
Calculated as the ratio of alphanumeric characters to spe- resilient in categorization jobs.
cial characters, this subfeature aids in discerning anomalous Convolutional neural networks, or CNNs, are a subclass
patterns, particularly in scenarios where this ratio deviates of deep neural networks that are mostly used for image
significantly from expected norms. analysis. They can automatically and adaptively learn the
spatial hierarchies of features from input images since they are
made up of numerous layers of convolutional filters followed
by pooling layers.
III. R ESULTS
The matrix in Fig 5.is a grid with both the columns and
rows labeled with types of attacks: ’norm’ (normal), ’sqli’
(SQL injection), ’xss’ (cross-site scripting), ’cmdi’ (command
injection), and ’path-traversal’. The rows represent the actual
classes, while the columns represent the predicted classes.
D. Tech Stack
We leverage several Python libraries and machine learning
concepts to develop and evaluate our proposed model. Here’s
a brief introduction to each:
Pandas: Data structures like DataFrame are provided by the
robust data manipulation and analysis library Pandas, making
it simple to handle and work with structured data.
NumPy: A core Python library for scientific computing,
NumPy supports massive, multi-dimensional arrays and matri- Fig. 5. Confusion Matrix
ces and provides a set of mathematical functions to effectively
work with these arrays. Here are the values from the confusion matrix:
Scikit-learn: A flexible machine learning package, Scikit- 1.’norm’ (normal) was correctly predicted 2128 times (true
learn offers a large selection of tools for data mining and positives for ’norm’).
analysis. It is a top option for creating machine learning mod- 2.’sqli’ (SQL injection) was correctly predicted 1171 times,
els in Python since it contains a variety of methods for model with 3 instances incorrectly predicted as ’xss’ and 2 as ’path-
selection, dimensionality reduction, regression, clustering, and traversal’.
classification. 3.’xss’ (cross-site scripting) had 58 instances correctly pre-
Seaborn: A high-level interface for making visually ap- dicted, with some instances incorrectly predicted as ’sqli’ (7),
pealing and educational statistical visualisations is offered ’cmdi’ (1), and ’path-traversal’ (1).
by this Matplotlib-based data visualisation library. It makes 4.’cmdi’ (command injection) had 4 instances incorrectly
the process of making intricate visual representations like predicted as ’norm’, with 7 correct predictions, and 1 instance
regression plots, distribution plots, and heatmaps easier. incorrectly predicted as ’path-traversal’.
Matplotlib: Matplotlib is a feature-rich Python plotting 5.’path-traversal’ had 26 instances correctly predicted, with
package that generates figures of publication quality in several 1 instance incorrectly predicted as ’sqli’, 3 as ’xss’, and 1 as
formats. Plots, charts, histograms, and other visualisations ’cmdi’.
can be made using a wide range of customisation choices to The matrix shows that the classifier performed very well
effectively express data insights. on ’norm’ and ’sqli’ classes, with a high number of true
positives and very few false positives or false negatives. The IEEE 28th International Symposium on Software Reliability Engineer-
other classes have a lower number of samples and show ing (ISSRE), Toulouse, France, 2017, pp. 339-350, doi: 10.1109/IS-
SRE.2017.28.
some misclassifications, but the majority of predictions are [2] A. Ghafarian, ”A hybrid method for detection and prevention of SQL
still correct for their respective classes. injection attacks,” 2017 Computing Conference, London, UK, 2017, pp.
The proposed Random Forest classifier performed excep- 833-838, doi: 10.1109/SAI.2017.8252192.
[3] Clincy, V. and Shahriar, H., (2018) Web Application Firewall: Network
tionally well in detecting normal and anomalous requests. It security models and configuration. In 2018 IEEE 42nd Annual Computer
achieved a 100% accuracy in classifying requests as either Software and Applications Conference (COMPSAC) (Vol. 1, pp. 835-
normal or anomalous. For distinguishing between different 836). IEEE.
attack types, the classifier attained an accuracy of 89.5%. [4] Ito, M. and Iyatomi, H., (2018) Web application firewall using character-
level convolutional neural network. In 2018 IEEE 14th International
• Precision for identifying attack types: 89.21% Colloquium on Signal Processing Its Applications (CSPA) (pp. 103-
• Recall for identifying attack types: 89.30% 106). IEEE.
[5] A. Divekar, M. Parekh, V. Savla, R. Mishra and M. Shirole, ”Benchmark-
• Precision for overall label classification (normal or at-
ing datasets for Anomaly-based Network Intrusion Detection: KDD CUP
tack): 100% 99 alternatives,” 2018 IEEE 3rd International Conference on Computing,
• Recall for overall label classification: 100% Communication and Security (ICCCS), Kathmandu, Nepal, 2018, pp. 1-
8, doi: 10.1109/CCCS.2018.8586840.
The confusion matrix reflects robust performance across [6] Betarte, G., Giménez, E., Martinez, R. and Pardo, A., (2018) Improving
various attack types, with the classifier demonstrating high Web Application Firewalls through anomaly detection. In 2018 17th
true positive rates and low false positive and negative rates. IEEE International Conference on Machine Learning and Applications
(ICMLA) (pp. 779-784). IEEE.
This indicates superior detection capabilities compared to [7] G. Betarte, Á. Pardo and R. Martı́nez, ”Web Application Attacks Detec-
traditional WAF methods. tion Using Machine Learning Techniques,” 2018 17th IEEE International
Conference on Machine Learning and Applications (ICMLA), Orlando,
IV. C ONCLUSION AND F UTURE W ORK FL, USA, 2018, pp. 1065-1072, doi: 10.1109/ICMLA.2018.00174.
[8] A. M. Vartouni, S. S. Kashi and M. Teshnehlab, ”An anomaly detection
To sum up, this study offers insightful information about method to detect web attacks using Stacked Auto-Encoder,” 2018
6th Iranian Joint Congress on Fuzzy and Intelligent Systems (CFIS),
the evolving field of machine learning-based web applica- Kerman, Iran, 2018, pp. 131-134, doi: 10.1109/CFIS.2018.8336654.
tion firewalls (WAFs). Emphasizing the critical importance [9] Y. Xin et al., ”Machine Learning and Deep Learning Methods for
of rigorous feature engineering and selection procedures, the Cybersecurity,” in IEEE Access, vol. 6, pp. 35365-35381, 2018, doi:
10.1109/ACCESS.2018.2836950.
proposed Random Forest-based WAF demonstrated an accu-
[10] Vartouni, A.M., Teshnehlab, M. and Kashi, S.S., (2019) Leveraging
racy of 100% in classifying requests as normal or anomalous deep neural networks for anomaly-based Web Application Firewall. IET
and 89.5% accuracy in classifying different attack types. The Information Security, 13(4), pp.352-361
model achieved a precision of 89.21% and recall of 89.30% [11] Tekerek, Adem & Bay, Omer. (2019). DESIGN AND IMPLEMENTA-
TION OF AN ARTIFICIAL INTELLIGENCE-BASED WEB APPLI-
for attack type classification, while achieving nearly perfect CATION FIREWALL MODEL. Neural Network World. 29. 189-206.
precision and recall for overall attack detection. Compared 10.14311/NNW.2019.29.013.
to traditional rule-based, behavior-based, and reputation-based [12] Abdulhammed, Razan, Hassan Musafer, Ali Alessa, Miad Faezipour,
WAFs, the proposed ML-based WAF offers superior accuracy and Abdelshakour Abuzneid. 2019. ”Features Dimensionality Reduction
Approaches for Machine Learning Based Network Intrusion Detection”
and reduced computational time, making it a more efficient Electronics 8, no. 3: 322. https://doi.org/10.3390/electronics8030322
and reliable solution. [13] G. T. Reddy, M. P. K. Reddy, K. Lakshmanna et al., “Analysis of
The proposed ML-based WAF also demonstrated a notable dimensionality reduction techniques on big data,” IEEE Access, vol.
8, pp. 54776–54788, 2020.
reduction in computational time compared to traditional WAF [14] Arwa Aldweesh, Abdelouahid Derhab, Ahmed Z. Emam, Deep
methods. This efficiency gain is attributed to the streamlined learning approaches for anomaly-based intrusion detection
process of feature extraction and the use of the Random systems: A survey, taxonomy, and open issues, Knowledge-
Based Systems, Volume 189, 2020, 105124, ISSN 0950-7051,
Forest classifier, which processes requests more quickly and https://doi.org/10.1016/j.knosys.2019.105124.
accurately. [15] Dainya Thomas-Reynolds Sergey Butakov,”Factors Affecting the Perfor-
Key areas for future improvement include integrating mance of Web Application Firewall Factors Affecting the Performance
anomaly-based and signature-based detection techniques and of Web Application Firewall ”,WISP 2020 Proceedings
[16] Thang, N.M., 2020. Improving Efficiency of Web Application Firewall to
exploring various classifier approaches. Addressing ethical, le- Detect Code Injection Attacks with Random Forest Method and Analysis
gal, and societal issues related to deploying ML-based security Attributes HTTP Request. Programming and Computer Software, 46(5),
solutions remains crucial. Future efforts should also focus on pp.351-361.
[17] Demetrio, L., Valenza, A., Costa, G. and Lagorio, G.,(2020) WAF-A-
defense mechanisms against a broader range of web-based MoLE: evading web application firewalls through adversarial machine
assaults, such as advanced injection techniques, Distributed learning. In Proceedings of the 35th Annual ACM Symposium on
Denial of Service (DDoS), and Cross-Site Scripting (XSS), Applied Computing (pp. 1745-1752).
ensuring continuous advancements in ML-based WAF technol- [18] Simon Applebaum, Tarek Gaber, Ali Ahmed, Signature based and
Machine-Learning-based Web Application Firewalls: A Short Survey,
ogy for enhanced security against a dynamic threat landscape. Procedia Computer Science, Volume 189, 2021, Pages 359-367, ISSN
1877-0509, https://doi.org/10.1016/j.procs.2021.05.105
R EFERENCES [19] Aref Shaheed, M. H. D. Bassam Kurdy, ”Web Application Firewall
Using Machine Learning and Features Engineering”, Security and Com-
[1] D. Appelt, A. Panichella and L. Briand, ”Automatically Repairing Web munication Networks, vol. 2022, Article ID 5280158, 14 pages, 2022.
Application Firewalls Based on Successful SQL Injection Attacks,” 2017 https://doi.org/10.1155/2022/5280158.
[20] Dawadi, Babu R., Bibek Adhikari, and Devesh Kumar Srivastava. 2023.
”Deep Learning Technique-Enabled Web Application Firewall for the
Detection of Web Attacks” Sensors 23, no. 4: 2073.