0% found this document useful (0 votes)
3 views14 pages

Security and Communication Networks - 2022 - Shaheed - Web Application Firewall Using Machine Learning and Features

The article presents a web application firewall model that utilizes machine learning and feature engineering to detect common web attacks by analyzing HTTP request features. The model classifies requests as normal or anomalous based on five extracted features and achieves high classification accuracy of 99.6% with research datasets and 98.8% with real web server logs. The study addresses limitations of previous works by providing a more comprehensive feature set and evaluating multiple classification algorithms.

Uploaded by

Blender Junior
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views14 pages

Security and Communication Networks - 2022 - Shaheed - Web Application Firewall Using Machine Learning and Features

The article presents a web application firewall model that utilizes machine learning and feature engineering to detect common web attacks by analyzing HTTP request features. The model classifies requests as normal or anomalous based on five extracted features and achieves high classification accuracy of 99.6% with research datasets and 98.8% with real web server logs. The study addresses limitations of previous works by providing a more comprehensive feature set and evaluating multiple classification algorithms.

Uploaded by

Blender Junior
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Hindawi

Security and Communication Networks


Volume 2022, Article ID 5280158, 14 pages
https://doi.org/10.1155/2022/5280158

Research Article
Web Application Firewall Using Machine Learning and
Features Engineering

1 2
Aref Shaheed and M. H. D. Bassam Kurdy
1
Department of Web Technologies, Syrian Virtual University, Damascus, Syria
2
Department of Artificial Intelligence, Syrian Virtual University, Damascus, Syria

Correspondence should be addressed to Aref Shaheed; [email protected]

Received 22 May 2021; Revised 12 July 2021; Accepted 26 March 2022; Published 6 June 2022

Academic Editor: David Megias

Copyright © 2022 Aref Shaheed and M. H. D. Bassam Kurdy. -is is an open access article distributed under the Creative
Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the
original work is properly cited.
Web application security has become a major requirement for any business, especially with the wide web attacks spreading despite
the defensive measures and the continuous development of software frameworks and servers. In this study, we present a proposed
model for a web application firewall that used machine learning and features engineering to detect common web attacks. Our
proposed model analyses incoming requests to the webserver, parses these requests to extract four features that describe
completely HTTP request parts (URL, payload, and headers), and classifies whether a request is normal or an anomaly. We took
into consideration the limitation of previous works that use URL and payload only in classification and provided five features that
describe and summarize all parts of the HTTP request using features engineering and previous experience in the field of the
software security domain. Extracted features are length of request, percentage of characters allowed, percentage of special
characters, and attack weight. -ese features were calculated for four different datasets CSIC 2010, HTTPParams 2015, Hybrid
dataset (CSIC 2010 and HTTPParams), and real logs for the compromised web server. We evaluated our proposed model by using
these updated datasets with four classification algorithms (Naive Bayes, logistic regression, decision tree, and support vector
machine) with two methods (train test split and cross-validation) to negate the probability of overfitting and ensure that features
are effective. Features values for a normal request are usually short request length, large allowed character ratio, small special
character ratio, and zero attack weight or close to zero. Features values for anomaly requests are large request length, small allowed
character percentage, large special character percentage, and very large numerically attack weight. Our proposed model achieved a
classification accuracy of 99.6% with datasets used in research studies in this field and 98.8% with datasets of real web servers.

1. Introduction an integrated manner with these defensive procedures to raise


the security level of web applications [1]. Security projects and
Cyberattacks targeting web servers and applications were standards were published to help developers and white hat
and still is one of the important points that are taken into hackers to increase the security level such as OWASP [2].
consideration when an organization uses technology in its Traditional firewalls interact with packets in network and
various types of work (applications, operating systems, transport layers [3], while web application firewalls interact
databases, networks, etc.), and these attacks remain high risk with web requests in the application layer [4]. -ese firewalls
despite the great diversity in the methods of combating were operated using the signature [5], as they recognize the
them. -is limited the impact of these attacks but was unable attack through a distinct fingerprint of it, and this requires
to make a tangible effect. large databases and storing the fingerprint of each attack
Despite the implementation of defensive measures by web after it is executed. Reliance on databases (signature-based
application developers, attacks are constantly evolving, and protection) and hardcoded logic and rules (using traditional
there has become an urgent need for dedicated software or programming) make it more difficult to take advantage of
product that supports these defensive procedures and works in expert knowledge by transferring it to the computer.
2037, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/5280158 by INASP/HINARI - CAMEROON, Wiley Online Library on [04/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2 Security and Communication Networks

In recent decades, artificial intelligence has become a (vii) Cross-site scripting (XSS): injecting JavaScript
scientific revolution [6] and has achieved peerless superi- code in a web application to modify the display of a
ority in mastering the work that humans do, and we think web application and force the victim to execute it
that a computer cannot learn and make decisions like in his browser; there are many types of it, such as
humans, but rather it has become a competitor to human Reflected XSS and DOM XSS [16].
capabilities. In the coming decades, it is expected that ar- (viii) Insecure deserialization: manipulating the inputs
tificial intelligence would eliminate many human jobs[7]. of a web application by deserializing it, modifying
Researchers and information security professionals have it, and serializing it again to compromise the web
specifically moved to harness the capabilities of artificial application [17].
intelligence to detect and combat attacks [8]. -e time has
(ix) Using components with known vulnerabilities: stop
come for the machine to work side by side with the human to
updating the used component in a web application
do what is difficult for him despite having hundreds of
allows attackers to exploit its known vulnerabilities;
millions of real neurons.
this type of vulnerability is found in abundance,
Most recent works relied on one dataset only and work
especially in CMS web applications [18].
with URL and payload only. In this article, we used features
engineering to present four generalizable features that (x) Insufficient logging and monitoring refer to the
summarize the whole HTTP request information (URL, lack of logging and monitoring mechanisms and
payload, and headers) and we used four classification al- techniques, which allow attackers to find and ex-
gorithms in machine learning in the classification phase to ploit without being detected [19].
evaluate our proposed model.
-e rest of this article is organized as follows: Section 2 Studies and research about protecting web applications
presents materials and methods (related works and proposed from malicious requests were following two methodologies
model), Section 3 discusses results and discussion, Section 4 to detect an attack: identify and detect a particular attack
gives the conclusion, and Section 5 contains future work. (such as detection of the SQL injection attack only or cross-
site scripting attack detection) or classify requests if it is an
anomaly or normal in general, regardless of the type of attack.
2. Materials and Methods It also followed two approaches to transfer this experi-
2.1. Related Works. Vulnerabilities of web applications have ence to computers: designing and implementing behavioral-
not changed as concepts; it changes in how to exploit them. based detection by using artificial intelligence techniques
-e most popular vulnerabilities of web applications are as such as classification algorithms or using a custom algorithm,
follows: and signature-based detection by using databases that con-
tain patterns of attacks.
(i) Injections: manipulating the input to force a web Most of the studies relied on old datasets such as CSIC
application to execute arbitrary commands in the 2010 [20], ECML-PKDD 2007 [21]. -e proposed models
operating system and queries in databases [9], SQL were not evaluated using modern datasets, and the datasets
injection is the most famous of injection attacks [10], created by some researchers are not available online.
and it allows the attacker to interact with the da- Zhang et al. proposed a framework to detect web attacks
tabase by reading, writing, and modifying records. by extracting seven features, web resource, attribute se-
(ii) Broken authentication: exploiting logical and quence, attribute value, HTTP version, header, and header
weakness points in the authentication mechanism input value. -is framework includes three components: the
to takeover and control accounts [11]. probability distribution model, the hidden Markov model,
(iii) Sensitive data exposure: manipulating a web ap- and the one-class SVM model. Each of these components
plication to make it throw exceptions and expose can be considered as a machine learning model. Each model
sensitive data such as credentials of the database trained on a dataset contains normal requests only and is
[12]. evaluated by using two datasets: Wikipedia access traces [22]
and FuzzDB [23]. Using a multimodel-based method takes
(iv) XML external entity (XXE): manipulating inputs advantage of all models in it, by this method, the authors
using functions that parse XML to execute arbi- mitigated false positive issue significantly [24]. -e main
trary commands [13]. advantage of this model is that it takes advantage of multiple
(v) Broken access control: accessing unauthorized components by combining them in a hybrid one. On the
resources in a web application due to the weakness other side, running multiple components may affect per-
of access control rules such as accessing the ad- formance. WAF is a real-time service that handles many
ministrator panel if there is no restriction on access requests and performance is an important parameter to take
to it [14]. into consideration.
(vi) Security misconfigurations: using brute force to Tekerek and Bay provided a hybrid model for the de-
find and exploit security misconfigurations such as tection of malicious and normal requests using signature-
unpatched flaws, default configurations, unused based detection to overcome speed issue for traditional
pages, unprotected files and directories, and un- attacks and behavioral-based detection to solve zero-day
necessary services [15]. attacks issue using neural networks with three features
2037, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/5280158 by INASP/HINARI - CAMEROON, Wiley Online Library on [04/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Security and Communication Networks 3

described by mathematical equations as input to this neural one dataset (CSIC 2010). In addition, using deep learning
network [25]. -e advantages of this model are that it used techniques for real-time services like WAF will affect speed
hybrid detection methods (signature-based and behavioral- performance if they deploy it in a real environment.
based) in detection that increase the performance and speed All of these previous studies deal with the request if it is
of implementation. In addition to the simplicity and speed of normal or anomaly; it used the second methodology of
implementation of the neural network, this research has detection.
undergone successive development work in recent years and Other studies detect specific attacks, such as detecting an
the model is mature. On the other side, the features extracted attack of database rights; it used the first methodology, as the
cannot be generalized for all web applications (values of the following study of Dr. Ahmad Ghafarian, in which he
features calculated depends on average and derivation of provided an approach that puts a line in each table and used
other requests, it requires a massive log of an application to an algorithm that it executes to verify queries before exe-
make the model able to classify the requests arrived at this cuting it on the web application. Malicious requests fetch the
application). In addition, the usage of statistical functions gives added line and the normal requests do not fetch it. -is way,
anomaly cases that cannot be handled (using the average in the an SQL injection attack is detected in real time before it is
denominator of 2 generates a very big number of requests that executed on the web application, but the proposed model
its length is near to the average of request lengths). reveals only one of the seven types mentioned by the re-
Sharma et al. used features engineering to extract seven searcher in his article. In addition, this study did not discuss
features from the incoming request and used three classi- the overload on resources, databases, and the delay in ex-
fication algorithms to test their effectiveness. -ey applied ecution time due to testing each query before executing [31].
preprocessing procedures on CSIC 2010 dataset to identify -ere are many studies similar to the study of Dr.
subcategories for malicious requests to overcome missing Ahmad Ghafarian that detect SQL injection in runtime like
features issue [26]. -is method yielded many useful fea- AMNESIA (proposed by Halfond and Orso) [32] and
tures, but some of these features cannot be extracted from CANDID (proposed by Bisht, Madhusudan, and Ven-
the dataset that used in this article due to the unavailability katakrishnan) [33].
of its fields in the dataset, such as the length of cookies (you
can review the structure of the CSIC 2010 dataset). In ad-
dition, researchers did not use another dataset and relied 2.2. Proposed Model
only on CSIC 2010.
Vartouni et al. applied n-gram character-based model to 2.2.1. Architecture. Our proposed model of WAF works as
construct features. -e size of features increased depends on an operating system service, which acts as an intermediary
the value of n; to overcome this issue and avoid using di- between the web server and the clients. -is service receives
mensionality decrease techniques, they applied an autoen- the request, parses it, extracts features, classifies it, and
coder to extract features as a data abstract. Deep learning makes decisions based on the classification result.
algorithms were used to work more effectively with extracted WAF can be configured through a dedicated web ap-
features. -is proposed model gave a generalizable model, plication (web control panel, see Figure 1). -e proposed
but the classification accuracy was low in comparison with WAF consists of the following five basic units (see Figure 2):
other models. In addition to use a single dataset, refer CSIC (1) Power on/off unit
2010 [27].
Hoang used supervised machine learning (inexpensive (2) Training unit
decision tree algorithm as a suitable real-time classifier to (3) Parsing unit
mitigate performance issue in terms of speed) to detect four (4) Classification unit
major web attacks (SQLi, XSS, command injection, and path
(5) Decision-making unit
traversal) and N-gram used with fixed n value (n � 3) and
PCA (principal component analysis) method to obtain a -e process starts in the first unit, the power on/off unit,
reduced number of features. He used the HTTPParams when the WAF is running, the OS service contacts the
dataset [28] and CSIC 2010 to evaluate the proposed model database and fetches the configurations to run WAF, initiate
and achieved high accuracy of 98.56% [29]. Hoang X.D the listener, and wait for incoming requests for the WAF that
proposed model takes web server logs as input and turns mediates between the client and the web server.
every single row into a vector, results of the model only focus After running the WAF, the training process starts
on the HTTPParams dataset, and experiments done used using the selected dataset and the selected classification
only one algorithm. algorithm.
Niu and Li extracted eight statistical features and used After the completion of the training process and the
convolutional neural network (CNN) combined with gated completion of the work of the first and second units, WAF is
recurrent unit (GRU) with CSIC 2010 dataset. Using these ready to receive requests.
techniques together with eight features improved detection When the request arrives, WAF, the first unit to handle it
performance but at the expense of speed performance. -ey is the parsing unit, which breaks down the request, extracts
evaluated the detection performance of this model by the features, and passes it as a vector to the classification unit
comparing it with other deep learning methods and they that classifies according to the classification method that the
achieved 99.00% accuracy [30]. Niu and Li also used only administrator chooses it in the training unit.
2037, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/5280158 by INASP/HINARI - CAMEROON, Wiley Online Library on [04/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
4 Security and Communication Networks

Web Server

WAF Web Service

Client

Control Panel
Figure 1: Architecture of the web server environment with proposal WAF.

Request

Classification Decision
Power Unit Training Unit Parsing Unit
Unit Making Unit

No Is Normal Yes
request?

Drop request and


Pass to web server
redirect to custom page

Figure 2: Units of proposal WAF (brief diagram).

After classifying the request, the classification unit sends databases to train the model, now WAF is ready to re-
the result to the decision-making unit, which takes the ceive requests.
appropriate action to pass or drop the request (see Figure 3 We used popular classification algorithms, any algo-
or Algorithm 1). rithm can be added by inserting its name in the database
(algorithms table) and call it programmatically in the WAF
(1) Power on/Off Unit. -is unit is responsible for controlling implementation code.
WAF by turning it on and off. When the WAF started, all In addition, any dataset can be added by inserting its
configurations will be fetched from the database such as the name in the database (datasets table) and adding the CSV file
IP address and port of the firewall, IP address, and port of the of the desired dataset in the dataset folder inside the WAF
web server to run the service (the listener). implementation code folder.
Used algorithms are Naive Bayes, Logistic Regression,
(2) Training Unit. After running WAF, the dataset name Decision Tree, and Support Vector Machine (popular al-
and classification algorithm will be fetched from gorithms in binary classification problems) [34].
2037, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/5280158 by INASP/HINARI - CAMEROON, Wiley Online Library on [04/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Security and Communication Networks 5

Start

Power Unit

Get Configuration from Database

Power On WAF

Start WAF Listener

Classification Unit
Training Model using Dataset and
Algorithm (got from Configuration)

No Yes

New Request
Wait for New Requests
Arrived?

Parsing Unit
Parse Request to retrieve Basic Features

Extract Final Features from parsed request

Classification Unit

Classify request

Decision Making Unit


No Yes
Is Request
Normal?
Drop request

Redirect to custom page with Pass request to Web Server


message “Attack !!”

Store all data in database

End

Figure 3: Units of proposal WAF (detailed diagram).

We preferred to use these algorithms instead of neural (3) Parsing Unit. After turning on WAF and the training
networks, which can be used in future works (see Future model, now WAF is ready to receive requests. When an
works for Researchers section). HTTP request arrives at WAF, the parsing unit breaks down
Used datasets include CSIC 2010, HTTPParams 2015, the request to extract features (features will be discussed in
Hybrid dataset (Generated by combining CSIC 2010 and the next section).
HTTPParams 2015), and Custom dataset (logs of real web -e parsing unit creates a final vector consisting of
servers). features and passed this vector to the classification unit.
2037, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/5280158 by INASP/HINARI - CAMEROON, Wiley Online Library on [04/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
6 Security and Communication Networks

Input of data: d (dataset), a (algorithm), p1 (web server port), i1 (web server IP), p2 (WAF port), i2 (WAF IP).
(1) Start
(2) Connect to database to initialize Inputs (d, a, p1, i1, p2, i2)
(3) Start WAF listener using Inputs (p1, i1, p2, i2)
(4) Training WAF using Inputs (d, a)
(5) While WAF listener is “ON”:
(6) If new request arrived R:
(7) Parse R
(8) Compute basic features vector B from parsed R
(9) Compute V Final features vector from B
(10) Compute C (class) of parsed request R by classify based on V
(11) If C � “anomal”
(12) Drop request
(13) Redirect to custom page with message “Attack”
(14) Else//C � ‘normal’
(15) pass request to web server
(16) Store V and C in database
(17) Endif
(18) Endif
(19) EndWhile
(20) End

ALGORITHM 1: Units of proposal WAF (detailed algorithm).

Parsing unit implemented using MITM Proxy in Python (4) Headers: request headers (original headers or
[35]. custom headers added by the client or web
application)
(4) Classification Unit. -is unit receives the final vector (5) Files: it is usually included in the payload but we
from the parsing unit and classifies the request depending on separated it as an independent feature
it. -e classification unit sends the classification results to
decision-making unit.
(2) Final Features. Final features used by WAF to classify the
(5) Decision-Making Unit. -is unit receives the classifica- request if it is normal or anomaly; these features are cal-
tion results from the classification unit and forwards the culated and extracted based on the basic features, and we
request to the web server if the request is normal, and drops have four final features (see Table 2):
the request if it is an anomaly. (1) Input length: it describes the number of characters in
payload, and it is calculated as follows:
n
2.2.2. Features Engineering. When an HTTP request arrives l � 􏽘 ci , (1)
at the parsing unit, it is dismantled to extract the basic i�0
features, and these basic features will be used to calculate the
final features that will be sent to the classification unit (see where l is the input length, c is the character in
Algorithm 2). payload, and n is the payload length.
(1) Basic Features. All basic information extracted directly Usually, this feature value is bigger in anomaly re-
from the request is called a basic feature; it is the content of quests compared to normal requests (see Figure 4).
HTTP Message [36], in our proposed model, and we have (2) Alphanumeric character ratio: it describes the ratio
five basic features (see Table 1): of alphanumeric characters over the input length.
Normal requests usually contain more numeric and
(1) HTTP Method: HTTP protocol, method, or verb that
alphabetic characters compared to special characters
is used by the client to request the resource from the
such as symbols, so this feature will have a big value
web server, it may be POST, GET, HEAD, OPTION,
in normal requests compared to anomaly requests
PUT, PATCH, or DELETE
(see Figure 5).
(2) Absolute URL (URL): it includes the IP address or
-is feature is calculated as follows:
domain of a web application with the resource, for
example, https://www.mysite.com/login n
ci |ci ∈ e􏼁
a�􏽘 × 100, (2)
(3) Payload: all data submitted by the client (text input, l
i�0
dropdown menu, text area, etc.)
2037, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/5280158 by INASP/HINARI - CAMEROON, Wiley Online Library on [04/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Security and Communication Networks 7

Input of data: R raw HTTP request


Output: V Final features vector
(1) Start
(2) parse R
(3) compute B basic features vector from parsed R
(4) compute V Final features vector from B
(5) End

ALGORITHM 2: Features extraction from raw HTTP request by parsing unit.

Table 1: Basic features of the proposed model.


Feature Description 80
HTTP
HTTP protocol method
method

Frequencies in Dataset
IP address or domain of web application with 60
URL
resource
Payload All data submitted from the client
Headers Request headers 40
Files Uploaded files in payloads

Table 2: Extracted features of the proposed model. 20

Feature Description
Input length Number of characters in payload 0
Alphanumeric -e ratio of alphanumeric characters 0 20 40 60 80 100 120 140
character ratio over input length Number of requests in Dataset
-e ratio of nonalphanumeric characters
Special characters ratio Figure 5: Histogram implementation of alphanumeric character
over input length
ratio feature in the CSIC 2010 dataset; green values represent
Sum of five subfeatures (see Table 3):
normal request and red values represent anomaly requests.
(i) URL weight
(ii) Number of attack words in inputs
Attack weight (iii) Manipulate payload weight
(iv) Alphanumeric character to special (3) Special character ratio: it describes the ratio of special
character ratio characters (nonalphanumeric) over the input length.
(v) Files weight
Anomaly requests usually contain fewer numeric
and alphabetic characters compared to special
70 characters such as symbols, so this feature will have a
big value in anomaly requests compared to normal
60
requests (see Figure 6).
Frequencies in Dataset

50 -is feature is calculated as follows:


n
40 ci |ci ∈ f􏼁
s�􏽘 × 100, (3)
i�0 l
30
where s is the special character ratio, c is the character
20 in payload, n is the payload length, f is the cluster of
not allowed characters (any character that is not
10
alphabet and numbers), and l is the input length.
0 Alternatively, it can be calculated as follows:
0 200 400 600 800 1000
Number of requests s � 1 − a, (4)
Figure 4: Histogram implementation of input length feature in the where s is the special character ratio and a is the
CSIC 2010 dataset; green values represent normal request and red alphanumeric character ratio.
values represent anomaly requests.
(4) Attack weight: it is the most important feature in the
where a is the alphanumeric character ratio, c is the classification process. It is calculated by summing
character in payload, n is the payload length, e is the four subfeatures.
cluster of allowed characters (alphabet and num- Anomaly requests usually have a big attack weight
bers), and l is the input length. compared to normal requests (see Figures 7 and 8).
2037, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/5280158 by INASP/HINARI - CAMEROON, Wiley Online Library on [04/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
8 Security and Communication Networks

(1) URL weight


-e value of this subfeature is initialized to
80
zero. Value of this subfeature for each dis-
covered special character, attack word, or
Frequencies in Dataset

access of unauthorized resource.


60
Each discovered malicious have its weight.
-is subfeature is calculated as follows:
40 n
u � 􏽘 wi |di ∈ URL􏼁, (5)
i�0
20
where u is the URL Weight, d is the discovered
malicious, w is the weight of discovered malicious, n
0
–40 –20 0 20 40 60 80 100 is the number of discovered malicious in WAF
Number of requests in Dataset database, and URL is request absolute URL.
Example: http://www.example.com/.env.
Figure 6: Histogram implementation of special character ratio
feature in the CSIC 2010 dataset; green values represent normal u � 200 (200 is the weight of discovered access
request and red values represent anomaly requests. of unauthorized resource—only one discov-
ered malicious in URL).
(2) Number of attack words in inputs
-e value of this subfeature is initialized to
zero. Value of this subfeature for each dis-
80
covered attack word in payload and headers.
Each discovered word attack has its own
Frequencies in Dataset

60
weight.
-is subfeature is calculated as follows:
n
40 v � 􏽘 wi |di ∈ Input􏼁, (6)
i�0

20
where v is the number of attack words in inputs, d is
the discovered attack word, w is the weight of attack
word, n is the number of attack words in WAF
0 database, and Input is request headers and payloads.
–20 0 20 40 60 80 100 Example: Email � [email protected] & passwd�’
Number of requests in Dataset or 1�1 --&mode�’ or 1�1 --
Figure 7: Histogram implementation of attack weight feature in v � 150 + 150 (150 is the weight of discovered
the CSIC 2010 dataset for normal request and red values represent SQLI, it exists twice).
anomaly requests. (3) Manipulate payloads weight
-e value of this subfeature is initialized to
zero. -e value of this subfeature increases for
each discovered manipulate in payload and
headers.
80 Manipulation is passing wrong data to the
application to throw an exception and to
Frequencies in Dataset

expose sensitive data, for example, passing a


60 string as a mobile number.
Each discovered manipulation has its own
weight.
40
-is subfeature is calculated as follows:
n
20 m � 􏽘 wi |di ∈ Input􏼁, (7)
i�0

0 where m is the manipulate payload weight, d is the


0 50000 100000 150000 200000 250000 300000 discovered manipulation, w is the weight of dis-
Number of requests in Dataset covered manipulation, n is the number of manip-
Figure 8: Histogram implementation of attack weight feature in ulations in WAF database, and Input is request
CSIC 2010 dataset for anomaly request. headers and payloads.
2037, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/5280158 by INASP/HINARI - CAMEROON, Wiley Online Library on [04/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Security and Communication Networks 9

Example: Email � aref@ f � 300 + 200 + 200 + 200 � 900.


mail.com&mobile � Hello You have to calculate it for each uploaded file
m � 100 (100 is the weight of discovered and sum the results to get the final files weight
manipulation in payload, passing a string as a for all files in request as follows:
mobile number—only one discovered ma- n
nipulation in URL). F � 􏽘 fi , (10)
(4) Alphanumeric characters to special character i�0
ratio
-e value of this subfeature is the result of where F is the files weight, n is the number of files in
dividing alphanumeric character ratio over request, and f is the file weight.
special characters ratio. Finally, the value of the attack weight feature is calcu-
If alphanumeric character ratio or special lated by summing all subfeatures as follows:
character ratio equals zero then set subfeature
value to zero. z � u + v + m + r + F, (11)
-is subfeature is calculated as follows: where z is the attack weight, u is the URL Weight, v is the
s number of attack words in inputs, m is the manipulation


⎪ 500, 􏼒 􏼓 ≥ 0.3, payload weight, r is the alphanumeric character to special



⎨ a character ratio, and F is the files weight.
r �⎪ (8) Now, all requests are converted from the raw form to the



⎪ s final form (four features with the label).
⎩ 0, 􏼒 􏼓 < 0,
a -is table is a sample of the dataset generated by our
model, label is 1 for anomaly requests and 0 for normal
where r is the alphanumeric character to special requests (see Table 4):
character ratio, s is the special character ratio, and a
is the alphanumeric character ratio. 3. Results and Discussion
Example: Email � [email protected]&passwd � ’
or 1 � 1 -- &mode � ’ or 1 � 1 -- A set of generalizable features extracted from HTTP requests
a � 40/55 (all numbers and alphabet number to detect common attacks on web applications.
over input length). We used four datasets: CSIC 2010, HTTPParams 2015,
s � 15/55 (count of nonalphanumeric char- Hybrid dataset (CSIC + HTTPParams), and custom web
acters such as’ � - @. and) server logs (compromised real server). -e last dataset was
s/a � 0.375 not published in the GitHub repository due to its privacy.
r � 500. We used four basic extracted from HTTP requests to
(5) Files weight calculate the final features, basic features are HTTP Protocol
-e value of this subfeature is initialized to (HTTP Method), Absolute URL (URL), payload, headers,
zero. Value of this subfeature increases for and files. Extracted features are the length of request, per-
each suspicious discovered in files. centage of characters allowed, percentage of special char-
Suspicious cases: acters, and attack weight.
(i) Invalid file extension (.exe,.bin,.php, etc.) We used various classification algorithms that work
(ii) Positive results of antivirus scanning (we used more efficiently on binary classification problems, such as
three antiviruses: Kaspersky, MalwareBytes, Linear Regression, Decision Tree, and Naive Bayes but we
and BitDefender) focus on Naive Bayes.
Each discovered malicious have its weight.
-is subfeature is calculated as follows:
3.1. Experiments
f � w1 + w2 + w3 + w4, (9)
3.1.1. Experimental Environment. WAF implemented in
web server under Linux Xubuntu 20.04 LTS. -is server
where f is the files weight, w1 � (300 if file
contains apache2 service (web service), web control panel
extension is invalid or 0 if file extension is
(web application developed using Django-Python), and
valid), w2 � (200 if Kaspersky detect this file as
WAF service (daemon service).
virus or 0 if not), w3 � (200 if MalwareBytes
detect this file as virus or 0 if not), and w4 �
(200 if BitDefender detect this file as virus or 0 3.1.2. Preprocessing Datasets. Four datasets were used in this
if not). study: CSIC 2010, HTTPParams, Hybrid dataset (CSIC 2010
Example: uploaded file: shell.php and HTTPParams), and custom dataset of compromised web
w1 � 300 (invalid extension). server logs. Data preparation procedure has been done on
w2 � 200 (Kaspersky detected it as virus). these datasets (remove missing values, duplications, and
w3 � 200 (MalwareBytes detected it as virus). outliers) and exported as CSV to be able to interact with
w4 � 200 (BitDefender detected it as virus). machine learning algorithms in Python (Scikit-learn package).
2037, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/5280158 by INASP/HINARI - CAMEROON, Wiley Online Library on [04/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
10 Security and Communication Networks

Table 3: Subfeatures of attack weight feature.


Subfeature Description
URL weight Sum of weights of discovered manipulation in URL
Number of attack words in inputs Sum of weights of discovered attack words in inputs
Manipulate payload weight Sum of weights of discovered manipulation in payloads
Alphanumeric character to special
-e ratio of the number of alphanumeric characters to the number of nonalphanumeric characters
character ratio
Sum of weights of the malicious files (malicious file weight is the weight of extension + sum of
Files weight
weights of scan using three antiviruses)

Table 4: Sample of the dataset used to train the model, we have four features for every single request and its label.
payload_len Alpha non_alpha attack_feature Label
0 0 0 0 0
41 95.45454545 4.545454545 200 1
241 100 0 0 0
9 100 0 0 0
24 94.73684211 5.263157895 2600 1
54 77.77777778 22.22222222 90000 1
75 100 0 0 0
103 87.37864078 12.62135922 60000 1
91 84.61538462 15.38461538 90000 1

CSIC 2010 contains 18 columns, custom compromised Usage of four different datasets negates the probability of
web server logs contain 10 columns, and HTTPParams 2015 overfitting presence, to confirm that, k-fold cross-validation
contains 4 columns. used in training also [37].
All previous datasets after preprocessing and di- Most of the related works used CSIC 2010 dataset with or
mension reduction procedures become with only 5 nu- without the custom dataset, and we used it in the proposed
meric columns, all columns that contain information model for the possibility of comparing the proposed model
about request were removed and replaced by final fea- with previous models (see Figure 9 or Table 5).
tures columns that describe request briefly and Implementation of the proposed model includes a
effectively. -ereafter, the Hybrid dataset became very function to export WAF records as a new dataset with the
simple. ability to correct records. Administrators can train the
proposed model using this exported dataset to strengthen
WAF in protecting its web applications.
3.1.3. Training. We used four algorithms to classify (Naive Most false positive cases are normal requests classified as
Bayes, Logistic Regression, Decision Tree, and SVM). Four anomaly requests (not the opposite).
datasets were fed to the classifier using two methods: train
test split (80%, 20%) and cross-validation (100 Folds), and 3.2.2. Results Compared to Related Works. Our proposed
results were very close. model achieved high accuracy of 98.8% compared with
Mixing and shuffling rows of CSIC 2010 and related works. -e following table shows the results for CSIC
HTTPParams 2015 as a new dataset (Hybrid dataset) gave a 2010, HTTPParams, and custom datasets created by the
very close result compared to the results of the classifier with researchers (see Figure 10 or Table 6).
each of the datasets separately (see Table 5).
Previous experiments negate the probability of over- 3.3. Comparison
fitting and prove that the final features of our proposed
model are effective. 3.3.1. Limitations of Previous Works. Researchers have
provided many models for detecting web attacks, and despite
their various features, there are some common weaknesses
3.2. Results among these researches, which can be summarized as follows:

3.2.1. Results Based on Datasets. Our proposed model used (1) Extracted features are not able to be general features
Naive Bayes with cross-validation (100 Folds) and and most of these features fit only web applications,
achieved an accuracy of 98.8% with the dataset created which it extracts from it.
from logs of a compromised real web server, 97.61% with (2) Using old datasets such as CSIC and evaluating the
HTTPParams dataset, 99.58% with CSIC dataset, and model depends on the results of training it. In ad-
96.40% for Hybrid dataset (combination of CSIC 2010 dition, all modern datasets used are not available on
and HTTPParams 2015). the Internet.
2037, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/5280158 by INASP/HINARI - CAMEROON, Wiley Online Library on [04/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Security and Communication Networks 11

Table 5: Classification accuracy of our proposed model for various datasets using Naive Bayes.
Compromised web
CSIC 2010 HTTPParams 2015 Hybrid dataset
server dataset
Number of normal requests 28,800 19,305 48,105 60250
Number of anomaly 11,213 11,764 22,977 5210
Classification accuracy (80% training, 20% testing) 99.59% 97.91% 96.40% 98.80%
Classification accuracy (100-fold cross-validation) 99.71% 98.02% 96.66% 98.97%
False positive rate 0.54% 1.20% 3.35% 0.84%

100 99.59

99 98.8

97.91
98
Accuracy

97
96.4
96

95

94
Datasets

CSIC 2010 Hybrid (CSIC + HTTPParams)


HTTPParams 2015 Custom IIS Logs
Figure 9: Classification accuracy of our proposed model for various datasets.

100 98.8
98 97.4
96.74
96

94
Accuracy

92

90
88.32
88

86

84

82
Web Application Firewalls

Tekerek A. and Bay O.F. (2019) Sharma S., Zavarsky P. and Butakov S. (2020)
Ghafarian A. (2017) proposed model

Figure 10: Classification accuracy of our proposed model compared with related works.

Table 6: Classification accuracy of our proposed model compared with related works.
Our proposed model Tekerek and Bay [25] Sharma et al. [26] Ghafarian [31]
CSIC 2010 99.59% 96.74% 94.7% 88.32%
ECML-PKDD 2007 Not tested 94.53% Not tested Not tested
HTTPParams 2015 97.61% Not tested Not tested Not tested
Custom dataset 98.8% 98.52% Not tested Not tested
2037, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/5280158 by INASP/HINARI - CAMEROON, Wiley Online Library on [04/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
12 Security and Communication Networks

(3) Some papers of related works contain some errors Regression, Decision Tree, and Naive Bayes, we focus on
and inaccurate information, such as the study by Naive Bayes. Our proposed model achieved a high classifi-
Sharma S., Zavarsky P., and Butakov S. (2020) [26]. cation accuracy of 99.6% with standard datasets used in
-ey used features that cannot be extracted from research studies in this field (CSIC 2010), and 98.8% with
CSIC 2010 (e.g., _cookie_len feature). datasets of real compromised web server dataset.
(4) Most of the related works process payload only
without taking headers and files into consideration. 5. Future Works
(5) Hybrid models are too rare (in related works only
TekerekA.and O.F.Bay(2019) paperisa hybridmodel). Future works domains are wide, it can be summarized as
follows (see next three subsections for more information):
(6) Most of the related works detect common web at-
Researchers in the information security domain can
tacks such as XSS and SQLI, no suggested model can
develop proposed WAF by feeding it with more datasets
detect attacks that use normal requests to be per-
(generated dataset from our proposed WAF or by creating a
formed, such as DOS attacks.
custom dataset from web servers logs) or by add or modify
current features, also they can develop separate components
3.3.2. Advantages of the Proposed Model. Disadvantages of and migrate them with our proposed WAF (signature-based
related works and weakness points were taken into con- model to check request before pass it to the classifier, DOS
sideration while designing and preparing our proposal attack detector, and use natural language process to make a
model. Features extracted in this model are general and can model to identify attack words instead of using a table in the
work with any web application. In addition, we used various database to store these attack words).
datasets (standard datasets such as CSIC 2010 to compare our For software engineers, developers and information
model with related works, modern datasets such as security engineers use our proposed model to evaluate their
HTTPParams 2015, and Hybrid dataset, in addition, we also applications and improve their skills by learning how to
used a custom dataset of a real compromised web server). write a secure source code.
Final features describe all parts of the HTTP request in- Sponsors and businesspersons can invest money to
cluding headers and files. Finally, a high-accuracy rate was develop the proposed model and become a commercial
achieved (98.8% for custom dataset and 99.6% for standard product.
dataset).

4. Conclusion 5.1. Future Works for Researchers in the Information Security


Domain
In this article, we proposed a web application firewall model
that used machine learning techniques and features engi- (1) Export web server logs as dataset after deploying
neering to detect common web attacks. We took into WAF for a specified period in the real environment,
consideration major limitations in previous works (unuse of and use neural networks instead of algorithms used
request headers, using one dataset only, absence of general in this article.
features). Features engineering and previous experience in (2) Use natural language processing to generate rules to
the software security domain were used to extract general detect common attack words and malicious payloads
and comprehensive features that describe and summarize instead of using hardcoded arrays. Common attack
requests and make the classification problem much easier. words and malicious payloads are implemented as
We extract the final four features from HTTP requests using arrays within the proposed model.
basic features. Basic features: All basic information extracted (3) Use reinforcement learning to obtain feedback and
directly from the request, we have five basic features: HTTP use it during decision-making (this proposal can be
protocol (HTTP method), absolute URL (URL), payload, implemented after the proposed model becomes
headers, and files. Final features: all features that are cal- mature and ready, so reinforcement learning is not a
culated and extracted based on basic features, we have four good option in real-time applications).
extracted features: input length, alphanumeric character
(4) Use various and new datasets (dimensionality re-
ratio, special character ratio, and attack weight. Values of
duction required depends on features numbers
extracted features for the normal request are usually short
[38, 39]).
request length, big allowed character ratio, small special
character ratio, and zero risk weight or close to zero. Values (5) Extend the proposed model to detect attacks that use
of extracted features for anomaly requests are usually large normal requests such as DOS attacks and brute force
request length, small allowed character percentage, large attacks. -e proposed model cannot detect these
special character percentage, and very large numeric risk types of attacks because it can detect attacks by
weight. To increase the security level, we suggest training our detecting anomaly requests.
proposed model on web server records of web applications (6) Use features engineering to modify or add features to
that will be protected by WAF. Any classification algorithm the model. Features are the most important com-
can be used, but we used algorithms that work more effi- ponent in the model depending on the researcher’s
ciently on binary classification problems, such as Logistic experience in the field of information security.
2037, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/5280158 by INASP/HINARI - CAMEROON, Wiley Online Library on [04/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Security and Communication Networks 13

(7) Extend the proposed model by adding components References


that work with signature-based detection. -e pro-
posed model works by using one technique, which is [1] M. Choraś and R. Kozik, “Machine learning techniques ap-
detection depending on the content through features plied to detect cyber attacks on web applications,” Logic
Journal of IGPL, vol. 23, no. 1, pp. 45–56, 2015.
extracted from the incoming request. -e proposed
[2] D. Wichers and J. Williams, “Owasp Top Ten,” 9e open web
system can be combined with different detection
application security project, vol. 3, 2017.
techniques and provide a hybrid model (see Tekerek [3] Z. J. Huang, O. Ai, and X. U. Hong-xian, “Network Security
A. and Bay O.F (2019) [25] study in related works). and Firewall Technology,” Journal of Naval University of
(8) Use ensemble classifiers instead of the proposed Engineering, vol. 1, 2002.
classifier (Naive Bayes) to increase the efficiency of [4] A. H. Yaacob, M. Nazrul, N. Ahmad, and M. Roslee, “Moving
request classification [40] (it depends on balancing towards positive security model for web application firewall,”
between security level sensitivity and perform- International Journal of Computer and Information Engi-
ance—usually using ensemble classifiers increase the neering, vol. 6, no. 12, pp. 1763–1768, 2012.
efficiency of classification at the expense of speed [5] P. P. Mukkamala and S. Rajendran, “A survey on the different
performance especially that WAF is a real-time firewall technologies,” International Journal of Engineering
Applied Sciences and Technology, vol. 5, no. 1, pp. 363–365,
service).
2020.
[6] W. Wang and K. Siau, “Artificial intelligence, machine
5.2. Future Works for Software Engineers, Developers, and learning, automation, robotics, future of work and future of
Information Security Engineers humanity,” Journal of Database Management, vol. 30,
pp. 61–79, 2019.
(1) Training model by supplying logs of their web ap- [7] M.-H. Huang and R. T. Rust, “Artificial intelligence in ser-
plications as a dataset. -is will increase the security vice,” Journal of Service Research, vol. 21, no. 2, pp. 155–172,
level of WAF to protect their web application and 2018.
WAF will get a great experience. [8] J. H. Li, “Cyber security meets artificial intelligence: a survey,”
(2) Implementing the proposed model to support Frontiers of Information Technology & Electronic Engineering,
Windows operating systems (current implementa- vol. 19, no. 12, pp. 1462–1474, 2018.
[9] A. K. Dalai and S. Kumar Jena, “Neutralizing SQL Injection
tion supports Linux distributions only).
Attack Using Server Side Code Modification in Web appli-
(3) Installing vulnerable web applications such as cations,” Security and Communication Networks, vol. 2017,
DVWA in the web server and try to bypass WAF; Article ID 3825373, 2017.
this will increase the experience of information se- [10] D. Mitropoulos, V. Karakoidas, P. Louridas, and D. Spinellis,
curity engineers to learn new methods of bypassing “Countering Code Injection Attacks: A Unified Approach,”
WAF and help researchers to modify features to Information Management & Computer Security, vol. 19, no. 3,
prevent these bypasses. 2011.
[11] M. M. Hassan, S. S. Nipa, M. Akter et al., “Broken authen-
tication and session management vulnerability: a case study of
5.3. Future Works for Sponsors and Businesspersons. web application,” International Journal of Simulation: Sys-
Investing money to develop the proposed model to be a tems, Science & Technology, vol. 19, no. 2, pp. 1–6, 2018.
product in the security and IT market. [12] J. Doshi and T. Bhushan, “Sensitive data exposure prevention
using dynamic database security policy,” International Jour-
nal of Computer Application, vol. 106, no. 15, pp. 18600–
Data Availability
19869, 2014.
CSIC 2010, HTTPParams 2015, and a hybrid dataset with [13] A. A. Osincev and O. R. Laponina, “Vulnerability testing in
Python code to train these datasets are available in the web applications external entities XML,” International Jour-
following repository: https://github.com/aref2008/waf. We nal of Open Information Technologies, vol. 7, no. 10, pp. 71–79,
2019.
recommend reading README.md to read all instructions
[14] D. H. Lee, J. W. Lee, and J. G. Kim, “Verification methods of
about usage. We did not publish a custom dataset (real OWASP TOP 10 security vulnerability under multi-tenancy
compromised web server logs) due to privacy. web site’s environments,” Convergence security journal,
vol. 16, no. 4, pp. 43–51, 2016.
Conflicts of Interest [15] P. Jayabalan, R. Ibrahim, and A. A. Manaf, “Understanding
cybercrime in Malaysia: an overview,” Sains Humanika, vol. 2,
-e authors declare no conflicts of interest. no. 2, 2014.
[16] B. K. Ayeni, J. B. Sahalu, and K. R. Adeyanju, “Detecting
Acknowledgments cross-site scripting in web applications using fuzzy inference
system,” Journal of Computer Networks and Communications,
-e authors thank everyone who provided any kind of vol. 2018, pp. 1–10, 2018.
support to complete this work, especially to the Hindawi [17] V. Pedreira, D. Barros, and P. Pinto, “A review of attacks,
Foundation, Mrs. Nadia Ali, the reviewers, and the editors vulnerabilities, and defenses in industry 4.0 with new chal-
Prof. Megias David, Prof. Roberto Di Pietro, and Prof. lenges on data sovereignty ahead,” Sensors, vol. 21, p. 5189,
Jhaveri Rutvij for their great efforts. 2021.
2037, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/5280158 by INASP/HINARI - CAMEROON, Wiley Online Library on [04/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
14 Security and Communication Networks

[18] N. Mendes, J. Duraes, and H. Madeira, “Benchmarking the [36] S. Suroto, “A review of defense against slow HTTP attack,”
security of web serving systems based on known vulnera- JOIV International Journal on Informatics Visualization,
bilities,” in Proceedings of the 2011 5th Latin-American vol. 1, no. 4, pp. 127–134, 2017.
Symposium on Dependable Computing, pp. 55–64, IEEE, Sao [37] S. B. Kotsiantis, I. Zaharakis, and P. Pintelas, “Supervised
Jose dos Campos, Brazil, April 2011. machine learning: a review of classification techniques,”
[19] M. Malviya, A. Jain, and N. Gupta, “Improving security by Emerging artificial intelligence applications in computer en-
predicting anomaly user through web mining: a review,” gineering, vol. 160, no. 1, pp. 3–24, 2007.
International Journal of Advances in Engineering & Tech- [38] R. Abdulhammed, H. Musafer, A. Alessa, M. Faezipour, and
nology, vol. 1, no. 2, p. 28, 2011. A. Abuzneid, “Features dimensionality reduction approaches
[20] C. T. Giménez, A. P. Villegas, and G. Á. Marañón, “Http for machine learning based network intrusion detection,”
Electronics, vol. 8, no. 3, p. 322, 2019.
dataset CSIC 2010,” 2010, https://www.isi.csic.es/dataset/.
[39] G. T. Reddy, M. P. K. Reddy, K. Lakshmanna et al., “Analysis
[21] F. Eisterlehner, A. Hotho, and R. Jäschke, “ECML/PKDD
of dimensionality reduction techniques on big data,” IEEE
dataset,” 2007, https://gitlab.fing.edu.uy/gsi/web-application-
Access, vol. 8, pp. 54776–54788, 2020.
attacks-datasets/-/tree/master/ecml_pkdd. [40] G. T. Reddy, S. Bhattacharya, S. S. Ramakrishnan et al., “An
[22] G. P. Urdaneta and G. V. S. Maarten, “Wikipedia access traces ensemble based machine learning model for diabetic reti-
Datasets,” 2008, http://www.wikibench.eu/?page_id�60. nopathy classification,” in Proceedings of the 2020 Interna-
[23] FuzzDB, 2007, https://code.google.com/p/fuzzdb/. tional Conference on Emerging Trends in Information
[24] M. Zhang, S. Lu, and B. Xu, “An anomaly detection method Technology and Engineering (Ic-ETITE), pp. 1–6, IEEE, Vel-
based on multi-models to detect web attacks,” in Proceedings lore, India, Feb 2020.
of the 2017 10th International Symposium on Computational
Intelligence and Design (ISCID), pp. 404–409, IEEE, Hang-
zhou, China, December 2017.
[25] A. Tekerek and O. F. Bay, “Design and implementation of an
artificial intelligence-based web application firewall model,”
Neural Network World, vol. 29, no. 4, pp. 189–206, 2019.
[26] S. Sharma, P. Zavarsky, and S. Butakov, “Machine learning
based intrusion detection system for web-based attacks,” in
Proceedings of the 2020 IEEE 6th Intl Conference on Big Data
Security on Cloud (BigDataSecurity), IEEE Intl Conference on
High Performance and Smart Computing,(HPSC) and IEEE
Intl Conference on Intelligent Data and Security (IDS),
pp. 227–230, IEEE, Baltimore, MD, USA, May 2020.
[27] A. M. Vartouni, S. S. Kashi, and M. Teshnehlab, “An anomaly
detection method to detect web attacks using stacked auto-
encoder,” in Proceedings of the 2018 6th Iranian Joint Congress
on Fuzzy and Intelligent Systems (CFIS), pp. 131–134, IEEE,
Kerman, Iran, March 2018.
[28] “HttpParams dataset,” 2015, https://github.com/Morzeux/
HttpParamsDataset.
[29] X. D. Hoang, “Detecting common web attacks based on
machine learning using web log,” in Proceedings of the In-
ternational Conference on Engineering Research and Appli-
cations, pp. 311–318, Springer, -ai Nguyen, December 2020.
[30] Q. Niu and X. Li, “A high-performance web attack detection
method based on CNN-GRU model,” in Proceedings of the
2020 IEEE 4th Information Technology, Networking, Electronic
and Automation Control Conference (ITNEC), pp. 804–808,
IEEE, Chongqing, China, June 2020.
[31] A. Ghafarian, “A hybrid method for detection and prevention
of SQL injection attacks,” in Proceedings of the 2017 Com-
puting Conference, pp. 833–838, IEEE, London, UK, July 2017.
[32] W. G. J. Halfond and A. Orso, “Preventing SQL injection
attacks using AMNESIA,” in Proceedings of the 28th Inter-
national Conference on Software Engineering, pp. 795–798,
Shanghai, China, May 2006.
[33] P. Bisht, P. Madhusudan, and V. N. Venkatakrishnan,
“Candid,” ACM Transactions on Information and System
Security, vol. 13, pp. 1–39, 2010.
[34] R. Kumari and S. K. Srivastava, “Machine learning: a review
on binary classification,” International Journal of Computer
Application, vol. 160, p. 7, 2017.
[35] M. Proxy: https://docs.mitmproxy.org/stable/.

You might also like