Security and Communication Networks - 2022 - Shaheed - Web Application Firewall Using Machine Learning and Features
Security and Communication Networks - 2022 - Shaheed - Web Application Firewall Using Machine Learning and Features
Research Article
Web Application Firewall Using Machine Learning and
Features Engineering
1 2
Aref Shaheed and M. H. D. Bassam Kurdy
1
Department of Web Technologies, Syrian Virtual University, Damascus, Syria
2
Department of Artificial Intelligence, Syrian Virtual University, Damascus, Syria
Received 22 May 2021; Revised 12 July 2021; Accepted 26 March 2022; Published 6 June 2022
Copyright © 2022 Aref Shaheed and M. H. D. Bassam Kurdy. -is is an open access article distributed under the Creative
Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the
original work is properly cited.
Web application security has become a major requirement for any business, especially with the wide web attacks spreading despite
the defensive measures and the continuous development of software frameworks and servers. In this study, we present a proposed
model for a web application firewall that used machine learning and features engineering to detect common web attacks. Our
proposed model analyses incoming requests to the webserver, parses these requests to extract four features that describe
completely HTTP request parts (URL, payload, and headers), and classifies whether a request is normal or an anomaly. We took
into consideration the limitation of previous works that use URL and payload only in classification and provided five features that
describe and summarize all parts of the HTTP request using features engineering and previous experience in the field of the
software security domain. Extracted features are length of request, percentage of characters allowed, percentage of special
characters, and attack weight. -ese features were calculated for four different datasets CSIC 2010, HTTPParams 2015, Hybrid
dataset (CSIC 2010 and HTTPParams), and real logs for the compromised web server. We evaluated our proposed model by using
these updated datasets with four classification algorithms (Naive Bayes, logistic regression, decision tree, and support vector
machine) with two methods (train test split and cross-validation) to negate the probability of overfitting and ensure that features
are effective. Features values for a normal request are usually short request length, large allowed character ratio, small special
character ratio, and zero attack weight or close to zero. Features values for anomaly requests are large request length, small allowed
character percentage, large special character percentage, and very large numerically attack weight. Our proposed model achieved a
classification accuracy of 99.6% with datasets used in research studies in this field and 98.8% with datasets of real web servers.
In recent decades, artificial intelligence has become a (vii) Cross-site scripting (XSS): injecting JavaScript
scientific revolution [6] and has achieved peerless superi- code in a web application to modify the display of a
ority in mastering the work that humans do, and we think web application and force the victim to execute it
that a computer cannot learn and make decisions like in his browser; there are many types of it, such as
humans, but rather it has become a competitor to human Reflected XSS and DOM XSS [16].
capabilities. In the coming decades, it is expected that ar- (viii) Insecure deserialization: manipulating the inputs
tificial intelligence would eliminate many human jobs[7]. of a web application by deserializing it, modifying
Researchers and information security professionals have it, and serializing it again to compromise the web
specifically moved to harness the capabilities of artificial application [17].
intelligence to detect and combat attacks [8]. -e time has
(ix) Using components with known vulnerabilities: stop
come for the machine to work side by side with the human to
updating the used component in a web application
do what is difficult for him despite having hundreds of
allows attackers to exploit its known vulnerabilities;
millions of real neurons.
this type of vulnerability is found in abundance,
Most recent works relied on one dataset only and work
especially in CMS web applications [18].
with URL and payload only. In this article, we used features
engineering to present four generalizable features that (x) Insufficient logging and monitoring refer to the
summarize the whole HTTP request information (URL, lack of logging and monitoring mechanisms and
payload, and headers) and we used four classification al- techniques, which allow attackers to find and ex-
gorithms in machine learning in the classification phase to ploit without being detected [19].
evaluate our proposed model.
-e rest of this article is organized as follows: Section 2 Studies and research about protecting web applications
presents materials and methods (related works and proposed from malicious requests were following two methodologies
model), Section 3 discusses results and discussion, Section 4 to detect an attack: identify and detect a particular attack
gives the conclusion, and Section 5 contains future work. (such as detection of the SQL injection attack only or cross-
site scripting attack detection) or classify requests if it is an
anomaly or normal in general, regardless of the type of attack.
2. Materials and Methods It also followed two approaches to transfer this experi-
2.1. Related Works. Vulnerabilities of web applications have ence to computers: designing and implementing behavioral-
not changed as concepts; it changes in how to exploit them. based detection by using artificial intelligence techniques
-e most popular vulnerabilities of web applications are as such as classification algorithms or using a custom algorithm,
follows: and signature-based detection by using databases that con-
tain patterns of attacks.
(i) Injections: manipulating the input to force a web Most of the studies relied on old datasets such as CSIC
application to execute arbitrary commands in the 2010 [20], ECML-PKDD 2007 [21]. -e proposed models
operating system and queries in databases [9], SQL were not evaluated using modern datasets, and the datasets
injection is the most famous of injection attacks [10], created by some researchers are not available online.
and it allows the attacker to interact with the da- Zhang et al. proposed a framework to detect web attacks
tabase by reading, writing, and modifying records. by extracting seven features, web resource, attribute se-
(ii) Broken authentication: exploiting logical and quence, attribute value, HTTP version, header, and header
weakness points in the authentication mechanism input value. -is framework includes three components: the
to takeover and control accounts [11]. probability distribution model, the hidden Markov model,
(iii) Sensitive data exposure: manipulating a web ap- and the one-class SVM model. Each of these components
plication to make it throw exceptions and expose can be considered as a machine learning model. Each model
sensitive data such as credentials of the database trained on a dataset contains normal requests only and is
[12]. evaluated by using two datasets: Wikipedia access traces [22]
and FuzzDB [23]. Using a multimodel-based method takes
(iv) XML external entity (XXE): manipulating inputs advantage of all models in it, by this method, the authors
using functions that parse XML to execute arbi- mitigated false positive issue significantly [24]. -e main
trary commands [13]. advantage of this model is that it takes advantage of multiple
(v) Broken access control: accessing unauthorized components by combining them in a hybrid one. On the
resources in a web application due to the weakness other side, running multiple components may affect per-
of access control rules such as accessing the ad- formance. WAF is a real-time service that handles many
ministrator panel if there is no restriction on access requests and performance is an important parameter to take
to it [14]. into consideration.
(vi) Security misconfigurations: using brute force to Tekerek and Bay provided a hybrid model for the de-
find and exploit security misconfigurations such as tection of malicious and normal requests using signature-
unpatched flaws, default configurations, unused based detection to overcome speed issue for traditional
pages, unprotected files and directories, and un- attacks and behavioral-based detection to solve zero-day
necessary services [15]. attacks issue using neural networks with three features
2037, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/5280158 by INASP/HINARI - CAMEROON, Wiley Online Library on [04/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Security and Communication Networks 3
described by mathematical equations as input to this neural one dataset (CSIC 2010). In addition, using deep learning
network [25]. -e advantages of this model are that it used techniques for real-time services like WAF will affect speed
hybrid detection methods (signature-based and behavioral- performance if they deploy it in a real environment.
based) in detection that increase the performance and speed All of these previous studies deal with the request if it is
of implementation. In addition to the simplicity and speed of normal or anomaly; it used the second methodology of
implementation of the neural network, this research has detection.
undergone successive development work in recent years and Other studies detect specific attacks, such as detecting an
the model is mature. On the other side, the features extracted attack of database rights; it used the first methodology, as the
cannot be generalized for all web applications (values of the following study of Dr. Ahmad Ghafarian, in which he
features calculated depends on average and derivation of provided an approach that puts a line in each table and used
other requests, it requires a massive log of an application to an algorithm that it executes to verify queries before exe-
make the model able to classify the requests arrived at this cuting it on the web application. Malicious requests fetch the
application). In addition, the usage of statistical functions gives added line and the normal requests do not fetch it. -is way,
anomaly cases that cannot be handled (using the average in the an SQL injection attack is detected in real time before it is
denominator of 2 generates a very big number of requests that executed on the web application, but the proposed model
its length is near to the average of request lengths). reveals only one of the seven types mentioned by the re-
Sharma et al. used features engineering to extract seven searcher in his article. In addition, this study did not discuss
features from the incoming request and used three classi- the overload on resources, databases, and the delay in ex-
fication algorithms to test their effectiveness. -ey applied ecution time due to testing each query before executing [31].
preprocessing procedures on CSIC 2010 dataset to identify -ere are many studies similar to the study of Dr.
subcategories for malicious requests to overcome missing Ahmad Ghafarian that detect SQL injection in runtime like
features issue [26]. -is method yielded many useful fea- AMNESIA (proposed by Halfond and Orso) [32] and
tures, but some of these features cannot be extracted from CANDID (proposed by Bisht, Madhusudan, and Ven-
the dataset that used in this article due to the unavailability katakrishnan) [33].
of its fields in the dataset, such as the length of cookies (you
can review the structure of the CSIC 2010 dataset). In ad-
dition, researchers did not use another dataset and relied 2.2. Proposed Model
only on CSIC 2010.
Vartouni et al. applied n-gram character-based model to 2.2.1. Architecture. Our proposed model of WAF works as
construct features. -e size of features increased depends on an operating system service, which acts as an intermediary
the value of n; to overcome this issue and avoid using di- between the web server and the clients. -is service receives
mensionality decrease techniques, they applied an autoen- the request, parses it, extracts features, classifies it, and
coder to extract features as a data abstract. Deep learning makes decisions based on the classification result.
algorithms were used to work more effectively with extracted WAF can be configured through a dedicated web ap-
features. -is proposed model gave a generalizable model, plication (web control panel, see Figure 1). -e proposed
but the classification accuracy was low in comparison with WAF consists of the following five basic units (see Figure 2):
other models. In addition to use a single dataset, refer CSIC (1) Power on/off unit
2010 [27].
Hoang used supervised machine learning (inexpensive (2) Training unit
decision tree algorithm as a suitable real-time classifier to (3) Parsing unit
mitigate performance issue in terms of speed) to detect four (4) Classification unit
major web attacks (SQLi, XSS, command injection, and path
(5) Decision-making unit
traversal) and N-gram used with fixed n value (n � 3) and
PCA (principal component analysis) method to obtain a -e process starts in the first unit, the power on/off unit,
reduced number of features. He used the HTTPParams when the WAF is running, the OS service contacts the
dataset [28] and CSIC 2010 to evaluate the proposed model database and fetches the configurations to run WAF, initiate
and achieved high accuracy of 98.56% [29]. Hoang X.D the listener, and wait for incoming requests for the WAF that
proposed model takes web server logs as input and turns mediates between the client and the web server.
every single row into a vector, results of the model only focus After running the WAF, the training process starts
on the HTTPParams dataset, and experiments done used using the selected dataset and the selected classification
only one algorithm. algorithm.
Niu and Li extracted eight statistical features and used After the completion of the training process and the
convolutional neural network (CNN) combined with gated completion of the work of the first and second units, WAF is
recurrent unit (GRU) with CSIC 2010 dataset. Using these ready to receive requests.
techniques together with eight features improved detection When the request arrives, WAF, the first unit to handle it
performance but at the expense of speed performance. -ey is the parsing unit, which breaks down the request, extracts
evaluated the detection performance of this model by the features, and passes it as a vector to the classification unit
comparing it with other deep learning methods and they that classifies according to the classification method that the
achieved 99.00% accuracy [30]. Niu and Li also used only administrator chooses it in the training unit.
2037, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/5280158 by INASP/HINARI - CAMEROON, Wiley Online Library on [04/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
4 Security and Communication Networks
Web Server
Client
Control Panel
Figure 1: Architecture of the web server environment with proposal WAF.
Request
Classification Decision
Power Unit Training Unit Parsing Unit
Unit Making Unit
No Is Normal Yes
request?
After classifying the request, the classification unit sends databases to train the model, now WAF is ready to re-
the result to the decision-making unit, which takes the ceive requests.
appropriate action to pass or drop the request (see Figure 3 We used popular classification algorithms, any algo-
or Algorithm 1). rithm can be added by inserting its name in the database
(algorithms table) and call it programmatically in the WAF
(1) Power on/Off Unit. -is unit is responsible for controlling implementation code.
WAF by turning it on and off. When the WAF started, all In addition, any dataset can be added by inserting its
configurations will be fetched from the database such as the name in the database (datasets table) and adding the CSV file
IP address and port of the firewall, IP address, and port of the of the desired dataset in the dataset folder inside the WAF
web server to run the service (the listener). implementation code folder.
Used algorithms are Naive Bayes, Logistic Regression,
(2) Training Unit. After running WAF, the dataset name Decision Tree, and Support Vector Machine (popular al-
and classification algorithm will be fetched from gorithms in binary classification problems) [34].
2037, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/5280158 by INASP/HINARI - CAMEROON, Wiley Online Library on [04/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Security and Communication Networks 5
Start
Power Unit
Power On WAF
Classification Unit
Training Model using Dataset and
Algorithm (got from Configuration)
No Yes
New Request
Wait for New Requests
Arrived?
Parsing Unit
Parse Request to retrieve Basic Features
Classification Unit
Classify request
End
We preferred to use these algorithms instead of neural (3) Parsing Unit. After turning on WAF and the training
networks, which can be used in future works (see Future model, now WAF is ready to receive requests. When an
works for Researchers section). HTTP request arrives at WAF, the parsing unit breaks down
Used datasets include CSIC 2010, HTTPParams 2015, the request to extract features (features will be discussed in
Hybrid dataset (Generated by combining CSIC 2010 and the next section).
HTTPParams 2015), and Custom dataset (logs of real web -e parsing unit creates a final vector consisting of
servers). features and passed this vector to the classification unit.
2037, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/5280158 by INASP/HINARI - CAMEROON, Wiley Online Library on [04/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
6 Security and Communication Networks
Input of data: d (dataset), a (algorithm), p1 (web server port), i1 (web server IP), p2 (WAF port), i2 (WAF IP).
(1) Start
(2) Connect to database to initialize Inputs (d, a, p1, i1, p2, i2)
(3) Start WAF listener using Inputs (p1, i1, p2, i2)
(4) Training WAF using Inputs (d, a)
(5) While WAF listener is “ON”:
(6) If new request arrived R:
(7) Parse R
(8) Compute basic features vector B from parsed R
(9) Compute V Final features vector from B
(10) Compute C (class) of parsed request R by classify based on V
(11) If C � “anomal”
(12) Drop request
(13) Redirect to custom page with message “Attack”
(14) Else//C � ‘normal’
(15) pass request to web server
(16) Store V and C in database
(17) Endif
(18) Endif
(19) EndWhile
(20) End
Parsing unit implemented using MITM Proxy in Python (4) Headers: request headers (original headers or
[35]. custom headers added by the client or web
application)
(4) Classification Unit. -is unit receives the final vector (5) Files: it is usually included in the payload but we
from the parsing unit and classifies the request depending on separated it as an independent feature
it. -e classification unit sends the classification results to
decision-making unit.
(2) Final Features. Final features used by WAF to classify the
(5) Decision-Making Unit. -is unit receives the classifica- request if it is normal or anomaly; these features are cal-
tion results from the classification unit and forwards the culated and extracted based on the basic features, and we
request to the web server if the request is normal, and drops have four final features (see Table 2):
the request if it is an anomaly. (1) Input length: it describes the number of characters in
payload, and it is calculated as follows:
n
2.2.2. Features Engineering. When an HTTP request arrives l � ci , (1)
at the parsing unit, it is dismantled to extract the basic i�0
features, and these basic features will be used to calculate the
final features that will be sent to the classification unit (see where l is the input length, c is the character in
Algorithm 2). payload, and n is the payload length.
(1) Basic Features. All basic information extracted directly Usually, this feature value is bigger in anomaly re-
from the request is called a basic feature; it is the content of quests compared to normal requests (see Figure 4).
HTTP Message [36], in our proposed model, and we have (2) Alphanumeric character ratio: it describes the ratio
five basic features (see Table 1): of alphanumeric characters over the input length.
Normal requests usually contain more numeric and
(1) HTTP Method: HTTP protocol, method, or verb that
alphabetic characters compared to special characters
is used by the client to request the resource from the
such as symbols, so this feature will have a big value
web server, it may be POST, GET, HEAD, OPTION,
in normal requests compared to anomaly requests
PUT, PATCH, or DELETE
(see Figure 5).
(2) Absolute URL (URL): it includes the IP address or
-is feature is calculated as follows:
domain of a web application with the resource, for
example, https://www.mysite.com/login n
ci |ci ∈ e
a� × 100, (2)
(3) Payload: all data submitted by the client (text input, l
i�0
dropdown menu, text area, etc.)
2037, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/5280158 by INASP/HINARI - CAMEROON, Wiley Online Library on [04/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Security and Communication Networks 7
Frequencies in Dataset
IP address or domain of web application with 60
URL
resource
Payload All data submitted from the client
Headers Request headers 40
Files Uploaded files in payloads
Feature Description
Input length Number of characters in payload 0
Alphanumeric -e ratio of alphanumeric characters 0 20 40 60 80 100 120 140
character ratio over input length Number of requests in Dataset
-e ratio of nonalphanumeric characters
Special characters ratio Figure 5: Histogram implementation of alphanumeric character
over input length
ratio feature in the CSIC 2010 dataset; green values represent
Sum of five subfeatures (see Table 3):
normal request and red values represent anomaly requests.
(i) URL weight
(ii) Number of attack words in inputs
Attack weight (iii) Manipulate payload weight
(iv) Alphanumeric character to special (3) Special character ratio: it describes the ratio of special
character ratio characters (nonalphanumeric) over the input length.
(v) Files weight
Anomaly requests usually contain fewer numeric
and alphabetic characters compared to special
70 characters such as symbols, so this feature will have a
big value in anomaly requests compared to normal
60
requests (see Figure 6).
Frequencies in Dataset
60
weight.
-is subfeature is calculated as follows:
n
40 v � wi |di ∈ Input, (6)
i�0
20
where v is the number of attack words in inputs, d is
the discovered attack word, w is the weight of attack
word, n is the number of attack words in WAF
0 database, and Input is request headers and payloads.
–20 0 20 40 60 80 100 Example: Email � [email protected] & passwd�’
Number of requests in Dataset or 1�1 --&mode�’ or 1�1 --
Figure 7: Histogram implementation of attack weight feature in v � 150 + 150 (150 is the weight of discovered
the CSIC 2010 dataset for normal request and red values represent SQLI, it exists twice).
anomaly requests. (3) Manipulate payloads weight
-e value of this subfeature is initialized to
zero. -e value of this subfeature increases for
each discovered manipulate in payload and
headers.
80 Manipulation is passing wrong data to the
application to throw an exception and to
Frequencies in Dataset
Table 4: Sample of the dataset used to train the model, we have four features for every single request and its label.
payload_len Alpha non_alpha attack_feature Label
0 0 0 0 0
41 95.45454545 4.545454545 200 1
241 100 0 0 0
9 100 0 0 0
24 94.73684211 5.263157895 2600 1
54 77.77777778 22.22222222 90000 1
75 100 0 0 0
103 87.37864078 12.62135922 60000 1
91 84.61538462 15.38461538 90000 1
CSIC 2010 contains 18 columns, custom compromised Usage of four different datasets negates the probability of
web server logs contain 10 columns, and HTTPParams 2015 overfitting presence, to confirm that, k-fold cross-validation
contains 4 columns. used in training also [37].
All previous datasets after preprocessing and di- Most of the related works used CSIC 2010 dataset with or
mension reduction procedures become with only 5 nu- without the custom dataset, and we used it in the proposed
meric columns, all columns that contain information model for the possibility of comparing the proposed model
about request were removed and replaced by final fea- with previous models (see Figure 9 or Table 5).
tures columns that describe request briefly and Implementation of the proposed model includes a
effectively. -ereafter, the Hybrid dataset became very function to export WAF records as a new dataset with the
simple. ability to correct records. Administrators can train the
proposed model using this exported dataset to strengthen
WAF in protecting its web applications.
3.1.3. Training. We used four algorithms to classify (Naive Most false positive cases are normal requests classified as
Bayes, Logistic Regression, Decision Tree, and SVM). Four anomaly requests (not the opposite).
datasets were fed to the classifier using two methods: train
test split (80%, 20%) and cross-validation (100 Folds), and 3.2.2. Results Compared to Related Works. Our proposed
results were very close. model achieved high accuracy of 98.8% compared with
Mixing and shuffling rows of CSIC 2010 and related works. -e following table shows the results for CSIC
HTTPParams 2015 as a new dataset (Hybrid dataset) gave a 2010, HTTPParams, and custom datasets created by the
very close result compared to the results of the classifier with researchers (see Figure 10 or Table 6).
each of the datasets separately (see Table 5).
Previous experiments negate the probability of over- 3.3. Comparison
fitting and prove that the final features of our proposed
model are effective. 3.3.1. Limitations of Previous Works. Researchers have
provided many models for detecting web attacks, and despite
their various features, there are some common weaknesses
3.2. Results among these researches, which can be summarized as follows:
3.2.1. Results Based on Datasets. Our proposed model used (1) Extracted features are not able to be general features
Naive Bayes with cross-validation (100 Folds) and and most of these features fit only web applications,
achieved an accuracy of 98.8% with the dataset created which it extracts from it.
from logs of a compromised real web server, 97.61% with (2) Using old datasets such as CSIC and evaluating the
HTTPParams dataset, 99.58% with CSIC dataset, and model depends on the results of training it. In ad-
96.40% for Hybrid dataset (combination of CSIC 2010 dition, all modern datasets used are not available on
and HTTPParams 2015). the Internet.
2037, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/5280158 by INASP/HINARI - CAMEROON, Wiley Online Library on [04/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Security and Communication Networks 11
Table 5: Classification accuracy of our proposed model for various datasets using Naive Bayes.
Compromised web
CSIC 2010 HTTPParams 2015 Hybrid dataset
server dataset
Number of normal requests 28,800 19,305 48,105 60250
Number of anomaly 11,213 11,764 22,977 5210
Classification accuracy (80% training, 20% testing) 99.59% 97.91% 96.40% 98.80%
Classification accuracy (100-fold cross-validation) 99.71% 98.02% 96.66% 98.97%
False positive rate 0.54% 1.20% 3.35% 0.84%
100 99.59
99 98.8
97.91
98
Accuracy
97
96.4
96
95
94
Datasets
100 98.8
98 97.4
96.74
96
94
Accuracy
92
90
88.32
88
86
84
82
Web Application Firewalls
Tekerek A. and Bay O.F. (2019) Sharma S., Zavarsky P. and Butakov S. (2020)
Ghafarian A. (2017) proposed model
Figure 10: Classification accuracy of our proposed model compared with related works.
Table 6: Classification accuracy of our proposed model compared with related works.
Our proposed model Tekerek and Bay [25] Sharma et al. [26] Ghafarian [31]
CSIC 2010 99.59% 96.74% 94.7% 88.32%
ECML-PKDD 2007 Not tested 94.53% Not tested Not tested
HTTPParams 2015 97.61% Not tested Not tested Not tested
Custom dataset 98.8% 98.52% Not tested Not tested
2037, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/5280158 by INASP/HINARI - CAMEROON, Wiley Online Library on [04/04/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
12 Security and Communication Networks
(3) Some papers of related works contain some errors Regression, Decision Tree, and Naive Bayes, we focus on
and inaccurate information, such as the study by Naive Bayes. Our proposed model achieved a high classifi-
Sharma S., Zavarsky P., and Butakov S. (2020) [26]. cation accuracy of 99.6% with standard datasets used in
-ey used features that cannot be extracted from research studies in this field (CSIC 2010), and 98.8% with
CSIC 2010 (e.g., _cookie_len feature). datasets of real compromised web server dataset.
(4) Most of the related works process payload only
without taking headers and files into consideration. 5. Future Works
(5) Hybrid models are too rare (in related works only
TekerekA.and O.F.Bay(2019) paperisa hybridmodel). Future works domains are wide, it can be summarized as
follows (see next three subsections for more information):
(6) Most of the related works detect common web at-
Researchers in the information security domain can
tacks such as XSS and SQLI, no suggested model can
develop proposed WAF by feeding it with more datasets
detect attacks that use normal requests to be per-
(generated dataset from our proposed WAF or by creating a
formed, such as DOS attacks.
custom dataset from web servers logs) or by add or modify
current features, also they can develop separate components
3.3.2. Advantages of the Proposed Model. Disadvantages of and migrate them with our proposed WAF (signature-based
related works and weakness points were taken into con- model to check request before pass it to the classifier, DOS
sideration while designing and preparing our proposal attack detector, and use natural language process to make a
model. Features extracted in this model are general and can model to identify attack words instead of using a table in the
work with any web application. In addition, we used various database to store these attack words).
datasets (standard datasets such as CSIC 2010 to compare our For software engineers, developers and information
model with related works, modern datasets such as security engineers use our proposed model to evaluate their
HTTPParams 2015, and Hybrid dataset, in addition, we also applications and improve their skills by learning how to
used a custom dataset of a real compromised web server). write a secure source code.
Final features describe all parts of the HTTP request in- Sponsors and businesspersons can invest money to
cluding headers and files. Finally, a high-accuracy rate was develop the proposed model and become a commercial
achieved (98.8% for custom dataset and 99.6% for standard product.
dataset).
[18] N. Mendes, J. Duraes, and H. Madeira, “Benchmarking the [36] S. Suroto, “A review of defense against slow HTTP attack,”
security of web serving systems based on known vulnera- JOIV International Journal on Informatics Visualization,
bilities,” in Proceedings of the 2011 5th Latin-American vol. 1, no. 4, pp. 127–134, 2017.
Symposium on Dependable Computing, pp. 55–64, IEEE, Sao [37] S. B. Kotsiantis, I. Zaharakis, and P. Pintelas, “Supervised
Jose dos Campos, Brazil, April 2011. machine learning: a review of classification techniques,”
[19] M. Malviya, A. Jain, and N. Gupta, “Improving security by Emerging artificial intelligence applications in computer en-
predicting anomaly user through web mining: a review,” gineering, vol. 160, no. 1, pp. 3–24, 2007.
International Journal of Advances in Engineering & Tech- [38] R. Abdulhammed, H. Musafer, A. Alessa, M. Faezipour, and
nology, vol. 1, no. 2, p. 28, 2011. A. Abuzneid, “Features dimensionality reduction approaches
[20] C. T. Giménez, A. P. Villegas, and G. Á. Marañón, “Http for machine learning based network intrusion detection,”
Electronics, vol. 8, no. 3, p. 322, 2019.
dataset CSIC 2010,” 2010, https://www.isi.csic.es/dataset/.
[39] G. T. Reddy, M. P. K. Reddy, K. Lakshmanna et al., “Analysis
[21] F. Eisterlehner, A. Hotho, and R. Jäschke, “ECML/PKDD
of dimensionality reduction techniques on big data,” IEEE
dataset,” 2007, https://gitlab.fing.edu.uy/gsi/web-application-
Access, vol. 8, pp. 54776–54788, 2020.
attacks-datasets/-/tree/master/ecml_pkdd. [40] G. T. Reddy, S. Bhattacharya, S. S. Ramakrishnan et al., “An
[22] G. P. Urdaneta and G. V. S. Maarten, “Wikipedia access traces ensemble based machine learning model for diabetic reti-
Datasets,” 2008, http://www.wikibench.eu/?page_id�60. nopathy classification,” in Proceedings of the 2020 Interna-
[23] FuzzDB, 2007, https://code.google.com/p/fuzzdb/. tional Conference on Emerging Trends in Information
[24] M. Zhang, S. Lu, and B. Xu, “An anomaly detection method Technology and Engineering (Ic-ETITE), pp. 1–6, IEEE, Vel-
based on multi-models to detect web attacks,” in Proceedings lore, India, Feb 2020.
of the 2017 10th International Symposium on Computational
Intelligence and Design (ISCID), pp. 404–409, IEEE, Hang-
zhou, China, December 2017.
[25] A. Tekerek and O. F. Bay, “Design and implementation of an
artificial intelligence-based web application firewall model,”
Neural Network World, vol. 29, no. 4, pp. 189–206, 2019.
[26] S. Sharma, P. Zavarsky, and S. Butakov, “Machine learning
based intrusion detection system for web-based attacks,” in
Proceedings of the 2020 IEEE 6th Intl Conference on Big Data
Security on Cloud (BigDataSecurity), IEEE Intl Conference on
High Performance and Smart Computing,(HPSC) and IEEE
Intl Conference on Intelligent Data and Security (IDS),
pp. 227–230, IEEE, Baltimore, MD, USA, May 2020.
[27] A. M. Vartouni, S. S. Kashi, and M. Teshnehlab, “An anomaly
detection method to detect web attacks using stacked auto-
encoder,” in Proceedings of the 2018 6th Iranian Joint Congress
on Fuzzy and Intelligent Systems (CFIS), pp. 131–134, IEEE,
Kerman, Iran, March 2018.
[28] “HttpParams dataset,” 2015, https://github.com/Morzeux/
HttpParamsDataset.
[29] X. D. Hoang, “Detecting common web attacks based on
machine learning using web log,” in Proceedings of the In-
ternational Conference on Engineering Research and Appli-
cations, pp. 311–318, Springer, -ai Nguyen, December 2020.
[30] Q. Niu and X. Li, “A high-performance web attack detection
method based on CNN-GRU model,” in Proceedings of the
2020 IEEE 4th Information Technology, Networking, Electronic
and Automation Control Conference (ITNEC), pp. 804–808,
IEEE, Chongqing, China, June 2020.
[31] A. Ghafarian, “A hybrid method for detection and prevention
of SQL injection attacks,” in Proceedings of the 2017 Com-
puting Conference, pp. 833–838, IEEE, London, UK, July 2017.
[32] W. G. J. Halfond and A. Orso, “Preventing SQL injection
attacks using AMNESIA,” in Proceedings of the 28th Inter-
national Conference on Software Engineering, pp. 795–798,
Shanghai, China, May 2006.
[33] P. Bisht, P. Madhusudan, and V. N. Venkatakrishnan,
“Candid,” ACM Transactions on Information and System
Security, vol. 13, pp. 1–39, 2010.
[34] R. Kumari and S. K. Srivastava, “Machine learning: a review
on binary classification,” International Journal of Computer
Application, vol. 160, p. 7, 2017.
[35] M. Proxy: https://docs.mitmproxy.org/stable/.