SQL Injection Attack Detection Framework Based on HTTP Traffic
SQL Injection Attack Detection Framework Based on HTTP Traffic
Traffic
Zhongdong Zhu Shilin Jia∗ Jishuai Li
State Key Laboratory of Networking National Computer Network State Key Laboratory of Networking
and Switching Technology, Beijing Emergency Response Technical and Switching Technology, Beijing
University of Posts and Team/Coordination Center of University of Posts and
Telecommunications, Beijing, 100876 China,[email protected] Telecommunications, Beijing, 100876
China,[email protected] China,[email protected]
179
ACM TURC, July 30–August 01, 2021, Hefei, China ZhongDong Zhu et al.
building. The traffic collection module proposes the construction higher penalty was given when the model misclassified SQL injec-
method of SQL injection attack data set. The data cleaning module tion attack as normal traffic. After the classification is completed,
improves the detection ability of SQL injection attack in complex the suspicious traffic is forwarded to the pattern matching detection
traffic environment by reducing the interference of irrelevant in- engine based on simplified security rule set for further in-depth
formation. The feature representation module proposes a way to analysis. The accuracy of the method was 97.6%. Its advantage is
generate lexical features containing special characters, which can that the detection location not only contains URL and POST, but
enhance the model performance by using semantic information also contains HTTP headers, while most detection methods ignore
between words. In the model construction phase, the characteristics the detection of SQL injection in the headers for the purpose of
of the detection model are discussed, which can further reduce the simplicity and efficiency. Its disadvantage is that it continues to use
alarm omission of SQL injection attacks. the rule engine analysis after the machine learning model, loses
the generalization of machine learning, does not overcome the lim-
itation of the rules themselves, and cannot find out the unknown
SQL attack.
2 RELATED WORK In recent years, deep learning has achieved revolutionary effects
As a simple and easy to understand attack technique, SQL injection in many fields due to its powerful feature representation and learn-
attack has always been a research hotspot in the field of network ing ability, which has also attracted the attention of many network
security.At present, researchers have proposed many SQL injec- security researchers. In the field of SQL injection attack detection,
tion detection methods based on HTTP traffic. These methods can the method based on deep learning is also practiced by many re-
be divided into signature-based methods, machine learning-based searchers. Tang et al. [5] proposed a SQL injection detection model
methods and deep learning-based methods according to specific based on neural network. Eight statistical features were used to
implementation technologies. model traffic and the detection effect of LSTM and SVM was com-
The signature-based method filters the traffic by collecting the pared. They extract the characteristics of SQL injection attack by
known attack generation rules. Although they are simple, easy defining the keyword table. This method ignores the timing infor-
to implement, and have a low false positive rate, a large number mation between keywords, and the keyword table must cover all
of rules limit detection speed and cannot cope with unknown at- SQL keywords, functions, etc., otherwise it will lead to omissions.
tacks. In order to break through the limitation of signature-based Secondly, SQL keywords may also appear in normal traffic and
detection methods, a series of detection methods based on machine lead to false positives. In addition, this feature extraction method is
learning have been proposed by many researchers because of their not able to deal with mutated SQL injection attacks. Xie et al. [6]
generalizability. These methods need to obtain appropriate feature proposed using Char-CNN with elastic pooling layer to detect SQL
representations from traffic and then train machine learning mod- injection, and compared with RandomForest, SVM and other tradi-
els. Howard et al. [2] proposed an automatic signature generation tional detection methods, the accuracy rate of this method reached
system for SQL injection attacks based on machine learning. The 99.98%. The elastic pooling layer used in this method can determine
system first actively collected 30000 SQL injection attack traffic, con- the pooling range according to the input feature vector, so as to
structed statistical features by using SQL reserved words, firewall realize the detection of the variable length sequence. However, its
rules and expert knowledge, and then used hierarchical clustering existing problems include: the two-dimensional convolution on the
algorithm to cluster the samples and features successively. They character sequence makes the character be cut; with variable length
derived the characteristics of each cluster, and trained the logistic input, only one data can be used for training each time, resulting in
regression model as a signature for each cluster. Compared with the unstable model update, long training time and difficult fitting. Fang
rule-based intrusion detection system (Snort and Bro) and the Web et al. [7] used grammatical analysis and likelihood ratio test to con-
application firewall (ModSec), the system shows strong competi- vert SQL query strings into word sequences, and then used LSTM
tiveness in the detection effect of SQL injection. The true positive classification. Experimental results showed that the accuracy of this
rate and false positive rate are 90.52% and 0.03% respectively. Kar method was 98.60%. Abaimov et al. [8] proposed a code injection de-
et al. [3] proposed a detection method for SQL injection attacks tection system based on CNN, which transformed SQL statements
deployed in database firewalls. By converting SQL statements into into codes to reduce the training time of the neural network. The
token sequences that maintain their syntactic structure, a graph accuracy of this model on the experimental data set reached 94%.
with tokens as nodes is generated, the interaction between tokens The input of detection method [7, 8] is SQL query statement, and
is used as a weighted edge, and then the central measure of nodes the combination of semantic analysis and deep learning can be used
is used as a feature to train the SVM classifier. A variety of different to effectively detect SQL injection attacks.
parameter settings are experimented in detail. The accuracy of the
final model is 99.47%. The false positive rate was 0.31%. Makiou
et al. [4] used a combination of machine learning classifier and 3 ANALYSIS
rule engine to detect SQL injection attacks in order to improve the Due to the convenience of the Internet, more and more services are
detection performance of traditional rule-based methods. Firstly, being moved to the network, often using the HTTP protocol. HTTP
the HTTP request traffic is divided into different parts to detect the traffic transmits malicious information injected by attackers as well
existence of SQL injection. The Naive Bayes classifier is used to as interactive data. HTTP request traffic passes the data entered
characterize the existence of some keywords and symbols in the by the user, and HTTP response traffic returns the information of
SQL statement. In order to reduce false negative during training, the server receiving the requests. An attacker sends HTTP request
180
SQL Injection Attack Detection Framework Based on HTTP Traffic ACM TURC, July 30–August 01, 2021, Hefei, China
the training phase and the prediction phase. The training phase
consists of four modules, and finally a detection model is obtained.
The prediction phase uses the detection model to predict the label
of the input flow.
In the training phase, Step I is the data collection module to
collect SQL injection attack traffic and normal traffic, extract the
parameters in URL and POST as payloads, and assign the corre-
Figure 1: Example of SQL injection attack traffic sponding labels. The Step II data cleaning module decodes, unifies
the case, replaces and standardizes Payload in turn. Step III is the
feature representation module to extract the lexical features of Pay-
load. The word segmentation of Payload is performed by using
character boundaries and special characters. The word vectors are
pre-trained to convert the word sequence into feature vectors. Step
IV is the model construction module, and the characteristics of the
detection model are discussed.
In the prediction phase, the input of the detection framework is
the HTTP request traffic and the output is the label of that traffic.
Figure 2: SQL injection attack in Cookie First, the parameters in the URL or POST and the HTTP request
headers (user-agents, cookies, and so on) are extracted from the
incoming HTTP request traffic, and they are connected as payloads
traffic containing SQL injection to break into the server. Therefore, using &. The Payload is then passed through the data cleaning
the HTTP request traffic records all the information of the attacker’s module and feature extraction module consistent with the training
injection attack and is an ideal detection object in the SQL injection phase to obtain the feature vector. Finally, the feature vector is input
attack task. The parameters in the URL or POST are the actual into the detection model to get the label of the traffic. This article
information that is transmitted over the network, often referred to treats request headers as a class of parameters, concatenating them
as the Payload. The SQL injection exists in the Payload, the rest of with parameters in the URL or POST using &. The sample Payload
the HTTP traffic is irrelevant information, and the detection system in the prediction stage in Figure 3 is id=1’ and ’1’=’1 ’&127.0.0.1’
first needs to extract the HTTP request traffic Payload. and1=2#, which combines parameters in the URL with the header
Four characteristics of SQL injection attack traffic: Firstly, SQL line X-Forwarded-For. In order to fully detect SQL injection attacks
injection conforms to SQL grammar rules and must contain SQL that exist in different locations, more header rows can be combined
reserved words or built-in functions, etc. Secondly, annotation sym- to generate Payload, but too many header rows increase the length
bols in SQL injection are short and composed of 1 or 2 characters, of the input sequence and increase the resource consumption during
but they are very important features. Thirdly, SQL injection state- detection. In a practical application, security personnel can combine
ments may appear anywhere in the traffic. If the traffic is truncated, only the header rows that interact with the back-end database.
information will be lost. Last, strings or numbers that appear in the
SQL injection are artificial random inputs that have no meaning
for the detection system. 4.1 Data Collection
In the existing SQL injection attack detection schemes, the detec- Data collection is Step I in Figure 3, which is the first stage of the
tion targets are mostly parameters in URL or POST, but there are detection system. SQL injection attack traffic and normal traffic are
also some SQL injection attacks in the HTTP header rows in the collected respectively. Deep learning models need to learn from a
actual traffic. Some scanners also support injection of HTTP header large number of data samples, which determine the effectiveness of
rows. The header fields that may appear in SQL injection attacks the detection model. SQL injection attacks are diverse and can be
are User-Agent, Refer, Cookie, X-Forwarded-For, and so on. SQL divided into many different types. In order to improve the general-
injection attacks occur in header rows because the information in ization of the detection model, it is important to collect as much
the header rows is processed unfiltered by the server, causing SQL data as possible that includes all types of SQL injection attacks. The
injection risks. data collection module adopts two ways of active collection and
passive collection, and the data format is HTTP request traffic.
4 WORKFLOW In the active collection method, the SQL injection attack data
The design idea of the detection framework revolves around the publicly available on the Internet are first collected, including the
four characteristics of SQL injection attack. For feature 1 and fea- SQL injection attack cases shared by network security portals, hack-
ture 2, lexical features that retain special characters are adopted ers or security experts, and the SQL injection attack public library
in the selection of feature representation methods. According to in GitHub. It also collects SQL injection vulnerabilities recently
characteristic 3, the design requirements for detection of arbitrary released in CVE. Secondly, the test sandbox environment was built
length Payload are proposed in the model construction. For feature locally. The test environment was a Web application program con-
4, the Payload is standardized during data cleanup to reduce the taining SQL injection vulnerability, which was realized based on
interference of irrelevant information. The detection framework Apache, PHP and MySQL. Use the popular injection tool SQLMap
is shown in Figure 3. The detection framework has two phases: to test the sandbox and save the request traffic during the test as
181
ACM TURC, July 30–August 01, 2021, Hefei, China ZhongDong Zhu et al.
part of the training set. In the passive collection mode, a rule-based 1.The decoding. In the decoding operation, if the HTTP method
network firewall was built in the lab LAN to identify the malicious is GET, the URL needs to be decoded first. The parameter string
traffic in the campus network. The traffic identified by the firewall in the URL is made up of key-value pairs of the form Key = Value,
as an SQL injection attack is saved, as well as a large number of with an & interval between each pair. The Payload is decoded in
normal HTTP request traffic as a white sample. URL, HTML and Base64, which can effectively help detection model
The data is collected through several different data collection learn the characteristics of SQL injection attack.
channels above, and then the data sets are constructed. Building 2.Uniform case. A change in the case of the characters in the tag
a data set can be divided into two steps: extracting payload and is perceived by the computer as two different tags, which greatly
assigning labels. SQL injection statements exist in the part of the increases the dimension of lexical characteristics. Second, the case
input that the user can control, the parameter that is passed. In of the characters in the traffic has no meaning for the detection
the HTTP protocol, the GET method passes the parameters in the task. At the same time, some attackers use case bypass, such as
URL, and the POST method passes the parameters in the request select rewritten to SELECT, in order to avoid the intrusion detection
body. To reduce the interference of irrelevant traffic background, system inspection. Therefore, in this paper, characters in Payload
parameter values in URL or POST in HTTP traffic are extracted as are uniformly converted to lowercase letters in preprocessing.
payloads, and corresponding labels are assigned. 1 represents SQL 3.Replace. All reserved words, functions, etc. in SQL language
injection attack traffic and 0 represents normal traffic. The sample are composed of ASCII-encoded characters. In order to reduce the
data set for Step I.2 in Figure 3 has a Payload of id=1’ and ’1 ’=’1’ interference of irrelevant information, non-ASCII-encoded strings
from the parameters section of the URL. The tag is 1, indicating are replaced with the fixed unk tag. Substitution processing is
that the Payload is an SQL injection attack. adopted in the detection framework instead of filtering processing.
Filtering will result in loss of information leading to omissions,
and may lead to connection of unrelated parts leading to false
4.2 Data cleaning positives. The ideal way to handle this is to replace a combination
In SQL injection attack detection task, the objective of data prepro- of characters of the same type with the same tag.
cessing is to filter out irrelevant traffic background information and 4.Traffic standardization. The string or number that appears in
avoid the loss of characteristics of SQL injection attack. the Payload is an artificial random input that is not meaningful for
182
SQL Injection Attack Detection Framework Based on HTTP Traffic ACM TURC, July 30–August 01, 2021, Hefei, China
183
ACM TURC, July 30–August 01, 2021, Hefei, China ZhongDong Zhu et al.
symbols in the SQL injection are distributed in the lower left cor- attack. Here the detection framework puts forward two require-
ner of the graph, and that the tags that have similar meanings are ments for model construction.
close together. For example, the tags and and or almost overlap. The first requirement is to realize the detection of arbitrary
In SQL statements, and and or are used to join conditional state- length Payload. Due to the variable length of Payload, SQL injection
ments. The special symbols –, /* and backspaces are very close. It may occur at any location. Truncation of Payload will cause infor-
can be seen from this that the pre-trained word vectors effectively mation loss and lead to missed reports. Therefore, it is required that
represent the semantic information between words and words and the detection model can accept the sequence of variable length as
between words and injected symbols. This effective representation input to realize detection of Payload of any length. In the detection
of semantic information can enhance the detection ability of the method based on statistical feature, statistical feature is an indirect
model. feature, which is obtained by statistics on the Payload, and it is
Step III.3 Word embedding is the last step of extracting lexical not necessary to consider the detection problem of variable-length
feature module, which is to convert text sequences into feature sequences, such as the proportion of backspaces in the Payload.
vectors. Step III.3 in Figure 3 shows the process of word-level em- However, the detection method based on lexical features directly
bedding, where each word token is transformed into a vector of takes text sequence as input, which requires consideration of the
fixed dimension size. network structure of the model to meet the requirement of detecting
input of any length.
As a classifier, the full connection layer is an essential part of
4.4 Model Construction the model structure of deep learning, but it can only accept input
with fixed dimensions, which leads to the fact that many existing
Step IV in Figure 3 is the model building module. The model con- detection models cannot accept input with variable length. They
struction needs to combine the characteristics of SQL injection
184
SQL Injection Attack Detection Framework Based on HTTP Traffic ACM TURC, July 30–August 01, 2021, Hefei, China
typically define an artificial length threshold for the input text, positives caused by too much detection data by reducing the inter-
truncate the text above the threshold, and fill in the text below. In ference of irrelevant information. A variety of cleaning methods
SQL injection attack detection, if the length of the input is fixed effectively prevent SQL injection attacks by various bypassing meth-
and the traffic is truncated, the SQL injection attacks that exist after ods. The feature representation module describes an efficient and
the length threshold are discarded. easy to obtain feature generation method, that is, lexical features
Deep learning models typically use a feature extraction layer that retain special characters. The model construction module pro-
to obtain abstract features before the full connection layer, whose posed a model constructing method for detecting arbitrary length
dimensions vary with the length of the input sequence. Therefore, Payload and a variable length sequence training method to guaran-
if the feature vectors of fixed dimensions can be obtained before tee efficiency. The detection location covers HTTP request headers,
the full connection layer, the detection of sequences of arbitrary URLs, and POST, providing multi-dimensional protection against
length can be realized. In order to achieve this goal, the common SQL injection attacks. In the real network environment, the frame-
method is to conduct down sampling in the direction of sequence. work detects SQL injection attacks with low alarm omission rate
For example, the combined structure of CNN and global pooling and low false positive rate.
layer can fix the dimension of the feature vector to the number
of convolution kernel. In RNN structure, the last state feature is REFERENCES
used as the feature vector, which can be understood as it abandons [1] OWASP. 2017. Top 10 Web Application Security Risks. Retrieved from https:
//owasp.org/www-project-top-ten/.
the state features of other sequences. The combined structure of [2] Howard G M, Gutierrez C N, Arshad F A, et al. pSigene: Webcrawling to Generalize
RNN and Attention can also be used to fix the feature dimension SQL Injection Signatures[C]. // IEEE/IFIP International Conference on Dependable
by a weighted sum of all state sequences. The detection model can Systems & Networks. Atlanta: IEEE Computer Society, 2014: 45-46.
[3] Kar D, Panigrahi S, Sundararajan S. SQLiGoT: Detecting SQL injection attacks
realize the detection of arbitrary length Payload by using these using graph of tokens and SVM[J]. Computers & Security, 2016, 60(jul.):206-225.
network structures. [4] Makiou A, Begriche Y, Serhrouchni A. Improving Web Application Firewalls to
The second requirement is to ensure training efficiency of vari- detect advanced SQL injection attacks[C]. // 2014 10th International Conference
on Information Assurance and Security, Okinawa: IEEE, 2014: 35-40.
able length sequence detection. The detection model can detect [5] Tang P, Qiu W, Huang Z, et al. Detection of SQL injection based on artificial neural
the traffic sequence of any length, which means that the detection network[J]. Knowledge-Based Systems, 2020, 190:105528.
[6] Xie X, Ren C, Fu Y, et al. SQL Injection Detection for Web Applications Based on
model receives input of variable length. Variable input makes it Elastic-Pooling CNN[J]. IEEE Access, 2019, 7:151475-151481.
impossible for the model to be updated in parallel, which leads [7] Fang Y, Peng J, Liu L, et al. WOVSQLI: Detection of SQL injection behaviors using
to the training stage can only rely on a single sample at a time. word vector and LSTM[C]. // Proceedings of the 2nd International Conference on
Cryptography, Security and Privacy. 2018: 170-174.
The gradient variance changes too much during the model update, [8] Abaimov S, Bianchi G. CODDLE: Code-injection detection with deep learning[J].
which makes it difficult for the model to fit and the training time is IEEE Access, 2019, 7: 128617-128627.
too long. However, when fixed length input is used, batch training [9] Maaten L, Hinton G. Visualizing data using t-SNE[J]. Journal of machine learning
research, 2008, 9(Nov): 2579-2605.
can be used to shorten the training time of the model and achieve
smoother model update. Therefore, when building the model, we
should look for a solution to ensure the training efficiency and
support the variable length sequence detection. This framework
proposes the training method of variable length sequence. Payloads
of similar lengths in the data set are first aggregated together so
that N sets of data subsets can be obtained. The average length
of each data subset is used as the length threshold. The Payload
above the length threshold is intercepted, and the Payload below
the length threshold is filled with zero vector. When truncating,
a priori, the SQL injection attack follows the parameter, so traffic
that exceeds the length threshold is truncated from behind. This
training method can realize model parallel training while ensuring
that the attack part of SQL injection will not be lost to the greatest
extent.
5 CONCLUSION
The existing technical scheme has some problems, such as imperfect
detection process and inability to detect SQL injection attacks using
various bypassing methods. In view of the characteristics of SQL
injection attack under the background of complex HTTP traffic,
this paper systematically proposes a framework of SQL injection
attack detection based on HTTP traffic, including four modules:
data collection, data cleaning, feature representation and model
construction. The data collection module introduces a variety of
channels to obtain data. The traffic cleaning module avoids false
185