0% found this document useful (0 votes)
2 views

Hybrid SQL Injection Detection System

The document presents a hybrid SQL Injection Detection System that combines the Efficient Data Adaptive Decision Tree (EDADT) algorithm with Support Vector Machine (SVM) classification to improve detection accuracy of SQL Injection Attacks (SQLIA) at the database level. It discusses various existing detection techniques and their limitations, emphasizing the need for a more effective approach that utilizes both syntactic and semantic features from SQL queries. Experimental results indicate that the proposed framework outperforms traditional methods in accurately identifying SQL injection attacks.

Uploaded by

raja.2003.ajar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Hybrid SQL Injection Detection System

The document presents a hybrid SQL Injection Detection System that combines the Efficient Data Adaptive Decision Tree (EDADT) algorithm with Support Vector Machine (SVM) classification to improve detection accuracy of SQL Injection Attacks (SQLIA) at the database level. It discusses various existing detection techniques and their limitations, emphasizing the need for a more effective approach that utilizes both syntactic and semantic features from SQL queries. Experimental results indicate that the proposed framework outperforms traditional methods in accurately identifying SQL injection attacks.

Uploaded by

raja.2003.ajar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2016 3rd International Conference on Advanced Computing and Communication Systems (ICACCS -2016), Jan.

22 & 23, 2016, Coimbatore, INDIA

Hybrid SQL Injection Detection System

B. Deva Priyaa (PG Student) M.Indra Devi (Professor)


Department of Computer Science and Engineering Department of Computer Science and Engineering
Kamaraj College of Engineering and Technology Kamaraj College of Engineering and Technology
Virudhunagar, India Virudhunagar, India
devapriyaa92@gmail.com indradevicse@kamarajengg.edu.in

Abstract — The use of database driven web applications are attackers. It means that attackers can bypass authentication,
increasing every day. Attacks on those web applications are also authorization checks and even sometimes allow access to
increasing. One of the common web application attacks is SQL operating system level commands. The SQLIA (Structured
Injection attack. These attacks are a code injection or insertion of Query Language Injection Attack) is a code injection attack
SQL query via input data from the client to the application.
commonly used for attacking websites or web applications.
There are many detection techniques implemented, but they have
focused on the SQL structure at the application level. So those Open Web Application Security Project (OWASP) ranks
techniques failed to detect some of the attacks at the database SQLIA is the one of top security in 2013[1]. They are
level. The existing approaches use classification techniques and frequently used by malicious users for various reasons like
suitable kernel functions to detect the attack at the database financial fraud, theft of confidential data, web site defacement,
level. As the SVM classification is the supervised learning sabotage and so forth. Some of the SQL injection attacks are
algorithm, the unknown attacks can’t be detected. In this paper, tautology, Illegal/Incorrect queries, Union query, Piggy
we propose a hybrid framework using the EDADT (Efficient backed query, Alternate encoding, Stored procedures,
Data Adaptive Decision Tree) algorithm which is the semi – Inference, Evasion technology, Stack queries etc.
supervised algorithm and SVM classification algorithm. It uses
Many techniques have been proposed to detect SQLI
the internal query tree from the database log for good
performance of framework. To get internal query tree, the query attacks. These include pattern or string matching, input
tree is converted to n – dimensional feature vector by using multi sanitation, randomization of SQL keywords, rule based
– dimensional sequence. The semantic features are used as the verification, positive tainting, entropy computations, reverse
component of feature vector. And also the syntactic and semantic proxy, semantic comparisons, statement sequence digest and
feature is used to generate multi – dimensional sequences. Then so forth. However, these techniques do not cover up all known
the extracted feature is converted into numeric value, if the attacks and also cannot be implemented in all platforms.
feature contains any string value. Experimental results show
that the proposed approach is more accurate in detecting the II. LITERATURE SURVEY
attacks than existing approaches.
Yi Wang and Zhoujun Li[2] proposed a novel approach for
Keywords—SQL Injection Attack, Database, Data mining, SVM finding the malicious SQL statements. They used machine
learning techniques such as SVM one class classification, to
detect unauthorized between the database and application.
I. INTRODUCTION Their approach was developed by incorporating the query tree
The World Wide Web grows rapidly with many web structure of SQL queries as well as input parameter and query
applications for meeting various purposes such as financial value similarity. They used this to distinguish malicious
transactions, educational endeavors, and countless other queries from benign queries. They used tree-vector-kernels in
activities. Computer files started to replace the paper files as SVM for SQL statements to prevent SQL injection in web
electronic records everywhere replacing carbon based applications was done by incorporating the syntax information
counterparts. Everyone using the computer technology also of query and semantic context from application in analyzing
uses the internet and web applications. Web applications SQL queries.
provide the user interface for various tasks in database such as Cristian I. Pinzon, Javier Bajo, Alvaro Herrero, Juan F. De
inserting data, updating data, making queries and so forth. Paz, Emilio Corchado, and Juan M. Corchado[3] proposed a
Those databases contain information such as customer names, technique, incorporating a new classification model. They
preferences, credit card numbers, and purchase orders. combined a neural network and a Support Vector Machine to
Therefore, it is very important to prevent attackers from classify SQL queries in a reliable way. The latter was
gaining unauthorized access to the web application from combining clustering and neural projection techniques to
accessing private information. The web application support the visual analysis and identification of target attacks.
communicates with database using Structured Query Their idea was a multi – agent architecture. The analysis,
Language (SQL). classification and decision making capabilities, among others,
Many web developers believe that SQL queries are were distributed throughout several layers in their proposed
secured one. But SQL queries can be meddled by the idMAS-SQL architecture.

Authorized licensed use limited to: Don Bosco Institute of Technology-Bengaluru. Downloaded on March 17,2025 at 07:12:06 UTC from IEEE Xplore. Restrictions apply.
2016 3rd International Conference on Advanced Computing and Communication Systems (ICACCS -2016), Jan. 22 & 23, 2016, Coimbatore, INDIA

Shailendra Kumar Shrivastav and Romil Rawat[4] D. Based on degree or Order of injection
determined the framework for detecting SQL-injection attack 1) First – order injections
using Support Vector Machine algorithm. The classification of 2) Second – order injections
suspicious query was done by analyzing the datasets of
The degree or the order of injection identifies the way in
Original query and suspicious query. Their classifier learned
which the injection yields the output. If the injection directly
the dataset and according to learning procedure, it classified the
queries. They compared SQL query strings and blocks with delivers the result, it is considered to be a first – order
suspicious SQL-query strings and blocks for detection. injection. But if the injection input yields no successful result
in extraction, but instead impacts some other place/page, it is
Jashan Koshal and Monark Bag[5] used two hybrid called a second – order injection. In the second – order
algorithms for developing the intrusion detection system. C4.5 injections, where the input stored in the application and later
decision tree and Support Vector Machine (SVM). rendered on some other page, thereby impacting that page
G.V.Nadiammai and M.Hemalatha[6] proposed the indirectly because of initial malicious input.
EDADT algorithm which is the semi – supervised learning E. Based on the injection point of location
algorithm. They applied PSO technique and extracted the
efficient features from the training datasets. The result was 1) Injection through user input form fields : Here,
applied to the ACO and identified the local and global values. attackers inject SQL commands by using suitably crafted user
Based that the unique value found and classified the specified input. Based on the application’s environment, the web
class and no value left behind unspecified. To achieve this they application reads the user input in several ways. Most of the
used the probability to split the unique values. attacker’s input come from form submission which send to the
web application via HTTP GET or POST requests.
III. SQL INJECTION METHOD 2) Injection through cookies : Cookies are files that
contain state information generated by Web applications and
A. Based on the extraction channel
stored on the client machine. When a client returns to a Web
1) Inband or inline : SQL injections that use the same application, cookies can be used to restore the client’s state
communication channel as input to dump the information information. Since the client has control over the storage of
back are called inband or inline SQL injections. EX: A the cookie, a malicious client could tamper with the cookie’s
query parameter. contents. If a Web application uses the cookie’s contents to
2) Out – of – band : Injections that use a secondary or build SQL queries, an attacker could easily submit an attack
different communication channel to dump the output of by embedding it in the cookie.
queries performed via input channel are referred to as out – of 3) Injection through server variables(header – based
– band SQL injections. EX: Injection made through the web injection): Server variables are a collection of variables that
application and a via DNS query. contain HTTP, network headers, and environmental variables.
Web applications use these server variables in a variety of
B. Based on the response from the server
ways, such as logging usage statistics and identifying
1) Error – based SQL Injection : Error – based SQL browsing trends. If these variables are logged to a database
injections are primarily those in which the SQL server dumps without sanitization, this could create SQL injection
some errors back to the user via the web application and this vulnerability. Because attackers can forge the values that are
error aids in successful exploitation. placed in HTTP and network headers, they can exploit this
a) Union query type vulnerability by placing an SQLIA directly into the headers.
b) Double query injection When the query to log the server variable is issued to the
2) Blind SQL Injection : Blind SQL injections are those database, the attack in the forged header is then triggered.
injections in which the backend database reacts to the input,
but somehow the errors are concealed by the web application
and not displayed to the end users. Or the output is not IV. SQLIA DETECTION FRAMEWORK
dumped directly into the screen. Therefore the name ‘blind’ To determine whether the SQL statement is malicious or
comes from that the injector is blindly injected using some not, we want to build a SQLI detection framework shown in
calculated assumptions and tries. “fig.1” using EDADT algorithm[6] and SVM algorithm[7]
a) Boolean – based blind injection with suitable kernel functions.
b) Time based blind injection
C. Based on how input treated
1) String – based
2) Numeric – or integer based
Based on how the input parameter would be treated in
the back end SQL query, an injection can be classified as
string – or integer – based.

Authorized licensed use limited to: Don Bosco Institute of Technology-Bengaluru. Downloaded on March 17,2025 at 07:12:06 UTC from IEEE Xplore. Restrictions apply.
2016 3rd International Conference on Advanced Computing and Communication Systems (ICACCS -2016), Jan. 22 & 23, 2016, Coimbatore, INDIA

query tree. Then the object of sequence explicitly expresses


the semantic feature of the query tree.
2) Feature Transformation : In the multi – dimensional
sequences, if the feature value is string type, the feature
extraction will request the feature transformer in order to
transform the string value. It replaces the string value with
numeric value based on the nature of the string value by
combining statistical models. The string value extracted from
the query tree is categorized into three cases: a constant that is
used as a column value of a table, a constant that is used as a
parameter of a function and the type of the node.

∑ In the case of column value, the transformation


method will combine the string length model and the
histogram model. The histogram stored in the
Fig. 1. SQLIA Detection Framework database system catalog represents the distribution
for specific column value. The histogram model will
A. Collection Phase determine whether the string value is placed within
In this phase, the data are collected from an attacker or the bound of histogram and the will calculate the
normal user who submits the input parameters into the web indicator which will be expressed in Boolean value
applications through the HTTP URL, based on POST method. {true=1, false=-1}. Then the string value replace with
The submitted queries will be stored in the database logs as numeric value will have the result of multiplying
query trees. Query Tree Log Collector will be used to collect string length and indicator.
the query trees that are stored in log files of database system. ∑ In case function parameter, the transformation
method will combine string length model and the
B. Pre – processing phase character distribution model. The character
Here the query trees is converted into n-dimensional feature distribution model will be used to determine whether
vectors. For converting the n-dimensional feature vector, the the each character of the string value is an element of
following steps are used. the set of the previously observed character and
indicator will be calculated and will be expressed in
1) Feature Extraction : It fetches the query tree from the Boolean value. Then the numeric value will be found
database logs[8] and loads it into memory by linear scan. The by multiplying the string length with the value of
query tree in the memory is implemented using the java indicator.
objects. The name top level class is represented as the label of ∑ In the case of type of node, string value is treated as
root node. The name of attribute is represented as the labels of the nominal value and will be replaced with the
internal nodes. The value of attributes is represented as the numeric value of enumeration.
labels of leaf nodes. When loading the query tree, the feature 3) N – dimensional feature vector generation : We build
extractor extracts the internal node and leaf node. The result of the vector generator, which concatenates the sequences of
the feature extraction of the query is multi-dimensional multi – dimensional sequences with the padding value and
sequences. The leaf nodes are used as the features in the multi- will add a length of the SQL statement to the end of
dimensional sequences. The structure of the query will vary concatenated sequences. The vector generator generates a set
for each query tree. In order to maintain same sequential of instances using the WEKA – library[11]. Each instance will
dimension for all query trees, we need to determine the basis express an n – dimensional feature vector. The generated
to generate each sequences. So we are using the internal paths instances will be then used as the input of the SVM
of the query tree based on the full SQL grammar, instead of classification.
considering individual internal paths of various query trees.
To search the internal path of the full query tree, it is traversed C. Training Phase
by Depth – First – Search (DFS) and the label of internal
The training module is built by cascading the EDADT and
nodes on the path are listed by a pre – processing in the order
SVM for the accurate classification.
in which they first visited by DFS. This searched internal path
generates the criteria sequence. The length of sequence 1) EDADT (Efficient Data Adapted Decision Tree) :
dimension of multi-dimensional sequence will be set to the The EDADT algorithm is a semi – supervised learning
criteria sequence. Then the leaf nodes are extracted to generate algorithm. We split the data set into the attack or normal
multi – dimensional sequences. In our method, the sequence without leaving any data as unclassified using this
will possess both syntactic and semantic features[9]. The order algorithm. It splits the data set into three different class
and length of the sequences imply the syntactic features of the label known as same class label, different class label and
relevant class label. Initially PSO technique[11] was
applied to extract the efficient features from the given

Authorized licensed use limited to: Don Bosco Institute of Technology-Bengaluru. Downloaded on March 17,2025 at 07:12:06 UTC from IEEE Xplore. Restrictions apply.
2016 3rd International Conference on Advanced Computing and Communication Systems (ICACCS -2016), Jan. 22 & 23, 2016, Coimbatore, INDIA

training datasets. The best features applied to the ACO[12] C. N – Dimensional vector generation
as input then pheromone will be initialized to obtain the We convert the multi – dimensional sequences into
optimal solution. Classification will be made based on the the n – dimensional vector using the converted sequences
class label. Probability of each class values was found to
Px . This Px is transformed to Vx = { o11 , o12, ….. o1l1, o21 ,
find the class label for each value accurately. Then the
normalized information gain found for each attribute. And o22, ….. o2l2, ….., ok1 , ok2, ….. oklk} where the length of the
k
the decision node that split the best attribute with highest vector |Vx | = ∑ i=1 li .
normalized information gain is created.
2) SVM (Support Vector Machine) : This module VI. FRAMEWORK EVALUATION
contains two phases known as model generator and model To perform the experimentation, we develop a system for
evaluator. We generate several SVM binary classification movie recommendation which consisting of user login, rating
models using the model generator. The EDADT result is and movie details. We create the web application using the
used to train SVM classifier. The model evaluator chooses PHP and PostgreSQL v9.4 database. We accessed the
the optimal binary classifier from the generated models. To PostgreSQL database through the XAMPP web server v3.2.1.
increase the accuracy of the classification suitable kernel The database for our system is created by using the Movielens
function and kernel parameter for the kernel function are dataset.
used. To optimize the SVM, SVM learning algorithm The dataset for our framework is created by the movie
Sequential Minimal Optimization model is utilized which recommendation system: A normal query dataset, and a
handles the large training datasets. The model evaluator malicious query dataset.
evaluates the performance of the binary classification We used the normal and malicious query generator to
models using the k – fold cross validation. The model submit the queries into the web applications. We categorize
evaluator performs the k – fold cross validation for each these queries into three different groups depending on the type
SMO model. The evaluator reports the performance of the of the command. SELECT statements are belongs to the
classifier using several measurements, such as true positive Group 1. INSERT statements are belongs to the GROUP 2.
rate, false positive rate, accuracy, AUC and reports. Stored Procedures are belongs to the GROUP 3.
Depending on the results, the best SVM model will be Whenever these queries are submitted into the web
chosen. application, the system stores the log file for the submitted
queries in the form of query tree. We collect these queries to
D. Detection Phase get the multi – dimensional sequences. The average time taken
In this module, the class label for the testing data is to generate the multi – dimensional sequences for all the three
predicted. The n – dimensional feature vector for the testing groups is 14.811 seconds for normal queries and 19.775
data is converted from query tree of testing data in the manner seconds for malicious queries.
similar to the training data. The SQLIA classifier determines The n – dimensional feature vector is obtained from the
the new testing feature vector is normal or malicious with the multi – dimensional sequences by utilizing the average time
optimized SVM classification model. 1.25 seconds.
We use those vectors as the instances for our hybrid
V. QUERY TREE CONVERSION detection framework for the better performances. “Table I”
shows the result of our proposed framework for each groups.
A. Multi – Dimensional sequence generation

The input for our framework is gathered from the database TABLE I. Result of Our Framework
logs of the database system in the form of query tree. The Groups Accuracy True Positive False Positive
result of feature extraction from the query tree T is the multi Group 1 99.96% 0.999 0.0
– dimensional sequences S. The criteria sequence P = { P1, Group 2 99.96% 1.0 0.0
P2, … Pk }, where k is the number of internal nodes, is generated Group 3 99.7% 0.996 0.02
from the internal path of each internal node ni i.e., (root(T),
ni). The length of the multi – dimensional sequences of each Table II shows the comparative analysis of our
query is set to criteria sequences. The leaf nodes of each framework with some previous techniques.
internal node are extracted as the features. Hence it generates
the sequence Pi = {oi1 , oi2, ….. oil} where l is the number of TABLE II. Comparative Analysis of Our Framework
children (ni ). Techniques Accuracy
Our Hybrid Approach 99.87%
B. Feature Transformation Composite kernel in SVM[2] 99.6%
If any of the feature in the multi – dimensional sequences idMAS – SQL[3] 99.01%
is a string value, then they will converted into the numeric SVM[4] 96.47%
values based on the category of the string value mentioned C4.5 and SVM[5] 99.87%
above.
EDADT[6] 98.12%

Authorized licensed use limited to: Don Bosco Institute of Technology-Bengaluru. Downloaded on March 17,2025 at 07:12:06 UTC from IEEE Xplore. Restrictions apply.
2016 3rd International Conference on Advanced Computing and Communication Systems (ICACCS -2016), Jan. 22 & 23, 2016, Coimbatore, INDIA

[4] Shailendra Kumar Shrivastav and Romil Rawat, “SQL Injection Attack
detection using SVM”, International Journal of Science Applications,
vol. 42, No. 13, March 2012.
VII. CONCLUSION [5] Jashan Koshal and Monark Bag, “Cascading of C4.5 Decision Tree and
In this paper, we proposed the framework for detecting Support Vector Machine for Rule Based Intrusion Detection System”,
I.J. Computer Network and Information Security, pp. 8 – 20, August
SQL Injection Attacks at the database level using the SVM 2012.
and the EDADT algorithm. The main contribution of this [6] G.V.Nadiammai and M.Hemalatha, “Effective approach toward
paper is combining the EDADT and Binary SVM Intrusion Detection System using data mining techniques”, Elsevier :
classification algorithm to efficiently learn about the unknown Egyptian Informatics Journal, pp. 37 – 50, December 2013.
attacks also. From the comparative results, it shows that our [7] Jiawei Han, Micheline Kamber , Jian Pei, “Data Mining : Concepts and
framework detects the malicious queries accurately compared Techniques”, Elsevier, Edi. 3rd, pp. 408 – 415, 2012.
to the previous works. [8] Low.W. L., Lee. J., &Teoh. P., “DIDAFIT : Detecting intrusions in
databases through fingerprinting transactions, Springer : Databases and
Information systems integrations, pp. 121 – 128, 2010.
[9] Lee. S. Y., Low. W. L., & Wong. P. Y., “Leaning fingerprints for the
REFERENCES database intrusion detection system”, Springer : In Computer Security,
pp. 264 – 279, 2002.
[1] https://www.owasp.org/index.php/Top_10_2013-Top_10 [10] Witten. I. H., Frank. E., & Hall. M. A., “Data mining: Practical machine
[2] Yi Wang and Zhoujun Li, “ SQL Injection Detection with Composite learning tolls and techniques. Elsevier, 2011.
Kernel in Support Vector Machine ”, Internal Journal of Security and its [11] Fatima Ardjani, Kaddour Sadouni, “Optimization of SVM Multiclass by
applications, vol. 6, No. 2, April 2012. Particle Swarm (PSO - SVM)”, I.J. Modern Education and Computer
[3] Cristian I. Pinzon, Javier Bajo, Alvaro Herrero, Juan F. De Paz, Emilio Science, no.2, pp. 32 - 38, 2010.
Corchado, and Juan M. Corchado, “idMAS-SQL: Intrusion Detection [12] Praveen Kumar. K, kamakshi. P, “Ant colony optimization algorithm for
based on MAS to Detect and Block SQL injection through data mining”, computer intrusion detection”, 2006.
Elsevier : Information Sciences journal, 2011.

Authorized licensed use limited to: Don Bosco Institute of Technology-Bengaluru. Downloaded on March 17,2025 at 07:12:06 UTC from IEEE Xplore. Restrictions apply.

You might also like