0% found this document useful (0 votes)
6 views9 pages

Final Paper

it is helpful
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views9 pages

Final Paper

it is helpful
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Prevention of Sql Injection Attacks Using

LightGBM
D.Rashmi1, N.Anvesh2 , CH.Sai Hanisha3 , G.Shravya4 ,Preethi Jeevan5

1,2,3,4 Student , Department of Computer Science and


Engineering
5
Associate Professor, Department of Computer Science and Engineering

ABSTRACT One of the most popular and the basic type of cyber attacks that implement arbitary malicious
code to obtain the sensitive and the confidential information of a client or an organisation residing in the
database are the SQL injection attacks. The underneath paper aims in hampering an overview and the
vulnerabilities caused due to the sql injection attack.Based on this analysis we try to study various preventive
solutions and one of this include preventing it with the help of the LightGBM algorithm. It also focuses on
the different stages of the working model and the future application of the particular algorithm in various
methods.

I.INTRODUCTION II. LITERATURE SURVEY:


A lot of cyber-physical systems are vital for safety Existing approaches to addressing SQL Injection
and need to be protected from cyberattacks and (SQLI) involve various strategies, such as statistical
unplanned malfunctions. One of the strongest risks analysis, dynamic analysis, or hybrid strategies. The
is SQL injection attacks (SQLIA). SQLIAs use aforementioned methods tackle web application
malicious SQL code to corrupt backend databases vulnerabilities, covering facets like scanning, input
and get access to sensitive data, including private validation, parameterized queries, regular software
consumer information or confidential company data. updates, security audits, code reviews, and Web
As of 2023, they were ranked sixth in the Common Application Firewalls (WAF). Upon careful
Weakness Enumeration. These assaults, which were investigation of current machine learning
first noted by cybersecurity researcher Jeff Forristal algorithms, including XGBOOST, SVM, Logistic
in 1998, have the ability to take over an Regression, K-nearest neighbor (KNN), Random
administrator's account on a database server, spoof Forest, Decision Tree, Neural Network (Deep
identities, expose, alter, destroy, or render data Learning), Gradient Boosting Machines (GBM),
unavailable. Error disclosure, insufficient input linear regression, Principle Component Analysis
validation, and the combination of code and data in (PCA), we have identified methodologies that
dynamic SQL statements are the main problems that enhance preclusion against SQL attacks. These
algorithms play a pivotal role in preventing SQL
lead to SQLIAs. Strong input validation and the
intrusion by providing supply chain optimization,
usage of parameterized queries, including fraud detection, and improving efficiency and
prepared statements, provide significant quality through the use of Natural Language
security, even if there is no one-size-fits-all Processing (NLP).
solution to totally prevent SQLIAs.
Furthermore, deep learning and artificial
III. PROPOSED
intelligence models can evaluate past attack
data, identify trends, and forecast upcoming MODEL:
assaults, improving defences against SQL
Intrusion assaults. Moreover, vulnerabilities can A.METHODOLOGY:
be found before they are exploited with the use
of frequent security audits and ongoing The methodology is helpful for the user to
monitoring. implement the decisions in a more effective way.
By combining these tactics, businesses may The terminology “analysis of system” and“
protect their vital systems and data while analysis of requirements” are identical. In orderto
drastically lowering the threat that SQL deliver the requirements to the user a particular
process is required. This process mainly involves
injection attacks provide.
imagination and dividing the system into several
constituents which helps in better analysis of the
structure and evaluating the goals of a project.
The pictorial representation portraits the
functionalities of a software, generally the user
produces a SQL traffic attack before the pre-
processing of the data is performed the data is sent
to the software and then it is processed and the
process of data cleaning takes place, at long last the
machine learning models are related with the data
and best benefited model is determined.
The process of evolving the data, modules,
architecture and interfaces is known as System
design. The advance towards producing System
design is drastically incrementing. System design is B.ALGORITHM:
also termed as the application of the system
hypothesis for producing a product. LIGHT GBM:
Light gradient boosting machine which is also
termed as light GBM itis constructed using decision
trees and are also applied in the process of
classification. It is a type of boosting platform
which is contributed by the Microsoft for various
machine learning algorithms. It is widely popular
because it is a free and an open source. This
advancement in the structure plays a crucial role in
performance. This particular framework supports
various algorithms like MART, GBM, GBDT,
GBM, GBT. The main advantages are multiple loss

The architecture of a system can be elaborated as


the process in which the user inputs the data to the
system then the process of data cleaning takes place
and a new set of testing data is created by using the
existing cleaned data which is later contributed to
the machine learning models. The data takes formof
a dataset and then it is cleaned. Then this data is
given as an input to the various algorithms and
according to the precision of the data the functions, regularization, sparse optimization,
unambiguousness model is selected. The major training in parallel etc.
constituents of the flow pictorial representation are The GBM uses the tree structure. It does not grow
input data, mode, Prediction. row wise instead it increases its size in ground up
and a single at once. It is not like the
previousmassive applications. Instead of using the
sorteddecision tree implementation machine
learning technique, which is identical to XGBoost
and its variants, Light GBM makes use of the
optimal division point on histograms and preserves
memory,which improves speed. It runs multiple
methods,including gradient- based one-side
sampling (GOSS) and exclusive feature bundling
(EFB).
The numerous version of light GBM can cortege a
CTR predictor on the special dataset like criteo
dataset. It can be executed on Linux, C++,C#, R,
Python.

VARIATION OF LIGHT GBM AND TREE-


BASE ALGORITHM:
The standard tree based algorithm develops in a
horizontal manner whereas the Light GBM
develops in a vertical manner. The major distinction - Training data: \( D = \{(x_1, y_1), (x_2, y_2), \dots,
lies as the tree based increments in level by level (x_N, y_N)\} \), where \( y_i \in \{-1, +1\} \)
and the other in leaf by leaf process. The below - Feature space: \( \chi \)
pictorial representation is a portrait of the tree base - Loss function: \( L(y, \theta(x)) \)
algorithm and Light GBM. - Number of iterations: \( M \)
- Big gradient data sampling ratio: \( a \)
- Small gradient data sampling ratio: \( b \)
B.RESISTANCE OF LIGHT GBM: Procedure:
The term Light GBM is comprised of word light 1. Exclusive Feature Bundling (EFB):
which defines the speediness that is equivalent to Combine mutually exclusive features \( z_i \)
the speed of light. It occupies minimal amount of (features that never accept nonzero values
space but can tackle humongous amount of data. simultaneously) using EFB technique.
It expands the data in an exponential format and
uses typical methodologies for producing hasty 2. Initialize Model Parameters:
results. It is popular among various data scientists Set initial model parameters \( \theta_0 \) to
minimize \( \sum_{(x_i, y_i) \in D} L(y_i, \theta(x_i))
due to its GUI implementation in applications based \).
on data science.
3.Gradient-based Iterations:
For \( m = 1 \) to \( M \) do:

C.BENEFITS OF LIGHT GBM: a. Calculate Gradient Absolute Values:


1. It is capable of handling humongous datasets with Compute absolute gradient values \( r_i = |
in a minimal training period unlike XGBOOST. \frac{\partial \zeta(y_i \theta(x_i))}{\partial \phi(x_i)}|
2. It comprises of less memory space as it restores \) with respect to the model parameters, ensuring \
( \theta(x) - \theta_{m-1} \geq 0 \) for all features \
the continuous values with distinct bins.
( \phi(x) \).
3. The speed is displayed and divided with the help
of distinct bins using bar-graph processing b.Gradient-based One-Side Sampling
technique to speed up the process. (GOSS): Resample the dataset \( D \) using
4. As it follows leaf based division but not the tree GOSS:
based division the accuracy is highly obtained. The - Determine top \( a \times |D| \) instances with
problem of over-fitting is prevailed over by largest gradient values.
incrementing the choice of maximum depth. - Randomly select \( b \times |D| \) instances from
the remaining dataset.
- Construct a new dataset \( D' = A \cup B \), where
D.CLAIMS FOR LIGHT GBM: \( A \) contains top instances and \( B \) contains
randomly selected instances.
1. For computing majority problems log loss
objective functions are utilized. c. Calculate Information Gains:
2. To designate things there are many ways. Compute information gains \( v_j(d) \) for each
3. To classify binary data objective function like log feature \( j \) based on dataset \( D' \):
loss is used and it utilizes regression model based \[ v_j(d) = \frac{1}{n}
on L.2 loss. \left( \frac{\left( \sum_{\alpha_{j}\alpha_{i}} r_i +
Light GBM is a faster algorithm which uses \frac{1-a}{b} \sum_{\alpha_{j}\alpha_{i}} r_i
machine learning techniques and produces much \right)^2}{n_i^2(d)} + \frac{\left( \sum_{\alpha_{j}
accurate results for which it is used more \alpha_{i}} r_i + \frac{1-a}{b} \sum_{\alpha_{j}
often.There are gradient process which increases the \alpha_{i}} r_i \right)^2}{n_i^2(d)} \right) \]
performance of non performing trees by few well
d. Build a New Decision Tree \( \theta_m(x) \) on
known tree based methods known as GBM, Dataset \( D' \):
XGBOOST, LGBM. A very few Light GBM are Develop a new decision tree \( \theta_m(x) \) using
manual by many settings are chosen to increase its dataset \( D' \) considering the calculated information
efficacy. It also enhances the system and maximize gains.
the algorithm to upgrade the framework which is
rudimentary. e. Update Model Parameters:
Update the model parameters:
\[ \theta_m(x) = \theta_{m-1}(x) + \frac{\theta_m}
{x} \]

3. Final Model:
Return the trained model parameters \( \theta(x) =
\theta_M(x) \).
THE LIGHTGBM ALGORITHM
Input:
IV. OPTIMISATION OF SYSTEM:
Hardware Optimization is described as the process quotes, single quotes, keywords, alphabets etc.
of increasing efficiency, improving efficiency and Atleast twenty features are extracted for each query.
efficient use of available resources. These features are helpful in detecting SQL
Key aspects of hardware optimization involves: Injection attacks. These features helps in
differentiating legitimate and fraudulent SQL
A.CACHE AWARENESS: queries. In order to detect these SQL injection
The main purpose of cache is to store the most attacks, involves identifying traits such as stored
frequently accessed data and helps in reducing procedures, illegal queries and redundancies. The
fetching time. selected features are helpful in differentiating
Instead of fetching data from main memory, legitimate and illegal SQL queries especially those
fetching data from cache is more faster. Cache with SQL Injection Keywords.
awareness involves designing algorithm by
considering cache architecture. This helps in
reducing the cache misses and increases accessing VII.CODE:
speed. Hence overall performance will be improved.

XGB Classifier:
B.OUT-OF-CORE COMPUTING:
# Gradient Boosting Classifier (using sklearn)
Out-of-core-computing comes into picture when from sklearn.ensemble import
dealing with datasets that are too large to fit into GradientBoostingClassifier
available memory. This approach divides the large from sklearn.metrics import accuracy_score,
dataset into number of chunks. Each and every classification_report
chunk is processed sequentially. Instead of loading import pandas as pd
entire dataset into memory ,only necessary chunks
are # Instantiate the Gradient Boosting Classifier
gb_classifier = GradientBoostingClassifier()
loaded into memory at a time. This approach helps
us in handling large datasets. It also provides # Fit the model on the training dataset
efficient utilization of storage resources. gb_classifier.fit(train_features, train_labels)
Optimization of hardware also involves memory
managemet.In this research, we have come across # Predict on training and test datasets
different ensemble machine learningtechniques such train_pred_gbc =
as XGBoost, AdaBoost, LGBM and Random forest. gb_classifier.predict(train_features)
test_pred_gbc =
V.UML DIAGRAMS: gb_classifier.predict(test_features)
UML stands for Unified Modelling Language. It is a
common modelling language for modelling # Evaluate performance - Accuracy acc_train_gbc =
software projects. It is used for visualizing various accuracy_score(train_labels, train_pred_gbc)
aspects of software systems such as design, acc_test_gbc = accuracy_score(test_labels, test_pred_gbc)
architecture and implementation of software
projects. UML diagram is a language for # Print accuracy results
visualizing, specifying, constructing, and print("Gradient Boosting: Accuracy on training data:
documenting the software projects. It is the visual {:.7f}".format(acc_train_gbc))
representation of set of UML things (such as class, print("Gradient Boosting: Accuracy on test data:
interfaces, state machine etc) and relationship {:.7f}".format(acc_test_gbc))
among them.
# Generate classification report
classification_report_gbc =
VI.FEATURE EXTRACTION: classification_report(test_labels, test_pred_gbc,
Feature extraction involves using tokenization to classes=[0, 1], support=True)
extract attributes. Tokenization is a process of
dividing SQL queries into number of tokens. # Storing the results results_gbc =
These tokens can be keywords or any individual pd.DataFrame({
units. 'Model': 'Gradient Boosting', 'Train
Features are the characteristics or properties of the Accuracy': acc_train_gbc, 'Test
tokenized queries. Based on the count of some Accuracy': acc_test_gbc
specific attributes associated with the tokens, })
features are identified. These features are extracted
based on the count of elements such as double
normality (0), the query's length, the quantity of SQL
# XGBoost Classifier (using xgboost) keywords it contains, the count of special characters
from xgboost import XGBClassifier (like quotes and semicolons), and a binary indicator
indicating the existence of SQL comments. Through
# Instantiate the XGBoost Classifier xgb_classifier = the use of these attributes, the machine learning model
XGBClassifier(learning_rate=0.4, max_depth=7) is better able to recognize harmful patterns in the text
and structure of SQL queries .In order to facilitate the
# Fit the model on the training dataset construction of additional features like query length
xgb_classifier.fit(train_features, train_labels) and keyword count, preprocessing stages included
tokenizing SQL queries into components like
# Predict on training and test datasets keywords, operators, and values. To distinguish
train_pred_xgb = between legitimate and malevolent inquiries, labels
xgb_classifier.predict(train_features) were encoded as binary values. In order to increase
test_pred_xgb = model accuracy and robustness, methods including
xgb_classifier.predict(test_features) under sampling and oversampling (such as SMOTE, or
Synthetic Minority Over- sampling Technique) were
# Evaluate performance - Accuracy acc_train_xgb = used to resolve any imbalances in the dataset. This
accuracy_score(train_labels, train_pred_xgb) resulted in a balanced representation of both malicious
acc_test_xgb = accuracy_score(test_labels, and regular searches. The final dataset has a sizable
test_pred_xgb) number of queries that provide a variety of instances
for the model to learn from.The project intends to
# Print accuracy results further the creation of strong machine learning models
print("XGBoost: Accuracy on training data: capable of successfully identifying and avoiding SQL
{:.7f}".format(acc_train_xgb)) injection attacks, thereby improving cybersecurity
print(" XGBoost: Accuracy on test data: measures, by utilizing this varied and extensive dataset.
{:.7f}".format(acc_test_xgb))

# Generate classification report IX. TRAINING THE MODEL:


classification_report_xgb = The aforesaid methodologies are utilized to separate
classification_report(test_labels, test_pred_xgb, the useful data sets from the other datasets. These data
classes=[0, 1], support=True) sets are constructed into a required model and is
trained. For training the data sets we make use of
# Storing the results results_xgb machine learning algorithm such as LightGBM, is
= pd.DataFrame({ extracted into our project and are model is trained.
'Model': 'XGBoost',
'Train Accuracy': acc_train_xgb, 'Test
Accuracy': acc_test_xgb
}) X .TESTING THE MODEL:
To determine the accuracy we perform the testing
# Support Vector Machine Classifier (using sklearn) phase for our trained data set. The data set obtained
from sklearn.svm import SVC after the training phase is tested using the test data. As
we make use of numerous machine learning
# Instantiate the Support Vector Machine algorithms to categorize and train our data sets
Classifier accuracy is a key factor to be considered. The more
svm_classifier = SVC(kernel='linear', C=1.0, accurate our estimated model are the more easier it
random_state=12) becomes to predict the upcoming attacks.

# Fit the model on the training dataset


svm_classifier.fit(train_features, train_labels) XI. RESULTS
S ML TRAIN TEST
VIII.DATASET N MODEL 1 ACCURACY ACCURA
O CY
Building and testing machine learning models 2 XGBoost 1,000 0.998
targeted at identifying and averting SQL injection 6 LightGBM 1,000 0.998
attacks require access to the dataset utilized in this
7 Gradient 0.997 0.997
work. It provides a thorough collection of SQL
Boosting
queries classified as either dangerous or regular,
4 SVM 0.978 0.982
sourced from the Open Source Security Foundation
(OSSF) SQL Injection Dataset. The dataset's salient
characteristics comprise the SQL query string itself, XII. CONCLUSION:
a label designating the query's maliciousness (1) or Ultimately, LightGBM's ability to stop SQL
injection assaults is a major improvement echniques”fromhttps://
over existing cybersecurity protocols. pdfs.semanticscholar.org/ Volume 3, Number 7,
Detecting complicated patterns suggestive of August 2011, pp 85 - 89.
SQLIAs is made possible by the gradient
[4] G.Wassermann,Z.Su,“Analysisframeworkfor
boosting framework LightGBM, which security in web applications,” In: Proceedings of the
delivers high performance and efficiency. FSE Workshop on Specificationand Verification of
LightGBM is capable of accurately ComponentBasedSystems,fromhttp
differentiating between benign and harmful s:// li n k . s p r i n g e r . c o m / c h a p t
activity by utilizing historical data and e r / 10.1007/978−0−387−44599−15SAVCBS,pp.70–
learning from previous attacks. Enhancing 78,2004.
database security is made easier by its real- [5] Mei Junjin, “An Approach for SQL Injection
time predictions and its speedy processing of Vulnerability Detection,” Proceedings. of the 6thI
huge datasets. SQL injection threats can be n t .
significantly reduced when LightGBM is Conf.onInformationTechnology:NewGenerations,L
combined with strong input validation and asVegas,Nevada,pp.14-19,Apr.2009.
parameterized queries, protecting the
confidentiality and integrity of important [6] V.Haldar, D.Chandra, and M.Franz, "Dynamic
Taint Propagation for Java," Proc. 2 1s t Annual
data. Using this method offers a dynamic and Computer Security Applications Conference, Dec
robust way to protect against SQLIAs while 2005.
simultaneously fortifying an organization's
defensive stance and catering to changing [7 ]S. W. Boydand AD. Keromytis," SQLrand:
threats. Preventing SQL Injection Attacks," Proc. the 2nd A
p p l i e d
CryptographyandNetworkSecurity(ACNS)Conferen
ce,pp.292-302,Jun2004.
ⅩIII. FUTURE SCOPE:
To increase the efficiency of the current model [8] G.T.Buehrer, RW.Weide, and P.AG.Sivilotti,
and to detect all types of sql injection attacks we "Using Parse Tree Validation to Prevent SQL
can also use other algorithms. We will be Injection Attacks,"International Workshop on
measuring web based application code available. Software Engineering and Middleware (SEM), 2005.
With integration of sql with Nikto HTTP
scanner, HTTP scanning proxies, and [9] Evans Dogbe, Richard Millham, Prenitha Singh
Metasploit any many more tools we can detect “A Combined Approach to Prevent SQL Injection
the attacks.We only built model to intercept the Attacks,” Science and Information Conference 2013
attacks but we can later enhance the model October7-9,2013,London,UK.
which also prevents the attacks and deploy
counter measures. [10] Ryohei Komiya, Incheon Paik, Masayuki
Hisada, “Classification of Malicious Web Code by
Machine Learning,” Awareness Science and
Technology (iCAST), 2011 3rd International
Conference on, vol., no., pp.406,411, 27-30 Sept.
XIV. REFERENCES:
2011
[1]Sonali Mishra, “SQL Injection Detection
using M a c h i n e L e a r n i n g ” , f r o m h t
[11] Justin Clarke. (2012). What Is SQL Injection?
tps://
scholarworks.sjsu.edu/cgi/viewcontent.cgi?artic
[12] S. Steiner, D. Conte de Leon, and J. Alves-
le=1727context=etdprojects, on23May2019
Foss. (2017).A Structured Analysis of SQL Injection
pp.10 - 29.
Runtime Mitigation Techniques. Proc. 50th Hawaii
Int.Conf.Syst.Sci.,2887 2895. Doi:10.24251/hi css.
[2] Bojken Shehu and Aleksander Xhuvani, “A
2017. 349 .
Literature Reviewand Comparative Analyseson
SQL Injection: Vulnerabilities, Attacks and their
Prevention andDetectionTechniques”fromhttps:// pdfs.semanticscholar.org/ Volume 3, Number 7,
pdfs.semanticscholar.org,Vol.11,Issue4,No1,July2 August 2011, pp 85 - 89.
0 14pp20-34.

[3] SuhaimiIbrahim, “SQLInjectionDetectionand


PreventionT

You might also like