Final Paper
Final Paper
LightGBM
D.Rashmi1, N.Anvesh2 , CH.Sai Hanisha3 , G.Shravya4 ,Preethi Jeevan5
ABSTRACT One of the most popular and the basic type of cyber attacks that implement arbitary malicious
code to obtain the sensitive and the confidential information of a client or an organisation residing in the
database are the SQL injection attacks. The underneath paper aims in hampering an overview and the
vulnerabilities caused due to the sql injection attack.Based on this analysis we try to study various preventive
solutions and one of this include preventing it with the help of the LightGBM algorithm. It also focuses on
the different stages of the working model and the future application of the particular algorithm in various
methods.
3. Final Model:
Return the trained model parameters \( \theta(x) =
\theta_M(x) \).
THE LIGHTGBM ALGORITHM
Input:
IV. OPTIMISATION OF SYSTEM:
Hardware Optimization is described as the process quotes, single quotes, keywords, alphabets etc.
of increasing efficiency, improving efficiency and Atleast twenty features are extracted for each query.
efficient use of available resources. These features are helpful in detecting SQL
Key aspects of hardware optimization involves: Injection attacks. These features helps in
differentiating legitimate and fraudulent SQL
A.CACHE AWARENESS: queries. In order to detect these SQL injection
The main purpose of cache is to store the most attacks, involves identifying traits such as stored
frequently accessed data and helps in reducing procedures, illegal queries and redundancies. The
fetching time. selected features are helpful in differentiating
Instead of fetching data from main memory, legitimate and illegal SQL queries especially those
fetching data from cache is more faster. Cache with SQL Injection Keywords.
awareness involves designing algorithm by
considering cache architecture. This helps in
reducing the cache misses and increases accessing VII.CODE:
speed. Hence overall performance will be improved.
XGB Classifier:
B.OUT-OF-CORE COMPUTING:
# Gradient Boosting Classifier (using sklearn)
Out-of-core-computing comes into picture when from sklearn.ensemble import
dealing with datasets that are too large to fit into GradientBoostingClassifier
available memory. This approach divides the large from sklearn.metrics import accuracy_score,
dataset into number of chunks. Each and every classification_report
chunk is processed sequentially. Instead of loading import pandas as pd
entire dataset into memory ,only necessary chunks
are # Instantiate the Gradient Boosting Classifier
gb_classifier = GradientBoostingClassifier()
loaded into memory at a time. This approach helps
us in handling large datasets. It also provides # Fit the model on the training dataset
efficient utilization of storage resources. gb_classifier.fit(train_features, train_labels)
Optimization of hardware also involves memory
managemet.In this research, we have come across # Predict on training and test datasets
different ensemble machine learningtechniques such train_pred_gbc =
as XGBoost, AdaBoost, LGBM and Random forest. gb_classifier.predict(train_features)
test_pred_gbc =
V.UML DIAGRAMS: gb_classifier.predict(test_features)
UML stands for Unified Modelling Language. It is a
common modelling language for modelling # Evaluate performance - Accuracy acc_train_gbc =
software projects. It is used for visualizing various accuracy_score(train_labels, train_pred_gbc)
aspects of software systems such as design, acc_test_gbc = accuracy_score(test_labels, test_pred_gbc)
architecture and implementation of software
projects. UML diagram is a language for # Print accuracy results
visualizing, specifying, constructing, and print("Gradient Boosting: Accuracy on training data:
documenting the software projects. It is the visual {:.7f}".format(acc_train_gbc))
representation of set of UML things (such as class, print("Gradient Boosting: Accuracy on test data:
interfaces, state machine etc) and relationship {:.7f}".format(acc_test_gbc))
among them.
# Generate classification report
classification_report_gbc =
VI.FEATURE EXTRACTION: classification_report(test_labels, test_pred_gbc,
Feature extraction involves using tokenization to classes=[0, 1], support=True)
extract attributes. Tokenization is a process of
dividing SQL queries into number of tokens. # Storing the results results_gbc =
These tokens can be keywords or any individual pd.DataFrame({
units. 'Model': 'Gradient Boosting', 'Train
Features are the characteristics or properties of the Accuracy': acc_train_gbc, 'Test
tokenized queries. Based on the count of some Accuracy': acc_test_gbc
specific attributes associated with the tokens, })
features are identified. These features are extracted
based on the count of elements such as double
normality (0), the query's length, the quantity of SQL
# XGBoost Classifier (using xgboost) keywords it contains, the count of special characters
from xgboost import XGBClassifier (like quotes and semicolons), and a binary indicator
indicating the existence of SQL comments. Through
# Instantiate the XGBoost Classifier xgb_classifier = the use of these attributes, the machine learning model
XGBClassifier(learning_rate=0.4, max_depth=7) is better able to recognize harmful patterns in the text
and structure of SQL queries .In order to facilitate the
# Fit the model on the training dataset construction of additional features like query length
xgb_classifier.fit(train_features, train_labels) and keyword count, preprocessing stages included
tokenizing SQL queries into components like
# Predict on training and test datasets keywords, operators, and values. To distinguish
train_pred_xgb = between legitimate and malevolent inquiries, labels
xgb_classifier.predict(train_features) were encoded as binary values. In order to increase
test_pred_xgb = model accuracy and robustness, methods including
xgb_classifier.predict(test_features) under sampling and oversampling (such as SMOTE, or
Synthetic Minority Over- sampling Technique) were
# Evaluate performance - Accuracy acc_train_xgb = used to resolve any imbalances in the dataset. This
accuracy_score(train_labels, train_pred_xgb) resulted in a balanced representation of both malicious
acc_test_xgb = accuracy_score(test_labels, and regular searches. The final dataset has a sizable
test_pred_xgb) number of queries that provide a variety of instances
for the model to learn from.The project intends to
# Print accuracy results further the creation of strong machine learning models
print("XGBoost: Accuracy on training data: capable of successfully identifying and avoiding SQL
{:.7f}".format(acc_train_xgb)) injection attacks, thereby improving cybersecurity
print(" XGBoost: Accuracy on test data: measures, by utilizing this varied and extensive dataset.
{:.7f}".format(acc_test_xgb))