0% found this document useful (0 votes)
36 views

XSS Cross-Site Scripting Attack Detection by Machine Learning Classifiers

This document is the abstract of a research paper presented at the 11th International Conference on System Modeling & Advancement in Research Trends. The paper discusses detecting cross-site scripting (XSS) attacks using machine learning classifiers. It trains classifiers like Logistic Regression, AdaBoost, Naive Bayes, XGBoost, and Decision Tree on a Kaggle XSS dataset. The results show that the AdaBoost classifier achieved 99.92% accuracy, outperforming previous work by 0.03%. The paper aims to detect XSS attacks through input sanitization using machine learning.

Uploaded by

Vishal Patil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

XSS Cross-Site Scripting Attack Detection by Machine Learning Classifiers

This document is the abstract of a research paper presented at the 11th International Conference on System Modeling & Advancement in Research Trends. The paper discusses detecting cross-site scripting (XSS) attacks using machine learning classifiers. It trains classifiers like Logistic Regression, AdaBoost, Naive Bayes, XGBoost, and Decision Tree on a Kaggle XSS dataset. The results show that the AdaBoost classifier achieved 99.92% accuracy, outperforming previous work by 0.03%. The paper aims to detect XSS attacks through input sanitization using machine learning.

Uploaded by

Vishal Patil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Proceedings of the SMART–2022, IEEE Conference ID: 55829

11th International Conference on System Modeling & Advancement in Research Trends, 16th–17th, December, 2022
College of Computing Sciences & Information Technology, Teerthanker Mahaveer University, Moradabad, India

XSS: Cross-site Scripting Attack Detection by


Machine Learning Classifiers
2022 11th International Conference on System Modeling & Advancement in Research Trends (SMART) | 978-1-6654-8734-4/22/$31.00 ©2022 IEEE | DOI: 10.1109/SMART55829.2022.10046960

Prince Roy1, Rajneesh Kumar2, Pooja Rani3, Tanmoy Saha Joy4


Student, Department of Computer Science & Engineering MMEC,
1

Maharishi Markandeshwar (Deemed to be University) Mullana,Ambala,Haryana(India)


2
Professor, Department of Computer Science & Engineering MMEC,
Maharishi Markandeshwar (Deemed to be University) Mullana, Ambala, Haryana(India)
3
Asst. Professor, MMICTBM,
Maharishi Markandeshwar (Deemed to be University) Mullana,Ambala,Haryana(India)
4
Student, Department of Computer Science & Engineering (S.D), MMEC,
Maharishi Markandeshwar (Deemed to be University)
Email: [email protected] , [email protected],
3
[email protected], [email protected]

Abstract—Cross-site scripting (XSS) is a menacing attack


predominately marked by owsap10. It is primarily caused by
insufficient sanitization of the web application’s input and
endpoint. Typically, developers are unconcerned about this
issue, which prompts the attacker to carry out this attack. To
detect this attack, this paper will employ multiple machine-
learning classifiers like Logistic Regression, AdaBoost, Naive
Bayes, XGBoost, Decision Tree, and will conduct experiments
on the Kaggle Cross-site scripting dataset. After conducting
numerous experiments, we discovered that the AdaBoost
classifier performs 0.03 percent better than previous work and
achieves 99.92% accuracy.
Keywords: Cross-site Scripting, Web Application, Machine
Learning, Adaboost, Logistic Regression, Naive Bayes, Fig. 1:  Cross-site scripting (XSS) attack [2]
XGBoost, Decision Tree
A.  Cross-site Scripting (XSS) Types
I.  Introduction According to the methodology, there are three common
Cross-site Scripting is nothing but a code injection flaw types of cross-site scripting that attackers can perform on a
that allows an attacker to easily inject malicious code into web application according to methodology. Those are :
the system in order to retrieve the data of a user, gain control 1)  Reflected Cross-site Scripting
of accounts, and spread it to all users who visit that page. In this case, code is executed on the user system. It does
Inadequate sensitization of user input and the application's not communicate with the server. If an attacker discovers
endpoint. [1]. it, he may use malicious script code to create special
In today's world, web application security is more interconnections and URLs and distribute them to users. If
important than ever concern because, as technology a user inadvertently clicks on the link as a result, they will
advances, almost every company, large organization, and be directed to a response web page containing the malicious
corporate company uses web applications. In addition, script, becoming a victim of the attack [3].
every customer wants to be safe online. 2)  Stored Cross-site scripting
But day by day attackers improve their hacking
The whole attack will start executing directly on the
technique and bypass the strong security within a moment database. This is the most dangerous attack because it stores
by making a new payload. In this paper, we did perform the payload in the database. This attack will be successfully
different types of classifiers to detect cross-site Scripting launched when a user entered that page. The attacker then
vulnerability and achieve a 99.92% accuracy, which is automatically obtains all information about users and the
better than previously existing work. system [3].

Copyright©IEEE–2022 ISBN: 978-1-6654-8734-4 | 1535

Authorized licensed use limited to: KLE Technological University. Downloaded on March 30,2023 at 09:23:11 UTC from IEEE Xplore. Restrictions apply.
XSS: Cross-site Scripting Attack Detection by Machine Learning Classifiers

3)  Dom Base Cross-site Scripting Umehara et al. proposed SVM, SCW, and Random
This cross-site scripting flaw usually takes place when Forest model samples for converting raw data into 128
data passes from a malware source, including a URL, to vectors. This model is tested on a five-dimensional feature
a sink that supports dynamic code execution. As a result, vector dataset as well as a 128-dimensional feature vector
attackers can use malicious JavaScript to gain access to dataset. The five-dimensional feature vectors are classified
other users' accounts. This eliminates the need to contact into five groups. For both activities and experimenting, five-
the server before making changes to the web page [3]. fold cross-validation was used. They achieved 98.9 percent
accuracy using SVM on a 128-dimensional extracted
II.  Literature Review features dataset [9].
Web application security has become a crying need for With a view to detecting XSS attacks, PMD Nagarjun
everyone nowadays as rising new techniques have made et al. proposed a model to train a large labeled and balanced
the web system more vulnerable. Throughout this section, dataset by using supervised ensemble learning techniques.
we'll look at different researchers who have come up with They used several methods like Random forest, AdaBoost,
different ideas to mitigate Cross-site Scripting attacks using SVM bagging, Extra-Trees, gradient boosting, and
machine learning methods. histogram-based gradient boosting in this paper. Finally,
Dimaz Arno Prasetio et al. proposed the hybrid features they obtained an accuracy of 99.89 percent by using the
model to classify XSS attacks and obtained the best results histogram-based gradient-boosting classification model
possible with a 99.87 percent accuracy and no false [10].
positives. With this XSS detection model, false positives Rathore et al. proposed a model for detecting Cross-
can be reduced to 0.039 percent. They used the Kaggle site scripting (XSS) attacks. They used URLs and extracted
dataset as well as datasets from various sources on GitHub features from web pages to train models for their research.
in this test, totaling 16361 datasets that typically contain Among the features are domain names in URLs, Iframe,
XSS attacks [4]. external links, and malicious scripts in SNSs webpages.
Shreyas Sudhir Barde proposed Random Forest over They use a dataset of 1000 XSS malware codes for testing.
the other models by combining Gradient Boosting and They scored 97.2 percent on their tests [11].
Decision Tree, Bagging algorithms, and the knowledge Mereani and Howe proposed a model for detecting
base's the unified graph. The results demonstrated this Cross-Site Scripting attacks that merge Random Forest,
number of advantages over traditional methods in the kNN, and SVM. They used 2000 samples to train and 13,000
majority of cases, with 97.16 percent accuracy in the worst samples to test their model. They were able to achieve up
scenario using a balanced ensembled dataset [5].
to 99.75 percent accuracy in their work. According to the
Ravi Pallam et al. proposed that data tokenization is
experiment, they extracted successfully [12].
the best pre-processing technique and that it significantly
reduced computational time. Using the Kaggle dataset, III.  Materials and Approaches.
the Light GBM clearly outperformed the other boosting A. Dataset
algorithms, with 99.51 percent and 99.59 percent accuracy
The most crucial aspect in finding out a cross-site
for SQLi and XSS, respectively [6].
scripting attack is compiling a dataset that contains Cross-
Fawaz Mahiuob Mohammed Mokbal et al. submited a
site scripting attack payloads. In this study, the Kaggle
method for detecting Cross-site Scripting attacks by NLP-
data set is used to test and evaluate the results of various
SVM. Text payload attacks were typically processed using
classifiers. There are 13,600 distinct data sets in total. The
Natural Language Processing, and the SVM model was
data set is available in Kaggle's repository. The data set
used to detect them. Following a thorough examination
contains several payload types that will aid in the detection
of the results, it was determined that the proposed method
of Cross-site Scripting.
is capable of precisely detecting XSS-based attacks with
low FN and FP. Actually, when compared to eight other B. Methodology
algorithms using the same data, the proposed method had Machine learning is a cutting-edge technology which
several significant advantages. With an accuracy of 99.44 enables a system to think as accurately as a human being by
percent, it produced promising and cutting-edge results on using algorithms.
the customized dataset [7]. The attackers have been incessantly trying to ameliorate
Nunan et al. proposed a sample model that detects the strategies over the last few decades to gain control over the
presence of XSS scripts using Support Vector Machine and systems. We have applied Logistic Regression, AdaBoost,
Naive Bayes. Malware code, duplicated special characters, Naive Bayes, XG-Boost, and Decision Tree to identify the
and keywords are all included in Weka's data preprocessing Cross-site Scripting payloads from the dataset. The Cross-
and classification. According to their findings, the maximum site Scripting data set from Kaggle is used to find the best
accuracy rate in similar works is 99.89 percent [8]. classifier for detecting cross-site scripting.

Copyright©IEEE–2022 ISBN: 978-1-6654-8734-4 | 1536

Authorized licensed use limited to: KLE Technological University. Downloaded on March 30,2023 at 09:23:11 UTC from IEEE Xplore. Restrictions apply.
11th International Conference on System Modeling & Advancement in Research Trends, 16th–17th, December, 2022
College of Computing Sciences & Information Technology, Teerthanker Mahaveer University, Moradabad, India
Table 1: Output of Used Classifiers

Logistic Decision Naȉve


Parameter Regression Tree Bayes
XG-Boost AdaBoost
Accuracy 99.85 99.89 88.89 99.70 99.92
F1 Score 99.86 99.89 90.44 99.72 99.93
Sensitivity 99.79 99.86 99.93 99.58 99.86
Specificity 99.92 99.92 76.65 99.84 100
Precision 99.93 99.93 82.60 99.86 100
From the output table, we observed that AdaBoost
provided 99.92 percent accuracy, 99.93 percent F1 Score,
99.86 percent sensitivity, 100 percent specificity, and
Fig. 2:  Methodology of Detection of Cross-site Scripting precision (100 percent). Logistic Regression was 99.85%
correct, Decision Tree was 99.89% correct, XG-Boost was
Classifiers 99.70% correct, and Naive Bayes was 88.89% correct.
1. Logistic Regression: It is a well-known machine Figures 3, 4, 5, 6, and 7 compare the classifiers.
learning classifier algorithm in the supervised
learning approach. In essence, it serves as the
unconditional dependent variable depending on
a number of independent factors. Additionally, it
may be utilized in both continuous and discrete
datasets to identify new data and calculate
probabilities. [13].
2. AdaBoost Classifiers: At first, the AdaBoost
classifier fits the dataset. Then it fits manifold
versions of the classification model on the dataset
to ameliorate project performance. [13].
3. XG-Boost: XG-Boost is an efficient and
adaptable optimum solution for gradient boosting Fig. 3:  Comparison of Accuracy
classification. It frequently uses the greedy
algorithm to segment existing nodes [13].
4. Naive Bayes: The supervised machine learning
algorithm, The Naive Bayes, is based on the Bayes
theorem. It is an effective supervised learning
algorithm. It is most commonly used to analyze
data [13].
5. Decision Tree: Usually this algorithm is used to
solve regression and classification problems which
is a semi-supervised machine learning method.
This algorithm’s goal is to create a model that can
predict the value of a target variable. A tree is an
approximation to a piecewise constant [14].
On the Kaggle dataset, several classifiers (XG-Boost, Fig. 4:  Comparisons of F1-score
Decision Tree, Logistic Regression, AdaBoost, and Naive
Bayes) are used in this study. We will calculate accuracy,
sensitivity, specificity, precision, and F1- score by classifier
implementation on the dataset.

F1score= 2*((Precision*Recall) / (Precision + Recall))


Sensitivity = TP / float (FN+FP)
Specificity = TN / (TN+FP)

After analyzing classification algorithm performance, it


was determined that AdaBoost provided the best efficiency
in all parameters. Fig. 5:  Comparison of Sensitivity

Copyright©IEEE–2022 ISBN: 978-1-6654-8734-4 | 1537

Authorized licensed use limited to: KLE Technological University. Downloaded on March 30,2023 at 09:23:11 UTC from IEEE Xplore. Restrictions apply.
XSS: Cross-site Scripting Attack Detection by Machine Learning Classifiers

Fig. 6:  Comparison of Specificity

Fig. 7:  Comparison of Precision


The ROC curve represents the various machine learning
classifiers in Fig.8. The ROC Curve represents classification
ability graphically. We can see here AdaBoost determines To bypass security in order to own full system access
the substantial area under the ROC curve value. attackers are constantly developing the technique. The
attacker tries to determine the endpoint of web application
in order to inject a Cross-Site Scripting payload. Using
this classifier, web applications can quickly detect cross-
site scripting injected on the web application's endpoint by
an attacker. Table 2 contrasts the proposed method with
previous research.
Table 2: Comparison of Performances Obtained Using AdaBoost

Study Year Data Set Classifier Accuracy


(%)

Angelo 2012 Web Pages Naive 99.89


Eduardo Using Bayes
Nunan [8] Document- and SVM
based classifiers
Fig. 8:  ROC Curve of Classifiers
Sota Akaishi 2019 Customize CNN+SVM 99.37
We can aver that Ada Boost is the efficient method to [15] a large data
detect the Cross-site Scripting attack with a 99.92 percent. set.
Source Code:
Table 2 (contd)...

Copyright©IEEE–2022 ISBN: 978-1-6654-8734-4 | 1538

Authorized licensed use limited to: KLE Technological University. Downloaded on March 30,2023 at 09:23:11 UTC from IEEE Xplore. Restrictions apply.
11th International Conference on System Modeling & Advancement in Research Trends, 16th–17th, December, 2022
College of Computing Sciences & Information Technology, Teerthanker Mahaveer University, Moradabad, India
(Contd.) Table 2...
[4] Prasetio, D., Kusrini, K. and Arief, M. R. (2021) “Cross-site
Shreyas 2020 Kaggel XSS Random 97.16 Scripting Attack Detection Using Machine Learning with Hybrid
Sudhir Data set Forest Features”, JURNAL INFOTEL, vol: 13(1), pp. 1-6. doi: 10.20895/
Barde [5] Bagging. Infotel.v13i1.606.
Dataset [5] Shreyas Sudhir Barde, “Cross-Site Scripting detection using Random
Ensemble Forest and Dataset Ensemble Modelling.” pp:1-17, 2020, http://
Modelling. norma.ncirl.ie/4486/1/shreyassudhirbarde.pdf.
[6] Ravi Pallam, Sai Prasad Konda, Lasya Manthripragada, Ram
Fawaz 2022 Kaggel XSS Average 99.44 Akhilesh Noone,” Detection of Web Attacks using Ensemble
Mahiuob Data set Word Learning”, International Research Journal of Engineering and
Mohammed Embedding Technology (IRJET), pp:2931-2939, vol:08(7), July 2021.
Mokbal [7] and Support [7] Mokbal, Fawaz & Dan, Wang & Wang, Xiaoxi. (2022). Detect
Vector Cross-Site Scripting Attacks Using Average Word Embedding and
Machine Support Vector Machine. International Journal of Network Security.
24. 20-28. 10.6633/IJNS.202201.
Proposed 2022 Kaggel XSS Adaboost 99.92 [8] Nunan, Angelo & Souto, Eduardo & Santos, Eulanda & Feitosa,
Method Data set Eduardo. (2012). Automatic Classification of Cross-Site Scripting
in Web Pages Using Document-based and URL-based Features.
V.  Conclusion 10.1109/ISCC.2012.6249380.
Cross-site Scripting is a crucial web vulnerability. [9] A. Umehara, T. Matsuda, M. Sonoda, S. Mizuno and J. Chao,
Consideration on the Cross-Site Scripting Attacks Detection Using
Attackers is constantly search to find hidden endpoints Machine Learning, IPSJ SIG Technical Report, Vol.2015-CSEC-71,
which accept any kind of argument and input. If there is No.13, pp.1– 4, 2015 (Japanese).
[10] PMD Nagarjun and Shaik Shakeel Ahamad, “Ensemble Methods to
any lack of filtration, an attacker can quickly acquire Detect XSS Attacks” International Journal of Advanced Computer
that web application and begin attacking the company or Science and Applications(IJACSA), 11(5), 2020. http://dx.doi.
org/10.14569/IJACSA.2020.0110585.
organization's users. To resolve this issue, we used various [11] S. Rathore, P. K. Sharma, and J. H. Park, "XSSClassifier: An
machine learning classifiers to detect payload. On Kaggle, Efficient XSS Attack Detection Approach Based on Machine
we used the dataset. Finally, AdaBoost detects a Cross-site Learning Classifier on SNSs.," JIPS, vol. 13, no. 4, pp. 1014–1028,
2017.
Scripting Payload with 99.92 percent accuracy. It is capable [12] F. A. Mereani and J. M. Howe, "Detecting cross-site scripting attacks
of detecting and protecting the payload from Cross-site using machine learning," in International Conference on Advanced
Machine Learning Technologies and Applications, 2018, pp. 200–
Scripting attacks. 210.
[13] Prince Roy, Rajneesh Kumar, and Pooja Rani, “SQL Injection
References Attack Detection By Machine Learning Classifier," International
[1] S. Akaishi and R. Uda, "Classification of XSS Attacks by Machine Conference on Applied Artificial Intelligence and Computing
Learning with Frequency of Appearance and Co-occurrence," 2019 ICAAIC 2022, pp: 396-401, Proceedings Paper.
53rd Annual Conference on Information Sciences and Systems [14] S., Shilpashree. (2019). Decision Tree: A Machine Learning for
(CISS), 2019, pp. 1-6, DOI: 10.1109/CISS.2019.8693047. Intrusion Detection. International Journal of Innovative Technology
[2] Mwila, Kingston. (2020). An Assessment of Cyber Attacks and Exploring Engineering. 8. 5. 10.35940/ijitee.F1234.0486S419.
Preparedness Strategy for Public and Private Sectors in Zambia. [15] S. Akaishi and R. Uda, “Classification of xss attacks by machine
[3] Vishnu, B. & Kp, Jevitha. (2014). Prediction of Cross-Site learning with frequency of appearance and co-occurrence,” in The
Scripting Attack Using Machine Learning Algorithms. 1-5. 53rd Annual Conference on Information Sciences and Systems
10.1145/2660859.2660969. (CISS’19), pp. 1–6, 2019.

Copyright©IEEE–2022 ISBN: 978-1-6654-8734-4 | 1539

Authorized licensed use limited to: KLE Technological University. Downloaded on March 30,2023 at 09:23:11 UTC from IEEE Xplore. Restrictions apply.

You might also like