E Ure: The Int'l Journal of Information Security
E Ure: The Int'l Journal of Information Security
ISeCure
Information Security
Manuscript template of ISeCure Journal (pp. 1–8)
http://www.isecure-journal.org
ARTICLE I N F O. Abstract
Keywords:
Artificial Intelligence-based Code In recent years, artificial intelligence has had a conspicuous growth in almost
review, ChatGPT Model, every aspect of life. One of the most applicable areas is security code review, in
Common Weakness Enumeration,
which a lot of AI-based tools and approaches have been proposed. Recently,
Static Application Security
Testing, Vulnerability Detection ChatGPT has caught a huge amount of attention with its remarkable
performance in following instructions and providing a detailed response.
Regarding the similarities between natural language and code, in this paper, we
study the feasibility of using ChatGPT for vulnerability detection in Python
source code. Toward this goal, we feed an appropriate prompt along with
vulnerable data to ChatGPT and compare its results on two datasets with the
results of three widely used Static Application Security Testing tools (Bandit,
Semgrep and SonarQube). We implement different kinds of experiments with
ChatGPT and the results indicate that ChatGPT reduces the false positive and
false negative rates and has the potential to be used for Python source code
vulnerability detection.
© 2023 ISC. All rights reserved.
effort are required by a security expert to validate the analysis of the obtained results 2 . In Section 6 we dis-
findings of the SAST tool. Moreover, this will increase cuss some factors that may threaten the validity of
the error rate by humans, which may then lead to ig- the results. Finally, Section 7 concludes the paper.
noring some vulnerabilities. On the other hand, a high
rate of false negative leads to catastrophic events. 2 Related Work
In recent years, Machine Learning (ML) and deep In this section, we review some of the works that
learning have had remarkable advances in various ar- used different kinds of AI models for vulnerability de-
eas such as natural language processing [10, 11]. There- tection. Note that we did not focus on works which
fore, considering the high similarity between code and proposed models for repairing the identified vulnera-
natural languages, the deep learning-based models bilities. When it comes to artificial intelligence, the
are expected to be successful in code processing tasks. main idea is the use of supervised learning. Therefore,
Likewise, studies in this area have shown the interest various machine learning models used methods of fea-
of researchers in using deep learning techniques in vul- ture engineering such as the number of lines of the
nerability detection [12, 13]. Machine learning models code, code complexity and the number of operations
can automatically learn the patterns of software vul- and also utilized textual features[15, 17]. In general,
nerabilities based on datasets. Furthermore, research research shows that text-based models have better
indicates that ML models have fewer false positives performance over feature engineering and the studies
compared to SAST tools [6, 14]. Results of a new re- also admit that machine learning models outperform
search have shown the superior performance of deep the existing SAST tools.
learning-based models over three open-source tools in Recently, more research has been devoted to deep
C/C++, reducing false positives and negatives rate learning. In this scope, researchers often used different
at the same time [15]. deep learning models such as Convolutional Neural
Recently, ChatGPT, an AI-powered chatbot tool Network (CNN), Long Short-Term Memory (LSTM),
that uses Natural Language Processing (NLP) and and Multilayer Perceptron (MLP) [13, 18–20]. Some of
machine learning algorithms to understand and re- the models were based on different kinds of code prop-
spond to customer inquiries, has drawn a lot of atten- erty graphs and used Graph Neural Network [13, 14],
tion. ChatGPT is vital for business professionals for while some others only relied on tokens [20]. A new
several reasons. It can help save time and resources by study has investigated the way deep learning models
automating tasks requiring human intervention. An function in vulnerability detection tasks. The results
important point to note is that, ChatGPT has been of this study reveal some points: first, the results of
trained with a huge amount of data till 2021 so that it different models are not compatible with each other.
can be a great help in finding known patterns in thou- Second, fine-tuned models have shown better perfor-
sands of packages in automated way. The model is mance in this field. Third, usually 1000 samples of
also trained on a large amount of code and is thus able each class are enough for the training of a neural net-
recognize common patterns. In this paper, we evaluate work and finally, models usually use the same features
the performance of ChatGPT in identifying security for prediction [21]. Although studies approved the
vulnerabilities of Python codes and compare the re- superiority of graph-based models, a new study indi-
sults with three well-known SAST tools for Python cates the superior performance of transformer-based
vulnerability detection ( Bandit, Semgrep and Sonar- models over graph-based ones [22]. In 2022, Hanif and
Qube). The reason for choosing the Python language Maffeis proposed a model named VulBERTa [23]. This
is that in 2022, Python was known to be the most pop- model is based on RoBERTa and is used for vulnerabil-
ular programming languages along with Java, based ity detection in C/C++ codes. Another recent study
on the Popularity of Programming Language Index also has used BERT architecture and CodeBERT vec-
(PYPL) and IEEE reports. Also, Stackscale ranked tors for predicting code vulnerabilities. The results
Python at the third place [16]. Although Python is of this study approve the superiority of transformer-
mainly used in the scope of machine learning and data based models over traditional deep learning models
science, its applications are not only limited to these and also graph-based models [24]. Overall, it seems
fields and with its famous frameworks such as Django that transformer-based models are effective in this
and Flask, it is prone to vulnerabilities. The rest of area. Another recent work has evaluated ChatGPT as
the paper is organized as follows: In Section 2, we pro- a large language model for detecting vulnerabilities
vide a brief literature review of this area. Section 3 is in Java source codes and compared the results with a
dedicated to datasets we used. Section 4 provides the dummy classifier and achieved no better results than
details of the experiments we performed with Chat- it [25]. However, there is still no academic study about
GPT. In Section 5, we present the evaluation and
2 https://github.com/abakhshandeh/ChatGPTasSAST.git
3
comparing the results of the ChatGPT model with no."}] so we can compare our results with those
traditional SAST tools for Python, and this paper of SAST tools.
aims to answer the question of whether the ChatGPT (3) In the third experiment, for each of vulnerable
model is outperforming SAST tools or not? files, we give the model all the labels returned
from Bandit, Semgrep and SonarQube tools for
3 Datasets Description the Python code, as the classes that ChatGPT
In this section, we provide the details about our dataset should use. We then ask the model whether each
and the labels we used. Our dataset consists of 156 vulnerable file contains any of those vulnerabili-
Python code files. These files contain 130 files of the ties or not? Here, the main difference with our
securityEval dataset which is proposed in [26]. As the second experiment is that we specify the classes
authors mentioned, these 130 files cover 75 vulnera- per vulnerable file separately. In other words, we
bility types that are mapped to Common Weakness use the model as an assistant for the SAST tools
Enumeration (CWE). The remaining 26 files belong to verify the detected vulnerabilities by them.
to a project called PyT in which the author developed In this experiment, we use the same JSON for-
a tool for Python code vulnerability detection and mat as the second experiment for the responses.
used these 26 vulnerable code files for evaluating his Note that in this experiment, although we pro-
tool [27, 28]. Since the used datasets do not provide vide the labels’ list beforehand for each vulnera-
the specific line of vulnerability, a security expert of ble file, in some cases the model has returned a
our team rechecked the data and specified the vulner- new CWE which is not among its input labels.
able line. We identified the corresponding line of code This is a natural behavior seen from a language
of CWEs that were assigned in the labels of these files model and in order to address this issue in our
with the help of a security expert. The datasets’ in- evaluation, we consider two cases: In one case,
formation and the distribution of their corresponding we ignore the new labels and calculate the met-
labels are presented in Table A.1 in appendix A. rics without considering them. This policy can
reduce the number of false positives of SAST
4 Working with ChatGPT API tools. In another case, we consider them as well
and this time the number of false negatives may
In this section, we provide the details of the process
decrease.
of utilizing the ChatGPT model API for identifying
(4) In our fourth experiment, we do not provide
vulnerabilities. In this study, we used the GPT-3.5-
any label list for the model and ask it to detect
turbo model. The GPT-3.5-Turbo model can accept a
the vulnerabilities in the files and determine
series of messages as input, unlike the previous version
their corresponding CWEs from its own trained
that only allowed a single text prompt. This capability
knowledge. Here, the format of the responses is
provides some interesting features, such as the ability
the same JSON structure in the previous exper-
to store prior responses or query with a predefined set
iments.
of instructions with context. This is likely to improve
the generated response. The GPT-3.5-Turbo model is To use the model for our experiments, we put all
a superior option compared to the GPT-3 model, as the vulnerable python codes of our dataset in a di-
it offers better performance across all aspects while rectory and called GPT-3.5 API with an optimized
being 10 times more cost-effective per token. We did prompt for each of the vulnerable python files. The
four kinds of experiments using the GPT-3.5-Turbo choice of prompt is the most challenging task in this
model. process, as it has direct effect on the results the model
provides. We optimize our prompts according to [29].
(1) In our first experiment, we give the model the
Table 2 provides the list of prompts we used for each
vulnerable files and ask it whether they con-
experiment.
tain any security vulnerabilities or not, without
specifying the corresponding CWEs. We ask the
model to just return the line number of the vul- 4.1 Parameters
nerability if it contains any. Then, we compare The parameters of the experiment are the prompt
these lines with ground truth labels. In effect, which contains the instructions the model will execute
this experiment is a binary classification. and a parameter called temperature that determines
(2) In our second experiment, we provide the list of the randomness level of the model response. The tem-
the corresponding CWEs and ask the model to perature can take the values between 0 to 6 with 6
find the vulnerabilities from the labels’ list in giving the most random output. Because we want to
the Python vulnerable file. In this experiment, specify labels for a piece of code, the nature of our job
we ask the model to respond in JSON format is deterministic, so we set the temperature to zero to
like [{"label": "CWE-X", "line of Code": "line get the most stable behavior. Another important pa-
4
Table 1. Comparison of the responses of two Prompts with Table 2. GPT-3.5 prompts used. Vul_code refers to the
minor change vulnerable code and labels1 is the list of labels of all vulnerable
files and labels2 is the labels of each vulnerable file that is
Prompt Response iterated through a loop.
find all the vulnerabilities with [ { ”label”: ”CWE-78: Improper
Experiment No. Prompt
the CWE standard in the Neutralization of Special Ele-
python code which is delim- ments used in an OS Command Experiment1 You will be provided with a python
ited with triple backticks. also (′ OS Command Injection′ )”, code delimited by triple backticks. If
give the line of the vulner-”line of Code”: ”command it contains any security vulnerability,
ability in the code. python = ′ echo′ + param + ′ » ′ + identify the lines of vulnerable code
code:′′′ {Vul_code}′′′ Format ′ menu.txt′ ” }, { ”label”: ”CWE- and only write the line in quotation If
your response as a list of JSON 78: Improper Neutralization of the code does not contain a vulnera-
objects with ”label” and ”line Special Elements used in an bility, then simply write None. python
of Code” as the keys for each el- OS Command (′ OS Command code: ′′′ {Vul_code}′′′
ement. only answer with JSON. Injection′ )”, ”line of Code”:
Experiment2 Which of the following vulnerabilities
”subprocess.call(command,
from list of vulnerabilities exist in the
shell=True)” }, { ”label”:
python code which is delimited with
”CWE-78: Improper Neu-
triple backticks. also give the line of
tralization of Special El-
the vulnerability in the code. python
ements used in an OS
code:′′′ {Vul_code}′′′ list of vulnera-
Command (′ OS Command
bilities: {”, ”.join(labels1)} Format
Injection′ )”, ”line of Code”:
your response as a list of JSON ob-
”subprocess.call(′ echo Menu: >
jects with ”label” and ”line of Code”
menu.txt′ , shell=True)” } ]
as the keys for each element. only an-
find all the vulnerabilities with [ { ”label”: ”Command swer with JSON.
the CWE standard in the Injection”, ”line of Code”:
Experiment3 Which of the following vulnerabilities
python code. also give the line ”subprocess.call(command,
from list of vulnerabilities exist in the
of the vulnerability in the code. shell=True)” }, { ”label”:
python code which is delimited with
python code: ′′′ {Vul_code}′′′ ”Command Injection”, ”line of
triple backticks. also give the line of
Format your response as a list Code”: ”subprocess.call(′ echo
the vulnerability in the code. python
of JSON objects with ”label”Menu: > menu.txt′ ,
code: ′′′ {Vul_code}′′′ list of vulnera-
and ”line of Code” as the keys shell=True)” } ]
bilities: {”, ”.join((labels2)} Format
for each element. only answer
your response as a list of JSON ob-
with JSON.
jects with ”label” and ”line of Code”
as the keys for each element. only an-
rameter is the prompt which is very influential in the swer with JSON.
results and adjusting it to get the best results is a chal- Experiment4 Your task is to determine whether the
lenging task. We used the prompts given in Table 2. following python code which is delim-
ited with triple backticks,is vulnerable
We chose our prompts based on [29]. According to [29],
or not? identify the following items:
some of the key points to create an optimized prompt - CWE of its vulnerabilities. - lines
are: using delimiters such as triple quotes or triple of vulnerable code. Format your re-
backticks to specify the piece of code, asking for struc- sponse as a list of JSON objects with
tured output such as JSON, HTML, etc., specifying ”label” and ”line of Code” as the keys
for each vulnerability. If the infor-
the steps to complete a task in a clear way, instructing
mation isn’t present, use ”unknown”
the model to work out its own solutions before rush- as the value. Make your response as
ing to a conclusion. In order to show the sensitivity of short as possible and only answer with
ChatGPT to its prompts, an example is provided in JSON. python code:′′′ {Vul_code}′′′
Table 1 where a prompt with minor modification is
given to the model and the model responses with dif-
SonarQube SAST tools and we also query the Chat-
ferent answers in which the second response contains
GPT model with our dataset using the appropriate
one less vulnerability compared to the first one.
prompts. We then calculate the following metrics for
5 Results each of the tools’ results and the model result based
on our ground truth labels. Finally, we compare the
In this section, we provide the results of our experi- results of the tools with GPT-3.5 model.
ments. First we explain the metrics we use for evalu-
ating our work and then we present GPT-3.5 results
5.1 Evaluation Metrics
and compare them with three popular SAST tools for
Python vulnerability detection. To be more precise, In classification, we have condition positive which
we perform the following actions: We give a dataset of indicates the number of real positive cases in the
156 vulnerable python codes to Bandit, Semgrep and data. Similarly, there is a condition negative which is
5
TP
recall = (2) Precision Recall F1
TP + FN
Semgrep 0.6694 0.1504 0.2457
• F-measure: It is a measure which combines pre- Bandit 0.7450 0.1447 0.2424
cision and recall and is defined according to the SonarQube 0.9104 0.1161 0.2060
Precision Recall F1 and provides primary steps toward this path. In future
Semgrep 0.4682 0.1123 0.1812
studies, the behavior of the latest model of ChatGPT
Bandit 0.3168 0.0609 0.1022
(GPT-4) which is more powerful than GPT-3.5 model,
SonarQube 0.3283 0.0419 0.0743
can be examined in vulnerability detection of codes
Experiment3,GPT-3.5-Case 1 0.7807 0.2781 0.4101
Experiment3,GPT-3.5-Case 2 0.333 0.1542 0.2109
with the hope of obtaining better results. Moreover,
the Temperature parameter of the model can be set
Table 5. Results of Experiment 3 (SAST assistant) to values other than zero and innovative rules can be
passed to decide for the most efficient obtained results.
Precision Recall F1 Another suggestion can be using one-shot learning in
Semgrep 0.4682 0.1123 0.1812 future works. Moreover, it should be considered that,
Bandit 0.3168 0.0609 0.1022 there is a security caution about using ChatGPT as
SonarQube 0.3283 0.0419 0.0743
as a SAST tool because it is required to upload the
GPT-3.5 0.3350 0.1238 0.1808
source code on the OpenAI servers.
Table 6. Results of Experiment 4 (Free Classification)
A Appendix
6 Threats to Validity The distribution of labels of our dataset is provided
in Table A.1
In this Section, we discuss some factors in our experi-
ments that could affect the correctness of the results. References
Our biggest challenge was the choice of the prompts of
[1] Wikipedia. https://en.wikipedia.org/wiki/
ChatGPT. There are some metrics for measuring the
GitHub, 2023. Accessed: 2023-03-27.
effectiveness of a prompt for LLMs. In [31] naturalness
[2] cvedetails. https://www.cvedetails.com/
and expressiveness are mentioned as two important
browse-by-date.php, 2023. Accessed: 2015-08-
factors for a prompt. Here, we tried to choose the
23.
most efficient prompts in terms of these metrics and
[3] Kumar V, Anjum M, Agarwal V, and Kapur
based on what was explained in 4.1 [29]. However, it
PK. A hybrid approach for evaluation and pri-
is possible that a more careful selection of the prompt
oritization of software vulnerabilities. Predictive
can affect the results. Another factor which may also
Analytics in System Reliability. Cham: Springer
affect the results is the size of the dataset and its acces-
International Publishing, - -:39–51, 2023.
sibility on the Internet. Furthermore, the distribution
[4] Sharma A. Zhou Y. Automated identification of
of the CWEs of the dataset is of great importance. To
security issues from commit messages and bug
overcome this threat, we chose three different datasets
reports. In Proceedings of the 2017 11th Joint
for better generalization of the vulnerabilities they
Meeting on Foundations of Software Engineering,
cover, but there may be still few coverages of the vul-
2017.
nerabilities. Moreover, we only compare this model
[5] Zhou Y., Liu S., Siow J, Du X, and Liu Y. Devign.
with three SAST tools for Python language. Perhaps,
Effective vulnerability identification by learning
further SAST tools affect the results. And finally, we
comprehensive program semantics via graph neu-
only test GPT-3.5 model of ChatGPT and it is possi-
ral networks. Advances in neural information
ble that the new billable version (GPT-4) performs
processing systems, 32, 2019.
better than this version.
[6] Perl H, Dechand S, Smith M, Arp D, Yamaguchi,
7 Conclusion Rieck K, and et al. Vccfinder: Finding potential
vulnerabilities in open-source projects to assist
In this paper, we did four types of experiments with code audits. In Proceedings of the 22nd ACM
ChatGPT model to detect the security vulnerabilities SIGSAC Conference on Computer and Commu-
of Python codes. We compared this model with Bandit, nications Security, 2015.
Semgrep and SonarQube that are popular SAST tools [7] Jabeen G, Rahim S, Afzal W, Khan D, Khan A,
for Python codes. We concluded that using GPT-3.5 Hussain Z, and et al. Machine learning techniques
model for vulnerability detection of codes in some for software vulnerability prediction: a compara-
especial manners gives promising results. Specifically, tive study. Applied Intelligence, 52, 2022.
if we use it as SAST tool assistant, it will produce [8] Maffeis S. Hanif H. Vulberta: Simplified source
results that can help to improve the returned results code pre-training for vulnerability detection. In
of SAST tools. Overall, we believe this model has the 2022 International Joint Conference on Neural
potential to be used in vulnerability detection tasks Networks (IJCNN), pages 1–8, 2022.
regarding the factors that may effect the correctness [9] Berabi B, He J, Raychev V, and Vechev M. Tfix:
of the results which we described in 6. However, we Learning to fix coding errors with a text-to-text
admit that this study is not general from all aspects
7