How Secure Is AI-generated Code: A Large-Scale Comparison of Large Language Models
How Secure Is AI-generated Code: A Large-Scale Comparison of Large Language Models
4
Guelma University, Guelma, Algeria.
5
The University of Manchester, Manchester, UK.
6
Federal University of Amazonas, Manaus, Brazil.
Abstract
This study compares state-of-the-art Large Language Models (LLMs) on their
tendency to generate vulnerabilities when writing C programs using a neutral
zero-shot prompt. Tihanyi et al. introduced the FormAI dataset at PROMISE
’23, featuring 112,000 C programs generated by GPT-3.5-turbo, with over 51.24%
identified as vulnerable. We extended that research with a large-scale study
involving 9 state-of-the-art models such as OpenAI’s GPT-4o-mini, Google’s
Gemini Pro 1.0, TII’s 180 billion-parameter Falcon, Meta’s 13 billion-parameter
Code Llama, and several other compact models. Additionally, we introduce the
FormAI-v2 dataset, which comprises 331 000 compilable C programs generated
by these LLMs. Each program in the dataset is labeled based on the vulnerabil-
ities detected in its source code through formal verification, using the Efficient
SMT-based Context-Bounded Model Checker (ESBMC). This technique mini-
mizes false positives by providing a counterexample for the specific vulnerability
and reduces false negatives by thoroughly completing the verification process.
Our study reveals that at least 62.07% of the generated programs are vulnerable.
The differences between the models are minor, as they all show similar cod-
ing errors with slight variations. Our research highlights that while LLMs offer
promising capabilities for code generation, deploying their output in a produc-
tion environment requires proper risk assessment and validation. Please cite
this once published: https://doi.org/10.1007/s10664-024-10590-1.
1
1 Introduction
Large Language Models (LLMs) are transforming software development and program-
ming [1–3]. Every day, developers and computer scientists utilize various code creation
and completion models to tackle different tasks [4, 5]. Research related to program
synthesis using Generative Pre-trained Transformers (GPT) [6] is gaining significant
traction, where initial studies indicate that GPT models can generate syntactically
correct yet vulnerable code [7].
A study conducted at Stanford University suggests that software engineers assisted
by OpenAI’s codex-davinci-002 model were at a higher risk of introducing security
flaws into their code [8]. As the usage of AI-based tools in coding continues to expand,
understanding their potential to introduce software vulnerabilities becomes increas-
ingly important. Given that LLMs are trained on data freely available on the internet,
including potentially vulnerable code, there is a high risk that AI tools could replicate
the same patterns. This raises a critical question: Is it safe to employ these models in
real projects?
As a first step towards answering this question, Tihanyi et al. published the For-
mAI dataset [9] at the 19th International Conference on Predictive Models and Data
Analytics in Software Engineering (PROMISE’23). This dataset is the first and largest
collection of AI-generated compilable C programs with vulnerability classification, fea-
turing 112 000 samples. To guarantee the diversity of generated C codes, the authors
developed a framework designed to produce a variety of programs that cover multiple
coding scenarios, efficiently facilitating real-world bugs. The study employed Bounded
Model Checking (BMC), a technique within Formal Verification (FV), to evaluate the
security properties of the dataset. This initial study revealed that at least 51.24% of
the C programs generated by GPT-3.5-turbo were vulnerable.
Continuing the original research presented in [9], we aim to expand the scope of
the study by addressing the key limitations highlighted by the research community.
We identified four main limitations that we intend to address from the original paper:
1. The first paper exclusively focuses on OpenAI’s GPT-3.5-turbo, without evaluating
other models. To bridge this gap, this paper compares nine state-of-the-art LLMs
in secure coding, such as Google’s Gemini Pro 1.0 [10], OpenAI’s GPT-4o-mini,
TII’s Falcon-180B [11], and Meta’s Code LLama 13B [12]. In addition, we have
expanded the original dataset from 112 000 to 331 000 samples, where incorporating
C code generated by different LLMs also enhances diversity.
2. The initial findings on the percentage of vulnerable samples in the dataset (51.24%)
may have been under-reported due to the limitations of bounded model check-
ing, indicating that the actual percentage of vulnerabilities could be higher. To
address this issue, we transitioned our verification approach from bounded to
unbounded verification, thereby enhancing the depth and accuracy of our security
evaluation [13–16].
3. We have incorporated new labels into the dataset to enhance its usability for a
broader research community. While all necessary features can be extracted and
reproduced directly from the provided source codes, we have enhanced the dataset’s
comprehensiveness by calculating the cyclomatic complexity (CC) [17] for each
2
program, adding source lines of code (SLOC), including the exact stack trace for
counterexamples, and providing a code snippet entry that captures only the 5
lines before and after the vulnerability. These additional features are valuable for
machine learning tasks to help models generalize the problem and identify vulnera-
bilities more effectively and conduct more detailed comparisons in various research
contexts.
4. To enhance the dataset, we have removed all Type 1, Type 2, Type 3-1, and Type
3-2 (with 10% deviation threshold) clones using the NiCad (Automated Detection
of Near-Miss Intentional Clones) [18] tool. We note, that removing Type 3-2 clones
with a larger threshold is not our goal. Even minor changes can be significant
and determine whether a vulnerability is present or absent, potentially introducing
different security risks. Moreover, different representations of a vulnerability can
help models better generalize during machine learning training.
This study answers the following research questions:
• RQ1: How does the security of LLM-generated code differ across various
models?
• RQ2: What are the most typical vulnerabilities introduced during C code
generation by different LLMs using neutral zero-shot prompts?
3
limitations and threats to the validity, and proposes potential future research direc-
tions. Finally, Section 8 concludes the paper by summarising our contributions and
addressing the research questions posed in this study.
2 Motivation
In program synthesis, LLMs are generally used for simple tasks like writing a prime
number generator or a basic program to sort an array, rather than handling large-
scale projects involving thousands of lines of code [8]. The latest generation of LLMs
can easily solve these simple tasks without facing any challenges. So far, the main
area of interest in LLM-based code generation has been correctness. Datasets such as
HumanEval [19] provide programming challenges to assess the performance of models
in correctly solving various problems. For example, GPT-4 achieves a 67% success
rate in solving tasks compared to 48.1% for GPT-3.5-turbo [20]. On the contrary,
even for basic programming tasks, state-of-the-art LLMs may adopt insecure coding
practices. To illustrate the issue, imagine a situation where a programmer asks GPT-
4 the following: “Create a C program that prompts the user to input two numbers and
then calculate their sum”. The code generated by GPT-4 is presented on the left in
Figure 1, while the output from the formal verification tool ESBMC 7.6.1 is shown
on the right.
This simple code contains three potential security vulnerabilities. It exhibits an integer
overflow during the addition of the variables number1 and number2, as well as two
buffer overflows through the scanf() functions that retrieve input from the user. In
32-bit computing architectures, integers are commonly stored as 4 bytes (32 bits),
which results in a maximum integer value of 2 147 483 647, equivalent to 231 − 1. If
one attempts to add 2 147 483 647 + 1 using this small program, the result will be
4
incorrect due to integer overflow. The incorrect result will be −2 147 483 648 instead
of the expected 2 147 483 648. The addition exceeds the maximum representable value
for a signed 32-bit integer 231 − 1, causing the integer to wrap around and become
negative due to the two’s complement representation.
When GPT-4 is requested to write a secure version of this code using the following
prompt: “Create a C program that prompts the user to input two numbers and then
calculates their sum. Be careful and avoid security vulnerabilities.”, it only attempts
to fix entering non-integer inputs by adding the following code snippet (Figure 2):
Fig. 2: GPT-4 generated code snippet response after requesting a secure version of
the code in Figure 1.
Even after requesting a “secure” zero-shot prompt, all three original issues remain
unresolved. Despite the significant advancements from GPT-3.5-turbo—which exhib-
ited the same issue [9]—to GPT-4, our motivational example indicates that GPT-4
continues to produce code with vulnerabilities. Even if specifically requested in the
prompt to avoid integer overflow in the program, the issue persists (Figure 3).
Fig. 3: Zero-shot prompt requesting a fix for integer overflow (failing to do so).
1
https://github.com/features/copilot/
2
https://aws.amazon.com/codewhisperer/
5
attempted through comments in the code). In addition, GitHub Copilot is powered by
a variant of the GPT (Generative Pre-trained Transformer) model called Codex, which
OpenAI developed. The underlying issue will remain if these models are not trained to
produce secure code. Based on this observation, we aim to conduct a comprehensive
study involving various state-of-the-art models to address our research questions.
3 Related Work
This section overviews automated vulnerability detection and notable existing
datasets containing vulnerable code samples for various training and benchmarking
purposes.
3
This metric highlights the model’s ability to produce correct and functional code on its first try without
any revisions or corrections.
4
Code smells are patterns in code that hint at potential problems, making maintenance harder but not
necessarily causing immediate errors. They suggest areas where the code may need to be refactored for
better quality and reliability.
6
This work had an interesting finding: the proposed software process models improved
the quality of the generated code by significantly reducing code smells compared to
what GPT-3.5-turbo outputs by itself. Code smells or bad coding practices will not
outright introduce vulnerabilities. However, several small-scale studies point to the
fact that LLMs negatively impact software development from a security perspective.
In [40], the authors generated 21 small programs in five different languages: C, C++,
Python, HTML, and Java. Combining manual verification with GPT-based vulner-
ability detection, the study found that only 5 of the 21 generated programs were
initially secure.
In [41], Pearce et al. conclude that the control group utilized GitHub’s Copilot to
solve arithmetic operations accurately. This work highlights an important lesson: to
accurately measure the role of AI tools in code generation or completion, it is essential
to choose coding scenarios mirroring a diverse set of relevant real-world settings,
thereby facilitating the occurrence of various vulnerabilities. This necessitates the
creation of code bases replicating a wide range of scenarios, which is one of the primary
goals the FormAI dataset strives to achieve. These studies indicate that AI tools, and
in particular ChatGPT, can produce code containing vulnerabilities as of today.
Ma et al. [42] assessed the capabilities and limitations of ChatGPT for SE and
provided initial insights into why the programs generated by language models are
syntactically correct but potentially vulnerable. A study by Microsoft [43] found that
GPT models encounter difficulties when accurately solving arithmetic operations. This
aligns with our findings in Section 2.
In a comprehensive literature review, Hou et al. [44] examined LLMs’ application,
effects, and possible limitations on SE. This study reveals that LLMs are extensively
employed across software development, appearing in 229 papers for 24 distinct SE
tasks, predominantly in code generation and program repair. It also identifies over 70
LLMs, classifying them into three architectural types: decoder-only, encoder-decoder,
and encoder-only. Each architecture serves specific functions—encoder-only for in-
depth understanding, encoder-decoder for combined understanding and generation,
and decoder-only primarily for generation tasks. This work highlights an interesting
gap: there are dozens of research papers aiming to perform vulnerability detection
in source code using machine learning (ML) and LLMs [45–57], however, assessing
software safety and security properties of LLM-generated code on a large-scale has
not yet been performed apart from our original work [9] for C, and recently by [58]
for PHP code. Both studies evaluated a single model in a zero-shot code generation
scenario, while our current work also conducts a comparison of the performance of
different models.
In [59] Shumailov et al. highlighted a phenomenon known as “model collapse”.
Their research demonstrated that integrating content generated by LLMs can lead to
persistent flaws in subsequent models when using the generated data for training. This
hints that training ML algorithms only on purely AI-generated content is insufficient
if one aims to prepare these models for detecting vulnerabilities in human-generated
code. This is essentially due to using a dataset during the training phase, which is
not diverse enough and misrepresents edge cases. This raises the question of whether
the FromAI dataset is suitable for fine-tuning and ML purposes. It is important to
7
note that the AI-generated code is just one part of the dataset. Most importantly, the
vulnerability labeling was not done by AI but by the ESBMC formal verification tool.
This way, models trained on this dataset can essentially learn the skills of a formal
verification tool (or at least try to achieve the best optimal outcomes).
The programs are generated through a dynamic zero-shot prompting method, and
the generated programs are not modified by any AI system afterward. While the
primary goal of our paper is to investigate and compare the secure coding abilities
of different LLMs, these conditions make the FormAI-v2 dataset suitable for ML
purposes. On the other hand, AI models were trained on human-generated content;
thus, the vulnerabilities produced have roots in incorrect code created by humans.
Yet, as discussed in the next section, existing datasets notoriously include synthetic
data (different from AI-generated), which can be useful for benchmarking vulnerability
scanners but has questionable value for training purposes [60].
8
Big-Vul, Draper, Devign, REVEAL, and DiverseVul comprise vulnerable real-world
functions from open-source applications. These five datasets do not include all the
samples’ dependencies; therefore, they are non-compilable. SARD and Juliet contain
synthetic, compilable programs. In their general composition, the programs contain a
vulnerable function, its equivalent patched function, and a main function calling these
functions. All datasets indicate whether a code is vulnerable, using various vulnerabil-
ity labeling methodologies such as P, where functions are considered vulnerable before
receiving GitHub commits that address detected vulnerabilities; M, which involves
manual labeling; S, which uses static analyzers; and B, designated as by design vul-
nerable without the use of a vulnerability verification tool. It’s important to note that
the size of these datasets can be misleading, as many include samples from languages
other than the one primarily studied. For example, SARD includes not only C and
C++ but also Java, PHP, and C#. Moreover, newly released sets often incorporate
previous datasets or scrape the same GitHub repositories, making them redundant.
For example, Draper contains C and C++ code from the SATE IV Juliet Test
Suite, Debian Linux distribution, and public Git repositories. Since the open-source
functions from Debian and GitHub were not labeled, the authors used a suite of
static analysis tools: CPPcheck [69] and Flawfinder [62]. However, the paper does
not mention if vulnerabilities were manually verified or if any confirmation has been
performed to root out false positives. In [60], on top of creating DiverseVul, Chen et al.
merged all datasets that were based on GitHub commits and removed duplicates, thus
making the most comprehensive collection of GitHub commits containing vulnerable
C and C++ code.
9
and is used in applications where reliability is critical, such as aerospace and medi-
cal devices. However, FV can be time-consuming and requires specialized knowledge,
limiting its widespread adoption [22]. Recent advancements include machine learning
techniques, particularly LLMs, in various aspects of software verification [75]. LLMs
can assist in automated code review by suggesting improvements, detecting vulnera-
bilities, generating test cases, fixing bugs, and creating documentation. Despite their
potential [76–82], LLMs, on their own face limitations such as a lack of understanding
of code semantics and difficulty in handling highly domain-specific knowledge [83],
and they depend heavily on the quality and variety of the training data. Using LLMs
as part of a framework to complement other techniques is, however, a promising area
of research [7, 9, 84, 85]. An earlier work from 2022 examined the ability of various
LLMs to fix vulnerabilities, where the models showed promising results, especially
when combined. Still, the authors noted that such tools are not ready to be used
in a program repair framework, where further research is necessary to incorporate
bug localization. They further highlighted challenges in the tool’s ability to generate
functionally correct code [86].
While LLMs struggle with detection by themselves, in [7], the authors demon-
strated that GPT-3.5-turbo could efficiently fix errors if the output of the ESBMC
verifier is provided. Program repair is another emerging area where the application of
LLMs is showing real promise, where in addition to fine-tuning strategies, the com-
bination of LLMs with other tools appears to be an effective method [47, 87–103].
In [104], the authors call for innovations to enhance automated vulnerability repair,
particularly for developing more extensive training datasets to optimize LLMs.
10
4.1.1 State Transition System
A state transition system M = (S, T, S0 ) represents an abstract machine consisting
of a collection of states S, where S0 ⊆ S indicates the initial states, and T ⊆ S × S
specifies the transition relation, illustrating the potential state transitions within the
system. Every state s ∈ S is characterized by the value of the program counter (pc)
and the values of all program variables. The initial state s0 sets the program’s starting
location. Transitions between states denoted as T = (si , si+1 ) ∈ T , between any two
states si and si+1 , are associated with a logical formula T (si , si+1 ) that describes the
constraints on the program counter and program variables relevant to that transition.
11
Termination and error conditions are mutually exclusive, rendering ϕ(s) ∧ ψ(s) inher-
ently unsatisfiable. If T (si , si+1 )∨ϕ(s) is unsatisfiable, state s is considered a deadlock
state.
12
k, thereby minimizing false negatives. By adopting this strategy, we aim to classify
each program by detecting violated properties in the generated code.
13
Table 2: The Four Main Categories for Vulnerability Classification With ESBMC
Category Description
VS: Verification Success The set of samples for which the verification process was com-
pleted successfully with no vulnerabilities detected.
VU : Verification Unknown The set of samples for which the verification process did not
(Timeout) complete within the allotted time frame. Although no counterex-
ample was found within the time limit, this does not guarantee
the absence of vulnerabilities in the program with a longer time
frame; therefore, the verification status remains unknown.
VF : Verification Failed The set of samples for which the verification status failed, vulner-
abilities detected by ESBMC based on counterexamples.
ER: Error The set of samples for which the verification status resulted in an
error. This typically occurs due to a parsing error in ESBMC, an
issue in the GOTO converter, or other problems with the SMT
solver.
14
datasets ReVeal, BigVul, and Diversevul, a function is vulnerable if the vulnerability-
fixing commit changes it, while in Juliet, a single vulnerability is introduced for each
program.
In FormAI, a single file often contains multiple vulnerabilities. As noted, a single
vulnerability can be associated with multiple CWEs. Additionally, multiple CWEs can
be required for a vulnerability to be exploitable. As an example, “CWE-120: Buffer
Copy without Checking Size of Input (Classic Buffer Overflow)”, can happen as a
result of “CWE-676: Use of Potentially Dangerous Function”, which can be the scanf
function. If this is combined with “CWE-20: Improper Input Validation”, it can result
in “CWE-787: Out-of-bounds Write”. Labeling the vulnerable function name, line
number, and vulnerability type identified by the ESBMC module provides granular
information that can benefit machine learning training. This level of detail can allow
models to discern patterns and correlations with higher precision, thereby improving
vulnerability prediction and detection capabilities.
Since our programs exhibit numerous vulnerabilities, including multiple occur-
rences of the same type, categorizing each solely into one CWE group, as seen with
Juliet, would be sub-optimal for training purposes. This method fails to communicate
crucial details about the vulnerabilities. For instance, both “Arithmetic overflow on
add” and “Arithmetic overflow on div” are assigned the same primary CWE, man-
ifesting differently in the source code. Therefore, merely labeling them with CWEs
does not offer the same level of granularity and makes the dataset less suitable for ML.
While other datasets focus more on CWEs related to vulnerabilities that could be
exploited, ESBMC also detects issues related to software safety. For this reason, in
the FormAI dataset, we did not assign a single CWE to each vulnerability. However,
based on our mapping in Table 4, one can easily associate an ESBMC vulnerability
with the closest CWE number if needed.
15
Table 4: Detailed Categorization of Vulnerabilities Detected by ESBMC
Description CWE Associated CWEs
DF: Dereference failures:
1. NULL pointer CWE-476 CWE-690, CWE-391
2. Invalid pointer CWE-822 CWE-119, CWE-787, CWE-822
3. Forgotten memory CWE-825 CWE-401, CWE-404, CWE-459
4. Array bounds violated CWE-125 CWE-119, CWE-787
5. Invalidated dynamic object CWE-824 CWE-416, CWE-415
6. Access to object out of bounds CWE-125 CWE-119, CWE-787
7. Accessed expired variable pointer CWE-416 CWE-825
8. Write access to string constant CWE-843 CWE-758
9. Of non-dynamic memory CWE-590 CWE-415, CWE-415, CWE-762
10. IBTA CWE-843 CWE-119
11. Oversized field offset CWE-787 CWE-119, CWE-125, CWE-823
12. Data object accessed with code type CWE-843 CWE-686, CWE-704
AO: Arithmetic overflows:
13. On sub CWE-191 CWE-20, CWE-190, CWE-192
14. On add CWE-190 CWE-20, CWE-191, CWE-192
15. On mul CWE-190 CWE-20, CWE-191, CWE-192
16. Floating-point ieee_mul CWE-190 CWE-681
17. Floating-point ieee_div CWE-682 CWE-369, CWE-681
18. Floating-point ieee_add CWE-190 CWE-681
19. Floating-point ieee_sub CWE-190 CWE-681
20. On div CWE-190 CWE-20, CWE-369
21. On shl CWE-190 CWE-192
22. On modulus CWE-190 CWE-20, CWE-191
23. On neg CWE-191 CWE-190, CWE-192
BO: Buffer overflow:
24. On scanf CWE-120 {CWE-20, CWE-121, CWE-122
25. On fscanf CWE-120 CWE-129, CWE-131, CWE-628
26. On sscanf CWE-120 CWE-676, CWE-680, CWE-787}
ABV: Array bounds violations:
27. lower bound CWE-129 {CWE-119, CWE-125, CWE-129
28. upper bound CWE-788 CWE-131, CWE-193, CWE-787}
29. VLA array size in bytes overflows CWE-190 CWE-131, CWE-680
MV: Miscellaneous Vulnerabilities:
30. Division by zero CWE-369 CWE-691
31. The pointer to a file must be valid CWE-476 CWE-690, CWE-459
32. Same object violation CWE-628 CWE-843, CWE-668
33. ZOFO CWE-761 CWE-415, CWE-590
Legend:
ZOFO: Operand of free must have zero pointer offset, IBTA: Object accessed with incompatible base type
The dynamic part of the prompt, highlighted as [Type] and [Style], represent
distinct categories within the prompt, each encompassing different elements. Every
API call randomly selects a Type category from a set of 200 elements. This cate-
gory contains topics such as Wi-Fi Signal Strength Analyzer, QR code reader, Image
Steganography, Pixel Art Generator, Scientific Calculator Implementation, etc. Sim-
ilarly, a coding Style is chosen from a set of 100 elements during each query. This
helps minimize repetition, as coding styles such as “excited”, “relaxed”, or “mathe-
matical” are randomly combined with a Type category. Our primary objective was to
16
1
Coding Dynamic prompt for LLMs Coding
to generate C programs
Task Style
Cyclomatic
Complexity
Analyzer
Clone Symbolic clang
Compiler (Lizard 1.17.10) SMT GOTO AST
Detection Execution compiler
solver converter converter
(gcc 13.2) (Nicad 7.0)
FormAI-v2 dataset
Verification
Successful
Verification Failed
Property violation
FormAI-v2
JSON Unknown
4
Final FormAI-v2 dataset with vulnerability classification
identify and capture as many vulnerabilities as possible. This method can generate
200 × 100 = 20 000 distinct combinations. As demonstrated by insights from [86, 114],
there’s a need for a code base that supports diverse settings while ensuring tasks
remain concise to fit within the token constraints of large language models (LLMs).
This raises a key question: If we generate a dataset of over 300,000 instances
but only 20,000 distinct combinations, will it lead to redundancy? Will the same or
different models produce identical outputs for these repeated prompts? To address
this, we will conduct clone code detection in the next section to ensure the generated
code is unique. Selecting prompts that LLMs can efficiently process is important,
17
Fig. 5: Dynamic Code Generation Prompt.
therefore we designed tasks in the Type category accordingly. For instance, complex
prompts like “Create a CRUD application using React for the front-end, Node.js with
Express for the back-end, and MongoDB for the database” must be broken down
into smaller, manageable tasks. Furthermore, tasks with different styles, such as ’File
handling’ with a ’romantic’ versus a ’happy’ style, lead to distinct outputs, which are
reflected in different representations in the vector space upon tokenization. Despite
potential compatibility issues between certain Type-Style combinations, encouraging
LLMs to code in varied styles has generally enhanced the diversity of responses to
identical Types.
Decreasing the number of unsuccessful queries by refining the prompt is important
from an efficiency perspective. We have established five instructions in each prompt
to minimize the error within the generated code. These instructions, along with their
corresponding explanations, are the following:
1. Minimum 50 lines: This encourages the LLM to avoid the generation of overly
simplistic code with only a few lines (which occasionally still happens);
2. Be creative!: The purpose of this instruction is to generate a more diverse
dataset;
3. Do not say I am sorry: This instruction aims to circumvent objections and
responses such as “As an AI model, I cannot generate code”, and similar statements.
4. Make sure the program compiles: This instruction encourages the model to
include header files and create a complete and compilable program.
5. Generate a code snippet that starts with “‘c: Enable easy extraction of
the C code from the response.
Once a C code is generated, the GNU C compiler5 is employed to verify whether
the corresponding code is compilable. During the code generation process, we ensure
that the FormAI-v2 dataset exclusively consists of compilable code while excluding
any other code that does not meet this criterion. Different models can generate vary-
ing percentages of compilable code depending on their parameter size. Models like
GPT-4o-mini, Gemini Pro, or Falcon-180B can achieve compilation rates higher than
90%, whereas smaller models with 7B parameters typically produce C code with a
compilability rate between 55-70%.
The primary reason for having non-compilable code was due to the absence of
necessary headers, such as math.h, ctype.h, or stdlib.h. As the cost of generation
associated with different models can significantly vary, we did not generate the same
5
https://gcc.gnu.org
18
Table 5: Content of the FormAI-v2 Dataset.
LLM Model Company Size License Sample Size
GPT-4o-mini OpenAI N/A Proprietary 40 000
Llama2 13B Meta 13B Open 20 000
Mistral-7B Mistral AI 7B Apache 2.0 10 000
Code Llama 13B Meta 13B Proprietary 12 000
Gemini Pro 1.0 Google N/A Proprietary 40 000
Gemma-7B Google 7B Gemma-TOU 47 000
Falcon-180B TII 180B Apache 2.0 72 000
Falcon2-11B TII 11B Apache 2.0 12 000
GPT-3.5-turbo OpenAI 175B Proprietary 78 000
number of samples from each model. While some tested models are open source, their
operational costs and GPU usage remain significant. For instance, running Falcon-
180B on AWS can cost around 40 USD per hour. Table 5 presents the samples obtained
from each LLM.
19
Table 6: Different Types of Clones Removed From the Dataset
LLM Model Sample size Type1 Type2 Type 3-1 Type 3-2 ∆(%)
Falcon2-11B 12 000 0 0 1 36 0.30
Mistral-7B 10 000 1 1 11 59 0.59
CodeLlama-13B 12 000 3 5 12 128 1.10
GPT-3.5-turbo 78 000 118 301 502 1 756 2.25
GPT-4o-mini 40 000 0 24 31 1 075 2.69
Gemini Pro 1.0 40 000 12 150 187 1 255 3.14
Falcon-180B 72 000 42 363 541 3 464 4.81
Llama2-13B 20 000 165 607 1 001 2 214 11.07
Gemma-7B 47 000 657 3 229 2 997 10 199 21.70
The last column, ∆(%), shows the percentage of Type 3-2 clones that were
detected and removed from the dataset. A higher percentage indicates that the LLM
generated more similar, redundant code samples.
In terms of clone categories, Type 1 and Type 2 are hierarchical: Type 1 clones
are a subset of Type 2, meaning that all Type 1 clones are also considered Type 2.
Similarly, Type 3-1 and Type 3-2 are inclusive, where Type 3-1 clones fall within
the broader Type 3-2 category. In other words, Type 1 ⊆ Type 2 and Type 3-1 ⊆
Type 3-2.
However, Type 2 is not a subset of Type 3-1 because the threshold for Type 2
clones is exactly zero—meaning only variable changes are allowed across the entire
code with no additional modifications. In contrast, Type 3-1 allows for up to a 10%
modification threshold, which can include variable changes, deletions, additions, or
structural modifications, as long as they remain within the 10% limit.
The most flexible clone category is Type 3-2, where a 10% threshold applies to the
entire code without restrictions. This means that any kind of modification, including
variable changes throughout the entire program, is allowed. To ensure the dataset’s
quality, we removed all clones up to and including Type 3-2.
After filtering out these clones from each LLM-generated subset, we applied Type
3-2 detection to the entire dataset to identify any similar code across different models.
This process revealed an additional 283 Type 3-2 clones, which were subsequently
removed. In total, 20 469 programs were excluded, resulting in a final dataset of 310 531
unique files. This demonstrates that the dataset is diverse, with only 6.18% of the
original programs being similar.
20
false impression. As a result, in the FormAI-v1 dataset, numerous samples were previ-
ously classified as “NON-VULNERABLE up to bound k”. We have transitioned from
bounded to unbounded model checking to capture more vulnerabilities or prove their
absence for each sample. This approach incrementally unwinds the program until a
bug is found or the completeness threshold is reached, meaning all possible terminat-
ing states have been explored. Incremental BMC ensures that smaller problems are
solved sequentially, avoiding the need to guess an upper bound for verification.
Applying these settings, we have successfully identified more vulnerabilities in the
programs. Consequently, if the verification process is completed successfully, we can
conclude that the program has no violated properties (that can be detected by the
currently used ESBMC version). While this approach requires significantly more com-
putational power, it has proven effective in revealing more vulnerabilities or proving
their absence, as we will demonstrate in Section 6.
21
Table 7: Classification Results for the 1000-Sample Dataset With Varying Param-
eters.
ESBMC Parameters RESULTS
u time k-ind bmc fls Runtime |ϕ| VS VF VU ER
Legend:
✓: Enabled; x: Not set; |ϕ|: Number of Vulnerabilities detected; k-ind: k-induction;
bmc: incremental-bmc; fls: falsification technique; u: unwind; (Runtime in (m:s))
22
Fig. 6: ESBMC Command Employed to Verify Each Sample in the Dataset.
6 Verification Results
In this section, we summarize our key results, beginning with an analysis of statis-
tics for the entire dataset, focusing on overall verification outcomes and vulnerability
types. It is important to note that the analysis is based on 310, 531 programs, as all
clones up to Type 3-2 have been excluded from the initial 331, 000. We then evaluate
each LLM, comparing the complexity of the code they generate and their secure coding
capabilities. This is followed by evaluating each LLM and comparing the complexity
of the code they generate and their secure coding capabilities.
In the original FormAI dataset, only 112, 000 compilable C samples were created
using GPT-3.5-turbo. Furthermore, the complexity of each program was not mea-
sured. This research closes this gap by comparing nine state-of-the-art LLMs and
providing a vulnerability-labelled dataset to the research community. We have exam-
ined 26 633 156 lines of C code, with an average of 85.77 lines per sample. In total,
we performed the verification process on 310 531 C program files, and our results for
the entire dataset are shown in Table 8. The TOP 10 violations throughout the entire
dataset are presented in Table 9. Table 10 provides a breakdown of the distribution
for each of the top five main categories of vulnerabilities.
During the 500-second verification time-frame, ESBMC identified 192 757 unique
programs with vulnerabilities. In contrast, only 25 674 programs, representing 8.27%,
were verified as secure. Expanding computational resources may increase the number
of programs uncovered from VU, thereby potentially extending the VF category. These
results provide an even better lower bound compared to [9], on what percentage of
LLM-generated code is vulnerable. The situation is more concerning than merely
23
Table 8: Overview of Statistics and Verification Results for Each LLM.
Model Samples Max Avg Avg VS VU VF ER
Name w.o clones SLOC SLOC CC (%) (%) (%) (%)
stating that 62.07% of the generated files are vulnerable, as a single file can contain
multiple vulnerabilities. On average, each file contains 3.97 vulnerabilities. The total
number of property violations detected by ESBMC for the overall dataset is 765 366.
The most common type of vulnerability is related to “Dereference failures”
accounting for 54.54% of the cases, predominantly due to NULL pointer issues. This
category includes a variety of pointer-related issues, such as invalid pointers, a forgot-
ten memory, and array-bounds violations, among others. “Buffer overflows”, mainly
triggered by the scanf function, comprise a significant 27.99% of the vulnerabili-
ties. This highlights common issues in handling buffer sizes and input functions.
24
Table 10: Detailed Categorisation of Vulnerabilities in the Entire Dataset
Description Count Percentage
Dereference failures:
- NULL pointer 289 548 37.83%
- Invalid pointer 73 838 9.65%
- Forgotten memory 21 108 2.76%
- Array bounds violated 23 586 3.08%
- Invalidated dynamic object 3 145 0.41%
- Access to object out of bounds 3 221 0.42%
- Accessed expired variable pointer 1 227 0.16%
- Write access to string constant 913 0.12%
- Non-dynamic memory 342 0.04%
- Object accessed with incompatible base type 379 0.05%
- Oversized field offset 170 0.02%
- Data object accessed with code type 14 0.00%
Arithmetic overflows:
- On sub 18 345 2.40%
- On add 15 966 2.09%
- On mul 12 462 1.63%
- IEEE mul 9 673 1.26%
- IEEE div 3 522 0.46%
- IEEE add 2 375 0.31%
- IEEE sub 1 632 0.21%
- On div 813 0.11%
- On shl 972 0.13%
- On modulus 348 0.05%
- On neg 155 0.02%
Buffer overflows:
- On scanf 214 255 27.99%
- On fscanf 8 252 1.08%
- On sscanf 4 184 0.55%
Array bounds violations:
- Upper bound 23 380 3.05%
- Lower bound 19 918 2.60%
- VLA array size in bytes overflows address space size 4 222 0.55%
Miscellaneous Vulnerabilities:
- Division by zero 4 311 0.56%
- The pointer to a file object must be a valid argument 1 225 0.16%
- Invalid Function argument issues 443 0.06%
- Same object violation 123 0.02%
- Operand of free must have zero pointer offset 134 0.02%
“Arithmetic overflows” are also notable, covering various operations like subtraction,
addition, multiplication, and division, indicating frequent issues in handling numeric
calculations without adequate checks. The table further lists “Array bounds viola-
tions” and “Division by zero” as common issues, illustrating challenges in correctly
managing arrays and arithmetic operations. A smaller portion of the table covers
“Miscellaneous Vulnerabilities” which includes a variety of less frequent but notable
issues such as invalid file object pointers and operand violations in memory dealloca-
tion. Overall, the data emphasizes the need for robust handling of pointers, buffers,
and numeric operations within the source code to mitigate the risk of vulnerabilities.
25
6.1 General observation about code complexity
NIST defines Cyclomatic Complexity (CC) as “the amount of decision logic in a
source code function” and recommends a maximum value of 10 [115]. According to
NIST, “higher numbers are bad and lower numbers are good.” As Figure 7 shows,
many individual programs generated by Gemma-7B exceed the threshold of 10. While
SLOC and CC cannot be used to determine whether code is vulnerable directly, we
observed that higher cyclomatic complexity can lead to an increased likelihood of
vulnerabilities. Models such as GPT-3.5-turbo, Gemma-7B, and Falcon2-11B, which
have high CC, also display the highest rates of verification failures.
As earlier shown in Table 8, the Avg. CC (Average cyclomatic complexity per sam-
ple) and Avg. SLOC (Average Source Lines of Code per sample) provide insight into
the complexity of the code generated by a certain model. As previously mentioned, if a
model produces only non-vulnerable code, it doesn’t necessarily indicate high quality;
it could suggest that the generated code is very simple (e.g., generating only “print
’hello world”’ examples). While observing SLOC and CC cannot precisely determine
a model’s code quality, it is interesting to observe that GPT-4o-mini, CodeLlama-
13B, and Llama2-13B had the least lowest verification failed results and the lowest
CC scores.
The analysis of Table 8 shows that GPT-4o-mini does not necessarily generate
shorter or simpler code. It produces the longest C programs, with an average SLOC
of 103.48, and has the highest verification unknown score (36.77%), indicating that
the ESBMC verification process takes longer for GPT-4o-mini samples. In contrast,
Gemma-7B generates the shortest average SLOC and also has the lowest verification
unknown result (16.30%). Additionally, GPT-4o-mini produces code with a lower CC,
which implies better maintainability and quality, while Gemma-7B has a much higher
average CC.
50 50
Avg. CC = 3.40 Avg. CC = 5.25
40 40
30 30
20 20
10 10
0 0
0 5000 10000 15000 20000 25000 30000 35000 40000 0 5000 10000 15000 20000 25000 30000 35000
26
Normalized Average Keyword Frequency Heatmap (Per Million Lines of Code)
if 96012 45687 32027 41093 46165
return 44190 28197 25819 30799 41172
struct 27847 15001 23654 21025 22625 100000
int 27303 97717 91534 106378 92644
const 19971 15632 68 3446 3989
case 17261 13219 19869 7412 7092
else 16199 12050 11528 10342 10188
void 12604 40261 17174 27736 21408
char 11541 39740 24815 38441 34986 80000
goto 10144 9 4 6 88
unsigned 9653 3924 597 1321 3735
for 7313 25320 29811 27322 27974
long 4258 3223 196 480 1144
bool 4006 1599 1 457 772
while 2993 10656 9139 12055 7691 60000
switch 2377 2590 3936 1356 1522
double 1709 8492 6350 6652 9748
static 927 144 5 153 675
sizeof 858 25 0 4 18
register 808 0 0 0 1
float 615 5234 1732 2141 2812 40000
short 558 346 122 54 146
do 477 1179 26 709 189
enum 451 329 56 190 817
auto 316 0 0 0 0
union 182 24 47 3 16
volatile 134 52 0 1 13 20000
typedef 64 7188 8909 8308 7638
break 53 0 0 0 2
signed 46 0 0 0 0
extern 28 0 0 0 1
default 18 0 0 0 1
continue 7 0 0 0 1
0
BigVul GPT-4o-mini Gemma-7B Falcon-180B Gemini-Pro
code. A practical starting point for this analysis is comparing keyword frequencies.
In real-world C/C++ projects, such as those from GitHub and datasets like BigVul,
common keywords include ‘if’, ‘return’, ‘struct’, ‘int’, and ‘const’, while less frequent
keywords include ‘continue’, ‘default’, and ‘extern’. Significant differences in keyword
frequencies between LLM-generated and real-world code would question the dataset’s
validity.
To investigate, we used a token-based keyword-counting method to analyze the
frequency of 32 C keywords in each LLM-generated subset. Ideally, LLM-generated
code should exhibit a similar keyword distribution to real-world code. Figure 8 shows
the normalized keyword frequency (occurrences per million lines of code) for various
LLM-generated codes, with BigVul as a real-world benchmark. The heatmap reveals
that LLM-generated and real-world code have closely matching keyword distributions,
likely due to the LLMs being trained on human-written GitHub projects.
While there are slight variations in the distribution between LLMs and BigVul
mainly for the less frequent words, LLMs show great similarity on how they handle
statements, expressions, and variables in distinct ways. Note, that while all LLM
generated codes are fully compilable on our dataset, this is not the case with BigVul
samples and other human written code datasets.
27
6.3 Vulnerability Ranking
Table 11 (Parts I, II, and III) provides an overview of the top 10 vulnerabilities gen-
erated by each model. Note that raw vulnerability counts are not directly comparable
due to the differing number of samples produced by each model. To enable a fair com-
parison across LLMs, the table also includes the percentage representation of each
vulnerability. This analysis does not offer a comprehensive review of all identified
CWEs but focuses on vulnerabilities explicitly verified by ESBMC.
Buffer overflow vulnerabilities related to scanf are consistently ranked among the
top three across all LLM models. The functions scanf, fscanf, and sscanf do not
restrict the input size of their respective buffers, creating a risk of buffer overflow.
This vulnerability can allow attackers to execute arbitrary code or trigger crashes.
As previously mentioned, these issues relate to several CWEs, including CWE-676,
CWE-20, and CWE-787. Although buffer overflow is a type of out-of-bounds write,
CWE-787 covers a broader range of vulnerabilities. CWE-120 specifically addresses
classic buffer overflow scenarios caused by unchecked input sizes during buffer copy
operations. While more complex issues like arithmetic overflows and array bounds vio-
lations require deeper programming context, simpler issues such as scanf errors should
be easier to avoid. However, all tested models consistently exhibit buffer overflow
errors with scanf.
Dereference failures, particularly NULL pointer dereferences, are among the most
prevalent vulnerabilities across all LLMs. This is due in part to the varied and often
unsafe examples of pointer usage in training datasets, combined with the inherent
complexity of dynamic memory management in C. LLMs rely on pattern recognition
rather than deep understanding, which leads them to mishandle pointers and fail
to replicate the nuanced behavior of real-world applications. This results in frequent
dereference issues and flawed pointer handling, highlighting significant risks when
deploying LLM-generated code in critical systems where security and reliability are
paramount.
The severity and frequency of these vulnerabilities vary significantly among mod-
els. For instance, Gemma-7B exhibits a notably high rate of NULL pointer dereference
failures at 60.50%, indicating substantial weaknesses in memory management. Arith-
metic overflows also consistently appear across all models in the top 10 list, and differ
based on specific operations (addition, subtraction, multiplication), underscoring var-
ied arithmetic handling. Notably, Llama2-13B stands out with less than 10% of scanf
violations, with Gemini Pro 1.0 close behind at approximately 11%; however, both
models, like Gemma-7B, show high rates of NULL pointer dereference failures.
The consistent occurrence of certain errors across different models underscores the
need for comprehensive testing and validation frameworks to address these recurring
issues before deployment. While all models share similar vulnerabilities, significant
differences in the frequency and types of other vulnerabilities—such as arithmetic
overflows—suggest that model-specific optimizations and enhancements are neces-
sary. To mitigate these risks, developing enhanced training methodologies focused on
robust memory handling is crucial. Implementing advanced code analysis tools and
frameworks is also essential to detect and rectify vulnerabilities before deployment for
real-world applications.
28
Table 11: Top 10 Vulnerabilities in LLM Generated Code - Part I
Rank Category Violation Type Count Percentage
GPT-3.5-turbo
1 BO Buffer overflow on scanf 84 213 38.23%
2 DF Dereference failure: NULL pointer 56 690 25.74%
3 DF Dereference failure: invalid pointer 20 617 9.36%
4 DF Dereference failure: forgotten memory 4 631 2.10%
5 DF Array bounds violated: lower bound 8 102 3.68%
6 DF Array bounds violated: upper bound 8 101 3.68%
7 AO Arithmetic overflow on sub 6 627 3.01%
8 DF Dereference failure: array bounds violated 6 537 2.97%
9 AO Arithmetic overflow on add 5 228 2.37%
10 AO Arithmetic overflow on mul 4 285 1.95%
Falcon-180B
1 BO Buffer overflow on scanf 49 175 34.37%
2 DF Dereference failure: NULL pointer 42 177 29.48%
3 DF Dereference failure: invalid pointer 15 732 11.00%
4 DF Dereference failure: forgotten memory 5 442 3.80%
5 DF Dereference failure: array bounds violated 4 545 3.18%
6 DF Array bounds violated: upper bound 4 310 3.01%
7 AO Arithmetic overflow on sub 3 315 2.32%
8 DF Array bounds violated: lower bound 3 611 2.52%
9 AO Arithmetic overflow on add 2 858 2.00%
10 BO Buffer overflow on fscanf 2 532 1.77%
Llama2-13B
1 DF Dereference failure: NULL pointer 17 630 54.45%
2 DF Dereference failure: invalid pointer 3 089 9.54%
3 BO Buffer overflow on scanf 2 775 8.57%
4 DF Dereference failure: array bounds violated 1 611 4.98%
5 DF Dereference failure: forgotten memory 1 254 3.87%
6 AO Arithmetic overflow on add 883 2.73%
7 DF Array bounds violated: upper bound 818 2.53%
8 AO Arithmetic overflow on mul 599 1.85%
9 AO Arithmetic overflow on sub 571 1.76%
10 BO Division by zero 462 1.43%
Gemma-7B
1 DF Dereference failure: NULL pointer 59 433 60.50%
2 BO Buffer overflow on scanf 14 950 15.22%
3 DF Dereference failure: invalid pointer 3 617 3.68%
4 DF Dereference failure: forgotten memory 3 191 3.25%
5 DF Array bounds violated: upper bound 3 379 3.44%
6 DF Array bounds violated: lower bound 2 784 2.83%
7 AO Arithmetic overflow on sub 2 040 2.08%
8 DF Dereference failure: array bounds violated 1 786 1.82%
9 AO Arithmetic overflow on floating-point ieee_mul 1 152 1.17%
10 AO Arithmetic overflow on add 1 302 1.33%
29
Table 11 (Cont.): Top 10 Vulnerabilities in LLM Generated Code - Part II
Rank Category Violation Type Count Percentage
CodeLlama-13B
1 DF Dereference failure: NULL pointer 11 546 44.75%
2 BO Buffer overflow on scanf 5 169 20.03%
3 DF Dereference failure: invalid pointer 3 481 13.49%
4 DF Dereference failure: array bounds violated 897 3.48%
5 DF Dereference failure: forgotten memory 695 2.69%
6 DF Array bounds violated: upper bound 683 2.65%
7 AO Arithmetic overflow on add 524 2.03%
8 DF Array bounds violated: lower bound 465 1.80%
9 AO Arithmetic overflow on mul 456 1.77%
10 AO Arithmetic overflow on sub 380 1.47%
Gemini Pro 1.0
1 DF Dereference failure: NULL pointer 65 376 55.95%
2 DF Dereference failure: invalid pointer 13 272 11.36%
3 BO Buffer overflow on scanf 12 948 11.08%
4 DF Dereference failure: array bounds violated 4 250 3.64%
5 DF Dereference failure: forgotten memory 3 340 2.86%
6 AO Arithmetic overflow on mul 2 466 2.11%
7 DF Array bounds violated: upper bound 2 285 1.96%
8 DF Array bounds violated: lower bound 1 952 1.67%
9 AO Arithmetic overflow on sub 1 899 1.63%
10 AO Arithmetic overflow on add 1 895 1.62%
Mistral-7B
1 DF Dereference failure: NULL pointer 6 294 33.17%
2 BO Buffer overflow on scanf 5 125 27.01%
3 DF Dereference failure: invalid pointer 2 460 12.97%
4 DF Dereference failure: array bounds violated 738 3.89%
5 DF Array bounds violated: lower bound 622 3.28%
6 AO Arithmetic overflow on sub 473 2.49%
7 DF Array bounds violated: upper bound 453 2.39%
8 DF Dereference failure: forgotten memory 414 2.18%
9 AO Arithmetic overflow on add 400 2.11%
10 BO Buffer overflow on sscanf 388 2.04%
GPT-4o-mini
1 BO Buffer overflow on scanf 33 307 42.60%
2 DF Dereference failure: NULL pointer 17 539 22.43%
3 DF Dereference failure: invalid pointer 7 055 9.02%
4 AO Arithmetic overflow on sub 2 479 3.17%
5 DF Array bounds violated: upper bound 2 277 2.91%
6 AO Arithmetic overflow on add 2 114 2.70%
7 DF Dereference failure: array bounds violated 1 956 2.50%
8 AO Arithmetic overflow on floating-point ieee_mul 1 857 2.38%
9 AO Arithmetic overflow on mul 1 536 1.96%
10 DF Array bounds violated: lower bound 1 398 1.79%
30
Table 11 (Cont.): Top 10 Vulnerabilities in LLM Generated Code - Part III
Rank Category Violation Type Count Percentage
Falcon2-11B
1 DF Dereference failure: NULL pointer 12 863 40.71%
2 BO Buffer overflow on scanf 6 593 20.87%
3 DF Dereference failure: invalid pointer 4 515 14.29%
4 DF Dereference failure: array bounds violated 1 266 4.01%
5 DF Dereference failure: forgotten memory 1 106 3.50%
6 DF Array bounds violated: upper bound 1 074 3.40%
7 AO Arithmetic overflow on add 762 2.41%
8 DF Array bounds violated: lower bound 613 1.94%
9 AO Arithmetic overflow on sub 561 1.78%
10 AO Arithmetic overflow on mul 459 1.45%
Legend:
VS: Verification Success; VF : Verification Failed; VU : Verification Unknown (Timeout).
Best performance in a category is highlighted with bold and/or Rank.
31
GPT-4o-mini outperforms GPT-3.5-turbo while showing the highest VU percent-
age under current ESBMC settings, indicating its ability to produce more complex and
longer outputs. It is important to note that this complexity is not reflected by the CC
number as discussed earlier, which confirms the criticism towards Cyclomatic Com-
plexity by practitioners. While GPT-4o-mini ranks third in VS and second in VF, it
finishes first with an average property violation per line. This might be the fairest way
to compare models, as the more lines, the more chances to have vulnerabilities, while
this metric doesn’t punish models producing shorter codes. While there is no defini-
tive winner in this analysis, Gemma-7B, Gemini-Pro, and GPT-3.5-turbo—with the
current verification settings— have the highest VF ratios and highest average prop-
erty violation both per line and file which indicates that these models are performing
worse in our test.
It is important to underline that it might be tempting to speculate on a winner,
having such a high verification failed ratio is unacceptable from an SE perspective for
any model. All models surpassed the VF threshold of 50%, indicating that nearly half
or more of the generated programs are vulnerable. The conclusions of this analysis
must be clear: Using code generated by the state-of-the-art Large Language
Models, without any additional framework for validation and vulnerability
analysis, carries severe risks. While LLMs can be useful for automating simple
tasks and scripting, directly including such codes in production software without
oversight from experienced software engineers is irresponsible and should
be avoided.
6
https://github.com/FormAI-Dataset
32
• What is the right path towards LLMs producing secure code: Re-training models
on better data, fine-tuning, or using current models in various few-shot frameworks
with better prompting?
• Since several codes contain multiple vulnerabilities, this dataset is ideal for bench-
marking and testing various vulnerability detection tools.
• As our motivation section showcased, GPT-4o-mini did not excel at avoiding
and fixing the vulnerability in the example. How do different LLMs compare in
understanding, correctly fixing, and detecting coding errors?
• We aim further to grow the FormAI dataset, including more state-of-the-art models,
and increase the number of samples for each LLM to have an overall larger dataset.
• How do different programming Tasks or Styles impact vulnerable coding patterns?
Are there tasks that LLMs consistently mess up?
While we can partially address the last question, noting the use of insecure func-
tions and poor input sanitation in handling user inputs, exploring this issue across
various domains, such as networking or cryptography, would be beneficial.
8 Conclusions
This research analyzed nine state-of-the-art Large Language Models to assess their
likelihood of introducing vulnerabilities during neutral prompt-based code genera-
tion, and to compare their performance. The models included in our analysis were
Mistral-7B, Falcon-180B, Falcon2-11B GPT-4o-mini, Llama2-13B, CodeLlama-13B,
Gemma-7B, GPT-3.5-turbo, and Gemini-Pro. We employed a zero-shot prompt-
ing method to encompass numerous programming scenarios for C code generation.
33
These programs constitute the FormAI-v2 dataset, containing 331 000 independent
compilable C programs.
We used the Efficient SMT-based Bounded Model Checker (ESBMC), a state-of-
the-art formal verification tool, to identify vulnerabilities. Each program was given
a verification period of 500 seconds with the unwinding parameter set to infinite,
uncovering a total of 765 366 vulnerabilities. Overall 62.07% of the codes were vul-
nerable. Detailed labeling of each sample—including filename, type of vulnerability,
function name, error type, and source code—is documented in a .json file, as detailed
in Appendix Fig. 1, to facilitate the dataset’s use in machine learning applications.
Additionally, the FormAI-v2 dataset proved instrumental for fuzzing various appli-
cations and identifying multiple bugs in ESBMC and CBMC. These findings provide
clear answers to our research questions:
• RQ1: How does the security of LLM-generated code differ across various
models?
• Answer: CodeLlama-13B, Llama-13B, and GPT-4o-mini perform slightly
better, but all examined models notoriously introduce vulnerabilities into the
C code they generate at unacceptable rates. Our research revealed that all
examined models introduced vulnerabilities in at least 50% of the generated
code.
While the literature reveals significant variations in these models’ ability to solve
tasks, this is not mirrored in their susceptibility to produce vulnerabilities in source
code. Our findings conclusively show that despite differences among the examined
models in terms of generating code, they all consistently introduce severe vulnera-
bilities when prompted with simple coding tasks. Our study indicates that despite
the impressive capabilities of Large Language Models in code generation, employing
their output in production requires detailed risk assessment. Relying on these models
without expert oversight in a production context is inadvisable.
Acknowledgement
We extend our sincere thanks to the anonymous reviewers for their valuable feedback,
which has significantly improved the quality of this paper. This research is supported
by the Technology Innovation Institute (TII), Abu Dhabi. Additionally, partial sup-
port is provided by the EPSRC grant EP/T026995/1, titled “EnnCore: End-to-End
34
Conceptual Guarding of Neural Architectures” under the Security for All in an AI-
enabled Society initiative. This work is also partially supported by the TKP2021-NVA
Funding Scheme under Project TKP2021-NVA-29.
Conflicts of interest
The authors have no competing interests to declare that are relevant to the content
of this article.
References
[1] Wang, J., Huang, Y., Chen, C., Liu, Z., Wang, S., Wang, Q.: Software testing
with large language models: Survey, landscape, and vision. IEEE Transactions
on Software Engineering (2024)
[2] Xu, F.F., Alon, U., Neubig, G., Hellendoorn, V.J.: A systematic evaluation
of large language models of code. In: Proceedings of the 6th ACM SIGPLAN
International Symposium on Machine Programming, pp. 1–10 (2022)
[3] Jain, N., Vaidyanath, S., Iyer, A., Natarajan, N., Parthasarathy, S., Rajamani,
S., Sharma, R.: Jigsaw: Large language models meet program synthesis. In:
Proceedings of the 44th International Conference on Software Engineering, pp.
1219–1231 (2022)
[4] Bui, N.D.Q., Le, H., Wang, Y., Li, J., Gotmare, A.D., Hoi, S.C.H.: CodeTF:
One-stop Transformer Library for State-of-the-art Code LLM. arXiv (2023).
http://arxiv.org/abs/2306.00029 Accessed 2023-06-22
[5] Ross, S.I., Martinez, F., Houde, S., Muller, M., Weisz, J.D.: The Programmer’s
Assistant: Conversational Interaction with a Large Language Model for Software
Development. In: Proceedings of the 28th International Conference on Intelligent
User Interfaces. IUI ’23, pp. 491–514. Association for Computing Machinery,
New York, NY, USA (2023). https://dl.acm.org/doi/10.1145/3581641.3584037
Accessed 2023-06-22
[6] Chavez, M.R., Butler, T.S., Rekawek, P., Heo, H., Kinzler, W.L.: Chat Genera-
tive Pre-trained Transformer: why we should embrace this technology. American
Journal of Obstetrics and Gynecology 228(6), 706–711 (2023) https://doi.org/
10.1016/j.ajog.2023.03.010 . Accessed 2023-06-22
35
[7] Charalambous, Y., Tihanyi, N., Jain, R., Sun, Y., Ferrag, M.A., Cordeiro, L.C.:
A New Era in Software Security: Towards Self-Healing Software via Large Lan-
guage Models and Formal Verification. arXiv (2023). http://arxiv.org/abs/2305.
14752 Accessed 2023-05-31
[8] Perry, N., Srivastava, M., Kumar, D., Boneh, D.: Do users write more inse-
cure code with ai assistants? In: Proceedings of the 2023 ACM SIGSAC
Conference on Computer and Communications Security. CCS ’23, pp. 2785–
2799. Association for Computing Machinery, New York, NY, USA (2023).
https://doi.org/10.1145/3576915.3623157
[9] Tihanyi, N., Bisztray, T., Jain, R., Ferrag, M.A., Cordeiro, L.C., Mavroeidis,
V.: The formai dataset: Generative ai in software security through the lens
of formal verification. In: Proceedings of the 19th International Conference on
Predictive Models and Data Analytics in Software Engineering. PROMISE 2023,
pp. 33–43. Association for Computing Machinery, New York, NY, USA (2023).
https://doi.org/10.1145/3617555.3617874
[10] Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B., Yu, J., Soricut, R.,
Schalkwyk, J., Dai, A.M., Hauth, A., et al.: Gemini: a family of highly capable
multimodal models. arXiv preprint arXiv:2312.11805 (2023)
[11] Almazrouei, E., Alobeidli, H., Alshamsi, A., Cappelli, A., Cojocaru, R., Debbah,
M., Goffinet, É., Hesslow, D., Launay, J., Malartic, Q., et al.: The falcon series
of open language models. arXiv preprint arXiv:2311.16867 (2023)
[12] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y.,
Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for
code. arXiv preprint arXiv:2308.12950 (2023)
[13] Gadelha, M.Y.R., Ismail, H.I., Cordeiro, L.C.: Handling loops in bounded model
checking of C programs via k-induction. Int. J. Softw. Tools Technol. Transf.
19(1), 97–114 (2017) https://doi.org/10.1007/s10009-015-0407-9
[14] Gadelha, M.R., Monteiro, F.R., Morse, J., Cordeiro, L.C., Fischer, B., Nicole,
D.A.: Esbmc 5.0: an industrial-strength c model checker. In: Proceedings of the
33rd ACM/IEEE International Conference on Automated Software Engineering,
pp. 888–891. ACM, Montpellier, France (2018)
[15] Gadelha, M.Y.R., Monteiro, F.R., Cordeiro, L.C., Nicole, D.A.: ESBMC v6.0:
Verifying C programs using k-induction and invariant inference - (competition
contribution). In: Beyer, D., Huisman, M., Kordon, F., Steffen, B. (eds.) Tools
and Algorithms for the Construction and Analysis of Systems (TACAS). LNCS,
vol. 11429, pp. 209–213 (2019). Springer
[16] Menezes, R.S., Aldughaim, M., Farias, B., Li, X., Manino, E., Shmarov, F.,
Song, K., Brauße, F., Gadelha, M.R., Tihanyi, N., Korovin, K., Cordeiro, L.C.:
36
ESBMC v7.4: Harnessing the power of intervals - (competition contribution). In:
Tools and Algorithms for the Construction and Analysis of Systems (TACAS).
LNCS, vol. 14572, pp. 376–380 (2024). Springer
[18] Cordy, J.R., Roy, C.K.: The nicad clone detector. 2011 IEEE 19th International
Conference on Program Comprehension, 219–220 (2011)
[19] Chen, M., Tworek, J., Jun, H., Yuan, Q., Oliveira Pinto, H.P., Kaplan, J.,
Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger,
G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder,
N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such,
F.P., Cummings, D., Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A.,
Guss, W.H., Nichol, A., Paino, A., Tezak, N., Tang, J., Babuschkin, I., Balaji,
S., Jain, S., Saunders, W., Hesse, C., Carr, A.N., Leike, J., Achiam, J., Misra, V.,
Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer, K.,
Welinder, P., McGrew, B., Amodei, D., McCandlish, S., Sutskever, I., Zaremba,
W.: Evaluating large language models trained on code (2021) arXiv:2107.03374
[cs.LG]
[22] Cordeiro, L.C., Lima Filho, E.B., Bessa, I.V.: Survey on automated symbolic
verification and its application for synthesising cyber-physical systems. IET
Cyper-Phys. Syst.: Theory & Appl. 5(1), 1–24 (2020) https://doi.org/10.1049/
IET-CPS.2018.5006
[23] Anwar, U., Saparov, A., Rando, J., Paleka, D., Turpin, M., Hase, P., Lubana,
E.S., Jenner, E., Casper, S., Sourbut, O., et al.: Foundational challenges
in assuring alignment and safety of large language models. arXiv preprint
arXiv:2404.09932 (2024)
[24] Kirova, V.D., Ku, C.S., Laracy, J.R., Marlowe, T.J.: Software engineering edu-
cation must adapt and evolve for an llm environment. In: Proceedings of the
55th ACM Technical Symposium on Computer Science Education V. 1. SIGCSE
2024, pp. 666–672. Association for Computing Machinery, New York, NY, USA
(2024). https://doi.org/10.1145/3626252.3630927
[25] Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang,
37
E., Cai, C., Terry, M., Le, Q., et al.: Program synthesis with large language
models (2021)
[26] Lu, S., Guo, D., Ren, S., Huang, J., Svyatkovskiy, A., Blanco, A., Clement, C.,
Drain, D., Jiang, D., Tang, D., et al.: Codexglue: A machine learning benchmark
dataset for code understanding and generation. arXiv preprint arXiv:2102.04664
(2021)
[27] White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A.,
Spencer-Smith, J., Schmidt, D.C.: A prompt pattern catalog to enhance prompt
engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023)
[28] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., Narasimhan, K.:
Tree of thoughts: Deliberate problem solving with large language models. In:
Advances in Neural Information Processing Systems, vol. 36 (2024)
[29] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou,
D.: Chain-of-thought prompting elicits reasoning in large language models.
Advances in Neural Information Processing Systems 35, 24824–24837 (2022)
[30] Tihanyi, N., Bisztray, T., Dubniczky, R.A., Toth, R., Borsos, B., Cherif, B.,
Ferrag, M.A., Muzsai, L., Jain, R., Marinelli, R., et al.: Dynamic intelligence
assessment: Benchmarking llms on the road to agi with a focus on model
confidence. arXiv preprint arXiv:2410.15490 (2024)
[31] Mirzadeh, I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Bengio, S., Farajtabar,
M.: Gsm-symbolic: Understanding the limitations of mathematical reasoning in
large language models. arXiv preprint arXiv:2410.05229 (2024)
[32] Honarvar, S., Wilk, M., Donaldson, A.: Turbulence: Systematically and auto-
matically testing instruction-tuned large language models for code. arXiv
preprint arXiv:2312.14856 (2023)
[33] Wang, S., Long, Z., Fan, Z., Wei, Z., Huang, X.: Benchmark self-
evolving: A multi-agent framework for dynamic llm evaluation. arXiv preprint
arXiv:2402.11443 (2024)
[34] Liang, X., Song, S., Zheng, Z., Wang, H., Yu, Q., Li, X., Li, R.-H., Xiong, F., Li,
Z.: Internal consistency and self-feedback in large language models: A survey.
arXiv preprint arXiv:2407.14507 (2024)
[35] Guo, D., Zhu, Q., Yang, D., Xie, Z., Dong, K., Zhang, W., Chen, G., Bi, X.,
Wu, Y., Li, Y., et al.: Deepseek-coder: When the large language model meets
programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196
(2024)
[36] Wang, H., Liu, Z., Wang, S., Cui, G., Ding, N., Liu, Z., Yu, G.: Intervenor:
38
Prompt the coding ability of large language models with the interactive chain
of repairing. arXiv preprint arXiv:2311.09868 (2023)
[37] Huang, D., Bu, Q., Zhang, J.M., Luck, M., Cui, H.: Agentcoder: Multi-agent-
based code generation with iterative testing and optimisation. arXiv preprint
arXiv:2312.13010 (2023)
[38] Muennighoff, N., Liu, Q., Zebaze, A., Zheng, Q., Hui, B., Zhuo, T.Y., Singh, S.,
Tang, X., Von Werra, L., Longpre, S.: Octopack: Instruction tuning code large
language models. arXiv preprint arXiv:2308.07124 (2023)
[39] Lin, F., Kim, D.J., et al.: When llm-based code generation meets the software
development process. arXiv preprint arXiv:2403.15852 (2024)
[40] Khoury, R., Avila, A.R., Brunelle, J., Camara, B.M.: How secure is code gener-
ated by chatgpt? In: 2023 IEEE International Conference on Systems, Man, and
Cybernetics (SMC), pp. 2445–2451 (2023). https://doi.org/10.1109/SMC53992.
2023.10394237
[41] Pearce, H., Ahmad, B., Tan, B., Dolan-Gavitt, B., Karri, R.: Asleep at the
keyboard? assessing the security of github copilot’s code contributions. In: 2022
IEEE Symposium on Security and Privacy (SP), pp. 754–768. IEEE, ??? (2022)
[42] Ma, W., Liu, S., Wang, W., Hu, Q., Liu, Y., Zhang, C., Nie, L., Liu, Y.: The
Scope of ChatGPT in Software Engineering: A Thorough Investigation. arXiv
(2023). http://arxiv.org/abs/2305.12138 Accessed 2023-06-10
[43] Imani, S., Du, L., Shrivastava, H.: Mathprompter: Mathematical reasoning using
large language models (2023). https://doi.org/10.48550/arXiv.2303.05398
[44] Hou, X., Zhao, Y., Liu, Y., Yang, Z., Wang, K., Li, L., Luo, X., Lo, D., Grundy,
J., Wang, H.: Large language models for software engineering: A systematic
literature review. ACM Transactions on Software Engineering and Methodology
(2023)
[45] Chan, A., Kharkar, A., Moghaddam, R.Z., Mohylevskyy, Y., Helyar, A., Kamal,
E., Elkamhawy, M., Sundaresan, N.: Transformer-based vulnerability detec-
tion in code at edittime: Zero-shot, few-shot, or fine-tuning? arXiv preprint
arXiv:2306.01754 (2023)
[46] Nguyen, V., Yuan, X., Wu, T., Nepal, S., Grobler, M., Rudolph, C.: Deep
learning-based out-of-distribution source code data identification: How far we
have gone? arXiv preprint arXiv:2404.05964 (2024)
[47] Gao, Z., Wang, H., Zhou, Y., Zhu, W., Zhang, C.: How far have we
gone in vulnerability detection using large language models. arXiv preprint
arXiv:2311.12420 (2023)
39
[48] Gao, S., Mao, W., Gao, C., Li, L., Hu, X., Xia, X., Lyu, M.R.: Learning in the
wild: Towards leveraging unlabeled data for effectively tuning pre-trained code
models. In: Proceedings of the IEEE/ACM 46th International Conference on
Software Engineering, pp. 1–13 (2024)
[49] Grishina, A., Hort, M., Moonen, L.: The earlybird catches the bug: On exploiting
early layers of encoder models for more efficient code classification. In: Proceed-
ings of the 31st ACM Joint European Software Engineering Conference and
Symposium on the Foundations of Software Engineering, pp. 895–907 (2023)
[50] Khare, A., Dutta, S., Li, Z., Solko-Breslin, A., Alur, R., Naik, M.: Understanding
the effectiveness of large language models in detecting security vulnerabilities.
arXiv preprint arXiv:2311.16169 (2023)
[51] Noever, D.: Can large language models find and fix vulnerable software? arXiv
preprint arXiv:2308.10345 (2023)
[52] Shestov, A., Cheshkov, A., Levichev, R., Mussabayev, R., Zadorozhny, P.,
Maslov, E., Vadim, C., Bulychev, E.: Finetuning large language models for
vulnerability detection. arXiv preprint arXiv:2401.17010 (2024)
[53] Steenhoek, B., Gao, H., Le, W.: Dataflow analysis-inspired deep learn-
ing for efficient vulnerability detection. In: Proceedings of the IEEE/ACM
46th International Conference on Software Engineering. ICSE ’24.
Association for Computing Machinery, New York, NY, USA (2024).
https://doi.org/10.1145/3597503.3623345
[54] Sun, Y., Wu, D., Xue, Y., Liu, H., Ma, W., Zhang, L., Shi, M., Liu, Y.: Llm4vuln:
A unified evaluation framework for decoupling and enhancing llms’ vulnerability
reasoning. arXiv preprint arXiv:2401.16185 (2024)
[55] Tang, W., Tang, M., Ban, M., Zhao, Z., Feng, M.: Csgvd: A deep learning
approach combining sequence and graph embedding for source code vulnerabil-
ity detection. J. Syst. Softw. 199(C) (2023) https://doi.org/10.1016/j.jss.2023.
111623
[56] Thapa, C., Jang, S.I., Ahmed, M.E., Camtepe, S., Pieprzyk, J., Nepal, S.:
Transformer-based language models for software vulnerability detection. In:
Proceedings of the 38th Annual Computer Security Applications Conference.
ACSAC ’22, pp. 481–496. Association for Computing Machinery, New York,
NY, USA (2022). https://doi.org/10.1145/3564625.3567985
[57] Zhang, C., Liu, H., Zeng, J., Yang, K., Li, Y., Li, H.: Prompt-enhanced software
vulnerability detection using chatgpt. arXiv preprint arXiv:2308.12697 (2023)
[58] Tóth, R., Bisztray, T., Erdődi, L.: Llms in web development: Evaluating llm-
generated php code unveiling vulnerabilities and limitations. In: Computer
40
Safety, Reliability, and Security. SAFECOMP 2024 Workshops, pp. 425–437.
Springer, Cham (2024)
[59] Shumailov, I., Shumaylov, Z., Zhao, Y., Gal, Y., Papernot, N., Anderson, R.:
The Curse of Recursion: Training on Generated Data Makes Models Forget.
arXiv (2023). http://arxiv.org/abs/2305.17493 Accessed 2023-06-27
[60] Chen, Y., Ding, Z., Alowain, L., Chen, X., Wagner, D.: DiverseVul: A
New Vulnerable Source Code Dataset for Deep Learning Based Vulner-
ability Detection. In: Proceedings of the 26th International Symposium
on Research in Attacks, Intrusions and Defenses. RAID ’23, pp. 654–
668. Association for Computing Machinery, New York, NY, USA (2023).
https://doi.org/10.1145/3607199.3607242
[61] Fan, J., Li, Y., Wang, S., Nguyen, T.N.: A C/C++ Code Vulnerability
Dataset with Code Changes and CVE Summaries. In: Proceedings of the
17th International Conference on Mining Software Repositories. MSR ’20, pp.
508–512. Association for Computing Machinery, New York, NY, USA (2020).
https://doi.org/10.1145/3379597.3387501 Accessed 2023-06-27
[62] Russell, R.L., Kim, L.Y., Hamilton, L.H., Lazovich, T., Harer, J.A., Ozdemir,
O., Ellingwood, P.M., McConley, M.W.: Automated Vulnerability Detection
in Source Code Using Deep Representation Learning. In: 2018 17th IEEE
International Conference on Machine Learning and Applications (ICMLA), pp.
757–762. IEEE, Orlando, FL, USA (2018). https://doi.org/10.1109/ICMLA.
2018.00120
[63] Kim, L., Russell, R.: Draper VDISC Dataset - Vulnerability Detection in Source
Code. Publisher: OSF (2018). https://osf.io/d45bw/ Accessed 2023-06-27
[65] Jr, F.E.B., Black, P.E.: The Juliet 1.1 C/C++ and Java Test Suite. NIST
45(10), 88–90 (2012). Last Modified: 2021-10-12T11:10-04:00 Publisher: Fred-
erick E. Boland Jr., Paul E. Black. Accessed 2023-05-28
[66] Zhou, Y., Liu, S., Siow, J., Du, X., Liu, Y.: Devign: Effective Vulnerability
Identification by Learning Comprehensive Program Semantics via Graph Neural
Networks, pp. 10197–10207. Curran Associates Inc., Red Hook, NY, USA (2019)
[67] Chakraborty, S., Krishna, R., Ding, Y., Ray, B.: Deep Learning Based Vulnera-
bility Detection: Are We There Yet? IEEE Transactions on Software Engineering
48(9), 3280–3296 (2022) https://doi.org/10.1109/TSE.2021.3087402
41
[68] Jain, R., Gervasoni, N., Ndhlovu, M., Rawat, S.: A code centric evaluation of
c/c++ vulnerability datasets for deep learning based vulnerability detection
techniques. In: Proceedings of the 16th Innovations in Software Engineering
Conference, pp. 1–10. ACM, Prayagraj, India (2023)
[69] Daniel Marjamäki: Cppcheck: A Tool for Static Analysis of C/C++ Code.
https://cppcheck.sourceforge.io/. [Online], Available at: https://cppcheck.
sourceforge.io/ (Accessed: 12 September 2024) (2024)
[70] Cordeiro, L., Fischer, B., Marques-Silva, J.: SMT-Based Bounded Model
Checking for Embedded ANSI-C Software. IEEE Transactions on Software
Engineering 38(4), 957–974 (2012) https://doi.org/10.1109/TSE.2011.59
[71] D’Silva, V., Kroening, D., Weissenbacher, G.: A Survey of Automated Tech-
niques for Formal Software Verification. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems 27(7), 1165–1178 (2008) https:
//doi.org/10.1109/TCAD.2008.923410
[72] Morse, J., Cordeiro, L.C., Nicole, D.A., Fischer, B.: Context-bounded model
checking of LTL properties for ANSI-C software. In: Barthe, G., Pardo, A.,
Schneider, G. (eds.) Software Engineering and Formal Methods - 9th Interna-
tional Conference, SEFM 2011, Montevideo, Uruguay, November 14-18, 2011.
Proceedings. Lecture Notes in Computer Science, vol. 7041, pp. 302–317 (2011).
Springer
[73] Wallace, D.R., Fujii, R.U.: Software verification and validation: an overview.
IEEE Software 6(3), 10–17 (1989) https://doi.org/10.1109/52.28119 . Accessed
2023-06-22
[74] Alshmrany, K.M., Aldughaim, M., Bhayat, A., Cordeiro, L.C.: Fusebmc: An
energy-efficient test generator for finding security vulnerabilities in C programs.
In: Loulergue, F., Wotawa, F. (eds.) Tests and Proofs - 15th International Con-
ference, TAP 2021, Held as Part of STAF 2021, Virtual Event, June 21-22, 2021,
Proceedings. Lecture Notes in Computer Science, vol. 12740, pp. 85–105 (2021).
Springer
[76] Hao, Y., Chen, W., Zhou, Z., Cui, W.: E&v: Prompting large language models to
perform static analysis by pseudo-code execution and verification. arXiv preprint
arXiv:2312.08477 (2023)
[77] Yang, A.Z., Le Goues, C., Martins, R., Hellendoorn, V.: Large language mod-
els for test-free fault localization. In: Proceedings of the 46th IEEE/ACM
International Conference on Software Engineering, pp. 1–12 (2024)
42
[78] Quan, V.L.A., Phat, C.T., Van Nguyen, K., Duy, P.T., Pham, V.-H.: Xgv-bert:
Leveraging contextualized language model and graph neural network for efficient
software vulnerability detection. arXiv preprint arXiv:2309.14677 (2023)
[79] Sun, T., Allix, K., Kim, K., Zhou, X., Kim, D., Lo, D., Bissyandé, T.F., Klein,
J.: Dexbert: Effective, task-agnostic and fine-grained representation learning of
android bytecode. IEEE Transactions on Software Engineering 49(10), 4691–
4706 (2023) https://doi.org/10.1109/TSE.2023.3310874
[80] Tian, H., Liu, K., Li, Y., Kaboré, A.K., Koyuncu, A., Habib, A., Li, L., Wen, J.,
Klein, J., Bissyandé, T.F.: The best of both worlds: Combining learned embed-
dings with engineered features for accurate prediction of correct patches. ACM
Trans. Softw. Eng. Methodol. 32(4) (2023) https://doi.org/10.1145/3576039
[81] Wang, W., Wang, Y., Joty, S., Hoi, S.C.H.: Rap-gen: Retrieval-augmented
patch generation with codet5 for automatic program repair. In: Proceedings
of the 31st ACM Joint European Software Engineering Conference and Sym-
posium on the Foundations of Software Engineering. ESEC/FSE 2023, pp.
146–158. Association for Computing Machinery, New York, NY, USA (2023).
https://doi.org/10.1145/3611643.3616256
[82] Zhang, Y., Jin, Z., Xing, Y., Li, G.: Steam: simulating the interactive behavior of
programmers for automatic bug fixing. arXiv preprint arXiv:2308.14460 (2023)
[83] Wu, Y., Li, Z., Zhang, J.M., Papadakis, M., Harman, M., Liu, Y.: Large language
models in fault localisation. arXiv preprint arXiv:2308.15276 (2023)
[84] Mohajer, M.M., Aleithan, R., Harzevili, N.S., Wei, M., Belle, A.B., Pham,
H.V., Wang, S.: Skipanalyzer: An embodied agent for code analysis with large
language models. arXiv preprint arXiv:2310.18532 (2023)
[85] Li, T.-O., Zong, W., Wang, Y., Tian, H., Wang, Y., Cheung, S.-C.: Finding
Failure-Inducing Test Cases with ChatGPT (2023)
[86] Pearce, H., Tan, B., Ahmad, B., Karri, R., Dolan-Gavitt, B.: Examining zero-
shot vulnerability repair with large language models. In: 2023 IEEE Symposium
on Security and Privacy (SP), pp. 2339–2356. IEEE, ??? (2023)
[87] Cao, J., Li, M., Wen, M., Cheung, S.-c.: A study on prompt design, advantages
and limitations of chatgpt for deep learning program repair. arXiv preprint
arXiv:2304.08191 (2023)
[88] Deligiannis, P., Lal, A., Mehrotra, N., Rastogi, A.: Fixing rust compilation errors
using llms. arXiv preprint arXiv:2308.05177 (2023)
[89] Fan, Z., Gao, X., Mirchev, M., Roychoudhury, A., Tan, S.H.: Automated repair
of programs from large language models. In: 2023 IEEE/ACM 45th International
43
Conference on Software Engineering (ICSE), pp. 1469–1481 (2023). IEEE
[90] Huang, Q., Zhu, J., Xing, Z., Jin, H., Wang, C., Xu, X.: A chain of ai-based solu-
tions for resolving fqns and fixing syntax errors in partial code. arXiv preprint
arXiv:2306.11981 (2023)
[91] Islam, N.T., Najafirad, P.: Code security vulnerability repair using reinforcement
learning with large language models. arXiv preprint arXiv:2401.07031 (2024)
[92] Jin, M., Shahriar, S., Tufano, M., Shi, X., Lu, S., Sundaresan, N., Svyatkovskiy,
A.: Inferfix: End-to-end program repair with llms. In: Proceedings of the 31st
ACM Joint European Software Engineering Conference and Symposium on the
Foundations of Software Engineering, pp. 1646–1656 (2023)
[93] Lajkó, M., Csuvik, V., Vidács, L.: Towards javascript program repair with gener-
ative pre-trained transformer (gpt-2). In: Proceedings of the Third International
Workshop on Automated Program Repair, pp. 61–68. IEEE, ??? (2022)
[94] Paul, R., Mohib Hossain, M., Hasan, M., Iqbal, A.: Automated program repair
based on code review: How do pre-trained transformer models perform? arXiv
e-prints, 2304 (2023)
[95] Peng, Y., Gao, S., Gao, C., Huo, Y., Lyu, M.: Domain knowledge matters:
Improving prompts with fix templates for repairing python type errors. In:
Proceedings of the IEEE/ACM 46th International Conference on Software Engi-
neering. ICSE ’24. Association for Computing Machinery, New York, NY, USA
(2024). https://doi.org/10.1145/3597503.3608132
[96] Tian, H., Liu, K., Kaboré, A.K., Koyuncu, A., Li, L., Klein, J., Bissyandé,
T.F.: Evaluating representation learning of code changes for predicting patch
correctness in program repair. In: Proceedings of the 35th IEEE/ACM Inter-
national Conference on Automated Software Engineering. ASE ’20, pp. 981–
992. Association for Computing Machinery, New York, NY, USA (2021).
https://doi.org/10.1145/3324884.3416532
[97] Wei, Y., Xia, C.S., Zhang, L.: Copiloting the copilots: Fusing large language
models with completion engines for automated program repair. In: Proceed-
ings of the 31st ACM Joint European Software Engineering Conference and
Symposium on the Foundations of Software Engineering. ESEC/FSE 2023, pp.
172–184. Association for Computing Machinery, New York, NY, USA (2023).
https://doi.org/10.1145/3611643.3616271
[98] Widjojo, P., Treude, C.: Addressing compiler errors: Stack overflow or large
language models? arXiv preprint arXiv:2307.10793 (2023)
[99] Xia, C.S., Wei, Y., Zhang, L.: Practical program repair in the era of large pre-
trained language models. arXiv preprint arXiv:2210.14179 (2022)
44
[100] Xia, C.S., Zhang, L.: Keep the conversation going: Fixing 162 out of 337 bugs
for $0.42 each using chatgpt. arXiv preprint arXiv:2304.00385 (2023)
[101] Zhang, Q., Fang, C., Sun, W., Liu, Y., He, T., Hao, X., Chen, Z.: Appt:
Boosting automated patch correctness prediction via fine-tuning pre-trained
models. IEEE Transactions on Software Engineering 50(3), 474–494 (2024)
https://doi.org/10.1109/TSE.2024.3354969
[102] Zhang, Q., Fang, C., Zhang, T., Yu, B., Sun, W., Chen, Z.: Gamma: Revis-
iting template-based automated program repair via mask prediction. In: 2023
38th IEEE/ACM International Conference on Automated Software Engineering
(ASE), pp. 535–547 (2023). IEEE
[103] Zhang, Y., Li, G., Jin, Z., Xing, Y.: Neural program repair with program depen-
dence analysis and effective filter mechanism. arXiv preprint arXiv:2305.09315
(2023)
[104] Wu, Y., Jiang, N., Pham, H.V., Lutellier, T., Davis, J., Tan, L., Babkin, P.,
Shah, S.: How effective are neural networks for fixing security vulnerabilities. In:
Proceedings of the 32nd ACM SIGSOFT International Symposium on Software
Testing and Analysis, pp. 1282–1294 (2023)
[105] Gadelha, M.Y.R., Steffinlongo, E., Cordeiro, L.C., Fischer, B., Nicole, D.A.:
Smt-based refutation of spurious bug reports in the clang static analyzer. In:
Atlee, J.M., Bultan, T., Whittle, J. (eds.) Proceedings of the 41st International
Conference on Software Engineering, pp. 11–14. IEEE / ACM, Montreal, QC,
Canada (2019). https://doi.org/10.1109/ICSE-Companion.2019.00026
[106] Sadowski, C., Yi, J.: How developers use data race detection tools. In: Pro-
ceedings of the 5th Workshop on Evaluation and Usability of Programming
Languages and Tools, pp. 43–51. ACM, Portland, USA (2014)
[107] White, M., Tufano, M., Vendome, C., Poshyvanyk, D.: Deep learning code
fragments for code clone detection. In: Proceedings of the 31st IEEE/ACM Inter-
national Conference on Automated Software Engineering, pp. 87–98. Association
for Computing Machinery, New York, USA (2016)
[108] Zhao, G., Huang, J.: Deepsim: deep learning code functional similarity. In: Pro-
ceedings of the 2018 26th ACM Joint Meeting on European Software Engineering
Conference and Symposium on the Foundations of Software Engineering, pp.
141–151. ACM, Lake Buena Vista, USA (2018)
[109] Cordeiro, L.C., Kroening, D., Schrammel, P.: JBMC: bounded model checking
for java bytecode - (competition contribution). In: Tools and Algorithms for the
Construction and Analysis of Systems (TACAS). LNCS, vol. 11429, pp. 219–223
(2019). Springer
45
[110] Menezes, R., Moura, D., Cavalcante, H., Freitas, R., Cordeiro, L.C.: Esbmc-
jimple: verifying kotlin programs via jimple intermediate representation. In:
Ryu, S., Smaragdakis, Y. (eds.) ISSTA ’22: 31st ACM SIGSOFT International
Symposium on Software Testing and Analysis, Virtual Event, South Korea, July
18 - 22, 2022, pp. 777–780 (2022). ACM
[111] Gadelha, M.R., Monteiro, F.R., Morse, J., Cordeiro, L.C., Fischer, B., Nicole,
D.A.: Esbmc 5.0: an industrial-strength c model checker. In: Proceedings of the
33rd ACM/IEEE International Conference on Automated Software Engineering.
ASE ’18, pp. 888–891. Association for Computing Machinery, New York, NY,
USA (2018). https://doi.org/10.1145/3238147.3240481
[112] Aho, A.V., Lam, M.S., Sethi, R., Ullman, J.D.: Compilers: Principles, Tech-
niques, And Tools, 2nd edn. Addison-Wesley Longman Publishing Co., Inc.,
Boston, MA (2006)
[113] Beyer, D.: Competition on software verification and witness validation: Sv-comp
2023. In: Sankaranarayanan, S., Sharygina, N. (eds.) Tools and Algorithms for
the Construction and Analysis of Systems, pp. 495–522. Springer, Cham (2023)
[114] Sandoval, G., Pearce, H., Nys, T., Karri, R., Garg, S., Dolan-Gavitt, B.: Lost
at c: A user study on the security implications of large language model code
assistants. In: 32nd USENIX Security Symposium (USENIX Security 23), pp.
2205–2222 (2023). USENIX Association
[116] Kroening, D., Tautschnig, M.: Cbmc–c bounded model checker: (competition
contribution). In: Tools and Algorithms for the Construction and Analysis of
Systems: TACAS 2014, pp. 389–391. Springer, Grenoble, France (2014)
Appendix
46
Fig. 1: Example JSON Labels for a GPT-3.5-turbo Generated Sample: FormAI-v2
dataset
47