Cybersecurity Risks of AI-Generated Code
Cybersecurity Risks of AI-Generated Code
Cybersecurity
Risks of AI-
Generated Code
Authors
Jessica Ji
Jenny Jun
Maggie Wu
Rebecca Gelles
November 2024
Executive Summary
Recent developments have improved the ability of large language models (LLMs) and
other AI systems to generate computer code. While this is promising for the field of
software development, these models can also pose direct and indirect cybersecurity
risks. In this paper, we identify three broad categories of risk associated with AI code
generation models: 1) models generating insecure code, 2) models themselves being
vulnerable to attack and manipulation, and 3) downstream cybersecurity impacts such
as feedback loops in training future AI systems.
Existing research has shown that, under experimental conditions, AI code generation
models frequently output insecure code. However, the process of evaluating the
security of AI-generated code is highly complex and contains many interdependent
variables. To further explore the risk of insecure AI-written code, we evaluated
generated code from five LLMs. Each model was given the same set of prompts, which
were designed to test likely scenarios where buggy or insecure code might be
produced. Our evaluation results show that almost half of the code snippets produced
by these five different models contain bugs that are often impactful and could
potentially lead to malicious exploitation. These results are limited to the narrow scope
of our evaluation, but we hope they can contribute to the larger body of research
surrounding the impacts of AI code generation models.
Given both code generation models’ current utility and the likelihood that their
capabilities will continue to improve, it is important to manage their policy and
cybersecurity implications. Key findings include the below.
● Code generation models also need to be evaluated for security, but it is currently
difficult to do so. Evaluation benchmarks for code generation models often focus
on the models’ ability to produce functional code but do not assess their ability to
generate secure code, which may incentivize a deprioritization of security over
functionality during model training. There is inadequate transparency around
models’ training data—or understanding of their internal workings—to explore
questions such as whether better performing models produce more insecure
code.
These models and associated tools are being adopted rapidly by the software
developer community and individual users. According to GitHub’s June 2023 survey,
92% of surveyed U.S.-based developers report using AI coding tools in and out of
work.1 Another industry survey from November 2023 similarly reported a high usage
rate, with 96% of surveyed developers using AI coding tools and more than half of
respondents using the tools most of the time.2 If this trend continues, LLM-generated
code will become an integral part of the software supply chain.
As language models have gotten larger and more advanced over the past few years,
their code generation capabilities have improved in step with their natural language-
generation capabilities.4 Coding languages are, after all, intentionally designed to
encode and convey information, and have their own rules and syntactical expectations
much like human languages. Researchers in the field of natural language processing
(NLP) have been interested in translating between natural language and computer code
for many years, but the simultaneous introduction of transformer-based language
model architectures and large datasets containing code led to a rapid improvement in
code generation capabilities beginning around 2018–2019. As new models were
released, researchers also began exploring ways to make them more accessible. In mid-
2021, for example, OpenAI released the first version of Codex, a specialized language
Research interest in AI code generation has consistently increased in the past decade,
especially experiencing a surge in the past year following the release of high-
performing foundation models such as GPT-4 and open-source models such as Llama
2. Figure 1 illustrates the trend by counting the number of research papers on code
generation by year from 2012–2023. The number of research papers on code
Code generation presents one of the most compelling and widely adopted use cases for
large language models. In addition to claims from organizations such as Microsoft that
their AI coding tool GitHub Copilot had 1.8 million paid subscribers as of spring 2024,
up from more than a million in mid-2023,11 software companies are also adopting
*
This graph counts the number of papers in CSET’s Merged Academic Corpus that contain the
keywords “code generation,” “AI-assisted programming,” “AI code assistant,” “code generating
LLM,” or “code LLM” and are also classified as AI- or cybersecurity-related using CSET’s AI classifier
and cybersecurity classifier. Note that at the time of writing in February 2024, CSET’s Merged
Academic Corpus did not yet include all papers from 2023 due to upstream collection lags, which
may have resulted in an undercounting of papers in 2023. The corpus currently includes data from
Clarivate’s Web of Science, The Lens, arXiv, Papers with Code, Semantic Scholar, and OpenAlex.
More information regarding our methodology for compiling the Merged Academic Corpus as well as
background on our classifiers and a detailed citation of data sources are available here:
https://eto.tech/dataset-docs/mac/; https://cset.georgetown.edu/publication/identifying-ai-research/.
Productivity is often cited as one of the key reasons individuals and organizations have
adopted AI code generation tools. Metrics for measuring how much developer
productivity improves by leveraging AI code generation tools vary by study. A small
GitHub study used both self-perceived productivity and task completion time as
productivity metrics, but the authors acknowledged that there is little consensus about
what metrics to use or how productivity relates to developer well-being.13 A McKinsey
study using similar metrics claimed that software developers using generative AI tools
could complete coding tasks up to twice as fast as those without them, but that these
benefits varied depending on task complexity and developer experience.14 Companies
have also run internal productivity studies with their employees. A Meta study on their
internal code generation model CodeCompose used metrics such as code acceptance
rate and qualitative developer feedback to measure productivity, finding that 20% of
users stated that CodeCompose helped them write code more quickly, while a Google
study found a 6% reduction in coding iteration time when using an internal code
completion model as compared to a control group.15 More recently, a September 2024
study analyzing data from randomized control trials across three different organizations
found a 26% increase in the number of completed tasks among developers using
GitHub Copilot as opposed to developers who were not given access to the tool.16 Most
studies are in agreement that code generation tools improve developer productivity in
general, regardless of the exact metrics used.
Broadly speaking, evidence suggests that code generation tools have benefits at both
the individual and organizational levels, and these benefits are likely to increase over
time as model capabilities improve. There are also plenty of incentives, such as ease of
access and purported productivity gains, for organizations to adopt—or at least
experiment with—AI code generation for software development.
This technological breakthrough, however, must also be met with caution. Increasing
usage of code generation models in routine software development processes means
that these models will soon be an important part of the software supply chain. Ensuring
that their outputs are secure—or that any insecure outputs they produce are identified
and corrected before code enters production—will also be increasingly important for
cybersecurity. However, code generation models are seldom trained with security as a
benchmark and are instead often trained to meet various functionality benchmarks such
as HumanEval, a set of 164 human-written programming problems intended to
evaluate models’ code-writing capability in the Python programming language.17 As the
functionality of these code generation models increases and models are adopted into
the standard routine of organizations and developers, overlooking the potential
vulnerabilities of such code may pose systemic cybersecurity risks.
The remainder of this section will examine three potential sources of risk in greater
detail: 1) code generation models’ likelihood of producing insecure code, 2) the models’
vulnerability to attacks, and 3) potential downstream cybersecurity implications related
to the widespread use of code generation models.
Firstly, various code generation models often suggest insecure code as outputs. Pearce
et al. (2021) show that approximately 40% of the 1,689 programs generated by Github
Copilot18 were vulnerable to MITRE’s “2021 Common Weakness Enumerations (CWE)
Top 25 Most Dangerous Software Weaknesses” list.19 Siddiq and Santos (2022) found
that out of 130 code samples generated using InCoder and Github Copilot, 68% and
73% of the code samples respectively contained vulnerabilities when checked
manually.20 Khoury et al. (2023) used ChatGPT to generate 21 programs in five
different programming languages and tested for CWEs, showing that only five out of
21 were initially secure. Only after specific prompting to correct the code did an
In certain coding languages, code generation models are also likely to produce code
that calls external libraries and packages. These external code sources can present a
host of problems, some security-relevant: They may be nonexistent and merely
hallucinated by the model, outdated and unpatched for vulnerabilities, or malicious in
nature (such as when attackers attempt to take advantage of common misspellings in
URLs or package names).23 For example, Vulcan Cyber showed that ChatGPT routinely
recommended nonexistent packages when answering common coding questions
sourced from Stack Overflow—over 40 out of 201 questions in Node.js and over 80 out
of 227 questions in Python contained at least one nonexistent package in the answer.24
Furthermore, some of these hallucinated library and package names are persistent
across both use cases and different models; as a follow-up study demonstrated, a
potential attacker could easily create a package with the same name and get users to
unknowingly download malicious code.25
Despite these empirical results, there are early indications that users perceive AI-
generated code to be more secure than human-written code. This “automation bias”
towards AI-generated code means that users may overlook careful code review and
accept insecure code as it is. For instance, in a 2023 industry survey of 537 technology
and IT workers and managers, 76% responded that AI code is more secure than human
code.26 Perry et al. (2023) further showed in a user study that student participants with
access to an AI assistant wrote significantly less secure code than those without
access, and were more likely to believe that they wrote secure code.27 However, there is
some disagreement on whether or not users of AI code generation tools are more likely
to write insecure code; other studies suggest that users with access to AI code
assistants may not be significantly more likely to produce insecure code than users
without AI tools.28 These contradictory findings raise a series of related questions, such
as: How does a user’s proficiency with coding affect their use of code generation
models, and their likelihood of accepting AI-generated code as secure? Could
automation bias lead human programmers to accept (potentially insecure) AI-generated
code as secure more often than human-authored code? Regardless, the fact that AI
coding tools may provide inexperienced users with a false sense of security has
cybersecurity implications if AI-generated code is more trusted and less scrutinized for
security flaws.
In addition to the code that they output, code generation models are software tools that
need to be properly secured. AI models are vulnerable to hacking, tampering, or
manipulation in ways that humans are not.33 Figure 2 illustrates the code generation
model development workflow, where the portions in red indicate various ways a
malicious cyber actor may attack a model.
Source: CSET.
Downstream Impacts
Aside from the direct cybersecurity risks posed by insecure code outputs, there are also
indirect, downstream effects that may have ramifications for the broader cybersecurity
ecosystem as code generation models become more widely adopted.
As programmers use these tools more frequently, the proportion of AI-authored code
will increase relative to human-authored code. If AI tools have a propensity to introduce
different types of bugs or potential vulnerabilities compared to human programmers,
the vulnerability landscape will also shift over time, and new classes of vulnerabilities
may emerge or become commonplace. This in turn may impact future code generation
models; while the large datasets of open-source code used to train the earliest code
generation models were guaranteed to be primarily human-authored, future scrapes of
open-source repositories are likely to contain greater amounts of AI-generated code.
Some AI researchers have posited that training AI models on datasets of AI-generated
text will lead to significant performance degradation if the datasets contain insufficient
amounts of human-generated text.38 It is currently unknown exactly how AI-generated
code produced today will affect the performance of future models. However, today’s
outputs are likely to become tomorrow’s training data, creating a different set of
patterns for future models to learn from.
Finally, AI code generation has workforce implications. Organizations could reduce the
size of their workforce or attempt to automate part of their software development
pipelines if code generation tools result in productivity gains for human programmers.
For instance, the CEO of IBM stated in 2023 that the company eventually plans on
using AI to replace roles that are currently performed by human employees, estimating
that almost 8,000 existing IBM positions could be replaced by AI and automation within
five years.39 Labor displacement may, in turn, have implications for cybersecurity, as
human software developers perform a host of non-programming tasks that are
important to the functionality of modern codebases. These responsibilities, which
include monitoring, manual code review, design, patching, updating dependencies, and
optimizing code for performance, are important and security-relevant software
development tasks. Today’s probabilistic code-generating models are unlikely to be
able to reliably perform such tasks out of the box, meaning human expertise and
institutional knowledge are still crucial.
Given the increasing interest in using code generation models and related security
concerns, the ability to reliably evaluate a model’s propensity to produce insecure code
becomes important in order to set appropriate standards and to find mitigation
techniques. Academic and industry research generally suggests that code generation
models often produce insecure code. However, these studies vary considerably in their
research questions, methodologies, and evaluation metrics, such that many empirical
results are not directly comparable. This poses a challenge in assessing external validity
on how empirical results from one study extrapolate to other situations.
Some of the factors impacting the reliable and reproducible assessment of code
generation models include:
● Model type: Not all existing studies attempt to compare the security of code
outputs from different AI models. There may be significant performance
differences between models or different instances of the same model (e.g., the
specialized Code Llama models compared to the general-purpose Llama
models). Some research suggests that models with better coding abilities are
more likely to produce insecure code, which may be due to a variety of factors
including being trained on larger datasets of code or being more likely to
replicate commonly seen insecure coding patterns.40 In addition to comparing
individual models, there may be differences between the broader classes of
specialized code-writing models and general-purpose models.
● Assessment tools: Different code quality checkers and static analysis tools vary
between programming languages because there is no shared industry standard
for these tools. For example, our evaluation uses ESBMC (the Efficient SMT-
based Context-Bounded Model Checker), an open-source model checker
originally developed for C and C++ but that also supports a handful of other
These factors make the simple synthesis and direct comparison of previous research
difficult. However, certain factors such as coding language, assessment tools, and
prompting can be kept consistent when experimentally comparing results across
models. While there is no one right answer, in the next section we provide one
approach of evaluating the security of code generated by various models.
The purpose of this evaluation was not to compare different models’ performance, but
to understand how they might perform differently when evaluated with security in
mind. We also hoped to illustrate some of the challenges associated with evaluating
the security of AI code generation models. Questions related to productivity
improvements, automation bias, and model performance on non-security-related
benchmarks are beyond our scope.
Methodology
Given the difficulties in comparing the security of code outputs by models, our
evaluation holds constant several factors. Namely, we tested five code generation
models using the same programming language, assessment tool, and prompts for
evaluating the generated code outputs.
GPT-3.5-turbo and GPT-4 were accessed via the OpenAI API, and the open models
were downloaded and run on virtual machines. The evaluation’s results reflect the
performance of the models as of early 2024.
To prompt the model, we used the LLMSecEval dataset, which consists of 150 natural-
language prompts explicitly designed to assess the security of C and Python code
produced by language models.47 Each prompt is intended to elicit code that is highly
likely to contain a software bug or weakness on MITRE’s Top 25 Common Weakness
Enumeration (CWE) list.48 The MITRE CWE list does not include cybersecurity
vulnerabilities per se; rather, the weaknesses on the list can lead to vulnerabilities if
discovered and exploited by a malicious actor. Notably, while LLMSecEval’s creators
assessed their prompts for several characteristics, including expressiveness and
conciseness, these prompts are specifically security-focused and are not necessarily
intended to mimic the behavior of the average user interacting with a code generation
model.49
After we generated code snippets for all models, we fed the snippets through the
ESBMC code checker. This workflow was inspired by a previous study that used formal
verification—the practice of mathematically proving the correctness of a system (or
program) relative to its specifications—as a proxy for cybersecurity vulnerability
detection.52 Essentially, ESBMC breaks the program into small nodes where errors may
occur and runs through all possible test cases to find counterexamples where a safety
property could be violated. The safety properties in C code that it tests for include out-
of-bounds array access, illegal pointer dereferences, integer overflows, undefined
behavior on shift operations, floating-point for NaN (short for “not a number”—
essentially an unidentifiable or unrepresentable numeric data type), divide by zero, and
memory leaks.
Source: CSET.
Evaluation Results
Overall, we saw a high rate of unsuccessful verification across the five models. In this
evaluation, we define unsuccessfully verified code snippets as all ESBMC outputs that
Source: CSET.
Across all five models, approximately 48% of all generated code snippets were
compilable but contained a bug that was flagged by ESBMC (“verification failed”),
which we define as insecure code. Approximately 30% of all generated code snippets
successfully compiled and passed ESBMC verification (which we define as secure),
while the remainder of the snippets failed to compile or produced other errors in the
verification pipeline.
Across the five models, we also saw significant variation in behavior. Some of this
variation can be attributed to models’ tendencies to generate certain types of output.
For instance, the sizable percentage of error snippets in Mistral’s sample is due to the
model’s tendency to generate an individual function targeted to each prompt’s specific
request rather than an entire, self-contained, and complete program. While these
snippets may have been functionally correct, their lack of completeness failed the
ESBMC compilation check.
WizardCoder, perhaps the least well-known of the models, produced the highest
overall number of code snippets that failed verification. However, WizardCoder also
tended to produce code that was less likely to result in an error or unknown verification
status when compared to the other similarly sized open models.
Source: CSET.
Overall, all five models tested also demonstrated a tendency to produce similar—and
severe—bugs. As mentioned in the Methodology section, the prompts used to generate
code snippets were designed to be highly likely to elicit bugs corresponding to the
MITRE Top 25 CWE list. This community-developed list enumerates some of the most
dangerous common weaknesses in software and hardware (such as bugs) that, if left
unaddressed, could lead to a potentially exploitable security vulnerability. Notably, bugs
found on the MITRE CWE list are not just potential security vulnerabilities, but can also
impact whether a program will work as intended. Even if a bug does not lead to an
exploitable vulnerability, it can still negatively impact how a computer system functions
when the code is run.
While the prompt dataset contained prompts intended to elicit other severe bugs,
including integer overflow and out-of-bounds array access, these were less common in
the compilable code generated by the five models in the evaluation. Code snippets that
failed verification often had more than one bug detected by ESBMC.
Limitations
As illustrated in Table 1, the five models we selected are not precisely comparable to
one another in terms of size or specialization. We accessed GPT-3.5-turbo and GPT-4
via the OpenAI API, but we faced size restrictions for the other three models because
we ran them locally instead of using a third-party provider’s computing resources. We
therefore used the smallest size (in terms of parameters) for each of the open models.
Source: CSET.
Industry adoption of AI code generation models may pose risks to software supply
chain security. As adoption increases, these models will become an important part of
the software development pipeline as AI-generated code is routinely accepted into
existing codebases. The negative impact of these models, however, may vary by
organization. Larger, well-resourced enterprises with robust code review processes and
secure software development processes may be able to mitigate the impact of AI-
generated insecure code using existing procedures, while smaller, under-resourced
businesses and individuals may either face constraints or simply overlook the need to
check AI code outputs for security. Users’ cognitive tendency to trust the outputs of AI
code generation models may exacerbate this problem.
The good news is that this risk can be incorporated into existing risk management
frameworks. While modern LLMs may be relatively novel, the idea that developers can
write insecure code is nothing new. Existing frameworks, such as NIST’s 2022
Cybersecurity Supply Chain Risk Management (C-SCRM) framework, already
enumerate similar risks in their documentation, just without the context that such code
can be generated by AI systems.55 Rather than being a novel risk category, AI-
generated code may simply mean that more weight should be placed on the risk of
insecure code from internal processes (compared to other categories of risk such as
adversarial compromise) on evaluating overall supply chain security. Regardless of its
authorship, code should be evaluated as part of existing secure software development
practices, such as those recommended by the NIST Cybersecurity Framework.56
This raises the question as to who then, if not the users, should be mainly responsible
for making sure that code outputs from LLMs are as secure as they can be. Part of the
answer lies with AI developers, who can improve the security of code outputs through
measures such as removing known vulnerable code from training datasets, assessing
models on security benchmarks in addition to functional benchmarks, and continuing to
monitor for unforeseen instances of insecure code generation in their test and
evaluation processes. Other parts of the answer lie with tools and applications that
integrate such LLMs to offer code generation as a service, to create built-in features
that check code outputs for security, and to offer further suggestions for fixes, if
possible. These conversations should be driven by relevant government organizations
such as CISA and NIST to expand secure-by-design principles to LLMs that have the
potential to impact software supply chain security.
There are downstream and associated risks related to insecure AI-generated code,
which require remedies beyond just fixing code outputs. As code generation models are
increasingly widely adopted, there may be potential negative feedback loops where
insecure code outputs from AI tools end up in open-source repositories and are used to
train future models, making such models more insecure. Without transparency in
training data, this may be difficult to trace and measure. There are also downstream
workforce implications if the increased use of code generation models leads to more
human-out-of-the-loop development pipelines and displacement of roles such as
security engineers, which can exacerbate existing cybersecurity risks to the
organization. Another problem may be that the model, by having been trained on older
data, consistently suggests a deprecated version of a commonly used package or
library, which can contain known and exploitable security vulnerabilities. The
probabilistic nature of model outputs means that patching them—whether by trying to
manipulate their outputs or otherwise—may not be 100% reliable.
More research is needed to answer key questions related to AI code generation and
cybersecurity. For this report, our evaluation was scoped to answering the question of
whether a small number of LLMs generate insecure code under specific conditions,
using formal verification as a proxy to measure code insecurity. At the same time,
further research on the following questions could further inform our understanding of
the extent to which AI code generation tools are expected to impact cybersecurity and
other associated and downstream risks. Some questions to guide future research may
include:
● How buggy or insecure is the training data being used to train AI code
generation models?
● How reliably will code generation models replicate patterns found in their
training data?
● How reliable are various security benchmarks for code generation models in
assessing the security of code outputs?
The ability of LLMs to generate functional code is one of the most promising application
areas of generative AI. Leveraging these tools can have positive effects on productivity
and efficiency, as well as show promise in workforce training and education. To fully
reap the benefits of these tools, however, there should be proactive policy attention on
the potential cybersecurity risks of such tools. A variety of code generation models
often produce insecure code, some of which contain impactful bugs. As more
individuals and organizations rely on code generation models to generate and
incorporate code into their projects, these practices may pose problems for software
supply chain security. They may also pose other downstream and associated risks such
as creating a negative feedback loop of more insecure code ending up in open
repositories, which could then feed into training future code generation models. Policy
attention on improving models and their usage with security in mind beyond
functionality benchmarks could help steer the industry towards reaping the productivity
gains from code generation models while mitigating their risks.
Jenny Jun is a non-resident fellow at CSET and an assistant professor at the Georgia
Institute of Technology’s Sam Nunn School of International Affairs. She completed her
contributions to this project while she was a research fellow with the CyberAI Project at
CSET.
Acknowledgments
For feedback and assistance, the authors would like to extend thanks to Catherine
Aiken, John Bansemer, Kyle Crichton, James Dunham, John Krumm, Brian Love, Chris
Rohlf, and Saranya Vijayakumar. For editorial assistance, thanks to Lauren Lassiter,
Jason Ly, and Shelton Fitch. Special thanks to Samantha Hubner, Cherry Wu, and Parth
Sarin for their invaluable early assistance.
© 2024 by the Center for Security and Emerging Technology. This work is licensed
under a Creative Commons Attribution-Non Commercial 4.0 International License.
Output Cause
Source: CSET.
Table B1: Number of “Error” Code Snippets by Model Before and After Code
Regeneration
GPT-3.5 Turbo 10 9
GPT-4 7 6
Mistral 22 12
WizardCoder 6 6
Code Llama 15 13
Source: CSET.
1
Inbal Shani and GitHub Staff, “Survey Reveals AI’s Impact on the Developer Experience,” GitHub Blog,
June 13, 2023, https://github.blog/2023-06-13-survey-reveals-ais-impact-on-the-developer-
experience/.
2
“AI Code, Security, and Trust in Modern Development,” (Snyk, 2024), https://snyk.io/reports/ai-code-
security/.
3
OpenAI, “ChatGPT Plugins,” OpenAI Blog, March 23, 2023, https://openai.com/blog/chatgpt-plugins.
4
Daniel Li and Lincoln Murr, “HumanEval on Latest GPT Models -- 2024,” arXiv preprint
arXiv:2402.14852 (2024), https://arxiv.org/abs/2402.14852v1.
5
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan et al., “Evaluating Large Language Models Trained
on Code,” arXiv preprint arXiv:2107.03374 (2021), https://arxiv.org/abs/2107.03374.
6
Nat Friedman, “Introducing GitHub Copilot: Your AI Pair Programmer,” GitHub Blog, June 29, 2021,
https://github.blog/2021-06-29-introducing-github-copilot-ai-pair-programmer/.
7
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle et al., “Code Llama: Open Foundation Models for
Code,” arXiv preprint arXiv:2308.12950 (2023), https://arxiv.org/abs/2308.12950.
8
Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li et al., “The Stack: 3 TB of Permissively Licensed
Source Code,” arXiv preprint arXiv:2211.15533 (2022), https://arxiv.org/abs/2211.15533; Loubna Ben
Allal, Raymond Li, Denis Kocetkov et al., “SantaCoder: Don’t Reach for the Stars!,” arXiv preprint
arXiv:2301.03988 (2023), https://arxiv.org/abs/2301.03988; Raymond Li, Loubna Ben Allal, Yangtian Zi
et al., “StarCoder: May the Source Be with You!,” arXiv preprint arXiv:2305.06161 (2023),
https://arxiv.org/abs/2305.06161.
9
Leo Gao, Stella Biderman, Sid Black, Laurence Golding et al., “The Pile: An 800GB Dataset of Diverse
Text for Language Modeling,” arXiv preprint arXiv:2101.00027 (2020), https://arxiv.org/abs/2101.00027.
10
Chen et al., “Evaluating Large Language Models Trained on Code.”
11
Brett Iversen, Satya Nadella, and Amy Hood, Transcript of “Microsoft Fiscal Year 2024 Third Quarter
Earnings Conference Call,” April 25, 2024, https://www.microsoft.com/en-us/investor/events/fy-
2024/earnings-fy-2024-q3.aspx; Thomas Dohmke, “The Economic Impact of the AI-Powered Developer
Lifecycle and Lessons from GitHub Copilot,” GitHub Blog, June 27, 2023, https://github.blog/2023-06-
27-the-economic-impact-of-the-ai-powered-developer-lifecycle-and-lessons-from-github-copilot/.
12
Hugh Langley, “Google QuietlyLaunches Internal AI Model Named 'Goose' to Help Employees Write
Code Faster, Leaked Documents Show,” Business Insider, February 14, 2024,
https://www.businessinsider.com/google-goose-ai-model-language-ai-coding-2024-2; Maxim
13
Eirini Kalliamvakou, “Research: Quantifying GitHub Copilot’s Impact on Developer Productivity and
Happiness,” GitHub Blog, September 7, 2022, https://github.blog/2022-09-07-research-quantifying-
github-copilots-impact-on-developer-productivity-and-happiness/.
14
Begum Karaci Deniz, Chandra Gnanasambandam, Martin Harrysson et al., “Unleashing Developer
Productivity with Generative AI,” McKinsey Digital, June 27, 2023,
https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/unleashing-developer-productivity-
with-generative-ai.
15
Murali et al., “AI-Assisted Code Authoring at Scale: Fine-Tuning, Deploying, and Mixed Methods
Evaluation”; Tabachnyk and Nikolov, “ML-Enhanced Code Completion Improves Developer Productivity.”
16
Kevin Zheyuan Cui, Mert Demirer, Sonia Jaffe et al., “The Effects of Generative AI on High Skilled Work:
Evidence from Three Field Experiments with Software Developers,” September 5, 2024,
https://dx.doi.org/10.2139/ssrn.4945566.
17
Chen et al., “Evaluating Large Language Models Trained on Code.”
18
At the time of this study, Github Copilot was powered by OpenAI’s Codex, which is a model fine-tuned
for code generation based on GPT-3. Github Copilot is currently powered by GPT-4 as of November 30,
2023.
19
Hammond Pearce, Baleegh Ahmad, Benjamin Tan et al., “Asleep at the Keyboard? Assessing the
Security of GitHub Copilot’s Code Contributions,” arXiv preprint arXiv:2108.09293 (2021),
https://arxiv.org/abs/2108.09293.
20
Mohammed Latif Siddiq and Joanna C. S. Santos, “SecurityEval Dataset: Mining Vulnerability Examples
to Evaluate Machine Learning-Based Code Generation Techniques,” MSR4P&S 2022: Proceedings of the
1st International Workshop on Mining Software Repositories Applications for Privacy and Security
(November 2022): 29–33, https://doi.org/10.1145/3549035.3561184.
21
Raphaël Khoury, Anderson R. Avila, Jacob Brunelle et al., “How Secure Is Code Generated by
ChatGPT?”, arXiv preprint arXiv:2304.09655 (2023), https://arxiv.org/abs/2304.09655.
22
Yujia Fu, Peng Liang, Amjed Tahir et al., “Security Weaknesses of Copilot Generated Code in Github,”
arXiv preprint arXiv:2310.02059v2 (2024), https://arxiv.org/abs/2310.02059v2.
24
Bar Lanyado, “Can You Trust ChatGPT’s Package Recommendations?”, Vulcan.io Blog, June 6, 2023,
https://vulcan.io/blog/ai-hallucinations-package-risk.
25
Thomas Claburn, “AI Hallucinates Software Packages and Devs Download Them – Even if Potentially
Poisoned with Malware,” The Register, March 28, 2024,
https://www.theregister.com/2024/03/28/ai_bots_hallucinate_software_packages.
26
Snyk, “AI Code, Security, and Trust in Modern Development.”
27
Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh, “Do Users Write More Insecure Code
with AI Assistants?”, arXiv preprint arXiv:2211.03622 (2023), https://arxiv.org/abs/2211.03622.
28
Gustavo Sandoval, Hammond Pearce, Teo Nys et al., “Lost at C: A User Study on the Security
Implications of Large Language Model Code Assistants,” arXiv preprint arXiv:2208.09727 (2023),
https://arxiv.org/abs/2208.09727; Owura Asare, Meiyappan Nagappan, and N. Asokan, “Is GitHub’s
Copilot as Bad as Humans at Introducing Vulnerabilities in Code?”, arXiv preprint arXiv:2204.04741
(2024), https://arxiv.org/abs/2204.04741.
29
Mohammed Latif Siddiq, Shafayat H. Majumder, Maisha R. Mim et al., “An Empirical Study of Code
Smells in Transformer-based Code Generation Techniques,” 2022 IEEE 22nd International Working
Conference on Source Code Analysis and Manipulation (SCAM) (October 2022): 71–82,
https://doi.org/10.1109/SCAM55253.2022.00014.
30
Manish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis et al., “Purple Llama CyberSecEval: A Secure
Coding Benchmark for Language Models,” arXiv preprint arXiv:2312.04724 (2023),
https://arxiv.org/abs/2312.04724.
31
Ran Elgedawy, John Sadik, Senjuti Dutta et al., “Occasionally Secure: A Comparative Analysis of Code
Generation Assistants,” arXiv preprint arXiv:2402.00689 (2024), https://arxiv.org/abs/2402.00689.
32
Elgedawy et al., “Ocassionally Secure.”
33
Arijit Ghosh Chowdhury, Md Mofijul Islam, Vaibhav Kumar et al., “Breaking Down the Defenses: A
Comparative Survey of Attacks on Large Language Models,” arXiv preprint arXiv:2403.04786 (2024),
https://arxiv.org/abs/2403.04786.
34
Evan Hubinger, Carson Denison, Jesse Mu et al., “Sleeper Agents: Training Deceptive LLMs that Persist
Through Safety Training,” arXiv preprint arXiv:2401.05566 (2024), https://arxiv.org/abs/2401.05566.
36
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra et al., “Not What You’ve Signed Up For:
Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection,” arXiv preprint
arXiv:2302.12173 (2023), https://arxiv.org/abs/2302.12173.
37
Scott Wu, “Introducing Devin, the First AI Software Engineer,” Cognition.ai Blog, March 12, 2024,
https://www.cognition-labs.com/introducing-devin.
38
Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson, “The
Curse of Recursion: Training on Generated Data Makes Models Forget,” arXiv preprint
arXiv:2305.17493v3 (2024), https://arxiv.org/abs/2305.17493v3; Sina Alemohammad, Josue Casco-
Rodriguez, Lorenzo Luzi et al., “Self-Consuming Generative Models Go MAD,” arXiv preprint
arXiv:2307.01850 (2023), https://arxiv.org/abs/2307.01850.
39
Brody Ford, “IBM to Pause Hiring for Jobs That AI Could Do,” Bloomberg News, May 1, 2023,
https://www.bloomberg.com/news/articles/2023-05-01/ibm-to-pause-hiring-for-back-office-jobs-that-
ai-could-kill.
40
Bhatt et al., “Purple Llama CyberSecEval.”
41
ESBMC, Systems and Software Verification Laboratory, 2024, http://esbmc.org/.
42
Bhatt et al., “Purple Llama CyberSecEval.”
43
Hossein Hajipour, Keno Hassler, Thorsten Holz et al., “CodeLMSec Benchmark: Systematically
Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models,” arXiv preprint
arXiv:2302.04012 (2023), https://arxiv.org/abs/2302.04012.
44
Aobo Kong, Shiwan Zhao, Hao Chen et al., “Better Zero-Shot Reasoning with Role-Play Prompting,”
arXiv preprint arXiv:2308.07702 (2023), https://arxiv.org/abs/2308.07702.
45
Perry et al., “Do Users Write More Insecure Code with AI Assistants?”
46
Pearce et al., “Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code
Contributions.”
47
Catherine Tony, Markus Mutas, Nicolás E. Díaz Ferreyra, and Riccardo Scandariato, “LLMSecEval: A
Dataset of Natural Language Prompts for Security Evaluations,” arXiv preprint arXiv:2303.09384 (2023),
https://arxiv.org/abs/2303.09384.
49
Tony et al., “LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations.”
50
The public GitHub repository for this project can be found at: https://github.com/georgetown-
cset/code-generation-2.0.
51
Tony et al., “LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations.”
52
Norbert Tihanyi, Tamas Bisztray, Ridhi Jain et al., “The FormAI Dataset: Generative AI in Software
Security Through the Lens of Formal Verification,” arXiv preprint arXiv:2307.02192 (2023),
https://arxiv.org/abs/2307.02192.
53
Khoury et al., “How Secure is Code Generated by ChatGPT?”; Fu et al.,
“Security Weaknesses of Copilot Generated Code in Github”; Bhatt et al., “Purple Llama CyberSecEval.”
54
Elgedaway et al., “Occassionally Secure”; Siddiq and Santos, “SecurityEval Dataset: Mining
Vulnerability Examples to Evaluate Machine Learning-Based Code Generation Techniques.”
55
Jon Boyens, Angela Smith, Nadya Bartol et al., “Cybersecurity Supply Chain Risk Management
Practices for Systems and Organizations,” National Institute of Standards and Technology (NIST), U.S.
Department of Commerce, May 2022, 20–21, https://doi.org/10.6028/NIST.SP.800-161r1.
56
“The NIST Cybersecurity Framework (CSF) 2.0,” National Institute of Standards and Technology
(NIST), U.S. Department of Commerce, February 26, 2024, https://doi.org/10.6028/NIST.CSWP.29.
57
“National Cybersecurity Strategy,” The White House, March 2023, https://www.whitehouse.gov/wp-
content/uploads/2023/03/National-Cybersecurity-Strategy-2023.pdf.
58
“EvalPlus Leaderboard,” EvalPlus GitHub, accessed May 2024,
https://evalplus.github.io/leaderboard.html; “Big Code Models Leaderboard,” HuggingFace Spaces,
accessed May 2024, https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard; “CanAiCode
Leaderboard,” HuggingFace Spaces, https://huggingface.co/spaces/mike-ravkine/can-ai-code-results;
“ClassEval Leaderboard,” ClassEval GitHub, https://fudanselab-classeval.github.io/leaderboard.html.
59
Simon Ott, Adriano Barbosa-Silva, Kathrin Blagec, Jan Brauner, and Matthias Samwald, “Mapping
Global Dynamics of Benchmark Creation and Saturation in Artificial Intelligence,” arXiv preprint
arXiv:2203.04592 (2022), https://arxiv.org/abs/2203.04592; Ameya Prabhu, Vishaal Udandarao, Philip
Torr et al., “Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress,” arXiv preprint
arXiv:2402.19472 (2024), https://arxiv.org/abs/2402.19472.
61
Bhatt et al., “Purple Llama CyberSecEval.”
62
Nafis Tanveer Islam, Mohammad Bahrami Karkevandi, and Peyman Najafirad, “Code Security
Vulnerability Repair Using Reinforcement Learning with Large Language Models,” arXiv preprint
arXiv:2401.07031v2 (2024), https://arxiv.org/abs/2401.07031v2.
63
Mohammed Latif Siddiq, Joanna C. S. Santos, Sajith Devareddy, and Anna Muller, “SALLM: Security
Assessment of Generated Code,” arXiv preprint arXiv:2311.00889 (2024),
https://arxiv.org/abs/2311.00889; Hossein Hajipour, Keno Hassler, Thorsten Holz, Lea Schönherr, and
Mario Fritz, “CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in
Black-Box Code Language Models,” arXiv preprint arXiv:2302.04012 (2023),
https://arxiv.org/abs/2302.04012.