Code Generation 2308.10335v5
Code Generation 2308.10335v5
Abstract merous works have been made to avoid syntax errors and
improve semantic understanding in the generated code (Xu
Recently, large language models (LLMs) have shown an ex-
traordinary ability to understand natural language and gen- et al. 2022; Chen et al. 2021; Shen et al. 2023a; Luo et al.
erate programming code. It has been a common practice for 2023). Unlike the online programming forums, the gener-
software engineers to consult LLMs when encountering cod- ated code snippets are not reviewed by the community peers
ing questions. Although efforts have been made to avoid syn- and thus suffer from API misuse, such as missing boundary
tax errors and align the code with the intended semantics, the checking in file reading and variable indexing, missing file
reliability, and robustness of the code generation from LLMs stream closing, failure in transaction completion, etc. Even
have not yet been thoroughly studied. The executable code if the code samples are executable or functionally correct,
is not equivalent to reliable and robust code, especially in misuse can trigger serious potential risks in production, such
the context of real-world software development. For example, as memory leaks, program crashes, garbage collection fail-
the misuse of APIs in the generated code could lead to se-
ures, etc, as shown in Figure 1. To make things worse, the
vere problems, such as resource leaks, program crashes, etc.
Existing code evaluation benchmarks and datasets focus on programmers asking these questions could be vulnerable to
crafting small tasks such as programming questions in coding the risk if they are novices to the APIs and cannot tell the
interviews. However, this deviates from the problems devel- violations in the generated code snippets. Therefore, it is es-
opers typically consult LLMs about. To fill the missing piece, sential to contemplate the code reliability while evaluating
we propose a dataset ROBUSTAPI for evaluating the relia- the code generation by large language models.
bility and robustness of code generated by LLMs. We collect
1208 coding questions from Stack Overflow on 18 representa- To evaluate the code generation of large language mod-
tive Java APIs. We summarize the common misuse patterns of els, most of the existing benchmarks focus on the functional
these APIs and evaluate them on current popular LLMs. The correctness of the execution result from the generated code,
evaluation results show that even GPT-4 has 62% of the gen- which means the code is acceptable as long as it is func-
erated code that contains API misuses. It would cause unex- tional for the user’s purpose (Chen et al. 2021; Yin et al.
pected consequences if the code is introduced into real-world 2018; Lu et al. 2021). We argue that the correct execution
software. result is important but it is not only the case in the software
development scenario. What the engineers really need is a
Introduction reliable code sample without potential risks in the long run.
The new era of language modeling arrives when large lan- Moreover, the domain of most current programming datasets
guage models (LLMs) are capable of generating customized is far from software engineering. The data source is mostly
code according to the user’s needs (Ye et al. 2023; Ope- online coding challenge websites, such as Codeforces, Kat-
nAI 2023a; Anil et al. 2023). It is not surprising that more tis, Leetcode, etc (Hendrycks et al. 2021; Austin et al. 2021).
and more software engineers choose to query large language Although remarkable progress has been made, we argue that
models for the answer to the coding questions, such as gen- they fail to substantially help the software development in
erating a code snippet using certain APIs or detecting bugs practical scenarios.
in a few lines of code. Large language models are able to To this end, we propose ROBUSTAPI, a comprehensive
respond more suitable and customized answers for the ques- benchmark to evaluate the reliability and robustness of code
tion compared with searching in the online programming fo- generated by large language models, including a dataset of
rums, such as Stack Overflow. coding questions and an evaluator using the abstract syntax
Such a fast pace conceals potential risks in the code gen- tree (AST) (Fischer, Lusiardi, and Von Gudenberg 2007).
eration of large language models. From the perspective of In the dataset, we target creating an evaluation setting that
software engineering, the robustness and reliability of gen- is close to real software development. Thus we collect rep-
erated code have not yet been thoroughly studied even if nu- resentative questions about Java from Stack Overflow. Java
Copyright © 2024, Association for the Advancement of Artificial is one of the most popular programming languages and is
Intelligence (www.aaai.org). All rights reserved. widely used in software development because of its write
LLM-Generated Code Snippet (Llama 2)
static void CreateNewFile(String filePath) { Syntax Correct ✓
How can I create a file with Java? File file = new File(filePath); Function Correct ✓
I want to create a file through Java. What if (!file.exists()) { file.createNewFile(); } Semantic Aligned ✓
}
functions shall I use? Reliable & Robust ✗
Correct API Usage
File file = new File(filePath); createNewFile
Ask LLMs
for help LLaMA2 try {
Requires catching IO
file.createNewFile();
} catch (IOException e){
exceptions when the file
Vicuna e.printStackTrace(); already exists or the parent
} folder doesn’t exist.
Figure 1: The scenario where software engineers consult large language models for the answer to the programming questions.
The generated code snippet is not reliable and has potential risks in the software development.
once, run anywhere (WORA) feature1 . For each question, we studied perspective to evaluate the code quality apart
provide a detailed description and the related Java API. We from functional correctness.
design templates to trigger large language models to gen- • We provide a well-formalized evaluation framework in-
erate the code snippet and the corresponding explanation. cluding a dataset of Stack Overflow questions and an API
We also provide an evaluator that analyzes the generated usage checker using AST. We report the performance of
code snippets using the abstract syntax tree (AST) and com- popular large language models, including GPT-3.5, GPT-
pares them with the expected API usage patterns. Following 4, Llama-2, and Vicuna-1.5.
Zhang et al. (2018), we formalize the API usage patterns into
structured call sequences, as shown in Figure 2. The struc- • We conduct a comprehensive analysis of the code gener-
tured call sequences present how these APIs can be properly ation performance of current large language models. We
used to eliminate the potential system risks. Any violations summarize the common API misuse for each model and
of such structured call sequences would be considered as point out the promising improvement direction for the
API misuse from the perspective of software engineering. future research.
We collect 1208 real questions from Stack Overflow
which involves 18 representative Java APIs. We run ex- Related Work
periments on the close-sourced language models (GPT-3.5
and GPT-4 (OpenAI 2023a)) as well as the open-sourced Code Quality of LLM-Sythesized Code With the re-
language models (Llama-2 (Touvron et al. 2023), Vicuna- lease of Copilot (Chen et al. 2021) and other commercial
1.5 (Chiang et al. 2023). We use the default hyper-parameter code assistant tools based on LLMs, the security and code
settings of the models without extensive hyper-parameter quality of these tools gradually get the attention of the re-
tuning. We further design two experiment settings, zero-shot search community. Yetistiren, Ozsoy, and Tuzun (2022) as-
and one-shot, where none or one demonstration sample is sess the quality of LLM-generated code from the aspects
provided in the prompt. We conduct a comprehensive anal- of compilation correctness, functional correctness, and code
ysis of the generated code and study the common API mis- efficiency. Siddiq et al. (2022) studied code smells in code
use cases of current large language models. We would like generated by LLMs, which is the poor design in code like
to bring up the important issues of API misuse in the code unusually long method, or duplicated code. Poesia et al.
generation by large language models, and provide a new di- (2022) shows that LLMs can make implementation errors in
mension to evaluate large language models other than the the code like syntax errors or semantic errors deviating from
commonly-used functional correctness. The main purpose users’ intention. Jesse et al. (2023) studied simple, stupid
of this benchmark is not to evaluate the functional correct- bugs in Codex and other LLMs, which shows that AI code
ness of the generated code, but instead, we focus on reli- assistants can help avoid some of such simple bugs but have
ability and robustness. We hope this work could facilitate a higher chance of introducing bugs that are hard to de-
future research on this topic and help create a more robust tect. As for security impact, Pearce et al. (2022) designed
coding helper out of large language models to step further 89 security-sensitive scenarios for Copilot to complete the
into real artificial general intelligence. We open-source our code for users, which shows approximately 40% of the code
dataset and evaluator on GitHub2 . We summarize our con- is vulnerable. Perry et al. (2022) conducted the first large-
tribution as follows. scale user study to examine whether users interacting with
• We propose a new benchmark, ROBUSTAPI, to evaluate AI Code assistants write secure code. They find that those
the reliability and robustness of code generation by large users wrote significantly less secure code while they believe
language models. This is an important but not yet well- their code was secure. Sandoval et al. (2023) conducts a user
study to assess the security of low-level code with pointer
1
https://en.wikipedia.org/wiki/Java (programming language) and array manipulations generated by AI-based coding as-
2
https://github.com/FloridSleeves/RobustAPI sistants. They find under this specific scenario, the assistants
do not introduce more security bugs than humans. Liu et al. distribution of questions for each domain is shown in Ta-
(2023) enlarges HumanEval (Chen et al. 2021) by generat- ble 1.
ing test cases with higher coverage which serve as an add-on
to the existing programming benchmarks but the evaluation API Domain Conseq* Github*
still focuses on functional correctness and simple program- StringTokenizer.nextToken String (iii) 13.3K
ming questions far from software development. Shen et al. String.getBytes Process (iii) 88.1K
(2023b) evaluates the reliability of ChatGPT by testing on JsonElement.getAsString (307) (iii) 4.4K
adversarial examples, which however has a different mean- List.get Data (iii) 2.7M
ing of ‘reliability’ in their context. In this paper, we refer to Map.get Structure (iii) 2.4M
reliability as the ability of code to resist failure, high work- Iterator.next (404) (iii) 918K
load, and unexpected input. ProgressDialog.dismiss (iii) 54K
Mobile
TypedArray.getString (iv) 6.8K
Quality Assessment of Code in Online Forum Ex- Develop
ApplicationInfo.loadIcon (v) 3.6K
isting literature in the software engineering field has inves- (75)
Activity.setContentView (v) 4.6K
tigated the quality of code from online forums and warned Cipher.init Crypto (10) (iii) 66.3K
developers of the potential issues. Yang, Hussain, and Lopes RandomAccessFile.write (i) 129K
(2016) finds that the majority of code examples given in BufferedReader.readLine (iii) 74.8K
Stack Overflow answers cannot be compiled. Zhou and PrintWriter.write (i) 1.1M
I/O (390)
Walker (2016) pointed out that 43% of the posts investigated File.mkdirs (ii) 73.2K
by them contained deprecated APIs, while Fischer et al. File.createNewFile (i) 176K
(2017) found that 29% of the code contains security risks. FileChannel.write (i) 5.2K
In Zhang et al. (2018), the authors analyze the code by call SQLiteDatabase.query Database (22) (iv) 4K
sequence extraction and slicing, and compare it to the manu- Total 1208 7.8M
ally validated API usage rules, which concludes that 31% of
the code examples in Stack Overflow answers contain API Table 1: 18 popular Java APIs in ROBUSTAPI. They are
misuse and could produce unexpected behaviors. easily misused by developers according to the existing lit-
erature of software engineering (Zhang et al. 2018). *Con-
Methodology sequences: (i) data loss; (ii) file system corruption; (iii)
In this section, we describe ROBUSTAPI, a comprehensive program crash; (iv) resource leak; (v) user interface bug.
benchmark to thoroughly evaluate the reliability and robust- *Github: occurrences of this API on Github.
ness of LLM-generated code. We describe the process of
data collection and prompt generation when constructing the After collecting the questions, we convert them into
dataset. Then we present the API misuse patterns evaluated the JSON format with the following fields: {id, api,
in ROBUSTAPI and discuss the potential consequence of vi- question, origin}. id field contains the unique id we
olations. Finally, we introduce the static analysis method in assign for each sample. api field contains the API that we
ROBUSTAPI for detecting the API usage violations which specifically instruct the large language models to use as a
leverages the abstract syntax tree and achieves higher eval- question hint. question field contains the title and descrip-
uation accuracy in evaluating the API misuse in code gener- tion of the Stack Overflow questions. origin field contains
ated by LLMs compared to rule-based method such as key- the original URL of this sample.
words matching.
Figure 2: The workflow of Our API Checker. The API checker uses the static analysis method and analyzes the generated code
with the abstract syntax tree (AST). The API misuse is detected when the AST call sequence and the API usage rule do not
match.
Research Questions
misuse rate is calculated by dividing answers that can be
We conduct a series of experiments on state-of-the-art LLMs compiled and contains API misuses by all the answers that
based on ROBUSTAPI, which demonstrate the usability and can be compiled. From the evaluation results, all the eval-
effectiveness of ROBUSTAPI. The experiments provide in- uated models suffer from API misuse problems, even the
sights on the ability to answer real-world coding questions state-of-the-art commercial models like GPT-3.5 and GPT-
and the robustness and reliability of these answers regarding 4. In zero-shot settings, Llama has the lowest API misuse
API misuse problems. In the experiment, we try to answer rate. However, this is partially due to that most of Llama’s
the following questions: answers do not include any code. A counter-intuition find-
• Q1: What are the API misuse rates in answering real- ing is that GPT-4 actually has a higher API misuse rate than
world coding questions by these LLMs? GPT-3.5, though the coding ability of GPT-4 is proved to be
• Q2: How do irrelevant shots affect the results? “40% more advanced than its predecessor, GPT-3.5” (Ope-
• Q3: Can correct API usage examples reduce the misuse? nAI 2023b). We also evaluate a code-specialized large lan-
guage model, DeekSeekCoder(Piplani and Bamman 2018),
• Q4: Why does LLM-generated code fail the API usage
which is trained on a variety of programming languages
check?
including Java, and surpasses many existing Code LLMs.
We report the results of deepseek-coder-6.7b-base and
API Misuse Rate deepseek-coder-6.7b-instruct. We observe that the
Firstly, we present the API misuse rate of each model based code-specialized large language model can generate more
on ROBUSTAPI on the left of Figure 3. In this figure, the compilable samples. However, the API misuse rate is not
higher the API misuse rate is, the worse the code reliabil- significantly better than other models. This indicates that
ity and robustness for this large language model. The API with the code generation ability of large language models
Zero-shot One-shot-irrelevant One-shot-relevant
Model Misuse Compilable Overall Misuse Compilable Overall Misuse Compilable Overall
Rate ↓ Rate ↑ Misuse ↓ Rate ↓ Rate ↑ Misuse ↓ Rate ↓ Rate ↑ Misuse ↓
GPT 3.5 62.97% 79.14% 49.83% 68.09% 91.06% 62.00% 38.56% 80.71% 31.13%
GPT 4 68.81% 90.23% 62.09% 70.38% 91.39% 64.32% 54.40% 90.40% 49.17%
Llama 2∗ 7.34%∗ 9.02%∗ 0.66%∗ 61.36% 80.13% 49.17% 64.47% 72.93% 47.02%
Vicuna 1.5 45.66% 37.17% 16.97% 57.85% 83.86% 48.51% 42.53% 64.24% 27.32%
ds-coder-6.7b-base 41.55% 40.65% 16.89% 75.60% 95.90% 72.43% 64.12% 67.14% 43.05%
ds-coder-6.7b-instruct 47.52% 50.00% 23.76% 59.04% 96.61% 57.04% 38.40% 86.01% 33.03%
Table 2: Performance of Each LLM on ROBUSTAPI. ↓: the lower the better. ↑: the higher the better. Misuse Rate is the
proportion of misuse cases among the compilable cases; Compilation Rate is the proportion of compilable cases among all
questions; Overall Misuse is the proportion of misuse cases among all questions. ∗ Though Llama2 has a low misuse rate, its
compilation rate is significantly lower than other models.
sS ne d
SQdommADcialoriteap.ignatl
tSt ken
nE ut ane.mwF e
en Ite eaml.wr irs
TyopkeSntriMn apse.q.wriatd
ay. extBytey
Arr r.n et tK y
ge To es
.cr treiph dLi n
Jso Inp Ch Fil Ne rit t
g
Ra ndoess int do .ge
etAor. ea
rin
WMF
taO ea fo en
T d ae
Da redtRionICnont