0% found this document useful (0 votes)
28 views9 pages

Code Generation 2308.10335v5

Uploaded by

Salisu Borodo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views9 pages

Code Generation 2308.10335v5

Uploaded by

Salisu Borodo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Can LLM Replace Stack Overflow?

A Study on Robustness and Reliability of


Large Language Model Code Generation
Li Zhong, Zilong Wang
University of California, San Diego
[email protected], [email protected]
arXiv:2308.10335v5 [cs.CL] 27 Jan 2024

Abstract merous works have been made to avoid syntax errors and
improve semantic understanding in the generated code (Xu
Recently, large language models (LLMs) have shown an ex-
traordinary ability to understand natural language and gen- et al. 2022; Chen et al. 2021; Shen et al. 2023a; Luo et al.
erate programming code. It has been a common practice for 2023). Unlike the online programming forums, the gener-
software engineers to consult LLMs when encountering cod- ated code snippets are not reviewed by the community peers
ing questions. Although efforts have been made to avoid syn- and thus suffer from API misuse, such as missing boundary
tax errors and align the code with the intended semantics, the checking in file reading and variable indexing, missing file
reliability, and robustness of the code generation from LLMs stream closing, failure in transaction completion, etc. Even
have not yet been thoroughly studied. The executable code if the code samples are executable or functionally correct,
is not equivalent to reliable and robust code, especially in misuse can trigger serious potential risks in production, such
the context of real-world software development. For example, as memory leaks, program crashes, garbage collection fail-
the misuse of APIs in the generated code could lead to se-
ures, etc, as shown in Figure 1. To make things worse, the
vere problems, such as resource leaks, program crashes, etc.
Existing code evaluation benchmarks and datasets focus on programmers asking these questions could be vulnerable to
crafting small tasks such as programming questions in coding the risk if they are novices to the APIs and cannot tell the
interviews. However, this deviates from the problems devel- violations in the generated code snippets. Therefore, it is es-
opers typically consult LLMs about. To fill the missing piece, sential to contemplate the code reliability while evaluating
we propose a dataset ROBUSTAPI for evaluating the relia- the code generation by large language models.
bility and robustness of code generated by LLMs. We collect
1208 coding questions from Stack Overflow on 18 representa- To evaluate the code generation of large language mod-
tive Java APIs. We summarize the common misuse patterns of els, most of the existing benchmarks focus on the functional
these APIs and evaluate them on current popular LLMs. The correctness of the execution result from the generated code,
evaluation results show that even GPT-4 has 62% of the gen- which means the code is acceptable as long as it is func-
erated code that contains API misuses. It would cause unex- tional for the user’s purpose (Chen et al. 2021; Yin et al.
pected consequences if the code is introduced into real-world 2018; Lu et al. 2021). We argue that the correct execution
software. result is important but it is not only the case in the software
development scenario. What the engineers really need is a
Introduction reliable code sample without potential risks in the long run.
The new era of language modeling arrives when large lan- Moreover, the domain of most current programming datasets
guage models (LLMs) are capable of generating customized is far from software engineering. The data source is mostly
code according to the user’s needs (Ye et al. 2023; Ope- online coding challenge websites, such as Codeforces, Kat-
nAI 2023a; Anil et al. 2023). It is not surprising that more tis, Leetcode, etc (Hendrycks et al. 2021; Austin et al. 2021).
and more software engineers choose to query large language Although remarkable progress has been made, we argue that
models for the answer to the coding questions, such as gen- they fail to substantially help the software development in
erating a code snippet using certain APIs or detecting bugs practical scenarios.
in a few lines of code. Large language models are able to To this end, we propose ROBUSTAPI, a comprehensive
respond more suitable and customized answers for the ques- benchmark to evaluate the reliability and robustness of code
tion compared with searching in the online programming fo- generated by large language models, including a dataset of
rums, such as Stack Overflow. coding questions and an evaluator using the abstract syntax
Such a fast pace conceals potential risks in the code gen- tree (AST) (Fischer, Lusiardi, and Von Gudenberg 2007).
eration of large language models. From the perspective of In the dataset, we target creating an evaluation setting that
software engineering, the robustness and reliability of gen- is close to real software development. Thus we collect rep-
erated code have not yet been thoroughly studied even if nu- resentative questions about Java from Stack Overflow. Java
Copyright © 2024, Association for the Advancement of Artificial is one of the most popular programming languages and is
Intelligence (www.aaai.org). All rights reserved. widely used in software development because of its write
LLM-Generated Code Snippet (Llama 2)
static void CreateNewFile(String filePath) { Syntax Correct ✓
How can I create a file with Java? File file = new File(filePath); Function Correct ✓
I want to create a file through Java. What if (!file.exists()) { file.createNewFile(); } Semantic Aligned ✓
}
functions shall I use? Reliable & Robust ✗
Correct API Usage
File file = new File(filePath); createNewFile
Ask LLMs
for help LLaMA2 try {
Requires catching IO
file.createNewFile();
} catch (IOException e){
exceptions when the file
Vicuna e.printStackTrace(); already exists or the parent
} folder doesn’t exist.

Figure 1: The scenario where software engineers consult large language models for the answer to the programming questions.
The generated code snippet is not reliable and has potential risks in the software development.

once, run anywhere (WORA) feature1 . For each question, we studied perspective to evaluate the code quality apart
provide a detailed description and the related Java API. We from functional correctness.
design templates to trigger large language models to gen- • We provide a well-formalized evaluation framework in-
erate the code snippet and the corresponding explanation. cluding a dataset of Stack Overflow questions and an API
We also provide an evaluator that analyzes the generated usage checker using AST. We report the performance of
code snippets using the abstract syntax tree (AST) and com- popular large language models, including GPT-3.5, GPT-
pares them with the expected API usage patterns. Following 4, Llama-2, and Vicuna-1.5.
Zhang et al. (2018), we formalize the API usage patterns into
structured call sequences, as shown in Figure 2. The struc- • We conduct a comprehensive analysis of the code gener-
tured call sequences present how these APIs can be properly ation performance of current large language models. We
used to eliminate the potential system risks. Any violations summarize the common API misuse for each model and
of such structured call sequences would be considered as point out the promising improvement direction for the
API misuse from the perspective of software engineering. future research.
We collect 1208 real questions from Stack Overflow
which involves 18 representative Java APIs. We run ex- Related Work
periments on the close-sourced language models (GPT-3.5
and GPT-4 (OpenAI 2023a)) as well as the open-sourced Code Quality of LLM-Sythesized Code With the re-
language models (Llama-2 (Touvron et al. 2023), Vicuna- lease of Copilot (Chen et al. 2021) and other commercial
1.5 (Chiang et al. 2023). We use the default hyper-parameter code assistant tools based on LLMs, the security and code
settings of the models without extensive hyper-parameter quality of these tools gradually get the attention of the re-
tuning. We further design two experiment settings, zero-shot search community. Yetistiren, Ozsoy, and Tuzun (2022) as-
and one-shot, where none or one demonstration sample is sess the quality of LLM-generated code from the aspects
provided in the prompt. We conduct a comprehensive anal- of compilation correctness, functional correctness, and code
ysis of the generated code and study the common API mis- efficiency. Siddiq et al. (2022) studied code smells in code
use cases of current large language models. We would like generated by LLMs, which is the poor design in code like
to bring up the important issues of API misuse in the code unusually long method, or duplicated code. Poesia et al.
generation by large language models, and provide a new di- (2022) shows that LLMs can make implementation errors in
mension to evaluate large language models other than the the code like syntax errors or semantic errors deviating from
commonly-used functional correctness. The main purpose users’ intention. Jesse et al. (2023) studied simple, stupid
of this benchmark is not to evaluate the functional correct- bugs in Codex and other LLMs, which shows that AI code
ness of the generated code, but instead, we focus on reli- assistants can help avoid some of such simple bugs but have
ability and robustness. We hope this work could facilitate a higher chance of introducing bugs that are hard to de-
future research on this topic and help create a more robust tect. As for security impact, Pearce et al. (2022) designed
coding helper out of large language models to step further 89 security-sensitive scenarios for Copilot to complete the
into real artificial general intelligence. We open-source our code for users, which shows approximately 40% of the code
dataset and evaluator on GitHub2 . We summarize our con- is vulnerable. Perry et al. (2022) conducted the first large-
tribution as follows. scale user study to examine whether users interacting with
• We propose a new benchmark, ROBUSTAPI, to evaluate AI Code assistants write secure code. They find that those
the reliability and robustness of code generation by large users wrote significantly less secure code while they believe
language models. This is an important but not yet well- their code was secure. Sandoval et al. (2023) conducts a user
study to assess the security of low-level code with pointer
1
https://en.wikipedia.org/wiki/Java (programming language) and array manipulations generated by AI-based coding as-
2
https://github.com/FloridSleeves/RobustAPI sistants. They find under this specific scenario, the assistants
do not introduce more security bugs than humans. Liu et al. distribution of questions for each domain is shown in Ta-
(2023) enlarges HumanEval (Chen et al. 2021) by generat- ble 1.
ing test cases with higher coverage which serve as an add-on
to the existing programming benchmarks but the evaluation API Domain Conseq* Github*
still focuses on functional correctness and simple program- StringTokenizer.nextToken String (iii) 13.3K
ming questions far from software development. Shen et al. String.getBytes Process (iii) 88.1K
(2023b) evaluates the reliability of ChatGPT by testing on JsonElement.getAsString (307) (iii) 4.4K
adversarial examples, which however has a different mean- List.get Data (iii) 2.7M
ing of ‘reliability’ in their context. In this paper, we refer to Map.get Structure (iii) 2.4M
reliability as the ability of code to resist failure, high work- Iterator.next (404) (iii) 918K
load, and unexpected input. ProgressDialog.dismiss (iii) 54K
Mobile
TypedArray.getString (iv) 6.8K
Quality Assessment of Code in Online Forum Ex- Develop
ApplicationInfo.loadIcon (v) 3.6K
isting literature in the software engineering field has inves- (75)
Activity.setContentView (v) 4.6K
tigated the quality of code from online forums and warned Cipher.init Crypto (10) (iii) 66.3K
developers of the potential issues. Yang, Hussain, and Lopes RandomAccessFile.write (i) 129K
(2016) finds that the majority of code examples given in BufferedReader.readLine (iii) 74.8K
Stack Overflow answers cannot be compiled. Zhou and PrintWriter.write (i) 1.1M
I/O (390)
Walker (2016) pointed out that 43% of the posts investigated File.mkdirs (ii) 73.2K
by them contained deprecated APIs, while Fischer et al. File.createNewFile (i) 176K
(2017) found that 29% of the code contains security risks. FileChannel.write (i) 5.2K
In Zhang et al. (2018), the authors analyze the code by call SQLiteDatabase.query Database (22) (iv) 4K
sequence extraction and slicing, and compare it to the manu- Total 1208 7.8M
ally validated API usage rules, which concludes that 31% of
the code examples in Stack Overflow answers contain API Table 1: 18 popular Java APIs in ROBUSTAPI. They are
misuse and could produce unexpected behaviors. easily misused by developers according to the existing lit-
erature of software engineering (Zhang et al. 2018). *Con-
Methodology sequences: (i) data loss; (ii) file system corruption; (iii)
In this section, we describe ROBUSTAPI, a comprehensive program crash; (iv) resource leak; (v) user interface bug.
benchmark to thoroughly evaluate the reliability and robust- *Github: occurrences of this API on Github.
ness of LLM-generated code. We describe the process of
data collection and prompt generation when constructing the After collecting the questions, we convert them into
dataset. Then we present the API misuse patterns evaluated the JSON format with the following fields: {id, api,
in ROBUSTAPI and discuss the potential consequence of vi- question, origin}. id field contains the unique id we
olations. Finally, we introduce the static analysis method in assign for each sample. api field contains the API that we
ROBUSTAPI for detecting the API usage violations which specifically instruct the large language models to use as a
leverages the abstract syntax tree and achieves higher eval- question hint. question field contains the title and descrip-
uation accuracy in evaluating the API misuse in code gener- tion of the Stack Overflow questions. origin field contains
ated by LLMs compared to rule-based method such as key- the original URL of this sample.
words matching.

Data Collection Prompt Generation


To take advantage of the existing research efforts in the soft- In the prompt, we start with the task introduction and the
ware engineering field, we build ROBUSTAPI based on the required response format. Then we append the few-shot
dataset from ExampleCheck (Zhang et al. 2018) as our start- demonstrations on this API when conducting experiments in
ing point. ExampleCheck is proposed to study the frequent the few-shot settings. The demonstration examples satisfy
Java API misuse in online Q&A forums. We select 18 pop- our provided response format. Next, we append the ques-
ular Java APIs from the dataset as shown in Table 1. These tion and the corresponding API hint for this question. This
18 APIs cover 6 domains including string processing, data prompt simulates a user asking coding questions without
structure, mobile development, crypto, I/O and database op- providing any additional hints from the API documentation
eration. Then we crawl questions relevant to these APIs from which is a typical scenario when novice developers seek help
Stack Overflow. We only select the questions with online from large language models. Due to the chat completion na-
answers and we keep the questions whose provided answer ture of state-of-the-art LLMs, we wrap the question and an-
contains API misuse. In this way, we guarantee that the ques- swer with special tags to instruct LLMs to generate answers
tions in ROBUSTAPI are answerable and non-trivial so we to the questions. The prompt template is adapted from (Patil
can use them to effectively evaluate the LLMs’ ability in et al. 2023), which can help LLMs follow a specific gener-
answering coding questions that humans are prone to make ation template so that we can extract more compilable code
mistakes. After filtering, we get 1208 questions in total. The snippets from the response.
Demonstration Samples Detecting API Misuse
Demonstration samples have been proven helpful to LLMs Existing research in evaluating the code generated by LLMs
in understanding natural language. To thoroughly analyze usually uses test cases, which falls short when testing the
LLMs’ ability in code generation, we design two few-shot reliability and robustness of code. To deal with this chal-
settings, One-shot-irrelevant and One-shot-relevant. lenging problem, we use static analysis for ROBUSTAPI,
In the one-shot-irrelevant setting, we provide LLMs with which has relatively mature solutions in detecting API mis-
an example using an irrelevant API (e.g. Arrays.stream). use (Zhang et al. 2018; Nguyen et al. 2014; Wang et al. 2013;
We assume this demonstration example would eliminate the Huang et al. 2023). To evaluate the API usage correctness
syntax errors in the generated code. in code, ROBUSTAPI detects the API misuses against the
API usage rules by extracting call consequences and control
In the one-shot-relevant setting, we provide LLMs with
structures from the source code, as shown in Figure 2. The
an example using the same API as the given question. The
code checker first checks the code snippets to see whether
provided example contains a pair of question and answer.
it is a snippet of a method or a method of a class so that it
The question in the demo example is not present in the test-
can enclose this code snippet and construct an abstract syn-
ing dataset and we manually revise the answer to ensure that
tax tree (AST) from the code snippet. Then the checker tra-
there is no API misuse in it and that the semantics well align
verses the AST to record all the method calls and control
with the questions.
structures in order, which generates a call sequence. Next,
the checker compares the call sequence against the API us-
Java API Misuse age rules. It infers the instance type of each method call and
When using the APIs provided by language libraries, de- uses the type and method as keys to retrieve corresponding
velopers need to follow the API usage rules so that they API usage rules. Finally, the checker computes the longest
can take full advantage of the ideal API effect. Violating common sequence between the call sequence and the API
these rules and misusing the APIs could result in unex- usage rules. If the call sequence does not match the expected
pected behaviors in production. A typical example is the API usage rules, the checker will report API misuse.
file operation. When opening and writing a file through
RandomAccessFile, two usage rules need to be enforced: Experimenet
(1) Reading the file could throw exceptions. If the buffer Experiment Setup
limit is reached before the expected bytes are read, the API In the experiments, we evaluate ROBUSTAPI on four LLMs:
would throw IndexOutOfBoundsException. Also, if the file GPT-3.5 (OpenAI 2023a), GPT-4 (OpenAI 2023a), Llama-
is concurrently closed by other processes, the API would 2 (Touvron et al. 2023), Vicuna-1.5 (Chiang et al. 2023).
throw ClosedChannelException. To deal with these excep- We use the default hyper-parameter settings of each model
tions, the correct implementation should enclose the API without further extensive hyper-parameter tuning. All exper-
inside try-catch blocks. (2) The file channel should be iment results are Pass@1 unless specified. For all models,
closed after usage. Otherwise, if this code snippet is inside a we evaluate three experiment settings:
long-lasting program that is concurrently running in multi-
ple instances, the file resources could be run out. Therefore, • Zero-shot: No example is provided in the prompt. The
the code needs to invoke close API after all file operations. prompt only contains the instruction, question.
The correct usage are shown as following: • One-shot-irrelevant: ROBUSTAPI provides one exam-
ple of an irrelevant task in the prompt.
Correct API Usage: • One-shot-relevant: ROBUSTAPI provides one example
try {
of the same API with the correct usage in the prompt.
RandomAccessFile raf =
new RandomAccessFile("/tmp/file.json", "r"); The examples for shot generations are manually written
byte[] buffer = new byte[1024 * 1024]; and double-checked by the authors. Then they are evaluated
int bytesRead = raf.read(buffer, 0, buffer.length); against the API usage checkers to make sure they are aligned
raf.close(); with the API usage rules.
} catch(Exception e) {...}
Evaluation Metrics
In ROBUSTAPI, we summarized 41 API usage rules from To quantitatively evaluate the reliability of the generated
the 18 APIs, which are validated in the documentation code, we define the following values and our metrics are
of these APIs (Zhang et al. 2018). These rules include: computed based on them. Supposing that we have N ques-
(1) The guard condition of an API, which should be tions in our dataset, we divide them into three groups.
checked before API calls. For example, check the re-
• Nmisuse : The number of cases where our API usage
sult of File.exists() before File.createNewFile()
checker detects the API usage violations.
(2) Required call sequence of an API, which should be
called in a specific order. For example, call close() after • Npass : The number of cases where our API usage checker
File.write(). (3) Control structures of an API. For exam- does not detect the API usage violations.
ple, enclose SimpleDateFormat.parse() with try-catch • Nnon-comp : The number of cases where the LLM fails to
structure. generate code or the generated code is not compilable.
Code Snippet Generated by LLM (ii) Compare AST with API Usage Rules

try { API Usage Rule


AST Call Sequence
RandomAccessFile raf = \ (RandomAccessFile().read())
new RandomAccessFile("file.json", "r"); TRY
byte[] bf = new byte[1024 * 1024];
TRY

Longest Common String


int bytes = raf.read(bf, 0, bf.length);
} catch(Exception e) { RandomAccessFile()
e.printStackTrace(); RandomAccessFile().read()
} RandomAccessFile().read()
RandomAccessFile().close()
(i) Generate AST for the Code Snippet END_BLOCK
END_BLOCK
TRY
CATCH
CATCH
TRY-BODY CATCH-BODY Exception.printStackTrace()
END_BLOCK
END_BLOCK
CALL CALL CALL CALL

(iii) Detect Mismatched Pattern Cannot find RandomAccessFile().close()


AST of the Given Code Snippet & Report Violation in AST Call Sequence

Figure 2: The workflow of Our API Checker. The API checker uses the static analysis method and analyzes the generated code
with the abstract syntax tree (AST). The API misuse is detected when the AST call sequence and the API usage rule do not
match.

Based on the values, we define our metrics. 100


Not Compilable API Misuse Pass

API Usage Results


• API Misuse Rate = Nmisuse /(Nmisuse + Npass ): To an- 80
alyze the proportion of misuse cases among the compil- 60 62 62 64 48
31 49
49 49
able code snippets. It reveals how reliable the generated 47 27
40
code is after the users filter out the non-compilable cases. 16
49
20 35 41 36
29 28 0 29 27 30 25
• Compilation Rate = (Nmisuse + Npass )/N : To analyze 8
20
0
the proportion of compilable cases among all questions. GPT3.5 GPT4 Llama2 Vicuna GPT3.5 GPT4 Llama2 Vicuna GPT3.5 GPT4 Llama2 Vicuna

Zero Shot One Shot Irrelevant One Shot Relevant


It is necessary to consider the percentage of compilable
cases in order to eliminate the influence from the ex-
treme situations, such as when only a few compilable Figure 3: Result of Checking API Usage from LLMs. Red
code snippets are generated. bars are the percentage of answers that contain API misuse,
which is the lower, the better. The white bars in dot lines are
• Overall API Misuse Percentage = Nmisuse /N : To ana-
the percentage of code answers that are not compilable.
lyze the proportion of misuse cases among all questions.

Research Questions
misuse rate is calculated by dividing answers that can be
We conduct a series of experiments on state-of-the-art LLMs compiled and contains API misuses by all the answers that
based on ROBUSTAPI, which demonstrate the usability and can be compiled. From the evaluation results, all the eval-
effectiveness of ROBUSTAPI. The experiments provide in- uated models suffer from API misuse problems, even the
sights on the ability to answer real-world coding questions state-of-the-art commercial models like GPT-3.5 and GPT-
and the robustness and reliability of these answers regarding 4. In zero-shot settings, Llama has the lowest API misuse
API misuse problems. In the experiment, we try to answer rate. However, this is partially due to that most of Llama’s
the following questions: answers do not include any code. A counter-intuition find-
• Q1: What are the API misuse rates in answering real- ing is that GPT-4 actually has a higher API misuse rate than
world coding questions by these LLMs? GPT-3.5, though the coding ability of GPT-4 is proved to be
• Q2: How do irrelevant shots affect the results? “40% more advanced than its predecessor, GPT-3.5” (Ope-
• Q3: Can correct API usage examples reduce the misuse? nAI 2023b). We also evaluate a code-specialized large lan-
guage model, DeekSeekCoder(Piplani and Bamman 2018),
• Q4: Why does LLM-generated code fail the API usage
which is trained on a variety of programming languages
check?
including Java, and surpasses many existing Code LLMs.
We report the results of deepseek-coder-6.7b-base and
API Misuse Rate deepseek-coder-6.7b-instruct. We observe that the
Firstly, we present the API misuse rate of each model based code-specialized large language model can generate more
on ROBUSTAPI on the left of Figure 3. In this figure, the compilable samples. However, the API misuse rate is not
higher the API misuse rate is, the worse the code reliabil- significantly better than other models. This indicates that
ity and robustness for this large language model. The API with the code generation ability of large language models
Zero-shot One-shot-irrelevant One-shot-relevant
Model Misuse Compilable Overall Misuse Compilable Overall Misuse Compilable Overall
Rate ↓ Rate ↑ Misuse ↓ Rate ↓ Rate ↑ Misuse ↓ Rate ↓ Rate ↑ Misuse ↓
GPT 3.5 62.97% 79.14% 49.83% 68.09% 91.06% 62.00% 38.56% 80.71% 31.13%
GPT 4 68.81% 90.23% 62.09% 70.38% 91.39% 64.32% 54.40% 90.40% 49.17%
Llama 2∗ 7.34%∗ 9.02%∗ 0.66%∗ 61.36% 80.13% 49.17% 64.47% 72.93% 47.02%
Vicuna 1.5 45.66% 37.17% 16.97% 57.85% 83.86% 48.51% 42.53% 64.24% 27.32%
ds-coder-6.7b-base 41.55% 40.65% 16.89% 75.60% 95.90% 72.43% 64.12% 67.14% 43.05%
ds-coder-6.7b-instruct 47.52% 50.00% 23.76% 59.04% 96.61% 57.04% 38.40% 86.01% 33.03%

Table 2: Performance of Each LLM on ROBUSTAPI. ↓: the lower the better. ↑: the higher the better. Misuse Rate is the
proportion of misuse cases among the compilable cases; Compilation Rate is the proportion of compilable cases among all
questions; Overall Misuse is the proportion of misuse cases among all questions. ∗ Though Llama2 has a low misuse rate, its
compilation rate is significantly lower than other models.

is largely improved nowadays, the reliability and robustness One-Shot-Relevant Results


of code in real-world production rises as an unnoticed issue. In this experiment, ROBUSTAPI adds a manually-written
And the space for improvement is huge for this problem. shot in the prompt, which performs a different task but uses
The execution time for static analysis is shown in Table 3. the same API. This gives hints to LLMs on how to use
The time difference is due to the different coding styles of these APIs correctly. From the results, after adding the cor-
each LLM, all of which are within 7 minutes. rect usage shot, the API misuse rates of GPT-3.5, GPT-4,
and Vicuna significantly drop. This indicates an effective
GPT 3.5 GPT 4 Llama 2 Vicuna 1.5 DeepSeek-Coder improvement under this experiment setting. As for Llama,
6m 31s 6m 56s 6m 36s 6m 19s 6m 36s the relevant shot does not improve the performance. This
experiment shows that some LLMs can effectively ‘learn’
Table 3: Execution Time of Static Analysis in ROBUSTAPI. the correct API usage and follow the usage. However, since
existing language models are trained with data from code
repositories if the training datasets contain a large number
Finding 1. Answers to real-world coding questions from the of API violations, the language models are prone to gen-
state-of-the-art large language models widely have API mis- erate code with API misuses, which explains the high API
use problems. misuse rate in zero-shot and one-shot-irrelevant evaluation.
We show Pass@k results of one-shot-relevant in Table 4.
One-Shot-Irrelevant Results
In this experiment, ROBUSTAPI gives a pair of question Pass@k Misuse Rate Compilation Rate Overall Misuse
and answer as an example to show the model how to follow Pass@1 39.06% 76.08% 29.72%
the template required by the instructions. The example con- Pass@5 21.98% 93.79% 20.61%
tains no information about the API usage checked by RO - Pass@10 16.51% 96.27% 15.89%
BUSTAPI. The result is shown in the middle of Figure 3.
However, for most models, the irrelevant shot does not sig- Table 4: Pass@k results of GPT 3.5 (T=1, one-relevant-
nificantly reduce the API misuse rate but on the contrary, shot).
slightly increases the misuse rate. One possible reason for
this is the irrelevant shot provided to the large language mod-
els actually encourages the models to give a lengthy code Finding 4. Some LLMs can learn from the correct usage
solution, which increases the chance of API misuse. API example, which reduce the API misuse rate.
misuse rate of Llama increases significantly after adding the
irrelevant shot because it has more valid answers that con- Robustness Analysis
tain code snippets. Overall, adding an irrelevant shot triggers We evaluate the benchmark on GPT 3.5 under different tem-
the large language models to generate more valid answers, peratures (Table 5). From the result, changing temperature
which enables a better evaluation of the code reliability and does not significantly change the misuse rate and compila-
robustness. tion rate. To study the effect of different prompting methods,
Finding 2. Among all the answers containing compilable we study how the API misuse rate changes when we replace
code, 57-70% of the LLM answers contain API misuse, the one-shot examples with the API usage rules. We feed
which could lead to severe consequence in production. the symbolized rules to ChatGPT and get the rules in natural
Finding 3. Irrelevant shot examples does not help decrease language. We add the usage rules as part of the prompts and
the API misuse rate but triggers more valid answers, which evaluate GPT-3.5 with ROBUSTAPI. The results are shown
show to be effective for benchmarking the model perfor- in Table 6, which indicates that the API usage rules might
mance. not help reduce the API misuse rate compared to one-shot
relevant examples.
Temperature Misuse Rate Compilation Rate Overall Misuse Case Study: API Misuse in GPT-3.5
T=0 38.56% 80.71% 31.13% Taking GPT-3.5 as an example, we show a typical sample
T = 0.5 39.77% 80.13% 31.87% that GPT-3.5 reacts differently under different experiment
T = 1.0 39.06% 76.08% 29.72% settings. This question asks the model to help write a string
to the file using API PrintWriter.write. Under zero-shot
Table 5: Results of GPT 3.5 with different temperature and one-irrelevant-shot settings, the answers differ slightly
(Pass@1, one-relevant-shot). but both misuse the API by not catching exceptions. After
giving the model the correct API usage example, the model
Prompt Misuse Rate Compilation Rate Overall Misuse learns how to use the API and responds correctly.
API Usage Rule 65.01% 79.78% 51.86% Zero Shot:
One-shot-relevant 38.56% 80.71% 31.13%
PrintWriter writer = new PrintWriter("f.txt", true);
writer.write("text to append");
Table 6: Results of GPT 3.5 with API usage rules (T=0, writer.close();
Pass@1). One Irrelevant Shot:
String text = "Hello, World!";
PrintWriter writer = new PrintWriter("f.txt", true);
Finding 5. Increasing temperature or replacing one shot ex- writer.write(text);
amples with API rules does not affect the API misuse rate writer.close();
One Relevant Shot:
significantly.
try {String text = "Hello, World!";
PrintWriter writer = new PrintWriter("f.txt", true);
Error Analysis writer.write(text);
} catch (IOException e) {e.printStackTrace();}
Misuse Rate for Each API in Each Model
gpt_zero_shot 100
gpt_fake_shot
gpt_one_shot 80 Discussion
gpt4_zero_shot
gpt4_fake_shot 60 Extend to Other Language ROBUSTAPI focuses on Java
gpt4_one_shot API usage since Java is one of the most widely used lan-
llama_zero_shot 40
llama_fake_shot guages in software development and has a special niche in
llama_one_shot
vicuna_zero_shot 20 web and Android ecosystems so that its API misuses may
vicuna_fake_shot cause more serious problems in real applications. Theoreti-
vicuna_one_shot 0
cally, the method proposed in this paper can also be applied
File eateame.wr.innie
lem Str ne kdile

sS ne d

SQdommADcialoriteap.ignatl

tSt ken
nE ut ane.mwF e
en Ite eaml.wr irs

Lit Ac ces g.dr.w et


t.g rat .r ite

TyopkeSntriMn apse.q.wriatd
ay. extBytey
Arr r.n et tK y
ge To es
.cr treiph dLi n
Jso Inp Ch Fil Ne rit t

Str SoeDacesssFil ismrite


ing rte tab Fil e.reiss
ed ize g.g.firsuere
Rarogr Pr Mac.Listtrinxgt
FileutS CreadIcow

g
Ra ndoess int do .ge
etAor. ea

rin

to other languages like Python.


utp der..loatVie

WMF
taO ea fo en

T d ae
Da redtRionICnont

Future Work The API misuse problem proposed in our


ffe ca et
Bupplvi ity.s

research can motivate many further research directions.


A cti

First, how to improve the quality of generated code aside


P
A

from functionality alignment. To achieve this goal, in-


Figure 4: Misuse rate of each API by each LLM. The deeper context learning, fine-tuning, and pre-training would be crit-
the color, the higher the misuse rate. G3.5, G4, LMA, Vic ical to improving existing models. Besides, other online
are short for GPT3.5, GPT4, Llama2, Vicuna1.5. code community like Github could also be a useful re-
source to evaluate code models, as proposed in a recent
In this section, we discuss the answers from LLMs that work (Jimenez et al. 2023). As we believe, evaluating and
cannot pass the API usage check in ROBUSTAPI evalua- improving LLMs on the perspective of real-world software
tion. There are two categories for failure cases: cases that development is a demanding and important problem.
are not compilable, and cases that are compilable but con-
tain API misuses as shown in Figure 3. We refer to the abil- Conclusion
ity to be compiled successfully as compilability. The com- In this paper, we propose a benchmark ROBUSTAPI to study
pilation failure rate is calculated by dividing the number of the API misuse behaviors in code generated by LLMs. From
cases that can be compiled to the total number of cases in the the benchmark results on state-of-the-art models, we find
benchmarks. GPT-4 performs the best among all the mod- that API misuse widely exists in large language models even
els regarding compilability, which has less than 10% of an- when the code is executable and aligned with users’ inten-
swers that cannot be compiled across all experiment settings. tion. Under different experiment settings, we explore effec-
Adding a few shots to prompts helps reduce the compilation tive methods of benchmarking and improving the API mis-
failure rate in the evaluation results for all models. As for the use rate of LLMs. To inspire and accelerate future research
API misuse rate, we dive deeper into the APIs that LLMs are on this problem, we open source the dataset and benchmark
prone to misuse. Figure 4 details the misuse rate of each API in https://github.com/FloridSleeves/RobustAPI.
for each LLM. Among all APIs, the Android development
API Activity.setContentView has the lowest misuse rate
across all the models.
Acknowledgments for code understanding and generation. arXiv preprint
The authors sincerely appreciate the reviewers and chairs of arXiv:2102.04664.
the AAAI for their constructive and insightful comments. Luo, Z.; Xu, C.; Zhao, P.; Sun, Q.; Geng, X.; Hu, W.; Tao,
Their expertise and thorough reviews have significantly con- C.; Ma, J.; Lin, Q.; and Jiang, D. 2023. WizardCoder: Em-
tributed to the enhancement of this paper. powering Code Large Language Models with Evol-Instruct.
arXiv preprint arXiv:2306.08568.
References Nguyen, H. A.; Dyer, R.; Nguyen, T. N.; and Rajan, H. 2014.
Anil, R.; Dai, A. M.; Firat, O.; Johnson, M.; Lepikhin, Mining preconditions of APIs in large-scale code corpus. In
D.; Passos, A.; Shakeri, S.; Taropa, E.; Bailey, P.; Chen, Proceedings of the 22nd ACM SIGSOFT International Sym-
Z.; et al. 2023. Palm 2 technical report. arXiv preprint posium on Foundations of Software Engineering, 166–177.
arXiv:2305.10403. OpenAI. 2023a. GPT-4 Technical Report. ArXiv,
Austin, J.; Odena, A.; Nye, M.; Bosma, M.; Michalewski, abs/2303.08774.
H.; Dohan, D.; Jiang, E.; Cai, C.; Terry, M.; Le, Q.; et al. OpenAI. 2023b. GPT-4 Technical Report.
2021. Program synthesis with large language models. arXiv arXiv:2303.08774.
preprint arXiv:2108.07732.
Patil, S. G.; Zhang, T.; Wang, X.; and Gonzalez, J. E. 2023.
Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Pinto, H. P. d. O.; Gorilla: Large language model connected with massive apis.
Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, arXiv preprint arXiv:2305.15334.
G.; et al. 2021. Evaluating large language models trained on
code. arXiv preprint arXiv:2107.03374. Pearce, H.; Ahmad, B.; Tan, B.; Dolan-Gavitt, B.; and Karri,
R. 2022. Asleep at the keyboard? assessing the security of
Chiang, W.-L.; Li, Z.; Lin, Z.; Sheng, Y.; Wu, Z.; Zhang, H.; github copilot’s code contributions. In 2022 IEEE Sympo-
Zheng, L.; Zhuang, S.; Zhuang, Y.; Gonzalez, J. E.; Stoica, sium on Security and Privacy (SP), 754–768. IEEE.
I.; and Xing, E. P. 2023. Vicuna: An Open-Source Chatbot
Impressing GPT-4 with 90%* ChatGPT Quality. Perry, N.; Srivastava, M.; Kumar, D.; and Boneh, D. 2022.
Do users write more insecure code with AI assistants? arXiv
Fischer, F.; Böttinger, K.; Xiao, H.; Stransky, C.; Acar, Y.; preprint arXiv:2211.03622.
Backes, M.; and Fahl, S. 2017. Stack overflow considered
harmful? the impact of copy&paste on android application Piplani, T.; and Bamman, D. 2018. DeepSeek: Con-
security. In 2017 IEEE Symposium on Security and Privacy tent based image search & retrieval. arXiv preprint
(SP), 121–136. IEEE. arXiv:1801.03406.
Fischer, G.; Lusiardi, J.; and Von Gudenberg, J. W. 2007. Poesia, G.; Polozov, O.; Le, V.; Tiwari, A.; Soares, G.;
Abstract syntax trees-and their role in model driven software Meek, C.; and Gulwani, S. 2022. Synchromesh: Reliable
development. In International Conference on Software En- code generation from pre-trained language models. arXiv
gineering Advances (ICSEA 2007), 38–38. IEEE. preprint arXiv:2201.11227.
Hendrycks, D.; Basart, S.; Kadavath, S.; Mazeika, M.; Sandoval, G.; Pearce, H.; Nys, T.; Karri, R.; Garg, S.; and
Arora, A.; Guo, E.; Burns, C.; Puranik, S.; He, H.; Song, D.; Dolan-Gavitt, B. 2023. Lost at c: A user study on the se-
et al. 2021. Measuring coding challenge competence with curity implications of large language model code assistants.
apps. arXiv preprint arXiv:2105.09938. arXiv preprint arXiv:2208.09727.
Huang, H.; Shen, B.; Zhong, L.; and Zhou, Y. 2023. Pro- Shen, B.; Zhang, J.; Chen, T.; Zan, D.; Geng, B.; Fu, A.;
tecting data integrity of web applications with database con- Zeng, M.; Yu, A.; Ji, J.; Zhao, J.; et al. 2023a. PanGu-
straints inferred from application code. In Proceedings of the Coder2: Boosting Large Language Models for Code with
28th ACM International Conference on Architectural Sup- Ranking Feedback. arXiv preprint arXiv:2307.14936.
port for Programming Languages and Operating Systems, Shen, X.; Chen, Z.; Backes, M.; and Zhang, Y. 2023b. In
Volume 2, 632–645. chatgpt we trust? measuring and characterizing the reliabil-
Jesse, K.; Ahmed, T.; Devanbu, P. T.; and Morgan, E. 2023. ity of chatgpt. arXiv preprint arXiv:2304.08979.
Large Language Models and Simple, Stupid Bugs. arXiv Siddiq, M. L.; Majumder, S. H.; Mim, M. R.; Jajodia, S.; and
preprint arXiv:2303.11455. Santos, J. C. 2022. An Empirical Study of Code Smells in
Jimenez, C. E.; Yang, J.; Wettig, A.; Yao, S.; Pei, K.; Press, Transformer-based Code Generation Techniques. In 2022
O.; and Narasimhan, K. 2023. SWE-bench: Can Language IEEE 22nd International Working Conference on Source
Models Resolve Real-World GitHub Issues? arXiv preprint Code Analysis and Manipulation (SCAM), 71–82. IEEE.
arXiv:2310.06770. Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.;
Liu, J.; Xia, C. S.; Wang, Y.; and Zhang, L. 2023. Is your Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale,
code generated by chatgpt really correct? rigorous evalua- S.; et al. 2023. Llama 2: Open foundation and fine-tuned
tion of large language models for code generation. arXiv chat models. arXiv preprint arXiv:2307.09288.
preprint arXiv:2305.01210. Wang, J.; Dang, Y.; Zhang, H.; Chen, K.; Xie, T.; and Zhang,
Lu, S.; Guo, D.; Ren, S.; Huang, J.; Svyatkovskiy, A.; D. 2013. Mining succinct and high-coverage API usage pat-
Blanco, A.; Clement, C.; Drain, D.; Jiang, D.; Tang, D.; et al. terns from source code. In 2013 10th Working Conference
2021. Codexglue: A machine learning benchmark dataset on Mining Software Repositories (MSR), 319–328. IEEE.
Xu, F. F.; Alon, U.; Neubig, G.; and Hellendoorn, V. J. 2022.
A systematic evaluation of large language models of code. In
Proceedings of the 6th ACM SIGPLAN International Sympo-
sium on Machine Programming, 1–10.
Yang, D.; Hussain, A.; and Lopes, C. V. 2016. From query to
usable code: an analysis of stack overflow code snippets. In
Proceedings of the 13th International Conference on Mining
Software Repositories, 391–402.
Ye, J.; Chen, X.; Xu, N.; Zu, C.; Shao, Z.; Liu, S.; Cui, Y.;
Zhou, Z.; Gong, C.; Shen, Y.; et al. 2023. A comprehensive
capability analysis of gpt-3 and gpt-3.5 series models. arXiv
preprint arXiv:2303.10420.
Yetistiren, B.; Ozsoy, I.; and Tuzun, E. 2022. Assessing the
quality of GitHub copilot’s code generation. In Proceedings
of the 18th International Conference on Predictive Models
and Data Analytics in Software Engineering, 62–71.
Yin, P.; Deng, B.; Chen, E.; Vasilescu, B.; and Neubig, G.
2018. Learning to mine aligned code and natural language
pairs from stack overflow. In Proceedings of the 15th inter-
national conference on mining software repositories, 476–
486.
Zhang, T.; Upadhyaya, G.; Reinhardt, A.; Rajan, H.; and
Kim, M. 2018. Are code examples on an online Q&A forum
reliable?: a study of API misuse on stack overflow. In 2018
IEEE/ACM 40th International Conference on Software En-
gineering (ICSE). IEEE, New York, United States, 886–896.
Zhou, J.; and Walker, R. J. 2016. API deprecation: a ret-
rospective analysis and detection method for code examples
on the web. In Proceedings of the 2016 24th ACM SIGSOFT
International Symposium on Foundations of Software Engi-
neering, 266–277.

You might also like