An Initial Investigation of ChatGPT Unit Test Generation
An Initial Investigation of ChatGPT Unit Test Generation
capability∗
Vitor H. Guilherme† Auri M. R. Vincenzi∗
[email protected] [email protected]
Federal University of São Carlos Federal University of São Carlos
São Carlos, SP, Brazil São Carlos, SP, Brazil
ABSTRACT ACM Reference Format:
Context: Software testing ensures software quality, but develop- Vitor H. Guilherme and Auri M. R. Vincenzi. 2023. An initial investigation
of ChatGPT unit test generation capability. In 8th Brazilian Symposium
ers often disregard it. The use of automated testing generation is
on Systematic and Automated Software Testing (SAST 2023), September 25–
pursued to reduce the consequences of overlooked test cases in a 29, 2023, Campo Grande, MS, Brazil. ACM, New York, NY, USA, 10 pages.
software project. Problem: In the context of Java programs, several https://doi.org/10.1145/3624032.3624035
tools can completely automate generating unit test sets. Addition-
ally, studies are conducted to offer evidence regarding the quality
of the generated test sets. However, it is worth noting that these 1 INTRODUCTION
tools rely on machine learning and other AI algorithms rather than Unit testing is an essential practice in software development to en-
incorporating the latest advancements in Large Language Models sure the correctness and robustness of individual code units. These
(LLMs). Solution: This work aims to evaluate the quality of Java tests, typically written by developers, play a crucial role in iden-
unit tests generated by an OpenAI LLM algorithm, using metrics tifying defects and validating the expected behavior of software
like code coverage and mutation test score. Method: For this study, components. DevOps pipelines are strongly based on the quality of
33 programs used by other researchers in the field of automated test the unit tests. However, manually generating comprehensive unit
generation were selected. This approach was employed to establish tests can be challenging and time-consuming, often requiring signif-
a baseline for comparison purposes. For each program, 33 unit test icant effort and expertise. To address these challenges, researchers
sets were generated automatically, without human interference, by have explored automated approaches for test generation [1, 8],
changing Open AI API parameters. After executing each test set, leveraging advanced techniques and tools.
metrics such as code line coverage, mutation score, and success rate In this work, we focus on evaluating the quality of Java unit
of test execution were collected to evaluate the efficiency and effec- tests generated by an OpenAI Large Language Model (LLM) that
tiveness of each set. Summary of Results: Our findings revealed has demonstrated remarkable capabilities in generating tests across
that the OpenAI LLM test set demonstrated similar performance various domains [23, 28, 29]. Our evaluation will utilize three key
across all evaluated aspects compared to traditional automated Java quality parameters: code coverage, mutation score, and build and
test generation tools used in the previous research. These results execute success rate of test sets. Code coverage quantifies the extent
are particularly remarkable considering the simplicity of the ex- to which the tests exercise different parts of the code, indicating
periment and the fact that the generated test code did not undergo the thoroughness of the test suite. On the other hand, the mutation
human analysis. score measures the ability of the tests to detect and kill mutated
versions of the code, providing insights into the fault-detection
CCS CONCEPTS capability [2]. Finally, the build and success execution rate measures
• Software and its engineering → Software verification and the reliability of the generated tests.
validation; Empirical software validation; Software defect To conduct a thorough and comprehensive analysis, this study
analysis. will compare the quality of the unit tests generated by the selected
LLM with those produced by other prominent Java test genera-
KEYWORDS tion tools, such as EvoSuite1 . This comparative evaluation aims
to determine if the LLMs can outperform state-of-the-art Java test
software testing, experimental software engineering, automated generation tools and will leverage relevant data from Araujo and
test generation, coverage testing, mutation testing, testing tools Vincenzi [3] research to provide a meaningful benchmark for com-
∗ This
parison. By assessing the effectiveness and performance of the LLMs
work is partially supported by Brazilian Funding Agencies CAPES - Grant 001,
FAPESP - Grant nº 2019/23160-0, and CNPq.
against established tools, we can gain valuable insights into their
† All authors contributed equally to the paper content. capabilities and potential advantages in generating high-quality
unit tests for Java programs.
Publication rights licensed to ACM. ACM acknowledges that this contribution was This study is part of an ongoing project that aims to support
authored or co-authored by an employee, contractor or affiliate of a national govern- mutation testing in a fully automated way2 . In this sense, we are
ment. As such, the Government retains a nonexclusive, royalty-free right to publish or
reproduce this article, or to allow others to do so, for Government purposes only.
SAST 2023, September 25–29, 2023, Campo Grande, MS, Brazil 1 https://www.evosuite.org/
2 Mutation-Based Software Testing with High Efficiency and Low Technical Debt: Auto-
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 979-8-4007-1629-4/23/09. . . $15.00 mated Process and Free Support Environment Prototype (FAPESP Grant Nº 2019/23160-
https://doi.org/10.1145/3624032.3624035 0)
SAST 2023, September 25–29, 2023, Campo Grande, MS, Brazil Guilherme, V. H. and Vincenzi, A.M.R.
investigating tools that allow us to generate test cases in a fully metrics used in other studies, like the one developed by Araujo and
automated way without human intervention/interaction. Vincenzi [3], which we will use as a baseline.
Therefore, we can summarize these paper’s contributions:
• To provide evidence of the quality of LLMs in generating 2.2 Traditional automatic test data generation
unit test sets for Java programs concerning their efficiency The automatic generation of test data poses an undecidable prob-
and efficacy; lem from a computational perspective. While random testing or
• To evaluate the improvements a combination of test sets can search-based strategies are commonly employed, other research has
achieve over individual test sets concerning efficiency and shown that the problem remains unsolved when using traditional
efficacy; tools that rely on these approaches [3, 25]. The shortcomings of
• To collect data for supporting further comparison of different traditional test data generators become apparent when attempting
LLMs on generating Java unit test sets; to achieve all testing objectives, such as complete code coverage or
• To develop and make available a set of artifacts for easing eliminating all mutants [1]. Consequently, pursuing comprehensive
the experimentation for different sets of programs. and efficient test data generation techniques continues to be an
The structure of the rest of this paper is as follows: We outline ongoing challenge in the dynamic field of software testing.
the essential subjects for comprehension of this paper in Section 2. Even considering the state-of-the-art unit testing generation
In Section 3, we touch on other studies that are related to ours for Java, EvoSuite[9], the resultant test set reaches low mutation
and highlight the differences. The design of our experiment, along scores in traditional competitions [22, 26]. Other tools have been
with our choices of programs and tools, is detailed in Section 4. discontinued, like Palus [30] and JTExpert [21], which also em-
Section 5 displays the data we’ve gathered and the subsequent anal- ployed search-based algorithms. There are also tools that employ a
ysis. A discussion of the outcomes derived from the collected data random generation approach like Randoop [18] but are still being
is provided in Section 6. We then discuss potential risks that may updated.
affect our experiment results in Section 7. Finally, in Section 8, we
wrap up the paper by indicating possible future research directions 2.3 LLM and Software Engineering
informed by this study and the data collected. LLMs, like ChatGPT3 , are state-of-the-art language models based
on the Transformer architecture [24]. They are designed to process
2 BACKGROUND and understand human language, enabling machines to generate
This section will explain software testing, automatic test data gen- coherent and contextually relevant text. These models have been
eration, and large language models so that the rest of the paper can trained on vast amounts of language data, allowing them to capture
be understood. intricate patterns and relationships in language usage. As a result,
they demonstrate impressive capabilities in tasks such as text gen-
2.1 Software testing eration, translation, question-answering, and even software-related
In the sphere of software development, it is crucial to ensure the activities.
robustness and reliability of a program. A primary technique used Ma et al. [16] comprehensively explore ChatGPT’s applicability
for this goal is software testing, which is a systematic process that and potential in the software engineering field. The authors exam-
checks the functionality and accuracy of a software application. ine various tasks, including code generation, code summarizing,
However, with software systems growing increasingly complex bug detection, and code completion, to evaluate the performance
and versatile, covering a broadening range of use cases and inputs of ChatGPT. Through a rigorous investigation and comparison
makes software testing an arduous task. with existing software engineering tools and techniques, the study
To analyze the effectiveness of a test set, various criteria come reveals both the strengths and limitations of ChatGPT in differ-
into play, two of which are lines of code coverage and mutation ent software engineering scenarios. The findings provide valuable
testing[20]. Code coverage entails analyzing the extent to which the insights into the capabilities of ChatGPT and offer guidance for
test suite exercises the internal structure of the software product, leveraging its potential to improve software development practices
like its statements or conditions. The goal is to achieve complete or while highlighting areas where further advancements are needed.
near-complete coverage to ensure that each statement or branch White et al. [27] also explore the potential applications of Chat-
in the code has been executed at least once for a given test case GPT in various software engineering tasks. The researchers in-
during testing. troduce a collection of prompt patterns specifically designed to
On the other hand, mutation testing evaluates the test suite’s leverage ChatGPT’s language generation capabilities for code qual-
ability to identify and “kill” mutated versions of the software [7]. ity improvement, refactoring, requirements elicitation, and soft-
Mutation testing can be seen as a fault model representation [2]. ware design tasks. Through experiments and case studies, they
These mutations involve making small syntax changes to the code to demonstrate the effectiveness of using ChatGPT with these prompt
simulate potential faults. A successful mutation test is one in which patterns in aiding developers and software engineers in their day-
the test suite effectively detects these mutations, highlighting its to-day activities. The article highlights the versatility of ChatGPT as
proficiency in identifying vulnerabilities and potential issues within a tool for supporting software engineering practices and fostering
the software. Both code coverage and mutation testing are used in better code development and design.
this study as metrics to measure the reliability and thoroughness of
the automatically generated test sets. Moreover, these are traditional 3 https://chat.openai.com/
An initial investigation of ChatGPT unit test generation capability SAST 2023, September 25–29, 2023, Campo Grande, MS, Brazil
2.4 LLM for automatic test data generation Siddiq et al. [23] investigate the efficacy of large language models,
Section 3 explores the possibility of leveraging LLMs for automatic specifically GPT-3, in generating unit tests for software programs.
test data generation. These studies involved exploratory investiga- The study explores the ability of GPT-3 to understand the require-
tions into using LLMs to generate test data across different testing ments of software functionalities and generate relevant test cases.
phases, from unit to end-to-end testing. Notably, the context pro- The author analyzes the quality, diversity, and coverage of the gen-
vided to the LLM was the only aspect that changed during these erated unit tests through experiments conducted on real-world
experiments. projects, comparing them with manually written tests. The findings
In the case of unit testing, the LLM was presented with code highlight the potential of large language models in automated unit
snippets as input [14, 23, 28, 29]. For instance, a prompt could be test generation but also reveal certain limitations and challenges
formulated as follows: that need to be addressed for more effective and reliable results.
“Given the code snippet provided, please generate test cases to The research contributes to understanding the capabilities and lim-
cover all possible scenarios and branches within the code.” itations of large language models in the context of unit testing and
The LLM then utilized its language generation capabilities to provides insights for further advancements in this area.
produce comprehensive test data sets that catered to various testing Xie et al. [28] present ChatUniTest, a tool that allows developers
scenarios. to interact with ChatGPT in a conversational manner to gener-
On the other hand, for end-to-end testing, the LLM was supplied ate unit tests for their code. By formulating test generation as a
with a description of the system’s functional specifications [19] or dialogue-based problem, developers can provide natural language
a GUI [15]. The prompt may have asked the LLM to: prompts to ChatGPT, which then responds with relevant test case
“Generate test cases that validate the entire system’s functional- suggestions. The article discusses the implementation details of
ity based on the provided functional specification.” ChatUniTest and evaluates its effectiveness through experiments
The results of these exploratory studies demonstrated the promis- on open-source projects. The results demonstrate that ChatUniTest
ing potential of LLMs in automating the test data generation process, successfully generates meaningful unit tests, assisting developers
streamlining testing efforts, and enhancing software quality. By in improving software quality and productivity. The study high-
tailoring the input context to the LLM’s capabilities, it was possible lights the potential of ChatGPT in the context of automated unit
to obtain effective test cases for different testing phases, further test generation and presents an innovative approach for facilitating
showcasing the versatility and adaptability of LLMs in software the software testing process.
testing. Liu et al. [15] explore the application of GPT-3 for automated GUI
testing in the context of mobile applications. The study proposes
an innovative approach where GPT-3 is utilized as a conversational
3 RELATED WORK agent to interact with mobile apps and generate test cases. A se-
The field of test generation has witnessed significant advancements ries of experiments conducted on various real-world mobile apps
in recent years, with researchers exploring innovative approaches demonstrate the feasibility of GPT-3 in performing human-like GUI
to automate the process and enhance software quality assurance testing. The approach achieves high code coverage and successfully
practices. Among these emerging techniques, one notable area of detects critical issues, showcasing the potential of leveraging GPT-3
exploration is using LLMs for test generation. This section provides for efficient and effective automated GUI testing of mobile applica-
a comprehensive overview of the existing literature investigating tions. The findings highlight the capabilities of GPT-3 in the domain
the application of these powerful tools in test generation. of mobile app testing, opening avenues for further advancements
Li et al. [14] introduce a novel approach to detecting failure- in automated testing techniques.
inducing test cases using ChatGPT. By leveraging the model’s abil- Considering the studies carried out so far, the majority explore
ity to understand natural language and generate coherent responses, the use of ChatGPT in an interactive way. We intend to investigate
they propose an interactive debugging technique that allows devel- the ChatGPT test generation capability fully automated, without
opers to converse with ChatGPT to identify test cases that are likely human intervention, interacting, or correcting test cases, consider-
to trigger failures. Through experiments on real-world software ing a possible scenario of no-touch testing [1, 8]. In this sense, we
projects, they demonstrate the effectiveness of their approach in consider our study incomparable to the previous ones.
improving fault localization and aiding in the debugging process,
highlighting its potential to enhance quality assurance practices. 4 EXPERIMENT DESIGN
Yuan et al. [29] explore the application of ChatGPT for automat- This paper evaluates the quality of automatically generated test
ing unit test generation. The researchers evaluate the performance sets by an LLM. A set of Java programs was carefully selected
of ChatGPT in generating meaningful and effective unit tests by to accomplish this, and multiple JUnit4 test sets were generated
comparing them with existing test-generation tools. They also pro- using the LLM. We use the OpenAI API5 (Application Programming
pose a novel approach to enhance ChatGPT’s ability to generate Interface) and develop a Python script for interacting with the model
high-quality unit tests by incorporating reinforcement learning via API. The generated test sets will be evaluated based on code
techniques. Through rigorous experimentation and evaluation of coverage, mutation score, and build and execution success rate
various code bases, the authors demonstrate the potential of Chat- using selected tools.
GPT as a promising tool for automating the labor-intensive task of
unit test generation, highlighting its ability to improve software 4 https://junit.org/
The collected data will be summarized and analyzed using simple responses. On the other hand, a lower temperature, like 0.2, pro-
statistics to compare test sets generated by the LLMs with the duces more focused and deterministic output, favoring predictable
ones generated by other automated test-generation tools. Figure 1 and conservative responses. By adjusting the temperature param-
illustrates the experiment workflow and the steps involved in the eter, users can fine-tune the balance between generating creative
evaluation process. and coherent text, enabling them to obtain the desired output level
for their specific application or task.
Java
1
Source Source Code We conducted the experiment using the range of temperature
Program Metric Tool Metric Report
values to investigate the variation in the results we will obtain.
T0.0 Coverage By exploring the entire spectrum of available temperature values,
2 Temperature’s parameter variation T0.1 Coverage we aimed to identify the most suitable setting to yield the best
0.0 T0.0
T0.2 Coverage
T... Coverage
results for our specific test generation requirements. Because of
the randomness of the model, especially with higher temperature
OpenAI API (gpt-3.5-turbo)
T0.8 Coverage
0.1 T0.1 3
T0.9 Coverage
values, we generated 3 test sets for each temperature value and
Maven builder
Table 1: Static information of the Java programs (extracted from Araujo and Vincenzi [3])
4.3 Tools Selection Table 2: Tools version and purpose (adapted from Araujo and
Vincenzi [3])
We used JUnit as a unit testing framework to evaluate the results,
which is widely recognized as the industry standard for testing
Tool Version Purpose
and generating comprehensive reports. Another noteworthy aspect JavaNCSS 32.53 Static Metric Computation
is the utilization of JUnit in the study of Araujo and Vincenzi [3], PITest 1.3.2 Mutation and Coverage Testing
which is a valuable reference point for comparing our results. Maven 3.6.3 Application Builder
JUnit 4.12 Framework for Unit Testing
By employing the same testing framework, we establish a mean- Python 3.7 Script language
ingful basis for comparison, enabling us to analyze and assess the Java 8 Programs language
effectiveness of our LLM-generated tests concerning their findings. LLM gpt-3.5-turbo OpenAI LLM for generating tests
The same logic was used to select PITest8 as our mutation tool.
Same as Araujo and Vincenzi [3], we employed all the mutation
operators of PITest9 to generate as many mutants as possible for Once we use the programs from Araujo and Vincenzi [3] we
each program under testing. simplest use the static metrics from their work without recomputing
Therefore, Table 2 summarizes the tools and versions we used, them. Table 1 presented such data about the programs.
which we kept the same as the ones adopted by Araujo and Vincenzi Subsequently, we proceeded with the test set generation using
[3] to minimize threats. the gpt-3.5-turbo model. To accomplish this, we formulated
a specific base prompt designed to request the model’s assistance
5 DATA COLLECTION AND ANALYSIS in generating JUnit unit tests tailored for a program.
The first prompt version is as follows:
The initial step involved creating a centralized repository housing
In Figure 2, {cut} represents the class’s name under testing, and
all the selected programs, scripts, and experimental results. To
{code} is a variable containing the CUT source and its dependen-
achieve version control and facilitate seamless collaboration, we
cies. We developed a Python script (generate-chatgpt.py),
opted for GitHub as our hosting platform10 .
which sends the request to the OpenAI API. Upon receiving the
8 https://pitest.org/ response from the API, the script removes any natural language
9 https://pitest.org/quickstart/mutators/ comments that the LLM model added before or after the gener-
10 https://github.com/aurimrv/initial-investigation-chatgpt-unit-tests.git ated code. Additionally, the script ensured that the Java test class
SAST 2023, September 25–29, 2023, Campo Grande, MS, Brazil Guilherme, V. H. and Vincenzi, A.M.R.
sets, around 16% less than temperature 0.6, but with 36 test sets, the
Generate test cases just for the {cut} average coverage and mutation score are the highest: 88.9 and 54.8,
Java class in one Java class file with respectively. Although we consider these results impressive due
imports using JUnit 4 and Java 8: to the simplicity of the prompt, we analyzed the errors produced
by the test sets and the parts of the source code not covered by
{code} the tests, and we tried to improve the prompt to mitigate some
problems found. Figure 3 shows the prompt’s second version.
Observe that in the prompt presented in Figure 3, {cut} and
Figure 2: Prompt version 1 for test set generation {code} have the same meaning, the name of the class under testing
and the source code of the class under testing and its dependencies.
We were more incisive regarding how we wanted the test set. In-
name matched the file name, following a pattern to enhance test
cluding mandatory dependencies, timeout, throws Exception, test
data organization. As an output, the script generates 33 Java test
set name, and the calling of void methods and default construc-
classes for every selected program, with 3 test classes for each LLM
tors. We also enforce two testing criteria: decision coverage and
temperature value, as mentioned in Section 4.1.
boundary values.
Then, with all tests generated for every program, it was
After this prompt upgrade, we rerun all scripts to generate new
time to build and run them using Maven and PITest. To
test sets, check their quality, and measure coverage and mutation
automate this process, we developed another Python script
scores. Table 4 presents the average data per temperature. The new
(compile-and-test-chatgpt.py). However, at this stage,
prompt improved the test set successful execution by more than 12%,
we encountered an issue where some tests generated by the model
observing temperature 0.2, 64 out of 99 test sets executed without
didn’t build successfully due to problems such as syntax errors
failures, a successful rate of 64.6%. We also improved coverage and
and missing imports. The script simplest discards any test set with
mutation scores to an average of 93.5% and 58.8%, respectively.
failing cases once it is only possible to call PITest after a successful
build and test execution. The script moves all test files to a directory
outside the project, copies one test file at a time to the project’s Table 4: Average Data for Each Temperature Parameter –
test directory, and then builds and runs the test for that specific Prompt version 2
file. This way, any build issues or errors in one test won’t affect the
others, ensuring a smoother and more effective testing process. Temp. # of Suc. Test % of Suc. AVG Cov. AVG Score
0.0 61 61.6 93.5 58.8
Finally, we developed the last Python script 0.1 59 59.6 93.4 57.4
(reports-chatgpt.py) for extracting coverage and mu- 0.2 64 64.6 90.7 57.4
tation scores from PITest reports. It is responsible for generating 0.3 63 63.6 91.2 57.7
0.4 59 59.6 92.0 57.8
one CSV file for each Java program, including all test results that 0.5 55 55.6 93.3 57.7
are executed successfully. Tables 3 and 4 present parts of the 0.6 63 63.6 88.0 55.9
collected data. Considering the first prompt version, presented in 0.7 54 54.5 89.9 55.4
0.8 55 55.6 88.6 55.3
Figure 2, Table 3 presents average data for each temperature value 0.9 54 54.5 85.8 54.1
we investigate. 1.0 61 61.6 87.7 54.1
Temperature
ID Program
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 All
1 Max 0 1 0 1 2 0 1 1 1 1 2 10
2 MaxMin1 2 1 2 3 3 2 2 2 0 3 2 22
3 MaxMin2 3 3 3 2 2 2 3 3 2 2 2 27
4 MaxMin3 0 0 0 0 0 0 0 0 0 0 1 1
5 Sort1 3 3 3 3 3 3 3 3 3 3 3 33
6 FibRec 0 0 1 0 1 2 2 2 2 3 2 15
7 FibIte 0 0 0 1 2 1 2 0 3 1 2 12
8 MaxMinRec 3 3 3 3 2 2 3 2 3 2 3 29
9 Mergesort 3 3 3 3 3 3 3 3 3 3 3 33
10 MultMatrixCost 0 0 0 0 0 0 0 0 0 0 0 0
11 ListArray 3 3 3 3 3 2 3 3 2 2 2 29
12 ListAutoRef 3 3 3 2 3 3 3 3 3 2 1 29
13 StackArray 3 3 3 3 3 3 3 3 3 3 2 32
14 StackAutoRef 3 3 3 3 2 3 3 2 3 2 3 30
15 QueueArray 0 0 1 2 0 2 3 3 3 2 2 18
16 QueueAutoRef 3 3 3 3 3 2 3 2 2 3 1 28
17 Sort2 0 0 1 0 0 0 0 0 0 1 1 3
18 HeapSort 0 0 0 0 0 0 0 0 0 0 0 0
19 PartialSorting 0 0 0 0 0 0 0 0 0 0 0 0
20 BinarySearch 3 2 3 2 3 2 2 3 0 0 2 22
21 BinaryTree 3 3 3 3 3 2 3 2 2 3 3 30
22 Hashing1 3 3 3 3 1 2 1 2 2 2 2 24
23 Hashing2 0 1 1 1 0 1 0 0 2 1 1 8
24 GraphMatAdj 3 3 3 3 3 3 3 0 2 2 3 28
25 GraphListAdj1 3 3 2 3 3 1 2 1 1 3 2 24
26 GraphListAdj2 3 3 3 3 3 3 3 2 3 3 2 31
27 DepthFirstSearch 3 2 3 3 2 2 3 2 2 0 2 24
28 BreadthFirstSearch 3 2 3 1 1 1 0 1 1 0 3 16
29 Graph 2 2 2 3 2 2 2 2 1 1 1 20
30 PrimAlg 0 0 0 0 0 0 1 1 0 0 1 3
31 ExactMatch 3 3 3 3 3 3 3 3 3 3 3 33
32 AproximateMatch 3 3 3 3 3 3 3 2 3 2 3 31
33 Identifier 0 0 0 0 0 0 0 1 0 1 1 3
# Successful Test 61 59 64 63 59 55 63 54 55 54 61 648
% Successful Test 61.6 59.6 64.6 63.6 59.6 55.6 63.6 54.5 55.6 54.5 61.6 59.5
# of Programs without test 12 10 8 8 9 8 8 8 9 8 3 3
% of Programs without test 36.4 30.3 24.2 24.2 27.3 24.2 24.2 24.2 27.3 24.2 9.1 9.1
Prompts also show us huge flexibility in asking for test cases cases with human support to check and correct possible mistakes.
considering specific testing criteria or asking for test cases to reach This is especially true once it is difficult to maintain software testing
a specific objective, like covering a specific statement or killing generators. For instance, considering the ones used by Araujo and
a specific mutant. In this work, we decided only on a standard Vincenzi [3], two of them (Palus and JTExpert) are unavailable or
pre-defined prompt, as shown in Figure 3, to use the generated did not work with new versions of Java.
unit testing fully automated, i.e., without interacting with the chat On the other hand, LLMs need a huge amount of data to work
asking for additional testing or testing corrections. and can be easily personalized to meet different testing objectives.
We do not think LLMs will solve all testing problems automat- A possible alternative to improve the LLM capabilities, considering
ically. We believe a good automated testing strategy now gained Java programs, for instance, is to use EvoSuite to start the test set
important support from LLMs. We intend to observe the LLM limits generation and, later, to provide to the LMM the source code of the
for unit testing generation. If some important testing requirement is class under testing and also a previously generated EvoSuite test
missing, having time and people available for testing, it is possible sets. In this way, we suppose the prompt can better understand the
to develop specialized prompts to solve and generate specific test
An initial investigation of ChatGPT unit test generation capability SAST 2023, September 25–29, 2023, Campo Grande, MS, Brazil
test case style, which may reduce the test case failures generated 7 THREATS TO VALIDITY
by LLMs. There are several potential threats in this paper. One possible threat
is sampling bias, which means the selection of programs and tools
used in the experiment may not accurately represent the entire
Table 6: All LLM test sets versus baseline test sets
software development landscape. This could lead to biased results
that may not apply to other contexts. To minimize this threat, we
LLM Suite Baseline Suite11
ID
Coverage Score Coverage Score tried to use tools and programs already explored in other experi-
1 100.0 85.7 100.0 64.3 ments. Moreover, especially for the automated test generator, at
2 100.0 85.7 100.0 83.8 least EvoSuite is a tool used in a vast number of experiments both
3 100.0 85.7 100.0 84.3
4 100.0 64.5 100.0 79.8
in academia [22, 26] and in industry [10] and is also integrated into
5 100.0 80.0 100.0 78.5 professionals’ integrated development environments [4].
6 100.0 100.0 100.0 100.0 Another threat is the limited generalization of our findings. The
7 100.0 100.0 100.0 100.0
8 100.0 83.3 100.0 83.1
study’s conclusions may only be relevant to a specific set of pro-
9 100.0 96.4 100.0 95.5 grams and tools and may not apply to different scenarios. Addition-
10 0.0 0.0 100.0 45.6 ally, there is a risk of measurement bias, where the metrics used
11 100.0 93.5 100.0 78.1 to measure the effectiveness of the generated test data may not
12 100.0 87.0 100.0 83.9
13 100.0 96.8 100.0 81.3 fully capture its quality and comprehensiveness. In this way, we
14 100.0 93.5 100.0 85.5 manually revise the Python scripts and check the collected data
15 100.0 93.0 100.0 90.5 for some programs to ensure the information is accurate. Coverage
16 100.0 67.6 100.0 82.4
17 100.0 57.6 100.0 68.8 and mutation scores are traditional metrics for evaluating software
18 0.0 0.0 100.0 73.9 testing quality. Specially mutation is confirmed to be an excellent
19 0.0 0.0 100.0 84.4 fault model to evaluate the quality of test sets [2, 12, 13]. Although
20 100.0 94.7 100.0 70.4
21 81.3 62.5 88.8 93.8
we work with Java programs on this initial investigation, other
22 100.0 57.3 100.0 93.9 studies in the course also explore the LLM test generation capabili-
23 98.1 63.3 100.0 86.6 ties for programs written in other languages like Python and C, for
24 100.0 73.4 96.2 69.0
25 100.0 78.9 100.0 81.9
instance.
26 98.0 77.2 99.2 78.1 In Section 4.2, we presume the correctness of all programs, as
27 100.0 78.7 100.0 81.0 they are basic and known algorithms. However, there remains the
28 100.0 78.7 100.0 82.1
possibility of bugs that could result in inaccurate mutation scores
29 100.0 78.7 100.0 81.4
30 100.0 71.1 100.0 42.4 and failure to run correct tests.
31 100.0 39.5 100.0 58.6 Using large language models for automatic test data generation
32 100.0 38.4 100.0 51.6 may have limitations or biases that could impact the quality and
33 100.0 64.0 100.0 75.8
AVG 90.2 70.5 99.5 78.5 comprehensiveness of the generated test sets. Using baseline results
SD 29.2 27.5 2.0 13.9 obtained from traditional automated test case generators [3] to
AVG ∗∗ 99.2 77.6 99.5 79.5 confront the results obtained from test sets generated from LLM
SD∗∗ 3.4 16.5 2.1 13.1
∗∗ - AVG and SD removing zeros. aims to minimize this threat. Moreover, we only used an LLM engine
and model in this experiment, which may not represent the results
for other LLMs or models. We intend to extend the experiment for
Another point is that we decided to provide the class under many programs, LLM engines, and models in further studies.
testing source code to LLM and asked it to generate tests for the
entire class. However, we believe if you ask for a test only for a 8 CONCLUSION
specific method inside a class, we will get better results once the
In this work, we presented an initial investigation of the use of
scope is reduced, and LLM will create more tests for each specific
OpenAI API, considering the LLM named gpt-3.5-turbo, for
method.
unit test generation in a fully automated way, i.e., with no human
Finally, we explore a single OpenAI API model called
interaction for test case correction after prompt return. The idea
gpt-3.5-turbo, but OpenAI offers a variety of models, each
was to detect to which extent the test cases will run directly, with
with different capabilities. Deciding which one is more suitable
no errors, for testing a set of Java programs.
for each situation demands additional experimentation. Moreover,
Basically, we developed a prompt to ask test sets via API, only
there are also a lot of new LLMs available like Bing12 , Bard13 , and
varying the code of the class under testing and the “temperature”
LLaMa14 which may also demand more investigation concerning
parameter of gpt-3.5-turbo model. We asked for three test
their capacity on automatic generating unit testing for specific
sets for each one of the eleven different temperature values (0.0,
languages.
0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0) for each program, resulting,
in the best case, in a total of 33 test sets per program.
12 https://www.bing.com/ Our results show that not for all temperatures the API was able
13 https://bard.google.com/ to produce useful test sets that run automatically with no error
14 https://labs.perplexity.ai/ without human intervention. In this way, we discarded these test
SAST 2023, September 25–29, 2023, Campo Grande, MS, Brazil Guilherme, V. H. and Vincenzi, A.M.R.
sets in our experiment. For 3 out of 33 programs, the model was not 2025179
able to generate useful test sets for any temperature, especially due [10] Gordon Fraser and Andrea Arcuri. 2014. A Large-Scale Evaluation of Automated
Unit Test Generation Using EvoSuite. ACM Trans. Softw. Eng. Methodol. 24, 2
to the non-overriding of traditional Java methods for object com- (Dec. 2014), 1–42. https://doi.org/10.1145/2685612 Place: New York, NY, USA
parison like equals() and compareTo() for the application Publisher: Association for Computing Machinery.
[11] Gordon Fraser and Andrea Arcuri. 2016. EvoSuite at the SBST 2016 Tool Compe-
under testing. tition. In Proceedings of the 9th International Workshop on Search-Based Software
We observed interesting results by keeping only test sets that Testing. ACM, Austin, Texas, 33–36. https://doi.org/10.1145/2897010.2897020
run automatically and comparing our results with those obtained [12] Yue Jia and Mark Harman. 2011. An Analysis and Survey of the Development
of Mutation Testing. IEEE Transactions on Software Engineering 37, 5 (Sept.
by other researchers that used traditional automated test set gener- 2011), 649–678. https://doi.org/10.1109/TSE.2010.62 Conference Name: IEEE
ators [3]. We considered that, besides the simplicity of the prompt, Transactions on Software Engineering.
asking for testing to the LLM, the results in terms of code coverage [13] René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, Reid Holmes,
and Gordon Fraser. 2014. Are mutants a valid substitute for real faults in software
were very similar to the ones obtained in the baseline. Moreover, testing?. In Proceedings of the 22nd ACM SIGSOFT International Symposium on
concerning mutation score, we observed complementary aspects Foundations of Software Engineering (FSE 2014, Vol. 1). Association for Computing
Machinery, Hong Kong, China, 654–665. https://doi.org/10.1145/2635868.2635929
between LLM Suite and Baseline Suite. They complement each [14] Tsz-On Li, Wenxi Zong, Yibo Wang, Haoye Tian, Ying Wang, Shing-Chi Cheung,
other. and Jeff Kramer. 2023. Finding Failure-Inducing Test Cases with ChatGPT.
Further work intends to investigate the best way to use a tra- [15] Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che,
Dandan Wang, and Qing Wang. 2023. Chatting with GPT-3 for Zero-Shot Human-
ditional automated testing generator and LLM prompts to obtain Like Mobile Automated GUI Testing. https://doi.org/10.48550/arXiv.2305.09434
better results than when using isolated tools. [16] Wei Ma, Shangqing Liu, Wenhan Wang, Qiang Hu, Ye Liu, Cen Zhang, Liming
Moreover, this initial investigation raised more questions than Nie, and Yang Liu. 2023. The Scope of ChatGPT in Software Engineering: A
Thorough Investigation. https://doi.org/10.48550/arXiv.2305.12138
produced answers. To answer the raised questions, more experi- [17] OpenAI. 2023. OpenAI GTP-3.5 Models Documentation. (July 2023). https:
mentation is necessary. A few of them are: //platform.openai.com/docs/models/gpt-3-5
[18] Carlos Pacheco and Michael D. Ernst. 2007. Randoop: Feedback-directed
(1) Do the other OpenAI models produce similar or complemen- Random Testing for Java. In Companion to the 22Nd ACM SIGPLAN Con-
tary results? ference on Object-oriented Programming Systems and Applications Companion
(OOPSLA ’07). ACM, 815–816. https://doi.org/10.1145/1297846.1297902 bib-
(2) Does the language used in the prompt influence the results? tex*[acmid=1297902;numpages=2] event-place: Montreal, Quebec, Canada.
(3) Does the language of the product under testing influence the [19] Marco Tulio Ribeiro. 2023. Testing Language Models (and Prompts) Like We Test
results? Software. (May 2023). https://towardsdatascience.com/testing-large-language-
models-like-we-test-software-92745d28a359
(4) How do other LLMs prompts automate unit testing genera- [20] M. Roper. 1994. Software Testing. McGrall Hill.
tion? [21] Abdelilah Sakti, Gilles Pesant, and Yann-Gaël Guéhéneuc. 2015. JTExpert at the
(5) Does the LLM perform better by asking testing for a method Third Unit Testing Tool Competition. 52–55. https://doi.org/10.1109/SBST.2015.20
[22] Sebastian Schweikl, Gordon Fraser, and Andrea Arcuri. 2023. EvoSuite at the
instead of a class? SBST 2022 Tool Competition. In Proceedings of the 15th Workshop on Search-Based
Software Testing (SBST ’22). Association for Computing Machinery, New York, NY,
REFERENCES USA, 33–34. https://doi.org/10.1145/3526072.3527526 event-place: Pittsburgh,
Pennsylvania.
[1] Mehrdad Abdi and Serge Demeyer. 2022. Steps towards zero-touch mutation [23] Mohammed Latif Siddiq, Joanna C. S. Santos, Ridwanul Hasan Tanvir, Noshin
testing in Pharo. In 21st Belgium-Netherlands Software Evolution Workshop – Ulfat, Fahmid Al Rifat, and Vinicius Carvalho Lopes. 2023. Exploring the
BENEVOL’2022 (CEUR Workshop Proceedings, Vol. 1). Mons, 10. Effectiveness of Large Language Models in Generating Unit Tests. https:
[2] J. H. Andrews, L. C. Briand, and Y. Labiche. 2005. Is mutation an appropriate //doi.org/10.48550/arXiv.2305.00418
tool for testing experiments?. In XXVII International Conference on Software [24] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Engineering – ICSE’05. ACM Press, St. Louis, MO, USA, 402–411. https://doi.org/ Aidan N. Gomez, \Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You
10.1145/1062455.1062530 Need. In Proceedings of the 31st International Conference on Neural Information
[3] Filipe Santos Araujo and Auri Vincenzi. 2020. How far are we from testing Processing Systems (NIPS’17, Vol. 1). Curran Associates Inc., Red Hook, NY, USA,
a program in a completely automated way, considering the mutation testing 6000–6010. event-place: Long Beach, California, USA.
criterion at unit level?. In Anais do Simpósio Brasileiro de Qualidade de Software [25] Auri M. R. Vincenzi, Tiago Bachiega, Daniel G. de Oliveira, Simone R. S. de
(SBQS). SBC, 151–159. https://doi.org/10.1145/3439961.3439977 Souza, and José C. Maldonado. 2016. The Complementary Aspect of Automat-
[4] Andrea Arcuri, José Campos, and Gordon Fraser. 2016. Unit Test Generation Dur- ically and Manually Generated Test Case Sets. In Proceedings of the 7th Inter-
ing Software Development: EvoSuite Plugins for Maven, IntelliJ and Jenkins. In national Workshop on Automating Test Case Design, Selection, and Evaluation
2016 IEEE International Conference on Software Testing, Verification and Validation (A-TEST 2016, Vol. 1). ACM, 23–30. https://doi.org/10.1145/2994291.2994295
(ICST). 401–408. https://doi.org/10.1109/ICST.2016.44 bibtex*[acmid=2994295;numpages=8] event-place: Seattle, WA, USA.
[5] Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2015. [26] Sebastian Vogl, Sebastian Schweikl, Gordon Fraser, Andrea Arcuri, Jose Campos,
The Oracle Problem in Software Testing: A Survey. IEEE Transactions on Software and Annibale Panichella. 2021. EvoSuite at the SBST 2021 Tool Competition.
Engineering 41, 5 (2015), 507–525. https://doi.org/10.1109/TSE.2014.2372785 In 2021 IEEE/ACM 14th International Workshop on Search-Based Software Testing
[6] Henry Coles. 2015. PITest: real world mutation testing. Disponível em: (SBST). IEEE, 28–29.
http://pitest.org/. Acesso em: 04/07/2016. bibtex*[howpublished=Página Web]. [27] Jules White, Sam Hays, Quchen Fu, Jesse Spencer-Smith, and Douglas C. Schmidt.
[7] R. A. DeMillo, R. J. Lipton, and F. G. Sayward. 1978. Hints on Test Data Selection: 2023. ChatGPT Prompt Patterns for Improving Code Quality, Refactoring, Re-
Help for the Practicing Programmer. IEEE Computer 11, 4 (April 1978), 34–43. quirements Elicitation, and Software Design. https://doi.org/10.48550/arXiv.
https://doi.org/10.1109/C-M.1978.218136 2303.07839
[8] Leo Fernandes, Márcio Ribeiro, Rohit Gheyi, Marcio Delamaro, Márcio Guimarães, [28] Zhuokui Xie, Yinghao Chen, Chen Zhi, Shuiguang Deng, and Jianwei Yin. 2023.
and André Santos. 2022. Put Your Hands In The Air! Reducing Manual Effort in ChatUniTest: a ChatGPT-based automated unit test generation tool. https:
Mutation Testing. In Proceedings of the XXXVI Brazilian Symposium on Software //doi.org/10.48550/arXiv.2305.04764
Engineering (SBES ’22, Vol. 1). Association for Computing Machinery, New York, [29] Zhiqiang Yuan, Yiling Lou, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen,
NY, USA, 198–207. https://doi.org/10.1145/3555228.3555233 event-place: Virtual and Xin Peng. 2023. No More Manual Tests? Evaluating and Improving ChatGPT
Event, Brazil. for Unit Test Generation. https://doi.org/10.48550/arXiv.2305.04207
[9] Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: automatic test suite generation [30] Sai Zhang. 2011. Palus: A Hybrid Automated Test Generation Tool for Java. In
for object-oriented software. In Proceedings of the 19th ACM SIGSOFT Symposium Proceedings of the 33rd International Conference on Software Engineering (ICSE’11).
and the 13th European Conference on Foundations of Software Engineering (ES- Association for Computing Machinery, New York, NY, USA, 1182–1184. https:
EC/FSE ’11). ACM, Szeged, Hungary, 416–419. https://doi.org/10.1145/2025113. //doi.org/10.1145/1985793.1986036 event-place: Waikiki, Honolulu, HI, USA.