Software Testing With Large Language Models Survey Landscape and Vision
Software Testing With Large Language Models Survey Landscape and Vision
0098-5589 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
912 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 4, APRIL 2024
Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SOFTWARE TESTING WITH LARGE LANGUAGE MODELS: SURVEY, LANDSCAPE, AND VISION 913
types of prompt engineering, input of the LLMs, as well recent survey of LLMs [17], the authors focus on discussing
as the accompanied techniques with these LLMs. the language models with a model size larger than 10B. Un-
• We highlight the challenges in existing studies and present der their criteria, the first LLM is T5 released by Google in
potential opportunities for further studies. 2019, followed by GPT-3 released by OpenAI in 2020, and
• We maintain a GitHub website https://github.com/LLM- there are more than thirty LLMs released between 2021 and
Testing/LLM4SoftwareTesting that serves as a platform 2023 indicating its popularity. In another survey of unifying
for sharing and hosting the latest publications about soft- LLMs and knowledge graphs [24], the authors categorize the
ware testing with LLM. LLMs into three types: encoder-only (e.g., BERT), encoder-
We believe that this work will be valuable to both researchers decoder (e.g., T5), and decoder-only network architecture (e.g.,
and practitioners in the field of software engineering, as it pro- GPT-3). In our review, we take into account the categoriza-
vides a comprehensive overview of the current state and future tion criteria of the two surveys and only consider the encoder-
vision of using LLMs for software testing. For researchers, this decoder and decoder-only network architecture of pre-training
work can serve as a roadmap for future research in this area, language models, since they can both support generative tasks.
highlighting potential avenues for exploration and identifying We do not consider the encoder-only network architecture be-
gaps in our current understanding of the use of LLMs in soft- cause they cannot handle generative tasks, were proposed rel-
ware testing. For practitioners, this work can provide insights atively early (e.g., BERT in 2018), and there are almost no
into the potential benefits and limitations of using LLMs for models using this architecture after 2021. In other words, the
software testing, as well as practical guidance on how to effec- LLMs discussed in this paper not only include models with
tively integrate them into existing testing processes. By provid- parameters of over 10B (as mentioned in [17]) but also include
ing a detailed landscape of the current state and future vision of other models that use the encoder-decoder and decoder-only
using LLMs for software testing, this work can help accelerate network architecture (as mentioned in [24]), such as BART with
the adoption of this technology in the software engineering 140M parameters and GPT-2 with parameter sizes ranging from
community and ultimately contribute to improving the quality 117M to 1.5B. This is also to potentially include more studies
and reliability of software systems. to demonstrate the landscape of this topic.
II. BACKGROUND
A. Large Language Model (LLM) B. Software Testing
Recently, pre-trained language models (PLMs) have been Software testing is a crucial process in software development
proposed by pretraining Transformer-based models over large- that involves evaluating the quality of a software product. The
scale corpora, showing strong capabilities in solving various primary goal of software testing is to identify defects or errors
natural language processing (NLP) tasks [16], [17], [18], [19]. in the software system that could potentially lead to incorrect or
Studies have shown that model scaling can lead to improved unexpected behavior. The whole life cycle of software testing
model capacity, prompting researchers to investigate the scal- typically includes the following tasks (demonstrated in Fig. 4):
ing effect through further parameter size increases. Interest- • Requirement Analysis: analyze the software requirements
ingly, when the parameter scale exceeds a certain threshold, and identify the testing objectives, scope, and criteria.
these larger language models demonstrate not only significant • Test Plan: develop a test plan that outlines the testing
performance improvements but also special abilities such as strategy, test objectives, and schedule.
in-context learning, which are absent in smaller models such • Test Design and Review: develop and review the test cases
as BERT. and test suites that align with the test plan and the require-
To discriminate the language models in different parameter ments of the software application.
scales, the research community has coined the term large lan- • Test Case Preparation: the actual test cases are prepared
guage models (LLM) for the PLMs of significant size. LLMs based on the designs created in the previous stage.
typically refer to language models that have hundreds of bil- • Test Execution: execute the tests that were designed in the
lions (or more) of parameters and are trained on massive text previous stage. The software system is executed with the
data such as GPT-3, PaLM, Codex, and LLaMA. LLMs are test cases and the results are recorded.
built using the Transformer architecture, which stacks multi- • Test Reporting: analyze the results of the tests and generate
head attention layers in a very deep neural network. Existing reports that summarize the testing process and identify any
LLMs adopt similar model architectures (Transformer) and pre- defects or issues that were discovered.
training objectives (language modeling) as small language mod- • Bug Fixing and Regression Testing: defects or issues iden-
els, but largely scale up the model size, pre-training data, and tified during testing are reported to the development team
total compute power. This enables LLMs to better understand for fixing. Once the defects are fixed, regression testing is
natural language and generate high-quality text based on given performed to ensure that the changes have not introduced
context or prompts. new defects or issues.
Note that, in existing literature, there is no formal consensus • Software Release: once the software system has passed all
on the minimum parameter scale for LLMs, since the model of the testing stages and the defects have been fixed, the
capacity is also related to data size and total compute. In a software can be released to the customer or end user.
Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
914 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 4, APRIL 2024
Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SOFTWARE TESTING WITH LARGE LANGUAGE MODELS: SURVEY, LANDSCAPE, AND VISION 915
TABLE I
DETAILS OF THE COLLECTED PAPERS
Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
916 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 4, APRIL 2024
JSS Journal of Systems and Software • Do the results presented in the study align with the re-
JSEP Journal of Software: Evolution and Process
STVR Software Testing, Verification and Reliability search objectives and are they presented in a clear and
IEEE SOFTW. IEEE Software
IET SOFTW. IET Software relevant manner?
IST Information and Software Technology
SQJ Software Quality Journal 5) Snowballing: At the end of searching database reposi-
ICLR International Conference on Learning Representations tories and conference proceedings and journals, and applying
NeurIPS Conference on Neural Information Processing Systems
AI Venues
ICML International Conference on Machine Learning inclusion/exclusion criteria and quality assessment, we obtain
AAAI AAAI Conference on Artificial Intelligence
EMNLP Conference on Empirical Methods in Natural Language Processing the initial set of papers. Next, to mitigate the risk of omitting
ACL Annual Meeting of the Association for Computational Linguistics
IJCAI International Joint Conference on Artificial Intelligence relevant literature from this survey, we also perform backward
snowballing [130] by inspecting the references cited by the
collected papers so far. Note that, this procedure did not include
new studies, which might because the surveyed topic is quite
Exclusion Criteria. The following studies would be ex- new and the reference studies tend to published previously, and
cluded during study selection: we already include a relatively comprehensive automatic and
• The paper does not involve software testing tasks, e.g., manual search.
code comment generation.
• The paper does not utilize LLMs, e.g., using recurrent
neural networks. B. Collection Results
• The paper mentions LLMs only in future work or discus- As shown in Fig. 2, the collection process started with a
sions rather than using LLMs in the approach. total of 14,623 papers retrieved from four academic databases
• The paper utilizes language models with encoder-only ar- employing keyword searching. Then after automated filtering,
chitecture, e.g., BERT, which can not directly be utilized manual search, applying inclusion/exclusion criteria, and qual-
for generation tasks (as demonstrated in Section II-A). ity assessment, we finally collected a total of 102 papers involv-
• The paper focuses on testing the performance of LLMs, ing software testing with LLMs. Table I shows the details of the
such as fairness, stability, security, etc. [124], [125], [126]. collected papers. Besides, we provide a more comprehensive
• The paper focuses on evaluating the performance of LLM- overview of these papers regarding the specific characteristics
enabled tools, e.g., evaluating the code quality of the code (will be illustrated in Section IV and Section V) in the online
generation tool Copilot [127], [128], [129]. appendix of the paper.
For the papers collected through automatic search and man- Note that, there are two studies which are respectively the
ual search, we conduct a manual inspection to check whether extension of a previously published paper by the same authors
they satisfy our inclusion criteria and filter those following ([46] and [131], [68] and [132]), and we only keep the extended
our exclusion criteria. Specifically, the first two authors read version to avoid duplicate.
each paper to carefully determine whether it should be included
based on the inclusion criteria and exclusion criteria, and any
paper with different decisions will be handed over to the third C. General Overview of Collected Paper
author to make the final decision. Among the papers, 47% papers are published in software
4) Quality Assessment: In addition, we establish quality as- engineering venues, among which 19 papers are from ICSE, 5
sessment criteria to exclude low-quality studies as shown below. papers are from FSE, 5 papers are from ASE, and 3 papers are
For each question, the study’s quality is rated as “yes”, “partial” from ISSTA. 2% papers are published in artificial intelligence
or “no” which are assigned values of 1, 0.5, and 0, respectively. venues such as EMNLP and ICLR, and 5% papers are published
Papers with a score of less than eight will be excluded from in program analysis or security venues like PLDI and S & P.
our study. Besides, 46% of the papers have not yet been published via
• Is there a clearly stated research goal related to software peer-reviewed venues, i.e., they are disclosed on arXiv. This is
testing? understandable because this field is emerging and many works
• Is there a defined and repeatable technique? are just completed and in the process of submission. Although
• Is there any explicit contribution to software testing? these papers did not undergo peer review, we have a quality
Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SOFTWARE TESTING WITH LARGE LANGUAGE MODELS: SURVEY, LANDSCAPE, AND VISION 917
Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
918 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 4, APRIL 2024
Fig. 4. Distribution of testing tasks with LLMs (aligned with software testing life cycle [1], [133], [134], the number in bracket indicates the number of
collected studies per task, and one paper might involve multiple tasks).
TABLE III
PERFORMANCE OF UNIT TEST CASE GENERATION
in terms of revealing bugs by leveraging mutation testing. They software testing techniques (e.g., Pynguin [135]) in generating
augment prompts with surviving mutants, as those mutants unit test case until its coverage improvements stall, then asking
highlight the limitations of test cases in detecting bugs. Zhang the LLM to provide the example test cases for under-covered
et al. [39] generated security tests with vulnerable dependencies functions. These examples can help the original test generation
with LLMs. redirect its search to more useful areas of the search space.
Yuan et al. [7] first performed an empirical study to eval- Tang et al. [8] conducts a systematic comparison of test suites
uate ChatGPT’s capability of unit test generation with both a generated by the LLM and the state-of-the-art search-based
quantitative analysis and a user study in terms of correctness, software testing tool EvoSuite, by considering the correctness,
sufficiency, readability, and usability. And results show that the readability, code coverage, and bug detection capability. Simi-
generated tests still suffer from correctness issues, including larly, Bhatia [42] experimentally investigates the quality of unit
diverse compilation errors and execution failures. They fur- tests generated by LLM compared to a commonly-used test
ther propose an approach that leveraged the ChatGPT itself to generator Pynguin.
improve the quality of its generated tests with an initial test 5) Performance of Unit Test Case Generation: Since the
generator and an iterative test refiner. Specifically, the iterative aforementioned studies of unit test case generation are based
test refiner iteratively fixed the compilation errors in the tests on different datasets, one can hardly derive a fair comparison
generated by the initial test generator, which follows a validate- and we present the details in Table III to let the readers obtain
and-fix paradigm to prompt the LLM based on the compilation a general view. We can see that in the SF110 benchmark, all
error messages and additional code context. Guilherme et al. three evaluated LLMs have quite low performance, i.e., 2%
[30] and Li et al. [40] respectively evaluated the quality of coverage [38]. SF110 is an Evosuite (a search-based unit test
the generated unit tests by LLM using different metrics and case generation technique) benchmark consisting of 111 open-
different prompts. source Java projects retrieved from SourceForge, containing
3) Test Generation With Additional Documentation: 23,886 classes, over 800,000 bytecode-level branches, and 6.6
Vikram et al. [33] went a step further by investigating the million lines of code. The authors did not present detailed
potential of using LLMs to generate property-based tests when reasons for the low performance which can be further explored
provided API documentation. They believe that the documen- in the future.
tation of an API method can assist the LLM in producing logic
to generate random inputs for that method and deriving mean-
ingful properties of the result to check. Instead of generating B. Test Oracle Generation
unit tests from the source code, Plein et al. [32] generated the A test oracle is a source of information about whether the
tests based on user-written bug reports. output of a software system (or program or function or method)
4) LLM and Search-Based Method for Unit Test Genera- is correct or not [136]. Most of the collected studies in this
tion: The aforementioned studies utilize LLMs for the whole category target the test assertion generation, which is inside a
unit test case generation task, while Lemieux et al. [36] focus on unit test case. Nevertheless, we opted to treat these studies as
a different direction, i.e., first letting the traditional search-based separate sections to facilitate a more thorough analysis.
Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SOFTWARE TESTING WITH LARGE LANGUAGE MODELS: SURVEY, LANDSCAPE, AND VISION 919
Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
920 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 4, APRIL 2024
input. Liu et al. [49] employ the LLM to intelligently generate masked tokens. Their follow-up study [58] goes a step further
the semantic input text according to the GUI context. In detail, to prime LLMs to synthesize unusual programs for the fuzzing
their proposed QTypist automatically extracts the component DL libraries. It is built on the well-known hypothesis that
information related to the EditText for generating the prompts, historical bug-triggering programs may include rare/valuable
and then inputs the prompts into the LLM to generate the code ingredients important for bug finding and show improved
input text. bug detection performance.
Besides the text input, there are other forms of input for c) Test input generation for other types of software:
mobile apps, i.e., operations like ‘click a button’ and ‘select There are also dozens of studies that address testing tasks in
a list’. To fully test an app, it is required to cover more GUI various other domains, due to space limitations, we will present
pages and conduct more meaningful exploration traces through a selection of representative studies in these domains.
the GUI operations, yet existing studies with random-/rule- Finding bugs in a commercial cyber-physical system (CPS)
based methods [9], [10], model-based methods [11], [12], and development tool such as Simulink is even more challenging.
learning-based methods [13] are unable to understand the se- Given the complexity of the Simulink language, generating
mantic information of the GUI page thus could not conduct valid Simulink model files for testing is an ambitious task
the trace planning effectively. Liu et al. [14] formulates the test for traditional machine learning or deep learning techniques.
input generation of mobile GUI testing problem as a Q & A Shrestha et al. [51] employs a small set of Simulink-specific
task, which asks LLM to chat with the mobile apps by passing training data to fine-tune the LLM for generating Simulink
the GUI page information to LLM to elicit testing scripts (i.e., models. Results show that it can create Simulink models quite
GUI operation), and executing them to keep passing the app similar to the open-source models, and can find a super-set of
feedback to LLM, iterating the whole process. The proposed the bugs traditional fuzzing approaches found.
GPTDroid extracts the static context of the GUI page and the Sun et al. [63] utilize LLM to generate test formulas for
dynamic context of the iterative testing process, and designs fuzzing SMT solvers. It retrains the LLMs on a large corpus of
prompts for inputting this information to LLM which enables SMT formulas to enable them to acquire SMT-specific domain
the LLM to better understand the GUI page as well as the whole knowledge. Then it further fine-tunes the LLMs on historical
testing process. It also introduces a functionality-aware memory bug-triggering formulas, which are known to involve struc-
prompting mechanism that equips the LLM with the ability to tures that are more likely to trigger bugs and solver-specific
retain testing knowledge of the whole process and conduct long- behaviors. The LLM-based compiler fuzzer proposed by Yang
term, functionality-based reasoning to guide exploration. Sim- et al. [69] adopts a dual-model framework: (1) an analysis LLM
ilarly, Zimmermann et al. utilize the LLM to interpret natural examines the low-level optimization source code and produces
language test cases and programmatically navigate through the requirements on the high-level test programs that can trigger
application under test [54]. the optimization; (2) a generation LLM produces test programs
Yu et al. [61] investigate the LLM’s capabilities in the mo- based on the summarized requirements. Ye et al. [48] utilize
bile app test script generation and migration task, including the LLM for generating the JavaScript programs and then use
the scenario-based test generation, and the cross-platform/app the well-structured ECMAScript specifications to automatically
test migration. generate test data along with the test programs, after that they
b) Test input generation for DL libraries: The input apply differential testing to expose bugs.
for testing DL libraries is DL programs, and the difficulty in 2) Input Generation in Terms of Testing Techniques: By
generating the diversified input DL programs is that they need to utilizing system test inputs generated by LLMs, the collected
satisfy both the input language (e.g., Python) syntax/semantics studies aim to enhance traditional testing techniques and make
and the API input/shape constraints for tensor computations. them more effective. Among these techniques, fuzz testing is
Traditional techniques with API-level fuzzing [139], [140] or the most commonly involved one. Fuzz testing, as a general
model-level fuzzing [141], [142] suffer from the following limi- concept, revolves around generating invalid, unexpected, or
tations: 1) lack of diverse API sequence thus cannot reveal bugs random data as inputs to evaluate the behavior of software.
caused by chained API sequences; 2) cannot generate arbitrary LLMs play a crucial role in improving traditional fuzz test-
code thus cannot explore the huge search space that exists when ing by facilitating the generation of diverse and realistic input
using the DL libraries. Since LLMs can include numerous code data. This enables fuzz testing to uncover potential bugs in the
snippets invoking DL library APIs in their training corpora, they software by subjecting it to a wide range of input scenarios.
can implicitly learn both language syntax/semantics and intri- In addition to fuzz testing, LLMs also contribute to enhancing
cate API constraints for valid DL program generation. Taken in other testing techniques, which will be discussed in detail later.
this sense, Deng et al. [59] used both generative and infilling a) Universal fuzzing framework: Xia et al. [67] present
LLMs to generate and mutate valid/diverse input DL programs Fuzz4All that can target many different input languages and
for fuzzing DL libraries. In detail, it first uses a generative many different features of these languages. The key idea behind
LLM (CodeX) to generate a set of seed programs (i.e., code it is to leverage LLMs as an input generation and mutation
snippets that use the target DL APIs). Then it replaces part of engine, which enables the approach to produce diverse and real-
the seed program with masked tokens using different mutation istic inputs for any practically relevant language. To realize this
operators and leverages the ability of infilling LLM (InCoder) potential, they present a novel auto-prompting technique, which
to perform code infilling to generate new code that replaces the creates LLM prompts that are well-suited for fuzzing, and a
Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SOFTWARE TESTING WITH LARGE LANGUAGE MODELS: SURVEY, LANDSCAPE, AND VISION 921
novel LLM-powered fuzzing loop, which iteratively updates the synthesize the corresponding scenario scripts to construct the
prompt to create new fuzzing inputs. They experiment with six test scenario. Luu et al. [56] examine the effectiveness of LLM
different languages (C, C++, Go, SMT2, Java and Python) as in generating metamorphic relations (MRs) for metamorphic
inputs and demonstrate higher coverage than existing language- testing. Their results show that ChatGPT can be used to advance
specific fuzzers. Hu et al. [52] propose a greybox fuzzer aug- software testing intelligence by proposing MRs candidates that
mented by the LLM, which picks a seed in the fuzzer’s seed pool can be later adapted for implementing tests, but human intelli-
and prompts the LLM to produce the mutated seeds that might gence should still inevitably be involved to justify and rectify
trigger a new code region of the software. They experiment with their correctness.
three categories of input formats, i.e., formatted data files (e.g., b) Input format of test generation: The aforementioned
json, xml), source code in different programming languages studies primarily take the source code or the software as the
(e.g., JS, SQL, C), text with no explicit syntax rules (e.g., HTTP input of LLM, yet there are also studies that take natural lan-
response, md5 checksum). In addition, effective fuzzing relies guage description as the input for test generation. Mathur et al.
on the effective fuzz driver, and Zhang et al. [66] utilize LLMs [53] propose to generate test cases from the natural language
on the fuzz driver generation, in which five query strategies are described requirements. Ackerman et al. [60] generate the in-
designed and analyzed from basic to enhanced. stances from natural language described requirements recur-
b) Fuzzing techniques for specific software: There are sively to serve as the seed examples for a mutation fuzzer.
studies that focus on the fuzzing techniques tailored to specific
software, e.g., the deep learning library [58], [59], compiler
D. Bug Analysis
[69], SMT solvers [63], input widget of mobile app [65], cyber-
physical system [51], etc. One key focus of these fuzzing This category involves analyzing and categorizing the iden-
techniques is to generate diverse test inputs so as to achieve tified software bugs to enhance understanding of the bug, and
higher coverage. This is commonly achieved by combining the facilitate subsequent debug and bug repair. Mukherjee et al. [73]
mutation technique with LLM-based generation, where the for- generate relevant answers to follow-up questions for deficient
mer produces various candidates while the latter is responsible bug reports to facilitate bug triage. Su et al. [76] transform the
for generating the executable test inputs [59], [63]. Another bug-component triaging into a multi-classification task and a
focus of these fuzzing techniques is to generate the risky test generation task with LLM, then ensemble the prediction results
inputs that can trigger bugs earlier. To achieve this, a common from them to improve the performance of bug-component triag-
practice is to collect the historical bug-triggering programs to ing further. Zhang et al. [72] first leverage the LLM under the
fine-tune the LLM [63] or treat them as the demonstrations zero-shot setting to get essential information on bug reports,
when querying the LLM [58], [65]. then use the essential information as the input to detect duplicate
c) Other testing techniques: There are studies that utilize bug reports. Mahbub et al. [74] proposes to explain software
LLMs for enhancing GUI testing for generating meaningful text bugs with LLM, which generates natural language explanations
input [49] and functionality-oriented exploration traces [14], for software bugs by learning from a large corpus of bug-fix
which has been introduced in Test input generation for mobile commits. Zhang et al. [70] target to automatically generate the
apps part of Section IV-C1. bug title from the descriptions of the bug, which aims to help
Besides, Deng et al. [62] leverage the LLMs to carry out developers write issue titles and facilitate the bug triaging and
penetration testing tasks automatically. It involves setting a follow-up fixing process.
penetration testing goal for the LLM, soliciting it for the ap-
propriate operation to execute, implementing it in the testing
environment, and feeding the test outputs back to the LLM for E. Debug
next-step reasoning. This category refers to the process of identifying and locating
3) Input Generation in Terms of Input and Output: the cause of a software problem (i.e., bug). It involves analyzing
a) Output format of test generation: Although most the code, tracing the execution flow, collecting error informa-
works use LLM to generate test cases directly, there are also tion to understand the root cause of the issue, and fixing the
some works generating indirect inputs like testing code, test issue. Some studies concentrate on the comprehensive debug
scenarios, metamorphic relations, etc. Liu et al. [65] propose process, while others delve into specific sub-activities within
InputBlaster which leverages the LLM to automatically gen- the process.
erate unusual text inputs for fuzzing the text input widgets 1) Overall Debug Framework: Bui et al. [77] proposes a
in mobile apps. It formulates the unusual inputs generation unified Detect-Localize-Repair framework based on the LLM
problem as a task of producing a set of test generators, each for debugging, which first determines whether a given code
of which can yield a batch of unusual text inputs under the snippet is buggy or not, then identifies the buggy lines, and
same mutation rule. In detail, InputBlaster leverages LLM to translates the buggy code to its fixed version. Kang et al. [83]
produce the test generators together with the mutation rules proposes automated scientific debugging, a technique that given
serving as the reasoning chain and utilizes the in-context learn- buggy code and a bug-revealing test, prompts LLMs to automat-
ing schema to demonstrate the LLM with examples for boosting ically generate hypotheses, uses debuggers to actively interact
the performance. Deng et al. [64] use LLM to extract key infor- with buggy code, and thus automatically reaches conclusions
mation related to the test scenario from a traffic rule, and rep- prior to patch generation. Chen et al. [88] demonstrate that
resent the extracted information in a test scenario schema, then self-debugging can teach the LLM to perform rubber duck
Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
922 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 4, APRIL 2024
debugging; i.e., without any human feedback on the code cor- attributed to the close relationship between this task and the
rectness or error messages, the model is able to identify its source code. With their advanced natural language processing
mistakes by investigating the execution results and explaining and understanding capabilities, LLM are well-equipped to pro-
the generated code in natural language. Cao et al. [89] conducts cess and analyze source code, making them an ideal tool for
a study of LLM’s debugging ability for deep learning programs, performing code-related tasks such as fixing bugs.
including fault detection, fault localization and program repair. There have been template-based [143], heuristic-based [144],
2) Bug Localization: Wu et al. [85] compare the two and constraint-based [145], [146] automatic program repair
LLMs (ChatGPT and GPT-4) with the existing fault localization techniques. And with the development of deep learning tech-
techniques, and investigate the consistency of LLMs in fault niques in the past few years, there have been several studies
localization, as well as how prompt engineering and the length employing deep learning techniques for program repair. They
of code context affect the results. Kang et al. [79] propose typically adopt deep learning models to take a buggy software
AutoFL, an automated fault localization technique that only program as input and generate a patched program. Based on
requires a single failing test, and during its fault localization the training data, they would build a neural network model that
process, it also generates an explanation about why the given learns the relations between the buggy code and the correspond-
test fails. Yang et al. [84] propose LLMAO to overcome the ing fixed code. Nevertheless, these techniques still fail to fix
left-to-right nature of LLMs by fine-tuning a small set of bidi- a large portion of bugs, and they typically have to generate
rectional adapter layers on top of the representations learned hundreds to thousands of candidate patches and take hours to
by LLMs, which can locate buggy lines of code without any validate these patches to fix enough bugs. Furthermore, the deep
test coverage information. Tu et al. [86] propose LLM4CBI learning based program repair models need to be trained with
to tame LLMs to generate effective test programs for finding huge amounts of labeled training data (typically pairs of buggy
suspicious files. and fixed code), which is time- and effort-consuming to col-
3) Bug Reproduction: There are also studies focusing on a lect the high-quality dataset. Subsequently, with the popularity
sub-phase of the debugging process. For example, Kang et al. and demonstrated capability of the LLMs, researchers begin to
[78] and Plein et al. [81] respectively propose the framework explore the LLMs for program repair.
to harness the LLM to reproduce bugs, and suggest bug repro- 1) Patch Single-Line Bugs: In the early era of program
ducing test cases to the developer for facilitating debugging. repair, the focus was mainly on addressing defects related to
Li et al. [87] focus on a similar aspect of finding the failure- single-line code errors, which are relatively simple and did not
inducing test cases whose test input can trigger the software’s require the repair of complex program logic. Lajkó et al. [95]
fault. It synergistically combines LLM and differential testing propose to fine-tune the LLM with JavaScript code snippets to
to do that. serve as the purpose for the JavaScript program repair. Zhang
There are also studies focusing on the bug reproduction of et al. [116] employs program slicing to extract contextual infor-
mobile apps to produce the replay script. Feng et al. [75] pro- mation directly related to the given buggy statement as repair
pose AdbGPT, a new lightweight approach to automatically re- ingredients from the corresponding program dependence graph,
produce the bugs from bug reports through prompt engineering, which makes the fine-tuning more focused on the buggy code.
without any training and hard-coding effort. It leverages few- Zhang et al. [121] propose a stage-wise framework STEAM
shot learning and chain-of-thought reasoning to elicit human for patching single-line bugs, which simulates the interactive
knowledge and logical reasoning from LLMs to accomplish the behavior of multiple programmers involved in bug manage-
bug replay in a manner similar to a developer. Huang et al. ment, e.g., bug reporting, bug diagnosis, patch generation, and
[71] propose CrashTranslator to automatically reproduce bugs patch verification.
directly from the stack trace. It accomplishes this by leveraging Since most real-world bugs would involve multiple lines of
the LLM to predict the exploration steps for triggering the code, and later studies explore these more complex situations
crash, and designing a reinforcement learning based technique (although some of them can also patch the single-line bugs).
to mitigate the inaccurate prediction and guide the search holis- 2) Patch Multiple-Lines Bugs: The studies in this category
tically. Taeb et al. [55] convert the manual accessibility test would input a buggy function to the LLM, and the goal is
instructions into replayable, navigable videos by using LLM to output the patched function, which might involve complex
and UI element detection models, which can also help reveal semantic understanding, code hunk modification, as well as
accessibility issues. program refactoring. Earlier studies typically employ the fine-
4) Error Explanation: Taylor et al. [82] integrates the LLM tuning strategy to enable the LLM to better understand the code
into the Debugging C Compiler to generate unique, novice- semantics. Fu et al. [123] fine-tune the LLM by employing BPE
focused explanations tailored to each error. Widjojo et al. [80] tokenization to handle Out-Of-Vocabulary (OOV) issues which
study the effectiveness of Stack Overflow and LLMs at explain- makes the approach generate new tokens that never appear
ing compiler errors. in a training function but are newly introduced in the repair.
Wang et al. [120] train the LLM based on both buggy input
and retrieved bug-fix examples which are retrieved in terms
F. Program Repair of the lexical and semantical similarities. The aforementioned
This category denotes the task of fixing the identified soft- studies (including the ones in patching single-line bugs) would
ware bugs. The high frequency of repair-related studies can be predict the fixed programs directly, and Hu et al. [92] utilize
Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SOFTWARE TESTING WITH LARGE LANGUAGE MODELS: SURVEY, LANDSCAPE, AND VISION 923
TABLE IV
PERFORMANCE OF PROGRAM REPAIR
a different setup that predicts the scripts that can fix the bugs Wei et al. [103] propose Repilot to copilot the AI “copilots”
when executed with the delete and insert grammar. For example, (i.e., LLMs) by synthesizing more valid patches during the
it predicts whether an original line of code should be deleted, repair process. Its key insight is that many LLMs produce out-
and what content should be inserted. puts autoregressively (i.e., token by token), and by resembling
Nevertheless, fine-tuning may face limitations in terms of human writing programs, the repair can be significantly boosted
its reliance on abundant high-quality labeled data, significant and guided through a completion engine. Brownlee et al. [105]
computational resources, and the possibility of overfitting. To propose to use the LLM as mutation operators for the search-
approach the program repair problem more effectively, later based techniques of program repair.
studies focus on how to design an effective prompt for program 3) Repair With Static Code Analyzer: Most of the program
repair. Several studies empirically investigate the effectiveness repair studies would suppose the bug has been detected, while
of prompt variants of the latest LLMs for program repair un- Jin et al. [114] propose a program repair framework paired with
der different repair settings and commonly-used benchmarks a static analyzer to first detect the bugs, and then fix them.
(which will be explored in depth later), while other studies In detail, the static analyzer first detects an error (e.g., null
focus on proposing new techniques. Ribeiro et al. [109] take pointer dereference) and the context information provided by
advantage of LLM to conduct the code completion in a buggy the static analyzer will be sent into the LLM for querying the
line for patch generation, and elaborate on how to circumvent patch for this specific error. Wadhwa et al. [110] focus on a
the open-ended nature of code generation to appropriately fit the similar task, and additionally employ an LLM as the ranker to
new code in the original program. Xia et al. [115] propose the assess the likelihood of acceptance of generated patches which
conversation-driven program repair approach that interleaves can effectively catch plausible but incorrect fixes and reduce
patch generation with instant feedback to perform the repair in developer burden.
a conversational style. They first feed the LLM with relevant 4) Repair for Specific Bugs: The aforementioned studies all
test failure information to start with, and then learns from both consider the buggy code as the input for the automatic program
failures and successes of earlier patching attempts of the same repair, while other studies conduct program repairing in terms
bug for more powerful repair. For earlier patches that failed of other types of bug descriptions, specific types of bugs, etc.
to pass all tests, they combine the incorrect patches with their Fakhoury et al. [122] focus on program repair from natural
corresponding relevant test failure information to construct a language issue descriptions, i.e., generating the patch with the
new prompt for the LLM to generate the next patch, in order bug and fix-related information described in the issue reports.
to avoid making the same mistakes. For earlier patches that Garg et al. [119] aim at repairing performance issues, in which
passed all the tests (i.e., plausible patches), they further ask the they first retrieve a prompt instruction from a pre-constructed
LLM to generate alternative variations of the original plausi- knowledge-base of previous performance bug fixes and then
ble patches. This can further build on and learn from earlier generate a repair prompt using the retrieved instruction. There
successes to generate more plausible patches to increase the are studies focusing on the bug fixing of Rust programs
chance of having correct patches. Zhang et al. [94] propose [108] or OCaml programs (an industrial-strength programming
a similar approach design by leveraging multimodal prompts language) [111].
(e.g., natural language description, error message, input-output- 5) Empirical Study About Program Repair: There are
based test cases), iterative querying, test-case-based few-shot several studies related to the empirical or experimental eval-
selection to produce repairs. Moon et al. [102] propose for bug uation of the various LLMs on program repair, and we sum-
fixing with feedback. It consists of a critic model to generate marize the performance in Table IV. Jiang et al. [113], Xia
feedback, an editor to edit codes based on the feedback, and et al. [93], and Zhang et al. [118] respectively conduct com-
a feedback selector to choose the best possible feedback from prehensive experimental evaluations with various LLMs and
the critic. on different automated program repair benchmarks, while other
Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
924 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 4, APRIL 2024
Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SOFTWARE TESTING WITH LARGE LANGUAGE MODELS: SURVEY, LANDSCAPE, AND VISION 925
Fig. 7. Distribution about how LLM is used (Note that, a study can involve multiple types of prompt engineering).
(e.g., UI screenshots, programming screencasts) in software Few-shot learning presents a set of high-quality demon-
testing tasks. strations, each consisting of both input and desired output, on
the target task. As the model first sees the examples, it can
better understand human intention and criteria for what kinds of
B. Types of Prompt Engineering answers are wanted, which is especially important for tasks that
As shown in Fig. 7, among our collected studies, 38 studies are not so straightforward or intuitive to the LLM. For example,
utilize the LLMs through pre-training or fine-tuning schema, when conducting the automatic test generation from general bug
while 64 studies employ the prompt engineering to commu- reports, Kang et al. [78] provide examples of bug reports (ques-
nicate with LLMs to steer its behavior for desired outcomes tions) and the corresponding bug reproducing tests (answers) to
without updating the model weights. When using the early the LLM, and their results show that two examples can achieve
LLMs, their performances might not be as impressive, so re- the highest performance than no examples or other number of
searchers often use pre-training or fine-tuning techniques to examples. Another example of test assertion generation, Nashid
adjust the models for specific domains and tasks in order to et al. [47] provide demonstrations of the focal method, the test
improve their performance. Then with the upgrading of LLM method containing an <AssertPlaceholder>, and the expected
technology, especially with the introduction of GPT-3 and later assertion, which enables the LLMs to better understand the task.
LLMs, the knowledge contained within the models and their Chain-of-thought (CoT) prompting generates a sequence
understanding/inference capability has increased significantly. of short sentences to describe reasoning logics step by step
Therefore, researchers will typically rely on prompt engineering (also known as reasoning chains or rationales) to the LLMs for
to consider how to design appropriate prompts to stimulate the generating the final answer. For example, for program repair
model’s knowledge. from the natural language issue descriptions [122], given the
Among the 64 studies with prompt engineering, 51 studies in- buggy code and issue report, the authors first ask the LLM
volve zero-shot learning, and 25 studies involve few-shot learn- to localize the bug, and then they ask it to explain why the
ing (a study may involve multiple types). There are also studies localized lines are buggy, finally, they ask the LLM to fix the
involving the chain-of-though (7 studies), self-consistency bug. Another example is for generating unusual programs for
(1 study), and automatic prompt (1 study). fuzzing deep learning libraries, Deng et al. [58] first generate
Zero-shot learning is to simply feed the task text to the a possible “bug” (bug description) before generating the actual
model and ask for results. Many of the collected studies employ “bug-triggering” code snippet that invokes the target API. The
the Codex, CodeT5, and CodeGen (as shown in Section V-A), predicted bug description provides an additional hint to the
which is already trained on source code. Hence, for the tasks LLM, indicating that the generated code should try to cover
dealing with source code like unit test case generation and specific potential buggy behavior.
program repair as demonstrated in previous sections, directly Self-consistency involves evaluating the coherence and con-
querying the LLM with prompts is the common practice. There sistency of the LLM’s responses on the same input in different
are generally two types of manners of zero-shot learning, i.e., contexts. There is one study with this prompt type, and it
with and without instructions. For example, Xie et al. [36] is about debugging. Kang et al. [83] employ a hypothesize-
would provide the LLMs with the instructions as “please help observe-conclude loop, which first generates a hypothesis about
me generate a JUnit test for a specific Java method...” to facili- what the bug is and constructs an experiment to verify, using
tate the unit test case generation. In contrast, Siddiq et al. [39] an LLM, then decide whether the hypothesis is correct based
only provide the code header of the unit test case (e.g., “class on the experiment result (with a debugger or code execution)
${className}${suffix}Test {”), and the LLMs would carry out using an LLM, after that, depending on the conclusion, it either
the unit test case generation automatically. Generally speaking, starts with a new hypothesis or opts to terminate the debugging
prompts with clear instructions will yield more accurate results, process and generate a fix.
while prompts without instructions are typically suitable for Automatic prompt aims to automatically generate and select
very specific situations. the appropriate instruction for the LLMs, instead of requiring
Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
926 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 4, APRIL 2024
Fig. 8. Mapping between testing tasks and how LLMs are used. Fig. 9. Input of LLM.
Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SOFTWARE TESTING WITH LARGE LANGUAGE MODELS: SURVEY, LANDSCAPE, AND VISION 927
Fig. 10. Distribution about other techniques incorporated with LLMs (Note that, a study can involve multiple types).
studies are related to the unit test case generation and program 2) LLM + Program Analysis: When utilizing LLMs to
repair, since in these scenarios, the running information can be accomplish tasks such as generating unit test cases and repairing
acquired easily. software code, it is important to consider that software code
When testing mobile apps, since the utilized LLM could not inherently possesses structural information, which may not be
understand the image of the GUI page, the view hierarchy file fully understood by LLMs. Hence, researchers often utilize pro-
which represents the details of the GUI page usually acts as gram analysis techniques, including code abstract syntax trees
the input to LLMs. Nevertheless, with the emergence of GPT-4 (ASTs) [74], to represent the structure of code more effectively
which is a multimodal model and accepts both image and text and increase the LLM’s ability to comprehend the code accu-
inputs for model input, the GUI screenshots might be directly rately. Researchers also perform the structure-based subsetting
utilized for LLM’s input. of code lines to narrow the focus for LLM [94], or extract
additional code context from other code files [7], to enable the
D. Incorporating Other Techniques With LLM models to focus on the most task-relevant information in the
There are divided opinions on whether LLM has reached an codebase and lead to more accurate predictions.
all-powerful status that requires no other techniques. As shown 3) LLM + Mutation Testing: It is mainly targeting at
in Fig. 10, among our collected studies, 67 of them utilize LLMs generating more diversified test inputs. For example, Deng et al.
to address the entire testing task, while 35 studies incorporate [59] first use LLM to generate the seed programs (e.g., code
additional techniques. These techniques include mutation test- snippets using a target DL API) for fuzzing deep learning
ing, differential testing, syntactic checking, program analysis, libraries. To enrich the pool of these test programs, they replace
statistical analysis, etc. parts of the seed program with masked tokens using mutation
The reason why researchers still choose to combine LLMs operators (e.g., replaces the API call arguments with the span
with other techniques might be because, despite exhibiting token) to produce masked inputs, and again utilize the LLMs
enormous potential in various tasks, LLMs still possess lim- to perform code infilling to generate new code that replaces the
itations such as comprehending code semantics and handling masked tokens.
complex program structures. Therefore, combining LLMs with 4) LLM + Syntactic Checking: Although LLMs have
other techniques optimizes their strengths and weaknesses to shown remarkable performance in various natural language
achieve better outcomes in specific scenarios. In addition, it is processing tasks, the generated code from these models can
important to note that while LLMs are capable of generating sometimes be syntactically incorrect, leading to potential errors
correct code, they may not necessarily produce sufficient test and reduced usability. Therefore, researchers have proposed
cases to check for edge cases or rare scenarios. This is where to leverage syntax checking to identify and correct errors in
mutation and other testing techniques come into play, as they the generated code. For example, in their work for unit test
allow for the generation of more diverse and complex code that case generation, Alagarsamy et al. [29] additionally introduce
can better simulate real-world scenarios. Taken in this sense, a verification method to check and repair the naming con-
a testing approach can incorporate a combination of different sistency (i.e., revising the test method name to be consistent
techniques, including both LLMs and other testing strategies, with the focal method name) and the test signatures (i.e.,
to ensure comprehensive coverage and effectiveness. adding missing keywords like public, void, or @test annota-
1) LLM + Statistical Analysis: As LLMs can often generate tions). Xie et al. [36] also validates the generated unit test
a multitude of outputs, manually sifting through and identifying case and employs rule-based repair to fix syntactic and simple
the correct output can be overwhelmingly laborious. As such, compile errors.
researchers have turned to statistical analysis techniques like 5) LLM + Differential Testing: Differential testing is well-
ranking and clustering [28], [45], [78], [93], [116] to efficiently suited to find semantic or logic bugs that do not exhibit explicit
filter through LLM’s outputs and ultimately obtain more accu- erroneous behaviors like crashes or assertion failures. In this
rate results. category of our collected studies, the LLM is mainly responsible
Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
928 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 4, APRIL 2024
for generating valid and diversified inputs, while the differential mutation rule) together with the mutation rules for text-oriented
testing helps to determine whether there is a triggered bug fuzzing, which reduces the human effort required for designing
based on the software’s output. For example, Ye et al. [48] mutation rules.
first uses LLM to produce random JavaScript programs, and A potential research direction could involve utilizing testing-
leverages the language specification document to generate test specific data to train or fine-tune a specialized LLM that is
data, then conduct the differential testing on JavaScript engines specifically designed to understand the nature of testing. By do-
such as JavaScriptCore, ChakraCore, SpiderMonkey, QuickJS, ing so, the LLM can inherently acknowledge the requirements
etc. There are also studies utilizing the LLMs to generate test of testing and autonomously generate diverse outputs.
inputs and then conduct differential testing for fuzzing DL 2) Challenges in Test Oracle Problem: The oracle problem
libraries [58], [59] and SAT solvers [63]. Li et al. [87] employs has been a longstanding challenge in various testing applica-
the LLM in finding the failure-inducing test cases. In detail, tions, e.g., testing machine learning systems [148] and testing
given a program under test, they first request the LLM to infer deep learning libraries [59]. To alleviate the oracle problem to
the intention of the program, then request the LLM to generate the overall testing activities, a common practice in our collected
programs that have the same intention, which are alternative studies is to transform it into a more easily derived form, often
implementations of the program, and are likely free of the by utilizing differential testing [63] or focusing on only identi-
program’s bug. Then they perform the differential testing with fying crash bugs [14].
the program under test and the generated programs to find the There are successful applications of differential testing with
failure-inducing test cases. LLMs, as shown in Fig. 10. For instance, when testing the
SMT solvers, Sun et al. adopt differential testing which involves
VI. CHALLENGES AND OPPORTUNITIES comparing the results of multiple SMT solvers (i.e., Z3, cvc5,
and Bitwuzla) on the same generated test formulas by LLM
Based on the above analysis from the viewpoints of software
[63]. However, this approach is limited to systems where coun-
testing and LLM, we summarize the challenges and opportuni-
terpart software or running environment can easily be found,
ties when conducting software testing with LLM.
potentially restricting its applicability. Moreover, to mitigate
the oracle problem, other studies only focus on the crash bugs
A. Challenges
which are easily observed automatically. This is particularly the
As indicated by this survey, software testing with LLMs has case for mobile applications testing, in which the LLMs guide
undergone significant growth in the past two years. However, the testing in exploring more diversified pages, conducting more
it is still in its early stages of development, and numerous complex operational actions, and covering more meaningful
challenges and open questions need to be addressed. operational sequences [14]. However, this significantly restricts
1) Challenges for Achieving High Coverage: Exploring the the potential of utilizing the LLMs for uncovering various types
diverse behaviors of the software under test to achieve high of software bugs.
coverage is always a significant concern in software testing. Exploring the use of LLMs to derive other types of test or-
In this context, test generation differs from code generation, acles represents an interesting and valuable research direction.
as code generation primarily focuses on producing a single, Specifically, metamorphic testing is also widely used in soft-
correct code snippet, whereas software testing requires gen- ware testing practices to help mitigate the oracle problem, yet in
erating diverse test inputs to ensure better coverage of the most cases, defining metamorphic relations relies on human in-
software. Although setting a high temperature can facilitate the genuity. Luu et al. [56] have examined the effectiveness of LLM
LLMs in generating different outputs, it remains challenging in generating metamorphic relations, yet they only experiment
for LLMs to directly achieve the required diversity. For ex- with straightforward prompts by directly querying ChatGPT.
ample, for unit test case generation, in SF110 dataset, the line Further exploration, potentially incorporating human-computer
coverage is merely 2% and the branch coverage is merely 1% interaction or domain knowledge, is highly encouraged. An-
[39]. For system test input generation, in terms of fuzzing DL other promising avenue is exploring the capability of LLMs to
libraries, the API coverage for TensorFlow is reported to be 66% automatically generate test cases based on metamorphic rela-
(2215/3316) [59]. tions, covering a wide range of inputs.
From our collected studies, we observe that the researchers The advancement of multi-model LLMs like GPT-4 may
often utilize mutation testing together with the LLMs to gen- open up possibilities for exploring their ability to detect bugs in
erate more diversified outputs. For example, when fuzzing software user interfaces and assist in deriving test oracles. By
a DL library, instead of directly generating the code snip- leveraging the image understanding and reasoning capabilities
pet with LLM, Deng et al. [59] replace parts of the selected of these models, one can investigate their potential to auto-
seed (code generated by LLM) with masked tokens using matically identify inconsistencies, errors, or usability issues in
different mutation operators to produce masked inputs. They user interfaces.
then leverage the LLM to perform code infilling to generate 3) Challenges for Rigorous Evaluations: The lack of bench-
new code that replaces the masked tokens, which can signifi- mark datasets and the potential data leakage issues associated
cantly increase the diversity of the generated tests. Liu et al. with LLM-based techniques present challenges in conducting
[65] leverage LLM to produce the test generators (each of rigorous evaluations and comprehensive comparisons of pro-
which can yield a batch of unusual text inputs under the same posed methods.
Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SOFTWARE TESTING WITH LARGE LANGUAGE MODELS: SURVEY, LANDSCAPE, AND VISION 929
For program repair, there are only two well-known and organization-specific datasets for training or fine-tuning is time-
commonly-used benchmarks, i.e., Defect4J and QuixBugs, as consuming and labor-intensive. To address this, one is encour-
demonstrated in Table IV. Furthermore, these datasets are aged to utilize the automated techniques of mining software
not specially designed for testing the LLMs. For example, repositories to build the datasets, for example, techniques like
as reported by Xia et al. [93], 39 out of 40 Python bugs key information extraction techniques from Stack Overflow
in the QuixBugs dataset can be fixed by Codex, yet in real- [151] offer potential solutions for automatically gathering rel-
world practice, the successful fix rate can be nowhere near evant data.
as high. For unit test case generation, there are no widely In addition, exploring the methodology for better fine-tuning
recognized benchmarks, and different studies would utilize dif- the LLMs with software-specific data is worth considering be-
ferent datasets for performance evaluation, as demonstrated in cause software-specific data differs from natural language data
Table III. This indicates the need to build more specialized and as it contains more structural information, such as data flow
diversified benchmarks. and control flow. Previous research on code representations has
Furthermore, the LLMs may have seen the widely-used shown the benefits of incorporating data flow, which captures
benchmarks in their pre-training data, i.e., data leakage issues. the semantic-level structure of code and represents the rela-
Jiang et al. [113] check the CodeSearchNet and BigQuery, tionship between variables in terms of “whether-value-comes-
which are the data sources of common LLMs, and the results from” [152]. These insights can provide valuable guidance for
show that four repositories used by the Defect4J benchmark are effectively fine-tuning LLMs with software-specific data.
also in CodeSearchNet, and the whole Defects4J repository is
included by BigQuery. Therefore, it is very likely that existing
program repair benchmarks are seen by the LLMs during pre- B. Opportunities
training. This data leakage issue has also been investigated in There are also many research opportunities in software test-
machine learning-related studies. For example, Tu et al. [149] ing with LLMs, which can greatly benefit developers, users,
focus on the data leakage in issue tracking data, and results show and the research community. While not necessarily challenges,
that information leaked from the “future” makes prediction these opportunities contribute to advancements in software test-
models misleadingly optimistic. This reminds us that the perfor- ing, benefiting practitioners and the wider research community.
mance of LLMs on software testing tasks may not be as good 1) Exploring LLMs in the Early Stage of Testing: As
as reported in previous studies. It also suggests that we need shown in Fig. 4, LLMs have not been used in the early stage of
more specialized datasets that are not seen by LLMs to serve as testing, e.g., test requirements, and test planning. There might
benchmarks. One way is to collect it from specialized sources, be two main reasons behind that. The first is the subjectivity
e.g., user-generated content from niche online communities. in early-stage testing tasks. Many tasks in the early stages of
4) Challenges in Real-World Application of LLMs in Soft- testing, such as requirements gathering, test plan creation, and
ware Testing: As we mentioned in Section V-B, in the early design reviews, may involve subjective assessments that require
days of using LLMs, pre-training and fine-tuning are commonly significant input from human experts. This could make it less
used practice, considering the model parameters are relatively suitable for LLMs that rely heavily on data-driven approaches.
few resulting in weaker model capabilities (e.g., T5). As time The second might be the lack of open-sourced data in the early
progressed, the number of model parameters increased signifi- stages. Unlike in later stages of testing, there may be limited
cantly, leading to the emergence of models with greater capabil- data available online during early-stage activities. This could
ities (e.g., ChatGPT). And in recent studies, prompt engineering mean that LLMs may not have seen much of this type of data,
has become a common approach. However, due to concerns and therefore may not perform well on these tasks.
regarding data privacy, when considering real-world practice, Adopting a human-computer interaction schema for tack-
most software organizations tend to avoid using commercial ling early-stage testing tasks would harness the domain-specific
LLMs and would prefer to adopt open-source ones with training knowledge of human developers and leverage the general
or fine-tuning using organization-specific data. Furthermore, knowledge embedded in LLMs. Additionally, it is highly en-
some companies also consider the current limitations in terms of couraged for software development companies to record and
computational power or pay close attention to energy consump- provide access to early-stage testing data, allowing for im-
tion, they tend to fine-tune medium-sized models. It is quite proved training and performance of LLMs in these critical
challenging for these models to achieve similar performance testing activities.
to what our collected papers have reported. For instance, in 2) Exploring LLMs in Other Testing Phases: We have
the widely-used QuixBugs dataset, it has been reported that analyzed the distribution of testing phases for the collected
39 out of 40 Python bugs and 34 out of 40 Java bugs can studies. As shown in Fig 11, we can observe that LLMs are
be automatically fixed [93]. However, when it comes to DL most commonly used in unit testing, followed by system testing.
programs collected from Stack Overflow, which represent real- However, there is still no research on the use of LLMs in
world coding practice, only 16 out of 72 Python bugs can be integration testing and acceptance testing.
automatically fixed [89]. For integration testing, it involves testing the interfaces be-
Recent research has highlighted the importance of high- tween different software modules. In some software organiza-
quality training data in improving the performance of models tions, integration testing might be merged with unit testing,
for code-related tasks [150], yet manually building high-quality which can be a possible reason why LLM is rarely utilized in
Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
930 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 4, APRIL 2024
Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SOFTWARE TESTING WITH LARGE LANGUAGE MODELS: SURVEY, LANDSCAPE, AND VISION 931
Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
932 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 4, APRIL 2024
the unit and system testing phases, and only for functional [17] W. X. Zhao et al., “A survey of large language models,” 2023. [Online].
testing. This highlights the research opportunities for exploring Available: https://doi.org/10.48550/arXiv.2303.18223
[18] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large
the uncovered areas. Regarding how the LLMs are utilized, we language models are zero-shot reasoners,” in Proc. NeurIPS, 2022.
find that various pre-training/fine-tuning and prompt engineer- [Online]. Available: http://papers.nips.cc/paper_files/paper/2022/hash/
ing methods have been developed to enhance the capabilities 8bb0d291acd4acf06ef112099c16f326-Abstract-Conference.html
of LLMs in addressing testing tasks. However, more advanced [19] J. Wei et al., “Chain-of-thought prompting elicits reasoning in large
language models,” in Proc. NeurIPS, 2022. [Online]. Available: http://
techniques in prompt design have yet to be explored and can papers.nips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7
be an avenue for future research. b31abca4-Abstract-Conference.html
It can serve as a roadmap for future research in this area, [20] J. Li, G. Li, Y. Li, and Z. Jin, “Structured chain-of-thought prompting
identifying gaps in our current understanding of the use of for code generation,” 2023, arXiv:2305.06599.
[21] J. Li, Y. Li, G. Li, Z. Jin, Y. Hao, and X. Hu, “Skcoder: A sketch-based
LLMs in software testing and highlighting potential avenues for approach for automatic code generation,” in Proc. IEEE/ACM 45th Int.
exploration. We believe that the insights provided in this paper Conf. Softw. Eng. (ICSE), 2023, pp. 2124–2135.
will be valuable to both researchers and practitioners in the field [22] J. Li, Y. Zhao, Y. Li, G. Li, and Z. Jin, “AceCoder: Utilizing existing
code to enhance code generation,” 2023, arXiv:2303.17780.
of software engineering, assisting them in leveraging LLMs to [23] Y. Dong, X. Jiang, Z. Jin, and G. Li, “Self-collaboration code genera-
improve software testing practices and ultimately enhance the tion via chatGPT” 2023. [Online]. Available: https://doi.org/10.48550/
quality and reliability of software systems. arXiv.2304.07590
[24] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, and X. Wu, “Unifying large
REFERENCES language models and knowledge graphs: A roadmap,” 2023. [Online].
Available: https://doi.org/10.48550/arXiv.2306.08302
[1] G. J. Myers, T. Badgett, T. M. Thomas, and C. Sandler, The Art of [25] M. Tufano, D. Drain, A. Svyatkovskiy, S. K. Deng, and N. Sundaresan,
Software Testing, 2nd ed. Hoboken, NJ, USA: Wiley, 2004. “Unit test case generation with transformers and focal context,” 2020,
[2] M. Pezze and M. Young, Software Testing and Analysis—Process, arXiv:2009.05617.
Principles and Techniques. Hoboken, NJ, USA: Wiley, 2007. [26] B. Chen et al., “Codet: Code generation with generated tests,” 2022,
[3] M. Harman and P. McMinn, “A theoretical and empirical study of arXiv:2207.10397.
search-based testing: Local, global, and hybrid search,” IEEE Trans. [27] S. K. Lahiri et al., “Interactive code generation via test-driven user-
Softw. Eng., vol. 36, no. 2, pp. 226–247, Mar./Apr. 2010.
intent formalization,” 2022, arXiv:2208.05950.
[4] P. Delgado-Pérez, A. Ramírez, K. J. Valle-Gómez, I. Medina-Bulo, and
[28] S. Alagarsamy, C. Tantithamthavorn, and A. Aleti, “A3test: Assertion-
J. R. Romero, “InterEvo-TR: Interactive evolutionary test generation
augmented automated test case generation,” 2023, arXiv:2302.10352.
with readability assessment,” IEEE Trans. Softw. Eng., vol. 49, no. 4,
[29] M. Schäfer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation
pp. 2580–2596, Apr. 2023.
[5] X. Xiao, S. Li, T. Xie, and N. Tillmann, “Characteristic studies of of using large language models for automated unit test generation,”
loop problems for structural test generation via symbolic execution,” IEEE Trans. Softw. Eng., vol. 50, no. 1, pp. 85–105, Jan. 2024.
in Proc. 28th IEEE/ACM Int. Conf. Automated Softw. Eng., (ASE), [30] V. Guilherme and A. Vincenzi, “An initial investigation of chatGPT
Silicon Valley, CA, USA, E. Denney, T. Bultan, and A. Zeller, Eds., unit test generation capability,” in Proc. 8th Brazilian Symp. Systematic
Piscataway, NJ, USA: IEEE Press, Nov. 2013, pp. 246–256. Automated Softw. Testing (SAST), Campo Grande, Brazil, A. L. Fontão
[6] C. Pacheco, S. K. Lahiri, M. D. Ernst, and T. Ball, “Feedback-directed et al., Eds., New York, NY, USA: ACM, Sep. 2023, pp. 15–24.
random test generation,” in Proc. 29th Int. Conf. Softw. Eng. (ICSE), [31] S. Hashtroudi, J. Shin, H. Hemmati, and S. Wang, “Automated test case
Minneapolis, MN, USA, Los Alamitos, CA, USA: IEEE Comput. Soc. generation using code models and domain adaptation,” 2023. [Online].
Press, May 2007, pp. 75–84. Available: https://doi.org/10.48550/arXiv.2308.08033
[7] Z. Yuan, et al., “No more manual tests? Evaluating and improving [32] L. Plein, W. C. Ouédraogo, J. Klein, and T. F. Bissyandé, “Automatic
chatGPT for unit test generation,” 2023, arXiv:2305.04207. generation of test cases based on bug reports: A feasibility study with
[8] Y. Tang, Z. Liu, Z. Zhou, and X. Luo, “ChatGPT vs SBST: A large language models,” 2023. [Online]. Available: https://doi.org/10.
comparative assessment of unit test suite generation,” 2023. [Online]. 48550/arXiv.2310.06320
Available: https://doi.org/10.48550/arXiv.2307.00588 [33] V. Vikram, C. Lemieux, and R. Padhye, “Can large language models
[9] Android Developers, “Ui/application exerciser monkey,” 2012. Ac- write good property-based tests?” 2023. [Online]. Available: https://
cessed: Dec. 27, 2023. [Online]. Available: https://developer.android. doi.org/10.48550/arXiv.2307.04346
google.cn/studio/test/other-testing-tools/monkey [34] N. Rao, K. Jain, U. Alon, C. L. Goues, and V. J. Hellendoorn, “CAT-
[10] Y. Li, Z. Yang, Y. Guo, and X. Chen, “DroidBot: A lightweight UI- LM training language models on aligned code and tests,” in Proc.
guided test input generator for android,” in Proc. IEEE/ACM 39th Int. 38th IEEE/ACM Int. Conf. Automated Softw. Eng. (ASE), Luxembourg,
Conf. Softw. Eng. Companion (ICSE), Piscataway, NJ, USA: IEEE Piscataway, NJ, USA: IEEE Press, Sep. 2023, pp. 409–420, doi:
Press, 2017, pp. 23–26. 10.1109/ASE56229.2023.00193.
[11] T. Su et al., “Guided, stochastic model-based gui testing of an- [35] Z. Xie, Y. Chen, C. Zhi, S. Deng, and J. Yin, “Chatunitest: A chatGPT-
droid apps,” in Proc. 11th Joint Meeting Found. Softw. Eng., 2017, based automated unit test generation tool,” 2023, arXiv:2305.04764.
pp. 245–256.
[36] C. Lemieux, J. P. Inala, S. K. Lahiri, and S. Sen, “CodaMosa: Escaping
[12] Z. Dong, M. Böhme, L. Cojocaru, and A. Roychoudhury, “Time-travel
coverage plateaus in test generation with pre-trained large language
testing of android apps,” in Proc. IEEE/ACM 42nd Int. Conf. Softw.
models,” in Proc. Int. Conf. Softw. Eng. (ICSE), 2023, pp. 919–931.
Eng. (ICSE), Piscataway, NJ, USA: IEEE Press, 2020, pp. 481–492.
[37] A. M. Dakhel, A. Nikanjam, V. Majdinasab, F. Khomh, and M. C.
[13] M. Pan, A. Huang, G. Wang, T. Zhang, and X. Li, “Reinforcement
learning based curiosity-driven testing of android applications,” in Desmarais, “Effective test generation using pre-trained large language
Proc. 29th ACM SIGSOFT Int. Symp. Softw. Testing Anal., 2020, models and mutation testing,” 2023. [Online]. Available: https://doi.
pp. 153–164. org/10.48550/arXiv.2308.16557
[14] Z. Liu et al., “Make LLM a testing expert: Bringing human-like [38] M. L. Siddiq, J. Santos, R. H. Tanvir, N. Ulfat, F. A. Rifat, and V.
interaction to mobile GUI testing via functionality-aware decisions,” C. Lopes, “Exploring the effectiveness of large language models in
2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.15780 generating unit tests,” 2023, arXiv:2305.00418.
[15] T. Su, J. Wang, and Z. Su, “Benchmarking automated GUI testing for [39] Y. Zhang, W. Song, Z. Ji, D. Yao, and N. Meng, “How well does LLM
android against real-world bugs,” in Proc. 29th ACM Joint Eur. Softw. generate security tests?” 2023. [Online]. Available: https://doi.org/10.
Eng. Conf. Symp. Found. Softw. Eng. (ESEC/FSE), Athens, Greece, 48550/arXiv.2310.00710
New York, NY, USA: ACM, Aug. 2021, pp. 119–130. [40] V. Li and N. Doiron, “Prompting code interpreter to write better unit
[16] M. Shanahan, “Talking about large language models,” 2022. [Online]. tests on quixbugs functions,” 2023. [Online]. Available: https://doi.org/
Available: https://doi.org/10.48550/arXiv.2212.03551 10.48550/arXiv.2310.00483
Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SOFTWARE TESTING WITH LARGE LANGUAGE MODELS: SURVEY, LANDSCAPE, AND VISION 933
[41] B. Steenhoek, M. Tufano, N. Sundaresan, and A. Svyatkovskiy, “Rein- [64] Z. Liu et al., “Testing the limits: Unusual text inputs generation for
forcement learning from automatic feedback for high-quality unit test mobile app crash detection with large language model,” 2023. [Online].
generation,” 2023, arXiv:2310.02368. Available: https://doi.org/10.48550/arXiv.2310.15657
[42] S. Bhatia, T. Gandhi, D. Kumar, and P. Jalote, “Unit test generation [65] C. Zhang et al., “Understanding large language model based fuzz driver
using generative AI: A comparative performance analysis of autogen- generation,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.
eration tools,” 2023, arXiv:2312.10622. 2307.12469
[43] M. Tufano, D. Drain, A. Svyatkovskiy, and N. Sundaresan, “Gener- [66] C. Xia, M. Paltenghi, J. Tian, M. Pradel, and L. Zhang, “Universal
ating accurate assert statements for unit test cases using pretrained fuzzing via large language models,” 2023. [Online]. Available: https://
transformers,” in Proc. 3rd ACM/IEEE Int. Conf. Automat. Softw. Test, api.semanticscholar.org/CorpusID:260735598
2022, pp. 54–64. [67] C. Tsigkanos, P. Rani, S. Müller, and T. Kehrer, “Variable discovery
[44] P. Nie, R. Banerjee, J. J. Li, R. J. Mooney, and M. Gligoric, “Learning with large language models for metamorphic testing of scientific
deep semantics for test completion,” 2023, arXiv:2302.10166. software,” in Proc. 23rd Int. Conf. Comput. Sci. (ICCS), Prague,
[45] A. Mastropaolo et al., “Using transfer learning for code-related tasks,” Czech Republic, J. Mikyska, C. de Mulatier, M. Paszynski, V. V.
IEEE Trans. Softw. Eng., vol. 49, no. 4, pp. 1580–1598, Apr. 2023, Krzhizhanovskaya, J. J. Dongarra, and P. M. A. Sloot, Eds., vol. 14073.
doi: 10.1109/TSE.2022.3183297. Springer, Jul. 2023, pp. 321–335, doi: 10.1007/978-3-031-35995-8_23.
[46] N. Nashid, M. Sintaha, and A. Mesbah, “Retrieval-based prompt [68] C. Yang et al., “White-box compiler fuzzing empowered by large
selection for code-related few-shot learning,” in Proc. 45th Int. Conf. language models,” 2023. [Online]. Available: https://doi.org/10.48550/
Softw. Eng. (ICSE), 2023, pp. 2450–2462. arXiv.2310.15991
[47] G. Ye et al., “Automated conformance testing for javascript engines [69] T. Zhang, I. C. Irsan, F. Thung, D. Han, D. Lo, and L. Jiang, “iTiger:
via deep compiler fuzzing,” in Proc. 42nd ACM SIGPLAN Int. Conf. An automatic issue title generation tool,” in Proc. 30th ACM Joint Eur.
Program. Lang. Des. Implementation, 2021, pp. 435–450. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2022, pp. 1637–1641.
[48] Z. Liu et al., “Fill in the blank: Context-aware automated text input [70] Y. Huang et al., “Crashtranslator: Automatically reproducing mobile
generation for mobile gui testing,” 2022, arXiv:2212.04732. application crashes directly from stack trace,” 2023. [Online]. Avail-
[49] M. R. Taesiri, F. Macklon, Y. Wang, H. Shen, and C.-P. Bezemer, able: https://doi.org/10.48550/arXiv.2310.07128
“Large language models are pretty good zero-shot video game bug [71] T. Zhang, I. C. Irsan, F. Thung, and D. Lo, “Cupid: Leveraging chatGPT
detectors,” 2022, arXiv:2210.02506. for more accurate duplicate bug report detection,” 2023. [Online].
[50] S. L. Shrestha and C. Csallner, “SlGPT: Using transfer learning to Available: https://doi.org/10.48550/arXiv.2308.10022
directly generate simulink model files and find bugs in the simulink [72] U. Mukherjee and M. M. Rahman, “Employing deep learning and
toolchain,” in Proc. Eval. Assessment Softw. Eng., 2021, pp. 260–265. structured information retrieval to answer clarification questions on bug
[51] J. Hu, Q. Zhang, and H. Yin, “Augmenting greybox fuzzing with reports,” 2023, arXiv:2304.12494.
generative AI,” 2023. [Online]. Available: https://doi.org/10.48550/ [73] P. Mahbub, O. Shuvo, and M. M. Rahman, “Explaining software
arXiv.2306.06782
bugs leveraging code structures in neural machine translation,” 2022,
[52] A. Mathur, S. Pradhan, P. Soni, D. Patel, and R. Regunathan, “Auto-
arXiv:2212.04584.
mated test case generation using t5 and GPT-3,” in Proc. 9th Int. Conf.
[74] S. Feng and C. Chen, “Prompting is all your need: Automated android
Adv. Comput. Commun. Syst. (ICACCS), vol. 1, 2023, pp. 1986–1992.
bug replay with large language models,” 2023. [Online]. Available:
[53] D. Zimmermann and A. Koziolek, “Automating GUI-based software
https://doi.org/10.48550/arXiv.2306.01987
testing with GPT-3,” in Proc. IEEE Int. Conf. Softw. Testing, Verifica-
[75] Y. Su, Z. Han, Z. Gao, Z. Xing, Q. Lu, and X. Xu, “Still confusing for
tion Validation Workshops (ICSTW), 2023, pp. 62–65.
bug-component triaging? Deep feature learning and ensemble setting to
[54] M. Taeb, A. Swearngin, E. Schoop, R. Cheng, Y. Jiang, and J. Nichols,
rescue,” in Proc. 31st IEEE/ACM Int. Conf. Program Comprehension
“Axnav: Replaying accessibility tests from natural language,” 2023.
(ICPC), Melbourne, Australia, Piscataway, NJ, USA: IEEE Press, May
[Online]. Available: https://doi.org/10.48550/arXiv.2310.02424
2023, pp. 316–327, doi: 10.1109/ICPC58990.2023.00046.
[55] Q. Luu, H. Liu, and T. Y. Chen, “Can chatGPT advance software
testing intelligence? An experience report on metamorphic testing,” [76] N. D. Bui, Y. Wang, and S. Hoi, “Detect-localize-repair: A
2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.19204 unified framework for learning to debug with codet5,” 2022,
[56] A. Khanfir, R. Degiovanni, M. Papadakis, and Y. L. Traon, “Ef- arXiv:2211.14875.
ficient mutation testing via pre-trained language models,” 2023, [77] S. Kang, J. Yoon, and S. Yoo, “Large language models are few-
arXiv:2301.03543. shot testers: Exploring llm-based general bug reproduction,” 2022,
[57] Y. Deng, C. S. Xia, C. Yang, S. D. Zhang, S. Yang, and L. Zhang, arXiv:2209.11515.
“Large language models are edge-case fuzzers: Testing deep learning [78] S. Kang, G. An, and S. Yoo, “A preliminary evaluation of LLM-based
libraries via fuzzGPT,” 2023, arXiv:2304.02014. fault localization,” 2023. [Online]. Available: https://doi.org/10.48550/
[58] Y. Deng, C. S. Xia, C. Yang, S. D. Zhang, S. Yang, and L. Zhang, arXiv.2308.05487
“Large language models are zero shot fuzzers: Fuzzing deep learning [79] P. Widjojo and C. Treude, “Addressing compiler errors: Stack overflow
libraries via large language models,” 2023, arXiv:2209.11515. or large language models?” 2023. [Online]. Available: https://doi.org/
[59] J. Ackerman and G. Cybenko, “Large language models for fuzzing 10.48550/arXiv.2307.10793
parsers (registered report),” in Proc. 2nd Int. Fuzzing Workshop [80] L. Plein and T. F. Bissyandé, “Can LLMs demystify bug reports?”
(FUZZING) Seattle, WA, USA, M. Böhme, Y. Noller, B. Ray, and 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.06310
L. Szekeres, Eds., New York, NY, USA: ACM, Jul. 2023, pp. 31–38, [81] A. Taylor, A. Vassar, J. Renzella, and H. A. Pearce, “DCC—Help:
doi: 10.1145/3605157.3605173. Generating context-aware compiler error explanations with large lan-
[60] S. Yu, C. Fang, Y. Ling, C. Wu, and Z. Chen, “LLM for test script guage models,” 2023. [Online]. Available: https://api.semanticscholar.
generation and migration: Challenges, capabilities, and opportunities,” org/CorpusID:261076439
2023. [Online]. Available: https://doi.org/10.48550/arXiv.2309.13574 [82] S. Kang, B. Chen, S. Yoo, and J.-G. Lou, “Explainable automated
[61] G. Deng et al., “PentestGPT: An llm-empowered automatic penetration debugging via large language model-driven scientific debugging,” 2023,
testing tool,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv. arXiv:2304.02195.
2308.06782 [83] A. Z. H. Yang, R. Martins, C. L. Goues, and V. J. Hellendoorn,
[62] M. Sun, Y. Yang, Y. Wang, M. Wen, H. Jia, and Y. Zhou, “SMT “Large language models for test-free fault localization,” 2023. [Online].
solver validation empowered by large pre-trained language models,” Available: https://doi.org/10.48550/arXiv.2310.01726
in Proc. 38th IEEE/ACM Int. Conf. Automated Softw. Eng. (ASE), [84] Y. Wu, Z. Li, J. M. Zhang, M. Papadakis, M. Harman, and Y.
Luxembourg, Piscataway, NJ, USA: IEEE Press, 2023, pp. 1288–1300, Liu, “Large language models in fault localisation,” 2023. [Online].
doi: 10.1109/ASE56229.2023.00180. Available: https://doi.org/10.48550/arXiv.2308.15276
[63] Y. Deng, J. Yao, Z. Tu, X. Zheng, M. Zhang, and T. Zhang, “Target: Au- [85] H. Tu, Z. Zhou, H. Jiang, I. N. B. Yusuf, Y. Li, and L. Jiang,
tomated scenario generation from traffic rules for testing autonomous “LLM4CBI: Taming llms to generate effective test programs for
vehicles,” 2023. [Online]. Available: https://api.semanticscholar.org/ compiler bug isolation,” 2023. [Online]. Available: https://doi.org/10.
CorpusID:258588387 48550/arXiv.2307.00593
Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
934 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 4, APRIL 2024
[86] T.-O. Li et al., “Nuances are the key: Unlocking chatGPT to find [107] P. Deligiannis, A. Lal, N. Mehrotra, and A. Rastogi, “Fixing rust
failure-inducing tests with differential prompting,” in Proc. 38th compilation errors using LLMs,” 2023. [Online]. Available: https://doi.
IEEE/ACM Int. Conf. Automated Softw. Eng. (ASE), 2023, pp. 14–26. org/10.48550/arXiv.2308.05177
[87] X. Chen, M. Lin, N. Schärli, and D. Zhou, “Teaching large language [108] F. Ribeiro, R. Abreu, and J. Saraiva, “Framing program repair as code
models to self-debug,” 2023. [Online]. Available: https://doi.org/10. completion,” in Proc. 3rd Int. Workshop Automated Program Repair,
48550/arXiv.2304.05128 2022, pp. 38–45.
[88] J. Cao, M. Li, M. Wen, and S.-c. Cheung, “A study on prompt design, [109] N. Wadhwa et al., “Frustrated with code quality issues? LLMs can
advantages and limitations of chatGPT for deep learning program help!” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2309.
repair,” 2023, arXiv:2304.08191. 12938
[89] H. Pearce, B. Tan, B. Ahmad, R. Karri, and B. Dolan-Gavitt, “Ex- [110] F. Ribeiro, J. N. C. de Macedo, K. Tsushima, R. Abreu, and J.
amining zero-shot vulnerability repair with large language models,” in Saraiva, “GPT-3-powered type error debugging: Investigating the use of
Proc. IEEE Symp. Secur. Privacy (SP), Los Alamitos, CA, USA: IEEE large language models for code repair,” in Proc. 16th ACM SIGPLAN
Comput. Soc., 2022, pp. 1–18. Int. Conf. Softw. Lang. Eng. (SLE), Cascais, Portugal, J. Saraiva, T.
[90] Z. Fan, X. Gao, A. Roychoudhury, and S. H. Tan, “Automated repair Degueule, and E. Scott, Eds., New York, NY, USA: ACM, Oct. 2023,
of programs from large language models,” 2022, arXiv:2205.10583. pp. 111–124, doi: 10.1145/3623476.3623522.
[91] Y. Hu, X. Shi, Q. Zhou, and L. Pike, “Fix bugs with transformer [111] Y. Wu et al., “How effective are neural networks for fixing security
through a neural-symbolic edit grammar,” 2022, arXiv:2204.06643. vulnerabilities,” 2023, arXiv:2305.18607.
[92] C. S. Xia, Y. Wei, and L. Zhang, “Practical program repair in the era [112] N. Jiang, K. Liu, T. Lutellier, and L. Tan, “Impact of code language
of large pre-trained language models,” 2022, arXiv:2210.14179. models on automated program repair,” 2023, arXiv:2302.05020.
[93] J. Zhang et al., “Repairing bugs in python assignments using large [113] M. Jin et al., “Inferfix: End-to-end program repair with LLMs,” 2023,
language models,” 2022, arXiv:2209.14876. arXiv:2303.07263.
[94] M. Lajkó, V. Csuvik, and L. Vidács, “Towards javascript program [114] C. S. Xia and L. Zhang, “Keep the conversation going: Fixing 162 out
repair with generative pre-trained transformer (GPT-2),” in Proc. 3rd of 337 bugs for $0.42 each using chatGPT,” 2023, arXiv:2304.00385.
Int. Workshop Automated Program Repair, 2022, pp. 61–68. [115] Y. Zhang, G. Li, Z. Jin, and Y. Xing, “Neural program repair with
[95] D. Sobania, M. Briesch, C. Hanna, and J. Petke, “An analysis of the program dependence analysis and effective filter mechanism,” 2023,
automatic bug fixing performance of chat,” 2023, arXiv:2301.08653. arXiv:2305.09315.
[96] K. Huang et al., “An empirical study on fine-tuning large lan- [116] J. A. Prenner and R. Robbes, “Out of context: How important is local
guage models of code for automated program repair,” in Proc. 38th context in neural program repair?” 2023, arXiv:2312.04986.
IEEE/ACM Int. Conf. Automated Softw. Eng. (ASE), Luxembourg, [117] Q. Zhang, C. Fang, B. Yu, W. Sun, T. Zhang, and Z. Chen, “Pre-
Piscataway, NJ, USA: IEEE Press, Sep. 2023, pp. 1162–1174, doi: trained model-based automated software vulnerability repair: How far
10.1109/ASE56229.2023.00181. are we?” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.
[97] M. C. Wuisang, M. Kurniawan, K. A. Wira Santosa, A. Agung Santoso
2308.12533
Gunawan, and K. E. Saputra, “An evaluation of the effectiveness
[118] S. Garg, R. Z. Moghaddam, and N. Sundaresan, “Rapgen: An approach
of openai’s chatGPT for automated python program bug fixing us-
for fixing code inefficiencies in zero-shot,” 2023. [Online]. Available:
ing quixbugs,” in Proc. Int. Seminar Appl. Technol. Inf. Commun.
https://doi.org/10.48550/arXiv.2306.17077
(iSemantic), 2023, pp. 295–300.
[119] W. Wang, Y. Wang, S. Joty, and S. C. H. Hoi, “Rap-gen: Retrieval-
[98] D. Horváth, V. Csuvik, T. Gyimóthy, and L. Vidács, “An extensive
augmented patch generation with codet5 for automatic program repair,”
study on model architecture and program representation in the domain
in Proc. 31st ACM Joint Eur. Softw. Eng. Conf. Symp. Found. Softw.
of learning-based automated program repair,” in Proc. IEEE/ACM
Eng. (ESEC/FSE), San Francisco, CA, USA, S. Chandra, K. Blincoe,
Int. Workshop Automated Program Repair (APR@ICSE), Melbourne,
and P. Tonella, Eds., New York, NY, USA: ACM, Dec. 2023, pp. 146–
Australia, Piscataway, NJ, USA: IEEE Press, May 2023, pp. 31–38,
158, doi: 10.1145/3611643.3616256.
doi: 10.1109/APR59189.2023.00013.
[99] J. A. Prenner, H. Babii, and R. Robbes, “Can openai’s codex fix bugs? [120] Y. Zhang, Z. Jin, Y. Xing, and G. Li, “STEAM: Simulating the
An evaluation on quixbugs,” in in Proc. 3rd Int. Workshop Automated interactive behavior of programmers for automatic bug fixing,” 2023.
Program Repair, 2022, pp. 69–75. [Online]. Available: https://doi.org/10.48550/arXiv.2308.14460
[100] W. Yuan et al., “Circle: Continual repair across programming lan- [121] S. Fakhoury, S. Chakraborty, M. Musuvathi, and S. K. Lahiri, “Towards
guages,” in Proc. 31st ACM SIGSOFT Int. Symp. Softw. Testing Anal., generating functionally correct code edits from natural language issue
2022, pp. 678–690. descriptions,” 2023, arXiv:2304.03816.
[101] S. Moon et al., “Coffee: Boost your code llms by fixing bugs with [122] M. Fu, C. Tantithamthavorn, T. Le, V. Nguyen, and D. Phung, “Vul-
feedback,” 2023, arXiv:2311.07215. repair: A t5-based automated software vulnerability repair,” in Proc.
[102] Y. Wei, C. S. Xia, and L. Zhang, “Copiloting the copilots: Fusing 30th ACM Joint Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2022,
large language models with completion engines for automated program pp. 935–947.
repair,” in Proc. 31st ACM Joint Eur. Softw. Eng. Conf. Symp. Found. [123] S. Gao, X. Wen, C. Gao, W. Wang, H. Zhang, and M. R. Lyu, “What
Softw. Eng. (ESEC/FSE), San Francisco, CA, USA, S. Chandra, K. makes good in-context demonstrations for code intelligence tasks with
Blincoe, and P. Tonella, Eds., New York, NY, USA: ACM, Dec. 2023, LLMs?” in Proc. 38th IEEE/ACM Int. Conf. Automated Softw. Eng.
pp. 172–184, doi: 10.1145/3611643.3616271. (ASE), Luxembourg, Piscataway, NJ, USA: IEEE Press, Sep. 2023,
[103] Y. Peng, S. Gao, C. Gao, Y. Huo, and M. R. Lyu, “Domain knowledge pp. 761–773, doi: 10.1109/ASE56229.2023.00109.
matters: Improving prompts with fix templates for repairing python [124] C. Treude and H. Hata, “She elicits requirements and he tests: Software
type errors,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv. engineering gender bias in large language models,” 2023. [Online].
2306.01394 Available: https://doi.org/10.48550/arXiv.2303.10131
[104] A. E. I. Brownlee et al., “Enhancing genetic improvement mutations [125] R. Kocielnik, S. Prabhumoye, V. Zhang, R. M. Alvarez, and A. Anand-
using large language models,” in Proc. 15th Int. Symp. Search-Based kumar, “Autobiastest: Controllable sentence generation for automated
Softw. Eng. (SSBSE), San Francisco, CA, USA, P. Arcaini, T. Yue, and open-ended social bias testing in language models,” 2023. [Online].
and E. M. Fredericks, Eds., vol. 14415. Cham, Switzerland: Springer Available: https://doi.org/10.48550/arXiv.2302.07371
Nature, Dec. 2023, pp. 153–159, doi: 10.1007/978-3-031-48796-5_13. [126] M. Ciniselli, L. Pascarella, and G. Bavota, “To what extent do deep
[105] M. M. A. Haque, W. U. Ahmad, I. Lourentzou, and C. Brown, “Fix- learning-based code recommenders generate predictions by cloning
eval: Execution-based evaluation of program fixes for programming code from the training set?” in Proc. 19th IEEE/ACM Int. Conf. Mining
problems,” in Proc. IEEE/ACM Int. Workshop Automated Program Re- Softw. Repositories (MSR), Pittsburgh, PA, USA, New York, NY, USA:
pair (APR@ICSE), Melbourne, Australia, Piscataway, NJ, USA: IEEE ACM, 2022, pp. 167–178, doi: 10.1145/3524842.3528440.
Press, May 2023, pp. 11–18, doi: 10.1109/APR59189.2023.00009. [127] D. Erhabor, S. Udayashankar, M. Nagappan, and S. Al-Kiswany,
[106] B. Ahmad, S. Thakur, B. Tan, R. Karri, and H. Pearce, “Fix- “Measuring the runtime performance of code produced with GitHub
ing hardware security bugs with large language models,” 2023, copilot,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.
arXiv:2302.01215. 2305.06439
Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SOFTWARE TESTING WITH LARGE LANGUAGE MODELS: SURVEY, LANDSCAPE, AND VISION 935
[128] R. Wang, R. Cheng, D. Ford, and T. Zimmermann, “Investigating [147] S. Song, X. Li, and S. Li, “How to bridge the gap between modalities:
and designing for trust in AI-powered code generation tools,” 2023. A comprehensive survey on multimodal large language model,” 2023,
[Online]. Available: https://doi.org/10.48550/arXiv.2305.11248 arXiv:2311.07594.
[129] B. Yetistiren, I. Özsoy, M. Ayerdem, and E. Tüzün, “Evaluating [148] J. M. Zhang, M. Harman, L. Ma, and Y. Liu, “Machine learning testing:
the code quality of AI-assisted code generation tools: An empirical Survey, landscapes and horizons,” IEEE Trans. Softw. Eng., vol. 48,
study on github copilot, amazon codewhisperer, and ChatGPT,” 2023. no. 2, pp. 1–36, Jan. 2022.
[Online]. Available: https://doi.org/10.48550/arXiv.2304.10778 [149] F. Tu, J. Zhu, Q. Zheng, and M. Zhou, “Be careful of when: An
[130] C. Wohlin, “Guidelines for snowballing in systematic literature studies empirical study on time-related misuse of issue tracking data,” in Proc.
and a replication in software engineering,” in Proc. 18th Int. Conf. Eval. ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng.
Assessment in Softw. Eng. (EASE), London, U.K., M. J. Shepperd, T. (ESEC/SIGSOFT FSE), Lake Buena Vista, FL, USA, G. T. Leavens,
Hall, and I. Myrtveit, Eds., New York, NY, USA: ACM, May 2014, A. Garcia, and C. S. Pasareanu, Eds., New York, NY, USA: ACM,
pp. 38: 1–38:10, doi: 10.1145/2601248.2601268. Nov. 2018, pp. 307–318, doi: 10.1145/3236024.3236054.
[131] A. Mastropaolo et al., “Studying the usage of text-to-text transfer [150] Z. Sun, L. Li, Y. Liu, X. Du, and L. Li, “On the importance of
transformer to support code-related tasks,” in Proc. 43rd IEEE/ACM building high-quality training datasets for neural code search,” in
Int. Conf. Softw. Eng. (ICSE), Madrid, Spain, Piscataway, NJ, USA: Proc. 44th IEEE/ACM Int. Conf. Softw. Eng. (ICSE), Pittsburgh, PA,
IEEE Press, May 2021, pp. 336–347. USA, Piscataway, NJ, USA: ACM, May 2022, pp. 1609–1620, doi:
[132] C. Tsigkanos, P. Rani, S. Müller, and T. Kehrer, “Large language 10.1145/3510003.3510160.
models: The next frontier for variable discovery within metamorphic [151] L. Shi et al., “ISPY: Automatic issue-solution pair extraction from
testing?” in Proc. IEEE Int. Conf. Softw. Anal., Evol. Reeng. (SANER), community live chats,” in Proc. 36th IEEE/ACM Int. Conf. Automated
Taipa, Macao, T. Zhang, X. Xia, and N. Novielli, Eds., Piscataway, Softw. Eng. (ASE), Melbourne, Australia, Piscataway, NJ, USA:
NJ, USA: IEEE Press, Mar. 2023, pp. 678–682, doi: 10.1109/SANER IEEE Press, Nov. 2021, pp. 142–154, doi: 10.1109/ASE51524.2021.
56733.2023.00070. 9678894.
[133] P. Farrell-Vinay, Manage Software Testing. New York, NY, USA: [152] D. Guo et al., “GraphCodeBERT: Pre-training code representations
Auerbach, 2008. with data flow,” in Proc. 9th Int. Conf. Learn. Representations
[134] A. Mili and F. Tchier, Software Testing: Concepts and Operations. (ICLR), Virtual Event, Austria, May 2021. [Online]. Available: https://
Hoboken, NJ, USA: Wiley, 2015. openreview.net/forum?id=jLoC4ez43PZ
[135] S. Lukasczyk and G. Fraser, “Pynguin: Automated unit test gener- [153] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao, “Lsun:
ation for python,” in Proc. 44th IEEE/ACM Int. Conf. Softw. Eng., Construction of a large-scale image dataset using deep learning with
(ICSE) Companion, Pittsburgh, PA, USA, ACM/IEEE Press, May 2022, humans in the loop,” 2015, arXiv:1506.03365.
pp. 168–172, doi: 10.1145/3510454.3516829. [154] “Loadrunner, Inc.” Accessed: Dec. 27, 2023. [Online]. Available:
[136] E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The microfocus.com
oracle problem in software testing: A survey,” IEEE Trans. Softw. Eng., [155] “Langchain, Inc.” Accessed: Dec. 27, 2023. [Online]. Available: https://
vol. 41, no. 5, pp. 507–525, May 2015. docs.langchain.com/docs/
[137] C. Watson, M. Tufano, K. Moran, G. Bavota, and D. Poshyvanyk, [156] Prompt Engineering. “Prompt engineering guide.” GitHub. Accessed:
“On learning meaningful assert statements for unit test cases,” in Proc. Dec. 27, 2023. [Online]. Available: https://github.com/dair-ai/Prompt-
42nd Int. Conf. Softw. Eng. (ICSE), Seoul, South Korea, G. Rother- Engineering-Guide
mel and D. Bae, Eds., New York, NY, USA: ACM, Jun./Jul. 2020, [157] Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola,
pp. 1398–1409. “Multimodal chain-of-thought reasoning in language models,” 2023,
[138] Y. He et al., “Textexerciser: Feedback-driven text input exercising for arXiv:2302.00923.
android applications,” in Proc. IEEE Symp. Secur. Privacy (SP), San [158] Z. Liu, X. Yu, Y. Fang, and X. Zhang, “Graphprompt: Unifying pre-
Francisco, CA, USA, Piscataway, NJ, USA: IEEE Press, May 2020, training and downstream tasks for graph neural networks,” in Proc.
pp. 1071–1087. ACM Web Conf. (WWW), Austin, TX, USA, Y. Ding, J. Tang, J. F.
[139] A. Wei, Y. Deng, C. Yang, and L. Zhang, “Free lunch for testing: Sequeda, L. Aroyo, C. Castillo, and G. Houben, Eds., New York, NY,
Fuzzing deep-learning libraries from open source,” in Proc. 44th USA: ACM, Apr./May 2023, pp. 417–428.
IEEE/ACM 44th Int. Conf. Softw. Eng. (ICSE), Pittsburgh, PA, USA, [159] Y. Charalambous, N. Tihanyi, R. Jain, Y. Sun, M. A. Ferrag, and L.
New York, NY, USA: ACM, May 2022, pp. 995–1007. C. Cordeiro, “A new era in software security: Towards self-healing
[140] D. Xie et al., “Docter: Documentation-guided fuzzing for testing deep software via large language models and formal verification,” 2023,
learning API functions,” in Proc. 31st ACM SIGSOFT Int. Symp. arXiv:2305.14752.
Softw. Testing Anal. (ISSTA), Virtual Event, South Korea, S. Ryu [160] S. Wang et al., “Machine/deep learning for software engineering: A
and Y. Smaragdakis, Eds., New York, NY, USA: ACM, Jul. 2022, systematic literature review,” IEEE Trans. Softw. Eng., vol. 49, no. 3,
pp. 176–188. pp. 1188–1231, Mar. 2023, doi: 10.1109/TSE.2022.3173346.
[141] Q. Guo et al., “Audee: Automated testing for deep learning frame- [161] Y. Yang, X. Xia, D. Lo, and J. C. Grundy, “A survey on deep learning
works,” in Proc. 35th IEEE/ACM Int. Conf. Automated Softw. Eng. for software engineering,” ACM Comput. Surv., vol. 54, no. 10s,
(ASE), Melbourne, Australia, Piscataway, NJ, USA: IEEE Press, Sep. pp. 206: 1–206:73, 2022, doi: 10.1145/3505243.
2020, pp. 486–498. [162] C. Watson, N. Cooper, D. Nader-Palacio, K. Moran, and D.
[142] Z. Wang, M. Yan, J. Chen, S. Liu, and D. Zhang, “Deep learning library Poshyvanyk, “A systematic literature review on the use of deep learning
testing via effective model generation,” in Proc. 28th ACM Joint Eur. in software engineering research,” ACM Trans. Softw. Eng. Methodol.,
Softw. Eng. Conf. Symp. Found. Softw. Eng. (ESEC/FSE), Virtual Event, vol. 31, no. 2, pp. 32:1–32:58, 2022, doi: 10.1145/3485275.
USA, P. Devanbu, M. B. Cohen, and T. Zimmermann, Eds., New York, [163] M. Bajammal, A. Stocco, D. Mazinanian, and A. Mesbah, “A survey
NY, USA: ACM, Nov. 2020, pp. 788–799. on the use of computer vision to improve software engineering tasks,”
[143] J. Jiang, Y. Xiong, H. Zhang, Q. Gao, and X. Chen, “Shaping program IEEE Trans. Softw. Eng., vol. 48, no. 5, pp. 1722–1742, May 2022,
repair space with existing patches and similar code,” in Proc. 27th doi: 10.1109/TSE.2020.3032986.
ACM SIGSOFT Int. Symp. Softw. Testing Anal., New York, NY, USA: [164] X. Hou et al., “Large language models for software engineering: A
ACM, 2018, pp. 298–309, doi: 10.1145/3213846.3213871. systematic literature review,” 2023. [Online]. Available: https://doi.
[144] M. Wen, J. Chen, R. Wu, D. Hao, and S.-C. Cheung, “Context-aware org/10.48550/arXiv.2308.10620
patch generation for better automated program repair,” in Proc. 40th [165] A. Fan et al., “Large language models for software engineering:
Int. Conf. Softw. Eng., New York, NY, USA: ACM, 2018, pp. 1–11, Survey and open problems,” 2023. [Online]. Available: https://doi.
doi: 10.1145/3180155.3180233. org/10.48550/arXiv.2310.03533
[145] Y. Xiong et al., “Precise condition synthesis for program repair,” [166] D. Zan et al., “Large language models meet NL2Code: A survey,”
in Proc. IEEE/ACM 39th Int. Conf. Softw. Eng. (ICSE), 2017, in Proc. 61st Annu. Meeting Assoc. Comput. Linguistics (Vol.
pp. 416–426. 1, Long Papers) (ACL), Toronto, ON, Canada, A. Rogers, J. L.
[146] J. Xuan et al., “Nopol: Automatic repair of conditional statement bugs Boyd-Graber, and N. Okazaki, Eds., Association for Computational
in java programs,” IEEE Trans. Softw. Eng., vol. 43, no. 1, pp. 34–55, Linguistics, Jul. 2023, pp. 7443–7464, doi: 10.18653/v1/2023.acl-
Jan. 2017. long.411.
Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
936 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 4, APRIL 2024
Junjie Wang (Member, IEEE) received the Ph.D. Zhe Liu received the Ph.D. degree from the
degree from ISCAS, in 2015. She is a Research University of Chinese Academy of Sciences, in
Professor with the Institute of Software, Chinese 2023. He is an Assistant Researcher with the In-
Academy of Sciences (ISCAS). She was a Visiting stitute of Software Chinese Academy of Sciences.
Scholar with North Carolina State University from His research interests include software engineer-
September 2017 to September 2018 and worked ing, mobile testing, deep learning, and human–
with Prof. Tim Menzies. Her research interests in- computer interaction. He has published 15 papers
clude AI for software engineering, software testing, at top-tier international software engineering confer-
and software analytics. She has more than 50 high- ences/journals, including IEEE TRANSACTIONS ON
quality publications including IEEE TRANSACTIONS SOFTWARE ENGINEERING, ICSE, CHI, and ASE.
ON SOFTWARE ENGINEERING, ICSE, TOSEM, FSE, Specifically, he applies AI and light-weight program
and ASE, and five of them has received the Distinguished/Best Paper Award, analysis technology in the following directions: AI(LLM)-assisted automated
respectively at ICSE 2019, ICSE 2020, and ICPC 2022. She is currently mobile GUI testing, usability, and bug replay, human machine collaborative
serving as an Associate Editor of IEEE TRANSACTIONS ON SOFTWARE testing including testing guide for testers, AI-empowered mining software
ENGINEERING. For more information, see https://people.ucas.edu.cn/0058217? repository including issue report mining. He received the ACM Student
language=en Research Competition (SRC) 2023 Grand Finals Winners, 1st Place, Graduate
Category. For more information, see https://zheliu6.github.io/
Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.