0% found this document useful (0 votes)
87 views26 pages

Software Testing With Large Language Models Survey Landscape and Vision

Uploaded by

Majid Babaei
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views26 pages

Software Testing With Large Language Models Survey Landscape and Vision

Uploaded by

Majid Babaei
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO.

4, APRIL 2024 911

Software Testing With Large Language Models:


Survey, Landscape, and Vision
Junjie Wang , Member, IEEE, Yuchao Huang , Chunyang Chen , Zhe Liu ,
Song Wang , Member, IEEE, and Qing Wang , Member, IEEE

Abstract—Pre-trained large language models (LLMs) have I. INTRODUCTION


recently emerged as a breakthrough technology in natural
language processing and artificial intelligence, with the ability to
handle large-scale datasets and exhibit remarkable performance
across a wide range of tasks. Meanwhile, software testing is a
S OFTWARE testing is a crucial undertaking that serves as
a cornerstone for ensuring the quality and reliability of
software products. Without the rigorous process of software
crucial undertaking that serves as a cornerstone for ensuring testing, software enterprises would be reluctant to release their
the quality and reliability of software products. As the scope
and complexity of software systems continue to grow, the products into the market, knowing the potential consequences
need for more effective software testing techniques becomes of delivering flawed software to end-users. By conducting thor-
increasingly urgent, making it an area ripe for innovative ough and meticulous testing procedures, software enterprises
approaches such as the use of LLMs. This paper provides a can minimize the occurrence of critical software failures, us-
comprehensive review of the utilization of LLMs in software ability issues, or security breaches that could potentially lead to
testing. It analyzes 102 relevant studies that have used LLMs
for software testing, from both the software testing and LLMs financial losses or jeopardize user trust. Additionally, software
perspectives. The paper presents a detailed discussion of the testing helps to reduce maintenance costs by identifying and
software testing tasks for which LLMs are commonly used, resolving issues early in the development lifecycle, preventing
among which test case preparation and program repair are the more significant complications down the line [1], [2].
most representative. It also analyzes the commonly used LLMs, The significance of software testing has garnered substantial
the types of prompt engineering that are employed, as well as the
accompanied techniques with these LLMs. It also summarizes the attention within the research and industrial communities. In
key challenges and potential opportunities in this direction. This the field of software engineering, it stands as an immensely
work can serve as a roadmap for future research in this area, popular and vibrant research area. One can observe the unde-
highlighting potential avenues for exploration, and identifying niable prominence of software testing by simply examining the
gaps in our current understanding of the use of LLMs in landscape of conferences and symposiums focused on software
software testing.
engineering. Amongst these events, topics related to software
Index Terms—Pre-trained large language model, software testing consistently dominate the submission numbers and are
testing, LLM, GPT.
frequently selected for publication.
While the field of software testing has gained significant
popularity, there remain dozens of challenges that have not been
effectively addressed. For example, one such challenge is auto-
mated unit test case generation. Although various approaches,
Manuscript received 15 July 2023; revised 8 February 2024; accepted including search-based [3], [4], constraint-based [5] or random-
9 February 2024. Date of publication 20 February 2024; date of current version based [6] techniques to generate a suite of unit tests, the cover-
19 April 2024. This work was supported in part by the National Natural
Science Foundation of China under Grant 62232016, Grant 62072442, and age and the meaningfulness of the generated tests are still far
Grant 62272445; in part by Youth Innovation Promotion Association Chinese from satisfactory [7], [8]. Similarly, when it comes to mobile
Academy of Sciences, Basic Research Program of ISCAS under Grant ISCAS- GUI testing, existing studies with random-/rule-based methods
JCZD-202304; and in part by Major Program of ISCAS under Grant ISCAS-
ZD-202302. Recommended for acceptance by L. Mariani. (Corresponding [9], [10], model-based methods [11], [12], and learning-based
authors: Junjie Wang; Qing Wang.) methods [13] are unable to understand the semantic informa-
Junjie Wang, Yuchao Huang, Zhe Liu, and Qing Wang are with tion of the GUI page and often fall short in achieving com-
State Key Laboratory of Intelligent Game, Institute of Software Chi-
nese Academy of Sciences, University of Chinese Academy of Sciences, prehensive coverage [14], [15]. Considering these limitations,
Beijing 100190, China (e-mail: [email protected]; [email protected]; numerous research efforts are currently underway to explore
[email protected]; [email protected]). innovative techniques that can enhance the efficacy of software
Chunyang Chen is with Technical University of Munich, D-80333 Munich,
Germany (e-mail: [email protected]). testing tasks, among which large language models are the most
Song Wang is with York University, Toronto, ON M3J 1P, Canada (e-mail: promising ones.
[email protected]). Large language models (LLMs) such as T5 and GPT-3
This article has supplementary downloadable material available at https://
doi.org/10.1109/TSE.2024.3368208, provided by the authors. have revolutionized the field of natural language processing
Digital Object Identifier 10.1109/TSE.2024.3368208 (NLP) and artificial intelligence (AI). These models, initially

0098-5589 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
912 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 4, APRIL 2024

pre-trained on extensive corpora, have exhibited remarkable


performance across a wide range of NLP tasks including
question-answering, machine translation, and text generation
[16], [17], [18], [19]. In recent years, there has been a sig-
nificant advancement in LLMs with the emergence of models
capable of handling even larger-scale datasets. This expansion
in model size has not only led to improved performance but also
opened up new possibilities for applying LLMs as Artificial
General Intelligence. Among these advanced LLMs, models
like ChatGPT1 and LLaMA2 boast billions of parameters. Such
models hold tremendous potential for tackling complex practi-
cal tasks in domains like code generation and artistic creation.
With their expanded capacity and enhanced capabilities, LLMs
have become game-changers in NLP and AI, and are driving
advancements in other fields like coding and software testing.
LLMs have been used for various coding-related tasks in-
cluding code generation and code recommendation [20], [21],
[22], [23]. On one hand, in software testing, there are many
tasks related to code generation, such as unit test generation
[7], where the utilization of LLMs is expected to yield good
performance. On the other hand, software testing possesses
unique characteristics that differentiate it from code generation.
For example, code generation primarily focuses on producing
Fig. 1. Structure of the contents in this paper (the numbers in bracket
a single, correct code snippet, whereas software testing often indicates the number of involved papers, and a paper might involve zero or
requires generating diverse test inputs to ensure better coverage multiple items).
of the software under test [1]. The existence of these differences
introduces new challenges and opportunities when employing
LLMs for software testing. Moreover, people have benefited fine-tuning schema, while the others employ prompt engineer-
from the excellent performance of LLMs in generation and ing to communicate with LLMs to steer their behavior for
inference tasks, leading to the emergence of dozens of new desired outcomes. For prompt engineering, the zero-shot learn-
practices that use LLMs for software testing. ing and few-shot learning strategies are most commonly used,
This article presents a comprehensive review of the utiliza- while other advances like chain-of-thought promoting and self-
tion of LLMs in software testing. We collect 102 relevant papers consistency are rarely utilized. Results also show that traditional
and conduct a thorough analysis from both software testing and testing techniques like differential testing and mutation testing
LLMs perspectives, as roughly summarized in Fig. 1. are usually accompanied by LLMs to help generate more diver-
From the viewpoint of software testing, our analysis involves sified tests.
an examination of the specific software testing tasks for which Furthermore, we summarize the key challenges and poten-
LLMs are employed. Results show that LLMs are commonly tial opportunities in this direction. Although software testing
used for test case preparation (including unit test case genera- with LLMs has undergone significant growth in the past two
tion, test oracle generation, and system test input generation), years, there are still challenges in achieving high coverage of
program debugging, and bug repair, while we do not find the testing, test oracle problem, rigorous evaluations, and real-
the practices for applying LLMs in the tasks of early testing world application of LLMs in software testing. Since it is a new
life-cycle (such as test requirement, test plan, etc). For each test emerging field, there are many research opportunities, including
task, we would provide detailed illustrations showcasing the exploring LLMs in an early stage of testing, exploring LLMs
utilization of LLMs in addressing the task, highlighting for more types of software and non-functional testing, exploring
commonly-used practices, tracking technology evolution advanced prompt engineering, as well as incorporating LLMs
trends, and summarizing achieved performance, so as to with traditional techniques.
facilitate readers in gaining a thorough overview of how LLMs This paper makes the following contributions:
are employed across various testing tasks. • We thoroughly analyze 102 relevant studies that used
From the viewpoint of LLMs, our analysis includes the LLMs for software testing, regarding publication trends,
commonly used LLMs in these studies, the types of prompt distribution of publication venues, etc.
engineering, the input of the LLMs, as well as the accompanied • We conduct a comprehensive analysis from the perspec-
techniques with these LLMs. Results show that about one- tive of software testing to understand the distribution of
third of the studies utilize the LLMs through pre-training or software testing tasks with LLM and present a thorough
discussion about how these tasks are solved with LLM.
1 https://openai.com/blog/chatgpt • We conduct a comprehensive analysis from the perspective
2 https://ai.meta.com/blog/large-language-model-llama-meta-ai/ of LLMs, and uncover the commonly-used LLMs, the

Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SOFTWARE TESTING WITH LARGE LANGUAGE MODELS: SURVEY, LANDSCAPE, AND VISION 913

types of prompt engineering, input of the LLMs, as well recent survey of LLMs [17], the authors focus on discussing
as the accompanied techniques with these LLMs. the language models with a model size larger than 10B. Un-
• We highlight the challenges in existing studies and present der their criteria, the first LLM is T5 released by Google in
potential opportunities for further studies. 2019, followed by GPT-3 released by OpenAI in 2020, and
• We maintain a GitHub website https://github.com/LLM- there are more than thirty LLMs released between 2021 and
Testing/LLM4SoftwareTesting that serves as a platform 2023 indicating its popularity. In another survey of unifying
for sharing and hosting the latest publications about soft- LLMs and knowledge graphs [24], the authors categorize the
ware testing with LLM. LLMs into three types: encoder-only (e.g., BERT), encoder-
We believe that this work will be valuable to both researchers decoder (e.g., T5), and decoder-only network architecture (e.g.,
and practitioners in the field of software engineering, as it pro- GPT-3). In our review, we take into account the categoriza-
vides a comprehensive overview of the current state and future tion criteria of the two surveys and only consider the encoder-
vision of using LLMs for software testing. For researchers, this decoder and decoder-only network architecture of pre-training
work can serve as a roadmap for future research in this area, language models, since they can both support generative tasks.
highlighting potential avenues for exploration and identifying We do not consider the encoder-only network architecture be-
gaps in our current understanding of the use of LLMs in soft- cause they cannot handle generative tasks, were proposed rel-
ware testing. For practitioners, this work can provide insights atively early (e.g., BERT in 2018), and there are almost no
into the potential benefits and limitations of using LLMs for models using this architecture after 2021. In other words, the
software testing, as well as practical guidance on how to effec- LLMs discussed in this paper not only include models with
tively integrate them into existing testing processes. By provid- parameters of over 10B (as mentioned in [17]) but also include
ing a detailed landscape of the current state and future vision of other models that use the encoder-decoder and decoder-only
using LLMs for software testing, this work can help accelerate network architecture (as mentioned in [24]), such as BART with
the adoption of this technology in the software engineering 140M parameters and GPT-2 with parameter sizes ranging from
community and ultimately contribute to improving the quality 117M to 1.5B. This is also to potentially include more studies
and reliability of software systems. to demonstrate the landscape of this topic.

II. BACKGROUND
A. Large Language Model (LLM) B. Software Testing
Recently, pre-trained language models (PLMs) have been Software testing is a crucial process in software development
proposed by pretraining Transformer-based models over large- that involves evaluating the quality of a software product. The
scale corpora, showing strong capabilities in solving various primary goal of software testing is to identify defects or errors
natural language processing (NLP) tasks [16], [17], [18], [19]. in the software system that could potentially lead to incorrect or
Studies have shown that model scaling can lead to improved unexpected behavior. The whole life cycle of software testing
model capacity, prompting researchers to investigate the scal- typically includes the following tasks (demonstrated in Fig. 4):
ing effect through further parameter size increases. Interest- • Requirement Analysis: analyze the software requirements
ingly, when the parameter scale exceeds a certain threshold, and identify the testing objectives, scope, and criteria.
these larger language models demonstrate not only significant • Test Plan: develop a test plan that outlines the testing
performance improvements but also special abilities such as strategy, test objectives, and schedule.
in-context learning, which are absent in smaller models such • Test Design and Review: develop and review the test cases
as BERT. and test suites that align with the test plan and the require-
To discriminate the language models in different parameter ments of the software application.
scales, the research community has coined the term large lan- • Test Case Preparation: the actual test cases are prepared
guage models (LLM) for the PLMs of significant size. LLMs based on the designs created in the previous stage.
typically refer to language models that have hundreds of bil- • Test Execution: execute the tests that were designed in the
lions (or more) of parameters and are trained on massive text previous stage. The software system is executed with the
data such as GPT-3, PaLM, Codex, and LLaMA. LLMs are test cases and the results are recorded.
built using the Transformer architecture, which stacks multi- • Test Reporting: analyze the results of the tests and generate
head attention layers in a very deep neural network. Existing reports that summarize the testing process and identify any
LLMs adopt similar model architectures (Transformer) and pre- defects or issues that were discovered.
training objectives (language modeling) as small language mod- • Bug Fixing and Regression Testing: defects or issues iden-
els, but largely scale up the model size, pre-training data, and tified during testing are reported to the development team
total compute power. This enables LLMs to better understand for fixing. Once the defects are fixed, regression testing is
natural language and generate high-quality text based on given performed to ensure that the changes have not introduced
context or prompts. new defects or issues.
Note that, in existing literature, there is no formal consensus • Software Release: once the software system has passed all
on the minimum parameter scale for LLMs, since the model of the testing stages and the defects have been fixed, the
capacity is also related to data size and total compute. In a software can be released to the customer or end user.

Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
914 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 4, APRIL 2024

first three databases, and the fourth database only serves as a


supplementary source for collecting additional papers.
• Keywords related with software testing tasks and tech-
niques: test OR bug OR issue OR defect OR fault OR error
OR failure OR crash OR debug OR debugger OR repair OR
fix OR assert OR verification OR validation OR fuzz OR
fuzzer OR mutation.
Fig. 2. Overview of the paper collection process.
• Keywords related with LLMs: LLM OR language model
OR generative model OR large model OR GPT-3 OR Chat-
The testing process is iterative and may involve multiple GPT OR GPT-4 OR LLaMA OR PaLM2 OR CodeT5
cycles of the above stages, depending on the complexity of the OR CodeX OR CodeGen OR Bard OR InstructGPT.
software system and the testing requirements. Note that, we only list the top ten most popular LLMs
During the testing phase, various types of tests may be per- (based on Google search), since they are the search key-
formed, including unit tests, integration tests, system tests, and words for matching paper titles, rather than matching the
acceptance tests. paper content.
• Unit Testing involves testing individual units or compo-
The above search strategy based on the paper title can recall
nents of the software application to ensure that they func- a large number of papers, and we further conduct the automatic
tion correctly. filtering based on the paper content. Specifically, we filter the
• Integration Testing involves testing different modules or
paper whose content contains “LLM” or “language model” or
components of the software application together to ensure “generative model” or “large model” or the name of the LLMs
that they work correctly as a system. (using the LLMs in [17], [24] except those in our exclusion
• System Testing involves testing the entire software system
criteria). This can help eliminate the papers that do not involve
as a whole, including all the integrated components and the neural models.
external dependencies. 2) Manual Search: To compensate for the potential omis-
• Acceptance Testing involves testing the software applica-
sions that may result from automated searches, we also con-
tion to ensure that it meets the business requirements and duct manual searches. In order to make sure we collect highly
is ready for deployment. relevant papers, we conduct a manual search within the con-
In addition, there can be functional testing, performance ference proceedings and journal articles from top-tier software
testing, unit testing, security testing, accessibility testing, engineering venues (listed in Table II).
etc, which explores various aspects of the software under In addition, given the interdisciplinary nature of this work,
test [1]. we also include the conference proceedings of the artificial
intelligence field. We select the top ten venues based on the h5
index from Google Scholar, and exclude three computer vision
III. PAPER SELECTION AND REVIEW SCHEMA venues, i.e., CVPR, ICCV, ECCV, as listed in Table II.
3) Inclusion and Exclusion Criteria: The search conducted
A. Paper Collection Methodology
on the databases and venue is, by design, very inclusive. This
Fig. 2 shows our paper search and selection process. To allows us to collect as many papers as possible in our pool.
collect as much relevant literature as possible, we use both However, this generous inclusivity results in having papers that
automatic search (from paper repository database) and manual are not directly related to the scope of this survey. Accordingly,
search (from major software engineering and artificial intelli- we define a set of specific inclusion and exclusion criteria and
gence venues). We searched papers from Jan. 2019 to Jun. 2023 then we apply them to each paper in the pool and remove papers
and further conducted the second round of search to include the not meeting the criteria. This ensures that each collected paper
papers from Jul. 2023 to Oct. 2023. aligns with our scope and research questions.
1) Automatic Search: To ensure that we collect papers from Inclusion Criteria. We define the following criteria for in-
diverse research areas, we conduct an extensive search using cluding papers:
four popular scientific databases: ACM digital library, IEEE • The paper proposes or improves an approach, study, or
Xplore digital library, arXiv, and DBLP. tool/framework that targets testing specific software or
We search for papers whose title contains keywords related systems with LLMs.
to software testing tasks and testing techniques (as shown • The paper applies LLMs to software testing practice, in-
below) in the first three databases. In the case of DBLP, we cluding all tasks within the software testing lifecycle as
use additional keywords related to LLMs (as shown below) demonstrated in Section II-B.
to filter out irrelevant studies, as relying solely on testing- • The paper presents an empirical or experimental study
related keywords would result in a large number of candidate about utilizing LLMs in software testing practice.
studies. While using two sets of keywords for DBLP may result • The paper involves specific testing techniques (e.g., fuzz
in overlooking certain related studies, we believe it is still a testing) employing LLMs.
feasible strategy. This is due to the fact that a substantial number If a paper satisfies any of the following criteria, we will
of studies present in this database can already be found in the include it.

Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SOFTWARE TESTING WITH LARGE LANGUAGE MODELS: SURVEY, LANDSCAPE, AND VISION 915

TABLE I
DETAILS OF THE COLLECTED PAPERS

ID Topic Paper title Year Reference


1 Unit test case generation Unit Test Case Generation with Transformers and Focal Context 2020 [25]
2 Unit test case generation Codet: Code Generation with Generated Tests 2022 [26]
3 Unit test case generation Interactive Code Generation via Test-Driven User-Intent Formalization 2022 [27]
4 Unit test case generation A3Test: Assertion-Augmented Automated Test Case Generation 2023 [28]
5 Unit test case generation An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation 2023 [29]
6 Unit test case generation An Initial Investigation of ChatGPT Unit Test Generation Capability 2023 [30]
7 Unit test case generation Automated Test Case Generation Using Code Models and Domain Adaptation 2023 [31]
8 Unit test case generation Automatic Generation of Test Cases based on Bug Reports: a Feasibility Study with Large Language Models 2023 [32]
9 Unit test case generation Can Large Language Models Write Good Property-Based Tests? 2023 [33]
10 Unit test case generation CAT-LM Training Language Models on Aligned Code And Tests 2023 [34]
11 Unit test case generation ChatGPT vs SBST: A Comparative Assessment of Unit Test Suite Generation 2023 [8]
12 Unit test case generation ChatUniTest: a ChatGPT-based Automated Unit Test Generation Tool 2023 [35]
13 Unit test case generation CODAMOSA: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models 2023 [36]
14 Unit test case generation Effective Test Generation Using Pre-trained Large Language Models and Mutation Testing 2023 [37]
15 Unit test case generation Exploring the Effectiveness of Large Language Models in Generating Unit Tests 2023 [38]
16 Unit test case generation How Well does LLM Generate Security Tests? 2023 [39]
17 Unit test case generation No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation 2023 [7]
18 Unit test case generation Prompting Code Interpreter to Write Better Unit Tests on Quixbugs Functions 2023 [40]
19 Unit test case generation Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation 2023 [41]
20 Unit test case generation Unit Test Generation using Generative AI: A Comparative Performance Analysis of Autogeneration Tools 2023 [42]
21 Test oracle generation Generating Accurate Assert Statements for Unit Test Cases Using Pretrained Transformers 2022 [43]
22 Test oracle generation Learning Deep Semantics for Test Completion 2023 [44]
23 Test oracle generation; Program repair Using Transfer Learning for Code-Related Tasks 2023 [45]
24 Test oracle generation; Program repair Retrieval-Based Prompt Selection for Code-Related Few-Shot Learning 2023 [46]
25 System test input generation Automated Conformance Testing for JavaScript Engines via Deep Compiler Fuzzing 2021 [47]
26 System test input generation Fill in the Blank: Context-aware Automated Text Input Generation for Mobile GUI Testing 2022 [48]
27 System test input generation Large Language Models are Pretty Good Zero-Shot Video Game Bug Detectors 2022 [49]
28 System test input generation Slgpt: Using Transfer Learning to Directly Generate Simulink Model Files and Find Bugs in the Simulink Toolchain 2021 [50]
29 System test input generation Augmenting Greybox Fuzzing with Generative AI 2023 [51]
30 System test input generation Automated Test Case Generation Using T5 and GPT-3 2023 [52]
31 System test input generation Automating GUI-based Software Testing with GPT-3 2023 [53]
32 System test input generation AXNav: Replaying Accessibility Tests from Natural Language 2023 [54]
33 System test input generation Can ChatGPT Advance Software Testing Intelligence? An Experience Report on Metamorphic Testing 2023 [55]
34 System test input generation Efficient Mutation Testing via Pre-Trained Language Models 2023 [56]
35 System test input generation Large Language Models are Edge-Case Generators:Crafting Unusual Programs for Fuzzing Deep Learning Libraries 2023 [57]
36 System test input generation Large Language Models are Zero Shot Fuzzers: Fuzzing Deep Learning Libraries via Large Language Models 2023 [58]
37 System test input generation Large Language Models for Fuzzing Parsers (Registered Report) 2023 [59]
38 System test input generation LLM for Test Script Generation and Migration: Challenges, Capabilities, and Opportunities 2023 [60]
39 System test input generation Make LLM a Testing Expert: Bringing Human-like Interaction to Mobile GUI Testing via Functionality-aware Decisions 2023 [14]
40 System test input generation PentestGPT: An LLM-empowered Automatic Penetration Testing Tool 2023 [61]
41 System test input generation SMT Solver Validation Empowered by Large Pre-Trained Language Models 2023 [62]
42 System test input generation TARGET: Automated Scenario Generation from Traffic Rules for Testing Autonomous Vehicles 2023 [63]
43 System test input generation Testing the Limits: Unusual Text Inputs Generation for Mobile App Crash Detection with Large Language Model 2023 [64]
44 System test input generation Understanding Large Language Model Based Fuzz Driver Generation 2023 [65]
45 System test input generation Universal Fuzzing via Large Language Models 2023 [66]
46 System test input generation Variable Discovery with Large Language Models for Metamorphic Testing of Scientific Software 2023 [67]
47 System test input generation White-box Compiler Fuzzing Empowered by Large Language Models 2023 [68]
48 Bug analysis Itiger: an Automatic Issue Title Generation Tool 2022 [69]
49 Bug analysis CrashTranslator: Automatically Reproducing Mobile Application Crashes Directly from Stack Trace 2023 [70]
50 Bug analysis Cupid: Leveraging ChatGPT for More Accurate Duplicate Bug Report Detection 2023 [71]
51 Bug analysis Employing Deep Learning and Structured Information Retrieval to Answer Clarification Questions on Bug Reports 2023 [72]
52 Bug analysis Explaining Software Bugs Leveraging Code Structures in Neural Machine Translation 2022 [73]
53 Bug analysis Prompting Is All Your Need: Automated Android Bug Replay with Large Language Models 2023 [74]
54 Bug analysis Still Confusing for Bug-Component Triaging? Deep Feature Learning and Ensemble Setting to Rescue 2023 [75]
55 Debug Detect-Localize-Repair: A Unified Framework for Learning to Debug with CodeT5 2022 [76]
56 Debug Large Language Models are Few-shot Testers: Exploring LLM-based General Bug Reproduction 2022 [77]
57 Debug A Preliminary Evaluation of LLM-Based Fault Localization 2023 [78]
58 Debug Addressing Compiler Errors: Stack Overflow or Large Language Models? 2023 [79]
59 Debug Can LLMs Demystify Bug Reports? 2023 [80]
60 Debug Dcc –help: Generating Context-Aware Compiler Error Explanations with Large Language Models 2023 [81]
61 Debug Explainable Automated Debugging via Large Language Model-driven Scientific Debugging 2023 [82]
62 Debug Large Language Models for Test-Free Fault Localization 2023 [83]
63 Debug Large Language Models in Fault Localisation 2023 [84]
64 Debug LLM4CBI: Taming LLMs to Generate Effective Test Programs for Compiler Bug Isolation 2023 [85]
65 Debug Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting 2023 [86]
66 Debug Teaching Large Language Models to Self-Debug 2023 [87]
67 Debug; Program repair A study on Prompt Design, Advantages and Limitations of ChatGPT for Deep Learning Program Repair 2023 [88]
68 Program repair Examining Zero-Shot Vulnerability Repair with Large Language Models 2022 [89]
69 Program repair Automated Repair of Programs from Large Language Models 2022 [90]
70 Program repair Fix Bugs with Transformer through a Neural-Symbolic Edit Grammar 2022 [91]
71 Program repair Practical Program Repair in the Era of Large Pre-trained Language Models 2022 [92]
72 Program repair Repairing Bugs in Python Assignments Using Large Language Models 2022 [93]
73 Program repair Towards JavaScript Program Repair with Generative Pre-trained Transformer (GPT-2) 2022 [94]
74 Program repair An Analysis of the Automatic Bug Fixing Performance of ChatGPT 2023 [95]
75 Program repair An Empirical Study on Fine-Tuning Large Language Models of Code for Automated Program Repair 2023 [96]
76 Program repair An Evaluation of the Effectiveness of OpenAI’s ChatGPT for Automated Python Program Bug Fixing using QuixBugs 2023 [97]
77 Program repair An Extensive Study on Model Architecture and Program Representation in the Domain of Learning-based Automated Program Repair 2023 [98]
78 Program repair Can OpenAI’s Codex Fix Bugs? An Evaluation on QuixBugs 2022 [99]
79 Program repair CIRCLE: Continual Repair Across Programming Languages 2022 [100]
80 Program repair Coffee: Boost Your Code LLMs by Fixing Bugs with Feedback 2023 [101]
81 Program repair Copiloting the Copilots: Fusing Large Language Models with Completion Engines for Automated Program Repair 2023 [102]
82 Program repair Domain Knowledge Matters: Improving Prompts with Fix Templates for Repairing Python Type Errors 2023 [103]
83 Program repair Enhancing Genetic Improvement Mutations Using Large Language Models 2023 [104]
84 Program repair FixEval: Execution-based Evaluation of Program Fixes for Programming Problems 2023 [105]
85 Program repair Fixing Hardware Security Bugs with Large Language Models 2023 [106]
86 Program repair Fixing Rust Compilation Errors using LLMs 2023 [107]
87 Program repair Framing Program Repair as Code Completion 2022 [108]
88 Program repair Frustrated with Code Quality Issues? LLMs can Help! 2023 [109]
89 Program repair GPT-3-Powered Type Error Debugging: Investigating the Use of Large Language Models for Code Repair 2023 [110]
90 Program repair How Effective Are Neural Networks for Fixing Security Vulnerabilities 2023 [111]
91 Program repair Impact of Code Language Models on Automated Program Repair 2023 [112]
92 Program repair Inferfix: End-to-end Program Repair with LLMs 2023 [113]
93 Program repair Keep the Conversation Going: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT 2023 [114]
94 Program repair Neural Program Repair with Program Dependence Analysis and Effective Filter Mechanism 2023 [115]
95 Program repair Out of Context: How important is Local Context in Neural Program Repair? 2023 [116]
96 Program repair Pre-trained Model-based Automated Software Vulnerability Repair: How Far are We? 2023 [117]
97 Program repair RAPGen: An Approach for Fixing Code Inefficiencies in Zero-Shot 2023 [118]
98 Program repair RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic Program Repair 2023 [119]
99 Program repair STEAM: Simulating the InTeractive BEhavior of ProgrAMmers for Automatic Bug Fixing 2023 [120]
100 Program repair Towards Generating Functionally Correct Code Edits from Natural Language Issue Descriptions 2023 [121]
101 Program repair VulRepair: a T5-based Automated Software Vulnerability Repair 2022 [122]
102 Program repair What Makes Good In-Context Demonstrations for Code Intelligence Tasks with LLMs? 2023 [123]

Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
916 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 4, APRIL 2024

TABLE II • Is there an explicit description of which LLMs


CONFERENCE PROCEEDINGS AND JOURNALS CONSIDERED FOR are utilized?
MANUAL SEARCH
• Is there an explicit explanation about how the LLMs
Acronym Venue are utilized?
ICSE International Conference on Software Engineering • Is there a clear methodology for validating the technique?
ESEC/FSE Joint European Software Engineering Conference and Symposium on the Foun-
dations of Software Engineering • Are the subject projects selected for validation suitable for
SE Conference

ASE International Conference on Automated Software Engineering


ISSTA International Symposium on Software Testing and Analysis the research goals?
ICST International Conference on Software Testing, Verification and Validation
ESEM International Symposium on Empirical Software Engineering and Measurement • Are there control techniques or baselines to demonstrate
MSR International Conference on Mining Software Repositories
QRS International Conference on Software Quality, Reliability and Security the effectiveness of the proposed technique?
ICSME International Conference on Software Maintenance and Evolution
ISSRE International Symposium on Software Reliability Engineering • Are the evaluation metrics relevant (e.g., evaluate the ef-
TSE Transactions on Software Engineering fectiveness of the proposed technique) to the research
TOSEM Transactions on Software Engineering and Methodology
EMSE Empirical Software Engineering objectives?
ASE Automated Software Engineering
SE Journal

JSS Journal of Systems and Software • Do the results presented in the study align with the re-
JSEP Journal of Software: Evolution and Process
STVR Software Testing, Verification and Reliability search objectives and are they presented in a clear and
IEEE SOFTW. IEEE Software
IET SOFTW. IET Software relevant manner?
IST Information and Software Technology
SQJ Software Quality Journal 5) Snowballing: At the end of searching database reposi-
ICLR International Conference on Learning Representations tories and conference proceedings and journals, and applying
NeurIPS Conference on Neural Information Processing Systems
AI Venues

ICML International Conference on Machine Learning inclusion/exclusion criteria and quality assessment, we obtain
AAAI AAAI Conference on Artificial Intelligence
EMNLP Conference on Empirical Methods in Natural Language Processing the initial set of papers. Next, to mitigate the risk of omitting
ACL Annual Meeting of the Association for Computational Linguistics
IJCAI International Joint Conference on Artificial Intelligence relevant literature from this survey, we also perform backward
snowballing [130] by inspecting the references cited by the
collected papers so far. Note that, this procedure did not include
new studies, which might because the surveyed topic is quite
Exclusion Criteria. The following studies would be ex- new and the reference studies tend to published previously, and
cluded during study selection: we already include a relatively comprehensive automatic and
• The paper does not involve software testing tasks, e.g., manual search.
code comment generation.
• The paper does not utilize LLMs, e.g., using recurrent
neural networks. B. Collection Results
• The paper mentions LLMs only in future work or discus- As shown in Fig. 2, the collection process started with a
sions rather than using LLMs in the approach. total of 14,623 papers retrieved from four academic databases
• The paper utilizes language models with encoder-only ar- employing keyword searching. Then after automated filtering,
chitecture, e.g., BERT, which can not directly be utilized manual search, applying inclusion/exclusion criteria, and qual-
for generation tasks (as demonstrated in Section II-A). ity assessment, we finally collected a total of 102 papers involv-
• The paper focuses on testing the performance of LLMs, ing software testing with LLMs. Table I shows the details of the
such as fairness, stability, security, etc. [124], [125], [126]. collected papers. Besides, we provide a more comprehensive
• The paper focuses on evaluating the performance of LLM- overview of these papers regarding the specific characteristics
enabled tools, e.g., evaluating the code quality of the code (will be illustrated in Section IV and Section V) in the online
generation tool Copilot [127], [128], [129]. appendix of the paper.
For the papers collected through automatic search and man- Note that, there are two studies which are respectively the
ual search, we conduct a manual inspection to check whether extension of a previously published paper by the same authors
they satisfy our inclusion criteria and filter those following ([46] and [131], [68] and [132]), and we only keep the extended
our exclusion criteria. Specifically, the first two authors read version to avoid duplicate.
each paper to carefully determine whether it should be included
based on the inclusion criteria and exclusion criteria, and any
paper with different decisions will be handed over to the third C. General Overview of Collected Paper
author to make the final decision. Among the papers, 47% papers are published in software
4) Quality Assessment: In addition, we establish quality as- engineering venues, among which 19 papers are from ICSE, 5
sessment criteria to exclude low-quality studies as shown below. papers are from FSE, 5 papers are from ASE, and 3 papers are
For each question, the study’s quality is rated as “yes”, “partial” from ISSTA. 2% papers are published in artificial intelligence
or “no” which are assigned values of 1, 0.5, and 0, respectively. venues such as EMNLP and ICLR, and 5% papers are published
Papers with a score of less than eight will be excluded from in program analysis or security venues like PLDI and S & P.
our study. Besides, 46% of the papers have not yet been published via
• Is there a clearly stated research goal related to software peer-reviewed venues, i.e., they are disclosed on arXiv. This is
testing? understandable because this field is emerging and many works
• Is there a defined and repeatable technique? are just completed and in the process of submission. Although
• Is there any explicit contribution to software testing? these papers did not undergo peer review, we have a quality

Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SOFTWARE TESTING WITH LARGE LANGUAGE MODELS: SURVEY, LANDSCAPE, AND VISION 917

the test prefix is typically a series of method invocation state-


ments or assignment statements, which aims at driving the focal
method to a testable state; and then the test oracle serves as the
specification to check whether the current behavior of the focal
method satisfies the expected one, e.g., the test assertion.
To alleviate manual efforts in writing unit tests, researchers
have proposed various techniques to facilitate automated unit
test generation. Traditional unit test generation techniques
leverage search-based [3], [4], constraint-based [5] or random-
based strategies [6] to generate a suite of unit tests with the
main goal of maximizing the coverage in the software under
test. Nevertheless, the coverage and the meaningfulness of the
generated tests are still far from satisfactory.
Since LLMs have demonstrated promising results in tasks
Fig. 3. Trend in the number of papers with year.
such as code generation, and given that both code generation
and unit test case generation involve generating source code,
assessment process that eliminates papers with low quality, recent research has extended the domain of code generation
which potentially ensures the quality of this survey. to encompass unit test case generation. Despite initial suc-
Fig. 3 demonstrates the trend of our collected papers per year. cess, there are nuances that set unit test case generation apart
We can see that as the years go by, the number of papers in this from general code generation, signaling the need for more
field is growing almost exponentially. In 2020 and 2021, there tailored approaches.
were only 1 and 2 papers, respectively. In 2022, there were 19 1) Pre-Training or Fine-Tuning LLMs for Unit Test Case
papers, and in 2023, there have been 82 papers. It is conceivable Generation: Due to the limitations of LLMs in their earlier
that there will be even more papers in the future, which indicates stages, a majority of the earlier published studies adopt this pre-
the popularity and attention that this field is receiving. training or fine-tuning schema. Moreover, in some recent stud-
ies, this schema continues to be employed to increase the LLMs’
familiarity with domain knowledge. Alagarsamy et al. [28]
IV. ANALYSIS FROM SOFTWARE TESTING PERSPECTIVE first pre-trained the LLM with the focal method and asserted
This section presents our analysis from the viewpoint of statements to enable the LLM to have a stronger foundation
software testing and organizes the collected studies in terms of knowledge of assertions, then fine-tuned the LLM for the test
testing tasks. Fig. 4 lists the distribution of each involved testing case generation task where the objective is to learn the relation-
task, aligned with the software testing life cycle. We first pro- ship between the focal method and the corresponding test case.
vide a general overview of the distribution, followed by further Tufano et al. [25] utilized a similar schema by pre-training the
analysis for each task. Note that, for each following subsection, LLM on a large unsupervised Java corpus, and supervised fine-
the cumulative total of subcategories may not always match the tuning a downstream translation task for generating unit tests.
total number of papers since a paper might belong to more than Hashtroudi et al. [31] leveraged the existing developer-written
one subcategory. tests for each project to generate a project-specific dataset for
We can see that LLMs have been effectively used in both domain adaptation when fine-tuning the LLM, which can fa-
the mid to late stages of the software testing lifecycle. In the cilitate generating human-readable unit tests. Rao et al. [34]
test case preparation phase, LLMs have been utilized for tasks trained a GPT-style language model by utilizing a pre-training
such as generating unit test cases, test oracle generation, and signal that explicitly considers the mapping between code and
system test input generation. These tasks are crucial in the mid- test files. Steenhoek et al. [41] utilizes reinforcement learning
phase of software testing to help catch issues and prevent further to optimize models by providing rewards based on static quality
development until issues are resolved. Furthermore, in later metrics that can be automatically computed for the generated
phases such as the test report/bug reports and bug fix phase, unit test cases.
LLMs have been employed for tasks such as bug analysis, 2) Designing Effective Prompts for Unit Test Case Gener-
debugging, and repair. These tasks are critical towards the end ation: The advancement of LLMs has allowed them to excel
of the testing phase when software bugs need to be resolved to at targeted tasks without pre-training or fine-tuning. Therefore
prepare for the product’s release. most later studies typically focus on how to design the prompt,
to make the LLM better at understanding the context and nu-
ances of this task. Xie et al. [35] generated unit test cases
A. Unit Test Case Generation
by parsing the project, extracting essential information, and
Unit test case generation involves writing unit test cases to creating an adaptive focal context that includes a focal method
check individual units/components of the software indepen- and its dependencies within the pre-defined maximum prompt
dently and ensure that they work correctly. For a method under token limit of the LLM, and incorporating these context into a
test (i.e., often called the focal method), its corresponding unit prompt to query the LLM. Dakhel et al. [37] introduced MuTAP
test consists of a test prefix and a test oracle. In particular, for improving the effectiveness of test cases generated by LLMs

Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
918 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 4, APRIL 2024

Fig. 4. Distribution of testing tasks with LLMs (aligned with software testing life cycle [1], [133], [134], the number in bracket indicates the number of
collected studies per task, and one paper might involve multiple tasks).

TABLE III
PERFORMANCE OF UNIT TEST CASE GENERATION

Dataset Correctness Coverage LLM Paper


5 Java projects from Defects4J 16.21% 5%-13% (line coverage) BART [25]
10 Jave projects 40% 89% (line coverage), 90% (branch coverage) ChatGPT [35]
CodeSearchNet 41% N/A ChatGPT [7]
HumanEval 78% 87% (line coverage), 92% (branch coverage) Codex [38]
SF110 2% 2% (line coverage), 1% (branch coverage) Codex [38]
Note that, [39] experiments with Codex, CodeGen, and ChatGPT, and the best performance was achieved by Codex.

in terms of revealing bugs by leveraging mutation testing. They software testing techniques (e.g., Pynguin [135]) in generating
augment prompts with surviving mutants, as those mutants unit test case until its coverage improvements stall, then asking
highlight the limitations of test cases in detecting bugs. Zhang the LLM to provide the example test cases for under-covered
et al. [39] generated security tests with vulnerable dependencies functions. These examples can help the original test generation
with LLMs. redirect its search to more useful areas of the search space.
Yuan et al. [7] first performed an empirical study to eval- Tang et al. [8] conducts a systematic comparison of test suites
uate ChatGPT’s capability of unit test generation with both a generated by the LLM and the state-of-the-art search-based
quantitative analysis and a user study in terms of correctness, software testing tool EvoSuite, by considering the correctness,
sufficiency, readability, and usability. And results show that the readability, code coverage, and bug detection capability. Simi-
generated tests still suffer from correctness issues, including larly, Bhatia [42] experimentally investigates the quality of unit
diverse compilation errors and execution failures. They fur- tests generated by LLM compared to a commonly-used test
ther propose an approach that leveraged the ChatGPT itself to generator Pynguin.
improve the quality of its generated tests with an initial test 5) Performance of Unit Test Case Generation: Since the
generator and an iterative test refiner. Specifically, the iterative aforementioned studies of unit test case generation are based
test refiner iteratively fixed the compilation errors in the tests on different datasets, one can hardly derive a fair comparison
generated by the initial test generator, which follows a validate- and we present the details in Table III to let the readers obtain
and-fix paradigm to prompt the LLM based on the compilation a general view. We can see that in the SF110 benchmark, all
error messages and additional code context. Guilherme et al. three evaluated LLMs have quite low performance, i.e., 2%
[30] and Li et al. [40] respectively evaluated the quality of coverage [38]. SF110 is an Evosuite (a search-based unit test
the generated unit tests by LLM using different metrics and case generation technique) benchmark consisting of 111 open-
different prompts. source Java projects retrieved from SourceForge, containing
3) Test Generation With Additional Documentation: 23,886 classes, over 800,000 bytecode-level branches, and 6.6
Vikram et al. [33] went a step further by investigating the million lines of code. The authors did not present detailed
potential of using LLMs to generate property-based tests when reasons for the low performance which can be further explored
provided API documentation. They believe that the documen- in the future.
tation of an API method can assist the LLM in producing logic
to generate random inputs for that method and deriving mean-
ingful properties of the result to check. Instead of generating B. Test Oracle Generation
unit tests from the source code, Plein et al. [32] generated the A test oracle is a source of information about whether the
tests based on user-written bug reports. output of a software system (or program or function or method)
4) LLM and Search-Based Method for Unit Test Genera- is correct or not [136]. Most of the collected studies in this
tion: The aforementioned studies utilize LLMs for the whole category target the test assertion generation, which is inside a
unit test case generation task, while Lemieux et al. [36] focus on unit test case. Nevertheless, we opted to treat these studies as
a different direction, i.e., first letting the traditional search-based separate sections to facilitate a more thorough analysis.

Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SOFTWARE TESTING WITH LARGE LANGUAGE MODELS: SURVEY, LANDSCAPE, AND VISION 919

Test assertion, which is to indicate the potential issues in


the tested code, is an important aspect that can distinguish the
unit test cases from the regular code. This is why some studies
specifically focus on the generation of effective test assertions.
Actually, before using LLMs, researchers have proposed RNN-
based approaches that aim at learning from thousands of unit
test methods to generate meaningful assert statements [137], yet
only 17% of the generated asserts can exactly match with the
ground truth asserts. Subsequently, to improve the performance,
several researchers utilized the LLMs for this task.
Mastropaolo et al. [45], [131] pre-trained a T5 model on a
dataset composed of natural language English text and source Fig. 5. Distribution of software under test.
code. Then, it fine-tuned such a model by reusing datasets
used in four previous works that used deep learning techniques
(such as RNN as mentioned before) including test assertion example, for mobile applications, the test input generation re-
generation and program repair, etc. Results showed that the quires providing a diverse range of text inputs or operation
extract match rate of the generated test assertion is 57%. Tufano combinations (e.g., click a button, long press a list) [14], [49],
et al. [43] proposed a similar approach which separately pre- which is the key to testing the application’s functionality and
trained the LLM with English corpus and code corpus, and then user interface; while for Deep Learning (DL) libraries, the test
fine-tuned it on the asserts dataset (with test methods, focal input is a program which covers diversified DL APIs [58], [59].
methods, and asserts). This further improved the performance This subsection will demonstrate how the LLMs are utilized to
to 62% of the exact match rate. Besides the syntax-level data generate inputs for different types of software.
as previous studies, Nie et al. [45] fine-tuned the LLMs with The second subsection input generation in terms of testing
six kinds of code semantics data, including the execution result techniques. We have observed that certain approaches serve
(e.g., types of the local variables) and execution context (e.g., as specific types of testing techniques. For example, dozens
the last called method in the test method), which enabled LLMs of our collected studies specifically focus on using LLMs for
to learn to understand the code execution information. The exact fuzz testing. Therefore, this subsection would provide an anal-
match rate is 17% (note that this paper is based on a different ysis of the collected studies in terms of testing techniques,
dataset from all other studies mentioned under this topic). showcasing how the LLMs are employed to enhance traditional
The aforementioned studies utilized the pre-training and fine- testing techniques.
tuning schema when using LLMs, and with the increasingly The third subsection input generation in terms of input and
powerful capabilities of LLMs, they can perform well on spe- output. While most of the collected studies take the source
cific tasks without these specialized pre-training or fine-tuning code or the software itself as the input and directly output the
datasets. Subsequently, Nashid et al. [47] utilized prompt engi- software’s test input, there are studies that utilize alternative
neering for this task, and proposed a technique for prompt cre- forms of input and output. This subsection would provide an
ation that automatically retrieves code demonstrations similar to analysis of such studies, highlighting different approaches and
the task, based on embedding or frequency analysis. They also their input-output characteristics.
present evaluations about the few-shot learning with various 1) Input Generation in Terms of Software Types: Fig. 5
numbers (e.g., zero-shot, one-shot, or n-shot) and forms (e.g., demonstrates the types of software under test in our collected
random vs. systematic, or with vs. without natural language studies. It is evident that the most prominent category is mobile
descriptions) of the prompts, to investigate its feasibility on test apps, with five studies utilizing LLMs for testing, possibly due
assertion generation. With only a few relevant code demonstra- to their prevalence and importance in today’s business and daily
tions, this approach can achieve an accuracy of 76% for exact life. Additionally, there are respectively two studies focusing
matches in test assertion generation, which is the state-of-the- on testing deep learning libraries, compilers, and SMT solvers.
art performance for this task. Moreover, LLM-based testing techniques have also been ap-
plied to domains such as cyber-physical systems, quantum
computing platforms, and more. This widespread adoption of
C. System Test Input Generation
LLMs demonstrates their effectiveness in handling diverse test
This category encompasses the studies related to creating inputs and enhancing testing activities across various software
test input of system testing for enabling the automation of domains. A detailed analysis is provided below.
test execution. We employ three subsections to present the a) Test input generation for mobile apps: For mobile app
analysis from three different orthogonal viewpoints, and each testing, one difficulty is to generate the appropriate text inputs
of the collected studies may be analyzed in one or more of to proceed to the next page, which remains a prominent obstacle
these subsections. for testing coverage. Considering the diversity and semantic
The first subsection is input generation in terms of software requirement of valid inputs (e.g., flight departure, movie name),
types. The generation of system-level test inputs for software traditional techniques with heuristic-based or constraint-based
testing varies for specific types of software being tested. For techniques [10], [138] are far from generating meaningful text

Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
920 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 4, APRIL 2024

input. Liu et al. [49] employ the LLM to intelligently generate masked tokens. Their follow-up study [58] goes a step further
the semantic input text according to the GUI context. In detail, to prime LLMs to synthesize unusual programs for the fuzzing
their proposed QTypist automatically extracts the component DL libraries. It is built on the well-known hypothesis that
information related to the EditText for generating the prompts, historical bug-triggering programs may include rare/valuable
and then inputs the prompts into the LLM to generate the code ingredients important for bug finding and show improved
input text. bug detection performance.
Besides the text input, there are other forms of input for c) Test input generation for other types of software:
mobile apps, i.e., operations like ‘click a button’ and ‘select There are also dozens of studies that address testing tasks in
a list’. To fully test an app, it is required to cover more GUI various other domains, due to space limitations, we will present
pages and conduct more meaningful exploration traces through a selection of representative studies in these domains.
the GUI operations, yet existing studies with random-/rule- Finding bugs in a commercial cyber-physical system (CPS)
based methods [9], [10], model-based methods [11], [12], and development tool such as Simulink is even more challenging.
learning-based methods [13] are unable to understand the se- Given the complexity of the Simulink language, generating
mantic information of the GUI page thus could not conduct valid Simulink model files for testing is an ambitious task
the trace planning effectively. Liu et al. [14] formulates the test for traditional machine learning or deep learning techniques.
input generation of mobile GUI testing problem as a Q & A Shrestha et al. [51] employs a small set of Simulink-specific
task, which asks LLM to chat with the mobile apps by passing training data to fine-tune the LLM for generating Simulink
the GUI page information to LLM to elicit testing scripts (i.e., models. Results show that it can create Simulink models quite
GUI operation), and executing them to keep passing the app similar to the open-source models, and can find a super-set of
feedback to LLM, iterating the whole process. The proposed the bugs traditional fuzzing approaches found.
GPTDroid extracts the static context of the GUI page and the Sun et al. [63] utilize LLM to generate test formulas for
dynamic context of the iterative testing process, and designs fuzzing SMT solvers. It retrains the LLMs on a large corpus of
prompts for inputting this information to LLM which enables SMT formulas to enable them to acquire SMT-specific domain
the LLM to better understand the GUI page as well as the whole knowledge. Then it further fine-tunes the LLMs on historical
testing process. It also introduces a functionality-aware memory bug-triggering formulas, which are known to involve struc-
prompting mechanism that equips the LLM with the ability to tures that are more likely to trigger bugs and solver-specific
retain testing knowledge of the whole process and conduct long- behaviors. The LLM-based compiler fuzzer proposed by Yang
term, functionality-based reasoning to guide exploration. Sim- et al. [69] adopts a dual-model framework: (1) an analysis LLM
ilarly, Zimmermann et al. utilize the LLM to interpret natural examines the low-level optimization source code and produces
language test cases and programmatically navigate through the requirements on the high-level test programs that can trigger
application under test [54]. the optimization; (2) a generation LLM produces test programs
Yu et al. [61] investigate the LLM’s capabilities in the mo- based on the summarized requirements. Ye et al. [48] utilize
bile app test script generation and migration task, including the LLM for generating the JavaScript programs and then use
the scenario-based test generation, and the cross-platform/app the well-structured ECMAScript specifications to automatically
test migration. generate test data along with the test programs, after that they
b) Test input generation for DL libraries: The input apply differential testing to expose bugs.
for testing DL libraries is DL programs, and the difficulty in 2) Input Generation in Terms of Testing Techniques: By
generating the diversified input DL programs is that they need to utilizing system test inputs generated by LLMs, the collected
satisfy both the input language (e.g., Python) syntax/semantics studies aim to enhance traditional testing techniques and make
and the API input/shape constraints for tensor computations. them more effective. Among these techniques, fuzz testing is
Traditional techniques with API-level fuzzing [139], [140] or the most commonly involved one. Fuzz testing, as a general
model-level fuzzing [141], [142] suffer from the following limi- concept, revolves around generating invalid, unexpected, or
tations: 1) lack of diverse API sequence thus cannot reveal bugs random data as inputs to evaluate the behavior of software.
caused by chained API sequences; 2) cannot generate arbitrary LLMs play a crucial role in improving traditional fuzz test-
code thus cannot explore the huge search space that exists when ing by facilitating the generation of diverse and realistic input
using the DL libraries. Since LLMs can include numerous code data. This enables fuzz testing to uncover potential bugs in the
snippets invoking DL library APIs in their training corpora, they software by subjecting it to a wide range of input scenarios.
can implicitly learn both language syntax/semantics and intri- In addition to fuzz testing, LLMs also contribute to enhancing
cate API constraints for valid DL program generation. Taken in other testing techniques, which will be discussed in detail later.
this sense, Deng et al. [59] used both generative and infilling a) Universal fuzzing framework: Xia et al. [67] present
LLMs to generate and mutate valid/diverse input DL programs Fuzz4All that can target many different input languages and
for fuzzing DL libraries. In detail, it first uses a generative many different features of these languages. The key idea behind
LLM (CodeX) to generate a set of seed programs (i.e., code it is to leverage LLMs as an input generation and mutation
snippets that use the target DL APIs). Then it replaces part of engine, which enables the approach to produce diverse and real-
the seed program with masked tokens using different mutation istic inputs for any practically relevant language. To realize this
operators and leverages the ability of infilling LLM (InCoder) potential, they present a novel auto-prompting technique, which
to perform code infilling to generate new code that replaces the creates LLM prompts that are well-suited for fuzzing, and a

Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SOFTWARE TESTING WITH LARGE LANGUAGE MODELS: SURVEY, LANDSCAPE, AND VISION 921

novel LLM-powered fuzzing loop, which iteratively updates the synthesize the corresponding scenario scripts to construct the
prompt to create new fuzzing inputs. They experiment with six test scenario. Luu et al. [56] examine the effectiveness of LLM
different languages (C, C++, Go, SMT2, Java and Python) as in generating metamorphic relations (MRs) for metamorphic
inputs and demonstrate higher coverage than existing language- testing. Their results show that ChatGPT can be used to advance
specific fuzzers. Hu et al. [52] propose a greybox fuzzer aug- software testing intelligence by proposing MRs candidates that
mented by the LLM, which picks a seed in the fuzzer’s seed pool can be later adapted for implementing tests, but human intelli-
and prompts the LLM to produce the mutated seeds that might gence should still inevitably be involved to justify and rectify
trigger a new code region of the software. They experiment with their correctness.
three categories of input formats, i.e., formatted data files (e.g., b) Input format of test generation: The aforementioned
json, xml), source code in different programming languages studies primarily take the source code or the software as the
(e.g., JS, SQL, C), text with no explicit syntax rules (e.g., HTTP input of LLM, yet there are also studies that take natural lan-
response, md5 checksum). In addition, effective fuzzing relies guage description as the input for test generation. Mathur et al.
on the effective fuzz driver, and Zhang et al. [66] utilize LLMs [53] propose to generate test cases from the natural language
on the fuzz driver generation, in which five query strategies are described requirements. Ackerman et al. [60] generate the in-
designed and analyzed from basic to enhanced. stances from natural language described requirements recur-
b) Fuzzing techniques for specific software: There are sively to serve as the seed examples for a mutation fuzzer.
studies that focus on the fuzzing techniques tailored to specific
software, e.g., the deep learning library [58], [59], compiler
D. Bug Analysis
[69], SMT solvers [63], input widget of mobile app [65], cyber-
physical system [51], etc. One key focus of these fuzzing This category involves analyzing and categorizing the iden-
techniques is to generate diverse test inputs so as to achieve tified software bugs to enhance understanding of the bug, and
higher coverage. This is commonly achieved by combining the facilitate subsequent debug and bug repair. Mukherjee et al. [73]
mutation technique with LLM-based generation, where the for- generate relevant answers to follow-up questions for deficient
mer produces various candidates while the latter is responsible bug reports to facilitate bug triage. Su et al. [76] transform the
for generating the executable test inputs [59], [63]. Another bug-component triaging into a multi-classification task and a
focus of these fuzzing techniques is to generate the risky test generation task with LLM, then ensemble the prediction results
inputs that can trigger bugs earlier. To achieve this, a common from them to improve the performance of bug-component triag-
practice is to collect the historical bug-triggering programs to ing further. Zhang et al. [72] first leverage the LLM under the
fine-tune the LLM [63] or treat them as the demonstrations zero-shot setting to get essential information on bug reports,
when querying the LLM [58], [65]. then use the essential information as the input to detect duplicate
c) Other testing techniques: There are studies that utilize bug reports. Mahbub et al. [74] proposes to explain software
LLMs for enhancing GUI testing for generating meaningful text bugs with LLM, which generates natural language explanations
input [49] and functionality-oriented exploration traces [14], for software bugs by learning from a large corpus of bug-fix
which has been introduced in Test input generation for mobile commits. Zhang et al. [70] target to automatically generate the
apps part of Section IV-C1. bug title from the descriptions of the bug, which aims to help
Besides, Deng et al. [62] leverage the LLMs to carry out developers write issue titles and facilitate the bug triaging and
penetration testing tasks automatically. It involves setting a follow-up fixing process.
penetration testing goal for the LLM, soliciting it for the ap-
propriate operation to execute, implementing it in the testing
environment, and feeding the test outputs back to the LLM for E. Debug
next-step reasoning. This category refers to the process of identifying and locating
3) Input Generation in Terms of Input and Output: the cause of a software problem (i.e., bug). It involves analyzing
a) Output format of test generation: Although most the code, tracing the execution flow, collecting error informa-
works use LLM to generate test cases directly, there are also tion to understand the root cause of the issue, and fixing the
some works generating indirect inputs like testing code, test issue. Some studies concentrate on the comprehensive debug
scenarios, metamorphic relations, etc. Liu et al. [65] propose process, while others delve into specific sub-activities within
InputBlaster which leverages the LLM to automatically gen- the process.
erate unusual text inputs for fuzzing the text input widgets 1) Overall Debug Framework: Bui et al. [77] proposes a
in mobile apps. It formulates the unusual inputs generation unified Detect-Localize-Repair framework based on the LLM
problem as a task of producing a set of test generators, each for debugging, which first determines whether a given code
of which can yield a batch of unusual text inputs under the snippet is buggy or not, then identifies the buggy lines, and
same mutation rule. In detail, InputBlaster leverages LLM to translates the buggy code to its fixed version. Kang et al. [83]
produce the test generators together with the mutation rules proposes automated scientific debugging, a technique that given
serving as the reasoning chain and utilizes the in-context learn- buggy code and a bug-revealing test, prompts LLMs to automat-
ing schema to demonstrate the LLM with examples for boosting ically generate hypotheses, uses debuggers to actively interact
the performance. Deng et al. [64] use LLM to extract key infor- with buggy code, and thus automatically reaches conclusions
mation related to the test scenario from a traffic rule, and rep- prior to patch generation. Chen et al. [88] demonstrate that
resent the extracted information in a test scenario schema, then self-debugging can teach the LLM to perform rubber duck
Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
922 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 4, APRIL 2024

debugging; i.e., without any human feedback on the code cor- attributed to the close relationship between this task and the
rectness or error messages, the model is able to identify its source code. With their advanced natural language processing
mistakes by investigating the execution results and explaining and understanding capabilities, LLM are well-equipped to pro-
the generated code in natural language. Cao et al. [89] conducts cess and analyze source code, making them an ideal tool for
a study of LLM’s debugging ability for deep learning programs, performing code-related tasks such as fixing bugs.
including fault detection, fault localization and program repair. There have been template-based [143], heuristic-based [144],
2) Bug Localization: Wu et al. [85] compare the two and constraint-based [145], [146] automatic program repair
LLMs (ChatGPT and GPT-4) with the existing fault localization techniques. And with the development of deep learning tech-
techniques, and investigate the consistency of LLMs in fault niques in the past few years, there have been several studies
localization, as well as how prompt engineering and the length employing deep learning techniques for program repair. They
of code context affect the results. Kang et al. [79] propose typically adopt deep learning models to take a buggy software
AutoFL, an automated fault localization technique that only program as input and generate a patched program. Based on
requires a single failing test, and during its fault localization the training data, they would build a neural network model that
process, it also generates an explanation about why the given learns the relations between the buggy code and the correspond-
test fails. Yang et al. [84] propose LLMAO to overcome the ing fixed code. Nevertheless, these techniques still fail to fix
left-to-right nature of LLMs by fine-tuning a small set of bidi- a large portion of bugs, and they typically have to generate
rectional adapter layers on top of the representations learned hundreds to thousands of candidate patches and take hours to
by LLMs, which can locate buggy lines of code without any validate these patches to fix enough bugs. Furthermore, the deep
test coverage information. Tu et al. [86] propose LLM4CBI learning based program repair models need to be trained with
to tame LLMs to generate effective test programs for finding huge amounts of labeled training data (typically pairs of buggy
suspicious files. and fixed code), which is time- and effort-consuming to col-
3) Bug Reproduction: There are also studies focusing on a lect the high-quality dataset. Subsequently, with the popularity
sub-phase of the debugging process. For example, Kang et al. and demonstrated capability of the LLMs, researchers begin to
[78] and Plein et al. [81] respectively propose the framework explore the LLMs for program repair.
to harness the LLM to reproduce bugs, and suggest bug repro- 1) Patch Single-Line Bugs: In the early era of program
ducing test cases to the developer for facilitating debugging. repair, the focus was mainly on addressing defects related to
Li et al. [87] focus on a similar aspect of finding the failure- single-line code errors, which are relatively simple and did not
inducing test cases whose test input can trigger the software’s require the repair of complex program logic. Lajkó et al. [95]
fault. It synergistically combines LLM and differential testing propose to fine-tune the LLM with JavaScript code snippets to
to do that. serve as the purpose for the JavaScript program repair. Zhang
There are also studies focusing on the bug reproduction of et al. [116] employs program slicing to extract contextual infor-
mobile apps to produce the replay script. Feng et al. [75] pro- mation directly related to the given buggy statement as repair
pose AdbGPT, a new lightweight approach to automatically re- ingredients from the corresponding program dependence graph,
produce the bugs from bug reports through prompt engineering, which makes the fine-tuning more focused on the buggy code.
without any training and hard-coding effort. It leverages few- Zhang et al. [121] propose a stage-wise framework STEAM
shot learning and chain-of-thought reasoning to elicit human for patching single-line bugs, which simulates the interactive
knowledge and logical reasoning from LLMs to accomplish the behavior of multiple programmers involved in bug manage-
bug replay in a manner similar to a developer. Huang et al. ment, e.g., bug reporting, bug diagnosis, patch generation, and
[71] propose CrashTranslator to automatically reproduce bugs patch verification.
directly from the stack trace. It accomplishes this by leveraging Since most real-world bugs would involve multiple lines of
the LLM to predict the exploration steps for triggering the code, and later studies explore these more complex situations
crash, and designing a reinforcement learning based technique (although some of them can also patch the single-line bugs).
to mitigate the inaccurate prediction and guide the search holis- 2) Patch Multiple-Lines Bugs: The studies in this category
tically. Taeb et al. [55] convert the manual accessibility test would input a buggy function to the LLM, and the goal is
instructions into replayable, navigable videos by using LLM to output the patched function, which might involve complex
and UI element detection models, which can also help reveal semantic understanding, code hunk modification, as well as
accessibility issues. program refactoring. Earlier studies typically employ the fine-
4) Error Explanation: Taylor et al. [82] integrates the LLM tuning strategy to enable the LLM to better understand the code
into the Debugging C Compiler to generate unique, novice- semantics. Fu et al. [123] fine-tune the LLM by employing BPE
focused explanations tailored to each error. Widjojo et al. [80] tokenization to handle Out-Of-Vocabulary (OOV) issues which
study the effectiveness of Stack Overflow and LLMs at explain- makes the approach generate new tokens that never appear
ing compiler errors. in a training function but are newly introduced in the repair.
Wang et al. [120] train the LLM based on both buggy input
and retrieved bug-fix examples which are retrieved in terms
F. Program Repair of the lexical and semantical similarities. The aforementioned
This category denotes the task of fixing the identified soft- studies (including the ones in patching single-line bugs) would
ware bugs. The high frequency of repair-related studies can be predict the fixed programs directly, and Hu et al. [92] utilize

Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SOFTWARE TESTING WITH LARGE LANGUAGE MODELS: SURVEY, LANDSCAPE, AND VISION 923

TABLE IV
PERFORMANCE OF PROGRAM REPAIR

Dataset % Correct patches LLM Paper


Defects4J v1.2, Defects4J 22/40 Jave bugs (QuixBugs dataset, with InCoder-6B, correct code PLBART, CodeT5, CodeGen, In- [112]
v2.0, QuixBugs, infilling setting) Coder (each with variant parame-
HumanEval-Java ters, 10 LLMs in total)
QuixBugs 23/40 Python bugs, 14/40 Java bugs (complete function generation Codex-12B [99]
setting)
Defects4J v1.2, Defects4J 39/40 Python bugs, 34/40 Java bugs (QuixBugs dataset, with Codex- Codex, GPT-Neo, CodeT5, In- [92]
v2.0, QuixBugs, ManyBugs 12B, correct code infilling setting); 37/40 Python bugs, 32/40 Java bugs Coder (each with variant parame-
(QuixBugs dataset, with Codex-12B, complete function generation ters, 9 LLMs in total)
setting)
QuixBugs 31/40 Python bugs (completion function generation setting) ChatGPT-175B [95]
DL programs from Stack- 16/72 Python bugs (complete function generation setting) ChatGPT-175B [89]
Overflow
Note that, for studies with multiple datasets or LLMs, we only present the best performance or in the most commonly utilized dataset.

a different setup that predicts the scripts that can fix the bugs Wei et al. [103] propose Repilot to copilot the AI “copilots”
when executed with the delete and insert grammar. For example, (i.e., LLMs) by synthesizing more valid patches during the
it predicts whether an original line of code should be deleted, repair process. Its key insight is that many LLMs produce out-
and what content should be inserted. puts autoregressively (i.e., token by token), and by resembling
Nevertheless, fine-tuning may face limitations in terms of human writing programs, the repair can be significantly boosted
its reliance on abundant high-quality labeled data, significant and guided through a completion engine. Brownlee et al. [105]
computational resources, and the possibility of overfitting. To propose to use the LLM as mutation operators for the search-
approach the program repair problem more effectively, later based techniques of program repair.
studies focus on how to design an effective prompt for program 3) Repair With Static Code Analyzer: Most of the program
repair. Several studies empirically investigate the effectiveness repair studies would suppose the bug has been detected, while
of prompt variants of the latest LLMs for program repair un- Jin et al. [114] propose a program repair framework paired with
der different repair settings and commonly-used benchmarks a static analyzer to first detect the bugs, and then fix them.
(which will be explored in depth later), while other studies In detail, the static analyzer first detects an error (e.g., null
focus on proposing new techniques. Ribeiro et al. [109] take pointer dereference) and the context information provided by
advantage of LLM to conduct the code completion in a buggy the static analyzer will be sent into the LLM for querying the
line for patch generation, and elaborate on how to circumvent patch for this specific error. Wadhwa et al. [110] focus on a
the open-ended nature of code generation to appropriately fit the similar task, and additionally employ an LLM as the ranker to
new code in the original program. Xia et al. [115] propose the assess the likelihood of acceptance of generated patches which
conversation-driven program repair approach that interleaves can effectively catch plausible but incorrect fixes and reduce
patch generation with instant feedback to perform the repair in developer burden.
a conversational style. They first feed the LLM with relevant 4) Repair for Specific Bugs: The aforementioned studies all
test failure information to start with, and then learns from both consider the buggy code as the input for the automatic program
failures and successes of earlier patching attempts of the same repair, while other studies conduct program repairing in terms
bug for more powerful repair. For earlier patches that failed of other types of bug descriptions, specific types of bugs, etc.
to pass all tests, they combine the incorrect patches with their Fakhoury et al. [122] focus on program repair from natural
corresponding relevant test failure information to construct a language issue descriptions, i.e., generating the patch with the
new prompt for the LLM to generate the next patch, in order bug and fix-related information described in the issue reports.
to avoid making the same mistakes. For earlier patches that Garg et al. [119] aim at repairing performance issues, in which
passed all the tests (i.e., plausible patches), they further ask the they first retrieve a prompt instruction from a pre-constructed
LLM to generate alternative variations of the original plausi- knowledge-base of previous performance bug fixes and then
ble patches. This can further build on and learn from earlier generate a repair prompt using the retrieved instruction. There
successes to generate more plausible patches to increase the are studies focusing on the bug fixing of Rust programs
chance of having correct patches. Zhang et al. [94] propose [108] or OCaml programs (an industrial-strength programming
a similar approach design by leveraging multimodal prompts language) [111].
(e.g., natural language description, error message, input-output- 5) Empirical Study About Program Repair: There are
based test cases), iterative querying, test-case-based few-shot several studies related to the empirical or experimental eval-
selection to produce repairs. Moon et al. [102] propose for bug uation of the various LLMs on program repair, and we sum-
fixing with feedback. It consists of a critic model to generate marize the performance in Table IV. Jiang et al. [113], Xia
feedback, an editor to edit codes based on the feedback, and et al. [93], and Zhang et al. [118] respectively conduct com-
a feedback selector to choose the best possible feedback from prehensive experimental evaluations with various LLMs and
the critic. on different automated program repair benchmarks, while other

Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
924 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 4, APRIL 2024

researchers [89], [96], [98], [100] focus on a specific LLM and


on one dataset, e.g., QuixBugs. In addition, Gao et al. [124]
empirically investigate the impact of in-context demonstrations
for bug fixing, including the selection, order, and number of
demonstration examples. Prenner et al. [117] empirically study
how the local context (i.e., code that comes before or after
the bug location) affects the repair performance. Horváth et al.
[99] empirically study the impact of program representation and
model architecture on the repair performance.
There are two commonly-used repair settings when using
LLMs to generate patches: 1) complete function generation
(i.e., generating the entire patch function), 2) correct code in-
filling (i.e., filling in a chunk of code given the prefix and
suffix), and different studies might utilize different settings
which are marked in Table IV. The commonly-used datasets
are QuixBugs, Defects4J, etc. These datasets only involve the
fundamental functionalities such as sorting algorithms, each
Fig. 6. LLMs used in the collected papers.
program’s average number of lines ranging from 13 to 22,
implementing one functionality, and involving few dependen-
cies. To tackle this, Cao et al. [89] conducts an empirical GitHub Copilot– an AI pair programmer that generates whole
study on a more complex dataset with DL programs collected code snippets, given a natural language description as a prompt.
from StackOverflow. Every program contains about 46 lines of Since a large portion of our collected studies involve the source
code on average, implementing several functionalities including code (e.g., repair, unit test case generation), it is not surprising
data preprocessing, DL model construction, model training, and that researchers choose Codex as the LLM in assisting them in
evaluation. And the dataset involves more than 6 dependencies accomplishing the coding-related tasks.
for each program, including TensorFlow, Keras, and Pytorch. The third-ranked LLM is CodeT5, which is an open-sourced
Their results demonstrate a much lower rate of correct patches LLM developed by salesforce3 . Thanks to its open source,
than in other datasets, which again reveals the potential dif- researchers can easily conduct the pre-training and fine-tuning
ficulty of this task. Similarly, Haque et al. [106] introduce a with domain-specific data to achieve better performance. Sim-
dataset comprising of buggy code submissions and their corre- ilarly, CodeGen is also open-sourced and ranked relatively
sponding fixes collected from online judge platforms, in which higher. Besides, for CodeT5 and CodeGen, there are more than
it offers an extensive collection of unit tests to enable the half of the related studies involve the empirical evaluations
evaluations about the correctness of fixes and further informa- (which employ multiple LLMs), e.g., program repair [112],
tion regarding time, memory constraints, and acceptance based [113], unit test case generation [39].
on a verdict. There are already 14 studies that utilize GPT-4, ranking at
the fourth place, which is launched on March 2023. Several
V. ANALYSIS FROM LLM PERSPECTIVE
studies directly utilize this state-of-the-art LLM of OpenAI,
This section discusses the analysis based on the viewpoints of since it demonstrates excellent performance across a wide range
LLM, specifically, it’s unfolded from the viewpoints of utilized of generation and reasoning tasks. For example, Xie et al. utilize
LLMs, types of prompt engineering, input of the LLMs, as well GPT-4 to generate fuzzing inputs [67], while Vikram et al.
as the accompanied techniques when utilizing LLM. employ it to generate property-based tests with the assistance
of API documentation [34]. In addition, some studies conduct
A. LLM Models experiments using both GPT-4 and ChatGPT or other LLMs to
As shown in Fig. 6, the most commonly utilized LLM in provide a more comprehensive evaluation of these models’ per-
software testing tasks is ChatGPT, which was released on Nov. formance. In their proposed LLM-empowered automatic pene-
2022 by OpenAI. It is trained on a large corpus of natural tration testing technique, Deng et al. find that GPT-4 surpasses
language text data, and primarily designed for natural language ChatGPT and LaMDA from Google [62]. Similarly, Zhang
processing and conversation. ChatGPT is the most widely rec- et al. find that GPT-4 shows its performance superiority over
ognized and popular LLM up until now, known for its excep- ChatGPT when generating the fuzz drivers with both the basic
tional performance across various tasks. Therefore, it comes query strategies and enhanced query strategies [66]. Further-
as no surprise that it ranks in the top position in terms of our more, GPT-4, as a multi-modal LLM, sets itself apart from the
collected studies. other mentioned LLMs by showcasing additional capabilities
Codex, an LLM based on GPT-3, is the second most com- such as generating image narratives and answering questions
monly used LLM in our collected studies. It is trained on based on images [147]. Yet we have not come across any studies
a massive code corpus containing examples from many pro- that explore the utilization of GPT-4’s image-related features
gramming languages such as JavaScript, Python, C/C++, and
Java. Codex was released on Sep. 2021 by OpenAI and powers 3 https://blog.salesforceairesearch.com/codet5/

Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SOFTWARE TESTING WITH LARGE LANGUAGE MODELS: SURVEY, LANDSCAPE, AND VISION 925

Fig. 7. Distribution about how LLM is used (Note that, a study can involve multiple types of prompt engineering).

(e.g., UI screenshots, programming screencasts) in software Few-shot learning presents a set of high-quality demon-
testing tasks. strations, each consisting of both input and desired output, on
the target task. As the model first sees the examples, it can
better understand human intention and criteria for what kinds of
B. Types of Prompt Engineering answers are wanted, which is especially important for tasks that
As shown in Fig. 7, among our collected studies, 38 studies are not so straightforward or intuitive to the LLM. For example,
utilize the LLMs through pre-training or fine-tuning schema, when conducting the automatic test generation from general bug
while 64 studies employ the prompt engineering to commu- reports, Kang et al. [78] provide examples of bug reports (ques-
nicate with LLMs to steer its behavior for desired outcomes tions) and the corresponding bug reproducing tests (answers) to
without updating the model weights. When using the early the LLM, and their results show that two examples can achieve
LLMs, their performances might not be as impressive, so re- the highest performance than no examples or other number of
searchers often use pre-training or fine-tuning techniques to examples. Another example of test assertion generation, Nashid
adjust the models for specific domains and tasks in order to et al. [47] provide demonstrations of the focal method, the test
improve their performance. Then with the upgrading of LLM method containing an <AssertPlaceholder>, and the expected
technology, especially with the introduction of GPT-3 and later assertion, which enables the LLMs to better understand the task.
LLMs, the knowledge contained within the models and their Chain-of-thought (CoT) prompting generates a sequence
understanding/inference capability has increased significantly. of short sentences to describe reasoning logics step by step
Therefore, researchers will typically rely on prompt engineering (also known as reasoning chains or rationales) to the LLMs for
to consider how to design appropriate prompts to stimulate the generating the final answer. For example, for program repair
model’s knowledge. from the natural language issue descriptions [122], given the
Among the 64 studies with prompt engineering, 51 studies in- buggy code and issue report, the authors first ask the LLM
volve zero-shot learning, and 25 studies involve few-shot learn- to localize the bug, and then they ask it to explain why the
ing (a study may involve multiple types). There are also studies localized lines are buggy, finally, they ask the LLM to fix the
involving the chain-of-though (7 studies), self-consistency bug. Another example is for generating unusual programs for
(1 study), and automatic prompt (1 study). fuzzing deep learning libraries, Deng et al. [58] first generate
Zero-shot learning is to simply feed the task text to the a possible “bug” (bug description) before generating the actual
model and ask for results. Many of the collected studies employ “bug-triggering” code snippet that invokes the target API. The
the Codex, CodeT5, and CodeGen (as shown in Section V-A), predicted bug description provides an additional hint to the
which is already trained on source code. Hence, for the tasks LLM, indicating that the generated code should try to cover
dealing with source code like unit test case generation and specific potential buggy behavior.
program repair as demonstrated in previous sections, directly Self-consistency involves evaluating the coherence and con-
querying the LLM with prompts is the common practice. There sistency of the LLM’s responses on the same input in different
are generally two types of manners of zero-shot learning, i.e., contexts. There is one study with this prompt type, and it
with and without instructions. For example, Xie et al. [36] is about debugging. Kang et al. [83] employ a hypothesize-
would provide the LLMs with the instructions as “please help observe-conclude loop, which first generates a hypothesis about
me generate a JUnit test for a specific Java method...” to facili- what the bug is and constructs an experiment to verify, using
tate the unit test case generation. In contrast, Siddiq et al. [39] an LLM, then decide whether the hypothesis is correct based
only provide the code header of the unit test case (e.g., “class on the experiment result (with a debugger or code execution)
${className}${suffix}Test {”), and the LLMs would carry out using an LLM, after that, depending on the conclusion, it either
the unit test case generation automatically. Generally speaking, starts with a new hypothesis or opts to terminate the debugging
prompts with clear instructions will yield more accurate results, process and generate a fix.
while prompts without instructions are typically suitable for Automatic prompt aims to automatically generate and select
very specific situations. the appropriate instruction for the LLMs, instead of requiring

Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
926 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 4, APRIL 2024

Fig. 8. Mapping between testing tasks and how LLMs are used. Fig. 9. Input of LLM.

This might be because this task often involves generating spe-


the user to manually engineer a prompt. Xia et al. [67] intro-
cific types of inputs, and demonstrations in few-shot learn-
duce an auto-prompting step that automatically distils all user-
ing can assist the LLMs in better understanding what should
provided inputs into a concise and effective prompt for fuzzing.
be generated. Besides, for this task, the utilization of pre-
Specifically, they first generate a list of candidate prompts by
training and/or fine-tuning methods are not as widespread as
incorporating the user inputs and auto prompting instruction
in unit test case generation and program repair. This might
while setting the LLM at high temperature, then a small-scale
be attributed to the fact that training data for system testing
fuzzing experiment is conducted to evaluate each candidate
varies across different software and is relatively challenging to
prompt, and the best one is selected.
collect automatically.
Note that there are fourteen studies that apply the iterative
prompt design when using zero-shot or few-shot learning, in
which the approach continuously refines the prompts with the C. Input of LLM
running information of the testing task, e.g., the test failure We also find that different testing tasks or software under
information. For example, for program repair, Xia et al. [115] test might involve diversified input when querying the LLM, as
interleave patch generation with test validation feedback to demonstrated in Fig. 9.
prompt future generation iteratively. In detail, they incorporate The most commonly utilized input is the source code since a
various information from a failing test including its name, the large portion of collected studies relate to program repair or unit
relevant code line(s) triggering the test failure, and the error test case generation whose input are source code. For unit test
message produced in the next round of prompting which can case generation, typical code-related information would be (i)
help the model understand the failure reason and provide guid- the complete focal method, including the signature and body;
ance towards generating the correct fix. Another example is for (ii) the name of the focal class (i.e., the class that the focal
mobile GUI testing, Liu et al. [14] iteratively query the LLM method belongs to); (iii) the field in the focal class; and (iv) the
about the operation (e.g., click a button, enter a text) to be signatures of all methods defined in the focal class [7], [26]. For
conducted in the mobile app, and at each iteration, they would program repair, there can be different setups and involve dif-
provide the LLM with current context information like which ferent inputs, including (i) inputting a buggy function with the
GUI pages and widgets have just explored. goal of outputting the patched function, (ii) inputting the buggy
Mapping between testing tasks and how LLMs are location with the goal of generating the correct replacement
used. Fig. 8 demonstrates the mapping between the testing tasks code (can be a single line change) given the prefix and suffix of
(mentioned in Section IV) and how LLMs are used (as intro- the buggy function [93]. Besides, there can be variations for the
duced in this subsection). The unit test case generation and pro- buggy location input, i.e., (i) does not contain the buggy lines
gram repair share similar patterns of communicating with the (but the bug location is still known), (ii) give the buggy lines
LLMs, since both tasks are closely related to the source code. as lines of comments.
Typically, researchers utilize pre-training and/or fine-tuning and There are also 12 studies taking the bug description as
zero-shot learning methods for these two tasks. Zero-shot learn- input for the LLM. For example, Kang et al. [78] take the
ing is suitable because these tasks are relatively straightforward bug description as input when querying LLM and let the LLM
and can be easily understood by LLMs. Moreover, since the generate the bug-reproducing test cases. Fakhoury et al. [122]
training data for these two tasks can be automatically collected input the natural language descriptions of bugs to the LLM, and
from source code repositories, pre-training and/or fine-tuning generate the correct code fixes.
methods are widely employed for these two tasks, which can There are 7 studies that would provide the intermediate
enhance LLMs’ understanding of domain-specific knowledge. error information, e.g., test failure information, to the LLM,
In comparison, for system test input generation, zero-shot and would conduct the iterative prompt (as described in Sec-
learning and few-shot learning methods are commonly used. tion V-B) to enrich the context provided to the LLM. These

Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SOFTWARE TESTING WITH LARGE LANGUAGE MODELS: SURVEY, LANDSCAPE, AND VISION 927

Fig. 10. Distribution about other techniques incorporated with LLMs (Note that, a study can involve multiple types).

studies are related to the unit test case generation and program 2) LLM + Program Analysis: When utilizing LLMs to
repair, since in these scenarios, the running information can be accomplish tasks such as generating unit test cases and repairing
acquired easily. software code, it is important to consider that software code
When testing mobile apps, since the utilized LLM could not inherently possesses structural information, which may not be
understand the image of the GUI page, the view hierarchy file fully understood by LLMs. Hence, researchers often utilize pro-
which represents the details of the GUI page usually acts as gram analysis techniques, including code abstract syntax trees
the input to LLMs. Nevertheless, with the emergence of GPT-4 (ASTs) [74], to represent the structure of code more effectively
which is a multimodal model and accepts both image and text and increase the LLM’s ability to comprehend the code accu-
inputs for model input, the GUI screenshots might be directly rately. Researchers also perform the structure-based subsetting
utilized for LLM’s input. of code lines to narrow the focus for LLM [94], or extract
additional code context from other code files [7], to enable the
D. Incorporating Other Techniques With LLM models to focus on the most task-relevant information in the
There are divided opinions on whether LLM has reached an codebase and lead to more accurate predictions.
all-powerful status that requires no other techniques. As shown 3) LLM + Mutation Testing: It is mainly targeting at
in Fig. 10, among our collected studies, 67 of them utilize LLMs generating more diversified test inputs. For example, Deng et al.
to address the entire testing task, while 35 studies incorporate [59] first use LLM to generate the seed programs (e.g., code
additional techniques. These techniques include mutation test- snippets using a target DL API) for fuzzing deep learning
ing, differential testing, syntactic checking, program analysis, libraries. To enrich the pool of these test programs, they replace
statistical analysis, etc. parts of the seed program with masked tokens using mutation
The reason why researchers still choose to combine LLMs operators (e.g., replaces the API call arguments with the span
with other techniques might be because, despite exhibiting token) to produce masked inputs, and again utilize the LLMs
enormous potential in various tasks, LLMs still possess lim- to perform code infilling to generate new code that replaces the
itations such as comprehending code semantics and handling masked tokens.
complex program structures. Therefore, combining LLMs with 4) LLM + Syntactic Checking: Although LLMs have
other techniques optimizes their strengths and weaknesses to shown remarkable performance in various natural language
achieve better outcomes in specific scenarios. In addition, it is processing tasks, the generated code from these models can
important to note that while LLMs are capable of generating sometimes be syntactically incorrect, leading to potential errors
correct code, they may not necessarily produce sufficient test and reduced usability. Therefore, researchers have proposed
cases to check for edge cases or rare scenarios. This is where to leverage syntax checking to identify and correct errors in
mutation and other testing techniques come into play, as they the generated code. For example, in their work for unit test
allow for the generation of more diverse and complex code that case generation, Alagarsamy et al. [29] additionally introduce
can better simulate real-world scenarios. Taken in this sense, a verification method to check and repair the naming con-
a testing approach can incorporate a combination of different sistency (i.e., revising the test method name to be consistent
techniques, including both LLMs and other testing strategies, with the focal method name) and the test signatures (i.e.,
to ensure comprehensive coverage and effectiveness. adding missing keywords like public, void, or @test annota-
1) LLM + Statistical Analysis: As LLMs can often generate tions). Xie et al. [36] also validates the generated unit test
a multitude of outputs, manually sifting through and identifying case and employs rule-based repair to fix syntactic and simple
the correct output can be overwhelmingly laborious. As such, compile errors.
researchers have turned to statistical analysis techniques like 5) LLM + Differential Testing: Differential testing is well-
ranking and clustering [28], [45], [78], [93], [116] to efficiently suited to find semantic or logic bugs that do not exhibit explicit
filter through LLM’s outputs and ultimately obtain more accu- erroneous behaviors like crashes or assertion failures. In this
rate results. category of our collected studies, the LLM is mainly responsible

Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
928 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 4, APRIL 2024

for generating valid and diversified inputs, while the differential mutation rule) together with the mutation rules for text-oriented
testing helps to determine whether there is a triggered bug fuzzing, which reduces the human effort required for designing
based on the software’s output. For example, Ye et al. [48] mutation rules.
first uses LLM to produce random JavaScript programs, and A potential research direction could involve utilizing testing-
leverages the language specification document to generate test specific data to train or fine-tune a specialized LLM that is
data, then conduct the differential testing on JavaScript engines specifically designed to understand the nature of testing. By do-
such as JavaScriptCore, ChakraCore, SpiderMonkey, QuickJS, ing so, the LLM can inherently acknowledge the requirements
etc. There are also studies utilizing the LLMs to generate test of testing and autonomously generate diverse outputs.
inputs and then conduct differential testing for fuzzing DL 2) Challenges in Test Oracle Problem: The oracle problem
libraries [58], [59] and SAT solvers [63]. Li et al. [87] employs has been a longstanding challenge in various testing applica-
the LLM in finding the failure-inducing test cases. In detail, tions, e.g., testing machine learning systems [148] and testing
given a program under test, they first request the LLM to infer deep learning libraries [59]. To alleviate the oracle problem to
the intention of the program, then request the LLM to generate the overall testing activities, a common practice in our collected
programs that have the same intention, which are alternative studies is to transform it into a more easily derived form, often
implementations of the program, and are likely free of the by utilizing differential testing [63] or focusing on only identi-
program’s bug. Then they perform the differential testing with fying crash bugs [14].
the program under test and the generated programs to find the There are successful applications of differential testing with
failure-inducing test cases. LLMs, as shown in Fig. 10. For instance, when testing the
SMT solvers, Sun et al. adopt differential testing which involves
VI. CHALLENGES AND OPPORTUNITIES comparing the results of multiple SMT solvers (i.e., Z3, cvc5,
and Bitwuzla) on the same generated test formulas by LLM
Based on the above analysis from the viewpoints of software
[63]. However, this approach is limited to systems where coun-
testing and LLM, we summarize the challenges and opportuni-
terpart software or running environment can easily be found,
ties when conducting software testing with LLM.
potentially restricting its applicability. Moreover, to mitigate
the oracle problem, other studies only focus on the crash bugs
A. Challenges
which are easily observed automatically. This is particularly the
As indicated by this survey, software testing with LLMs has case for mobile applications testing, in which the LLMs guide
undergone significant growth in the past two years. However, the testing in exploring more diversified pages, conducting more
it is still in its early stages of development, and numerous complex operational actions, and covering more meaningful
challenges and open questions need to be addressed. operational sequences [14]. However, this significantly restricts
1) Challenges for Achieving High Coverage: Exploring the the potential of utilizing the LLMs for uncovering various types
diverse behaviors of the software under test to achieve high of software bugs.
coverage is always a significant concern in software testing. Exploring the use of LLMs to derive other types of test or-
In this context, test generation differs from code generation, acles represents an interesting and valuable research direction.
as code generation primarily focuses on producing a single, Specifically, metamorphic testing is also widely used in soft-
correct code snippet, whereas software testing requires gen- ware testing practices to help mitigate the oracle problem, yet in
erating diverse test inputs to ensure better coverage of the most cases, defining metamorphic relations relies on human in-
software. Although setting a high temperature can facilitate the genuity. Luu et al. [56] have examined the effectiveness of LLM
LLMs in generating different outputs, it remains challenging in generating metamorphic relations, yet they only experiment
for LLMs to directly achieve the required diversity. For ex- with straightforward prompts by directly querying ChatGPT.
ample, for unit test case generation, in SF110 dataset, the line Further exploration, potentially incorporating human-computer
coverage is merely 2% and the branch coverage is merely 1% interaction or domain knowledge, is highly encouraged. An-
[39]. For system test input generation, in terms of fuzzing DL other promising avenue is exploring the capability of LLMs to
libraries, the API coverage for TensorFlow is reported to be 66% automatically generate test cases based on metamorphic rela-
(2215/3316) [59]. tions, covering a wide range of inputs.
From our collected studies, we observe that the researchers The advancement of multi-model LLMs like GPT-4 may
often utilize mutation testing together with the LLMs to gen- open up possibilities for exploring their ability to detect bugs in
erate more diversified outputs. For example, when fuzzing software user interfaces and assist in deriving test oracles. By
a DL library, instead of directly generating the code snip- leveraging the image understanding and reasoning capabilities
pet with LLM, Deng et al. [59] replace parts of the selected of these models, one can investigate their potential to auto-
seed (code generated by LLM) with masked tokens using matically identify inconsistencies, errors, or usability issues in
different mutation operators to produce masked inputs. They user interfaces.
then leverage the LLM to perform code infilling to generate 3) Challenges for Rigorous Evaluations: The lack of bench-
new code that replaces the masked tokens, which can signifi- mark datasets and the potential data leakage issues associated
cantly increase the diversity of the generated tests. Liu et al. with LLM-based techniques present challenges in conducting
[65] leverage LLM to produce the test generators (each of rigorous evaluations and comprehensive comparisons of pro-
which can yield a batch of unusual text inputs under the same posed methods.

Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SOFTWARE TESTING WITH LARGE LANGUAGE MODELS: SURVEY, LANDSCAPE, AND VISION 929

For program repair, there are only two well-known and organization-specific datasets for training or fine-tuning is time-
commonly-used benchmarks, i.e., Defect4J and QuixBugs, as consuming and labor-intensive. To address this, one is encour-
demonstrated in Table IV. Furthermore, these datasets are aged to utilize the automated techniques of mining software
not specially designed for testing the LLMs. For example, repositories to build the datasets, for example, techniques like
as reported by Xia et al. [93], 39 out of 40 Python bugs key information extraction techniques from Stack Overflow
in the QuixBugs dataset can be fixed by Codex, yet in real- [151] offer potential solutions for automatically gathering rel-
world practice, the successful fix rate can be nowhere near evant data.
as high. For unit test case generation, there are no widely In addition, exploring the methodology for better fine-tuning
recognized benchmarks, and different studies would utilize dif- the LLMs with software-specific data is worth considering be-
ferent datasets for performance evaluation, as demonstrated in cause software-specific data differs from natural language data
Table III. This indicates the need to build more specialized and as it contains more structural information, such as data flow
diversified benchmarks. and control flow. Previous research on code representations has
Furthermore, the LLMs may have seen the widely-used shown the benefits of incorporating data flow, which captures
benchmarks in their pre-training data, i.e., data leakage issues. the semantic-level structure of code and represents the rela-
Jiang et al. [113] check the CodeSearchNet and BigQuery, tionship between variables in terms of “whether-value-comes-
which are the data sources of common LLMs, and the results from” [152]. These insights can provide valuable guidance for
show that four repositories used by the Defect4J benchmark are effectively fine-tuning LLMs with software-specific data.
also in CodeSearchNet, and the whole Defects4J repository is
included by BigQuery. Therefore, it is very likely that existing
program repair benchmarks are seen by the LLMs during pre- B. Opportunities
training. This data leakage issue has also been investigated in There are also many research opportunities in software test-
machine learning-related studies. For example, Tu et al. [149] ing with LLMs, which can greatly benefit developers, users,
focus on the data leakage in issue tracking data, and results show and the research community. While not necessarily challenges,
that information leaked from the “future” makes prediction these opportunities contribute to advancements in software test-
models misleadingly optimistic. This reminds us that the perfor- ing, benefiting practitioners and the wider research community.
mance of LLMs on software testing tasks may not be as good 1) Exploring LLMs in the Early Stage of Testing: As
as reported in previous studies. It also suggests that we need shown in Fig. 4, LLMs have not been used in the early stage of
more specialized datasets that are not seen by LLMs to serve as testing, e.g., test requirements, and test planning. There might
benchmarks. One way is to collect it from specialized sources, be two main reasons behind that. The first is the subjectivity
e.g., user-generated content from niche online communities. in early-stage testing tasks. Many tasks in the early stages of
4) Challenges in Real-World Application of LLMs in Soft- testing, such as requirements gathering, test plan creation, and
ware Testing: As we mentioned in Section V-B, in the early design reviews, may involve subjective assessments that require
days of using LLMs, pre-training and fine-tuning are commonly significant input from human experts. This could make it less
used practice, considering the model parameters are relatively suitable for LLMs that rely heavily on data-driven approaches.
few resulting in weaker model capabilities (e.g., T5). As time The second might be the lack of open-sourced data in the early
progressed, the number of model parameters increased signifi- stages. Unlike in later stages of testing, there may be limited
cantly, leading to the emergence of models with greater capabil- data available online during early-stage activities. This could
ities (e.g., ChatGPT). And in recent studies, prompt engineering mean that LLMs may not have seen much of this type of data,
has become a common approach. However, due to concerns and therefore may not perform well on these tasks.
regarding data privacy, when considering real-world practice, Adopting a human-computer interaction schema for tack-
most software organizations tend to avoid using commercial ling early-stage testing tasks would harness the domain-specific
LLMs and would prefer to adopt open-source ones with training knowledge of human developers and leverage the general
or fine-tuning using organization-specific data. Furthermore, knowledge embedded in LLMs. Additionally, it is highly en-
some companies also consider the current limitations in terms of couraged for software development companies to record and
computational power or pay close attention to energy consump- provide access to early-stage testing data, allowing for im-
tion, they tend to fine-tune medium-sized models. It is quite proved training and performance of LLMs in these critical
challenging for these models to achieve similar performance testing activities.
to what our collected papers have reported. For instance, in 2) Exploring LLMs in Other Testing Phases: We have
the widely-used QuixBugs dataset, it has been reported that analyzed the distribution of testing phases for the collected
39 out of 40 Python bugs and 34 out of 40 Java bugs can studies. As shown in Fig 11, we can observe that LLMs are
be automatically fixed [93]. However, when it comes to DL most commonly used in unit testing, followed by system testing.
programs collected from Stack Overflow, which represent real- However, there is still no research on the use of LLMs in
world coding practice, only 16 out of 72 Python bugs can be integration testing and acceptance testing.
automatically fixed [89]. For integration testing, it involves testing the interfaces be-
Recent research has highlighted the importance of high- tween different software modules. In some software organiza-
quality training data in improving the performance of models tions, integration testing might be merged with unit testing,
for code-related tasks [150], yet manually building high-quality which can be a possible reason why LLM is rarely utilized in

Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
930 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 4, APRIL 2024

application domains, e.g., interpreters, database systems, and


other popular libraries. More than that, there are already studies
that focus on universal fuzzing techniques [52], [67] which are
designed to be adaptable and applicable to different types of
test inputs and software.
From another point of view, other types of software can also
benefit from the capabilities of LLMs to design the testing
techniques that are better suited to their specific domain and
characteristics. For instance, the metaverse, with its immersive
Fig. 11. Distribution of testing phases (note that we omit the studies which
do not explicitly specify the testing phases, e.g., program repair). virtual environments and complex interactions, presents unique
challenges for software testing. LLMs can be leveraged to gen-
erate diverse and realistic inputs that mimic user behavior and
integration testing. Another reason might be that the size and interactions within the metaverse, which are never explored.
complexity of the input data in this circumstance may exceed 4) Exploring LLMs for Non-Functional Testing: In our
the capacity of the LLM to process and analyze (e.g., the source collected studies, LLMs are primarily used for functional test-
code of all involved software modules), which can lead to ing, and no practice in performance testing, usability testing or
errors or unreliable results. To tackle this, a potential reference others. One possible reason for the prevalence of LLM-based
can be found in Section IV-A, where Xie et al. [36] design solutions in functional testing is that they can convert functional
a method to organize the necessary information into the pre- testing problems into code generation or natural language gen-
defined maximum prompt token limit of the LLM. Furthermore, eration problems [14], [59], which LLMs are particularly adept
integration testing requires diversified data to be generated at solving.
to sufficiently test the interface among multiple modules. As On the other hand, performance testing and usability testing
mentioned in Section IV-C, previous work has demonstrated may require more specialized models that are designed to detect
the LLM’s capability in generating diversified test input for and analyze specific types of data, handle complex statistical
system testing, in conjunction with mutation testing techniques analyses, or determine the buggy criteria. Moreover, there have
[48], [59]. And these can provide insights about generating the been dozens of performance testing tools (e.g., LoadRunner
diversified interface data for integration testing. [154]) that can generate a workload that simulates real-world
Acceptance testing is usually conducted by business analysts usage scenarios and achieve relatively satisfactory performance.
or end-users to validate the system’s functionality and usabil- The potential opportunities might let the LLM integrate the
ity, which requires more non-technical language and domain- performance testing tools and acts like the LangChain [155],
specific knowledge, thus making it challenging to apply LLM to better simulate different types of workloads based on real
effectively. Since acceptance testing involves humans, it is well- user behavior. Furthermore, the LLMs can identify the param-
suited for the use of human-in-the-loop schema with LLMs. eter combinations and values that have the highest potential to
This has been studied in traditional machine learning [153], but trigger performance problems. It is essentially a way to rank and
has not yet been explored with LLMs. Specifically, the LLMs prioritize different parameter settings based on their impact on
can be responsible for automatically generating test cases, eval- performance and improve the efficiency of performance testing.
uating test coverage, etc, while human testers are responsible 5) Exploring Advanced Prompt Engineering: There are a
for checking the program’s behavior and verifying test oracle. total of 11 commonly used prompt engineering techniques as
3) Exploring LLMs for More Types of Software: We analyze listed in a popular prompt engineering guide [156], as shown
what types of software have been explored in the collected in Fig. 12. Currently, in our collected studies, only the first five
studies, as shown in Fig. 5. Note that, since a large portion techniques are being utilized. The more advanced techniques
of studies are focused on unit testing or program repair, they have not been employed yet, and can be explored in the future
are conducted on publicly available datasets and do not involve for prompt design.
specific software types. For instance, multimodal chain of thought prompting in-
From the analysis in Section IV-C, the LLM can generate not volves using diverse sensory and cognitive cues to stimulate
only the source code for testing DL libraries but also the textual thinking and creativity in LLMs [157]. By providing images
input for testing mobile apps, even the models for testing CPS. (e.g., GUI screenshots) or audio recordings related to the soft-
Overall, the LLM provides a flexible and powerful framework ware under test can help the LLM better understand the soft-
for generating test inputs for a wide range of applications. ware’s context and potential issues. Besides, try to prompt the
Its versatility would make it useful for testing the software in LLM to imagine itself in different roles, such as a developer,
other domains. user, or quality assurance specialist. This perspective-shifting
From one point of view, some proposed techniques can be exercise enables the LLM to approach software testing from
applied to other types of software. For example, in the paper multiple viewpoints and uncover different aspects that might
proposed for testing deep learning libraries [58], since it pro- require attention or investigation.
poses techniques for generating diversified, complicated, and Graph prompting [158] involves the representation of infor-
human-like DL programs, the authors state that the approach mation using graphs or visual structures to facilitate understand-
can be easily extended to test software systems from other ing and problem-solving. Graph prompting can be a natural

Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SOFTWARE TESTING WITH LARGE LANGUAGE MODELS: SURVEY, LANDSCAPE, AND VISION 931

VII. RELATED WORK


The systematic literature review is a crucial manner for gain-
ing insights into the current trends and future directions within
a particular field. It enables us to understand and stay updated
on the developments in that domain.
Wang et al. surveyed the machine learning and deep learn-
ing techniques for software engineering [160]. Yang et al. and
Watson et al. respectively carried out surveys about the use
of deep learning in software engineering domain [161], [162].
Bajammal et al. surveyed the utilization of computer vision
techniques to improve software engineering tasks [163]. Zhang
et al. provided a survey of techniques for testing machine learn-
ing systems [150]
With the advancements of artificial intelligence and LLMs,
Fig. 12. List of advanced prompt engineering practices and those utilized researchers also conduct systematic literature reviews about
in the collected papers. LLMs, and their applications in various fields (e.g., software en-
gineering). Zhao et al. [17] reviewed recent advances in LLMs
by providing an overview of their background, key findings, and
match with software engineering, consider it involves various
mainstream techniques. They focused on four major aspects of
dependencies, control flow, data flow, state transitions, or other
LLMs, namely pre-training, adaptation tuning, utilization, and
relevant graph structure. Graph prompting can be beneficial in
capacity evaluation. Additionally, they summarized the avail-
analyzing this structural information, and enabling the LLMs
able resources for developing LLMs and discuss the remaining
to comprehend the software under test effectively. For instance,
issues for future directions. Hou et al. conducted a system-
testers can use graph prompts to visualize test coverage, identify
atic literature review on using LLMs for software engineering,
untested areas or paths, and ensure adequate test execution.
with a particular focus on understanding how LLMs can be
6) Incorporating LLMs With Traditional Techniques: There
exploited to optimize processes and outcomes [164]. Fan et
is currently no clear consensus on the extent to which LLMs can
al. conducted a survey of LLMs for software engineering, and
solve software testing problems. From the analysis in Section
set out open research challenges for the application of LLMs
V-D, we have seen some promising results from studies that
to technical problems faced by software engineers [165]. Zan
have combined LLMs with traditional software testing tech-
et al. conducted a survey of existing LLMs for NL2Code task
niques. This implies the LLMs are not the sole silver bullet for
(i.e., generating code from a natural language description), and
software testing. Considering the availability of many mature
reviewed benchmarks and metrics [166].
software testing techniques and tools, and the limited capa-
While these studies either targeted the broader software engi-
bilities of LLMs, it is necessary to explore other better ways
neering domain (with a limited focus on software testing tasks)
to combine LLMs with traditional testing or program analysis
or focused on other software development tasks (excluding soft-
techniques and tools for better software testing.
ware testing), this paper specifically focuses on the use of LLMs
Based on the collected studies, the LLMs have been success-
for software testing. It surveys related studies, summarizes key
fully utilized together with various techniques such as differen-
challenges and potential opportunities, and serves as a roadmap
tial testing (e.g., [63]), mutation testing (e.g., [59]), program
for future research in this area.
analysis (e.g., [104], as shown in Fig. 10. From one perspec-
tive, future studies can explore improved integration of these
VIII. CONCLUSION
traditional techniques with LLMs. Take mutation testing as an
example, current practices mainly rely on the human-designed This paper provides a comprehensive review of the use of
mutation rules to mutate the candidate tests, and let the LLMs LLMs in software testing. We have analyzed relevant studies
re-generate new tests [38], [59], [67], while Liu et al. directly that have utilized LLMs in software testing from both the soft-
utilize the LLMs for producing the mutation rules alongside the ware testing and LLMs perspectives. This paper also highlights
mutated tests [65]. Further explorations in this direction are of the challenges and potential opportunities in this direction. Re-
great interest. sults of this review demonstrate that LLMs have been success-
From another point of view, more traditional techniques can fully applied in a wide range of testing tasks, including unit
be incorporated in LLMs for software testing. For instance, test case generation, test oracle generation, system test input
besides the aforementioned traditional techniques, the LLMs generation, program debugging, and program repair. However,
have been combined with formal verification for self-healing challenges still exist in achieving high testing coverage, ad-
software detection in the field of software security [159]. More dressing the test oracle problem, conducting rigorous evalua-
attempts are encouraged. Moreover, considering the existence tions, and applying LLMs in real-world scenarios. Additionally,
of numerous mature software testing tools, one can explore the it is observed that LLMs are commonly used in only a subset
integration of LLMs with these tools, allowing them to act as of the entire testing lifecycle, for example, they are primarily
a “LangChain” to better explore the potential of these tools. utilized in the middle and later stages of testing, only serving

Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
932 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 4, APRIL 2024

the unit and system testing phases, and only for functional [17] W. X. Zhao et al., “A survey of large language models,” 2023. [Online].
testing. This highlights the research opportunities for exploring Available: https://doi.org/10.48550/arXiv.2303.18223
[18] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large
the uncovered areas. Regarding how the LLMs are utilized, we language models are zero-shot reasoners,” in Proc. NeurIPS, 2022.
find that various pre-training/fine-tuning and prompt engineer- [Online]. Available: http://papers.nips.cc/paper_files/paper/2022/hash/
ing methods have been developed to enhance the capabilities 8bb0d291acd4acf06ef112099c16f326-Abstract-Conference.html
of LLMs in addressing testing tasks. However, more advanced [19] J. Wei et al., “Chain-of-thought prompting elicits reasoning in large
language models,” in Proc. NeurIPS, 2022. [Online]. Available: http://
techniques in prompt design have yet to be explored and can papers.nips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7
be an avenue for future research. b31abca4-Abstract-Conference.html
It can serve as a roadmap for future research in this area, [20] J. Li, G. Li, Y. Li, and Z. Jin, “Structured chain-of-thought prompting
identifying gaps in our current understanding of the use of for code generation,” 2023, arXiv:2305.06599.
[21] J. Li, Y. Li, G. Li, Z. Jin, Y. Hao, and X. Hu, “Skcoder: A sketch-based
LLMs in software testing and highlighting potential avenues for approach for automatic code generation,” in Proc. IEEE/ACM 45th Int.
exploration. We believe that the insights provided in this paper Conf. Softw. Eng. (ICSE), 2023, pp. 2124–2135.
will be valuable to both researchers and practitioners in the field [22] J. Li, Y. Zhao, Y. Li, G. Li, and Z. Jin, “AceCoder: Utilizing existing
code to enhance code generation,” 2023, arXiv:2303.17780.
of software engineering, assisting them in leveraging LLMs to [23] Y. Dong, X. Jiang, Z. Jin, and G. Li, “Self-collaboration code genera-
improve software testing practices and ultimately enhance the tion via chatGPT” 2023. [Online]. Available: https://doi.org/10.48550/
quality and reliability of software systems. arXiv.2304.07590
[24] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, and X. Wu, “Unifying large
REFERENCES language models and knowledge graphs: A roadmap,” 2023. [Online].
Available: https://doi.org/10.48550/arXiv.2306.08302
[1] G. J. Myers, T. Badgett, T. M. Thomas, and C. Sandler, The Art of [25] M. Tufano, D. Drain, A. Svyatkovskiy, S. K. Deng, and N. Sundaresan,
Software Testing, 2nd ed. Hoboken, NJ, USA: Wiley, 2004. “Unit test case generation with transformers and focal context,” 2020,
[2] M. Pezze and M. Young, Software Testing and Analysis—Process, arXiv:2009.05617.
Principles and Techniques. Hoboken, NJ, USA: Wiley, 2007. [26] B. Chen et al., “Codet: Code generation with generated tests,” 2022,
[3] M. Harman and P. McMinn, “A theoretical and empirical study of arXiv:2207.10397.
search-based testing: Local, global, and hybrid search,” IEEE Trans. [27] S. K. Lahiri et al., “Interactive code generation via test-driven user-
Softw. Eng., vol. 36, no. 2, pp. 226–247, Mar./Apr. 2010.
intent formalization,” 2022, arXiv:2208.05950.
[4] P. Delgado-Pérez, A. Ramírez, K. J. Valle-Gómez, I. Medina-Bulo, and
[28] S. Alagarsamy, C. Tantithamthavorn, and A. Aleti, “A3test: Assertion-
J. R. Romero, “InterEvo-TR: Interactive evolutionary test generation
augmented automated test case generation,” 2023, arXiv:2302.10352.
with readability assessment,” IEEE Trans. Softw. Eng., vol. 49, no. 4,
[29] M. Schäfer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation
pp. 2580–2596, Apr. 2023.
[5] X. Xiao, S. Li, T. Xie, and N. Tillmann, “Characteristic studies of of using large language models for automated unit test generation,”
loop problems for structural test generation via symbolic execution,” IEEE Trans. Softw. Eng., vol. 50, no. 1, pp. 85–105, Jan. 2024.
in Proc. 28th IEEE/ACM Int. Conf. Automated Softw. Eng., (ASE), [30] V. Guilherme and A. Vincenzi, “An initial investigation of chatGPT
Silicon Valley, CA, USA, E. Denney, T. Bultan, and A. Zeller, Eds., unit test generation capability,” in Proc. 8th Brazilian Symp. Systematic
Piscataway, NJ, USA: IEEE Press, Nov. 2013, pp. 246–256. Automated Softw. Testing (SAST), Campo Grande, Brazil, A. L. Fontão
[6] C. Pacheco, S. K. Lahiri, M. D. Ernst, and T. Ball, “Feedback-directed et al., Eds., New York, NY, USA: ACM, Sep. 2023, pp. 15–24.
random test generation,” in Proc. 29th Int. Conf. Softw. Eng. (ICSE), [31] S. Hashtroudi, J. Shin, H. Hemmati, and S. Wang, “Automated test case
Minneapolis, MN, USA, Los Alamitos, CA, USA: IEEE Comput. Soc. generation using code models and domain adaptation,” 2023. [Online].
Press, May 2007, pp. 75–84. Available: https://doi.org/10.48550/arXiv.2308.08033
[7] Z. Yuan, et al., “No more manual tests? Evaluating and improving [32] L. Plein, W. C. Ouédraogo, J. Klein, and T. F. Bissyandé, “Automatic
chatGPT for unit test generation,” 2023, arXiv:2305.04207. generation of test cases based on bug reports: A feasibility study with
[8] Y. Tang, Z. Liu, Z. Zhou, and X. Luo, “ChatGPT vs SBST: A large language models,” 2023. [Online]. Available: https://doi.org/10.
comparative assessment of unit test suite generation,” 2023. [Online]. 48550/arXiv.2310.06320
Available: https://doi.org/10.48550/arXiv.2307.00588 [33] V. Vikram, C. Lemieux, and R. Padhye, “Can large language models
[9] Android Developers, “Ui/application exerciser monkey,” 2012. Ac- write good property-based tests?” 2023. [Online]. Available: https://
cessed: Dec. 27, 2023. [Online]. Available: https://developer.android. doi.org/10.48550/arXiv.2307.04346
google.cn/studio/test/other-testing-tools/monkey [34] N. Rao, K. Jain, U. Alon, C. L. Goues, and V. J. Hellendoorn, “CAT-
[10] Y. Li, Z. Yang, Y. Guo, and X. Chen, “DroidBot: A lightweight UI- LM training language models on aligned code and tests,” in Proc.
guided test input generator for android,” in Proc. IEEE/ACM 39th Int. 38th IEEE/ACM Int. Conf. Automated Softw. Eng. (ASE), Luxembourg,
Conf. Softw. Eng. Companion (ICSE), Piscataway, NJ, USA: IEEE Piscataway, NJ, USA: IEEE Press, Sep. 2023, pp. 409–420, doi:
Press, 2017, pp. 23–26. 10.1109/ASE56229.2023.00193.
[11] T. Su et al., “Guided, stochastic model-based gui testing of an- [35] Z. Xie, Y. Chen, C. Zhi, S. Deng, and J. Yin, “Chatunitest: A chatGPT-
droid apps,” in Proc. 11th Joint Meeting Found. Softw. Eng., 2017, based automated unit test generation tool,” 2023, arXiv:2305.04764.
pp. 245–256.
[36] C. Lemieux, J. P. Inala, S. K. Lahiri, and S. Sen, “CodaMosa: Escaping
[12] Z. Dong, M. Böhme, L. Cojocaru, and A. Roychoudhury, “Time-travel
coverage plateaus in test generation with pre-trained large language
testing of android apps,” in Proc. IEEE/ACM 42nd Int. Conf. Softw.
models,” in Proc. Int. Conf. Softw. Eng. (ICSE), 2023, pp. 919–931.
Eng. (ICSE), Piscataway, NJ, USA: IEEE Press, 2020, pp. 481–492.
[37] A. M. Dakhel, A. Nikanjam, V. Majdinasab, F. Khomh, and M. C.
[13] M. Pan, A. Huang, G. Wang, T. Zhang, and X. Li, “Reinforcement
learning based curiosity-driven testing of android applications,” in Desmarais, “Effective test generation using pre-trained large language
Proc. 29th ACM SIGSOFT Int. Symp. Softw. Testing Anal., 2020, models and mutation testing,” 2023. [Online]. Available: https://doi.
pp. 153–164. org/10.48550/arXiv.2308.16557
[14] Z. Liu et al., “Make LLM a testing expert: Bringing human-like [38] M. L. Siddiq, J. Santos, R. H. Tanvir, N. Ulfat, F. A. Rifat, and V.
interaction to mobile GUI testing via functionality-aware decisions,” C. Lopes, “Exploring the effectiveness of large language models in
2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.15780 generating unit tests,” 2023, arXiv:2305.00418.
[15] T. Su, J. Wang, and Z. Su, “Benchmarking automated GUI testing for [39] Y. Zhang, W. Song, Z. Ji, D. Yao, and N. Meng, “How well does LLM
android against real-world bugs,” in Proc. 29th ACM Joint Eur. Softw. generate security tests?” 2023. [Online]. Available: https://doi.org/10.
Eng. Conf. Symp. Found. Softw. Eng. (ESEC/FSE), Athens, Greece, 48550/arXiv.2310.00710
New York, NY, USA: ACM, Aug. 2021, pp. 119–130. [40] V. Li and N. Doiron, “Prompting code interpreter to write better unit
[16] M. Shanahan, “Talking about large language models,” 2022. [Online]. tests on quixbugs functions,” 2023. [Online]. Available: https://doi.org/
Available: https://doi.org/10.48550/arXiv.2212.03551 10.48550/arXiv.2310.00483

Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SOFTWARE TESTING WITH LARGE LANGUAGE MODELS: SURVEY, LANDSCAPE, AND VISION 933

[41] B. Steenhoek, M. Tufano, N. Sundaresan, and A. Svyatkovskiy, “Rein- [64] Z. Liu et al., “Testing the limits: Unusual text inputs generation for
forcement learning from automatic feedback for high-quality unit test mobile app crash detection with large language model,” 2023. [Online].
generation,” 2023, arXiv:2310.02368. Available: https://doi.org/10.48550/arXiv.2310.15657
[42] S. Bhatia, T. Gandhi, D. Kumar, and P. Jalote, “Unit test generation [65] C. Zhang et al., “Understanding large language model based fuzz driver
using generative AI: A comparative performance analysis of autogen- generation,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.
eration tools,” 2023, arXiv:2312.10622. 2307.12469
[43] M. Tufano, D. Drain, A. Svyatkovskiy, and N. Sundaresan, “Gener- [66] C. Xia, M. Paltenghi, J. Tian, M. Pradel, and L. Zhang, “Universal
ating accurate assert statements for unit test cases using pretrained fuzzing via large language models,” 2023. [Online]. Available: https://
transformers,” in Proc. 3rd ACM/IEEE Int. Conf. Automat. Softw. Test, api.semanticscholar.org/CorpusID:260735598
2022, pp. 54–64. [67] C. Tsigkanos, P. Rani, S. Müller, and T. Kehrer, “Variable discovery
[44] P. Nie, R. Banerjee, J. J. Li, R. J. Mooney, and M. Gligoric, “Learning with large language models for metamorphic testing of scientific
deep semantics for test completion,” 2023, arXiv:2302.10166. software,” in Proc. 23rd Int. Conf. Comput. Sci. (ICCS), Prague,
[45] A. Mastropaolo et al., “Using transfer learning for code-related tasks,” Czech Republic, J. Mikyska, C. de Mulatier, M. Paszynski, V. V.
IEEE Trans. Softw. Eng., vol. 49, no. 4, pp. 1580–1598, Apr. 2023, Krzhizhanovskaya, J. J. Dongarra, and P. M. A. Sloot, Eds., vol. 14073.
doi: 10.1109/TSE.2022.3183297. Springer, Jul. 2023, pp. 321–335, doi: 10.1007/978-3-031-35995-8_23.
[46] N. Nashid, M. Sintaha, and A. Mesbah, “Retrieval-based prompt [68] C. Yang et al., “White-box compiler fuzzing empowered by large
selection for code-related few-shot learning,” in Proc. 45th Int. Conf. language models,” 2023. [Online]. Available: https://doi.org/10.48550/
Softw. Eng. (ICSE), 2023, pp. 2450–2462. arXiv.2310.15991
[47] G. Ye et al., “Automated conformance testing for javascript engines [69] T. Zhang, I. C. Irsan, F. Thung, D. Han, D. Lo, and L. Jiang, “iTiger:
via deep compiler fuzzing,” in Proc. 42nd ACM SIGPLAN Int. Conf. An automatic issue title generation tool,” in Proc. 30th ACM Joint Eur.
Program. Lang. Des. Implementation, 2021, pp. 435–450. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2022, pp. 1637–1641.
[48] Z. Liu et al., “Fill in the blank: Context-aware automated text input [70] Y. Huang et al., “Crashtranslator: Automatically reproducing mobile
generation for mobile gui testing,” 2022, arXiv:2212.04732. application crashes directly from stack trace,” 2023. [Online]. Avail-
[49] M. R. Taesiri, F. Macklon, Y. Wang, H. Shen, and C.-P. Bezemer, able: https://doi.org/10.48550/arXiv.2310.07128
“Large language models are pretty good zero-shot video game bug [71] T. Zhang, I. C. Irsan, F. Thung, and D. Lo, “Cupid: Leveraging chatGPT
detectors,” 2022, arXiv:2210.02506. for more accurate duplicate bug report detection,” 2023. [Online].
[50] S. L. Shrestha and C. Csallner, “SlGPT: Using transfer learning to Available: https://doi.org/10.48550/arXiv.2308.10022
directly generate simulink model files and find bugs in the simulink [72] U. Mukherjee and M. M. Rahman, “Employing deep learning and
toolchain,” in Proc. Eval. Assessment Softw. Eng., 2021, pp. 260–265. structured information retrieval to answer clarification questions on bug
[51] J. Hu, Q. Zhang, and H. Yin, “Augmenting greybox fuzzing with reports,” 2023, arXiv:2304.12494.
generative AI,” 2023. [Online]. Available: https://doi.org/10.48550/ [73] P. Mahbub, O. Shuvo, and M. M. Rahman, “Explaining software
arXiv.2306.06782
bugs leveraging code structures in neural machine translation,” 2022,
[52] A. Mathur, S. Pradhan, P. Soni, D. Patel, and R. Regunathan, “Auto-
arXiv:2212.04584.
mated test case generation using t5 and GPT-3,” in Proc. 9th Int. Conf.
[74] S. Feng and C. Chen, “Prompting is all your need: Automated android
Adv. Comput. Commun. Syst. (ICACCS), vol. 1, 2023, pp. 1986–1992.
bug replay with large language models,” 2023. [Online]. Available:
[53] D. Zimmermann and A. Koziolek, “Automating GUI-based software
https://doi.org/10.48550/arXiv.2306.01987
testing with GPT-3,” in Proc. IEEE Int. Conf. Softw. Testing, Verifica-
[75] Y. Su, Z. Han, Z. Gao, Z. Xing, Q. Lu, and X. Xu, “Still confusing for
tion Validation Workshops (ICSTW), 2023, pp. 62–65.
bug-component triaging? Deep feature learning and ensemble setting to
[54] M. Taeb, A. Swearngin, E. Schoop, R. Cheng, Y. Jiang, and J. Nichols,
rescue,” in Proc. 31st IEEE/ACM Int. Conf. Program Comprehension
“Axnav: Replaying accessibility tests from natural language,” 2023.
(ICPC), Melbourne, Australia, Piscataway, NJ, USA: IEEE Press, May
[Online]. Available: https://doi.org/10.48550/arXiv.2310.02424
2023, pp. 316–327, doi: 10.1109/ICPC58990.2023.00046.
[55] Q. Luu, H. Liu, and T. Y. Chen, “Can chatGPT advance software
testing intelligence? An experience report on metamorphic testing,” [76] N. D. Bui, Y. Wang, and S. Hoi, “Detect-localize-repair: A
2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.19204 unified framework for learning to debug with codet5,” 2022,
[56] A. Khanfir, R. Degiovanni, M. Papadakis, and Y. L. Traon, “Ef- arXiv:2211.14875.
ficient mutation testing via pre-trained language models,” 2023, [77] S. Kang, J. Yoon, and S. Yoo, “Large language models are few-
arXiv:2301.03543. shot testers: Exploring llm-based general bug reproduction,” 2022,
[57] Y. Deng, C. S. Xia, C. Yang, S. D. Zhang, S. Yang, and L. Zhang, arXiv:2209.11515.
“Large language models are edge-case fuzzers: Testing deep learning [78] S. Kang, G. An, and S. Yoo, “A preliminary evaluation of LLM-based
libraries via fuzzGPT,” 2023, arXiv:2304.02014. fault localization,” 2023. [Online]. Available: https://doi.org/10.48550/
[58] Y. Deng, C. S. Xia, C. Yang, S. D. Zhang, S. Yang, and L. Zhang, arXiv.2308.05487
“Large language models are zero shot fuzzers: Fuzzing deep learning [79] P. Widjojo and C. Treude, “Addressing compiler errors: Stack overflow
libraries via large language models,” 2023, arXiv:2209.11515. or large language models?” 2023. [Online]. Available: https://doi.org/
[59] J. Ackerman and G. Cybenko, “Large language models for fuzzing 10.48550/arXiv.2307.10793
parsers (registered report),” in Proc. 2nd Int. Fuzzing Workshop [80] L. Plein and T. F. Bissyandé, “Can LLMs demystify bug reports?”
(FUZZING) Seattle, WA, USA, M. Böhme, Y. Noller, B. Ray, and 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.06310
L. Szekeres, Eds., New York, NY, USA: ACM, Jul. 2023, pp. 31–38, [81] A. Taylor, A. Vassar, J. Renzella, and H. A. Pearce, “DCC—Help:
doi: 10.1145/3605157.3605173. Generating context-aware compiler error explanations with large lan-
[60] S. Yu, C. Fang, Y. Ling, C. Wu, and Z. Chen, “LLM for test script guage models,” 2023. [Online]. Available: https://api.semanticscholar.
generation and migration: Challenges, capabilities, and opportunities,” org/CorpusID:261076439
2023. [Online]. Available: https://doi.org/10.48550/arXiv.2309.13574 [82] S. Kang, B. Chen, S. Yoo, and J.-G. Lou, “Explainable automated
[61] G. Deng et al., “PentestGPT: An llm-empowered automatic penetration debugging via large language model-driven scientific debugging,” 2023,
testing tool,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv. arXiv:2304.02195.
2308.06782 [83] A. Z. H. Yang, R. Martins, C. L. Goues, and V. J. Hellendoorn,
[62] M. Sun, Y. Yang, Y. Wang, M. Wen, H. Jia, and Y. Zhou, “SMT “Large language models for test-free fault localization,” 2023. [Online].
solver validation empowered by large pre-trained language models,” Available: https://doi.org/10.48550/arXiv.2310.01726
in Proc. 38th IEEE/ACM Int. Conf. Automated Softw. Eng. (ASE), [84] Y. Wu, Z. Li, J. M. Zhang, M. Papadakis, M. Harman, and Y.
Luxembourg, Piscataway, NJ, USA: IEEE Press, 2023, pp. 1288–1300, Liu, “Large language models in fault localisation,” 2023. [Online].
doi: 10.1109/ASE56229.2023.00180. Available: https://doi.org/10.48550/arXiv.2308.15276
[63] Y. Deng, J. Yao, Z. Tu, X. Zheng, M. Zhang, and T. Zhang, “Target: Au- [85] H. Tu, Z. Zhou, H. Jiang, I. N. B. Yusuf, Y. Li, and L. Jiang,
tomated scenario generation from traffic rules for testing autonomous “LLM4CBI: Taming llms to generate effective test programs for
vehicles,” 2023. [Online]. Available: https://api.semanticscholar.org/ compiler bug isolation,” 2023. [Online]. Available: https://doi.org/10.
CorpusID:258588387 48550/arXiv.2307.00593

Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
934 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 4, APRIL 2024

[86] T.-O. Li et al., “Nuances are the key: Unlocking chatGPT to find [107] P. Deligiannis, A. Lal, N. Mehrotra, and A. Rastogi, “Fixing rust
failure-inducing tests with differential prompting,” in Proc. 38th compilation errors using LLMs,” 2023. [Online]. Available: https://doi.
IEEE/ACM Int. Conf. Automated Softw. Eng. (ASE), 2023, pp. 14–26. org/10.48550/arXiv.2308.05177
[87] X. Chen, M. Lin, N. Schärli, and D. Zhou, “Teaching large language [108] F. Ribeiro, R. Abreu, and J. Saraiva, “Framing program repair as code
models to self-debug,” 2023. [Online]. Available: https://doi.org/10. completion,” in Proc. 3rd Int. Workshop Automated Program Repair,
48550/arXiv.2304.05128 2022, pp. 38–45.
[88] J. Cao, M. Li, M. Wen, and S.-c. Cheung, “A study on prompt design, [109] N. Wadhwa et al., “Frustrated with code quality issues? LLMs can
advantages and limitations of chatGPT for deep learning program help!” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2309.
repair,” 2023, arXiv:2304.08191. 12938
[89] H. Pearce, B. Tan, B. Ahmad, R. Karri, and B. Dolan-Gavitt, “Ex- [110] F. Ribeiro, J. N. C. de Macedo, K. Tsushima, R. Abreu, and J.
amining zero-shot vulnerability repair with large language models,” in Saraiva, “GPT-3-powered type error debugging: Investigating the use of
Proc. IEEE Symp. Secur. Privacy (SP), Los Alamitos, CA, USA: IEEE large language models for code repair,” in Proc. 16th ACM SIGPLAN
Comput. Soc., 2022, pp. 1–18. Int. Conf. Softw. Lang. Eng. (SLE), Cascais, Portugal, J. Saraiva, T.
[90] Z. Fan, X. Gao, A. Roychoudhury, and S. H. Tan, “Automated repair Degueule, and E. Scott, Eds., New York, NY, USA: ACM, Oct. 2023,
of programs from large language models,” 2022, arXiv:2205.10583. pp. 111–124, doi: 10.1145/3623476.3623522.
[91] Y. Hu, X. Shi, Q. Zhou, and L. Pike, “Fix bugs with transformer [111] Y. Wu et al., “How effective are neural networks for fixing security
through a neural-symbolic edit grammar,” 2022, arXiv:2204.06643. vulnerabilities,” 2023, arXiv:2305.18607.
[92] C. S. Xia, Y. Wei, and L. Zhang, “Practical program repair in the era [112] N. Jiang, K. Liu, T. Lutellier, and L. Tan, “Impact of code language
of large pre-trained language models,” 2022, arXiv:2210.14179. models on automated program repair,” 2023, arXiv:2302.05020.
[93] J. Zhang et al., “Repairing bugs in python assignments using large [113] M. Jin et al., “Inferfix: End-to-end program repair with LLMs,” 2023,
language models,” 2022, arXiv:2209.14876. arXiv:2303.07263.
[94] M. Lajkó, V. Csuvik, and L. Vidács, “Towards javascript program [114] C. S. Xia and L. Zhang, “Keep the conversation going: Fixing 162 out
repair with generative pre-trained transformer (GPT-2),” in Proc. 3rd of 337 bugs for $0.42 each using chatGPT,” 2023, arXiv:2304.00385.
Int. Workshop Automated Program Repair, 2022, pp. 61–68. [115] Y. Zhang, G. Li, Z. Jin, and Y. Xing, “Neural program repair with
[95] D. Sobania, M. Briesch, C. Hanna, and J. Petke, “An analysis of the program dependence analysis and effective filter mechanism,” 2023,
automatic bug fixing performance of chat,” 2023, arXiv:2301.08653. arXiv:2305.09315.
[96] K. Huang et al., “An empirical study on fine-tuning large lan- [116] J. A. Prenner and R. Robbes, “Out of context: How important is local
guage models of code for automated program repair,” in Proc. 38th context in neural program repair?” 2023, arXiv:2312.04986.
IEEE/ACM Int. Conf. Automated Softw. Eng. (ASE), Luxembourg, [117] Q. Zhang, C. Fang, B. Yu, W. Sun, T. Zhang, and Z. Chen, “Pre-
Piscataway, NJ, USA: IEEE Press, Sep. 2023, pp. 1162–1174, doi: trained model-based automated software vulnerability repair: How far
10.1109/ASE56229.2023.00181. are we?” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.
[97] M. C. Wuisang, M. Kurniawan, K. A. Wira Santosa, A. Agung Santoso
2308.12533
Gunawan, and K. E. Saputra, “An evaluation of the effectiveness
[118] S. Garg, R. Z. Moghaddam, and N. Sundaresan, “Rapgen: An approach
of openai’s chatGPT for automated python program bug fixing us-
for fixing code inefficiencies in zero-shot,” 2023. [Online]. Available:
ing quixbugs,” in Proc. Int. Seminar Appl. Technol. Inf. Commun.
https://doi.org/10.48550/arXiv.2306.17077
(iSemantic), 2023, pp. 295–300.
[119] W. Wang, Y. Wang, S. Joty, and S. C. H. Hoi, “Rap-gen: Retrieval-
[98] D. Horváth, V. Csuvik, T. Gyimóthy, and L. Vidács, “An extensive
augmented patch generation with codet5 for automatic program repair,”
study on model architecture and program representation in the domain
in Proc. 31st ACM Joint Eur. Softw. Eng. Conf. Symp. Found. Softw.
of learning-based automated program repair,” in Proc. IEEE/ACM
Eng. (ESEC/FSE), San Francisco, CA, USA, S. Chandra, K. Blincoe,
Int. Workshop Automated Program Repair (APR@ICSE), Melbourne,
and P. Tonella, Eds., New York, NY, USA: ACM, Dec. 2023, pp. 146–
Australia, Piscataway, NJ, USA: IEEE Press, May 2023, pp. 31–38,
158, doi: 10.1145/3611643.3616256.
doi: 10.1109/APR59189.2023.00013.
[99] J. A. Prenner, H. Babii, and R. Robbes, “Can openai’s codex fix bugs? [120] Y. Zhang, Z. Jin, Y. Xing, and G. Li, “STEAM: Simulating the
An evaluation on quixbugs,” in in Proc. 3rd Int. Workshop Automated interactive behavior of programmers for automatic bug fixing,” 2023.
Program Repair, 2022, pp. 69–75. [Online]. Available: https://doi.org/10.48550/arXiv.2308.14460
[100] W. Yuan et al., “Circle: Continual repair across programming lan- [121] S. Fakhoury, S. Chakraborty, M. Musuvathi, and S. K. Lahiri, “Towards
guages,” in Proc. 31st ACM SIGSOFT Int. Symp. Softw. Testing Anal., generating functionally correct code edits from natural language issue
2022, pp. 678–690. descriptions,” 2023, arXiv:2304.03816.
[101] S. Moon et al., “Coffee: Boost your code llms by fixing bugs with [122] M. Fu, C. Tantithamthavorn, T. Le, V. Nguyen, and D. Phung, “Vul-
feedback,” 2023, arXiv:2311.07215. repair: A t5-based automated software vulnerability repair,” in Proc.
[102] Y. Wei, C. S. Xia, and L. Zhang, “Copiloting the copilots: Fusing 30th ACM Joint Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2022,
large language models with completion engines for automated program pp. 935–947.
repair,” in Proc. 31st ACM Joint Eur. Softw. Eng. Conf. Symp. Found. [123] S. Gao, X. Wen, C. Gao, W. Wang, H. Zhang, and M. R. Lyu, “What
Softw. Eng. (ESEC/FSE), San Francisco, CA, USA, S. Chandra, K. makes good in-context demonstrations for code intelligence tasks with
Blincoe, and P. Tonella, Eds., New York, NY, USA: ACM, Dec. 2023, LLMs?” in Proc. 38th IEEE/ACM Int. Conf. Automated Softw. Eng.
pp. 172–184, doi: 10.1145/3611643.3616271. (ASE), Luxembourg, Piscataway, NJ, USA: IEEE Press, Sep. 2023,
[103] Y. Peng, S. Gao, C. Gao, Y. Huo, and M. R. Lyu, “Domain knowledge pp. 761–773, doi: 10.1109/ASE56229.2023.00109.
matters: Improving prompts with fix templates for repairing python [124] C. Treude and H. Hata, “She elicits requirements and he tests: Software
type errors,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv. engineering gender bias in large language models,” 2023. [Online].
2306.01394 Available: https://doi.org/10.48550/arXiv.2303.10131
[104] A. E. I. Brownlee et al., “Enhancing genetic improvement mutations [125] R. Kocielnik, S. Prabhumoye, V. Zhang, R. M. Alvarez, and A. Anand-
using large language models,” in Proc. 15th Int. Symp. Search-Based kumar, “Autobiastest: Controllable sentence generation for automated
Softw. Eng. (SSBSE), San Francisco, CA, USA, P. Arcaini, T. Yue, and open-ended social bias testing in language models,” 2023. [Online].
and E. M. Fredericks, Eds., vol. 14415. Cham, Switzerland: Springer Available: https://doi.org/10.48550/arXiv.2302.07371
Nature, Dec. 2023, pp. 153–159, doi: 10.1007/978-3-031-48796-5_13. [126] M. Ciniselli, L. Pascarella, and G. Bavota, “To what extent do deep
[105] M. M. A. Haque, W. U. Ahmad, I. Lourentzou, and C. Brown, “Fix- learning-based code recommenders generate predictions by cloning
eval: Execution-based evaluation of program fixes for programming code from the training set?” in Proc. 19th IEEE/ACM Int. Conf. Mining
problems,” in Proc. IEEE/ACM Int. Workshop Automated Program Re- Softw. Repositories (MSR), Pittsburgh, PA, USA, New York, NY, USA:
pair (APR@ICSE), Melbourne, Australia, Piscataway, NJ, USA: IEEE ACM, 2022, pp. 167–178, doi: 10.1145/3524842.3528440.
Press, May 2023, pp. 11–18, doi: 10.1109/APR59189.2023.00009. [127] D. Erhabor, S. Udayashankar, M. Nagappan, and S. Al-Kiswany,
[106] B. Ahmad, S. Thakur, B. Tan, R. Karri, and H. Pearce, “Fix- “Measuring the runtime performance of code produced with GitHub
ing hardware security bugs with large language models,” 2023, copilot,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.
arXiv:2302.01215. 2305.06439

Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: SOFTWARE TESTING WITH LARGE LANGUAGE MODELS: SURVEY, LANDSCAPE, AND VISION 935

[128] R. Wang, R. Cheng, D. Ford, and T. Zimmermann, “Investigating [147] S. Song, X. Li, and S. Li, “How to bridge the gap between modalities:
and designing for trust in AI-powered code generation tools,” 2023. A comprehensive survey on multimodal large language model,” 2023,
[Online]. Available: https://doi.org/10.48550/arXiv.2305.11248 arXiv:2311.07594.
[129] B. Yetistiren, I. Özsoy, M. Ayerdem, and E. Tüzün, “Evaluating [148] J. M. Zhang, M. Harman, L. Ma, and Y. Liu, “Machine learning testing:
the code quality of AI-assisted code generation tools: An empirical Survey, landscapes and horizons,” IEEE Trans. Softw. Eng., vol. 48,
study on github copilot, amazon codewhisperer, and ChatGPT,” 2023. no. 2, pp. 1–36, Jan. 2022.
[Online]. Available: https://doi.org/10.48550/arXiv.2304.10778 [149] F. Tu, J. Zhu, Q. Zheng, and M. Zhou, “Be careful of when: An
[130] C. Wohlin, “Guidelines for snowballing in systematic literature studies empirical study on time-related misuse of issue tracking data,” in Proc.
and a replication in software engineering,” in Proc. 18th Int. Conf. Eval. ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng.
Assessment in Softw. Eng. (EASE), London, U.K., M. J. Shepperd, T. (ESEC/SIGSOFT FSE), Lake Buena Vista, FL, USA, G. T. Leavens,
Hall, and I. Myrtveit, Eds., New York, NY, USA: ACM, May 2014, A. Garcia, and C. S. Pasareanu, Eds., New York, NY, USA: ACM,
pp. 38: 1–38:10, doi: 10.1145/2601248.2601268. Nov. 2018, pp. 307–318, doi: 10.1145/3236024.3236054.
[131] A. Mastropaolo et al., “Studying the usage of text-to-text transfer [150] Z. Sun, L. Li, Y. Liu, X. Du, and L. Li, “On the importance of
transformer to support code-related tasks,” in Proc. 43rd IEEE/ACM building high-quality training datasets for neural code search,” in
Int. Conf. Softw. Eng. (ICSE), Madrid, Spain, Piscataway, NJ, USA: Proc. 44th IEEE/ACM Int. Conf. Softw. Eng. (ICSE), Pittsburgh, PA,
IEEE Press, May 2021, pp. 336–347. USA, Piscataway, NJ, USA: ACM, May 2022, pp. 1609–1620, doi:
[132] C. Tsigkanos, P. Rani, S. Müller, and T. Kehrer, “Large language 10.1145/3510003.3510160.
models: The next frontier for variable discovery within metamorphic [151] L. Shi et al., “ISPY: Automatic issue-solution pair extraction from
testing?” in Proc. IEEE Int. Conf. Softw. Anal., Evol. Reeng. (SANER), community live chats,” in Proc. 36th IEEE/ACM Int. Conf. Automated
Taipa, Macao, T. Zhang, X. Xia, and N. Novielli, Eds., Piscataway, Softw. Eng. (ASE), Melbourne, Australia, Piscataway, NJ, USA:
NJ, USA: IEEE Press, Mar. 2023, pp. 678–682, doi: 10.1109/SANER IEEE Press, Nov. 2021, pp. 142–154, doi: 10.1109/ASE51524.2021.
56733.2023.00070. 9678894.
[133] P. Farrell-Vinay, Manage Software Testing. New York, NY, USA: [152] D. Guo et al., “GraphCodeBERT: Pre-training code representations
Auerbach, 2008. with data flow,” in Proc. 9th Int. Conf. Learn. Representations
[134] A. Mili and F. Tchier, Software Testing: Concepts and Operations. (ICLR), Virtual Event, Austria, May 2021. [Online]. Available: https://
Hoboken, NJ, USA: Wiley, 2015. openreview.net/forum?id=jLoC4ez43PZ
[135] S. Lukasczyk and G. Fraser, “Pynguin: Automated unit test gener- [153] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao, “Lsun:
ation for python,” in Proc. 44th IEEE/ACM Int. Conf. Softw. Eng., Construction of a large-scale image dataset using deep learning with
(ICSE) Companion, Pittsburgh, PA, USA, ACM/IEEE Press, May 2022, humans in the loop,” 2015, arXiv:1506.03365.
pp. 168–172, doi: 10.1145/3510454.3516829. [154] “Loadrunner, Inc.” Accessed: Dec. 27, 2023. [Online]. Available:
[136] E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The microfocus.com
oracle problem in software testing: A survey,” IEEE Trans. Softw. Eng., [155] “Langchain, Inc.” Accessed: Dec. 27, 2023. [Online]. Available: https://
vol. 41, no. 5, pp. 507–525, May 2015. docs.langchain.com/docs/
[137] C. Watson, M. Tufano, K. Moran, G. Bavota, and D. Poshyvanyk, [156] Prompt Engineering. “Prompt engineering guide.” GitHub. Accessed:
“On learning meaningful assert statements for unit test cases,” in Proc. Dec. 27, 2023. [Online]. Available: https://github.com/dair-ai/Prompt-
42nd Int. Conf. Softw. Eng. (ICSE), Seoul, South Korea, G. Rother- Engineering-Guide
mel and D. Bae, Eds., New York, NY, USA: ACM, Jun./Jul. 2020, [157] Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola,
pp. 1398–1409. “Multimodal chain-of-thought reasoning in language models,” 2023,
[138] Y. He et al., “Textexerciser: Feedback-driven text input exercising for arXiv:2302.00923.
android applications,” in Proc. IEEE Symp. Secur. Privacy (SP), San [158] Z. Liu, X. Yu, Y. Fang, and X. Zhang, “Graphprompt: Unifying pre-
Francisco, CA, USA, Piscataway, NJ, USA: IEEE Press, May 2020, training and downstream tasks for graph neural networks,” in Proc.
pp. 1071–1087. ACM Web Conf. (WWW), Austin, TX, USA, Y. Ding, J. Tang, J. F.
[139] A. Wei, Y. Deng, C. Yang, and L. Zhang, “Free lunch for testing: Sequeda, L. Aroyo, C. Castillo, and G. Houben, Eds., New York, NY,
Fuzzing deep-learning libraries from open source,” in Proc. 44th USA: ACM, Apr./May 2023, pp. 417–428.
IEEE/ACM 44th Int. Conf. Softw. Eng. (ICSE), Pittsburgh, PA, USA, [159] Y. Charalambous, N. Tihanyi, R. Jain, Y. Sun, M. A. Ferrag, and L.
New York, NY, USA: ACM, May 2022, pp. 995–1007. C. Cordeiro, “A new era in software security: Towards self-healing
[140] D. Xie et al., “Docter: Documentation-guided fuzzing for testing deep software via large language models and formal verification,” 2023,
learning API functions,” in Proc. 31st ACM SIGSOFT Int. Symp. arXiv:2305.14752.
Softw. Testing Anal. (ISSTA), Virtual Event, South Korea, S. Ryu [160] S. Wang et al., “Machine/deep learning for software engineering: A
and Y. Smaragdakis, Eds., New York, NY, USA: ACM, Jul. 2022, systematic literature review,” IEEE Trans. Softw. Eng., vol. 49, no. 3,
pp. 176–188. pp. 1188–1231, Mar. 2023, doi: 10.1109/TSE.2022.3173346.
[141] Q. Guo et al., “Audee: Automated testing for deep learning frame- [161] Y. Yang, X. Xia, D. Lo, and J. C. Grundy, “A survey on deep learning
works,” in Proc. 35th IEEE/ACM Int. Conf. Automated Softw. Eng. for software engineering,” ACM Comput. Surv., vol. 54, no. 10s,
(ASE), Melbourne, Australia, Piscataway, NJ, USA: IEEE Press, Sep. pp. 206: 1–206:73, 2022, doi: 10.1145/3505243.
2020, pp. 486–498. [162] C. Watson, N. Cooper, D. Nader-Palacio, K. Moran, and D.
[142] Z. Wang, M. Yan, J. Chen, S. Liu, and D. Zhang, “Deep learning library Poshyvanyk, “A systematic literature review on the use of deep learning
testing via effective model generation,” in Proc. 28th ACM Joint Eur. in software engineering research,” ACM Trans. Softw. Eng. Methodol.,
Softw. Eng. Conf. Symp. Found. Softw. Eng. (ESEC/FSE), Virtual Event, vol. 31, no. 2, pp. 32:1–32:58, 2022, doi: 10.1145/3485275.
USA, P. Devanbu, M. B. Cohen, and T. Zimmermann, Eds., New York, [163] M. Bajammal, A. Stocco, D. Mazinanian, and A. Mesbah, “A survey
NY, USA: ACM, Nov. 2020, pp. 788–799. on the use of computer vision to improve software engineering tasks,”
[143] J. Jiang, Y. Xiong, H. Zhang, Q. Gao, and X. Chen, “Shaping program IEEE Trans. Softw. Eng., vol. 48, no. 5, pp. 1722–1742, May 2022,
repair space with existing patches and similar code,” in Proc. 27th doi: 10.1109/TSE.2020.3032986.
ACM SIGSOFT Int. Symp. Softw. Testing Anal., New York, NY, USA: [164] X. Hou et al., “Large language models for software engineering: A
ACM, 2018, pp. 298–309, doi: 10.1145/3213846.3213871. systematic literature review,” 2023. [Online]. Available: https://doi.
[144] M. Wen, J. Chen, R. Wu, D. Hao, and S.-C. Cheung, “Context-aware org/10.48550/arXiv.2308.10620
patch generation for better automated program repair,” in Proc. 40th [165] A. Fan et al., “Large language models for software engineering:
Int. Conf. Softw. Eng., New York, NY, USA: ACM, 2018, pp. 1–11, Survey and open problems,” 2023. [Online]. Available: https://doi.
doi: 10.1145/3180155.3180233. org/10.48550/arXiv.2310.03533
[145] Y. Xiong et al., “Precise condition synthesis for program repair,” [166] D. Zan et al., “Large language models meet NL2Code: A survey,”
in Proc. IEEE/ACM 39th Int. Conf. Softw. Eng. (ICSE), 2017, in Proc. 61st Annu. Meeting Assoc. Comput. Linguistics (Vol.
pp. 416–426. 1, Long Papers) (ACL), Toronto, ON, Canada, A. Rogers, J. L.
[146] J. Xuan et al., “Nopol: Automatic repair of conditional statement bugs Boyd-Graber, and N. Okazaki, Eds., Association for Computational
in java programs,” IEEE Trans. Softw. Eng., vol. 43, no. 1, pp. 34–55, Linguistics, Jul. 2023, pp. 7443–7464, doi: 10.18653/v1/2023.acl-
Jan. 2017. long.411.

Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.
936 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 4, APRIL 2024

Junjie Wang (Member, IEEE) received the Ph.D. Zhe Liu received the Ph.D. degree from the
degree from ISCAS, in 2015. She is a Research University of Chinese Academy of Sciences, in
Professor with the Institute of Software, Chinese 2023. He is an Assistant Researcher with the In-
Academy of Sciences (ISCAS). She was a Visiting stitute of Software Chinese Academy of Sciences.
Scholar with North Carolina State University from His research interests include software engineer-
September 2017 to September 2018 and worked ing, mobile testing, deep learning, and human–
with Prof. Tim Menzies. Her research interests in- computer interaction. He has published 15 papers
clude AI for software engineering, software testing, at top-tier international software engineering confer-
and software analytics. She has more than 50 high- ences/journals, including IEEE TRANSACTIONS ON
quality publications including IEEE TRANSACTIONS SOFTWARE ENGINEERING, ICSE, CHI, and ASE.
ON SOFTWARE ENGINEERING, ICSE, TOSEM, FSE, Specifically, he applies AI and light-weight program
and ASE, and five of them has received the Distinguished/Best Paper Award, analysis technology in the following directions: AI(LLM)-assisted automated
respectively at ICSE 2019, ICSE 2020, and ICPC 2022. She is currently mobile GUI testing, usability, and bug replay, human machine collaborative
serving as an Associate Editor of IEEE TRANSACTIONS ON SOFTWARE testing including testing guide for testers, AI-empowered mining software
ENGINEERING. For more information, see https://people.ucas.edu.cn/0058217? repository including issue report mining. He received the ACM Student
language=en Research Competition (SRC) 2023 Grand Finals Winners, 1st Place, Graduate
Category. For more information, see https://zheliu6.github.io/

Song Wang (Member, IEEE) received the dual


B.E. degrees from Sichuan University, the mas-
ter’s degree from the Institute of Software Chinese
Yuchao Huang is a Doctoral Student with the Academy of Sciences, and the Ph.D. degree from
Institute of Software Research, Chinese Academy the University of Waterloo. He is an Assistant
of Sciences (ISCAS). His research interests in- Professor with York University, Canada. He worked
clude software engineering, mobile testing, and at the intersection of software engineering and
deep learning. He has published four papers at top- artificial intelligence. He has more than 50 high-
tier international software engineering conferences, quality publications including IEEE TRANSACTIONS
including ICSE and FSE. Specifically, he applies ON SOFTWARE ENGINEERING, ICSE, TOSEM, FSE,
AI and LLM technology in the AI(LLM)-assisted and ASE, and is the recipient of four Distin-
automated mobile GUI bug replay. guished/Best Paper Awards. He is currently serving as an Associate Editor of
ACM Transactions on Software Engineering (TOSEM). For more information,
see https://www.eecs.yorku.ca/wangsong/index.html

Qing Wang (Member, IEEE) is a Research Profes-


sor with the Institute of Software Chinese Academy
of Sciences (ISCAS). She is also the Deputy Chief
Chunyang Chen received the bachelor’s degree Engineer of ISCAS, and the Director of State Key
from BUPT, China and the Ph.D. degree from Laboratory of Intelligent Game of ISCAS. Her
NTU, Singapore. He is a Full Professor with the research lies in the area of software process, soft-
School of Computation, Information and Technol- ware quality assurance, and artificial intelligence for
ogy, Technical University of Munich, Germany. His software engineering. She currently serves as the
main research interest includes automated software member of the International Software and Systems
engineering, especially data-driven mobile app de- Processes Association (ISSPA), the member of the
velopment. Besides, he is also interested in human– International Software Engineering Research Net-
computer interaction and software security. His work (ISERN), the Editorial Board of Information and Software Technology
research has won awards including ACM SIGSOFT Journal (IST) and Journal of Software Evolution and Process (JSEP), the
Early Career Researcher Award, Facebook Research Deputy Chair of Software Quality and Testing Group in China National Infor-
Award, four ACM SIGSOFT Distinguished Paper Awards (ICSE’23/21/20, mation Technology Standardization (SAC/TC28/SC7/WG1), and the CMMI
ASE’18), and multiple best paper/demo awards. For more information, see lead appraisal. She has edited/co-edited five books, and published more than
https://chunyang-chen.github.io/ 100 papers in high-level conferences and journals.

Authorized licensed use limited to: McGill University. Downloaded on May 18,2024 at 16:01:14 UTC from IEEE Xplore. Restrictions apply.

You might also like