0% found this document useful (0 votes)
6 views

Unveiling Large Language Models Comparative Analysis in Software Defect Detection (1) (1)

The thesis titled 'Unveiling Large Language Models: Comparative Analysis in Software Defect Detection' explores the application of Large Language Models (LLMs) in fault localization, a crucial aspect of software debugging. It presents a novel technique that enhances fault localization accuracy using LLMs, demonstrating superior performance over traditional methods through experiments on the Defects4J benchmark. The findings suggest significant potential for LLMs in improving software engineering processes, while also highlighting areas for further research and development.

Uploaded by

samiul mugdha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Unveiling Large Language Models Comparative Analysis in Software Defect Detection (1) (1)

The thesis titled 'Unveiling Large Language Models: Comparative Analysis in Software Defect Detection' explores the application of Large Language Models (LLMs) in fault localization, a crucial aspect of software debugging. It presents a novel technique that enhances fault localization accuracy using LLMs, demonstrating superior performance over traditional methods through experiments on the Defects4J benchmark. The findings suggest significant potential for LLMs in improving software engineering processes, while also highlighting areas for further research and development.

Uploaded by

samiul mugdha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Unveiling Large Language Models: Comparative Analysis in

Software Defect Detection

Institute of Information and Communication Technology,


Shahjalal University of Science and Technology

Authors

Prottya Roy Chayan


2018831001
Abid Al Mahmud
2018831071

Supervisor

Partha Protim Paul


Lecturer
Institute of Information and Communication Technology,
Shahjalal University of Science and Technology
Recommendation Letter from Thesis Supervisor

The thesis entitled ”Unveiling Large Language Models: Comparative Analysis in


Software Defect Detection” submitted by Prottya Roy Chayan (2018831001) and
Abid Al Mahmud (2018831071) on 12th February 2024 is under my supervision.

I, hereby, agree that the thesis can be submitted for examination.

Partha Protim Paul,


Lecturer,
Institute of Information and
Communication Technology,
Shahjalal University of Science and
Technology.

i
Certificate of Acceptance

The thesis entitled ”Unveiling Large Language Models: Comparative Analysis in


Software Defect Detection” submitted by Prottya Roy Chayan (2018831001) and
Abid Al Mahmud (2018831071) on 12th February 2024 is, hereby, accepted as
the partial fulfillment of the requirements for the award of their Bachelor’s Degrees.

Director, IICT Chairman, Exam Committee Supervisor


Dr. M. Jahirul Islam, Dr. M. Jahirul Islam, Partha Protim Paul,
PhD., PEng. PhD., PEng. Lecturer
Institute of Information and Institute of Information and Institute of Information and
Communication Technology Communication Technology Communication Technology

ii
Abstract

Large Language Models (LLMs) have exhibited remarkable prowess in addressing


diverse software engineering challenges. Despite their successes, the application of
LLMs to the domain of Fault Localization (FL), a critical task involving the iden-
tification of the code element responsible for a bug within a potentially extensive
codebase, remains an uncharted territory. This thesis investigates the unexplored
potential of leveraging LLMs for fault localization, presenting a novel technique that
requires only a single failing test. The fault localization process is augmented with
an explanation generation mechanism, providing insights into the reasons behind
the test failure.The methodology involves harnessing the function call API of the
OpenAI LLM and the Large Language model of gpt-3.5-turbo-1106. This enables
the model to ’explore’ vast source code repositories, overcoming the challenge posed
by the limited prompt length typical in LLM interactions. Through comprehen-
sive experiments conducted on the widely used Defects4J benchmark, our results
demonstrate that the proposed LLM-based fault localization technique consistently
outperforms existing standalone methods from prior research. Notably, it frequently
identifies the faulty method on the first attempt, showcasing the potential of lan-
guage model-based approaches in enhancing fault localization accuracy.

While our findings present promising advancements in the realm of LLM-based


fault localization, it is essential to acknowledge the existing room for performance
improvement. This work invites and encourages further experimentation and ex-
ploration of language model-based fault localization as an evolving and promising
research area within the broader field of software engineering. The insights gained
from this study contribute to the ongoing discourse on the potential applications
of LLMs in fault localization, setting the stage for future research endeavors and
advancements in this critical domain.

Keywords:
Large Language Model, Fault Localization, Fail Test, Function Call.

iii
Table of Contents

Abstract iii

Table of Contents iv

List of Figures vi

List of Tables 1

1 Introduction 2
1.1 The Impact on APR . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Motivation for LLM Integration . . . . . . . . . . . . . . . . . . . . . 3
1.3 Thesis Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 5
2.1 Exploration of FL Techniques . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Overview of Spectrum-Based FL (SBFL) . . . . . . . . . . . . 5
2.2 Large Language Models (LLM) . . . . . . . . . . . . . . . . . . . . . 7

3 Related Work 9
3.1 Fault Localization Techniques In Debuggging . . . . . . . . . . . . . . 9
3.2 Integration of Large Language Models (LLMs) in Software Engineering 9

4 Methodology 11
4.1 Overview of Fault Localization Using LLM . . . . . . . . . . . . . . . 11
4.2 Step by Step Approach of Fault Localization . . . . . . . . . . . . . . 12
4.2.1 LLM Configuration . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2.2 FL Query Initiation . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2.3 Knowledge Base Navigation for Failure Analysis . . . . . . . . 14
4.2.4 Receiving & Analysing Potential Faulty Lines . . . . . . . . . 14
4.2.5 Fault Localization Performance Evaluation . . . . . . . . . . . 15

5 Experimental Setup 16
5.1 LLM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.3 Baseline FL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

iv
6 Experimental Results and Analysis 19
6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.2 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.2.1 Analysis of the Chart Program: . . . . . . . . . . . . . . . . . 20
6.2.2 Examination of the Lang Program: . . . . . . . . . . . . . . . 21
6.2.3 Insight into the Time Program: . . . . . . . . . . . . . . . . . 22
6.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3.1 Dataset Size and Complexity . . . . . . . . . . . . . . . . . . 23
6.3.2 Dependency on Training Data . . . . . . . . . . . . . . . . . . 23
6.3.3 Need for Further Validation . . . . . . . . . . . . . . . . . . . 23

7 Conclusion 25

Bibliography 28

v
List of Figures

4.1 Workflow for LLM Based Fault Localization . . . . . . . . . . . . . . 12

6.1 Performance For Chart Program . . . . . . . . . . . . . . . . . . . . . 20


6.2 Performance For Lang Program . . . . . . . . . . . . . . . . . . . . . 21
6.3 Performance For Time Program . . . . . . . . . . . . . . . . . . . . . 22

vi
List of Tables

5.1 Statistic of the Defects4J Dataset . . . . . . . . . . . . . . . . . . . . 17

6.1 Comparison of SBFL and LLM for Different Programs . . . . . . . . 19

1
Chapter 1

Introduction

In the intricate landscape of software engineering, debugging continues to represent


a resource-intensive and meticulous endeavor. It is in this context that Automated
Program Repair (APR) emerges as a transformative strategy, promising to signifi-
cantly reduce debugging costs by autonomously identifying and rectifying software
faults. At the heart of APR fault localization assumes a pivotal role, marking a cru-
cial phase in the intricate process of debugging software systems. It serves not only
as a foundational prerequisite for developers aiming to accurately rectify program
errors [13] but also as a linchpin for the successful implementation of automated
program repair (APR) [7]. The efficiency and precision of fault localization tech-
niques hold paramount significance, contributing significantly to the enhancement
of software repair and maintenance processes [11], [17].

The past two decades have witnessed a concerted focus on fault localization (FL),
captivating researchers who delve into innovative approaches to heighten the identi-
fication of faulty program elements. At its core, fault localization aims to precisely
pinpoint specific program elements responsible for failures during test scenarios.
Contemporary fault localization methods exhibit a rich tapestry of techniques, em-
bracing a spectrum from statistical analysis [11] to coverage analysis [4], [10], and
extending to the application of machine learning algorithms [1], [14]. Despite the
diversity of methodologies, a prevailing challenge surfaces in the guise of limited
adaptability. The efficacy of current fault localization techniques encounters restric-
tions imposed by the intricacies of programming languages, the quality of test cases,
or the complexities embedded within code structures. This constraint invariably
diminishes the overall effectiveness of these methodologies [6].It is imperative to ac-
knowledge that existing fault localization often begets inaccuracies after exhaustive
and resource-intensive FL analysis. This realization prompts a critical examination
of the methodologies that underpin the fault localization domain, paving the way for
a nuanced understanding of their limitations and potential avenues for improvement.

1.1 The Impact on APR


The efficacy of Automated Program Repair (APR) is intricately tied to the selec-
tion of the Fault Localization (FL) technique employed. The effectiveness metric
serves as a yardstick for gauging the tool’s proficiency in rectifying faults within the
codebase. Concurrently, the performance aspect measures the temporal or compu-

2
tational steps required to pinpoint potential repairs, thereby influencing the overall
efficiency of the repair process. The critical dimension of repair correctness delves
into the assessment of how adeptly a potentially repaired program retains its in-
tended functionality.

It is imperative to recognize that the choice of FL techniques plays a pivotal role in


shaping the APR landscape. Ineffectual FL methodologies run the risk of misleading
the repair process. This misdirection can manifest in two distinct scenarios: firstly,
by overlooking pertinent faulty statements, thereby compromising the tool’s abil-
ity to address crucial issues; and secondly, by erroneously identifying an excessive
number of potentially faulty statements, which can introduce noise and ambiguity
into the repair process. Both scenarios bear significant implications for the efficacy
of APR, underscoring the importance of a judicious selection of FL techniques in
ensuring optimal repair outcomes.

1.2 Motivation for LLM Integration


In the realm of software engineering, the landscape of language models has under-
gone a transformative evolution. Language models, statistical constructs designed
to approximate the probability distribution of a language, have witnessed recent ad-
vancements in machine learning. Notably, trainable language models with a substan-
tial number of parameters, commonly referred to as large language models (LLMs),
exhibit emergent properties that elude smaller models [15]. These emergent prop-
erties include but are not limited to few-shot learning [9], commonsense reasoning
[21], and logical step following [16].

This recognition of emergent properties has propelled LLMs into the spotlight of
software engineering research, captivating attention for their applications in bug di-
agnosis [20], patch generation [22], [27], test generation [24], and bug reproduction
[23]. The comprehensive capabilities displayed by LLMs in handling various facets
of software engineering challenges have underscored their potential as versatile tools
in advancing the field.

The motivation to integrate LLMs, exemplified by the likes of GPT-3.5, into fault
localization stems from their distinctive capabilities when juxtaposed with tradi-
tional Fault Localization (FL) techniques. While conventional methods, notably
coverage-based ones, have demonstrated value, they may encounter challenges in
capturing the intricate linguistic and contextual aspects inherent in software code.
LLMs, renowned for their contextual understanding and semantic representation of
language, present a fresh perspective on fault localization. Their ability to consider
multifaceted linguistic cues enhances fault identification accuracy.

Moreover, the versatility and adaptability inherent in LLMs render them promis-
ing candidates for fault localization tasks, particularly in diverse codebases. The
transfer learning paradigm empowers LLMs to glean insights from varied language
contexts during pre-training and subsequently specialize in code-specific nuances
during fine-tuning. This adaptability proves advantageous in scenarios where code-
bases exhibit significant variations in style and structure. The amalgamation of

3
these traits positions LLMs as potentially transformative elements Within the do-
main of fault localization, offering a sophisticated and context-aware approach to
identifying and addressing software faults.

1.3 Thesis Objectives


This research endeavor is fundamentally geared towards a meticulous evaluation and
comparative analysis of the efficacy exhibited by Large Language Models (LLMs)
when juxtaposed against the backdrop of well-established Fault Localization (FL)
techniques. The overarching objective lies in harnessing the intrinsic generative and
contextual prowess inherent in LLMs to discern and address potential constraints
identified within traditional FL approaches. Through a rigorous and systematic
process of experimentation and analytical scrutiny, the aspiration is to delve into
the nuanced dimensions of how LLMs can potentially transcend prevailing method-
ologies, promising advancements in terms of fault localization precision, operational
efficiency, and repair correctness.

1.4 Research Questions


Acknowledging the pivotal role Fault Localization (FL) plays in steering Automated
Program Repair (APR), this study embarks on a quest to scrutinize the intricacies
and repercussions of diverse FL techniques. The focal point of inquiry revolves
around elucidating the effectiveness, performance, and correctness of APR tools
when juxtaposed with conventional methodologies against an avant-garde approach
leveraging the prowess of Large Language Models (LLMs) such as GPT-3.5.

1. How does the effectiveness of traditional FL techniques, such as Ochiai and


DStar, compare to the LLM-based FL approach in guiding Automated Pro-
gram Repair (APR)?

2. Can the LLM-based FL approach potentially improve the accuracy of fault


localization compared to traditional FL techniques?

4
Chapter 2

Background

2.1 Exploration of FL Techniques


Fault localization is an indispensable process in the domain of software debugging,
crucial for identifying precise locations within a program’s source code where bugs
manifest. It serves as the bedrock of effective debugging strategies, allowing devel-
opers to strategically allocate their resources and efforts towards rectifying software
defects. Spectrum-Based Fault Localization (SBFL) stands out prominently among
the myriad techniques available, renowned for its sophisticated methodology that
facilitates the identification of potential fault locations with precision and efficiency.

2.1.1 Overview of Spectrum-Based FL (SBFL)


Spectrum-Based Fault Localization (SBFL) stands as a pivotal cornerstone in the
expansive domain of fault localization, founded upon the fundamental premise of
meticulously analyzing the dynamic behavior of a program during the execution of
tests. This intricate process involves a detailed comparison of the execution traces
between passing and failing tests, where SBFL methods engage in a diligent cal-
culation of suspiciousness scores for individual program statements. These scores,
functioning as critical indicators, serve to illuminate the likelihood of a statement
harboring a fault, providing valuable insights for further investigation and resolution.

Within the intricate domain of SBFL, a diverse array of heuristics comes into play,
ranging from foundational elementary coverage metrics to sophisticated algorithms.
These heuristics, intricately designed, contribute to a nuanced evaluation of sus-
piciousness scores by considering the intricate interplay of test outcomes and exe-
cution patterns. Each heuristic within SBFL, including but not limited to Ochiai,
DStar, and Tarantula, adds a layer of sophistication to the evaluation process. These
heuristics operate in harmony, leveraging dynamic execution information to compre-
hensively assess the likelihood of a statement containing a fault.

In essence, SBFL, positioned as a cornerstone in fault localization, employs a so-


phisticated methodology rooted in the dynamic analysis of a program during test
executions. The finesse with which SBFL navigates the intricacies of execution traces
and calculates suspiciousness scores, coupled with the diverse array of heuristics at
its disposal, solidifies its role as a vital instrument in identifying and addressing

5
faults within software systems.

Principles of SBFL:

1. Comparing Program Behavior: At the heart of SBFL lies the sophisticated


task of comparing the program’s behavior between a passing execution and a
failing execution. This approach aims to discern the dynamic behavior of
program statements under varying test conditions.

2. Dynamic Information Collection: SBFL meticulously collects dynamic


information about the execution of each statement when the program under-
goes testing. This includes a comprehensive account of the number of passing
and failing tests in which a statement is executed.

3. Computing Suspiciousness Scores: The crux of SBFL lies in the com-


putation of suspiciousness scores for each program statement. These scores,
derived from dynamic information, reflect the likelihood of a statement con-
taining a fault. Statements that exhibit more frequent execution during failing
tests receive elevated suspiciousness scores.

Heuristics in SBFL
The domain of Spectrum-Based Fault Localization (SBFL) is characterized by its
intricate utilization of a diverse set of heuristics, each standing as a testament to
the nuanced complexity inherent in the assessment of suspiciousness scores for pro-
gram statements. These heuristics, akin to sophisticated evaluative instruments,
meticulously navigate the landscape of software code, intricately weighing the sig-
nificance of statements. Their evaluative prowess extends to a detailed analysis of
the execution patterns exhibited in both passing and failing tests, thereby contribut-
ing to a comprehensive understanding of the dynamic interplay within the SBFL
methodology. In essence, these heuristics serve as discerning guides, unraveling the
intricacies of program behavior to assign scores that elucidate the likelihood of a
statement harboring a fault.

Commonly Used Heuristics in SBFL

• Ochiai
Pos(s)
Ochiai(s) = p
Pos(s) + Neg(s)
Ochiai measures the cosine of the angle between the statement’s vector and
the vector of failing executions. It considers the number of passing and failing
tests that cover a statement.

• DStar
P os(s)
DStar(s) = p
T otalP os · (P os(s) + N eg(s))

DStar intricately measures the cosine of the angle between the vector repre-
senting the statement and the vector associated with failing executions. This
meticulous consideration extends to both the number of passing and failing
tests that cover a particular statement.

6
In the domain of Spectrum-Based Fault Localization (SBFL), a set of crucial metrics
serves as the cornerstone for evaluating the dynamics of program statements during
testing and calculating the suspiciousness scores of different heuristics. These met-
rics, each with its distinctive role, contribute to the comprehensive understanding
of the fault localization process. Let’s delve into these key metrics:

• Pos(s): This metric signifies the number of failing tests that encompass the
program statement ”s.” Simply put, it quantifies the instances where the state-
ment ”s” is executed during failed test executions.

• Neg(s): Representing the count of passing tests covering the program state-
ment ”s,” this metric offers insights into the successful executions where the
statement ”s” is involved.

• TotalPos: Encompassing the entirety of failing tests in the test suite, TotalPos
provides a holistic view of all failing executions within the entire test suite.

• TotalNeg: Serving as its counterpart, TotalNeg represents the overall count of


passing tests in the test suite, encapsulating the sum of all successful executions
across the entire test suite.

2.2 Large Language Models (LLM)


Large Language Models (LLMs) represent a transformative leap in the field of ar-
tificial intelligence, reshaping the landscape of natural language processing (NLP)
and machine learning. At their core, LLMs possess unparalleled capabilities, distin-
guished by their monumental scale, contextual understanding, proficiency in transfer
learning, and remarkable aptitude for code understanding and generation.

Unprecedented Scale and Training


At the core of Large Language Models (LLMs) resides their monumental scale and
vastness. These models undergo rigorous training on colossal datasets, frequently
spanning billions of sentences and trillions of words. This expansive training corpus
imparts to LLMs an extraordinary depth of understanding, encompassing intricate
linguistic nuances, syntactic structures, and semantic relationships. Noteworthy ex-
amples of LLMs that exemplify this scale and proficiency include OpenAI’s GPT-3,
Google’s BERT, and similar architectures. These models, far from merely setting
new standards, have ascended to the status of benchmarks in the sphere of language
comprehension.

Contextual Understanding
Setting themselves apart from their predecessors, Large Language Models (LLMs)
showcase an extraordinary proficiency in capturing nuanced contextual intricacies
within language. They go beyond simplistic keyword matching, immersing them-
selves in the complex interplay of words, phrases, and sentences. This elevated
level of contextual understanding bestows upon LLMs the capability to generate re-
sponses that are not only coherent but also profoundly contextually relevant. This
prowess renders them exceptionally skilled in a diverse array of tasks, spanning from
the intricate nuances of language translation to the sophisticated realm of intricate

7
question answering.

Transfer Learning Paradigm


A characteristic that stands out prominently in Large Language Models (LLMs) is
their notable proficiency in the domain of transfer learning. This involves a com-
prehensive two-step process, commencing with initial pre-training on an extensive
language dataset, followed by the meticulous fine-tuning on data specific to the task
at hand. Through this intricate process, LLMs develop the remarkable ability to
generalize knowledge across diverse domains. This adaptability emerges as a signif-
icant advantage, especially in scenarios where an intricate understanding of specific
contexts, such as the complexities inherent in software code, is deemed paramount
for optimal performance.

Code Understanding and Generation


Although originally conceptualized for tasks centered around natural language, Large
Language Models (LLMs) have showcased an extraordinary proficiency in not only
comprehending but also generating programming code. Their exceptional capabil-
ities encompass the discernment of intricate syntactic structures, the recognition
of diverse coding patterns, and the facilitation of suggestions for code completion.
This multifaceted competence has firmly established LLMs as indispensable tools
within the domain of software development. Their utility extends beyond the con-
ventional realm of mere syntax checking, delving into a profound understanding of
the intricate logic and semantics embedded in diverse code snippets.

8
Chapter 3

Related Work

3.1 Fault Localization Techniques In Debuggging


Fault Localization is pivotal for developers as it allows for the precise identification
of faulty statements, methods, or components, thereby enabling them to allocate
debugging efforts judiciously and minimize resource expenditure. Particularly in
large codebases, where manual debugging is impractical, automated fault localiza-
tion techniques play a crucial role in swiftly pinpointing potential bug locations,
thus conserving valuable developer time [25].

Despite its significance, FL encounters challenges and limitations. Common FL tech-


nique families, such as Spectrum-based FL (SBFL), Information Retrieval-based FL
(IRFL), and Mutation-based FL (MBFL), face adaptability issues [6]. SBFL tech-
niques, though effective in isolation, struggle in large enterprise software due to the
computational costs incurred in coverage measurement [12].

Moreover, many FL techniques lack clear rationales or explanations in their out-


put, posing reliability and practicality challenges in real-world debugging scenarios.
Kochhar et al. emphasize the importance of a transparent rationale in FL for bug
fixing and incorporating practitioners’ domain knowledge [5]. Providing developers
with an understanding of why a particular location is identified as the bug’s culprit
facilitates informed decision-making during the fixing process and enables them to
assess the correctness of FL output based on their domain knowledge.

3.2 Integration of Large Language Models (LLMs)


in Software Engineering
In recent years, the integration of Large Language Models (LLMs) in software en-
gineering has sparked significant interest and innovation. LLMs, such as OpenAI’s
GPT-3 and Google’s BERT, have revolutionized natural language processing and
demonstrated remarkable capabilities beyond traditional applications.

9
ReAct pioneered the integration of chain-of-thought prompting with LLMs, show-
casing improved performance on various tasks within software engineering [18]. By
leveraging the capabilities of LLMs, ReAct demonstrated enhanced productivity and
effectiveness in software-related tasks.

Similarly, HuggingGPT demonstrated LLMs’ ability to compose computer vision


pipelines dynamically [26]. This innovative application of LLMs showcases their
versatility and potential for integrating with diverse tools and domains within soft-
ware engineering.

The advent of LLMs represents a paradigm shift in machine learning, particularly


in their unparalleled capacity to comprehend and generate text that closely mirrors
human language. As the field continues to evolve, the integration of LLMs into soft-
ware engineering promises to unlock new possibilities and enhance various aspects
of software development and maintenance.

10
Chapter 4

Methodology

4.1 Overview of Fault Localization Using LLM


In the quest to unravel the complex tapestry of software failures, our methodology
adopts a meticulous and iterative prompting process, leveraging the immense ca-
pabilities of Large Language Models (LLMs). This section intricately details the
methodology, offering a profound exploration of its core components and the step-
by-step approach employed.

Initiation of Inquiry:
The initiation phase serves as the bedrock of our methodology, involving the crafting
of a comprehensive and insightful prompt. This prompt is not merely a surface-level
introduction; rather, it intricately encapsulates vital information about the failing
test. This includes the test’s nomenclature, the encapsulating code snippet, and a
crucial detail—the specific line within the code where the test encountered failure.
This comprehensive initiation lays a robust foundation for a focused, informed ex-
ploration of software failures.

Controlled LLM Interaction:


Central to our methodology is a carefully designed framework for LLM interaction,
with a key focus on a dedicated function: get code snippet. This function em-
powers LLMs to extract intricate details about the method from a specific class,
ensuring a controlled and insightful exploration of potential fault locations. The
introduction of this function strategically guides LLMs in their analysis, enhancing
the depth and relevance of insights.

Iterative Exploration:
The inherent strength of the methodology lies in its iterative framework, character-
ized by a continuous cycle of inquiries and responses orchestrated by LLMs. This
dynamic process facilitates an evolving comprehension of the failure, with each inter-
action contributing to a refined understanding. The iterative nature of the approach
aligns with the adaptive capabilities inherent in LLMs, as they systematically narrow
down potential fault locations. Through a cumulative exploration of the codebase,
each interaction serves to build upon the insights gleaned from preceding steps, fos-
tering a progressively deeper and more nuanced analysis of the failure scenario.

11
Response Dynamics:
The culmination of LLM interaction yields responses that unveil potential faulty
lines or code blocks. These responses serve as dynamic artifacts, representing a sym-
biotic collaboration between human-guided inquiry and the language-understanding
capabilities of LLMs. Far beyond being mere outputs, these responses provide tan-
gible insights into the intricate interplay of code elements and their potential impact
on observed failures.

Token Constraints Management:


Acknowledging token limitations in free LLM models, the methodology adopts a ju-
dicious approach by allowing seven function calls. This decision balances between ex-
tracting meaningful information and staying within computational boundaries. The
pragmatic approach ensures a focused and effective interaction, optimizing available
computational resources without compromising the depth of exploration.

4.2 Step by Step Approach of Fault Localization

Figure 4.1: Workflow for LLM Based Fault Localization

4.2.1 LLM Configuration


Configuring the LLM with meticulous precision is of paramount importance, serv-
ing as the linchpin for the effectiveness of our LLM-based fault localization process.
This setup transforms the LLM into a dedicated debugging assistant, a crucial role
that demands careful attention to the nuances of the provided instruction set. A
system was implemented to process a code snippet upon initialization, identifying
and localizing defects within the provided code.

To configure the LLM model, a set of instructions was provided, delineating the
precise steps for the model to follow. These instructions served as the blueprint
for configuring the LLM, ensuring that it adhered to the specified parameters and
criteria essential for its optimal performance and functionality. The LLM model is
to be initialized with the following configuration instructions:

12
Now, you are going to work as a debugging assistant.
In your role as a debugging assistant, focus on failing tests
labeled "// error occurred here" Utilize the method
get\_code\_snippet to request code snippets and access the
system under test (SUT) source code. Your objective is to deliver a
step-by-step explanation of the bug, drawing insights from both the
failing test and information gathered through SUT tests. You have a
maximum of 7 chances to request functions for collecting relevant
information. Provide analysis only when certain about the fault’s
reason and can specify the faulty lines. Your task is to identify
potential faulty lines contributing to the test failure.

4.2.2 FL Query Initiation


The inception of the fault localization process is marked by the introduction of a
meticulously constructed prompt, designed to stand as the linchpin in the investi-
gation.(1 in workflow figure) This carefully structured prompt plays a pivotal role
in the unraveling of essential details surrounding the encountered failure, laying a
meticulous foundation that intricately paves the way for the exacting task of fault
identification. The LLM model intricately examines the code snippet in conjunction
with the provided prompt, skillfully identifying any inconsistencies within the code.
Utilizing this analysis, the LLM methodically constructs a thorough knowledge base,
essential for facilitating accurate fault localization.

Failing Test Information:


The initial step involves meticulously parsing the stack trace of the failed test case
to pinpoint the exact failing test class. Through this process, specific lines of code
implicated in the failure are carefully identified and isolated for further analysis

Code Snippet Presentation:


Following the identification of the failing test, we present an elaborate and detailed
Java code snippet. This isn’t merely a fragment of code but a carefully curated ex-
cerpt that encapsulates the entire test method. Every line of code is systematically
numbered, ensuring a precise and unambiguous reference for subsequent analysis.

Error Localization Markers:


At the heart of this code snippet, a crucial element emerges—the meticulous place-
ment of error localization markers. The comment ”// error occurred here” serves
as a laser-focused indicator, pointing unambiguously to the exact line where the
failure manifests within the code. This strategic marker becomes a guiding beacon
for subsequent investigative steps.

Error Message Unveiling:


Beyond the code snippet, the prompt seamlessly integrates the error message asso-
ciated with the failure. This error message provides crucial insights into the nature
of the encountered issue, contributing to the comprehensive understanding of the
failure context. Presented herewith is an illustrative example of the prompt that
marks the commencement of the fault localization process.

13
@test
public void test2947660() {
AbstractCategoryItemRenderer r = new LineAndShapeRenderer();
assertNotNull(r.getLegendItems());
assertEquals(0, r.getLegendItems().getItemCount());

DefaultCategoryDataset dataset = new DefaultCategoryDataset();


CategoryPlot plot = new CategoryPlot();
plot.setDataset(dataset);
plot.setRenderer(r);
assertEquals(0, r.getLegendItems().getItemCount());

dataset.addValue(1.0, "S1", "C1");


LegendItemCollection lic = r.getLegendItems();
assertEquals(1, lic.getItemCount()); // errror occurred here
assertEquals("S1", lic.get(0).getLabel());
}

It Failed with the following error message :


junit.framework.AssertionFailedError: expected:<1> but was:<0>

4.2.3 Knowledge Base Navigation for Failure Analysis


(2 in workflow figure)Upon receiving the detailed prompt outlining the error specifics,
the LLM implements a series of strategic techniques to meticulously discern errors
within the code snippet. Additionally, it systematically extracts the methods embed-
ded within the code snippet, thus culminating in the creation of a robust knowledge
base. The primary objective is to localize the cause of the failure and systemat-
ically enumerate potential faulty lines.(3 in workflow figure) To facilitate this ex-
ploration, the LLM leverages the get code snippet function, enabling the request
of code snippets from various class methods for in-depth analysis. It’s essential to
note that, considering the token limitations imposed on the gpt-3.5-turbo-1106
language model from OpenAI, we restrict the LLM to acquire code snippets a max-
imum of seven times. This limitation is in place to ensure efficient utilization of
the available tokens while systematically pinpointing the failure’s root cause and
identifying the potentially faulty lines in the codebase.

4.2.4 Receiving & Analysing Potential Faulty Lines


The fault localization process, characterized by the LLM’s meticulous analysis to
unveil the cause of failure, transitions into a pivotal phase. Herein, (4 in workflow
figure) we delve into the reception and thorough examination of the potential faulty
lines identified by the LLM. This critical step signifies the culmination of our fault
localization endeavors, with the LLM, having traversed through code snippets and
extracted valuable insights, presenting a comprehensive list of potential lines that
might be causative factors for the encountered failure.

14
(5 in workflow figure)These identified potential faulty lines assume the role of in-
valuable indicators, guiding further investigation and laying the foundation for the
resolution of issues within the software. Notably, to ensure heightened reliability
and robustness, we adopt an iterative refinement approach. The fault localization
process is systematically repeated for the same test failure, a minimum of three
times. This iterative strategy aims to enhance the consistency of outcomes, pro-
viding a nuanced and comprehensive understanding of potential faults. Such an
approach contributes significantly to the efficacy of the debugging process, fostering
an environment conducive to thorough fault analysis and resolution.

4.2.5 Fault Localization Performance Evaluation


The meticulous evaluation of fault localization accuracy serves as a critical dimen-
sion in determining the efficacy of fault localization techniques—a key focus within
the landscape of software engineering. In the context of our research, where preci-
sion and reliability are paramount, we employ the esteemed TOP-N metric as our
evaluative yardstick. This metric, widely embraced in the scholarly domain and
validated by prior studies such as [19], unfolds as a robust tool for quantifying the
performance of fault localization techniques.

The TOP-N metric functions as an illuminating gauge by numerically capturing


the prowess of a technique. It does so by delineating the count of faulty functions
housing erroneous statements, where the suspiciousness rank of these statements is
either less than or equal to N. This nuanced analysis underscores the technique’s
proficiency in delineating the exact locations of faulty statements—essentially, how
effectively it zeros in on the crux of software defects.

A pivotal characteristic of this metric is its responsiveness to varying values of


N, allowing us to explore fault localization effectiveness across different granulari-
ties. Higher TOP-N values act as beacons, signaling a more potent and nuanced
fault localization capability. As we delve into this evaluative journey, we align our-
selves with established practices in the field, ensuring a robust and comprehensive
assessment of our fault localization techniques’ performance.

15
Chapter 5

Experimental Setup

5.1 LLM Model


In our relentless pursuit of advancing fault localization techniques, our experimen-
tal endeavors found a formidable ally in the gpt-3.5-turbo-1106 language model
crafted by OpenAI. Renowned for its state-of-the-art capabilities, this Large Lan-
guage Model (LLM) stood at the forefront of our investigative framework, contribut-
ing to a nuanced and in-depth analysis of intricate details within the expansive land-
scape of software testing and debugging.

The gpt-3.5-turbo-1106 language model emerged as an indispensable and influ-


ential tool, boasting an impressive 175 billion parameters. It served as a testament
to the remarkable strides made in the field of natural language processing. The
model’s sophisticated architecture, deeply embedded with extensive language un-
derstanding, played a pivotal role in facilitating a comprehensive exploration of
various fault localization methodologies. This exploration not only pushed the con-
ventional boundaries of the model’s capabilities but also unearthed novel insights,
enriching the landscape of software engineering practices.

However, the dynamics of our interaction with the gpt-3.5-turbo-1106 model were
intricately shaped by its token limitations, with a stringent maximum input limit
of 4096 tokens. Each interaction demanded meticulous consideration of the text’s
length and complexity, creating a crucial dimension in our experimental design. The
constraint became particularly pivotal in our approach to crafting prompts, queries,
and instructions during fault localization experiments. Navigating within the con-
fines of this token limit, we strategically tailored prompts and interactions to extract
optimal insights, achieving a delicate and intentional balance between input rich-
ness and token constraints. This strategic interplay played an indispensable role
in shaping the effectiveness of our experimental approach, ensuring that the vast
and powerful capabilities of the gpt-3.5-turbo-1106 model were meticulously har-
nessed to their fullest extent.

16
5.2 Dataset
Defects4J [2] serves as an indispensable and pioneering resource, occupying a cen-
tral role in the landscape of fault localization datasets. This reservoir of real-world
open-source programs is not merely a collection of code; rather, it is an expansive
repository enriched with authentic bugs that have unfolded across various histor-
ical versions. What elevates the significance of this dataset is not just the bugs
themselves but the meticulous curation of corresponding test cases [3], amplifying
its value for researchers and practitioners delving into the intricacies of fault local-
ization studies.

For the trajectory of experimentation in this study, our focal point is an explo-
ration of Three expansive open-source Java projects meticulously cataloged within
Defects4J. These selected projects, namely JFreeChart (Chart), Apache Commons-
Lang (Lang), and Joda-Time (Time), represent a diverse tapestry of functionalities
and applications. The deliberate selection ensures a comprehensive exploration of
fault localization methodologies across a spectrum of scenarios.

The dataset unfurls a comprehensive panorama of the chosen projects, unraveling


nuanced insights into various facets, including the number of available versions, the
count of active bugs, and the average lines of code for each program. It’s crucial to
underscore the inclusivity of these projects, encompassing multiple program versions
while judiciously excluding those afflicted with compilation errors or segmentation
faults.

Moreover, the dataset’s meticulous consideration extends beyond just the bugs,
delving into the total number of bugs within the selected projects. This nuanced
approach contributes to a more profound and comprehensive understanding of the
fault landscape, laying the groundwork for extracting meaningful insights.

Table 5.1: Statistic of the Defects4J Dataset

Program Number of Versions Number of Active Bugs Average Length of Code


Chart 17 26 62,395
Lang 35 64 13,441
Time 15 26 20,357
Sum 67 116 -

The projects enshrined in this dataset exhibit variations in scale, complexity, and
code length, adding an extra layer of diversity to our experimental exploration. By
strategically leveraging the Defects4J dataset, our endeavors align not only with
widely recognized research practices [8] but also ensure the comparability of our re-
sults with a substantial body of existing literature. This deliberate and methodical
approach fosters a dependable evaluation of fault localization techniques, establish-
ing a robust foundation for deriving meaningful insights from the rich tapestry of
collected data.

17
5.3 Baseline FL
Within our investigation, Spectrum-Based Fault Localization (SBFL) takes a cen-
tral role as a debugging technique, utilizing execution traces from both passed and
failed test cases to gauge the likelihood of program statements being faulty. At its
core, SBFL operates on the premise that faulty statements are more prominently
covered by failed tests compared to their counterparts in passed tests. To quantify
this intuition, SBFL employs suspiciousness formulas, with our study exclusively
focusing on the Ochiai and DStar methods due to their established superior per-
formance in existing research.

In our experimental paradigm, we deliberately restricted our evaluation to the


Ochiai and DStar formulas as the baseline SBFL techniques. This intentional
focus allows for a meticulous comparison with our Language Model-based Fault Lo-
calization (LMFL) approach, shedding light on the nuanced distinctions between
conventional SBFL methods and our innovative LLM-based fault localization strat-
egy.

It is paramount to highlight that the intrinsic strength of Spectrum-Based Fault


Localization (SBFL) lies in its lightweight instrumentation, which enables the col-
lection of coverage traces without the need for intricate program semantics. This
characteristic streamlines the implementation of SBFL, contributing to its efficiency
in fault localization processes.

However, a critical caveat emerges in the form of its reliance on the quality of
the test suite. The practical efficacy of SBFL becomes intricately entwined with
how effectively the test cases execute the code. The performance of SBFL is contin-
gent on the test suite’s capability to not only cover the code comprehensively but,
more importantly, to unveil potential faults latent within the intricate fabric of the
software.

This interplay between lightweight instrumentation and test suite quality under-
scores the nuanced dynamics of SBFL, emphasizing the need for a robust testing
infrastructure to fully realize its potential in identifying and localizing faults within
software systems.

18
Chapter 6

Experimental Results and Analysis

6.1 Results
In the initial phase of our study, we undertook a comprehensive comparison between
existing fault localization techniques and our proposed approach based on Large
Language Models (LLMs). This comparative analysis aimed to gauge the effective-
ness and efficiency of fault localization methodologies in identifying and pinpointing
software defects. To facilitate this evaluation, we meticulously measured and ana-
lyzed the performance metrics of both traditional fault localization techniques and
our LLM-based approach across a diverse range of scenarios.

Table 6.1: Comparison of SBFL and LLM for Different Programs

Program Name Top N SBFL Ochiai SBFL DStar LLM Based


Top-1 3 4 11
Top-2 8 8 13
Chart Top-3 9 8 14
Top-4 11 12 16
Top-5 13 13 18
Top-1 8 8 23
Top-2 17 16 26
Lang Top-3 23 22 31
Top-4 27 27 35
Top-5 29 28 37
Top-1 5 5 8
Top-2 5 7 9
Time Top-3 10 11 11
Top-4 12 12 12
Top-5 18 17 13

Central to our comparative analysis is the measurement of fault localization accuracy


using the widely adopted TOP-N metric. This metric serves as a robust benchmark
for assessing the efficacy of fault localization techniques by quantifying the number
of faulty functions within which faulty statements are correctly identified within the
top N-ranked suspicious statements. By employing the TOP-N metric, we aimed

19
to provide a nuanced understanding of the fault localization capabilities of both
traditional techniques and our novel LLM-based approach.

To present our findings comprehensively, we curated a detailed table encapsulat-


ing the results obtained from the evaluation of three Defects4J programs. This
table serves as a visual aid, allowing for a systematic comparison of fault localiza-
tion accuracy across different methodologies and scenarios. Through the meticulous
examination of these results, we aimed to shed light on the strengths and limita-
tions of existing fault localization techniques while showcasing the potential of our
LLM-based approach to enhance fault localization efficacy.

6.2 Comparative Analysis


6.2.1 Analysis of the Chart Program:
Upon an in-depth examination of the fault localization outcomes for the Chart
program, our analysis unveils a multifaceted performance landscape where both tra-
ditional SBFL methodologies, Ochiai and DStar, and the cutting-edge LLM-based
fault localization approach exhibit notable performance characteristics.

Figure 6.1: Performance For Chart Program

Each bar within the graph corresponds to a specific Top-N value, ranging from
1 to 5, denoting the number of top-ranked statements considered in the fault local-
ization process. The height of each bar signifies the corresponding fault localization
accuracy achieved by the respective methodologies at the given Top-N value.

A careful examination of the graph reveals distinct patterns and trends across the dif-
ferent methodologies. For instance, the bars representing the LLM-based approach
consistently exhibit greater heights compared to those of the SBFL techniques, in-
dicative of superior fault localization accuracy across varying Top-N scenarios.

20
Moreover, as we progress from lower to higher Top-N values, the disparity in fault
localization accuracy between the LLM-based approach and SBFL methodologies
becomes more pronounced. This trend underscores the LLM’s remarkable ability to
excel in scenarios requiring the identification and localization of faults within the
top-ranked statements.

6.2.2 Examination of the Lang Program:


In a manner akin to the discoveries observed within the Chart program dataset,
upon a comprehensive analysis of the Lang program dataset, our exploration reveals
a captivating narrative that underscores the efficacy and potency of deploying Large
Language Model (LLM)-based fault localization techniques. Within this narrative,
LLM emerges as the paramount contender in terms of fault localization accuracy and
effectiveness, showcasing its superiority when compared to conventional Spectrum-
Based Fault Localization (SBFL) methodologies, particularly Ochiai and DStar,
across diverse Top-N values.

Figure 6.2: Performance For Lang Program

The obtained results underscore the remarkable consistency exhibited by LLM-based


fault localization in achieving notably lower TOP-N values compared to its SBFL
counterparts. This consistent pattern underscores the robustness and reliability
of the LLM approach in accurately pinpointing faults within the program’s source
code. Moreover, these findings highlight the superior performance of LLM-based
fault localization, demonstrating its capability to outperform conventional SBFL
methodologies in terms of accuracy and effectiveness.

21
6.2.3 Insight into the Time Program:
In the context of the Time program, the analysis uncovers a nuanced performance
landscape characterized by a degree of variability in the efficacy of LLM-based fault
localization when compared to Spectrum-Based Fault Localization (SBFL) tech-
niques.

Figure 6.3: Performance For Time Program

Specifically, while LLM exhibits superior performance in terms of Top-1 and Top-2
values, the SBFL techniques demonstrate a transient advantage for Top-3, Top-4,
and Top-5 values. This disparity indicates a divergence in effectiveness across dif-
ferent Top-N scenarios, suggesting that the optimal choice between LLM and SBFL
methods may vary depending on the specific Top-N metric considered.

Moreover, this variability in performance sheds light on the multifaceted nature


of fault localization efficacy. It suggests that the success of LLM-based techniques
may be influenced by nuanced factors such as the inherent complexity and structural
intricacies of the program under analysis. Such factors could potentially impact the
ability of LLM to accurately identify and localize faults within the codebase, leading
to fluctuations in its comparative performance against traditional SBFL methodolo-
gies.

6.3 Limitations
While our study has provided valuable insights into the comparative effectiveness
of Large Language Models (LLMs) in fault localization when contrasted with tra-
ditional Spectrum-Based Fault Localization (SBFL) techniques, it is essential to
acknowledge several limitations that may influence the interpretation and general-
ization of our findings.

22
6.3.1 Dataset Size and Complexity
The dimensions of the analyzed programs, specifically in terms of their size and
complexity, introduce a potential factor that may impede the generalizability of our
findings. The inherent intricacies associated with larger or more complex codebases
could engender distinct fault localization dynamics, thereby introducing a level of
variability in the efficacy of Large Language Models (LLMs) for fault localization.
It is plausible that the performance characteristics observed in our study may not
seamlessly extend to codebases of greater magnitude or intricacy, necessitating cau-
tion in extrapolating our results to scenarios featuring more expansive and intricate
software architectures.

6.3.2 Dependency on Training Data


The efficacy of Large Language Models (LLMs) is intricately tied to the caliber
and representational adequacy of the training data upon which these models are
nurtured. Potential challenges, including but not limited to, biases embedded within
the training data and a potential inadequacy in encompassing the full spectrum of
programming languages, have the capacity to exert a discernible influence on the
LLM’s proficiency in precisely identifying and localizing faults. In the event of biases
present in the training data, the model may inadvertently learn and perpetuate
those biases, potentially compromising its objectivity and the fairness of its fault
localization outcomes. Moreover, a deficiency in adequately covering the diverse
linguistic nuances of various programming languages may limit the LLM’s ability to
comprehensively comprehend and analyze code, potentially resulting in suboptimal
fault localization accuracy. Hence, a meticulous consideration of the quality and
inclusiveness of the training data emerges as a pivotal factor in comprehending and
interpreting the fault localization capabilities of LLMs.

6.3.3 Need for Further Validation


Our research endeavors, encapsulated in this study, constitute an inaugural foray
into the burgeoning realm of leveraging Large Language Models (LLMs) for the in-
tricate task of fault localization. However, it is imperative to underscore that our
current exploration merely scratches the surface, representing a preliminary investi-
gation into the potential applications of LLMs in this domain. To fortify and fortify
the foundation laid by our study, a subsequent phase of rigorous validation is in-
dispensable. This validation should encompass a multifaceted approach, involving
the pursuit of additional case studies, the execution of meticulously designed ex-
periments, and the immersion of the proposed LLM-based approach in real-world
applications. Only through such comprehensive validation endeavors can we aspire
to unveil the full extent of the robustness and reliability inherent in our proposed
approach, thereby substantiating its efficacy across a diverse array of contextual
landscapes.

23
Through the explicit acknowledgment of these identified limitations, our primary
objective is to construct an intricate and transparent conceptual framework. This
framework is intended to serve as a guide for comprehending the delineated bound-
aries and potential challenges that arise in the application of Large Language Models
(LLMs) within the specific domain of fault localization. By openly articulating these
limitations, we aim to furnish a lucid and comprehensive understanding of the con-
straints and nuances that may influence the interpretability and generalizability of
our study’s results.

The meticulous recognition of these limitations functions as a deliberate effort to


contribute to the scholarly discourse by providing a detailed map of potential pit-
falls and constraints associated with the integration of LLMs in fault localization
methodologies. This transparency, in turn, sets the stage for future research endeav-
ors, offering valuable insights that can inform the design and execution of subsequent
investigations in this field.

Moreover, the elucidation of these considerations is intended to foster a nuanced


interpretation of the results presented in our study. By foregrounding the inherent
limitations, we encourage a discerning and context-aware evaluation of the findings,
thereby enriching the intellectual discourse surrounding the application of LLMs in
fault localization. Ultimately, our aim is to catalyze a continuous and informed
dialogue, guiding future research trajectories and contributing to the advancement
of knowledge in this evolving domain.

24
Chapter 7

Conclusion

In summary, this thesis addresses two key research questions, comparing traditional
fault localization (FL) techniques like Ochiai and DStar with the Language Model-
based FL (LLM-FL) approach, specifically focusing on guiding Automated Program
Repair (APR) and improving fault localization accuracy.

For the first question, examining the performance in guiding APR, the LLM-FL
approach consistently outshines traditional methods. In the Chart program at Top-
1, LLM-FL identified 11 faults compared to 3 and 4 by SBFL Ochiai and DStar,
respectively. Similar trends were observed across programs, highlighting the poten-
tial superiority of LLM-FL in identifying faults during APR.

Regarding the second question, LLM-FL consistently enhances fault localization


accuracy compared to SBFL Ochiai and DStar across various Top-N scenarios. In
the Lang program at Top-3, LLM-FL identified 31 faults, surpassing SBFL Ochiai
and DStar, which identified 23 and 22 faults, respectively. These results underscore
the significant improvements in accuracy offered by the LLM-FL approach.

In the broader context, this research sheds light on the evolving landscape of FL and
Automated Program Repair. While LLM-based approaches introduce innovation,
acknowledging limitations, such as computational costs, is crucial. Future research
can focus on optimizing the efficiency of LLM-FL methods while maintaining their
enhanced accuracy.

In conclusion, this thesis provides valuable insights into FL techniques and their
application in guiding APR. The findings, supported by experimental results, pave
the way for future research and practical implementations in software engineering.
Emphasizing the need for a nuanced understanding of traditional and emerging FL
approaches, this research encourages ongoing exploration and refinement of fault
localization methods for improved reliability in software maintenance.

25
Bibliography

[1] W. Weimer, T. Nguyen, C. Le Goues, and S. Forrest, “Automatically finding


patches using genetic programming,” in 2009 IEEE 31st International Con-
ference on Software Engineering, IEEE, 2009, pp. 364–374.
[2] R. Just, Defects4j, 2014. [Online]. Available: https : / / github . com / rjust /
defects4j.
[3] R. Just, D. Jalali, and M. D. Ernst, “Defects4j: A database of existing faults to
enable controlled testing studies for java programs,” in Proceedings of the 2014
international symposium on software testing and analysis, 2014, pp. 437–440.
[4] M. Papadakis and Y. Le Traon, “Metallaxis-fl: Mutation-based fault localiza-
tion,” Software Testing, Verification and Reliability, vol. 25, no. 5-7, pp. 605–
628, 2015.
[5] P. S. Kochhar, X. Xia, D. Lo, and S. Li, “Practitioners’ expectations on auto-
mated fault localization,” in Proceedings of the 25th international symposium
on software testing and analysis, 2016, pp. 165–176.
[6] W. E. Wong, R. Gao, Y. Li, R. Abreu, and F. Wotawa, “A survey on software
fault localization,” IEEE Transactions on Software Engineering, vol. 42, no. 8,
pp. 707–740, 2016.
[7] F. Y. Assiri and J. M. Bieman, “Fault localization for automated program
repair: Effectiveness, performance, repair correctness,” Software Quality Jour-
nal, vol. 25, pp. 171–199, 2017.
[8] D. Zou, J. Liang, Y. Xiong, M. D. Ernst, and L. Zhang, “An empirical study
of fault localization families and their combinations,” IEEE Transactions on
Software Engineering, vol. 47, no. 2, pp. 332–347, 2019.
[9] T. Brown, B. Mann, N. Ryder, et al., “Language models are few-shot learners
advances in neural information processing systems 33,” 2020.
[10] Y.-H. Wu, Z. Li, Y. Liu, and X. Chen, “Fatoc: Bug isolation based multi-
fault localization by using optics clustering,” Journal of Computer Science
and Technology, vol. 35, pp. 979–998, 2020.
[11] T. A. Khan, A. Sullivan, and K. Wang, “Alloyfl: A fault localization framework
for alloy,” in Proceedings of the 29th ACM Joint Meeting on European Soft-
ware Engineering Conference and Symposium on the Foundations of Software
Engineering, 2021, pp. 1535–1539.
[12] T. Bach, A. Andrzejak, C. Seo, et al., “Testing very large database manage-
ment systems: The case of sap hana,” Datenbank-Spektrum, vol. 22, no. 3,
pp. 195–215, 2022.

26
[13] Y. Li, S. Wang, and T. N. Nguyen, “Fault localization to detect co-change
fixing locations,” in Proceedings of the 30th ACM Joint European Software
Engineering Conference and Symposium on the Foundations of Software En-
gineering, 2022, pp. 659–671.
[14] C. Ni, W. Wang, K. Yang, X. Xia, K. Liu, and D. Lo, “The best of both
worlds: Integrating semantic features with expert features for defect predic-
tion and localization,” in Proceedings of the 30th ACM Joint European Soft-
ware Engineering Conference and Symposium on the Foundations of Software
Engineering, 2022, pp. 672–683.
[15] J. Wei, Y. Tay, R. Bommasani, et al., “Emergent abilities of large language
models,” arXiv preprint arXiv:2206.07682, 2022.
[16] J. Wei, X. Wang, D. Schuurmans, et al., “Chain-of-thought prompting elicits
reasoning in large language models,” Advances in Neural Information Process-
ing Systems, vol. 35, pp. 24 824–24 837, 2022.
[17] Y. Wu, Y. Liu, W. Wang, Z. Li, X. Chen, and P. Doyle, “Theoretical analysis
and empirical study on the impact of coincidental correct test cases in multiple
fault localization,” IEEE Transactions on Reliability, vol. 71, no. 2, pp. 830–
849, 2022.
[18] S. Yao, J. Zhao, D. Yu, et al., “React: Synergizing reasoning and acting in
language models,” arXiv preprint arXiv:2210.03629, 2022.
[19] M. Zeng, Y. Wu, Z. Ye, Y. Xiong, X. Zhang, and L. Zhang, “Fault localization
via efficient probabilistic modeling of program semantics,” in Proceedings of
the 44th International Conference on Software Engineering, 2022, pp. 958–969.
[20] T. Ahmed, S. Ghosh, C. Bansal, T. Zimmermann, X. Zhang, and S. Rajmohan,
“Recommending root-cause and mitigation steps for cloud incidents using large
language models,” arXiv preprint arXiv:2301.03797, 2023.
[21] A. Chowdhery, S. Narang, J. Devlin, et al., “Palm: Scaling language modeling
with pathways,” Journal of Machine Learning Research, vol. 24, no. 240, pp. 1–
113, 2023.
[22] N. Jiang, K. Liu, T. Lutellier, and L. Tan, “Impact of code language models
on automated program repair,” arXiv preprint arXiv:2302.05020, 2023.
[23] S. Kang, J. Yoon, and S. Yoo, “Large language models are few-shot testers: Ex-
ploring llm-based general bug reproduction,” in 2023 IEEE/ACM 45th Inter-
national Conference on Software Engineering (ICSE), IEEE, 2023, pp. 2312–
2323.
[24] C. Lemieux, J. P. Inala, S. K. Lahiri, and S. Sen, “Codamosa: Escaping cov-
erage plateaus in test generation with pre-trained large language models,” in
International conference on software engineering (ICSE), 2023.
[25] M. Motwani and Y. Brun, “Better automatic program repair by using bug
reports and tests together,” in 2023 IEEE/ACM 45th International Conference
on Software Engineering (ICSE), IEEE, 2023, pp. 1225–1237.
[26] Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang, “Hugginggpt:
Solving ai tasks with chatgpt and its friends in huggingface,” arXiv preprint
arXiv:2303.17580, 2023.

27
[27] C. Xia, Y. Wei, and L. Zhang, “Practical program repair in the era of large
pre-trained language models (2022),” arXiv preprint arXiv:2210.14179,

28

You might also like