Unveiling Large Language Models Comparative Analysis in Software Defect Detection (1) (1)
Unveiling Large Language Models Comparative Analysis in Software Defect Detection (1) (1)
Authors
Supervisor
i
Certificate of Acceptance
ii
Abstract
Keywords:
Large Language Model, Fault Localization, Fail Test, Function Call.
iii
Table of Contents
Abstract iii
Table of Contents iv
List of Figures vi
List of Tables 1
1 Introduction 2
1.1 The Impact on APR . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Motivation for LLM Integration . . . . . . . . . . . . . . . . . . . . . 3
1.3 Thesis Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 Exploration of FL Techniques . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Overview of Spectrum-Based FL (SBFL) . . . . . . . . . . . . 5
2.2 Large Language Models (LLM) . . . . . . . . . . . . . . . . . . . . . 7
3 Related Work 9
3.1 Fault Localization Techniques In Debuggging . . . . . . . . . . . . . . 9
3.2 Integration of Large Language Models (LLMs) in Software Engineering 9
4 Methodology 11
4.1 Overview of Fault Localization Using LLM . . . . . . . . . . . . . . . 11
4.2 Step by Step Approach of Fault Localization . . . . . . . . . . . . . . 12
4.2.1 LLM Configuration . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2.2 FL Query Initiation . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2.3 Knowledge Base Navigation for Failure Analysis . . . . . . . . 14
4.2.4 Receiving & Analysing Potential Faulty Lines . . . . . . . . . 14
4.2.5 Fault Localization Performance Evaluation . . . . . . . . . . . 15
5 Experimental Setup 16
5.1 LLM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.3 Baseline FL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
iv
6 Experimental Results and Analysis 19
6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.2 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.2.1 Analysis of the Chart Program: . . . . . . . . . . . . . . . . . 20
6.2.2 Examination of the Lang Program: . . . . . . . . . . . . . . . 21
6.2.3 Insight into the Time Program: . . . . . . . . . . . . . . . . . 22
6.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3.1 Dataset Size and Complexity . . . . . . . . . . . . . . . . . . 23
6.3.2 Dependency on Training Data . . . . . . . . . . . . . . . . . . 23
6.3.3 Need for Further Validation . . . . . . . . . . . . . . . . . . . 23
7 Conclusion 25
Bibliography 28
v
List of Figures
vi
List of Tables
1
Chapter 1
Introduction
The past two decades have witnessed a concerted focus on fault localization (FL),
captivating researchers who delve into innovative approaches to heighten the identi-
fication of faulty program elements. At its core, fault localization aims to precisely
pinpoint specific program elements responsible for failures during test scenarios.
Contemporary fault localization methods exhibit a rich tapestry of techniques, em-
bracing a spectrum from statistical analysis [11] to coverage analysis [4], [10], and
extending to the application of machine learning algorithms [1], [14]. Despite the
diversity of methodologies, a prevailing challenge surfaces in the guise of limited
adaptability. The efficacy of current fault localization techniques encounters restric-
tions imposed by the intricacies of programming languages, the quality of test cases,
or the complexities embedded within code structures. This constraint invariably
diminishes the overall effectiveness of these methodologies [6].It is imperative to ac-
knowledge that existing fault localization often begets inaccuracies after exhaustive
and resource-intensive FL analysis. This realization prompts a critical examination
of the methodologies that underpin the fault localization domain, paving the way for
a nuanced understanding of their limitations and potential avenues for improvement.
2
tational steps required to pinpoint potential repairs, thereby influencing the overall
efficiency of the repair process. The critical dimension of repair correctness delves
into the assessment of how adeptly a potentially repaired program retains its in-
tended functionality.
This recognition of emergent properties has propelled LLMs into the spotlight of
software engineering research, captivating attention for their applications in bug di-
agnosis [20], patch generation [22], [27], test generation [24], and bug reproduction
[23]. The comprehensive capabilities displayed by LLMs in handling various facets
of software engineering challenges have underscored their potential as versatile tools
in advancing the field.
The motivation to integrate LLMs, exemplified by the likes of GPT-3.5, into fault
localization stems from their distinctive capabilities when juxtaposed with tradi-
tional Fault Localization (FL) techniques. While conventional methods, notably
coverage-based ones, have demonstrated value, they may encounter challenges in
capturing the intricate linguistic and contextual aspects inherent in software code.
LLMs, renowned for their contextual understanding and semantic representation of
language, present a fresh perspective on fault localization. Their ability to consider
multifaceted linguistic cues enhances fault identification accuracy.
Moreover, the versatility and adaptability inherent in LLMs render them promis-
ing candidates for fault localization tasks, particularly in diverse codebases. The
transfer learning paradigm empowers LLMs to glean insights from varied language
contexts during pre-training and subsequently specialize in code-specific nuances
during fine-tuning. This adaptability proves advantageous in scenarios where code-
bases exhibit significant variations in style and structure. The amalgamation of
3
these traits positions LLMs as potentially transformative elements Within the do-
main of fault localization, offering a sophisticated and context-aware approach to
identifying and addressing software faults.
4
Chapter 2
Background
Within the intricate domain of SBFL, a diverse array of heuristics comes into play,
ranging from foundational elementary coverage metrics to sophisticated algorithms.
These heuristics, intricately designed, contribute to a nuanced evaluation of sus-
piciousness scores by considering the intricate interplay of test outcomes and exe-
cution patterns. Each heuristic within SBFL, including but not limited to Ochiai,
DStar, and Tarantula, adds a layer of sophistication to the evaluation process. These
heuristics operate in harmony, leveraging dynamic execution information to compre-
hensively assess the likelihood of a statement containing a fault.
5
faults within software systems.
Principles of SBFL:
Heuristics in SBFL
The domain of Spectrum-Based Fault Localization (SBFL) is characterized by its
intricate utilization of a diverse set of heuristics, each standing as a testament to
the nuanced complexity inherent in the assessment of suspiciousness scores for pro-
gram statements. These heuristics, akin to sophisticated evaluative instruments,
meticulously navigate the landscape of software code, intricately weighing the sig-
nificance of statements. Their evaluative prowess extends to a detailed analysis of
the execution patterns exhibited in both passing and failing tests, thereby contribut-
ing to a comprehensive understanding of the dynamic interplay within the SBFL
methodology. In essence, these heuristics serve as discerning guides, unraveling the
intricacies of program behavior to assign scores that elucidate the likelihood of a
statement harboring a fault.
• Ochiai
Pos(s)
Ochiai(s) = p
Pos(s) + Neg(s)
Ochiai measures the cosine of the angle between the statement’s vector and
the vector of failing executions. It considers the number of passing and failing
tests that cover a statement.
• DStar
P os(s)
DStar(s) = p
T otalP os · (P os(s) + N eg(s))
DStar intricately measures the cosine of the angle between the vector repre-
senting the statement and the vector associated with failing executions. This
meticulous consideration extends to both the number of passing and failing
tests that cover a particular statement.
6
In the domain of Spectrum-Based Fault Localization (SBFL), a set of crucial metrics
serves as the cornerstone for evaluating the dynamics of program statements during
testing and calculating the suspiciousness scores of different heuristics. These met-
rics, each with its distinctive role, contribute to the comprehensive understanding
of the fault localization process. Let’s delve into these key metrics:
• Pos(s): This metric signifies the number of failing tests that encompass the
program statement ”s.” Simply put, it quantifies the instances where the state-
ment ”s” is executed during failed test executions.
• Neg(s): Representing the count of passing tests covering the program state-
ment ”s,” this metric offers insights into the successful executions where the
statement ”s” is involved.
• TotalPos: Encompassing the entirety of failing tests in the test suite, TotalPos
provides a holistic view of all failing executions within the entire test suite.
Contextual Understanding
Setting themselves apart from their predecessors, Large Language Models (LLMs)
showcase an extraordinary proficiency in capturing nuanced contextual intricacies
within language. They go beyond simplistic keyword matching, immersing them-
selves in the complex interplay of words, phrases, and sentences. This elevated
level of contextual understanding bestows upon LLMs the capability to generate re-
sponses that are not only coherent but also profoundly contextually relevant. This
prowess renders them exceptionally skilled in a diverse array of tasks, spanning from
the intricate nuances of language translation to the sophisticated realm of intricate
7
question answering.
8
Chapter 3
Related Work
9
ReAct pioneered the integration of chain-of-thought prompting with LLMs, show-
casing improved performance on various tasks within software engineering [18]. By
leveraging the capabilities of LLMs, ReAct demonstrated enhanced productivity and
effectiveness in software-related tasks.
10
Chapter 4
Methodology
Initiation of Inquiry:
The initiation phase serves as the bedrock of our methodology, involving the crafting
of a comprehensive and insightful prompt. This prompt is not merely a surface-level
introduction; rather, it intricately encapsulates vital information about the failing
test. This includes the test’s nomenclature, the encapsulating code snippet, and a
crucial detail—the specific line within the code where the test encountered failure.
This comprehensive initiation lays a robust foundation for a focused, informed ex-
ploration of software failures.
Iterative Exploration:
The inherent strength of the methodology lies in its iterative framework, character-
ized by a continuous cycle of inquiries and responses orchestrated by LLMs. This
dynamic process facilitates an evolving comprehension of the failure, with each inter-
action contributing to a refined understanding. The iterative nature of the approach
aligns with the adaptive capabilities inherent in LLMs, as they systematically narrow
down potential fault locations. Through a cumulative exploration of the codebase,
each interaction serves to build upon the insights gleaned from preceding steps, fos-
tering a progressively deeper and more nuanced analysis of the failure scenario.
11
Response Dynamics:
The culmination of LLM interaction yields responses that unveil potential faulty
lines or code blocks. These responses serve as dynamic artifacts, representing a sym-
biotic collaboration between human-guided inquiry and the language-understanding
capabilities of LLMs. Far beyond being mere outputs, these responses provide tan-
gible insights into the intricate interplay of code elements and their potential impact
on observed failures.
To configure the LLM model, a set of instructions was provided, delineating the
precise steps for the model to follow. These instructions served as the blueprint
for configuring the LLM, ensuring that it adhered to the specified parameters and
criteria essential for its optimal performance and functionality. The LLM model is
to be initialized with the following configuration instructions:
12
Now, you are going to work as a debugging assistant.
In your role as a debugging assistant, focus on failing tests
labeled "// error occurred here" Utilize the method
get\_code\_snippet to request code snippets and access the
system under test (SUT) source code. Your objective is to deliver a
step-by-step explanation of the bug, drawing insights from both the
failing test and information gathered through SUT tests. You have a
maximum of 7 chances to request functions for collecting relevant
information. Provide analysis only when certain about the fault’s
reason and can specify the faulty lines. Your task is to identify
potential faulty lines contributing to the test failure.
13
@test
public void test2947660() {
AbstractCategoryItemRenderer r = new LineAndShapeRenderer();
assertNotNull(r.getLegendItems());
assertEquals(0, r.getLegendItems().getItemCount());
14
(5 in workflow figure)These identified potential faulty lines assume the role of in-
valuable indicators, guiding further investigation and laying the foundation for the
resolution of issues within the software. Notably, to ensure heightened reliability
and robustness, we adopt an iterative refinement approach. The fault localization
process is systematically repeated for the same test failure, a minimum of three
times. This iterative strategy aims to enhance the consistency of outcomes, pro-
viding a nuanced and comprehensive understanding of potential faults. Such an
approach contributes significantly to the efficacy of the debugging process, fostering
an environment conducive to thorough fault analysis and resolution.
15
Chapter 5
Experimental Setup
However, the dynamics of our interaction with the gpt-3.5-turbo-1106 model were
intricately shaped by its token limitations, with a stringent maximum input limit
of 4096 tokens. Each interaction demanded meticulous consideration of the text’s
length and complexity, creating a crucial dimension in our experimental design. The
constraint became particularly pivotal in our approach to crafting prompts, queries,
and instructions during fault localization experiments. Navigating within the con-
fines of this token limit, we strategically tailored prompts and interactions to extract
optimal insights, achieving a delicate and intentional balance between input rich-
ness and token constraints. This strategic interplay played an indispensable role
in shaping the effectiveness of our experimental approach, ensuring that the vast
and powerful capabilities of the gpt-3.5-turbo-1106 model were meticulously har-
nessed to their fullest extent.
16
5.2 Dataset
Defects4J [2] serves as an indispensable and pioneering resource, occupying a cen-
tral role in the landscape of fault localization datasets. This reservoir of real-world
open-source programs is not merely a collection of code; rather, it is an expansive
repository enriched with authentic bugs that have unfolded across various histor-
ical versions. What elevates the significance of this dataset is not just the bugs
themselves but the meticulous curation of corresponding test cases [3], amplifying
its value for researchers and practitioners delving into the intricacies of fault local-
ization studies.
For the trajectory of experimentation in this study, our focal point is an explo-
ration of Three expansive open-source Java projects meticulously cataloged within
Defects4J. These selected projects, namely JFreeChart (Chart), Apache Commons-
Lang (Lang), and Joda-Time (Time), represent a diverse tapestry of functionalities
and applications. The deliberate selection ensures a comprehensive exploration of
fault localization methodologies across a spectrum of scenarios.
Moreover, the dataset’s meticulous consideration extends beyond just the bugs,
delving into the total number of bugs within the selected projects. This nuanced
approach contributes to a more profound and comprehensive understanding of the
fault landscape, laying the groundwork for extracting meaningful insights.
The projects enshrined in this dataset exhibit variations in scale, complexity, and
code length, adding an extra layer of diversity to our experimental exploration. By
strategically leveraging the Defects4J dataset, our endeavors align not only with
widely recognized research practices [8] but also ensure the comparability of our re-
sults with a substantial body of existing literature. This deliberate and methodical
approach fosters a dependable evaluation of fault localization techniques, establish-
ing a robust foundation for deriving meaningful insights from the rich tapestry of
collected data.
17
5.3 Baseline FL
Within our investigation, Spectrum-Based Fault Localization (SBFL) takes a cen-
tral role as a debugging technique, utilizing execution traces from both passed and
failed test cases to gauge the likelihood of program statements being faulty. At its
core, SBFL operates on the premise that faulty statements are more prominently
covered by failed tests compared to their counterparts in passed tests. To quantify
this intuition, SBFL employs suspiciousness formulas, with our study exclusively
focusing on the Ochiai and DStar methods due to their established superior per-
formance in existing research.
However, a critical caveat emerges in the form of its reliance on the quality of
the test suite. The practical efficacy of SBFL becomes intricately entwined with
how effectively the test cases execute the code. The performance of SBFL is contin-
gent on the test suite’s capability to not only cover the code comprehensively but,
more importantly, to unveil potential faults latent within the intricate fabric of the
software.
This interplay between lightweight instrumentation and test suite quality under-
scores the nuanced dynamics of SBFL, emphasizing the need for a robust testing
infrastructure to fully realize its potential in identifying and localizing faults within
software systems.
18
Chapter 6
6.1 Results
In the initial phase of our study, we undertook a comprehensive comparison between
existing fault localization techniques and our proposed approach based on Large
Language Models (LLMs). This comparative analysis aimed to gauge the effective-
ness and efficiency of fault localization methodologies in identifying and pinpointing
software defects. To facilitate this evaluation, we meticulously measured and ana-
lyzed the performance metrics of both traditional fault localization techniques and
our LLM-based approach across a diverse range of scenarios.
19
to provide a nuanced understanding of the fault localization capabilities of both
traditional techniques and our novel LLM-based approach.
Each bar within the graph corresponds to a specific Top-N value, ranging from
1 to 5, denoting the number of top-ranked statements considered in the fault local-
ization process. The height of each bar signifies the corresponding fault localization
accuracy achieved by the respective methodologies at the given Top-N value.
A careful examination of the graph reveals distinct patterns and trends across the dif-
ferent methodologies. For instance, the bars representing the LLM-based approach
consistently exhibit greater heights compared to those of the SBFL techniques, in-
dicative of superior fault localization accuracy across varying Top-N scenarios.
20
Moreover, as we progress from lower to higher Top-N values, the disparity in fault
localization accuracy between the LLM-based approach and SBFL methodologies
becomes more pronounced. This trend underscores the LLM’s remarkable ability to
excel in scenarios requiring the identification and localization of faults within the
top-ranked statements.
21
6.2.3 Insight into the Time Program:
In the context of the Time program, the analysis uncovers a nuanced performance
landscape characterized by a degree of variability in the efficacy of LLM-based fault
localization when compared to Spectrum-Based Fault Localization (SBFL) tech-
niques.
Specifically, while LLM exhibits superior performance in terms of Top-1 and Top-2
values, the SBFL techniques demonstrate a transient advantage for Top-3, Top-4,
and Top-5 values. This disparity indicates a divergence in effectiveness across dif-
ferent Top-N scenarios, suggesting that the optimal choice between LLM and SBFL
methods may vary depending on the specific Top-N metric considered.
6.3 Limitations
While our study has provided valuable insights into the comparative effectiveness
of Large Language Models (LLMs) in fault localization when contrasted with tra-
ditional Spectrum-Based Fault Localization (SBFL) techniques, it is essential to
acknowledge several limitations that may influence the interpretation and general-
ization of our findings.
22
6.3.1 Dataset Size and Complexity
The dimensions of the analyzed programs, specifically in terms of their size and
complexity, introduce a potential factor that may impede the generalizability of our
findings. The inherent intricacies associated with larger or more complex codebases
could engender distinct fault localization dynamics, thereby introducing a level of
variability in the efficacy of Large Language Models (LLMs) for fault localization.
It is plausible that the performance characteristics observed in our study may not
seamlessly extend to codebases of greater magnitude or intricacy, necessitating cau-
tion in extrapolating our results to scenarios featuring more expansive and intricate
software architectures.
23
Through the explicit acknowledgment of these identified limitations, our primary
objective is to construct an intricate and transparent conceptual framework. This
framework is intended to serve as a guide for comprehending the delineated bound-
aries and potential challenges that arise in the application of Large Language Models
(LLMs) within the specific domain of fault localization. By openly articulating these
limitations, we aim to furnish a lucid and comprehensive understanding of the con-
straints and nuances that may influence the interpretability and generalizability of
our study’s results.
24
Chapter 7
Conclusion
In summary, this thesis addresses two key research questions, comparing traditional
fault localization (FL) techniques like Ochiai and DStar with the Language Model-
based FL (LLM-FL) approach, specifically focusing on guiding Automated Program
Repair (APR) and improving fault localization accuracy.
For the first question, examining the performance in guiding APR, the LLM-FL
approach consistently outshines traditional methods. In the Chart program at Top-
1, LLM-FL identified 11 faults compared to 3 and 4 by SBFL Ochiai and DStar,
respectively. Similar trends were observed across programs, highlighting the poten-
tial superiority of LLM-FL in identifying faults during APR.
In the broader context, this research sheds light on the evolving landscape of FL and
Automated Program Repair. While LLM-based approaches introduce innovation,
acknowledging limitations, such as computational costs, is crucial. Future research
can focus on optimizing the efficiency of LLM-FL methods while maintaining their
enhanced accuracy.
In conclusion, this thesis provides valuable insights into FL techniques and their
application in guiding APR. The findings, supported by experimental results, pave
the way for future research and practical implementations in software engineering.
Emphasizing the need for a nuanced understanding of traditional and emerging FL
approaches, this research encourages ongoing exploration and refinement of fault
localization methods for improved reliability in software maintenance.
25
Bibliography
26
[13] Y. Li, S. Wang, and T. N. Nguyen, “Fault localization to detect co-change
fixing locations,” in Proceedings of the 30th ACM Joint European Software
Engineering Conference and Symposium on the Foundations of Software En-
gineering, 2022, pp. 659–671.
[14] C. Ni, W. Wang, K. Yang, X. Xia, K. Liu, and D. Lo, “The best of both
worlds: Integrating semantic features with expert features for defect predic-
tion and localization,” in Proceedings of the 30th ACM Joint European Soft-
ware Engineering Conference and Symposium on the Foundations of Software
Engineering, 2022, pp. 672–683.
[15] J. Wei, Y. Tay, R. Bommasani, et al., “Emergent abilities of large language
models,” arXiv preprint arXiv:2206.07682, 2022.
[16] J. Wei, X. Wang, D. Schuurmans, et al., “Chain-of-thought prompting elicits
reasoning in large language models,” Advances in Neural Information Process-
ing Systems, vol. 35, pp. 24 824–24 837, 2022.
[17] Y. Wu, Y. Liu, W. Wang, Z. Li, X. Chen, and P. Doyle, “Theoretical analysis
and empirical study on the impact of coincidental correct test cases in multiple
fault localization,” IEEE Transactions on Reliability, vol. 71, no. 2, pp. 830–
849, 2022.
[18] S. Yao, J. Zhao, D. Yu, et al., “React: Synergizing reasoning and acting in
language models,” arXiv preprint arXiv:2210.03629, 2022.
[19] M. Zeng, Y. Wu, Z. Ye, Y. Xiong, X. Zhang, and L. Zhang, “Fault localization
via efficient probabilistic modeling of program semantics,” in Proceedings of
the 44th International Conference on Software Engineering, 2022, pp. 958–969.
[20] T. Ahmed, S. Ghosh, C. Bansal, T. Zimmermann, X. Zhang, and S. Rajmohan,
“Recommending root-cause and mitigation steps for cloud incidents using large
language models,” arXiv preprint arXiv:2301.03797, 2023.
[21] A. Chowdhery, S. Narang, J. Devlin, et al., “Palm: Scaling language modeling
with pathways,” Journal of Machine Learning Research, vol. 24, no. 240, pp. 1–
113, 2023.
[22] N. Jiang, K. Liu, T. Lutellier, and L. Tan, “Impact of code language models
on automated program repair,” arXiv preprint arXiv:2302.05020, 2023.
[23] S. Kang, J. Yoon, and S. Yoo, “Large language models are few-shot testers: Ex-
ploring llm-based general bug reproduction,” in 2023 IEEE/ACM 45th Inter-
national Conference on Software Engineering (ICSE), IEEE, 2023, pp. 2312–
2323.
[24] C. Lemieux, J. P. Inala, S. K. Lahiri, and S. Sen, “Codamosa: Escaping cov-
erage plateaus in test generation with pre-trained large language models,” in
International conference on software engineering (ICSE), 2023.
[25] M. Motwani and Y. Brun, “Better automatic program repair by using bug
reports and tests together,” in 2023 IEEE/ACM 45th International Conference
on Software Engineering (ICSE), IEEE, 2023, pp. 1225–1237.
[26] Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang, “Hugginggpt:
Solving ai tasks with chatgpt and its friends in huggingface,” arXiv preprint
arXiv:2303.17580, 2023.
27
[27] C. Xia, Y. Wei, and L. Zhang, “Practical program repair in the era of large
pre-trained language models (2022),” arXiv preprint arXiv:2210.14179,
28