WhiteFox - White-Box Compiler Fuzzing Empowered by Large Language Models
WhiteFox - White-Box Compiler Fuzzing Empowered by Large Language Models
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses,
contact the owner/author(s).
© 2024 Copyright held by the owner/author(s).
ACM 2475-1421/2024/10-ART296
https://doi.org/10.1145/3689736
Proc. ACM Program. Lang., Vol. 8, No. OOPSLA2, Article 296. Publication date: October 2024.
296:2 Chenyuan Yang, Yinlin Deng, Runyu Lu, Jiayi Yao, Jiawei Liu, Reyhaneh Jabbarvand, and Lingming Zhang
1 Introduction
Modern compilers [9, 24, 39, 66, 72, 78] play a critical role in translating high-level programming
languages into efficient machine code. However, incorrect or misapplied optimizations can lead
to subtle and hard-to-detect bugs, and even vulnerabilities [15, 90, 94]. For instance, compiler
misoptimizations have led to severe security vulnerabilities, e.g., system hanging, memory errors,
and information leaks, in the Linux kernel [90] and safety-critical deep learning (DL) applica-
tions/systems [1, 15, 63, 71]. Given the ubiquity of compilers in software development, it is vital
to ensure the correctness of compiler optimizations. To this end, fuzzing (or fuzz testing) [76, 97]
has been applied to automatically generate a large number of test inputs, aiming to explore com-
piler defects [74]. To date, a large body of fuzzing tools has been tailored for different languages
and compilers [21, 46, 51, 93], highlighting their success by finding a large number of real-world
compiler bugs.
In the literature, researchers have proposed various fuzzing techniques to incorporate the
knowledge about the system under test (SUT) during test generation [11, 21]. They are generally
classified into three categories according to the extent of SUT knowledge visible to the fuzzer:
black-box [46, 51, 93], grey-box [11, 21, 45], and white-box fuzzing [37, 69]. Black-box fuzzing has
zero knowledge about the internal workings of the SUT and simply considers system input/interface
information. Consequently, the inputs are generated without complying with the intended structures
or behaviors of the SUT. In contrast, white-box fuzzing techniques, by inspecting the source code
of the SUT, aim to synthesize test cases to exhaustively explore all possible code paths. Grey-box
fuzzing lies between black- and white-box fuzzing. By leveraging limited program information of
the SUT (e.g., code coverage), grey-box fuzzers attempt to efficiently produce tests that are more
Proc. ACM Program. Lang., Vol. 8, No. OOPSLA2, Article 296. Publication date: October 2024.
WhiteFox: White-Box Compiler Fuzzing Empowered by Large Language Models 296:3
likely to exercise new program behaviors. While in theory, we can apply all such approaches for
compiler fuzzing, each approach grapples with its challenges and limitations owing to the immense
complexity/scale of modern compiler systems. For instance, the widely-used LLVM [39] compiler is
implemented with 14M lines of code, and the popular DL compiler, TensorFlow [78], has 3.5M lines.
Challenges. Black-box fuzzing, without knowledge of the internal workings, struggles due to the
intricate conditions required to trigger optimizations. Simply generating random inputs, without
any guidance, often proves impractical for reaching the deep corners of optimization logic. For
example, a recent study [11] shows that the black-box fuzzer for C compilers, Csmith [93], which
produces test-cases through grammar-based generation, can be significantly less effective than
coverage-guided fuzzing. Grey-box fuzzing, though better informed by source code instrumentation
to achieve higher code coverage than its black-box counterpart, frequently falls short of fully
understanding the nuanced criteria required to trigger particular optimizations. This shortcoming
stems from the fact that compiler optimizations typically hinge on meeting precise and strict
conditions. Vanilla coverage-driven strategies might not navigate these specific states effectively.
Moreover, grey-box compiler fuzzing [11, 21] even fail to generate semantically correct inputs,
leading to the discovery of mostly front-end crash bugs. On the other hand, traditional white-box
fuzzing, which relies on strict analysis of the SUT source code, becomes daunting with modern
compilers. The sheer complexity of modern optimization techniques, combined with the vast
landscape of programming paradigms and hardware targets, makes modeling all behaviors an
uphill task. For instance, symbolic execution [37] executes a program by using symbolic variables
in place of concrete values, enabling the systematic exploration of every potential execution path.
However, when applied to compiler systems, it becomes infeasible to designate every variable as
symbolic. Even if such a feat were achievable, the million-line scale of compiler codebase inevitably
leads to path explosion, rendering the approach highly challenging.
Furthermore, traditional compiler fuzzing techniques are typically tailored to specific lan-
guages/compilers. Yet, designing and implementing a fuzzing framework for a new compiler
is both time-intensive and laborious. For instance, Csmith [93] is comprised of tens of thousands
of lines of code through years of development. Given the unique characteristics of each target
language/compiler, reusing the efforts of one fuzzing implementation for a different input lan-
guage/compiler often presents significant challenges.
Motivation. Figure 1 presents a motivating example of the optimization permute_linear_fusion in
PyTorch Inductor [66]. This optimization fuses the permute and linear operators when the permute
method is invoked on an input tensor with more than two dimensions, specifically swapping the
last two dimensions. On the left side of the figure, we see its source code implementation. Here, the
constraints required to trigger this optimization are explicitly detailed with nested if condition
statements and a helper function check_permute. However, when applying fuzzing techniques to
test this optimization, black/grey-box fuzzing struggles to generate models that align the permute
and linear operators with these constraints. For instance, consider a scenario where a grey-box
compiler fuzzer produces a model with the linear operator and covers the first if-check (Line 3-6,
Figure 1) in this optimization. Even if a black/grey-box fuzzer repeatedly selects this test as a seed
for mutation, it is challenging to successfully mutate the model to invoke the permute method on a
tensor—specifically, to swap its last two dimensions—where the output should then serve as the
input for linear. This is because both black-box and grey-box fuzzing are unaware that the models
should include the permute and linear operators in this specific way, due to the absence of guidance
from the source code implementation. As there are thousands of operators in PyTorch, such fuzzers
will likely choose a different operator than permute or apply permute in many other ways. As a
result, the generated models often fail validation checks and cannot activate this optimization,
let alone discover deep bugs in it (Line 16, Figure 1). On the other hand, though the white-box
Proc. ACM Program. Lang., Vol. 8, No. OOPSLA2, Article 296. Publication date: October 2024.
296:4 Chenyuan Yang, Yinlin Deng, Runyu Lu, Jiayi Yao, Jiawei Liu, Reyhaneh Jabbarvand, and Lingming Zhang
techniques have the potential to trigger this optimization theoretically, it is impractical to apply
traditional program analysis to extract constraints from the detailed source code due to the data
structure complexity in compilers, e.g.,torch.fx.GraphModule and torch.fx.Node. These structures
are further composed of several other intricate classes with diverse attributes (e.g., args, shape, and
rank). Additionally, the intricate constraints are often expressed in hierarchical conditions (e.g.,
nested if statements) and even complex check functions. Therefore, it is extremely challenging,
if not impossible, to accurately extract and express these constraints symbolically for constraint
solvers, let alone apply any formal method to solve them.
Insight. Can we scale white-box fuzzing to fully test optimizations for any compilers? We address
this question based on the insight that Large Language Models (LLMs) [8, 22, 43, 60, 61, 87, 95] are
pre-trained on a vast array of code spanning various programming languages. This broad foundation
enables them to excel in comprehending and generating code across diverse languages [6].Therefore,
for the permute_linear_fusion optimization, different from typical white-box fuzzing, we can
leverage LLMs to summarize the requirements for the models that could trigger it based on the
source code information, as highlighted in the yellow text box of Figure 1. Subsequently, we can
leverage the generated requirement description to further prompt LLMs to create the corresponding
inputs, which are PyTorch models in this case. In our experiments, the generated tests indeed
triggered the permute_linear_fusion optimization and even detected a previously unknown bug in
it! Notably, this bug was confirmed by developers and labeled as high-priority.
Proposal. We present WhiteFox, the first white-box compiler fuzzing approach via Large Language
Models (LLMs) to fully test the core optimization modules in DL compilers, which represent the
fastest-evolving segment in the field of compilers. As discussed, existing approaches to white-box
testing cannot scale to model the behavioral information of complex compiler systems. Therefore,
the key idea of WhiteFox is to leverage LLMs to automatically infer the requirements of test
programs that can trigger the compiler optimizations based on their source code implementation.
LLMs, having been pre-trained on an extensive corpus comprising natural languages and a variety of
programming languages, possess the ability to comprehend and succinctly summarize optimization
source code. The input to WhiteFox is the source code that implements compiler optimizations.
First, an LLM-based analysis agent automatically analyzes and summarizes the testing requirements
for triggering optimizations. Subsequently, an LLM-based generation agent produces numerous
meaningful test programs guided by the generated requirements. To generate test programs that
can directly exercise corresponding optimizations, WhiteFox further employs a feedback-loop
mechanism that uses optimization-triggering tests as few-shot examples to guide future test
generation.
Summary. This work makes the following contributions:
• Novelty. We introduce a new dimension of white-box compiler fuzzing by using LLMs as both
optimization source code analyzers and test input generators. To our best knowledge, this work
is the first to demonstrate that LLMs can transform the low-level implementation information
into the corresponding high-level test programs, making it practical to employ the white-box
fuzzing techniques to test complex DL compilers. Furthermore, beyond DL compiler testing,
WhiteFox can also be adapted for white-box fuzzing of other compilers, and even complex,
real-world software systems in general, inspiring future work in this promising direction.
• Approach. While our approach is general and applicable to various compiler systems, we
implement WhiteFox as a practical fuzzer for the three most popular DL compilers, PyTorch
Inductor, TensorFlow-XLA, and TensorFlow Lite. We utilize GPT4 [61] as an analysis agent to
summarize the requirements based on the source code, and StarCoder [43] as a generation agent
to create diverse test inputs. Our artifact is available at https://github.com/ise-uiuc/WhiteFox.
Proc. ACM Program. Lang., Vol. 8, No. OOPSLA2, Article 296. Publication date: October 2024.
WhiteFox: White-Box Compiler Fuzzing Empowered by Large Language Models 296:5
Proc. ACM Program. Lang., Vol. 8, No. OOPSLA2, Article 296. Publication date: October 2024.
296:6 Chenyuan Yang, Yinlin Deng, Runyu Lu, Jiayi Yao, Jiawei Liu, Reyhaneh Jabbarvand, and Lingming Zhang
fuzzers [7, 11, 40, 48]. Besides the path explosion issue due to the compiler-scale complexity, another
key limitation is that hybrid fuzzing typically operates on binary inputs, overlooking the nuanced
semantics present in source code, which is crucial for compiler testing.
3 Design
Figure 2 depicts the overview of WhiteFox, consisting of three main components: Requirement
Summarization, Test Generation, and Feedback Loop. Overall, WhiteFox takes the source code of
optimization passes from the tested compiler as inputs and generates high-quality test programs
via LLMs. To that end, during the Requirement Summarization phase, an analysis LLM is used to
summarize the requirements of the test programs to trigger the optimization by examining its
source-code implementation (§ 3.1). Next, the analyzed requirements are used to prompt a generation
LLM, which synthesizes test programs to specifically practice corresponding optimizations (§ 3.2).
The generated test programs are then compiled and executed, and WhiteFox observes whether
they can indeed activate the corresponding optimizations via instrumentation. If a test program
Proc. ACM Program. Lang., Vol. 8, No. OOPSLA2, Article 296. Publication date: October 2024.
WhiteFox: White-Box Compiler Fuzzing Empowered by Large Language Models 296:7
Feedback Loop
matchAndRewrite(
TFL::ConcatenationOp op) { An Unpack operator is
invoked on a tensor and then
// Checks all operands come
a Concatenation operator...
// from the same unpack op. class Model(tf.keras.Model):
def call(self, x):
auto lhs = op.front(); NL Description
x1 = tf.Unpack(...)
auto unpack = Analysis Generation return tf.concat(...)
dyn_cast_or_null<..>(..);
x1 = tf.unstack(x)
LLM LLM
... x2 = tf.concat(x1, ...) x = tf.constant([...])
}
Pseudo Code
triggers any optimization, it will be incorporated into WhiteFox’s feedback loop as an example to
guide the generation LLM towards more optimization-targeted generation in subsequent iterations
(§ 3.3). To detect compiler bugs, every test program is executed with test oracles including result
inconsistency, as well as compile- and execution-time crash (§ 3.4).
Notably, WhiteFox employs a multi-agent framework: (i) an LLM-based analysis agent is
prompted to infer the conditions for triggering optimizations by inspecting the implementation
code; and (ii) an LLM-based generation agent is prompted to create a large number of meaningful
test programs. This design allows us to balance the trade-off between the costs and benefits that
different LLMs provide. For example, we can let the analysis LLM be one with broad knowledge
and reasoning ability (in both natural language and code) and let the generation LLM be one that is
specialized for efficient program generation.
Proc. ACM Program. Lang., Vol. 8, No. OOPSLA2, Article 296. Publication date: October 2024.
296:8 Chenyuan Yang, Yinlin Deng, Runyu Lu, Jiayi Yao, Jiawei Liu, Reyhaneh Jabbarvand, and Lingming Zhang
and is closer to the structure of tests that we want to generate ultimately. Additionally, we map
the low-level implementation to a high-level summarization using natural language or pseudo
code resembling user code. For instance, TFL::ConcatenationOp is a low-level TensorFlow Lite IR
used in the implementation source code of TensorFlow Lite optimizations (in Figure 2), which
corresponds to the high-level TensorFlow public API tf.concat, and semantically means joining
data from multiple input tensors. In our natural language description/pseudo code, instead of
using the low-level TFL::ConcatenationOp, we directly use “concatenation operator”/tf.concat to
summarize.
While both natural language and pseudo-code can help shorten the context and remove low-
level information to avoid confusing the LLM, each has its own unique strengths. For example,
the permute_linear_fusion optimization from PyTorch Inductor requires that “the tensor method
permute is invoked first, and then the torch.nn.functional.linear function is invoked on the permuted
tensor”. Such a NL description is not as straightforward as the pseudo-code format for this case.
Specifically, “the tensor method permute” could be simply represented as input_tensor.permute(...)
in pseudo-code. However, for the description “it swaps the last two dimensions of this tensor”, it
is pretty challenging to leverage pseudo-code to describe it clearly and briefly. To combine and
maximize their strengths, we adopt a hybrid format that blends NL and pseudo-code to describe
the requirements for triggering optimization, rather than relying solely on either format. This
mixed format provides the analysis LLM with greater flexibility to utilize NL and/or pseudo-code
as needed for each component of requirements, resulting in a more expressive and higher-quality
summarization. Our experimental findings (§ 6.2.1) also support that the mix of NL and pseudo-code
achieves the best performance in this task.
Despite all these aforementioned benefits of a high-quality optimization summary, it is worth
noting that the requirement summarization is a very comprehensive and challenging task, even for
domain experts. First, it requires understanding the logic of the implementation code and rephrasing
it with semantic-preserving natural language or pseudo-code. Second, the mapping from low-level
implementation code to high-level information necessitates a broad background knowledge of
the corresponding programming language and compiler. This ranges from understanding general
technical terminology (e.g., “variable arguments” and “tail calls”), to in-depth domain-specific
knowledge (e.g., from low-level LLVM IRs to high-level C++ grammar). As demonstrated in our
evaluation (§ 6.2.1), the most powerful LLM to date (namely GPT4) is capable of performing this
challenging analysis process (while the current open-source models, such as StarCoder, still lag
far behind). This capability stems from its extensive training on vast datasets, enabling it to gain
a broad knowledge and implicit understanding of various programming languages and systems.
Additionally, its proficiency in performing these domain-specific analyses in our work likely results
from its training on the source code of these open-source compiler systems.
Therefore, WhiteFox first leverages the analysis LLM to infer the requirement of high-level
inputs that could trigger the optimization, utilizing its implementation written in the low-level
source code. More specifically, for each optimization, we use few-shot in-context learning [5] to
prompt the analysis LLM to generate its trigger requirements for the inputs in the mixed format of
NL and pseudo-code. Figure 3(a) presents the general few-shot prompt template used to summarize
requirements for optimizations in target compilers. This prompt template starts with the instruction
“Please describe the [TARGET INPUT] that can trigger the [OPTIMIZATION NAME] optimization...”, where
[TARGET INPUT] is the input format specific to the target compiler. Then it is followed by the source
code of the optimization implementation and concludes with the description of requirements that
the input should fulfill. Note that the description is in the mixed format of NL and pseudo-code,
consisting of [PSEUDO CODE] and [NL DESCRIPTION]. The target optimization has the same structure
as the few-shot examples, but its requirements field is left empty, awaiting generation by the LLM.
Proc. ACM Program. Lang., Vol. 8, No. OOPSLA2, Article 296. Publication date: October 2024.
WhiteFox: White-Box Compiler Fuzzing Empowered by Large Language Models 296:9
(a) Prompt template for requirement summarization. (b) Requirement summarization prompt in PyTorch Induc-
tor.
Take the prompt of requirement summarization in PyTorch Inductor as an example, whose few-shot
prompt is shown in Figure 3(b). The expected input format for PyTorch Inductor is a PyTorch model;
therefore, [TARGET INPUT] is populated with “PyTorch model”. Following this, the source code for
the example optimization, permute_linear_fusion, is provided. Finally, example descriptions are
given in the mix of pseudo-code and NL formats to outline the constraints necessary to trigger the
example optimization. Note that such few-shot examples can guide the analysis LLM to generate
desired output formats. For instance, the example description (emphasized with a yellow box in
Figure 3(b)) provides the LLM with a clearer illustration of expected formats. Furthermore, they
facilitate the learning process of analysis LLM by providing the example mappings from low-level
optimization implementation to high-level input program requirements.
More formally, let P𝐴 be the analysis LLM that models the probability of token sequences.
It takes the following types of information as inputs: (i) 𝐼𝑖 , the instruction on summarizing an
optimization 𝑂𝑖 ; (ii) 𝐶𝑖 , the source code implementation of 𝑂𝑖 ; (iii) 𝑅𝑖 , the summarized trigger
pattern or requirement of the optimization 𝑂𝑖 . Let 𝐸𝐾 be the few-shot prompt prefix consisting
of 𝐾 example optimizations: 𝐸𝐾 = (𝐼 1, 𝐶 1, 𝑅1 ) ◦ (𝐼 2, 𝐶 2, 𝑅2 ) ◦ . . . ◦ (𝐼𝐾 , 𝐶𝐾 , 𝑅𝐾 ). Let 𝑂𝑡 be the target
optimization. The probability distribution of the generated requirement 𝑅𝑡 can be defined as
P𝐴 (𝑅𝑡 | 𝐸𝐾 , 𝐼𝑡 , 𝐶𝑡 ).
Proc. ACM Program. Lang., Vol. 8, No. OOPSLA2, Article 296. Publication date: October 2024.
296:10 Chenyuan Yang, Yinlin Deng, Runyu Lu, Jiayi Yao, Jiawei Liu, Reyhaneh Jabbarvand, and Lingming Zhang
# Description # Description
The [TARGET INPUT] should contain the following pattern: The model should contain the following pattern:
``` ```
[PSEUDO CODE] t1 = input_tensor.permute(...) # Permute the input tensor
``` t2 = torch.nn.functional.linear(t1, ...) # Apply linear
This pattern characterizes scenarios where [NL DESCRIPTION]. transformation to the permuted tensor
```
# [TARGET INPUT] This pattern characterizes scenarios where ...
[EXAMPLE TRIGGERING INPUT] The permute method is invoked on an input tensor with more than 2
dimensions, and it swaps the last two dimensions of this tensor.
Target Optimization
# Model
### Please generate ... class Model(torch.nn.Module):
def forward(self, x1):
# Description v1 = x1.permute(0, 2, 1)
v2 = F.linear(v1, self.linear.weight, self.linear.bias)
# [TARGET INPUT] [TO BE GENERATED] return v2
(a) Prompt template for test generation (b) Test generation prompt in PyTorch Inductor.
the test input to it is to be generated by the LLM. Figure 4(b) shows the test generation prompt used
in PyTorch Inductor. The format of test input to the PyTorch Inductor is a PyTorch model ([TARGET
INPUT]) with public PyTorch APIs ([INPUT SPECIFICATION]). Next, we specify the requirements
for the test inputs to activate the optimization permute_linear_fusion. This is complemented by
an illustrative model that can trigger this optimization.The provided few-shot examples aid the
LLM in generating the test in the desired format, such as a torch.nn.Module composing public
PyTorch APIs. Furthermore, the example could help the LLM learn the relationship between the
requirement description of the optimization and the corresponding test input that can trigger it. In
Figure 4(b), the model input highlighted in a blue box (# Model) contains the code x1.permute(0,
2, 1). This corresponds to the requirement: "The permute method is invoked on an input tensor with
more than 2 dimensions, and it swaps the last two dimensions of this tensor". Such examples elucidate
for the LLM: (i) the meaning of "permute method is invoked on an input tensor"—which should
not be torch.permute(x1, ...), and (ii) the implication of "swaps the last two dimensions of this
tensor"-which is permute(0, 2, 1) for the tensor with three dimensions.
Again, let P𝐺 be the generation LLM that models the probability of token sequences. Recall
that 𝐼 represents the summarization instruction and 𝑅 denotes the requirement for triggering
an optimization. Let 𝐸𝐾 be the few-shot prompt prefix consisting of 𝐾 example optimizations:
𝐸𝐾 = (𝐼 1, 𝑅1,𝑇1 ) ◦ (𝐼 2, 𝑅2,𝑇2 ) ◦ . . . ◦ (𝐼𝐾 , 𝑅𝐾 ,𝑇𝐾 ), where 𝑇𝑖 is a valid test input that triggers the
optimization 𝑂𝑖 . Let 𝑂𝑡 be the target optimization. The probability distribution of the generated
test 𝑇𝑡 can be defined as P𝐺 (𝑇𝑡 | 𝐸𝐾 , 𝐼𝑡 , 𝑅𝑡 ).
Proc. ACM Program. Lang., Vol. 8, No. OOPSLA2, Article 296. Publication date: October 2024.
WhiteFox: White-Box Compiler Fuzzing Empowered by Large Language Models 296:11
optimization, we collect them as candidates for few-shot examples for future test generation prompts.
By incorporating such successful triggering tests into the prompt, we enhance a targeted guidance
to the generation LLM, enabling it to produce more inputs that trigger the target optimization.
Based on the inspiration, WhiteFox incorpo-
rates tests that have successfully triggered the Target Optimization
corresponding optimization as supplementary ### Please generate different valid [TARGET INPUT] example with
[INPUT SPECIFICATION] meets the specified requirements.
examples during iterative test generation. The
target compiler is instrumented to record trig- # Description
The [TARGET INPUT] should contain the following pattern:
gered optimizations for each test input. If the ```
examples (defaulting to 3). These example trig- # [TARGET INPUT] [TO BE GENERATED]
gering inputs will be plugged into the prompt
shown in Figure 5, along with the instruction Fig. 5. Prompting for test generation with feedback.
and requirement summary of the target opti-
mization (same instruction and summary as the
previous prompt), and will be used to generate the next batch of test inputs. This feedback design
aids the LLM in generating tests that have a higher likelihood of triggering the targeted pattern,
which is demonstrated in our ablation study (§ 6.2.2). In the case where there are no triggering
inputs for a particular optimization, WhiteFox continues to use the initial prompt (Figure 4) in
subsequent iterations until it finds an input capable of triggering that optimization.
To further investigate this, we observe that not all triggering examples are equally effective in
guiding the LLM to generate new valuable triggering tests. One useful signal for assessing their
Proc. ACM Program. Lang., Vol. 8, No. OOPSLA2, Article 296. Publication date: October 2024.
296:12 Chenyuan Yang, Yinlin Deng, Runyu Lu, Jiayi Yao, Jiawei Liu, Reyhaneh Jabbarvand, and Lingming Zhang
effectiveness is the triggering rate of the newly generated tests when we employ them as few-
shot examples. To effectively select triggering examples with an evolving knowledge of example
effectiveness, it is crucial to find a balance between exploration and exploitation. Exploration is
critical because it not only allows us to evaluate under-explored options but also helps to produce
a diverse set of tests for fuzzing. On the other hand, some level of exploitation is desirable, as it
enables us to fully harness the potential of effective examples. To address this, WhiteFox adopts
an (adapted) Multi-Armed Bandit (MAB) algorithm, Thompson Sampling [81], as the selection
strategy for triggering examples to balance the exploiting and exploration trade-off. Each triggering
example is conceptualized as an arm in the MAB framework. The main assumption here is that each
triggering example is associated with a probability representing the triggering rate, which quantifies
the proportion of generated tests capable of triggering the optimization when utilizing the given
example. During the fuzzing loop, our objective is to estimate the probability associated with each
triggering example, with the aim of using the most effective example to achieve more valuable
triggering tests. More specifically, following the classical Thompson Sampling algorithm [81], when
we do not have any prior information about an arm, we choose standard beta distribution [54] 𝐵(1, 1)
(or equivalently Uniform(0, 1)) for its prior distribution. The beta distribution is parameterized
by two shape parameters 𝛼 > 0, 𝛽 > 0, which represents the number of successes and failures in
historical trials. The probability density function of beta distribution can be formally written as
follows:
1
𝑓𝛼,𝛽 (𝑥) = 𝑥 𝛼 −1 (1 − 𝑥) 𝛽 −1
𝐵(𝛼, 𝛽)
where 𝐵(𝛼, 𝛽) is a constant for normalization. After drawing a new sample and observing the
reward (in our case, 1 if a generated test successfully triggers the targeted optimization and 0
otherwise), the posterior probability can be conveniently updated by increasing 𝛼 or 𝛽 by one,
depending on whether the sample was a success or failure. More formally, if our prior belief about
𝑋 is represented by 𝑓𝛼,𝛽 , the posterior distribution of 𝑋 will be updated to 𝑓𝛼+1,𝛽 or 𝑓𝛼,𝛽+1 after we
observe 𝑋 = 1 or 𝑋 = 0.
Algorithm 1 shows the example test selection process. Firstly, we sample 𝜃 𝑡 from each of the pos-
terior distributions (Line 2-3). Subsequently, we opt for top-𝑁 arms with the highest sampled value,
which are the chosen example tests (ExampleTests) in this iteration (Line 4). Unlike conventional
scenarios where a single arm is chosen, we simultaneously select multiple test examples at a time
to construct a single few-shot prompt for generating a batch of new tests. Consequently, when we
observe the number of triggering tests among all newly generated tests, we use this information to
update the posterior of each of ExampleTests (Line 7-9). We next initialize the new trigger tests
(NewTriggerTests) using the mean values of 𝛼 and 𝛽 of the distribution of the ExampleTests (Line 10-
14), to reduce the overhead of re-exploring the distribution of NewTriggerTests from scratch. This
comes from our assumption that the effectiveness of the new test is highly correlated with those
few-shot examples, as the new test is generated by the LLM based on those specific examples,
potentially inheriting valuable code patterns from such “parent” examples [5]. In the end, we update
the pool of existing trigger tests with newly found tests (Line 15).
Proc. ACM Program. Lang., Vol. 8, No. OOPSLA2, Article 296. Publication date: October 2024.
WhiteFox: White-Box Compiler Fuzzing Empowered by Large Language Models 296:13
Specifically, for each test program that is both compilable and executable, given the same set of
inputs to the program (if required), we cross-check the produced outputs over the optimized and
non-optimized (or minimally optimized) versions of the program.
Crash. Following prior work [11, 17, 21, 41, 46, 48, 84, 93], it is undesirable to let the compiler and
the compiled executable crash unexpectedly. Consequently, WhiteFox actively captures crashing
signals at both the compile- and execution-time for the test program, including process aborts,
segmentation faults, and unexpected internal exceptions (e.g., INTERNAL_ASSERT_FAILED in PyTorch).
To summarize, we compile each test input in two modes: with and without optimization. We
consider the following conditions as bug candidates:
• Crashes during either optimization compilation or optimized program execution.
• Discrepancies in compilation status (pass/fail) between the two modes.
• Different program outputs between the two modes.
4 Implementation
Optimization collection. We start to gather optimization-specific compiler source code by spec-
ifying the relevant directories. For example, the source code of optimization passes for PyTorch
Inductor is managed under the torch/_inductor directory, and that for TensorFlow-XLA is mainly
placed under tensorflow/compiler/xla/service. We next identify code fragments (e.g., functions)
that perform optimizations through simple keyword pattern matching. For instance, operator fusion
is an important optimization in DL compilers and we collect the relevant functions by searching
“fusion” or “fuse”. In addition, auxiliary functions invoked by the main optimizations are also
collected since they may unveil essential conditions for activating the optimizations. Curating and
identifying optimization-relevant code fragments is required to drive WhiteFox; however, it shall
be easy for compiler developers who have a deep understanding of the code being maintained.
Instrumentation. To gather the triggering information for the feedback loop (§ 3.3), we instrument
each collected optimization function by inserting a logging statement at the function entry. As
such, when compiling a test program, from the logs a sequence of activated optimization passes
can be obtained.
Analysis and generation LLMs. While our approach is general and thus agnostic to the LLMs
being employed, our tool WhiteFox is built on the state-of-the-art GPT4 [61] and StarCoder [43].
Specifically, we utilize GPT4 [61] as the analysis LLM for its recognized excellence in code compre-
hension and proficiency in natural language processing tasks [6]. For each optimization, we let
the analysis LLM generate one requirement description with the temperature set to zero via the
OpenAI APIs. Meanwhile, we choose StarCoder [43] (StarCoder-15B) to be the generative LLM,
which is an open-source model with 15.5B parameters and a context length of 8K. In each iteration,
we let StarCoder generate a batch of ten test inputs with the temperature set to one through the
HuggingFace APIs [33]. The model choices allow us to balance the trade-off between the costs and
benefits that different LLMs provide: (i) GPT4 is a powerful unified LLM (i.e., with broad knowledge
and reasoning ability over both natural language and code) but costly, making it suitable as the
analysis LLM where its use is a one-time effort; (ii) StarCoder is an affordable LLM specialized for
code and is thus suitable for efficient continuous test generation.
Few-shot prompting. For the requirement summarization and initial test generation few-shot
prompt specific to each target compiler, we opt for one-shot prompting, for minimal manual efforts
involved in prompt construction and affordable LLM context size. To accomplish this, we select
an optimization from each target compiler. Subsequently, we manually write the requirement
description and a demonstrative input test capable of triggering the optimization. This serves as
the one-shot example in the prompts for both requirement summarization and test generation. One
Proc. ACM Program. Lang., Vol. 8, No. OOPSLA2, Article 296. Publication date: October 2024.
296:14 Chenyuan Yang, Yinlin Deng, Runyu Lu, Jiayi Yao, Jiawei Liu, Reyhaneh Jabbarvand, and Lingming Zhang
exception is that PyTorch Inductor has two distinct types of optimizations (7 utilizes a conventional
optimization check function, and 54 involves a pattern matcher).Therefore, we separately design
two prompts for each type and choose the corresponding prompt for each optimization based on its
type. For the feedback prompt, the requirement is produced by the analysis LLM, and the sample
tests are created by the generation LLM. We use three-shot as our default setting in the feedback
prompt. Overall, constructing prompts for each target compiler is relatively straightforward. We
only included a single example per compiler to illustrate the task format, requiring minimal effort.
It is even easier for compiler developers who are familiar with optimization logic. Notably, such
examples might already exist in test suites for many compilers. For comparison, many traditional
compiler fuzzing techniques even require numerous example tests as seeds [40, 41, 99].
5 Evaluation
5.1 Research Questions
We investigate the following research questions in our experiments:
• RQ1: How does WhiteFox compare to state-of-the-art DL compiler fuzzers?
• RQ2: Are all the key components in WhiteFox effective?
• RQ3: Is WhiteFox able to detect real-world bugs?
Proc. ACM Program. Lang., Vol. 8, No. OOPSLA2, Article 296. Publication date: October 2024.
WhiteFox: White-Box Compiler Fuzzing Empowered by Large Language Models 296:15
programming languages (e.g., C++, Python, and CUDA) and rely heavily on various backend libraries
(e.g., Triton [82], oneDNN [59], and MKL-DNN [57]). Additionally, generating arbitrary inputs for
DL compilers is extremely difficult for general-purpose fuzzers due to dual requirements: satisfying
language syntax/semantics (e.g., Python’s dynamic typing and syntax checks) and tensor/operator
constraints for valid computational graphs [17, 46]. As a result, we opt to compare with the current
best techniques for fuzzing DL compilers, i.e., TitanFuzz [17] and NNSmith [46].
Ablation variants. Multiple WhiteFox variants are evaluated in our ablation study. Considering
that PyTorch Inductor boasts the most optimizations, we conduct our ablation study exclusively to
PyTorch Inductor. For requirement generation, we consider four variants: WF-Mix (the default
setting of WhiteFox), WF-NL, WF-Code, and WF-Impl. For each optimization, we let the
generation LLM generate 100 test inputs, guided by different requirement formats. Our default
setting, WF-Mix, describes the requirements in the mixed format of natural language and pseudo-
code generated by the analysis LLM. By contrast, the requirements used in WF-NL (resp. WF-Code)
are the natural language (resp. pseudo-code) description extracted from the mixed format, for a fair
comparison. Besides, we also evaluate the performance of directly feeding the generation LLM with
the implementation source code, i.e., the WF-Impl variant. Regarding the feedback loop, in addition
to our default configuration, which uses feedback with Thompson Sampling, we contemplate two
alternative variants: one without any feedback (WF-No-Feedback) and another that incorporates
feedback but employs a simple uniform random selection (WF-Naive). Furthermore, we revisit the
decision of using GPT4 as the analysis LLM by introducing an additional variant, WF-SC, which
employs StarCoder as the analysis LLM.
Environment. Our test-bed runs Ubuntu 20.04.5 LTS with 64-core CPUs, 256 GB RAM, and NVIDIA
RTX A6000 GPUs.
Fuzzing budget. Our default setting is to generate a total of 1000 tests for each optimization
in 100 iterations. In each iteration, WhiteFox by default generates a batch of 10 tests based on
optimization triggering feedback from previous iterations. If the optimization was triggered in
previous iterations, WhiteFox picks three triggering inputs used as few-shot examples in the
feedback-guided prompt (Section 3.3, Figure 5) for the following iterations. Otherwise, WhiteFox
uses the default few-shot prompt (Section 3.2, Figure 4) to generate tests.
5.3 Metrics
Following prior work [11, 17, 20, 21, 46, 48, 51], we use the number of detected bugs as our metric,
which essentially reflects the goal of fuzzing – finding more bugs. Meanwhile, the primary goal of
our approach is to effectively test the optimizations within compilers. As such, we also let the number
of triggered optimizations and the number of (optimization-)triggering tests be our principal metrics.
Specifically, an optimization is deemed “triggered” if its corresponding optimization function (§ 4)
logs its presence during fuzzing. Meanwhile, a test qualifies as a “triggering test” only if during its
execution, any of the optimizations are triggered. Given that optimization bugs can only manifest
when the optimization is activated, similar to the concept of coverage, a higher number of triggering
tests correlates with an increased likelihood of bug discovery.
To further show the effectiveness of every component, we also use code coverage [17, 21, 38, 46]
as a metric. Specifically, we report line coverage in the source languages where the optimizations
are implemented: Python for PyTorch Inductor, and C++ for both TensorFlow Lite and TensorFlow-
XLA. Following previous work [27, 86, 91], we measure line coverage using Coverage.py [2] for
Python and GCOV [3] for C++.
Proc. ACM Program. Lang., Vol. 8, No. OOPSLA2, Article 296. Publication date: October 2024.
296:16 Chenyuan Yang, Yinlin Deng, Runyu Lu, Jiayi Yao, Jiawei Liu, Reyhaneh Jabbarvand, and Lingming Zhang
6 Result Analysis
6.1 Comparison with Prior Work
Table 2 compares WhiteFox against the baselines on the three target compilers under their default
settings. Because NNSmith has a shorter execution time than our default setting, for fair comparison,
we also present results from WhiteFox-Mini, which produces 100 tests for each optimization, as
opposed to the default 1000 tests. Notably, Column Time in Table 2 encompasses both the generation
time of requirements/tests (including LLM invocations) and the test-execution time.
In terms of optimization triggering, we observe that WhiteFox outperforms the baselines signif-
icantly in PyTorch Inductor and TensorFlow Lite. Overall, among the tested compilers, WhiteFox
outperforms existing testers by up to 8.2x in terms of the number of triggered optimizations. For
example, out of the 61 optimizations in PyTorch Inductor, WhiteFox is able to trigger 41 opti-
mizations, while the baseline approaches can trigger at most 5 optimizations, which is a subset of
optimizations covered by WhiteFox. Regarding the time cost, WhiteFox consumes less time than
all other techniques except NNSmith. Meanwhile, given the inferior performance of NNSmith,
WhiteFox-Mini can still trigger more optimizations than NNSmith using less time.
WhiteFox outperforms all baselines on compilers except for TensorFlow-XLA, with two fewer
optimizations being triggered compared to TitanFuzz. One possible reason is that the targeted
Proc. ACM Program. Lang., Vol. 8, No. OOPSLA2, Article 296. Publication date: October 2024.
WhiteFox: White-Box Compiler Fuzzing Empowered by Large Language Models 296:17
optimizations in TensorFlow-XLA are relatively simple per our optimization filtering for fair
comparison with baselines (§ 5.2). Upon inspection, many of these optimizations represent common
model patterns that are widely used in practice. Therefore, they can be effectively triggered by
TitanFuzz since TitanFuzz leverages LLMs to generate human-like programs by resembling the
distribution of training data. Nevertheless, despite generating slightly fewer total tests compared
to TitanFuzz, WhiteFox demonstrates its own edge by triggering four unique optimizations
which TitanFuzz cannot. In addition, we note that NNSmith has much more triggering tests than
WhiteFox and TitanFuzz over TensorFlow-XLA. This is largely due to the implementation choice
of NNSmith, which always outputs the model with redundant reshapes. Thus, the vast majority of
test inputs from NNSmith can trigger the IdentityReshapeRemoving optimization (117,006/117,381).
Regarding unique optimizations triggered by each approach, the baselines trigger 7 optimizations
for PyTorch Inductor, while WhiteFox covers these 7 and an additional 34 unique optimizations.
For TensorFlow Lite, the baselines trigger 9 optimizations, which are all covered by WhiteFox,
plus 3 more unique optimizations. For TensorFlow-XLA, the baselines trigger 26 optimizations,
including the 20 covered by WhiteFox.
Additionally, Table 3 compares WhiteFox with the baselines over a 24-hour testing period, a
common setting for fuzzing studies [38]. WhiteFox performs the best on the number of triggered
optimizations, substantially outperforming others on PyTorch Inductor and TensorFlow Lite. In
terms of code coverage, WhiteFox covers more lines than the baselines in PyTorch Inductor and
TensorFlow-XLA by up to 19.9%. For TensorFlow Lite, WhiteFox performs slightly worse than
TitanFuzz (5.9%). This may be attributed to the limited number of optimizations (13) in TensorFlow
Lite, which inherently restricts WhiteFox’s code coverage exploration, as WhiteFox does not
have much white-box information (i.e., optimization implementation) to guide the generation.
Meanwhile, please kindly note that code coverage is just a proxy indicator and does not always
correlate strongly with bug finding abilities for complicated systems [34, 73]. Despite slightly
lower code coverage on TensorFlow Lite, WhiteFox still performs better on the ultimate goal, bug
finding (detailed in Section 6.3). Overall, these results demonstrate the effectiveness of WhiteFox
in generating test cases to cover not only optimizations but also various compiler behaviors.
6.2.1 Requirement Generation & Test Generation. We first study the effectiveness of the require-
ment description and the multiple choices for the format (shown in Table 4). The goal of the
requirement generation is to assist the generation LLM in producing more tests that can trigger
additional optimizations within the compiler. Thus, our main points of comparison are the number
Proc. ACM Program. Lang., Vol. 8, No. OOPSLA2, Article 296. Publication date: October 2024.
296:18 Chenyuan Yang, Yinlin Deng, Runyu Lu, Jiayi Yao, Jiawei Liu, Reyhaneh Jabbarvand, and Lingming Zhang
700 WhiteFox
# Triggering Tests
600 WF-Naive
500 WF-No-Feedback
400
300
200
100
0
Optimization Passes
Fig. 6. Impact of the feedback loop and Thompson Sampling on PyTorch Inductor.
of triggered optimizations (Column “# Triggered optim.”) and the number of tests that can trigger
the optimization (Column “# Triggered tests”).
Effectiveness of requirement description. Compared with WF-Impl, which feeds the imple-
mentation source code directly with the generation LLM, all three variants that use requirements
(WF-Mix, WF-NL, and WF-Code) demonstrate superior performance in generating triggering tests.
Notably, our default setting WF-Mix is able to generate 1.74x more triggering tests than directly
using implementation code. Besides, WF-Mix can also trigger 7 more optimizations than WF-Impl,
emphasizing the importance of using requirement description. This aligns with our statement in the
Approach section that optimization source code is not the most effective guide for the generation
LLM due to its redundant, unrelated information, and low-level format.
Effectiveness of mixed format. As shown in Table 4, WF-Mix achieves the best number of
triggered optimizations and triggering tests, underlining the effectiveness of combining NL and
pseudo-code for requirement description. Concurrently, while WF-NL triggers more optimizations
than WF-Code, it results in fewer triggering tests. This is because NL usually contains additional
information than pseudo-code, ensuring vital triggering requirements are not missed during
conversion from the implementation source code. Conversely, it is more straightforward for the
generation LLM to correlate requirements formatted in pseudo-code with the respective test
programs, leading to a higher number of triggering tests.
Analysis LLM. When employing requirement descriptions generated by StarCoder, WF-SC not
only results in fewer triggered optimizations but also a reduced number of triggering tests compared
to our default setting, which utilizes GPT4 to summarize the implementation source code. This
discrepancy is anticipated, given that GPT4 is recognized as the cutting-edge LLM in tasks related
to code comprehension and natural language generation [6]. Essentially, GPT4 exhibits a superior
ability in translating intricate source code details into high-level input requirements compared to
StarCoder. This observation underscores our rationale for choosing GPT4 as the analytical LLM.
Interestingly, even WF-SC generates a higher number of triggering tests than WF-Impl, which
creates the input straight from the implementation source code. Such a discovery confirms that a
dual-model infrastructure might be better aligned for white-box compiler fuzzing than directly
utilizing the implementation source code, emphasizing the value of having a distinct phase dedicated
to requirement generation.
6.2.2 Feedback Loop. Next, we examine the effectiveness of our feedback loop and the Thompson
Sampling algorithm. The primary aim of the feedback loop is to enhance the likelihood of generating
additional test inputs that activate already-triggered optimizations. Therefore, in this ablation study,
we focus on comparing the number of triggering tests. Figure 6 showcases a bar chart that details
Proc. ACM Program. Lang., Vol. 8, No. OOPSLA2, Article 296. Publication date: October 2024.
WhiteFox: White-Box Compiler Fuzzing Empowered by Large Language Models 296:19
the number of triggering tests for each triggered optimization, spanning the range of variants
explored in this ablation study. Besides, we also collect the coverage results for these three variants,
shown in Table 5.
Effectiveness of feedback loop. WhiteFox and WF-Naive, incorporating the feedback loop,
produce significantly more triggering tests than the variant without it (WF-No-Feedback), with
respectively 2.6x and 2.1x increases. This improvement not only emphasizes the effectiveness of the
feedback loop but indicates that its guidance is more attuned to triggering the target optimization
than relying on few-shot examples for other different optimizations. Furthermore, both approaches
outperform WF-No-Feedback in terms of code coverage, demonstrating that the feedback loop can
guide LLMs to generate more diverse test cases.
Effectiveness of Thompson Sampling. In our default configuration, WhiteFox leverages the
Thompson Sampling algorithm for selecting triggering examples and achieves a remarkable 1.3x
more triggering tests compared to WF-Naive, which uses uniform sampling to select the examples.
As shown in Figure 6, WhiteFox outperforms the rest over 32 out of the 41 triggered optimizations
in terms of the number of trigger tests. In the meanwhile, the code coverage of WhiteFox is
higher than WF-Naive. Overall, the experimental results show the effectiveness of our MAB-based
triggering example selection.
Proc. ACM Program. Lang., Vol. 8, No. OOPSLA2, Article 296. Publication date: October 2024.
296:20 Chenyuan Yang, Yinlin Deng, Runyu Lu, Jiayi Yao, Jiawei Liu, Reyhaneh Jabbarvand, and Lingming Zhang
Regarding the 68 fixed bugs, we further explore the root cause of the bugs by inspecting their
corresponding developer fixes. Impressively, 47 (69.1%) of the fixed bugs are repaired in the
optimization code of PyTorch Inductor. This demonstrates the effectiveness of WhiteFox for
finding optimization bugs, which is the primary goal of our approach. Specifically, only 3 of these
47 optimization bugs can be covered by TitanFuzz and NNSmith, highlighting the significant
edge of WhiteFox in testing compiler optimizations. One interesting observation is that certain
optimizations appear to be more erroneous than the others; however, such erroneous optimizations
instead turn out to be harder to discover. For example, WhiteFox detects 5 bugs in the optimization
for the important attention modules [83], which are the fundamental building blocks to LLMs. The
developer-crafted tests may seem surprising in their oversight of multiple critical bugs, but this is
due to the challenge of creating precise model patterns to reveal deeply hidden issues. By exposing
such critical bugs, WhiteFox demonstrates the power of white-box fuzzing with LLMs.
6.3.2 Bug Characteristics. We further study the detailed characteristics of the WhiteFox-detected
bugs, which include crashes, mis-compilations, failed optimizations, incorrectly passed optimizations,
and vulnerabilities, shown in Table 7. A mis-compilation occurs when the optimized program returns
different outputs than the non-optimized one. Failed optimizations refer to cases where compilation
with optimization fails, while it is valid without optimization. Incorrectly passed optimizations
occur when the optimization compiles invalid models successfully. Regarding the vulnerabilities, in
addition to the 6 crash bugs that could be used for DoS attacks, there are another 5 out-of-bound
read vulnerabilities detected within the incorrectly passed optimizations.
6.3.3 Won’t Fix Bugs. For the won’t fix bugs, in PyTorch Inductor, one is due to the compiler
not supporting quantized APIs, another is from undefined behavior in operators, and the third is
because developers considered our input invalid, despite the optimization compiling the model
and returning different results. In TensorFlow Lite, two bugs stem from its feature that doesn’t
guarantee input-output order, and another is the optimized output having different shapes, which
is rare and not expected in both PyTorch Inductor and TensorFlow-XLA.
6.3.4 Bug Examples. We demonstrate representative bugs detected by WhiteFox and discuss
their exploitation or security implications. Figure 7(a) illustrates a misoptimization of PyTorch
Inductor, manifested when compiling attention modules [83]. The faulty optimization through
pattern matching identifies self-attentions and fuses their sub-operators into a compact and efficient
implementation. However, the optimized attention module, by rearranging the tensor layout to
a channel-last format and rendering the last dimension non-contiguous, results in an accuracy
issue. This leads to incorrect outputs when compared to those from the unoptimized module. Given
the prevalence and impact of attention modules and LLMs, this bug is labeled with high-priority
and subsequently fixed. The developers highlighted the importance of the issue, stating, “raising
priority due to being an accuracy problem on an important operator”.
Proc. ACM Program. Lang., Vol. 8, No. OOPSLA2, Article 296. Publication date: October 2024.
WhiteFox: White-Box Compiler Fuzzing Empowered by Large Language Models 296:21
Figure 7(b) presents two bugs detected in PyTorch Inductor for the binary_unary_fusion opti-
mization, which fuses the Linear (i.e., binary) and ReLU (i.e., unary) operators into a compact and
thus efficient operation. However, after compilation, such exemplified module returns impermissible
negative outputs since its final layer is ReLU whose output is always non-negative [4]. Because
such Linear-ReLU structures are incorporated in many fundamental architectures such as ResNet
(Residual Network) [29], this bug, uniquely found by WhiteFox, is labeled as high-priority and was
fixed immediately after our report. Furthermore, configuring the ReLU operation with inplace=True
results in a crash of the optimized model for inputs of bfloat16 data type, which can be leveraged
for DoS attacks by requesting data in bfloat16 format. Given the severity of this potential security
issue, the developers promptly patched the vulnerability within two days.
Figure 7(c) depicts a bug manifested in the FuseUnpackAndConcatToReshape optimization in Ten-
sorFlow Lite. Specifically, this optimization aims to streamline unpack-concat operation pairs into
a single reshape operation, provided that the two operations are semantically inverse to each
other. In this optimization, the unpacked dimension should match the concatenated dimension
in the original input of the unpack-concat operation pair. The example listed in Figure 7(c) vi-
olates the assumption and by theory cannot be simplified to a reshaping logic. However, the
FuseUnpackAndConcatToReshape function still erroneously transforms it into a wrong reshaping op-
eration. Notably, this bug is exclusively detected by WhiteFox as it hinges on generating valid tests
to trigger this particular optimization, which other techniques consistently struggle to accomplish.
Figure 7(d) shows a TensorFlow-XLA bug exclusively detected by WhiteFox. The bug-triggering
model contains an embedding layer with a vocabulary size of 64 tokens, followed by one multiplica-
tion operation. When the first input to the model is 64, exceeding the embedding layer’s maximum
token index (63), the unoptimized model raises an InvalidArgumentError as expected. However,
the model optimized by XLA omits index validation, leading to an out-of-bounds read vulnerability.
Proc. ACM Program. Lang., Vol. 8, No. OOPSLA2, Article 296. Publication date: October 2024.
296:22 Chenyuan Yang, Yinlin Deng, Runyu Lu, Jiayi Yao, Jiawei Liu, Reyhaneh Jabbarvand, and Lingming Zhang
Given the prevalence/impact of the important embedding layers, developers have swiftly addressed
and fixed this issue after seeing our report.
7 Discussion
7.1 Real-world Impact
Notably, we received the acknowledgment from the PyTorch team along with the request for
integrating WhiteFox into the development pipeline of PyTorch Inductor compiler.
“Thanks for your contributions to surfacing TorchInductor issues with Whitefox and sharing
details. It will be great to figure out the next steps (for integration).” — PyTorch Team
Consequently, we further extend WhiteFox to accommodate the most recent version of Py-
Torch Inductor, incorporating support for an additional 38 newly introduced optimizations. This
underscores the distinct dynamic of DL compilers, which diverge from traditional compilers due to
the brisk pace of DL model architecture evolution and the pressing need to optimize for nascent
architectures. For context, PyTorch Inductor has experienced 1,846 commits in the last year alone.
Therefore, the principal focus of WhiteFox on DL compilers is driven by the necessity for an
approach that can evolve in tandem with the rapid development of new optimizations. As described
in § 6.3, WhiteFox helped detect 14 new bugs for the newly introduced optimizations, all of which
have been confirmed by the developers. This underscores WhiteFox’s effectiveness and ability to
adapt to evolving optimizations, showing the value and significance of incorporating WhiteFox
into the development workflow.
Proc. ACM Program. Lang., Vol. 8, No. OOPSLA2, Article 296. Publication date: October 2024.
WhiteFox: White-Box Compiler Fuzzing Empowered by Large Language Models 296:23
Bug detection. WhiteFox detects 6 bugs for LLVM, with 2 confirmed as previously unknown, 3
pending, and 1 won’t fix. Figure 7(e) presents a confirmed LLVM bug, which is only revealed when
a test program references a huge array through a large enough index, crashing the LLVM post-
optimization. Attackers can exploit this vulnerability for DoS by crafting specific input programs
to crash Just-in-Time-enabled systems that use this optimization.
8 Conclusion
We present WhiteFox, the first practical white-box compiler fuzzer to test compiler optimizations.
WhiteFox adopts a multi-agent design: an analysis LLM reads through the implementation code of
compiler optimizations and summarizes desired patterns of test programs, with which a generation
LLM is then prompted to efficiently and continuously synthesize meaningful test programs to
exercise corresponding optimizations. Our evaluation shows that WhiteFox is effective in testing
the emerging DL compilers and is also adaptable to the conventional C/C++ compilers. To date,
WhiteFox has found in total 101 bugs for DL compilers, with 92 confirmed as previously unknown
and 70 already fixed.
Data-Availability Statement
The artifact of WhiteFox is available at https://github.com/ise-uiuc/WhiteFox.
Acknowledgments
We thank Chunqiu Steven Xia for providing the help and resources to run some experiments. This
work was partially supported by NSF grant CCF-2131943 and Kwai Inc. This project is supported,
in part, by funding from Two Sigma Investments, LP. Any opinions, findings, and conclusions or
recommendations expressed in this material are those of the authors and do not necessarily reflect
the views of Two Sigma Investments, LP.
References
[1] 2021. News. https://www.vice.com/en_us/article/9kga85/uber-is-giving-up-on-self-driving-cars-in-california-after-
deadly-crash.
[2] 2022. Coverage.py. https://github.com/nedbat/coveragepy.
[3] 2022. GCOV. https://gcc.gnu.org/onlinedocs/gcc/Gcov.html.
[4] Abien Fred Agarap. 2018. Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375 (2018).
Proc. ACM Program. Lang., Vol. 8, No. OOPSLA2, Article 296. Publication date: October 2024.
296:24 Chenyuan Yang, Yinlin Deng, Runyu Lu, Jiayi Yao, Jiawei Liu, Reyhaneh Jabbarvand, and Lingming Zhang
[5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural
information processing systems 33 (2020), 1877–1901.
[6] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat
Lee, Yuanzhi Li, Scott Lundberg, et al. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4.
arXiv preprint arXiv:2303.12712 (2023).
[7] Junjie Chen, Jibesh Patra, Michael Pradel, Yingfei Xiong, Hongyu Zhang, Dan Hao, and Lu Zhang. 2020. A survey of
compiler testing. ACM Computing Surveys (CSUR) 53, 1 (2020), 1–36.
[8] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards,
Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv
preprint arXiv:2107.03374 (2021).
[9] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang,
Yuwei Hu, Luis Ceze, et al. 2018. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In 13th
USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578–594.
[10] Yinghao Chen, Zehao Hu, Chen Zhi, Junxiao Han, Shuiguang Deng, and Jianwei Yin. 2024. ChatUniTest: A Framework
for LLM-Based Test Generation. In Companion Proceedings of the 32nd ACM International Conference on the Foundations
of Software Engineering. 572–576.
[11] Yongheng Chen, Rui Zhong, Hong Hu, Hangfan Zhang, Yupeng Yang, Dinghao Wu, and Wenke Lee. 2021. One engine
to fuzz’em all: Generic language processor testing with semantic validation. In 2021 IEEE Symposium on Security and
Privacy (SP). IEEE, 642–658.
[12] Mingi Cho, Seoyoung Kim, and Taekyoung Kwon. 2019. Intriguer: Field-level constraint solving for hybrid fuzzing. In
Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. 515–530.
[13] Jaeseung Choi, Joonun Jang, Choongwoo Han, and Sang Kil Cha. 2019. Grey-box concolic testing on binary code. In
2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 736–747.
[14] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham,
Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways.
arXiv preprint arXiv:2204.02311 (2022).
[15] Neophytos Christou, Di Jin, Vaggelis Atlidakis, Baishakhi Ray, and Vasileios P Kemerlis. 2023. {IvySyn}: Automated
Vulnerability Discovery in Deep Learning Frameworks. In 32nd USENIX Security Symposium (USENIX Security 23).
2383–2400.
[16] Arghavan Moradi Dakhel, Amin Nikanjam, Vahid Majdinasab, Foutse Khomh, and Michel C Desmarais. 2024. Effective
test generation using pre-trained large language models and mutation testing. Information and Software Technology
171 (2024), 107468.
[17] Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. 2023. Large Language Models
are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models. In Proceedings of the 32nd ACM
SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2023).
[18] Yinlin Deng, Chunqiu Steven Xia, Chenyuan Yang, Shizhuo Dylan Zhang, Shujing Yang, and Lingming Zhang. 2023.
Large language models are edge-case fuzzers: Testing deep learning libraries via fuzzgpt. arXiv preprint arXiv:2304.02014
(2023).
[19] Alastair F Donaldson, Hugues Evrard, Andrei Lascu, and Paul Thomson. 2017. Automated testing of graphics shader
compilers. Proceedings of the ACM on Programming Languages 1, OOPSLA (2017), 1–29.
[20] Alastair F Donaldson, Paul Thomson, Vasyl Teliman, Stefano Milizia, André Perez Maselco, and Antoni Karpiński.
2021. Test-case reduction and deduplication almost for free with transformation-based compiler testing. In Proceedings
of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 1017–1032.
[21] Karine Even-Mendoza, Arindam Sharma, Alastair F Donaldson, and Cristian Cadar. 2023. GrayC: Greybox Fuzzing of
Compilers and Analysers for C. (2023).
[22] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu,
Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint
arXiv:2002.08155 (2020).
[23] Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation for object-oriented software.
In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software
engineering. 416–419.
[24] GCC 2023. GCC. https://gcc.gnu.org/.
[25] Patrice Godefroid, Nils Klarlund, and Koushik Sen. 2005. DART: Directed Automated Random Testing. In Proceedings
of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (Chicago, IL, USA) (PLDI
’05). Association for Computing Machinery, New York, NY, USA, 213–223. https://doi.org/10.1145/1065010.1065036
Proc. ACM Program. Lang., Vol. 8, No. OOPSLA2, Article 296. Publication date: October 2024.
WhiteFox: White-Box Compiler Fuzzing Empowered by Large Language Models 296:25
[26] Rahul Gopinath, Bachir Bendrissou, Björn Mathis, and Andreas Zeller. 2020. Fuzzing with fast failure feedback. arXiv
preprint arXiv:2012.13516 (2020).
[27] J. Gu, X. Luo, Y. Zhou, and X. Wang. 2022. Muffin: Testing Deep Learning Libraries via Neural Architecture Fuzzing. In
2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE). IEEE Computer Society, Los Alamitos,
CA, USA, 1418–1430. https://doi.org/10.1145/3510003.3510092
[28] Qianyu Guo, Xiaofei Xie, Yi Li, Xiaoyu Zhang, Yang Liu, Xiaohong Li, and Chao Shen. 2020. Audee: Automated testing
for deep learning frameworks. In 2020 35th IEEE/ACM International Conference on Automated Software Engineering
(ASE). IEEE, 486–498.
[29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity mappings in deep residual networks. In
Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings,
Part IV 14. Springer, 630–645.
[30] Christian Holler, Kim Herzig, and Andreas Zeller. 2012. Fuzzing with code fragments. In 21st USENIX Security
Symposium (USENIX Security 12). 445–458.
[31] Heqing Huang, Peisen Yao, Rongxin Wu, Qingkai Shi, and Charles Zhang. 2020. Pangolin: Incremental hybrid fuzzing
with polyhedral path abstraction. In 2020 IEEE Symposium on Security and Privacy (SP). IEEE, 1613–1627.
[32] Linghan Huang, Peizhou Zhao, Huaming Chen, and Lei Ma. 2024. Large language models based fuzzing techniques: A
survey. arXiv preprint arXiv:2402.00350 (2024).
[33] HuggingFace 2023. Hugging Face. https://huggingface.co.
[34] Laura Inozemtseva and Reid Holmes. 2014. Coverage is not strongly correlated with test suite effectiveness. In
Proceedings of the 36th international conference on software engineering. 435–445.
[35] Ling Jiang, Hengchen Yuan, Mingyuan Wu, Lingming Zhang, and Yuqun Zhang. 2023. Evaluating and improving
hybrid fuzzing. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 410–422.
[36] Kyungtae Kim, Dae R Jeong, Chung Hwan Kim, Yeongjin Jang, Insik Shin, and Byoungyoung Lee. 2020. HFL: Hybrid
Fuzzing on the Linux Kernel.. In NDSS.
[37] James C King. 1976. Symbolic execution and program testing. Commun. ACM 19, 7 (1976), 385–394.
[38] George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. 2018. Evaluating fuzz testing. In Proceedings
of the 2018 ACM SIGSAC conference on computer and communications security. 2123–2138.
[39] Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation.
In International symposium on code generation and optimization, 2004. CGO 2004. IEEE, 75–86.
[40] Vu Le, Mehrdad Afshari, and Zhendong Su. 2014. Compiler validation via equivalence modulo inputs. ACM Sigplan
Notices 49, 6 (2014), 216–226.
[41] Vu Le, Chengnian Sun, and Zhendong Su. 2015. Finding deep compiler bugs via guided stochastic program mutation.
ACM SIGPLAN Notices 50, 10 (2015), 386–399.
[42] Caroline Lemieux, Jeevana Priya Inala, Shuvendu K Lahiri, and Siddhartha Sen. 2023. CODAMOSA: Escaping coverage
plateaus in test generation with pre-trained large language models. In International conference on software engineering
(ICSE).
[43] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone,
Christopher Akiki, Jia Li, Jenny Chim, et al. 2023. StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161
(2023).
[44] Wen Li, Haoran Yang, Xiapu Luo, Long Cheng, and Haipeng Cai. 2023. PyRTFuzz: Detecting Bugs in Python
Runtimes via Two-Level Collaborative Fuzzing. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and
Communications Security. 1645–1659.
[45] libFuzzer 2023. libFuzzer – a library for coverage-guided fuzz testing. https://llvm.org/docs/LibFuzzer.html.
[46] Jiawei Liu, Jinkun Lin, Fabian Ruffy, Cheng Tan, Jinyang Li, Aurojit Panda, and Lingming Zhang. 2023. NNSmith:
Generating Diverse and Valid Test Cases for Deep Learning Compilers. In ASPLOS. 530–543.
[47] Jiawei Liu, Jinjun Peng, Yuyao Wang, and Lingming Zhang. 2023. NeuRI: Diversifying DNN Generation via Inductive
Rule Inference. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the
Foundations of Software Engineering (San Francisco, CA, USA) (ESEC/FSE 2023). Association for Computing Machinery,
New York, NY, USA, 657–669. https://doi.org/10.1145/3611643.3616337
[48] Jiawei Liu, Yuxiang Wei, Sen Yang, Yinlin Deng, and Lingming Zhang. 2022. Coverage-Guided Tensor Compiler
Fuzzing with Joint IR-Pass Mutation. Proc. ACM Program. Lang. 6, OOPSLA1, Article 73 (apr 2022), 26 pages. https:
//doi.org/10.1145/3527317
[49] Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023.
Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172 (2023).
[50] Yinxi Liu and Wei Meng. 2023. DSFuzz: Detecting Deep State Bugs with Dependent State Exploration. In Proceedings
of the 2023 ACM SIGSAC Conference on Computer and Communications Security. 1242–1256.
Proc. ACM Program. Lang., Vol. 8, No. OOPSLA2, Article 296. Publication date: October 2024.
296:26 Chenyuan Yang, Yinlin Deng, Runyu Lu, Jiayi Yao, Jiawei Liu, Reyhaneh Jabbarvand, and Lingming Zhang
[51] Vsevolod Livinskii, Dmitry Babokin, and John Regehr. 2020. Random testing for C and C++ compilers with YARPGen.
Proceedings of the ACM on Programming Languages 4, OOPSLA (2020), 1–25.
[52] LLVM 2023. LLVM’s Analysis and Transform Passes. https://llvm.org/docs/Passes.html.
[53] Björn Mathis, Rahul Gopinath, and Andreas Zeller. 2020. Learning input tokens for effective fuzzing. In Proceedings of
the 29th ACM SIGSOFT international symposium on software testing and analysis. 27–37.
[54] James B McDonald and Yexiao J Xu. 1995. A generalization of the beta distribution with applications. Journal of
Econometrics 66, 1-2 (1995), 133–152.
[55] William M McKeeman. 1998. Differential testing for software. Digital Technical Journal 10, 1 (1998), 100–107.
[56] Ruijie Meng, Martin Mirchev, Marcel Böhme, and Abhik Roychoudhury. 2024. Large language model guided protocol
fuzzing. In Proceedings of the 31st Annual Network and Distributed System Security Symposium (NDSS).
[57] MKLDNN 2024. MKL-DNN. https://github.com/rsdubtso/mkl-dnn.
[58] Pengyu Nie, Rahul Banerjee, Junyi Jessy Li, Raymond J Mooney, and Milos Gligoric. 2023. Learning Deep Semantics
for Test Completion. arXiv preprint arXiv:2302.10166 (2023).
[59] oneDNN 2024. oneDNN. https://github.com/oneapi-src/oneDNN.
[60] OpenAI. 2023. ChatGPT. (2023). https://openai.com/blog/chatgpt.
[61] OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
[62] Xianfei Ou, Cong Li, Yanyan Jiang, and Chang Xu. 2024. The Mutators Reloaded: Fuzzing Compilers with Large
Language Model Generated Mutation Operators. (2024).
[63] Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. Deepxplore: Automated whitebox testing of deep learning
systems. In proceedings of the 26th Symposium on Operating Systems Principles. 1–18.
[64] Hung Viet Pham, Thibaud Lutellier, Weizhen Qi, and Lin Tan. 2019. CRADLE: Cross-Backend Validation to Detect and
Localize Bugs in Deep Learning Libraries. In 2019 IEEE/ACM 41st International Conference on Software Engineering
(ICSE). 1027–1038. https://doi.org/10.1109/ICSE.2019.00107
[65] PyTorch 2023. PyTorch. http://pytorch.org.
[66] PyTorch 2023. PyTorch 2.0. https://pytorch.org/get-started/pytorch-2.0.
[67] Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. Adaptive test generation using a large language model.
arXiv preprint arXiv:2302.06527 (2023).
[68] Mozilla Security. 2007. jsfunfuzz. https://github.com/MozillaSecurity/funfuzz.
[69] Koushik Sen. 2007. Concolic testing. In Proceedings of the 22nd IEEE/ACM international conference on Automated
software engineering. 571–572.
[70] Koushik Sen, Darko Marinov, and Gul Agha. 2005. CUTE: A Concolic Unit Testing Engine for C. In Proceedings of
the 10th European Software Engineering Conference Held Jointly with 13th ACM SIGSOFT International Symposium on
Foundations of Software Engineering (Lisbon, Portugal) (ESEC/FSE-13). Association for Computing Machinery, New
York, NY, USA, 263–272. https://doi.org/10.1145/1081706.1081750
[71] Weijie Shao, Yuyang Gao, Fu Song, Sen Chen, and Lingling Fan. 2023. An Empirical Study of Bugs in Open-Source
Federated Learning Framework. ArXiv abs/2308.05014 (2023). https://api.semanticscholar.org/CorpusID:265221980
[72] Dave Shreiner et al. 2009. OpenGL programming guide: the official guide to learning OpenGL, versions 3.0 and 3.1.
Pearson Education.
[73] Ting Su, Jue Wang, and Zhendong Su. 2021. Benchmarking automated GUI testing for Android against real-world
bugs. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the
Foundations of Software Engineering. 119–130.
[74] Chengnian Sun, Vu Le, Qirun Zhang, and Zhendong Su. 2016. Toward understanding compiler bugs in GCC and
LLVM. In Proceedings of the 25th international symposium on software testing and analysis. 294–305.
[75] Maolin Sun, Yibiao Yang, Yang Wang, Ming Wen, Haoxiang Jia, and Yuming Zhou. 2023. SMT Solver Validation
Empowered by Large Pre-trained Language Models. In ASE.
[76] Michael Sutton, Adam Greene, and Pedram Amini. 2007. Fuzzing: brute force vulnerability discovery. Pearson Education.
[77] Yutian Tang, Zhijie Liu, Zhichao Zhou, and Xiapu Luo. 2024. Chatgpt vs sbst: A comparative assessment of unit test
suite generation. IEEE Transactions on Software Engineering (2024).
[78] TensorFlow 2023. TensorFlow. https://www.tensorflow.org.
[79] TensorFlowLite 2023. TensorFlow Lite. https://www.tensorflow.org/lite.
[80] TensorFlowXLA 2023. TensorFlow XLA. https://www.tensorflow.org/xla.
[81] William R Thompson. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence
of two samples. Biometrika 25, 3-4 (1933), 285–294.
[82] Triton 2024. Triton. https://github.com/openai/triton.
[83] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
Proc. ACM Program. Lang., Vol. 8, No. OOPSLA2, Article 296. Publication date: October 2024.
WhiteFox: White-Box Compiler Fuzzing Empowered by Large Language Models 296:27
[84] Jiannan Wang, Thibaud Lutellier, Shangshu Qian, Hung Viet Pham, and Lin Tan. 2022. EAGLE: Creating Equivalent
Graphs to Test Deep Learning Libraries. (2022).
[85] Zan Wang, Ming Yan, Junjie Chen, Shuang Liu, and Dongdi Zhang. 2020. Deep learning library testing via effective
model generation. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and
Symposium on the Foundations of Software Engineering. 788–799.
[86] Anjiang Wei, Yinlin Deng, Chenyuan Yang, and Lingming Zhang. 2022. Free Lunch for Testing: Fuzzing Deep-Learning
Libraries from Open Source. In 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE). 995–1007.
https://doi.org/10.1145/3510003.3510041
[87] Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2024. Magicoder: Empowering code generation
with oss-instruct. In Forty-first International Conference on Machine Learning.
[88] Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. 2023. Universal fuzzing via
large language models. arXiv preprint arXiv:2308.04748 (2023).
[89] Danning Xie, Yitong Li, Mijung Kim, Hung Viet Pham, Lin Tan, Xiangyu Zhang, and Michael W Godfrey. 2022.
DocTer: documentation-guided fuzzing for testing deep learning API functions. In Proceedings of the 31st ACM SIGSOFT
International Symposium on Software Testing and Analysis. 176–188.
[90] Jianhao Xu, Kangjie Lu, Zhengjie Du, Zhu Ding, Linke Li, Qiushi Wu, Mathias Payer, and Bing Mao. 2023. Silent Bugs
Matter: A Study of {Compiler-Introduced} Security Bugs. In 32nd USENIX Security Symposium (USENIX Security 23).
3655–3672.
[91] Chenyuan Yang, Yinlin Deng, Jiayi Yao, Yuxing Tu, Hanchi Li, and Lingming Zhang. 2023. Fuzzing Automatic
Differentiation in Deep-Learning Libraries. In Proceedings of the 45th International Conference on Software Engineering
(Melbourne, Victoria, Australia) (ICSE ’23). IEEE Press, 1174–1186. https://doi.org/10.1109/ICSE48619.2023.00105
[92] Chenyuan Yang, Zijie Zhao, and Lingming Zhang. 2023. Kernelgpt: Enhanced kernel fuzzing via large language models.
arXiv preprint arXiv:2401.00563 (2023).
[93] Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. 2011. Finding and understanding bugs in C compilers. In
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation. 283–294.
[94] Zhaomo Yang, Brian Johannesmeyer, Anders Trier Olesen, Sorin Lerner, and Kirill Levchenko. 2017. Dead store
elimination (still) considered harmful. In 26th USENIX Security Symposium (USENIX Security 17). 1025–1040.
[95] Shafiq Joty Yue Wang, Weishi Wang and Steven C.H. Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-
Decoder Models for Code Understanding and Generation. In EMNLP 2021.
[96] Insu Yun, Sangho Lee, Meng Xu, Yeongjin Jang, and Taesoo Kim. 2018. {QSYM}: A practical concolic execution engine
tailored for hybrid fuzzing. In 27th USENIX Security Symposium (USENIX Security 18). 745–761.
[97] Andreas Zeller, Rahul Gopinath, Marcel Böhme, Gordon Fraser, and Christian Holler. 2019. The fuzzing book.
[98] Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Yun Yang, Weisong Sun, Shengcheng Yu, and Zhenyu Chen.
2023. A survey on large language models for software engineering. arXiv preprint arXiv:2312.15223 (2023).
[99] Qirun Zhang, Chengnian Sun, and Zhendong Su. 2017. Skeletal program enumeration for rigorous compiler testing. In
Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation. 347–361.
Proc. ACM Program. Lang., Vol. 8, No. OOPSLA2, Article 296. Publication date: October 2024.