0% found this document useful (0 votes)
108 views12 pages

Toward Understanding Compiler Bugs in GCC and LLVM

The study examines over 50,000 bugs from GCC and LLVM compilers over more than a decade. It finds that (1) C++ components had the most bugs, accounting for 20% of bugs in both compilers, (2) test cases triggering bugs typically had fewer than 45 lines of code, and (3) most bug fixes touched a single source file with small code modifications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views12 pages

Toward Understanding Compiler Bugs in GCC and LLVM

The study examines over 50,000 bugs from GCC and LLVM compilers over more than a decade. It finds that (1) C++ components had the most bugs, accounting for 20% of bugs in both compilers, (2) test cases triggering bugs typically had fewer than 45 lines of code, and (3) most bug fixes touched a single source file with small code modifications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Toward Understanding Compiler Bugs in GCC and LLVM

Chengnian Sun Vu Le Qirun Zhang Zhendong Su


Department of Computer Science, University of California, Davis, USA
{cnsun, vmle, qrzhang, su}@ucdavis.edu

ABSTRACT 1. INTRODUCTION
Compilers are critical, widely-used complex software. Bugs Compilers are an important category of system software.
in them have significant impact, and can cause serious dam- Extensive research and development efforts have been de-
age when they silently miscompile a safety-critical applica- voted to increasing compilers’ performance and reliability.
tion. An in-depth understanding of compiler bugs can help However, it may still surprise application developers that,
detect and fix them. To this end, we conduct the first em- similar to application software, production compilers also
pirical study on the characteristics of the bugs in two main- contain bugs, and in fact quite many. Furthermore, com-
stream compilers, GCC and LLVM. Our study is significant piler bugs impact application code, and even lead to severe
in scale — it exhaustively examines about 50K bugs and damage, especially when a buggy compiler is used to compile
30K bug fix revisions over more than a decade’s span. safety-critical applications.
This paper details our systematic study. Summary find- Different from most of the application bugs, compiler bugs
ings include: (1) In both compilers, C++ is the most buggy are difficult to recognize as they usually manifest indirectly
component, accounting for around 20% of the total bugs and as application failures. For example, a compiler bug makes
twice as many as the second most buggy component; (2) the a program optimized and transformed into a wrong exe-
bug revealing test cases are typically small, with 80% having cutable, and this bug can only manifest as the executable
fewer than 45 lines of code; (3) most of the bug fixes touch a misbehaves. Even worse, in most cases the application de-
single source file with small modifications (43 lines for GCC veloper will first assume the misbehavior is caused by some
and 38 for LLVM on average); (4) the average lifetime of bug introduced by herself/himself, and it may take a long
GCC bugs is 200 days, and 111 days for LLVM; and (5) time for her/him to realize that the compiler is the culprit.
high priority tends to be assigned to optimizer bugs, most In order to better understand compiler bugs, we conduct
notably 30% of the bugs in GCC’s inter-procedural analysis the first empirical study on their characteristics. Although
component are labeled P1 (the highest priority). there have been a number of empirical studies on software
This study deepens our understanding of compiler bugs. bugs [6, 7, 17, 25, 27, 31, 36, 37], none focuses on compiler
For application developers, it shows that even mature pro- bugs. For example, Lu et al. [17] study the characteristics
duction compilers still have many bugs, which may affect of concurrency bugs, Sahoo et al. [25] investigate the bugs
development. For researchers and compiler developers, it in server software, and Chou et al. [7] research the errors
sheds light on interesting characteristics of compiler bugs, in operating system kernels. In contrast, this paper ex-
and highlights challenges and opportunities to more effec- amines the bugs of two mainstream production compilers,
tively test and debug compilers. GCC and LLVM, in total 39,890 bugs of GCC and 12,842
bugs of LLVM. In order to explore the properties of com-
piler bug fixes, we also examine 22,947 GCC revisions and
CCS Concepts 8,452 LLVM revisions, which are fixes to most of the bugs.
•General and reference → Empirical studies; •Software In particular, we attempt to investigate compiler bugs along
and its engineering → Software testing and debug- four central aspects:
ging; (1) Location of Bugs. We compute the distribution of
bugs in compiler components and the source files. The data
Keywords shows that in both GCC and LLVM, the component C++
is always the most buggy one, with a defect rate twice as
empirical studies, compiler bugs, compiler testing much as the second most buggy component. In addition, we
find that most of the source files only contain one bug (i.e.,
Permission to make digital or hard copies of all or part of this work for personal or 60% for GCC and 53% for LLVM), and most of the top ten
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita- buggy source files belong to the front end of C++.
tion on the first page. Copyrights for components of this work owned by others than (2) Test Cases, Localization and Fixes of Bugs. We
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission investigate the size of the test cases that trigger compiler
and/or a fee. Request permissions from [email protected]. bugs, and find that on average GCC test cases have 32 lines
ISSTA ’16, July 18-22, 2016, SaarbrÃijcken, Germany of code and those of LLVM have 29 lines. This observation

c 2016 ACM. ISBN 978-1-4503-4390-9/16/07. . . $15.00 can guide random testing of compilers. For example, instead
DOI: http://dx.doi.org/10.1145/2931037.2931074
of generating large simple test programs, compiler testing
may be more effective by generating small, yet complex test 2.1 Source of Bugs
programs. Moreover, it confirms that most of the bugs can In this paper, we study the bugs in two mainstream com-
be effectively reduced by techniques such as Delta [38] or C- piler systems GCC and LLVM.
Reduce [24]. The fixes of compiler bugs are not big either,
GCC GCC is a compiler system produced by the GNU
which on average only touch 43 lines of code for GCC and
Project supporting various languages (e.g. C, C++, and For-
38 for LLVM. 92% of them involve fewer than 100 lines of
tran) and various target architectures (e.g. PowerPC, x86,
code changes, and 58% for GCC and 54% for LLVM involve
and MIPS). It has been under active development since the
only one function. Our findings reveal that most of the bug
late 1980s. The latest version is 5.3, which was released on
fixes are local. Even for optimizers, each bug is usually from
December 4, 2015.
a single optimization pass.
LLVM LLVM is another popular compiler infrastructure,
(3) Duration of Bugs. We compute information on the
providing a collection of modular and reusable compiler and
duration of bugs (i.e. the time between when a bug was filed
toolchain technologies for arbitrary programming languages.
and when it was closed), and show that the average duration
Similar to GCC, LLVM also supports multiple languages
of GCC bugs is 200 days and that of LLVM bugs is 111 days.
and multiple target architectures. The project started in
We also show that the GCC community confirms bugs faster
2000 and has drawn much attention from both industry and
than LLVM but spends more time fixing them.
academia. The latest version is 3.7.1, released on January
(4) Priorities of Bugs. The field priority of a bug 5, 2016.
report is determined by developers for prioritizing bugs, ex-
pressing the order in which bugs should be fixed. We inves- Table 1: The information of the bugs used in this study.
tigate the distribution of bugs over priorities. Of the GCC Compiler Start End Bugs Revisions
bugs, 68% have the default P3 priority, and only 6.02% are GCC Aug-1999 Oct-2015 39,890 22,947
labeled as P1. We then study how priorities are correlated LLVM Oct-2003 Oct-2015 12,842 8,452
with components. The inter-procedural analysis component,
ipa, of GCC is the most “impactful” as 30% of its bugs are la-
Collection of Bugs Our study focuses on fixed bugs. We
beled as P1. We also study how priorities affect the lifetime say a bug is fixed if its resolution field is set to fixed and
of bugs. On average, P1 bugs are fixed the fastest, whereas the status field is set to resolved, verified or closed in
P5 bugs are fixed the slowest. However, for the rest, the fix the bug repositories of GCC and LLVM.
time is not strictly correlated with their priorities.
As a proof-of-concept demonstration of the practical po- Identifying Bug Fix Revisions We then identify the
tentials of our findings, we design a simple yet effective pro- revisions that correspond to these fixed bugs. We collect the
gram mutation algorithm tkfuzz to find compiler crash- entire revision log from the code repositories, and for each
ing bugs, by leveraging the second observation that bug- revision, we scan its commit message to check whether this
revealing test programs are usually small. Applying tkfuzz revision is a fix to a bug.
on the 36,966 test programs in the GCC test suite (with an GCC and LLVM developers usually add a marker in the
average of 32 lines of code each), we have found 18 crash- commit message following one of the two patterns below:
ing bugs in GCC and LLVM, of which 12 have already been • “PR hbug-idi”
fixed or confirmed. • “PR hcomponenti/hbug-idi”
The results presented in this paper provide further in-
sights into understanding compiler bugs and better guidance where the prefix “PR” stands for “Problem Report” and hbug-
toward effectively testing and debugging compilers. To en- idi is the id of the corresponding bug. We use these patterns
sure reproducibility and to benefit the community, we have to link a revision to the bug that it fixes.
made all our data and code publicly available at http:// Table 1 shows the numbers of bugs and their accompanied
chengniansun.bitbucket.org/projects/compiler-bug-study/. revisions that are used in our study. We analyze 50K fixed
Paper Organization Section 2 describes the bugs used bugs and 30K revisions in total, including 1,858 GCC and
in this study and potential threats to validity. Section 3 1,643 LLVM enhancement requests.
introduces general properties of these bugs, while Section 4 2.2 Threats to Validity
studies the distribution of compiler bugs in components and
source files respectively. In Section 5, we study bug-revealing Similar to other empirical studies, our study is potentially
test cases and bug fixes. Section 6 then investigates the life- subject to several threats, namely the representativeness of
time of bugs, and Section 7 studies the priorities of bugs the chosen compilers, the generalization of the used bugs,
and their relation to other properties such as components and the correctness of the examination methodology.
and duration. Section 8 presents our preliminary applica- Regarding representativeness of the chosen compilers, we
tion of the findings in this paper for compiler testing. We use GCC and LLVM, which are two popular compiler projects
discuss, in Section 9, how to utilize the findings in this pa- written in C/C++ with multiple language frontends, a set of
per. Finally, Section 10 surveys related work, and Section 11 highly effective optimization algorithms and various archi-
concludes. tecture backends. Both are widely deployed on Linux and
OS X operating systems. We believe these two compilers
can well represent most of the traditional compilers. How-
ever, they may not well reflect the characteristics of Just-
2. METHODOLOGY In-Time compilers (e.g., Java Hotspot virtual machine) or
This section briefly introduces the compilers used in our interpreters (e.g., Perl, Python).
study, and describes how bugs are collected. We also discuss Regarding the generalization of the used bugs, we uni-
limitations and threats to the validity of this study. formly use all the bug reports satisfying the selection crite-
new3.0 3.2 3.3 3.4 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.2
600 fixed
Number of Bugs

rejected
400 unconfirmed

200

0
1 23 07 19 01 16 27 09 24 05 20 02
-1 1- 4- 8- 1- 5- 9- 2- 6- 1- 3- 8-
07 -1 -0 -0 -0 -0 -0 -0 -0 -1 -0 -0
0 0- 01 03 0 4 06 07 08 10 1 1 12 14 1 5
20 20 20 20 20 20 20 20 20 20 20 20

(a) GCC.

400
1.2 new
1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6
fixed
Number of Bugs

300 rejected
unconfirmed
200

100

0
24 11 27 15 01 20 05 24 11 27 15 01 20 06 24 10 28 13 01 20 06 24 10
0- 5- 1- 6- 1- 7- 2- 8- 3- 9- 4- 1- 5- 2- 6- 1- 7- 2- 9- 3- 0- 4- 1-
3 -1 4 -0 4 -1 5 -0 6 -0 6 -0 7 -0 7-0 8 -0 8 -0 9-0 9 -1 0 -0 0-1 1 -0 2 -0 2-0 3 -0 3 -0 4-0 4 -1 5 -0 5-1
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20

(b) LLVM
Figure 1: The overall evolution history of the bug repositories (in months). The plot filled in gray background shows the
new bug reports submitted every month. The blue dashdotted plot shows the number of bugs fixed every month. The red
dashed plot shows the number of bugs that are resolved as not fixed (e.g. invalid, duplicate, worksforme or wontfix). The
black curve shows the number of bug reports per month which have not been confirmed yet. Clearly there is an increasing
trend of unconfirmed bugs for LLVM.

ria stated in Section 2.1 with no human intervention. For Table 2: The number and the percentage of bug reports
those unresolved or invalid bugs, we believe that they are revolved as invalid, worksforme or wontfix.
not likely as interesting as the bugs investigated in this pa- invalid worksforme wontfix
per. As for the identification of revisions containing bug GCC 7,072/10.4% 1,151/1.7% 1,463/2.2%
fixes, based on the interaction with the GCC and LLVM LLVM 1,639/6.7% 717/2.9% 593/2.4%
developers on a large number of bugs we reported before,
GCC developers usually link the revisions and the bugs ex-
plicitly, while LLVM developers do less often. Given the tion than before and more bug reports are being submit-
large number of GCC and LLVM revisions, we believe that ted monthly. However, the bug-fixing rate (indicated by the
the analysis results should be representative. blue dashdotted fixed curve) does not increase much and
Regarding the correctness of the analysis methodology, more bugs are being left as unconfirmed (shown as the black
we have automated every analysis mentioned in this paper. unconfirmed curve), which is likely due to limited human
We also have hands-on experience on analyzing bug report resources as we were told by active members in the LLVM
repositories and code revision histories from our previous community that some Apple developers were pulled into the
studies. Swift project.1 This may give external bug reporters the
impression that the community is not as responsive as GCC
although bugs are fixed as regularly as before.
3. GENERAL STATISTICS
This section shows the general statistics of the two bug 3.1 Rejected Bug Reports
repositories. Figure 1 shows the overall evolution of GCC Table 2 shows the information on the bugs which are re-
and LLVM’s bug repositories. In particular, it shows the solved as invalid, worksforme or wontfix. An invalid bug
number of bugs reported, rejected or fixed each month. It report is one in which the associated test program is invalid
also shows the number of bug reports that have never been (for example, containing undefined behaviors, a bug in an-
confirmed. Each vertical dashed line represents the date other software, misunderstanding of the language standard,
of a compiler release, and the label on the top the version misuse of standard libraries), or the reported “anomalous”
number. behavior is in fact deliberate. If a bug described in a report
The trends of the plots for GCC are relatively stable in cannot be reproduced, this report is labeled as worksforme.
these years compared to those of LLVM. After gaining its
1
popularity recently, LLVM has drawn much more atten- https://developer.apple.com/swift/
A bug report is resolved as wontfix if the affected version,
component or library of the compiler is not maintained al- 0.2

Fraction of Bugs
though the reported bug can be confirmed.
0.15
3.2 Duplicate Bug Reports
Reporting bugs in a bug tracking system is uncoordinated. 0.1
Different people may file multiple reports on the same bug.
0.05
In this case, the latter bug reports are referred to as dupli-
cates.
+ et n pt d pt + c ap va
Table 3: Distribution of duplicate bug reports. The first row c+ targ rtra ee-o le-en tl-o dc+ tst
r ja
fo tr dd r st o o
lists the number of duplicate bugs, and the second shows the i lib b
m
number of bugs with the given number of duplicates.
(a) Duplicate bugs of GCC. (a) The top ten buggy components out of 52 in GCC. The most
buggy component is C++, containing 22% of the 39,890 bugs,
#Duplicate 0 1 2 3 4 5 ≥6
nearly twice as buggy as the second one. And these ten compo-
#Report 35,933 2,924 596 215 98 41 83 nents account for 79% of the GCC bugs.
(b) Duplicate bugs of LLVM.
#Duplicate 0 1 2 3 4 5 ≥6
#Report 12,157 570 72 17 13 5 8

Fraction of Bugs
0.2

Table 3 shows the information on duplicate bug reports 0.15


in the GCC and LLVM bug repositories. 9.9% of GCC
bugs and 5.3% of LLVM bugs have duplicates. The du- 0.1
plicate rates are similar to other open source projects such
0.05
as Eclipse, OpenOffice and Mozilla [30].
The GCC bug with the most duplicates is #4529 that has
s n n 1 t r s
60 duplicates. The compiler crashes when it is compiling the ug c++ tend ege ege +1 -x86 r-op lyze bug
Linux kernel 2.4. The LLVM bug with the most duplicates is w-b o n od od c+ nd la na all-
ne fr -c n-c e a
#2431 with 15 duplicates. It is a meta-bug report to manage m o ck sc tic-a
llv mm ba
all the crashing bugs related to the component gfortran. sa
t
co
3.3 Reopening Bugs (b) The top ten buggy components out of 96 in LLVM. The most
A bug report is labeled as fixed if a revision has been com- buggy component is new-bugs, containing 23% of the 12,842 bugs.
mitted to fix the bug and passed certain test cases. How- The second component is C++, 14% and the third is Frontend
9%. And these ten components account for 79% of the LLVM
ever, sometimes the bug may be found to be not correctly bugs.
or fully fixed later. In this case, the bug report will be re- Figure 2: Distribution of Bugs in Components
opened. There is another scenario of reopening a bug. If the
bug repository administrator rejects a bug report first as he
deems the bug invalid (e.g., not reproducible) and later finds
LLVM bugs are fixed directly without being ever reopened.
it valid, then the bug report will be reopened.
The most frequently reopened bug of GCC is #299752 (re-
Both scenarios are normal but undesirable. Reopening a
opened eight times). It is a meta bug to track Fortran bugs
bug could mean careless bug fixes, inadequate communica-
revealed by compiling the CP2K program.3 Another GCC
tion between developers and testers or users, or inadequate
bug #527484 was reopened three times before it was fixed.
information to reproduce the bug [40]. It can even become a
It is related to the new feature decltype of C++11 stan-
bad practice if a bug report is reopened more than once, as
dard, and took developers several attempts to correctly fix
this behavior is usually regarded as an anomalous software
this bug.
process execution [4, 28], and should be avoided in software
development and management process.
4. LOCATION OF BUGS
Table 4: Breakdown of reopened bugs. The first row is the
This section answers the following question: Where are
number of reopened times, and the second is the percentage
the bugs? Subsection 4.1 shows how the bugs are distributed
of the corresponding bugs.
in various compiler components, while Subsection 4.2 shows
(a) Reopened bugs of GCC.
how bugs are distributed in different source files.
#Reopen 0 1 2 3 4 8
% 96.95 2.85 0.15 0.04 0.01 0.003 4.1 Distribution of Bugs in Components
(b) Reopened bugs of LLVM. Figure 2 shows the distribution of bugs in compiler com-
ponents. We only show the top ten buggy components for
#Reopen 0 1 2 3 each compiler, accounting for the majority of the bugs (79%
% 94.596 5.007 0.374 0.023
2
https://gcc.gnu.org/bugzilla/show bug.cgi?id=29975
3
Table 4 shows the reopening information of bugs in GCC http://www.cp2k.org/
4
and LLVM. Nearly 97% of the GCC bugs and 94% of the https://gcc.gnu.org/bugzilla/show bug.cgi?id=52748
Table 5: The top 10 buggy files of GCC and LLVM.
(a) GCC. (b) LLVM.
File # Description File # Description
cp/pt.c 817 C++ templates SemaDecl.cpp 301 semantic analysis for declarations
cp/decl.c 638 C++ declarations and variables DiagnosticSemaKinds.td 268 definitions of diagnostics messages
cp/parser.c 595 C++ parser SemaExpr.cpp 232 semantic analysis for expressions
config/i386/i386.c 569 code generation on IA-32 SemaDeclCXX.cpp 221 semantic analysis for C++ declarations
cp/semantics.c 457 semantic phase of parsing C++ X86ISelLowering.cpp 194 lowering LLVM code into a DAG for X86
fortran/resolve.c 452 type resolution for Fortran SemaTemplate.cpp 127 semantic analysis for C++ templates
cp/cp-tree.h 415 parsing and type checking C++ SemaExprCXX.cpp 124 semantic analysis for C++ expressions
fold-const.c 386 constant folding SemaOverload.cpp 123 semnatic analysis for C++ overloading
cp/typeck.c 374 type checking C++ lib/Sema/Sema.h 122 semantic analysis and AST building
cp/call.c 354 method invocations of C++ SemaTemplateInstantiateDecl.cpp 119 C++ template declaration instantiation

for GCC and 79% for LLVM). These components touch all 1

Fraction of Bugs
critical parts of compilers: front end (e.g. syntactic and se-
mantic parsing), middle end (e.g. optimizations) and back 0.75
end (e.g. code generation).
0.5
As the plots show, C++ is the most buggy component in
GCC, accounting for around 22% of the bugs. It is much 0.25
more buggy than the other components as the bug rate of GCC
the second most buggy component is only half. In LLVM, 0 LLVM
bug reports can be submitted without specifying their com- 1 10 25 50 100 1 000 10 000
ponents. Therefore a large number of bug reports are closed
with the component new-bugs. However, based on the dis- Lines of Code of Regression Test Case
tribution of bugs in source files (which will be discussed in
the next section), C++ is also the most buggy component
(a) This graph shows the empirical cumulative distribution func-
in LLVM. tion of the sizes of the test cases that trigger the bugs in GCC
One possible explanation is that C++ has many more and LLVM,
features than the other programming languages, support-
ing multiple paradigms (i.e. procedural, object-oriented, and (b) The statistics of the lines of code in regression test cases.
generic programming). It is surprising that most of the re- Mean Median SD Min Max
search and engineering efforts have been devoted to testing GCC 32 21 75 1 5,554
or validating C compilers [3, 11, 12, 13, 14, 23, 29, 34, 35] LLVM 29 16 59 1 916
but little on C++, although C++ is one of the most pop-
ular programming languages in industry (among top three Figure 4: Size of regression test cases.
according to the TIOBE index [33]).

Figure 3 shows the fraction of source files with a given


4.2 Distribution of Bugs in Files number of bugs, where the x-axis is the number of bugs,
This section analyzes the distribution of bugs in the source and the y-axis is the fraction of the source files containing
files (i.e., how many bugs in each source file). We exclude that number of bugs. It depicts a phenomenon that half of
the source files of test cases as they contribute nothing to the source files only contain one bug (60% of GCC and 53%
the core functionalities of the compilers. of LLVM), and quite few files have a large number of bugs.
The plots in Figure 3 only show the files with no more
0.6 GCC(1, 60%) GCC than 30 bugs. In order to describe the other extreme of the
LLVM(1, 53.2%) distribution, Table 5 shows the top ten most buggy source
Fraction of Files

0.5 LLVM
files. Consistent with the distribution of bugs in components
0.4 in Figure 2, the source files of the C++ component account
0.3 for most bugs. For example, as perhaps the most complex
0.2 GCC(30, 0.1%) feature of C++, the C++ template is the most buggy file in
0.1 LLVM(30, 0.3%) GCC. This skewness of bugs again implies that we should
devote more efforts to testing C++ compilers, considering
0
that C++ is one of the most wide-used programming lan-
1 5 10 15 20 25 30 guages.
Number of Bugs
Figure 3: The figure shows the fraction of source files with 5. REVEALING AND FIXING BUGS
a given number (≤ 30) of bugs for GCC and LLVM. The In this section, we investigate the properties of bug-revealing
files with more than 30 bugs are skipped and their fraction test cases and the bug fixes.
is approximately monotonically decreasing. For GCC, there
are 5,039 files in total, and the file with the most bugs is 5.1 Size of Bug-Revealing Test Cases
‘cp/pt.c’ (with 817 bugs); while for LLVM, there are 2,850 Figure 4a shows the empirical cumulative distribution func-
files in total, and the most buggy is ‘SemaDecl.cpp’ (with tion over the sizes of the test cases triggering the bugs of
301 bugs). GCC and LLVM. As shown, most of the test cases 95%
a test program t 1

Fraction of Bugs
0.8
r vi+1 vi+2 0.6
vi
0.4

0.2 GCC
X b b
1 10 20 100 1 000
Figure 5: Compiler regression bugs. Time to Reveal Regressions (in days)
(a) This graph shows the empirical cumulative distribution func-
tion of the time interval between when a regression is introduced
are smaller than 100 lines of code, and more than 60% are and when it is triggered.
smaller than 25 lines of code.5 Table 4b shows the statistics
including mean, median and standard deviation (SD). (b) The statistics of the time to reveal regressions.
These test cases are collected from the test suites of GCC Mean Median SD Min Max
and LLVM. For a new bug, besides the fix, the developer of- GCC 163 20 335 1 3492
ten creates a regression test case by extracting the attached
bug triggering code in the bug report. Meanwhile a bug la- Figure 6: Time to reveal regression bugs.
bel is also created by concatenating a prefix pr and an infix
bug-id, which is later used as the file name of the test case 1
or inserted into the test case as a comment. For example,

Fraction of Bugs
the test case pr14963.c is the regression test case for the 0.8
bug 14963. (100, 92%)
0.6
By identifying the bug labels, we have collected 11,142
test cases of GCC and its size distribution is shown as the 0.4
blue solid plot in Figure 4. We only collect 347 of LLVM, 0.2 GCC
and this is why its plot (i.e., the red dashed one) is not as LLVM
smooth that of GCC. However, the overall trend is similar 0
1 10 100 1 000 10 000
in the two plots, and supports the conclusion stated at the
beginning of this section. Number of Lines of Code in a Fix
This observation can be leveraged for improving the effec-
(a) This graph shows the number of lines of code in a bug fix.
tiveness of random testing for compiler validation. A ran- The empirical cumulative distribution curves of GCC and LLVM
dom program generator such as Csmith [35] or Orion [11] are almost the same. And most of the bug fixes (92%) contain
should focus on generating small but complex test programs fewer than 100 lines of code.
rather than big but simple ones. (b) The statistics of the lines of code modification in bug fixes.

5.2 Time to Reveal Compiler Regressions Mean Median SD Min Max


A compiler regression is a bug which is introduced by a GCC 43 10 264 1 20028
revision, breaking the functionality of a feature. Consider LLVM 38 11 161 1 5333
the illustration in Figure 5. Let r be the committed revision Figure 7: Lines of code modification in bug fixes.
between two consecutive versions vi and vi+1 in the source
code repository. We say revision r introduces a regression
bug b into version vi+1 comments to the bug reports. Although these corre-
In this section, we investigate how much time it takes for lations are only expressed in natural languages, they
a regression bug to be uncovered by testing. For the re- follow certain frequent patterns. For example, given a
gression bug b, its regression revealing time can be obtained culprit revision r, the link can be written as started
by computing the time span between when the revision r with r, caused by r, or regressed with r.
introducing b is committed and when the bug report of b
is filed. As GCC provides richer meta-information in its We first collect bug reports of which the summaries con-
bug repository than LLVM, in this study we only focus on tain the word regression, and then search the comments in
the regression bugs of GCC. The following describes how we these reports for the patterns of culprit revisions. In total,
collect the necessary information for this study. we have collected 1,248 regression bugs.
Figure 6 shows the empirical cumulative distribution func-
Identifying Regression Bug Reports If a bug report is tion of the time interval between when a regression is intro-
confirmed by a GCC developer as a regression, the duced and when it is triggered. On average it takes 163 days,
summary of the bug report will be prefixed with a but 50% of these regressions only need 20 days to uncover.
keyword regression.
Identifying Culprit Revisions GCC developers often link
5.3 Size of Bug Fixes
the culprit revisions with the regressions as additional We consider all the 22,947 revisions of GCC and 8,452
revisions of LLVM and exclude the non-functional files (e.g.,
5
We do not exclude comments from counting in this paper. change logs, documents, executables, test cases) in revisions.
1 1

Fraction of Bugs
Fraction of Bugs

0.9 0.8
0.8 0.6
0.7 0.4
0.6 GCC GCC
LLVM 0.2 LLVM
1 2 3 45 10 100 1 000 1 710 27 100 1 000 10 000
Number of Functions in a Fix Duration of Bugs (in days)
(a) This graph shows the number of functions modified in a bug
fix. The empirical cumulative distribution curve indicates that
over half of the investigated bugs only involve one function (i.e., (a) This graph shows the empirical cumulative distribution func-
58% for GCC, and 54% for LLVM), and most of the bug fixes tion of bugs through time.
(90% or so) involve no more than 5 functions. (b) The statistics of the duration of bugs.
(b) The statistics of the number of functions in bug fixes. Mean Median SD Min Max
Mean Median SD Min Max GCC 200 28 448 1 5686
GCC 2.7 1 7.0 1 434 LLVM 111 7 268 1 2967
LLVM 3.4 1 18.4 1 972
Figure 9: This figure shows the statistics of the duration
Figure 8: Number of functions modified in bug fixes. of bugs. Figure (a) is the emprical cumulative distribution
function of the duration of bugs, and Table (b) displays the
statistics of the duration. Averagely, a bug is fixed within
5.3.1 Lines of Code a year. Comparatively, GCC takes longer time to fix bugs
than LLVM.
We obtain the difference between two versions from the
two version control systems respectively, and count the lines
of code modification made to the source files.
As Figure 7a shows, 92% of the bug fixes contain fewer
6. DURATION OF BUGS
than 100 lines of code, and 50% of the bug fixes contain This section investigates how bugs are distributed through
fewer than 10 lines. This indicates that most of the bugs time. A duration spans the time when a bug report is filed
only touch a small portion of the compiler code. Table 7b in the bug tracking system to the time when the bug is
shows the statistics of the sizes of the bug fixes. On av- fixed. Ideally, a bug should be fixed right after it is reported,
erage, the number of lines of code modified in a bug fix is which is zero-day duration. However in reality due to limited
approximately 40, and the median is about 11. resource and various constraints, a bug usually takes time,
sometimes years, to be fixed.
5.3.2 Number of Functions
6.1 Collecting Duration Data
We investigate the number of functions modified in a bug-
fix. The information is acquired by (1) retrieving the changed In the bug tracking systems of GCC and LLVM, when a
source files at a specific revision, (2) parsing the files, and (3) bug report is filed, its creation date is saved in the database,
locating functions of which the lines in files intersect with which can be used as the birth date of the bug. The other
the line numbers recorded in the version control systems. field of date type in a bug report is the modification field,
Figure 8a shows the relation between the number of func- which records the date of the last revision to the report.
tions revised in a revision and the fraction of bugs. Around However, it cannot be used as the killed date of the bug,
90% of the bugs involve at most 5 functions in GCC and as even after a bug is resolved as fixed, the developers may
LLVM. Moreover, more than 58% of the GCC bugs and 54% still edit the bug report, such as adjusting target version or
of LLVM only involve one single function. Table 8b shows adding comments. Hence, using the time interval between
the summary statistics. In terms of number of functions, the creation date and the modification date is inaccurate.
the median is one and the mean is no more than three. For example, the third bug6 of GCC was reported on 1999-
08-26, the last modification was on 2013-07-23, and therefore
5.3.3 Discussion the time interval is more than ten years. However, based on
its revision history,7 the bug was already fixed on 2001-01-
The data — the lines of code and number of functions 08, so the duration of bug should be two years, but not ten
modified in a revision — imply that a compiler bug is usu- years.
ally not so complex that a severe collateral effect is imposed In order to compute the bug lifetime information accu-
on multiple functions. Considering the plots and tables in rately, we retrieve both the bug reports and the bug history
Figures 7 and 8, we can conclude that compiler bugs are information from the two bug tracking systems. For the
often local. Although compilers are intricate (such as the start date of a bug, we use the creation date recorded in the
complex optimization algorithms), the bug fixes only have bug report. For the end date of the bug, we scan its revision
limited impact on the entire code-base. On the other hand, history in reverse chronological order, and find the date of
this indirectly demonstrates that both compilers are well
6
designed with clear modularity and limited coupling among https://gcc.gnu.org/bugzilla/show bug.cgi?id=3
7
modules. https://gcc.gnu.org/bugzilla/show activity.cgi?id=3
the last revision to set the resolution field to fixed. Taking be more important and thus its bugs are more severe than
the bug #3 as an example, its duration is the time interval the other components.
between its creation date 1999-08-26 in the report and the Given a compiler component c and a priority level p ∈
revision date 2001-01-08 to set it as fixed. [1, 5] corresponding to hP1, P2, P3, P4, P5i, let R be the
set of the bugs in c. We define the following function to
6.2 Duration Analysis compute the fraction of the bugs with the given priority p
Figure 9 shows the empirical cumulative distribution func- among all the bugs in the component c.
tion of bugs over time, and the statistics of the bug duration.
On average, the bugs of GCC are fixed within 111 days, and |{r ∈ R|the priority of r is p}|
those of LLVM are fixed in 98 days. The medians of dura- ψ(c, p) =
|R|
tions are 28 days and 7 days respectively.
Table 6 further breaks down the duration into two seg- We then define a total order between components in the
ments, i.e., the duration between when a bug is reported order of their fractions of five different priorities.
and when it is confirmed, and the duration between it is Θ(c1 , c2 ) = θ(c1 , c2 , 1)
confirmed and fixed. On average, it takes less time for GCC 
to confirm a bug but longer time to fix a bug than LLVM. 
 > if ψ(c1 , p) > ψ(c2 , p)

< else if ψ(c1 , p) < ψ(c2 , p)
θ(c1 , c2 , p) =
Table 6: The breakdown of the duration. 
 = else if p = 5
(a) Duration of bugs between when they are reported and con- 
θ(c1 , c2 , p + 1) otherwise

firmed.
Mean Median SD Min Max Basically, if a component c1 has a larger fraction of bugs with
GCC 61 2 205 1 3816 high priorities than c2 , then c1 is greater than c2 . Bugs in
LLVM 95 5 245 1 2967 a “large” component are more likely to be prioritized higher
than those in “small” components, and should draw more
(b) Duration of bugs between when they are confirmed and fixed. attention for testing and validation.
Mean Median SD Min Max
P1 P2 P3 P4 P5
GCC 139 4 396 1 5427
LLVM 17 1 113 1 2446 1

7. PRIORITIES OF BUGS 0.5


This section first studies the priority of compiler bugs,
and then investigates the correlation between priorities and
other types of bug information, i.e., compiler components 0
and bug duration. a e t t t
ip ofil -op -endebugl-op trap ltoc++ pchtizerflap itmrgession
prtree dle d rt ots ni ud lib ta re
7.1 Priority Distribution ov
-
id bo sa ibm re
g
The field priority of a bug report is determined by de- gc m l
velopers for prioritizing their bugs, expressing the order in Figure 10: Correlation between priorities and components
which bugs should be fixed by the developers. GCC has five
levels: P1, P2, P3, P4 and P5. P1 is the most urgent level We rank all the components of GCC in the descending
and bugs of this level should be fixed soon, and P5 has the order of Θ. Figure 10 shows the top 15 components. The
lowest priority. P3 is the default priority level. Any new bug first component is ipa, which includes the inlining and other
report is initially labeled as P3 by default, then the triage inter-procedural optimizations and the infrastructure sup-
team will determine its priority based on the impact of the porting them. 30% of its bugs are labeled as P1. This ratio
bug on users and available human and time resource [32]. is significantly higher than the second component, which is
LLVM developers do not prioritize bugs explicitly in the only 14%. Many other optimization components also ap-
bug repository, so we only conduct priority-related analyses pear in this list, such as tree-opt (optimizations over high-
on the GCC data set. level tree representation), middle-end (optimizations over
GIMPLE representation [9]), rtl-opt (optimizations over
Table 7: The priority distribution of the GCC bugs. low-level architecture-neutral register transfer language rep-
P1 P2 P3 P4 P5 resentation [10]), lto (link-time optimization).
6.02% 22.92% 68.24% 1.82% 0.99%
7.3 Priority and Duration Correlation
Table 7 lists the breakdown of the five priority categories This subsection studies the correlation between priorities
in GCC. As the default priority, P3 accounts for over half and the lifetime of bugs. Intuitively, bugs with higher prior-
of the bugs, whereas P4 and P5 are the extreme cases with ities should be fixed in a shorter time than those with lower
the fewest bugs. priorities. The results in this section partially invalidate this
hypothesis.
7.2 Priority and Component Correlation Table 11a lists the statistics of the lifetime of bugs for the
We study the correlation between priorities and compo- five priority levels. Each row shows the information of all the
nents of bugs, that is, which compiler component tends to bugs that are labeled with a certain priority. For example,
(a) The statistics of the duration of bugs for different priority Table 8: Bugs Found by tkfuzz
levels in GCC. Bug ID Component Status
Priority Mean Median SD Min Max 1 LLVM-24610 frontend fixed
P1 72 22 149 1 2196
2 LLVM-24622 frontend fixed
3 LLVM-24797 new-bugs –
P2 262 49 502 1 4937 4 LLVM-24798 c++ fixed
P3 185 21 420 1 5686 5 LLVM-24803 new-bugs –
P4 268 48 473 1 3326 6 LLVM-24884 new-bugs –
P5 450 170 617 1 3844 7 LLVM-24943 new-bugs –
8 LLVM-25593 new-bugs –
9 LLVM-25634 new-bugs –
P1 P2 P3 P4 P5 10 GCC-67405 target fixed
11 GCC-67581 c++ fixed
Fraction of Bugs

0.8
12 GCC-67619 middle-end fixed
13 GCC-67639 middle-end confirmed
0.6 14 GCC-67653 middle-end fixed
15 GCC-67845 c++ fixed
16 GCC-67846 c++ fixed
0.4 17 GCC-67847 c++ fixed
18 GCC-68013 tree-opt fixed
0.2
1 2 3 4 5
Duration of Bugs (in months) We applied tkfuzz to the GCC test suite to generate mu-
tated programs. The test suite contains 36,966 test pro-
(b) The cumulative breakdown of the bug lifetime for different grams (24,949 for C and 12,017 for C++); the average size
priorities within 5 months of each test program is only 32 lines of code, including blank
lines and comments. Our testing was done on a quad-core
Figure 11: The correlation between duration and priorities Intel 3.40 GHz machine. We have reported 18 bugs, of which
of GCC bugs. 12 have already been accepted or fixed. Table 8 shows the
details of these bugs. Note that five bugs are in the two
compilers’ C++ components, confirming our observation in
the second row shows the mean, median and standard devia- Subsection 4.1. Although this is a simple application, it
tion of all the bugs of P1. Additionally, we performed t-test clearly demonstrates the practical potentials of our findings
between every two priorities, and validated that except the in this paper.
difference between P2 and P4 the differences between all
others are statistically significant with (p < 0.001).
From a different perspective, Figure 11b graphically shows 9. CALL FOR ACTIONS
the cumulative breakdown of the bug lifetime for each pri-
We have shown that compiler bugs are common: there
ority level within five months. For example, 79% of P1 bugs
is a large number of them. This section discusses several
are fixed within three month, 58% for P2, 68% for P3, 59%
directions that are worthy of pursuing based on the analysis
for P4 and 38% for P5.
results presented earlier.
Both the table and the figure demonstrate that the time
spent in fixing bugs does not strictly follow the order of their Buggy C++ Components It is not very surprising that
priorities. On average, only the bugs with P1 are fixed much C++ components are the most buggy in both compilers, as
faster than the other priorities, and the bugs with P5 take C++ is one of the most complex programming languages.
the most time among all the priorities. The P2 bugs take However, it is surprising that little research has been di-
more time than P3. The P1 bug with the maximum lifetime rected at testing C++. Although it is more difficult than
is Bug #25130.8 It was reopened twice, and took 2,196 days testing C compilers [11, 35], it is very worthy because C++ is
in total to resolve. used as widely as C. Furthermore, based on the most buggy
files of the two compilers (cf. Table 5), we can take gradual
steps toward testing C++ by starting from the most buggy
8. A PRELIMINARY APPLICATION features such as templates and overloaded methods.
This section presents a proof-of-concept application of our
Small Regression Tests Figure 4 shows the size of the
findings. Specifically, we have designed tkfuzz, a simple yet
regression test cases, which are extracted from the test pro-
effective program mutation algorithm for compiler testing,
grams attached to bug reports. 95% of them have fewer
by leveraging the observation in Subsection 5.1 that bug-
than 100 lines of code, and more than 50% fewer than 25
revealing test cases are usually small. Moreover, as indicated
lines of code. This interesting (and surprising) finding indi-
in Subsection 4.1 that C++ is the most buggy component,
cates that randomized compiler testing techniques can lever-
our algorithm works for C++ too, the first research effort
age this fact by producing small but complex test programs.
on C++ compiler testing.
This not only can stress test compilers, but may also en-
tkfuzz Given a test program represented as a sequence hance testing efficiency. Our preliminary application of this
of tokens, tkfuzz randomly substitutes an identifier token finding in Section 8 also demonstrates its potential impact.
(e.g., variable or function names) with another different iden-
Locality of Fixes Section 5.3 shows that most of the
tifier token in the token list. Then the mutated programs
bug fixes only touch one function (58% for GCC and 54%
are used to find crashing bugs in compilers.
for LLVM), and they tend to be local. Thus, it would be
8
https://gcc.gnu.org/bugzilla/show activity.cgi?id=25130 promising to investigate techniques that can direct testing
at a chosen component, rather than treating the compiler as LLVM. Compared to traditional random C program genera-
a monolithic whole. tors which target at compiler crashes, Csmith is able to gen-
Duration of Bugs Section 6 shows that the average time erate valid C programs by avoiding introducing undefined
to triage and resolve a compiler bug is a few months. This behaviors, hence capable of finding mis-compilation bugs.
may be because compilers are very complex and compiler It has also been applied to test virtual machines [18], CPU
bugs are difficult to resolve. One has to understand the root emulators [19] and static analyzers, such as Frama-C [8].
cause of a bug and decide how to fix it; at the same time, Another successful compiler testing technique is Equiva-
it is important not to overlook certain cases and avoid re- lence Modulo Inputs (EMI) [11, 12, 13]. It has found several
gressions. Thus, practical techniques are needed to aid com- hundred bugs in GCC and LLVM. EMI is a general compiler
piler developers with such tasks. Lopes et al.’s recent work testing methodology to derive semantically equivalent vari-
on Alive [16] can be viewed as a relevant example in this ants from existing programs. It introduces an alternative
direction — it helps developers write and debug peephole view of differential testing. Given a program P , instead of
optimizations. verifying the consistency between executables compiled by
multiple compilers or multiple versions of a compiler on P , it
tests the consistency w.r.t. an input I between executables
10. RELATED WORK from P and P 0 compiled by the same compiler, where P 0 is
We survey two lines of closely related research. an EMI variant of P w.r.t. I.
Empirical Studies on Bugs Much work has been de- A considerable amount of effort has also been put on test-
voted to studying the characteristics of various bugs in vari- ing different compilers or different components in compilers.
ous software systems. Chou et al. [7] conducted an empirical Zhao et al. proposed a tool JTT to test the EC++ embed-
study on approximately one thousand operating system er- ded compiler [39]. Nagai et al. [20, 21] proposed a technique
rors. The errors were collected by applying static automatic to test the arithmetic optimizers of compilers. CCG is an-
compiler analysis to Linux and OpenBSD kernels. They other random C program generator targets compiler crash-
found that device drivers had much more bugs than the ing bugs [2]. Sun et al. [29] proposed an approach to finding
rest of the kernels. Lu et al. [17] studied the characteris- bugs in compiler warning diagnostics.
tics of concurrency bugs by examining 105 concurrency bugs Our study highlights that in GCC and LLVM, C++ has
randomly selected from four real-world programs (MySQL, the highest bug rate of all the compiler components, much
Apache, Mozilla and OpenOffice). Their findings further higher than the C component. It would be interesting to
the understanding of concurrency bugs and highlight future devise effective testing strategies, theories and tools to test
directions for concurrency bugs detection, diagnosis and fix- C++ compilers, as it is also a popular but more complex
ing. Sahoo et al. [25] analyzed 266 reported bugs found in programming language widely used in industry. We also
released server software, such as MySQL, Apache, and SVN. analyze the statistics of bug revealing test cases and bug
Based on the findings, they discussed several implications fixes and find that most of the test cases are small and most
on reproducing software failures and designing automated of the bug fixes are local to a small number of lines. These
diagnosis tools for production runs of server software. Li observations can potentially serve as good heuristics to guide
et al. [15] analyzed the trend of bugs by applying natural random program generator to generate test programs, small
language text classification techniques to about 29,000 bugs in size but effective at detecting new bugs.
of Mozilla and Apache. Thung et al. [31] studied the bugs in
machine learning systems and categorized them based on the 11. CONCLUSION
characteristics of bugs. Song et al. [26] studied performance
This paper has presented a study of in total 39,890 bugs
bugs in open source projects and based on their character-
and 22,947 bug fixes of GCC, and 12,842 bugs and 8,452 bug
istics proposed a statistical debugging technique.
fixes of LLVM, and analyzed the characteristics of compiler
Our work complements these previous studies, with a spe-
bugs. In particular, we have shown how bugs are distributed
cific focus on compiler bugs. Different from application bugs,
in components and source files (skewness of bugs in a small
compiler bugs can be much difficult to notice and debug, and
number of components and files), how bugs are triggered
for application developers, compilers are usually assumed to
and fixed (both bug-triggering test cases and fixes are small
be bug-free. In this paper, we show that compiler bugs are
in size), how long bugs live (on average 200 days for GCC
also common, and more than 65% bugs of GCC and LLVM
and 111 for clang), and how bugs are prioritized.
are reported by external users. For researchers working on
We believe that our analysis results and findings pro-
compiler testing and validation, we show that certain com-
vide insight into understanding compiler bugs and guidance
piler components have higher bug rates than the others, and
toward better testing and debugging compilers. All our
should be paid more attention to.
data and code are publicly available at http://chengniansun.
Compiler Testing Due to compilers’ complexity, testing bitbucket.org/projects/compiler-bug-study/.
is still the major technique to validate the correctness of
production compilers. In addition to internal regression test
suites, compiler developers can also use external commercial Acknowledgments
conformance testing suites [1, 22] to further test whether We are grateful to the anonymous reviewers for their insight-
compilers conforms to language standards or specifications. ful comments. This research was supported in part by the
However, such manual test suites may still be inadequate United States National Science Foundation (NSF) Grants
and therefore recently researchers have started to employ 1117603, 1319187, 1349528 and 1528133, and a Google Fac-
randomized testing to further stress test compilers. ulty Research Award. The information presented here does
One of the most successful approaches is Csmith [5, 24, not necessarily reflect the position or the policy of the Gov-
35], which has found several hundred bugs in GCC and ernment and no official endorsement should be inferred.
References [15] Z. Li, L. Tan, X. Wang, S. Lu, Y. Zhou, and C. Zhai.
[1] ACE. SuperTest compiler test and validation suite. Have Things Changed Now?: An Empirical Study of
http://www.ace.nl/compiler/supertest.html. Bug Characteristics in Modern Open Source Software.
In Proceedings of the 1st Workshop on Architectural and
[2] A. Balestrat. CCG: A random C code generator. https: System Support for Improving Software Dependability
//github.com/Merkil/ccg/. (ASID), pages 25–33, 2006.

[3] S. Blazy, Z. Dargaye, and X. Leroy. Formal Verification [16] N. P. Lopes, D. Menendez, S. Nagarakatte, and
of a C Compiler Front-End. In Int. Symp. on Formal J. Regehr. Provably correct peephole optimizations
Methods (FM), pages 460–475, 2006. with alive. In Proceedings of the 36th ACM SIGPLAN
Conference on Programming Language Design and Im-
[4] N. Chen, S. C. H. Hoi, and X. Xiao. Software Process plementation (PLDI), pages 22–32, 2015.
Evaluation: A Machine Learning Approach. In ASE,
pages 333–342, Washington, DC, USA, 2011. [17] S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from Mis-
takes: A Comprehensive Study on Real World Concur-
[5] Y. Chen, A. Groce, C. Zhang, W.-K. Wong, X. Fern, rency Bug Characteristics. In Proceedings of the 13th
E. Eide, and J. Regehr. Taming compiler fuzzers. In International Conference on Architectural Support for
Proceedings of the 2013 ACM SIGPLAN Conference Programming Languages and Operating Systems (ASP-
on Programming Language Design and Implementation LOS), pages 329–339, 2008.
(PLDI), pages 197–208, 2013.
[18] L. Martignoni, R. Paleari, G. Fresi Roglia, and D. Br-
[6] R. Chillarege, W.-L. Kao, and R. G. Condit. Defect uschi. Testing system virtual machines. In Proceedings
Type and Its Impact on the Growth Curve. In Proceed- of the 19th International Symposium on Software Test-
ings of the 13th International Conference on Software ing and Analysis (ISSTA), pages 171–182, 2010.
Engineering (ICSE), pages 246–255, 1991.
[19] L. Martignoni, R. Paleari, A. Reina, G. F. Roglia, and
[7] A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler. D. Bruschi. A methodology for testing cpu emulators.
An Empirical Study of Operating Systems Errors. In ACM Trans. Softw. Eng. Methodol., 22(4):29:1–29:26,
Proceedings of the Eighteenth ACM Symposium on Op- Oct. 2013.
erating Systems Principles (SOSP), pages 73–88, 2001.
[20] E. Nagai, H. Awazu, N. Ishiura, and N. Takeda. Ran-
[8] P. Cuoq, B. Monate, A. Pacalet, V. Prevosto, J. Regehr, dom testing of C compilers targeting arithmetic opti-
B. Yakobowski, and X. Yang. Testing static analyzers mization. In Workshop on Synthesis And System In-
with randomly generated programs. In A. Goodloe and tegration of Mixed Information Technologies (SASIMI
S. Person, editors, NASA Formal Methods, volume 7226 2012), pages 48–53, 2012.
of Lecture Notes in Computer Science, pages 120–125.
Springer Berlin Heidelberg, 2012. [21] E. Nagai, A. Hashimoto, and N. Ishiura. Scaling up
size and number of expressions in random testing of
[9] GCC. GIMPLE – GNU Compiler Collection arithmetic optimization of C compilers. In Workshop on
(GCC) Internals. https://gcc.gnu.org/onlinedocs/ Synthesis And System Integration of Mixed Information
gccint/GIMPLE.html, accessed: 2014-06-25. Technologies (SASIMI 2013), pages 88–93, 2013.
[10] GCC. RTL – GNU Compiler Collection (GCC) Inter- [22] Plum Hall, Inc. The Plum Hall Validation Suite for C.
nals. https://gcc.gnu.org/onlinedocs/gccint/RTL.html, http://www.plumhall.com/stec.html.
accessed: 2014-06-25.
[23] A. Pnueli, M. Siegel, and E. Singerman. Translation
[11] V. Le, M. Afshari, and Z. Su. Compiler Validation via Validation. In 4th International Conference on Tools
Equivalence Modulo Inputs. In Proceedings of the 2014 and Algorithms for Construction and Analysis of Sys-
ACM SIGPLAN Conference on Programming Language tems (TACAS), pages 151–166, 1998.
Design and Implementation (PLDI), 2014.
[24] J. Regehr, Y. Chen, P. Cuoq, E. Eide, C. Ellison, and
[12] V. Le, C. Sun, and Z. Su. Finding Deep Compiler Bugs X. Yang. Test-case reduction for C compiler bugs. In
via Guided Stochastic Program Mutation. In Proceed- Proceedings of the 2012 ACM SIGPLAN Conference
ings of the 2015 ACM SIGPLAN International Confer- on Programming Language Design and Implementation
ence on Object-Oriented Programming, Systems, Lan- (PLDI), pages 335–346, 2012.
guages, and Applications (OOPSLA), pages 386–399.
ACM, 2015. [25] S. K. Sahoo, J. Criswell, and V. Adve. An Empirical
Study of Reported Bugs in Server Software with Impli-
[13] V. Le, C. Sun, and Z. Su. Randomized Stress-Testing of cations for Automated Bug Diagnosis. In Proceedings
Link-Time Optimizers. In Proceedings of the 2015 In- of the 32Nd ACM/IEEE International Conference on
ternational Symposium on Software Testing and Anal- Software Engineering (ICSE), pages 485–494, 2010.
ysis (ISSTA), pages 327–337. ACM, 2015.
[26] L. Song and S. Lu. Statistical Debugging for Real-world
[14] X. Leroy, A. W. Appel, S. Blazy, and G. Stewart. The Performance Problems. In Proceedings of the 2014
CompCert Memory Model, Version 2. Research report ACM International Conference on Object Oriented Pro-
RR-7987, INRIA, June 2012. gramming Systems Languages and Applications (OOP-
SLA), pages 561–578, 2014.
[27] M. Sullivan and R. Chillarege. A Comparison ACM Symposium on Principles of Programming Lan-
of Software Defects in Database Management Sys- guages (POPL), pages 17–27, Jan. 2008.
tems and Operating Systems. In Twenty-Second In-
ternational Symposium on Fault-Tolerant Computing [35] X. Yang, Y. Chen, E. Eide, and J. Regehr. Finding and
(FTCS), pages 475–484, July 1992. Understanding Bugs in C Compilers. In Proceedings of
the 2011 ACM SIGPLAN Conference on Programming
[28] C. Sun, J. Du, N. Chen, S.-C. Khoo, and Y. Yang. Language Design and Implementation (PLDI), pages
Mining Explicit Rules for Software Process Evaluation. 283–294, 2011.
In ICSSP, pages 118–125, 2013.
[36] Z. Yin, X. Ma, J. Zheng, Y. Zhou, L. N. Bairavasun-
[29] C. Sun, V. Le, and Z. Su. Finding and Analyzing Com- daram, and S. Pasupathy. An empirical study on config-
piler Warning Defects. In Proceedings of the 38th Inter- uration errors in commercial and open source systems.
national Conference on Software Engineering (ICSE). In Proceedings of the Twenty-Third ACM Symposium
ACM, 2016. on Operating Systems Principles (SOSP), pages 159–
172, 2011.
[30] C. Sun, D. Lo, X. Wang, J. Jiang, and S.-C. Khoo.
A Discriminative Model Approach for Accurate Dupli- [37] Z. Yin, D. Yuan, Y. Zhou, S. Pasupathy, and
cate Bug Report Retrieval. In Proceedings of the 32Nd L. Bairavasundaram. How Do Fixes Become Bugs? In
ACM/IEEE International Conference on Software En- 19th ACM SIGSOFT Symposium and the 13th Euro-
gineering (ICSE), pages 45–54, 2010. pean Conference on Foundations of Software Engineer-
ing (ESEC/FSE), pages 26–36, 2011.
[31] F. Thung, S. Wang, D. Lo, and L. Jiang. An Empirical
Study of Bugs in Machine Learning Systems. In Soft- [38] A. Zeller and R. Hildebrandt. Simplifying and Isolat-
ware Reliability Engineering (ISSRE), 2012 IEEE 23rd ing Failure-Inducing Input. IEEE Trans. Softw. Eng.,
International Symposium on, pages 271–280, Nov 2012. 28(2):183–200, Feb. 2002.
[32] Y. Tian, D. Lo, and C. Sun. DRONE: Predicting Pri- [39] C. Zhao, Y. Xue, Q. Tao, L. Guo, and Z. Wang. Au-
ority of Reported Bugs by Multi-factor Analysis. In tomated test program generation for an industrial op-
29th IEEE International Conference on Software Main- timizing compiler. In ICSE Workshop on Automation
tenance (ICSM), pages 200–209, Sept 2013. of Software Test (AST), pages 36–43, 2009.
[33] TIOBE. TIOBE Index for May 2016. http://www. [40] T. Zimmermann, N. Nagappan, P. J. Guo, and B. Mur-
tiobe.com/tiobe index, accessed: 2016-05-15. phy. Characterizing and Predicting Which Bugs Get
Reopened. In Proceedings of the 34th International
[34] J.-B. Tristan and X. Leroy. Formal Verification of
Conference on Software Engineering (ICSE), pages
Translation Validators: A Case Study on Instruction
1074–1083, 2012.
Scheduling Optimizations. In Proceedings of the 35th

You might also like