Toward Understanding Compiler Bugs in GCC and LLVM
Toward Understanding Compiler Bugs in GCC and LLVM
ABSTRACT 1. INTRODUCTION
Compilers are critical, widely-used complex software. Bugs Compilers are an important category of system software.
in them have significant impact, and can cause serious dam- Extensive research and development efforts have been de-
age when they silently miscompile a safety-critical applica- voted to increasing compilers’ performance and reliability.
tion. An in-depth understanding of compiler bugs can help However, it may still surprise application developers that,
detect and fix them. To this end, we conduct the first em- similar to application software, production compilers also
pirical study on the characteristics of the bugs in two main- contain bugs, and in fact quite many. Furthermore, com-
stream compilers, GCC and LLVM. Our study is significant piler bugs impact application code, and even lead to severe
in scale — it exhaustively examines about 50K bugs and damage, especially when a buggy compiler is used to compile
30K bug fix revisions over more than a decade’s span. safety-critical applications.
This paper details our systematic study. Summary find- Different from most of the application bugs, compiler bugs
ings include: (1) In both compilers, C++ is the most buggy are difficult to recognize as they usually manifest indirectly
component, accounting for around 20% of the total bugs and as application failures. For example, a compiler bug makes
twice as many as the second most buggy component; (2) the a program optimized and transformed into a wrong exe-
bug revealing test cases are typically small, with 80% having cutable, and this bug can only manifest as the executable
fewer than 45 lines of code; (3) most of the bug fixes touch a misbehaves. Even worse, in most cases the application de-
single source file with small modifications (43 lines for GCC veloper will first assume the misbehavior is caused by some
and 38 for LLVM on average); (4) the average lifetime of bug introduced by herself/himself, and it may take a long
GCC bugs is 200 days, and 111 days for LLVM; and (5) time for her/him to realize that the compiler is the culprit.
high priority tends to be assigned to optimizer bugs, most In order to better understand compiler bugs, we conduct
notably 30% of the bugs in GCC’s inter-procedural analysis the first empirical study on their characteristics. Although
component are labeled P1 (the highest priority). there have been a number of empirical studies on software
This study deepens our understanding of compiler bugs. bugs [6, 7, 17, 25, 27, 31, 36, 37], none focuses on compiler
For application developers, it shows that even mature pro- bugs. For example, Lu et al. [17] study the characteristics
duction compilers still have many bugs, which may affect of concurrency bugs, Sahoo et al. [25] investigate the bugs
development. For researchers and compiler developers, it in server software, and Chou et al. [7] research the errors
sheds light on interesting characteristics of compiler bugs, in operating system kernels. In contrast, this paper ex-
and highlights challenges and opportunities to more effec- amines the bugs of two mainstream production compilers,
tively test and debug compilers. GCC and LLVM, in total 39,890 bugs of GCC and 12,842
bugs of LLVM. In order to explore the properties of com-
piler bug fixes, we also examine 22,947 GCC revisions and
CCS Concepts 8,452 LLVM revisions, which are fixes to most of the bugs.
•General and reference → Empirical studies; •Software In particular, we attempt to investigate compiler bugs along
and its engineering → Software testing and debug- four central aspects:
ging; (1) Location of Bugs. We compute the distribution of
bugs in compiler components and the source files. The data
Keywords shows that in both GCC and LLVM, the component C++
is always the most buggy one, with a defect rate twice as
empirical studies, compiler bugs, compiler testing much as the second most buggy component. In addition, we
find that most of the source files only contain one bug (i.e.,
Permission to make digital or hard copies of all or part of this work for personal or 60% for GCC and 53% for LLVM), and most of the top ten
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita- buggy source files belong to the front end of C++.
tion on the first page. Copyrights for components of this work owned by others than (2) Test Cases, Localization and Fixes of Bugs. We
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission investigate the size of the test cases that trigger compiler
and/or a fee. Request permissions from [email protected]. bugs, and find that on average GCC test cases have 32 lines
ISSTA ’16, July 18-22, 2016, SaarbrÃijcken, Germany of code and those of LLVM have 29 lines. This observation
c 2016 ACM. ISBN 978-1-4503-4390-9/16/07. . . $15.00 can guide random testing of compilers. For example, instead
DOI: http://dx.doi.org/10.1145/2931037.2931074
of generating large simple test programs, compiler testing
may be more effective by generating small, yet complex test 2.1 Source of Bugs
programs. Moreover, it confirms that most of the bugs can In this paper, we study the bugs in two mainstream com-
be effectively reduced by techniques such as Delta [38] or C- piler systems GCC and LLVM.
Reduce [24]. The fixes of compiler bugs are not big either,
GCC GCC is a compiler system produced by the GNU
which on average only touch 43 lines of code for GCC and
Project supporting various languages (e.g. C, C++, and For-
38 for LLVM. 92% of them involve fewer than 100 lines of
tran) and various target architectures (e.g. PowerPC, x86,
code changes, and 58% for GCC and 54% for LLVM involve
and MIPS). It has been under active development since the
only one function. Our findings reveal that most of the bug
late 1980s. The latest version is 5.3, which was released on
fixes are local. Even for optimizers, each bug is usually from
December 4, 2015.
a single optimization pass.
LLVM LLVM is another popular compiler infrastructure,
(3) Duration of Bugs. We compute information on the
providing a collection of modular and reusable compiler and
duration of bugs (i.e. the time between when a bug was filed
toolchain technologies for arbitrary programming languages.
and when it was closed), and show that the average duration
Similar to GCC, LLVM also supports multiple languages
of GCC bugs is 200 days and that of LLVM bugs is 111 days.
and multiple target architectures. The project started in
We also show that the GCC community confirms bugs faster
2000 and has drawn much attention from both industry and
than LLVM but spends more time fixing them.
academia. The latest version is 3.7.1, released on January
(4) Priorities of Bugs. The field priority of a bug 5, 2016.
report is determined by developers for prioritizing bugs, ex-
pressing the order in which bugs should be fixed. We inves- Table 1: The information of the bugs used in this study.
tigate the distribution of bugs over priorities. Of the GCC Compiler Start End Bugs Revisions
bugs, 68% have the default P3 priority, and only 6.02% are GCC Aug-1999 Oct-2015 39,890 22,947
labeled as P1. We then study how priorities are correlated LLVM Oct-2003 Oct-2015 12,842 8,452
with components. The inter-procedural analysis component,
ipa, of GCC is the most “impactful” as 30% of its bugs are la-
Collection of Bugs Our study focuses on fixed bugs. We
beled as P1. We also study how priorities affect the lifetime say a bug is fixed if its resolution field is set to fixed and
of bugs. On average, P1 bugs are fixed the fastest, whereas the status field is set to resolved, verified or closed in
P5 bugs are fixed the slowest. However, for the rest, the fix the bug repositories of GCC and LLVM.
time is not strictly correlated with their priorities.
As a proof-of-concept demonstration of the practical po- Identifying Bug Fix Revisions We then identify the
tentials of our findings, we design a simple yet effective pro- revisions that correspond to these fixed bugs. We collect the
gram mutation algorithm tkfuzz to find compiler crash- entire revision log from the code repositories, and for each
ing bugs, by leveraging the second observation that bug- revision, we scan its commit message to check whether this
revealing test programs are usually small. Applying tkfuzz revision is a fix to a bug.
on the 36,966 test programs in the GCC test suite (with an GCC and LLVM developers usually add a marker in the
average of 32 lines of code each), we have found 18 crash- commit message following one of the two patterns below:
ing bugs in GCC and LLVM, of which 12 have already been • “PR hbug-idi”
fixed or confirmed. • “PR hcomponenti/hbug-idi”
The results presented in this paper provide further in-
sights into understanding compiler bugs and better guidance where the prefix “PR” stands for “Problem Report” and hbug-
toward effectively testing and debugging compilers. To en- idi is the id of the corresponding bug. We use these patterns
sure reproducibility and to benefit the community, we have to link a revision to the bug that it fixes.
made all our data and code publicly available at http:// Table 1 shows the numbers of bugs and their accompanied
chengniansun.bitbucket.org/projects/compiler-bug-study/. revisions that are used in our study. We analyze 50K fixed
Paper Organization Section 2 describes the bugs used bugs and 30K revisions in total, including 1,858 GCC and
in this study and potential threats to validity. Section 3 1,643 LLVM enhancement requests.
introduces general properties of these bugs, while Section 4 2.2 Threats to Validity
studies the distribution of compiler bugs in components and
source files respectively. In Section 5, we study bug-revealing Similar to other empirical studies, our study is potentially
test cases and bug fixes. Section 6 then investigates the life- subject to several threats, namely the representativeness of
time of bugs, and Section 7 studies the priorities of bugs the chosen compilers, the generalization of the used bugs,
and their relation to other properties such as components and the correctness of the examination methodology.
and duration. Section 8 presents our preliminary applica- Regarding representativeness of the chosen compilers, we
tion of the findings in this paper for compiler testing. We use GCC and LLVM, which are two popular compiler projects
discuss, in Section 9, how to utilize the findings in this pa- written in C/C++ with multiple language frontends, a set of
per. Finally, Section 10 surveys related work, and Section 11 highly effective optimization algorithms and various archi-
concludes. tecture backends. Both are widely deployed on Linux and
OS X operating systems. We believe these two compilers
can well represent most of the traditional compilers. How-
ever, they may not well reflect the characteristics of Just-
2. METHODOLOGY In-Time compilers (e.g., Java Hotspot virtual machine) or
This section briefly introduces the compilers used in our interpreters (e.g., Perl, Python).
study, and describes how bugs are collected. We also discuss Regarding the generalization of the used bugs, we uni-
limitations and threats to the validity of this study. formly use all the bug reports satisfying the selection crite-
new3.0 3.2 3.3 3.4 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.2
600 fixed
Number of Bugs
rejected
400 unconfirmed
200
0
1 23 07 19 01 16 27 09 24 05 20 02
-1 1- 4- 8- 1- 5- 9- 2- 6- 1- 3- 8-
07 -1 -0 -0 -0 -0 -0 -0 -0 -1 -0 -0
0 0- 01 03 0 4 06 07 08 10 1 1 12 14 1 5
20 20 20 20 20 20 20 20 20 20 20 20
(a) GCC.
400
1.2 new
1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6
fixed
Number of Bugs
300 rejected
unconfirmed
200
100
0
24 11 27 15 01 20 05 24 11 27 15 01 20 06 24 10 28 13 01 20 06 24 10
0- 5- 1- 6- 1- 7- 2- 8- 3- 9- 4- 1- 5- 2- 6- 1- 7- 2- 9- 3- 0- 4- 1-
3 -1 4 -0 4 -1 5 -0 6 -0 6 -0 7 -0 7-0 8 -0 8 -0 9-0 9 -1 0 -0 0-1 1 -0 2 -0 2-0 3 -0 3 -0 4-0 4 -1 5 -0 5-1
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
(b) LLVM
Figure 1: The overall evolution history of the bug repositories (in months). The plot filled in gray background shows the
new bug reports submitted every month. The blue dashdotted plot shows the number of bugs fixed every month. The red
dashed plot shows the number of bugs that are resolved as not fixed (e.g. invalid, duplicate, worksforme or wontfix). The
black curve shows the number of bug reports per month which have not been confirmed yet. Clearly there is an increasing
trend of unconfirmed bugs for LLVM.
ria stated in Section 2.1 with no human intervention. For Table 2: The number and the percentage of bug reports
those unresolved or invalid bugs, we believe that they are revolved as invalid, worksforme or wontfix.
not likely as interesting as the bugs investigated in this pa- invalid worksforme wontfix
per. As for the identification of revisions containing bug GCC 7,072/10.4% 1,151/1.7% 1,463/2.2%
fixes, based on the interaction with the GCC and LLVM LLVM 1,639/6.7% 717/2.9% 593/2.4%
developers on a large number of bugs we reported before,
GCC developers usually link the revisions and the bugs ex-
plicitly, while LLVM developers do less often. Given the tion than before and more bug reports are being submit-
large number of GCC and LLVM revisions, we believe that ted monthly. However, the bug-fixing rate (indicated by the
the analysis results should be representative. blue dashdotted fixed curve) does not increase much and
Regarding the correctness of the analysis methodology, more bugs are being left as unconfirmed (shown as the black
we have automated every analysis mentioned in this paper. unconfirmed curve), which is likely due to limited human
We also have hands-on experience on analyzing bug report resources as we were told by active members in the LLVM
repositories and code revision histories from our previous community that some Apple developers were pulled into the
studies. Swift project.1 This may give external bug reporters the
impression that the community is not as responsive as GCC
although bugs are fixed as regularly as before.
3. GENERAL STATISTICS
This section shows the general statistics of the two bug 3.1 Rejected Bug Reports
repositories. Figure 1 shows the overall evolution of GCC Table 2 shows the information on the bugs which are re-
and LLVM’s bug repositories. In particular, it shows the solved as invalid, worksforme or wontfix. An invalid bug
number of bugs reported, rejected or fixed each month. It report is one in which the associated test program is invalid
also shows the number of bug reports that have never been (for example, containing undefined behaviors, a bug in an-
confirmed. Each vertical dashed line represents the date other software, misunderstanding of the language standard,
of a compiler release, and the label on the top the version misuse of standard libraries), or the reported “anomalous”
number. behavior is in fact deliberate. If a bug described in a report
The trends of the plots for GCC are relatively stable in cannot be reproduced, this report is labeled as worksforme.
these years compared to those of LLVM. After gaining its
1
popularity recently, LLVM has drawn much more atten- https://developer.apple.com/swift/
A bug report is resolved as wontfix if the affected version,
component or library of the compiler is not maintained al- 0.2
Fraction of Bugs
though the reported bug can be confirmed.
0.15
3.2 Duplicate Bug Reports
Reporting bugs in a bug tracking system is uncoordinated. 0.1
Different people may file multiple reports on the same bug.
0.05
In this case, the latter bug reports are referred to as dupli-
cates.
+ et n pt d pt + c ap va
Table 3: Distribution of duplicate bug reports. The first row c+ targ rtra ee-o le-en tl-o dc+ tst
r ja
fo tr dd r st o o
lists the number of duplicate bugs, and the second shows the i lib b
m
number of bugs with the given number of duplicates.
(a) Duplicate bugs of GCC. (a) The top ten buggy components out of 52 in GCC. The most
buggy component is C++, containing 22% of the 39,890 bugs,
#Duplicate 0 1 2 3 4 5 ≥6
nearly twice as buggy as the second one. And these ten compo-
#Report 35,933 2,924 596 215 98 41 83 nents account for 79% of the GCC bugs.
(b) Duplicate bugs of LLVM.
#Duplicate 0 1 2 3 4 5 ≥6
#Report 12,157 570 72 17 13 5 8
Fraction of Bugs
0.2
for GCC and 79% for LLVM). These components touch all 1
Fraction of Bugs
critical parts of compilers: front end (e.g. syntactic and se-
mantic parsing), middle end (e.g. optimizations) and back 0.75
end (e.g. code generation).
0.5
As the plots show, C++ is the most buggy component in
GCC, accounting for around 22% of the bugs. It is much 0.25
more buggy than the other components as the bug rate of GCC
the second most buggy component is only half. In LLVM, 0 LLVM
bug reports can be submitted without specifying their com- 1 10 25 50 100 1 000 10 000
ponents. Therefore a large number of bug reports are closed
with the component new-bugs. However, based on the dis- Lines of Code of Regression Test Case
tribution of bugs in source files (which will be discussed in
the next section), C++ is also the most buggy component
(a) This graph shows the empirical cumulative distribution func-
in LLVM. tion of the sizes of the test cases that trigger the bugs in GCC
One possible explanation is that C++ has many more and LLVM,
features than the other programming languages, support-
ing multiple paradigms (i.e. procedural, object-oriented, and (b) The statistics of the lines of code in regression test cases.
generic programming). It is surprising that most of the re- Mean Median SD Min Max
search and engineering efforts have been devoted to testing GCC 32 21 75 1 5,554
or validating C compilers [3, 11, 12, 13, 14, 23, 29, 34, 35] LLVM 29 16 59 1 916
but little on C++, although C++ is one of the most pop-
ular programming languages in industry (among top three Figure 4: Size of regression test cases.
according to the TIOBE index [33]).
0.5 LLVM
files. Consistent with the distribution of bugs in components
0.4 in Figure 2, the source files of the C++ component account
0.3 for most bugs. For example, as perhaps the most complex
0.2 GCC(30, 0.1%) feature of C++, the C++ template is the most buggy file in
0.1 LLVM(30, 0.3%) GCC. This skewness of bugs again implies that we should
devote more efforts to testing C++ compilers, considering
0
that C++ is one of the most wide-used programming lan-
1 5 10 15 20 25 30 guages.
Number of Bugs
Figure 3: The figure shows the fraction of source files with 5. REVEALING AND FIXING BUGS
a given number (≤ 30) of bugs for GCC and LLVM. The In this section, we investigate the properties of bug-revealing
files with more than 30 bugs are skipped and their fraction test cases and the bug fixes.
is approximately monotonically decreasing. For GCC, there
are 5,039 files in total, and the file with the most bugs is 5.1 Size of Bug-Revealing Test Cases
‘cp/pt.c’ (with 817 bugs); while for LLVM, there are 2,850 Figure 4a shows the empirical cumulative distribution func-
files in total, and the most buggy is ‘SemaDecl.cpp’ (with tion over the sizes of the test cases triggering the bugs of
301 bugs). GCC and LLVM. As shown, most of the test cases 95%
a test program t 1
Fraction of Bugs
0.8
r vi+1 vi+2 0.6
vi
0.4
0.2 GCC
X b b
1 10 20 100 1 000
Figure 5: Compiler regression bugs. Time to Reveal Regressions (in days)
(a) This graph shows the empirical cumulative distribution func-
tion of the time interval between when a regression is introduced
are smaller than 100 lines of code, and more than 60% are and when it is triggered.
smaller than 25 lines of code.5 Table 4b shows the statistics
including mean, median and standard deviation (SD). (b) The statistics of the time to reveal regressions.
These test cases are collected from the test suites of GCC Mean Median SD Min Max
and LLVM. For a new bug, besides the fix, the developer of- GCC 163 20 335 1 3492
ten creates a regression test case by extracting the attached
bug triggering code in the bug report. Meanwhile a bug la- Figure 6: Time to reveal regression bugs.
bel is also created by concatenating a prefix pr and an infix
bug-id, which is later used as the file name of the test case 1
or inserted into the test case as a comment. For example,
Fraction of Bugs
the test case pr14963.c is the regression test case for the 0.8
bug 14963. (100, 92%)
0.6
By identifying the bug labels, we have collected 11,142
test cases of GCC and its size distribution is shown as the 0.4
blue solid plot in Figure 4. We only collect 347 of LLVM, 0.2 GCC
and this is why its plot (i.e., the red dashed one) is not as LLVM
smooth that of GCC. However, the overall trend is similar 0
1 10 100 1 000 10 000
in the two plots, and supports the conclusion stated at the
beginning of this section. Number of Lines of Code in a Fix
This observation can be leveraged for improving the effec-
(a) This graph shows the number of lines of code in a bug fix.
tiveness of random testing for compiler validation. A ran- The empirical cumulative distribution curves of GCC and LLVM
dom program generator such as Csmith [35] or Orion [11] are almost the same. And most of the bug fixes (92%) contain
should focus on generating small but complex test programs fewer than 100 lines of code.
rather than big but simple ones. (b) The statistics of the lines of code modification in bug fixes.
Fraction of Bugs
Fraction of Bugs
0.9 0.8
0.8 0.6
0.7 0.4
0.6 GCC GCC
LLVM 0.2 LLVM
1 2 3 45 10 100 1 000 1 710 27 100 1 000 10 000
Number of Functions in a Fix Duration of Bugs (in days)
(a) This graph shows the number of functions modified in a bug
fix. The empirical cumulative distribution curve indicates that
over half of the investigated bugs only involve one function (i.e., (a) This graph shows the empirical cumulative distribution func-
58% for GCC, and 54% for LLVM), and most of the bug fixes tion of bugs through time.
(90% or so) involve no more than 5 functions. (b) The statistics of the duration of bugs.
(b) The statistics of the number of functions in bug fixes. Mean Median SD Min Max
Mean Median SD Min Max GCC 200 28 448 1 5686
GCC 2.7 1 7.0 1 434 LLVM 111 7 268 1 2967
LLVM 3.4 1 18.4 1 972
Figure 9: This figure shows the statistics of the duration
Figure 8: Number of functions modified in bug fixes. of bugs. Figure (a) is the emprical cumulative distribution
function of the duration of bugs, and Table (b) displays the
statistics of the duration. Averagely, a bug is fixed within
5.3.1 Lines of Code a year. Comparatively, GCC takes longer time to fix bugs
than LLVM.
We obtain the difference between two versions from the
two version control systems respectively, and count the lines
of code modification made to the source files.
As Figure 7a shows, 92% of the bug fixes contain fewer
6. DURATION OF BUGS
than 100 lines of code, and 50% of the bug fixes contain This section investigates how bugs are distributed through
fewer than 10 lines. This indicates that most of the bugs time. A duration spans the time when a bug report is filed
only touch a small portion of the compiler code. Table 7b in the bug tracking system to the time when the bug is
shows the statistics of the sizes of the bug fixes. On av- fixed. Ideally, a bug should be fixed right after it is reported,
erage, the number of lines of code modified in a bug fix is which is zero-day duration. However in reality due to limited
approximately 40, and the median is about 11. resource and various constraints, a bug usually takes time,
sometimes years, to be fixed.
5.3.2 Number of Functions
6.1 Collecting Duration Data
We investigate the number of functions modified in a bug-
fix. The information is acquired by (1) retrieving the changed In the bug tracking systems of GCC and LLVM, when a
source files at a specific revision, (2) parsing the files, and (3) bug report is filed, its creation date is saved in the database,
locating functions of which the lines in files intersect with which can be used as the birth date of the bug. The other
the line numbers recorded in the version control systems. field of date type in a bug report is the modification field,
Figure 8a shows the relation between the number of func- which records the date of the last revision to the report.
tions revised in a revision and the fraction of bugs. Around However, it cannot be used as the killed date of the bug,
90% of the bugs involve at most 5 functions in GCC and as even after a bug is resolved as fixed, the developers may
LLVM. Moreover, more than 58% of the GCC bugs and 54% still edit the bug report, such as adjusting target version or
of LLVM only involve one single function. Table 8b shows adding comments. Hence, using the time interval between
the summary statistics. In terms of number of functions, the creation date and the modification date is inaccurate.
the median is one and the mean is no more than three. For example, the third bug6 of GCC was reported on 1999-
08-26, the last modification was on 2013-07-23, and therefore
5.3.3 Discussion the time interval is more than ten years. However, based on
its revision history,7 the bug was already fixed on 2001-01-
The data — the lines of code and number of functions 08, so the duration of bug should be two years, but not ten
modified in a revision — imply that a compiler bug is usu- years.
ally not so complex that a severe collateral effect is imposed In order to compute the bug lifetime information accu-
on multiple functions. Considering the plots and tables in rately, we retrieve both the bug reports and the bug history
Figures 7 and 8, we can conclude that compiler bugs are information from the two bug tracking systems. For the
often local. Although compilers are intricate (such as the start date of a bug, we use the creation date recorded in the
complex optimization algorithms), the bug fixes only have bug report. For the end date of the bug, we scan its revision
limited impact on the entire code-base. On the other hand, history in reverse chronological order, and find the date of
this indirectly demonstrates that both compilers are well
6
designed with clear modularity and limited coupling among https://gcc.gnu.org/bugzilla/show bug.cgi?id=3
7
modules. https://gcc.gnu.org/bugzilla/show activity.cgi?id=3
the last revision to set the resolution field to fixed. Taking be more important and thus its bugs are more severe than
the bug #3 as an example, its duration is the time interval the other components.
between its creation date 1999-08-26 in the report and the Given a compiler component c and a priority level p ∈
revision date 2001-01-08 to set it as fixed. [1, 5] corresponding to hP1, P2, P3, P4, P5i, let R be the
set of the bugs in c. We define the following function to
6.2 Duration Analysis compute the fraction of the bugs with the given priority p
Figure 9 shows the empirical cumulative distribution func- among all the bugs in the component c.
tion of bugs over time, and the statistics of the bug duration.
On average, the bugs of GCC are fixed within 111 days, and |{r ∈ R|the priority of r is p}|
those of LLVM are fixed in 98 days. The medians of dura- ψ(c, p) =
|R|
tions are 28 days and 7 days respectively.
Table 6 further breaks down the duration into two seg- We then define a total order between components in the
ments, i.e., the duration between when a bug is reported order of their fractions of five different priorities.
and when it is confirmed, and the duration between it is Θ(c1 , c2 ) = θ(c1 , c2 , 1)
confirmed and fixed. On average, it takes less time for GCC
to confirm a bug but longer time to fix a bug than LLVM.
> if ψ(c1 , p) > ψ(c2 , p)
< else if ψ(c1 , p) < ψ(c2 , p)
θ(c1 , c2 , p) =
Table 6: The breakdown of the duration.
= else if p = 5
(a) Duration of bugs between when they are reported and con-
θ(c1 , c2 , p + 1) otherwise
firmed.
Mean Median SD Min Max Basically, if a component c1 has a larger fraction of bugs with
GCC 61 2 205 1 3816 high priorities than c2 , then c1 is greater than c2 . Bugs in
LLVM 95 5 245 1 2967 a “large” component are more likely to be prioritized higher
than those in “small” components, and should draw more
(b) Duration of bugs between when they are confirmed and fixed. attention for testing and validation.
Mean Median SD Min Max
P1 P2 P3 P4 P5
GCC 139 4 396 1 5427
LLVM 17 1 113 1 2446 1
0.8
12 GCC-67619 middle-end fixed
13 GCC-67639 middle-end confirmed
0.6 14 GCC-67653 middle-end fixed
15 GCC-67845 c++ fixed
16 GCC-67846 c++ fixed
0.4 17 GCC-67847 c++ fixed
18 GCC-68013 tree-opt fixed
0.2
1 2 3 4 5
Duration of Bugs (in months) We applied tkfuzz to the GCC test suite to generate mu-
tated programs. The test suite contains 36,966 test pro-
(b) The cumulative breakdown of the bug lifetime for different grams (24,949 for C and 12,017 for C++); the average size
priorities within 5 months of each test program is only 32 lines of code, including blank
lines and comments. Our testing was done on a quad-core
Figure 11: The correlation between duration and priorities Intel 3.40 GHz machine. We have reported 18 bugs, of which
of GCC bugs. 12 have already been accepted or fixed. Table 8 shows the
details of these bugs. Note that five bugs are in the two
compilers’ C++ components, confirming our observation in
the second row shows the mean, median and standard devia- Subsection 4.1. Although this is a simple application, it
tion of all the bugs of P1. Additionally, we performed t-test clearly demonstrates the practical potentials of our findings
between every two priorities, and validated that except the in this paper.
difference between P2 and P4 the differences between all
others are statistically significant with (p < 0.001).
From a different perspective, Figure 11b graphically shows 9. CALL FOR ACTIONS
the cumulative breakdown of the bug lifetime for each pri-
We have shown that compiler bugs are common: there
ority level within five months. For example, 79% of P1 bugs
is a large number of them. This section discusses several
are fixed within three month, 58% for P2, 68% for P3, 59%
directions that are worthy of pursuing based on the analysis
for P4 and 38% for P5.
results presented earlier.
Both the table and the figure demonstrate that the time
spent in fixing bugs does not strictly follow the order of their Buggy C++ Components It is not very surprising that
priorities. On average, only the bugs with P1 are fixed much C++ components are the most buggy in both compilers, as
faster than the other priorities, and the bugs with P5 take C++ is one of the most complex programming languages.
the most time among all the priorities. The P2 bugs take However, it is surprising that little research has been di-
more time than P3. The P1 bug with the maximum lifetime rected at testing C++. Although it is more difficult than
is Bug #25130.8 It was reopened twice, and took 2,196 days testing C compilers [11, 35], it is very worthy because C++ is
in total to resolve. used as widely as C. Furthermore, based on the most buggy
files of the two compilers (cf. Table 5), we can take gradual
steps toward testing C++ by starting from the most buggy
8. A PRELIMINARY APPLICATION features such as templates and overloaded methods.
This section presents a proof-of-concept application of our
Small Regression Tests Figure 4 shows the size of the
findings. Specifically, we have designed tkfuzz, a simple yet
regression test cases, which are extracted from the test pro-
effective program mutation algorithm for compiler testing,
grams attached to bug reports. 95% of them have fewer
by leveraging the observation in Subsection 5.1 that bug-
than 100 lines of code, and more than 50% fewer than 25
revealing test cases are usually small. Moreover, as indicated
lines of code. This interesting (and surprising) finding indi-
in Subsection 4.1 that C++ is the most buggy component,
cates that randomized compiler testing techniques can lever-
our algorithm works for C++ too, the first research effort
age this fact by producing small but complex test programs.
on C++ compiler testing.
This not only can stress test compilers, but may also en-
tkfuzz Given a test program represented as a sequence hance testing efficiency. Our preliminary application of this
of tokens, tkfuzz randomly substitutes an identifier token finding in Section 8 also demonstrates its potential impact.
(e.g., variable or function names) with another different iden-
Locality of Fixes Section 5.3 shows that most of the
tifier token in the token list. Then the mutated programs
bug fixes only touch one function (58% for GCC and 54%
are used to find crashing bugs in compilers.
for LLVM), and they tend to be local. Thus, it would be
8
https://gcc.gnu.org/bugzilla/show activity.cgi?id=25130 promising to investigate techniques that can direct testing
at a chosen component, rather than treating the compiler as LLVM. Compared to traditional random C program genera-
a monolithic whole. tors which target at compiler crashes, Csmith is able to gen-
Duration of Bugs Section 6 shows that the average time erate valid C programs by avoiding introducing undefined
to triage and resolve a compiler bug is a few months. This behaviors, hence capable of finding mis-compilation bugs.
may be because compilers are very complex and compiler It has also been applied to test virtual machines [18], CPU
bugs are difficult to resolve. One has to understand the root emulators [19] and static analyzers, such as Frama-C [8].
cause of a bug and decide how to fix it; at the same time, Another successful compiler testing technique is Equiva-
it is important not to overlook certain cases and avoid re- lence Modulo Inputs (EMI) [11, 12, 13]. It has found several
gressions. Thus, practical techniques are needed to aid com- hundred bugs in GCC and LLVM. EMI is a general compiler
piler developers with such tasks. Lopes et al.’s recent work testing methodology to derive semantically equivalent vari-
on Alive [16] can be viewed as a relevant example in this ants from existing programs. It introduces an alternative
direction — it helps developers write and debug peephole view of differential testing. Given a program P , instead of
optimizations. verifying the consistency between executables compiled by
multiple compilers or multiple versions of a compiler on P , it
tests the consistency w.r.t. an input I between executables
10. RELATED WORK from P and P 0 compiled by the same compiler, where P 0 is
We survey two lines of closely related research. an EMI variant of P w.r.t. I.
Empirical Studies on Bugs Much work has been de- A considerable amount of effort has also been put on test-
voted to studying the characteristics of various bugs in vari- ing different compilers or different components in compilers.
ous software systems. Chou et al. [7] conducted an empirical Zhao et al. proposed a tool JTT to test the EC++ embed-
study on approximately one thousand operating system er- ded compiler [39]. Nagai et al. [20, 21] proposed a technique
rors. The errors were collected by applying static automatic to test the arithmetic optimizers of compilers. CCG is an-
compiler analysis to Linux and OpenBSD kernels. They other random C program generator targets compiler crash-
found that device drivers had much more bugs than the ing bugs [2]. Sun et al. [29] proposed an approach to finding
rest of the kernels. Lu et al. [17] studied the characteris- bugs in compiler warning diagnostics.
tics of concurrency bugs by examining 105 concurrency bugs Our study highlights that in GCC and LLVM, C++ has
randomly selected from four real-world programs (MySQL, the highest bug rate of all the compiler components, much
Apache, Mozilla and OpenOffice). Their findings further higher than the C component. It would be interesting to
the understanding of concurrency bugs and highlight future devise effective testing strategies, theories and tools to test
directions for concurrency bugs detection, diagnosis and fix- C++ compilers, as it is also a popular but more complex
ing. Sahoo et al. [25] analyzed 266 reported bugs found in programming language widely used in industry. We also
released server software, such as MySQL, Apache, and SVN. analyze the statistics of bug revealing test cases and bug
Based on the findings, they discussed several implications fixes and find that most of the test cases are small and most
on reproducing software failures and designing automated of the bug fixes are local to a small number of lines. These
diagnosis tools for production runs of server software. Li observations can potentially serve as good heuristics to guide
et al. [15] analyzed the trend of bugs by applying natural random program generator to generate test programs, small
language text classification techniques to about 29,000 bugs in size but effective at detecting new bugs.
of Mozilla and Apache. Thung et al. [31] studied the bugs in
machine learning systems and categorized them based on the 11. CONCLUSION
characteristics of bugs. Song et al. [26] studied performance
This paper has presented a study of in total 39,890 bugs
bugs in open source projects and based on their character-
and 22,947 bug fixes of GCC, and 12,842 bugs and 8,452 bug
istics proposed a statistical debugging technique.
fixes of LLVM, and analyzed the characteristics of compiler
Our work complements these previous studies, with a spe-
bugs. In particular, we have shown how bugs are distributed
cific focus on compiler bugs. Different from application bugs,
in components and source files (skewness of bugs in a small
compiler bugs can be much difficult to notice and debug, and
number of components and files), how bugs are triggered
for application developers, compilers are usually assumed to
and fixed (both bug-triggering test cases and fixes are small
be bug-free. In this paper, we show that compiler bugs are
in size), how long bugs live (on average 200 days for GCC
also common, and more than 65% bugs of GCC and LLVM
and 111 for clang), and how bugs are prioritized.
are reported by external users. For researchers working on
We believe that our analysis results and findings pro-
compiler testing and validation, we show that certain com-
vide insight into understanding compiler bugs and guidance
piler components have higher bug rates than the others, and
toward better testing and debugging compilers. All our
should be paid more attention to.
data and code are publicly available at http://chengniansun.
Compiler Testing Due to compilers’ complexity, testing bitbucket.org/projects/compiler-bug-study/.
is still the major technique to validate the correctness of
production compilers. In addition to internal regression test
suites, compiler developers can also use external commercial Acknowledgments
conformance testing suites [1, 22] to further test whether We are grateful to the anonymous reviewers for their insight-
compilers conforms to language standards or specifications. ful comments. This research was supported in part by the
However, such manual test suites may still be inadequate United States National Science Foundation (NSF) Grants
and therefore recently researchers have started to employ 1117603, 1319187, 1349528 and 1528133, and a Google Fac-
randomized testing to further stress test compilers. ulty Research Award. The information presented here does
One of the most successful approaches is Csmith [5, 24, not necessarily reflect the position or the policy of the Gov-
35], which has found several hundred bugs in GCC and ernment and no official endorsement should be inferred.
References [15] Z. Li, L. Tan, X. Wang, S. Lu, Y. Zhou, and C. Zhai.
[1] ACE. SuperTest compiler test and validation suite. Have Things Changed Now?: An Empirical Study of
http://www.ace.nl/compiler/supertest.html. Bug Characteristics in Modern Open Source Software.
In Proceedings of the 1st Workshop on Architectural and
[2] A. Balestrat. CCG: A random C code generator. https: System Support for Improving Software Dependability
//github.com/Merkil/ccg/. (ASID), pages 25–33, 2006.
[3] S. Blazy, Z. Dargaye, and X. Leroy. Formal Verification [16] N. P. Lopes, D. Menendez, S. Nagarakatte, and
of a C Compiler Front-End. In Int. Symp. on Formal J. Regehr. Provably correct peephole optimizations
Methods (FM), pages 460–475, 2006. with alive. In Proceedings of the 36th ACM SIGPLAN
Conference on Programming Language Design and Im-
[4] N. Chen, S. C. H. Hoi, and X. Xiao. Software Process plementation (PLDI), pages 22–32, 2015.
Evaluation: A Machine Learning Approach. In ASE,
pages 333–342, Washington, DC, USA, 2011. [17] S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from Mis-
takes: A Comprehensive Study on Real World Concur-
[5] Y. Chen, A. Groce, C. Zhang, W.-K. Wong, X. Fern, rency Bug Characteristics. In Proceedings of the 13th
E. Eide, and J. Regehr. Taming compiler fuzzers. In International Conference on Architectural Support for
Proceedings of the 2013 ACM SIGPLAN Conference Programming Languages and Operating Systems (ASP-
on Programming Language Design and Implementation LOS), pages 329–339, 2008.
(PLDI), pages 197–208, 2013.
[18] L. Martignoni, R. Paleari, G. Fresi Roglia, and D. Br-
[6] R. Chillarege, W.-L. Kao, and R. G. Condit. Defect uschi. Testing system virtual machines. In Proceedings
Type and Its Impact on the Growth Curve. In Proceed- of the 19th International Symposium on Software Test-
ings of the 13th International Conference on Software ing and Analysis (ISSTA), pages 171–182, 2010.
Engineering (ICSE), pages 246–255, 1991.
[19] L. Martignoni, R. Paleari, A. Reina, G. F. Roglia, and
[7] A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler. D. Bruschi. A methodology for testing cpu emulators.
An Empirical Study of Operating Systems Errors. In ACM Trans. Softw. Eng. Methodol., 22(4):29:1–29:26,
Proceedings of the Eighteenth ACM Symposium on Op- Oct. 2013.
erating Systems Principles (SOSP), pages 73–88, 2001.
[20] E. Nagai, H. Awazu, N. Ishiura, and N. Takeda. Ran-
[8] P. Cuoq, B. Monate, A. Pacalet, V. Prevosto, J. Regehr, dom testing of C compilers targeting arithmetic opti-
B. Yakobowski, and X. Yang. Testing static analyzers mization. In Workshop on Synthesis And System In-
with randomly generated programs. In A. Goodloe and tegration of Mixed Information Technologies (SASIMI
S. Person, editors, NASA Formal Methods, volume 7226 2012), pages 48–53, 2012.
of Lecture Notes in Computer Science, pages 120–125.
Springer Berlin Heidelberg, 2012. [21] E. Nagai, A. Hashimoto, and N. Ishiura. Scaling up
size and number of expressions in random testing of
[9] GCC. GIMPLE – GNU Compiler Collection arithmetic optimization of C compilers. In Workshop on
(GCC) Internals. https://gcc.gnu.org/onlinedocs/ Synthesis And System Integration of Mixed Information
gccint/GIMPLE.html, accessed: 2014-06-25. Technologies (SASIMI 2013), pages 88–93, 2013.
[10] GCC. RTL – GNU Compiler Collection (GCC) Inter- [22] Plum Hall, Inc. The Plum Hall Validation Suite for C.
nals. https://gcc.gnu.org/onlinedocs/gccint/RTL.html, http://www.plumhall.com/stec.html.
accessed: 2014-06-25.
[23] A. Pnueli, M. Siegel, and E. Singerman. Translation
[11] V. Le, M. Afshari, and Z. Su. Compiler Validation via Validation. In 4th International Conference on Tools
Equivalence Modulo Inputs. In Proceedings of the 2014 and Algorithms for Construction and Analysis of Sys-
ACM SIGPLAN Conference on Programming Language tems (TACAS), pages 151–166, 1998.
Design and Implementation (PLDI), 2014.
[24] J. Regehr, Y. Chen, P. Cuoq, E. Eide, C. Ellison, and
[12] V. Le, C. Sun, and Z. Su. Finding Deep Compiler Bugs X. Yang. Test-case reduction for C compiler bugs. In
via Guided Stochastic Program Mutation. In Proceed- Proceedings of the 2012 ACM SIGPLAN Conference
ings of the 2015 ACM SIGPLAN International Confer- on Programming Language Design and Implementation
ence on Object-Oriented Programming, Systems, Lan- (PLDI), pages 335–346, 2012.
guages, and Applications (OOPSLA), pages 386–399.
ACM, 2015. [25] S. K. Sahoo, J. Criswell, and V. Adve. An Empirical
Study of Reported Bugs in Server Software with Impli-
[13] V. Le, C. Sun, and Z. Su. Randomized Stress-Testing of cations for Automated Bug Diagnosis. In Proceedings
Link-Time Optimizers. In Proceedings of the 2015 In- of the 32Nd ACM/IEEE International Conference on
ternational Symposium on Software Testing and Anal- Software Engineering (ICSE), pages 485–494, 2010.
ysis (ISSTA), pages 327–337. ACM, 2015.
[26] L. Song and S. Lu. Statistical Debugging for Real-world
[14] X. Leroy, A. W. Appel, S. Blazy, and G. Stewart. The Performance Problems. In Proceedings of the 2014
CompCert Memory Model, Version 2. Research report ACM International Conference on Object Oriented Pro-
RR-7987, INRIA, June 2012. gramming Systems Languages and Applications (OOP-
SLA), pages 561–578, 2014.
[27] M. Sullivan and R. Chillarege. A Comparison ACM Symposium on Principles of Programming Lan-
of Software Defects in Database Management Sys- guages (POPL), pages 17–27, Jan. 2008.
tems and Operating Systems. In Twenty-Second In-
ternational Symposium on Fault-Tolerant Computing [35] X. Yang, Y. Chen, E. Eide, and J. Regehr. Finding and
(FTCS), pages 475–484, July 1992. Understanding Bugs in C Compilers. In Proceedings of
the 2011 ACM SIGPLAN Conference on Programming
[28] C. Sun, J. Du, N. Chen, S.-C. Khoo, and Y. Yang. Language Design and Implementation (PLDI), pages
Mining Explicit Rules for Software Process Evaluation. 283–294, 2011.
In ICSSP, pages 118–125, 2013.
[36] Z. Yin, X. Ma, J. Zheng, Y. Zhou, L. N. Bairavasun-
[29] C. Sun, V. Le, and Z. Su. Finding and Analyzing Com- daram, and S. Pasupathy. An empirical study on config-
piler Warning Defects. In Proceedings of the 38th Inter- uration errors in commercial and open source systems.
national Conference on Software Engineering (ICSE). In Proceedings of the Twenty-Third ACM Symposium
ACM, 2016. on Operating Systems Principles (SOSP), pages 159–
172, 2011.
[30] C. Sun, D. Lo, X. Wang, J. Jiang, and S.-C. Khoo.
A Discriminative Model Approach for Accurate Dupli- [37] Z. Yin, D. Yuan, Y. Zhou, S. Pasupathy, and
cate Bug Report Retrieval. In Proceedings of the 32Nd L. Bairavasundaram. How Do Fixes Become Bugs? In
ACM/IEEE International Conference on Software En- 19th ACM SIGSOFT Symposium and the 13th Euro-
gineering (ICSE), pages 45–54, 2010. pean Conference on Foundations of Software Engineer-
ing (ESEC/FSE), pages 26–36, 2011.
[31] F. Thung, S. Wang, D. Lo, and L. Jiang. An Empirical
Study of Bugs in Machine Learning Systems. In Soft- [38] A. Zeller and R. Hildebrandt. Simplifying and Isolat-
ware Reliability Engineering (ISSRE), 2012 IEEE 23rd ing Failure-Inducing Input. IEEE Trans. Softw. Eng.,
International Symposium on, pages 271–280, Nov 2012. 28(2):183–200, Feb. 2002.
[32] Y. Tian, D. Lo, and C. Sun. DRONE: Predicting Pri- [39] C. Zhao, Y. Xue, Q. Tao, L. Guo, and Z. Wang. Au-
ority of Reported Bugs by Multi-factor Analysis. In tomated test program generation for an industrial op-
29th IEEE International Conference on Software Main- timizing compiler. In ICSE Workshop on Automation
tenance (ICSM), pages 200–209, Sept 2013. of Software Test (AST), pages 36–43, 2009.
[33] TIOBE. TIOBE Index for May 2016. http://www. [40] T. Zimmermann, N. Nagappan, P. J. Guo, and B. Mur-
tiobe.com/tiobe index, accessed: 2016-05-15. phy. Characterizing and Predicting Which Bugs Get
Reopened. In Proceedings of the 34th International
[34] J.-B. Tristan and X. Leroy. Formal Verification of
Conference on Software Engineering (ICSE), pages
Translation Validators: A Case Study on Instruction
1074–1083, 2012.
Scheduling Optimizations. In Proceedings of the 35th