0% found this document useful (0 votes)
42 views12 pages

(2019-ASE) Automatic Self-Validation For Code Coverage Profilers

Uploaded by

Zhou Yuming
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views12 pages

(2019-ASE) Automatic Self-Validation For Code Coverage Profilers

Uploaded by

Zhou Yuming
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE)

Automatic Self-Validation for Code Coverage Profilers


Yibiao Yang∗† , Yanyan Jiang∗ , Zhiqiang Zuo∗ , Yang Wang∗ ,
Hao Sun∗ , Hongmin Lu∗ , Yuming Zhou∗ , and Baowen Xu∗
∗ StateKey Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
† School of Cyber Science and Engineering, Huazhong University of Science and Technology, Wuhan, China
{yangyibiao, jyy, zqzuo}@nju.edu,cn, [email protected]
[email protected], {hmlu, zhouyuming, bwxu}@nju.edu.cn

Abstract—Code coverage as the primitive dynamic program Even though the programming experts can specify the oracle
behavior information, is widely adopted to facilitate a rich precisely, it requires enormous human intervention, making it
spectrum of software engineering tasks, such as testing, fuzzing, impractical.
debugging, fault detection, reverse engineering, and program
understanding. Thanks to the widespread applications, it is A simple differential testing approach C2V tried to uncover
crucial to ensure the reliability of the code coverage profilers. coverage bugs by comparing the coverage profiling results of
Unfortunately, due to the lack of research attention and the the same input program over two different profiler implemen-
existence of testing oracle problem, coverage profilers are far tations (e.g., gcov and llvm-cov) [11]. For instance, if gcov
away from being tested sufficiently. Bugs are still regularly seen
and llvm-cov provide different coverage information for the
in the widely deployed profilers, like gcov and llvm-cov, along
with gcc and llvm, respectively. same statement of the profiled program, a bug is reported.
This paper proposes Cod, an automated self-validator for Due to the inconsistency of coverage semantics defined by
effectively uncovering bugs in the coverage profilers. Starting different profiler implementations, it is rather common that
from a test program (either from a compiler’s test suite or independently implemented coverage profilers exhibit different
generated randomly), Cod detects profiler bugs with zero false
positive using a metamorphic relation in which the coverage
opinions on the code-line based statistics (e.g., the case in
statistics of that program and a mutated variant are bridged. Figure 1) — this essentially contradicts the fundamental as-
We evaluated Cod over two of the most well-known code sumption of differential testing that distinct coverage profilers
coverage profilers, namely gcov and llvm-cov. Within a four- should output identical coverage statistics for the same input
month testing period, a total of 196 potential bugs (123 for gcov, program.
73 for llvm-cov) are found, among which 23 are confirmed by
the developers. Approach To tackle the flaws of the existing approach, this pa-
Index Terms—Code coverage, Metamorphic testing, Coverage per presents Cod, a fully automated self-validator of coverage
profilers, Bug detection. profilers, based on the metamorphic testing formulation [12].
I. I NTRODUCTION Instead of comparing outputs from two independent profilers,
Cod takes a single profiler and a program P (either from
Profiling code coverage data [1] (e.g., executed branches, a compiler’s test suite or generated randomly) as input and
paths, functions, etc.) of the instrumented subject programs is uncovers the bugs by identifying the inconsistency of coverage
the cornerstone of a rich spectrum of software engineering results from P and its equivalent mutated variants whose
practices, such as testing [2], fuzzing [3], debugging [4]– coverage statistics are expected to be identical. The equivalent
[6], specification mining [7], [8], fault detection [9], reverse program variants are generated based on the assumption that
engineering, and program understanding [10]. Incorrect cov- modifying unexecuted code blocks should not affect the cover-
erage information would severely mislead developers in their age statistics of executed blocks under the identical profiler,
software engineering practices. which should generally hold in a non-optimized setting1 . This
Unfortunately, coverage profilers themselves (e.g., gcov and idea originates from EMI [2], a metamorphic testing approach
llvm-cov) are prone to errors. Even a simple randomized which is targeted at compiler optimization bugs.
differential testing technique exposed more than 70 bugs in
Specifically, assuming that the compiler is correct2 and
coverage profilers [11]. The reasons are two-fold. Firstly, nei-
given a deterministic program P under profiling (either from
ther the application-end developers nor academic researchers
a compiler’s test suite or generated randomly) and fixate its
paid sufficient attention to the testing of code coverage pro-
input, Cod obtains a reference program P  by removing the
filers. Secondly, automatic testing of coverage profilers is still
unexecuted statements in P. P  should strictly follow the same
challenging due to the lack of test oracles. During the code
execution path as long as the coverage profiling data of P
coverage testing, the oracle is supposed to constitute the rich
is correct. Therefore, Cod asserts that the coverage statistics
execution information, e.g., the execution frequency of each
should be exactly the same over all unchanged statements
code statement in the program under a given particular test
case. Different from the functional oracle which usually can 1 According to the developers [13], coverage statistics are only stable under
be obtained via the given specification, achieving the complete zero optimization level.
code coverage oracles turns out to be extremely challenging. 2 We assume this because mis-compilations are rare.

978-1-7281-2508-4/19/$31.00 ©2019 IEEE 79


DOI 10.1109/ASE.2019.00018

Authorized licensed use limited to: Nanjing University. Downloaded on February 09,2022 at 00:39:31 UTC from IEEE Xplore. Restrictions apply.
in P and P  . However, consider a line of code s in P and II. BACKGROUND AND M OTIVATION
P  , for which the same profiler reported different coverage A. Coverage Profilers
results, i.e., CP (s) = CP  (s) where CP (s) refers to the
Code coverage profiling data (each line of code’s execution
profiled runtime execution count of statement s in program P.
count in a program execution) is the foundation of a broad
The execution count CP (s) is usually a nonnegative number
spectrum of software engineering practices. Code coverage
except for a special value −1 indicating the unknown coverage
is the most widely adopted criteria for measuring testing
information. This could happen when the coverage profiler
thoroughness, and is also widely used in the automation of
failed to obtain the coverage information of statement s due
software engineering tasks. For example, test input generation
to the information loss caused by the abstraction gap between
techniques leverage code coverage to guide search iterations
the source code and the intermediate code transformed during
[3]; fault localization techniques use code coverage to isolate
compilation. Given CP (s) = CP  (s), either of the two cases
potentially faulty branches [4]–[6].
applies:
To obtain code coverage statistics, a code coverage profiler
maintains each source code line an execution counter and
1) (Strong Inconsistency) CP (s) ≥ 0 ∧ CP  (s) ≥ 0, updates them along with the program execution. Specifically,
meaning that the coverage profiler reports the inconsis- given a program P, a coverage profiler runs P and outputs
tent coverage information. Definitely there is a bug in each line of code s ∈ P a number CP (s) = n, indicating that
the profiler because P  should follow exactly the same s was executed n times. A special value n = −1 indicates that
execution path as P assuming that the coverage statistics the profiler provides no coverage information for this line.
of program P are correct. Where should a profiler report a coverage statistics for a line
2) (Weak Inconsistency) CP (s) = −1 ∨ CP  (s) = −1, of code s (i.e., whether CP (s) = −1) is not well defined. Code
indicating an inaccurate statistics because that a non- transformations (e.g., expansion of macros, or compilation
instrumented line is actually executed in its equivalent. from source code to intermediate code) and optimizations
This is also for sure a bug because non-optimized cov- may lead to CP (s) = −1 for a line, and different profilers
erage statistics should faithfully reflect the program’s generally have different opinions upon which lines would have
execution path. CP (s) ≥ 0. Later we see that this is a major limitation of
existing techniques for validating coverage profilers.
The self-validator Cod fully exploits the inconsistencies be-
B. Validating Coverage Profilers
tween path-equivalent programs with zero false positive. Cod
addresses the limitation of C2V in Section II-C and handles Validating the correctness a coverage profiler is challenging
weak inconsistencies whereas C2V has to ignore all weak because it is labor-intensive to obtain the ground truth of
inconsistencies between independent profiler implementations coverage statistics. Though we have large number of test inputs
to avoid being flooded by false positives. It is worth noting (any program used to test a compiler also works in testing a
that such a technique of obtaining path-equivalent programs, profiler), lacking of a test oracle became the problem.
firstly known as EMI, was proposed to validate the correctness The only known technique to uncover coverage profiler bugs
of compiler optimizations [2]. We found that this idea is also is C2V which is based on differential testing [11]. Given a
powerful in the validation of coverage profilers. Nevertheless, program P under profiling, C2V profiles it using two indepen-
Cod differs from EMI as we proposed the specialized mutation dently implemented profilers to obtain for each statement s the
 
strategies to acquire the program variants, and adopted the coverage statistics CP (s) and CP (s). When CP (s) = CP (s),

results verification criterion in particular for testing coverage an inconsistency is found. When CP (s) ≥ 0 ∧ CP (s) ≥ 0
profilers. We defer to Section V for the comparison details. (a strong inconsistency), C2V reports it as a bug candidate
and uses clustering to filter out potential false positives and
duplicates.
Results We implemented Cod as a prototype and evaluated it
on two popular coverage profilers, namely gcov and llvm-cov C. Limitations of Differential Testing
integrated with the compiler gcc and llvm, respectively. As of Though being effective in uncovering coverage profiler
the submission deadline, a total of 196 potential bugs (123 for bugs, differential testing also has the following major limi-
gcov, 73 for llvm-cov) are uncovered within 4 months, among tations:
which 23 bugs have already been confirmed by the developers. First, differential testing cannot be applied when there is
Promisingly, all the detected bugs are new bugs according to only a single coverage profiler. This is the case for many
the developers’ feedback. mainstream programming languages (e.g, Python and Perl).
Second, differential testing requires heavy human efforts on
Outline The rest of the paper is organized as follows. We analyzing the reports because it is hard to determine which
introduce the necessary background and a brief motivation in one is faulty when two profilers disagree on the statistics of a
Section II. Section III elaborates on the detailed approach, line of code.
followed by the evaluation in Section IV. We discuss related Third, differential testing miss many potential bugs on weak
work in Section V and conclude the paper in Section VI. inconsistencies. Two profilers can have inconsistent (but both

80

Authorized licensed use limited to: Nanjing University. Downloaded on February 09,2022 at 00:39:31 UTC from IEEE Xplore. Restrictions apply.
1 : 1:int main() 1| -1|int main() 1| -1|int main()
-1: 2:{ 2| 1 |{ 2| 1 |{
-1: 3: switch (8) 3| 1 | switch (8) 3| 1 | switch (8)
-1: 4: { 4| 1 | { 4| 1 | {
-1: 5: case 8: 5| 1 | case 8: 5| 1 | case 8:
1 : 6: break; 6| 1 | break; 6| 1 | break;
-1: 7: default: 7| 1 | default: 7| 1 | default:
-1: 8: abort (); 8| ×0 | abort (); 8| ×0 | ; // abort ();
-1: 9: break; 9| 1 | break; 9| ×0 | break;
-1: 10: } 10| 1 | } 10| 1 | }
1 : 11: return 0; 1
11|  | return 0; 11| 1 | return 0;
-1: 12:} 12| 1 |} 12| 1 |}
(a) CP (gcov) (b) CP (llvm-cov) (c) CP\{s8 }∪{s8 } (llvm-cov)
Fig. 1. The bug case of LLVM #41821. llvm-cov incorrectly reported that the break in Line 9 is executed. This bug cannot be detected by differential
testing [11] because gcov does not provide coverage statistics for Line 9. Visual conventions of coverage statistics: For a line of code s, gcov and llvm-cov
output coverage statistics cP (s) in the first and second column, respectively. A -1 denotes that the profiler does not provide coverage information of s. A
check mark or cross mark followed by a number n denotes that cP (s) = n.

correct) interpretations over the coverage statistics. Section IV Program


Program P Program P’
reveal that 92.9% of inconsistencies found by differential Pruner

testing are weak, i.e., CP (s) = CP (s) with CP (s) = −1 ∨

CP (s) = −1. Our motivating example in Figures 1 (a)–(b)
Coverage Coverage
showed that for 10/12 lines, exactly one of gcov or llvm-cov profiler profiler
provides no coverage information. Unfortunately, C2V has to
ignore all of such weak consistencies (i.e., not report any of
them as a bug) because the vast majority of them are attributed
Coverage Coverage
to the feature of independently implemented profilers. Output O Output O’
report C report C’
Finally, differential testing reports false positive even when
two profilers report strongly inconsistent coverage statistics,
because two profilers may disagree over the definition of the Consistent?
execution count of a line, e.g., whether the initialization code
No
of a for loop counts for one time of execution.
Bug report
D. Motivation
No
The key observation leading to automatic self-validation of
a single profiler is that changing unexecuted statements in a Equal?
program should not affect the coverage statistics. Take the
program in Figure 1 (a)–(b) as an example. Suppose that we
Fig. 2. The framework of Cod
comment out the function call in Line 8 of P and obtain
P  = P \ {s8 } ∪ {s8 } as shown in Figure 1 (c) We assert that
unchanged statements should have identical coverage statistics, the profiler because we have full control over P (thus easily
i.e., to guarantee it is deterministic) and there is little chance that
∀s ∈ P ∩ P  . CP (s) = CP  (s), the compiler/hardware is defective.
In the motivating example, CP (s9 ) = CP  (s9 ) revealed a
reasonably assuming that: previously unknown profiler bug in which llvm-cov incorrectly
1) P is deterministic, contains no undefined behavior, and reported an unexecuted break as being executed once. This
does not depend on external environment; bug case is missed by differential testing (particularly, C2V)
2) the coverage statistics is correct; and because all inconsistencies between CP (gcov) and CP (llvm-
3) the executions of P and P  are consistent with their cov) are weak (Lines 1–5, 7–10, and 12). Generally, weak
semantics. inconsistencies between different compiler implementations
Since we only remove “unexecuted” statements reported indicate different compilation strategies (thus do not indicate
by a coverage profiler, P and P  should be semantically a profiler bug) and should not be reported by C2V.
equivalent. Furthermore, a profiler (particularly under minimal
III. A PPROACH
optimization) should be self-consistent in terms of which
statement should have a coverage statistics. Therefore, if there A. Metamorphic Testing
is an inconsistency CP (s) = CP  (s), no matter whether A test oracle is a mechanism for determining whether
it is a strong or weak inconsistency, either of the above a test has passed or failed. Under certain circumstances,
assumptions is violated. It turns out that we should blame however, the oracle is not available or too expensive to achieve.

81

Authorized licensed use limited to: Nanjing University. Downloaded on February 09,2022 at 00:39:31 UTC from IEEE Xplore. Restrictions apply.
This is known as the oracle problem [14]. For example, in Algorithm 1: Cod’s process for coverage tool validation
compiler testing, it is not easy to verify whether the generated Data: The profiler T under test, the program P, the input i
executable code by a compiler is functionally equivalent to the Result: reported bugs
given source code. Even if the oracle is available, manually 1 begin

checking the oracle results is tedious and error-prone [15], /* Step 1: Extract output and coverage
[16]. As a matter of fact, the oracle problem has been “one of information */
the most difficult tasks in software testing” [16]. 2 Pexe ← compile(P)
3 O ← getOutput(execute(Pexe , i))
Metamorphic testing (MT) was coined by T.Y. Chen in 4 C ← T .extractCoverage(execute(Pexe , i))
1998 [12], which can be exploited to alleviate the oracle
/* Step 2: Generate variants via
problem. Based on the existing successful test cases (that transformation */
have not revealed any failure, such as running too long or 5 P  ← genVariant(P, C)

returning abnormal values), MT generates follow-up test cases 6 Pexe ← compile(P  )
by making reference to metamorphic relations (MR), which 7 O ← getOutput(execute(Pexe

, i))
 
are the necessary properties of the target function or algorithm 8 C ← T .extractCoverage(execute(Pexe , i))
in terms of multiple inputs and their expected outputs. Let us /* Step 3: Compare outputs and reports */
consider a program P implementing function F on domain // First Stage
9 if O = O then
D. Let t be an initial successful test case, i.e. t ∈ D and 10 reportBug()
the execution output P(t) equals to the expected value F(t). // Second Stage
MT can be applied to generate a follow-up test case t ∈ D 11 else if inconsistent(C, C  ) then
base on t and a pre-defined MR (i.e., metamorphic relation). 12 reportBug()
For program P, a MR is a property of its target function
/* Generate a variant for program P under
F. For instance, suppose F(x) = sin(x), then the property
coverage C */
sin(π−x) = sin(x) is a typical MR with respect to F. Hence, 13 Function genVariant(Program P, Coverage C)
given a successful test case, say t = 1.2, MT generates its 14 P ← P
follow-up test case t = π − 1.2, and then runs the program 15 foreach s ∈ getStmts(P) ∧ CP (s) = 0 do
over t . Finally, two outputs (i.e., P(t) and P(t )) are checked 16 P  .delete(s)
to see if they satisfy the expected relation P(t) = P(t ). If 17 if isCompiable(P  ) then
the identity does not hold, a failure manifests. 18 return P 
In our work, we apply MT to the validation of code coverage 19 else
20 genVariant(P, C)
profilers. Each program becomes a test case fed to the profilers.
Given a program P and the code coverage profiler under /* Check whether coverages is inconsistent */
testing, running P with an input i would produce the execution 21 Function inconsistent(Coverage C, Coverage C  )
output O and a coverage report C from the coverage profiler, 22 foreach s ∈ C ∧ s ∈ C  do

The coverage report C records which lines of code are executed 23 if CP (s) = CP  (s) then
24 return True
(or unexecuted) and how many times are executed exactly.
Note that this program P is the initial test case according 25 return False
to the notions of MT. A follow-up test program P  can be
generated based on the following MR:
Given a program, the unexecuted code can be eliminated tion of a given test program, (2) generating equivalent variants
since these code has no impact with the execution output or based on code coverage report, and (3) comparing the outputs
the coverage report specifically for the executed part. and coverage results to uncover bugs.
In other words, by taking advantage of the coverage infor- Algorithm 1 is the main process of Cod. At the first
mation C, we generate P  , a functionally equivalent variant of place, Cod compiles the program P and profiles the execution
P by removing the un-executed code statements of P. We run information with input i to collect: (1) the output O (which
P  on the same input i and obtain the output O and coverage may correspond to a return value or an exit code) and (2) the
results C  , accordingly. A bug can then be discovered if 1) the code coverage information C (Lines 2 - 4). It then generates
execution output O is not equal to the new one O , or 2) there the variant P  with respect to program P (Line 5) and collects
exists inconsistency between the coverage information for the respective output O and code coverage C  (Lines 6 -
executed code inside C and C  . Figure 2 shows our framework 8). Finally, it compares the outputs together with the code
for the self-validation of code coverage profilers. coverage reports to validate the coverage profiler. A potential
bug in T is reported if any of them is inconsistent (Lines 9 -
B. Our Algorithm 12). We discuss each step in details as follows.
Based on our formulation, we implemented a tool Cod for Extracting Coverage Information For each test program P,
detecting bugs in C code coverage profilers. Cod consists of we first compile it with particular options to generate the
three main steps: (1) extracting output and coverage informa- executable binary Pexe . The options enables the compiler

82

Authorized licensed use limited to: Nanjing University. Downloaded on February 09,2022 at 00:39:31 UTC from IEEE Xplore. Restrictions apply.
to integrate the necessary instrumentation code into the ex- with the execution frequency of each line. The first and second
ecutable. While executing Pexe with input i, we obtain the column list the execution frequency and the line number. The
output O of the program. Meanwhile, the code coverage report frequency number “-1” in the first column indicates that the
C can also be readily extracted by the coverage profiler T . coverage information is unknown.
Each code coverage report contains the lines of code executed In this example, we first utilize gcc to compile the program
and unexecuted in the test program P under the input i. Those P and then execute it to produce the output and coverage
statements marked as unexecuted will be randomly pruned for report (shown as Figure 3 (a)). Note that the output in this case
the purpose of generating P’s equivalent variants, which will is 0. According to the original code coverage report of P, Cod
be discussed shortly. Cod implemented the supports for both decides to remove the 6th statement from the original program,
gcov and llvm-cov. Take gcov as an example, Cod extracts resulting in an equivalent program P  shown as Figure 3(b).
coverage information by compiling the program P with the Next, we compile and execute P  to get the new output and
flag: “-O0 --coverage” under gcc. It tells the compiler to coverage report. Here, the output turns to be 1.
instrument additional code in the object files for generating Since the outputs of these two program are not equal, P
the extra profiling information at runtime. Cod then runs the and P  are somehow not equivalent, meaning that we actually
executable binary under input i to produce coverage report for deleted some executed code. The code coverage tool wrongly
the program P. marked some executed statements as not executed. A potential
Generating Variants via Transformation Based on the bug is identified. We reported this bug to Bugzilla. The gcov
coverage report C for the original program P, its variants are developers quickly confirmed and fixed it.
generated (Line 5). Cod produces the variants by stochastically Bug Example Exposed by Strongly Inconsistent Coverage
removing unexecuted program statements from the original Figure 4 illustrates another real bug example uncovered by
program P. Specifically, for each of these removable lines strongly inconsistent code coverage reports between the pro-
of code, we made a random choice. As such, we obtain a gram and its “equivalence” variant. Figure 4 (a) shows the
number of variants P  that should be equivalent to the original coverage report for P. We can read from it that Line 10 is not
program. The function genVariant (Lines 13 - 20) describe executed at all (i.e., the execution count is 0). Cod prunes Line
Cod’s process for generating equivalence mutants via transfor- 10 to generate the equivalent program P  . After compiling and
mation. Note that stochastically removing unexecuted program executing P  , another coverage report shown as Figure 4 (a) is
statements would lead to many uncompilable mutants. Only produced. As can be seen, there exists an strong inconsistency
the compilable ones are returned by genVariant (Line 17). in term of the execution frequency of Line 6, indicating a
Comparing the Outputs and Coverage Reports Having the potential bug. This bug is submitted and confirmed already by
outputs and the coverage reports for the orginal program P and gcov developers.
its variants P  , we detect bugs in the code coverage tool T by Bug Example Exposed by Weakly Inconsistent Coverage
checking the existence of inconsistency. More specifically, We Figure 5 presents another confirmed real bug example found
first compare the outputs of P and P  . If they are not identical, via the weakly inconsistent code coverage reports between the
a potential bug would be reported in the code coverage tool. program and its equivalent variant. In Figure 5 (a), Line 6 in
Otherwise, the code coverage reports are further compared to P is not executed (i.e., the execution count is 0). Cod gets
seeking for inconsistencies. Note that only the code coverage rid of Line 6 to generate the equivalent program P  . Upon
of the common lines of code between the programs P and P  compiling and executing P  , another coverage report shown as
(i.e. those lines of code left in the variant program), will be Figure 5 (a) is generated. Apparently, the weakly inconsistency
considered for comparison. If the code coverage reports is not with respect to the execution frequency of Line 5 appears,
consistent over the common lines, a potential bug is reported indicating a potential bug.
as well (Lines 9–12).
IV. E VALUATION
C. Illustrative Examples
This section presents our evaluation of Cod. We evaluated
In the following, we take our reported three concrete bug
Cod using the most popular practical code coverage profilers:
examples to illustrate how Cod works. Three bugs are newly
gcov and llvm-cov and a set of testing programs for testing
discovered by Cod and confirmed by the GCC developers.
compilers, and compared the results with existing differential
Bug Example Exposed by Different Outputs Figure 3 shows technique C2V [11].
a real bug example exposed via different outputs of two
“equivalent” programs in gcov [17], a C code coverage tool A. Evaluation Setup
integrated in GCC [18]. Figure 3 (a) and (b) are the code
coverage reports produced by gcov for the original program Profilers for Validation We evaluated Cod using the latest
P and its equivalent program P  (by removing an unexecuted versions of gcov and llvm-cov, the most popular two code
Line 8), respectively. Note that all the test programs are coverage profilers of C programs, as our experimental subjects.
reformatted for presentation. As can be seen, a code coverage Both profilers are:
report is an annotated version of the source code augmented 1) popular in the software engineering community;

83

Authorized licensed use limited to: Nanjing University. Downloaded on February 09,2022 at 00:39:31 UTC from IEEE Xplore. Restrictions apply.
-1: 1:#include <stdio.h> -1: 1:#include <stdio.h>
-1: 2:int *p=0, a=0, b=2; -1: 2:int *p=0, a=0, b=2;
1 : 3:int *foo() { 1 : 3:int *foo() {
1 : 4: int *r = (int *)1; 1 : 4: int *r = (int *)1;
-1: 5: while (1) { -1: 5: while (1) {
×0 : 6: r = (int)(a+p) & ˜1; -1: 6: // r = (int)(a+p) & ˜1;
1 : 7: if (a < b) return r; 1 : 7: if (a < b) return r;
-1: 8: } -1: 8: }
-1: 9: return r; -1: 9: return r;
-1: 10:} -1: 10:}
1 : 11:void main () { 1 : 11:void main () {
1 : 12: int *r = foo(); 1 : 12: int *r = foo();
1 : 13: printf("%d\n", r); 1 : 13: printf("%d\n", r);
1 : 14:} 1 : 14:}
(a) CP (gcov, output: 0) (b) CP\{s6 }∪{s6 } (gcov, output: 1)
Fig. 3. A real bug example exposed by Cod via different outputs. This is Bug #89675 of gcov 8.2.0. In (a), Line #6 is marked as not executed; (b) is the
“equivalent” program by deleting Line #6 from the original program in (a). The outputs of these two “equivalent” programs are not identical, indicating a
bug in gcov 8.2.0.

1: 1:int foo() { 1 : 1:int foo() {


1: 2: int h=2, f=1, k=0; 1 : 2: int h=2, f=1, k=0;
1 : 3: int y=18481, x=y; 1 : 3: int y=18481, x=y;
1 : 4: if(y!=0 && (k<=x>>4)) { 1 : 4: if(y!=0 && (k<=x>>4)) {
 
1 : 5: h=y>0 ? 2:1; 1 : 5: h=y>0 ? 2:1;
2 : 6: if (f) { 1 : 6: if (f) {
1 : 7: hˆ=3; 1 : 7: hˆ=3;
-1: 8: } -1: 8: }
-1: 9: } else { -1: 9: } else {
×0 : 10: h = 0; -1: 10: // h = 0;
-1: 11: } -1: 11: }
1 : 12: return h; 1 : 12: return h;
-1: 13:} -1: 13:}
1 : 14:void main() { foo(); } 1 : 14:void main() { foo(); }
(a) CP (gcov) (b) CP\{s10 }∪{s10 } (gcov)
Fig. 4. A real bug example discovered by Cod, with confirmed bug id #89470 of gcov 8.2.0. When the unexecuted Line #10 is pruned from the original
program in (a), the code coverage of Line #6 is inconsistent between that of the original program and the new program in (b), which indicates a bug. A star
after a number in Line #5 denotes that this number may be inaccurate.

1 : 1:void foo(int x, unsigned u) { 1 : 1:void foo(int x, unsigned u) {


1 : 2: if ((1U << x) != 64 1 : 2: if ((1U << x) != 64
1 : 3: || (2 << x) != u 1 : 3: || (2 << x) != u
-1: 4: || (1 << x) == 14 -1: 4: || (1 << x) == 14
1 : 5: || (3 << 2) != 12) -1: 5: || (3 << 2) != 12)
0
× : 6: __builtin_abort (); -1: 6: ; // __builtin_abort ();
1 : 7:} 1
 : 7:}
1 : 8:int main() { 1 : 8:int main() {
1 : 9: foo(6, 128U); 1 : 9: foo(6, 128U);
1 : 10: return 0; 1 : 10: return 0;
-1: 11:} -1: 11:}
(a) CP (gcov) (b) CP\{s5 }∪{s5 } (gcov)
Fig. 5. A real bug example discovered by Cod, with confirmed bug id #90439 of gcov 9.0. When the unexecuted Line #5 is pruned from the original program
in (a), the code coverage of Line #5 is weakly inconsistent between that of the original program and the new program in (b).

2) integrated in the most widely used production compilers, gcov test.c


i.e. GCC and Clang;
3) extensive validated by existing research, both for the For llvm-cov, we use the following commands to produce the
compilers and the profilers. coverage report test.c.lcov:
clang -O0 -fcoverage-mapping -fprofile-instr-generate \
Following the existing research [11], we use the default -o test test.c
complier flags to obtain coverage report for gcov and llvm-cov ./test
under zero-level optimization. Given a piece of source code llvm-profdata merge default.profraw -o test.pd
llvm-cov show test -instr-profile=test.pd \
test.c, the following commands are used to produce the test.c > test.c.lcov
coverage report test.c.gcov:
gcc -O0 --coverage -o test test.c Evaluation Steps To run either differential testing or Cod, we
./test obtain code coverage statistics for the 26,530 test programs

84

Authorized licensed use limited to: Nanjing University. Downloaded on February 09,2022 at 00:39:31 UTC from IEEE Xplore. Restrictions apply.
TABLE I TABLE II
S TATISTICS OF B UG - TRIGGERING T EST P ROGRAMS . L IST OF C ONFIRMED OR F IXED B UGS . PN DENOTES A NORMAL PRIORITY.
D IFF T EST DENOTES WHETHER THE BUG CAN BE FOUND BY A
DIFFERENTIAL TESTING .
Inconsistent Reports
Profilers Different Outputs
Strong Weak
ID Profiler Bugzilla ID Priority Status Type DiffTest
gcov 1 69 54
llvm-cov 0 62 11 1 gcov 88913 P3 Fixed Wrong Freq. 
2 gcov 88914 P3 Fixed Wrong Freq. 
3 gcov 88924 P5 New Wrong Freq. 
in the test-suite shipped with the latest gcc release (7.4.0) 4 gcov 88930 P3 Fixed Wrong Freq. 
and 5,000 random programs generated by csmith [19]. All
5 gcov 89465 P3 Fixed Missing ×
evaluated programs contain neither external environmental
6 gcov 89467 P3 Fixed Wrong Freq. 
dependency nor undefined behavior. We run Cod over all the
test programs and collect all reported inconsistencies for a 7 gcov 89468 P5 New Wrong Freq. ×
manual inspection. We also compare these results with the 8 gcov 89469 P5 New Wrong Freq. 
state-of-the-art differential testing technique C2V [11]. 9 gcov 89470 P5 New Wrong Freq. 
Testing Environment We evaluated gcov shipped with the 10 gcov 89673 P5 New Spurious ×
latest version of gcov (until gcc 9.0.1-20190414) and llvm- 11 gcov 89674 P5 New Spurious ×
cov (until llvm 9.0.0-svn358899) during our experiments. All 12 gcov 89675 P3 Fixed Missing ×
experiments were conducted on a hexa-core Intel(R) Core(TM) 13 gcov 90023 P5 New Spurious ×
[email protected] virtual machine with 10GiB of RAM running 14 gcov 90054 P3 Fixed Missing 
Ubuntu Linux 18.04.
15 gcov 90057 P3 Fixed Wrong Freq. 
B. Experimental Results 16 gcov 90066 P5 New Wrong Freq. ×
17 gcov 90091 P3 New Wrong Freq. 
Inconsistent Reports For each of the test cases in our
testbed, only one variant was generated by using Cod for the 18 gcov 90104 P3 New Wrong Freq. ×
validation. The only variant is generated by removing all the 19 gcov 90425 P5 New Wrong Freq. ×
unexecuted statements reported by coverage profilers from the 20 gcov 90439 P3 New Missing ×
original test cases. It is obvious that generating more variants 21 llvm-cov 41051 PN New Wrong Freq. 
for each test program may trigger more inconsistencies over 22 llvm-cov 41821 PN New Spurious ×
the test programs and probably detect more bugs in those cov- 23 llvm-cov 41849 PN New Missing ×
erage profilers. Table I shows the statistics of bug-triggering
test programs over two code coverage profilers under test, i.e.,
gcov and llvm-cov. Column 2 refers to the total number of
the pairs of test program with its variant, which can lead confirmation state, one was marked as duplicate, and only one
to different execution outputs, and Column 3 shows the total was rejected by the developer (gcov #90438). This rejected
number that can impose inconsistent coverage reports. case is controversial because gcc is performing optimization
The single case in which the variant outputs a different value even under the zero optimization levels (as shown in Figure 6),
(Figure 3) is due to the incorrect coverage statistics causing which may mislead a developer or an automated tool that are
Cod to create functionally different “equivalent” mutated vari- based on the branch information in the coverage statistics.
ants. Others inconsistencies also due to profiler bugs, which Following the notions from C2V, code coverage bugs inside
are discussed as follows. coverage profilers can categorized as Spurious Marking, Miss-
ing Marking, and Wrong Frequency. As shown in Column 6
Bugs Found We manually inspected all cases and found that of Table II, we can find that Cod is able to detect all three
all reported (strong and weak) inconsistencies revealed defects types of bugs in coverage profilers. 14 bugs belong to Wrong
in the profiler. By far, we reported a total of 26 bugs to the Frequency, 5 bugs belong to Missing Marking, and the rest 4
developers of gcov and llvm-cov. The manual classification bugs is Spurious. Besides, most of bugs are Wrong Frequency
and reporting of profiler bugs is still on-going. We believe bugs, i.e., the execution frequencies is wrongly reported.
that more bugs will be reported in the future.
Among all these bugs, nearly half (12/26) cannot be mani-
23/26 bugs are confirmed3 by the developers as listed in
fested by differential testing. Considering that differential test-
Table II. One of the remaining three is still in the pending
ing leverages the coverage statistics of an independent profiler
3 Consistent with C. Sun et al’s [20] and V. Le et al’s [21] studies, due to implementation (which produces correct coverage information
the bug management process of LLVM is not as organized as that of GCC, in all these cases, and thus differential testing is essentially
if a llvm-cov bug report has been CCed by Clang developers and there is comparing with a golden version) while Cod is merely self-
no objection in the comments, we label the bug as confirmed. In addition, as
stated by developers, if someone does not close the reported bug as “invalid”, validation, we are expecting Cod to be effective and useful in
then the bug is real in LLVM Bugzilla. finding code coverage profiler bugs.

85

Authorized licensed use limited to: Nanjing University. Downloaded on February 09,2022 at 00:39:31 UTC from IEEE Xplore. Restrictions apply.
1 : 1:int f(int i) { 1 : 1:int f(int i) {
-1: 2: int res; -1: 2: int res;
1 : 3: switch (i) { -1: 3: switch (i) {
×0 : 4: case 5: -1: 4: case 5:
×0 : 5: res = i - i; -1: 5: // res = i - i;
×0 : 6: break; -1: 6: // break;
1 : 7: default: -1: 7: default:
1 : 8: res = i * 2; 1 : 8: res = i * 2;
1 : 9: break; 1 : 9: break;
-1: 10: } -1: 10: }
1 : 11: return res; 1
 : 11: return res;
-1: 12:} -1: 12:}
1 : 13:int main(void) { 1 : 13:int main(void) {
1 : 14: f(2); 1 : 14: f(2);
1 : 15: return 0; 1 : 15: return 0;
-1: 16:} -1: 16:}
(a) P (gcov) (b) P  = P \ {s5 , s6 } ∪ {s5 , s6 } (gcov)
Fig. 6. In the case of gcov #90438, gcov refuses to report coverage information for the case statement after removing Lines #5–6, but reports the execution
of its default branch, which may mislead a developer or an automated tool. Note that though Line #4 is not covered, it is not removed otherwise will result
in a compilation error.

TABLE III TABLE IV


S UMMARY OF THE TEST PROGRAMS WITH INCONSISTENT COVERAGE S UMMARIZATION OF THE COMMON AND NON - COMMON
REPORTS BY C O D ON THE CONSISTENT TEST PROGRAMS BY C2V. INSTRUMENTATION SITES BETWEEN GCOV AND LLVM - COV FOR THE TEST
PROGRAMS IN GCC TESTSUITES 7.4.0.
C / C: NUMBER OF COMMON / NON - COMMON INSTRUMENTATION SITES .
# inconsistent in terms of Cod
# weakly consistent under C2V gcov llvm-cov
C C
Strong Weak Strong Weak Total Avg. Total Avg.
3745 10 23 19 9
# 83026 16.49 98523 19.56
% 45.73% - 54.27% -

C. Discussions
Only 5036 test programs in GCC testsuites 7.4.0 can be
Statistics of Inconsistencies Table III summarizes the test successfully compiled and further processed by both gcov and
programs in which inconsistencies are identified by Cod but llvm-cov. Table IV summarizes the total number and total
unable to be identified by C2V. All these inconsistencies are percentage of common instrumentation sites and non-common
true positives: one is either a bug or may mislead a developer instrumentation sites. The second and the forth columns re-
or automatic tool. spectively show the total number/percentage C and C. The
While using these test programs to test gcov by Cod, we third and the last columns respectively shows the average C
respectively identified 23 weak and 10 strong inconsistencies and C. From Table IV, we can found that about 46% code
from these test programs. For llvm-cov, 28 weak and 19 strong lines are C and each test program has about 16 code lines are
inconsistencies are identified. This indicates that Cod has C.
its unique ability to identify many inconsistencies that C2V Table V summarizes the statistics of the proportion of C for
unable to given the same test programs. We thus believe that the 5036 test programs in GCC testsuite 7.4.0. We calculate the
Cod is more powerful and useful than C2V. proportion as p = |C|/(|C|+|C|) for each test program. Then,
Weak Inconsistencies Between Independently Implemented we can obtain how many test programs falls into different
Coverage Profilers As aforementioned, independently imple- intervals as listed in the second row of Table V. From Table V,
mented code coverage profilers might have different interpre- we can find that about 40%∼70% code lines in most test
tations for the same code. This is the major source of weak programs are in C. This indicates that most code lines of
inconsistencies that C2V cannot recognize as a bug. each program is instrumented by only one of the two coverage
profilers. Besides, we also found that only 1.14% test programs
To further understand weak inconsistencies among profilers,
have exactly the same instrumentation sites under the two
we collect the common instrumentation sites between gcov
profilers.
9.0.0 and llvm-cov 9.0 for the test programs using programs
Overall, our core observation is that different coverage
in GCC testsuites 7.4.0. A code line s is a common instru-
G L profilers indeed have quite different interpretations on a same
mentation site s ∈ C if CP (s) = −1 ∧ CP (s) = −1, where
G L piece of code.
CP (s) and CP (s) refer to the profiled runtime execution count
of code line s in program P respectively by gcov and llvm- Reliability of Code Coverage Profilers Under Compiler
G L G L Optimizations Finally, even though coverage profilers provide
cov. When CP (s) = CP (s) ∧ (CP (s) = −1 ∨ CP (s) = −1),
s is an non-common instrumentation site s ∈ C . only faithful statistics under the zero optimization level, we

86

Authorized licensed use limited to: Nanjing University. Downloaded on February 09,2022 at 00:39:31 UTC from IEEE Xplore. Restrictions apply.
TABLE V
D ISTRIBUTION OF THE PERCENTAGE OF NON - COMMON INSTRUMENTATION SITES .
p = |C|/(|C| + |C|): THE PROPORTION OF NON - COMMON INSTRUMENTATION SITES .

p
0 0∼10% 10%∼20% 20%∼30% 30%∼40% 40%∼50% 50%∼60% 60%∼70% 70%∼80% 80%∼90% 90%∼100%

# 61 21 91 228 703 1722 1433 678 276 84 58


% 1.14% 0.39% 16.70% 4.26% 13.13% 32.16% 26.76% 12.66% 5.15% 15.69% 1.08%

TABLE VI among which 42 and 28 bugs are confirmed/fixed by gcov


S UMMARY OF INCONSISTENT REPORTS WITH DIFFERENT OPTIMIZATION
LEVELS .
and llvm-cov, respectively. In essence, C2V is a randomized
differential testing approach. As stated in Section II-C, C2V
Inconsistent lines suffers from a bunch of drawbacks. The work presented in this
Optimization level gcov llvm-cov paper attempts to fill the gap.
Strongly Weakly Strongly Weakly
B. Metamorphic Testing
-O0 69 54 62 11
-O1 1115 635 62 11 As a simple but effective approach to alleviating the oracle
-O2 678 936 63 11 problem, metamorphic testing (MT) exploits the metamorphic
-O3 679 937 63 11 relation (MR) among multiple inputs and their expected out-
-Os 799 977 63 11 puts, to generated follow-up test cases from existing ones, and
-Ofast 677 927 61 11
verifies the corresponding outputs against the MR. Since its
first publication in 1998, MT has been successful applied in
a variety of domains including bioinformatics [22], [23], web
still wonder whether inconsistencies reported by Cod for services [24], [25], embedded systems [26], components [27],
optimized binaries may reveal bugs in a profiler. Therefore, we databases [28], machine learning classifiers [29], online search
conducted our experiments under optimized compiler settings, functions and search engines [30], [31], and security [32].
and the results are summarized in Table VI. Several representative work are listed below. Chan et
After the manual inspection of a few cases, we found al. [24], [25] presented a metamorphic testing methodology
that gcov generally does not provide comprehensive cover- for Service Oriented Applications (SOA). Their method
age statistics for optimized code (sometimes even obviously relies on so-called metamorphic services to encapsulate the
wrong), however, llvm-cov is much more reliable–coverage services under test, executes the seed test and the followup
statistics barely change across optimization levels. test cases, and finally check their results. Zhou et al. [30],
We attempted to report such an obviously incorrect coverage [31] employed metamorphic testing to detect inconsistencies in
statistics of gcov to the developers, as shown in Figure 7 online web search applications. Several metamorphic relations
(gcov #90420, under -O3 optimization level). Line #11 cannot are proposed and utilized in a number of experiments with
be executed for 11 times in any circumstance, however, the the web search engines, like Google, Yahoo! and Live Search.
developer rejected this bug report and argued that “it’s the Jiang et al. [26] presented several metamorphic relations for
nature of any optimizing compiler. If you want to have the fault detection in Central Processing Unit (CPU) scheduling
best results, then don’t use -O3, or any other optimization algorithms. Two real bugs are found in one of the simulators
level.” This case is also controversial, however, revealed that under test. Beydeda [27] proposed a selftesting method for
providing guarantee of coverage statistics under compiler commercial offtheshelf components via metamorphic testing.
optimizations would be a worthwhile future direction. Zhou et al. [33] applied metamorphic testing to self-driving
cars, and detected fatal software bugs in the LiDAR obstacle-
V. R ELATED W ORK
perception module. Chen et al. [22] presented several meta-
This section surveys some related work on coverage profiler morphic relations for the detection of faults in two opensource
testing, metamorphic testing, testing via equivalence module bioinformatics programs for gene regulatory networks simula-
inputs, and techniques relied on code coverage. tions and short sequence mapping.
In this paper, we applied MT to a new domain, i.e.,
A. Code Coverage Profiler Testing validating the correctness of coverage profilers.
To the best of our knowledge, C2V [11] is the first and also
the state-of-the-art work for hunting bugs in code coverage C. Testing via Equivalence Modulo Inputs
profilers. It feeds a randomly generated program to both gcov Testing via equivalence modulo inputs (EMI) [2], [34], [35]
and llvm-cov, and then reports a bug if there exist inconsis- is a new testing technique proposed in recent years, being
tencies between the produced coverage reports. Within non- targeted at discovering the compiler optimization bugs. The
continuous four months of testing, C2V uncovered 83 bugs, basic idea of EMI is to modify a program to generate variants

87

Authorized licensed use limited to: Nanjing University. Downloaded on February 09,2022 at 00:39:31 UTC from IEEE Xplore. Restrictions apply.
×0 : 1:int func (int *p) { ×0 : 1:int func (int *p) {
 
11 : 2: int x = 0; 1 : 2: int x = 0;
0
× : ×0
3: for (int i = 0; i < 10; i++) : 3: for (int i = 0; i < 10; i++)

10 : 4: x += p[i]; ×0 : 4: x += p[i];
 
1 : 5: return x; 1 : 5: return x;
-1: 6:} -1: 6:}
1 : 7:int main() { 1 : 7:int main() {
1 : 8: int a[10]; 1 : 8: int a[10];
11 : 9: for (int i = 0; i < 10; i++) 1 : 9: for (int i = 0; i < 10; i++)
10 : 10: a[i] = 1; -1: 10: a[i] = 1;
11 : 11: if (func(a) != 10) 1 : 11: if (func(a) != 10)
×0 : 12: return 1; -1: 12: ; // return 1;
-1: 13: return 0; 1 : 13: return 0;
-1: 14:} -1: 14:}
(a) P (gcov) (b) P  = P \ {s12 } ∪ {s12 } (gcov)
Fig. 7. The bug case of GCC #90420. gcov incorrectly reported that the if(func(a) != 10) in Line 11 was executed 11 times. Deleting Line #12
revealed this bug.

with the same outputs as the original program. Initially, Le et One of the most attractive compiler testing techniques is
al. [2] proposed to generate equivalent versions of the program based on the code coverage of a program’s execution to
by profiling program’s execution and pruning unexecuted generate equivalence modulo inputs by stochastically pruning
code inside. Once a program and its equivalent variant are its unexecuted code [2], [48]. With the equivalence modulo
constructed, both are fed to the compiler under test, and the in- inputs, we can differentially test compilers. It is obvious that
consistencies of the outputs are checked. Following this work, the correctness of “equivalance” relies on the reliability of
Athena [34] and Hermes [35] are developed subsequently. code coverage. Debugging is a common activity in software
Athena [34] generates EMI by randomly inserting code into development which aims to locating the root cause of a fault.
and removing statements from dead code regions. Hermes [35] Spectrum-Based Fault Localization (SBFL) is one of the most
complements mutation strategies by operating on live code extensively studied debugging techniques which is heavily
regions, which overcomes the limitations of mutating dead based on code coverage [4], [49]–[51]. Under a specific test
code regions. suite, SBFL leverages the code coverage and the corresponding
In Cod, we followed the similar way to generate program failed/passed information to statistically infer which code is
variants as EMI did, but focused on validating the correctness the root cause of a fault.
of coverage profilers instead of optimization bugs in compilers. As we can see, the correct code coverage information is
As such, during the results verification, Cod not only checked one of the prerequisites for the techniques above, indicating
the inconsistencies in terms of the outputs, but more impor- the importance of our work.
tantly the coverage reports. Through our evaluations, it is also
VI. C ONCLUSION
shown that only few bugs (1 among 23 confirmed bugs) can be
discovered by looking at only the outputs. Moreover, different This paper presents Cod, an automated self-validator for
from EMI performing a random modification, Cod mutates code coverage profilers based on metamorphic testing. Cod ad-
the original program by aggressive statement pruning, thus dressed the limitation of the state-of-the-art differential testing
triggering different coverage behaviors as much as possible. approach, and encouragingly found many previously unknown
bugs which cannot be revealed by existing approaches.
D. Techniques relied on code coverage ACKNOWLEDGMENT
Code coverage is widely adopted in practice and extensively We thank the anonymous reviewers for their construc-
used to facilitate many software engineering tasks, such as tive comments. We also thank the GCC and LLVM de-
coverage-based regression testing, coverage-based compiler velopers especially Martin Liška for analyzing and fixing
testing, and coverage-based debugging. In the context of our reported bugs. This work is supported by the Na-
regression testing, test case prioritization and test suite aug- tional Key R&D Program of China (2018YFB1003901), the
mentation are the two widely used techniques [36]–[43]. The National Natural Science Foundation of China (61832009,
former aims to improve the ability of test cases in finding 61432001, 61932021, 61690204, 61772259, 61772263,
faults by scheduling test cases in a specific order [41], [43], 61802165, 61802168, 61702256), the Natural Science Foun-
[44]. To achieve a high code coverage as fast as possible dation of Jiangsu Province ( BK20191247, BK20170652),
is a common practice [45]. The latter is to generate new the China Postdoctoral Science Foundation (2018T110481),
test cases to strengthen the ability of a test suite in finding the Fundamental Research Funds for the Central Universities
faults [42], [46], [47]. In practice, it is often to generate new (020214380032, 02021430047). We would also like to thank
test cases to cover the source code affected by code changes. the support from the Collaborative Innovation Center of Novel
Recent years have seen an increasing interest in compiler Software Technology and Industrialization, Jiangsu, China.
testing which aims to validate the correctness of compilers. Yuming Zhou and Baowen Xu are the corresponding authors.

88

Authorized licensed use limited to: Nanjing University. Downloaded on February 09,2022 at 00:39:31 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES [24] W. Chan, S. C. Cheung, and K. R. Leung, “Towards a metamorphic
testing methodology for service-oriented software applications,” in Pro-
[1] J. C. Miller and C. J. Maloney, “Systematic mistake analysis of digital ceedings of the 5th International Conference on Quality Software, ser.
computer programs,” Commun. ACM, vol. 6, no. 2, pp. 58–63, Feb. QSIC ’05. IEEE, 2005, pp. 470–476.
1963. [25] W. K. Chan, S. C. Cheung, and K. R. Leung, “A metamorphic testing
[2] V. Le, M. Afshari, and Z. Su, “Compiler validation via equivalence approach for online testing of service-oriented software applications,”
modulo inputs,” in Proceedings of the 35th ACM SIGPLAN Conference International Journal of Web Services Research (IJWSR), vol. 4, no. 2,
on Programming Language Design and Implementation, ser. PLDI ’14. pp. 61–81, 2007.
New York, NY, USA: ACM, 2014, pp. 216–226. [26] M. Jiang, T. Y. Chen, F.-C. Kuo, and Z. Ding, “Testing central processing
[3] M. Böhme, V.-T. Pham, and A. Roychoudhury, “Coverage-based grey- unit scheduling algorithms using metamorphic testing,” in Proceedings
box fuzzing as markov chain,” in Proceedings of the ACM SIGSAC of the 4th IEEE International Conference on Software Engineering and
Conference on Computer and Communications Security, ser. CCS ’16. Service Science. IEEE, 2013, pp. 530–536.
New York, NY, USA: ACM, 2016, pp. 1032–1043. [27] S. Beydeda, “Self-metamorphic-testing components,” in 30th Annual
[4] J. A. Jones and M. J. Harrold, “Empirical evaluation of the tarantula International Computer Software and Applications Conference, ser.
automatic fault-localization technique,” in Proceedings of the 20th COMPSAC ’06.
IEEE/ACM International Conference on Automated Software Engineer- [28] M. Lindvall, D. Ganesan, R. Árdal, and R. E. Wiegand, “Metamorphic
ing, ser. ASE ’05. New York, NY, USA: ACM, 2005, pp. 273–282. model-based testing applied on NASA DAT: An experience report,”
[5] Z. Zuo, S.-C. Khoo, and C. Sun, “Efficient predicated bug signature in Proceedings of the 37th International Conference on Software
mining via hierarchical instrumentation,” in Proceedings of the 2014 Engineering-Volume 2. IEEE Press, 2015, pp. 129–138.
International Symposium on Software Testing and Analysis, ser. ISSTA [29] X. Xie, J. W. Ho, C. Murphy, G. Kaiser, B. Xu, and T. Y. Chen, “Testing
’14. New York, NY, USA: ACM, 2014, pp. 215–224. and validating machine learning classifiers by metamorphic testing,”
[6] Z. Zuo, L. Fang, S.-C. Khoo, G. Xu, and S. Lu, “Low-overhead and Journal of Systems and Software, vol. 84, no. 4, pp. 544–558, 2011.
fully automated statistical debugging with abstraction refinement,” in [30] Z. Q. Zhou, S. Zhang, M. Hagenbuchner, T. Tse, F.-C. Kuo, and T. Y.
Proceedings of the 2016 ACM SIGPLAN International Conference on Chen, “Automated functional testing of online search services,” Software
Object-Oriented Programming, Systems, Languages, and Applications, Testing, Verification and Reliability, vol. 22, no. 4, pp. 221–243, 2012.
ser. OOPSLA ’16. New York, NY, USA: ACM, 2016, pp. 881–896. [31] Z. Q. Zhou, S. Xiang, and T. Y. Chen, “Metamorphic testing for software
[7] D. Lo, S.-C. Khoo, J. Han, and C. Liu, Mining Software Specifications: quality assessment: A study of search engines,” IEEE Transactions on
Methodologies and Applications, 1st ed. Boca Raton, FL, USA: CRC Software Engineering, vol. 42, no. 3, pp. 264–284, 2016.
Press, Inc., 2011. [32] T. Y. Chen, F.-C. Kuo, W. Ma, W. Susilo, D. Towey, J. Voas, and Z. Q.
[8] Z. Zuo and S.-C. Khoo, “Mining dataflow sensitive specifications,” in Zhou, “Metamorphic testing for cybersecurity,” Computer, vol. 49, no. 6,
Formal Methods and Software Engineering, L. Groves and J. Sun, Eds. pp. 48–55, 2016.
Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 36–52. [33] Z. Q. Zhou and L. Sun, “Metamorphic testing of driverless cars,”
[9] S. Park, R. W. Vuduc, and M. J. Harrold, “Falcon: Fault localization in Commun. ACM, vol. 62, no. 3, pp. 61–67, Feb. 2019.
concurrent programs,” in Proceedings of the 32Nd ACM/IEEE Interna- [34] V. Le, C. Sun, and Z. Su, “Finding deep compiler bugs via guided
tional Conference on Software Engineering - Volume 1, ser. ICSE ’10. stochastic program mutation,” in Proceedings of the 2015 ACM SIG-
New York, NY, USA: ACM, 2010, pp. 245–254. PLAN International Conference on Object-Oriented Programming, Sys-
[10] “Lighthouse - a code coverage explorer for reverse engineers,” https: tems, Languages, and Applications, ser. OOPSLA ’15. New York, NY,
//github.com/gaasedelen/lighthouse. USA: ACM, 2015, pp. 386–399.
[11] Y. Yang, Y. Zhou, H. Sun, Z. Su, Z. Zuo, L. Xu, and B. Xu, “Hunting for [35] C. Sun, V. Le, and Z. Su, “Finding compiler bugs via live code mutation,”
bugs in code coverage tools via randomized differential testing,” in Pro- in Proceedings of the 2016 ACM SIGPLAN International Conference on
ceedings of the 41st International Conference on Software Engineering, Object-Oriented Programming, Systems, Languages, and Applications,
ser. ICSE ’19. Piscataway, NJ, USA: IEEE Press, 2019, pp. 488–499. ser. OOPSLA ’16. New York, NY, USA: ACM, 2016, pp. 849–863.
[12] T. Y. Chen, S. C. Cheung, and S. M. Yiu, “Metamorphic testing: A [36] M. Gligoric, A. Groce, C. Zhang, R. Sharma, M. A. Alipour, and
new approach for generating next test cases,” Technical Report HKUST- D. Marinov, “Guidelines for coverage-based comparisons of non-
CS98-01, Department of Computer Science, Hong Kong, Tech. Rep., adequate test suites,” ACM Trans. Softw. Eng. Methodol., vol. 24, no. 4,
1998. pp. 22:1–22:33, Sep. 2015.
[13] M. Liška, “Explanations on the coverage results under optimizations.” [37] G. Fraser, M. Staats, P. McMinn, A. Arcuri, and F. Padberg, “Does
https://gcc.gnu.org/bugzilla/show bug.cgi?id=90420. automated white-box test generation really help software testers?” in
[14] E. J. Weyuker, “On testing non-testable programs,” The Computer Proceedings of the 2013 International Symposium on Software Testing
Journal, vol. 25, no. 4, pp. 465–470, 1982. and Analysis, ser. ISSTA ’13. New York, NY, USA: ACM, 2013, pp.
[15] D. Hamlet, “Predicting dependability by testing,” in ACM SIGSOFT 291–301.
Software Engineering Notes, vol. 21, no. 3. ACM, 1996, pp. 84–91. [38] S. Yoo and M. Harman, “Regression testing minimization, selection and
[16] L. Manolache and D. G. Kourie, “Software testing using model pro- prioritization: A survey,” Softw. Test. Verif. Reliab., vol. 22, no. 2, pp.
grams,” Software: Practice and Experience, vol. 31, no. 13, pp. 1211– 67–120, Mar. 2012.
1236, 2001. [39] J. A. Jones and M. J. Harrold, “Test-suite reduction and prioritization for
[17] “Gcov,” https://gcc.gnu.org/onlinedocs/gcc/Gcov.html. modified condition/decision coverage,” IEEE Trans. Softw. Eng., vol. 29,
[18] “Gcc,” https://gcc.gnu.org/. no. 3, pp. 195–209, Mar. 2003.
[19] X. Yang, Y. Chen, E. Eide, and J. Regehr, “Finding and understanding [40] L. Zhang, D. Marinov, L. Zhang, and S. Khurshid, “Regression muta-
bugs in C compilers,” in Proceedings of the 32nd ACM SIGPLAN tion testing,” in Proceedings of the 2012 International Symposium on
Conference on Programming Language Design and Implementation, ser. Software Testing and Analysis, ser. ISSTA ’12. New York, NY, USA:
PLDI ’11. New York, NY, USA: ACM, 2011, pp. 283–294. ACM, 2012, pp. 331–341.
[20] C. Sun, V. Le, and Z. Su, “Finding and analyzing compiler warning de- [41] D. Hao, L. Zhang, L. Zhang, G. Rothermel, and H. Mei, “A unified
fects,” in Proceedings of the 38th International Conference on Software test case prioritization approach,” ACM Trans. Softw. Eng. Methodol.,
Engineering, ser. ICSE ’16. New York, NY, USA: ACM, 2016, pp. vol. 24, no. 2, pp. 10:1–10:31, Dec. 2014.
203–213. [42] S. Artzi, J. Dolby, F. Tip, and M. Pistoia, “Directed test generation
[21] V. Le, C. Sun, and Z. Su, “Randomized stress-testing of link-time for effective fault localization,” in Proceedings of the 19th international
optimizers,” in Proceedings of the 2015 International Symposium on symposium on Software testing and analysis. ACM, 2010, pp. 49–60.
Software Testing and Analysis, ser. ISSTA ’15. New York, NY, USA: [43] Z. Li, M. Harman, and R. M. Hierons, “Search algorithms for regression
ACM, 2015, pp. 327–337. test case prioritization,” IEEE Trans. Softw. Eng., vol. 33, no. 4, pp. 225–
[22] T. Y. Chen, J. W. Ho, H. Liu, and X. Xie, “An innovative approach 237, Apr. 2007.
for testing bioinformatics programs using metamorphic testing,” BMC [44] G. Rothermel, R. J. Untch, and C. Chu, “Prioritizing test cases for
Bioinformatics, vol. 10, no. 1, p. 24, 2009. regression testing,” IEEE Trans. Softw. Eng., vol. 27, no. 10, pp. 929–
[23] L. L. Pullum and O. Ozmen, “Early results from metamorphic testing of
948, Oct. 2001.
epidemiological models,” in Proceedings of the ASE/IEEE International
Conference on BioMedical Computing, ser. BioMedCom ’12.

89

Authorized licensed use limited to: Nanjing University. Downloaded on February 09,2022 at 00:39:31 UTC from IEEE Xplore. Restrictions apply.
[45] W. E. Wong, J. R. Horgan, S. London, and H. A. Bellcore, “A piler fuzzing,” in Proceedings of the 36th ACM SIGPLAN Conference
study of effective regression testing in practice,” in Proceedings of the on Programming Language Design and Implementation, ser. PLDI ’15.
Eighth International Symposium on Software Reliability Engineering, New York, NY, USA: ACM, 2015, pp. 65–76.
ser. ISSRE ’97. Washington, DC, USA: IEEE Computer Society, 1997, [49] X. Xie, T. Y. Chen, F.-C. Kuo, and B. Xu, “A theoretical analysis of
pp. 264–274. the risk evaluation formulas for spectrum-based fault localization,” ACM
[46] G. Fraser and A. Arcuri, “A large-scale evaluation of automated unit test Trans. Softw. Eng. Methodol., vol. 22, no. 4, pp. 31:1–31:40, Oct. 2013.
generation using evosuite,” ACM Trans. Softw. Eng. Methodol., vol. 24, [50] S. Yoo, M. Harman, and D. Clark, “Fault localization prioritization:
no. 2, pp. 8:1–8:42, Dec. 2014. Comparing information-theoretic and coverage-based approaches,” ACM
[47] Y. Li, Z. Su, L. Wang, and X. Li, “Steering symbolic execution to less Trans. Softw. Eng. Methodol., vol. 22, no. 3, pp. 19:1–19:29, Jul. 2013.
traveled paths,” in Proceedings of the 2013 ACM SIGPLAN International [51] R. Santelices, J. A. Jones, Y. Yu, and M. J. Harrold, “Lightweight
Conference on Object Oriented Programming Systems Languages and fault-localization using multiple coverage types,” in Proceedings of the
Applications, ser. OOPSLA ’13. New York, NY, USA: ACM, 2013, 31st International Conference on Software Engineering, ser. ICSE ’09.
pp. 19–32. Washington, DC, USA: IEEE Computer Society, 2009, pp. 56–66.
[48] C. Lidbury, A. Lascu, N. Chong, and A. F. Donaldson, “Many-core com-

90

Authorized licensed use limited to: Nanjing University. Downloaded on February 09,2022 at 00:39:31 UTC from IEEE Xplore. Restrictions apply.

You might also like