0% found this document useful (0 votes)
8 views11 pages

Swarm 12

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views11 pages

Swarm 12

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Swarm Testing

Alex Groce Chaoqiang Zhang Eric Eide Yang Chen John Regehr
School of Electrical Engineering and Computer Science University of Utah, School of Computing
Oregon State University, Corvallis, OR USA Salt Lake City, UT USA
[email protected], [email protected] {eeide, chenyang, regehr}@cs.utah.edu

ABSTRACT the frequency of various method calls [6], or how to choose a length
Swarm testing is a novel and inexpensive way to improve the di- for tests [5]. As a rule, the notion that some test configurations
versity of test cases generated during random testing. Increased are “good” and that finding a good (if not truly optimal, given the
diversity leads to improved coverage and fault detection. In swarm size of the search space) configuration is important has not been
testing, the usual practice of potentially including all features in challenged. Furthermore, in the interests of maximizing coverage
every test case is abandoned. Rather, a large “swarm” of randomly and fault detection, it has been assumed that a good random test
generated configurations, each of which omits some features, is configuration includes as many API calls or other input domain
used, with configurations receiving equal resources. We have iden- features as possible, and this has been the guiding principle in large-
tified two mechanisms by which feature omission leads to better scale efforts to test C compilers [36], file systems [17], and utility
exploration of a system’s state space. First, some features actively libraries [29]. The rare exceptions to this rule have been cases where
prevent the system from executing interesting behaviors; e.g., “pop” a feature makes tests too difficult to evaluate or slow to execute,
calls may prevent a stack data structure from executing a bug in or when static analysis or hand inspection can demonstrate that an
its overflow detection logic. Second, even when there is no active API call is unrelated to state [17]. For example, including pointer
suppression of behaviors, test features compete for space in each assertions may make compiling random C programs too slow with
test, limiting the depth to which logic driven by features can be some compilers.
explored. Experimental results show that swarm testing increases In general, if a call or feature is omitted from some tests, it
coverage and can improve fault detection dramatically; for example, is usually omitted from all tests. This approach seems to make
in a week of testing it found 42% more distinct ways to crash a intuitive sense: omitting features, unless it is necessary, means
collection of C compilers than did the heavily hand-tuned default giving up on detecting some faults. However, this objection to
configuration of a random tester. feature omission only holds so long as testing is performed using a
single test configuration. Swarm testing, in contrast, uses a diverse
Categories and Subject Descriptors D.2.5 [Software Engineer- “swarm” of test configurations, each of which deliberately omits
ing]: Testing and Debugging—testing tools; D.3.4 [Programming certain API calls or input features. As a result, given a fixed testing
Languages]: Processors—compilers budget, swarm testing tends to test a more diverse set of inputs than
would be tested under a so-called “optimal” configuration (perhaps
General Terms Algorithms, Experimentation, Languages, Relia- better referred to as a default configuration) in which every feature
bility is available for use by every test.
Keywords Random testing, configuration diversity One can visualize the impact of swarm testing by imagining a
“test space” defined by the contents of tests. As a simple example,
consider testing an implementation of a stack ADT that provides two
1. INTRODUCTION operations, push and pop. One can visualize the test space for the
This paper focuses on answering a single question: In random stack ADT using these features as axes: each test is characterized
testing, can a diverse set of testing configurations perform better by the number of times it invokes each operation. Any method for
than a single, possibly “optimal” configuration? An example of a test randomly generating test cases results in a probability distribution
configuration would be, for example, a list of API calls that can be over the test space, with the value at each point (x, y) giving the
included in test cases. Conventional wisdom in random testing [19] probability that a given test will contain exactly x pushes and y pops
has assumed a policy of finding a “good” configuration and running (in any order). To make this example more interesting, imagine the
as many tests as possible with that configuration. Considerable stack implementation has a capacity bug, and will crash whenever
research effort has been devoted to the question of how to tune a the stack is required to hold more than 32 items.
“good configuration,” e.g., how to use genetic algorithms to optimize Figure 1(a) illustrates the situation for testing the stack with a
test generator that chooses pushes and pops with equal probability.
The generator randomly chooses an input length and then decides if
each operation is a push or a pop. The graph shows the distribution
of tests produced by this generator over the test space. The graph
also shows contour lines for significant regions of the test space.
c ACM, 2012. This is the author’s version of the work. It is posted here by
permission of ACM for your personal use. Not for redistribution. Where Pf ail = 1, a test chosen randomly from that region is certain
The definitive version was published in Proceedings of the 2012 International
to trigger the stack’s capacity bug; where Pf ail = 0, no test can
Symposium on Software Testing and Analysis (ISSTA), Minneapolis, MN, trigger the bug. As Figure 1(a) shows, this generator only rarely
Jul. 2012, http://doi.acm.org/10.1145/NNNNNNN.NNNNNNN

1
Number of pops per test case
50 testing makes significantly better use of a fixed CPU time budget
"no_swarm" than does random testing using a single test configuration, in terms
40 Pfail = 0 Pfail < 10-6
of both coverage and fault detection. For example, we performed
30
an experiment where two machines, differing only in that one used
swarm testing and one did not, used Csmith [36] to generate tests for
20 Pfail > 10-6
a collection of production-quality C compiler versions for x86-64.
During one week of testing, the swarm machine found 104 distinct
10 ways to crash compilers in the test suite whereas the other machine—
Pfail = 1 running the default Csmith test configuration, which enables all
0 features—found only 73. An improvement of more than 40% in
0 10 20 30 40 50 terms of number of bugs found, using a random tester that has been
Number of pushes per test case intensively tuned for several years, is surprising and significant.
(a) Random testing with uniform probabilities Even more surprising were some of the details. We found, for
example, a compiler bug that could only be triggered by programs
containing pointers, but which was almost never triggered by inputs
Number of pops per test case

50
that contained arrays. This is odd because pointer dereferences and
"swarm" array accesses are very nearly the same thing in C.1 Moreover, we
40 Pfail = 0 Pfail < 10-6
found another bug in the same compiler that was only triggered by
30 programs containing arrays, but which was almost never triggered by
inputs containing pointers. Fundamentally, it appears that omitting
20 Pfail > 10-6 features while generating random test cases can lead to improved
test effectiveness.
10
Pfail = 1
Our contributions are as follows. First, we characterize swarm
testing, a pragmatic variant of random testing that increases the
0
diversity of generated test cases with little implementation effort.
0 10 20 30 40 50
The swarm approach to diversity differs from previous methods
Number of pushes per test case in that it focuses solely on feature omission diversity: variance in
(b) Swarm testing which possible input features are not present in test cases. Second,
we show that—in three case studies—swarm testing offers improved
Figure 1: Swarm testing changes the distribution of test cases coverage and bug-finding power. Third, we offer some explanations
for a stack. If push and pop operations are selected with equal as to why swarm testing works.
probability, about 1 in 370,000 test cases will trigger a bug
in a 32-element stack’s overflow-detection logic. Swarm test- 2. SWARM TESTING
ing (note the test cases concentrated near the x-axis and y-axis
of Figure 1(b)) triggers this bug in about 1 of every 16 cases. Swarm testing uses test configurations that correspond to sets of
features of test cases. A feature is an attribute of generated test
inputs that the generator can directly control, in a computationally
produces test cases that can trigger the bug. efficient way. For example, an API-based test generator might define
Now consider a test generator based on swarm testing. This features corresponding to inclusion of API functions (e.g., push and
generator first chooses a non-empty subset of the stack API and pop); a C program generator might define features corresponding
then generates a test case using that subset. Thus, one-third of the to the use of language constructs (e.g., arrays and pointers); and
test cases contain both pushes and pops, one-third just pushes, and a media-player tester might define features over the properties of
one-third just pops. Figure 1(b) shows the distribution of test cases media files, e.g., whether or not the tester will generate files with
output by this generator. As is evident from the graph, this generator corrupt headers. In our work, a feature determines a configuration
often produces test cases that trigger the capacity bug. of test generation, not the System Under Test (SUT)—in this work
Although simple, this example illustrates the dynamics that make we use the same build of the SUT for all testing. In particular, we
swarm testing work. The low dimensionality of the stack example are configuring which aspects of the SUT will be tested (and not
is contrived, of course, and we certainly believe that programmers tested) only by controlling the test cases output. Features can be
should make explicit efforts to test boundary conditions. As evi- thought of simply as constraints on test cases, in particular those
denced by the results presented in this paper, however, swarm testing the test case generator lets us control.
generalizes to real situations in which there may be dozens of fea- Assume that a test configuration C is a set of features, f1 . . . fn . C
tures that can be independently turned on or off. It also generalizes is used as the input to a random testing function [8, 19] gen(C, s),
to testing real software in which faults are very well hidden. which given configuration C and seed s generates a test case for the
Every test generated by any swarm configuration can, in princi- SUT containing only features in C. We may ignore the details of
ple, be generated by a test configuration with all features enabled. how the exact test case is built. The values f1 . . . fn determine which
However—as the stack example illustrates—the probability of cov- features are allowed to appear in the test case. For example, ifwe are
ering parts of the state space and detecting certain faults can be testing a simple file system, the set of all features might be: read ,
demonstrably higher when a diverse set of configurations is tested. write , open , close , unlink , sync , mkdir, rmdir , unmount ,
Swarm testing has several important advantages. First, it is low mount . A typical default C would then be read , write , open ,
cost: in our experience, existing random test case generators already close , unlink , sync , mkdir , rmdir , which omits mount and
support or can be easily adapted to support feature omission. Second, unmount in order to avoid wasting test time on operations while the
swarm testing reduces the amount of human effort that must be file system is unmounted.
devoted to tuning the random tester. In our experience, tuning is a
significant ongoing burden. Finally—and most importantly—swarm 1 In C/C++, a[i] is syntactic sugar for *(a+i).

2
Assume that a test engineer has two CPUs available and 24 hours in a test does not always improve the ability of the test to cover
to test a file system. The conventional strategy would be to choose a behavior and expose faults: some features can actively suppress
“good” test case length, divide the set of random-number-generator the exhibition of some behaviors. Formally, we say that a feature
seeds into two sets, and simply generate, execute, and evaluate as suppresses a behavior in a given tester if, over the set of all test
many tests as possible on each CPU, with a single C. cases the tester in question can produce, test cases containing the
In contrast, a swarm approach to testing the same system, under suppressing feature are less likely to display the behavior than those
the same assumptions, would use a “swarm”—a set {C1 ,C2 , . . .Cn }. without the suppressing feature.
A fixed set could be chosen in advance, or a fresh Ci could be Furthermore, if we assume that some aspects of system state are
generated for each test. In most of our experimental results, we use affected more by some features than others, and assume that test
large but fixed-size sets, generated randomly. That is, we “toss a fair cases are limited in size, then by shifting the distribution of calls
coin” to determine feature presence or absence in C. In Section 3.1.3 within each test case (though not over all test cases), swarm testing
we discuss other methods for generating each C. With a fixed swarm results in a much higher probability of exploring “deep” values of
set, we divide the total time budget on each CPU such that each Ci state variables. Consider adding a top call that simply returns the
receives equal testing time, likely generating multiple tests for the top value of the stack to the ADT above. For every call to top in
each Ci . For testing without a fixed swarm set, we would simply a finite test case, the number of push calls possible is reduced by
keep generating a Ci and running a test until time is up. For the file one. Only if all features equally affect all state variables are swarm
system example, where there are nine features and thus 29 (512) and using just CD equally likely to explore “deep” states. Given
possible configurations, a fixed set might consist of 64 C, each of that real systems exhibit a strong degree of modularity and that API
which would receive a test budget of 45 minutes (48 CPU-hours calls and input features are typically designed to have predictable,
divided by 64)—a large number of tests would be generated for each localized effects on system state or behavior, this seems extremely
Ci . The default approach is equivalent to swarm testing if we use a unlikely. Many fault-detection or property-proof techniques, from
singleton set {CD }, where CD includes all features we are interested abstraction to compositional verification to k-wise combinatorial
in testing. Some Ci (those omitting features that slow down the testing, take this modularity of interaction for granted. We therefore
SUT) may generate tests that are quicker to execute than CD ; others hypothesize that many features “passively” suppress some behaviors
may produce tests that execute slower due to a high concentration of by “crowding out” relevant features in finite test cases.
expensive features. On average, the total number of tests executed Active and passive suppression mean that we may need tests that
will be similar to the standard approach, though perhaps with a exhibit a high degree of feature omission diversity, since we do not
greater variance. The distribution of calls made, over all tests, will know which features will suppress which behaviors, and features
also be similar to that found using just CD : each call will be absent are almost certain both to suppress some behaviors and be required
in roughly half of configurations, but will be called more frequently or at least helpful for producing others!
in other configurations. Why, then, might results from swarms differ
from those with the default approach? 2.2 Disadvantages of Configuration Diversity
An immediate objection to swarm testing is that it may significantly
2.1 Advantages of Configuration Diversity reduce the probability of detecting certain faults. Consider a file
The key insight motivating this paper is that the possibility of a system bug that can only be exposed by a combination of calls to
test being produced is not the same as the probability that it will read, write, open, mkdir, rmdir, unlink, and sync. Because
be produced. In particular, consider a fault that relies on making there is only a 1/128 chance that a given Ci will enable all of these,
64 calls to open, without any calls to close, at which point the it is likely that a swarm set of size 64 cannot find this fault. At
file descriptor table overflows and the file system crashes. If the first examination, it seems that the swarm approach to testing will
test length in both the default and swarm settings is fixed at 512 expose fewer faults and result in worse coverage than using a single
operations, we know that testing with CD is highly unlikely to expose inclusive C. Furthermore, recalling that any test possible with any
the fault: the rate is much less than 1 in 100,000 tests. With swarm, Ci in a swarm set is also possible under CD , but that some tests
on the other hand, many Ci (16 on average) will produce tests produced by the CD may be impossible to produce for almost all Ci ,
containing calls to open but no calls to close. Furthermore, some it may be hard to imagine how swarm can compensate.
of these Ci will also disable other calls, increasing the proportion This apparent disadvantage of swarm—that feature subsetting
of open calls made. For Ci such that open ∈ Ci but without close will necessarily miss some bugs—in fact has rather limited impact.
and at least one other feature, the probability of detecting the fault First, when features appear together only infrequently over Ci , this
improves from close to 0% to over 80%. If 48 hours of testing may lower the probability of finding the “right” test for a particular
produces approximately 100,000 tests, it is almost certain that using bug, but does not preclude it. Second, since other features will
CD will fail to detect the fault, and at the same time almost certain almost certainly be omitted from the few Ci that do contain the
that any swarm set of size 64 will detect it. The same argument right combination, the features may interact more than in CD —thus
holds even if we improve the chances of CD by assuming that close increasing the likelihood of finding the bug (Section 2.1) in each
calls do not decrement the file descriptor count: swarm is still much test.
more likely to produce any failure that requires many open calls. For bugs that can be discovered only when many features are
While such resource-exhaustion faults may seem to be a rare enabled, the relevant question is this: how likely is it that swarm test-
special case, the concept can be generalized. Obviously, many data ing will not include any Ci with all the needed features? Using the
structures, as in the stack example, may have overflow or underflow simplest form of coin-toss generation for swarm sets, the chance of
problems where one or more API calls moves the system away from a given set of k features never appearing together in any of C1 . . .Cn
exhibiting failure. In the file system setting, it seems likely that is (1 − 0.5k )n . Even a very small swarm set of 100 configurations
many faults related to buffering will be masked by calls to sync. is 95% likely to contain at least one Ci for any given choice of five
In a compiler, many potentially faulty optimizations will never be features. If a tester uses a swarm set size of 1,000 (as we do in
applied to code that contains pointer accesses, because of failed Section 3.2) there is a 95% chance of covering any given set of eight
safety checks based on aliasing. In other words, including a feature features, and a 60% chance with ten. If a tester believes that even

3
these probabilities are unsatisfying, a simple mitigation strategy is using any random C for each test, and we believe this may be the
to include CD in every swarm set. We chose not to do this in our best practice when it is possible.
experiments in order to heighten the value of our comparison with
the default, all-inclusive configuration strategy. 3.1 Case Study: YAFFS Flash File System
YAFFS2 [35] is a popular open-source NAND flash file system for
2.3 An Empirical Question embedded use; it is the default image format for the Android operat-
In general, all that can be said is that what we call CD may be opti- ing system. Our test generator for YAFFS2 produces random tests
mal for some hypothetical set of coverage targets and faults, while of any desired length and executes the tests using the RAM flash
a swarm set {C1 . . .Cn } will contain some Ci that are optimal for emulation mode. By default, tests can include or not include any of
different coverage targets and faults, but will perform less testing 23 core API calls, as specified by a command line argument to the
under each Ci than the conventional approach will perform under CD . test generator: these command line arguments are the Ci , and calls
Our hypothesis is that for many real-world programs, the conven- are features. Our tester generates a test case by randomly choosing
tional “CD approach” to testing will expose the same faults and cover an API from the feature set, and calling the API with random pa-
the same behaviors many times, while swarm testing may expose rameters (not influenced by our test configuration). This is repeated
more faults and cover more targets, but might well produce fewer n times to produce a length n test case, consisting of the API calls
failing tests for each fault and execute fewer tests that cover each and parameter choices. Feedback [17, 29] is used in the YAFFS2
branch/statement/path/etc. Given that the precise distribution of tester to ensure that calls such as close and readdir occur only
faults and coverage targets in the state space of any realistic system in states where valid inputs are available. We ran one experiment
is complex and not amenable to a priori analysis, only experimental with 100 test configurations and another with 500 configurations,
investigation of how the two random testing approaches compare all including API calls with 50% probability. Both sets were large
on real systems can give us practical insight into what strategies enough compared to the number of tests run to make unusual effec-
might be best for large-scale, random testing of critical systems. The tiveness or poor performance due to a small set of especially good
remainder of this paper shows how conventional, single-C testing or bad Ci highly unlikely. Both experiments compared 72 hours of
and swarm testing compare for coverage and fault detection on the swarm testing to 72 hours of testing with CD only, evaluating test
software in three case studies: (1) a widely used open-source flash effectiveness based on block, branch, du-path, prime path, path, and
file system, (2) seventeen versions of five widely used C compilers, mutation coverage [3]. For prime and du-paths, we limited lengths
and (3) a container in the widely used Sglib library. to a maximum of ten. Path coverage was measured at the function
The thesis of this paper is that turning features off during test level (that is, a path is the path taken from entry to exit of a single
case generation can lead to more effective random testing. Thus, function).
one of our evaluation criteria is that swarm should find more defects Both experiments used 532 mutants, randomly sampled from the
and lead to improved code coverage, when compared to the default space of all 12,424 valid YAFFS2 mutants, using the C program
configuration of a random tester, with other factors being equal. mutation approach (and software) shown to provide a good proxy for
In the context of any individual bug, it is possible to evaluate fault detection by Andrews et al. [4]. Unfortunately, evaluation on all
swarm in a more detailed fashion by analyzing the features found possible mutants would require prohibitive computational resources:
and not found in test cases that trigger the bug. We say that a test evaluation on 532 mutants required over 11 days of compute time.
case feature is significant with respect to some bug if its presence Random sampling of mutants has been shown to provide useful
or absence affects the likelihood that the bug will be found. For results in cases where full evaluation is not feasible [38]. Our
example, push operations are obviously a significant feature with sampled mutants were not guaranteed to be killable by the API calls
respect to the example in Section 1 because a call to push causes and emulation mode tested. We expect that considerably more than
the bug to manifest. But this is banal: it has long been known that half of these mutants lie in code that cannot execute due to our using
effective random testing requires “featureful” test inputs. The power YAFFS2’s RAM emulation in place of actual flash hardware, or due
of swarm is illustrated when the absence of one or more features to the set of YAFFS2 API calls that we test. Some unknown portion
is statistically significant in triggering a bug. Section 3 shows that of the remainder are semantically equivalent mutants. Unfortunately,
such (suppressing) features for bugs are commonplace, providing excluding mutants for these three cases statically is very difficult,
strong support for our claim that swarm testing is beneficial. and we do not want to prune out all hard-to-kill mutants—those
are precisely the mutants we want to keep! The only difference
between experiments, other than the number of configurations, was
3. CASE STUDIES that in the first experiment mutation coverage was computed online,
We evaluated swarm testing using three case studies in which we and was included in the test budget for each approach. The second
tested software systems of varying size and complexity. The first experiment did not count mutation coverage computations as part of
study was based on YAFFS2, a flash file system; the second (and the test budgets, but executed all mutation tests offline, in order to
largest) used seventeen versions of five production-quality C com- show how results changed with increased number of tests.
pilers; and the third, a “mini-study,” focused on a red-black tree
implementation. The file system and red-black tree were small 3.1.1 Results—Coverage and Mutants Killed
enough (15 KLOC and 476 LOC, respectively) to be subjected to Table 1 shows how swarm configuration affects testing of YAFFS2.
mutation testing. The compilers, on the other hand, were not conve- For each experiment, the first column of results shows how CD
niently sized for mutation testing, but provided something better: a performed, the second column shows the coverage for swarm testing,
large set of real faults that caused crashes or other abnormal exits. and the last column shows the coverage for combining the two
In all case studies, we used relatively small (n ≤ 1, 000) swarm test suites. Each individual test case contained 200 file system
sets, to show that swarm testing improves over using CD even with operations: the swarm test suite in the second experiment executed
relatively small sets, which may be necessary if there are complex a total of 1.17 million operations, and testing all mutants required
or expensive-to-check constraints on valid Ci . In practice, all of an additional 626 million YAFFS2 calls.
our case studies would support a much simpler approach of simply The first experiment (columns 2–4 of Table 1) shows that, despite

4
fd0 = yaffs_open("/ram2k/umtpaybhue",O_APPEND|O_EXCL
Table 1: YAFFS2 coverage results |O_TRUNC|O_RDWR|O_CREAT,S_IREAD);
# = number of tests; coverage: bl = blocks; br = branches; du = du-paths; yaffs_write(fd0, rwbuf, 9243);
pr = prime paths; pa = paths; mu = mutants killed fd2 = yaffs_open("/ram2k/iri",O_WRONLY|O_RDONLY|O_CREAT,
S_IWRITE);
swarm 100, mutants online swarm 500, offline fd3 = yaffs_open("/ram2k/iri",O_WRONLY|O_TRUNC|O_RDONLY
|O_APPEND,S_IWRITE);
CD Swarm Both CD Swarm Both
yaffs_write(fd3, rwbuf, 5884);
# 1,747 1,593 3,340 5,665 5,888 11,553 yaffs_write(fd3, rwbuf, 903);
bl 1,161 1,168 1,173 1,173 1,172 1,178 fd6 = yaffs_open("/ram2k/wz",O_WRONLY|O_CREAT,S_IWRITE);
br 1,247 1,253 1,261 1,261 1,259 1,268 yaffs_write(fd2, rwbuf, 3437);
yaffs_write(fd6, rwbuf, 8957);
du 2,487 2,507 2,525 2,525 2,538 2,552
yaffs_write(fd3, rwbuf, 2883);
pr 2,834 2,872 2,964 2,907 2,967 3,018 yaffs_write(fd3, rwbuf, 4181);
pa 14,153 25,484 35,478 35,432 64,845 91,280 yaffs_read(fd2, rwbuf, 8405);
mu 94 97 97 95 97 97 fd12 = yaffs_open("/ram2k/gddlktnkd",
O_TRUNC|O_RDWR|O_WRONLY|O_APPEND|O_CREAT, S_IREAD);
yaffs_write(fd0, rwbuf, 3387);
yaffs_write(fd12, rwbuf, 2901);
yaffs_write(fd12, rwbuf, 9831);
yaffs_freespace("/ram2k/wz");

executing 154 fewer tests, swarm testing improved on the default Figure 2: Operations in a minimized test case for killing
configuration in all coverage measures—the difference is particu- YAFFS2 mutant #62. The mutant returns an incorrect amount
larly remarkable for path coverage, where swarm testing explored of remaining free space.
over 10,000 more paths. The combined test suite results show that
100
default and swarm testing overlapped in most forms of coverage, but
that CD did explore some blocks, branches, and paths that swarm did

Percentage of Killing Test Cases


not. For pure path coverage, the two types of testing produced much 75
more disjoint coverage. Surprisingly, swarm was strictly superior in
terms of mutation kills: swarm killed three mutants that CD did not,
50
and killed every mutant killed by CD . On average, CD killed each
mutant 1,173 times. Swarm only killed each mutant (counting only
those killed by both test suites, to avoid any effect from swarm also 25
killing particularly hard-to-kill mutants) an average of 725 times.
An improvement of three mutants out of 94 may seem small, but a
better measure of fault detection capabilities may be kill rates for 0
ch
cl od
cl se
fc sed
fr mo r
fs sp
lin t ce
ls
ls k
m t
op dir
op n
re nd
re
re d
re dli
re am
rm ind
st ir ir
sy t
tr lin
un nca
w ink e
ee d

ee
ta

u k

rit
o
o

ta a
h i

a
k

ad ir
ad
a ir
n nk
w e

m
m

k
nontrivially detectable mutants. It seems reasonable to consider any

e
e

l t
d d

e
mutant killed by more than 10% of random tests (each with only
200 operations) to be uninteresting. Even quite desultory random
testing will catch such faults. Of the 97 mutants killed by CD , only Figure 3: 95% confidence intervals for the percentage of test
14 were killed by < 10% of tests, making swarm’s 17 nontrivial cases killing YAFFS2 mutant #62 containing each call
kills an improvement of over 20%.
In the second experiment (columns 5–7 of Table 1), test through- 3.1.2 Results—Significant Features
put was about 3× greater due to offline computation of mutation Figure 2 shows a delta-debugged [37] version of one of the swarm
coverage. Here CD covered slightly more blocks and branches than tests that killed mutant #62. Using CD never killed this mutant. The
the swarm tests. However, of the six blocks and nine branches cov- original line of code at line 2 of yaffs_UseChunkCache is:
ered only by CD , all but four (two of each) were low-probability if (dev->srLastUse < 0 || dev->srLastUse > 100000000)
behaviors of the rename operation. In file system development and
testing at NASA’s Jet Propulsion Laboratory, rename was by far the The mutant changes < 0 to > 0. The minimized test requires no
most complex and faulty operation, and we expect that this is true operations other than yaffs_open, yaffs_write, yaffs_read
for many file systems [17]. Discussion with the primary author of and yaffs_freespace. The five tests killing this mutant in the
YAFFS has confirmed that he also believes rename to be the most first experiment were all produced by two Ci , both of which disabled
complex function we tested. This result suggests a vulnerability (or close, lseek, symlink, link, readdir, and truncate. The uni-
intelligent use requirement) for swarm: if a single feature is expected versal omission of close, in particular, probably indicates that this
to account for a large portion of the behavioral complexity and po- API call actively interferes with triggering this fault: it is difficult to
tential faults in a system, it may well be best to set the probability perform the necessary write operations to expose the bug if files can
of that feature to more than 50%. Nonetheless, swarm managed be closed at any point in the test. The other missing features may
to produce better coverage for all other metrics, executing almost all indicate passive interference: without omitting a large number
30,000 more paths than testing under CD only, and still killed all of of features it is difficult to explore the space of combinations of
the mutants killed by CD , plus two additional mutants. The swarm open and write, and observe the incorrect free space, in the 200
advantage in nontrivial mutant kills was reduced to 13% (15 vs. 17 operations performed in each test.
kills). In fact, the additional 4,295 tests (with a different, larger, set Figure 3 shows which YAFFS2 test features were significant in
of Ci ) did not add mutants to the set killed by swarm, suggesting killing mutant #62. The 95% confidence intervals (computed using
good fault detection ability for swarm on YAFFS2, even with a small the Wilson score for a binomial proportion) are somewhat wide
test budget. Using CD once more tended towards “overkill,” with an because this mutant was killed only 10 times in the second experi-
average of 3,756 killing tests per mutant. Swarm only killed each ment. Calls to freespace, open, and write are clearly “triggers”
mutant an average of 2,669 times. (and almost certainly necessary) for this bug, while close is, as

5
100
Table 2: Top trigger and suppressor features for YAFFS2
Percentage of Killing Test Cases

75 Triggers Suppressors
fchmod 66% rename 29%
lseek 65% unlink 26%
50
read 62% link 24%
write 61% rewinddir 22%
25 open 58% closedir 12%
close 57% lstat 12%
fstat 56% opendir 12%
0
symlink 41% stat 11%
ch
cl od
cl se
fc ed
fr mo r
fs sp
lin t ce
ls
ls ek
m t
op dir
op n
re nd
re d
re d
re dli
re am
rm ind
st ir ir
sy t
tr lin
un nca
w ink e
ee d

e
ta

u k

rit
o
os

ta a
h i

a
k

a ir
ad
a ir
n nk
w e

m
m

e
e

l t
d d

e
closedir 35% mkdir 11%
opendir 35% readdir 9%
Figure 4: 95% confidence intervals for the percentage of test truncate 34% close 9%
cases killing mutant #400 containing each YAFFS2 API call rmdir 32% symlink 8%
chmod 32% truncate 8%
readdir 31% rmdir 5%
mkdir 30% chmod 5%
expected, an active suppressor. Calls to mkdir and rmdir are prob-
freespace 28% fstat 4%
ably passive suppressors (we have not discovered any mechanism
link 26% lseek 4%
for active suppression, at least); we know that in model checking
stat 25% readlink 4%
[16], omitting directory operations can give much better coverage
readlink 18% open 4%
of operations such as write and read.
unlink 16% freespace 4%
Figure 4 shows 95% confidence intervals for each feature in the
lstat 14% read 3%
137 test cases (from the second experiment, again) killing another
rename 11% write 3%
mutant that was only killed by swarm testing. This mutant negates a
rewinddir 10% fchmod 3%
condition in the code for marking chunks dirty: bi->pagesInUse
== 0 becomes bi->pagesInUse != 0. Here, freespace is again
Values in the table show the percentage of YAFFS2 mutants that were statis-
required, but only chmod acts as a major trigger. This mutant affects tically likely to be triggered (killed) or suppressed by test cases containing
a deeper level of block management, so either a file or directory the listed API calls.
can expose the fault: thus mkdir is a moderate trigger and open
is possibly a marginal trigger but neither is required, as either call
shows that picking any single C is likely to weaken testing. The
will set up conditions for exposure. A typical delta-debugged test
three features that suppress the most faults are, respectively, also
case killing this mutant includes 15 calls to chmod, but chmod is not
triggers for 11%, 16%, and 26% of mutants killed. The obvious
required to expose the fault—it is simply a very effective way to re-
active suppressors close and closedir are triggers for 57% and
peatedly force any directory entry to be rewritten to flash. Moreover,
35% of mutants killed, respectively.
rmdir completely suppresses the fault—presumably by deleting
directories before the chmod-induced problems can exhibit. It is
possible to imagine hand-tuning of the YAFFS2 call mix helping 3.1.3 Other Configuration Possibilities
to detect a fault like mutant #62. It is very hard to imagine a file Random generation with 50% probability per feature of omission
system expert, one not already aware of the exact fault, tuning a (“coin tossing”) is not the only way to build a set {C1 . . .Cn }. Al-
tester to expose a bug like mutant #400. though we did not explore this space in depth, we did perform some
As a side note, we suspect that these kinds of confidence interval limited comparisons of the coin-toss approach with using 5-way
graphs, which appear as a natural byproduct of swarm testing, may covering arrays from NIST [27] to produce Ci . A 5-way covering
be quite helpful as aids to fault localization, debugging, and program array guarantees that all 32 possible combinations for each set of 5
understanding. In principle such graphs may also be produced from features are covered in the set; we used 5-way coverage because we
CD , but with random testing and realistic test case sizes, the chance speculate that very few YAFFS2 bugs require more than 5 features to
of a feature not appearing in a test case is close to zero; even if expose. For the 23 YAFFS2 test features, a 5-way covering requires
measuring feature frequency provides useful information, which n = 167. Each Ci≤n has some features specified as “on,” some as
seems unlikely, this complicates producing useful graphs. “off,” and others as “don’t care.” The “don’t care” features can be
Table 2 shows which YAFFS2 features contributed to mutant included or excluded as desired.
killing and which features suppressed mutant killing. Percentages We compared coverage results for covering-array-based swarm
represent the fraction of killed mutants for which the feature’s pres- sets with “don’t care” set two ways—first to inclusion, then to
ence or absence was statistically significant. While mutants are not exclusion—with a swarm set using our coin-toss approach and with
necessarily good representatives of real faults (we show how swarm the {CD } only. Both non-random 5-way covering sets and random
performs for real faults below), the particular features that trigger swarm performed much better than CD for path coverage. The non-
and suppress the most bugs are quite surprising. For example, it is at random 5-way covering sets slightly improved on coin tossing if
first puzzling why fchmod is such a helpful feature. We believe this “don’t care” features were included, but gave lower path coverage
to be a result of the power of fchmod and chmod to “burn” through when they were omitted. Coin-toss swarm always performed best
flash pages quickly by forcing rewrites of directory entries for either (by at least six blocks and seven branches) for block and branch
a file or a directory. The most likely explanation for fchmod’s value coverage—both non-random 5-way covering sets performed slightly
over chmod lies in feedback’s ability to always select a valid file worse than CD for block and branch coverage.
descriptor, giving a rewrite of an open file; we know that open files These results at minimum suggest that random generation of Ci is
are more likely to be involved in file system faults. Table 2 also not obviously worse than some combinatorics-based approaches. As

6
Table 3: YAFFS2 37 API results Table 4: Distinct crash bugs found during one week of testing

Method bl br pa pr mu rt Compiler CD Swarm Both


Swarm 1,459 1,641 112,944 61,864 123 136 LLVM/Clang 2.6 10 12 14
CD 1,446 1,626 70,587 61,380 113 158 LLVM/Clang 2.7 5 6 7
LLVM/Clang 2.8 1 1 1
we would expect, comparison of coin-toss swarm with 5-way cov- LLVM/Clang 2.9 0 1 1
ering sets with all “don’t care” values picked via coin toss showed GCC 3.2.0 5 10 11
very little difference in performance, with pure coin-toss slightly GCC 3.3.0 3 4 5
better by some measures and the covering-array sets slightly better GCC 3.4.0 1 2 2
by other measures. Any set of unbiased random Ci of size 120 or GCC 4.0.0 8 8 10
greater is 99% likely to be 3-way covering, and 750 Ci are 99% GCC 4.1.0 7 8 10
likely to be 5-way covering [25]. Sets the size of those used in our GCC 4.2.0 2 5 5
primary YAFFS2 experiments are very likely 3-way covering, and GCC 4.3.0 7 8 9
quite possibly approximately 5-way covering.
GCC 4.4.0 2 3 4
Finally, we investigated the simplest approach to swarm: using a
GCC 4.5.0 0 1 1
new random Ci for each test. Table 3 compares 5,000 tests with CD
GCC 4.6.0 0 1 1
and 5,000 tests with random Ci . For these results, we were able to
use a new, much faster, version of our YAFFS2 tester, supporting 14 Open64 4.2.4 13 18 20
more features (calls) and computing a version of Ball’s predicate- Sun CC 5.11 5 14 14
complete test (PCT) coverage (pr) [9]. Since this result was based on Intel CC 12.0.5 4 2 5
equal tests, not equal time, we also show total test runtime in seconds Total 73 104 120
(rt), not counting mutant analysis or coverage computations, which
required an additional 21-27 hours. If Ci generation is inexpensive, considers two crashes of the same compiler to be distinct if and only
choosing a random Ci for each test simplifies swarm and produces if the compiler tells us that it crashed in two different ways. For
very good results, at least for YAFFS2: a set of tests based on full example
random swarm takes less time to run than a CD based suite and
produces universally better coverage, including an additional 10 internal compiler error: in vect_enhance_data_refs_alignment,
at tree-vect-data-refs.c:1550
mutant kills.
and
3.2 Case Study: C Compilers
internal compiler error: in vect_create_epilog_for_reduction,
Csmith [36] is a random C program generator; its use has resulted at tree-vect-loop.c:3725
in more than 400 bugs being reported to commercial and open-
source compiler developers. Most of these reports have led to are two distinct ways that GCC 4.6.0 can crash. We believe that
compiler defects being fixed. Csmith generates test cases in the this metric represents a conservative estimate of the number of true
form of random C programs. A Csmith test configuration is a set C compiler bugs. Our experience—based on hundreds of bug reports to
language features that can be included in these generated random real compiler teams—is that it is almost always the case that distinct
C programs. In most cases the feature is essentially a production error messages correspond to distinct bugs. The converse is not
rule in the grammar for C programs. By default, Csmith errs on the true: many different bugs may hide behind a generic error message
side of expressiveness: CD emits test cases containing all supported such as Segmentation fault. Our method for counting crash
parts of the C language. Command line arguments can prohibit errors may over-count in the case where we are studying multiple
Csmith from including any of a large variety of features in generated versions of the same compiler, and several of these versions contain
programs, however. To support swarm testing, we did not have to the same (unfixed) bug. However, because the symptom of this kind
modify Csmith in any way: we simply called the Csmith tool with of bug typically changes across versions (e.g., the line number of an
arguments for our test configuration, and compiled the resulting assertion failure changes due to surrounding code being modified), it
random C program with each compiler to be tested. Feature control is difficult to reliably avoid double-counting. We did not attempt to
had previously been added by the Csmith developers (including do so. However, as noted below, our results retain their significance
ourselves) to support testing compilers for embedded platforms if we simply consider the single buggiest member of each compiler
that only compile subsets of the C language, and for testing and family.
debugging Csmith itself. We tested each compiler using vanilla optimization options rang-
ing from “no optimization” to “maximize speed” and “minimize
3.2.1 Methodology size.” For example, GCC and LLVM/Clang were tested using -O0,
We used Csmith to generate random C programs and fed these -O1, -O2, -Os, and -O3. We did not use any of the architecture or
programs to 17 compiler versions targeting the x86-64 architecture; feature-specific options (e.g., GCC’s -m3dnow or Intel’s -openmp)
these compilers are listed in Table 4. While these 17 versions options that typically make compilers extremely easy to crash.
arise from only 5 different base compilers, in our experience major We generated 1,000 unique Ci , each of which included some of
releases of the GCC and LLVM compilers are quite different, in the following (with 50% probability for each feature):
terms of code base as well as, most critically, new bugs introduced • declaration of main() with argc and argv
and old bugs fixed. All of these tools are (or have been) in general • the comma operator, as in x = (y, 1);
use to compile production code. All compilers were run under Linux. • compound assignment operators, e.g. x += y;
When possible, we used the pre-compiled binaries distributed by the • embedded assignments, as in x = (y = 1);
compilers’ vendors. • the auto-increment and auto-decrement operators ++ and --
Our testing focused on distinct compiler crash errors. This metric • goto

7
Table 5: Compiler code coverage Table 6: Top trigger and suppressor features for C compilers

Compiler Metric CD Swarm Change Triggers Suppressors


(95% conf.) Pointers 33% Pointers 41%
line 95,021 95,695 446–903 Arrays 31% Embedded assignments 24%
Clang branch 63,285 64,052 619–915 Structs 29% Jumps 21%
function 43,098 43,213 37–193 Volatiles 21% Arrays 17%
line 142,422 144,347 1,547–2,303 Bitfields 15% ++ and – 16%
GCC branch 114,709 116,664 1,631–2,377 Embedded assignments 15% Volatiles 15%
function 9,177 9,263 61–112 Consts 13% Unions 13%
Comma operator 11% Comma operator 11%
• integer division Jumps 11% Long long ints 11%
• integer multiplication Unions 11% Compound assignments 11%
• long long integers Packed structs 10% Bitfields 10%
• 64-bit math operations Long long ints 10% Consts 10%
• structs 64-bit math 10% Volatile pointers 10%
• bitfields Integer division 8% 64-bit math 8%
• packed structs Compound assignments 8% Structs 7%
• unions Integer multiplication 6% Packed structs 7%
• arrays
Values in the table show the percentage of compiler crash bugs that were
• pointers statistically likely to be triggered or suppressed by test cases containing the
• const-qualified objects listed C program features.
• volatile-qualified objects
• volatile-qualified pointers intervals for the increase in coverage, we ran seven 24-hour tests for
3.2.2 Results—Distinct Bugs Found each compiler and for each of swarm and CD . The absolute values
of these results should be taken with a grain of salt: LLVM/Clang
The top-level result from this case study is that with all other factors and GCC are both large (2.6 MLOC and 2.3 MLOC, respectively)
being equal, a week of swarm testing on a single, reasonably fast and contain much code that is impossible to cover during our tests.
machine found 104 distinct ways to crash our collection of compil- We believe that these incremental coverage values—for example,
ers, whereas using CD (the way we have always run Csmith in the around 2,000 additional branches in GCC were covered—support
past) found only 73—an improvement of 42%. Table 4 breaks these our claim that swarm testing provides a useful amount of additional
results down by compiler. A total of 47,477 random programs were test coverage.
tested under CD , of which 22,691 crashed at least one compiler.2 A
total of 66,699 random programs were tested by swarm, of which 3.2.4 Results—Significant Features
15,851 crashed at least one compiler. Thus, swarm found 42% more
During the week-long swarm test run described in Section 3.2.2,
distinct ways to crash a compiler while finding 30% fewer actual
swarm testing found 104 distinct ways to crash a compiler in our
instances of crashes. Test throughput increased for the swarm case
test set. Of these 104 crash symptoms, 54 occurred enough times
because simpler test cases (i.e., those lacking some features) are
for us to analyze crashing test cases for significant features. (Four
faster to generate and compile.
occurrences are required for a feature present in all four test cases to
To test the statistical significance of our results, we split each
become recognizable as not including the baseline 50% occurrence
of the two one-week tests into seven independent 24-hour periods
rate for the feature in its 95% confidence interval.) 52 of these 54
and used the t-test to check if the samples were from different
crashes had at least one feature whose presence was significant and
populations. The resulting p-value for the data is 0.00087, indicating
42 had at least one feature whose absence was significant.
significance at the 99.9% level. We also normalized for number
Table 6 shows which of the C program features that we used in
of test cases, giving CD a 40% advantage in CPU time. Swarm
this swarm test run were statistically likely to trigger or suppress
remained superior in a statistically significant sense.
crash bugs. Some of these results, such as the frequency with which
Even if we attribute some of this success to over-counting of bugs
pointers, arrays, and structs trigger compiler bugs, are unsurprising.
across LLVM or GCC versions, we observe that taking only the
On the other hand, we did not expect pointers, embedded assign-
most buggy version of each compiler (thus eliminating all double
ments, jumps, arrays, or the auto increment/decrement operators to
counting), swarm revealed 56 distinct faults compared to only 37
figure so highly in the list of bug suppressors.
for CD , which is actually a larger improvement (51%) than in the
We take two lessons from the data in Table 6. First, some features
full case study. Using CD detected more faults than swarm in only
(most notably, pointers) strongly trigger some bugs while strongly
one compiler version, Intel CC 12.0.5.
suppressing others. This observation directly motivates swarm test-
3.2.3 Results—Code Coverage ing. Second, our intuitions (built up over the course of reporting 400
compiler bugs) did not serve us well in predicting which features
Table 5 shows the effect that swarm testing has on coverage of two would most often trigger and suppress bugs.
compilers: LLVM/Clang 2.9 and GCC 4.6.0. To compute confidence
2 We realize that it may be hard to believe that nearly half of random 3.2.5 An Example Bug
test cases would crash some compiler. Nevertheless, this is the case. A bug we found in Clang 2.6 causes it to crash when compiling—at
The bulk of the “easy” crashes come from Open64 and Sun CC, any optimization level—the code in Figure 5, with this message:
which have apparently not been the target of much random testing.
Clang, GCC, and Intel CC are substantially more robust, particularly Assertion ‘NextFieldOffsetInBytes <= FieldOffsetInBytes &&
in recent versions. "Field offset mismatch!"’ failed.

8
struct S1 {
int f0; Table 7 compares coin-toss swarm sets of two sizes (one set of ten
char f1; Ci and another of twenty), 2-way and 3-way covering-array swarm
} __attribute__((packed)); sets, a complete swarm set (all 127 feature combinations), and the
struct S2 {
char f0; default strategy, all for test cases of length ten. The benefits of
struct S1 f1; swarm here are limited: with length-10 test cases and only seven
int f2; features, each feature already has a 20% chance of being omitted
};
struct S2 g = { 1, { 2, 3 }, 4 };
from any given test. The best value for each coverage type is shown
int foo (void) { in bold, and the worst in italics. The swarm sets in this experiment
return g.f0; outperformed CD by a respectable margin for every kind of coverage.
}
The results for the size-20 coin toss, however, show that where the
Figure 5: Code triggering a crash bug in Clang 2.6 benefits of swarm over the default in terms of diversity are marginal,
a bad set can make testing less effective. It is also interesting
to note that even when it is easy to do, complete coverage of all
Percentage of Triggering Test Cases

100 combinations does not do best in all coverage metrics, and increased
75
k for covering is also sometimes harmful.

50 3.4 Threats to Validity


25 The statistical analysis strongly supports the claim that the swarm
treatment did indeed have positive effects for Csmith, including
0
increased code coverage for GCC and LLVM. While there is a
Ar
Ar /
Bi ys v
Co eld
Co m
Co po pe
In sts d tor
Em ger
+ d is
Ju an d a n
Lo s
64 lo
In it
Pa ger th
Po ed ul
St ter tru lica
U cts
Vo ns
Vo tile
ni
+ d io
te

m d

te m nts
tfi

ru s cts ti
gc
ra arg

ng

ck m
in s tip

la
la p
-b ng
m s
m ao
n un ra

be div

o
p

threat from multiple counting of bugs across GCC and LLVM ver-
til oi
es nt
e
-- ssi

sions, the overall fault-detection advantage of swarm increased if we


i
as

er
si

s
gn
gn

considered only the version of each compiler with the most faults.
m
m

on
en
en

The YAFFS2 coverage results are more anecdotal and varied. For
ts
ts

the mutation results, the 95% confidence intervals on features show-


Figure 6: 95% confidence intervals for the percentage of test ing that some features were included in no killing tests support the
cases triggering the Clang bug triggered by the code in Figure 5 claim that detection of these particular mutants is a result of swarm’s
containing each Csmith program feature configuration diversity, as CD is extremely unlikely to produce tests
of the needed form (e.g., without any close/mkdir/rmdir calls).
This crash happened 395 times during our test run. Figure 6 Sampling more mutants would increase our confidence in these
shows that packed structs (and therefore, of course, also structs) results, but we believe it is safe to say we found at least 5 mutants
are found in 100% of test cases triggering this bug. This is not that statistically, CD , will not kill with reasonable-sized test suites;
surprising since the error is in Clang’s logic for dealing with packed we found no such mutants swarm is unlikely to kill.
structs. The only other feature that has a significant effect on the The primary threats come from external validity: use of limited
incidence of this bug is bitfields, which suppresses it. We looked at systems/feature definitions limits generalization. However, file sys-
the relevant Clang source code and found that the assertion violation tems and compilers are good representatives of programs for which
happens in a function that helps build up a struct by adding its next people will devote the effort to build strong random testers.
field. It turns out that this (incorrect) function is not called when a
struct containing bitfield members is being built. This explains why
the presence of bitfields suppress this bug. 4. RELATED WORK
Swarm testing is a low-cost and effective approach to increasing
3.3 Miniature Case Study: Sglib RBTree the diversity of (randomly generated) test cases. It is inspired by
Our primary target for swarm testing is complex systems software swarm verification [21], which runs a model checker in multiple
with many features. However, swarm testing can also be applied search configurations to improve coverage of large state spaces. The
to simpler SUTs. To evaluate how swarm performs on smaller core idea of swarm verification is that given a fixed time/memory
programs, we applied it to the red-black tree implementation in the budget, a “swarm” of diverse searches can explore more of the state
widely used Sglib library. The implementation is 476 lines of code space than a single search. Swarm verification is successful in part
and has seven API calls (the features). A test case, as with YAFFS2, because a single “best” search cannot easily exploit parallelism: the
is a sequence of API calls with parameters. Like the YAFFS2-37 communication overhead for checking whether states have already
API results, these results include PCT coverage. Test-case execution been visited by another worker gives a set of independent searches
time varied only trivially with C, so we were able to simply perform an advantage. This advantage disappears in random testing: runs
20,000 tests for each experiment. are always completely independent. The benefits of our swarm thus
do not depend on any loss of parallelism inherent in the default test
Table 7: Sglib red-black tree coverage results configuration or on the failure of depth-first search to “escape” a
subtree when exploring a very large state space [14]. Our results
Method bl br pa pr mu reflect the value of (feature omission) diversity in test configurations.
Coin-Toss 10 181 206 169 2,839 190 Swarm verification and swarm testing are orthogonal approaches:
swarm verification could be applied in combination with feature
Coin-Toss 20 157 182 165 2,469 175
omission to produce further state-space exploration diversity.
2-Way Cover 182 209 219 2,823 188
Another related area is configuration testing, which diversifies
3-Way Cover 169 209 167 2,518 187
the SUT itself (e.g., for programs with many possible builds) [13,
Complete 176 203 192 2,688 190 32]. As noted above, we vary the “configuration” of the test gener-
CD 170 192 149 2,504 187 ation system to produce a variety of tests, rather than varying the

9
SUT. Configuration testing is thus also orthogonal to our work. In systems with encouraging results; ART has not always been shown
practice, our test configurations often do not require new builds of to be effective for complex real-world programs [7], and has mostly
even the test generation system, but only require use of different been applied to numeric input programs.
command-line arguments to constrain tests. Another approach that More generally, structural testing [33], statistical testing [31],
may be conceptually related is the idea of testability transformation many meta-heuristic testing approaches [1], and even concolic test-
proposed by Korel et al. [23]. While considerably more heavyweight ing [15] can be viewed as aiming at a set of test cases exhibiting
than swarm, and aimed at improving source-based test data gener- diversity in the targets they cover—e.g., statements, branches, or
ation rather than random testing, the idea of “slicing away” some paths [10]. Other approaches make diversity explicit—e.g., in look-
parts of the program under test is in some ways like configuration ing for operational abstractions [20] or contract violations [34], or in
testing, but is directed by picking a class of test cases on which the feedback [29]. Some of these techniques are, given sufficient com-
slice is based (those hitting a given target). pute time and appropriate SUT, highly effective. Swarm is a more
In partition testing [18] an input space is divided into disjoint par- lightweight technique than most of these approaches, which often re-
titions, and the goal of testing is to sample (cover) at least one input quire symbolic execution, considerable instrumentation, or machine
from each partition. Two underlying assumptions are usually that learning. The most scalable but effective techniques often focus on
(1) the partition forces diversity and (2) inputs from a single partition a certain kind of application and type of fault. Some approaches
are largely interchangeable. Without a sound basis for the partition- refine their notion of diversity in such a way that future exploration
ing scheme (which can be hard to establish), partition testing can relies on past results, making them nontrivial to parallelize. Swarm
perform worse than pure random testing [18]. Category-partition testing is inherently massively parallel. Finally, swarm testing is
testing [28] may improve partition testing, through programmer iden- in principle applicable to any software-testing (or model-checking)
tification of functional units and parameters and external conditions approach that can use a test configuration, including many of those
that affect each unit. The kind of thinking used in category-partition discussed above, whereas, e.g., ART is tied to random testing. We
testing could potentially be used in determining features in a ran- have performed preliminary experiments in using swarm to improve
dom tester. Swarm testing differs from partition testing and category the scalability of bounded model checking of C programs [12], with
partition testing partly in that it has no partitions which must be some success [2].
disjoint and cover the space, only a set of features which somehow
constrain generated tests.
Combinatorial testing [24, 26] seeks to test input parameters in
combinations, with the goal of either complete test coverage for 5. CONCLUSIONS AND FUTURE WORK
interactions of parameters, as in k-wise testing, or reducing total
Swarm testing relies on the following claim: for realistic systems,
testing time while maximizing diversity (in a linear arithmetical
randomly excluding some features from some tests can improve
sense) when k-way coverage of combinations is too expensive, as
coverage and fault detection, compared to a test suite that poten-
in orthogonal array testing [30]. Combinatorial techniques can be
tially uses every feature in every test. The benefit of using of a
used to generate swarm sets (the input parameters are the features).
single, inclusive, default configuration—that every test can poten-
Combinatorial testing has been shown to be quite effective for testing
tially expose any fault and cover any behavior, heretofore usually
appropriate systems, almost as effective as exhaustive testing.
taken for granted in random testing—does not, in practice, make
Swarm testing differs from partition testing and combinatorial
up for the fact that some features can, statistically, suppress be-
testing approaches primarily in the number of tests associated with
haviors. Effective testing therefore may require feature omission
a “configuration.” In partition approaches, each partition typically
diversity. We show that this not only holds for simple container-
includes a large number of test cases, but coverage is usually based
class examples (e.g., pop operations suppress stack overflow) but
on picking one test from each partition. Swarm does not aim at
for a widely used flash file system and 14 out of 17 versions of five
exhaustively covering a set of partitions (such as the cross-product
production-quality C compilers. For these real-world systems, if we
of all feature values), but may generate many tests for the same
compare testing with a single inclusive configuration to testing with
test configuration. Traditional combinatorial testing is based on
a set of 100–1,000 unique configurations, each omitting features
actual input parameter values: each combination in the covering
with 50% probability per feature, we have observed (1) significantly
array will result in one test case. In swarm testing, a combination
better fault detection, (2) significantly better branch and statement
of features defines only a constraint on test cases, and thus a very
coverage, and (3) strictly superior mutant detection. Test configura-
large or even infinite set of tests. Many tests may be generated from
tion diversity does indeed produce better testing in many realistic
the set defined by a test configuration. The use of combinatorial
situations.
techniques in generating test configurations, rather than actual tests,
Swarm testing was inspired by swarm verification, and we hope
merits further study.
that its ideas can be ported back to model checking. We also plan to
Adaptive random testing (ART) [11] modifies random testing by
investigate swarm in the context of bounded exhaustive testing and
sampling the space of tests and only executing those most “distant,”
learning-based testing methods. Finally, we believe there is room to
as determined by a distance metric over inputs, from all previ-
better understand why swarm provides its benefits, particularly in
ously executed tests. Many variations on this approach have been
the context of large, idiosyncratic SUTs such as compilers, virtual
proposed. Unlike ART, swarm testing does not require the work—
machines, and OS kernels. More case studies will be needed to
human effort or computational—of using a distance metric. The kind
generate data to support this work. We also plan to investigate how
of “feature breakdown” that swarm requires is commonly provided
swarm testing’s increased diversity of code coverage in test cases
by test generators; in our experience developing over a dozen test
can benefit fault localization and program understanding algorithms
harnesses, we implemented the generator-configuration options long
relying on test cases [22]; traditional random tests are far more
before we considered utilizing them as part any methodology other
homogeneous than swarm tests.
than simply finding a good default configuration; implementing “fea-
We have made Python scripts supporting swarm testing avail-
tures” where these are API call choices or grammar productions is
able at http://beaversource.oregonstate.edu/projects/
usually almost trivial. Swarm testing has been applied to real-world
cswarm/browser/release.

10
6. ACKNOWLEDGMENTS Engineering, pages 970–978. Wiley, 1994.
The authors thank the anonymous reviewers for their comments, [20] M. Harder, J. Mellen, and M. D. Ernst. Improving test suites
Jamie Andrews for the use of mutation generation code for C, and via operational abstraction. In Proc. ICSE, pages 60–71, May
Amin Alipour, Mike Ernst, Patrice Godefroid, Gerard Holzmann, 2003.
Rajeev Joshi, Alastair Reid, and Shalini Shamasunder for helpful [21] G. J. Holzmann, R. Joshi, and A. Groce. Swarm verification
comments on the swarm testing concept. A portion of this research techniques. IEEE Trans. Software Eng., 37(6):845–857,
was funded by NSF grant CCF–1054786. Nov./Dec. 2011.
[22] J. A. Jones and M. J. Harrold. Empirical evaluation of the
Tarantula automatic fault-localization technique. In Proc. ASE,
7. REFERENCES pages 273–282, Nov. 2005.
[1] S. Ali, L. C. Briand, H. Hemmati, and R. K. [23] B. Korel, M. Harman, S. Chung, P. Apirukvorapinit, R. Gupta,
Panesar-Walawege. A systematic review of the application and and Q. Zhang. Data dependence based testability
empirical investigation of search-based test case generation. transformation in automated test generation. In Proc. ISSRE,
IEEE Trans. Software Eng., 36(6):742–762, Nov./Dec. 2010. pages 245–254, Nov. 2005.
[2] A. Alipour and A. Groce. Bounded model checking and [24] D. R. Kuhn, D. R. Wallace, and A. M. Gallo Jr. Software fault
feature omission diversity. In Proc. CFV, Nov. 2011. interactions and implications for software testing. IEEE Trans.
Software Eng., 30(6):418–421, June 2004.
[3] P. Ammann and J. Offutt. Introduction to Software Testing.
Cambridge University Press, 2008. [25] J. Lawrence, R. N. Kacker, Y. Lei, D. R. Kuhn, and M. Forbes.
A survey of binary covering arrays. Electron. J. Comb., 18(1),
[4] J. H. Andrews, L. C. Briand, and Y. Labiche. Is mutation an
2011.
appropriate tool for testing experiments? In Proc. ICSE, pages
402–411, May 2005. [26] C. Nie and H. Leung. A survey of combinatorial testing. ACM
Comput. Surv., 43(2), Jan. 2011.
[5] J. H. Andrews, A. Groce, M. Weston, and R.-G. Xu. Random
test run length and effectiveness. In Proc. ASE, pages 19–28, [27] NIST. NIST covering array tables. http://math.nist.
Sept. 2008. gov/coveringarrays/ipof/tables/table.5.2.html,
Feb. 2008.
[6] J. H. Andrews, F. C. H. Li, and T. Menzies. Nighthawk: A
two-level genetic-random unit test data generator. In Proc. [28] T. J. Ostrand and M. J. Balcer. The category-partition method
ASE, pages 144–153, Nov. 2007. for specifying and generating functional tests. CACM,
31(6):676–686, June 1988.
[7] A. Arcuri and L. Briand. Adaptive random testing: An illusion
of effectiveness? In Proc. ISSTA, pages 265–275, July 2011. [29] C. Pacheco, S. K. Lahiri, M. D. Ernst, and T. Ball.
Feedback-directed random test generation. In Proc. ICSE,
[8] A. Arcuri, M. Z. Iqbal, and L. Briand. Formal analysis of the
pages 75–84, May 2007.
effectiveness and predictability of random testing. In Proc.
ISSTA, pages 219–230, July 2010. [30] M. S. Phadke. Planning efficient software tests. CrossTalk,
10(10):11–15, Oct. 1997.
[9] T. Ball. A theory of predicate-complete test coverage and
generation. In Proc. FMCO, pages 1–22, Nov. 2004. [31] S. Poulding and J. A. Clark. Efficient software verification:
Statistical testing using automated search. IEEE Trans.
[10] T. Y. Chen. Fundamentals of test case selection: Diversity,
Software Eng., 36(6):763–777, Nov./Dec. 2010.
diversity, diversity. In Proc. SEDM, pages 723–724, June
2010. [32] X. Qu, M. B. Cohen, and G. Rothermel. Configuration-aware
regression testing: An empirical study of sampling and
[11] T. Y. Chen, H. Leung, and I. K. Mak. Adaptive random testing.
prioritization. In Proc. ISSTA, pages 75–86, July 2008.
In Advances in Computer Science, volume 3321 of LNCS,
pages 320–329. Springer, 2004. [33] P. Thévenod-Fosse and H. Waeselynck. An investigation of
statistical software testing. J. Software Testing, Verification
[12] E. Clarke and D. K. F. Lerda. A tool for checking ANSI-C
and Reliability, 1(2):5–25, 1991.
programs. In Proc. TACAS, volume 2988 of LNCS, pages
168–176. Springer, 2004. [34] Y. Wei, H. Roth, C. A. Furia, Y. Pei, A. Horton,
M. Steindorfer, M. Nordio, and B. Meyer. Stateful testing:
[13] H. Dai, C. Murphy, and G. Kaiser. Configuration fuzzing for
Finding more errors in code and contracts. Computing
software vulnerability detection. In Proc. ARES, pages
Research Repository, Aug. 2011.
525–530, Feb. 2010.
[35] YAFFS: A flash file system for embedded use.
[14] M. B. Dwyer, S. Elbaum, S. Person, and R. Purandare.
http://www.yaffs.net/.
Parallel randomized state-space search. In Proc. ICSE, pages
3–12, May 2007. [36] X. Yang, Y. Chen, E. Eide, and J. Regehr. Finding and
understanding bugs in C compilers. In Proc. PLDI, pages
[15] P. Godefroid, N. Klarlund, and K. Sen. DART: Directed
283–294, June 2011.
automated random testing. In Proc. PLDI, pages 213–223,
June 2005. [37] A. Zeller and R. Hildebrandt. Simplifying and isolating
failure-inducing input. IEEE Trans. Software Eng.,
[16] A. Groce. (Quickly) testing the tester via path coverage. In
28(2):183–200, Feb. 2002.
Proc. WODA, pages 22–28, July 2009.
[38] L. Zhang, S.-S. Hou, J.-J. Hu, T. Xie, and H. Mei. Is
[17] A. Groce, G. Holzmann, and R. Joshi. Randomized
operator-based mutant selection superior to random mutant
differential testing as a prelude to formal verification. In Proc.
selection? In Proc. ICSE, pages 435–444, May 2010.
ICSE, pages 621–631, May 2007.
[18] D. Hamlet and R. Taylor. Partition testing does not inspire
confidence. IEEE Trans. Software Eng., 16(12):1402–1411,
Dec. 1990.
[19] R. Hamlet. Random testing. In Encyclopedia of Software

11

You might also like