Adaptive Random Testing The ART of Test Case - 2010 - Journal of Systems and So
Adaptive Random Testing The ART of Test Case - 2010 - Journal of Systems and So
a r t i c l e i n f o a b s t r a c t
Article history: Random testing is not only a useful testing technique in itself, but also plays a core role in many other
Received 7 November 2008 testing methods. Hence, any significant improvement to random testing has an impact throughout the
Accepted 13 February 2009 software testing community. Recently, Adaptive Random Testing (ART) was proposed as an effective alter-
Available online 4 March 2009
native to random testing. This paper presents a synthesis of the most important research results related
to ART. In the course of our research and through further reflection, we have realised how the techniques
Keywords: and concepts of ART can be applied in a much broader context, which we present here. We believe such
Software testing
ideas can be applied in a variety of areas of software testing, and even beyond software testing. Amongst
Random testing
Adaptive random testing
these ideas, we particularly note the fundamental role of diversity in test case selection strategies. We
Adaptive random sequence hope this paper serves to provoke further discussions and investigations of these ideas.
Failure-based testing Ó 2009 Elsevier Inc. All rights reserved.
Failure pattern
0164-1212/$ - see front matter Ó 2009 Elsevier Inc. All rights reserved.
doi:10.1016/j.jss.2009.02.022
T.Y. Chen et al. / The Journal of Systems and Software 83 (2010) 60–66 61
are, by definition, not truly random, all published articles on ran- 3. Adaptive Random Testing
dom testing to date treat such pseudorandom sequences as equiv-
alent to random. For simplicity, therefore, we shall refer to testing If contiguous failure regions are indeed common, it would sug-
according to such pseudorandom sequences as ‘‘random” in this gest that one way to improve the failure-detection effectiveness of
paper. random testing is to somehow taking advantage of this
When conducting random testing, testers may choose an appro- phenomenon.
priate sampling distribution to meet their requirements. When try- One corollary of the existence of contiguous failure regions is
ing to accurately estimate the delivered reliability of software, for that ‘‘non-failure regions”, that is, regions of the input domain
instance, testers may choose to sample according to a profile that where the software produces outputs according to specification,
reflects the expected usage profile of the software — known as will also be contiguous. Therefore, given a set of previously exe-
an operational profile. On the other hand, analysis of random testing cuted test cases that have not revealed any failures, new test cases
as a testing strategy has normally assumed a uniform sampling located away from these old ones are more likely to reveal failures
profile. Throughout this paper, unless otherwise specified, we also — in other words, test cases should be more evenly spread
assume a uniform sampling profile. throughout the input domain. Based on this intuition, Adaptive
Random Testing (ART) was developed to improve the failure-detec-
tion effectiveness of random testing.
2. Failure patterns The first ART method proposed, the Fixed Size Candidate Set
ART algorithm (FSCS-ART) (Chen et al., 2004b), is described in
Essentially, the testing process can be viewed as taking sam- Fig. 1. Essentially, to choose a new test case, k candidates are ran-
ples from the set of all possible inputs to the software under test domly generated. For each candidate ci , the closest previously exe-
(known as the input domain), executing the samples one by one, cuted test is located, and the distance di is determined. The
and determining whether the outputs from each sample match candidate with the largest di is selected, and the other candidates
the software specification. If the outputs do not match the spec- are discarded. The process is repeated until the desired stopping
ification, a software failure is revealed. The presence of a software criterion, be it the exhaustion of testing resources or the detection
failure implies the existence of a fault — an actual code defect in of enough failures, is reached.
the software concerned. (Obviously, many software failures can Fig. 2 shows FSCS-ART in operation, on a program with a two-
be related to the same fault.) A tester seeks to select test data dimensional input space such that k ¼ 3. In Fig. 2a, we show four
with a view to maximising the number of distinct faults detected. previously executed test cases, t 1 to t 4 . We wish to select an addi-
To help the tester in this task, it is natural to consider how tional test case, so three candidates, c1 to c3 are randomly gener-
faults may cause different parts of the input domain to produce ated as shown. To choose among the candidates, we must
erroneous outputs when executed — in other words, reveal calculate di for each. Fig. 2b depicts this process for candidate c1 ,
failures. and Fig. 2c shows the nearest ti for each candidate. The dashed
A pioneering work in this area was that of White and Cohen lines in Fig. 2c indicate di for the respective candidates. We choose
(1980), who analysed certain types of program fault in numerical the candidate with the largest di , which is c2 in this example. Thus,
programs. They observed that when the contents of predicates we discard candidates c1 and c3 , treat c2 as test case t 5 , and execute
(decision-making points in the source code) were erroneous, an it. We repeat the process until the stopping criterion is reached.
incorrect computation path would be taken (referred to as domain To assess the effectiveness of the FSCS-ART method, Chen et al.
errors). This would, therefore, often result in contiguous regions of compared the failure-detection effectiveness of FSCS-ART to ran-
the input domain that reveal failures. White and Cohen then pro- dom testing — that is, testing by uniform random sampling with
posed a systematic technique for detecting such errors. replacement — on a sample of 12 error-seeded numerical pro-
More empirical studies came to similar conclusions about the grams. The original, unmodified programs were used as a testing
tendency for software faults to result in contiguous ‘‘failure re- oracle to check the correctness of the outputs. The statistic used
gions” within the input domain. Ammann and Knight (1988) ana- to compare the methods was the average number of tests required
lysed a number of sample numerical programs to determine the to detect the first failure, which is commonly known as the F-mea-
distribution of failures caused by various faults. In their small sam- sure. In most cases, the F-measure of FSCS-ART was 30–50% lower
ple, they found that the faults resulted in ‘‘locally continuous” fail- than that of random testing. Results of simulations using a variety
ure regions. A more comprehensive study was conducted by
Bishop (1993), who examined program faults in control functions
for nuclear reactors. He found that virtually all the faults were
‘‘blob” faults — that is, each fault revealed failures in a contiguous
region of the input domain.
Chan et al. (1996) also noted that certain common types of fault
in numerical software would also lead to typical distributions of
failure-causing inputs throughout the input domain, which they
termed failure patterns. They categorised three such patterns: (i)
the block pattern, where failures form a locally compact, contiguous
region of the input domain; (ii) the strip pattern, similar to the pat-
terns resulting from White and Cohen’s domain errors, in which a
‘‘strip”, contiguous but elongated along one or more dimensions,
would reveal failures; and (iii) the point pattern, where failures
would spread in a non-contiguous manner throughout the input
domain. They argued that strip and block failure patterns were
much more common than point patterns.
All of these quite different studies lead to a more general con-
clusion: that, in numerical programs, many program faults lead
to contiguous failure regions of the program input domain. Fig. 1. FSCS-ART algorithm.
62 T.Y. Chen et al. / The Journal of Systems and Software 83 (2010) 60–66
Fig. 2. FSCS-ART in operation. Previously-executed test cases are denoted by crosses, and randomly generated candidates are denoted by triangles. To select a new test case,
(a) multiple candidates are randomly generated; (b) the nearest previously-executed test case to each candidate is determined; (c) these nearest distances are compared
among all candidates; and (d) the candidate with the longest such distance is selected.
of failure patterns, with different failure rates and geometries, are methods offer lower selection overheads, or work well when the
consistent with the experimental results. dimensionality of the input domain is large.
While such improvements are significant, it is reasonable to Antirandom testing (Malaiya, 1995) is another testing method
speculate that there might be other, more efficient ways to take that uses a related concept of ‘‘distance” to distribute test cases.
advantage of contiguous failure regions which would result in an However, there are several major differences between it and ART.
even smaller F-measure. A number of different methods, using dif- Antirandom testing is almost exclusively deterministic; the only
ferent intuitions to achieve ‘‘even spread”, have been investigated non-determinism comes in the choice of the first test case in the
in the literature. An example is Restricted Random Testing (RRT), set. Furthermore, the method requires the number of test cases
which is based on the notion of exclusion to achieve the even to be chosen in advance, unlike the flexibility of incremental gen-
spreading fundamental to ART (Chan et al., 2006). It involves the eration offered by ART.
creation of ‘‘exclusion zones” around test cases that have been exe-
cuted. A randomly generated input will be used as the next test 4. Theoretical limits
case if it lies outside all exclusion regions; otherwise it will be dis-
carded and the process will be repeated. The effectiveness of RRT is If so many different approaches to taking advantage of failure
very similar to that of FSCS-ART. ART by Partitioning (Chen et al., contiguity achieve similar results, an interesting question arises
2004a) uses a rather different intuition — in essence, that partition- — is the failure to make further improvements a lack of imagina-
ing the input domain, and allocating test cases evenly to partitions, tion by researchers in identifying better methods? Are existing
will achieve even spread. Other attempts to take advantage of fail- ART methods too similar to one another — might an entirely differ-
ure region contiguity, but using various other intuitions to achieve ent approach achieve better results? Or are existing solutions close
the ‘‘even spreading” of test cases, include Quasi-Random Testing to optimally effective already? In the computer science world, such
(Chen and Merkel, 2007), and Lattice-Based ART (Mayer, 2005). questions are traditionally answered by theoretical complexity
Interestingly, all of these methods have similar ranges of fail- analyses of problems. We set out to apply the same approach to
ure-detection performance, with the maximum improvement over this problem — how much can we improve on random testing by
random testing being around 50%. However, the different methods using failure contiguity information? We have proved (Chen and
perform best under various circumstances. For instance, some Merkel, 2008) that there is indeed a fundamental limit to how
T.Y. Chen et al. / The Journal of Systems and Software 83 (2010) 60–66 63
much failure contiguity information, when used on its own, can any two members of the input domain and determine the ‘‘dis-
improve failure-detection effectiveness. tance” between them. The first issue is not unique to ART: by def-
Our approach to doing so is simple in principle. We first con- inition, pure random testing also requires the ability to sample
sider a case where the tester has more information about the fail- randomly from the input domain! In practice, random generation
ure pattern than available in reality — in essence, the tester knows of test cases can be a challenging problem. However, the genera-
that there is one single, contiguous failure region of the input do- tion of random test cases of sufficient quality to reveal significant
main. The tester knows the size, shape and orientation of this sin- software faults has been thoroughly studied and demonstrated in
gle failure region — everything about it except where it is located in numerous application domains such as SQL servers (Slutz, 1998;
the input domain. In fact, the tester does not have any information Bati et al., 2007) and Java Virtual Machines (Yoshikawa et al.,
about the location of the failure region in the input domain. La- 2003), among many others. Hence, we shall not address this issue
place’s Principle of Indifference (Keynes, 2006) states that, if a deci- further in this paper.
sion maker knows the possible states of the world, and truly has no By contrast, the second issue — a ‘‘distance” measure — is un-
information about the plausibility of each possible state, they ique to ART. The algorithm will execute given any trivial distance
should act as if each state were equally likely. In this context, as measure — such as simply returning a distance of zero, regardless
the tester has no information about the location of the failure re- of locations of the members of the input domain in question. In
gion, they should assume that it is equally likely to be located in such cases, however, the algorithm degenerates into a more costly
any possible location within the input domain. The tester has version of pure random testing. Therefore, in designing an appro-
strictly more information about failure contiguity than that as- priate distance measure, we need to consider why contiguous fail-
sumed by various ART algorithms, and absolutely no information ure patterns occur in numeric programs, and how this concept
about the failure region location, which is also assumed in ART. might be generalised for a wider range of software.
Given these assumptions, we then devise an optimal strategy Essentially, the ‘‘distance” measure needs to provide an estima-
for selecting test cases, and show definitively that it will have an tion of the likelihood of two inputs having common failure behav-
F-measure lower than or equal to any other strategy (recalling that iour — the smaller the distance, the more likely they will trigger
the F-measure is an average). In essence, we create a ‘‘grid” of test the same failure behaviour. In fact, the ‘‘distance” measure is really
cases at regularly spaced locations throughout the input domain, a difference measure. Studies revealing contiguous failure patterns
and execute the resulting tests in an arbitrary order. On particular in numeric input domains show that adjacent test cases (as re-
occasions, such a strategy might ‘‘get lucky” and reveal a failure on flected by the Cartesian distance measure) were likely to result
the first test case. Over many trials, however, the F-measure of in similar computations. In turn, it is our intuition that the similar-
such a strategy will be at least half that of the F-measure of random ity of computation is a good predictor of the similarity of failure
testing with replacement, given the same failure rate. behaviour. To apply ART effectively in a non-numeric context,
That is, no strategy using failure pattern information, other than therefore, alternative methods to measure the similarity of compu-
information about its location, can reduce F-measures by more than tation resulting from the executions of two test cases are required.
50% compared to random testing. We have proposed a difference measure (Kuo, 2006; Merkel,
This result still holds even if there are multiple, contiguous fail- 2005) that can be applied to a broad range of software input types,
ure regions. The proof is complex, but based on the same principles based on the concepts of categories and choices proposed by Os-
as the single failure region case. Interested readers may refer to trand and Balcer (1988) for the category-partition method. The cat-
(Chen and Merkel, 2008). egory-partition method is a specification-based testing method.
The implications of our result are quite clear. ART, which uses The tester must first identify the parameters and environment con-
strictly less information, still often achieves effectiveness improve- ditions determining the behaviour of the software under test,
ments which are quite close to the maximum theoretically possible. known as categories. For each category, choices are defined as
Therefore, any further improvements in the testing effectiveness of mutually exclusive sets of values which are expected to trigger
ART must come from taking account of additional information similar computation.
about the program’s failure location. Alternatively, rather than Our work makes use of the concepts of categories and choices as
attempting to improve failure-detection effectiveness, researchers the basis of a difference measure for ART, allowing ART to be applied
can develop ART methods that have lower overheads in evenly to a broader range of software. Intuitively, the more categories in
spreading the test cases, in order to improve the overall which two inputs have different choices, the more different will
cost-effectiveness. Furthermore, the closeness of ART’s perfor- be the computation they trigger. Therefore, a count of categories
mance to the theoretical bound indicates that the bound is indeed with differing choices can be used as a difference measure.
a tight one. As an illustration, consider a simple object recognition system,
which can distinguish shapes, sizes and colours. Suppose that the
colour of objects can only be light-red, red, deep-red, light-blue,
5. ART beyond numeric programs blue, deep-blue, light-green, green and deep-green, and objects
are spheres, cubes or pyramids in shape. The size is in the range
Initial studies of ART showed that it can improve the failure- (0, 10] in m3. The system behaviour depends only on the object
detection effectiveness of random testing substantially, and this shape, the ‘‘base colour” — red, blue or green, and whether the ob-
improvement is indeed close to the theoretical maximum possible ject is larger than 1 m3. In this case, we can define three categories:
(in the absence of further information on failure location). Never- Colour, Shape and Size; three choices for the Colour category: [red],
theless, these initial studies were limited to software with numeric [blue] and [green]; three choices for the Shape category: [sphere],
inputs. Much, perhaps most, software of practical interest does not [cube] and [pyramid]; and two choices for the Size category:
have such simple input parameters. It is therefore of considerable [large] and [small]. Some choices contain more than one possible
importance to study how to apply ART to a broader class of value. For example, the [red] choice has light-red, red and deep-
programs. red as its possible values and [large] has any size more than 1 m3.
As an illustration of the broader application of ART, we again Consider two program inputs T 1 and T 2 , where T 1 is a light-red
consider FSCS-ART. To apply FSCS-ART in a given situation, two is- sphere of size 3.2 m3, and T 2 is a deep-blue sphere of size 2.7 m3 . T 1
sues must first be resolved: a method to sample randomly from the has the choices [red], [sphere] and [large] while T 2 has the choices
input domain of the software under test, and some way to compare [blue], [sphere] and [large]. In this case, therefore, there is only one
64 T.Y. Chen et al. / The Journal of Systems and Software 83 (2010) 60–66
category — colour — in which T 1 and T 2 differ, so the difference be- One obvious application for AR sequences is for regression test-
tween the two inputs is 1 using our measure. ing. In regression testing, a large test suite may be accumulated
We have used this distance measure as the basis for the devel- over time, even to the extent that not all of them can necessarily
opment of new ART algorithms for non-numeric software. We have be run each time a change is made. Hence, various techniques have
used these new algorithms in several case studies including the been developed to prioritise the elements of a test suite, based on a
Unix command-line utility ‘‘grep”, and other programs from the number of different criteria. We believe that AR sequences may be
UNL Software-artifact Infrastructure Repository (Rothermel et al., a simple, effective and relatively low-overhead alternative. Fur-
2009). Details can be found in Barus et al. (in preparation). thermore, there are many other testing techniques (such as path
While we have demonstrated that it is possible to construct testing techniques) that can generate a larger set of test cases than
meaningful difference measures for a broad range of input types, can be run with available resources; AR sequences may be very
this is not the only feasible approach. Recently, Ciupa et al. (2008) useful in these circumstances.
proposed an alternative type of difference measure in the context It may even be that AR sequences have uses beyond testing.
of object-oriented software. They provide a method for computing Quasi-random sequences (Chen and Merkel, 2007), which have
object distance between two arbitrary objects. They first define some been proposed as an alternative to ART for testing purposes, are
distance measures for elementary types, such as numbers, Booleans, used in a wide variety of contexts. AR sequences may have simi-
strings and references. Next, they describe how to measure dis- larly wide applications. Quasi-random sequence generation is inti-
tances between composite objects, made up of three elements — mately tied to the properties of the binary representation of
the type distance, based on the difference between the two object floating-point numbers (Bratley et al., 1992); AR sequences are in
types, the field distance — the distance between the matching fields, fact more easily applicable to a wider range of data types. Further-
and the recursive distance — the distance between matching refer- more, standard quasi-random sequence generation algorithms
ence attributes. This method has the advantage that it completely only generate very few distinct sequences; AR sequencing can triv-
specifies how to calculate the difference measure, supporting the ially be used to generate large numbers of distinct sequences.
complete automation of the method. However, further empirical re-
search will need to be conducted to determine its effectiveness. 6.2. Failure-based testing
refinement of existing ones. One complicating factor is, of course, similarly to how time and space complexity analysis is useful to
that failure patterns for non-numeric inputs are defined on the basis algorithm researchers in computer science.
of particular difference measures as discussed in Section 5; research It might well be possible to use this general approach to identify
on such programs will need to take this into account. other relationships between information available and testing
From another perspective, testing based on failure patterns can effectiveness. Ultimately, a complexity hierarchy that relates vari-
be viewed as a type of specialised search problem. In this view, feed- ous types of information about the software under test, to a class of
back from the tests as they are executed guides the continuing testing strategies, may be possible. For such a class, the best prac-
search for failure-causing input. ART is a simple and successful real- tical strategies developed to date, and theoretical effectiveness
isation of this concept. Many testing methods regard tests that do bounds, can be identified. Our work represents a first step towards
not reveal a failure as, essentially, wasted, but ours is not the only such a hierarchy. We also note it is unlikely that any such hierarchy
work to disagree with this view. For instance, metamorphic testing would be as elegant or comprehensive as the complexity class hier-
(Chen et al., 1998) uses previously executed ‘‘original” test cases to archy of problems in theoretical computer science. Nevertheless,
construct ‘‘follow-up” test cases. The relationship between the out- we believe that the development of such a hierarchy will have a
puts for the original and follow-up test cases is then checked to ver- significant impact on the theory of software testing, and will iden-
ify program correctness. This approach is designed specifically to tify where methodological research can best be directed.
alleviate the oracle problem. Pacheco et al. (2007) take advantage
of non-failure-causing test cases as ‘‘feedback”, by using past test 6.4. The role of test case diversity
cases that do not reveal failure as building blocks to construct more
complex ones. Search-based testing also seeks to use feedback from The key intuition that led us to develop ART was the concept of
past test cases to guide future test case selection, but generally ‘‘even spreading” throughout the input domain. We have come to
much more selectively. Most such work to date has taken into ac- realise that ‘‘even spreading” can be better described as a form of
count additional information from test execution, such as execution diversity. For the numeric case, at least, neighbouring inputs tend
paths, to help guide the search, and often only considers the most to result in similar computations. An even spread of test cases
recent few test cases rather than the entirety. Without such guid- throughout the input domain, therefore, gives rise to a diversity
ance, conventional search algorithms will not be able to tell where of computations.
to ‘‘go next”, so it will be very challenging to adapt these techniques While the importance of diversity is hardly a new or surprising
for failure-based testing. Research into the geometry and distribu- insight, ART achieves a very simple form — perhaps the simplest
tion of failure patterns will help in the design of appropriate search possible form — of test case diversity. There have been a variety
algorithms. These algorithms could take into account the results of of different notions of test case diversity intrinsic in some testing
many previous test cases, thus making best use of the limited infor- methods over the years; for instance, the different types of control
mation available from each test case. coverage and dataflow coverage criteria yield test sets with differ-
ent notions of diversity. Ultimately, in testing, the tester seeks
6.3. A theory of software testing diversity in failure behaviour, so that test cases reveal as many dif-
ferent ways in which the program can fail with the given testing
Our theoretical analysis showing that ART’s performance is resources. As failure behaviour information cannot ever be com-
close to the theoretical maximum is significant in itself. However, pletely available before the test cases are executed, testers must
the approach used to show this is also worthy of further consider- find various other models of diversity that strongly correlate with
ation. While different types of theoretical analyses have been con- failure behaviour.
ducted for different testing strategies, we believe that our In the study of partition testing, the proportional sampling (PS)
approach is novel and can serve as a model to build a deeper strategy (Chen and Yu, 1996) stipulates that the number of ran-
understanding of the relationship between failure information domly selected test cases from each partition should be propor-
and software testing. tional to the corresponding partition size. This strategy provides
There have been a number of theoretical analyses of various a sufficient condition for partition testing to have its probability
coverage criteria. These analyses have shown that achieving one of detecting at least one failure not smaller than that for random
type of coverage may imply another coverage — a trivial example testing with replacement.
is that branch coverage implies statement coverage. Such analyses Chen et al. (2001) observed that ‘‘a comparison of ART with PS
are useful, but they do not directly correlate to failure-detection strategy reveals an interesting similarity; the PS strategy can be re-
capabilities. By contrast, in fault-based testing, fault subsumption garded as a form of ART. Such a similarity between the PS strategy
relationships for certain fault classes have been developed (Kapoor and ART appears striking . . . the two strategies were initially pro-
and Bowen, 2007). As such, a hierarchy of certain fault types has posed for very different reasons. The PS strategy was motivated
been described, and the subsumption relationships among fault- by the need of providing a universally safe strategy [which is guar-
based testing strategies have been explored. There have also been anteed to outperform random testing], whereas ART attempts to
a number of papers that evaluate individual testing techniques for improve random testing in those situations where the failure-caus-
failure-detection effectiveness, or compare two testing techniques. ing inputs tend to cluster . . . no distribution of the failure-causing
These tend to be quite specific in their applicability, such as prov- inputs was assumed when deriving the PS strategy.” This interest-
ing a sufficient condition for one testing technique to be more ing similarity can now be interpreted as being due to their com-
effective than another (Chen and Yu, 1996; Morasca and Serra- mon ‘‘diversity over the input domain”. Hence, we believe that a
Capizzano, 2004). new way to classify test case selection strategies may be based
Our approach was quite different from all of the above. We on various forms of diversity.
explicitly develop a model about the tester’s prior knowledge ART achieves diversity not only in the context of the entire test
about failures, and identify the best performance that can be suite, but also within the subset of test cases executed at any one
achieved by any testing strategy using only this information. This time. When an AR sequence is used to order test suites that already
not only evaluates the performance of known strategies, but can exhibit diversity according to some specific criterion, the current
also be applied to testing strategies that have not yet been in- subset of executed tests, at any stage of testing, exhibits additional
vented. In this way, it can help to identify where methodological local diversity. Such local diversity will improve the chances of
research should best be applied to improve the state of the art, detecting failures earlier.
66 T.Y. Chen et al. / The Journal of Systems and Software 83 (2010) 60–66
7. Conclusion Chen, T.Y., Eddy, G., Merkel, R.G., Wong, P.K., 2004a. Adaptive random testing
through dynamic partitioning. In: Proceedings of the 4th International
Conference on Quality Software (QSIC 2004). IEEE Computer Society Press, Los
Based on empirical observations that contiguous failure regions Alamitos, CA, pp. 79–86.
are common, adaptive random testing combines random candidate Chen, T.Y., Leung, H., Mak, I.K., 2004b. Adaptive random testing. In: Proceedings of
the Ninth Asian Computing Science Conference (ASIAN’04), Lecture Notes in
selection with a filtering process to encourage an even spread of
Computer Science, vol. 3321, pp. 320–329.
test cases throughout the input domain. Experimental studies have Ciupa, I., Leitner, A., Oriol, M., Meyer, B., 2008. ARTOO: adaptive random testing for
shown that ART can detect failures using up to 50% fewer test cases object-oriented software. In: Proceedings of the 30th International Conference
on Software Engineering (ICSE 2008). ACM Press, New York, NY, pp. 71–80.
than random testing. In fact, ART methods achieve close to the the-
Kapoor, K., Bowen, J.P., 2007. Test conditions for fault classes in Boolean
oretical maximum test case effectiveness by any possible testing specifications. ACM Transactions on Software Engineering and Methodology
method using the same information. Early work on ART concen- 16 (3), 1–12.
trated mainly on numeric input domains; however, recent research Keynes, J.M., 2006. A Treatise on Probability. Cosimo, New York, NY.
Kuo, F.-C., 2006. On adaptive random testing. PhD Thesis, Swinburne University of
has shown that it can also be applied to a broad range of software. Technology, Melbourne, Australia.
As such, we believe that it represents an effective, efficient alterna- Malaiya, Y.K., 1995. Antirandom testing: getting the most out of black-box testing.
tive to random testing in many applications. In: Proceedings of the 6th International Symposium on Software Reliability
Engineering (ISSRE ’95). IEEE Computer Society Press, Los Alamitos, CA, pp. 86–
On the other hand, research on ART may have broader implica- 95.
tions, and we have discussed a number of them in this paper. The Mayer, J., 2005. Lattice-based adaptive random testing. In: Proceedings of the 20th
AR sequence is a promising, general method of incremental order- IEEE/ACM International Conference on Automated Software Engineering (ASE
2005). ACM Press, New York, NY, pp. 333–336.
ing. The success of ART illustrates the potential of the approach of Merkel, R.G., 2005. Analysis and enhancements of adaptive random testing. PhD
failure-based testing, and the impact and importance that diversity Thesis, Swinburne University of Technology, Melbourne, Australia.
has on the effectiveness of test suites. Our theoretical work, moti- Morasca, S., Serra-Capizzano, S., 2004. On the analytical comparison of testing
techniques. In: Proceedings of the 2004 ACM SIGSOFT International Symposium
vated by ART, paves the way for a more rigorous and scientific
on Software Testing and Analysis (ISSTA 2004). ACM SIGSOFT Software
analysis of the relationships between the information available to Engineering Notes 29 (4), 154–164.
the software tester and the effectiveness of families of testing Ostrand, T.J., Balcer, M.J., 1988. The category-partition method for specifying and
generating functional tests. Communications of the ACM 31 (6), 676–686.
strategies — including those not yet developed. We believe such an
Pacheco, C., Lahiri, S.K., Ernst, M.D., Ball, T., 2007. Feedback-directed random test
analytic approach will provide a significant contribution to the generation. In: Proceedings of the 29th International Conference on Software
foundations of software testing. Engineering (ICSE 2007). IEEE Computer Society Press, Los Alamitos, CA, pp. 75–
84.
Renfer, G.F., 1962. Automatic program testing. In: Proceedings of Third Conference
Acknowledgements of the Computing and Data Processing Society of Canada. University of Toronto
Press, Toronto, Canada.
This work was supported in part by a discovery grant of the Rothermel, G., Elbaum, S., Kinneer, A., Do, H., 2009. Software-artifact Infrastructure
Repository. <http://sir.unl.edu>.
Australian Research Council (project no. ARC DP 0880295). Slutz, D.R., 1998. Massive stochastic testing of SQL. In: Proceedings of the 24th
We would also like to thank our colleagues who have worked International Conference on Very Large Data Bases (VLDB ’98). Morgan
with us on adaptive random testing over the years: A. Barus, K.P. Kaufmann, San Francisco, CA, pp. 618–622.
White, L.J., Cohen, E.I., 1980. A domain strategy for computer program testing. IEEE
Chan, G. Eddy, D.H. Huang, H. Leung, H. Liu, I.K. Mak, G. Rothermel, Transactions on Software Engineering SE-6 (3), 247–257.
K.Y. Sim, C.A. Sun, D.P. Towey, P.K. Wong, W.E. Wong, and Z.Q. Zhou. Yoshikawa, T., Shimura, K., Ozawa, T., 2003. Random program generator for Java JIT
compiler test system. In: Proceedings of the Third International Conference on
Quality Software (QSIC 2003). IEEE Computer Society Press, Los Alamitos, CA,
References
pp. 20–24.
Ammann, P.E., Knight, J.C., 1988. Data diversity: an approach to software fault
tolerance. IEEE Transactions on Computers 37 (4), 418–425. Tsong Yueh Chen received his BSc, and MPhil degrees from the University of Hong
Barus, A.C., Chen, T.Y., Kuo, F.-C., Merkel, R.G., Rothermel, G., in preparation. Kong; MSc degree, and DIC from the Imperial College of Science and Technology;
Adaptive random testing of programs with arbitrary input types: a and PhD degree from the University of Melbourne. He is currently a Professor of
methodology and an empirical study. Software Engineering in the Faculty of Information and Communication Technol-
Bati, H., Giakoumakis, L., Herbert, S., Surna, A., 2007. A genetic approach for random
ogies, Swinburne University of Technology, Australia. His research interests include
testing of database systems. In: Proceedings of the 33rd International
software testing, debugging, software maintenance, and software design.
Conference on Very Large Data Bases (VLDB 2007). VLDB Endowment, pp.
1243–1251.
Bishop, P.G., 1993. The variation of software survival time for different operational Fei-Ching Kuo is currently a Lecturer at Swinburne University of Technology,
input profiles (or why you can wait a long time for a big bug to fail). In: Digest of Australia. She received her PhD degree in Software Engineering, and BSc (Honours)
Papers, the 23rd International Symposium on Fault-Tolerant Computing (FTCS- in Computer Science, both from the Swinburne University of Technology. Her
23). IEEE Computer Society Press, Los Alamitos, CA, pp. 98–107. research interests include software testing, debugging and project management.
Bratley, P., Fox, B.L., Niederreiter, H., 1992. Implementation and tests of low- She has been an IEEE member for many years, a PC member of several international
discrepancy sequences. ACM Transactions on Modeling and Computer conferences and workshops, including SAC and QSIC amongst others, and also acted
Simulation 2 (3), 195–213.
as a reviewer for several international journals, including the Journal of Systems
Chan, F.T., Chen, T.Y., Mak, I.K., Yu, Y.T., 1996. Proportional sampling strategy:
and Software and Journal of Software.
guidelines for software testing practitioners. Information and Software
Technology 38 (12), 775–782.
Chan, K.P., Chen, T.Y., Towey, D.P., 2006. Restricted random testing: adaptive Robert Merkel received his BSc (Hons) from the University of Melbourne, and his
random testing by exclusion. International Journal of Software Engineering and PhD from Swinburne University of Technology. He is currently a Lecturer in Soft-
Knowledge Engineering 16 (4), 553–584. ware Engineering at Swinburne in the Faculty of Information and Communication
Chen, T.Y., Merkel, R.G., 2007. Quasi-random testing. IEEE Transactions on Technologies. His current main research interests include software testing, and
Reliability 56 (3), 562–568. program verification. Prior to his PhD study, he worked in industry on an open
Chen, T.Y., Merkel, R.G., 2008. An upper bound on software testing effectiveness. source software project.
ACM Transactions on Software Engineering and Methodology 17 (3). Article No.
16.
Chen, T.Y., Yu, Y.T., 1996. On the expected number of failures detected by T.H. Tse is a Professor in Computer Science at The University of Hong Kong. His
subdomain testing and random testing. IEEE Transactions on Software research interest is in software testing and analysis. He is an editorial board
Engineering 22 (2), 109–119. member of Software Testing, Verification and Reliability and Journal of Systems and
Chen, T.Y., Cheung, S.C., Yiu, S.M., 1998. Metamorphic testing: a new approach for Software, the steering committee chair of QSIC and a standing committee member
generating next test cases. Technical Report HKUST-CS98-01, Department of of COMPSAC. He is a fellow of the British Computer Society, a fellow of the Institute
Computer Science, Hong Kong University of Science and Technology, Hong for the Management of Information Systems, a fellow of the Institute of Mathe-
Kong. matics and its Applications and a fellow of the Hong Kong Institution of Engineers.
Chen, T.Y., Tse, T.H., Yu, Y.T., 2001. Proportional sampling strategy: a compendium He was decorated with an MBE by Queen Elizabeth II of the United Kingdom.
and some insights. Journal of Systems and Software 58 (1), 65–81.