Unit 3 Using Combinatorial Testing To Reduce Software Rework
Unit 3 Using Combinatorial Testing To Reduce Software Rework
Testing to Reduce
NIST approach could cost-effectively reduce the number of latent
software defects escaping into system testing and at the same time
achieve the structural coverage required by regulatory authorities.
binatorial explosion. The smaller the number, the greater is the required running all such sets of tests. In no case, however, was
likelihood of missed defects and inadequate structural coverage. there an output value affected by interactions among more than
A compromise is to limit input values to those representing six input variables, and in aggregate all 6-way combinations of
equivalence classes [16]. For each input variable, possible val- interacting variables were tested.
ues are segregated into groups that would ostensibly produce
no difference of interest in code behavior or output value. One Generating Expected Outputs and Executing Tests
or more representative values are then picked from each group. The model checker is given a model containing variable
This typically includes values that test behavior across instruc- definitions, their relationships, their values in an initial state, and
tion and memory architecture boundaries (e.g., positive and how their values are determined in subsequent states. It then
negative minimum and maximum values, and 0), data definition generates the state space (or a binary decision diagram of it),
ranges, coordinate systems, units of measure, and so on, and each state mapping a combination of input variable values to
also those that drive decision conditions. output variable values. See Fig. 1 showing the mapping of the
Identifying representative values for boundary values was input values from Table 1 to the output variable, e. For all states
straightforward. Finding values for condition variables in in which the value of c is true, the value of e will be equal to the
complex, nested logic—values that would force the execution value of a plus the value of b, which is expressed as c = true : a
paths required for code coverage—took more time. MC/DC + b. In all other states, the value of e will be equal to the value
requires that every condition in a decision has taken all pos- of a times the value of d, expressed as TRUE : a * d. Fig. 1b
sible outcomes at least once, and that each condition in each shows a segment of the generated state space—the value of e
decision has been shown to independently affect that decision’s followed by the input values that produced it.
outcome. Demonstrating independence-of-outcome typically In the NIST approach, the process of creating expected
requires modifying each condition in a decision while all others outputs for an input test vector relies on a model checker’s
remain fixed, and showing that this modification has changed counter-examples [17]. Ordinarily, to verify requirements or a
the outcome of the decision. For the while-loop in design, developers using a model checker would create a model
if ((a != b) && (a != c)) like the one in Fig. 1a , but they would also write properties the
{ model must preserve—e.g., there must always be a way for the
… variable e to be 0, there must always be a way for it to be 272.
while ((a != b) && (a != c)) The model checker attempts to prove that the model preserves
{ these properties. Where it finds a violation of a property (a
a = chan (); counter-example—e.g., an execution path in which e can never
} be 0), it produces a trace of the states that led to the violation.
} To have a model checker determine an expected output for
a given input vector, developers could negate a property and
tests must be run to show that when both conditions are true, use the counter example to trace back to the input values that
the loop is executed, and that when each is false but the other produced it. For example, they could specify that the variable e
true, the loop is not executed. To determine the input space, val- must never be 0. The model checker would detect a state that
ues that force execution of each such path under the required violated this property and generate a counter example show-
conditions must be selected for each variable of each condition ing the state transitions from the initial input values (the input
of each decision. vector) to the point at which e became 0. A simple utility could
Enabling those values was difficult when the condition vari- create a complete test case from a counter-example by merg-
able was an input and the values had to be loaded by an exter- ing the value of the output variable with the values of the input
nal procedure invoked from within a decision. In the example, variables that produced it [16].
the loop decision must be tested when a = b and when a = c, This study used a slightly different approach, requiring a smaller
neither of which conditions can be created by direct input from learning curve. Instead of searching through counter examples
a test case. The value of a must be changed at runtime by the generated by the model checker, the utility function searches for
call to the external procedure chan (), which is stubbed-out for each input vector across the entire state space generated by the
unit test. The work-around was to add test-unique variables to model checker. The model in Fig. 1a generated 36 states: those
the test cases generated by ACTS and the model checker. Test containing all possible combinations of variable values. As shown
stubs were replaced with small procedures that loaded the value in Table 1, all 2-way combinations of inputs can be covered by the
of the test-unique variable directly or indirectly into the condi- nine input vectors generated by ACTS. The utility function finds
tion variable. In the example, the test variable’s value would be state 4 containing the input vector, {0,255,false,1}, eliminates
loaded into the return value of chan (). any irrelevant inputs and outputs from the state, reformats the
Generating a state space for all 34 input variables of the remainder (the input vector and its expected outputs), and exports
mode-state controller produced combinatorial explosion. Several the result, {0,0,255,false,1}, to the test harness. When it has found
separate sets of test vectors had to be generated instead, each and exported all 9 test cases, it is finished.
set covering only those variables that interact to produce an Developers then load the test harness with both the source
output. The test harness assigned default values to those vari- code and the test cases, and map the test case entries to input
ables not included in a test case. Maximizing structural coverage and output variable names—e.g., map the first entry of the input
24 CrossTalk—January/February 2014
LEGACY SYSTEM SOFTWARE SUSTAINMENT
MODULE main ------- State 4 ------ them in 15 seconds, created their executable tests in 12 seconds,
VAR e=0 and executed and analyzed them in under eight minutes.
a : {0,15,16}; a=0 Cost effectiveness was a measure of the value-in-use (accu-
b : {255,256}; b = 255
racy, coverage, scalability, and performance), the effort required to
c : {true,false}; c = false
d=1 learn the approach, and the effort required to use it on an ongoing
d : {-1,0,1};
------- State 5 ------ basis. Learning to use ACTS was simple. NIST provides a tutorial
DEFINE e = 272 that takes about two hours to process and contains everything
e := a = 16 needed to begin using the tool. Initial definition of the 34 input
case b = 256 variables used by the mode controller took four hours, including
(c = true) : a + b; c = true initial equivalence class determination and value selection. Using
TRUE : a * d; d=1
------- State 6 ------ the .pdf tutorial from the NuSMV web site, learning to develop
esac;
NuSMV models and to use the NuSMV simulator to generate the
Fig. 1a. NuSMV Model Fig 1b. State Space Segment state space took 20 hours. After encountering state space explo-
sion, generating sets of input vectors for only interacting variables
test case in Fig. 1b (0) to the source code variable e, the second and selecting equivalence class values to achieve 100% branch
entry (0) to the variable a. They can then execute the tests. Fail- coverage took an additional 16 hours. Finding a way of achiev-
ures and the achieved code coverage can be monitored in test ing 100% MC/DC coverage without manual intervention took
harness windows. Correctness of the expected outputs (verify- another 16 hours. In total, the learning curve was 84 hours. As er-
ing the oracle) is established when the resulting test cases are rors were found in models, the worst-case time spent completely
able to detect all seeded defects with no false positives. regenerating and re-executing tests was under 90 minutes, but
more commonly was less than 15 minutes.
Results Maturity was an evaluation of readiness for deployment across
Putting aside defective or incomplete requirements, misin- a potential population of several thousand engineers—e.g., if the
terpretations of requirements and design decisions, and other tools crash frequently or if they produce inconsistent, incorrect, or
errors not revealed by exercising the code, at issue was whether confusing results. The study used the 9-level NASA/DoD Tech-
such an automated test approach could cost effectively detect nology Readiness scale3 and found the toolset to be at Level 7,
all (or nearly all) implementation defects. Evaluation criteria in- “System Prototype Demonstrated in [an operational environment]”.
cluded accuracy, structural coverage, scalability, execution time, In summary, prototype software exists and all key functionality is
maturity, ease of learning, and ease of use. available for demonstration or test; the tools were well integrated
Accuracy was measured in two ways: as the percent of with operational systems; operational feasibility was demonstrated
seeded defects the tests detected; and as the percent of false and most of the software bugs have been eliminated; and at least
detections (number of false positive detections as a percent of some documentation is available. A general deployment would
total detections). Defects were manually and arbitrarily seeded require level 9 “Actual system [performance] proven through suc-
into versions of the code by changing values in arithmetic and cessful [developmental use].”
logic statements, changing arithmetic signs, reversing and ne-
gating comparisons, deleting statements, and so on. In all, there Conclusion
were over 200. After debugging the NuSMV model, the search- For unit test, this appears to be much more effective than
export utility, and the test harness definition, the generated tests the standard manual, iterative approach of writing tests, running
triggered all defects with no false detections. them, checking coverage, writing more tests to fill coverage
The initial set of tests achieved 75% statement coverage, gaps, running more tests, and so on. Defining the input space
71% branch coverage, and 68% MC/DC. The relatively low to achieve required coverage consumed the largest amount
initial coverage was the result of the inadequately defined input of time, requiring several iterations of test case generation –
space, described earlier. With a better understanding of how the especially to achieve full MC/DC. With experience, however, the
input space was to be defined, the subsequently generated test number of iterations was significantly reduced. The study used
cases achieved 100% MC/DC. staff with significant experience, but in general the approach
Scalability was an evaluation of both size (in this case, the required no knowledge or skills that could not easily be learned
number of input and output variables) and logical complexity. by an above average entry-level software engineer—e.g., creat-
As mentioned earlier, after limiting inputs to only interacting ing and debugging the test generation models was much easier
variables, test generation never again produced state space than writing and debugging the source code being tested.
explosion. After using test variables to deal with loops that Overall, results of the study were positive, although there are
changed the value of their condition variables, there were no remaining issues of deployment packaging and tool licensing,
further complexity issues. training, mentoring, and technical support. Data for an empirical
Execution time was acceptable: for the largest vector genera- comparative evaluation of defect detection capability between
tion model (19 input variables, 1 output variable), ACTS produced combinatorial testing and other approaches do not exist, but
2775 input vectors in six seconds, NuSMV generated the state there is enough evidence from literature to justify a pilot project
space in about 60 minutes, and searching it and building the test or a trial deployment in a business unit. This is the current plan
cases took just over eight minutes. The test harness imported going forward.
CrossTalk—January/February 2014 25
LEGACY SYSTEM SOFTWARE SUSTAINMENT
26 CrossTalk—January/February 2014