0% found this document useful (0 votes)
81 views4 pages

Unit 3 Using Combinatorial Testing To Reduce Software Rework

This document describes a study that used combinatorial testing to generate test cases for legacy software. The goal was to reduce defects escaping to later testing phases. Researchers used ACTS to generate 6-way combinatorial test vectors for a software defined radio's control software. A model checker was used to generate expected outputs for each test case. The tests were then executed to evaluate coverage and identify defects. The approach aimed to improve testing without significantly increasing test iteration time.

Uploaded by

Yang Feng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views4 pages

Unit 3 Using Combinatorial Testing To Reduce Software Rework

This document describes a study that used combinatorial testing to generate test cases for legacy software. The goal was to reduce defects escaping to later testing phases. Researchers used ACTS to generate 6-way combinatorial test vectors for a software defined radio's control software. A model checker was used to generate expected outputs for each test case. The tests were then executed to evaluate coverage and identify defects. The approach aimed to improve testing without significantly increasing test iteration time.

Uploaded by

Yang Feng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

LEGACY SYSTEM SOFTWARE SUSTAINMENT

Using Combinatorial This paper describes an industry proof-of-concept study that


used NIST’s approach to automate unit testing of a software
defined radio’s control software. The goal was to determine if the

Testing to Reduce
NIST approach could cost-effectively reduce the number of latent
software defects escaping into system testing and at the same time
achieve the structural coverage required by regulatory authorities.

Software Rework The Test Environment


Tests were generated, executed, and analyzed on a Windows 7,
quad-core, 2.5 GHz, i5 laptop with 4GB memory. ACTS was used
Redge Bartholomew, Rockwell Collins to create a model of the input variables, generate 6-way combi-
natorial test vectors, and export them to a networked server. The
Abstract. In developing many safety-critical, embedded systems, rework to fix NuSMV2 model checker generated the state space and exported
software defects detected late in the test phase is the largest single cause of cost it to the same networked server. An in-house utility function read
overrun and schedule delay. Typically, these defects involve the interactions among no the two files, searched the state space for states containing the
more than 6 variables, suggesting that 6-way combinatorial tests could detect them ACTS vectors, reformatted them, and exported them as test cases
much earlier. NIST developed an approach to automatically generating, executing, back to the server. A commercial test harness, VectorCAST, instru-
and analyzing such tests. This paper describes an industry proof-of-concept demon- mented the source code to track structural coverage, measured
stration to see if this approach could significantly reduce the number of defects that code complexity, imported the test case file, loaded test values
escape into the test and evaluation phase of safety-critical embedded systems. into input variables, and executed tests. It also accumulated the
achieved modified condition/decision coverage (MC/DC) [18],
Introduction collected output variable values, compared actual with expected
Studies of safety-critical, embedded systems have shown that values, and identified discrepancies.
the rework required to fix late-detected software defects is one The code being tested was a software defined radio’s control
of the largest single components of their development cost and interface, containing 196,000 executable source lines of C++
schedule—e.g., [1][2][3][4][5]. They also show that detection of code. The initial focus of the study was a code unit responsible for
these latent defects accelerates during late-stage testing and controlling the radio’s waveform mode (e.g., HAVEQUICK, SINC-
that those detected during operational test and evaluation have GARS, Link 4) and operational state (e.g., idle, ready, running).
become more than just problematic. Much of this is attributable to This had 579 lines of code, 34 input variables, and 4 output
verification tools and techniques that are becoming increasingly variables of interest, used by 47 decisions nested up to 8-levels
inadequate as the scale and complexity of software continues to deep, spread over a 6-case switch. Its measured complexity
increase [6][7][8][9]. An emerging need to develop parallel software (number of unique execution paths) was 46. In addition to the
for embedded multicore processors will make this problem worse mode and state controller, the study tested another 70 of
[10]. Improvement requires tools and methods that prevent defect 717 code files.
injection or that accelerate detection. They must do so, however,
without a prohibitively large impact on normal development. Defining the Input Space
A study conducted by NIST and NASA looked at software Developers provide ACTS with a name, a data type, and a set
defects detected over a 15-year period [11]. Systems studied of values for each input variable. They also select the combina-
included avionics, medical devices, web browsers, servers, space torial strength of the vector generation (2-way through 6-way).
systems, and network security systems, and ranged in size from ACTS then generates a set of input vectors containing all com-
tens of thousands to hundreds of thousands of lines of code. It binations of input variable values for the selected strength. Table
found that defects were triggered by the interactions among no 1 shows the 2-way vectors ACTS generated for the function:
more than six variables. This being the case, 6-way combinato- if (c = = true)
a b c d
rial test vectors might be able to detect them. Subsequently, e = a + b;
1 0 255 true -1
NIST and the University of Texas-Arlington found an efficient else
algorithm for minimizing the number of test vectors that would e = a * d; 2 0 256 false 0

cover up to 6-way combinations of input values [12][13][14] return e; 3 0 255 false 1


[15]. They implemented this algorithm in a tool called Automated Defining the input space to 4 15 256 true -1
Combinatorial Test System (ACTS)1. maximize defect detection and 5 15 255 true 0
The tool was effective at triggering defects, but verification structural coverage without 6 15 256 false 1
testing required expected outputs, not just inputs, and creating significant test iteration (test, 7 16 255 false -1
these manually for thousands of inputs would be prohibitively measure coverage, determine 8 16 256 true 0
expensive. NIST found an approach to automating this process coverage gaps, add input vec- 9 16 255 true 1
using a model checker’s counter examples. It also created a utility tors, repeat) is nontrivial [12].
that merged the input vectors with their expected outputs as well The greater the number of Table 1: Two-Way
as a test harness that read complete test cases, executed tests, input test values, the greater Combinatorial Vectors
analyzed results (compared actual versus expected outputs), and the code coverage but also the
identified anomalies [16]. greater the likelihood of com-
CrossTalk—January/February 2014 23
LEGACY SYSTEM SOFTWARE SUSTAINMENT

binatorial explosion. The smaller the number, the greater is the required running all such sets of tests. In no case, however, was
likelihood of missed defects and inadequate structural coverage. there an output value affected by interactions among more than
A compromise is to limit input values to those representing six input variables, and in aggregate all 6-way combinations of
equivalence classes [16]. For each input variable, possible val- interacting variables were tested.
ues are segregated into groups that would ostensibly produce
no difference of interest in code behavior or output value. One Generating Expected Outputs and Executing Tests
or more representative values are then picked from each group. The model checker is given a model containing variable
This typically includes values that test behavior across instruc- definitions, their relationships, their values in an initial state, and
tion and memory architecture boundaries (e.g., positive and how their values are determined in subsequent states. It then
negative minimum and maximum values, and 0), data definition generates the state space (or a binary decision diagram of it),
ranges, coordinate systems, units of measure, and so on, and each state mapping a combination of input variable values to
also those that drive decision conditions. output variable values. See Fig. 1 showing the mapping of the
Identifying representative values for boundary values was input values from Table 1 to the output variable, e. For all states
straightforward. Finding values for condition variables in in which the value of c is true, the value of e will be equal to the
complex, nested logic—values that would force the execution value of a plus the value of b, which is expressed as c = true : a
paths required for code coverage—took more time. MC/DC + b. In all other states, the value of e will be equal to the value
requires that every condition in a decision has taken all pos- of a times the value of d, expressed as TRUE : a * d. Fig. 1b
sible outcomes at least once, and that each condition in each shows a segment of the generated state space—the value of e
decision has been shown to independently affect that decision’s followed by the input values that produced it.
outcome. Demonstrating independence-of-outcome typically In the NIST approach, the process of creating expected
requires modifying each condition in a decision while all others outputs for an input test vector relies on a model checker’s
remain fixed, and showing that this modification has changed counter-examples [17]. Ordinarily, to verify requirements or a
the outcome of the decision. For the while-loop in design, developers using a model checker would create a model
if ((a != b) && (a != c)) like the one in Fig. 1a , but they would also write properties the
{ model must preserve—e.g., there must always be a way for the
… variable e to be 0, there must always be a way for it to be 272.
while ((a != b) && (a != c)) The model checker attempts to prove that the model preserves
{ these properties. Where it finds a violation of a property (a
a = chan (); counter-example—e.g., an execution path in which e can never
} be 0), it produces a trace of the states that led to the violation.
} To have a model checker determine an expected output for
a given input vector, developers could negate a property and
tests must be run to show that when both conditions are true, use the counter example to trace back to the input values that
the loop is executed, and that when each is false but the other produced it. For example, they could specify that the variable e
true, the loop is not executed. To determine the input space, val- must never be 0. The model checker would detect a state that
ues that force execution of each such path under the required violated this property and generate a counter example show-
conditions must be selected for each variable of each condition ing the state transitions from the initial input values (the input
of each decision. vector) to the point at which e became 0. A simple utility could
Enabling those values was difficult when the condition vari- create a complete test case from a counter-example by merg-
able was an input and the values had to be loaded by an exter- ing the value of the output variable with the values of the input
nal procedure invoked from within a decision. In the example, variables that produced it [16].
the loop decision must be tested when a = b and when a = c, This study used a slightly different approach, requiring a smaller
neither of which conditions can be created by direct input from learning curve. Instead of searching through counter examples
a test case. The value of a must be changed at runtime by the generated by the model checker, the utility function searches for
call to the external procedure chan (), which is stubbed-out for each input vector across the entire state space generated by the
unit test. The work-around was to add test-unique variables to model checker. The model in Fig. 1a generated 36 states: those
the test cases generated by ACTS and the model checker. Test containing all possible combinations of variable values. As shown
stubs were replaced with small procedures that loaded the value in Table 1, all 2-way combinations of inputs can be covered by the
of the test-unique variable directly or indirectly into the condi- nine input vectors generated by ACTS. The utility function finds
tion variable. In the example, the test variable’s value would be state 4 containing the input vector, {0,255,false,1}, eliminates
loaded into the return value of chan (). any irrelevant inputs and outputs from the state, reformats the
Generating a state space for all 34 input variables of the remainder (the input vector and its expected outputs), and exports
mode-state controller produced combinatorial explosion. Several the result, {0,0,255,false,1}, to the test harness. When it has found
separate sets of test vectors had to be generated instead, each and exported all 9 test cases, it is finished.
set covering only those variables that interact to produce an Developers then load the test harness with both the source
output. The test harness assigned default values to those vari- code and the test cases, and map the test case entries to input
ables not included in a test case. Maximizing structural coverage and output variable names—e.g., map the first entry of the input

24 CrossTalk—January/February 2014
LEGACY SYSTEM SOFTWARE SUSTAINMENT

MODULE main ------- State 4 ------ them in 15 seconds, created their executable tests in 12 seconds,
VAR e=0 and executed and analyzed them in under eight minutes.
a : {0,15,16}; a=0 Cost effectiveness was a measure of the value-in-use (accu-
b : {255,256}; b = 255
racy, coverage, scalability, and performance), the effort required to
c : {true,false}; c = false
d=1 learn the approach, and the effort required to use it on an ongoing
d : {-1,0,1};
------- State 5 ------ basis. Learning to use ACTS was simple. NIST provides a tutorial
DEFINE e = 272 that takes about two hours to process and contains everything
e := a = 16 needed to begin using the tool. Initial definition of the 34 input
case b = 256 variables used by the mode controller took four hours, including
(c = true) : a + b; c = true initial equivalence class determination and value selection. Using
TRUE : a * d; d=1
------- State 6 ------ the .pdf tutorial from the NuSMV web site, learning to develop
esac;
NuSMV models and to use the NuSMV simulator to generate the
Fig. 1a. NuSMV Model Fig 1b. State Space Segment state space took 20 hours. After encountering state space explo-
sion, generating sets of input vectors for only interacting variables
test case in Fig. 1b (0) to the source code variable e, the second and selecting equivalence class values to achieve 100% branch
entry (0) to the variable a. They can then execute the tests. Fail- coverage took an additional 16 hours. Finding a way of achiev-
ures and the achieved code coverage can be monitored in test ing 100% MC/DC coverage without manual intervention took
harness windows. Correctness of the expected outputs (verify- another 16 hours. In total, the learning curve was 84 hours. As er-
ing the oracle) is established when the resulting test cases are rors were found in models, the worst-case time spent completely
able to detect all seeded defects with no false positives. regenerating and re-executing tests was under 90 minutes, but
more commonly was less than 15 minutes.
Results Maturity was an evaluation of readiness for deployment across
Putting aside defective or incomplete requirements, misin- a potential population of several thousand engineers—e.g., if the
terpretations of requirements and design decisions, and other tools crash frequently or if they produce inconsistent, incorrect, or
errors not revealed by exercising the code, at issue was whether confusing results. The study used the 9-level NASA/DoD Tech-
such an automated test approach could cost effectively detect nology Readiness scale3 and found the toolset to be at Level 7,
all (or nearly all) implementation defects. Evaluation criteria in- “System Prototype Demonstrated in [an operational environment]”.
cluded accuracy, structural coverage, scalability, execution time, In summary, prototype software exists and all key functionality is
maturity, ease of learning, and ease of use. available for demonstration or test; the tools were well integrated
Accuracy was measured in two ways: as the percent of with operational systems; operational feasibility was demonstrated
seeded defects the tests detected; and as the percent of false and most of the software bugs have been eliminated; and at least
detections (number of false positive detections as a percent of some documentation is available. A general deployment would
total detections). Defects were manually and arbitrarily seeded require level 9 “Actual system [performance] proven through suc-
into versions of the code by changing values in arithmetic and cessful [developmental use].”
logic statements, changing arithmetic signs, reversing and ne-
gating comparisons, deleting statements, and so on. In all, there Conclusion
were over 200. After debugging the NuSMV model, the search- For unit test, this appears to be much more effective than
export utility, and the test harness definition, the generated tests the standard manual, iterative approach of writing tests, running
triggered all defects with no false detections. them, checking coverage, writing more tests to fill coverage
The initial set of tests achieved 75% statement coverage, gaps, running more tests, and so on. Defining the input space
71% branch coverage, and 68% MC/DC. The relatively low to achieve required coverage consumed the largest amount
initial coverage was the result of the inadequately defined input of time, requiring several iterations of test case generation –
space, described earlier. With a better understanding of how the especially to achieve full MC/DC. With experience, however, the
input space was to be defined, the subsequently generated test number of iterations was significantly reduced. The study used
cases achieved 100% MC/DC. staff with significant experience, but in general the approach
Scalability was an evaluation of both size (in this case, the required no knowledge or skills that could not easily be learned
number of input and output variables) and logical complexity. by an above average entry-level software engineer—e.g., creat-
As mentioned earlier, after limiting inputs to only interacting ing and debugging the test generation models was much easier
variables, test generation never again produced state space than writing and debugging the source code being tested.
explosion. After using test variables to deal with loops that Overall, results of the study were positive, although there are
changed the value of their condition variables, there were no remaining issues of deployment packaging and tool licensing,
further complexity issues. training, mentoring, and technical support. Data for an empirical
Execution time was acceptable: for the largest vector genera- comparative evaluation of defect detection capability between
tion model (19 input variables, 1 output variable), ACTS produced combinatorial testing and other approaches do not exist, but
2775 input vectors in six seconds, NuSMV generated the state there is enough evidence from literature to justify a pilot project
space in about 60 minutes, and searching it and building the test or a trial deployment in a business unit. This is the current plan
cases took just over eight minutes. The test harness imported going forward.

CrossTalk—January/February 2014 25
LEGACY SYSTEM SOFTWARE SUSTAINMENT

ABOUT THE AUTHOR REFERENCES


Redge Bartholomew is with Rockwell Collins, currently research- 1. Government Accountability Office, “F-35 Joint Strike
ing tools and methods for automating the development of em- Fighter: Current Outlook Is Improved, but Long-Term
bedded software and for reducing the number of latent software Affordability Is a Major Concern, GAO-13-309”, March 2013
defects found during test and evaluation. 2. Government Accountability Office, “KC-46 Tanker Aircraft:
Program Generally Stable but Improvements in Managing
Schedule Are Needed, GAO-13-258”, February 2013
400 Collins Road 3. Government Accountability Office, “Airborne Electronic
M.S. 108-265 Attack: Achieving Mission Objectives Depends on
Cedar Rapids, Iowa 52498 Overcoming Acquisition Challenges, GAO-12-175”,
Phone: 319-295-1906 March 2012
E-mail: [email protected] 4. Jones, “Software Quality and Software Economics”,
SoftwareTech News, April 2010
5. Dvorak (ed.), NASA Study on Flight Software Complexity,
March 2009.
NOTES 6. National Academy of Sciences, Critical Code:
Software Producibility for Defense, 2010
1. The ACTS executable is available from NIST. See <http://csrc.nist.gov/groups/SNS/acts/index.html> 7. Baldwin, DoD Software Engineering and System Assurance,
2. NuSMV is available from <http://nusmv.fbk.eu> NDIA Proceedings of the 11th Annual Systems Engineering
3. Technology Readiness Calculator at <https://acc.dau.mil/CommunityBrowser.aspx?id=320594&lang=en-US> Conference, October 2008.
8. Afzal, Torkar, Feldt, Search-Based Prediction of
Fault-Slip-Through in Large Software Projects, IEEE
Symposium on Search Based Software Engineering,
September 2010.
9. Andersin, TPI – a Model for Test Process Improvement,
Seminar on Quality Models for Software Engineering,
U of Helsinki, October 2004
10. Lu, Park, Seo, Zhou, Learning from Mistakes –
A Comprehensive Study on Real World Concurrency Bug
Characteristics, ACM Proceedings of the 13th Annual
International Conference on Architectural Support for
 Programming Languages and Operating Systems, March 2008
11. Kuhn, Wallace and Gallo, “Software Fault Interactions
and Implications for Software Testing,” IEEE Transactions
on Software Engineering, June 2004.
12. Borazjany, Yu, Lei, Kacker, Kuhn, Combinatorial Testing of
ACTS: A Case Study, Proceedings of the International
Conference on Software Testing, Verification, and Validation,
April 2012
13. Lei, Kacker, Kuhn, Okun, Lawrence, IPOG: A General
Strategy for T-Way Software Testing, IEEE Proceedings
of the Conference and Workshops on the Engineering of
Computer-Based Systems, March 2007.
14. Kuhn, Lei, Kacker, “Practical Combinatorial Testing:
Beyond Pairwise,” IEEE IT Pro, May/June 2008.
15. Kuhn, Okun, Pseudo-Exhaustive Testing for Software,
NASA/IEEE Proceedings of the 30th Software Engineering
Workshop, April 2006.
16. Kuhn; Kacker; Lei, Practical Combinatorial Testing, NIST
Special Publication 800-142, National Institute of
Standards & Technology, October 2010.
17. Ammann, Black, Majurski, Using Model Checking to
Generate Tests from Specifications, IEEE Proceedings
of the 2nd International Conference on Formal Engineering
Methods, December 1998.
18. RTCA, DO-178C: Software Considerations in Airborne
Systems and Equipment Certification, RTCA, 2011.

26 CrossTalk—January/February 2014

You might also like