Software Quality Model Based Approaches For Advanced Software and
Software Quality Model Based Approaches For Advanced Software and
Stefan Biffl
Johannes Bergsmann (Eds.)
Software Quality
LNBIP 166
123
Lecture Notes
in Business Information Processing 166
Series Editors
Wil van der Aalst
Eindhoven Technical University, The Netherlands
John Mylopoulos
University of Trento, Italy
Michael Rosemann
Queensland University of Technology, Brisbane, Qld, Australia
Michael J. Shaw
University of Illinois, Urbana-Champaign, IL, USA
Clemens Szyperski
Microsoft Research, Redmond, WA, USA
Dietmar Winkler
Stefan Biffl
Johannes Bergsmann (Eds.)
Software Quality
Model-Based Approaches for Advanced
Software and Systems Engineering
13
Volume Editors
Dietmar Winkler
Vienna University of Technology
Institute of Software Technology
and Interactive Systems
Vienna, Austria
E-mail: [email protected]
Stefan Biffl
Vienna University of Technology
Institute of Software Technology
and Interactive Systems
Vienna, Austria
E-mail: stefan.biffl@tuwien.ac.at
Johannes Bergsmann
Software Quality Lab GmbH
Linz, Austria
E-mail: [email protected]
The Software Quality Days (SWQD) conference and tools fair started in 2009
and has grown to one of the biggest conferences on software quality in Europe
with a strong community. The program of the SWQD conference is designed
to encompass a stimulating mixture of practical presentations and new research
topics in scientific presentations as well as tutorials and an exhibition area for
tool vendors and other organizations in the area of software quality.
This professional symposium and conference offers a range of comprehensive
and valuable opportunities for advanced professional training, new ideas, and
networking with a series of keynote speeches, professional lectures, exhibits, and
tutorials.
The SWQD conference is suitable for anyone with an interest in software
quality, such as test managers, software testers, software process and quality
managers, product managers, project managers, software architects, software
designers, user interface designers, software developers, IT managers, develop-
ment managers, application managers, and those in similar roles.
The 6th Software Quality Days (SWQD) conference and tools fair brought to-
gether researchers and practitioners from business, industry, and academia work-
ing on quality assurance and quality management for software engineering and
information technology. The SWQD conference is one of the largest software
quality conferences in Europe.
Over the past years, a growing number of scientific contributions were submit-
ted to the SWQD symposium. Starting in 2012 the SWQD symposium included
a dedicated scientific track published in scientific proceedings. For the third year
we received an overall number of 24 high-quality submissions from researchers
across Europe, which were each peer-reviewed by three or more reviewers. Out
of these submissions, the editors selected four contributions as full papers, an
acceptance rate of 17%. Further, ten short papers, which represent promising
research directions, were accepted to spark discussions between researchers and
practitioners at the conference.
The main topics from academia and industry focused on systems and software
quality management methods, improvements of software development methods
and processes, latest trends in software quality, and testing and software quality
assurance.
This book is structured according to the sessions of the scientific track fol-
lowing the guiding conference topic “Model-Based Approaches for Advanced
Software and Systems Engineering”:
• Software Process Improvement and Measurement
• Requirements Management
• Value-Based Software Engineering
• Software and Systems Testing
• Automation-Supported Testing
• Quality Assurance and Collaboration
SWQD 2014 was organized by the Software Quality Lab GmbH and the Vienna
University of Technology, Institute of Software Technology and Interactive Sys-
tems, and the Christian Doppler Laboratory “Software Engineering Integration
for Flexible Automation Systems.”
Organizing Committee
General Chair
Johannes Bergsmann Software Quality Lab GmbH
Proceedings Chair
Dietmar Winkler Vienna University of Technology
Program Committee
SWQD 2014 established an international committee of well-known experts in
software quality and process improvement to peer-review the scientific submis-
sions.
Additional Reviewers
Asim Abdulkhaleq Marcel Ibe
Alarico Campetelli Jan-Peter Ostberg
Peter Engel Jasmin Ramadani
Daniel Mendez Fernandez Joachim Schramm
Benedikt Hauptmann Fabian Sobiech
Table of Contents
Keynote
Software Quality Assurance by Static Program Analysis . . . . . . . . . . . . . . 1
Reinhard Wilhelm
Requirements Management
Statistical Analysis of Requirements Prioritization for Transition to
Web Technologies: A Case Study in an Electric Power Organization . . . . 63
Panagiota Chatzipetrou, Christos Karapiperis,
Chrysa Palampouiki, and Lefteris Angelis
Automation-Supported Testing
Automated Test Generation for Java Generics (Short Paper) . . . . . . . . . . 185
Gordon Fraser and Andrea Arcuri
Reinhard Wilhelm
Fachrichtung Informatik
Universität des Saarlandes
Saarbrücken, Germany
[email protected]
http://rw4.cs.uni-saarland.de/people/wilhelm.shtml
1 Introduction
D. Winkler, S. Biffl, and J. Bergsmann (Eds.): SWQD 2014, LNBIP 166, pp. 1–11, 2014.
c Springer International Publishing Switzerland 2014
2 R. Wilhelm
of completeness: the static analysis may fail to derive all correctness statements
that actually hold and thus issue warnings that are, in fact, false alarms. Bug-
chasing tools, in general, are both unsound and incomplete; they produce false
alarms and they fail to detect all bugs.
The precision of static analysis, i.e., the set of valid correctness statements
proved by the analysis, increases with more information made available to it.
This additional information can be given by programmer annotations, or it can
be transferred from the model level in model-based software design.
Static analysis has emancipated itself from its origins as a compiler technology
and has become an important verification method [12]. Static analyses prove
safety properties of programs, such as the absence of run-time errors [3,15],
and they prove partial correctness of programs [14]. They determine execution-
time bounds for embedded real-time systems [5,19] and check synchronization
properties of concurrent programs [20]. Static analysis has become indispensable
for the development of reliable software. It is one of several candidate formal
methods for the analysis and verification of software.
The acceptance of these formal methods in industry has been slow. This is
partly caused by the competences required from the users of the methods.
property P System
abstract semantics
abstr. domain abstr.
abstr. transfer functions interpreter
for P
verification
of the system
wrt. P
Verification time
developer
System P
by Gary A. Killdall (1972) [11] and by Patrick Cousot (1978) [4]. Gary Kildall
clarified the lattice-theoretic foundations of data-flow analysis. Patrick Cousot
established the relation between the semantics of a programming language and
static analyses of programs written in this language. He therefore called such
a semantics-based program analysis abstract interpretation. This relation to the
language semantics allowed for a correctness proof of static analyses and even
the design of analyses that were correct by construction.
{-,0,+}
+# {0} {+} {−}
{0} {0} {+} {−}
{-,0} {-,+} {0,+} {+} {+} {+} {−, 0, +}
{−} {−} {−, 0, +} {−}
{−, 0} {−, 0} {−, 0, +} {−, 0}
{-} {0} {+} {−, +} {−, +} {−, 0, +} {−, 0, +}
{0, +} {0, +} {0, +} {−, 0, +}
{−, 0, +} {−, 0, +} {−, 0, +} {−, 0, +}
{}
Fig. 3. The lattice for rules-of-signs and the table for the abstract addition. Missing
columns can be added by symmetry.
The association of sets of signs with program variables is called sign envi-
ronment, as said above. In order for the program analysis to determine (new)
signs of variables it needs to execute statements in sign environments instead of
in value environments. These abstract versions of the semantics of statements
are called abstract transfer functions. For an assignment statement x = e; its
abstract transfer function looks up the sign information for the variables in e,
attempts to determine the possible signs of e, and associates these signs with x
in the sign environment. As part of that it must be able to evaluate expressions
in sign environments to obtain signs of the value of the expression.
The rules for this evaluation we know from our school days; − × − gives +,
− × + gives − and so on. We can easily extend this to sets of signs by computing
the resulting signs for all combinations and collecting all results in a set. The
rules for addition are given in the table in Fig. 3. The table defines an abstract
addition operator ×# .
We have met now the essential ingredients of a static program analysis, an
abstract domain, a lattice, and a set of abstract transfer functions, one for each
statement type. Following this principle we could design more complex static
analyses. A very helpful static analysis, called interval analysis, would compute
at each program point an enclosing interval for all possible values a numerical
variable may take on. The results of this analysis allows one to exclude index-
out-of-bounds errors for arrays; any expression used as an index into an array
whose interval of possible values is completely contained in the corresponding
index range of the array will never cause such an error. A run-time check for
this error can thus be eliminated to gain efficiency.
Fig. 4 shows a simple program and an associated control-flow graph. We will
demonstrate the working of the rules-of-sign analysis at this program. As a result
we will obtain sets of possible signs for the values of program variables x and y
at all program points.
The analysis goes iteratively through the program as shown in Fig. 5. It starts
with initial assumptions at node 0 about the signs of variables, e.g. that these
are unknown and could thus be any sign. It then evaluates the expressions it
6 R. Wilhelm
0
x=0
1
y=1
1: x = 0;
2: y = 1; 2
3: while (y > 0) do
4: y = y + x; true(y>0) false(y>0)
5: x = x + (-1);
4 3
y = y+x
5
x = x+(-1)
0 1 2 3 4 5
x y x y x y x y x y x y
{−,0,+} {−,0,+} {0} {−,0,+} {0} {+} {} {} {0} {+} {0} {+}
{−,0,+} {−,0,+} {0} {−,0,+} {−,0} {+} {} {} {−,0} {+} {−,0} {−,0,+}
{−,0,+} {−,0,+} {0} {−,0,+} {−,0} {+} {−,0} {−,0,+} {−,0} {+} {−,0} {−,0,+}
{−,0,+} {−,0,+} {0} {−,0,+} {−,0} {+} {−,0} {−,0,+} {−,0} {+} {−,0} {−,0,+}
Fig. 5. (Cleverly) iterating through the program. Column numbers denote program
points, rows describe iterations.
One could be tempted to say that everything is fine, and on the correctness
side, it actually is. If the analysis evaluates e to {+, 0} we are assured that
executing sqrt(e) does never take the square root of a negative value. And if
the analysis associates e with {+, −} we know that a/e never will attempt to
divide by 0. On the precision side, the world may look different: A set of signs
{+, 0, −} computed for an expression e in a/e will force the analysis to issue a
warning ”Possible division by 0”, where this actually might be impossible.
2 Applications
Static program analysis is one of the verification techniques required by the
transition from process-based assurance to product-based assurance. It is strongly
recommended for high criticality levels by several international standards such
as ISO 26262, DO178C, CENELEC EN-50128, and IEC 61508, see Fig. 6 and
[8,9].
Fig. 6. ISO 26262 highly recommends static code analysis for ASIL level B -D
many abstract domains for particular program features such as feedback con-
trol loops, digital filters and clocks. The Astrée user is provided with options
and directives to reduce the number of false alarms significantly. Besides the
usual run-time errors such as index-out-of-bounds, division by 0, dereference of
null pointers, it has strong analyses for the rounding errors of floating-point
computations. These are particularly important for embedded control systems.
The designers of these systems in general assume that the programs work with
reals and base their stability considerations on this assumption. In fact, round-
ing errors may lead to divergence of a control loop with potentially disastrous
consequences.
The PolySpace Verifier[15] aims at similar errors. However, Astrée and
the PolySpace Verifier have different philosophies with regard to the contin-
uation of the analysis after an identified run-time error. Verifier’s philosophy
is ”grey follows red”, meaning that code following an identified run-time error
is considered as non-reachable and will be disregarded by the analysis. Astrée
continues the analysis with whatever information it may safely assume. Veri-
fier’s philosophy would make sense if the programmer would not remove the
found run-time error. In practice, he will remove it. The code previously found
to be unreachable thereby becomes reachable—assuming that the modification
removed the error—and will be subject to analysis with possibly newly discov-
ered errors and warnings. So, Verifier may need more iterations than Astrée
until all errors and warnings are found and removed. Given that analysis time
is in the order of hours or even days for industrial size programs this increased
number of iterations may be annoying.
PolySpace Verifier is able to check for the compatibility with several cod-
ing standards, such as MISRA C.
4 Conclusion
References
1. AbsSint, http://www.absint.com/ait
2. AbsSint, http://www.absint.com/stackanalyzer/index.htm
3. Blanchet, B., Cousot, P., Cousot, R., Feret, J., Mauborgne, L., Miné, A., Monniaux,
D., Riva, X.: A static analyzer for large safety-critical software. In: Proceedings
of the ACM SIGPLAN 2003 Conference on Programming Language Design and
Implementation, PLDI 2003, pp. 196–207. ACM, New York (2003)
4. Cousot, P., Cousot, R.: Abstract interpretation: A unified lattice model for static
analysis of programs by construction or approximation of fixpoints. In: POPL
1977: Proceedings of the 4th ACM SIGACT-SIGPLAN Symposium on Principles
of Programming Languages, pp. 238–252. ACM, New York (1977)
5. Ferdinand, C., Heckmann, R., Langenbach, M., Martin, F., Schmidt, M., Theiling,
H., Thesing, S., Wilhelm, R.: Reliable and precise WCET determination for a
real-life processor. In: Henzinger, T.A., Kirsch, C.M. (eds.) EMSOFT 2001. LNCS,
vol. 2211, pp. 469–485. Springer, Heidelberg (2001)
6. Ferdinand, C., Heckmann, R., Sergent, T.L., Lopes, D., Martin, B., Fornari, X.,
Martin, F.: Combining a high-level design tool for safety-critical systems with a
tool for WCET analysis on executables. In: ERTS2 (2008)
7. Ferdinand, C., Wilhelm, R.: Efficient and precise cache behavior prediction for
real-time systems. Real-Time Systems 17(2-3), 131–181 (1999)
Software Quality Assurance by Static Program Analysis 11
Abstract. Defect causal analysis (DCA) has shown itself an efficient means to
improve the quality of software processes and products. A DCA approach
exploring Bayesian networks, called DPPI (Defect Prevention-Based Process
Improvement), resulted from research following an experimental strategy. Its
conceptual phase considered evidence-based guidelines acquired through
systematic reviews and feedback from experts in the field. Afterwards, in order
to move towards industry readiness the approach evolved based on results of an
initial proof of concept and a set of primary studies. This paper describes the
experimental strategy followed and provides an overview of the resulting DPPI
approach. Moreover, it presents results from applying DPPI in industry in the
context of a real software development lifecycle, which allowed further
comprehension and insights into using the approach from an industrial
perspective.
1 Introduction
Regardless of the notion of software quality that a project or organization adopts,
most practitioners would agree that the presence of defects indicates lack of quality,
as it had been said by Card [1]. Consequently, learning from defects is profoundly
important to quality focused software practitioners.
Defect Causal Analysis (DCA) [2] allows learning from defects in order to
improve software processes and products. It encompasses the identification of causes
of defects, and ways to prevent them from recurring in the future. Many popular
process improvement approaches (e.g., Six Sigma, CMMI, and Lean) incorporate
causal analysis activities. In industrial settings, effective DCA has helped to reduce
defect rates by over 50 percent in organizations. For instance, companies such as IBM
[3], Computer Science Corporation [4], and InfoSys [5] reported such achievements.
D. Winkler, S. Biffl, and J. Bergsmann (Eds.): SWQD 2014, LNBIP 166, pp. 12–33, 2014.
© Springer International Publishing Switzerland 2014
An Industry Ready DCA Approach Exploring Bayesian Networks 13
the feasibility of using the approach [11]; and (iv) conducting experimental studies to
evaluate possible benefits of using it [12]. A summary on how each of those activities
was accomplished is provided in the subsections hereafter. The new results obtained
from applying DPPI in industry in a real software development project lifecycle will
be provided in Section 4.
relations identified in DCA meetings. Therefore, this initial concept suggested the use
of Bayesian networks. Using such networks would allow the diagnostic inference to
support defect cause identification in subsequent DCA meetings. The hypotheses
were that this kind of inference could help answering questions during DCA
meetings, such as: “Given the past projects within my organizational context, with
which probability does a certain cause lead to a specific defect type?”. Additionally,
in order to allow the usage of the Bayesian diagnostic inferences during DCA
meetings, the traditional cause-effect diagram [15] was extended into a probabilistic
cause effect diagram [10].
Figure 2 illustrates the approach’s proposed feedback and inference cycle,
representing the cause-effect learning mechanism to support the identification of
defect causes.
Fig. 2. The proposed feedback cycle. Building the Bayesian networks for each development
activity upon results of the DCA meetings.
Later, after an additional SLR trial (conducted in 2009) and feedback gathered
from experts in the field, this initial concept was evolved and tailored into the DPPI
(Defect Prevention-Based Process Improvement) approach [11]. Besides using and
feeding Bayesian networks to support DCA meetings with probabilistic cause-effect
diagrams, DPPI also addresses the specific practices of the CMMI CAR (Causal
Analysis and Resolution) process area, described in [16].
Once proposed there was a need to evaluate the feasibility of using the approach in
the context of real development projects. Moreover, if the approach shows itself
feasible there would be a need to evaluate the hypotheses related to benefits of using
the approach’s main innovation: that the Bayesian inference could help in identifying
causes in subsequent DCA meetings. The feasibility was addressed through a proof of
concept (Section 2.3), while the hypotheses were evaluated through a set of
experimental studies (Section 2.4).
16 M. Kalinowski, E. Mendes, and G.H. Travassos
Figure 3 also shows the development activity’s causal model. DPPI considers those
causal models to be dynamically established and maintained by feeding Bayesian
networks. Pearl [20] suggests that probabilistic causal models can be built using
Bayesian networks, if concrete examples of cause-effect relations can be gathered in
order to feed the network. In the case of DPPI the examples to feed the Bayesian
network can be directly taken from each DCA meeting results.
A brief description of the four DPPI activities and their tasks, with examples of
how they can be applied considering real software project data taken from the proof
of concept described in [11], is provided in the following subsections.
Analyze Development Activity Results. This task aims at analyzing the development
activity’s defect-related results by comparing them against historical defect-related
results for the same development activity in similar projects (or previous iterations of
the same project). Therefore, the number of defects per unit of size and per inspection
An Industry Ready DCA Approach Exploring Bayesian Networks 19
hour should be analyzed against historical data using a statistical process control
chart. As suggested by [21], the type of the statistical process control chart for those
metrics should be a U-chart, given that defects follow a Poisson distribution. Those
charts can indicate if the defect metrics of the development activity are under control
by applying basic statistical tests. An example U-chart based on real project data for
defects per inspection hour is shown in Figure 4.
shown in Figure 5. It shows that most defects were of type incorrect fact, and that the
sum of incorrect facts and omissions represents about 60% of all defects found.
Find Systematic Errors. This task comprises analyzing the defect sample (reading
the description of the sampled defects) in order to find its systematic errors. Only the
defects related to those systematic errors should be considered in the DCA meeting.
At this point the moderator could receive support from representatives of the
document authors and the inspectors involved in finding the defects.
Identifying Main Causes. This task represents the core of the DPPI approach. Given
the causal model elaborated based on prior DCA meeting results for the same
development activity considering similar projects (by feeding the Bayesian network
with the identified causes for the defect types), the probabilities for causes to lead to
the defect types related to the systematic errors being analyzed can be calculated
(using the Bayesian diagnostic inference).
Afterwards, those probabilities support the DCA meeting team in identifying the main
causes. Therefore, probabilistic cause-effect diagrams for the defect types related to the
analyzed systematic errors can be used. The probabilistic cause-effect diagram was
proposed in [10]. It extends the traditional cause-effect diagram [15] by (i) showing the
probabilities for each possible cause to lead to the analyzed defect type, and (ii)
representing the causes using grey tones, where the darker tones represent causes with
An Industry Ready DCA Approach Exploring Bayesian Networks 21
higher probability. This representation can be easily interpreted by causal analysis teams
and highlights the causes with greater probabilities of causing the analyzed defect type.
Figure 6 shows the probabilistic cause-effect diagram for defects of type “incorrect
fact” in a real project. In this diagram it can be seen that, for this project (based on its
prior iterations), typically the causes for incorrect facts are “lack of domain
knowledge” (25%), “size and complexity of the problem” (18.7%), and “oversight”
(15.6%). Figure 7 shows the underlying Bayesian network and its diagnostic inference
for defects of type “incorrect fact” (built using the Netica Application software [22]).
The main hypotheses were related to the fact that showing such probabilistic cause-
effect diagrams could help to effectively use the cause-effect knowledge stored in the
Bayesian network to improve the identification of causes during DCA meetings. In
fact, in the particular experience described in [11], the identified causes for the
analyzed systematic errors were “lack of domain knowledge”, “size and complexity
of the problem”, and “oversight”. Those causes correspond to the three main causes of
the probabilistic cause-effect diagram and were identified by the author, inspectors,
and SEPG members reading the defect descriptions.
Thus, the probabilistic cause-effect diagram showed itself helpful in this case.
Afterwards, experimental investigations indicated that those probabilistic cause-effect
diagrams may in fact increase the effectiveness and reduce the effort of identifying
defect causes during DCA meetings [12].
22 M. Kalinowski, E. Mendes, and G.H. Travassos
According to DPPI, one of the outcomes of a meeting are resulting causes for the
analyzed defect type that can be used to feed the Bayesian network, so that the
probabilities of the causes can be dynamically updated, closing the feedback cycle for
the next DCA event based on a consensus result of the team on the current DCA
event. A prototype of tool support automating this dynamic feedback cycle was also
built [12].
Fig. 7. DPPI’s probabilistic cause-effect diagram underlying Bayesian network inference for
incorrect facts, adapted from [11]
the effort of implementing them should be recorded. The effect of such actions on
improving the development activity will be objectively seen in DPPI’s Development
Activity Result Analysis activity (Section 3.1), once data on the defects for the same
activity is collected in the next similar project (or in the next iteration of the same
project) and DPPI is launched again.
The results of the proof of concept and of the set of experimental studies indicated
that DPPI could bring possible benefits to industrial software development practice.
However, at this point there was no real industrial usage data (in the context of a real
software development lifecycle) supporting such claims and allowing further
understanding on the approach’s industry readiness. According to Shull et al [23],
applying a new proposed technology to an industrial setting of a real software
development lifecycle is important to allow investigating any unforeseen
negative/positive interactions with industrial settings.
Given the potential benefits, Kali Software decided to adopt the DPPI in May 2012
in the context of one if its biggest development projects. Kali Software is a small
sized (~25 employees) Brazilian company that worked with software development
projects for nearly a decade (November 2004 to March 2013), providing services to
customers in different business areas in Brazil and abroad. The following subsections
provide more information on the development project context, the preparation of the
causal models, and the use of DPPI and its obtained results.
documents of the fifth (billing) and sixth (financial) modules. Up to this point DPPI
had only been applied to functional specifications; therefore the focus herein is on the
technical specifications, although some quantitative results from applying it to the
functional specifications are also provided. Details on these modules, including
the number of use cases (#UC), the size in use case points (#UCP), the number of
domain model classes (#DC), the total development effort in hours (Effort), the effort
spent on inspecting the technical specification (Insp. Effort), and the number of
defects found in the technical specification (#Defects) are shown in Table 1.
As mentioned before, functional and technical specification defects of the first four
modules where analyzed retroactively with representatives of the development team
in order to feed the Bayesian network with causes for the different defect types.
Hereafter, our discussion will be scoped to the technical specifications. Table 2 shows
the ten identified systematic errors for the most frequent technical specification defect
types.
The causes identified for those systematic errors in the retroactive DCA meetings
are shown in Table 3. This data was used to feed the Bayesian network so that it
would be possible to use its inference when applying DPPI to the project (in the fifth
module). Performing such retroactive analyses is not mandatory in order to use DPPI,
but, otherwise, in the first sessions, DPPI’s Bayesian network would only be able to
learn from the DCA meeting results, not supporting the team in obtaining them.
An Industry Ready DCA Approach Exploring Bayesian Networks 25
Table 2. Systematic errors identified for the most frequent defect types
Table 3. Causes identified for the systematic errors of the first four modules
Fig. 8. Defects per inspection hour of the first five project modules
Fig. 9. Defects per use case point of the first five project modules
DCA Preparation. This activity relates to preparing for the DCA meeting by
sampling defects (plotting a Pareto chart) and identifying the systematic errors leading
to several defects of the same type by reading the sampled defects. Figure 10 shows
the Pareto chart. In this chart, it is possible to observe that more than 60% of the
An Industry Ready DCA Approach Exploring Bayesian Networks 27
technical specification defects are related to violating GRASP design principles and to
omission. Therefore, defects from these two categories were sampled in order to
identify systematic errors; this was carried out by the moderator during a one-hour
meeting with the first author and one of the two inspectors who found the defects.
The systematic errors related to the GRASP category were high coupling (3
defects) and using inheritance improperly (2 defects). The systematic error related to
the Omission category was omitting classes and attributes (5 defects). Since the total
amount of defects was low during the meeting the team also read the incorrect facts,
in which another systematic error could be identified: misunderstanding several parts
of the functional specification.
DCA Meeting. In this meeting, the moderator, the first author and a software
engineering process group member identified the main causes for each of the
systematic errors. The probabilistic cause-effect diagram for each defect type was
used to support this task. The Bayesian network built (using the Netica tool) based on
the data provided in Table 3 is shown in Figure 11.
The diagnostic inferences of this network based on the defect types (GRASP violation
and Omission) of the systematic errors to be analyzed are shown in Figure 12. Based on
these inferences it was possible to obtain the probabilistic cause-effect diagrams. For
instance, the diagram to support analyzing the systematic errors associated to the GRASP
violation defects is shown in Figure 13. Hence, the probabilistic cause-effect diagrams
were used to support the DCA meeting team in identifying the causes.
Of course the data obtained from only four modules is still initial, to be completed
with the results of each new DCA meeting. In this particular case, the diagram of
Figure 13 shows that, in prior projects, Lack of Good Design Practices (36.4%), the
Size and Complexity of the Problem Domain (27.3%), and the Lack of Domain
Knowledge (18.2%) usually caused the GRASP violation defects.
28 M. Kalinowski, E. Mendes, and G.H. Travassos
In fact, the identified causes where Lack of Good Design Practices and the Size and
Complexity of the Problem Domain, the latter led to high coupling and inheritance
problems due to difficulties in understanding the entities and their responsibilities.
Fig. 11. The Bayesian network built based on the DCA meetings of the four prior modules
(a) (b)
Fig. 12. Bayesian diagnostic inference for defect types GRASP (a) and Omission (b)
An Industry Ready DCA Approach Exploring Bayesian Networks 29
Regarding the systematic error of omitting classes and attributes, the causes were
Requirements Omissions, in the initial versions of the functional specification, and
Inefficient Requirements Management, not communicating changes in the
requirements properly to the designer.
Once the causes were identified, the meeting continued with action proposals
addressing them. Five actions were proposed to treat the four identified causes: (i)
providing training on GRASP design principles, (ii) publishing a list with common
design defects (iii) providing training on the problem domain, with an overview of the
whole system and the relation between its concepts, (iv) reinforcing the importance of
avoiding omissions in the functional specifications, and (v) adjusting the requirements
management process to make sure that changes to the requirements are always
communicated to the designer.
After the meeting was finished, identified causes for the defect types were fed into
the Bayesian network, updating the causal model for the development activity and
closing the knowledge learning feedback cycle.
Fig. 14. Defects per inspection hour, including the sixth module
Fig. 15. Defects per use case point, including the sixth module
Those charts can be compared to the quantitative goal established in the prior DPPI
enactment. As expected [7], defect rates reduced in about 50 percent (from 0.14 to
0.07). This expected defect rate reduction behavior of about 50 percent (in fact, 46%)
also happened to the functional specification activity. Moreover, for the functional
specifications, analyzing the defects allowed producing a set of nine “improvement
factors” that can be used by the employees for writing better use cases.
The fifth module of this project represented the first time that DPPI was applied
end-to-end to a real project, by the project team and allowing to present defect rate
reductions observed in a subsequent iteration.
An Industry Ready DCA Approach Exploring Bayesian Networks 31
The total effort of applying DPPI was about 15.5 hours, 1 hour spent by the
moderator in the Development Result Analysis activity, 2 hours spent by each team
member in the DCA Preparation activity (6 hours at all), and 2.5 hours spent by each
team member in the DCA Meeting (7.5 hours at all). Of course this effort does not
consider the time spent in implementing the action proposals, otherwise the GRASP
training (8 hours), publishing the list of design problems (2 hour), the domain training
(4 hours) and the adjustments to the requirements management process (1 hour)
would have to be accounted. Nevertheless, comparing such effort to defect rate
reductions of 50 percent the trade-off is clearly favorable.
5 Conclusions
The DPPI approach resulted from research following an experimental strategy. Its
conceptual phase considered evidence-based guidelines acquired through systematic
reviews and feedback from experts in the field. Afterwards the approach was evolved
based on results of an initial proof of concept and a set of experimental studies.
The results of the proof of concept and of the set of experimental studies indicated
that DPPI could bring possible benefits to industrial software development practice.
However, even though the proof of concept showed the feasibility of using the
approach, it was applied retroactively by the researcher. Hence, not allowing
observing some interesting aspects, such as application effort and results on defect
rate reductions. The set of experimental studies, on the other hand, allowed evaluating
the usefulness of the probabilistic cause-effect diagrams in identifying causes.
Nevertheless, by not being applied as part of a real development life cycle they were
not able of providing further information on the results of addressing those causes.
In this paper, we extended our research by outlining the complete adopted
experimental strategy, providing an overview of DPPI and presenting additional
results from applying it to a real industrial software development lifecycle. In this
experience, DPPI was successfully applied to different development activities
(functional specification and technical specification), allowing further comprehension
on its industry readiness and objectively measuring the effort and the obtained
benefits. The total application effort was reasonably low (on average 15.5 hours)
when compared to the obtained benefits (reducing defect rates in 46 percent in
functional specifications and in 50 percent for technical specifications).
Although these results were obtained in a specific context, not allowing any
external validity inferences on other types of projects or other industrial contexts, we
believe that applying DPPI to a real software project lifecycle helped to understand
possible benefits and constraints of using the approach from an industrial perspective.
Acknowledgments. We thank David N. Card for his contributions to our research, the
subjects involved in the experimental study, the COPPETEC Foundation, and
Tranship. Without their support this paper would not have been possible. Thanks also
to CAPES and CNPq for financial support.
32 M. Kalinowski, E. Mendes, and G.H. Travassos
References
1. Card, D.N.: Defect Analysis: Basic Techniques for Management and Learning. In:
Advances in Computers, ch. 7, vol. 65, pp. 259–295 (2005)
2. Card, D.N.: Defect Causal Analysis Drives Down Error Rates. IEEE Software 10(4), 98–99
(1993)
3. Mays, R.G., Jones, C.L., Holloway, G.J., Studinski, D.P.: Experiences with Defect
Prevention. IBM Systems Journal 29(1), 4–32 (1990)
4. Dangerfield, O., Ambardekar, P., Paluzzi, P., Card, D., Giblin, D.: Defect Causal Analysis:
A Report from the Field. In: Proceedings of International Conference of Software Quality,
American Society for Quality Control (1992)
5. Jalote, P., Agrawal, N.: Using Defect Analysis Feedback for Improving Quality and
Productivity in Iterative Software Development. In: 3rd ICICT, Cairo, pp. 701–713 (2005)
6. Boehm, B., Basili, V.R.: Software Defect Reduction Top 10 List. IEEE Computer 34(1),
135–137 (2001)
7. Kalinowski, M., Card, D.N., Travassos, G.H.: Evidence-Based Guidelines to Defect
Causal Analysis. IEEE Software 29(4), 16–18 (2012)
8. Kalinowski, M., Travassos, G.H., Card, D.N.: Guidance for Efficiently Implementing
Defect Causal Analysis. In: VII Brazilian Symposium on Software Quality (SBQS),
Florianopolis, Brazil, pp. 139–156 (2008)
9. Mafra, S.N., Barcelos, R.F., Travassos, G.H.: Aplicando uma Metodologia Baseada em
Evidência na Definição de Novas Tecnologias de Software. In: Proc. of the XX Brazilian
Symposium on Software Engineering (SBES), Florianopolis, Brazil, pp. 239–254 (2006)
10. Kalinowski, M., Travassos, G.H., Card, D.N.: Towards a Defect Prevention Based Process
Improvement Approach. In: 34th Euromicro Conference on Software Engineering and
Advanced Applications, Parma, Italy, pp. 199–206 (2008)
11. Kalinowski, M., Mendes, E., Card, D.N., Travassos, G.H.: Applying DPPI: A Defect
Causal Analysis Approach Using Bayesian Networks. In: Ali Babar, M., Vierimaa, M.,
Oivo, M. (eds.) PROFES 2010. LNCS, vol. 6156, pp. 92–106. Springer, Heidelberg (2010)
12. Kalinowski, M., Mendes, E., Travassos, G.H.: Automating and Evaluating Probabilistic
Cause-Effect Diagrams to Improve Defect Causal Analysis. In: Caivano, D., Oivo, M.,
Baldassarre, M.T., Visaggio, G. (eds.) PROFES 2011. LNCS, vol. 6759, pp. 232–246.
Springer, Heidelberg (2011)
13. Pai, M., McCulloch, M., Gorman, J.D.: Systematic reviews and meta-analyses: An
illustrated step-by-step guide. National Medical Journal of India 17(2) (2004)
14. Kitchenham, B.A., Charters, S.: Guidelines for Performing Systematic Literature Reviews
in Software Engineering. Technical Report (version 2.3), Keele University (2007)
15. Ishikawa, K.: Guide to Quality Control. Asian Productivity Organization, Tokyo (1976)
16. SEI: CMMI for Development (CMMI-DEV), Version 1.3. CMU/SEI-2010. Pittsburgh,
PA: Software Engineering Institute, Carnegie Mellon University (2010)
17. Kalinowski, M., Spínola, R.O., Dias Neto, A.C., Bott, A., Travassos, G.H.: Inspeções de
Requisitos de Software em Desenvolvimento Incremental: Uma Experiência Prática. In: VI
Brazilian Symposium on Software Quality (SBQS), Porto de Galinhas, Brazil (2007)
18. Kalinowski, M., Travassos, G.H.: A Computational Framework for Supporting Software
Inspections. In: International Conference on Automated Software Engineering (ASE
2004), Linz, Austria, pp. 46–55 (2004)
19. Fagan, M.E.: Design and Code Inspection to Reduce Errors in Program Development.
IBM Systems Journal 15(3), 182–211 (1976)
20. Pearl, J.: Causality Reasoning, Models and Inference. Cambridge University Press (2000)
An Industry Ready DCA Approach Exploring Bayesian Networks 33
21. Hong, G., Xie, M., Shanmugan, P.: A Statistical Method for Controlling Software Defect
Detection Process. Computers and Industrial Engineering 37(1-2), 137–140 (1999)
22. Netica Application, http://www.norsys.com/netica.html
23. Shull, F., Carver, J., Travassos, G.H.: An Empirical Methodology for Introducing Software
Processes. In: European Software Engineering Conference, Vienna, Austria, pp. 288–296
(2001)
24. Shull, F.: Developing Techniques for Using Software Documents: A Series of Empirical
Studies. Ph.D. thesis, University of Maryland, College Park (1998)
25. Larman, G.: Applying UML and Patterns: An Introduction to Object-oriented Analysis and
Design and Iterative Development. Prentice Hall (2008)
Business Intelligence in Software Quality
Monitoring: Experiences and Lessons Learnt
from an Industrial Case Study
1 Motivation
The new generation of interconnected and open IT systems in areas like health-
care, telematics services for traffic monitoring, or energy management are evolv-
ing dynamically, for example, software updates are carried out in rather short
cycles under presence of a complex technology stack. Tomorrow’s IT systems
are liable to stringent quality requirements (e.g. safety-, security or compliance
issues) and evolve continuously, that is, the software habitually is subject of
change. Given this context, monitoring of quality attributes becomes a neces-
sity. Efficient and effective management of a software product or service thus
requires continous, tool-supported monitoring of well-defined quality attributes.
The role of software as a driver of innovation in business and society leads to
an increasing industrialization of the entire software life cycle. Thus, the devel-
opment and operation of software nowadays is perceived as mature engineering
D. Winkler, S. Biffl, and J. Bergsmann (Eds.): SWQD 2014, LNBIP 166, pp. 34–47, 2014.
c Springer International Publishing Switzerland 2014
BI in Software Engineering: Experiences and Lessons Learnt 35
discipline (even in the secondary sector) and the division of labor is driven by
coordinated tools (in the sense of an integrated tool chain). As a consequence,
monitoring of quality attributes has to be carried out holistically, i.e. taking into
account various quality dimensions. This includes the continuous monitoring of
process- (e.g. processing time of for bugs or change requests), resource- (e.g.
alignment of bugs to software engineers) and product-metrics (e.g. code metrics
like test coverage as well as metrics regarding abstract models of the software,
if in place). However, to establish a holistic view on software quality we lack
methods and tools that allow one for an integration of the relevant dimensions
and (key) quality indicators.
To address these issues, as part of the research programme Softnet Austria II,
a web-enabled software dashboard making use of powerful OLAP technology has
been developed. This cockpit strives to provide the technological underpinning
to enable the integration of the various dimensions. In this article we briefly
describe the software architecture of our dashboard and point out experiences
and lessons learnt in the course of a pilot project. We considered issue manage-
ment, that is the treatment of defects and change requests (CRQs) alongside
with related parts regarding project management. In this industrial experience
paper we present exhaustive empirical results regarding the running time of the
queries implementing the various quality indicators. We contribute to the state
of the art in the field of quality monitoring in terms of (1) providing experiences
on setting up a software dashboard relying on an open-source BI solution and
(2) a performance analysis regarding the execution times.
In Section 2 we discuss the importance of metrics and quality models and
point out the need for integration of the different views on software quality. In
Section 3 we subsume the main ideas behind OLAP, typical usage scenarios and
operations and the overall architecture of our web-enabled dashboard solution.
In Section 4 we present our industrial case study conducted with a company from
the secondary sector1 . In the context of the prevailing processes around CRQs
we outline (1) examples of quality metrics, (2) show how to use the OLAP tech-
nology for ad-hoc analysis of an industrial software repository, and (3) provide
empirical insights on the execution time of the various mulit-dimensional queries.
In Section 5 we report on lessons learned from our industrial case study, Section
6 discusses related work and Section 7 points out open issues and concludes our
industrial experience paper.
The main idea behind BI is to transform simple data into useful information
which enhances the decision making process by basing it on a wide knowledge
about itself and the environment. That minimizes risks and uncertainties derived
from any decision that a company has to take [17].
Moreover, business intelligence contributes to translate the defined objectives
of a company into indicators with the possibility of analysing them from different
points of view. That transforms the data into information that, not only is able
to answer questions about current or past events but it also makes possible
building models to predict future events [6].
The concepts outline in the previous sections are typically used in various busi-
ness areas for measurement, analysis and mining of specific data pools. Our
dashboard primary serves the purpose of carrying out data-driven research par-
ticularly addressing the integration of various views on software quality (process-,
product- and resources view). Figure 1 outlines the architecture of the web-
enabled dashboard making use of BI technology. The dashboard uses the open
source tool JPivot [10] as front end. It operates on the open source OLAP
BI in Software Engineering: Experiences and Lessons Learnt 37
server Mondrian [4]. JPivot allows the interactive composition of MDX (Muli-
Dimensional eXpressions) queries via a Web interface that also displays the re-
sults in tabular and graphical form. Predefined queries are stored on the server
and accessible from the web interface. JPivot loads these queries from the query
files or takes them from the MDX editor provided through the web interface,
than OLAP parses those queries through the multidimensional cubes and con-
verts them to SQL queries which are sent to a MySQL database.
– Group 1: Metrics regarding the inflow states and the elapsed time from
submission of an issue to a particular state. The around 20 performance
indicators from this group affect the decision phase and the implementation
phase. Depending on the concrete metric, drill-down and slicing into 15 to
20 different dimensions (version, milestone, product, status, priority, error
severity, person, role, age etc.) is required.
– Group 2: Metrics regarding the outflow state and the elapsed time from
submission to a particular state. The around 15 performance indicators con-
cern the review- and testing phase and have similar requirements wrt. ad
hoc analysis as group 1.
– Group 3: Metrics regarding the inflow states and the elapsed time on that
state. The 11 performance indicators are dealing with the elapsed time of a
CRQ in the decision- and implementation phase. Some of the performance
indicators are drilled-down to around 15 dimensions, often standard aggre-
gates (maximum, minimum, average) are used.
BI in Software Engineering: Experiences and Lessons Learnt 39
– Group 4: Metrics regarding outflow states and the elapsed time on that
state. The 14 performance indicators require ad-hoc analysis in up to 20
dimensions.
– Group 5 and Group 6: These metrics have high level of complexity as these
metrics relate states that are not consecutive. The complexity is caused due
to the fact, that it is necessary to keep track of an issue from its submission
through all its changes in order to compute the desired metrics. Nevertheless,
these performance indicators need to be drilled down to a couple of dimen-
sions. However, these metrics are less time critical as the Group 1-4 as the
quality indicators regarding group 5 and 6 are used occasionally only.
Using the where clause, one has the possibility to slice the represented cube.
That means that not all the data of the four dimensions will be shown, but only
those specified through the slicer defined in the where clause. In this case, these
issues which go to the state Resolved (R) and were successfully tested in the
previous step (Verified (V)).
40 A. Kalchauer et al.
The output given by the application is a table showing the required informa-
tion on the desired rows (Figure 5). Products and Milestone can be drilled down,
since they compose a hierarchy with several levels.
Fig. 4. The three previous dimensions are queried along the years which is the fourth
dimension
characteristics. The reason why this happens can be found in the complexity of
the queries which is caused by the where clause (slicer) of the query.
For the different groups of queries the following execution times have been
collected:
Figure 6 shows the execution times for the queries of Group 1 with different
sets of data. The average response time of all the considered metrics of Group 1
is shown in Figure 7.
Figure 8 shows the execution times for the queries of Group 2 with different
sets of data. Queries 2.7 and 2.8 are excluded since they proof the correct use
of the open state and have been implemented for the purpose of verification.
These queries are not used for continuous monitoring of the process. To ease the
illustration, queries 2.10 and 2.13 are also excluded as they can be computed
42 A. Kalchauer et al.
Fig. 6. Response time regarding the amount of the data for Group 1
within a second even for a big amount of data.The average of all the quality
indicators of Group 2 is shown in Figure 9.
Execution times for the queries of Group 3 and 4 look very similar. For Group
5 and Group 6 queries we did not evaluate the response times, because of the
complexity of these queries. However, as mentioned before, these queries are not
used for continuous quality monitoring and are executed only occasionally.
BI in Software Engineering: Experiences and Lessons Learnt 43
Fig. 8. Response time regarding the amount of the data for Group 2
However, inside these two groups one can also differentiate between the met-
rics which start in state New (’N’, creation of an issue) and go to a non-
terminating state (we refer to a middle state in the following, see Figure 2) and
the metrics which start in a middle state and end in another middle state. Since
the way through the hierarchy is longer for those metrics starting in state N ew
(’N’), the complexity of these metrics is higher. Furthermore the number of OR
44 A. Kalchauer et al.
operations in their definition is higher, which increases the execution times fur-
ther. Therefore a comparison between metrics from state new (’N’) to a middle
state and the metrics which goes from a middle state to another middle state
can be of interest.
Figure 10 shows the differences in performance between those metrics.
Fig. 10. Performance comparison between initial and middle states queries
6 Related Work
Software cockpits for interpretation and visualization of data were already con-
ceptually prepared ten years ago. A reference model for concepts and definitions
around software cockpits is presented in [15]. Concepts and research prototypes
have been developed in research projects like Soft-pit [14] and Q-Bench [9].
Whereas Soft-pit has the goal of monitoring process metrics [14], the Q-Bench
project is aimed at developing and using an approach to assure the internal
quality of object oriented software systems by detecting quality defects in the
code and adjusting them automatically.
The authors of [2] report that the integration of data from different analysis
tools gives a comprehensive view of the quality of a project. As one of the
few publications in this field, the authors also deal with the possible impact of
introducing a software dashboard. The authors of [3], [5], [13], [12] report on
experiences and lessons learnt regarding software dashboards .
According to [8], in the context of explicit quality models, a viable qual-
ity model has to be adaptable and the handling of the obtained values (i.e.
a table result) should be easily understandable. Business intelligence systems
with OLAP functionalities support these requirements and are used nowadays
in several fields [20]. Thefore using BI technology for qualiy monitoring offers
an excellent foundation for data-driven research in collaboration with companies
[16].
Related to the work presented herein are industrial-strength tools, e.g. Bugzilla
(http://www.bugzilla.org), JIRA (http://www.atlassian.com/software/jira), Po-
larion (http://www.polarion.com) and Codebeamer (http://www.intland.com),
Swat4j (http://www.codeswat.com), Rational Logiscope
(http://www-01.ibm.com/software/awdtools/logiscope/) or
Sonar (http://www.sonarsource.org) strive to integrate quality metrics in an en-
vironment of heterogeneous data sources [18]. Some tools (Swat4j or Rational
Logiscope) support the evaluation of quality metrics explicitly taking account a
quality model. Whereas these tools are very flexible in data storing and repre-
sentation, they offer few possibilities for the integration and analysis of different
quality dimensions.
7 Conclusion
Acknowledgement. The work presented herein has been partially carried out
within the competence network Softnet Austria II (www.soft-net.at, COMET
K-Projekt) and funded by the Austrian Federal Ministry of Economy, Family
and Youth (bmwfj), the province of Styria, the Steirische Wirtschaftsfrderungs-
gesellschaft mbH. (SFG), and the city of Vienna in support of the Center for
Innovation and Technology (ZIT). We listed the authors in alphabetical order.
References
8. Deissenboeck, F., Juergens, E., Lochmann, K., Wagner, S.: Software quality mod-
els: Purposes, usage scenarios and requirements. In: ICSE Workshop on Software
Quality, WOSQ 2009, pp. 9–14. IEEE (2009)
9. http://www.qbench.de/
10. JPivot, http://jpivot.sourceforge.net/
11. Lang, S.M., Peischl, B.: Nachhaltiges software management durch lebenszyklus-
übergreifende überwachung von qualitätskennzahlen. In: Tagungsband der Fach-
tagung Software Management, Nachhaltiges Software Management. Deutsche
Gesellschaft für Informatik (November 2012)
12. Larndorfer, S., Ramler, R., Buchwiser: Dashboards, cockpits und projekt-leitstände:
Herausforderung messsysteme für die softwareentwicklung. OBJEKTspektrum 4,
72–77 (2009)
13. Larndorfer, S., Ramler, R., Buchwiser, C.: Experiences and results from estab-
lishing a software cockpit at bmd systemhaus. In: 35th Euromicro Conference on
Software Engineering and Advanced Applications, SEAA 2009, pp. 188–194 (2009)
14. Münch, J., Heidirch, J., Simon, F., Lewerentz, C., Siegmund, B., Bloch, R., Kurpicz,
B., Dehn, M.: Soft-pit - ganzheitliche projekt-leitstände zur ingenieurmässigen
software-projektdurchführung. In: Proceedings of the Status Conference of the Ger-
man Research Program Software Engineering, vol. 70 (2006)
15. Münch, J., Heidrich, J.: Software project control centers: concepts and approaches.
Journal of Systems and Software 70(1), 3–19 (2004)
16. Peischl, B., Lang, S.M.: What can we learn from in-process metrics on issue man-
agement? Testing: Academic and Industrial Conferencem Practice and Research
Techniques, page IEEE Digial Library (2013)
17. Raisinghani, M.S.: Business intelligence in the digital economy: opportunities, lim-
itations and risks. Idea Group Pub. (2004)
18. Staron, M., Meding, W., Nilsson, C.: A framework for developing measurement
systems and its industrial evaluation. Information and Software Technology 51(4),
721–737 (2009)
19. Torrents, V.R.: Development and optimization of a web-enabled OLAP-based soft-
ware dashboard, Master thesis, Universidad de Alcala (2013)
20. Watson, H.J., Wixom, B.H.: The current state of business intelligence. Com-
puter 40(9), 96–99 (2007)
Dealing with Technical Debt
in Agile Development Projects
Harry M. Sneed
1 Technical Debt
”Technical Debt“, is a term for the work required to put a piece of software in the
state that it should be in. It may be that this is never done and the software remains in
the substandard state forever, but still the debt remains. In other words, technical debt
is a term for substandard software. It was coined to describe poor quality software in
terms that business managers can relate to, namely in terms of money, money
required to fix a problem [1]. Managers should become aware of the fact the neglect
of software quality costs them money and that these costs can be calculated. The
notion of “debt” should remind them that someday they have to pay it back or at least
try to reduce it, just like a country should reduce the national debt. The size of the
national debt is an indicator that a national economy is spending more than what it is
D. Winkler, S. Biffl, and J. Bergsmann (Eds.): SWQD 2014, LNBIP 166, pp. 48–62, 2014.
© Springer International Publishing Switzerland 2014
Dealing with Technical Debt in Agile Development Projects 49
producing. It has a negative balance. The size of the technical debt is an indicator that
an organization is producing more software than it can make correctly as it should be.
The amount of the national debt can be measured absolutely in terms of Dollars or
Euros and relatively in relation to the gross national product. The same applies to the
technical debt. It too can be measured absolutely in terms of money required to
renovate the software and relatively in terms of the costs of renovation relative to the
development costs. Just as a country whose national debt exceeds its annual gross
national product is in danger of bankruptcy, a software producer whose technical debt
exceeds its annual development budget is in danger of collapsing. Something must be
done to eliminate or at least to reduce the size of the debt.
The notion of “technical debt” was coined by Ward Cunningham at the OOPSLA
conference in 1992. The original meaning as used by Cunningham was ”all the not
quite right code which we postpone making it right.“ [2]. With this statement he was
referring to the inner quality of the code. Later the term was extended to imply all that
should belong to a properly developed software system, but which was purposely left
out to remain in time or in budget, system features such as error handling routines,
exception conditions, security checks and backup and recovery procedures and
essential documents such as the architecture design, the user guide, the data model
and the updated requirement specification. All of the many security and emergency
features can be left out by the developer and the users will never notice it until a
problem comes up. Should someone inquire about them, the developers can always
say that these features were postponed to a later release. In an agile development they
would say that they were put in the backlog. In his book on “Managing Software Debt
- Building for Inevitable Change“, Sterling describes how this debt accumulates – one
postponement at a time, one compromise after another [3].
There are many compromises made in the course of a software development. Each one
is in itself not such a problem, but in sum the compromises add up to a tremendous burden
on the product. The longer the missing or poorly implemented features are pushed off, the
less likely it is that they will ever be added or corrected. Someone has to be responsible for
the quality of the product under construction. It would be the job of the testers in the team
to insist that these problems be tended to before they get out of hand. That means that the
testers not only test but also inspect the code and review the documents. Missing security
checks and exception handling are just as important as incorrect results. For that the testers
must be in a position to recognize what is missing in the code and what has been coded
poorly. This will not come out of the test. The test will only show what has been
implemented but not what should have been implemented. What is not there cannot be
tested. Nor will it show how the code has been implemented. Testers must have the
possibility and also the ability to look into the code and see what is missing. Otherwise
they would have to test each and every exception condition, security threat and incorrect
data state. That would require much too much testing effort. For this reason testers should
also have good knowledge of the programming language and be able to recognize what
is missing in the code. Missing technical features make up a good part of the technical
debt [4].
The other part of the technical debt is the poor quality of the code. This is what
Ward Cunningham meant when he coined the term. In the heat of development
developers make compromises in regard to the architecture and the implementation of
50 H.M. Sneed
the code in order to get on with the job or simply because they don’t know better.
Instead of adding new classes they embed additional code into existing classes
because that is simpler. Instead of calling new methods they nest the code deeper with
more if statements, instead of using polymorphic methods, they use switch statements
to distinguish between variations of the same functions, instead of thinking though the
class hierarchy, they tend to clone classes thus creating redundant code. There is no
end to the possibilities that developers have to ruin the code and they all too often
leave none out, when they are under pressure as is the case with sprints in Scrum. In
their book on “Improving the Design of existing Code”, Fowler and Beck identify 22
code deficiencies, which they term as bad smells, or candidates for refactoring [5].
These smells such bad practices as duplicated code, long methods, large classes, long
parameter lists, divergent change, shotgun surgery and feature envy. In a contribution
to the journal of software maintenance and evolution with the title “Code Bad Smells
– a Review of Current Knowledge”, Zhang, Hall and Baddoo review the literature on
bad smells and their effects on maintenance costs [6]. The effects of bad smells range
from decreasing readability to causing an entire system to be rewritten. In any case
they cause additional costs and are often the source of runtime errors which cost even
more. The fact is undisputed that certain bad coding habits drive up maintenance
costs. Unfortunately these costs are not obvious to the user. They do not immediately
affect runtime behavior. The naïve product owner in an agile development project will
not recognize what is taking place in the code. However, if nothing is done to clean
up the code, the technical debt is growing from release to release (see Figure 1:
accumulating technical debt)
Fig. 1.
It has always been known that poor coding practices and missing features cause higher
maintenance costs but no one has ever calculated the exact costs. This has now been done
by Bill Curtis and his colleagues from CAST Software Limited in Texas. Their suggestion
or computing the costs of bad quality joins each deficiency type with the effort required to
remove that deficiency. The basis for their calculation is an experience database from
Dealing with Technical Debt in Agile Development Projects 51
which the average cost of removing each deficiency type is calculated. The data from
scores of refactoring projects has been accumulated in this database [7]. The refactoring
and retesting of a single overly complex method can cost a half day. Considering an
hourly rate of US $70, that leads to a total cost of $280. The cost of a missing exception
handling may cost only an hour but if five hundred such handlings are missing the cost
will amount to some $35,000. The cost of adding a missing security can cost up to three
person days or $1600. The main contribution of Curtis and his colleagues is that they have
identified and classified more than a hundred software problem types and put a price tag
on them. Samples of their problem types are:
• Builtin SQL queries (subject to hacking)
• Empty Catch Blocks (no exception handling)
• Overly complex conditions (error prone)
• Missing comments (detracts from comprehension)
• Redundant Code (swells the code amount and leads to maintenance errors)
• Non-uniform variable naming (making the code hard to understand)
• Loops without an emergency brake (can cause system crash)
• Too deeply nested statements (are difficult to change).
The costs of such deficiencies are the sum of the costs for each deficiency type.
The cost of each deficiency type is the number of occurrences of that deficiency times
the average effort required to remove and retest that deficiency as taken from the
experience database. (see Figure 2: Code Deficiencies)
Fig. 2.
To the costs of these statically recognizable deficiencies must be added the costs of
removing errors which come up during operations. Due to time constraints not all
errors will be fixed before release. These errors may lead later to interruptions in
production if they are not corrected. Other errors may lead to false results which must
be manually corrected. Bad code can lead to bad performance and that too will
52 H.M. Sneed
become annoying to the users. The users may accept living with these inconveniences
for a while but eventually they will insist that they be fixed. The cost of fixing them is
not trivial. Together with the code smells and the missing features this adds up to a
significant debt. Bill Curtis estimates that the median debt of an agile project is $ 3.61
per statement [8].
That is the absolute measure for debt. It is also possible to compute the relative
debt. That is the cost of renovating the software relative to the cost of development.
Relative Debt = Renovation Cost / Development Cost
It could be that the cost of renovation is almost as high as the development cost. In
that case the software should be abandoned and rewritten.
In their endeavor to give the user a functioning software product in the shortest
possible time, developers are inclined to take short cuts. They leave the one or the
other feature out with the good intention of adding it later. A typical example is error
handling. If a foreign function is called on another server, the developer should
know that he will not always get an answer. It can be that the remote server is
overloaded or that it is out of operation. This applies particularly to web services. For
this reason, he should always include an exception handler to check the time and hen
a given time is exceeded to invoke an error handling routine. The same applies to
accessing files and databases as well as to all external devices like printers and
displays. In addition, a cautious developer will always check the returned results to
make sure that they are at least plausible. The result of this defensive programming is
a mass of additional code. Experts estimate that at least 50% of the code should be for
handling exception conditions and erroneous states if the developer is really
concerned about reliability [9].
By disregarding reliability problems the developer can save half of his effort for
coding and testing, since the error handling he does not code will also not have to be
tested. The temptation is very great to do this. The agile developer knows that the user
representative will not notice what goes on behind the user interface. The code does
not interest him. Often it will have to come to a tremendous load or to extreme
situations before an exception condition is triggered. Other exceptions come up only
in certain cases. Even those exceptions which do come up in test, leading to short
interruptions, will not be immediately registered by the user. He is too busy dealing
with other things. Therefore, if a punctual release is the overriding goal, it is all too
tempting to leave the exception handling out. As John Shore, the manager of software
development at the Naval Research Lab once pointed out, programmers are only poor
sinners, who, under pressure, cannot withstand the temptation to cheat, especially if
they know they will probably not be caught [10]. The developer has the good
intention of building in the missing code later when he has more time. Unfortunately,
this “later” never comes. In the end it is forgotten until one day a big crash takes place
in production. The system goes down because an exception occurs and is not handled
properly. The path to hell is paved with good intentions.
Dealing with Technical Debt in Agile Development Projects 53
Tilo Linz and his coauthors also comment on this phenomena in their book on
testing in Scrum projects [11]. There they note that the goal of finishing a release in a
fully transparent environment with an unmovable deadline in a prescribed time box
can lead to tremendous mental burden on the developers. Theoretically, according to
the Scrum philosophy they should be able to adjust the scope of the functionality to
their time and their abilities. In practice, this is most often not possible. Instead of
cutting back on functionality, the team cuts back on quality. Discipline is sacrificed to
the benefit of velocity. The functionality is not properly specified, test cases that
should be automated are executed manually, necessary refactoring measures are
pushed off. The software degrades to a point where it can no longer be evolved. The
team is sucked under by a downwards spiral and in the end the project is discarded.
The situation with security is similar. Secure code is code that protects itself
against all potential intrusions. But to achieve that means more code. A secure method
or function controls all incoming calls before it accepts them. Every parameter value
should be checked to make sure it is not corrupt. The developer should also confirm
that the numbers and types of arguments matches that what is expected, so that no
additional parameters are added to the list. These could be hooks to get access to
internal data. The values of the incoming parameters should be checked against
predefined value ranges. This is to prevent corrupted data from getting into the
software component. To make sure that only authorized clients are calling, the
function called can require an identification key to each call. In Java the incoming
objects should be cloned, otherwise it can be mutated by the caller to exploit race
conditions in the method. Methods which are declared public for testing purposes
should later be changed to private or protected to restrict access rights, and classes
should be finalized when they are finished to protect their byte code from being
overwritten [12]. Embedded SQL statements are particularly vulnerable to attack
since they can be easily manipulated at runtime. They should be outsourced into a
separate access shell, but that entails having extra classes which also have to be tested
[13]. Thus, there are many things a developer can do to make his code more secure, if
he takes the time to do it. But here too, the developer can rest assure that nobody will
notice the missing security checks unless they make an explicit security test. So
security is something that the developer can easily postpone. The users will only
become aware of the problem when one day their customer data is stolen. Then it will
be too late.
That is the problem with software. Users cannot easily distinguish between
prototypes and products. A piece of software appears to be complete and functioning,
even when it is not. As long as the user avoids the pitfalls everything seems to be ok.
They have no way of perceiving what is missing. For that they would either have to
test every possible usage or they have to dig into the code. The question of whether a
piece of software is done or not, cannot be answered by the user, since he has no clue
of what “done” really means. The many potential dangers lurking in the code cannot
be recognized by a naive user, therefore it is absurd to expect that from him. He needs
an agent to act in his behalf who can tell him if the software is really done or not. This
agent is the tester, or testers, in the development team [14].
54 H.M. Sneed
• erroneous code = statements that may cause an error when the code is
executed
• deficient code = statements that reduce the overall quality of the code,
maintainability, portability, testability, etc.
• cloned codes, i.e. code sections which are almost the same except for minor
variances
• variables declared but not used
• methods implemented but never called
• dead code or statements which can never be reached
Dealing with Technical Debt in Agile Development Projects 55
Also many of the security threats in the code can be detected through automated
static analysis. Jslint from Citital is representative of tools which perform such a
security analysis. It enforces 12 rules for secure Java code. These rules are:
1) Guard against the allocation of uninitialized objects
2) Limit access to classes and their members
3) Declare all classes and methods as final
4) Prevent new classes from being added to finished packages
5) Forbid the nesting of classes
6) Avoid the signing of code
7) Put all of the code in one jar or archive file
8) Define the classes as uncloneable
9) Make the classes unserializeable
10) Make the classes undeserializeable
11) Forbid company classes by name
12) Avoid the hardwiring of sensitive data in the code
Jslint scans the code for infringement of these rules. All of the rule violations can
be recognized except for the last one. Here it is not possible to distinguish between
sensitive and non-sensitive data. Therefore it is better not to use hard-wired text in the
code at all. If it is, then it can be easily recognized. In some cases Jslint can even
correct the code, for instance by overwriting the access class, by adding the final
56 H.M. Sneed
clause and removing code signatures. The automated correction of code can however
be dangerous and cause undesired side effects. Therefore, it is better to use automated
tools to detect the potential problems and to manually eliminate them [16].
A study was performed at the North Carolina State University on over 3 million
lines of C++ Code from the Nortel Networks Corporation to demonstrate how
effective automated static analysis can be. Using the tool “FlexLint” the researchers
there were able to detect more than 136.000 problems or one problem for every 22
code statements. Even though this code had been in operation for over three years
there remained a significant technical debt. The problems uncovered by the automated
static analysis also correlated well to the defects discovered during the prerelease
testing with an accuracy of 83%. That implies that 83% of the erroneous modules
could have been identified already through static analysis.
Of the defects uncovered in the static analysis, 20% were critical, 39% were major
and 41% were minor. The coding standard violations were not included in this count.
Based on their findings, the authors of this study came to the following conclusions:
• the defect detection rate of automated static analysis is not significantly
different than that of manual inspections
• the defect density recorded by automatic static analysis has a high correlation
to the defect density recorded in prerelease testing
• the costs of automated static analysis is a magnitude less – under 10% - of
the costs occurred by manual inspection and prerelease testing
Thus, automated static analysis can be considered an effective and economic
means of detecting problematic code modules [17].
One of the areas addressed by the static analysis study at North Carolina State was
that of security vulnerabilities. The authors of the study paid particular attention to
code constructs that have the potential to cause security vulnerabilities if proper
protection mechanisms are not in place. Altogether 10 security vulnerabilities were
identified by the automated analysis tool:
• Use of Null pointers
• Access out of bound references, i.e. potential buffer overflow
• Suspicious use of malformed data
• Type mismatch in which statements
• Passing a null pointer to another function
• Failure to check return results
• Division by zero
• Null pointer dereference
• Unrecognizable formats due to malformed data
• Wrong output messages
The number of false positives in the check of these potential security violations
was higher than average, however the results indicate that automated static analysis
can also be used to find coding deficiencies that have the potential to cause security
vulnerabilities. The conclusion of this research is that static analysis is the best
instrument for detecting and measuring technical debt [18].
Dealing with Technical Debt in Agile Development Projects 57
The main contribution of agile testing is to establish a rapid feedback test to development.
This is the rational for having the testers in the team. When it comes to preventing quality
degradation and increasing debt, the testers in an agile project have more to do than just
test. They are there to assure the quality of the product at all levels during construction.
This should be accomplished through a number of control measures, measures such as
reviewing the stories, transforming the stories into testable specifications, reviewing the
architectural design, inspecting the code, validating the unit tests, performing continuous
integration testing and running an acceptance test with the user. On top of that, there still
have to be special performance, load and security tests, as well as the usability and
reliability tests. If the product under construction is not just a trivial temporary solution,
then those special tests should be made by a professional testing team, which works
independently of the agile developing teams. This solution is referred to by Linz and his
coauthors as “System Test Team” whose job it is to shake down the systems delivered by
the Scrum teams independently of the schedule imposed on the project. That means that
the releases of the agile development team are really only prototypes and should be treated
as such. The final product comes only after the prototypes have gone through all of the
special tests. To these special tests belongs also the final auditing of the code [19]. (see
Figure 3:The role of the Tester)
Fig. 3.
The review of the stories is concerned with discussing and analyzing the stories
with the view of enhancing and improving them. They should also be checked for
testability. The user representative might overlook something or fail to cover all
describe a function adequately. It is up to the testers in the team to point this out and
to clear the missing and unclear issues with him. Special attention should be paid by
the testers to the non-functional aspects of the stories like security and usability. In
any case the stories need to be purified before the developers start to implement them.
The testers need stories they can interpret and test.
In order to establish a testing baseline, the testers should convert the informally
defined user stories into an at least semi-formal requirement specification. As noted
58 H.M. Sneed
above a significant portion of the technical debt is caused by missing functions, but
who is to know that they are missing when they are not specified anywhere. All of the
exception conditions, security checks and other quality assuring functions of a system
should be noted down as non-functional requirements or as business rules. They
should be specified in such a way as to be recognizable, for instance with key words
and to be convertible into test cases. For every business rule and non-functional
requirement at least one test case should be generated to ensure that the possible
exceptions and security threats are tested. In this way the requirements can be used as
an oracle for measuring test coverage [20].
In checking the code for conformity with the coding rules testers should also check
the format and the readability. Most all of the missing exception handling, security
measures and data validation checks can be recognized in the code. This is the
purpose of an automated code audit. It only takes a few minutes of the tester’s time to
set up and run such an audit. More time is needed to explain the reported deficiencies
to the developers and to convince them that they should clean them up. Of course the
tester cannot force the developers to do anything, but he can post the deficiencies on
the project white board and try to convince the team members that it is to their benefit
to have these coding issues settled, if not in the next release then sometime in a future
release. In no case should he let the team forget them. A good way of reminding the
team is to post a chart on the amount of technical debt accrued in the project room for
everyone to see, similar to the clock in Times Square which reminds Americans
constantly of the current state of their national debt. Just as in New York where many
American citizens simply shrug their shoulders and say “so what”, project members
who could care less about the final quality of their product will ignore the issue of
technical debt, but at least it will remind them of their sins. .
By controlling the results of the unit tests, the testers can point out to the
developers what they have missed and what they can do to improve their test. In
checking through the trace and coverage reports, the tester can identify what functions
have not been adequately tested and can suggest to the developer additional test cases
required to traverse those hidden corners of the code. Here too, he must convince the
developer that it is to his benefit to filter out the defects as early as possible. Pushing
them off to the acceptance test only increases the cost of correcting them. Every
defect removed in unit testing is one defect less to deal with in integration and
acceptance testing [21].
The testers in an agile development team must strive to be only one or maximal
two days behind the developers [22]. No later than two days after turning over a
component or class, the developer should know what he has done wrong. ”Continuous
Integration“ makes this possible [23]. The tester should maintain an automated
integration test frame into which he can insert the new components. The existing
components will already be there. Every day new or corrected components are taken
over into the build. In the case of new functions, the test scripts and test data have to
be extended. Not only will the new functions be tested, but the test of the old
functions will be repeated. The setting up of such an automated regression test may
take days, but after that the execution of each additional test should take only few
hours. The goal is to keep a steady flow of testing. The actual data results should be
Dealing with Technical Debt in Agile Development Projects 59
Fig. 4.
5 What Is “Done”?
At the Belgian ”Testing Days“ conference in 2012 Johanna Rothman und Lisa
Crispin, two recognized experts on agile testing, discussed what is meant by “done”
[27]. In the opinion of Rothman this is something that must be decided upon by the
team as a whole. It is, however, up to the testers to trigger the discussion and to lead it
a conclusion. The testers should feed the discussion with their experience from
previous projects and to steer it in a positive direction. Rothman proclaims ”you have
to get the team thinking about what is done. Does it mean partially done, as in it is
ready for testing or fully done, as in it is ready for release?” A minimum quality line
is required in order to make an intermediate release. If the users know they are
working with a prototype they are more likely to be error tolerant. If they know, this
release will be the final product they will be stricter. In the end the users must decide
what is done. But, since users only have a restricted view of the system, they need
others to help them in passing final judgment on the state of the software. Those
others are the testers who are prepared to assess the system on behalf of the potential
users. It is all too easy, just to declare a product as being “done”. The lead tester
should, in any case, be involved I the discussion. The decision as to whether
something is “done” or not is, in the end, a political decision, which must be
considered from several points of view.
Developers tend to become impatient and to push the product thru in any case,
even if it is falling apart. They will always argue that the quality is sufficient. Testers
will, on the other hand, claim that the quality level is not high enough. They will
argue for more time to test. Otherwise the quality problems will only be pushed off on
the maintenance team. Rothman suggests using the KANBAN progress charts to
show the current quality state of each and every component as well as the product
under development. This way everyone in the project can see where they are, relative
to the goals they have set for themselves. In effect, the project should produce two
quality reports, one on the functionality and one on the quality. (see Figure 5:
Measuring Technical Debt).
The functional state of a project is easier to assess than the qualitative state. It is
visible whether a function is performing or not. The quality state is not so visible. To
see what security and data checks are missing, the tester has to dig into the code. To
see how many rules are violated he has to have the code audited. There is no easy way
to determine how many defects remain. This can only be known after every last
function has been tested for all variations. That is why, one must often guess when
deciding on the state of product quality. Perhaps the best indicators of product quality
are the number of code deficiencies relative to the number of code statements and the
number of defects detected so far relative to the test coverage. For both metrics there
are benchmark levels which the testers can discuss with the user and the other team
members [28].
Dealing with Technical Debt in Agile Development Projects 61
Fig. 5.
Johanna Rothman argues that testers must be involved in the discussion as to what is
“done“ from the beginning of the project on. ”To be done also means that the quality
criteria set by the team are met“. For that this criteria has to be accepted and practiced by
all team members. Everyone in the team must be fully aware of his or her responsibility
for quality and must be dedicated to improving it. Rothman summarizes “Everybody in
the team needs to take responsibility for quality and for keeping technical debt at a
manageable level. The whole team has to make a meaningful commitment to quality“. It is
true that quality is a team responsibility but the testers have a particular role to play. They
should keep track of the technical debt and keep it visible to the other team members [29].
Lisa Crispin points out that software quality is the ultimate measure of the success
of agile development [30]. Functional progress should not be made by sacrificing
quality. After each release, i.e. every 2 to 4 weeks, the quality should be assessed. If it
falls below the quality line, then there has to be a special release devoted strictly to
the improvement of quality, even if it means putting important functions on ice. In
this release the code is refactored and the defects and deficiencies removed. Crispin
even suggests having a separate quality assurance team working parallel to the
development team to monitor the quality of the software and to report to the
developers. This would mean going back to the old division of labor between
development and test, and to the problems associated with that. The fact is that there
is no one way of ensuring quality and preventing the accumulation of technical debt.
Those responsible for software development projects must decide on a case by case
basis which way to go. The right solution is as always context dependent.
References
1. Kruchten, P., Nord, R.: Technical Debt – from Metaphor to Theory and Practice. IEEE
Software, S.18 (December 2012)
2. Cunningham, W.: The Wgcash Portfolio Management System. In: Proc. of ACM Object-
Oriented Programming Systems, Languages and Applications, OOPSLA, New Orleans,
p. S.29 (1992)
62 H.M. Sneed
3. Sterling, C.: Managing Software Dept – Building for inevitable Change. Addison-Wesley
(2011)
4. Lim, E., Taksande, N., Seaman, C.: A Balancing Act – What Practioneers say about
Technical Debt. IEEE Software, S.22 (December 2012)
5. Fowler, M., Beck, K.: Improving the Design of existing Code. Addison-Wesley, Boston
(2011)
6. Zhang, M., Hall, T., Baddoo, M.: Code Smells – A Review of Current Knowledge. Journal
of Software Maintenance and Evolution 23(3), 179 (2011)
7. Curtis, B., Sappidi, J., Szynkarski, A.: Estimating the Principle of an Application’s
Technical Debt. IEEE Software, S. 34 (December 2012)
8. Wendehost, T.: Der Source Code birgt eine Kostenfalle. Computerwoche 10, S.34 (2013)
9. Nakajo, T.: A Case History Analysis of Software Error Cause and Effect Relationships.
IEEE Trans. on S.E. 17(8), S.830 (1991)
10. Shore, J.: Why I never met a Programmer that I could trust. ACM Software Engineering
Notes 5(1) (January 1979)
11. Linz, T.A.O.: Testing in Scrum Projects, p. 177. Dpunkt Verlag, Heidelberg
12. Lai, C.: Java Insecurity – accounting for Sublities that can compromise Code. IEEE
Software Magazine 13 (January 2008)
13. Merlo, E., Letrte, D., Antoniol, G.: Automated Protection of PHP Applications against
SQL Injection Attacks. In: IEEE Proc. 11th European Conference on Software
Maintenance and Reengineering (CSMR 2007), Amsterdam, p. 191 (March 2007)
14. Crispin, L., Gregory, J.: Agile Testing – A practical Guide for Testers and agile Teams.
Addison-Wesley-Longman, Amsterdam (2009)
15. Ayewah, N., Pugh, W.: Using Static Analysis to find Bugs. IEEE Software Magazine, 22
(September 2008)
16. Vieg, J., McGraw, G., Felten, E.: Statically Scanning Java Code – Finding Security
Vulnerabilities. IEEE Software Magazine, 68 (September 2000)
17. Zheng, J., Williams, L., Nagappan, N., Snipes, W.: On the Value of Static Analysis for
Fault Detection in Software. IEEE Trans. on S.E. 32(4), 240 (2006)
18. Gueheneu, Y.-G., Moha, N., Duchien, L.: DÉCOR – A Method for the Detection of Code
and Design Smells. IEEE Trans. on S.E 36(1), 20 (2010)
19. Linz, T.A.O.: Testing in Scrum Projects, p. 179. Dpunkt Verlag, Heidelberg
20. Sneed, H., Baumgartner, M.: Value-driven Testing – The economics of software testing.
In: Proc. of 7th Conquest Conference, Nürnberg, p. 17 (September 2009)
21. Marre, M., Bertolini, A.: Using Spanning Sets for Coverage Measurement. IEEE Trns. on
S.E. 29(11), 974 (2003)
22. Bloch, U.: Wenn Integration mit Agilität nicht Schritt hält. Computerwoche 24, S. 22
(2011)
23. Duvall, P., Matyas, S., Glover, A.: Continuous Integration – Improving Software Quality
and reducing Risk. Addison-Wesley, Reading (2007)
24. Humble, J., Fairley, D.: Continuous Delivery. Addison-Wesley, Boston (2011)
25. Cockburn, A.: Agile Software Development. Addison-Wesley, Reading (2002)
26. Bavani, R.: Distributed Agile Testing and Technical Debt. IEEE Software, S.28
(December 2012)
27. Rothman, J.: Do you know what fracture-it is? Blog (June 2, 2011),
http://www.jrothman.com
28. Letouzey, J.L., Ilkiewicz, M.: Managing Technical Debt. IEEE Software Magazine, 44
(December 2012)
29. Rothman, J.: Buy Now, Pay Later. In: Proc. of Belgium Testing Days, Brussels (March
2012)
30. Crispin, L., House, T.: Testing Extreme Programming. Addison-Wesley-Longmann,
Reading (2002)
Statistical Analysis of Requirements Prioritization
for Transition to Web Technologies:
A Case Study in an Electric Power Organization
1 Introduction
The paper presents a quantitative case study related to requirements prioritization by
various stakeholders in different roles. Although the study concerns a specific
organization, the problem of transition of an existing critical system to modern Web
technologies is general and quite common, while the research questions and the
methodologies applied in the paper can be used to address a wide range of similar
problems in other domains. In the following subsections we provide a detailed
introduction to the background and the objectives of the paper, we state the research
questions and we briefly mention the methods followed (these are detailed in later
sections).
1.1 Background
Nowadays, the energy environment is rapidly changing. Even for monopoly activities,
like transmission and distribution of electric power, an Information Technology (IT)
D. Winkler, S. Biffl, and J. Bergsmann (Eds.): SWQD 2014, LNBIP 166, pp. 63–84, 2014.
© Springer International Publishing Switzerland 2014
64 P. Chatzipetrou et al.
The new ongoing project, aims to convert the front end application to modern Web
technologies for the parts of the application related to customer service and network
operation. The modernization of application architecture to advanced Web
technologies facilitates system’s integration with other available systems. Such a
transition was decided on the basis of a benefit-cost analysis as especially
advantageous for the organization and the customers, due to the reasonable cost, the
low risk of implementation and also the easiness of training.
This improvement in IT processes optimizes the planning of the work effort and
improves the system quality, the quality of services and the user satisfaction. The
ultimate goal of the company is to proceed in enterprise system integration as soon as
possible. Another advantage of this project is the fact that it is of medium scale and a
small, skillful and flexible team is capable to complete it in scheduled time.
Furthermore, the generated knowledge remains in house and the company retains its
independence from external partners specialized in IT systems.
The main goals for the new deployment are:
• High quality services;
• reduction of process complexity;
• improvement of the communication between departments;
• reduction of human effort;
• improvement of data storage and management;
• minimization of failure rates due to overtime;
• interoperability;
• scalability.
To reach these goals, the deployment is based on the existing IT infrastructure,
using a carefully designed enterprise system integration plan. As we already
mentioned, such a transition involves various groups of employees with different
relations with the company. Discussions with these stakeholders indicated that
prioritizations of requirements regarding the software under development and the
services supported are different. However, the origins of different views are not always
clear. Divergences may originate from their different responsibilities and authorities,
the working environment, even from individual cultural, educational and personality
characteristics. On the other hand, decision makers have to make strategic decisions
based not only on technical specifications, but also on the need to achieve the
maximum possible stakeholders’ satisfaction.
The specific research questions posed in this paper were related to the study of
prioritizations of stakeholders in different roles. Hence, the first question was:
The findings are interesting since they revealed that prioritization attitudes are able
to discriminate major groups of stakeholders. Furthermore, the overall approach
followed in the specific case study can be applied as methodology to similar situations.
The rest of the paper is structured as follows: Section 2 provides an outline of the
related work. Section 3 describes the survey procedure and the resulted dataset.
Section 4 presents the basic principles of the statistical tools used in the analysis.
Section 5 presents the results of statistical analysis. In Section 6 we conclude by
discussing the findings of the paper, the threats to the validity of the study and we
provide directions for future work.
respect to the type of companies, the roles of managers and their interdependencies. In
[6] an online survey was used for collecting quantitative and qualitative data regarding
quality requirements in Australian organizations.
The data of the present paper were obtained by the Hierarchical Cumulative Voting
(HCV) method [7], [8]. HCV is a very effective way to elicit the priorities given to the
various requirements of software quality. The methodology is essentially an extension
of the Cumulative Voting (CV) procedure on two levels. Each stakeholder first uses
CV to prioritize issues of the upper level and then uses the same technique in the lower
level. The original CV method (also known as hundred-dollar test) is described in [9].
CV is a simple, straightforward and intuitively appealing voting scheme where each
stakeholder is given a fixed amount (e.g. 100, 1000 or 10000) of imaginary units (for
example monetary) that can be used for voting in favor of the most important items
(issues, aspects, requirements etc). In this way, the amount of money assigned to an
item represents the respondent’s relative preference (and therefore prioritization) in
relation to the other items. The imaginary money can be distributed according to the
stakeholder’s opinion without restrictions. Each stakeholder is free to put the whole
amount on only one item of dominating importance or to distribute equally the amount
to several or even to all of the issues.
In software engineering, CV and HCV have been used as prioritization tools in
various areas such as requirements engineering, impact analysis or process
improvement ([10], [11], [12], [13]). Prioritization is performed by stakeholders (users,
developers, consultants, marketing representatives or customers), under different
perspectives or positions, who respond in questionnaires appropriately designed.
Recently, HCV was used in an empirical study [14] undertaken in a global software
development (GSD) of a software product. The 65 stakeholders, in different roles with
the company, prioritized 24 software quality aspects using HCV in two levels.
The multivariate data obtained from CV or HCV are of a special type, since the
values of all variables have a constant sum. An exploratory methodology for studying
data obtained from CV-based methods is Compositional Data Analysis (CoDA).
CoDA has been widely used in the methodological analysis of materials composition
in various scientific fields like chemistry, geology and archaeology [15].
Authors of the present paper in a former study [16] derived data concerning
prioritization of requirements and issues from questionnaires using the Cumulative
Voting (CV) method and used CoDA to analyze them. Compositional Data Analysis
(CoDA) has been used also for studying different types of data. For example, in [17]
the authors used CoDA for obtaining useful information on allocated software effort
within project phases and in relation to external or internal project characteristics.
CoDA apart from being a method of analysis provides useful tools for transformation
of the original data and further analysis by other methods such as cluster analysis [14].
Finally, a recent paper [18], provides a systematic review of issues related to CV
and a method for detecting prioritization items with equal priority.
In the present paper we used the experience from our former studies on requirements
prioritization in order to collect data from the organization under study and then to apply
statistical techniques in order to study divergences between different types of stakeholders
in prioritization. The data were collected using a 2-level HCV procedure and then we
applied specific CoDA techniques for transforming the data. The transformed data were
subsequently analyzed with univariate and multivariate statistical methodologies in order
68 P. Chatzipetrou et al.
Each stakeholder was asked to prioritize 18 requirements which were decided after
long interaction with members of the organization. It is important to emphasize that the
authors of this paper are actively involved in the procedures of this project so the
defined requirements are realistic demands expressed by the personnel. Furthermore, it
is necessary to highlight that the requirements examined here are not all quality
requirements. Some of them can be considered as functional requirements while others
are simply requirements for properties that were considered practically important for
the transition to Web technologies of the specific organization.
B4. Procedures Implementation, which refers to the full integration of all business
processes within the system,
B5. Prevention of duplicate data entries (systems’ interoperability), which refers
to the reduction of the primary data entry due to the interoperability with
other IT systems.
C1. Time, which refers to the time that will be needed to develop the system
software,
C2. Costs, which refers to the total cost of the development and the transition, but
also other costs too.
C3. Insource Development, which refers to the work which is undertaken by
employees internal to the organization,
C4. Outsource Development, which refers to the work which is undertaken by
employees from a consultancy company or any other subcontractor working
for the organization.
D1. Mobile Technologies, which stands out for the ability of storing data via
mobile devices or tablet PCs ,
D2. Internet Publications, which refer to the possibility of publishing online
information
D3. Barcode/RFID Implementation, which is related to managing and monitoring
materials via barcodes or RFID technology,
D4. GIS Interconnection of Distribution Network, which refers to Geographical
mapping of the distribution network through GIS technology
A summary description of the requirements in both levels (Higher and Lower) is
shown in Table 2.
In order to apply specific statistical techniques to the data regarding the 18 lower-
level requirements considering all of them together, we applied a weighting conversion
schema. The data obtained by Hierarchical Cumulative Voting (HCV) were
transformed to simple Cumulative Voting (CV) results by a procedure described by
Statistical Analysis of Requirements Prioritization for Transition to Web Technologies 71
Berander et Jönsson in [7]. The basic idea of this procedure is to convert the amount
assigned to each one of the requirements of the lower level by taking into account the
amount assigned to the high-level requirement where they belong and also the number
of the low-level requirements belonging in the requirement of the higher level. The
data obtained after this transformation are weighted on a common base and they are of
a specific kind since they are summed up to 1 (actually their sum is 100, but by a
simple division with 100, we can consider that they sum up to 1). This methodology
essentially addresses RQ1.
Since the sum is fixed, there is a problem of interdependence of the proportions and
therefore they cannot be treated as independent variables. Moreover, normality
assumptions are invalid since the values are restricted in the [0,1] interval and finally
the relative values of proportions are of particular interest instead of their absolute
values. So, our dataset needs to be transformed in order to address a problem of
analysis and interpretation of ratios of proportions. Therefore, in order to apply
various classical univariate and multivariate statistical methods like ANOVA and
Disccriminant Analysis, the proportions are transformed to ratios by appropriate
formulas like the clr transformation we describe below.
Since we are interested in ratios, the problem of zeros in these data is of principal
importance since division by zero is prohibited. According to the general theory, the
72 P. Chatzipetrou et al.
zeros can be essential (i.e. complete absence of a component) or rounded (i.e. the
instrument used for the measurements cannot detect the component). However, in our
context, the importance of a requirement is expressed as a measurement according to
the human judgment. So we can assume that a low prioritization is actually measured
by a very low value in the 100$ scale which is rounded to zero. Of course, when all
respondents allocate zero to an issue, this can be excluded from the entire analysis and
considered as essential zero.
Due to the problems of zeros, the various ratios needed for the analysis are
impossible to be computed. It is therefore necessary to find first a way of dealing with
the zeros. In our case we used a simple method proposed by [27], known as
multiplicative zero replacement strategy. According to this method, every vector
,…, consisting of proportions or percentages as described in (1) and having
c number of zeros can be replaced by a vector ,…, where
r j = δ j (if p j = 0 ) or r j = p j 1 − δ l (if p j > 0 ) (2)
l: pl = 0
k
g ( r ) = ∏ ri .1 / k (4)
i =1
The zero replacements as well as CLR transformations were carried out with the
CoDAPack 3D [28].
by box-plots. In total, we performed 18 one-way ANOVA tests, one for each of the
transformed lower level requirements. It should be emphasized that the initial
transformation described in [7] essentially weights the lower-level requirements using
the values and the structure of the higher-level requirements, i.e. the new transformed
values contain information of both levels. That is why we do not perform ANOVA for
the higher level (A-D). With this method we address RQ2.
5 Results
The results from an initial descriptive analysis (Table 3 & Fig 1) revealed that
regarding the higher level, all the stakeholders prioritize Features (A) first, assigning in
general higher values (mean value 34.5, median 35). On the other hand, the
requirements related to Project Management (C) seem to receive the lowest values
(mean value 17.5, median 20).
74 P. Chatzipetrou et al.
Fig. 1. Box Plot for the requirements prioritizations at the Higher Level (original values)
50
45 Seniors
40 Executives
35
Executives
30
25
20 Software
15 Developers
10
Users
5
0
Features System Project New
Properties Management technology
Adaptation
Fig. 2. Bar chart of means of all high-level requirements in the groups of stakeholders
The results obtained at the lower level are presented in Table 5. Here we report the
statistics of the weighted percentages taking into account the number of lower level
issues within each issue of the higher level [7]. Again, requirements related to
Features, are prioritized with the highest values (Usability (A1), Operability (A2) and
System Response Time (A3)). The requirement with the lowest values is Outsource
Development (C3). It is also remarkable that Prevention of duplicate data entries (B5)
receives in general high values of prioritization.
Similar results are obtained regarding the different groups of stakeholders (Table 6).
It is clear (Table 6) that requirements Usability (A1) and Operability (A2) are
prioritized high and especially by senior stakeholders. On the contrary, it is noticeable
that Users seem to assign low amounts on Data security (A4) and Adaptability (A5),
which deserves further investigation. Hence Users seem to attribute high values to the
Ability of Statistical Services (B1), Bureaucracy issues (B3) as well as to the Prevention
of Duplicate Data (B5).
Taking into consideration the Project Management requirements, the results were
more or less expected . Senior stakeholders prioritize higher issues related with Cost
(C2), where on the other hand Software Developers assign higher values to Time (C1).
Finally, Software Developers prioritize with high values all the requirements related to
New Technology Adaptation (D) issues .
76 P. Chatzipetrou et al.
Std
Mean Median Min Max
Deviation
Α1: Usability 9.80 9.51 1.14 31.91 6.01
Α2: Operability 8.09 6.67 1.14 17.39 4.17
Α3: System Response Time 9.13 8.51 1.11 26.6 4.35
Α4: Data Security 4.89 4.79 0 9.78 2.37
Α5: Adaptability 5.44 4.89 0 21.28 3.56
Β1: Ability of Statistic Services 6.79 6.62 0 16.48 3.78
Β2: Configurability 3.57 3.30 0 15.63 2.74
Β3: Publishing-printing Forms/ Limiting
5.74 5.62 0.31 21.54 3.40
Bureaucracy
Β4: Procedures Implementation 3.86 3.33 0 10.77 2.36
B5: Prevention of duplicate data entries 7.09 6.59 2.13 13.64 3.27
C1: Time 4.88 4.44 0.64 10.67 2.57
C2: Cost 4.07 3.52 0.08 10.55 2.45
C3: Insource Development 4.09 3.64 0.08 8.89 2.07
C4: Outsource Development 2.30 1.98 0 8.89 1.69
D1: Mobile Technologies 6.10 6.02 1.70 16.74 3.21
D2: Internet Publications 4.87 4.35 0 18.18 3.37
D3: Barcode/RFID Implementation 3.82 3.40 0 10.21 2.21
D4: GIS Interconnection of Distribution
5.47 5.22 0.42 16.74 2.99
Network
Table 6. Results of Descriptive Statistics regarding the different groups of stakeholders: Lower
Level
Senior Software
Executives Users
Executives Developers
Α1: Usability 13.95 9.86 7.84 10.12
Α2: Operability 11.98 8.19 7.65 7.81
Α3: System Response Time 7.10 9.14 6.67 10.31
Α4: Data Security 7.84 6.85 5.92 3.54
Α5: Adaptability 6.17 6.15 7.32 4.39
Β1: Ability of Statistic Services 4.21 7.69 5.04 7.46
Β2: Configurability 2.02 3.09 3.96 3.74
Β3: Publishing-printing Forms/
2.29 4.95 4.55 6.82
Limiting Bureaucracy
Β4: Procedures Implementation 3.72 4.86 4.35 3.37
B5: Prevention of duplicate data
4.21 6.39 6.54 7.85
entries
C1: Time 4.85 5.10 5.21 4.68
C2: Cost 4.99 4.70 4.10 3.77
C3: Insource Development 4.26 4.41 3.10 4.37
C4: Outsource Development 1.99 2.68 2.22 2.24
D1: Mobile Technologies 5.29 5.07 7.33 6.02
D2: Internet Publications 3.52 3.54 6.49 4.80
D3: Barcode/RFID
4.36 3.66 4.38 3.59
Implementation
D4: GIS Interconnection of
7.26 3.68 7.34 5.13
Distribution Network
Statistical Analysis of Requirements Prioritization for Transition to Web Technologies 77
The prioritized data of the lower level were transformed and analyzed separately
with the methods of CoDa described in the previous section (replacement of zeros and
clr transformation). One-way ANOVA and DA were applied to the obtained data.
The results from one-way ANOVA revealed that there is a statistically significant
difference between the way the different groups of stakeholders prioritize some of the
18 requirements of the lower level. Specifically, the requirements with significant
differences between the different roles of the stakeholders (p<0.05) are the following
five: Data Security (A4), Adaptability (A5), Publishing-printing Forms/Limiting
Bureaucracy (B3), Insource Development (C3) and GIS Interconnection of
Distribution Network (D4).
Furthermore, for those requirements that ANOVA showed significant difference, we
used the Tukey’s post-hoc test (for equal variances) and the Games-Howell (for
unequal variances) in order to see which of the groups are different. For each of these
five requirements the appropriate post-hoc test gave the following results:
Data security (A4): There is statistically significant difference only between Users and
all the other roles. The important finding here is that Users prioritize this requirement
with significantly lower values than all the other groups (Fig. 3). It is also notable that
A4 is very important issue for the Senior Executive stakeholders.
Adaptability (A5): The only statistically significant difference here is between Users
and Software Developers. Again it is remarkable that Users prioritize low adaptability
issues (Fig. 4). This is not so unexpected result since system’s ability of adjusting to
the new processes or changes of the existing framework is not of the main concern of a
simple user.
Publishing-printing Forms/Limiting Bureaucracy (B3): (Fig. 5 ). The significant
differences are between Users and Senior Executives, and between Users and Software
Developers. Here, it seems that simple users are more concerned about issues related to
the workload due to the bureaucracy that can be limited by the new system.
Insource Development (C3): The only significant difference here is between Software
Developers and Users. The unexpected finding is that Users are more concerned about
the insource development than the developers themselves (Fig. 6). It is true that users
count on the company’s development team. The developers are able to understand
user's daily routine and can deliver a better application.
GIS Interconnection of Distribution Network (D4): There was only one significant
difference, between Executives and Software Developers. It seems here that the group
of executives prioritizes low the issue of geographical mapping of the distribution
network through GIS technology (Fig. 7).
At the next step in our analysis we perform stepwise Discriminant Analysis (DA).
DA is used to address the following research question: Is there a combination of
requirements which are able to discriminate the various groups of stakeholders
according to their prioritization attitude? As already mentioned, DA was first applied
to the initial 4 inherent groups, but the model was not satisfactory since only the
58,8% of the cross validated groups of stakeholders were classified correctly. Hence
we applied it to a new grouping with only two categories. Specifically, we congregated
our stakeholders into 2 groups.
The first group contains the first 3 role categories (Senior Executives, Executives
and Software Developers) while the Users consist the second group. This division was
decided so as to have in the first group employees with high expertise and high access
in internal company information in technical or managerial and administrative issues
(Experts) and in the second group Regular Users. Of course, it can be argued that users
are experts too, especially regarding specific operations within the organization.
However, their perspective of the system as a whole is quite different since they do not
actually have to take managerial decisions. The distribution of the stakeholders within
the second grouping is shown in Table 7.
According to the stepwise DA results, the statistically significant requirements for
the discrimination of the stakeholders groups are the following four: Data Security
(A4), System Response Time (A3), Configurability (B2) and Insource Development
(C3). Moreover, according to the Wilk’s Lambda criterion, the most important
requirement for the discrimination of the groups is Data Security (A4).
The classification accuracy is satisfactory since the overall cross validation accuracy
rate is 84.3%. More specifically, DA classifies correctly 91.3% of the Experts, while
the Regular Users are correctly classified at a 78.6% rate. The ROC curve of DA
discriminant scores is presented in Fig. 8. The AUC statistic is 0.936 (p<0.001)
showing that the classification of groups of stakeholders is achieved with high
accuracy based on their prioritization of the four aforementioned requirements.
The coefficients of the discriminant function resulted from DA show that Experts
prioritize higher than Regular Users the requirement for Data Security. On the other
hand, Regular Users prioritize higher all the other three requirements, i.e. System
Response Time, Configurability and Insource Development.
Fig. 8. ROC curve for the Discriminant Analysis for Experts and Regular Users prioritization
The fact that the differences found in prioritizations of stakeholders are only few
and in general anticipated and interpretable is positive in the sense that they can be
taken into account easily and implemented in the developed software. The lack of
strong and conflicting disagreements is an indication that the transition to the new
modernized Web technologies will be achieved smoothly and with limited criticism.
Regarding the validity threats, the study is clearly exploratory and by no means can
the findings be generalized to other companies or situations. The stakeholders do not
constitute a random sample; however they were approached for their experience and
expertise so their responses are considered especially valid.
As a future work we plan to repeat the study with the same respondents after their
interaction with the new system. This replication of the survey will concern issues of
the new systems’ functionality, corresponding to the requirements studied here. This
will essentially provide feedback for quality evaluation of the new system.
References
1. Mayle, D., Bettley, A., Tantoush, T. (eds.): Operations management: A strategic approach.
Sage Publications Limited (2005)
2. Leung, H.K.: Quality metrics for intranet applications. Information & Management 38(3),
137–152 (2001)
3. Johansson, E., Wesslén, A., Bratthall, L., Host, M.: The importance of quality
requirements in software platform development-a survey. In: The 34th Annual Hawaii
International Conference on System Science. IEEE (2001)
4. Offutt, J.: Quality attributes of Web software applications. IEEE Software 19(2), 25–32
(2002)
5. Berntsson Svensson, R., Gorschek, T., Regnell, B., Torkar, R., Shahrokni, A., Feldt, R.:
Quality Requirements in Industrial Practice—An Extended Interview Study at Eleven
Companies. IEEE Transactions on Software Engineering, 38(4), 923–935 (2012)
6. Phillips, L.B., Aurum, A., Svensson, R.B.: Managing Software Quality Requirements. In:
The 38th EUROMICRO Conference in Software Engineering and Advanced Applications
(SEAA), pp. 349–356. IEEE (2012)
7. Berander, P., Jönsson, P.: Hierarchical Cumulative Voting (HCV) -Prioritization of
requirements in hierarchies. International Journal of Software Engineering and Knowledge
Engineering 16(6), 819–849 (2006)
8. Berander, P., Jönsson, P.: A goal question metric based approach for efficient
measurement framework definition. In: The 2006 ACM/IEEE International Symposium on
Empirical Software Engineering (ISESE 2006), pp. 316–325. ACM, New York (2006)
9. Leffingwell, D., Widrig, D.: Managing software requirements: A Use Case Approach, 2nd
edn. Addison-Wesley, Boston (2003)
10. Regnell, B., Host, M., Natt och Dag, J., Beremark, P., Hjelm, T.: An industrial case study
on distributed prioritisation in market-driven requirements engineering for packaged
software. Requirements Eng. 6, 51–62 (2001)
Statistical Analysis of Requirements Prioritization for Transition to Web Technologies 83
11. Berander, P., Wohlin, C.: Difference in views between development roles in software
process improvement – A quantitative comparison. In: Empirical Assessment in Software
Engineering, EASE (2004)
12. Karlsson, L.: Requirements prioritisation and retrospective analysis for release planning
process improvement. PhD Thesis, Department of Communication Systems Lund Institute
of Technology (2006)
13. Berander, P.: Evolving prioritization for software product management. Phd Thesis
Department of Systems and Software Engineering School of Engineering Blekinge
Institute of Technology, Sweden (2007)
14. Chatzipetrou, P., Angelis, L., Barney, S., Wohlin, C.: Software product quality in global
software development: Finding groups with aligned goals. In: The 37th EUROMICRO
Conference on Software Engineering and Advanced Applications (SEAA 2011), pp. 435–
442. IEEE Press (2011)
15. Aitchison, J.: The Statistical Analysis of Compositional Data. The Blackburn Press,
London (2003)
16. Chatzipetrou, P., Angelis, L., Rovegård, P., Wohlin, C.: Prioritization of Issues and
Requirements by Cumulative Voting: A Compositional Data Analysis Framework. In: The
36th EUROMICRO Conference on Software Engineering and Advanced Applications
(SEAA 2010), pp. 361–370. IEEE press (2010)
17. Chatzipetrou, P., Papatheocharous, E., Angelis, L., Andreou, A.S.: An Investigation of
Software Effort Phase Distribution Using Compositional Data Analysis. In: The 38th
EUROMICRO Conference on Software Engineering and Advanced Applications (SEAA
2012), pp. 367–375. IEEE press (2012)
18. Riņķevičs, K., Torkar, R.: Equality in cumulative voting: A systematic review with an
improvement proposal. Information and Software Technology (2012),
doi:10.1016/j.infsof.2012.08.004
19. Rovegard, P., Angelis, L., Wohlin, C.: An empirical study on views of importance of
change impact analysis issues. IEEE Transactions on Software Engineering 34(4), 516–
530 (2008)
20. Robson, C.: Real World Research. Blackwell (2002)
21. Freeman, R.E.: Strategic Management: A stakeholder approach. Pitman, Boston (1984)
22. Kotonya, G., Sommerville, I.: Requirements Engineering: processes and techniques. John
Wiley (1998)
23. Nuseibeh, B., Easterbrook, S.: Requirements engineering: a roadmap. In: Proceedings of
the Conference on the Future of Software Engineering. ACM (2000)
24. Aurum, A., Wohlin, C.: The fundamental nature of requirements engineering activities as a
decision-making process. Information and Software Technology (2003)
25. Eason, K.: Information Technology and Organisational Change. Taylor and Francis (1987)
26. Sharp, H., Finkelstein, A., Galal, G.: Stakeholder identification in the requirements
engineering process. In: Proceedings of 10th International Workshop on Database and
Expert Systems Applications. IEEE (1999)
27. Martín-Fernández, J.A., Barceló-Vidal, C., Pawlowsky-Glahn, V.: Zero replacement in
compositional data sets. In: Kiers, H., Rasson, J., Groenen, P., Shader, M. (eds.) The 7th
Conference of the International Federation of Classification Societies (IFCS 2000). Studies
in Classification, Data Analysis, and Knowledge Organization, pp. 155–160. Springer,
Berlin (2000)
28. Comas-Cufí, M., Thió-Henestrosa, S.: CoDaPack 2.0: a stand-alone, multi-platform
compositional software. In: Egozcue, J.J., Tolosana-Delgado, R., Ortego, M.I. (eds.) The
4th International Workshop on Compositional Data Analysis (2011)
84 P. Chatzipetrou et al.
29. Field, A.: Discovering statistics with SPSS. Sage, London (2005)
30. Krzanowski, W.J., Hand, D.J.: ROC curves for continuous data. Chapman & Hall/CRC
(2009)
31. Charette, R.N.: Why software fails. IEEE Spectrum 42(9) (2005)
32. Sumner, M.: Critical success factors in enterprise wide information management systems
projects. In: The 1999 ACM SIGCPR Conference on Computer Personnel Research, pp.
297–303. ACM (1999)
33. Poon, P., Wagner, C.: Critical success factors revisited: success and failure cases of
information systems for senior executives. Decision Support Systems 30(4), 393–418
(2001)
Challenges and Solutions
in Global Requirements Engineering
– A Literature Survey
Klaus Schmid
1 Introduction
Today, software development happens increasingly in a globally distributed context.
This is due to a number of factors. In particular, markets themselves are increasingly
globalized, which means that sometimes even small companies build software for a
customer on another continent. But also the competition for competent software
engineers is strong in many parts of the world, leading to the creation of development
centers around the world by many large corporations. Whatever the reason is, the
consequence is that it is important to perform requirements engineering in a way that
deals with the aspects of distribution and internationalization. In this study, we aim to
summarize results from an analysis of results from global requirements engineering.
The recognition that global requirements engineering poses particular and
challenging problems is not new. Cheng and Atlee described this as an important
requirements engineering problem [1]. Herbsleb also described requirements
engineering as an important problem in global software engineering [2].
When discussing the problem of global requirements engineering, we found it
important to distinguish:
1. Internationalization: the development for a set of international (globally
distributed) customers (perhaps with a single, localized development team) and
2. Distribution: the development in a globally distributed environment, where many
developers are in a different location from the customer(s).
D. Winkler, S. Biffl, and J. Bergsmann (Eds.): SWQD 2014, LNBIP 166, pp. 85–99, 2014.
© Springer International Publishing Switzerland 2014
86 K. Schmid
While both problems often co-occur, they are different and may require slightly
different strategies for handling them. This is particularly true as key parameters like
the possibility to influence the organizations differ. For example, a development
organization may decide to change its processes for all development personnel, but it
is not able to influence processes in the customer organization. However, also
common strategies exist as some problems are crosscutting such as handling cultural
differences.
The first of these problems also entails that systems must be customized to the
demands of a particular customer, especially if the same or similar systems must be
provided to multiple customers. The adaptation of software to an international
customer base is often described as Internationalization [3][4][5]. Hence, we refer to
this form of Requirements Engineering as International Requirements Engineering
(IRE). While internationalization addresses the customization to regional differences
such as language, time zones, etc., often also functional customizations are required to
adapt software to different customer needs [6]. This need for diversity also makes
product line engineering relevant to this situation [7].
On the other hand requirements engineering may need to interact with a globally
distributed development team. This might even mean that people within the company,
who are involved in requirements engineering, are globally distributed. We will refer
to this company-internal aspect as (Globally) Distributed Requirements Engineering
(DRE), as the main characteristic of this is the distribution of requirements
engineering and the consequences this leads to, like company-internal culture clashes.
We will use the term Global Requirements Engineering (GRE) to refer to the
combination of both. In global software engineering the main focus is usually on
those aspects that are related to distribution. It should be noted that our use of the term
global is intentionally broader than this and also includes the identification of
international differences in requirements and providing the basis for the
corresponding adaptation, as this is part of the globalization problem as a whole.
While in practice DRE and IRE may occur together, they are conceptually different
and each may appear alone as well. In particular, a situation where IRE will occur
without (relevant) DRE is when the customer itself is distributed or there are a
number of different international customers, but the development is collocated at a
single site. On the other hand, the customer might be next door to the requirements
engineer, while the development team is globally distributed.
In Section 2 we will survey results on IRE, while in Section 3 we will discuss
DRE. Finally, in Section 4 we will conclude.
difference in location and often in culture and language between the customer and the
development organization. Moreover, often an additional dimension of complexity is
introduced, if multiple customers in different countries must be supported at the same
time and their demands must be combined into a single system or product line. While
this seems a rather common situation in software engineering, we could only identify
few sources in the globalization and requirements engineering literature that
addressed the requirements engineering problems in internationalization explicitly.
1. Own sales office: there is an office of the development organization in the country.
2. Partnering: A partner relationship with a company from the target country exists.
Thus, requirements exchange usually happens with this organization, as opposed to
direct customer interaction.
3. Repeated Meetings: Repeated meetings with the customer organization are
organized to ensure good communication. This might culminate in bringing
developers to the customer organization.
4. Co-located customer: A member of the customer organization is embedded with
the development team, e.g., as a product owner in an agile setting.
5. Remote Interaction: Use of techniques like video-teleconferences, email, etc.
Of course, the different approaches are not mutually exclusive. Rather they can be
well combined to address the various problems of International Requirements
Engineering. It is interesting to note that we did hardly find approach 2 discussed in
scientific literature, while it seems that it is useful for small companies that produce
standardized software for a broad market as it simplifies interaction and reduces costs.
However, this implies a three-tiered organization: development organization, partner
organization, and customer. As a downside this makes it hard to correctly identify the
requirements of a customer as they may become distorted in the process. This is
emphasized in [12], where it is mentioned that “It is typical that resellers distort the
facts...”. It also leads to added complexity of the stakeholder network, but this
problem is only partially related to internationalization, but rather to the very large
number of entities with whom interaction is necessary. In this context also problems
of dealing with prioritization and assessing the relative business value of requirements
are amplified. They also report that the added link in the market communication
introduces delays. These issues are mostly relevant to market-driven development,
which often relies on strategies 1 or 2.
Some work also aims to provide techniques to analyze and improve distributed
information flow in a general way (i.e., independent of a specific setup). This is
typically related to communication and organizational issues. These approaches are
not necessarily specific to requirements engineering [19], [20].
Most papers focus on cases with direct customer interaction. In this context it is
still rather unclear how to organize it as evidence so far seems inconclusive. Hanisch
and Corbitt [10] describe a case where several approaches are tried over the course of
a project (users at development site, each at their site, developers (partially) at user
site), but no matter which way the organization is done, still problems arise. They
summarize their experience in the recommendation that it is important to put one
person in charge of communication and also to use a video conferencing facility. But
the question remains whether this should be regarded as a success case, as they state
that the involved people would not like to repeat the experience.
While often the importance of personal interaction is emphasized (a viewpoint,
which is also shared by the author), the study by Calefato et al. [13] shows that text-
based elicitation can be as effective as face-to-face communication. However, they
also note that participants prefer face-to-face communication. Thus, it might be more
an issue of (social) preference than a fundamental quality of an approach.
Challenges and Solutions in Global Requirements Engineering – A Literature Survey 89
a. This refers to cultural issues that have an impact on the final product as opposed to cultural
issues that have an impact on the customer-developer-interaction.
So far cultural issues in global software engineering have been mainly discussed in
the context of development team internal issues. Based on our overview, it is
important to note that many of the typically recommended practices, such as cultural
trainings, seem to provide less benefit than typically claimed.
While the previous sections mainly focused on aspects that influence the requirements
engineering method, context issues also influence the requirements themselves. Thus,
it is important to IRE that these requirements are also identified as part of
requirements engineering. The importance of context is also emphasized in [8].
We use the term (usage) context issue in a broad sense by referring with it to any
issue that impacts the requirements of a system due to the place where the software is
used. Sub-categories that we could identify from work on internationalization are
given in TABLE 1. While this list is not necessarily complete, it is a good starting
Challenges and Solutions in Global Requirements Engineering – A Literature Survey 91
point for addressing differences. The interesting point is, however, that we hardly
found guidelines on how to deal with them in the literature that we analyzed from the
requirements engineering and global software engineering backgrounds.
However, these issues are recognized in the area of software internationalization
[3][6], although most work there typically focuses on languages and language-
induced differences. The first three items in TABLE 1 are widely considered in
internationalization [6][4]. For many of them (e.g., different languages for the user
interface or different calendars or units of measurement) very good support exists also
from a technical perspective. However, some impacts are much more subtle. For
example, [6] describe a case study where differences in language (German vs.
Chinese) led to differences in fonts and finally required a hardware redesign to ensure
readability of the user interface. The authors also state that taking such requirements
into account is of great importance to achieve product success (their first version,
which did not take this into account was not successful). Another language issue is
that some countries have more than one language and thus systems must be capable to
switch among these languages at any time. An extreme case is India, where more than
20 languages and 11 scripts are in use [18].
Laws and regulations in different countries might introduce very subtle, but key
distinctions, while having very profound effects on the product requirements. As
customers cannot be expected to know them all, the development team faces the
question of how to identify these requirements in a foreign state. A non-trivial
challenge for which we could not find good advice in literature, especially as the
developer team might not even be aware that relevant laws and regulations may exist.
The standard requirements engineering answer for identifying such tacit knowledge is
to use ethnographic methods like observation. However, these methods also require a
significant amount of time, while being still prone to omissions of rare events [17].
Cultural differences may also significantly impact the details of the product
requirements as they may form expectations. This might be in the form product
requirements like the supported business process or it might impact the design of the
user interface. Examples of the former are everyday processes like the usage of a gas
pump, where substantial differences exist among countries with a corresponding
impact on related products. This makes it very important to identify and understand
cultural expectations early on, as otherwise the result may not be acceptable.
Also the educational background of people or environmental characteristics might
significantly impact the requirements of the resulting system. For example, in some
countries one needs to include particularly low-skilled workers. This might then have
an impact on how they are expected to use machinery or what they are allowed to do.
This might be further compounded if different groups are introduced with different
skill levels [6]. It is very hard to identify such issues, without any ethnographic
studies, as the customers will often not be able to voice these issues in the beginning
(as they are obvious to them). Jantunen et al. [12] also emphasized the problems of
identifying the needs (even in the presence of a proxy).
92 K. Schmid
Requirements need not only be elicited, but negotiated and prioritized. It seems that
these activities are significantly different in some respects from elicitation. For
example, while Calefato et al. [13] found that for requirements elicitation text-based
communication is roughly as efficient as face-to-face communication, their
experiments show that for requirements negotiation face-to-face is more effective. A
key problem of internationalization is identified in [12], where the authors conclude
that it is very difficult to determine the value of a requirement in a distributed
situation. However, their conclusion might be partially due to the fact that they look at
a case where the company can only indirectly communicate with the customer. In
their context (market-driven development) they additionally face the problem of
integrating multiple requests in a balanced way. This problem is also addressed in the
context of product line engineering as scoping from different perspectives [15], [16].
Bhat and Gupta address requirements engineering in a maintenance situation [22].
They emphasize that requirements must be clustered to identify what requirements
belong together and with which role they are associated to prioritize them correctly.
They also emphasize the need to communicate priority back to the stakeholders.
Damian emphasizes the importance and difficulty of establishing a trust relationship
in a distributed setting [8]. Crnkovic et al. also mention the importance of relationship
building as a success factor in the context of the many student courses they conducted
[24]. In their courses they support this by enforcing intensive communication.
2.6 Summary
In this section, we discussed the problem of identifying requirements with customers,
who are in a different country as the development organization. We introduced the
term International Requirements Engineering (IRE) to emphasize this situation. We
identified a number of different issues in this context, which we discussed based on
existing literature from requirements engineering and global software engineering.
We also used additional literature, where appropriate (in particular from software
internationalization).
Overall, there is a broad range of issues that come up in the context of IRE: they
impact all phases of requirements engineering from requirements elicitation over
requirements analysis to negotiation and prioritization. A fundamental question here
is how to organize customer interaction. This strongly influences further decisions in
IRE. However, we found little discussion of this as most cases reported on single
system, contract-based development with direct customer-developer interaction. The
introduction of a separate sales office leads to a significant difficulty in creating close
cooperation between development and customer, but it also provides major benefits to
small- and medium-size companies when they aim at an international market.
Organizational issues are a very important problem in IRE. The challenge in IRE is
that the possibility to influence this is rather low as the development organization has
typically no direct power to make changes to the customer organization. A fundamental
decision is how to organize the customer interface. Using an intermediary organization
Challenges and Solutions in Global Requirements Engineering – A Literature Survey 93
has fundamental advantages, but also disadvantages. Thus, we assume this is only useful
if the development organization deals with a large range of customers like in the case
studies by Jantunen et al. [12].
Cultural differences between developer and customer organization may negatively
impact the requirements engineering process. This is not only true for regional
differences, but also company cultures or the cultural difference between small and
large company must be taken into account. Cultural differences may easily lead to
interpretations of requirements that are inadequate and hence result in inacceptable
products.
We also identified context issues that give rise to specific requirements that are
often hard to identify. Typically ethnographic studies are proposed for this. This
reinforces the need for people exchange between both organizations and it seems to
promote that more benefits are gained if the development organization visits the
customer organization than vice versa.
Finally, requirements negotiation and prioritization are impacted by international
development. It seems to be very important to facilitate direct communication
between development and customer and among different customer groups, preferably
face-to-face. If this is not possible, at least transparency of process and status must be
achieved for all involved stakeholders. In particular, the multiple, distributed
stakeholder needs must be clustered to adequately represent the corresponding roles.
Many of the issues (and approaches) we found are not specific to requirements
engineering. Actually, many are of general importance to distributed software
development. However, we restricted our survey to issues that are explicitly raised in
the context of RE.
The organization of elicitation has already been partially addressed in Section 2.2.
However, our focus here is on the elicitation problems that may occur due to the
distributed nature of the development and not because of distributed customer sites. This
situation is characterized by the fact that the development organization is distributed and
that different parts of the development organization may have different information needs
and may even try to communicate with the customer simultaneously, perhaps in an
uncoordinated fashion. Unfortunately, we could not find very much information on this,
as most work that addresses global requirements engineering addresses the basic situation
that one customer interacts with one (part of) a development organization. Probably the
closest to this is the situation described by Jantunen et al. [12]. In their case study the core
development organization communicates through a number of sales offices and partner
companies with the customer. In this case the core elicitation function is actually handled
not by the development company, but is part of the sales offices, thus, they are
responsible for part of requirements engineering. The developers emphasize that through
this indirection step time is lost and the integration of requirements that arrive through
different channels becomes hard. Moreover, it is extremely difficult in such a situation to
assess the (relative) business value of the various requirements.
Bhat et al. also provide an interesting case study [23]. They discuss the situation
that both the business managers on the client side as well as the development team are
distributed. This arrangement became particularly problematic as access to the
business managers had to be routed through the IT-organization of the customer.
While this simplified communication, as only one point of contact was needed, it was
much harder to communicate with the user. In this situation also multiple parts of the
developer organization communicated with different parts of the client organization.
This led to disagreements among the different development teams on issues like the
requirements approach or the current status of the project. The fact that multiple
teams worked in parallel also led to further misunderstandings and conflicts.
In these examples, we see the issues that may occur highlighted. In both cases the
authors of the studies also made some recommendations on how to deal with this
situation. Jantunen et al. [12] provide some examples of how the organization aimed
to address the problem by establishing a central product management and try to create
direct links from there to the customers or to introduce rigorous prioritization
techniques to ensure that tradeoffs among the requirements relevant to the different
markets can be made. Bhat et al. [23] also discuss some proposals for improving the
situation. However, while they give several recommendations, they do not clearly
relate them to the situation described above. Some of the probably relevant practices
are: create a shared process, this can include methods like distributed quality function
deployment for requirements prioritization. Another one that is relevant to these
Challenges and Solutions in Global Requirements Engineering – A Literature Survey 95
Nicholson and Sahay point out that cultural issues can also strongly influence
requirements communication [9]. In a case study of shared development between
England and India they identified cultural differences as major obstacles in the area of
requirements engineering. For example, they mention that Indians tended to take all
requirements without questioning them, while the site in England expected that
requirements would be questioned. This led to significant misunderstandings wrt. to
the requirements.
3.5 Summary
DRE requires adapting the requirements engineering process and activities, the
organizational structure and most importantly the mindset of people. It is critical that
the corresponding problems are explicitly recognized and addressed in order to
achieve a successful requirements process. Surprisingly, especially the organization of
elicitation has been addressed rather little in the scientific literature (e.g., the role of
sales organizations in the overall requirements engineering structure). More research
would certainly be beneficial here.
Important guidelines are to explicitly recognize and address distribution difficulties
and challenges, especially language issues, cultural issues, and geographic
distribution. Major approaches to achieve this are proxies (embedded personal from
the other organization), intensive communication (e.g., visits, scheduled interactions,
etc.), and most importantly direct communication. Also, already some models emerge
that aim to give prescriptive guidance on how to set up an organization in a way that
supports good global communication like the global teaming model [29].
A somewhat unclear point is, however, the importance of addressing cultural
distance as different authors give this different importance. We think, based on the
existing literature that it is important to address cultural issues, but this can in no way
replace other measures and the technical content must remain in the foreground. We
were also surprised that the use of tools (e.g., internal requirements databases where
everyone has access) and similar measures that aim at supporting continuous
communication did not come up more often.
4 Conclusion
Global requirements engineering is a challenge and will remain a challenge for the future.
It is inherently more complex and difficult than requirements engineering in closely
collocated settings. However, given today’s situation of globalized development,
practitioners must cope with these problems. Here, we tried to condense some current
knowledge on requirements engineering in global software engineering.
We decided to separate our analysis into two main parts: International Requirements
Engineering and Distributed Requirements Engineering. While the first addresses
customer-oriented interaction, the second focuses on development-internal interaction.
While the corresponding issues overlap, also rather different issues came up. We could
identify a number of specific solutions that have been identified in literature, however,
we also needed several times to concede that evidence is so far inconclusive. Often, also
the specific details of the support of a proposal are not given precisely enough. Most
studies cover only a single case study or a small number of case studies. Often, it is also
not clear how reliable the case study information is.
Further, in some cases, we have only conflicting information leading to different
recommendations. It is at this point unclear how to resolve these conflicts. We believe
this can only be achieved by further detailed studies which also explicitly take into
account the variations in contexts for global requirements engineering (for example an
approach that is very adequate in one of the contexts identified in Section 2.2 is not
98 K. Schmid
Acknowledgment. The work on this publication was partially supported by the Software
Engineering Center (SEC), NIPA, South Korea. The author was also partially supported
by the Brazilian National Council for Scientific and Technological Development
(CNPq).
References
[1] Cheng, B., Atlee, J.: Research directions in requirements engineering. In: Future of
Software Engineering (FOSE 2007) Part of International Conference on Software
Engineering, pp. 285–303 (2007)
[2] Herbsleb, J.: Global software engineering: The future of socio-technical coordination. In:
Future of Software Engineering (FOSE 2007), Part of International Conference on
Software Engineering, pp. 188–198 (2007)
[3] Esselink, B.: A Practical Guide to Localization. John Benjamins (2000)
[4] Phillips, A.: Internationalization: An introduction part II: Enabling. In: Tutorial at the
Internationalization and Unicode Conference (2009)
[5] Watson, G.L.: A Quick Guide to Software Internationalization Issues. Kindle E-Book
(2010)
[6] VDMA. Software-Internationalisierung. VDMA (2009) (in German)
[7] Schmid, K.: Requirements Engineering for Globalization (RE4G): Understanding the
Issues, Report for the Software Engineering Center (SEC), NIPA, South Korea (2012)
[8] Damian, D.: Stakeholders in global requirements engineering: Lessons learned from
practice. IEEE Software 24(2), 21–27 (2007)
[9] Nicholson, B., Sahay, S.: Embedded Knowledge and Offshore Software Development.
Information and Organization, pp. 329–365. Elsevier (2004)
[10] Hanisch, J., Corbitt, B.: Requirements engineering during global software development:
Some impediments to the requirements engineering process — a case study. In: 12th
European Conference on Information Systems (ECIS), pp. 628–640 (2004)
[11] Prikladnicki, R., Audy, J., Damian, D., de Oliveira, T.: Distributed Software Development:
Practices and challenges in different business strategies of offshoring and onshoring. In: 2nd
International Conference on Global Software Engineering, pp. 262–274 (2007)
[12] Jantunen, S., Smolander, K., Gause, D.: How internationalization of a product changes
requirements engineering activities: An exploratory study. In: 15th International
Requirements Engineering Conference, pp. 163–172 (2007)
[13] Calefato, F., Damian, D., Lanubile, F.: An empirical investigation on text-based
communication in distributed requirements workshops. In: 2nd International Conference
on Global Software Engineering, pp. 3–11 (2007)
Challenges and Solutions in Global Requirements Engineering – A Literature Survey 99
[14] Brockmann, P., Thaumüller, T.: Cultural aspects of global requirements engineering: An
empirical chinese-german case study. In: 4th International Conference on Global
Software Engineering, pp. 353–357 (2009)
[15] John, I., Eisenbarth, M.: A decade of scoping — a survey. In: 13th International
Conference on Software Product Lines, pp. 31–40 (2009)
[16] Helferich, A., Schmid, K., Herzwurm, G.: Reconciling marketed and engineered software
product lines. In: 10th International Software Product Line Conference, pp. 23–27 (2006)
[17] Maiden, N., Rugg, G.: ACRE: Selecting Methods for Requirements Acquisition.
Software Engineering Journal 11(3), 183–192 (1996)
[18] Bhatia, M., Vasal, A.: Localisation and Requirement Engineering in Context to Indian
Scenario. In: 15th IEEE International Requirements Engineering Conference, pp. 393–394
(2007)
[19] Laurent, P., Mäder, P., Cleland-Huang, J., Steele, A.: A Taxonomy and Visual Notation
for Modeling Globally Distributed Requirements Engineering Projects. In: International
Conference on Global Software Engineering, pp. 35–44 (2010)
[20] Stapel, K., Knauss, E., Schneider, K., Zazwoprka, N.: FLOW Mapping: Planning and
Managing Communication in Distributed Teams. In: 6th International Conference on
Global Software Engineering, pp. 190–199 (2011)
[21] Berenbach, B., Wolf, T.: A unified requirements model; integrating features, use cases,
requirements, requirements analysis and hazard analysis. In: 2nd International Conference
on Global Software Engineering, pp. 197–203 (2007)
[22] Bhat, J., Gupta, M.: Enhancing Requirement Stakeholder Satisfaction during Far-shore
Maintenance of Custom Developed Software using Shift-Pattern Model. In: 2nd
International Conference on Requirements Engineering, pp. 322–327 (2007)
[23] Bhat, J., Gupta, M., Murthy, S.: Overcoming requirements engineering challenges:
Lessons from offshore outsourcing. IEEE Software 23(5), 38–44 (2006)
[24] Crnkovic, I., Bosnic, I., Žagar, M.: Ten tips to succeed in global software engineering
education. In: International Conference on Software Engineering, Software Engineering
Education, pp. 101–105 (2012)
[25] Gotel, O., Kulkarni, V., Say, M., Scharff, C., Sunetnanta, T.: Quality indicators on global
software development projects: Does “getting to know you” really matter? In: 4th IEEE
International Conference on Global Software Engineering, pp. 3–7 (2009)
[26] Gorschek, T., Fricker, S., Felt, R., Torkar, R., Wohlin, C., Mattsson, M.: 1st international
global requirements engineering workshop (GREW 2007). ACM SIGSOFT Software
Engineering Notes 33(2), 29–32 (2008)
[27] Hagge, L., Lappe, K.: Sharing requirements engineering experience using patterns. IEEE
Software 22(1), 24–31 (2005)
[28] Gumm, D.: Distribution dimensions in software development projects: A taxonomy.
IEEE Software 23(5), 45–51 (2006)
[29] Beecham, S., Noll, J., Richardson, I., Dhungana, D.: A Decision Support System for
Global Software Development. In: 6th International Conference on Global Software
Engineering, Workshops, pp. 48–53 (2011)
Automated Feature Identification
in Web Applications
1 Introduction
Feature creep [1,2] (i.e. addition or expansion of features over time) has become a
significant challenge for market-driven software intensive product development.
Today’s software intensive products are overloaded with features, which have led
to an uncontrollable growth of size and complexity. A recent study [3] revealed
that most of the software products contain from 30 to 50 percent of features
that have no or marginal value.
One of the major consequences of feature creep is feature fatigue [4], when
a product becomes too complex and has too many low value features. Users
then usually switch to other, simpler products. Moreover, feature creep can also
result in software bloat [5] that makes a computer application slower, which
requires higher hardware capacities, and increases the cost of maintenance. One
of the most recent example of software bloat is Nokia Symbian 60 smartphone
platform [6]. The system grew so much that it was too expensive to maintain it,
and therefore it was abandoned.
Currently, lean start-up [7] software business development methodology tack-
les the feature creep problem by finding a minimum viable product that contains
D. Winkler, S. Biffl, and J. Bergsmann (Eds.): SWQD 2014, LNBIP 166, pp. 100–114, 2014.
c Springer International Publishing Switzerland 2014
Automated Feature Identification in Web Applications 101
only essential and the most valuable features. However, not all lean-start up com-
panies start development from scratch and can easily determine the minimum
viable product as they already have complex systems. For example, by under-
standing how users are using the features, a company might discover that a
set of features maintained by the company are actually not too much valuable
for their customers. Thus, decision makers could analyse if removal of such fea-
tures would bring any long term benefits for the company as there would be less
features to maintain. Therefore, there is a need to monitor and identify the fea-
tures that are not too valuable in order to systematically remove them from the
product [6].
To start with the feature reduction process it is crucial to identify the complete
set of features. Features should be identified automatically in order to reduce
feature reduction process cost. After identifying the features, they can be mon-
itored by the company to detect how their values change in time. For example,
feature usage monitoring could indicate that the usage of some features is de-
creasing, and thus such features might become candidates for feature reduction.
In our previous work, we showed that low feature usage is a good indicator for
the features that are potentially losing customer value [8].
There exists some approaches that tackle with feature monitoring and iden-
tification problem (see Section 2). For example, a number of methods aim to
locate features from the source code [9], but they still lack precision. Others aim
to monitor system changes by observing user activity [10,11]. However, such
approaches collect too much irrelevant information (i.e. random mouse clicks,
mouse scrolling, all key strokes) and fail to monitor the system on a feature
level. Another set of approaches [12,13] monitor system usage on too high level
of abstraction (i.e. page usage, but not a feature level usage).
The focus of this paper is to address the problem of automated feature identi-
fication for web applications for feature reduction purposes. We set our research
questions as follows:
RQ1 : What constitutes a feature in web applications for the purposes of feature
reduction?
RQ2 : How well features could be identified in an automated manner in the
context of web applications?
Here, we present our approach and an associated tool that identifies elements
of a web application (based on HTML5 technologies) that correspond to fea-
tures. To this end, first, we investigated definitions of a feature by making a
literature survey. Then, we defined formally what constitutes a feature in web
applications. Finally, we developed a tool, which implements the rules for auto-
matically identifying features in web applications.
To evaluate the performance of our tool, we conducted a multiple case study.
We selected three well known web sites as the cases: Google, BBC, and Youtube.
The features for these web applications were first identified manually by the
participants of the case study and then in an automated way using our tool. At
the end, we compared the results using two operational measures: precision and
recall [14].
102 S. Marciuska, C. Gencel, and P. Abrahamsson
The rest of the paper is structured as follows: Section 2 presents the related
work. In Section 3, we elaborate on definitions of feature in the literature. Then,
by formalising the definition of a feature for web applications, we describe our
approach for feature identification. Section 4 presents the case study and the re-
sults. In Section 5, we discuss threats to validity for this study. Finally, Section 6
concludes the work and presents future directions.
2 Related Work
The feature location field aims to locate features and their dependencies in the
code of a software system. A recent systematic literature review on feature lo-
cation [9] categorizes the existing techniques in four groups: static, dynamic,
textual, and historical.
Static feature location techniques [15,16,17] use static analysis to locate fea-
tures and their dependencies from the source code. The results present detailed
information such as variable, class, method names and relations between them.
The main advantage of these approaches is that they do not require executing
the system in order to collect the information. However, they require to have
an access to the source code. Moreover, static analysis generate a set of features
dependent on the source code, so they involve a lot of noise (i.e. variable names
that do not represent features).
Dynamic feature location techniques [18,19,20] use dynamic analysis to locate
the features during runtime. As an input this technique requires a set of features,
which has to be mapped to source code elements of the system (i.e. variables,
methods, classes). As a result, a dependency graph among given features is
generated. The main advantage of these techniques is that it shows the parts of
the code called during the execution time. However, dynamic feature location
techniques rely on the user predefined initial feature set, so they cannot generate
a complete features set beforehand.
Textual feature location techniques [21,22,23,24,25] examine the textual parts
of the code to locate features. As an input this technique requires to define a
query with feature descriptions. Later, the method uses information retrieval and
language processing techniques to check the variables, classes, method names,
and comments to locate them. The main advantage of these techniques is that
they map features to code. However, like dynamic feature location technique, it
requires a predefined feature set with their descriptions.
Historical feature location techniques [26,27] use information from software
repositories to locate features. The idea is to query features from comments, and
then associate them to the lines that were changed in respective code commit.
The main advantage of these techniques is that they can map features to a very
low granularity of the source code, that is to exact lines. However, as in dynamic
and textual approaches, this technique cannot determine a complete features
set in an automated manner. In the next section, we present our approach to
address this issue.
Automated Feature Identification in Web Applications 103
This definition makes it clear that 1) features are identified based on events
triggered by users, and 2) they realise functional requirements. For example, a
case where a user has to enter his email, password and press the login button in
order to login is different from the case where system remembers his credentials
and he just has to press the login button in order to login. In the first scenario,
three features are identified, while in the second, only one even though the final
state was the same (that is, the user logged into the system).
These two scenarios might be interpreted differently by different people depend-
ing on the abstraction level how they perceive what a feature is. In this study, we
focus on the lowest granularity level features in order to be able to identify them
automatically. Then, our approach allows decision makers to group similar fea-
tures to represent them at higher granularity levels (see [38] for details).
Moreover, according to the definition non-functional requirements (such as
performance requirements) are not features, but they might affect how features
are implemented. In addition, some of them might evolve into functional require-
ments (such as usability requirements) during implementation, but then these
would be identified based on the events triggered by users.
In this study, we extended the definition to include the features that are
triggered not only by users, but by other systems as well (i.e. web services),
since in some cases they can also be considered as users. We used this version
of the definition when conducting our case study to evaluate the performance of
the tool in automatically identifying features against manual identification.
Event Description
onclick Event occurs when the user clicks on an element
ondblclick Event occurs when the user double-clicks on an element
onmousedown Event occurs when a user presses a mouse button over an element
onmousemove Event occurs when the pointer is moving while it is over an element
onmouseover Event occurs when the pointer is moved onto an element
onmouseout Event occurs when a user moves the mouse pointer out of an element
onmouseup Event occurs when a user releases a mouse button over an element
Event Description
onkeydown Event occurs when the user is pressing a key
onkeypress Event occurs when the user presses a key
onkeyup Event occurs when the user releases a key
Event Description
onabort Event occurs when an image is stopped from loading before com-
pletely loaded (for object)
onerror Event occurs when an image does not load properly
onload Event occurs when a document, frameset, or object has been loaded
onresize Event occurs when a document view is resized
onscroll Event occurs when a document view is scrolled
onunload Event occurs once a page has unloaded (for body and frameset)
Event Description
onblur Event occurs when a form element loses focus
onchange Event occurs when the content of a form element, the selection, or
the checked state have changed (for input, select, and textarea)
onfocus Event occurs when an element gets focus (for label, input, select,
textarea, and button)
onreset Event occurs when a form is reset
onselect Event occurs when a user selects some text (for input and textarea)
onsubmit Event occurs when a form is submitted
106 S. Marciuska, C. Gencel, and P. Abrahamsson
event (see Table 3), or a form event (see Table 4). Then, a custom JavaScript
library is generated, which contains the selected event set.
There are two ways to use this tool when identifying the set of features for
a web application of interest: 1) put the aforementioned library directly on a
website where it is hosted; 2) download the website and include the library in
the downloaded version of it. Obviously, the latter approach is able to identify
features only in the downloaded pages. It would not work on the pages that are
dynamically generated or cannot be downloaded. Nevertheless, in a normal use
case scenario, companies have a full access to their websites and can apply the
first method.
Finally, the following information is presented as the result:
1. the full path to an element,
2. the title attribute of the element,
3. the name attribute of the element,
4. the id attribute of the element,
5. the value attribute of the element,
6. the text field of the element.
The full path to the element helps to find it in the DOM. We assumed that
the title, name, id, value attributes and text field, if present, usually contain
human understandable information, which contains a description of the feature.
We test this assumption in our case study, which we present in the next section.
4 Case Study
We conducted a multiple-case study according to [40] to evaluate whether our
tool performs well in identifying the features in web applications with respect to
manual identification.
We selected three web applications for the case study: Google, BBC, and
Youtube. The selected websites cover a wide range of daily used applications
having search, news and video streaming features. One major reason why we
chose these websites was that they are used in different contexts, and thereby, we
could test our tool to identify variety of features developed for different contexts.
Another reason was that most readers would be familiar with these applications
as they are widely used (according to the Alexa traffic rating [41] they are among
top 10 most popular websites in Great Britain). Finally, as these websites are not
customised based on user demographics, the set of features would be the same
for all participants of the case study and hence the results could be comparable.
In addition to the first author of this paper, 9 subjects, who are frequent users
of the case applications, participated in this study. This is a convenience sample,
where the participants have varying backgrounds (i.e. medicine, design, computer
science) and sufficient knowledge and experience in using web applications.
108 S. Marciuska, C. Gencel, and P. Abrahamsson
Before the case study, we downloaded the main web pages of the case websites to
make sure that all participants have the same version of the website. After that,
we distributed the downloaded pages as executable systems to the participants,
introduced the participants our formal feature definition and then asked them
to manually identify the features. The participants were given as much time as
they need to complete the task.
As our purpose in this case study was to identify manually all the features of
the selected applications in order to compare them to the results of our auto-
mated tool, we needed to have a correct and compete set of features that were
identified manually. Therefore, after the set of features identified by the partic-
ipants converged to a one final set, we considered this as the complete set and
we stopped asking more people to participate in the case study. In parallel, the
first author of this paper also identified the set of features in the same way as
the participants to cross-check the features identified by the participants.
Later, we asked the participants to write down the information about features
in a text file. Finally, we used our tool to identify the features from the down-
loaded web pages. The results from the tool were printed out to the console of
a web browser.
To evaluate the performance of our tool we used operational measures: pre-
cision and recall. Precision indicates whether the tool collects elements that
are not features and recall indicates whether the tool determines the complete
features set.
As a pre-analysis, we verified with all the participants each feature, which they
identified and recorded. We excluded the data provided by two of the participants
as they provided too generalized outputs (i.e. “menu widget features”). They
mentioned that it was too much time consuming task. Therefore, they could not
provide a complete feature set.
To compute the precision and recall, we compared features identified by the
participants and the tool to the complete feature set. In addition, we compared
the data collected by the tool with the feature description provided by the par-
ticipants.
Then, we made informal interviews with the participants to receive feedback
on the features, which were identified by them, but not identified by the tool,
and vice versa.
The results showed that there were both visible features and hidden features in
the applications of the case study. Visible features are the ones that participants
were expected to detect manually through a web browser without using other
tools or looking at the code of a given website. Google had 33, Youtube had 80
and BBC had 96 visible features in total.
The analysis of the results showed that the precision was 100% in both manual
and automatic identification of visible features. This means that both the tool
Automated Feature Identification in Web Applications 109
and the participants could identify the elements that correspond to features.
Figure 1 presents the results of the recall measure.
The recall measure to automatically identify visible features by the tool was
100% for the Google website, 91% for the Youtube website, and 100% for the
BBC website. When we investigated why our tool could not identify all the
features in the Youtube website, we saw that this is due to Flash technologies
used in this website. This showed that our tool has some technological limitations
when identifying features.
When we compared the performance of our tool to manual identification of
the features, we saw that our tool overperforms the participants in most of the
cases. We checked with the subjects why they failed to identify all the visible
features. Here are the reasons we found after our interviews:
– Missed. Some of the features were simply missed by the participants due
to carelessness in manual identification. It was understood that the great
majority of such features were not visible instantly in the main page as they
were placed in drop down, or pull down lists.
– Redundant. The participants reported that there were some features that
had the same functionality as the ones, which they had already identified.
Therefore, they didn’t add these features in the list. For example, the link on
the icon of the commentator and the link on the name of the commentator
in Youtube leads to the same page. Therefore, some participants identified
only one feature instead of two in this case.
Finally, we compared the representation of the results provided by our tool and
the results provided by the participants. We noticed that most of the participants
110 S. Marciuska, C. Gencel, and P. Abrahamsson
used the name of a feature from a website to describe its functionality. After
analyzing the tool results, we found out that this information was stored in
text, value, or title attributes. The rest of the attributes (id attribute and name
attribute) were not useful for this task. To describe the location of a feature
participants indicated the main container where a selected feature belongs to.
On the other hand, the full path of the feature provides even more detailed
information as the one provided by the participants.
One major result of the case study was that our tool could identify a high
number of hidden features, which could not be detected by the participants
manually. In the Google website, our tool identified 15, in the Youtube website
8, and in the BBC website 126 hidden features.
After analyzing the source code of such features, we found out that most of
these were related to personal user preferences, or the device that is used to
open the website. For example, if a user opens a website using a mobile phone
or tablet, then the feature set is adapted accordingly. In addition, there were a
few features, which did not seem to add any value (such as 3 hidden textarea
elements in the main google website that are not necessary). Those features
might be important for the developers of the website for some reason, or might
be just obsolete features, which hang in the code without any specific purpose.
This shows that out tool has the potential to indicate different kind of features,
which could be removed after developers assess them.
5 Threats to Validity
calculated based on this figure. To mitigate this threat, first, one of the authors
of this study manually identified the features. Then, we observed whether the
manually identified features converge to the one final set as we receive input from
more participants. If participants identified new features, which were not present
in the initial complete feature set, then we extended it with those features. We
stopped involving more subjects in a manual feature identification when we saw
that adding more participants was not adding any new information. Therefore,
we believe that this set should reflect the complete set.
Internal Validity. Internal validity concerns the causality relation between the
treatment and the outcome, and whether the results do follow from the data. In
this study, we evaluated the performance of our method and tool in comparison
to manual identification of features. One validity threat could have been if the
participants did not have enough knowledge about selected websites to perform
the task well. To mitigate this task we chose the participants, who were existing
users of the selected websites. Furthermore, we purposefully chose the case ap-
plications that are frequently used. In this way, we believe that the subjects had
sufficient knowledge about the features of the websites. In fact, all participants
showed similar performance for all three cases.
Reliability. Reliability reflects to what extent the data and the analysis depend
on the specific researchers. We used two objective operational measures in this
study: precision and recall. Therefore, we do not see any validity threat in inter-
preting them. However, one validity threat could have been the interpretation of
112 S. Marciuska, C. Gencel, and P. Abrahamsson
the feature set descriptions provided by the participants. To mitigate this treat,
we verified with each participant what we understood from their inputs.
References
1. Elliott, B.: Anything is possible: Managing feature creep in an innovation rich
environment. In: Engineering Management Conference, pp. 304–307. IEEE (2007)
2. Davis, F.D., Venkatesh, V.: Toward preprototype user acceptance testing of new
information systems: implications for software project management. IEEE Trans-
actions on Engineering Management 51(1) (2004)
3. Ebert, C., Dumke, R.: Software Measurement. Springer (2007)
4. Rust, R.T., Thompson, D.V., Hamilton, R.W.: Defeating feature fatigue. Harvard
Business Review 84(2), 98–107 (2006)
5. Xu, G., Mitchell, N., Arnold, M., Rountev, A., Sevitsky, G.: Software bloat analysis:
Finding, removing, and preventing performance problems in modern large-scale
object-oriented applications. In: Proceedings of the FSE/SDP Workshop on Future
of Software Engineering Research (2010)
Automated Feature Identification in Web Applications 113
6. Ebert, C., Abrahamsson, P., Oza, N.: Lean Software Development. IEEE Software,
22–25 (October 2012)
7. Ries, E.: The Lean Startup: How Today’s Entrepreneurs Use Continuous Inno-
vation to Create Radically Successful Businesses. Journal of Product Innovation
Management (2011)
8. Marciuska, S., Gencel, C., Abrahamsson, P.: Exploring How Feature Usage Relates
to Customer Perceived Value: A Case Study in a Startup Company. In: Herzwurm,
G., Margaria, T. (eds.) ICSOB 2013. LNBIP, vol. 150, pp. 166–177. Springer,
Heidelberg (2013)
9. Dit, B., Revelle, M., Gethers, M., Poshyvanyk, D.: Feature Location in Source
Code: A Taxonomy and Survey. Journal of Software Maintenance and Evolution:
Research and Practice (2011)
10. Atterer, R., Wnuk, M., Schmidt, A.: Knowing the user’s every move: user activity
tracking for website usability evaluation and implicit interaction. In: Proceedings
of the International Conference on World Wide Web, p. 203 (2006)
11. Microsoft Spy++, http://msdn.microsoft.com/en-us/library/
aa264396(v=vs.60).aspx (last visited on March 31, 2013)
12. OpenSpan Desktop Analytics, http://www.openspan.com/products/
desktop analytics (last visited on March 31, 2013)
13. Google Analytics, http://www.google.com/analytics (last visited on March 31,
2013)
14. Manning, C.D., Raghavan, P., Schutze, H.: Introduction to information retrieval.
Cambridge University Press, Cambridge (2008)
15. Chen, K., Rajlich, V.: Case Study of Feature Location Using Dependence Graph.
In: Proceedings of 8th IEEE International Workshop on Program Comprehension,
pp. 241–249 (2000)
16. Robillard, M.P., Murphy, G.C.: Concern Graphs: Finding and describing concerns
using structural program dependencies. In: Proceedings of International Conference
on Software Engineering, pp. 406–416 (2002)
17. Trifu, M.: Using Dataflow Information for Concern Identification in Object-
Oriented Software Systems. In: Proceedings of European Conference on Software
Maintenance and Reengineering, pp. 193–202 (2008)
18. Eisenberg, A.D., De Volder, K.: Dynamic Feature Traces: Finding Features in Un-
familiar Code. In: Proceedings of 21st IEEE International Conference on Software
Maintenance, Budapest, Hungary, pp. 337–346 (2005)
19. Bohnet, J., Voigt, S., Dollner, J.: Locating and Understanding Features of Complex
Software Systems by Synchronizing Time, Collaboration and Code-Focused Views
on Execution Traces. In: Proceedings of 16th IEEE International Conference on
Program Comprehension, pp. 268–271 (2008)
20. Edwards, D., Wilde, N., Simmons, S., Golden, E.: Instrumenting Time-Sensitive
Software for Feature Location. In: Proceedings of International Conference on Pro-
gram Comprehension, pp. 130–137 (2009)
21. Petrenko, M., Rajlich, V., Vanciu, R.: Partial Domain Comprehension in Software
Evolution and Maintenance. In: International Conference on Program Comprehen-
sion (2008)
22. Marcus, A., Sergeyev, A., Rajlich, V., Maletic, J.: An Information Retrieval Ap-
proach to Concept Location in Source Code. In: Proceedings of 11th IEEE Working
Conference on Reverse Engineering, pp. 214–223 (2004)
23. Grant, S., Cordy, J.R., Skillicorn, D.B.: Automated Concept Location Using In-
dependent Component Analysis. In: Proceedings of 15th Working Conference on
Reverse Engineering, pp. 138–142 (2008)
114 S. Marciuska, C. Gencel, and P. Abrahamsson
24. Hill, E., Pollock, L., Shanker, K.V.: Automatically Capturing Source Code Con-
text of NL-Queries for Software Maintenance and Reuse. In: Proceedings of 31st
IEEE/ACM International Conference on Software Engineering (2009)
25. Poshyvanyk, D., Marcus, A.: Combining formal concept analysis with information
retrieval for concept location in source code. In: Program Comprehension, pp. 37–48
(2007)
26. Chen, A., Chou, E., Wong, J., Yao, A.Y., Zhang, Q., Zhang, S., Michail, A.:
CVSSearch: searching through source code using CVS comments. In: Proceedings
of IEEE International Conference on Software Maintenance, pp. 364–373 (2001)
27. Ratanotayanon, S., Choi, H.J., Sim, S.E.: Using Transitive changesets to Support
Feature Location. In: Proceedings of 25th IEEE/ACM International Conference
on Automated Software Engineering, pp. 341–344 (2010)
28. Classen, A., Heymans, P., Schobbens, P.-Y.: What’s in a feature: A requirements
engineering perspective. In: Fiadeiro, J.L., Inverardi, P. (eds.) FASE 2008. LNCS,
vol. 4961, pp. 16–30. Springer, Heidelberg (2008)
29. Riebisch, M.: Towards a more precise definition of feature models. In: Modelling
Variability for Object-Oriented Product Lines, pp. 64–76 (2003)
30. Kang, K.C., Cohen, S.G., Hess, J.A., Novak, W.E., Peterson, A.S.: Feature-oriented
domain analysis (FODA) feasibility study, Carnegie-Mellon University, Pittsburgh,
Software Engineering Institute (1990)
31. Kang, K.C.: Feature-oriented development of applications for a domain. In: Pro-
ceedings of Software Reuse, pp. 354–355. IEEE (1998)
32. Bosch, J.: Design and use of software architectures: adopting and evolving a
product-line approach. Addison-Wesley Professional (2000)
33. Chen, K., Zhang, W., Zhao, H., Mei, H.: An approach to constructing feature mod-
els based on requirements clustering. In: Proceedings of Requirements Engineering,
pp. 31–40. IEEE (2005)
34. Batory, D.: Feature modularity for product-lines. Tutorial at: OOPSLA, 6 (2006)
35. Batory, D., Benavides, D., Ruiz-Cortes, A.: Automated analysis of feature models:
challenges ahead. Communications of the ACM 49(12), 45–47 (2006)
36. Apel, S., Lengauer, C., Batory, D., Moller, B., Kastner, C.: An algebra for feature-
oriented software development, Number MIP-0706. University of Passau (2007)
37. Eisenbarth, T., Koschke, R., Simon, D.: Locating Features in Source Code. IEEE
Computer 29(3), 210 (2003)
38. Marciuska, S., Gencel, C., Wang, X., Abrahamsson, P.: Feature usage diagram for
feature reduction. In: Baumeister, H., Weber, B. (eds.) XP 2013. LNBIP, vol. 149,
pp. 223–237. Springer, Heidelberg (2013)
39. HTML Reference, http://www.w3schools.com/tags/default.asp (last visited on
March 31, 2013)
40. Yin, R.K.: Case study research: Design and methods. SAGE Publications, Incor-
porated (2002)
41. Alexa traffic rating, http://www.alexa.com/topsites/countries/GB (last visited
on March 31, 2013)
42. Runeson, P., Host, M.: Guidelines for conducting and reporting case study research
in software engineering. In: Empirical Software Engineering, pp. 131–164 (2009)
Value-Based Migration of Legacy Data Structures
1 Introduction
One of the largest IT challenges that many organizations – especially large enterprises
in data-intensive domains such as the finance, insurance and healthcare sectors – are
currently facing is not the design and implementation of new systems, but the
continuous adaption and migration of their legacy application and data landscape to
more modern and flexible platforms and systems. Obviously, changing parts of a
complex integrated application landscape that supports essential business processes
and holds large amounts of customer data is a delicate endeavor in any domain. It is
exacerbated by the fact that such landscapes typically evolve more organically than
structurally over time, often incorporating technical solutions for business
requirements that work very efficiently in their specific context, but are tedious to
maintain and hard to generalize or transfer to other contexts. Also, systems tend to
stay with a company much longer than the engineers who built them, so the
knowledge about system components and data structures, their relationships and their
design assumptions, may erode over time if not carefully maintained.
The consideration of these factors obviously makes system migration projects more
complex than first meets the eye – however, austerity policies may limit the budget
that is available for migration projects in many organizations, especially since such
projects usually seem to add little business value (in terms of new products, new
D. Winkler, S. Biffl, and J. Bergsmann (Eds.): SWQD 2014, LNBIP 166, pp. 115–134, 2014.
© Springer International Publishing Switzerland 2014
116 M. Book, S. Grapenthin, and V. Gruhn
customers etc.), but just provide an incremental efficiency improvement that might
not even be measurable in the short run.
In such a situation, it is important to focus the project effort on the critical aspects
by understanding which legacy components and data structures are supporting the
core business processes, which are of peripheral relevance, and which ones can even
safely be ignored because they are no longer required in the new system landscape.
Given this understanding, the team’s resources can be employed more effectively, and
the migration thus carried out more diligently, than if the interdependencies and
priorities of the various legacy structures remained unclear.
In this paper, we present our experiences from an ongoing migration project at a
large bank, in which a value-based approach was followed to obtain an overview of
the dependencies between business processes, system components and data structures,
to identify patterns in their dependencies, and to define migration strategies tailored to
those patterns. After an overview of the initial situation that led to the migration
project (Sect. 2), we present the steps taken in this migration process in Sect. 3, and
discuss lessons learned from the project in Sect. 4. We close with an overview of
related approaches and concluding observations in Sect. 5 and 6.
2 Project Background
In the past, standard IT solutions were not available in the banking domain, so the
system landscape of our project partner comprised a large set of heterogeneous self-
developed applications for every department, and even for single products. Initially,
the IT landscape had been completely based on an IBM mainframe COBOL
technology stack using sequential and indexed files, as well as hierarchical and
relational databases. Application components usually communicated through complex
sequential files. To identify and relate records of all kinds in these data structures, a
large set of so-called keys was defined and used bank-wide by nearly all applications.
A key can be a simple attribute value identifying a data entity (e.g. an account number
identifying an account) – we call this a simple key. However, a key can also be
derived from a combination of values or value ranges of several attributes (e.g. a
combination of department ID, service group ID and account number range that
identify a certain class of customers) – we call these complex keys.
Despite continuous evolution (e.g. the introduction of a client-server environment
with message protocols and web services for some applications), the bank never
changed their overall technology landscape. One reason is that especially in the last
ten years, mostly mandatory or regulatory requirements had to be implemented (e.g.
the Euro introduction, Y2K, new tax regulations, etc.), so there was no opportunity
for a comprehensive change of the landscape, which therefore remained quite
heterogeneous in terms of software technology, database techniques and integration
patterns.
Recently, the bank decided to introduce a new back-end system that is based on a
standard banking platform by a third-party supplier. From the beginning of the
project, it was known that this product would not be able to cover the full spectrum of
Value-Based Migration of Legacy Data Structures 117
the bank’s existing services, so those applications whose functionality was not
covered by the new back-end would have to remain active in the new environment.
The Investments department was one of the areas where the new banking software
could not replace existing applications, so interfaces between them and the new
system had to be developed.
One of the biggest challenges in the introduction of the new banking system was
the deprecation of many legacy keys used by applications throughout the bank to
identify all kinds of data records. Their replacement with a much leaner and cleaner
set of keys was necessary because over the course of several decades of evolution,
most of the key fields originally introduced with the old COBOL components had
become semantically overloaded by employing them for different purposes in other
applications as well. In other cases, the lack of suitable dedicated keys had led to the
use of various content fields as de-facto key fields (for example, some account
number ranges were reserved for specific types of customers, because an explicit key
field to designate the customer type was not available). This way, many data fields
and thus the corresponding keys had become overloaded with semantics over the
course of the application landscape’s evolution, and the overview of which data fields
and keys were used for which purpose had become increasingly hard to maintain – a
deterioration as forecast by Lehman already in 1980 [1].
With the introduction of the new banking system, the bank therefore aimed to
replace most of the overloaded legacy keys with a more stringent set of just a few
clearly defined keys. This made it necessary to analyze the impact of the planned key
transformation on the Investments department’s landscape of about 50 applications,
and to develop suitable migration strategies for those applications and the keys they
depended on. Since these changes would affect production applications, some of
which were older than 30 years, the guiding principles of the migration project were:
• Keeping the impact on the legacy applications as small as possible, up to the point
of designating “untouchable” applications that were not to be changed at all, so as
not to introduce new bugs.
• Minimizing the number of legacy keys that had to be included in the new banking
system’s set of keys, so as not to “taint” it with legacy data structures, and to avoid
long-term maintenance responsibilities for those keys in the new system.
• Finding migration strategies that were internal to the Investments department,
wherever possible; and in particular, not adding any Investments-specific keys to
the globally used banking system.
Since the size of the legacy application landscape made it infeasible to analyze all
applications in a straightforward way, it was decided to follow the concepts of value-
based software engineering [2] and the Pareto principle, i.e. to develop solutions for a
small set of most critical applications first – thereby focusing resources on where they
were most urgently needed, and designing solutions around those aspects that allowed
the least compromises. It was hoped for (and is actually being confirmed now in the
ongoing migration project) that the solutions found in this pilot phase would later also
be applicable to the rest of the application landscape, which is less time-critical and
more open to adaption.
118 M. Book, S. Grapenthin, and V. Gruhn
To identify the most critical applications, to analyze the impact of the key
transformations on them, and to conceive suitable adaption strategies, we undertook the
following steps:
1. Analysis of process value: Since the criticality of applications depends to a large
degree on the importance of the business processes they support, our analysis
began with an assessment of the value drivers of the Investments department’s 86
business processes (as described in Sect. 3.1).
2. Analysis of key exposure: The second factor in determining applications’
migration criticality is their exposure to the legacy keys, which was analyzed in the
second step (Sect. 3.2).
3. Identification of most crucial processes: Based on the processes’ value and key
exposure, we narrowed the process landscape down to a small set of the most
relevant processes, which were to be tackled in the pilot phase of the migration
project (Sect. 3.3).
4. Detailed analysis of key usage: For all the relevant business processes identified
in the previous step, we analyzed the software applications and components that
implemented them, and examined on the source code level how they depended on
the legacy keys (Sect. 3.4).
5. Analysis of application components’ necessity: The deprecation of the legacy
keys begged the question whether the application functions that they had supported
would still be needed in the future, or if they could also be replaced by
workarounds, in order to simplify the migration (Sect. 3.5).
6. Derivation of adaption strategies: Looking at all the individual instances of key
usage we had charted in the previous two steps, we identified a set of recurring key
usage patterns, and derived strategies for the migration of the legacy keys to the
new key set (Sect. 3.6).
7. Adaption of application components: For adapting the application components to
the new key structures, architectural solutions had to be found that had as little
impact as possible on the existing logic (Sect. 3.7).
Note that the examples presented to illustrate these steps in the following sections
have been sanitized to protect the bank’s IT landscape and business processes. As
such, they represent only a small fraction of the surveyed applications’ and data
structures’ actual complexity.
Fig. 1. Annotated
A Order Routing business process
and technical stakeholders of the bank. For this purpose, the high-level “layer 0”
processes were broken down into more fine-grained “layer 1” sub-processes, as
shown in Fig. 1 using the example of the Order Routing process. Annotations could
be placed on either layer – with an annotation on layer 0 meaning that the respective
value or effort driver applied to the complete process, while an annotation on layer 1
indicated a value or effort driver to have specific impact on a particular sub-process.
The results of the process annotation were documented in a process value table
(excerpted in Table 1), which ranks the layer 0 processes by the number of different
types of value and effort annotations (#A) associated with them and their sub-
processes. This approach of just counting the annotation types may seem quite
simplistic at first sight, as one might consider calculating a more precise ranking by
assigning different weights to the annotations. However, we found in our discussions
that the annotations would have to be weighed differently depending on the individual
process in question. Instead, we found counting the annotations to be a pragmatic
heuristic for judging the processes’ relative importance, which would still be refined
in the following steps. Also, while not reflected in the raw numbers in the table, the
discussions sparked between stakeholders in the course of the annotation process
provided valuable insights into technical or business particularities of the individual
processes that would have to be considered during their adaption.
Process #A
Settlement 12 x x x x x x x x x x x x
Order 10 x x x x x x x x x x
Routing
Securities 10 x x x x x x x x x x
Account
Pricing
Transaction 10 x x x x x x x x x x
Pricing
Asset 9 x x x x x x x x x
Services
Investments 8 x x x x x x x x
Proposal
Order 8 x x x x x x x x
Execution
Securities 5 x x x x x
Account
Statement
… …
that were about to be phased out. We therefore also analyzed all of the business
processes with regard to what data they processed, and which key fields they
depended on for that purpose. The results were documented in a key exposure table
(excerpted in Table 2), which ranks all processes by the number of keys (#K) they
depend on. While we did not yet look in detail into the applications implementing the
processes in this step, it seemed natural to assume (and was later confirmed) that
the complexity of the applications would correlate with the processes’ key exposure –
the more keys a process depends on, the more difficult it would be to adapt the
respective application, and thus the more important to find solutions for it in the pilot
phase already.
One might wonder why we did not examine the applications directly, but instead
took the “detour” via the business processes to assess value and key exposure. The
reason is that examining the implementation of 50 applications (some of them several
decades old) in detail would require prohibitive effort, while examining the general
business processes, whose requirements and relationships the team members were
familiar with, was a significant but still feasible undertaking. Of course, we eventually
did look at application implementations (see Sect. 3.4), but only after we identified
the most crucial ones that merited closer scrutiny in the pilot phase.
Based on the ranking of business processes by value and effort annotations and key
exposure, we could now identify those most crucial processes which should be
addressed in the migration project’s pilot phase. For this purpose, we plotted the
business processes in a coordinate system of number of annotations and key exposure,
to obtain the business process criticality matrix (Fig. 2). This matrix enabled us to
focus our efforts on those processes that were deemed to be most important for the
business, and at the same time pose the most complex technical challenges, i.e. those
processes that can be found in the top right area of the matrix.
Value-Based Migration of Legacy Data Structures 125
application function, and how the loss of functionality could be compensated for:
Often, an application function would be essential, so the information carried by the
legacy key also had to be included in the new key set in some way, and the
application component had to be adapted to use the new key. In some cases, however,
it also seemed possible to construct workarounds for the functionality, which made
the continued dependence on the key obsolete, or even retire the functionality in
question altogether. In the key usage diagrams, we illustrated these impacts by
shading the application component boxes as specified in Table 3.
In Fig. 3, for example, we see that the Statements for empty accounts – a business-
wise relatively simple function – depends on a combination of eight keys, so its
adaption would be quite complex. However, since this functionality is required by
national legislation, the component is indispensable, as the white shading indicates.
Adding this information to the key usage diagram provides important background
information for the choice of adaption strategy, as the following steps show.
The second pattern applies to all components shaded gray in the key usage
diagrams, i.e. those where a technical or operational workaround can be found to
eliminate the need for the key. In this case, the component needs to be adapted to
include the workaround – e.g. deriving the needed criterion from other context
information, or restructuring the process so the key is not needed anymore. This was
e.g. the strategy chosen for the Sorting of print jobs component, where a simple
change in the manual process achieved the same effect as the automated solution,
without having to keep the legacy key.
Patterns 3 to 5 apply to all components shaded white in the key usage diagrams, i.e.
indispensable application functions that remain dependent on the information in the
legacy keys. To maintain their functionality, the legacy key must be migrated to the
new system’s key set, and the component must be adapted to access it there.
Depending on the structure of the key, we need to distinguish different patterns
though:
Pattern 3 applies to application components that require a simple, atomic key, such
as an account number or employee ID. These kinds of keys can be easily transferred
to the new banking system’s set of keys, and require only minimal changes to the
component’s implementation, since the key information is just retrieved from a
different system, but processed in the same way.
Pattern 4 in contrast applies to components that derive certain information from a
combination of various keys. For example, in the generation of securities account
statements, a specific rule must be applied to a certain group of customers. This group
of customers was identified by a certain range of account numbers in combination
with the department ID and the service group ID associated with these customers. The
combination of all these digits enabled a simple binary decision on whether this
customer was part of the particular group or not. In the course of the migration, our
strategy was to compute the derived customer group characteristic beforehand, and
store this binary value as a simple key in the new system.
For many application components, a key change like this also promises a
simplification of the historically evolved control logic that could be replaced with
Value-Based Migration of Legacy Data Structures 129
much more compact logic relying just on the one new key. In case of the Securities
Account Statement component, however, we were dealing with an “untouchable”,
which required special precautions, as described in Sect. 3.7.
Pattern 5 finally applies to semantically “overloaded” key fields that may serve
multiple purposes in different contexts – for example, an account number that has the
overlaid semantic of customer type (represented by certain account number intervals), or a
string identifier in which e.g. characters 1-5 and 6-10 denote independent characteristics of
a particular record. In these cases, we strive to decouple the independent aspects that were
conjoined in the legacy key, and express them through distinct keys in the new system.
Again, this obviously also requires significant changes in the application components
working with these keys, as discussed in Sect. 3.7.
The adaption strategies that we postulated for each of the key usage patterns were
validated by applying them to the application components implementing the top
processes we had selected for the pilot phase, where they were iteratively improved in
detail and generally found to match well. While we could not expect that these
solutions would be immediately transferrable to the remaining components, we now
had a reduced solution space at our disposal that might only require minor tweaking
to be applicable to the remaining large set of processes.
4 Discussion
Beginning the migration project with the pilot phase described here enabled us to
develop solutions for a small set of processes that delivered the most business value
and addressed the most critical technical challenges first. The approach not only
ensured that the Investments department’s resources and attention were focused on
the most crucial processes, instead of being scattered over the whole process
landscape, but also enabled them to develop and validate solution strategies that could
then be applied to the rest of the process landscape (comprising about 80 further
processes) with less effort than if they had to be developed from scratch for each
process without prioritization.
Value-Based Migration of Legacy Data Structures 131
The solution strategies developed in the course of the pilot phase were documented in
the form of a migration process (Fig. 5) that is now being applied to every application
component in the Investments department’s system landscape. According to this
decision process, business analysts and engineers first examine whether the
component can be retired as well when the keys it depends on are phased out; if an
operational workaround for the legacy keys can be found; or if the application is
indispensable and thus must be adapted.
In the latter case, we distinguish between simple, atomic keys (which can simply
be added to the new system’s key set), and complex keys that are derived from
multiple keys or extracted from a particular segment or value range of a key. For
these, we construct according atomic keys in the keyset (the process model in Fig. 5
places these steps in concurrent branches to indicate that particularly complex legacy
key structures might require both processing steps).
After adding the new keys to the new banking system, we can adapt the application
component directly to employ the new keys – or, if the application is “untouchable”
or the keys have a high reach, i.e. are used by a large number of applications, we
implement an adaptor in the transformation layer to perform the translation between
the new keys and the application’s legacy interface.
During the ongoing migration of application components, the bank found the set of
solution patterns identified in the pilot phase to be comprehensive, since no other
paths than those described in the new migration process had to be pursued for any
component until now.
Our approach for value-based piloting of a large legacy system’s migration was
applied here in the course of an actual industry project that examined production-
critical processes and systems, which obviously did not give us much latitude in terms
of controlled experiment design. Specifically, we had no control over the staffing of
the project and thus the background and experience of the team members. Instead, we
carried out action research, i.e. we served as advisors to the bank’s business and IT
analysts, rather than purely observing their execution of the method.
While our involvement (in terms of methodical guidance) can likely not be dismissed
as one contributing factor to the project’s success, the key success factor was certainly
the team members’ understanding, adoption and integration of the techniques with their
analytical experience and domain knowledge. In particular, the continuing migration of
the department’s further processes according to the established strategies, and the
willingness to adopt this method for future large-scale projects as well (without our
ongoing or future participation) underscores the applicability and benefit of the approach
in practice. Still, given the nature of the study, our findings should be treated as anecdotal
rather than generalizable evidence for the effectiveness of the method.
132 M. Book, S. Grapenthin, and V. Gruhn
IT support for
function
dispensable
Complex Simple
key key Use
Single Retire
operational
characteristic function
Only particular workaround
derived from
multiple keys segment or
value range of
key relevant
Add key to
new key set
Fig. 5. The bank’s newly established migration process for future key and component adaptions,
derived from our experiences in the pilot phase
5 Related Work
There is a large body of literature on the topic of data and system migration, or
software maintenance and evolution in general – Brodie and Stonebraker [4], Bennett
[5] and Bisbal et al. [6], among others, described the field early already. The project
we presented here is another confirmation of Brodie and Stonebraker’s statement that
“the fundamental value of a legacy information system is buried in its data, not in its
functions or code” [4]. Many discussions of system and data migration tend to be
highly technically focused, reflecting the challenges of a branch of software
engineering where the devil is usually in the details. For example, Ceccato et al. [13]
Value-Based Migration of Legacy Data Structures 133
report on experiences from the migration of a large banking system to Java – while
they were dealing mostly with issues on the code level, our work focuses on the
migration of the actual business data and its interlinking references.
The economic aspects of system migration were addressed early on by Sneed [7]
with a planning process that involved justifying the need for migration to sponsors,
prioritizing applications, estimating costs and juxtaposing them with benefits, before
actually contracting the migration work. Many of these aspects are reflected in our
approach – for example, the dependencies illustrated in the key usage diagrams were
used to support budget requests, and the component necessity analysis ensured that
migration efforts were only expended on components where the same benefit could
not be achieved through workarounds.
The use of a value-based approach in a migration project was described by Visaggio
[8], who extended Sneed’s approach by providing guidelines for scoring the economic
value and the technical quality of legacy components. Instead of his rather detailed
quantitative approach (as exemplified e.g. in the case study by Tilus et al. [12]), we found
it more effective to follow a pragmatic, qualitative approach, in which the stakeholders
used value and effort annotations to highlight not only how critical they believe processes
and applications to be, but also of what nature the critical aspects are – such as supporting
large numbers of transactions, affecting the company image, being exposed to particular
legal regulations, etc. Given that our focus in this project was less on justifying the
benefits of the migration, rather than prioritizing the migration steps and raising awareness
of the risks and complexity involved in them, we felt this heuristic approach to be more
appropriate than more quantitative approaches.
Involving the stakeholders in this analysis, and actually starting on the business
process level rather than the application level, is an approach that worked well also in
other migration projects, such as the (also data-intensive) migration of a legacy
academic information system that Liem et al. [9] report on.
Bocciarelli and D’Ambrogio also employed the concept of annotations to extend
the Business Process Model and Notation (BPMN) with information on non-
functional requirements [10]. While some of our value and effort annotations (such as
reliability and security) are related to non-functional requirements, we do not need the
precise quantification that their approach provides, but rather a more diverse spectrum
of value and effort drivers beyond performance and reliability.
The technical solutions we employed – mapping keys, and using adaptors in the
transformation layer – are patterns whose application in migration projects was already
described by e.g. Thiran and Hainaut [11]. Our focus here was however less on the
technical solutions than on the process of breaking down the overall migration challenge
into problem classes (the key usage patterns) and devising solution strategies for the most
critical ones of them, that could then also be applied to all the others.
6 Conclusion
In this paper, we reported how a large bank dealt with a complex migration challenge
under time and budget limitations by using a value-based approach to identify the
legacy system components and data structures that drove the most critical business
134 M. Book, S. Grapenthin, and V. Gruhn
processes, identifying data usage patterns in the existing components, and devising
strategies to migrate them.
This approach had two benefits: Firstly, it ensured that the team developed
solutions for the most pressing challenges first. Secondly, after implementing these
solutions, the team now has a proven set of strategies and a formal migration process
at hand that it can apply to the ongoing migration of the secondary systems that were
initially disregarded in the pilot phase. Had the team tried to work out solutions for all
systems at once, there would have been a significant risk that the team could have lost
its focus on the most critical challenges amid all the peripheral aspects, and required
more time to come up with less straightforward solutions.
This risk has been successfully averted by using the approach of fostering value-
driven understanding and prioritization of tasks – an approach that has met widespread
interest in other departments of the bank as well since the success of this project.
References
1. Lehman, M.M.: Programs, life-cycles, and the laws of program evolution. Proc.
IEEE 68(9), 1060–1076 (1980)
2. Boehm, B., Huang, L.: Value-based software engineering. ACM Software Engineering
Notes 28, 4–10 (2003)
3. Book, M., Grapenthin, S., Gruhn, V.: Seeing the forest and the trees: Focusing team
interaction on value and effort drivers. In: Proc. 20th Intl. Symp. Foundations of Software
Engineering (FSE-20) New Ideas Track, art. no. 30. ACM (2012)
4. Brodie, M., Stonebraker, M.: Migrating legacy systems. Morgan Kaufmann (1995)
5. Bennett, K.: Legacy systems: Coping with success. IEEE Software 12(1), 19–23 (1995)
6. Bisbal, J., Lawless, D., Wu, B., Grimson, J.: Legacy information systems: Issues and
directions. IEEE Software 6(5), 103–111 (1999)
7. Sneed, H.M.: Planning the reengineering of legacy systems. IEEE Software 12(1), 24–34
(1995)
8. Visaggio, G.: Value-based decision model for renewal processes in software maintenance.
Annals of Software Engineering 9(1-2), 215–233 (2000)
9. Liem, I., Schatten, A., Wahyudin, D.: Data integration: An experience of information system
migration. In: Proceedings of the 8th Intl. Conf. Information Integration, Web Applications and
Services (IIWAS 2006), Österreichische Computer Gesellschaft, vol. 214 (2006)
10. Bocciarelli, P., D’Ambrogio, A.: A BPMN extension for modeling non-functional
properties of business processes. In: Proc. 2011 Symposium on the Theory of Modeling &
Simulation: DEVS Integrative M&S Symposium (TMS-DEVS 2011), pp. 160–168.
Society for Computer Simulation Intl. (2011)
11. Thiran, P., Hainaut, J.L.: Wrapper development for legacy data reuse. In: Proc. 8th Working
Conference on Reverse Engineering (WCRE 2001), pp. 198–207. IEEE Computer Society
(2001)
12. Tilus, T., Koskinen, J., Ahonen, J.J., Lintinen, H., Sivula, H., Kankaanpää, I.: Software
evolution strategy evaluation: Industrial case study applying value-based decision model.
In: Proc. 9th Intl. Conf. Business Information Systems (BIS 2006). LNI, vol. P-85, pp. 543–557.
GI e.V (2006)
13. Ceccato, M., Dean, T.R., Tonella, P., Marchignoli, D.: Migrating legacy data structures
based on variable overlay to Java. Journal of Software Maintenance and Evolution:
Research and Practice 22(3), 211–237 (2010)
An Integrated Analysis and Testing Methodology
to Support Model-Based Quality Assurance
Fraunhofer IESE,
67663 Kaiserslautern, Germany
{frank.elberzhager,alla.rosbach,thomas.bauer}@fraunhofer.iese.de
1 Introduction
Defects are a disturbing, but inevitable fact of today’s software. Especially defects
that are not found before software is delivered can result in serious consequences such
as high monetary loss or even risk for human life. This is especially true in the
embedded domain. Two very prominent examples are the software defect of the
Therac-25 radiation therapy machine which resulted in deaths and serious injuries by
overdoses between 1985 and 1987 [64] and the software bug of the Ariane 5 rocket in
1996 which led to the explosion of the rocket and a loss of about 500 million $ [63].
A lot of different strategies and solutions have emerged in recent decades to improve
software before delivery. One such solution is to conduct analytic quality assurance,
for example by applying different analysis or testing techniques. A tremendous
number of approaches have been published in the last decades [53, 54]. However,
certain problems remain, such as increased effort for performing testing, which
sometimes consumes more than 50 % of the overall development effort, or unreliable
software products with a large number of defects.
D. Winkler, S. Biffl, and J. Bergsmann (Eds.): SWQD 2014, LNBIP 166, pp. 135–154, 2014.
© Springer International Publishing Switzerland 2014
136 F. Elberzhager, A. Rosbach, and T. Bauer
2 Related Work
The figure shows a simplified quality assurance process for embedded systems that
runs in parallel to the development and construction process. Construction is
characterized by the systematic refinement of higher-level artifacts. The outcomes of
each stage of the development process are usually quality assured using appropriate
means. Informal artifacts like textual requirements and high-level architectural
descriptions are analyzed manually, for example by means of systematic inspection.
Formal artifacts such as concrete design and program code are formally verified in
order to check critical properties like deadlocks, out-of-bound signal values,
uninitialized variables, unreachable elements, and worst-case execution times.
Executable artifacts like program code and simulatable design models are
additionally tested to check their compliance with corresponding specifications.
Testing is characterized by the stimulation of the test object with input data and the
evaluation of the system response. In practice, testing is performed manually and
highly depends on the experience of the test engineers and their system knowledge. In
industry, traditional function-oriented testing techniques are used, such as
requirements-based testing, equivalence class partitioning, and boundary value
testing. Such approaches lead to inefficient quality assurance in terms of effort, test
coverage, and product quality.
Model-based test approaches have been developed in research to enable the
automated generation or selection of test cases from models. They improve the
efficiency and the degree of coverage of requirements, design artifacts, and code.
Testing on the software design level opens up a new branch in quality assurance with
model component testing and subsequent model integration testing. Available test
tools usually provide the automated execution of manually pre-defined test cases on
the target platform. A widely accepted test tool for the design level is TPT (time
partition testing) which supports several modeling environments for test execution,
such as Simulink and ASCET. The tool CTE XL is another industrially relevant
functional testing tool for the graphical modeling of test cases with equivalence class
partitioning, the so-called classification tree method. It also supports combinatorial
generation of test cases and test execution on different platforms.
Table 1. Overview of papers from the literature review [57] regarding their defect type focus
Table 1. (Continued.)
defect data, and based upon defined assumptions. Third, if knowledge about the
relationships is gathered and evaluated in the specific context, the integrated quality
assurance techniques can be used during development time to perform (and optimize)
quality assurance. After one iteration (a-b), more applications can be performed (step
c), the calibration can be refined (step d), or the objective may be adapted (step e).
Furthermore, the approach can be applied in different contexts; however, for each
new context, the knowledge about the relationships between the quality assurance
techniques has to be re-evaluated.
Below, we will describe each of the three main activities in more detail. Fig. 4
shows the definition.
It starts with a definition of the quality assurance (QA) objective and the strategy.
The strategy might be to save effort. Concrete objectives may be, for example, to find
more defects in less time, to find a certain number of defects, to find certain kinds of
144 F. Elberzhager, A. Rosbach, and T. Bauer
First of all, the quality assurance techniques have to be selected that should be
integrated, and the kind of combination has to be determined (i.e., the order). If
knowledge about the defect detection abilities is available, this can support the
selection (e.g., which quality assurance technique can find which kind of defects,
which quality assurance technique is most reasonable in a certain step, etc.). For
finding out relationships, historical data is required. If such data is not available, it has
to be gathered first by applying the quality assurance techniques in isolation.
Assuming that we have a context specific data pool that stores, for instance:
• defect data from applied analysis and testing techniques,
• product data such as the size or complexity of the corresponding software
and the models,
• or process data such as the number of check-ins or the number of
developers per model,
we can extract and select the relevant data for the calibration.
Usually, it is unclear how different analysis and testing techniques may influence
each other. Therefore, the next step is to define assumptions and refined selection
rules to find out typical relationships that can be exploited. Such assumptions can be
derived in different ways, e.g., based upon expert knowledge, or based upon empirical
knowledge. One frequently observed defect distribution is the so-called Pareto
distribution (i.e. an accumulation of defects in certain parts, such as code classes or
model blocks). Following this empirically validated assumption in our context, we
can refine this into a concrete selection rule, which says, for example, to focus one
quality assurance technique on those parts in which the other quality assurance
technique has already found defects. Indeed, such a selection rule can be further
refined to make it more applicable, e.g., by defining thresholds for the defect count.
Many other assumptions make sense, too, e.g. regarding historical defect numbers,
size or complexity metrics, or additional metrics. After assumptions and refined
selection rules have been defined, prioritization, respectively focusing takes place,
and the assumptions are evaluated based upon the historical data. Different
possibilities exist for judging the significance, starting with a simple significance
level where each positive evaluation of a selection rule increases the number by one,
to sophisticated calculations of precision/recall and F-measures for each rule [58].
The results are packaged afterwards.
Once the relationships between the selected quality assurance techniques have been
identified, they can be exploited in subsequent applications in order to optimize the
quality assurance. Fig. 6 shows these steps. For the sake of simplicity, we assume
here that one quality assurance techniques is applied after the other (and omit parallel
execution or more than two quality assurance techniques being applied). The first
quality assurance technique is applied as defined and observed in the calibration
activity (see Fig. 5). During its application, defect data and, optionally, additional
product and process data are gathered, depending on what is needed for evaluating the
corresponding assumptions. After the necessary data is available, the prioritization for
the second quality assurance activity is done based upon the context-specific
146 F. Elberzhager, A. Rosbach, and T. Bauer
evaluated assumptions. The second quality assurance technique is then applied based
upon the refined selection rules until a stopping criterion is achieved (e.g., a certain
number of defects are found). The results are analyzed and it is checked again
whether the assumption worked in this quality assurance run. Finally, all data is
packaged for further analysis and for subsequent applications.
4 Proof of Concept
Following the definition of the general setting, calibration takes place. The selected
quality assurance techniques are model inspections of Matlab Simulink models, and
integration testing; here, inspections should be performed before the test activity. For
simplicity, we do not expect specific knowledge about the techniques stored in a
database though experts would likely have knowledge about the effectiveness of the
selected quality assurance techniques.
We assume that we have data from one already performed quality assurance run (more
data would be better in terms of validity). Four model blocks were inspected and tested,
and defect content (dc, the number of defects) and defect density (dd, number of defects
per size (e.g., block elements, block inheritance depth)) were gathered. We again assume
a Pareto distribution of defects and derive two selection rules: one that uses defect
content and one that uses defect density. Of course, concrete thresholds might be defined,
but as we have no experience, we will omit this initially. Afterwards, we can prioritize
the model blocks and evaluate which selection rules are the best ones. In this example,
selection rule 1 ranks model blocks 3 and 1 the highest; these also had the largest number
of defects during testing. This selection rule fits well as those model parts are prioritized
that contains more defects, and its significance can be increased. Selection rule 2, on the
other hand, prioritizes model blocks that do not contain more defects during testing, i.e.,
it is a bad selection rule in our context and thus its significance is very low. These results
can be packaged to store the results.
The application then is straightforward: The evaluated assumption, respectively
selection rule 1 should be used in the subsequent integration of inspections and testing
of models.
Of course, the example is kept quite simple for demonstration purposes. For
example, a lot more metrics could be considered, a lot more assumptions and
selection rules would make sense, and a much more detailed analysis would increase
the knowledge about the relationships between quality assurance activities in the
given context (here, we only considered the “top-2” prioritized model blocks).
However, the basic concepts remain the same.
In addition, many other instantiations are possible. For example, one could focus more
on certain defect types. Different quality assurance techniques can complement each
other in finding those defect types at whose detection they are best. Analysis techniques
and testing techniques might by applied in reverse order, as the concepts of the integrated
quality assurance methodology do not determine the order of the quality assurance
techniques, i.e., testing can be done first and the resulting testing outcomes could help to
focus inspections or other kinds of analyses on certain parts. Moreover, more than one
quality assurance activity can influence a later activity; for instance, results from
inspections or static analysis, or other product data may control testing activities.
4.3 Implications
The integration of static and dynamic quality assurance techniques provides high
benefits during software development. Reduced costs, higher coverage, or improved
efficiency can be achieved. This approach can help practitioners in identifying
improvement potential for their quality assurance, as analysis and testing techniques
are usually performed in isolation. The assumptions mentioned in this paper can be a
starting point for focusing quality assurance. Furthermore, general concepts that
should be considered are given by the generic integrated methodology.
For researchers, many open questions remain, however. For instance: What kinds
of combinations are worthwhile, which relationships exist between the different
150 F. Elberzhager, A. Rosbach, and T. Bauer
Acknowledgments. Parts of this work have been funded by the ARTEMIS project
“MBAT” (grant: 269335). We would also like to thank Sonnhild Namingha for
proofreading.
References
1. Garro, A., Tundis, A.: A model-based method for system reliability analysis. In: 2012
Symposium on Theory of Modeling and Simulation - DEVS Integrative M&S Symposium,
pp. 1–8 (2012)
2. Zhan, Y., Clark, J.A.: A search-based framework for automatic testing of
MATLAB/Simulink models. Journal of Systems and Software 81(2), 262–285 (2008)
3. Chapoutot, A., Martel, M.: Abstract Simulation: A Static Analysis of Simulink Models.
In: International Conference on Embedded Software and Systems, pp. 83–92 (2009)
Integrated Analysis and Testing Methodology to Support Model-Based Quality Assurance 151
4. Boden, L.M., Busser, R.D.: Adding natural relationships to Simulink models to improve
automated model-based testing. In: Digital Avionics Systems Conference, vol. 2, pp. 1–9.
5. Cleaveland, W.R., Smolka, S.A., Sims, S.T.: An instrumentation-based approach to
controller model validation. In: Broy, M., Krüger, I.H., Meisinger, M. (eds.) ASWSD
2006. LNCS, vol. 4922, pp. 84–97. Springer, Heidelberg (2008)
6. Peranandam, P., Raviram, S., Satpathy, M., Yeolekar, A., Gadkari, A., Ramesh, S.:
An integrated test generation tool for enhanced coverage of Simulink/Stateflow models.
In: Design, Automation Test in Europe Conference Exhibition, pp. 308–311 (2012)
7. Siegl, S., Hielscher, K.-S., German, R., Berger, C.: Automated testing of embedded
automotive systems from requirement specification models. In: 12th IEEE Latin-American
Test Workshop (2011)
8. Sims, S., Cleaveland, R., Butts, K., Ranville, S.: Automated Validation of Software
Models. In: 16th IEEE International Conference on Automated Software Engineering, p.
91 (2001)
9. Mohalik, S., Gadkari, A.A., Yeolekar, A., Shashidhar, K.C., Ramesh, S.: Automatic test
case generation from Simulink / Stateflow models using model checking (2013)
10. Deissenboeck, F., Hummel, B., Jürgens, E., Schätz, B., Wagner, S., Girard, J.-F., Teuchert,
S.: Clone detection in automotive model-based development. In: 30th International
Conference on Software Engineering, pp. 603–612 (2008)
11. Boström, P.: Contract-based verification of simulink models. In: Qin, S., Qiu, Z. (eds.)
ICFEM 2011. LNCS, vol. 6991, pp. 291–306. Springer, Heidelberg (2011)
12. Broy, M., Chakraborty, S., Goswami, D., Ramesh, S., Satpathy, M., Resmerita, S.,
Pree, W.: Cross-layer analysis, testing and verification of automotive control software.
In: 9th ACM International Conference on Embedded Software, pp. 263–272 (2011)
13. Satpathy, M., Yeolekar, A., Peranandam, P., Ramesh, S.: Efficient coverage of parallel and
hierarchical stateflow models for test case generation. Software Testing Verification and
Reliability 22(7), 457–479 (2012)
14. Barnat, J., Brim, L., Beran, J., Kratochvíla, T., Oliveira, Í.R.: Executing model checking
counterexamples in Simulink. In: 6th International Symposium on Theoretical Aspects of
Software Engineering, pp. 245–248 (2012)
15. Sims, S., DuVarney, D.C.: Experience report: the reactis validation tool. SIGPLAN 42(9),
137–140 (2007)
16. Merschen, D., Polzer, A., Botterweck, G., Kowalewski, S.: Experiences of applying
model-based analysis to support the development of automotive software product lines.
ACM Int. Conference Proceeding Series, pp. 141–150 (2011)
17. Stürmer, I., Conrad, M., Fey, I., Dörr, H.: Experiences with model and autocode reviews in
model-based software development. In: Int. Conf. on Software Eng., pp. 45–51 (2006)
18. Kemmann, S., Kuhn, T., Trapp, M.: Extensible and automated model-evaluations
with iNProVE. In: Kraemer, F.A., Herrmann, P. (eds.) SAM 2010. LNCS, vol. 6598,
pp. 193–208. Springer, Heidelberg (2011)
19. Chen, C.: Formal analysis for stateflow diagrams. In: 4th IEEE International Conference
on Secure Software Integration and Reliability Improvement Companion, pp. 102–109
(2010)
20. Popovici, K., Lalo, M.: Formal model and code verification in Model-Based Design. In:
Joint IEEE North-East Workshop on Circuits and Systems and TAISA Conference (2009)
21. Toyn, I., Galloway, A.: Formal validation of hierarchical state machines against
expectations. In: Australian Software Engineering Conference, pp. 181–190 (2007)
22. Alur, R.: Formal verification of hybrid systems. In: 9th ACM International Conference on
Embedded software, pp. 273–278 (2011)
152 F. Elberzhager, A. Rosbach, and T. Bauer
23. Kanade, A., Alur, R., Ivančić, F., Ramesh, S., Sankaranarayanan, S., Shashidhar, K.C.:
Generating and analyzing symbolic traces of simulink/Stateflow models. In: Bouajjani, A.,
Maler, O. (eds.) CAV 2009. LNCS, vol. 5643, pp. 430–445. Springer, Heidelberg (2009)
24. Whalen, M., Cofer, D., Miller, S., Krogh, B.H., Storm, W.: Integration of formal analysis
into a model-based software development process. In: Leue, S., Merino, P. (eds.) FMICS
2007. LNCS, vol. 4916, pp. 68–84. Springer, Heidelberg (2008)
25. Hu, W., Wegener, J., Stürmer, I., Reicherdt, R., Salecker, E., Glesner, S.: MeMo - methods
of model quality. In: Workshop Modellbasierte Entwicklung eingebetteter Systeme, pp.
127–132 (2011)
26. Böhr, F.: Model Based Statistical Testing of Embedded Systems. In: 4th International
Conference on Software Testing, Verification and Validation Workshops, pp. 18–25
(2011)
27. Bringmann, E., Krämer, A.: Model-based testing of automotive systems. In: 1st
International Conference on Software Testing, Verification and Validation, pp. 485–493
(2008)
28. Morschhauser, I., Lindvall, M.: “Model-Based Validation Verification Integrated with SW
Architecture Analysis: A Feasibility Study. In: IEEE Aerospace Conference, pp. 1–18
(2007)
29. Mazzolini, M., Brusaferri, A., Carpanzano, E.: Model-checking based verification
approach for advanced industrial automation solutions. In: 15th IEEE International
Conference on Emerging Technologies and Factory Automation (2010)
30. Brillout, A., He, N., Mazzucchi, M., Kroening, D., Purandare, M., Rümmer, P.,
Weissenbacher, G.: Mutation-based test case generation for simulink models. In: de Boer,
F.S., Bonsangue, M.M., Hallerstede, S., Leuschel, M. (eds.) FMCO 2009. LNCS,
vol. 6286, pp. 208–227. Springer, Heidelberg (2010)
31. Satpathy, M., Yeolekar, A., Ramesh, S.: Randomized directed testing (REDIRECT) for
Simulink/Stateflow models. In: 8th ACM International Conference on Embedded
Software, pp. 217–226 (2008)
32. Schwarz, M.H., Sheng, H., Batchuluun, B., Sheleh, A., Chaaban, W., Borcsok, J.: Reliable
software development methodology for safety related applications: From simulation to
reliable source code. In: XXII International Symposium on Information, Communication
and Automation Technologies, pp. 1–7 (2009)
33. Farkas, T., Grund, D.: Rule Checking within the Model-Based Development of Safety-
Critical Systems and Embedded Automotive Software. In: 8th International Symposium on
Autonomous Decentralized Systems, pp. 287–294 (2007)
34. Zhan, Y., Clark, J.: Search based automatic test-data generation at an architectural level.
In: Deb, K., Tari, Z. (eds.) GECCO 2004. LNCS, vol. 3103, pp. 1413–1424. Springer,
Heidelberg (2004)
35. Zhan, Y., Clark, J.A.: Search-based mutation testing for Simulink models. In: Conference
on Genetic and Evolutionary Computation, pp. 1061–1068 (2005)
36. Windisch, A.: Search-based test data generation from stateflowstatecharts. In: 12th Annual
Conference on Genetic and Evolutionary Computation, pp. 1349–1356 (2010)
37. Böhr, F., Eschbach, R.: SIMOTEST: A tool for automated testing of hybrid real-time
Simulink models. In: Symposium on Emerging Technologies and Factory Automation
(2011)
38. Reicherdt, R., Glesner, S.: Slicing MATLAB simulink models. In: Int. Conference on
Software Engineering, pp. 551–561 (2012)
Integrated Analysis and Testing Methodology to Support Model-Based Quality Assurance 153
39. Alur, R., Kanade, A., Ramesh, S., Shashidhar, K.C.: Symbolic analysis for improving
simulation coverage of Simulink/Stateflow models. In: 8th ACM Int. Conf. on Embedded
Softw., pp. 89–98 (2008)
40. Venkatesh, R., Shrotri, U., Darke, P., Bokil, P.: Test generation for large automotive
models. In: IEEE International Conference on Industrial Technology, pp. 662–667 (2012)
41. He, N., Rümmer, P., Kroening, D.: Test-case generation for embedded simulink via formal
concept analysis. In: 48th Design Automation Conference, pp. 224–229 (2011)
42. Zhan, Y., Clark, J.A.: The state problem for test generation in Simulink. In: 8th Annual
Conference on Genetic and Evolutionary Computation, pp. 1941–1948 (2006)
43. Oh, J., Harman, M., Yoo, S.: Transition coverage testing for simulink/stateflow models
using messy genetic algorithms. In: 13th Annual Conference on Genetic and Evolutionary
Computation, pp. 1851–1858 (2011)
44. Ray, A., Morschhaeuser, I., Ackermann, C., Cleaveland, R., Shelton, C., Martin, C.:
Validating Automotive Control Software Using Instrumentation-Based Verification.
In: IEEE/ACM International Conference on Automated Software (2009)
45. Kim, M., Kim, Y., Jang, Y.: Industrial Application of Concolic Testing on Embedded
Software: Case Studies. In: 5th International Conference on Software Testing, Verification
and Validation, pp. 390–399 (2012)
46. Elberzhager, F., Münch, J., Rombach, D., Freimut, B.: Integrating Inspection and Test
Processes based on Context-specific Assumptions. Journal of Software: Evolution and
Processes (2012)
47. Elberzhager, F., Muench, J., Tran, V.: A Systematic Mapping Study on the Combination of
Static and Dynamic Quality Assurance Techniques. Information and Software
Technology 54(1), 1–15 (2012)
48. Zimmerman, D.M., Kiniry, J.R.: A Verification-centric Software Development Process for
Java. In: 9th International Conference on Quality Software, pp. 76–85 (2009)
49. Chen, Q., Wang, L., Yang, Z., Stoller, S.D.: HAVE: Detecting Atomicity Violations via
Integrated Dynamic and Static Analysis. In: 12th International Conference on Fundamental
Approaches to Software Engineering: Held as Part of the Joint European Conferences on
Theory and Practice of Software, pp. 425–439 (2009)
50. Runeson, P., Andrews, A.: Detection or Isolation of Defects? An Experimental
Comparison of Unit Testing and Code Inspection. In: 14th International Symposium on
Software Reliability Engineering, pp. 3–13 (2003)
51. Juristo, N., Vegas, S.: Functional Testing, Structural Testing, and Code Reading: What
Fault Type do they each Detect? In: Empirical Methods and Studies in Software
Engineering, pp. 208–232 (2003)
52. Chen, J., Zhou, H., Bruda, S.D.: Combining Model Checking and Testing for Software
Analysis. In: International Conference on Computer Science and Software Engineering,
pp. 206–209 (2008)
53. Juristo, N., Moreno, A.M., Vegas, S.: Reviewing 25 Years of Testing Technique
Experiments. Empirical Software Engineering 9(1-2), 7–44 (2004)
54. Aurum, A., Petersson, H., Wohlin, C.: State-of-the-Art: Software Inspections after 25
Years. Software Testing, Verification and Reliability 12(3), 133–154 (2002)
55. Genero, M., Piattini, M., Calero, C.: A Survey of Metrics for UML Class Diagrams.
Journal of Object Technology 4(9) (2005)
56. ODC. Orthogonal Defect Classification v5.11, IBM (2002),
http://www.research.ibm.com/softeng/ODC/ODC.HTM
57. Elberzhager, F., Rosbach, A., Bauer, T.: Analysis and Testing of Matlab Simulink Models:
A Systematic Mapping Study. In: JAMAICA Workshop (ISSTA) (accepted, 2013)
154 F. Elberzhager, A. Rosbach, and T. Bauer
58. Arisholm, E., Briand, L.C., Johannesson, E.B.: A systematic and comprehensive
investigation of methods to build and evaluate fault prediction models. Journal of Systems
and Software 83(1), 2–17 (2009)
59. Functional Safety of Electrical/Electronic/Programmable Electronic Safety-related
Systems, IEC 61508 Standard
60. Road vehicles – Functional safety, ISO 26262 Standard
61. AUTomotive Open System Architecture (AUTOSAR), http://www.autosar.org/
62. Radjenovic, D., Hericko, M., Torkar, R., Zivkovic, A.: Software fault prediction
metrics: A systematic literature review. Information and Software Technology Journal
(accepted, 2013)
63. Jezequel, J.-M., Meyer, B.: Put it in the contract: The lessons of Ariane,
http://www.irisa.fr/pampa/EPEE/Ariane5.html
(accessed on June 17, 2013)
64. Leveson, N., Turner, C.S.: An Investigation of the Therac-25 Accidents,
http://courses.cs.vt.edu/~cs3604/lib/Therac_25/Therac_1.html
(accessed on June 17, 2013)
65. Petersen, K., Wohlin, C.: Context in industrial software engineering research.
In: 3rd International Symposium on Empirical Software Engineering and Measurement,
pp. 401–404 (2009)
Effects of Test-Driven Development:
A Comparative Analysis of Empirical Studies
University of Helsinki,
Department of Computer Science,
P.O. Box 68 (Gustaf Hällströmin katu 2b),
FI-00014 University of Helsinki, Finland
{simo.makinen,juergen.muench}@cs.helsinki.fi
1 Introduction
Red. Green. Refactor. The mantra of test-driven development [1] is contained
in these words: red refers to the fact that first and foremost implementation
of any feature should start with a failing test, green signifies the need to make
that test pass as fast as possible and refactor is the keyword to symbolize that
the code should be cleaned up and perfected to keep the internal structure of
the code intact. But the question is, what lies behind these three words and
what do we know about the effects of following such guidelines? Test-driven
development reshapes the design and implementation of software [1] but does the
change propagate to the associated software products and in which way are the
processes altered with the introduction of this alternative way of development?
The objective here was to explore these questions and to get an overview of the
observed effects of test-driven development.
D. Winkler, S. Biffl, and J. Bergsmann (Eds.): SWQD 2014, LNBIP 166, pp. 155–169, 2014.
c Springer International Publishing Switzerland 2014
156 S. Mäkinen and J. Münch
2 Related Work
Test-driven development has been subject of reviews before. For instance, Turhan
et al. [2] performed a systematic literature review on test-driven development
that highlighted internal and external quality aspects. The review discovered
that during the last decade, there have been hundreds of publications that men-
tion test-driven development but few report empirically viable results.
Based on the reports, Turhan et al. were able to draw a picture of the overall
effect test-driven development might have. They categorized their findings as
internal and external quality, test quality and productivity. Individual metrics
were assigned to these categories which were labeled to have better, worse, mixed
or inconclusive effects. The rigor of each study was assessed by looking at the
experimental setup and studies were further categorized into four distinct rigor
levels.
Results from the review of Turhan et al. for each of the categories indicate
that the effects vary. Internal quality—which consisted of size, complexity, co-
hesion and other product metrics—was reported to increase in several cases but
more studies were found where there either was no difference or the results were
mixed or worse. External quality, however, was seen to be somewhat higher as
the majority of reviewed studies showed that the amount of defects dropped;
relatively few studies showed a decrease in external quality or were inconclu-
sive. The effect of test-driven development on productivity wasn’t clear: most
industrial studies reported lowered productivity figures while other experiments
had just the opposite results or were inconclusive. Surprisingly, Turhan et al.
conclude that test quality, which means such attributes as code coverage and
testing effort, was not superior in all cases when test-driven development was
used. Test quality was considered better in certain studies but some reported
inconclusive or even worse results.
A few years earlier, Jeffries and Melnik wrote down a short summary of exist-
ing studies about test-driven development [3]. The article covered around twenty
Effects of Test-Driven Development 157
studies both from the industry and academia, describing various context factors
such as the number of participants, programming language and the duration for
each study. The reported effects were categorized into productivity and generic
quality effects which included defect rates as well as perceptions of quality.
Jeffries and Melnik summarize that in most of the industrial studies, more
development effort was needed when the organizations took test-driven devel-
opment into use. In the academic environment, effort increases were noticed as
well but in some academic experiments test-driven development was seen to lead
to reduced effort levels. As for the quality effects, a majority of the industrial
studies showed a positive effect on the amount of defects and in certain cases
the differences to previous company baselines were quite significant. Fewer aca-
demic studies reported reduced defect rates and the results were not quite as
significant; in one, defect rates actually went up with test-driven development.
Recently, Rafique and Mišić [4] gathered experiences from 25 test-driven de-
velopment studies in a statistical meta-analysis which focused on the dimensions
of external quality and productivity. Empirical results from existing studies were
used as much as data was available from the primary studies. The studies were
categorized as academic or industrial and the development method of the respec-
tive control groups was noted as well. It seemed to matter which development
method the reference group was using in terms of how effective test-driven de-
velopment was seen to be.
Rafique and Mišić conclude that external quality could be seen to have im-
proved with test-driven development in bigger, and longer, industrial projects
but the same effect was not noticed in all academic experiments. For productiv-
ity, the results were indecisive and there was a bigger gap between the academic
experiments and industrial case studies than with external quality. Desai et al.
[5] came to a similar conclusion in their review of academic studies: some aspects
of internal and external quality saw improvement but the results for productivity
were mixed—experience of the students was a seen as a factor in the experiments.
All of the previous reviews offer valuable insights into the effects of
test-driven development as a whole by gathering information from a number
of existing studies. While the research method is similar to the aforementioned
reviews, in this review the idea is to extend previous knowledge by breaking down
the quality attributes to more atomic units as far as data is available from the
empirical studies. We expect that this will lead to deeper understanding of the
effect test-driven development has on various attributes of quality.
3 Research Method
for improving some aspects that are related to the research focus [6]. Test-driven
development has been the object of study in a number of research endeavours
that have utilized such methods.
Literature reviews can be used for constructing a theoretical framework for a
study which helps to formulate a problem statement and in the end identify the
purpose of a study [7]. In integrative literature reviews [8], existing literature
itself is the object of study and emerging or mature topics can be analyzed
in more detail in order to create new perspectives on the topic. Systematic
literature reviews [9] have similar objectives as integrative reviews but stress
that the review protocols, search strategies and finally both inclusion criteria
and exclusion criteria are explicitly defined.
The research method in this study is an integrative literature review which
was performed to discover a representative sample of empirical studies about the
effects of test-driven development from the industry and academia alike. Creswell
[10] writes that literature reviews should start by identifying the keywords to use
as search terms for the topic. Keywords that were used were such as test driven
development, test driven, test first programming and test first. The second step
suggested by Creswell is to apply these search terms in practice and find relevant
publications from catalogs and publication databases. Several search engines
from known scientific publishers were used in the process, namely the keywords
were entered into search fields at the digital libraries of the Institute of Electrical
Engineers (IEEE), Association of Computer Machinery (ACM), Springer and
Elsevier. The titles and abstracts of the first few hundred highest-ranked entries
from each service were manually screened. Although the search was repeated
several times with the aforementioned keywords with multiple search engines on
different occasions, the review protocol wasn’t entirely systematic since there
wasn’t a single, exact, search string and entries were not evaluated if they had
a low rank in the search results.
A selection of relevant publications from a larger body requires that there is
at least some sort of inclusion and exclusion criteria. Merriam [7] suggests that
the criteria can for instance include the consideration of the seminality of the
author, date of publication, topic relevance to current research and the overall
quality of the publication in question. There was no strict filter according to the
year of publication so the main criterion was whether the publication included
empirical findings of test-driven development and presented results either from
the industry or academia. The quality of publications and general relevance were
used as exclusion criteria in some cases. While the objective was to gather all
the relevant research, there was a limited number of publications that could be
reviewed in greater detail due to the nature of the study.
After applying the criteria, 19 publications remained to be analyzed further.
These publications, listed in Table 1, were conference proceedings and journal
articles that had relevant empirical information about test-driven development.
In 2009, Turhan et al. [2] identified 22 publications in their systematic literature
review of test-driven development and 7 of these publications are also included in
Effects of Test-Driven Development 159
Table 1. An overview of the publications included in the review and the effects of
test-driven development on quality factors
External Quality
Maintainability
Productivity
Complexity
Coverage
Cohesion
Coupling
Defects
Effort
Size
Author Name and Year Context
Bhat and Nagappan 2006 [12] Industry x+ x x x
Canfora et al. 2006 [13] Industry x− x−
Dogša and Batič 2011 [14] Industry x+ x x x− x+ x− x+
George and Williams 2003 [15] Industry x x− x+ x−
Geras et al. 2004 [16] Industry x x x x
Maximilien and Williams 2003 [17] Industry x+ x
Nagappan et al. 2008 [18] Industry x+ x x x−
Williams et al. 2003 [19] Industry x+ x
Janzen and Saiedian 2008 [20] Industry/Academia x x+ x x x+
Madeyski and Szała 2010 [21] Industry/Academia x x
Müller and Höfer 2007 [22] Industry/Academia x x x x x
Desai et al. 2009 [23] Academia x x x x
Gupta and Jalote 2007 [24] Academia x+ x x
Huang and Holcombe 2009 [25] Academia x− x x
Janzen and Saiedian 2006 [26] Academia x x x x x x x
Madeyski 2010 [27] Academia x x
Pančur and Ciglarič 2011 [28] Academia x x x x
Vu et al. 2009 [29] Academia x x x+ x x x x
Wilkerson et al. 2012 [30] Academia x− x
this integrative review but some of the previously identified publications remain
outside the analysis.
The findings of the studies of this review were used to construct a map of
quality attributes and the perceived effects noted in each study. Construction of
the quality map proceeded so that an attribute was added to the map if a par-
ticular study contained empirical data about the attribute so the attributes were
not predetermined. This map lead to a more detailed analysis of the individual
attributes. The literature review itself was completed in 2012 [11].
per one thousand lines of code [15]. While the overall amount of reported defects
dropped, there were still the same relative amount of severe defects as with the
previous release that didn’t use test-driven development methods [19]. Microsoft
development teams were able to cut down their defect rates, as well, and the
selected test-driven projects had a reduced defect density of sixty to ninety
percent compared to similar projects [18]. In the telecommunications field, test-
driven development seemed to slightly improve quality by reductions in defect
densities before and especially after releases [14].
Outside the industry, test-driven development hasn’t always lead to drastically
better outcomes than other development methods. Geras et al. experimented
with industry professionals [16] but in a setting that resembled an experiment
rather than a case study and didn’t notice much of a difference in defect counts
between experimental teams that were instructed to use or not to use test-driven
development in their task. Student developer teams reported in the research of
Vu et al. [29] fared marginally better in terms of the amount of defects when
teams were using test-driven development but it seems that process adherence
was relatively low and the designated test-driven team wrote less test code than
the non-test-driven team. In another student experiment, Wilkerson et al. [30]
noticed that code inspections were more effective at catching defects than test-
driven development; that is, more defects remained in the code that was devel-
oped using test-driven development than in the code that was inspected by a
code inspection group.
work and how well they’re able to adhere to the test-driven development process
[22,27]. Students who are more unfamiliar with the concept might not be able to
achieve as high a coverage as their industry peers [29]. Very low coverage rates
might be a sign that design and implementation is not done incrementally with
the help of tests.
Good coverage doesn’t necessarily mean that the tests are able to effectively
identify incorrect behavior. Mutation testing consists of transforming the origi-
nal source code with mutation operators; good tests should detect adverse mu-
tations to the source code [32]. Pančur and Ciglarič [28] noticed in their student
experiment that even though the branch coverage was higher for the test-driven
development students, the mutation score indicator was actually worse than the
score of other students who were developing their code with a test-last approach.
In another student experiment, the mutation score indicators were more or less
equal [27].
4.3 Complexity
Complexity measures of code products can be used to describe individual code
components and their internal structure. For instance, McCabe’s complexity
[33] is calculated from the relative amount of branches in code while modern
views of complexity take the structure of methods and classes into account [34].
Because code is developed in small code fragments with test-driven development,
a reduction in the complexity of code is possible.
Reductions in complexity of classes have been observed both in the industry
and academia but not all results are conclusive. Classes created by industrial test-
driven developers seem to be more coherent and contain fewer complex methods
[20] or the products seem less complex overall [14]. Student developers have on
occasion constructed less complex classes with test-driven development [20,29]
but in some cases the differences in complexity between development methods
have been small [28].
4.5 Size
Besides examining the structure of code products and the relationships of ob-
jects, it is possible to determine the size of these elements. The amount of source
code lines is one size measure which can be used to characterize code products.
Test-driven development involves writing a considerable amount of automated
unit tests that contribute to the overall size of the code base but the incremental
design and implementation could have an effect on the size of the classes and
methods written.
The ratio between test source code lines and non-test source code lines is one
way to look at the relative size of test code. The studies from the industry show
that the ratios can be substantial in projects where test-driven development is
used. At Microsoft, the highest ratings were reported to be at 0.89 test code
lines/production code lines, lowest at 0.39 test code lines/production code lines
for a larger project and somewhere in between for other projects depending
on the size of the project [18]. The numbers from IBM fall within this range
at around 0.48 test code lines/production code lines [17]. Without test-driven
development and with proper test-last principles, it is possible to reach fair ratios
but the ratios tend to fall behind test-driven projects [14]. For student developers,
the ratios have been observed to be on the same high level as industry developers
[27] or somewhat lower, albeit in one case students were able to achieve a ratio
of over 1 test code lines/production code lines without test-driven development
[29]. It seems to be apparent that with test-driven development, the size of the
overall code base increases due to the tests written if for every two lines of code
there is at least a line of test code.
The size of classes and methods has in certain cases been affected under test-
driven development. In the longitudinal industry study of Janzen [20], classes
and methods were reported to be smaller although the same didn’t apply to
several other case studies mentioned in the report. Similarly, students wrote
lighter classes and methods in one case [20] but not in another [26]. Madeyski
and Szała found in a quasi-controlled experiment that there was less code per
user story when the developer was designing and implementing the stories with
test-driven development [21]. Müller and Höfer conclude from an experiment
that test-driven developers with less experience don’t necessarily create test code
which has a larger code footprint but there might be some differences in the size
of the non-test code in favor of the experts [22].
4.6 Effort
Effort is a process quality attribute [31] and here, it can be defined as the amount
of exertion spent on a specific software engineering task. Typically, there is a
relation between effort and duration and with careful consideration of the context
the duration of a task could be seen as an indicator for effort spent. Test-driven
development changes the design and implementation process of code products so
the effort of these processes might be affected. The effects might not be limited
to these areas as the availability of automated tests is related to verification and
validation activities as well.
164 S. Mäkinen and J. Münch
Writing the tests seems to take time or then there are other factors which affect
the development process as there have been experiences which suggest that the
usage of test-driven development increases the effort. At Microsoft and IBM,
managers estimated that development took about 15 to 30 percent longer [18].
Effort also increased in the study of Dogša and Batič [14] where the development
took around three to four thousand man-hours more in an approximately six-
month project; the developers felt the increase was at least partly due to the
test-driven practices. George and Williams [15] and Canfora et al. [13] came to
the same conclusion in their experiments that developers used more time with
test-driven development. Effort and time spent on testing has also been shown
to increase in academical experiments [25]. However, the correlation between
test-driven development and effort isn’t that straightforward as Geras et al. [16]
didn’t notice such a big difference and students have been shown to get a faster
start into development with test-driven development [24].
Considering effort, it is not enough to look at the initial stages of development,
as more effort is put on the development of the code products in later stages
of the software life cycle when the code is being maintained or refactored for
other purposes. Here, the industrial case study of Dogša and Batič [14] provides
interesting insights into the different stages of development. Even though the
test-driven development team had used more time in the development phase
before the major release of the product, maintenance work on the code that was
previously written with test-driven development was substantially easier and less
time-consuming. The observation period for the maintenance period was around
nine months, and during this time the test-driven team was quickly making up
for the increased effort in the initial development stage, although they were still
several thousand man-hours behind the aggregated effort of the non-test-driven
teams when the observation ended.
better code quality through test-driven development. The students in the study
of Gupta and Jalote [24] had the sense that test-driven development improved
the testability of their creation but at the same time reduced the confidence in
the design of the system.
4.8 Productivity
Productivity is an indirect resource metric based on the relation between the
amount of some tangible items produced and the effort required to produce the
output. The resources can, for instance, be developers while the output can be
products resulting from development work like source code or implemented user
stories. Productivity is an external attribute which is sensitive to the environ-
ment [31]. Test-driven development seemed to be a factor in the increased effort
of the developers as previously described but could the restructured development
process and the tests written somehow accelerate the implementation velocity
of the user stories or affect the rate by which developers write source code lines?
There have been a number of studies which have featured test-driven de-
velopment and productivity. Dogša and Batič [14] reported that the industrial
test-driven developer team produced code at a slightly lower rate than the other
teams involved in the study and the developers also thought themselves that
their productivity was affected by the practices required in test-driven develop-
ment. In the experiment of Madeyski and Szała [21], there were some signs of
increased productivity but it was noted that there were certain validity threats
for this single-developer study. Student developers who used test-driven develop-
ment in the study of Gupta and Jalote [24] were on average on the same level or
a bit faster in producing source code lines than student teams developing with-
out test-driven development. In the study of Huang and Holcombe [25], students
were also faster with test-driven development, although the difference didn’t ex-
ceed statistical significance. But then again there have been test-driven student
teams whose productivity has been lower in terms of implemented features or
source code lines [29]. In some cases, no differences between the productivity of
student teams have been found [28].
The time it takes to produce one line of code might depend on the type of
code being written as well. Müller and Höfer [22] examined the productivity
rates when experts and students were developing code in a test-driven develop-
ment experiment and noticed that both experts and students wrote test code
faster than they wrote the implementation code. Experts were generally faster in
writing code than students but both groups of developers wrote test code three
times faster reaching maximum rates of 150 lines of code per hour. Test-driven
development involves writing a lot of test code but based on this result, writing
an equal amount of test code doesn’t take as long as writing implementation
code which is something to consider.
166 S. Mäkinen and J. Münch
4.9 Maintainability
Maintainability is a property that is related to some of the evolution aspects of
software: the easiness of finding out what to change from existing code, the rela-
tive effort to make a change and the sustained confidence that everything works
well after the change with sufficient mechanisms to verify the effects [35]. An ar-
ray of automated tests might at least help to increase the testability and stability
of software which implies that test-driven has a chance to affect maintainability.
Few empirical studies about test-driven development mention maintainability
and there seems to be room for additional research in this area. The indus-
trial case study of Dogša and Batič [14] considers maintainability and the nine
months maintenance period seems long enough to draw some initial conclusions.
As previously described, serving the change requests for the code that had been
developed with test-driven development took less time and was thus more ef-
fortless. In addition, when interviewed, developers in the study answered to a
closed question that the development practice helped them to make the software
more maintainable. While more research could verify whether the effect is of a
constant nature, the idea is still encouraging.
4.11 Limitations
The results of the individual studies have a limited scope of validity and cannot
be easily generalized and compared. Therefore, the findings presented in this
Effects of Test-Driven Development 167
( )
(
Fig. 1. The occurrence of positive, neutral and negative effects for each quality at-
tribute as reported by the test-driven development publications included in the review
article need a careful analysis of the respective contexts before applying in other
environments. The completeness of the integrative literature review was based
on the ranking algorithm of the search engines and might have been enforced
more strictly. Other threats to validity concern the use of qualitative inclusion
and exclusion criteria as well as the selection of databases, search terms, and
the chosen timeframe. Due to these factors, there could be a selection bias re-
lated to the selection of the publications. This needs to be taken into care when
interpreting and using the results of this integrative literature review.
5 Conclusion
This integrative literature review analyzed the effects of test-driven development
from existing empirical studies. The detailed review collected empirical findings
for different quality attributes and found out varying effects to these attributes.
Based on the results, prominent effects include the reduction of defects and
the increased maintainability of code. The internal quality of code in terms of
coupling and cohesion seem not to be affected so much but code complexity might
be reduced a little with test-driven development. With all the tests written, the
whole code base becomes larger but more source code lines are being covered by
tests. Test code is faster to write than the code implementing the test but many
of the studies report increased effort in development.
The quality map constructed as part of the review shows some possible direc-
tions for future research. One of the promising effects was the increased main-
tainability and reduced effort it took to maintain code later but at the time of
the review there was only a single study from Dogša and Batič [14] which had
specifically focused on maintainability. This could be one of the areas for further
research on test-driven development.
168 S. Mäkinen and J. Münch
References
19. Williams, L., Maximilien, E.M., Vouk, M.: Test-Driven Development as a Defect-
Reduction Practice. In: Proceedings of the 14th International Symposium on Soft-
ware Reliability Engineering, ISSRE 2003, pp. 34–45 (November 2003)
20. Janzen, D.S., Saiedian, H.: Does Test-Driven Development Really Improve Software
Design Quality? IEEE Software 25(2), 77–84 (2008)
21. Madeyski, L., Szała, Ł.: The Impact of Test-Driven Development on Software De-
velopment Productivity — An Empirical Study. In: Abrahamsson, P., Baddoo, N.,
Margaria, T., Messnarz, R. (eds.) EuroSPI 2007. LNCS, vol. 4764, pp. 200–211.
Springer, Heidelberg (2007)
22. Müller, M., Höfer, A.: The Effect of Experience on the Test-Driven Development
Process. Empirical Software Engineering 12(6), 593–615 (2007)
23. Desai, C., Janzen, D.S., Clements, J.: Implications of Integrating Test-Driven De-
velopment Into CS1/CS2 Curricula. In: Proceedings of the 40th ACM Technical
Symposium on Computer Science Education, SIGCSE 2009, pp. 148–152. ACM,
New York (2009)
24. Gupta, A., Jalote, P.: An Experimental Evaluation of the Effectiveness and Effi-
ciency of the Test Driven Development. In: Proceedings of the First International
Symposium on Empirical Software Engineering and Measurement, ESEM 2007,
pp. 285–294 (September 2007)
25. Huang, L., Holcombe, M.: Empirical Investigation Towards the Effectiveness of Test
First Programming. Information and Software Technology 51(1), 182–194 (2009)
26. Janzen, D., Saiedian, H.: On the Influence of Test-Driven Development on Software
Design. In: Proceedings of the 19th Conference on Software Engineering Education
and Training, CSEET 2006, pp. 141–148 (April 2006)
27. Madeyski, L.: The Impact of Test-First Programming on Branch Coverage and
Mutation Score Indicator of Unit Tests: An Experiment. Information and Software
Technology 52(2), 169–184 (2010)
28. Pančur, M., Ciglarič, M.: Impact of Test-Driven Development on Productivity,
Code and Tests: A Controlled Experiment. Information and Software Technol-
ogy 53(6), 557–573 (2011)
29. Vu, J., Frojd, N., Shenkel-Therolf, C., Janzen, D.: Evaluating Test-Driven Devel-
opment in an Industry-Sponsored Capstone Project. In: Proceedings of the Sixth
International Conference on Information Technology, ITNG 2009, pp. 229–234.
New Generations (April 2009)
30. Wilkerson, J., Nunamaker, J.J., Mercer, R.: Comparing the Defect Reduction Ben-
efits of Code Inspection and Test-Driven Development. IEEE Transactions on Soft-
ware Engineering 38(3), 547–560 (2012)
31. Fenton, N.E., Pfleeger, S.L.: Software Metrics: A Rigorous and Practical Approach.
PWS Publishing Company, Boston (1997)
32. Pezzè, M., Young, M.: Software Testing and Analysis: Process, Principles and
Techniques. Wiley, Chichester (2008)
33. McCabe, T.: A Complexity Measure. IEEE Transactions on Software Engineer-
ing SE 2(4), 308–320 (1976)
34. Chidamber, S., Kemerer, C.: A Metrics Suite for Object Oriented Design. IEEE
Transactions on Software Engineering 20(6), 476–493 (1994)
35. Cook, S., He, J., Harrison, R.: Dynamic and Static Views of Software Evolution.
In: Proceedings of the IEEE International Conference on Software Maintenance,
pp. 592–601 (2001)
Isolated Testing of Software Components
in Distributed Software Systems
1 Introduction
Component-based software engineering (CBSE) [1] relies on the existence of in-
dependent software components which can be composed to a software system
in a loosely coupled manner. Instead of continually custom developing software
artefacts, as is largely the case under traditional development processes, CBSE
concentrates on assembling prefabricated parts which can be an organization’s
D. Winkler, S. Biffl, and J. Bergsmann (Eds.): SWQD 2014, LNBIP 166, pp. 170–184, 2014.
c Springer International Publishing Switzerland 2014
Isolated Testing of Distributed Components-Based Software Systems 171
2 Related Work
A challenge in CBS [5] is to avoid dependencies between components [8] since ac-
cording to the definition components should be independent. However, compo-
nents still need other/external components to fulfil their work [8], [9], and thus to
form a running system. These interactions lead to non-technical dependencies [6]
between components. Furthermore, components can also have internal dependen-
cies, which means that they not only depend on self-generated elements but also
have relations between input and output [9]. The greater the number of depen-
dencies between components, the more complex the system will become [6]. This
makes the system harder to modify, verify, and understand and thus leads to poor
maintainability. Therefore, it is not only important to know these dependencies
in order to verify and being capable of modifying components but also to define
the influence of external dependencies on the component’s behaviour [9].
Web Services: Web services are integration technologies, which enable a dy-
namic correlation between components in networks under the use of open stan-
dardized internet technologies, like the ”Web Services Description Language”
(WSDL) for describing the interfaces, or the ”Simple Object Access Protocol”
(SOAP), the ”Hypertext Transfer Protocol” (HTTP), and the ”Internet Pro-
tocol” (IP) for communication purposes. [10] presents two tools that are using
Web services as interfaces between services and client components. These tools
offers an easy way for testing Web services in different programming languages.
However, Web services are independent and have always a defined endpoint.
These testing approaches rely on the original definition, i.e. independent com-
ponents. When components are dependent and communicating with other com-
ponents over several ports, then the presented approaches cannot be applied as
components need to be independent.
3 Motivating Scenario
Component-based software engineering aims reusing and assembling software
systems of software parts which potentially have been prefabricated by third-
parties. A lot of testing approaches have been introduced and presented, which
help to find software defects (see section 2).
An example for a component-based software system is the so called Engi-
neering Service Bus (EngSB) [16]. In comparison to the Enterprise Service Bus,
which integrates services [17], the EngSB integrates not only different tools and
systems but also different steps in the software development life cycle [16] - the
platform aims at integrating software engineering disciplines. A motivating set
of components is the EngSB, the .Net Bridge [18], and the Engineering Object
Editor (EOE) [19]. The .Net Bridge is a communication interface which enables
interaction between the EngSB and .Net-based implementations (e.g., EOE),
1
http://code.google.com/p/mockito/
2
http://www.jmock.org/
3
http://www.easymock.org/
4
http://www.nmock.org/
174 F. Thillen, R. Mordinyi, and S. Biffl
Test case
4 Research Issues
In this paper, the so called ”Effective Tester in the Middle” (ETM) is introduced,
which aims to improve testing of distributed components depending on other
components in the system. ETM introduces interaction models and network
communication models which facilitate isolated testing of entire components
without the need to run the entire system. The key research issue is how to
test dependent components in distributed environments in order to find defects
effectively and efficiently:
Modelling Component Dependencies: Remote dependent components need
data from other components to fulfil their work. Therefore, the interaction re-
quires network protocols (e.g., TCP, UDP) to be used, consequently defining a
technical dependency. Another dependency refers to the used application proto-
col, which allows applications to filter information in an efficient way. Therefore,
5
http://activemq.apache.org/
Isolated Testing of Distributed Components-Based Software Systems 175
3 4
Test case ETM
Figure 2 shows the integration of the ETM in the use case presented in section
3. In contrast to the traditional system, communication to the EngSB is simu-
lated by the ETM. The EOE does not know that the EngSB is simulated. This
means that check out and check in are still performed the same way (Figure 2,
1 and 6). Also the .Net Bridge does not know that the ETM is the responder
(Figure 2, 2 and 5). The ETM catches the communication messages (Figure 2,
4), parses the request, and replies with a corresponding message to the .Net
Bridge (Figure 2, 5), which then forwards it to the EOE.
6 Prototypic Implementation
The following section describes first the ETM core followed by two examples.
In the first example, SOAP messages, based on HTML and XML are sent over
TCP, which demonstrate the power of the ETM in a realistic environment. In
the second example the ETM simulates the EngSB (Figure 2) that allows it to
test the .Net Bridge and the EOE isolated. In this case the ETM simulates the
AQ protocol. The sockets have to be configured on a specified IP and Port, which
will be provided by the test case and for simplicity reasons, the communication
model in this paper is TCP. First the ETM Core is described, which include a
new list implementation and the socket handling.
1
Register
Register (id=1)
2
Component1 3 ETM
Register
Register (id=2)
4
which is caught by the ETM and a message from the configuration (e.g identi-
fier) is send back (Figure 3, 2). Next, the component1 sends a second registration
message (Figure 3, 3), which requires an answer. Corresponding to the example,
this should have a different identifier as the previous (Figure 3, 4). This prob-
lem requires a new list implementation. From the concrete implementation point
of view, every list entry has a counter, which is initialised with zero. When a
configuration gets picked this counter get increased (Figure 3, 2 and 4). To find
the correct configuration, first the protocol (converted from bytes) is compared
with the interaction models and if there more matches, the number of returns
(counter) is compared. Most programming languages are handling all the lay-
ers up to the transport layer, thus only the socket has to be configured by an
endpoint (IP address and port) and the used transport layer (e.g. TCP). The
components can communicate over different ports with each other, which make
it the simulation challenging. The ETM opens for each port an own thread that
handles the communication with the component on that specific port.
returns the message, which will be invoked by the Web Service client (see Figure
6). In the GetReplyTestCase1 method the response message is generated, which
should be transmitted to the client (see Figure 7). In the last row, the HTML
header informations are added, which are generated from the XML part. The
third example is based on the presented use case. Not only the component itself
has to simulate but also the behaviour of the application protocol (AQ). The AQ
protocol is based on commands, which are serialized to bytes forwarded to the
transport layer and on the other side deserialised. AQ uses the OpenWire format
to communicate with each other. The ETM needs a AQ protocol implementation
to be able to communicate with components that uses this MOM. However, AQ
provides a class (OpenWireFormat) that offers the possibility to serialize and
deserialze objects to/from a byte array. To create a valid connection, every re-
sponse command needs a corresponding CommandId, every MessageDispatcher
a ConsumerId, and a Destination. The test case architecture is equivalent to the
previous example and is presented in Figure 8. First, (Figure 8, 1) the ETM gets
configured and started. The configuration creates the interaction models that
are used to interact with the AQ client (.Net Bridge). The ineraction models are
presented in Figure 5, which are releated to Figure 10. Every dotted box stands
for a communication commands of AQ. For every message that is sent/received
from the .Net Bridge, a new producer/consumer is opened. It follows that the
commands are the same for all these sockets. In the code, this is specified by
setting the socket number -1. The answers for the .Net Bridge has to be config-
ured, which are shown as boxes (solid lines) in Figure 10. The first message from
the .Net Bridge is send on the Socket 2, which is a create message. According
to the definition of a connector, a void message is needed, i.e. the ETM sends a
void message back. This void message is wrapped in an ActiveMQTextMessage
command, which itself is wrapped in a MessageDispatcher. This behaviour is the
same for the register, unregister, and delete message. Next, the .Net Bridge has
to be started (Figure 8, 2). When the registration method was successfully, some
method calls on the bridge can be triggered to test the correct behaviour. This is
done by forwarding a configuration to the ETM , which finds the correct socket
and forwards the request to the .Net Bridge. This request has to be send to the
receive queue that is from the definition of the .Net Bridge always on socket
0. To be able to send a valid request to the correct consumer, the ConsumerId
and Destination of the receive queue is required. This information is stored in
the ETM itself. Next we have to close the .Net Bridge in a correct way that
implies to send unregister and delete message. This behaviour is already config-
ured (Figure 9). In the last step (Figure 8, 3) the correct behaviour is tested with
normal unit tests. The method call, which has been triggered from the ETM,
some variables had been set. These values get test on their correctness with the
normal test strategies.
Fig. 8. ActiveMQ and .Net Bridge com- Fig. 9. ActiveMQ ETM configuration
munication method
7 Evaluation
The advantage of the ETM is that test cases can be executed at any time and
are not dependent to a finished implementation of a component. Furthermore,
specified components can be tested without the need to start up the complete
system. Therefore, test cases can be created once and then executed any time
without any requirements to the system.
7.2 Effectiveness
The effectiveness is measured, by the different kinds of bugs the ETM can help to
find. An advantage of the ETM is that it supports the generation of user defined
messages, which may be inconsistent to the definition of the component. ETM
therefore enables testing components with fault messages. Since ETM does not
execute test cases itself but only simulates other components, the effectiveness
depends on the effectiveness of the created test cases. However, this also implies
that the ETM can simulate components, which do not exist yet. From the view
of the software development process, developers can work independently to the
system, which improves the effectiveness of the complete team.
In the following, effectiveness is also discussed by the time of implementing
tests and the time needed for execution. The SOAP and ActiveMQ protocols are
compared, and together with the complexity and implementation time of the test
case the effectiveness of the ETM is illustrated. It needs to be mentioned that
the test case using the SOAP protocol is very short, mainly because the compo-
nent is started in one line only. The complete test case consists of 64 LOC while it
took 20 minutes of implementation time. In case of the ActiveMQ protocol reuse
of ActiveMQ libraries reduced implementation time. While the protocol code has
82 LOC and took twenty minutes to implement, the behaviour has to be imple-
mented, which contains of 151 LOC and needed about 40 minutes effort. The im-
plementation and creation of the protocol and/or the behaviour of the component
costs time. It can be derived that this is just a one time effort. The AQ for exam-
ple needs in total 20 minutes of implementing the protocol, behaviour and the test
case. In contrast, the test case just needs only 15 minutes and 20 LOC. It follows
Isolated Testing of Distributed Components-Based Software Systems 181
that all other test cases can use the configuration and the protocol. This improves
the effectiveness extremely and it follows that implementing further test cases is
very easy. However, the developer of the protocol has to understand the behaviour
and the complexity of the protocol to properly handle the exchange of messages.
7.3 Efficiency
With the traditional approach, some test cases can be created at any time of
the process but the tests fails until all the required components are completely
created, which leads to a very late finding of the errors. The ETM is simulating
the components and so can also simulate things, which are not implemented
yet. With the traditional approach, the components cannot be isolated tested,
i.e. the dependencies have to be present and running, which implies that the
dependent components have to be error free. This is not guaranteed and so the
errors can ours in the component under test and in the dependent components.
It follows that the location of the error is challenging to locate. The feature of the
tradition approach is that the tests are always communicating with the newest
versions of the dependent component. Furthermore, the communication between
the components exists and so no simulation has to be created. From the view
of performance, the tradition approach implies a start-up time. In the case in
which all the dependent components are present, these have to be started, con-
figured and ready before the test case starts. In contrast, the ETM have to be
started and next the test cases can be executed. Because the tradition approach
does not need an implementation of the behaviour and the protocol, there is
no search for a corresponding message. This implies that the communication be-
tween the components is faster because the ETM needs some time for converting
the message to the chosen protocol and searching the corresponding message.
The execution time for the ETM requies 7017 ms without start-up and 7051 ms
with start-up and clean-up. The traditional approach (i.e starting all depend-
ing components) needs 1516 ms fwithout start-up and 45613 ms with start-up
and clean-up. The ETM and the traditional approach are executed with the
presented Use case. The start-up for the traditional approach is the following:
Starting the OpenEngSB, provide the corresponding domain and execute the
test case. This follows that the traditional approach needs a very long time to
execute a single test, which are for the use case around 46 seconds. In contrast,
the ETM needs for executing a test case 7 seconds, which is very fast compared
to the start-up and execution of the tradition approach. A disadvantage is that
the ETM needs still 7 seconds without a configuration. The traditional approach
is five times faster than the ETM approach. By including the time to implement
the protocol and behaviour of a used protocol the ETM needs 93 executions of
test cases (Figure 11) to be concurrent to the traditional approach.
The Figure 11 is based on the execution time for the test cases for the pre-
sented Use case. Generally, several test cases are create and executed all at the
same time. This implies that the peak of 93 is very fast reached and so the ETM
justify the implementation time of the protocol and the behaviour.
182 F. Thillen, R. Mordinyi, and S. Biffl
8 Discussion
In this section a discussion about the advantages and limitations of the ETM
through a comparison to existing concepts is presented. The ETM approach is
language independent because it is listening on the transport layer. The transport
layer is standardised and so it is offered by a lot of programming language. Fur-
thermore every operation system supports this layer, which implies that compo-
nents can be tested in different environments, operation systems and languages.
Like introduced in section 2, testing is a part of the software engineering process.
For every development state a test scenario has to be created.
Like presented in the related work, concepts and approaches already exists to
test component based system. Mock-up frameworks, simulates components and
allows it to perform tests with the normal test strategies (for example unit tests).
This approach is very applicable to every component unless the source code is
open. In the other case, it is not possible to use mock-up frameworks because the
inner structure has to be known (like white box testing). It is challenging, to use
the mock-up framework with the presented approach because the AQ has to be
mocked as well. Furthermore, every send and receive method has to be mocked
in a way that every message is correctly represented. The test case for the ETM
and the Mock-up are the same and both have to simulate the behaviour of the
component. It follows that the time to implement the behaviour is very similar.
The state-based components approach cannot be used for the Use case because
not all of the components are state-based. Compared to the hybrid approach,
the presented Use case could not be applied because the basic idea of approach is
based on component whose design details are not given. However, the complete
design is known in the presented use case. Furthermore, black box testing is just
possible with the use case when AQ can synchronises itself. This requires that a
dependent component offers an AQ connection. Otherwise the component tries
to open a connection which implies an exception. The JRT framework, enables
testing of remote server components. The concept is using the black box testing
strategy and compares the received result with the expectations. The approach
is applicable to server components but is not usable for the presented Use case.
This is mainly because the components under test invokes methods on other
components and JRT can not handle this kind of method calls (JRT would test
the EngSB). The JATA framework is a powerful framework to test components.
Isolated Testing of Distributed Components-Based Software Systems 183
References
1. Matevska, J.: Rekonfiguration komponentenbasierter Softwaresysteme zur
Laufzeit. Vieweg+Teubner, Wiesbaden (2010)
2. Gross, H.G.: Component-Based Software Testing with UML. Springer, Heidelberg
(2005)
3. Yao, Y.: A framework for testing distributed software components. Electrical and
Computer Engineering, 1566–1569 (May 2005)
4. Winkler, D., Hametner, R., Östreicher, T., Bill, S.: A Framework for Automated
Testing of Automation Systems (2010)
5. Szyperski, C.: Component Software: Beyond Object-Oriented Programming. In:
Component Software: Beyond Object-Oriented Programming, 2nd edn. Addison-
Wesley Longman Publishing Co., Inc. (2002)
6. Sharma, A., Grover, P.S., Kumar, R.: Dependency analysis for component-based
software systems. ACM SIGSOFT Software Engineering Notes 34(4), 1 (2009)
184 F. Thillen, R. Mordinyi, and S. Biffl
1 Introduction
To support developers in the tedious task of writing and updating unit test
suites, white-box testing techniques analyze program source code and automati-
cally derive test cases targeting different criteria. These unit tests either exercise
automated test oracles, for example by revealing unexpected exceptions, or help
in satisfying a coverage criterion. A prerequisite for an effective unit test gen-
erator is that as many as possible language features of the target programming
language are supported, otherwise the quality of the generated tests and the
usefulness of the test generation tools will be limited.
A particular feature common to many modern programming languages such
as Java are generics [12]: Generics make it possible to parameterize classes and
methods with types, such that the class can be instantiated for different types. A
common example are container classes (e.g., list, map, etc.), where generics can
be used to specify the type of the values in the container. For example, in Java a
D. Winkler, S. Biffl, and J. Bergsmann (Eds.): SWQD 2014, LNBIP 166, pp. 185–198, 2014.
c Springer International Publishing Switzerland 2014
186 G. Fraser and A. Arcuri
List<String> denotes a list in which the individual elements are strings. Based
on this type information, any code using such a list will know that parameters
to the list and values returned from the list are strings. While convenient for
programmers, this feature is a serious obstacle for automated test generation.
In this short paper, we present a simple automated approach to generating
unit tests for code using generics by statically determining candidate types for
generic type variables. This approach can be applied in random testing, or in
search-based testing when exploring type assignments as part of a search for a
test suite that maximizes code coverage. We have implemented the approach in
the EvoSuite test generation tool and demonstrate that it handles cases not
covered by other popular test data generation tools. Furthermore, to the best of
our knowledge, we are aware of no technique in the literature that targets Java
generics.
2 Background
In this paper, we address the problem of Java generics in automated test genera-
tion. To this purpose, this section presents the necessary background information
on generics and test generation, and illustrates why generics are problematic for
automated test generation.
public T pop() {
return data[pos--];
}
The class Stack has one type parameter, T. Within the definition of class
Stack, T can be used as if it were a concrete type. For example, data is defined
as an array of type T, pop returns a value of type T, and push accepts a parameter
of type T. When instantiating a Stack, a concrete value is assigned to T:
Automated Test Generation for Java Generics 187
Thanks to generics, the Stack can be instantiated with any type, e.g., String
and Integer in this example. The same generic class is thus reused, and the
compiler can statically check whether the values that are passed into a Stack
and returned from a Stack are of the correct type.
It is possible to put constraints on the type parameters. For example, consider
the following generic interface:
public abstract class Foo<T extends Number> {
private T value;
This class can only be instantiated with types that are subclasses of Number.
Thus, Foo<Integer> is a valid instantiation, whereas Foo<String> is not. A
further way to restrict types is the super operator. For example, Foo<T super
Bar> would restrict type variable T to classes convertible to Bar or its super-
classes.
Note also that Foo is an abstract class. When creating a subclass or when
instantiating a generic interface, the type parameter can be concretized, or can
be assigned a new type parameter of the subclass. For example:
public class Bar extends Foo<Integer> {
// ...
}
The class Bar sets the value of T in Foo to Integer, whereas Zoo delays the in-
stantiation of T by creating a new type variable U. This means that inheritance can
be used to strengthen constraints on type variables. Class Zoo also demonstrates
that a class can have any number of type parameters, in this case two.
188 G. Fraser and A. Arcuri
Sometimes it is not known statically what the concrete type for a type variable
is, and sometimes it is irrelevant. In these cases, the wildcard type can be used.
For example, if we have different methods returning instances of Foo and we do
not care what the concrete type is as we are only going to use method get which
works independently of the concrete type, then we can declare this in Java code
as follows:
Foo<?> foo = ... // some method returning a Foo
double value = foo.get();
Generics are a feature that offers type information for static checks in the
compiler. However, this type information is not preserved by the compiler. That
is, at runtime, given a concrete instance of a Foo object, it is not possible to
know the value of T (the best one can do is guess by checking value). This is
known as type erasure, and is problematic for dynamic analysis tools.
The Java compiler also accepts generic classes without any instantiated type
parameters. This is what most dynamic analysis and test generation tools use, and
it essentially amounts to assuming that every type variable represents Object.
Stack stack = new Stack();
:::::::::::::::::::::::::
stack.push("test");
Object o = stack.pop();
As indicated with the waved underline, a compiler will issue a warning in such
a case, as static type checking for generic types is impossible this way.
Besides classes, it is also possible to parameterize methods using generics. A
generic method has a type parameter which is inferred from the values passed
as parameters. For example, consider the following generic method:
public class Foo {
public <T> List<T> getNList(T element, int length) {
List<T> list = new ArrayList<T>();
for(int i = 0; i < length; i++)
list.add(element);
return list;
}
}
The method getNList creates a generic list of length length with all ele-
ments equal to element. List is a generic class of the Java standard library, and
the generic parameter of the list is inferred from the parameter element. For
example:
Foo foo = new Foo();
List<String> stringList = foo.getNList("test", 10);
List<Integer> intList = foo.getNList(0, 10);
In this example, the same generic method is used to generate a list of strings
and a list of numbers.
Automated Test Generation for Java Generics 189
of being suitable for almost any testing problem, and generating unit tests is an
area where search-based testing has been particularly successful.
In search-based testing [1,10], the problem of test data generation is cast as a
search problem, and search algorithms such as hillclimbing or genetic algorithms
are used to derive test data. The search is driven by a fitness function, which
is a heuristic that estimates how close a candidate solution is to the optimum.
When the objective is to maximize code coverage, then the fitness function does
not only quantify the coverage of a given test suite, but it also provides guidance
towards improving this coverage. To calculate the fitness value, test cases are
usually executed using instrumentation to collect data. Guided by the fitness
function, the search algorithm iteratively produces better solutions until either
an optimal solution is found, or a stopping condition (e.g., timeout) holds.
The use of search algorithms to automate software engineering tasks has been
receiving a tremendous amount of attention in the literature [8], as they are well
suited to address complex, non-linear problems.
So why are generics a problem for test generation? Java bytecode has no informa-
tion at all about generic types — anything that was generic on the source code
is converted to type Object in Java bytecode. However, many modern software
analysis and testing tools work with Java bytecode rather than sourcecode. Some
information can be salvaged using Java reflection: It is possible to get the exact
signature of a method. For example, in the following snippet we can determine
using Java reflection that the parameter of method bar is a list of strings:
public class Foo {
public void bar(List<String> stringList) {
// ...
}
}
If we now query the signature of method bar using Java reflection, we learn
that the method expects a list of T. But what is T? Thanks to type erasure, for a
given object, it is impossible to know for a given instance of Foo what the type
of T is1 .
1
Except if we know the implementation of Foo and List such that we can use reflection
to dig into the low level details of the list member of Foo to find out what the type
of the internal array is.
Automated Test Generation for Java Generics 191
Our only hope is if we know how Foo was generated. For example, as part
of the test generation we might instantiate Foo ourselves — yet, when doing
so, what should we choose as concrete type for type variable T? We did not
specify a type boundary for T in the example, and the implicit default boundary
is Object. Consequently, we have to choose a concrete value for type T out of
the set of all classes that are assignable to Object. In this example, this means
all classes on the classpath are candidate values for T, and typically there are
many classes on the classpath.
However, things get worse: It is all too common that methods are declared to
return objects or take parameters of generic classes where the type parameters
are given with a wildcard type. Even worse, legacy or sloppily written code
may even completely omit type parameters in signatures. A wildcard type or
an omitted type parameter looks like Object when we inspect it with Java
reflection. So if we have a method that expects a List<?>, what type of list
should we pass as parameter? If we have received a List<?>, all we know is that
if we get a value out of it, it will be an Object. In such situations, if we do not
guess the right type, then likely we end up producing useless tests that end up in
ClassCastExceptions and cover no useful code. To illustrate this predicament,
consider the following snippet of code:
@Test
public void testTyping(){
List<String> listString = new LinkedList<String>();
listString.add("This is a list for String objects");
listString.add("Following commented line would not
compile");
//List<Integer> listStringButIntegerType =
listString;
List erasedType = listString;
List<Integer> listStringButIntegerType = erasedType;
listStringButIntegerType.get(0).intValue();
}
The signature of bar does not reveal what the exact type of the parameter
should be. In fact, given an object of type Foo, when we query the method
parameters of method bar using Java’s standard reflection API reveals that
baz is of type Object! Fortunately, the reflection API was extended starting
in Java version 1.5, and the extended API adds for each original method a
variant that returns generic type information. For example, whereas the standard
way to access the parameters of a java.lang.reflect.Method object is via
method getParameters, there is also a generic variant getGenericParameters.
However, this method only informs us that the type of baz is T.
Consequently, the test generator needs to consider the concrete instance of
Foo on which this method is called, in order to find out what T is. Assume the
test generator decides to instantiate a new object of type Foo. At this point, it
is necessary to decide on the precise type of the object, i.e., to instantiate the
type parameter T. As discussed earlier, any class on the classpath is assignable
to Object, so any class qualifies as candidate for T. As randomly choosing a type
out of the entire set of available classes is not a good option (i.e., the probability
of choosing an appropriate type would be extremely low), we need to restrict
the set of candidate classes.
To find a good set of candidate classes for generic type parameters, we can
exploit the behaviour of the Java compiler. The compiler removes the type in-
formation as part of type erasure, but if an object that is an instance of a type
described by a type variable is used in the code, then before the use the compiler
inserts a cast to the correct type. For example, if type variable T is expected to
be a string, then there will be a method call on the object representing the string
or it is passed as a string parameter to some other method. Consequently, by
looking for casts in the bytecode we can identify which classes are relevant. Be-
sides explicit casts, a related construct giving evidence of the concrete type is the
instanceof operator in Java, which takes a type parameter that is preserved in
the bytecode.
Automated Test Generation for Java Generics 193
The candidate set is initialized with the default value Object. To collect
the information about candidate types, we start with the dedicated class un-
der test (CUT), and inspect the bytecode of each of its methods for castclass
or instanceof operators. In addition to direct calls in the CUT, the parameters
may be used in subsequent calls, therefore this analysis needs to be interproce-
dural. Along the analysis, we can also consider the precise method signatures,
which may contain concretizations of generic type variables. However, the fur-
ther away from the CUT the analysis goes, the less related the cast may be to
covering the code in the CUT. Therefore, we also keep track of the depth of the
call tree for each type added to the set of candidate types.
Now when instantiating a generic class Foo, we randomly choose values for
its type parameters out of the set of candidate classes. The probability of a type
being selected is dependent on the depth in the call tree, and in addition we
need to determine the subset of the candidate classes that are compatible with
the bounds of the type variable that is instantiated. Finally, it may happen that
a generic class itself ends up in the candidate set. Thus, the process of instan-
tiating generic type parameters is a recursive process, until all type parameters
have received concrete values. To avoid unreasonably large recursions, we put
an upper boundary on the number of recursive calls, and use a wildcard type if
the boundary has been reached.
We have extended the EvoSuite unit test generation tool [3] with support for
Java generics according to the discussed approach. EvoSuite uses a genetic
algorithm (GA) to evolve test suites. The objective of the search inEvoSuite
is to maximize code coverage, so the fitness function does not only quantify the
coverage of a given test suite, but it also provides guidance towards improving
this coverage. For example, in the case of branch coverage, the fitness function
considers for each individual branching statement how close it was to evaluating
to true and to false (i.e., its branch distance), and thus can guide the search
towards covering both outcomes.
The GA has a population of candidate solutions, which are test suites, i.e.,
sets of test cases. Each test case in turn is a sequence of calls (like a JUnit test
case). EvoSuite generates random test suites as initial population, and these
test suites are evolved using search operators that mimic processes of natural
evolution. The better the fitness value of an individual, the more likely it is
considered for reproduction. Reproduction applies mutation and crossover, which
modify test suites according to predefined operators. For example, crossover
between two test suites creates two offspring test suites, each containing subsets
from both parents. Mutation of test suites leads to insertion of new test cases, or
change of existing test cases. When changing a test case, we can remove, change,
or insert new statements into the sequence of statements. To create a new test
case, we simply apply this statement insertion on an initially empty sequence
until the test has a desired length. Generic classes need to be handled both, when
194 G. Fraser and A. Arcuri
generating the initial random population, and during the search, when test cases
are mutated. For example, mutation involves adding and changing statements,
both of which require that generic classes are properly handled. For details on
these search operators we refer to [5].
To demonstrate the capabilities of the improved tool, we now show several sim-
ple examples on which test generation tools that do not support generics fail (in
particular, we verified this on Randoop [11], Dsc [9], Symbolic PathFinder [2],
and Pex [13]).
The first example shows the simple case where a method parameter specifies
the exact signature. As the method accesses the strings in the list (by writing
them to the standard output) outwitting the compiler by providing a list without
type information is not sufficient to cover all the code – anything but an actual
list of strings would lead to a ClassCastException. Thus, the task of the test
generation tool is to produce an instance that exactly matches this signature:
import java.util.List;
For this class, EvoSuite produces the following test suite to cover all branches:
public class TestGenericParameter {
@Test
public void test0() throws Throwable {
GenericParameter genericParameter0 = new GenericParameter();
LinkedList<String> linkedList0 = new LinkedList<String>();
linkedList0.add("");
linkedList0.add("");
linkedList0.add("");
boolean boolean0 = genericParameter0.stringListInput(linkedList0);
assertEquals(true, boolean0);
}
@Test
public void test1() throws Throwable {
GenericParameter genericParameter0 = new GenericParameter();
LinkedList<String> linkedList0 = new LinkedList<String>();
boolean boolean0 = genericParameter0.stringListInput(linkedList0);
assertEquals(false, boolean0);
}
}
Automated Test Generation for Java Generics 195
As second example, consider a generic class that has different behavior based
on what type it is initialized to, such that a test generator needs to find appro-
priate values for type parameter T in order to cover all branches:
import java.util.List;
@Test
public void test1() throws Throwable {
GenericsExample<Integer, Object> genericsExample0 = new
GenericsExample<Integer, Object>();
int int0 = genericsExample0.typedInput((Integer) 1031);
assertEquals(1, int0);
}
@Test
public void test2() throws Throwable {
GenericsExample<String, ServerSocket> genericsExample0 = new
GenericsExample<String, ServerSocket>();
int int0 = genericsExample0.typedInput("");
assertEquals(0, int0);
}
@Test
public void test3() throws Throwable {
GenericsExample<Object, Object> genericsExample0 = new
GenericsExample<Object, Object>();
Object object0 = new Object();
int int0 = genericsExample0.typedInput(object0);
assertEquals(3, int0);
}
}
196 G. Fraser and A. Arcuri
To evaluate the examples, we compare with other test generation tools for Java
described in the literature. Research prototypes are not always freely available,
hence we selected tools that are not only available online, but also popular (e.g.,
highly cited and used in different empirical studies). In the end, we selected
Randoop [11], Dsc [9], Symbolic PathFinder [2], and Pex [13]).
Like for EvoSuite, setting up Randoop to generate test cases for
GenericsExample is pretty straightforward. Already after a few seconds, it has
generated JUnit classes with hundreds of test cases. However, Randoop gen-
erated no tests that used a list as input for the stringListInput method (0%
coverage) or a ServerSocket for typedInput.
A popular alternative to search-based techniques in academia is dynamic sym-
bolic execution (DSE). For Java, available DSE tools are Dsc [9] and Sym-
bolic PathFinder [2]. However, these tools assume static entry functions (Dsc),
or appropriate test drivers that take care of setting up object instances and
selecting methods to test symbolically2 . As choosing the “right” type for a
GenericsExample object instantiation is actually part of the testing problem
(e.g., consider typedInput method), then JPF does not seem to help in this
case.
To overcome this issue, we also investigated Pex, probably the most popular
DSE tool, but a tool that assumes C#. However, generics also exist in C#, so it
is a good opportunity to demonstrate that the problem addressed in this paper
is not a problem specific to Java.
We translated the GenericsExample class to make a comparison, resulting in
the following C# code.
using System;
using System.Collections.Generic;
Note that we replaced the TCP socket class with a Stack class such that the
example can also be used with the web interface for Pex, PexForFun3 . Like the
other DSE tools, Pex assumes an entry function, which is typically given by a
2
http://javapathfinder.sourceforge.net/, accessed June 2013.
3
http://www.pexforfun.com/, accessed June 2013.
Automated Test Generation for Java Generics 197
parameterized unit test. For example, the following code shows an entry function
that can be used to explore the example code using DSE:
using Microsoft.Pex.Framework;
[PexClass]
public class TestClass {
[PexMethod]
public void Test<T>(T typedInput) {
var ge = new GenericsExample<T>();
ge.typedInput(typedInput);
}
}
This test driver and the GenericsExample class can be used with Pex on
Pex4Fun, and doing so reveals that Pex does not manage to instantiate T at all
(it just attempts null).
5 Conclusions
Generics are an important feature in object-oriented programming languages like
Java. However, they pose serious challenges for automated test case generation
tools. In this short paper, we have presented a simple techniques to handle Java
generics in the context of test data generation. Although we implemented those
techniques as part of the EvoSuite tool, they could be used in any Java test
generation tool.
We showed the feasibility of our approach on artificial examples. While Evo-
Suite was able to achieve 100% coverage in all these examples quickly, other
test generation tools fail — even if they would be able to handle equivalent code
without generics (e.g., in the case of Randoop).
Beyond the simple examples of feasibility, as future work we will perform large
scale experiments to determine how significant the effect of generics support is
in practice, for example using the SF100 corpus of open source projects [4].
EvoSuite is a freely available tool. To learn more about EvoSuite, visit our
Web site:
http://www.evosuite.org
Acknowledgments. This project has been funded by a Google Focused Re-
search Award on “Test Amplification” and the Norwegian Research Council.
References
1. Ali, S., Briand, L., Hemmati, H., Panesar-Walawege, R.: A systematic review of
the application and empirical investigation of search-based test-case generation.
IEEE Transactions on Software Engineering (TSE) 36(6), 742–762 (2010)
198 G. Fraser and A. Arcuri
2. Anand, S., Păsăreanu, C.S., Visser, W.: JPF–SE: A symbolic execution extension to
java pathFinder. In: Grumberg, O., Huth, M. (eds.) TACAS 2007. LNCS, vol. 4424,
pp. 134–138. Springer, Heidelberg (2007)
3. Fraser, G., Arcuri, A.: EvoSuite: Automatic test suite generation for object-oriented
software. In: ACM Symposium on the Foundations of Software Engineering (FSE),
pp. 416–419 (2011)
4. Fraser, G., Arcuri, A.: Sound empirical evidence in software testing. In: ACM/IEEE
International Conference on Software Engineering (ICSE), pp. 178–188 (2012)
5. Fraser, G., Arcuri, A.: Whole test suite generation. IEEE Transactions on Software
Engineering 39(2), 276–291 (2013)
6. Fraser, G., Zeller, A.: Mutation-driven generation of unit tests and oracles. IEEE
Transactions on Software Engineering (TSE) 28(2), 278–292 (2012)
7. Godefroid, P., Klarlund, N., Sen, K.: Dart: directed automated random testing. In:
ACM Conference on Programming language design and implementation (PLDI),
pp. 213–223 (2005)
8. Harman, M., Mansouri, S.A., Zhang, Y.: Search-based software engineering:
Trends, techniques and applications. ACM Computing Surveys (CSUR) 45(1), 11
(2012)
9. Islam, M., Csallner, C.: Dsc+mock: A test case + mock class generator in sup-
port of coding against interfaces. In: International Workshop on Dynamic Analysis
(WODA), pp. 26–31 (2010)
10. McMinn, P.: Search-based software test data generation: A survey. Software Test-
ing, Verification and Reliability 14(2), 105–156 (2004)
11. Pacheco, C., Lahiri, S.K., Ernst, M.D., Ball, T.: Feedback-directed random test gen-
eration. In: ACM/IEEE International Conference on Software Engineering (ICSE),
pp. 75–84 (2007)
12. Parnin, C., Bird, C., Murphy-Hill, E.: Adoption and use of Java generics. Empirical
Software Engineering, 1–43 (2012)
13. Tillmann, N., de Halleux, N.J.: Pex — white box test generation for .NET. In:
International Conference on Tests And Proofs (TAP), pp. 134–253 (2008)
Constraint-Based Automated Generation of Test Data
1 Introduction
The construction of comprehensive test data sets for large software applications is a
complex, time consuming, and error-prone process. The sets have to include valid
data records in order to support positive functional tests, as well as invalid data for
negative tests.
In order to exert pressure onto the application one needs appropriate test coverage
of the space of possible input data. Appropriate coverage requires seeking challenging
input values for input fields, such as extreme or otherwise special values (ESVs),
while simultaneously fulfilling all required validation rules for the input data, or,
conversely, explicitly violating some of them. In order to limit the test execution time,
the size of the test data set (i.e. the total number of test data records) should be small,
which forces one to incorporate as many ESVs as possible into each test data record.
Large applications may require thousands of input values, which are subject to a
similar number of validation rules. Test data sets have to be generated frequently.
Therefore the automation of the entire data generation process is not only highly
desirable, but a necessity for business-critical applications. This paper presents how
we solved the problem of generating high-quality artificial test data sets that are
mainly used within an efficient automated quality assurance process.
D. Winkler, S. Biffl, and J. Bergsmann (Eds.): SWQD 2014, LNBIP 166, pp. 199–213, 2014.
© Springer International Publishing Switzerland 2014
200 H.-M. Adorf and M. Varendorff
For each input item, i.e. a field which, as explained above, may appear in seveeral
instances, values need to be generated. The latter must comprise as many predefiined
extreme and special valuess (ESVs) as possible. Each ESV should, if feasible at all,
occur in at least one test data record. There is no requirement to consider m many
(all???) combinations of in nteresting values, as a “combinatorial explosion” woould
arise. If a test requires a particular combination of values, then these values are
predefined for the test daata generation process, or such test cases are execuuted
manually.
The execution of functiional test cases (each using a single record from the test
data set) is time consuming g, since each test may run for several minutes. Thereffore
the size of a set with sufficient test coverage, as defined above, should be minimizzed.
This requirement entails th hat the ESV density in each test data record shouldd be
maximized.
Within a software devellopment cycle the rule-base associated with an applicattion
usually undergoes several changes. If such changes are small, a large proportionn of
existing test data is usually still valid, and can therefore be used for testing the
application. But if the rule--base changes are substantial, new test data will have too be
generated. This requiremen nt entails that the test data generation process has too be
correspondingly fast. For reasons
r of practicality our goal has been to accomplissh a
turn-around time of one daay between the delivery of a new rule-base and the endd of
the data generation process..
For the test of large appplications it is impossible to manually generate a sufficiient
amount of test data recordss with the required quality and within such stringent tiime
limits. Therefore an automa ated generation approach is mandatory.
202 H.-M. Adorf and M. Varendorff
Fig. 2. Procedure for testing using a new or updated rule set, or a new version of the
application. The necessary test data are generated by the R-TDG.
3.1 Translating Field and Rule Definitions into Variable and Constraint
Definitions
Several configuration parameters govern the whole data generation process. Two of
those influence the translation, namely the actual number of instances of the forms
204 H.-M. Adorf and M. Varendorff
holding the fields (e.g. instances for up to 14 children in a tax declaration), and the
actual number of field instances. We refer to the actual numbers of form and field
instances as their “multiplicities”.
Consider, for example, the declaration of travel expenses: one “line” (consisting of
a set of field instances) might consist of the amount of trips, the origin (i.e. the
address of your employer), the destination (i.e. address of your destination), and the
distance. A validation rule might restrict the number of trips to 366. When we
generate test data, we have to set the desired multiplicity to less than 366 for the fields
above.
A form multiplicity of m combined with a field multiplicity of n leads to m times n
input items. The multiplicity values m and n can be quite large (1000 and more), but
in practice we rarely use values exceeding 2.
An important obstacle to deal with consists in the fact that an input item may be
empty. (An item being empty is equivalent to a Java-variable containing the value
null.) No SMT-solver known to us can handle variables that have no value. Therefore,
for each item, we have introduced a binary “occupation variable” that holds the
information whether the corresponding “value variable” is null or not. Such a variable
pair is semantically equivalent to a single variable whose value may be unspecified.
For instance, a field item of type Boolean is translated into a variable pair that
together can represent the three values: true, false, and undefined. We therefore refer
to the logic implemented in the rules as “three-valued” logic.
Fields are declared in field definitions which, apart from the data type, hold
additional information such as the minimum/maximum field length (number of
characters). This information has to be translated into variable and constraint
definitions that are comprehensible to the SMT-solver. Each field acts as a template,
which is translated into 2 * m * n corresponding variables.
The rules, which form the other half of the input to the R-TDG, are also mere
templates since initially only the maximum number of instances of the forms and of
the fields occurring in those rules are specified. Therefore each abstract rule must be
replicated according to the actual multiplicity of the forms and fields occurring in that
rule. As a consequence, each abstract rule is translated into several concrete con-
straints. The constraints have to be formulated in such a way that they properly
accommodate the value and occupation variables explained above. In effect, each rule
is normally translated into m * n corresponding constraints.
Another important issue to deal with is functions that occur in rule conditions.
SMT-solvers do not comprehend proper functions, but only constraints over a very
limited set of data types. Therefore each function occurring in a rule condition needs
to be translated into a corresponding equivalent constraint.
The three-valued logic vastly complicates all logical expressions and all functions
operating on field items. Consider the simple predicate z = max2(x, y). This expands to
4 cases:
1. z = max(x,y), if both x and y have values (here max is the ordinary maximum
function, which can be resolved into a logical condition),
2. z = x, if x has a value and y does not,
3. z = y, if y has a value and x does not,
4. z = lb, if neither x nor y is defined (here lb is the lower bound for the variables x
and y).
Constraint-Based Automated Generation of Test Data 205
Decimal Numbers
A decimal number is a rational number that possesses a representation with a finite
number of digits after the decimal separator. For instance, a currency amount is a
decimal number with at most two decimal digits. On the other hand the rational
number 1/3 is not a decimal number.
In order to represent a decimal number within a CSP we use a pair consisting of a
rational number and an accompanying constraint (called “decimal constraint”). The
latter requires that a certain multiple of the number must be an integer. E.g. a standard
currency amount with 2 post-decimal digits (say Euros with Cents) multiplied by 100
must be an integer.
Unfortunately those innocent looking decimal constraints can lead to severe
performance issues for the SMT-solver.
Calendar Dates
Fields with values of type calendar date present another problem since the available
SMT-solvers do not encompass the data type “calendar date”. In the rule-base the
(external) representation of a date value is always a string, such as “11.03.1956”, but
a typical cross field constraint uses functions that require separate access to the day in
the month, the month, and the year of the calendar date. Comparisons with other date
variables or constants, such as before, at the same time, or later, may have to be
performed on parts of a date value, or on the whole value.
In order to enable a CSP-solver to operate on calendar dates, we represent any
calendar date by an integer equivalent to a ‘relative’ day, starting from 01.01.1900
(which is day 1). In an imperative or functional programming language it is easy to
206 H.-M. Adorf and M. Varendorff
implement accessor functions that retrieve the year, the month in the year, or the day
in the month from such a relative day. However, we are dealing with a declarative
CSP-language, and thus these accessors have to be implemented as constraints – a
non-trivial task.
We found representations for all required date constraints. However the run-times
are sometimes prohibitive. The only solution we have found so far consists in pre-
assigning suitable values to a sufficient number of the date variables involved, and
thereby remove these variables from the CSP in question. With this drastic measure
we have been able to cut down the run-times to reasonable values, at the expense of
losing some test-coverage.
There are string constraints which, out of principle, cannot be treated by a string
generator acting in a post-processing phase. These comprise the equality and
inequality constraints between string variables, the substring function, and conversion
functions where a suitable string is converted to a number or calendar date. These
constraints do actually occur in our rule-bases, and properly solving CSPs that contain
them would require a genuine string solver. In cooperation with the Technical
University Munich such a string solver has been developed [7], but so far has not yet
been incorporated into the productive R-TDG. Therefore in each of those cases a
handcrafted workaround is still required.
The R-TDG is expected not to produce arbitrary data records, but test data records
that put the software under test (SUT) under pressure. Therefore, as explained above,
the records have to include ESVs for the input items.
For a numeric input item an important ESV obviously is 0; other ESVs of interest
are the minimum and the maximum admissible values. For a calendar date field 28th
and 29th of February, and 1st of March are interesting ESVs. Dates at the boundaries
of a quarter such as 1st of January and 31st of March are also interesting values. For a
string field, the empty string and a string with the maximum allowed number of
characters are interesting ESVs. For string fields it is important that each admissible
character occurs in at least one ESV, if feasible at all. Of course, for all input items
the null value is an important ESV.
For the production of ESVs we have implemented an automated ESV-generator.
For each variable it produces a reasonable set of ESVs on the basis of the variable’s
generic field-type combined with some additional field-specific parameters such as
the minimum/maximum field length.
In addition to predefined ESVs we usually include some random values that are
treated in the same way as the deterministic ESVs.
The number of ESVs for a given CSPs is roughly proportional to the number of
variables present. The number of ESVs to be generated for a given CSP averages
about 3 to 5 times the number of variables contained in the CSP.
208 H.-M. Adorf and M. Varendorff
A major breakthrough consisted in partitioning any given CSP into independent CSP-
components. Two variables are in the same component when they simultaneously
occur only in constraints of this component. The components of a CSP can best be
viewed in the undirected primal constraint graph in which each vertex corresponds to
a variable, and two vertices are linked when the corresponding variables appear
together in one of the constraints. Each component-CSP can be solved independently
of the others.
One might think that CSP partitioning would be an integral part of any
SMT-solver, but apparently that seems not to be the case. Fortunately, our external
CSP-partitioning technique offers some advantages:
1. The partial solutions to the component-CSPs can freely be combined to form
complete solutions to the parent CSP, i.e. valid test data records.
2. A trivial parallelization (i.e. concurrency) of the solution process becomes
feasible, which further reduces the make-span for the test data generation. We
are usually employing two to four threads, each operating on a different CSP-
component. This approach exploits the resources of a state-of-the-art CPU with
four physical (eight virtual) cores quite well. The reduction of the make-span is
roughly proportional to the number of threads employed.
Constraint-Based Automated Generation of Test Data 209
The run-time of solving a given CSP is usually dominated by the run-time of solving
the largest component-CSPs. Sometimes the largest component-CSPs still are too
complex for our SMT-solver. Here it would be helpful if we could somehow identify
one or more central variables which, if “removed” from the CSP, would permit the
latter to be decomposed into two or more independent components. Again the child
CSPs should more easily be solvable than the parent CSP.
In order to tackle this problem we have developed a very fast, concurrent, graph-
based “vertex inbetweenness” cluster algorithm that iteratively identifies central graph
vertices. The algorithm works on the primal constraint graph of the problematic
component-CSP. A central vertex, once identified, can subsequently be removed from
the graph (figure 3). After we had developed our algorithm we found that it had
already been invented several decades before in the field of sociology [11].
In the context of CSPs a variable removal can be achieved by pre-assigning a value
to the corresponding variable. (The pre-assigned value effectively transforms the
variable into a constant which therefore disappears from the CSP.) Often the removal
of a single variable does not yet allow a decomposition of the CSP. If so, the cluster
algorithm has to be iterated until decomposition becomes possible.
The vertex-inbetweenness algorithm is really amazing in identifying the important
central vertices, which eventually will allow a decomposition of the CSP into in-
dependent CSP-components of roughly comparable size. (The algorithm is greatly
superior to e.g. the simple vertex-cut algorithm, which will often split a given CSP
into one large and one tiny component.) From observing the effect of the vertex-
inbetweenness algorithm, and from analyzing its inner workings, we can state that
210 H.-M. Adorf and M. Varendorff
4.5 Timeouts
Sometimes the solution process for a component-CSP is well underway, when all of
the sudden the SMT-solver “goes on strike”, meaning, it does not return within
acceptable time. In such a situation a timeout is very helpful. When a process,
Constraint-Based Automated Generation of Test Data 211
in which an SMT-solver ru uns, is overdue, it is cancelled. The latest ESV added to the
current component-CSP is considered
c to be the trouble-maker, and is eliminated frrom
further consideration. The solution
s process is then restarted, and it is guaranteed tthat
the process not only terminaates, but terminates in acceptable time.
5 Architecture
A sketch of the architecturee of the R-TDG is shown in figure 4. The basic idea is to
start with a rule set, translaate it to a CSP with variables and constraints, partition the
CSP into independent com mponents, solve the CSP by an SMT-solver (whichh is
accessed via a “Solver Adaapter”), assemble the solutions of the CSPs, and translate
them back to valid test datta records. Each architectural component contained in the
diagram is accompanied by y a short description of its task.
The most complex com mponents are the “Translator” and the “ESV Coverrage
Handler”. The Translatorr remedies the discrepancy between the “real worrld”
definitions of fields and rulles of the business domain and the available variable tyypes
and constraint definitions of an SMT-solver. The ESV Coverage Handler triess to
include as many ESVs as arre feasible into the resulting test data set, while keeping the
overall size of the set at a minimum.
m
212 H.-M. Adorf and M. Varendorff
6 Results
Table 1 presents the results of generating test data for some large form-centric SUTs.
For all SUTs the number of test data records is relatively small considering the
number of ESVs that have to be incorporated. In addition, in all cases the make-span
of generating test data is between ~10 and ~100 minutes, well below the one day
target that we set out, when we began the development of the R-TDG.
Table 1. For several large form-centric software applications under test (SUTs) the table shows
the size of the CSP (measured by the number of variables and the number of constraints), the
number of CSP-components after partitioning and decomposition, the number of test data
records produced, and the make-span of the data generation. Form and field multiplicities were
set to two. Two threads were used in parallel.
7 Summary
We have presented a novel approach for generating artificial random test data that are
suitable for testing large form-centric software applications. These contain up to
several thousand fields that the user has to potentially fill in. The very same single-
field and cross-field constraints that are used by a validator, for validating the user
input to the application, are also used by our test data generator.
The set of base constraints is augmented by simple additional constraints, which
insert into the solutions “extreme and special values” (ESVs), whose purpose is to
exert pressure onto the software application under test (SUT). The variations of the
initial constraint satisfaction problem (CSP) are solved by an off-the-shelf Satis-
fiability Modulo Theories (SMT) solver. Data types unknown to current SMT-solvers
such as decimal numbers, calendar dates, and strings, have to be treated in special
ways. Several heuristics, including an effective graph-clustering algorithm, have been
put in place in order to enable an efficient generation of test data. Even for large form-
centric applications the make-span for data generation has come down to less than
100 minutes.
For about four years, these data are regularly used mainly for automated tests of
several large form-centric software applications that are being developed by our
company.
Constraint-Based Automated Generation of Test Data 213
8 Glossary
References
1. Dost, J., Nägele, R.: “jFunk Overview”, mgm technology partners GmbH (2012)
2. Howden, W.E.: Methodology for the Generation of Program Test Data. IEEE Transactions
on Computers C-24(5), 554–560 (1975)
3. Edvardsson, J.: Survey on Automatic Test Data Generation. In: Second Conf. on Computer
Science and Engineering in Linkoeping (ECSEL), pp. 21–28 (1999)
4. DeMillo, R.A., Offutt, A.J.: Constraint-based automatic test data generation. IEEE
Transactions on Software Engineering 17(9), 900–910 (1991)
5. Gotlieb, A., Botella, B., et al.: Automatic test data generation using constraint solving
techniques. ACM SIGSOFT Software Engineering Notes 23(2), 53–62 (1998)
6. Hooimeijer, P., Veanes, M.: An Evaluation of Automata Algorithms for String Analysis.
Redmond City, Microsoft Research (2010)
7. Braun, M.: A Solver for a Theory of Strings. Fakultät für Informatik, Technische
Universität München (2012)
8. Møller, A.: “Automaton.” Aarhus, Basic Research in Computer Science (BRICS) (2009)
9. Brüggemann-Klein, A.: Regular expressions into finite automata. In: Simon, I. (ed.)
LATIN 1992. LNCS, vol. 583, pp. 87–98. Springer, Heidelberg (1992)
10. de Moura, L., Bjørner, N.: Z3: An efficient SMT solver (2012)
11. Freeman, L.C.: A Set of Measures of Centrality Based on Betweenness. Sociometry 40,
35–41 (1977)
RUP Alignment and Coverage Analysis
of CMMI ML2 Process Areas for the Context
of Software Projects Execution*
Abstract. The simultaneous adoption of CMMI and RUP allows the definition
of “what to do” (with the support of CMMI) and “how to do” (with the support
of RUP) in the context of executing software development projects. In this
paper, our main contribution relates to the alignment of CMMI ML2 with RUP,
in the context of executing software projects and the analysis of RUP coverage.
We present the alignment for CMMI ML2 process areas, incorporating priority
mechanisms. The adopted case study allows the analysis of the way RUP
supports CMMI ML2 process areas taking into account the proposed alignment
and the theoretical coverage analyzed. For particular process areas, RUP can be
considered a good approach for CMMI ML2 implementation.
1 Introduction
World organizations are influenced and molded by reference models that rule their
activity, size or organizational culture. Regarding software development
organizations, reference models such as CMMI, SPICE (ISO/IEC 15504:1998),
ISO/IEC 9000, RUP, PMBOK, BABOK, PSP, ISO/IEC 9126, SWEBOK [1-3],
amongst many others rule their behaviour and work. Although these reference models
act in many different perspectives and sub-fields, their main purpose is to enhance the
quality of the developed software according to the final users’ needs [4]. The software
development organizations need to be aware that the concern with the final product
(software) is not enough and that the development process must be involved in
improving itself. Development process means all activities necessary for managing,
developing, acquiring and maintaining software [5]. The teams must be able to
*
This work has been supported by FEDER through Programa Operacional Fatores de
Competitividade – COMPETE and by Fundos Nacionais through FCT – Fundação para a
Ciência e Tecnologia in the scope of the project: FCOMP-01-0124-FEDER-022674.
D. Winkler, S. Biffl, and J. Bergsmann (Eds.): SWQD 2014, LNBIP 166, pp. 214–228, 2014.
© Springer International Publishing Switzerland 2014
RUP Alignment and Coverage Analysis of CMMI ML2 Process Areas 215
evaluate the quality of the process in order to promote the monitoring and thereby
detecting deviations in advance.
Using reference models to assess software quality is nowadays not only a
minimum requirement for an organization’s survival [6] but also a business strategy
[7]. The focus of the our work is on two reference models for software development
processes: RUP (Rational Unified Process) [7] and CMMI (Capability Maturity
Model Integration) [8, 9]. RUP and CMMI are used for different purposes: the first is
focused on the necessary activity flow and responsibilities, and the other guides and
assesses the maturity of the process in use. RUP and CMMI have a common goal:
improving software quality and increasing the customer satisfaction. Both can be used
together: since they complement each other mutually.
The main purpose of this work is to discuss whether through the adoption of RUP
for small projects [10] in the execution of software development projects it is possible
to achieve CMMI ML2. Our goal is to understand the support that we can expect from
RUP when executing software development projects in a CMMI-compliant
perspective. In a previous publication [11] we have presented the CMMI-RUP
mapping only for two process areas (PP and REQM) strictly for the context of the
restricted effort of elaborating software project proposals (we did not consider the
execution phase of the project). The detailed motivation analysis for this work can be
found in [11] but generically we can state that we intend to perceive if RUP is a good
reference model to support a software development team in getting an alignment with
CMMI ML2.
In this paper, we present the detailed mapping of CMMI ML2 (without SAM) PAs
into RUP tasks and activities for the execution of software development projects and
the analysis of the theoretical RUP coverage that we can expect for each PA. The
SAM PA is out of our study since this process area is not mandatory for most of
the companies. Since our concern is the project execution, we focus our analysis on
the RUP Inception, Elaboration and Construction phases. A case study illustrates the
usefulness of our CMMI-RUP mapping, where we interpret the obtained results in
terms of the teams’ performance while executing one software project and comparing
with the theoretical RUP coverage.
2 Related Work
In 1991, the Software Engineering Institute (SEI) created the Capability Maturity
Model (CMM). It has evolved until the creation of CMMI in 2002 [8, 9, 12, 13],
which is more engineering-focused than its antecessors. CMMI provides technical
guidelines to achieve a certain level of process development quality, however, it
cannot determine how to attain such a level [2]. In November 2010, SEI released the
CMMI-DEV v1.3 [9]. An appraisal at ML2 guarantees that the organization’s
processes are performed and managed according to the plan [6].
Rational Software developed in 1996 a software development framework called
RUP. This framework includes activities, artifacts, roles and responsibilities and the
best practices recognized for software projects. RUP enables the development team to
perform an iterative transformation of the user’s requirements into software that suits
216 P. Monteiro et al.
the stakeholder’s needs [2, 7, 14]. RUP also provides guidelines for “what”, “who”
and “when” [2], avoiding an ad-hoc approach [7] that is usually time consuming and
costly [6]. RUP divides the life cycle of a development process in four phases:
Inception, Elaboration, Construction and Transition.
CMMI and RUP intersect each other in regards to software quality and hence
customer satisfaction. In addition, both models have been constantly updated, so they
do not become obsolete [7] and prevent an ad-hoc and chaotic software development
environment [12]. While created by independent entities, they both counted with the
participation of experts from the software industry and government [12]. There are
many reasons why organizations should use these two frameworks: increased quality,
productivity, customer and partners satisfaction; lower costs and time consumed; and
better team communication [2, 6, 12]. CMMI-DEV may be used to evaluate an
organization’s maturity whether it uses or not RUP as a process model.
Since our goal is to understand what kind of support can we expect from RUP to the
execution of software development projects in a CMMI-compliant perspective, we
will detail the previous analysis [11] for all the CMMI ML2 process areas at the
subpractices level. When performing the previous analysis, we considered five
different coverage levels that we will use in this study: High coverage (H): CMMI
fully implemented with RUP, which means that there are no substantial weaknesses;
Medium-High coverage (MH): CMMI nearly fully implemented with RUP, although
some weaknesses can be identified; Medium coverage (M): CMMI mostly
implemented with RUP, however additional effort is needed to fully implement this
process area using RUP; Low coverage (L): CMMI is not directly supported using
RUP, or there is a minimal RUP support; Not covered (N): CMMI is not covered by
any RUP.
We consider two different contexts for the execution of software projects: the
context #1, where the development team must fully comply with CMMI
recommendations, which means the team needs to perform all the subpractices; the
context #2, where the development team acts with a strong time or cost bounds
constrictions, which means the team may not be able to perform all the subpractices.
Teams framed in the context #2 should only perform in what we have called P1
priority subpractices. P1 (higher priority) subpractices are considered mandatory for
all the software projects execution (see last column of Table 2). Teams framed in the
context #1 should perform P1, P2 and P3 priority subpractices. P3 (lower priority) and
P2 (medium priority) subpractices may be skipped, when there is lack of information
or of metrics to be thoroughly covered in the project execution. P3 subpractices are
the first to be skipped; P2 subpractices may also be skipped in a second analysis.
Taking into account the dependencies analysis published and analyzed in [15]
between CMMI ML2 PAs and SPs (see Table 1), we have decided to classify these
SPs (all subpractices) with priority P1. Table 1 shows that PP SP1.1, SP1.2, SP1.4,
SP2.1, SP2.2, SP2.3, SP2.4, SP2.5, SP2.6, SP3.1 and SP3.2 have priority P1.
RUP Alignment and Coverage Analysis of CMMI ML2 Process Areas 217
Table 1. Dependencies Analysis between PAs and Specific Practices based on [15]
The detailed CMMI-RUP for the Project Planning PA contains the same required
RUP tasks or activities to support each subpractice for the context of elaborating
project proposals [11]. However, for the context of projects execution the subpractices
priority has been changed. The priority of SP2.3, SP2.4, SP2.5 and SP3.2 for
elaborating project proposals was P2 and now for the project execution is P1 since
those SPs have dependencies between other ML2 PAs. In this PA, we do not consider
P3 subpractices.
The Requirements Management (Table 2) PA mapping is quite similar to the
previous mapping [11]. The difference is the reconsideration of the subpractices
priorities. The priorities of SP1.3 and SP1.4 increased because these SPs have a
dependency between the other ML2 PAs. The priority of subpractices SP1.1.3 and
SP1.1.4 also increased because, in the project execution, the requirements analysis to
ensure that the established criteria are met and the requirements understanding by
requirements provider has more prominence that the case of elaborating project
proposals.
We detail all the CMMI ML2 process areas (except SAM) like we did to the PP
and REQM. For all the subpractices, we identify the tasks and activities, the coverage
level of each mapping and the priority of each subpractice.
The Measurements and Analysis and the Project and Monitoring Control PA
present high coverage. The Measurements and Analysis process area is composed by
eight specific practices, six with high coverage, one (SP2.2) with medium-high
coverage and one (SP2.3) with medium coverage. The SP2.2 presents medium-high
coverage since the tasks Monitor Project Status and Assess Iteration do not guarantee
the execution of an initial analysis of the measurements and accomplishment of
preliminaries results. The SP2.3 presents medium coverage because none of the RUP
218 P. Monteiro et al.
tasks covers the subpractices SP2.3.3 and SP2.3.4. RUP does not have elements that
address the need to make the stored measurement data available only for appropriate
groups and personnel, and to prevent the inappropriate use of the stored information.
The subpractices presenting higher priority are the SP1.2, SP1.3 and SP1.4
subpractices since its priority is imposed by the dependencies between PAs (Table 1).
The Project Monitoring and Control is composed by two specific goals. SG1 is
composed by seven specific practices, four with high coverage and three with
medium-high coverage (SP1.1, SP1.4 and SP1.5). Specific Practice 1.1 is composed
by six subpractices, four of them with high coverage. Subpractice SP.1.1.2 presents
medium coverage because task Develop Measurement Plan do not demand the
inclusion of the project's cost and expended effort in the project metrics. The other
subpractice (SP1.1.5) presents no RUP coverage because RUP do not monitors the
skills of the team members. The SP1.4 presents medium-high coverage because two
of its three subpractices (SP1.4.1 and SP1.4.3) are not fully implemented with RUP
elements. With RUP, we cannot guarantee the review of the data management
activities. SP1.5 presents also medium-high coverage because two of its three
subpractices (SP1.5.1 and SP1.5.3) are partially implemented with RUP elements.
Task Report Status does not ensure the review of the stakeholder involvement. The
second goal has three specific practices, all with high coverage. The subpractices with
higher priority are the subpractices of SP1.1, SP1.2, SP1.3, SP2.1 and SP2.2. These
RUP Alignment and Coverage Analysis of CMMI ML2 Process Areas 219
subpractices should be performed even if the team has some constrictions because
they monitor the main performance issues of the project. Furthermore, the
dependencies between process areas also impose P1 priority for these specific
practices.
The remaining ML2 Process Areas are Process and Product Quality Assurance
and Configuration Management, and presents medium-high RUP compliance.
Process and Product Quality Assurance is composed by two specific goals, each one
with two specific practices. SG1 has one specific practice (SP1.1) with medium-high
coverage and the other (SP1.2) with high coverage. The SP1.1 is not fully
implemented with RUP because its tasks (in particular the task Assess Iteration) do
not ensure the noncompliance identification and tracking and the lessons learned
identification. The SG2 comprise two specific practices presenting medium coverage.
SP2.1 presents medium coverage because RUP tasks do not address how we can
ensure the resolution of noncompliance issues. The medium coverage of SP2.2 is
triggered by the lack of RUP tasks that guarantee the storage and maintenance of the
quality assurance results. This process area does not have priorities imposed by the
dependencies between process areas. This process area has dependencies from other
ML2 process areas, but the other process areas do not have dependencies from
Process and Product Quality Assurance. The last process area is the Configuration
Management. This process area has three specific goals. The first specific goal is
divided into three specific practices: SP1.1, SP1.2 and SP1.3 (presenting high
coverage). SP1.1 and SP1.2 present medium-high coverage because RUP tasks do not
rigorously define the configuration items, components, and related work products that
should be maintained under configuration management. RUP does not have
mechanisms to create configuration management reports and to guarantee the storage,
update, and retrieve of configuration management records. The specific goal 2 and 3
comprise two specific practices each: SP2.1 and SP3.2 with high coverage, and SP2.2
and SP3.1 with low coverage. The low coverage compliance of those specific
practices is a consequence of the absence of RUP tasks that guarantee the control
changes of the configuration items and the establishment and maintenance of
configuration management records describing the configuration items.
In this section, we describe the analysis of coverage that each PA can achieve with the
adoption of RUP in the execution of software development projects. This study starts
with the identification of all RUP tasks and activities mapped into each subpractice of
the PA under evaluation. Next, we verify the coverage level for each subpractice. We
convert the coverage levels defined in [11] into numeric values: H coverage: between
76% and 100%, by default, H coverage is 100% (this is the ideal coverage); MH
coverage: between 51% and 75%, by default MH coverage is 75%; M coverage:
between 25% and 50%, by default M coverage is 50%; L coverage: between 1% and
25%, by default L coverage is 25%; Not covered: the coverage is 0%.
220 P. Monteiro et al.
After assessing the RUP coverage for each subpractice, we calculate the RUP
coverage for each Specific Goal, Specific Practice and PA.
We adopt a weighted average to calculate the RUP coverage of each specific goal,
specific practice and PA. The subpractices weight is based in the priority level.
Higher priority (P1) subpractices correspond to a weight of 1, medium priority
subpractices correspond to a weight of 0,5 (P2_weight=P1_weight/2) and lower
priority subpractices correspond to a weight of 0,33 (P3_weight=P1_weight/3). The
SP weight is defined as the sum of its subpractices weight. We calculate two types of
PAs coverage, one only with P1 subpractices and the other with all the subpractices.
The RUP coverage for Project Planning PA (see Table 4) considering P1
subpractices is 84% in the Inception and 90% in the Elaboration and Construction
phases. The RUP coverage, with all Project Planning subpractices is 83% in the
Inception and 89% in the Elaboration and Construction phases.
The RUP coverage for Requirements Management PA (see Table 4) considering
P1 subpractices is 57% in the Inception, medium-high coverage. In the other two
phases, it presents high coverage, achieving 96%. The RUP coverage, with all
Requirements Management subpractices decreases around 3%. In the Inception, the
RUP coverage is 53% and in the Elaboration and Construction is 94%.
Table 4 presents a summary of the RUP coverage for Project Planning,
Requirements Management, Process and Product Quality Assurance, Project and
Monitoring Control, Measurements and Analysis and Configuration Management
process areas.
The RUP coverage for Process and Product Quality Assurance PA considering P1
subpractices is 67% in the Inception, medium-high coverage. In the other two phases,
it presents high coverage, achieving 83%. The RUP coverage, with all Process and
Product Quality Assurance is 55% in the Inception and 73% in the Elaboration and
Construction phases.
The RUP coverage for Project and Monitoring Control PA considering P1
subpractices is 91% in the Inception, Elaboration and Construction phases. The RUP
coverage, with all subpractices is 87% in the Inception and 91% in the Elaboration
and Construction phases.
The Measurements and Analysis PA considering P1 subpractices achieves the ideal
RUP coverage, 100%, in all RUP phases. The RUP coverage with all Measurements
and Analysis subpractices decreases around 10%, achieving 90% in the Inception and
91% in the Elaboration and Construction phases.
The Configuration Management is the process area that presents the lowest RUP
coverage. The RUP coverage for Configuration Management PA considering P1
subpractices is 6% in the Inception, phase and 62% in the Elaboration and
Construction phases. The RUP coverage, with all Configuration Management
subpractices is 6% in the Inception and 59% in the Elaboration and Construction
phases.
5 Case Study
A case study was developed to assess the usefulness of the CMMI-RUP mapping to
support the execution of CMMI ML2 Process Areas in the context of software
development projects execution. This case study was performed at an educational
environment and adopted the guidelines established in [16].
The case study involved one hundred and eleven students enrolled in the 8604N5
Software System Development (SSD) from the undergraduate degree in Information
Systems and Technology in University of Minho (the first University to offer in
Portugal DEng, MSc and PhD degrees in Computing). Students were divided in seven
development software teams, each one receiving a sequential identification number
(Team 1, Team 2, .., Team 7). The teams had between 13 and 17 people (1 team with
13, 1 team with 15, 2 teams with 16 and 3 with 17).These teams have to produce a
web application that meets the requirements of a real end customer. The teams have
some constraints: two interactions with the client; using RUP (only the first three
RUP phases); follow CMMI ML2 guidelines; and eighteen weeks for development.
All teams have attended a previous course where they were exposed to the RUP
concepts. One team (Team 1) was randomly chosen to not adopting the RUP (we call
this team the "Control Team") while the other six teams followed the guidelines
established by the RUP, executing the phases of inception, elaboration and
construction. The control team did not follow any kind of guidelines for organizing
themselves in term of roles/responsibilities/team organization.
The students have four assessment milestones of execution and evaluation:
Assessment Milestone 1 (M1): relates to the initial project planning, which is part of
RUP Alignment and Coverage Analysis of CMMI ML2 Process Areas 223
the Inception Phase; Assessment Milestone 2 (M2): relates to the Inception Phase;
Assessment Milestone 3 (M3): relates to the Elaboration Phase; Assessment
Milestone 4 (M4): relates to the Construction Phase.
The assessment of the teams’ performance adopted the following two steps: (1)
Documental analysis of the produced artifacts to detect compliance with the
subpractices of the ML2 process areas. SAM process area is out of the project scope;
(2) Elaboration of a survey at the end of each assessment milestone, to check the
status of the teams and the team members perception of CMMI practices.
For each team, we calculate the coverage level observed for each subpractice, the
corresponding average for each specific practice of CMMI ML2 process areas and the
process area average. The coverage level was converted into numeric values: high
coverage (H) corresponds to 100%; medium-high coverage (MH) corresponds to
75%; medium coverage (M) corresponds to 50%; low coverage (L) corresponds to
25%, and no coverage (N) correspond to 0%.
The specific practice and process area coverage was calculated using the weighted
average. The subpractices weight was based in the level of priority: P1 subpractices
correspond to a weight of 1, P2 subpractices correspond to a weight of 0,5
(P2_weight=P1_weight/2) and P3 subpractices correspond to a weight of 0,33
(P3_weight=P1_weight/3). The specific practice weight was defined as the sum of its
subpractices weight.
As an example of the detailed results obtained in the case study, we present the
Table 5 with the teams’ results for Requirements Management PA. A similar effort
was made for all the teams’ results for the CMMI ML2 process areas.
Table 6 presents a summary of the results obtained after the assessment of the RUP
coverage for each process areas, in each assessment milestone.
224 P. Monteiro et al.
In the project planning (M1, Fig. 1), the RUP coverage is much higher than the
assessed teams coverage. The main reason for this discrepancy is explained by the
fact that, in M1, the teams are only concerned in the project planning execution and
RUP Alignment and Coverage Analysis of CMMI ML2 Process Areas 225
the theoretical RUP coverage for this assessment milestone is the same of the
Inception phase coverage. Looking to the teams’ coverage, we can see that all of them
(except the control team) achieve coverage higher 40% when we assess only the P1
subpractices.
Looking to Fig. 1 and Table 6, we can see the huge difference between the Project
Planning process area and the other ML2 process areas as well as the difference
between the P1 average and all subpractices average. The teams’ Project Planning
process area presents higher coverage when we assess all subpractices than it has
when we assess only the P1 subpractices. Since in this assessment milestone, the
teams are focused in the project planning elaboration they try to implement all
subpractices. In M1, two teams (Team 2 and Team 5) have also spent some effort in
the implementation of Process and Product Quality Assurance (P1 average of 47%
and 44%, respectively). However, this effort let them undervalue the Project Planning
implementation.
At the end of M2 (Fig. 1 and Table 6), the teams’ coverage has increased when
compared with M1. The coverage of all process areas has increased and approached
the theoretical RUP coverage. To emphasize the Team 4, that in the Project Planning
and Requirements Management process areas even exceeded the theoretical coverage
226 P. Monteiro et al.
(85% for all Project Planning subpractices, 61% for Requirements Management P1
subpractices and 71% for all Requirements Management subpractices). This was a
consequence of the teams’ constraint of follow CMML ML2 guidelines. They
anticipate some of the guidelines that will be implemented by RUP in a subsequent
phase. The teams’ coverage of Configurations Management was also higher than the
theoretical RUP coverage, also because the CMMI ML2 constraint. The theoretical
RUP coverage is very low in the Inception phase because RUP tasks that cover the
Configurations Management subpractices are performed only in the Elaboration and
Construction phase.
In M3, the theoretical coverage is the maximum coverage that we can achieve if
we adopt RUP to implement CMMI. Almost all CMMI ML2 process areas have a
theoretical coverage level higher than 75% for both average types. The Process and
Product Quality Assurance do not achieve a high level for very little it has 73%
coverage (for all subpractices average). The Configuration Management is the process
area with the lowest coverage level; it has a coverage level of 59% for all subpractices
and 62% for P1 subpractices. In this phase, we can see that the results of the control
team are become quite different from the other teams and considerably lower than the
theoretical RUP coverage. Team 4 was the assessed team that achieved the highest
level of coverage. This team has achieved the highest level for all process areas
except for Measurement and Analysis.
In the last assessment milestone, the results are quite similar to the M3 results.
There are some slight coverage improvements in the assessed process areas for all
teams. The control team performance as we expected is the weakest of all teams. This
team had more difficulty in implement CMMI ML2 process areas since they do not
follow RUP, and consequently do not have a predefined set of tasks that will help in
know how to implement CMMI.
6 Conclusion
Customer satisfaction is the most expected outcome by the software development
companies. CMMI and RUP intersect in regards to software quality and, therefore,
customer satisfaction.
In this study, we have identified the RUP elements that fulfill CMMI ML2 process
areas (without SAM) in the context of software projects execution. We have also
analyzed the RUP coverage that we can achieve for each CMMI ML2 PA.
With a case study, we have assessed the accomplishment of several teams of the
CMMI-RUP mapping. We have found out that the teams adopting RUP have a higher
compliance with CMMI ML2 than the control team that did not follow RUP
(Table 6). In Table 7, we can compare the theoretical coverage of each CMMI ML2
PA and the teams’ coverage average (without the control team).
Fig. 2 presents the comparison of the theoretical RUP coverage and the average of
the real results obtained by the teams (without the control team). We can compare the
evolution of the teams’ coverage average with the theoretical RUP coverage,
throughout the RUP phases. We can also compare the teams’ average with the
theoretical RUP coverage looking only to P1 subpractices and looking to all
subpractices, throughout the RUP phases for each CMMI ML2 PA. The theoretical
RUP coverage achieves the maximum coverage in the Elaboration phase, but the
teams’ average achieves the maximum coverage only in the Construction phase. The
teams performance reach almost the previously theoretical estimated coverage, but
with some temporal delay; i.e., while the theoretical pick coverage is possible during
the Elaboration phase, the real pick coverage is observed during the Construction
phase for almost the teams.
As future work, we will detail the CMMI-RUP mapping to the CMMI ML3
process areas.
Fig. 2. Comparison between Ideal and Teams Coverage Average by PA and Priority
228 P. Monteiro et al.
References
1. Niazi, M., Wilson, D., Zowghi, D.: Critical success factors for software process
improvement implementation: An empirical study. SPIP 11, 193–211 (2006)
2. Manzoni, L.V., Price, R.T.: Identifying extensions required by RUP to comply with CMM
levels 2 and 3. IEEE TSE 29, 181–192 (2003)
3. Marchewka, J.T.: Information technology project management. John Wiley and Sons
(2009)
4. Chen, C.-Y., Chong, P.P.: Software engineering education: A study on conducting
collaborative senior project development. Journal of Systems and Software 84, 479–491
(2011)
5. Melo, W.: Enhancing RUP for CMMI compliance: A methodological approach. The
Rational Edge. IBM (2004)
6. Carvallo, J.P., Franch, X., Quer, C.: Supporting CMMI Level 2 SAM PA with Non-
technical Features Catalogues. SPIP 13, 171–182 (2008)
7. Kruchten, P.: The Rational Unified Process: An Introduction. Addison-Wesley (2003)
8. CMMI Product Team: CMMI for Development, Version 1.2 (CMU/SEI-2006-TR-008)
(2006)
9. CMMI Product Team: CMMI for Development, Version 1.3 (CMU/SEI-2010-TR-033)
(2010)
10. IBM, RUP for small projects, version 7.1,
http://www.wthreex.com/rup/smallprojects/ (accessed April 12, 2012)
11. Monteiro, P., Machado, R.J., Kazman, R., Lima, A., Simões, C., Ribeiro, P.: Mapping
CMMI and RUP Process Frameworks for the Context of Elaborating Software Project
Proposals. In: Winkler, D., Biffl, S., Bergsmann, J. (eds.) SWQD 2013. LNBIP, vol. 133,
pp. 191–214. Springer, Heidelberg (2013)
12. Ahern, D.M., Clouse, A., Turner, R.: CMMI Distilled: A Practical introduction to
Integrated Process Improvement. Addison - Wesley (2004)
13. Chrissis, M.B., Konrad, M., Shrum, S.: CMMI: Guidelines for Process Integration and
Product Improvement. Addison-Wesley (2006)
14. IBM, Rational Unified Process: Best practices for software development teams,
http://www.ibm.com/developerworks/rational/library/content/
03July/1000/1251/1251_bestpractices_TP026B.pdf
(accessed August 29, 2013)
15. Monteiro, P., Machado, R.J., Kazman, R., Henriques, C.: Dependency Analysis between
CMMI Process Areas. In: Ali Babar, M., Vierimaa, M., Oivo, M. (eds.) PROFES 2010.
LNCS, vol. 6156, pp. 263–275. Springer, Heidelberg (2010)
16. Runeson, P., Host, M.: Guidelines for conducting and reporting case study research in
software engineering. Empirical Software Engineering 14, 131–164 (2009)
Directing High-Performing Software Teams: Proposal
of a Capability-Based Assessment Instrument Approach
Petri Kettunen
University of Helsinki
Department of Computer Science
P.O. Box 68, FI-00014 University of Helsinki, Finland
[email protected]
1 Introduction
Modern high-performing software organizations rely increasingly on capable teams. It
follows that the organizational development should focus on their team capabilities.
Furthermore, new teams can be configured based on the required key capabilities.
Moreover, not just having teams but consciously concentrating on their performance
is what brings the business benefits. Such fundamental comprehension would greatly
help to leverage high-performing teams to scale up even at enterprise levels.
High-performing teamwork has been investigated in many fields over the years. In
particular, the success factors of new product development (NPD) teams are in
general relatively well known [1]. However, the specific concerns and intrinsic
properties of modern software development teams are essentially less understood in
particular in larger scales [2]. It is well known that there may be tremendous
productivity differences between different software teams. However, it is not clearly
understood, what high performance means for software development enterprises in
total, and how exactly such effects and outcomes are achievable in predictable ways.
D. Winkler, S. Biffl, and J. Bergsmann (Eds.): SWQD 2014, LNBIP 166, pp. 229–243, 2014.
© Springer International Publishing Switzerland 2014
230 P. Kettunen
Our overall research question is thus as follows: How can defined (high) team
performance be attained in the particular software development context? This paper
tackles that with a design science approach by proposing a holistic capability analysis
frame for team-based software organizations. The capabilities are evaluated with our
previously developed Monitor instrument (team self-assessment) [3], [4]. Based on
that information, we produce the current capability profile of the team with the
Analyzer instrument constructed here.
The rest of this paper is organized as follows. The next Section 2 reviews software
team performance in general and capability-oriented development views in particular.
Section 3 then presents the capability-based team performance analysis approach,
followed by a case example in Section 4. Finally, Section 5 discusses the proposition
with implications and pointers to further work concluding in Section 6.
3.1 Approach
This paper extends our previous works. The Monitor instrument was initially
developed to profile different characteristics of high-performing software teams [3].
The capability-oriented approach was introduced with the Agile capability [4]. This
paper continues developing the approach and building the instrumentation for the
other recognized capabilities.
The overall standpoint of our team analysis approach is that, for each particular
software team, there is a performance ideal in its specific organizational context
(desired state). The current state of the team may deviate from that for various
reasons. The objective is then to understand the current position of the team and the
capabilities to be developed and improved in order direct the team to reach the desired
state (gap analysis).
The research method and design are as follows. High performance requires capable
teams. However, different teams may have different performance targets, and
emphasize different key capabilities (e.g., in fast-moving customer goods vs.
industrial automation systems). Once the team recognizes its particular capability
needs and weights, it can self-assess them and improve the identified gaps.
Our connection between capabilities and (high) team performance is as follows:
• Ideally, each team performs to the best of its capabilities. The team and its
management should know them.
• However, the actual realization of the capabilities may be incomplete and possibly
hindered by impediments. The team should recognize them.
In order to realize that line of thinking, we apply design science to construct
actionable artifacts. The attributes of the capabilities are measured by the Monitor
(team self-assessment). Based on that information, we produce the current capability
profile of the team with the Analyzer (Sect. 3.2). We can then discuss that together
with the team, whether the team have sufficient and fit capabilities for the desired
(high) performance, which capabilities should be directed to be improved in the
future, and what potential obstacles and impediments should be removed in order to
get the full benefits of the capabilities.
The Monitor-Analyzer instrumentation presented here covers currently six typical
capability areas. The Monitor items are mainly based on software team performance
management research literature. Typically such investigations (reductionism) study
certain different factors and their performance effects with correlation hypothesis. Our
work is based on combining such factors under the capability profiles. The profiles
thus give suggestions, how the team may potentially perform with respect to the
different capability traits.
234 P. Kettunen
Currently the analysis comprises the following specific capabilities: Agile, Lean,
Business Excellence, Operational Excellence, Growth, Innovativeness. They have been
selected based on general-purpose organizational performance development models to
begin with (e.g., [24], [30], [31]). We have then coupled them with software team
performance management literature (Sect. 2.2), in particular with respect to Agile and
Lean capabilities. Table 1 presents their overall rationales. However, we do not claim
that these are all key capabilities for any particular team. Nevertheless, considering
generalization, we presume this initial set to be a plausible starting point.
This work does not propose any particular quantitative formulas for determining
the level of the team capabilities (such as capability index). Instead, we rely on the
expert judgment of the team itself supported by visual plotting of systematized item
combinations as sourced in the team self-assessment (Monitor). The suggested
heuristic reasoning is as follows: If the indicator items associated with a particular
capability appear to be positive, the current level of the team with respect of that
capability may be high. Conversely, if there are some negative signs and/or large
variations between the individual team member ratings of the items, the level of
capability may be lower.
3.2 Analyzer
Our earlier investigations have sensed high-performing software teams with a self-
assessment Monitor instrument [3]. The Monitor instrument captures a wide set of
team performance attributes. By selecting and combining distinct subsets of them, we
can produce capability profile views of the team. This is the design rationale of the
Analyzer instrument [4].
The Analyzer aggregates certain subsets of the Monitor questionnaire items and
recombines them for the selected team capability indicators. Certain items are coupled
to multiple capabilities. Table 2 presents a subset of the current constituting indicator
items (currently 6) for each included capability (c.f., Table 1) with their underlying
rationales.
Directing High-Performing Software Teams 235
ITEMS Rationale
Agile
How do you rate the following organiza- The whole organization is encouraged to
tional factors in your context? think in customer-oriented ways. The
• The organization is flexible and respon- mindset is towards leveraging the organiza-
sive to customer needs. tional strengths and capital (“can do”).
How do you rate the following concerns? Iterative development and consequently
• Our team is capable of quick round-trip responsiveness are enabled. Continuous
software engineering cycles (design- (fast) delivery of valuable software is
build-test-learn). realized.
Lean
How do you rate the following aspects from The importance and usefulness of the
your point of view? product to customer/user needs and
• We see how our products bring benefits problems are prompted.
(value).
How do you appraise the following team The software implementation is consistently
outcomes and impacts? assessed against defined quality criteria
• Outputs meet the organizational (e.g., reliability, response time).
standards and expectations.
Business Excellence
How do you rate the following concerns? The actual business and use context and
• How often are you able to see the how the user operations utilize the software
software (product) in actual use? execution results are observed in real time.
The value and fitness of the product solution
to its purpose and concept of operations is
assessed.
How do you appraise the following team The value creation and capture drives the
outcomes and impacts? software development. The business mindset
• Getting the business benefits (value) is incorporated to the software teams.
Operational Excellence
How important are the following for your The software production is results-driven.
team? The activities are aligned towards the
• Getting the products done well (effective delivery targets.
and disciplined delivery)
How do you rate the following concerns? The software development is clearly
• Our team is fully integrated with the positioned in the total value stream and able
surrounding organization. to implement its role in the flow.
Growth
How do you rate the following aspects from Finding (the) most significant and
your point of view? meaningful problems and opportunities to
• We have a clear, compelling direction be solved with software is emphasized.
that energizes, orients the attention, and
engages our full talents.
How important are the following for your The customer space uncertainties and
team? opportunities are taken into account in
• The software (design) is easily design decisions and preparations (e.g.,
upgradable and flexible for future architectural choices). Technical debt is
development. avoided.
236 P. Kettunen
Table 2. (Continued.)
Innovativeness
How important are the following for your The role of the software in the product
team? solutions is comprehended. The positioning
• Thinking the total product / service / of the products with the customer/user
system systems is motivated.
How important are the following aspects for The purpose of the product drives the
you in your work? software development.
• Developing the particular product or
service (innovation)
ITEMS Rationale
How do you rate the following concerns? Gaps and glitches tend to cause delays,
• How often are there communication and inefficient knowledge-sharing, and
coordination breakdowns? distracted decisions hurting all software
capabilities.
How do you rate the following Sustainable complex software work requires
organizational factors in your context? certain slack time. Rushing does not foster
• People have time to "think" (no excessive excellence in software capabilities.
stress, pressures).
4 Case Example
We have been using the Monitor instrument regularly since the initial establishment
[3]. This section illustrates a case for applying the Analyzer with the detailed items of
the current Monitor realization.
We have executed several team performance investigations with various industrial
software development organizations. There is one such case team included here. They
have applied the Monitor in two rounds in 2012-2013. Considering their key
demographic information, the team develops integrated system components in a
medium-size global company. Most of the team members (including some
subcontractors) were experienced and the team had worked together for longer time.
The system has a long life-span.
Figure 1 presents the subset of the Monitor data of the case team as viewed by the
Analyzer. The organization of the table is as follows: The question blocks (6+1) are
Directing High-Performing Software Teams 237
the currently incorporated indicating items of the defined capabilities in Table 2 and
the demotivators in Table 3. They have been extracted from the Monitor. The data
shows the number of responses of the team self-ratings like described in our initial
publication of the Monitor [3]. The responses were anonymous. Note that some of the
team members chose not to respond to all question items.
In essence, the Analyzer views tabulated in Figure 1 exhibit, how the case team
perceived certain key aspects contributing to their capabilities with respect to Table 1.
We can now reflect the different capabilities as follows (italics as in Tables 2-3):
• Agile: For the system component team, the customer appears to be more distant.
They do not especially see the organization to be ‘flexible and responsive to
changing customer needs’. However, they do have the core ability to conduct
‘quick round-trip software engineering cycles’ for responsiveness.
• Lean: The team understands how the product creates ‘benefits (value)’. In addition,
they know that their work should fulfill the ‘organizational standards’.
• Business Excellence: As a component team, not everybody can observe the
‘software (product) in actual use’. This might be an impediment for customer-
orientation. Nevertheless, they do appreciate the ‘business benefits’.
• Operational Excellence: Being a part of a larger of the system product, the team
may not see its role in the total ‘delivery’ chain quite clearly. They appear to be not
so strongly ‘integrated’, which may cause for instance delays in their teamwork.
• Growth: A component team may have trouble having a clear ‘direction’. This
could be discouraging, and prevent them from seeing future growth opportunities.
In addition, the current software architecture may not readily support ‘future
development’.
• Innovativeness: Not everybody in the team is equally interested in ‘developing the
particular product’. This may make them less amenable for pursuit of new ideas
for the product. However, they do have the mindset towards considering the ‘total
product’.
• DE: The team appears to have problems with ‘communication and coordination’.
However, they do not seem to be under excessive ‘time’ pressure, so that may not
be explained by hurry.
When we conducted such reflective discussions together with the team, one
consecutive improvement action was to reorganize the team into two smaller
subteams in order to mitigate the perceived communication and coordination
problems (see the ‘DE’ part). The perceptions of the reorganized team indicate that
this change was favorable (see Figure 1). The response rate was not 100%, though.
In conclusion, the purpose of our approach and instrumentation is neither to
measure the team’s performance nor to give normative means to achieve high ends.
Instead, like illustrated in this one case example, the general idea is to highlight key
performance influencing factors and potential impediments for the team. However, it
is for the team itself to judge and rank them for improvement actions. Similar general
tactics have been proposed elsewhere [36].
238 P. Kettunen
Agile
How do you rate the following organizational factors in How do you rate the following concerns?
your context?
Lean
How do you rate the following aspects from your point of How do you appraise the following team outcomes and impacts?
view?
Business Excellence
How do you rate the following concerns? How do you appraise the following team outcomes and impacts?
How often are you able to see the Getting the business benefits (value)
software (product) in actual use?
Always Usually Occassionally Seldom Never I don't know Key Important Relative Some little Little n/a
Operational Excellence
How important are the following for your team? How do you rate the following concerns?
Getting the products done well Our team is fully integrated with the
(effective and disciplined delivery) surrounding organization.
Key Important Relative Some little Little I don't know Always Usually Occassionally Seldom Never I don't know
Growth
1)
How do you rate the following aspects from your point of How important are the following for your team?
view?
Innovativeness
1) How important are the following aspects for you in your
How important are the following for your team?
work?
DE
2) How do you rate the following organizational factors in your
How do you rate the following concerns?
context?
How often are there communication People have time to "think" (no
and coordination breakdowns? excessive stress, pressures).
Fig. 1. (Continued.)
5 Discussion
In its current stage of development, the Monitor-Analyzer has not been validated
for prediction [38]. For instance, with respect to the Agile capability, the Analyzer
view based on the team Monitor instrument self-rating information merely suggests
that the team may perform high in terms of agility [4]. However, that is not measured
here. Specific performance measures could then be some of the ones summarized in
Sect. 2.2. As of this writing, we do not have such measurement data readily available.
The validation can be improved by asking the team about their level of agreement
on their perceived capability profile indicated by the Analyzer. This would also
provide some triangulation of the survey results.
Furthermore, the self-assessment could be repeated longitudinally following the
triggered improvement actions to see their performance effects. The team itself can
thereby keep building its own capability profile view over time. Moreover, if the
Analyzer view indicates that the team itself perceives to have some weaknesses in its
current capabilities, there is a risk of lower performance. Such considerations should
then be taken into account when anticipating (if not predicting) the teams’ future
performance. The key is that the team itself recognizes its own level of capability with
respect to the expectations.
The team case presented in Sect. 4 exemplifies those considerations. Like
illustrated there, the Analyzer views should be evaluated in conjunction with the
actual context and situational factors of the team. In general, deeper analysis based on
the survey information only is not recommended.
Like recognized already in our initial works with the Monitor, the team self-rating
survey has certain inherent limitations and constraints such as lack of common
terminology, trust, honesty, self-assessment biases, and survey method limitations [3].
However, most of those limitations and risks can be mitigated by face-to-face
discussions with the team members (e.g., clarifying potential misunderstandings). In
reality, such reflective dialogue is anyway required to be able to engage the team and
stimulate their performance improvements.
All things considered, based on the limited, primarily survey-based case evidence,
we are not yet in a position to draw firm conclusions about the generalization of our
results. However, the main idea of our approach and instrumentation is to be valid for
the particular teams and organizations, and the local validity is for them to judge.
Overall, we see the following prospective thread for further research and
development of the approach and instrumentation:
• By conducting more case studies with different teams, the actual expressive
strength of the selected Monitor items could be weighted more systematically. This
would also strengthen the validation.
• Following that, the current configuration of the Analyzer can be evaluated further
with respect to the indicating items (currently 6, but could vary) of each capability
descriptor. In addition, potential new capability views could be considered (e.g.,
Flexibility, Resilience). This is in particular if the specific performance needs of
the team are not fully followed by the currently included capabilities. Moreover,
potential linkages between the different elementary items within the capabilities
and combining capabilities (bundling) such as Agile & Lean would deserve further
investigations.
Directing High-Performing Software Teams 241
6 Conclusion
This paper tackles the research question of how specific software team performance
can be directed in the organizational context. We have presented a capability-based
software team performance assessment and improvement approach. The approach is
supported by provisional design scientific Monitor-Analyzer instrumentation.
The key research design principle of our team Monitor-Analyzer approach has
been not to limit to any one particular discipline (e.g., computer science). Instead, we
take a holistic view of software teams consisting of individuals in their organizational
and business contexts.
This work contributes primarily for practitioners. The Monitor-Analyzer
instruments are readily available (as prototype tools). In addition, it promotes
potential topical research directions for team performance management given that
software teams even in traditional organizations work increasingly in new set-ups
(e.g., offshoring) and radically new ways of teamwork are emerging in creative
organizations (e.g., game companies). Moreover, the Monitor-Analyzer can be used
as research instruments for action research to trigger more theoretical research
questions.
By and large, our Monitor-Analyzer approach strives for addressing the following
strategic issues in the software organization [39]: What is the intended performance
(success) for the team / organization? What are the key capabilities needed for the
success? What are the individual competences and organizational characteristics
needed to support the capabilities? Like illustrated in the case example, with such
understanding the software organization can gauge its teams for achieving the overall
(business) goals of the organization. Moreover, the organization can direct its
development activities according to the capabilities of the software teams with such
profound understanding of its team-based strengths in the competitive environment.
References
1. Cooper, R.G., Edgett, S.J.: Lean, Rapid, and Profitable New Product Development.
BookSurge Publishing, North Charleston (2005)
2. McLeod, L., MacDonnell, S.G.: Factors that Affect Software Systems Development
Project Outcomes: A Survey of Research. ACM Computing Surveys 43(4) (2011)
3. Kettunen, P., Moilanen, S.: Sensing High-Performing Software Teams: Proposal of an
Instrument for Self-monitoring. In: Wohlin, C. (ed.) XP 2012. LNBIP, vol. 111, pp. 77–92.
Springer, Heidelberg (2012)
4. Kettunen, P.: The Many Facets of High-Performing Software Teams: A Capability-Based
Analysis Approach. In: McCaffery, F., O’Connor, R.V., Messnarz, R. (eds.) EuroSPI
2013. CCIS, vol. 364, pp. 131–142. Springer, Heidelberg (2013)
5. Kleinschmidt, E., de Brentani, U., Salomo, S.: Information Processing and Firm-Internal
Environment Contingencies: Performance Impact on Global New Product Development.
Creativity and Innovation Management 19(3), 200–218 (2010)
6. Kettunen, P.: Agile Software Development in Large-Scale New Product Development
Organization: Team-Level Perspective. In: Dissertation. Helsinki University of
Technology, Finland (2009)
7. Hackman, J.R.: Leading Teams: Setting the Stage for Great Performances. Harvard
Business School Press, Boston (2002)
8. Petersen, K.: Measuring and predicting software productivity: A systematic map and
review. Information and Software Technology 53, 317–343 (2011)
9. Chenhall, R.H., Langfield-Smith, K.: Multiple Perspectives of Performance Measures.
European Management Journal 25(4), 266–282 (2007)
10. Stensrud, E., Myrtveit, I.: Identifying High Performance ERP Projects. IEEE Trans.
Software Engineering 29(5), 398–416 (2003)
11. Berlin, J.M., Carlström, E.D., Sandberg, H.S.: Models of teamwork: ideal or not?
A critical study of theoretical team models. Team Performance Management 18(5/6),
328–340 (2012)
12. Kasunic, M.: A Data Specification for Software Project Performance Measures: Results of
a Collaboration on Performance Measurement. Technical report TR-012, CMU/SEI (2008)
13. Winter, M., Szczepanek, T.: Projects and programmes as value creation processes: A new
perspective and some practical implications. International Journal of Project
Management 26, 95–103 (2008)
14. Ancona, D., Bresman, H.: X-Teams: How to Build Teams that Lead, Innovate, and
Succeed. Harvard Business School Press, Boston (2007)
15. Allee, V.: Value Network Analysis and value conversion of tangible and intangible assets.
Journal of Intellectual Capital 9(1), 5–24 (2008)
16. Buschmann, F.: Value-Focused System Quality. IEEE Software 27(6), 84–86 (2010)
17. Anderson, D.J.: Agile Management for Software Engineering. Prentice Hall, Upper Saddle
River (2004)
18. Staron, M., Meding, W., Karlsson, G.: Developing measurement systems: an industrial
case study. J. Softw. Maint. Evol.: Res. Pract. 23, 89–107 (2010)
19. Agresti, W.W.: Lightweight Software Metrics: The P10 Framework, pp. 12–16. IT Pro
(September-October 2006)
20. Osterwalder, A., Pigneur, Y.: Business Model Generation: A Handbook for Visionaries,
Game Changers, and Challengers. John Wiley & Sons, New York (2010)
Directing High-Performing Software Teams 243
21. Tonini, A.C., Medina, J., Fleury, A.L., de Mesquita Spinola, M.: Software Development
Strategic Management: A Resource-Based View Approach. In: Proc. PICMET,
pp. 1072–1080 (2009)
22. Professional Staff Core Capability Dictionary. University of Adelaide, Australia (2010)
23. Day, G.S.: The Capabilities of Market-Driven Organizations. Journal of Marketing 58,
37–52 (1994)
24. Neely, A., Adams, C., Crowe, P.: The performance prism in practice. Measuring Business
Excellence 5(2), 6–13 (2001)
25. von Hertzen, M., Laine, J., Kangasharju, S., Timonen, J., Santala, M.: Drive For Future
Software Leverage: The Role, Importance, and Future Challenges of Software
Competences in Finland. Review 262. Tekes, Helsinki (2009)
26. Downey, J.: Designing Job Descriptions for Software Development. In: Barry, C., et al.
(eds.) Information Systems Development: Challenges in Practice, Theory, and Education,
vol. 1, pp. 447–460. Springer Science+Business Media (2009)
27. Conboy, K., Fitzgerald, B.: Toward a conceptual framework for agile methods: a study of
agility in different disciplines. In: Mehandjiev, N., Brereton, P. (eds.) Workshop on
Interdisciplinary software engineering research (WISER), pp. 37–44. ACM, New York
(2004)
28. CMMI for Development. Technical report, CMU/SEI-2010-TR-033. Software Engineering
Institute, Carnegie Mellon University, USA (2010)
29. Guidance on use for process improvement and process capability determination.
Information technology, Process assessment, Part 4: 15504-4, ISO/IEC (2009)
30. EFQM Excellence Model. EFQM Foundation, Belgium (2012)
31. Baldrige National Quality Program: Criteria for Performance Excellence. National
Institute of Standards and Technology (NIST), Gaithersburg, MD (2012)
32. Drexler, A., Sibbet, D.: Team Performance Model (TPModel). The Grove Consultants
International, San Francisco (2004)
33. Humphrey, W.S.: Introduction to the Team Software Process. Addison Wesley Longman
Inc., Reading (2000)
34. Humphrey, W.S., Chick, T.A., Nichols, W.R., Pomeroy-Huff, M.: Team Software Process
(TSP) Body of Knowledge (BOK). Technical report, CMU/SEI-2010-TR-020. Software
Engineering Institute, Carnegie Mellon University, USA (2010)
35. Pikkarainen, M.: Towards a Framework for Improving Software Development Process
Mediated with CMMI Goals and Agile Practices. Dissertation, University of Oulu, Finland
(2008)
36. Moe, N.B., Dingsøyr, T., Røyrvik, E.A.: Putting Agile Teamwork to the Test – An
Preliminary Instrument for Empirically Assessing and Improving Agile Software
Development. In: Abrahamsson, P., Marchesi, M., Maurer, F. (eds.) XP 2009. LNBIP,
vol. 31, pp. 114–123. Springer, Heidelberg (1975)
37. Glazer, H.: Love and Marriage: CMMI and Agile Need Each Other. CrossTalk 23(1),
29–34 (2010)
38. Fenton, N.E., Pfleeger, S.L.: Software Metrics: A Rigorous & Practical Approach.
International Thompson Computer Press (1996)
39. Wirtenberg, J., Lipsky, D., Abrams, L., Conway, M., Slepian, J.: The Future of
Organization Development: Enabling Sustainable Business Performance Through People.
Organization Development Journal 25(2), 11–22 (2007)
Author Index