Automating High-Level Software Testing in Industrial Practise
Automating High-Level Software Testing in Industrial Practise
Emil Alégroth
Emil Alégroth
ii
To Therese, Alexandra and my supporting family
iv
Abstract
Software Engineering is at the verge of a new era where continuous releases
are becoming more common than planned long-term projects. In this context
test automation will become essential on all levels of system abstraction to
meet the market’s demands on time-to-market and quality. Hence, automated
tests are required from low-level software components, tested with unit tests,
up to the pictorial graphical user interface (GUI), tested with user emulated
system and acceptance tests. Thus far, research has provided industry with a
plethora of automation solutions for lower level testing but GUI level testing is
still primarily a manual, and therefore costly and tedious, activity in practice.
We have identified three generations of automated GUI-based testing. The
first (1st ) generation relies on GUI coordinates but is not used in practice due
to unfeasible maintenance costs caused by fragility to GUI change. Second
(2nd ) generation tools instead operate against the system’s GUI architecture,
libraries or application programming interfaces. Whilst this approach is suc-
cessfully used in practice, it does not verify the GUI’s appearance and it is
restricted to specific GUI technologies, programming languages and platforms.
The third (3rd ) generation, referred to as Visual GUI Testing (VGT), is
an emerging technique in industrial practice with properties that mitigate the
challenges experienced with previous techniques. VGT is defined as a tool-
driven test technique where image recognition is used to interact with, and
assert, a system’s behavior through its pictorial GUI as it is shown to the user
in user-emulated, automated, system or acceptance tests. Automated tests
that produce results of quality on par with a human tester and is therefore
an effective complement to reduce the aforementioned challenges with manual
testing. However, despite its benefits, the technique is only sparsely used in
industry and the academic body of knowledge contains little empirical support
for the technique’s industrial viability.
This thesis presents a broad evaluation of VGT’s capabilities, obtained
through a series of case studies and experiments performed in academia and
Swedish industry. The research follows an incremental methodology that be-
gan with experimentation with VGT, followed by industrial studies that were
concluded with a study of VGT’s use at a company over several years. Results
of the research show that VGT is viable for use in industrial practice with
better defect-finding ability than manual tests, ability to test any GUI based
system, high learnability, feasible maintenance costs and both short and long-
term company benefits. However, there are still challenges associated with the
successful adoption, use and long-term use of VGT in a company, the most
crucial that suitable development and maintenance practices are used. This
thesis thereby concludes that VGT can be used in industrial practice and aims
to provides guidance to practitioners that seek to do so. Additionally, this
work aims to be a stepping stone for academia to explore new test solutions
that build on image recognition technology to improve the state-of-art.
Keywords
vii
List of Publications
Appended papers
This thesis is primarily supported by the following papers:
ix
x
Other papers
The following papers are published but not appended to this thesis, either due
to overlapping contents to the appended papers, contents not related to the
thesis or because the contents are of less priority for the thesis main conclu-
sions.
Statement of contribution
In all listed papers, the first author was the primary contributor to the research
idea, design, data collection, analysis and/or reporting of the research work.
Contents
Abstract v
Acknowledgments vii
List of Publications ix
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Software engineering and the need for testing . . . . . . . . . . 3
1.2.1 Software Testing . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Automated Software Testing . . . . . . . . . . . . . . . 5
1.2.3 Automated GUI-based Software Testing . . . . . . . . . 7
1.2.3.1 1st generation: Coordinate-based . . . . . . . . 7
1.2.3.2 2nd generation: Component/Widget-based . . 7
1.2.3.3 3rd generation: Visual GUI Testing . . . . . . 9
1.2.3.4 Comparison . . . . . . . . . . . . . . . . . . . . 11
1.3 Research problem and methodology . . . . . . . . . . . . . . . 11
1.3.1 Problem background and motivation for research . . . . 13
1.3.2 Thesis research process . . . . . . . . . . . . . . . . . . 16
1.3.3 Research methodology . . . . . . . . . . . . . . . . . . . 17
1.3.4 Case studies . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.4.1 Interviews . . . . . . . . . . . . . . . . . . . . 20
1.3.4.2 Workshops . . . . . . . . . . . . . . . . . . . . 21
1.3.4.3 Other . . . . . . . . . . . . . . . . . . . . . . . 22
1.3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 24
1.3.6 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . 24
1.4 Overview of publications . . . . . . . . . . . . . . . . . . . . . . 27
1.4.1 Paper A: Static evaluation . . . . . . . . . . . . . . . . . 27
1.4.2 Paper B: Dynamic evaluation . . . . . . . . . . . . . . . 28
1.4.3 Paper C: Challenges, problems and limitations . . . . . 30
1.4.4 Paper D: Maintenance and return on investment . . . . 31
1.4.5 Paper E: Long-term use . . . . . . . . . . . . . . . . . . 33
1.4.6 Paper F: VGT-GUITAR . . . . . . . . . . . . . . . . . . 35
1.4.7 Paper G: Failure replication . . . . . . . . . . . . . . . . 37
1.5 Contributions, implications and limitations . . . . . . . . . . . 38
1.5.1 Applicability of Visual GUI Testing in practice . . . . . 39
1.5.2 Feasibility of Visual GUI Testing in practice . . . . . . . 42
xi
xii CONTENTS
Bibliography 235
Chapter 1
Introduction
1.1 Introduction
Today, software is ubiquitous in all types of user products, from software ap-
plications to cars, mobile applications, medical systems, etc. Software allows
development organizations to broaden the number of features in their prod-
ucts, improve the quality of these features and provide customers with post-
deployment updates and improvements. In addition, software has shortened
the time-to-market in many product domains, a trend driven by the market
need for new products, features and higher quality software.
However, these trends place new time constraints on software develop-
ment organizations that limit the amount of requirements engineering, devel-
opment and testing that can be performed on new software [1]. For testing,
these time constraints imply that developers can no longer verify and vali-
date the software’s quality with manual test practices since manual testing is
associated with properties such as high cost, tediousness and therefore error-
proneness [2–7]. These properties are a particular challenge in the context
of changing requirements where the tests continuously need to be rerun for
regression testing [8, 9].
Automated testing has been suggested as the solution to this challenge since
automation allows tests to be run more frequently and at lower cost [4, 7, 10].
However, most automated test techniques have prerequisites that prohibit their
use on software written in certain programming languages, for certain oper-
ating systems, platforms, etc. [4, 11–13]. Additionally, most automated test
techniques operate on a lower level of system abstraction, i.e. against the
backend of the system. One such, commonly used, low-level test technique is
automated unit testing [14]. Whilst unit tests are applicable to find defects
in individual software components, its use for system and acceptance testing
is still a subject of ongoing debate [15, 16]. Test techniques exist for auto-
mated system and acceptance testing that interact with the system under test
(SUT) through hooks into the SUT or its GUI. However, these techniques do
not verify that the pictorial GUI, as shown to the user, behaves or appears
correctly. These techniques therefore have limited ability to fully automate
manual, scenario-based, regression test cases, in the continuation of this the-
sis referred to as manual test cases. Consequently, industry is in need of a
1
2 CHAPTER 1. INTRODUCTION
flexible and GUI-based test automation technique that can emulate human
tester behavior to mitigate the challenges associated with current manual and
automated test techniques.
In this thesis we introduce and evaluate Visual GUI Testing (VGT). VGT
is a term we have defined that encapsulates all tools that use image recog-
nition to interact with a SUT’s functionality through the bitmaps shown on
the SUT’s pictorial GUI. These interactions are performed with user emu-
lated keyboard and mouse events that make VGT applicable on almost any
GUI-driven application and to automate test cases that previously had to be
performed manually. Consequently, VGT has the properties that software in-
dustry is looking for in a flexible, GUI-based, automated test technique since
the technique’s only prerequisite is that a SUT has a GUI. A prerequisite that
only limits the technique’s applicability and usefulness for, as examples, server
or other backend software.
However, at the start of this thesis work the body of knowledge on VGT
was limited to analytical research results [17] regarding VGT tools, i.e. Trig-
gers [18], VisMap [19] and Sikuli [20]. Hence, no empirical evidence existed
regarding the technique’s applicability or feasibility of use in industrial prac-
tice. Applicability that, in this thesis, refers to factors such as a test tech-
nique’s defect-finding ability, usability for regression, system and acceptance
testing, learnability and flexibility of use for different types of GUI-based soft-
ware. Feasibility, in turn, refers to the long-term applicability of a technique,
including feasible development and maintenance costs, usability under strict
time constraints and suitable time until the technique provides positive return
on investment (ROI). Empirical evidence on these factors are key to under-
stand the real life complexities of using the technique, to build best practices
and to advance its use in industrial practice [17, 21]. However, such evidence
can only be acquired through an incremental process that evaluates the tech-
nique from several perspectives and different industrial contexts. This the-
sis work was therefore performed in Swedish software industry, with different
projects, VGT tools and research techniques to fulfill the thesis research objec-
tive. Hence, to acquire evidence for, or against, the applicability and feasibility
of adoption, use and viability of VGT in industrial practice, including what
challenges, problems and limitations that are associated with these activities.
Work that consequently resulted in an overall understanding of the current
state-of-practice of VGT, what impedes its continued adoption and a final, yet
positive, conclusion regarding the long-term viability of VGT in industrial use.
Testing for the purpose of verification can be split into three types; unit,
integration and system testing [30], which are performed on different levels
of system abstraction [16, 26, 31] as shown in Figure 1.1. A unit test verifies
that the behavior of a single software component conforms to its low-level
functional requirement(s) and is performed either through code reviews or
more commonly through automated unit tests [9, 11, 14, 15, 32–34]. In turn,
integration tests verify the conformance of several components’ interoperability
between each other and across layers of the SUT’s implementation [16, 30].
Components can in this context be single methods or classes but also hardware
components in embedded systems. Finally, system tests are, usually, scenario-
based manual or automated tests that are performed either against the SUT’s
technical interfaces or the SUT’s GUI to verify that the SUT, as a whole [30],
conforms to its feature requirements [35–37]. However, scenario-based tests
are also used to validate the conformance of a SUT in acceptance tests that
are performed either by, or with, the SUT’s user or customer [35–38]. The
key difference between system and acceptance test scenarios is therefore how
representative they are of the SUT’s real-world use, i.e. the amount of domain
knowledge that is embedded in the test scenario.
Testing is also used to verify that a SUT’s behavior still conforms to the re-
quirements after changes to the SUT, i.e. regression tests. Regression tests can
be performed with unit, integration, system or acceptance test cases that have
predefined inputs for which there are known, expected, outputs [9]. Inputs and
outputs that are used to stimulate and assert various states of the SUT. As
such, the efficiency of a regression test suite is determined by the tests’ cover-
age of the SUT’s components, features, functions, etc [34, 39], i.e. the amount
of a SUT’s states that are stimulated during test execution. This also limits
regression tests to finding defects in states that are explicitly asserted, which
implies that the test coverage should be as high as possible. However, for
manual regression tests, high coverage is costly, tedious and error-prone [2–7],
which is the primary motivation why automated testing is needed and should
be used on as many different levels of system abstraction as possible [16, 40].
Especially in the current market where the time available for testing is shrink-
ing due to the demands for faster software delivery [1]. Demands that have
transformed automated testing from “want” to a “must” in most domains.
However, whilst lower levels of system abstraction are well supported by
automated regression test techniques, tools and frameworks, there is a lack of
automated techniques for testing through the pictorial GUI, i.e. the highest
level of system abstraction. Thus, a lack of support that presents the key
motivator for the research presented in this thesis.
To cover any lack of regression test coverage, exploratory testing, defined
as simultaneous learning, test design and test execution, is commonly used
in industrial practice [41, 42]. The output of exploratory testing is a defect
but also the scenario(s) that caused the defect to manifest, i.e. scenarios that
can be turned into new regression tests. This technique has been found to be
effective [43] but has also been criticized for not being systematic enough for
fault replication. Further, the practice requires decision making to guide the
testing and is therefore primarily performed manually, despite the existence
of a few automated exploratory testing tools, e.g. CrawlMan [44]. However,
automated exploratory testing is still an unexplored research area that war-
1.2. SOFTWARE ENGINEERING AND THE NEED FOR TESTING 5
the organizational changes affect the company’s processes, e.g. due to changes
of the intended users’ responsibilities. Additionally, many automated test
techniques have prerequisites that prohibit their use to certain systems written
in specific programming languages, operating systems and platforms [4,11–13].
Therefore it is necessary to perform a pilot project to (1) evaluate if the new
technique is at all applicable for the intended SUT and (2) for what types of
tests the technique can be used. Thus a pilot project is an important activity
but also associated with a, sometimes substantial, cost. However, several of
these costs are often overlooked in practice and are thereby “hidden” costs
associated with any change to a software process.
However, this brings us to the third cost associated with automated test-
ing which is maintenance of test scripts. Maintenance constitutes a continuous
cost for all automated testing that grows with the size of the test suite. This
maintenance is required to keep the test scripts aligned with the SUT’s re-
quirements [49], or at least its behavior, to ensure that test failures are caused
by defects in the SUT rather than intended changes to the SUT itself, i.e.
failures referred to as false positives. However, larger changes to the SUT can
occur and the resulting maintenance costs can, in a worst case, become unrea-
sonable [12]. These costs can however be mitigated through engineering best
practices, e.g. modular test design [16, 40, 50]. However, best practices takes
time to acquire, for any technique, and are therefore often missing, also for
VGT.
Hence, these three costs must be compared together to the value provided
by the automated tests, for instance value in terms of defects found or to
the costs compared to alternative test techniques, e.g. manual testing. The
reason for the comparison is to identify the point in time when the costs of
automation break even with the alternatives, i.e. when return on investment
(ROI) is achieved. Hence, for any automated test technique to be feasible,
the adoption, development and maintenance costs must provide ROI and it
should do so as quickly as possible. Consequently, an overall view of costs,
value and other factors, e.g. learnability, adoptability and usability, is required
to provide an answer if a test automation technique is applicable and feasible
in practice. These factors were therefore evaluated during the thesis work to
provide industrial practitioners with decision support of when, how and why
to adopt and use VGT.
Figure 1.2: Pseudocode example of a 2nd generation test script for a simple
application where GUI components are identified, in this case, through their
properties (Tags) associated with a user defined variable.
etc [57]. This functionality is required since these properties are unintuitive
without technical or domain knowledge, e.g. an ID number or component
type is not enough for a human to intuitively identify a component. How-
ever, combined, groups of properties allow the tester to distinguish between
components, exemplified with pseudocode in Figure 1.2.
Some 2nd generation tools, e.g. GUITAR [58], also support GUI ripping
that allow the tools to automatically extract GUI components, and their prop-
erties, from the SUT’s GUI and create a model over possible interactions with
the SUT. These models can then be traversed to generate scenarios of inter-
actions that can be replayed as test cases, a technique typically referred to
as model-based testing [59–63]. As such, provided that the interaction model
contains all GUI components, it becomes theoretically possible to automati-
cally achieve full feature coverage of the SUT since all possible scenarios of
interactions can be generated. However, in practice this is not possible since
the number of test cases grow exponentially with the number of GUI com-
ponents and length of test cases that makes it unreasonable to execute all of
them. This problem is referred to as the state-space explosion problem and is
common to most model-based testing tools [59]. One way to mitigate the prob-
lem is to limit the number of interactions per generated test scenario but this
practice also limits the tests’ representativeness of real world use and stifles
their ability to reach faulty SUT states.
Furthermore, because 2nd generation GUI-based tools’ interact with the
SUT through hooks into the GUI, these tests do not verify that the picto-
rial GUI conforms to the SUT’s requirements, i.e. neither that its appear-
ance is correct or that human interactions with it is possible. In addition,
the tools require these hooks into the SUT to operate, which restricts their
use to SUT’s written in specific programming languages and for certain GUI
libraries/toolkits. This requirement also limits the tools’ use for testing of
systems distributed over several physical computers, cloud based applications,
etc., where the SUT’s hooks are not accessible.
Another challenge is that the tools need to know what properties a GUI
component has to stimulate and assert its behavior. Standard components,
included in commonly used GUI libraries, e.g. JAVA Swing or AWT, are
generally supported by most tools. However, for custom built components, e.g.
user defined buttons, the user has to create custom interpreters or hooks for
the tools to operate. However, these interpreters need to be maintained if the
components are changed, which adds to the overall maintenance costs. Overall
1.2. SOFTWARE ENGINEERING AND THE NEED FOR TESTING 9
maintenance costs that have been reported to, in some cases, be substantial
in practice [10, 12, 16, 52].
However, there are also some types of GUI components that are difficult or
can not be tested with this technique, e.g. components generated at runtime,
since their properties are not known prior to execution of the system. As such,
there are several challenges associated with 2nd generation GUI-based testing
that limit the technique’s flexibility of use in industrial practice.
In summary, 2nd generation GUI-based testing is associated with quick and
often robust test execution due to their access to the SUT’s inner workings.
However, this access is a prerequisite for the technique’s use that also limits its
tools to test applications written is certain programming languages, with cer-
tain types of components, etc. As a consequence, the technique lacks flexibility
in industrial use. Further, the technique does not operate on the same level of
system abstraction as a human user and does therefore not verify that the SUT
is correct from a pictorial GUI point of view, neither in terms of appearance
or behavior. Additionally, the technique is associated with script maintenance
costs that can be extensive and in worst cases infeasible [10, 12, 16, 52]. Conse-
quently, 2nd generation GUI-based testing does not fully fulfill the industry’s
needs for a flexible and feasible test automation technique.
OK click OK
Hello World
AssertExists Hello World
Figure 1.3: Pseudocode example of a 3rd generation (VGT) test case for a
simple application. GUI components are associated with the application’s GUI
component images (Bitmaps).
VGT scripts are generally intuitive to understand, also for non-technical stake-
holders, since the scripts’ syntax is relatable to how the stakeholders would
themselves interact with the SUT [20], e.g. click on a target represented by
a bitmap and type a text represented by a string. This intuitiveness also
provides VGT with high learnability also by technically awkward users [65].
A pseudo-code VGT script example is shown in Figure 1.3 that performs
the same interactions as the example presented for 2nd generation GUI-based
testing, presented in Figure 1.2, for comparison.
Conceptually, image recognition is performed in two steps during VGT
script playback. First, the SUT’s current GUI state is captured as a bitmap,
e.g. in a screenshot of the computers desktop, which is sent together with
the sought bitmap from the VGT script to the image recognition algorithm.
Second, the image recognition algorithm searches for the sought bitmap in the
screenshot and if it finds a match it returns the coordinates for the match that
are then used to perform an interaction with the SUT’s GUI. Alternatively,
if the image recognition fails, a false boolean is returned or an exception is
raised.
Different VGT tools use different algorithms but most algorithms rely on
similarity-based matching which means that a match, i.e. sought bitmap, is
found if it is within a percentile margin between the identified and sought
bitmap image [20]. This margin is typically set to 70 to 80 percent of the
original image to counteract failures due to small changes to a GUI’s appear-
ance, e.g. change of a GUI bitmap’s color tint. However, similarity-based
matching does not prevent image recognition failure when bitmaps are resized
or changed completely.
Additionally, VGT scripts, similar to 1st and 2nd generation scripts, need
to be synchronized with the SUT’s execution. Synchronization in VGT is
performed with built in functionality or methods that wait for a bitmap(s) to
appear on the screen before the script can proceed. However, these methods
also make VGT scripts slow since they cannot execute quicker than the state
transitions of the GUI, which is a particular challenge for web-systems since
waits also need to take network latency into account.
In summary, VGT is a flexible automated GUI-based test technique that
uses tools with image recognition to interact and assert a SUT’s behavior
through its pictorial GUI. However, the technique’s maturity is unknown and
this thesis therefore aims to evaluate if VGT is applicable and feasible in
industrial practice.
1.3. RESEARCH PROBLEM AND METHODOLOGY 11
1.2.3.4 Comparison
To provide a general background and overview of the three generations of
automated GUI-based testing, some of their key properties have been presented
in Table 1.1. The table shows which properties that each technique has (“Y”)
or not (“N”) or if a property is support by some, but not all, of the technique’s
tools (“S”). These properties were acquired during the thesis work as empirical
results or through analysis of related work. However, they are not considered
to be part of the thesis main contributions even though they support said
contributions.
Several properties are shared by all techniques. For instance, they can
all be used to automate manual test cases but only VGT tools also support
bitmap assertions and user emulation and it is therefore the only technique that
provides results of equal quality to manual tests. Further, all three techniques
are perceived to support daily continuous integration and all techniques require
the scripts to be synchronized with the SUT’s execution. Finally, none of the
techniques are perceived as replacements to manual testing since all of the
techniques are designed for regression testing and therefore only find defects
in system states that are explicitly asserted. In contrast, a human can use
cognitive reasoning to determine if new, previously unexplored, states of the
SUT are correct. Consequently, a human oracle [69] is required to judge if a
script’s outcome is correct or not.
Other properties of interest regard the technique’s robustness to change.
For instance, both 2nd and 3rd generation tools are robust to GUI layout
change, assuming, for the 3rd generation, that the components are still shown
on the screen after change. In contrast, 1st generation tools are fragile to this
type of change since they are dependent on the GUI components’ location
being constant.
However, 1st generation tools, and also 3rd generation tools, are robust to
changes to the SUT’s GUI code whilst 2nd generation tools are not, especially
if these changes are made to custom GUI components, the GUI libraries or
GUI toolkits [12].
Finally, 1st and 2nd generation tools are robust to changes to the GUI
components’ bitmaps since none of the techniques care about the GUI’s ap-
pearance. In contrast, 3rd generation tools fail if either the appearance or the
behavior of the SUT is incorrect.
Consequently, the different techniques have different benefits and draw-
backs that are perceived to make the techniques more or less applicable in
different contexts.
Table 1.1: The positive and negative properties of different GUI-based test
techniques. All properties have been formulated such that a “Y” indicates that
the property is supported by the technique. “N” indicates that the property is
not supported by the technique. “S” indicates that some of the technique’s tools
supports the property, but most don’t.
1.3. RESEARCH PROBLEM AND METHODOLOGY 13
nique at all find failures and defects on industrial grade systems? Additionally,
it aims to identify support for what types of testing VGT is used for, e.g. only
regression testing of system and acceptance tests or exploratory testing as
well? This question also addresses if VGT can be used in different contexts
and domains, such as agile software development companies, for safety-critical
software, etc. Support for this question was acquired throughout the thesis
work but in particular in the studies presented in Chapters 2, 3, 4, 6 and 8,
i.e. Papers A, B, C, E and G.
RQ2: To what extent is Visual GUI Testing feasible for long-term use in
industrial practice?
Feasibility refers to the maintenance costs and return on investment (ROI)
of adoption and use of the technique in practice. This makes this question
key to determine the value and long-term industrial usability of VGT. Hence,
if maintenance is too expensive, the time to positive ROI may outweigh the
technique’s benefits compared to other test techniques and render the tech-
nique undesirable or even impractical in practice. This question also concerns
the execution time of VGT scripts to determine in what contexts the tech-
nique can feasibly be applied, e.g. for continuous integration? Support for
this research question was, in particular, acquired in three case studies at four
different companies, presented in Chapters 3, 5, and 6, i.e. Papers B, D and
E.
RQ3: What are the challenges, problems and limitations of adopting, us-
ing and maintaining Visual GUI Testing in industrial practice?
This question addresses if there are challenges, problems and limitations (CPLs)
associated with VGT, the severity of these CPLs and if any of them prohibit
the technique’s adoption or use in practice. Furthermore, these CPLs represent
pitfalls that practitioners must avoid and therefore take into consideration to
make an informed decision about the benefits and drawbacks of the technique,
i.e. how the CPLs might affect the applicability and feasibility of the tech-
nique in the practitioner’s context. To guide practitioners, this question also
includes finding guidelines for the adoption, use and long-term use of VGT in
practice.
Results to answer this question were acquired primarily from three case
studies that, fully or in part, focused on CPLs associated with VGT, presented
in Chapters 3, 4 and 6, i.e. Papers B, C and E.
RQ4: What technical, process, or other solutions exist to advance Visual
GUI Testing’s applicability and feasibility in industrial practice?
This question refers to technical or process oriented solutions that improve
the usefulness of VGT in practice. Additionally, this question aims to identify
future research directions to improve, or build upon, the work presented in
this thesis.
Explicit work to answer the question was performed in an academic study,
presented in Chapter 7, i.e. Paper F, where VGT was combined with 2nd gen-
eration technology to create a fully automated VGT tool. Additional support
was acquired from an experience report presented in Chapter 8 (Paper G)
where a novel VGT-based process was reported from industrial practice.
16 CHAPTER 1. INTRODUCTION
Paper B:
Dynamic
evaluation
Paper C: Challenges,
problems and
limitations
Paper D: Maintenance
costs
Paper F: VGT-
GUITAR
Figure 1.4: A chronological mapping of how the studies included in this thesis
are connected to provide support for the thesis four research questions. The
figure also shows which papers that provided input (data, challenges, research
questions, etc.) to proceeding papers. CPLs - challenges, problems and limi-
tations.
per D). These results were acquired through empirical work with an industrial
system (Static analysis) and interviews with practitioners that had used VGT
for several months (Dynamic analysis). However, results regarding the long-
term feasibility of the technique were still missing, a gap in knowledge that
was filled by an interview study at a company that had used VGT for several
years (Paper E ). Consequently, these studies provided an overall view of the
current state-of-practice of VGT. In addition they provided support to draw
conclusions regarding the applicability (RQ1 ) and feasibility (RQ2 ) of VGT
in practice but also what CPLs that are associated with the technique (RQ3 ).
Further, to advance state-of-practice, a study was performed where VGT
was combined with 2nd generation technology that resulted in a building block
for future research into fully automated VGT (Paper F )(RQ4 ). Additional
support for RQ4 was acquired from an experience report from industry (Paper
G) where a novel semi-automated exploratory test process based on VGT was
reported.
Combined, these studies provide results to answer the thesis four research
questions and a significant contribution to the body of knowledge of VGT and
automated testing.
Legend
X - Case study presented in Paper X
Su
rv
is
lys
ey
a
an
s
op
t
sh
en
C
rk
cu m
wo
F
A
Do
and
Ex
E
pe
B
rim
Exploratory
en
Interviews
t
Ex
e
tiv
pla
rip
na
sc
tor
De
G
y
Ex
pe
ri e
re nc
po e
D rt
Paper E that was descriptive, depicted in Figure 1.5. Hence, the thesis work
transitioned from exploration to explanation of the capabilities and properties
of VGT to description of its use in practice. This transition was driven by the
incrementally acquired results from each study, where later studies thereby
aimed to verify the results of earlier studies. Figure 1.5 also includes studies
that were not case studies, i.e. Papers F and G which were an experiment and
an experience report respectively, depicted to show how they were classified in
relation to the other papers included in the thesis.
Furthermore, the performed case studies were all inherently different, i.e.
conducted with different companies, in different domains, with different sub-
jects and VGT tools, which has strengthened both the construct and external
validity of the thesis conclusions. Further, interviews were used for the ma-
jority of the data collection to acquire in depth knowledge about the adoption
and use of VGT. However, quantitative, or quantifiable, data was also acquired
since it was required to compare VGT to other test techniques in the studies,
and the thesis. For instance, quantitative data was acquired to compare the
performance and cost of VGT to both manual test techniques and 2nd genera-
tion GUI-based testing. However, comparisons were also made with qualitative
data, such as practitioners’ perceptions about benefits and drawbacks of dif-
ferent techniques, to get a broad view of the techniques’ commonalities and
differences in different contexts. Thus ensuring that the included studies’ in-
dividual contributions were triangulated with data from different sources and
methods to improve the results internal validity.
1.3.4.1 Interviews
Interviews are commonly used for data collection in case study research and
can be divided into three different types: structured-, semi-structured and
unstructured interviews [71, 75]. Each type is performed with an interview
guide that contains step-by-step instructions for an interview, including the
interview questions, research objectives, etc. In addition, interview guides
shall include a statement regarding the purpose of the study and insurance
of the interviewee’s anonymity, which helps to mitigate biased or untruthful
answers. Further, these types of interviews vary in strictness, which relates to
the makeup of the interview guide as well as the interviewer’s freedoms during
an interview.
Structured interviews: Structured interviews are the most strict [71] and
restrict the interviewer from asking follow up questions or ask the interviewee
to clarify their answers. Therefore, considerable effort should be spent on
the interview guide to test it and to ensure that the interview questions are
unambiguous and valid to answer the study’s research questions. Structured
interview questions can be of different type but multiple-choice or forced-choice
are the most common. Forced choice questions, e.g. Likert-scale questions,
can be analyzed with statistics [76] but require a larger interview sample,
which makes the method costly in terms of resources. Therefore, structured
interviews were not used during the thesis work.
Semi-structured interviews: The second most strict type of interview
is called semi-structured interviews [71], which allow the interviewer to elicit
more in depth or better quality information by asking follow up questions or
1.3. RESEARCH PROBLEM AND METHODOLOGY 21
1.3.4.2 Workshops
1.3.4.3 Other
Interviews and workshops were the primary methods used in the case studies
included in this thesis. However, other methods were also used during the
thesis work1 , some of which will be briefly discussed in this section.
Document analysis: Interviews and workshops acquire first degree data [71,
84], i.e. data directly from a source of information such as an interviewee. In
turn, second degree data is collected indirectly from a source, for instance,
through transcription of a recorded interview. However, document analysis
relies on third degree data [17], which is data that has already been tran-
scribed and potentially analyzed.
From a company perspective, document analysis can be a cost-effective
method of data transference but can be a time-consuming activity for the
researcher, especially in foreign or highly technical domains. Further, third
degree data is created by a third person and can therefore include biases which
means that document root-source analysis is required to identify who created
the document, for what purpose, the age of the information, etc. [17], i.e. to
evaluate the documented information’s validity.
Document analysis was used in the thesis work to acquire information
about the research companies’ processes and practices. In particular, test
specifications were analyzed to give input for empirical work with VGT at the
studied companies, e.g. in Paper A where manual test cases at Saab AB were
automated with two different VGT tools.
Further, this method can be used in a survey to conduct systematic map-
pings and systematic literature reviews [85] of published research papers. How-
ever, due to the limited body of knowledge on VGT, no such study was per-
formed during the thesis work.
Surveys: Surveys are performed on samples of people, documents, soft-
ware, or other group [86] for the purpose of acquiring general conclusions
1 In this instance, thesis work refers also to studies performed by the author but not
regarding an aspect of the sample [71]. For instance, a survey with people can
serve to acquire their perceptions of a phenomenon, document surveys instead
aim at document synthesis [85], etc.
In software engineering research, surveys are often performed with ques-
tionnaires as an alternative to structured interviews [71]. One benefit of ques-
tionnaires is that they can be distributed to a large sample at low cost but if
there is no incentive for the sample to answer the questionnaire, the partici-
pant response-rate can be low, i.e. less than the rule of thumb of 60 percent
that is suggested for the survey to be considered valid.
Questionnaire questions can be multiple-choice, forced-choice or open, i.e.
free text. Forced choice questions are often written as Likert scale ques-
tions [76], i.e. on an approximated ratio-scale between, for instance, totally
disagree and totally agree. In turn, multiple choice questions can ask par-
ticipants to rank concepts on ratio-scales, e.g. with the 100 dollar bill ap-
proach [87]. However, questions can have other scales such as nominal, ordinal
or interval scales [88]. These scales serve different purposes and it is therefore
important to choose the right type to be able to answer the study’s research
questions. Regardless, questionnaire creation is a challenge since the questions
must be unambiguous, complete, use context specific nomenclature, etc., to
be of high quality. Therefore, like interview guides, questionnaires must be
reviewed and tested prior to use.
Questionnaires were used during the thesis work to verify previously gath-
ered results and to acquire data in association with workshops. Explicitly,
among the research papers included in this thesis, a questionnaire survey was
used in Paper E. The results of the surveys were then analyzed qualitatively
or with formal or descriptive statistics, discussed further in Section 1.3.6, to
test the studies’ hypotheses or answer the studies’ research questions.
Observation: Observations are fundamental in research and can be used
in different settings and performed in different ways, e.g. structured or un-
structured [89]. One way to perform an observation is the fly on the wall
technique, where the researcher is not allowed to influence the person, pro-
cess, etc., being observed. Another type is the talk-aloud protocol, where the
observed person is asked to continuously describe what (s)he is doing [17]. As
such, observations are a suitable practice to acquire information about a phe-
nomenon in its actual context and can also provide the researcher with deeper
understanding of domain-specific or technical aspects of the phenomenon.
However, observation studies are associated with several threats, for in-
stance the Hawthorne effect, which causes the observed person to change
his/her behavior because they know they are being observed [90]. There-
fore, the context of the observation must be taken into consideration as well
as ethical considerations, e.g. how, what and why something or someone is
being observed [17]. An example of unethical observation would be to observe
a person without their knowledge.
Planned observation, with an observation guide, etc., was only used once
during the thesis work to observe how manual testing was performed at a
company. However, observations were also used to help explain results from
the empirical studies with VGT, i.e. in Papers A and D.
24 CHAPTER 1. INTRODUCTION
1.3.5 Experiments
Experimentation is a research methodology [73] that focuses on answering what
factor(s) (or independent variable(s)) that affect a measured factor of the phe-
nomena (the dependent variable(s)). As such, experiments aim to compare the
impact of treatments (change of the independent variable(s)) on the dependent
variable(s).
Experimental design begins with formulation of a research objective that
is broken down into research questions and hypotheses. A hypothesis is a
statement that can be either true or false that the study will test, for instance
the expected outcome of a treatment on the dependent variable. Therefore,
experiments primarily aim to acquire quantitative or quantifiable data, which
can be analyzed statistically to accept or reject the study’s hypotheses and
answer the study’s research questions.
However, experiments are also affected by confounding factors [73], i.e. fac-
tors outside the researcher’s control that also influence the dependent variable.
These factors can be mitigated through random sampling that cancels out the
confounding factors across the sample [91] such that measured changes to the
dependent variable are caused only by changes to the independent variable(s).
However, in some contexts, e.g. in industry, it is not possible to random-
ize the studied sample and instead quasi-experiments need to be used [73, 92].
Whilst controlled experiments are associated with a high degree of internal va-
lidity but lower construct validity (due to manipulation of contextual factors),
quasi-experiments have lower internal validity but higher construct validity
since they are performed in a realistic context.
Further, compared to case studies, controlled experiments have high repli-
cability, i.e. an experiment with a valid experimental procedure can be repli-
cated to acquire the same outcome as the original experiment. It is therefore
common that publications that report experimental results present the exper-
imental procedure in detail and make the experimental materials available to
other researchers.
Experiments were performed as part of two papers included in the thesis,
i.e. Papers A and F. In Paper A, the applicability of VGT to test non-animated
GUIs was compared to testing of animated GUIs. Additionally in Paper F, to
compare the false test results generated by 2nd and 3rd generation GUI-based
testing during system and acceptance tests.
Paper Name Domain Size City Development Test strategy VGT tool
(S/M/L) process(es)
A, C Saab AB Safety-critical M Gothenburg Plan-driven Manual system and accep- Sikuli
and G air-traffic and Agile tance testing (Python
management API)
software
B Saab AB Mission- M Järfälla Plan-driven Manual system and accep- Sikuli
critical mili- and Agile tance testing, automated unit (Python
tary control testing API)
software
D Saab AB Safety-critical M Växjö Plan-driven Manual system and accep- Sikuli
air-traffic and Agile tance testing (Python
management API)
software
D Siemens Life-critical S Gothenburg Agile (Scrum) Manual scenario-based and JAutomate
Medical medical jour- exploratory system and ac-
nal systems ceptance testing, automated
unit testing
E Spotify Entertainment L Gothenburg/ Agile (Scrum) Manual scenario-based and Sikuli
streaming ap- Stockholm exploratory system, accep- (Java API)
plication tance and user experience
testing, automated unit, inte-
gration and system testing.
Table 1.3: Summary of key characteristics of the companies/divisions/groups that took part in the studies included in the thesis. In the
column “size” the companies’ contexts are categorized as small (S), medium (M) and large (L) where small is less than 50 developers,
medium is less than 100 developers and large is more than 100 developers in total. Note that Saab AB refers to explicit divisions/companies
within the Saab coorporation. API - Application Programming Interface.
CHAPTER 1. INTRODUCTION
1.4. OVERVIEW OF PUBLICATIONS 27
3.5 hours, which was an improvement of 4.5 times compared to manual testing
that took 16 hours on average.
Further, none of the null hypotheses in regards to development time, lines
of code and execution time could be rejected. The study therefore concludes
that there is no statistical significant difference between the two tools on any
of these measures. Therefore, since the tools could successfully automate the
industrial test cases, the study provides initial support that VGT is applicable
for automation of manual system test cases in industrial practice.
Contributions: The study’s main contributions are as such:
CA1: Initial support for the applicability of VGT to automate manual scenario-
based industrial test cases when performed by experts,
CA2: Initial support for the positive return on investment of the technique,
and
CA3: Comparative results regarding the benefits and drawbacks of two VGT
tools used in industrial practice.
This work also provided an industrial contribution to Saab with decision sup-
port regarding which VGT tool to adopt.
Further, the case study showed that the VGT scripts had been im-
plemented as 1-to-1 mappings of the manual test cases and therefore con-
sisted of small use cases, combined in meta-models, into longer test scenarios.
This architecture was perceived beneficial to lower development and mainte-
nance costs since the modular design facilitated change and reuse of use case
scripts [40].
In addition, the VGT scripts executed 16 times faster than the manual
tests, identified all regression defects the manual tests found but also defects
that were previously unknown to the company. These additional defects were
found by changing the order of the test scripts between executions that resulted
in stimulation of previously untested system states. As such, VGT improved
both the test frequency and the defect-finding ability of Saab’s testing com-
pared to their manual ATD testing.
However, six challenges were found with VGT; including high SUT-Script
synchronization costs, high maintenance costs of older scripts or scripts writ-
ten by other testers, low (70 percent) success-rate of Sikuli’s image recognition
when used over a virtual network connection (VNC) (100 percent locally), un-
stable Sikuli behavior, etc. These challenges were solved with ad hoc solutions;
for instance, minimized use of VNC, script documentation, script coding stan-
dards, etc.
Additionally, three months into the study a large change was made to the
SUT that required 90 percent of the VGT scripts to be maintained. This
maintenance provided initial support for the feasibility of VGT script main-
tenance, measured to 25.8 percent of the development cost of the suite. The
scripts’ development costs were also used to estimate the time to positive ROI
of automating all 40 ATDs, which showed that positive ROI could be achieved
within 6-13 executions of the VGT test suite.
Finally, the post-study showed that VGT was perceived as both valuable
and feasible by the testers, despite the observed challenges.
Contributions: The main contributions provided by this study are as
such:
CB1: Support that VGT is applicable in industrial practice when adopted and
applied by practitioners in an industrial project environment,
CB2: Additional support that positive ROI can be achieved from adopting
VGT in practice,
CB3: Initial support that the maintenance costs of VGT scripts can be feasi-
ble, and
CB4: Challenges and solutions related to the adoption and use of VGT in the
studied project.
CC2: Four general solutions that solve or mitigate roughly half of the identified
CPLs, and
Cost
(Hours)
1500 45 180 532
a nce
20% manual in ten
testing t ma
1000 crip 7% manual
(Fictional Ts
VG testing
project) ent
requ (Saab)
In-f
ance
mainten
T script
500
Freq uent VG
VGT script
development
0
0 100 200 300 400 500 600 weeks
Figure 1.6: Model of the measured development and maintenance costs of VGT
compared to the costs of manual testing.
techniques were used at the company. Additionally, the interviews were com-
plemented with two workshops, one exploratory in the beginning of the study
and one with one person to verify previously collected results and to identify
the company’s future plans for VGT.
Finally, VGT was statistically compared to an alternative test technique
developed by Spotify (the Test Interface) based on properties acquired in the
interviews that were quantified based on the techniques’ stated benefits and
drawbacks.
Results: VGT was adopted at Spotify after an attempt to embed inter-
faces for GUI testing (the Test interface) into the company’s main application
had failed due to lack of developer mandate and high costs. Further, be-
cause the application lacked the prerequisites of most other test automation
frameworks, VGT became the only option. VGT was adopted with the tool
Sikuli [54] and its success could be accounted to three factors:
Several benefits were observed with VGT, such as value in terms of found
regression defects, robust script execution in terms of reported false test results,
feasible script maintenance costs in most projects, support for testing of the
release ready product, support for integration of external applications without
code access into the tests, etc. Additionally, VGT integrated well with the
open source model-based testing tool Graphwalker for model-based Visual GUI
Testing (MBVGT). MBVGT made reuse and maintenance of scripts more cost-
effective.
However, several drawbacks were also reported, such as costly maintenance
of images in scripts, inability to test non-deterministic data from databases,
limited applicability to run tests on mobile devices, etc. Because of these draw-
backs, Spotify abandoned VGT in several projects in favor of the originally
envisioned “Test interface” solution which became realizable after the adop-
tion of VGT due to VGT’s impact on the company’s testing culture. Hence,
VGT had shown the benefits of automation which gave developers mandate
to adopt more automation and create the Test interface. These interfaces
are instrumented by Graphwalker models that use the interfaces in the source
code to collect state information from the application’s GUI components that
is then used to assert the application’s behavior. This approach is beneficial
since it notifies the developer if an interface is broken when the application is
compiled, which ensures that the test suites are always maintained.
Additionally, the Test interface has several benefits over VGT, such as
better support for certain test objectives (e.g. tests with non-deterministic
data), faster and more robust test execution, etc. However, the Test interface
also has drawbacks, such as inability to verify that the pictorial GUI conforms
to the application’s specification, inability to perform interactions equal to a
human user, required manual synchronization between application and scripts,
lack of support for audio output testing, etc.
1.4. OVERVIEW OF PUBLICATIONS 35
during the study, to create 18 faulty versions of the application. A test suite
was then generated for the original version of the application that was executed
with GUITAR and VGT-GUITAR (Independent variable) on each mutant to
measure the number of correctly identified mutants, false positives and false
negative test results (Dependent variables). The dependent variables were then
analyzed to compare the two techniques in terms of reported false positives
and negatives for system and acceptance tests, where system tests evaluated
the SUT’s behavior whilst acceptance tests also took the SUT’s appearance
into account. In addition the execution time of the scripts in the two tools
were recorded and compared.
The study was concluded with a case study where GUITAR and VGT-
GUITAR were applied on three open source applications to identify support
for the tools’ industrial applicability.
Results: Statistical analysis of the experiment’s results showed that 3rd
generation scripts report statistically significantly more false positives for sys-
tem tests than 2nd generation tools and that 2nd generation tools report sta-
tistically significantly more false negative results for acceptance tests. These
results could be explain by observations of the scripts’ behavior on different
mutants and relate to how the two techniques stimulate and assert the SUT’s
behavior, i.e. through hooks into the SUT or by image recognition. As an
example, if the GUI’s appearance was changed such that a human could still
interact with it, e.g. by making a button larger, the 3rd generation scripts
would report a false positive result since the image recognition would fail.
However, the 2nd generation scripts would pass since the hook to the button
still remained. In contrast, if the GUI’s appearance was changed such that
a human could not interact with it, e.g. by making a button invisible, the
2nd generation scripts would produce a false negative since the hook allowed
the script to interact with the invisible button. However, the 3rd generation
scripts would successfully fail because the image recognition would not find a
match. The results of the experiment therefore indicate that a combination of
the 2nd and 3rd generation techniques could be the most beneficial because of
their complementary behavior for system and acceptance tests.
The proceeding case study did however show that VGT-GUITAR is not
applicable in industrial practice since the tool had zero percent success rate
on any of the open source applications, caused by technical limitations in the
tool, e.g. in reality it could not capture screenshots of all GUI components.
Additionally, the test cases were generated for GUITAR that can, for instance,
interact with a menu item without expanding the menu, i.e. functionality
that is not supported by 3rd generation tools. Hence, further development is
required to make VGT-GUITAR applicable in practice but the tool still shows
proof-of-concept for fully automated 3rd generation GUI-based testing due to
its successful use for the simpler application in the experiment.
Contributions: This study thereby provides the following main contri-
butions:
CF1: Comparative results regarding the fault-finding ability of 2nd and 3rd
generation GUI-based tools in terms of false test results for system and
acceptance tests.
CF2: Initial support that a completely automated 3rd generation test tool
1.4. OVERVIEW OF PUBLICATIONS 37
could be developed even though the tool developed in the study, VGT-
GUITAR, still requires additional work to become applicable in practice.
CG1: A success-story from industrial practice that shows that VGT can be
used to replicate and resolve non-frequent and nondeterministic defects,
and
CG2: A case where automated testing was paired with manual practices to
create a novel, semi-automated, test practice, implying that similar pro-
cesses can be achieved with other, already available, test frameworks in
practice.
Together, these four contributions lets us draw the conclusion that VGT
fulfills the industrial need for a flexible GUI-based test automation technique
and is mature enough for widespread use in industrial practice. This conclu-
sion is of particular value to companies that have GUI-based systems that
1.5. CONTRIBUTIONS, IMPLICATIONS AND LIMITATIONS 39
lack the prerequisites, e.g. specific technical interfaces, required by other test
automation frameworks since VGT finally provides these companies with the
means to automate tests to lower cost and raise software quality. However,
there are still many challenges, problems and limitations (CPL) associated
with VGT that are pitfalls that could prohibit the successful adoption, or
longer term use, of the technique in practice. Pitfalls that must be taken into
consideration by adopting companies and addressed, and mitigated, by future
academic research.
The continuation of this section will present the detailed syntheses of the
included research papers’ individual contributions and how they provide sup-
port for the thesis objective and main conclusion.
Table 1.4: Mapping of the individual contributions presented in Section 1.4 to the thesis research questions. P - Paper, ID - Identifier
CHAPTER 1. INTRODUCTION
of contribution, Cont. - Contribution, RQX - Research question X, CPLs - Challenges, problems and limitations.
1.5. CONTRIBUTIONS, IMPLICATIONS AND LIMITATIONS 41
Table 1.5: Table that summarizes the estimated results on development time,
execution time and ROI of 100 VGT test cases from the results acquired in Pa-
pers A, B and C. Script development time is compared to the average time
spent on manual testing in the three projects, 263 hours. In compar-
ison, the calendar time spent in a six month project with 20 percent testing is
192 hours. mh. - Man-hours, Dev. - Development, Exe. - Execution, ROI
- Return on investment, Min - Minutes, Sec - Seconds.
ing images is lower than script logic. Second, the quantitative results were
visualized in a ROI cost model, presented in Section 1.4.4 in Figure 1.6. These
results indicate, in a best case, that the development and maintenance costs of
a VGT suite provides positive ROI within on development iteration, given that
at least 20 percent of the project’s cost is associated with manual testing and
that maintenance is performed frequently. However, if a company currently
spends less time on manual testing and if scripts are maintained infrequently,
the time to positive ROI could be several years, in Saab AB’s case 532 weeks
(or over 10 years).
Consequently, successful long-term use of VGT has several prerequisites.
First, VGT needs to be integrated into the company’s development and test
process and the company’s organization needs to be adopted to the changed
process, e.g. to facilitate the need for frequent maintenance. Second, the
developed VGT suite should follow engineering best practice, i.e. be based
on a modularized architecture, have suitable amounts of failure mitigation,
etc [40]. Further, test scripts shall be kept as short and linear as possible to
mitigate script complexity, which is also mitigated by coding standards that
improve script readability. Third, test automation should first, and foremost,
be performed of stable test cases since this practice mitigates unnecessary
maintenance costs and aligns with the techniques’ primary purpose to per-
form regression testing. Additional factors were reported in Paper D, some
that are common to other automated test techniques, but it is unknown how
comprehensive this set of factors is and future research is therefore required
to expand this set.
In summary we conclude that VGT is feasible in industrial practice with
development and maintenance costs that are significant, yet manageable and
provide positive return on investment. However, there are still challenges
associated with the maintenance of VGT scripts that require suitable practices,
organizational change as well as technical support, to be mitigated.
1.5. CONTRIBUTIONS, IMPLICATIONS AND LIMITATIONS 45
Table 1.6: Summary of key reported CPLs. For each CPL, its affect and impact
has been ranked. Affect refers to what the CPL affects (Adoption, Usage or
Maintenance). Impact on how serious (Low, Medium or High) its presence
is for a company. Column “Support” indicates in which studies the CPL was
reported. The table is sorted based on impact.
46
Table 1.7: Summary of guidelines to consider during the adoption, use or long-term use of VGT in industrial practice.
CHAPTER 1. INTRODUCTION
1.5. CONTRIBUTIONS, IMPLICATIONS AND LIMITATIONS 47
1.5.5 Implications
This thesis presents results with implications for both industrial practice and
academia, e.g. decision support for practitioners and input for future academic
research.
Table 1.8: Summary of the threats to validity of the thesis results for each of
the thesis research questions. CPLs - Challenges, problems and limitations.
question and also results that complement each other. For instance, the quan-
titative results regarding the feasibility of VGT (Papers B and D could be
triangulated by statements from practitioners that had used the technique for
a longer period of time (Papers D and E). Similar connections were found for
results that support the applicability, for instance regarding the technique’s
defect finding ability, both for regression testing (Papers B, C, D and E ) and
for infrequent/non-deterministic defects, (Paper F ). Therefore the internal va-
lidity of the conclusions for research questions 1 and 2 are considered high.
However, the internal validity of identified CPLs is only considerate mod-
erate because different, unique, CPLs were identified in different studies. This
observation implies that there could be additional CPLs that can emerge in
other companies and domains.
Lastly, the internal validity regarding advances of VGT are also perceived
to be moderate because these results were only provided from two studies, i.e.
Paper F and G, which had specific focuses that are perceived narrow compared
to the many possible advances to VGT as outlined in Section 1.5.5.2.
Faster test execution speed and improved test frequency over manual
testing,
[6] A. Memon, “GUI testing: Pitfalls and process,” IEEE Computer, vol. 35,
no. 8, pp. 87–88, 2002.
235
236 BIBLIOGRAPHY
[25] G. Myers, C. Sandler, and T. Badgett, The art of software testing. Wi-
ley, 2011.
[32] E. Gamma and K. Beck, “JUnit: A cook’s tour,” Java Report, vol. 4,
no. 5, pp. 27–38, 1999.
[34] H. Zhu, P. A. Hall, and J. H. May, “Software unit test coverage and
adequacy,” ACM Computing Surveys (CSUR), vol. 29, no. 4, pp. 366–
427, 1997.
[54] T. Chang, T. Yeh, and R. Miller, “GUI testing using computer vision,”
in Proceedings of the 28th international conference on Human factors in
computing systems. ACM, 2010, pp. 1535–1544.
[57] W.-K. Chen, T.-H. Tsai, and H.-H. Chao, “Integration of specification-
based and CR-based approaches for GUI testing,” in Advanced Informa-
tion Networking and Applications, 2005. AINA 2005. 19th International
Conference on, vol. 1. IEEE, 2005, pp. 967–972.
[61] P. Fröhlich and J. Link, “Automated test case generation from dy-
namic models,” ECOOP 2000 Object-Oriented Programming, pp. 472–
491, 2000.
[63] M. Fowler, UML distilled: a brief guide to the standard object modeling
language. Addison-Wesley Professional, 2004.
240 BIBLIOGRAPHY
[81] S. Kausar, S. Tariq, S. Riaz, and A. Khanum, “Guidelines for the selec-
tion of elicitation techniques,” in Emerging Technologies (ICET), 2010
6th International Conference on. IEEE, 2010, pp. 265–269.
[96] A. Arcuri and L. Briand, “A practical guide for using statistical tests to
assess randomized algorithms in software engineering,” in IEEE Inter-
national Conference on Software Engineering (ICSE), 2011.
[107] J. Andersson and G. Bache, “The video store revisited yet again: Ad-
ventures in GUI acceptance testing,” Extreme Programming and Agile
Processes in Software Engineering, pp. 1–10, 2004.
[109] A. Memon, M. Pollack, and M. Soffa, “Hierarchical GUI test case gen-
eration using automated planning,” Software Engineering, IEEE Trans-
actions on, vol. 27, no. 2, pp. 144–155, 2001.
[139] E. Börjesson and R. Feldt, “Automated system testing using visual GUI
testing tools: A comparative study in industry,” in Software Testing,
Verification and Validation (ICST), 2012 IEEE Fifth International Con-
ference on. IEEE, 2012, pp. 350–359.
[142] N. Olsson and K. Karl. (2015) Graphwalker: The Open Source Model-
Based Testing Tool. [Online]. Available: http://graphwalker.org/index
[144] J. Saldaña, The coding manual for qualitative researchers. Sage, 2012,
no. 14.
[148] Y. Jia and M. Harman, “An analysis and survey of the development of
mutation testing,” IEEE Transactions on Software Engineering, vol. 37,
no. 5, pp. 649–678, 2011.
[149] E. Alégroth, “Random Visual GUI Testing: Proof of Concept,” Pro-
ceedings of the 25th International Conference on Software Engineering
& Knowledge Engineering (SEKE 2013), pp. 178–184, 2013.