Machine Learning Applied to Software Testing
Machine Learning Applied to Software Testing
Abstract—Software testing involves probing into the behavior advances in technology. Consequently, software has become in-
of software systems to uncover faults. Most testing activities are creasingly important in modern society. As software becomes
complex and costly, so a practical strategy that has been adopted more pervasive in everyday life, software engineers must meet
to circumvent these issues is to automate software testing. There
has been a growing interest in applying machine learning (ML) to stringent requirements to obtain reliable software. To keep up
automate various software engineering activities, including testing- with all these advances, software engineering has come a long
related ones. In this paper, we set out to review the state-of-the art way since its inception. Yet, a number of software projects still
of how ML has been explored to automate and streamline software fail to meet expectations due to a combination of factors as, for
testing and provide an overview of the research at the intersection
instance, cost overruns and poor quality. Evidence suggests that
of these two fields by conducting a systematic mapping study. We
selected 48 primary studies. These selected studies were then cate- one of the factors that contribute the most to budget overruns is
gorized according to study type, testing activity, and ML algorithm fault detection and fault correction: as pointed out by Westland
employed to automate the testing activity. The results highlight the [2], uncorrected faults become increasingly more expensive as
most widely used ML algorithms and identify several avenues for software projects evolve. To mitigating such overheads, there
future research. We found that ML algorithms have been used has been a growing interest in software testing, which is the
mainly for test-case generation, refinement, and evaluation. Also,
ML has been used to evaluate test oracle construction and to pre- primary method to evaluate software under development [3].
dict the cost of testing-related activities. The results of this paper Software testing plays a pivotal role in both achieving and
outline the ML algorithms that are most commonly used to auto- evaluating the quality of software. Despite all the advances
mate software-testing activities, helping researchers to understand in software development methodologies and programming lan-
the current state of research concerning ML applied to software guages, software testing remains necessary. Basically, testing is
testing. We also found that there is a need for better empirical
studies examining how ML algorithms have been used to automate a process whose purpose is to make sure that the software arti-
software-testing activities. facts under test do what they were designed to do and also that
they do not do anything unintended, thus raising the quality of
Index Terms—Machine learning (ML), software testing,
systematic mapping study.
these artifacts [4]. Nevertheless, testing is costly, resource con-
suming, and notoriously complex: studies indicate that testing
accounts for more than 50% of the total costs of software de-
I. INTRODUCTION velopment [5]. Moreover, as any human-driven activity, testing
OST early software applications belonged to the scien- is error-prone and creating reliable software systems is still an
M tific computing and data processing domains [1]. Over
the past few decades, however, there has been a substantial
open problem. In hopes of coping with this problem, researchers
and practitioners have been investigating more effective ways
growth in the software industry, which was primarily driven by of testing software.
A practical strategy for facing some of the aforementioned
issues is to automate software testing. Thus, a lot of effort has
Manuscript received February 7, 2018; revised September 16, 2018; accepted been put into automating testing activities. Artificial intelligence
January 4, 2019. The work of A. T. Endo was supported by the CNPq/Brazil
(Grant 420363/2018-1). Associate Editor: I. Gashi. (Corresponding author: (AI) techniques have been successfully used to reduce the effort
Vinicius Humberto Serapilha Durelli.) of carrying out many software engineering activities [6]–[8].
V. H. S. Durelli and D. R. C. Dias are with the Department of Computer In particular, machine learning1 (ML) [9], which is a research
Science, Federal University of São João Del Rei, São João Del Rei 36307-352,
Brazil (e-mail:,[email protected]; [email protected]). field at the intersection of AI, computer science, and statistics,
R. S. Durelli is with the Department of Computer Science, Federal University has been applied to automate various software engineering ac-
of Lavras, Lavras 37200-000, Brazil (e-mail:,[email protected]). tivities [10]. It turns out that some software-testing issues lend
M. M. Eler is with the School of Arts, Sciences and Humanities – University
of São Paulo, São Paulo 05508-000, Brazil (e-mail:,[email protected]). themselves to being formulated as learning problems and tack-
S. S. Borges and A. T. Endo are with the Department of Computer Sci- led by learning algorithms, so there has been a growing inter-
ence, Federal University of Technology, Curitiba 81280-340, Brazil (e-mail:, est in capitalizing on ML to automate and streamline software
[email protected]; [email protected]).
M. P. Guimarães is with the Open University of Brazil – Federal Uni- testing. In addition, software systems have become increasingly
versity of São Paulo (UNIFESP)/Unifaccamp’s Master Program in Com- complex, so some conventional testing techniques may not scale
puter Science, São Paulo 04021-001, Brazil (e-mail:, marcelodepaiva@
gmail.com).
Digital Object Identifier 10.1109/TR.2019.2892517 1 Machine learning is also known as predictive analytics or statistical learning.
0018-9529 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
well to the complexity of these modern software systems. This 1) Testing Techniques and Criteria: As alternatives for ex-
ever-increasing complexity of modern software systems has haustive testing, testing techniques have been proposed to help
rendered ML-based techniques attractive. developers and testers to create a reduced and yet effective test
The remainder of this paper is organized as follows. suite [12]. Each testing technique has specific criteria to cover
Section II provides background on software testing and a particular aspect of the program and each criterion defines
ML. Section III describes the rationale behind our research. different test requirements that should be met by a test suite.
Section IV details the mapping study we carried out. Section V Test requirements can be generated from different parts of the
discusses the results and their implications. Section VI discusses software, e.g., specification and implementation. In this context,
future research in the area. Section VII outlines the threats to the SUT can be instrumented so that it reports on the execution
the validity of this mapping study. Finally, Section VIII presents of a test suite to measure how well the test suite satisfies the test
concluding remarks. requirements [5].
Functional and structural testing are two of the most com-
monly used testing techniques. The functional testing technique
II. BACKGROUND is also known as black-box testing because it only uses the
This section covers background on software testing and ML. SUT specification to generate test cases. In this technique, the
The discussion is divided into two parts: the first covers the pur- internal structure of the SUT is not taken into account. The
pose of software testing, giving special emphasis to elucidating two most popular functional criteria are equivalence partition-
the most fundamental concepts; the second part lays out the ing and boundary-value analysis. Structural testing (also known
essential background on ML. as white-box testing), on the other hand, creates test cases based
on the SUT implementation. Its purpose is to make sure that all
structures (e.g., paths, instructions, and branches) of the SUT
A. Software Testing are exercised during execution of the test suite. Basically, struc-
Software testing is a quality assurance activity that consists in tural testing criteria are usually classified as control flow and
evaluating the system under test (SUT) by observing its execu- data flow. Control-flow criteria specify test requirements based
tion with the aim of revealing failures [4]. A failure is detected on the execution flow of the SUT [13]. Two widely used goals
when the SUT external behavior is different from what is ex- related to control-flow criteria are executing all instructions or
pected of the SUT according to its requirements or some other exercising all branches at least once. Data-flow criteria are based
description of the expected behavior [3]. Since this activity re- on the assumption that testers should focus on the flows of data
quires the execution of the SUT, it is often referred to as dynamic values, i.e., variable uses and definitions [14]. One common goal
analysis. In contrast, there are quality assurance activities that of this type of criteria is to execute every definition of a data
do not require the execution of the SUT [5]. value (i.e., variable) and its associated uses at least once. This
An important element of the testing activity is the test case. criterion is known as the all-uses criterion and takes into account
Essentially, a test case specifies in which conditions the SUT only the def-use pairs that have some path from the definition to
must be executed in hopes of finding a failure. When a test the use in which the considered variable is not redefined. This
case reveals a failure, it is considered successful (or effective). special path is called def-clear path.
A test case embodies the input values needed to execute the Mutation testing is a less widely used technique that has been
SUT [3]. Therefore, test case inputs vary in nature, ranging mostly used in academic settings. This technique is centered
from user inputs to method calls with the test-case values as around the idea of changing the SUT in such a way that the
parameters. To evaluate the results of test cases, testers must changes made to the SUT mimic mistakes that a competent pro-
know what output the SUT would produce for those test cases. grammer would make. The elements that describe how the SUT
The element that verifies the correctness of the outputs produced should be changed are referred to as mutation operators, and
by the SUT is referred to as oracle. Usually, testers play the role the resulting different versions of the SUT are called mutants.
of oracle. However, it is worth emphasizing that an oracle can Then, after mutant generation, testers have to come up with test
be a specification or even another program. cases for uncovering the seeded faults. When a test case causes
As stated by Ammann and Offutt [3], regardless of how thor- a mutant to behave differently from the original SUT, the test
oughly planned and carried out, the main limitation of testing case is said to kill the mutant. Mutation testing assumes that a
activities is that they are able to show only the existence of fail- well-designed test suite can kill all mutants. Therefore, testers
ures, not the lack thereof. Assuring that an SUT will not fail have to improve the test suite until it is able to kill all mutants
in the future requires exhaustive testing, which means the SUT that are not equivalent to the original SUT.
has to be run against all possible inputs in all possible scenarios. The goal when applying mutation testing is to obtain a mu-
Performing exhaustive testing, however, is usually impossible tation score of 100%: the mutation score is the percentage of
or impractical due to the large size of the input domain and nonequivalent mutants that have been killed [3]. A score of
the large amount of combinations of scenarios, an SUT can be 100% indicates that the test suite is able to detect all the faults
executed [11]. As a result, testers have to come up with some represented by the mutants. Usually, achieving a mutation score
standard of test adequacy that allows them to decide when the of 100% is impractical, so a threshold value can be established;
SUT has been tested thoroughly enough. This has prompted representing the minimum value for the mutation score.
the development of test adequacy criteria. Testing criteria are 2) Testing Phases: Software-testing activities can be carried
discussed in the following section. out during the whole software life-cycle to assure that failures
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
DURELLI et al.: MACHINE LEARNING APPLIED TO SOFTWARE TESTING: A SYSTEMATIC MAPPING STUDY 3
are discovered as early as possible in the software development software engineers to reduce cost and complexity is automation.
process. Testing phases, also known as test levels, are com- Hence, in the past few years, many efforts have been carried out
monly used throughout software development projects: accep- to come up with automated approaches for generating test inputs
tance testing, system testing, integration testing, module testing, and stimuli to meet different test goals (e.g., branch coverage).
and unit testing [3]. Acceptance testing activities evaluate the Three different techniques to generate test cases automatically
SUT with respect to requirements and business processes. Sys- stand out in this scenario: symbolic execution, search based, and
tem testing assesses the architectural design. It is worth men- random approaches [15].
tioning that, in many companies, there is no difference between
system and acceptance testing. B. Machine Learning
Integration testing is carried out in the hopes of finding fail- Essentially, problem solving using computers revolves around
ures that arise from module integration. During integration test- coming up with algorithms, which are sequences of instructions
ing, test cases are designed to assess whether the interfaces that when carried out turn the input (or set of inputs) into an
between modules communicate properly. Thus, integration test- output (or set of outputs). For instance, a number of algorithms
ing assumes that modules work as expected. Module testing for sorting have been proposed over the years. As input, these
is carried out to evaluate modules in isolation, test cases are algorithms take a set of elements (e.g., numbers) and the output
designed to assess how the units within the module under test is an ordered list (e.g., list of numbers in ascending or descending
interact with each other as well as their associated data struc- order).
tures. Unit testing has to do with exercising the smallest unit of Many problems, however, do not lend themselves well to
the SUT in isolation. In object-oriented programs, for instance, being solved by traditional algorithms. An example of problem
the smallest unit is usually either a class or a method. that is hard to solve through traditional algorithms is predicting
Regression testing is performed throughout the life-cycle of a whether a test case is effective. Depending on the SUT, we
system, thus rather than been considered a phase or level, it can know what the input is like: for instance, for a program that
be considered a subphase of the aforementioned testing phases. implements a sorting algorithm, it is a list of elements (e.g.,
Regression testing has to do with rerunning the existing test numbers). We also know what the output should be: an ordered
cases whenever an element of the system is changed to ensure list of elements. Nevertheless, we do not know what list of
that the elements that were previously developed and tested still elements is most likely to uncover faults: that is, what inputs
perform correctly. More specifically, regression testing is per- will exercise different parts of the program’s code.
formed with the intention of checking whether recent changes There are many problems for which there is no algorithm.
have not introduced unintended consequences elsewhere in the In effect, trying to solve these problems through traditional
system [3]. algorithms has led to limited success. However, in recent years,
Since executing all test cases whenever the system is changed a vast amount of data concerning such problems has become
is costly and time consuming, many research efforts have been available. This rise in data availability has prompted researchers
investigating ways of selecting only the most effective subset and practitioners to look at solutions that involve learning from
of the test suite, i.e., the subset that is more likely to reveal data: ML algorithms.
failures. In this context, two techniques are usually applied to Apart from the explosion of data being captured and stored,
select test cases: test prioritization and test minimization. Test- the recent widespread adoption of ML algorithms has been
case prioritization sorts the test suite in a way that the test cases largely fueled by two contributing factors: first, the exponential
with higher priority are executed before the test cases that have growth of compute power, which has made it possible for com-
a lower priority. Priorities are assigned according to different puters to tackle ever-more-complex problems using ML, and
criteria, including the probability of revealing failures or the second, the increasing availability of powerful ML tools [16],
business value of the features exercised. Basically, test-case [17]. Due to these advances, researchers and practitioners have
minimization removes redundant test cases from the test suite, applied ML algorithms to an ever-expanding range of domains.
so that regression testing activities become less time consuming Some of the domains in which ML algorithms have been used
and costly. to solve problems are: weather prediction, Web search engines,
3) Test Automation: Executing test cases manually is costly, natural language processing, speech recognition, computer vi-
time consuming, and error prone. Therefore, many testing sion, and robotics [18]–[20]. It is worth noting, however, that
frameworks and tools have been developed along the years with ML is not new. As pointed out by Louridas and Ebert [21], ML
the intent of supporting the automated execution of test cases has been around since the 1970s, when the first ML algorithms
at different levels. Testing frameworks that support unit testing emerged.
have been widely used especially because of the popularization Let us go back to the problem of predicting the effective-
of agile methodologies and test-focused strategies. More re- ness of test cases. When facing problems of this nature, data
cently, many record-and-play or even script-based frameworks come into play when we need to know what an effective test
and tools to perform graphical user interface (GUI) testing have case looks like. Although we might not know how to come up
become more popular among developers. with an effective test case, we make an assumption that some
Even though automating the execution of test cases repre- effective test cases will be present in the collected data (e.g.,
sented a significant improvement in the field, software-testing set of inputs for a program whose run-time behavior was also
activities tend to become more difficult and costly as sys- recorded). If an ML algorithm is able to learn from the available
tems become increasingly more complex. The classic answer of test-case data, and assuming that the program under test did not
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
deviate much from the version used during data collection, it is by describing a number of applications the author was involved
possible to make predictions based on the results of the algo- with over the years as well as a brief overview of other re-
rithm. Although the ML algorithm may not be able to identify lated research. Furthermore, owing to his assumption that ML
the whole test-case evaluation process, it can still detect some has the potential to help testers cope with some long-standing
hidden structures and patterns in the data. In this context, the software-testing problems, Briand argues that more research
result of the algorithm is an approximation (i.e., a model). In a should be performed toward synthesizing the knowledge at the
broad sense, ML algorithms process the available data to build intersection of these research areas. Although evidence suggests
models. The resulting models embody patterns that allow us to that software testing is the subject for which a substantial num-
make inferences and better characterize problems as predicting ber of systematic literature reviews have been carried out [30],
the effectiveness of test cases. to the best of our knowledge, there are no up-to-date, com-
At its core, ML is simply a set of algorithms for designing prehensive systematic reviews or systematic mappings provid-
models and understanding data [19], [20]. Therefore, as stated ing an overview of published research that combines these two
by Mohri et al. [18], ML algorithms are data-driven methods particular research areas.
that combine computer science concepts with ideas from statis- In order to fill in such a gap, we carried out a systematic map-
tics, probability, and optimization. As emphasized by Shalev- ping study covering the existing research at the intersection of
Shwartz and Ben-David [22], the main difference in compar- software testing with ML. According to Kitchenham et al. [31],
ison with traditional statistics and other fields is that in com- systematic mapping is a research methodology whose goal is to
puter science, ML is centered around learning by computers, so survey the literature to synthesize a comprehensive overview of
algorithmic considerations are key. a given topic, identifying research gaps, and providing insight
A number of ML algorithms have been devised over the into future research directions. Using this methodology, we set
years. Essentially, these ML algorithms differ in terms of the out to survey the target literature to gain an overview of the state
models they use or yield. These algorithms can be broadly clas- of the art in ML applied to software testing. The overarching mo-
sified as supervised or unsupervised (a more in-depth explana- tivation is to provide researchers and practitioners with a better
tion of these two categories of learning types is presented in understanding of which ML algorithms have already been tuned
Section V-E). and applied to cope with software-testing problems. Moreover,
Software has been playing an increasingly important role in we investigated what research techniques are the most used in
modern society. Therefore, ensuring software quality is vital. this field as well as the most prolific researchers. Given that
Although many factors impact the development of reliable soft- our focus is on answering broad questions instead of analyzing
ware, testing is the primary approach for assessing and improv- particular facets of this research area, we decided to conduct
ing software quality [3]. However, despite decades of research, a systematic mapping rather than a form of secondary study
testing remains challenging. Recently, a strategy that has been that requires a more in-depth analysis (i.e., systematic literature
adopted to circumvent some of the open issues is applying ML review).
algorithms to automate software testing. We set out to provide This paper provides up-to-date information on the research at
an overview of the literature on how researchers have harnessed the intersection of ML and software testing: outlining the most
ML algorithms to automate software testing. We detail the investigated topics, the strength of evidence for, and benefits
rationale behind our research in the following section. and limitations of ML algorithms. We believe that the results of
this systematic mapping will enable researchers to devise more
effective ML-based testing approaches, since these research ef-
III. PROBLEM STATEMENT AND JUSTIFICATION forts can capitalize on the best available knowledge. In addition,
Although applying ML to tackle software-testing problems is given that ML is not a panacea for all software-testing issues, we
a relatively new and emerging research trend, a number of stud- conjecture that this paper is an important step to make headway
ies have been published in the past two decades [23]–[28], [82], in applying ML to software testing. Essentially, the results of this
[83], [85], [86], [89], [91]. Different ML algorithms have been paper have the potential to enable practitioners and researchers
adapted and used to automate software testing, however, it is not to make informed decisions about which ML algorithms are
clear how research in this area has evolved in terms of what has best suited to their context: as stated by Kitchenham et al. [32],
already been investigated. Despite the inherent value of examin- secondary studies as ours can be used as a starting point for
ing the nature and scope of the literature in the area, few studies further research. Another contribution of our paper is the iden-
have attempted to provide a general overview of how ML algo- tification of research gaps, paving the way for future research in
rithms have contributed to efforts to automate software-testing this area.
activities. Noorian et al. [29], for instance, proposed a frame-
work that can be used to classify research at the intersection
of ML and software testing. Nevertheless, their classification IV. MAPPING STUDY PROCESS
framework is not based on a systematic review of the literature, This section describes the process we followed throughout the
which to some extent undermines the scope and validity of such conduction of this systematic mapping study, which was based
framework. on the guidelines for conducting secondary studies proposed by
Drawing from his personal experience, Briand [26] gives an Petersen et al. [30] and Kitchenham et al. [31]. We designed
account of the state-of-the art in ML applied to software testing this mapping study to be as inclusive as possible, so we did
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
DURELLI et al.: MACHINE LEARNING APPLIED TO SOFTWARE TESTING: A SYSTEMATIC MAPPING STUDY 5
not use any sort of quality assessment to filter primary studies. some factors that have influenced this burgeoning interest for
The following sections describe how we followed the guidelines ML are as follows.
to answer the research questions (RQs) posed by this mapping 1) The plummeting cost of computational power.
study. 2) Development of robust and efficient algorithms that can
deal with more diverse sources and types of data.
A. Research Questions 3) A wide variety of tools that can be used to support and
speed up the development of ML-based applications.
We set out to devise RQs that emphasize the classification Based on this, at first, we chose to consider only primary
of the literature in a way that is interesting to researchers and studies that were published over the last few decades: from
practitioners and also gives them insights into how ML has 1980 to August 2017. Afterward, given that our results account
been used to automate software testing. The scope and goal of for neither a substantial portion of 2017 nor 2018, we decided
our paper can be formulated using the Goal-Question-Metric to update our systematic mapping study on grounds of expand-
approach [33] as follows. ing the collected evidence and providing up-to-date information
on the research at the intersection of ML and software testing.
Analyze the state-of-the art in ML applied to software testing Essentially, updating our systematic mapping study involved
for the purpose of exploration and analysis rerunning the original searches (using the same inclusion and
with respect to the intensity of the research in the area, exclusion criteria). We filtered the updated searches by publi-
trends, advantages and drawbacks of using ML to automate cation year: we looked for primary studies that were published
software testing, hindrances to using ML to automate soft- from June 2017 to August 2018. The purpose of the small over-
ware testing, what extent the application of ML to automate lap with the first search is to allow for time lags in the indexing
software testing has been empirically evaluated, the most- of studies.
active researchers in the area We used automated searching as the main search strategy. In
from the point of view of researchers, and practitioners hopes of finding as many relevant primary studies as possible
in the context of software testing. and properly answering our RQs, we examined the four digital
libraries that together cover most of the literature on software
As pointed out by Kitchenham et al. [31], RQs must embody engineering and a general indexing system. More precisely, we
the goal of secondary studies. Accordingly, the goal of our searched the IEEE Digital Library2 and ACM Digital Library3
paper can be broken down into following eight main RQs and a because these digital libraries include prime international jour-
subquestion. nals and a wealth of important computing-related conferences
1) RQ1 : What is the intensity of the research on ML applied and workshops. In addition, we searched SpringerLink4 and
to software testing? ScienceDirect5 because these two digital libraries also index a
2) RQ2 : What types of ML algorithms have been used to number of recognized international journals on related topics.
cope with software- testing issues? To reduce the need to search many publisher-specific sources,
3) RQ3 : Which software-testing activities are automated by we decided to take Web of Science6 into account as well. Web
ML algorithms? of Science is a general indexing service that index papers pub-
4) RQ4 : What trends can be observed among research studies lished by many digital libraries, such as ACM, Elsevier, IEEE,
discussing the application of ML to support software- Springer, and Wiley. To broaden the scope of our paper, during
testing activities? the rerun of the searches, we also searched the Society for Indus-
5) RQ5 : What are the drawbacks and advantages of the trial and Applied Mathematics (SIAM)7 and the Proceedings of
algorithms when applied to software testing? the VLDB (Very Large Data Bases) Endowment.8 Specifically,
6) RQ6 : What problems have been observed by researchers we searched SIAM website looking for studies that were pub-
when applying ML algorithms to support software-testing lished in the proceedings of the SIAM International Conference
activities? on Data Mining.
7) RQ7 : To what extent have these ML-based approaches When conducting automated searches in digital libraries,
been evaluated empirically? search keywords are vital to obtain good results, and so they
a) RQ7.1 : Which empirical research methods do re- have to be chosen carefully. However, given that terminology
searchers use to evaluate ML algorithms when is not well established in software engineering (and most of
applied to software testing? its subareas) [31], and due to the interdisciplinarity of the sub-
8) RQ8 : Which individuals are most active in this research ject area, we conjectured that it would be difficult to identify
area? a reliable set of keywords to use in our search string. Thus,
solve practical problems outside the realm of AI. As mentioned, 8 [Online]. Available: https://www.vldb.org/pvldb/
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
we derived the keywords for our search string from the RQs TABLE I
PUBLICATION VENUES INVESTIGATED DURING THE CREATION OF THE
and based on the keywords used in the set of known papers. QUASI-GOLD STANDARD
This set of known papers was selected through the construc-
tion of a quasi-gold standard, as proposed by Zhang et al. [34].
The quasi-gold standard is created by manually searching a set
of journals and conference proceedings for a given period: the
quasi-gold standard results in a set of studies that are venue-
and period specific [34]. This set of papers is then used to eval-
uate the completeness of subsequent automated searches. The
quasi-gold standard used in this mapping study is presented in
Section IV-B1.
We experimented with different combinations of keywords
by linking them using Boolean operators (i.e., AND and OR).
Basically, the search string used in our paper is twofold: the
first part contains all keywords related to software testing and
the second part is comprised of ML-related keywords. The two
parts are linked using the Boolean operator AND. The following
combination of keywords was considered the most appropriate
for our paper.
DURELLI et al.: MACHINE LEARNING APPLIED TO SOFTWARE TESTING: A SYSTEMATIC MAPPING STUDY 7
a set of relevant studies, which is comprised of a subset or all criterion requires a more thorough examination of the papers:
of the studies selected over the course of the study selection usually, abstract and keywords are not enough to pin down the
phase. In each of the subsequent iterations, the papers refer- content of a paper. During the second round, two reviewers read
enced in the last analyzed studies are checked. The process ends the abstracts of the papers selected in the first round. Throughout
when no new studies are selected. Throughout the backward this round, the reviewers applied criteria I2 , E1 , and E2 . The re-
snowballing process, we followed the guidelines provided by sulting set of candidate papers was examined by two reviewers
Wohlin [35]. and disagreements concerning whether any borderline paper is
eligible or not were resolved by discussion and, when needed,
settled by a third reviewer. In the final round, the two reviewers
C. Primary Study Selection Process independently filtered the candidate papers by reading them in
This section defines the inclusion and exclusion criteria that their entirety. Criteria I2 , E1 , E2 , and E4 were applied to se-
were used throughout the conduction of this secondary study. lect the final set of primary studies. Disagreements on selection
The following criteria were used as inclusion criteria. results were discussed and addressed by two reviewers. When
1) I1 : Our initial selection relies on the filtering provided by needed, a third reviewer was consulted.
the peer-review process, so all selected studies must have
undergone peer review. Only studies published in schol-
arly venues such as journals, conference proceedings, and D. Data Extraction
workshop proceedings were taken into account.
To answer the RQs described in Section IV-A, we extracted
2) I2 : Studies that report on ML algorithms applied to
from each primary study the information shown in the data ex-
software testing.
traction form presented in Appendix B. The data extraction form
Studies that fall into at least one of the following categories
includes fields designed to gather general publication informa-
were not eligible to be selected.
tion, such as title and year of publication, as well as fields that
1) E1 : Studies on approaches to testing ML algorithms, fault
were framed to reflect the RQs.
prediction techniques, debugging approaches, any sort
It is worth noting that, before carrying out our systematic
of hardware testing approach, and approaches based on
study, we discussed the definitions of these fields, which we
evolutionary computation (e.g., genetic algorithms and
refer to as data items (DIs), to clarify their meanings to all data
evolutionary programming).
extractors. Furthermore, to make sure that all data extractors
2) E2 : The study describes the application of an ML al-
had a clear understanding of the DIs, we pilot tested the data
gorithm, but the algorithm is not applied to automate a
extraction form using the quasi-gold standard. During the pilot,
testing-related activity or problem.
we aimed at resolving disagreements and misconceptions about
3) E3 : Gray literature (e.g., technical reports, working pa-
the DIs.
pers, and presentations) or studies published in the form
During the conduction of the original systematic mapping,
of abstract or panel discussion.
two data extractors performed the data extraction on the result-
4) E4 : Often, research efforts are published at various stages
ing set of selected studies independently. Having extracted the
of their evolution. In the context of this mapping study,
information from all selected studies, the two data extractors
duplicate versions of studies should be excluded. Only
checked all data to make sure that the extracted information is
the most comprehensive or recent version of each study
valid and clear for further analysis. The extracted data were kept
should be included.
in a spreadsheet. As mentioned, with the purpose of incorpo-
5) E5 : Peer-reviewed studies that are not published in jour-
rating new evidence published since the original searches were
nals, conference proceedings, or workshop proceedings
completed, we repeated the extraction method for the papers re-
(e.g., Ph.D. thesis and patents).
turned from the reruns of the searches. During the update, three
6) E6 : Studies that are not written in English.
data extractors performed data extraction on the set of selected
These inclusion and exclusion criteria were applied as de-
studies, updating the original spreadsheet accordingly.
scribed in Section IV-C1. As discussed in Section IV-C1, during
the application of these criteria, we went over several parts of the
returned papers as, for instance, title, abstract, and keywords.
E. Data Synthesis
Additionally, we carried out a pilot study to resolve disagree-
ments and misunderstandings concerning these criteria. The purpose of data synthesis is to summarize the extracted
1) Selection Process: The inclusion and exclusion criteria data in meaningful ways in hopes of answering the RQs defined
were applied in three stages. First, papers were filtered based on in Section IV-A. More specifically, descriptive statistics and
title, keywords, and venue. This first step is aimed at excluding frequency analysis are used to answer the RQs. We devised
papers that are clearly irrelevant. Thus, criteria I1 , E3 , E5 , and classification schemes by means of keywording relevant topics
E6 were applied first. We realized that often a more in-depth addressed by some of the RQs. The resulting classifications were
analysis is needed to determine whether the ML-based approach devised and refined as the mapping process advanced. Several
described in a paper is applied to software testing, hence, the facets were defined for classification purposes. For instance, to
criteria I2 and E2 were not applied during the first round. Sim- answer RQ7 , we classified the primary studies according to the
ilarly, E1 was not used in the first round because applying this nature of the research reported in them.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
V. STUDY RESULTS
We carried out this mapping study according to the procedure
described in Section IV. During the first literature search, 38
papers met the inclusion criteria. Upon updating the searches
(i.e., rerunning them as per the original systematic mapping
protocol), ten new papers met the inclusion criteria. So, in total,
we selected 48 primary studies. A brief summary of each study
is provided in Appendix C.
DURELLI et al.: MACHINE LEARNING APPLIED TO SOFTWARE TESTING: A SYSTEMATIC MAPPING STUDY 9
DURELLI et al.: MACHINE LEARNING APPLIED TO SOFTWARE TESTING: A SYSTEMATIC MAPPING STUDY 11
with ten primary studies. Four of the ten studies on this topic techniques for oracle automation. However, it is worth mention-
were published from 2017 to August 2018. ing that considering the software-testing literature as a whole,
Bergadano and Gunetti [71] (PS28) devised a test-case gen- test oracle automation has received significantly less attention
eration approach that is based on the inductive learning of pro- compared with many other aspects of test automation (e.g., au-
grams from finite sets of input and output examples. Given a tomated test input generation) [37]. By contrast, a significant
program P and a set of alternative programs P’, the proposed amount of the research at the intersection of software testing
approach yields test cases that are adequate in the sense that they with ML has been concerned with automating test oracles. In
are able to distinguish P from all programs in P’. As Bergadano fact, 10 of the 48 selected studies report on approaches that
and Gunetti emphasize, although the approach is similar to fault- leverage ML algorithms to construct test oracles. We believe
based approaches, the programs in P’ are not restricted to being that this is the case because ML presents novel tools to predict
simple mutations of P. outcomes and, in the case of software testing, this constitutes a
In PS21 [64], Choi et al. tackled the problem of automatically powerful tool for implementing test oracles.
generating a test suite for Android applications for which there is Wang et al. [48] (PS5) examined how ML algorithms can be
no existing model of the GUI. The proposed approach uses ML used to automatically generate test oracles for reactive programs
to learn a model of the application during testing. The learned without relying on explicit specifications. Essentially, their ap-
model is then used to generate inputs that visit states of the ap- proach turns test traces into feature vectors, which are used to
plication that have not been explored. When the application is train the ML algorithm. The model yielded by the algorithm
executed using the generated inputs, the execution is observed in then acts as a test oracle.
order to refine the model. An important feature of the approach The oracle problem appears in different contexts. Chang et al.
is that it avoids restarting the application under test, which in [49] in PS6 tackled this problem in the context of mesh simpli-
many cases is computationally costly. Choi et al. [64] carried fication programs. Mesh simplification programs yield three-
out an experiment to compare how their approach compares to dimensional (3-D) polygonal models that are similar to the
random testing and L*-based testing. The results of this exper- original, albeit simpler in the sense that they have fewer poly-
iment seem to indicate that their approach can achieve better gons. That is, these programs produce different graphics despite
coverage. operating on the same input (i.e., the original polygonal model).
In PS26 [69], Sant et al. report on a test-case design approach As noted by Chan et al. [49], this results in a test oracle problem.
for web applications. More specifically, in PS26, Sant et al. [69] The authors developed an approach that trains a classifier using
apply an ML approach to turn user session data into models a reference model of the SUT. This supervised ML approach
of web applications. The resulting model is then randomly tra- groups test cases into two categories: passed and failure caus-
versed to generate test data. In PS48 [91], a test-case design ing. To improve the accuracy of its predictions, the approach
approach for mobile applications is presented. In this recent also pipes test cases classified as passed by the ML algorithm to
study, Rosenfeld et al. [91] describe an approach that leverages an analytical metamorphic testing (MT) module. Their results
ML algorithms to analyze GUI elements of Android applica- show that this can significantly improve the effectiveness of the
tions. After analyzing these elements, the proposed approach proposed approach.
generates functional test cases. Jin et al. [72] (PS29) investigated how ANNs can be used to
Some primary studies in this category evaluate the proposed ease the test oracle problem. Similarly, in PS30 [73], Vineeta
approaches using only one medium-sized program or several toy et al. outline two ML approaches toward implementing test
programs (e.g., PS19 and PS28). Due to the simplicity of such oracles. Specifically, the first approach builds on ANNs and
programs, it is unlikely that they expose the limitations of these the second one builds on decision trees to predict the expected
test-data generation approaches. Hence, evaluating test gener- outputs of the SUT. The applicability of these approaches was
ation approaches using toy programs provides limited utility. examined through an example using a toy program. In PS20 [63],
Consequently, the evidence presented in these primary studies Vanmali et al. also looked into how ANNs can be used to create
is insufficient to draw conclusions on the effectiveness of these a test oracle for a credit approval application.
ML-based test-data generation. Several primary studies, how- As mentioned, PS33 [76] is the only primary study concerned
ever, provide a stronger case for applying ML algorithms to exclusively with evaluating the effectiveness of two different
automate test-data generation (e.g., PS15 [58] and PS21). approaches that have been used to implement test oracles, i.e.,
2) Oracle Problem: As mentioned, software testing involves IFNs and ANNs. According to the results of this study, IFNs
exploring the behavior of the program under test so as to uncover significantly outperform ANNs in terms of computation time
faults. In this context, when the program is run with a certain while achieving almost the same fault-defection effectiveness.
input, it is vital to tell apart the correct from the potentially This comparative study also provides another insight into the
incorrect behaviors. This conundrum is referred to as the test characteristics of these two approaches: the experiment results
oracle problem [37]. Without a test oracle, testers have to use indicate that the performance of the oracles are highly dependent
domain-specific information to ascertain whether the observed on the amount of available test data.
behavior is correct, which is in many cases impractical due to 3) Test-Case Evaluation: When carrying out testing efforts,
the complexity and size of present-day software systems. To testers need to be able to assess the quality of a given test suite.
make matters worse, sometimes software systems lack the doc- However, evaluating the quality of test suites is complex because
umentation needed to determine the correctness of the observed it is hard to formalize and measure what characteristics of the
behavior. To overcome these problems, researchers have sought test cases influence quality. In the absence of precise quality
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
indicators for test suites, the coverage of a test suite is usually a) Test-Case Prioritization: The time taken by regression test-
used as proxy for its fault-detection effectiveness. ing is usually dictated by the size of the test suite. As regression
An adequate test suite is one that implies that the SUT is test suites grow, they become computationally demanding to
free of errors if it runs correctly. However, there is no trustwor- run. A large test suite might take weeks to run [38]. In such
thy model through which adequacy can be properly measured; cases, testers are pressed to come up with ways to improve
hence, adequacy is often quantified using proxy measures of the effectiveness of the testing effort. In a limited-resource set-
code behavior as, for instance, branch coverage and mutation ting, test-case prioritization can be used to mitigate some of
coverage. However, as noted by Fraser and Walkinshaw [51], the cost associated with regression testing. Essentially, test-case
these program-based adequacy metrics can be impractical and prioritization involves arranging the execution of test cases in a
may be misleading even when they are satisfied. One alternative particular order so as to optimize the rate of fault detection. The
approach that has the potential to overcome the shortcomings basic idea is centered around the hypothesis that most faults can
of program-based adequacy metrics, is the idea of behavioral be detected as early as possible by prioritizing the most relevant
coverage, which is essentially concerned with inferring a model (i.e., higher priority) test cases.
from a system by observing its behavior (i.e., outputs) during Lenz et al. [52] (PS9) present an ML-based approach to link
the execution of a test suite. If one can show that the model is ac- test results (i.e., structural coverage information and mutation
curate, it follows that the test suite can be considered adequate. score) from the application of different testing criteria. The pro-
This approach is appealing because it eliminates the need to posed approach then groups the test results into similar func-
use proxy source-code approximations. Despite the potential of tional clusters. Afterward, information related to the existing
this approach, its adoption has been hindered by the complexity test cases and the clusters generated in the previous step are
of inferring models. To deal with this complexity, Fraser and used as a training set for an ML algorithm, which yields clas-
Walkinshaw [51] (PS8) employed ML algorithms to infer mod- sifiers according to the tester’s goals. As stated by Lenz et al.
els from observed inputs and outputs. More specifically, Fraser [52], different classifiers can be obtained and employed to dif-
and Walkinshaw came up with an ML-based approach to cope ferent purposes, including prioritization and refinement of test
with the adequacy problem, the resulting approach evaluates cases.
the extent to which a test suite covers the observable program In PS25 [68], Tonella et al. reformulated the test-case prior-
behavior. itization problem as an ML problem. Their proposed solution
This category also comprises a research effort whose purpose uses case-based reasoning (CBR) to learn an effective way to
is to predict the feasibility of test cases: PS2 [45]. In the con- order the test cases. While sorting through the test cases, the pro-
text of GUIs, test cases take the form of sequences of events posed solution takes into account priority information from the
that are executed in hopes of detecting faults in the application. user: the solution prompts the tester with pairs of test cases, and
However, test cases might be rendered infeasible if one or more asks it to select the most important ones. Additionally, the tester
events in the sequence are disabled or inaccessible. This type of input is integrated with additional information (e.g., structural
test case terminates prematurely and end up wasting resources. coverage information) to generate an ordering of test cases. To
To prune away infeasible test cases from test suites, Gove and evaluate their solution, Tonella et al. carried out an experiment
Faytong [45] propose two approaches that capitalize on two using the program space, which contains 9564 lines of code
ML algorithms: support vector machines (SVMs) and gram- and 136 functions. The results of this experiment would seem
mar induction. These two approaches to identifying infeasible to indicate that prioritization using CBR outperforms coverage-
test cases differ mainly in terms of their results. SVMs are a based prioritization approaches. In a more recent study, Spieker
highly effective classifier, but the models produced by this algo- et al. [83] (PS40) introduce Retecs, which is an approach for au-
rithm, albeit accurate, are not easily interpretable by humans. In tomatically learning test-case selection and prioritization. The
contrast to SVMs, grammar induction yields human-readable proposed approach employs reinforcement learning to select
results, which allow for interpretation by the tester. Nonethe- and prioritize test cases according to their duration, previous
less, grammar induction is notably computationally expensive. last execution and failure history. According to Spieker et al. in
In a more recent study, Felbinger et al. [85] (PS42) outline an comparison to similar approaches, their approach offers a more
approach for test evaluation that is based on inferring a model lightweight learning method that uses only one source of data,
from the test suite and using the similarity between the inferred namely test-case failure history.
model and the SUT as a measure of test suite adequacy. b) Test-Case Refinement: During the life cycle of software
4) Test-Case Prioritization and Refinement: Previous re- systems, the existing test suites need to be refined so as to better
search has proposed two main approaches to streamline regres- reflect changing test requirements. Often, to cope with new or
sion testing: test-case prioritization10 and test-case refinement.11 changed requirements, new test cases are included to test suites.
Since these approaches are closely related in purpose, the pri- As a result, the size of test suites grows, increasing the cost of
mary studies that have employed ML to cope with the issue of running them on the SUT (i.e., regression testing). To keep the
speeding up regression testing using either test-case prioritiza- expense of regression testing in check, sometimes the amount
tion or refinement are discussed in this section. of test cases needs to be reduced. Given that changes to test
suites must be carried out in a sensible and planned fashion,
testers usually employ test-case refinement algorithms to help
10 Test-case prioritization is often termed test-suite selection. them select an effective subset of test cases, thereby reducing
11 Test-case refinement is also known as test-suite reduction. testing cost. These algorithms compute an optimal subset of
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
DURELLI et al.: MACHINE LEARNING APPLIED TO SOFTWARE TESTING: A SYSTEMATIC MAPPING STUDY 13
test cases by removing ineffective, redundant, and obsolete test test cases, execution complexity, and the experience of the tester
cases from test suites. in charge of executing the test suite. Examples are presented to
Our results indicate that a considerable amount of research show the usefulness of their proposed approach.
has been carried out to provide methodological and tool sup- Cheatham et al. [57] (PS14) investigated how ML algorithms
port to help testers understand the shortcomings and potential can be used to determine the factors that are important in predict-
redundancies of test suites and thus being able to refine them ing testing time. More specifically, an ML algorithm was used
in a cost effective fashion. Briand et al. [55] (PS12), for in- to learn the most important attributes that influence testing time
stance, developed an ML-based approach to help testers ana- from a database containing data on 25 software projects. The
lyze the strengths, weaknesses, and redundancies of black-box resulting classification tree was then used to predict the testing
test specifications and test suites and iteratively improve them. time for new software systems. Silva et al. [61] (PS18) also em-
This partially automated approach is based on abstracting test ployed an ML-based approach toward estimating the execution
suite information by transforming test cases into specifications effort of functional test suites. PS18 is the only primary study
at a higher level of abstraction. More specifically, test cases are in this category that through experimental evidence provides
interpreted as categories and choice combinations, as defined a stronger case for using ML to predict the effort involved in
by the black-box text specification technique category-partition. testing-related activities. In a more recent study, Badri et al. [90]
Hence, test suites are transformed into abstract test suites, which (PS47), set out to employ ML algorithms to predict test code
are much more amenable to use in an ML algorithm. An ML size for object-oriented software in terms of test lines of code
algorithm is then used to learn this abstract representation of the (TLOC), which is a key indicator of testing effort. Badri et al.
test suite, taking into account the relationships between input used different ML algorithms to build the prediction models. To
properties and output equivalence classes. As Briand et al. state, predict testing effort in terms of TLOC, Badri et al. used sev-
this allows the tester to better understand the strengths and the eral metrics as input to the ML algorithms. According to their
drawbacks of the test suite. In addition, this can be used when a results, their metric-based approach yields accurate predictions
given test suite needs to be improved but there is no test specifi- of TLOC.
cation nor rationale (e.g., reusing open source software). Also, 6) Mutation Testing Automation: From a research view-
it is possible to use this approach to carry out a black-box test- point, mutation testing is a mature technique [39]. This tech-
ing process in which a test specification is created (e.g., using nique is centered around the idea of devising test data for
category-partition) and then test cases are generated from this uncovering artificially introduced faults. These faults are slight
specification. syntactic changes made to a given program. Each modified ver-
Chen et al. [66] (PS23) devised an approach aimed at ef- sion of the original program is a mutant. Mutation operators
fective automating regression testing by means of clustering dictate how mutants are created: a hallmark of the changes
algorithms: distance measures and clustering algorithms are em- introduced by mutation operators is that they are analogous to
ployed to group test cases into clusters. In this context, test cases mistakes that programmers make. Mutation testing is often used
in the same cluster are considered to have similar behavior and as a “gold standard” to compare testing approaches. Due to its
characteristics. The novelty of their approach is that they in- effectiveness, mutation testing is widely used as an experimen-
troduced a semisupervised clustering method (semisupervised tal research technique. In effect, some of the primary studies
K-means, SSKM) to enhance cluster selection. The limited su- have experimentally used mutation testing to compare the ef-
pervision used by their clustering method is in the form of pair- fectiveness of their proposed approaches (e.g., PS6, PS12, and
wise constraints: (i.e., must-link when two test cases are must be PS20).
assigned to the same cluster or cannot-link when two test cases Despite the effectiveness of this technique, manually carrying
must belong to different clusters). These pairwise constraints out mutation testing entails a lot of human effort. Even when
are extracted from previous test-case executions and test selec- taking into account moderate-sized programs, mutation testing
tion results. Chen et al. claim that they were the first to apply a yields hundreds of mutants. Hence, mutation testing hinges on
semisupervised clustering algorithm to test-case selection. They the existence of tools. In fact, mutation testing is costly and
believe that their study has the potential to foster developments time consuming even when automated. Recently, researchers
in this area as well as help establish the basis for a greater un- have been trying to overcome these hurdles to the widespread
derstanding of how semisupervised clustering algorithms can adoption of this technique by using ML algorithms to expedite
be applied to solve similar software-testing-related problems. some steps of the process, e.g., mutant execution [44], [50].
5) Test Cost Estimation: As mentioned, software testing ac- Strug and Strug [44] ( PS1) put forward an approach to reduce
counts for a significant proportion of the total cost of software the computational cost associated with mutation execution. In
development. Therefore, testers have to come up with ways to their approach, a randomly selected number of mutants is run
effectively test software systems while avoiding setbacks and and the performance of the mutants that were not selected is
staying within the allotted time and budget. Some ML-based assessed on the basis of their similarity to the executed mu-
approaches were proposed to help testers to better estimate cir- tants. To measure the similarity among mutants, they are turned
cumstances that can affect the cost of software testing efforts. into a graph representation, which is then analyzed by an ML
Zhu et al. [53] (PS10), for instance, developed an approach to algorithm. This approach to classifying mutants thus reduces
estimate the effort to execute test suites. Their approach charac- the number of mutants that need to be executed by evaluat-
terizes test suites as a 3-D vector that combines the number of ing the quality of the test suite without running it against all
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE V unsupervised. When only a subset of the input data has out-
ALGORITHMS USED IN THE PRIMARY STUDIES CLASSIFIED INTO FIVE BROAD
CATEGORIES
put data associated to it, the problem lies between supervised
and unsupervised learning. This is often referred to as semisu-
pervised problem. In this setting, algorithms have to incorpo-
rate into the analysis the input data for which the associated
output data are available as well as the input data for which
there are no corresponding output data [20]. Interestingly, our
results indicate that semisupervised algorithms have been used
more often than unsupervised algorithms. One of the primary
studies also investigated an algorithm that can be used in a super-
vised or semisupervised fashion: the expectation–maximization
(EM) algorithm was used in PS9. The classification scheme in
Table V also includes the category meta-algorithm: a metalearn-
ing algorithm combines the predictions of several different ML
algorithms in some way so as to utilize the strengths of each
algorithm [16]. Only one primary study falls into this cate-
gory, this primary study evaluated the performance of a meta-
algorithm (i.e., AdaBoost) when applied to support software
The majority of the primary studies employed supervised learning algorithms. testing: PS8.
Table V does not list all primary studies, so the numbers in
the table do not amount to the total number of ML algorithms
generated mutants. Also with the purpose of reducing the cost
investigated by the set of selected studies. Some studies describe
of executing mutants, Jalbert and Bradbury [50] (PS7) devised
more than one solution, thus some primary studies (e.g., PS9
an ML-based approach tailored toward predicting the effective-
and PS44) fall into more than one category. However, we believe
ness of given test suite based on a combination of source code
that the taxonomy presented in Table V is useful because it gives
and test suite metrics. Zhang et al. [78] in PS7 propose an ap-
an insight into the types of input data that need to be taken into
proach to predicting mutation testing results without having to
account at the intersection of these two research areas as well
run the mutants. Their approach creates a model that is based
as how software-testing issues are more naturally formulated as
on features related to mutants and tests. Such a model is then
ML problems. Although a useful classification scheme, there
used to predict whether a mutant can be killed by the current
are still algorithms that do not quite fit into the five categories
test suite.
in Table V. For example, ANNs can be trained in either a super-
vised or unsupervised fashion. Therefore, we further classified
E. Mapping Primary Studies According to ML Algorithm the primary studies according to the similarity of the ML algo-
As mentioned, there are a plethora of ML algorithms. Most of rithms that they investigate. Stated more formally, we grouped
these algorithms fall into one of two broad learning categories: the selected studies based on the function of the ML algorithms.
supervised or unsupervised learning. Supervised learning is used This classification scheme is detailed in the following section.
when for each input variable (i.e., X) there is a corresponding 1) Classifying the Algorithms According to Their Function:
output variable (i.e., Y ). In such scenario, with the purpose of To classify the existing research spectrum and give a better idea
predicting the outputs for future inputs or better understanding of the ML algorithms that have been most used to automate
the relationship between the input and the output, an algorithm software testing, we decided to further classify the algorithms
is used to learn the mapping function (i.e., model) from the in terms of their function. We studied the terminology used in
input to the output: Y = F(X) [20]. Put simply, supervised al- the ML literature and proposed eight categories that attempt to
gorithms “learn” by generalizing from known examples. These capture the essence of the function of different ML algorithms.
algorithms find ways to produce the desired output based on These eight categories are the following: ANNs, Bayesian,
the pairs of inputs and desired outputs provided by the user. clustering, decision tree, ensemble algorithm, instance based,
In contrast, unsupervised learning is when only input data are learning finite automata, and regression.
available, so the goal is to understand the underlying relation- a) ANNs: This group includes studies that employ models
ship between the inputs. In this setting, unsupervised learning designed to resemble biological neural networks. These models
often is concerned with clustering problems, in which the goal can be described as directed graphs whose nodes represent neu-
is to determine whether the inputs fall into distinct groups [20]. rons and edges correspond to links between them. Each neuron
According to the results of our mapping study, the vast majority performs computations that contribute to the learning process of
of software-testing issues have been formulated and tackled as the network. In this setting, neurons receive as input a weighted
supervised learning problems (see Table V). Only three unsu- sum of the outputs of the neurons connected to them [22]. Sim-
pervised learning algorithms were used to automate software ply put, ANNs are a parallel information-processing structure
testing. that learns and stores knowledge about their environment. This
It is worth noting that some ML algorithms do not fit learning paradigm has been mostly used to cope with the test or-
in the classification that groups them into supervised and acle problem (as discussed in Section V-D2). This is the largest
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
DURELLI et al.: MACHINE LEARNING APPLIED TO SOFTWARE TESTING: A SYSTEMATIC MAPPING STUDY 15
graph kernels generated from hierarchical control flow graphs. activities is that testers have to ensure that relevant data are avail-
Similarly, PS11 [54] also takes into account graph kernels. able. In effect, the available data must be in a form that facilitates
In this mapping, we found that the software artifacts most the learning process using ML algorithms. For instance, in the
frequently taken into account by ML-based approaches are test approach to refining test cases proposed by Briand et al. [55], a
cases (i.e., sets of inputs and outputs) and test suite metrics. human expert has to provide inputs in the form of categories and
Many studies extract information from test cases: PS3, PS7, PS8, choices. This sort of preprocessing, which is often necessary,
PS9, PS10, PS12, PS14, PS15, PS18, PS25, PS28, PS29, PS30, is one of the main disadvantages of relying on standard ML
PS34, PS40, and PS42. Although information on test cases is algorithms. It is worth noting, however, that this disadvantage
widely used, this sort of information is seldom considered in is inherent to the use of some ML algorithms.
isolation. Often, other information sources are also used during We conjecture that a possible obstacle to the adoption of ML
the learning process. algorithms is that, if these algorithms are to be used effectively,
Aside from source code, graphs, and test cases, ML al- software-testing efforts will have to include informed testers at
gorithms have also been used to analyze regular expres- all levels. These testers will have to be able to deploy and inter-
sions (i.e., PS2), features from polygonal models (i.e., PS6), rogate the outcomes of ML-based approaches. In many cases,
formal specifications (i.e., PS13 [56]), abstract GUI models this will entail having people who might not have an in-depth
(i.e., PS21), GUI elements (i.e., PS48), false positives and false understanding of the code under test but know how to work
negatives yielded by oracles (i.e., PS24 [67]), Web logs (i.e., knowledgeably with the strengths and weaknesses of ML algo-
PS26), and images (i.e., PS32). An overview of the inputs (i.e., rithms. More specifically, the adoption of ML algorithms might
elements learned) and outputs (i.e., resulting models) of each blur the roles of testers and data scientists. Testers will not
ML-based approach is provided in Appendix C. be able to truly leverage the benefits of ML algorithms with-
out understanding the assumptions and implications of these
algorithms.
G. Advantages and Drawbacks of Using ML Algorithms to
Automate Software Testing
H. Most Prolific Researchers in the Area
As discussed in the previous sections, ML algorithms are ap-
Upon analyzing the primary studies, we also counted the
pealing for automating a wide range of software-testing activ-
number of primary studies published by each author as a way to
ities. Most selected studies use ML algorithms to “synthesize”
evaluate the author’s impact. We found that only five researchers
test-related artifacts (e.g., test cases) into a form that is suit-
have published more than one paper: Lionel Briand (i.e., PS12
able for decision-making, be it either fully automated or with
and PS16) and Abraham Kandel and Mark Last (i.e., PS20 and
human interaction. There are a plethora of ML algorithms and
PS33), and Neil Walkinshaw and Gordon Fraser (i.e., PS8 and
they differ in terms of their function, some of which seem to be
PS43). Although the rate of papers published in the area seems
more suitable to automate certain testing activities than others.
to have increased since 2008, our results would seem to suggest
In this section, we discuss the advantages and disadvantages of
that there is no research group specifically dealing with ML and
applying ML to solve software-testing activities.
software testing.
ANNs have been widely used due to their ability to solve
multiple types of problems related to test oracle automation.
Nevertheless, as pointed out by Anderson et al. [77], issues tend VI. POTENTIAL RESEARCH DIRECTIONS
to surface when it is needed to extract the model or interpret what Researchers have been able to successfully harness ML al-
an ANN has learned. As Louridas and Ebert [21] remark, ML gorithms to automate a number of software-testing activities.
algorithms lie on a spectrum based on the easy of understanding While a fair amount of research has been carried out in this
their results. ANNs, for example, do not yield anything that direction, we found that most research efforts are not method-
can be interpreted by users: the network itself embodies all ologically sound and some issues remain unexplored. In this
learned information. On the other end of the spectrum, some section, we present several potential directions for exploring the
ML algorithms yield human-readable models. For example, one synergy between ML and software testing.
of the advantages of decision tree algorithms is that they produce We posit that applying ML algorithms to a wider range of
flowchartlike tree structures that are a somewhat straightforward software-testing problems could be a useful trend to follow. In
to interpret by humans. However, according to the results of our particular, a number of approaches have been developed for
mapping, it turns out that the decision trees generated by these automating mutation testing [39]. However, according to our
algorithms are not always intuitive. In PS24, Sprenkle et al. [67] results, not much has been done in terms of drawing on ML
reported that the decision trees yielded by their approach led to algorithms to expedite mutation testing. For example, ascer-
decisions (i.e., in this case concerning oracle combinations) that tain whether a program and one of its mutants are equivalent
were nonintuitive and contrary to what they were expecting. is an undecidable problem. Consequently, this activity is of-
Given that most software-testing activities are challenging, ten carried out by humans. Although this has drawn the at-
we were interested in investigating to what extent software- tention of many researchers over the years, resulting in many
testing activities can be automated by ML algorithms as well as theoretical contributions, this is still an open challenge. Along
how practical it is to apply ML-based approaches. As pointed the lines of what we have previously argued, ML algorithms
out by Briand [59], one of the few key limitations of ML algo- have the potential to outperform current approaches to detect-
rithms that impact their usefulness for supporting certain testing ing possible equivalent mutants as well as automating other
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
DURELLI et al.: MACHINE LEARNING APPLIED TO SOFTWARE TESTING: A SYSTEMATIC MAPPING STUDY 17
facets of this test criterion. However, as discussed in the pre- out the possibility that some primary studies might have been
vious sections, there has not been much research work in this misclassified. To mitigate this issue, the categorization schemes
direction. underwent several reviews by the authors.
ML algorithms have become instrumental in automating ac-
tivities in other fields. Although these algorithms have come
a long way, software-testing researchers and practitioners have B. Internal Validity
only started to tap into the potential of these algorithms. Given The two main threats to the internal validity of our pa-
that automation is seen as a practical approach to coping with the per are the following: missing relevant papers and researcher
increasing demand for reliable software systems, we believe that bias during paper selection. Although mapping studies cover
the overarching motivation for research in this area should be a broad scope, the search is often restricted to one or more
automating most software-testing activities. However, despite databases [31]. In this paper, we also restricted our search to
future advances in ML, some human collaboration will still be the most widely used databases. To mitigate the threat of fail-
needed. Thus, ML-based tools should not be designed as black- ing to include relevant papers, we searched the digital libraries
box solutions. Researchers should seek to provide solutions that that are most likely to cover most of the literature on software
allow users to easily interrogate the model behind the outputs. engineering and a general indexing system. We believe that the
An effective step toward addressing this challenge would be to set of primary studies accounts for most of the relevant papers
carry out research efforts that bridge the gap between academia on applying ML algorithms to support software-testing activi-
and industry; we believe that this will increase the chances of ties. However, we cannot rule out the possibility that we may
coming up with solutions that can be translated into tools that have missed several relevant studies during the conduction of
are useful in industrial settings. the automated search in these databases. To cope with the in-
Finally, as pointed out by Briand [59], there is very little terdisciplinarity of the subject area, we derived the keywords
empirically grounded evidence supporting the cost effectiveness for our search string from the RQs and based on the keywords
of the existing applications of ML algorithms in software testing. used in the quasi-gold standard. We also used the quasi-gold
Thus, more empirical research is needed to examine how ML standard to assess the completeness of automated searches we
models perform in software-testing settings. performed.
Researcher bias during data extraction can potentially lead to
VII. THREATS TO VALIDITY inaccuracies in data extraction, which may affect the classifica-
tion and analysis of the selected studies. In hopes of mitigating
In this section, we discuss the factors that can threaten the
this issue, we took some preventive measures. First, all DIs ex-
validity of our systematic mapping study. When carrying out
tracted during this mapping study were discussed among the
systematic mappings, threats arise from the design, conduct,
researchers so that an agreement on the meaning of each DI was
analysis, and interpretation. There are several classification
achieved previous to data extraction. Second, as mentioned, to
schemes for different types of threats to the validity of empiri-
ensure that the two researchers in charge of data extraction had
cal studies. In this section, we follow the classification scheme
a clear understanding of all DIs, they pilot-tested the data ex-
proposed by Campbell and Stanley [41] and followed by many
traction form. The results of the pilot data extraction were then
software engineering researchers [42]. As Campbell and Stan-
discussed so as to reach a consensus. Third, when needed, a
ley state, threats to validity can be categorized into four major
third researcher went over the data extraction results to settle
categories: construct validity, internal validity, conclusion va-
any disagreement.
lidity, and external validity. More specifically, considering our
mapping study, the main factors might have introduced threats
to the validity of our paper are the following: selection of digital C. Conclusion Validity
libraries, definition of search string, the time frame we chose,
Conclusion validity is mainly concerned with the degree to
researcher bias during study selection, inaccurate data extrac-
which the conclusions we reached are reasonable. We answered
tion, and researcher-biased data synthesis. These factors are
the RQs and drew conclusions based on information gleaned
discussed in the following sections.
from the primary studies, e.g., number of papers investigating
test data generation using ML algorithms. The conclusion va-
A. Construction Validity
lidity issue lies in whether there is a relationship between the
Construct validity has to do with whether the concepts be- number of studies we found and current research trends in the
ing investigated are interpreted correctly and whether all rel- subject area. We cannot rule out this threat. Another potential
evant studies were selected. In this mapping study, the main threat is that some primary studies might have been misclassi-
concepts under consideration are ML algorithms and software- fied. To mitigate this threat, data extraction and synthesis were
testing activities. In hopes of ensuring the correct interpretation undertaken as a team, with two reviewers working together to
of these concepts, we checked their definitions and discussed reach agreements concerning the extracted data and classifica-
them among the authors to reach a consensus. The soundness tion thereof: as mentioned, two reviewers worked together to
of the categorization schemes we created during data extrac- create the classification schemes presented in the previous sec-
tion hinge on how we interpreted the concepts in both areas. tions. However, we cannot fully rule out this threat because
Due to the interdisciplinarity of the subject, we cannot rule of the qualitative nature of this systematic mapping, which
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
makes data extraction and synthesis (i.e., classification) more to be gaining traction is using ML algorithms to predict
susceptible to bias. the cost of testing-related activities.
4) RQ4 : What trends can be observed among research stud-
D. External Validity ies discussing the application of ML to support software-
testing activities? A trend we observed is that the oracle
External validity is concerned with the extent to which the
problem tends to be tackled by employing either ANN-
outcomes from our systematic mapping can be generalized to the
or decision tree based approaches. Interestingly, these ap-
intended population of interest. Therefore, one potential issue
proaches lie on opposite ends of a spectrum based on how
stems from assessing whether the primary studies are represen-
easy it is to understand their results. ANNs do not yield
tative of all the relevant studies in the subject area. To mitigate
anything that can be interpreted by users: the network it-
this issue, we followed a comprehensive search process during
self embodies all learned information. On the other end
which we tried to be as inclusive as possible. Although we did
of the spectrum, decision trees yield flowchartlike tree
not take into account studies written in languages other than
structures that are easily interpretable by humans.
English, we believe that the primary studies we selected con-
5) RQ5 : What are the drawbacks and advantages of the
tain enough information to give researchers and practitioners
algorithms when applied to software testing? The main
an insight into how ML has been employed to support software
advantage of the ML-based approaches described in the
testing.
primary studies is that most approaches are likely to scale
Some primary studies do not provide all the information
very well, thus we believe that they can be used to sup-
needed to fill out the extraction form. Thus, we often had to
port increasingly complex testing activities. Another ad-
infer some information concerning some DIs during data syn-
vantage is that most approaches require minimal human
thesis. For example, some primary studies mention that their
intervention. As for the drawbacks, upon analyzing our re-
ML approach results in several advantages without elaborating
sults, we found that a key limitations of ML algorithms is
on these advantages in the text. Similarly, some primary studies
that testers have to ensure that relevant data are available.
do not mention the drawbacks of their approaches.
Moreover, the available data must be in a form that facil-
itates the learning process using ML algorithms. There-
VIII. CONCLUSION fore, preprocessing all the available data is an inherent
ML and software testing are two broad areas of active re- disadvantage of some ML algorithms. Another drawback
search whose intersection has been drawing the attention of that has the potential to hinder widespread adoption of
researchers. Our systematic mapping focused on making a sur- ML algorithms is that, if these algorithms are to be used
vey of research efforts based on using ML algorithms to support effectively, software-testing efforts will have to include
software testing. We believe that our mapping study provides a informed testers at all levels. These testers will have to be
valuable overview of the state-of-the art in ML applied to soft- able to deploy and interrogate the outcomes of ML-based
ware testing, which is useful for researchers and practitioners approaches. In many cases, this will entail having people
looking to understand this research field either for the goal of who might not have an in-depth understanding of the SUT
leveraging or contributing to that field. but know how to work knowledgeably with the strengths
We posed the following RQs and provided answers for them and weaknesses of ML algorithms. We conjecture that
through the analysis of the results of our systematic mapping the adoption of ML algorithms might in a way blur the
study. roles of testers and data scientists. Testers will not be able
1) RQ1 : What is the intensity of the research on ML applied to truly leverage the benefits of ML algorithms without
to software testing? According to our results, ML algo- understanding the assumptions and implications of these
rithms have been applied to tackle software-testing prob- algorithms.
lems since 1995, but only very recently ML algorithms 6) RQ6 : What problems have been observed by researchers
caught the interest of researchers and practitioners. Our when applying ML algorithms to support software-testing
results suggest that interest in applying ML algorithms to activities? Basically, the two problems faced by re-
solve software-testing problems has spiked in the last few searchers when trying to apply ML algorithms to solve
years. In effect, this renewed interest in ML-related ap- software-testing problems are as follows: first, most ML
proaches to software testing started in 2010 and has been algorithms need a substantial amount of training data
more pronounced since 2016 onward. and second, data quality is the key for ML algorithms to
2) RQ2 : What types of ML algorithms have been used to function as intended.
cope with software-testing issues? The vast majority of 7) RQ7 : To what extent have these ML-based approaches
the approaches described in the primary studies automate been evaluated empirically?
software testing using supervised learning algorithms. Ac- We found that the body of empirical research available at
cording to our results, ANNs and decision trees are the the intersection of ML and software testing leaves much
most widely used algorithms. to be desired, especially when compared with the level
3) RQ3 : Which software-testing activities are automated by of understanding and body of evidence that have been
ML algorithms? ML algorithms have been used mainly achieved in other fields. Although most selected studies
for oracle construction and for test-case generation, re- present sections that were termed “experiment,” we found
finement, and evaluation. Another application that seems that these evaluations could not be strictly considered as
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
DURELLI et al.: MACHINE LEARNING APPLIED TO SOFTWARE TESTING: A SYSTEMATIC MAPPING STUDY 19
APPENDIX A
PRIMARY STUDIES
This appendix lists all primary studies [44]–[91].
and outputs (e.g., how the resulting mapping or model is used
APPENDIX B to make predictions about some software-testing activity).
DATA EXTRACTION FORM PS1: Strug and Strug [44] propose an approach to reducing
the amount of mutants that needs to be executed during mutation
This appendix presents the data extraction form we used
testing. Their KNN-learner receives mutants as input, which are
throughout the conduction of our systematic mapping study.
represented as a hierarchical graph. As output, their model can
be used to make predictions on whether a given test is able to
APPENDIX C
kill a certain mutant.
SUMMARY OF THE SELECTED STUDIES PS2: Gove and Faytong [45] employ SVM and grammar
This section gives an overview of the research at the intersec- induction learners to eliminate infeasible GUI test cases. The
tion of software testing and ML by providing a brief summary learner receives as input the test case as a sequence of event
of each study. Since most ML algorithms are centered around IDs. The results yielded by their learner can be used to make
learning a mapping from inputs (i.e., data points) to outputs, we predictions on whether a given test case is infeasible or not.
tried to describe each ML-based testing approach in terms of PS3: Cotroneo et al. [46] aim to improve the selection of
its inputs (e.g., information that is fed into the learning model) testing techniques for a given test session. The predictor is fed
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
with historical data on features (i.e., metrics) of test sessions the approach produces a classification tree describing high-level
and their outcomes. As output, the predictor yields information test cases.
on the performance of each technique for a given test session. PS14: Cheatham et al. [57] investigate factors that affect the
PS4: Xiao et al. [47] propose an approach to generating tests prediction of testing costs, mainly testing time. The approach
for commercial games. The learner receives as input samples of takes as input metrics of code complexity, programmer and
input/output pairs extracted from the game engine. As output, it tester experience, adoption of software engineering practices,
yields a model of the game’s expected behavior. and statistics on test execution. A model that lends itself well to
PS5: Wang et al. [48] set out to devise test oracles for reactive make predictions on testing time is produced as output.
systems. Feature vectors generated from test traces are used as PS15: Mariani et al. [58] present a technique to generate new
input to the proposed approach. An oraclelike model is yielded test cases for GUI-based applications from GUI-driven tests
by the approach. performed manually. The learner receives as input an initial test
PS6: Chan et al. [49] present a methodology that integrates suite, GUI actions, and observed states obtained by the tool. As
ML and MT to build a test oracle for mesh simplification pro- output, this GUI-based testing approach produces a behavioral
grams of 3-D polygonal models. The learner receives as input model from which new test cases can be created.
features of polygonal models. Once the learner has been built, PS16: Briand [59] gives an overview of the state-of-the art
it can be used to predict whether a test case will fail or not. and reports on the diverse ways in which ML has been applied to
PS7: Jalbert and Bradbury [50] propose an approach to im- automate and support software-testing activities. Thus, this pa-
prove the performance and reduce the cost of mutation testing. per does not focus on a specific ML-based approach for software
The learner receives as input source code’s and test suite’s met- testing.
rics for a given unit. As output, the learner estimates the mutation PS17: Noorian et al. [60] outline a classification framework
score for an unknown unit of code as low, medium, or high. that can help us to systematically review research in the ML and
PS8: Fraser and Walkinshaw [51] aim to evaluate test suites by software-testing domains. No specific ML-based approach for
using behavioral coverage instead of syntactic adequacy metrics software testing is detailed.
as branch coverage. The learner receives as input input/output PS18: Silva et al. [61] propose an approach aimed at sup-
pairs observed by a test generation tool. As a result, a model porting the estimation of test execution effort. Their ML-based
aimed at predicting the behavior of the program under test is approach takes as input metrics related to the test suite, testers,
generated. use cases, and source code. According to Silva et al. [61], the
PS9: Lenz et al. [52] propose an approach to ranking the resulting model is able to predict the effort (in person-hours)
results of different testing techniques into functional clusters. required to run the test cycle.
The results of such ranking can be used to support the selection PS19: This paper does not detail a specific ML-based ap-
and prioritization of test cases. The learner receives as input proach for automating software testing. Instead, Zhang [62]
test cases, structural coverage information, number of mutants describes a general framework for value-based test data gener-
killed, and mutation score associated to each mutation operator. ation.
As output, the approach groups the data into clusters that can be PS20: Vanmali et al. [63] present an approach whose main
seen as functional equivalence classes. purpose is to create an oracle from a software system’s test
PS10: Zhu et al. [53] describe an approach to supporting the suite. The test cases of the SUT serve as input for the proposed
estimation of test execution effort. The input to the proposed approach. The resulting oraclelike model can be used to predict
approach includes the number of test cases, the complexity of the outcomes produced by new and possibly faulty versions of
executing the test cases, and the tester that will execute the test the SUT.
cases (testers are classified according to their experience and PS21: Choi et al. [64] introduce a tool that automatically
knowledge of the target application). As output, the approach generates sequences of test inputs for Android apps. The learner
generates a model tailored to predict the test execution effort. receives as input sequences of actions extracted from the app’s
PS11: Kanewala et al. [54] propose an approach to support GUI. The output can be seen as a model representing the GUI
testing activities without the need for test oracle automation by of the application under test.
predicting metamorphic relations for scientific software. The PS22: Aarts et al. [65] investigate how active learning can be
input to their ML-based approach comprises graph kernels ob- employed to support protocol conformance testing. Sequences
tained from control flow graphs. The results can be used to make of input/output pairs are used as input to the proposed ap-
predictions on metamorphic relations. proach. The outcome of the approach is a Mealy machine model
PS12: Briand et al. [55] introduce an approach and a tool to representing the behavior of the SUT.
support the refinement of test cases in Category-Partition testing. PS23: Chen et al. [66] present an approach for test selection
The learner receives as input abstract test cases obtained from during regression testing. The learner receives as input func-
the test suite and a Category-Partition specification. As output, tion call profiles of test cases and pairwise constraints. As out-
the learner predicts rules that relate pairs (e.g., category and put, the approach produces clusters of test cases (considered to
choice). have similar behaviors) from which sampling strategies can be
PS13: Singh et al. [56] detail an approach to generating test employed to reduce the test suite for regression testing.
cases from Z specifications for partition testing. The learner PS24: Sprenkle et al. [67] introduce an approach to iden-
receives as input the functional specification in Z. As output, tify the most effective HTML oracle combinations for web
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
DURELLI et al.: MACHINE LEARNING APPLIED TO SOFTWARE TESTING: A SYSTEMATIC MAPPING STUDY 21
application testing. The learner receives as input test results source code’s and test suite’s features related to execution, in-
for each oracle, application behavior, and expected results. Ac- fection, and propagation of a given mutant. The resulting model
cording to Sprenkle et al. [67], the output predicts the most predicts whether a mutant is killed by some test case.
effective oracle combination. PS36: Felbinger et al. [79] present an approach to evaluating
PS25: Tonella et al. [68] propose a test-case prioritization a test suite adequacy with respect to an inferred model. The
technique that takes advantage of user knowledge. Their ML- learner receives as input the test cases as input/output pairs. As
based approach receives as input test cases, the prioritization output, the approach builds a state machine model that can be
indexes, and a sample of user-defined pairwise priority relations used to calculate the similarity with the SUT.
on test cases. As a result, the approach iteratively refines the PS37: Busjaeger and Xie [80] introduce a novel approach
test-case ordering. for test prioritization in industrial environments. Inputs to the
PS26: The approach proposed by Sant et al. [69] creates test approach are code coverage information, text path similarity,
cases for web applications from logged user data. Web logs from text content similarity, failure history, and test age. As output,
user sessions are used as inputs. The output is a Markov model the resulting model yields an effective prioritization of the test
from which test cases can be derived. cases.
PS27: Namin and Sridharan [70] give an overview of PS38: Bowring et al. [81] investigate the use of Markov
Bayesian reasoning methods and discuss their applicability to models to evaluate and augment test suites for future versions
software testing. No specific ML-based approach for automating of the SUT. The learner receives as input test cases, event-
software testing is discussed. transition profiles, and its behavior label. Markov models that
PS28: Bergadano and Gunetti [71] introduce an approach to are clustered into effective behavior classifiers are produced as
generate test cases that distinguish a given program from a set output.
of alternative programs. The approach is based on the inductive PS39: Grano et al. [82] conducted a preliminary study to look
learning of programs from a finite set of input/output examples. at how ML models can be used to predict the branch coverage
More specifically, their approach receives as input the program, achieved by test data generation tools. The learner receives as
the set of alternative programs, and input/output examples. As input code metrics and, as output, it predicts branch coverage
output, the approach induces an alternative program that is con- achieved by test data generator tools.
sistent (equivalent) with the original, taking into account the PS40: Spieker et al. [83] propose an ML-based approach
provided input/output examples. for test-case selection and prioritization. The approach receives
PS29: Jin et al. [72] tackle the automated creation of test as input information on test cases: the test-case duration, the
oracles by employing ANNs. The input to their ANN-based last execution, and a failure history. As output, their approach
approach is test cases. As output, their approach is able to predict yields a model that tailored to prioritize error-prone test cases
the expected behavior of new test cases. under guidance of a reward function and by taking into account
PS30: Vineeta et al. [73] set out to generate test oracles. The previous executions.
learner receives as input test cases and, as a result, it can be used PS41: Hardin and Kanewala [84] propose a semisupervised
to predict the expected behavior of new test cases. ML approach to detecting metamorphic relations that are appli-
PS31: Hungar et al. [74] investigate automata learning to cable to a given code base. The learner receives as input paths
support the testing of complex reactive systems, mainly from through methods control flow graph. The resulting model can
the telecommunication domain. The approach receives as input be used to predict metamorphic relations.
stimuli and responses from the SUT. The resulting learning PS42: Felbinger et al. [85] propose a method for evaluating
model can be used to predict I/O automata. the effectiveness of test suites, which is based on inferring mod-
PS32: Semenenko et al. [75] report an experience on build- els from the test suites. The input to their approach is the test
ing a tool for cross-browser compatibility testing. The ML- suite being evaluating and information concerning the current
based approach receives as input image features of regions state and previous output of the SUT. As a result, the approach
of interest in the web pages. The resulting model can be yields an inferred model from the test suite.
used to point out potential incompatibilities among multiple PS43: Walkinshaw and Fraser [86] apply a technique known
browser-platform combinations. in ML parlance as “query strategy framework” that entails in-
PS33: Agarwal et al. [76] compare IFNs and ANNs to build ferring a behavioral model of the SUT and selecting test cases
automated test oracles. Test cases are fed into the learning that the inferred model is “least certain” about. It is assumed
model. As output, the model is then used to determine whether that running these tests on the SUT will further help to inform
a new input is correct or not. the learner, That is, the underlying assumption is that by provid-
PS34: Anderson et al. [77] present empirical results on the ing information that the learner has not processed yet (i.e., test
adoption of ANNs to prune test suites while keeping their effec- cases that are not present in the training set) this uncertainty-
tiveness. The learner receives as input test-case metrics, such as driven approach is able to form an effective basis for test -case
length, command frequency, and parameter use frequency. As selection. The learner receives as input the JSON specifica-
output, their ANN predicts the fault detection capabilities of a tion of the SUT’s interface. As output, it produces new test
given test case. inputs.
PS35: Zhang et al. [78] propose an approach to reduce the ex- PS44: Balkan et al. [87] introduce a framework named Under-
ecution cost of mutation testing. The inputs to their approach are miner that aims to find parameter values and inputs for black-box
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
APPENDIX D
DISTRIBUTION OF THE SELECTED STUDIES ACCORDING TO
THEIR PUBLICATION SOURCES
REFERENCES
[1] P. E. Ceruzzi, Computing: A Concise History (The MIT Press Essential
Knowledge series). Cambridge, MA, USA: MIT Press, 2012.
[2] J. C. Westland, “The cost of errors in software development: Evidence
from industry,” J. Syst. Softw., vol. 62, no. 1, pp. 1–9, 2002.
[3] P. Ammann and J. Offutt, Introduction to Software Testing, 2nd ed. Cam-
bridge, U.K.: Cambridge Univ. Press, 2016.
[4] G. J. Myers, C. Sandler, and T. Badgett, The Art of Software Testing, 3rd
ed. Hoboken, NJ, USA: Wiley, 2011.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
DURELLI et al.: MACHINE LEARNING APPLIED TO SOFTWARE TESTING: A SYSTEMATIC MAPPING STUDY 23
[5] M. J. Harrold, “Testing: A roadmap,” in Proc. Conf. Future Softw. Eng., [35] C. Wohlin, “Guidelines for snowballing in systematic literature studies
2000, pp. 61–72. and a replication in software engineering,” in Proc. 18th Int. Conf. Eval.
[6] C. A. Welty and P. G. Selfridge, “Artificial intelligence and software Assessment Softw. Eng., 2014, paper 38.
engineering: Breaking the toy mold,” Automated Softw. Eng., vol. 4, [36] R. Wieringa, N. Maiden, N. Mead, and C. Rolland, “Requirements en-
no. 3, pp. 255–270, 1997. gineering paper classification and evaluation criteria: A proposal and a
[7] M. Harman, “The role of artificial intelligence in software engineering,” discussion,” Requirements Eng., vol. 11, no. 1, pp. 102–107, 2005.
in Proc. 1st Int. Workshop Realizing Artif. Intell. Synergies Softw. Eng., [37] E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The oracle
2012, pp. 1–6. problem in software testing: A survey,” IEEE Trans. Softw. Eng., vol. 41,
[8] T. Xie, “The synergy of human and artificial intelligence in software no. 5, pp. 507–525, May 2015.
engineering,” in Proc. 2nd Int. Workshop Realizing Artif. Intell. Synergies [38] G. Rothermel, R. H. Untch, C. Chu, and M. J. Harrold, “Prioritizing test
Softw. Eng., 2013, pp. 4–6. cases for regression testing,” IEEE Trans. Softw. Eng., vol. 27, no. 10,
[9] J. Bell, Machine Learning: Hands-On for Developers and Technical Pro- pp. 929–948, Oct. 2001.
fessionals. Hoboken, NJ, USA: Wiley, 2014. [39] Y. Jia and M. Harman, “An analysis and survey of the development of
[10] D. Zhang and J. J. Tsai, “Machine learning and software engineering,” mutation testing,” IEEE Trans. Softw. Eng., vol. 37, no. 5, pp. 649–678,
Softw. Qual. J., vol. 11, no. 2, pp. 87–119, 2003. Sep./Oct. 2011.
[11] S. R. Vergilio, J. A. C. Maldonado, and M. Jino, “Infeasible paths in the [40] J. V. Stone, Bayes’ Rule: A Tutorial Introduction to Bayesian Analysis.
context of data flow based testing criteria: Identification, classification and Sheffield, U.K.: Sebtel Press, 2012.
prediction,” J. Brazilian Comput. Soc., vol. 12, pp. 71–86, Jun. 2006. [41] D. T. Campbell and J. Stanley, Experimental and Quasi-Experimental
[12] B. Beizer, Software Testing Techniques, 2nd ed. New York, NY, USA: Van Designs for Research. Belmont, CA, USA: Wadsworth, 1963.
Nostrand Reinhold Company, 1990. [42] C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, and A.
[13] H. Zhu, P. A. V. Hall, and J. H. R. May, “Software unit test coverage and Wesslén, Experimentation in Software Engineering. New York, NY, USA:
adequacy,” ACM Comput. Surveys, vol. 29, no. 4, pp. 366–427, 1997. Springer, 2012.
[14] S. Rapps and E. J. Weyuker, “Data flow analysis techniques for test data [43] E. T. Stringer, Action Research, 4th ed. Newbury Park, CA, USA: SAGE,
selection,” in Proc. 6th Int. Conf. Softw. Eng., 1982, pp. 272–278. 2013.
[15] A. Orso and G. Rothermel, “Software testing: A research travelogue [44] J. Strug and B. Strug, “Machine learning approach in mutation testing,”
(2000–2014),” in Proc. Future Softw. Eng., 2014, pp. 117–132. in Proc. Int. Conf. Testing Softw. Syst., pp. 200–214.
[16] B. Lantz, Machine Learning With R, 2nd ed. Birmingham, U.K.: Packt [45] R. Gove and J. Faytong, “Identifying infeasible GUI test cases using
Publishing, 2015. support vector machines and induced grammars,” in Proc. Int. Conf. Softw.
[17] M. Bowles, Machine Learning in Python: Essential Techniques for Pre- Testing, Verification Validation Workshops, 2011, pp. 202–211.
dictive Analysis. Hoboken, NJ, USA: Wiley, 2015. [46] D. Cotroneo, R. Pietrantuono, and S. Russo, “A Learning-based method
[18] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of Machine for combining testing techniques,” in Proc. Int. Conf. Softw. Eng., 2013,
Learning. Cambridge, MA, USA: MIT Press, 2012. pp. 142–151.
[19] P. Flach, Machine Learning: The Art and Science of Algorithms That Make [47] G. Xiao, F. Southey, R. C. Holte, and D. Wilkinson, “Software testing
Sense of Data. Cambridge, U.K.: Cambridge Univ. Press, 2012. by active learning for commercial games,” in Proc. 20th Nat. Conf. Artif.
[20] G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Intell., 2005, vol. 2, pp. 898–903.
Statistical Learning: With Applications in R (Springer Texts in Statistics). [48] F. Wang, L. W. Yao, and J. H. Wu, “Intelligent test oracle construction
New York, NY, USA: Springer, 2013. for reactive systems without explicit specifications,” in Proc. Int. Conf.
[21] P. Louridas and C. Ebert, “Machine learning,” IEEE Softw., vol. 33, no. 5, Dependable, Auton. Secure Comput., 2011, pp. 89–96.
pp. 110–115, Sep./Oct. 2016. [49] W. K. Chan, J. C. F. Ho, and T. H. Tse, “Finding failures from passed test
[22] S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning: cases: Improving the pattern classification approach to the testing of mesh
From Theory to Algorithms. Cambridge, U.K.: Cambridge Univ. Press, simplification programs,” Softw. Testing, Verification Rel., vol. 20, no. 2,
2014. pp. 89–120, 2010.
[23] C. Anderson, A. V. Mayrhauser, and R. Mraz, “On the use of neural [50] K. Jalbert and J. S. Bradbury, “Predicting mutation score using source
networks to guide software testing activities,” in Proc. Int. Test Conf., code and test suite metrics,” in Proc. Int. Workshop Realizing AI Synergies
1995, pp. 720–729. Softw. Eng., 2012, pp. 42–46.
[24] H. Singh, M. Conrad, and S. Sadeghipour, “Test case design based on Z [51] G. Fraser and N. Walkinshaw, “Assessing and generating test sets in terms
and the classification-tree method,” in Proc. IEEE Int. Conf. Formal Eng. of behavioural adequacy,” Softw. Testing, Verification Rel., vol. 25, no. 8,
Methods, 1997, pp. 81–90. pp. 749–780, 2015.
[25] J. Bowring, J. M. Rehg, and M. J. Harrold, “Active learning for automatic [52] A. R. Lenz, A. Pozo, and S. R. Vergilio, “Linking software testing results
classification of software behavior,” in Proc. ACM SIGSOFT Int. Symp. with a machine learning approach,” Eng. Appl. Artif Intell., vol. 26, no.
Softw. Testing Anal., 2004, pp. 195–205. 5/6, pp. 1631–1640, 2013.
[26] L. C. Briand, “Novel applications of machine learning in software testing,” [53] X. Zhu, B. Zhou, L. Hou, J. Chen, and L. Chen, “An experience-based
in Proc. 8th Int. Conf. Qual. Softw., 2008, pp. 3–10. approach for test execution effort estimation,” in Proc. Int. Conf. Young
[27] W. Choi, G. Necula, and K. Sen, “Guided GUI testing of android apps Comput. Scientists, 2008, pp. 1193–1198.
with minimal restart and approximate learning,” in Proc. ACM SIGPLAN [54] U. Kanewala, J. M. Bieman, and A. Ben-Hur, “Predicting metamorphic
Int. Conf. Object Oriented Program. Syst. Lang. Appl., 2013, pp. 623–640. relations for testing scientific software: A machine learning approach
[28] J. Zhang et al., “Predictive mutation testing,” in Proc. 25th Int. Symp. using graph kernels,” Softw. Testing, Verification Rel., vol. 26, no. 3,
Softw. Testing Anal., 2016, pp. 342–353. pp. 245–269, 2016.
[29] M. Noorian, E. Bagheri, and W. Du, “Machine learning-based software [55] L. C. Briand, Y. Labiche, Z. Bawar, and N. T. Spido, “Using machine
testing: Towards a classification framework,” in Proc. Int. Conf. Softw. learning to refine category-partition test specifications and test suites,” Inf.
Eng. Knowl. Eng., 2011, pp. 225–229. Softw. Technol., vol. 51, no. 11, pp. 1551–1564, 2009.
[30] K. Petersen, S. Vakkalanka, and L. Kuzniarz, “Guidelines for conduct- [56] H. Singh, M. Conrad, and S. Sadeghipour, “Test case design based on Z
ing systematic mapping studies in software engineering: An update,” Inf. and the classification-tree method,” in Proc. IEEE Int. Conf. Formal Eng.
Softw. Technol., vol. 64, pp. 1–18, 2015. Methods, 1997, pp. 81–90.
[31] B. A. Kitchenham, D. Budgen, and P. Brereton, Evidence-Based Software [57] T. J. Cheatham, J. P. Yoo, and N. J. Wahl, “Software testing: A machine
Engineering and Systematic Reviews. London, U.K.: Chapman and Hall/, learning experiment,” in Proc. ACM Annu. Conf. Comput. Sci., 1995,
2015. pp. 135–141.
[32] B. A. Kitchenham, D. Budgen, and O. P. Brereton, “Using mapping studies [58] L. Mariani, M. Pezzè, O. Riganelli, and M. Santoro, “Automatic testing
as the basis for further research – A participant-observer case study,” Inf. of GUI-based applications,” Softw. Testing, Verification Reliab., vol. 24,
Softw. Technol., vol. 53, no. 6, pp. 638–651, 2011. no. 5, pp. 341–366, 2014.
[33] V. R. Basili, G. Caldiera, and H. D. Rombach, “The goal question metric [59] L. C. Briand, “Novel applications of machine learning in software testing,”
approach, ” in Encyclopedia of Software Engineering. Hoboken, NJ, USA: in Proc. 8th Int. Conf. Qual. Softw., 2008, pp. 3–10.
Wiley, 1994. [60] M. Noorian, E. Bagheri, and W. Du, “Machine learning-based software
[34] H. Zhang, M. A. Babar, and P. Tell, “Identifying relevant studies in soft- testing: Towards a classification framework,” in Proc. Int. Conf. Softw.
ware engineering,” Inf. Softw. Technol., vol. 53, no. 6, pp. 625–637, 2011. Eng. Knowl. Eng., 2011, pp. 225–229.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[61] D. G. Silva, M. Jino, and B. T. d. Abreu, “Machine learning methods [77] C. Anderson, A. V. Mayrhauser, and R. Mraz, “On the use of neural
and asymmetric cost function to estimate execution effort of software networks to guide software testing activities,” in Proc. Int. Test Conf.,
testing,” in Proc. Int. Conf. Softw. Testing, Verification Validation, 2010, 1995, pp. 720–729.
pp. 275–284. [78] J. Zhang et al., “Predictive mutation testing,” in Proc. 25th Int. Symp.
[62] D. Zhang, “Machine learning in value-based software test data genera- Softw. Testing Anal., 2016, pp. 342–353.
tion,” in Proc. IEEE Int. Conf. Tools Artif. Intell., 2006, pp. 732–736. [79] H. Felbinger, F. Wotawa, and M. Nica, “Empirical study of correlation
[63] M. Vanmali, M. Last, and A. Kandel, “Using a neural network in the between mutation score and model inference based test suite adequacy
software testing process,” Int. J. Intell. Syst., vol. 17, no. 1, pp. 45–62, assessment,” in Proc. Int. Workshop Autom. Softw. Test., 2016, pp. 43–49.
2002. [80] B. Busjaeger and T. Xie, “Learning for test prioritization: An industrial
[64] W. Choi, G. Necula, and K. Sen, “Guided GUI testing of android apps case study,” in Proc. ACM SIGSOFT Int. Symp. Found. Softw. Eng., 2016,
with minimal restart and approximate learning,” in Proc. ACM SIG- pp. 975–980.
PLAN Int. Conf. Object Oriented Program. Syst. Lang. Appl., 2013, [81] J. Bowring, J. M. Rehg, and M. J. Harrold, “Active learning for automatic
pp. 623–640. classification of software behavior,” in Proc. ACM SIGSOFT Int. Symp.
[65] F. Aarts, H. Kuppens, J. Tretmans, F. Vaandrager, and S. Verwer, “Im- Softw. Testing Anal., 2004, pp. 195–205.
proving active mealy machine learning for protocol conformance testing,” [82] G. Grano, T. V. Titov, S. Panichella, and H. C. Gall, “How high will it be?
Mach. Learn., vol. 96, no. 1, pp. 189–224, 2014. Using machine learning models to predict branch coverage in automated
[66] S. Chen, Z. Chen, Z. Zhao, B. Xu, and Y. Feng, “Using Semisupervised testing,” in Proc. Workshop Mach. Learn. Techn. Softw. Qual. Eval., 2018,
clustering to improve regression test selection techniques,” in Proc. IEEE pp. 19–24.
Int. Conf. Softw. Testing, Verification Validation, 2011, pp. 1–10. [83] H. Spieker, A. Gotlieb, D. Marijan, and M. Mossige, “Reinforcement
[67] S. Sprenkle, E. Hill, and L. Pollock, “Learning effective oracle comparator learning for automatic test case prioritization and selection in continuous
combinations for web applications,” in Proc. Int. Conf. Qual. Softw., 2007, integration,” in Proc. ACM SIGSOFT Int. Symp. Softw. Testing Anal., 2017,
pp. 372–379. pp. 12–22.
[68] P. Tonella, P. Avesani, and A. Susi, “Using the case-based ranking method- [84] B. Hardin and U. Kanewala, “Using semi-supervised learning for predict-
ology for test case prioritization,” in Proc. IEEE Int. Conf. Softw. Mainte- ing metamorphic relations,” in Proc. Int. Workshop Metamorphic Testing,
nance, 2006, pp. 123–133. 2018, pp. 14–17.
[69] J. Sant, A. Souter, and L. Greenwald, “An exploration of statistical models [85] H. Felbinger, F. Wotawa, and M. Nica, “Empirical study of correla-
for automated test case generation,” in Proc. Int. Workshop Dyn. Anal., tion between mutation score and model inference based test suite ade-
2005, pp. 1–7. quacy assessment,” in Proc. Int. Workshop Automat. Softw. Test, 2016,
[70] A. S. Namin and M. Sridharan, “Bayesian reasoning for software testing,” pp. 43–49.
in Proc. FSE/SDP Workshop Future Softw. Eng. Res., 2010, pp. 349–354. [86] N. Walkinshaw and G. Fraser, “Uncertainty-driven black-box test data
[71] F. Bergadano and D. Gunetti, “Testing by means of inductive program generation,” in Proc. Int. Conf. Softw. Testing, Verification Validation,
learning,” ACM Trans. Softw. Eng. Methodol., vol. 5, no. 2, pp. 119–145, 2017, pp. 253–263.
1996. [87] A. Balkan, P. Tabuada, J. V. Deshmukh, X. Jin, and J. Kapinski, “Under-
[72] H. Jin, Y. Wang, N. W. Chen, Z. J. Gou, and S. Wang, “Artificial neural miner: A framework for automatically identifying nonconverging behav-
network for automatic test oracles generation,” in Proc. Int. Conf. Comput. iors in black-box system models,” ACM Trans. Embedded Comput. Syst.,
Sci. Softw. Eng., 2008, vol. 2, pp. 727–730. vol. 17, no. 1, 2017, Art. no. 20.
[73] Vineeta, A. Singhal, and A. Bansal, “Generation of test oracles using [88] H. Enişer and A. Sen, “Testing service oriented architectures using state-
neural network and decision tree model,” in Proc. Int. Conf. - Confluence ful service visualization via machine learning,” in Proc. Int. Workshop
Next Generation Inf. Technol. Summit, 2014, pp. 313–318. Automat. Softw. Test, 2018, pp. 9–15.
[74] H. Hungar, O. Niese, and B. Steffen, “Domain-specific optimization in [89] R. Groz, A. Simao, N. Bremond, and C. Oriat, “Revisiting AI and test-
automata learning,” in Proc. Int. Conf. Comput. Aided Verification., 2003, ing methods to infer FSM models of black-box systems,” in Proc. Int.
pp. 315–327. Workshop Automat. Softw. Test., 2018, pp. 16–19.
[75] N. Semenenko, M. Dumas, and T. Saar, “Browserbite: Accurate cross- [90] M. Badri, L. Badri, W. Flageol, and F. Toure, “Investigating the accuracy
browser testing via machine learning over image features,” in Proc. IEEE of test code size prediction using use case metrics and machine learning
Int. Conf. Softw. Maintenance, 2013, pp. 528–531. algorithms: An empirical study,” in Proc. Int. Conf. Mach. Learn. Soft
[76] D. Agarwal, D. E. Tamir, M. Last, and A. Kandel, “A comparative study of Comput., 2017, pp. 25–33.
artificial neural networks and info-fuzzy networks as automated oracles [91] A. Rosenfeld, O. Kardashov, and O. Zang, “Automation of android appli-
in software testing,” IEEE Trans. Syst., Man, Cybernet., vol. 42, no. 5, cations functional testing using machine learning activities classification,”
pp. 1183–1193, Sep. 2012. in Proc. Int. Conf. Mobile Softw. Eng. Syst., 2018, pp. 122–132.