0% found this document useful (0 votes)
266 views11 pages

Reinforcement Learning For Test Case Prioritization-Issta17 0 PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
266 views11 pages

Reinforcement Learning For Test Case Prioritization-Issta17 0 PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Reinforcement Learning for Automatic Test Case Prioritization

and Selection in Continuous Integration


Helge Spieker Arnaud Gotlieb
Simula Research Laboratory Simula Research Laboratory
P.O. Box 134 P.O. Box 134
1325 Lysaker, Norway 1325 Lysaker, Norway
[email protected] [email protected]

Dusica Marijan Morten Mossige


Simula Research Laboratory University of Stavanger
P.O. Box 134 Stavanger, Norway
1325 Lysaker, Norway ABB Robotics
[email protected] Bryne, Norway
[email protected]

ABSTRACT ACM Reference format:


Testing in Continuous Integration (CI) involves test case prioriti- Helge Spieker, Arnaud Gotlieb, Dusica Marijan, and Morten Mossige. 2017.
Reinforcement Learning for Automatic Test Case Prioritization and Selection
zation, selection, and execution at each cycle. Selecting the most
in Continuous Integration. In Proceedings of 26th International Symposium on
promising test cases to detect bugs is hard if there are uncertainties Software Testing and Analysis , Santa Barbara, CA, USA, July 2017 (ISSTA’17),
on the impact of committed code changes or, if traceability links 11 pages.
between code and tests are not available. This paper introduces https://doi.org/10.1145/3092703.3092709
Retecs, a new method for automatically learning test case selection
and prioritization in CI with the goal to minimize the round-trip
time between code commits and developer feedback on failed test 1 INTRODUCTION
cases. The Retecs method uses reinforcement learning to select Context. Continuous Integration (CI) is a cost-effective software
and prioritize test cases according to their duration, previous last development practice commonly used in industry [10, 13] where
execution and failure history. In a constantly changing environ- developers frequently integrate their work. It involves several tasks,
ment, where new test cases are created and obsolete test cases are including version control, software configuration management,
deleted, the Retecs method learns to prioritize error-prone test automatic build and regression testing of new software release
cases higher under guidance of a reward function and by observing candidates. Automatic regression testing is a crucial step which aims
previous CI cycles. By applying Retecs on data extracted from at detecting defects as early as possible in the process by selecting
three industrial case studies, we show for the first time that rein- and executing available and relevant test cases. CI is seen as an
forcement learning enables fruitful automatic adaptive test case essential method for improving software quality while keeping
selection and prioritization in CI and regression testing. verification costs at a low level [24, 34].
Unlike usual testing methods, testing in CI requires tight control
CCS CONCEPTS over the selection and prioritization of the most promising test
• Software and its engineering → Software verification and cases. By most promising, we mean test cases that are prone to
validation; Software testing and debugging; detect failures early in the process. Admittedly, selecting test cases
which execute the most recent code changes is a good strategy in
CI, such as, for example in coverage-based test case prioritization
KEYWORDS [9]. However, traceability links between code and test cases are not
Regression testing, Test case prioritization, Test case selection, Re- always available or easily accessible when test cases correspond to
inforcement Learning, Machine Learning, Continuous Integration system tests. In system testing for example, test cases are designed
for testing the overall system instead of simple units of code and
instrumenting the system for code coverage monitoring is not easy.
Permission to make digital or hard copies of all or part of this work for personal or In that case, test case selection and prioritization has to be handled
classroom use is granted without fee provided that copies are not made or distributed differently and using historical data about failures and successes of
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM test cases has been proposed as an alternative [16]. Based on the
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, hypothesis that test cases having failed in the past are more likely
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from [email protected].
to fail in the future, history-based test case prioritization schedules
ISSTA’17, July 2017, Santa Barbara, CA, USA these test cases first in new CI cycles [19]. Testing in CI also means
© 2017 Association for Computing Machinery.
ACM ISBN 978-1-4503-5076-1/17/07. . . $15.00 This work is supported by the Research Council of Norway (RCN) through the research-
https://doi.org/10.1145/3092703.3092709 based innovation center Certus, under the SFI program.

12
ISSTA’17, July 2017, Santa Barbara, CA, USA Helge Spieker, Arnaud Gotlieb, Dusica Marijan, and Morten Mossige

to control the time required to execute a complete cycle. As the (2) Implementing an online RL method, without any previous
durations of test cases strongly vary, not all tests can be executed training phase, into a Continuous Integration process is
and test case selection is required. shown to be effective to learn how to prioritize test cases.
Despite algorithms have been proposed recently [19, 23], we According to our knowledge, this is the first time that RL
argue that these two aspects of CI testing, namely test case selection is applied to test case prioritization and compared with
and history-based prioritization, can hardly be solved by using only other simple deterministic and random approaches. Com-
non-adaptive methods. First, the time allocated to test case selection paring two distinct representations (i.e., tableau and neural
and prioritization in CI is limited as each step of the process is given networks) and three distinct reward functions, our exper-
a contract of time. So, time-effective methods shall be privileged imental results show that, without any prior knowledge
over costly and complex prioritization algorithms. Second, history- and without any model of the environment, the RL ap-
based prioritization is not well adapted to changes in the execution proach is able to learn how to prioritize test cases better
environment. More precisely, it is frequent to see some test cases than other approaches. Remarkably, the number of cycles
being removed from one cycle to another because they test an required to improve on other methods corresponds to less
obsolete feature of the system. At the same time, new test cases than 2-months of data, if there is only one CI cycle per day;
are introduced to test new or changed features. Additionally, some (3) Our experimental results have been computed on industrial
test cases are more crucial in certain periods of time, because they data gathered over one year of Continuous Integration. By
test features on which customers focus the most, and then they applying our RL method on this data, we actually show
loose their prevalence because the testing focus has changed. In that the method is deployable in industrial settings. This is
brief, non-adaptive methods may not be able to spot changes in the third contribution of this paper.
the importance of some test cases over others because they apply Paper Outline. The rest of the paper is organized as follows: Sec-
systematic prioritization algorithms. tion 2 provides notations and definitions. It also includes a formal-
Reinforcement Learning. In order to tame these problems, we ization of the problem addressed in our work. Section 3 presents
propose a new lightweight test case selection and prioritization ap- our Retecs approach for test case prioritization and selection based
proach in CI based on reinforcement learning and neural networks. on reinforcement learning. It also introduces basic concepts such
Reinforcement learning is well-tuned to design an adaptive method as artificial neural network, agent, policy and reward functions.
capable to learn from its experience of the execution environment. Section 4 presents our experimental evaluation of the Retecs on
By adaptive, it is meant, that our method can progressively improve industrial data sets, while Section 5 discusses related work. Finally,
its efficiency from observations of the effects its actions have. By Section 6 summarizes and concludes the paper.
using a neural network which works on both the selected test cases
and the order in which they are executed, the method tends to 2 FORMAL DEFINITIONS
select and prioritize test cases which have been successfully used
This section introduces necessary notations used in the rest of the
to detect faults in previous CI cycles, and to order them so that the
paper and presents the addressed problem in a formal way.
most promising ones are executed first.
Unlike other prioritization algorithms, our method is able to 2.1 Notations and Definitions
adapt to situations where test cases are added to or deleted from a
general repository. It can also adapt to situations where the testing Let Ti be a set of test cases {t 1 , t 2 , . . . , t N } at a CI cycle i. Note
priorities change because of different focus or execution platforms, that this set can evolve from one cycle to another. Some of these
indicated by changing failure indications. Finally, as the method test cases are selected and ordered for execution in a test schedule
is designed to run in a CI cycle, the time it requires is negligible, called T S i (T S i ⊆ Ti ). For evaluation purposes, we define further
because it does not need to perform computationally intensive T S itot al as being the ordered sequence of all test cases (T S itot al =
operations during prioritization. It does not mine in detail code- Ti ) as if all test cases are scheduled for execution regardless of
based repositories or change-logs history to compute a new test case any time limit. Note that Ti is an unordered set, while T S i and
schedule. Instead it facilitates knowledge about test cases which T S itot al are ordered sequences. Following up on this idea, we
have been the most capable to detect failures in a small sequence define a ranking function over the test cases: rank : T S i → N
of previous CI cycles. This knowledge to make decisions is updated where rank(t) is the position of t within T S i .
only after tests are executed from feedback provided by a reward In T S i , each test case t has a verdict t .verdicti and a duration
function, the only component in the method initially embedding t .durationi . Note that these values are only available after executing
domain knowledge. the test case and that they depend on the cycle in which the test case
The contributions of this paper are threefold: has been executed. For the sake of simplicity, the verdict is either 1 if
the test case has passed, or 0 if it has failed or has not been executed
in cycle i, i.e. it is not included in T S i . The subset of all failed test
(1) This paper shows that history-based test case prioritiza- f ail
tion and selection can be approached as a reinforcement cases in T S i is noted T S i = {t ∈ T S i s.t. t .verdicti = 0}. The
learning problem. By modeling the problem with notions failure of an executed test case can be due to one or several actual
such as states, actions, agents, policy, and reward functions, faults in the system under test, and conversely a single fault can be
we demonstrate, as a first contribution, that RL is suitable responsible of multiple failed test cases. For the remainder of this
to automatically prioritize and select test cases; paper, we will focus only on failed test cases (and not actual faults
of the system) as the link between actual faults and executed test

13
Reinforcement Learning for Automatic Test Case Prioritization and Selection
in Continuous Integration ISSTA’17, July 2017, Santa Barbara, CA, USA

cases is not explicit in the available data of our context. Whereas


t .durationi is the actual duration and only available after executing Agent
States: Actions:
the test case, t .duration is a simple over-approximation of previous Prioritized
Test Suite reward
durations and can be used for planning purposes. Ti ri Test Cases
Finally, we define qi (t) as a performance estimation of a test TSi
ri+1
case in the given cycle i. By performance, we mean an estimate Environment:
Ti+1
of its efficiency to detect failures. The performance Q i of a test CI Cycle
suite {t 1 , . . . , tn } can be estimated with any cumulative function
(e.g., sum, max, average, etc.) over qi (t 1 ), . . . qi (tn ), e.g., Q i (T S i ) =
1 Í Figure 1: Interaction of Agent and Environment (adapted
| T S | t ∈ T S i q(t).
i from [36, Fig 3.1])

2.2 Problem Formulation 3 THE RETECS METHOD


This section introduces our approach to the ATCS problem using
The goal of any test case prioritization algorithm is to find an op-
reinforcement learning (RL), called Reinforced Test Case Selection
timal ordered sequence of test cases that reveal failures as early
(Retecs). It starts by describing how RL is applied to test case
as possible in the regression testing process. Formally speaking,
prioritization and selection (section 3.1), then discusses test case
following and adapting the notations proposed by Rothermel et al.
scheduling in one CI cycle (section 3.2). Finally, integration of the
in [32]: Test Case Prioritization Problem (TCP)
method within a CI process is presented (section 3.3).
Let T S i be a test suite, and PT be the set of all possible permuta-
tions of T S i , let Q i be the performance, then TCP aims at finding
T S ′i a permutation of T S i , such that Q i (T S ′i ) is maximized. 3.1 Reinforcement Learning for Test Case
Said otherwise, TCP aims at finding T S ′i such that ∀ T S i ∈ PT : Prioritization
Q i (T S ′i ) ≥ Q i (T S i ) . Although it is fundamental, this problem In this section, we describe the main elements of reinforcement
formulation does not capture the notion of a time limit for exe- learning in the context of test case prioritization and selection. If
cuting the test suite. Time-limited Test Case Prioritization extends necessary, a more in-depth introduction can be found in [36]. We
the TCP problem by limiting the available time for execution. As a apply RL as a model-free and online learning method for the ATCS
consequence, not all the test cases may be executed when there is a problem. Each test case is prioritized individually and after all test
time-contract. Note that other resources (than time) can constrain cases have been prioritized, a schedule is created from the most
the test case selection process, too. However, the formulation given important test cases, and afterwards executed and evaluated.
below can be adapted without any loss of generality. Model-free means the method has no initial concept of the envi-
Time-limited Test Case Prioritization Problem (TTCP) ronment’s dynamics and how its actions affect it. This is appropriate
Let M be the maximum time available for test suite execution, for test case prioritization and selection, as there is no strict model
then TTCP aims at finding a test suite T S i , such that Q i (T S i ) behind the existence of failures within the software system and
is maximized and the total duration of execution of T S i is less their detection.
than M. Said otherwise, TTCP aims at finding T S i such that Online learning describes a method constantly learning during
∀ T S ′i ∈ PT : Q i (T S i ) ≥ Q i (T S ′i ) ∧ tk ∈ T S′ i tk .duration ≤
Í
its runtime. This is also appropriate for software testing, where
M ∧ tk ∈ T S i tk .duration ≤ M.
Í
indicators for failing test cases can change over time according to
Still the problem formulation given above does not take into the focus of development or variations in the test suite. Therefore
account the history of test suite execution. In case the links between it is necessary to continuously adapt the prioritization method for
code changes and test cases are not available as discussed in the test cases.
introduction, history-based test case prioritization can be used. In RL, an agent interacts with its environment by perceiving
The final problem formulation given below corresponds to the its state and selecting an appropriate action, either from a learned
problem addressed in this paper and for which a solution based on policy or by random exploration of possible actions. As a result,
reinforcement learning is proposed. In a CI process, TTCP has to the agent receives feedback in terms of rewards, which rate the
be solved in every cycle, but under the additional availability of performance of its previous action.
historical information as a basis for test case prioritization. Adaptive Figure 1 illustrates the links between RL and test case prioritiza-
Test Case Selection Problem (ATCS) tion. A state represents a single test case’s metadata, consisting of
Let T S 1 , . . . , T S i−1 be a sequence of previously executed test the test case’s approximated duration, the time it was last executed
suites, then the Adaptive Test Case Selection Problem aims at finding and previous test execution results. As an action the test case’s
T S i , so Q i (T S i ) is maximized and t ∈ T S i t .duration ≤ M.
Í
priority for the current CI cycle is returned. After all test cases in a
We see that ATCS is an optimization problem which gathers test suite are prioritized, the prioritized test suite is scheduled, in-
the idea of time-constrained test case prioritization, selection and cluding a selection of the most important test cases, and submitted
performance evaluation, without requesting more information than for execution. With the test execution results, i.e., the test verdicts,
previous test execution results in CI. a reward is calculated and fed back to the agent. From this reward,
the agent adapts its experience and policy for future actions. In case

14
ISSTA’17, July 2017, Santa Barbara, CA, USA Helge Spieker, Arnaud Gotlieb, Dusica Marijan, and Morten Mossige

of positive rewards previous behavior is encouraged, i.e. reinforced, In the first reward function (1) all test cases, both scheduled
while in case of negative rewards it is discouraged. and unscheduled, receive the number of failed test cases in the
Test verdicts of previous executions have shown to be useful schedule as a reward. It is a basic, but intuitive reward function
to reveal future failures [16]. This raises the question how long directly rewarding the RL agent on the goal of maximizing the
the history of test verdicts should be for a reliable indication. In number of failed test cases. The reward function acknowledges
general, a long history provides more information and allows better the prioritized test suite in total, including positive feedback on
knowledge of the failure distribution of the system under test, but low priorities for test cases regarded as unimportant. This risks
it also requires processing more data which might have become encouraging low priorities for test cases which would have failed if
irrelevant with previous upgrades of the system as the previously executed, and could encourage undesired behavior, but at the same
error-prone feature got more stable. To consider this, the agent time it strengthens the influence all priorities in the test suite have.
has to learn how to time-weight previous test verdicts, which adds
Definition 3.2. Test Case Failure Reward
further complexity to the learning process. How the history length
affects the performance of our method, is experimentally evaluated (
tcf ail 1 − t .verdicti if t ∈ T S i
in Section 4.2.2. rewardi (t) = (2)
Of further importance for RL applications are the agent’s policy, 0 otherwise
i.e. the way it decides on actions, the memory representation, i.e. The second reward function (2) returns the test case’s verdict as
how it stores its experience and policy, and the reward function to each test case’s individual reward. Scheduling failing test cases is
provide feedback for adaptation and policy improvement. intended and therefore reinforced. If a test case passed, no specific
In the following, we will discuss these components and their reward is given as including it neither improved nor reduced the
relevance for Retecs. schedule’s quality according to available information. Still, the order
of test cases is not explicitly included in the reward. It is implicitly
3.1.1 Reward Functions. Within the ATCS problem, a good test
included by encouraging the agent to focus on failing test cases
schedule is defined by the goals of test case selection and prior-
and prioritizing them higher. For the proposed scheduling method
itization. It contains those test cases which lead to detection of
(section 3.2) this automatically leads to an earlier execution.
failures and executes them early to minimize feedback time. The
reward function should reflect these goals and thereby domain Definition 3.3. Time-ranked Reward
knowledge to steer the agent’s behavior [20]. Referring to the defi-
f ail
Õ
nition of ATCS, the reward function implements Q i and evaluates rewardit ime (t) = |T S i | − t .verdicti × 1 (3)
the performance of a test schedule. f ail
t k ∈T S i ∧
Ideally, feedback should be based on common metrics used in r ank (t )<r ank (t k )
test case prioritization and selection, e.g. NAPFD (presented in
The third reward function (3) explicitly includes the order of
section 4.1). However, these metrics require knowledge about the
test cases and rewards each test case based on its rank in the test
total number of faults in the system under test or full information on
schedule and whether it failed. As a good schedule executes failing
test case verdicts, even for non-executed test cases. In a CI setting,
test cases early, every passed test case reduces the schedule’s quality
test case verdicts exist only for executed test cases and information
if it precedes a failing test case. Each test cases is rewarded by the
about missed failures is not available. It is impossible to teach the
total number of failed test cases, for failed test cases it is the same
RL agent about test cases which should have been included, but
as reward function (1). For passed test cases, the reward is further
only to reinforce actions having shown positive effects. Therefore,
decreased by the number of failed test cases ranked after the passed
in Retecs, rewards are either zero or positive, because we cannot
test case to penalize scheduling passing test cases early.
automatically detect negative behavior.
In order to teach the agent about both the goal of a task and the 3.1.2 Action Selection: Prioritizing Test Cases. Action selection
way to approach this goal the reward, two types of reward functions describes how the RL agent processes a test case and decides on
can be distinguished. Either a single reward value is given for the a priority for it by using the policy. The policy is a function from
whole test schedule, or, more specifically, one reward value per the set of states, i.e., test cases in our context, to the set of actions,
individual test case. The former rewards the decisions on all test i.e., how important each test case is for the current schedule, and
cases as a group, but the agent does not receive feedback how describes how the agent interacts with its execution environment.
helpful each particular test case was to detect failures. The latter The policy function is an approximation of the optimal policy. In
resolves this issue by providing more specific feedback, but risks to the beginning it is a loose approximation, but over time and by
neglect the prioritization strategy of different priorities for different gathering experience it adapts towards an optimal policy.
test cases for the complete schedule as a whole. The agent selects those actions from the policy which were most
Throughout the presentation and evaluation of this paper, we rewarding before. It relies on its learned experience on good actions
will consider three reward functions. for the current state. Because the agent initially has no concept
of its actions’ effects, it explores the environment by choosing
Definition 3.1. Failure Count Reward random actions and observing received rewards on these actions.
How often random actions are selected instead of consulting the
f ail f ail
policy, is controlled by the exploration rate, a parameter which
rewardi (t) = |T S i | (∀ t ∈ Ti ) (1) usually decreases over time. In the beginning of the process, a

15
Reinforcement Learning for Automatic Test Case Prioritization and Selection
in Continuous Integration ISSTA’17, July 2017, Santa Barbara, CA, USA

high exploration rate encourages experimenting, whereas at a later differently. Previously encountered experiences are stored and re-
time exploration is reduced and the agent more strongly relies on visited during training phase to achieve repeated learning impulses,
its learned policy. Still, exploration is not disabled, because the which is called experience replay [18]. When rewards are received,
agent interacts in a dynamic environment, where the effects of each experience, consisting of a test case, action and reward, is
certain actions change and where it is necessary to continuously stored in a separate replay memory with limited capacity. If the
adapt the policy. Action selection and the effect of exploration are replay memory capacity is reached, oldest experiences get replaced
also influenced by non-stationary rewards, meaning that the same first. During training, a batch of experiences is randomly sampled
action for the same test case does not always yield the same reward. from this memory and used for training the ANN via backpropaga-
Test cases which are likely to fail, based on previous experiences, tion with stochastic gradient descent [44].
do not fail when the software is bug-free, although their failure
would be expected. The existence of non-stationary rewards has 3.2 Scheduling
motivated our selection of an online-learning approach, which Test cases are scheduled under consideration of their priority, their
enables continuous adaptation and should tolerate their occurence. duration and a time limit. The scheduling method is a modular
aspect within Retecs and can be selected depending on the envi-
3.1.3 Memory Representation. As noted above, the policy is an ronment, e.g. considering execution constraints or scheduling onto
approximated function from a state (a test case) to an action (a multiple test agents. As an only requirement it has to maximize the
priority). There exist a wide variety of function approximators in total priority within the schedule. For example, in an environment
literature, but for our context we focus on two main approximators. with only a single test agent and no further constraints, test cases
The first function approximator is the tableau representation [36]. can be selected by descending priority (ties broken randomly) until
It consists of two tables to track seen states and selected actions. In the time limit is reached.
one table it is counted how often each distinct action was chosen
per state. The other table stores the average received reward for 3.3 Integration within a CI Process
these actions. The policy is then to choose that action with highest In a typical CI process (as shown in Figure 2), a set of test cases is
expected reward for the current state, which can be directly read first prioritized and based on the prioritization a subset of test cases
from the table. When receiving rewards, cells for each rewarded is selected and scheduled onto the testing system(s) for execution.
combination of states and actions are updated by increasing the The Retecs method fits into this scheme by providing the Prior-
counter and calculating the running average of received rewards. itization and Selection & Scheduling steps. It extends the CI process
As an exploration method to select random actions, ϵ-greedy by requiring an additional feedback channel to receive test results
exploration is used. With probability (1 − ϵ) the most promising after each cycle, which is the same or part of the information also
action according to the policy is selected, otherwise a random action provided as developer feedback.
is selected for exploration.
Albeit a straightforward representation, the tableau also restricts 4 EXPERIMENTAL EVALUATION
the agent. States and actions have to be discrete sets of limited size
as each state/action pair is stored separately. Furthermore, with In this section we present an experimental evaluation of the Retecs
many possible states and actions, the policy approximation takes method. During the first part, an overview of evaluation metrics
longer to converge towards an optimal policy as more experiences (section 4.1) is given before the experimental setup is introduced
are necessary for the training. However, for the presented problem (section 4.2). In section 4.3 we present and discuss the experimental
and its number of possible states a tableau is still applicable and results. A discussion of possible threats (section 4.4) and extensions
considered for evaluation. (section 4.5) to our work close the evaluation.
Overcoming the limitations of the tableau, artificial neural net- Within the evaluation of the Retecs method we investigate if
works (ANN) are commonly used function approximators [37]. it can be successfully applied towards the ATCS problem. Initially,
ANNs can approximate functions with continuous states and ac- before evaluating the method on our research questions, we explore
tions and are easier to scale to larger state spaces. The downside of how different parameter choices affect the performance of our
using ANNs are more complex configuration and higher training ef- method.
forts than for the tableau. In the context of Retecs, an ANN receives RQ1 Is the Retecs method effective to prioritize and select test
a state as input to the network and outputs a single continuous cases? We evaluate combinations of memory representa-
action, which directly resembles the test case’s priority. tions and reward functions on three industrial data sets.
Exploration is different when using ANNs, too. Because a con- RQ2 Can the lightweight and model-free Retecs method priori-
tinuous action is used, ϵ-greedy exploration is not possible. Instead, tize test cases comparable to deterministic, domain-specific
exploration is achieved by adding a random value drawn from a methods? We compare Retecs against three comparison
Gaussian distribution to the policy’s suggested action. The variance methods, one random prioritization strategy and to basic
of the distribution is given by the exploration rate and a higher rate deterministic methods.
allows for higher deviations from the policy’s actions. The lower
the exploration rate is, the closer the action is to the learned policy. 4.1 Evaluation Metric
Whereas the agent with tableau representation processes each In order to compare the performance of different methods, eval-
experience and reward once, an ANN-based agent can be trained uation metrics are required as a common performance indicator.

16
ISSTA’17, July 2017, Santa Barbara, CA, USA Helge Spieker, Arnaud Gotlieb, Dusica Marijan, and Morten Mossige

Reinforcement
Learning Policy Evaluation

Prioritized Selection &


Test Cases Prioritization Test Schedule Test Execution
Test Cases Scheduling

Developer
Feedback

Figure 2: Testing in CI process: RETECS uses test execution results for learning test case prioritization (solid boxes: Included
in RETECS, dashed boxes: Interfaces to the CI environment)

Following, we introduce Normalized Average Percentage of Faults corresponds to a weighted sum with equal weights and is thereby
Detected as the applied evaluation metric. a naive version of Retecs without adaptation. Although the three
comparison methods are basic approaches to test case prioritization,
Definition 4.1. Normalized APFD
they utilize the same information as provided to our method, and
Õ are likely to be encountered in industrial environments.
rank(t) Due to the online learning properties and the dependence on
f ail
t ∈ T Si p previous test suite results, evaluation is done by comparing the
N APF D(T S i ) = p − + NAPFD metrics for all subsequent CI cycles of a data set over time.
f ail 2 × |T S i |
|T S i | × |T S i | To account for the influence of randomness within the experimental
f ail evaluation, all experiments are repeated 30 times and reported
|T S i |
with p = results show the mean, if not stated otherwise.
t ot al,f ail
|T S i | Retecs1 is implemented in Python [38] using scikit-learn’s im-
plementation of artificial neural networks [26].
Average Percentage of Faults Detected (APFD) was introduced 4.2.1 Industrial Data Sets. To determine real-world applicability,
in [31] to measure the effectiveness of test case prioritization tech- industrial data sets from ABB Robotics Norway2 , Paint Control and
niques. It measures the quality via the ranks of failure-detecting test IOF/ROL, for testing complex industrial robots, and Google Shared
cases in the test execution order. As it assumes all detectable faults Dataset of Test Suite Results (GSDTSR) [11] are used.3 These data
get detected, APFD is designed for test case prioritization tasks sets consist of historical information about test executions and their
without selecting a subset of test cases. Normalized APFD (NAPFD) verdicts and each contain data for over 300 CI cycles.
[28] is an extension of APFD to include the ratio between detected Table 1 gives an overview of the data sets’ structure. Both ABB
and detectable failures within the test suite, and is thereby suited data sets are split into daily intervals, whereas GSDTSR is split
for test case selection tasks when not all test cases are executed and into hourly intervals as it originally provides log data of 16 days,
failures can be undetected. If all faults are detected (p = 1), NAPFD which is too short for our evaluation. Still, the average test suite
is equal to the original APFD formulation. size per CI cycle in GSDTSR exceeds that in the ABB data sets while
having fewer failed test executions. For applying Retecs constant
4.2 Experimental Setup durations between each CI cycle are not required.
Two RL agents are evaluated in the experiments. First uses a tableau For the CI cycle’s time limit, which is not present in the data sets,
representation of discrete states and a fixed number of actions, a fixed percentage of 50% of the required time is used. A relative
named Tableau-based agent. And a second, Network-based agent time limit allows better comparison of results between data sets
uses an artificial neural network as memory representation for and keeps the difficulty at each CI cycle on a comparable level. How
continuous states and a continuous action. The reward function of this percentage affects the results is evaluated in section 4.3.3.
each agent is not fixed, but varied throughout the experiments.
Test cases are scheduled on a single test agent in descending 4.2.2 Parameter Selection. A couple of parameters allow adjust-
order of priority until the time limit is reached. ing the method towards specific environments. For the experimental
To evaluate the efficiency of the Retecs method, we compare it evaluation the same set of parameters is used in all experiments, if
to three basic test case prioritization methods. First is random test not stated otherwise. These parameters are based on values from
case prioritization as a baseline method, referred to as Random. The literature and experimental exploration.
other two methods are deterministic. As a second method, named Table 2 gives an overview of the chosen parameters. The number
Sorting, test cases are sorted by their recent verdicts with recently of actions for the Tableau-based agent is set to 25. Preliminary tests
failed test cases having higher priority. For the third comparison showed a larger number of actions did not substantially increase
method, labeled as Weighting, the priority is calculated by a sum 1 Implementation available at https://bitbucket.org/helges/retecs
of the test case’s features as they are used as an input to the RL 2 Website: http://new.abb.com/products/robotics
3 Data Sets available at https://bitbucket.org/helges/atcs-data
agent. Weighting considers the same information as Retecs and

17
Reinforcement Learning for Automatic Test Case Prioritization and Selection
in Continuous Integration ISSTA’17, July 2017, Santa Barbara, CA, USA
Table 1: Industrial Data Sets Overview: All columns show the 4.3 Results
total amount of data in the data set
4.3.1 RQ1: Learning Process & Effectiveness. Figure 4 shows the
performance of Tableau- and Network-based agents with different
Data Set Test Cases CI Cycles Verdicts Failed reward functions on three industrial data sets. Each column shows
Paint Control 114 312 25,594 19.36% results for one data set, each row for a particular reward function.
IOF/ROL 2,086 320 30,319 28.43% It is visible that the combination of memory representation and
GSDTSR 5,555 336 1,260,617 0.25% reward function strongly influences the performance. In some cases
it does not support the learning process and the performance stays
at the initial level or even declines. Some combinations enable the
Table 2: Parameter Overview agent to learn which test cases to prioritize higher or lower and to
create meaningful test schedules.
RL Agent Parameter Value Performance on all data sets is best for the Network-based agent
with the Test Case Failure reward function. It benefits from the
All CI cycle’s time limit M 50% × Ti .duration specific feedback for each test case and learns which test cases are
History Length 4 likely to fail. Because the Network-based agent prioritizes test cases
Tableau Number of Actions 25 with continuous actions, it adapts more easily than the Tableau-
Exploration Rate ϵ 0.2 based agent, where only specific actions are rewarded and rewards
for one action do not influence close other actions.
Network Hidden Nodes 12
In all results a similar pattern should be visible. Initially, the agent
Replay Memory 10000
has no concept of the environment and cannot identify failing test
Replay Batch Size 1000
cases, leading to a poor performance. After a few cycles it received
enough feedback by the reward function to make better choices and
100 successively improves. However, this is not true for all combinations
of memory representation and reward function. One example is the
90
% of best result

combination of Network-based agent and Test Case Failure reward.


On Paint Control, the performance at early CI cycles is superior to
80 the Tableau-based agent, but it steadily declines due to misleading
feedback from the reward function.
70 Network
One general observation are performance fluctuations over time.
These fluctuations are correlated to noise in the industrial data sets,
Tableau
where failures in the system occur for different reasons and are
60
2 3 4 5 6 7 8 9 10 15 25 50 hard to predict. For example, in the Paint Control data set between
200 and 250 cycles a performance drop is visible. For these cycles a
History Length
larger number of test cases were repeatedly added to the test suite
Figure 3: Relative performance of different history lengths. manually. A large part of these test cases failed, which put additional
A longer history can reduce the performance due to more difficulty on the task. However, as the test suite was manually
complex information. (Data set: ABB Paint Control) adjusted, from a practical perspective it is arguable whether a fully
automated prioritization technique is feasible during these cycles.
In GSDTSR only few failed test cases occur in comparison to
the performance. Similar tests were conducted for the ANN’s size, the high number of successful executions. This makes it harder
including variations on the number of layers and hidden nodes, for the learning agent to discover a feasible prioritization strategy.
but a network larger than a single layer with 12 nodes did not Nevertheless, as the results show, it is possible for the Network-
significantly improve performance. based agent to create effective schedules in a high number of CI
The effect of different history lengths is evaluated experimen- cycles, albeit with occasional performance drops.
tally on the Paint Control data set. As Figure 3 shows, does a longer Regarding RQ1, we conclude that it is possible to apply Retecs
history not necessarily correspond to better performance. From on the ATCS problem. In particular, the combination of memory
an application perspective we interpret the most recent results to representation and reward function strongly influences the per-
also be the most relevant results. Many historical failures indicate formance of the agent. We found both Network-based agent and
a relevant test case better than many passes, but individual consid- Test Case Failure Reward, as well as Tableau-based agent with
eration of each of these results on their own is unlikely to lead to Time-ranked Reward, to be suitable combinations, with the former
better conclusions of future verdicts. From a technical perspective, delivering an overall better performance. The Failure Count Reward
this is supported by the fact, that a longer history increases the function does not support the learning processes of the two agents.
state space of possible test case representations. A larger state space Providing only a single reward value without further distinction is
is in both memory representations related to a higher complexity not helping the agents towards an effective prioritization strategy.
and requires generally more data to adapt, because the agent has It is better to reward each test case’s priority individually according
to learn to handle earlier execution results differently than more to its contribution to the previous schedule.
recent ones, for example by weighting or aggregating them.

18
ISSTA’17, July 2017, Santa Barbara, CA, USA Helge Spieker, Arnaud Gotlieb, Dusica Marijan, and Morten Mossige

ABB Paint Control ABB IOF/ROL GSDTSR


(a) Failure Count Reward
1.0 Network Tableau
0.8
NAPFD

0.6

0.4

0.2

0.0

(b) Test Case Failure Reward


1.0

0.8
NAPFD

0.6

0.4

0.2

0.0

(c) Time-ranked Reward


1.0

0.8
NAPFD

0.6

0.4

0.2

0.0
60 120 180 240 300 60 120 180 240 300 0 60 120 180 240 300
CI Cycle CI Cycle CI Cycle
Figure 4: Comparison of reward functions and memory representations: A Network-based agent with Test Case Failure reward
delivers best performance on all three data sets (Black lines indicate trend over time)

4.3.2 RQ2: Comparison to Other Methods. Where the experi- methods are not able to correctly prioritize failing test cases higher,
ments on RQ1 focus on the performances of different component as the small performance gap indicates.
combinations, is the focus of RQ2 towards comparing the best- For GSDTSR, Retecs is performing overall comparable with an
performing Network-based RL agent (with Test Case Failure re- NAPFD difference up to 0.2. Due to the few failures within the
ward) with other test case prioritization methods. Figure 5 shows data set, the exploration phase does not impact the performance in
the results of the comparison against the three methods on each of the early cycles as strongly as for the other two data sets. Also, it
the three data sets. A comparison is made for every 30 CI cycles on appears as if the indicators for failing test cases are not as correlated
the difference of the average NAPFD values of each cycle. Positive to the previous test execution results as they were in the other data
differences show better performance by the comparison method, a sets, which is visible from the comparatively low performance of
negative difference shows better performance by Retecs. the deterministic methods.
During early CI cycles, the deterministic comparison methods In summary, the results for RQ2 show, that Retecs can, starting
show mostly better performance. This corresponds to the initial from a model-free memory without initial knowledge about test
exploration phase, where Retecs adapts to its environment. After case prioritization, in around 60 cycles, which corresponds to two
approximately 60 CI cycles, for Paint Control, it is able to prioritize month for daily intervals, learn to effectively prioritize test cases.
with similar or better performance than the comparison methods. Its performance compares to that of basic deterministic test case
Similar results are visible on the other two data sets, with a longer prioritization methods. For CI, this means that Retecs is a promis-
adaptation phase but less performance differences on IOF/ROL and ing method for test case prioritization which adapts to environment
an early comparable performance on GSDTSR. specific indication of system failures.
For IOF/ROL, where the previous evaluation (see Figure 4) showed
lower performance compared to Paint Control, also the comparison 4.3.3 Internal Evaluation: Schedule Time Influence. In the ex-
perimental setup, the time limit for each CI cycle’s reduced test

19
Reinforcement Learning for Automatic Test Case Prioritization and Selection
in Continuous Integration ISSTA’17, July 2017, Santa Barbara, CA, USA
ABB Paint Control ABB IOF/ROL Google GSDTSR
0.8
0.6 Sorting
NAPFD Difference

Weighting
0.4
Random
0.2
0.0
-0.2
-0.4
-0.6
60 120 180 240 300 60 120 180 240 300 60 120 180 240 300
CI Cycle CI Cycle CI Cycle
Figure 5: Performance difference between network-based agent and comparison methods: After an initial exploration phase
RETECS adapts to competitive performance. Each group of bars compares 30 CI cycles.

Network Weighting Another threat is related to the existence of faults within our
Tableau Random implementation. We approached this threat by applying established
Sorting components, such as scikit-learn, within our software where ap-
100
propriate. Furthermore, our implementation is available online for
inspection and reproduction of experiments.
% of best result

80 Finally, many machine learning algorithms are sensible to their


parameters and a feasible parameter set for one problem environ-
ment might not work for as well for different one. During our
60 experiments, the initially selected parameters were not changed
for different problems to allow better comparison. In a real-world
40 setting, those parameters can be adjusted to tune the approach for
10 20 30 40 50 60 70 80 90 the specific environment.
Scheduling Time Ratio (in % of Ti .duration) External. Our evaluation is based on data from three industrial
data sets, which is a limitation regarding the wide variety of CI envi-
Figure 6: Relative performance under different time lim-
ronments and failure distributions. One of these data sets is publicly
its. Shorter scheduling times reduce the information for re-
available, but according to our knowledge it has only been used in
wards and delay learning. The performance differences for
one publication and a different setting [12]. From what we have an-
Network and Tableau also arise from the initial exploration
alyzed, there are no further public data sets available which include
phase, as shown in Figure 5 (Data set: ABB Paint Control).
the required data, especially test verdicts over time. This threat
has to be addressed by additional experiments in different settings
schedule is set to 50% of the execution time of the overall test suite once further data is accessible. To improve the data availability, we
Ti . To see how this choice influences the results and how it affects publish the other two data sets used in our experiments.
the learning process, an additional experiment is conducted with Construct. A threats to construct validity is the assumption, that
varying scheduling time ratios. each failed test cases indicates a different failure in the system
Figure 6 shows the results on the Paint Control data set. The under test. This is not always true. One test case can fail due to
NAPFD result is averaged over all CI cycles, which explains the multiple failures in the system and one failure can lead to multiple
overall better performance by the comparison methods due to an failing test cases. Based on the abstraction level of our method, this
initial learning period. As it is expected, performance decreases information is not easily available. Nevertheless, our approach tries
with lower time limits for all methods. However, for RL agents a de- to find all failing test cases and thereby indirectly also all detectable
creased scheduling time directly decreases available information for failures. To address the threat, we propose to include failure causes
learning as fewer test cases can be executed and fewer actions can as input features in future work.
meaningfully be rewarded, resulting in a slower learning process. Further regarding the input features, our proposed method uses
Nevertheless, the decrease in performance is not directly propor- only few test case metadata to prioritize test cases and to reason
tional to the decrease in scheduling time, a sign that Retecs learns about their importance for the test schedule. In practical environ-
at some point how to prioritize test cases even though the amount ments, more information about test cases or the system under test
of data in previous cycles was limited. is available and should be utilized.
We compared our method to baseline approaches, but we have
4.4 Threats to Validity not considered additional techniques. Although further methods
exist in literature, they do not report results on comparable data
Internal. The first threat to internal validity is the influence of
sets or would need adjustment for our CI setting.
random decisions on the results. To mitigate the threat, we repeated
our experiments 30 times and report averaged results.

20
ISSTA’17, July 2017, Santa Barbara, CA, USA Helge Spieker, Arnaud Gotlieb, Dusica Marijan, and Morten Mossige

4.5 Extensions use machine learning and multiple heuristic techniques to priori-
The presented results give perspectives to extensions from two tize test cases in an industrial setting. By combining various data
angles. First perspective is on the technical RL approach. Through a sources and learning to rank in an agnostic way, this work makes a
pre-training phase the agent can internalize test case prioritization strong step into the definition of a general framework to automati-
knowledge before actually prioritizing test cases and thereby im- cally learn to rank test cases. Our approach, only based on RL and
prove the initial performance. This can be approached by imitation ANN, takes another direction by providing a lightweight learning
of other methods [1], e.g. deterministic methods with desirable method using one source of data, namely test case failure history.
behavior, or by using historical data before it is introduced in the CI Chen et al. [6] uses semi-supervised clustering for regression test
process [30]. The second perspective focuses on the domain-specific selection. The downside of such an approach may be higher compu-
approach of test case prioritization and selection. Here, only few tational complexity. Other approaches include active learning for
metadata of a test case and its history is facilitated. The number test classification [3], combining machine learning and program
of features of a test case should be extended to allow better rea- slicing for regression test case prioritization [41], learning agent-
soning of expected failures, e.g. links between source code changes based test case prioritization [2], or clustering approaches [5]. RL
and relevant test cases. By including failure causes, scheduling of has been previously used in combination with adaptation-based
redundant test cases can be avoided and the effectiveness improved. programming (ABP) for automated testing of software APIs, where
Furthermore, this work used a linear scheduling model, but in the combination of RL and ABP successively selects calls to the API
industrial environments more complex environments are encoun- with the goal to increase test coverage, by Groce et al. [15]. Fur-
tered, e.g. multiple systems for test executions or additional con- thermore, Reichstaller et al. [29] apply RL to generate test cases for
straints on test execution besides time limits. Another extension risk-based interoperability testing. Based on a model of the system
of this work is therefore to integrate different scheduling methods under test, RL agents are trained to interact in an error-provoking
under consideration of prioritization information and integration way, i.e. they are encouraged to exploit possible interactions be-
into the learning process [27]. tween components. Veanes et al. use RL for online formal testing
of communication systems [39]. Based on the idea to see testing as
a two-player game, RL is used to strengthen the tester’s behavior
5 RELATED WORK when system and test cases are modeled as Input-Output Labeled
Test case prioritization and selection for regression testing: Transition Systems. While this approach is appealing, Retecs ap-
Previous work focuses on optimizing regression testing based on plies RL for a completely different purpose, namely test case pri-
mainly three aspects: cost, coverage, and fault detection, or their oritization and selection. Our approach aims at CI environments,
combinations. In [21] authors propose an approach for test case which are characterized by strict time and effort constraints.
selection and prioritization using the combination of Integer Linear
Programming (ILP) and greedy methods by optimizing multiple
criteria. Another study investigates coverage-based regression test- 6 CONCLUSION
ing [9], using four common prioritization techniques: a test selec-
We presented Retecs, a novel lightweight method for test case
tion technique, a test suite minimization technique and a hybrid
prioritization and selection in Continuous Integration, combining
approach that combines selection and minimization. Similar ap-
reinforcement learning methods and historical test information.
proaches have been proposed using search-based algorithms [7, 42],
Retecs is adaptive and learns important indicators for failing test
including swarm optimization [8] and ant colony optimization [22].
cases during its runtime by observing test cases, test results, and
Walcott et al. use genetic algorithms for time-aware regression
its own actions and their effects.
test suite prioritization for frequent code rebuilding [40]. Simi-
Evaluation results show fast learning and adaptation of Retecs
larly, Zhang et al. propose time-aware prioritization using ILP [43].
in three industrial case studies. An effective prioritization strategy
Strandberg et al. [35] apply a novel prioritization method with
is discovered with a performance comparable to basic deterministic
multiple factors in a real-world embedded software and show the
prioritization methods after an initial learning phase of approxi-
improvement over industry practice. Other regression test selec-
mately 60 CI cycles without previous training on test case prioriti-
tion techniques have been proposed based on historical test data
zation. Necessary domain knowledge is only reflected in a reward
[16, 19, 23, 25], code dependencies [14], or information retrieval
function to evaluate previous schedules. The method is model-free,
[17, 33]. Despite various approaches to test optimization for regres-
language-agnostic and requires no source code or program access.
sion testing, the challenge of applying most of them in practice
It only requires test metadata, namely historical results, durations
lies in their complexity and the computational overhead typically
and last execution times. However, we expect additional metadata
required to collect and analyze different test parameters needed
to enhance the method’s performance.
for prioritization, such as age, test coverage, etc. By contrast, our
In our evaluation we compare different variants of RL agents
approach based on RL is a lightweight method, which only uses
for the ATCS problem. Agents based on artificial neural networks
historical results and its experience from previous CI cycles. Fur-
have shown to be best performing, especially when trained with
thermore, Retecs is adaptive and suited for dynamic environments
test case-individual reward functions. While we applied only small
with frequent changes in code and testing, and evolving test suites.
networks in this work, with extended available data amounts, an
Machine learning for software testing: Machine learning al-
extension towards larger networks and deep learning techniques
gorithms receive increasing attention in the context of software
can be a promising path for future research.
testing. The work closest to ours is [4], where Busjaeger and Xie

21
Reinforcement Learning for Automatic Test Case Prioritization and Selection
in Continuous Integration ISSTA’17, July 2017, Santa Barbara, CA, USA

REFERENCES [22] T Noguchi, H Washizaki, Y Fukazawa, A Sato, and K Ota. 2015. History-Based
[1] Pieter Abbeel and Andrew Y Ng. 2004. Apprenticeship learning via inverse Test Case Prioritization for Black Box Testing Using Ant Colony Optimization.
reinforcement learning. Proceedings of the 21st International Conference on In 2015 IEEE 8th International Conference on Software Testing, Verification and
Machine Learning (ICML) (2004), 1–8. https://doi.org/10.1145/1015330.1015430 Validation (ICST). 1–2. https://doi.org/10.1109/ICST.2015.7102622
arXiv:1206.5264 [23] Tanzeem Bin Noor and Hadi Hemmati. 2015. A similarity-based approach for
[2] Sebastian Abele and Peter Göhner. 2014. Improving Proceeding Test Case Priori- test case prioritization using historical failure data. 2015 IEEE 26th International
tization with Learning Software Agents. In Proceedings of the 6th International Symposium on Software Reliability Engineering (ISSRE) (2015), 58—-68.
Conference on Agents and Artificial Intelligence - Volume 2 (ICAART). 293–298. [24] A Orso and G Rothermel. 2014. Software Testing: a Research Travelogue (2000–
[3] James F Bowring, James M Rehg, and Mary Jean Harrold. 2004. Active Learning 2014). In Proceedings of the on Future of Software Engineering. ACM, Hyderabad,
for Automatic Classification of Software Behavior. In Proceedings of the 2004 ACM India, 117–132.
SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’04). [25] H Park, H Ryu, and J Baik. 2008. Historical Value-Based Approach for Cost-
ACM, New York, NY, USA, 195–205. https://doi.org/10.1145/1007512.1007539 Cognizant Test Case Prioritization to Improve the Effectiveness of Regression
[4] Benjamin Busjaeger and Tao Xie. 2016. Learning for Test Prioritization: An Testing. In 2008 Second International Conference on Secure System Integration and
Industrial Case Study. In Proceedings of the 2016 24th ACM SIGSOFT International Reliability Improvement. 39–46. https://doi.org/10.1109/SSIRI.2008.52
Symposium on Foundations of Software Engineering. ACM, New York, NY, USA, [26] F Pedregosa, G Varoquaux, A Gramfort, V Michel, B Thirion, O Grisel, M Blondel,
975–980. https://doi.org/10.1145/2950290.2983954 P Prettenhofer, R Weiss, V Dubourg, J Vanderplas, A Passos, D Cournapeau, M
[5] G Chaurasia, S Agarwal, and S S Gautam. 2015. Clustering based novel test case Brucher, M Perrot, and E Duchesnay. 2011. Scikit-learn: Machine Learning in
prioritization technique. In 2015 IEEE Students Conference on Engineering and {P}ython. Journal of Machine Learning Research 12 (2011), 2825–2830.
Systems (SCES). IEEE, 1–5. https://doi.org/10.1109/SCES.2015.7506447 [27] Bo Qu, Changhai Nie, and Baowen Xu. 2008. Test case prioritization for multiple
[6] S Chen, Z Chen, Z Zhao, B Xu, and Y Feng. 2011. Using semi-supervised clustering processing queues. 2008 International Symposium on Information Science and
to improve regression test selection techniques. In 2011 Fourth IEEE International Engineering (ISISE) 2 (2008), 646–649. https://doi.org/10.1109/ISISE.2008.106
Conference on Software Testing, Verification and Validation. IEEE, 1–10. https: [28] Xiao Qu, Myra B. Cohen, and Katherine M. Woolf. 2007. Combinatorial interac-
//doi.org/10.1109/ICST.2011.38 tion regression testing: A study of test case generation and prioritization. In IEEE
[7] Luciano S de Souza, Pericles BC de Miranda, Ricardo BC Prudencio, and Flavia International Conference on Software Maintenance, 2007 (ICSM). IEEE, 255–264.
de A Barros. 2011. A Multi-objective Particle Swarm Optimization for Test Case [29] Andre André Reichstaller, Benedikt Eberhardinger, Alexander Knapp, Wolfgang
Selection Based on Functional Requirements Coverage and Execution Effort. In Reif, and Marcel Gehlen. 2010. Risk-Based Interoperability Testing Using Rein-
2011 IEEE 23rd International Conference on Tools with Artificial Intelligence. IEEE, forcement Learning. In 28th IFIP WG 6.1 International Conference, ICTSS 2016,
245–252. https://doi.org/10.1109/ICTAI.2011.45 Graz, Austria, October 17-19, 2016, Proceedings, Franz Wotawa, Mihai Nica, and
[8] Luciano S de Souza, Ricardo B C Prudêncio, Flavia de A. Barros, and Eduardo H Natalia Kushik (Eds.), Vol. 6435. Springer International Publishing, Cham, 52–69.
da S. Aranha. 2013. Search based constrained test case selection using execution https://doi.org/10.1007/978-3-642-16573-3
effort. Expert Systems with Applications 40, 12 (2013), 4887–4896. https://doi.org/ [30] Martin Riedmiller. 2005. Neural fitted Q iteration - First experiences with a
10.1016/j.eswa.2013.02.018 data efficient neural Reinforcement Learning method. In European Conference on
[9] Daniel Di Nardo, Nadia Alshahwan, Lionel Briand, and Yvan Labiche. 2015. Machine Learning. Springer, 317–328. https://doi.org/10.1007/11564096_32
Coverage-based regression test case selection, minimization and prioritization: a [31] Gregg Rothermel, Roland H Untch, Chengyun Chu, and Mary Jean Harrold.
case study on an industrial system. Software Testing, Verification and Reliability 1999. Test case prioritization: An empirical study. In Software Maintenance,
25, 4 (2015), 371–396. https://doi.org/10.1002/stvr.1572 1999.(ICSM’99) Proceedings. IEEE International Conference on. IEEE, 179–188.
[10] P M Duvall, S Matyas, and A Glover. 2007. Continuous Integration: Improving [32] Gregg Rothermel, Roland H Untch, Chengyun Chu, Mary Jean Harrold, and
Software Quality and Reducing Risk. Pearson Education. Ieee Computer Society. 2001. Prioritizing Test Cases For Regression Testing.
[11] Sebastian Elbaum, Andrew Mclaughlin, and John Penix. 2014. The IEEE Transactions on Software Engineering 27, 10 (2001), 929–948. https://doi.org/
Google Dataset of Testing Results. (2014). https://code.google.com/p/ 10.1145/347324.348910
google-shared-dataset-of-test-suite-results/ [33] Ripon K Saha, L Zhang, S Khurshid, and D E Perry. 2015. An Information
[12] Sebastian Elbaum, Gregg Rothermel, and John Penix. 2014. Techniques for im- Retrieval Approach for Regression Test Prioritization Based on Program Changes.
proving regression testing in continuous integration development environments. In Software Engineering (ICSE), 2015 IEEE/ACM 37th IEEE International Conference
In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations on, Vol. 1. 268–279. https://doi.org/10.1109/ICSE.2015.47
of Software Engineering. ACM, 235–245. https://doi.org/10.1145/2635868.2635910 [34] S Stolberg. 2009. Enabling agile testing through continuous integration. In Agile
[13] Martin Fowler and M Foemmel. 2006. Continuous integration. (2006). http: Conference, 2009. AGILE’09. IEEE, 369–374.
//martinfowler.com/articles/continuousIntegration.html [35] Per Erik Strandberg, Daniel Sundmark, Wasif Afzal, Thomas Ostrand, and Elaine
[14] M Gligoric, L Eloussi, and D Marinov. 2015. Ekstazi: Lightweight Test Selection. Weyuker. 2016. Experience Report: Automated System Level Regression Test
In Proceedings of the 37th International Conference on Software Engineering, Vol. 2. Prioritization Using Multiple Factors. In Software Reliability Engineering (ISSRE),
713–716. https://doi.org/10.1109/ICSE.2015.230 2016 IEEE 27th International Symposium on. IEEE, 12—-23.
[15] A. Groce, A. Fern, J. Pinto, T. Bauer, A. Alipour, M. Erwig, and C. Lopez. 2012. [36] Richard S. Sutton and Andrew G. Barto. 1998. Reinforcement Learning: An Intro-
Lightweight Automated Testing with Adaptation-Based Programming. In 2012 duction (1st ed.). MIT press Cambridge. https://doi.org/10.1109/TNN.1998.712192
IEEE 23rd International Symposium on Software Reliability Engineering. 161–170. [37] Hado Van Hasselt and Marco A Wiering. 2007. Reinforcement learning in
https://doi.org/10.1109/ISSRE.2012.1 continuous action spaces. Proceedings of the 2007 IEEE Symposium on Approximate
[16] Jung-Min Kim Jung-Min Kim and A. Porter. 2002. A history-based test prioriti- Dynamic Programming and Reinforcement Learning, ADPRL 2007 (2007), 272–279.
zation technique for regression testing in resource constrained environments. In https://doi.org/10.1109/ADPRL.2007.368199
Proceedings of the 24th international conference on software engineering. 119–129. [38] Fred L Van Rossum, Guido and Drake Jr. 1995. Python Reference Manual. Techni-
https://doi.org/10.1109/ICSE.2002.1007961 cal Report. Amsterdam, The Netherlands, The Netherlands.
[17] Jung-Hyun Kwon, In-Young Ko, Gregg Rothermel, and Matt Staats. 2014. Test [39] Margus Veanes, Pritam Roy, and Colin Campbell. 2006. Online Testing with
case prioritization based on information retrieval concepts. 2014 21st Asia-Pacific Reinforcement Learning. Springer Berlin Heidelberg, Berlin, Heidelberg, 240–253.
Software Engineering Conference (APSEC) 1 (2014), 19–26. https://doi.org/10.1109/ https://doi.org/10.1007/11940197_16
APSEC.2014.12 [40] K R Walcott, M L Soffa, G M Kapfhammer, and R S Roos. 2006. Time-Aware
[18] Long-Ji Lin. 1992. Self-Improving Reactive Agents Based on Reinforcement Test Suite Prioritization. In Proceedings of the 2006 International Symposium on
Learning, Planning and Teaching. Machine Learning 8, 3-4 (1992), 293–321. Software Testing and Analysis (ISSTA). ACM, Portland, Maine, USA, 1–12.
https://doi.org/10.1023/A:1022628806385 [41] Farn Wang, Shun-Ching Yang, and Ya-Lan Yang. 2011. Regression Testing Based
[19] Dusica Marijan, Arnaud Gotlieb, and Sagar Sen. 2013. Test case prioritization on Neural Networks and Program Slicing Techniques. Springer Berlin Heidelberg,
for continuous regression testing: An industrial case study. In 2013 29th IEEE Berlin, Heidelberg, 409–418. https://doi.org/10.1007/978-3-642-25658-5_50
International Conference on Software Maintenance (ICSM). 540–543. https://doi. [42] Lian Yu, Lei Xu, and Wei-Tek Tsai. 2010. Time-Constrained Test Selection for
org/10.1109/ICSM.2013.91 Regression Testing. Springer Berlin Heidelberg, Berlin, Heidelberg, 221–232.
[20] Maja J Matarić. 1994. Reward functions for accelerated learning. In Machine https://doi.org/10.1007/978-3-642-17313-4_23
Learning: Proceedings of the Eleventh international conference. 181–189. https: [43] Lu Zhang, Shan-Shan Hou, Chao Guo, Tao Xie, and Hong Mei. 2009. Time-aware
//doi.org/10.1.1.42.4313 test-case prioritization using integer linear programming. Proceedings of the
[21] Siavash Mirarab, Soroush Akhlaghi Esfahani, and Ladan Tahvildari. 2012. Size- eighteenth International Symposium on Software Testing and Analysis (ISSTA)
Constrained Regression Test Case Selection Using Multicriteria Optimization. (2009), 213–224. https://doi.org/10.1145/1572272.1572297
IEEE Transactions on Software Engineering 38, 4 (jul 2012), 936–956. https://doi. [44] Tong Zhang. 2004. Solving large scale linear prediction problems using stochas-
org/10.1109/TSE.2011.56 tic gradient descent algorithms. In Proceedings of the twenty-first international
conference on Machine learning. ACM, 116.

22

You might also like