Reinforcement Learning For Test Case Prioritization-Issta17 0 PDF
Reinforcement Learning For Test Case Prioritization-Issta17 0 PDF
12
ISSTA’17, July 2017, Santa Barbara, CA, USA Helge Spieker, Arnaud Gotlieb, Dusica Marijan, and Morten Mossige
to control the time required to execute a complete cycle. As the (2) Implementing an online RL method, without any previous
durations of test cases strongly vary, not all tests can be executed training phase, into a Continuous Integration process is
and test case selection is required. shown to be effective to learn how to prioritize test cases.
Despite algorithms have been proposed recently [19, 23], we According to our knowledge, this is the first time that RL
argue that these two aspects of CI testing, namely test case selection is applied to test case prioritization and compared with
and history-based prioritization, can hardly be solved by using only other simple deterministic and random approaches. Com-
non-adaptive methods. First, the time allocated to test case selection paring two distinct representations (i.e., tableau and neural
and prioritization in CI is limited as each step of the process is given networks) and three distinct reward functions, our exper-
a contract of time. So, time-effective methods shall be privileged imental results show that, without any prior knowledge
over costly and complex prioritization algorithms. Second, history- and without any model of the environment, the RL ap-
based prioritization is not well adapted to changes in the execution proach is able to learn how to prioritize test cases better
environment. More precisely, it is frequent to see some test cases than other approaches. Remarkably, the number of cycles
being removed from one cycle to another because they test an required to improve on other methods corresponds to less
obsolete feature of the system. At the same time, new test cases than 2-months of data, if there is only one CI cycle per day;
are introduced to test new or changed features. Additionally, some (3) Our experimental results have been computed on industrial
test cases are more crucial in certain periods of time, because they data gathered over one year of Continuous Integration. By
test features on which customers focus the most, and then they applying our RL method on this data, we actually show
loose their prevalence because the testing focus has changed. In that the method is deployable in industrial settings. This is
brief, non-adaptive methods may not be able to spot changes in the third contribution of this paper.
the importance of some test cases over others because they apply Paper Outline. The rest of the paper is organized as follows: Sec-
systematic prioritization algorithms. tion 2 provides notations and definitions. It also includes a formal-
Reinforcement Learning. In order to tame these problems, we ization of the problem addressed in our work. Section 3 presents
propose a new lightweight test case selection and prioritization ap- our Retecs approach for test case prioritization and selection based
proach in CI based on reinforcement learning and neural networks. on reinforcement learning. It also introduces basic concepts such
Reinforcement learning is well-tuned to design an adaptive method as artificial neural network, agent, policy and reward functions.
capable to learn from its experience of the execution environment. Section 4 presents our experimental evaluation of the Retecs on
By adaptive, it is meant, that our method can progressively improve industrial data sets, while Section 5 discusses related work. Finally,
its efficiency from observations of the effects its actions have. By Section 6 summarizes and concludes the paper.
using a neural network which works on both the selected test cases
and the order in which they are executed, the method tends to 2 FORMAL DEFINITIONS
select and prioritize test cases which have been successfully used
This section introduces necessary notations used in the rest of the
to detect faults in previous CI cycles, and to order them so that the
paper and presents the addressed problem in a formal way.
most promising ones are executed first.
Unlike other prioritization algorithms, our method is able to 2.1 Notations and Definitions
adapt to situations where test cases are added to or deleted from a
general repository. It can also adapt to situations where the testing Let Ti be a set of test cases {t 1 , t 2 , . . . , t N } at a CI cycle i. Note
priorities change because of different focus or execution platforms, that this set can evolve from one cycle to another. Some of these
indicated by changing failure indications. Finally, as the method test cases are selected and ordered for execution in a test schedule
is designed to run in a CI cycle, the time it requires is negligible, called T S i (T S i ⊆ Ti ). For evaluation purposes, we define further
because it does not need to perform computationally intensive T S itot al as being the ordered sequence of all test cases (T S itot al =
operations during prioritization. It does not mine in detail code- Ti ) as if all test cases are scheduled for execution regardless of
based repositories or change-logs history to compute a new test case any time limit. Note that Ti is an unordered set, while T S i and
schedule. Instead it facilitates knowledge about test cases which T S itot al are ordered sequences. Following up on this idea, we
have been the most capable to detect failures in a small sequence define a ranking function over the test cases: rank : T S i → N
of previous CI cycles. This knowledge to make decisions is updated where rank(t) is the position of t within T S i .
only after tests are executed from feedback provided by a reward In T S i , each test case t has a verdict t .verdicti and a duration
function, the only component in the method initially embedding t .durationi . Note that these values are only available after executing
domain knowledge. the test case and that they depend on the cycle in which the test case
The contributions of this paper are threefold: has been executed. For the sake of simplicity, the verdict is either 1 if
the test case has passed, or 0 if it has failed or has not been executed
in cycle i, i.e. it is not included in T S i . The subset of all failed test
(1) This paper shows that history-based test case prioritiza- f ail
tion and selection can be approached as a reinforcement cases in T S i is noted T S i = {t ∈ T S i s.t. t .verdicti = 0}. The
learning problem. By modeling the problem with notions failure of an executed test case can be due to one or several actual
such as states, actions, agents, policy, and reward functions, faults in the system under test, and conversely a single fault can be
we demonstrate, as a first contribution, that RL is suitable responsible of multiple failed test cases. For the remainder of this
to automatically prioritize and select test cases; paper, we will focus only on failed test cases (and not actual faults
of the system) as the link between actual faults and executed test
13
Reinforcement Learning for Automatic Test Case Prioritization and Selection
in Continuous Integration ISSTA’17, July 2017, Santa Barbara, CA, USA
14
ISSTA’17, July 2017, Santa Barbara, CA, USA Helge Spieker, Arnaud Gotlieb, Dusica Marijan, and Morten Mossige
of positive rewards previous behavior is encouraged, i.e. reinforced, In the first reward function (1) all test cases, both scheduled
while in case of negative rewards it is discouraged. and unscheduled, receive the number of failed test cases in the
Test verdicts of previous executions have shown to be useful schedule as a reward. It is a basic, but intuitive reward function
to reveal future failures [16]. This raises the question how long directly rewarding the RL agent on the goal of maximizing the
the history of test verdicts should be for a reliable indication. In number of failed test cases. The reward function acknowledges
general, a long history provides more information and allows better the prioritized test suite in total, including positive feedback on
knowledge of the failure distribution of the system under test, but low priorities for test cases regarded as unimportant. This risks
it also requires processing more data which might have become encouraging low priorities for test cases which would have failed if
irrelevant with previous upgrades of the system as the previously executed, and could encourage undesired behavior, but at the same
error-prone feature got more stable. To consider this, the agent time it strengthens the influence all priorities in the test suite have.
has to learn how to time-weight previous test verdicts, which adds
Definition 3.2. Test Case Failure Reward
further complexity to the learning process. How the history length
affects the performance of our method, is experimentally evaluated (
tcf ail 1 − t .verdicti if t ∈ T S i
in Section 4.2.2. rewardi (t) = (2)
Of further importance for RL applications are the agent’s policy, 0 otherwise
i.e. the way it decides on actions, the memory representation, i.e. The second reward function (2) returns the test case’s verdict as
how it stores its experience and policy, and the reward function to each test case’s individual reward. Scheduling failing test cases is
provide feedback for adaptation and policy improvement. intended and therefore reinforced. If a test case passed, no specific
In the following, we will discuss these components and their reward is given as including it neither improved nor reduced the
relevance for Retecs. schedule’s quality according to available information. Still, the order
of test cases is not explicitly included in the reward. It is implicitly
3.1.1 Reward Functions. Within the ATCS problem, a good test
included by encouraging the agent to focus on failing test cases
schedule is defined by the goals of test case selection and prior-
and prioritizing them higher. For the proposed scheduling method
itization. It contains those test cases which lead to detection of
(section 3.2) this automatically leads to an earlier execution.
failures and executes them early to minimize feedback time. The
reward function should reflect these goals and thereby domain Definition 3.3. Time-ranked Reward
knowledge to steer the agent’s behavior [20]. Referring to the defi-
f ail
Õ
nition of ATCS, the reward function implements Q i and evaluates rewardit ime (t) = |T S i | − t .verdicti × 1 (3)
the performance of a test schedule. f ail
t k ∈T S i ∧
Ideally, feedback should be based on common metrics used in r ank (t )<r ank (t k )
test case prioritization and selection, e.g. NAPFD (presented in
The third reward function (3) explicitly includes the order of
section 4.1). However, these metrics require knowledge about the
test cases and rewards each test case based on its rank in the test
total number of faults in the system under test or full information on
schedule and whether it failed. As a good schedule executes failing
test case verdicts, even for non-executed test cases. In a CI setting,
test cases early, every passed test case reduces the schedule’s quality
test case verdicts exist only for executed test cases and information
if it precedes a failing test case. Each test cases is rewarded by the
about missed failures is not available. It is impossible to teach the
total number of failed test cases, for failed test cases it is the same
RL agent about test cases which should have been included, but
as reward function (1). For passed test cases, the reward is further
only to reinforce actions having shown positive effects. Therefore,
decreased by the number of failed test cases ranked after the passed
in Retecs, rewards are either zero or positive, because we cannot
test case to penalize scheduling passing test cases early.
automatically detect negative behavior.
In order to teach the agent about both the goal of a task and the 3.1.2 Action Selection: Prioritizing Test Cases. Action selection
way to approach this goal the reward, two types of reward functions describes how the RL agent processes a test case and decides on
can be distinguished. Either a single reward value is given for the a priority for it by using the policy. The policy is a function from
whole test schedule, or, more specifically, one reward value per the set of states, i.e., test cases in our context, to the set of actions,
individual test case. The former rewards the decisions on all test i.e., how important each test case is for the current schedule, and
cases as a group, but the agent does not receive feedback how describes how the agent interacts with its execution environment.
helpful each particular test case was to detect failures. The latter The policy function is an approximation of the optimal policy. In
resolves this issue by providing more specific feedback, but risks to the beginning it is a loose approximation, but over time and by
neglect the prioritization strategy of different priorities for different gathering experience it adapts towards an optimal policy.
test cases for the complete schedule as a whole. The agent selects those actions from the policy which were most
Throughout the presentation and evaluation of this paper, we rewarding before. It relies on its learned experience on good actions
will consider three reward functions. for the current state. Because the agent initially has no concept
of its actions’ effects, it explores the environment by choosing
Definition 3.1. Failure Count Reward random actions and observing received rewards on these actions.
How often random actions are selected instead of consulting the
f ail f ail
policy, is controlled by the exploration rate, a parameter which
rewardi (t) = |T S i | (∀ t ∈ Ti ) (1) usually decreases over time. In the beginning of the process, a
15
Reinforcement Learning for Automatic Test Case Prioritization and Selection
in Continuous Integration ISSTA’17, July 2017, Santa Barbara, CA, USA
high exploration rate encourages experimenting, whereas at a later differently. Previously encountered experiences are stored and re-
time exploration is reduced and the agent more strongly relies on visited during training phase to achieve repeated learning impulses,
its learned policy. Still, exploration is not disabled, because the which is called experience replay [18]. When rewards are received,
agent interacts in a dynamic environment, where the effects of each experience, consisting of a test case, action and reward, is
certain actions change and where it is necessary to continuously stored in a separate replay memory with limited capacity. If the
adapt the policy. Action selection and the effect of exploration are replay memory capacity is reached, oldest experiences get replaced
also influenced by non-stationary rewards, meaning that the same first. During training, a batch of experiences is randomly sampled
action for the same test case does not always yield the same reward. from this memory and used for training the ANN via backpropaga-
Test cases which are likely to fail, based on previous experiences, tion with stochastic gradient descent [44].
do not fail when the software is bug-free, although their failure
would be expected. The existence of non-stationary rewards has 3.2 Scheduling
motivated our selection of an online-learning approach, which Test cases are scheduled under consideration of their priority, their
enables continuous adaptation and should tolerate their occurence. duration and a time limit. The scheduling method is a modular
aspect within Retecs and can be selected depending on the envi-
3.1.3 Memory Representation. As noted above, the policy is an ronment, e.g. considering execution constraints or scheduling onto
approximated function from a state (a test case) to an action (a multiple test agents. As an only requirement it has to maximize the
priority). There exist a wide variety of function approximators in total priority within the schedule. For example, in an environment
literature, but for our context we focus on two main approximators. with only a single test agent and no further constraints, test cases
The first function approximator is the tableau representation [36]. can be selected by descending priority (ties broken randomly) until
It consists of two tables to track seen states and selected actions. In the time limit is reached.
one table it is counted how often each distinct action was chosen
per state. The other table stores the average received reward for 3.3 Integration within a CI Process
these actions. The policy is then to choose that action with highest In a typical CI process (as shown in Figure 2), a set of test cases is
expected reward for the current state, which can be directly read first prioritized and based on the prioritization a subset of test cases
from the table. When receiving rewards, cells for each rewarded is selected and scheduled onto the testing system(s) for execution.
combination of states and actions are updated by increasing the The Retecs method fits into this scheme by providing the Prior-
counter and calculating the running average of received rewards. itization and Selection & Scheduling steps. It extends the CI process
As an exploration method to select random actions, ϵ-greedy by requiring an additional feedback channel to receive test results
exploration is used. With probability (1 − ϵ) the most promising after each cycle, which is the same or part of the information also
action according to the policy is selected, otherwise a random action provided as developer feedback.
is selected for exploration.
Albeit a straightforward representation, the tableau also restricts 4 EXPERIMENTAL EVALUATION
the agent. States and actions have to be discrete sets of limited size
as each state/action pair is stored separately. Furthermore, with In this section we present an experimental evaluation of the Retecs
many possible states and actions, the policy approximation takes method. During the first part, an overview of evaluation metrics
longer to converge towards an optimal policy as more experiences (section 4.1) is given before the experimental setup is introduced
are necessary for the training. However, for the presented problem (section 4.2). In section 4.3 we present and discuss the experimental
and its number of possible states a tableau is still applicable and results. A discussion of possible threats (section 4.4) and extensions
considered for evaluation. (section 4.5) to our work close the evaluation.
Overcoming the limitations of the tableau, artificial neural net- Within the evaluation of the Retecs method we investigate if
works (ANN) are commonly used function approximators [37]. it can be successfully applied towards the ATCS problem. Initially,
ANNs can approximate functions with continuous states and ac- before evaluating the method on our research questions, we explore
tions and are easier to scale to larger state spaces. The downside of how different parameter choices affect the performance of our
using ANNs are more complex configuration and higher training ef- method.
forts than for the tableau. In the context of Retecs, an ANN receives RQ1 Is the Retecs method effective to prioritize and select test
a state as input to the network and outputs a single continuous cases? We evaluate combinations of memory representa-
action, which directly resembles the test case’s priority. tions and reward functions on three industrial data sets.
Exploration is different when using ANNs, too. Because a con- RQ2 Can the lightweight and model-free Retecs method priori-
tinuous action is used, ϵ-greedy exploration is not possible. Instead, tize test cases comparable to deterministic, domain-specific
exploration is achieved by adding a random value drawn from a methods? We compare Retecs against three comparison
Gaussian distribution to the policy’s suggested action. The variance methods, one random prioritization strategy and to basic
of the distribution is given by the exploration rate and a higher rate deterministic methods.
allows for higher deviations from the policy’s actions. The lower
the exploration rate is, the closer the action is to the learned policy. 4.1 Evaluation Metric
Whereas the agent with tableau representation processes each In order to compare the performance of different methods, eval-
experience and reward once, an ANN-based agent can be trained uation metrics are required as a common performance indicator.
16
ISSTA’17, July 2017, Santa Barbara, CA, USA Helge Spieker, Arnaud Gotlieb, Dusica Marijan, and Morten Mossige
Reinforcement
Learning Policy Evaluation
Developer
Feedback
Figure 2: Testing in CI process: RETECS uses test execution results for learning test case prioritization (solid boxes: Included
in RETECS, dashed boxes: Interfaces to the CI environment)
Following, we introduce Normalized Average Percentage of Faults corresponds to a weighted sum with equal weights and is thereby
Detected as the applied evaluation metric. a naive version of Retecs without adaptation. Although the three
comparison methods are basic approaches to test case prioritization,
Definition 4.1. Normalized APFD
they utilize the same information as provided to our method, and
Õ are likely to be encountered in industrial environments.
rank(t) Due to the online learning properties and the dependence on
f ail
t ∈ T Si p previous test suite results, evaluation is done by comparing the
N APF D(T S i ) = p − + NAPFD metrics for all subsequent CI cycles of a data set over time.
f ail 2 × |T S i |
|T S i | × |T S i | To account for the influence of randomness within the experimental
f ail evaluation, all experiments are repeated 30 times and reported
|T S i |
with p = results show the mean, if not stated otherwise.
t ot al,f ail
|T S i | Retecs1 is implemented in Python [38] using scikit-learn’s im-
plementation of artificial neural networks [26].
Average Percentage of Faults Detected (APFD) was introduced 4.2.1 Industrial Data Sets. To determine real-world applicability,
in [31] to measure the effectiveness of test case prioritization tech- industrial data sets from ABB Robotics Norway2 , Paint Control and
niques. It measures the quality via the ranks of failure-detecting test IOF/ROL, for testing complex industrial robots, and Google Shared
cases in the test execution order. As it assumes all detectable faults Dataset of Test Suite Results (GSDTSR) [11] are used.3 These data
get detected, APFD is designed for test case prioritization tasks sets consist of historical information about test executions and their
without selecting a subset of test cases. Normalized APFD (NAPFD) verdicts and each contain data for over 300 CI cycles.
[28] is an extension of APFD to include the ratio between detected Table 1 gives an overview of the data sets’ structure. Both ABB
and detectable failures within the test suite, and is thereby suited data sets are split into daily intervals, whereas GSDTSR is split
for test case selection tasks when not all test cases are executed and into hourly intervals as it originally provides log data of 16 days,
failures can be undetected. If all faults are detected (p = 1), NAPFD which is too short for our evaluation. Still, the average test suite
is equal to the original APFD formulation. size per CI cycle in GSDTSR exceeds that in the ABB data sets while
having fewer failed test executions. For applying Retecs constant
4.2 Experimental Setup durations between each CI cycle are not required.
Two RL agents are evaluated in the experiments. First uses a tableau For the CI cycle’s time limit, which is not present in the data sets,
representation of discrete states and a fixed number of actions, a fixed percentage of 50% of the required time is used. A relative
named Tableau-based agent. And a second, Network-based agent time limit allows better comparison of results between data sets
uses an artificial neural network as memory representation for and keeps the difficulty at each CI cycle on a comparable level. How
continuous states and a continuous action. The reward function of this percentage affects the results is evaluated in section 4.3.3.
each agent is not fixed, but varied throughout the experiments.
Test cases are scheduled on a single test agent in descending 4.2.2 Parameter Selection. A couple of parameters allow adjust-
order of priority until the time limit is reached. ing the method towards specific environments. For the experimental
To evaluate the efficiency of the Retecs method, we compare it evaluation the same set of parameters is used in all experiments, if
to three basic test case prioritization methods. First is random test not stated otherwise. These parameters are based on values from
case prioritization as a baseline method, referred to as Random. The literature and experimental exploration.
other two methods are deterministic. As a second method, named Table 2 gives an overview of the chosen parameters. The number
Sorting, test cases are sorted by their recent verdicts with recently of actions for the Tableau-based agent is set to 25. Preliminary tests
failed test cases having higher priority. For the third comparison showed a larger number of actions did not substantially increase
method, labeled as Weighting, the priority is calculated by a sum 1 Implementation available at https://bitbucket.org/helges/retecs
of the test case’s features as they are used as an input to the RL 2 Website: http://new.abb.com/products/robotics
3 Data Sets available at https://bitbucket.org/helges/atcs-data
agent. Weighting considers the same information as Retecs and
17
Reinforcement Learning for Automatic Test Case Prioritization and Selection
in Continuous Integration ISSTA’17, July 2017, Santa Barbara, CA, USA
Table 1: Industrial Data Sets Overview: All columns show the 4.3 Results
total amount of data in the data set
4.3.1 RQ1: Learning Process & Effectiveness. Figure 4 shows the
performance of Tableau- and Network-based agents with different
Data Set Test Cases CI Cycles Verdicts Failed reward functions on three industrial data sets. Each column shows
Paint Control 114 312 25,594 19.36% results for one data set, each row for a particular reward function.
IOF/ROL 2,086 320 30,319 28.43% It is visible that the combination of memory representation and
GSDTSR 5,555 336 1,260,617 0.25% reward function strongly influences the performance. In some cases
it does not support the learning process and the performance stays
at the initial level or even declines. Some combinations enable the
Table 2: Parameter Overview agent to learn which test cases to prioritize higher or lower and to
create meaningful test schedules.
RL Agent Parameter Value Performance on all data sets is best for the Network-based agent
with the Test Case Failure reward function. It benefits from the
All CI cycle’s time limit M 50% × Ti .duration specific feedback for each test case and learns which test cases are
History Length 4 likely to fail. Because the Network-based agent prioritizes test cases
Tableau Number of Actions 25 with continuous actions, it adapts more easily than the Tableau-
Exploration Rate ϵ 0.2 based agent, where only specific actions are rewarded and rewards
for one action do not influence close other actions.
Network Hidden Nodes 12
In all results a similar pattern should be visible. Initially, the agent
Replay Memory 10000
has no concept of the environment and cannot identify failing test
Replay Batch Size 1000
cases, leading to a poor performance. After a few cycles it received
enough feedback by the reward function to make better choices and
100 successively improves. However, this is not true for all combinations
of memory representation and reward function. One example is the
90
% of best result
18
ISSTA’17, July 2017, Santa Barbara, CA, USA Helge Spieker, Arnaud Gotlieb, Dusica Marijan, and Morten Mossige
0.6
0.4
0.2
0.0
0.8
NAPFD
0.6
0.4
0.2
0.0
0.8
NAPFD
0.6
0.4
0.2
0.0
60 120 180 240 300 60 120 180 240 300 0 60 120 180 240 300
CI Cycle CI Cycle CI Cycle
Figure 4: Comparison of reward functions and memory representations: A Network-based agent with Test Case Failure reward
delivers best performance on all three data sets (Black lines indicate trend over time)
4.3.2 RQ2: Comparison to Other Methods. Where the experi- methods are not able to correctly prioritize failing test cases higher,
ments on RQ1 focus on the performances of different component as the small performance gap indicates.
combinations, is the focus of RQ2 towards comparing the best- For GSDTSR, Retecs is performing overall comparable with an
performing Network-based RL agent (with Test Case Failure re- NAPFD difference up to 0.2. Due to the few failures within the
ward) with other test case prioritization methods. Figure 5 shows data set, the exploration phase does not impact the performance in
the results of the comparison against the three methods on each of the early cycles as strongly as for the other two data sets. Also, it
the three data sets. A comparison is made for every 30 CI cycles on appears as if the indicators for failing test cases are not as correlated
the difference of the average NAPFD values of each cycle. Positive to the previous test execution results as they were in the other data
differences show better performance by the comparison method, a sets, which is visible from the comparatively low performance of
negative difference shows better performance by Retecs. the deterministic methods.
During early CI cycles, the deterministic comparison methods In summary, the results for RQ2 show, that Retecs can, starting
show mostly better performance. This corresponds to the initial from a model-free memory without initial knowledge about test
exploration phase, where Retecs adapts to its environment. After case prioritization, in around 60 cycles, which corresponds to two
approximately 60 CI cycles, for Paint Control, it is able to prioritize month for daily intervals, learn to effectively prioritize test cases.
with similar or better performance than the comparison methods. Its performance compares to that of basic deterministic test case
Similar results are visible on the other two data sets, with a longer prioritization methods. For CI, this means that Retecs is a promis-
adaptation phase but less performance differences on IOF/ROL and ing method for test case prioritization which adapts to environment
an early comparable performance on GSDTSR. specific indication of system failures.
For IOF/ROL, where the previous evaluation (see Figure 4) showed
lower performance compared to Paint Control, also the comparison 4.3.3 Internal Evaluation: Schedule Time Influence. In the ex-
perimental setup, the time limit for each CI cycle’s reduced test
19
Reinforcement Learning for Automatic Test Case Prioritization and Selection
in Continuous Integration ISSTA’17, July 2017, Santa Barbara, CA, USA
ABB Paint Control ABB IOF/ROL Google GSDTSR
0.8
0.6 Sorting
NAPFD Difference
Weighting
0.4
Random
0.2
0.0
-0.2
-0.4
-0.6
60 120 180 240 300 60 120 180 240 300 60 120 180 240 300
CI Cycle CI Cycle CI Cycle
Figure 5: Performance difference between network-based agent and comparison methods: After an initial exploration phase
RETECS adapts to competitive performance. Each group of bars compares 30 CI cycles.
Network Weighting Another threat is related to the existence of faults within our
Tableau Random implementation. We approached this threat by applying established
Sorting components, such as scikit-learn, within our software where ap-
100
propriate. Furthermore, our implementation is available online for
inspection and reproduction of experiments.
% of best result
20
ISSTA’17, July 2017, Santa Barbara, CA, USA Helge Spieker, Arnaud Gotlieb, Dusica Marijan, and Morten Mossige
4.5 Extensions use machine learning and multiple heuristic techniques to priori-
The presented results give perspectives to extensions from two tize test cases in an industrial setting. By combining various data
angles. First perspective is on the technical RL approach. Through a sources and learning to rank in an agnostic way, this work makes a
pre-training phase the agent can internalize test case prioritization strong step into the definition of a general framework to automati-
knowledge before actually prioritizing test cases and thereby im- cally learn to rank test cases. Our approach, only based on RL and
prove the initial performance. This can be approached by imitation ANN, takes another direction by providing a lightweight learning
of other methods [1], e.g. deterministic methods with desirable method using one source of data, namely test case failure history.
behavior, or by using historical data before it is introduced in the CI Chen et al. [6] uses semi-supervised clustering for regression test
process [30]. The second perspective focuses on the domain-specific selection. The downside of such an approach may be higher compu-
approach of test case prioritization and selection. Here, only few tational complexity. Other approaches include active learning for
metadata of a test case and its history is facilitated. The number test classification [3], combining machine learning and program
of features of a test case should be extended to allow better rea- slicing for regression test case prioritization [41], learning agent-
soning of expected failures, e.g. links between source code changes based test case prioritization [2], or clustering approaches [5]. RL
and relevant test cases. By including failure causes, scheduling of has been previously used in combination with adaptation-based
redundant test cases can be avoided and the effectiveness improved. programming (ABP) for automated testing of software APIs, where
Furthermore, this work used a linear scheduling model, but in the combination of RL and ABP successively selects calls to the API
industrial environments more complex environments are encoun- with the goal to increase test coverage, by Groce et al. [15]. Fur-
tered, e.g. multiple systems for test executions or additional con- thermore, Reichstaller et al. [29] apply RL to generate test cases for
straints on test execution besides time limits. Another extension risk-based interoperability testing. Based on a model of the system
of this work is therefore to integrate different scheduling methods under test, RL agents are trained to interact in an error-provoking
under consideration of prioritization information and integration way, i.e. they are encouraged to exploit possible interactions be-
into the learning process [27]. tween components. Veanes et al. use RL for online formal testing
of communication systems [39]. Based on the idea to see testing as
a two-player game, RL is used to strengthen the tester’s behavior
5 RELATED WORK when system and test cases are modeled as Input-Output Labeled
Test case prioritization and selection for regression testing: Transition Systems. While this approach is appealing, Retecs ap-
Previous work focuses on optimizing regression testing based on plies RL for a completely different purpose, namely test case pri-
mainly three aspects: cost, coverage, and fault detection, or their oritization and selection. Our approach aims at CI environments,
combinations. In [21] authors propose an approach for test case which are characterized by strict time and effort constraints.
selection and prioritization using the combination of Integer Linear
Programming (ILP) and greedy methods by optimizing multiple
criteria. Another study investigates coverage-based regression test- 6 CONCLUSION
ing [9], using four common prioritization techniques: a test selec-
We presented Retecs, a novel lightweight method for test case
tion technique, a test suite minimization technique and a hybrid
prioritization and selection in Continuous Integration, combining
approach that combines selection and minimization. Similar ap-
reinforcement learning methods and historical test information.
proaches have been proposed using search-based algorithms [7, 42],
Retecs is adaptive and learns important indicators for failing test
including swarm optimization [8] and ant colony optimization [22].
cases during its runtime by observing test cases, test results, and
Walcott et al. use genetic algorithms for time-aware regression
its own actions and their effects.
test suite prioritization for frequent code rebuilding [40]. Simi-
Evaluation results show fast learning and adaptation of Retecs
larly, Zhang et al. propose time-aware prioritization using ILP [43].
in three industrial case studies. An effective prioritization strategy
Strandberg et al. [35] apply a novel prioritization method with
is discovered with a performance comparable to basic deterministic
multiple factors in a real-world embedded software and show the
prioritization methods after an initial learning phase of approxi-
improvement over industry practice. Other regression test selec-
mately 60 CI cycles without previous training on test case prioriti-
tion techniques have been proposed based on historical test data
zation. Necessary domain knowledge is only reflected in a reward
[16, 19, 23, 25], code dependencies [14], or information retrieval
function to evaluate previous schedules. The method is model-free,
[17, 33]. Despite various approaches to test optimization for regres-
language-agnostic and requires no source code or program access.
sion testing, the challenge of applying most of them in practice
It only requires test metadata, namely historical results, durations
lies in their complexity and the computational overhead typically
and last execution times. However, we expect additional metadata
required to collect and analyze different test parameters needed
to enhance the method’s performance.
for prioritization, such as age, test coverage, etc. By contrast, our
In our evaluation we compare different variants of RL agents
approach based on RL is a lightweight method, which only uses
for the ATCS problem. Agents based on artificial neural networks
historical results and its experience from previous CI cycles. Fur-
have shown to be best performing, especially when trained with
thermore, Retecs is adaptive and suited for dynamic environments
test case-individual reward functions. While we applied only small
with frequent changes in code and testing, and evolving test suites.
networks in this work, with extended available data amounts, an
Machine learning for software testing: Machine learning al-
extension towards larger networks and deep learning techniques
gorithms receive increasing attention in the context of software
can be a promising path for future research.
testing. The work closest to ours is [4], where Busjaeger and Xie
21
Reinforcement Learning for Automatic Test Case Prioritization and Selection
in Continuous Integration ISSTA’17, July 2017, Santa Barbara, CA, USA
REFERENCES [22] T Noguchi, H Washizaki, Y Fukazawa, A Sato, and K Ota. 2015. History-Based
[1] Pieter Abbeel and Andrew Y Ng. 2004. Apprenticeship learning via inverse Test Case Prioritization for Black Box Testing Using Ant Colony Optimization.
reinforcement learning. Proceedings of the 21st International Conference on In 2015 IEEE 8th International Conference on Software Testing, Verification and
Machine Learning (ICML) (2004), 1–8. https://doi.org/10.1145/1015330.1015430 Validation (ICST). 1–2. https://doi.org/10.1109/ICST.2015.7102622
arXiv:1206.5264 [23] Tanzeem Bin Noor and Hadi Hemmati. 2015. A similarity-based approach for
[2] Sebastian Abele and Peter Göhner. 2014. Improving Proceeding Test Case Priori- test case prioritization using historical failure data. 2015 IEEE 26th International
tization with Learning Software Agents. In Proceedings of the 6th International Symposium on Software Reliability Engineering (ISSRE) (2015), 58—-68.
Conference on Agents and Artificial Intelligence - Volume 2 (ICAART). 293–298. [24] A Orso and G Rothermel. 2014. Software Testing: a Research Travelogue (2000–
[3] James F Bowring, James M Rehg, and Mary Jean Harrold. 2004. Active Learning 2014). In Proceedings of the on Future of Software Engineering. ACM, Hyderabad,
for Automatic Classification of Software Behavior. In Proceedings of the 2004 ACM India, 117–132.
SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’04). [25] H Park, H Ryu, and J Baik. 2008. Historical Value-Based Approach for Cost-
ACM, New York, NY, USA, 195–205. https://doi.org/10.1145/1007512.1007539 Cognizant Test Case Prioritization to Improve the Effectiveness of Regression
[4] Benjamin Busjaeger and Tao Xie. 2016. Learning for Test Prioritization: An Testing. In 2008 Second International Conference on Secure System Integration and
Industrial Case Study. In Proceedings of the 2016 24th ACM SIGSOFT International Reliability Improvement. 39–46. https://doi.org/10.1109/SSIRI.2008.52
Symposium on Foundations of Software Engineering. ACM, New York, NY, USA, [26] F Pedregosa, G Varoquaux, A Gramfort, V Michel, B Thirion, O Grisel, M Blondel,
975–980. https://doi.org/10.1145/2950290.2983954 P Prettenhofer, R Weiss, V Dubourg, J Vanderplas, A Passos, D Cournapeau, M
[5] G Chaurasia, S Agarwal, and S S Gautam. 2015. Clustering based novel test case Brucher, M Perrot, and E Duchesnay. 2011. Scikit-learn: Machine Learning in
prioritization technique. In 2015 IEEE Students Conference on Engineering and {P}ython. Journal of Machine Learning Research 12 (2011), 2825–2830.
Systems (SCES). IEEE, 1–5. https://doi.org/10.1109/SCES.2015.7506447 [27] Bo Qu, Changhai Nie, and Baowen Xu. 2008. Test case prioritization for multiple
[6] S Chen, Z Chen, Z Zhao, B Xu, and Y Feng. 2011. Using semi-supervised clustering processing queues. 2008 International Symposium on Information Science and
to improve regression test selection techniques. In 2011 Fourth IEEE International Engineering (ISISE) 2 (2008), 646–649. https://doi.org/10.1109/ISISE.2008.106
Conference on Software Testing, Verification and Validation. IEEE, 1–10. https: [28] Xiao Qu, Myra B. Cohen, and Katherine M. Woolf. 2007. Combinatorial interac-
//doi.org/10.1109/ICST.2011.38 tion regression testing: A study of test case generation and prioritization. In IEEE
[7] Luciano S de Souza, Pericles BC de Miranda, Ricardo BC Prudencio, and Flavia International Conference on Software Maintenance, 2007 (ICSM). IEEE, 255–264.
de A Barros. 2011. A Multi-objective Particle Swarm Optimization for Test Case [29] Andre André Reichstaller, Benedikt Eberhardinger, Alexander Knapp, Wolfgang
Selection Based on Functional Requirements Coverage and Execution Effort. In Reif, and Marcel Gehlen. 2010. Risk-Based Interoperability Testing Using Rein-
2011 IEEE 23rd International Conference on Tools with Artificial Intelligence. IEEE, forcement Learning. In 28th IFIP WG 6.1 International Conference, ICTSS 2016,
245–252. https://doi.org/10.1109/ICTAI.2011.45 Graz, Austria, October 17-19, 2016, Proceedings, Franz Wotawa, Mihai Nica, and
[8] Luciano S de Souza, Ricardo B C Prudêncio, Flavia de A. Barros, and Eduardo H Natalia Kushik (Eds.), Vol. 6435. Springer International Publishing, Cham, 52–69.
da S. Aranha. 2013. Search based constrained test case selection using execution https://doi.org/10.1007/978-3-642-16573-3
effort. Expert Systems with Applications 40, 12 (2013), 4887–4896. https://doi.org/ [30] Martin Riedmiller. 2005. Neural fitted Q iteration - First experiences with a
10.1016/j.eswa.2013.02.018 data efficient neural Reinforcement Learning method. In European Conference on
[9] Daniel Di Nardo, Nadia Alshahwan, Lionel Briand, and Yvan Labiche. 2015. Machine Learning. Springer, 317–328. https://doi.org/10.1007/11564096_32
Coverage-based regression test case selection, minimization and prioritization: a [31] Gregg Rothermel, Roland H Untch, Chengyun Chu, and Mary Jean Harrold.
case study on an industrial system. Software Testing, Verification and Reliability 1999. Test case prioritization: An empirical study. In Software Maintenance,
25, 4 (2015), 371–396. https://doi.org/10.1002/stvr.1572 1999.(ICSM’99) Proceedings. IEEE International Conference on. IEEE, 179–188.
[10] P M Duvall, S Matyas, and A Glover. 2007. Continuous Integration: Improving [32] Gregg Rothermel, Roland H Untch, Chengyun Chu, Mary Jean Harrold, and
Software Quality and Reducing Risk. Pearson Education. Ieee Computer Society. 2001. Prioritizing Test Cases For Regression Testing.
[11] Sebastian Elbaum, Andrew Mclaughlin, and John Penix. 2014. The IEEE Transactions on Software Engineering 27, 10 (2001), 929–948. https://doi.org/
Google Dataset of Testing Results. (2014). https://code.google.com/p/ 10.1145/347324.348910
google-shared-dataset-of-test-suite-results/ [33] Ripon K Saha, L Zhang, S Khurshid, and D E Perry. 2015. An Information
[12] Sebastian Elbaum, Gregg Rothermel, and John Penix. 2014. Techniques for im- Retrieval Approach for Regression Test Prioritization Based on Program Changes.
proving regression testing in continuous integration development environments. In Software Engineering (ICSE), 2015 IEEE/ACM 37th IEEE International Conference
In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations on, Vol. 1. 268–279. https://doi.org/10.1109/ICSE.2015.47
of Software Engineering. ACM, 235–245. https://doi.org/10.1145/2635868.2635910 [34] S Stolberg. 2009. Enabling agile testing through continuous integration. In Agile
[13] Martin Fowler and M Foemmel. 2006. Continuous integration. (2006). http: Conference, 2009. AGILE’09. IEEE, 369–374.
//martinfowler.com/articles/continuousIntegration.html [35] Per Erik Strandberg, Daniel Sundmark, Wasif Afzal, Thomas Ostrand, and Elaine
[14] M Gligoric, L Eloussi, and D Marinov. 2015. Ekstazi: Lightweight Test Selection. Weyuker. 2016. Experience Report: Automated System Level Regression Test
In Proceedings of the 37th International Conference on Software Engineering, Vol. 2. Prioritization Using Multiple Factors. In Software Reliability Engineering (ISSRE),
713–716. https://doi.org/10.1109/ICSE.2015.230 2016 IEEE 27th International Symposium on. IEEE, 12—-23.
[15] A. Groce, A. Fern, J. Pinto, T. Bauer, A. Alipour, M. Erwig, and C. Lopez. 2012. [36] Richard S. Sutton and Andrew G. Barto. 1998. Reinforcement Learning: An Intro-
Lightweight Automated Testing with Adaptation-Based Programming. In 2012 duction (1st ed.). MIT press Cambridge. https://doi.org/10.1109/TNN.1998.712192
IEEE 23rd International Symposium on Software Reliability Engineering. 161–170. [37] Hado Van Hasselt and Marco A Wiering. 2007. Reinforcement learning in
https://doi.org/10.1109/ISSRE.2012.1 continuous action spaces. Proceedings of the 2007 IEEE Symposium on Approximate
[16] Jung-Min Kim Jung-Min Kim and A. Porter. 2002. A history-based test prioriti- Dynamic Programming and Reinforcement Learning, ADPRL 2007 (2007), 272–279.
zation technique for regression testing in resource constrained environments. In https://doi.org/10.1109/ADPRL.2007.368199
Proceedings of the 24th international conference on software engineering. 119–129. [38] Fred L Van Rossum, Guido and Drake Jr. 1995. Python Reference Manual. Techni-
https://doi.org/10.1109/ICSE.2002.1007961 cal Report. Amsterdam, The Netherlands, The Netherlands.
[17] Jung-Hyun Kwon, In-Young Ko, Gregg Rothermel, and Matt Staats. 2014. Test [39] Margus Veanes, Pritam Roy, and Colin Campbell. 2006. Online Testing with
case prioritization based on information retrieval concepts. 2014 21st Asia-Pacific Reinforcement Learning. Springer Berlin Heidelberg, Berlin, Heidelberg, 240–253.
Software Engineering Conference (APSEC) 1 (2014), 19–26. https://doi.org/10.1109/ https://doi.org/10.1007/11940197_16
APSEC.2014.12 [40] K R Walcott, M L Soffa, G M Kapfhammer, and R S Roos. 2006. Time-Aware
[18] Long-Ji Lin. 1992. Self-Improving Reactive Agents Based on Reinforcement Test Suite Prioritization. In Proceedings of the 2006 International Symposium on
Learning, Planning and Teaching. Machine Learning 8, 3-4 (1992), 293–321. Software Testing and Analysis (ISSTA). ACM, Portland, Maine, USA, 1–12.
https://doi.org/10.1023/A:1022628806385 [41] Farn Wang, Shun-Ching Yang, and Ya-Lan Yang. 2011. Regression Testing Based
[19] Dusica Marijan, Arnaud Gotlieb, and Sagar Sen. 2013. Test case prioritization on Neural Networks and Program Slicing Techniques. Springer Berlin Heidelberg,
for continuous regression testing: An industrial case study. In 2013 29th IEEE Berlin, Heidelberg, 409–418. https://doi.org/10.1007/978-3-642-25658-5_50
International Conference on Software Maintenance (ICSM). 540–543. https://doi. [42] Lian Yu, Lei Xu, and Wei-Tek Tsai. 2010. Time-Constrained Test Selection for
org/10.1109/ICSM.2013.91 Regression Testing. Springer Berlin Heidelberg, Berlin, Heidelberg, 221–232.
[20] Maja J Matarić. 1994. Reward functions for accelerated learning. In Machine https://doi.org/10.1007/978-3-642-17313-4_23
Learning: Proceedings of the Eleventh international conference. 181–189. https: [43] Lu Zhang, Shan-Shan Hou, Chao Guo, Tao Xie, and Hong Mei. 2009. Time-aware
//doi.org/10.1.1.42.4313 test-case prioritization using integer linear programming. Proceedings of the
[21] Siavash Mirarab, Soroush Akhlaghi Esfahani, and Ladan Tahvildari. 2012. Size- eighteenth International Symposium on Software Testing and Analysis (ISSTA)
Constrained Regression Test Case Selection Using Multicriteria Optimization. (2009), 213–224. https://doi.org/10.1145/1572272.1572297
IEEE Transactions on Software Engineering 38, 4 (jul 2012), 936–956. https://doi. [44] Tong Zhang. 2004. Solving large scale linear prediction problems using stochas-
org/10.1109/TSE.2011.56 tic gradient descent algorithms. In Proceedings of the twenty-first international
conference on Machine learning. ACM, 116.
22