0% found this document useful (0 votes)
0 views

Evaluating the Performance of Reinforcement Learning Algorithms

This document discusses the importance of reliable performance evaluations for reinforcement learning (RL) algorithms, highlighting issues with current evaluation methods that often lead to inconsistent and non-reproducible results. The authors propose a new evaluation methodology that emphasizes scientific rigor, usability across various environments, non-exploitative practices, and computational tractability, aiming to accurately quantify algorithm performance without extensive hyperparameter tuning. They introduce the concept of 'complete algorithm definitions' to facilitate automatic hyperparameter selection, thus allowing for more straightforward application of RL algorithms in real-world scenarios.

Uploaded by

mxrv
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Evaluating the Performance of Reinforcement Learning Algorithms

This document discusses the importance of reliable performance evaluations for reinforcement learning (RL) algorithms, highlighting issues with current evaluation methods that often lead to inconsistent and non-reproducible results. The authors propose a new evaluation methodology that emphasizes scientific rigor, usability across various environments, non-exploitative practices, and computational tractability, aiming to accurately quantify algorithm performance without extensive hyperparameter tuning. They introduce the concept of 'complete algorithm definitions' to facilitate automatic hyperparameter selection, thus allowing for more straightforward application of RL algorithms in real-world scenarios.

Uploaded by

mxrv
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Evaluating the Performance of Reinforcement Learning Algorithms

Scott M. Jordan 1 Yash Chandak 1 Daniel Cohen 1 Mengxue Zhang 1 Philip S. Thomas 1

Abstract usability of RL algorithms, we suggest that it should have


four properties. First, to ensure accuracy and reliability,
Performance evaluations are critical for quantify-
an evaluation procedure should be scientific, such that it
ing algorithmic advances in reinforcement learn-
provides information to answer a research question, tests
ing. Recent reproducibility analyses have shown
a specific hypothesis, and quantifies any uncertainty in the
that reported performance results are often incon-
results. Second, the performance metric captures the usabil-
sistent and difficult to replicate. In this work, we
ity of the algorithm over a wide variety of environments.
argue that the inconsistency of performance stems
For a performance metric to capture the usability of an algo-
from the use of flawed evaluation metrics. Tak-
rithm, it should include the time and effort spent tuning the
ing a step towards ensuring that reported results
algorithm’s hyperparameters (e.g., step-size and policy struc-
are consistent, we propose a new comprehensive
ture). Third, the evaluation procedure should be nonexploita-
evaluation methodology for reinforcement learn-
tive (Balduzzi et al., 2018), meaning no algorithm should be
ing algorithms that produces reliable measure-
favored by performing well on an over-represented subset
ments of performance both on a single environ-
of environments or by abusing a particular score normal-
ment and when aggregated across environments.
ization method. Fourth, an evaluation procedure should be
We demonstrate this method by evaluating a broad
computationally tractable, meaning that a typical researcher
class of reinforcement learning algorithms on stan-
should be able to run the procedure and repeat experiments
dard benchmark tasks.
found in the literature.
As an evaluation procedure requires a question to answer, we
1. Introduction pose the following to use throughout the paper: which algo-
rithm(s) perform well across a wide variety of environments
When applying reinforcement learning (RL), particularly to with little or no environment-specific tuning? Throughout
real-world applications, it is desirable to have algorithms this work, we refer to this question as the general eval-
that reliably achieve high levels of performance without re- uation question. This question is different from the one
quiring expert knowledge or significant human intervention. commonly asked in articles proposing a new algorithm, e.g.,
For researchers, having algorithms of this type would mean the common question is, can algorithm X outperform other
spending less time tuning algorithms to solve benchmark algorithms on tasks A, B, and C? In contrast to the common
tasks and more time developing solutions to harder prob- question, the expected outcome for the general evaluation
lems. Current evaluation practices do not properly account question is not to find methods that maximize performance
for the uncertainty in the results (Henderson et al., 2018) with optimal hyperparameters but to identify algorithms that
and neglect the difficulty of applying RL algorithms to a do not require extensive hyperparameter tuning and thus are
given problem. Consequently, existing RL algorithms are easy to apply to new problems.
difficult to apply to real-world applications (Dulac-Arnold
et al., 2019). To both make and track progress towards de- In this paper, we contend that the standard evaluation ap-
veloping reliable and easy-to-use algorithms, we propose a proaches do not satisfy the above properties, and are not
principled evaluation procedure that quantifies the difficulty able to answer the general evaluation question. Thus, we
of using an algorithm. develop a new procedure for evaluating RL algorithms that
overcomes these limitations and can accurately quantify the
For an evaluation procedure to be useful for measuring the uncertainty of performance. The main ideas in our approach
1
College of Information and Computer Sciences, University are as follows. We present an alternative view of an algo-
of Massachusetts, MA, USA. Correspondence to: Scott Jordan rithm such that sampling its performance can be used to
<[email protected]>. answer the general evaluation question. We define a new
normalized performance measure, performance percentiles,
Proceedings of the 37 th International Conference on Machine which uses a relative measure of performance to compare
Learning, Online, PMLR 119, 2020. Copyright 2020 by the au-
thor(s). algorithms across environments. We show how to use a
Evaluating the Performance of RL Algorithms

game-theoretic approach to construct an aggregate measure


Data Collection (tune-and-report)
of performance that permits quantifying uncertainty. Lastly,
we develop a technique, performance bound propagation For each (i,j) in X
(PBP), to quantify and account for uncertainty throughout Tune hyperparameters (N iterations)
the entire evaluation procedure. We provide source code so
Execute T trials
others may easily apply the methods we develop here.1 with tuned hyperparameters

Store results
2. Notation and Preliminaries
In this section, we give notation used in this paper along Return stored results
with an overview of an evaluation procedure. In addition to
this section, a list of symbols used in this paper is presented
in Appendix C. We represent a performance metric of an Figure 1. Data collection process of the tune-and-report method.
algorithm, i ∈ A, on an environment, j ∈ M, as a random The yellow box indicates trials using different random seeds.
variable Xi,j . This representation captures the variability of
results due to the choice of the random seed controlling the
stochastic processes in the algorithm and the environment. results. During the data collection phase, samples are col-
The choice of the metric depends on the property being stud- lected of the performance metric Xi,j for each combination
ied and is up to the experiment designer. The performance (i, j) of an algorithm i ∈ A and environment j ∈ M. In the
metric used in this paper is the average of the observed re- data aggregation phase, all samples of performance are nor-
turns from one execution of the entire training procedure, malized so the metric on each environment is on a similar
which we refer to as the average return. The cumulative scale, then they are aggregated to provide a summary of each
distribution function (CDF), FXi,j : R → [0, 1], describes algorithm’s performance across all environments. Lastly,
the performance distribution of algorithm i on environment the uncertainty of the results is quantified and reported.
j such that FXi,j (x) := Pr(Xi,j ≤ x). The quantile func-
tion, QXi,j (α) := inf x {x ∈ R|FXi,j (x) ≥ α}, maps a
cumulative probability, α ∈ (0, 1), to a score such that
3. Data Collection
α proportion of samples of Xi,j are less than or equal to In this section, we discuss how common data collection
QXi,j (α). A normalization function, g : R × M → R, methods are unable to answer the general evaluation ques-
maps a score, x, an algorithm receives on an environment, tion and then present a new method that can. We first high-
j, to a normalized score, g(x, j), which has a common scale light the core difference in our approach to previous meth-
for all environments. In this work, we seek an aggregate ods.
performance measure, yi ∈ R, for an algorithm, i, such
P|M| The main difference between data collection methods is in
that yi := j=1 qj E[g(Xi,j , j)], where qj ≥ 0 for all
P|M| how the samples of performance are collected for each al-
j ∈ {1, . . . , |M|} and j=1 qj = 1. In Section 4, we dis- gorithm on each environment. Standard approaches rely on
cuss choices for the normalizing function g and weightings first tuning an algorithm’s hyperparameters, i.e., any input
q that satisfy the properties specified in the introduction. to an algorithm that is not the environment, and then gener-
The primary quantities of interest in this paper are the ag- ating samples of performance. Our method instead relies on
gregate performance measures for each algorithm and confi- having a definition of an algorithm that can automatically
dence intervals on that measure. Let y ∈ R|A| be a vector select, sample, or adapt hyperparameters. This method can
representing the aggregate performance for each algorithm. be used to answer the general evaluation question because
We desire confidence intervals, Y − , Y + ∈ R|A| × R|A| , its performance measure represents the knowledge required
such that, for a confidence level δ ∈ (0, 0.5], to use the algorithm. We discuss these approaches below.

Pr ∀i ∈ {1, 2, . . . , |A|}, yi ∈ [Yi− , Yi+ ] ≥ 1 − δ.



3.1. Current Approaches
A typical evaluation procedure used in RL research is the
To compute an aggregate performance measure and its con-
tune-and-report method. As depicted in Figure 1, the tune-
fidence intervals that meet the criteria laid out in the intro-
and-report method has two phases: a tuning phase and a
duction, one must consider the entire evaluation procedure.
testing phase. In the tuning phase, hyperparameters are opti-
We view an evaluation procedure to have three main compo-
mized either manually or via a hyperparameter optimization
nents: data collection, data aggregation, and reporting of the
algorithm. Then after tuning, only the best hyperparameters
1
Source code for this paper can be found at https:// are selected and executed for T trials using different random
github.com/ScottJordan/EvaluationOfRLAlgs. seeds to provide an estimate of performance.
Evaluating the Performance of RL Algorithms

The tune-and-report data collection method does not satisfy


Complete Data Collection (our method)
the usability requirement or the scientific requirement. Re-
call that our objective is to capture the difficulty of using a For each (i,j) in X

particular algorithm. Because the tune-and-report method ig- Repeat T times


nores the amount of data used to tune the hyperparameter, an Select hyperparameters
by algorithm definition
algorithm that only works well after significant tuning could
be favored over one that works well without environment- Execute algorithm
specific tuning, thus, violating the requirements.
Store results
Consider an extreme example of an RL algorithm that in-
cludes all policy parameters as hyperparameters. This al-
gorithm would then likely be optimal after any iteration of Return stored results
hyperparameter tuning that finds the optimal policy. This
effect is more subtle in standard algorithms, where hyper-
parameter tuning infers problem-specific information about Figure 2. Data collection process using complete algorithm defini-
how to search for the optimal policy, (e.g., how much explo- tions. The yellow box indicates using different random seeds.
ration is needed, or how aggressive policy updates can be).
Furthermore, this demotivates the creation of algorithms
3.2. Our Approach
that are easier to use but do not improve performance after
finding optimal hyperparameters. In this section, we outline our method, complete data collec-
tion, that does not rely on hyperparameter tuning. If there
The tune-and-report method violates the scientific property
were no hyperparameters to tune, evaluating algorithms
by not accurately capturing the uncertainty of performance.
would be simpler. Unfortunately, how to automatically set
Multiple i.i.d. samples of performance are taken after hy-
hyperparameters has been an understudied area. Thus, we
perparameter tuning and used to compute a bound on the
introduce the notion of a complete algorithm definition.
mean performance. However, these samples of performance
do not account for the randomness due to hyperparameter Definition 1 (Algorithm Completeness). An algorithm is
tuning. As a result, any statistical claim would be inconsis- complete on an environment j, when defined such that the
tent with repeated evaluations of this method. This has been only required input to the algorithm is meta-information
observed in several studies where further hyperparameter about environment j, e.g., the number of state features and
tuning has shown no difference in performance relative to actions.
baseline methods (Lucic et al., 2018; Melis et al., 2018).
Algorithms with a complete definition can be used on an
The evaluation procedure proposed by Dabney (2014) ad- environment and without specifying any hyperparameters.
dresses issues with uncertainty due to hyperparameter tuning Note that this does not say that an algorithm cannot receive
and performance not capturing the usability of algorithms. forms of problem specific knowledge, only that it is not
Dabney’s evaluation procedure computes performance as a required. A well-defined algorithm will be able to infer
weighted average over all N iterations of hyperparameter effective combinations of hyperparameters or adapt them
tuning, and the entire tuning process repeats for T trials. during learning. There are many ways to make an exist-
Even though this evaluation procedure fixes the problems ing algorithm complete. In this work, algorithms are made
with the tune-and-report approach, it violates our compu- complete by defining a distribution from which to randomly
tationally tractable property by requiring T N executions sample hyperparameters. Random sampling may produce
of the algorithm to produce just T samples of performance. poor or divergent behavior in the algorithm, but this only
In the case where N = 1 it is not clear how hyperparam- indicates that it is not yet known how to set the hyperparam-
eters should be set. Furthermore, this style of evaluation eters of the algorithm automatically. Thus, when faced with
does not cover the case where it is prohibitive to perform a new problem, finding decent hyperparameters will be chal-
hyperparameter tuning, e.g., slow simulations, long agent lenging. One way to make an adaptive complete algorithm
lifetimes, lack of a simulator, and situations where it is dan- is to include a hyperparameter optimization method in the al-
gerous or costly to deploy a bad policy. In these situations, gorithm. However, all tuning must be done within the same
it is desirable for algorithms to be insensitive to the choice fixed amount of time and cannot propagate information over
of hyperparameters or able to adapt them during a single trials used to obtain statistical significance.
execution. It is in this setting that the general evaluation
question can be answered. Figure 2 shows the complete data collection method. For
this method we limit the scope of algorithms to only include
ones with complete definitions; thus, it does not violate any
of the properties specified. This method satisfies the scien-
Evaluating the Performance of RL Algorithms

tific requirement since it is designed to answer the general [0, 1] will produce larger differences than those on the range
evaluation question, and the uncertainty of performance can [1000, 1001]. Furthermore, all changes in performance are
be estimated using all of the trials. Again, this data collec- assumed to be equally challenging, i.e., going from a score
tion method captures the difficulty of using an algorithm of 0.8 to 0.89 is the same difficulty as 0.9 to 0.99. This
since the complete definition encodes the knowledge nec- assumption of linearity of difficulty is not reflected on en-
essary for the algorithm to work effectively. The compute vironments with nonlinear changes in the score as an agent
time of this method is tractable, since T executions of the improves, e.g., completing levels in Super Mario.
algorithm produces T independent samples of performance.
A critical flaw in the performance ratio is that it can pro-
The practical effects of using the complete data collection duce an arbitrary ordering Pof algorithms when combined
method are as follows. Researchers do not have to spend with the arithmetic mean, j qj E[Xi,j ]/E[Xk,j ] (Fleming
time tuning each algorithm to try and maximize perfor- & Wallace, 1986), meaning a different algorithm in the de-
mance. Fewer algorithm executions are required to obtain nominator could change the relative rankings. Using the
a statistically meaningful result. With this data collection geometric mean can address this weakness of performance
method, improving upon algorithm definitions will become ratios, but does not resolve the other issues.
significant research contributions and lead to algorithms that
Another normalization technique is policy percentiles, a
are easy to apply to many problems.
method that projects the score of an algorithm through the
performance CDF of random policy search (Dabney, 2014).
4. Data Aggregation The normalized score for an algorithm, i, is FXΠ,j (Xi,j ),
where FXΠ,j is the performance CDF when a policy is sam-
Answering the general evaluation question requires a rank-
pled uniformly from a set of policies, Π, on an environment
ing of algorithms according to their performance on all
j, i.e, π ∼ U (Π). Policy percentiles have a unique advan-
environments M. The aggregation step accomplishes this
tage in that performance is scaled according to how difficult
task by combining the performance data generated in the
it is to achieve that level of performance relative to ran-
collection phase and summarizing it across all environments.
dom policy search. Unfortunately, policy percentiles rely
However, data aggregation introduces several challenges.
on specifying Π, which often has a large search space. As a
First, each environment has a different range of scores that
result, most policies will perform poorly, making all scores
need to be normalized to a common scale. Second, a uni-
approach 1.0. It is also infeasible to use when random pol-
form weighting of environments can introduce bias. For
icy search is unlikely to achieve high levels of performance.
example, the set of environments might include many slight
Despite these drawbacks, the scaling of scores according to
variants of one domain, giving that domain a larger weight
a notion of difficulty is desirable, so we adapt this idea to
than a single environment coming from a different domain.
use any algorithm’s performance as a reference distribution.
4.1. Normalization 4.1.2. O UR A PPROACH
The goal in score normalization is to project scores from An algorithm’s performance distribution can have an inter-
each environment onto the same scale while not being ex- esting shape with large changes in performance that are due
ploitable by the environment weighting. In this section, we to divergence, lucky runs, or simply that small changes to a
first show how existing normalization techniques are ex- policy can result in large changes in performance (Jordan
ploitable or do not capture the properties of interest. Then et al., 2018). These effects can be seen in Figure 3, where
we present our normalization technique: performance per- there is a quick rise in cumulative probability for a small in-
centiles. crease in performance. Inspired by Dabney (2014)’s policy
percentiles, we propose performance percentiles, a score
4.1.1. C URRENT A PPROACHES normalization technique that can represent these intricacies.
We examine two normalization techniques: performance The probability integral transform shows that projecting a
ratios and policy percentiles. We discuss other normal- random variable through its CDF transforms the variable to
ization methods in Appendix A. The performance ratio is be uniform on [0, 1] (Dodge & Commenges, 2006). Thus,
commonly used with the Arcade Learning Environment to normalizing an algorithm’s performance by its CDF will
compare the performance of algorithms relative to human equally distribute and represent a linear scaling of difficulty
performance (Mnih et al., 2015; Machado et al., 2018). The across [0, 1]. When normalizing performance against an-
performance ratio of two algorithms i and k on an environ- other algorithm’s performance distribution, the normalized
ment j is E[Xi,j ]/E[Xk,j ]. This ratio is sensitive to the score distribution will shift towards zero when the algorithm
location and scale of the performance metric on each envi- is worse than the normalizing distribution and shift towards
ronment, such that an environment with scores in the range one when it is superior. As seen in Figure 3, the CDF can
Evaluating the Performance of RL Algorithms

4.2. Summarization
A weighting over environments is needed to form an ag-
gregate measure. We desire a weighting over environments
such that no algorithm can exploit the weightings to increase
its ranking. Additionally, for the performance percentiles,
we need to determine the weighting of algorithms to use as
the reference distribution. Inspired by the work of Balduzzi
et al. (2018), we propose a weighting of algorithms and
environments, using the equilibrium of a two-player game.
In this game, one player, p, will try to select an algorithm
to maximize the aggregate performance, while a second
player, q, chooses the environment and reference algorithm
to minimize p’s score. Player p’s pure strategy space, S1 , is
the set of algorithms A, i.e., p plays a strategy s1 = i corre-
Figure 3. This plot shows the CDF of average returns for the Sarsa- sponding to an algorithm i. Player q’s pure strategy space,
Parl2, Sarsa(λ), and Actor-Critic algorithms on the Cart-Pole en- S2 , is the cross product of a set of environments, M, and
vironment. Each line represents the empirical CDF using 10,000 algorithms, A, i.e., player q plays a strategy s2 = (j, k) cor-
trials and the shaded regions represent the 95% confidence inter-
responding to a choice of environment j and normalization
vals. To illustrate how the performance percentiles work, this plot
algorithm k. We denote the pure strategy space of the game
shows how samples of performance (black dots) are normalized
by each CDF, producing the normalized scores (colored dots). The by S := S1 × S2 . A strategy, s ∈ S, can be represented by
correspondence between a single sample and its normalized score a tuple s = (s1 , s2 ) = (i, (j, k)).
is shown by the dotted line. The utility of strategy s is measured by a payoff func-
tion up : S → R and uq : S → R for players p and q
be seen as encoding the relative difficulty of achieving a respectively. The game is defined to be zero sum, i.e.,
given level of performance, where large changes in an algo- uq (s) = −up (s). We define the payoff function to be
rithm’s CDF output indicate a high degree of difficulty for up (s) := E[FXk,j (Xi,j )]. Both players p and q sample
that algorithm to make an improvement and similarly small strategies from probability distributions p ∈ ∆(S1 ) and
changes in output correspond to low change in difficulty. In q ∈ ∆(S2 ), where ∆(X ) is the set of all probability distri-
this context difficulty refers to the amount of random chance butions over X .
(luck) needed to achieve a given level of performance. The equilibrium solution of this game naturally balances the
To leverage these properties of the CDF, we define per- normalization and environment weightings to counter each
formance percentiles, that use a weighted average of each algorithm’s strengths without conferring an advantage to a
algorithm’s CDF to normalize scores for each environment. particular algorithm. Thus, the aggregate measure will be
useful in answering the general evaluation question.
Definition 2 (Performance Percentile). In an evaluation of After finding a solution (p∗ , q ∗ ), the aggregate performance
algorithms, A, the performance percentile for a score x on measure yi for an algorithm i defined as
an environment, j, is FX̄j (x, wj ), where FX̄j is the mixture
P|A|
of CDFs FX̄j (x, wj ) := i=1 wj,i FXi,j (x), with weights
P|A|
wj ∈ R|A| , i=1 wj,i = 1, and ∀i wj,i ≥ 0. |M| |A|
X X

yi := qj,k E[FXk,j (Xi,j )]. (1)
So we can say that performance percentiles capture the j=1 k=1
performance characteristic of an environment relative to
some averaged algorithm. We discuss how to set the weights
wj in the next section.
Performance percentiles are closely related to the concept of To find a solution (p∗ , q ∗ ), we employ the α-Rank tech-
(probabilistic) performance profiles (Dolan & Moré, 2002; nique (Omidshafiei et al., 2019), which returns a stationary
Barreto et al., 2010). The difference being that performance distribution over the pure strategy space S. α-Rank allows
profiles report the cumulative distribution of normalized for efficient computation of both the equilibrium and con-
performance metrics over a set of tasks (environments), fidence intervals on the aggregate performance (Rowland
whereas performance percentiles are a technique for normal- et al., 2019). We detail this method and details of our imple-
izing scores on each task (environment). mentation in Appendix B.
Evaluating the Performance of RL Algorithms

5. Reporting Results
As it is crucial to quantify the uncertainty of all claimed
performance measures, we first discuss how to compute con-
fidence intervals for both single environment and aggregate
measures, then give details on displaying the results.

5.1. Quantifying Uncertainty


In keeping with our objective to have a scientific evaluation,
we require our evaluation procedure to quantify any uncer-
tainty in the results. When concerned with only a single en-
vironment, standard concentration inequalities can compute
confidence intervals on the mean performance. Similarly,
when displaying the distribution of performance, one can
apply standard techniques for bounding the empirical dis-
tribution of performance. However, computing confidence Figure 4. This plot shows the distribution of average returns for
the Actor-Critic algorithm on the Acrobot environment. The x-
intervals on the aggregate has additional challenges.
axis represents a probability and the y-axis represents the average
Notice that in (1) computing the aggregate performance return such that the proportion of trials that have a value less than
requires two unknown values: q ∗ and the mean normal- or equal to y is x, e.g., at x = 0.5, y is the median. Each line
ized performance, E[FXk,j (Xi,j )]. Since q ∗ depends on represents the empirical quantile function using a different number
mean normalized performance, any uncertainty in the mean of trials and the shaded regions represent the 95% confidence
intervals computed using concentration inequalities. In this plot,
normalized performance results in uncertainty in q ∗ . To
the larger the area under the curve, the better the performance.
compute valid confidence intervals on the aggregate perfor- This plot highlights the large amount of uncertainty when using
mance, the uncertainty through the entire process must be small sample sizes and how much it decreases with more samples.
considered.
We introduce a process to compute the confidence intervals,
which we refer to as performance bound propagation (PBP). 5.2. Displaying Results
We represent PBP as a function PBP : D×R → R|A| ×R|A| ,
which maps a dataset D ∈ D containing all samples of In this section, we describe our method for reporting the
performance and a confidence level δ ∈ (0, 0.5], to vectors results. There are three parts to our method: answering the
Y − and Y + representing the lower and upper confidence stated hypothesis, providing tables and plots showing the
intervals, i.e., (Y − , Y + ) = PBP(D, δ). performance and ranking of algorithms for all environments,
and the aggregate score, then for each performance measure,
The overall procedure for PBP is as follows, first compute provide confidence intervals to convey uncertainty.
confidence intervals for each FXi,j , then using these inter-
vals compute confidence intervals on each mean normalized The learning curve plot is a standard in RL and displays a
performance, next determine an uncertainty set Q for q ∗ performance metric (often the return) over regular intervals
that results from uncertainty in the mean normalized per- during learning. While this type of plot might be informative
formance, finally for each algorithm find the minimum and for describing some aspects of the algorithm’s performance,
maximum aggregate performance over the uncertainty in the it does not directly show the performance metric used to
mean normalized performances and Q. We provide pseu- compare algorithms, making visual comparisons less ob-
docode in Appendix C and source code in the repository. vious. Therefore, to provide the most information to the
reader, we suggest plotting the distribution of performance
We prove that PBP produces valid confidence intervals for for each algorithm on each environment. Plotting the dis-
a confidence level δ ∈ (0, 0.5] and a dataset D containing tribution of performance has been suggested in many fields
Ti,j > 1 samples of performance for all algorithms i ∈ A as a means to convey more information, (Dolan & Moré,
and environments j ∈ M. 2002; Farahmand et al., 2010; Reimers & Gurevych, 2017;
Theorem 1. If (Y − , Y + ) = PBP(D, δ), then Cohen et al., 2018). Often in RL, the object is to maximize
a metric, so we suggest showing the quantile function over
Pr ∀i ∈ 1, 2, . . . , |A|, yi ∈ [Yi− , Yi+ ] ≥ 1 − δ.

the CDF as it allows for a more natural interpretation of
the performance, i.e., the higher the curve, the better the
Proof. Although the creation of valid confidence intervals performance (Bellemare et al., 2013). Figure 4 show the
is critical to this contribution, due to space restrictions it is performance distribution with 95% confidence intervals for
presented in Appendix C. different sample sizes. It is worth noting that when tuning
Evaluating the Performance of RL Algorithms

hyperparameters the data needed to compute these distribu- Aggregate Performance


tions is already being collected, but only the results from the
tuned runs are being reported. By only reporting only the Algorithms Score Rank
tuned performance it shows what an algorithm can achieve Sarsa-Parl2 0.4623(0.3904, 0.5537) 1 (2,1)
not what it is likely to achieve. Q-Parl2 0.4366(0.3782, 0.5632) 2 (2,1)
AC-Parl2 0.1578(0.0765, 0.3129) 3 (11,3)
6. Experimental Results Sarsa(λ)-s 0.0930(0.0337, 0.2276) 4 (11,3)
AC-s 0.0851(0.0305, 0.2146) 5 (11,3)
In this section, we describe and report the results of ex- Sarsa(λ) 0.0831(0.0290, 0.2019) 6 (11,3)
periments to illustrate how this evaluation procedure can AC 0.0785(0.0275, 0.2033) 7 (11,3)
answer the general evaluation question and identify when Q(λ)-s 0.0689(0.0237, 0.1973) 8 (11,3)
a modification to an algorithm or its definition improves Q(λ) 0.0640(0.0214, 0.1780) 9 (11,3)
performance. We also investigate the reliability of different NAC-TD 0.0516(0.0180, 0.1636) 10 (11,3)
bounding techniques on the aggregate performance measure. PPO 0.0508(0.0169, 0.1749) 11 (11,3)

6.1. Experiment Description Table 1. Aggregate performance measures for each algorithm and
their rank. The parentheses contain the intervals computed using
To demonstrate the evaluation procedure we compare the al- PBP and together all hold with 95% confidence. The bolded num-
gorithms: Actor-Critic with eligibility traces (AC) (Sutton & bers identify the best ranked statistically significant differences.
Barto, 2018), Q(λ), Sarsa(λ), (Sutton & Barto, 1998), NAC-
TD (Morimura et al., 2005; Degris et al., 2012; Thomas,
2014), and proximal policy optimization (PPO) (Schulman yet a complete algorithm that can reliably solve each one.
et al., 2017). The learning rate is often the most sensitive
We execute each algorithm on each environment for 10,000
hyperparameter in RL algorithms. So, we include three ver-
trials. While this number of trials may seem excessive,
sions of Sarsa(λ), Q(λ), and AC: a base version, a version
our goal is to detect a statistically meaningful result. De-
that scales the step-size with the number of parameters (e.g.,
tecting such a result is challenging because the variance
Sarsa(λ)-s), and an adaptive step-size method, Parl2 (Dab-
of RL algorithms performance is high; we are comparing
ney, 2014), that does not require specifying the step size.
|A| × |M| = 165 random variables, and we do not as-
Since none of these algorithms have an existing complete
sume the performances are normally distributed. Compu-
definition, we create one by randomly sampling hyperpa-
tationally, executing ten thousand trials is not burdensome
rameters from fixed ranges. We consider all parameters
if one uses an efficient programming language such as Ju-
necessary to construct each algorithm, e.g., step-size, func-
lia (Bezanson et al., 2017) or C++, where we have noticed
tion approximator, discount factor, eligibility trace decay.
approximately two orders of magnitude faster execution
For the continuous state environments, each algorithm em-
than similar Python implementations. We investigate using
ploys linear function approximation using the Fourier basis
smaller sample sizes at the end of this section.
(Konidaris et al., 2011) with a randomly sampled order. See
Appendix E for full details of each algorithm.
6.2. Algorithm Comparison
These algorithms are evaluated on 15 environments, eight
discrete MDPs, half with stochastic transition dynamics, The aggregate performance measures and confidence in-
and seven continuous state environments: Cart-Pole (Flo- tervals are illustrated in Figure 5 and given in Table 1.
rian, 2007), Mountain Car (Sutton & Barto, 1998), Acrobot Appendix I lists the performance tables and distribution
(Sutton, 1995), and four variations of the pinball environ- plots for each environment. Examining the empirical perfor-
ment (Konidaris & Barto, 2009; Geramifard et al., 2015). mances in these figures, we notice two trends. The first is
For each independent trial, the environments have their dy- that our evaluation procedure can identify differences that
namics randomly perturbed to help mitigate environment are not noticeable in standard evaluations. For example,
overfitting (Whiteson et al., 2011); see code for details. For all algorithms perform near optimally when tuned properly
further details about the experiment see Appendix F. (indicated by the high end of the performance distribution).
The primary differences between algorithms are in the fre-
While these environments have simple features compared to quency of high performance and divergence (indicated by
the Arcade Learning Environment (Bellemare et al., 2013), low end of the performance distribution). Parl2 methods
they remain useful in evaluating RL algorithms for three rarely diverge, giving a large boost in performance relative
reasons. First is that experiments finish quickly. Second, to the standard methods.
the environments provide interesting insights into an algo-
rithm’s behavior. Third, as our results will show, there is not The second trend is that our evaluation procedure can iden-
tify when theoretical properties do or do not make an al-
Evaluating the Performance of RL Algorithms

Confidence Interval Performance


PBP PBP-t Bootstrap
Samples FR SIG FR SIG FR SIG
10 0.0 0.0 1.000 0.00 0.112 0.11
30 0.0 0.0 0.000 0.00 0.092 0.37
100 0.0 0.0 0.000 0.02 0.084 0.74
1,000 0.0 0.0 0.000 0.34 0.057 0.83
10,000 0.0 0.33 0.003 0.83 0.069 0.83

Table 2. Table showing the failure rate (FR) and proportion of


significant pairwise comparison (SIG) identified for δ = 0.05
using different bounding techniques and sample sizes. The first
column represents the sample size. The second, third, and fourth
columns represent the results for PBP, PBP-t, and bootstrap bound
methods respectively. For each sample size, 1,000 experiments
were conducted.
Figure 5. The aggregate performance for each algorithm with con-
fidence intervals using PBP, PBP-t, and bootstrap. The width of
each interval is scaled so all intervals hold with 95% confidence. unclear if they are valid.
To test the different bounding techniques, we estimate the
gorithm more usable. For example, Sarsa(λ) algorithms failure rate of each confidence interval technique at different
outperform their Q(λ) counterparts. This result might stem sample sizes. For this experiment we execute 1,000 trials of
from the fact that Sarsa(λ) is known to converge with lin- the evaluation procedure using sample sizes (trials per algo-
ear function approximation (Perkins & Precup, 2002) while rithm per environment) of 10, 30, 100, 1,000, and 10,000.
Q(λ) is known to diverge (Baird, 1995; Wiering, 2004). Ad- There are a total of 11.14 million samples per algorithm
ditionally, NAC-TD performs worse than AC despite that per environment. To reduce computation costs, we limit
natural gradients are a superior ascent direction. This result this experiment to only include Sarsa(λ)-Parl2, Q(λ)-Parl2,
is due in part because it is unknown how to set the three AC-Parl2, and Sarsa(λ)-s. Additionally, we reduce the envi-
step-sizes in NAC-TD, making it more difficult to use than ronment set to be the discrete environments and Mountain
AC. Together these observations point out the deficiency in Car. We compute the failure rate of the confidence inter-
the way new algorithms have been evaluated. That is, tun- vals, where a valid confidence interval will have a failure
ing hyperparameters hides the lack of knowledge required rate less than or equal to δ, e.g., for δ = 0.05 failure rate
to use the algorithm, introducing bias that favors the new should be less than ≤ 5%. We report the failure rate and the
algorithm. In contrast, our method forces this knowledge to proportion of statistically significant pairwise comparisons
be encoded into the algorithm, leading to a more fair and in Table 2. All methods use the same data, so the results are
reliable comparison. not independent.
The PBP method has zero failures indicating it is overly con-
6.3. Experiment Uncertainty servative. The failure rate of PBP-t is expected to converge
to zero as the number of samples increase due to the central
While the trends discussed above might hold true in general,
limit theorem. PBP-t begins to identify significant results
we must quantify our uncertainty. Based on the confidence
at a sample size of 1,000, but it is only at 10,000 that it can
intervals given using PBP, we claim with 95% confidence
identify all pairwise differences.2 The bootstrap technique
that on these environments and according to our algorithm
has the tightest intervals, but has a high failure rate.
definitions, Sarsa-Parl2 and Q-Parl2 have a higher aggregate
performance of average returns than all other algorithms in These results are stochastic and will not necessarily hold
the experiment. It is clear that 10,000 trials per algorithm with different numbers of algorithms and environments. So,
per environment is not enough to detect a unique ranking one should use caution in making claims that rely on either
of algorithms using the nonparametric confidence intervals PBP-t or bootstrap. Nevertheless, to detect statistically sig-
in PBP. We now consider alternative methods, PBP-t, and nificant results, we recommend running between 1,000, and
the percentile bootstrap. PBP-t replaces the nonparame- 10,000 samples, and using the PBP-t over bootstrap.
teric intervals in PBP with ones based on the Student’s t-
While this number of trials seems, high it is a necessity as
distribution. We detail these methods in Appendix G. From
comparison of multiple algorithms over many environments
Figure 5, it is clear that both alternative bounds are tighter
and thus useful in detecting differences. Since assumptions 2
Sarsa-Parl2 and Q-Parl2 have similar performance on discrete
of these bounds are different and not typically satisfied, it is environments so we consider detecting 83% of results optimal.
Evaluating the Performance of RL Algorithms

is a challenging statistical problem with many sources of in our work to these is that the knowledge required to use
uncertainty. Thus, one should be skeptical of results that an algorithm gets included in the performance metric.
use substantially fewer trials. Additionally, researchers are
An important aspect of evaluation not discussed so far in this
already conducting many trials that go unreported when
paper is competitive versus scientific testing (Hooker, 1995).
tuning hyperparameters. Since our method requires no hy-
Competitive testing is the practice of having algorithms
perparameter tuning, researchers can instead spend the same
compete for top performance on benchmark tasks. Scientific
amount of time collecting more trials that can be used to
testing is the careful experimentation of algorithms to gain
quantify uncertainty.
insight into how an algorithm works. The main difference
There are a few ways that the number of trials needed can be in these two approaches is that competitive testing only
reduced. The first is to think carefully about what question says which algorithms worked well but not why, whereas
one should answer so that only a few algorithms and envi- scientific testing directly investigates the what, when, how,
ronments are required. The second is to use active sampling or why better performance can be achieved.
techniques to determine when to stop generating samples of
There are several examples of recent works using scien-
performance for each algorithm environment pair (Rowland
tific testing to expand our understanding of commonly used
et al., 2019). It is important to caution the reader that this
methods. Lyle et al. (2019) compares distributional RL ap-
process can bias the results if the sequential tests are not
proaches using different function approximation schemes
accounted for (Howard et al., 2018).
showing that distributional approaches are only effective
Summarizing our experiments, we make the following ob- when nonlinear function approximation is used. Tucker
servations. Our experiments with complete algorithms show et al. (2018) explore the sources of variance reduction in ac-
that there is still more work required to make standard RL tion dependent control variates showing that improvement
algorithms work reliably on even extremely simple bench- was small or due to additional bias. Witty et al. (2018)
mark problems. As a result of our evaluation procedure, we and Atrey et al. (2020) investigate learned behaviors of an
were able to identify performance differences in algorithms agent playing Atari 2600 games using ToyBox (Foley et al.,
that are not noticeable under standard evaluation procedures. 2018), a tool designed explicitly to enable carefully con-
The tests of the confidence intervals suggest that both PBP trolled experimentation of RL agents. While, at first glance
and PBP-t provide reliable estimates of uncertainty. These the techniques developed here seems to be only compatible
outcomes suggest that this evaluation procedure will be with competitive testing, this is only because we specified
useful in comparing the performance of RL algorithms. question with a competitive answer. The techniques devel-
oped here, particularly complete algorithm definitions, can
7. Related Work be used to accurately evaluate the impact of various algo-
rithmic choices. This allows for the careful experimentation
This paper is not the first to investigate and address issues to determine what components are essential to an algorithm.
in empirically evaluating algorithms. The evaluation of
algorithms has become a signficant enough topic to spawn 8. Conclusion
its own field of study, known as experimental algorithmics
(Fleischer et al., 2002; McGeoch, 2012). The evaluation framework that we propose provides a prin-
cipled method for evaluating RL algorithms. This approach
In RL, there have been significant efforts to discuss and
facilitates fair comparisons of algorithms by removing unin-
improve the evaluation of algorithms (Whiteson & Littman,
tentional biases common in the research setting. By develop-
2011). One common theme has been to produce shared
ing a method to establish high-confidence bounds over this
benchmark environments, such as those in the annual re-
approach, we provide the framework necessary for reliable
inforcement learning competitions (Whiteson et al., 2010;
comparisons. We hope that our provided implementations
Dimitrakakis et al., 2014), the Arcade Learning Environ-
will allow other researchers to easily leverage this approach
ment (Bellemare et al., 2013), and numerous others which
to report the performances of the algorithms they create.
are to long to list here. Recently, there has been a trend of
explicit investigations into the reproducibility of reported
results (Henderson et al., 2018; Islam et al., 2017; Khetarpal Acknowledgements
et al., 2018; Colas et al., 2018). These efforts are in part due
The authors would like to thank Kaleigh Clary, Emma Tosch,
to the inadequate experimental practices and reporting in RL
and members of the Autonomous Learning Laboratory:
and general machine learning (Pineau et al., 2020; Lipton
Blossom Metevier, James Kostas, and Chris Nota, for dis-
& Steinhardt, 2018). Similar to these studies, this work has
cussion and feedback on various versions of this manuscript.
been motivated by the need for a more reliable evaluation
Additionally, we would like to thank the reviewers and meta-
procedure to compare algorithms. The primary difference
reviewers for their comments, which helped improved this
Evaluating the Performance of RL Algorithms

paper. Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B.


Julia: A fresh approach to numerical computing. SIAM
This work was performed in part using high performance
Review, 59(1):65–98, 2017.
computing equipment obtained under a grant from the Col-
laborative R&D Fund managed by the Massachusetts Tech- Cohen, D., Jordan, S. M., and Croft, W. B. Distributed
nology Collaborative. This work was supported in part by evaluations: Ending neural point metrics. CoRR,
a gift from Adobe. This work was supported in part by the abs/1806.03790, 2018.
Center for Intelligent Information Retrieval. Any opinions,
findings and conclusions or recommendations expressed in Colas, C., Sigaud, O., and Oudeyer, P. How many random
this material are those of the authors and do not necessarily seeds? Statistical power analysis in deep reinforcement
reflect those of the sponsor. Research reported in this paper learning experiments. CoRR, abs/1806.08295, 2018.
was sponsored in part by the CCDC Army Research Labo-
Csáji, B. C., Jungers, R. M., and Blondel, V. D. Pager-
ratory under Cooperative Agreement W911NF-17-2-0196
ank optimization by edge selection. Discrete Applied
(ARL IoBT CRA). The views and conclusions contained in
Mathematics, 169:73–87, 2014.
this document are those of the authors and should not be
interpreted as representing the official policies, either ex- Dabney, W. C. Adaptive step-sizes for reinforcement learn-
pressed or implied, of the Army Research Laboratory or the ing. PhD thesis, University of Massachusetts Amherst,
U.S. Government. The U.S. Government is authorized to 2014.
reproduce and distribute reprints for Government purposes
notwithstanding any copyright notation herein. de Kerchove, C., Ninove, L., and Dooren, P. V. Maximizing
pagerank via outlinks. CoRR, abs/0711.2867, 2007.

References Degris, T., Pilarski, P. M., and Sutton, R. S. Model-free


reinforcement learning with continuous action in practice.
Anderson, T. W. Confidence limits for the value of an
In American Control Conference, ACC, pp. 2177–2182,
arbitrary bounded random variable with a continuous
2012.
distribution function. Bulletin of The International and
Statistical Institute, 43:249–251, 1969. Dimitrakakis, C., Li, G., and Tziortziotis, N. The reinforce-
ment learning competition 2014. AI Magazine, 35(3):
Atrey, A., Clary, K., and Jensen, D. D. Exploratory not 61–65, 2014.
explanatory: Counterfactual analysis of saliency maps
for deep reinforcement learning. In 8th International Dodge, Y. and Commenges, D. The Oxford dictionary of
Conference on Learning Representations, ICLR. OpenRe- statistical terms. Oxford University Press on Demand,
view.net, 2020. 2006.

Baird, L. C. Residual algorithms: Reinforcement learning Dolan, E. D. and Moré, J. J. Benchmarking optimization
with function approximation. In Prieditis, A. and Russell, software with performance profiles. Math. Program., 91
S. J. (eds.), Machine Learning, Proceedings of the Twelfth (2):201–213, 2002.
International Conference on Machine Learning, pp. 30–
Dulac-Arnold, G., Mankowitz, D. J., and Hester, T. Chal-
37. Morgan Kaufmann, 1995.
lenges of real-world reinforcement learning. CoRR,
Balduzzi, D., Tuyls, K., Pérolat, J., and Graepel, T. Re- abs/1904.12901, 2019.
evaluating evaluation. In Advances in Neural Information Dvoretzky, A., Kiefer, J., and Wolfowitz, J. Asymptotic
Processing Systems 31: Annual Conference on Neural In- minimax character of a sample distribution function and
formation Processing Systems, NeurIPS., pp. 3272–3283, of the classical multinomial estimator. Annals of Mathe-
2018. matical Statistics, 27:642–669, 1956.
Barreto, A. M. S., Bernardino, H. S., and Barbosa, H. J. C. Farahmand, A. M., Ahmadabadi, M. N., Lucas, C., and
Probabilistic performance profiles for the experimental Araabi, B. N. Interaction of culture-based learning and
evaluation of stochastic algorithms. In Pelikan, M. and cooperative co-evolution and its application to automatic
Branke, J. (eds.), Genetic and Evolutionary Computation behavior-based system design. IEEE Trans. Evolutionary
Conference, GECCO, pp. 751–758. ACM, 2010. Computation, 14(1):23–57, 2010.
Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. Fercoq, O., Akian, M., Bouhtou, M., and Gaubert, S. Er-
The arcade learning environment: An evaluation plat- godic control and polyhedral approaches to pagerank op-
form for general agents. Journal of Artificial Intelligence timization. IEEE Trans. Automat. Contr., 58(1):134–148,
Research, 47:253–279, jun 2013. 2013.
Evaluating the Performance of RL Algorithms

Fleischer, R., Moret, B. M. E., and Schmidt, E. M. (eds.). Konidaris, G. D. and Barto, A. G. Skill discovery in contin-
Experimental Algorithmics, From Algorithm Design to uous reinforcement learning domains using skill chaining.
Robust and Efficient Software [Dagstuhl seminar, Septem- In Advances in Neural Information Processing Systems
ber 2000], volume 2547 of Lecture Notes in Computer 22., pp. 1015–1023. Curran Associates, Inc., 2009.
Science, 2002. Springer.
Lipton, Z. C. and Steinhardt, J. Troubling trends in machine
Fleming, P. J. and Wallace, J. J. How not to lie with statis- learning scholarship. CoRR, abs/1807.03341, 2018.
tics: The correct way to summarize benchmark results. Lucic, M., Kurach, K., Michalski, M., Gelly, S., and Bous-
Commun. ACM, 29(3):218–221, 1986. quet, O. Are gans created equal? A large-scale study. In
Advances in Neural Information Processing Systems 31.,
Florian, R. V. Correct equations for the dynamics of the cart-
pp. 698–707, 2018.
pole system. Center for Cognitive and Neural Studies
(Coneural), Romania, 2007. Lyle, C., Bellemare, M. G., and Castro, P. S. A compara-
tive analysis of expected and distributional reinforcement
Foley, J., Tosch, E., Clary, K., and Jensen, D. Toybox: Better learning. In The Thirty-Third AAAI Conference on Artifi-
Atari Environments for Testing Reinforcement Learning cial Intelligence, pp. 4504–4511. AAAI Press, 2019.
Agents. In NeurIPS 2018 Workshop on Systems for ML,
2018. Machado, M. C., Bellemare, M. G., Talvitie, E., Veness,
J., Hausknecht, M. J., and Bowling, M. Revisiting the
Geramifard, A., Dann, C., Klein, R. H., Dabney, W., and arcade learning environment: Evaluation protocols and
How, J. P. RLPy: A value-function-based reinforcement open problems for general agents. J. Artif. Intell. Res.,
learning framework for education and research. Journal 61:523–562, 2018.
of Machine Learning Research, 16:1573–1578, 2015.
Massart, P. The tight constant in the Dvoretzky-Kiefer-
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, Wolfowitz inequality. The Annals of Probability, 18(3):
D., and Meger, D. Deep reinforcement learning that mat- 1269–1283, 1990.
ters. In Proceedings of the Thirty-Second AAAI Confer- McGeoch, C. C. A Guide to Experimental Algorithmics.
ence on Artificial Intelligence, (AAAI-18), pp. 3207–3214, Cambridge University Press, 2012.
2018.
Melis, G., Dyer, C., and Blunsom, P. On the state of the
Hooker, J. N. Testing heuristics: We have it all wrong. art of evaluation in neural language models. In 6th Inter-
Journal of Heuristics, 1(1):33–42, 1995. national Conference on Learning Representations, ICLR.
OpenReview.net, 2018.
Howard, S. R., Ramdas, A., McAuliffe, J., and Sekhon,
J. S. Uniform, nonparametric, non-asymptotic confidence Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-
sequences. arXiv: Statistics Theory, 2018. ness, J., Bellemare, M. G., Graves, A., Riedmiller, M. A.,
Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C.,
Islam, R., Henderson, P., Gomrokchi, M., and Precup, D. Re- Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wier-
producibility of benchmarked deep reinforcement learn- stra, D., Legg, S., and Hassabis, D. Human-level control
ing tasks for continuous control. CoRR, abs/1708.04133, through deep reinforcement learning. Nature, 518(7540):
2017. 529–533, 2015.
Morimura, T., Uchibe, E. . i. e. j., and Doya, K. Utilizing
Jordan, S. M., Cohen, D., and Thomas, P. S. Using cumu-
the natural gradient in temporal difference reinforcement
lative distribution based performance analysis to bench-
learning with eligibility traces. In International Sympo-
mark models. In Critiquing and Correcting Trends in
sium on Information Geometry and Its Applications, pp.
Machine Learning Workshop at Neural Information Pro-
256–263, 2005.
cessing Systems, 2018.
Omidshafiei, S., Papadimitriou, C., Piliouras, G., Tuyls, K.,
Khetarpal, K., Ahmed, Z., Cianflone, A., Islam, R., and Rowland, M., Lespiau, J.-B., Czarnecki, W. M., Lanc-
Pineau, J. Re-evaluate: Reproducibility in evaluating tot, M., Perolat, J., and Munos, R. α-rank: Multi-agent
reinforcement learning algorithms. 2018. evaluation by evolution. Scientific reports, 9(1):1–29,
2019.
Konidaris, G., Osentoski, S., and Thomas, P. S. Value
function approximation in reinforcement learning using Page, L., Brin, S., Motwani, R., and Winograd, T. The
the fourier basis. In Proceedings of the Twenty-Fifth AAAI pagerank citation ranking: Bringing order to the web.
Conference on Artificial Intelligence, AAAI, 2011. Technical report, Stanford InfoLab, 1999.
Evaluating the Performance of RL Algorithms

Perkins, T. J. and Precup, D. A convergent form of approxi- Whiteson, S., Tanner, B., Taylor, M. E., and Stone, P. Pro-
mate policy iteration. In Advances in Neural Information tecting against evaluation overfitting in empirical rein-
Processing Systems 15, pp. 1595–1602. MIT Press, 2002. forcement learning. In 2011 IEEE Symposium on Adap-
tive Dynamic Programming And Reinforcement Learning,
Pineau, J., Vincent-Lamarre, P., Sinha, K., Larivière, V., ADPRL, pp. 120–127. IEEE, 2011.
Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and
Larochelle, H. Improving reproducibility in machine Wiering, M. Convergence and divergence in standard and
learning research (A report from the NeurIPS 2019 repro- averaging reinforcement learning. In Machine Learn-
ducibility program). CoRR, abs/2003.12206, 2020. ing: ECML 2004, 15th European Conference on Machine
Learning, volume 3201 of Lecture Notes in Computer
Puterman, M. L. Markov Decision Processes: Discrete
Science, pp. 477–488. Springer, 2004.
Stochastic Dynamic Programming. Wiley Series in Prob-
ability and Statistics. Wiley, 1994. Williams, R. J. and Baird, L. C. Tight performance bounds
on greedy policies based on imperfect value functions.
Reimers, N. and Gurevych, I. Reporting score distribu-
1993.
tions makes a difference: Performance study of LSTM-
networks for sequence tagging. In Proceedings of the Witty, S., Lee, J. K., Tosch, E., Atrey, A., Littman, M.,
2017 Conference on Empirical Methods in Natural Lan- and Jensen, D. Measuring and characterizing general-
guage Processing, EMNLP, pp. 338–348, 2017. ization in deep reinforcement learning. arXiv preprint
arXiv:1812.02868, 2018.
Rowland, M., Omidshafiei, S., Tuyls, K., Pérolat, J., Valko,
M., Piliouras, G., and Munos, R. Multiagent evaluation
under incomplete information. In Advances in Neural
Information Processing Systems 3, NeurIPS, pp. 12270–
12282, 2019.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and
Klimov, O. Proximal policy optimization algorithms.
CoRR, abs/1707.06347, 2017.
Sutton, R. S. Generalization in reinforcement learning:
Successful examples using sparse coarse coding. In Ad-
vances in Neural Information Processing Systems 8, pp.
1038–1044, 1995.
Sutton, R. S. and Barto, A. G. Reinforcement learning
- an introduction. Adaptive computation and machine
learning. MIT Press, 1998.
Sutton, R. S. and Barto, A. G. Reinforcement learning: An
introduction. MIT press, 2018.
Thomas, P. Bias in natural actor-critic algorithms. In Pro-
ceedings of the 31th International Conference on Ma-
chine Learning, ICML, pp. 441–448, 2014.
Tucker, G., Bhupatiraju, S., Gu, S., Turner, R. E., Ghahra-
mani, Z., and Levine, S. The mirage of action-dependent
baselines in reinforcement learning. In Proceedings of
the 35th International Conference on Machine Learning,
ICML, pp. 5022–5031, 2018.
Whiteson, S. and Littman, M. L. Introduction to the special
issue on empirical evaluations in reinforcement learning.
Mach. Learn., 84(1-2):1–6, 2011.
Whiteson, S., Tanner, B., and White, A. M. Report on the
2008 reinforcement learning competition. AI Magazine,
31(2):81–94, 2010.

You might also like