Evaluating the Performance of Reinforcement Learning Algorithms
Evaluating the Performance of Reinforcement Learning Algorithms
Scott M. Jordan 1 Yash Chandak 1 Daniel Cohen 1 Mengxue Zhang 1 Philip S. Thomas 1
Store results
2. Notation and Preliminaries
In this section, we give notation used in this paper along Return stored results
with an overview of an evaluation procedure. In addition to
this section, a list of symbols used in this paper is presented
in Appendix C. We represent a performance metric of an Figure 1. Data collection process of the tune-and-report method.
algorithm, i ∈ A, on an environment, j ∈ M, as a random The yellow box indicates trials using different random seeds.
variable Xi,j . This representation captures the variability of
results due to the choice of the random seed controlling the
stochastic processes in the algorithm and the environment. results. During the data collection phase, samples are col-
The choice of the metric depends on the property being stud- lected of the performance metric Xi,j for each combination
ied and is up to the experiment designer. The performance (i, j) of an algorithm i ∈ A and environment j ∈ M. In the
metric used in this paper is the average of the observed re- data aggregation phase, all samples of performance are nor-
turns from one execution of the entire training procedure, malized so the metric on each environment is on a similar
which we refer to as the average return. The cumulative scale, then they are aggregated to provide a summary of each
distribution function (CDF), FXi,j : R → [0, 1], describes algorithm’s performance across all environments. Lastly,
the performance distribution of algorithm i on environment the uncertainty of the results is quantified and reported.
j such that FXi,j (x) := Pr(Xi,j ≤ x). The quantile func-
tion, QXi,j (α) := inf x {x ∈ R|FXi,j (x) ≥ α}, maps a
cumulative probability, α ∈ (0, 1), to a score such that
3. Data Collection
α proportion of samples of Xi,j are less than or equal to In this section, we discuss how common data collection
QXi,j (α). A normalization function, g : R × M → R, methods are unable to answer the general evaluation ques-
maps a score, x, an algorithm receives on an environment, tion and then present a new method that can. We first high-
j, to a normalized score, g(x, j), which has a common scale light the core difference in our approach to previous meth-
for all environments. In this work, we seek an aggregate ods.
performance measure, yi ∈ R, for an algorithm, i, such
P|M| The main difference between data collection methods is in
that yi := j=1 qj E[g(Xi,j , j)], where qj ≥ 0 for all
P|M| how the samples of performance are collected for each al-
j ∈ {1, . . . , |M|} and j=1 qj = 1. In Section 4, we dis- gorithm on each environment. Standard approaches rely on
cuss choices for the normalizing function g and weightings first tuning an algorithm’s hyperparameters, i.e., any input
q that satisfy the properties specified in the introduction. to an algorithm that is not the environment, and then gener-
The primary quantities of interest in this paper are the ag- ating samples of performance. Our method instead relies on
gregate performance measures for each algorithm and confi- having a definition of an algorithm that can automatically
dence intervals on that measure. Let y ∈ R|A| be a vector select, sample, or adapt hyperparameters. This method can
representing the aggregate performance for each algorithm. be used to answer the general evaluation question because
We desire confidence intervals, Y − , Y + ∈ R|A| × R|A| , its performance measure represents the knowledge required
such that, for a confidence level δ ∈ (0, 0.5], to use the algorithm. We discuss these approaches below.
tific requirement since it is designed to answer the general [0, 1] will produce larger differences than those on the range
evaluation question, and the uncertainty of performance can [1000, 1001]. Furthermore, all changes in performance are
be estimated using all of the trials. Again, this data collec- assumed to be equally challenging, i.e., going from a score
tion method captures the difficulty of using an algorithm of 0.8 to 0.89 is the same difficulty as 0.9 to 0.99. This
since the complete definition encodes the knowledge nec- assumption of linearity of difficulty is not reflected on en-
essary for the algorithm to work effectively. The compute vironments with nonlinear changes in the score as an agent
time of this method is tractable, since T executions of the improves, e.g., completing levels in Super Mario.
algorithm produces T independent samples of performance.
A critical flaw in the performance ratio is that it can pro-
The practical effects of using the complete data collection duce an arbitrary ordering Pof algorithms when combined
method are as follows. Researchers do not have to spend with the arithmetic mean, j qj E[Xi,j ]/E[Xk,j ] (Fleming
time tuning each algorithm to try and maximize perfor- & Wallace, 1986), meaning a different algorithm in the de-
mance. Fewer algorithm executions are required to obtain nominator could change the relative rankings. Using the
a statistically meaningful result. With this data collection geometric mean can address this weakness of performance
method, improving upon algorithm definitions will become ratios, but does not resolve the other issues.
significant research contributions and lead to algorithms that
Another normalization technique is policy percentiles, a
are easy to apply to many problems.
method that projects the score of an algorithm through the
performance CDF of random policy search (Dabney, 2014).
4. Data Aggregation The normalized score for an algorithm, i, is FXΠ,j (Xi,j ),
where FXΠ,j is the performance CDF when a policy is sam-
Answering the general evaluation question requires a rank-
pled uniformly from a set of policies, Π, on an environment
ing of algorithms according to their performance on all
j, i.e, π ∼ U (Π). Policy percentiles have a unique advan-
environments M. The aggregation step accomplishes this
tage in that performance is scaled according to how difficult
task by combining the performance data generated in the
it is to achieve that level of performance relative to ran-
collection phase and summarizing it across all environments.
dom policy search. Unfortunately, policy percentiles rely
However, data aggregation introduces several challenges.
on specifying Π, which often has a large search space. As a
First, each environment has a different range of scores that
result, most policies will perform poorly, making all scores
need to be normalized to a common scale. Second, a uni-
approach 1.0. It is also infeasible to use when random pol-
form weighting of environments can introduce bias. For
icy search is unlikely to achieve high levels of performance.
example, the set of environments might include many slight
Despite these drawbacks, the scaling of scores according to
variants of one domain, giving that domain a larger weight
a notion of difficulty is desirable, so we adapt this idea to
than a single environment coming from a different domain.
use any algorithm’s performance as a reference distribution.
4.1. Normalization 4.1.2. O UR A PPROACH
The goal in score normalization is to project scores from An algorithm’s performance distribution can have an inter-
each environment onto the same scale while not being ex- esting shape with large changes in performance that are due
ploitable by the environment weighting. In this section, we to divergence, lucky runs, or simply that small changes to a
first show how existing normalization techniques are ex- policy can result in large changes in performance (Jordan
ploitable or do not capture the properties of interest. Then et al., 2018). These effects can be seen in Figure 3, where
we present our normalization technique: performance per- there is a quick rise in cumulative probability for a small in-
centiles. crease in performance. Inspired by Dabney (2014)’s policy
percentiles, we propose performance percentiles, a score
4.1.1. C URRENT A PPROACHES normalization technique that can represent these intricacies.
We examine two normalization techniques: performance The probability integral transform shows that projecting a
ratios and policy percentiles. We discuss other normal- random variable through its CDF transforms the variable to
ization methods in Appendix A. The performance ratio is be uniform on [0, 1] (Dodge & Commenges, 2006). Thus,
commonly used with the Arcade Learning Environment to normalizing an algorithm’s performance by its CDF will
compare the performance of algorithms relative to human equally distribute and represent a linear scaling of difficulty
performance (Mnih et al., 2015; Machado et al., 2018). The across [0, 1]. When normalizing performance against an-
performance ratio of two algorithms i and k on an environ- other algorithm’s performance distribution, the normalized
ment j is E[Xi,j ]/E[Xk,j ]. This ratio is sensitive to the score distribution will shift towards zero when the algorithm
location and scale of the performance metric on each envi- is worse than the normalizing distribution and shift towards
ronment, such that an environment with scores in the range one when it is superior. As seen in Figure 3, the CDF can
Evaluating the Performance of RL Algorithms
4.2. Summarization
A weighting over environments is needed to form an ag-
gregate measure. We desire a weighting over environments
such that no algorithm can exploit the weightings to increase
its ranking. Additionally, for the performance percentiles,
we need to determine the weighting of algorithms to use as
the reference distribution. Inspired by the work of Balduzzi
et al. (2018), we propose a weighting of algorithms and
environments, using the equilibrium of a two-player game.
In this game, one player, p, will try to select an algorithm
to maximize the aggregate performance, while a second
player, q, chooses the environment and reference algorithm
to minimize p’s score. Player p’s pure strategy space, S1 , is
the set of algorithms A, i.e., p plays a strategy s1 = i corre-
Figure 3. This plot shows the CDF of average returns for the Sarsa- sponding to an algorithm i. Player q’s pure strategy space,
Parl2, Sarsa(λ), and Actor-Critic algorithms on the Cart-Pole en- S2 , is the cross product of a set of environments, M, and
vironment. Each line represents the empirical CDF using 10,000 algorithms, A, i.e., player q plays a strategy s2 = (j, k) cor-
trials and the shaded regions represent the 95% confidence inter-
responding to a choice of environment j and normalization
vals. To illustrate how the performance percentiles work, this plot
algorithm k. We denote the pure strategy space of the game
shows how samples of performance (black dots) are normalized
by each CDF, producing the normalized scores (colored dots). The by S := S1 × S2 . A strategy, s ∈ S, can be represented by
correspondence between a single sample and its normalized score a tuple s = (s1 , s2 ) = (i, (j, k)).
is shown by the dotted line. The utility of strategy s is measured by a payoff func-
tion up : S → R and uq : S → R for players p and q
be seen as encoding the relative difficulty of achieving a respectively. The game is defined to be zero sum, i.e.,
given level of performance, where large changes in an algo- uq (s) = −up (s). We define the payoff function to be
rithm’s CDF output indicate a high degree of difficulty for up (s) := E[FXk,j (Xi,j )]. Both players p and q sample
that algorithm to make an improvement and similarly small strategies from probability distributions p ∈ ∆(S1 ) and
changes in output correspond to low change in difficulty. In q ∈ ∆(S2 ), where ∆(X ) is the set of all probability distri-
this context difficulty refers to the amount of random chance butions over X .
(luck) needed to achieve a given level of performance. The equilibrium solution of this game naturally balances the
To leverage these properties of the CDF, we define per- normalization and environment weightings to counter each
formance percentiles, that use a weighted average of each algorithm’s strengths without conferring an advantage to a
algorithm’s CDF to normalize scores for each environment. particular algorithm. Thus, the aggregate measure will be
useful in answering the general evaluation question.
Definition 2 (Performance Percentile). In an evaluation of After finding a solution (p∗ , q ∗ ), the aggregate performance
algorithms, A, the performance percentile for a score x on measure yi for an algorithm i defined as
an environment, j, is FX̄j (x, wj ), where FX̄j is the mixture
P|A|
of CDFs FX̄j (x, wj ) := i=1 wj,i FXi,j (x), with weights
P|A|
wj ∈ R|A| , i=1 wj,i = 1, and ∀i wj,i ≥ 0. |M| |A|
X X
∗
yi := qj,k E[FXk,j (Xi,j )]. (1)
So we can say that performance percentiles capture the j=1 k=1
performance characteristic of an environment relative to
some averaged algorithm. We discuss how to set the weights
wj in the next section.
Performance percentiles are closely related to the concept of To find a solution (p∗ , q ∗ ), we employ the α-Rank tech-
(probabilistic) performance profiles (Dolan & Moré, 2002; nique (Omidshafiei et al., 2019), which returns a stationary
Barreto et al., 2010). The difference being that performance distribution over the pure strategy space S. α-Rank allows
profiles report the cumulative distribution of normalized for efficient computation of both the equilibrium and con-
performance metrics over a set of tasks (environments), fidence intervals on the aggregate performance (Rowland
whereas performance percentiles are a technique for normal- et al., 2019). We detail this method and details of our imple-
izing scores on each task (environment). mentation in Appendix B.
Evaluating the Performance of RL Algorithms
5. Reporting Results
As it is crucial to quantify the uncertainty of all claimed
performance measures, we first discuss how to compute con-
fidence intervals for both single environment and aggregate
measures, then give details on displaying the results.
6.1. Experiment Description Table 1. Aggregate performance measures for each algorithm and
their rank. The parentheses contain the intervals computed using
To demonstrate the evaluation procedure we compare the al- PBP and together all hold with 95% confidence. The bolded num-
gorithms: Actor-Critic with eligibility traces (AC) (Sutton & bers identify the best ranked statistically significant differences.
Barto, 2018), Q(λ), Sarsa(λ), (Sutton & Barto, 1998), NAC-
TD (Morimura et al., 2005; Degris et al., 2012; Thomas,
2014), and proximal policy optimization (PPO) (Schulman yet a complete algorithm that can reliably solve each one.
et al., 2017). The learning rate is often the most sensitive
We execute each algorithm on each environment for 10,000
hyperparameter in RL algorithms. So, we include three ver-
trials. While this number of trials may seem excessive,
sions of Sarsa(λ), Q(λ), and AC: a base version, a version
our goal is to detect a statistically meaningful result. De-
that scales the step-size with the number of parameters (e.g.,
tecting such a result is challenging because the variance
Sarsa(λ)-s), and an adaptive step-size method, Parl2 (Dab-
of RL algorithms performance is high; we are comparing
ney, 2014), that does not require specifying the step size.
|A| × |M| = 165 random variables, and we do not as-
Since none of these algorithms have an existing complete
sume the performances are normally distributed. Compu-
definition, we create one by randomly sampling hyperpa-
tationally, executing ten thousand trials is not burdensome
rameters from fixed ranges. We consider all parameters
if one uses an efficient programming language such as Ju-
necessary to construct each algorithm, e.g., step-size, func-
lia (Bezanson et al., 2017) or C++, where we have noticed
tion approximator, discount factor, eligibility trace decay.
approximately two orders of magnitude faster execution
For the continuous state environments, each algorithm em-
than similar Python implementations. We investigate using
ploys linear function approximation using the Fourier basis
smaller sample sizes at the end of this section.
(Konidaris et al., 2011) with a randomly sampled order. See
Appendix E for full details of each algorithm.
6.2. Algorithm Comparison
These algorithms are evaluated on 15 environments, eight
discrete MDPs, half with stochastic transition dynamics, The aggregate performance measures and confidence in-
and seven continuous state environments: Cart-Pole (Flo- tervals are illustrated in Figure 5 and given in Table 1.
rian, 2007), Mountain Car (Sutton & Barto, 1998), Acrobot Appendix I lists the performance tables and distribution
(Sutton, 1995), and four variations of the pinball environ- plots for each environment. Examining the empirical perfor-
ment (Konidaris & Barto, 2009; Geramifard et al., 2015). mances in these figures, we notice two trends. The first is
For each independent trial, the environments have their dy- that our evaluation procedure can identify differences that
namics randomly perturbed to help mitigate environment are not noticeable in standard evaluations. For example,
overfitting (Whiteson et al., 2011); see code for details. For all algorithms perform near optimally when tuned properly
further details about the experiment see Appendix F. (indicated by the high end of the performance distribution).
The primary differences between algorithms are in the fre-
While these environments have simple features compared to quency of high performance and divergence (indicated by
the Arcade Learning Environment (Bellemare et al., 2013), low end of the performance distribution). Parl2 methods
they remain useful in evaluating RL algorithms for three rarely diverge, giving a large boost in performance relative
reasons. First is that experiments finish quickly. Second, to the standard methods.
the environments provide interesting insights into an algo-
rithm’s behavior. Third, as our results will show, there is not The second trend is that our evaluation procedure can iden-
tify when theoretical properties do or do not make an al-
Evaluating the Performance of RL Algorithms
is a challenging statistical problem with many sources of in our work to these is that the knowledge required to use
uncertainty. Thus, one should be skeptical of results that an algorithm gets included in the performance metric.
use substantially fewer trials. Additionally, researchers are
An important aspect of evaluation not discussed so far in this
already conducting many trials that go unreported when
paper is competitive versus scientific testing (Hooker, 1995).
tuning hyperparameters. Since our method requires no hy-
Competitive testing is the practice of having algorithms
perparameter tuning, researchers can instead spend the same
compete for top performance on benchmark tasks. Scientific
amount of time collecting more trials that can be used to
testing is the careful experimentation of algorithms to gain
quantify uncertainty.
insight into how an algorithm works. The main difference
There are a few ways that the number of trials needed can be in these two approaches is that competitive testing only
reduced. The first is to think carefully about what question says which algorithms worked well but not why, whereas
one should answer so that only a few algorithms and envi- scientific testing directly investigates the what, when, how,
ronments are required. The second is to use active sampling or why better performance can be achieved.
techniques to determine when to stop generating samples of
There are several examples of recent works using scien-
performance for each algorithm environment pair (Rowland
tific testing to expand our understanding of commonly used
et al., 2019). It is important to caution the reader that this
methods. Lyle et al. (2019) compares distributional RL ap-
process can bias the results if the sequential tests are not
proaches using different function approximation schemes
accounted for (Howard et al., 2018).
showing that distributional approaches are only effective
Summarizing our experiments, we make the following ob- when nonlinear function approximation is used. Tucker
servations. Our experiments with complete algorithms show et al. (2018) explore the sources of variance reduction in ac-
that there is still more work required to make standard RL tion dependent control variates showing that improvement
algorithms work reliably on even extremely simple bench- was small or due to additional bias. Witty et al. (2018)
mark problems. As a result of our evaluation procedure, we and Atrey et al. (2020) investigate learned behaviors of an
were able to identify performance differences in algorithms agent playing Atari 2600 games using ToyBox (Foley et al.,
that are not noticeable under standard evaluation procedures. 2018), a tool designed explicitly to enable carefully con-
The tests of the confidence intervals suggest that both PBP trolled experimentation of RL agents. While, at first glance
and PBP-t provide reliable estimates of uncertainty. These the techniques developed here seems to be only compatible
outcomes suggest that this evaluation procedure will be with competitive testing, this is only because we specified
useful in comparing the performance of RL algorithms. question with a competitive answer. The techniques devel-
oped here, particularly complete algorithm definitions, can
7. Related Work be used to accurately evaluate the impact of various algo-
rithmic choices. This allows for the careful experimentation
This paper is not the first to investigate and address issues to determine what components are essential to an algorithm.
in empirically evaluating algorithms. The evaluation of
algorithms has become a signficant enough topic to spawn 8. Conclusion
its own field of study, known as experimental algorithmics
(Fleischer et al., 2002; McGeoch, 2012). The evaluation framework that we propose provides a prin-
cipled method for evaluating RL algorithms. This approach
In RL, there have been significant efforts to discuss and
facilitates fair comparisons of algorithms by removing unin-
improve the evaluation of algorithms (Whiteson & Littman,
tentional biases common in the research setting. By develop-
2011). One common theme has been to produce shared
ing a method to establish high-confidence bounds over this
benchmark environments, such as those in the annual re-
approach, we provide the framework necessary for reliable
inforcement learning competitions (Whiteson et al., 2010;
comparisons. We hope that our provided implementations
Dimitrakakis et al., 2014), the Arcade Learning Environ-
will allow other researchers to easily leverage this approach
ment (Bellemare et al., 2013), and numerous others which
to report the performances of the algorithms they create.
are to long to list here. Recently, there has been a trend of
explicit investigations into the reproducibility of reported
results (Henderson et al., 2018; Islam et al., 2017; Khetarpal Acknowledgements
et al., 2018; Colas et al., 2018). These efforts are in part due
The authors would like to thank Kaleigh Clary, Emma Tosch,
to the inadequate experimental practices and reporting in RL
and members of the Autonomous Learning Laboratory:
and general machine learning (Pineau et al., 2020; Lipton
Blossom Metevier, James Kostas, and Chris Nota, for dis-
& Steinhardt, 2018). Similar to these studies, this work has
cussion and feedback on various versions of this manuscript.
been motivated by the need for a more reliable evaluation
Additionally, we would like to thank the reviewers and meta-
procedure to compare algorithms. The primary difference
reviewers for their comments, which helped improved this
Evaluating the Performance of RL Algorithms
Baird, L. C. Residual algorithms: Reinforcement learning Dolan, E. D. and Moré, J. J. Benchmarking optimization
with function approximation. In Prieditis, A. and Russell, software with performance profiles. Math. Program., 91
S. J. (eds.), Machine Learning, Proceedings of the Twelfth (2):201–213, 2002.
International Conference on Machine Learning, pp. 30–
Dulac-Arnold, G., Mankowitz, D. J., and Hester, T. Chal-
37. Morgan Kaufmann, 1995.
lenges of real-world reinforcement learning. CoRR,
Balduzzi, D., Tuyls, K., Pérolat, J., and Graepel, T. Re- abs/1904.12901, 2019.
evaluating evaluation. In Advances in Neural Information Dvoretzky, A., Kiefer, J., and Wolfowitz, J. Asymptotic
Processing Systems 31: Annual Conference on Neural In- minimax character of a sample distribution function and
formation Processing Systems, NeurIPS., pp. 3272–3283, of the classical multinomial estimator. Annals of Mathe-
2018. matical Statistics, 27:642–669, 1956.
Barreto, A. M. S., Bernardino, H. S., and Barbosa, H. J. C. Farahmand, A. M., Ahmadabadi, M. N., Lucas, C., and
Probabilistic performance profiles for the experimental Araabi, B. N. Interaction of culture-based learning and
evaluation of stochastic algorithms. In Pelikan, M. and cooperative co-evolution and its application to automatic
Branke, J. (eds.), Genetic and Evolutionary Computation behavior-based system design. IEEE Trans. Evolutionary
Conference, GECCO, pp. 751–758. ACM, 2010. Computation, 14(1):23–57, 2010.
Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. Fercoq, O., Akian, M., Bouhtou, M., and Gaubert, S. Er-
The arcade learning environment: An evaluation plat- godic control and polyhedral approaches to pagerank op-
form for general agents. Journal of Artificial Intelligence timization. IEEE Trans. Automat. Contr., 58(1):134–148,
Research, 47:253–279, jun 2013. 2013.
Evaluating the Performance of RL Algorithms
Fleischer, R., Moret, B. M. E., and Schmidt, E. M. (eds.). Konidaris, G. D. and Barto, A. G. Skill discovery in contin-
Experimental Algorithmics, From Algorithm Design to uous reinforcement learning domains using skill chaining.
Robust and Efficient Software [Dagstuhl seminar, Septem- In Advances in Neural Information Processing Systems
ber 2000], volume 2547 of Lecture Notes in Computer 22., pp. 1015–1023. Curran Associates, Inc., 2009.
Science, 2002. Springer.
Lipton, Z. C. and Steinhardt, J. Troubling trends in machine
Fleming, P. J. and Wallace, J. J. How not to lie with statis- learning scholarship. CoRR, abs/1807.03341, 2018.
tics: The correct way to summarize benchmark results. Lucic, M., Kurach, K., Michalski, M., Gelly, S., and Bous-
Commun. ACM, 29(3):218–221, 1986. quet, O. Are gans created equal? A large-scale study. In
Advances in Neural Information Processing Systems 31.,
Florian, R. V. Correct equations for the dynamics of the cart-
pp. 698–707, 2018.
pole system. Center for Cognitive and Neural Studies
(Coneural), Romania, 2007. Lyle, C., Bellemare, M. G., and Castro, P. S. A compara-
tive analysis of expected and distributional reinforcement
Foley, J., Tosch, E., Clary, K., and Jensen, D. Toybox: Better learning. In The Thirty-Third AAAI Conference on Artifi-
Atari Environments for Testing Reinforcement Learning cial Intelligence, pp. 4504–4511. AAAI Press, 2019.
Agents. In NeurIPS 2018 Workshop on Systems for ML,
2018. Machado, M. C., Bellemare, M. G., Talvitie, E., Veness,
J., Hausknecht, M. J., and Bowling, M. Revisiting the
Geramifard, A., Dann, C., Klein, R. H., Dabney, W., and arcade learning environment: Evaluation protocols and
How, J. P. RLPy: A value-function-based reinforcement open problems for general agents. J. Artif. Intell. Res.,
learning framework for education and research. Journal 61:523–562, 2018.
of Machine Learning Research, 16:1573–1578, 2015.
Massart, P. The tight constant in the Dvoretzky-Kiefer-
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, Wolfowitz inequality. The Annals of Probability, 18(3):
D., and Meger, D. Deep reinforcement learning that mat- 1269–1283, 1990.
ters. In Proceedings of the Thirty-Second AAAI Confer- McGeoch, C. C. A Guide to Experimental Algorithmics.
ence on Artificial Intelligence, (AAAI-18), pp. 3207–3214, Cambridge University Press, 2012.
2018.
Melis, G., Dyer, C., and Blunsom, P. On the state of the
Hooker, J. N. Testing heuristics: We have it all wrong. art of evaluation in neural language models. In 6th Inter-
Journal of Heuristics, 1(1):33–42, 1995. national Conference on Learning Representations, ICLR.
OpenReview.net, 2018.
Howard, S. R., Ramdas, A., McAuliffe, J., and Sekhon,
J. S. Uniform, nonparametric, non-asymptotic confidence Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-
sequences. arXiv: Statistics Theory, 2018. ness, J., Bellemare, M. G., Graves, A., Riedmiller, M. A.,
Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C.,
Islam, R., Henderson, P., Gomrokchi, M., and Precup, D. Re- Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wier-
producibility of benchmarked deep reinforcement learn- stra, D., Legg, S., and Hassabis, D. Human-level control
ing tasks for continuous control. CoRR, abs/1708.04133, through deep reinforcement learning. Nature, 518(7540):
2017. 529–533, 2015.
Morimura, T., Uchibe, E. . i. e. j., and Doya, K. Utilizing
Jordan, S. M., Cohen, D., and Thomas, P. S. Using cumu-
the natural gradient in temporal difference reinforcement
lative distribution based performance analysis to bench-
learning with eligibility traces. In International Sympo-
mark models. In Critiquing and Correcting Trends in
sium on Information Geometry and Its Applications, pp.
Machine Learning Workshop at Neural Information Pro-
256–263, 2005.
cessing Systems, 2018.
Omidshafiei, S., Papadimitriou, C., Piliouras, G., Tuyls, K.,
Khetarpal, K., Ahmed, Z., Cianflone, A., Islam, R., and Rowland, M., Lespiau, J.-B., Czarnecki, W. M., Lanc-
Pineau, J. Re-evaluate: Reproducibility in evaluating tot, M., Perolat, J., and Munos, R. α-rank: Multi-agent
reinforcement learning algorithms. 2018. evaluation by evolution. Scientific reports, 9(1):1–29,
2019.
Konidaris, G., Osentoski, S., and Thomas, P. S. Value
function approximation in reinforcement learning using Page, L., Brin, S., Motwani, R., and Winograd, T. The
the fourier basis. In Proceedings of the Twenty-Fifth AAAI pagerank citation ranking: Bringing order to the web.
Conference on Artificial Intelligence, AAAI, 2011. Technical report, Stanford InfoLab, 1999.
Evaluating the Performance of RL Algorithms
Perkins, T. J. and Precup, D. A convergent form of approxi- Whiteson, S., Tanner, B., Taylor, M. E., and Stone, P. Pro-
mate policy iteration. In Advances in Neural Information tecting against evaluation overfitting in empirical rein-
Processing Systems 15, pp. 1595–1602. MIT Press, 2002. forcement learning. In 2011 IEEE Symposium on Adap-
tive Dynamic Programming And Reinforcement Learning,
Pineau, J., Vincent-Lamarre, P., Sinha, K., Larivière, V., ADPRL, pp. 120–127. IEEE, 2011.
Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and
Larochelle, H. Improving reproducibility in machine Wiering, M. Convergence and divergence in standard and
learning research (A report from the NeurIPS 2019 repro- averaging reinforcement learning. In Machine Learn-
ducibility program). CoRR, abs/2003.12206, 2020. ing: ECML 2004, 15th European Conference on Machine
Learning, volume 3201 of Lecture Notes in Computer
Puterman, M. L. Markov Decision Processes: Discrete
Science, pp. 477–488. Springer, 2004.
Stochastic Dynamic Programming. Wiley Series in Prob-
ability and Statistics. Wiley, 1994. Williams, R. J. and Baird, L. C. Tight performance bounds
on greedy policies based on imperfect value functions.
Reimers, N. and Gurevych, I. Reporting score distribu-
1993.
tions makes a difference: Performance study of LSTM-
networks for sequence tagging. In Proceedings of the Witty, S., Lee, J. K., Tosch, E., Atrey, A., Littman, M.,
2017 Conference on Empirical Methods in Natural Lan- and Jensen, D. Measuring and characterizing general-
guage Processing, EMNLP, pp. 338–348, 2017. ization in deep reinforcement learning. arXiv preprint
arXiv:1812.02868, 2018.
Rowland, M., Omidshafiei, S., Tuyls, K., Pérolat, J., Valko,
M., Piliouras, G., and Munos, R. Multiagent evaluation
under incomplete information. In Advances in Neural
Information Processing Systems 3, NeurIPS, pp. 12270–
12282, 2019.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and
Klimov, O. Proximal policy optimization algorithms.
CoRR, abs/1707.06347, 2017.
Sutton, R. S. Generalization in reinforcement learning:
Successful examples using sparse coarse coding. In Ad-
vances in Neural Information Processing Systems 8, pp.
1038–1044, 1995.
Sutton, R. S. and Barto, A. G. Reinforcement learning
- an introduction. Adaptive computation and machine
learning. MIT Press, 1998.
Sutton, R. S. and Barto, A. G. Reinforcement learning: An
introduction. MIT press, 2018.
Thomas, P. Bias in natural actor-critic algorithms. In Pro-
ceedings of the 31th International Conference on Ma-
chine Learning, ICML, pp. 441–448, 2014.
Tucker, G., Bhupatiraju, S., Gu, S., Turner, R. E., Ghahra-
mani, Z., and Levine, S. The mirage of action-dependent
baselines in reinforcement learning. In Proceedings of
the 35th International Conference on Machine Learning,
ICML, pp. 5022–5031, 2018.
Whiteson, S. and Littman, M. L. Introduction to the special
issue on empirical evaluations in reinforcement learning.
Mach. Learn., 84(1-2):1–6, 2011.
Whiteson, S., Tanner, B., and White, A. M. Report on the
2008 reinforcement learning competition. AI Magazine,
31(2):81–94, 2010.