0% found this document useful (0 votes)
17 views

Benchmarking Deep Reinforcement Learning for Continuous Control

This document presents a benchmark suite for continuous control tasks in deep reinforcement learning, addressing the lack of standardized evaluation methods in this domain. It includes 31 tasks ranging from basic control problems to complex locomotion and partially observable tasks, along with systematic evaluations of various reinforcement learning algorithms. The benchmark and implementations are made publicly available to facilitate reproducibility and encourage further research in the field.

Uploaded by

mxrv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Benchmarking Deep Reinforcement Learning for Continuous Control

This document presents a benchmark suite for continuous control tasks in deep reinforcement learning, addressing the lack of standardized evaluation methods in this domain. It includes 31 tasks ranging from basic control problems to complex locomotion and partially observable tasks, along with systematic evaluations of various reinforcement learning algorithms. The benchmark and implementations are made publicly available to facilitate reproducibility and encourage further research in the field.

Uploaded by

mxrv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Benchmarking Deep Reinforcement Learning for Continuous Control

Yan Duan† ROCKYDUAN @ EECS . BERKELEY. EDU


Xi Chen† C . XI @ EECS . BERKELEY. EDU

Rein Houthooft† REIN . HOUTHOOFT @ UGENT. BE
§
John Schulman† JOSCHU @ EECS . BERKELEY. EDU

Pieter Abbeel PABBEEL @ CS . BERKELEY. EDU

arXiv:1604.06778v3 [cs.LG] 27 May 2016

University of California, Berkeley, Department of Electrical Engineering and Computer Sciences



Ghent University - iMinds, Department of Information Technology
§ OpenAI

Abstract usually hand-engineered. Recently, significant progress


Recently, researchers have made significant has been made by combining advances in deep learning for
progress combining the advances in deep learn- learning feature representations (Krizhevsky et al., 2012;
ing for learning feature representations with rein- Hinton et al., 2012) with reinforcement learning, tracing
forcement learning. Some notable examples in- back to much earlier work of Tesauro (1995) and Bert-
clude training agents to play Atari games based sekas & Tsitsiklis (1995). Notable examples are training
on raw pixel data and to acquire advanced ma- agents to play Atari games based on raw pixels (Guo et al.,
nipulation skills using raw sensory inputs. How- 2014; Mnih et al., 2015; Schulman et al., 2015a) and to
ever, it has been difficult to quantify progress acquire advanced manipulation skills using raw sensory in-
in the domain of continuous control due to the puts (Levine et al., 2015; Lillicrap et al., 2015; Watter et al.,
lack of a commonly adopted benchmark. In this 2015). Impressive results have also been obtained in train-
work, we present a benchmark suite of contin- ing deep neural network policies for 3D locomotion and
uous control tasks, including classic tasks like manipulation tasks (Schulman et al., 2015a;b; Heess et al.,
cart-pole swing-up, tasks with very high state 2015b).
and action dimensionality such as 3D humanoid Along with this recent progress, the Arcade Learning En-
locomotion, tasks with partial observations, and vironment (ALE) (Bellemare et al., 2013) has become a
tasks with hierarchical structure. We report novel popular benchmark for evaluating algorithms designed for
findings based on the systematic evaluation of a tasks with high-dimensional state inputs and discrete ac-
range of implemented reinforcement learning al- tions. However, these algorithms do not always generalize
gorithms. Both the benchmark and reference im- straightforwardly to tasks with continuous actions, leading
plementations are released at https://github.com/ to a gap in our understanding. For instance, algorithms
rllab/rllab in order to facilitate experimental re- based on Q-learning quickly become infeasible when naive
producibility and to encourage adoption by other discretization of the action space is performed, due to the
researchers. curse of dimensionality (Bellman, 1957; Lillicrap et al.,
2015). In the continuous control domain, where actions
are continuous and often high-dimensional, we argue that
1. Introduction the existing control benchmarks fail to provide a compre-
hensive set of challenging problems (see Section 7 for a
Reinforcement learning addresses the problem of how
review of existing benchmarks). Benchmarks have played
agents should learn to take actions to maximize cumula-
a significant role in other areas such as computer vision
tive reward through interactions with the environment. The
and speech recognition. Examples include MNIST (Le-
traditional approach for reinforcement learning algorithms
Cun et al., 1998), Caltech101 (Fei-Fei et al., 2006), CI-
requires carefully chosen feature representations, which are
FAR (Krizhevsky & Hinton, 2009), ImageNet (Deng et al.,
Proceedings of the 33 rd International Conference on Machine 2009), PASCAL VOC (Everingham et al., 2010), BSDS500
Learning, New York, NY, USA, 2016. JMLR: W&CP volume (Martin et al., 2001), SWITCHBOARD (Godfrey et al.,
48. Copyright 2016 by the author(s). 1992), TIMIT (Garofolo et al., 1993), Aurora (Hirsch &
Also available at https://arxiv.org/abs/1604.06778 Pearce, 2000), and VoiceSearch (Yu et al., 2007). The lack
Benchmarking Deep Reinforcement Learning for Continuous Control

of a standardized and challenging testbed for reinforcement in the supplementary materials and in the source code.
learning and continuous control makes it difficult to quan-
We choose to implement all tasks using physics simulators
tify scientific progress. Systematic evaluation and compar-
rather than symbolic equations, since the former approach
ison will not only further our understanding of the strengths
is less error-prone and permits easy modification of each
of existing algorithms, but also reveal their limitations and
task. Tasks with simple dynamics are implemented using
suggest directions for future research.
Box2D (Catto, 2011), an open-source, freely available 2D
We attempt to address this problem and present a bench- physics simulator. Tasks with more complicated dynam-
mark consisting of 31 continuous control tasks. These ics, such as locomotion, are implemented using MuJoCo
tasks range from simple tasks, such as cart-pole balanc- (Todorov et al., 2012), a 3D physics simulator with better
ing, to challenging tasks such as high-DOF locomotion, modeling of contacts.
tasks with partial observations, and hierarchically struc-
tured tasks. Furthermore, a range of reinforcement learn- 3.1. Basic Tasks
ing algorithms are implemented on which we report novel
findings based on a systematic evaluation of their effective- We implement five basic tasks that have been widely an-
ness in training deep neural network policies. The bench- alyzed in reinforcement learning and control literature:
mark and reference implementations are available at https: Cart-Pole Balancing (Stephenson, 1908; Donaldson, 1960;
//github.com/rllab/rllab, allowing for the development, im- Widrow, 1964; Michie & Chambers, 1968), Cart-Pole
plementation, and evaluation of new algorithms and tasks. Swing Up (Kimura & Kobayashi, 1999; Doya, 2000),
Mountain Car (Moore, 1990), Acrobot Swing Up (DeJong
& Spong, 1994; Murray & Hauser, 1991; Doya, 2000), and
2. Preliminaries Double Inverted Pendulum Balancing (Furuta et al., 1978).
In this section, we define the notation used in subsequent These relatively low-dimensional tasks provide quick eval-
sections. uations and comparisons of RL algorithms.

The implemented tasks conform to the standard interface 3.2. Locomotion Tasks
of a finite-horizon discounted Markov decision process
(MDP), defined by the tuple (S, A, P, r, ρ0 , γ, T ), where In this category, we implement six locomotion tasks of
S is a (possibly infinite) set of states, A is a set of actions, varying dynamics and difficulty: Swimmer (Purcell, 1977;
P : S ×A×S → R≥0 is the transition probability distribu- Coulom, 2002; Levine & Koltun, 2013; Schulman et al.,
tion, r : S × A → R is the reward function, ρ0 : S → R≥0 2015a), Hopper (Murthy & Raibert, 1984; Erez et al.,
is the initial state distribution, γ ∈ (0, 1] is the discount 2011; Levine & Koltun, 2013; Schulman et al., 2015a),
factor, and T is the horizon. Walker (Raibert & Hodgins, 1991; Erez et al., 2011; Levine
& Koltun, 2013; Schulman et al., 2015a), Half-Cheetah
For partially observable tasks, which conform to the in- (Wawrzyński, 2007; Heess et al., 2015b), Ant (Schulman
terface of a partially observable Markov decision process et al., 2015b), Simple Humanoid (Tassa et al., 2012; Schul-
(POMDP), two more components are required, namely Ω, man et al., 2015b), and Full Humanoid (Tassa et al., 2012).
a set of observations, and O : S × Ω → R≥0 , the observa- The goal for all the tasks is to move forward as quickly as
tion probability distribution. possible. These tasks are more challenging than the basic
Most of our implemented algorithms optimize a stochastic tasks due to high degrees of freedom. In addition, a great
policy πθ : S × A → R≥0 . Let η(π) denote its expected amount of exploration is needed to learn to move forward
without getting stuck at local optima. Since we penalize for
hP i
T t
discounted reward: η(π) = Eτ t=0 γ r(s , a )
t t , where
excessive controls as well as falling over, during the initial
τ = (s0 , a0 , . . .) denotes the whole trajectory, s0 ∼ ρ0 (s0 ), stage of learning, when the robot is not yet able to move
at ∼ π(at |st ), and st+1 ∼ P (st+1 |st , at ). forward for a sufficient distance without falling, apparent
For deterministic policies, we use the notation µθ : S → A local optima exist including staying at the origin or diving
to denote the policy instead. The objective for it has the forward slowly.
same form as above, except that now we have at = µ(st ).
3.3. Partially Observable Tasks
3. Tasks In real-life situations, agents are often not endowed with
perfect state information. This can be due to sensor noise,
The tasks in the presented benchmark can be divided into
sensor occlusions, or even sensor limitations that result in
four categories: basic tasks, locomotion tasks, partially ob-
partial observations. To evaluate algorithms in more realis-
servable tasks, and hierarchical tasks. We briefly describe
tic settings, we implement three variations of partially ob-
them in this section. More detailed specifications are given
Benchmarking Deep Reinforcement Learning for Continuous Control

(a) (b) (c) (d)

(a) (b)

Figure 2. Illustration of hierarchical tasks: (a) Locomotion +


Food Collection; and (b) Locomotion + Maze.
(e) (f) (g)

Figure 1. Illustration of locomotion tasks: (a) Swimmer; (b) Hop- Locomotion + Food Collection: For this task, the agent
per; (c) Walker; (d) Half-Cheetah; (e) Ant; (f) Simple Humanoid; needs to learn to control either the swimmer or the ant robot
and (g) Full Humanoid. to collect food and avoid bombs in a finite region. The
agent receives range sensor readings about nearby food and
bomb units. It is given a positive reward when it reaches a
servable tasks for each of the five basic tasks described in food unit, or a negative reward when it reaches a bomb.
Section 3.1, leading to a total of 15 additional tasks. These
variations are described below. Locomotion + Maze: For this task, the agent needs to learn
to control either the swimmer or the ant robot to reach a
Limited Sensors: For this variation, we restrict the obser- goal position in a fixed maze. The agent receives range
vations to only provide positional information (including sensor readings about nearby obstacles as well as its goal
joint angles), excluding velocities. An agent now has to (when visible). A positive reward is given only when the
learn to infer velocity information in order to recover the robot reaches the goal region.
full state. Similar tasks have been explored in Gomez &
Miikkulainen (1998); Schäfer & Udluft (2005); Heess et al.
(2015a); Wierstra et al. (2007). 4. Algorithms
Noisy Observations and Delayed Actions: In this case, In this section, we briefly summarize the algorithms im-
sensor noise is simulated through the addition of Gaussian plemented in our benchmark, and note any modifications
noise to the observations. We also introduce a time de- made to apply them to general parametrized policies. We
lay between taking an action and the action being in effect, implement a range of gradient-based policy search meth-
accounting for physical latencies (Hester & Stone, 2013). ods, as well as two gradient-free methods for comparison
Agents now need to learn to integrate both past observa- with the gradient-based approaches.
tions and past actions to infer the current state. Similar
tasks have been proposed in Bakker (2001). 4.1. Batch Algorithms
System Identification: For this category, the underly- Most of the implemented algorithms are batch algorithms.
ing physical model parameters are varied across different At each iteration, N trajectories {τi }N i=1 are generated,
episodes (Szita et al., 2003). The agents must learn to gen- where τi = {(sit , ait , rti )}Tt=0 contains data collected along
eralize across different models, as well as to infer the model the ith trajectory. For on-policy gradient-based methods,
parameters from its observation and action history. all the trajectories are sampled under the current policy. For
gradient-free methods, they are sampled under perturbed
3.4. Hierarchical Tasks versions of the current policy.
Many real-world tasks exhibit hierarchical structure, where REINFORCE (Williams, 1992): This algorithm estimates
higher level decisions can reuse lower level skills (Parr & the gradient of expected return ∇θ η(πθ ) using the likeli-
Russell, 1998; Sutton et al., 1999; Dietterich, 2000). For in- hood ratio trick:
stance, robots can reuse locomotion skills when exploring N T
the environment. We propose several tasks where both low- 1 XX
∇\
θ η(πθ ) = ∇θ log π(ait |sit ; θ)(Rti − bit ),
level motor controls and high-level decisions are needed. N T i=1 t=0
These two components each operates on a different time PT 0
scale and calls for a natural hierarchy in order to efficiently where Rti = t0 =t γ t −t rti0 and bit is a baseline that only
i
learn the task. depends on the state st to reduce variance. Hereafter, an as-
Benchmarking Deep Reinforcement Learning for Continuous Control

cent step is taken in the direction of the estimated gradient. Here δKL > 0 controls the step size of the policy, and
This process continues until θk converges. δi (ν) = ri + ν T (φ(s0i ) − φ(si )) is the sample Bellman
error. We then solve for the new policy parameters:
Truncated Natural Policy Gradient (TNPG) (Kakade,
2002; Peters et al., 2003; Bagnell & Schneider, 2003; M
Schulman et al., 2015a): Natural Policy Gradient improves 1 X δi (ν ∗ )/η∗
θk+1 = arg max e log π(ai |si ; θ).
upon REINFORCE by computing an ascent direction that θ M i=1
approximately ensures a small change in the policy distri-
bution. This direction is derived to be I(θ)−1 ∇θ η(πθ ), Trust Region Policy Optimization (TRPO) (Schulman
where I(θ) is the Fisher information matrix (FIM). We et al., 2015a): This algorithm allows more precise control
use the
q step size suggested by Peters & Schaal (2008): on the expected policy improvement than TNPG through
−1
α = δKL (∇θ η(πθ )T I(θ)−1 ∇θ η(πθ )) . Finally, we re- the introduction of a surrogate loss. At each iteration, we
place ∇θ η(πθ ) and I(θ) by their empirical estimates. solve the following constrained optimization problem (re-
placing expectations with samples):
For neural network policies with tens of thousands of pa-
rameters or more, generic Natural Policy Gradient incurs 
πθ (a|s)

prohibitive computation cost by forming and inverting the maximizeθ Es∼ρθk ,a∼πθk Aθk (s, a)
πθk (a|s)
empirical FIM. Instead, we study Truncated Natural Policy s.t. Es∼ρθk [DKL (πθk (·|s)kπθ (·|s))] ≤ δKL
Gradient (TNPG) in this paper, which computes the nat-
ural gradient direction without explicitly forming the ma- where ρθ = ρπθ is the discounted state-visitation frequen-
trix inverse, using a conjugate gradient algorithm that only cies induced by πθ , Aθk (s, a), known as the advantage
requires computing I(θ)v for arbitrary vector v. TNPG function, is estimated by the empirical return minus the
makes it practical to apply natural gradient in policy search baseline, and δKL is a step size parameter which controls
setting with high-dimensional parameters, and we refer the how much the policy is allowed to change per iteration.
reader to Schulman et al. (2015a) for more details. We follow the procedure described in the original paper for
Reward-Weighted Regression (RWR) (Peters & Schaal, solving the optimization, which results in the same descent
2007; Kober & Peters, 2009): This algorithm formulates direction as TNPG with an extra line search in the objective
the policy optimization as an Expectation-Maximization and KL constraint.
problem to avoid the need to manually choose learning Cross Entropy Method (CEM) (Rubinstein, 1999; Szita
rate, and the method is guaranteed to converge to a lo- & Lőrincz, 2006): Unlike previously mentioned meth-
cally optimal solution. At each iteration, this algorithm ods, which perform exploration through stochastic actions,
optimizes a lower bound of the log-expected return: θ = CEM performs exploration directly in the policy parame-
arg maxθ0 L(θ0 ), where ter space. At each iteration, we produce N perturbations
of the policy parameter: θi ∼ N (µk , Σk ), and perform a
N T
1 XX rollout for each sampled parameter. Then, we compute the
L(θ) = log π(ait |sit ; θ)ρ(Rti − bit )
N T i=1 t=0 new mean and diagonal covariance using the parameters
that correspond to the top q-quantile returns.
Here, ρ : R → R≥0 is a function that transforms raw re- Covariance Matrix Adaption Evolution Strategy
turns to nonnegative values. Following Deisenroth et al. (CMA-ES) (Hansen & Ostermeier, 2001): Similar to
(2013), we choose ρ to be ρ(R) = R − Rmin , where Rmin is CEM, CMA-ES is a gradient-free evolutionary approach
the minimum return among all trajectories collected in the for optimizing nonconvex objective functions. In our case,
current iteration. this objective function equals the average sampled return.
Relative Entropy Policy Search (REPS) (Peters et al., In contrast to CEM, CMA-ES estimates the covariance
2010): This algorithm limits the loss of information per matrix of a multivariate normal distribution through
iteration and aims to ensure a smooth learning progress incremental adaption along evolution paths, which contain
(Deisenroth et al., 2013). At each iteration, we collect all information about the correlation between consecutive
trajectories into a dataset D = {(si , ai , ri , s0i )}M updates.
i=1 , where
M is the total number of samples. Then, we first solve for
the dual parameters [η ∗ , ν ∗ ] = arg minη0 ,ν 0 g(η 0 , ν 0 ) s.t. 4.2. Online Algorithms
η > 0, where Deep Deterministic Policy Gradient (DDPG) (Lillicrap
M
! et al., 2015): Compared to batch algorithms, the DDPG
1 X δi (ν)/η algorithm continuously improves the policy as it explores
g(η, ν) = ηδKL + η log e .
M i=1 the environment. It applies gradient descent to the policy
Benchmarking Deep Reinforcement Learning for Continuous Control

with minibatch data sampled from a replay pool, where the Policy Representation: For basic, locomotion, and hier-
gradient is computed via archical tasks and for batch algorithms, we use a feed-
forward neural network policy with 3 hidden layers, con-
B
X sisting of 100, 50, and 25 hidden units with tanh nonlin-
∇\
θ η(µθ ) = ∇a Qφ (si , a)|a=µθ (si ) ∇θ µθ (si )
earity at the first two hidden layers, which map each state
i=1
to the mean of a Gaussian distribution. The log-standard
where B is the batch size. The critic Q is trained deviation is parameterized by a global vector independent
via gradient P descent on the `2 loss of the Bellman er- of the state, as done in Schulman et al. (2015a). For all par-
1 B
ror L = B i=1 (yi − Qφ (si , ai ))2 , where yi = ri + tially observable tasks, we use a recurrent neural network
γQ0φ0 (s0i , µ0θ0 (s0i )). To improve stability of the algorithm, with a single hidden layer consisting of 32 LSTM hidden
we use target networks for both the critic and the policy units (Hochreiter & Schmidhuber, 1997).
when forming the regression target yi . We refer the reader
For the DDPG algorithm which trains a deterministic pol-
to Lillicrap et al. (2015) for a more detailed description of
icy, we follow Lillicrap et al. (2015). For both the policy
the algorithm.
and the Q function, we use the same architecture of a feed-
forward neural network with 2 hidden layers, consisting of
4.3. Recurrent Variants 400 and 300 hidden units with relu activations.
We implement direct applications of the aforemen- Baseline: For all gradient-based algorithms except REPS,
tioned batch-based algorithms to recurrent policies. The we can subtract a baseline from the empirical return to re-
only modification required is to replace π(ait |sit ) by duce variance of the optimization. We use a linear function
π(ait |oi1:t , ai1:t−1 ), where oi1:t and a1:t−1 are the histories of as the baseline with a time-varying feature vector.
past and current observations and past actions. Recurrent
versions of reinforcement learning algorithms have been
studied in many existing works, such as Bakker (2001), 6. Results and Discussion
Schäfer & Udluft (2005), Wierstra et al. (2007), and Heess The main evaluation results are presented in Table 1. The
et al. (2015a). tasks on which the grid search is performed are marked
with (*). In each entry, the pair of numbers shows the mean
5. Experiment Setup and standard deviation of the normalized cumulative return
using the best possible hyperparameters.
In this section, we elaborate on the experimental setup used
to generate the results. REINFORCE: Despite its simplicity, REINFORCE is an
effective algorithm in optimizing deep neural network poli-
Performance Metrics: For each report unit (a particular al- cies in most basic and locomotion tasks. Even for high-
gorithm running onP a particular
PNi task), we define its perfor- DOF tasks like Ant, REINFORCE can achieve competi-
I
mance as PI 1 N i=1 n=1 Rin , where I is the num- tive results. However we observe that REINFORCE some-
i=1 i
ber of training iterations, Ni is the number of trajectories times suffers from premature convergence to local optima
collected in the ith iteration, and Rin is the undiscounted as noted by Peters & Schaal (2008), which explains the per-
return for the nth trajectory of the ith iteration, formance gaps between REINFORCE and TNPG on tasks
Hyperparameter Tuning: For the DDPG algorithm, we such as Walker (Figure 3(a)). By visualizing the final poli-
used the hyperparametes reported in Lillicrap et al. (2015). cies, we can see that REINFORCE results in policies that
For the other algorithms, we follow the approach in (Mnih tend to jump forward and fall over to maximize short-term
et al., 2015), and we select two tasks in each category, on return instead of acquiring a stable walking gait to max-
which a grid search of hyperparameters is performed. Each imize long-term return. In Figure 3(b), we can observe
choice of hyperparameters is executed under five random that even with a small learning rate, steps taken by RE-
seeds. The criterion for the best hyperparameters is de- INFORCE can sometimes result in large changes to policy
fined as mean(returns) − std(returns). This metric se- distribution, which may explain the fast convergence to lo-
lects against large fluctuations of performance due to overly cal optima.
large step sizes. TNPG and TRPO: Both TNPG and TRPO outperform
For the other tasks, we try both of the best hyperparame- other batch algorithms by a large margin on most tasks,
ters found in the same category, and report the better per- confirming that constraining the change in the policy dis-
formance of the two. This gives us insights into both the tribution results in more stable learning (Peters & Schaal,
maximum possible performance when extensive hyperpa- 2008).
rameter tuning is performed, and the robustness of the best Compared to TNPG, TRPO offers better control over each
hyperparameters across different tasks.
Table 1. Performance of the implemented algorithms in terms of average return over all training iterations for five different random seeds (same across all algorithms). The results
of the best-performing algorithm on each task, as well as all algorithms that have performances that are not statistically significantly different (Welch’s t-test with p < 0.05), are
highlighted in boldface.a In the tasks column, the partially observable variants of the tasks are annotated as follows: LS stands for limited sensors, NO for noisy observations and
delayed actions, and SI for system identifications. The notation N/A denotes that an algorithm has failed on the task at hand, e.g., CMA-ES leading to out-of-memory errors in the
Full Humanoid task.

Task Random REINFORCE TNPG RWR REPS TRPO CEM CMA-ES DDPG

Cart-Pole Balancing 77.1 ± 0.0 4693.7 ± 14.0 3986.4 ± 748.9 4861.5 ± 12.3 565.6 ± 137.6 4869.8 ± 37.6 4815.4 ± 4.8 2440.4 ± 568.3 4634.4 ± 87.8
Inverted Pendulum* −153.4 ± 0.2 13.4 ± 18.0 209.7 ± 55.5 84.7 ± 13.8 −113.3 ± 4.6 247.2 ± 76.1 38.2 ± 25.7 −40.1 ± 5.7 40.0 ± 244.6
Mountain Car −415.4 ± 0.0 −67.1 ± 1.0 -66.5 ± 4.5 −79.4 ± 1.1 −275.6 ± 166.3 -61.7 ± 0.9 −66.0 ± 2.4 −85.0 ± 7.7 −288.4 ± 170.3
Acrobot −1904.5 ± 1.0 −508.1 ± 91.0 −395.8 ± 121.2 −352.7 ± 35.9 −1001.5 ± 10.8 −326.0 ± 24.4 −436.8 ± 14.7 −785.6 ± 13.1 -223.6 ± 5.8
Double Inverted Pendulum* 149.7 ± 0.1 4116.5 ± 65.2 4455.4 ± 37.6 3614.8 ± 368.1 446.7 ± 114.8 4412.4 ± 50.4 2566.2 ± 178.9 1576.1 ± 51.3 2863.4 ± 154.0

Swimmer* −1.7 ± 0.1 92.3 ± 0.1 96.0 ± 0.2 60.7 ± 5.5 3.8 ± 3.3 96.0 ± 0.2 68.8 ± 2.4 64.9 ± 1.4 85.8 ± 1.8
Hopper 8.4 ± 0.0 714.0 ± 29.3 1155.1 ± 57.9 553.2 ± 71.0 86.7 ± 17.6 1183.3 ± 150.0 63.1 ± 7.8 20.3 ± 14.3 267.1 ± 43.5
2D Walker −1.7 ± 0.0 506.5 ± 78.8 1382.6 ± 108.2 136.0 ± 15.9 −37.0 ± 38.1 1353.8 ± 85.0 84.5 ± 19.2 77.1 ± 24.3 318.4 ± 181.6
Half-Cheetah −90.8 ± 0.3 1183.1 ± 69.2 1729.5 ± 184.6 376.1 ± 28.2 34.5 ± 38.0 1914.0 ± 120.1 330.4 ± 274.8 441.3 ± 107.6 2148.6 ± 702.7
Ant* 13.4 ± 0.7 548.3 ± 55.5 706.0 ± 127.7 37.6 ± 3.1 39.0 ± 9.8 730.2 ± 61.3 49.2 ± 5.9 17.8 ± 15.5 326.2 ± 20.8
Simple Humanoid 41.5 ± 0.2 128.1 ± 34.0 255.0 ± 24.5 93.3 ± 17.4 28.3 ± 4.7 269.7 ± 40.3 60.6 ± 12.9 28.7 ± 3.9 99.4 ± 28.1
Full Humanoid 13.2 ± 0.1 262.2 ± 10.5 288.4 ± 25.2 46.7 ± 5.6 41.7 ± 6.1 287.0 ± 23.4 36.9 ± 2.9 N/A ± N/A 119.0 ± 31.2

Cart-Pole Balancing (LS)* 77.1 ± 0.0 420.9 ± 265.5 945.1 ± 27.8 68.9 ± 1.5 898.1 ± 22.1 960.2 ± 46.0 227.0 ± 223.0 68.0 ± 1.6
Inverted Pendulum (LS) −122.1 ± 0.1 −13.4 ± 3.2 0.7 ± 6.1 −107.4 ± 0.2 −87.2 ± 8.0 4.5 ± 4.1 −81.2 ± 33.2 −62.4 ± 3.4
Mountain Car (LS) −83.0 ± 0.0 −81.2 ± 0.6 -65.7 ± 9.0 −81.7 ± 0.1 −82.6 ± 0.4 -64.2 ± 9.5 -68.9 ± 1.3 -73.2 ± 0.6
Acrobot (LS)* −393.2 ± 0.0 −128.9 ± 11.6 -84.6 ± 2.9 −235.9 ± 5.3 −379.5 ± 1.4 -83.3 ± 9.9 −149.5 ± 15.3 −159.9 ± 7.5

Cart-Pole Balancing (NO)* 101.4 ± 0.1 616.0 ± 210.8 916.3 ± 23.0 93.8 ± 1.2 99.6 ± 7.2 606.2 ± 122.2 181.4 ± 32.1 104.4 ± 16.0
Inverted Pendulum (NO) −122.2 ± 0.1 6.5 ± 1.1 11.5 ± 0.5 −110.0 ± 1.4 −119.3 ± 4.2 10.4 ± 2.2 −55.6 ± 16.7 −80.3 ± 2.8
Mountain Car (NO) −83.0 ± 0.0 −74.7 ± 7.8 -64.5 ± 8.6 −81.7 ± 0.1 −82.9 ± 0.1 -60.2 ± 2.0 −67.4 ± 1.4 −73.5 ± 0.5
Acrobot (NO)* −393.5 ± 0.0 -186.7 ± 31.3 -164.5 ± 13.4 −233.1 ± 0.4 −258.5 ± 14.0 -149.6 ± 8.6 −213.4 ± 6.3 −236.6 ± 6.2

Cart-Pole Balancing (SI)* 76.3 ± 0.1 431.7 ± 274.1 980.5 ± 7.3 69.0 ± 2.8 702.4 ± 196.4 980.3 ± 5.1 746.6 ± 93.2 71.6 ± 2.9
Inverted Pendulum (SI) −121.8 ± 0.2 −5.3 ± 5.6 14.8 ± 1.7 −108.7 ± 4.7 −92.8 ± 23.9 14.1 ± 0.9 −51.8 ± 10.6 −63.1 ± 4.8
Mountain Car (SI) −82.7 ± 0.0 −63.9 ± 0.2 -61.8 ± 0.4 −81.4 ± 0.1 −80.7 ± 2.3 -61.6 ± 0.4 −63.9 ± 1.0 −66.9 ± 0.6
Benchmarking Deep Reinforcement Learning for Continuous Control

Acrobot (SI)* −387.8 ± 1.0 -169.1 ± 32.3 -156.6 ± 38.9 −233.2 ± 2.6 −216.1 ± 7.7 -170.9 ± 40.3 −250.2 ± 13.7 −245.0 ± 5.5

Swimmer + Gathering 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0
Ant + Gathering −5.8 ± 5.0 −0.1 ± 0.1 −0.4 ± 0.1 −5.5 ± 0.5 −6.7 ± 0.7 −0.4 ± 0.0 −4.7 ± 0.7 N/A ± N/A −0.3 ± 0.3
Swimmer + Maze 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0
Ant + Maze 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 N/A ± N/A 0.0 ± 0.0

a
Except for the hierarchical tasks
Benchmarking Deep Reinforcement Learning for Continuous Control

1.0
100 3500
tnpg reinforce
ddpg reinforce ddpg reinforce
trpo
2000 tnpg reps tnpg reps
erwr cem 3000 erwr cem
trpo cma_es 0.8 trpo cma_es
95

2500
1500

0.6 90 2000

1000
1500
0.4 85

1000
500

0.2 80 500

TNPG MeanKL=0.05
0 0
TRPO MeanKL=0.1
0.0 75
0 100 200 300 400 500 0 100 200 300 400 500 0 20 40 60 80 100 0 100 200 300 400 500

(a) (b) (c) (d)

Figure 3. Performance as a function of the number of iterations; the shaded area depicts the mean ± the standard deviation over five
different random seeds: (a) Performance comparison of all algorithms in terms of the average reward on the Walker task; (b) Comparison
between REINFORCE, TNPG, and TRPO in terms of the mean KL-divergence on the Walker task; (c) Performance comparison on
TNPG and TRPO on the Swimmer task; (d) Performance comparison of all algorithms in terms of the average reward on the Half-
Cheetah task.

policy update by performing a line search in the natural gra- tain basic tasks such as Cart-Pole Balancing and Moun-
dient direction to ensure an improvement in the surrogate tain Car, suggesting that the dimension of the searching
loss function. We observe that hyperparameter grid search parameter is not always the limiting factor of the method.
tends to select conservative step sizes (δKL ) for TNPG, However, the performance degrades quickly as the system
which alleviates the issue of performance collapse caused dynamics becomes more complicated. We also observe
by a large update to the policy. By contrast, TRPO can that CEM outperforms CMA-ES, which is remarkable as
robustly enforce constraints with larger a δKL value and CMA-ES estimates the full covariance matrix. For higher-
hence speeds up learning in some cases. For instance, grid dimensional policy parameterizations, the computational
search on the Swimmer task reveals that the best step size complexity and memory requirement for CMA-ES become
for TNPG is δKL = 0.05, whereas TRPO’s best step-size is noticeable. On tasks with high-dimensional observations,
larger: δKL = 0.1. As shown in Figure 3(c), this larger step such as the Full Humanoid, the CMA-ES algorithm runs
size enables slightly faster learning. out of memory and fails to yield any results, denoted as
N/A in Table 1.
RWR: RWR is the only gradient-based algorithm we im-
plemented that does not require any hyperparameter tun- DDPG: Compared to batch algorithms, we found that
ing. It can solve some basic tasks to a satisfactory degree, DDPG was able to converge significantly faster on certain
but fails to solve more challenging tasks such as locomo- tasks like Half-Cheetah due to its greater sample efficiency.
tion. We observe empirically that RWR shows fast initial However, it was less stable than batch algorithms, and the
improvement followed by significant slow-down, as shown performance of the policy can degrade significantly during
in Figure 3(d). training. We also found it to be more susceptible to scaling
of the reward. In our experiment for DDPG, we rescaled
REPS: Our main observation is that REPS is especially
the reward of all tasks by a factor of 0.1, which seems to
prone to early convergence to local optima in case of con-
improve the stability.
tinuous states and actions. Its final outcome is greatly af-
fected by the performance of the initial policy, an obser- Partially Observable Tasks: We experimentally verify
vation that is consistent with the original work of Peters that recurrent policies can find better solutions than feed-
et al. (2010). This leads to a bad performance on average, forward policies in Partially Observable Tasks but recur-
although under particular initial settings the algorithm can rent policies are also more difficult to train. As shown in
perform on par with others. Moreover, the tasks presented Table 1, derivative-free algorithms like CEM and CMA-ES
here do not assume the existence of a stationary distribu- work considerably worse with recurrent policies. Also we
tion, which is assumed in Peters et al. (2010). In particular, note that the performance gap between REINFORCE and
for many of our tasks, transient behavior is of much greater TNPG widens when they are applied to optimize recurrent
interest than steady-state behavior, which agrees with pre- policies, which can be explained by the fact that a small
vious observation by van Hoof et al. (2015), change in parameter space can result in a bigger change in
policy distribution with recurrent policies than with feed-
Gradient-free methods: Surprisingly, even when train-
forward policies.
ing deep neural network policies with thousands of pa-
rameters, CEM achieves very good performance on cer- Hierarchical Tasks: We observe that all of our imple-
Benchmarking Deep Reinforcement Learning for Continuous Control

mented algorithms achieve poor performance on the hier- variety of challenging tasks. We implemented several rein-
archical tasks, even with extensive hyperparameter search forcement learning algorithms, and presented them in the
and 500 iterations of training. It is an interesting direction context of general policy parameterizations. Results show
to develop algorithms that can automatically discover and that among the implemented algorithms, TNPG, TRPO,
exploit the hierarchical structure in these tasks. and DDPG are effective methods for training deep neural
network policies. Still, the poor performance on the pro-
7. Related Work posed hierarchical tasks calls for new algorithms to be de-
veloped. Implementing and evaluating existing and newly
In this section, we review existing benchmarks of con- proposed algorithms will be our continued effort. By pro-
tinuous control tasks. The earliest efforts of evaluating viding an open-source release of the benchmark, we en-
reinforcement learning algorithms started in the form of courage other researchers to evaluate their algorithms on
individual control problems described in symbolic form. the proposed tasks.
Some widely adopted tasks include the inverted pendu-
lum (Stephenson, 1908; Donaldson, 1960; Widrow, 1964), Acknowledgements
mountain car (Moore, 1990), and Acrobot (DeJong &
Spong, 1994). These problems are frequently incorporated We thank Emo Todorov and Yuval Tassa for providing
into more comprehensive benchmarks. the MuJoCo simulator, and Sergey Levine, Aviv Tamar,
Chelsea Finn, and the anonymous ICML reviewers for in-
Some reinforcement learning benchmarks contain low-
sightful comments. We also thank Shixiang Gu and Timo-
dimensional continuous control tasks, such as the ones
thy Lillicrap for helping us diagnose the DDPG implemen-
introduced above, including RLLib (Abeyruwan, 2013),
tation. This work was supported in part by DARPA, the
MMLF (Metzen & Edgington, 2011), RL-Toolbox (Neu-
Berkeley Vision and Learning Center (BVLC), the Berke-
mann, 2006), JRLF (Kochenderfer, 2006), Beliefbox (Dim-
ley Artificial Intelligence Research (BAIR) laboratory, and
itrakakis et al., 2007), Policy Gradient Toolbox (Peters,
Berkeley Deep Drive (BDD). Rein Houthooft is supported
2002), and ApproxRL (Busoniu, 2010). A series of RL
by a Ph.D. Fellowship of the Research Foundation - Flan-
competitions has also been held in recent years (Dutech
ders (FWO).
et al., 2005; Dimitrakakis et al., 2014), again with relatively
low-dimensional actions. In contrast, our benchmark con-
tains a wider range of tasks with high-dimensional contin- References
uous state and action spaces. Abeyruwan, S. RLLib: Lightweight standard and on/off policy
reinforcement learning library (C++). http://web.cs.miami.
Previously, other benchmarks have been proposed for high- edu/home/saminda/rilib.html, 2013.
dimensional control tasks. Tdlearn (Dann et al., 2014)
Bagnell, J. A. and Schneider, J. Covariant policy search. pp.
includes a 20-link pole balancing task, DotRL (Papis & 1019–1024. IJCAI, 2003.
Wawrzyński, 2013) includes a variable-DOF octopus arm
Bakker, B. Reinforcement learning with long short-term memory.
and a 6-DOF planar cheetah model, PyBrain (Schaul et al., In NIPS, pp. 1475–1482, 2001.
2010) includes a 16-DOF humanoid robot with standing
Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The
and jumping tasks, RoboCup Keepaway (Stone et al., 2005) Arcade Learning Environment: An evaluation platform for
is a multi-agent game which can have a flexible dimension general agents. J. Artif. Intell. Res., 47:253–279, 2013.
of actions by varying the number of agents, and SkyAI Bellman, R. Dynamic Programming. Princeton University Press,
(Yamaguchi & Ogasawara, 2010) includes a 17-DOF hu- 1957.
manoid robot with crawling and turning tasks. Other li- Bertsekas, Dimitri P and Tsitsiklis, John N. Neuro-dynamic pro-
braries such as CL-Square (Riedmiller et al., 2012) and gramming: an overview. In CDC, pp. 560–564, 1995.
RLPark (Degris et al., 2013) provide interfaces to actual Busoniu, L. ApproxRL: A Matlab toolbox for approxi-
hardware, e.g., Bioloid and iRobot Create. In contrast to mate RL and DP. http://busoniu.net/files/repository/
these aforementioned testbeds, our benchmark makes use readme-approxrl.html, 2010.
of simulated environments to reduce computation time and Catto, E. Box2D: A 2D physics engine for games, 2011.
to encourage experimental reproducibility. Furthermore, it Coulom, Rémi. Reinforcement learning using neural networks,
provides a much larger collection of tasks of varying diffi- with applications to motor control. PhD thesis, Institut Na-
culty. tional Polytechnique de Grenoble-INPG, 2002.
Dann, C., Neumann, G., and Peters, J. Policy evaluation with tem-
8. Conclusion poral differences: A survey and comparison. J. Mach. Learn.
Res., 15(1):809–883, 2014.
In this work, a benchmark of continuous control problems Degris, T., Béchu, J., White, A., Modayil, J., Pilarski, P. M., and
for reinforcement learning is presented, covering a wide Denk, C. RLPark. http://rlpark.github.io, 2013.
Benchmarking Deep Reinforcement Learning for Continuous Control

Deisenroth, M. P., Neumann, G., and Peters, J. A survey on policy Heess, N., Wayne, G., Silver, D., Lillicrap, T., Erez, T., and Tassa,
search for robotics, foundations and trends in robotics. Found. T. Learning continuous control policies by stochastic value
Trends Robotics, 2(1-2):1–142, 2013. gradients. In NIPS, pp. 2926–2934. 2015b.
DeJong, G. and Spong, M. W. Swinging up the Acrobot: An Hester, T. and Stone, P. The open-source TEXPLORE code re-
example of intelligent control. In ACC, pp. 2158–2162, 1994. lease for reinforcement learning on robots. In RoboCup 2013:
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, Robot World Cup XVII, pp. 536–543. 2013.
L. ImageNet: A large-scale hierarchical image database. In Hinton, G., Deng, L., Yu, D., Mohamed, A.-R., Jaitly, N., Se-
CVPR, pp. 248–255, 2009. nior, A., Vanhoucke, V., Nguyen, P., Dahl, T. S. G., and Kings-
Dietterich, T. G. Hierarchical reinforcement learning with the bury, B. Deep neural networks for acoustic modeling in speech
MAXQ value function decomposition. J. Artif. Intell. Res, 13: recognition. IEEE Signal Process. Mag, 29(6):82–97, 2012.
227–303, 2000. Hirsch, H.-G. and Pearce, D. The Aurora experimental framework
Dimitrakakis, C., Tziortziotis, N., and Tossou, A. Beliefbox: A for the performance evaluation of speech recognition systems
framework for statistical methods in sequential decision mak- under noisy conditions. In ASR2000-Automatic Speech Recog-
ing. http://code.google.com/p/beliefbox/, 2007. nition: Challenges for the new Millenium ISCA Tutorial and
Research Workshop (ITRW), 2000.
Dimitrakakis, Christos, Li, Guangliang, and Tziortziotis, Nikoa-
los. The reinforcement learning competition 2014. AI Maga- Hochreiter, S. and Schmidhuber, J. Long short-term memory.
zine, 35(3):61–65, 2014. Neural Comput., 9(8):1735–1780, 1997.

Donaldson, P. E. K. Error decorrelation: a technique for matching Kakade, S. M. A natural policy gradient. In NIPS, pp. 1531–1538.
a class of functions. In Proc. 3th Intl. Conf. Medical Electron- 2002.
ics, pp. 173–178, 1960. Kimura, H. and Kobayashi, S. Stochastic real-valued reinforce-
Doya, K. Reinforcement learning in continuous time and space. ment learning to solve a nonlinear control problem. In IEEE
Neural Comput., 12(1):219–245, 2000. SMC, pp. 510–515, 1999.
Dutech, Alain, Edmunds, Timothy, Kok, Jelle, Lagoudakis, Kober, J. and Peters, J. Policy search for motor primitives in
Michail, Littman, Michael, Riedmiller, Martin, Russell, Bryan, robotics. In NIPS, pp. 849–856, 2009.
Scherrer, Bruno, Sutton, Richard, Timmer, Stephan, et al. Re- Kochenderfer, M. JRLF: Java reinforcement learning framework.
inforcement learning benchmarks and bake-offs ii. Advances http://mykel.kochenderfer.com/jrlf, 2006.
in Neural Information Processing Systems (NIPS), 17, 2005. Krizhevsky, A. and Hinton, G. Learning multiple layers of fea-
Erez, Tom, Tassa, Yuval, and Todorov, Emanuel. Infinite hori- tures from tiny images. Technical report, 2009.
zon model predictive control for nonlinear periodic tasks. Krizhevsky, A., Sutskever, I., and Hinton, G. ImageNet classifi-
Manuscript under review, 4, 2011. cation with deep convolutional neural networks. In NIPS, pp.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., and 1097–1105. 2012.
Zisserman, A. The pascal visual object classes (VOC) chal- LeCun, Y., Cortes, C., and Burges, C. The MNIST database of
lenge. Int. J. Comput. Vision, 88(2):303–338, 2010. handwritten digits, 1998.
Fei-Fei, L., Fergus, R., and Perona, P. One-shot learning of object Levine, S. and Koltun, V. Guided policy search. In ICML, pp.
categories. IEEE Trans. Pattern Anal. Mach. Intell., 28(4):594– 1–9, 2013.
611, 2006.
Levine, S., Finn, C., Darrell, T., and Abbeel, P. End-to-end train-
Furuta, K., Okutani, T., and Sone, H. Computer control of a ing of deep visuomotor policies. arXiv:1504.00702, 2015.
double inverted pendulum. Comput. Electr. Eng., 5(1):67–84,
1978. Lillicrap, T., Hunt, J., Pritzel, A., Heess, N., Erez, T., Tassa, Y.,
Silver, D., and Wierstra, D. Continuous control with deep re-
Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., and Pal- inforcement learning. arXiv:1509.02971, 2015.
lett, D. S. DARPA TIMIT acoustic-phonetic continuous speech
corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon Martin, D., C. Fowlkes, D. Tal, and Malik, J. A database of human
Technical Report N, 93, 1993. segmented natural images and its application to evaluating seg-
mentation algorithms and measuring ecological statistics. In
Godfrey, J. J., Holliman, E. C., and McDaniel, J. SWITCH- ICCV, pp. 416–423, 2001.
BOARD: Telephone speech corpus for research and develop-
ment. In ICASSP, pp. 517–520, 1992. Metzen, J. M. and Edgington, M. Maja machine learning frame-
work. http://mloss.org/software/view/220/, 2011.
Gomez, F. and Miikkulainen, R. 2-d pole balancing with recurrent
evolutionary networks. In ICANN, pp. 425–430. 1998. Michie, D. and Chambers, R. A. BOXES: An experiment in adap-
tive control. Machine Intelligence, 2:137–152, 1968.
Guo, X., Singh, S., Lee, H., Lewis, R. L., and Wang, X. Deep
learning for real-time Atari game play using offline monte- Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J.,
carlo tree search planning. In NIPS, pp. 3338–3346. 2014. Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K.,
Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou,
Hansen, N. and Ostermeier, A. Completely derandomized self- I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis,
adaptation in evolution strategies. Evol. Comput., 9(2):159– D. Human-level control through deep reinforcement learning.
195, 2001. Nature, 518(7540):529–533, 2015.
Heess, N., Hunt, J., Lillicrap, T., and Silver, D. Memory-based Moore, A. Efficient memory-based learning for robot control.
control with recurrent neural networks. arXiv:1512.04455, Technical report, University of Cambridge, Computer Labora-
2015a. tory, 1990.
Benchmarking Deep Reinforcement Learning for Continuous Control

Murray, R. M. and Hauser, J. A case study in approximate lin- mdps and semi-mdps: A framework for temporal abstraction
earization: The Acrobot example. Technical report, UC Berke- in reinforcement learning. Artificial intelligence, 112(1):181–
ley, EECS Department, 1991. 211, 1999.
Murthy, S. S. and Raibert, M. H. 3D balance in legged locomo- Szita, I. and Lőrincz, A. Learning Tetris using the noisy cross-
tion: modeling and simulation for the one-legged case. ACM entropy method. Neural Comput., 18(12):2936–2941, 2006.
SIGGRAPH Computer Graphics, 18(1):27–27, 1984. Szita, I., Takács, B., and Lörincz, A. ε-MDPs: Learning in vary-
Neumann, G. A reinforcement learning toolbox and RL bench- ing environments. J. Mach. Learn. Res., 3:145–174, 2003.
marks for the control of dynamical systems. Dynamical prin- Tassa, Yuval, Erez, Tom, and Todorov, Emanuel. Synthesis and
ciples for neuroscience and intelligent biomimetic devices, pp. stabilization of complex behaviors through online trajectory
113, 2006. optimization. In Intelligent Robots and Systems (IROS), 2012
Papis, B. and Wawrzyński, P. dotrl: A platform for rapid rein- IEEE/RSJ International Conference on, pp. 4906–4913. IEEE,
forcement learning methods development and validation. In 2012.
FedCSIS, pp. pages 129–136., 2013. Tesauro, G. Temporal difference learning and TD-Gammon.
Parr, Ronald and Russell, Stuart. Reinforcement learning with Commun. ACM, 38(3):58–68, 1995.
hierarchies of machines. Advances in neural information pro- Todorov, E., Erez, T., and Tassa, Y. MuJoCo: A physics engine
cessing systems, pp. 1043–1049, 1998. for model-based control. In IROS, pp. 5026–5033, 2012.
Peters, J. Policy Gradient Toolbox. http://www.ausy. van Hoof, H., Peters, J., and Neumann, G. Learning of non-
tu-darmstadt.de/Research/PolicyGradientToolbox, 2002. parametric control policies with high-dimensional state fea-
Peters, J. and Schaal, S. Reinforcement learning by reward- tures. In AISTATS, pp. 995–1003, 2015.
weighted regression for operational space control. In ICML, Watter, M., Springenberg, J., Boedecker, J., and Riedmiller, M.
pp. 745–750, 2007. Embed to control: A locally linear latent dynamics model for
Peters, J. and Schaal, S. Reinforcement learning of motor skills control from raw images. In NIPS, pp. 2728–2736, 2015.
with policy gradients. Neural networks, 21(4):682–697, 2008. Wawrzyński, P. Learning to control a 6-degree-of-freedom walk-
Peters, J., Vijaykumar, S., and Schaal, S. Policy gradient methods ing robot. In IEEE EUROCON, pp. 698–705, 2007.
for robot control. Technical report, 2003. Widrow, B. Pattern recognition and adaptive control. IEEE Trans.
Peters, J., Mülling, K., and Altün, Y. Relative entropy policy Ind. Appl., 83(74):269–277, 1964.
search. In AAAI, pp. 1607–1612, 2010. Wierstra, D., Foerster, A., Peters, J., and Schmidhuber, J. Solv-
Purcell, E. M. Life at low Reynolds number. Am. J. Phys, 45(1): ing deep memory POMDPs with recurrent policy gradients. In
3–11, 1977. ICANN, pp. 697–706. 2007.
Raibert, M. H. and Hodgins, J. K. Animation of dynamic legged Williams, R. J. Simple statistical gradient-following algorithms
locomotion. In ACM SIGGRAPH Computer Graphics, vol- for connectionist reinforcement learning. Mach. Learn., 8:
ume 25, pp. 349–358, 1991. 229–256, 1992.
Riedmiller, M., Blum, M., and Lampe, T. CLS2: Closed loop Yamaguchi, A. and Ogasawara, T. SkyAI: Highly modularized
simulation system. http://ml.informatik.uni-freiburg.de/ reinforcement learning library. In IEEE-RAS Humanoids, pp.
research/clsquare, 2012. 118–123, 2010.
Rubinstein, R. The cross-entropy method for combinatorial and Yu, D., Ju, Y.-C., Wang, Y.-Y., Zweig, G., and Acero, A. Auto-
continuous optimization. Methodol. Comput. Appl. Probab., 1 mated directory assistance system - from theory to practice. In
(2):127–190, 1999. Interspeech, pp. 2709–2712, 2007.
Schäfer, A. M. and Udluft, S. Solving partially observable rein-
forcement learning problems with recurrent neural networks.
In ECML Workshops, pp. 71–81, 2005.
Schaul, T., Bayer, J., Wierstra, D., Sun, Y., Felder, M., Sehnke, F.,
Rückstieß, T., and Schmidhuber, J. PyBrain. J. Mach. Learn.
Res., 11:743–746, 2010.
Schulman, J., Levine, S., Abbeel, P., Jordan, M. I., and Moritz,
P. Trust region policy optimization. In ICML, pp. 1889–1897,
2015a.
Schulman, J., Moritz, P., Levine, S., Jordan, M. I., and Abbeel,
P. High-dimensional continuous control using generalized ad-
vantage estimation. arXiv:1506.02438, 2015b.
Stephenson, A. On induced stability. Philos. Mag., 15(86):233–
236, 1908.
Stone, Peter, Kuhlmann, Gregory, Taylor, Matthew E, and Liu,
Yaxin. Keepaway soccer: From machine learning testbed to
benchmark. In RoboCup 2005: Robot Soccer World Cup IX,
pp. 93–105. Springer, 2005.
Sutton, Richard S, Precup, Doina, and Singh, Satinder. Between
Supplementary Material

1. Task Specifications
Below we provide some specifications for the task observations, actions, and rewards. Please refer to the benchmark source
code (https://github.com/rllab/rllab) for complete specification of physics parameters.

1.1. Basic Tasks


Cart-Pole Balancing: In this task, an inverted pendulum is mounted on a pivot point on a cart. The cart itself is restricted
to linear movement, achieved by applying horizontal forces. Due to the system’s inherent instability, continuous cart
movement is needed to keep the pendulum upright. The observation consists of the cart position x, pole angle θ, the cart
velocity ẋ, and the pole velocity θ̇. The 1D action consists of the horizontal force applied to the cart body. The reward
function is given by r(s, a) := 10 − (1 − cos(θ)) − 10−5 kak22 . The episode terminates when |x| > 2.4 or |θ| > 0.2.
Cart-Pole Swing Up: This is a more complicated version of the previous task, in which the system should not only be able
to balance the pole, but first succeed in swinging it up into an upright position. This task extends the working range of the
inverted pendulum to 360◦ . This is a nonlinear extension of the previous task. It has the same observation and action as
in balancing. The reward function is given by r(s, a) := cos(θ). The episode terminates when |x| > 3, with a penalty of
−100.
Mountain Car: In this task, a car has to escape a valley by repetitive application of tangential forces. Because the maximal
tangential force is limited, the car has to alternately drive up along the two slopes of the valley in order to build up enough
inertia to overcome gravity. This brings a challenge of exploration, since before first reaching the goal among all trials, a
locally optimal solution exists, which is to drive to the point closest to the target and stay there for the rest of the episode.
The observation is given by the horizontal position x and the horizontal velocity ẋ of the car. The reward is given by
r(s, a) := −1 + height, with height the car’s vertical offset. The episode terminates when the car reaches a target height
of 0.6. Hence the goal is to reach the target as soon as possible.
Acrobot Swing Up: In this task, an under-actuated, two-link robot has to swing itself into an upright position. It consists
of two joints of which the first one has a fixed position and only the second one can exert torque. The goal is to swing the
robot into an upright position and stabilize around that position. The controller not only has to swing the pendulum in order
to build up inertia, similar to the Mountain Car task, but also has to decelerate it in order to prevent it from tipping over.
The observation includes the two joint angles, θ1 and θ2 , and their velocities, θ̇1 and θ̇2 . The action is the torque applied at
the second joint. The reward is defined as r(s, a) := −ktip(s) − tiptarget k2 , where tip(s) computes the Cartesian position
of the tip of the robot given the joint angles. No termination condition is applied.
Double Inverted Pendulum Balancing: This task extends the Cart-Pole Balancing task by replacing the single-link pole
by a two-link rigid structure. As in the former task, the goal is to stabilize the two-link pole near the upright position.
This task is more difficult than single-pole balancing, since the system is even more unstable and requires the controller to
actively maintain balance. The observation includes the cart position x, joint angles (θ1 and θ2 ), and joint velocities (θ̇1
and θ̇2 ). We encode each joint angle as its sine and cosine values. The action is the same as in cart-pole tasks. The reward
is given by r(s, a) = 10 − 0.01x2tip − (ytip − 2)2 − 10−3 · θ̇12 − 5 · 10−3 · θ̇22 , where xtip , ytip are the coordinates of the tip
of the pole. No termination condition is applied. The episode is terminated when ytip ≤ 1.

1.2. Locomotion Tasks


Swimmer: The swimmer is a planar robot with 3 links and 2 actuated joints. Fluid is simulated through viscosity forces,
which apply drag on each link, allowing the swimmer to move forward. This task is the simplest of all locomotion tasks,
since there are no irrecoverable states in which the swimmer can get stuck, unlike other robots which may fall down or flip
over. This places less burden on exploration. The 13-dim observation includes the joint angles, joint velocities, as well as
Benchmarking Deep Reinforcement Learning for Continuous Control

the coordinates of the center of mass. The reward is given by r(s, a) = vx − 0.005kak22 , where vx is the forward velocity.
No termination condition is applied.
Hopper: The hopper is a planar monopod robot with 4 rigid links, corresponding to the torso, upper leg, lower leg, and
foot, along with 3 actuated joints. More exploration is needed than the swimmer task, since a stable hopping gait has to
be learned without falling. Otherwise, it may get stuck in a local optimum of diving forward. The 20-dim observation
includes joint angles, joint velocities, the coordinates of center of mass, and constraint forces. The reward is given by
r(s, a) := vx − 0.005 · kak22 + 1, where the last term is a bonus for being “alive.” The episode is terminated when
zbody < 0.7 where zbody is the z-coordinate of the body, or when |θy | < 0.2, where θy is the forward pitch of the body.
Walker: The walker is a planar biped robot consisting of 7 links, corresponding to two legs and a torso, along with 6
actuated joints. This task is more challenging than hopper, since it has more degrees of freedom, and is also prone to
falling. The 21-dim observation includes joint angles, joint velocities, and the coordinates of center of mass. The reward
is given by r(s, a) := vx − 0.005 · kak22 . The episode is terminated when zbody < 0.8, zbody > 2.0, or when |θy | > 1.0.
Half-Cheetah: The half-cheetah is a planar biped robot with 9 rigid links, including two legs and a torso, along with 6
actuated joints. The 20-dim observation includes joint angles, joint velocities, and the coordinates of the center of mass.
The reward is given by r(s, a) = vx − 0.05 · kak22 . No termination condition is applied.
Ant: The ant is a quadruped with 13 rigid links, including four legs and a torso, along with 8 actuated joints. This task
is more challenging than the previous tasks due to the higher degrees of freedom. The 125-dim observation includes joint
angles, joint velocities, coordinates of the center of mass, a (usually sparse) vector of contact forces, as well as the rotation
matrix for the body. The reward is given by r(s, a) = vx − 0.005 · kak22 − Ccontact + 0.05, where Ccontact penalizes
contacts to the ground, and is given by 5 · 10−4 · kFcontact k22 , where Fcontact is the contact force vector clipped to values
between −1 and 1. The episode is terminated when zbody < 0.2 or when zbody > 1.0.
Simple Humanoid: This is a simplified humanoid model with 13 rigid links, including the head, body, arms, and legs,
along with 10 actuated joints. The increased difficulty comes from the increased degrees of freedom as well as the need
to maintain balance. The 102-dim observation includes the joint angles, joint velocities, vector of contact forces, and the
coordinates of the center of mass. The reward is given by r(s, a) = vx − 5 · 10−4 kak22 − Ccontact − Cdeviation + 0.2, where
Ccontact = 5 · 10−6 · kFcontact k, and Cdeviation = 5 · 10−3 · (vy2 + vz2 ) to penalize deviation from the forward direction.
The episode is terminated when zbody < 0.8 or when zbody > 2.0.
Full Humanoid: This is a humanoid model with 19 rigid links and 28 actuated joints. It has more degrees of freedom
below the knees and elbows, which makes the system higher-dimensional and harder for learning. The 142-dim observation
includes the joint angles, joint velocities, vector of contact forces, and the coordinates of the center of mass. The reward
and termination condition is the same as in the Simple Humanoid model.

1.3. Partially Observable Tasks


Limited Sensors: The full description is included in the main text.
Noisy Observations and Delayed Actions: For all tasks, we use a Gaussan noise with σ = 0.1. The time delay is as
follows: Cart-Pole Balancing 0.15 sec, Cart-Pole Swing Up 0.15 sec, Mountain Car 0.15 sec, Acrobot Swing Up 0.06 sec,
and Double Inverted Pendulum Balancing 0.06 sec. This corresponds to 3 discretization frames for each task.
System Identifications: For Cart-Pole Balancing and Cart-Pole Swing Up, the pole length is varied uniformly between,
50% and 150%. For Mountain Car, the width of the valley varies uniformly between 75% and 125%. For Acrobot Swing
Up, each of the pole length varies uniformly between 50% and 150%. For Double Inverted Pendulum Balancing, each of
the pole length varies uniformly between 83% and 167%. Please refer to the benchmark source code for reference values.

1.4. Hierarchical Tasks


Locomotion + Food Collection: During each episode, 8 food units and 8 bombs are placed in the environment. Collecting
a food unit gives +1 reward, and collecting a bomb gives −1 reward. Hence the best cumulative reward for a given episode
is 8.
Locomotion + Maze: During each episode, a +1 reward is given when the robot reaches the goal. Otherwise, the robot
receives a zero reward throughout the episode.
Benchmarking Deep Reinforcement Learning for Continuous Control

2. Experiment Parameters
For all batch gradient-based algorithms, we use the same time-varying feature encoding for the linear baseline:

φs,t = concat(s, s s, 0.01t, (0.01t)2 , (0.01t)3 , 1)

where s is the state vector and represents element-wise product.


Table 2 shows the experiment parameters for all four categories. We will then detail the hyperparameter search range for
the selected tasks and report best hyperparameters, shown in Tables 3, 4, 5, 6, 7, and 8.

Table 2. Experiment Setup


Basic & Locomotion Partially Observable Hierarchical
Sim. steps per Iter. 50,000 50,000 50,000
Discount(λ) 0.99 0.99 0.99
Horizon 500 100 500
Num. Iter. 500 300 500

Table 3. Learning Rate α for REINFORCE


Search Range Best
Cart-Pole Swing Up [1 × 10−4 , 1 × 10−1 ] 5 × 10−3
Double Inverted Pendulum [1 × 10−4 , 1 × 10−1 ] 5 × 10−3
Swimmer [1 × 10−4 , 1 × 10−1 ] 1 × 10−2
Ant [1 × 10−4 , 1 × 10−1 ] 5 × 10−3

Table 4. Step Size δKL for TNPG


Search Range Best
−3
Cart-Pole Swing Up [1 × 10 , 5 × 10 ] 0
5 × 10−2
Double Inverted Pendulum [1 × 10−3 , 5 × 100 ] 3 × 10−2
Swimmer [1 × 10−3 , 5 × 100 ] 1 × 10−1
Ant [1 × 10−3 , 5 × 100 ] 3 × 10−1

Table 5. Step Size δKL for TRPO


Search Range Best
−3
Cart-Pole Swing Up [1 × 10 , 5 × 10 ] 0
5 × 10−2
Double Inverted Pendulum [1 × 10−3 , 5 × 100 ] 1 × 10−3
Swimmer [1 × 10−3 , 5 × 100 ] 5 × 10−2
Ant [1 × 10−3 , 5 × 100 ] 8 × 10−2

Table 6. Step Size δKL for REPS


Search Range Best
Cart-Pole Swing Up [1 × 10−3 , 5 × 100 ] 1 × 10−2
Double Inverted Pendulum [1 × 10−3 , 5 × 100 ] 8 × 10−1
Swimmer [1 × 10−3 , 5 × 100 ] 3 × 10−1
Ant [1 × 10−3 , 5 × 100 ] 8 × 10−1
Benchmarking Deep Reinforcement Learning for Continuous Control

Table 7. Initial Extra Noise for CEM


Search Range Best
−3
Cart-Pole Swing Up [1 × 10 , 1] 1 × 10−2
Double Inverted Pendulum [1 × 10−3 , 1] 1 × 10−1
Swimmer [1 × 10−3 , 1] 1 × 10−1
Ant [1 × 10−3 , 1] 1 × 10−1

Table 8. Initial Standard Deviation for CMA-ES


Search Range Best
−3 3
Cart-Pole Swing Up [1 × 10 , 1 × 10 ] 1 × 103
Double Inverted Pendulum [1 × 10−3 , 1 × 103 ] 3 × 10−1
Swimmer [1 × 10−3 , 1 × 103 ] 1 × 10−1
Ant [1 × 10−3 , 1 × 103 ] 1 × 10−1

You might also like