0% found this document useful (0 votes)

100 views22 pages

IMPALA: Scalable Distributed Deep-RL With Importance Weighted Actor-Learner Architectures

This document introduces IMPALA, a new distributed deep reinforcement learning agent that can efficiently train on a large collection of tasks using a single set of parameters. IMPALA uses an importance weighted actor-learner architecture that scales to thousands of machines without sacrificing data efficiency or resource utilization. It achieves high throughput by having actors communicate trajectories of experience to a centralized learner, which performs updates on GPU in mini-batches while parallelizing operations. IMPALA introduces the V-trace off-policy correction algorithm to address the harmful discrepancy from actors lagging behind the learner's updated policy. It achieves exceptionally high data throughput rates, making it over 30 times faster than single-machine A3C while also being more data efficient and robust.

Uploaded by

Justin Glibert

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

100 views22 pages

IMPALA: Scalable Distributed Deep-RL With Importance Weighted Actor-Learner Architectures

Uploaded by

Justin Glibert

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

IMPALA: Scalable Distributed Deep-RL with Importance Weighted

Actor-Learner Architectures

Lasse Espeholt * 1 Hubert Soyer * 1 Remi Munos * 1 Karen Simonyan 1 Volodymyr Mnih 1 Tom Ward 1
Yotam Doron 1 Vlad Firoiu 1 Tim Harley 1 Iain Dunning 1 Shane Legg 1 Koray Kavukcuoglu 1

Abstract separately. We are interested in developing new methods

arXiv:1802.01561v3 [cs.LG] 28 Jun 2018

capable of mastering a diverse set of tasks simultaneously as

In this work we aim to solve a large collection
well as environments suitable for evaluating such methods.
of tasks using a single reinforcement learning
agent with a single set of parameters. A key One of the main challenges in training a single agent on
challenge is to handle the increased amount of many tasks at once is scalability. Since the current state-of-
data and extended training time. We have devel- the-art methods like A3C (Mnih et al., 2016) or UNREAL
oped a new distributed agent IMPALA (Impor- (Jaderberg et al., 2017b) can require as much as a billion
tance Weighted Actor-Learner Architecture) that frames and multiple days to master a single domain, training
not only uses resources more efficiently in single- them on tens of domains at once is too slow to be practical.
machine training but also scales to thousands of
We propose the Importance Weighted Actor-Learner
machines without sacrificing data efficiency or
Architecture (IMPALA) shown in Figure 1. IMPALA is
resource utilisation. We achieve stable learning at
capable of scaling to thousands of machines without sacri-
high throughput by combining decoupled acting
ficing training stability or data efficiency. Unlike the popular
and learning with a novel off-policy correction
A3C-based agents, in which workers communicate gradi-
method called V-trace. We demonstrate the effec-
ents with respect to the parameters of the policy to a central
tiveness of IMPALA for multi-task reinforcement
parameter server, IMPALA actors communicate trajectories
learning on DMLab-30 (a set of 30 tasks from
of experience (sequences of states, actions, and rewards) to a
the DeepMind Lab environment (Beattie et al.,
centralised learner. Since the learner in IMPALA has access
2016)) and Atari-57 (all available Atari games in
to full trajectories of experience we use a GPU to perform
Arcade Learning Environment (Bellemare et al.,
updates on mini-batches of trajectories while aggressively
2013a)). Our results show that IMPALA is able to
parallelising all time independent operations. This type of
achieve better performance than previous agents
decoupled architecture can achieve very high throughput.
with less data, and crucially exhibits positive trans-
However, because the policy used to generate a trajectory
fer between tasks as a result of its multi-task ap-
can lag behind the policy on the learner by several updates at
proach. The source code is publicly available at
the time of gradient calculation, learning becomes off-policy.
github.com/deepmind/scalable agent.
Therefore, we introduce the V-trace off-policy actor-critic
algorithm to correct for this harmful discrepancy.
1. Introduction With the scalable architecture and V-trace combined, IM-
PALA achieves exceptionally high data throughput rates of
Deep reinforcement learning methods have recently mas- 250,000 frames per second, making it over 30 times faster
tered a wide variety of domains through trial and error than single-machine A3C. Crucially, IMPALA is also more
learning (Mnih et al., 2015; Silver et al., 2017; 2016; Zoph data efficient than A3C based agents and more robust to
et al., 2017; Lillicrap et al., 2015; Barth-Maron et al., 2018). hyperparameter values and network architectures, allow-
While the improvements on tasks like the game of Go (Sil- ing it to make better use of deeper neural networks. We
ver et al., 2017) and Atari games (Horgan et al., 2018) have demonstrate the effectiveness of IMPALA by training a sin-
been dramatic, the progress has been primarily in single gle agent on multi-task problems using DMLab-30, a new
task performance, where an agent is trained on each task challenge set which consists of 30 diverse cognitive tasks
*
Equal contribution 1 DeepMind Technologies, London, in the 3D DeepMind Lab (Beattie et al., 2016) environment
United Kingdom. Correspondence to: Lasse Espeholt <lespe- and by training a single agent on all games in the Atari-57
[email protected]>. set of tasks.
IMPALA: Importance Weighted Actor-Learner Architectures

Observations Environment steps Forward pass Backward pass

Actor 0 ...next unroll
4 time steps Actor 1
Actor Actor Worker Actor 2
Actor Learner Actor 0 Actor 3
Actor 1 Actor 4
Observations Actor 2 Actor 5
Parameters Actor 3 Actor 6
Actor 7
(a) Batched A2C (sync step.)

…
Actor Learner Actor Parameters Gradients 4 time steps
…
Actor 0
Actor 1
Actor 2
Master
Actor 3 (c) IMPALA
Actor
Learner (b) Batched A2C (sync traj.)
Actor Actor
Observations
Figure 2. Timeline for one unroll with 4 steps using different ar-
chitectures. Strategies shown in (a) and (b) can lead to low GPU
Figure 1. Left: Single Learner. Each actor generates trajectories utilisation due to rendering time variance within a batch. In (a),
and sends them via a queue to the learner. Before starting the next the actors are synchronised after every step. In (b) after every n
trajectory, actor retrieves the latest policy parameters from learner. steps. IMPALA (c) decouples acting from learning.
Right: Multiple Synchronous Learners. Policy parameters are
distributed across multiple learners that work synchronously.

during the estimation of the policy gradient. In contrast,

2. Related Work IMPALA uses the more principled V-trace algorithm.
Related previous work on off-policy RL include (Precup
The earliest attempts to scale up deep reinforcement learn-
et al., 2000; 2001; Wawrzynski, 2009; Geist & Scherrer,
ing relied on distributed asynchronous SGD (Dean et al.,
2014; O’Donoghue et al., 2017) and (Harutyunyan et al.,
2012) with multiple workers. Examples include distributed
2016). The closest work to ours is the Retrace algorithm
A3C (Mnih et al., 2016) and Gorila (Nair et al., 2015), a
(Munos et al., 2016) which introduced an off-policy correc-
distributed version of Deep Q-Networks (Mnih et al., 2015).
tion for multi-step RL, and has been used in several agent
Recent alternatives to asynchronous SGD for RL include
architectures (Wang et al., 2017; Gruslys et al., 2018). Re-
using evolutionary processes (Salimans et al., 2017), dis-
trace requires learning state-action-value functions Q in
tributed BA3C (Adamski et al., 2018) and Ape-X (Horgan
order to make the off-policy correction. However, many
et al., 2018) which has a distributed replay but a synchronous
actor-critic methods such as A3C learn a state-value func-
learner.
tion V instead of a state-action-value function Q. V-trace is
There have also been multiple efforts that scale up reinforce- based on the state-value function.
ment learning by utilising GPUs. One of the simplest of
such methods is batched A2C (Clemente et al., 2017). At 3. IMPALA
every step, batched A2C produces a batch of actions and
applies them to a batch of environments. Therefore, the IMPALA (Figure 1) uses an actor-critic setup to learn a
slowest environment in each batch determines the time it policy π and a baseline function V π . The process of gener-
takes to perform the entire batch step (see Figure 2a and ating experiences is decoupled from learning the parameters
2b). In other words, high variance in environment speed of π and V π . The architecture consists of a set of actors,
can severely limit performance. Batched A2C works partic- repeatedly generating trajectories of experience, and one or
ularly well on Atari environments, because rendering and more learners that use the experiences sent from actors to
game logic are computationally very cheap in comparison to learn π off-policy.
the expensive tensor operations performed by reinforcement
At the beginning of each trajectory, an actor updates its
learning agents. However, more visually or physically com-
own local policy µ to the latest learner policy π and runs
plex environments can be slower to simulate and can have
it for n steps in its environment. After n steps, the ac-
high variance in the time required for each step. Environ-
tor sends the trajectory of states, actions and rewards
ments may also have variable length (sub)episodes causing
x1 , a1 , r1 , . . . , xn , an , rn together with the corresponding
a slowdown when initialising an episode.
policy distributions µ(at |xt ) and initial LSTM state to the
The most similar architecture to IMPALA is GA3C learner through a queue. The learner then continuously
(Babaeizadeh et al., 2016), which also uses asynchronous updates its policy π on batches of trajectories, each col-
data collection to more effectively utilise GPUs. It de- lected from many actors. This simple architecture enables
couples the acting/forward pass from the gradient calcu- the learner(s) to be accelerated using GPUs and actors to
lation/backward pass by using dynamic batching. The ac- be easily distributed across many machines. However, the
tor/learner asynchrony in GA3C leads to instabilities during learner policy π is potentially several updates ahead of the
learning, which (Babaeizadeh et al., 2016) only partially actor’s policy µ at the time of update, therefore there is a
mitigates by adding a small constant to action probabilities policy-lag between the actors and learner(s). V-trace cor-
IMPALA: Importance Weighted Actor-Learner Architectures

rects for this lag to achieve extremely high data throughput Barto, 1998) where the goal is to find a policy π that
while maintaining data efficiency. Using an actor-learner ar- maximises the expected sum of future discounted rewards:
chitecture, provides fault tolerance like distributed A3C but def
V π (x) = Eπ t
P
t≥0 γ rt , where γ ∈ [0, 1) is the dis-
often has lower communication overhead since the actors count factor, rt = r(xt , at ) is the reward at time t, xt is the
send observations rather than parameters/gradients. state at time t (initialised in x0 = x) and at ∼ π(·|xt ) is the
With the introduction of very deep model architectures, the action generated by following some policy π.
speed of a single GPU is often the limiting factor during The goal of an off-policy RL algorithm is to use trajectories
training. IMPALA can be used with distributed set of learn- generated by some policy µ, called the behaviour policy, to
ers to train large neural networks efficiently as shown in learn the value function V π of another policy π (possibly
Figure 1. Parameters are distributed across the learners and different from µ), called the target policy.
actors retrieve the parameters from all the learners in par-
allel while only sending observations to a single learner. 4.1. V-trace target
IMPALA use synchronised parameter update which is vital
t=s+n
to maintain data efficiency when scaling to many machines Consider a trajectory (xt , at , rt )t=s generated by the ac-
(Chen et al., 2016). tor following some policy µ. We define the n-steps V-trace
target for V (xs ), our value approximation at state xs , as:
3.1. Efficiency Optimisations def Ps+n−1 Q
t−1

vs = V (xs ) + t=s γ t−s i=s ci δt V , (1)
GPUs and many-core CPUs benefit greatly from running
few large, parallelisable operations instead of many small def
where δt V = ρt rt + γV (xt+1 ) − V (xt ) is a temporal
operations. Since the learner in IMPALA performs updates
def t |xt ) def
difference for V , and ρt = min ρ̄, π(a

on entire batches of trajectories, it is able to parallelise more µ(at |xt ) and ci =
i |xi )
of its computations than an online agent like A3C. As an min c̄, π(a

µ(ai |xi ) are truncated importance sampling (IS)
example, a typical deep RL agent features a convolutional Qt−1
weights (we make use of the notation i=s ci = 1 for
network followed by a Long Short-Term Memory (LSTM)
s = t). In addition we assume that the truncation levels are
(Hochreiter & Schmidhuber, 1997) and a fully connected
such that ρ̄ ≥ c̄.
output layer after the LSTM. An IMPALA learner applies
the convolutional network to all inputs in parallel by folding Notice that in the on-policy case (when π = µ), and as-
the time dimension into the batch dimension. Similarly, it suming that c̄ ≥ 1, then all ci = 1 and ρt = 1, thus (1)
also applies the output layer to all time steps in parallel rewrites
once all LSTM states are computed. This optimisation Ps+n−1
vs = V (xs ) + t=s γ t−s rt + γV (xt+1 ) − V (xt )

increases the effective batch size to thousands. LSTM-based
agents also obtain significant speedups on the learner by Ps+n−1
= t=s γ t−s rt + γ n V (xs+n ), (2)
exploiting the network structure dependencies and operation
fusion (Appleyard et al., 2016). which is the on-policy n-steps Bellman target. Thus in
the on-policy case, V-trace reduces to the on-policy n-steps
Finally, we also make use of several off the shelf optimisa-
Bellman update. This property (which Retrace (Munos et al.,
tions available in TensorFlow (Abadi et al., 2017) such as
2016) does not have) allows one to use the same algorithm
preparing the next batch of data for the learner while still per-
for off- and on-policy data.
forming computation, compiling parts of the computational
graph with XLA (a TensorFlow Just-In-Time compiler) and Notice that the (truncated) IS weights ci and ρt play dif-
optimising the data format to get the maximum performance ferent roles. The weight ρt appears in the definition of the
from the cuDNN framework (Chetlur et al., 2014). temporal difference δt V and defines the fixed point of this
update rule. In a tabular case, where functions can be per-
4. V-trace fectly represented, the fixed point of this update (i.e., when
V (xs ) = vs for all states), characterised by δt V being equal
Off-policy learning is important in the decoupled distributed to zero in expectation (under µ), is the value function V πρ̄
actor-learner architecture because of the lag between when of some policy πρ̄ , defined by
actions are generated by the actors and when the learner
estimates the gradient. To this end, we introduce a novel off- def min ρ̄µ(a|x), π(a|x)
πρ̄ (a|x) = P , (3)
policy actor-critic algorithm for the learner, called V-trace. b∈A min ρ̄µ(b|x), π(b|x)

First, let us introduce some notations. We consider the (see the analysis in Appendix A ). So when ρ̄ is infinite
problem of discounted infinite-horizon RL in Markov De- (i.e. no truncation of ρt ), then this is the value function V π
cision Processes (MDP), see (Puterman, 1994; Sutton & of the target policy. However if we choose a truncation
IMPALA: Importance Weighted Actor-Learner Architectures

level ρ̄ < ∞, our fixed point is the value function V πρ̄ of The reason why we use qs instead of vs as the target for
a policy πρ̄ which is somewhere between µ and π. At the our Q-value Qπρ̄ (xs , as ) is that, assuming our value esti-
limit when ρ̄ is close to zero, we obtain the value function mate is correct at all states, i.e. V = V πρ̄ , then we have
of the behaviour policy V µ . In Appendix A we prove the E[qs |xs , as ] = Qπρ̄ (xs , as ) (whereas we do not have this
contraction of a related V-trace operator and the convergence property if we choose qt = vt ). See Appendix A for analy-
of the corresponding online V-trace algorithm. sis and Appendix E.3 for a comparison of different ways to
estimate qs .
The weights ci are similar to the “trace cutting” coefficients
in Retrace. Their product cs . . . ct−1 measures how much In order to reduce the variance of the policy gradient es-
a temporal difference δt V observed at time t impacts the timate (4), we usually subtract from qs a state-dependent
update of the value function at a previous time s. The more baseline, such as the current value approximation V (xs ).
dissimilar π and µ are (the more off-policy we are), the
Finally notice that (4) estimates the policy gradient for πρ̄
larger the variance of this product. We use the truncation
which is the policy evaluated by the V-trace algorithm when
level c̄ as a variance reduction technique. However notice
using a truncation level ρ̄. However assuming the bias
that this truncation does not impact the solution to which
V πρ̄ − V π is small (e.g. if ρ̄ is large enough) then we can
we converge (which is characterised by ρ̄ only).
expect qs to provide us with a good estimate of Qπ (xs , as ).
Thus we see that the truncation levels c̄ and ρ̄ represent Taking into account these remarks, we derive the following
different features of the algorithm: ρ̄ impacts the nature of canonical V-trace actor-critic algorithm.
the value function we converge to, whereas c̄ impacts the
speed at which we converge to this function. V- TRACE ACTOR - CRITIC ALGORITHM
Remark 1. V-trace targets can be computed recursively: Consider a parametric representation Vθ of the value func-

vs = V (xs ) + δs V + γcs vs+1 − V (xs+1 ) . tion and the current policy πω . Trajectories have been gen-
erated by actors following some behaviour policy µ. The
Remark 2. Like in Retrace(λ), we can also consider an V-trace targets vs are defined by (1). At training time s, the
additional discounting parameter λ ∈ [0, 1] in the definition value parameters θ are updated by gradient descent on the
i |xi )
of V-trace by setting ci = λ min c̄, π(a l2 loss to the target vs , i.e., in the direction of

µ(ai |xi ) . In the on-
policy case, when n = ∞, V-trace then reduces to TD(λ).
vs − Vθ (xs ) ∇θ Vθ (xs ),

4.2. Actor-Critic algorithm and the policy parameters ω in the direction of the policy
gradient:
P OLICY GRADIENT
ρs ∇ω log πω (as |xs ) rs + γvs+1 − Vθ (xs ) .
In the on-policy case, the gradient of the value function
V µ (x0 ) with respect to some parameter of the policy µ is In order to prevent premature convergence we may add an
hP i entropy bonus, like in A3C, along the direction
∇V µ (x0 ) = Eµ s≥0 γ s
∇ log µ(as |x s )Qµ
(x s , as ) , X
−∇ω πω (a|xs ) log πω (a|xs ).
def a
where Qµ (xs , as ) = Eµ t−s
P
t≥s γ rt |xs , as is the
state-action value of policy µ at (xs , as ). This is The overall update is obtained by summing these three gra-
usually implemented by a stochastic gradient ascent dients rescaled by appropriate coefficients, which are hyper-
that updates parameters of the algorithm.
h the policy parameters
i
in the direction of
Eas ∼µ(·|xs ) ∇ log µ(as |xs )qs xs , where qs is an estimate

of Qµ (xs , as ), and averaged over the set of states xs that 5. Experiments
are visited under some behaviour policy µ. We investigate the performance of IMPALA under multiple
Now in the off-policy setting that we consider, we can use settings. For data efficiency, computational performance
an IS weight between the policy being evaluated πρ̄ and the and effectiveness of the off-policy correction we look at the
behaviour policy µ, to update our policy parameter in the learning behaviour of IMPALA agents trained on individual
direction of tasks. For multi-task learning we train agents—each with
h π (a |x ) i one set of weights for all tasks—on a newly introduced
ρ̄ s s
Eas ∼µ(·|xs ) ∇ log πρ̄ (as |xs )qs xs (4) collection of 30 DeepMind Lab tasks and on all 57 games of
µ(as |xs ) the Atari Learning Environment (Bellemare et al., 2013a).
def
where qs = rs + γvs+1 is an estimate of Qπρ̄ (xs , as ) For all the experiments we have used two different model
built from the V-trace estimate vs+1 at the next state xs+1 . architectures: a shallow model similar to (Mnih et al., 2016)
IMPALA: Importance Weighted Actor-Learner Architectures
⇡ (at ) Vt
conceptually similar to the queues used in GA3C. Table 1
ht 1 LSTM 256
details the results for single-machine and multi-machine ver-
⇡ (at ) Vt
rt at
ReLU
FC 256 LSTM 64
sions with the shallow model from Figure 3. In the single-
1 1

ht LSTM 256
ReLU Embedding 20
machine case, IMPALA achieves the highest performance
1

blue ladder on both tasks, ahead of all batched A2C variants and ahead
ReLU Residual Block
rt 1 at 1 FC 256 LSTM 64 of A3C. However, the distributed, multi-machine setup is
32 Residual Block
ReLU
Conv. 4 ⇥ 4, stride 2
Embedding 20
⇥3
Max 3 ⇥ 3, stride 2 Conv. 3 ⇥ 3, stride 1 +
where IMPALA can really demonstrate its scalability. With
blue ladder
[16, 32, 32] ch.
ReLU
16
Conv. 3 ⇥ 3, stride 1
ReLU the optimisations from Section 3.1 to speed up the GPU-
Conv. 8 ⇥ 8, stride 4

/255 3 /255
3
Conv. 3 ⇥ 3, stride 1 based learner, the IMPALA agent achieves a throughput rate
ReLU
of 250,000 frames/sec or 21 billion frames/day. Note, to
reduce the number of actors needed per learner, one can
96 ⇥ 72 96 ⇥ 72
use auxiliary losses, data from experience replay or other
Figure 3. Model Architectures. Left: Small architecture, 2 convo- expensive learner-only computation.
lutional layers and 1.2 million parameters. Right: Large architec-
ture, 15 convolutional layers and 1.6 million parameters. 5.2. Single-Task Training
To investigate IMPALA’s learning dynamics, we employ the
Architecture CPUs GPUs1 FPS2 single-task scenario where we train agents individually on
Single-Machine Task 1 Task 2 5 different DeepMind Lab tasks. The task set consists of a
A3C 32 workers 64 0 6.5K 9K planning task, two maze navigation tasks, a laser tag task
Batched A2C (sync step) 48 0 9K 5K with scripted bots and a simple fruit collection task.
Batched A2C (sync step) 48 1 13K 5.5K
Batched A2C (sync traj.) 48 0 16K 17.5K We perform hyperparameter sweeps over the weighting of
Batched A2C (dyn. batch) 48 1 16K 13K entropy regularisation, the learning rate and the RMSProp
IMPALA 48 actors 48 0 17K 20.5K epsilon. For each experiment we use an identical set of 24
IMPALA (dyn. batch) 48 actors3 48 1 21K 24K pre-sampled hyperparameter combinations from the ranges
Distributed in Appendix D.1 . The other hyperparameters were fixed to
A3C 200 0 46K 50K values specified in Appendix D.3 .
IMPALA 150 1 80K
IMPALA (optimised) 375 1 200K 5.2.1. C ONVERGENCE AND S TABILITY
IMPALA (optimised) batch 128 500 1 250K
1
Nvidia P100 2 In frames/sec (4 times the agent steps due to action repeat). 3 Limited by Figure 4 shows a comparison between IMPALA, A3C
amount of rendering possible on a single machine. and batched A2C with the shallow model in Figure 3.
In all of the 5 tasks, either batched A2C or IMPALA
Table 1. Throughput on seekavoid arena 01 (task 1) and reach the best final average return and in all tasks but
rooms keys doors puzzle (task 2) with the shallow model seekavoid arena 01 they are ahead of A3C through-
in Figure 3. The latter has variable length episodes and slow
out the entire course of training. IMPALA outperforms
restarts. Batched A2C and IMPALA use batch size 32 if not other-
wise mentioned.
the synchronous batched A2C on 2 out of 5 tasks while
achieving much higher throughput (see Table 1). We hy-
pothesise that this behaviour could stem from the V-trace
with an LSTM before the policy and value (shown in Fig- off-policy correction acting similarly to generalised advan-
ure 3 (left)) and a deeper residual model (He et al., 2016) tage estimation (Schulman et al., 2016) and asynchronous
(shown in Figure 3 (right)). For tasks with a language chan- data collection yielding more diverse batches of experience.
nel we used an LSTM with text embeddings as input.
In addition to reaching better final performance, IMPALA is
also more robust to the choice of hyperparameters than A3C.
5.1. Computational Performance Figure 4 compares the final performance of the aforemen-
High throughput, computational efficiency and scalability tioned methods across different hyperparameter combina-
are among the main design goals of IMPALA. To demon- tions, sorted by average final return from high to low. Note
strate that IMPALA outperforms current algorithms in these that IMPALA achieves higher scores over a larger number
metrics we compare A3C (Mnih et al., 2016), batched A2C of combinations than A3C.
variations and IMPALA variants with various optimisations.
For single-machine experiments using GPUs, we use dy- 5.2.2. V- TRACE A NALYSIS
namic batching in the forward pass to avoid several batch To analyse V-trace we investigate four different algorithms:
size 1 forward passes. Our dynamic batching module is 1. No-correction - No off-policy correction.
implemented by specialised TensorFlow operations but is
IMPALA: Importance Weighted Actor-Learner Architectures

IMPALA - 1 GPU - 200 actors Batched A2C - Single Machine - 32 workers A3C - Single Machine - 32 workers A3C - Distributed - 200 workers

rooms_watermaze rooms_keys_doors_puzzle lasertag_three_opponents_small explore_goal_locations_small seekavoid_arena_01

55 30 35 250 45
50 30 40
25 200
45 25 35
40
Return

20 20 30
35 150
15 15 25
30 100
25 10 10 20
20 5 15
5 50
15 0 10
10 0 −5 0 5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Environment Frames 1e9 Environment Frames 1e9 Environment Frames 1e9 Environment Frames 1e9 Environment Frames 1e9

rooms_watermaze rooms_keys_doors_puzzle lasertag_three_opponents_small explore_goal_locations_small seekavoid_arena_01

60 40 40 300 50
35 35
Final Return

50 250 40
30 30
40 25 25 200
20 30
30 20 150
15 20
20 15 10 100
10 5
10 50 10
5 0
0 0 −5 0 0
1 5 9 13 17 21 24 1 5 9 13 17 21 24 1 5 9 13 17 21 24 1 5 9 13 17 21 24 1 5 9 13 17 21 24
Hyperparameter Combination Hyperparameter Combination Hyperparameter Combination Hyperparameter Combination Hyperparameter Combination

Figure 4. Top Row: Single task training on 5 DeepMind Lab tasks. Each curve is the mean of the best 3 runs based on final return.
IMPALA achieves better performance than A3C. Bottom Row: Stability across hyperparameter combinations sorted by the final
performance across different hyperparameter combinations. IMPALA is consistently more stable than A3C.

Task 1 Task 2 Task 3 Task 4 Task 5 a bias. Out of ρ̄ ∈ [1, 10, 100] we found that ρ̄ = 1 worked
Without Replay best.
V-trace 46.8 32.9 31.3 229.2 43.8 We evaluate all algorithms on the set of 5 DeepMind Lab
1-Step 51.8 35.9 25.4 215.8 43.7 tasks from the previous section. We also add an experience
ε-correction 44.2 27.3 4.3 107.7 41.5 replay buffer on the learner to increase the off-policy gap
No-correction 40.3 29.1 5.0 94.9 16.1
between π and µ. In the experience replay experiments we
With Replay draw 50% of the items in each batch uniformly at random
V-trace 47.1 35.8 34.5 250.8 46.9 from the replay buffer. Table 2 shows the final performance
1-Step 54.7 34.4 26.4 204.8 41.6 for each algorithm with and without replay respectively. In
ε-correction 30.4 30.2 3.9 101.5 37.6 the no replay setting, V-trace performs best on 3 out of 5
No-correction 35.0 21.1 2.8 85.0 11.2
tasks, followed by 1-step importance sampling, ε-correction
Tasks: rooms watermaze, rooms keys doors puzzle,
lasertag three opponents small,
and No-correction. Although 1-step importance sampling
explore goal locations small, seekavoid arena 01 performs similarly to V-trace in the no-replay setting, the
gap widens on 4 out 5 tasks when using experience replay.
Table 2. Average final return over 3 best hyperparameters for differ- This suggests that the cruder 1-step importance sampling ap-
ent off-policy correction methods on 5 DeepMind Lab tasks. When proximation becomes insufficient as the target and behaviour
the lag in policy is negligible both V-trace and 1-step importance policies deviate from each other more strongly. Also note
sampling perform similarly well and better than ε-correction/No-
that V-trace is the only variant that consistently benefits
correction. However, when the lag increases due to use of expe-
from adding experience replay. ε-correction improves sig-
rience replay, V-trace performs better than all other methods in 4
out 5 tasks. nificantly over No-correction on two tasks but lies far behind
the importance-sampling based methods, particularly in the
more off-policy setting with experience replay. Figure E.1
2. ε-correction - Add a small value (ε = 1e-6) during shows results of a more detailed analysis. Figure E.2 shows
gradient calculation to prevent log π(a) from becoming that the importance-sampling based methods also perform
very small and leading to numerical instabilities, similar better across all hyperparameters and are typically more
to (Babaeizadeh et al., 2016). robust.
3. 1-step importance sampling - No off-policy correction
when optimising V (x). For the policy gradient, multiply 5.3. Multi-Task Training
the advantage at each time step by the corresponding im-
portance weight. This variant is similar to V-trace without IMPALA’s high data throughput and data efficiency allow us
“traces” and is included to investigate the importance of to train not only on one task but on multiple tasks in parallel
“traces” in V-trace. with only a minimal change to the training setup. Instead
4. V-trace as described in Section 4. of running the same task on all actors, we allocate a fixed
number of actors to each task in the multi-task suite. Note,
For V-trace and 1-step importance sampling we clip each the model does not know which task it is being trained or
importance weight ρt and ct at 1 (i.e. c̄ = ρ̄ = 1). This evaluated on.
reduces the variance of the gradient estimate but introduces
IMPALA: Importance Weighted Actor-Learner Architectures

Model Test score low network version not only in terms of final performance
A3C, deep 23.8% but throughout the entire training. Note in Table 3 that
IMPALA, shallow 37.1% IMPALA, deep, PBT, 8 learners, although providing much
IMPALA-Experts, deep 44.5% higher throughput, reaches the same final performance as
IMPALA, deep 46.5% the 1 GPU IMPALA, deep, PBT in the same number of steps.
IMPALA, deep, PBT 49.4% Of particular importance is the gap between the IMPALA-
IMPALA, deep, PBT, 8 learners 49.1%
Experts which were trained on each task individually and
IMPALA, deep, PBT which was trained on all tasks at once.
Table 3. Mean capped human normalised scores on DMLab-30. As Figure 5 shows, the multi-task version is outperforms
All models were evaluated on the test tasks with 500 episodes per IMPALA-Experts throughout training and the breakdown
task. The table shows the best score for each architecture.
into individual scores in Appendix B shows positive transfer
on tasks such as language tasks and laser tag tasks.
5.3.1. DML AB -30 Comparing A3C to IMPALA with respect to wall clock time
To test IMPALA’s performance in a multi-task setting we use (Figure 6) further highlights the scalability gap between
DMLab-30, a set of 30 diverse tasks built on DeepMind Lab. the two approaches. IMPALA with 1 learner takes only
Among the many task types in the suite are visually complex around 10 hours to reach the same performance that A3C
environments with natural-looking terrain, instruction-based approaches after 7.5 days. Using 8 learner GPUs instead of
tasks with grounded language (Hermann et al., 2017), navi- 1 further speeds up training of the deep model by a factor of
gation tasks, cognitive (Leibo et al., 2018) and first-person 7 to 210K frames/sec, up from 30K frames/sec.
tagging tasks featuring scripted bots as opponents. A de-
tailed description of DMLab-30 and the tasks are available 5.3.2. ATARI
at github.com/deepmind/lab and deepmind.com/dm-lab-30. The Atari Learning Environment (ALE) (Bellemare et al.,
We compare multiple variants of IMPALA with a distributed 2013b) has been the testing ground of most recent deep
A3C implementation. Except for agents using population- reinforcement agents. Its 57 tasks pose challenging rein-
based training (PBT) (Jaderberg et al., 2017a), all agents are forcement learning problems including exploration, plan-
trained with hyperparameter sweeps across the same range ning, reactive play and complex visual input. Most games
given in Appendix D.1 . We report mean capped human feature very different visuals and game mechanics which
normalised score where the score for each task is capped makes this domain particularly challenging for multi-task
at 100% (see Appendix B ). Using mean capped human learning.
normalised score emphasises the need to solve multiple We train IMPALA and A3C agents on each game individu-
tasks instead of focusing on becoming super human on ally and compare their performance using the deep network
a single task. For PBT we use the mean capped human (without the LSTM) introduced in Section 5. We also pro-
normalised score as fitness function and tune entropy cost, vide results using a shallow network that is equivalent to
learning rate and RMSProp ε. See Appendix F for the the feed forward network used in (Mnih et al., 2016) which
specifics of the PBT setup. features three convolutional layers. The network is provided
In particular, we compare the following agent variants. A3C, with a short term history by stacking the 4 most recent ob-
deep, a distributed implementation with 210 workers (7 servations at each step. For details on pre-processing and
per task) featuring the deep residual network architecture hyperparameter setup please refer to Appendix G .
(Figure 3 (Right)). IMPALA, shallow with 210 actors and In addition to individual per-game experts, trained for 200
IMPALA, deep with 150 actors both with a single learner. million frames with a fixed set of hyperparameters, we train
IMPALA, deep, PBT, the same as IMPALA, deep, but ad- an IMPALA Atari-57 agent—one agent, one set of weights—
ditionally using the PBT (Jaderberg et al., 2017a) for hy- on all 57 Atari games at once for 200 million frames per
perparameter optimisation. Finally IMPALA, deep, PBT, 8 game or a total of 11.4 billion frames. For the Atari-57 agent,
learners, which utilises 8 learner GPUs to maximise learn- we use population based training with a population size of
ing speed. We also train IMPALA agents in an expert setting, 24 to adapt entropy regularisation, learning rate, RMSProp ε
IMPALA-Experts, deep, where a separate agent is trained and the global gradient norm clipping threshold throughout
per task. In this case we did not optimise hyperparameters training.
for each task separately but instead across all tasks on which
the 30 expert agents were trained. We compare all algorithms in terms of median human nor-
malised score across all 57 Atari games. Evaluation follows
Table 3 and Figure 5 show all variants of IMPALA perform- a standard protocol, each game-score is the mean over 200
ing much better than the deep distributed A3C. Moreover, evaluation episodes, each episode was started with a random
the deep variant of IMPALA performs better than the shal-
IMPALA: Importance Weighted Actor-Learner Architectures

Human Normalised Return Median Mean

Figure 5. Performance of best agent in each sweep/population dur-
ing training on the DMLab-30 task-set wrt. data consumed across A3C, shallow, experts 54.9% 285.9%
all environments. IMPALA with multi-task training is not only A3C, deep, experts 117.9% 503.6%
faster, it also converges at higher accuracy with better data effi- Reactor, experts 187% N/A
ciency across all 30 tasks. The x-axis is data consumed by one
IMPALA, shallow, experts 93.2% 466.4%
agent out of IMPALA,
a hyperparameter sweep/PBTIMPALA,
population of 24 agents,
deep, PBT - 8 GPUs shallow IMPALA, deep, experts 191.8% 957.6%
total data consumed across
IMPALA, deep, PBT the whole population/sweep
IMPALA-Experts, deepcan be
obtained by IMPALA,
multiplying
deep with the population/sweep
A3C, deep size. IMPALA, deep, multi-task 59.7% 176.9%
60
Mean Capped Normalized Score

Table 4. Human normalised scores on Atari-57. Up to 30 no-ops

50
at the beginning of each episode. For a level-by-level comparison
40
to ACKTR (Wu et al., 2017) and Reactor see Appendix C.1 .

30
the high diversity in visual appearance and game mechanics
20 within the ALE suite, IMPALA multi-task still manages
to stay competitive to A3C, shallow, experts, commonly
10
used as a baseline in related work. ALE is typically con-
0
sidered a hard multi-task environment, often accompanied
0.0 0.2 0.4 0.6 0.8 1.0
Environment Frames 1e10 by negative transfer between tasks (Rusu et al., 2016). To
IMPALA, deep, PBT - 8 GPUs IMPALA, shallow our knowledge, IMPALA is the first agent to be trained in a
IMPALA, deep, PBT IMPALA-Experts, deep multi-task setting on all 57 games of ALE that is competitive
IMPALA, deep A3C, deep
with a standard expert baseline.
60
Mean Capped Normalized Score

50 6. Conclusion
40 We have introduced a new highly scalable distributed agent,
IMPALA, and a new off-policy learning algorithm, V-trace.
30
With its simple but scalable distributed architecture, IM-
20
PALA can make efficient use of available compute at small
and large scale. This directly translates to very quick
10 turnaround for investigating new ideas and opens up un-
explored opportunities.
0
0 20 40 60 80 100 120 140 160 180
Wall Clock Time (hours) V-trace is a general off-policy learning algorithm that is
more stable and robust compared to other off-policy correc-
Figure 6. Performance on DMLab-30 wrt. wall-clock time. All tion methods for actor critic agents. We have demonstrated
models used the deep architecture (Figure 3). The high throughput that IMPALA achieves better performance compared to
of IMPALA results in orders of magnitude faster learning. A3C variants in terms of data efficiency, stability and final
performance. We have further evaluated IMPALA on the
new DMLab-30 set and the Atari-57 set. To the best of
number of no-op actions (uniformly chosen from [1, 30]) to our knowledge, IMPALA is the first Deep-RL agent that
combat the determinism of the ALE environment. has been successfully tested in such large-scale multi-task
settings and it has shown superior performance compared
As table 4 shows, IMPALA experts provide both better final to A3C based agents (49.4% vs. 23.8% human normalised
performance and data efficiency than their A3C counterparts score on DMLab-30). Most importantly, our experiments
in the deep and the shallow configuration. As in our Deep- on DMLab-30 show that, in the multi-task setting, positive
Mind Lab experiments, the deep residual network leads transfer between individual tasks lead IMPALA to achieve
to higher scores than the shallow network, irrespective of better performance compared to the expert training setting.
the reinforcement learning algorithm used. Note that the We believe that IMPALA provides a simple yet scalable and
shallow IMPALA experiment completes training over 200 robust framework for building better Deep-RL agents and
million frames in less than one hour. has the potential to enable research on new challenges.
We want to particularly emphasise that IMPALA, deep, multi-
task, a single agent trained on all 57 ALE games at once,
reaches 59.7% median human normalised score. Despite
IMPALA: Importance Weighted Actor-Learner Architectures

Acknowledgements Clemente, A. V., Martı́nez, H. N. C., and Chandra, A. Ef-

ficient parallel methods for deep reinforcement learning.
We would like to thank Denis Teplyashin, Ricardo Barreira, CoRR, abs/1705.04862, 2017.
Manuel Sanchez for their work improving the performance
on DMLab-30 environments and Matteo Hessel, Jony Hud- Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M.,
son, Igor Babuschkin, Max Jaderberg, Ivo Danihelka, Jacob Mao, M., Ranzato, M., Senior, A., Tucker, P., Yang, K.,
Menick and David Silver for their comments and insightful Le, Q. V., and Ng, A. Y. Large scale distributed deep
discussions. networks. In Advances in Neural Information Processing
Systems 25, pp. 1223–1231, 2012.
References
Geist, M. and Scherrer, B. Off-policy learning with eligibil-
Abadi, M., Isard, M., and Murray, D. G. A computational ity traces: A survey. The Journal of Machine Learning
model for tensorflow: An introduction. In Proceedings Research, 15(1):289–333, 2014.
of the 1st ACM SIGPLAN International Workshop on
Machine Learning and Programming Languages, MAPL Gruslys, A., Dabney, W., Azar, M. G., Piot, B., Belle-
2017, 2017. ISBN 978-1-4503-5071-6. mare, M. G., and Munos, R. The Reactor: A fast and
sample-efficient actor-critic agent for reinforcement learn-
Adamski, I., Adamski, R., Grel, T., Jedrych, A., Kaczmarek, ing. ICLR, 2018.
K., and Michalewski, H. Distributed deep reinforcement
learning: Learn how to play atari games in 21 minutes. Harutyunyan, A., Bellemare, M. G., Stepleton, T., and
CoRR, abs/1801.02852, 2018. Munos, R. Q(λ) with Off-Policy Corrections, pp. 305–
Appleyard, J., Kociský, T., and Blunsom, P. Optimizing 320. Springer International Publishing, Cham, 2016.
performance of recurrent neural networks on gpus. CoRR,
He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings
abs/1604.01946, 2016.
in deep residual networks. In European Conference on
Babaeizadeh, M., Frosio, I., Tyree, S., Clemons, J., and Computer Vision, pp. 630–645. Springer, 2016.
Kautz, J. GA3C: GPU-based A3C for deep reinforcement
learning. NIPS Workshop, 2016. Hermann, K. M., Hill, F., Green, S., Wang, F., Faulkner, R.,
Soyer, H., Szepesvari, D., Czarnecki, W., Jaderberg, M.,
Barth-Maron, G., Hoffman, M. W., Budden, D., Dabney, Teplyashin, D., et al. Grounded language learning in a
W., Horgan, D., Tirumala, D., Muldal, A., Heess, N., and simulated 3d world. arXiv preprint arXiv:1706.06551,
Lillicrap, T. Distributional policy gradients. ICLR, 2018. 2017.
Beattie, C., Leibo, J. Z., Teplyashin, D., Ward, T., Wain- Hochreiter, S. and Schmidhuber, J. Long short-term memory.
wright, M., Kuttler, H., Lefrancq, A., Green, S., Valdes, Neural computation, 9(8):1735–1780, 1997.
V., Sadik, A., Schrittwieser, J., Anderson, K., York, S.,
Cant, M., Cain, A., Bolton, A., Gaffney, S., King, H., Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel,
Hassabis, D., Legg, S., and Petersen, S. Deepmind lab. M., van Hasselt, H., and Silver, D. Distributed prioritized
CoRR, abs/1612.03801, 2016. experience replay. ICLR, 2018.
Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W. M.,
The Arcade Learning Environment: An evaluation plat- Donahue, J., Razavi, A., Vinyals, O., Green, T., Dunning,
form for general agents. Journal of Artificial Intelligence I., Simonyan, K., Fernando, C., and Kavukcuoglu, K.
Research, 47:253–279, June 2013a. Population based training of neural networks. CoRR,
Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. abs/1711.09846, 2017a.
The arcade learning environment: An evaluation platform
Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T.,
for general agents. J. Artif. Intell. Res.(JAIR), 47:253–279,
Leibo, J. Z., Silver, D., and Kavukcuoglu, K. Reinforce-
2013b.
ment learning with unsupervised auxiliary tasks. ICLR,
Chen, J., Monga, R., Bengio, S., and Józefowicz, 2017b.
R. Revisiting distributed synchronous SGD. CoRR,
abs/1604.00981, 2016. Leibo, J. Z., d’Autume, C. d. M., Zoran, D., Amos, D.,
Beattie, C., Anderson, K., Castañeda, A. G., Sanchez, M.,
Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, Green, S., Gruslys, A., et al. Psychlab: A psychology
J., Catanzaro, B., and Shelhamer, E. cudnn: Efficient laboratory for deep reinforcement learning agents. arXiv
primitives for deep learning. CoRR, abs/1410.0759, 2014. preprint arXiv:1801.08116, 2018.
IMPALA: Importance Weighted Actor-Learner Architectures

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,
T., Tassa, Y., Silver, D., and Wierstra, D. Continuous van den Driessche, G., Schrittwieser, J., Antonoglou, I.,
control with deep reinforcement learning. arXiv preprint Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe,
arXiv:1509.02971, 2015. D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T.,
Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis,
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, D. Mastering the game of go with deep neural networks
J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje- and tree search. Nature, 529:484–503, 2016.
land, A. K., Ostrovski, G., et al. Human-level control
through deep reinforcement learning. Nature, 518(7540): Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou,
529–533, 2015. I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M.,
Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L.,
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lilli- Driessche, G. v. d., Graepel, T., and Hassabis, D. Master-
crap, T. P., Harley, T., Silver, D., and Kavukcuoglu, K. ing the game of go without human knowledge. Nature,
Asynchronous methods for deep reinforcement learning. 550(7676):354–359, 10 2017. ISSN 0028-0836. doi:
ICML, 2016. 10.1038/nature24270.
Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, Sutton, R. and Barto, A. Reinforcement learning: An intro-
M. Safe and efficient off-policy reinforcement learning. duction, volume 116. Cambridge Univ Press, 1998.
In Advances in Neural Information Processing Systems,
pp. 1046–1054, 2016. Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R.,
Kavukcuoglu, K., and de Freitas, N. Sample efficient
Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon, actor-critic with experience replay. In ICLR, 2017.
R., Maria, A. D., Panneershelvam, V., Suleyman, M.,
Beattie, C., Petersen, S., Legg, S., Mnih, V., Kavukcuoglu, Wawrzynski, P. Real-time reinforcement learning by sequen-
K., and Silver, D. Massively parallel methods for deep tial actor-critics and experience replay. Neural Networks,
reinforcement learning. CoRR, abs/1507.04296, 2015. 22(10):1484–1497, 2009.

O’Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, Wu, Y., Mansimov, E., Liao, S., Grosse, R. B., and Ba,
V. Combining policy gradient and Q-learning. In ICLR, J. Scalable trust-region method for deep reinforcement
2017. learning using kronecker-factored approximation. CoRR,
abs/1708.05144, 2017.
Precup, D., Sutton, R. S., and Singh, S. Eligibility traces for
off-policy policy evaluation. In Proceedings of the Sev- Zoph, B., Vasudevan, V., Shlens, J., and Le, Q. V. Learning
enteenth International Conference on Machine Learning, transferable architectures for scalable image recognition.
2000. arXiv preprint arXiv:1707.07012, 2017.

Precup, D., Sutton, R. S., and Dasgupta, S. Off-policy

temporal-difference learning with function approxima-
tion. In Proceedings of the 18th International Conference
on Machine Laerning, pp. 417–424, 2001.

Puterman, M. L. Markov Decision Processes: Discrete

Stochastic Dynamic Programming. John Wiley & Sons,
Inc., New York, NY, USA, 1st edition, 1994. ISBN
0471619779.

Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H.,

Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., and Had-
sell, R. Progressive neural networks. arXiv preprint
arXiv:1606.04671, 2016.

Salimans, T., Ho, J., Chen, X., and Sutskever, I. Evolu-

tion strategies as a scalable alternative to reinforcement
learning. arXiv preprint arXiv:1703.03864, 2017.

Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel,
P. High-dimensional continuous control using generalized
advantage estimation. In ICLR, 2016.
Supplementary Material

A. Analysis of V-trace
A.1. V-trace operator
Define the V-trace operator R:
hX i
def
γ t c0 . . . ct−1 ρt rt + γV (xt+1 ) − V (xt ) x0 = x, µ ,

RV (x) = V (x) + Eµ (5)
t≥0

where the expectation Eµ is with respect to the policy µ which has generated the trajectory (xt )t≥0 , i.e., x0 = x,
xt+1 ∼ p(·|xt , at ), at ∼ µ(·|xt ). Here we consider the infinite-horizon operator but very similar results hold for the n-step
truncated operator.
t |xt ) π(at |xt )
Theorem 1. Let ρt = min ρ̄, π(a

µ(at |xt ) and ct = min c̄, µ(at |xt ) be truncated importance sampling weights, with ρ̄ ≥ c̄.
Assume that there exists β ∈ (0, 1] such that Eµ ρ0 ≥ β. Then the operator R defined by (5) has a unique fixed point V πρ̄ ,
which is the value function of the policy πρ̄ defined by

def min ρ̄µ(a|x), π(a|x)
πρ̄ (a|x) = P , (6)
b∈A min ρ̄µ(b|x), π(b|x)
Furthermore, R is a η-contraction mapping in sup-norm, with
t−2
def X Y
η = γ −1 − (γ −1 − 1)Eµ γt

ci ρt−1 ≤ 1 − (1 − γ)β < 1.
t≥0 i=0

Remark 3. The truncation levels c̄ and ρ̄ play different roles in this operator:

• ρ̄ impacts the fixed-point of the operator, thus the policy πρ̄ which is evaluated. For ρ̄ = ∞ (untruncated ρt ) we get the
value function of the target policy V π , whereas for finite ρ̄, we evaluate a policy which is in between µ and π (and
when ρ is close to 0, then we evaluate V µ ). So the larger ρ̄ the smaller the bias in off-policy learning. The variance
naturally grows with ρ̄. However notice that we do not take the product of those ρt coefficients (in contrast to the cs
coefficients) so the variance does not explode with the time horizon.
• c̄ impacts the contraction modulus η of R (thus the speed at which an online-algorithm like V-trace will converge to its
fixed point V πρ̄ ). In terms of variance reduction, here is it really important to truncate the importance sampling ratios
in ct because we take the product of those. Fortunately, our result says that for any level of truncation c̄, the fixed point
(the value function V πρ̄ we converge to) is the same: it does not depend on c̄ but on ρ̄ only.

Proof. First notice that we can rewrite R as

 
X t−1
Y
RV (x) = (1 − Eµ ρ0 )V (x) + Eµ  γt cs ρt rt + γ[ρt − ct ρt+1 ]V (xt+1 )  .
t≥0 s=0

Thus
 
X t−1
Y
γ t+1

RV1 (x) − RV2 (x) = (1 − Eµ ρ0 ) V1 (x) − V2 (x) + Eµ  cs [ρt − ct ρt+1 ] V1 (xt+1 ) − V2 (xt+1 )  .
t≥0 s=0
 
X t−2
Y
γt

= Eµ  cs [ρt−1 − ct−1 ρt ] V1 (xt ) − V2 (xt )  ,

| {z }
t≥0 s=0
αt
IMPALA: Importance Weighted Actor-Learner Architectures
Qt−2
with the notation that c−1 = ρ−1 = 1 and s=0 cs = 1 for t = 0 and 1. Now the coefficients (αt )t≥0 are non-negative in
expectation. Indeed, since ρ̄ ≥ c̄, we have

Eµ αt = E ρt−1 − ct−1 ρt ≥ Eµ ct−1 (1 − ρt ) ≥ 0,
t |xt )
since Eµ ρt ≤ Eµ π(a
µ(at |xt ) = 1. Thus V1 (x) − V2 (x) is a linear combination of the values V1 − V2 at other states, weighted
by non-negative coefficients whose sum is
" t−2 #
X Y
t
γ Eµ cs [ρt−1 − ct−1 ρt ]
t≥0 s=0
t−2
" # " t−1 #
X Y X Y
t t
= γ Eµ cs ρt−1 − γ Eµ cs ρ t
t≥0 s=0 t≥0 s=0
 
t−2
" # " t−2 #
X Y X Y
= γ t Eµ cs ρt−1 − γ −1  γ t Eµ cs ρt−1 − 1
t≥0 s=0 t≥0 s=0
" t−2 #
X Y
= γ −1 − (γ −1 − 1) γ t Eµ cs ρt−1
t≥0 s=0
| {z }
≥1+γEµ ρ0

≤ 1 − (1 − γ)Eµ ρ0
≤ 1 − (1 − γ)β
< 1.
h Q i
t−2
We deduce that kRV1 (x) − RV2 (x)k ≤ ηkV1 − V2 k∞ , with η = γ −1 − (γ −1 − 1) t
P
t≥0 γ Eµ s=0 cs ρt−1 ≤
1 − (1 − γ)β < 1, so R is a contraction mapping. Thus R possesses a unique fixed point. Let us now prove that this fixed
point is V πρ̄ . We have:
i
Eµ ρt rt + γV πρ̄ (xt+1 ) − V πρ̄ (xt ) xt

X π(a|xt ) h X i
= µ(a|xt ) min ρ̄, r(xt , a) + γ p(y|xt , a)V πρ̄ (y) − V πρ̄ (xt )
a
µ(a|xt ) y
X h X iX
p(y|xt , a)V (y) − V πρ̄ (xt )
πρ̄

= πρ̄ (a|xt ) r(xt , a) + γ min ρ̄µ(b|xt ), π(b|xt )
a y b
| {z }
=0
= 0,
since this is the Bellman equation for V πρ̄ . We deduce that RV πρ̄ = V πρ̄ , thus V πρ̄ is the unique fixed point of R.

A.2. Online learning

Theorem 2. Assume a tabular representation, i.e. the state and action spaces are finite. Consider a set of trajectories, with
the k th trajectory x0 , a0 , r0 , x1 , a1 , r1 , . . . generated by following µ: at ∼ µ(·|xt ). For each state xs along this trajectory,
update
X
γ t−s cs . . . ct−1 ρt rt + γVk (xt+1 ) − Vk (xt ) ,

Vk+1 (xs ) = Vk (xs ) + αk (xs ) (7)
t≥s

π(ai |xi ) π(ai |xi )

with ci = min c̄, ρi = min ρ̄,
µ(ai |xi ) , ρ̄ ≥ c̄. Assume that (1) all states are visited infinitely often, and (2) the
µ(ai |xi ) ,
stepsizes obey the usual Robbins-Munro conditions: for each state x, k αk (x) = ∞, k αk2 (x) < ∞. Then Vk → V πρ̄
P P
almost surely.

The proof is a straightforward application of the convergence result for stochastic approximation algorithms to the fixed
point of a contraction operator, see e.g. Dayan & Sejnowski (1994); Bertsekas & Tsitsiklis (1996); Kushner & Yin (2003).
IMPALA: Importance Weighted Actor-Learner Architectures

A.3. On the choice of qs in policy gradient

The policy gradient update rule (4) makes use of the coefficient qs = rs + γvs+1 as an estimate of Qπρ̄ (xs , as ) built from the
V-trace estimate vs+1 at the next state xs+1 . The reason why we use qs instead of vs as target for our Q-value Qπρ̄ (xs , as )
is to make sure our estimate of the Q-value is as unbiased as possible, and the first requirement is that it is entirely unbiased
in the case of perfect representation of the V-values. Indeed, assuming our value function is correctly estimated at all states,
i.e. V = V πρ̄ , then we have E[qs |xs , as ] = Qπρ̄ (xs , as ) (whereas we do not have this property for vt ). Indeed,

E[qs |xs , as ] = rs + γE V πρ̄ (xs+1 ) + δs+1 V πρ̄ + γcs+1 δs+2 V πρ̄ + . . .

= rs + γE V πρ̄ (xs+1 )

= Qπρ̄ (xs , as )

whereas

E[vs |xs , as ] = V πρ̄ (xs ) + ρs rs + γE V πρ̄ (xs+1 ) − V πρ̄ (xs ) + γcs δs+1 V πρ̄ + . . .

= V πρ̄ (xs ) + ρs rs + γE V πρ̄ (xs+1 ) − V πρ̄ (xs )

= V πρ̄ (xs )(1 − ρs ) + ρs Qπρ̄ (xs , as ),

which is different from Qπρ̄ (xs , as ) when V πρ̄ (xs ) 6= Qπρ̄ (xs , as ).
IMPALA: Importance Weighted Actor-Learner Architectures

B. Reference Scores
Task t Human h Random r Experts IMPALA
rooms collect good objects test 10.0 0.1 9.0 5.8
rooms exploit deferred effects test 85.7 8.5 15.6 11.0
rooms select nonmatching object 65.9 0.3 7.3 26.1
rooms watermaze 54.0 4.1 26.9 31.1
rooms keys doors puzzle 53.8 4.1 28.0 24.3
language select described object 389.5 -0.1 324.6 593.1
language select located object 280.7 1.9 189.0 301.7
language execute random task 254.1 -5.9 -49.9 66.8
language answer quantitative question 184.5 -0.3 219.4 264.0
lasertag one opponent large 12.7 -0.2 -0.2 0.3
lasertag three oponents large 18.6 -0.2 -0.1 4.1
lasertag one opponent small 18.6 -0.1 -0.1 2.5
lasertag three opponents small 31.5 -0.1 19.1 11.3
natlab fixed large map 36.9 2.2 34.7 12.2
natlab varying map regrowth 24.4 3.0 20.7 15.9
natlab varying map randomized 42.4 7.3 36.1 29.0
skymaze irreversible path hard 100.0 0.1 13.6 30.0
skymaze irreversible path varied 100.0 14.4 45.1 53.6
pyschlab arbitrary visuomotor mapping 58.8 0.2 16.4 14.3
pyschlab continuous recognition 58.3 0.2 29.9 29.9
pyschlab sequential comparison 39.5 0.1 0.0 0.0
pyschlab visual search 78.5 0.1 0.0 0.0
explore object locations small 74.5 3.6 57.8 62.6
explore object locations large 65.7 4.7 37.0 51.1
explore obstructed goals small 206.0 6.8 135.2 188.8
explore obstructed goals large 119.5 2.6 39.5 71.0
explore goal locations small 267.5 7.7 209.4 252.5
explore goal locations large 194.5 3.1 83.1 125.3
explore object rewards few 77.7 2.1 39.8 43.2
explore object rewards many 106.7 2.4 58.7 62.6
P
Mean Capped Normalised Score: ( t min [1, (st − rt )/(ht − rt )]) /N 100% 0% 44.5% 49.4%

Table B.1. DMLab-30 test scores.

IMPALA: Importance Weighted Actor-Learner Architectures

B.1. Final training scores on DMLab-30

A3C, deep IMPALA-Experts, deep IMPALA, deep, PBT

language_select_described_object

language_answer_quantitative_question

language_select_located_object

explore_goal_locations_small

rooms_collect_good_objects_train

explore_obstructed_goals_small

explore_object_locations_small

explore_object_locations_large

natlab_varying_map_randomized

natlab_varying_map_regrowth

explore_goal_locations_large

explore_obstructed_goals_large

explore_object_rewards_many

rooms_watermaze

rooms_select_nonmatching_object

explore_object_rewards_few

pyschlab_continuous_recognition

lasertag_three_opponents_small

skymaze_irreversible_path_varied

rooms_keys_doors_puzzle

natlab_fixed_large_map

skymaze_irreversible_path_hard

pyschlab_arbitrary_visuomotor_mapping

rooms_exploit_deferred_effects_train

lasertag_three_oponents_large

language_execute_random_task

lasertag_one_opponent_small

lasertag_one_opponent_large

pyschlab_visual_search

pyschlab_sequential_comparison

0 20 40 60 80 100 120 140 160

Human Normalised Score

Figure B.1. Human normalised scores across all DMLab-30 tasks.

IMPALA: Importance Weighted Actor-Learner Architectures

C. Atari Scores
ACKTR The Reactor IMPALA (deep, multi-task) IMPALA (shallow) IMPALA (deep)
alien 3197.10 6482.10 2344.60 1536.05 15962.10
amidar 1059.40 833 136.82 497.62 1554.79
assault 10777.70 11013.50 2116.32 12086.86 19148.47
asterix 31583.00 36238.50 2609.00 29692.50 300732.00
asteroids 34171.60 2780.40 2011.05 3508.10 108590.05
atlantis 3433182.00 308258 460430.50 773355.50 849967.50
bank heist 1289.70 988.70 55.15 1200.35 1223.15
battle zone 8910.00 61220 7705.00 13015.00 20885.00
beam rider 13581.40 8566.50 698.36 8219.92 32463.47
berzerk 927.20 1641.40 647.80 888.30 1852.70
bowling 24.30 75.40 31.06 35.73 59.92
boxing 1.45 99.40 96.63 96.30 99.96
breakout 735.70 518.40 35.67 640.43 787.34
centipede 7125.28 3402.80 4916.84 5528.13 11049.75
chopper command N/A 37568 5036.00 5012.00 28255.00
crazy climber 150444.00 194347 115384.00 136211.50 136950.00
defender N/A 113128 16667.50 58718.25 185203.00
demon attack 274176.70 100189 10095.20 107264.73 132826.98
double dunk -0.54 11.40 -1.92 -0.35 -0.33
enduro 0.00 2230.10 971.28 0.00 0.00
fishing derby 33.73 23.20 35.27 32.08 44.85
freeway 0.00 31.40 21.41 0.00 0.00
frostbite N/A 8042.10 2744.15 269.65 317.75
gopher 47730.80 69135.10 913.50 1002.40 66782.30
gravitar N/A 1073.80 282.50 211.50 359.50
hero N/A 35542.20 18818.90 33853.15 33730.55
ice hockey -4.20 3.40 -13.55 -5.25 3.48
jamesbond 490.00 7869.20 284.00 440.00 601.50
kangaroo 3150.00 10484.50 8240.50 47.00 1632.00
krull 9686.90 9930.80 10807.80 9247.60 8147.40
kung fu master 34954.00 59799.50 41905.00 42259.00 43375.50
montezuma revenge N/A 2643.50 0.00 0.00 0.00
ms pacman N/A 2724.30 3415.05 6501.71 7342.32
name this game N/A 9907.20 5719.30 6049.55 21537.20
phoenix 133433.70 40092.20 7486.50 33068.15 210996.45
pitfall -1.10 -3.50 -1.22 -11.14 -1.66
pong 20.90 20.70 8.58 20.40 20.98
private eye N/A 15177.10 0.00 92.42 98.50
qbert 23151.50 22956.50 10717.38 18901.25 351200.12
riverraid 17762.80 16608.30 2850.15 17401.90 29608.05
road runner 53446.00 71168 24435.50 37505.00 57121.00
robotank 16.50 68.50 9.94 2.30 12.96
seaquest 1776.00 8425.80 844.60 1716.90 1753.20
skiing N/A -10753.40 -8988.00 -29975.00 -10180.38
solaris 2368.60 2760 1160.40 2368.40 2365.00
space invaders 19723.00 2448.60 199.65 1726.28 43595.78
star gunner 82920.00 70038 1855.50 69139.00 200625.00
surround N/A 6.70 -8.51 -8.13 7.56
tennis N/A 23.30 -8.12 -1.89 0.55
time pilot 22286.00 19401 3747.50 6617.50 48481.50
tutankham 314.30 272.60 105.22 267.82 292.11
up n down 436665.80 64354.20 82155.30 273058.10 332546.75
venture N/A 1597.50 1.00 0.00 0.00
video pinball 100496.60 469366 20125.14 228642.52 572898.27
wizard of wor 702.00 13170.50 2106.00 4203.00 9157.50
yars revenge 125169.00 102760 14739.41 80530.13 84231.14
zaxxon 17448.00 25215.50 6497.00 1148.50 32935.50

Table C.1. Atari scores after 200M steps environment steps of training. Up to 30 no-ops at the beginning of each episode.
IMPALA: Importance Weighted Actor-Learner Architectures

D. Parameters
In this section, the specific parameter settings that are used throughout our experiments are given in detail.

Hyperparameter Range Distribution

Entropy regularisation [5e-5, 1e-2] Log uniform
Learning rate [5e-6, 5e-3] Log uniform
RMSProp epsilon (ε) regularisation parameter [1e-1, 1e-3, 1e-5, 1e-7] Categorical

Table D.1. The ranges used in sampling hyperparameters across all experiments that used a sweep and for the initial hyperparameters for
PBT. Sweep size and population size are 24. Note, the loss is summed across the batch and time dimensions.

Action Native DeepMind Lab Action

Forward [ 0, 0, 0, 1, 0, 0, 0]
Backward [ 0, 0, 0, -1, 0, 0, 0]
Strafe Left [ 0, 0, -1, 0, 0, 0, 0]
Strafe Right [ 0, 0, 1, 0, 0, 0, 0]
Look Left [-20, 0, 0, 0, 0, 0, 0]
Look Right [ 20, 0, 0, 0, 0, 0, 0]
Forward + Look Left [-20, 0, 0, 1, 0, 0, 0]
Forward + Look Right [ 20, 0, 0, 1, 0, 0, 0]
Fire [ 0, 0, 0, 0, 1, 0, 0]

Table D.2. Action set used in all tasks from the DeepMind Lab environment, including the DMLab-30 experiments.

D.1. Fixed Model Hyperparameters

In this section, we list all the hyperparameters that were kept fixed across all experiments in the paper which are mostly
concerned with observations specifications and optimisation. We first show below the reward pre-processing function that is
used across all experiments using DeepMind Lab, followed by all fixed numerical values.

5
Clipped Reward

−1
−10 −5 0 5 10
Reward

Figure D.1. Optimistic Asymmetric Clipping - 0.3 · min(tanh(reward), 0) + 5.0 · max(tanh(reward), 0)

IMPALA: Importance Weighted Actor-Learner Architectures

Parameter Value
Image Width 96
Image Height 72
Action Repetitions 4
Unroll Length (n) 100
Reward Clipping
- Single tasks [-1, 1]
- DMLab-30, including experts See Figure D.1
Discount (γ) 0.99
Baseline loss scaling 0.5
RMSProp momentum 0.0
Experience Replay (in Section 5.2.2 )
- Capacity 10,000 trajectories
- Sampling Uniform
- Removal First-in-first-out

Table D.3. Fixed model hyperparameters across all DeepMind Lab experiments.
IMPALA: Importance Weighted Actor-Learner Architectures

E. V-trace Analysis
E.1. Controlled Updates
Here we show how different algorithms (On-Policy, No-correction, ε-correction, V-trace) behave under varying levels of
policy-lag between the actors and the learner.
rooms_watermaze
60 ²-correction No-correction V-trace
50 0 0 0
1 1 1
40 10
10 100
Return

30 10 100 500
20 100 500
500
10
0
0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8
Environment Frames 1e9 Environment Frames 1e9 Environment Frames 1e9

rooms_keys_doors_puzzle
30 ²-correction No-correction V-trace
0 0 1
25 1 1 0
10 10 10
20 100
100 500
Return

15 100
500 500
10
5
0
0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8
Environment Frames 1e9 Environment Frames 1e9 Environment Frames 1e9

lasertag_three_opponents_small
35 ²-correction No-correction V-trace
10
30 1
25 0 0 100
20 0
Return

15 1 1 500
10
5 10 10
0 100 100
500 500
5
0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8
Environment Frames 1e9 Environment Frames 1e9 Environment Frames 1e9

explore_goal_locations_small
250 ²-correction No-correction V-trace
10
200 0 0 1
0
150 1 1 100
500
Return

100 100
10 10
50 500 100
500
0
0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8
Environment Frames 1e9 Environment Frames 1e9 Environment Frames 1e9

seekavoid_arena_01
45 ²-correction No-correction V-trace
0 0 0
40 1 1 1
35 10 10
30 100
25 500
Return

20
15 100 10
10 500
5 100
0 500
0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8
Environment Frames 1e9 Environment Frames 1e9 Environment Frames 1e9

Figure E.1. As the policy-lag (the number of update steps the actor policy is behind learner policy) increases, learning with V-trace is
more robust compared to ε-correction and pure on-policy learning.
IMPALA: Importance Weighted Actor-Learner Architectures

E.2. V-trace Stability Analysis

V−trace − min(ρ, 1) 1 Step Importance Sampling − min(ρ, 1) ε−correction No-correction

rooms_watermaze rooms_keys_doors_puzzle lasertag_three_opponents_small explore_goal_locations_small seekavoid_arena_01

60 40 35 300 50
35 30
Final Return

50 250 40
30 25
40 25 20 200
30
30 20 15 150
15 10 20
20 100
10 5
10 50 10
5 0
0 0 −5 0 0
1 5 9 13 17 21 24 1 5 9 13 17 21 24 1 5 9 13 17 21 24 1 5 9 13 17 21 24 1 5 9 13 17 21 24
Hyperparameter Combination Hyperparameter Combination Hyperparameter Combination Hyperparameter Combination Hyperparameter Combination

Figure E.2. Stability across hyper parameter combinations for different off-policy correction variants using replay. V-trace is much more
stable across a wide range of parameter combinations compared to ε-correction and pure on-policy learning.

E.3. Estimating the State Action Value for Policy Gradient

We investigated different ways of estimating the state action value function used to estimate advantages for the policy
gradient calculation. The variant presented in the main section of the paper uses the V-trace corrected value function vs+1 to
estimate qs = rs + γvs+1 . Another possibility is to use the actor-critic baseline V (xs+1 ) to estimate qs = rs + γV (xs+1 ).
Note that the latter variant does not use any information from the current policy rollout to estimate the policy gradient and
relies on an accurate estimate of the value function. We found the latter variant to perform worse both when comparing the
top 3 runs and an average over all runs of the hyperparameter sweep as can be see in figures E.3 and E.4.

qs = rs + γ ⋅ vs + 1 qs = rs + γ ⋅ V(xs + 1 )
rooms_watermaze rooms_keys_doors_puzzle lasertag_three_opponents_small explore_goal_locations_small seekavoid_arena_01
50 30 35 250 45
45 30 40
40 25 200
25 35
Return

35 20 30
30 20 150
15 25
25 15 100
20 10 20
15 5 15
10 50
10 0 10
5 5 −5 0 5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Environment Frames 1e9 Environment Frames 1e9 Environment Frames 1e9 Environment Frames 1e9 Environment Frames 1e9

Figure E.3. Variants for estimation of state action value function - average over top 3 runs.

qs = rs + γ ⋅ vs + 1 qs = rs + γ ⋅ V(xs + 1 )
rooms_watermaze rooms_keys_doors_puzzle lasertag_three_opponents_small explore_goal_locations_small seekavoid_arena_01
35 22 12 160 35
20 10 140 30
30
18 120
8 25
Return

25 16 100
14 6 20
20 80
12 4 15
15 10 60
2 40 10
8
10 0 20 5
6
5 4 −2 0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Environment Frames 1e9 Environment Frames 1e9 Environment Frames 1e9 Environment Frames 1e9 Environment Frames 1e9

Figure E.4. Variants for estimation of state action value function - average over all runs.

F. Population Based Training

For Population Based Training we used a “burn-in” period of 20 million frames where no evolution is done. This is to
stabilise the process and to avoid very rapid initial adaptation which hinders diversity. After collecting 5,000 episode rewards
in total, the mean capped human normalised score is calculated and a random instance in the population is selected. If the
score of the selected instance is more than an absolute 5% higher, then the selected instance weights and parameters are
copied.
No matter if a copy happened or not, each parameter (RMSProp epsilon, learning rate and entropy cost) is permuted
with 33% probability by multiplying with either 1.2 or 1/1.2. This is different from Jaderberg et al. (2017) in that our
multiplication is unbiased where they use a multiplication of 1.2 or .8. We found that diversity is increased when the
parameters are permuted even if no copy happened.
We reconstruct the learning curves of the PBT runs in Figure 5 by backtracking through the ancestry of copied checkpoints
for selected instances.
IMPALA: Importance Weighted Actor-Learner Architectures

IMPALA - PBT - 8 GPUs IMPALA - PBT - 1 GPU

Learning Rate
0.0007

0.0006

0.0005

0.0004

0.0003

0.0002

0.0001

0.0000
0.0 0.2 0.4 0.6 0.8 1.0
Environment Frames 1e10

Figure F.1. Learning rate schedule that is discovered by the PBT Jaderberg et al. (2017) method compared against the linear annealing
schedule of the best run from the parameter sweep (red line).

G. Atari Experiments
All agents trained on Atari are equipped only with a feed forward network and pre-process frames in the same way as
described in Mnih et al. (2016). When training experts agents, we use the same hyperparameters for each game for
both IMPALA and A3C. These hyperparameters are the result of tuning A3C with a shallow network on the following
games: breakout, pong, space invaders, seaquest, beam rider, qbert. Following related work, experts
use game-specific action sets.
The multi-task agent was equipped with a feed forward residual network (see Figure 3 ). The learning rate, entropy
regularisation, RMSProp ε and gradient clipping threshold were adapted through population based training. To be able to
use the same policy layer on all Atari games in the multi-task setting we train the multi-task agent on the full Atari action set
consisting of 18 actions.
Agents were trained using the following set of hyperparameters:
IMPALA: Importance Weighted Actor-Learner Architectures

Parameter Value
Image Width 84
Image Height 84
Grayscaling Yes
Action Repetitions 4
Max-pool over last N action repeat frames 2
Frame Stacking 4
End of episode when life lost Yes
Reward Clipping [-1, 1]
Unroll Length (n) 20
Batch size 32
Discount (γ) 0.99
Baseline loss scaling 0.5
Entropy Regularizer 0.01
RMSProp momentum 0.0
RMSProp ε 0.01
Learning rate 0.0006
Clip global gradient norm 40.0
Learning rate schedule Anneal linearly to 0
From beginning to end of training.
Population based training (only multi-task agent)
- Population size 24
- Start parameters Same as DMLab-30 sweep
- Fitness Mean
P capped human normalised scores
( l min [1, (st − rt )/(ht − rt )]) /N
- Adapted parameters Gradient clipping threshold
Entropy regularisation
Learning rate
RMSProp ε

Table G.1. Hyperparameters for Atari experiments.

References
Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-Dynamic Programming. Athena Scientific, 1996.
Dayan, P. and Sejnowski, T. J. TD(λ) converges with probability 1. Machine Learning, 14(1):295–301, 1994. doi:
10.1023/A:1022657612745.
Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W. M., Donahue, J., Razavi, A., Vinyals, O., Green, T., Dunning, I.,
Simonyan, K., Fernando, C., and Kavukcuoglu, K. Population based training of neural networks. CoRR, abs/1711.09846,
2017.

Kushner, H. and Yin, G. Stochastic Approximation and Recursive Algorithms and Applications. Stochastic Modelling and
Applied Probability. Springer New York, 2003. ISBN 9780387008943.
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous
methods for deep reinforcement learning. ICML, 2016.

Towards Delivering A Coherent Self-Contained Explanation of Proximal Policy Optimization
No ratings yet
Towards Delivering A Coherent Self-Contained Explanation of Proximal Policy Optimization
36 pages
Reinforcement Learning Notes ?
No ratings yet
Reinforcement Learning Notes ?
40 pages
Deep Reinforcement Learning: Overcoming The Challenges of Deep Learning in Discrete and Continuous Markov Decision Processes
No ratings yet
Deep Reinforcement Learning: Overcoming The Challenges of Deep Learning in Discrete and Continuous Markov Decision Processes
110 pages
Multi Agent Deep Reinforcement Learning: A Survey: Sven Gronauer Klaus Diepold
No ratings yet
Multi Agent Deep Reinforcement Learning: A Survey: Sven Gronauer Klaus Diepold
49 pages
(Addison-Wesley Data & Analytics Series) Laura Graesser - Wah Loon Keng - Foundations of Deep Reinforcement Learning - Theory and Practice in Python-Addison-Wesley Professional (2019) PDF
100% (1)
(Addison-Wesley Data & Analytics Series) Laura Graesser - Wah Loon Keng - Foundations of Deep Reinforcement Learning - Theory and Practice in Python-Addison-Wesley Professional (2019) PDF
656 pages
Asynchronous Methods For Deep Reinforcement Learning
No ratings yet
Asynchronous Methods For Deep Reinforcement Learning
28 pages
Advanced Systemdesign 2023
No ratings yet
Advanced Systemdesign 2023
65 pages
L24 MarkovDecisionProcess
No ratings yet
L24 MarkovDecisionProcess
129 pages
A Survey of Demonstration Learning
No ratings yet
A Survey of Demonstration Learning
30 pages
2006 00979v1 PDF
No ratings yet
2006 00979v1 PDF
33 pages
Imitation Learning Progress, Taxonomies and Challenges
No ratings yet
Imitation Learning Progress, Taxonomies and Challenges
21 pages
Soft Actor-Critic Algorithms and Applications: UC Berkeley, Google Brain, Contributed Equally
No ratings yet
Soft Actor-Critic Algorithms and Applications: UC Berkeley, Google Brain, Contributed Equally
17 pages
DRL v5
No ratings yet
DRL v5
64 pages
Chapter 4
No ratings yet
Chapter 4
16 pages
CSD411-Week 3 - Learning Paradigms and Mathematical Foundations
No ratings yet
CSD411-Week 3 - Learning Paradigms and Mathematical Foundations
132 pages
Asynchronous Methods For Deep Reinforcement Learning
No ratings yet
Asynchronous Methods For Deep Reinforcement Learning
28 pages
Federated Deep Reinforcement Learning
No ratings yet
Federated Deep Reinforcement Learning
9 pages
Deep Reinforcement Learning in Large Discrete Action Spaces
No ratings yet
Deep Reinforcement Learning in Large Discrete Action Spaces
11 pages
Offline Imitation Learning From Multiple Baselines With Applications To Compiler Optimization
No ratings yet
Offline Imitation Learning From Multiple Baselines With Applications To Compiler Optimization
10 pages
Model Ensemble Trpo
No ratings yet
Model Ensemble Trpo
15 pages
Op Tim Ization
No ratings yet
Op Tim Ization
19 pages
NeurIPS 2022 Multi Agent Reinforcement Learning Is A Sequence Modeling Problem Paper Conference
No ratings yet
NeurIPS 2022 Multi Agent Reinforcement Learning Is A Sequence Modeling Problem Paper Conference
13 pages
NIPS 2016 Generative Adversarial Imitation Learning Paper
No ratings yet
NIPS 2016 Generative Adversarial Imitation Learning Paper
9 pages
A Deep Reinforcement Learning Algorithm For Robotic Manipulation Tasks in Simulated Environments
No ratings yet
A Deep Reinforcement Learning Algorithm For Robotic Manipulation Tasks in Simulated Environments
10 pages
Reinforcement Learning For IoT - Final
No ratings yet
Reinforcement Learning For IoT - Final
45 pages
Deep Reinforcement Learning An Overview
No ratings yet
Deep Reinforcement Learning An Overview
30 pages
I - E - E R - L: N Context Xploration Xploitation For E Inforcement Earning
No ratings yet
I - E - E R - L: N Context Xploration Xploitation For E Inforcement Earning
16 pages
3003 o Ine Reinforcement Learning W
No ratings yet
3003 o Ine Reinforcement Learning W
15 pages
AC - Sample Efficient Actor-Critic With Experience Replay
No ratings yet
AC - Sample Efficient Actor-Critic With Experience Replay
20 pages
Deep Reinforcement Learning Mohit Sewak
No ratings yet
Deep Reinforcement Learning Mohit Sewak
6 pages
1701 07274v2 PDF
No ratings yet
1701 07274v2 PDF
30 pages
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning With A Stochastic Actor
No ratings yet
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning With A Stochastic Actor
10 pages
A Survey On Offline Reinforcement Learning Taxonomy Review and Open Problems
No ratings yet
A Survey On Offline Reinforcement Learning Taxonomy Review and Open Problems
21 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
Algorithm For RL
No ratings yet
Algorithm For RL
99 pages
Cheatsheet Deep Learning
No ratings yet
Cheatsheet Deep Learning
2 pages
Reasoning
No ratings yet
Reasoning
21 pages
2404 08003v5
No ratings yet
2404 08003v5
31 pages
ReinforcementLearningAssign2 1)
No ratings yet
ReinforcementLearningAssign2 1)
7 pages
High-Throughput Synchronous Deep RL: Iou-Jen Liu, Raymond A. Yeh, Alexander G. Schwing
No ratings yet
High-Throughput Synchronous Deep RL: Iou-Jen Liu, Raymond A. Yeh, Alexander G. Schwing
22 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Case
No ratings yet
Case
6 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
17 pages
Emergent Complexity Via Multiagent Competition
No ratings yet
Emergent Complexity Via Multiagent Competition
12 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
28 pages
Machine Learning Using Python Project Report: Stock Price Prediction Using ML
No ratings yet
Machine Learning Using Python Project Report: Stock Price Prediction Using ML
21 pages
Report On Reinforcement Learning
No ratings yet
Report On Reinforcement Learning
26 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
6-Month Roadmap To Becoming An AI Engineer - A Step-by-Step Guide
No ratings yet
6-Month Roadmap To Becoming An AI Engineer - A Step-by-Step Guide
21 pages
RL Concepts and Methods
No ratings yet
RL Concepts and Methods
8 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
25 pages
Alg RLearning Ejemplo
No ratings yet
Alg RLearning Ejemplo
99 pages
Survey On Large Language Model-Enhanced Reinforcement Learning: Concept, Taxonomy, and Methods
No ratings yet
Survey On Large Language Model-Enhanced Reinforcement Learning: Concept, Taxonomy, and Methods
22 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
19 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Unit 3 Ai
No ratings yet
Unit 3 Ai
5 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
G M R:Graph Meta Reinforcement Learning For Multi-Objective Influence Maximization
No ratings yet
G M R:Graph Meta Reinforcement Learning For Multi-Objective Influence Maximization
11 pages
Recollected - Questions para Repasar
No ratings yet
Recollected - Questions para Repasar
8 pages
Deep Learning - AD3501 - Important Questions
No ratings yet
Deep Learning - AD3501 - Important Questions
12 pages
Deep Learning in Trading PDF
No ratings yet
Deep Learning in Trading PDF
10 pages
Artificial Intelligence For Renewable Energy Systems (Artificial Intelligence and Soft Computing For Industrial Transformation) (Ajay Kumar Vyas, S. Balamurugan Etc.) (Z-Library)
100% (2)
Artificial Intelligence For Renewable Energy Systems (Artificial Intelligence and Soft Computing For Industrial Transformation) (Ajay Kumar Vyas, S. Balamurugan Etc.) (Z-Library)
256 pages
A Novel Deep Learning Framework: Prediction and Analysis of Financial Time Series Using CEEMD and LSTM
No ratings yet
A Novel Deep Learning Framework: Prediction and Analysis of Financial Time Series Using CEEMD and LSTM
21 pages
Stock Market Prediction Using Machine Learning
No ratings yet
Stock Market Prediction Using Machine Learning
3 pages
Abnormal Event Detection in Videos Using Spatiotemporal Autoencoder
100% (1)
Abnormal Event Detection in Videos Using Spatiotemporal Autoencoder
20 pages
LSTM
No ratings yet
LSTM
11 pages
2nd Unit NN Final Class Notes
No ratings yet
2nd Unit NN Final Class Notes
51 pages
Endsem Deep Learning Important
No ratings yet
Endsem Deep Learning Important
2 pages
Effective LSTMs With Seasonal-Trend Decomposition and Adaptive Learning and Niching-Based Backtracking Search Algorithm For Time Series Forecasting
No ratings yet
Effective LSTMs With Seasonal-Trend Decomposition and Adaptive Learning and Niching-Based Backtracking Search Algorithm For Time Series Forecasting
29 pages
NLP and Sentiment Analysis
No ratings yet
NLP and Sentiment Analysis
89 pages
Unit 5b - Natural Language Processing
No ratings yet
Unit 5b - Natural Language Processing
41 pages
Deep Learning in Automated Ecg Noise Detection
No ratings yet
Deep Learning in Automated Ecg Noise Detection
22 pages
A Data-Driven Auto-CNN-LSTM Prediction Model For Lithium-Ion Battery Remaining Useful Life
No ratings yet
A Data-Driven Auto-CNN-LSTM Prediction Model For Lithium-Ion Battery Remaining Useful Life
10 pages
Stock Chart Pattern Recognition With Deep Learning: Marc Velay and Fabrice Daniel
No ratings yet
Stock Chart Pattern Recognition With Deep Learning: Marc Velay and Fabrice Daniel
6 pages
An Evaluation of Machine Learning and Deep Learning Models For Drought Prediction Using Weather Dara
No ratings yet
An Evaluation of Machine Learning and Deep Learning Models For Drought Prediction Using Weather Dara
36 pages
Engineering Applications of Artificial Intelligence: Mohit Beniwal, Archana Singh, Nand Kumar
No ratings yet
Engineering Applications of Artificial Intelligence: Mohit Beniwal, Archana Singh, Nand Kumar
11 pages
Predicting Inflation With Neural Networks: Livia Paranhos
No ratings yet
Predicting Inflation With Neural Networks: Livia Paranhos
47 pages
The Machine To End All Machines Not Annotated
No ratings yet
The Machine To End All Machines Not Annotated
17 pages
Deep Transformer Models For Time Series Forecasting
No ratings yet
Deep Transformer Models For Time Series Forecasting
10 pages
A Novel Hybrid CNN-LSTM Approach For Handwritten Text Recognition For The Washington Database
No ratings yet
A Novel Hybrid CNN-LSTM Approach For Handwritten Text Recognition For The Washington Database
5 pages
DL Jun - 2023
No ratings yet
DL Jun - 2023
2 pages
Deep Finesse Network Model With Multichannel Syntactic and Contextual Features For Target-Specific Sentiment Classification
No ratings yet
Deep Finesse Network Model With Multichannel Syntactic and Contextual Features For Target-Specific Sentiment Classification
21 pages
NLP Lab2
No ratings yet
NLP Lab2
7 pages
Deep Learning For Classification Tasks On Geospatial Vector Polygons
No ratings yet
Deep Learning For Classification Tasks On Geospatial Vector Polygons
30 pages
Deep Learning For Time Series Forecasting: The Electric Load Case
No ratings yet
Deep Learning For Time Series Forecasting: The Electric Load Case
19 pages
Robotic Self-Replication: Annual Review of Control, Robotics, and Autonomous Systems
No ratings yet
Robotic Self-Replication: Annual Review of Control, Robotics, and Autonomous Systems
26 pages
Understanding LSTM Networks - Colah's Blog
No ratings yet
Understanding LSTM Networks - Colah's Blog
7 pages
Ensembles of Deep LSTM Learners For Activity Recognition Using Wearables
No ratings yet
Ensembles of Deep LSTM Learners For Activity Recognition Using Wearables
28 pages
FCC For Lunar Regolith
No ratings yet
FCC For Lunar Regolith
10 pages
Lunar-Based Self-Replicating Solar Factory: Voicesofthenew
No ratings yet
Lunar-Based Self-Replicating Solar Factory: Voicesofthenew
10 pages
Improved Performance Research Integration Tool User Guide - Version 4.6
From Everand
Improved Performance Research Integration Tool User Guide - Version 4.6
Beth Plott
No ratings yet
Google JAX Essentials: A quick practical learning of blazing-fast library for machine learning and deep learning projects
From Everand
Google JAX Essentials: A quick practical learning of blazing-fast library for machine learning and deep learning projects
Mei Wong
No ratings yet
50 Breakthrough AI Concepts in 500 Words Each: In 500 words, #17
From Everand
50 Breakthrough AI Concepts in 500 Words Each: In 500 words, #17
Nietsnie Trebla
No ratings yet
IGNOU PGDCA MCS 206 Object Oriented Programming using Java Previous Years solved Papers
From Everand
IGNOU PGDCA MCS 206 Object Oriented Programming using Java Previous Years solved Papers
Manish Soni
No ratings yet

IMPALA: Scalable Distributed Deep-RL With Importance Weighted Actor-Learner Architectures

Uploaded by

IMPALA: Scalable Distributed Deep-RL With Importance Weighted Actor-Learner Architectures

Uploaded by

IMPALA: Scalable Distributed Deep-RL with Importance Weighted

Abstract separately. We are interested in developing new methods

capable of mastering a diverse set of tasks simultaneously as

Observations Environment steps Forward pass Backward pass

during the estimation of the policy gradient. In contrast,

rooms_watermaze rooms_keys_doors_puzzle lasertag_three_opponents_small explore_goal_locations_small seekavoid_arena_01

rooms_watermaze rooms_keys_doors_puzzle lasertag_three_opponents_small explore_goal_locations_small seekavoid_arena_01

Human Normalised Return Median Mean

Table 4. Human normalised scores on Atari-57. Up to 30 no-ops

Acknowledgements Clemente, A. V., Martı́nez, H. N. C., and Chandra, A. Ef-

Precup, D., Sutton, R. S., and Dasgupta, S. Off-policy

Puterman, M. L. Markov Decision Processes: Discrete

Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H.,

Salimans, T., Ho, J., Chen, X., and Sutskever, I. Evolu-

Proof. First notice that we can rewrite R as

A.2. Online learning

π(ai |xi ) π(ai |xi )

A.3. On the choice of qs in policy gradient

E[qs |xs , as ] = rs + γE V πρ̄ (xs+1 ) + δs+1 V πρ̄ + γcs+1 δs+2 V πρ̄ + . . .

= V πρ̄ (xs ) + ρs rs + γE V πρ̄ (xs+1 ) − V πρ̄ (xs )

= V πρ̄ (xs )(1 − ρs ) + ρs Qπρ̄ (xs , as ),

Table B.1. DMLab-30 test scores.

B.1. Final training scores on DMLab-30

0 20 40 60 80 100 120 140 160

Figure B.1. Human normalised scores across all DMLab-30 tasks.

Hyperparameter Range Distribution

Action Native DeepMind Lab Action

D.1. Fixed Model Hyperparameters

Figure D.1. Optimistic Asymmetric Clipping - 0.3 · min(tanh(reward), 0) + 5.0 · max(tanh(reward), 0)

E.2. V-trace Stability Analysis

rooms_watermaze rooms_keys_doors_puzzle lasertag_three_opponents_small explore_goal_locations_small seekavoid_arena_01

E.3. Estimating the State Action Value for Policy Gradient

F. Population Based Training

IMPALA - PBT - 8 GPUs IMPALA - PBT - 1 GPU

Table G.1. Hyperparameters for Atari experiments.

You might also like