0% found this document useful (0 votes)
72 views17 pages

NeurIPS 2022 Reincarnating Reinforcement Learning Reusing Prior Computation To Accelerate Progress Paper Conference

This document discusses reincarnating reinforcement learning (RRL), an alternative approach to RL research that aims to maximize the reuse of prior computational work. The key points are: 1) Traditional tabula rasa RL requires retraining from scratch for each change, which is inefficient for large-scale problems. RRL instead transfers knowledge between iterations to accelerate progress. 2) As a step towards general RRL, the document focuses on policy-to-value RRL (PVRL), which efficiently transfers a suboptimal policy to a value-based student agent. 3) Experiments on Atari games and other tasks show RRL can provide significant gains in sample and computational efficiency compared to tabula rasa RL.

Uploaded by

101ryanpamugas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views17 pages

NeurIPS 2022 Reincarnating Reinforcement Learning Reusing Prior Computation To Accelerate Progress Paper Conference

This document discusses reincarnating reinforcement learning (RRL), an alternative approach to RL research that aims to maximize the reuse of prior computational work. The key points are: 1) Traditional tabula rasa RL requires retraining from scratch for each change, which is inefficient for large-scale problems. RRL instead transfers knowledge between iterations to accelerate progress. 2) As a step towards general RRL, the document focuses on policy-to-value RRL (PVRL), which efficiently transfers a suboptimal policy to a value-based student agent. 3) Experiments on Atari games and other tasks show RRL can provide significant gains in sample and computational efficiency compared to tabula rasa RL.

Uploaded by

101ryanpamugas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Reincarnating Reinforcement Learning:

Reusing Prior Computation to Accelerate Progress

Rishabh Agarwal1,2∗ Max Schwarzer1,2


Pablo Samuel Castro1 Aaron Courville2 Marc G. Bellemare1,2
1
Google Research, Brain Team 2 MILA

Abstract
Learning tabula rasa, that is without any previously learned knowledge, is the
prevalent workflow in reinforcement learning (RL) research. However, RL systems,
when applied to large-scale settings, rarely operate tabula rasa. Such large-scale
systems undergo multiple design or algorithmic changes during their development
cycle and use ad hoc approaches for incorporating these changes without re-training
from scratch, which would have been prohibitively expensive. Additionally, the
inefficiency of deep RL typically excludes researchers without access to industrial-
scale resources from tackling computationally-demanding problems. To address
these issues, we present reincarnating RL as an alternative workflow or class of
problem settings, where prior computational work (e.g., learned policies) is reused
or transferred between design iterations of an RL agent, or from one RL agent
to another. As a step towards enabling reincarnating RL from any agent to any
other agent, we focus on the specific setting of efficiently transferring an existing
sub-optimal policy to a standalone value-based RL agent. We find that existing
approaches fail in this setting and propose a simple algorithm to address their
limitations. Equipped with this algorithm, we demonstrate reincarnating RL’s gains
over tabula rasa RL on Atari 2600 games, a challenging locomotion task, and the
real-world problem of navigating stratospheric balloons. Overall, this work argues
for an alternative approach to RL research, which we believe could significantly
improve real-world RL adoption and help democratize it further. Open-sourced
code and trained agents at agarwl.github.io/reincarnating_rl.

1 Introduction
Reinforcement learning (RL) is a general-purpose paradigm for making data-driven decisions. Due
to this generality, the prevailing trend in RL research is to learn systems that can operate efficiently
tabula rasa, that is without much learned knowledge including prior computational work such as
offline datasets or learned policies. However, tabula rasa RL systems are typically the exception rather
than the norm for solving large-scale RL problems [4, 13, 55, 75, 85]. Such large-scale RL systems
often need to function for long periods of time and continually experience new data; restarting them
from scratch may require weeks if not months of computation, and there may be billions of data
points to re-process – this makes the tabula rasa approach impractical. For example, the system that
plays Dota 2 at a human-like level [13] underwent several months of RL training with continual
changes (e.g., in model architecture, environment, etc) during its development; this necessitated
building upon the previously trained system after such changes to circumvent re-training from scratch,
which was done using ad hoc approaches (described in Section 3).
Current RL research also excludes the majority of researchers outside certain resource-rich labs from
tackling complex problems, as doing so often incurs substantial computational and financial cost:
AlphaStar [85], which achieves grandmaster level in StarCraft, was trained using TPUs for more than
a month and replicating it would cost several million dollars (Appendix A.1). Even the quintessential
deep RL benchmark of training an agent on 50+ Atari games [10], with at least 5 runs, requires

Correspondence to Rishabh Agarwal <[email protected]>.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).


Figure 1: A reincarnating RL workflow on ALE. The plots show IQM [2] normalized scores over training,
computed using 50 seeds, aggregated across 10 Atari games. The vertical separators correspond to loading
network weights and replay buffer for fine-tuning while offline pre-training on replay buffer using QDag-
ger (Section 4.1) for reincarnation. Shaded regions show 95% confidence intervals. We assign a score of 1
to DQN (Adam) trained for 400M frames and 0 to a random agent. (Panel 1) Tabula rasa Nature DQN [60]
nearly converges in performance after training for 200M frames. (Panel 2) Reincarnation via fine-tuning
Nature DQN with a reduced learning rate leads to 50% higher IQM with only 1M additional frames (leftmost
point). Furthermore, fine-tuning Nature DQN while switching from RMSProp to Adam matches the perfor-
mance of DQN (Adam) trained from scratch for 400M frames, using only 20M frames. (Panel 3). A modern
ResNet (Impala-CNN [26]) with a better algorithm (Rainbow [35]) outperforms further fine-tuning n-step DQN.
Reincarnating Impala-CNN Rainbow from DQN, outperforms tabula rasa Impala-CNN Rainbow throughout
training and requires only 50M frames to nearly match its performance at 100M frames. See Section 5.

more than 1000 GPU days. As deep RL research move towards more challenging problems, the
computational barrier to entry in RL research is likely to further increase.
To address both the computational and sample inefficiencies of tabula rasa RL, we present reincarnat-
ing RL (RRL) as an alternative research workflow or a class of problems to focus on. RRL seeks
to maximally leverage existing computational work, such as learned network weights and collected
data, to accelerate training across design iterations of an RL agent or when moving from one agent to
another. In RRL, agents need not be trained tabula rasa, except for initial forays into new problems.
For example, imagine a researcher who has trained an agent A1 for a long time (e.g., weeks), but
now this or another researcher wants to experiment with better architectures or RL algorithms. While
the tabula rasa workflow requires re-training another agent from scratch, reincarnating RL provides
the more viable option of transferring A1 to another agent and training this agent further, or simply
fine-tuning A1 (Figure 1). As such, RRL can be viewed as an attempt to provide a formal foundation
for the research workflow needed for real-world and large-scale RL models.
Reincarnating RL can democratize research by allowing the broader community to tackle larger-scale
and complex RL problems without requiring excessive computational resources. As a consequence,
RRL can also help avoid the risk of researchers overfitting to conclusions from small-scale RL
problems. Furthermore, RRL can enable a benchmarking paradigm where researchers continually
improve and update existing trained agents, especially on problems where improving performance has
real-world impact (e.g., balloon navigation [11], chip design [59], tokamak control [24]). Furthermore,
a common real-world RL use case will likely be in scenarios where prior computational work is
available (e.g., existing deployed RL policies), making RRL important to study. However, beyond
some ad hoc large-scale reincarnation efforts (Section 3), the community has not focused much on
studying reincarnating RL as a research problem in its own right. To this end, this work argues for
developing general-purpose RRL approaches as opposed to ad hoc solutions.
Different RRL problems can be instantiated depending on how the prior computational work is
provided: logged datasets, learned policies, pretrained models, representations, etc. As a step
towards developing broadly applicable reincarnation approaches, we focus on the specific setting
of policy-to-value reincarnating RL (PVRL) for efficiently transferring a suboptimal teacher policy
to a value-based RL student agent (Section 4). Since it is undesirable to maintain dependency on
past teachers for successive reincarnations, we require a PVRL algorithm to “wean” off the teacher

2
dependence as training progresses. We find that prior approaches, when evaluated for PVRL on
the Arcade Learning Environment (ALE) [10], either result in small improvements over the tabula
rasa student or exhibit degradation when weaning off the teacher. To address these limitations, we
introduce QDagger, which combines Dagger [71] with n-step Q-learning, and outperforms prior
approaches. Equipped with QDagger, we demonstrate the sample and compute-efficiency gains
of reincarnating RL over tabula rasa RL, on ALE, a humanoid locomotion task and the simulated
real-world problem of navigating stratospheric balloons [11] (Section 5). Finally, we discuss some
considerations in RRL as well as address reproducibility and generalizability concerns.

2 Preliminaries
The goal in RL is to maximize the long-term discounted reward in an environment. We model the
environment as an MDP, defined as (S, A, R, P, γ) [69], with a state space S, an action space A, a
stochastic reward function R(s, a), transition dynamics P (s0 |s, a) and a discount factor γ ∈ [0, 1).
A policy π(·|s) maps states to a distribution over actions. The Q-value function Qπ (s, a) for a
policy π(·|s) is the expected sum of discounted rewards obtained by executing action a at state s
and following π(·|s) thereafter. DQN [60] builds on Q-learning [87] and parameterizes the Q-value
function, Qθ , with a neural net with parameters θ while following an -greedy policy with respect to
Qθ for data collection. DQN minimizes the temporal difference (TD) loss, LT D (DS ), on transition
tuples, (s, a, r, s0 ), sampled from an experience replay buffer DS collected during training:
h  i
0 0 2
LT D (D) = Es,a,r,s0 ∼D Qθ (s, a) − r − γ max
0
Q̄θ (s , a ) (1)
a

where Q̄θ is a delayed copy of the same Q-network, referred to as the target network. Modern value-
based RL agents, such as Rainbow [35], use n-step returns to further stabilize learning. Specifically,
rather than training the Q-value estimate Q(st , at ) on the basis of the single-step temporal difference
Pn−1
error rt +γ maxa0 Q(st+1 , a0 )−Q(st , at ), an n-step target k=0 γ k rt+k +γ n maxa0 Q(st+n , a0 )−
Q(st , at ) is used in the TD loss, with intermediate future rewards stored in the replay D.

3 Related work
Prior ad hoc reincarnation efforts. While several high-profile RL achievements have used reincar-
nation, it has typically been done in an ad-hoc way and has limited applicability. OpenAI Five [13],
which can play Dota 2 at a human-like level, required 10 months of large-scale RL training and
went through continual changes in code and environment (e.g., expanding observation spaces) during
development. To avoid restarting from scratch after such changes, OpenAI Five used “surgery” akin
to Net2Net [17] style transformations to convert a trained model to certain bigger architectures with
custom weight initializations. AlphaStar [85] employs population-based training (PBT) [42], which
periodically copies weights of the best performing value-based agents and mutates hyperparameters
during training. Although PBT and surgery methods are efficient, they have they can not be used for
reincarnating RL when switching to arbitrary architectures (e.g., feed-forward to recurrent networks)
or from one model class to another (e.g., policy to a value function). Akkaya et al. [4] trained RL
policies for several months to manipulate a robot hand for solving Rubik’s cube. To do so, they
“rarely trained experiments from scratch” but instead initialized new policies, with architectural
changes, from previous trained policies using behavior cloning via on-policy distillation [20, 67].
AlphaGo [75] also used behavior cloning on human replays for initializing the policy and fine-tuning
it further with RL. However, behavior cloning is only applicable for policy to policy transfer and is
inadequate for the PVRL setting of transferring a policy to a value function [e.g., 63, 83]. Contrary
to such approaches, we apply reincarnation in settings where these approaches are not applicable
including transferring a DQN agent to Impala-CNN Rainbow in ALE, and a distributed agent with
MLP architecture to a recurrent agent in BLE. Several prior works also fine-tune existing agents with
deep RL for reducing training time, especially on real-world tasks such as chip floor-planning [59],
robotic manipulation [43], aligning language models [6], and compiler optimization [82]. In line
with these works, we find that fine-tuning a value-based agent can be an effective reincarnation strat-
egy (Figure 7). However, fine-tuning is often constrained to use the same architecture as the agent
being fine-tuned. Instead, we focus on reincarnating RL methods that do not have this limitation.
Leveraging prior computation. While areas such as offline RL, imitation learning, transfer in RL,
continual RL etc focus on developing methods to leverage prior computation, such areas don’t strive
to change how we do RL research by incorporating such methods as a part of our workflow. For
completeness, we contrast closely related approaches to PVRL, the RRL setting we study.

3
- Leveraging existing agents. Existing policies have been previously used for improving data
collection [14, 16, 28, 77, 89]; we evaluate one such approach, JSRL [83], which improves exploration
in goal-reaching RL tasks. However, our PVRL experiments indicate that JSRL performs poorly
on ALE. Schmitt et al. [74] propose kickstarting to speed-up actor-critic agents using an interactive
teacher policy by combining on-policy distillation [20, 67] with RL. Empirically, we find that
kickstarting is a strong baseline for PVRL, however it exhibits unstable behavior without n-step
returns and underperforms QDagger. PVRL also falls under the framework of agents teaching
agents (ATA) [21] with RL-based students and teachers. While ATA approaches, such as action
advice [81], emphasize how and when to query the teacher or evaluating the utility of teacher
advice, PVRL focuses on sample-efficient transfer and does not impose constraints on querying the
teacher. PVRL is also different from prior work on accelerating RL using a heuristic or oracle value
function [9, 19, 78], as PVRL only assumes access to a suboptimal policy. Unlike PVRL methods
that wean off the teacher, imitation-regularized RL methods [51, 61] stay close to the suboptimal
teacher, which can limit the student’s performance with continued training (Figure 9).
- Leveraging prior data. Learning from demonstrations (LfD) [5, 30, 36, 40, 72] approaches
focus on accelerating RL training using demonstrations. Such approaches typically assume access to
optimal or near-optimal trajectories, often obtained from human demonstrators, and aim to match the
demonstrator’s performance. Instead, PVRL focuses on leveraging a suboptimal teacher policy, which
can be obtained from any trained RL agent, that we wean off during training. Empirically, we find
that DQfD [36], a well-known LfD approach to accelerate deep Q-learning, when applied to PVRL,
exhibits severe performance degradation when weaning off the teacher. Rehearsal approaches [62,
66, 76] focus on improving exploration by replaying demonstrations during learning; we find that
such approaches are ineffective for leveraging the teacher in PVRL. Offline RL [1, 49, 52] focuses on
learning solely from fixed datasets while reincarnating RL focuses on leveraging prior information,
which can also be presented as offline datasets, for speeding up further learning from environment
interactions. Recent work [45, 51, 55, 63] use offline RL to pretrain on prior data and then fine-tune
online. We also evaluate this pretraining approach for PVRL and find that it underperforms QDagger,
which utilizes the interactive teacher policy in addition to the prior teacher collected data.

4 Case Study: Policy to Value Reincarnating RL


While prior large-scale efforts have used a limited form of reincarnating RL (Section 3), it is unclear
how to design more broadly applicable RRL approaches. To exemplify the challenges of designing
such approaches, we focus on the RRL setting for accelerating training of a student agent given
access to a suboptimal teacher policy and some data from it. While a policy-based student can
be easily reincarnated in this setting via behavior cloning [e.g., 4], we study the more challenging
policy-to-value reincarnating RL (PVRL) setting for transferring a policy to a value-based student
agent. While we can obtain a policy from any RL agent, we chose this setting because value-based
RL methods (Q-learning, actor-critic) can leverage off-policy data for better sample efficiency. To be
broadly useful for reincarnating agents, a PVRL algorithm should satisfy the following desiderata:
• Teacher-agnostic. Reincarnating RL has limited utility if the student is constrained by the
teacher’s architecture or learning algorithm. Thus, we require the student to be teacher-agnostic.
• Weaning. It is undesirable to maintain dependency on past teachers when reincarnation may
occur several times over the course of a project, or one project to another. Thus, it is necessary
that the student’s dependence on the teacher policy can be weaned off, as training progresses.
• Compute & sample efficient. Naturally, RRL is only useful if it is computationally cheaper
than training from scratch. Thus, it is desirable that the student can recover and possibly improve
upon the teacher’s performance using fewer environment samples than training tabula rasa.
PVRL on Atari 2600 games. Given the above desiderata for PVRL, we now empirically investigate
whether existing methods that leverage existing data or agents (see Section 3) suffice for PVRL.
The specific methods that we consider were chosen because they are simple to implement, and also
because they have been designed with closely related goals in mind.
Experimental setup. We conduct experiments on ALE with sticky actions [57]. To reduce the
computational cost of our experiments, we use a subset of 10 commonly-used Atari 2600 games:
Asterix, Breakout, Space Invaders, Seaquest, Q∗ Bert, Beam Rider, Enduro, Ms Pacman, Bowling and
River Raid. We obtain the teacher policy πT by running DQN [60] with Adam optimizer for 400
million environment frames, requiring 7 days of training per run with Dopamine [15] on P100 GPUs.

4
QDagger Kickstarting Pretraining Rehearsal DQfD JSRL

Fraction of runs with score > τ


Offline Online 1.00

IQM Normalized Score 1.00 0.75


0.75
0.50
0.50
0.25
0.25

0.00 0.00
1 5 10 2 4 6 8 10 0.00 0.25 0.50 0.75 1.00 1.25 1.50
Steps (x 100k) Env. Frames (x 1M) Teacher Normalized Score (τ)

Figure 2: Comparing PVRL algorithms for reincarnating a student DQN agent given a teacher policy (with
normalized score of 1), obtained from a DQN agent trained for 400M frames (Section 4). Baselines include
kickstarting [74], JSRL [83], rehearsal [66], offline pretraining [46] and DQfD [36]. Tabula rasa 3-step DQN
student (−· line) obtains an IQM teacher normalized score around 0.39. Shaded regions show 95% CIs. Left.
Sample efficiency curves based on IQM normalized scores, aggregated across 10 games and 3 runs, over the
course of training. Among all algorithms, only QDagger (Section 4.1) surpasses teacher performance within 10
million frames. Right. Performance profiles [2] showing the distribution of scores across all 30 runs at the end
of training (higher is better). Area under an algorithm’s profile is its mean performance while τ value where it
intersects y = 0.5 shows its median performance. QDagger outperforms the teacher in 75% of runs.

We also assume access to a dataset DT that can be generated by the teacher (see Appendix A.5 for
results about dependence on DT ). For this work, DT is the final replay buffer (1M transitions) logged
by the teacher DQN, which is 100 times smaller than the data the teacher was trained on. For a
challenging PVRL setting, we use DQN as the student since tabula rasa DQN requires a substantial
amount of training to reach the teacher’s performance. To emphasize sample-efficient reincarnation,
we train this student for only 10 million frames, a 40 times smaller sample budget than the teacher.
Furthermore, we wean off the teacher at 6 million frames. See Appendix A.3 for more details.
Evaluation. Following Agarwal et al. [2], we report interquartile mean normalized scores with 95%
confidence intervals (CIs), aggregated across 10 games with 3 seeds each. The normalization is done
such that the random policy obtains a score of 0 and the teacher policy πT obtains a score of 1. This
differs from typically reported human-normalized scores, as we wanted to highlight the performance
differences between the student and the teacher. Next, we describe the approaches we investigate.
• Rehearsal: Since the student, in principle, can learn using any off-policy data, we can replay
teacher data DT along with the student’s own interactions during training. Following Paine et al.
[66], the student minimizes the TD loss on mini-batches that contain ρ% of the samples from
DT and the rest from the student’s replay DS (different ρ and n-step values in Figure A.12).
• JSRL (Figure 3, left): JSRL [83] uses an interactive teacher policy as a “guide” to improve
exploration and rolls in with the guide for a random number of environment steps. To evaluate
JSRL, we vary the maximum number of roll-in steps, α, that can be taken by the teacher and
sample a random number of roll-in steps between [0, α] every episode. As the student improves,
we decay the steps taken by the teacher every iteration (1M frames) by a factor of β.
• Offline RL Pretraining: Given access to teacher data DT , we can pre-train the student using
offline RL. To do so, we use CQL [46], a widely used offline RL algorithm, which jointly
minimizes the TD and behavior cloning on logged transitions in DT (Equation A.3). Following
pretraining, we fine-tune the learned Q-network using TD loss on the student’s replay DS .
• Kickstarting (Figure 3, right): Akin to kickstarting [74], we jointly optimize the TD loss with an
on-policy distillation loss on the student’s self-collected data in DS . The distillation loss uses the
cross-entropy between teacher’s policy πT and the student policy π(·|s) = softmax(Q(s, ·)/τ ),
where τ corresponds to temperature. To wean off the teacher, we decay the distillation coefficient
as training progresses. Note that kickstarting does not pretrain on teacher data.
• DQfD (Figure 4, left): Following DQfD [35], we initially pretrain the student on teacher data
DT using a combination of TD loss with a large margin classification loss to imitate the teacher
actions (Equation A.4). After pretraining, we train the student on its replay data DS , again
using a combination of TD and margin loss. While DQfD minimizes the margin loss throughout
training, we decay the margin loss coefficient during the online phase, akin to kickstarting.

5
n=1 n=3 n=5 n = 10
n = 1, β = 1.0 n = 1, β = 0.8
n = 3, β = 1.0 n = 3, β = 0.8 0.8

IQM Normalized Score


IQM Normalized Score
0.4

0.3 0.6

0.2
0.4
0.1
0.2
0.0
0 100 1000 5000 2 4 6 8 10
Max. Teacher Roll-in Steps (α) Env. Frames (x 1M)

Figure 3: Left. JSRL. The plot shows teacher normalized scores with 95% CIs, after training for 10M
frames, aggregated using IQM across 10 Atari games with 3 seeds each. Each point corresponds to a different
experiment, evaluated using 30 seeds, with specific values of JSRL parameters (α, β) and n-step returns. Right.
Kickstarting, with different n-step returns. The plots show IQM scores over the coures of training. Kickstarting
exhibits performance degradation, which is severe with 1-step, and is unable to surpass teacher’s performance.
n=3, m=1.0 n=10, m=1.0
n=3, m=3.0 n=10, m=3.0 n=1 n=3 n=5 n = 10

0.8 Offline Online Offline Online


IQM Normalized Score

IQM Normalized Score


1.2

0.6 1.0

0.4 0.8

0.2 0.6

0.4
1 5 10 2 4 6 8 10 1 5 10 2 4 6 8 10
Steps (x 100k) Env. Frames (x 1M) Steps (x 100k) Env. Frames (x 1M)

Figure 4: Left. DQfD. Here, m is the margin loss parameter, which is the loss penalty when the student’s
action is different from the teacher. Right. QDagger, with different n-step returns. In both, the 1st vertical line
separates pretraining phase from online phase while the 2nd one indicates completely weaning off the teacher.

Results. Rehearsal, with best-performing teacher data ratio (ρ = 1/16), is marginally better than
tabula rasa DQN but significantly underperforms the teacher (Figure 2, teal), which seems related to
the difficulty of standard value-based methods to learn from off-policy teacher data [65]. JSRL does
not improve performance compared to tabula rasa DQN and even hurts performance with a large
number of teacher roll-in steps (Figure 3, left). The ineffectiveness of JSRL on ALE is likely due to
the state-distribution mismatch between the student and the teacher, as the student may never visit the
states visited by the teacher and as a result, doesn’t learn to correct for its previous mistakes [16].
Pretraining with offline RL on logged teacher data recovers around 50% of the teacher’s performance
and fine-tuning this pretrained Q-function online marginally improves performance (Figure 2, pink).
However, fine-tuning degrades performance with 1-step returns, which is more pronounced with
higher values of CQL loss coefficient (Figure A.13). We also find that kickstarting exhibits perfor-
mance degradation (Figure 3, right), which is severe with 1-step returns, once we wean off the teacher
policy. Akin to kickstarting, we again observe a severe performance collapse when weaning off the
the teacher dependence in DQfD (Figure 4, left), even when using n-step returns. We hypothesize
that this performance degradation is caused by the inconsistency between Q-values trained using a
combination of imitation learning and TD losses, as opposed to only minimizing the TD loss. We
also find that using intermediate values of n-step returns, such as n = 3 (also used by Rainbow [35]),
quickly recovers after the performance drop from weaning while larger n-step values impede learning,
possibly due to stale target Q-values. These results reveal the sensitivity of prior methods in the
PVRL setting to specific hyperparameter choices (n-step), indicating the need for developing stable
PVRL methods that do not fail when weaning off the teacher. For practitioners, the takeaway is to
consider this hyperparameter sensitivity when weaning off the teacher for reincarnation.
4.1 QDagger: A simple PVRL baseline
To address the limitations of prior approaches, we propose QDagger, a simple method for PVRL that
combines Dagger [71], an interactive imitation learning algorithm, with n-step Q-learning (Figure 4,
right). Specifically, we first pre-train the student on teacher data DT by minimizing LQDagger (DT ),
which combines distillation loss with the TD loss, weighted by a constant λ. This pretraining phase

6
helps the student to mimic the teacher’s state distribution, akin to the behavior cloning phase in Dagger.
After pretraining, we minimize LQDagger (DS ) on the student’s replay DS , akin to kickstarting, where
the teacher “corrects” the mistakes on the states visited by the student. As opposed to minimizing
the Dagger loss indefinitely, QDagger decays the distillation loss coefficient λt (λ0 = λ) as training
progresses, to satisfy the weaning desiderata for PVRL. Weaning allows QDagger to deviate from
the suboptimal teacher policy πT , as opposed to being perpetually constrained to stay close to
πT (Figure 9). We find that both decaying λt linearly over training steps or using an affine function
of the ratio of student and teacher performance worked well (Appendix A.3). Assuming the student
policy π(·|s) = softmax(Q(s, ·)/τ ), the QDagger loss is given by:
hX i
LQDagger (D) = LT D (D) + λt Es∼D πT (a|s) log π(a|s) (2)
a
Figure 2 shows that QDagger outperforms prior methods and surpasses the teacher. We remark that
DQfD can be viewed as a QDagger ablation that uses a margin loss instead of a distillation loss, while
kickstarting as another ablation that does not pretrain on teacher data. Equipped with QDagger, we
show how to incorporate PVRL into our workflow and demonstrate its benefits over tabula rasa RL.

5 Reincarnating RL as a research workflow


Revisiting ALE. As Mnih et al. [60]’s development of Nature DQN established the tabula rasa work-
flow on ALE, we demonstrate how iterating on ALE agents’ design can be significantly accelerated
using a reincarnating RL workflow, starting from Nature DQN, in Figure 1. Although Nature DQN
used RMSProp, Adam yields better performance than RMSProp [1, 64]. While we can train another
DQN agent from scratch with Adam, fine-tuning Nature DQN with Adam and 3-step returns, with
a reduced learning rate ( Figure 7), matches the performance of this tabula rasa DQN trained for
400M frames, using a 20 times smaller sample budget (Panel 2 in Figure 1). As such, on a P100 GPU,
fine-tuning only requires training for a few hours rather than a week needed for tabula rasa RL. Given
this fine-tuned DQN, fine-tuning it further results in diminishing returns with additional frames due
to being constrained to use the 3-layer convolutional neural network (CNN) with the DQN algorithm.
Let us now consider how one might use a more general reincarnation approach to improve on
fine-tuning, by leveraging architectural and algorithmic advances since DQN, without the sample
complexity of training from scratch (Panel 3 in Figure 1). Specifically, using QDagger to transfer the
fine-tuned DQN, we reincarnate Impala-CNN Rainbow that combines Dopamine Rainbow [35], which
incorporates distributional RL [12], prioritized replay [73] and n-step returns, with an Impala-CNN
architecture [26], a deep ResNet with 15 convolutional layers. Tabula rasa Impala-CNN Rainbow
outperforms fine-tuning DQN further within 25M frames. Reincarnated Impala-CNN Rainbow
quickly outperforms its teacher policy within 5M frames and maintains superior performance over its
tabula rasa counterpart throughout training for 50M frames. To catch up with the performance of this
reincarnated agent’s performance, the tabula rasa Impala-CNN Rainbow requires additional training
for 50M frames (48 hours on a P100 GPU). See Appendix A.4 for more training details. Overall,
these results indicate how past research on ALE could have been accelerated by incorporating a
reincarnating RL approach to designing agents, instead of always re-training agents from scratch.
Tackling a challenging control task. To show how reincarnating RL can enable faster experimen-
tation, we apply PVRL on the humanoid:run locomotion task, one of the hardest control problems
in DMC [80] due to its large action space (21 degrees of freedom). For this experiment, shown in
Figure 5, we use actor-critic agents in Acme [37]. For the teacher policy, we use TD3 [29] trained
for 10M environment steps and pick the best run. We find that fine-tuning this TD3 agent degrades
performance after 15M environment steps (other learning rates in Appendix A.4), which may be
related to capacity loss in value-based RL with prolonged training [47, 56]. For reincarnation, we
use single-actor D4PG [8], a distributional RL variant of DDPG [54], with a larger policy and critic
architecture than TD3. Reincarnated D4PG performs better than its tabula rasa counterpart for the
first 10M environment interactions. Both these agents converge to similar performance, which is
likely a limitation of QDagger. This result also raises the question of whether better PVRL methods
can lead to reincarnated agents that outperform their tabula rasa counterpart throughout learning.
Nevertheless, tabula rasa D4PG requires additional training for 10-12 hours on a V100 GPU to match
reincarnated D4PG’s performance, which might quickly add up to a substantial savings in compute
when running a large set of experiments (e.g., architectural or hyperparameter sweeps).
Balloon Learning Environment (BLE) [33]. One of the motivations of our work is to be able to use
deep RL in real-world tasks in a data and computationally efficient manner. To this end, the BLE

7
110

Teacher Normalized Score%


105
100
95
90
85
80
d
IQN R2D6 nateQd N natedD6 tune lli*
-DQ
N
a r I
QRa r 2 e - t e
inc inc R Finercia
Re Re P
Figure 5: Reincarnating RL on humanoid:run. (Panel 1). We Figure 6: Comparing BLE agents. ∗:
observe that TD3 nearly saturates in performance after training See main text. We compare QR-DQN [23]
for 10M environment steps. The dashed traces show individual with the same MLP architecture as Perci-
runs while the solid line shows the mean return. (Panel 2). Rein- atelli, IQN [22] with DenseNet [39], and
carnated D4PG performs better than its tabula rasa counterpart R2D6. Reincarnated R2D6 outperforms
until the first 10M environment steps and then converges to sim- Perciatelli as well as the tabula rasa agents,
ilar performance (with lower variance). Furthermore, training but lags behind fine-tuned Perciatelli. We
TD3 for a large number of steps eventually results in performance report the mean score (TWR50) across
collapse. We use identically parameterized MLP critic and policy 10,000 evaluation seeds with varying wind
networks with 2 hidden layers of size (256, 256) for TD3 but difficulty, averaged over 2 independent
larger networks with 3 hidden layers for D4PG. Shaded regions runs. Error bars show minimum and maxi-
show 95% CIs based on 10 seeds. mum scores on those runs.

provides a high-fidelity simulator for navigating stratospheric balloons using RL [11]. An agent in
BLE can choose from three actions to control the balloon: move up, down, or stay in place. The
balloon can only move laterally by “surfing” the winds at its altitude; the winds change over time
and vary as the balloon changes position and altitude. Thus, the agent is interacting with a partially
observable and non-stationary system, rendering this environment quite challenging. For the teacher,
we use the QR-DQN agent provided by BLE, called Perciatelli, trained using large-scale distributed
RL for 40 days on the production-level Loon simulator by Bellemare et al. [11] and further fine-tuned
in BLE. For our experiments, we train distributed RL agents using Acme with 64 actors for a budget
of 50,000 episodes on a single cloud TPU-v2, taking approximately 10-12 hours per run.
In Figure 6, we compare the final performance of distributed agents trained tabula rasa (in pink),
with reincarnation (in blue), and fine-tuned (in yellow). We consider three agents, QR-DQN [23]
with an MLP architecture (same as Perciatelli), IQN [22] with a Densenet architecture [39], and a
recurrent agent R2D62 for addressing the partial observability in BLE. When trained tabula rasa, none
of these agents are able to match the teacher performance, with the teacher-lookalike QR-DQN agent
performing particularly poorly. As R2D6 and IQN have substantial architectural differences from the
teacher, we utilize PVRL for transferring the teacher. Reincarnation allows IQN to match and R2D6
to surpass teacher, although both lag behind fine-tuning the teacher. More details in Appendix A.4.2.
When fine-tuning, we are reloading the weights from Perciatelli, which was notably trained on a
broader geographical region than BLE and whose training distribution can be considered a superset of
what is used by the other agents; this is likely the reason that fine-tuning does remarkably well relative
to other agents in BLE. Efficiently transferring information in Perciatelli’s weights to another agent
without the replay data from the Loon simulator presents an interesting challenge for future work.
Overall, the improved efficiency of reincarnating RL (fine-tuning and PVRL) over tabula rasa RL, as
evident on the BLE, could make deep RL more accessible to researchers without access to industrial-
scale resources as they can build upon prior computational work, such as model checkpoints, enabling
the possible reuse of months of prior computation (e.g., Perciatelli).

6 Considerations in Reincarnating RL
Reincarnation via fine-tuning. Given access to model weights and replay of a value-based agent,
a simple reincarnation strategy is to fine-tune this agent. While naive fine-tuning with the same
learning rate (lr) as the nearly saturated original agent does not exhibit improvement, fine-tuning
2
R2D6 builds on recurrent replay distributed DQN (R2D2) [44], which uses a LSTM-based policy, and
incorporates dueling networks [86], distributional RL [12], DenseNet [39], and double Q-learning [84].

8
Figure 7: Reincarnation via fine-tuning with same Figure 8: Contrasting benchmarking results under
and reduced lr, relative to the original agent. tabula rasa and PVRL settings.

with a reduced lr, for only 1 million additional frames, results in 25% IQM improvement for
DQN (Adam) and 50% IQM improvement for Nature DQN trained with RMSProp (Figure 7). As
reincarnating RL leverages existing computational work (e.g., model checkpoints), it allows us
to easily experiment with such hyperparameter schedules, which can be expensive in the tabula
rasa setting. Note that when fine-tuning, one is forced to keep the same network architecture; in
contrast, reincarnating RL grants flexibility in architecture and algorithmic choices, which can surpass
fine-tuning performance (Figures 1 and 5).
Difference with tabula rasa benchmarking. Are student agents that are more data-efficient when
trained from scratch also better for reincarnating RL? In Figure 8, we answer this question in the
negative, indicating the possibility of developing better students for utilizing existing knowledge.
Specifically, we compare Dopamine Rainbow [35] and DrQ [90], under tabula rasa and PVRL settings.
DrQ outperforms Rainbow in low-data regime when trained from scratch but underperforms Rainbow
in the PVRL setting as well as when training longer from scratch. Based on this, we speculate that
reincarnating RL comparisons might be more consistent with asymptotic tabula rasa comparisons.
Reincarnation vs. Distillation. PVRL is different from imitation learning or imitation-regularized
RL as it focuses on using an existing policy only as a launchpad for further learning, as opposed
to imitating or staying close to it. To contrast these settings, we run two ablations of QDagger for
reincarnating Impala-CNN Rainbow given a DQN teacher policy: (1) Dagger [71], which only mini-
mizes the on-policy distillation loss in QDagger, and (2) Dagger + QL, which uses a fixed distillation
loss coefficient throughout training (as opposed to QDagger, which decays it; see Equation 2). As
shown in Figure 9, Dagger performs similarly to the teacher while Dagger + QL improves over the
teacher but quickly saturates in performance. On the contrary, QDagger substantially outperforms
these ablations and shows continual improvement with additional environment interactions.
Dependency on prior work. While performance in reincarnating RL depend on prior computational
work (e.g., teacher policy in PVRL), this is analogous to how fine-tuning results in NLP / computer
vision depend on the pretrained models (e.g., using BERT vs GPT-3). To investigate teacher de-
pendence in PVRL, we reincarnate a fixed student from three different DQN teachers (Figure 10).
As expected, we observe that a higher performing teacher results in a better performing student.
However, reincarnation from two policies with similar performance but obtained from different
agents, DQN (Adam) vs. a fine-tuned Nature DQN, results in different performance. This suggests
that a reincarnated student’s performance depends not only on the teacher’s performance but also
on its behavior. Nevertheless, the ranking of PVRL algorithms remains consistent across these two
teacher policies (Figure A.11). See Section 7 for a broader discussion about generalizability.

7 Reproducibility, Comparisons and Generalizability in Reincarnating RL


Scientific Comparisons. Fairly comparing reincarnation approaches entails using the exactly same
computational work and workflow. For example, in the PVRL setting, the same teacher and data
should be used when comparing different algorithms, as we do in Section 4. To enable this, it would
be beneficial if the researchers can release model checkpoints and the data generated (at least the
final replay buffers), in addition to open-source code for their trained RL agents. Indeed, to allow
others to use the same reincarnation setup as our work, we have already open-sourced DQN (Adam)
agent checkpoints and the final replay buffer at gs://rl_checkpoints.

9
DQN → Impala-CNN Rainbow (Reincarnation)
QDagger Dagger + QL Dagger
Offline Online
2.0

IQM Normalized Score


Offline Online

IQM Normalized Score


2.0
1.5
1.5
1.0

1.0 0.5 DQN @ 20M


DQN @ 400M
DQN (Fine-Tune) @ 20M

1 5 10 5 10 15 20 25 30 1 5 10 5 10 15 20 25 30
Steps (x 100k) Env. Frames (x 1M) Steps (x 100k) Env. Frames (x 1M)

Figure 9: Reincarnation vs. Distillation. Reincarnat- Figure 10: Reincarnation from different teachers,
ing Impala-CNN Rainbow from a DQN (Adam) trained namely, a DQN (Adam) policy trained for 20M and
for 400M frames, using QDagger, and comparing it to 400M frames and fine-tuned Nature DQN in Figure 1
Dagger (imitation) and Dagger + Q-learning (imitation- that achieves similar performance to DQN (Adam)
regularized RL). trained for 400M frames.

Generalizability. The generalizable findings in reincarnating RL would be about comparing algo-


rithmic efficacy given access to existing computational work on a task. As such, the performance
ranking of reincarnation algorithms is likely to remain consistent across different teachers. In fact, we
empirically verified this for the PVRL setting, where we find that while using two different teacher
policies, namely DQN(Adam) vs. a fine-tuned Nature DQN, leads to different performance trends
but the ranking of PVRL algorithms remain consistent: QDagger > Kickstarting > Pretraining (see
Figure 2 and Figure A.11). Practitioners can use the findings from reincarnating RL to try to improve
on an existing deployed RL policy (as opposed to being restricted to running tabula rasa RL). For
example, this work developed QDagger using ALE and applied it to PVRL on other tasks with
existing policies (Humanoid-run and BLE).
Reproducibility. Reproducibility from scratch is challenging in RRL as it would require details
of the generation of the prior computational work (e.g., teacher policies), which may itself has
been obtained via reincarnating RL. As reproducibility from scratch involves reproducing existing
computational work, it could be more expensive than training tabula rasa, which beats the purpose of
doing reincarnation. Furthermore, reproducibility from scratch is also difficult in NLP and computed
vision, where existing pretrained models (e.g.,, GPT-3) are rarely, if ever, reproduced / re-trained
from scratch but almost always used as-is. Despite this difficulty, pretraining-and-fine-tuning is a
dominant paradigm in NLP and vision [e.g., 18, 25, 34, 38], and we believe that a similar difficulty in
RRL should not prevent researchers from investigating and studying this important class of problems.
Instead, we expect that RRL research would build on open-sourced prior computational work. Akin
to NLP and vision, where typically a small set of pretrained models are used in research, we believe
that research on developing better reincarnating RL methods can also possibly converge to a small set
of open-sourced models / data on a given benchmark, e.g., the agents and data we released on Atari
or the 25, 000 trained Atari agents released by Gogianu et al. [31], concurrent to this work.

8 Conclusion
Our work shows that reincarnating RL is a much computationally efficient research workflow than
tabula rasa RL and can help further democratize research. Nevertheless, our results also open several
avenues for future work. Particularly, more research is needed for developing better PVRL methods,
and extending PVRL to learn from multiple suboptimal teachers [48, 53], and enabling workflows that
can incorporate knowledge provided in a form other than a policy, such as pretrained models [41, 79],
representations [88], skills [50, 58, 68], or LLMs [3]. Furthermore, we believe that reincarnating
RL would be crucial for building embodied agents in open-ended domains [7, 27, 32]. Aligned
with this work, there have been calls for collaboratively building and continually improving large
pre-trained models in NLP and vision [70]. We hope that this work motivates RL researchers to
release computational work (e.g., model checkpoints), which would allow others to directly build on
their work. In this regards, we have open-sourced our code and trained agents with their final replay.
Furthermore, re-purposing existing benchmarks, akin to how we use ALE in this work, can serve
as testbeds for reincarnating RL. As Newton put it “If I have seen further it is by standing on the
shoulders of giants”, we argue that reincarnating RL can substantially accelerate progress by building
on prior computational work, as opposed to always redoing this work from scratch.

10
Societal Impacts
Reincarnating RL could positively impact society by reducing the computational burden on re-
searchers and is more environment friendly than tabula rasa RL. For example, reincarnating RL
allow researchers to train super-human Atari agents on a single GPU within a span of few hours
as opposed to training for a few days. Additionally, reincarnating RL is more accessible to the
wider research community, as researchers without sufficient compute resources can build on prior
computational work from resource-rich groups, and even improve upon them using limited resources.
Furthermore, this democratization could directly improve RL applicability for practical applications,
as most businesses that could benefit from RL often cannot afford the expertise to design in-house
solutions. However, this democratization could also make it easier to apply RL for potentially harmful
applications. Furthermore, reincarnating RL could carry forward the bias or undesirable traits from
the previously learned systems. As such, we urge practitioners to be mindful of how RL fits into the
wider socio-technical context of its deployment.

Acknowledgments
We would like to thank David Ha, Evgenii Nikishin, Karol Hausman, Bobak Shahriari, Richard Song,
Alex Irpan, Andrey Kurenkov for their valuable feedback on this work. We thank Joshua Greaves for
helping us set up RL agents for BLE. We also acknowledge Ted Xiao, Dale Schuurmans, Aleksandra
Faust, George Tucker, Rebecca Roelofs, Eugene Brevdo, Pierluca D’Oro, Nathan Rahn, Adrien Ali
Taiga, Bogdan Mazoure, Jacob Buckman, Georg Ostrovski and Aviral Kumar for useful discussions.

Author Contributions
Rishabh Agarwal led the project from start-to-finish, defined the scope of the work to focus on
policy to value reincarnation, came up with a successful algorithm for PVRL, and performed the
literature survey. He designed, implemented and ran most of the experiments on ALE, Humanoid-run
and BLE, and wrote the paper.
Max Schwarzer helped run DQfD experiments on ALE and as well as setting up some agents for
the BLE codebase with Acme, was involved in project discussions and edited the paper. Work done
as a student researcher at Google.
Pablo Samuel Castro was involved in project discussions, helped in setting up the BLE environment
and implemented the initial Acme agents, and helped with paper editing.
Aaron Courville advised the project, helped with project direction and provided feedback on writing.
Marc Bellemare advised the project, challenged Rishabh to come up with an experimental paradigm
in which one continuously improves on an existing agent, and provided feedback on writing.

References
[1] Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on offline
reinforcement learning. In International Conference on Machine Learning, pages 104–114. PMLR, 2020.
[2] Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Bellemare. Deep
reinforcement learning at the edge of the statistical precipice. Advances in Neural Information Processing
Systems, 34, 2021.
[3] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn,
Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do as i can, not as i say: Grounding
language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
[4] Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex
Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s cube with a robot hand.
arXiv preprint arXiv:1910.07113, 2019.
[5] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from
demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
[6] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain,
Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with
reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.

11
[7] Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton,
Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online
videos. arXiv preprint arXiv:2206.11795, 2022.

[8] Gabriel Barth-Maron, Matthew W Hoffman, David Budden, Will Dabney, Dan Horgan, Dhruva Tb, Alistair
Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy gradients.
arXiv preprint arXiv:1804.08617, 2018.

[9] Wissam Bejjani, Rafael Papallas, Matteo Leonetti, and Mehmet R Dogar. Planning with a receding
horizon for manipulation in clutter using a learned value function. In 2018 IEEE-RAS 18th International
Conference on Humanoid Robots (Humanoids), pages 1–9. IEEE, 2018.

[10] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment:
An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.

[11] Marc G Bellemare, Salvatore Candido, Pablo Samuel Castro, Jun Gong, Marlos C Machado, Subhodeep
Moitra, Sameera S Ponda, and Ziyu Wang. Autonomous navigation of stratospheric balloons using
reinforcement learning. Nature, 588(7836):77–82, 2020.

[12] Marc G. Bellemare, Will Dabney, and Mark Rowland. Distributional Reinforcement Learning. MIT Press,
2022. http://www.distributional-rl.org.

[13] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ebiak, Christy Dennison,
David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement
learning. arXiv preprint arXiv:1912.06680, 2019.

[14] Reinaldo AC Bianchi, Carlos HC Ribeiro, and Anna HR Costa. Heuristically accelerated q–learning: a
new approach to speed up reinforcement learning. In Brazilian Symposium on Artificial Intelligence, pages
245–254. Springer, 2004.

[15] Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G Bellemare.
Dopamine: A research framework for deep reinforcement learning. arXiv preprint arXiv:1812.06110,
2018.

[16] Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daumé III, and John Langford. Learning
to search better than your teacher. In International Conference on Machine Learning, pages 2058–2066.
PMLR, 2015.

[17] Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer.
arXiv preprint arXiv:1511.05641, 2015.

[18] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. Big self-
supervised models are strong semi-supervised learners. Advances in neural information processing systems,
33:22243–22255, 2020.

[19] Ching-An Cheng, Andrey Kolobov, and Adith Swaminathan. Heuristic-guided reinforcement learning.
Advances in Neural Information Processing Systems, 34, 2021.

[20] Wojciech M Czarnecki, Razvan Pascanu, Simon Osindero, Siddhant Jayakumar, Grzegorz Swirszcz,
and Max Jaderberg. Distilling policy distillation. In The 22nd International Conference on Artificial
Intelligence and Statistics, pages 1331–1340. PMLR, 2019.

[21] Felipe Leno Da Silva, Garrett Warnell, Anna Helena Reali Costa, and Peter Stone. Agents teaching agents:
a survey on inter-agent transfer learning. Autonomous Agents and Multi-Agent Systems, 34(1):1–17, 2020.

[22] Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional
reinforcement learning. In International conference on machine learning, pages 1096–1105. PMLR, 2018.

[23] Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. Distributional reinforcement learning
with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.

[24] Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey, Francesco Carpanese,
Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego de Las Casas, et al. Magnetic control of tokamak
plasmas through deep reinforcement learning. Nature, 602(7897):414–419, 2022.

[25] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec-
tional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

12
[26] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad
Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted
actor-learner architectures. In International Conference on Machine Learning, pages 1407–1416. PMLR,
2018.

[27] Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang,
De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents
with internet-scale knowledge. arXiv preprint arXiv:2206.08853, 2022.

[28] Fernando Fernández and Manuela Veloso. Probabilistic policy reuse in a reinforcement learning agent.
In Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems,
pages 720–727, 2006.

[29] Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic
methods. In International conference on machine learning, pages 1587–1596. PMLR, 2018.

[30] Yang Gao, Huazhe Xu, Ji Lin, Fisher Yu, Sergey Levine, and Trevor Darrell. Reinforcement learning from
imperfect demonstrations. arXiv preprint arXiv:1802.05313, 2018.

[31] Florin Gogianu, Tudor Berariu, Lucian Bus, oniu, and Elena Burceanu. Atari agents, 2022. URL https:
//github.com/floringogianu/atari-agents.

[32] Djordje Grbic, Rasmus Berg Palm, Elias Najarro, Claire Glanois, and Sebastian Risi. Evocraft: A
new challenge for open-endedness. In International Conference on the Applications of Evolutionary
Computation (Part of EvoStar), pages 325–340. Springer, 2021.

[33] Joshua Greaves, Salvatore Candido, Vincent Dumoulin, Ross Goroshin, Sameera S. Ponda, Marc G.
Bellemare, and Pablo Samuel Castro. Balloon Learning Environment, 12 2021. URL https://github.
com/google/balloon-learning-environment.

[34] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE
international conference on computer vision, pages 2961–2969, 2017.

[35] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan
Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep
reinforcement learning. In Thirty-second AAAI conference on artificial intelligence, 2018.

[36] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John
Quan, Andrew Sendonaris, Ian Osband, et al. Deep q-learning from demonstrations. In Proceedings of the
AAAI Conference on Artificial Intelligence, 2018.

[37] Matt Hoffman, Bobak Shahriari, John Aslanides, Gabriel Barth-Maron, Feryal Behbahani, Tamara Norman,
Abbas Abdolmaleki, Albin Cassirer, Fan Yang, Kate Baumli, et al. Acme: A research framework for
distributed reinforcement learning. arXiv preprint arXiv:2006.00979, 2020.

[38] Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. arXiv
preprint arXiv:1801.06146, 2018.

[39] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected
convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 4700–4708, 2017.

[40] Peter C Humphreys, David Raposo, Toby Pohlen, Gregory Thornton, Rachita Chhaparia, Alistair Muldal,
Josh Abramson, Petko Georgiev, Alex Goldin, Adam Santoro, et al. A data-driven approach for learning to
control computers. arXiv preprint arXiv:2202.08137, 2022.

[41] Andrew Hundt, Aditya Murali, Priyanka Hubli, Ran Liu, Nakul Gopalan, Matthew Gombolay, and
Gregory D. Hager. ”good robot! now watch this!”: Repurposing reinforcement learning for task-to-task
transfer. In 5th Annual Conference on Robot Learning, 2021. URL https://openreview.net/forum?
id=Pxs5XwId51n.

[42] Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi,
Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based training of neural
networks. arXiv preprint arXiv:1711.09846, 2017.

[43] Ryan Julian, Benjamin Swanson, Gaurav S Sukhatme, Sergey Levine, Chelsea Finn, and Karol Hausman.
Never stop learning: The effectiveness of fine-tuning in robotic reinforcement learning. arXiv preprint
arXiv:2004.10190, 2020.

13
[44] Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent experience
replay in distributed reinforcement learning. In International conference on learning representations, 2018.

[45] Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning.
arXiv preprint arXiv:2110.06169, 2021.

[46] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline
reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.

[47] Aviral Kumar, Rishabh Agarwal, Dibya Ghosh, and Sergey Levine. Implicit under-parameterization inhibits
data-efficient deep reinforcement learning. In International Conference on Learning Representations,
2021.

[48] Andrey Kurenkov, Ajay Mandlekar, Roberto Martin-Martin, Silvio Savarese, and Animesh Garg. Ac-teach:
A bayesian actor-critic method for policy learning with an ensemble of suboptimal teachers. arXiv preprint
arXiv:1909.04121, 2019.

[49] Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batch reinforcement learning. In Reinforcement
learning, pages 45–73. Springer, 2012.

[50] Michael Laskin, Hao Liu, Xue Bin Peng, Denis Yarats, Aravind Rajeswaran, and Pieter Abbeel. Cic:
Contrastive intrinsic control for unsupervised skill discovery. arXiv preprint arXiv:2202.00161, 2022.

[51] Seunghyun Lee, Younggyo Seo, Kimin Lee, Pieter Abbeel, and Jinwoo Shin. Offline-to-online reinforce-
ment learning via balanced replay and pessimistic q-ensemble. In Conference on Robot Learning, pages
1702–1712. PMLR, 2022.

[52] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial,
review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.

[53] Siyuan Li, Fangda Gu, Guangxiang Zhu, and Chongjie Zhang. Context-aware policy reuse. arXiv preprint
arXiv:1806.03793, 2018.

[54] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David
Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint
arXiv:1509.02971, 2015.

[55] Yao Lu, Karol Hausman, Yevgen Chebotar, Mengyuan Yan, Eric Jang, Alexander Herzog, Ted Xiao,
Alex Irpan, Mohi Khansari, Dmitry Kalashnikov, et al. Aw-opt: Learning robotic skills with imitation
andreinforcement at scale. In Conference on Robot Learning, pages 1078–1088. PMLR, 2022.

[56] Clare Lyle, Mark Rowland, and Will Dabney. Understanding and preventing capacity loss in reinforcement
learning. arXiv preprint arXiv:2204.09560, 2022.

[57] Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael
Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general
agents. Journal of Artificial Intelligence Research, 61:523–562, 2018.

[58] Michael Matthews, Mikayel Samvelyan, Jack Parker-Holder, Edward Grefenstette, and Tim Rocktäschel.
Hierarchical kickstarting for skill transfer in reinforcement learning. arXiv preprint arXiv:2207.11584,
2022.

[59] Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, Joe Wenjie Jiang, Ebrahim Songhori, Shen Wang,
Young-Joon Lee, Eric Johnson, Omkar Pathak, Azade Nazi, et al. A graph placement methodology for fast
chip design. Nature, 594(7862):207–212, 2021.

[60] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,
Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through
deep reinforcement learning. nature, 518(7540):529–533, 2015.

[61] Ted Moskovitz, Michael Arbel, Jack Parker-Holder, and Aldo Pacchiano. Towards an understanding of
default policies in multitask policy optimization. In International Conference on Artificial Intelligence and
Statistics, pages 10661–10686. PMLR, 2022.

[62] Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Overcoming
exploration in reinforcement learning with demonstrations. In 2018 IEEE international conference on
robotics and automation (ICRA), pages 6292–6299. IEEE, 2018.

14
[63] Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. Awac: Accelerating online reinforcement
learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020.

[64] Johan S Obando-Ceron and Pablo Samuel Castro. Revisiting rainbow: Promoting more insightful and
inclusive deep reinforcement learning research. In International Conference on Machine Learning (ICML),
2021.

[65] Georg Ostrovski, Pablo Samuel Castro, and Will Dabney. The difficulty of passive learning in deep
reinforcement learning. Advances in Neural Information Processing Systems, 34, 2021.

[66] Tom Le Paine, Caglar Gulcehre, Bobak Shahriari, Misha Denil, Matt Hoffman, Hubert Soyer, Richard Tan-
burn, Steven Kapturowski, Neil Rabinowitz, Duncan Williams, et al. Making efficient use of demonstrations
to solve hard exploration problems. arXiv preprint arXiv:1909.01387, 2019.

[67] Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Actor-mimic: Deep multitask and transfer
reinforcement learning. arXiv preprint arXiv:1511.06342, 2015.

[68] Karl Pertsch, Youngwoon Lee, and Joseph J Lim. Accelerating reinforcement learning with learned skill
priors. arXiv preprint arXiv:2010.11944, 2020.

[69] Martin L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley
& Sons, Inc., 1994.

[70] Colin Raffel. A call to build models like we build open-source software. https://colinraffel.com/
blog/a-call-to-build-models-like-we-build-open-source-software.html, 2021.

[71] Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured
prediction to no-regret online learning. In Proceedings of the fourteenth international conference on
artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011.

[72] Stefan Schaal. Learning from demonstration. Advances in neural information processing systems, 9, 1996.

[73] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv
preprint arXiv:1511.05952, 2015.

[74] Simon Schmitt, Jonathan J Hudson, Augustin Zidek, Simon Osindero, Carl Doersch, Wojciech M Czarnecki,
Joel Z Leibo, Heinrich Kuttler, Andrew Zisserman, Karen Simonyan, et al. Kickstarting deep reinforcement
learning. arXiv preprint arXiv:1803.03835, 2018.

[75] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian
Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go
with deep neural networks and tree search. nature, 529(7587):484–489, 2016.

[76] Alexey Skrynnik, Aleksey Staroverov, Ermek Aitygulov, Kirill Aksenov, Vasilii Davydov, and Aleksandr I
Panov. Forgetful experience replay in hierarchical reinforcement learning from expert demonstrations.
Knowledge-Based Systems, 218:106844, 2021.

[77] William D Smart and L Pack Kaelbling. Effective reinforcement learning for mobile robots. In Proceedings
2002 IEEE International Conference on Robotics and Automation (Cat. No. 02CH37292), volume 4, pages
3404–3410. IEEE, 2002.

[78] Wen Sun, J Andrew Bagnell, and Byron Boots. Truncated horizon policy search: Combining reinforcement
learning & imitation learning. arXiv preprint arXiv:1805.11240, 2018.

[79] Yanchao Sun, Ruijie Zheng, Xiyao Wang, Andrew E Cohen, and Furong Huang. Transfer RL across
observation feature spaces via model-based regularization. In International Conference on Learning
Representations, 2022. URL https://openreview.net/forum?id=7KdAoOsI81C.

[80] Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden,
Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint
arXiv:1801.00690, 2018.

[81] Lisa Torrey and Matthew Taylor. Teaching on a budget: Agents advising agents in reinforcement learning.
In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems, pages
1053–1060, 2013.

[82] Mircea Trofin, Yundi Qian, Eugene Brevdo, Zinan Lin, Krzysztof Choromanski, and David Li. Mlgo: a
machine learning guided compiler optimizations framework. arXiv preprint arXiv:2101.04808, 2021.

15
[83] Ikechukwu Uchendu, Ted Xiao, Yao Lu, Banghua Zhu, Mengyuan Yan, Joséphine Simon, Matthew
Bennice, Chuyuan Fu, Cong Ma, Jiantao Jiao, et al. Jump-start reinforcement learning. arXiv preprint
arXiv:2204.02372, 2022.
[84] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In
Proceedings of the AAAI conference on artificial intelligence, 2016.
[85] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung
Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft
ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
[86] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network
architectures for deep reinforcement learning. In International conference on machine learning, pages
1995–2003. PMLR, 2016.
[87] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3):279–292, 1992.
[88] Tete Xiao, Ilija Radosavovic, Trevor Darrell, and Jitendra Malik. Masked visual pre-training for motor
control. arXiv preprint arXiv:2203.06173, 2022.
[89] Linhai Xie, Sen Wang, Stefano Rosa, Andrew Markham, and Niki Trigoni. Learning with training wheels:
speeding up training with a simple controller for deep reinforcement learning. In 2018 IEEE International
Conference on Robotics and Automation (ICRA), pages 6276–6283. IEEE, 2018.
[90] Denis Yarats, Ilya Kostrikov, and Rob Fergus. Image augmentation is all you need: Regularizing deep
reinforcement learning from pixels. In International Conference on Learning Representations, 2020.

Checklist
1. For all authors...
(a) Do the main claims made in the abstract and introduction accurately reflect the paper’s
contributions and scope? [Yes]
(b) Did you describe the limitations of your work? [Yes] Section 5 (BLE and continuous
control results) and reproducibility and evaluation concerns in Sec 7
(c) Did you discuss any potential negative societal impacts of your work? [Yes]
(d) Have you read the ethics review guidelines and ensured that your paper conforms to
them? [Yes]
2. If you are including theoretical results...
(a) Did you state the full set of assumptions of all theoretical results? [N/A]
(b) Did you include complete proofs of all theoretical results? [N/A]
3. If you ran experiments...
(a) Did you include the code, data, and instructions needed to reproduce the main
experimental results (either in the supplemental material or as a URL)? [Yes]
agarwl.github.io/reincarnating_rl.
(b) Did you specify all the training details (e.g., data splits, hyperparameters, how they
were chosen)? [Yes] See Appendix A.3 and A.4
(c) Did you report error bars (e.g., with respect to the random seed after running experi-
ments multiple times)? [Yes] 95% CIs
(d) Did you include the total amount of compute and the type of resources used (e.g., type
of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix A.2
4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...
(a) If your work uses existing assets, did you cite the creators? [Yes]
(b) Did you mention the license of the assets? [Yes] Apache License, Version 2.0
(c) Did you include any new assets either in the supplemental material or as a URL? [Yes]
(d) Did you discuss whether and how consent was obtained from people whose data you’re
using/curating? [N/A]
(e) Did you discuss whether the data you are using/curating contains personally identifiable
information or offensive content? [N/A]

16
5. If you used crowdsourcing or conducted research with human subjects...
(a) Did you include the full text of instructions given to participants and screenshots, if
applicable? [N/A]
(b) Did you describe any potential participant risks, with links to Institutional Review
Board (IRB) approvals, if applicable? [N/A]
(c) Did you include the estimated hourly wage paid to participants and the total amount
spent on participant compensation? [N/A]

17

You might also like