NeurIPS 2022 Reincarnating Reinforcement Learning Reusing Prior Computation To Accelerate Progress Paper Conference
NeurIPS 2022 Reincarnating Reinforcement Learning Reusing Prior Computation To Accelerate Progress Paper Conference
Abstract
Learning tabula rasa, that is without any previously learned knowledge, is the
prevalent workflow in reinforcement learning (RL) research. However, RL systems,
when applied to large-scale settings, rarely operate tabula rasa. Such large-scale
systems undergo multiple design or algorithmic changes during their development
cycle and use ad hoc approaches for incorporating these changes without re-training
from scratch, which would have been prohibitively expensive. Additionally, the
inefficiency of deep RL typically excludes researchers without access to industrial-
scale resources from tackling computationally-demanding problems. To address
these issues, we present reincarnating RL as an alternative workflow or class of
problem settings, where prior computational work (e.g., learned policies) is reused
or transferred between design iterations of an RL agent, or from one RL agent
to another. As a step towards enabling reincarnating RL from any agent to any
other agent, we focus on the specific setting of efficiently transferring an existing
sub-optimal policy to a standalone value-based RL agent. We find that existing
approaches fail in this setting and propose a simple algorithm to address their
limitations. Equipped with this algorithm, we demonstrate reincarnating RL’s gains
over tabula rasa RL on Atari 2600 games, a challenging locomotion task, and the
real-world problem of navigating stratospheric balloons. Overall, this work argues
for an alternative approach to RL research, which we believe could significantly
improve real-world RL adoption and help democratize it further. Open-sourced
code and trained agents at agarwl.github.io/reincarnating_rl.
1 Introduction
Reinforcement learning (RL) is a general-purpose paradigm for making data-driven decisions. Due
to this generality, the prevailing trend in RL research is to learn systems that can operate efficiently
tabula rasa, that is without much learned knowledge including prior computational work such as
offline datasets or learned policies. However, tabula rasa RL systems are typically the exception rather
than the norm for solving large-scale RL problems [4, 13, 55, 75, 85]. Such large-scale RL systems
often need to function for long periods of time and continually experience new data; restarting them
from scratch may require weeks if not months of computation, and there may be billions of data
points to re-process – this makes the tabula rasa approach impractical. For example, the system that
plays Dota 2 at a human-like level [13] underwent several months of RL training with continual
changes (e.g., in model architecture, environment, etc) during its development; this necessitated
building upon the previously trained system after such changes to circumvent re-training from scratch,
which was done using ad hoc approaches (described in Section 3).
Current RL research also excludes the majority of researchers outside certain resource-rich labs from
tackling complex problems, as doing so often incurs substantial computational and financial cost:
AlphaStar [85], which achieves grandmaster level in StarCraft, was trained using TPUs for more than
a month and replicating it would cost several million dollars (Appendix A.1). Even the quintessential
deep RL benchmark of training an agent on 50+ Atari games [10], with at least 5 runs, requires
∗
Correspondence to Rishabh Agarwal <[email protected]>.
more than 1000 GPU days. As deep RL research move towards more challenging problems, the
computational barrier to entry in RL research is likely to further increase.
To address both the computational and sample inefficiencies of tabula rasa RL, we present reincarnat-
ing RL (RRL) as an alternative research workflow or a class of problems to focus on. RRL seeks
to maximally leverage existing computational work, such as learned network weights and collected
data, to accelerate training across design iterations of an RL agent or when moving from one agent to
another. In RRL, agents need not be trained tabula rasa, except for initial forays into new problems.
For example, imagine a researcher who has trained an agent A1 for a long time (e.g., weeks), but
now this or another researcher wants to experiment with better architectures or RL algorithms. While
the tabula rasa workflow requires re-training another agent from scratch, reincarnating RL provides
the more viable option of transferring A1 to another agent and training this agent further, or simply
fine-tuning A1 (Figure 1). As such, RRL can be viewed as an attempt to provide a formal foundation
for the research workflow needed for real-world and large-scale RL models.
Reincarnating RL can democratize research by allowing the broader community to tackle larger-scale
and complex RL problems without requiring excessive computational resources. As a consequence,
RRL can also help avoid the risk of researchers overfitting to conclusions from small-scale RL
problems. Furthermore, RRL can enable a benchmarking paradigm where researchers continually
improve and update existing trained agents, especially on problems where improving performance has
real-world impact (e.g., balloon navigation [11], chip design [59], tokamak control [24]). Furthermore,
a common real-world RL use case will likely be in scenarios where prior computational work is
available (e.g., existing deployed RL policies), making RRL important to study. However, beyond
some ad hoc large-scale reincarnation efforts (Section 3), the community has not focused much on
studying reincarnating RL as a research problem in its own right. To this end, this work argues for
developing general-purpose RRL approaches as opposed to ad hoc solutions.
Different RRL problems can be instantiated depending on how the prior computational work is
provided: logged datasets, learned policies, pretrained models, representations, etc. As a step
towards developing broadly applicable reincarnation approaches, we focus on the specific setting
of policy-to-value reincarnating RL (PVRL) for efficiently transferring a suboptimal teacher policy
to a value-based RL student agent (Section 4). Since it is undesirable to maintain dependency on
past teachers for successive reincarnations, we require a PVRL algorithm to “wean” off the teacher
2
dependence as training progresses. We find that prior approaches, when evaluated for PVRL on
the Arcade Learning Environment (ALE) [10], either result in small improvements over the tabula
rasa student or exhibit degradation when weaning off the teacher. To address these limitations, we
introduce QDagger, which combines Dagger [71] with n-step Q-learning, and outperforms prior
approaches. Equipped with QDagger, we demonstrate the sample and compute-efficiency gains
of reincarnating RL over tabula rasa RL, on ALE, a humanoid locomotion task and the simulated
real-world problem of navigating stratospheric balloons [11] (Section 5). Finally, we discuss some
considerations in RRL as well as address reproducibility and generalizability concerns.
2 Preliminaries
The goal in RL is to maximize the long-term discounted reward in an environment. We model the
environment as an MDP, defined as (S, A, R, P, γ) [69], with a state space S, an action space A, a
stochastic reward function R(s, a), transition dynamics P (s0 |s, a) and a discount factor γ ∈ [0, 1).
A policy π(·|s) maps states to a distribution over actions. The Q-value function Qπ (s, a) for a
policy π(·|s) is the expected sum of discounted rewards obtained by executing action a at state s
and following π(·|s) thereafter. DQN [60] builds on Q-learning [87] and parameterizes the Q-value
function, Qθ , with a neural net with parameters θ while following an -greedy policy with respect to
Qθ for data collection. DQN minimizes the temporal difference (TD) loss, LT D (DS ), on transition
tuples, (s, a, r, s0 ), sampled from an experience replay buffer DS collected during training:
h i
0 0 2
LT D (D) = Es,a,r,s0 ∼D Qθ (s, a) − r − γ max
0
Q̄θ (s , a ) (1)
a
where Q̄θ is a delayed copy of the same Q-network, referred to as the target network. Modern value-
based RL agents, such as Rainbow [35], use n-step returns to further stabilize learning. Specifically,
rather than training the Q-value estimate Q(st , at ) on the basis of the single-step temporal difference
Pn−1
error rt +γ maxa0 Q(st+1 , a0 )−Q(st , at ), an n-step target k=0 γ k rt+k +γ n maxa0 Q(st+n , a0 )−
Q(st , at ) is used in the TD loss, with intermediate future rewards stored in the replay D.
3 Related work
Prior ad hoc reincarnation efforts. While several high-profile RL achievements have used reincar-
nation, it has typically been done in an ad-hoc way and has limited applicability. OpenAI Five [13],
which can play Dota 2 at a human-like level, required 10 months of large-scale RL training and
went through continual changes in code and environment (e.g., expanding observation spaces) during
development. To avoid restarting from scratch after such changes, OpenAI Five used “surgery” akin
to Net2Net [17] style transformations to convert a trained model to certain bigger architectures with
custom weight initializations. AlphaStar [85] employs population-based training (PBT) [42], which
periodically copies weights of the best performing value-based agents and mutates hyperparameters
during training. Although PBT and surgery methods are efficient, they have they can not be used for
reincarnating RL when switching to arbitrary architectures (e.g., feed-forward to recurrent networks)
or from one model class to another (e.g., policy to a value function). Akkaya et al. [4] trained RL
policies for several months to manipulate a robot hand for solving Rubik’s cube. To do so, they
“rarely trained experiments from scratch” but instead initialized new policies, with architectural
changes, from previous trained policies using behavior cloning via on-policy distillation [20, 67].
AlphaGo [75] also used behavior cloning on human replays for initializing the policy and fine-tuning
it further with RL. However, behavior cloning is only applicable for policy to policy transfer and is
inadequate for the PVRL setting of transferring a policy to a value function [e.g., 63, 83]. Contrary
to such approaches, we apply reincarnation in settings where these approaches are not applicable
including transferring a DQN agent to Impala-CNN Rainbow in ALE, and a distributed agent with
MLP architecture to a recurrent agent in BLE. Several prior works also fine-tune existing agents with
deep RL for reducing training time, especially on real-world tasks such as chip floor-planning [59],
robotic manipulation [43], aligning language models [6], and compiler optimization [82]. In line
with these works, we find that fine-tuning a value-based agent can be an effective reincarnation strat-
egy (Figure 7). However, fine-tuning is often constrained to use the same architecture as the agent
being fine-tuned. Instead, we focus on reincarnating RL methods that do not have this limitation.
Leveraging prior computation. While areas such as offline RL, imitation learning, transfer in RL,
continual RL etc focus on developing methods to leverage prior computation, such areas don’t strive
to change how we do RL research by incorporating such methods as a part of our workflow. For
completeness, we contrast closely related approaches to PVRL, the RRL setting we study.
3
- Leveraging existing agents. Existing policies have been previously used for improving data
collection [14, 16, 28, 77, 89]; we evaluate one such approach, JSRL [83], which improves exploration
in goal-reaching RL tasks. However, our PVRL experiments indicate that JSRL performs poorly
on ALE. Schmitt et al. [74] propose kickstarting to speed-up actor-critic agents using an interactive
teacher policy by combining on-policy distillation [20, 67] with RL. Empirically, we find that
kickstarting is a strong baseline for PVRL, however it exhibits unstable behavior without n-step
returns and underperforms QDagger. PVRL also falls under the framework of agents teaching
agents (ATA) [21] with RL-based students and teachers. While ATA approaches, such as action
advice [81], emphasize how and when to query the teacher or evaluating the utility of teacher
advice, PVRL focuses on sample-efficient transfer and does not impose constraints on querying the
teacher. PVRL is also different from prior work on accelerating RL using a heuristic or oracle value
function [9, 19, 78], as PVRL only assumes access to a suboptimal policy. Unlike PVRL methods
that wean off the teacher, imitation-regularized RL methods [51, 61] stay close to the suboptimal
teacher, which can limit the student’s performance with continued training (Figure 9).
- Leveraging prior data. Learning from demonstrations (LfD) [5, 30, 36, 40, 72] approaches
focus on accelerating RL training using demonstrations. Such approaches typically assume access to
optimal or near-optimal trajectories, often obtained from human demonstrators, and aim to match the
demonstrator’s performance. Instead, PVRL focuses on leveraging a suboptimal teacher policy, which
can be obtained from any trained RL agent, that we wean off during training. Empirically, we find
that DQfD [36], a well-known LfD approach to accelerate deep Q-learning, when applied to PVRL,
exhibits severe performance degradation when weaning off the teacher. Rehearsal approaches [62,
66, 76] focus on improving exploration by replaying demonstrations during learning; we find that
such approaches are ineffective for leveraging the teacher in PVRL. Offline RL [1, 49, 52] focuses on
learning solely from fixed datasets while reincarnating RL focuses on leveraging prior information,
which can also be presented as offline datasets, for speeding up further learning from environment
interactions. Recent work [45, 51, 55, 63] use offline RL to pretrain on prior data and then fine-tune
online. We also evaluate this pretraining approach for PVRL and find that it underperforms QDagger,
which utilizes the interactive teacher policy in addition to the prior teacher collected data.
4
QDagger Kickstarting Pretraining Rehearsal DQfD JSRL
0.00 0.00
1 5 10 2 4 6 8 10 0.00 0.25 0.50 0.75 1.00 1.25 1.50
Steps (x 100k) Env. Frames (x 1M) Teacher Normalized Score (τ)
Figure 2: Comparing PVRL algorithms for reincarnating a student DQN agent given a teacher policy (with
normalized score of 1), obtained from a DQN agent trained for 400M frames (Section 4). Baselines include
kickstarting [74], JSRL [83], rehearsal [66], offline pretraining [46] and DQfD [36]. Tabula rasa 3-step DQN
student (−· line) obtains an IQM teacher normalized score around 0.39. Shaded regions show 95% CIs. Left.
Sample efficiency curves based on IQM normalized scores, aggregated across 10 games and 3 runs, over the
course of training. Among all algorithms, only QDagger (Section 4.1) surpasses teacher performance within 10
million frames. Right. Performance profiles [2] showing the distribution of scores across all 30 runs at the end
of training (higher is better). Area under an algorithm’s profile is its mean performance while τ value where it
intersects y = 0.5 shows its median performance. QDagger outperforms the teacher in 75% of runs.
We also assume access to a dataset DT that can be generated by the teacher (see Appendix A.5 for
results about dependence on DT ). For this work, DT is the final replay buffer (1M transitions) logged
by the teacher DQN, which is 100 times smaller than the data the teacher was trained on. For a
challenging PVRL setting, we use DQN as the student since tabula rasa DQN requires a substantial
amount of training to reach the teacher’s performance. To emphasize sample-efficient reincarnation,
we train this student for only 10 million frames, a 40 times smaller sample budget than the teacher.
Furthermore, we wean off the teacher at 6 million frames. See Appendix A.3 for more details.
Evaluation. Following Agarwal et al. [2], we report interquartile mean normalized scores with 95%
confidence intervals (CIs), aggregated across 10 games with 3 seeds each. The normalization is done
such that the random policy obtains a score of 0 and the teacher policy πT obtains a score of 1. This
differs from typically reported human-normalized scores, as we wanted to highlight the performance
differences between the student and the teacher. Next, we describe the approaches we investigate.
• Rehearsal: Since the student, in principle, can learn using any off-policy data, we can replay
teacher data DT along with the student’s own interactions during training. Following Paine et al.
[66], the student minimizes the TD loss on mini-batches that contain ρ% of the samples from
DT and the rest from the student’s replay DS (different ρ and n-step values in Figure A.12).
• JSRL (Figure 3, left): JSRL [83] uses an interactive teacher policy as a “guide” to improve
exploration and rolls in with the guide for a random number of environment steps. To evaluate
JSRL, we vary the maximum number of roll-in steps, α, that can be taken by the teacher and
sample a random number of roll-in steps between [0, α] every episode. As the student improves,
we decay the steps taken by the teacher every iteration (1M frames) by a factor of β.
• Offline RL Pretraining: Given access to teacher data DT , we can pre-train the student using
offline RL. To do so, we use CQL [46], a widely used offline RL algorithm, which jointly
minimizes the TD and behavior cloning on logged transitions in DT (Equation A.3). Following
pretraining, we fine-tune the learned Q-network using TD loss on the student’s replay DS .
• Kickstarting (Figure 3, right): Akin to kickstarting [74], we jointly optimize the TD loss with an
on-policy distillation loss on the student’s self-collected data in DS . The distillation loss uses the
cross-entropy between teacher’s policy πT and the student policy π(·|s) = softmax(Q(s, ·)/τ ),
where τ corresponds to temperature. To wean off the teacher, we decay the distillation coefficient
as training progresses. Note that kickstarting does not pretrain on teacher data.
• DQfD (Figure 4, left): Following DQfD [35], we initially pretrain the student on teacher data
DT using a combination of TD loss with a large margin classification loss to imitate the teacher
actions (Equation A.4). After pretraining, we train the student on its replay data DS , again
using a combination of TD and margin loss. While DQfD minimizes the margin loss throughout
training, we decay the margin loss coefficient during the online phase, akin to kickstarting.
5
n=1 n=3 n=5 n = 10
n = 1, β = 1.0 n = 1, β = 0.8
n = 3, β = 1.0 n = 3, β = 0.8 0.8
0.3 0.6
0.2
0.4
0.1
0.2
0.0
0 100 1000 5000 2 4 6 8 10
Max. Teacher Roll-in Steps (α) Env. Frames (x 1M)
Figure 3: Left. JSRL. The plot shows teacher normalized scores with 95% CIs, after training for 10M
frames, aggregated using IQM across 10 Atari games with 3 seeds each. Each point corresponds to a different
experiment, evaluated using 30 seeds, with specific values of JSRL parameters (α, β) and n-step returns. Right.
Kickstarting, with different n-step returns. The plots show IQM scores over the coures of training. Kickstarting
exhibits performance degradation, which is severe with 1-step, and is unable to surpass teacher’s performance.
n=3, m=1.0 n=10, m=1.0
n=3, m=3.0 n=10, m=3.0 n=1 n=3 n=5 n = 10
0.6 1.0
0.4 0.8
0.2 0.6
0.4
1 5 10 2 4 6 8 10 1 5 10 2 4 6 8 10
Steps (x 100k) Env. Frames (x 1M) Steps (x 100k) Env. Frames (x 1M)
Figure 4: Left. DQfD. Here, m is the margin loss parameter, which is the loss penalty when the student’s
action is different from the teacher. Right. QDagger, with different n-step returns. In both, the 1st vertical line
separates pretraining phase from online phase while the 2nd one indicates completely weaning off the teacher.
Results. Rehearsal, with best-performing teacher data ratio (ρ = 1/16), is marginally better than
tabula rasa DQN but significantly underperforms the teacher (Figure 2, teal), which seems related to
the difficulty of standard value-based methods to learn from off-policy teacher data [65]. JSRL does
not improve performance compared to tabula rasa DQN and even hurts performance with a large
number of teacher roll-in steps (Figure 3, left). The ineffectiveness of JSRL on ALE is likely due to
the state-distribution mismatch between the student and the teacher, as the student may never visit the
states visited by the teacher and as a result, doesn’t learn to correct for its previous mistakes [16].
Pretraining with offline RL on logged teacher data recovers around 50% of the teacher’s performance
and fine-tuning this pretrained Q-function online marginally improves performance (Figure 2, pink).
However, fine-tuning degrades performance with 1-step returns, which is more pronounced with
higher values of CQL loss coefficient (Figure A.13). We also find that kickstarting exhibits perfor-
mance degradation (Figure 3, right), which is severe with 1-step returns, once we wean off the teacher
policy. Akin to kickstarting, we again observe a severe performance collapse when weaning off the
the teacher dependence in DQfD (Figure 4, left), even when using n-step returns. We hypothesize
that this performance degradation is caused by the inconsistency between Q-values trained using a
combination of imitation learning and TD losses, as opposed to only minimizing the TD loss. We
also find that using intermediate values of n-step returns, such as n = 3 (also used by Rainbow [35]),
quickly recovers after the performance drop from weaning while larger n-step values impede learning,
possibly due to stale target Q-values. These results reveal the sensitivity of prior methods in the
PVRL setting to specific hyperparameter choices (n-step), indicating the need for developing stable
PVRL methods that do not fail when weaning off the teacher. For practitioners, the takeaway is to
consider this hyperparameter sensitivity when weaning off the teacher for reincarnation.
4.1 QDagger: A simple PVRL baseline
To address the limitations of prior approaches, we propose QDagger, a simple method for PVRL that
combines Dagger [71], an interactive imitation learning algorithm, with n-step Q-learning (Figure 4,
right). Specifically, we first pre-train the student on teacher data DT by minimizing LQDagger (DT ),
which combines distillation loss with the TD loss, weighted by a constant λ. This pretraining phase
6
helps the student to mimic the teacher’s state distribution, akin to the behavior cloning phase in Dagger.
After pretraining, we minimize LQDagger (DS ) on the student’s replay DS , akin to kickstarting, where
the teacher “corrects” the mistakes on the states visited by the student. As opposed to minimizing
the Dagger loss indefinitely, QDagger decays the distillation loss coefficient λt (λ0 = λ) as training
progresses, to satisfy the weaning desiderata for PVRL. Weaning allows QDagger to deviate from
the suboptimal teacher policy πT , as opposed to being perpetually constrained to stay close to
πT (Figure 9). We find that both decaying λt linearly over training steps or using an affine function
of the ratio of student and teacher performance worked well (Appendix A.3). Assuming the student
policy π(·|s) = softmax(Q(s, ·)/τ ), the QDagger loss is given by:
hX i
LQDagger (D) = LT D (D) + λt Es∼D πT (a|s) log π(a|s) (2)
a
Figure 2 shows that QDagger outperforms prior methods and surpasses the teacher. We remark that
DQfD can be viewed as a QDagger ablation that uses a margin loss instead of a distillation loss, while
kickstarting as another ablation that does not pretrain on teacher data. Equipped with QDagger, we
show how to incorporate PVRL into our workflow and demonstrate its benefits over tabula rasa RL.
7
110
provides a high-fidelity simulator for navigating stratospheric balloons using RL [11]. An agent in
BLE can choose from three actions to control the balloon: move up, down, or stay in place. The
balloon can only move laterally by “surfing” the winds at its altitude; the winds change over time
and vary as the balloon changes position and altitude. Thus, the agent is interacting with a partially
observable and non-stationary system, rendering this environment quite challenging. For the teacher,
we use the QR-DQN agent provided by BLE, called Perciatelli, trained using large-scale distributed
RL for 40 days on the production-level Loon simulator by Bellemare et al. [11] and further fine-tuned
in BLE. For our experiments, we train distributed RL agents using Acme with 64 actors for a budget
of 50,000 episodes on a single cloud TPU-v2, taking approximately 10-12 hours per run.
In Figure 6, we compare the final performance of distributed agents trained tabula rasa (in pink),
with reincarnation (in blue), and fine-tuned (in yellow). We consider three agents, QR-DQN [23]
with an MLP architecture (same as Perciatelli), IQN [22] with a Densenet architecture [39], and a
recurrent agent R2D62 for addressing the partial observability in BLE. When trained tabula rasa, none
of these agents are able to match the teacher performance, with the teacher-lookalike QR-DQN agent
performing particularly poorly. As R2D6 and IQN have substantial architectural differences from the
teacher, we utilize PVRL for transferring the teacher. Reincarnation allows IQN to match and R2D6
to surpass teacher, although both lag behind fine-tuning the teacher. More details in Appendix A.4.2.
When fine-tuning, we are reloading the weights from Perciatelli, which was notably trained on a
broader geographical region than BLE and whose training distribution can be considered a superset of
what is used by the other agents; this is likely the reason that fine-tuning does remarkably well relative
to other agents in BLE. Efficiently transferring information in Perciatelli’s weights to another agent
without the replay data from the Loon simulator presents an interesting challenge for future work.
Overall, the improved efficiency of reincarnating RL (fine-tuning and PVRL) over tabula rasa RL, as
evident on the BLE, could make deep RL more accessible to researchers without access to industrial-
scale resources as they can build upon prior computational work, such as model checkpoints, enabling
the possible reuse of months of prior computation (e.g., Perciatelli).
6 Considerations in Reincarnating RL
Reincarnation via fine-tuning. Given access to model weights and replay of a value-based agent,
a simple reincarnation strategy is to fine-tune this agent. While naive fine-tuning with the same
learning rate (lr) as the nearly saturated original agent does not exhibit improvement, fine-tuning
2
R2D6 builds on recurrent replay distributed DQN (R2D2) [44], which uses a LSTM-based policy, and
incorporates dueling networks [86], distributional RL [12], DenseNet [39], and double Q-learning [84].
8
Figure 7: Reincarnation via fine-tuning with same Figure 8: Contrasting benchmarking results under
and reduced lr, relative to the original agent. tabula rasa and PVRL settings.
with a reduced lr, for only 1 million additional frames, results in 25% IQM improvement for
DQN (Adam) and 50% IQM improvement for Nature DQN trained with RMSProp (Figure 7). As
reincarnating RL leverages existing computational work (e.g., model checkpoints), it allows us
to easily experiment with such hyperparameter schedules, which can be expensive in the tabula
rasa setting. Note that when fine-tuning, one is forced to keep the same network architecture; in
contrast, reincarnating RL grants flexibility in architecture and algorithmic choices, which can surpass
fine-tuning performance (Figures 1 and 5).
Difference with tabula rasa benchmarking. Are student agents that are more data-efficient when
trained from scratch also better for reincarnating RL? In Figure 8, we answer this question in the
negative, indicating the possibility of developing better students for utilizing existing knowledge.
Specifically, we compare Dopamine Rainbow [35] and DrQ [90], under tabula rasa and PVRL settings.
DrQ outperforms Rainbow in low-data regime when trained from scratch but underperforms Rainbow
in the PVRL setting as well as when training longer from scratch. Based on this, we speculate that
reincarnating RL comparisons might be more consistent with asymptotic tabula rasa comparisons.
Reincarnation vs. Distillation. PVRL is different from imitation learning or imitation-regularized
RL as it focuses on using an existing policy only as a launchpad for further learning, as opposed
to imitating or staying close to it. To contrast these settings, we run two ablations of QDagger for
reincarnating Impala-CNN Rainbow given a DQN teacher policy: (1) Dagger [71], which only mini-
mizes the on-policy distillation loss in QDagger, and (2) Dagger + QL, which uses a fixed distillation
loss coefficient throughout training (as opposed to QDagger, which decays it; see Equation 2). As
shown in Figure 9, Dagger performs similarly to the teacher while Dagger + QL improves over the
teacher but quickly saturates in performance. On the contrary, QDagger substantially outperforms
these ablations and shows continual improvement with additional environment interactions.
Dependency on prior work. While performance in reincarnating RL depend on prior computational
work (e.g., teacher policy in PVRL), this is analogous to how fine-tuning results in NLP / computer
vision depend on the pretrained models (e.g., using BERT vs GPT-3). To investigate teacher de-
pendence in PVRL, we reincarnate a fixed student from three different DQN teachers (Figure 10).
As expected, we observe that a higher performing teacher results in a better performing student.
However, reincarnation from two policies with similar performance but obtained from different
agents, DQN (Adam) vs. a fine-tuned Nature DQN, results in different performance. This suggests
that a reincarnated student’s performance depends not only on the teacher’s performance but also
on its behavior. Nevertheless, the ranking of PVRL algorithms remains consistent across these two
teacher policies (Figure A.11). See Section 7 for a broader discussion about generalizability.
9
DQN → Impala-CNN Rainbow (Reincarnation)
QDagger Dagger + QL Dagger
Offline Online
2.0
1 5 10 5 10 15 20 25 30 1 5 10 5 10 15 20 25 30
Steps (x 100k) Env. Frames (x 1M) Steps (x 100k) Env. Frames (x 1M)
Figure 9: Reincarnation vs. Distillation. Reincarnat- Figure 10: Reincarnation from different teachers,
ing Impala-CNN Rainbow from a DQN (Adam) trained namely, a DQN (Adam) policy trained for 20M and
for 400M frames, using QDagger, and comparing it to 400M frames and fine-tuned Nature DQN in Figure 1
Dagger (imitation) and Dagger + Q-learning (imitation- that achieves similar performance to DQN (Adam)
regularized RL). trained for 400M frames.
8 Conclusion
Our work shows that reincarnating RL is a much computationally efficient research workflow than
tabula rasa RL and can help further democratize research. Nevertheless, our results also open several
avenues for future work. Particularly, more research is needed for developing better PVRL methods,
and extending PVRL to learn from multiple suboptimal teachers [48, 53], and enabling workflows that
can incorporate knowledge provided in a form other than a policy, such as pretrained models [41, 79],
representations [88], skills [50, 58, 68], or LLMs [3]. Furthermore, we believe that reincarnating
RL would be crucial for building embodied agents in open-ended domains [7, 27, 32]. Aligned
with this work, there have been calls for collaboratively building and continually improving large
pre-trained models in NLP and vision [70]. We hope that this work motivates RL researchers to
release computational work (e.g., model checkpoints), which would allow others to directly build on
their work. In this regards, we have open-sourced our code and trained agents with their final replay.
Furthermore, re-purposing existing benchmarks, akin to how we use ALE in this work, can serve
as testbeds for reincarnating RL. As Newton put it “If I have seen further it is by standing on the
shoulders of giants”, we argue that reincarnating RL can substantially accelerate progress by building
on prior computational work, as opposed to always redoing this work from scratch.
10
Societal Impacts
Reincarnating RL could positively impact society by reducing the computational burden on re-
searchers and is more environment friendly than tabula rasa RL. For example, reincarnating RL
allow researchers to train super-human Atari agents on a single GPU within a span of few hours
as opposed to training for a few days. Additionally, reincarnating RL is more accessible to the
wider research community, as researchers without sufficient compute resources can build on prior
computational work from resource-rich groups, and even improve upon them using limited resources.
Furthermore, this democratization could directly improve RL applicability for practical applications,
as most businesses that could benefit from RL often cannot afford the expertise to design in-house
solutions. However, this democratization could also make it easier to apply RL for potentially harmful
applications. Furthermore, reincarnating RL could carry forward the bias or undesirable traits from
the previously learned systems. As such, we urge practitioners to be mindful of how RL fits into the
wider socio-technical context of its deployment.
Acknowledgments
We would like to thank David Ha, Evgenii Nikishin, Karol Hausman, Bobak Shahriari, Richard Song,
Alex Irpan, Andrey Kurenkov for their valuable feedback on this work. We thank Joshua Greaves for
helping us set up RL agents for BLE. We also acknowledge Ted Xiao, Dale Schuurmans, Aleksandra
Faust, George Tucker, Rebecca Roelofs, Eugene Brevdo, Pierluca D’Oro, Nathan Rahn, Adrien Ali
Taiga, Bogdan Mazoure, Jacob Buckman, Georg Ostrovski and Aviral Kumar for useful discussions.
Author Contributions
Rishabh Agarwal led the project from start-to-finish, defined the scope of the work to focus on
policy to value reincarnation, came up with a successful algorithm for PVRL, and performed the
literature survey. He designed, implemented and ran most of the experiments on ALE, Humanoid-run
and BLE, and wrote the paper.
Max Schwarzer helped run DQfD experiments on ALE and as well as setting up some agents for
the BLE codebase with Acme, was involved in project discussions and edited the paper. Work done
as a student researcher at Google.
Pablo Samuel Castro was involved in project discussions, helped in setting up the BLE environment
and implemented the initial Acme agents, and helped with paper editing.
Aaron Courville advised the project, helped with project direction and provided feedback on writing.
Marc Bellemare advised the project, challenged Rishabh to come up with an experimental paradigm
in which one continuously improves on an existing agent, and provided feedback on writing.
References
[1] Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on offline
reinforcement learning. In International Conference on Machine Learning, pages 104–114. PMLR, 2020.
[2] Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Bellemare. Deep
reinforcement learning at the edge of the statistical precipice. Advances in Neural Information Processing
Systems, 34, 2021.
[3] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn,
Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do as i can, not as i say: Grounding
language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
[4] Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex
Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s cube with a robot hand.
arXiv preprint arXiv:1910.07113, 2019.
[5] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from
demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
[6] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain,
Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with
reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
11
[7] Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton,
Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online
videos. arXiv preprint arXiv:2206.11795, 2022.
[8] Gabriel Barth-Maron, Matthew W Hoffman, David Budden, Will Dabney, Dan Horgan, Dhruva Tb, Alistair
Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy gradients.
arXiv preprint arXiv:1804.08617, 2018.
[9] Wissam Bejjani, Rafael Papallas, Matteo Leonetti, and Mehmet R Dogar. Planning with a receding
horizon for manipulation in clutter using a learned value function. In 2018 IEEE-RAS 18th International
Conference on Humanoid Robots (Humanoids), pages 1–9. IEEE, 2018.
[10] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment:
An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
[11] Marc G Bellemare, Salvatore Candido, Pablo Samuel Castro, Jun Gong, Marlos C Machado, Subhodeep
Moitra, Sameera S Ponda, and Ziyu Wang. Autonomous navigation of stratospheric balloons using
reinforcement learning. Nature, 588(7836):77–82, 2020.
[12] Marc G. Bellemare, Will Dabney, and Mark Rowland. Distributional Reinforcement Learning. MIT Press,
2022. http://www.distributional-rl.org.
[13] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ebiak, Christy Dennison,
David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement
learning. arXiv preprint arXiv:1912.06680, 2019.
[14] Reinaldo AC Bianchi, Carlos HC Ribeiro, and Anna HR Costa. Heuristically accelerated q–learning: a
new approach to speed up reinforcement learning. In Brazilian Symposium on Artificial Intelligence, pages
245–254. Springer, 2004.
[15] Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G Bellemare.
Dopamine: A research framework for deep reinforcement learning. arXiv preprint arXiv:1812.06110,
2018.
[16] Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daumé III, and John Langford. Learning
to search better than your teacher. In International Conference on Machine Learning, pages 2058–2066.
PMLR, 2015.
[17] Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer.
arXiv preprint arXiv:1511.05641, 2015.
[18] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. Big self-
supervised models are strong semi-supervised learners. Advances in neural information processing systems,
33:22243–22255, 2020.
[19] Ching-An Cheng, Andrey Kolobov, and Adith Swaminathan. Heuristic-guided reinforcement learning.
Advances in Neural Information Processing Systems, 34, 2021.
[20] Wojciech M Czarnecki, Razvan Pascanu, Simon Osindero, Siddhant Jayakumar, Grzegorz Swirszcz,
and Max Jaderberg. Distilling policy distillation. In The 22nd International Conference on Artificial
Intelligence and Statistics, pages 1331–1340. PMLR, 2019.
[21] Felipe Leno Da Silva, Garrett Warnell, Anna Helena Reali Costa, and Peter Stone. Agents teaching agents:
a survey on inter-agent transfer learning. Autonomous Agents and Multi-Agent Systems, 34(1):1–17, 2020.
[22] Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional
reinforcement learning. In International conference on machine learning, pages 1096–1105. PMLR, 2018.
[23] Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. Distributional reinforcement learning
with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
[24] Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey, Francesco Carpanese,
Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego de Las Casas, et al. Magnetic control of tokamak
plasmas through deep reinforcement learning. Nature, 602(7897):414–419, 2022.
[25] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec-
tional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
12
[26] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad
Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted
actor-learner architectures. In International Conference on Machine Learning, pages 1407–1416. PMLR,
2018.
[27] Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang,
De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents
with internet-scale knowledge. arXiv preprint arXiv:2206.08853, 2022.
[28] Fernando Fernández and Manuela Veloso. Probabilistic policy reuse in a reinforcement learning agent.
In Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems,
pages 720–727, 2006.
[29] Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic
methods. In International conference on machine learning, pages 1587–1596. PMLR, 2018.
[30] Yang Gao, Huazhe Xu, Ji Lin, Fisher Yu, Sergey Levine, and Trevor Darrell. Reinforcement learning from
imperfect demonstrations. arXiv preprint arXiv:1802.05313, 2018.
[31] Florin Gogianu, Tudor Berariu, Lucian Bus, oniu, and Elena Burceanu. Atari agents, 2022. URL https:
//github.com/floringogianu/atari-agents.
[32] Djordje Grbic, Rasmus Berg Palm, Elias Najarro, Claire Glanois, and Sebastian Risi. Evocraft: A
new challenge for open-endedness. In International Conference on the Applications of Evolutionary
Computation (Part of EvoStar), pages 325–340. Springer, 2021.
[33] Joshua Greaves, Salvatore Candido, Vincent Dumoulin, Ross Goroshin, Sameera S. Ponda, Marc G.
Bellemare, and Pablo Samuel Castro. Balloon Learning Environment, 12 2021. URL https://github.
com/google/balloon-learning-environment.
[34] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE
international conference on computer vision, pages 2961–2969, 2017.
[35] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan
Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep
reinforcement learning. In Thirty-second AAAI conference on artificial intelligence, 2018.
[36] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John
Quan, Andrew Sendonaris, Ian Osband, et al. Deep q-learning from demonstrations. In Proceedings of the
AAAI Conference on Artificial Intelligence, 2018.
[37] Matt Hoffman, Bobak Shahriari, John Aslanides, Gabriel Barth-Maron, Feryal Behbahani, Tamara Norman,
Abbas Abdolmaleki, Albin Cassirer, Fan Yang, Kate Baumli, et al. Acme: A research framework for
distributed reinforcement learning. arXiv preprint arXiv:2006.00979, 2020.
[38] Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. arXiv
preprint arXiv:1801.06146, 2018.
[39] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected
convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 4700–4708, 2017.
[40] Peter C Humphreys, David Raposo, Toby Pohlen, Gregory Thornton, Rachita Chhaparia, Alistair Muldal,
Josh Abramson, Petko Georgiev, Alex Goldin, Adam Santoro, et al. A data-driven approach for learning to
control computers. arXiv preprint arXiv:2202.08137, 2022.
[41] Andrew Hundt, Aditya Murali, Priyanka Hubli, Ran Liu, Nakul Gopalan, Matthew Gombolay, and
Gregory D. Hager. ”good robot! now watch this!”: Repurposing reinforcement learning for task-to-task
transfer. In 5th Annual Conference on Robot Learning, 2021. URL https://openreview.net/forum?
id=Pxs5XwId51n.
[42] Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi,
Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based training of neural
networks. arXiv preprint arXiv:1711.09846, 2017.
[43] Ryan Julian, Benjamin Swanson, Gaurav S Sukhatme, Sergey Levine, Chelsea Finn, and Karol Hausman.
Never stop learning: The effectiveness of fine-tuning in robotic reinforcement learning. arXiv preprint
arXiv:2004.10190, 2020.
13
[44] Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent experience
replay in distributed reinforcement learning. In International conference on learning representations, 2018.
[45] Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning.
arXiv preprint arXiv:2110.06169, 2021.
[46] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline
reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
[47] Aviral Kumar, Rishabh Agarwal, Dibya Ghosh, and Sergey Levine. Implicit under-parameterization inhibits
data-efficient deep reinforcement learning. In International Conference on Learning Representations,
2021.
[48] Andrey Kurenkov, Ajay Mandlekar, Roberto Martin-Martin, Silvio Savarese, and Animesh Garg. Ac-teach:
A bayesian actor-critic method for policy learning with an ensemble of suboptimal teachers. arXiv preprint
arXiv:1909.04121, 2019.
[49] Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batch reinforcement learning. In Reinforcement
learning, pages 45–73. Springer, 2012.
[50] Michael Laskin, Hao Liu, Xue Bin Peng, Denis Yarats, Aravind Rajeswaran, and Pieter Abbeel. Cic:
Contrastive intrinsic control for unsupervised skill discovery. arXiv preprint arXiv:2202.00161, 2022.
[51] Seunghyun Lee, Younggyo Seo, Kimin Lee, Pieter Abbeel, and Jinwoo Shin. Offline-to-online reinforce-
ment learning via balanced replay and pessimistic q-ensemble. In Conference on Robot Learning, pages
1702–1712. PMLR, 2022.
[52] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial,
review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
[53] Siyuan Li, Fangda Gu, Guangxiang Zhu, and Chongjie Zhang. Context-aware policy reuse. arXiv preprint
arXiv:1806.03793, 2018.
[54] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David
Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint
arXiv:1509.02971, 2015.
[55] Yao Lu, Karol Hausman, Yevgen Chebotar, Mengyuan Yan, Eric Jang, Alexander Herzog, Ted Xiao,
Alex Irpan, Mohi Khansari, Dmitry Kalashnikov, et al. Aw-opt: Learning robotic skills with imitation
andreinforcement at scale. In Conference on Robot Learning, pages 1078–1088. PMLR, 2022.
[56] Clare Lyle, Mark Rowland, and Will Dabney. Understanding and preventing capacity loss in reinforcement
learning. arXiv preprint arXiv:2204.09560, 2022.
[57] Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael
Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general
agents. Journal of Artificial Intelligence Research, 61:523–562, 2018.
[58] Michael Matthews, Mikayel Samvelyan, Jack Parker-Holder, Edward Grefenstette, and Tim Rocktäschel.
Hierarchical kickstarting for skill transfer in reinforcement learning. arXiv preprint arXiv:2207.11584,
2022.
[59] Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, Joe Wenjie Jiang, Ebrahim Songhori, Shen Wang,
Young-Joon Lee, Eric Johnson, Omkar Pathak, Azade Nazi, et al. A graph placement methodology for fast
chip design. Nature, 594(7862):207–212, 2021.
[60] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,
Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through
deep reinforcement learning. nature, 518(7540):529–533, 2015.
[61] Ted Moskovitz, Michael Arbel, Jack Parker-Holder, and Aldo Pacchiano. Towards an understanding of
default policies in multitask policy optimization. In International Conference on Artificial Intelligence and
Statistics, pages 10661–10686. PMLR, 2022.
[62] Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Overcoming
exploration in reinforcement learning with demonstrations. In 2018 IEEE international conference on
robotics and automation (ICRA), pages 6292–6299. IEEE, 2018.
14
[63] Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. Awac: Accelerating online reinforcement
learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020.
[64] Johan S Obando-Ceron and Pablo Samuel Castro. Revisiting rainbow: Promoting more insightful and
inclusive deep reinforcement learning research. In International Conference on Machine Learning (ICML),
2021.
[65] Georg Ostrovski, Pablo Samuel Castro, and Will Dabney. The difficulty of passive learning in deep
reinforcement learning. Advances in Neural Information Processing Systems, 34, 2021.
[66] Tom Le Paine, Caglar Gulcehre, Bobak Shahriari, Misha Denil, Matt Hoffman, Hubert Soyer, Richard Tan-
burn, Steven Kapturowski, Neil Rabinowitz, Duncan Williams, et al. Making efficient use of demonstrations
to solve hard exploration problems. arXiv preprint arXiv:1909.01387, 2019.
[67] Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Actor-mimic: Deep multitask and transfer
reinforcement learning. arXiv preprint arXiv:1511.06342, 2015.
[68] Karl Pertsch, Youngwoon Lee, and Joseph J Lim. Accelerating reinforcement learning with learned skill
priors. arXiv preprint arXiv:2010.11944, 2020.
[69] Martin L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley
& Sons, Inc., 1994.
[70] Colin Raffel. A call to build models like we build open-source software. https://colinraffel.com/
blog/a-call-to-build-models-like-we-build-open-source-software.html, 2021.
[71] Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured
prediction to no-regret online learning. In Proceedings of the fourteenth international conference on
artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011.
[72] Stefan Schaal. Learning from demonstration. Advances in neural information processing systems, 9, 1996.
[73] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv
preprint arXiv:1511.05952, 2015.
[74] Simon Schmitt, Jonathan J Hudson, Augustin Zidek, Simon Osindero, Carl Doersch, Wojciech M Czarnecki,
Joel Z Leibo, Heinrich Kuttler, Andrew Zisserman, Karen Simonyan, et al. Kickstarting deep reinforcement
learning. arXiv preprint arXiv:1803.03835, 2018.
[75] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian
Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go
with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
[76] Alexey Skrynnik, Aleksey Staroverov, Ermek Aitygulov, Kirill Aksenov, Vasilii Davydov, and Aleksandr I
Panov. Forgetful experience replay in hierarchical reinforcement learning from expert demonstrations.
Knowledge-Based Systems, 218:106844, 2021.
[77] William D Smart and L Pack Kaelbling. Effective reinforcement learning for mobile robots. In Proceedings
2002 IEEE International Conference on Robotics and Automation (Cat. No. 02CH37292), volume 4, pages
3404–3410. IEEE, 2002.
[78] Wen Sun, J Andrew Bagnell, and Byron Boots. Truncated horizon policy search: Combining reinforcement
learning & imitation learning. arXiv preprint arXiv:1805.11240, 2018.
[79] Yanchao Sun, Ruijie Zheng, Xiyao Wang, Andrew E Cohen, and Furong Huang. Transfer RL across
observation feature spaces via model-based regularization. In International Conference on Learning
Representations, 2022. URL https://openreview.net/forum?id=7KdAoOsI81C.
[80] Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden,
Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint
arXiv:1801.00690, 2018.
[81] Lisa Torrey and Matthew Taylor. Teaching on a budget: Agents advising agents in reinforcement learning.
In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems, pages
1053–1060, 2013.
[82] Mircea Trofin, Yundi Qian, Eugene Brevdo, Zinan Lin, Krzysztof Choromanski, and David Li. Mlgo: a
machine learning guided compiler optimizations framework. arXiv preprint arXiv:2101.04808, 2021.
15
[83] Ikechukwu Uchendu, Ted Xiao, Yao Lu, Banghua Zhu, Mengyuan Yan, Joséphine Simon, Matthew
Bennice, Chuyuan Fu, Cong Ma, Jiantao Jiao, et al. Jump-start reinforcement learning. arXiv preprint
arXiv:2204.02372, 2022.
[84] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In
Proceedings of the AAAI conference on artificial intelligence, 2016.
[85] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung
Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft
ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
[86] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network
architectures for deep reinforcement learning. In International conference on machine learning, pages
1995–2003. PMLR, 2016.
[87] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3):279–292, 1992.
[88] Tete Xiao, Ilija Radosavovic, Trevor Darrell, and Jitendra Malik. Masked visual pre-training for motor
control. arXiv preprint arXiv:2203.06173, 2022.
[89] Linhai Xie, Sen Wang, Stefano Rosa, Andrew Markham, and Niki Trigoni. Learning with training wheels:
speeding up training with a simple controller for deep reinforcement learning. In 2018 IEEE International
Conference on Robotics and Automation (ICRA), pages 6276–6283. IEEE, 2018.
[90] Denis Yarats, Ilya Kostrikov, and Rob Fergus. Image augmentation is all you need: Regularizing deep
reinforcement learning from pixels. In International Conference on Learning Representations, 2020.
Checklist
1. For all authors...
(a) Do the main claims made in the abstract and introduction accurately reflect the paper’s
contributions and scope? [Yes]
(b) Did you describe the limitations of your work? [Yes] Section 5 (BLE and continuous
control results) and reproducibility and evaluation concerns in Sec 7
(c) Did you discuss any potential negative societal impacts of your work? [Yes]
(d) Have you read the ethics review guidelines and ensured that your paper conforms to
them? [Yes]
2. If you are including theoretical results...
(a) Did you state the full set of assumptions of all theoretical results? [N/A]
(b) Did you include complete proofs of all theoretical results? [N/A]
3. If you ran experiments...
(a) Did you include the code, data, and instructions needed to reproduce the main
experimental results (either in the supplemental material or as a URL)? [Yes]
agarwl.github.io/reincarnating_rl.
(b) Did you specify all the training details (e.g., data splits, hyperparameters, how they
were chosen)? [Yes] See Appendix A.3 and A.4
(c) Did you report error bars (e.g., with respect to the random seed after running experi-
ments multiple times)? [Yes] 95% CIs
(d) Did you include the total amount of compute and the type of resources used (e.g., type
of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix A.2
4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...
(a) If your work uses existing assets, did you cite the creators? [Yes]
(b) Did you mention the license of the assets? [Yes] Apache License, Version 2.0
(c) Did you include any new assets either in the supplemental material or as a URL? [Yes]
(d) Did you discuss whether and how consent was obtained from people whose data you’re
using/curating? [N/A]
(e) Did you discuss whether the data you are using/curating contains personally identifiable
information or offensive content? [N/A]
16
5. If you used crowdsourcing or conducted research with human subjects...
(a) Did you include the full text of instructions given to participants and screenshots, if
applicable? [N/A]
(b) Did you describe any potential participant risks, with links to Institutional Review
Board (IRB) approvals, if applicable? [N/A]
(c) Did you include the estimated hourly wage paid to participants and the total amount
spent on participant compensation? [N/A]
17