0% found this document useful (0 votes)
13 views28 pages

Jump Start RL

Jump-Start Reinforcement Learning (JSRL) is a meta algorithm designed to enhance the efficiency of reinforcement learning (RL) by leveraging a guide-policy to initialize an exploration-policy. This approach allows for improved performance in tasks with limited data by using prior knowledge from existing policies or demonstrations, thereby addressing exploration challenges. JSRL significantly outperforms traditional imitation and reinforcement learning methods, particularly in small-data scenarios, and offers a theoretical upper bound on sample complexity.

Uploaded by

Wenshuai Zhao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views28 pages

Jump Start RL

Jump-Start Reinforcement Learning (JSRL) is a meta algorithm designed to enhance the efficiency of reinforcement learning (RL) by leveraging a guide-policy to initialize an exploration-policy. This approach allows for improved performance in tasks with limited data by using prior knowledge from existing policies or demonstrations, thereby addressing exploration challenges. JSRL significantly outperforms traditional imitation and reinforcement learning methods, particularly in small-data scenarios, and offers a theoretical upper bound on sample complexity.

Uploaded by

Wenshuai Zhao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Jump-Start Reinforcement Learning

Ikechukwu Uchendu * 1 Ted Xiao 1 Yao Lu 1 Banghua Zhu 2 Mengyuan Yan 3 Joséphine Simon 3
Matthew Bennice 3 Chuyuan Fu 3 Cong Ma 4 Jiantao Jiao 2 Sergey Levine 1 2 Karol Hausman 1 5

Abstract 1. Introduction
A promising aspect of reinforcement learning (RL) is the
ability of a policy to iteratively improve via trial and error.
arXiv:2204.02372v2 [cs.LG] 7 Jul 2023

Reinforcement learning (RL) provides a theoret-


Often, however, the most difficult part of this process is the
ical framework for continuously improving an
very beginning, where a policy that is learning without any
agent’s behavior via trial and error. However, effi-
prior data needs to randomly encounter rewards to further
ciently learning policies from scratch can be very
improve. A common way to side-step this exploration is-
difficult, particularly for tasks that present explo-
sue is to aid the policy with prior knowledge. One source
ration challenges. In such settings, it might be
of prior knowledge might come in the form of a prior pol-
desirable to initialize RL with an existing policy,
icy, which can provide some initial guidance in collecting
offline data, or demonstrations. However, naively
data with non-zero rewards, but which is not by itself fully
performing such initialization in RL often works
optimal. Such policies could be obtained from demonstra-
poorly, especially for value-based methods. In
tion data (e.g., via behavioral cloning), from sub-optimal
this paper, we present a meta algorithm that can
prior data (e.g., via offline RL), or even simply via manual
use offline data, demonstrations, or a pre-existing
engineering. In the case where this prior policy is itself
policy to initialize an RL policy, and is compati-
parameterized as a function approximator, it could serve
ble with any RL approach. In particular, we pro-
to simply initialize a policy gradient method. However,
pose Jump-Start Reinforcement Learning (JSRL),
sample-efficient algorithms based on value functions are
an algorithm that employs two policies to solve
notoriously difficult to bootstrap in this way. As observed
tasks: a guide-policy, and an exploration-policy.
in prior work (Peng et al., 2019; Nair et al., 2020; Kostrikov
By using the guide-policy to form a curriculum of
et al., 2021; Lu et al., 2021), value functions require both
starting states for the exploration-policy, we are
good and bad data to initialize successfully, and the mere
able to efficiently improve performance on a set
availability of a starting policy does not by itself readily pro-
of simulated robotic tasks. We show via experi-
vide an initial value function of comparable performance.
ments that it is able to significantly outperform
This leads to the question we pose in this work: how can
existing imitation and reinforcement learning al-
we bootstrap a value-based RL algorithm with a prior policy
gorithms, particularly in the small-data regime.
that attains reasonable but sub-optimal performance?
In addition, we provide an upper bound on the
sample complexity of JSRL and show that with The main insight that we leverage to address this problem is
the help of a guide-policy, one can improve the that we can bootstrap an RL algorithm by gradually “rolling
sample complexity for non-optimism exploration in” with the prior policy, which we refer to as the guide-
methods from exponential in horizon to polyno- policy. In particular, the guide-policy provides a curriculum
mial. of starting states for the RL exploration-policy, which sig-
nificantly simplifies the exploration problem and allows for
fast learning. As the exploration-policy improves, the effect
of the guide-policy is diminished, leading to an RL-only
* Work done as part of the Google AI Residency 1 Google, Moun- policy that is capable of further autonomous improvement.
tain View, California 2 University of California, Berkeley, Berkeley,
California 3 Everyday Robots, Mountain View, California, United Our approach is generic, as it can be applied to downstream
States 4 Department of Statistics, University of Chicago 5 Stanford RL methods that require the RL policy to explore the envi-
University, Stanford, California. Correspondence to: Ikechukwu ronment, though we focus on value-based methods in this
Uchendu <[email protected]>. work. The only requirements of our method are that the
guide-policy can select actions based on observations of the
Proceedings of the 40 th International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright environment, and its performance is reasonable (i.e., better
2023 by the author(s). than a random policy). Since the guide-policy significantly

1
Jump-Start Reinforcement Learning

Figure 1. We study how to efficiently bootstrap value-based RL algorithms given access to a prior policy. In vanilla RL (left), the agent
explores randomly from the initial state until it encounters a reward (gold star). JSRL (right), leverages a guide-policy (dashed blue line)
that takes the agent closer to the reward. After the guide-policy finishes, the exploration-policy (solid orange line) continues acting in the
environment. As the exploration-policy improves, the influence of the guide-policy diminishes, resulting in a learning curriculum for
bootstrapping RL.

speeds up the early phases of RL, we call this approach approach (Nair et al., 2018; Vecerik et al., 2018), but this
Jump-Start Reinforcement Learning (JSRL). We provide an requires prior data rather than just a prior policy. More
overview diagram of JSRL in Fig. 1. recent approaches improve this strategy by using offline
RL (Kumar et al., 2020; Nair et al., 2020; Lu et al., 2021)
JSRL can utilize any form of prior policy to accelerate RL,
to pre-train on prior data and then finetune. We compare to
and it can be combined with existing offline and/or online
such methods, showing that our approach not only makes
RL methods. In addition, we provide a theoretical justifi-
weaker assumptions (requiring only a policy rather than a
cation of JSRL by deriving an upper bound on its sample
dataset), but also performs comparably or better.
complexity compared to classic RL alternatives. Finally, we
demonstrate that JSRL significantly outperforms previously Curriculum learning and exact state resets for RL. Many
proposed imitation and reinforcement learning approaches prior works have investigated efficient exploration strategies
on a set of benchmark tasks as well as more challenging in RL that are based on starting exploration from specific
vision-based robotic problems. states. Commonly, these works assume the ability to reset
to arbitrary states in simulation (Salimans & Chen, 2018).
2. Related Work Some methods uniformly sample states from demonstra-
tions as start states (Hosu & Rebedea, 2016; Peng et al.,
Imitation learning combined with reinforcement learn- 2018; Nair et al., 2018), while others generate curriculas
ing (IL+RL). Several previous works on leveraging a prior of start states. The latter includes methods that start at
policy to initialize RL focus on doing so by combining imita- the goal state and iteratively expand the start state distribu-
tion learning and RL. Some methods treat RL as a sequence tion, assuming reversible dynamics (Florensa et al., 2017;
modelling problem and train an autoregressive model us- McAleer et al., 2019) or access to an approximate dynamics
ing offline data (Zheng et al., 2022; Janner et al., 2021; model (Ivanovic et al., 2019). Other approaches generate the
Chen et al., 2021). One well-studied class of approaches curriculum from demonstration states (Resnick et al., 2018)
initializes policy search methods with policies trained via or from online exploration (Ecoffet et al., 2021). In contrast,
behavioral cloning (Schaal et al., 1997; Kober et al., 2010; our method does not control the exact starting state distribu-
Rajeswaran et al., 2017). This is an effective strategy for tion, but instead utilizes the implicit distribution naturally
initializing policy search methods, but is generally ineffec- arising from rolling out the guide-policy. This broadens the
tive with actor-critic or value-based methods, where the distribution of start states compared to exact resets along a
critic also needs to be initialized (Nair et al., 2020), as we narrow set of demonstrations, making the learning process
also illustrate in Section 3. Methods have been proposed more robust. In addition, our approach could be extended to
to include prior data in the replay buffer for a value-based the real world, where resetting to a state in the environment
0
is impossible.
A project webpage is available at https://
jumpstartrl.github.io Provably efficient exploration techniques. Online explo-

2
Jump-Start Reinforcement Learning

ration in RL has been well studied in theory (Osband &


Van Roy, 2014; Jin et al., 2018; Zhang et al., 2020b; Xie
et al., 2021; Zanette et al., 2020; Jin et al., 2020). The pro-
posed methods either rely on the estimation of confidence
intervals (e.g. UCB, Thompson sampling), which is hard to
approximate and implement when combined with neural net-
works, or suffer from exponential sample complexity in the
worst-case. In this paper, we leverage a pre-trained guide-
policy to design an algorithm that is more sample-efficient
than these approaches while being easy to implement in
practice. Figure 2. Naı̈ve policy initialization. We pre-train a policy to
medium performance (depicted by negative steps), then use this
“Rolling in” policies. Using a pre-existing policy (or poli- policy to initialize actor-critic fine-tuning (starting from step 0),
cies) to initialize RL and improve exploration has been while initializing the critic randomly. Actor performance decays,
studied in past literature. Some works use an ensemble of as the untrained critic provides a poor learning signal, causing the
roll-in policies or value functions to refine exploration (Jiang good initial policy to be forgotten. In Figures 7 and 8, we repeat
et al., 2017; Agarwal et al., 2020). With a policy that models this experiment but allow the randomly initialized critic to ”warm
the environment’s dynamics, it is possible to look ahead to up” before fine-tuning.
guide the training policy towards useful actions (Lin, 1992).
Similar to our work, an approach from (Smart & Pack Kael-
bling, 2002) rolls out a fixed controller to provide bootstrap
data for a policy’s value function. However, this method In order to leverage prior data in value-based RL and con-
does not mix the prior policy and the learned policy, but tinue fine-tuning, researchers commonly use various offline
only uses the prior policy for data collection. We use a RL methods (Kostrikov et al., 2021; Kumar et al., 2020; Nair
multi-stage curriculum to gradually reduce the contribution et al., 2020; Lu et al., 2021) that often rely on pre-trained,
of the prior policy during training, which allows for on- regularized Q-functions that can be further improved using
policy experience for the learned policy. Our method is also online data. In the case where a pre-trained Q-function
conceptually related to DAgger (Ross & Bagnell, 2010), is not available and we only have access to a prior policy,
which also bridges distributional shift by rolling in with one value-based RL methods struggle to effectively incorporate
policy and then obtaining labels from a human expert, but that information as depicted in Fig. 2. In this experiment,
DAgger is intended for imitation learning and rolls in the we train an actor-critic method up to step 0, then we start
learned policy, while our method addresses RL and rolls in from a fresh Q-function and continue with the pre-trained
with a sub-optimal guide-policy. actor, simulating the case where we only have access to a
prior policy. This is the setting that we are concerned with
in this work.
3. Preliminaries
We define a Markov decision process M = 4. Jump-Start Reinforcement Learning
(S, A, P, R, p0 , γ, H), where S and A are state and
action spaces, P : S × A × S → R+ is a state-transition In this section, we describe our method, Jump-Start Rein-
probability function, R : S × A → R is a reward function, forcement Learning (JSRL), that we use to initialize value-
p0 : S → R+ is an initial state distribution, γ is a based RL algorithms with a prior policy of any form. We
discount factor, and H is the task horizon. Our goal is to first describe the intuition behind our method then lay out a
effectively utilize a prior policy of any form in value-based detailed algorithm along with theoretical analysis.
reinforcement learning (RL). The goal of RL is to find
a policy π(a|s) that maximizes the expected discounted 4.1. Rolling In With Two Policies
reward over trajectories, τ , induced by the policy: Eπ [R(τ )] We assume access to a fixed prior policy that we refer to as
where s0 ∼ p0 , st+1 ∼ P (·|st , at ) and at ∼ π(·|st ). the “guide-policy”, π g (a|s), which we leverage to initialize
To solve this maximization problem, value-based RL RL algorithms. It is important to note that we do not assume
methods take advantage of state or state-action value any particular form of π g ; it could be learned with imitation
functions (Q-function) Qπ (s, a), which can be learned learning, RL, or it could be manually scripted.
using approximate dynamic programming approaches. The
Q-function, Qπ (s, a), represents the discounted returns We will refer to the RL policy that is being learned via trial
when starting from state s and action a, followed by the and error as the “exploration-policy” π e (a|s), since, as it is
actions produced by the policy π. commonly done in RL literature, this is the policy that is
used for exploration as well as online improvement.

3
Jump-Start Reinforcement Learning

The only requirement for π e is that it is an RL policy that can termining guide-step sequences: via a curriculum and via
adapt with online experience. Our approach and the set of random-switching. With the curriculum strategy, we start
assumptions is generic in that it can handle any downstream with a large guide-step (ie. H1 = H) and use policy evalu-
RL method, though we focus on the case where π e is learned ations of the combined policy π to progressively decrease
via a value-based RL algorithm. Hn as π e improves. Intuitively, this means that we train
our policy in a backward manner by first rolling out π g to
The main idea behind our method is to leverage the two
the last guide-step and then exploring with π e , and then
policies, π g and π e , executed sequentially to learn tasks
rolling out π g to the second to last guide-step and exploring
more efficiently. During the initial phases of training, π g
with π e , and so on. With the random-switching strategy,
is significantly better than the untrained policy π e , so we
we sample each h uniformly and independently from the
would like to collect data using π g . However, this data
set {H1 , H2 , · · · , Hn }. In the rest of the paper, we refer to
is out of distribution for π e , since exploring with π e will
the curriculum variant as JSRL, and the random switching
visit different states. Therefore, we would like to gradu-
variant as JSRL-Random.
ally transition data collection away from π g and toward π e .
Intuitively, we would like to use π g to get the agent into Algorithm 1 Jump-Start Reinforcement Learning
“good” states, and then let π e take over and explore from
1: Input: guide-policy π g , performance threshold β, task hori-
those states. As it gets better and better, π e should take over zon H, a sequence of initial guide-steps H1 , H2 , · · · , Hn ,
earlier and earlier, until all data is being collected by π e where Hi ∈ [H] for all i ≤ n.
and there is no more distributional shift. We can employ 2: Initialize exploration-policy from scratch or with the guide-
different switching strategies to switch from π g to π e , but policy π e ← π g . Initialize Q-function Q̂ and dataset D ← ∅.
the most direct curriculum simply switches from π g to π e 3: for current guide step h = H1 , H2 , · · · , Hn do
4: Set the non-stationary policy π1:h = π g , πh+1:H = π e
at some time step h, where h is initialized to the full task
5: Roll out the policy π to get trajectory
horizon and gradually decreases over the course of training. {(s1 , a1 , r1 ), · · · , (sH , aH , rH )}; Append the trajectory to
This naturally provides a curriculum for π e . At each curricu- the dataset D.
lum stage, π e needs to master a small part of the state-space 6: π e , Q̂ ← T RAIN P OLICY(π e , Q̂, D)
that is required to reach the states covered by the previous 7: if E VALUATE P OLICY(π) ≥ β then
8: Continue
curriculum stage.
9: end if
10: end for
4.2. Algorithm
We provide a detailed description of JSRL in Algorithm 1.
Given an RL task with horizon H, we first choose a se- 4.3. Theoretical Analysis
quence of initial guide-steps to which we roll out our guide- In this section, we provide theoretical analysis of JSRL,
policy, (H1 , H2 , · · · , Hn ), where Hi ∈ [H] denotes the showing that the roll-in data collection strategy that we pro-
number of steps that the guide-policy at the ith iteration acts pose provably attains polynomial sample complexity. The
for. Let h denote the iterator over such a sequence of initial sample complexity refers to the number of samples required
guide-steps. At the beginning of each training episode, we by the algorithm to learn a policy with small suboptimal-
roll out π g for h steps, then π e continues acting in the envi- ity, where we define the suboptimality for a policy π as
ronment for the additional H − h steps until the task horizon Es∼p0 [V ⋆ (s) − V π (s)].
H is reached. We can write the combination of the two poli-
cies as the combined switching policy, π, where π1:h = π g In particular, we aim to answer two questions: Why is JSRL
and πh+1:H = π e . After we roll out π to collect online better than other exploration algorithms which start ex-
data, we use the new data to update our exploration-policy ploration from scratch? Under which conditions does the
π e and combined policy π by calling a standard training guide-policy provably improve exploration? To answer the
procedure T RAIN P OLICY. For example, the training pro- two questions, we study upper and lower bounds for the
cedure may be updating the exploration-policy via a Deep sample complexity of the exploration algorithms. We first
Q-Network (Mnih et al., 2013) with ϵ-greedy as the explo- provide a lower bound showing that simple non-optimism-
ration technique. The new combined policy is then evaluated based exploration algorithms like ϵ-greedy suffer from a
over the course of training using a standard evaluation pro- sample complexity that is exponential in the horizon. Then
cedure E VALUATE P OLICY(π). Once the performance of we show that with the help of a guide-policy with good cov-
the combined policy π reaches a threshold, β, we continue erage of important states, the JSRL algorithm with ϵ-greedy
the for loop with the next guide step h. as the exploration strategy can achieve polynomial sample
complexity.
While any guide-step sequence could be used with JSRL,
in this paper we focus on two specific strategies for de- We focus on comparing JSRL with standard non-optimism-
based exploration methods, e.g. ϵ-greedy (Langford &

4
Jump-Start Reinforcement Learning

Zhang, 2007) and FALCON+ (Simchi-Levi & Xu, 2020). action pairs and is a standard assumption in the literature in
Although the optimism-based RL algorithms like UCB (Jin offline learning (Rashidinejad et al., 2021; Xie et al., 2021).
et al., 2018) and Thompson sampling (Ouyang et al., 2017) The ratio in Assumption 4.2 is also sometimes referred to
turn out to be efficient strategies for exploration from as the distribution mismatch coefficient in the literature of
scratch, they all require uncertainty quantification, which policy gradient methods (Agarwal et al., 2021).
can be hard for vision-based RL tasks with neural network
We show via the following theorem that given Assump-
parameterization. Note that the cross entropy method used
tion 4.2, a simplified JSRL algorithm which only explores
in the vision-based RL framework Qt-Opt (Kalashnikov
at current guide step h + 1 gives good performance guaran-
et al., 2018) is also a non-optimism-based method. In par-
tees for both tabular MDP and MDP with general function
ticular, it can be viewed as a variant of ϵ-greedy algorithm
approximation. The simplified JSRL algorithm coincides
in continuous action space, with the Gaussian distribution
with the Policy Search by Dynamic Programming (PSDP)
as the exploration distribution.
algorithm in (Bagnell et al., 2003), although our method is
We first show that without the help of a guide-policy, the mainly motivated by the problem of fine-tuning and efficient
non-optimism-based method usually suffers from a sam- exploration in value based methods, while PSDP focuses on
ple complexity that is exponential in horizon for episodic policy-based methods.
MDP. We adapt the combination lock example in (Koenig &
Simmons, 1993) to show the hardness of exploration from
scratch for non-optimism-based methods. Theorem 4.3 (Informal). Under Assumption 4.2 and an
Theorem 4.1 ((Koenig & Simmons, 1993)). For 0- appropriate choice of TrainPolicy and EvaluatePolicy,
initialized ϵ-greedy, there exists an MDP instance such that JSRL in Algorithm 1 guarantees a suboptimality of
one has to suffer from a sample complexity that is exponen- O(CH 5/2 S 1/2 A/T 1/2 ) for tabular MDP; and a near-
tial in total horizon H in order to find a policy that has optimal bound up to factor of C · poly(H) for MDP with
suboptimality smaller than 0.5. general function approximation.

We include the construction of combination lock MDP and


the proof in Appendix A.5.2 for completeness. This lower To achieve a polynomial bound for JSRL, it suffices to take
bound also applies to any other non-optimism-based explo- TrainPolicy as ϵ-greedy. This is in sharp contrast to Theo-
ration algorithm which explores uniformly when the esti- rem 4.1, where ϵ-greedy suffers from exponential sample
mated Q for all actions are 0. As a concrete example, this complexity. As is discussed in the related work section,
also shows that iteratively running FALCON+ (Simchi-Levi although polynomial and even near-optimal bound can be
& Xu, 2020) suffers from exponential sample complexity. achieved by many optimism-based methods (Jin et al., 2018;
With the above lower bound, we are ready to show the upper Ouyang et al., 2017), the JSRL algorithm does not require
bound for JSRL under certain assumptions on the guide- constructing a bonus function for uncertainty quantification,
policy. In particular, we assume that the guide-policy π g and can be implemented easily based on naı̈ve ϵ-greedy
is able to cover good states that are visited by the optimal methods.
policy under some feature representation. Let dπh be the Furthermore, although we focus on analyzing the simplified
state visitation distribution of policy π at time step h. We JSRL which only updates policy π at current guide steps
make the following assumption: h + 1, in practice we run a JSRL algorithm as in Algo-
Assumption 4.2 (Quality of guide-policy π g ). Assume that rithm 1, which updates all policies after step h + 1. This
the state is parametrized by some feature mapping ϕ : S 7→ is the main difference between our proposed algorithm and
Rd such that for any policy π, Qπ (s, a) and π(s) depend on PSDP. For a formal statement and more discussion related
s only through ϕ(s), and that in the feature space, the guide- to Theorem 4.3, please refer to Appendix A.5.3.
policy π g cover the states visited by the optimal policy:

dπh (ϕ(s)) 5. Experiments
sup g ≤ C.
s,h dπh (ϕ(s)) In our experimental evaluation, we study the following ques-
tions: (1) How does JSRL compare with competitive IL+RL
In other words, the guide-policy visits only all good states baselines? (2) Does JSRL scale to complex vision-based
in the feature space. A policy that satisfies Assumption 4.2 robotic manipulation tasks? (3) How sensitive is JSRL to
may be far from optimal due to the wrong choice of actions the quality of the guide-policy? (4) How important is the
in each step. Assumption 4.2 is also much weaker than curriculum component of JSRL? (5) Does JSRL generalize?
the single policy concentratability coefficient assumption, That is, can a guide-policy still be useful if it was pre-trained
which requires the guide-policy visits all good state and on a related task?

5
Jump-Start Reinforcement Learning

Environment Dataset AWAC1 BC1 CQL1 IQL IQL+JSRL (Ours)


Curriculum Random
antmaze-umaze-v0 1k 0.0 ± 0.0 0.0 0.0 ± 0.0 0.2 ± 0.5 15.6 ± 19.9 10.4 ± 9.6
10k 0.0 ± 0.0 1.0 0.0 ± 0.0 55.5 ± 12.5 71.7 ± 14.5 52.3 ± 26.7
100k 0.0 ± 0.0 62.0 0.0 ± 0.0 74.2 ± 25.6 93.7 ± 4.2 92.1 ± 2.8
1m (standard) 93.67 ± 1.89 61.0 64.33 ± 45.58 97.6 ± 3.2 98.1 ± 1.4 95.0 ± 3.0
antmaze-umaze-diverse-v0 1k 0.0 ± 0.0 0.0 0.0 ± 0.0 0.0 ± 0.0 3.1 ± 8.0 1.9 ± 4.8
10k 0.0 ± 0.0 1.0 0.0 ± 0.0 33.1 ± 10.7 72.6 ± 12.2 39.4 ± 20.1
100k 0.0 ± 0.0 13.0 0.0 ± 0.0 29.9 ± 23.1 81.3 ± 23.0 82.3 ± 14.2
1m (standard) 46.67 ± 3.68 80.0 0.50 ± 0.50 53.0 ± 30.5 88.6 ± 16.3 89.8 ± 10.0
antmaze-medium-play-v0 1k 0.0 ± 0.0 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0
10k 0.0 ± 0.0 0.0 0.0 ± 0.0 0.1 ± 0.3 16.7 ± 12.9 3.8 ± 5.0
100k 0.0 ± 0.0 0.0 0.0 ± 0.0 32.8 ± 32.6 86.7 ± 3.7 56.2 ± 28.8
1m (standard) 0.0 ± 0.0 0.0 0.0 ± 0.0 92.8 ± 2.7 91.1 ± 3.9 87.8 ± 4.2
antmaze-medium-diverse-v0 1k 0.0 ± 0.0 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0
10k 0.0 ± 0.0 0.0 0.0 ± 0.0 0.0 ± 0.0 16.6 ± 11.7 5.1 ± 8.2
100k 0.0 ± 0.0 0.0 0.0 ± 0.0 15.7 ± 17.7 81.5 ± 18.8 67.0 ± 17.4
1m (standard) 0.0 ± 0.0 0.0 0.0 ± 0.0 92.4 ± 4.5 93.1 ± 3.1 86.3 ± 5.9
antmaze-large-play-v0 1k 0.0 ± 0.0 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0
10k 0.0 ± 0.0 0.0 0.0 ± 0.0 0.0 ± 0.0 0.1 ± 0.2 0.0 ± 0.0
100k 0.0 ± 0.0 0.0 0.0 ± 0.0 2.6 ± 8.2 36.3 ± 16.4 17.7 ± 13.4
1m (standard) 0.0 ± 0.0 0.0 0.0 ± 0.0 62.4 ± 12.4 62.9 ± 11.3 48.6 ± 10.0
antmaze-large-diverse-v0 1k 0.0 ± 0.0 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0
10k 0.0 ± 0.0 0.0 0.0 ± 0.0 0.0 ± 0.0 0.1 ± 0.2 0.0 ± 0.0
100k 0.0 ± 0.0 0.0 0.0 ± 0.0 4.1 ± 10.4 34.4 ± 23.0 22.4 ± 15.4
1m (standard) 0.0 ± 0.0 0.0 0.0 ± 0.0 68.3 ± 8.9 68.3 ± 8.8 58.3 ± 6.5
door-binary-v0 100 0.07 ± 0.11 0.0 0.0 ± 0.0 0.8 ± 3.8 0.4 ± 1.8 0.1 ± 0.2
1k 0.41 ± 0.58 0.0 0.0 ± 0.0 0.5 ± 1.5 0.7 ± 1.0 0.45 ± 1.2
10k 1.93 ± 2.72 0.0 12.24 ± 24.47 10.6 ± 14.1 4.3 ± 8.4 22.3 ± 11.6
100k (standard) 17.26 ± 20.09 0.0 8.28 ± 19.94 50.2 ± 2.5 28.5 ± 19.5 24.3 ± 11.5
pen-binary-v0 100 3.13 ± 4.43 0.0 31.46 ± 9.99 18.8 ± 11.6 24.3 ± 12.1 29.1 ± 7.6
1k 1.43 ± 1.10 0.0 54.50 ± 0.0 30.1 ± 10.2 36.7 ± 7.9 46.3 ± 6.3
10k 2.21 ± 1.30 0.0 51.36 ± 4.34 38.4 ± 11.2 44.3 ± 6.2 52.1 ± 3.3
100k (standard) 1.23 ± 1.08 0.0 59.58 ± 1.43 65.0 ± 2.9 62.6 ± 3.6 60.6 ± 2.7
relocate-binary-v0 100 0.0 ± 0.0 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.1 0.0 ± 0.0
1k 0.01 ± 0.01 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.1 0.0 ± 0.0
10k 0.0 ± 0.0 0.0 1.18 ± 2.70 0.2 ± 0.3 0.6 ± 1.6 0.5 ± 0.7
100k (standard) 0.0 ± 0.0 0.0 4.44 ± 6.36 8.6 ± 7.7 0.0 ± 0.1 4.7 ± 4.2

Table 1. Comparing JSRL with IL+RL baselines on D4RL tasks by using averaged normalized scores for D4RL Ant Maze and Adroit
tasks. Each method pre-trains on an offline dataset and then runs online fine-tuning for 1m steps. Our method IQL+JSRL is competitive
with IL+RL baselines in the full dataset setting, but performs significantly better in the small-data regime. For implementation details and
more detailed comparisons, see Appendix A.2 and A.3

5.1. Comparison with IL+RL baselines and Adroit environments, IQL+JSRL is able to successfully
fine-tune given an initial offline dataset, and is competitive
To study how JSRL compares with competitive IL+RL meth-
with baselines. We will come back for further analysis of
ods, we utilize the D4RL (Fu et al., 2020) benchmark tasks,
Table 1 when discussing the sensitivity to the size of the
which vary in task complexity and offline dataset quality.
dataset.
We focus on the most challenging D4RL tasks: Ant Maze
and Adroit manipulation. We consider a common setting
5.2. Vision-Based Robotic Tasks
where the agent first trains on an offline dataset (1m transi-
tions for Ant Maze, 100k transitions for Adroit) and then Utilizing offline data is challenging in complex tasks such as
runs online fine-tuning for 1m steps. We compare against vision-based robotic manipulation. The high dimensionality
algorithms designed specifically for this setting, which in- of both the continuous control action space as well as the
clude advantage-weighted actor-critic (AWAC) (Nair et al., pixel-based state space present unique scaling challenges for
2020), implicit q-learning (IQL) (Kostrikov et al., 2021), IL+RL methods. To study how JSRL scales to such settings,
conservative q-learning (CQL) (Kumar et al., 2020), and we focus on two simulated robotic manipulation tasks: In-
behavior cloning (BC). See appendix A.1 for a more de- discriminate Grasping and Instance Grasping. In these tasks,
tailed description of each IL+RL baseline algorithm. While a simulated robot arm is placed in front of a table with vari-
JSRL can be used in combination with any initial guide- ous categories of objects. When the robot lifts any object, a
policy or fine-tuning algorithm, we show the combination of sparse reward is given for the Indiscriminate Grasping task;
JSRL with the strongest baseline, IQL. IQL is an actor-critic for the more challenging Instance Grasping task, the sparse
method that completely avoids estimating the values of ac- 1
tions that are not seen in the offline dataset. This is a recent We used https://github.com/rail-berkeley/rlkit/tree/master/rlkit
for AWAC and BC, and https://github.com/young-
state-of-the-art method for the IL+RL setting we consider. geng/CQL/tree/master/SimpleSAC for CQL.
In Table 1, we see that across the Ant Maze environments

6
Jump-Start Reinforcement Learning

Figure 4. IL+RL methods on two simulated robotic grasping


tasks. The baselines show improvement with fine-tuning, but
QT-Opt+JSRL is more sample efficient and attains higher final
performance. Each line depicts the mean and standard deviation
over three random seeds.

fair comparison for QT-Opt by initializing the replay buffer


with the offline demonstrations, which was not the case in
the original QT-Opt paper. Since we have already shown
that JSRL can work well with an offline RL algorithm in
the previous experiment, to demonstrate the flexibility of
our approach, in this experiment we combine JSRL with
an online Q-learning method: QT-Opt. As seen in Fig. 4,
the combination of QT-Opt+JSRL (both versions of the cur-
Figure 3. We evaluate the importance of guide-policy quality for ricula) significantly outperforms the other methods in both
JSRL on Instance Grasping, the most challenging task we consider. sample efficiency as well as the final performance.
By limiting the initial demonstrations, JSRL is less sensitive to lim-
itations of initial demonstrations compared to baselines, especially
in the small-data regime. For each of these initial demonstration 5.3. Initial Dataset Sensitivity
settings, we find that QT-Opt+JSRL is more sample efficient than While most IL and RL methods are improved by more data
QT-Opt+JSRL-Random in early stages of training, but converge to and higher quality data, there are often practical limitations
the same final performances. A similar analysis for Indiscriminate
that restrict initial offline datasets. JSRL is no exception to
Grasping is provided in Fig. 10 in the Appendix.
this dependency, as the quality of the guide-policy π g di-
rectly depends on the offline dataset when utilizing JSRL in
an IL+RL setting (i.e., when the guide-policy is pre-trained
reward is only given when a sampled target object is grasped.
on an offline dataset). We study the offline dataset sensitiv-
An image of the task is shown in Fig. 5 and described in de-
ity of IL+RL algorithms and JSRL on both D4RL tasks as
tail in Appendix A.2.2. We compare JSRL against methods
well as the vision-based robotic grasping tasks. We note that
that have been shown to scale to such complex vision-based
the two settings presented in D4RL and Robotic Grasping
robotics settings: QT-Opt (Kalashnikov et al., 2018), AW-
are quite different: IQL+JSRL in D4RL pre-trains with an
Opt (Lu et al., 2021), and BC. Each method has access to
offline RL algorithm from a mixed quality offline dataset,
the same offline dataset of 2,000 successful demonstrations
while QT-Opt+JSRL pre-trains with BC from a high quality
and is allowed to run online fine-tuning for up to 100,000
dataset.
steps. While AW-Opt and BC utilize offline successes as
part of their original design motivation, we allow a more For D4RL, methods typically utilize 1 million transitions

7
Jump-Start Reinforcement Learning

Environment Demo AW-Opt BC QT-Opt QT-Opt+JSRL QT-Opt+JSRL Random


Indiscriminate Grasping 20 0.33 ± 0.43 0.19 ± 0.04 0.00 ± 0.00 0.91 ± 0.01 0.89 ± 0.00
Indiscriminate Grasping 200 0.93 ± 0.02 0.23 ± 0.00 0.92 ± 0.02 0.92 ± 0.00 0.92 ± 0.01
Indiscriminate Grasping 2k 0.93 ± 0.01 0.40 ± 0.06 0.92 ± 0.01 0.93 ± 0.02 0.94 ± 0.02
Indiscriminate Grasping 20k 0.93 ± 0.04 0.92 ± 0.00 0.93 ± 0.00 0.95 ± 0.01 0.94 ± 0.00
Instance Grasping 20 0.44 ± 0.05 0.05 ± 0.03 0.29 ± 0.20 0.54 ± 0.02 0.53 ± 0.02
Instance Grasping 200 0.44 ± 0.04 0.16 ± 0.01 0.44 ± 0.04 0.52 ± 0.01 0.55 ± 0.02
Instance Grasping 2k 0.42 ± 0.02 0.30 ± 0.01 0.15 ± 0.22 0.52 ± 0.02 0.57 ± 0.02
Instance Grasping 20k 0.55 ± 0.01 0.48 ± 0.01 0.27 ± 0.20 0.55 ± 0.01 0.56 ± 0.02

Table 2. Limiting the initial number of demonstrations is challenging for IL+RL baselines on the difficult robotic grasping tasks. Notably,
only QT-Opt+JSRL is able to learn in the smallest-data regime of just 20 demonstrations, 100x less than the standard 2,000 demonstrations.
For implementation details, see Appendix A.2.2

from mixed-quality policies from previous RL training runs; hyperparameter sensitivity of JSRL-Curriculum and provide
as we reduce the size of the offline datasets in Table 1, the specific implementation of hyperparameters chosen for
IQL+JSRL performance degrades less than the baseline IQL our experiments in Appendix A.4.
performance. For the robotic grasping tasks, we initially pro-
vided 2,000 high-quality demonstrations; as we drastically 5.5. Guide-Policy Generalization
reduce the number of demonstrations, we find that JSRL ef-
ficiently learns better policies. Across both D4RL and the In order to study how guide-policies from easier tasks can
robotic grasping tasks, JSRL outperforms baselines in the be used to efficiently explore more difficult tasks, we train
low-data regime, as shown in Table 1 and Table 2. In the an indiscriminate grasping policy and use it as the guide-
high-data regime, when we increase the number of demon- policy for JSRL on instance grasping (Figure 13). While the
strations by 10x to 20,000 demonstrations, we notice that performance when using the indiscriminate guide is worse
AW-Opt and BC perform much more competitively, suggest- than using the instance guide, the performance for both
ing that the exploration challenge is no longer the bottleneck. JSRL versions outperform vanilla QT-Opt.
While starting with such large numbers of demonstrations We also test JSRL’s generalization capabilities in the D4RL
is not typically a realistic setting, this results suggests that setting. We consider two variations of Ant mazes: ”play”
the benefits of JSRL are most prominent when the offline and ”diverse”. In antmaze-*-play, the agent must reach a
dataset does not densely cover good state-action pairs. This fixed set of goal locations from a fixed set of starting loca-
aligns with our analysis in Appendix A.1 that JSRL does tions. In antmaze-*-diverse, the agent must reach random
not require such assumptions about the dataset, but solely goal locations from random starting locations. Thus, the
requires a prior policy. diverse environments present a greater challenge than the
corresponding play environments. In Figure 14, we see that
5.4. JSRL-Curriculum vs. JSRL-Random Switching JSRL is able to better generalize to unseen goal and starting
locations compared to vanilla IQL.
In order to disentangle these two components, we propose an
augmentation of our method, JSRL-Random, that randomly
selects the number of guide-steps every episode. Using the 6. Conclusion
D4RL tasks and the robotic grasping tasks, we compare
In this work, we propose Jump-Start Reinforcement Learn-
JSRL-Random to JSRL and previous IL+RL baselines and
ing (JSRL), a method for leveraging a prior policy of any
find that JSRL-Random performs quite competitively, as
form to bolster exploration in RL to increase sample ef-
seen in Table 1 and Table 2. However, when considering
ficiency. Our algorithm creates a learning curriculum by
sample efficiency, Fig. 4 shows that JSRL is better than
rolling in a pre-existing guide-policy, which is then followed
JSRL-Random in early stages of training, while converged
by the self-improving exploration policy. The job of the
performance is comparable. These same trends hold when
exploration-policy is simplified, as it starts its exploration
we limit the quality of the guide-policy by constraining the
from states closer to the goal. As the exploration policy
initial dataset, as seen in Fig. 3. This suggests that while
improves, the effect of the guide-policy diminishes, leading
a curriculum of guide-steps does help sample efficiency,
to a fully capable RL policy. Importantly, our approach is
the largest benefits of JSRL may stem from the presence
generic since it can be used with any RL method that re-
of good visitation states induced by the guide-policy as
quires exploring the environment, including value-based RL
opposed to the specific order of good visitation states, as
approaches, which have traditionally struggled in this set-
suggested by our analysis in Appendix A.5.3. For analyze
ting. We showed the benefits of JSRL in a set of offline RL

8
Jump-Start Reinforcement Learning

benchmark tasks as well as more challenging vision-based References


robotic simulation tasks. Our experiments indicate that
Agarwal, A., Henaff, M., Kakade, S., and Sun, W. Pc-pg:
JSRL is more sample efficient than more complex IL+RL
Policy cover directed exploration for provable policy gra-
approaches while being compatible with other approaches’
dient learning. arXiv preprint arXiv:2007.08459, 2020.
benefits. In addition, we presented theoretical analysis of
an upper bound on the sample complexity of JSRL , which Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G.
showed from-exponential-to-polynomial improvement in On the theory of policy gradient methods: Optimality,
time horizon from non-optimism exploration methods. In approximation, and distribution shift. Journal of Machine
the future, we plan on deploying JSRL in the real world in Learning Research, 22(98):1–76, 2021.
conjunction with various types of guide-policies to further
investigate its ability to bootstrap data efficient RL. Bagnell, J., Kakade, S. M., Schneider, J., and Ng, A. Policy
search by dynamic programming. Advances in neural
7. Limitations information processing systems, 16, 2003.

We acknowledge several potential limitations that stem from Bagnell, J. A. Learning decisions: Robustness, uncertainty,
the quality of the pre-existing policy or data. Firstly, the and approximation. Carnegie Mellon University, 2004.
policy discovered by JSRL is inherently susceptible to any
biases present in the training data or within the guide-policy. Chen, J. and Jiang, N. Information-theoretic considera-
Furthermore, the quality of the training data and pre-existing tions in batch reinforcement learning. arXiv preprint
policy could profoundly impact the safety and effectiveness arXiv:1905.00360, 2019.
of the guide-policy. This becomes especially important
in high-risk domains such as robotics, where poor or mis- Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A.,
guided policies could lead to harmful outcomes. Finally, Laskin, M., Abbeel, P., Srinivas, A., and Mordatch, I. De-
the presence of adversarial guide-policies might result in cision transformer: Reinforcement learning via sequence
learning that is even slower than random exploration. For modeling. Advances in neural information processing
instance, in a task where an agent is required to navigate systems, 34, 2021.
through a small maze, a guide-policy that is deliberately
trained to remain static could constrain the agent, inhibiting Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual
its learning and performance until the curriculum is com- bandits with linear payoff functions. In Proceedings
plete. These potential limitations underline the necessity for of the Fourteenth International Conference on Artificial
carefully curated training data and guide-policies to ensure Intelligence and Statistics, pp. 208–214. JMLR Workshop
the usefulness of JSRL. and Conference Proceedings, 2011.

Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K. O., and


Acknowledgements Clune, J. First return, then explore. Nature, 590(7847):
We would like to thank Kanishka Rao, Nikhil Joshi, and 580–586, 2021.
Alex Irpan for their insightful discussions and feedback on
our work. We would also like to thank Rosario Jauregui Ru- Florensa, C., Held, D., Wulfmeier, M., Zhang, M., and
ano for performing physical robot experiments with JSRL. Abbeel, P. Reverse curriculum generation for reinforce-
Jiantao Jiao and Banghua Zhu were partially supported by ment learning. In Conference on robot learning, pp. 482–
NSF Grants IIS-1901252 and CCF-1909499. 495. PMLR, 2017.

Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine,
S. D4RL: Datasets for deep data-driven reinforcement
learning. arXiv preprint arXiv:2004.07219, 2020.

Hosu, I.-A. and Rebedea, T. Playing atari games with


deep reinforcement learning and human checkpoint re-
play. arXiv preprint arXiv:1607.05077, 2016.

Ivanovic, B., Harrison, J., Sharma, A., Chen, M., and


Pavone, M. Barc: Backward reachability curriculum
for robotic reinforcement learning. In 2019 International
Conference on Robotics and Automation (ICRA), pp. 15–
21. IEEE, 2019.

9
Jump-Start Reinforcement Learning

Janner, M., Li, Q., and Levine, S. Offline reinforcement Liao, P., Qi, Z., and Murphy, S. Batch policy learning
learning as one big sequence modeling problem. Ad- in average reward Markov decision processes. arXiv
vances in neural information processing systems, 34, preprint arXiv:2007.11771, 2020.
2021.
Lin, L.-J. Self-improving reactive agents based on reinforce-
Jiang, N. On value functions and the agent-environment ment learning, planning and teaching. Machine learning,
boundary. arXiv preprint arXiv:1905.13341, 2019. 8(3-4):293–321, 1992.
Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J.,
Liu, B., Cai, Q., Yang, Z., and Wang, Z. Neural trust
and Schapire, R. E. Contextual decision processes with
region/proximal policy optimization attains globally op-
low bellman rank are pac-learnable. In International Con-
timal policy. In Neural Information Processing Systems,
ference on Machine Learning, pp. 1704–1713. PMLR,
2019.
2017.
Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is Lu, Y., Hausman, K., Chebotar, Y., Yan, M., Jang, E., Her-
Q-learning provably efficient? In Proceedings of the zog, A., Xiao, T., Irpan, A., Khansari, M., Kalashnikov,
32nd International Conference on Neural Information D., and Levine, S. Aw-opt: Learning robotic skills with
Processing Systems, pp. 4868–4878, 2018. imitation andreinforcement at scale. In 2021 Conference
on Robot Learning (CoRL), 2021.
Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably
efficient reinforcement learning with linear function ap- McAleer, S., Agostinelli, F., Shmakov, A., and Baldi, P.
proximation. In Conference on Learning Theory, pp. Solving the rubik’s cube without human knowledge.
2137–2143. PMLR, 2020. 2019.
Kakade, S. and Langford, J. Approximately optimal approx- Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,
imate reinforcement learning. In ICML, volume 2, pp. Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing
267–274, 2002. Atari with deep reinforcement learning. arXiv preprint
arXiv:1312.5602, 2013.
Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog,
A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M.,
Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W.,
Vanhoucke, V., et al. Qt-opt: Scalable deep reinforcement
and Abbeel, P. Overcoming exploration in reinforcement
learning for vision-based robotic manipulation. arXiv
learning with demonstrations. In 2018 IEEE Interna-
preprint arXiv:1806.10293, 2018.
tional Conference on Robotics and Automation (ICRA),
Kober, J., Mohler, B., and Peters, J. Imitation and rein- pp. 6292–6299. IEEE, 2018.
forcement learning for motor primitives with perceptual
coupling. In From motor learning to interaction learning Nair, A., Gupta, A., Dalal, M., and Levine, S. Awac: Accel-
in robots, pp. 209–225. Springer, 2010. erating online reinforcement learning with offline datasets.
2020.
Koenig, S. and Simmons, R. G. Complexity analysis of
real-time reinforcement learning. In AAAI, pp. 99–107, Osband, I. and Van Roy, B. Model-based reinforcement
1993. learning and the eluder dimension. arXiv preprint
arXiv:1406.1853, 2014.
Kostrikov, I., Nair, A., and Levine, S. Offline reinforce-
ment learning with implicit q-learning. arXiv preprint Ouyang, Y., Gagrani, M., Nayyar, A., and Jain, R. Learn-
arXiv:2110.06169, 2021. ing unknown markov decision processes: A thompson
sampling approach. arXiv preprint arXiv:1709.04570,
Krishnamurthy, A., Langford, J., Slivkins, A., and Zhang,
2017.
C. Contextual bandits with continuous actions: Smooth-
ing, zooming, and adapting. In Conference on Learning Peng, X. B., Abbeel, P., Levine, S., and van de Panne,
Theory, pp. 2025–2027. PMLR, 2019. M. Deepmimic: Example-guided deep reinforcement
Kumar, A., Zhou, A., Tucker, G., and Levine, S. Conserva- learning of physics-based character skills. ACM Trans.
tive q-learning for offline reinforcement learning. arXiv Graph., 37(4), July 2018.
preprint arXiv:2006.04779, 2020.
Peng, X. B., Kumar, A., Zhang, G., and Levine, S.
Langford, J. and Zhang, T. The epoch-greedy algorithm Advantage-weighted regression: Simple and scalable
for contextual multi-armed bandits. Advances in neural off-policy reinforcement learning. arXiv preprint
information processing systems, 20(1):96–1, 2007. arXiv:1910.00177, 2019.

10
Jump-Start Reinforcement Learning

Rajeswaran, A., Kumar, V., Gupta, A., Vezzani, G., Schul- Zanette, A., Lazaric, A., Kochenderfer, M., and Brunskill, E.
man, J., Todorov, E., and Levine, S. Learning complex Learning near optimal policies with low inherent bellman
dexterous manipulation with deep reinforcement learning error. In International Conference on Machine Learning,
and demonstrations. arXiv preprint arXiv:1709.10087, pp. 10978–10989. PMLR, 2020.
2017.
Zhang, J., Koppel, A., Bedi, A. S., Szepesvari, C., and
Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., and Russell, Wang, M. Variational policy gradient method for rein-
S. Bridging offline reinforcement learning and imi- forcement learning with general utilities. arXiv preprint
tation learning: A tale of pessimism. arXiv preprint arXiv:2007.02151, 2020a.
arXiv:2103.12021, 2021.
Zhang, Z., Zhou, Y., and Ji, X. Almost optimal model-free
Resnick, C., Raileanu, R., Kapoor, S., Peysakhovich, A., reinforcement learning via reference-advantage decom-
Cho, K., and Bruna, J. Backplay:” man muss immer position. Advances in Neural Information Processing
umkehren”. arXiv preprint arXiv:1807.06919, 2018. Systems, 33, 2020b.

Zheng, Q., Zhang, A., and Grover, A. Online decision


Ross, S. and Bagnell, D. Efficient reductions for imitation
transformer. arXiv preprint arXiv:2202.05607, 2022.
learning. In Proceedings of the thirteenth international
conference on artificial intelligence and statistics, pp.
661–668, 2010.

Salimans, T. and Chen, R. Learning montezuma’s re-


venge from a single demonstration. arXiv preprint
arXiv:1812.03381, 2018.

Schaal, S. et al. Learning from demonstration. Advances in


neural information processing systems, pp. 1040–1046,
1997.

Scherrer, B. Approximate policy iteration schemes: A com-


parison. In International Conference on Machine Learn-
ing, pp. 1314–1322, 2014.

Simchi-Levi, D. and Xu, Y. Bypassing the monster: A faster


and simpler optimal algorithm for contextual bandits un-
der realizability. Available at SSRN 3562765, 2020.

Smart, W. and Pack Kaelbling, L. Effective reinforcement


learning for mobile robots. In Proceedings 2002 IEEE
International Conference on Robotics and Automation
(Cat. No.02CH37292), volume 4, pp. 3404–3410 vol.4,
2002. doi: 10.1109/ROBOT.2002.1014237.

Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O.,
Piot, B., Heess, N., Rothörl, T., Lampe, T., and Riedmiller,
M. Leveraging demonstrations for deep reinforcement
learning on robotics problems with sparse rewards, 2018.

Wang, L., Cai, Q., Yang, Z., and Wang, Z. Neural pol-
icy gradient methods: Global optimality and rates of
convergence. In International Conference on Learning
Representations, 2019.

Xie, T., Jiang, N., Wang, H., Xiong, C., and Bai, Y. Policy
finetuning: Bridging sample-efficient offline and online
reinforcement learning. arXiv preprint arXiv:2106.04895,
2021.

11
Jump-Start Reinforcement Learning

A. Appendix
A.1. Imitation and Reinforcement Learning (IL+RL)
Most of our baseline algorithms are imitation and reinforcement learning methods (IL+RL). IL+RL methods usually involve
pre-training on offline data, then fine-tuning the pre-trained policies online. We do not include transfer learning methods
because our goal is to use demonstrations or sub-optimal, pre-existing policies to speed up RL training. Transfer learning
usually implies distilling knowledge from a well performing model to another (often smaller) model, or re-purposing an
existing model to solve a new task. Both of these use cases are outside the scope of our work. We provide a description of
each of our IL+RL baselines below.

A.1.1. D4RL
AWAC AWAC (Nair et al., 2020) is an actor-critic method that updates the critic with dynamic programming and updates
the actor such that its distribution stays close to the behavior policy that generated the offline data. Note that the AWAC
paper compares against a few additional IL+RL baselines, including a few variants that use demonstrations with vanilla
SAC.

CQL CQL (Kumar et al., 2020) is a Q-learning variant that regularizes Q-values during training to avoid the estimation
errors caused by performing Bellman updates with out of distribution actions.

IQL IQL (Kostrikov et al., 2021) is an actor-critic method that completely avoids estimating the values of actions that are
not seen in the offline dataset. This is a recent state-of-the-art method for the IL+RL setting we consider.

A.1.2. S IMULATED ROBOTIC G RASPING


AW-Opt AW-Opt combines insights from AWAC and QT-Opt (Kalashnikov et al., 2018) to create a distributed actor-critic
algorithm that can successfully fine-tune policies trained offline. QT-Opt is an RL system that has been shown to scale to
complex, high-dimensional robotic control from pixels, which is a much more challenging domain than common simulation
benchmarks like D4RL.

A.2. Experiment Implementation Details


A.2.1. D4RL: A NT M AZE AND A DROIT
We evaluate on the Ant Maze and Adroit tasks, the most challenging tasks in the D4RL benchmark (Fu et al., 2020).
For the baseline IL+RL method comparisons, we utilize implementations from (Kostrikov et al., 2021): we use the
open-sourced version of IQL and the open-sourced versions of AWAC, BC, and CQL from https://github.com/rail-
berkeley/rlkit/tree/master/rlkit. While the standard initial offline datasets contain 1m transitions for Ant Maze and 100k
transitions for Adroit, we additionally ablate the datasets to evaluate settings with 100, 1k, 10k, and 100k transitions provided
initially. For AWAC and CQL, we report the mean and standard deviation over three random seeds. For behavioral cloning
(BC), we report the results of a single random seed. For IQL and both variations of IQL+JSRL, we report the mean and
standard deviation over twenty random seeds.
For the implementation of IQL+JSRL, we build upon the open-sourced IQL implementation (Kostrikov et al., 2021). First, to
obtain a guide-policy, we use IQL without modification for pre-training on the offline dataset. Then, we follow Algorithm 1
when fine-tuning online and use the IQL online update as the T RAIN P OLICY step from Algorithm 1. The IQL neural
network architecture follows the original implementation of (Kostrikov et al., 2021). For fine-tuning, we maintain two replay
buffers for offline and online transitions. The offline buffer contains all the demonstrations, and the online buffer is FIFO
with a fixed capacity of 100k transitions. For each gradient update during fine-tuning, we sample minibatches such that 75%
of samples come from the online buffer, and 25% of samples come from the offline buffer.
Our implementation of IQL+JSRL focused on two settings when switching from offline pre-training to online fine-tuning:
Warm-starting and Cold-starting. When Warm-starting, we copy the actor, critic, target critic, and value networks from
the pre-trained guide-policy to the exploration-policy. When Cold-starting, we instead start training the exploration-policy
from scratch. Results for both variants are shown in Appendix A.3. We find that empirically, the performance of these
two variants is highly dependent on task difficulty as well as the quality of the initial offline dataset. When initial datasets

12
Jump-Start Reinforcement Learning

Figure 5. In the simulated vision-based robotic grasping tasks, a robot arm must grasp various objects placed in bins in front of it. Full
implementation details are described in Appendix A.2.2.

Figure 6. Example ant maze (left) and adroit dexterous manipulation (right) tasks.

13
Jump-Start Reinforcement Learning

are very poor, cold-starting usually performs better; when initial datasets are dense and high-quality, warm-starting seems
to perform better. For the results reported in Table 1, we utilize Cold-start results for both IQL+JSRL-Curriculum and
IQL+JSRL-Random.
Finally, the curriculum implementation for IQL+JSRL used policy evaluation every 10,000 steps to gauge learning progress
of the exploration-policy π e . When the moving average of π e ’s performance increases over a few samples, we move on to
the next curriculum stage. For the IQL+JSRL-Random variant, we randomly sample the number of guide-steps for every
single episode.

A.2.2. S IMULATED ROBOTIC M ANIPULATION


We simulate a 7 DoF arm with an over-the-shoulder camera (see Figure 5) Three bins in front of the robot are filled with
various simulated objects to be picked up by the robot and a sparse binary reward is assigned if any object is lifted above
a bin at the end of an episode. States are represented in the form of RGB images and actions are continuous Cartesian
displacements of the gripper’s 3D positions and yaw. In addition, the policy commands discrete gripper open and close
actions and may terminate an episode.
For the implementation of QT-Opt+JSRL, we build upon the QT-Opt algorithm described in (Kalashnikov et al., 2018). First,
to obtain a guide-policy we use a BC policy trained offline on the provided demonstrations. Then, we follow Algorithm 1
when fine-tuning online and use the QT-Opt online update as the T RAIN P OLICY step from Algorithm 1. The demonstrations
are not added to the QT-Opt+JSRLreplay buffer. The QT-Opt neural network architecture follows the original implementation
in (Kalashnikov et al., 2018). For JSRL, AW-Opt, QT-Opt, and BC, we report the mean and standard deviation over three
random seeds.
Finally, similar to Appendix A.2.1, the curriculum implementation for QT-Opt+JSRLused policy evaluation every 1,000 steps
to gauge learning progress of the exploration-policy π e . When the moving average of π e ’s performance increases over a few
samples, the number of guide-steps is lowered, allowing the JSRL curriculum to continue. For the QT-Opt+JSRL-Random
variant, we randomly sample the number of guide-steps for every single episode.

A.3. Additional Experiments

JSRL: Random Switching JSRL: Curriculum


Environment Warm-start Cold-start Warm-start Cold-start IQL
pen-binary-v0 27.18 ± 7.77 29.12 ± 7.62 25.10 ± 8.73 24.31 ± 12.05 18.80 ± 11.63
door-binary-v0 0.01 ± 0.04 0.06 ± 0.23 1.45 ± 4.67 0.40 ± 1.80 0.84 ± 3.76
relocate-binary-v0 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.01 ± 0.06 0.01 ± 0.03

Table 3. Adroit 100 Offline Transitions

JSRL: Random Switching JSRL: Curriculum


Environment Warm-start Cold-start Warm-start Cold-start IQL
pen-binary-v0 47.23 ± 3.96 46.30 ± 6.34 34.23 ± 7.22 36.74 ± 7.91 30.11 ± 10.22
door-binary-v0 0.15 ± 0.25 0.45 ± 1.22 0.44 ± 0.89 0.68 ± 1.02 0.53 ± 1.46
relocate-binary-v0 0.06 ± 0.08 0.01 ± 0.04 0.05 ± 0.09 0.04 ± 0.10 0.01 ± 0.03

Table 4. Adroit 1k Offline Transitions

A.4. Hyperparameters of JSRL


JSRL introduces three hyperparameters: (1) the initial number of guide-steps that the guide-policy takes at the beginning of
fine-tuning (H1 ), (2) the number of curriculum stages (n), and (3) the performance threshold that decides whether to move
on to the next curriculum stage (β). Minimal tuning was done for these hyperparameters.

14
Jump-Start Reinforcement Learning

Figure 7. A policy is first pre-trained on 100k offline transitions. Negative steps correspond to this pre-training. We then roll out the
pre-trained policy for 100k timesteps, and use these online samples to warm-up the critic network. After warming up the critic, we
continue with actor-critic fine-tuning with the pre-trained policy and the warmed up critic.

15
Jump-Start Reinforcement Learning

Figure 8. A policy is first pre-trained on one million offline transitions. Negative steps correspond to this pre-training. We then roll out
the pre-trained policy for 100k timesteps, and use these online samples to warm-up the critic network. After warming up the critic, we
continue with actor-critic fine-tuning with the pre-trained policy and the warmed up critic. Allowing the critic to warm up provides a
stronger baseline to compare JSRL to, since in the case where we have a policy, but no value function, we could use that policy to train a
value function.

16
Jump-Start Reinforcement Learning

Figure 9. QT-Opt+JSRL using guide-policies trained from-scratch online vs. guide-policies trained with BC on demonstration data in the
indiscriminate grasping environment. For each experiment, the guide-policy trained offline and the guide-policy trained online are of
equivalent performance.

17
Jump-Start Reinforcement Learning

Figure 10. Comparing IL+RL methods with JSRL on the Indiscriminate Grasping task while adjusting the initial demonstrations available.
In addition, compare the sample efficiency

18
Jump-Start Reinforcement Learning

Figure 11. Comparing IL+RL methods with JSRL on the Instance Grasping task while adjusting the initial demonstrations available.

19
Jump-Start Reinforcement Learning

IQL+JSRL: Random Switching IQL+JSRL: Curriculum


Environment Warm-start Cold-start Warm-start Cold-start IQL
pen-binary-v0 51.78 ± 3.00 52.11 ± 3.30 38.04 ± 12.71 44.31 ± 6.22 38.41 ± 11.18
door-binary-v0 10.59 ± 11.78 22.32 ± 11.61 5.08 ± 7.60 4.33 ± 8.38 10.61 ± 14.11
relocate-binary-v0 1.99 ± 3.15 0.50 ± 0.65 4.39 ± 8.17 0.55 ± 1.60 0.19 ± 0.32

Table 5. Adroit 10k Offline Transitions

IQL+JSRL: Random Switching IQL+JSRL: Curriculum


Environment Warm-start Cold-start Warm-start Cold-start IQL
pen-binary-v0 60.06 ± 2.94 60.58 ± 2.73 62.81 ± 2.79 62.59 ± 3.62 64.96 ± 2.87
door-binary-v0 27.23 ± 8.90 24.27 ± 11.47 38.70 ± 17.25 28.51 ± 19.54 50.21 ± 2.50
relocate-binary-v0 5.09 ± 4.39 4.69 ± 4.16 11.18 ± 11.69 0.04 ± 0.14 8.59 ± 7.70

Table 6. Adroit 100k Offline Transitions

IQL+JSRL: Random Switching IQL+JSRL: Curriculum


Environment Warm-start Cold-start Warm-start Cold-start IQL
antmaze-umaze-v0 0.10 ± 0.31 10.35 ± 9.59 0.40 ± 0.94 15.60 ± 19.87 0.20 ± 0.52
antmaze-umaze-diverse-v0 0.10 ± 0.31 1.90 ± 4.81 0.45 ± 1.23 3.05 ± 7.99 0.00 ± 0.00
antmaze-medium-play-v0 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00
antmaze-medium-diverse-v0 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00
antmaze-large-play-v0 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00
antmaze-large-diverse-v0 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00

Table 7. Ant Maze 1k Offline Transitions

IQL+JSRL: Random Switching IQL+JSRL: Curriculum


Environment Warm-start Cold-start Warm-start Cold-start IQL
antmaze-umaze-v0 56.00 ± 13.70 52.70 ± 26.71 57.25 ± 15.86 71.70 ± 14.49 55.50 ± 12.51
antmaze-umaze-diverse-v0 23.05 ± 10.96 39.35 ± 20.07 26.80 ± 12.03 72.55 ± 12.18 33.10 ± 10.74
antmaze-medium-play-v0 0.05 ± 0.22 3.75 ± 4.97 0.00 ± 0.00 16.65 ± 12.93 0.10 ± 0.31
antmaze-medium-diverse-v0 0.00 ± 0.00 5.10 ± 8.16 0.00 ± 0.00 16.60 ± 11.71 0.00 ± 0.00
antmaze-large-play-v0 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.05 ± 0.22 0.00 ± 0.00
antmaze-large-diverse-v0 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.05 ± 0.22 0.00 ± 0.00

Table 8. Ant Maze 10k Offline Transitions

IQL+JSRL: Random Switching IQL+JSRL: Curriculum


Environment Warm-start Cold-start Warm-start Cold-start IQL
antmaze-umaze-v0 73.35 ± 22.58 92.05 ± 2.76 71.35 ± 26.36 93.65 ± 4.21 74.15 ± 25.62
antmaze-umaze-diverse-v0 40.95 ± 13.34 82.25 ± 14.20 38.80 ± 21.96 81.30 ± 23.04 29.85 ± 23.08
antmaze-medium-play-v0 9.55 ± 14.42 56.15 ± 28.78 22.15 ± 29.82 86.85 ± 3.67 32.80 ± 32.64
antmaze-medium-diverse-v0 14.05 ± 13.30 67.00 ± 17.43 15.75 ± 16.48 81.50 ± 18.80 15.70 ± 17.69
antmaze-large-play-v0 0.35 ± 0.93 17.70 ± 13.35 0.45 ± 1.19 36.30 ± 16.41 2.55 ± 8.19
antmaze-large-diverse-v0 1.25 ± 2.31 22.40 ± 15.44 0.75 ± 1.16 34.35 ± 22.97 4.10 ± 10.37

Table 9. Ant Maze 100k Offline Transitions

IQL+JSRL: For offline pre-training and online fine-tuning, we use the same exact hyperparameters as the default imple-
mentation of IQL [6].
Our reported results for vanilla IQL do differ from the original paper, but this is due to us running more random seeds (20 vs.
5), which we also consulted with the authors of IQL. For Indiscriminate and Instance Grasping experiments we utilize the
same environment, task definition, and training hyperparameters as QT-Opt and AW-Opt.

20
Jump-Start Reinforcement Learning

IQL+JSRL: Random Switching IQL+JSRL: Curriculum


Environment Warm-start Cold-start Warm-start Cold-start IQL
antmaze-umaze-v0 95.35 ± 2.23 94.95 ± 2.95 96.70 ± 1.69 98.05 ± 1.43 97.60 ± 3.19
antmaze-umaze-diverse-v0 65.95 ± 27.00 89.80 ± 10.00 59.95 ± 33.90 88.55 ± 16.37 52.95 ± 30.48
antmaze-medium-play-v0 82.25 ± 4.88 87.80 ± 4.20 92.20 ± 2.84 91.05 ± 3.86 92.75 ± 2.73
antmaze-medium-diverse-v0 83.45 ± 4.64 86.25 ± 5.94 91.65 ± 2.98 93.05 ± 3.10 92.40 ± 4.50
antmaze-large-play-v0 50.35 ± 9.74 48.60 ± 10.01 72.15 ± 9.66 62.85 ± 11.31 62.35 ± 12.42
antmaze-large-diverse-v0 56.80 ± 9.15 58.30 ± 6.54 70.55 ± 17.43 68.25 ± 8.76 68.25 ± 8.85

Table 10. Ant Maze 1m Offline Transitions

Moving Average Horizon


1 5 10
0% 79.66 56.66 74.83
Tolerance 5% 51.12 78.8 79.78
15% 56.41 47.46 59.52

Table 11. We fix the number of curriculum stages at n = 10 for antmaze-large-diverse-v0, then vary the moving average horizon and
tolerance. Each number is the average reward after 5 million training steps of one seed. As tolerance increases, the reward decreases since
curriculum stages are not fully mastered before moving on.

Initial Number of Guide-Steps: H1 :


For all X+JSRL experiments, we train the guide-policy (IQL for D4RL and BC for grasping) then evaluate it to determine
how many steps it takes to solve the task on average. For D4RL, we evaluate it over one hundred episodes. For grasping,
we plot training metrics and observe the average episode length after convergence. This average is then used as the initial
number of guide-steps. Since H1 is directly computed, no hyperparameter search is required.
Curriculum Stages: n
Once the number of curriculum stages was chosen, we computed the number of steps between curriculum stages as Hn1 .
Then h varies from H1 − Hn1 , H1 − 2 Hn1 , . . . , H1 − (n − 1) Hn1 , 0. To decide on an appropriate number of curriculum stages,
we decreased n (increased Hn1 and Hi − Hi−1 ), starting from n = H, until the curriculum became too difficult for the agent
to overcome (i.e., the agent becomes ”stuck” on a curriculum stage). We then used the minimal value of n for which the
agent could still solve all stages. In practice, we did not try every value between H and 1, but chose a very small subset of
values to test in this range.
Performance Threshold β: For both grasping and D4RL tasks, we evaluated π between fixed intervals and computed the
moving average of these evaluations (5 for D4RL, 3 for grasping). If the current moving average is close enough to the best
previous moving average, then we move from curriculum stage i to i + 1. To define ”close enough”, we set a tolerance
that let the agent move to the next stage if the current moving average was within some percentage of the previous best.
The tolerance and moving average horizon were our ”β”, a generic parameter that is flexible based on how costly it is to
evaluate the performance of π. In Figure 12 and Table 11, we perform small studies to determine how varying β affects
JSRL’s performance.

21
Jump-Start Reinforcement Learning

Figure 12. Ablation study for β in the indiscriminate grasping environment. We find that the moving average horizon does not have a
large impact on performance, but larger tolerance slightly hurts performance. A larger tolerance around the best moving average makes it
easier for JSRL to move on to the next curriculum stage. This means that experiments with a larger tolerance could potentially move on to
the next curriculum stage before JSRL masters the previous curriculum stage, leading to lower performance.

22
Jump-Start Reinforcement Learning

Figure 13. First, an indiscriminate grasping policy is trained using online QT-Opt to 90% indiscriminate grasping success and 5% instance
grasping success (when the policy happens to randomly pick the correct object). We compare this 90% indiscriminate grasping guide
policy with a 8.4% success instance grasping guide policy trained with BC on 2k demonstrations. While the performance for using the
indiscriminate guide is slightly worse than using the instance guide, the performance for both JSRL versions are much better than vanilla
QT-Opt.

23
Jump-Start Reinforcement Learning

Figure 14. First, a policy is trained offline on a simpler antmaze-*-play environment for one million steps (depicted by negative steps).
This policy is then used for initializing fine-tuning (depicted by positive steps) in a more complex antmaze-*-diverse environment. We
find that IQL+JSRL can better generalize to the more difficult antmazes compared to IQL even when using guide-policies trained on
different tasks.
24
Jump-Start Reinforcement Learning

A.5. Theoretical Analysis for JSRL


A.5.1. S ETUP AND N OTATIONS
Consider a finite-horizon time-inhomogeneous MDP with a fixed total horizon H and bounded reward rh ∈ [0, 1], ∀h ∈ [H].
The transition of state-action pair (s, a) in step h is denoted as Ph (· | s, a). Assume that at step 0, the initial state follows a
distribution p0 .
For simplicity, we use π to denote the policy for H steps π = {πh }H π
h=1 . We let dh (s) be the marginalized state occupancy
distribution in step h when we follow policy π.

A.5.2. P ROOF S KETCH FOR T HEOREM 4.1

Figure 15. Lower bound instance: combination lock

We construct a special instance, combination lock MDP, which is depicted in Figure 15 and works as follows. The agent
can only arrive at the red state s⋆h+1 in step h + 1 when it takes action a⋆h at the red state s⋆h at step h. Once it leaves state
s⋆h , the agent stays in the blue states and can never get back to red states again. At the last layer, one receives reward 1
when the agent is at state s⋆H and takes action a⋆H . For all other cases, the reward is 0. In exploration from scratch, before
seeing rH (s⋆ , a⋆ ), one only sees reward 0. Thus 0-initialized ϵ-greedy always takes each action with probability 1/2.
The probability of arriving at state s⋆H with uniform actions is 1/2H , which means that one needs at least 2H samples in
expectation to see rH (s⋆ , a⋆ ).

A.5.3. U PPER BOUND OF JSRL


In this section, we restate Theorem 4.3 and its assumption in a formal way. First, we make assumption on the quality of
the guide-policy, which is the key assumption that helps improve the exploration from exponential to polynomial sample
complexity. One of the weakest assumption in theory of offline learning literature is the single policy concentratability
coefficient (Rashidinejad et al., 2021; Xie et al., 2021)1 . Concretely, they assume that there exists a guide-policy π g such that


dπh (s, a)
sup πg ≤ C. (1)
s,a,h dh (s, a)

This means that for any state action pair that the optimal policy visits, the guide-policy shall also visit with certain probability.
In the analysis, we impose a strictly weaker assumption. We only require that the guide-policy visits all good states in the
feature space instead of all good state and action pairs.
Assumption A.1 (Quality of guide-policy π g ). Assume that the state is parametrized by some feature mapping ϕ : S → Rd
such that for any policy π, Qπ (s, a) and π(s) depends on s only through ϕ(s). We assume that in the feature space, the
1
The single policy concentratability assumption is already a weaker version of the traditional concentratability coefficient assumption,
which takes a supremum of the density ratio over all state-action pairs and all policies (Scherrer, 2014; Chen & Jiang, 2019; Jiang, 2019;
Wang et al., 2019; Liao et al., 2020; Liu et al., 2019; Zhang et al., 2020a).

25
Jump-Start Reinforcement Learning

guide-policy π g cover the states visited by the optimal policy:



dπh (ϕ(s))
sup g ≤ C.
s,h dπh (ϕ(s))

Note that for the tabular case when ϕ(s) = s, one can easily prove that (1) implies Assumption A.1. In real robotics,
the assumption implies that the guide-policy at least sees the features of the good states that the optimal policy also see.
However, the guide-policy can be arbitrarily bad in terms of choosing actions.
Before we proceed to the main theorem, we need to impose another assumption on the performance of the exploration step,
which requires to find an exploration algorithm that performs well in the case of H = 1 (contextual bandit).
Assumption A.2 (Performance guarantee for ExplorationOracle CB). In (online) contextual bandit with stochastic context
s ∼ p0 and stochastic reward r(s, a) supported on [0, R], there exists some ExplorationOracle CB which executes a policy
π t in each round t ∈ [T ], such that the total regret is bounded:

T
X
Es∼p0 [r(s, π ⋆ (s)) − r(s, π t (s))] ≤ f (T, R).
t=1

This assumption is usually given for free since it is implied by a rich literature in contextual bandit, including tabular (Lang-
ford & Zhang, 2007), linear (Chu et al., 2011), general function approximation with finite action (Simchi-Levi & Xu, 2020),
neural networks and continuous actions (Krishnamurthy et al., 2019), either via optimism-based methods (UCB, Thompson
sampling etc.) or non-optimism-based methods (ϵ-greedy, inverse gap weighting etc.).
Now we are ready to present the algorithm and guarantee. The JSRL algorithm is summarized in Algorithm 1. For the
convenience of theoretical analysis, we make some simplification by only considering curriculum case, replacing the step of
EvaluatePolicy with a fixed iteration time, and set the TrainPolicy in Algorithm 1 as follows: at iteration h, fix the policy
π
Ph+1:H unchanged, set πh = ExplorationOracle CB(D), where the reward for contextual bandit is the cumulative reward
t=h:H rt . For concreteness, we show the pseudocode for the algorithm below.

Algorithm 2 Jump-Start Reinforcement Learning for Episodic MDP with CB oracle


1: Input: guide-policy π g , total time step T , horizon length H
2: Initialize exploration policy π = π g , online dataset D = ∅.
3: for iteration h = H − 1, H − 2, · · · , 0 do
4: Execute ExplorationOracle CB for ⌈T /H⌉ rounds, with the state-aciton-reward tuple for contextual bandit de-
rived as follows: at round t, first gather a trajectory {(stl , atl , stl+1 , rlt )}l∈[H−1] by rolling out policy π, then take
PH
{sth , ath , l=h rlt } as the state-action-reward samples for contextual bandit. Let π t be the executed policy at round t.
5: Set policy πh = Unif({π t }Tt=1 }).
6: end for

Note that the Algorithm 2 is a special case of Algorithm 1 where the policies after current step h is fixed. This coincides with
the idea of Policy Search by Dynamic Programming (PSDP) in (Bagnell et al., 2003). Notably, although PSDP is mainly
motivated from policy learning while JSRL is motivated from efficient online exploration and fine-tuning, the following
theorem follows mostly the same line as that in (Bagnell, 2004). For completeness we provide the performance guarantee of
the algorithm as follows.
Theorem A.3. Under Assumption A.1 and A.2, the JSRL in Algorithm 2 guarantees that after T rounds,
H−1
X
Es0 ∼p0 [V0∗ (s0 ) − V0π (s0 )] ≤ C · f (T /H, H − h).
h=0

Theorem A.3 is quite general, and it depends on the choice of the exploration oracle. Below we give concrete results for
tabular RL and RL with function approximation.

26
Jump-Start Reinforcement Learning

Corollary A.4. For tabular case, when we take ExplorationOracle CB as ϵ-greedy, the rate achieved is
O(CH 7/3 S 1/3 A1/3 /T 1/3 ) ; when we take ExplorationOracle CB as FALCON+, the rate becomes O(CH 5/2 S 1/2 A/T 1/2 ).
Here S can be relaxed to the maximum state size that π g visits among all steps.

The result above implies a polynomial sample complexity when combined with non-optimism exploration techniques,
including ϵ-greedy (Langford & Zhang, 2007) and FALCON+ (Simchi-Levi & Xu, 2020). In contrast, they both suffer from
a curse of horizon without such a guide-policy.
Next, we move to RL with general function approximation.
Corollary p For general function approximation, when we take ExplorationOracle CB as FALCON+, the rate becomes
PH A.5.
Õ(C h=1 AEF (T /H)) under the following assumption.
Assumption A.6. Let π be an arbitrary policy. Given n training trajectories of the form {(sjh , ajh , sjh+1 , rhj )}j∈[n],h∈[H]
drawn from following policy π in a given MDP, according to sjh ∼ dπh , ajh |sjh ∼ πh (sh ), rhj |(sjh , ajh ) ∼ Rh (sjh , ajh ),
sjh+1 |(sjh , ajh ) ∼ Ph (·|sjh , ajh ), there exists some offline regression oracle which returns a family of predictors Qbh :
S × A → R, h ∈ [H], such that for any h ∈ [H], we have
h i
b h (s, a) − Qπh (s, a))2 ≤ EF (n).
E (Q

As is shown in (Simchi-Levi & Xu, 2020), this assumption on offline regression oracle implies our Assumption on regret
bound in Assumption A.2. When EF is a polynomial function, the above rate matches the worst-case lower bound for
contextual bandit in (Simchi-Levi & Xu, 2020), up to a factor of C · poly(H).
The results above show that under Assumption A.1, one can achieve polynomial and sometimes near-optimal sample
complexity up to polynomial factors of H without applying Bellman update, but only with a contextual bandit oracle. In
practice, we run Q-learning based exploration oracle, which may be more robust to the violation of assumptions. We leave
the analysis for Q-learning based exploration oracle as a future work.
Remark A.7. The result generalizes to and is adaptive to the case when one has time-inhomogeneous C, i.e.

dπh (ϕ(s))
∀h ∈ [H], sup g ≤ C(h).
s dπh (ϕ(s))
PH−1
The rate becomes h=0 C(h) · f (T /H, H − h) in this case.

In our current analysis, we heavily rely on the assumption of visitation and applied contextual bandit based exploration
techniques. In our experiments, we indeed run a Q-learning based exploration algorithm which also explores the succinct
states after we roll out the guide-policy. This also suggests why setting K > 1 and even random switching in Algorithm 1
might achieve better performance than the case of K = 1. We conjecture that with a Q-learning based exploration algorithm,
JSRL still works even when Assumption A.1 only holds partially. We leave the related analysis for JSRL with a Q-learning
based exploration oracle for future work.

A.5.4. P ROOF OF T HEOREM A.3 AND C OROLLARIES


Proof. The analysis follows a same line as (Bagnell, 2004). For completeness we include here. By the performance
difference lemma (Kakade & Langford, 2002), one has

H−1
X
Es0 ∼d0 [V0⋆ (s0 ) − V0π (s0 )] = Es∼d⋆h [Qπh (s, πh⋆ (s)) − Qπh (s, πh (s))]. (2)
h=0

At iteration h, the algorithm adopts a policy π with πl = πlg , ∀l < h, and fixed learned πl for l > h. The algorithm only
PH
updates πh during this iteration. By taking the reward as l=h rl , this presents a contextual bandit problem with initial state
g
distribution dπh , reward bounded in between [0, H − h], and the expected reward for taking state action (s, a) is Qπh (s, a).

Let π̂h be the optimal policy for this contextual bandit problem. From Assumption A.2, we know that after T /H rounds at

27
Jump-Start Reinforcement Learning

iteration h, one has


H−1
X (i) H−1
X
Es∼d⋆h [Qπh (s, πh⋆ (s)) − Qπh (s, πh (s))] ≤ Es∼d⋆h [Qπh (s, π̂h⋆ (s)) − Qπh (s, πh (s))]
h=0 h=0
H−1
(ii) X
= Es∼d⋆h [Qπh (ϕ(s), π̂h⋆ (ϕ(s))) − Qπh (ϕ(s), πh (ϕ(s)))]
h=0

(iii) H−1
X
≤ C· Es∼dπg [Qπh (ϕ(s), π̂h⋆ (ϕ(s))) − Qπh (ϕ(s), πh (ϕ(s)))]
h
h=0

(iv) H−1
X
≤ C· f (T /H, H − h).
h=0

Here the inequality (i) uses the fact that π̂ ⋆ is the optimal policy for the contextual bandit problem. The equality (ii) uses the
fact that Q, π depends on s only through ϕ(s). The inequality (iii) comes from Assumption A.1. The inequality (iv) comes
from Assumption A.2. From Equation (2) we know that the conclusion holds true.
When ExplorationOracle CB is ϵ-greedy, the rate in Assumption A.2 becomes f (T, R) = R · ((SA/T )1/3 ) (Langford
& Zhang, 2007), which gives the rate for JSRL as O(CH 7/3 S 1/3 A1/3 /T 1/3 ); when we take ExplorationOracle CB as
FALCON+ in tabular case, the rate in Assumption A.2 becomes f (T, R) = R · ((SA2 /T )1/2 ) (Simchi-Levi & Xu, 2020),
the final rate for JSRL becomes O(CH 5/2 S 1/2 A/T 1/2 ). When we take ExplorationOracle CB as FALCON+ in general
function approximation under Assumption A.6, the rate in Assumption A.2 becomes f (T, R) = R · (AEF (T ))1/2 , the final
PH p
rate for JSRL becomes Õ(C h=1 AEF (T /H)).

28

You might also like