NeurIPS 2021 Heuristic Guided Reinforcement Learning Paper
NeurIPS 2021 Heuristic Guided Reinforcement Learning Paper
Abstract
We provide a framework for accelerating reinforcement learning (RL) algorithms
by heuristics constructed from domain knowledge or offline data. Tabula rasa
RL algorithms require environment interactions or computation that scales with
the horizon of the sequential decision-making task. Using our framework, we
show how heuristic-guided RL induces a much shorter-horizon subproblem that
provably solves the original task. Our framework can be viewed as a horizon-based
regularization for controlling bias and variance in RL under a finite interaction
budget. On the theoretical side, we characterize properties of a good heuristic
and its impact on RL acceleration. In particular, we introduce the novel concept
of an improvable heuristic, a heuristic that allows an RL agent to extrapolate
beyond its prior knowledge. On the empirical side, we instantiate our framework
to accelerate several state-of-the-art algorithms in simulated robotic control tasks
and procedurally generated games. Our framework complements the rich literature
on warm-starting RL with expert demonstrations or exploratory datasets, and
introduces a principled method for injecting prior knowledge into RL.
1 Introduction
Many recent empirical successes of reinforcement learning (RL) require solving problems with very
long decision-making horizons. OpenAI Five [1] used episodes that were 20000 timesteps on average,
while AlphaStar [2] used roughly 5000 timesteps. Long-term credit assignment is a very challenging
statistical problem, with the sample complexity growing quadratically (or worse) with the horizon [3].
Long horizons (or, equivalently, large discount factors) also increase RL’s computational burden,
leading to slow optimization convergence [4]. This makes RL algorithms require prohibitively large
amounts of interactions and compute: even with tuned hyperparameters, AlphaStar needed over 108
samples and OpenAI Five needed over 107 PFLOPS of compute.
A popular approach to mitigate the statistical and computational issues of tabula rasa RL methods is to
warm-start or regularize learning with prior knowledge [1, 2, 5–10]. For instance, AlphaStar learned
a policy and value function from human demonstrations and regularized the RL agent using imitation
learning (IL). AWAC [9] warm-started a policy using batch policy optimization on exploratory
datasets. While these approaches have been effective in different domains, none of them explicitly
address RL’s complexity dependence on horizon.
In this paper, we propose a complementary regularization technique that relies on heuristic value
functions, or heuristics1 for short, to effectively shorten the problem horizon faced by an online RL
agent for fast learning. We call this approach Heuristic-Guided Reinforcement Learning (HuRL).
The core idea is simple: given a Markov decision process (MDP) M = (S, A, P, r, γ) and a
heuristic h : S → R, we select a mixing coefficient λ ∈ [0, 1] and have the agent solve a new MDP
Mf = (S, A, P, re, γe) with a reshaped reward and a smaller discount (i.e. a shorter horizon):
re(s, a) := r(s, a) + (1 − λ)γEs0 ∼P (·|s,a) [h(s0 )] and γe := λγ. (1)
1
We borrow this terminology from the planning literature to refer to guesses of V ∗ in an MDP [11].
35th Conference on Neural Information Processing Systems (NeurIPS 2021).
HuRL effectively introduces horizon-based regularization that determines whether long-term value
information should come from collected experiences or the heuristic. By modulating the effective
horizon via λ, we trade off the bias and the complexity of solving the reshaped MDP. HuRL with
λ = 1 recovers the original problem and with λ = 0 creates an easier contextual bandit problem [12].
A heuristic h in HuRL represents a prior guess of the desired long-term return of states, which
ideally is the optimal value function V ∗ of the unknown MDP M. When the heuristic h captures
the state ordering of V ∗ well, conceptually, it becomes possible to make good long-term decisions
by short-horizon planning or even acting greedily. How do we construct a good heuristic? In the
planning literature, this is typically achieved by solving a relaxation of the original problem [13–
15]. Alternatively, one can learn it from batch data collected by exploratory behavioral policies (as
in offline RL [16]) or from expert policies (as in IL [17]).2 For some dense reward problems, a
zero heuristic can be effective in reducing RL complexity, as exploited by the guidance discount
framework [18–23]. In this paper, we view heuristics as a unified representation of various forms of
prior knowledge, such as expert demonstrations, exploratory datasets, and engineered guidance.
Although the use of heuristics to accelerate search has been popular in planning and control algorithms,
e.g., A* [24], MCTS [25], and MPC [7, 26–28], its theory is less developed for settings where the
MDP is unknown. The closest work in RL is potential-based reward shaping (PBRS) [29], which
reshapes the reward into r̄(s, a) = r(s, a)+γEs0 |s,a [h(s0 )]−h(s) while keeping the original discount.
PBRS can use any heuristic to reshape the reward while preserving the ordering of policies. However,
giving PBRS rewards to an RL algorithm does not necessarily lead to faster learning, because the
base RL algorithm would still seek to explore to resolve long-term credit assignment. HuRL allows
common RL algorithms to leverage the short-horizon potential provided by a heuristic to learn faster.
In this work, we provide a theoretical foundation of HuRL to enable adopting heuristics and horizon
reduction for accelerating RL, combining advances from the PBRS and the guidance discount
literatures. On the theoretical side, we derive a bias-variance decomposition of HuRL’s horizon-based
regularization in order to characterize the solution quality as a function of λ and h. Using this insight,
we provide sufficient conditions for achieving an effective trade-off, including properties required of
a base RL algorithm that solves the reshaped MDP M fλ . Furthermore, we define the novel concept of
an improvable heuristic and prove that good heuristics for HuRL can be constructed from data using
existing pessimistic offline RL algorithms (such as pessimistic value iteration [30, 31]).
The effectiveness of HuRL depends on the heuristic quality, so we design HuRL to employ a sequence
of mixing coefficients (i.e. λs) that increases as the agent gathers more data from the environment.
Such a strategy induces a learning curriculum that enables HuRL to remain robust to non-ideal
heuristics. HuRL starts off by guiding the agent’s search direction with a heuristic. As the agent
becomes more experienced, it gradually removes the guidance and lets the agent directly optimize
the true long-term return. We empirically validate HuRL in MuJoCo [32] robotics control problems
and Procgen games [33] with various heuristics and base RL algorithms. The experimental results
demonstrate the versatility and effectiveness of HuRL in accelerating RL algorithms.
2 Preliminaries
2.1 Notation
We focus on discounted infinite-horizon Markov Decision Processes (MDPs) for ease of exposition.
The technique proposed here can be extended to other MDP settings.3 A discounted infinite-horizon
MDP is denoted as a 5-tuple M = (S, A, P, r, γ), where S is the state space, A is the action space,
P (s0 |s, a) is the transition dynamics, r(s, a) is the reward function, and γ ∈ [0, 1) is the discount
factor. Without loss of generality, we assume r : S × A → [0, 1]. We allow the state and action spaces
S and A to be either discrete or continuous. Let ∆(·) denote the space of probability distributions. A
decision-making policy π is a conditional distribution π : S → ∆(A), which can be deterministic.
We define some shorthand for writing expectations: For a state distribution d ∈ ∆(S) and a function
V : S → R, we define V (d) := Es∼d [V (s)]; similarly, for a policy π and a function Q : S × A → R,
we define Q(s, π) := Ea∼π(·|s) [Q(s, a)]. Lastly, we define Es0 |s,a := Es0 ∼P (·|s,a) .
2
We consider the RL setting for imitation where we suppose the rewards of expert trajectories are available.
3
The results here can be readily applied to finite-horizon MDPs; for other infinite-horizon MDPs, we need
further, e.g., mixing assumptions for limits to exist.
2
Central to solving MDPs are the concepts of value functions and P∞average distributions. For a policy
π, we define its state value function V π as V π (s) := Eρπs [ t=0 γ t r(st , at )] , where ρπs denotes
the trajectory distribution of s0 , a0 , s1 , . . . induced by running π starting from s0 = s. We define
the state-action value function (or the Q-function) as Qπ (s, a) := r(s, a) + γEs0 |s,a [V π (s0 )]. We
∗
denote the optimal policy as π ∗ and its state value function as V ∗ := V π . Under the assumption
π π 1
that rewards are in [0, 1], we have V (s), Q (s, a) ∈ [0, 1−γ ] for all π, s ∈ S, and a ∈ A. We
denote the initial state distribution of interest as d0 ∈ ∆(S) and the state distribution of policy π
at time t as dπtP
, with dπ0 = d0 . Given d0 , we define the average state distribution of a policy π as
π := ∞
d (1 − γ) t=0 γ t dπt . With a slight abuse of notation, we also write dπ (s, a) := dπ (s)π(a|s).
3.1 Algorithm
HuRL takes a reduction-based approach to realize the idea of heuristic guidance. As summarized in
Algorithm 1, HuRL takes a heuristic h : S → R and a base RL algorithm L as input, and outputs
an approximately optimal policy for the original MDP M. During training, HuRL iteratively runs
the base algorithm L to collect data from the MDP M and then uses the heuristic h to modify the
agent’s collected experiences. Namely, in iteration n, the agent interacts with the original MDP M
and saves the raw transition tuples4 Dn = {(s, a, r, s0 )} (line 2). HuRL then defines a reshaped MDP
Mfn := (S, A, P, ren , γ
en ) (line 3) by changing the rewards and lowering the discount factor:
ren (s, a) := r(s, a) + (1 − λn )γEs0 |s,a [h(s0 )] and en := λn γ,
γ (2)
4
If L learns only with trajectories, we transform each tuple and assemble them to get the modified trajectory.
3
Algorithm 1 Heuristic-Guided Reinforcement Learning (HuRL)
Require: MDP M = (S, A, P, r, γ), RL algorithm L, heuristic h, mixing coefficients {λn }.
1: for n = 1, . . . , N do
2: Dn ← L.CollectData(M)
3: Get λn from {λn } and construct M
fn = (S, A, P, ren , γ
en ) according to (2) using h and λn
4: πn ← L.Train(Dn , Mn )f
5: end for
6: return πN
where λn ∈ [0, 1] is the mixing coefficient. The new discount γ en effectively gives M fn a shorter
horizon than M’s, while the heuristic h is blended into the new reward in (2) to account for the
missing long-term information. We call γ en = λn γ in (2) the guidance discount to be consistent
with prior literature [20], which can be viewed in terms of our framework as using a zero heuristic.
In the last step (line 4), HuRL calls the base algorithm L to perform updates with respect to the
reshaped MDP M fn . This is realized by 1) setting the discount factor used in L to γ en , and 2) setting
the sampled reward to r + (γ − γ en )h(s0 ) for every transition tuple (s, a, r, s0 ) collected from M. We
remark that the base algorithm L in line 2 always collects trajectories of lengths proportional to the
original discount γ, while internally the optimization is done with a lower discount γ en in line 4.
Over the course of training, HuRL repeats the above steps with a sequence of increasing mixing
coefficients {λn }. From (2) we see that as the agent interacts with the environment, the effects of the
heuristic in MDP reshaping decrease and the effective horizon of the reshaped MDP increases.
4
(a) Heatmap of different values. (b) Different policy behaviors. (c) HuRL with different h and λ.
Figure 1: Example of HuRL in a chain MDP. Each cell in a row in each diagram represents a state from
S = {1, . . . , 10}. The agent starts at state 3 (s0 ), and states 1 and 10 are absorbing (Abs in subfigure a-(1)).
Actions A = {←, →} move the agent left or right in the chain unless the agent is in an absorbing state. Subfig.
a-(1) shows the reward function: r(2, ←) = 0.1, r(4, →) = −0.2, r(5, →) = 0.1, and all state-action pairs
not shown in a-(1) yield r = 0. Subfig. a-(2) shows V ∗ for γ = 0.9. Subfig. a-(3) shows a good heuristic h
— V (random π). Subfig. a-(4) shows a bad heuristic h — V (myopic π). Subfig. b-(1): π ∗ for V ∗ from a-(2).
Subfig. b-(2): π̃ ∗ from HuRL with h = 0, λ = 0.5. Subfig. b-(3): π̃ ∗ from HuRL with the good h from (a).(3)
and λ = 0.5. Subfig. b-(4): π̃ ∗ from the bad h from a-(4), λ = 0.5. Subfig. b-(5): π̃ ∗ from the bad h and
λ = 1. Subfig. (c) illustrates the takeaway message: using HuRL with a good h can find π ∗ from s0 even
with a small λ (see the x-axis), while HuRL with a bad h requires a much higher λ to discover π ∗ .
4 Theoretical Analysis
When can HuRL accelerate learning? Similar to typical regularization techniques, the horizon-based
regularization of HuRL leads to a bias-variance decomposition that can be optimized for better
finite-sample performance compared to directly solving the original MDP. However, a non-trivial
trade-off is possible only when the regularization can bias the learning toward a good direction. In
HuRL’s case this is determined by the heuristic, which resembles a prior we encode into learning.
In this section we provide HuRL’s theoretical foundation. We first describe the bias-variance trade-off
induced by HuRL. Then we show how suboptimality in solving the reshaped MDP translates into
performance in the original MDP, and identify the assumptions HuRL needs the base RL algorithm to
satisfy. In addition, we explain how HuRL relates to PBRS, and characterize the quality of heuristics
and sufficient conditions for constructing good heuristics from batch data using offline RL.
For clarity, we will focus on the reshaped MDP M f = (S, A, P, re, γe) for a fixed λ ∈ [0, 1], where
re, γ
e are defined in (1). We can view this MDP as the one in a single iteration of HuRL. For a policy π,
we denote its state value function in Mf as Ve π , and the optimal policy and value function of M e∗
f as π
∗
and V , respectively. The missing proofs of the results from this section can be found in Appendix A.
e
The theorem shows that suboptimality of a policy π in the original MDP M can be decomposed into
1) a bias term due to solving a reshaped MDP M f instead of the original MDP M, and 2) a regret
term (i.e. the learning variance) due to π being suboptimal in the reshaped MDP M. f Moreover, it
shows that heuristics are equivalent up to constant offsets. In other words, only the relative ordering
between states that a heuristic induces matters in learning, not the absolute values.
5
Balancing the two terms trades off bias and variance in learning. Using a smaller λ replaces the
long-term information with the heuristic and make the horizon of the reshaped MDP M f shorter.
Therefore, given a finite interaction budget, the regret term in (3) can be more easily minimized,
though the bias term in (4) can potentially be large if the heuristic is bad. On the contrary, with λ = 1,
the bias is completely removed, as the agent solves the original MDP M directly.
where ρπ (d0 ) denotes the trajectory distribution of running π from d0 , and A e∗ (s, a) = re(s, a) +
∗ 0 ∗
e e e∗ of M.
eEs0 |s,a [V (s )] − V (s) ≤ 0 is the action gap with respect to the optimal policy π
γ f
Another way to comprehend the regret term is through studying its dependency on λ. When λ = 1,
Regret(h, 0, π) = V ∗ (d0 )−V π (d0 ), which is identical to the policy regret in M for a fixed initial dis-
tribution d0 . On the other hand, when λ = 0, Regret(h, 0, π) = maxπ0 1−γ 1
r(s, π 0 )−e
Es∼dπ [e r(s, π)],
which is the regret of a non-stationary contextual bandit problem where the context distribution is dπ
(the average state distribution of π). In general, for λ ∈ (0, 1), the regret notion mixes a short-horizon
non-stationary problem and a long-horizon stationary problem.
One natural question is whether the reshaped MDP M f has a more complicated and larger value
landscape than the original MDP M, because these characteristics may affect the regret rate of a base
algorithm. We show that M f preserves the value bounds and linearity of the original MDP.
Proposition 4.2. Reshaping the MDP as in (1) preserves the following characteristics: 1) If
1 1
h(s) ∈ [0, 1−γ ], then Ve π (s) ∈ [0, 1−γ ] for all π and s ∈ S. 2) If M f is a linear MDP with
0
feature vector φ(s, a) (i.e. r(s, a) and Es0 |s,a [g(s )] for any g can be linearly parametrized in
φ(s, a)), then Mf is also a linear MDP with feature vector φ(s, a).
On the contrary, the MDP Mλ := (S, A, P, r, λγ) in Section 2.2 does not have these properties.
We can show that Mλ is equivalent to M f up to a PBRS transformation (i.e., r̄(s, a) = r̃(s, a) +
γ̃Es0 |s,a [h(s0 )] − h(s)). Thus, HuRL incorporates guidance discount into PBRS with nicer properties.
To better understand how the heuristic h affects the bias, we derive an upper bound on the bias by
replacing the first term in (4) with an upper bound that depends only on π ∗ .
6
P∞
Proposition 4.3. For g : S → R and η ∈ [0, 1], define C(π, g, η) := Eρπ (d0 ) t=1 η t−1 g(st ) .
Then Bias(h, λ, π) ≤ (1 − λ)γ(C(π ∗ , V ∗ − h, λγ) + C(π, h − Ve ∗ , γ)).
In Proposition 4.3, the term (1 − λ)γC(π ∗ , V ∗ − h, λγ) is the underestimation error of the heuristic
h under the states visited by the optimal policy π ∗ in the original MDP M. Therefore, to minimize
the first term in the bias, we would want the heuristic h to be large along the paths that π ∗ generates.
However, Proposition 4.3 also discourages the heuristic from being arbitrarily large, because the
second term in the bias in (4) (or, equivalently, the second term in Proposition 4.3) incentivizes
the heuristic to underestimate the optimal value of the reshaped MDP Ve ∗ . More precisely, the
second term requires the heuristic to obey some form of spatial consistency. A quick intuition is
0
the observation that if h(s) = V π (s) for some π 0 or h(s) = 0, then h(s) ≤ Ve ∗ (s) for all s ∈ S.
More generally, we show that if the heuristic is improvable with respect to the original MDP M
(i.e. the heuristic value is lower than that of the max of Bellman backup), then h(s) ≤ Ve ∗ (s). By
Proposition 4.3, learning with an improvable heuristic in HuRL has a much smaller bias.
Definition 4.1. Define the Bellman operator (Bh)(s, a) := r(s, a) + γEs0 |s,a [h(s0 )]. A heuristic
function h : S → R is said to be improvable with respect to an MDP M if maxa (Bh)(s, a) ≥ h(s).
Proposition 4.4. If h is improvable with respect to M, then Ve ∗ (s) ≥ h(s), for all λ ∈ [0, 1].
5 Experiments
We validate our framework HuRL experimentally in MuJoCo (commercial license) [32] robotics
control problems and Procgen games (MIT License) [33], where soft actor critic (SAC) [35] and
proximal policy optimization (PPO) [36] were used as the base RL algorithms, respectively5 . The
goal is to study whether HuRL can accelerate learning by shortening the horizon with heuristics. In
particular, we conduct studies to investigate the effects of different heuristics and mixing coefficients.
Since the main focus here is on the possibility of leveraging a given heuristic to accelerate RL
algorithms, in these experiments we used vanilla techniques to construct heuristics for HuRL.
Experimentally studying the design of heuristics for a domain or a batch of data is beyond the scope
of the current paper but are important future research directions. For space limitation, here we
report only the results of the MuJoCo experiments. The results on Procgen games along with other
experimental details can also be found in Appendix C.
5
Code to replicate all experiments is available at https://github.com/microsoft/HuRL.
7
5.1 Setup
We consider four MuJoCo environments with dense rewards (Hopper-v2, HalfCheetah-v2, Humanoid-
v2, and Swimmer-v2) and a sparse reward version of Reacher-v2 (denoted as Sparse-Reacher-v2)6 .
We design the experiments to simulate two learning scenarios. First, we use Sparse-Reacher-v2 to
simulate the setting where an engineered heuristic based on domain knowledge is available; since this
is a goal reaching task, we designed a heuristic h(s) = r(s, a) − 100ke(s) − g(s)k, where e(s) and
g(s) denote the robot’s end-effector position and the goal position, respectively. Second, we use the
dense reward environments to model scenarios where a batch of data collected by multiple behavioral
policies is available before learning, and a heuristic is constructed by an offline policy evaluation
algorithm from the batch data (see Appendix C.1 for details). In brief, we generated these behavioral
policies by running SAC from scratch and saved the intermediate policies generated in training. We
then use least-squares regression to fit a neural network to predict empirical Monte-Carlo returns of
the trajectories in the sampled batch of data. We also use behavior cloning (BC) to warm-start all RL
agents based on the same batch dataset in the dense reward experiments.
The base RL algorithm here, SAC, is based on the standard implementation in Garage (MIT Li-
cense) [37]. The policy and value networks are fully connected independent neural networks. The
policy is Tanh-Gaussian and the value network has a linear head.
Algorithms. We compare the performance of different algorithms below. 1) BC 2) SAC 3) SAC
with BC warm start (SAC w/ BC) 4) HuRL with the engineered heuristic (HuRL) 5) HuRL with a
zero heuristic and BC warm start (HuRL-zero) 6) HuRL with the Monte-Carlo heuristic and BC warm
start (HuRL-MC) 7) SAC with PBRS reward (and BC warm start, if applicable) (PBRS). For the
HuRL algorithms, the mixing coefficient was scheduled as λn = λ0 + (1 − λ0 )cω tanh(ω(n − 1)),
for n = 1, . . . , N , where λ0 ∈ [0, 1], ω > 0 controls the increasing rate, and cω is a normalization
constant such that λ∞ = 1 and λn ∈ [0, 1]. We chose these algorithms to study the effect of each
additional warm-start component (BC and heuristics) added on top of vanilla SAC. HuRL-zero is
SAC w/ BC but with an extra λ schedule above that further lowers the discount, whereas SAC and
SAC w/ BC keep a constant discount factor.
Evaluation and Hyperparameters. In each iteration, the RL agent has a fixed sample budget for
environment interactions, and its performance is measured in terms of undiscounted cumulative
returns of the deterministic mean policy extracted from SAC. The hyperparameters used in the
algorithms above were selected as follows. First, the learning rates and the discount factor of the
base RL algorithm, SAC, were tuned for each environment. The tuned discount factor was used as
the discount factor γ of the original MDP M. Fixing the hyperparameters above, we additionally
tune λ0 and ω for the λ schedule of HuRL for each environment and each heuristic. Finally, after all
these hyperparameters were fixed, we conducted additional testing runs with 30 different random
seeds and report their statistics here. Sources of randomness included the data collection process of
the behavioral policies, training the heuristics from batch data, BC, and online RL. However, the
behavioral policies were fixed across all testing runs. We chose this hyperparameter tuning procedure
to make sure that the baselines (i.e. SAC) compared in these experiments are their best versions.
8
(a) Sparse-Reacher-v2 (b) Humanoid-v2 (c) Hopper-v2
6 Related Work
Discount regularization. The horizon-truncation idea can be traced back to Blackwell optimality
in the known MDP setting [18]. Reducing the discount factor amounts to running HuRL with a zero
heuristic. Petrik and Scherrer [19], Jiang et al. [20, 21] study the MDP setting; Chen et al. [22] study
POMDPs. Amit et al. [23] focus on discount regularization for Temporal Difference (TD) methods,
while Van Seijen et al. [6] use a logarithmic mapping to lower the discount for online RL.
Reward shaping. Reward shaping has a long history in RL, from the seminal PBRS work [29]
to recent bilevel-optimization approaches [38]. Tessler and Mannor [5] consider a complementary
problem to HuRL: given a discount γ 0 , they find a reward r0 that preserves trajectory ordering in
the original MDP. Meanwhile there is a vast literature on bias-variance trade-off for online RL with
9
horizon truncation. TD(λ) [39, 40] and Generalized Advantage Estimates [41] blend value estimates
across discount factors, while Sherstan et al. [42] use the discount factor as an input to the value
function estimator. TD(∆) [43] computes differences between value functions across discount factors.
Heuristics in model-based methods. Classic uses of heuristics include A* [24], Monte-Carlo Tree
Search (MCTS) [25], and Model Predictive Control (MPC) [44]. Zhong et al. [26] shorten the horizon
in MPC using a value function approximator. Hoeller et al. [27] additionally use an estimate for
the running cost to trade off solution quality and amount of computation. Bejjani et al. [28] show
heuristic-accelerated truncated-horizon MPC on actual robots and tune the value function throughout
learning. Bhardwaj et al. [7] augment MPC with a terminal value heuristic, which can be viewed as an
instance of HuRL where the base algorithm is MPC. Asai and Muise [45] learn an MDP expressible
in the STRIPS formalism that can benefit from relaxation-based planning heuristics. But HuRL is
more general, as it does not assume model knowledge and can work in unknown environments.
Pessimistic extrapolation. Offline RL techniques employ pessimistic extrapolation for robust-
ness [30], and their learned value functions can be used as heuristics in HuRL. Kumar et al. [46]
penalize out-of-distribution actions in off-policy optimization while Liu et al. [31] additionally use
a variational auto-encoder (VAE) to detect out-of-distribution states. We experimented with VAE-
filtered pessimistic heuristics in Appendix C. Even pessimistic offline evaluation techniques [16] can
be useful in HuRL, since function approximation often induces extrapolation errors [47].
Heuristic pessimism vs. admissibility. Our concept of heuristic pessimism can be easily confused
for the well-established notion of admissibility [48], but in fact they are opposites. Namely, an
admissible heuristic never underestimates V ∗ (in the return-maximization setting), while a pessimistic
one never overestimates V ∗ . Similarly, our notion of improvability is distinct from consistency: they
express related ideas, but with regards to pessimistic and admissible value functions, respectively.
Thus, counter-intuitively from the planning perspective, our work shows that for policy learning,
inadmissible heuristics are desirable. Pearl [49] is one of the few works that has analyzed desirable
implications of heuristic inadmissibility in planning.
Other warm-starting techniques. HuRL is a new way to warm-start online RL methods. Bianchi
et al. [50] use a heuristic policy to initialize agents’ policies. Vinyals et al. [2], Hester et al. [10] train a
value function and policy using batch IL and then used them as regularization in online RL. Nair et al.
[9] use off-policy RL on batch data and fine-tune the learned policy. Recent approaches of hybrid
IL-RL have strong connections to HuRL [17, 51, 52]. In particular, Cheng et al. [17] is a special
case of HuRL with a max-aggregation heuristic. Farahmand et al. [8] use several related tasks to
learn a task-dependent heuristic and perform shorter-horizon planning or RL. Knowledge distillation
approaches [53] can also be used to warm-start learning, but in contrast to them, HuRL expects prior
knowledge in the form of state value estimates, not features, and doesn’t attempt to make the agent
internalize this knowledge. A HuRL agent learns from its own environment interactions, using prior
knowledge only as guidance. Reverse Curriculum approaches [54] create short horizon RL problems
by initializing the agent close to the goal, and moving it further away as the agent improves. This
gradual increase in the horizon inspires the HuRL approach. However, HuRL does not require the
agent to be initialized on expert states and can work with many different base RL algorithms.
10
References
[1] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ebiak, Christy
Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large
scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
[2] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Jun-
young Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster
level in Starcraft II using multi-agent reinforcement learning. Nature, 575(7782):350–354,
2019.
[3] Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforce-
ment learning. In NIPS, 2015.
[4] Aaron Sidford, Mengdi Wang, Xian Wu, Lin F Yang, and Yinyu Ye. Near-optimal time and
sample complexities for solving Markov decision processes with a generative model. In NeurIPS,
2018.
[5] Chen Tessler and Shie Mannor. Maximizing the total reward via reward tweaking. arXiv
preprint arXiv:2002.03327, 2020.
[6] Harm Van Seijen, Mehdi Fatemi, and Arash Tavakoli. Using a logarithmic mapping to enable
lower discount factors in reinforcement learning. In NeurIPS, 2019.
[7] Mohak Bhardwaj, Sanjiban Choudhury, and Byron Boots. Blending mpc & value function
approximation for efficient reinforcement learning. In ICLR, 2021.
[8] Amir-massoud Farahmand, Daniel N Nikovski, Yuji Igarashi, and Hiroki Konaka. Truncated
approximate dynamic programming with task-dependent terminal value. In AAAI, 2016.
[9] Ashvin Nair, Murtaza Dalal, Abhishek Gupta, and Sergey Levine. Accelerating online rein-
forcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020.
[10] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan,
John Quan, Andrew Sendonaris, Ian Osband, et al. Deep Q-learning from demonstrations. In
AAAI, 2018.
[11] Mausam and Andrey Kolobov. Planning with Markov decision processes: An AI perspective.
Synthesis Lectures on Artificial Intelligence and Machine Learning, 6:1–210, 2012.
[12] Dylan Foster and Alexander Rakhlin. Beyond UCB: Optimal and efficient contextual bandits
with regression oracles. In ICML, 2020.
[13] J. Hoffmann and B. Nebel. The FF planning system: Fast plan generation through heuristic
search. Journal of Artificial Intelligence Research, 14:253–302, 2001.
[14] S. Richter and M. Westphal. The LAMA planner: Guiding cost-based anytime planning with
landmarks. Journal of Artificial Intelligence Research, 39:127–177, 2010.
[15] Andrey Kolobov, Mausam, and Daniel S. Weld. Classical planning in MDP heuristics: With a
little help from generalization. In AAAI, 2010.
[16] Caglar Gulcehre, Sergio Gómez Colmenarejo, Ziyu Wang, Jakub Sygnowski, Thomas Paine,
Konrad Zolna, Yutian Chen, Matthew Hoffman, Razvan Pascanu, and Nando de Freitas. Regu-
larized behavior value estimation. arXiv preprint arXiv:2103.09575, 2021.
[17] Ching-An Cheng, Andrey Kolobov, and Alekh Agarwal. Policy improvement via imitation of
multiple oracles. In NeurIPS, 2020.
[18] David Blackwell. Discrete dynamic programming. The Annals of Mathematical Statistics,
pages 719–726, 1962.
[19] Marek Petrik and Bruno Scherrer. Biasing approximate dynamic programming with a lower
discount factor. In NIPS, 2008.
11
[20] Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The dependence of effective
planning horizon on model accuracy. In AAMAS, 2015.
[21] Nan Jiang, Satinder Singh, and Ambuj Tewari. On structural properties of mdps that bound loss
due to shallow planning. In IJCAI, 2016.
[22] Yi-Chun Chen, Mykel J Kochenderfer, and Matthijs TJ Spaan. Improving offline value-function
approximations for pomdps by reducing discount factors. In IROS, 2018.
[23] Ron Amit, Ron Meir, and Kamil Ciosek. Discount factor as a regularizer in reinforcement
learning. In ICML, 2020.
[24] Peter E Hart, Nils J Nilsson, and Bertram Raphael. A formal basis for the heuristic determination
of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics, 4(2):100–107,
1968.
[25] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling,
Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton.
A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence
and AI in games, 4(1):1–43, 2012.
[26] Mingyuan Zhong, Mikala Johnson, Yuval Tassa, Tom Erez, and Emanuel Todorov. Value
function approximation and model predictive control. In IEEE International Symposium on
Adaptive Dynamic Programming and Reinforcement Learning, pages 100–107, 2013.
[27] David Hoeller, Farbod Farshidian, and Marco Hutter. Deep value model predictive control. In
CoRL, 2020.
[28] Wissam Bejjani, Rafael Papallas, Matteo Leonetti, and Mehmet R Dogar. Planning with a
receding horizon for manipulation in clutter using a learned value function. In Humanoids,
pages 1–9, 2018.
[29] Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transforma-
tions: Theory and application to reward shaping. In ICML, 1999.
[30] Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline RL?
arXiv preprint arXiv:2012.15085, 2020.
[31] Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Provably good batch
off-policy reinforcement learning without great exploration. In NeurIPS, 2020.
[32] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based
control. In IROS, 2012.
[33] Sharada Mohanty, Jyotish Poonganam, Adrien Gaidon, Andrey Kolobov, Blake Wulfe, Dipam
Chakraborty, Gražvydas Šemetulskis, João Schapke, Jonas Kubilius, Jurgis Pašukonis, Linas Kli-
mas, Matthew Hausknecht, Patrick MacAlpine, Quang Nhat Tran, Thomas Tumiel, Xiaocheng
Tang, Xinwei Chen, Christopher Hesse, Jacob Hilton, William Hebgen Guss, Sahika Genc, John
Schulman, and Karl Cobbe. Measuring sample efficiency and generalization in reinforcement
learning benchmarks: NeurIPS 2020 Procgen benchmark. arXiv preprint arXiv:2103.15332,
2021.
[34] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is Q-learning provably
efficient? In NeurIPS, 2018.
[35] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy
maximum entropy deep reinforcement learning with a stochastic actor. In ICML, 2018.
[36] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal
policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
[37] The garage contributors. Garage: A toolkit for reproducible reinforcement learning research.
https://github.com/rlworkgroup/garage, 2019.
12
[38] Yujing Hu, Weixun Wang, Hangtian Jia, Yixiang Wang, Yingfeng Chen, Jianye Hao, Feng Wu,
and Changjie Fan. Learning to utilize shaping rewards: A new approach of reward shaping. In
NeurIPS, 2020.
[39] Harm Seijen and Rich Sutton. True online td (lambda). In ICML, 2014.
[40] Yonathan Efroni, Gal Dalal, Bruno Scherrer, and Shie Mannor. Beyond the one-step greedy
approach in reinforcement learning. In ICML, 2018.
[41] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-
dimensional continuous control using generalized advantage estimation. In ICLR, 2016.
[42] Craig Sherstan, Shibhansh Dohare, James MacGlashan, Johannes Günther, and Patrick M
Pilarski. Gamma-nets: Generalizing value estimation over timescale. In AAAI, 2020.
[43] Joshua Romoff, Peter Henderson, Ahmed Touati, Emma Brunskill, Joelle Pineau, and Yann
Ollivier. Separating value functions across time-scales. In ICML, 2019.
[44] Jacques Richalet, André Rault, JL Testud, and J Papon. Model predictive heuristic control.
Automatica, 14(5):413–428, 1978.
[45] Masataro Asai and Christian Muise. Learning neural-symbolic descriptive planning models via
cube-space priors: The voyage home (to strips). In IJCAI, 2020.
[46] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for
offline reinforcement learning. In NeurIPS, 2020.
[47] Tyler Lu, Dale Schuurmans, and Craig Boutilier. Non-delusional q-learning and value iteration.
In NeurIPS, 2018.
[48] Stuart J. Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Pearson, 4th
edition, 2020.
[49] Judea Pearl. Heuristic search theory: Survey of recent results. In IJCAI, 1981.
[50] Reinaldo AC Bianchi, Murilo F Martins, Carlos HC Ribeiro, and Anna HR Costa. Heuristically-
accelerated multiagent reinforcement learning. IEEE transactions on cybernetics, 44(2):252–
265, 2013.
[51] Wen Sun, Arun Venkatraman, Geoffrey J Gordon, Byron Boots, and J Andrew Bagnell. Deeply
aggrevated: Differentiable imitation learning for sequential prediction. In ICML, 2017.
[52] Wen Sun, J Andrew Bagnell, and Byron Boots. Truncated horizon policy search: Combining
reinforcement learning & imitation learning. In ICLR, 2018.
[53] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.
arXiv preprint arXiv:1503.02531, 2015.
[54] Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Reverse
curriculum generation for reinforcement learning. In CoRL, 2017.
[55] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning.
In ICML, 2002.
[56] Karl Cobbe, Christopher Hesse, Jacob Hilton, and John Schulman. Leveraging procedural
generation to benchmark reinforcement learning. In ICML, 2020.
[57] Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph
Gonzalez, Michael Jordan, and Ion Stoica. RLlib: Abstractions for distributed reinforcement
learning. In ICML, 2018.
[58] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam
Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA:
Scalable distributed deep-RL with importance weighted actor-learner architectures. In ICML,
2018.
13
Checklist
1. For all authors...
(a) Do the main claims made in the abstract and introduction accurately reflect the paper’s
contributions and scope? [Yes]
(b) Did you describe the limitations of your work? [Yes] Section 7.
(c) Did you discuss any potential negative societal impacts of your work? [Yes] Section 7.
It is a conceptual work that doesn’t have foreseeable societal impacts yet.
(d) Have you read the ethics review guidelines and ensured that your paper conforms to
them? [Yes]
2. If you are including theoretical results...
(a) Did you state the full set of assumptions of all theoretical results? [Yes] The assump-
tions are in the theorem, proposition, and lemma statements throughout the paper.
(b) Did you include complete proofs of all theoretical results? [Yes] Appendix A and
Appendix B.
3. If you ran experiments...
(a) Did you include the code, data, and instructions needed to reproduce the main ex-
perimental results (either in the supplemental material or as a URL)? [Yes] In the
supplemental material.
(b) Did you specify all the training details (e.g., data splits, hyperparameters, how they
were chosen)? [Yes] Appendix C.
(c) Did you report error bars (e.g., with respect to the random seed after running experi-
ments multiple times)? [Yes] Section 5 and Appendix C.
(d) Did you include the total amount of compute and the type of resources used (e.g., type
of GPUs, internal cluster, or cloud provider)? [Yes] Appendix C.
4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...
(a) If your work uses existing assets, did you cite the creators? [Yes] References [37], [32],
and [33] in Section 5.1.
(b) Did you mention the license of the assets? [Yes] In Section 5.1.
(c) Did you include any new assets either in the supplemental material or as a URL? [Yes]
Heuristic computation and scripts to run training.
(d) Did you discuss whether and how consent was obtained from people whose data you’re
using/curating? [N/A]
(e) Did you discuss whether the data you are using/curating contains personally identifiable
information or offensive content? [N/A]
5. If you used crowdsourcing or conducted research with human subjects...
(a) Did you include the full text of instructions given to participants and screenshots, if
applicable? [N/A]
(b) Did you describe any potential participant risks, with links to Institutional Review
Board (IRB) approvals, if applicable? [N/A]
(c) Did you include the estimated hourly wage paid to participants and the total amount
spent on participant compensation? [N/A]
14