0% found this document useful (0 votes)

15 views14 pages

Temporal Difference Models - Model-Free Deep RL For Model-Based Control

The paper introduces Temporal Difference Models (TDMs), a novel approach that combines model-free and model-based reinforcement learning (RL) to improve sample efficiency while achieving high asymptotic performance. TDMs leverage the rich information from state transitions to learn goal-conditioned value functions efficiently, outperforming both state-of-the-art model-based and model-free methods in continuous control tasks. The authors demonstrate that TDMs can be trained using off-policy data and provide a framework for bridging the gap between model-based and model-free learning.

Uploaded by

marwaissaoui895

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views14 pages

Temporal Difference Models - Model-Free Deep RL For Model-Based Control

Uploaded by

marwaissaoui895

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Published as a conference paper at ICLR 2018

T EMPORAL D IFFERENCE M ODELS :

M ODEL -F REE D EEP RL FOR M ODEL -BASED C ONTROL

Vitchyr Pong∗ Shixiang Gu∗

University of California, Berkeley University of Cambridge
[email protected] Max Planck Institute
Google Brain
[email protected]

Murtaza Dalal Sergey Levine

University of California, Berkeley University of California, Berkeley
[email protected] [email protected]

A BSTRACT
Model-free reinforcement learning (RL) is a powerful, general tool for learning
complex behaviors. However, its sample efficiency is often impractically large for
solving challenging real-world problems, even with off-policy algorithms such as
Q-learning. A limiting factor in classic model-free RL is that the learning signal
consists only of scalar rewards, ignoring much of the rich information contained
in state transition tuples. Model-based RL uses this information, by training a
predictive model, but often does not achieve the same asymptotic performance
as model-free RL due to model bias. We introduce temporal difference models
(TDMs), a family of goal-conditioned value functions that can be trained with
model-free learning and used for model-based control. TDMs combine the bene-
fits of model-free and model-based RL: they leverage the rich information in state
transitions to learn very efficiently, while still attaining asymptotic performance
that exceeds that of direct model-based RL methods. Our experimental results
show that, on a range of continuous control tasks, TDMs provide a substantial im-
provement in efficiency compared to state-of-the-art model-based and model-free
methods.

1 I NTRODUCTION
Reinforcement learning (RL) algorithms provide a formalism for autonomous learning of com-
plex behaviors. When combined with rich function approximators such as deep neural networks,
RL can provide impressive results on tasks ranging from playing games (Mnih et al., 2015; Silver
et al., 2016), to flying and driving (Lillicrap et al., 2015; Zhang et al., 2016), to controlling robotic
arms (Levine et al., 2016; Gu et al., 2017). However, these deep RL algorithms often require a large
amount of experience to arrive at an effective solution, which can severely limit their application to
real-world problems where this experience might need to be gathered directly on a real physical sys-
tem. Part of the reason for this is that direct, model-free RL learns only from the reward: experience
that receives no reward provides minimal supervision to the learner.
In contrast, model-based RL algorithms obtain a large amount of supervision from every sample,
since they can use each sample to better learn how to predict the system dynamics – that is, to
learn the “physics” of the problem. Once the dynamics are learned, near-optimal behavior can
in principle be obtained by planning through these dynamics. Model-based algorithms tend to be
substantially more efficient (Deisenroth et al., 2013; Nagabandi et al., 2017), but often at the cost
of larger asymptotic bias: when the dynamics cannot be learned perfectly, as is the case for most
complex problems, the final policy can be highly suboptimal. Therefore, conventional wisdom
holds that model-free methods are less efficient but achieve the best asymptotic performance, while
model-based methods are more efficient but do not produce policies that are as optimal.
∗
denotes equal contribution

1
Published as a conference paper at ICLR 2018

Can we devise methods that retain the efficiency of model-based learning while still achieving the
asymptotic performance of model-free learning? This is the question that we study in this paper. The
search for methods that combine the best of model-based and model-free learning has been ongoing
for decades, with techniques such as synthetic experience generation (Sutton, 1990), partial model-
based backpropagation (Nguyen & Widrow, 1990; Heess et al., 2015), and layering model-free
learning on the residuals of model-based estimation (Chebotar et al., 2017) being a few examples.
However, a direct connection between model-free and model-based RL has remained elusive. By
effectively bridging the gap between model-free and model-based RL, we should be able to smoothly
transition from learning models to learning policies, obtaining rich supervision from every sample
to quickly gain a moderate level of proficiency, while still converging to an unbiased solution.
To arrive at a method that combines the strengths of model-free and model-based RL, we study a
variant of goal-conditioned value functions (Sutton et al., 2011; Schaul et al., 2015; Andrychowicz
et al., 2017). Goal-conditioned value functions learn to predict the value function for every possible
goal state. That is, they answer the following question: what is the expected reward for reaching
a particular state, given that the agent is attempting (as optimally as possible) to reach it? The
particular choice of reward function determines what such a method actually does, but rewards based
on distances to a goal hint at a connection to model-based learning: if we can predict how easy it is
to reach any state from any current state, we must have some kind of understanding of the underlying
“physics.” In this work, we show that we can develop a method for learning variable-horizon goal-
conditioned value functions where, for a specific choice of reward and horizon, the value function
corresponds directly to a model, while for larger horizons, it more closely resembles model-free
approaches. Extension toward more model-free learning is thus achieved by acquiring “multi-step
models” that can be used to plan over progressively coarser temporal resolutions, eventually arriving
at a fully model-free formulation.
The principle contribution of our work is a new RL algorithm that makes use of this connection
between model-based and model-free learning to learn a specific type of goal-conditioned value
function, which we call a temporal difference model (TDM). This value function can be learned
very efficiently, with sample complexities that are competitive with model-based RL, and can then be
used with an MPC-like method to accomplish desired tasks. Our empirical experiments demonstrate
that this method achieves substantially better sample complexity than fully model-free learning on
a range of challenging continuous control tasks, while outperforming purely model-based methods
in terms of final performance. Furthermore, the connection that our method elucidates between
model-based and model-free learning may lead to a range of interesting future methods.

2 P RELIMINARIES

In this section, we introduce the reinforcement learning (RL) formalism, temporal difference Q-
learning methods, model-based RL methods, and goal-conditioned value functions. We will build
on these components to develop temporal difference models (TDMs) in the next section. RL deals
with decision making problems that consist of a state space S, action space A, transition dynamics
P (s′ | s, a), and an initial state distribution p0 . The goal of the learner is encapsulated by a reward
function r(s, a, s′ ). Typically, long or infinite horizon tasks also employ a discount factor γ, and
the standard P objective is to find a policy π(a | s) that maximizes the expected discounted sum of
rewards, Eπ [ t γ t r(st , at , st+1 )], where s0 ∼ p0 , at ∼ π(at |st ), and st+1 ∼ P (s′ | s, a).

Q-functions. We will focus on RL algorithms that learn a Q-function. The Q-function represents
the expected total (discounted) reward that can be obtained by the optimal policy after taking action
at in state st , and can be defined recursively as following:
Q(st , at ) = Ep(st+1 |st ,at ) [r(st , at , st+1 ) + γ max Q(st+1 , a)]. (1)
a

The optimal policy can then recovered according to π(at |st ) = δ(at = arg maxa Q(st , a)). Q-
learning algorithms (Watkins & Dayan, 1992; Riedmiller, 2005) learn the Q-function via an off-
policy stochastic gradient descent algorithm, estimating the expectation in the above equation with
samples collected from the environment and computing its gradient. Q-learning methods can use
transition tuples (st , at , st+1 , rt ) collected from any exploration policy, which generally makes them
more efficient than direct policy search, though still less efficient than purely model-based methods.

2
Published as a conference paper at ICLR 2018

Model-based RL and optimal control. Model-based RL takes a different approach to maximize

the expected reward. In model-based RL, the aim is to train a model of the form f (st , at ) to predict
the next state st+1 . Once trained, this model can be used to choose actions, either by backprop-
agating reward gradients into a policy, or planning directly through the model. In the latter case,
a particularly effective method for employing a learned model is model-predictive control (MPC),
where a new action plan is generated at each time step, and the first action of that plan is executed,
before replanning begins from scratch. MPC can be formalized as the following optimization prob-
lem:
X
t+T
at = argmax r(si , ai ) where si+1 = f (si , ai ) ∀ i ∈ {t, ..., t + T − 1}. (2)
at:t+T
i=t

We can also write the dynamics constraint in the above equation in terms of an implicit dynamics,
according to
X
t+T
at = argmax r(si , ai ) such that C(si , ai , si+1 ) = 0 ∀ i ∈ {t, ..., t + T − 1}, (3)
at:t+T ,st+1:t+T
i=t

where C(si , ai , si+1 ) = 0 if and only if si+1 = f (si , ai ). This implicit version will be important in
understanding the connection between model-based and model-free RL.

Goal-conditioned value functions. Q-functions trained for a specific reward are specific to the
corresponding task, and learning a new task requires optimizing an entirely new Q-function. Goal-
conditioned value functions address this limitation by conditioning the Q-value on some task de-
scription vector sg ∈ G in a goal space G. This goal vector induces a parameterized reward
r(st , at , st+1 , sg ), which in turn gives rise to parameterized Q-functions of the form Q(s, a, sg ).
A number of goal-conditioned value function methods have been proposed in the literature, such as
universal value functions (Schaul et al., 2015) and Horde (Sutton et al., 2011). When the goal cor-
responds to an entire state, such goal-conditioned value functions usually predict how well an agent
can reach a particular state, when it is trying to reach it. The knowledge contained in such a value
function is intriguingly close to a model: knowing how well you can reach any state is closely re-
lated to understanding the physics of the environment. With Q-learning, these value functions can be
learned for any goal sg using the same off-policy (st , at , st+1 ) tuples. Relabeling previously visited
states with the reward for any goal leads to a natural data augmentation strategy, since each tuple can
be replicated many times for many different goals without additional data collection. Andrychow-
icz et al. (2017) used this property to produce an effective curriculum for solving multi-goal task
with delayed rewards. As we discuss below, relabeling past experience with different goals enables
goal-conditioned value functions to learn much more quickly from the same amount of data.

3 T EMPORAL D IFFERENCE M ODEL L EARNING

In this section, we introduce a type of goal-conditioned value functions called temporal difference
models (TDMs) that provide a direct connection to model-based RL. We will first motivate this con-
nection by relating the model-based MPC optimizations in Equations (2) and (3) to goal-conditioned
value functions, and then present our temporal difference model derivation, which extends this con-
nection from a purely model-based setting into one that becomes increasingly model-free.

3.1 F ROM G OAL -C ONDITIONED VALUE F UNCTIONS TO M ODELS

Let us consider the choice of reward function for the goal conditioned value function. Although
a variety of options have been explored in the literature (Sutton et al., 2011; Schaul et al., 2015;
Andrychowicz et al., 2017), a particularly intriguing connection to model-based RL emerges if we
set G = S, such that g ∈ G corresponds to a goal state sg ∈ S, and we consider distance-based
reward functions rd of the following form:
rd (st , at , st+1 , sg ) = −D(st+1 , sg ),
where D(st+1 , sg ) is a distance, such as the Euclidean distance D(st+1 , sg ) = kst+1 − sg k2 . If
γ = 0, we have Q(st , at , sg ) = −D(st+1 , sg ) at convergence of Q-learning, which means that

3
Published as a conference paper at ICLR 2018

Q(st , at , sg ) = 0 implies that st+1 = sg . Plug this Q-function into the model-based planning
optimization in Equation (3), denoting the task control reward as rc , such that the solution to
X
t+T
at = argmax rc (si , ai ) such that Q(si , ai , si+1 ) = 0 ∀ i ∈ {t, ..., t + T − 1} (4)
at:t+T ,st+1:t+T
i=t
yields a model-based plan. We have now derived a precise connection between model-free and
model-based RL, in that model-free learning of goal-conditioned value functions can be used to
directly produce an implicit model that can be used with MPC-based planning. However, this con-
nection by itself is not very useful: the resulting implicit model is fully model-based, and does
not provide any kind of long-horizon capability. In the next section, we show how to extend this
connection into the long-horizon setting by introducing the temporal difference model (TDM).

3.2 L ONG -H ORIZON L EARNING WITH T EMPORAL D IFFERENCE M ODELS

If we consider the case where γ > 0, the optimization in Equation (4) no longer corresponds to any
optimal control method. In fact, when γ = 0, Q-values have well-defined units: units of distance
between states. For γ > 0, no such interpretation is possible. The key insight in temporal differ-
ence models is to introduce a different mechanism for aggregating long-horizon rewards. Instead
of evaluating Q-values as discounted sums of rewards, we introduce an additional input τ , which
represents the planning horizon, and define the Q-learning recursion as
Q(st , at , sg , τ ) = Ep(st+1 |st ,at ) [−D(st+1 , sg )✶[τ = 0] + max Q(st+1 , a, sg , τ −1)✶[τ 6= 0]]. (5)
a
The Q-function uses a reward of −D(st+1 , sg ) when τ = 0 (at which point the episode terminates),
and decrements τ by one at every other step. Since this is still a well-defined Q-learning recursion,
it can be optimized with off-policy data and, just as with goal-conditioned value functions, we can
resample new goals sg and new horizons τ for each tuple (st , at , st+1 ), even ones that were not
actually used when the data was collected. In this way, the TDM can be trained very efficiently,
since every tuple provides supervision for every possible goal and every possible horizon.
The intuitive interpretation of the TDM is that it tells us how close the agent will get to a given
goal state sg after τ time steps, when it is attempting to reach that state in τ steps. Alternatively,
TDMs can be interpreted as Q-values in a finite-horizon MDP, where the horizon is determined by
τ . For the case where τ = 0, TDMs effectively learn a model, allowing TDMs to be incorporated
into a variety of planning and optimal control schemes at test time as in Equation (4). Thus, we can
view TDM learning as an interpolation between model-based and model-free learning, where τ = 0
corresponds to the single-step prediction made in model-based learning and τ > 0 corresponds
to the long-term prediction made by typical Q-functions. While the correspondence to models is
not the same for τ > 0, if we only care about the reward at every K step, then we can recover a
correspondence by replace Equation (4) with
X
at = argmax rc (si , ai )
at:K:t+T ,st+K:K:t+T
i=t,t+K,...,t+T (6)
such that Q(si , ai , si+K , K − 1) = 0 ∀ i ∈ {t, t + K, ..., t + T − K},
where we only optimize over every K th state and action. As the TDM becomes effective for longer
horizons, we can increase K until K = T , and plan over only a single effective time step:
at = argmax rc (st+T , at+T ) such that Q(st , at , st+T , T − 1) = 0. (7)
at ,at+T ,st+T
This formulation does result in some loss of generality, since we no longer optimize the reward at
the intermediate steps. This limits the multi-step formulation to terminal reward problems, but does
allow us to accommodate arbitrary reward functions on the terminal state st+T , which still describes
a broad range of practically relevant tasks. In the next section, we describe how TDMs can be
implemented and used in practice for continuous state and action spaces.

4 T RAINING AND U SING T EMPORAL D IFFERENCE M ODELS

The TDM can be trained with any off-policy Q-learning algorithm, such as DQN (Mnih et al.,
2015), DDPG (Lillicrap et al., 2015), NAF (Gu et al., 2016), and SDQN (Metz et al., 2017). Dur-
ing off-policy Q-learning, TDMs can benefit from arbitrary relabeling of the goal states g and the

4
Published as a conference paper at ICLR 2018

horizon τ , given the same (st , at , st+1 ) tuples from the behavioral policy as done in (Andrychow-
icz et al., 2017). This relabeling enables simultaneous, data-efficient learning of short-horizon and
long-horizon behaviors for arbitrary goal states, unlike previously proposed goal-conditioned value
functions that only learn for a single time scale, typically determined by a discount factor (Schaul
et al., 2015; Andrychowicz et al., 2017). In this section, we describe the design decisions needed to
make practical a TDM algorithm.

4.1 R EWARD F UNCTION S PECIFICATION

Q-learning typically optimizes scalar rewards, but TDMs enable us to increase the amount of su-
pervision available to the Q-function by using a vector-valued reward. Specifically, if the distance
D(s, sg ) factors additively over the dimensions, we can train a vector-valued Q-function that pre-
dicts per-dimension distance, with the reward function for dimension j given by −Dj (sj , sg,j ). We
use the ℓ1 norm in our implementation, which corresponds to absolute value reward −|sj − sg,j |.
The resulting vector-valued Q-function can learn distances along each dimension separately, provid-
ing it with more supervision from each training point. Empirically, we found that this modifications
provides a substantial boost in sample efficiency.
We can optionally make an improvement to TDMs if we know that the task reward rc depends only
on some subset of the state or, more generally, state features. In that case, we can train the TDM to
predict distances along only those dimensions or features that are used by rc , which in practice can
substantially simplify the corresponding prediction problem. In our experiments, we illustrate this
property by training TDMs for pushing tasks that predict distances from an end-effector and pushed
object, without accounting for internal joints of the arm, and similarly for various locomotion tasks.

4.2 P OLICY E XTRACTION WITH TDM S

While the TDM optimal control formulation Equation (7) drastically reduces the number of states
and actions to be optimized for long-term planning, it requires solving a constrained optimiza-
tion problem, which is more computationally expensive than unconstrained problems. We can
remove the need for a constrained optimization through a specific architectural decision in the de-
sign of the function approximator for Q(s, a, sg , τ ). We define the Q-function as Q(s, a, sg , τ ) =
−kf (s, a, sg , τ ) − sg k, where f (s, a, sg , τ ) outputs a state vector. By training the TDM with a stan-
dard Q-learning method, f (s, a, sg , τ ) is trained to explicitly predict the state that will be reached
by a policy attempting to reach sg in τ steps. This model can then be used to choose the action with
fully explicit MPC as below, which also allows straightforward derivation of a multi-step version as
in Equation (6).
at = argmax rc (f (st , at , st+T , T − 1), at+T ) (8)
at ,at+T ,st+T
In the case where the task is to reach a goal state sg , a simpler approach to extract a policy is to use
the TDM directly:
at = argmax Q(st , a, sg , T ) (9)
a
In our experiments, we use Equations (8) and (9) to extract a policy.

4.3 A LGORITHM S UMMARY

The algorithm is summarized as Algorithm 1. A crucial difference from prior goal-conditioned value
function methods (Schaul et al., 2015; Andrychowicz et al., 2017) is that our algorithm can be used
to act according to an arbitrary terminal reward function rc , both during exploration and at test time.
Like other off-policy algorithms (Mnih et al., 2015; Lillicrap et al., 2015), it consists of exploration
and Q-function fitting. Noise is injected for exploration, and Q-function fitting uses standard Q-
learning techniques, with target networks Q′ and experience replay (Mnih et al., 2015; Lillicrap
et al., 2015). If we view the Q-function fitting as model fitting, the algorithm also resembles iterative
model-based RL, which alternates between collecting data using the learned dynamics model for
planning (Deisenroth & Rasmussen, 2011) and fitting the model. Since we focus on continuous
tasks, we use DDPG (Lillicrap et al., 2015), though any Q-learning method could be used.
The computation cost of the algorithm is mostly determined by the number of updates to fit the
Q-function per transition, I. In general, TDMs can benefit from substantially larger I than classic

5
Published as a conference paper at ICLR 2018

Algorithm 1 Temporal Difference Model Learning

Require: Task reward function rc (s, a), parameterized TDM Qw (s, a, sg , τ ), replay buffer B
1: for n = 0, ..., N − 1 episodes do
2: s0 ∼ p(s0 )
3: for t = 0, ..., T − 1 time steps do
4: a∗t = MPC(rc , st , Qw , T − t) // Eq. 6, Eq. 7, Eq. 8, or Eq. 9
5: at = AddNoise(a∗t ) // Noisy exploration
6: st+1 ∼ p(st , at ), and store {st , at , st+1 } in the replay buffer B // Step environment
7: for i = 0, I − 1 iterations do
8: Sample M transitions {sm , am , s′m } from the replay B.
9: Relabel time horizons and goal states τm , sg,m // Section A.1
Pm − sg,m k✶[τm = 0] + maxa Q (s m , a, sg,m , τm − 1)✶[τm 6= 0]
′ ′ ′
10: ym = −ks
11: L(w) = m (Qw (sm , am , sg,m , τm ) − ym )2 /M // Compute the loss
12: Minimize(w, L(w)) // Optimize
13: end for
14: end for
15: end for

model-free methods such as DDPG due to relabeling increasing the amount of supervision signals.
In real-world applications such as robotics where we care most of the sample efficiency (Gu et al.,
2017), the learning is often bottlenecked by the data collection rather than the computation, and
therefore large I values are usually not a significant problem and can continuously benefit from the
acceleration in computation.

5 R ELATED W ORK

Combining model-based and model-free reinforcement learning techniques is a well-studied prob-

lem, though no single solution has demonstrated all of the benefits of model-based and model-free
learning. Some methods first learn a model and use this model to simulate experience (Sutton, 1990;
Gu et al., 2016) or compute better gradients for model-free updates (Heess et al., 2015; Nguyen &
Widrow, 1990). Other methods use model-free algorithms to correct for the local errors made by the
model (Chebotar et al., 2017; Bansal et al., 2017). While these prior methods focused on combining
different model-based and model-free RL techniques, our method proposes an equivalence between
these two branches of RL through a specific generalization of goal-conditioned value function. As
a result, our approach achieves much better sample efficiency in practice on a variety of challeng-
ing reinforcement learning tasks than model-free alternatives, while exceeding the performance of
purely model-based approaches.
We are not the first to study the connection between model-free and model-based methods, with
Boyan (1999) and Parr et al. (2008) being two notable examples. Boyan (1999) shows that one can
extract a model from a value function when using a tabular representation of the transition function.
Parr et al. (2008) shows that, for linear function approximators, the model-free and model-based RL
approaches produce the same value function at convergence. Our contribution differs substantially
from these: we are not aiming to show that model-free RL performs similarly to model-based RL
at convergence, but rather how we can achieve sample complexity comparable to model-based RL
while retaining the favorable asymptotic performance of model-free RL in complex tasks with non-
linear function approximation.
A central component of our method is to train goal-conditioned value functions. Many variants
of goal-conditioned value functions have been proposed in the literature Foster & Dayan (2002);
Sutton et al. (2011); Schaul et al. (2015); Dosovitskiy & Koltun (2016). Critically, unlike the works
on contextual policies (Caruana, 1998; Da Silva et al., 2012; Kober et al., 2012) which require on-
policy trajectories with each new goal, the value function approaches such as Horde (Sutton et al.,
2011) and UVF (Schaul et al., 2015) can reuse off-policy data to learn rich contextual value functions
using the same data.
TDMs condition on a policy trying to reach a goal and must predict τ steps into the future. This type
of prediction is similar to the prediction made by prior work on multi-step models (Mishra et al.,
2017; Venkatraman et al., 2016): predict the state after τ actions. An important difference is that

6
Published as a conference paper at ICLR 2018

multi-step models do not condition on a policy reaching a goal, and so they require optimizing over
a sequence of actions, making the input space grow linearly with the planning horizon.
A particularly related UVF extension is hindsight experience replay (HER) Andrychowicz et al.
(2017). Both HER and our method retroactively relabel past experience with goal states that are
different from the goal aimed for during data collection. However, unlike our method, the standard
UVF in HER uses a single temporal scale when learning, and does not explicitly provide for a
connection between model-based and model-free learning. The practical result of these differences
is that our approach empirically achieves substantially better sample complexity than HER on a
wide range of complex continuous control tasks, while the theoretical connection between model-
based and model-free learning suggests a much more flexible use of the learned Q-function inside a
planning or optimal control framework.
Lastly, our motivation is shared by other lines of work besides goal-conditioned value functions
that aim to enhance supervision signals for model-free RL (Silver et al., 2017; Jaderberg et al.,
2017; Bellemare et al., 2017). Predictions (Silver et al., 2017) augment classic RL with multi-step
reward predictions, while UNREAL (Jaderberg et al., 2017) also augments it with pixel control as a
secondary reward objective. These are substantially different methods from our work, but share the
motivation to achieve efficient RL by increasing the amount of learning signals from finite data.

6 E XPERIMENTS

Our experiments examine how the sample efficiency and performance of TDMs compare to both
model-based and model-free RL algorithms. We expect to have the efficiency of model-based RL
but with less model bias. We also aim to study the importance of several key design decisions
in TDMs, and evaluate the algorithm on a real-world robotic platform. For the model-free com-
parison, we compare to DDPG (Lillicrap et al., 2015), which typically achieves the best sample
efficiency on benchmark tasks (Duan et al., 2016); HER, which uses goal-conditioned value func-
tions (Andrychowicz et al., 2017); and DDPG with the same sparse rewards of HER. For the model-
based comparison, we compare to the model-based component in Nagabandi et al. (2017), a recent
work that reports highly efficient learning with neural network dynamics models. Details of the
baseline implementations are in the Appendix. We perform the comparison on five simulated tasks:
(1) a 7 DoF arm reaching various random end-effector targets, (2) an arm pushing a puck to a target
location, (3) a planar cheetah attempting to reach a goal velocity (either forward or backward), (4) a
quadrupedal ant attempting to reach a goal position, and (5) an ant attempting to reach a goal posi-
tion and velocity. The tasks are shown in Figure 1 and terminate when either the goal is reached or
the time horizon is reached. The pushing task requires long-horizon reasoning to reach and push the
puck. The cheetah and ant tasks require handling many contact discontinuities which is challenging
for model-based methods, with the ant environment having particularly difficult dynamics given the
larger state and action space. The ant position and velocity task presents a scenario where reward
shaping as in traditional RL methods may not lead to optimal behavior, since one cannot maintain
both a desired position and velocity. However, such a task can be very valuable in realistic settings.
For example, if we want the ant to jump, we might instruct it to achieve a particular velocity at a
particular location. We also tested TDMs on a real-world robot arm reaching end-effector positions,
to study its applicability to real-world tasks.
For the simulated and real-world 7-DoF arm, our TDM is trained on all state components. For the
pushing task, our TDM is trained on the hand and puck XY-position. For the half cheetah task, our
TDM is trained on the velocity of the cheetah. For the ant tasks, our TDM is trained on either the
position or the position and velocity for the respective task. Full details are in the Appendix.

6.1 TDM S VS M ODEL -F REE , M ODE -BASED , AND D IRECT G OAL -C ONDITIONED RL

The results are shown in Figure 2. When compared to the model-free baselines, the pure model-
based method learns learns much faster on all the tasks. However, on the harder cheetah and ant
tasks, its final performance is worse due to model bias. TDMs learn as quickly or faster than the
model-based method, but also always learn policies that are as good as if not better than the model-
free policies. Furthermore, TDMs requires fewer samples than the model-free baselines on ant tasks
and drastically fewer samples on the other tasks. We also see that using HER does not lead to

7
Published as a conference paper at ICLR 2018

(a) 7-DoF Reacher (b) Pusher (c) Half Cheetah (d) Ant (e) Sawyer Robot

Figure 1: The tasks in our experiments: (a) reaching target locations, (b) pushing a puck to a random
target, (c) training the cheetah to run at target velocities, (d) training an ant to run to a target position
or a target position and velocity, and (e) reaching target locations (real-world Sawyer robot).

(a) 7-Dof Reacher (b) Pusher (c) Half Cheetah

(d) Ant: Position (e) Ant: Position and Velocity (f) Sawyer Robot (Real-world)

Figure 2: The comparison of TDM with the baseline methods in model-free (DDPG), model-based,
and goal-conditioned value functions (HER - Dense) on various tasks. All plots show the final
distance to the goal versus 1000 environment steps (not rollouts). The bold line shows the mean
across 3 random seeds, and the shaded region show one standard deviation. Our method, which
uses model-free learning, is generally more sample-efficient than model-free alternatives including
DDPG and HER and improves upon the best model-based performance.

an improvement over DDPG. While we were initially surprised, we realized that a selling point of
HER is that it can solve sparse tasks that would otherwise be unsolvable. In this paper, we were
interested in improving the sample efficiency and not the feasibility of model-free reinforcement
learning algorithms, and so we focused on tasks that DDPG could already solve. In these sorts of
tasks, the advantage of HER over DDPG with a dense reward is not expected. To evaluate HER
as a method to solve sparse tasks, we included the DDPG-Sparse baseline and we see that HER
significantly outperforms it as expected. In summary, TDMs converge as fast or faster than model-
based learning (which learns faster than the model-free baselines), while achieving final performance
that is as good or better that the model-free methods on all tasks.
Lastly, we ran the algorithm on a 7-DoF Sawyer robotic arm to learn a real-world analogue of the
reaching task. Figure 2f shows that the algorithm outperforms and learns with fewer samples than
DDPG, our model-free baseline. These results show that TDMs can scale to real-world tasks.

6.2 A BLATION S TUDIES

In this section, we discuss two key design choices for TDMs that provide substantially improved
performance. First, Figure 3a examines the tradeoffs between the vectorized and scalar rewards. The
results show that the vectorized formulation learns substantially faster than the naı̈ve scalar variant.
Second, Figure 3b compares the learning speed for different horizon values τmax . Performance
degrades when the horizon is too low, and learning becomes slower when the horizon is too high.

8
Published as a conference paper at ICLR 2018

(a) Scalar vs Vectorized TDMs (b) TDMs with different τmax

Figure 3: Ablation experiments for (a) scalar vs. vectorized TDMs on 7-DoF simulated reacher task
and (b) different τmax on pusher task. The vectorized variant performs substantially better, while
the horizon effectively interpolates between model-based and model-free learning.
7 C ONCLUSION
In this paper, we derive a connection between model-based and model-free reinforcement learning,
and present a novel RL algorithm that exploits this connection to greatly improve on the sample
efficiency of state-of-the-art model-free deep RL algorithms. Our temporal difference models can
be viewed both as goal-conditioned value functions and implicit dynamics models, which enables
them to be trained efficiently on off-policy data while still minimizing the effects of model bias. As
a result, they achieve asymptotic performance that compares favorably with model-free algorithms,
but with a sample complexity that is comparable to purely model-based methods.
While the experiments focus primarily on the new RL algorithm, the relationship between model-
based and model-free RL explored in this paper provides a number of avenues for future work.
We demonstrated the use of TDMs with a very basic planning approach, but further exploring how
TDMs can be incorporated into powerful constrained optimization methods for model-predictive
control or trajectory optimization is an exciting avenue for future work. Another direction for future
is to further explore how TDMs can be applied to complex state representations, such as images,
where simple distance metrics may no longer be effective. Although direct application of TDMs
to these domains is not straightforward, a number of works have studied how to construct metric
embeddings of images that could in principle provide viable distance functions. We also note that
while the presentation of TDMs have been in the context of deterministic environments, the ex-
tension to stochastic environments is straightforward: TDMs would learn to predict the expected
distance between the future state and a goal state. Finally, the promise of enabling sample-efficient
learning with the performance of model-free RL and the efficiency of model-based RL is to en-
able widespread RL application on real-world systems. Many applications in robotics, autonomous
driving and flight, and other control domains could be explored in future work.

8 ACKNOWLEDGMENT
This research was supported by the Office of Naval Research and the National Science Foundation
through IIS-1614653 and IIS-1651843.

R EFERENCES
Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob
McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. arXiv
preprint arXiv:1707.01495, 2017.
Somil Bansal, Roberto Calandra, Ted Xiao, Sergey Levine, and Claire J Tomlin. Goal-driven dy-
namics learning via bayesian optimization. arXiv preprint arXiv:1703.09260, 2017.
Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement
learning. arXiv preprint arXiv:1707.06887, 2017.
Justin A Boyan. Least-squares temporal difference learning. In Proceedings of the 16th International
Conference on Machine Learning, pp. 49–56, 1999.

9
Published as a conference paper at ICLR 2018

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and
Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
Rich Caruana. Multitask learning. In Learning to learn, pp. 95–133. Springer, 1998.
Yevgen Chebotar, Karol Hausman, Marvin Zhang, Gaurav Sukhatme, Stefan Schaal, and Sergey
Levine. Combining model-based and model-free updates for trajectory-centric reinforcement
learning. arXiv preprint arXiv:1703.03078, 2017.
Bruno Da Silva, George Konidaris, and Andrew Barto. Learning parameterized skills. arXiv preprint
arXiv:1206.6398, 2012.
Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy
search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pp.
465–472, 2011.
Marc Peter Deisenroth, Gerhard Neumann, Jan Peters, et al. A survey on policy search for robotics.
Foundations and Trends R in Robotics, 2(1–2):1–142, 2013.
Alexey Dosovitskiy and Vladlen Koltun. Learning to act by predicting the future. arXiv preprint
arXiv:1611.01779, 2016.
Y. Duan, R. Chen, X. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement
learning for continuous control. In Proceedings of the 33rd International Conference on Machine
Learning (ICML), 2016.
David Foster and Peter Dayan. Structure in the space of value functions. Machine Learning, 49(2):
325–346, 2002.
Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep q-learning
with model-based acceleration. In International Conference on Machine Learning, pp. 2829–
2838, 2016.
Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for
robotic manipulation with asynchronous off-policy updates. In Robotics and Automation (ICRA),
2017 IEEE International Conference on, pp. 3389–3396. IEEE, 2017.
Nicolas Heess, Greg Wayne, David Silver, Timothy Lillicrap, Yuval Tassa, and Tom Erez. Learn-
ing Continuous Control Policies by Stochastic Value Gradients. arXiv, pp. 1–13, 2015. ISSN
10495258. URL http://arxiv.org/abs/1510.09142.
Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David
Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. In-
ternational Conference on Learning Representations, 2017.
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
Jens Kober, Andreas Wilhelm, Erhan Oztop, and Jan Peters. Reinforcement learning to adjust
parametrized motor primitives to new situations. Autonomous Robots, 33(4):361–379, 2012.
Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuo-
motor policies. Journal of Machine Learning Research, 17(39):1–40, 2016.
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,
David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv
preprint arXiv:1509.02971, 2015.
Luke Metz, Julian Ibarz, Navdeep Jaitly, and James Davidson. Discrete sequential prediction of
continuous actions for deep rl. arXiv preprint arXiv:1705.05035, 2017.
Nikhil Mishra, Pieter Abbeel, and Igor Mordatch. Prediction and control with temporal segment
models. 2017.

10
Published as a conference paper at ICLR 2018

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle-
mare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level
control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.

Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dy-
namics for model-based deep reinforcement learning with model-free fine-tuning. arXiv preprint
arXiv:1708.02596, 2017.

Derrick H Nguyen and Bernard Widrow. Neural networks for self-learning control systems. IEEE
Control systems magazine, 10(3):18–23, 1990.

Ronald Parr, Lihong Li, Gavin Taylor, Christopher Painter-Wakefield, and Michael L Littman. An
analysis of linear models, linear value-function approximation, and feature selection for reinforce-
ment learning. In International Conference on Machine learning, 2008.

Martin Riedmiller. Neural fitted q iteration-first experiences with a data efficient neural reinforce-
ment learning method. In ECML, volume 3720, pp. 317–328. Springer, 2005.

Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approxima-
tors. In Proceedings of the 32nd International Conference on Machine Learning, pp. 1312–1320,
2015.

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche,
Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering
the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.

David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel
Dulac-Arnold, David Reichert, Neil Rabinowitz, Andre Barreto, et al. The predictron: End-to-
end learning and planning. International Conference on Machine Learning, 2017.

Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximat-
ing dynamic programming. In Proceedings of the seventh international conference on machine
learning, pp. 216–224, 1990.

Richard S Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M Pilarski, Adam White,
and Doina Precup. Horde: A scalable real-time architecture for learning knowledge from unsuper-
vised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and
Multiagent Systems-Volume 2, pp. 761–768. International Foundation for Autonomous Agents
and Multiagent Systems, 2011.

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control.
In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–
5033. IEEE, 2012.

Arun Venkatraman, Roberto Capobianco, Lerrel Pinto, Martial Hebert, Daniele Nardi, and J Andrew
Bagnell. Improved learning of dynamics models for control. In International Symposium on
Experimental Robotics, 2016.

Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.

Tianhao Zhang, Gregory Kahn, Sergey Levine, and Pieter Abbeel. Learning deep control policies for
autonomous aerial vehicles with mpc-guided policy search. In Robotics and Automation (ICRA),
2016 IEEE International Conference on, pp. 528–535. IEEE, 2016.

A E XPERIMENT D ETAILS

In this section, we detail the experimental setups used in our results.

11
Published as a conference paper at ICLR 2018

Figure 4: TDMs with different number of updates per step I on ant target position task. The max-
imum distance was set to 5 rather than 6 for this experiment, so the numbers should be lower than
the ones reported in the paper.

A.1 G OAL S TATE AND τ S AMPLING S TRATEGY

While Q-learning is valid for any value of sg and τ for each transition tuple (st , at , st+1 ), the way
in which these values are sampled during training can affect learning efficiency. Some potential
strategies for sampling sg are: (1) uniformly sample future states along the actual trajectory in the
buffer (i.e., for st , choose sg = st+k for a random k > 0) as in (Andrychowicz et al., 2017);
(2) uniformly sample goal states from the replay buffer; (3) uniformly sample goals from a uniform
range of valid states. We found that the first strategy performed slightly better than the others, though
not by much. In our experiments, we use the first strategy. The horizon τ is sampled uniformly at
random between 0 and the maximum horizon τmax .

A.2 M ODEL - FREE SETUP

In all our experiments, we used DDPG (Lillicrap et al., 2015) as the base off-policy model-free RL
algorithm for learning the TDMs Q(s, a, g, sτ ). Experience replay (Mnih et al., 2015) has size of
1 million transitions, and the soft target networks (Lillicrap et al., 2015) are used with a polyak
averaging coefficient of 0.999 for DDPG and TDM and 0.95 for HER and DDPG-Sparse. For HER
and DDPG-Sparse, we also added a penalty on the tanh pre-activation, as in Andrychowicz et al.
(2017). Learning rates of the critic and the actor are chosen from {1e-4, 1e-3} and {1e-4, 1e-3}
respectively. Adam (Kingma & Ba, 2014) is used as the base optimizer with default parameters
except the learning rate. The batch size was 128. The policies and networks are parmaeterized
with neural networks with ReLU hidden activation and two hidden layers of size 300 and 300. The
policies have a tanh output activation, while the critic has no output activation (except for TDM,
see A.5). For the TDM, the goal was concatenated to the observation. The planning horizon τ
is also concatenated as an observation and represented as a single integer. While we tried various
representations for τ such as one-hot encodings and binary-string encodings, we found that simply
providing the integer was sufficient.
While any distance metric for the TDM reward function can be used, we chose L1 norm −kst+1 −
sg k1 to ensure that the scalar and vectorized TDMs are consistent.

A.3 M ODEL - BASED SETUP

For the model-based comparison, we trained a neural network dynamics model with ReLU acti-
vation, no output activation, and two hidden units of size 300 and 300. The model was trained to
predict the difference in state, rather than the full state. The dynamics model is trained to minimize
the mean squared error between the predicted difference and the actual difference. After each state is
observed, we sample a minibatch of size 128 from the replay buffer (size 1 million) and perform one
step of gradient descent on this mean squared error loss. Twenty rollouts were performed to com-
pute the (per-dimension) mean and standard deviation of the states, actions, and state differences.

12
Published as a conference paper at ICLR 2018

We used these statistics to normalize the states and actions before giving them to the model, and
to normalize the state differences before computing the loss. For MPC, we simulated 512 random
action sequences of length 15 through the learned dynamics model and chose the first action of the
sequence with the highest reward.

A.4 T UNED H YPERPARAMETERS

For TDMs, we found the most important hyperparameters to be the reward scale, τmax , and the
number of updates per observations, I. As shown in Figure 4, TDMs can greatly benefit from larger
values of I, though eventually there are diminishing returns and potentially negative impact, mostly
likely due to over-fitting. We found that the baselines did not benefit, except for HER which did
benefit from larger I values. For all the model-free algorithms (DDPG, DDPG-Sparse, HER, and
TDMs), we performed a grid search over the reward scale in the range {0.01, 1, 100, 10000} and the
number of updates per observations in the range {1, 5, 10}. For HER, we also tuned the weight given
to the policy pre-tanh-activation {0, 0.01, 1}, which is described in Andrychowicz et al. (2017). For
TDMs, we also tuned the best τmax in the range {15, 25, Horizon − 1}. For the half cheetah task, we
performed extra searches over τmax and found τmax = 9 to be effective.

A.5 TDM N ETWORK A RCHITECTURE AND V ECTOR - BASED S UPERVISION

For TDMs, since we know that the true Q-function must learn to predict (negative) distances,
we incorporate this prior knowledge into the Q-function by parameterizing it as Q(s, a, sg , τ ) =
−kf (s, a, sg , τ ) − sg k1 . Here, f is a vector outputted by a feed-forward neural network and has the
same dimension as the goal. This parameterization ensures that the Q-function outputs non-positive
values, while encouraging the Q-function to learn what we call a goal-conditioned model: f is en-
couraged to predict what state will be reached after τ , when the policy is trying to reach goal sg in
τ time steps.
For the ℓ1 norm, the scalar supervision regresses
X
Q(st , at , sg , τ ) = − |fj (st , at , sg , τ ) − sg,j |
j

onto
r(st , at , st+1 , sg ) + ✶[τ = 0] + Q(st+1 , a∗ , sg , τ − 1)✶[τ 6= 0]
X
=− {|st+1,j − sg,j |✶[τ = 0] + |fj (st , a∗ , sg , τ − 1) − sg,j |✶[τ 6= 0]}
j

A.6 TASK AND R EWARD D ESCRIPTIONS

Benchmark tasks are designed on MuJoCo physics simulator (Todorov et al., 2012) and OpenAI
Gym environments (Brockman et al., 2016). For the simulated reaching and pushing tasks, we use
(8) and for the other tasks we use (9) for policy extraction. The horizon (length of episode) for the
pusher and ant tasks are 50. The reaching tasks has a horizon of 100. The half-cheetah task has a
horizon of 99.
7-DoF reacher.: The state consists of 7 joint angles, 7 joint angular velocities, and 3 XYZ obser-
vation of the tip of the arm, making it 17 dimensional. The action controls torques for each joint,
totally 7 dimensional. The reward function during optimization control and for the model-free base-
line is the negative Euclidean distance between the XYZ of the tip and the target XYZ. The targets
are sampled randomly from all reachable locations of the arm at the beginning of each episode.

13
Published as a conference paper at ICLR 2018

The robot model is taken from the striker and pusher environments in OpenAI Gym MuJoCo do-
mains (Brockman et al., 2016) and has the same joint limits and physical parameters.
Many tasks can be solved by expressing a desired goal state or desired goal state components. For
example, the 7-Dof reacher solves the task when the end effector XYZ component of its state is
equal to the goal location, (x∗ , y ∗ , z ∗ ). One advantage of using a goal-conditioned model f as in
Equation (8) is that this desire can be accounted for directly: if we already know the desired values
of some components in st+T , then wen can simply fix those components of st+T and optimize over
the other dimensions. For example for the 7-Dof reacher, the optimization problem in Equation (8)
needed to choose an action becomes
at = argmax rc (f (st , at , st+T [0 : 14]||[x∗ , y ∗ , z ∗ ]))
at ,st+T [0:14]

where || denotes concatenation; st+T [0 : 14] denotes that we only optimize over the first 14 dimen-
sions (the joint angles and velocities), and we omit at+T since the reward is only a function of the
state. Intuitively, this optimization chooses whatever goal joint angles and joint velocities make it
easiest to reach (x∗ , y ∗ , z ∗ ). It then chooses the corresponding action to get to that goal state in
T time steps. We implement the optimization over s[0 : 14] with stochastic optimization: sample
10,000 different vectors and choose the best value. Lastly, instead of optimizing over the actions,
we use the policy trained in DDPG to choose the action, since the policy is already trained to choose
an action with maximum Q-value for a given state, goal state, and planning horizon. We found this
optimization scheme to be reliable, but any optimizer can be used to solve Equation (8),(7), or (6).
Pusher: The state consists of 3 joint angles, 3 joint angular velocities, the XY location of the hand,
and the XY location of the puck. The action controls torques for each of the 3 joints. The reward
function is the negative Euclidean distance between the puck and the puck. Once the hand is near
(with 0.1) of the puck, the reward is increased by 2 minus the Euclidean distance between the puck
and the goal location. This reward function encourages the arm to reach the puck. Once the arm
reaches the puck, bonus reward begins to have affect, and the arm is encouraged to bring the puck
to the target.
As in the 7-DoF reacher, we set components of the goal state for the optimal control formulation.
Specifically, we set the goal hand position to be the puck location. To copy the two-stage reward
shaping used by our baselines, the goal XY location for the puck is initially its current location until
the hand reaches the puck, at which point the goal position for the puck is the target location. There
are no other state dimensions to optimize over, so the optimal control problem is trivial.
Half-Cheetah: The environment is the same as in Brockman et al. (2016). The only difference is
that the reward is the ℓ-1 norm between the velocity and desired velocity v ∗ . Our optimal control
formulation is again trivial since we set the goal velocity to be v ∗ . The goal velocity for rollout was
sampled uniformly in the range [−6, 6]. We found that the resulting TDM policy tends to “jump” at
the last time step, which is the type of behavior we would expect to come out of this finite-horizon
formulation but not of the infinite-time horizon of standard model-free deep RL techniques.
Ant: The environment is the same as in Brockman et al. (2016), except that we lowered the gear
ratio to 30 for all joints. We found that this prevents the ant from flipping over frequently during the
initially phase of training, allowing us to run all the experiments faster. The reward is the ℓ-1 norm
between the actual and desired xy-position and xy-velocity (for the position and velocity task) of the
torso center of mass. For the target-position task, the target position was any position within a 6-by-6
square. For the target-position-and-velocity task, the target position was any position within a 1-by-
1 square and any velocity within a 0.05-by-0.05 velocity-box. When computing the distance for the
position-and-velocity task, the velocity distance was weighted by 0.9 and the position distance was
weighted by 0.1.
Sawyer Robot: The state and action spaces are the same as in the 7-DoF simulated robot except that
we also included the measured torques as part of the state space since these can different from the
applied torques. The reward function used is also the ℓ1 norm to the desired XYZ position.

MODULE 9 Personal Relationships
No ratings yet
MODULE 9 Personal Relationships
91 pages
Project Report: Physics
No ratings yet
Project Report: Physics
8 pages
Dissertations Portsmouth Uni
100% (2)
Dissertations Portsmouth Uni
6 pages
FULLTEXT02
No ratings yet
FULLTEXT02
84 pages
Fundamentals of Human Geography - Chapter 1
0% (1)
Fundamentals of Human Geography - Chapter 1
9 pages
Bde Unit IV
No ratings yet
Bde Unit IV
21 pages
Module 3 & Module 4 Thematic Lesson Plan/ Unit in Kindergarten
50% (2)
Module 3 & Module 4 Thematic Lesson Plan/ Unit in Kindergarten
6 pages
Where Can Buy Psychology in Modules Twelfth Edition David Myers Ebook With Cheap Price
No ratings yet
Where Can Buy Psychology in Modules Twelfth Edition David Myers Ebook With Cheap Price
55 pages
Science, The Media, and Interpretations of Upper Paleolithic Figurines
No ratings yet
Science, The Media, and Interpretations of Upper Paleolithic Figurines
17 pages
Electronics 10 00999
No ratings yet
Electronics 10 00999
30 pages
Sensors 23 08766 v2
No ratings yet
Sensors 23 08766 v2
34 pages
Melting Pot Theory
No ratings yet
Melting Pot Theory
4 pages
2023 05 Struktur Variaans-Kovarians
No ratings yet
2023 05 Struktur Variaans-Kovarians
42 pages
Gantt Chart, Productivity, Working Time Factor
No ratings yet
Gantt Chart, Productivity, Working Time Factor
11 pages
The AI Power Paradox
No ratings yet
The AI Power Paradox
16 pages
Guideline-For-Application-For - Energy-Auditor-Accreditation
No ratings yet
Guideline-For-Application-For - Energy-Auditor-Accreditation
16 pages
Unit 3 - Problem Sheet No 3C - Corona
No ratings yet
Unit 3 - Problem Sheet No 3C - Corona
2 pages
RWM 2025 Sales Brochure
No ratings yet
RWM 2025 Sales Brochure
17 pages
ARTICLEONnlp
No ratings yet
ARTICLEONnlp
18 pages
Q4 LAW 2 English 7
No ratings yet
Q4 LAW 2 English 7
8 pages
Video Surveillance 1
No ratings yet
Video Surveillance 1
16 pages
G4 - Q2 - Module 7 - Addition and Subtraction of Dissimilar Fractions-1
0% (1)
G4 - Q2 - Module 7 - Addition and Subtraction of Dissimilar Fractions-1
17 pages
Scimplify Customer Deck - Curtailed
No ratings yet
Scimplify Customer Deck - Curtailed
22 pages
Anempiricalframeworkfordevelopingandevaluatinga Virtual Assembly Training Systeminlearningfactories
No ratings yet
Anempiricalframeworkfordevelopingandevaluatinga Virtual Assembly Training Systeminlearningfactories
19 pages
Crowley Mark Electrical Engineering Computer Science Using Equilibrium Policy
No ratings yet
Crowley Mark Electrical Engineering Computer Science Using Equilibrium Policy
17 pages
Paper 1
No ratings yet
Paper 1
15 pages
Biomimetics 09 00238 v3
No ratings yet
Biomimetics 09 00238 v3
13 pages
E-DQN-Based Path Planning Method For Drones in Air
No ratings yet
E-DQN-Based Path Planning Method For Drones in Air
13 pages
Kinetic and Potential Energy
No ratings yet
Kinetic and Potential Energy
2 pages
Reinforcement Learning For UAV Control With Policy and Reward Shaping
No ratings yet
Reinforcement Learning For UAV Control With Policy and Reward Shaping
9 pages
Deepurl: Deep Pose Estimation Framework For Underwater Relative Localization
No ratings yet
Deepurl: Deep Pose Estimation Framework For Underwater Relative Localization
9 pages
A UNEP/SETAC Life Cycle Management Guide
No ratings yet
A UNEP/SETAC Life Cycle Management Guide
2 pages
Wjaets 2023 0164
No ratings yet
Wjaets 2023 0164
4 pages
Jeas 0223 9099
No ratings yet
Jeas 0223 9099
6 pages
End Term Examination IKS
No ratings yet
End Term Examination IKS
3 pages
Diodo Linscan 808 JOLD-xxx-HS-4L Horizontal Stack 808 NM
No ratings yet
Diodo Linscan 808 JOLD-xxx-HS-4L Horizontal Stack 808 NM
2 pages
Coconut Husk RRS
No ratings yet
Coconut Husk RRS
6 pages
5SC28 L9 Machine Learning Systems Control
No ratings yet
5SC28 L9 Machine Learning Systems Control
75 pages
Pages From 112006967-PRV-Sizing-for-Exchanger-Tube-Rupture
No ratings yet
Pages From 112006967-PRV-Sizing-for-Exchanger-Tube-Rupture
1 page
Comparative Analysis of RL Models
No ratings yet
Comparative Analysis of RL Models
47 pages
Gold-First-NE-2015-Exam-Maximiser-Answer-Key First For Schools - Answer Key UNIT 1 Vocabulary 1 - Studocu PDF
No ratings yet
Gold-First-NE-2015-Exam-Maximiser-Answer-Key First For Schools - Answer Key UNIT 1 Vocabulary 1 - Studocu PDF
1 page
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
DRI-HBI Induction Furnace Application
100% (2)
DRI-HBI Induction Furnace Application
7 pages
Martha Argerich and Piano Technique: Science Behind Controlled Pianism
100% (2)
Martha Argerich and Piano Technique: Science Behind Controlled Pianism
40 pages
12 Simple Life Lessons Summary
No ratings yet
12 Simple Life Lessons Summary
3 pages
MLT - Module 5
No ratings yet
MLT - Module 5
77 pages
1 s2.0 S000510981500343X Main
No ratings yet
1 s2.0 S000510981500343X Main
8 pages
2023 Week7 modelbasedRL Updated
No ratings yet
2023 Week7 modelbasedRL Updated
56 pages
RL Unit - Iv
No ratings yet
RL Unit - Iv
25 pages
Continuous Deep Q-Learning With Model-Based Acceleration: Shixiang Gu Timothy Lillicrap Ilya Sutskever Sergey Levine
No ratings yet
Continuous Deep Q-Learning With Model-Based Acceleration: Shixiang Gu Timothy Lillicrap Ilya Sutskever Sergey Levine
10 pages
MIS Automation For Accurate Sales Forecasting Sales Analysis and Stock Planning
No ratings yet
MIS Automation For Accurate Sales Forecasting Sales Analysis and Stock Planning
78 pages
Domestic Violence A Biased Concept in Term of Men
No ratings yet
Domestic Violence A Biased Concept in Term of Men
2 pages
NeurIPS 2021 Decision Transformer Reinforcement Learning Via Sequence Modeling Paper
No ratings yet
NeurIPS 2021 Decision Transformer Reinforcement Learning Via Sequence Modeling Paper
14 pages
Lecture RL
No ratings yet
Lecture RL
37 pages
Hansen 2022
No ratings yet
Hansen 2022
20 pages
Linear Quadratic Control Using Model-Free Reinforcement Learning
No ratings yet
Linear Quadratic Control Using Model-Free Reinforcement Learning
16 pages
3003 o Ine Reinforcement Learning W
No ratings yet
3003 o Ine Reinforcement Learning W
15 pages
Survey of Model-Based Reinforcement Learning: Applications On Robotics
No ratings yet
Survey of Model-Based Reinforcement Learning: Applications On Robotics
21 pages
An Analysis of Quantile Temporal-Difference Learning: Mark Rowland
No ratings yet
An Analysis of Quantile Temporal-Difference Learning: Mark Rowland
47 pages
ML Mod 6
No ratings yet
ML Mod 6
11 pages
Continuous Deep Q-Learning With Model-Based Acceleration
No ratings yet
Continuous Deep Q-Learning With Model-Based Acceleration
13 pages
Imitation Learning Papers
No ratings yet
Imitation Learning Papers
10 pages
Trajectory Transformer
No ratings yet
Trajectory Transformer
15 pages
Dissecting Reinforcement Learning-Part10
No ratings yet
Dissecting Reinforcement Learning-Part10
19 pages
Mlt-Cia Iii Ans Key
No ratings yet
Mlt-Cia Iii Ans Key
14 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
Learning Task
No ratings yet
Learning Task
14 pages
DL Questions
No ratings yet
DL Questions
30 pages
Reinforcement Learning in The Era of LLMS: What Is Essential? What Is Needed? An RL Perspective On RLHF, Prompting, and Beyond
No ratings yet
Reinforcement Learning in The Era of LLMS: What Is Essential? What Is Needed? An RL Perspective On RLHF, Prompting, and Beyond
11 pages
Discuss About Temporal Difference in Reinforcement Learning?
No ratings yet
Discuss About Temporal Difference in Reinforcement Learning?
9 pages
Lecture 29 RL
No ratings yet
Lecture 29 RL
38 pages
Algorithms For Reinforcement Learning - Szepesvari
No ratings yet
Algorithms For Reinforcement Learning - Szepesvari
98 pages
15) EXPLAIN Fitted Q and Deep Q-Learning
No ratings yet
15) EXPLAIN Fitted Q and Deep Q-Learning
17 pages
Lecture 4: Model-Free Prediction: David Silver
No ratings yet
Lecture 4: Model-Free Prediction: David Silver
51 pages
PD Control Based On Reinforcement Learning Compensation For A DC Servo Drive
No ratings yet
PD Control Based On Reinforcement Learning Compensation For A DC Servo Drive
6 pages
37 RL
No ratings yet
37 RL
18 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Using Q-Learning To Automatically Tune Quadcopter PID Controller Online For Fast Altitude Stabilization
No ratings yet
Using Q-Learning To Automatically Tune Quadcopter PID Controller Online For Fast Altitude Stabilization
6 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
RLAlgs in MDPs
No ratings yet
RLAlgs in MDPs
98 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
Unit 4
100% (1)
Unit 4
7 pages
Reinforcement Learning: Yijue Hou
No ratings yet
Reinforcement Learning: Yijue Hou
34 pages
Alg RLearning Ejemplo
No ratings yet
Alg RLearning Ejemplo
99 pages
Sharma 2019 Dynamics Aware
No ratings yet
Sharma 2019 Dynamics Aware
11 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Gaussian Processes For Data-Efficient Learning in Robotics and Control
No ratings yet
Gaussian Processes For Data-Efficient Learning in Robotics and Control
20 pages
RL PDF
No ratings yet
RL PDF
4 pages
An Overview of Machine Learning
No ratings yet
An Overview of Machine Learning
42 pages
Drive in Trafic PDF
No ratings yet
Drive in Trafic PDF
20 pages
Report On Reinforcement Learning
No ratings yet
Report On Reinforcement Learning
26 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
5 Temporal Difference Learning
No ratings yet
5 Temporal Difference Learning
25 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Bayesian Deep Reinforcement Learning Via Deep Kernel Learning
No ratings yet
Bayesian Deep Reinforcement Learning Via Deep Kernel Learning
8 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Algorithms For Reinforced Learning
No ratings yet
Algorithms For Reinforced Learning
98 pages
Playing Geometry Dash With Convolutional Neural Networks
No ratings yet
Playing Geometry Dash With Convolutional Neural Networks
7 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
Secrets of Statistical Data Analysis and Management Science!
From Everand
Secrets of Statistical Data Analysis and Management Science!
Andrei Besedin
No ratings yet

Temporal Difference Models - Model-Free Deep RL For Model-Based Control

Uploaded by

Temporal Difference Models - Model-Free Deep RL For Model-Based Control

Uploaded by

Published as a conference paper at ICLR 2018

T EMPORAL D IFFERENCE M ODELS :

Vitchyr Pong∗ Shixiang Gu∗

Murtaza Dalal Sergey Levine

Model-based RL and optimal control. Model-based RL takes a different approach to maximize

3 T EMPORAL D IFFERENCE M ODEL L EARNING

3.1 F ROM G OAL -C ONDITIONED VALUE F UNCTIONS TO M ODELS

3.2 L ONG -H ORIZON L EARNING WITH T EMPORAL D IFFERENCE M ODELS

4 T RAINING AND U SING T EMPORAL D IFFERENCE M ODELS

4.1 R EWARD F UNCTION S PECIFICATION

4.2 P OLICY E XTRACTION WITH TDM S

4.3 A LGORITHM S UMMARY

Algorithm 1 Temporal Difference Model Learning

Combining model-based and model-free reinforcement learning techniques is a well-studied prob-

(a) 7-Dof Reacher (b) Pusher (c) Half Cheetah

6.2 A BLATION S TUDIES

(a) Scalar vs Vectorized TDMs (b) TDMs with different τmax

In this section, we detail the experimental setups used in our results.

A.1 G OAL S TATE AND τ S AMPLING S TRATEGY

A.2 M ODEL - FREE SETUP

A.3 M ODEL - BASED SETUP

A.4 T UNED H YPERPARAMETERS

A.5 TDM N ETWORK A RCHITECTURE AND V ECTOR - BASED S UPERVISION

A.6 TASK AND R EWARD D ESCRIPTIONS

You might also like