0% found this document useful (0 votes)
16 views20 pages

Hansen 2022

The document presents Temporal Difference Learning for Model Predictive Control (TD-MPC), a framework that combines model-free and model-based methods to enhance sample efficiency and performance in continuous control tasks. It utilizes a learned task-oriented latent dynamics model and a terminal value function to optimize trajectories over short horizons while estimating long-term returns. The proposed method demonstrates superior performance in various tasks compared to prior approaches, achieving effective planning with reduced computational costs.

Uploaded by

wptjcks15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views20 pages

Hansen 2022

The document presents Temporal Difference Learning for Model Predictive Control (TD-MPC), a framework that combines model-free and model-based methods to enhance sample efficiency and performance in continuous control tasks. It utilizes a learned task-oriented latent dynamics model and a terminal value function to optimize trajectories over short horizons while estimating long-term returns. The proposed method demonstrates superior performance in various tasks compared to prior approaches, achieving effective planning with reduced computational costs.

Uploaded by

wptjcks15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Online Planning

Observation

Temporal Difference Learning for Model Predictive Control t=0

LVM Environment
1 *1 *1
Nicklas Hansen Xiaolong Wang Hao Su
🕹 🕹

Abstract TD-MPC +
MPC Reward Value
z0
Observation
Data-driven model predictive control has two key
advantages over model-free methods: a potential z0 zH 🕹
arXiv:2203.04955v2 [cs.LG] 19 Jul 2022

for improved sample efficiency through model


Action
learning, and better performance as computational Learned Model
budget for planning increases. However, it is both 🕹
costly to plan over long horizons and challeng-
ing to obtain an accurate model of the environ- Reward
Environment value reward
ment. In this work, we combine the strengths of
Humanoid Stand Humanoid Walk Humanoid Run
1000 1000 400
model-free and model-based methods. We use a
learned task-oriented latent dynamics model for Episode return
750 750 300

local trajectory optimization over a short hori- 500 500 200

zon, and use a learned terminal value function 250 250 100

to estimate long-term return, both of which are 0 0 0


0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

learned jointly by temporal difference learning. 1000


Environment steps (×106)
Dog Walk
1000
Environment steps (×106)
Dog Trot
600
Environment steps (×106)
Dog Run
SAC MPC:sim TD−MPC (ours)
Our method, TD-MPC, achieves superior sam-
750 750 450
Episode return

ple efficiency and asymptotic performance over


500 500 300
prior work on both state and image-based con-
tinuous control tasks from DMControl and Meta- 250 250 150

World. Code and videos are available at https: 0


0 1 2 3 4 5
0
0 1 2 3 4 5
0
0 1 2 3 4 5
Environment steps (×106) Environment steps (×106) Environment steps (×106)
//nicklashansen.github.io/td-mpc. SAC MPC:sim TD−MPC (ours)

Figure 1. Overview. (Top) We present a framework for MPC using


a task-oriented latent dynamics model and value function learned
1. Introduction jointly by temporal difference learning. We perform trajectory
optimization over model rollouts and use the value function for
To achieve desired behavior in an environment, a Reinforce-
long-term return estimates. (Bottom) Episode return of our method,
ment Learning (RL) agent needs to iteratively interact and SAC, and MPC with a ground-truth simulator on challenging, high-
consolidate knowledge about the environment. Planning is a dimensional Humanoid and Dog tasks (Tassa et al., 2018). Mean
powerful approach to such sequential decision making prob- of 5 runs; shaded areas are 95% confidence intervals.
lems, and has achieved tremendous success in application
areas such as game-playing (Kaiser et al., 2020; Schrit- vantages of model-based learning: (i) planning, which is ad-
twieser et al., 2020) and continuous control (Tassa et al., vantageous over a learned policy, but it can be prohibitively
2012; Chua et al., 2018; Janner et al., 2019). By utilizing expensive to plan over long horizons (Janner et al., 2019;
an internal model of the environment, an agent can plan a Lowrey et al., 2019; Hafner et al., 2019; Argenson & Dulac-
trajectory of actions ahead of time that leads to the desired Arnold, 2021); and (ii) using a learned model to improve
behavior; this is in contrast to model-free algorithms that sample-efficiency of model-free methods by e.g. learning
learn a policy purely through trial-and-error. from generated rollouts, but this makes model biases likely
to propagate to the policy as well (Ha & Schmidhuber, 2018;
Concretely, prior work on model-based methods can largely Hafner et al., 2020b; Clavera et al., 2020). As a result,
be subdivided into two directions, each exploiting key ad- model-based methods have historically struggled to outper-
*
Equal contribution 1 UC San Diego. Correspondence to: Nick- form simpler, model-free methods (Srinivas et al., 2020;
las Hansen <[email protected]>. Kostrikov et al., 2020) in continuous control tasks.

Proceedings of the 39 th International Conference on Machine Can we instead augment model-based planning with the
Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copy- strengths of model-free learning? Because of the immense
right 2022 by the author(s). cost of long-horizon planning, Model Predictive Control
TD-Learning for MPC

(MPC) optimizes a trajectory over a shorter, finite horizon, transition (dynamics) function, R : S × A 7→ R is a reward
which yields only temporally local optimal solutions. MPC function, γ ∈ [0, 1) is a discount factor, and p0 is the initial
can be extended to approximate globally optimal solutions state distribution. We aim to learn a parameterized map-
by using a terminal value function that estimates discounted ping Πθ : S 7→PA with parameters θ such that discounted

return beyond the planning horizon. However, obtaining an return EΓ∼Πθ [ t=1 γ t rt ], rt ∼ R(·|st , at ) is maximized
accurate model and value function can be challenging. along a trajectory Γ = (s0 , a0 , s1 , a1 , . . . ) following Πθ
by sampling an action at ∼ Πθ (·|st ) and reaching state
In this work, we propose Temporal Difference Learning
st+1 ∼ T (·|st , at ) at each decision step t.
for Model Predictive Control (TD-MPC), a framework
for data-driven MPC using a task-oriented latent dynamics Fitted Q-iteration. Model-free TD-learning algorithms
model and terminal value function learned jointly by tem- aim to estimate an optimal state-action value func-
poral difference (TD) learning. At each decision step, we tion Q∗ : S × A 7→ R using a parametric value
perform trajectory optimization using short-term reward es- function Qθ (s, a) ≈ Q∗ (s, a) = maxa0 E[R(s, a) +
timates generated by the learned model, and use the learned γQ∗ (s0 , a0 )] ∀s ∈ S where s0 , a0 is the state and action
value function for long-term return estimates. For exam- at the following step, and θ parameterizes the function (Sut-
ple, in the Humanoid locomotion task shown in Figure 1, ton, 2005). For γ ≈ 1, Q∗ estimates discounted return for
planning with a model may be beneficial for accurate joint the optimal policy over an infinite horizon. While Q∗ is
movement, whereas the higher-level objective, e.g. direction generally unknown, it can be approximated by repeatedly
of running, can be guided by long-term value estimates. fitting Qθ using the update rule
A key technical contribution is how the model is learned. θk+1 ← arg min E(s,a,s0 )∼B kQθ (s, a) − yk22 (1)
While prior work learns a model through state or video pre- θ
diction, we argue that it is remarkably inefficient to model
everything in the environment, including irrelevant quanti- where the Q-target y = R(s, a) + γ maxa0 Qθ− (s0 , a0 ), B
ties and visuals such as shading, as this approach suffers is a replay buffer that is iteratively grown as new data is
from model inaccuracies and compounding errors. To over- collected, and θ− is a slow-moving average of the online

come these challenges, we make three key changes to model parameters θ updated with the rule θk+1 ←− (1 − ζ)θk− +
learning. Firstly, we learn the latent representation of the ζθk at each iteration using a constant coefficient ζ ∈ [0, 1).
dynamics model purely from rewards, ignoring nuances Model Predictive Control. In actor-critic RL algo-
unnecessary for the task at hand. This makes the learning rithms, Π is typically a policy parameterized by a
more sample efficient than state/image prediction. Sec- neural network that learns to approximate Πθ (·|s) ≈
ondly, we back-propagate gradients from the reward and arg maxa E[Qθ (s, a)] ∀s ∈ S, i.e, the globally optimal pol-
TD-objective through multiple rollout steps of the model, icy. In control, Π is traditionally implemented as a trajectory
improving reward and value predictions over long horizons. optimization procedure. To make the problem tractable, one
This alleviates error compounding when conducting rollouts. typically obtains a local solution to the trajectory optimiza-
Lastly, we propose a modality-agnostic prediction loss in la- tion problem at each step t by estimating optimal actions
tent space that enforces temporal consistency in the learned at:t+H over a finite horizon H and executing the first action
representation without explicit state or image prediction. at , known as Model Predictive Control (MPC):
We evaluate our method on a variety of continuous control "H #
tasks from DMControl (Tassa et al., 2018) and Meta-World
X
ΠMPC
θ (st ) = arg max E γ i R(si , ai ) , (2)
(Yu et al., 2019), where we find that our method achieves at:t+H
i=t
superior sample efficiency and asymptotic performance over
prior model-based and model-free methods. In particular, where γ, unlike in fitted Q-iteration, is typically set to 1,
our method solves Humanoid and Dog locomotion tasks i.e., no discounting. Intuitively, Equation 2 can be viewed
with up to 38-dimensional continuous action spaces in as as a special case of the standard additive-cost optimal con-
little as 1M environment steps (see Figure 1), and is trivially trol objective. A solution can be found by iteratively fitting
extended to match the state-of-the-art in image-based RL. parameters of a family of distributions, e.g., µ, σ for a mul-
tivariate Gaussian with diagonal covariance, to the space
of actions over a finite horizon using the derivative-free
2. Preliminaries Cross-Entropy Method (CEM; Rubinstein (1997)), and sam-
Problem formulation. We consider infinite-horizon ple trajectories generated by a model. As opposed to fitted
Markov Decision Processes (MDP) characterized by a tuple Q-iteration, Equation 2 is not predictive of long-term re-
(S, A, T , R, γ, p0 ), where S ∈ Rn and A ∈ Rm are contin- wards, hence a myopic solution. When a value function is
uous state and action spaces, T : S × A × S 7→ R+ is the known (e.g. a heuristic or in the context of our method: esti-
mated using Equation 1), it can be used in conjunction with
TD-Learning for MPC

Equation 2 to estimate discounted return at state st+H and Algorithm 1 TD-MPC (inference)
beyond; such methods are known as MPC with a terminal Require: θ : learned network parameters
value function. In the following, we consider parameterized µ0 , σ 0 : initial parameters for N
mappings Π from both the perspective of actor-critic RL N, Nπ : num sample/policy trajectories
algorithms and model predictive control (planning). To dis- st , H: current state, rollout horizon
ambiguate these concepts, we refer to planning with MPC 1: Encode state zt ← hθ (st ) C Assuming TOLD model
as Πθ and a policy network as πθ . We generically denote 2: for each iteration j = 1..J do
parameterization using neural networks as θ (online) and 3: Sample N traj. of len. H from N (µj−1 , (σ j−1 )2 I)
θ− (target; slow-moving average of θ) as combined feature 4: Sample Nπ traj. of length H using πθ , dθ
vectors. // Estimate trajectory returns φΓ using dθ , Rθ , Qθ ,
starting from zt and initially letting φΓ = 0:
3. TD-Learning for Model Predictive Control 5: for all N + Nπ trajectories (at , at+1 , . . . , at+H ) do
6: for step t = 0..H − 1 do
We propose TD-MPC, a framework that combines MPC
7: φΓ = φΓ + γ t Rθ (zt , at ) C Reward
with a task-oriented latent dynamics model and terminal
8: zt+1 ← dθ (zt , at ) C Latent transition
value function jointly learned using TD-learning in an on-
9: φΓ = φΓ + γ H Qθ (zH , aH ) C Terminal value
line RL setting. Specifically, TD-MPC leverages Model
// Update parameters µ, σ for next iteration:
Predictive Path Integral (MPPI; Williams et al. (2015)) con-
10: µj , σ j = Equation 4 (and Equation 5)
trol for planning (denoted Πθ ), learned models dθ , Rθ of the
11: return a ∼ N (µJ , (σ J )2 I)
(latent) dynamics and reward signal, respectively, a terminal
state-action value function Qθ , and a parameterized policy
πθ that helps guide planning. We summarize our frame- each decision step t and execute only the first action, i.e., we
work in Figure 1 and Algorithm 1. In this section, we detail employ receding-horizon MPC to produce a feedback policy.
the inference-time behavior of our method, while we defer To reduce the number of iterations required for convergence,
discussion of training to Section 4. we “warm start” trajectory optimization at each step t by
reusing the 1-step shifted mean µ obtained at the previous
MPPI is an MPC algorithm that iteratively updates pa- step (Argenson & Dulac-Arnold, 2021), but always use a
rameters for a family of distributions using an importance large initial variance to avoid local minima.
weighted average of the estimated top-k sampled trajectories
(in terms of expected return); in practice, we fit parameters Exploration by planning. Model-free RL algorithms such
of a time-dependent multivariate Gaussian with diagonal as DDPG (Lillicrap et al., 2016) encourage exploration by
covariance. We adapt MPPI as follows. Starting from ini- injecting action noise (e.g. Gaussian or Ornstein-Uhlenbeck
tial parameters (µ0 , σ 0 )t:t+H , µ0 , σ 0 ∈ Rm , A ∈ Rm , i.e. noise) into the learned policy πθ during training, optionally
independent parameters for each action over a horizon of following a linear annealing schedule. While our trajectory
length H, we independently sample N trajectories using optimization procedure is inherently stochastic due to tra-
rollouts generated by the learned model dθ , and estimate the jectory sampling, we find that the rate at which σ decays
total return φΓ of a sampled trajectory Γ as varies wildly between tasks, leading to (potentially poor)
" # local optima for small σ. To promote consistent exploration
H−1
H
X
t across tasks, we constrain the std. deviation of the sampling
φΓ , EΓ γ Qθ (zH , aH ) + γ Rθ (zt , at ) , (3)
distribution such that, for a µj obtained from Equation 4 at
t=0
iteration j, we instead update σ j to
where zt+1 = dθ (zt , at ) and at ∼ N (µj−1 t , (σtj−1 )2 I) at v 
iteration j − 1, as highlighted in red in Algorithm 1. We u PN
u ?
Ωi (Γ − µ )j 2
select the top-k returns φ?Γ and obtain new parameters µj , σ j σ j = max t i=1PN i ,  , (5)
at iteration j from a φ?Γ -normalized empirical estimate: i=1 Ωi
v
Pk ?
u Pk
Ωi (Γ? − µj )2 where  ∈ R+ is a linearly decayed constant. Likewise, we
i=1 Ωi Γi
u
j j
µ = Pk , σ = t i=1Pk i , (4) linearly increase the planning horizon from 1 to H in the
i=1 Ωi i=1 Ωi early stages of training, as the model is initially inaccurate
? and planning would therefore be dominated by model bias.
where Ωi = eτ (φΓ,i ) , τ is a temperature parameter control-
ling the “sharpness” of the weighting, and Γ?i denotes the ith Policy-guided trajectory optimization. Analogous to
top-k trajectory corresponding to return estimate φ?Γ . After Schrittwieser et al. (2020); Sikchi et al. (2022), TD-MPC
a fixed number of iterations J, the planning procedure is learns a policy πθ in addition to planning procedure Πθ , and
terminated and a trajectory is sampled from the final return- augments the sampling procedure with additional samples
normalized distribution over action sequences. We plan at from πθ (highlighted in blue in Algorithm 1). This leads to
Online Planning
Observation
TD-Learning for MPC

one of two cases: the policy trajectory is estimated to be (i) t=0 t=1 t=2
poor, and may be excluded from the top-k trajectories; or (ii)
good, and may be included with influence proportional to its
estimated return φΓ . While LOOP relies on the LVM maximum
Environment
entropy objective of SAC (Haarnoja et al., 2018) for ex-
ploration, TD-MPC learns a deterministic policy. To make 🕹 🕹 🕹
sampling stochastic, we apply linearly annealed (Gaussian)
noise to πθ actions as in DDPG (Lillicrap et al., 2016). Our
TD-MPC +
full procedure is summarized in Algorithm 1.
Observation
Reward Value
z0 z1 z2 …
4. Task-Oriented Latent Dynamics Model
z 0 z H 🕹
To be used in conjunction with TD-MPC, we propose a
Action
Task-Oriented Latent Dynamics (TOLD) model that is
Learned Model
jointly learned together with a terminal value function using
TD-learning. Rather than attempting to model the environ-
🕹 🕹 🕹
ment itself, our TOLD model learns to only model elements 🕹
Reward
of the environment that are predictive of reward, which
Environment value reward action online net target net
is a far easier problem. During inference, our TD-MPC
Figure 2. Training our TOLD model. A trajectory Γ0:H of length
framework leverages the learned TOLD model for trajectory
H is sampled from a replay buffer, and the first observation s0
optimization, estimating short-term rewards using model
is encoded by hθ into a latent representation z0 . Then, TOLD
rollouts and long-term returns using the terminal value func- recurrently predicts the following latent states z1 , z2 , . . . , zH , as
tion. TD-MPC and TOLD support continuous action spaces, well as a value q̂, reward r̂, and action â for each latent state, and
arbitrary input modalities, and sparse reward signals. Figure we optimize TOLD using Equation 7. Subsequent observations
2 provides an overview of the TOLD training procedure. are encoded using target net hθ− (θ− : slow-moving average of θ)
and used as latent targets only during training (illustrated in gray).
Components. Throughout training, our agent iteratively
performs the following two operations: (i) improving the Objective. We first state the full objective, and then mo-
learned TOLD model using data collected from previous tivate each module and associated objective term. During
environment interaction; and (ii) collecting new data from training, we minimize a temporally weighted objective
the environment by online planning of action sequences with
t+H
TD-MPC, using TOLD for generating imagined rollouts. X
Our proposed TOLD consists of five learned components J (θ; Γ) = λi−t L(θ; Γi ) , (7)
i=t
hθ , dθ , Rθ , Qθ , πθ that predict the following quantities:
where Γ ∼ B is a trajectory (st , at , rt , st+1 )t:t+H sampled
Representation: zt = hθ (st ) from a replay buffer B, λ ∈ R+ is a constant that weights
Latent dynamics: zt+1 = dθ (zt , at ) near-term predictions higher, and the single-step loss
Reward: r̂t = Rθ (zt , at ) (6)
Value: q̂t = Qθ (zt , at ) L(θ; Γi ) = c1 kRθ (zi , ai ) − ri k22 (8)
Policy: ât ∼ πθ (zt ) | {z
reward
}

Given an observation st observed at time t, a representation + c2 kQθ (zi , ai ) − (ri + γQθ− (zi+1 , πθ (zi+1 ))) k22 (9)
| {z }
network hθ encodes st into a latent representation zt . From value
zt and an action at taken at time t, TOLD then predicts (i) + c3 kdθ (zi , ai ) − hθ− (si+1 )k22 (10)
the latent dynamics (latent representation zt+1 of the fol- | {z }
latent state consistency
lowing timestep); (ii) the single-step reward received; (iii)
its state-action (Q) value; and (iv) an action that (approx- is employed to jointly optimize for reward prediction, value
imately) maximizes the Q-function. To make TOLD less prediction, and a latent state consistency loss that regu-
susceptible to compounding errors, we recurrently predict larizes the learned representation. Here, c1:3 are constant
the aforementioned quantities multiple steps into the future coefficients balancing the three losses. From each tran-
from predicted future latent states, and back-propagate gra- sition (zi , ai ), the reward term (Equation 8) predicts the
dients through time. Unlike prior work (Ha & Schmidhuber, single-step reward, the value term (Equation 9) is our adop-
2018; Janner et al., 2019; Hafner et al., 2019; 2020b; Sikchi tion of fitted Q-iteration from Equation 1 following previ-
et al., 2022), we find it sufficient to implement all compo- ous work on actor-critic algorithms (Lillicrap et al., 2016;
nents of TOLD as purely deterministic MLPs, i.e., without Haarnoja et al., 2018), and the consistency term (Equa-
RNN gating mechanisms nor probabilistic models. tion 10) predicts the latent representation of future states.
TD-Learning for MPC

Crucially, recurrent predictions are made entirely in latent Algorithm 2 TOLD (training)
space from states zi = hθ (si ), zi+1 = dθ (zi , ai ), . . . ,
zi+H = dθ (zi+H−1 , ai+H−1 ) such that only the first obser- Require: θ, θ− : randomly initialized network parameters
vation si is encoded using hθ and gradients from all three η, τ, λ, B: learning rate, coefficients, buffer
terms are back-propagated through time. This is in contrast 1: while not tired do
to prior work on model-based learning that learn a model 2: // Collect episode with TD-MPC from s0 ∼ p0 :
by state or video prediction, entirely decoupled from policy 3: for step t = 0...T do
and/or value learning (Ha & Schmidhuber, 2018; Hafner 4: at ∼ Πθ (·|hθ (st )) C Sample with TD-MPC
et al., 2020b; Sikchi et al., 2022). We use an exponential 5: (st+1 , rt ) ∼ T (·|st , at ), R(·|st , at ) C Step env.
moving average θ− of the online network parameters θ 6: B ← B ∪ (st , at , rt , st+1 ) C Add to buffer
for computing the value target (Lillicrap et al., 2016), and 7: // Update TOLD using collected data in B:
similarly also use θ− for the latent state consistency target 8: for num updates per episode do
hθ− (si+1 ). The policy πθ is described next, while we defer 9: {st , at , rt , st+1 }t:t+H ∼ B C Sample traj.
discussion of the consistency loss to the following section. 10: zt = hθ (st ) C Encode first observation
11: J =0 C Initialize J for loss accumulation
Computing TD-targets. The TD-objective in Equation 9 12: for i = t...t + H do
requires estimating the quantity maxat Qθ− (zt , at ), which 13: r̂i = Rθ (zi , ai ) C Equation 8
is extremely costly to compute using planning (Lowrey 14: q̂i = Qθ (zi , ai ) C Equation 9
et al., 2019). Therefore, we instead learn a policy πθ that 15: zi+1 = dθ (zi , ai ) C Equation 10
maximizes Qθ by minimizing the objective 16: âi = πθ (zi ) C Equation 11
t+H 17: J ← J + λi−t L(zi+1 , r̂i , q̂i , âi ) C Equation 7
θ ← θ − H1 η∇θ J
X
Jπ (θ; Γ) = − λi−t Qθ (zi , πθ (sg(zi ))) , (11) 18: C Update online network
i=t 19: θ− ← (1 − τ )θ− + τ θ C Update target network
which is a temporally weighted adaptation of the policy
objective commonly used in model-free actor-critic methods Meta-World v2 (Yu et al., 2019), including tasks with sparse
such as DDPG (Lillicrap et al., 2016) and SAC (Haarnoja rewards, high-dimensional state and action spaces, image
et al., 2018). Here, sg denotes the stop-grad operator, and observations, multi-modal inputs, goal-conditioning, and
Equation 11 is optimized only wrt. policy parameters. While multi-task learning settings; see Appendix L for task visual-
we empirically observe that for complex tasks the learned izations. We choose these two benchmarks for their great
πθ is inferior to planning (discussed in Section 5), we find task diversity and availability of baseline implementations
it sufficiently expressive for efficient value learning. and results. We seek to answer the following questions:
Latent state consistency. To provide a rich learning signal − How does planning with TD-MPC compare to state-of-
for model learning, prior work on model-based RL com- the-art model-based and model-free approaches?
monly learn to directly predict future states or pixels (Ha − Are TOLD models capable of multi-task and transfer
& Schmidhuber, 2018; Janner et al., 2019; Lowrey et al., behaviors despite using a reward-centric objective?
2019; Kaiser et al., 2020; Sikchi et al., 2022). However,
− How does performance relate to the computational budget
learning to predict future observations is an extremely hard
of the planning procedure?
problem as it forces the network to model everything in the
environment, including task-irrelevant quantities and details An implementation of TD-MPC is available at https://
such as shading. Instead, we propose to regularize TOLD nicklashansen.github.io/td-mpc, which will
with a latent state consistency loss (shown in Equation 10) solve most tasks in an hour on a single GPU.
that forces a future latent state prediction zt+1 = dθ (zt , at )
Implementation details. All components are deterministic
at time t + 1 to be similar to the latent representation of the
and implemented using MLPs. We linearly anneal the ex-
corresponding ground-truth observation hθ− (st+1 ), circum-
ploration parameter  of Πθ and πθ from 0.5 to 0.05 over
venting prediction of observations altogether. Additionally,
the first 25k decision steps1 . We use a planning horizon of
this design choice effectively makes model learning agnos-
H = 5, and sample trajectories using prioritized experience
tic to the observation modality. The training procedure is
replay (Schaul et al., 2016) with priority scaled by the value
shown in Algorithm 2; see Appendix F for pseudo-code.
loss. During planning, we plan for 6 iterations (8 for Dog;
12 for Humanoid), sampling N = 512 trajectories (+5%
5. Experiments sampled from πθ ), and we compute µ, σ parameters over
We evaluate TD-MPC with a TOLD model on a total of 92 1
To avoid ambiguity, we refer to simulation steps as environ-
diverse and challenging continuous control tasks from Deep- ment steps (independent of action repeat), and use decision steps
Mind Control Suite (DMControl; Tassa et al. (2018)) and when referring to policy queries (dependent on action repeat).
TD-Learning for MPC

Average Acrobot Swingup Cartpole Swingup Cartpole Swingup Sparse


1000 400 1000 1000

750 300 750 750


Episode return

500 200 500 500

250 100 250 250

0 0 0 0
Cheetah Run Cup Catch Finger Spin Finger Turn Hard
1000 1000 1000 1000

750 750 750 750


Episode return

500 500 500 500

250 250 250 250

0 0 0 0
Fish Swim Hopper Hop Quadruped Run Quadruped Walk
1000 600 1000 1000

750 450 750 750


Episode return

500 300 500 500

250 150 250 250

0 0 0 0
Reacher Easy Reacher Hard Walker Run Walker Walk
1000 1000 1000 1000

750 750 750 750


Episode return

500 500 500 500

250 250 250 250

0 0 0 0
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Environment steps (×103) Environment steps (×103) Environment steps (×103) Environment steps (×103)

SAC LOOP MPC:sim TD-MPC (no latent) TD-MPC (no reg.) TD−MPC (ours)

Figure 3. DMControl tasks. Return of our method (TD-MPC) and baselines on 15 state-based continuous control tasks from DMControl
(Tassa et al., 2018). Mean of 5 runs; shaded areas are 95% confidence intervals. In the top left, we visualize results averaged across all 15
tasks. We observe especially large performance gains on tasks with complex dynamics, e.g., the Quadruped and Acrobot tasks.
the top-64 trajectories each iteration. For image-based tasks, we limit the planning horizon to 10 (2× ours), sampled tra-
observations are 3 stacked 84×84-dimensional RGB frames jectories to 200, and optimize for 4 iterations (ours: 6).
and we use ±4 pixel shift augmentation (Kostrikov et al., − CURL (Srinivas et al., 2020), DrQ (Kostrikov et al.,
2020). Refer to Appendix F for additional details. 2020), and DrQ-v2 (Yarats et al., 2021), three state-of-the-
Baselines. We evaluate our method against the following: art model-free algorithms.

− Soft Actor-Critic (SAC; (Haarnoja et al., 2018)), a state- − PlaNet (Hafner et al., 2019), Dreamer (Hafner et al.,
of-the-art model-free algorithm derived from maximum en- 2020b), and Dreamer-v2 (Hafner et al., 2020a). All three
tropy RL (Ziebart et al., 2008). We choose SAC as our main methods learn a model using a reconstruction loss, and se-
point of comparison due to its popularity and strong perfor- lect actions using either MPC or a learned policy.
mance on both DMControl and Meta-World. In particular, − MuZero (Schrittwieser et al., 2020) and EfficientZero
we adopt the implementation of Yarats & Kostrikov (2020). (Ye et al., 2021), which learn a latent dynamics model from
− LOOP (Sikchi et al., 2022), a hybrid algorithm that ex- rewards and uses MCTS for discrete action selection.
tends SAC with planning and a learned model. LOOP has − Ablations. We consider: (i) our method implemented
been shown to outperform a number of model-based meth- using a state predictor (hθ being the identity function), (ii)
ods, e.g., MBPO (Janner et al., 2019) and POLO (Lowrey our method implemented without the latent consistency loss
et al., 2019)) on select MuJoCo tasks. It is a particularly from Equation 10, and lastly: the consistency loss replaced
relevant baseline due to its similarities to TD-MPC. by either (iii) the reconstruction objective of PlaNet and
− MPC with a ground-truth simulator (denoted MPC:sim). Dreamer, or (iv) the contrastive objective of EfficientZero.
As planning with a simulator is computationally intensive, See Appendix G for further discussion on baselines.
TD-Learning for MPC

Table 1. Learning from pixels. Return of our method (TD-MPC) and state-of-the-art algorithms on the image-based DMControl 100k
benchmark used in Srinivas et al. (2020); Kostrikov et al. (2020); Ye et al. (2021). Baselines are tuned specifically for image-based RL,
whereas our method is not. Results for SAC, CURL, DrQ, and PlaNet are partially obtained from Srinivas et al. (2020); Kostrikov et al.
(2020), and results for Dreamer, MuZero, and EfficientZero are obtained from Hafner et al. (2020b); Ye et al. (2021). Mean and std.
deviation over 10 runs. *: MuZero and EfficientZero use a discretized action space, and EfficientZero performs an additional 20k gradient
steps before evaluation, whereas other methods do not. Due to dimensionality explosion under discretization, MuZero and EfficientZero
cannot feasibly solve tasks with higher-dimensional action spaces, e.g., Walker Walk and Cheetah Run (A ∈ R6 ), while our method can.
Model-free Model-based Ours
100k env. steps SAC State SAC Pixels CURL DrQ PlaNet Dreamer MuZero* Eff.Zero* TD-MPC
Cartpole Swingup 812±45 419±40 597±170 759±92 563±73 326±27 219±122 813±19 770±70
Reacher Easy 919±123 145±30 517±113 601±213 82±174 314±155 493±145 952±34 628±105
Cup Catch 957±26 312±63 772±241 913±53 710±217 246±174 542±270 942±17 933±24
Finger Spin 672±76 166±128 779±108 901±104 560±77 341±70 − − 943±59
Walker Walk 604±317 42±12 344±132 612±164 221±43 277±12 − − 577±208
Cheetah Run 228±95 103±38 307±48 344±67 165±123 235±137 − − 222±88

Cup Catch Finger Spin Finger Turn Easy Finger Turn Hard
1000 1000 1000 1000

750 750 750 750


Episode return

500 500 500 500

250 250 250 250

0 0 0 0
Walker Stand Walker Walk Quadruped Run Quadruped Walk
1000 1000 1000 1000

750 750 750 750


Episode return

500 500 500 500

250 250 250 250

0 0 0 0
Pendulum Swingup Cheetah Run Reacher Easy Walker Run
1000 1000 1000 1000

750 750 750 750


Episode return

500 500 500 500

250 250 250 250

0 0 0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Environment steps (×106) Environment steps (×106) Environment steps (×106) Environment steps (×106)

CURL DrQ DrQ-v2 Dreamer-v2 TD−MPC (ours)

Figure 4. Learning from pixels. Return of our method (TD-MPC) and state-of-the-art algorithms on 12 challenging image-based
DMControl tasks. We follow prior work (Hafner et al., 2020b;a; Yarats et al., 2021) and use an action repeat of 2 for all tasks. Compared
to the DMControl 100k benchmark shown in Table 1, we here consider more difficult tasks with up to 30× more data. Results for DrQ-v2
and Dreamer-v2 are obtained from Yarats et al. (2021); Hafner et al. (2020a), results for DrQ are partially obtained from Kostrikov et al.
(2020), and results for CURL are reproduced using their publicly available implementation (Srinivas et al., 2020). While baselines use
task-dependent hyperparameters, TD-MPC uses the same hyperparameters for all tasks. Mean of 5 runs; shaded areas are 95% confidence
intervals. TD-MPC consistently outperforms CURL and DrQ, and is competitive with DrQ-v2 and Dreamer-v2.

Tasks. We consider the following 92 tasks: benchmark (3M environment steps). Results in Figure 4.
− 6 challenging Humanoid (A ∈ R ) and Dog (A ∈ R ) 21 38 − 2 multi-modal (proprioceptive data + egocentric camera)
locomotion tasks with high-dimensional state and action 3D locomotion tasks in which a quadruped agent navigates
spaces. Results are shown in Figure 1. around obstacles. Results are shown in Figure 5 (middle).
− 15 diverse continuous control tasks from DMControl, 6 − 50 goal-conditioned manipulation tasks from Meta-
of which have sparse rewards. Results shown in Figure 3. World, as well as a multi-task setting where 10 tasks are
learned simultaneously. Results are shown in Figure 5 (top).
− 6 image-based tasks from the data-efficient DMControl
100k benchmark. Results are shown in Table 1. Throughout, we benchmark performance on relatively few
− 12 image-based tasks from the DMControl Dreamer environment steps, e.g., 3M steps for Humanoid tasks
whereas prior work typically runs for 30M steps (10×).
TD-Learning for MPC

Meta-World (Goal-Conditioned) MT10 (Multi-Task)


Comparison to other methods. We find our method to 1.00 1.00

outperform or match baselines in most tasks considered, 0.75 0.75


generally with larger gains on complex tasks such as Hu-

Success rate
0.50 0.50
manoid, Dog (DMControl), and Bin Picking (Meta-World),
and we note that TD-MPC is in fact the first documented 0.25 0.25

result solving the complex Dog tasks of DMControl. Per-


0.00 0.00
formance of LOOP is similar to SAC, and MPC with a 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3
Environment steps (×106) Environment steps (×106)
Quadruped Corridor Quadruped Obstacles
simulator (MPC:sim) performs well on locomotion tasks 600
SAC
600
TD−MPC (ours)
but fails in tasks with sparse rewards. Although we did
450 450
not tune our method specifically for image-based RL, we

Episode return
obtain results competitive with state-of-the-art model-based 300 300

and model-free algorithms that are both carefully tuned 150 150
for image-based RL and contain up to 15× more learnable
0 0
parameters. Notably, while EfficientZero produces strong 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
Environment steps (×106) Environment steps (×106)
results on tasks with low-dimensional action spaces, its
SAC MPC:sim TD-MPC (blind) TD−MPC (ours)
Monte-Carlo Tree Search (MCTS) requires discretization
of action spaces, which is unfeasible in high dimensions. Figure 5. (top) Meta-World. Success rate on 50 goal-conditioned
Meta-World tasks using individual policies, and a multi-task policy
In contrast, TD-MPC scales remarkably well to the 38-
trained on 10 tasks simultaneously (Meta-World MT10). Individ-
dimensional continuous action space of Dog tasks. Lastly, ual task results shown in Appendix I. (bottom) Multi-modal RL.
we observe inferior sample efficiency compared to SAC and Episode return of TD-MPC on two multi-modal locomotion tasks
LOOP on the hard exploration task Finger Turn Hard in using proprioceptive data + an egocentric camera. Blind uses only
Figure 3, which suggests that incorporating more sophisti- proprioceptive data. See Appendix L for visualizations. All results
cated exploration strategies might be promising for future are means of 5 runs; shaded areas are 95% confidence intervals.
research. We defer experiments that ablate the choice of reg- 1000
Quadruped Walk
1000
Quadruped Walk

ularization loss to Appendix D, but find our proposed latent


Episode return

900 900
state consistency loss to yield the most consistent results.
800 800

Multi-task RL, multi-modal RL, and generalization. A 700 700

common argument in favor of general-purpose models is 600 600


1 3 5 7 9 1 3 5 7 9
that they can benefit from data-sharing across tasks. There- Horizon CEM iterations

fore, we seek to answer the following question: does TOLD Horizon CEM iterations Default Policy

similarly benefit from synergies between tasks, despite its Figure 6. Variable computational budget. Return of TD-MPC
reward-centric objective? We test this hypothesis through on Quadruped Walk under a variable budget. We evaluate perfor-
two experiments: training a single policy to perform 10 mance of fully trained agents when varying (left) planning horizon;
different tasks simultaneously (Meta-World MT10), and (right) number of iterations during planning. When varying one
hyperparameter, the other is fixed to the default value. We include
evaluating model generalization when trained on one task
evaluation of the learned policy πθ , and the default setting of 6
(Walk) and transferring to a different task from the same do-
iterations and a horizon of 5 used in training. Mean of 5 runs.
main (Run). Multi-task results are shown in Figure 5 (top),
and transfer results are deferred to Appendix A. We find our have access to the egocentric camera fails. See Appendix
method to benefit from data sharing in both experiments, J for further details on the multi-modal experiments, and
and our transfer results indicate that hθ generalizes well to Appendix I for details on the multi-task experiments.
new tasks, while dθ encodes more task-specific behavior.
We conjecture that, while TOLD only learns features that are Performance vs. computational budget. We investigate
predictive of reward, similar tasks often have similar reward the relationship between computational budget (i.e., plan-
structures, which enables sharing of information between ning horizon and number of iterations) and performance in
tasks. However, we still expect general-purpose models to DMControl tasks; see Figure 6. We find that, for complex
benefit more from unrelated tasks in the same environment tasks such as Quadruped Walk (A ∈ R12 ), more planning
than TOLD. Lastly, an added benefit of our task-centric generally leads to better performance. However, we also ob-
objective is that it is agnostic to the input modality. To serve that we can reduce the planning cost during inference
demonstrate this, we solve two multi-modal (proprioceptive by 50% (compared to during training) without a drop in
data + egocentric camera) locomotion tasks using TD-MPC; performance by reducing the number of iterations. For par-
results in Figure 5 (bottom). We find that TD-MPC suc- ticularly fast inference, one can discard planning altogether
cessfully fuses information from the two input modalities, and simply use the jointly learned policy πθ ; however, πθ
and solves the tasks. In contrast, a blind agent that does not generally performs worse than planning. See Appendix C
for additional results.
TD-Learning for MPC

Table 2. Wall-time. (top) time to solve, and (bottom) time per 2021) learn a latent dynamics model using reward predic-
500k environment steps (in hours) for the Walker Walk and Hu- tion. EfficientZero is most similar to ours in terms of model
manoid Stand tasks from DMControl. We consider the tasks solved learning, but its MCTS-based action selection is inherently
when a method achieves an average return of 940 and 800, respec- incompatible with continuous action spaces. Finally, while
tively. TD-MPC solves Walker Walk 16× faster than LOOP while learning a terminal value function for MPC has previously
using 3.3× less compute per 500k steps. Mean of 5 runs. been proposed (Negenborn et al., 2005; Lowrey et al., 2019;
Walker Walk Humanoid Stand Bhardwaj et al., 2020; Hatch & Boots, 2021), we are (to the
Wall-time (h) SAC LOOP MPC:sim TD-MPC SAC TD-MPC best of our knowledge) the first to jointly learn model and
time to solve ↓ 0.41 7.72 0.91 0.47 9.31 9.39 value function through TD-learning in continuous control.
h/500k steps ↓ 1.41 18.5 − 5.60 1.82 12.94
Hybrid algorithms. Several prior works aim to develop
algorithms that combine model-free and model-based ele-
Training wall-time. To better ground our results, we report ments (Nagabandi et al., 2018; Buckman et al., 2018; Pong
the training wall-time of TD-MPC compared to SAC, LOOP et al., 2018; Hafez et al., 2019; Sikchi et al., 2022; Wang &
that is most similar to our method, and MPC with a ground- Ba, 2020; Clavera et al., 2020; Hansen et al., 2021; Morgan
truth simulator (non-parametric). Methods are benchmarked et al., 2021; Bhardwaj et al., 2021; Margolis et al., 2021),
on a single RTX3090 GPU. Results are shown in Table 2. many of which are orthogonal to our contributions. For
TD-MPC solves Walker Walk 16× faster than LOOP and example, Clavera et al. (2020) and Buckman et al. (2018);
matches the time-to-solve of SAC on both Walker Walk Lowrey et al. (2019) use a learned model to improve policy
and Humanoid Stand while being significantly more sample and value learning, respectively, through generated trajecto-
efficient. Thus, our method effectively closes the time-to- ries. LOOP (Sikchi et al., 2022) extends SAC with a learned
solve gap between model-free and model-based methods. state prediction model and constrains planned trajectories
This is a nontrivial reduction, as LOOP is already known to be close to those of SAC, whereas we replace the pa-
to be, e.g., 12× faster than the purely model-based method, rameterized policy by planning with TD-MPC and learn a
POLO (Lowrey et al., 2019; Sikchi et al., 2022). We provide task-oriented latent dynamics model.
additional experiments on inference times in Appendix H.
We provide a qualitative comparison of key components in
TD-MPC and prior work in Appendix B.
6. Related Work
Temporal Difference Learning. Popular model-free off-
policy algorithms such as DDPG (Lillicrap et al., 2016) and
7. Conclusions and Future Directions
SAC (Haarnoja et al., 2018) represent advances in deep TD- We are excited that our TD-MPC framework, despite being
learning based on a large body of literature (Sutton, 1988; markedly distinct from previous work in the way that the
Mnih et al., 2013; Hasselt et al., 2016; Mnih et al., 2016; Fu- model is learned and used, is already able to outperform
jimoto et al., 2018; Kalashnikov et al., 2018; Espeholt et al., model-based and model-free methods on diverse continuous
2018; Pourchot & Sigaud, 2019; Kalashnikov et al., 2021). control tasks, and (with trivial modifications) simultane-
Both DDPG and SAC learn a policy πθ and value function ously match state-of-the-art on image-based RL tasks. Yet,
Qθ , but do not learn a model. Kalashnikov et al. (2018); we believe that there is ample opportunity for performance
Shao et al. (2020); Kalashnikov et al. (2021) also learn Qθ , improvements by extending the TD-MPC framework. For
but replace or augment πθ with model-free CEM. Instead, example, by using the learned model in creative ways (Clav-
we jointly learn a model, value function, and policy using era et al., 2020; Buckman et al., 2018; Lowrey et al., 2019),
TD-learning, and interact using sampling-based planning. incorporating better exploration strategies, or improving the
Model-based RL. A common paradigm is to learn a model model through architectural innovations.
of the environment that can be used for planning (Ebert
et al., 2018; Zhang et al., 2018; Janner et al., 2019; Hafner Acknowledgements
et al., 2019; Lowrey et al., 2019; Kaiser et al., 2020; Bhard- This project is supported, in part, by grants from NSF CCF-
waj et al., 2020; Yu et al., 2020; Schrittwieser et al., 2020; 2112665 (TILOS), and gifts from Meta, Qualcomm.
Nguyen et al., 2021) or for training a model-free algorithm
with generated data (Pong et al., 2018; Ha & Schmidhu- The authors would like to thank Yueh-Hua Wu, Ruihan
ber, 2018; Hafner et al., 2020b; Sekar et al., 2020). For Yang, Sander Tonkens, Tongzhou Mu, and Yuzhe Qin for
example, Zhang et al. (2018); Ha & Schmidhuber (2018); helpful discussions.
Hafner et al. (2019; 2020b) learn a dynamics model using
a video prediction loss, Yu et al. (2020); Kidambi et al.
(2020) consider model-based RL in the offline setting, and
MuZero/EfficientZero (Schrittwieser et al., 2020; Ye et al.,
TD-Learning for MPC

References Hafez, M. B., Weber, C., Kerzel, M., and Wermter, S. Curi-
ous meta-controller: Adaptive alternation between model-
Agarwal, R., Schwarzer, M., Castro, P. S., Courville, A.,
based and model-free control in deep reinforcement learn-
and Bellemare, M. G. Deep reinforcement learning at
ing. 2019 International Joint Conference on Neural Net-
the edge of the statistical precipice. Advances in Neural
works (IJCNN), pp. 1–8, 2019.
Information Processing Systems, 2021.
Argenson, A. and Dulac-Arnold, G. Model-based offline Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D.,
planning. ArXiv, abs/2008.05556, 2021. Lee, H., and Davidson, J. Learning latent dynamics for
planning from pixels. In International Conference on
Bhardwaj, M., Handa, A., Fox, D., and Boots, B. Infor- Machine Learning, pp. 2555–2565, 2019.
mation theoretic model predictive q-learning. ArXiv,
abs/2001.02153, 2020. Hafner, D., Lillicrap, T., Norouzi, M., and Ba, J. Mas-
tering atari with discrete world models. arXiv preprint
Bhardwaj, M., Choudhury, S., and Boots, B. Blending mpc arXiv:2010.02193, 2020a.
& value function approximation for efficient reinforce-
ment learning. ArXiv, abs/2012.05909, 2021. Hafner, D., Lillicrap, T. P., Ba, J., and Norouzi, M. Dream to
control: Learning behaviors by latent imagination. ArXiv,
Buckman, J., Hafner, D., Tucker, G., Brevdo, E., and Lee, H.
abs/1912.01603, 2020b.
Sample-efficient reinforcement learning with stochastic
ensemble value expansion. In NeurIPS, 2018. Hansen, N. and Wang, X. Generalization in reinforcement
Chen, X. and He, K. Exploring simple siamese represen- learning by soft data augmentation. In International Con-
tation learning. 2021 IEEE/CVF Conference on Com- ference on Robotics and Automation (ICRA), 2021.
puter Vision and Pattern Recognition (CVPR), pp. 15745–
Hansen, N., Jangir, R., Sun, Y., Alenyà, G., Abbeel, P.,
15753, 2021.
Efros, A. A., Pinto, L., and Wang, X. Self-supervised
Chua, K., Calandra, R., McAllister, R., and Levine, S. Deep policy adaptation during deployment. In International
reinforcement learning in a handful of trials using proba- Conference on Learning Representations (ICLR), 2021.
bilistic dynamics models. In NeurIPS, 2018.
Hasselt, H. V., Guez, A., and Silver, D. Deep reinforcement
Clavera, I., Fu, Y., and Abbeel, P. Model-augmented learning with double q-learning. In Aaai, 2016.
actor-critic: Backpropagating through paths. ArXiv,
abs/2005.08068, 2020. Hatch, N. and Boots, B. The value of planning for infinite-
horizon model predictive control. 2021 IEEE Interna-
Ebert, F., Finn, C., Dasari, S., Xie, A., Lee, A. X., and tional Conference on Robotics and Automation (ICRA),
Levine, S. Visual foresight: Model-based deep reinforce- pp. 7372–7378, 2021.
ment learning for vision-based robotic control. ArXiv,
abs/1812.00568, 2018. Janner, M., Fu, J., Zhang, M., and Levine, S. When to trust
your model: Model-based policy optimization. ArXiv,
Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, abs/1906.08253, 2019.
V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning,
I., Legg, S., and Kavukcuoglu, K. Impala: Scalable dis- Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Camp-
tributed deep-rl with importance weighted actor-learner bell, R. H., Czechowski, K., Erhan, D., Finn, C., Koza-
architectures. ArXiv, abs/1802.01561, 2018. kowski, P., Levine, S., Sepassi, R., Tucker, G., and
Fujimoto, S., Hoof, H. V., and Meger, D. Addressing func- Michalewski, H. Model-based reinforcement learning
tion approximation error in actor-critic methods. ArXiv, for atari. ArXiv, abs/1903.00374, 2020.
abs/1802.09477, 2018.
Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog,
Ha, D. and Schmidhuber, J. Recurrent world models fa- A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M.,
cilitate policy evolution. In Advances in Neural Infor- Vanhoucke, V., and Levine, S. Qt-opt: Scalable deep rein-
mation Processing Systems 31, pp. 2451–2463. Curran forcement learning for vision-based robotic manipulation.
Associates, Inc., 2018. ArXiv, abs/1806.10293, 2018.

Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Kalashnikov, D., Varley, J., Chebotar, Y., Swanson, B.,
Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., and Jonschkowski, R., Finn, C., Levine, S., and Hausman,
Levine, S. Soft actor-critic algorithms and applications. K. Mt-opt: Continuous multi-task robotic reinforcement
ArXiv, abs/1812.05905, 2018. learning at scale. ArXiv, abs/2104.08212, 2021.
TD-Learning for MPC

Kidambi, R., Rajeswaran, A., Netrapalli, P., and Joachims, Pourchot, A. and Sigaud, O. Cem-rl: Combining evolution-
T. Morel : Model-based offline reinforcement learning. ary and gradient-based methods for policy search. ArXiv,
ArXiv, abs/2005.05951, 2020. abs/1810.01222, 2019.

Kostrikov, I., Yarats, D., and Fergus, R. Image augmentation Rubinstein, R. Y. Optimization of computer simulation mod-
is all you need: Regularizing deep reinforcement learn- els with rare events. European Journal of Operational
ing from pixels. International Conference on Learning Research, 99:89–112, 1997.
Representations, 2020.
Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Priori-
Lillicrap, T., Hunt, J., Pritzel, A., Heess, N., Erez, T., Tassa, tized experience replay. CoRR, abs/1511.05952, 2016.
Y., Silver, D., and Wierstra, D. Continuous control with Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K.,
deep reinforcement learning. CoRR, abs/1509.02971, Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis,
2016. D., Graepel, T., Lillicrap, T. P., and Silver, D. Mastering
atari, go, chess and shogi by planning with a learned
Lowrey, K., Rajeswaran, A., Kakade, S. M., Todorov, E.,
model. Nature, 588 7839:604–609, 2020.
and Mordatch, I. Plan online, learn offline: Efficient
learning and exploration via model-based control. ArXiv, Sekar, R., Rybkin, O., Daniilidis, K., Abbeel, P., Hafner, D.,
abs/1811.01848, 2019. and Pathak, D. Planning to explore via self-supervised
world models. ArXiv, abs/2005.05960, 2020.
Margolis, G., Chen, T., Paigwar, K., Fu, X., Kim, D., Kim,
S., and Agrawal, P. Learning to jump from pixels. In Shao, L., You, Y., Yan, M., Sun, Q., and Bohg, J. Grac: Self-
CoRL, 2021. guided and self-regularized actor-critic. arXiv preprint
arXiv:2009.08973, 2020.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,
Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing Sikchi, H., Zhou, W., and Held, D. Learning off-policy with
atari with deep reinforcement learning. arXiv preprint online planning. In Conference on Robot Learning, pp.
arXiv:1312.5602, 2013. 1622–1633. PMLR, 2022.

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, Srinivas, A., Laskin, M., and Abbeel, P. Curl: Contrastive
T. P., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn- unsupervised representations for reinforcement learning.
chronous methods for deep reinforcement learning. In arXiv preprint arXiv:2004.04136, 2020.
ICML, 2016. Sutton, R. Learning to predict by the method of temporal
differences. Machine Learning, 3:9–44, 08 1988. doi:
Morgan, A. S., Nandha, D., Chalvatzaki, G., D’Eramo, C.,
10.1007/BF00115009.
Dollar, A. M., and Peters, J. Model predictive actor-critic:
Accelerating robot skill acquisition with deep reinforce- Sutton, R. Learning to predict by the methods of temporal
ment learning. 2021 IEEE International Conference on differences. Machine Learning, 3:9–44, 2005.
Robotics and Automation (ICRA), pp. 6672–6678, 2021.
Tassa, Y., Erez, T., and Todorov, E. Synthesis and stabi-
Nagabandi, A., Kahn, G., Fearing, R. S., and Levine, S. lization of complex behaviors through online trajectory
Neural network dynamics for model-based deep reinforce- optimization. 2012 IEEE/RSJ International Conference
ment learning with model-free fine-tuning. 2018 IEEE on Intelligent Robots and Systems, pp. 4906–4913, 2012.
International Conference on Robotics and Automation
Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y.,
(ICRA), pp. 7559–7566, 2018.
de Las Casas, D., Budden, D., Abdolmaleki, A., et al.
Negenborn, R. R., De Schutter, B., Wiering, M. A., and Hel- Deepmind control suite. Technical report, DeepMind,
lendoorn, H. Learning-based model predictive control for 2018.
markov decision processes. IFAC Proceedings Volumes, Wang, T. and Ba, J. Exploring model-based planning with
38(1):354–359, 2005. 16th IFAC World Congress. policy networks. ArXiv, abs/1906.08649, 2020.
Nguyen, T. D., Shu, R., Pham, T., Bui, H. H., and Ermon, S. Williams, G., Aldrich, A., and Theodorou, E. A. Model
Temporal predictive coding for model-based planning in predictive path integral control using covariance variable
latent space. In ICML, 2021. importance sampling. ArXiv, abs/1509.01149, 2015.
Pong, V. H., Gu, S. S., Dalal, M., and Levine, S. Temporal Yarats, D. and Kostrikov, I. Soft actor-critic (sac) im-
difference models: Model-free deep rl for model-based plementation in pytorch. https://github.com/
control. ArXiv, abs/1802.09081, 2018. denisyarats/pytorch_sac, 2020.
TD-Learning for MPC

Yarats, D., Fergus, R., Lazaric, A., and Pinto, L. Mastering shown in Figure 7. When finetuning the full TOLD model,
visual continuous control: Improved data-augmented re- we find our method to converge considerably faster, sug-
inforcement learning. arXiv preprint arXiv:2107.09645, gesting that TOLD does indeed learn features that transfer
2021. between related tasks. We perform two additional finetuning
experiments: freezing parameters of the representation hθ ,
Ye, W., Liu, S., Kurutach, T., Abbeel, P., and Gao, Y. Master- and freezing parameters of both hθ and the latent dynam-
ing atari games with limited data. ArXiv, abs/2111.00210, ics predictor dθ during finetuning. We find that freezing
2021. hθ nearly matches our results for finetuning without frozen
Yu, T., Quillen, D., He, Z., Julian, R., Hausman, K., Finn, weights, indicating that hθ learns to encode information
C., and Levine, S. Meta-world: A benchmark and evalua- that transfers between tasks. However, when finetuning
tion for multi-task and meta reinforcement learning. In with both hθ , dθ frozen, rate of convergence degrades sub-
Conference on Robot Learning (CoRL), 2019. stantially, which suggests that dθ tends to encode more
task-specific behavior.
Yu, T., Thomas, G., Yu, L., Ermon, S., Zou, J. Y., Levine, S.,
Finn, C., and Ma, T. Mopo: Model-based offline policy B. Comparison to Prior Work
optimization. ArXiv, abs/2005.13239, 2020.
We here extend our discussion of related work in Section
Zhang, M., Vikram, S., Smith, L., Abbeel, P., Johnson, M. J., 6. Table 3 provides a qualitative comparison of key compo-
and Levine, S. Solar: Deep structured latent represen- nents of TD-MPC and prior model-based and model-free
tations for model-based reinforcement learning. ArXiv, approaches, e.g., comparing model objectives, use of a (ter-
abs/1808.09105, 2018. minal) value function, and inference-time behavior. While
Ziebart, B. D., Maas, A., Bagnell, J. A., and Dey, A. K. different aspects of TD-MPC have been explored in prior
Maximum entropy inverse reinforcement learning. In work, we are the first to propose a complete framework for
Proceedings of the 23rd National Conference on Artificial MPC with a model learned by TD-learning.
Intelligence, volume 3, 2008.
C. Variable Computational Budget
Quadruped Run Walker Run
1000 1000
This section supplements our experiments in Figure 6 on
750 750 a variable computational budget for planning during infer-
Episode return

500 500
ence; additional results are shown in Figure 8. We observe
that the gap between planning performance and policy per-
250 250 formance tends to be larger for tasks with high-dimensional
0 0
action spaces such as the two Quadruped tasks. We simi-
0 100 200 300 400 500 0 100 200 300 400 500
Environment steps (×103) Environment steps (×103)
larly find that performance varies relatively little when the
Rand. init Finetune (❄ hθ) computational budget is changed for tasks with simple dy-
Finetune Finetune (❄ hθ, dθ) namics (e.g., Cartpole tasks) compared to tasks with more
Figure 7. Model generalization. Return of our method under complex dynamics. We find that our default hyperparame-
three different settings: (Rand. init) TD-MPC trained from scratch ters (H = 5 and 6 iterations; shown as a star in Figure 8)
on the two Run tasks; (Finetune) TD-MPC initially trained on Walk strikes a good balance between compute and performance.
tasks and then finetuned online on Run tasks without any weights
frozen; (Finetune, freeze hθ ) same setting as before, but with the
encoder hθ frozen; and (Finetune, freeze hθ , dθ ) both encoder hθ D. Latent Dynamics Objective
and latent dynamics predictor dθ frozen. Mean of 5 runs; shaded We ablate the choice of latent dynamics objective by replac-
areas are 95% confidence intervals.
ing our proposed latent state consistency loss in Equation 10
with (i) a contrastive loss similar to that of Ye et al. (2021);
A. Model Generalization Hansen & Wang (2021), and (ii) a reconstruction objective
We investigate the transferability of a TOLD model between similar to that of Ha & Schmidhuber (2018); Hafner et al.
related tasks. Specifically, we consider model transfer in (2019; 2020b). Specifically, for (i) we adopt the recently pro-
two locomotion domains, Walker and Quadruped, where posed SimSiam (Chen & He, 2021) self-supervised frame-
we first train policies on Walk tasks and then finetune the work and implement the projection layer as an MLP with 2
learned model on Run tasks. We finetune in an online set- hidden layers and output size 32, and the predictor head is
ting, i.e., the only difference between training from scratch an MLP with 1 hidden layer. All layers use ELU activations
and finetuning is the weight initialization, and we keep all and a hidden size of 256. Consistent with the public im-
hyperparameters identical. Results from the experiment are plementations of Ye et al. (2021); Hansen & Wang (2021),
TD-Learning for MPC

Table 3. Comparison to prior work. We compare key components of TD-MPC to prior model-based and model-free approaches. Model
objective describes which objective is used to learn a (latent) dynamics model, value denotes whether a value function is learned, inference
provides a simplified view of action selection at inference time, continuous denotes whether an algorithm supports continuous action
spaces, and compute is a holistic estimate of the relative computational cost of methods during training and inference. We use policy w/
CEM to indicate inference based primarily on a learned policy, and vice-versa.
Method Model objective Value Inference Continuous Compute
SAC % ! Policy ! Low
QT-Opt % ! CEM ! Low
MPC:sim Ground-truth model % CEM ! High
POLO Ground-truth model ! CEM ! High
LOOP State prediction ! Policy w/ CEM ! Moderate
PlaNet Image prediction % CEM ! High
Dreamer Image prediction ! Policy ! Moderate
MuZero Reward/value pred. ! MCTS w/ policy % Moderate
EfficientZero Reward/value pred. + contrast. ! MCTS w/ policy % Moderate
TD-MPC (ours) Reward/value pred. + latent pred. ! CEM w/ policy ! Low

Quadruped Run Quadruped Run Fish Swim Fish Swim


900 900 1000 1000
Episode return

Episode return
800 800
800 800

700 700

600 600
600 600

500 500 400 400


1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9
Horizon CEM iterations Horizon CEM iterations

Horizon CEM iterations Default Policy Horizon CEM iterations Default Policy

Reacher Hard Reacher Hard Cartpole Swingup Sparse Cartpole Swingup Sparse
1000 1000 1000 1000
Episode return

Episode return

900 900

900 900

800 800

800 800 700 700


1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9
Horizon CEM iterations Horizon CEM iterations

Horizon CEM iterations Default Policy Horizon CEM iterations Default Policy

Figure 8. Variable computational budget. Return of our method (TD-MPC) under a variable computational budget. In addition to the
task in Figure 6, we provide results on four other tasks from DMControl: Quadruped Run (A ∈ R12 ), Fish Swim (A ∈ R5 ), Reacher
Hard (A ∈ R2 ), and Cartpole Swingup Sparse (A ∈ R). We evaluate performance of fully trained agents when varying (blue) planning
horizon; (green) number of iterations during planning. For completeness, we also include evaluation of the jointly learned policy πθ , as
well as the default setting of 6 iterations and a horizon of 5 used during training. Higher values require more compute. Mean of 5 runs.

we find it beneficial to apply BatchNorm in the projection deviation (and thus degree of exploration) is decreasing
and predictor modules. We also find that using a higher as training progresses, and converges as the task becomes
loss coefficient of c3 = 100 (up from 2) produces slightly solved. Generally, we find that exploration decreases slower
better results. For (ii) we implement the decoder for state for hard tasks, which we conjecture is due to larger vari-
reconstruction by mirroring the encoder; an MLP with 1 ance in reward and value estimates. As such, the TD-MPC
hidden layer and ELU activations. We also include a no framework inherently balances exploration and exploitation.
regularization baseline for completeness. Results are shown
in Figure 10. F. Implementation Details
E. Exploration by planning We provide an overview of the implementation details of our
method in Section 5. For completeness, we list all relevant
We investigate the role that planning by TD-MPC has in hyperparameters in Table 4. As discussed in Appendix G,
exploration. Figure 9 shows the average std. deviation of we adopt most hyperparameters from the SAC implementa-
our planning procedure after the final iteration of planning tion (Yarats & Kostrikov, 2020). Following previous work
for the three Humanoid tasks: Stand, Walk, and Run, listed (Hafner et al., 2019), we use a task-specific action repeat
in order of increasing difficulty. We observe that the std. hyperparameter for DMControl that is constant across all
TD-Learning for MPC

Humanoid
0.30 (2): Linear(in_features=512, out_features=512)
(3): ELU(alpha=1.0)
0.25 (4): Linear(in_features=512, out_features=A))
CEM std. deviation (σ)
(Q1): Sequential(
0.20
(0): Linear(in_features=Z+A, out_features=512)
0.15
(1): LayerNorm((512,), elementwise_affine=True)
(2): Tanh()
0.10 (3): Linear(in_features=512, out_features=512)
(4): ELU(alpha=1.0)
0.05 (5): Linear(in_features=512, out_features=1))
(Q2): Sequential(
0.00
0.0 0.5 1.0 1.5 2.0 2.5 3.0 (0): Linear(in_features=Z+A, out_features=512)
Environment steps (×106) (1): LayerNorm((512,), elementwise_affine=True)
(2): Tanh()
Stand Walk Run
(3): Linear(in_features=512, out_features=512)
Figure 9. Exploration by planning. Average std. deviation (σ) of (4): ELU(alpha=1.0)
our planning procedure after the final iteration of planning over the (5): Linear(in_features=512, out_features=1))
course of training. Results are shown for the three Humanoid tasks:
Stand, Walk, and Run, listed in order of increasing difficulty. Additionally, PyTorch-like pseudo-code for training our
TOLD model (codified version of Algorithm 2) is shown
methods; see Table 7 for a list of values. For state-based
below:
experiments, we implement the representation function hθ
using an MLP with a single hidden layer of dimension 256.
def update(replay_buffer):
For image-based experiments, hθ is a 4-layer CNN with """
kernel sizes (7, 5, 3, 3), stride (2, 2, 2, 2), and 32 filters per A single gradient update of our TOLD model.
h, R, Q, d: TOLD components.
layer. All other components are implemented using 2-layer c1, c2, c3: loss coefficients.
MLPs with dimension 512. Following prior work (Yarats rho: temporal loss coefficient.
"""
& Kostrikov, 2020; Srinivas et al., 2020; Kostrikov et al., states, actions, rewards = replay_buffer.sample()
2020), we apply layer normalization to the value function.
# Encode first observation
Weights and biases in the last layer of the reward predic- z = h(states[0])
tor Rθ and value function Qθ are zero-initialized to reduce
# Recurrently make predictions
model and value biases in the early stages of training, and reward_loss = 0
all other fully-connected layers use orthogonal initialization; value_loss = 0
consistency_loss = 0
the SAC and LOOP baselines are implemented similarly. for t in range(H):
We do not find it consistently better to use larger networks r = R(z, actions[t])
q1, q2 = Q(z, actions[t])
neither for state-based nor image-based experiments. In z = d(z, actions[t])
multi-task experiments, we augment the state input with a
# Compute targets and losses
one-hot task vector. In multi-modal experiments, we en- z_target = h_target(states[t+1])
code state and image separately and sum the features. We td_target = compute_td(rewards[t], states[t+1])
reward_loss += rho**t * mse(r, rewards[t])
provide a PyTorch-like summary of our task-oriented latent value_loss += rho**t * \
dynamics model in the following. For clarity, we use S, Z, (mse(q1, td_target) + mse(q2, td_target))
consistency_loss += rho**t * mse(z, z_target)
and A to denote the dimensionality of states, latent states,
and actions, respectively, and report the total number of # Update
total_loss = c1 * reward_loss + \
learnable parameters for our TOLD model initialized for the c2 * value_loss + \
Walker Run task (S ∈ R24 , A ∈ A6 ). c3 * consistency_loss
total_loss.backward()
optim.step()
Total parameters: approx. 1,507,000
(h): Sequential( # Update slow-moving average
(0): Linear(in_features=S, out_features=256) update_target_network()
(1): ELU(alpha=1.0)
(2): Linear(in_features=256, out_features=Z))
(d): Sequential(
(0): Linear(in_features=Z+A, out_features=512)
(1): ELU(alpha=1.0) G. Extended Description of Baselines
(2): Linear(in_features=512, out_features=512)
(3): ELU(alpha=1.0) We tune the performance of both our method and baselines
(4): Linear(in_features=512, out_features=Z))
(R): Sequential( to perform well on DMControl and then subsequently bench-
(0): Linear(in_features=Z+A, out_features=512) mark algorithms on Meta-World using the same choice of
(1): ELU(alpha=1.0)
(2): Linear(in_features=512, out_features=512) hyperparameters. Below, we provide additional details on
(3): ELU(alpha=1.0) our efforts to tune the baseline implementations.
(4): Linear(in_features=512, out_features=1))
(pi): Sequential(
(0): Linear(in_features=Z, out_features=512)
SAC. We adopt the implementation of Yarats & Kostrikov
(1): ELU(alpha=1.0) (2020) which has been used extensively in the literature as a
TD-Learning for MPC
Average Acrobot Swingup Cartpole Swingup Cartpole Swingup Sparse
1000 400 1000 1000

750 300 750 750

Episode return

500 200 500 500

250 100 250 250

0 0 0 0
Cheetah Run Cup Catch Finger Spin Finger Turn Hard
1000 1000 1000 1000

750 750 750 750


Episode return

500 500 500 500

250 250 250 250

0 0 0 0
Fish Swim Hopper Hop Quadruped Run Quadruped Walk
1000 600 1000 1000

750 450 750 750


Episode return

500 300 500 500

250 150 250 250

0 0 0 0
Reacher Easy Reacher Hard Walker Run Walker Walk
1000 1000 1000 1000

750 750 750 750


Episode return

500 500 500 500

250 250 250 250

0 0 0 0
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Environment steps (×103) Environment steps (×103) Environment steps (×103) Environment steps (×103)

No reg. Reconstruction Contrastive loss Latent state consistency (ours)

Figure 10. Latent dynamics objective. Return of our method (TD-MPC) using different latent dynamics objectives in addition to reward
and value prediction. 15 state-based continuous control tasks from DMControl (Tassa et al., 2018). No reg. uses no regularization term,
reconstruction uses a state prediction loss, contrastive loss adopts the contrastive objective of Ye et al. (2021); Hansen & Wang (2021),
and latent state consistency corresponds to Equation 10. Mean of 5 runs; shaded areas are 95% confidence intervals. In the top left, we
visualize results averaged across all 15 tasks. Both reconstruction and contrastive losses improve over the baseline without regularization,
but our proposed latent state consistency loss yields more consistent results.

benchmark implementation for state-based DMControl. We hyperparameters are listed in Table 6.


use original hyperparameters except for the target network
MPC:sim. We compare TD-MPC to a vanilla MPC al-
momentum coefficient ζ, where we find it beneficial for both
gorithm using a ground-truth model of the environment
SAC, LOOP, and our method to use a faster update of ζ =
(simulator), but no terminal value function. As such, this
0.99 as opposed to 0.995 in the original implementation.
baseline is non-parametric. We use the same MPC imple-
Additionally, we decrease the batch size from 1024 to 512
mentation as in our method (MPPI; Williams et al. (2015)).
for fair comparison to our method. For completeness, we list
As planning with a simulator is computationally intensive,
important hyperparameters for the SAC baseline in Table 5.
we limit the planning horizon to 10 (which is still 2× as
LOOP. We benchmark against the official implementation much as TD-MPC), and we reduce the number of iterations
from Sikchi et al. (2022), but note that LOOP has – to the to 4 (our method uses 6), as we find MPC to converge faster
best of our knowledge – not previously been benchmarked when using the ground-truth model. At each iteration, we
on DMControl nor Meta-World. Therefore, we do our best sample N = 200 trajectories and update distribution param-
to adapt its hyperparameters. As in the SAC implementa- eters using the top-20 (10%) sampled action sequences. We
tion, we find LOOP to perform better using ζ = 0.99 than keep all other hyperparameters consistent with our method.
its original value of 0.995, and we increase the batch size Because of the limited planning horizon, this MPC baseline
from 256 to 512. Lastly, we set the number of seed steps to generally performs well for locomotion tasks where local
1, 000 (down from 10, 000) to match the SAC implementa- solutions are sufficient, but tends to fail at tasks with, for
tion. As LOOP uses SAC as backbone learning algorithm, example, sparse rewards.
we found these changes to be beneficial. LOOP-specific
TD-Learning for MPC
Table 4. TD-MPC hyperparameters. We here list hyperparame-
ters for TD-MPC with TOLD and emphasize that we use the same
Table 6. LOOP hyperparameters. We list general SAC hyperpa-
parameters for SAC whenever possible.
rameters shared by LOOP in Table 5, and list only hyperparameters
Hyperparameter Value specific to LOOP here. We use the official implementation from
Discount factor (γ) 0.99 Sikchi et al. (2022) but list its hyperparameters for completeness.
Seed steps 5, 000 Note that we – as in the SAC implementation – use a different
Replay buffer size Unlimited batch size and momentum coefficient than in Sikchi et al. (2022),
Sampling technique PER (α = 0.6, β = 0.4) as we find this to marginally improve performance on DMControl.
Planning horizon (H) 5
Initial parameters (µ0 , σ 0 ) (0, 2) Hyperparameter Value
Population size 512 Planning horizon (H) 3
Elite fraction 64 Population size 100
Iterations 12 (Humanoid) Elite fraction 20%
8 (Dog, pixels)
Iterations 5
6 (otherwise)
Policy fraction 5%
Policy fraction 5%
Number of particles 4
Number of particles 1
Momentum coefficient 0.1
Momentum coefficient 0.1
MLP hidden size 256
Temperature (τ ) 0.5
MLP activation ELU/RELU
MLP hidden size 512
Ensemble size 5
MLP activation ELU
Latent dimension 100 (Humanoid, Dog)
50 (otherwise)
Learning rate 3e-4 (Dog, pixels) No latent ablation. We make the following change to
1e-3 (otherwise) our method: replacing hθ with the identity function, i.e.,
Optimizer (θ) Adam (β1 = 0.9, β2 = 0.999) x = hθ (x). As such, environment dynamics are modelled
Temporal coefficient (λ) 0.5 by forward prediction directly in the state space, with the
Reward loss coefficient (c1 ) 0.5
Value loss coefficient (c2 ) 0.1
consistency loss effectively degraded to a state prediction
Consistency loss coefficient (c3 ) 2 loss. This ablation makes our method more similar to prior
Exploration schedule () 0.5 → 0.05 (25k steps) work on model-based RL from states (Janner et al., 2019;
Planning horizon schedule 1 → 5 (25k steps) Lowrey et al., 2019; Sikchi et al., 2022; Argenson & Dulac-
Batch size 2048 (Dog) Arnold, 2021). However, unlike previous work that decou-
256 (pixels)
512 (otherwise)
ples model learning from policy and value learning, we
Momentum coefficient (ζ) 0.99 still back-propagate gradients from the reward and value
Steps per gradient update 1 objectives through the model, which is a stronger baseline.
θ− update frequency 2
No consistency regularization. We set the coefficient c3
Table 5. SAC hyperparameters. We list the most important hy- corresponding to the latent state consistency loss in Equation
perparameters for the SAC baseline. Note that we mostly follow 10 to 0, such that the TOLD model is trained only with the
the implementation of Yarats & Kostrikov (2020) but improve reward and value prediction losses. This ablation makes our
upon certain hyperparameter choices, e.g., the momentum coeffi- method more similar to MuZero (Schrittwieser et al., 2020).
cient ζ and values specific to the Dog tasks.
Other baselines. Results for other baselines are obtained
Hyperparameter Value
from related work. Specifically, results for SAC, CURL,
Discount factor (γ) 0.99 DrQ, and PlaNet are obtained from Srinivas et al. (2020) and
Seed steps 1, 000
Kostrikov et al. (2020), and results for Dreamer, MuZero,
Replay buffer size Unlimited
Sampling technique Uniform and EfficientZero are obtained from Hafner et al. (2020b)
MLP hidden size 1024 and Ye et al. (2021).
MLP activation RELU
Latent dimension 100 (Humanoid, Dog)
50 (otherwise) H. Inference Time
Optimizer (θ) Adam (β1 = 0.9, β2 = 0.999)
Optimizer (α of SAC) Adam (β1 = 0.5, β2 = 0.999)
In the experiments of Section 5, we investigate the relation-
Learning rate (θ) 3e-4 (Dog) ship between performance and the computational budget of
1e-3 (otherwise) planning with TD-MPC. For completeness, we also eval-
Learning rate (α of SAC) 1e-4 uate the relationship between computational budget and
Batch size 2048 (Dog) inference time. Figure 11 shows the inference time of TD-
512 (otherwise)
Momentum coefficient (ζ) 0.99
MPC as the planning horizon and number of iterations is
Steps per gradient update 1 varied. As in previous experiments, we benchmark infer-
θ− update frequency 2 ence times on a single RTX3090 GPU. Unsurprisingly, we
TD-Learning for MPC

find that there is an approximately linear relationship be-


tween computational budget and inference time. However,
Table 7. Action repeat. We adopt action repeat hyperparameters it is worth noting that our default settings used during train-
for DMControl from previous work (Hafner et al., 2019; Kostrikov ing only require approximately 20ms per step, i.e., 50Hz,
et al., 2020) for state-based experiments as well as the DMControl which is fast enough for many real-time robotics applica-
100k benchmark; we list all values below. For the DMControl
tions such as manipulation, navigation, and to some extent
Dreamer benchmark, all methods use an action repeat of 2 regard-
less of the task. We do not use action repeat for Meta-World.
locomotion (assuming an on-board GPU). For applications
where inference time is critical, the computational budget
Task Action repeat can be adjusted to meet requirements. For example, we
Humanoid 2 found in Figure 8 that we can reduce the planning horizon
Dog 2 of TD-MPC on the Quadruped Run task from 5 to 1 with
Walker 2 no significant reduction in performance, which reduces in-
Finger 2 ference time to approximately 12ms per step. While the
Cartpole 8 performance of the model-free policy learned jointly with
Other (DMControl) 4 TD-MPC indeed is lower than that of planning, it is however
Meta-World 1
still nearly 6× faster than planning at inference time.

I. Meta-World
Quadruped Run Quadruped Run
30 30

25 25
We provide learning curves and success rates for individual
20 20 Meta-World (Yu et al., 2019) tasks in Figure 14. Due to
ms / step

15 15
the sheer number of tasks, we choose to only visualize
10 10

5 5
the first 24 tasks (sorted alphabetically) out of the total of
0 0 50 tasks from Meta-World. Note that we use Meta-World
1 3 5 7 9 1 3 5 7 9
Horizon CEM iterations v2 and that we consider the goal-conditioned versions of
Horizon CEM iterations Default Policy
the tasks, which are considered harder than the single-goal
Figure 11. Inference time under a variable budget. Millisec- variant often used in related work. We generally find that
onds per decision step for TD-MPC on the Quadruped Run task SAC is competitive to TD-MPC in most tasks, but that
under a variable computational budget. We evaluate performance TD-MPC is far more sample efficient in tasks that involve
of fully trained agents when varying (left) planning horizon; (right) complex manipulation, e.g., Bin Picking, Box Close, and
number of iterations during planning. When varying one hyperpa- Hammer. Successful trajectories for each of these three
rameter, the other is fixed to the default value. For completeness, tasks are visualized in Figure 15. Generally, we choose to
we also include the inference time of the learned policy πθ , and
focus on sample-efficiency for which we empirically find
the default setting of 6 iterations and a horizon of 5 used during
1M environment steps (3M for multi-task experiments) to
training.
be sufficient for achieving non-trivial success rates in Meta-
World. As the original paper reports maximum per-task
success rate for multi-task experiments rather than average
success rate, we also report this metric for our SAC baseline
Table 8. Meta-World MT10. As our performance metric reported in Table 8. We find that our SAC baseline is strikingly
in Figure 5 differs from that of the Meta-World v2 benchmark competitive with the original paper results considering that
proposal (Yu et al., 2019), we here report results for our SAC we evaluate over just 3M steps.
baseline using the same maximum per-task success rate metric
used for the MT10 multi-task experiment from the original paper.
J. Multi-Modal RL
Task Max. success rate
Window Close 1.00 We demonstrate the ability of TD-MPC to successfully fuse
Window Open 1.00 information from multiple input modalities (proprioceptive
Door Open 1.00 data + an egocentric camera) in two 3D locomotion tasks:
Peg Insert Side 0.00
Drawer Open 0.85 − Quadruped Corridor, where the agent needs to move
Pick Place 0.00 along a corridor with constant target velocity. To succeed,
Reach 1.00 the agent must perceive the corridor walls and adjust its
Button Press Down 1.00
walking direction accordingly.
Push 0.00
Drawer Close 1.00 − Quadruped Obstacles, where the agent needs to move
along a corridor filled with obstacles that obstruct vision and
TD-Learning for MPC

Median IQM Mean

TD-MPC

LOOP

SAC

600 700 800 640 720 800 880 560 640 720 800
Aggregate return

Figure 12. Rliable metrics. Median, interquantile median (IQM), and mean performance of TD-MPC and baselines on the 15 state-based
DMControl tasks. Confidence intervals are estimated using the percentile bootstrap with stratified sampling, per recommendation of
Agarwal et al. (2021). Higher values are better. 5 seeds.

Additional material on the following pages ↓


Corridor
Obstacles

Figure 13. Multi-modal RL. Visualization of the two multi-modal


3D locomotion tasks that we construct.

forces the agent to move in a zig-zag pattern with constant


target velocity. To succeed, the agent must perceive both
the corridor walls and obstacles, and continuously adjust its
walking direction.
Trajectories from the two tasks are visualized in Figure 13.

K. Additional Metrics
We report additional (aggregate) performance metrics of
SAC, LOOP, and TD-MPC on the set of 15 state-based DM-
Control tasks using the rliable toolkit provided by Agarwal
et al. (2021). Concretely, we report the aggregate median,
interquantile mean (IQM), and mean returns with 95% con-
fidence intervals based on the episode returns of trained
(after 500k environment steps) agents. As recommended
by Agarwal et al. (2021), confidence intervals are estimated
using the percentile bootstrap with stratified sampling.

L. Task Visualizations
Figure 15 provides visualizations of successful trajectories
generated by TD-MPC on seven tasks from DMControl
and Meta-World, all of which TD-MPC solves in less than
1M environment steps. In all seven trajectories, we display
only key frames in the trajectory, as actual episode lengths
are 1000 (DMControl) and 500 (Meta-World). For full
video trajectories, refer to https://nicklashansen.
github.io/td-mpc.
TD-Learning for MPC

Assembly Basketball Bin Picking Box Close


1.00 1.00 1.00 1.00

0.75 0.75 0.75 0.75


Success rate

0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25

0.00 0.00 0.00 0.00


Button Press Button Press Topdown Button Press Topdown Wall Button Press Wall
1.00 1.00 1.00 1.00

0.75 0.75 0.75 0.75


Success rate

0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25

0.00 0.00 0.00 0.00


Coffee Button Coffee Pull Coffee Push Dial Turn
1.00 1.00 1.00 1.00

0.75 0.75 0.75 0.75


Success rate

0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25

0.00 0.00 0.00 0.00


Door Close Door Lock Door Open Door Unlock
1.00 1.00 1.00 1.00

0.75 0.75 0.75 0.75


Success rate

0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25

0.00 0.00 0.00 0.00


Drawer Close Drawer Open Faucet Close Faucet Open
1.00 1.00 1.00 1.00

0.75 0.75 0.75 0.75


Success rate

0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25

0.00 0.00 0.00 0.00


Hammer Hand Insert Handle Press Handle Press Side
1.00 1.00 1.00 1.00

0.75 0.75 0.75 0.75


Success rate

0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25

0.00 0.00 0.00 0.00


0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Environment steps (×106) Environment steps (×106) Environment steps (×106) Environment steps (×106)

SAC TD−MPC (ours)

Figure 14. Individual Meta-World tasks. Success rate of our method (TD-MPC) and SAC on diverse manipulation tasks from Meta-
World (Yu et al., 2019). We use the goal-conditioned version of Meta-World, which is considered harder than the fixed-goal version. Due
to the large number of tasks (50), we choose to visualize only the first 24 tasks (sorted alphabetically). Mean of 5 runs; shaded areas
are 95% confidence intervals. Our method is capable of solving complex tasks (e.g., Basketball) where SAC achieves a relatively small
success rate. Note that we use Meta-World v2 and performances are therefore not comparable to previous work using v1.
TD-Learning for MPC

time −→
Dog Walk
Humanoid Walk
Quadruped Run
Finger Turn Hard
Bin Picking
Box Close
Hammer

Figure 15. Visualizations. We visualize trajectories generated by our method on seven selected tasks from the two benchmarks, listed
(from top to bottom) as follows: (1) Dog Walk, a challenging locomotion task that has a high-dimensional action space (A ∈ R38 ); (2)
Humanoid Walk, a challenging locomotion task (A ∈ R21 ); (3) Quadruped Run, a four-legged locomotion task (A ∈ R12 ); (4) Finger
Turn Hard, a hard exploration task with sparse rewards; (5) Bin Picking, a 3-d pick-and-place task; (6) Box Close, a 3-d manipulation
task; and lastly (7) Hammer, another 3d-manipulation task. In all seven trajectories, we display only key frames in the trajectory. Actual
episode lengths are 1000 (DMControl) and 500 (Meta-World). Our method (TD-MPC) is capable of solving each of these tasks in less
than 1M environment steps. Video results are available at https://nicklashansen.github.io/td-mpc.

You might also like