RLC Project

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

RLC Project report 2024

C OMBINING RL AND MPC FOR B IPED WALKING


Aastha Mishra1 , Ishita Ganjoo1 , Varad Vaidya1 , Prakrut Kotecha1 , Shishir Kolathaya1 ∗

A BSTRACT

The fusion of reinforcement learning (RL) with Model Predictive Control (MPC)
is a promising approach that leverages the strengths of model-based and model-
free paradigms for learning complex control policies for highly nonlinear, dy-
namic high-dimensional systems. It provides improved sample efficiency, faster
learning and flexibility to incorporate constraints. We study two approaches to
the H-step lookahead policies, viz. LOOP and TD-MPC, which leverage trajec-
tory optimization over a learned dynamics model over a horizon H and terminal
value function to account for future rewards. In this work, the two methods have
been used to train benchmark and custom bipedal environments. The performance
of both algorithms on Walker 2D has been compared and the challenges in their
application to train more complex custom environments have been brought out.
Another contribution of this work is the theoretical proof for finding an optimality
bound for the value function for TD-MPC. The video of results can be found at
https://youtu.be/-Ud4vcf2 LY ?si = V QoOhpkN ZrT zDLiu

1 I NTRODUCTION

A reinforcement learning (RL) agent has to simultaneously interact and amalgamate knowledge of
the environment to obtain the optimal policy for a task. State and action spaces for tasks such as
locomotion of legged robots are generally large, making it difficult for the agent to experience all
situations and further, decide how to act optimally in all states. Hence, the agent has to act in the
world while learning to perform tasks. Planning is a useful approach for applications like these,
which require sequential decision-making and provide a level of interpretability. Planning with a
dynamics model leads to higher sample efficiency as well.
Model Predictive Control (MPC) is one of the approaches used to enable locomotion in legged
robots, wherein a dynamics model is used to get a control action sequence that would generate an
optimal trajectory over a finite (generally, shorter) horizon. In most applications, the first input in the
sequence is applied to the system. But these input sequences are temporally local optimal solutions.
MPC can be augmented with a terminal value function to account for return beyond the finite, fixed
horizon, thus obtaining globally optimal solutions. However, both, an accurate value function as
well as a dynamics model, might be hard to get.

1.1 R ELATED W ORK

Combining information attained from the dynamics model with model-free RL algorithms using
lookahead policies was first introduced in Efroni et al. (2020). Since then, multiple methods of in-
corporating lookahead policies to solve large state space problems using function approximations
have been proposed Winnicki et al. (2021). Active research is being done in adaptive lookahead
methods, wherein the planning horizon is a function of some of the parameters used for value func-
tion approximations Rosenberg et al. (2023).
Plan Online Learn Offline (POLO)Lowrey et al. (2018) combines RL and MPC by proposing a
framework wherein a ground-truth dynamics model is used for local trajectory optimization and
simultaneously an approximate global value function is used to generate return beyond the fixed
horizon. A theoretical proof of the generation of optimal actions, using this conjunction has been
∗1
Robert Bosch Centre of Cyber-Physical Systems, Indian Institute of Science (IISc), Bengaluru.
{aasthamishra,ishitaganjoo,varadmandar, prakrutpk,shishirk}@iisc.ac.in .

1
RLC Project report 2024

provided. In addition, trajectory optimization is used to enable efficient exploration, which in turn,
is useful for getting better global information and to escape local minima.
A much deeper understanding of the same is given in Bhardwaj et al. (2020). The proposed frame-
work can mitigate model bias in MPC by blending model-free value estimates using a parameter, to
systematically trade-off different sources of error. Guarantees corresponding to this framework have
been provided and improvements have been suggested.

1.2 LOOP

In Learning Off-Policy with Online Planning (LOOP), H step-lookahead policies with approximate
value function and learned dynamics model have been proposed to solve continuous control prob-
lems like locomotion of legged robots. Here, at each time step, an H-step rollout is generated from
the current state using the learned dynamics model. Additionally, an approximate terminal value
function is added at the end of the rollouts to estimate the return for the infinite horizon. The policy
that gives the highest return for this H-step rollout is chosen as the optimal policy. The terminal
value function is called approximate as the value function is a parametrized function learned using a
neural network. Similarly, the dynamics model is considered unknown initially and a neural network
is employed to learn it.

1.3 TD-MPC

Temporal Difference Learning for Model Predictive Control (TD-MPC) (Hansen et al. (2022)), op-
timizes an H-step look ahead reward to obtain the deployment policy. The dynamics model is learnt
in the latent state space. The novelty of TD-MPC is that the learning of its task-oriented latent dy-
namics (TOLD) model is coupled with the learning of the value function and the RL policy. This is
done using rewards, TD value objective and a modality-agnostic prediction loss in latent space. The
authors claim that this method outperforms previous methods in terms of superior sample efficiency
and asymptotic performance on complex DMControl (Tassa et al., 2018) and Meta-World (Yu et al.,
2019) tasks.

1.4 PAPER S TRUCTURE

Section 1 explains the problem statement, and the motivation, and provides a brief overview of each
of the methods we have implemented in the project (sections 1.2 and 1.3). A detailed explanation
of the methods is provided in the section. 2 along with the advantages and disadvantages of each.
Section 3 talks about the implementation details. In Section. 4, results of implementation and
analysis of the same have been provided. Section 5 details a few open questions and possible future
work. Section 6 summarises the important components and contributions of the project.

1.5 C ONTRIBUTIONS

In this work, we have tested the methods mentioned above (sections.1.2 and 1.3) on different bipedal
environments and summarized our results, observations, and analyses of the same. We have success-
fully compared the performance of both algorithms on Walker 2D and also put forth and analyzed
the problems faced in the application of these algorithms to more complex environments. We have
also proposed a theoretical proof for finding an optimality bound for the value function for TD-MPC.

2 T ECHNICAL BACKGROUND
This section provides a detailed theoretical background for each of the two methods of combining
RL and MPC, which have been implemented and analyzed in this project - LOOP and TD-MPC.

2.1 LOOP

In this work, H-step lookahead policies have been evaluated both theoretically and empirically. A
theoretical analysis of H-step lookahead under a learned model and approximate value function has
been provided in the paper and an optimality bound for value function has also been given. In the

2
RLC Project report 2024

Figure 1: Schematic of the LOOP and TD-MPC frameworks

algorithm, the value function is learned via an actor using an off-policy algorithm. This is done to
enhance computational efficiency by keeping the trajectory optimization independent of the updation
of the value function. The proposed algorithm also utilizes the advantage that the H-step lookahead
offers during exploration (as detailed in Lowrey et al. (2018)). This unique blend of model-based
planning and model-free learning helps in achieving sample-efficient and computationally efficient
learning. The issue of “Actor Divergence” which generally prevails in off-policy algorithms has also
been addressed by modifying the trajectory optimization method and this has been called ”Actor
Regularized Control” (ARC). More details on these have been given below.

2.1.1 A LGORITHM

In the proposed algorithm, transitions are sampled by executing actions randomly to fill up the
replay buffer with s, a, r, s’ pairs where s represents the initial state, a represents the action, s’
represents the future state obtained by applying action a at state s and r represents the corresponding
state-action reward. The parametrized actor πϕ and Q function Qθ and the dynamics model M
are initialized. After the replay buffer has been filled with a fixed number of samples, these
samples are used to optimize the policy πϕ and learn the Q function Qθ using a model-free RL
algorithm (which, here is the Soft Actor-Critic (SAC) Algorithm). The model is trained on this
replay buffer until convergence is achieved (this is then done only after a fixed number of training
steps, not at every training step). The ensemble of models obtained via training are then used
in the trajectory optimisation routine. Details of the trajectory optimisation routine have been
explained in the subsection titled ”Trajectory Optimisation Routine - ARC” below. Next, actions are
sampled from the output of the trajectory optimisation routine and the replay buffer is again filled
with a new transition. The policy and Q function are again optimised over this updated replay buffer.

As can be seen, since the policy is optimised using a replay buffer which is iteratively filled with new
iterations, the training takes place using the same data over and over again. Hence, this off-policy
algorithm is more sample-efficient than model-free RL algorithms.
The Q function is updated using the following Bellman update equation (this backup is enabled
using the parametrized actor):

T Q(st , at ) = r(st , at ) + Est+1 ∼p,at+1 ∼πQ [γQ(st+1 , at+1 )] (1)

3
RLC Project report 2024

2.1.2 T RAJECTORY O PTIMISATION ROUTINE - ARC


This section details the trajectory optimization routine of the algorithm, the reason behind the issue
of actor divergence, and the steps that have been undertaken to avoid the same.
Once the trained models (an ensemble of dynamic models obtained from training over a replay
buffer) are obtained, these are used in the trajectory optimization routine (will be referred to as ARC
from here on, in the paper). The optimized policy πθ and Q function are also passed as arguments to
ARC. However, it has been observed that using different policies each for collecting data(i.e. H-step
lookahead policy) and for learning the value-function (the parametrized actor) could result in a shift
in the difference in the state-action visitation distribution of the parametrized actor and the actual
behavior policy. This can lead to difficulties and instabilities in value function learning. To resolve
this issue, the ARC routine has been proposed.
The proposed solution constrains the trajectory optimization policy to be close to the actor policy.
This can be framed as the following constrained optimization problem:

h i
pτopt = argmaxEpτ LH,V̂ (st , τ ) , s.t DKL pτ ∥pτprior ≤ ϵ

(2)

where LH,V̂ (st , τ ) is the expected lookahead objective under the learned model given by
h i
LH,V̂ (st , τ ) = EM̂ RH,V̂ (st , τ ) , starting from state st , pτ is a distribution over action sequences
τ of horizon H starting from st , and pτprior is a prior distribution over such action sequences.
The optimal policy is approximated as a multivariate Gaussian with diagonal covariance. This policy
can be parameterized using its mean and variance (hence the ARC policy is stochastic). Using
iterative importance sampling, the update rule on the parameters at step m + 1of the policy is given
as:

PN h ′
i PN h 1 L (st ,τ ′ ) ′ 2 i
e η LH,V̂ (st ,τ ) τ ′
1

i=1 i=1 e
η H,V̂ τ − µm+1
µm+1 = PN 1 ′)
, σ m+1 = PN 1 L (st ,τ ′ ) (3)
η LH,V̂ (st ,τ
i=1 e i=1 e
η H,V̂

where τ ′ ∼ N (µm , σ m ) and N µ0 , σ 0 is set to pτprior. .




To constrain the action distribution of the trajectory optimization routine to be close to that of the
parametrized actor, we set
pτprior = βπθ + (1 − β)N (µt−1 , σ) (4)
Therefore, the prior is a linear combination of πθ and the action sequence from the previous envi-
ronment timestep with N (0, σ).
Therefore, the output of the trajectory optimization routine is an updated mean of the distribution
of optimal H-step lookahead policy at a given environment timestep. Although an update of the
variance of this distribution has been provided at each timestep theoretically in the paper, in the
actual implementation, the variance is modified only when the mean reaches a certain fixed value.
Using this new mean and variance of ARC distribution, an action is sampled which is then used to
update the replay buffer and the steps continue as explained in the previous subsection. The ARC
policy obtained after training is used for evaluation and deployment.

2.1.3 W HY DO H- STEP L OOKAHEAD METHODS WORK


The theorem to prove why H-step lookahead methods work has been provided in the paper. The
same has been stated below.
Theorem 1. Suppose M̂ is an approximate dynamics model with total variation distance bounded
by ϵm , let V̂ be an approximate value function such that, maxs |V ⋆ (s) − V̂ (s)| ≤ ϵv , let the reward
function r(s, a) be bounded by [0, Rmax ] and V̂ be bounded by [0, Vmax ]. Let ϵp be the suboptimality

4
RLC Project report 2024

incurred in H-step lookahead optimisation. Then the performance of the H-step lookahead policy
πH,V̂ can be bounded as:
2 h ϵp i
J ⋆ − J πH,V̂ ≤ H
C(ϵm , H, γ) + + γ H ϵv
1−γ 2
where,
H−1
X
C(ϵm , H, γ) = Rmax γ t tϵm + γ H Hϵm Vmax
t=0

The proofs of the supporting theorems have been detailed by us in the appendix.
By comparing Lemma 1 as given in the appendix and theorem 1, it can be seen that the performance
of the H-step lookahead policy reduces the dependency on the value function error but it also intro-
duces an additional dependency on the model errors. In cases where the amount of data is low (like
locomotion of legged robots), the value function bias is more likely to dominate the model bias.
Therefore, H-step lookahead policies would perform better than 1-step greedy policies.

2.2 TD-MPC

Temporal Difference Learning for Model Predictive Control (TD-MPC) shares the concept of an
H-step look-ahead reward. However, it distinguishes itself by learning a task-oriented latent dy-
namics (TOLD) model which is learnt concurrently with a terminal value function through temporal
difference learning. The key technical contribution of TD-MPC lies in this method for learning the
dynamics model. The latent space representation of the model is derived solely from rewards which
enables the model to focus on elements of the environment predictive of reward, thus enhancing
sample efficiency compared to methods relying on state or image prediction. Moreover, it remains
agnostic to the observation modality and is adept at handling sparse rewards. Subsequently, policies
are sought exclusively in the latent space. We proceed to analyze and derive a theoretical bound on
the performance of the H-step look-ahead policy attained in the latent space, in comparison to the
performance of the optimal policy had it been found in the original state space.
Theorem 2. Let Vθ be an approximate value function such that max |V ∗ − Vθ | ≤ ϵv where
s
Vθ (s) := Vθ (hθ (s)). Let the reward function r(s, a) be bounded and Rθ be an approximate re-
ward function such that max |r − Rθ | ≤ ϵr , where Rθ (s, a) := Rθ (hθ (s), a). Also, let Rθ
s,a
be Lipschitz, i.e. |Rθ (x, a) − Rθ (y, a)| ≤ L|x − y|. Let the consistency loss be bounded by
max |hθ (st+1 ) − dθ (hθ (st ), at )| ≤ ϵh , ∀t. Then the performance of the H-step lookahead policy
s,a
πH,V π can be bounded as:
∗ ϵR (1 − γ H ) + (Lϵh )γ(1 − γ H−1 )
|J πH,V − J πH,V π | ≤
(1 − γ)(1 − γ H )

The proof has been given in the appendix.

2.2.1 T RAINING
The TOLD model consists of following five learned components (MLPs) that predict the following
quantities:
Representation (Encoder): zt = hθ (st )
Latent Dynamics: zt+1 = dθ (zt , at )
Reward: r̂t = Rθ (zt , at )
Value: q̂t = Qθ (zt , at )
Policy: ât ∼ πθ (zt )
TOLD minimises the following objective
t+H
X
J (θ; Γ) = λi−t L (θ; Γi ) (5)
i=t

5
RLC Project report 2024

where,
L(θ; Γi ) = c1 ∥Rθ (zi , ai ) − ri ∥22 (6)
| {z }
reward
+ c2 ∥Qθ (zi , ai ) − (ri + γQθ− (zi+1 , πθ (zi+1 ))) ∥22 (7)
| {z }
value
+ c3 ∥dθ (zi , ai ) − hθ− (si+1 )∥22 (8)
| {z }
latent state consistency

A trajectory Γ0:H of length H is sampled from a replay buffer, and the first observation s0 is encoded
by hθ into a latent representation z0 . The actions are sampled from the MPC policy Πθ and executed
in the environment to gather state and reward observations. This trajectory is then appended to the
buffer. TOLD predicts subsequent latent states z1 , z2 ,. . . ,zH , as well as a value q̂, reward r̂, and
action a for each latent state by optimizing the above loss. The gradients are back propagated from
all three terms over multiple rollout steps. Estimating Q-targets via planning is very slow, hence RL
policy πθ is used instead. The prediction loss in latent space enforces temporal consistency in the
learned representation without explicit state or image prediction.
The deterministic RL policy is parametrized by minimizing the following loss,
t+H
X
Jπ (θ; Γ) = − λi−t Qθ (zi , πθ (zi )) (9)
i=t

This is similar to policy objective commonly used in model-free actor-critic methods used to derive
the RL policy πt heta which is found to be sufficiently expressive for efficient value learning.

2.2.2 P LANNING
The planning algorithm leverages the Model Predictive Path Integral control(MPPI) which is a
stochastic optimization method to find the MPC policy Πθ as,

"t+H #
X
i
Πθ (st ) = arg max E γ R(si , ai ) (10)
at:t+H
i=t
The trajectories are generated using actions from both the MPC policy Πθ as well as the RL pol-
icy πθ . To carry out, what the authors call, policy-guided trajectory optimization, actions are also
sampled from πθ in addition to Πθ . As πθ is deterministic, a linearly annealed (Gaussian) noise is
applied to it to make it stochastic.The transitions and expected reward are obtained from the learnt
PH−1 t
models as zt+1 ←− dθ (zt , at ) and ϕΓ = 0 γ Rθ (zt , at )+γ H Qθ (zH , aH ). The top-k sampled

trajectories (ϕΓ ) in terms of expected return are used to update parameters for a family of distribu-
tions using an importance weighted average of the estimated reward over J iterations. The policy
hence obtained is a time-dependent multivariate Gaussian with diagonal covariance with parameters
as (µ, σ)t:t+H , µ, σ, A ∈ Rm .
v
Pk ⋆
u Pk
j Ω Γ
i=1 i i j
u Ωi (Γ⋆ − µj )2
µ = Pk , σ = t i=1Pk i (11)
i=1 Ωi i=1 Ωi

where Ωi = exp τ ((ϕ∗Γ ), i) and τ is a temperature parameter, j ∈ 0, 1 . . . J. The planning procedure


is terminated at J iterations and a trajectory is sampled from the final return-normalized distribution
over action sequences. Only the first action is applied to produce a feedback policy i.e., a receding-
horizon MPC is employed. To promote consistent exploration across tasks, the σ j is constrained
as
vu Pk 
u ⋆ j
Ωi (Γ − µ )  2
σ j = max t i=1Pk i ,ϵ (12)
i=1 Ωi

where ϵ ∈ R+ is a linearly decayed constant. The horizon is also linearly increased to H as initially
model is inaccurate.

6
RLC Project report 2024

(a) Walker - LOOP (c) Stoch BiRo - LOOP (e) Cassie - LOOP

(b) Walker - TDMPC (d) Stoch BiRo - TD-MPC (f) Cassie - TD-MPC

Figure 2: Reward graphs for all the environments on both LOOP and TD-MPC.

3 I MPLEMENTATION DETAILS

The code for both LOOP1 and TD-MPC2 were open source and so we have used them and only
modified slightly corresponding to the current versions and some environment specific requirements.
These results were generated using i5 9th gen Intel processor with 32 GB of memory and RTX
2090Ti with 12 GB memory.

4 R ESULTS

In this section, we evaluate the performance of both algorithms on different bipedal environments.
For bipedal environment we have used OpenAI Gym Mujoco Todorov et al. (2012) locomotion
task Walker-2D, DeepMind Control (DMControl) Tassa et al. (2018) locomotion task Walker-2D
walk, a custom bipedal environment developed in Mothish et al. (2023) named Stoch BiRo, and an
environment for Cassie for running developed inspired from Kuznetsov et al. (2020).
We seek to answer the following questions:

1. How does the dynamics model scale for bipeds with a higher degree of freedom?
2. Will the latent dimension also work when shrunk to a dimension smaller than the observa-
tion space dimension?

Replicating existing results and comparing: Sikchi et al. (2022) and Hansen et al. (2022) have
already shown results on OpenAI Gym Mujoco Todorov et al. (2012) locomotion task Walker-2D
and DMControl Tassa et al. (2018) locomotion task Walker-2D walk. Aggregate results are shown
in Fig. 2. We find that for bounded rewards TD-MPC outperforms LOOP. Regarding the claims of
TD-MPC of using the same hyperparameter for all the tasks, we observed that these hyperparame-
ters do not work for tasks out of DMControl. For Mujoco Walker-2D environment LOOP performed
sufficiently well but for TD-MPC the gradients exploded even after testing with a few hyperparam-
eters. Hansen et al. (2023) has observed this issue for exploding gradients and they have claimed
to have solved it by tweaking the method to encode every state moving forward and adding another
learnable parameter for multi-task action compensation.
Implementation on Stoch BiRo: In Mothish et al. (2023) the authors proposed design of a new
biped and provided results for biped walking using a linear policy. They have incorporated a hi-
erarchical controller similar to Krishna et al. (2022) involving a high-level RL-based linear policy
approach to convert the step length and ellipse coordinates used as actions to corresponding joint
angles. Then the low-level control takes these joint angles and computes the joint torques using PD
control.
1
https://github.com/hari-sikchi/LOOP/tree/master
2
https://github.com/nicklashansen/tdmpc/tree/main

7
RLC Project report 2024

As the low-level controller is just following the joint angles the objective of the high-level controller
is to generate actions such that it tracks the desired velocity and walks forward. Hence the reward
function used is:

r = Gω1 (eϕ ) + Gω2 (eθ ) + Gω3 (epz ) + Gω4 (evx ) + Gω5 (evy ) + W ∆x (13)

In the context of this environment eϕ , eθ , epz denote errors in roll, pitch, and height from the ground
respectively. ∆x represents the displacement along the current heading direction, with a weighing
factor W . And evx , evy denote errors in velocity along the x and y directions respectively. G(.)
corresponds to a Gaussian kernel defined as Gw (x) = exp(−w ∗ x2 ).
Given the simplifications made to the environment having a gradient-based learning method was not
a viable option. We tried training this but the rewards kept oscillating around a standard value as
shown in Fig. 2. The possible alternatives that we could come up with to make this work on gradient-
based methods were to either change the action space to joint angles directly and add dependence
of reward on observation and actions or change the environment to replicate one of the pre-made
standard environments. We have attached results corresponding to both cases in the Appendix.
Implementation on Cassie: The standard environment for Cassie from Rudin et al. (2022) requires
parallelization of environments for meaningful training. Hence we have used an open-source envi-
ronment for Cassie based on Kuznetsov et al. (2020) with the task of running along the x-axis.
The action space is kept as standard of joint torques and observation space consists of joint angles
and joint velocities. The reward is designed to keep the motion one-directional and to keep it moving
forward.
We have observed that LOOP trains fairly decent on this environment, but since the code is not opti-
mized/ parallelized, it takes unfeasibly long to train to a point where we can get any decisive results.
In the case of TD-MPC although the code is more optimized and faster it is still not parallelized.
But the main issue in TD-MPC comes from the gradient explosion as mentioned in Hansen et al.
(2023). The results for the same are in Fig. 2
(Some detailed explanations and more graphs can be seen in this presentation.3 )

5 L ESSONS AND O PPORTUNITIES


Lessons: While the results we produced do not signify the generalization of these algorithms, they
show a direction in which RL research is moving forward. Researchers are trying to make algorithms
that can generalize over all tasks and would require lesser hyperparameter tuning for continuous
control tasks. Having said that as we have already shown these algorithms still require a lot of work
to reach that level of generalization. Scaling up for these algorithms is non-trivial, and parallelizing
these off-policy methods is crucial to achieving meaningful generalizations.
Apart from these the specific issues with TD-MPC can be seen about actor divergence and not
having a shrinking latent space. Using latent space decreases interpretability but was able to achieve
success in more complex environments. However not shrinking the latent dimension would lead to
more memory consumption and higher computational requirements.
Opportunities: As we summarized above a lot of work is remaining in these areas to reach the goal
of complete generalization of algorithms. This brings us to some recent works that have attempted to
address some of these problems through various methods. The newer version of TD-MPC Hansen
et al. (2022): TD-MPC2 Hansen et al. (2023) has shown a lot more tasks than TD-MPC and has
also attempted on solving the issues related to scaling with environment that is faced in TD-MPC.
Although Hansen et al. (2023) claims to have solved the problem of generalization over continuous
control tasks, discrete action tasks are still an open problem using TD-MPC types of method.
Parallelizing off-policy algorithms can have a huge impact on the performance of these algorithms.
Li et al. (2023) have attempted to scale off-policy algorithms successfully and integrating that with
these algorithms could be a future direction.
3
https://docs.google.com/presentation/d/1nZrH3ni9hlukSfHfEC8hqoYMm47wiD9gDBdhHQns1tQ/edit?usp=sharing

8
RLC Project report 2024

With the specific observations from TD-MPC regarding expanding latent dimension. The suggestion
given by Prof. Shishir about using single rigid body dynamics with augmented latent dimensions
only being used for residual dynamics/extra information encoding can be a future direction.

6 C ONCLUSION
In this project, we have analyzed and compared two H-step lookahead methods under the learned
model and value function. We tested it on different bipedal systems and derived some significant
inferences that would be beneficial for future work along these lines. We have also proposed a
theorem to find the optimality bound corresponding to TD-MPC.

ACKNOWLEDGMENTS
We would like to thank Professor Shishir, Manan Tayal, Aditya Shirwatkar, Naman Saxena, and
GVS Mothish for their valuable insights during this project.

R EFERENCES
Mohak Bhardwaj, Sanjiban Choudhury, and Byron Boots. Blending mpc & value function approx-
imation for efficient reinforcement learning. In International Conference on Learning Represen-
tations, 2020.
Yonathan Efroni, Mohammad Ghavamzadeh, and Shie Mannor. Online planning with lookahead
policies. Advances in Neural Information Processing Systems, 33:14024–14033, 2020.
Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive
control. arXiv preprint arXiv:2203.04955, 2022.
Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for contin-
uous control. arXiv preprint arXiv:2310.16828, 2023.
Lokesh Krishna, Guillermo A Castillo, Utkarsh A Mishra, Ayonga Hereid, and Shishir Kolathaya.
Linear policies are sufficient to realize robust bipedal walking on challenging terrains. IEEE
Robotics and Automation Letters, 7(2):2047–2054, 2022.
Arsenii Kuznetsov, Pavel Shvechikov, Alexander Grishin, and Dmitry Vetrov. Controlling overesti-
mation bias with truncated mixture of continuous distributional quantile critics, 2020.
Zechu Li, Tao Chen, Zhang-Wei Hong, Anurag Ajay, and Pulkit Agrawal. Parallel q-learning:
Scaling off-policy reinforcement learning under massively parallel simulation. In International
Conference on Machine Learning, pp. 19440–19459. PMLR, 2023.
Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, and Igor Mordatch. Plan
online, learn offline: Efficient learning and exploration via model-based control. In International
Conference on Learning Representations, 2018.
GVS Mothish, Karthik Rajgopal, Ravi Kola, Manan Tayal, and Shishir Kolathaya. Stoch biro:
Design and control of a low cost bipedal robot. arXiv preprint arXiv:2312.06512, 2023.
Aviv Rosenberg, Assaf Hallak, Shie Mannor, Gal Chechik, and Gal Dalal. Planning and learn-
ing with adaptive lookahead. In Proceedings of the AAAI Conference on Artificial Intelligence,
volume 37, pp. 9606–9613, 2023.
Nikita Rudin, David Hoeller, Philipp Reist, and Marco Hutter. Learning to walk in minutes using
massively parallel deep reinforcement learning, 2022.
Harshit Sikchi, Wenxuan Zhou, and David Held. Learning off-policy with online planning. In
Conference on Robot Learning, pp. 1622–1633. PMLR, 2022.
Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Bud-
den, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv
preprint arXiv:1801.00690, 2018.

9
RLC Project report 2024

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control.
In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp. 5026–5033.
IEEE, 2012.
Anna Winnicki, Joseph Lubars, Michael Livesay, and R Srikant. The role of lookahead and approxi-
mate policy evaluation in reinforcement learning with linear value function approximation. arXiv
preprint arXiv:2109.13419, 2021.

A A PPENDIX
Lemma 1. Consider an infinite horizon discounted problem, with γ < 1, ∥Jµ − J ⋆ ∥ < ϵ, ϵ > 0,
where, µ is a one-step greedy policy based on J, then:

∥Jµ − J ⋆ ∥ ≤ ϵ (14)
1−γ

Proof. Using the fact that operator Tµ and T are contractions, and using triangle inequality, we have
the following:
∥Jµ − J ⋆ ∥ = ∥Tµ Jµ − J ⋆ ∥
≤ ∥Tµ Jµ − Tµ J∥ + ∥Tµ J − J ⋆ ∥
≤ γ∥Jµ − J∥ + ∥T J − T J ⋆ ∥ . . . (∵ T is optimal)
≤ γ∥Jµ − J∥ + γ∥J − J ⋆ ∥ + γ∥J − J ⋆ ∥
≤ 2γϵ + γ∥J − J ⋆ ∥
Recursively applying the above inequality, we get:

∥Jµ − J ⋆ ∥ ≤ ϵ
1−γ

From, the above lemma we can clearly see that as γ ≈ 1, the error bounds grows unbounded, and
can result in unstable learning, even if the initial error in value function approximation is small. The
optimal policy might exploit approximation errors to produce a subpar policy.
Lemma 2. For any two bounded value functions J and J ′ and ∀k = 0, 1, . . ., the following holds:
max T k J(x) − T k J ′ (x) ≤ γ k max |J(x) − J ′ (x)| (15)
x∈S x∈S

Proof. This proof is trivial, and given for completeness. Let,


c := max |J(x) − J ′ (x)|
x∈S

Then,
J(x) − c ≤ J ′ (x) ≤ J(x) + c ∀x ∈ S
Applying, T k , and using the monotonocity operator, we get,
T k J(x) − c ≤ T k J ′ (x) ≤ T k J(x) + c ∀x ∈ S
Thus,
∥T k J(x) − T k J ′ (x)∥ ≤ c ∀x ∈ S

Lemma 3. Let the error in approximate value function be ϵ := max |V̂ − V ⋆ |, and let the terminal
reward be r(sH ) = V̂ (sH ). For all MDPs, the performance of the H-step lookahead policy is
bounded as,
2γ H
V̂ − V ⋆ ≤ ϵ
1 − γH
where, π̂ is the H-step lookahead policy and π ⋆ is the optimal policy.

10
RLC Project report 2024

Proof. Let τ̂ and τ ⋆ represent trajectories of length H under π̂ and π ⋆ respectively. Now, for some
state s,
"H−1 # "H−1 #
X X
⋆ t H ⋆ t H
V (s) − V̂ (s) = Eτ ⋆ γ rt + γ V (sH ) − Eτ̂ γ rt + γ V̂ (sH )
t=0 t=0
hP i
H−1 t H ⋆
Adding and subtracting Eτ̂ γ rt + γ V (sH ) , we get:
t=0
"H−1 # "H−1 #
X X
⋆ t H ⋆ t H ⋆
V (s) − V̂ (s) = Eτ ⋆ γ rt + γ V (sH ) + Eτ̂ γ rt + γ V (sH )
t=0 t=0
"H−1 # "H−1 #
X X
t H ⋆ t H
− Eτ̂ γ rt + γ V (sH ) − Eτ̂ γ rt + γ V̂ (sH )
t=0 t=0
"H−1 # "H−1 #
X X
t H ⋆ t H ⋆
= Eτ ⋆ γ rt + γ V (sH ) − Eτ̂ γ rt + γ V (sH )
t=0 t=0
h i
+ γ H Eτ̂ V ⋆ (sH ) − V̂ (sH ) (16)

Since, τ̂ , was generated by the H-step lookahead policy, which optimises over the H-step lookahead
value function, the the approximate terminal value function, it will be atleast as good as the trajectory
generated by the optimal policy, with the same terminal value function. Thus, we have:
"H−1 # "H−1 #
X X
t H t H
Eτ̂ γ rt + γ V̂ (sH ) ≥ Eτ ⋆ γ rt + γ V̂ (sH ) (17)
t=0 t=0

Using the Lemma 2, and expanding the T k operator we get:


"H−1 # "H−1 #
X X
t H ⋆
Eτ ⋆ γ rt + γ V (sH ) ≤ Eτ ⋆ γ rt + γ V̂ (sH ) + γ H ϵ
t H

t=0 t=0
"H−1 # "H−1 #
X X
t H ⋆
Eτ̂ γ rt + γ V (sH ) ≥ Eτ̂ t H
γ rt + γ V̂ (sH ) − γ H ϵ
t=0 t=0
Subtracting tha above equations, we get:
"H−1 # "H−1 #
X X
t H ⋆
Eτ ⋆ γ rt + γ V (sH ) − Eτ̂ γ rt + γ V (sH ) ≤ 2γ H ϵ
t H ⋆

t=0 t=0
"H−1 # "H−1 #
X X
t H t H
+ Eτ ⋆ γ rt + γ V̂ (sH ) − Eτ̂ γ rt + γ V̂ (sH )
t=0 t=0
From Equation 17 we can conclude the following:
"H−1 # "H−1 #
X X
t H ⋆
Eτ ⋆ γ rt + γ V (sH ) − Eτ̂ γ rt + γ V (sH ) ≤ 2γ H ϵ
t H ⋆

t=0 t=0
Substituting this back in Equation 16, we get:
h i
V ⋆ (s) − V̂ (s) ≤ 2γ H ϵ + γ H Eτ̂ V ⋆ (sH ) − V̂ (sH )
Since, the terminal values functions doesn not depend on the trajectory generated, to reach the state
sH
V ⋆ (s) − V̂ (s) ≤ 2γ H ϵ + γ H V ⋆ (sH ) − V̂ (sH )
Applying, above bound recursively, we get the required result.
2γ H
=⇒ V ⋆ (s) − V̂ (s) ≤ ϵ
1 − γH

11
RLC Project report 2024

Comparing the above result with Lemma 1, we can see that as H > 1, is less prone to approximation
errors, as 1 − γ H dominates compared to 1 − γ.
Theorem 2. Let Vθ be an approximate value function such that max |V ∗ − Vθ | ≤ ϵv where
s
Vθ (s) := Vθ (hθ (s)). Let the reward function r(s, a) be bounded and Rθ be an approximate re-
ward function such that max |r − Rθ | ≤ ϵr , where Rθ (s, a) := Rθ (hθ (s), a). Also, let Rθ
s,a
be Lipschitz, i.e. |Rθ (x, a) − Rθ (y, a)| ≤ L|x − y|. Let the consistency loss be bounded by
max |hθ (st+1 ) − dθ (hθ (st ), at )| ≤ ϵh , ∀t. Then the performance of the H-step lookahead policy
s,a
πH,V π can be bounded as:
∗ ϵR (1 − γ H ) + (Lϵh )γ(1 − γ H−1 )
|J πH,V − J πH,V π | ≤
(1 − γ)(1 − γ H )

Proof. The performance difference we want to upper bound:



J πH,V − J πH,V π = V ⋆ (s0 ) − V π (hθ (s0 ))
"H−1 # " H−1
#
X X
t H ⋆ t H π
= γ r(st , at ) + γ V (sH ) − Rθ (hθ (s0 )) + γ Rθ (zt , at ) + γ V (zH )
t=0 t=1

"H−1 # "H−1 #
X X
t H ⋆ t H ⋆
= γ r(st , at ) + γ V (sH ) − γ Rθ (hθ (st ), at ) + γ V (sH )
t=0 t=0
"H−1 # " H−1
#
X X
+ γ t Rθ (hθ (st ), at ) + γ H V ⋆ (sH ) − Rθ (hθ (s0 ), a0 ) + γ t Rθ (zt , at ) + γ H V π (zH )
t=0 t=1

"H−1 # "H−1
X X
t
= γ (r(st , at ) − Rθ (hθ (st ), at )) + γ t Rθ (hθ (st ), at )+
t=0 t=0
H−1
#
X
γ Rθ (zt , at ) + γ H V ⋆ (sH ) − γ H V π (zH )
t
 
Rθ (hθ (s0 ), a0 ) −
t=1

Now to find the bound on the absolute value of the difference, since J πH,V is value corresponding
to optimal policy the difference cannot be negative
"H−1 # "H−1
X X
⋆ π t
|V (s0 ) − V (hθ (s0 ))| = γ (r(st , at ) − Rθ (hθ (st ), at )) + γ t Rθ (hθ (st ), at ) +
t=0 t=0
H−1
#
X
γ t Rθ (zt , at ) + γ H V ⋆ (sH ) − γ H V π (zH )
 
Rθ (hθ (s0 ), a0 ) −
t=1
"H−1 # "H−1
X X
t
≤ γ (r(st , at ) − Rθ (hθ (st ), at )) + γ t Rθ (hθ (st ), at ) + (Rθ (hθ (s0 ), a0 ) −
t=0 t=0
H−1
#
X
t
+ γ H V ⋆ (sH ) − γ H V π (zH )
 
γ Rθ (zt , at )
t=1

From the fact that the absolute value of the sum is less than the sum of absolute values and γ ∈ (0, 1),
H−1
X H−1
X
≤ γ t |(r(st , at ) − Rθ (hθ (st ), at ))| + γ t Rθ (hθ (st ), at ) + Rθ (hθ (s0 ), a0 )−
t=0 t=0
H−1
X
γ t Rθ (zt , at ) + γ H |V ⋆ (sH ) − V π (zH )|
t=1

12
RLC Project report 2024

"H−1 # H−1
X X
t
≤ γ ϵr + γ t Rθ (hθ (st ), at ) + Rθ (hθ (s0 ), a0 ) −
t=0 t=0
H−1
X
γ t Rθ (zt , at ) + γ H |V ⋆ (sH ) − V π (zH )|
t=1
"H−1 # H−1
X X
≤ γ t ϵr + γ t (Rθ (hθ (st ), at ) − Rθ (zt , at )) + γ H |V ⋆ (sH ) − V π (zH )|
t=0 t=1

Therefore,
"H−1 # "t=H−1
X X
⋆ π t
|V (s0 ) − V (hθ (s0 ))| ≤ γ ϵr + γ t |Rθ (hθ (st ), at ) − Rθ (zt , at )|
t=0 t=1
#
H ⋆ π
+ γ |V (sH ) − V (zH )| (18)

Now solving for |Rθ (hθ (st ), at ) − Rθ (zt , at )| ,

|Rθ (hθ (st ), at ) − Rθ (zt , at )| = |Rθ (hθ (st ), at ) − Rθ (dθ (zt−1 , at−1 ), at )
+Rθ (dθ (zt−1 , at−1 ), at ) − Rθ (zt , at )|
= |Rθ (hθ (st ), at ) − Rθ (dθ (zt−1 , at−1 ), at )|
Since Rθ (dθ (zt−1 , at−1 ), at ) = Rθ (zt , at )
Now, as Rθ is assumed to be Globally Lipschitz and consistency loss is bounded:

|Rθ (hθ (st ), at ) − Rθ (dθ (zt−1 , at−1 ), at )| ≤ L |hθ (st ) − dθ (zt−1 , at−1 )| ≤ Lϵh

Substituting it back in Equation 18, we get:

H−1
X H−1
X
|V ⋆ (s0 ) − V π (hθ (s0 ))| ≤ γ t ϵr + γ t Lϵh + γ H |V ⋆ (sH ) − V π (zH )|
t=0 t=0

Now from recursion, we can see this is a decreasing sequence so by bounded recursion we can get
the final bound to be

ϵR (1 − γ H ) + (Lϵh )γ(1 − γ H−1 )


|V ⋆ (s0 ) − V π (hθ (s0 ))| ≤ (19)
(1 − γ)(1 − γ H )

13

You might also like