A Pontryagin Perspective On Reinforcement Learning
A Pontryagin Perspective On Reinforcement Learning
Abstract
1 Introduction
Reinforcement learning (RL) refers to
Closed-loop control Open-loop control
“the optimal control of incompletely-
known Markov decision processes” xt
[Sutton and Barto, 2018, p. 2]. It has π( · | xt ) u0:T −1
traditionally focused on applying dy-
ut ut
namic programming algorithms, such
as value iteration or policy iteration,
to situations where the environment is xt xt+1 xt xt+1
unknown. These methods solve opti- f (xt , ut ) f (xt , ut )
mal control problems in a closed-loop
fashion by learning feedback policies,
which map states xt to actions ut .
In contrast, this work introduces the
paradigm of open-loop reinforcement Figure 1: Comparison of closed-loop (feedback) and open-
learning (OLRL), in which entire ac- loop (feedforward) control. In closed-loop reinforcement
tion sequences u0:T −1 , over a horizon learning (RL), the goal is to learn a policy (π). In open-loop
T , are learned instead. The closed- RL, a fixed sequence of actions (u0:T −1 ) is learned instead,
loop and open-loop control paradigms with the action ut independent of the states x0:t .
are illustrated in Fig. 1.
There are good reasons why RL research has historically focused on closed-loop control. Many
problems, particularly those with stochastic or unstable dynamics, cannot be satisfactorily solved in
an open-loop fashion. However, in some settings, open-loop control is the only viable option, for
example when economic reasons prevent the use of sensors. A feedback policy may also produce
unpredictable behavior if the environment changes, whereas an open-loop solution, which cannot
detect environmental changes, is robust in such cases. In open-loop control, we optimize over action
sequences, and so we need to find one action per time step. In finite-horizon settings, this can be
considerably cheaper than closed-loop control, as optimizing over policies amounts to finding one
action for each state of the system. Finally, there are also fundamental control-theoretic reasons
for combining a feedback policy that attenuates disturbances with a feedforward controller that
is optimized for reference tracking. In many situations, such a design significantly outperforms a
solution purely based on feedback [e.g., Åström and Murray, 2021, Sec. 12.4].
Indeed, for these reasons, there exists a large body of literature on open-loop optimal control theory
[Pontryagin et al., 1962]. In this work, we introduce a family of three new open-loop RL algorithms
by adapting this theory to the setting of unknown dynamics. Whereas closed-loop RL is largely based
on approximating the Bellman equation, the central equation of dynamic programming, we base our
algorithms on approximations of Pontryagin’s principle, the central equation of open-loop optimal
control. We provide a model-based method whose convergence we prove to be robust to modeling
errors. We then extend this algorithm to settings with completely unknown dynamics and propose
two fully online model-free methods. Finally, we empirically demonstrate the robustness and sample
efficiency of our methods on an inverted pendulum swing-up task and on two complex MuJoCo tasks.
2 Background
We consider a reinforcement learning setup with state space X ⊂ RD , action space U ⊂ RK , and a
fixed horizon T ∈ N. An episode always starts with the same initial state x0 ∈ X and follows the
deterministic dynamics f : X × U → X , such that xt+1 = f (xt , ut ) for all times t ∈ [T − 1]0 .1
After every transition, a deterministic reward r(xt , ut ) ∈ R is received, and at the end of an episode,
an additional terminal reward rT (xT ) ∈ R is computed as a function of the terminal state xT . We
can thus define the value of the state xt at time t as the sum of future rewards
T −1
. X
Vt (xt ; ut:T −1 ) = r(xτ , uτ ) + rT (xT ) = r(xt , ut ) + Vt+1 {f (xt , ut ); ut+1:T −1 },
τ =t
where we defined VT as the terminal reward function rT . Our goal is to find a sequence of actions
.
u0:T −1 ∈ U T maximizing the total sum of rewards J(u0:T −1 ) = V0 (x0 ; u0:T −1 ). We will tackle this
trajectory optimization problem using gradient ascent.
Pontryagin’s principle. The gradient of the objective function J with respect to the action ut is
∇ut J(u0:T −1 ) = ∇u r(xt , ut ) + ∇u f (xt , ut ) ∇x Vt+1 (xt+1 ; ut+1:T −1 ) , (1)
| {z }
λt+1 ∈RD
where the terms of J related to the earlier time steps τ ∈ [t − 1]0 vanish, as they do not depend on ut .
. ∂f
We denote Jacobians as (∇y f )i,j = ∂yji . The costates λ1:T are defined as the gradients of the value
function along the given trajectory. They can be computed through a backward recursion:
.
λt = ∇x Vt (xt ; ut:T −1 ) = ∇x r(xt , ut ) + ∇x f (xt , ut )λt+1 (2)
.
λT = ∇VT (xT ) = ∇rT (xT ). (3)
The gradient (1) of the objective function can thus be obtained by means of one forward pass through
the dynamics f (a rollout), yielding the states x0:T , and one backward pass through Eqs. (2) and (3),
yielding the costates λ1:T . The stationarity condition arising from setting Eq. (1) to zero, where
the costates are computed from Eqs. (2) and (3), is known as Pontryagin’s principle. (Pontryagin’s
principle in fact goes much further than this, as it generalizes to infinite-dimensional and constrained
settings.) We re-derive Eqs. (1) to (3) using the method of Lagrange multipliers in Appendix B.
3 Method
If the dynamics are known, then the trajectory can be optimized as described above, by performing
gradient ascent with the gradients computed according to Pontryagin’s equations (1) to (3). In this
work, we adapt this idea to the domain of reinforcement learning, where the dynamics are unknown.
In RL, we are able to interact with the environment, so the forward pass through the dynamics f
is not an issue. However, the gradient computation according to Pontryagin’s principle requires
. .
the Jacobians ∇x ft = ∇x f (xt , ut ) and ∇u ft = ∇u f (xt , ut ) of the unknown dynamics. In our
1 . .
For n ∈ N, we write [n] = {1, 2, . . . , n} and [n]0 = {0, 1, . . . , n}. Unless explicitly mentioned, all
time-dependent equations hold for all t ∈ [T − 1]0 .
2
methods, which follow the structure of Algorithm 1, we therefore replace these Jacobians by estimates
At ≃ ∇x ft and Bt ≃ ∇u ft . We now show that, given sufficiently accurate estimates At and Bt , this
algorithm still converges to a local maximum of the objective J. In the following sections we then
discuss model-based and model-free methods to obtain such estimates.
Assumption 3.2. There exist constants γ, ζ > 0 with γ + ζ + γζ < 1 such that for any trajectory
(u0:T −1 , x0:T ) encountered by Algorithm 1, the following properties hold for all t ∈ [T − 1]0 :
(a) The error of At+s is bounded, for all s ∈ [T − t], in the following way:
(s−1 )
γ σ(∇u ft ) Y σ(∇x ft+i )
∥At+s − ∇x ft+s ∥ ≤ s ¯ ¯ σ(∇x ft+s ).
3 σ̄(∇u ft ) i=1
σ̄(∇x ft+i ) ¯
(b) The error of Bt is bounded in the following way: ∥Bt − ∇u ft ∥ ≤ ζ σ(∇u ft ).
. ¯
Here, σ(A) and ∥A∥ = σ̄(A) denote the minimum and maximum singular value of A, respectively.
¯
This assumption restricts the errors of the estimates At and Bt which are used in place of the true
Jacobians ∇x ft and ∇u ft in Algorithm 1. Although the use of the true system for collecting rollouts
prevents a buildup of error in the forward pass, any error in the approximate costate λ̃t can still be
amplified by the Jacobians of earlier time steps, ∇x fτ for τ ∈ [t−1], during the backward pass. Thus,
to ensure convergence to a stationary point of the true objective function, the errors of these Jacobians
need to be small. This is particularly important for t close to T , as the errors will be amplified over
more time steps. Assumption 3.2 provides a quantitative characterization of this intuition.
Assumption 3.3. There exists a constant L > 0 such that, for all action sequences uA B
0:T −1 , u0:T −1 ∈
T A B A B
U and all times t ∈ [T − 1]0 , ∥∇ut J(u0:T −1 ) − ∇ut J(u0:T −1 )∥ ≤ L∥ut − ut ∥.
This final assumption states that the objective function J is L-smooth with respect to the action ut at
each time step t ∈ [T − 1]0 , which is a standard assumption in nonconvex optimization. This implies
that the dynamics f are smooth as well. We are now ready to state the convergence result.
.
Theorem 3.4. Suppose Assumptions 3.1 to 3.3 hold with γ, ζ, and L. Let µ = 1 − γ − ζ − γζ and
. .
ν = 1 + γ + ζ + γζ. If the step size η is chosen small enough such that α = µ − 12 ηLν 2 is positive,
(k) −1
then the iterates (u0:T −1 )N
k=0 of Algorithm 1 satisfy, for all N ∈ N and t ∈ [T − 1]0 ,
N −1 (0)
1 X (k) J ⋆ − J(u0:T −1 )
∥∇ut J(u0:T −1 )∥2 ≤ ,
N αηN
k=0
.⋆
where J = supu∈U T J(u) is the optimal value of the initial state.
3
Proof. See Appendix D.
The most direct way to approximate the Jacobians ∇x ft and ∇u ft is by using a (learned or manually
designed) differentiable model f˜ : X × U → X of the dynamics f and setting At = ∇x f˜(xt , ut )
and Bt = ∇u f˜(xt , ut ) in Line 6 of Algorithm 1. Theorem 3.4 guarantees that this model-based
open-loop RL method (see Algorithm A.1) is robust to a certain amount of modeling error. In contrast
to this, consider the more naive planning method of using the model to directly obtain the gradient by
differentiating
J(u0:T −1 ) ≃ r(x0 , u0 ) + r{f˜(x0 , u0 ), u1 } + · · · + rT {f˜(f˜(· · · f˜(f˜(x0 , u0 ), u1 ) · · · ), uT −1 )}
| {z } | {z }
x̃1 x̃T
with respect to the actions u0:T −1 using the backpropagation algorithm. In Appendix C, we show that
this approach is exactly equivalent to an approximation of Algorithm 1 where, in addition to setting
At = ∇x f˜(xt , ut ) and Bt = ∇u f˜(xt , ut ), the forward pass of Line 3 is replaced by an imagined
forward pass through the model f˜. In Section 4, we empirically demonstrate that this planning
method, whose convergence is not guaranteed by Theorem 3.4, is much less robust to modeling errors
than the open-loop RL approach.
Access to a reasonably accurate model may not always be feasible, and as Algorithm 1 only requires
the Jacobians of the dynamics along the current trajectory, a global model is also not necessary. In
the following two sections, we propose two methods that directly estimate the Jacobians ∇x ft and
∇u ft from rollouts in the environment. We call these methods model-free, as the estimated Jacobians
are only valid along the current trajectory, and thus cannot be used for planning.
Our goal is to estimate the Jacobians ∇x f (x̄t , ūt ) and ∇u f (x̄t , ūt ) that lie along the trajectory
induced by the action sequence ū0:T −1 . These Jacobians measure how the next state (x̄t+1 ) changes
if the current state or action (x̄t , ūt ) are slightly perturbed. More formally, the dynamics f may be
linearized about the reference trajectory (ū0:T −1 , x̄0:T ) as
f (xt , ut ) − f (x̄t , ūt ) ≃ ∇x f (x̄t , ūt )⊤ (xt − x̄t ) +∇u f (x̄t , ūt )⊤ (ut − ūt ) ,
| {z } | {z } | {z }
∆xt+1 ∆xt ∆ut
which is a valid approximation if the perturbations ∆x0:T and ∆u0:T −1 are small. By collecting a
dataset of M ∈ N rollouts with slightly perturbed actions, we can thus estimate the Jacobians by
solving the (analytically tractable) least-squares problem
M
(i) (i) (i) 2
X
arg min ∥A⊤ ⊤
t ∆xt + Bt ∆ut − ∆xt+1 ∥ . (4)
[A⊤ ⊤
t Bt ]∈R
D×(D+K)
i=1
This technique is illustrated in Fig. 2a (dashed purple line). Using these estimates in Algorithm 1
yields a model-free method we call on-trajectory, as the gradient estimate relies only on data generated
based on the current trajectory (see Algorithm A.2 for details). We see a connection to on-policy
methods in closed-loop reinforcement learning, where the policy gradient estimate (or the Q-update)
similarly depends only on data generated under the current policy. Like on-policy methods, on-
trajectory methods will benefit greatly from the possibility of parallel environments, which could
reduce the effective complexity of the forward pass stage from M +1 rollouts to that of a single rollout.
Exploiting the Markovian structure. Consider a direct linearization of the objective function J
.
about the current trajectory. Writing the action sequence as a vector ū = vec(ū0:T −1 ) ∈ RT K , this
linearization is given, for u ∈ R TK
close to ū, by
J(u) ≃ J(ū) + ∇J(ū)⊤ (u − ū).
We can thus estimate the gradient of the objective function by solving the least squares problem
M
X
∇J(ū) ≃ arg min {J(ui ) − J(ū) − g ⊤ (ui − ū)}2 ,
g∈RT K i=1
4
(a) (b)
(i) (i)
∆xt , ∆ut (k−1)
xt+1 (k)
xt+1
x̄t+1
(k−2)
(i) xt+1
∆xt+1
(i)
xt+1 Reference transition
M perturbed transitions
. . .
ft = f (x̄t , ūt ), ∆x = x − x̄t , ∆u = u − ūt (k−3)
xt+1
f (x, u)
f (x, u)
ft + ∇x ft> ∆x + ∇u ft> ∆u
Transitions of subsequent trajectories
ft + A >
t ∆x + Bt ∆u
>
(least squares fit) Linearizations of f at subsequent trajectories
t x + Bt u + c t
A> >
(least squares fit) Least squares fit weighting all points equally
(xt , ut )i (x̄t , ūt ) (xt , ut )k−3 (xt , ut )k−2 (xt , ut )k−1 (xt , ut )k
Figure 2: (a) The Jacobians of f (slope of the green linearization) at the reference point (x̄t , ūt ) can
(i) (i) (i)
be estimated from the transitions {(xt , ut , xt+1 )}Mi=1 of M perturbed rollouts. (b) The Jacobians
of subsequent trajectories (indexed by k) remain close. To estimate the Jacobian at iteration k, the
most recent iterate (k − 1) is more relevant than older iterates.
where {ui } are M ∈ N slightly perturbed action sequences. Due to the dimensionality of ū, this
method requires O(T K) rollouts to estimate the gradient. In contrast to this, our approach leverages
the Markovian structure of the problem, including the fact that we observe the states x0:T in each
rollout. As the Jacobians at all time steps are estimated jointly, we can expect to get a useful gradient
estimate from only O(D2 + DK) rollouts, which significantly reduces the sample complexity if T is
large. This gain in efficiency is demonstrated empirically in Section 4.
The on-trajectory algorithm is sample-efficient in the sense that it leverages the problem structure,
but a key inefficiency remains: the rollout data sampled at each iteration is discarded after the action
sequence is updated. In this section, we propose an off-trajectory method that implicitly uses the
data from previous trajectories to construct the Jacobian estimates. Our approach is based on the
following observation. If the dynamics f are smooth and the step size η is small, then the updated
(k) (k) (k−1) (k−1)
trajectory (u0:T −1 , x0:T ) will remain close to the previous iterate (u0:T −1 , x0:T ). Furthermore,
the Jacobians along the updated trajectory will be similar to the previous Jacobians, as illustrated in
Fig. 2b. Thus, we propose to estimate the Jacobians along the current trajectory from a single rollout
only by bootstrapping our estimates using the Jacobian estimates from the previous iteration.
Consider again the problem of estimating the Jacobians from multiple perturbed rollouts, illustrated
in Fig. 2a. Instead of relying on a reference trajectory and Eq. (4), we can estimate the Jacobians by
fitting a linear regression model to the dataset of M perturbed transitions. Solving
M
(i) (i) (i)
X
arg min ∥A⊤ ⊤
t xt + Bt ut + ct − xt+1 ∥
2
(5)
[A⊤ ⊤
t Bt ct ]∈R
D×(D+K+1)
i=1
.
yields an approximate linearization f (xt , ut ) ≃ A⊤
t xt + Bt ut + ct = Ft zt , with Ft = [At Bt ct ]
⊤ ⊤ ⊤
.
and zt = (xt , ut , 1) ∈ R D+K+1
. This approximation is also shown in Fig. 2a (dotted gray line).2
(k−1) (k) (k) (k)
At iteration k, given the estimate Ft and a new point zt = (xt , ut , 1) and corresponding
(k) (k)
target xt+1 , computing the new estimate Ft is a problem of online linear regression. We solve this
regression problem using an augmented version of the recursive least squares (RLS) algorithm [e.g.,
(0) .
Ljung, 1999, Sec. 11.2]. By introducing a prior precision matrix Qt = q0 I for each time t, where
q0 > 0, in Algorithm A.3 we compute the update at iteration k ∈ N as
(k) (k−1) (k) (k)
Qt = αQt + (1 − α)q0 I + zt {zt }⊤ (6)
(k) (k−1) (k) (k) (k) (k−1) (k) ⊤
Ft = Ft + {Qt }−1 zt {xt+1 − Ft zt } .
2
If we replace Eq. (4) by Eq. (5) in Algorithm A.2, we get a slightly different on-trajectory method that has a
similar performance.
5
1.0
Tip trajectory
0 100
Time t 0.8
Model-based OLRL
Solve rate
0.6 Naive planning
MLP model OLRL
0.4
` MLP model planning
0.2
F θ
Forgetting and stability. The standard RLS update of the precision matrix corresponds to Eq. (6)
with α = 0. In the limit as q0 → 0, the RLS algorithm is equivalent to the batch processing of Eq. (5),
which treats all points equally. However, as illustrated in Fig. 2b, points from recent trajectories should
be given more weight, as transitions that happened many iterations ago will give little information
about the Jacobians along the current trajectory. We thus incorporate a forgetting factor α ∈ (0, 1)
into the precision update with the effect that past data points are exponentially downweighted:
k
(k) (k−1) (k) (k) (k) (i) (i)
X
Qt = αQt + zt {zt }⊤ ⇝ Qt = α k q0 I + αk−i zt {zt }⊤ . (7)
i=1
This forgetting factor introduces a new problem: instability. If subsequent trajectories lie close to each
other, then the sum of outer products may become singular (clearly, if all zi are identical, then the sum
has rank 1). As the prior q0 I is downweighted, at some point Q may become singular. Our modifica-
tion in Eq. (6) adds (1 − α)q0 I in each update, which has the effect of removing the αk coefficient in
front of q0 I in Eq. (7). If the optimization procedure converges, then eventually subsequent trajectories
will indeed lie close together. Although Eq. (6) prevents issues with instability, the quality of the Jaco-
bian estimates will still degrade, as this estimation inherently requires perturbations (see Section 3.3).
In Algorithm A.3, we thus slightly perturb the actions used in each rollout to get more diverse data.
4 Experiments
4.1 Inverted pendulum swing-up
We empirically evaluate our algorithms on the inverted pendulum swing-up problem shown in Fig. 3.
The state at time t is xt = (ℓ, ℓ̇, θ, θ̇)t ∈ R4 , where ℓ is the position of the cart on the bar and θ is the
pendulum angle. The action ut is the horizontal force F applied to the cart at time t. Episodes are of
length T = 100, the running reward r(x, u) ∝ −u2 penalizes large forces, and the terminal reward
rT (x) = −∥x∥1 defines the goal state to be at rest in the upright position.
Performance metrics. We monitor several performance criteria, all of which are based on the return
. (k)
J. First, we define Jmax = maxk∈[N ]0 J(u0:T −1 ) as the return achieved by the best action sequence
over a complete learning process of N optimization steps. Based on this quantity, we say that the
swing-up task is solved if Jmax > −0.03. This threshold was determined empirically. If the algorithm
or the model is randomized, then [Jmax > −0.03] is a Bernoulli random variable whose mean, which
we call the solve rate, depends on the quality of the learning algorithm. We repeat all experiments
with 100 random seeds and show 95% bootstrap confidence intervals in all plots.
Robustness: model-based open-loop RL. We start with the model-based method for open-loop RL
of Section 3.2 (Algorithm A.1). In Theorem 3.4, we proved that this method can accommodate some
model error and still converge to a local maximum of the true objective. In Fig. 4, we empirically
analyze this robustness property on the pendulum system. This system has five parameters: the
masses of the cart and the pendulum tip, the length of the pendulum, and the friction coefficients for
linear and rotational motion. To test the robustness of our algorithm against model misspecification,
6
σ = 0.0001 σ = 0.001 σ = 0.01 On-traj. OLRL MLP model OLRL CEM
Average return J
−0.1 Oracle −1
Solved
−1 −2
2 5 10 20 50 2 5 10 20 50 2 5 10 20 50 −3
M +1 M +1 M 0 20000 40000 104 106
Number of rollouts per iteration Episodes Episodes
Figure 5: The on-trajectory open-loop RL method Figure 6: Learning curves on the pendulum task.
is more sample-efficient than the finite-difference On the right, we show a longer time period in log
and cross-entropy methods. It is also much less scale. The off-trajectory open-loop RL method
sensitive to the noise scale σ. converges almost as fast as the oracle method.
we use a pendulum system with inaccurate parameters as the model f˜. Concretely, if mi is the ith
parameter of the true system, we sample the corresponding model parameter m̃i from a log-normal
distribution centered at mi , such that m̃i = ξmi , with ln ξ ∼ N (0, s2 ). The severity of the model
error is then controlled by the scale parameter s. In Fig. 4, we compare the performance of our method
with the planning procedure described in Section 3.2, in which the forward pass is performed through
the model f˜ instead of the real system f . Whereas the planning method only solves the pendulum
reliably with the true system as the model (s = 0), the open-loop RL method can accommodate a
considerable model misspecification.
In a second experiment, we represent the model f˜ by a small multi-layer perceptron (MLP). The
model is learned from 1000 rollouts on the pendulum system, with the action sequences sampled from
a pink noise distribution, as suggested by Eberhard et al. [2023]. Figure 4 shows the performance
achieved with this model by our proposed algorithm and by the planning method. As the MLP model
represents a considerable misspecification of the true dynamics, only the open-loop RL method
manages to solve the pendulum task.
Structure: on-trajectory open-loop RL. In Section 3.3, we proposed a model-free approach
(Algorithm A.2) that uses rollouts to directly estimate the Jacobians needed to update the action
sequence. It is clear from Eq. (4) that more rollouts (i.e., larger M ) will give more accurate Jacobian
estimates, and therefore increase the quality of the gradient approximation. In Fig. 5, we analyze the
sample efficiency of the on-trajectory OLRL algorithm by comparing the performance achieved at
different values of M , where the number N of optimization steps remains fixed. We compare our
method to the finite-difference approach described at the end of Section 3.3 and to the gradient-free
cross-entropy method [CEM; Rubinstein, 1999]. Both these methods also update the action sequence
on the basis of M perturbed rollouts in the environment. As in our method, the M action sequences
are perturbed using Gaussian white noise with noise scale σ. We describe both baselines in detail
in Appendix F. The oracle performance shown in Fig. 5 corresponds to Algorithm 1 with the true
gradient, i.e., At = ∇x ft and Bt = ∇u ft .
We note two things in Fig. 5. First, the performance of both the finite-difference method and CEM
heavily depends on the choice of the noise scale σ, whereas our method performs identically for
all three values of σ. Second, even for tuned values of σ, the finite-difference method and CEM
still need approximately twice as many rollouts per iteration as the open-loop RL method to reliably
swing up the pendulum. At 10 rollouts per iteration, our method matches the oracle’s performance,
while both baselines are below the solved threshold of Jmax = −0.03. This empirically confirms our
theoretical claims at the end of Section 3.3, where we argue that exploiting the Markovian structure
of the problem leads to increased sample efficiency.
Efficiency: off-trajectory open-loop RL. Finally, we turn to the method proposed in Section 3.4
(Algorithm A.3), which promises increased sample efficiency by estimating the Jacobians in an
off-trajectory fashion. The performance of this algorithm is shown in Fig. 6, where the learning
curves of all our methods as well as the two baselines and the oracle are plotted. For the on-trajectory
methods compared in Fig. 5, we chose for each the minimum number of rollouts M such that, under
the best choice for σ, the method would reliably solve the swing-up task. The hyperparameters for all
7
methods are summarized in Appendix G. It can be seen that the off-trajectory method, which only
requires one rollout per iteration, converges much faster than the on-trajectory open-loop RL method.
4.2 MuJoCo
Average return J
SAC (open-loop)
it is a relatively simple task with 200
smooth, low-dimensional, and deter-
ministic dynamics. In this section, 100
we test our method in two consider-
ably more challenging environments:
0
the Ant-v4 and HalfCheetah-v4
tasks provided by the OpenAI Gym li- 0 10000 20000 0 10000 20000
brary [Brockman et al., 2016, Towers Episodes Episodes
et al., 2023], implemented in MuJoCo
[Todorov et al., 2012]. These envi- Figure 7: Learning curves of our off-trajectory open-loop
ronments are high-dimensional, they RL method and soft actor-critic for two MuJoCo tasks. All
exhibit non-smooth contact dynamics, experiments were repeated with 20 random seeds, and we
and the initial state is randomly sam- show 95%-bootstrap confidence intervals for the average
pled at the beginning of each episode. return. The horizon is fixed to T = 100.
We tackle these two tasks with our model-free off-trajectory method (Algorithm A.3). The results are
shown in Fig. 7, where we compare to the closed-loop RL baseline soft actor-critic [SAC; Haarnoja
et al., 2018a]. It can be seen that the open-loop RL method performs comparably to SAC, even though
SAC learns a closed-loop policy that is capable of adapting its behavior to the initial condition.3 In
the figure, we also analyze the open-loop performance achieved by SAC. Whereas the closed-loop
performance is the return obtained in a rollout where the actions are taken according to the mean of
the Gaussian policy, the open-loop return is achieved by blindly executing exactly the same actions
in a new episode. The discrepancy in performance is thus completely due to the stochasticity in the
initial state. In Appendix E, we show that our method also works with a longer horizon T .
The results demonstrate that the open-loop RL method is robust to a certain level of stochasticity in
the initial state of stable dynamical systems. Additionally, while our convergence analysis depends on
the assumption of smooth dynamics, these experiments empirically demonstrate that the algorithms
are also able to tackle non-smooth contact dynamics. Finally, we see that the high dimensionality
of the MuJoCo systems is handled without complications. While soft actor-critic is an elegant and
powerful algorithm, the combination with deep function approximation can make efficient learning
more difficult. Our methods are considerably simpler and, because they are based on Pontryagin’s
principle rather than dynamic programming, they evade the curse of dimensionality by design, and
thus do not require any function approximation.
5 Related work
Our work is inspired by numerical optimal control theory [Betts, 2010, Geering, 2007], which
deals with the numerical solution of trajectory optimization problems, a setting that has also been
studied in machine learning [Schaal and Atkeson, 2010, Howe et al., 2022]. As in our approach,
an important aspect of solution methods is to exploit the Markovian structure of the dynamics to
reduce computation [Carraro et al., 2015, Schäfer et al., 2007]. However, all these methods deal
with situations where the dynamics are known, whereas our algorithms only requires an approximate
model (model-based open-loop RL) or no model at all (model-free open-loop RL). Another set of
related methods is known as iterative learning control [Moore, 1993, Ma et al., 2022, 2023], which is
a control-theoretic framework that iteratively improves the execution of a task by optimizing over
feedforward trajectories. However, these methods are often formulated for trajectory tracking tasks,
while we consider a more general class of reinforcement learning problems. Chen and Braun [2019]
3
In this comparison, our method is further disadvantaged by the piecewise constant “health” terms in the
reward function of Ant-v4. Our method, exclusively relying on the gradient of the reward function, ignores these.
8
explore an idea similar as in our Algorithm A.1; their model-based control algorithm combines a
rollout in a real system with an inaccurate model to construct an iterative LQR feedback controller.
Recently, deep neural networks have been used to learn representations of complex dynamical systems
[Fragkiadaki et al., 2015] and Pontryagin’s principle was leveraged in the optimization of control tasks
based on such models [Jin et al., 2020, Böttcher et al., 2022]. However, these methods only consider
the setting of closed-loop control. The combination of an exact forward pass with an approximate
backward pass, which our methods are based on, has also been explored in different settings in the
deep learning literature, such as spiking [Lee et al., 2016] or physical [Wright et al., 2022] neural
networks, or networks that include nondifferentiable procedures, for example used for rendering
[Niemeyer et al., 2020] or combinatorial optimization [Vlastelica et al., 2020].
6 Discussion
While open-loop control is a well-established field with countless applications [Diehl et al., 2006,
van Zundert and Oomen, 2018, Sferrazza et al., 2020], the setting of incompletely-known dynamics
has received little attention. This paper makes an important first step towards understanding how
principles from open-loop optimal control can be combined with ideas from reinforcement learning
while preserving convergence guarantees. We propose three algorithms that address this open-loop
RL problem, from robust trajectory optimization with an approximate model to sample-efficient
learning under fully unknown dynamics. This work focuses on reinforcement learning in continuous
state and action spaces, a class of problems known to be challenging [Recht, 2019]. Although this
setting allows us to leverage continuous optimization techniques, we expect that most ideas will
transfer to the discrete setting and would be interested to see further research on this topic.
It is interesting to note that there are many apparent parallels between our open-loop RL algorithms
and their closed-loop counterparts. The distinction between model-based and model-free methods
is similar as in closed-loop RL. Similarly, the on-trajectory and off-trajectory methods we present
show a tradeoff between sample efficiency and stability that is reminiscent of the tradeoffs between
on-policy and off-policy methods in closed-loop RL. The question of exploration, which is central to
reinforcement learning, also arises in our case. We do not address this complex problem thoroughly
here but instead rely on additive Gaussian noise to sample diverse trajectories.
Limitations of Open-Loop RL. Another important open question is how open-loop methods fit into
the reinforcement learning landscape. An inherent limitation of these methods is that an open-loop
controller can, by definition, not react to unexpected changes in the system’s state, be it due to random
disturbances or an adversary. If these disturbances are small, and the system is not sensitive to small
changes in the initial condition or the action sequence (roughly speaking, if the system is stable and
non-chaotic), then such a reaction is not necessary. In this situation, open-loop RL works even for
long time-horizons T , as we highlight in our MuJoCo experiments (see Appendix E). However, an
open-loop controller cannot balance an inverted pendulum in its unstable position4 , track a reference
trajectory in noisy conditions, or play Go, where reactions to the opponent’s moves are constantly
required. In these situations open-loop RL is not viable or only effective over a very small horizon T .
Open-loop control can be viewed as a special case of closed-loop control, and therefore it is clear
that closed-loop control is much more powerful. Historically, control theory has dealt with both
closed-loop and open-loop control [Åström and Murray, 2021, Skogestad and Postlethwaite, 2005,
Åström and Hägglund, 1995] and there is a broad consensus in the control community that both are
important [Betts, 2010, Verscheure et al., 2009, Horn et al., 2013]. However, the setting of open-loop
control with unknown dynamics, which is naturally formulated through the lens of reinforcement
learning, has so far received much less consideration from the community. Our algorithms provide
a first solution to the open-loop RL problem and are not intended to replace any of the existing
closed-loop RL algorithms. Instead, we think open-loop and closed-loop methods should complement
each other. In control engineering, it is common practice to combine feedback and feedforward
techniques, and such approaches are also thinkable in RL. For example, a high-level policy could
execute low-level feedforward trajectories using the options framework [Sutton et al., 1999]; a related
idea is explored by Hansen et al. [1996]. We believe that ultimately a combination of open-loop and
closed-loop techniques will be fruitful in reinforcement learning and think that this is an important
direction for future research.
4
Except with a clever trick called vibrational control [Meerkov, 1980].
9
Acknowledgments
We thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for their
support. C. Vernade is funded by the German Research Foundation (DFG) under both the project
468806714 of the Emmy Noether Programme and under Germany’s Excellence Strategy – EXC
number 2064/1 – Project number 390727645. M. Muehlebach is funded by the German Research
Foundation (DFG) under the project 456587626 of the Emmy Noether Programme.
References
K. J. Åström and T. Hägglund. PID Controllers: Theory, Design, and Tuning. International Society
for Measurement and Control, second edition, 1995. 9
K. J. Åström and R. M. Murray. Feedback Systems: An Introduction for Scientists and Engineers.
Princeton University Press, second edition, 2021. 2, 9
J. T. Betts. Practical Methods for Optimal Control and Estimation Using Nonlinear Programming.
SIAM, 2010. 8, 9
L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM
review, 60(2):223–311, 2018. URL https://doi.org/10.1137/16M1080173. 15
T. Carraro, M. Geiger, S. Körkel, and R. Rannacher, editors. Multiple Shooting and Time Domain
Decomposition Methods. Springer, 2015. 8
Y. Chen and D. J. Braun. Hardware-in-the-loop iterative optimal feedback control without model-
based future prediction. IEEE Transactions on Robotics, 35(6):1419–1434, 2019. URL https:
//doi.org/10.1109/TRO.2019.2929014. 8
M. Diehl, H. Bock, H. Diedam, and P.-B. Wieber. Fast direct multiple shooting algorithms for
optimal robot control. In M. Diehl and K. Mombaur, editors, Fast Motions in Biomechanics
and Robotics: Optimization and Feedback Control, pages 65–93. Springer, 2006. URL https:
//doi.org/10.1007/978-3-540-36119-0_4. 9
O. Eberhard, J. Hollenstein, C. Pinneri, and G. Martius. Pink noise is all you need: Colored
noise exploration in deep reinforcement learning. In Proceedings of the Eleventh International
Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=
hQ9V5QN27eS. 7
K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik. Recurrent network models for human dynamics.
In Proceedings of the IEEE International Conference on Computer Vision, pages 4346–4354, 2015.
URL https://doi.org/10.1109/ICCV.2015.488. 9
T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep
reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference
on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1856–1865.
PMLR, 2018a. URL http://proceedings.mlr.press/v80/haarnoja18b.html. 8, 21
10
D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning latent
dynamics for planning from pixels. In Proceedings of the 36th International Conference on
Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2555–2565.
PMLR, 2019. URL https://proceedings.mlr.press/v97/hafner19a.html. 20
E. Hansen, A. Barto, and S. Zilberstein. Reinforcement learning for mixed open-loop and closed-
loop control. Advances in Neural Information Processing Systems, 9, 1996. URL https:
//proceedings.neurips.cc/paper/1996/hash/ab1a4d0dd4d48a2ba1077c449479130
6-Abstract.html. 9
G. Horn, S. Gros, and M. Diehl. Numerical trajectory optimization for airborne wind energy systems
described by high fidelity aircraft models. In U. Ahrens, M. Diehl, and R. Schmehl, editors,
Airborne Wind Energy, pages 205–218. Springer, 2013. URL https://doi.org/10.1007/97
8-3-642-39965-7_11. 9
N. Howe, S. Dufort-Labbé, N. Rajkumar, and P.-L. Bacon. Myriad: a real-world testbed to bridge
trajectory optimization and deep learning. In Advances in Neural Information Processing Systems,
volume 35, pages 29801–29815, 2022. URL https://proceedings.neurips.cc/paper_fil
es/paper/2022/hash/c0b91f9a3587bf35287f41dba5d20233-Abstract-Datasets_an
d_Benchmarks.html. 8
W. Jin, Z. Wang, Z. Yang, and S. Mou. Pontryagin differentiable programming: An end-to-end
learning and control framework. In Advances in Neural Information Processing Systems, volume 33,
pages 7979–7992, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/5a7
b238ba0f6502e5d6be14424b20ded-Abstract.html. 9
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of the Third
International Conference on Learning Representations, 2014. URL http://arxiv.org/abs/14
12.6980. 21
J. H. Lee, T. Delbruck, and M. Pfeiffer. Training deep spiking neural networks using backpropagation.
Frontiers in Neuroscience, 10, 2016. URL https://doi.org/10.3389/fnins.2016.00508. 9
L. Ljung. System Identification: Theory for the User. Prentice Hall, 1999. 5
H. Ma, D. Büchler, B. Schölkopf, and M. Muehlebach. A learning-based iterative control framework
for controlling a robot arm with pneumatic artificial muscles. In Proceedings of Robotics: Science
and Systems, 2022. URL https://www.roboticsproceedings.org/rss18/p029.html. 8
H. Ma, D. Büchler, B. Schölkopf, and M. Muehlebach. Reinforcement learning with model-based
feedforward inputs for robotic table tennis. Autonomous Robots, 47:1387–1403, 2023. URL
https://doi.org/10.1007/s10514-023-10140-6. 8
S. Meerkov. Principle of vibrational control: Theory and applications. IEEE Transactions on
Automatic Control, 25(4):755–762, 1980. URL https://doi.org/10.1109/TAC.1980.11024
26. 9
K. L. Moore. Iterative Learning Control for Deterministic Systems. Springer, 1993. URL https:
//doi.org/10.1007/978-1-4471-1912-8. 8
M. Niemeyer, L. Mescheder, M. Oechsle, and A. Geiger. Differentiable volumetric rendering:
Learning implicit 3d representations without 3d supervision. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 3504–3515, 2020. URL https:
//openaccess.thecvf.com/content_CVPR_2020/html/Niemeyer_Differentiable_V
olumetric_Rendering_Learning_Implicit_3D_Representations_Without_3D_Supe
rvision_CVPR_2020_paper.html. 9
C. Pinneri, S. Sawant, S. Blaes, J. Achterhold, J. Stueckler, M. Rolinek, and G. Martius. Sample-
efficient cross-entropy method for real-time planning. In Proceedings of the 2020 Conference on
Robot Learning, volume 155 of Proceedings of Machine Learning Research, pages 1049–1065.
PMLR, 2021. URL https://proceedings.mlr.press/v155/pinneri21a.html. 20
L. Pontryagin, V. Boltayanskii, R. Gamkrelidze, and E. Mishchenko. Mathematical Theory of Optimal
Processes. Wiley, 1962. 2
11
A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann. Stable-Baselines3:
Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22
(268):1–8, 2021. URL http://jmlr.org/papers/v22/20-1364.html. 21
B. Recht. A tour of reinforcement learning: The view from continuous control. Annual Review of
Control, Robotics, and Autonomous Systems, 2:253–279, 2019. URL https://doi.org/10.114
6/annurev-control-053018-023825. 9
R. Rubinstein. The cross-entropy method for combinatorial and continuous optimization. Methodol-
ogy and Computing in Applied Probability, 1:127–190, 1999. URL https://doi.org/10.102
3/A:1010091220143. 7, 19
S. Schaal and C. G. Atkeson. Learning control in robotics. IEEE Robotics & Automation Magazine,
17(2):20–29, 2010. URL https://doi.org/10.1109/MRA.2010.936957. 8
A. Schäfer, P. Kühl, M. Diehl, J. Schlöder, and H. G. Bock. Fast reduced multiple shooting methods for
nonlinear model predictive control. Chemical Engineering and Processing: Process Intensification,
46(11):1200–1214, 2007. URL https://doi.org/10.1016/j.cep.2006.06.024. 8
C. Sferrazza, M. Muehlebach, and R. D’Andrea. Learning-based parametrized model predictive
control for trajectory tracking. Optimal Control Applications and Methods, 41(6):2225–2249,
2020. URL https://doi.org/10.1002/oca.2656. 9
S. Skogestad and I. Postlethwaite. Multivariable Feedback Control: Analysis and Design. Wiley,
second edition, 2005. 9
R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT press, second edition,
2018. 1
R. S. Sutton, D. Precup, and S. Singh. Between MDPs and semi-MDPs: A framework for temporal
abstraction in reinforcement learning. Artificial intelligence, 112(1):181–211, 1999. URL https:
//doi.org/10.1016/S0004-3702(99)00052-1. 9
E. Todorov, T. Erez, and Y. Tassa. MuJoCo: A physics engine for model-based control. In Proceedings
of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033,
2012. URL https://doi.org/10.1109/IROS.2012.6386109. 8
M. Towers, J. K. Terry, A. Kwiatkowski, J. U. Balis, G. d. Cola, T. Deleu, M. Goulão, A. Kallinteris,
A. KG, M. Krimmel, R. Perez-Vicente, A. Pierré, S. Schulhoff, J. J. Tai, A. T. J. Shen, and O. G.
Younis. Gymnasium, 2023. URL https://github.com/Farama-Foundation/Gymnasium. 8
J. van Zundert and T. Oomen. On inversion-based approaches for feedforward and ILC. Mechatronics,
50:282–291, 2018. URL https://doi.org/10.1016/j.mechatronics.2017.09.010. 9
D. Verscheure, B. Demeulenaere, J. Swevers, J. De Schutter, and M. Diehl. Time-optimal path
tracking for robots: A convex optimization approach. IEEE Transactions on Automatic Control,
54(10):2318–2327, 2009. URL https://doi.org/10.1109/TAC.2009.2028959. 9
M. Vlastelica, A. Paulus, V. Musil, G. Martius, and M. Rolínek. Differentiation of blackbox combina-
torial solvers. In Proceedings of the Eighth International Conference on Learning Representations,
2020. URL https://openreview.net/forum?id=BkevoJSYPB. 9
L. G. Wright, T. Onodera, M. M. Stein, T. Wang, D. T. Schachter, Z. Hu, and P. L. McMahon. Deep
physical neural networks trained with backpropagation. Nature, 601(7894):549–555, 2022. URL
https://doi.org/10.1038/s41586-021-04223-6. 9
12
A Algorithms
In this section, we provide detailed descriptions of the three open-loop RL algorithms presented in
the main text. The model-based algorithm of Section 3.2 is listed in Algorithm A.1, the model-free
on-trajectory method of Section 3.3 is listed in Algorithm A.2, and the off-trajectory method of
Section 3.4 is listed in Algorithm A.3. The hyperparameters we use in these algorithms are discussed
in Appendix G.
// Pontryagin update
13 λ̃t ← ∇x r(x̄t , ūt ) + At λ̃t+1
14 gt ← ∇u r(x̄t , ūt ) + Bt λ̃t+1
15 ūt ← ūt + ηgt // Gradient ascent
16 end
17 end
13
Algorithm A.3: Model-free off-trajectory open-loop RL
Input: Forgetting factor α ∈ [0, 1], noise scale σ > 0, initial precision q0 > 0, optimization
steps N ∈ N, step size η > 0
1 Initialize ū0:T −1 (initial action sequence)
2 Initialize Ft ∈ RD×(D+K+1) , ∀t ∈ [T − 1]0
3 Qt ← q0 I ∈ R(D+K+1)×(D+K+1) , ∀t ∈ [T − 1]0
4 for k = 1, 2, . . . , N do
// Forward pass
5 u0:T −1 ∼ N (ū0:T −1 , σI)
6 x0:T ← rollout(u0:T −1 )
// Backward pass
7 λ̃T ← ∇rT (xT )
8 for t = T − 1, T − 2, . . . , 0 do
// Jacobian estimation
9 zt ← [x⊤ ⊤
t ut 1]
⊤
We maximize J with respect to x0:T and u0:T −1 subject to the constraint that xt+1 = f (xt , ut ) for
all t = [T − 1]0 . The corresponding Lagrangian is
T −1
. X
L(x0:T , u0:T −1 , λ1:T ) = {r(xt , ut ) + λ⊤
t+1 (f (xt , ut ) − xt+1 )} + rT (xT ),
t=0
where the constraints are included through the multipliers λ1:T . The costate equations are then
obtained by setting the partial derivatives of the Lagrangian with respect to x0:T to zero:
.
∇xt L = ∇x r(xt , ut ) + ∇x f (xt , ut )λt+1 − λt = 0
=⇒ λt = ∇x r(xt , ut ) + ∇x f (xt , ut )λt+1
.
∇xT L = ∇rT (xT ) − λT = 0
=⇒ λT = ∇rT (xT ).
Setting the partial derivatives of the Lagrangian with respect to λ1:T to zero yields the dynamics
equations, and the partial derivatives of the Lagrangian with respect to u0:T −1 are
∇ut L = ∇u r(xt , ut ) + ∇u f (xt , ut )λt+1 ,
which is the same expression for the gradient of the objective as in Eq. (1).
14
C Pontryagin’s principle from backpropagation
In Section 3.2, we mention that computing the gradient of
T
X −1
J(u0:T −1 ) = r(xt , ut ) + rT (xT )
t=0
= r(x0 , u0 ) + r{f (x0 , u0 ), u1 } + r{f (f (x0 , u0 ), u1 ), u2 } + · · ·
| {z } | {z }
x1 x2
+ r{f (f (· · · f (f (x0 , u0 ), u1 ) · · · ), uT −2 ), uT −1 }
| {z }
xT −1
+ rT {f (f (· · · f (f (x0 , u0 ), u1 ) · · · ), uT −1 )}
| {z }
xT
by means of the backpropagation algorithm, i.e. a repeated application of the chain rule, leads
naturally to Pontryagin’s principle. The chain rule states that for g : Rn → Rk , h : Rk → Rm and
x ∈ Rn ,
∇(h ◦ g)(x) = ∇g(x) ∇h{g(x)},
where ∇g : Rn → Rn×k , ∇h : Rk → Rk×m and ∇(h ◦ g) : Rn → Rn×m . From this, we can
compute the gradient of the objective function with respect to the action ut at time t ∈ [T − 1]0 as
∇ut J(u0:T −1 ) = ∇u r(xt , ut ) + ∇u f (xt , ut )∇x r(xt+1 , ut+1 )
+ ∇u f (xt , ut )∇x f (xt+1 , ut+1 )∇x r(xt+2 , ut+2 )
+ ···
+ ∇u f (xt , ut )∇x f (xt+1 , ut+1 ) · · · ∇x f (xT −2 , uT −2 )∇x r(xT −1 , uT −1 )
+ ∇u f (xt , ut )∇x f (xt+1 , ut+1 ) · · · ∇x f (xT −1 , uT −1 )∇rT (xT )
= ∇u r(xt , ut ) + ∇u f (xt , ut )λt+1 ,
where we have introduced the shorthand λt+1 for the blue part. This is the same expression for the
gradient as in Eq. (1), and it can easily be seen that this definition of λt satisfies the costate equations
(2) and (3).
Lemma D.2. Let A, B ∈ Rm×n for some m, n ∈ N and x, y ∈ Rn such that σ̄(A)∥x∥ ≤ σ(B)∥y∥.
Then, ∥Ax∥ ≤ ∥By∥. ¯
15
Proof. This is a simple corollary of the Courant-Fischer (min-max) theorem. The min-max theorem
states that, for a symmetric matrix C ∈ Rn×n , the minimum and maximum eigenvalues λ(C) and
λ̄(C) are characterized in the following way: ¯
This can be extended to a characterization of the singular values , σ̄(A) and σ(B) by relating them to
¯
the eigenvalues of A⊤ A and B ⊤ B, respectively:
q √ 1
σ̄(A) = λ̄(A⊤ A) = maxn z ⊤ A⊤ Az = maxn ∥Az∥ ≥ ∥Ax∥,
z∈R z∈R ∥x∥
∥z∥=1 ∥z∥=1
q √ 1
σ(B) = λ(B ⊤ B) = minn z ⊤ B ⊤ Bz = minn ∥Bz∥ ≤ ∥By∥.
¯ ¯ z∈R z∈R ∥y∥
∥z∥=1 ∥z∥=1
.
Theorem D.3. Suppose Assumptions 3.1 and 3.2 hold with γ and ζ and define µ = 1 − γ − ζ − γζ
.
and ν = 1 + γ + ζ + γζ. Then,
gt⊤ ∇Jt ≥ µ∥∇Jt ∥2 and ∥gt ∥ ≤ ν∥∇Jt ∥,
for all t ∈ [T − 1]0 .
Proof. Let t ∈ [T − 1]0 be fixed. Decomposing the left-hand side of the first inequality, we get
gt⊤ ∇Jt = λ̃⊤ ⊤
t+1 Bt ∇u ft λt+1
= (λt+1 + δt+1 )⊤ (∇u ft + ε′t )⊤ ∇u ft λt+1
= ∥∇Jt ∥2 + λ⊤ ′⊤ ⊤ ⊤ ⊤ ′⊤
t+1 εt ∇u ft λt+1 + δt+1 ∇u ft ∇u ft λt+1 + δt+1 εt ∇u ft λt+1
| {z } | {z } | {z }
a b c
2
≥ ∥∇Jt ∥ − |a| − |b| − |c|.
We will now show that
|a| ≤ ζ∥∇Jt ∥2 and |b| ≤ γ∥∇Jt ∥2 and |c| ≤ γζ∥∇Jt ∥2 ,
which, when taken together, will give us
gt⊤ ∇Jt ≥ (1 − γ − ζ − γζ)∥∇Jt ∥2 = µ∥∇Jt ∥2 .
We first derive the bound on |a|:
σ̄(ε′t ) ≤ ζ σ(∇u ft ) (Assumption 3.2b)
¯
=⇒ ∥ε′t λt+1 ∥ ≤ ζ∥∇u ft λt+1 ∥ (Lemma D.2) (8)
=⇒ | λ⊤ ′⊤
t+1 εt ∇u ft λt+1 | ≤ ζ∥∇Jt ∥ .2
(Lemma D.1)
| {z }
a
The expression for b involves δt+1 , which is the error of the approximate costate λ̃t+1 . This error
comes from the cumulative error build-up due to εt+1:T −1 , the errors of the approximate Jacobians
used in the backward pass. To bound |b| we therefore first need to bound this error build-up. To this
end, we now show that for all s ∈ [T − t],
s−1
γ Y
∥δt+s ∥ ≤ s−1
κ−1 (∇u ft ) κ−1 (∇x ft+i )∥λt+s ∥, (9)
3 i=1
.
where we write the inverse condition number of a matrix A as κ−1 (A) = σ(A)/σ̄(A). To prove this
bound, we perform a backward induction on s. First, consider s = T −¯t. The right-hand side of
Eq. (9) is clearly nonnegative. The left-hand side is
∥δT ∥ = ∥λ̃T − λT ∥ = 0,
16
as λ̃T = λT . Thus, the inequality holds for s = T − t. We now complete the induction by showing
that it holds for any s ∈ [T − t − 1], assuming that it holds for s + 1. We start by decomposing δt+s :
δt+s = λ̃t+s − λt+s
= At+s λ̃t+s+1 − ∇x ft+s λt+s+1
= (∇x ft+s + εt+s )(λt+s+1 + δt+s+1 ) − ∇x ft+s λt+s+1
= εt+s λt+s+1 + ∇x ft+s δt+s+1 + εt+s δt+s+1 .
Now, we can bound ∥δt+s ∥ by bounding these individual contributions:
∥δt+s ∥ ≤ ∥εt+s λt+s+1 ∥ + ∥∇x ft+s δt+s+1 ∥ + ∥εt+s δt+s+1 ∥.
| {z } | {z } | {z }
a′ b′ c′
We start with ∥a ∥:
′
s−1
γ −1 Y
σ̄(εt+s ) ≤ s κ (∇u ft ) κ−1 (∇x ft+i )σ(∇x ft+s ) (Assumption 3.2a)
3 i=1
¯
s−1
γ Y
=⇒ ∥εt+s λt+s+1 ∥ ≤ s κ−1 (∇u ft ) κ−1 (∇x ft+i )∥∇x ft+s λt+s+1 ∥. (Lemma D.2)
| {z } 3 i=1
| {z }
a′ λt+s
Now, ∥b ∥:
′
s
γ −1 Y
∥δt+s+1 ∥ ≤ κ (∇ u ft ) κ−1 (∇x ft+i )∥λt+s+1 ∥
3s i=1
(Induction hypothesis)
s−1
γ −1 Y
⇐⇒ σ̄(∇x ft+s )∥δt+s+1 ∥ ≤ κ (∇u ft ) κ−1 (∇x ft+i )σ(∇x ft+s )∥λt+s+1 ∥
3s i=1
¯
(Definition of κ−1 )
s−1
γ Y
=⇒ ∥∇x ft+s δt+s+1 ∥ ≤ s κ−1 (∇u ft ) κ−1 (∇x ft+i )∥∇x ft+s λt+s+1 ∥.
| {z } 3 i=1
| {z }
b′ λt+s
(Lemma D.2)
And finally, ∥c′ ∥:
σ̄(εt+s ) ≤ σ(∇x ft+s ) ≤ σ̄(∇x ft+s ) (10)
¯ s
γ Y
=⇒ σ̄(εt+s )∥δt+s+1 ∥ ≤ s κ−1 (∇u ft ) κ−1 (∇x ft+i )σ̄(∇x ft+s )∥λt+s+1 ∥
3 i=1
(Induction hypothesis)
s−1
γ −1 Y
⇐⇒ σ̄(εt+s )∥δt+s+1 ∥ ≤ s
κ (∇ u f t ) κ−1 (∇x ft+i )σ(∇x ft+s )∥λt+s+1 ∥
3 i=1
¯
(Definition of κ−1 )
s−1
γ Y
=⇒ ∥εt+s δt+s+1 ∥ ≤ s κ−1 (∇u ft ) κ−1 (∇x ft+i )∥∇x ft+s λt+s+1 ∥. (Lemma D.2)
| {z } 3 i=1
| {z }
c′ λt+s
Here, Eq. (10) follows from Assumption 3.2a by noting that that the constant before σ(∇x ft+s ) on
¯ us Eq. (9):
the right-hand side is not greater than 1. We can now put all three bounds together to give
s−1
′ ′ γ −1
′
Y
∥δt+s ∥ ≤ ∥a ∥ + ∥b ∥ + ∥c ∥ ≤ 3 · s κ (∇u ft ) κ−1 (∇x ft+i )∥λt+s ∥.
3 i=1
Equipped with a bound on δt+s , we are ready to bound |b| and |c|. Starting with |b|, we have:
∥δt+1 ∥ ≤ γκ−1 (∇u ft )∥λt+1 ∥ (Eq. (9) for s = 1)
17
⇐⇒ σ̄(∇u ft )∥δt+1 ∥ ≤ γ σ(∇u ft )∥λt+1 ∥ (Definition of κ−1 )
¯
=⇒ ∥∇u ft δt+1 ∥ ≤ γ∥∇u ft λt+1 ∥ (Lemma D.2) (11)
⊤
=⇒ | δt+1 ∇u ft⊤ ∇u ft λt+1 | ≤ γ∥∇Jt ∥ . 2
(Lemma D.1)
| {z }
b
Proof of Theorem 3.4. Let N ∈ N and t ∈ [T − 1]0 be fixed. In Algorithm A.1, the iterates are
computed, for all k ∈ [N − 1]0 , as
(k+1) (k) (k)
ut = ut + ηgt ,
(k)
where gt is the approximate gradient at iteration k. We denote the true gradient at iteration k by
(k)
∇Jt . From the L-smoothness of the objective function (Assumption 3.3), it follows that
(k+1) (k) (k) (k+1) (k) L (k+1) (k)
J(u0:T −1 ) ≥ J(u0:T −1 ) + ∇ut J(u0:T −1 )⊤ (ut − ut ) − ∥u − ut ∥2
2 t
(k) L
(k) ⊤ (k) (k)
= J(u0:T −1 ) + ∇Jt ∥ηgt ∥2
(ηgt ) −
2
(k) (k) η 2 Lν 2 (k)
≥ J(u0:T −1 ) + ηµ∥∇Jt ∥2 − ∥∇Jt ∥2 (Theorem D.3)
2
ηLν 2
(k) (k)
= J(u0:T −1 ) + η µ − ∥∇Jt ∥2 .
| {z } 2
α
Theorem 3.4 demands that η > 0 is set small enough such that α > 0, which is possible because
0 < µ < ν and L > 0. Thus, we get
(k) (k+1) (k)
ηα∥∇Jt ∥2 ≤ J(u0:T −1 ) − J(u0:T −1 )
N −1 N −1
1 X (k) 2 1 Xn (k+1) (k)
o
=⇒ ∥∇Jt ∥ ≤ J(u0:T −1 ) − J(u0:T −1 )
N αηN
k=0 k=0
18
1 n (N ) (0)
o
= J(u0:T −1 ) − J(u0:T −1 )
αηN
(0)
J ⋆ − J(u0:T −1 )
≤ ,
αηN
.
where J ⋆ = supu∈U T J(u) is the optimal value of the initial state.
E Further experiments
In this section, we repeat the MuJoCo experiments of Section 4.2 with longer time-horizons T . The
results are shown in Fig. 8. Our algorithms are sensitive to the horizon T , as the costates, which are
propagated backward in time, attain some error at each propagation step due to the approximation
errors of the Jacobians. For this reason, Theorem 3.4 demands (through Assumption 3.2) more
accurate Jacobians at later time steps. Thus, for large T , our convergence result requires more
accurate Jacobian estimates. However, in Fig. 8, we see that Algorithm A.3 is able to cope with
longer horizons for the two MuJoCo environments. The reason for this discrepancy between our
theoretical and empirical result is that Theorem 3.4 does not consider the stability of the system
under consideration. The two MuJoCo systems, Ant-v4 and HalfCheetah-v4, are stable along
the trajectories encountered during training, which prevents an exponential build-up of error in the
costate propagation.
Ant-v4 HalfCheetah-v4
500 400 T = 100
T = 200
Average return J
300 200
200 100
100 0
0 5000 10000 15000 20000 0 5000 10000 15000 20000
Episodes Episodes
Figure 8: Learning curves of our off-trajectory algorithm for the two MuJoCo tasks. All experiments
were repeated with 20 random seeds, and we show 95%-bootstrap confidence intervals for the average
return.
F Baselines
We compare our algorithms against two baselines: the finite-difference approach discussed at the end
of Section 3.3 and the gradient-free cross-entropy method [CEM; Rubinstein, 1999]. These methods
are listed in Algorithms F.1 and F.2.
Similarly to the finite-difference method and our on-trajectory algorithm, in CEM, we also perform
M ∈ N rollouts of perturbed action sequences {ui ∼ N (ū, σI)}M i=1 . Here, ū is the current action
sequence and σ > 0 is a noise scale parameter. We then construct the elite set S of the L < M
perturbed action sequences with the highest returns, where L ∈ N is a hyperparameter. Finally, P the
current action sequence ū is updated to be the mean of the elite sequences, such that ū ← L1 u∈S u.
This method can be considerably more efficient than the simple finite-difference method. Here, we are
not trying to estimate the gradient anymore, so we can potentially improve the action sequence with far
fewer rollouts than would be needed in the finite-difference approach. However, this method suffers
from the same fundamental deficiency as the finite-difference method: it ignores the Markovian
structure of the RL problem and treats the objective function J as a black box. CEM is commonly
used in model-based closed-loop reinforcement learning for planning. In this setting, the rollouts are
hallucinated using the approximate model. Instead of executing the complete open-loop trajectory,
the model-predictive control framework is typically employed. The planning procedure is repeated
after each step in the real environment with the executed action being the first item in the planned
19
action sequence. Thus, this setting is very different from our open-loop RL objective. For this reason,
we slightly modify the CEM algorithm to better fit our requirements. In model-based RL, typically
both mean ū and standard variation σ are adapted in CEM [Hafner et al., 2019, Pinneri et al., 2021].
In our experiments, this approach led to very fast convergence (σ → 0) to suboptimal trajectories.
We thus only fit the mean and keep the noise scale fixed, which we empirically observed to give much
better results.
G Hyperparameters
Unless stated otherwise, we used the hyperparameters listed in Table 1 in the inverted penulum
experiments of Section 4.1, and those listed in Table 2 in the MuJoCo experiments of Section 4.2
20
Parameter Value Parameter Value
Number of optimization steps N 50000 Number of optimization steps N 20000
Step size η 0.001 Step size η 0.0001
Noise scale σ 0.001 Noise scale σ 0.03
Number of perturbed rollouts M 10 Initial precision q0 0.0001
Forgetting factor α 0.8
Forgetting factor α
Initial precision q0 0.001
HalfCheetah-v4, T = 100 0.9
Cross-entropy method: M 5 20 HalfCheetah-v4, T = 200 0.8
Finite-difference method: M 5 20 HalfCheetah-v4, T = 300 0.8
Finite-difference method: σ 5 0.0001 Ant-v4, T = 100 0.95
MLP model: hidden layers [16, 16] Ant-v4, T = 200 0.9
MLP model: training rollouts 1000 Ant-v4, T = 300 0.85
MLP model training: epochs 10
MLP model training: batch size 100
Table 2: MuJoCo experiments hyperparameters
MLP model training: step size 0.002
MLP model training: weight decay 0.001
Ant-v4 HalfCheetah-v4
300 0.99
Forgetting factor α
350
Average return J
0.95
200
300
0.9
100
250 0.85
0.8
200 0
0 5000 10000 15000 20000 0 5000 10000 15000 20000
Episodes Episodes
Figure 9: Analysis of the influence of the forgetting factor α on the performance of the off-trajectory
method (Algorithm A.3) in the MuJoCo environments (T = 200). All experiments were repeated
with 20 random seeds, and we show 95%-bootstrap confidence intervals for the average return.
(0)
and Appendix E. In each experiment, all actions in the initial action trajectory u0:T −1 are sampled
from a zero-mean Gaussian distribution with standard deviation 0.01. We use the Adam optimizer
[Kingma and Ba, 2014] both for training the MLP model and for performing the gradient ascent steps
in Algorithms A.1 to A.3 and F.1. We did not optimize the hyperparameters of soft actor-critic (SAC),
but kept the default values suggested by Haarnoja et al. [2018a], as these are already optimized for
the MuJoCo environments. The entropy coefficient of the SAC algorithm is tuned automatically
according to the procedure described by Haarnoja et al. [2018b]. In our experiments, we make use of
the Stable-Baselines3 [Raffin et al., 2021] implementation of SAC.
For our off-trajectory method, we found it worthwile to tune the forgetting factor α to the specific
task at hand. Large α means that data is retained for longer, which both makes the algorithm more
sample efficient (i.e., faster convergece) and the Jacobian estimates more biased (i.e., convergence to
a worse solution). In Fig. 9, we show this trade-off in the learning curves for the MuJoCo tasks (with
the horizon T = 200). We found that the performance is much less senstitive to the choice of noise
scale σ and initial precision q0 than to the choice of the forgetting factor α.
5
This value was chosen on the basis of the experiment presented in Fig. 5.
21