Model Predictive Path Integral Control Using Covariance Variable Important Sampling
Model Predictive Path Integral Control Using Covariance Variable Important Sampling
Importance Sampling
Grady Williams1 , Andrew Aldrich1 , and Evangelos A. Theodorou1
Abstract— In this paper we develop a Model Predictive Path drastically simplify the system under consideration by using
Integral (MPPI) control algorithm based on a generalized a hierarchical scheme [4], and use path integral control to
importance sampling scheme and perform parallel optimization generate trajectories for a point mass which is then followed
via sampling using a Graphics Processing Unit (GPU). The
proposed generalized importance sampling scheme allows for by a low level controller. Even though this approach may be
changes in the drift and diffusion terms of stochastic diffusion successfull for certain applications, it is limited in the kinds
arXiv:1509.01149v3 [cs.SY] 28 Oct 2015
processes and plays a significant role in the performance of the of behaviors that it can generate since it does not consider the
model predictive control algorithm. We compare the proposed full non-linearity of dynamics. A more efficient approach is
algorithm in simulation with a model predictive control version to take advantage of the parallel nature of sampling and use
of differential dynamic programming.
a graphics processing unit (GPU) [19] to sample thousands
I. INTRODUCTION of trajectories from the nonlinear dynamics.
The path integral optimal control framework [7], [15], A major issue in the path integral control framework is
[16] provides a mathematically sound methodology for de- that the expectation is taken with respect to the uncontrolled
veloping optimal control algorithms based on stochastic dynamics of the system. This is problematic since the proba-
sampling of trajectories. The key idea in this framework is bility of sampling a low cost trajectory using the uncontrolled
that the value function for the optimal control problem is dynamics is typically very low. This problem becomes more
transformed using the Feynman-Kac lemma [2], [8] into an drastic when the underlying dynamics are nonlinear and
expectation over all possible trajectories, which is known sampled trajectories can become trapped in undesirable parts
as a path integral. This transformation allows stochastic of the state space. It has previously been demonstrated
optimal control problems to be solved with a Monte-Carlo how to change the mean of the sampling distribution using
approximation using forward sampling of stochastic diffusion Girsanov’s theorem [15], [16], this can then be used to
processes. develop an iterative algorithm. However, the variance of
There have been a variety of algorithms developed in the the sampling distribution has always remained unchanged.
path integral control setting. The most straight-forward appli- Although in some simple simulated scenarios changing the
cation of path integral control is when the iterative feedback variance is not necessary, in many cases the natural variance
control law suggested in [15] is implemented in its open of a system will be too low to produce useful deviations from
loop formulation. This requires that sampling takes place the current trajectory. Previous methods have either dealt
only from the initial state of the optimal control problem. with this problem by artificially adding noise into the system
A more effective approach is to use the path integral control and then optimizing the noisy system [10], [14]. Or they
framework to find the parameters of a feedback control have simply ignored the problem entirely and sampled from
policy. This can be done by sampling in policy parameter whatever distribution worked best [12], [19]. Although these
space, these methods are known as Policy Improvement approaches can be successful, both are problematic in that
with Path Integrals [14]. Another approach to finding the the optimization either takes place with respect to the wrong
parameters of a policy is to attempt to directly sample from system or the resulting algorithm ignores the theoretical basis
the optimal distribution defined by the value function [3]. of path integral control.
Other methods along similar threads of research include [10], The approach we take here generalizes these approaches in
[17]. that it enables for both the mean and variance of the sampling
Another way that the path integral control framework distribution to be changed by the control designer, without
can be applied is in a model predictive control setting. violating the underlying assumptions made in the path inte-
In this setting an open-loop control sequence is constantly gral derivation. This enables the algorithm to converge fast
optimized in the background while the machine is simulta- enough that it can be applied in a model predictive control
neously executing the “best guess” that the controller has. setting. After deriving the model predictive path integral
An issue with this approach is that many trajectories must control (MPPI) algorithm, we compare it with an existing
be sampled in real-time, which is difficult when the system model predictive control formulation based on differential
has complex dynamics. One way around this problem is to dynamic programming (DDP) [6], [13], [18]. DDP is one of
the most powerful techniques for trajectory optimization, it
This research has been supported by NSF Grant No. NRI-1426945. relies on a first or second order approximation of the dynam-
The 1 authors are with the Autonomous Control and Decision Systems
Laboratory at the Georgia Institute of Technology, Atlanta, GA, USA. Email: ics and a quadratic approximation of the cost along a nominal
[email protected] trajectory, it then computes a second order approximation of
the value function which it uses to generate the control. this transformation we apply an exponential transformation
of the value function
II. PATH INTEGRAL CONTROL
V (x, t) = −λ log(Ψ(x, t)) (6)
In this section we review the path integral optimal control
framework [7]. Let xt ∈ RN denote the state of a dynamical Here λ is a positive constant. We also have to assume a
system at time t, u(xt , t) ∈ Rm denotes a control input for relationship between the cost and noise in the system (as
the system, τ : [t0 , T ] → Rn represents a trajectory of the well as λ) through the equation:
system, and dw ∈ Rp is a brownian disturbance. In the path
integral control framework we suppose that the dynamics Bc (xt , t)Bc (x, t)T = λGc (xt , t)R(xt , t)−1 Gc (xt , t)T (7)
take the form:
The main restriction implied by this assumption is that
dx = f (xt , t)dt + G(xt , t)u(xt , t)dt + B(xt , t)dw (1) B(xt , t) has the same rank as R(xt , t). This limits the
noise in the system to only effect state variables that are
In other words, the dynamics are affine in control and subject directly actuated (i.e. the noise is control dependent). There
to an affine brownian disturbance. We also assume that G are a wide variety of systems which naturally fall into this
and B are partitioned as: description, so the assumption is not too restrictive. However,
there are interesting systems for which this description does
0 0
G(xt , t) = ; B(xt , t) = (2) not hold (i.e. if there are known strong disturbances on
Gc (xt , t) Bc (xt , t)
indirectly actuated state variables or if the dynamics are only
Expectations taken with respect to (1) are denoted as EQ [·], partially known).
we will also be interested in taking expectations with respect By making this assumption and performing the exponen-
to the uncontrolled dynamics of the system (i.e (1) with tial transformation of the value function the stochastic HJB
u ≡ 0). These will be denoted EP [·]. We suppose that equation is transformed into the linear partial differential
the cost function for the optimal control problem has a equation:
quadratic control cost and an arbitrary state-dependent cost. Ψ(xt , t) 1
Let φ(xT ) denote a final the terminal cost, q(xt , t) a state ∂t Ψ = q(xt , t) − f (xt , t)T Ψx − tr(Σ(xt , t)Ψxx )
λ 2
dependent running cost, and define R(xt , t) as a positive (8)
definite matrix. The value function V (xt , t) for this optimal Here we’ve denoted the covariance matrix
control problem is then defined as:
" Z T # Bc (xt , t)Bc (xt , t)T
1 T
min EQ φ(xT ) + q(xt , t) + u R(xt , t)u dt as Σ(xt , t). This equation is known as the backward
u t 2
Chapman-Kolmogorov PDE. We can then apply the
(3)
Feynman-Kac lemma, which relates backward PDEs of this
The Stochastic Hamilton-Jacobi-Bellman equation [1], [11]
type to path integrals through the equation:
for the type of system in (1) and for the cost function in (3)
is given as:
" ! #
1 T
Z
Ψ(xt0 , t0 ) = EP exp − q(x, t) dt Ψ(xT , T )
λ t0
−∂t V = q(xt , t) + f (xt , t)T Vx
(9)
1
− VxT G(xt , t)R(xt , t)−1 G(xt , t)T Vx (4) Note that the expectation (which is the path integral) is
2 taken with respect to P which is the uncontrolled dynamics
1
+ tr(B(xt , t)B(xt , t)T Vxx ) of the system. By recognizing that the term Ψ(xT ) is the
2 1
transformed terminal cost: e− λ φ(xT ) we can re-write this
where the optimal control is expressed as: expression as:
u∗ = −R(xt , t)−1 G(xt , t)T Vx (5) 1
Ψ(xt0 , t0 ) ≈ EP exp − S(τ ) (10)
λ
The solution to this backwards PDE yields the value function RT
for the stochastic optimal control problem, which is then where S(τ ) = φ(xT ) + t0 q(xt , t)dt is the cost-to-go of
used to generate the optimal control. Unfortunately, classical the state dependent cost of a trajectory. Lastly we have to
methods for solving partial differential equations of this compute the gradient of Ψ with respect to the initial state xt0 .
nature suffer from the curse of dimensionality and are This can be done analytically and is a straightforward, albeit
intractable for systems with more than a few state variables. lengthy, computation so we omit it and refer the interested
The approach we take in the path integral control frame- reader to [14]. After taking the gradient we obtain:
work is to transform the backwards PDE into a path integral, 1
which is an expectation over all possible trajectories of the ∗ −1 EP exp − λ S(τ ) B(xt0 , t0 )dw
u dt = G(xt0 , t0 )
EP exp − λ1 S(τ )
system. This expectation can then be approximated by for-
ward sampling of the stochastic dynamics. In order to effect (11)
Where the matrix G(xt , t) is defined as: In order to efficiently approximate the controls, we require
−1 the ability to sample from a distribution which is likely to
R(xt , t)−1 Gc (xt , t)T Gc (xt , t)R(xt , t)−1 Gc (xt , t)T produce low cost trajectories. In previous applications of
(12) path integral control [15], [16] the mean of the sampling
Note that if Gc (xt , t) is square (which is the case if the distribution has been changed which allows for an iterative
system is not over actuated) this reduces to Gc (xt , t)−1 . update law. However, the variance of the sampling distri-
Equation (11) is the path integral form of the optimal bution has always remained unchanged. In well engineered
control. The fundamental difference between this form of systems, where the natural variance of the system is very
the optimal control and classical optimal control theory is low, changing the mean is insufficient since the state space
that instead of relying on a backwards in time process, is never aggressively explored. In the following derivation
this formula requires the evaluation of an expectation which we provide a method for changing both the initial control
can be approximated using forward sampling of stochastic input and the variance of the sampling distribution.
differential equations.
A. Likelihood Ratio
A. Discrete Approximation We suppose that we have a sampling distribution with non-
Equation (11) provides an expression for the optimal zero control input and a changed variance, which we denote
control in terms of a path integral. However, these equations as q, and we would like to approximate (16) using samples
are for continuous time and in order to sample trajectories from q as opposed to p. Now if we write the expectation
on a computer we need discrete time approximations. term (16) in integral form we get:
dxt
We first discretize the dynamics of the system. We have exp − λ1 S(τ )
R
∆t
0
− f (xt , t) p(τ )dτ
that xt+1 = xt + dxt where dxt is defined as: (17)
exp − λ1 S(τ ) p(τ )dτ
R
√
dxt = (f (xt , t) + G(xt , t)u(xt , t)) ∆t + B(xt , t) ∆t Where we are abusing notation and using τ to represent the
(13) discrete trajectory (xt0 , xt1 , . . . xtN ). Next we multiply both
The term is a vector of standard normal Gaussian random q(τ )
integrals by 1 = q(τ ) to get:
variables. For the uncontrolled dynamics of the system we
dxt
q(τ )
have: exp − λ1 S(τ )
R
√ ∆t
0
− f (x t , t) q(τ ) p(τ )dτ
dxt = f (xt , t)∆t + B(xt , t) ∆t (14) q(τ ) (18)
exp − λ1 S(τ ) q(τ
R
) p(τ )dτ
Another way we can express B(xt , t)dw which will be And we can then write this as an expectation with respect
useful is as: to q:
h dxt0 i
B(xt , t)dw ≈ dxt − f (xt , t)∆t (15) Eq exp − λ1 S(τ ) p(τ )
∆t − f (xt , t) q(τ )
PN (19)
Lastly we say: S(τ ) ≈ φ(xT )+ i=0 q(xt , t)∆t where N =
h )i
Eq exp − λ1 S(τ ) p(τ
q(τ )
(T − t)/∆t Then by defining p as the probability induced by
the discrete time uncontrolled dynamics we can approximate We now have the expectation in terms of a sampling distri-
(11) as: bution q for which we can choose:
h dxt0 i i) The initial control sequence from which to sample
Ep exp − λ1 S(τ ) ∆t − f (x t0 , t0 ) around.
u∗ = G(xt0 , t0 )−1 1
ii) The variance of the exploration noise which determines
Ep exp − λ S(τ )
(16) how aggressively the state space is explored.
p(τ )
Note that we have moved the ∆t term multiplying u over However, we now have an extra term to compute q(τ ) . This is
to the right-hand side of the equation and inserted it into the known as the likelihood ratio (or Radon-Nikodym derivative)
expectation. between the distributions p and q. In order to derive an
expression for this term we first have to derive equations
III. GENERALIZED IMPORTANCE SAMPLING for the probability density functions of p(τ ) and q(τ ) indi-
Equation (16) provides an implementable method for vidually. We can do this by deriving the probability density
approximating the optimal control via random sampling of function for the general discrete time diffusion processes
trajectories. By drawing many samples from p the expecta- P (τ ), corresponding to the dynamics:
√
tion can be evaluated using a Monte-Carlo approximation. In dxt = (f (xt , t) + G(xt , t)u(xt , t)) ∆t + B(xt , t) ∆t
practice, this approach is unlikely to succeed. The problem is (20)
that p is typically an inefficient distribution to sample from The goal is to find P (τ ) = P (xt0 , xt1 , . . . xtN ). By condi-
(i.e the cost-to-go will be high for most trajectores sampled tioning and using the Markov property of the state space this
from p). Intuitively sampling from the uncontrolled dynamics probability becomes:
corresponds to turning a machine on and waiting for the N
natural noise in the system dynamics to produce interesting
Y
P (xt0 , xt1 , . . . xtN ) = P (xti |xti−1 ) (21)
behavior. i=1
Now recall that a portion of the state space has deterministic Then under the condition that each Ati is invertible and each
dynamics and that we’ve partitioned the diffusion matrix as: Γi is invertible, the likelihood ratio for the two distributions
is:
0 N
! N
!
B(xt , t) = (22) Y ∆t X
Bc (xt , t) |Ati | exp − Qi (31)
i=1
2 i=1
We can partition the state variables x into the deterministic
(a) (c)
and non-deterministic variables xt and xt respectively.
(a) (a) (a) Proof: In discrete time the probability of a trajectory is
The next step is to conditionon xt+1 = F (xt , t) = xt +
(a) (a) formulated according to the (26). We thus have p(τ ) equal
f (xt , t) + G (xt , t)ut dt since if this does not hold
to:
P (τ ) is zero. We thus need to compute: PN
exp − ∆t 2 i=1 z i Σ i z i
N p(τ ) = (32)
Zp (τ )
(a)
Y
P xti |xti−1 , xti = F (a) (xti−1 , ti−1 (23)
i=1 and q(τ ) equal to:
And from the dynamics equations we know that each of these PN −1
T T
one-step transitions is Gaussian with mean: f (c) (xt , t) + exp − ∆t 2 i=1 (zi −µi ) ( At Σ A
i ti
i
) (zi −µi )
G(c) (xti , ti )u(xti , ti ) and variance: Zq (τ )
(33)
Σi = Bc (xti , ti )Bc (xti , ti )T ∆t. (24) p(τ )
Then dividing these two equations we have q(τ )
as:
(c)
dxt (c)
We then define zi = i
− f (xti , ti ), and µi = N
! N
!
∆t (2π)n/2 |AT
t Σi A t i |
1/2
) ∆t X
G(c) (xti , ti )u(xti , ti ). Applying the definition of the Gaus-
Y
i
exp − ζi (34)
sian distribution with these terms yields: i=1
(2π)n/2 |Σi |1/2 ) 2 i=1
N exp − ∆t (z − µ )T Σ−1 (z − µ )
i i i i
Where ζi is:
Y 2 i
P (τ ) = (25) −1
i=1
(2π)n/2 |Σi |1/2 ζ i = zT −1
i Σi zi − (zi − µi )
T
AT
t i Σi A t i (zi − µi ) (35)
And then using basic rules of exponents this probability Using basic rules of determinants it is easy to see that the
becomes: term outside the exponent reduces to
N
!
−1 ∆t X T −1 N N
Z(τ ) exp − (zi − µi ) Σi (zi − µi ) (26) Y (2π)n/2 |AT
j Σ j Aj |
1/2
) Y
2 i=1 = |Aj | (36)
j=1
(2π)n/2 |Σj |1/2 ) j=1
QN n/2
Where Z(τ ) = i=1 (2π) |Σi |1/2 . With this equation
in hand we’re now ready to compute the likelihood ratio So we need only show that ζi reduces to Qi . Observe that at
between two diffusion processes. every timestep we have the difference between two quadratic
functions of zi , so we can complete the square to combine
Theorem 1: Let p(τ ) be the probability density function this into a single quadratic function. If we recall the definition
for trajectories under the uncontrolled discrete time dynam- of Γi from above, and define Λi = AT ti Σi Ati then completing
ics: √ the square yields:
dxt = f (xt , t)∆t + B(xt , t) ∆t (27)
T
ζi = zi + Γi Λ−1 Γ−1 zi + Γi Λ−1
And let q(τ ) be the probability density function for trajecto- i µi i i µi
T −1 (37)
ries under the controlled dynamics with an adjusted variance: T −1 −1
− µi Λi µi − Γi Λi µi Γi Γt Λi µi −1
dxt = (f (xt , t) + G(xt , t)u(xt , t)) ∆t+ Now we expand out the first quadratic term to get:
√
BE (xt , t) ∆t (28)
−1 T −1 T −1 −1
ζ i = zT
i Γi zi + 2µi Λi zi + µi Λi Γi Λi µi
Where the adjusted variance has the form: (38)
−1 −1 T −1 −1
− µT
i Λi µi − (Γi Λi µi ) Γi (Γi Λi µi )
0
BE (xt , t) =
At Bc (xt , t) Notice that the two underlined terms are the same, except
for the sign, so they cancel out and we’re left with:
And define zi , µi , and Σi as before. Let Qi be defined as:
−1 T −1 T −1
ζ i = zT
i Γi zi + 2µi Λi zi − µi Λi µi (39)
T
Qi = (zi − µi ) Γ−1
i (zi − µi ) (29)
T −1 −1 Now define z̃i = zi − µi , and then re-write this equation in
+ 2 (µi ) Σi (zi − µi ) + µT
i Σi µi
terms of z˜i :
Where Γi is:
−1 ζi = (z̃i + µi )T Γ−1 T −1 T −1
i (z̃i + µi ) + 2µi Λi (z̃i + µi ) − µi Λi µi
Γ−1
i = Σ−1
i − AT
t i Σi A t i (30) (40)
which expands out to: Then by re-defining the running cost q(xt , t) as:
−1 T −1 T −1
ζi = z̃T
i Γi z̃i + 2µi Γi z̃i + µi Γi µi
1 T
−1 T −1 T −1
(41) q̃(x, u, dx) = q(xt , t) + (z − µ) Γ̃−1 (z − µ)
+ 2µT i Λi z̃i + 2µi Λi µi − µi Λi µi
2 (48)
T −1 1 T −1
Which then simplifies to: + µ H (z − µ) + µ H µ
2
−1 T −1 T −1
PN
ζi = z̃T
i Γi z̃i + 2µi Γi z̃i + µi Γi µi (42)
and S̃(τ ) = φ(xT ) + j=1 q̃(x, u, dx), we have:
−1 T −1
+ 2µT i Λi z̃i + µi Λi µi
h i
Eq exp − λ1 S̃(τ ) dx ∆t
t
− f (x t , t)
Now recall that Γi = (Σ−1
i − Λi )
−1 −1
, so we can split the u∗t = G(xt , t)−1 h i
1
quadratic terms in Γi into the Σ−1
−1
i and Λ−1
i components. Eq exp − λ S̃(τ )
Doing this yields: (49)
Also note that dxt is now equal to:
−1 T −1 T −1 T −1
ζi = z̃T
i Γi z̃i + 2µi Σi z̃i − 2µi Λi z̃i + µi Σi µi √
(43) (f (xt , t) + G(xt , t)u(xt , t)) ∆t + B(xt , t) ∆t (50)
−1 T −1 T −1
− µT
i Λi µi + 2µi Λi z̃i + µi Λi µi
dxt
So we can re-write ∆t − f (xt , t) as:
and by noting that the underlined terms cancel out we see
that we’re left with:
G(xt , t)u(xt , t) + B(xt , t) √ (51)
−1 −1 −1 ∆t
ζi = z̃T
i Γi z̃i + 2µT
i Σi z̃i + µT
i Σi µi (44)
And then since G(xt , t) does not depend on the expectation
which is the same as: we can pull it out and get the iterative update law:
T
(zi − µi ) Γ−1 T −1 T −1
i (zi − µi ) + 2µi Σi (zi − µi ) + µi Σi µi
(45) u∗t = G(xt , t)−1 G(xt , t)u(xt , t)
h i
And so ζi = Qi which completes the proof. Eq exp − λ1 S̃(τ ) B(xt , t) √∆t (52)
+ G(xt , t)−1 h i
The key difference between this proof and earlier path inte- Eq exp − λ1 S̃(τ )
gral works which use an application of Girsanov’s theorem
to sample from a non-zero control input is that this theorem C. Special Case
allows for a change in the variance as well. The update law (52) is applicable for a very general class
In the expression for the likelihood ratio derived here of systems. In this section we examine a special case which
−1 T −1
the last two terms (2µTi Σi (zi − µi ) + µi Σi µi ) are ex- we use for all of our experiments. We consider dynamics of
actly the terms from Girsanov’s theorem. The first term the form:
T
((zi − µi ) Γ−1
i (zi − µi )), which can be interpreted as pe-
1 √
nalizing over-aggressive exploration, is the only additional dxt = f (xt , t)∆t + G(xt , t) u(xt , t)∆t + √ ∆t
ρ
term.
√(53)
B. Likelihood Ratio as Additional Running Cost And for the sampling distribution we set A equal to νI.
We also assume that Gc (xt , t) is a square invertible matrix.
The form of the likelihood ratio just derived is easily
This reduces H(xt , t) to Gc (xt , t)−1 . Next the dynamics
incorporated into the path integral control framework by
can be re-written as:
folding it into the cost-to-go as an extra running cost. Note
that the likelihood ratio appears in both the numerator and 1
dxt = f (xt , t)∆t + G(xt , t) u(xt , t) + √ √ ∆t
denominator of (16). Therefore, any terms which do not ρ ∆t
depend on the state can be factored out of the expectation (54)
and canceled. This Then we can interpret √1ρ √∆t as a random change in the
QNremoves the numerically troublesome
normalizing term j=1 |Atj |. So only the summation of Qi control input, to emphasize this we will denote this term as
remains. Recall that Σ = λG(xt , t)R(xt , t)−1 G(xt , t). This δu = √1ρ √∆t . We then have B(xt , t) √∆t = G(xt , t)δu.
implies that: This yields the iterative update law as:
h i
Eq exp − λ1 S̃(τ ) δu
−1
Γ = λ G(xt , t)R(xt , t)−1 G(xt , t) u(xt , t)∗ = u(xt , t) + (55)
(46) h i
−1 Eq exp − λ1 S̃(τ )
− AT G(xt , t)R(xt , t)−1 G(xt , t)T A
which can be approximated as:
Now define H = G(xt , t)R(xt , t)−1 G(xt , t)T and Γ̃ = λ1 Γ.
PK
We then have: exp − 1
S̃(τ i,k ) δui,k
k=1 λ
1 T
u(xti , ti )∗ ≈ u(xti , ti ) +
(z−µ) Γ̃−1 (z−µ)+2µT H−1 (z−µ)+µT H−1 µ
Q= PK
exp − 1
S̃(τ )
λ k=1 λ i,k
(47) (56)
Where K is the number of random samples (termed rollouts) Algorithm 1: Model Predictive Path Integral Control
and S(τi,k ) is the cost-to-go of the kth rollout from time ti Given: K: Number of samples;
onward. This expression is simply a reward-weighted average N : Number of timesteps;
of random variations in the control input. Next we investigate (u0 , u1 , ...uN −1 ): Initial control sequence;
what the likelihood ratio addition to the running cost is. For ∆t, xt0 , f , G, B, ν: System/sampling dynamics;
these dynamics we have the following simplifications: φ, q, R, λ: Cost parameters;
i) z − µ = G(xt , t)δu uinit : Value to initialize new controls to;
ii) Γ̃−1 = (1 − ν −1 )G(xt , t)−1 R(xt , t)G(xt , t)
while task not completed do
iii) H−1 = G(xt , t)−1 R(xt , t)G(xt , t)−1
for k ← 0 to K − 1 do
Given these simplifications q̃ reduces to: x = xt0 ;
for i ← 1 to N − 1 do
(1 − ν −1 ) T xi+1 = xi + (f + G (ui + δui,k )) ∆t;
q̃(x, u, dx) = q(xt , t) + δu Rδu
2 (57) S̃(τi+1,k ) = S̃(τi,k ) + q̃;
1
+ uT Rδu + uT Ru
2 for i ← 0 to N− 1 do
PK exp(− λ1
S̃( τi,k ))δui,k
This means that the introduction of the likelihood ratio ui ← ui + k=1 PK ;
k=1 exp(− λ S̃( τi,k ))
1
simply introduces the original control cost from the optimal
control formulation into the sampling cost, which originally send to actuators(u0 );
only included state-dependent terms. for i ← 0 to N − 2 do
ui = ui+1 ;
IV. MODEL PREDICTIVE CONTROL ALGORITHM uN −1 = uinit
Update the current state after receiving feedback;
We apply the iterative path integral control update law, check for task completion;
with the generalized importance sampling term, in a model
predictive control setting. In this setting optimization and
execution occur simultaneously: the trajectory is optimized
and then a single control is executed, then the trajectory is A. Cart-Pole
re-optimized using the un-executed portion of the previous
trajectory to warm-start the optimization. This scheme has For the cart-pole swing-up task we used the state cost:
two key requirements: q(x) = p2 + 500(1 + cos(θ))2 + θ̇2 + ṗ2 , where p is the
i) Rapid convergence to a good control input. position of cart, ṗ is the velocity and θ, θ̇ are the angle and
ii) The ability to sample a large number of trajectories in angular velocity of the pole. The control input is desired
real-time. velocity, which maps to velocity through the equation: p̈ =
10(u − ṗ). The disturbance parameter √1ρ was set equal .01
The first requirement is essential because the algorithm
and the control cost was R = 1. We ran the MPPI controller
does not have the luxury of waiting until the trajectory has
for 10 seconds with a 1 second optimization horizon. The
converged before executing. The new importance sampling
controller has to swing-up the pole and keep it balanced for
term enables tuning of the exploration variance which allows
the rest of the 10 second horizon. The exploration variance
for rapid convergence, this is demonstrated in Fig. 1.
The second requirement, sampling a large number of
trajectories in real-time, is satisfied by implementing the
random sampling of trajectories on a GPU. The algorithm 300 ν = 75
ν = 500
is given in Algorithm 1, in the parallel GPU implementation ν = 1000
Average Running Cost
250
the sampling for loop (for k to K-1) is run completely in ν = 1500
parallel. 200
V. EXPERIMENTS 150
function: q(x) = 100d2 + (vx − 7.0)2 . Where d is defined DDP |vx | MPPI |vx |
8
x 2 2 DDP |vy | MPPI |vy |
as: d = | 13 + y6 − 1|, and vx is the forward (in
7
body frame) velocity of the car. This cost ensures that
the car to stays on an elliptical track while maintaining a 6
Velocity (m/s)
forward speed of 7 meters/sec. We use a non-linear dynamics 5
model [5] which takes into account the (highly non-linear)
4
interactions between tires and the ground. The exploration
variance was set to a constant ν times the natural variance 3
30 1
DDP Solution
ν = 50 0
0 10 20 30 40 50
ν = 100
25
Time (s)
Average Running Cost
ν = 150
Fig. 4. Comparison of DDP (left) and MPPI (right) performing a cornering
ν = 300
maneuver along an ellipsoid track. MPPI is able to make a much tigther
20
turn while carrying more speed in and out of the corner than DDP.
15
MPPI and DDP which guide the quadrotor through the forest
10
as quickly as possible. The cost function for MPPI was
100
MPC-DDP MPPI
5
1.0 1.5 2.0 2.5 3.0 3.5 4.0
80
Number of Rollouts (Log Scale)
the turn. The DDP solution does not attempt to slide and 0 20 40 60 80 100 0 20 40 60 80 100
16
changing the variance by a constant multiple times the
15 natural variance of the system. In this special case the
14
introduction of the likelihood ratio corresponds to adding in
a control cost when evaluating the cost-to-go of a trajectory.
13
A direction for future research is to investigate how to
12
automatically adjust the variance online. Doing so could
enable the algorithm to switch from aggressively exploring
11
the state space when performing aggressive maneuvers to
10 exploring more conservatively for performing very precise
3m 4m 5m
maneuvers.
Density Setting of Forest
R EFERENCES
Fig. 6. Time to navigate forest. Comparison between MMPI and DDP.
[1] W. H. Fleming and H. M. Soner. Controlled Markov processes and
viscosity solutions. Applications of mathematics. Springer, New York,
Since the MPPI controller can explicitly reason about 2nd edition, 2006.
crashing (as opposed to just staying away from obstacles), [2] A. Friedman. Stochastic Differential Equations And Applications.
it is able to travel both faster and closer to obstacles than Academic Press, 1975.
[3] Vicenç Gómez, Hilbert J Kappen, Jan Peters, and Gerhard Neumann.
the MPC-DDP controller. Fig. 7 shows the difference in time Policy search for path integral control. In Machine Learning and
between the two algorithms and Fig. 6 the trajectories taken Knowledge Discovery in Databases, pages 482–497. Springer, 2014.
by MPC-DDP and one of the MPPI runs on the forest with [4] Vicenç Gómez, Sep Thijssen, Hilbert J Kappen, Stephen Hailes, and
Andrew Symington. Real-time stochastic optimal control for multi-
obstacles placed on average 4 meters away. agent quadrotor swarms. arXiv preprint arXiv:1502.04548, 2015.
[5] R.Y Hindiyeh. Dynamics and Control of Drifting in Automobiles. PhD
thesis, Stanford University, March 2013.
[6] D. H. Jacobson and D. Q. Mayne. Differential dynamic programming.
American Elsevier Pub. Co., New York, 1970.
[7] H. J. Kappen. Linear theory for control of nonlinear stochastic
systems. Phys Rev Lett, 95:200201, 2005. Journal Article United
States.
[8] I. Karatzas and S. E. Shreve. Brownian Motion and Stochastic
Calculus (Graduate Texts in Mathematics). Springer, 2nd edition,
August 1991.
Fig. 7. Simulated forest environment used in the quadrotor navigation task. [9] Nathan Michael, Daniel Mellinger, Quentin Lindsey, and Vijay Ku-
mar. The grasp multiple micro-uav testbed. Robotics & Automation
Magazine, IEEE, 17(3):56–65, 2010.
[10] E. Rombokas, M. Malhotra, E.A. Theodorou, E. Todorov, and Y. Mat-
VI. CONCLUSION suoka. Reinforcement learning and synergistic control of the act hand.
In this paper we have developed a model predictive path IEEE/ASME Transactions on Mechatronics, 18(2):569–577, 2013.
[11] R. F. Stengel. Optimal control and estimation. Dover books on
integral control algorithm which is able to outperform a advanced mathematics. Dover Publications, New York, 1994.
state-of-the-art DDP method on two difficult control tasks. [12] F. Stulp, J. Buchli, E. Theodorou, and S. Schaal. Reinforcement learn-
The algorithm is based on stochastic sampling of system ing of full-body humanoid motor skills. In Proceedings of 10th IEEE-
RAS International Conference on Humanoid Robots (Humanoids),
trajectories and requires no derivatives of either the dynamics pages 405–410, Dec 2010.
or costs of the system. This enables the algorithm to naturally [13] E. Theodorou, Y. Tassa, and E. Todorov. Stochastic differential
take into account non-linear dynamics, such as a non-linear dynamic programming. In American Control Conference, 2010, pages
1125–1132, 2010.
tire model [5]. It is also able to handle cost functions which [14] E. A. Theodorou, J. Buchli, and S. Schaal. A generalized path integral
are intuitively appealing, such as an impulse cost for hitting approach to reinforcement learning. Journal of Machine Learning
an obstacle, but are difficult for traditional approaches that Research, (11):3137–3181, 2010.
[15] E.A. Theodorou and E. Todorov. Relative entropy and free energy du-
rely on a smooth gradient signal to perform optimization. alities: Connections to path integral and kl control. In the Proceedings
The two keys to achieving this level of performance with a of IEEE Conference on Decision and Control, pages 1466–1473, Dec
sampling based method are: 2012.
[16] Evangelos A. Theodorou. Nonlinear stochastic control and information
i) The derivation of the generalized likelihood ratio be- theoretic dualities: Connections, interdependencies and thermody-
tween discrete time diffusion processes. namic interpretations. Entropy, 17(5):3352–3375, 2015.
[17] Sep Thijssen and HJ Kappen. Path integral control and state-dependent
feedback. Physical Review E, 91(3):032104, 2015.
[18] E. Todorov and W. Li. A generalized iterative lqg method for locally-
optimal feedback control of constrained nonlinear stochastic systems.
pages 300–306, 2005.
[19] G. Williams, E. Rombokas, and T. Daniel. Gpu based path integral
control with learned dynamics. In Neural Information Processing
Systems - ALR Workshop, 2014.