0% found this document useful (0 votes)
21 views43 pages

LectureNotes MA5232 2021

This document consists of lecture notes for a course on Optimal Control Theory at the National University of Singapore. It covers fundamental concepts in optimal control, including ordinary differential equations, Pontryagin’s Maximum Principle, Hamilton-Jacobi-Bellman equations, and numerical methods for optimal control. The notes aim to provide a concise introduction to the theory and algorithms relevant to control problems for mathematics students.

Uploaded by

The Beast Aryan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views43 pages

LectureNotes MA5232 2021

This document consists of lecture notes for a course on Optimal Control Theory at the National University of Singapore. It covers fundamental concepts in optimal control, including ordinary differential equations, Pontryagin’s Maximum Principle, Hamilton-Jacobi-Bellman equations, and numerical methods for optimal control. The notes aim to provide a concise introduction to the theory and algorithms relevant to control problems for mathematics students.

Uploaded by

The Beast Aryan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

National University of Singapore, MA5232 - Modeling and

Numerical Simulations (Part III)

Optimal Control Theory

Qianxiao Li
v0.1
Preface

These lecture notes are compiled for the special topic in MA5232 Modeling and Numerical
Simulations in Semester 2 of AY 2020-21.
This document is typeset using LATEXwith a modified theme based on
https://www.overleaf.com/latex/templates/lecture-note-template/dwyrjrnthdcz
If you find any mistakes or typos in the notes, please send me an email at [email protected].

2
Contents

Contents

1 Introduction 5
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Ordinary Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Flow Map and Dependence on Initial Condition . . . . . . . . . . . . . 6
1.2.3 Numerical Solution of ODEs . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Optimal Control Theory 9


2.1 From Calculus of Variations to Optimal Control . . . . . . . . . . . . . . . . . 9
2.1.1 A Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 The Problem of Optimal Control . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Weak vs Strong Minima . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.4 A Dynamical View on the Calculus of Variations . . . . . . . . . . . . 13
2.1.5 The Optimal Control Formulation . . . . . . . . . . . . . . . . . . . . . 14
2.2 Pontryagin’s Maximum Principle . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 The Maximum Principle . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Other Forms of the Maximum Principle . . . . . . . . . . . . . . . . . 19
2.2.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Hamilton-Jacobi-Bellman Equations . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Motivating Example of Dynamic Programming . . . . . . . . . . . . . 21
2.3.2 The Dynamic Programming Principle . . . . . . . . . . . . . . . . . . 23
2.3.3 Hamilton-Jacobi-Bellman Equations . . . . . . . . . . . . . . . . . . . 25
2.3.4 Implications for Optimal Control . . . . . . . . . . . . . . . . . . . . . 27
2.3.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4 Stochastic Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.1 Control of Stochastic Differential Equations . . . . . . . . . . . . . . . 30
2.4.2 The Stochastic Dynamic Programming Principle . . . . . . . . . . . . 31
2.4.3 Stochastic Hamilton-Jacobi-Bellman Equation . . . . . . . . . . . . . . 32
2.4.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3 Numerical Methods for Optimal Control 34


3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Numerical methods based on the PMP . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 The Method of Successive Approximations . . . . . . . . . . . . . . . 34
3.2.2 Solution of Two-point Boundary Value Problem . . . . . . . . . . . . . 36
3.3 Nonlinear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3
Contents

3.4 Numerical Methods based on the HJB . . . . . . . . . . . . . . . . . . . . . . . 38


3.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4
1 Introduction

1 Introduction

1.1 Overview

Optimal control theory is an important topic of study in applied mathematics. In some sense, it
is the culmination of a series of work on calculus of variations that originates from classical
mechanics. In modern times, optimal control finds applications in a variety of fields, including
aerospace engineering, systems engineering, financial engineering and machine learning.
These notes gives a brief introduction to the theory of optimal control to mathematics students,
with emphasis on both the underlying mathematical theory, and numerical algorithms for
control problems. Due to limited time, we will only cover the essential topics in each case, and
the interested reader may consult the cited reference books for further study.

1.2 Ordinary Differential Equations

In this section, we introduce some basics of ordinary differential equations that will be useful to
us for later chapters. We will not present any proofs since they can be found in any standard
introductory text (e.g. [Arn12, Cod12]) the readers are assumed to have some familiarity with
the topic, but we will state without proof a few useful properties and illustrate some relevant
phenomenon with examples.

1.2.1 Basic Definitions

We will work in Rd . An ordinary differential equation (ODE) is the equation

x(t)
Û = f (x(t)), x(0) = x 0 ∈ Rd , (1.1)

where xÛ denotes the time derivative, f : Rd → Rd is a function or vector field and x 0 is the
initial condition. This called a time homogeneous ODE since the vector field on the right does
not depend explicitly on time t. On the other hand, a time-inhomogeneous ODE is given by

x(t)
Û = f (t, x(t)), x(0) = x 0 ∈ Rd . (1.2)

We note that minus technical conditions, these two equations are equivalent. First, obviously (1.2)
includes (1.1). For the reverse direction, we define an auxiliary variable x 0 ∈ R such that
xÛ 0 (t) = 1, x 0 (0) = 0 so that x 0 (t) = t. Then, we can rewrite (1.2) by defining x̃ = (x 0 , x),

5
1 Introduction

f˜ (x̃) = (1, f (x 0 , x)) exactly in the form of (1.1). Hence, for convenience we will work with
either (1.1) and (1.2), keeping in mind that they are effectively equivalent for most purposes.
By a solution of an ODE on [0,T ] we mean a function x : [0,T ] → Rd with x := {x(t) : t ∈
[0,T ]} that satisfies (1.2).

Example 1.1: Linear ODEs

Let d = 1 and f (x) = ax with a ∈ R. Then, check that

x(t) = e at x 0 , (1.3)

is the solution to (1.1). More generally, consider d ≥ 1 and f (x) = Ax where A ∈ Rd ×d .


Then,

x(t) = e t Ax 0 , (1.4)

is the solution to (1.1). Here e C := C i /i! denotes the usual matrix exponential.
Í
i

The definition of solution requires x to be differentiable on (0,T ). But we remark that it is


possible to relax this by considering integral forms. For example, we can write (1.2) as
∫ t
x(t) = x 0 + f (s, x(s))ds. (1.5)
0

The advantage here is that we can consider less regular x to be solutions of ODEs, e.g. here it is
only required for x to be absolutely continuous, meaning that x satisfies (1.2) for almost every t.
One of the most basic results concerns when a solution to (1.2) (or (1.5)) exists. The following
result gives sufficient conditions, and we will hereafter always assume that a unique solution
exists to whichever ODE we deal with.

Theorem 1.2: Picard–Lindelöf Theorem

Let f be continuous in t and uniformly Lipschitz in x, i.e. there exists a constant C such
that k f (t, x) − f (t, x 0)k ≤ C kx − x 0 k for all x, x 0 ∈ Rd and t ∈ [0,T ]. Then, there exists a
unique solution to (1.2) on [0,T ].

1.2.2 Flow Map and Dependence on Initial Condition

One way to look at ODEs is to look at its solution trajectories given initial condition. Alterna-
tively, we can also look at what the solution does to a set of initial conditions at a fixed terminal
time. In other words, we define the flow or the flow map φ t : Rd → Rd

φ t (x) := x(t) where x(s)


Û = f (s, x(s)), s ∈ [0, t], x(0) = x (1.6)

6
1 Introduction

In fact, the set Φ := {φ t : t ∈ R} forms a one-parameter continuous group of transformations


on Rd , under the binary operation of function composition. Analyzing the set Φ can be seen as
an alternative way to understand ODEs.
The following properties are well-known and easy to check:
• φ t is continuous for each t
• φ 0 is the identity mapping, φ 0 (x) = x for all x
• If f does not depend on t, then φ t ◦ φ s = φ t +s , i.e. t → φ t is a homomorphism.
One can also ask how sensitive the terminal state of the ODE is to the initial condition. This can
be captured by the jacobian of φ t , [∇φ t (x)]i j = ∂j φ t,i (x). The following result will be useful to
us later.

Theorem 1.3: Dependence on Initial Condition

Let f be continuously differentiable in x, and Lipschitz in x uniformly in t. Let x be


the solution of the ODE (1.2) with flow map φ t and v be the solution to the linear time-
inhomogeneous ODE

v(s)
Û = ∇x f (s, x(s))v(s), s ∈ [0, t], v(0) = v 0 . (1.7)

Then, we have
φ t (x 0 + εv 0 ) − φ t (x 0 )
lim+ − v(t) → 0, (1.8)
ε →0 ε

uniformly in t ∈ [0,T ] for kv 0 k ≤ 1.

Corollary 1.4

Under the same conditions as in Theorem 1.3, the Jacobian J (t) := ∇x φ t (x 0 ) satisfies the
linear equation

JÛ(t) = ∇x f (t, x(t))J (t), J (0) = I . (1.9)

Equation (1.7) is called the variational equation associated with the ODE (1.2). It describes the
propagation of variations of the initial condition along the evolution in time, hence its name.
We will refer back to these results in our discussion of optimal control theory.

Example 1.5: Flow Map and Variational Equations for Linear Systems

7
1 Introduction

Recall the linear system in Example 1.1. In this case, the flow map is a linear function

φ t (x) = e t Ax, (1.10)

with Jacobian J (t) = e t A (in this case, the Jacobian does not depend on x 0 ). Check that
J (t) satisfies the variational equation

JÛ(t) = AJ (t), J (0) = I, (1.11)

as shown in the above corollary.

1.2.3 Numerical Solution of ODEs

Often, ODEs do not admit explicit solutions and we have to compute a solution numerically.
There are many methods for doing so and it is not the purpose here to give a thorough intro-
duction. Pertaining to the topic discussed in these notes, it is sufficient to first introduce the
simplest possible method, the forward Euler method.
In this method, we construct a solution to (1.2) by discretizing time and setting

xb(k + 1) = xb(k) + ∆t f (k∆t, xb(k)), xb(0) = x 0 , (1.12)

which can be seen as a first-order Taylor expansion of the integral form of the ODE (1.5) for
small ∆t. The latter is called the step size. We expect that this approximation to get better as the
step size ∆t becomes small. This is made precise in the following result.

Theorem 1.6: Global Truncation Error of Forward Euler Method

Let f be Lipschitz in x uniformly in t and continuous in t. Let x be a solution of the


ODE (1.2) with initial condition x 0 and x
b be the iterates defined in (1.12), then for each
K > 0 there exists a constant C > 0 such that

max kb
x (k) − x(k∆t)k ≤ C∆t . (1.13)
k ≤K

8
2 Optimal Control Theory

2 Optimal Control Theory

The study of optimal control theory originates from the classical theory of the calculus of varia-
tions, beginning with the seminal work of Euler and Lagrange in the 1700s. These culminated in
the so-called Lagrangian mechanics that reformulate Newtonian mechanics in terms of extremal
principles. In a nut shell, the calculus of variations studies optimization over “curves”, which
one can picture as an infinite dimensional extension of traditional optimization problems.
Optimal control theory is a nontrivial extension of the classical theory of calculus of variations
in two main directions: to dynamical and non-smooth settings. This builds on important
contributions of Weierstrass and others and led in two inter-related directions: the Pontryagin’s
maximum principle and the Hamilton-Jacobi-Bellman theory. An interesting historical account
of the developments can be found in [Lib12].
In this section, we give a minimal introduction of the problem formulation of optimal control
problems, paying particular attention to the so-called Bolza problems. The reader is referred to
comprehensive texts on optimal control theory for a more complete account [AF13, Lib12, BP07].

2.1 From Calculus of Variations to Optimal Control

2.1.1 A Motivating Example

Finite-dimensional optimization problems are of the form


inf Φ(x), Φ : X → R, (2.1)
x ∈X

where X is usually a subset of a Euclidean space. On the other hand, a calculus of variations
problem minimizes some functional J over some infinite dimensional space X, i.e.
inf J [x] J : X → R. (2.2)
x∈X

There are many possible forms of the functional J and the space X. For example, one may
encounter functionals in the form of an integral, where the argument x = {x(u) : u ∈ [a, b]} is
a function of a scalar variable u, i.e.
∫ b
J [x] = L(u, x(u), x 0(u))du (2.3)
a

Let us consider motivating example problem of this nature that is also of substantial historical
importance.

9
2 Optimal Control Theory

Example 2.1: Rolling a Ball Down a Ramp

Let a < b be two points on a horizontal plane, and our goal is to build a ramp such that
when we release the ball from point a, it can arrive at a point directly under point b in the
shortest time possible. See figure below. We will assume that there is no friction.

What shape of the ramp will achieve this task? It turns out that we can phrase this as a
calculus of variations problem. Let s(u) be the instantaneous speed of the ball when its
horizontal coordinate is at u, and let {x(u)} denote the shape of the ramp and that x(a) = 0.
By conservation of energy we find that
1
ms(u)2 = mдx(u) s(u) =
p
⇒ 2дx(u) (2.4)
2
Hence, the total time taken from a to b is the integral of the arc-length divided speed, i.e.
∫ bp
1 + x 0(u)2
Total time = J [x] = du, (2.5)
2дx(u)
p
a

√ √
which is of the form (2.3) where L(u, x, v) = 1 + v 2 / 2дx.

The problem in Example 2.1 is known as the Brachistochrone1 problem, and is first posed by
Johann Bernoulli in 1696. One can see from the example above that to solve this problem, it is
needed to solve optimization problems over curves. A classical result due to Euler and Lagrange
gives a necessary condition for optimality that allows us to solve this problem.

Theorem 2.2: Euler-Lagrange Equations

Let x ∈ C 1 ([a, b], R) be an extremum of J as defined in (2.3). Then, x satisfies the Euler-
Lagrange Equations

d
∂x L (u, x(u), x 0(u)) = ∂x 0 L (u, x(u), x 0(u)) , u ∈ [a, b]. (2.6)
du

1 In Greek, “Brachistochrone” is literally “shortest time”.

10
2 Optimal Control Theory

We have deliberately left several notions rather undefined, such as the meaning of an extremum.
We will revisit this slightly subtle issue in the next part. Here, we will not present a proof of the
Euler-Lagrange equations, since it is not required for the rest of our discussions. A proof can be
found in any standard texts on the subject of calculus of variations, say [GS00, Lib12].

Exercise 2.3: Brachistochrone Solution

Consider the Brachistochrone problem in Example 2.1. By choosing appropriate units one
can set д = 1/2. Show that the optimal ramp shapes are cycloids whose parametric forms
are
u(θ ) = a + c(θ − sin θ )
θ ∈ [0, 2π ], c > 0. (2.7)
x(θ ) = c(1 − cos θ )

2.1.2 The Problem of Optimal Control

In passing to optimal control, we consider additionally two aspects of the problem, namely the
type of extrema studied, as well as the setting in which such calculus of variations problems are
phrased.
Throughout these notes, the word “extrema” refers to either a minimum or a maximum in the
function/functional under consideration. Since maximization is just equivalent to minimization
by replacing the objective function(al) with its negation, we will hereafter only discuss minima,
unless otherwise stated.
We start with distinguishing different types of minima.

2.1.3 Weak vs Strong Minima

In finite-dimensional optimization, it is easy to define the notion of local and global minima.
Let Φ : Rd → R be a function.
• We say that x ∗ is a local minimum of Φ if there exists a δ > 0 such that Φ(x ∗ ) ≤ Φ(x) for
all kx − x ∗ k ≤ δ .
• We say that x ∗ is a global minimum of Φ if Φ(x ∗ ) ≤ Φ(x) for all x ∈ Rd .
Hence, all global minima are automatically local minima. If Φ is differentiable, then a necessary
condition for a local minima is that ∇Φ(x ∗ ) = 0.
In extending these ideas to infinite dimensions, one needs to be slightly more careful. Notice
that the definition of minima (local or global) depends on the norm k · k which gives us a sense

11
2 Optimal Control Theory

of closeness. We did not specify what norm we used in the finite dimensional case above, since
all of them are equivalent2 .
In the infinite dimensional case of minimization of functionals, the norm we choose will affect
our results, and some curve x may be a local minimum of J under one norm but not under
another.
We now distinguish between two notions of minima – weak and strong minima – commonly
encountered in calculus of variations and optimal control.
Let us consider for the moment that our curve x is C 1 . Moreover, let us simplify things and
consider one spatial dimension, so that x(u) ∈ R for u ∈ [a, b]. Now there are two natural
choices of norm that we can use
• 0-Norm: kxk0 = supu ∈[a,b] |x(u)|.
• 1-Norm: kxk1 = kxk0 + supu ∈(a,b) |x 0(u)|.
Each of these norms then allows us define the notion of minimum.

Definition 2.4: Strong and Weak Minima

Let J : C 1 ([a, b], R) → R be a functional and x∗ ∈ C 1 ([a, b], R). We say that x∗ is a strong
local minimum if there exists a δ > 0 such that J [x∗ ] ≤ J [x] for all kx − x∗ k0 ≤ δ . We say
that x∗ is a weak local minimum if we place the norm k · k0 by k · k1 . The global versions
are defined similarly.

Now, it is easy to see that any C 1 curve which is a strong minima must also be a weak minima,
but the converse is not true. The Euler-Lagrange equations (Thm. 2.2) apply to weak minima,
whereas we need more advanced tools to handle strong minima. We now consider a simple
example below where a weak minima simply do not exist, but we will see later that this does
not prevent the existence of a strong minima that is not C 1 . All of these reasons motivate us to
go past the setting of Euler and Lagrange and into the realm of optimal control.

Example 2.5: Piece-wise C 1 Minimizer

Consider the problem of minimizing the functional


∫ 1
J [x] = [x(u)]2 [x 0(u) − 1]2du, (2.8)
−1

subject to the boundary conditions x(−1) = 0 and x(1) = 1. Clearly, for all x ∈ C 1 we have

2 Letk · kA and k · kB be two norms on Rd , then there exists c ∈ (0, 1] such that c kx kA ≤ kx kB ≤ c1 kx k for all
x ∈ Rd .

12
2 Optimal Control Theory

J [x] > 0. But the curve


(
0 −1 ≤ u < 0
x(u) = (2.9)
x 0≤u ≤1

achieves J [x] = 0, but is only piece-wise C 1 . In fact, C 1 curves can get closer and closer to
x(u) with lower and lower cost, thus a C 1 global minimizer does not exist.

2.1.4 A Dynamical View on the Calculus of Variations

Optimal control is another way to look at calculus of variations problems, in that we view
things in a dynamical nature. Concretely, we may re-parameterize the curves x(u) considered
via infinitesimal changes in it, in the form of a control. Let us motivate this approach in the
context of the Brachistochrone problem.

Example 2.6: Control Formulation of Brachistochrone

Consider the Brachistochrone problem 2.1, but this time we parameterize the ramp by a
parametric form from
p the outset, i.e. (u(t), x(t)) where t is time. Then, the speed at time t
Û 2 + x(t)
is s(u(t)) = s(t) = u(t) Û 2 . Then, conservation of energy leads to

Û 2 + u(t)
2дx(t) = x(t) Û 2. (2.10)

Now, we imagine the reverse scenario treating the velocities x, Û uÛ as controls, by setting

θ 1 (t) = u(t)/
p
2дx(t) θ 2 (t) = x(t)/
p
Û Û 2дx(t). (2.11)

Then, we end up with a control system that defines the equation of the ramp

u(t)
Û = θ 1 (t) 2дx(t)
p

x(t)
Û = θ 2 (t) 2дx(t)
p
(2.12)
θ 1 (t)2 + θ 2 (t)2 = 1
(u(t 0 ), x(t 0 )) = (a, 0), u(t 1 ) = b
∫ t1
The cost function in this case is the time taken, so J = t0
1dt = t 1 − t 0 .

It is worth noting that by formulating the original calculus of variations problem as a control
problem, we actually gained some generality:
• It is no longer assumed that x can be written as a function of u
• It is not necessary for x to be differentiable with respect to u

13
2 Optimal Control Theory

2.1.5 The Optimal Control Formulation

Now, let us formulate precisely the optimal control problem in the general setting.

The Dynamics. Consider the ordinary differential equation

x(t)
Û = f (t, x(t), θ (t)), t ∈ [t 0 , t 1 ], x(t 0 ) = x 0 . (2.13)

Here x(t) ∈ Rd is the state, θ (t) ∈ Θ ⊂ Rm is the control, with Θ the control set. We will assume
that the control set is closed (but it need not be bounded).
We will assume that the following conditions on f holds, unless otherwise stated:
• f (t, x, θ ) is continuous in t and θ for all x
• f (t, x, θ ) is continuously differentiable in x for all t, θ
These conditions are sufficient to ensure that (2.13) is well-posed by a similar result as in
Theorem 1.2. See [BP07].
Remark. The conditions outlined above are certainly not the weakest possible to imply local
well-posedness of solutions, and they can be weakened in various ways (See e.g. [BP07] Ch.2).
We also emphasize two crucial points not assumed
• We did not assume that f is differentiable with respect to θ
• We did not assume that t 7→ θ (t) is regular. In fact, in the general case we can consider θ to
be a essentially bounded function of t

The Cost Functional Let us now define the objective functionals. We will consider function-
als of the form
∫ t1
J [θ] = L(t, x(t), θ (t))dt + Φ(t 1 , x(t 1 )) (2.14)
t0

• L : R × Rd × Θ → R is called the running cost


• Φ : R × Rd → R is called the terminal cost

The Bolza Problem of Optimal Control Now, we state the Bolza problem of optimal
control, which will be the primary object of analysis in these notes.
∫ t1
inf J [θ] = L(t, x(t), θ (t))dt + Φ(t 1 , x(t 1 ))
θ t0
(2.15)
subject to
x(t)
Û = f (t, x(t), θ (t)), t ∈ [t 0 , t 1 ], x(t 0 ) = x 0 .

14
2 Optimal Control Theory

For historical reasons, the case where Φ = 0 (no terminal cost) is called a Lagrange problem,
where as the case with L = 0 (no running cost) is called a Mayer problem. In optimal control
theory, we often consider x 0 (initial condition) and t 0 (initial time) to be fixed. However, the
terminal time t 1 can either be fixed or it can vary. Moreover, there can be a constraint set placed
on the terminal state x(t 1 ). We will mostly consider the case where the final time t 1 is fixed (so
that we can neglect the t 1 dependence of Φ), and there is no constraint on the terminal state,
and we will discuss how the various results may change if we consider the general case.
As with classical optimization problems, the primary object of study is optimality conditions.
One differentiates between necessary and sufficient conditions for optimality. The former asks
what conditions must any local/global optimum satisfy, and the latter concerns a condition that
is enough to guarantee optimality. In the following sections, we will investigate each of these
aspects in turn.

2.2 Pontryagin’s Maximum Principle

In this section, we discuss a necessary condition for optimality – the Pontryagin’s Maximum
Principle (PMP) – that is a hallmark result in optimal control theory and the calculus of variations.
It greatly generalizes the Euler Lagrange equations in highly nontrivial ways.
We will present the proof of the PMP in the case of fixed end time, without constraints on the
terminal state. In this case, the problem is
∫ t1
inf J [θ] = L(t, x(t), θ (t))dt + Φ(x(t 1 ))
θ t0
(2.16)
subject to
x(t)
Û = f (t, x(t), θ (t)), t ∈ [t 0 , t 1 ], x(t 0 ) = x 0 .

The proof of the PMP for this case is quite accessible, and hence we will present it in full. We
will discuss the PMP for other variants of the basic formulation, but we will omit the proofs as
they can be significantly more involved.

2.2.1 The Maximum Principle

To state the Pontryagin’s maximum principle, we need some definitions. Let us define the
Hamiltonian

H : R × Rd × Rd × Θ → R,
(2.17)
H (t, x, p, θ ) = p > f (t, x, θ ) − L(t, x, θ ).

For a control θ = {θ (t) : t ∈ [t 0 , t 1 ]}, we say it is admissable if θ (t) ∈ Θ for all t ∈ [t 0 , t 1 ].

15
2 Optimal Control Theory

Theorem 2.7: Pontryagin’s Maximum Principle

Let θ ∗ be a bounded, measurable and admissable control that optimizes (2.16), and x∗ be
its corresponding state trajectory. Then, there exists an absolutely continuous process
p = {p(t) : t ∈ [t 0 , t 1 ]} such that

xÛ ∗ (t) = ∇p H (t, x ∗ (t), p ∗ (t), θ ∗ (t)), x ∗ (t 0 ) = x 0 (2.18)


pÛ∗ (t) = −∇x H (t, x ∗ (t), p ∗ (t), θ ∗ (t)), p (t 1 ) = −∇x Φ(x (t 1 ))
∗ ∗
(2.19)
H (t, x ∗ (t), p ∗ (t), θ ∗ (t)) ≥ H (t, x ∗ (t), p ∗ (t), θ )
(2.20)
∀θ ∈ Θ and a.e. t ∈ [t 0 , t 1 ]

Proof 2.7: Proof of the PMP (Theorem 2.7)

The proof proceeds in several steps. To make the proof instructive, we will first assume
that the function t 7→ θ ∗ (t) is continuous, and we will relax this assumption at the end.

Step 1: Convert to Mayer Problem. Define an auxiliary scalar variable x 0 (t), with

xÛ 0 (t) = L(t, x(t), θ (t)), x 0 (t 0 ) = 0. (2.21)

Then, by going one dimension higher and setting x̃ = (x 0 , x), f˜ = (L, f ), and Φ̃(x̃) =
Φ(x) + x 0 we can rewrite (2.16) as one without running cost in the new augmented
coordinates. Hence, we will hereafter drop the tildes and assume without loss of generality
that L ≡ 0.

Step 2: Needle Perturbation. Fix τ > 0 and an admissible s ∈ Θ. Define the needle
perturbation to the optimal control

s if t ∈ [τ − ε, τ ]

θ ε (t) = (2.22)
θ ∗ (t) otherwise

Let xε be the corresponding controlled trajectory, i.e. the solution of

xÛε (t) = f (t, x ε (t), θ ε (t)), x ε (t 0 ) = x 0 . (2.23)

Our goal is to derive necessary conditions for which any such needle perturbation will
be sub-optimal, thus resulting in a necessary condition for a strong minima in the cost
functional.

16
2 Optimal Control Theory

Step 3: Variational Equation It is clear that x ε (t) = x ∗ (t) for t ≤ τ − ε. Let us define
for t ≥ τ
x ε (t) − x ∗ (t)
v(t) := lim . (2.24)
ε →0+ ϵ
This measures the propagation of the effect of the needle perturbation as time increases.
In particular, at t = τ , v(τ ) is the tangent vector of the curve ϵ 7→ x ε (τ ), given by
 ∫ τ
1 1 τ
∫ 
v(τ ) = lim f (t, x ε (t), s) dt − f (t, x (t), θ (t)) dt
∗ ∗
ε →0+ ε τ −ε ε τ −ε (2.25)
= f (τ , x (τ ), s) − f (τ , x (τ ), θ (τ )) .
∗ ∗ ∗

For the remaining time t ∈ [τ ,T ], x ε follows the same ODE (2.23). Thus, by Theorem 1.3
v(t) is well-defined and solves the linear variational equation

v(t)
Û = ∇x f (t, x ∗ (t), θ ∗ (t)) v(t), t ∈ [τ , t 1 ], (2.26)

with initial condition given by (2.25). In particular, the vector v(t 1 ) describes the variation
in the end point x ε (t 1 ) due to the needle perturbation.

Step 4: Optimality Condition at End Point. By our assumption, the control θ ∗ is


optimal, hence we must have

Φ(x ∗ (t 1 )) ≤ Φ(x ε (t 1 )). (2.27)

Thus, we have

Φ (x ε (t 1 )) − Φ (x ∗ (t 1 )) d
0 ≤ lim = Φ (x ε (t 1 )) = ∇Φ (x ∗ (t 1 )) · v(t 1 ) (2.28)
ε →0+ ε dε ε =0+

In fact, the inequality (2.28) holds for any τ and s that characterizes the needle perturbation.

Step 5: The Adjoint Equation and the Maximum Principle. The idea is now to
derive consequence that the end-point optimality condition have on each τ . To this end,
we define p∗ (t) as the solution of the backward Cauchy problem

pÛ∗ (t) = −∇x f (t, x ∗ (t), θ ∗ (t))>p ∗ (t), p ∗ (t 1 ) = −∇Φ(x ∗ (t 1 )). (2.29)

Then, observe that we indeed have


d ∗ >
[p (t) v(t)] = 0 ∀t ∈ [τ , t 1 ] ⇒ p ∗ (τ )>v(τ ) = p ∗ (t 1 )>v(t 1 ) ≤ 0, (2.30)
dt

17
2 Optimal Control Theory

which implies that for any τ ∈ (t 0 , t 1 ] we have

[p ∗ (τ )]> f (τ , x ∗ (τ ), θ ∗ (τ )) ≥ [p ∗ (τ )]> f (τ , x ∗ (τ ), s) (2.31)

for any s ∈ Θ. By continuity this also holds for t = t 0 .


By undoing the conversion in in Step 1, we can back to a general Bolza problem by sending
p∗ → (p0 , p∗ ). In particular, observe that pÛ0 (t) = 0 and p 0 (t 1 ) = −1. Hence, p 0 (t) ≡ −1.
Hence, we get from the optimality condition (2.31) that

p ∗ (τ )> f (τ , x ∗ (τ ), θ ∗ (τ )) − L(τ , x ∗ (τ ), θ ∗ (τ )) ≥ p ∗ (τ )> f (τ , x ∗ (τ ), s) − L(τ , x ∗ (τ ), s), (2.32)


| {z } | {z }
H (τ ,x ∗ (τ ),p ∗ (τ ),θ ∗ (τ )) H (τ ,x ∗ (τ ),p ∗ (τ ),s)

where p∗ satisfies the adjoint equation

pÛ∗ (t) = −∇x H (t, x ∗ (t), p ∗ (t), θ ∗ (t)), p ∗ (t 1 ) = −∇Φ(x ∗ (t 1 )). (2.33)

Step 6: Extending to Measurable Controls. The last step is purely of technical in-
terest, where we relax the assumption that t 7→ θ ∗ (t) is continuous. By the Lebesgue
differentiation theorem, we have for almost every τ ∈ (t 0 , t 1 ),

1 τ +ε

lim | f (t, x ∗ (t), θ ∗ (t)) − f (τ , x ∗ (τ ), θ ∗ (τ ))| dt = 0, (2.34)
ε →0+ ε τ −ε

that is, the measurable function t 7→ f (t, x ∗ (t), θ ∗ (t)) is quasi-continuous. Hence, the proof
steps 1-5 proceeds exactly as before, only that τ is required to be a Lebesgue point, and
hence the solutions of the state and adjoint equations are now only absolutely continuous,
and the maximization condition (2.32) now only holds at Lebesgue points, which is almost
every t ∈ [t 0 , t 1 ]. This concludes the proof of the maximum principle. 

Let us make some remarks on the maximum principle.


• The equation (2.18) is called the state equation, and it is simply

xÛ ∗ (t) = f (t, x ∗ (t), θ ∗ (t)), (2.35)

and it describes the evolution of the state under the optimal control.
• The equation (2.19) is called the co-state equation, with p∗ being the co-state. As evidenced
in the proof of the PMP, the role of the co-state equation is to propagate back the optimality
condition and is the adjoint of the variational equation. In fact, one can also connect p∗
formally to a Lagrange multiplier enforcing the constraint of the ODE. However, this
approach can only derive the weaker optimality condition that H is stationary at the
optimum.
• The maximization condition (2.20) is the heart of the maximum principle. It says that
an optimal control must globally maximize the Hamiltonian. One can regard this as

18
2 Optimal Control Theory

a nontrivial generalization of the Euler-Lagrange equations to handle strong extrema


(See [BP07], Theorem 6.5.2), as well as a generalization of the KKT conditions to non-
smooth settings.

2.2.2 Other Forms of the Maximum Principle

The reason why we called the result (2.7) a maximum principle is to emphasize that it is not
just one result, but a class of results of similar nature. Indeed, there are many variants of the
maximum principle, and we state one of them below, which is for a fixed-end-point variant of
the Bolza problem (variation highlighted)
∫ t1
inf J [θ] = L(t, x(t), θ (t))dt+Φ(x(t 1 ))
θ t0
(2.36)
subject to
x(t)
Û = f (t, x(t), θ (t)), t ∈ [t 0 , t 1 ], x(t 0 ) = x 0 , x(t 1 ) = x 1 .

In this case, the maximum principle now reads

xÛ ∗ (t) = ∇p H (t, x ∗ (t), p ∗ (t), θ ∗ (t)), x ∗ (t 0 ) = x 0 x ∗ (t 1 ) = x 1 (2.37)


pÛ∗ (t) = −∇x H (t, x ∗ (t), p ∗ (t), θ ∗ (t)), p ∗ (t 1) = −∇x Φ(x ∗ (t 1 )) (2.38)
H (t, x (t), p (t), θ (t)) ≥ H (t, x (t), p (t), θ )
∗ ∗ ∗ ∗ ∗
(2.39)
∀θ ∈ Θ and a.e. t ∈ [t 0 , t 1 ]

Example 2.8: Piece-wise C 1 Minimizer Revisted

Let us consider the problem in Example 2.5 and we now show that the piece-wise C 1
minimizer satisfies the PMP (2.37). Notice that we can convert the problem into a fixed-
end-point problem
∫ 1
min x(t)2 (θ (t) − 1)2dt
θ −1
(2.40)
subject to
x(t)
Û = θ (t), t ∈ [−1, 1], x(−1) = 0, x(1) = 1.

That is, f (t, x, θ ) = θ and running cost is L(t, x, θ ) = x 2 (θ − 1)2 . Writing out the PMP
equations for an optimal θ ∗ , we get H (t, x, p, θ ) = pθ − x 2 (1 − θ )2

xÛ ∗ (t) = θ ∗ (t), x ∗ (−1) = 0, x ∗ (1) = 1, (2.41)


pÛ∗ (t) = 2x ∗ (t)(1 − θ ∗ (t))2 , (2.42)
2 2
θ (t) ∈ arg max{p (t)θ − [x (t)] (1 − θ )}.
∗ ∗ ∗
(2.43)
θ ∈R

19
2 Optimal Control Theory

One can then check that the control


(
0 −1 ≤ t < 0
θ ∗ (t) = (2.44)
1 0≤t ≤1

satisfies the PMP above with x ∗ (t) given by (2.9) and p ∗ (t) = 0.

Example 2.9: Driving a Car

Suppose we are driving a car on a straight road for t ∈ [0,T ]. Let x(t) denote the position
of the car at time t. We suppose that we are initially at rest at the origin, and we want
to drive forwards on the road. We have control over an accelerator, which we can use
to accelerate or brake, but acceleration costs fuel. The problem statement is, suppose we
want to drive far yet save fuel, how should we drive?
This problem can be formulated as a Bolza problem with fixed end time and free end
point (2.16) as follows:
T
1

inf J [θ] = max(0, θ (t))2dt − x(T )
θ 0 2
subject to
(2.45)
x(t)
Û = v(t), x(0) = 0,
v(t)
Û = θ (t), v(0) = 0,
θ (t) ∈ [−1, 1] for all t .

Here, the fuel cost is related to the acceleration by 12 max(0, θ )2 (braking spends no fuel).
Let us now apply the PMP (2.7) to derive a solution. In this case, the Hamiltonian is
1
H (t, x, v, px , pv , θ ) = px v + pv θ − max(0, θ )2 . (2.46)
2
Thus, the PMP equations are

xÛ ∗ (t) = v ∗ (t), x ∗ (0) = 0, (2.47)


vÛ (t) = θ (t),
∗ ∗
v (0) = 0,

(2.48)
pÛx∗ (t) = 0, px∗ (T ) = 1, (2.49)
pÛv∗ (t) = −px∗ (t), pv∗ (T ) = 0, (2.50)

and hence px∗ (t) = 1, pv∗ (t) = T − t. Therefore, the optimal control is found by maximizing

20
2 Optimal Control Theory

the Hamiltonian:
θ ∗ (t) ∈ arg max H (t, x ∗ (t), v ∗ (t), px∗ (t), pv∗ (t), θ )
θ ∈[−1,1]
1 (2.51)
∈ arg max v ∗ (t) + (T − t)θ − max(0, θ )2
θ ∈[−1,1] 2
= min(T − t, 1).

Thus, we should drive at maximum acceleration, and then ease off on the accelerator
linearly.

Exercise 2.10: Driving a Better Car

As an extension of Example 2.9, we can consider the following scenario: the car has been
upgraded so that the fuel cost now scales linearly with acceleration, i.e. the running cost
is now max(0, θ ) instead of max(0, θ )2 . What is the optimal way to drive in this case?

2.2.3 Further Reading

Besides the basic fixed end time setting considered in the previous part, other variants of the
PMP can be derived for different scenarios, including: variable end time, general set constraints
on initial and final states. The proofs of these results are more involved than what is proposed
above, requiring some machinery from functional analysis. For the purpose of the application
cases in these notes, the previous formulation is enough. However, the interested reader is
encouraged to consult optimal control references for various generalizations, or proofs under
weaker assumptions e.g. [Lib12, BP07].

2.3 Hamilton-Jacobi-Bellman Equations

As a key alternative to the maximum principle, we now discuss another line of work that
establishes necessary and sufficient conditions for optimality for optimal control problems. This
presents another approach to optimal control theory that is important in its own right, as it
depends on the very general idea of dynamic programming [Bel66].

2.3.1 Motivating Example of Dynamic Programming


Example 2.11: A Toy Maze

Consider the following maze where we want to get to the orange circle while maximizing
the reward obtained along the way. When we cross each arrow, we gain a reward equal

21
2 Optimal Control Theory

to the number attached to that arrow. The red path shows an example path with a final
reward of 4.

Suppose that there are N circles to choose from per step and T steps in total. Then, the
total number of paths is N T and grows exponentially with T . This is known as the curse
of dimensionality.

Instead of a brute force search over all paths, we can use the principle of dynamic programming
to find a solution much more efficiently. To do this, let us introduce some notation. We will
index each time step in the maze by t = 0, 1, . . . ,T . Also, we denote by S t the circle we step on
at the t th step, and R t the reward we obtain at the t th step.
Define the function
T
( )
Õ
V (t, x) = max Rs : St = x . (2.52)
s=t +1

In other words, V (t, x) is the best possible reward we can get starting from state x at time t.
Then, we can work backwards easily!
Let us just consider the case in Example 2.11, where S t = 1 or 2 for t = 1, 2, 3. Here, S t = 1
denotes the top circle and S t = 2 is the bottom circle. The initial state is S 0 = 0. Then, clearly
we have

V (3, 1) = +3, V (3, 2) = −3, (2.53)

since both cases we only have one choice – and this is the best we can do. Now, let us consider
t = 2. Given we are at S 2 = 1, then there are two choices, either we go to S 3 = 1 or S 3 = −1.
If we go to S 3 = 1 we get a reward of −1 and then, the best we can do from there would be
V (3, 1) = +3. Similarly, if we take S 3 = 2 then we get +4 reward and the best we can do from
S 3 = 2 is V (3, 2) = −3. Hence,

V (2, 1) = max{−1 + V (3, 1), +4 + V (3, 2)} = +2. (2.54)

22
2 Optimal Control Theory

A similar calculation shows that V (2, 2) = +1. Once we know these values we can then compute
V (1, ·) and so on. This allows us to calculate backwards to obtain V (0, 0) = +6. This is the best
possible reward we can get, and we have obtained it without resorting to brute force search
over all the paths! Moreover, once we have solved for V (t, x) for all t, x, we can also easily find
the optimal policy to navigate this maze. We simply proceed greedily with respect to the value
function: at time t we always go the circle in the next step with the highest V (t, x) plus the
immediate reward.
In fact, the above methodology is known as dynamic programming [Bel66]. Let us look at the
computational complexity of dynamic programming versus a brute force search, which takes
N T steps. In dynamic programming, we simply have to traverse the time steps once, starting
the from the end. For each time step, we have to compute N values of V (t, x), each depends on a
linear combination of V (t + 1, s). Hence, for each time step we incur a computation overhead of
N 2 . Therefore, the entire dynamical programming procedure solves the problem in N 2T steps.
This is much less than N T !.
The key idea behind dynamic programming is defining the so called cost-to-go V (t, x) (2.52),
which allows us to derive a recursion in V (t, x) that gives a solution to our original problem.
The function V (t, x) is also known as the value function, emphasizing the fact that it represents
the “value” of a given state. This understanding will motivate the alternative approach we
present next on optimal control.

2.3.2 The Dynamic Programming Principle

Now, let us state and prove the dynamic programming principle as applied to optimal control
problems. We recall the Bolza problem with fixed end time:
∫ t1
inf J [θ] = L(t, x(t), θ (t))dt + Φ(x(t 1 ))
θ t0
(2.55)
subject to
x(t)
Û = f (t, x(t), θ (t)), t ∈ [t 0 , t 1 ], x(t 0 ) = x 0 .

Following the idea of dynamic programming, we embed this problem in a bigger class of
problems:
∫ t1
V (s, z) := inf L(t, x(t), θ (t))dt + Φ(x(t 1 ))
θ s
(2.56)
subject to
x(t)
Û = f (t, x(t), θ (t)), t ∈ [s, t 1 ], x(s) = z.

The function V : [t 0 , t 1 ] × Rd → R is called the value function. In words, it is the minimum cost
attaininable starting from the initial condition z at time t. Observe that V (t 0 , x 0 ) is the optimal
cost of (2.55).

23
2 Optimal Control Theory

It may appear that we have made the problem more difficult, since we are not considering a much
larger class of optimal control problems. However, it turns out that we can derive a recursion
on V in terms of a partial differential equation, thereby deriving an elegant characterization of
optimal controls.
Now, let us state and prove the dynamic programming principle concerning the value function
for the optimal control problem.

Theorem 2.12: Dynamic Programming Principle

For every τ , s ∈ [t 0 , t 1 ], s ≤ τ , and z ∈ Rd , we have


∫ τ 
V (s, z) = inf L(t, x(t), θ (t))dt + V (τ , x(τ )) , (2.57)
θ s

where on the right hand side, x solves x(t)


Û = f (t, x(t), θ (t)) on t ∈ [s, τ ] with x(s) = z.

The meaning of the dynamic programming principle is that the optimization problem defining
V (s, z) can be split into two parts:
• First, solve the optimization problem on [τ , t 1 ] with the usual running cost L and terminal
cost Φ, but for all initial conditions z 0 ∈ Rd . This gives us the value function V (τ , ·)
• Next, we solve the optimization problem on [s, τ ] with running cost L and terminal cost
V (τ , ·) given by the step before.

Proof 2.12: Dynamic Programming Principle

Let us denote the right hand side of (2.57) as J τ . We first show that J τ ≤ V (s, z). We fix
ε > 0 and choose a control θ : [s, t 1 ] → Θ such that

J [θ] ≤ V (s, z) + ε. (2.58)

This θ always exists since V (s, z) is defined as the infimum of such J [θ]. Under this control,
we have again by the definition of the value function
∫ t1
V (τ , x(τ )) ≤ L(t, x(t), θ (t))dt + Φ(x(t 1 )). (2.59)
τ

24
2 Optimal Control Theory

Then, we have
∫ τ
τ
J ≤ L(t, x(t), θ (t))dt + V (τ , x(τ )) (2.60)
s
∫ t1
≤ L(t, x(t), θ (t))dt + Φ(x(t 1 )) (2.61)
s
= J [θ] ≤ V (s, z) + ε. (2.62)

Since ε > 0 is arbitrary, we have J τ ≤ V (s, z).


Next, we show the reverse inequality. Fix ε > 0. Then, there exists a control θ1 : [s, τ ] → Θ
such that
∫ τ
L(t, x(t), θ 1 (t))dt + V (τ , x(τ )) ≤ J τ + ε. (2.63)
s

Now, similarly there exists a control θ2 : [τ , t 1 ] → Θ such that


∫ t1
L(t, x(t), θ 2 (t))dt + Φ(x(t 1 )) ≤ V (τ , x(τ )) + ε. (2.64)
τ

This allows us to concatenate the two controls together to define


(
θ 1 (t) t ∈ [s, τ ],
θ (t) = (2.65)
θ 2 (t) t ∈ (τ , t 1 ].

Then, combining (2.63) and (2.65) we have

V (s, z) ≤ J [θ] ≤ J τ + 2ε, (2.66)

and since ε > 0 is arbitrary, we obtain the desired result. 

2.3.3 Hamilton-Jacobi-Bellman Equations

In this section, we will derive the key result from the dynamic programming approach to optimal
control problems, which establishes connections with partial differential equations, in particular
the Hamilton-Jacobi equations. As defining the right sort of solutions for these equations turns
out to be a slightly involved problem, we will proceed mostly formally in this section, but we
will discuss at the end the key ideas in making these steps rigorous.
The basic motivation here is to derive an infinitesimal version of the dynamic programming
principle (Theorem 2.12). To this end, we will make extensive use of Taylor expansions by as-
suming that τ = s + ∆s with ∆s  1 in Eq. (2.57), giving the infinitesimal dynamic programming

25
2 Optimal Control Theory

principle
∫ s+∆s 
V (s, z) = inf L(t, x(t), θ (t))dt + V (s + ∆s, x(s + ∆s)) , (2.67)
θ s

where again on the right hand side x follows the ODE

x(t)
Û = f (t, x(t), θ (t)), t ∈ [s, s + ∆s], x(s) = z. (2.68)

Applying Taylor’s expansion on the ODE, we have


∫ s+∆s
x(s + ∆s) = z + f (t, x(t), θ (t))dt = z + f (s, z, θ (s))∆s + o(∆s), (2.69)
s

Furthermore, assuming that V is sufficiently regular, we have

V (s + ∆s, x(s + ∆s)) = V (s, z) + ∂s V (s, z)∆s + [∇z V (s, z)]> f (s, z, θ (s))∆s + o(∆s). (2.70)

Similarly, we can also expand the running cost


∫ s+∆s
L(t, x(t), θ (t))dt = L(s, z, θ (s))∆s + o(∆s). (2.71)
s

Combining (2.67), (2.70) and (2.71), we have

V (s, z) = inf L(s, z, θ (s))∆s + V (s, z) + ∂s V (s, z)∆s + [∇z V (s, z)]> f (s, z, θ (s))∆s + o(∆s) .

θ
(2.72)

Cancelling the term V (s, z) on both sides and taking the limit ∆s → 0, the infimum over paths
θ on t ∈ [s, s + ∆s] becomes an infimum over a scalar θ = θ (s) ∈ Θ, thus we obtain:

∂s V (s, z) + inf {L(s, z, θ ) + [∇z V (s, z)]> f (s, z, θ )} = 0. (2.73)


θ ∈Θ

This is known as the Hamilton-Jacobi-Bellman (HJB) equation for the value function. It remains
to specify the boundary conditions. One can quickly observe that at time s = t 1 , we in fact have
by definition, V (t 1 , z) = Φ(z).
Now, we note that the derivations above are purely formal for at least two reasons:
• We do not know if V (s, z) is sufficiently regular to admit Taylor expansions.
• We do not know if the partial differential equation (2.73) is well-posed, i.e. whether it
admits a unique solution, and in what sense should a solution be defined.
This is a common difficulty faced by many nonlinear partial differential equations. In this case,
the Hamilton-Jacobi structure allows one to use the concept of viscosity solutions [CL83] as
an appropriate notion of solution. Loosely speaking, viscosity solutions are a class of weak
solutions to nonlinear PDEs defined by being some sense of an extremum of a sequence of
smooth functions that satisfy an inequality corresponding to the PDE. One can also see them

26
2 Optimal Control Theory

as limits of solutions of the original PDE regularized with a diffusive term (hence the term
“viscosity”). For more information on viscosity solutions, the reader is referred to [FS06]. With
the notion of viscosity solutions, we in fact can put the HJB equations on a rigorous footing. Let
us now state the main theorem in this section, whose proof we omit (but see [BP07], Theorem
8.7.1). For convenience we will replace (s, z) by (t, x) in the following.

Theorem 2.13: Hamilton-Jacobi-Bellman Equation

Let V : [t 0 , t 1 ] × Rd → R be the value function defined by (2.56). Then, V is the unique


viscosity solution of the Hamilton-Jacobi-Bellman equation

∂t V (t, x) + inf L(t, x, θ ) + [∇x V (t, x)]> f (t, x, θ ) (t, x) ∈ (t 0 , t 1 ) × Rd



θ ∈Θ (2.74)
V (t 1 , x) = Φ(x)

2.3.4 Implications for Optimal Control

Recall that we have the correspondence

V (t 0 , x 0 ) = inf J [θ], (2.75)


θ

hence the solution of the HJB equations will give us the optimal cost that we can obtain for the
Bolza problem. In fact, we will see that this gives us much more.

A Necessary Condition. It should be clear from our discussions so far that what we have
formally derived is that the HJB constitutes a necessary condition for global optimality. Indeed,
suppose we have a family of optimal controls {θs,z ∗ : s ∈ [t , t ], z ∈ Rd } and define
0 1
∫ t1
Vb(s, z) = Φ(x s,z

(t 1 )) + L(t, x s,z

(t), θ s,z

(t))dt,
s (2.76)
where xÛs,z (t) = f (t, x s,z (t), θ s,z (t)),
∗ ∗ ∗
t ∈ [s, t 1 ], x s,z (s) = z.

Then, by Theorem 2.13 Vb ≡ V satisfies the HJB equation.


In fact, let us fix s, τ ∈ [t 0 , t 1 ) and z ∈ Rd . By the assumption of global optimality we can rewrite
the dynamic programming principle (2.57) as
∫ τ 
V (s, z) = inf L(t, x(t), θ (t))dt + V (τ , x(τ ))
θ s
∫ τ (2.77)
= L(t, x s,z (t), θ s,z (t))dt + V (τ , x s,z (τ )).
∗ ∗ ∗
s

We may now proceed as before using Taylor expansions to derive an infinitesimal version
of the above. Let us call θ ∗ = θt∗0,x 0 the optimal control for our original problem, and x∗ is

27
2 Optimal Control Theory

corresponding controlled state trajectory. Then, Taylor expanding and comparing with the
usual dynamic programming principle we obtain the equality

−∂s V (s, x ∗ (t)) = min L(t, x ∗ (t), θ (t)) + [∇x V (t, x ∗ (t))]> f (t, x ∗ (t), θ ) ,

θ ∈Θ (2.78)
= L(t, x ∗ (t), θ ∗ (t)) + [∇x V (t, x ∗ (t))]> f (t, x ∗ (t), θ ∗ (t)),

which we can rewrite as

H (t, x ∗ (t), −∇x V (t, x ∗ (t)), θ ∗ (t)) = max H (t, x ∗ (t), −∇x V (t, x ∗ (t)), θ ) (2.79)
θ ∈Θ

where the Hamiltonian is defined exactly as in the case of the PMP (2.17)

H (t, x, p, θ ) = p > f (t, x, θ ) − L(t, x, θ ). (2.80)

Thus, this is similar to the statement of the PMP, except that the co-state p ∗ (t) is now replaced
by −∇x V (t, x ∗ (t)). However, there is a nontrivial difference in that now, this is also a sufficient
condition for global optimality, as we now show.

A Sufficient Condition. Let us now assume that a continuously differentiable function V


satisfies the HJB (2.74) and moreover that a control θ
b : [t 0 , t 1 ] → Θ satisfies

H (t, xb(t), −∇x V (t, xb(t)), θb(t)) = max H (t, xb(t), −∇x V (t, xb(t)), θ ), (2.81)
θ ∈Θ

for all t ∈ [t 0 , t 1 ], where x


b is the state process corresponding to the control θ,
b then θ
b is a globally
optimal control that solves (2.55) with optimal cost V (t 0 , x 0 ).
To show this, observe that if we set x = xb(t) in the HJB equation for V , noting the condition (2.81),
we have

∂t V (t, xb(t)) + [∇x V (t, xb(t))]> f (t, xb(t), θb(t)) + L(t, xb(t), θb(t)) = 0, (2.82)

which means
d
V (t, xb(t)) + L(t, xb(t), θb(t)) = 0. (2.83)
dt
Integrating from t 0 to t 1 and using the boundary condition V (t 1 , x) = Φ(x), we have
∫ t1
V (t 0 , x 0 ) = L(t, xb(t), θb(t))dt + Φ(b
x (t 1 )) = J [θ].
b (2.84)
t0

On the other hand, if θ be any other control whose trajectory is x, we would have

∂t V (t, x(t)) + [∇x V (t, x(t))]> f (t, x(t), θ (t)) + L(t, x(t), θ (t)) ≥ 0, (2.85)

28
2 Optimal Control Theory

which yields
∫ t1
0≤ L(t, x(t), θ (t))dt + V (t 1 , x(t 1 )) −V (t 0 , x 0 ), (2.86)
t0
| {z }
J [θ], since V (t 1,x (t 1 ))=Φ(x (t 1 ))

or

J [θ]
b = V (t 0 , x 0 ) ≤ J [θ]. (2.87)

This shows that θ


b is globally optimal, with cost V (t 0 , x 0 ).

Example 2.14: Nondifferentiable Value Function ([Lib12], Example 5.2.1)

Consider the scalar control system

x(t)
Û = x(t)θ (t), t ∈ [0,T ], x(0) = x 0 ∈ R, θ (t) ∈ Θ ≡ [−1, 1]. (2.88)

We set running cost L ≡ 0 and terminal cost Φ(x) = x. The optimal control is just −Sign(x 0 )
if x 0 , 0, and if x 0 = 0 the cost is always 0. Hence, the value function is simply

 e
 −(T −t )x if x > 0

V (t, x) = eT −t x if x < 0 (2.89)

 0

if x = 0

Observe that it is not differentiable at x = 0.
Let us now check that the value function satisfies the HJB, which is now

∂t V (t, x) − |x ∂x V (t, x)| = 0, V (T , x) = x . (2.90)

Clearly, this is the case. In fact, we can derive the value function from the HJB by applying
the method of characteristics (See [Eva98], Ch. 3).

Remark. We end this section with a remark on the HJB solution. Recall that we can write the
optimal control as

θ ∗ (t) = u(t, x ∗ (t)) := min{L(t, x ∗ (t), θ ) + [∇x V (t, x ∗ (t))]> f (t, x ∗ (t), θ )}. (2.91)
θ

In other words, provided we can solve the HJB, the optimal control solution is of feed-back or
closed-loop form, meaning that it tells how to steer the system by just observing the state trajectory
x∗ . We can contrast with the PMP, where we obtain open-loop controls that are pre-computed (since
it also depends on the co-state) and cannot be applied on-the-fly. This is an important distinction.

29
2 Optimal Control Theory

2.3.5 Further Reading

The principle of optimality has been referenced in different manners throughout the development
of calculus of variations, dating back to the solution of the Brachistochrone problem of Jacob
Bernoulli in 1697. The building of the Hamilton-Jacobi-Bellman theory for optimal control
rests on important works of Carathéodory, Bellman and Kalman in the early 1900s. The theory
is first put on rigorous footing via the introduction of viscosity solutions by Crandall and
Lions [CL83]. See also [FS06] for a general exposition of viscosity solutions. Here we also
omitted the interesting topic of how the HJB and the PMP are related. In fact, they can be
related via the method of characteristics ([Eva98] Ch. 3): the PMP equations can be interpreted,
at least formally, as characteristic equations associated with the HJB. See [Lib12], Ch. 5.2.

2.4 Stochastic Control

So far, optimal control problems are analyzed in the deterministic (ODE) setting. This follows
as a natural development from classical calculus of variations. Recently, many applications
requires the control of a noisy process, examples of which includes control of robots in uncertain
environments, optimal execution of trading strategies, etc.
In this section, we will briefly introduce the stochastic variant of the theory of optimal control.
For mathematical simplicity, we will only introduce the Hamilton-Jacobi-Bellman approach.
The Pontryagin’s approach turns out to be rather involved for stochastic processes, and requires
some theory on backward stochastic differential equations which are beyond the scope of these
notes. The interested reader may consult standard references, e.g. [YZ99] on this topic.
We assume a basic familiarity with stochastic differential equations. Unfamiliar readers may
refer to textbooks e.g. [Øks03]. Note that we will not use any advanced techniques beyond Itô’s
formula.

2.4.1 Control of Stochastic Differential Equations

We consider the following Itô stochastic differential equation, also known as a diffusion process:

dX (t) = f (t, X (t), θ (t))dt + σ (t, X (t), θ (t))dW (t), X (0) = x 0 , t ∈ [0,T ]. (2.92)

Here, X (t) ∈ Rd is the stochastic process, and we use capital letter to highlight its stochastic
nature, as is conventional. The process W (t) is the standard Wiener process, or Brownian
motion in Rp . The matrix-valued function σ : [0,T ] × Rd × Rm → Rd ×p is called the diffusion
matrix. In some applications, f is called the drift and σ is called the volatility. We hereafter
assume they are uniformly Lipschitz in the state argument to guarantee existence of strong
solutions to (2.92) (see [Øks03]). The initial condition x 0 ∈ Rd can be deterministic or random.
Next, we specify the cost functionals. Similar to the deterministic counterpart, we consider a
terminal cost Φ and a running cost L. The only difference now is that the cost function should be

30
2 Optimal Control Theory

defined in an averaged sense, so we simply add an expectation. Thus, we obtain the stochastic
version of the Bolza problem (2.15)
∫ T 
inf J [θ] = E L(t, X (t), θ (t))dt + Φ(X (T ))
θ ∈A0,T 0
(2.93)
subject to
dX (t) = f (t, X (t), θ (t))dt + σ (t, X (t), θ (t))dW (t) t ∈ [0,T ], X (0) = x 0 .

Here, the expectation is taken over the Wiener process, and possibly over the initial condition.
The control set A0,T is a subset of W -adapted processes, meaning that they cannot look into
the future of the Wiener process. Sometimes, A0,T is called the admissable set of the control
problem. Note that this control problem generalizes the classical control problem of Bolza, and
reduces to it if we take σ = 0.

2.4.2 The Stochastic Dynamic Programming Principle

The procedure is almost identical to the deterministic case. We define the value function
∫ T 
V (s, z) := inf Es,z L(t, x(t), θ (t))dt + Φ(x(T ))
θ ∈As,T s
(2.94)
subject to
dX (t) = f (t, X (t), θ (t))dt + σ (t, X (t), θ (t))dW (t) t ∈ [s,T ], X (s) = z.

The expectation Es,z represents a conditional expectation on X s = z.


The first main result is the stochastic version of the dynamic programming principle

Theorem 2.15: Stochastic Dynamic Programming Principle

For every τ , s ∈ [0,T ], s ≤ τ , with τ a stopping timea , and z ∈ Rd , we have


∫ τ 
V (s, z) = inf Es,z L(t, X (t), θ (t))dt + V (τ , X (τ )) , (2.95)
θ s
a A random variable τ is a stopping time if {τ ≤ t } is measurable with respect to the Brownian filtration up
to time t.

The proof is identical to that of Thm. 2.12, since one can observe that whether the dynamics is
an ODE or an SDE is not used in the derivation, which only requires some arguments based on
optimality. The stopping time criterion is required as a technical condition so that the combined
control (analogue of Eq. (2.65)) is adapted to the Wiener process. The reader should prove
Thm. 2.15 as an exercise.

31
2 Optimal Control Theory

2.4.3 Stochastic Hamilton-Jacobi-Bellman Equation

As in the deterministic case, the form of the dynamic programming principle can be turned into
a more instructive form by infinitesimal perturbations, i.e. when τ ≈ s + ∆s. In the deterministic
case, we obtained the HJB PDE. We now show that in the stochastic case, we obtain a very
similar PDE. To avoid mathematical technicalities, we will proceed formally by simply assuming
that we have the required smoothness to perform Taylor expansions using Itô’s formula.
Let us recall the the Itô’s formula.

Theorem 2.16: Itô’s Formula

Consider the stochastic process in Rd

dX (t) = f (t, X (t))dt + σ (t, X (t))dW (t). (2.96)

Let F : R × Rd → R be twice differentiable, then

∂F 1 
 
dF = + (∇x F )> f + Tr σ > (∇x2 F )σ dt + (∇x F )>σdW , (2.97)

∂t 2

where all functions are evaluated at (t, X (t)).


One can write the above in integral form
∫ τ
∂F 1 
F (τ , X (τ )) − F (s, X (s)) = + (∇x F )> f + Tr σ > (∇x2 F )σ dt

s ∂t 2
∫ τ (2.98)
+ (∇x F )>σdW .
s

We start with the stochastic dynamic programming principle with τ = s + ∆s


∫ s+∆s 
V (s, z) = inf Es,z L(t, X (t), θ (t))dt + V (s + ∆s, X (s + ∆s)) . (2.99)
θ s

Applying Itô’s formula (2.16), we get


s+∆s
∂V 1  > 2
∫  
V (s + ∆s, X (s + ∆s)) =V (s, X (s)) + + f ∇x V + Tr σ (∇x V )σ ] dt
>

s ∂s 2
∫ s+∆s
(2.100)
+ (∇x V ) σdW (t).
>
s

Note that the last term is a Martingale and thus has zero conditional expectation. Substitute
this into (2.99), we get
∫ s+∆s 
∂V 1  > 2
 
inf Es,z + L + f ∇x V + Tr σ (∇x V )σ ] dt = 0.
>
(2.101)

θ ∈As, s +∆s s ∂s 2

32
2 Optimal Control Theory

Now, we can replace all the time varying terms (t, X (t)) in the integral By (s, X (s)) = (s, z) and
incur only errors of o(∆s). Taking the limit ∆s → 0, we get

1 
 
> 2
∂s V (s, z) + inf L(s, z, θ ) + f (s, z, θ ) ∇x V (s, z) + Tr σ (s, z, θ ) (∇x V (s, z))σ (s, z, θ ) ] = 0.
>

θ 2
(2.102)
As in the deterministic case, the terminal condition is V (T , z) = Φ(z). This is the stochastic
Hamilton-Jacobi-Bellman equation.
As before, our derivation is not rigorous, since the Taylor expansion based on Itô’s formula
requires regularity of the value function, which we typically cannot guarantee. The more
mathematically precise method to handle this issue again appeals to viscosity solutions and
comparison principles [CL83].

Theorem 2.17: Stochastic HJB Equation

The value function is the unique viscosity solution of the following Hamilton Jacobi
Bellman equation

∂s V (t, x)
1 
 
> 2
+ inf L(t, x, θ ) + f (t, x, θ ) ∇x V (t, x) + Tr σ (t, x, θ ) (∇x V (t, x))σ (t, x, θ ) ] = 0,
>

θ 2
V (t, x) = Φ(x).
(2.103)

Following the same line of argument as the deterministic case, we can show that the infimum
in the HJB actually obtains a globally optimal control.

2.4.4 Further Reading

Here we only introduced the bare-basics of stochastic control. For a more complete treatment of
the theory, the reader may consult many standard references, such as [FR12]. Backward SDEs,
which began with the attempt to generalize Pontryagin’s theory, was introduced in [Pen90,
PP90]. This development led to topics beyond stochastic control, including the theory of
nonlinear expectation. See [Pen10].

33
3 Numerical Methods for Optimal Control

3 Numerical Methods for Optimal Control

3.1 Overview

So far, our discussion focused on formulating necessary and sufficient conditions for optimality
for (stochastic) optimal control problems. There were two main lines of approach, namely
the Pontryagin’s maximum principle, and the Hamilton-Jacobi-Bellman equation. In practice,
these conditions rarely lead to explicitly solvable equations. Hence, numerical solution is an
important tool to study control problems.
This section gives a brief introduction to a few types of numerical algorithms that can be used
to solve optimal control problems.

3.2 Numerical methods based on the PMP

We begin with methods based on the Pontryagin’s maximum principle. This are also known as
indirect methods, in that we solve a necessary condition for optimality, which typically involve
integration of ODEs and small optimization problems, instead of the complete solution of a
non-linear programming problem. The latter are known as direct methods.

3.2.1 The Method of Successive Approximations

The first such method is called the method of successive approximations (MSA) or the sweeping
method. The derivation of this method is very simple. Let us consider the Bolza problem on the
time interval [0,T ]. Recall that the PMP equations take the form

xÛ ∗ (t) = f (t, x ∗ (t), θ ∗ (t)), x ∗ (0) = x 0 (3.1)


pÛ∗ (t) = −∇x H (t, x ∗ (t), p ∗ (t), θ ∗ (t)), p ∗ (T ) = −∇x Φ(x ∗ (T )) (3.2)
H (t, x (t), p (t), θ (t)) ≥ H (t, x (t), p (t), θ )
∗ ∗ ∗ ∗ ∗
(3.3)
∀θ ∈ Θ and a.e. t ∈ [0,T ]

where the Hamiltonian has the form

H (t, x, p, θ ) = p > f (t, x, θ ) − L(t, x, θ ). (3.4)

Notice the following:

34
3 Numerical Methods for Optimal Control

• Three equations for three unknowns


• If we know θ ∗ , we can compute x∗ via (3.1)
• If we know θ ∗ , x∗ , we can compute p∗ via (3.2)
• If we know x∗ and p∗ , we can compute θ ∗ via (3.3)
Observe that this then forms a loop that we can iterate. If the iteration stops, then we have
found a solution of the PMP equations, from which the θ ∗ thus obtained is now a candidate
optimal control. This is just the method of successive approximations. We summarize this
algorithm in Alg. 1.

Algorithm 1: Method of Successive Approximations


Initialize: θ ∈ L∞ ([0,T ], Θ)
while stopping criterion not reached do
x ← Solution of x(t)
Û = f (t, x(t), θ (t));
p ← Solution of p(t)
Û = −∇x H (t, x(t), p(t), θ (t)), p(T ) = −∇Φ(x(T )).;
for t ∈ [0,T ] do
θ (t) ← arg maxθ ∈Θ H (t, x(t), p(t), θ )
end
end
return x, p, θ

The solutions of x, p relies on the solution of differential equations. Thus, they can be solved
by any ODE solution methods, such as Euler methods, Runge-Kutta methods, or symplectic
methods, whichever is better suited to the dynamical problem at hand. The third step requires
some discussion. While this is still an optimization problem, this is a finite-dimensional one,
since we solve a separate optimization problem for each time t. This is often quite tractable for
a variety of reasons, e.g.
1. There may be an exact solution
2. A simple sub-routine can calculate an approximate solution very quickly
3. In fact, we do not really need an exact solution to this problem
The last point stems from an error estimate that one can derive [KC62], which says that for any
controls θ, θ 0, suppose we define by xθ the solution of the ODE (3.1) with control θ, and by pθ
the solution of the ODE (3.2) with control θ and state xθ , then we have the following estimate
(subject to some technical conditions)
∫ T
J [θ ] − J [θ] ≤ −
0
H (t, x θ (t), p θ (t), θ 0(t)) − H (t, x θ (t), p θ (t), θ (t))dt
0
∫ T
(3.5)
2
+C kθ (t) − θ (t)k dt
0
0

35
3 Numerical Methods for Optimal Control

One can imagine θ as the current MSA iterate, and θ 0 is the next iterate obtained by not
necessarily solving the maximum of the Hamiltonian. The bound (3.5) then tell us that as long
as 1) The Hamiltonian is sufficiently increased; and 2) The parameters do not change too much,
then we should see a decrement in the objective functional. In fact, this turns out to be necessary,
as the MSA is known to diverge if θ moves too much. These can be achieved by two methods:
1. Replacing the maximization by steepest ascent:
θ (t) ← θ (t) + η∇θ H (t, x(t), p(t), θ (t)) (3.6)

2. Replacing the maximization by a regularized problem


θ (t) ← arg max H (t, x(t), p(t), θ ) − λkθ − θ (t)k 2 (3.7)
θ

3.2.2 Solution of Two-point Boundary Value Problem

Observe that the MSA is an iterative algorithm, and we need to perform a potentially large
number of iterations to find a solution. In the special case where the Hamiltonian maximization
step has an exact solution, we can simply this procedure.
Concretely, suppose that there is a function ξ : R × Rd × Rd → Θ such that
H (t, x, p, ξ (t, x, p)) = max H (t, x, p, θ ). (3.8)
θ ∈Θ

Then, the PMP equations reduce to the following set of ODEs


xÛ ∗ (t) = f (t, x ∗ (t), ξ (t, x ∗ (t), p ∗ (t)), x ∗ (0) = x 0 ,
(3.9)
pÛ∗ (t) = −∇x H (t, x ∗ (t), p ∗ (t), ξ (t, x ∗ (t), p ∗ (t))), p ∗ (T ) = −∇Φ(x ∗ (T )).
This is simply a pair of ODEs, but this is not a usual initial value problem. The equation for
the state has an initial condition, but the equation for the co-state has a terminal condition.
This equation is known as a two-point boundary value problem (2PBVP). There are a variety of
numerical methods developed for 2PBVPs. Here we outline a simple method called shooting
method, which guess an initial condition for p∗ , integrates the ODEs in (3.9) (now an initial value
problem) forward in time. We would obtain a solution to the 2PBVP (3.9) if the terminal co-state
agrees with −∇Φ. This can be solved by a root-finding algorithm, such as Newton’s method,
quasi-Newton methods (e.g. L-BFGS), or Krylov sub-space methods (e.g GMRES, Conjugate
Gradient). The resulting algorithm is summarized in Alg. 2.
Remark. It is well known in the solution of boundary value problems that shooting methods
become unstable when the time horizon [0,T ] increases. This is because the ODEs may be ill-
conditioned and a small change in the initial conditions may produce a large change in the final
state, or vice versa. In either cases, the root-finding step becomes very difficult. An effective method
to deal with this is to break the time interval into small sub-intervals {[Tn ,Tn+1 ] : n = 0, . . . , N −1}
with T0 = 0 and TN = T . We can then apply shooting individually to each time interval. This is
known as multiple shooting [SB13], and can be used to improve the usual shooting method for large
T.

36
3 Numerical Methods for Optimal Control

Algorithm 2: Shooting Method for 2PBVP Formulaton of PMP Equations


Hyperparameters: RootFind (root finding algorithm), ξ (explicit solution of
Hamiltonian maximization)
For a ∈ Rd , Define (xa , pa ) as the solution of the IVP

x a (t) =f (t, x a (t), ξ (t, x a (t), p a (t))), x a (0) = x 0


(3.10)
p a (t) = − ∇x H (t, x a (t), p a (t), ξ (t, x a (t), p a (t))), p a (0) = a

a ∗ ← RootFind(p a (T ) + ∇Φ(x a (T )))


return θ ∗ (·) = ξ (·, x a (·), p a (·))
∗ ∗

3.3 Nonlinear Programming

Now we briefly discuss another approach which falls under the category of direct methods.
Here, instead of solving for optimality conditions, we directly solve a discretized version of the
optimal control problem.
First, we introduce a time-discretization size ∆t  q. Then, we can approximate the control
space L∞ ([0,T ], Θ) by ΘN with N = T /∆t the number of discretization points.
Similarly, the ODE
x(t)
Û = f (t, x(t), θ (t)) (3.11)
is now discretized as
x n+1 − x n
= f (n∆t, x n , θ n ) (3.12)
∆t
so that x n ≈ x(n∆t).
Performing a similar discretization to the cost functional, we obtain
N
Õ
min Φ(x N ) + ∆tL(n∆t, x n , θ n )
θ ∈Θ N
n=0
(3.13)
Subject to
x n+1 = x n + ∆t f (n∆t, x n , θ n ), n = 0, . . . , N − 1.

This is a constrained optimization problem of the form

min F (z) subject to G(z) = 0, (3.14)


z

where z = (x, θ). This is a standard equality-constrained optimization problem, and can be
solved by any of the standard nonlinear programming methods. [Ber97].

37
3 Numerical Methods for Optimal Control

3.4 Numerical Methods based on the HJB

The PMP based methods solves optimal control problems by solving coupled ODEs and a point-
wise in time optimization problem. The advantage is that these methods are quite cheap to
implement, especially when the ambient dimension d is large. However, one disadvantage is
that the optimal control calculated is specific to the initial condition x 0 . If we are given another
initial condition, in general we will have to repeat the calculation again.
Recall that the dynamic programming approach and the HJB precisely avoids this issue. The
control we compute from the HJB equations are of feed-back form
θ ∗ (t) =ξ (t, x ∗ (t))
= arg min{L(t, x ∗ (t), θ ) + f (t, x ∗ (t), θ )> ∇x V (t, x ∗ (t))} (3.15)
θ ∈Θ
where V is the value function computed from the HJB equations. This control can be applied to
any state trajectory, whether or not it begins with x 0 is irrelevant. Thus, the HJB equations give
a stronger solution to optimal control problems. This is also known as a closed-loop control. Let
us now discuss some methods for solving the HJB equations.
Recall the HJB equations
∂t V (t, x) + inf L(t, x, θ ) + [∇x V (t, x)]> f (t, x, θ ) (t, x) ∈ (0,T ) × Rd

θ ∈Θ (3.16)
V (T , x) = Φ(x)
It is customary to define the function
H (t, x, p) = + inf L(t, x, θ ) + p > f (t, x, θ ) (3.17)

θ ∈Θ
Then, the above reduces to the standard Hamilton-Jacobi equation
∂t V (t, x) + H (t, x, ∇x V (t, x)) = 0 (t, x) ∈ (0,T ) × Rd
(3.18)
V (T , x) = Φ(x)
In fact, we can also consider the stochastic control problem where we have
∂t V (t, x) + H (t, x, ∇x V (t, x), ∇x2 V (t, x)) = 0 (t, x) ∈ (0,T ) × Rd
(3.19)
V (T , x) = Φ(x)
with
H (t, x, p, Q) = inf L(t, x, θ ) + p > f (t, x, θ ) + Tr(σ (t, x, θ )>Qσ (t, x, θ )) (3.20)

θ ∈Θ
We will now consider this case since it is more general.
Note that the PDE (3.19) is a fully nonlinear PDE, in the sense that the highest order derivative
∇x2 V enters non-linearly. Such equations are notoriously hard to solve due to numerical insta-
bilities. However, in the case of control problems we can exploit the special structures that are
present in Hamilton-Jacobi equations. For an introduction to the theory of HJ equations, the
reader is referred to [Eva98]. The theory and methodologies presented here are from the works
of [BS91].

38
3 Numerical Methods for Optimal Control

Ellipticity. Let Sd denote the set of d × d real symmetric matrices. We say that a function

H : R × Rd × Rd × Sd → R (3.21)

is elliptic if for any A ≥ B (meaning A − B is positive semi-definite), we have

H (t, x, p, A) ≤ H (t, x, p, B) for all (t, x, p) ∈ R × Rd × Rd . (3.22)

Clearly, for the case of deterministic control this is automatically satisfied since H is independent
of the last argument. In general, it can be shown that under standard technical conditions, HJBs
arising from stochastic control satisfies ellipticity.
The main results of [BS91] shows that if the ellipticity condition is inherited by the numer-
ical discretization scheme, then the numerical method converges to the right solution. The
discretized version of this result is called monotonicity, which we now introduce.

Monotonicity and Consistency. Let us consider a discretized version of Eq. (3.19). By


setting t 7→ T − t, we can regard it as a standard Cauchy (initial value) problem. Using a spatial
grid size of ∆x and a temporal grid size of ∆t, we get

S(h, t, x, Vh (t, x), [Vh ]t,x ) = 0 in Gh \ {t = 0}


(3.23)
Vh (0, x) = Φ(x) on Gh ∩ {t = 0}

where h = (∆x, ∆t) and Gh = ∆t {0, 1, . . . , NT } × ∆xZ d with Z ⊂ Z (|Z | = N x ) a subset of grid
points in space. Thus Gh denotes a regular grid in [0,T ] × Rd . The function S encodes the
discretization schemes for the differential operators, whose forms may vary. Here Vh (t, x) is the
approximation of the solution V at (t, x) with grid size h. For (t, x) ∈ Gh , the symbol [Vh ]t,x
denotes the values of Vh at all points in Gh except (t, x).
Let us assume the following:
• Monotonicity: If u ≤ v (element-wise), then

S(h, t, x, r, u) ≥ S(h, t, x, r, v) (3.24)

• Consistency: For any smooth function V (t, x), we have

lim S(h, t, x, V (t, x), [V ]t,x ) = ∂t V (t, x) − H (t, x, ∇x V (t, x), ∇x2 V (t, x)) (3.25)
h→0

for all (t, x).


• Stability: For every h > 0, Eq. (3.23) admits a solution Vh and there exists a constant C > 0
such that suph kVh k ≤ C, i.e. the solutions are bounded uniformly in h.
The following result shows that the above assumptions are enough to guarantee the convergence
of solutions of the numerical scheme.

39
3 Numerical Methods for Optimal Control

Theorem 3.1: Barles-Souganidis

If the numerical scheme (3.23) satisfies monotonicity, consistency and stability, then its
solution uh converges locally uniformly, as h → 0, to the unique viscosity solution of (3.19).

Exercise 3.2: Heat Equation

Consider the 1D heat equation

∂t V (t, x) = ∂x2 x V (t, x). (3.26)

We perform the standard forward-time, central space discretization to obtain the form (3.23)
with
n + V n − 2V n
V n+1 − Vin Vi+1 i
S ∆t, ∆x, (n + 1)∆T , i∆x, Vin+1 , Vi−1 = i i−1
 n
, Vin , Vi+1
n 
.


∆t ∆x 2
(3.27)
Show that the scheme is consistent, and that it is monotone and stable if the following
CFL condition is satisfied:
1
∆t ≤ ∆x 2 . (3.28)
2
Thus, Theorem 3.1 is consistent with the usual Lax-equivalence theorem for numerical
analysis of PDEs.

Let us now give an example of a nonlinear HJ equation and an associated monotone scheme.
Consider the 1D Hamilton-Jacobi equation

∂t V (t, x) + |∂x V (t, x)| 2 = 0, V (0, x) = |x |. (3.29)

Observe that we must have V (t, x) ≥ 0 for all t, x. Using FTCS, we can discretize this to
n − V n )2
Vin+1 − Vin (Vi+1 i−1
+ = 0. (3.30)
∆t 4∆x 2
Unlike the heat equation example, no matter what the value of the step sizes, this scheme is not
monotone, and hence we cannot guarantee convergence.
An alternative is to consider the following discretization of |∂x V | 2 :
n − V n )2
(Vi+1 n − 2V n + V n
Vi+1
i−1 i i−1
− α (3.31)
4∆x 2 ∆x
Note that this is still consistent, since the last term added vanishes in the limit ∆x → 0. Suppose
that there exists a constant C such that u(t, x) = |u(t, x)| ≤ C for all t, x, then, taking α > C we
obtain a monotone scheme. In general, constructing monotone schemes can be quite challenging,
especially for higher order accuracy. The reader is referred to [Shu07].

40
3 Numerical Methods for Optimal Control

3.5 Further Reading

We have only scratched the surface of the large topic of numerical methods for optimal control.
For a good survey on direct and indirect methods based on nonlinear programming or the
Pontryagin’s maximum principle, the reader is referred to [Rao10]. For more information on
monotone methods for HJBs, we refer to [Tou] and references therein. Lastly, the solution
of HJB is very much related to the field of reinforcement learning and approximate dynamic
programming [SB18]. Although most of the problems there are solved in discrete time, the
techniques (e.g. value iteration, Monte-Carlo sampling) provide an alternative way to solve
optimal control problems.

41
Bibliography

Bibliography

[AF13] Michael Athans and Peter L Falb. Optimal Control: An Introduction to the Theory and
Its Applications. Courier Corporation, 2013.

[Arn12] Vladimir Igorevich Arnold. Geometrical Methods in the Theory of Ordinary Differential
Equations, volume 250. Springer Science & Business Media, 2012.

[Bel66] Richard Bellman. Dynamic programming. Science, 153(3731):34–37, 1966.

[Ber97] Dimitri P Bertsekas. Nonlinear programming. Journal of the Operational Research


Society, 48(3):334–334, 1997.

[BP07] Alberto Bressan and Benedetto Piccoli. Introduction to the Mathematical Theory of
Control, volume 2. American institute of mathematical sciences Springfield, 2007.

[BS91] Guy Barles and Panagiotis E Souganidis. Convergence of approximation schemes for
fully nonlinear second order equations. Asymptotic analysis, 4(3):271–283, 1991.

[CL83] Michael G. Crandall and Pierre-Louis Lions. Viscosity solutions of Hamilton-Jacobi


equations. Trans. Amer. Math. Soc., 277(1):1–42, 1983.

[Cod12] Earl A. Coddington. An Introduction to Ordinary Differential Equations. Courier


Corporation, 2012.

[Eva98] L C Evans. Partial Differential Equations. American Mathematical Society, 1998.

[FR12] Wendell H Fleming and Raymond W Rishel. Deterministic and stochastic optimal
control, volume 1. Springer Science & Business Media, 2012.

[FS06] Wendell H. Fleming and Halil Mete Soner. Controlled Markov Processes and Viscosity
Solutions, volume 25. Springer Science & Business Media, 2006.

[GS00] Izrail Moiseevitch Gelfand and Richard A. Silverman. Calculus of Variations. Courier
Corporation, 2000.

[KC62] Ivan A Krylov and Felix L Chernousko. On the method of successive approximations
for solution of optimal control problems. J. Comp. Mathem. and Mathematical Physics,
2(6), 1962.

[Lib12] Daniel Liberzon. Calculus of Variations and Optimal Control Theory: A Concise Intro-
duction. Princeton University Press, Princeton ; Oxford, 2012.

42
Bibliography

[Øks03] Bernt Øksendal. Stochastic differential equations. In Stochastic differential equations,


pages 65–84. Springer, 2003.

[Pen90] Shige Peng. A general stochastic maximum principle for optimal control problems.
SIAM Journal on control and optimization, 28(4):966–979, 1990.

[Pen10] Shige Peng. Nonlinear expectations and stochastic calculus under uncertainty. arXiv
preprint arXiv:1002.4546, 24, 2010.

[PP90] Etienne Pardoux and Shige Peng. Adapted solution of a backward stochastic differential
equation. Systems & Control Letters, 14(1):55–61, 1990.

[Rao10] Anil Rao. A Survey of Numerical Methods for Optimal Control. Advances in the
Astronautical Sciences, 135, January 2010.

[SB13] Josef Stoer and Roland Bulirsch. Introduction to numerical analysis, volume 12. Springer
Science & Business Media, 2013.

[SB18] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT
press, 2018.

[Shu07] Chi-Wang Shu. HIGH ORDER NUMERICAL METHODS FOR TIME DEPENDENT
HAMILTON-JACOBI EQUATIONS, volume 11, pages 47–91. WORLD SCIENTIFIC,
October 2007.

[Tou] Agnes Tourin. An introduction to Finite Difference methods for PDEs in Finance.
page 12.

[YZ99] Jiongmin Yong and Xun Yu Zhou. Stochastic controls: Hamiltonian systems and HJB
equations, volume 43. Springer Science & Business Media, 1999.

43

You might also like