Chapter One: 1.1 Optimal Control Problem
Chapter One: 1.1 Optimal Control Problem
Copyrighted Material
Chapter One
Introduction
where x is the state taking values in Rn , u is the control input taking values
in some control set U ⊂ Rm , t is time, t0 is the initial time, and x0 is the
initial state. Both x and u are functions of t, but we will often suppress
their time arguments.
The second basic ingredient is the cost functional. It associates a cost
with each possible behavior. For a given initial data (t0 , x0 ), the behaviors
are parameterized by control functions u. Thus, the cost functional assigns
a cost value to each admissible control. In this book, cost functionals will
be denoted by J and will be of the form
Z tf
J(u) := L(t, x(t), u(t))dt + K(tf , xf ) (1.2)
t0
where L and K are given functions (running cost and terminal cost, respec-
tively), tf is the final (or terminal ) time which is either free or fixed, and
xf := x(tf ) is the final (or terminal ) state which is either free or fixed or
belongs to some given target set. Note again that u itself is a function of
time; this is why we say that J is a functional (a real-valued function on a
space of functions).
The optimal control problem can then be posed as follows: Find a control
u that minimizes J(u) over all admissible controls (or at least over nearby
controls). Later we will need to come back to this problem formulation
1
cvoc-formatted August 24, 2011 7x10
Copyrighted Material
2 CHAPTER 1
The reader will easily think of other examples. Several specific optimal
control problems will be examined in detail later in the book. We briefly
discuss one simple example here to better illustrate the general problem
formulation.
Copyrighted Material
INTRODUCTION 3
(The basic form of the optimal control strategy may be intuitively obvious,
but obtaining a complete description of the optimal control requires some
work.)
Copyrighted Material
4 CHAPTER 1
Copyrighted Material
INTRODUCTION 5
We claim that
g 0 (0) = 0. (1.7)
To show this, suppose that g 0 (0) 6= 0. Then, in view of (1.6), there exists
an ε > 0 small enough so that for all nonzero α with |α| < ε, the absolute
value of the fraction in (1.6) is less than |g 0 (0)|. We can write this as
|α| < ε, α =
6 0 ⇒ |o(α)| < |g 0 (0)α|.
If we further restrict α to have the opposite sign to g 0 (0), then the right-hand
side of (1.8) becomes 0 and we obtain g(α) − g(0) < 0. But this contradicts
the fact that g has a minimum at 0. We have thus shown that (1.7) is indeed
true.
We now need to re-express this result in terms of the original function
f . A simple application of the chain rule from vector calculus yields the
formula
g 0 (α) = ∇f (x∗ + αd) · d (1.9)
where
∇f := (fx1 , . . . , fxn )T
∇f (x∗ ) = 0 (1.11)
Copyrighted Material
6 CHAPTER 1
We know from the derivation of the first-order necessary condition that g 0 (0)
must be 0. We claim that
g 00 (0) ≥ 0. (1.14)
Indeed, suppose that g 00 (0) < 0. By (1.13), there exists an ε > 0 such that
1
|α| < ε, α =
6 0 ⇒ |o(α2 )| < |g 00 (0)|α2 .
2
For these values of α, (1.12) reduces to g(α) − g(0) < 0, contradicting that
fact that 0 is a minimum of g. Therefore, (1.14) must hold.
What does this result imply about the original function f ? To see what
g 00 (0) is in terms of f , we need to differentiate the formula (1.9). The reader
may find it helpful to first rewrite (1.9) more explicitly as
n
X
0
g (α) = fxi (x∗ + αd)di .
i=1
Copyrighted Material
INTRODUCTION 7
where
fx 1 x 1 . . . fx1 xn
2 .. .. ..
∇ f := . . .
fx n x 1 . . . fx n x n
is the Hessian matrix of f . In view of (1.14), (1.15), and the fact that
d was arbitrary, we conclude that the matrix ∇2 f (x∗ ) must be positive
semidefinite:
∇2 f (x∗ ) ≥ 0 (positive semidefinite)
This is the second-order necessary condition for optimality.
Like the previous first-order necessary condition, this second-order con-
dition only applies to the unconstrained case. But, unlike the first-order
condition, it requires f to be C 2 and not just C 1 . Another difference with the
first-order condition is that the second-order condition distinguishes minima
from maxima: at a local maximum, the Hessian must be negative semidef-
inite, while the first-order condition applies to any extremum (a minimum
or a maximum).
Strengthening the second-order necessary condition and combining it
with the first-order necessary condition, we can obtain the following second-
order sufficient condition for optimality: If a C 2 function f satisfies
Copyrighted Material
8 CHAPTER 1
The key fact that we used in the previous developments was that for every
d ∈ Rn , points of the form x∗ + αd for α sufficiently close to 0 belong to
D. This is no longer the case if D has a boundary (e.g., D is a closed ball
in Rn ) and x∗ is a point on this boundary. Such situations do not fit into
the unconstrained optimization scenario as we defined it at the beginning
of Section 1.2.1; however, for simple enough sets D and with some extra
care, a similar analysis is possible. Let us call a vector d ∈ Rn a feasible
direction (at x∗ ) if x∗ + αd ∈ D for small enough α > 0 (see Figure 1.1). If
not all directions d are feasible, then the condition ∇f (x∗ ) = 0 is no longer
necessary for optimality. We can still define the function (1.4) for every
feasible direction d, but the proof of (1.7) is no longer valid because α is now
nonnegative. We leave it to the reader to modify that argument and show
that if x∗ is a local minimum, then ∇f (x∗ ) · d ≥ 0 for every feasible direction
d. As for the second-order necessary condition, the inequality (1.14) is still
true if g 0 (0) = 0, which together with (1.10) and (1.15) implies that we must
have dT ∇2 f (x∗ )d ≥ 0 for all feasible directions satisfying ∇f (x∗ ) · d = 0.
x∗
feasible
not feasible
Figure 1.1: Feasible directions
Copyrighted Material
INTRODUCTION 9
form x∗ + αd, α ∈ [0, ᾱ] for some d ∈ Rn and ᾱ > 0. This means that the
feasible direction approach is particularly suitable for the case of a convex
D. But if D is not convex, then the first-order and second-order necessary
conditions in terms of feasible directions are conservative. The next exercise
touches on the issue of sufficiency.
x∗
D
PSfrag replacements
Copyrighted Material
10 CHAPTER 1
f (x)
PSfrag replacements
x
x∗
Figure 1.3: A convex function
Copyrighted Material
INTRODUCTION 11
Copyrighted Material
12 CHAPTER 1
x∗
)
(0
x0
PSfrag replacements D
x(α)
Figure 1.4: A tangent vector
for all α (close enough to 0). Setting α = 0 and remembering that x(0) = x ∗ ,
we obtain
d
0= hi (x(α)) = ∇hi (x∗ ) · x0 (0), i = 1, . . . , m.
dα α=0
We have shown that for an arbitrary C 1 curve x(·) in D with x(0) = x∗ , its
tangent vector x0 (0) must satisfy ∇hi (x∗ ) · x0 (0) = 0 for each i. Actually,
one can show that the converse is also true, namely, every vector d ∈ Rn
satisfying
∇hi (x∗ ) · d = 0, i = 1, . . . , m (1.20)
is a tangent vector to D at x∗ corresponding to some curve. (We do not give
a proof of this fact but note that it relies on x∗ being a regular point of D.)
In other words, the tangent vectors to D at x∗ are exactly the vectors d for
which (1.20) holds. This is the characterization of the tangent space T x∗ D
that we were looking for. It is clear from (1.20) that Tx∗ D is a subspace of
Rn ; in particular, if d is a tangent vector, then so is −d (going from x0 (0) to
−x0 (0) corresponds to reversing the direction of the curve).
Now let us go back to (1.19), which tells us that ∇f (x∗ ) · d = 0 for
all d ∈ Tx∗ D (since the curve x(·) and thus the tangent vector x0 (0) were
arbitrary). In view of the characterization of Tx∗ D given by (1.20), we can
rewrite this condition as follows:
∇f (x∗ ) · d = 0 ∀ d such that ∇hi (x∗ ) · d = 0, i = 1, . . . , m. (1.21)
The relation between ∇f (x∗ ) and ∇hi (x∗ ) expressed by (1.21) looks some-
what clumsy, since checking it involves a search over d. Can we eliminate d
from this relation and make it more explicit? A careful look at (1.21) should
quickly lead the reader to the following statement.
Claim: The gradient of f at x∗ is a linear combination of the gradients of
the constraint functions h1 , . . . , hm at x∗ :
∇f (x∗ ) ∈ span{∇hi (x∗ ), i = 1, . . . , m}. (1.22)
cvoc-formatted August 24, 2011 7x10
Copyrighted Material
INTRODUCTION 13
Indeed, if the claim were not true, then ∇f (x∗ ) would have a component
orthogonal to span{∇hi (x∗ )}, i.e, there would exist a d 6= 0 satisfying (1.20)
such that ∇f (x∗ ) can be written in the form
m
X
∇f (x∗ ) = d − λ∗i ∇hi (x∗ ) (1.23)
i=1
for some λ∗1 , . . . , λ∗m ∈ R. Taking the inner product with d on both sides
of (1.23) and using (1.20) gives
∇f (x∗ ) · d = d · d =
6 0
x∗
PSfrag replacements
The condition (1.22) means that there exist real numbers λ∗1 , . . . , λ∗m such
that
∇f (x∗ ) + λ∗1 ∇h1 (x∗ ) + · · · + λ∗m ∇hm (x∗ ) = 0 (1.24)
cvoc-formatted August 24, 2011 7x10
Copyrighted Material
14 CHAPTER 1
If this Jacobian matrix were nonsingular, then we could apply the Inverse
Function Theorem (see, e.g., [Rud76, Theorem 9.24]) and conclude that
there are neighborhoods of (0, 0) and F (0, 0) = (f (x∗ ), 0) on which the map
F is a bijection (has an inverse). This would imply, in particular, that there
are points x arbitrarily close to x∗ such that h(x) = 0 and f (x) < f (x∗ );
such points would be obtained by taking preimages of points on the ray
directed to the left from F (0, 0) in Figure 1.6. But this cannot be true,
since h(x) = 0 means that x ∈ D and we know that x∗ is a local minimum
of f over D. Therefore, the matrix (1.25) is singular.
Regularity of x∗ in the present case just means that the gradient ∇h(x∗ )
is nonzero. Choose a d1 such that ∇h(x∗ ) · d1 = 6 0. With this d1 fixed, let
λ := −(∇f (x ) · d1 )/(∇h(x ) · d1 ), so that ∇f (x∗ ) · d1 = −λ∗ ∇h(x∗ ) · d1 .
∗ ∗ ∗
Since the matrix (1.25) must be singular for all choices of d2 , its first row
must be a constant multiple of its second row (the second row being nonzero
cvoc-formatted August 24, 2011 7x10
Copyrighted Material
INTRODUCTION 15
PSfrag replacements
α2 h
x3 , . . . , x n
F
x∗ α1 f
f (x∗ )
F (0, 0)
h(x∗ )
Copyrighted Material
16 CHAPTER 1
Second-order conditions
Copyrighted Material
INTRODUCTION 17
The second-order necessary condition says that this Hessian matrix must be
positive semidefinite on the tangent space to D at x∗ , i.e., we must have
dT `xx (x∗ , λ∗ )d ≥ 0 for all d ∈ Tx∗ D. Note that this is weaker than asking
the above Hessian matrix to be positive semidefinite in the usual sense (on
the entire Rn ).
The second-order sufficient condition says that a point x∗ ∈ D is a strict
constrained local minimum of f if the first-order necessary condition for
constrained optimality (1.24) holds and, in addition, we have
Copyrighted Material
18 CHAPTER 1
Typical function spaces that we will consider are spaces of functions from
some interval [a, b] to Rn (for some n ≥ 1). Different spaces result from
placing different requirements on the regularity of these functions. For ex-
ample, we will frequently work with the function space C k ([a, b], Rn ), whose
elements are k-times continuously differentiable (here k ≥ 0 is an integer; for
k = 0 the functions are just continuous). Relaxing the C k assumption, we
can arrive at the spaces of piecewise continuous functions or even measurable
functions (we will define these more precisely later when we need them). On
the other hand, stronger regularity assumptions lead us to C ∞ (smooth, or
infinitely many times differentiable) functions or to real analytic functions
(the latter are C ∞ functions that agree with their Taylor series around every
point).
We regard these function spaces as linear vector spaces over R. Why
are they infinite-dimensional? One way to see this is to observe that the
monomials 1, x, x2 , x3 , . . . are linearly independent. Another example of an
infinite set of linearly independent functions is provided by the (trigonomet-
ric) Fourier basis.
As we already mentioned, we also need to equip our function space V
with a norm k · k. This is a real-valued function on V which is positive
definite (kyk > 0 if y 6≡ 0), homogeneous (kλyk = |λ| · kyk for all λ ∈ R,
y ∈ V ), and satisfies the triangle inequality (ky + zk ≤ kyk + kzk). The
norm gives us the notion of a distance, or metric, d(y, z) := ky − zk. This
allows us to define local minima and enables us to talk about topological
concepts such as convergence and continuity (more on this in Section 1.3.4
below). We will see how the norm plays a crucial role in the subsequent
developments.
On the space C 0 ([a, b], Rn ), a commonly used norm is
This construction can be continued in the obvious way to yield the k-norm
on C k ([a, b], Rn ) for each k. The k-norm can also be used on C ` ([a, b], Rn )
for all ` ≥ k. There exist many other norms, such as for example the Lp
cvoc-formatted August 24, 2011 7x10
Copyrighted Material
INTRODUCTION 19
norm Z 1/p
b
p
kykLp := |y(x)| dx (1.31)
a
where p is a positive integer. In fact, the 0-norm (1.29) is also known as the
L∞ norm.
We are now ready to formally define local minima of a functional. Let V
be a vector space of functions equipped with a norm k · k, let A be a subset
of V , and let J be a real-valued functional defined on V (or just on A). A
function y ∗ ∈ A is a local minimum of J over A if there exists an ε > 0 such
that for all y ∈ A satisfying ky − y ∗ k < ε we have
J(y ∗ ) ≤ J(y).
Note that this definition of a local minimum is completely analogous to the
one in the previous section, modulo the change of notation x 7→ y, D 7→ A,
f 7→ J, | · | 7→ k · k (also, implicitly, Rn 7→ V ). Strict minima, global minima,
and the corresponding notions of maxima are defined in the same way as
before. We will continue to refer to minima and maxima collectively as
extrema.
For the norm k · k, we will typically use either the 0-norm (1.29) or the 1-
norm (1.30), with V being C 0 ([a, b], Rn ) or C 1 ([a, b], Rn ), respectively. In the
remainder of this section we discuss some general conditions for optimality
which apply to both of these norms. However, when we develop more specific
results later in calculus of variations, our findings for these two cases will be
quite different.
Copyrighted Material
20 CHAPTER 1
depends on both y and η. The requirement that δJ|y must be a linear func-
tional is understood in the usual sense: δJ|y (α1 η1 + α2 η2 ) = α1 δJ|y (η1 ) +
α2 δJ|y (η2 ) for all η1 , η2 ∈ V and α1 , α2 ∈ R.
The first variation as defined above corresponds to the so-called Gateaux
derivative of J, which is just the usual derivative of J(y + αη) with respect
to α (for fixed y and η) evaluated at α = 0:
then
δJ|y (η) = g 0 (0) (1.35)
Copyrighted Material
INTRODUCTION 21
Exercise 1.5 Consider the space V = C 0 ([0, 1], R), let ϕR: R → R be
1
a C 1 function, and define the functional J on V by J(y) = 0 ϕ(y(x))dx.
Show that its first variation exists and is given by the formula δJ|y (η) =
R1 0
0 ϕ (y(x))η(x)dx.
Our notion of the first variation, defined via the expansion (1.32), is in-
dependent of the choice of the norm on V . This means that the first-order
necessary condition (1.36) is valid for every norm. To obtain a necessary
condition better tailored to a particular norm, we could define δJ|y differ-
ently, by using the following expansion instead of (1.32):
The difference with our original formulation is subtle but substantial. The
earlier expansion (1.32) describes how the value of J changes with α for each
fixed η. In (1.37), the higher-order term is a function of kηk and so the ex-
pansion captures the effect of all η at once (while α is no longer needed). We
remark that the first variation defined via (1.37) corresponds to the so-called
Fréchet derivative of J, which is a stronger differentiability notion than the
Gateaux derivative (1.33). In fact, (1.37) suggests constructing more gen-
eral perturbations: instead of working with functions of the form y + αη,
where η is fixed and α is a scalar parameter, we can consider perturbed
functions y + η which can approach y in a more arbitrary manner as kηk
tends to 0 (multiplying η by a vanishing parameter is just one possibility).
This generalization is conceptually similar to that of passing from the lines
x∗ + αd used in Section 1.2.1 to the curves x(α) utilized in Section 1.2.2.
We will start seeing perturbations of this kind in Chapter 3.
In what follows, we retain our original definition of the first variation in
terms of (1.32). It is somewhat simpler to work with and is adequate for our
needs (at least through Chapter 2). While the norm-dependent formulation
could potentially provide sharper conditions for optimality, it takes more
work to verify (1.37) for all η compared to verifying (1.32) for a fixed η.
Besides, we will eventually abandon the analysis based on the first variation
altogether in favor of more powerful tools. However, it is useful to be aware
of the alternative formulation (1.37), and we will occasionally make some
side remarks related to it. This issue will resurface in Chapter 3 where,
although the alternative definition (1.37) of the first variation will not be
specifically needed, we will use more general perturbations along the lines
of the preceding discussion.
Copyrighted Material
22 CHAPTER 1
Exercise 1.6 Consider the same functional J as in Exercise 1.5, but as-
sume now that ϕ is C 2 . Derive a formula for the second variation of J (make
sure that it is indeed a quadratic form).
(this should again hold for all admissible perturbations η with respect to a
subset A of V over which we want y ∗ to be a minimum). We would then hope
to show that for y = y ∗ the second-order term in (1.38) dominates the higher-
order term o(α2 ), which would imply that y ∗ is a strict local minimum (since
the first-order term is 0). Our earlier proof of sufficiency of (1.16) followed
the same idea. However, examining that proof more closely, the reader will
discover that in the present case the argument does not go through.
We know that there exists an ε > 0 such that for all nonzero α with
|α| < ε we have |o(α2 )| < δ 2 J y∗ (η)α2 . Using this inequality and (1.36),
we obtain from (1.38) that J(y ∗ + αη) > J(y ∗ ). Note that this does not yet
cvoc-formatted September 19, 2011 7x10
Copyrighted Material
INTRODUCTION 23
for some number λ > 0. The property (1.41) does not automatically fol-
low from (1.40), again because we are in an infinite-dimensional space.
(Quadratic forms satisfying (1.41) are sometimes called uniformly positive
definite.) The second step is to modify the definitions of the first and sec-
ond variations by explicitly requiring that the higher-order terms decay uni-
formly with respect to kηk. We already mentioned such an alternative def-
inition of the
first variation via the expansion (1.37). Similarly, we could
define δ J y via the following expansion in place of (1.38):
2
J(y + η) = J(y) + δJ|y (η) + δ 2 J y (η) + o(kηk2 ). (1.42)
Adopting these alternative definitions and assuming that (1.36) and (1.41)
hold, we could easily prove optimality by noting that |o(kηk2 )| < λkηk2
when kηk is small enough.
With our current definitions of the first and second variations in terms
of (1.32) and (1.38), we do not have a general second-order sufficient condi-
tion for optimality. However, in variational problems that we are going to
study, the functional J to be minimized will take a specific form. This addi-
tional structure will allow us to derive conditions under which second-order
terms dominate higher-order terms, resulting in optimality. The above dis-
cussion was given mainly for illustrative purposes, and will not be directly
used in the sequel.
Copyrighted Material
24 CHAPTER 1
spaces equipped with a norm (or, more generally, a metric). On the other
hand, closed and bounded subsets of an infinite-dimensional vector space are
not necessarily compact—we already mentioned noncompactness of the unit
sphere—and the Weierstrass Theorem does not apply to them; see the next
exercise. We note that since our function space V has a norm, the notions
of continuity of J and convergence, closedness, boundedness, and openness
in V with respect to this norm are defined exactly as their familiar counter-
parts in Rn . We leave it to the reader to write down precise definitions or
consult the references given at the end of this chapter.
Success stories of optimal control theory in various applications are too nu-
merous to be listed here; see [CEHS87, Cla10, ST05, Swa84] for some exam-
ples from engineering and well beyond. The reader interested in applications
will easily find many other references.
The material in Section 1.2 can be found in standard texts on optimiza-
tion, such as [Lue84] or [Ber99]. See also Sections 5.2-5.4 of the book [AF66],
which will be one of our main references for the optimal control chapters.
Complete proofs of the results presented in Section 1.2.2, including the fact
that the condition (1.20) is sufficient for d to be a tangent vector, are given
in [Lue84, Chapter 10]. The alternative argument based on the inverse func-
tion theorem is adopted from [Mac05, Section 1.4]. The necessary condition
in terms of Lagrange multipliers can also be derived from a cone separation
cvoc-formatted August 24, 2011 7x10
Copyrighted Material
INTRODUCTION 25