Optimization With Vector Space Methods
Optimization With Vector Space Methods
Optimization of functionals
Global theory of constrained optimization
Local theory of constrained optimization
Iterative methods of optimization
WILEY-1NTERSCIENCE
9 780471 181170
OPTIMIZATION BY
VECTOR SPACE METHODS
David G. Luenberger
Stanford University,
Stanford, California
by David G. Luenberger
INTRODUCTION TO DYNAMIC PROGRAMMING
by George L. Nemhauser
by Ronald A. Howard
OPTIMIZATION BY
VECTOR SPACE METHODS
PREFACE
This book has evolved from a course on optimization that I have taught at
Stanford University for the past five years. It is intended to be essentially
self-contained and should be suitable for classroom work or self-study.
As a text it is aimed at first- or second-year graduate students in engineering,
mathematics, operations research, or other disciplines dealing with optimization theory.
The primary objective of the book is to demonstrate that a rather large
segment of the field of optimization can be effectively unified by a few
geometric principles of linear vector space theory. By use of these principles,
important and complex infinite-dimensional problems, such as those
generated by consideration of time functions, are interpreted and solved by
methods springing from our geometric insight. Concepts such as distance,
orthogonality, and convexity play fundamental roles in this development.
Viewed in these terms, seemingly diverse problems and techniques often
are found to be intimately related.
The essential mathematical prerequisite is a familiarity with linear
algebra, preferably from the geometric viewpoint. Some familiarity with
elementary analysis including the basic notions of sets, convergence, and
continuity is assumed, but deficiencies in this area can be corrected as one
progresses through the book. More advanced concepts of analysis such as
Lebesgue measure and integration theory, although referred to in a few
isolated sections, are not required background for this book.
Imposing simple intuitive interpretations on complex infinite-dimensional problems requires a fair degree of mathematical sophistication.
The backbone of the approach taken in this book is functional analysis,
the study of linear vector spaces. In an attempt to keep the mathematical
prerequisites to a minimum while not sacrificing completeness of the development, the early chapters of the book essentially constitute an introduction
to functional analysis, with applications to optimization, for those having
the relatively modest background described above`The mathematician or
more advanced student may wish simply to scan Chapters 2, 3, 5, and 6 for
review or for sections treating applications and then concentrate on the
other chapters which deal explicitly with optimization theory.
vii
viii
PREFACE
G.
LUENBERGER
CONTENTS
1 INTRODUCTION
1.1
1.2
1.3
1.4
Motivation
Applications 2
The Main Principles 8
Organizatioll of the Book
2 LINEAR SPACES
10
11
2.1
Introduction.
2.2
2.3
2.4
VECTOR SPACES 11
Definition and Examples 1I
Subspaces, Linear Combinations, and Linear Varieties
Convexity and Cones 17
Linear Independence and Dimension 19
2.5
2.6
2.7
2.8
2.9
*2.10
2. I I
2.12
*2.13
*2.14
*2. I 5
2.16
3 HILBERT SPACE
3.1
II
46
Introduction 46
ix
39
14
CONTENTS
3.2
3.3
3.4
3.5
3.6
3.7
*3.8
3.9
PRE-HILBERT SPACES 46
Inner Products 46
The Projection Theorem 49
Orthogonal Complements 52
The Gram-Schmidt Procedure 53
APPROXIMATION 55
The Normal Equations and Gram Matrices
Fourier Series 58
Complete Orthonormal Sequences 60
Approximation and Fourier Series 62
55
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
Introduction 78
Hilbert Space of Random Variables 79
The Least-Squares Estimate 82
Minimum-Variance Unbiased Estimate (Gauss-Markov Estimate) 84
Minimum-Variance Estimate 87
Additional Properties of Minimum-Variance Estimates 90
Recursive Estimation 93
Problems 97
References 102
5 DUAL SPACES
103
5.1
Introduction
103
5.2
5.3
5.4
5.5
5.6
5.7
106
115
116
CONTENTS
118
5.11
5.12
*5.13
5.14
6.1
6.2
Introduction 143
Fundamentals 143
INVERSE OPERATORS 147
6.3
6.4
6.5
6.6
6.7
*6.8
148
150
6.9
6.10
6.11
6.12
7.1
Introduction 169
LOCAL THEORY 171
7.2
7.3
7.4
*7.5
*7.6
7.7
xi
X11
CONTENTS
7.8
*7.9
7.10
7.11
7.12
*7.13
7.14
Introduction 213
Positive Cones and Convex Mappings
Lagrange Multipliers 216
Sufficiency 219
Sensitivity 221
Duality 223
Applications 226
Problems 236
References 238
214
Introduction
239
10.1
Introduction
271
10.2
CONTENTS
10.9
10.10
10.11
10.12
SYMBOL INDEX
321
SUBJECT INDEX
323
xiii
NOTATION
Sets
If a and b are real numbers, [a, b] denotes the set of real numbers x satisfying a:::;; X :::;; h. A rounded instead of square bracket denotes strict
inequality in the definition. Thus (a, bJ denotes all x with a < x :::;; h.
If S is a set of real numbers bounded ahove, then there is a smallest real
number y such that x y for all XES. 1 he number y is called the least
upper bound or supremum of S and is denoted sup (x) or sup {x: XES}
If S is not bounded above we write sup (x)
xeS
xeS
= 00.
xv
XES}.
xvi
NOTATION
Sequences
A sequence Xl, X 2 ,
the indices is clear.
.. , X., .
Let {Xi} be an infinite sequence of real numbers and suppose that there
is a real number S satisfying: (I) for every E > 0 there is an N such that for
all n > N, Xn < S + E, and (2) for every E > 0 and M > 0 there is an n > M
such that Xn > S - E. Then S is called the limit superior of{xn } and we write
S = lim sup X n . If {x.} is not bounded above we write lim sup Xn = + 00.
n'" 00
=-
= lim
Xn
Functions
The function sgn (pronounced sig-num) of a real variable is defined by
X>o
x=O
x<O
sgn (x) = (
-1
oij is defined by ..
i=j
i=j
f /(t)1;(t) dt =/(0)
b
Ca, b).
if: (1) for every s> 0 there is 0 > 0 such that for all X satisfying Ix - xol
< 0, g(x) < S + E, and (2) for every E > 0 and 0 > 0 there is anxsuchthat
Ix - xol
and g(x) > S - E. (See the corresponding definitions for
sequences.)
<"
=O(x)
NOTATION
xvii
An n x m matrilx A with entry ail in its i-th row andj-th column is written
A = [aij]' If X = (Xl' X2, ... , xn), the product Ax is the vector y with components Yt = Li"'l al r i , i = 1, 2, ... , m.
Letj(xl' X2, ... ,. xn) be a function of the n real variables Xi' Then we
write I" for the row vector
oj
oj
oj
(Xl, ,
xn), we write
1
INTRODtrCTION
1.1 Motivation
During the past twenty years mathematics and engineering have been
increasingly directed towards
IS of decision making in physical or
organizational systems. This trend has been inspired primarily by the
significant economic benefits which often result from a proper decision
concerning the distribution of expensive resources, and by the repeated
demonstration that such problems can be realistically formulated and
mathematically aMlyzed to obtain good decisions.
The arrival of high-speed digital computers has also played a major
role in the development of the science of decision making. Computers
have inspired the development of larger systems and the coupling of
previously separate systems, thereby resulting in decision and control
problems of correspondingly increased complexity. At the same time,
have revolutionized applied mathematics and solved
however,
many of the complex problems they generated.
It is perhaps natural that the concept of best or optimal decisions should
emerge as the fundamental approach for formulating decision problems.
In this approach Ii single real quantity, summarizing the performance or
value of a decision, is isolated and optimized (Le., either maximized or
minimized depending on the situation) by proper selection among available
alternatives. The resulting optimal decision is taken as the solution to the
decision problem. This approach to decision problems has the virtues of
simplicity, precisell1ess, elegance, and, in many cases, mathematical tractability. It also has obvious limitations due to the necessity of selecting a
single objective by which to measure results. But optimization has proved
its utility as a mode of analysis and is firmly entrenched in the field of
decision making.
_Nllcll...of the classical theory of optimization, motivated primarily by
problems
of physics, is associated with great I mathematicians: Gauss,
.
..
Lagrange, Euler, the Bernoulis, etc. During the recent development of
optimization in decision problems, the classical techniques have been reo
examined, extended, sometimes rediscovered, and applied to problems
INTRODUCTION
having quite different origins than those responsible for their earlier
development. New insights have been obtained and new techniques have
been discovered. The computer has rendered many techniques obsolete
while making other previously impractical methods feasible and efficient.
These recent developments in optimization have been made by mathematicians, system engineers, economists, operations researchers, statisticians, numerical analysts, and others in a host of different fields.
The study of optimization as an independent topic must, of course, be
regarded as a branch of applied mathematics. As such it must look to
various areas of pure mathematics for its unification, clarification, and
general foundation. One such area of particular relevance is functional
analysis.
Functional analysis is the study of vector spaces resulting from a
merging of geometry, linear algebra, and analysis. It serves as a basis for
aspects of several important branches of applied mathematics including
Fourier series, integral and differential equations, numerical analysis, and
any field where linearity plays a key role. Its appeal as a unifying discipline
stems primarily from its geometric character. Most"Of the principal results
in functional analysis are expressed as abstractions of intuitive geometric
properties of ordinary three-dimensional space.
Some readers may look with great expectation toward functional
analysis, hoping to discover new powerful techniques that will enable them
to solve important problems beyond the reach of simpler mathematical
analysis. Such hopes are rarely realized in practice. The primary utility
of functional analysis for the purposes of this book is its role as a unifying
discipline, gathering a number of apparently diverse, specialized mathematical tricks into one or a few general geometric principles.
1.2 Applications
The main purpose of this section is to illustrate the variety of problems
that can be formulated as optimization problems in vector space by introducing some specific examples that are treated in later chapters. As a
vehicle for this purpose, we classify optimization problems according to
the role of the decision maker. We list the classification, briefly describe
its meaning, and illustrate it with one problem that can be formulated in
vector space and treated by the methods described later in the book. The
classification is not intended to be necessarily complete nor, for that matter,
particularly significant. It is merely representative of the classifications
often employed when discussing optimization.
Although the formal definition of a vector space is not given until
Chapter 2, we point out, in the examples that follow, how each problem
1.2
APPLICATIONS
+ P2 X2 + ... + PnXn
+ a 12 x 2 + ... + a1nXn
bl
a21 x I
+ a 2 2 x 2 + ... + a 2n X n
b2
amlx1
+ a m2 X 2 + ... + amnXn
bm '
...
and
INTRODUCTION
x(O)
given
0,]
J=
10 {c[r(t)] + h . x(t)} dt
where c[r] is the production cost rate for the production level rand h . x
is the inventory cost rate for inventory level x.
This problem can be regarded as defined on a vector spffCe consisting of
continuous functions on the interval [0, T] of the real line. The optimal
production schedule r is then an element of the space. Again the constraints define a region in the space in which the solution r must lie while
minimizing the cost.
3. Control (or Guidance). Problems of control are asso.::iated with dynamic systems evolving in time. Control is quite similar to planning;
1
x(t)
== dx(t)fdt.
1.2
APPLICATIONS
indeed, as we shall see, it is often the source of a problem rather than its
mathematical structure which determines its category.
Control or guidance usually refers to directed influence on a dynamic
system to achieve desired performance. The system itself may be physical
in nature, such as a rocket heading for Mars or a chemical plant processing
acid, or it may be operational such as a warehouse receiving and filling
orders.
Often we seek fee:dback or so-called closed-loop control in which decisions of current control action are made continuously in time based on
recent observations of system behavior. Thus, one may imagine himself as
a controller sitting at the control panel watching meters and turning knobs
or in a warehouse ordering new stock based on inventory and predicted
demand. This is in contrast to the approach described for planning in
which the whole series of control actions is predetermined. Generally,
however, the terms planning or control may refer to either possibility.
As an example of a control problem, we consider the launch of a rocket
to a fixed altitude h in given time T while expending a minimum of fuel.
For simplicity, we assume unit mass, a constant gravitational force, and
the absence of aerodynamic forces. The motion of a rocket being propelled
vertically is governed by the equations
ji(t) = u(t) - 9
JJ
INTRODUCTION
IJ
1: JeZ(t) dt
a
2.
max le(t)1
a,;;r,;;b
b
3. fle(t)1 dt.
a
5. Estimation. Estimation problems are really a special class of approximation problems. We seek to. estimate some quantity from imperfect
observations of it or from observations which are statistically correlated
but not deterministically related to it. Loosely speaking, the problem
amounts to approximating the unobservable quantity by a combination of
the observable ones. For example, the position of a random maneuvering
airplane at some future time might reasonably be estimated by a linear
combination of past measurements of its position.
Another example of estimatio'n arises in connection with triangulation
problems such as in location of forest fires, ships at sea, or remote stars.
Suppose there are three lookout stations, each of which measures the angle
of the line-of-sight from the station to the observed object. The situation is
illustrated in Figure 1.1. Given these three angles, what is the best estimate
of the object's location?
Figure 1.1
A triangulation problem
1.2
APPLICATIONS
---,
XI+
Y{
L." - - -
1= 1 ;'(j'"
Yj
UI'
INTRODUCTION
Figure 1.2
1.3
Hahn-Banach theorem states (in simplest form) that given a sphere and a
point not in the sphere there is a hyperplane separating the point and the
sphere; This version of the theorem, together with the associated notions
of hyperplanes, is the basis for most of the theory beyond Chapter 5.
Figure 1.3
Duality
10
INTRODUCTION
Before our discussion of optimization can begin in earnest, certain fundamental concepts and results of linear vector space theory must be introduced. Chapter 2 is devoted to that task. The chapter consists of material
that is standard, elementary functional analysis background and is essential for further pursuit of our objectives. Anyone having some familiarity
with linear algebra and analysis should have little difficulty with this
chapter.
Chapters 3 and 4 are devoted to the projection theorem in Hilbert space
and its applications. Chapter 3 develops the general theory, illustrating it
with some applications from Fourier approximation and optimal control
theory. Chapter 4 deals solely with,the applications of the projection
theorem to estimation problems including the recursive estimation and
prediction of time series as developed by Kalman.
Chapter 5 is devoted to the Hahn-Banach theorem. It is in this chapter
that we meet with full force the essential ingredients of the general theory
of optimization: hyperplanes, duality, and convexity.
Chapter 6 discusses linear transformations on a vector space and is the
last chapter devoted to the elements of linear functional analysis. The
concept of duality is pursued in this chapter through the introduction of
adjoint transformations and their relation to minimum norm problems.
The pseudoinverse of an operator in Hilbert space is discussed.
Chapters 7, 8, and 9 consider general optimization problems in linear
spaces. Two basic approaches, the local theory leading to differential
conditions and the global theory relying on convexity, are isolated and
discussed in a parallel fashion. The techniques in these chapters are a direct
outgrowth of the principles of earlier chapters, and geometric visualization
is stressed wherever possible. In the course of the development, we treat
problems from the calculus of variations, the Fenchel conjugate function
theory, Lagrange multipliers, the Kuhn-Tucker theorem, and Pontryagin's
maximum principle for optimal control problems.
Finally, Chapter 10 contains an introduction to iterative techniques for
the solution of optimization problems. Some techniques in this chapter
are quite different than those in previous chapters, but many are based on
extensions of the same logic and geometrical considerations found to be so
fruitful throughout the book. The methods discussed include successive
approximation, Newton's method, steepest descent, conjugate gradients,
the primal-dual method, and penalty functions.
LINEAR SPACES
2.1 Introduction
The first few sections of this chapter introduce the concept of a vector
space and explon: the elementary properties resulting from the basic
definition. The notions of subspace, linear independence, convexity, and
dimension are developed and illustrated by examples. The material is
largely review for most readers since it duplicates the first part of standard
courses in linear algebra.
The second part of the chapter discusses the basic properties oLnormed
linear spaces. A normed linear space is a vector space having a measure of
distance or length defined on it. With the introduction of a norm, it
becomes possible to define analytical or topological properties such as
convergence and open and closed setS. Therefore, that portion of the
chapter introduces and explores these basic concepts which distinguish
functional analysis. from linear algebra.
VECTOR SPACES
11
12
LINEAR SPACES
3.
4.
5.
6.
7.
x + y = y + x.
(x + y) + z = x + (y + z).
There is a null vector 0 in X such
that x + 0 = x for all x in X.
IX(X + y) = IXX + lXy. }
(IX + f3)x = IXX + [3x.
(1X[3)X = 1X([3X).
Ox=O,
1x=x.
(commutative law)
(associative law)
(distributive laws)
(associative law)
2.
3.
4.
5.
6.
= x + z implies y = z.
= lXy and IX t= 0 imply x
y.
IXX = [3x and x t= 0 imply IX = [3.
(IX - [3)x = IXX - [3x. }
IX(X - y) = IXX - lXy.
IXO = O.
+y
IXX
(cancellation, laws)
(distributive laws)
'k
2.2
13
The real number is referred to as the k-th component of the vector. Two
vectors are equal if their corresponding components are equal. The null
vector is defined as e = (0, 0, ... ,0). If x = (el, e2, ... , en) and y =
(th, Yl2, ... , YIn), the vector x + y is defined as the n-tuple whose k-th
component is ek + Ylk' The vector ax, where a is a (real) scalar, is the
n-tuple whose k-th component is aek. The axioms in the definition are
verified by checking for equality among components. For example, if
x = (e 1, e2 , " ., en), the relation ek + 0 = ek implies x + = x.
This space, n-dimensional real coordinate space, is denoted by Rn. The
corresponding complex space consisting of n-tuples of complex numbers
is denoted by en.
At this point
are, strictly speaking, somewhat prematurely introducing the term dimensionality. Later in this chapter the notion of dimension is defined, and it is proved that these spaces are in fact n dimensional.
14
LINEAR SPACES
Definition. Let X and Y be vector spaces over the same field of scalars.
Then the Cartesian product of X and Y, denoted X x Y, consists of the
collection of ordered pairs (x, y) with x E X, Y E Y. Addition and scalar
multiplication are defined on X x Y by (Xl' Yl) + (X2' Y2) = (Xl + X2 ,
Yl + Y2) and (X(x, y) = (Xx, (Xy).
That the above definition is tonsistent with the axioms of a vector space
is obvious. The definition is easily generalized to the product of n vector
spaces Xl' X 2 , , X n. We write Xn for the product of a vector space
with itself n times.
2.3
e,
2.3
15
containing the origin is a line containing the origin. This is a special case
of the following result
Proposition 1. Let M and N be subspaces of a nee tor space X. Then the
intersection, M n N, of M and N is a subspace of x.
The union of two subspaces in a vector space is not necessarily a subspace. In the plane, for example, the union of two (noncolinear) lines
through the origin does not contain arbitrary sums of its elements. However, two subspaces may be combined to form a larger subspace by introducing the notion of the sum of two sets.
Definition. The sum of two subsets Sand T in a vector space, denoted
S + T, consists of all vectors of the form s + t where s E Sand t E T.
Figure 2.1
+ N, is a subspace of X.
e.
16
LINEAR SPACES
x = tnl + nr,y
m 2 + nz. For any scalars a, p; ax + py
(aml + Pm2)
(anl + Pn2)' Therefore, ax + py can be expressed as the sum of a vector
, Xn
in a vector
Actually, addition Ihas been defined previously only for a sum of two
vectors. To form a sum consisting of n elements, the sum must be performed
two at a time. It follows from the axioms, however, that analogous to the
corresponding operations with real numbers, the result is independent of
the order of summation. There is thus no ambiguity in the simplified
notation.
It is apparent that a linear combination of vectors from a subspace is
also in the subspace. Conversely, linear combinations can be used to
construct a subspace from an arbitrary subset in a vector space,
Definition. Suppose S is a subset of a vector space X. The set [S], called
the subspace generated by S, consists of all vectors in X which are linear
combinations of vectors in S.
2.4
17
A linear varietyl V can be written as V = Xo + M where M is a sub:space. In this representation the subspace M is unique, but any vector in
V can serve as xo. This is illustrated in Figure 2.2.
Given a subset S, we can construct the smallest linear variety containing S.
\M
\
\
\
\
8
/Xx!
+ (1
- ct)X2 with 0 S;
IX S;
1 are in K.
This definition merely says that given two points in a convex set, the line
segment between them is also in the set. Examples are shown in Figure 2.3
Note in particular that subspaces and linear varieties are convex. The
empty set is con!lidered convex.
1 Other names for a linear variety include: flat, affine subspace, and linear manifold.
The term linear manifold, although commonly used, is reserved by many authors as an
alternative term for subspace.
18
LINEAR SPACES
09
Convex
Nonconvex
nK
Proof Let C =
e'G K. If C is empty, the lemma is trivially proved.
Assume that Xl' X2 E C and select rx, 0::;; Or: ::;; 1. Then Xl' x 2 E K for all
K E Ctf, and since K is convex, rxXl + (1 - rx)X2 E K for all K E Ctf. Thus
rxx l + (1 - rx)X2 E C and C is convex. I
2.S
19
Several cases are shown in Figure 2.5. A cone with vertex p is defined as a
translation p + C of a cone C with vertex at the origin. If the vertex of a
cone is not e1{plicitly mentioned, it is assumed to be the origin. A convex
cone is, of course, defined as a set which is both convex and a cone. Of
the cones in Figure 2.5, only (b) is a convex cone. Cones again generalize
the concepts of subspace and linear variety since both of these are convex
cones.
(c)
(b)
(a)
= {x : x =
gl'
... ,
all i},
defining the positive orthant, is a convex cone. Likewise, the set of all
nonnegative continuous functions is a convex cone in the vector space of
continuous functions.
2.5 Linear Independence and Dimension
In this section we first introduce the concept of linear independence, which
is important for any general study of vector space, and then review the
essential distinguishing features of finite-dimensional space: basis and
dimension.
Definition. A vector x is said to be linearly dependent upon a set S of
. vectors if x can be expressed as a linear combination of vectors from S.
Equivalently, x: is linearly dependent upon S if x is in [S], the subspace
generated by S. Conversely, the vector x is saJ.d to be linearly independent
of the set S if it is not linearly dependent on S; a set of vectors is said to
be a linearly independent set if each vector in the set is linearly independent
of the remaind,er of the set.
20
LINEAR SPACES
Thus, two vectors are linearly independent if they do not lie on a common line through the origin, three vectors are linearly independent if they
do not lie in a plane through the origin, etc. It follows from our definition
that the vector () is dependent on any given vector x since () = Ox. Also,
by convention, the sd consisting of () alone is understood to be a dependent
set. On the other hand, a set consisting of a single nonzero vector is an
independent set. With these conventions the following major test for linear
independence is applicable even to trivial sets consisting of a single vector.
Iz=
if
and
infinite dimensional.
2.S
21
Theorem 2. Any two bases for a finite-dimensional vector space contain the
same number of elements.
Proof Suppose that {Xl' X2' ... , X.} and {Yl' Y2' ... , Ym} are bases
for a vector space X. Suppose also that m n. We shall substitute Y vectors
for X vectors one by one in the first basis until all the X vectors are replaced.
Since the x;'s form a basis, the vector Yl may be expressed as a linear
combination of them, say, Yl = II=l (XiXi' Since Yl::F e, at least one of
the scalars (Xi must be nonzero. Rearranging the x;'s if necessary, it may
be assumed that (Xl ::F O. Then Xl may be expressed as a linear combination
of Y1' x 2 , x 3 , , x. by the formula
The set Yl' x 2 , ... , x. generates X since any linear combination of the
original x/s becomes an equivalent linear combination of this new set
when Xl is replaced according to the above formula.
Suppose now that k - 1 of the vectors Xi have been replaced by the
first k - 1 y;'s. The vector Yk can be expressed as a linear combination of
Yl' Y2' ... , Yk-l, x k , ... , X n , say,
k-l
Yk
L (Xi Yi + i=k
L Pi Xi'
i= 1
Since the vectors Yl,)'2 , ... ,)'k are linearly independent, not all the P;'s
can be zero. Rearrangiing Xk, Xk+ l' ... , X. if necessary, it may be assumed
that Pk::F O. Then X k can be solved for as a linear combination of
Yl')'2, .. , Yk, X k + 1> , X., and this new set of vectors generates X.
By induction on k then, we can replace all n of the x;'s by y;'s, forming a
generating set at each step. This implies that the independent vectors
Yl')'2, ... , Yn generate X and hence forrr: a basis for X. Therefore, by the
assumption of linear independence of {Yl' Y2, ... , Ym}, we must have
n=m.
22
LINEAR SPACES
in finite-dimensional spaces. FurthermQre, there are theorems and concepts in finite-dimensional space which have no direct counterpart in
infinite-dimensional spaces. Occasionally, the availability of these special
characteristics of finite dimensionality is essential to obtaining a solution
to a particular problem. It is more usual, however, that results first derived
for finite-dimensional spaces do have direct analogs in more general
spaces. In these cases, verification of the corresponding result in infinitedimensional space often enhances our understanding by indicating precisely which properties of the space are responsible for the result. We
constantly endeavor to stress the similarities between infinite- and finitedimensional spaces rather than the few minor differences.
NORMED LINEAR SPACES
1.
2.
3.
Proof.
Ilxll - Ilyl\
=\\x - y\\. I
2.6
23
Example 1. The
space C[a, b] consists of continuous functions on the real interval [a, b] together with the norm Ilxll = max Ix(t)l.
astSb
This space was considered as a vector space in Section 2.2. We now verify
that the proposed norm satisfies the three required axioms. Obviously,
Ilxll 0 and is zero only for the function which is identically zero. The
triangle inequality follows from the relation
max Ix(t)
+ y(t)1
max [lx(t)1
+ ly(t)l]
max Ix(t)1
+ max ly(t)l.
Example 2. The
Ilxll
max Ix(t)1
astsb
max Ix(t)l.
aStSb
en,
a = to < 11 <
12
24
LINEAR SPACES
be of bounded variation
of [a, bJ
L,lx(t;) - x(tj_I)1
:$
K.
;=1
= sup L Ix(t
j=
j) -
x(ti_I)1
where the supremum is taken with respect to all partitions of [a, bJ. A
convenient and suggestive notation for the total variation is
JIdx(t)l
b
T.Y.(x) =
The total variation of a constant function is zero and the total variation of
a monotonic function is the absolute value of the difference between the
function values at the end points a and b.
The space BV[a, bJ,is defined as the space of all functions of bounded
variation on [a, bJ together with the norm defined as
Ilxll
2.7
= Ix(a)1 +T.Y.(x).
The empty set is open since its interior is also empty. The entire space
is an open set. The unit sphere consisting of all vectors x with Ilxll < 1 is
an open set. We leave it to the reader to verify that P is open.
2.7
Definition. A point x
25
P.
The empty set and the whole space X are closed as well as open. The
unit sphere consisting of all points x with Ilxli =:;; 1 is a closed set. A single
point is a closed set. It is clear that P = P.
The proofs of the following two complementary results are left to the
reader.
26
LINEAR SPACES
2.8
Convergence
2.9
27
yll
Thus, x =y. I
IIx -
= IIx -
Xn
+ Xn -
+ Ilxn -
yll-t O.
28
LINEAR SPACES
2.1O
THE
Ip AND Lp SPACES
29
that IIT(xn) - T(xo)1I 8. Since xn -+ xo, this implies that for every 8 > 0
there is a point Xn with Ilxn - xoll < 8 and IIT(xn) - T(x)1I > s. This proves
the" only if" portion by contraposition. I
*2.10 The'.. and L.. Spaces
In this section we discuss some classical normed spaces that are useful
throughout the book.
L leil
i= 1
<
00.
Ip is defined as
30
LINEAR SPACES
Equality holds
if and only if
(M)
Ilxllp
1/ q
(J1J..)
l/p
Ilyllq
Jor each i.
Proof The cases p = 1, 00; q = 00, I are straightforward and are left
to the reader. Therefore, it is assumed that I < p < 00, I < q < 00. We
first prove the auxiliary inequality: For a >- 0, b >- 0, and 0 < A < I, we
have
AO
aWl-A)
+ (l
- A)b
+ A-
defined for t >- o. Then f'(t) = A(t)-l - I). Since 0 < A < I, we have
f'(t) > for < t < I and f'(t) < for t > 1. It follows that for t >- 0,
J(t)
= 0 with equality only for t = 1. Hence
tA
At + I - A
with equality only for t = 1. If b =1= 0, the substitution t = alb gives the
desired inequality, while for b = 0 the inequality is trivial.
Applying this inequality to the numbers
leil
)p.
a = ( IIxll p ,
b-
(J1J..)Q
with A=
Ilyllq
1 - A=
(JQ)P
(1hl.)q
Jlle, lId
1 1
-'--'--=---<-+-=1
Ilxllpllyllq - P
The conditions for equality follow directly from the required condition
o = b in the auxiliary inequality. I
The special case p = 2, q = 2is of major importance. In this case,
Holder's inequality becomes the well-known Cauchy-Schwarz inequality
for sequences:
2.IO
THE
Ip
AND
Lp
SPACES
31
Theorem 2. (The Minkowski Inequality) If x and yare in IP' I :::; p :::; 00,
then so is x + y, and Ilx + yllp :::; Ilxllp+ IIYllp' For I < P < 00, equality holds
if and only ifk 1x = k 2 y for some positive constants kl and k2 .
L lei + '7d
i=1
:::;
L lei +
1=1
i=1
00
+ (tl''7llpf'P.
Ilxll p+ IIYllp'
'
00
Ilxllp + IIYllp
+ yllp
The conditions for equality follow from the conditions for equality in
the Holder inequality and are left to the reader. I
The Lp and Rp spaces are defined analogously to the Ip spaces. For
p I, the space Lp[a, b] consists of those real-valued measurable functions x on the interval [a, b] for which Ix(tW is Lebesgue integrable. The
norm on this space is defined as
lP
II x l p=
. ,
W
((IX(t dtr
32
LINEAR SPACES
Ilxll oo
1 - t2
x(t) = {2
E [
-1, 1J,
t;:f:O
t=O
Lp[a, bJ, y
.
LIX(t)y(t)1 dt::;;
IIxllpllYllq.
2 Two fUnctions are said to be equal almost everywhere (a.e.) on [a, h] if they differ
on a set of Lebesgue measure zero.
2.11
Equality holds
BANACH SPACES
33
if and only if
IX(t)I)P = (ly(t)I)q
( Ilxll p
IiYllq
almost everywhere on [a, b].
-1
Figure 2.6
o
The junction jor Example 1
34
LINEAR SPACES
'\ 0
"
x n (t) = '.\ nt - -2 + 1
[I
for
O<t<--- - 2 n
for
---<t<2 n- - 2
for
1
t>-:-2
2.1l
BANACH SPACES
35
Each x. is in X and has its norm equal to unity. The sequence {x.} is a
Cauchy sequence sinc(:, as is easily verified,
11x.. - xmll
It is obvious,
that there is no element of X (with a finite
number of nonzero components) to which this sequence converges.
x
= Ilxn-
XN
+ xN11
IlxNII
+ IIx n-
+ 1. I
36
LINEAR SPACES
x(t)l.
+ <5) -
x(t)1 S Ix(t
+ <5) -
xn(t
+ <5)1
+ Ixn(t + <5) -
xn(t)1
+ IXn(t) -
x(t)l.
lip
ez," .},
{e e
L:
i= 1
:$;
Since the left member of the inequality is a finite sum, the iaequality
remains valid as
11
00;
therefore,
k
L
i=
I
S MP.
2.1l
BANACH SPACES
37
1= 1
-+ 00
1= 1
for n > N. Now letting k -+ 00, we deduce that Ilxn - xil P :::;; e for n > N
and, therefore, Xn -+ x. This completes the proof for 1 :s; P < 00.
Now let {x n } be a Cauchy sequence in lex)' Then Ie;: - e;:'1 :::;; Ilxn - xmll -+ 0.
Hence, for each k there is a real number ek such that e;: -+ ek' Furthermore,
this convergence is uniform in k. Let x = {el> e2 , }. Since {xn} is Cauchy,
there is a constant M such that for all , , Ilxnll :::;; M. Therefore, for each k
and each n, we have le;:1 :::;; Ilxnll :::;; M from which it follows that x E Ie<)
and IlxlI:::;; M.
The convergence of Xn to x follows directly from the uniform convergence
of e;: -+ ek
n.
Example 5. Lp[O,
1 !:.p!:. 00 is a Banach space. We do not prove the
completeness of the Lp spaces because the proof requires a fairly thorough
familiarity with Leb()sgue integration theory. Consider instead the space
Rp consisting of all functions x on [0, IJ for which Ixl P is Riemann
integrable with norm defined as
Ilxll
38
LINEAR SPACES
Complete Subsets
'2:>jejll.
j*k
2.l3
39
For arbitrary n, m
Ilx. - xmll
II itp7 -
Aj)el\1 ;;::
Ilx. - xii
II k=1
f(Ak -
Ilx. - xii
-+
40
LINEAR SPACES
As the reader may verify, an equivalent definition is that/is upper semicontinuous at Xo if 3 lim suPx .... xJ(x) s,/(x o)' Clearly, if / is both upper
and lower semicontinuous, it is continuous.
Definition. A set K in a normed space X is said to be compact if, given an
arbitrary sequence {x;} in K, there is a subsequence {Xi.,} converging to an
element x E K.
Proof. Let M = sup/ex) (we allow the possibility M::;: 00). There is
xeK
a sequence {Xi} from K such that /(x;) --+ M. Since K is compact, there
is a convergent subsequence Xj" --+ x E K. Clearly, /(Xi,,) --+ M and, since /
is upper semicontinuous,/(x) lim/(x i ,,) = M, Thus, since/ex) must be
finite, we conclude that M < 00 and that/ex) ,b. M. I
We offer now an example of a continuous functional on the unit sphere
of C [0, IJ which does not attain a maximum (thus proving that the unit
sphere is not compact).
fo
1/2
f(x) =
x(t) dt -
cEO, IJ
by
1/2
x(t) dt.
2.14
QUOTIENT SPACES
41
Quotient
defined by [xa
42
LINEAR SPACES
II [x] I
= ",eM
infllx + mil,
i.e., I [x] II is the infimum of the norms of all elements in the coset [x]. The
assumption that M is closed insures that II [x] I > 0 if [x] :f: e. Satisfaction
of the other two axioms for a norm is easily verified. In the case of X being
two dimensional, M one dimensional, and X/M consisting of parallel
lines, the quotient norm of one of the lines is the minimum distance of the
line from the origin.
Proposition 1. Let X be a Banach space, M a closed subspace of X, and X/ M
the quotient space l-t'ith the quotient norm defined as above. Then X/M is also
a Banach space.
The proof is left to the reader.
*2.15 Denseness and Separability
We conclude this chapter by introducing one additional topological concept, that of denseness.
x=
in En.
and dense
2.16
PROBLEMS
43
e2, ... }
r(t)1
+ max Ip(t) -
r(t)1
Example 4. The Lp spaces 1 p < 0Cl are separable but Loo is not separable.
The particular space L2 is considered in great detail in Chapter 3.
2.16 Problems
1. Prove Proposition I, Section 2.2.
2. show that in a vector space (-(X)x = (X( -x) = -(Xx), (X(x - y) =
(Xx - (XY, (X - (3)x := (Xx - [3x.
3. Let M and N be subspaces in a vector space. Show that [Mu NJ =
M+N.
4. A convex combination of the vectors Xl' x 2 , .. " Xn is a linear combination of the form !XlX l + (X2 X2 + ... + (Xnxn where (Xi 0, for each
i; and (Xl + (X2 + ... + (Xn = 1. Given a set S in a vector space, let K be
the set of vectors consisting of all convex combinations from S. Show
that K = co(S).
5. Let C and D be co:nvex cones in a vector space. Show that C nD and
C + D are convex cones.
6. Prove that the union of an arbitrary collection of open sets is open and
that the intersection of a finite collection of open sets is open.
44
LINEAR SPACES
p .... '"
2.l6
3. Ix + y I Ixl
Let M = {x: Ixl
inf
meM
Ix + ml
lxi,
PROBLEMS
45
+ Iyl
= O}
l/[xJI/
20. Let X be the space of all functions x on [0, IJ which vanish at all
but a countable lllumber of points and for which
IIxll
00
n= 1
REFERENCES
There are a numbeJr of excellent texts on linear spaces that can be used to
supplement this chaptl:r.
2.2. For the basic concepts of vector space, an established standard is Halmos
[68].
2.4. Convexity is an important aspect of a number of mathematical areas. For
an introductory discussion, consult Eggleston [47] or, more advanced,
Valentine [148].
2.6-9. For general discussions of functional analysis, refer to any of the following. They a.re listed in approximate order of increasing level: Kolmogorov
and Fomin [88], Simmons [139], Luisternik and Sobolev [10lJ, Goft'man
and Pedrick [59], Kantorovich and Akilov [79], Taylor [145], Yosida [157],
Riesz and Sz-Nagy [123], Edwards [46], Dunford and Schwartz [45], and
Hille and Phillips [73].
2.1O. For a readable discussion of lp and Lp spaces and a general reference on
analysis and integration theory, see Royden [131].
2.13. For the Weierstrass approximation theorem, see Apostol [10].
3
HILBERT SPACE
3.1 Introduction
Every student of high school geometry learns that the shortest distance
from a point to a line is given by the perpendicular from the point to the
line. This highly intuitive result is easily generalized to the problem of
finding the shortest distance from a point to a plane; furthermore one might
reasonably conjecture that in n-dimensional Euclidean space the shortest
vector from a point to a subspace is orthogonal to the subspace. Thi's is, in
fact, a special Case of one of the most powerful and important optimization
principles-the projection theorem.
The key concept in this observation is that of orthogonality; a concept
which is not generally available in normed space but which is available in
Hilbert space. A Hilbert space is simply a special form of normed space
having an inner product defined which is analogous to the dot product of
two vectors in analytic geometry. Two vectors are then defined as
orthogonal if their inner product is zero.
Hilbert spaces, equipped with their inner products, possess a wealth of
structural properties generalizing many of our geometrical insights for two
and three dimensions. Correspondingly, these structural properties imply a
. wealth of analytical results applicable to problems formulated in Hilbert
space. The concepts of orthonormal bases, Fourier series, and leastsquares minimization all have natural settings in Hilbert space.
PRE-HILBERT SPACES
(xly) = (Ylx).
+ ylz) = (xlz)
2. (x
+ (Ylz).
46
3.2
= 0 if and only if x =
INNER PRODUCTS
47
e.
The bar on the right: side of axiom I d, .lotes complex conjugation. The
axiom itself guarantees that (x I x) is real for each x. Together axioms 2 and
3 imply that the inner product is linear in the first entry. We distinguish
between real and complex pre-Hilbert spaces according to whether the underlying vector space is real or complex. In the case of real pre-Hilbert spaces,
it is required that the iltlner product be real valued; it then follows that the
inner product is linear in both entries. In this book the pre-Hilbert spaces
are almost exclusively considered to be real.
The quantity J(XTJ0 is denoted IIxll, and our first objective is to verify
that it is indeed a norm in the sense of Chapter 2. Axioms I and 3 together
give II ax II = lo:lllxll and axiom 4 gives IIxll > 0, x # e. It is shown in
Proposition I that I II satisfies the triangle inequality and, hence, defines a
norm on the
space.
Before proving the triangle inequality, it is first necessary to prove an
important 'lemma whkh is fundamental throughout this chapter.
e;
e.
+ IAI2(y Iy).
o : ; ; (x I x) _I(x I y)12 ,
(y Iy)
or
Ilx + Yll2
(lIxll + lIyll)2.
48
HILBERT SPACE
Ilxll
Ct
1e"1 2 f/2
Example 3. The (real) space L 2 [a, b] is a pre-Hilbert space with the inner
product defined as (x Iy) = S: x(t)y(t) dt. Again the Holder inequality
guarantees that (x Iy) is finite.
Example 4. The space of polynomi1tlfunctions on [a, b] with inner product
(x I y) = x(t)y(t) dt is a pre-Hilbert space. Obviously, this space is a subspace of the pre-Hilbert space L 2 [a, b].
There are various properties of inner products which are a direct consequence of the definition. Some of these are useful in later developments.
Lemma 2. In a pre-Hilbert space the statement (x Iy) = 0 for all y implies
that x = O.
Proof Putting y
=:
x implies (x I x)
= 0.
yll2 =
211xl12 + 211y112.
This last result is a generalization of a result for parallelograms in twodimensional geometry. The sum of the squares of the lengths of the
diagonals of a parallelogram is equal to twice the sum of the squares of
two adjacent sides. See Figure 3.1.
3.3
49
Ilxnll M. Now
I(x nIYn) - (x Iy)l == l(x IYn) - (xn Iy)
II
+ (xn IY) -
(x ly)1
+ l(xn -
x ly)l
/lxn/lllYn - yll
+ /lxn -
x/lllyI/
/lxnll is bounded,
l(xn IYn) - (x Iy)l
x/llIylI
-+
O.
Definition. In a pre-Hilbert space two vectors x, yare said to be orthogonal if (x IY) = O. We symbolize this by x .Ly. A vector x is said to be
orthogonal to a set S (written x.L S) if x.L s for each s E S.
The concept of orthogonality has many of the c!onsequences in preHilbert spaces that it has in plane geometry. For example, the Pythagorean
theorem is true in pre-Hilbert spaces.
Lemma 1. Ifx .Ly, then IIx
+ YII2
IIxll2 + lIylI 2.
50
HILBERT SPACE
3.
Proof
Ilx
+ Yl12
= (x
+ y I x + y) =
IIxl12
3.3
51
->
02 as i -+
00,
we conclude
as i, j -.
00.
52
HILBERT SPACE
. complete space, the sequence {m j } has a limit 1110 in M.By continuity of the
norm, it follows that IIx - moll =
I
o.
It should be noted that neither the statement nof the proof of the
existence of the minimizing vector makes explicit reference to the inner
product; only the norm is used. The proof, however, makes heavy use of
the parallelogram law, the proof of which makes heavy use of the inner
product. There is an extended version of the theorem, valid in a large class
of Banach spaces, which is developed in Chapter 5, but the theorem cannot
be extended to arbitrary Banach spaces.
3.4
Orthogonal Complements
The orthogonal complement of the set consisting only of the null vector
is the whole space. For any set S, S 1- is a closed subspace. It is a subspace
because a linear combination of vectors orthogonal to a set is also orthogonal to the set. It is closed since if {x,J is a convergent sequence from S1-,
say Xn x, continuity of the mner product implies that 0 = (xn Is) (x I s)
for all s E S, so XES 1-. The following proposition summarizes the basic
relations between a set and its orthogonal complement.
S 1- is a closed subspace.
SC SH,
If SeT, then T1- c S 1-.
SH1- = S1-.
5.
1.
2.
3.
4.
Therefore, S 1.
S.
::=
S H1..
3.5
53
Definition. We say that a vector space X is the direct sum of two subspaces
M and N if every vector x E X has a unique represeniationof the form
x = m + n where m E M and n E N. We describe this situation by the notation X = M EB N. (The 4)nly difference between direct sum and the earlier
definition of sum is the added requirement of uniqueness.)
We come now to the theorem that motivates the expression" orthogonal
complement" for the set of vectors orthogonal to a set. If the set is a
closed subspace in a Hilbert space, its on:hogonal complement contains
enough additional vectors to generate the space.
Theorem 1. If M is a closed linear subspace of a Hilbert space H, then
H = MEt> M 1. and M == M H.
54
HILBERT SPACE
Proof Suppose {Xi> X2., ... , xn} is a finite subset of the given orthogonal set and that there are n scalars lXi' i = 1, 2, ... , n, such that
:Li'= 1 ctiXi = O. Taking the inner product of both sides of this equation with
Xk produces
or
Thus, ctk = 0 for each k and, according to Theorem 1, Section 2.5, the
vectors are independent. I
In Hilbert space, orthonormal sets are greatly favored over other
linearly independent sets, and it is reassuring to know that they exist and
can be easily constructed. The next theorem states that in fact an orthonormal set can be constructed from an arbitrary linearly independent set. The
constructive method employed for the proof is referred to as the GramSchmidt orthogonalization procedure and is as important in its own right
as the theorem itself.
which obviously generates the same space as Xl' Form e2 in two steps.
First, put
Z2
= X2 -
(x21 el)el
and then e2 = z2/11z 211. By direct calculation, it is verified that Z2 .1. e1 and
e2 .1. el . The vector Z2 cannot be zero since X2 and el are linearly independent; furthermore, e2 and e1 span the same space as Xl and X2 since X2
may be expressed as a linear combination of el and ez.
The process is best understood in terms of the two-dimensional diagram
of Figure 3.3. The vector Z2 is formed by subtracting the projection of X2 on
3.6
55
Zn
Xn -
L1(xn Iej)ej
j=
and
(.\'21 el )el
Figure 3.3
The study of the Gram-Schmidt procedure and its relation to the projection theorem, approximation, and equation solving is continued in the
following few secti'Dns.
APPROXIMATION
56
HILBERT SPACE
-IXIYI
=0
-1X2Y2 -'"
(YII
YI)IX I
These equations in the n coefficients IX; are known as the normal equations
for the minimization problem.
Corresponding to the vectors YI' Y2, ... , Yn, the n x n matrix
(YII Yt)
(Y21 YI)
G=G(Yt.Y2, .. ,Yn):=
:
[
(YnIYt)
(YII Y2)
is called the Gram matrix of YI' Y2, ... , Yn' It is the transpose of the coefficient matrix of the normal equations. (In a real pre-Hilbert space the
matrix is symmetric. In a complex space its transpose is its complex conjugate.) The determinant g = g(Yb Y2 , ... , Y.) of the Gram matrix is known
as the Gram determinant.
I
The approximation problem is solved once the normal equations are
solved. In order that this set of equations be uniquely solvable, it is necessary and sufficient that the Gram determinant be nonzero.
Proposition 1. g(Yl> Y2 , ... , Yn) = 0 if and only if the vectors Yl> Y2 , ... , Yn
are linearly independent.
=0
if and only if the vectors YI' Y2, ... , Y. are linearly dependent. Suppose that
the Yi'S are linearly dependent; that is, suppose there are constants IX/) not
aU zero. such that 2:7-. f't,)J, ::= fJ. It then clear that the rows of the Gram
determinant have a corresponding dependency and, hence, the determinant
is zero.
3.6
57
Now assume that the Gram determinant is zero or, equivalently, that
there is a linear dependency among the rows. In that case there are-constants r:J.j, not all zero, such that
1 r:J.j(Yi IYJ) = 0 for all j. From this it
follows that
for all ,j
and hence that
or
Thus,
L7=
r:J.iYI = () and the vectors Yl, Y2, ... , Y. are linearly dependent.
Then
By the
I =
or
58
HILBERT SPACE
(ytl Y1)
(YIIY2)
(Y21 YI)
...
(Yn IYt)
(x IYI)
(Yt I Yn) , .
(Yn I Yn) (x I Yn)
(YI I x)
I-=.....::....:---C-_
_ _ _ _----"'(Y-::n-'-Ix-,)_-,(_x.:.....1X---'-)_I
.:5 2 =
(YIIYI) (Y21 Yt)
(Yn I Yt)
0
(hi Y2)
(Yt I Yn)
(Yt I x)
6 2 Applying Cramer's
.. , (Yn I Yn)
.. , (Ynlx)
0
1
i = LlXiei
i= t
Xi
Sn
Li= I
XI
3.7
FOURIER SERIES
S9
ele,.
H if and only if
Iisn - sml1 = II
2
lei ei 112 =
-4 0
Lr;. e
Proof. Letting
IXI
Thus,
n
L \cxi \2 S; I\xl\2
i. == 1
for all n.
Hence,
00
L IIXil 2S
i= 1
Ilx112. I
60
HILBERT SPACE
I<x
Ir;.
converges to an element
The difference vector x -
Ii=,
(x -
Sn
I e) = (x -
L1 (x I e;)ei I e) = (x I e) -
(x I e)
j=
= O.
Sn
lej) ==
(x -
3.8
61
en<t) =
n = 0,1,2, ... ,
L/''f(t) dt
1
f-1
t"F(t) dt
t"+1
-f -1n+l
--f(t) dt =
t"+1
= n+l
Q(t) =
La, t'
1=0
62
HILBERT SPACE
such that
IF(t) - Q(t)1 <
for all t
E [
L
1
IIF(t)1 2 dt :::; 28 2
Since e is arbitrary and Fis continuous, we deduce that F(t) = and, hence,
thatf(t) = (almost everywhere). This proves that the subspace of polynomials is dense in 2[ -1, 1J.
Example 2. Consider the complex space 2[0, 211:]. The functions ek(t) =
(l/jfn)e ikt for k = 0, 1, 2, ... are easily seen by direct calculation to be
an orthonormal sequence. Furthermore, it may be shown, in a manner
similar to that for the Legendre polynomials, that the system is complete.
Therefore, we obtain the classical complex Fourier expansion for an
arbitrary function x E 2[0, 211:J
elkt
<Xl
x(t)=
where the Fourier coefficients
Ck
Ck = (x I ek)
I fkJk=-oo
211:
= ;;;;: J
.....; 211:
21t
x(t)e
-ikt
dt .
Suppose again that we are given independent vectors Yl' Yz, ... , Yn
generating a subspace M of a Hilbert space H and wish to find the vector :
in M which minimizes IIx Rather than seeking to obtain : directly
as a linear combination of the Yi'S by solving the normal equations, we
can employ the Gram-Schmidt orthogonalization procedure together with
Fourier series.
3.9
63
First we apply the Gram-Schmidt procedure to {Ylo Y2, ... , Yn}, obtaining an orthonormal set {e 1 , e2' ... ', en} generating M. The vector is then
given by the Fourier series
" (x I ej)el
= 1=L
1
since x - is orthogonal to M. Thus, our original optimization problem is
. easily resolved once
independent vectors YI are orthonormalized. The
advantage of this method is that once the et's are found, the best approximation to any vector is easily computed.
Since solution to the approximation problem is equivalent to solution of
the normal equations, it is clear that the
procedure can be
interpreted as a procedure for inverting the Gram matrix. Conversely,
many effective algorithms for solving the normal equations have an interpretation in terms of the minimum norm problem in Hilbert space.
We have seen that the Gram-Schmidt procedure can be used to solve
a minimum norm approximation problem. It is interesting to note that
the Gram-Schmidt procedure can itself be viewed as an approximation
problem. Given a sequence {Yl' Y2, Y3' ... , Yn} of independent vectors, the
Gram-Schmidt procedure sets
k-l
Yk ek ='
IIYk -
L (Yk I ej)e
j= 1
k 1
L (Yk I ej)ejll
j=1
IIy"
where the minimum is over all in the subspace [Ylo Y2' ... , Yk-I] =
rei, e2' ... , ek-l]. The vector ek is just a normalized version of this error.
Thus, the Gram-Schmidt procedure consists of solving a series of minimum
norm approximation problems by use of the projecti(;m theorem.
Alternatively, the minimum norm approximation of x on the subspace
[YI, Y2, ... , YnJ can be found by applying the Gram-Schmidt procedure to
is found at
the sequence {Yl' Y2' Y:I' ... , Yn' x}. The optimal error x the last step.
64
HILBERT SPACE
all
Theorem 1. (Restatement of Projection Theorem) Let M be a closed subspace of a Hilbert space H. Let .\ be a fixed element in H and let V be the
linear mriety x + M. Then there is a unique rector Xo in V of minimum norm.
Furthermore, Xo is orthogonal to M.
Proof The theorem is proved by translating V by - x so that it
becomes a closed subspace and the,n applying the projection theorem.
See Figure 3.4. I
A point of caution is necessary here. The minimum norm solution Xo is
not orthogonal to the linear variety V but to the subspace M from which V
is obtained.
Figure 3.4
3.10
65
Two special kinds of linear varieties are of particular interest in optimization theory because they lead to finite-dimensional problems. The first
is the n-dimensional linear variety consisting of points of the form
x + Li': I aj XI where {Xl' X2, ... , Xn} is a linearly independent set in Hand
X is a fixed vector in H. Problems which seek minimum norm vectors in an
n-dimensional variety can be reduced to the solution of an n-dimensional
set of normal equations as developed in Section 3.6.
The second special type of linear variety consists of all vectors X in a
Hilbert space H satisfying conditions of the form
(x IYI) = CI
= C2
(XIY2)
where Ylo Y2' ... , Yn are a set of linearly independeQ,t vectors in Hand
Cl , C2' ... , Cn are fixed ';:onstants. Ifwe denote by M the subspace generated
by YI' Y2, ... , Yn' it is dear that if each C I == 0, then
linear variety is the
subspace M1.. For nonzero CI'S the resulting linear variety is a translation
of M 1.. A linear variety of this form is said to be of codimension n since
the orthogonal complement of the subspace producing it has dimension n.
We now consider the minimum norm problem of se'eking the closest
vector to the origin lying in a linear variety of finite codimension.
Theorem 2. Let H be Cl Hi/bert space and {YI' Y2' ... , Yn} a set of linearly
independent vectors in H. Among all vectors X E H satisfying
(XIYl) =
C1
(XIY2) =
C2
(X I Yn)
= Cn'
"
Xo = LPIYI
1= I
CI
C2
66
HILBERT SPACE
L f3iYi'
i= 1
Xo =
+ wet) = u(t),
where u(t) is the field current at time t. The angular position of the motor
shaft is the time integral of w. Assume that
motor is initially at rest,
8(0) = w(O) = 0, and that it is desired to find the field current function u
of minimum energy which rotates the shaft to the new rest position = I,
w = 0 within one second. The energy is assumed to be proportional to
g u2(t) dt. This is a simple control problem in which the cost criterion
depends only on the control function u. The problem can be solved by
treating it as a minimum norm problem in the Hilbert space L 2 [O, 1].
We can integrate the first-order differential equation governing the motor
to obtain the explicit formula
(1)
for the final angular velocity corresponding to any control. From the equation wet) +O(t) = u(t) follows the equation
Jo
(2)
8(1) =
u(t) dt - w(1)
0(1)
Jo {I - e(t-l)}u(t) dt.
1
3.10
67
= (Y1 I u)
8(1) = (Y21
u)
where Yl = e(t-1l, Y2 = I With these definitions the original problem is equivalent to that of finding
u E L 2 [0, 1] of minimum norm subject to the constraints
e(t- 1 l.
(4)
(Y11 u)
1 = (Y21 u).
u(t)
=OC1
+ OC2 et
where OC 1 and OC2 are chosen to satisfy the two constraints (4). Evaluating the
constants leads to the solution
1
u(t) == -3-
-e
[1
+e-
2et ].
We have observed that there are two basic forms of minimum norm
problems in Hilbert space that reduce to the solution of a finite number of
simultaneous linear equations. In their general forms, both problems are
.concerned with finding the shortest distance from a point to a linear variety.
In one case the linear variety has finite dimension; in the other it has
finite codimension. To understand the relation between these two problems, which is one of the simplest examples of a duality relation, consider
Figure 3.5. Let M be a closed subspace of a Hilbert space H and let x be an
arbitrary vector in H. We may then formulate two problems: one of
projecting x onto M and the other of projecting x onto M 1. The situation
is completely symmetrical since M l.l. = M. If we solve one of these problems, the other is automatically solved in the process since if, for instance,
68
HILBERT SPACE
A Control Problem
Jo {x (t) + u (t)} dt
T
J =
x(t)=u(t);
(1)
(2)
x(t) = xeD)
+ fu(r) dr.
o
= L 2 [O, TJ x L 2 [O, TJ
fo [X 1(t)X2(t) + ul(t)uz(t)] dt
Xl' u t ) I (X2' u 2 =
II(x, u)11 2
fo [x 2 (t)
+ u 2(t)] dt.
The set of elements (x, u) E H satisfying the constraint (2) is a linear variety
V in H. Thus, abstractly the control problem is one of finding the element
(x, u) E V having minimum norm.
, 3.12
69
Both terms on the right tend to zero as n -+ 00, from which we conclude
that x = y, the desired result.
The general quadratic loss control problem is treated further in Section 9.5.
3.12 Minimum Distance to a Convex Set
Much of the discussion in the previous sections can be generalized from
linear varieties to convex sets., The following theorem, which treats the
minimum norm problem illustrated in Figure 3.6, is a direct extension of
the proof of the projection theorem.
IIx-ko'lI::::; IIx-kll
70
HILBERT SPACE
l1
By convexity of K, (k;
+ k j )/2 is in K; hence,
and therefore
Ilk; - kjl12 :$ 2 IIk i - Xll2
+ 2 IIk j
- xllz - 40 2-+ O.
n even
n odd
has llx - knll-+ 0 so, by the above argument, {k n} is Cauchy and convergent. This can only happen if kl = k o .
We show now that if ko is the unique minimizing vector, then
(x - ko Ik - k o) :$ 0
IIx -
= -2(x-k o lk t
-k o)= -2B<0.
Thus for some small positive ex, IIx - k,.11 < IIx - koll which contradicts the
minimizing property of k o . Hence, no such kl can exist.
3.12
71
Ilx - kl1 2
= Ilx - ko
and therefore ko is
IIx - k oll 2
Example 1. As an applic:ation of the above result, we consider an approximation problem with restrictions on the coefficients. Let {Yt, Y2, ... ,Yn} be
linearly independent vectors in a Hilbert space H. Given x e H, we-seek
to minimize IIx - IXtYt -- 1X2Y2 - - IXnYnll where we require IX; 0 for
each i. Such a restriction is common in many physical problems.
This general problem can be formulated abstractly as that of finding the
minimum distance from the pointx to the convex cone
K = {y: Y =
IXtYt
0 each i}.
Ik
(x --
Setting k =
+ Y; leads to
for i
(x-jcly;)sO
and setting k
all k eK.
:s; 0,
= -
IX/Y;
= 1,2, ... , n
leads to
(x -
if IX; > O.
Iy;)
GIX -
b =z
for some vector z with components Zl O. (Here IX and b are vectors with
components IX; and b;, respectively.) Furthermore, IX;Z/ = 0 for each i or,
more compactly,
(2)
IX'Z
= O.
Conditions (1) and (2) are the analog of the normal equations. They
represent necessary and sufficient conditions for the solution ofthe approximation problem but are not easily solved since they are nonlinear.
72
HILBERT SPACE
3.13
Problems
1. Let x andy be vectors in a pre-Hilbert space. Show that I(x Iy)1 = Ilxllllyll
if and only if i1.X + fiy = for some scalars i1., {3.
2. Consider the set X of all real functions x defined on (- 00, 00) for
which
lim T->Cfj 2T
J Ix(tW dt <
T
00.
-T
It is easily verified that these form a linear vector space. Let M be the
subspace on which the indicated limit is zero.
(a) Show that the space H::= XjM becomes a pre-Hilbert space when
the inner product is defined as
([x] I [y]) =
1 T
2T L/(t)y(t) dt
= Trace
[A' QB]
S6
J6
Ee N
is
3.13
PROBLEMS
73
peT, t) == a1(T)
minimizes
T
fo [x(t) -
Show that the coefficients ai(T) need not be completely recomputed for
each T but rather can be continuously updated according to a formula
of the form
74
IDLBERT SPACE
13, Show that the Gram determinant g(x" X2, ... , x n) is never negative
(thus generalizing the Cauchy-Schwarz inequality which can be
expressed as g(Xl, X2) ;;::; 0).
14. Let {Yl' Y2, ... , Yn} be linearly independent vectors in a pre-Hilbert
space X and let x be an arbitrary element of X. Show that the best
approximation to 'x in the subspace generated by the y/s has the
explicit representation
(Yn I YI) (x I YI)
(Yn I Y2) (x I Y2)
(Yl I Yl)
(YI I Y2)
(Y21 Yt)
(Y21 Y2)
(YI I Yn)
YI
(Y21 Yn)
(Yn I Yn)
Y2
Yn
-gCYI' Y2,"" Yn)
(x I Yn)
()
00
2: - = 00.
i'" ni
I
(a) Let Mk = [tn" tnz, ... , t nk ]' The result holds if and only if for each
m ;;::; 1 the minimal distance dk of from Mk goes to zero as k goes to
infinity. This is equivalent to
t.:
= O.
Show that
k
d k2
ITI (ni -
m)2
i=
(2m
+ 1)TI(m + ni + 1)2
i= 1
(b) Show that a series of the form 2:r; I log (1 + ai) diverges ifand only
1 a I diverges.
if
(c) Show that lim log d/ = - 00 jf and only if 'L;';ll/ni = 00.
L{;
k'" 00
3.13
PROBLEMS
75
16. Prove Parseval's equality: An orthonormal sequence {ei}i"=l is complete in a Hilbert space H if and only if for each x, y in H
(x Iy)
OC>
=1=L1(x I et)(etly)
17. Let {Yl'Y2' ... ,Yn} be independent and suppose {eto e2, ... , en} are
1= 1
Show that the coefficients lXi can be easily obtained from the Fourier
coefficients (x lei).
18. Let w(t) be a positive (weight) function defined on an interval [a, b] of
the real line. Assume that the integrals
exist for n = 1, 2, ....
Define the inner product of two polynomials p and q as
(p I q)
= f p(t)q(t)w(t) dt.
b
Beginning with the sequence {I, t, t 2 , ., .}, we can employ the GramSchmidt procedure to produce a sequence of orthonormal polynomials
with respect to this weight function. Show that the zeros of the real
orthonormal polynomials are real, simple, and located on the interior
of [a, b].
19. A sequence of orthonormal polynomials {en} (with respect to a
weighting function on a given interval) can be generated by applying
the Gram-Schmidt procedure to the vectors {I, t, (2, }. The procedure is straightforward but becomes progressively more complex with
each new term. A superior technique, especially suited to machine computation, is to exploit a
relation of the form
n
= 2, 3, ....
Show that such a recursive relatIon I;;xists among the orthonormal polynomials and det.ermine the coefficients an' bn,
.
20. Suppose we are to set up a special manufacturing company which will
operate for only ten months. During the ten m"nths the company is to
produce one million copies of a single product. We assume that the
manufacturing facilities have been leased for the ten-month period, but
that labor has not yet been hired. Presumably, employees will be hired
76
HILBERT SPACE
and fired during the ten-month period. Our problem is to determine how
many employees should be hired or fired in each of the ten months.
It is assumed that each employee produces one hundred items per
month. The cost of labor is proportional to the number of productive
workers, but there is an additional cost due to hiring and firing. If
u(k) workers are hired in the k-th month (negative u(k) corresponds to
firings), the processing cost can be argued to be u2 (k) because, as u
increases, people must be paid to stand in line and more nonproductive
employees must be paid.
At the end of the ten-month period all workers must be fired. Find
u(k) for k = 1,2, ... ,10.
21. Using the projection theorem, solve the finite-dimensional problem:
minimize x' Qx
subject to Ax = b
where x is an n-vector, Q a positive-definite symmetric matrix, A an
< n), and b an m-vector.
22. Let x be a vector in a Hilbert space H, and suppose {Xl' X2, . , xn}
and {Yl' Yz , ... , Ym} are sets of linearly independent vectors in H. We
seek the vector minimizing Ilx - xii while satisfying:
m x n matrix (m
(xly;)=c;,
(a) Find equations for the solution which are similar to the normal
equations.
(b) Give a geometric interpretation of the equations.
23. Consider the problem of finding the vector X of minimum norm
satisfying
(X IYi)
ci ,
..
X=
Lai)'i
i= f
G'a
and that aj = 0 if (x IYi) > c i G is the Gram matrix of {Yll Y2' .. " Yn}.
3.13
PROBLEMS
77
REFERENCES
3.1-3.8. Most of the texts on functional analysis listed as references for
Chapter 2 contain discussions of pre-Hilbert space and Hilbert space. In
addition, see Berberian [21], or Akhiezer and Glazman [3] for introductory
treatments.
3.9. Fot" a discussion of Gram matrices and polynomial approximation, see
Davis
3.l0. For some additional applications of the dual problem, see Davis [36].
3.ll. For a different proof of the existence and uniqueness to this type of
control problem, see Bellman, Glicksberg, and Gross [17].
3.13. For Problem 12, see Luenberger [99]. Reproducing kernel Hilbert space
introduced in Problems 10 and 11 is a useful concept in approximation and
estimation problems. See Aronszajn [11] and Parzen [116]. For additional
material on polynomial approximation and a treatment of Problems 14, 15,
18, and 19, see Davis [36].
4
LEAST-SQUARES ESTIMATION
4.1
Introduction
Perhaps the richest and most exciting area of application of the projection
theorem is the area of statistical estimation. It appears in virtually all
branches of science, engineering, and social science for the analysis of
experimental data, for control of systems subject to random disturbances,
or for decision making based on incomplete information.
All estimation problems discussed in this chapter are ultimatlely formulated as equivalent minimum norm problems in Hilbert space and are
resolved by an appropriate application of the projection theorem. This
approach has several practical advantages but limits our estimation criteria
to various forms of least squares. At the outset, however, it
be
pointed out that there are a number of different least-squares estimation
procedures which as a group offer broad flexibility in problem foqnulation.
The differences lie primarily in the choice of optimality criterion a\nd in the
statistical assumptions required. In this chapter three basic forms of least
squares estimation are distinguished and examined.
Least squares is, of course, only one of several established approaches to
estimation theory, the main alternatives being maximum likelihood ann
Bayesian techniques. These other techniques usually require a
statistical description of the problem variables in terms of joint probability
distribution functions, whereas least squares requires only means, variances,
and covariances. Although a thorough study of estimation theory would
certainly include other approaches as well as least squares, we limit our
discussion to those techniques that are derived as applications of the projection theorem. In complicated, multivariable problems the equatiol)s
resulting from the other approaches are often nonlinear, difficult to solve,
and impractical to implement. It is only when all variables have GaussLln
statistics that these techniques produce linear equations, in which case the
estimate is identical with that obtained by least squares. In many practical
situations then, the analyst is forced by the complexity of the problem to
either assume Gaussian statistics or to employ a least-squares approac:h.
Since the resulting estimates are identical, which is used is primarily a
matter of taste.
78
4.2
79
The first few sections of this chapter are devoted to constructing various
Hilbert spaces of random variables and to finding the appropriate normal
equations for three bask estimation problems. No new optimization techniques are involved, but these problems provide interesting applications
of the projection theorem.
tionPofxby
= Prob (x :s;
In other words,
is the probability that the random variable x assumes
a value less than or equal to the number
The expected value of any
function g of x is then
as
E[g(x)]
which may in general not be finite. Of primary interest are the quantities
the expected value of x,
E(x),
Z
E(x ),
E[(x -- E(X2],
1,
X2 :s;
Xn :s;
for all i.
80
LEAST-SQUARES ESTIMATION
and the n x n covariance matrix cov (Xl' X2 , , xn) whose ij-th element is
defined as
E {[Xi - E(Xi)] [Xj - E(Xj )]}
which in: case of zero means reduces to E(xixj)'
Two random variables Xi' Xj are said to be uncorrelated if E(XiXj) =
E(xj) E(x) or, equivalently, if the ij-th element of the covariance matrix
is zero.
With this elementary background material, we now construct a Hilbert
space of random variables. Let {Yl, Y2, ... , Ym} be a finite collection of
random variables with E(y/) < C1) for each i. We define a Hilbert space H
consisting of all random variables that are linear combinations of the y;'s.
The inner product of two elements x, Y in H is defined as
(x I y)
= E(xy).
Since x, yare linear combinations of the y/s, their inner product can be
calculated from the second-order statistics of the Yi'S. In particular, if
X = LCtiYi,Y = LPiYi' then
E(xy) = E{
Ct j
Yi)
Pj
yj)}'
4.2
81
= Kl Yl + K2 Y2 + ... + Km Ym
where the K/s are real 11 x n matrices. The resulting space is, of course,
generally larger than that which would be obtained by simply considering
linear combinations of the y/s. This specific compound construction of.llt',
although rarely
to explicitly in our development, is implicitly
responsible for the simplicity of many standard results of estimation theory.
If x and z are elements of:?t, we define their inner product as
which is the expe<:ted value of the n-dimensional inner product. A convenient notation is (x I z) = E(x'z).
The norm of an element x in the space of n-dimensional random vectors
can be written
II x II
= {Trace E(xX'W I 2,
where
xn)]
E(xx') =
[
E(X"X2)
.. I
E(xnxn)
is the expected value of the random latrix dyad xx'. Similarly, we have
(x Iz) ,; Trace E(xz').
82
LEASTSQUARES ESTIMATION
In the c'ase Of zero means the matrix E(xx') is simply, the covariance
matrix of the random variables Xi which form the components of x, These
components can, of course, be regarded as elements of a " smaller" Hilbert
space of random variables; the covariance matrix above then is seen to be
the Gram matrix corresponding to {Xl' X2, . , x n }.
Corresponding to the general case with nonzero means, we define the
corresponding covariance matrix by
cov (x) = E [(x - E(x (x - E(x'].
4.3
83
Figure 4.1
W'WP= W'y.
Since the columns of Ware assumed to be linearly independent, the Gram
matrix W' W is nonsingular and the result follows. I
There is an extensive theory dealing with the case where W' W is singular.
See the references at the end of this chapter and the discussion of pseudoinverses in Section 6.11.
,
Although Theorem 1 is actually only a simple finite-dimensional version
of the general approximation problem treated in Chapter 3, the solution is
stated here explicitly in matrix notation so that the result can be easily
compared with other estimation techniques derived in this chapter.
84
LEAST-SQUARES ESTIMATION
= Wp + B.
E[II P- PII 2 ],
where 1/ I is the ordinary Euclidean n-space norm. If, however, this error
is written explicitly in terms of the problem variables, we obtain
(1)
E[IIP - P11 2 ]
KW=I.
4.4
85
subject to
i = 1,2, .... n,
and
Pi
A
== k'iY
i = 1,2, ... , n,
where k/ is the i-th row of the matrix K. Therefore, the problem is really
n separate problems, one for each Pi' Minimizing each E(PI - Pi)2
minimizes the sum. Each subproblem can be considered as a constrained
minimum norm problem in a Hilbert space of random variables.
An alternative viewpoint is to consider the equivalent deterministic
problem of selecting the optimal matrix K. Returning to equations (1) and
(2), the problem is: one ofselecting the n x m matrix K to
minimize
sul::dect to
Trace {KQK'}
KW=I.
Wj
k/Qki
k/wj == Oi}'
Oij
j == 1,2, ... , n,
is the Kronecker delta function.
86
LEAST-SQUARES ESTIMATION
Introducing the inner product (x IY)Q ::::: x' Qy, the problem becomes
minimize
subject to
= 1,2, ... , n.
This is now in the form of the standard minimum norm problem treated in
Section 3.10. It is a straightforward exercise to show that
kj
= Q-1W(W'Q-1W)-lei
where ei is the n-vector with i-th component unity and the rest zero. Note
that W' Q W is the Gram matrix of the columns of W with respect to (I)Q'
Finally, combining all m of the subproblems, we obtain
K'
Wp + /l where
(/l) = 0
(/l/l')
=Q
PY]
=(W'Q- 1 W)-11
As an aside, it might be mentioned
that the justification for the terminI
ology "minimum variance unbiased" rather than "minimum covariance
trace unbiased" is that for each i,
is the minimum-variance unbiased
estimate of PI' In other words, the Gauss-Markov estimate provides a
minimum-variance unbiased estimate of each component rather than
merely a vector optimal in the sense of minimizing the sum of the individual
vanances.
Pi
4.5
MINIMUM-VARIANCE ESTIMATE
87
E(88')
4.5
Minimum-Varianc:e Estimate
= W{3 + e
but in this case both p and e are random vectors. The criterion for optimality is simply the minimization of E[IIS - {3112].
We begin by establishing an important theorem which applies in a somesetting than that described above but which is really
what more
only an application of the normal equations.
Theorem 1. (Minimum-Variance Estimate) Let y and {3 be random vectors
(not necessarily 0/ the same dimension). Assume [E(yy')] -1 exists. The
linear estimate S0/{3, based ony, minimizing E[IIP - {3112] is
P = E({3y')[E(yy')]-1 y ,
with corresponding error covariance matrix
E[(P - {3) (p
.- {3)'] =
E({3{3') - E(PP')
= E({3{3'} - E({3y') [E(yy')] -1 E(y{3'}.
Proof. This problem, like that in the last sectioh, decomposes into a
separate problem for leach component {3 i' Since there are no constraints,
subproblem is simply that of finding the best approximation of {3 i
the
within the subspace ge:nerated by the y;'s.
88
,LEAST-SQUARES ESTIMATION
[E(yy')]K' = E(yf3')
from which
that
K = E(f3y')[E(yy')J -1,
which is the desired result. Proof of the formula for the error covariance
is obtained by direCt substitution. I
Note that, as in the previous section, the term minimum variance applied
to these estimates can be taken to imply that each component of 13 is
estimated by a minimum-variance estimator. Also note that if both 13 and y
have zeto means, the estimate is unbiased in the sense that E(P) = E(f3) O.
If the means are not zero, we usually broaden the class of estimators to
include. the form P= Ky + b where b is an appropriate constant vector.
This matter is considered in Problem 6.
Returning to our original purpose, we now apply the above theorem to
a revised form of our standard problem.
Corollary 1. Suppose
y = Wf3
+8
E(f3f3') = R,
E(ef3')
= O.
..
P = RW'(WRW'
+ Q)-I y
Pn =
+ QrIWR.
E(yy') = WRW' + Q
R - RW'(WRW'
4.S
MINIMUM-VARIANCE ESTIMATE
89
............
,..'"
-::::::.-()
....
----
"-,,"-
"-
"-
"-,,-
....
90
LEAST-SQUARES ESTIMATION
= (W'Q-I W + R-1)-I ..
+ Q)-I
= (W'Q-I W
+ R-1)-1 W'Q- 1
is easily established by postmultiplying by (WR W' + Q) and premultiplying by (W'Q-I W + R- 1 ). This establishes the equivalence of the
formula for the estimate p.
From Corollary I we have
=R-
RW'(WRW'
+ Q)- 1 WR
=R -
= (W'Q-1W
+ R-1)-I{(W'Q-1W + R-1)xR
- W'Q-1WR}
=(W'Q-1W+W1)-I1
4.6
4.6
91
=0
and, hence, in matrix form the normal equations for the columns of rare
[E(yy')]r' = E[y{3'T'J
so that
= T E({3y') [E(yy')] -1
P is also the linear estimate minimizing E[({3 - P)'P({3 - p)]for any positivesemidefinite n x n matrix P.
Proof Let p I /2 be the unique positive-semidefinite square root of P.
According to Theorem 1, p I/ 2p is the minimum-variance estimate of
pl/2{3 and, hence, Pminimizes
Finally, we consider the problem of updating an optimal estimate of {3
if additional data become available. This result is of fundamental practical
importance in a number of modern sequential estidtation procedures such
as the recursive estimation of random processes discussed in Section 4.7.
Like the two properties discussed above, the answer to the updating problem
is extremely simple.
92
LEAST-SQUARES ESTIMATION
The procedure is based on the simple orthogonality properties of projection in Hilbert space. The idea is illustrated in Figure 4.3. If Y 1 and Y 2
are subspaces of a Hilbert space, the projection of a vector {3 onto the
subspace Y 1 + Y 2 is equal to the projection onto Y l plus the projection
onto Yz wtjere Y2 is orthogonal to Y I and is chosen so that Y 1 El3 Y2 =
Y I + Y 2 FUrthermore, if Y 2 is generated by a finite set of vectors,
the diffe"'!llces between these vectors and their projections onto Y l
generate Yz .
Figure 4.3
P= PI + E({3Y2')[E(Y2Y2')]-ly2
In other words, Pis
by Y2 .
Proof. It is clear that Y 1 + Y z = Y 1 El3 2 and that Yz is orthogonal
to Y 1 The result then follows immediately since the projection onto the
sum of subspaces is equal to the sum'of the individual projections if the
subspaces are orthogonal. I
4.7
RECURSIVE ESTIMATION
93
= R.
= Wp + 6
where e is a random vector which has zero mean and which is uncorrelated
we seek the updated optimal
with both p and the past
estimate of [3.
The best estimate of y based on th'" past measurements is p = wiJ and
thus y = y - wiJ. Hence, by Theorem 3,
P= iJ + E([3y')[E(YY'>r 1y
which workf> out to be
4.7
/3)'] = R -
RW'[WRW'
+ Q]-IWR.
Recursive Estimation
94
LEAST-SQUARES ESTIMATION
E[x(j)x(k)]
= rt j 0jk'
= 1.
We assume that underlying every observed random process is an orthogonal process in the sense that the variables of the observed process are
linear combinations of past values of the orthogonal process. In other
words, the given observed process results from the operation on an orthogonal process by a linear-processing device acting in real time.
Example 1. (Moving Average Process) Let {u(k)}f" _ ex:> be an orthornormal random process, and suppose that {x(k)}k'= -ex:> is generated by the
formula
ex:>
x(k):= Laju(k-j)
j=1
where the constants aj satisfy LJ= I la j/2 < 00. The process can be regarded
as a moving average of past values of the underlying orthonormal process.
Example 2. (Autoregressive Scheme of Order 1) Let {u(k)}k'o:.:.ex:> be an
orthonormal process and suppose that {x(k)} is generated by the formula
x(k) = ax(k - 1)
+ u(k - 1),
lal < 1.
x(k)
=L
aj-Iu(k - j).
j=1
+ a1x(k -
1)
+ ... + anx(k -
n) = b1 u(k - 1)
+ ... + bnu(k -
n).
In order that the formula represents a stable system (so that E[x 2 (k)]
remains finite when the formula is assumed to have been operating over the
4.7
RECURSIVE ESTIMATION
95
s" + a1 s"-1
+ ... + an = 0
has its roots within the unit circle in the complex plane. Alternatively,
we may assume that n initial random variables, say x(O), x( -1), .... ,
x( -n + I), have been specified and regard the difference formula as
generating [x(k)] for positive k only. In that case, stability is immaterial,
although in general E[x 2 (k)] may grow without bound as k -L!X). By
hypothesizing time..varying coefficients in a finite-difference scheme such
as this, we can generate a large class of random processes.
There is a great deal of physical motivation behind these models for
random processes. It is believed that basic randomness at the microscopic scale including electron emissions, molecular gas velocities, and
elementary particle fission are basically uncorrelated processes. When
their effects are observed at the macroscopic scale with, for example, a voltmeter, we obtain some average of the past microscopic effects.
The recursive approach to estimation of random processes is based on
a model for the process similar to that in Example 3. However, for convenience and generality, we choose to represent the random process as
being generated by a first-order vector difference equation rather than an
n-th order scalar difference equation. This model accommodates a larger
number of practic:al situations than the scalar model and simplifies the
notation of the analysis.
Definition. An n-dimensional dynamic model of a random process consists
of the following three parts:
+ 1) = <ll(k)x(k) + u(k),
k = 0,1,2, ... ,
where x(k) is an n-dimensional state vector, each component of
which is a random variable, <ll(k) is a known n x n matrix, and u(k)
is an n-dime:nsional random vector input of mean zero satisfying
x(k
E[u(k)u'(l)] =
+ w(k),
= 0, 1,2, ... ,
96
LEAST-SQUARES ESTIMATION
In addition, it is assumed that the random vectors x(O), u(j) and w(k) are
;?
0, k
O.
(1)
x(k
+ 11 k) = <l>(k)P(k)M'(k)[M(k)P(k)M'(k) + R(k)r I
X [v(k) - M(k)x(k I k - 1)] + <l>(k)x(k I k -
1)
P(k
+ 1) =
<l>(k)P(k){l- M'(k)[M(k)P(k)M'(k)
= M(k)x(k) + w(k)
In this section this notation should not be confused with that of an inner product.
4.8
PROBLEMS
97
(3)
+ P(k)M'(k)[M(k)P(k)M'(k) + R(k)r 1
x [v(k) - M(k)x(k I k - 1)]
+ R(k)r IM(k)P(k).
Based on this optimal estimate of x(k), we may now compute the optimal
estimate x(k + 11 k) of x(k + 1) = <1>(k)x(k) + u(k), given the observation
v(k). We do this by noting that by Theorem 1, Section 4.6, the optimal
estimate of <1>(k)x(k) is <1>(k)x(k I k), and since u(k) is orthogonal (uncorrelated) to v(k) and x(k), the optimal estimate of x(k + 1) is
x(k
(5)
+ II k) =
<1>(k)x(k I k).
P(k
+ 1) =
<1>(k)P(k I k)<1>'(k)
+ Q(k).
Substitution of equation (3) into (5) and of (4) into (6) leads directly to (1)
and (2). I
Equations (1) and (2) may at first sight seem to be quite complicated,
but it should be easily recognized that they merely represent the standard
minimum-variance formulae together with a slight modification due to the
updating process. Furthermore, although these equations do not allow for
simple hand computations or analytic expressions, their recursive structure
is ideally suited to machine calculation. The matrices P(k) can be precomputed from equation (2); then, when filtering, only a few calculations
must be made as each new measurement is received.
Theorem 1 treats only estimates of the special form x(k + 1 I k) rather
than xU I k) for arbitrary j. Solutions to the more general problem, however,
are based on the estimate x(k + 11 k). See Problems 15 and 16.
4.8 Problems
1. A single scalar f3 is measured by m different people. Assuming uncorrelated measurement errors of equal variance, show that the
minimum-variancle unbiased estimate of f3 is equal to the average of
the m measurement values.
I
2. Three observers, [ocated on a single straight line, measure the angle
of their line-of-sight to a certain object (see Figure 4.4). It is desired
to estimate the true position of the object from these meaSllrements.
98
LEAST-SQUARES ESTIMATION
Li;
0, =
(}1
R 3 )S,
RI)S/
R 2 )S/'
where Si = secant ()/ and k is chosen so that 01, O2 , &3' define a single
point of intersection.
3. A certain type of mass-spectrometer produces an output graph similar
to that shown in Figure 4.5. Ideally, this curve is made up of a linear
combination of several identical but equally displaced pulses. In other
words,
s(t) =
n-I
L fJia(t -
i)
i=O
where the fJ;'s are unknown constants and a(t) is a known function.
Assuming that
a(t)a(t - i) dt =i. show that the least-squares
estimate of the fJ;'s, given an arbitrary measurement curve s(t), is
.
1
<Xl
Pi = -_12 b;<t)s(t) dt
P -<Xl
where
bi(t) =
a(t - i) - pa(t - i - 1)
(
i= 0
+ 1)
O<i<n-l
i=n-l
4.8
PROBLEMS
99
S1.t)
>' t
4. Let P be an n-dimelilsional
vector of zero mean and positivedefinite covariance matrix Q. Suppose measurements of the form
y = Wp are made where the rank of W is m. If is the linear minimum
variance estimate of P based on y, show that the covariance of the
error P- has rank n - m.
5. Assume that the measurement vector y is obtained from Pby
y == Wp
+e
where
E(f313') = R,
E(ee') = Q,
E(f3e')
= S.
7.
100
LEAST-SQUARES ESTIMATION
. .,.... -...,..-
. .....,;- ..,..-
...-'""
ji
+ b(x- x)
where
N
1 N
ji= -N LYj,
i= 1
x=LX,
N j=1
LYjX j - Njix
= ;:...I=-.r:'--_ __
L,X,2 -Nx 2
i= I
r(Yi - ji)(x; - x)
b=
;:...1=.....:!c.......,.;N-----
L(x; - X)2
i= 1
y =, IX
+ px + e
where e is a random variable with zero mean, variance (T2, and independent of x. We are hypothesizing the existence of a linear relation
between x and y together with an additive error vallable. The values
of a and b found in Problem 8 are estimates of the parameters IX and p.
xN
4.8
PROBLEMS
10 1
L (61.- e)(xi -
x)
b = P + I._=....:I,N..-----
L (XI -
so that E(b) = p.
(b) Show that E(a)
(c) Show that
X)2
1= 1
= ct..
(12
Var (b) =
(Xi i= 1
X)2
+6
= W1 P+ 61
Y2 = W 2 P+ 62
= e, E(6 1 61') = QI'
Yl
102
ESTIMATION
+ I) =
<l>x(k)
e,
E[x(O)x'(O)]
= P.
+ e(k)
DUAL SP
5.1 Introduction
The modern theory of optimization in normed linear space is largely
centered about the interrelations between a space and its corresponding
dual-the space consisting of all continuous linear functionals on the
original space. In this chapter we consider the general construction of
dual spaces, give some e:xamples, and develop the most important theorem
in this book-the Hahn-Banach theorem.
In the remainder of the book we witness the interplay between a normed
space and its dual in a number of distinct situations. Dual space plays a
role analogous to the inner product in Hilbert space; by suitable interpretation we can develop results extending the projection theorem solution
of minimum norm problems to arbitrary normed linear spaces. Dual space
provides the setting for an optimization problem "dual" to a given
problem in the origina.l (primal) space in the sense that if one of these
problems isa minimization problem, the other is a maximization problem.
The two problems are equivalent in the sense that the optimal values of
objective functions are equal and solution of either problem leads to
solution of the other. Dual space is also essential for the development of
the concept of a gradient, which is basic for the variational analysis of
optimization problems. And finally, dual spaces provide the setting for
Lagrange multipliers, fundamental for I:' study of constrained optimization
problems.
.
Our approach in this chapter is largely geometric. To make precise
mathematical statements, however, it is necessary to translate these geometric concepts into concrete algebraic relations. In this chapter we follow
two paths to a final set of algebraic results by consitiering two different
geometrical viewpoints, corresponding to two versions of the HahnBanach theorem. The first viewpoint parallels the development of the
projection theorem, While the second is based on the idea of separating
convex sets with hyperplanes.
104
DUAL SPACES
LINEAR FUNCTIONALS
5.2
Basic Concepts
First we recall that a functional fOil a vector space X is linear if for any
two vectors x, Y E X and any two scalars a, fJ there holds f(ax + fJy) ==
af(x)
+ Pf(y)
fixed y
Proposition 1.
+ xo) - f(xo)l
The above result is most often applied to the point and continuity
thus verified by verifying that f(x n) -+ for any sequence tending to
Intimately related to the notion of continuity is the notion of boundedness.
e.
if it is continuous.
5.2
BASIC CONCEPTS
105
e.
II(x) I =
and M
<
x = {el'
f{x)
= k=t
L kek'
Ilfll
= sup Il(x)1
x <1<8
IIxll
= sup If(x) I
IIxll :s 1
= sup If(x)l.
Ilxll =1
Mllxll, for
all x EX}
106
DUAL SPACES
The reader should verify these equivalences since they are used throughout
this chapter. The norm defined in this way satisfies the usual requirements
of a norm: 11/\\ > 0; 11/\\ = 0 if and only if 1= e; 11 alii = lalll/ll; 1111 +
1211 = sup I/l(x)+/2(x)1
sup I/I(x)1 + sup I/l(x) I = Il/tll+1I/211
II xli :s; I
IIxll:s; I
lIxli s 1
In view of the preceding discussion, the following definition is justified:
Definition. Let X be a normed linear vector space. The space of
an bounded
11/11 =
sup
IIxll:s; 1
I/(x)l.
+ x!(x) I
(a + IIx!IDllxll
and x* is a bounded linear functional. Also from Ix*(x) - x!(x) I <
m > M, there follows IIx* - x!1I < a
that x! x* in X*. I
al\xll,
5.3
107
The Dual of Eft. In the space En, each vector is an n-tuple of real scalars
x=
2' ... ,
with norm Ilxll = (Li _,
12. Any functional fOf the
formf(x) = L,= I til with each tljareal numberisc1earlylinear. Also,from
the Cauchy-Schwarz inequality for finite sequences.
If(x)1 =
The Dual of lp, 1 p < 00. The Ip spaces were discussed in Section 2.10.
For every p, 1 p < 00, we define the conjugate index q = p/(p - 1), so
that l/p + l/q = 1; if p = 1, we take q = 00. We now show that the dual
space of Ip is Iq
p<
00,
is represent.
(1)
f(x)
= 1=L1'1i el
f l'1il
Ilfll = Ilyllq = {(
) I/q
1=1
sup l11kl
k
if 1 <
<
00
if p = 1.
I".
108
DUAL SPACES
for a 1 in the i-th component. Define lJi =!(el)' For any x ={el} E lp, we
have, by the continuity of/,f(x) = Ir'=l
Suppose first that I < P < 00. For a given positive integer N define the
vector XN E Ip having components
i
5, N
i >N.
Then
and
But I/(XN) I 5,
follows that
ljq
5, 11111
for all N.
i=N.
Then
IlxN11 5, I and
The Dual of Lp(O, 1), 1!;p <. 00. The Lp spaces were discussed in Section 2.10 and are the function space analogs of the lp spaces. Arguments
similar to those given in Theorem 1 show that for 1 ::; P < 00, the dual
5.3
109
f(x)
and Ilfll
= fax(t)y(t) dt
= lIyllq'
The Dual of
sequences x =
is IIxll = max
i
Co.
space H, there exists a unique vector y e H such that for all x e H, f(x) =
(x I y). Furthermore, we have IIfll = lIyll and every y determines a unique
bounded linear functional in this way.
110
DUAL SPACES
Given any x e n, we have x - f(x)z eN since f[x - f(x)z] = f(x) f(x)f(z) = O. Sincez ..L N, we have (x - f(x)z Iz) = 0 or (x Iz) = f(x)llzI1 2 or
f(x) = (x Iz/lIzI12). Thus, defining y = z/llzIl2, we havef(x) = (x Iy).
The vector y is clearly unique since if y' is any vector for whichf(x)=
(x I y') for all x we have (x I y) = f(x) = (x Iy'), or (x Iy - y') = 0 for all x
which according to Lemma 2, Section 3.2 implies y' ::::: y.
It was shown in the discussion preceding the theorem that IIfll = lIyll I
EXTENSION FORM OF THE HAHN-BANACH THEOREM
IlfllM =
If(m)1
lmf'
for aU XI' X2 X.
for all CI. ;;:: 0 and x e X.
S.4
III
+ ocg(y)
or
and hence
inf [p(m
+ y) -
f(m)].
""oM
inf [p(m
meM
+ y) -
f(m)].
+ occ.
We
112
DUAL SPACES
If ex =
-p < 0,
then
- pc + f(m) = p[ - c + f(7i) ]
Y) = p(m -
py).
Thus g(m + exy) p(m + exy) for all ex and 9 is an extension off from M
to [M+yJ.
Now let {XI' Xl, , X n , } be a countable dense set in X. From this
set of vectors select, one at a time, a subset of vectors {YI, Y2 , ... , Yn, ... }
which is independent and independent of the subspace M. The set
{YI' Y2' ... , Yn, ... } together with the subspace M generates a subspace S
dense in X.
The functionalfcan be extended to a functional 9 on the subspace S by
extendingffrom M to [M + YI], then to [[M + YI] + Y2]; and so on.
Finally, the resulting 9 (which is continuous since p is) can be extended
by continuity from the dense subspace S to the space X; suppose X E X,
then there exists a sequence {sn} of vectors in S converging to x. Define
F(x) = lim g(sn)' F is obviously linear and F(x) +- g(sn) p(sn) p(x) so
n.... co
F(x)
p(x) on X.
= IIfll11 IIxll
5.5
b]
113
f(x)
gl'
.. } E
X, we define
= i=f 1 (1 - I
f x(t) dv(t)
b
f(x) =
and such that the norm off is the total variation of v on [a, b). Conversely,
every function of bounded variation on [a, b] defines a bounded linear
functional on X in this way.
Proof Let B be the space of bounded functions on [a, b] with the
norm of an element x e B defined as !lxliB = sup Ix(t)l. The space C[a, b)
a:St:sb
l14
DUAL SPACES
L \v(ti } -
1=1
V(t l _ I )\ ==
L 81[V(tI) -
v(t,_t)]
1= 1
n
=1=1
L 81[F(u r)
- F(u r,_,)]
Thus,
since
IIFII
IIJII
and
z(t) =
L X(tI-I)[Ur,(t) i=
ur,_,(t)]
which (by the uniform continuity of x) goes to zero as the partition is made
arbitrarily fine. Thus, since F is continuous, F(z) -+ F(x} = /(x). But
n
F(z) ==
L X(ti-I)[V(t/) -
1= 1
V(ti-I)]
5.6
115
Jx(t) dv(t).
f(x) =
Jx(t) dv(t).
Therefore,
b
\ (X(t)dV(t)\
Ilxll . T.V.(v)
and, hence, II/II :s;; T.V.(v). On the other hand, we have Ilfll T.V.(v) and,
consequently, IIfll = T.V.(v).
Conversely, if v is a function of bounded variation on [a, b], the functional
f(x) =
J x(t) dv(t)
b
IlxIIT.V.(v).
With the above definition the association between the dual of C [a, b]
and NBV[a, b] is unique. However, this normalization is not necessary in
most applications, since when dealing with a specific functional, usually
any representation is: adequate.
5.6 The Second Dual Space
Let x* e X*. We often employ the notation (x, x*) for the value of the
functional x* at a point x e X. Now, given x e XI the equation f(x*) =
(x, x*) defines a functional on the space X*. The functionalfdefined on
X* in this way is linear
f(rxx!
+ px!} = (x,rxx! +
116
DUAL SPACES
X if
S.7
117
Example 1. Let X = Lp[a, b), I < p < 00, and X = Lla, b), l/p + l/q= 1.
The condition for two functions x E L p, y E Lq to be aligned follow directly
from the conditions for Ilquality in the Holder inequality, namely,
r-----,
,...
I
L ___ "''/
/ - - - v(t)
to every vector in S.
Given a subset U of the dual space X*, its orthogonJI complement uJ.
is in X**. A more useful concept in this case, however, is described below.
I
118
DUAL SPACES
Definition. Given a subset U of the dual space X*, we define the orthogonal
complement of U in X as the set .L U c: X consisting of all elements in X
orthogonal to every vector in U.
The set .L Umay be thought of as the intersection of U.L with X, where X
is considered to be imbedded in X** by the natural mapping. Many of the
relations among orthogonal complements for Hilbert space generalize to
normed spaces. In particular we have the following fundamental duality
result.
Theorem 1. Let M be a closed subspace of a normed space X. Then
.L[M.L] = M.
11111
Ilx
+ m)
1
+ mil = inf Ilx + mil
m
and since M is closed, IIfll < 00. Thus by the Hahn-Banach theorem, we
can extendfto an x* E X*. Sincefvanishes on. M, we have x* E M.L. But
also (x, x*) = 1 and thus x rf= .L[M.L]. I
5.8 Minimum Norm Problems
In this section we consider the question of determining a vector in a subspace M of a normed space which best approximates a given vector x in
the sense of minimum norm. This section thus extends the results of Chapter 3 for minimum norm problems in Hilbert space.
We recall that if M is a closed subspace in Hilbert space, there is always a
unique solution to the minimum norm problem and the solution satisfies
an orthogonality condition. Furthermore, the projection theorem leads to
a linear equation for determining the unknown optimizing vector. Even
limited experience with nonquadratic optimization problems warns us that
the situation is likely to be more complex in arbitrary normed spaces. The
optimal vector, if it exists, may not be unique and the equations for the
Nevertheless, despite these
optimal vector will generally be
difficulties, we find that the theorems of Chapter 3 have remarkable analogs
here. As before, the key concept is that of orthogonality and our principal
result is an analog of the projection theorem.
As an example of the difficulties encountered in arbitrary norrned space,
we consider a simple two-dimensional minimum norm problem that does
. not have a unique solution.
S.S
119
....
having their second component zero, and consider the fixed point x = (2, 1).
The minimum distance from x to M is obviously 1, but any vector in M of
the form m = (a, 0) where 1 S; as; 3 satisfies Ilx - mil = 1. The situation
is sketched in Figure 5.2.
x
..
"'i4.L...
vectors
(1)
meM
Ilx - mil
IIx1I s
xeM'!'
where the maximum on the right is achieved for some xci E Ml..
If the infimum on the left is achieved for some mo E M, then xci is aligned
with x - mo.
Proof For 8 > 0, let m. E M satisfy IIx - m.1I S; d + 8. Then for any
MJ.., IIx*1I S; 1, we have
(x, x*) == (x - m., x*) S; Ilx*lllIx - m.11 S; d + 8.
Since 8 waS arbitrary, we conclude that (x, x*) ::;; d. Therefore, the proof
of the first part of the theorem is complete if we exhibit any xci for which
(x, xci> = d.
x*
120
DUAL SPACES
Let N be the subspace [x + M]. Elements of N are uniquely representable in the form n = ax + m, with m E M, a real. Define the linear functional f on N by the equation fen) = ad. We have
Ilfll =
If(n)1
laid
r r
d
1.
5.8
121
whereas
IIx - mil. I
moll
(2)
= m*eML
min Ilx* - m*11
X*
IIxll SI
m*)J
IIxll S!
XliiM
IIxlI:s; !
Thus,
Ilx* - m*11
122
DUAL SPACES
Ilx*IIM ==
XI M.
For an
IIxll,; 1
xeM
IlxilM = IlxllMl
IIx*IIMl = Ilx*IIM
where on the right side of(3) the vector x is regarded as a functional on X*.
5.9 Applications
In this section we present examples of problem solution by use of the theory
developed in the last section. Three basic guidelines are applied to our
analyses of the problems considered: (1) In characterizing optimum solutions, use the alignment properties of the space and its dual. In the Lp and
lp spaces, for instance, this amounts to the conditions for equality in the
Holder inequality. (2) Try to guarantee the existence of a solution by
formulating minimum norm problems in a dual space. (3) Look at the
dual problem to see if it is easier than the original problem. The dual may
have lower dimension or be more transparent.
S.9
APPLICATIONS
123
Therefore,f - Po is not aligned with any nonzero element of NJ. and hence
r must contain at least n + 2 points. We have therefore proved the classic
result of Tonelli: If J is continuous on [a, b] and Po h the polynomial oj
degree n (or less) minimizing max IJ(t) - p(t)l, then IAt) - Po(t) I
t e[a, b]
x\
(Yl' x*) = C1
(Y2' x*)
= c2
d ==
min
(YI.X*)=CI
IIx*1I
min
m*eMJ.
IIx* - m*1I
d = min
IIx* - m*1I
d=
min
IIx*1I =
sup (x,
xeM
IIxll Sl
= l:1 =
x*).
11 Yall S
II Yall:S t
the last equality following from the fact that x* satisfies the constraints.
The quantity c'a denotes the usual inner product of the two n-vectors a
and c with components ai and ci . Furthermore, we have the alignment
properties of Theorem 2, Section 5.8, and thus the following corollary.
124
DUAL SPACES
is nonempty. Then
min
x'eD
Ilx*11
max c'a.
IIYalls!
+ O(t) =
u(t)
So e(t- J)u(t) dt = 0
Jo {1 - e(l-l)}u(t) dt == 1.
1
From Corollary 1,
min
lIuli
max
a2'
!lalYI +a2Y211 S 1
The norm on the right is the norm in X = L 1[0, 1J. Thus, the two constants aI' a2 must satisfy
!
So I(a
1 -
a2)e(t-l)
+ a21 dt s; 1.
5.9
APPLICATIONS
125
j,(t) == u(t) - 1
x(o)
= x(o)
= 0.
T2
x(T)
= fo (T -
t)u(t) dt -
'2 .
fo lu(t)1 dt.
The final time T is in general unspecified, but we approach the problem by
finding the minimum fuel expenditure for each fixed T and then minimizing
over T.
For a fixed T the optimization problem reduces to that of finding u
minimizing
T
fo lu(t)1 dt
while satisfying the single linear constraint
T
Io (T - t)u(t) dt
T2
= 1 + -.
At first sight we might regard this problem as one in Ll [0, T]. Since,
however, L1 [0, T] is not the dual of any normed space, we imbed our
problem in the space NBV[O, T] and associate control elements u with
the derivatives of ellements v in NBV[O, T]. Thus the problem becomes
that of finding the I) e NBV[O, I] minimizing
fo Idv(t)1 = T.V.(v)
T
I\vl\
subject to
T
fo
(T - t) dv(t)
T2
=1+- .
2
126
DUAL SPACES
According to Corollary 1,
min
Ilvll =
max
II(T-r)alls;'
[a(l + T2)],
2
== T lal
IIvll
T2)
= (1 + T T'
The optimal v must be aligned with (T - t)a and, hence, can vary only
at t = 0. Therefore, we conclude that v is a step function and u is an
impulse (or delta function) at t = 0. The best final time can be obtained by
differentiating the optimal fuel expenditure with respect to T. This leads
to the final result
J2
min IIvll = J2
T=
anp
v=
{ft
t=
0< t
J2.
Note that our early observation that the problem should be formulated
in NBV[O, T] rather than L, [0, T] turned out to be crucial since the
optimal u is an impulse.
*5.10 Weak Convergence
...
An interesting and important concept that arises naturally upon the introduction of the dual space is that of weak convergence. It is important for
certain problems in analysis and plays an indirect role in many optimization problems.
5.IO
WEAK CONVERGENCE
127
Xn -+
x weakly.
There are, however,. sequences that converge weakly but not strongly.
Starting with a normed space X, we form X* and define weak convergence on X in terms of X*. The same technique can be applied to X*
with weak
being defined in terms of X**. However, there is a
more important notion of convergence in X* defined in terms of X rather
than X**.
128
DUAL SPACES
Theorem 1. (Alaoglu) Let X be a real normed linear space. The closed unit
x:
+ I(Xb x:n> -
(Xb X!m> I
compact subset Sol X*. Then/is bounded on S and achieves its maximum
on S.
A special application of Theorems 1 and 2 is to the problem of maximizing (x, x*> for a fixed x E X while x* ranges over the unit sphere in X*.
S.II
129
Since the unit sphere in X* is weak* compact and <x, x*) is a weak*
continuous functional on X*, the maximum is achieved. This result is
equivalent to Corollary 2 of the Hahn-Banach theorem, Section 5.4.
GEOMETRIC FORM OF THE HAHN-BANACH THEOREM
S.11
xa,
130
DUAL SPACES
H. convergent to x
{x :f(x);::: c},
{x :f(x) > c}
S.12
131
for hyperplanes not containing the origin suggests that virtually all concepts in which X* plays a fundamental role can be visualized in terms of
closed hyperplanes or their corresponding half-spaces.
5.12 Hyperplanes and Convex Sets
In this section we prove the geometric form of the Hahn-Banach theorem
which in simplest form says that given a convex set K containing an
interior point, and given a point Xo not in 1<, there is a closed hyperplane
containing Xo but disjoint from 1<..
If K were the unit sphere, this result would follow immediately from our
earlier version of the Hahn-Banach theorem since it establishes the
existence of an
aligned with Xo' For every x in the interior of the unit
sphere, we then have (x, xti>
<
= (xo,
or
(x,
< (xo,
which implies. that the hyperplane {x: (x,
=
(xo,
is disjoint from the interior of the unit sphere. If we begin with
an arbitrary convex set K, on the other hand, we might try to redefine the
norm on X so that K, when translated so as to contain e, would be the
unit sphere with
to this norm. The Hahn-Banach theorem could
then be applied on thiH new normed space. This approach is in fact successful for some special convex sets. To handle the general case, however,
we must use the
Hahn-Banach theorem stated in terms of sublinear functionals instead of norms.
p(X)=inftr :;EK,r>o}.
We note that for K equal to the unit s1,here in X the Minkowski functional is IIxli. In the general case, p(x) defines a kind of distance from the
origin to x measured with respect to Kj it is the factor by which K must
be expanded so as to include x. See Figure 5.3.
Lemma 1. Let K be a convex set containing
2.
3.
4.
5.
00 > p(x)
0for all x E X,
p(rxx) = rxp(x) for rx > 0,
P(XI
+ X2)
p is continuous"
K = {x: p(x):-:; I}.
+ P(X2),
K=
e as an
132
DUAL SPACES
.,..
/'
,/
,/
,/
I
(
I
\
"
.......
Figure 5.3
Proof
1.
2.
e K, r > O}
IX
inf {r':
Given Xl' X2 and 6 > 0, choose r l , r2 such thatp(xj) < rj < p(Xj) + 6,
i = I, 2. By No.2, p(x!r/) < 1 and so x;/ri E K. Let r = r 1 + r 2 .
By convexity of K, (rt/r)(xt/r 1) + (r2!r)(x2h) = (Xl + x2)!r E K.
Thus, P(X/+X2)!r::;;1. Or by No.2, P(Xt-!-x2)sr<p(xl)+
P(X2) + 26. Since e was arbitrary, pis subadditive.
4. Let e be the radius of a closed sphere centered at e and contained
in K. Then for any X E X, 6x!lIxll E K and thus p(sx!llx!l) s 1.
Hence, by No.2, p(x) ::;;(lfe)llxli. This shows that p is continuous at
e. However, from No.3, we have p(x) = p(x - y + y) ::;; p(x - y) +
p(y) and p(y) = p(y - X + x) s p(y - x) + p(x) or -p(y
p(x) - p(y) ::;; p(x - y) from which continuity on X follows from
continuity at fJ.
5. This follows readily from No.4. I
3.
5.12
133
"eK2
e,
134
DUAL SPACES
kz
keK
the open sphere about x of radius d/2. Then apply Theorem 3 to Sand K.
Figure 5.4
Duality
S.13
135
distances from the point to hyperplanes separating the point and the
convex set K. We translate this simple, intuitive, geometric relation
into algebraic form and show its relation to our earlier duality result.
Since the results of thi:s section are included in the more general theory of
Chapter 7, and since the machinery introduced is not explicitly required
for later portions of the book, the rea:ler may wish to skip this section.
functional of K.
In general, h(x*) may be infinite.
The support functional is illustrated in Figure 5.5. It can be interpreted
geometrically as follows: Given an element x* E X*, we consider the
family of half-spaces {x: (x, x*) ::;; c} as the constant c varies. As c
increases, these half-spaces get larger and h(x*) is defined as the infimum
K=
136
DUAL SPACES
Figure 5.6
< 0,
IIx* \I
=I
Proof of Theorem 1
xeK
Ilx*II:s;1
5.14
}ROBLEMS
137
Ilx*11
IIxll <
1, we have -h(x*) S; d.
On the other hand, since K contains no interior points of Sed), there
is a hyperplane separating Sed) and K. Therefore, there is a X6 e X*,
= 1, such that -h(X6) = d.
To prove the statement concerning alignment, suppose that Xo E K,
IIxoll = d. Then (xo, X6> S;
= -d since Xo E K. However,
-(xo ,
S; IIx61111xoil = d. Thus -(xo,
= IIx61111xoII and -X6 is
aligned with Xo' I
Example 1. Suppose that K = M is a subspace of the normed space X
and that x is fixed in X. Theorem 1 then states that
inf Ilx - mil = max [(x, x*) - h(x*)].
S;
IIx*Usl
mllM
inf
Ilx - mil
meM
IIx*lIsl
xeMl.
Problems
f(x) =
f aCt) f b(s)x(s) ds dt
1
138
DUAL SPACES
IIxll
3.
4.
5.
6.
lekl.
sup
I :;;k<
k-+oo
00
Figure 5.7
A rocket car
IIuli
= max lu(t)12'
O:;;tSI
J(u) =
ul(t) dvl(t)
+ U2(t) dV2(t)
5.14
where
Vl
and
V2
PROBLEMS
139
LJldvt\2 + Idvzl
1
IlfiII =
:5
IIxli
= sup
Ostsl
Ix(t)lp
where
Find X*.
9. Let Xl and X2 denote horizontal and vertical position components in
a vertical plane. A rocket of unit mass initially at rest at the point
Xl = X2 =
is to be propelled to the point Xl = X2 = 1 in unit time
by a single jet with components of thrust u l , U2' Making assumptions
similar to those of Example 3, Section 5.9, find the thrust program
Ul(t), U2(t) that accomplishes this with minimum expenditure of fuel,
L
1
Ju/(t)
+ ul(t) dt.
10. Let X be a normed space and M a subspace of it. Show that (within
an isometric isomorphism) M* = X*/MJ. and MJ. = (XIM)*. (See
Problem 15, Chapter 2.)
11. Let {xt} be a bounded sequence in X* and suppose the sequence of
scalars {(x, xt>} converges for each x in a dense subset of X. Show
that {xt} converges weak* to an element x* E X*.
12. In numerical computations it is often necessary to approximate certain
linear functionals by simpler ones. For instance, for X = C[O, IJ, we
might approximate
1
L(x) =
Sox(t) dt
140
DUAL SPACES
a l1
tl1
t21
t22
t31
t32
t33
a21
a22
a31
a32
a33
a41
t41
Ln(P) = tp(t) dt
for any polynomial p of degree n -1 or less. Suppose also that
for all n.
Show that for any x
X
1
Ln(x)
fo x(t) dt.
has a solution x
the relation
1 s; is; n
S.14
PROBLEMS
141
implies
17. Let X be a real l:inear vector space and let It, 12 , ... ,j" be linear funtionals on X. Show that, for fixed cx/s, the system of inequalities
fi(x)
CX i
implies
142
DUAL SPACES
22. Prove the duality theorem of Section 5.13 for the case Xl =F O.
23. Let X be a real normed linear space, and let K be a convex set in X,
having 0 as an interior point. Let h be the support functional of K
and define KO = {x* E X*: h(x*) s I}. Now for X E X, let p(x) =
sup (x, x*). Show that p is equal to the Minkowski functional of K.
xeK
24. Let X, K, KO, p, h be defined as in Problem 23, with the exception that
K is now an arbitrary set in X. Show that {x: p(x) I} is equal to
the closed convex hull of K u to}.
REFERENCES
5.2-6. The presentation of these sections closely follows that of standr.rd works
such as Yosida [157], Goffman and Pedrick [59], and Taylor [145]. These
references can be consulted for additional examples of normed duals and for
a proof using Zorn's lemma of the more general Hahn-Banach theorem.
5.8-9. The duality theorems for minimum norm problems are given in the form
presented here in Nirenberg [114]. This material has a parallel development
known as the L-problem in the theory of moments, which was studied extensively by Akhiezer and Krein [4]. The solution of the L-problem has been
applied to optimal control problems by several authors. See Krasovskii [89],
Kulikowski [92], Kirillova [86], Neustadt [110], [111], and Butkovskii [26];
for an extensive bibliography, see Sarachik [137]. For additional applications
of the Hahn-Banach theorem to approximation theory, see Deutsch and
Maserick [39].
5.l1-12. The theory of convexity and hyperplanes has been studied extensively.
For some extensions of the development given in this chapter, see Klee [87],
Lorch [97], and Day [37]. For a number of counter-examples showing nonseparability of disjoint convex sets when the interior point condition is not
met, see Tukey [147].
5.l3. The duality theorem is proved in Nirenberg [11h
5.l4. Problems 7-9 are based on material from Neustadt [111].
6
LINEAR OPERATORS
AND ADJOINTS
6.1 Introduction
A study of linear operators and adjoints is essential for a sophisticated
approach to many problems oflinear vector spaces. The associated concepts
and notations of operator theory often streamline an otherwise cumbersome analysis by eliminating the need for carrying along complicated
explicit formulas and by enhancing one's insight of the problem and its
solution. This chapter contains no additional optimization principles but
instead develops results of linear operator theory that make the application
of optimization principles more straightforward in complicated situations.
Of particular importance is the concept of the adjoint of a linear operator
which, being defined in dual space, characterizes many aspects of duality
theory.
Because it is difficult to obtain a simple geometric representation of an
arbitrary linear operator, the material in this chapter tends to be somewhat
more algebraic in character than that of other chapters. Effort is made,
however, to extend some of the geometric ideas used for the study of linear
functionals to general linear operators and also to interpret adjoints in
terms of relations among hyperplanes.
6.2 Fundamentals
A transformation T is, as discussed briefly in Chapter 2, a mapping from
one vector space to another. If T maps the space X into Y, we write
T: X -+ Y, and if T maps the vector x F.: X into the vector Y E Y, we write
y = T(x) and refer to y as the image of x under T. As before, we allow that
a transformation may be defined only on a subset D C IX, called the domain
of T, although in most cases D = X. The .. ,)Uection of an vectors Y E Y for
which there is an xED with y = T(x) is called the range of T.
If T : X -+ Yand S is a given set in X, we denote by T(S) the image of S
in Y defined as the subset of Y consisting of points of the form y = T(s)
143
144
IIAII
IIAII
sup IIAxll
Ilxll S
= sup IIAxll.
x8
IIx!!
+ A 2 )x =
(aA)x
==
Alx
+ A2 x
a(Ax)
the linear operators from X to Y form a linear vector space. If X and Yare
normed spaces, the subspace of continuous linear operators can be
identified and this becomes a normed space when the norm of an operator
6.2
FUNDAMENTALS
145
is defined according to the last definition. (The reader ciln easily verify that
the requirements for a norm are satisfied.)
Definition. The norffiied space of all bounded linear operators from the
normed space X into the normed space Y is denoted B(X, Y).
We note the following result which generalizes Theorem 1, Section 5.2.
The proof requires only slight modification of the proof in Section 5.2
and is omitted here.
Theorem 1. Let X and Y be normed spaces with Y complete. Then the space
B(X, Y) is complete.
In general the space B(X, Y), although of interest by its own right, does
not play nearly as dominant a role in our theory as that of the normed
dual of X. Nevertheless, certain of its elementary properties and the definition itself are often convenient. For instance, we write A e B(X, Y) for,
"let A be a continuous linear operator from the normed space X to the
normed space Y."
Finally, before turning to some examples, we observe that the spaces
of linear operators have a structure not present in an arbitrary vector space
Y,
in that it is possible Ito define products of operators. Thus, if S : X
T: Y -+ Z, we define the operator TS : X -+ Z by the equation (TS)(x) =
T(Sx) for all x e X. For bounded operators we have the following useful
result.
for all x e X.
X by
IIAII. We have
IIAxll= max I{K(S,t)X(t)dt\
o sss 1
OS!Sl
= max
o s.s 1
Therefore,
146
So
the
fo IK(s, 1)1 dt
achieves its maximum. Given e > 0 let p be a polynomial which approximates K(so , . ) in the sense that
max IK(so, t) - p(t)1 < e
1
IS: p(t)x(t) dt -
s: Ip(t)1 dt I< e.
J:
IS: p(t)x(t) dt /- e
(IK(So, t)1 dt -/
s:
Ip(t)1 dt - 2e
J IK(so, t)1 dt -
3e.
Thus, since
IIxll ::; 1,
1
IIA"
Jo IK(so, t)1 dt -
3e.
But since e was arbitrary, and since the reverse inequality was established
above, we have
IIAII
= max
o
6.3
LINEARITY OF INVERSES
147
OPERATORS
6.3 Linearity of Invers,!s
Let A : X -+ Y be a linear operator between two linear spaces X and Y.
Corresponding to A we consider the equation Ax = y. For a given Y E Y
this equation may:
1.
2.
3.
If
A(x2) = Y2'
+ OC2Y2'
Thus
148
of view, however, since optimization theory often provides effective procedures for solving equations. Furthermore, a problem can never really be
regarded as resolved until an efficient computational method of solution is
derived. Nevertheless, our primary interest in linear operators is their
role in optimization problems. We do not develop an extensive theory
of linear equations but are content with establishing the existence of a
solution.
nn Fn.
6.4
149
It follows that the union of the original collection of sets {En} is not X
since
y =
UnA(S).
"=1
N(}, r/2.
We have shown tha.t the closure of the image of a sphere centered at
the origin contains such a sphere in Y, but it remains to be shown that
the image itself, rather than its closure, contains a sphere. For any i:> 0,
let S(8) and P(8) be the spheres in X, Y, respectively, of radii 8 centered
at the origins. Let 8 0 > 0 be arbitrary and let '10 > 0 be chosen so that
P('1o) is a sphere contained in the closure of the image of S(80). Let y
be an arbitrary point in P('1o). We show that there is an x E S(280 ) such
that Ax = y so that the image of the sphere of radius 2110 contains the
I
sphere P('1o).
Let {Si} be a sequence of positive numbers such that Ii";,! 6; " 60' Then
there is a sequence {'11}, with '11 > 0 and '11-+ 0, such that
150
6.5
Definition. Let X and Y be normed spaces and let A E B(X, Y). The adjoint
operator A*: y* -+ X* is defined by the equation
(x, A*y*
Figure 6.1
r*
6.5
151
for each x
E X.
=(A*y*)(x)
where the left side denotes the functional on X which is the composition
of the operators A and y* and the right side is the functional obtained by
operating on y* by A *.
B(X, Y) is
I(x, A*y*)1
== I(Ax,
Y*)I
Ily*IIIIAxll
1IY*IIIIAllllxll
it follows that
IIA*y*11
IIAIlIIY*11
IIA*II
IIAII
IIAxo11
I(xo,
IIA*llllxoll
IIA*II
It now follows that
IIA*II
IIAII I
3.
4.
5.
1.
then 1* == I.
(A *) -1.
152
= (Ax, yt
=/:
for some x E X. Thus, A*yf =/: A*yi and A* is one-to-one. Now for any
x* E X* and any x E X, Ax = y, we ha,ve
(x, x*)
(Ax);= 'LaijXj.
J=1
(Axl y)
= 1=1
L LYialjXj =
j=1
Example 2. Let X
j= 1
n
Xj
au =
l>ijYi
1"'1
ajl'
= (xIA*Y)
Ax =
t E [0, 1J
6.S
153
where
1
fo fo IK(t, sW ds dt <
00.
Then
(Ax I y)
where
1
t E [0, 1J,
with
1
sW dt ds < 00.
fo fo IK(t,
Then
1
(Ax I y)
= fo
,
(Ax I y)
= fo fa y(t)K(t, s)x(s) dt ds
1
154
(b)
(0)
A *y =
: ;
where
t 1 < t2 < t2 < '" < tn :::;; 1 are fixed. It is easily verified that A
is continuous and linear. Let y* = (YI' Y2' .. , Yn) be a linear functional
on En. Then
(Ax, y*)
1= 1
l)
where vet) is constant except at the points tt where it has a jump of magnitude Yi' as illustrated in Figure 6.3. Thus A*: en -+ NBV[O, 1] is defined
by A*y* = v.
v
' - _ L - -_ _ _
tl
t2
. .r
2I
tn
>- t
6.6
1SS
';v(A*).
Kllyll.
Proof Let N = .;V(A) and consider the space XI N consisting of equivalence classes [x] modulo N. Define A: XIN -+ 9P(A) by A[x] = Ax. It is
easily verified that ,4 is one-to-one, onto, linear; and bounded. Since 9P(A)
closed implies that 9l(A) is a Banach space, it follows from the Banach
inverse theorem th:at A has a continuous inverse. Hence, given y E 9l(A),
there is [x] e XIN with \I[x]\I
\lJ-l\1\1y\l. Take x e [x] with
\lxll 2 II [x] II and then K = 211A- 1 11 satisfies the
stated in the
lemma. I
N oW we give the dual to Theorem 1.
156
Theorem 2. Let X and Y be Banach spaces and let A E B(X, Y). Let fA(A) be
closed. Then
= [%(A)]J..
Proof Let x* E
%(A), we have
= O.
.:J
[%(A)]J..
Then
However,
and thus
n 2 ""
= Y.
is not closed.
6.7
157
[&l(A)JJ. = %(A*).
2. &leA) = [%(A*)]J..
3. [&l(A*)]J. = %(A).
4. &l(A*) = [%(A)JJ..
Proof. Part 1 is just Theorem 1. To prove part 2, take the orthogonal
complement of both sides of 1 obtaining [&l(A)]J.J. == [%(A*)JJ.. Since
&leA) is a subspace, the result follows. Parts 3 and 4 are obtained from
1 and 2 by use of the relation A** = A. I
6.7
Definition. Given a set S in a normed space X, the set SfD = {x* e X*:
(x, x*) ::::: 0 for all xeS} is called the positive conjugate cone of S. Likewise the set Sa == {x* e X* : (x, x*) s; 0 for all xeS} is called the
negative conjugate cone of S.
It is a simple matter to verify that SfD and Sa are in fact convex cones.
They are nonempty since they always contain the zer9 functional. If S is a
subspace of X, then obviously SfD = Sa = SJ.; hence, the conjugate cones
can be regarded as generalizations of the orthogonal complement of a set.
The definition is illustrated in Figure 6.4 for the Hilbert space situation
where SfD and Sa can be regarded as subsets of X. The basic properties
158
2.
be a subset of X. Then
s E S. Then
arbitrary in S, y*
<s, A*y*)
O. Thus, since
ment is reversible. I
s is
<As, y*)
E
0 and hence
A*-l(S$). The argu-
6.8
= {O},
159
---
'\
.
...
..............
------ --""......
........ "
1
'
160
for a fixed y
A*Ax = A*y.
yll where y E
Thus, by Theorem 1, Section 3.3 (the projection
theorem without the existence part), y is a minimizing vector if and only if
y - yE
Hence, by Theorem 3 of Section 6.6, y - YE ';v(A*). Or
lIy -
() = A*(y -
y)
= A*y -
A*Ax.
y=t?=la,Xj.
6.JO
161
L a/xI,
i= I
(x I Aa) = (x I L a/ XI) =
i= 1
n
L
al(x I Xi) = (z I a)En
i=l
'.
where z = x I XI)' ... , (x I x.)). Thus, A*x == x I XI)' (x I x 2 ), " ' , (x I x.)).
The operator A*A maps E into E and is therefore represented by an
n x n matrix. It is then easily deduced that the equation A *Aa = A*y
equivalent to
(XIIXI) (x2Ixl)
(XI I x2)
[
.:
(xlix.)
... (X.IXI)]al]
a2
:
(X. I x") a.
(YIXI)]
(y I X2)
=
'
(y I x.)
162
vector x of minimum norm satisfying Ax = y and that this vector is orthogonal to %(A). Thus, since
is assumed closed,
[%(A)]l ==
= y,
we conclude that
Note that if, as is frequently the case, the operator AA* is invertible,
the optimal solution takes the form
X
== A*(AA*)-ly.
+ bu(t)
x(T)
= J eF(T-t)bu(t) dt.
o
-+
En by
Au
= fo eF(T-t)bu(t) dt,
u = A*z
where
AA*z =
Xl'
(y I AU)En = y'
== (A*Y I U)L2
2'
6.11
PSEUDOINVERSE OPERATORS
163
where
A*y = b'eF'(T-t)y.
fo eF(T-t)bb'eF'(T-t) dt.
u = A*(AA*)-IX I .
6.11
Pseudoinverse Operators
We now develop a
general and more complete approach to the problem of finding approximate or minimum norm solutions to Ax = y. The
approach leads to the c:oncept of the pseudo inverse of an operator A.
Suppose again that G and H are Hilbert spaces and that A e B(G, H)
with E/9(A) closed. (In applications the Closure of E/9(A) is usually supplied
by the finite dimensionality of either G or H).
IIAxI - yll
H = 9l!(A) $ :;P(Al.
164
6.12
PROBLEMS
165
4.
5.
6.
7.
8.
9.
A t is linear.
A t is bounded.
(At)t = A.
(A*)t = (At)*.
AtAAt = At.
AAtA = A.
(AtA)*=AtA.
At = (A*A)tA*.
At = A*(AA*)t.
Aa = Laixj,
1= 1
= Aty.
Problems
Ax
K(t. S)X(S) ds
166
where
1
fo fo IK(t,
s)1 2 dt ds <
00.
+ I/q = 1.
Define
Aby
1
Ax
where y = {Ili}' x =
IIAII
= sup
i
I: IIXul.
j= 1
6.12
00,
iF
167
+ llq =
1. Let
Y = Lq [0,1], lip
PROBLEMS
00.
IIxll
Ax=b
13.
14.
15.
16.
IIAyII:sI
1. pZ = P (idempotent)
p* = P (self-adjoint).
2.
At
lim (A* A
1:-+0+
+ 81)-1 A* =
lim A*(AA*
-+0+
+ 81)-1
168
[i i bj.
J
1 0
21. Let G, H, K be Hilbert spaces and let B E B(G, K) with range equal to
K and C E B(K, H) with nulls pace equal to {O} (i.e., B is onto and C
is one-to-one). Then for A = C B we have At = Bt ct.
REFERENCES
6.4. The Banach inverse theorem is intimately related to the closed graph
theorem and the open mapping theorem. For further developments and applications, see Goffman and Pedrick [59].
6.7. The concept of conjugate cone is related to various alternative concepts
including dual cones, polar
and polar sets. See Edwards [46], Hurwicz
[75J, Fenchel [52), or Kelley, Namioka, et al. [85].
6.1O. See Balakrishnan [14].
6.11. The concept of a pseudo inverse (or generalized inverse) was originally
introduced for matrices by Moore [105], [l06] and Penrose [117], [118J.
See also Greville [67]. For an interesting introduction to the subject, see
Albert [5]. Pseudoinverses on Hilbert space have been discussed by Tseng
[1461 and Ben Israel and Charnes [20]. Our treatment closely parallels that of
Desoer and Whalen [38].
7
OPTIMIZATION
OF FUNCTIONALS
7.1
Introduction
170
OPTIMIZATION OF FUNCTIONALS
Figure 7.1
familiar in finite-dimensional space and is easily ,extended to infinitedimensional space. The principal advantage of the second representation
over the first is that for a given dimensionality of X, one less dimension
is required for the representation.
The first half of this chapter deals with generalizations of the concepts of
differentials, gradients, etc., to normed spaces and is essentially based on
the second representation stated above. The detailed development is largely
algebraic and manipulative in nature, although certain geometric interpretations are apparent. From this apparatus we obtain the local or variational theory of optimization paralleling the familiar theory in finite
dimensions. The most elementary portion of the classical calculus of
variations is used as a principal example of these methods.
The second half of the chapter deals with convex and concave funetionals from which we obtain a global theory of optimization. This
f(x) = c
Figure 7.2
7.2
171
7.2
(1)
+ rxh) -
T(x)]
+ O(h)\
=0
'
of
L -;hi.
n
1=1 vXi
172
OPTIMIZATION OF FUNCTIONALS
(jj(x; h)
()f(x; h) =
lim IIT(x
+ h) -
T(x) - bT(x;
Ilh I
IIhll->O
h)11 _ 0
- ,
Proof Suppose both bT(x; /1) and b'T(x; h) satisfy the requirements
of the last definition. Then
IIT(x
+ 11) -
T(x) - oT(x; h) I
7.2
173
as ex ...... 0.
+ exh) -
T(x)
= t5T(x; h). I
ex
... 0
+ h) -
+ h in
< sllhll.
Thus IIT(x + h) - T(x) II <sllhll + IIt5T(x; h)1I < Mllhll I
IIT(x
T(x) - t5T(x; h) II
/I
of
L -;:- hi
.=1 vX.
ox.
oXi
e.
for k
= 1,2, ... , n.
174
OPTIMIZATION OF FUNCTIONALS
f(X
+ h) -
"hll
f(x) -
:1 It
1=1 uXI
hil =
k=1
+ gk) -
{f(X
f(x
+ gk-I) -
:f hk}1
uX k
Now we examine the k-th term in the above summation. 1;'he vector x + gk
differs from x + gk-l only in the k-th component. In fact, x + gk = X +
gk-t + hk ek . Thus, by the mean value theorem for functions of a single
variable,
\
jJ(X
+ gk) -
+ gk-t) -
f(x
hk <
Ilhll.
f(x
+ h) -
f(x) -
of
c5f(x; h)
= fotgxCX, t)h(t) dt
If(x
+ h) -
= I( {g(x + h, t) -
g(x(t)
+ h(t), t) -
where Ix(t) - x(t) I Ih(t)l. Given e > 0, the uniform continuity of g" in x
and t implies that there is a /j > 0 such that for IIhl! < c5, Igx<x + h, t) g,,(x, t)1 < e. Therefore, we have
I(fl1.(X,
r) -
ellhll
7.3
FRECHET DERIVATIVES
175
or] dor
where F = (fl,f2, ... ,fm) has continuous partial derivatives with respect
to its arguments. The Gateaux differential of T is easily seen to be
176
OPTIMIZATION OF FUNCTIONALS
references cited at the end of the chapter. In this section we discuss the
elementary properties of Frechet derivatives used in later sections.
It follows immediately from the definition that if T( and T2 are Fnechet
differentiable at xED, then rJ.( T( + CX2 T2 is Frechet differentiable at x and
(a(T( + CX2 T 2 )'(x) = cx(T('(x) + CX 2 T 2 '{x). We next show that the chain rule
applies to Frechet derivatives.
Ilg - S'(x)hll
+ h) - T(x) - P'(y)gll
o(llgll).
o(llhll),
we obtain
= o(IIhll) + o(lIgll).
T'(x)h
= P'(y)S'(x)h. I
We now give a very useful inequality that replaces the mean value
theorem for ordinary functions.
+ h) - T(x)"
:S
IIhll
sup 1IT'(x
0<<<< 1
I. Then
+ cxh)lI.
+ rxh)h].
7.4
EXTREMA
177
and hence
ly*[T(x
sup IIT'(x
0<,,< 1
+ cth)II "hil,
+ h) - T(x),
+ cth)ll I
f"(xo) =
[02f (X)] ,
OXl ax x=xo
j
IIT(x
sup IIT"(x
0<,,<1
+ cth)ll.
7.4 Extrema
It is relatively simple to apply the concepts of Gateaux and Frechet
differentials to the problem of minimizing or maximizing a functional on a
linear space. The technique leads quite naturally to the rudiments of the
calculus of variatio11s where, in fact, the abstract concept of differentials
originated. In this section, we extend the familiar technique of minimizing
a function of a single variable by ordinary calculus to a similar technique
based on more general differentials. In this way we obtain analogs of the
classical necessary conditions for local extrema and, in a later section, the
Lagrange technique for constrained extrema.
I
178
OPTIMIZATION OF FUNCTIONALS
n Ii N.
+ ('J.h) I
,,=0
= O.
+ ('J.(x -
xo) e
n for 0 S
0
7.S
179
EULER-LAGRANGE EQUATIONS
J =
'2
M(x; h)
d
= -d
I f(x + IXh, x + IXh, t) dt I
12
IX I.
,,=0
or, equivalently,
(1)
M(x; h)
"
M(x; h)
= f12 [fix, x, t) - dd
1\
fix,
O.
1\
The boundary terms vanish for admissible h and thus the necessary
condition is
I
tl
and t 2
180
OPTIMIZATION OF FUNCTIONALS
fix,
x, t) -
:t fix,
x, t) = o.
[tl> t 2 ].
tE[tt',t 2 ']
otherwise.
The function h satisfies the hypotheses of the lemma and
12
a(t)h(t) dt> O.
11
[aCt) -
C]2
II
dt =
f" [ae-r) -
c] d1:.
I [aCt) - c)]h(t) dt
12
II
=f'Ci(t)h(t) dt 11
[aCt) - c] dt = 0
c[h(t 2 )
h(tt)] = 0,
7.5
EULER-LAGRANGE EQUATIONS
181
[!X(t)h(t)
(3)
tl
+ (J(t)h(t)] dt = 0
ACt) =
aCt) dt.
ft
!X(t)h(t) dt
tl
=-
t(
-A(t)
+ fi(t)Jh(t) dt =
tl
(J(t)
= A(t) + c
= O(t). I
Example 1. (Minimum Arc Length) Given t 1 , t,. and x(tt), x(t 2 ), let us
employ the Euler-Lagrange equations to determine the curve in D[tt, t2 ]
connecting these points with minimum arc length. We thus seek to minimize
dtox J 1 + (X)2 = 0
or, equivalently,
x = const.
Thus the extremizing arc is the straight line connecting the two points.
182
where
(X
fo e-P1V[ax(t) -
x(t)] dt
subject to x(O) = Sand x(T) = 0 (or some other fixed value). Of course,
there is the additional constraint x(t) ;;::: 0, but, in the cases we consider,
this turns out to be satisfied automatically.
From the EulerLagrange equation (2), we obtain
ae-P1U'[ax(t) - x(t)]
+ dt e-P1U'[ax(t) -
x(t)] = 0
:t U'[cxx(t) - x(t)]
= (P -
cx)U'[cxx(t) - x(t)].
(5)
U'[r(t)] = U'[r(O)]e(P-a)l.
cxx(t) - x(t)
= ret) =
7.6
183
= etZtx(O) +
a - 2P
(6)
= {X(O) _
r(O) --'ea.
af
2P -
e2(a-P)I.
2P - a
Assuming a> fJ > a/2, we may find r(O) from (6) by requiring x(T) = O.
Thus
(2P - a)x(O)
-e -(2p-a)T'
r(O) = 1
J f(x, ie, t)
12
J =
dt
1\
where the interval [t1' t2], as well as the function x(t), must be chosen.
In a typical problem the two end points are constrained to lie on fixed
curves in the x - t plane. We assume here that the left end point is fixed
and allow the right end point to vary along a curve S described by the
function x =g(t), as illustrated in Figure 7.3.
Suppose X(8, t) is a one-parameter family of Ifunctions emanating from
the point 1 and terminating on the curve S. The termination point is
X(8, t2(8 and satisfies X(8, t 2(8 = g(t2(8). We assume that the family is
defined for 8 e [-a, a] for some a> 0 and that the desired extremal is the
184
OPTIMIZATION OF FUNCTIONALS
7
glt)
s
Figure 7.3
ox(t)
= dd
x(e, t)/
8=0
we have
J(e) =
OJ
(2)
with respect to
DX(t 2 )
B.
7.7
185
and hence the complete set of necessary conditions is (1) and the transversality condition
(4)
{f(x,
x, t) + [g -
x]f,.{x,
x, t)}lt=tz = o.
t Jt +
h
J=
(x)2 dt,
or
.
1
X(t2) = - -.- .
g(t 2 )
Thus the extremal arc must be orthogonal to the tangent of gat t 2 (See
Figure 7.4.)
:<
" ...
""
"
7.7
186
OPTIMIZATION OF FUNCTIONALS
gix)
=0
(1)
T
g(x) = 0
7.7
187
The utility of this observation is that the exact form of the surface n near
Xo is replaced by the simple description of its tangent in the expression for
the necessary conditions. In order for the procedure to work, however, it
must be possible to express the tangent in terms of the derivatives of the
constraints. For this reason we introduce the definition of a regular point.
= 0, i = 1,2, ... , n.
Choose heX such that (jgj(xo; h) = 0 for i =
Proof
1,2, ... , n. Let
YI, Y2, ... , Y" e X be 11 linearly independent vectors chosen so that the
n x n matrix
M=
(jg"(xo; Y")
(2)
g"(xo
The lacobianofthis set with respectto the variables CPj, at e = 0 and CPi == 0,
is just the determi.nant of M and is therefore nonzel;O by assumption. Hence,
the implicit function theorem (see, for example, Apostol [1 OJ and also Section
188
OPTIMIZATION OF FUNCTIONALS
l:?=
(3)
0 = g/(xo
= g/(xo)
+ eh +
epjYj)
+ sog/(xo;
h)
Or, writing all n equations simultaneously and taking into account the
fact that the first two terms on the right side of (3) are zero, we have,
after taking the norm of the result,
0= II Mep(e) II
(4)
+ o(e) + o[lly(e)llJ.
clllep(e)1I
d f[xo
dS
+ eh + yes)]
8=0
= o.
= o. I
Lemma 1. Let fo ,fl' ... ,/" be linear functionals on a vector space X and
suppose thatj'o(x) = Ofor every x E X satisfyingfl(x) =fJfor i = 1,2, ... , n.
Then there are constants ..1.1' . 1. 2 , .. , An such that
fo
i = 1,2, ... , n,
7.7
189
f(x)
stationary at
II
L A/ub)
/=1
Xo
subject to
1
f JX2 + 1 dt
-1
1.
J(x)
=0
f " - df"
dt
or
d
x .= 0
2
dtJx +1
1- A-
or
+ 1 =,. t + c.
J
1
X2
190
OPTIMIZATION OF FUNCTIONALS
Xl)1.+
(t - tlYZ = r2.
The parameters Xl' t1> alld r are chosen to satisfy the boundary conditions
and the condition on total length.
GLOBAL THEORY
Figure 7.6
______
A convex function
for all
Xl' X2 E
Xl ::j: X2
x=o
x>O
= x 2 ; I(x) =
eX
7.8
191
f(x) =
So {x
(t)
+ Ix(t)l}
dt
1.
2.
1.
2.
[f, C]
= {(r, x) E R X
X: x E C,J(x) s; r}.
192
OPTIMIZATION OF FUNCTIONALS
-+---'---------------'--x
Figure 7.7
7.9
[f, e]
193
0=f(8)=f[-_1
1+e
(1
v( e). Furthermore,
x+ (1 _1+e
_
1 )(_!x)] :::;_1 f(x)
e
1+e
or
f(x);;::'
-eeo.
'I.
194
OPTIMIZATION OF FUNCTIONALS
z
for some
xe
= Y + (l -
P-l)X == P-l(py)
+ (1
- P-l)X
f(z) S P-lf(Py)
+ (1
+ (1
- p-l)e.
Thus f is bounded above in the sphere liz - yll < (1 - P-l)<5. It follows
that for sufficiently large r the point (r, y) is an interior point of [J, C];
hence, by Proposition 1,J is continuous at y. I
The proof of the following important corollary is left to the reader.
Corollary 1. A convex functional defined on a finite-dimensional convex set C
is continuous throughout
6.
.r
a}
7.10
195
x.
I
)'
X2
Figure 7.9
7.10
C*
= {x* E X*:
196
OPTIMIZATlvN OF FUNCTIONALS
.0 j
is defined on C * as
+ (1 - ex)xi> - f(x)}
+ (1 -
. + (1 -
f*(xi)
Sl
cx:>, we obtain
s
for all x
C. Therefore,
We see that the conjugate functional defines a set [j*, C*] which is of
the same type as [f, C]; therefore we write [f, C]* = [f*, C*]. Note that
if f = 0, the conjurate functional f* becomes the support functional of C.
Example 1. Let X = C = En 'd define, for x;;:::: (Xl' Xz, ... , x n), f(x) =
lip Li=l Ixi/P, I < P < cx:>. Then for x* = (el> e2 , , en),
7.10
197
L Ixd
n
i=l
sgn
P(
Xi
1) = -1L led
n
1- -
qi=l
+ llq = 1.
sr
+ (x, x*) =
(x, x*) - r = k
describe parallel closed hyperplanes in R x X. The number f*(x*) is the
supremum of the values of k for which the hyperplane intersects [f, C].
Thus the hyperplane (x, x*) - r = f*(x*) is a support hyperplane of
[I, C].
In the terminology of Section 5.13, f*(x*) is the support functional
h[( -1, x*)] of the functional (-1, x*) for the convex set [f, C]. The
special feature here is that we only consider functionals of the form
(-1, x*) on R x X and thereby eliminate the need of carrying an extra
variable.
For the application to optimization problems, the most important geometric interpretation of the conjugate functional is that it measures vertical
distance to the support hyperplane. The hyperplane
198
OPTIMIZATION OF FUNCTIONALS
1' ....
r =-f*(x*)
"-
.... ....
, ...
....
"
(x, x*) - r
s.
Then the. set [f*, C *] consists of those (nonvertical) half-spaces that contain the set [f, C]. Hence [f*, C*] is the dual representation of [f, C].
Beginning with an arbitrary convex functional ({J defined on a convex
subset r of a dual space X*, we may, of course, define the conjugate of ({J
in X** or, alternatively, following the standard pattern for duality relations (e.g., see Section 5.7), define the set *r in X as
*r =
on *r. We then write *[({J, r] = [*({J, *r]. With these definitions we have
the following characterization of the duality between a convex functional
and its conjugate.
Proposition 2. Let f be a convex functional on the convex set C in a normed
space X. If [f, C] is closed, then [J, C] = *[[f, C]*].
Proof We show first that [J, C] c: *[f*, C*] = *[[J, C]*]. Let
(r, x) E [J, C]; then for all x* E C *, f*(x*) (x, x*) - f(x). Hence, we
have r
(x, x*) - f*(x*) for all x* C*. Thus
r;;:: sup [(x, x*) - f*(x*)]
.nee
and (r, x)
*[f*, C *].
7 . 11
199
sr
for all (r, x) E [f, C]. It can be shown that, without loss of generality, this
hyperplane can be assumed to be nonvertical and hence s=!:O (see Problem 16). Furthermore, since r can be made arbitrarily large, we must have
s < O. Thus we take s = - 1. Now it follows that <x, x*> - f(x) :::;; c
for all x e C, which implies that (c, x*) e [f*, C*J. On the other
hand, c < <xo, x*) - r0 implies <xo, x*) - c > ro, which implies that
(ro,xo)*[f*,C*].1
7.11
A development similar to that of the last section applies to concave funetionals. It must be stressed, however, that we do not tre;;t concave functionals by merely multiplying by - I and then applying the theory for
convex functionals. There: is an additional sign change in part of the
definition. See Problem 15.
Given a concave functional 9 defined on a convex subset D of a vector
space, we define the set
Definition. Let 9 be a concave functional on the convex set D. The conjugate set D* is defined as
D*
= {x* e X*:
g*(x*)
= inf[<x, x*> -
g(X)].
xeD
200
OPTIMIZATION OF FUNCTIONALS
/" =-g*(.\"*)
"'"
"'" "'"
"'"
-+------.. . . . .
Figure 7.11
.\'
7.12
201
,.
> x
Figure 7.12
support hyperplane below [j. C], and - g*(x*) is the vertical distance to
the parallel support hyperplane above [g, D], g*(x*) - f*(x*) is the vertical separation of the two hyperplanes. The duality principle is stated in
detail in the following theorem.
Theorem 1. (Fenchel Duality Theorem) Assume that f and 9 are, respectively, convex and concave functjonals on the convex sets C and D in a
normed space X. Assume that C n D contains points in the relative interior
of C and D and that either [j; C] 01' [g, D] has nonempty interior. Suppose
further that p. = inf {f(x) - g(x)} is finite. Then
xeCnD
p.
max
{g*(x*) - /*(x*)}
x"'eCnD'"
and
= <xo , X6)
- g(Xo)
g*(x*)
Thus,
f(x) - g(x)
g*(x*) - f*(x*)
202
OPTIMIZATION OF FUNCTIONALS
and hence
sup [g*(x*) - f*(x*)J.
COnDO
c == inf [(x,
- g(x)] = g*(X&).
xeD
Likewise,
c == sup
xeC
[<X,
f(x) + Il] =
+ Il.
Thus Jl =
If the infimum Il on the left is achieved by some Xo e C ("\ D, the sets
[f - Jl, C] and [g, D] have the point (g(xo), xo) in common and this
point lies in the separating hyperplane. I
In typical applications of this theorem, we consider minimizing a convex
functional f on a convex domain D; the set D representing constraints.
Accordingly, we take C = X and 9 = 0 in the theorem. Calculation of f*
is itself an optimization problem, but with C = X (t is unconstrained.
Calculation of g* is an optimization problem with a linear objective
functional when 9 = O.
Example 1. (An Allocation Problem) Suppose that there is a fixed quantity Xo of some commodity (such as money) which is to be allocated among
n distinct activities in such a way as to maximize the total return from
these activities. We assume that the return associated with the i-th activity
when allocated XI units is ;."JXI) where g/, due to diminishing marginal
returns, is assumed to be an increasing concave function. In these terms
the problem is one of finding x ::= (X., xz ... , x n ) so as to
= tlg/(X/)
(1)
XI
'
i = 1,2, ... , n.
7.12
203
= AXo .
Also, for each i we define the conjugate functions (of a single variable)
(2)
gi(y/)
= x,.,o
inf [Xi y/ -
g/(x/)]
g*(y) ::;::
I 1 gi(y/).
i=
gi(A)].
i= 1
1=1
Sj
+ Xi
204
OPTIMIZATION OF FUNCTIONALS
(4)
R
I
i'=1 51
+ XI
Xo.
Our problem is to find Xi' i:::::: 1,2, ... , n, which maximize R subject to
n
(5)
L XI::::: Xo,
Xi;::: 0,
1=1
i = 1,2, ... , n
(7)
,,/>0
Pi
Figure 7.13
7.12
205
A_
- (SI
Si Pi
+ X/)2
or
(8)
Pi
Si
p.
for A ;;::: ....!
sI
(10)
for i
= m + 1, ... , n.
The parameter A in this section can be found from (3) or from the constraint
(5). In other words, A is chosen so that
(11)
Now SeA) is easily verified to be continuous and S(O) = 00, S( (0) = O. Thus
there is a A satisfying (11).
Note that for small XCI (xo < < max Sj), the total amount should be
on a single horse, the horse corresponding to the maximum p;/Sj,
or equivalently, the maximum Pi ri where rj := C Lj Sj/sl is the track odds.
Example 3. Consider the minimum energy control problem discussed in
Sections 3.10 and 6.10. We seek the element U E L 2 [O, I] minimizing
feu) = t
tu
1
(t) dt,
206
OPTIMIZATION OF FUNCTIONALS
C* =L 2 [O, 1]
and
f*<u*) =
fo [U*(t)]2 dt.
D*
KK*a
=c
= K*a.
7.13
207
(x, x*) is computed and player A pays that amount (in some appropriate
units) to player B. Thus A seeks to make his selection so as to minimize
(x, x"'> while B seeks to maximize (x, x"'>.
Assuming for the moment that the quantities
fJ,0
fJ,0
he can be assured oflosing no more than fJ,0. On the other hand, player B
by selecting x'" E B, wins at least min (x, x"'>; therefore, by proper choice
xeA
of x"', say X6, he can be assured of winning at least fJ,o. It follows that
fJ,0
(xo, X6> :s;; /1-0, and the fundamental question that arises is whether
fJ,o = fJ,0 so that there is determined a unique pay-off value for optimal
play by both players.
The most interesting type of game that can be put into the form outlined above is the classical finite game. In a finite game each player has a
finite set of strategies and the pay-off is determined by a matrix Q, the
pay-off being qij if A uses strategy i and B uses strategy j. For example,
in a simple coin-matching game, the players independentlY select either
" heads" or " tails." If their choices match, A pays B 1 unit while, if they
differ, B pays A 1 unit. The pay-off matrix in this case is
Q= [
1 -1]
-1
Finite games of this kind usually do not have a unique pay-off value.
We might, however, consider a long sequence of such games and" randomized" strategies where each player determines his play in anyone
game according to a fixed probability distribution among his choices.
Assuming that A has n basic strategies, he selects an n vector of probabilities x = (Xl' X2' , x n) such that Xi;::: 0, 2:i,'1 Xi = 1. Likewise, if B has m
strategies, he selects Y = (YI' Y2 , ... , Ym) such that Yi ;::: 0, Ii"= 1 Yi = 1. The
expected (or average) pay-off is th.en (x IQy).
DefiningA = (x
O,LI'= 1 XI = I} c En,andB = {x'" :x'" = QY'Yi;::: 0,
1 YI = I} c En, the randomized game takes the standard form given at
the beginning of the section. Note that A and B are both bounded closed
convex sets. Other game-type optimization problems with bilinear objectives other than the classical randomized finite game also take this form.
We now give a simple proof of the min-max theorem based on duality.
For simplicity, our proof is for reflexive spaces, although more general
Liz
208
OPTIMIZATION OF FUNCTIONALS
versions of the result hold. For the generalizations, consult the references
at the end of the chapter.
Theorem 1. (MiniMax) Let X be a reflexive normed space and let A and B
be compact convex subsets of X and X*, respectively. Then
min max (x, x*)
"eA "oeB
D*=X*
(2)
C* =B
(3)
(4)
f*(X*)
= o.
To prove (3) and (4), let x! B, and by using the separating hyperplane
theorem, let Xl e X and IX be such that (XI' xi) - (XI' x*) > IX > 0 for
all x* E B. Ther. (x, xi> - max (x, x*) can be made arbitrarily large by
"Oe B
00
"
E
minf(x) = max
""A
g*(x*) =
,,eBnXo
x*).
7.14
PROBLEMS
209
Ilxll
"eA
max - h(x*)
11"*11 s
where h is the support functional of the convex set A. This result is the
duality theorem for minimum norm problems of Section 5.8.
7.14 Problems
1. On the vector space of continuous functions on [0, 1], define
f(x) = max x(t).
OSrSI
f(x)::;
Ix(t)1 dt.
if XI = 0
is Gateaux differentiable but not continuous at XI = Xl = o.
4. On the space X = C [0, 1], define the functional f(x) = [X(!)]l. Find
the Frechet differential and Frechet derivative off
5. Let qJ be a function of a single variable having a continuous derivative
and satisfying IqJWI < Klel. Find the Gateaux differential of the
functionalj(x) =
I qJ(e i) where X = gil Ell. Is this also a Frechet
differential?
6. Suppose the real-valued functionalj defined on an open subset D of a
normed space has a relative minimum at Xo E D. Show that ifjis twice
Gateaux differentiable at x o , then (h,f"(x o )h) 0 for all hEX.
7. Let f be a real-valued functional defined on an open region D in a
normed space X. Suppose that at Xo E D the first Frechet differential
vanishes identically on X; within a sphere, S(xo, e), f"(x) exists
and is continuous in x; and the lower bound of (h,J"(xo)h) for
IIhll = 1 is positive. Show that f obtains a relative minimum at Xo.
8. Let A be a nonempty subset of a normed space X and let Xo EA.
Denote by C(A, xo} the closed cone generated by A - Xo, i.e., the
Ir;
210
OPTIMIZATION OF FUNCTIONALS
==
n C(A
Ii N,
xo)
N e.f'
= fa P[x, x, t] dt
b
among all functions x for which x(a) = A, x(b) = B and which have
continuous derivatives on [a, b] except possibly at a single point in
[a, b]. If the extremum has a continuous derivative except at C E [a, b],
show that Fx = dF;i;/dt on the intervals [a, c) and (c, b] and that the
functions F;i; ,and F - XP;i; are continuous at c. These are called the
Weierstrass-Erdman corner conditions. Apply these considerations to
the functional
x(-l) = 0,
x(l) = 1.
7.14
PROBLEMS
211
Hot Day
1000
400
Cool Day
200
200
= Ax(t) + hu(t)
212
OPTIMIZATION OF FUNCTIONALS
x(T)
= <I>(T.)x(O) + fo <I>(T -
t)bu(t) dt,
+ t So u2 (t) dt.
Hint: Assume first that x(T) is known and reduce the problem to a
finite-dimensional one. Next optimize x(T). Alternatively, formulate
the problem in E" X L2[O, T].
REFERENCES
7.1-4. For general background on differentials, see Graves [62], [63], Hildebrandt and Graves [72 and Hille and Phillips [73, Chapters 3 and 26]. A
GLOBAL THEORY OF
CONSTRAINED OPTIMIZATION
8.1 Introduction
The general optimization problem treated in this book is to locate from
within a given subset of a vector space that particular vector which minimizes a given functional. In some problems the subset of admissible vectors competing for the optimum is defined explicitly, as in the case of a
given subspace in minimum norm problems; in other cases the subset of
admissible vectors is defined implicitly by a set of constraint relations. In
previous chapters we: considered examples of both types) but generally we
reduced a problem with implicit constraints to one with explicit constraints
by finding the set of solutions to the constraint relations. In this chapter and
the next we make a more complete study of problems with implicit Constraints defined by nonlinear equality or inequality relations.
Deservedly dominating our attention) of course, are Lagrange multipliers
which somehow almost always unscramble a difficult constrained problem.
Although we encountered Lagrange multipliers at several points in previous
chapters) they were treated rather naively as a convenient set of numbers or
simply as the result of certain duality calculations. In a more sophisticated
approach) we do not speak of individual Lagrange multipliers but instead of
an entire Lagrange mUltiplier vector in an appropriate dual space. For
example, the problem
minimize f(x)
(1)
{subject to H(x) =
e,
L(x) z*)
=I(x) + z* H(x)
214
P = {xeE": x
8.2
215
p$ = {x*
for all x
P}.
for all x*
e,
e.
peP
216
Proof
W(XZI
+ (I
(X)Z2)
8.3
LAGRANGE MULTIPLIERS
Zl
217
Z2' then
CO(Zl) :$ CO(Z2)'
w(Z)
o
8.1
e.
and assume
(3)
1-10 =
1-10
infj(x)
subject to x
n, G(x) ::;; e
;;::
= xeO
inf {f(x) + <G(x), Z6)}.
e in Z * such that
218
(4)
n, G(xo) :::;; e, it is
(G(xo),
Proof. In the space W = R x Z, define the sets
A = {(r, z) : r '?:.!(x), z '?:. G(x) for some x E n}
rOr1
+ (Zlo
'?:. rOr2
+ (Z2'
ror
+ (z,
'?:. roJJ.o
for all (r, z) EA. If ro were zero, it would follow in particular that
(G(Xl)'
0 and that zci :p. However, since G(Xl) is an interior point
of N and
'?:. it follows that (G(X1),
< 0 (the reader should verify
this), which is a contradiction. Therefore, ro > 0 and, without loss of
generality, we may assume ro = 1.
Since the point (JJ.o, e) is arbitrarily close to A and B, we have (with ro = 1)
e.
e,
JJ.o
= (r.z)eA
inf [r + (z,
e, =!(xo), then
= O.
4) :::;;!(xo) =
JJ.o
8.4
SUFFICIENCY
219
ConVex if and only if co is convex. This in turn is implied by, but weaker
than, the assumption of convexity off and G.
Aside from convexity there are two important assumptions in Theorem 1
that deserve comment: the assumption that the positive cone P contains an
interior point and the assumption that G(X'l) < for some Xl E n. These
assumptions are .introduced to guarantee the existence of a nonverticat"
separating hyperplane. The requirement that the positive cone possess an
interior point is fairly severe and apparently cannot be completely
omitted. Indeed, in many applications this requirement is the determining
factor in the choice of space in which a pro'blem is formulated and, of
C[a, b] is a natural candidate for problems involving functions on
the interval [a, b]. The condition G(x l ) < e is called a regularity condition
and is typical of the assumptions that must be made in Lagrange multiplier
theorems. It guarantees that the separating hyperplane is nonvertical.
We have considered only convex constraints of the form G(x) e. An
equality constraint H(x)
with H(x) = Ax - b, where A is linear, is
equivalent to the two convex inequalities H(x) e and - H(x) e and can
thus be cast into the Dorm G(x) e. One expects then that, as a trivial
corollary to Theorem 1, linear equality constraints can be included together
with their resulting Lagrange multipliers. There is a slight difficulty, however, since there never exists an Xl which simultaneously renders H(xt) <
and - H(xl ) < e. A composite theorem for inequality constraints and a
finite number of equality constraints is given in Problem 7.
Theorem 1 is a geometric version of the Lagrange multipliet theorem for
convex problems. An equivalent algebraic formulation of the results is
given by the following saddle-point statement.
=e
L(xo, z*)
L(xo ,
L(x,
e.
Proof Let
diately L(xo, Z&)
be ddined as in Theorem 1. From (3) we have immeL(x, z;n. By equation (4) we have
= <G(xo), z*) -
<G(x o),
= <G(xo), z*)
::; O.
8.4 Sufficiency
The conditions of convexity and existence of interior points cannot be
omitted if we are to guarantee the existence of a separating hyperplane in the
220
w(z)
w(z)
(a)
(b)
<
<
<G(x
1 ),
<G(xo) ,
8.5
SENSITIVITY
221
for all x
e n, z*
e,
8. Then Xo solves:
minimize f(x)
subject to G(x)
8,
xen.
for all z*
zri)
(G(xo),
zt +
:::;; (G(xo),
or
(G(xo), z!) :::;; O.
We conclude by Proposition I, Section 8.2, that G(xo) 8. The saddlepoint condition therefore implies that (G(xo), zri) = O.
Assume now that Xl e n and that G(XI) 8. Then, according to the
saddle-point condition with respect to x,
f(xo) =f(xo)
+ (G(xo),
+ (G(XI)'
X En, G(x) 8. I
The Lagrange theorems of the preceding sections do not exploit all of the
geometric properties evident from the representation of the problem in
R x Z. Two other properties, sensitivity and duality, important for both
theory and application, are obtainable by visualization of the functional
w.
For any Zo the hyperplane determined by the Lagrange multiplier for the
problem
minimize f(x)
subject to G(x) Zo ,
xen
222
w(::)
Xl
problems
minimize f(x)
subject to x e nand G(x)::;; zo.
G(x)::;;
Zl.
+ (G(xo) -
Zo,
!(xo) -
!(Xl) ::;;
<G(Xl) -
ro(z) - ro(zo)
(zo - z,
*
ro '(Zo) -- -zo.
8.6
DUALITY
223
8.6 Duality
Consider again the basic convex programming problem:
minimize f(x)
subject to G(x)
0,
XEn
where/, G and n are convex. The general duality principle for this problem
is based on the simple geometric properties of the problem viewed in R x Z.
As in Section 8.3, we define the primal functional on the set r
(1)
w(Z)
= inf {f(x)
: G(x)
z, X En}
and let flo = w(O). The duality principle is based on the observation that (as
illustrated in Figure 8.4) flo is equal to the maximum intercept with the
--tI
w(z)
vertical axis of all closed hyperplanes that lie below w. The maximum
intercept is, of course, attained by the hyperplane determined by the
Lagrange mUltiplier of the problem.
To express the above duality principle analytically, we introduce the
dual functional <p corresponding to (1) to be defined on the positive cone in
Z* as
(2)
<p(Z*)
=:
inf [f(x)
xeO
+ <G(x), z*) J.
In general, cp is not finite throughout the positive cone in Z * but the region
where it is finite is convex.
224
(3)
:er
+ (z, z*)].
Proof The concavity of q> is easy to show and is left to the reader. For
any z* () and any z e r, we have
;S;
inf {f(x)
xell
= w(z)
+ (z, z*).
Xl
EO we have, with
inf {f(x)
= W(Zl)
= G(Xl),
+ (Zl> z*)
and hence
q>(Z*)
Zl
inf [w(z)
zer
+ (z,
z*)].
n,
(4)
inf [(x)
(1(X) SCI
= max <p(z*)
xd}
8.
8.6
DUALITY
225
fl, then
<G(xo) , Z6) = 0
and Xo minimizes f(x)
+ <G(x),
x E n.
xen
+ <G(x), z*)]
inf [J(x)
S;;
xen
+ <G(x), z*)]
G(x) sO
G(x) so
(5)
zso
z*<!O
As a final note we observe that even in nonconvex programming problems the dual functional can be formed according to (2). From the geometric interpretation of cp in terms of hyperplanes supporting [w, r],
it is clear that cp formed according to (2) will be equal to that which would
be obtained by considering hyperplanes supporting the convex hull of
[w, r]. Also from this interpretation it follows (provided cp(z*) is finite
for some z* 2'! 8) that for any programming problem
max cp(z*) ::s; min w(z)
(6)
z*<!O
zso
and hence the dual functional always serves as a lower bound to the value
of the primal problem.
(7)
+ A.'[Ax - c]}.
226
= Q -l(b -
A').).
Thus the dual is also a quadratic programming problem. Note that the dual
problem may be much easier to solve than the primal problem since the
constraint set is much simpler and the dimension is smaller if m < n.
but now the minimum over x may not be finite for every A. c;
minimization over; is finite if and only if there is an x
e. In fact, the
Qxb+A'A.=e
and all x's satisfying this equation yield the minimum. With this equation
substituted in (8), we obtain the dl.!al in the form
maximize - A' c - tx' Qx
subject to Qx - b + A'A. = e,
where the maximum is taken with respect to both A and x.
8.7 Applications
In this section we use the theory developed in this chapter to solve several
allocation and control problems involving inequality constraints.
Although the theory of convex programming developed earlier does not
require even continuity of the convex functionals involved, in most problems the functionals are not only continuous but Frechet differentiable. In
such problems it is convenient to express necessary or sufficient conditions
8.7
APPLICATIONS
227
in differential form. For this reason we make frequent use of the following lemma which generalizes the observation that if a convex funct,ion
f on [0, ex:: achieves a minimum at an interior point Xo , then !'(xo) = 0,
while if it achieves a minimum at Xo = 0, then !'(xo) ;:::: 0; in either case
xo!,(x o) = 0.
(jf(x o; x) ;:::: 0
(2)
of(xo; xo) = 0.
all x
P we must have
Hence
(3)
Setting x = xo/2 yields
(4)
while setting x = 2xo yields
(5)
Together, equations (3), (4), and (5) imply (1) and (2).
Sufficiency: For xo, x E P and < IX < I we have
or
1
f(x) - f(xo) ;:::: - [f(xo
IX
+ IX(X -
- f(xo)].
xo
(6)
P.
for all x
P and, hence,
228
(7)
x(t) = A(t)x(t)
+ b(t)u(t),
1-
J=
I u (t) dt,
tl
to
c,
tI
where <1> is the fundamental solution matrix of the corresponding homogeneous equation. We assume <1>(t1' t) and b(t) to be continuous. The
original problem can now be expressed as: minimize
subject to
(10)
Ku
8.7
APPLICATIONS
229
).,,0
+ X(d -
Ku)},
+ Xd
to
and hence
110 = max A'QA
(12)
).,,0
+ It'd,
where
Q=
-1-
f (J>(t
t,
j ,
t)b(t)b'(t)(J>'(tj> t) dt.
to
The Lagrange multiplier vector 11.0 has the usual interpretation as a sensitivity. In this case it is the gradient of the optimal cost with respect to the
target c.
Example 2. (Oil Drilling) A company has located an underground oil
deposit of known quantity IX. It wishes to determine the long-term oil
extraction program that will maximize its total discounted profit.
The problem may
formulated as that of finding the integrable function x on [0, (0), repre:senting the extraction rate, that maximizes
Ia<XlF[X(t)]V(t) dt
subject to
{"'x(t) dt::::;;
IX,
x(t)
o.
1 A problem almost identical to this is solved in Section 7.12 by use of the Fenchel
duality theorem.
230
F[x] represents the profit rate associated with the extraction rate x, By
assuming diminishing marginal returns, the function F can be argued to be
strictly concave atld increasing on [0, co) with F[O] = O. We assume also
that Fhas a continuous derivative. The" function vet) represents the discount
factor and can be assumed to be continuous, positive, strictly decreasing
toward zero, and integrable on [0, co).
To apply the differentiability conditions, we first consider the problem
with the additional restriction that x be continuous and x(t) = 0 for
t T where T is some fixed positive time. The constraint space Z then
corresponds to the inequality g x dt :s; (X and is thus only one dimensional.
The Lagrangian for this problem is
(13)
(X
X:1;O
fa {Fx[xo(t)]v(t) -
(14)
Ao}X(t) dt :s; 0
fa {Fx[xo(t)]v(t) -
(15)
AO}Xo(t) dt == 0
(16)
xo(t)
0,
Ao > 0,
J.l.(t)
0,
J.l.(t)xo(t)
= 0,
faxo(t) dt
= <x,
(17)
Since Fx is condnuous and strictly decreasing, it has an inverse Fx -1 which
is continuous and strictly decreasing. From (16) and (17) we see immediately that on any interval Nhere xo(t) > 0 we have xo(t) = Fx -l{Ao/t>(t)}
which is decreasing. Since we seek a continuous x, it follows that the
maXimizing xo, if it exists, m'lst have the form
to:::;; t:::;; T.
8.7
APPLICATIONS
231
ft0oFx -
Fx[O]v(to).
{F x[O]v(to)} dt
v(t)
,
one can show (we leave it to the reader) that J is continuous and that lim
to"" 0
J(t 0) = 0 and lim J(t 0) = co. Hence, let us choose to such that J( t 0) = 0(, Then
to-t- 00
F; 1 {Fx[O]V(to)
xo(t) =
v(t)
o
o ::; t ::; to
to ::; t ::; T
satisfies (16) and (17). Since the Lagrangian is concave, these conditions
imply that Xo is optimal for the problem on [0, T].
Given 8 > 0, however, any function y(t) for which
( ' y(t) dt =
IX
< co
234
s Xl(t) + b(t -
't')
= (b(t -
Thus dvldtlt=! represents the value, in units of final storage, of an opportunity to reinvest one unit of storage at time 't'.
Example 4. (Production Planning) This example illustrates that the
minimize
1- fo r2(t) dt
set),
+ {r('t') d't'
{d('t') d't'
=!
set)
= (1
2t
O:::;;tS1-
-1<t:::;;1.
+{
reT) d't'
J& r2 dt
does not exist for all r E L 1 [O, 1]. The choice X = CEO, 1] is feasible but,
8.7
235
APPLICATIONS
as we see from Example 3, there may be no solution in this space. Two other
possibilities are: (1) the space of piecewise continuous functions and (2) the
space of bounded measurable functions. We may define the supremum
norm on these last two spaces but this is unnecessary since Lemma 1 is
valid for linear Gateaux differentials as well as Frechet differentials and
this example goes through either way. Let us settle on piecewise continuous
functions.
The Lagrangian for this problem is
L(r, v) =
where v E BV[O,
Vo
e.
IJ
f r2(t) dt + f [set) 1
z(t)] dv(t),
e,
=t
fo r2(t) dt
fo r2(t) dt
fa fa r(-c) dT dv(t)
1
+ fo [set) -
z(O)J dv(t) -
+ fa [set) -
z(O)J dv(t)
v(l)
dt.
e, we require that
+ vo(t) - vo(l) ; 0
ro(t)[ro(t) + vo(t) - vo(l)J = 0
ro(t)
for all t
for
I'
t.
Since set) S z(t) for all t, in order to maximize L(ro, v) we require that
vo(t) varies only for those t with z(t) = set).
The set offunctions satisfying these requirements is shown in Figure 8.5.
Note that v is a step function and varies only at the single point t = 1. If a
Lagrangian of the form
L(r, A)
=t
fo r2(t) dt
+ fo [set) -
Z(t)]A(t) dt
= - fo Mv(t) dt + I:is(l)v(l)'-l:is(O)v(O).
236
J* ) o
r(t)
;;. t
r
I
"2
-11-----1
vo(t)
Figure 8.5
AJ = -
fo Ad(t)v(t) dt.
Thus - vet) is the price per unit that must be charged for small additional
orders at time t to recover the increased production cost. Note that in the
particular example considered above this price is zerO for t > t. This is
because the production cost has zero slope at r == 0.
8.8 Problems
1. Prove that the subset P of Lp[a, b], I
$,
p <
00,
consisting of functions
8.8
PROBLEMS
237
x;::::
e,
e,
yet) = yeO)
+ fo[z(r) -
d(r)] dr.
Find the production plan z(t) for the interval 0::;; t::;; T satisfying
z(t) ;:::: 0, y(t) ;:::: 0, 0::;; t::5: T, and minimizing the stlm of the production
costs and inventory costs
J=
238
Assume'that
d(t) == 1,
2y(O)h> 1,
Answer:
z(t) =
2h
o
(h . (t -
+ y(O) > T,
y(O)
< T.
t1)
where
tl
== T -
[T - y(O)].
== 1,
1
2h + y(O) S T,
2y(O)h> I,
y(O) < T.
It is helpful to define
tl
== y(O) - - , t z == y(O) + -.
2h
2h
REFERENCES
LOCAL THEORY OF
CONSTRAINED OPTIMIZATION
9.1 Introduction
240
attention because additional results can be. derived for this important class
of problems and additional insight into Lagrange multipliers is obtained
from their a,nalysis.
LAGRANGE MULTIPLIER THEOREMS
+ gn-1))'
9.2
241
(which is possible
Rewriting (1)
geL"
slightly, we have
Ln == A- 1(y - T(xo
+ 9n-1) + T'(XO)Un-l)
and similarly
L.-1
Therefore,
L. - L.-1
== - A -l(T(x) - T'(xo)x),
we obtain
(3)
Since IIg111 :::; 211L111
ciently small we
lIy - Yo II
suffi-
+ (g2
- gl)
$(1 + 1- + ... +
shows that
aU n.
+ ... + (gn -
gn-I)II
211glll5. r
242
Thus
for all n and hence the sequence {gn} converges to an element g. Correspondingly, the sequence {Ln} converges to a coset L. For these limits we
have
L = L I- A-1(y - T(xo
+ g
or, equivalently,
. T(xo
+ g) = y;
and
='e
The above result can be visualized geometrically iIi the space X in terms of
the concept of the tangent space of the consttaint surface. By the tangent
space at xo, we mean the setof all vectors h for which H'(xo)h = e (Le., the
nUllspace of H'(xo' It is a subspace of X which, when translated to the
point xo, can be regarded as the tangent of the surface N = {x : H(x) = 8}
near Xo . An equivalent statement to that of Lemma 1 is that/ is stationary
at Xo with respect to variation in the tangent plane. This is illustrated in
Figure 9.1 where contours of/as well as the constraint surface for a single
9.3
EQUALITY CONSTRAINTS
243
\
\
ex\
\
\
\
\
Tangent space
Figure 9.1
lI(x)
=e
Constrained optimization
+ Z6H'(Xo) = e.
Proof From Lemma I it is clear that!'(xo) is orthogonal to the nullspace of H'(x o)' Since, however, the range of H'(xo) is closed, it follows
(by Theorem 2, Section 6.6) that
!'(xo) E PA [H'(xo)*J.
Hence there is a
Z6 E Z * such that
f'(x o) == - H'(x o)*z6
*>.
244
meM
e.
e.MJ.,
:F Since
together with ro == 0
9.3
EQUALITY CONSTRAINTS
245
ta
t,
(1)
J =
x, t) dt
f(x,
to
cp(x, t) =
(2)
O.
Both f and cp are realvalued functions and are assumed to have continuous partial derivatives of second order. Since the end points are fixed,
we restrict OUr attention to variations lying in the subspace Xc:: Dn[fo,
consisting of those functions that vanish at to and 11 As demonstrated in
Section 7.5, the Frechet differential of J is
ta
(3)
M(x; h) =
[fx(x,
to
(4)
la.
f y(t) dz(t)
t,
<y, y*) =
to
246
. f'l y(t)A(t) dt
(y, y*) =
to
fix,
d t J.2
x
'2
'2
x +y +z
dt J
+ A(t)cpx = 0
x2 + y2 + Z2 + A{t),I,.'1'" --
Z
.2
dtJ x
'2
'2
+y +z
+ A(t)cpz =0
which, together with the constraint; can be solved for the arc.
In the special case of geodesics on a sphere, we have
cp(x, y, z) = x 2 + y2
CPx = 2x,
+ Z2 -
Cp" = 2y,
r2 = 0
cpz = 2z.
9.4
247
Let the constants a, b, c be chosen so that the plane through the origin
described by
ax
+ by + cz =
-dt
= ax(t) +
(I)
e,
where A*
and Ajgj(Xo) == 0, i = 1,2,3. The equality Ajgj(Xo) = 0 merely
says that if gi(X O) < 0, then the corresponding Lagrange multiplier is absent
from the necessary condition.
248
e.
(b)
(c)
(d)
Lf 11
9.4
249
Figure 9.4
e.
f(x)
is stationary at Xo; furthermore,
+ <G(x),
<G(xo),
o.
{(r, z) : r
0, z
G(xo)
O}.
1 As discusseed in Problem 9 at the end of this chapter, these hypotheses can be somewhat weakened.
250
The sets A and B are obviously convex; in fact, both are convex
cones
although A does not necessarily have its vertex at the origin. The set B
contains interior points since P does.
.
The set A does not contain any interior points of B because if (r, z) e A,
with, < 0, z < 9, then there exists heX such that
()f(xo; h) < 0,
The point G(xo) + ()G(x o ; h) is the center of some open sphere contained
in the negative cone N in Z. Suppose this sphere has radius p. Then for
0< oc < 1 the point oc[G(xo) + ()G(xo; h)] is the center of an open sphere of
radius oc' p contained in N; hence so is the point (1 - oc)G(xo)+oc[G(xo) +
()G(xo; h)] = G(xo) + oc . ()G(x o ; h). Since for fixed h
IIG(.Ji)
+ och) -
it follows that for sufficiently s''''Lall oc, G(xo + och) < 8. A similar argument
shows thatf(xo + och) <f(xo) for sufficiently small oc. This contradicts the
optimality of Xo; therefore A contains no interior points of B.
to Theorem 3, Section 5.12, there is a closed hyperplane
() such that
separating A and B. Hence there are '0,
ro . r + (z,
()
+ (z,
()
ro' r
'0
'0
()f(xo; h)
== 8
<G(xo),
== O.
9.4
251
fixo)
(2)
+ ,.1,0 Gixo) -
Jlo ::;:: 0
o.
Since the first term of (2) is nonpositive and the second term is nonnegative,
they must both be zero. Thus, defining the reduced Lagrangian
L(x, A) == f(x)
+ A'G(x),
Lixo, ,.1,0)
Lx(x o , Ao)x o = 0
L;.(xo, Ao)
L;.Cxo, Ao)Ao = 0
X"2!O
The top row of these equations says that the derivative of the Lagrangian
with respect to Xi must vanish if Xi> 0 and must be nonnegative if Xi = O.
The bottom row says that Aj is zero if the j-th constraint is not active, i.e.,
if the j-th component of G is not zero.
(3)
minimize J =
f f(x, x, t) dt
'
to
(4)
o.
Here to, t1 are fixed and X is a function of t. The initial value x(to) is fixed
and satisfies (x(to), to) < o. The real-valued functions f and 4> have continuous partial derivatives with respect to their arguments. We seek a
continuous solution xU) having piecewise continuous derivative. We
assume that, along the solution, 4>x ::j:. O.
252
for all t e [to, t 1]. This condition is satisfied since it is assumed that
cp(xo(to), to) < 0 and CP;r;(XO(t), t) :F O.
We now obtain, directly from Theorem 1, the conditions
(5)
The function
is bounded on [to, t1] and has at most a countable number of discontinuities. However, with the ex:;;eption of the right end point, M must be
continuous from the right. Therefore,by a slightly strengthened version of
Lemma2, Section 7.5, and by considering (7) for those particular he Xthat
vanish at tl as well as at to, we have
{S}
for t e [to, t 1). If A does not have a jump at tl> then (8) substituted into (7)
yields
!;i;(X,
x, t)1
t=t1
=0
because h(t1) is arbitrary. On the other hand, if -t has a jump at tl> then
(0] ImpliCI5 tl1i:lt
cp(x, t)
''''II
= O.
9.4
Therefore, we obtain in
q>(x, t) . fx(x,
(9)
253
x, t) \1=11 = O.
Together (6), (8), and (9) are a complete set of .necessary conditions for the
problem. For a simple application of this result, see Problem 6.
Example 4. As a specific instance ofthe above result, consider the problem
1
minimize J =
fa [x(t) + !x(t)2] dt
)0
254
x.
We assume that the vector-valued function! has continuous partial
derivatives with respect to x and u. The class of admissible control functions
is taken to be Cm[to, td, the continuous m-dimensional functions on
[to ,td, although there are other important alternatives. The space of
admissible control functions is denoted U.
Given any u E U and an initial condition x(to), we assume that equation
(Ir-defines a unique continuous solution x(t), t > to. The function x
resulting from application of a given control u is said to be the trajectory of
the system produced by u. The class of all admissible trajectories which we
take to be the continuous n-dimensional functions on [to , t1] is denoted X.
In the classical optimal control problem, we are given, in addition to the
dynamic equation (1) and the initial condition, an objective functional of
the form
(2)
tl
J=
J,to l(x,u)dt
i = 1,2, ... , r
9.5
255
=d
(4)
x(t) - x(t o) -
to
(5)
A(x, u)
= e.
for (h, v)
U.
to
256
Xx U.
From the differential (7) it is immediately clear that we must assume that
the r x n matrix G,,(X(tI has rankr. In addition we invoke a controllability
assumption on (6). Spedfically we assume that for any n-dimensional
vector e it is possible to select a continuous function v such that the
equation
Jrl fxh(r:)dr: -
h(t)-
II !"v(r:)dr:=O
10
10
(8)
h(t) -
10
10
h(tl)
(9)
y(t)
= e.
Jl(x, u) dt
tl
(2)
J =
10
9.5
257
subject to
(1)
(3)
and assume that the regularity conditions are satisfied. Then there is an
n-dimensional vector-valued/unction A(t) and an r-dimensional vector J1 such
that for all t E [to , t 1]
(I I)
+ fx'(xo(t), uo(t
';'(t l ) = Gx'(XO(tlJ1
(12)
uo(t
(13)
+ lu(xo(t), uo(t = e.
(lx(X o , uo)h(t) dt
+ (dA'(t)[ h(t) -
+ J1'G x(xo(tlh(tl) = 0
(15)
I,
= e.
'II
I,
10
10
10
+ J1'G x h(tt) = o.
(16)
= O.
10
[t 9 ,
ta I
258
Note that the conditions (II), (12), and (13) together with the original
constraints (1) and (3) and the initial condition form a complete system of
equations: 2n first-order differential equations, 2n boundary conditions,
r terminal constraints, and m instantaneous equations from which xo(t),
,t(t), jl., and uo(t) can be found.
Example 1. We now find the m-dimensional control function u that
minimizes the quadratic objective functional
=t f
tl
(17)
to
[x'(t)Qx(t) + u'(t)Ru(t)] dt
(18)
+ Bu(t),
x(to) fixed,
-let)
= F',t(t) + Qx(t),
,t(tt) = ()
,t'(t)B + u'(t)R = O.
(20)
(21)
= -R- 1B',t(t).
(22)
x(to) fixed.
Together (19) and (22) form a linear system of differential equations in the
variables x and A.. The system is complicated, however, by the fact that half
of the boundary conditions are given at each end. To solve this system, we
observe that the solution satisfies the relation
(23)
,t(t)
= P(t)x(t),
Pet)
=-
+ P(t)BR -1 B'P(t) -
Q,
9.5
259
(25)
= -R-IB'P(t)x(t)
which gives the control input in feedback form as a linear function of the
state.
This solution is of great practical utility because if the solution pet) of
equation (24) is found (as, for example, by simple backward numerical
integration), the optimal control can be calculated in real time from
physical measurements of x(t).
= 0 with
fixed initial velocity and direction. The rocket is propelled by thrust
developed by the rocket motor and is acted upon by a uniform gravitational
field and negligible atmospheric resistance. Given the motor thrust, we
seek the thrust direction program that maximize'S the range of the rocket on
a horizontal plane.
in Figure 9.6. Note that the final time T is
The problem is
determined by the impact on the horizontal plane and is an unknown
variable.
y
Figure 9.6
Letting
VI
= oX, Vz =
>" x
VI = fU) cos
e,
VI (0)
Vz = r(t) sin
e- g,
V2(O) given,
given,
where f(t) is the instantaneous ratio of rocket thrust to mass and g is the
acceleration due to gravity. The range is
J
VI(t)
dt = x(T),
where T> 0 is the time at which y(T) = 0 or, equivalently, the time at
which vz(t) dt = O.
We may regard the problem as formulated in the space CZ[O, To] of twodimensional continuous time functions where To is some fixed time greater
260
and
,
J(v
+ h) -
J(v)
= x(T) -
=t
x(T)+(T- T)V1(T)
h 1(t) dt
+ (T -T)Vl<T).
+ h) -
J(v)
= fa h 1(t) dt -
'T
fo h2(t) dt.
([v.(t) -
V2
(t)] dt
which is an integral on the fixed interval [0, T]. This problem can be solved
in the standard fashion.
Following Theorem 1, we introduce the Lagrange variables
J. 1 = -1
J. _ V1(T)
2 -
v2(T)
A,1(T) = 0
A,2(T) =
o.
9.6
[VI
Cf)/v 2(T)](t -
261
o.
A[x, uJ = 0
vii.
+ g[x, u].
262
(4)
+ gx[x(u), u] = O.
Then for v E 0,
(5)
J(u) - J(v)
= L[x(u), u, A*] -
L[x(u), v, A*]
+ o(lIu -
viI).
Proof By definition
J(u) - J(v)
g[x(v), v]
+ g[x(u), v] - g[x(v), v]
g[x(u), v] + gx[x(u), u] . [x(u) - x(v)]
+ (gx[x(v), v] -: gx[x(u), u])[x(u) - x(v)] + o<llx(u) g[x(u), u] - g[x(u), v] + gx[x(u), u][x(u) - x(v)]
+ o(lIu - viI),
g[x(u), v]
x(v)11)
where the last two steps follow from the continuity of gx and the Lipschitz
condition (2),
Likewise,
IIA[x(u), u] - A[x(u), v] - Ax[x(u), u][x(v) - x,(u)]!1 = o(llv - ull).
Therefore,
J(u) - J(v) = L[x(u), u, A*] - L[x(u), v, A*]
+ o(lIu- viI). I
The significance of the above result is that it gives, to first order, a way of
determining the change in J- due to a change in u without reevaluating the
implicit function x. The essence of the argument is brought out in Problem 18. We next apply this result to a system described by a system of
ordinary differential equations of the form
x =/(x, u),
and an associated objective functional
J=
tl(X, u) dt.
to
II II
M[lIx
-.vII
-I-
lIu - vII],
9.6
263
f lex, u) dt
I.
J =
10
subject to x(t)
equation
(6)
=f(x,
-.le(t) =fx'A(t)
+ lx',
where the partial derivatives are evaluated along the optimal trajectory,
and define the Hamiltonian function
(7)
H(x, u,
t)
(8)
for all u E
n.
10
g[x,
uJ =
l.
f lex, u) dl,
10
+ lt5u(-r:)I} d-r:
IIt5x(t)IIEn =:;;
f'lou(-r:)1 d-r:.
10
uJ satisfies the
264
= e. The functional
f H(x, u, A) dt
11
10
(9)
J(uo) - J(u) =
We now show that equation (9) implies (8). Suppose to the contrary that
there is 1 e [to, t 1] and u e n such that
u, A(t)
>e
+ o(llu -
uolI).
But lIu - uoll =O([t" - t']); hence, by selecting [t', til] sufficiently small,
J(uo) - J(u) can be made positive, which contradicts the optimality of Uo I
Before considering an example, several remarks concerning this result
and its relation to other sets of necessary conditions are appropriate.
Briefly, the result says that if a control function minimizes the objective
functional, its values at each instant must also minimize the Hamiltonian.
(Pontryagin's adjoint equation is defined slightly differently than ours with
the result that his Hamiltonian must be maximized, thus accounting for
the name maximum principle rather than minimum principle.) It should
___ immediately be obvious that if the Hamiltonian is differentiable with
respect to u as well as x and if the region n is open, the conditions of
Theorem 1 are identical with those of Theorem 1, Section 9.5. The
mum principle can also be extended to problems having terminal constraints, but the proof is by no means elementary. However, problems of
this type arising from applications are often convex and can be treated by
the global theory developed in Chapter 8.
9.6
265
(10)
= u(t)x(t),
x(O)
>0
(11)
J =
(12)
So (1 -
u(t))x(t) dt
O::5:u(t)::5:1.
(13)
U(t)A(t)
+ 1-
A(T) = 0
u(t),
(14)
H(x, u, A)
= A(t)U(t)x(t) + [1 -
u(t)]x(t).
An optimal solution x o , uo, A must satisfy (10), (12), and (13) and (14)
must be maximized with respect to admissible u's. Since x(t) ;;::: 0 for all
t E [0, T], it follows from (14) that
A(t) > 1
A(t)
< 1.
o
Figure 9.8
266
9.7 Problems
1. Prove the following inverse function theorem. Let D .be an open subset
of a Banach space X and let T be a transformation from D into X.
Assume that Tis continuously Frechet differentiable on D and that ata
point Xo eD, [T'(XO)]-l exists. Then:
(i) There is a neighborhood P of Xo such that T is one-to-one on P.
(ii) There is a continuous transformation F defined on a neighborhood
N of T(xo) with range R c: P such that F(T(x = x for all x e R.
Hint: To prove uniqueness of solution to T(x) == yin P, apply the
mean value inequality to the transformation q>(x) = [T'(xo)] -IT(x) - x.
2. Prove the following implicit function theorem. Let X and Y be Banach
spaces and let T be a continuously Frechet differentiable transformation from an open set D in X x Y with values in X. Let (xo, Yo) be a
point in D for which T(xo ,Yo) = (J and for which [T/(xo, Yo)] -1 exists.
Then there is a neighborhood N of Yo and a continuous transformation
F mapping N into X such that F(yo) = Xo and T(F(y), y) = () for all
yeN.
3. Show that if all the hypotheses of Theorem 1, Section 9.4, are satisfied,
except perhaps the regularity condition, then there is a nonzero,
e R x Z* such that ro/(x) + <G(x),
is
positive element (ro,
stationary at xo, and <G(x),
= o.
4. Let gl' gz , ... , gn be real-valued Frechet differentiable functionals on a
normed space X. Let Xo be a point in X satisfying
(1)
Let I be
set of indices i for which gi(XO) = 0 (the so-called binding
constraints). Show
Xo is a regular point of the constraints (I) if and
only if there is nO set of A./s, i e I satisfying
for all i E I,
5. Show that if in the generalized Kuhn-Tucker theorem X is normed, /
and G are Frechet differentiable, and the vector x is required to lie in a
given convex set 0 c: X (as well as to satisfy G(x) ::;; (J), then there is a
Zd 8 such that <G(xo),
= 0 and!'(x o) +
e [0 - xo] al.
6. A bomber pilot at a certain initial position above the ground seeks the
path of shortest distance to put him over a certain target. Considering
only two dimensions (vertical and horizontal), what is the nature of
the solution to his problem when there are mountain ranges between
him and his target?
9.7
PROBLEMS
267
(i)
G(x{t)) :::;
dx(t)
dt
(ii)
:=
(iii)
x(O) = xo.
1=0
differential
<5+T(x; h) = lim
"->0+01:
[T(x
+ rxh) -
T(x)]
if the limit on the right exists for all hEX and if <5+T(x; h) is convex in
the variable h. Let/be: a functional on X and G a transformation from
X into Z. Assume that both / and G possess convex Gateaux differentials. Let Xo be a solution'to the problem:
minimize /(x)
subject to G(x) ::;;
e.
268
10. After a heavy military campaign a certain army requires many new
shoes. The ,?'lartermaster can order three sizes of shoes. Although he
does not know precisely how many of each size are required, he feels
that the demands for thl three sizes are independent and the demand
for each size is uniformly distributed between zero and three thousand
pairs. He wishes to allocate his shoe budget of four thousand dollars
among the three sizes so as to maximize the expected number of men
properly shod. Small shoes cost one dollar per pair, medium shoes
cost two dollars per pair, and large shoes cOst four dollars per pair.
__How many pairs of each size should he order?
11. Because of an increasing average demand for its product, a
is
considering a program of expansion. Denoting the firm's capacity at
time t by e(t) and the rate of demand by d(t), the firm seeks the nondecreasing function e(t) starting from e(O) that maximizes
T
[C(t)]2} dt
where the first term in the integrand represents revenue due to sales and
the second represents expansion costs. Show that this problem can be
stated as a convex programming problem. Apply the considerations of
Problem 9 to this problem.
12. Derive the necessary conditions for the problem of extremizing
t.
f I(x, x, t) dt
to
subject to
cf>(x,
x, t) $
0,
minimize
J lex, u, t) dt
to
(
A subject to x(t) = I(x, u, t),
x(to) fixed,
f K(x, u, t) dt
'
= b
to
minimize ",(X(tl
B{subject to x(/) = I(x, u, t),
x(to) fixed,
9.7
PROBLEMS
269
+ 1) =f(x(k),
u(k,
L l(x(k), u(k,
k=O
+ u'(k)Ru(k),
fo lex, u) dt
1
minimize
+ u 2 (t)
x(O) = 1
x(l)
= e- 1 ,
REFERENCES
9.2. This form of the inverse function theorem is apparently new, but the proof
is based on a technique in Luisternik and Sobolev [101, pp. 202.208]. For
standard results on inverse function theorems, see Apostol [10] and Hildebrandt and Graves [72].
270
10
ITERATIVE METHODS
OF OPTIMIZATION
10.1
Introduction
272
10
Definition. Let Sbe a subset ofa normed space X and let Tbe a transformation mapping S into S. Then T is said to be a contraction mapping if there is
an oc, 0 S oc < 1 such that \IT(XI) - T(x:z)1I Soc \lx l - x:zII for all Xl' X:z E S.
Note for example that a transformation having IIT'(x)1I Soc < 1 on a
convex set S is a contraction mapping since, by the mean value inequality,
IIT(Xl) - T(x:z)1I S sup IIT'(x)lllIxl - X:z1I Soc IIxl - x:zlI.
Theorem 1. (Contraction Mapping Theorem) 1fT is a contraction mapping on
a closed
S of a Banach space, there is a unique vector Xo e S satisfying
Xo = T(xo) Furthermore, Xo can be obtained by the method of successive
approximation starting fn.n an arbitrary initial vector in S.
Proof Select an arbitrary element Xl e S. Define the sequence {xn}
by the formula Xn+l = T(xn}. Then IIXn+1 - xnll == IIT(xn) - T(xn-I)11
oclixn - xn-ili. Therefore,
It follows that
IIxn+p - xnll S Ilxn+p - xn+p-lll
S (ocn+p-:Z + ocn+p00
- x.11
n- 1
IO.2
SUCCESSIVE APPROXIMATION
273
Ilxo -
T(xo) I
= Ilxo -
+ Xn -
x.
T(xo) I
IIxo - x.1I
+ Ilx. -
T(xo) I
Thus Xo = Yo'
Ilxo - Yoll
I
= IIT(xo) - T(yo) II
II
//
/
/
/
/1
1
1
.f- - ......"..
I
/
/
/
Figure 10.1
Successive approximation
Xk
+
+
+ ... +
+ ... +
= b,
= b2
274
ITERATIVE METHODS
('7
OPTIMIZATION
10
f(x)
latil >
L: laijl
j*1
for each i.
In what follows we assume that A has strict diagonal dominance and that
each of the n equations represented by Ax;: b has been appropriately
scaled so that ajj = 1 for each i. The equation may be rewritten as
x
= (I -
A)x + b.
Ilxll
= Imax
IXII
SISn
UBII
= max
L: Ibljl
I
j= 1
\IT(x) - T(y) I
= II(A -
IllUx - YII
IO.2
SUCCESSIVE APPROXIMATION
275
IIA - III
L lau l == ee.
= max
J*I
J
b
x(t) ,= f(t)
Let
Kl(S, t) dt ds = p2 < co, and assume that f EX:::= Ll[a, bJ. Then
the integral on the righthand side of the integral equation defines a
bounded linear operator on X having norm less than or equal to p. It follows
that the mapping
T(x) = f( t) + A.
JK(t, s)x(s) ds
b
is a contraction mapping on X provided that 1.11 < liP. Thus, for this range
of the parameter A the equation has a unique solution which can be determined by successive approximation.
The basic idea of successive approximation and contraction mappings
can be modified in
ways to produce convergence theorems for a
number of different situations. We consider one such modification below.
problems at the end of this chapter.
Others can be found in
Theorem 2. Let T be a continuous mapping from a closed subset S of a
Banach space into S, and suppose that Tn is a contraction mapping for some
positive integer n. Then T htls a unique fixed point in S which can be found by
successive approximation.
Proof. Let
Xl
Xi+l = T(xJ
Now since Tn is a contraction, it follows by Theorem 1 that the subsequence
{Xnk} converges to an eleme:nt Xo E S which is a fixed point of Tn. We show
that Xo is a unique fixed point of T.
By the continuity of T, the element T(xo) can be obtained by applying
Tn successively to T(xl)' Therefore, we have Xo = lim T"k(X t ) and
k ... oo
276
10
= f[x(t), t].
The function x may be taken to be scalar valued or vector valued, but for
simplicity we assume here thatit is scalar valued. Suppose that x(to) is
specified. We seek a solution x(t) for to S; t S; t 1
We show that under the assumption that the function f satisfies a
Lipschitz condition on [to, t l ] of the form
If[x" t] - f[Xl' t]1 S Mlxl - xli,
a unique solution to the initial value problem exists and can be found by
successive approximation.
The differential equation is equivalent to the integral equation
x(t) = Xo
On the space X = C [to, t
to
f[x(t), t] dt.
Then
II T(Xl) - T(x2)11
S;
10.3
NEWTON'S METHOD
-
277
Since n! increases faster than any geometric progression, it follows that for
sufficiently large n, Tn :is a contraction mapping. Therefore, Theorem 2
applies and there is a unique solution to the differential equation which can
be obtained by successive approximation.
A slight modification of this technique can be used to solve the EulerLagrange differential t:quation arising from certain optimal control
problems. See Problem 5.
For a successive approximation procedure applied to a contraction mapping T having fixed point xo, we have the inequalities
(1)
ex
< 1,
and
(2)
lI -
1 -
xoll
lIm sup
Ilx" - xoll
II X - - xolI
n
= ex
for some ex, 0 < ex < 1. Thus, in particular, (2) implies that a successive
approximation procedure converges linearly. In many applications, however, linear convergence is not sufficiently rapid so faster techniques must
be considered.
.
e.
278
10
f(x)
then repeated from this new point. This procedure defines a sequence of
points according to the recurrence relation
P(Xn)
Xn+1= Xn - P'(x )'
n
Example 1. Newton's method can be used to develop an effective iterative
scheme for computing square roots. Letting P(x) = X2 - a, we obtain by
Newton's method
Xn+l
= Xn _
xn
2xn
a=
t[x + !:].
xn
n
This algorithm converges quite rapidly, as illustrated below, for the computation of .jiO, beginning with the initial approximation Xl = 3.0.
Iteration
Xn
X2
n
3.0000000000
3.1666666667
3.1622807018
3.1622776602
9.0000000000
10.0277777778
10.0000192367
10.0000000000
2
3
4
Xn
Xn+l'
Alternatively, the
IO.3
NEWTON'S METHOD
279
T'(xn)
if lip;; 111 :;;; /3, liP "(X.) II ::; K, lip;; 1[P(xn)]ll :;;; 11, h = /3YJK, we have
IIT'(xn) II :;;; h. Therefore, by the contraction mapping principle, we expect
to obtain convergence if h -< 1 for every point in the region of interest. In
the following theorem it is shown that if h < t at the initial point, then
h < t holds for all points in the iteration and Newton's method converges.
Just as with the principle of contraction mapping, study of the convergence of Newton's method answers some questions concerning existence
and uniqueness of a solution to the original equation.
2.
Then the sequence Xn+1 = Xn - p;; 1 [P(xn)] exists for all n > 1 and converges
to a solution of P(x) = e.
Proof We show that if the point Xl satisfies 1, 2, and 3, the point
Xz = Xl - Pl 1p(x 1) satisfies the same conditions with new constants /3z,
YJz, h z .
Clearly, x z is well defined 'and II x z value inequality,
XI
where x = Xl + IX(X z - Xl), 0;:; IX:;;; 1. Thus Ilpl 1 [pl - PzJII ;:; /31KI]I = hi'
Since hI < 1, it follows that the linear operator
= 1- Pl 1 [Pl
- Pl] = pjlpl
280
Ilpilli
10
= H- 1pl 1 so pi 1 exists.
IIH- l llllpl 1 11
An estimate of its
= f3z
Tl(xz)
Tl'(Xl)(x Z
Xl)'
z
t sup I T"(x) II Ilxz - xlll
= t sup ilPl1p"(x)lllIxz - xlli z
::s;; tf31K'11
= th 1'11'
Therefore,
l
IIPilP(xz)11 = IIH- pl 1p (Xz)1I
='1Z<t'11'
Hence the conditions 1, 2, and 3 are satisfied by the point X z and the
constants f3z, '1z, and hz . It follows by induction that Newton's process
defines {xn }.
SinCe'1n+l < t'1n ,it follows that'1n < (t)"-l '11' Also since IIXn+1 -xnll <'1n
follows that IIxn+k - xnll < 2'1n and hence that the sequence {xn} con
verges to a point Xo E X.
To prove that Xo satisfies P(xo) = we note that the sequence {IIPnll} is
bounded since
e,
and the sequence {IIxn - Xli\} b bounded since it is convergent. Now for
each n, Pn(Xn+ 1 - Xn) + P(xn) := e; and since IIxn+ 1 - xnll ..:... 0 and IIPnll is
bounded, it follows that IIP(xn)II -+ O. By the continuity of P, P(xo) = O. I
It is assumed in the above theorem that P is defined throughout the
Banach space X and that IIP"(x)1I :;;;; K everywhere. It is clear that these
requirements are more severe than necessary since they are only used in the
neighborhood of the points of the successive approximations. It can be
10.3
NEWTON'S METHOD
281
Xo ::::; xn - [P'(xn)]"clp(x n) - Xo
e,
where x ::::; Xn
+ IX(Xn -
(1)
where c::::;
xoll :::;
2
sup IIT"(x)llllxn - xo11 ,
x
- XOIl2,
sup IIT"(x)II,.a bound depending upon liP I/I(X) II. Relation (1)
xeR
282
10
(2)
(3)
Cx(t l ) =
(4)
Dx(t2)
CI
= d2 ,
where dim (el) + dim (d2 ) = n. Since the system is linear, we may write the
superposition relation
(5)
Cx(t l ) =
D4>(t2 , tl)X(tl)
CI
= d2 -
Db.
(6)
1
XUl)
t1)r
d2 _
Having determined x(t l), the original equation (2) can be solved by a single
forward integration.
Now consider a similar, nonlinear, two-point boundary problem:
x(t) = F(x, t)
CX(tl):: cl
Dx(t2 ) = d2
10.4
GENERAL PHILOSOPHY
283
Although this problem cannot be solved by a single integration or by superposition, it can often be solved iteratively by Newton's method. We start
with an initial approximation Xt(t) and define
Xn+l(t):::: F(x n, t)
+ Fix,p t)(X,,+I(t) -
xn(t))
= Cl
DXn +I(t2) = d2
CXn+I(tI)
At each step of the iteration the linearized version of the problem is solved
by the method outlined above. Then provided that the initial approximation is sufficiently close to the solution, we can expect this method to converge quadratically to the solution.
DESCENT METHODS
10.4
General Philosophy
where IXn is a scalar andpn is a (direction) vector. The procedure for selecting
the vector Pn varies from technique to technique but, ideally, once it is
chosen the scalar IXn is selected to minimize f(x" + IXPn), regarded as a
function of the scalar IX. Generally, things are arranged (by multiplyingpn
by - 1 if necessary) so that f(xn + aPn) < f(xn) for small positive IX. The
scalar IXn is then often take:n as the smallest positive root of the equation
d
da f(xn + IXPn)
= O.
284
In
10
If/is bounded below, it is clear that the descent process defines a bounded
Qecreasing sequence of functional values and hence that the objective values
tend toward a limit/o . The difficulties remaining are those ofinsuring that
10 is, in fact, the minimum off, that the sequence of approximations {xn}
converges to a minimizing vector, and finally, the most difficult, that convergence is rapid enough to make the whole scheme practical.
Example 1. Newton's method can be modified for optimization problems
to become a rapidly converging descent method. Suppose again that we seek
to minimize the functional Ion a Banach space X. This might be accomplished by the ordinary Newton's method for solving the nonlinear equation F(x) = where F(x) =I'(x), but this method suffers from the lack of a
global convergence theorem. The method is modified to become the
Newtonian descent method by selecting the direction vectors according to
the ordinary Newton's method but moving along them to a point mini.
mizing/in that direction. Thus the general iteration formula is
and
OCn
lO.S
I
I
I
Iii
X2
Figure 10.5
STEEPEST DESCENT
285
Xl
m=m
M
= sup (x I Qx)
(xix)
286
10
by the
(1)
=b
r=b- Qx
is called the residual of the approximation; inspection of/reveals that 2r is
the negative gradient of/at the point x.
The method of steepest descent applied to / therefore takes the form
- 2(x,. + r,.lb)
+ (Xn I QXn) -
2(xn I b),
which is minimized by
(2)
an =
(rn I rn)
.
(rnlQrn)
(3)
where rn = b - Qxn
Theorem 1. For any Xl e X the sequence {XII} defined by (3) converges (in
norm) to the unique solution Xo 0/ Qx = b. Furthermore, defining
F(x)
= (x -
Xo I Q(x - xo
= Xo -
Xn
1
m
F(x,.)
m)n-l
1(
m 1 - MF(Xt)
1O.5
STEEPEST DESCENT
287
Xo I Q(x
+ (xo 1 Qxo)
'n =
(rn I rII)
(r nI rn)
(,,,I Qrn) (Q-I rn I rn)
-1 'n
I 'n)
F(x n) - F(x n+ 1)
F(x n)
m
M
And finally,
288
10
x" + I
Ax",
or X"+l = (I - A)x" + b. This last equation is equivalent to the method of
successive approximation given in Example 1, Section 10.2.
t g"(l}
f(x) - f(x t )
t m \Ix -
XI liZ
10.5
289
STEEPEST DESCENT
x = tx. + (l
- t)xa , 0
M)IIf'(X.)1I 2,
(-IX +
where
--IX IIf'(x.)11 2+
f(x a) - f(x.)
1. Therefore, for
r:t.
11M we have
f(x a )
e
- 2M'
f(x.)
which implies that f(x.+ 1) < fo. Since this is impossible, it follows that
1I[,(x.) II --+ O.
For any x, YES we have, by the one-dimensional mean value theorem,
+ (1
- t)y, 0 :s; t
IIx.+k -
m Ilx
_ yll2,
1. Thus
x.11 2 .5: 2.
(f'(x.+ k) - r(x.) I X.+k - xn)
m
or
IIX.+k -
x.1I
2.m IIf'(x.+k) -
f'(x.)II.
S with
x. --+ Xo.
Obviously,f'(xo) =
such that
f(xo
E S,
+ '2 IIhi/2
290
10
minimize J
=j
tl
l(x(t), u(t dt
to
(4)
x(to} given, where x(t) is n dimensional and u(t) is, r dimensional. Under
l,.{x(t), u(t
(5)
+ A'(t)!..(x(t), u(t,
where the x(t) resulting from (4) when using the given control u is substituted in (5). The function ).,(t) is the solution of
-,1,(t)
(6)
(I)
for all
vector
(xl Qx)
X
Xo
m(x I x)
1O.6
FOURIER SERIES
291
We can view this problem as: a minimum norm problem by introducing the
new inner product
[xIY]
= (xl Qy),
Ilx -
== (x - Xo I Q(X - XO
X.+ I = X.
+ (X.P.
(p.lr.)
=
(P. I Qp.)
(3)
(X
(4)
r. = b - Qx.
= 0, k
Proof Define Y.
= X.
and
(5)
Y.+I=Y.+
(P. I b - Qx I - Qy.)
(P.IQPn)
P.,
I ],
(6)
[pRI Yo - Y.]
[ I ]
P.
Pn Pn
Y.+l=Y.+
292
10
Since Yn e [PI, P'}., ' . , ,Pn- tJ and since the p/s are Q-orthogonal, it follows
that [Pn I Yn] = 0 and hence equation (6) becomes
[Pnl Yo]
Yn+ 1 == Yn + [ I ] Pn'
Pn Pn
Thus
[Pkl.Yo]
Yn+ 1 = k=
"- [p I p ] Pk'
1
k
k
which is the n-th partial sum of a Fourier expansion of Yo' Since with our
assumptions on Q, convergence with respect to II I is equivalent to convergence with respect to II lIa, it follows that Yn -+ Yo and hence that
xn-+ xo'
The orthogonality relation (rn IpJ = follows from the fact that the error
Yn-YO=xn-xO is Q-orthogonal to the subspace [Pl,P'}., ... ,Pn-lJ.1
Ga=b,
where G is the Gram matrix of {Yl' Y'}., . , ., Yn} and b is the vector with
components b , = (xIYI)'
The n-dimensional linear equation is equivalent to the unconstrained
minimization of
a'Ga - 2a'b
with respect to a e En. This problem can be solved by using the method
of conjugate directions. A set of linearly independent vectors {Pl'P'}." , "Pn}
satisfying p,'GPJ = 0, i ::pj, can be constructed by applying the GramSchmidt procedure to any independent set of n vectors in En, For instance,
we may take the vectors e, = (0, ... , 1,0, ... ,0) (with the 1 in the i-th component) and orthogonalize these with respect to G, The resulting iterative
calculation for a has as its k-th approximation ak the vector such that
n
LalYI
1=1
is the best approximation to x in the subspace [Yl' Y'}., ... , Yk]' In other
words, this conjugate direction method is equivalent to solving the original
10.7
293
ORTHOGONALIZATION OF MOMENTS
Ortbogonalization of Moments
(n> 1),
where again [x Iy] == (x I Qy). This procedure, in its most general form, is in
practice rarely worth the efi'olrt involved. However, the scheme retains some
of its attractiveness if, as practical considerations dictate, the sequence {en}
is not completely arbitrary but is itself generated by a simple recurrence
scheme. Suppose, in particullar, that starting with an initial vector el and a
bounded linear self-adjoint operator B, the sequence {et} is generated by
the relation en+I = Ben. Such a sequence is said to be a sequence of
moments of B. There appeal' to be no simple conditions guaranteeing that
the moments generate a dense subspace of H, so we ignore this question
here. The point of main interest is that a sequence of moments can be
orthogonalized by a proc1edure that is far simpler than the general
Gram-Schmidt procedure.
Tbeorem 1. Let {el} be a sequence of moments of a self-adjoint operator B.
P2
= BPI -
Pn+1 =
Pn -
[PIIBPI]
- - - - PI
[PI I PI]
[Pn 1BPn]
[Pn-I 1BPn]
-[-1-]
Pn - [
I
] Pn-I
Pn Pn
Pn - I Pn - I
(n
2)
defines a Q-orthogonal sequence in H such that for each n, [PI' P2, ... , Pn] =
[el' e2 , ... , en]'
Proof Simple direct verification shows the theorem is true for PI' P2'
We prove it for n> 2 by induction. Assume that the result is true for
{p/}7= I ' We prove that it is true for {PI}7:; f .
294
10
[PnIBPn] [ I ]
[PnIBPn-l] [ I
]
[ PI I Pn+ I] = [ PI I B]
Pn - [ I ] Pi Pn - [p
I P ] PI Pn-I .
Pn Pn
n-I . n-I
For i :s; n - 2 the second two terms in the above expression are zero by the
induction hypothesis and the first term can be written as [BPIIPn] which is
zero since BpI lies in the subspace [PI' P2 , ... , PI +I]. For i =n -1 the first
and the third term cancel while the second term vanishes. For i = n the
first and the second term cancel while the third term vanishes. I
10.8 The Conjugate Gradient Method
A particularly attractive method of selecting direction vectors when
minimizing the functional
/(X)
= (x I Qx) -
2(b I x)
(2)
(rnIPn)
Xn + I
= Xn + (Pn I QPn) Pn
Pn + I
= r n+ I -
(rn+11 QPn)
(Pn I QPn) Pn .
(4)
I"n
=b -
= b - QXn
QXl and
lO.S
(5)
P.+l
(6)
0(.
= r.+ 1 -
295
P.P.
.:....:.i..:..p.:.::..)_
(P. I Qp.)
(7)
(P. IQp.)
For k = n the two terms on the right of (8) cancel. For k < n the second
term on the right is zero and the first term can be written as (QPk I r.+ 1 ).
But QPk E [PI' P2' ... , PH 1] C [PI' P2 , ... ,P.] and for any conjugate
direction method (r.+ 1 I Pi) = 0, i n. Hence the method is a conjugate
direction method.
Next we prove that the sequence {x.} converges to xo. Define the functional E by
E(x) ,= (b - Qx I Q-l(b - Qx)).
= O(.(r. Ip.).
But by (5) and (r. Ip.-l) =: 0 we have (r. Ip.) = (r. I r.) and hence
(9)
E(x. ) -
x.+ 1)
(r.lr.)
1) E(x.).
= 0(. (I
r. Q
r.
(r. I Qr.)
(r. I r.)
(r. I Q-1r.) ;;:: m.
296
10
s: (1 - ':)E(Xn).
The slight increase in the amount of computation required for the conjugate gradient method over that of steepest descent can lead to a significant
improvement in the rate of convergence. It can be shown that for m(x I x)
(x I Qx) M(x
the con ,.,rgence rate is
2
IIxli +1 - xoll S
in E(x1)
1+
Jc '
where c = m/M, whereas for steepest descent the best estimate (see
Problem 11) is
IlxUl - xoll
2s: m
1E(xJ (1C)'h
1 +c
In an n-dimensional quadratic problem the error tends to zero geometrically with steepest descent, in one step with Newton's method, and within n
steps with any conjugate direction method.
The method of conjugate directions has several extensions applicable to
the minimization of a nonquadratic functional f One such method, the
method of parallel tangents (PARTAN), is based on the easily established
geometric rel!ltion which exists among the direction vectors for the
quadratic version of the conjugate gradient method (see Figure 10.7).
B
XII
1'"
.\'11+1
Point A, the intersection of the line between (xn , X n + 2) with the line
(xn+l> B) determined by the negative gradient offat Xn+lt is actually the
minimum off along the line (x n + 1 , B). To carry outthe PARTAN procedure
wimwizan6 "" ","bit'""'"Y funvtiQnulf, the point xnH b found from Xn
and Xn+l by first minimizingf along the negative gradient direction from
F<:u:
lO.9
PROJECTION METHODS
297
to find the point A and then minimizing f along the (dotted) line
determined by Xn and A to find x n + 2 . For a quadratic functional this
method coincides with the conjugate gradient method. For nonquadratic
functionals the process determines a decreasing sequence of functional
values f(xn); practical experience indicates that it converges rapidly,
although no sharp theoretical results are available.
X n+1
Devising computati9nal procedures for constrained optimization problems, as with solving any difficult problem, generally requires a lot of
ingenuity and thorough familiarity with the basic principles and existing
techniques of the area. No general, all-purpose optimization algorithm
has been devised, but a number of procedures are effective for certain
classes of problems. Essentially all of these methods have strong connections with the general principles discussed in the earlier chapters.
where (X) is chosen in the usual way to minimize f(x 2 ). Since 9) E %(A),
the point X 2 also satisfies the constraint and the process can be continued.
To project the negative gradient at the n-th step onto %(A), the component of f'(xn) in al(A) must be added to the negative gradient. Thus the
required projected negative gradient has the form
9n
= - f'(xn) + A*An
298
10
= O.
-j'
\
Xl
H(x)
={}
10.JO
299
minimize f(x)
{subject to G(x)
s e,
XEQ.
+ (G(x),
z*)}.
z";:O;O xeO
+ (G(x), z*)},
(4)
maximize qJ(z*)
{subject to z* ;;:::: e.
e,
The dual problem (4) has only the constraint z* ;;:::: hence, assuming that
the gradient of qJ is available, the dual problem can be solved in a rather
routine fashion. (Note that if the primal problem (1) has only equality
constraints of the form Ax = b, the dual problem (4) will have no
constraints.) Once the dual problem is solved yielding an optimal Z6,
the primal problem can be solved by minimizing the corresponding
Lagrangian.
300
10
e,
(6)
= Q-l(b - A'Ao).
where
k+ 1 ==
W,
Convergence of the
1 (
1- 1
k+ 1
- - d j + LPjJAj
Pii
J=1
k)
L pjJAJ .
J=I+l
is easily proved.
1O.1O
301
Therefore,
(7)
and hence G(x l ) defines a hyperplane that bounds q> from above. If q> is
differentiable, G(XI) is the (unique) gradient of q> at
(See Problem 8,
Chapter 8.)
In view of the above observation, it is not difficult to solve the dual
problem by using some gradient-based technique modified slightly to
account for the constraint z* e. At each step of the process an unconstrained minimization is performed with respect to x in order to evaluate q>
and its gradient. The minimization with respect to x must, of course,
usually be performed by some iterative technique.
zt.
x(t)
= I(x, u, t).
r == {y E EP : (y,
302
10
+ ),'G(Y)]} .
"'(x(tt
+ ),'G(X(tl)
but having no terminal constraints. This latter type of control problem can
be solved by a stll!.ldard gradient method (see Example 2, Section 10.5).
minimize f(x)
{subject to hi(x)
= 0,
i = 1,2, ... , p,
(2)
minimizej(x) + K
L hi 2(X)
i
10.11
PENALTY FUNCTIONS
303
(3)
subject to
f(x)
Lt h/(x) ::;; 0,
and by this transformation we reduce the n constraints to a single constraint. This constraint, it should be noted, is not regular; i.e., there is no x
such that
h/(x) < O. The primal function for problem (3),
Li
'",
)0
::
304
10
,the penalty term K L /hlx)/ can be used. If the h;'s have nonvanishill! first
derivatives at the solution, the primal function will, in this case, as in Figure
1O.10,havefinite slope at z = 0, and hence some finite value of K will yield
a support hyperplane. This latter feature is attractive from a computational
point of view but is usually offset by the difficulties imposed by nonexistence of a gradient of
w
\
______
Figure 10.10 The primal function using Llhd
Let G be a mapping from X into RP, i.e., G(x) ::;: (g1(X), g2(X), ., ., gp(x.
We define
G+(x)
It is then clear the p
inequality
G+(x)'G+(x);::
[g/(X)]2
to the single
s: 0.
Again this inequality does not satisfy the regularity condition. Since this
form includes equality constraints, we consider only inequalities in the
remainder of the section. Hence we analyze the method in detail for the
problem
minimize f(x)
{subject to G(x)
(4)
e.
equalities.
'
10.11
PENALTY FUNCTIONS
305
sequence of positive constants tending toward infinity and that problem (4)
has a solution.
Lemma 1. Define
Jlo
= min {I(x)
: G(x) :5 O}.
Then
1. fn+l(X n+1 ) '?:.!,,(x n).
2. flo '?:. !,,(xn)
3. lim Kn G+(xn),G+(x.)
= O.
n-+OO
Proof
1. fn+I(X Q + 1)
2.
3.
+l} + Kn+lG+(Xn+l)'G+(xn+l)
'?:.f(X.+l) + K. G+(X.+l)'G+(Xn+l)
'?:. !,,(xn).
= f(X
y <fN(XN )
Since Kn - t
such that
00
+ 8.
:=
and
+ 8 :5!(XM) + KNg(XM) + 8
=!(XM)
<!(XM)
+8
<!(XM)
+ !KMU(X M) + 8
306
10
O.
e,
If
Xo minimizes
(5)
then it also minimizes
f(x}
(6)
where .,1.0
+ .,1.o'G(x),
= 2G+(xo}.
+ (1
8.
a:
1)
Because of the definition of Xl the second line of the expression is less than
-a:8. Since a continuous convex functional satisfies a Lipschitz condition at
1O.11
PENALTY FUNCTIONS
307
every point on the interior of its domain (see problem 19), the first term in
the third line is O(1X 2 ). The last term is identically zero for ex sufficiently
small. Thus
f(x) + G+(x"YG+(x) $,f(xo) + G+(xo)'G+(xo) - ecx + O(1X 2),
and hence the left side is less than the right for sufficiently small ex showing
that Xo does not
(5). I
Suppose we apply the penalty function method to a problem wherefand
G are convex and continous. At the n-th step let Xn minimize
and define
An = 2Kn G+(xn)
Then applying Lemma 2 with G
In other words, x n ' the nesult of the n-th penalty function minimization,
determines a dual vector ,1.n and the corresponding value of the dual functional. This leads to the interpretation that the penalty function method
seeks to solve the dual.
+ e = q>(An) + e
$,
308
10
10.12 Problems
1. Let S be a closed subset of a Banach space. A mapping T from S
onto a region r contai'ling S is an expansion. mapping if there is a
constant K> 1 such that lIT(x) - T(Y)II :?: Kllx - yll for x =1= y. Show
that an expansion mapping has a unique fixed point.
2. Let S be a compact subset of a Banach space X and let T be a mapping
of S into S satisfying IIT(x) - T(y)1I < IIx - yll for x =1= y. Show that T
has a unique fixed point in S which can be found by the method of
successive approximation.
3. Let X = L 2 [a, b]. Suppose the real-valued function [is such that for
each xe X
f [(t, s, x{s ds
b
"
J!J!
x(t) = y(t) +
J!(t, s, x(sds
b
ll
f [x'(t)x(t) + u (t)] dt
2
to
10.12
PROBLEMS
309
PI(x)11 < 1,
xeD
x211}
J10
10:
. (x I Qx)
lnf (xix) = m > O.
1
m
e.
310
10
13. Under the hypothesis of Theorem 2, Section 10.5, show that the simplified steepest-descent process defined by Xn+1 = Xn -(l/M)f'(xn)
converges tt> the point minimizing f
14. Suppose a sequence {xn} in anormed space converges to a point Xo.
The convergence is said to be weakly linear if there exists a positive
integer N such that
.
Ilxn+N - xoll
hm sup II Xn - Xo II < 1.
n.... oo
n=
over the interior of the region n. For small r the solutions to the two
problems will be nearly equal. Develop a geometric interpretation of
this method.
IO.12
PROBLEMS
311
REFERENCES
1O.2. The contraction mapping theorem is the simplest of a variety of fixed-
point theorems. For an introduction to other results, see Courant and Robbins
[32], Graves [64], and Bonsall [24].
of Newton's method in Banach spaces was developed
10.3. Much of the
by Kantorovich [78]. See also Kantorovich and Akilov [79], [80], Bartle [16],
Antosiewicz and Rheinboldt [9], and Collatz [30]. Some interesting extensions
and modifications of the method have been developed by Altman [6], [7], [8].
An interesting account of various applications of Newton's method is found
in Bellman and Kalaba [18].
1O.4-5. For material on steepest descent, see Ostrowski [115], Rosenbloom
[130], Nashed [108], Curry [33], and Kelley [84),
1O.6-8. For the original development of conjugate direction methods, see
Hestenes and Stiefd [71] and Hestenes [70]. The underlying principle of
orthogonalizing moments, useful for a number of computational problems, is
discussed in detail by Vorobyev [150] and Faddeev and Faddeeva [50].
There are various extensions of the method of conjugate gradients to problems
with nonquadratic objective functionals. The most popular of these are the
Fletcher and Reeves [54] method and the PARTAN method in Shah, Buehler,
and Kempthorne [138]. For another generalization and the derivation of the
convergence rate see Daniel [34], [35]. For application to optimal control
theory, see Lasdon, Mitter, and Waren [94] and Sinnott and Luenberger
[140].
1O.9-11. The gradient-projection method is due to Rosen [128], [129]. A
closely related method is that of Zoutendijk [158]. For an application of the
primal-dual method, see Wilson [154]. The penalty function method goes
back to a suggestion of Courant [31]. See also Butler and Martin [27], Kelley
[84], Fabian [49], and Fiacco and McCormick [53]. For additional general
BmLIOGRAPBY
[1] Abadie, J., ed., Nonlinear Programming, North-Holland Pub. Co., Amsterdam, 1967.
[2} Akhiezer, N. I., The Calculus of Variations, Ginn (Blaisdell), Boston, 1962.
[3] Akhiezer, N. I. and I. M. Glazman, Theory of Linear Operators in Hilbert
Space, Yols. I, II, Frederick Ungar, New'York, 1961.
[4] Akhiezer, N. I. and M. Krein," Some Questions in the Theory of Moments,"
Article 4, Amer. Math. Soc. Publ., 1962.
[5] Albert, A., "An Introduction and Beginner's Guide to Matrix PseudoInverses," ARCON, July 1964.
[6] Altman, M., "A Generalization of Newton's Method," Bull. A cad. Polon.
Sci. CI. Ill, 3, 189, 1955.
.
[7] Altman, M., "On the Generalization of Newton's Method," Bull. A cad.
Polon. Sci. CI. III, 5, 789, 1957.
[8] Altman, M., "Connection between the Method of Steepest Descent and
Newton's Method," Bull. Acad. Polon. Sci. CI. III, 5, 1031-1036, 1957.
[9] Antosiewicz, H. A. and W. C. Rheinboldt, "Numerical Analysis and Functional Analysis," Chapter 14 in Survey oj Numerical Analysis, ed. by J. Todd,
McGraw-Hill, l'.'ew York, 1962.
[10] Apostol, T. M., Mathemr..'ical Analysis, A Modern Approach to Advanced
Calculus, Addison-Wesley, Reading, Mass., 1957.
[11] Aronszajn, N., "Theory of Reproducing KernelS," Trans. Amer. Math.
Soc., 68, 337-404, 1950.
[12] Arrow, K. J., L. Hurwicz, and H. Uzawa, Studies in Linear and Non-Linear
Programming, Chapter 4, "Programming in Linear Spaces," Stanford Univ.
Press, Stanford, Calif., 1964, pp. 38-102.
[13] Arrow, K. J., S. Karlin, and H. Scarf, Studies in the Mathematical Theory
oj Inventory and Production, Stanford Univ. Press, Stanford, Calif., 1958.
[14] Balakrishnan, A. Y., "An Operator.Theoretic Formulation of a Class of
Control Problems and a Steepest Descent Method of Solution, J. SIAM
Control, Ser. A, 1 (2), 1963.
[15] Balakrishnan, A. Y., "Optimal Control Problems in Banach Spaces,"
J. SIAM Control, Ser. A, 3 (1), 152-180, 1965.
[16J Bartle, R. G., "Newton's Method in Banach Spaces," Proc. Amer. Math.
Soc., 6, 827-831, 1955.
[I7] Bellman, R. E., I. Glicksberg, and O. A. Gross, Some Aspects oj the
Mathematical Theory of Control Processes, The Rand Corp., Santa Monica,
Calif., Jan. 16, 1958.
312
BIBLIOGRAPHY
313
(I8] Bellman, R. E. and R. E. Kalaba, QUQsilinearization and Nonlinear Boundary-Value Problems, American Elsevier, New York, 1965.
[19] Bellman, R. E. and W. Karush, "Mathematical Programming and the
Maximum Transform," J. Soc. Indust. Appl. Math., 10,550-567,1962.
[20] Ben-Israel, A. and A. Charnes, "Contributions to the Theory of Generalized Inverses," J. Soc. Indust. Appl. Math., 11 (3), Sept. 1963.
[21] Berberian, S. K., Introduction to Hilbert Space, Oxford Univ. Press, 1961.
[22) Bliss, G. A., Lectures on the Calculus of Variations, Univ. of Chicago Press,
1945.
[23] Blum, E. K., "The Calculus of Variations, Functional Analysis, and
Optimal Control Problems," Topics in Optimization, Ed. by G. Leitman,
Academic Press, New York, 417-461, 1967.
[24] Bonsall, F. F., Lectures on Some Fixed Point Theorems of Functional
Analysis, Tata Inst. of Fundamental Res., Bombay, 1962.
[25] Brondsted, A. "Conjugate Convex Functions in Topological Vector
Dan. Vid. Selsk. 34 (2), 1-27, 1964.
Spaces," Mat. Fys.
[26] Butkovskii, A. G., "The Method of Moments in the Theory of Optimal
Control of Systems with Distributed Parameters," Automation and Remote
Control, 24,1106-1113,1963.
[27] Butler, T. and A. V. Martin, "On a Method of Courant for Minimizing
Functionals," J. Math. and Physics, 41, 291-299, 1962.
[i8} Canon, M., C. Cullum, and E. Polak, "Constrained Minimization Problems in Finite-Dimensional Spaces," J. SIAM Control, 4 (3), 528-547, 1966.
[29] Chipman, J. S., "On Least Squares with Insufficient Observations,"
J. Amer. Stat. Assoc., S9, 1078-1111, Dec. 1964.
[30] Collatz, L., Functional Analysis and Numerical Mathematics, trans. by
H. Oser, Academic Press, New York, 1966.
[31) Courant, R., "Cakulus of Variations and Supplementary Notes and
Exercises" (mimeographed notes), Supplementary notes by M. Kruskal and
H. Rubin, revised and amended by J. Moser, New York Univ., 1962.
[32] Courant, R. and H. Robbins, What is Mathematics? Oxford Univ. Press,
London, 1941, pp. 251-255.
[33] Curry, H. B., "The Method of Steepest Descent for Nonlinear Minimization Problems," Quar. Appl. Math., 2, 258-261, 1944.
[34) Daniel, J. W., "The Conjugate Gradient Method for Linear and Nonlinear Operator Equations," J. SIAM Numer. Anal., 4 (1), 10-25, 1967.
[35] Daniel, J. W., "Convergence of the Conjugate Gradient Method with
Computationally Convtmient Modifications," Numerische Mathematik, 10,
125-131, 1967.
[36) Davis, P. J., Interpolation and Approximation, Blaisdell, New York, 1965.
[37] Day, M. M., Normed Linear Spaces, Springer, Berlin, 1958.
[38] Desoer, C. A. and B. H. Whalen, "A Note on Pseudoinverses," J. Soc.
Indust. Appl. Math., 11 (2), 442-447, June 1963.
(39) Deutsch, F. R. and JP. H. Maserick, "Applications of the Hahn-Banach
Theorem in Approximation Theory," SIAM Rev., 9 (3), 516-530, July 1967.
316
BIBLIOGRAPHY
BIBLIOGRAPHY
317
318
BIBLIOGRAPHY
BIBLIOGRAPHY
319
SYMBOL INDEX
o. O. xvi
Y, 143
A*, ISO
At, 163
B(X. Y), 145
BVla. b], 23
c,I38
Co, 109
co, 18
cov(x),82
C*, 195
qa, b], 23
E,xv
'1,23
E(x),79
Ix,xvii
/*, 196
C], 191
v:
C+,304
inf, xv
lim inf, lim sup, xvi
p,29
11....,29
Lp[a,b],31
L .... [a. b], 32
.;iI'(A). 144
NBVla, b], 115
R.R",12-13
144
sup, xv
[S], 16
S(x. e), 24
0,12
T'(x),175
T.Y.(v),24
v(S),17
x2::0,x>O,214
{Xi) ,xvi
X" .... x, 26
[x], 41
X",14
X/M,41
X*,106
,xvii
-,xv
,24
-,25
J., 52,117,118
U,xv
n,xv
c,xv
1111,22
II lip, 29
(1),46
(,),106,115
X,I4
@,53
0,157
e,157
SUBJECT
Addition, 12
of sets, 15
Adjoint, 150, 159
Alaoglu's theorem, 128
Alignment, 116
Allocation, 3, 202-203, 211,230-234
Approximation, 5-6, 55-58, 62-68, 71,
122-123, 165, 292
Arc length problem, 181, 185
Autoregressive scheme, 94
Baire's lemma, 148
Banach inverse theorem, 149
Banach space, 34
Bessel's inequality, 59
Bounded linear functional, 104
Bounded operator, 144
Bounded sequence, 13, 35
Bounded variation, 23-24, 113-115
Cartesian product, 14
Cauchy-Schwarz inequality, 30,47, 74
Cauchy sequence, 33
Chain rule, 176
Chebyshev approximation, 122-123
Closed graph theorem, 166
Closed set, 25
Closure point, 25
Codimension, 65
Compactness, 39-40
weak *, 127
Complement, xv, 25
orthogonal, 52, 117-118, 157
Complete orthonormal sequ(mce, 60-62
Complete set, 38
Complete space, 34
Concave functional, 190
Cone, 18-19
conjugate, 157
positive, 214
Conjugate cone, 157
Conjugate directions, 291
Conjugate functional, 196, 199,224
Conjugate gradients, 294-297
Conjugate set, 195-199
Continuity, 28
of inner product, 49
Continuous functional, 28
linear, 104
maximum of, 39-40
ContinuouslY differentiable, 175
Contraction mapping, 272
Control, 4-5, 66, 68-69,124,162-163,
205-206,211,228-229,254-265,
290,301
Controllability, 256
Convergence, 26-27
weak, weak*, 126
Convex combination, 43
Convex functional, 190
Convex hull, 18, 142
Convex mapping, 215
Convex programming, 216
Convex set, 17, 25, 131-137
Cosets,41
Covariance matrix, 80, 82
Dense set, 42
Differential, 9, 227
Frechet, 172
Gateaux, 171
Dimension, 20-22
Direct sum, 53
Discrete random process, 93
323
324
INDEX
Gradient, 175
Image, 143
Implicitfunction theorem, 187,266
Infunum, xv
Infinite series, 58
Inner product, 46
continuity of, 49
Interior point, 24
relative, 26
Intersection, xv, 15
Inverse function theorem, 240, 266
Inverse image, 144
Inverse operator, 147
Isomorphic, 44
Kalman's theorem, 96
Kuhn-Tucker theorem, 249, 267
Lagrange multiplier, 188-189, 213, 243
Lagrangian, 213
Least-squares estimate, 82-83
Legendre polynominals, 61
Linear combination, 16
Linear convergence, 277
Linear dependence, 19-20, 53
Linear functional, 104
extension of, 110
Linear programming, 3, 232
Linear transformation, 28
Linear variety, 16-17,64-65
Mass spectrometer problem, 98
Minkowski-Farkas lemma, 167
Minkowski functional, 131
Minkowski inequalitY,23, 31, 33
INDEX
325
326
INDEX
Unbiased estimate, 85
Union, xv
Updating a linear estimate, 91-93,
101-102
Variance, 79
Vector space, 11-12
Vertical axis, 192
Weak continuity, 127
Weak convergence, 126
Weierstrass approximation theorem,
42-43,61
Weierstrass-Erdman corner conditions,
210
Weierstrass maximum theorem, 39-40,
128
Young's inequality, 211