0% found this document useful (0 votes)
38 views69 pages

Part 3 Nonlinear Op Tim Ization

This document discusses nonlinear optimization and its applications. Nonlinear optimization involves minimizing or maximizing a function with multiple variables, which can be subject to constraints. The document provides examples of applications like portfolio optimization and fitting models to data. It also discusses basic concepts in optimization like local and global minima or maxima.

Uploaded by

sunshine4u
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views69 pages

Part 3 Nonlinear Op Tim Ization

This document discusses nonlinear optimization and its applications. Nonlinear optimization involves minimizing or maximizing a function with multiple variables, which can be subject to constraints. The document provides examples of applications like portfolio optimization and fitting models to data. It also discusses basic concepts in optimization like local and global minima or maxima.

Uploaded by

sunshine4u
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Part III

Nonlinear optimization

247
Chapter 9

The basics and applications

The problem of minimizing a function of several variables, possibly subject


to constraints on these variables, is what optimization is about. So the main
problem is easy to state! And, more importantly, such problems arise in many
applications in natural science, engineering, economics and business as well as
in mathematics itself.
Nonlinear optimization differs from Fourier analysis and wavelet theory in
that classical multivariate analysis also is an important ingredient. A recom-
mended book on this, used here at the University of Oslo, is [8] (in Norwegian).
It contains a significant amount of fixed point theory, nonlinear equations, and
optimization.
There are many excellent books on nonlinear optimization (or nonlinear
programming, as it is also called). Some of these books that have influenced
these notes are [1, 2, 9, 5, 13, 11]. These are all recommended books for those
who want to go deeper into the subject. These lecture notes are particularly
influenced by the presentations in [1, 2].
Optimization has its mathematical foundation in linear algebra and multi-
variate calculus. In analysis the area of convexity is especially important. For
the brief presentation of convexity given here the author’s own lecture notes [4]
(originally from 2001), and the very nice book [14], have been useful sources.
But, of course, anyone who wants to learn convexity should study the work by
R.T. Rockafellar, see e.g. the classic text [12].
Linear optimization (LP, linear programming) is a special case of nonlinear
optimization, but we do not discuss this in any detail here. The reason for this is
that we, at the University of Oslo, have a separate course in linear optimization
which covers many parts of that subject in some detail.
This first chapter introduces some of the basic concepts in optimization and
discusses some applications. Many of the ideas and results that you will find in
these lecture notes may be extended to more general linear spaces, even infinite-
dimensional. However, to keep life a bit easier and still cover most applications,
we will only be working in Rn .

248
Due to its character this chapter is a “proof-free zone”, but in the remaining
text we usually give full proofs of the main results.
Notation: For z ∈ Rn and δ > 0 define the (closed) ball B̄(z; �) = {x ∈ Rn :
�x − z� ≤ �}. It consists of all points with distance at most � from z. Similarly,
define the open ball B(z; �) = {x ∈ Rn : �x − z� < �}. A neighborhood
of z is a set N containing B(z; �) for some � > 0. Vectors are treated as
column vectors and they are identified with the corresponding n-tuple, denoted
by x = (x1 , x2 , . . . , xn ). A statement like

P (x) (x ∈ H)

means that the statement P (x) is true for all x ∈ H.

9.1 The basic concepts


Optimization deals with finding optimal solutions! So we need to define what
this is.
Let f : Rn → R be a real-valued function in n variables. The function value
is written as f (x), for x ∈ Rn , or f (x1 , x2 , . . . , xn ). This is the function we
want to minimize (or maximize) and it is often called the objective function.
Let x∗ ∈ Rn . Then x∗ is a local minimum (or local minimizer) of f if there is
an � > 0 such that

f (x∗ ) ≤ f (x) for all x ∈ B(x∗ ; �).

So, no point “sufficiently near” x∗ has smaller f -value than x∗ . A local


maximum is defined similarly, but with the inequality reversed. A stronger
notion is that x∗ is a global minimum of f which means that

f (x∗ ) ≤ f (x) for all x ∈ Rn .

A global maximum satisfies the opposite inequality.


The definition of local minimum has a “variational character”; it concerns the
behavior of f near x∗ . Due to this it is perhaps natural that Taylor’s formula,
which gives an approximation of f in such a neighborhood, becomes a main
tool for characterizing and finding local minima. We present Taylor’s formula,
in different versions, in Section 9.3.
An extension of the notion of minimum and maximum is for constrained
problems where we want, for instance, to minimize f (x) over all x lying in a
given set C. Then x∗ ∈ C is a local minimum of f over the set C, or subject to
x ∈ C as we shall say, provided no point in C in some neighborhood of x∗ has
smaller f -value than x∗ . A similar extension holds for global minimum over C,
and for maxima.

249
Example 9.1. To make these things concrete, consider an example from plane
geometry. Consider the point set C = {(z1 , z2 ) : z1 ≥ 0, z2 ≥ 0, z1 + z2 ≤ 1} in
the plane. We want to find a point x = (x1 , x2 ) ∈ C which is closest possible
to the point a = (3, 2). This can be formulated as the minimization problem

minimize (x1 − 3)2 + (x2 − 2)2


subject to
x1 + x2 ≤ 1
x1 ≥ 0, x2 ≥ 0.

The function we want to minimize is f (x) = (x1 − 3)2 + (x2 − 2)2 which is a
quadratic function. This is the square of the distance between x and a; and
minimizing the distance or the square of the distance is equivalent (why?). A
minimum here is x∗ = (1, 0). If we instead minimize this function f over R2 ,
the unique global minimum is x∗ = a = (3, 2). It is useful to study this example
and try to solve it geometrically as well as analytically.

In optimization one considers minimization and maximization problems. As

max{f (x) : x ∈ S} = − min{−f (x) : x ∈ S}

it is clear how to convert a maximization problem into a minimization problem


(or vise versa). This transformation may, however, change the properties of the
function you work with. For instance, if f is convex (definitions come later!),
then −f is not convex (unless f is linear), so rewriting between minimization
and maximization may take you out of a class of “good problems”. Note that a
minimum or maximum may not exist. A main tool one uses to establish that
optimal solutions really exist is the extreme value theorem as stated next. You
may want to look these notions up in [8].

Theorem 9.2. Let C be a subset of Rn which is closed and bounded, and let
f : C → R be a continuous function.
Then f attains both its (global) minimum and maximum, so these are
points x1 , x2 ∈ C with

f (x1 ) ≤ f (x) ≤ f (x2 ) (x ∈ C).

9.2 Some applications


It is useful to see some application areas for optimization. They are many, and
here we mention a few in some detail.

250
9.2.1 Portfolio optimization
The following optimization problem was introduced by Markowitz in order to
find an optimal portfolio in a financial market; he later received the Nobel prize
in economics1 (in 1990) for his contributions in this area:
� �n
minimize α i,j≤n cij xi xj − j=1 µj xj
subject to �
n
j=1 xj = 1
xj ≥ 0 (j ≤ n).

The model may be understood as follows. The decision variables are x1 , x2 ,


. . . , xn where xi is the fraction of a total investment that is made in (say) stock
i. Thus one has available a set of stocks in different companies (Statoil, IBM,
Apple etc.) or bonds. The fractions xi must be nonnegative (so we consider no
short sale) and add up to 1. The function f to be minimized is

� n

f (x) = α cij xi xj − µj xj .
i,j≤n j=1

It can be explained in terms of random variables. Let Rj be the return on


stock j, this is a random variable, and let µj�= ERj be the expectation of
n
Rj . So if X denotes the random variable X = j=1 xj Rj , which is the return
�n
on our portfolio (= mix among investments), then EX = j=1 µj xj which is
the second term in f . The minus sign in front explains that we really want to
maximize the expected return. The first term in f is there because just looking
at expected return is too simple. We want to spread our investments to reduce
the risk. The first term in f is the variance of X multiplied by a weight factor
α; the constant cij is the covariance of Ri and Rj and cii is the variance of Ri .
The covariance of Ri and Rj is defined as E(Ri − µi )(Rj − µj ).
So f is a weighted difference of variance and expected return. This is what
we want to minimize. The optimization problem is to minimize a quadratic
function subject to linear constraints. We shall discuss theory and methods for
such problems later.
In order to use such a model one needs to find good values for all the param-
eters µj and cij ; this is done using historical data from the stock markets. The
weight parameter α is often varied and the optimization problem is solved for
each such “interesting” value. This makes it possible to find a so-called efficient
frontier of expectation versus variance for optimal solutions.
The Markowitz model is a useful tool for financial investments, and now
extensions and variations of the model exist, e.g., by using different ways of
measuring risk. All such models involve a balance between risk and expected
return.
1 The precise term is “Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred

Nobel”

251
9.2.2 Fitting a model
In many applications one has a mathematical model of some phenomenon where
the model has some parameters. These parameters represent a flexibility of the
model, and they may be adjusted so that the model explains the phenomenon
best possible.
To be more specific consider a model

y = Fα (x)

for some function Fα : Rm → R. Here α = (α1 , α2 , . . . , αn ) ∈ Rn is a param-


eter vector (so we may have several parameters). Perhaps there are natural
constraints on the parameter, say α ∈ A for a given set A in Rn .
For instance, consider
y = α1 cos x1 + xα
2
2

so here n = m = 2, α = (α1 , α2 ) and Fα (x) = α1 cos x1 + xα


2 where (say)
2

α1 ∈ R and α2 ∈ [1, 2].


The general model may also be thought of as

y = Fα (x) + error

since it is usually a simplification of the system one considers. In statistics


one specifies this error term as a random variable with some (partially) known
distribution. Sometimes one calls y the dependent variable and x the explaining
variable. The goal is to understand how y depends on x.
To proceed, assume we are given a number of observations of the phe-
nomenon given by points

(xi , y i ) (i = 1, 2, . . . , m).

meaning that one has observed y i corresponding to x = xi . We have m such


observations. Usually (but not always) we have m ≥ n. The model fit problem
is to adjust the parameter α so that the model fits the given data as good as
possible. This leads to the optimization problem
m

minimize (y i − Fα (xi ))2 subject to α ∈ A.
i=1

The optimization variable is the parameter α. Here the model error is quadratic
(corresponding to the Euclidean norm), but other norms are also used.
This optimization problem above is a constrained nonlinear optimization
problem. When the function Fα depends linearly on α, which often is the
case in practice, the problem becomes the classical least squares approxima-
tion problem which is treated in basic linear algebra courses. The solution is
then characterized by a certain linear system of equations, the so-called normal
equations.

252
9.2.3 Maximum likelihood
A very important problem in statistics, arising in many applications, is param-
eter estimation and, in particular, maximum likelihood estimation. It leads to
optimization.
Let Y be a “continuous” real-valued random variable with probability den-
sisty px (y). Here x is a parameter (often one uses other symbols for the pa-
rameter, like ξ, θ etc.). For instance, if Y is a normal (Gaussian) variable with
2
expectation x and variance 1, then px (y) = √12π e−(y−x) /2 and
� b
1 2
P(a ≤ Y ≤ b) = √ e−(y−x) /2 dy
a 2π
where P denotes probability.
Assume Y is the outcome of an experiment, and that we have observed
Y = y (so y is a known real number or a vector, if several observations were
made). On the basis of y we want to estimate the value of the parameter x
which “explains” best possible our observation Y = y. We have now available
the probability density px (·). The function x → px (y), for fixed y, is called
the likelihood function. It gives the “probability mass” in y as a function of the
parameter x. The maximum likelihood problem is to find a parameter value x
which maximizes the likelihood, i.e., which maximizes the probability of getting
precisely y. This is an optimization problem

max px (y)
x

where y is fixed and the optimization variable is x. We may here add a constraint
on x, say x ∈ C for some set C, which may incorporate possible knowledge of
x and assure that px (y) is positive for x ∈ C. Often it is easier to solve the
equivalent optimization problem of maximizing the logarithm of the likelihood
function
max ln px (y)
x

This is a nonlinear optimization problem. Often, in statistics, there are several


parameters, so x ∈ Rn for some n, and we need to solve a nonlinear optimization
problem in several variables, possibly with constraints on these variables. If
the likelihood function, or its logarithm, is a concave function, we have (after
multiplying by −1) a convex optimization problem. Such problems are easier
to solve than general optimization problems. This will be discussed later.
As a specific example assume we have the linear statistical model

y = Ax + w

where A is a given m×n matrix, x ∈ Rn is an unknown parameter, w ∈ Rm is a


random variable (the “noise”), and y ∈ Rn is the observed quantity. We assume
that the components of w, i.e., w1 , w2 , . . . , wm are independent and identically

253
distributed with common density function p on R. This leads to the likelihood
function
�m
px (y) = p(yi − ai x)
i=1
where ai is the i’th row in A. Taking the logarithm we obtain the maximum
likelihood problem
m

max ln p(yi − ai x).
i=1
In many applications of statistics is is central to solve this optimization problem
numerically.

9.2.4 Optimal control problems


Recall that a discrete dynamical system is an equation
xt+1 = ht (xt ) (t = 0, 1, . . .)
where xt ∈ Rn , x0 is the initial solution, and ht is a given function for each
t. We here think of t as time and xt is the state of the process at time t. For
instance, let n = 1 and consider ht (x) = ax (t = 0, 1, . . .) for some a ∈ R.
Then the solution is xt = at x0 . Another example is when A is an n × n matrix,
xt ∈ Rn and ht (x) = Ax for each t. Then the solution is xt = At x0 . For
the more general situation, where the system functions ht may be different, it
may be difficult to find an explicit solution for xt . Numerically, however, we
compute xt simply in a for-loop by computing x0 , then x1 = f1 (x0 ) and then
x2 = f2 (x1 ) etc.
Now, consider a dynamical system where we may “control” the system in
each time step. We restrict the attention to a finite time span, t = 0, 1, . . . , T .
A proper model is then
xt+1 = ht (xt , ut ) (t = 0, 1, . . . , T − 1)
where xt is the state of the system at time t and the new variable ut is the
control at time t. We assume xt ∈ Rn and ut ∈ Rm for each t (but these
things also work if these vectors lie in spaces of different dimensions). Thus,
when we choose the controls u0 , u1 , . . . , uT −1 and x0 is known, the sequence
{xt } of states is uniquely determined. Next, assume there are given functions
ft : Rn × Rm → R that we call cost functions. We think of ft (xt , ut ) as the
“cost” at time t when the system is in state xt and we choose control ut . The
optimal control problem is

�T −1
minimize fT (xT ) + t=0 ft (xt , ut )
subject to (9.1)
xt+1 = ht (xt , ut ) (t = 0, 1, . . . , T − 1)

254
where the control is the sequence (u0 , u1 , . . . , uT −1 ) to be determined. This
problem arises an many applications, in engineering, finance, economics etc. We
now rewrite this problem. First, let u = (u1 , u2 , . . . , uT ) ∈ RN where N = T n.
Since, as we noted, xt is uniquely determined by u, there is a function v t such
that xt = v t (u) (t = 1, 2, . . . , T ); x0 is given. Therefore the total cost may be
written
T
� −1 T
� −1
fT (xT ) + ft (xt , ut ) = fT (v T (u)) + ft (v t (u), ut ) := f (u)
t=0 t=0

which is a function of u. Thus, we see that the optimal control problem may
be transformed to the unconstrained optimization problem

min f (u)
u∈RN

Sometimes there may be constraints on the control variables, for instance that
they each lie in some interval, and then the transformation above results in a
constrained optimization problem.

9.2.5 Linear optimization


This is not an application, but rather a special case of the general nonlinear opti-
mization problem where all functions are linear. A linear optimization problem,
also called linear programming, has the form

minimize cT x
subject to (9.2)
Ax = b, x ≥ 0.

Here A is an m × n matrix, b ∈ Rm and x ≥ 0 means that xi ≥ 0 for each


i ≤ n. So in linear optimization one minimizes (or maximizes) a linear function
subject to linear equations and nonnegativity on the variables. Actually, one
can show any problem with constraints that are linear equations and/or linear
inequalities may be transformed into the form above. Such problems have a
wide range of application in science, engineering, economics, business etc. Ap-
plications include portfolio optimization and many planning problems for e.g.
production, transportation etc. Some of these problems are of a combinatorial
nature, but linear optimization is a main tool here as well.
We shall not treat linear optimization in detail here since this is the topic
of a separate course, INF-MAT3370 Linear optimization. In that course one
presents some powerful methods for such problems, the simplex algorithm and
interior point methods. In addition one considers applications in network flow
models and game theory.

255
9.3 Multivariate calculus and linear algebra
We first recall some useful facts from linear algebra.
The spectral theorem says that if A is a real symmetric matrix, then there
is an orthogonal matrix V (i.e., its columns are orthonormal) and a diagonal
matrix D such that
A = V DV T .
The diagonal of D contains the eigenvalues of A, and A has an orthonormal set
of eigenvectors (the columns of V ).
A real symmetric matrix is positive semidefinite2 if xT Ax ≥ 0 for all x ∈ Rn .
The following statements are equivalent

(i) A is positive semidefinite,


(ii) all eigenvalues of A are nonnegative,
(iii) A = W T W for some matrix W .

Similarly, a real symmetric matrix is positive definite if xT Ax > 0 for all nonzero
x ∈ Rn . The following statements are equivalent

(i) A is positive definite,


(ii) all eigenvalues of A are positive,
(iii) A = W T W for some invertible matrix W .

Every positive definite matrix is therefore invertible.


We also recall some central facts from multivariate calculus. They will be
used repeatedly in these notes. Let f : Rn → R be a real-valued function defined
on Rn . The gradient of f at x is the n-tuple
� �
∂f (x) ∂f (x) ∂f (x)
∇f (x) = , ,..., .
∂x1 ∂x2 ∂xn

We will always identify an n-tuple with the corresponding column vector3 . Of


course, the gradient only exists if all the partial derivatives exist. Second or-
der information is contained in a matrix: assuming f has second order partial
derivatives we define the Hessian matrix4 ∇2 f (x) as the n × n matrix whose
(i, j)’th entry is
∂ 2 f (x)
.
∂xi ∂xj
If these second order partial derivatives are continuous, then we may switch the
order in the derivations, and ∇2 f (x) is a symmetric matrix.
2 See Section 7.2 in [7]
3 This is somewhat different from [8], since the gradient there is always considered as a row
vector
4 See Section 5.9 in [8]

256
For vector-valued functions we also need the derivative. Consider the vector-
valued function F given by
 
F1 (x)
 F2 (x) 
 
F (x) =  .. 
 . 
Fn (x)

so Fi : Rn → R is the ith component function of F . F � denotes the Jacobi


matrix5 , or simply the derivative, of F
 ∂F (x) ∂F (x) 
1 1
· · · ∂F∂x1 (x)
 ∂F∂x 1 ∂x2 n

 2 (x) ∂F2 (x)
··· ∂F1 (x) 
 ∂x1 ∂x2 ∂xn 
F � (x) =  .. 
 
 . 
∂Fn (x) ∂Fn (x)
∂x1 ∂x2 · · · ∂F∂x
n (x)
n

The ith row of this matrix is therefore the gradient of Fi , now viewed as a row
vector.
Next we recall Taylor’s theorems from multivariate calculus 6 :

Theorem 9.3 (First order Taylor theorem). Let f : Rn → R be a function


having continuous partial derivatives in some ball B(x; r). Then, for each
h ∈ Rn with �h� < r there is some t ∈ (0, 1) such that

f (x + h) = f (x) + ∇f (x + th)T h.

The next one is known as Taylor’s formula, or the second order Taylor’s
theorem7 :

Theorem 9.4 (Second order Taylor theorem). Let f : Rn → R be a func-


tion having second order partial derivatives that are continuous in some ball
B(x; r). Then, for each h ∈ Rn with �h� < r there is some t ∈ (0, 1) such
that
1
f (x + h) = f (x) + ∇f (x)T h + hT ∇2 f (x + th)h.
2

This may be shown by considering the one-variable function g(t) = f (x+th)


and applying the chain rule and Taylor’s formula in one variable.
5 See Section 2.6 in [8]
6 This theorem is also the mean value theorem of functions in several variables, see Section
5.5 in [8]
7 See Section 5.9 in [8]

257
There is another version of the second order Taylor theorem in which the
Hessian is evaluated in x and, as a result, we get an error term. This theorem
shows how f may be approximated by a quadratic polynomial in n variables8 :

Theorem 9.5 (Second order Taylor theorem, version 2). Let f : Rn → R be


a function having second order partial derivatives that are continuous in some
ball B(x; r). Then there is a function � : Rn → R such that, for each h ∈ Rn
with �h� < r,
1
f (x + h) = f (x) + ∇f (x)T h + hT ∇2 f (x)h + �(h)�h�2 .
2
Here �(y) → 0 when y → 0.

Using the O-notation from Definition 4.6, the very useful approximations we
get from Taylor’ theorems can thus be summarized as follows:

Taylor approximations:
First order: f (x + h) = f (x) + ∇f (x)T h + O(�h�)
≈ f (x) + ∇f (x)T h.
Second order: f (x + h) = f (x) + ∇f (x)T h + 12 hT ∇2 f (x)h + O(�h�2 )
≈ f (x) + ∇f (x)T h + 12 hT ∇2 f (x)h.

We introduce notation for these approximations

Tf1 (x; x + h) = f (x) + ∇f (x)T h


Tf2 (x; x + h) = f (x) + ∇f (x)T h + 12 hT ∇2 f (x)h

As we shall see, one can get a lot of optimization out of these approximations!
We also need a Taylor theorem for vector-valued functions, which follows by
applying Taylor’ theorem above to each component function:

Theorem 9.6 (First order Taylor theorem for vector-valued functions). Let
F : Rn → Rm be a vector-valued function which is continuously differentiable
in a neighborhood N of x. Then

F (x + h) = F (x) + F � (x)h + O(�h�)

when x + h ∈ N .
8 See Section 5.9 in [8]

258
Finally, if F : Rn → Rm and G : Rk → Rn we define the composition
H = F ◦ G as the function H : Rk → Rm by H(x) = F (G(x)). Then, under
natural differentiability assumption the following chain rule9 holds:

H � (x) = F � (G(x))G� (x).

Here the right-hand side is a product of two matrices, the respective Jacobi
matrices evaluated in the right points.
Finally, we discuss some notions concerning the convergence of sequences.

Definition 9.7 (Linear convergence). We say that a sequence {xk }∞ k=1 con-
verges to x∗ linearly (or that the convergence speed in linear) if there is a
γ < 1 such that

�xk+1 − x∗ � ≤ γ�xk − x∗ � (k = 0, 1, . . .).

A faster convergence rate is superlinear convergence which means that

lim �xk+1 − x∗ �/�xk − x∗ � = 0


k→∞

A special type of superlinear convergence is quadratic convergence where

�xk+1 − x∗ � ≤ γ�xk − x∗ �2 (k = 0, 1, . . .)

for some γ < 1.

Exercises for Section 9.3

Ex. 1 — Give an example of a function f : R → R with 10 global minima.

Ex. 2 — Consider the function f (x) = x sin(1/x) defined for x > 0. Find its
local minima. What about global minimum?

Ex. 3 — Let f : X → R+ be a function (with nonnegative function values).


Explain why it is equivalent to minimize f over x ∈ X or minimize f 2 (x) over
X.

Ex. 4 — Consider f : R2 → R given by f (x) = (x1 − 3)2 + (x2 − 2)2 . How


would you explain to anyone that x∗ = (3, 2) is a minimum point?

9 See Section 2.7 in [8]

259
Ex. 5 — The level sets of a function f : R2 → R are sets of the form Lα =
x ∈ R2 : f (x) = α}. Let f (x) = (1/4)(x − 1)2 + (x − 3)2 . Draw the level sets
in the plane for α = 10, 5, 1, 0.1.

Ex. 6 — The sublevel set of a function f : Rn → R is the set Sα (f ) = {x ∈


R2 : f (x) ≤ α}, where α ∈ R. Assume that inf{f (x) : x ∈ Rn } = η exists.
a. What happens to the sublevel sets Sα as α decreases? Give an example.

b. Show that if f is continuous and there is an x� such that with α = f (x� )


the sublevel set Sα (f ) is bounded, then f attains its minimum.

Ex. 7 — Consider the portfolio optimization problem in Subsection 9.2.1.


a. Assume that cij = 0 for each i �= j. Find, analytically, an optimal
solution. Describe the set of all optimal solutions.
b. Consider the special case where n = 2. Solve the problem (hint: eliminate
one variable) and discuss how minimum point depends on α.

Ex. 8 — Later in these notes we will need the expression for the gradient of
functions which are expressed in terms of matrices.
a. Let f : Rn → R be defined by f (x) = q T x = xT q, where q is a vector.
Show that ∇f (x) = q, and that ∇2 f (x) = 0.
b. Let f : Rn → R be the quadratic function f (x) = (1/2)xT Ax. Show
that ∇f (x) = Ax, and that ∇2 f (x) = A.

Ex. 9 — Consider f (x) = f (x1 , x2 ) = x21 + 3x1 x2 − 5x22 + 3. Determine the


first order Taylor approximation to f at each of the points (0, 0) and (2, 1).

� �
1 2
Ex. 10 — Let A = . Show that A is positive definite. (Try to give
2 8
two different proofs.)

Ex. 11 — Show that is A is positive definite, then its inverse is also positive
definite.

260
Chapter 10

A crash course in convexity

Convexity is a branch of mathematical analysis dealing with convex sets and


convex functions. It also represents a foundation for optimization.
We just summarize concepts and some results. For proofs one may consult
[4] or [14], see also [1].

10.1 Convex sets


A set C ⊆ Rn is called convex if (1 − λ)x + λy ∈ C whenever x, y ∈ C and
0 ≤ λ ≤ 1. Geometrically, this means that C contains the line segment between
each pair of points in C, so, loosely speaking, a convex set contains no “holes”.
For instance, the ball B(a; δ) = {x ∈ Rn : �x − a� ≤ δ} is a convex set. Let
us show this. Recall the triangle inequality which says that �u+v� ≤ �u�+�v�
whenever u, v ∈ Rn . Let x, y ∈ B(a; δ) and λ ∈ [0, 1]. Then
�((1 − λ)x + λy) − a� = �(1 − λ)(x − a) + λ(y − a)�
≤ �(1 − λ)(x − a)� + �λ(y − a)�
= (1 − λ)�x − a� + λ�y − a�
≤ (1 − λ)δ + λδ = δ.
Therefore B(a; δ) is convex.
Every linear subspace is also a convex set, as well as the translate of every
subspace (which is called an affine set). Some other examples of convex sets in
R2 are shown in Figure 10.1. We will come back to why each of these sets are
convex later. Another important property is that the intersection of a family of
convex sets is a convex set.
By a linear system we mean a finite system of linear equations and/or linear
inequalities involving n variables. For example
x1 + x2 = 3, x1 ≥ 0, x2 ≥ 0
is a linear system in the variables x1 , x2 . The solution set is the set of points
(x1 , 3 − x1 ) where 0 ≤ x1 ≤ 3. The set of solutions of a linear system is called

261
2 2 2

1 1 1

0 0 0

−1 −1 −1

−2 −2 −2
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
x2
(a) A square (b) The ellipse 4
+ y2 ≤1 (c) The area x4 + y 4 ≤ 1

Figure 10.1: Examples of some convex sets.

a polyhedron. These sets often occur in optimization. Thus, a polyhedron has


the form
P = {x ∈ Rn : Ax ≤ b}
where A ∈ Rm,n and b ∈ Rm (m is arbitrary, but finite) and ≤ means compo-
nentwise inequality. There are simple techniques for rewriting any linear system
in the form Ax ≤ b.

Proposition 10.1. Every polyhedron is a convex set.

The square from Figure 10.1(a) is defined by the inequalities −1 ≤ x, y ≤ 1.


It is therefore a polyhedron, and therefore convex. The next result shows that
convex sets are preserved under linear maps.

Proposition 10.2. If T : Rn → Rm is a linear transformation, and C ⊆ Rn


is a convex set, then the image T (C) of this set is also convex.

10.2 Convex functions


The notion of a convex function also makes sense for real-valued functions of
several variables. Consider a real-valued function f : C → R where C ⊆ Rn is
a convex set. We say that f is convex provided that

f ((1 − λ)x + λy) ≤ (1 − λ)f (x) + λf (y) (x, y ∈ Rn , 0 ≤ λ ≤ 1) (10.1)

(This inequality holds for all x, y and λ as specified). Due to the convexity
of C, the point (1 − λ)x + λy lies in C, so the inequality is well-defined. The
geometrical interpretation in one dimension is that whenever you take two points
on the graph of f , say (x, f (x)) and (y, f (y)), the graph of f restricted to the
line segment [x, y] lies below the line segment in Rn+1 between the two chosen
points. A function g is called concave if −g is convex.
Every linear function is convex. Some other examples of convex functions in
n variables are

262
• f (x) = L(x) + α where L is a linear function from Rn into R (a linear
functional) and α is a real number. Such a function is called an affine
function and it may be written f (x) = cT x + α for a suitable vector c.
• f (x) = �x� (Euclidean norm). That this is convex can be proved by
writing �(1 − λ)x + λy� ≤ �(1 − λ)x� + �λy� = (1 − λ)�x� + λ�y�. In
fact, the same argument can be used to show that every norm defines
a convex function. Such an
�nexample is the l1 -norm, also called the sum
norm, defined by �x�1 = j=1 |xj |.
�n
xj
• f (x) = e j=1 (see Exercise 7).

• f (x) = eh(x) where h : Rn → R is a convex function.


• f (x) = maxi gi (x) where gi : Rn → R is an affine function (i ≤ m).
This means that the pointwise maximum of affine functions is a convex
function. Note that such convex functions are typically not differentiable
everywhere. A more general result is that the pointwise supremum of an
arbitrary family of affine functions (or even convex functions) is convex.
This is a very useful fact in convexity and its applications.
The following result is an exercise to prove, and it gives a method for proving
convexity of a function.

Proposition 10.3. Assume that f : Rn → R is convex and H : Rm → Rn is


affine. Then the composition f ◦ H is convex, where (f ◦ H)(x) := f (H(x)).

The next result is often used, and is called Jensen’s inequality.. It can be
shown using induction.

Theorem 10.4 (Jensen’s inequality). Let f : C → R be a convex function


defined�on a convex set C ⊆ Rn . If x1 , x2 . . . , xr ∈ C and λ1 , . . . , λr ≥ 0
r
satisfy j=1 λj = 1, then
r
� r

f( λj xj ) ≤ λj f (xj ). (10.2)
j=1 j=1

�r
A point of the form j=1 λj xj , where the λj ’s are nonnegative and sum to
1, is called a convex combination of the points x1 , x2 . . . , xr . One can show that
a set is convex if and only if it contains all convex combinations of its points.
Finally, one connection between convex sets and convex functions is the
following fact whose proof is an exercise.

263
Proposition 10.5. Let C ⊆ Rn be a convex set and consider a convex function
f : C → R. Let α ∈ R. Then the “sublevel” set

{x ∈ C : f (x) ≤ α}

is a convex set.

10.3 Properties of convex functions


A convex function may not be differentiable in every point. However, one can
show that a convex function always has one-sided directional derivatives at any
point. But what about continuity?

Theorem 10.6. Let f : C → R be a convex function defined on an open


convex set C ⊆ Rn . Then f is continuous on C.

However, a convex function may be discontinuous in points on the boundary


of its domain. For instance, the function f : [0, 1] → R given by f (0) = 1 and
f (x) = 0 for x ∈ (0, 1] is convex, but discontinuous at x = 0. Next we give a
useful technique for checking that a function is convex.

Theorem 10.7. Let f be a real-valued function defined on an open convex


set C ⊆ Rn and assume that f has continuous second-order partial derivatives
on C.
Then f is convex if and only if the Hessian matrix ∇2 f (x) is positive
semidefinite for each x ∈ C.

With this result it is straightforward to prove that the remaining sets from
Figure 10.1 are convex. They can be written as sublevel sets of the functions
2
f (x, y) = x4 + y 2 , and f (x, y) = x4 + y 4 . For the first of these the level sets are
ellipses, and are shown in Figure 10.2, together with f itself. One can quickly
verify that the Hessian matrices of these functions are positive semidefinite. It
follows from Proposition 10.5 that the corresponding sets are convex.
An important class of convex functions consists of (certain) quadratic func-
tions. Let A ∈ Rn×n be a symmetric matrix which is positive semidefinite and
consider the quadratic function f : Rn → R given by
� n

f (x) = (1/2) xT Ax − bT x = (1/2) aij xi xj − bj x j .
i,j j=1

(If A = 0, then the function is linear, and it may be strange to call it quadratic.
But we still do this, for simplicity.) Then (Exercise 9.3.8) the Hessian matrix
of f is A, i.e., ∇2 f (x) = A for each x ∈ Rn . Therefore, by Theorem 10.7 is a
convex function.

264
2

6
1
4

0
2

0 −1
2
2
0 0
−2 −2 −2
−2 −1 0 1 2
x2
(a) The function f (x, y) = 4
+ y2 (b) Some level curves of f

Figure 10.2: A function and its level curves.

We remark that sometimes it may be easy to check that a symmetric ma-


trix A is positive semidefinite.�A (real) symmetric n × n matrix A is called
diagonally dominant if |aii | ≥ j�=i |aij | for i = 1, . . . , n. These matrices arise
in many applications, e.g. splines and differential equations. It can be shown
that every symmetric diagonally dominant matrix is positive semidefinite. For a
simple proof of this fact using convexity, see [3]. Thus, we get a simple criterion
for convexity of a function: check if the Hessian matrix ∇2 f (x) is diagonally
dominant for each x. Be careful here: this matrix may be positive semidefinite
without being diagonally dominant!
We now look at differentiability properties of convex functions.

Theorem 10.8. Let f be a real-valued convex function defined on an


open convex set C ⊆ Rn . Assume that all the partial derivatives
∂f (x)/∂x1 , . . . , ∂f (x)/∂xn exist at a point x ∈ C. Then f is differentiable at
x.

A convex function may not be differentiable everywhere, but it is differen-


tiable “almost everywhere”. More precisely, for a convex function defined on an
open convex set in Rn , the set of points for which f is not differentiable has
Lebesgue measure zero. We do not go into further details on this here, but refer
to e.g. [5] for a proof and a discussion.
Another characterization of convex functions that involves the gradient may
now be presented.

Theorem 10.9. Let f : C → R be a differentiable function defined on an


open convex set C ⊆ Rn . Then the following conditions are equivalent:

265
(i) f is convex.
(ii) f (x) ≥ f (x0 ) + ∇f (x0 )T (x − x0 ) for all x, x0 ∈ C.
(iii) T
(∇f (x) − ∇f (x0 )) (x − x0 ) ≥ 0 for all x, x0 ∈ C.

This theorem is important. Property (ii) says that the first-order Taylor
approximation of f at x0 (which is the right-hand side of the inequality) always
underestimates f . This result has interesting consequences for optimization as
we shall see later.

Exercises for section 10.3

Ex. 1 — Let S = {(x, y, z) : z ≥ x2 + y 2 } ⊂ R3 . Sketch the set and verify


that it is a convex set.

Ex. 2 — Let f : S → R be a differentiable function, where S is an open set


in R. Check that f is convex if and only if f �� (x) ≥ 0 for all x ∈ S.

Ex. 3 — Prove Proposition 10.3.

Ex. 4 — Prove Proposition 10.5.

Ex. 5 — Explain how you can write the LP problem max {cT x : Ax ≥
b, Bx = d, x ≥ 0} as an LP problem of the form

max{cT x : Hx ≤ h, x ≥ 0}

for suitable matrix H and vector h.

Ex. 6 — Let x1 , . . . , xt ∈ Rn and let C be the set of vectors of the form


t

λ j xj
j=1

where λj ≥ 0 for each j = 1, . . . , t. Show that C is convex. Make a sketch of


such a set in R3 .

�n
xj
Ex. 7 — Show that f (x) = e j=1 is a convex function.

Ex. 8 — Let f : Rn → R be a convex function and let α ∈ R. Show that the


sublevel set Sα (f ) = {x ∈ Rn : f (x) ≤ α} is a convex set.

266
Ex. 9 — Assume that f and g are convex functions defined on an interval I.
Determine which of the functions following functions that are convex or concave:
a. λf where λ ∈ R,
b. min{f, g},
c. |f |.

Ex. 10 — Let f : [a, b] → R be a convex function. Show that

max{f (x) : x ∈ [a, b]} = max{f (a), f (b)}

i.e., a convex function defined on closed real interval attains its maximum in
one of the endpoints.

Ex. 11 — Let f : �0, ∞� → R and define the function g : �0, ∞� → R by


g(x) = xf (1/x). Why is the function x → xe1/x convex?

Ex. 12 — Let C ⊆ Rn be a convex set and consider the distance function dC


defined by dC (x) = inf{�x − y� : y ∈ C}. Show that dC is a convex function.

267
Chapter 11

Nonlinear equations

A basic mathematical problem is to solve a system of equations in several un-


knowns (variables). There are numerical methods that can solve such equations,
at least within a small error tolerance. We shall briefly discuss such methods
here; for further details, see [6, 11].

11.1 Equations and fixed points


In linear algebra one works a lot with linear equations in several variables, and
Gaussian elimination is a central method for solving such equations. There
are also other faster methods, so-called iterative methods, for linear equations.
But what about nonlinear equations? For instance, consider the system in two
variables x1 and x2 :

x21 − x1 x−3
2 + cos x1 = 1
5x41 + 2x31 − tan(x1 x82 ) = 3

Clearly, such equations can be very hard to solve. The general problem is to
solve the equation
F (x) = 0 (11.1)
for a given function F : Rn → Rn . If F (x) = 0 we call x a root of F (or
of the equation). The example above is equivalent to finding roots in F (x) =
(F1 (x), F2 (x)) where

F1 (x) = x21 − x1 x−3


2 + cos x1 − 1
F2 (x) = 5x41 + 2x31 − tan(x1 x82 ) − 3

In particular, if F (x) = Ax − b where A is an n × n matrix and b ∈ Rn , then


we are back to linear equations (a square system). More generally one may
consider equations G(x) = 0 where G : Rn → Rm , but we here only discuss the
case m = n.

268
Often the problem F (x) = 0 has the following form, or may be rewritten to
it:
K(x) = x. (11.2)
for some function K : Rn → Rn . This corresponds to the special choice F (x) =
K(x) − x. A point x ∈ Rn such that x = K(x) is called a fixed point of the
function K. In finding such a fixed point it is tempting to use the following
iterative method: choose a starting point x0 and repeat the following iteration

xk+1 = K(xk ) for k = 1, 2, . . . (11.3)

This is called a fixed-point iteration. We note that if K is continuous and this


procedure converges to some point x∗ , then x∗ must be a fixed point. The fixed-
point iteration is an extremely simple algorithm, and very easy to implement.
Perhaps surprisingly, it also works very well for many such problems. Let � > 0
denote a small error tolerance used for stopping the process, e.g. 10−6 .

Fixed-point algorithm:
1. Choose an initial point x0 , let x = x0 and err = 1.
2. while err > � do
(i) Compute x1 = K(x)
(ii) Compute err = �x1 − x�
(iii) Update x := x1

When does the fixed-point iteration work? Let � · � be a fixed norm, e.g. the
Eulidean norm, on Rn . We say that the function K : Rn → Rn is a contraction
if there is a constant 0 ≤ c < 1 such that

�K(x) − K(y)� ≤ c�x − y� (x, y ∈ Rn ).

We also say that K is c-Lipschitz in this case. The following theorem is called
the Banach contraction principle. It also holds in Banach spaces, i.e., complete
normed vector spaces (possibly infinite-dimensional).

Theorem 11.1. Assume that K is c-Lipschitz with 0 < c < 1. Then K has
a unique fixed point x∗ . For any starting point x0 the fixed-point iteration

(11.3) generates a sequence {xk }∞
k=0 that converges to x . Moreover

�xk+1 − x∗ � ≤ c�xk − x∗ � for k = 0, 1, . . . (11.4)

so that
�xk − x∗ � ≤ ck �x0 − x∗ �.

269
Proof. First, note that if both x and y are fixed points of K, then

�x − y� = �K(x) − K(y)� ≤ c�x − y�

which means that x = y (as c < 1); therefore K has at most one fixed point.
Next, we compute

�xk+1 − xk � = �K(xk ) − K(xk−1 )� ≤ c�xk − xk−1 � = · · · ≤ ck �x1 − x0 �

so �m−1 �m−1
�xm − x0 � = � k=0 (xk+1 − xk )� ≤ k=0 �xk+1 − xk �
�n−1
≤ ( k=0 ck )�x1 − x0 � ≤ (1/(1 − c))�x1 − x0 �
From this we derive that {xk } is a Cauchy sequence; as we have

�xs+m − xs � = �K(xs+m−1 ) − K(xs−1 )� ≤ c�xs+m−1 − xs−1 � = · · ·


≤ cs �xm − x0 � ≤ (cs /(1 − c))�x1 − x0 �.

and 0 < c < 1. Any Cauchy sequence in Rn has a limit point, so xm → x∗ for
some x∗ ∈ Rn . We now prove that the limit point x∗ is a (actually, the) fixed
point:
�x∗ − K(x∗ )� ≤ �x∗ − xm � + �xm − K(x∗ )�
= �x∗ − xm � + �K(xm−1 ) − K(x∗ )�
≤ �x∗ − xm � + c�xm−1 − x∗ �
and letting m → ∞ here gives �x∗ − K(x∗ )� ≤ 0 so x∗ = K(x∗ ) as desired.
Finally,

�xk+1 − x∗ � = �K(xk ) − K(x∗ )� ≤ c�xk − x∗ � ≤ ck+1 �x0 − x∗ �

which completes the proof.


We see that xk → x∗ linearly, and that Equation (11.4) gives an estimate
on the convergence speed.

11.2 Newton’s method


We return to the main problem (11.1). Our goal is to present Newton’s method,
a highly efficient iterative method for solving this equation. The method con-
structs a sequence
x0 , x1 , x2 , . . .
in Rn which, hopefully, converges to a root x∗ of F , so F (x∗ ) = 0. The idea
is to linearize F at the current iterate xk and choose the next iterate xk+1 as
a zero of this linearized function. The first order Taylor approximation of F at
xk is
TF1 (xk ; x) = F (xk ) + F � (xk )(x − xk ).

270
We solve TF1 (xk ; x) = 0 for x and define the next iterate as xk+1 = x. This
gives
xk+1 = xk − F � (xk )−1 F (xk ) (11.5)
which leads to Newton’s method. One here assumes that the derivative F � is
known analytically. Note that we do not (and hardly ever do!) compute the
inverse of the matrix F � .

Newton’s method for nonlinear equations:


1. Choose an initial point x0 .
2. For k = 0, 1, . . . do
(i) Find the direction p by solving F � (xk )p = −F (xk )
(ii) Update: xk+1 = xk + p

In the main step, which is to compute p, one needs to solve an n × n linear


system of equations where the coefficient matrix is the Jacobi matrix of F , eval-
uated at xk . In MAT1110 [8] we implemented the following code for Newton’s
method for nonlinear equations:
function x=newtonmult(x0,F,J)
% Performs Newtons method in many variables
% x: column vector which contains the start point
% F: computes the values of F
% J: computes the Jacobi matrix
epsilon=0.0000001; N=30; n=0;
x=x0;
while norm(F(x)) > epsilon && n<=N
x=x-J(x)\F(x);
fval = F(x);
fprintf(’itnr=%2d x=[%13.10f,%13.10f] F(x)=[%13.10f,%13.10f]\n’,...
n,x(1),x(2),fval(1),fval(2))
n = n + 1;
end

This code also terminates after a given number of iterations, and when a given
accuracy is obtained. Note that this function should work for any function F ,
since it is a parameter to the function.
The convergence of Newton’s method may be analyzed using fixed point
theory since one may view Newton’s method as a fixed point iteration. Observe
that the Newton iteration (11.5) may be written
xk+1 = G(xk )
where G is the function
G(x) = x − F � (x)−1 F (x)

271
From this it is possible to show that if the starting point is sufficiently close to
the root, then Newton’s method will converge to this root at a linear convergence
rate. With more clever arguments one may show that the convergence rate of
Newton’s method is even faster: it has superlinear convergence. Actually, for
many functions one even has quadratic convergence rate. The proof of the
following convergence theorem relies purely on Taylor’s theorem.

Theorem 11.2. Assume that Newton’s method with initial point x0 produces
a sequence {xk }∞
k=0 which converges to a solution x of (11.1). Then the

convergence rate is superlinear.

Proof. From Taylor’s theorem for vector-valued functions, Theorem 9.6, in the
point xk we have

0 = F (x∗ ) = F (xk + (x∗ − xk )) = F (xk ) + F � (xk )(x∗ − xk ) + O(�xk − x∗ �)

Multiplying this equation by F � (xk )−1 (which is assumed to exist!) gives

xk − x∗ − F � (xk )−1 F (xk ) = O(�xk − x∗ �)

Combining this with the Newton iteration xk+1 = xk − F � (xk )−1 F (xk ) we get

xk+1 − x∗ = O(�xk − x∗ �).

So
lim �xk+1 − x∗ �/�xk − x∗ � = 0
k→∞

This shows the superlinear convergence.


The previous result is interesting, but it does not say how near to the root
the starting point need to be in order to get convergence. This is the next topic.
Let F : U → Rn where U is an open, convex set in Rn . Consider the conditions
on the derivative F �
(i) �F � (x) − F � (y)� ≤ M �x − y� for all x, y ∈ U
� (11.6)
(ii) �F (x0 )� ≤ K for some x0 ∈ U

where K and L are some constants. Here �F � (x0 )� denotes the operator norm
of the square matrix F � (x0 ) which is defined as

�F � (x0 )� = sup �F � (x0 )x�


�x�=1

and it measures how much the operator F � (x0 ) may increase the size of vectors.
The following convergence result for Newton’s method is known as Kantorovich’
theorem.

272
Theorem 11.3 (Kantorovich’ theorem). Let F : U → Rn be a differentiable
function satisfying (11.6). Assume that B̄(x0 ; 1/(KL)) ⊆ U and that

�F � (x0 )−1 F (x0 )� ≤ 1/(2KL).

Then F � (x) is invertible for all x ∈ B(x0 ; 1/(KL)) and Newton’s method with
initial point x0 will produce a sequence {xk }∞ k=0 contained in B(x0 ; 1/(KL))
and limk→∞ xk = x∗ for some limit point x∗ ∈ B̄(x0 ; 1/(KL)) with

F (x∗ ) = 0.

A proof of this theorem is quite long (but not very difficult to understand) [8].

One disadvantage with Newton’s method is that one needs to know the
Jacobi matrix F � explicitly. For complicated functions, or functions being the
output of a simulation, the derivative may be hard or impossible to find. The
quasi-Newton method, also called the secant-method, is then a good alternative.
The idea is to approximate F � (xk ) by some matrix Bk and to compute the new
search direction from
Bk p = −F (xk )
A practical method for finding these approximations B1 , B2 , . . . is Broyden’s
method. Provided that the previous iteration gave xk , with Broyden’s method
we compute xk+1 by following the search direction, define sk = xk+1 − xk and
y k = F (xk+1 ) − F (xk ), and compute Bk+1 from Bk by the formula

Bk+1 = Bk + (1/sTk sk )(y k − Bk sk )sTk . (11.7)

It can be shown that Bk approximates the Jacobi matrix F � (xk ) well in each
iteration. Moreover, the update given in (11.7) can be done efficiently (it is a
rank one update of Bk ).

Algorithm: Broyden’s method:


1. Choose an initial point x0 , and an initial B0 .
2. For k = 0, 1, . . . do
(i) Find direction pk by solving Bk p = −F (xk )
(ii) Use line search (see Section 12.2) along direction pk to find αk
(iii) Update: xk+1 := xk + αk pk
sk := xk+1 − xk
y k := F (xk+1 ) − F (xk )
compute Bk+1 from (11.7).

Note that this algorithm also computes an α through what we call a line
search, to attempt to find the optimal distance to follow the search direction.

273
We do not here specify how this line search can be performed. Also, we do
not specify how the initial values can be chosen. For B0 , any approximation of
the Jacobian of F at x0 can be used, using a numerical differentiation method
of your own choosing. One can show that Broyden’s method, under certain
assumptions, also converges superlinearly, see [11].

Exercises for Section 11.2

Ex. 1 — Show that the problem of solving nonlinear equations (11.1) may
be transformed into a nonlinear optimization problem. (Hint: Square each
component function and sum these up!)

Ex. 2 — Let T : R → R be given by T (x) = (3/2)(x − x3 ). Draw the graph


of this function, and determine its fixed points. Let x∗ denote the largest fixed
point. Find, using your graph, an interval I containing x∗ such that the fixed
point algorithm with an initial point in I will guaranteed converge
� towards x∗ .
Then try the fixed point algorithm with starting point x0 = 5/3.

Ex. 3√— Let α ∈ R+ be fixed, and consider f (x) = x2 − α. Then the zeros
are ± α. Write down the Newton’s iteration for this problem. Let α = 2 and
compute the first three iterates in Newton’s method when x0 = 1.

Ex. 4 — For any vector norm � · � on Rn , we can more generally define a


corresponding operator norm for n × n matrices by

�A� = sup �Ax�.


�x�=1

a. Explain why this supremum is attained.


�n
b. Consider the vector norm �x� = �x�1 = j=1 |xj | on Rn . For n = 2,
draw the sublevel set {x ∈ R2 : �x�1 ≤ 1}. Compute the corresponding
operator norm �A� where A is an n × n matrix.

Ex. 5 — Consider a linear map T : Rn → Rm given by T (x) = Ax where A


is an n × n matrix. When is T a contraction, using the operator norm defined
in the previous exercise?

Ex. 6 — Test the function newtonmult on the equations given initially in


Section 11.1.

Ex. 7 — In this exercise we will implement Broyden’s method with Matlab.

274
a. Given a value x0 , implement a function which computes an estimate
of F � (x0 ) by estimating the partial derivatives of F , using a numerical
differentiation method and step size of you own choosing.
b. Implement a function
function x=broyden(x0,F)
which returns an estimate of a zero of F using Broyden’s method. Your
method should set B0 to be the matrix obtained from the function in a.
Just indicate where line search along the search direction should be per-
formed in your function, without implementing it. The function should
work as newtonmult in that it terminates after a given number of itera-
tions, or after precision of a given accuracy has been obtained.

275
Chapter 12

Unconstrained optimization

How can we know whether a given point x∗ is a minimum, local or global, of


some given function f : Rn → R? And how can we find such a point x∗ ?
These are, of course, some main questions in optimization. In order to give
good answers to these questions we need optimality conditions. They provide
tests for optimality, and serve as the basis for algorithms. We here focus on
differentiable functions; the corresponding results for the nondifferentiable case
are more difficult (but they exist, and are based on convexity, see [5, 13]).
For unconstrained problems it is not difficult to find powerful optimality
conditions from Taylor’s theorem for functions in several variables.

12.1 Optimality conditions


In order to establish optimality conditions in unconstrained optimization, Tay-
lor’s theorem is the starting point, see Section 9.3. We only consider mini-
mization problems, as maximization problems are turned into minimization
problems by multiplying the function f by −1.
First we look at some necessary optimality conditions.

Theorem 12.1. Assume that f : Rn → R has continuous partial derivatives,


and assume that x∗ is a local minimum of f . Then

∇f (x∗ ) = 0. (12.1)

If, moreover, f has continuous second order partial derivatives, then ∇2 f (x∗ )
is positive semidefinite.

Proof. Assume that x∗ is a local minimum of f and that ∇f (x∗ ) �= 0. Let


h = −α∇f (x∗ ) where α > 0. Then ∇f (x∗ )T h = −α�∇f (x∗ )�2 < 0 and
by continuity of the partial derivatives of f , ∇f (x)T h < 0 for all x in some

276
neighborhood of x∗ . From Theorem 9.3 (first order Taylor) we obtain

f (x∗ + h) − f (x∗ ) = ∇f (x∗ + th)T h (12.2)

for some t ∈ (0, 1) (depending on α). By choosing α small enough, the right-
hand side of (12.2) is negative (as just said), and so f (x∗ + h) < f (x∗ ), contra-
dicting that x∗ is a local minimum. This proves that ∇f (x∗ ) = 0.
To prove the second statement, we get from Theorem 9.4 (second order
Taylor)
1
f (x∗ + h) = f (x∗ ) + ∇f (x∗ )T h + hT ∇2 f (x∗ + th)h
2
∗ 1 T 2
= f (x ) + h ∇ f (x + th)h (12.3)
2
If ∇2 f (x∗ ) is not positive semidefinite, there is an h such that hT ∇2 f (x∗ )h < 0
and, by continuity of the second order partial derivatives, hT ∇2 f (x)h < 0 for
all x in some neighborhood of x∗ . But then (12.3) gives f (x∗ + h) − f (x∗ ) < 0;
a contradiction. This proves that ∇2 f (x) is positive semidefinite.
The two necessary optimality conditions in Theorem 12.1 are called the first-
order and the second-order conditions, respectively. The first-order condition
says that the gradient must be zero at x∗ , and such a point if often called a
stationary point. The second-order condition may be interpreted by f being
"convex locally" at x∗ , although this is not a precise term. A stationary point
which is neither a local minimum or a local maximum is called a saddle point.
So, every neighborhood of a saddle point contains points with larger and points
with smaller f -value.
Theorem 12.1 gives a connection to nonlinear equations. In order to find a
stationary point we may solve ∇f (x) = 0, which is a n × n (usually nonlinear)
system of equations. (The system is linear whenever f is a quadratic function.)
One may solve this equation, for instance, by Newton’s method and thereby
get a candidate for a local minimum. Sometimes this approach works well,
in particular if f has a unique local minimum and we have an initial point
"sufficiently close". However, there are other better methods which we discuss
later.
It is important to point out that any algorithm for finding a minimum of f
has to be able to find a stationary point. Therefore algorithms in this area are
typically iterative and move to gradually better points where the norm of the
gradient becomes smaller, and eventually almost equal to zero.
As an example consider a convex quadratic function

f (x) = (1/2) xT Ax − bT x

where A is the (symmetric) Hessian matrix is (constant equal to) A and this ma-
trix is positive semidefinite. Then ∇f (x) = Ax − b so the first-order necessary
optimality condition is
Ax = b

277
which is a linear system of equations. If f is strictly convex, which happens when
A is positive definite, then A is invertible and the unique solution is x∗ = A−1 b.
Thus, there is only one candidate for a local (and global) minimum, namely
x∗ = A−1 b. Actually, this is indeed a unique global minimum, but to verify
this we need a suitable argument. One way is to use convexity (with results
presented later) or an alternative is to use sufficient optimality conditions which
we discuss next. The linear system Ax = b, when A is positive definite, may be
solved by several methods. A popular, and very fast, method is the conjugate
gradient method. This method, and related methods, are discussed in detail in
the course INF-MAT4360 Numerical linear algebra [10].
In order to present a sufficient optimality condition we need a result from
linear algebra. Recall from linear algebra that a symmetric positive definite
matrix has only real eigenvalues and all these are positive.

Proposition 12.2. Let A be an n × n symmetric positive definite matrix, and


let λn > 0 denote its smallest eigenvalue. Then

hT Ah ≥ λn �h�2 (h ∈ Rn ).

Proof. By the spectral theorem there is an orthogonal matrix V (containing the


orthonormal eigenvectors as its columns) such that

A = V DV T

where D is the diagonal matrix with the eigenvalues λ1 , . . . , λn on the diagonal.


Let h ∈ Rn and define y = V T h. Then �y� = �h� and
n
� n

hT Ah = hT V DV T h = y T Dy = λi yi2 ≥ λn yi2 = λn �y�2 = λn �h�2 .
j=1 i=1

Next we consider sufficient optimality conditions in the general differentiable


case. These conditions are used to prove that a candidate point (say, found by
an algorithm) is really a local minimum.

Theorem 12.3. Assume that f : Rn → R has continuous second order partial


derivatives in some neighborhood of a point x∗ . Assume that ∇f (x∗ ) = 0 and
∇2 f (x∗ ) is positive definite. Then x∗ is a local minimum of f .

Proof. From Theorem 9.5 (second order Taylor) and Proposition 12.2 we get

f (x∗ + h) = f (x∗ ) + ∇f (x∗ )T h + 12 hT ∇2 f (x∗ )h + �(h)�h�2


≥ f (x∗ ) + 12 λn �h�2 + �(h)�h�2

278
where λn > 0 is the smallest eigenvalue of ∇2 f (x∗ ). Dividing here by �h�2
gives
1
(f (x∗ + h) − f (x∗ ))/|h�2 = λn + �(h)
2
Since limh→0 �(h) = 0, there is an r such that for �h� < r, |�(h)| < λn /4. This
implies that
(f (x∗ + h) − f (x∗ ))/|h�2 ≥ λn /4
for all h with �h� < r. This proves that x∗ is a local minimum of f .
We remark that the proof of the previous theorem actually shows that x∗
is a strict local minimum of f meaning that f (x∗ ) is strictly smaller than f (x)
for all other points x in some neighborhood of x∗ . Note the difference between
the necessary and the sufficient optimality conditions: a necessary condition is
that ∇2 f (x) is positive semidefinite, while a part of the sufficient condition is
the stronger property that ∇2 f (x) is positive definite.
Let us see what happens when we work with a convex function.

Theorem 12.4. Let f : Rn → R be a convex function. Then a local minimum


is also a global minimum. If, in addition, f is differentiable, then a point x∗
is a local (and then global) minimum of f if and only if

∇f (x∗ ) = 0.

Proof. Let x1 be a local minimum. If x1 is not a global minimum, there is an


x2 �= x1 with f (x2 ) < f (x1 ). Then for 0 < λ < 1

f ((1 − λ)x1 + λx2 ) ≤ (1 − λ)f (x1 ) + λf (x2 ) < f (x1 )

and this contradicts that f (x) ≥ f (x1 ) for all x in a neighborhood of x∗ .


Therefore x1 must be a global minimum.
Assume f is convex and differentiable. Due to Theorem 12.1 we only need
to show that if ∇f (x∗ ) = 0, then x∗ is a local and global minimum. So assume
that ∇f (x∗ ) = 0. Then, from Theorem 10.9 we have

f (x) ≥ f (x∗ ) + ∇f (x∗ )T (x − x∗ )

for all x ∈ Rn . If ∇f (x∗ ) = 0, this directly shows that x∗ is a global minimum.

12.2 Methods
Algorithms for unconstrained optimization are iterative methods that generate
a sequence of points with gradually smaller values on the function f which is
to be minimized. There are two main types of algorithms in unconstrained
optimization:

279
• Line search methods: Here one first chooses a search direction dk from
the current point xk , using information about the function f . Then one
chooses a step length αk so that the new point

xk+1 = xk + αk dk

has a small, perhaps smallest possible, value on the halfline {xk + αdk :
α ≥ 0}. αk describes how far one should go along the search direction.
The problem of choosing αk is a one-dimensional optimization problem.
Sometimes we can find αk exactly, and in such cases we refer to the method
as exact line search. In cases where αk can not be found analytically, algo-
rithms can be used to approximate how we can get close to the minimum
on the halfline. The method is then refered to as backtracking line search.
• Trust region methods: In these methods one chooses an approximation
fˆk to the function in some neighborhood of the current point xk . The
function fˆk is simpler than f and one minimizes fˆk (in the mentioned
neighborhood) and let the next iterate xk+1 be this minimizer.

These types are typically both based on quadratic approximation of f , but


they differ in the order in which one chooses search direction and step size. In
the following we only discuss the first type, the line search methods.
A very natural choice for search direction at a point xk is the negative
gradient, dk = −∇f (xk ). Recall that the direction of maximum increase of a
(differentiable) function f at a point x is ∇f (x), and the direction of maximum
decrease is −∇f (x). To verify this, Taylor’s theorem gives
1
f (x + h) = f (x) + ∇f (x) · h + hT ∇2 f (x + th)h.
2
So, for small h, the first order term dominates and we would like to make this
term small. By the Cauchy-Schwarz inequality1

∇f (x) · h ≥ −�∇f (x)� �h�

and equality holds for h = −α∇f (x) for some α ≥ 0. In general, we call h a
descent direction at x if ∇f (x) · h < 0. Thus, if we move in a descent direction
from x and make a sufficiently small step, the new point has a smaller f -value.
With this background we shall in the following focus on gradient methods given
by
xk+1 = xk + αk dk (12.4)
where the direction dk satisfies

∇f (xk ) · dk < 0 (12.5)

There are two gradient methods we shall discuss:


1 The Cauchy-Schwarz’ inequality says: |u · v| ≤ �u� �v� for u, v ∈ Rn .

280
• If we choose the search direction dk = −∇f (xk ), we get the steepest
descent method
xk+1 = xk − αk ∇f (xk ).
In each step it moves in the direction of the negative gradient. Sometimes
this gives slow convergence, so other methods have been developed where
other choices of direction dk are made.
• An important method is Newton’s method

xk+1 = xk − αk ∇2 f (xk )−1 ∇f (xk ). (12.6)

This is the gradient method with dk = −∇2 f (xk )−1 ∇f (xk ); this vector
dk is called the Newton step. The so-called pure Newton method is when
one simply chooses step size αk = 1 for each k. To interpret this method
consider the second order Taylor approximation of f in xk

f (xk + h) ≈ Tf2 (xk ; xk + h) = f (xk ) + ∇f (xk )T h + (1/2)hT ∇2 f (xk )h

If we minimize this quadratic function w.r.t. h, assuming ∇2 f (xk ) is


positive definite, we get (see Exercise 7)

h = −∇2 f (xk )−1 ∇f (xk )

which explains the Newton step.

In the following we follow the presentation in [1]. In a gradient method


we need to choose the step length. This is the one-dimensional optimization
problem
min{f (x + αd) : α ≥ 0}.
Sometimes (maybe not too often) we may solve this problem exactly. Most
practical methods try some candidate α’s and pick the one with smallest f -
value. Note that it is not necessary to compute the exact minimum (this may
take too much time). The main thing is to assure that we get a sufficiently large
decrease in f without making a too small step.
A popular step size rule is the Armijo Rule. Here one chooses (in advance)
parameters s, a reduction factor β satisfying 0 < β < 1, and 0 < σ < 1. Define
the integer

mk = min{m : m ≥ 0, f (xk ) − f (xk + β m sdk ) ≥ −σβ m s∇f (xk )T dk } (12.7)

and choose step length αk = β mk s. Here σ is typically chosen very small, e.g.
σ = 10−3 . The parameter s fixes the search for step size to lie within the
interval [0, s]. This can be important: for instance, we can set s so small that
the initial step size we try is within the domain of definition for f . According to
[1] β is usually chosen in [1/10, 1/2]. In the literature one may find a lot more
information about step size rules and how they may be adjusted to the methods
for finding search direction, see [1], [11].

281
Now, we return to the choice of search direction in the gradient method
(12.4). A main question is whether it generates a sequence {xk }∞ k=1 which
converges to a stationary point x∗ , i.e., where ∇f (x∗ ) = 0. It turns out that
this may not be the case; one needs to be careful about the choice of dk to assure
this convergence. The problem is that if dk tends to be nearly orthogonal to
∇f (xk ) one may get into trouble. For this reason one introduces the following
notion:

Definition 12.5 (Gradient related). {dk } is called gradient related to {xk }


if for any subsequence {xkp }∞ p=1 of {xk } converging to a nonstationary
point, then the corresponding subsequence {dkp }∞
p=1 of {dk } is bounded and
lim supp→∞ ∇f (xk )T dk < 0.

What this condition assures is that �dk � is not too small or large compared
to �∇f (xk )� and that the angle between the vectors dk and ∇f (xk ) is not too
close to 90◦ . The proof of the following theorem may be found in [1].

Theorem 12.6. Let {xk }∞ k=0 be generated by the gradient method (12.4),
where {dk }∞
k=0 is gradient related to {xk }k=0 and the step size αk is chosen

using the Armijo rule. Then every limit point of {xk }∞


k=0 is a stationary point.

We remark that in Theorem 12.6 the same conclusion holds if we use exact
minimization as step size rule, i.e., f (xk +αdk ) is minimized exactly with respect
to α.
A very important property of a numerical algorithm is its convergence speed.
Let us consider the steepest descent method first. It turns out that the con-
vergence speed for this algorithm is very well explained by its performance
on minimizing a quadratic function, so therefore the following result is impor-
tant. In this theorem A is a symmetric positive definite matrix with eigenvalues
λ1 ≥ λ2 ≥ · · · ≥ λn > 0.

Theorem 12.7. If the steepest descent method xk+1 = xk − αk ∇f (xk ) using


exact line search is applied to the quadratic function f (x) = xT Ax where A
is positive definite, then (the minimum value is 0 and)

f (xk+1 ) ≤ mA f (xk )

where mA = ((λ1 − λn )/(λ1 + λn ))2 .

The proof may be found in [1]. Thus, if the largest eigenvalue is much
larger than the smallest one, mA will be nearly 1 and one typically have slow
convergence. In this case we have mA ≈ cond(A) where cond(A) = λ1 /λn is
the condition number of the matrix A. So the rule is: if the condition number
of A is small we get fast convergence, but if cond(A) is large, there will be

282
slow convergence. A similar behavior holds for most functions f because locally
near a minimum point the function is very close to its second order Taylor
approximation in x∗ which is a quadratic function with A = ∇2 f (x∗ ).
Thus, Theorem 12.7 says that the sequence obtained in the steepest descent
method converges linearly to a stationary point (at least for quadratic functions).
We now turn to Newton’s method.

Newton’s method for unconstrained optimization:


1. Choose an initial point x0 .
2. For k = 1, 2, . . . do
(i) (Newton step) dk := −∇2 f (x)−1 ∇f (x); η = −∇f (x)T dk
(ii) (Stopping criterion) If η/2 < �: stop.
(iii) (Line search) Use backtracking line search to find step size αk
(iv) (Update) xk+1 := xk + αk dk

Recall that the pure Newton step minimizes the second order Taylor ap-
proximation of f at the current iterate xk . Thus, if the function we minimize
is quadratic, we are done in one step. Similarly, if the function can be well
approximated by a quadratic function, then one would expect fast convergence.
We shall give a result on the convergence of Newton’s method (see [2] for
further details). When A is symmetric, we let λmin (A) denote that smallest
eigenvalue of A.
For the convergence result we need a lemma on strictly convex functions.
Assume that x0 is a starting point for Newton’s method and let S = {x ∈ Rn :
f (x) ≤ f (x0 )}. We shall assume that f is continuous and convex, and this
implies that S is a closed convex set. We also assume that f has a minimum
point x∗ which then must be a global minimum. Moreover the minimum point
will be unique due to a strict convexity assumption on f . Let f ∗ = f (x∗ ) be
the optimal value.
The following lemma says that for a convex functions as just described, a
point is nearly a minimum point (in terms of the f -value) whenever the gradient
is small in that point.

Lemma 12.8. Assume that f is convex as above and that λmin (∇2 f (x)) ≥ m
for all x ∈ S. Then
1
f (x) − f ∗ ≤ �∇f (x)�2 . (12.8)
2m

Proof. From Theorem 9.4, the second order Taylor’ theorem, we have for each
x, y ∈ S

f (y) = f (x) + ∇f (x)T (y − x) + (1/2)(y − x)T ∇2 f (z)(y − x)

283
for suitable z on the line segment between x and y. Here a lower bound for the
quadratic term is (m/2)�y − x�2 , due to Proposition 12.2. Therefore
f (y) ≥ f (x) + ∇f (x)T (y − x) + (m/2)�y − x�2 .
Now, fix x and view the expression on the right-hand side as a quadratic function
of y. This function is minimized for y ∗ = x − (1/m)∇f (x). So, by inserting
y = y ∗ above we get
f (y) ≥ f (x) + ∇f (x)T (y ∗ − x) + (m/2)�y ∗ − x�2
1 2
= f (x) − 2m �∇f (x)�

This holds for every y ∈ S so letting y = x∗ gives


1
f ∗ = f (x∗ ) ≥ f (x) − �∇f (x)�2
2m
which proves the desired inequality.
In the following convergence result we consider a function f as in Lemma
12.8. Moreover, we assume that the Hessian matrix is Lipschitz continuous over
S; this is essentially a bound on the third derivatives of f . We do not give the
complete proof (it is quite long), but consider some of the main ideas. Recall
the definition of the set S from above. Recall that the spectral norm of a square
matrix A is defined by
�A�2 = max �Ax�.
�x�=1

It is a fact that �A�2 is equal to the largest singular value of A.

Theorem 12.9. Let f be convex and twice continuously differentiable and


assume that
(i) λmin (∇2 f (x)) ≥ m for all x ∈ S.
(ii) �∇2 f (x) − ∇2 f (y)�2 ≤ L�x − y� for all x ∈ S.
Moreover, assume that f has a minimum point x∗ . Then Newton’s method
generates a sequence {xk }∞
k=0 that converges to x . From a certain k the
∗ �

convergence speed is quadratic.

Proof. Define f ∗ = f (x∗ ). It is possible to show that there are numbers η and
γ > 0 with 0 < η ≤ m2 /L such that the following holds for each k:
(i) If �∇f (xk )� ≥ η, then
f (xk+1 ) ≤ f (xk ) − γ. (12.9)

(ii) If �∇f (xk )� < η, then backtracking line search gives αk = 1 and
� �2
L L
�∇f (xk+1 )� ≤ �∇f (x k )� . (12.10)
2m2 2m2

284
We omit the proof of this fact; it may be found in [2].
We may now prove that if �∇f (xk )� < η, then also �∇f (xk+1 )� < η. This
follows from (ii) above and the fact (assumption) η ≤ m2 /L. Therefore, as soon
as case (ii) occurs in the iterative process, in all the remaining iterations case
(ii) will occur. Actually, as soon as case (ii) “kicks in” quadratic convergence
starts as we shall see now. So assume that case (ii) occurs from a certain k.
(Below we show that such k must exist.)
Define µl = 2m L
2 �∇f (xl )� for each l ≥ k. Then 0 ≤ µk < 1/2 as η ≤ m /L.
2

From what we just saw and (12.10)

µl+1 ≤ µ2l (l ≥ k).

So (by induction)
l−k l−k
µl ≤ µ2k ≤ (1/2)2 (l = l, k + 1, . . .).

Next, from Lemma 12.8

1 2m3 l−k+1
f (xk ) − f ∗ ≤ �∇f (xl )�2 ≤ 2 (1/2)2 (l ≥ k).
2m L
This inequality shows that f (xl ) → f ∗ , and since the minimum point is unique,
we must have xl → x∗ . Moreover, it follows that the convergence is quadratic.
It only remains to explain why case (ii) above indeed occurs for some k. In
each iteration of type (i) f is decreased by at least γ, as seen from equation
(12.10), so the number of such iterations must be bounded by

(f (x0 ) − f ∗ )/γ

which is a finite number. Finally, the proof of the statements in connection


with (i) and (ii) above is quite long and one derives several inequalities using
the convexity properties of f .
From the proof it is also possible to say something about haw many iterations
that are needed to reach a certain accuracy. In fact, if � > 0 a bound on the
number of iterations until f (xk ) ≤ f ∗ + � is

2m3
(f (x0 ) − f ∗ )/γ + log2 log2 .
�L2
Here γ is the parameter introduced in the proof above. The second term in
this expression (the logarithmic term) grows very slowly as � is decreased, and
it may roughly be replaced by the constant 6. So, whenever the second stage
(case (ii) in the proof) occurs, the convergence is extremely fast, it takes about
6 more Newton iterations. Note that quadratic convergence means, roughly,
that the number of correct digits in the answer doubles for every iteration.

285
Exercises for Section 12.2

Ex. 1 — Consider the function f (x1 , x2 ) = x21 + ax22 where a > 0 is a param-
eter. Draw some of the level sets of f (for different levels) for each a in the set
{1, 4, 100}. Also draw the gradient in a few points on these level sets.

Ex. 2 — State and prove a theorem similar to Theorem 12.1 for maximization
problems.

Ex. 3 — Let f (x) = xT Ax where A is a symmetric n × n matrix. Assume


that A is indefinite, so it has both positive and negative eigenvalues. Show that
x = 0 is a saddlepoint of f .

Ex. 4 — Let f (x1 , x2 ) = 4x1 + 6x2 + x21 + 2x22 . Find all stationary points and
determine if they are minimum, maximum or saddlepoints. Do the same for the
function g(x1 , x2 ) = 4x1 + 6x2 + x21 − 2x22 .

Ex. 5 — The function f (x1 , x2 ) = 100(x2 − x21 )2 + (1 − x1 )2 is called the


Rosenbrock function. Compute the gradient and the Hessian matrix at every
point x. Find every local minimum. Also draw some of the level sets (contour
lines) of f using Matlab.

Ex. 6 — Let f (x) = (1/2)xT Ax − bT x where A is a positive definite n × n


matrix. Consider the steepest descent method applied to the minimization of
f , where we assume exact line search is used. Assume that the search direction
happens to be equal to an eigenvector of A. Show that then the minimum is
reached in just one step.

Ex. 7 — Consider the second order Taylor approximation


Tf2 (x; x + h) = f (x) + ∇f (x)T h + (1/2)hT ∇2 f (x)h.

a. Show that ∇h Tf2 = ∇f (x)T + ∇2 f (x)h.


b. Minimizing Tf2 with respect to h implies solving ∇h T 2 f = 0, i.e. ∇f (x)T +
∇2 f (x)h = 0 from a.. If ∇2 f (x) is positive definite, explain that it is in-
vertible, so that this equation has the unique solution h = −∇2 f (xk )−1 ∇f (xk ),
as previously noted for the Newton step.

Ex. 8 — Implement the steepest descent method. Test the algorithm on the
functions in exercises 4 and 5. Use different starting points.

286
Ex. 9 — Implement a function
function alpha=armijorule(f,df,x,d)

which returns α chosen according to the Armijo rule for a function f with the
given gradient, at point x, with search direction d. The function shuld compute
mk from Equation (12.7) with β = 0.2, s = 0.5, σ = 10−3 , and return α = β mk s.

Ex. 10 — Write a function


[xopt,numit]=newtonbacktrack(f,df,d2f,x0)

which performs Newton’s method for unconstrained optimization. The input


parameters are the function, its gradient, its Hesse matrix, and the initial point.
The function should also return the number of iterations, and at each iteration
write the corresponding function value. The function should use backtracking
line search with the function armijorule from the previous exercise. Test the
algorithm on the functions in exercises 4 and 5. Use different starting points.

287
Chapter 13

Constrained optimization -
theory

In this chapter we consider constrained optimization problems. A general opti-


mization problem is

minimize f (x) subject to x ∈ S

where S ⊆ Rn is a given set and f : S → R. We here focus on a very


general optimization problem which often occurs in applications. Consider the
nonlinear optimization problem with equality/inequality constraints

minimize f (x)
subject to
(13.1)
hi (x) = 0 (i ≤ m)
gj (x) ≤ 0 (j ≤ r)

where f , h1 , h2 , . . . , hm and g1 , g2 , . . . , gr are continuously differentiable func-


tions from Rn into R. A point x satisfying all the m + r constraints will be
called feasible. Thus, we look for a feasible point with smallest f -value.
Our goal is to establish optimality conditions for this problem, starting with
the special case with only equality constraints. Then we discuss algorithms for
solving this problem. Our presentation is strongly influenced by [2] and [1].

13.1 Equality constraints and the Lagrangian


Consider the nonlinear optimization problem with equality constraints

288
minimize f (x)
subject to (13.2)
hi (x) = 0 (i ≤ m)

where f and h1 , h2 , . . . , hm are continuously differentiable functions from Rn


into R. We introduce the vector field H = (h1 , h2 , . . . , hm ), so H : Rn → Rm
and H(x) = (h1 (x), h2 (x), . . . , hm (x)).
We first establish necessary optimality conditions for this problem. A point
x ∈ Rn is called regular if the gradient vectors ∇hi (x∗ ) (i ≤ m) are linearly
independent.

Theorem 13.1. Let x∗ be a local minimum in problem (13.1) and assume that
x∗ is a regular point. Then there is a unique vector λ∗ = (λ∗1 , λ∗2 , . . . , λ∗m ) ∈
Rm such that
�m

∇f (x ) + λ∗i ∇hi (x∗ ) = 0. (13.3)
i=1

If f and each hi are twice continuously differentiable, then the following also
holds
m

hT (∇2 f (x∗ ) + λ∗i ∇2 hi (x∗ ))h ≥ 0 for all h ∈ T (x∗ ) (13.4)
i=1

where T (x∗ ) is the subspace T (x∗ ) = {h ∈ Rn : ∇hi (x∗ ) · h = 0 (i ≤ m)}.

The numbers λ∗i in this theorem are called the Lagrangian multipliers. Note
that the Lagrangian multiplier vector λ∗ is unique; this follows directly from
the linear independence assumption as x∗ is assumed regular. The theorem may
also be stated in terms of the Lagrangian function L : Rn × Rm → R given by

m

L(x, λ) = f (x) + λi hi (x) = f (x) + λT H(x) (x ∈ Rn , λ ∈ Rm ).
i=1

Then

∇x L(x, λ) = ∇f (x) + λi ∇hi
i
∇λ L(x, λ) = H(x).

Therefore, the first order conditions in Theorem 13.1 may be rewritten as follows

∇x L(x∗ , λ∗ ) = 0, ∇λ L(x∗ , λ∗ ) = 0.

289
h2 (x) = b2

h1 (x) = b1

Figure 13.1: The two surfaces h1 (x) = b1 og h2 (x) = b2 intersect each other in
a curve. Along this curve the constraints are fulfilled

∇h1 (x∗ ) ∇f (x∗ )


❈❖ ✍✂❈❖
❈ ✂ ❈
h2 (x∗ ) = b2❈ ✂ ❈
❈ ✂ ❈
❈ ✘✘
✿ ∇h2 (x∗ )
✂�
❈ ✂ ✿
✘✘
❈✘ ✘ ✘
h1 (x∗ ) = b1

Figure 13.2: ∇f (x∗ ) as a linear combination of ∇h1 (x∗ ) and ∇h2 (x∗ )

Here the second equation simply means that H(x) = 0. These two equations
say that (x∗ , λ∗ ) is a stationary point for the Lagrangian, and it is a system of
n + m (possibly nonlinear) equations in n + m variables.
We may interpret the theorem in the following way. At the point x∗ the
linear subspace T (x∗ ) consist of the “first order feasible directions”. Actually, if
each hi is linear, then T (x∗ ) consists of those h such that x∗ + h is feasible, i.e.,
hi (x∗ +h) = 0 for each i ≤ m. Thus, (13.3) says that in a local minimum x∗ the
gradient ∇f (x∗ ) is orthogonal to the subspace T (x∗ ) of the first order feasible
variations. This is reasonable since otherwise there would be a feasible direction
in which f would decrease. In Figure 13.1 we have plotted a curve where two
constraints are fulfilled. In Figure 13.2 we have then shown an interpretation of
Theorem 13.1.
Note that this necessary optimality condition corresponds to the condition
∇f (x∗ ) = 0 in the unconstrained case. The second condition (13.4) is a sim-

290
ilar generalization of the second order condition in Theorem 12.1 (saying that
∇2 f (x∗ ) is positive semidefinite).
It is possible to prove the theorem by eliminating variables based on the
equations and thereby reducing the problem to an unconstrained one. Another
proof, which we shall present below is based on the penalty approach. This
approach is also interesting as it leads to algorithms for actually solving the
problem.
Proof. (Theorem 13.1) For k = 1, 2, . . . consider the modified objective function

F k (x) = f (x) + (k/2)�H(x)�2 + (α/2)�x − x∗ �2

where x∗ is the local minimum under consideration, and α is a positive constant.


The second term is a penalty term for violating the constraints and the last term
is there for proof technical reasons. As x∗ is a local minimum there is an � > 0
such that f (x∗ ) ≤ f (x) for all x ∈ B̄(x∗ ; �). Choose now an optimal solution xk
of the problem min{F k (x) : x ∈ B̄(x∗ ; �)}; the existence here follows from the
extreme value theorem (F k is continuous and the ball is compact). For every k

F k (xk ) = f (xk ) + (k/2)�H(xk )�2 + (α/2)�xk − x∗ �2 ≤ F k (x∗ ) = f (x∗ ).

By letting k → ∞ in this inequality we conclude that limk→∞ �H(xk )� = 0.


So every limit point x̄ of the sequence {xk } satisfies H(x̄) = 0. The inequality
above also implies (by dropping a term on the left-hand side) that f (xk ) +
(α/2)�xk − x∗ �2 ≤ f (x∗ ) for all k, so by passing to the limit we get

f (x̄) + (α/2)�x̄ − x∗ �2 ≤ f (x∗ ) ≤ f (x̄)

where the last inequality follows from the facts that x̄ ∈ B̄(x∗ ; �) and H(x̄) = 0.
Clearly, this gives x̄ = x∗ . We have therefore shown that the sequence {xk }
converges to the local minimum x∗ . Since x∗ is the center of the ball B̄(x∗ ; �),
the points xk lie in the interior of S for suitably large k. The conclusion is then
that xk is the unconstrained minimum of F k when k is sufficiently large. We
may therefore apply Theorem 12.1 so ∇F k (xk ) = 0, so

0 = ∇F k (xk ) = ∇f (xk ) + kH � (xk )T H(xk ) + α(xk − x∗ ). (13.5)

Here H � denotes the Jacobi matrix of H. For suitably large k the matrix
H � (xk )H � (xk )T is invertible (as the rows of H � (xk ) are linearly independent
due to rank(H � (x∗ )) = m and a continuity argument). Multiply equation (13.5)
by (H � (xk )H � (xk )T )−1 H � (xk ) to obtain

kH(xk ) = −(H � (xk )H � (xk )T )−1 H � (xk )(∇f (xk ) + α(xk − x∗ )).

Letting k → ∞ we see that the sequence {kH(xk )} is convergent and its limit
point λ∗ is given by

λ∗ = −(H � (x∗ )H � (x∗ )T )−1 H � (x∗ )∇f (x∗ ).

291
Finally, by passing to the limit in (13.5) we get

0 = ∇f (x∗ ) + H � (x∗ )T λ∗

This proves the first part of the theorem; we omit proving the second part which
may be found in [1].

The first order necessary condition (13.3) along with the constraints H(x) =
0 is a system of n + m equations in the n + m variables x1 , x2 , . . . , xn and
λ1 , λ2 , . . . , λm . One may use e.g. Newton’s method for solving these equations
and find a candidate for an optimal solution. But usually there are better
numerical methods for solving the optimization (13.1), as we shall see soon.
Necessary optimality conditions are used for finding a candidate solution
for being optimal. In order to verify optimality we need sufficient optimality
conditions.

Theorem 13.2. Assume that f and H are twice continuously differentiable


functions. Moreover, let x∗ be a point satisfying the first order necessary
optimality condition (13.3) and the following condition

y T ∇2 L(x∗ , λ∗ )y > 0 for all y �= 0 with H � (x∗ )T y = 0 (13.6)

where ∇2 L(x∗ , λ∗ ) is the Hessian of the Lagrangian function with second order
partial derivatives with respect to x. Then x∗ is a (strict) local minimum of
f subject to H(x) = 0.

This theorem may be proved (see [1] for details) by considering the aug-
mented Lagrangian function

Lc (x, λ) = f (x) + λT H(x) + (c/2)�H(x)�2 (13.7)

where c is a positive scalar. This is in fact the Lagrangian function in the


modified problem

minimize f (x) + (c/2)�H(x)�2 subject to H(x) = 0 (13.8)

and this problem must have the same local minima as the problem of minimizing
f (x) subject to H(x) = 0. The objective function in (13.8) contains the penalty
term (c/2)�H(x)�2 which may be interpreted as a penalty (increased function
value) for violating the constraint H(x) = 0. In connection with the proof of
Theorem 13.2 based on the augmented Lagrangian one also obtains the following
interesting and useful fact: if x∗ and λ∗ satisfy the sufficient conditions in
Theorem 13.2 then there exists a positive c̄ such that for all c ≥ c̄ the point x∗ is
also a local minimum of the augmented Lagrangian Lc (·, λ∗ ). Thus, the original
constrained problem has been converted to an unconstrained one involving the
augmented Lagrangian. And, as we know, unconstrained problems are easier to
solve (solve the equations saying that the gradient is equal to zero).

292
13.2 Inequality constraints and KKT
We now consider the general nonlinear optimization problem where there are
both equality and inequality constraints. The problem is then

minimize f (x)
subject to
(13.9)
hi (x) = 0 (i ≤ m)
gj (x) ≤ 0 (j ≤ r)

We assume, as usual, that all these functions are continuously differentiable


real-valued functions defined on Rn . In short form we write the constraints
as H(x) = 0 and G(x) ≤ 0 where we let H = (h1 , h2 , . . . , hm ) and G =
(g1 , g2 , . . . , gr ).
A main difficulty in problems with inequality constraints is to determine
which of the inequalities that are active in an optimal solution. If we knew
the active inequalities, we would essentially have a problem with only equal-
ity constraints, H(x) = 0 plus the active equalities, i.e., a problem of the
form discussed in the previous section. For very small problems (solvable by
hand-calculation) a direct method is to consider all possible choices of active in-
equalities and solve the corresponding equality-constrained problem by looking
at the Lagrangian function.
Interestingly, one may also transform the problem (13.9) into the following
equality-constrained problem

minimize f (x)
subject to
(13.10)
hi (x) = 0 (i ≤ m)
gj (x) + zj2 = 0 (j ≤ r).

We have introduced extra variables zj , one for each inequality. The square
of these variables represent slack in each of the original inequalities. Note that
there is no sign constraint on zj . Clearly, the problems (13.9) and (13.10) are
equivalent. This transformation can also be useful computationally. Moreover,
it is useful theoretically as one may apply the optimality conditions from the
previous section to problem (13.10) to derive the theorem below (see [1]).
We now present a main result in nonlinear optimization. It gives optimality
conditions for this problem, and these conditions are called the Karush-Kuhn-
Tucker conditions, or simply the KKT conditions. In order to present the KKT
conditions we introduce the Lagrangian function L : Rn × Rm × Rr → R given
by

293
m
� r

L(x, λ, µ) = f (x) + λi hi (x) + µj gj (x) = f (x) + λT H(x) + µT G(x).
i=1 j=1

The gradient of L with respect to x is given by


m
� r

∇x L(x, λ, µ) = ∇f (x) + λi ∇hi (x) + µj ∇gj (x).
i=1 j=1

The Hessian matrix of L at (x, λ, µ) containing second order partial derivatives


of L with respect to x will be denoted by ∇xx L(x, λ, µ). Finally, the indices of
the active inequalities at x is denoted by A(x), so A(x) = {j ≤ r : gj (x) = 0}.
A point x is called regular if {∇h1 (x), . . . ∇hm (x)} ∪ {∇gi (x) : i ∈ A(x)} is
linearly independent.
In the following theorem the first part contains necessary conditions while
the second part contains sufficient conditions for optimality.

Theorem 13.3. Consider problem (13.9) with the usual differentiability as-
sumptions.
(i) Let x∗ be a local minimum of this problem and assume that x∗ is
a regular point. Then there are unique Lagrange multiplier vectors λ∗ =
(λ∗1 , λ∗2 , . . . , λ∗m ) and µ∗ = (µ∗1 , µ∗2 , . . . , µ∗r ) such that

∇x L(x∗ , λ∗ , µ∗ ) = 0
µ∗j ≥ 0 (j ≤ r) (13.11)
µ∗j =0 ∗
(j �∈ A(x )).

If f , g and h are twice continuously differentiable, then the following also holds

y T ∇2xx L(x∗ , λ∗ , µ∗ )y ≥ 0 (13.12)

for all y with ∇hi (x∗ )T y = 0 (i ≤ m) and ∇gj (x∗ )T y = 0 (j ∈ A(x∗ )).
(ii) Assume that x∗ , λ∗ and µ∗ are such that x∗ is a feasible point and
(13.11) holds. Assume, moreover, that (13.12) holds with strict inequality for
each y. Then x∗ is a (strict) local minimum in problem (13.9).

Proof. We shall derive this result from Theorem 13.1.


(i) By assumption x∗ is a local minimum of problem (13.9), and x∗ is a
regular point. Consider the constrained problem

294
minimize f (x)
subject to
(13.13)
hi (x) = 0 (i ≤ m)
gj (x) = 0 (j ∈ A(x∗ ))

which is obtained by removing all inactive constraints in x∗ . Then x∗ must


be a local minimum in (13.13); otherwise there would be a point x� in the
neighborhood of x∗ which is feasible in (13.13) and satisfying f (x� ) < f (x∗ ).
By choosing x� sufficiently near x∗ we would get gj (x� ) < 0 for all j ∈ A(x∗ ),
contradicting that x∗ is a local minimum in (13.9). Therefore we may apply
Theorem 13.1 to problem (13.13) and by regularity of x∗ there must be unique
Lagrange multiplier vectors λ∗ = (λ∗1 , λ∗2 , . . . , λ∗m ) and µ∗j (j ∈ A(x∗ )) such that
m
� �
∇f (x∗ ) + λ∗i ∇hi (x∗ ) + µ∗j ∇gj (x∗ ) = 0
i=1 j∈A(x∗ )

By defining µj = 0 for j �∈ A(x∗ ) we get (13.11), except for the nonnegativity


of µ.
The remaining part of the theorem may be proved, after some work, by
studying the equality-constrained reformulation (13.10) of (13.9) and applying
Theorem 13.1 to (13.10). The details may be found in [1].
The KKT conditions have an interesting geometrical interpretation. They
say that −∇f (x∗ ) may be written as linear combination of the gradients of the
hi ’s plus a nonnegative linear combination of the gradients of the gj ’s that are
active at x∗ .
We remark that the assumption that x∗ is a regular point may be too restric-
tive in some situations, for instance there may be more than n active inequalities
in x∗ . There exist several other weaker assumptions that assure the existence
of Lagrangian multipliers (and similar necessary conditions). Let us briefly say
a bit more on this matter.

Definition 13.4 (Tangent vector). Let C ⊆ Rn and let x ∈ C. A vector


d ∈ Rn is called a tangent (vector) to C at x if there is a sequence {xk } in C
and a sequence {αk } in R+ such that

lim (xk − x)/αk = d.


k→∞

The set of tangent vectors at x is denoted by TC (x).

TC (x) always contains the zero vector and it is a cone, meaning that it
contains each positive multiple of its vectors. Consider now problem (13.9) and
let C be the set of feasible solutions (those x satisfying all the equality and
inequality constraints).

295
Definition 13.5 (Linearized feasible directions). A linearized feasible direc-
tion at x ∈ C is a vector d such that
d · ∇hi (x) = 0 (i ≤ m)
d · ∇gj (x) = 0 (j ∈ A(x∗ )).

Let LFC (x) be the set of all linearized feasible directions at x.

So, if we move from x along a linearized feasible direction with a suitably


small step, then the new point is feasible if we only care about the linearized
constraints at x (the first order Taylor approximations) of each hi and each
gj for active constraints at x, i.e., those inequality constraints that hold with
equality. With this notation we have the following lemma. The proof may be
found in [11] and it involves the implicit function theorem from multivariate
calculus [8].

Lemma 13.6. Let x∗ ∈ C. Then TC (x∗ ) ⊆ LFC (x). If x∗ is a regular point,


then TC (x∗ ) = LFC (x).

The purpose of constraint qualifications is to assure that TC (x∗ ) = LFC (x).


This property is central for obtaining the necessary optimality conditions dis-
cussed above. An important example is when C is defined only by linear con-
straints, i.e., each hi and cj is a linear function. Then TC (x∗ ) = LFC (x) holds
for each x ∈ C.
For a more thorough discussion of these matters, see e.g. [11, 1].
In the remaining part of this section we discuss some examples; the main
tool is to establish the KKT conditions.
Example 13.7. Consider the one-variable problem: minimize f (x) subject to
x ≥ 0, where f : R → R is a differentiable convex function. We here let
g1 (x) = −x and m = 0. The KKT conditions then become: there is a number
µ such that f � (x) − µ = 0, µ ≥ 0 and µ = 0 if x > 0. This is one of the (rare)
occasions where we can eliminate the Lagrangian variable µ via the equation
µ = f � (x). So the optimality conditions are: x ≥ 0 (feasibility), f � (x) ≥ 0, and
f � (x) = 0 if x > 0 (x is an interior point of the domain so the derivative must
be zero), and if x = 0 we must have f � (0) ≥ 0.
Example 13.8. More generally, consider the problem to minimize f (x) subject
to x ≥ 0, where f : Rn → R. So here C = {x ∈ Rn : x ≥ 0} is the nonnegative
orthant. We have that gi (x) = −xi , so that ∇gi = −ei . The KKT conditions
say that −∇f (x∗ ) is a nonnegative combination of −ei for i so that xi = 0. In
other words, ∇f (x∗ ) is a nonnegative combination of ei for i so that xi = 0.
This means that
∂f (x∗ )/∂xi = 0 for all i ≤ n with x∗i > 0, and
∂f (x∗ )/∂xi ≥ 0 for all i ≤ n with x∗i = 0.
It we interpret this for n = 3 we get the following cases:

296
1 1

z
0.5 0.5

0 0
1 1
1 1
0.5 0.5 0.5 0.5
y 0 0 x y 0 0 x

(a) One active constraint (b) Two active constraints

1
z

0.5

0
1
1
0.5 0.5
y 0 0 x

(c) Three active constraints

Figure 13.3: The different possibilities for ∇f in a minimum of f , under the


constraints x ≥ 0.

• No active constraints: This means that x, y, z > 0. The KKT-conditions


say that all partial derivatives are 0, so that ∇f (x∗ ) = 0. This is reason-
able, since these points are internal points.
• One active constraint, such as x = 0, y, z > 0 The KKT-conditions say
that ∂f (x∗ )/∂y = ∂f (x∗ )/∂z = 0, so that ∇f (x∗ ) points in the positive
direction of e1 , as shown in Figure 13.3(a).

• Two active constraints, such x = y = 0, z > 0. The KKT-conditions


say that ∂f (x∗ )/∂z = 0, so that ∇f (x∗ ) lies in the cone spanned by
e1 , e2 , i.e. ∇f (x∗ ) lies in the first quadrant of the xy-plane, as shown in
Figure 13.3(b).

• Three active constraints: This means that x = y = z = 0. The KKT


conditions say that ∇f (x∗ ) is in the cone spanned by e1 , e2 , e3 , as shown
in Figure 13.3(c).
In all cases ∇f (x∗ ) points into a cone spanned by gradients corresponding to
the active inequalities (in general, by a cone we mean the set of all linear com-
binations of a set of vectors, with positive coefficients). Note that for the third
case above, we are used to finding minimum values from before: if we restrict
f to values where x = y = 0, we have a one-dimensional problem where we
want to minimize g(z) = f (x, y, z), which is equivalent to finding z so that
g � (z) = ∂f (x∗ )/∂z = 0, as stated by the KKT-conditions.

297
Example 13.9. Consider a quadratic optimization problem with linear equality
constraints

minimize (1/2) xT Dx − q T x
subject to
Ax = b

where D is positive semidefinite and A ∈ Rm×n , b ∈ Rm . This is a special


case of (13.15) where f (x) = (1/2) xT Dx − q T x. Then ∇f (x) = Dx − q (see
Exercise 9.8). Thus, the KKT conditions are: there is some λ ∈ Rm such that
Dx − q + AT λ = 0. In addition, the vector x is feasible so we have Ax = b.
Thus, solving the quadratic optimization problem amounts to solving the linear
system of equations
Dx + AT λ = q, Ax = b
which may be written as
� �� � � �
D AT x q
= . (13.14)
A 0 λ b

Under the additional assumption that D is positive definite and A has full row
rank, one can show that the coefficient matrix in (13.14) is invertible so this
system has a unique solution x, λ. Thus, for this problem, we may write down
an explicit solution (in terms of the inverse of the block matrix). Numerically,
one finds x (and the Lagrangian multiplier λ) by solving the linear system
(13.14) by e.g. Gaussian elimination or some faster (direct or iterative) method.
Example 13.10. Consider an extension of the previous example by allowing
linear inequality constraints as well:

minimize (1/2) xT Dx − q T x
subject to
Ax = b
x≥0

Here D, A and b are as above. Then ∇f (x) = Dx − q and ∇gk (x) = −ek .
Thus, the KKT conditions for this problem are: there are λ ∈ Rm and µ ∈ Rn
such that Dx − q + AT λ − µ = 0, µ ≥ 0 and µk = 0 if xk > 0 (k ≤ n). We
eliminate µ from the first equation and obtain the equivalent condition: there
is a λ ∈ Rm such that Dx + AT λ ≥ q and (Dx + AT λ − q)k · xk = 0 (k ≤ n).
In addition, we have Ax = b, x ≥ 0. This problem may be solved numerically,
for instance, by a so-called active set method, see [9].
Example 13.11. Linear optimization is a problem of the form

298
minimize cT x subject to Ax = b, x ≥ 0

This is a special case of the convex programming problem (13.15) where


gj (x) = −xj (j ≤ n). Here ∇f (x) = c and ∇gk (x) = −ek . Let x be a feasible
solution. The KKT conditions state that there are vectors λ ∈ Rm and µ ∈ Rn
such that c + AT λ − µ = 0, µ ≥ 0 and µk = 0 if xk > 0 (k ≤ n). Here we
eliminate µ and obtain the equivalent set of KKT conditions: there is a vector
λ ∈ Rm such that c + AT λ ≥ 0, (c + AT λ)k · xk = 0 (k ≤ n). These conditions
are the familiar optimality conditions in linear optimization theory. The vector
λ is feasible in the so-called dual problem and complementary slack holds. We
do not go into details on this here, but refer to the course INF-MAT3370 Linear
optimization where these matters are treated in detail.

13.3 Convex optimization


A convex optimization problem is to minimize a convex function f over a convex
set C in Rn . These problems are especially attractive, both from a theoretic
and algorithmic perspective.
First, let us consider some general results.

Theorem 13.12. Let f : C → R be a convex function defined on a convex


set C ⊆ Rn .
1. Then every local minimum of f over C is also a global minimum.

2. If f is continuous and C is closed, then the set of local (and therefore


global) minimum points of f over C is a closed convex set.
3. Assume, furthermore, that f : C → R is differentiable and C is open.
Let x∗ ∈ C. Then x∗ ∈ C is a local (global) minimum if and only if
∇f (x∗ ) = 0.

Proof. 1.) The proof of property 1 is exactly as the proof of the first part of
Theorem 12.4, except that we work with local and global minimum of f over C.
2.) Assume the set C ∗ of minimum points is nonempty and let α = minx∈C f (x).
Then C ∗ = {x ∈ C : f (x) ≤ α} is a convex set, see Proposition 10.5. Moreover,
this set is closed as f is continuous.
3.) This follows directly from Theorem 10.9.
Next, we consider a quite general convex optimization problem which is of
the form (13.9):

299
minimize f (x)
subject to
(13.15)
Ax = b
gj (x) ≤ 0 (j ≤ r)

where all the functions f and gj are differentiable convex functions, and A ∈
Rm×n and b ∈ Rm . Let C denote the feasible set of problem (13.15). Then C is a
convex set, see Proposition 10.5. A special case of (13.15) is linear optimization.
An important concept in convex optimization is duality. To briefly explain
this introduce again the Lagrangian function L : Rn × Rm × Rr+ → R given by

L(x, λ, ν) = f (x) + λT (Ax − b) + ν T G(x) (x ∈ Rn , λ ∈ Rm , ν ∈ Rr+ )

Remark: we use the variable name ν here in stead of the µ used before
because of another parameter µ to be used soon. Note that we require ν ≥ 0.
Define the new function g : Rm × Rr+ → R̄ by

g(λ, ν) = inf L(x, λ, ν)


x

Note that this infimum may sometimes be equal to −∞ (meaning that the
function x → L(x, λ, ν) is unbounded below). The function g is the pointwise
infimum of a family of affine functions in (λ, µ), one function for each x, and
this implies that g is a concave function. We are interested in g due to the
following fact, which is easy to prove. It is usually referred to as weak duality.

Lemma 13.13. Let x be feasible in problem (13.15) and let λ ∈ Rm , ν ∈ Rr


where ν ≥ 0. Then
g(λ, ν) ≤ f (x).

Proof. For λ ∈ Rm , ν ∈ Rr with ν ≥ 0 and x feasible in problem (13.15) we


have
g(λ, ν) ≤ L(x, λ, ν)
= f (x) + λT (Ax − b) + ν T G(x)
≤ f (x)
as Ax = b, ν ≥ 0 and G(x) ≤ 0.
Thus, g(λ, ν) provides a lower bound on the optimal value in (13.15). It is
natural to look for a best possible such lower bound and this is precisely the
so-called dual problem which is

300
maximize g(λ, ν)
subject to (13.16)
ν ≥ 0.

Actually, in this dual problem, we may further restrict the attention to those
(λ, ν) for which g(λ, ν) is finite.
The original problem (13.15) will be called the primal problem. It follows
from Lemma 13.13 that
g∗ ≤ f ∗
where f ∗ denotes the optimal value in the primal problem and g ∗ the optimal
value in the dual problem. If g ∗ < f ∗ , we say that there is a duality gap. Note
that the derivation above, and weak duality, holds for arbitrary functions f and
gj (j ≤ r). The concavity of g also holds generally.
The dual problem is useful when the dual objective function g may be com-
puted efficiently, either analytically or numerically. Duality provides a powerful
method for proving that a solution is optimal or, possibly, near-optimal. If we
have a feasible x in (13.15) and we have found a dual solution (λ, ν) with ν ≥ 0
such that
f (x) = g(λ, ν) + �
for some � (which then has to be nonnegative), then we can conclude that x is
“nearly optimal”, it is not possible to improve f by more than �. Such a point
x is sometimes called �-optimal, where the case � = 0 means optimal.
So, how good is this duality approach? For convex problems it is often
perfect as the next theorem says. We omit most of the proof, see [5, 1, 14]).
For nonconvex problems one should expect a duality gap. Recall that G� (x)
denotes the Jacobi matrix of G = (g1 , g2 , . . . , gr ) at x.

Theorem 13.14. Consider convex optimization problem (13.15) and assume


this problem has a feasible point satisfying

gj (x� ) < 0 (j ≤ r).

Then f ∗ = g ∗ , so there is no duality gap. Moreover, x is a (local and global )


minimum in (13.15) if and only if there are λ ∈ Rm and ν ∈ Rr with ν ≥ 0
and
∇f (x) + AT λ + G� (x)T ν = 0
and
νj gj (x) = 0 (j ≤ r).

Proof. We only prove the second part (see the references above). So assume
that f ∗ = g ∗ and the infimum and supremum are attained in the primal and
dual problems, respectively. Let x be a feasible point in the primal problem.

301
Then x is a minimum in the primal problem if and only if there are λ ∈ Rm
and ν ∈ Rr such that all the inequalities in the proof of Lemma 13.13 hold
with equality. This means that g(λ, ν) = L(x, λ, ν) and ν T G(x) = 0. But
L(x, λ, ν) is convex in x so it is minimized by x if and only if its gradient is
the zero vector, i.e., ∇f (x) + λT A + G� (x)T ν = 0. This leads to the desired
characterization.

The assumption stated in the theorem, that gj (x� ) < 0 for each j, is called
the weak Slater condition.
Finally, we mention a theorem on convex optimization which is used in
several applications.

Theorem 13.15. Let x∗ ∈ C. Then x∗ is a (local and therefore global)


minimum of f over C if and only if

∇f (x∗ )T (x − x∗ ) ≥ 0 for all x ∈ C. (13.17)

Proof. Assume first that ∇f (x∗ )T (x − x∗ ) < 0 for some x ∈ C. Consider the
function g(�) = f (x∗ + �(x − x∗ )) and apply the mean value theorem to this
function. Thus, for every � > 0 there exists an s ∈ [0, 1] with

f (x∗ + �(x − x∗ )) = f (x∗ ) + �∇f (x∗ + s�(x − x∗ ))T (x − x∗ ).

Since ∇f (x∗ )T (x − x∗ ) < 0 and the gradient function is continuous (our stan-
dard assumption!) we have for sufficiently small � > 0 that ∇f (x∗ + s�(x −
x∗ ))T (x − x∗ ) < 0. This implies that f (x∗ + �(x − x∗ )) < f (x∗ ). But, as C is
convex, the point x∗ + �(x − x∗ ) also lies in C and so we conclude that x∗ is
not a local minimum. This proves that (13.17) is necessary for x∗ to be a local
minimum of f over C.
Next, assume that (13.17) holds. Using Theorem 10.9 we then get

f (x) ≥ f (x∗ ) + ∇f (x∗ )T (x − x∗ ) ≥ f (x∗ ) for every x ∈ C

so x∗ is a (global) minimum.

Exercises for section 13.3

Ex. 1 — In the plane consider a rectangle R with sides of length x and y and
with perimeter equal to α (so 2x + 2y = α). Determine x and y so that the area
of R is largest possible.

Ex. 2 — Consider the optimization problem

minimize f (x1 , x2 ) subject to (x, x2 ) ∈ C

302
where C = {(x1 , x2 ) ∈ R2 : x1 , x2 ≥ 0, 4x1 + x2 ≥ 8, 2x1 + 3x3 ≤ 12}.
Draw the feasible set C in the plane. Find the set of optimal solutions in each
of the cases given below.
a. f (x1 , x2 ) = 1.
b. f (x1 , x2 ) = x1 .
c. f (x1 , x2 ) = 3x1 + x2 .
d. f (x1 , x2 ) = (x1 − 1)2 + (x2 − 1)2 .
e. f (x1 , x2 ) = (x1 − 10)2 + (x2 − 8)2 .

Ex. 3 — Solve
n

max{x1 x2 · · · xn : xj = 1}.
j=1

Ex. 4 — Let S = {x ∈ R2 : �x� = 1} be the unit circle in the plane. Let


a ∈ R2 be a given point. Formulate the problem of finding a nearest point in
S to a as a nonlinear optimization problem. How can you solve this problem
directly using a geometrical argument?

Ex. 5 — Let S be the unit circle as in � the previous exercise. Let a1 , a2 be


2
two given points in the plane. Let f (x) = i=1 �x − ai �2 . Formulate this as an
optimization problem and find its Lagrangian function L. Find the stationary
points of L, and use this to solve the optimization problem.

Ex. 6 — Solve

minimize x1 + x2 subject to x21 + x22 = 1.

using the Lagrangian, see Theorem 13.1. Next, solve the problem by eliminating
x2 (using the constraint).

Ex. 7 — Let g(x1 , x2 ) = 3x21 + 10x1 x2 + 3x22 − 2. Solve

min{�(x1 , x2 )� : g(x1 , x2 ) = 0}.

Ex. 8 — Same question as in previous exercise, but with g(x1 , x2 ) = 5x21 −


4x1 x2 + 4x22 − 6.

303
Ex. 9 — Let f be a two times differentiable function f : Rn → R. Consider
the optimization problem

minimize f (x) subject to x1 + x2 + · · · + xn = 1.

Characterize the stationary points (find the equation they satisfy).

Ex. 10 — Consider the previous exercise. Explain how to convert this into
an unconstrained problem by eliminating xn . Find an

Ex. 11 — Let A be a real symmetric n × n matrix. Consider the optimization


problem
max{xT Ax : �x� = 1}
Rewrite the constraint as −�x� = −1 and show that an optimal solution of this
problem must be an eigenvector of A. What can you say about the Lagrangian
multiplier?

Ex. 12 — Solve

min{(1/2)(x21 + x22 + x23 : x1 + x2 + x3 ≤ −6}.

Hint: Use KKT and discuss depending on whether the constraint is active or
not.

Ex. 13 — Solve

min{(x1 − 3)2 + (x2 − 5)2 + x1 x2 : 0 ≤ x1 , x2 ≤ 1}.

Ex. 14 — Solve
min{x1 + x2 : x21 + x22 ≤ 2}.

Ex. 15 — Use Theorem 13.15 to find optimality conditions for the convex
optimization problem
n

min{f (x1 , x2 , . . . , xn ) : xj ≥ 0 (j ≤ n), xj ≤ 1}
j=1

where f : Rn → R is a differentiable convex function.

304
Chapter 14

Constrained optimization -
methods

In this final chapter we present numerical methods for solving nonlinear opti-
mization problems. This is a huge area, so we can here only give a small taste
of it! The algorithms we present are known good methods.

14.1 Equality constraints


We here consider the nonlinear optimization problem with linear equality con-
straints

minimize f (x)
subject to (14.1)
Ax = b

Newton’s method may be applied to this problem. The method is very


similar to the unconstrained case, but with two modifications. First, the initial
point x0 must be chosen so that it is feasible, i.e., Ax0 = b. Next, the search
direction d must be such that the new iterate is feasible as well. This means
that Ad = 0, so the search direction lies in the nullspace of A.
The second order Taylor approximation of f at an iterate xk is

Tf1 (xk ; xk + h) = f (xk ) + ∇f (xk )T h + (1/2)hT ∇2 f (xk )h

and we want to minimize this w.r.t. h subject to the constraint

A(xk + h) = b

305
This is a quadratic optimization problem in h with a linear equality constraint
(Ah = 0) as in Example 13.9. The KKT conditions for this problem are thus
� 2 �� � � �
∇ f (xk ) AT h −∇f (xk )
=
A 0 λ 0

where λ is the Lagrange multiplier. The Newton step is only defined when the
coefficient matrix in the KKT problem is invertible. In that case, the problem
has a unique solution (h, λ) and we define dN t = h and call this the Newton
step.
Newton’s method for solving (14.1) may now be described as follows. Again
� > 0 is a small stopping criterion.

Newton’s method for linear equality constrained optimization:


1. Choose an initial point x0 satisfying Ax0 = b and let x = x0 .
2. repeat
(i) Compute the Newton step dN t and η := dTN t ∇2 f (x)dN t .
(ii) If η 2 /2 < �: stop.
(iii) Use backtracking line search to find step size α
(iv) Update x := x + αdN t

This leads to an algorithm for Newtons’s method for linear equality con-
strained optimization which is very similar to the function newtonbacktrack
from Exercise 12.2.10. We do not state a formal convergence theorem for this
method, but it behaves very much like Newton’s method for unconstrained op-
timization. Actually, it can be seen that the method just described corresponds
to eliminating variables based on the equations Ax = b and using the uncon-
strained Newton method for the resulting (smaller) problem. So as soon as
the solution is “sufficiently near” an optimal solution, the convergence rate is
quadratic, so extremely few iterations are needed in this final stage.

14.2 Inequality constraints


We here briefly discuss an algorithm for inequality constrained nonlinear opti-
mization problems. The presentation is mainly based on [2, 11]. We restrict the
attention to convex optimization problems, but many of the ideas are used for
nonconvex problems as well.
The method we present is an interior-point method, more precisely, an interior-
point barrier method. This is an iterative method which produces a sequence
of points lying in the relative interior of the feasible set. The barrier idea is
to approximate the problem by a simpler one in which constraints are replaced
by a penalty term. The purpose of this penalty term is to give large objective

306
function values to points near the (relative) boundary of the feasible set, which
effectively becomes a barrier against leaving the feasible set.
Consider again the convex optimization problem

minimize f (x)
subject to
(14.2)
Ax = b
gj (x) ≤ 0 (j ≤ r)

where A is an m × n matrix and b ∈ Rm . The feasible set here is F = {x ∈


R : Ax = b, gj (x) ≤ 0 (j ≤ r)}. We assume that the weak Slater condition
n

holds, and therefore by Theorem 13.14 the KKT conditions for problem (14.2)
are
Ax = b, gj (x) ≤ 0 (j ≤ r)
ν ≥ 0, ∇f (x) + AT λ + G� (x)T ν = 0 (14.3)
νj gj (x) = 0 (j ≤ r).
So, x is a minimum in (14.2) if and only if there are λ ∈ Rm and ν ∈ Rr such
that (14.3) holds.
Let us state an algorithm for Newton’s method for linear equality constrained
optimization with inequality constraints. Before we do this there is one final
problem we need to address: The α we get from backtracking line search may be
so that x + αdN t do not satisfty the inequality constraints (in the exercises you
will be asked to verify that this is the case for a certain function). The problem
comes from that the iterates xk + β m sdk from Armijo’s rule do not necessarily
satisfy the inequality constraints. However, we can choose m large enough so
that all succeeding iterates satisfy these constraints. We can reimplement the
function armijorule to address this as follows:
function alpha=armijoruleg1g2(f,df,x,d,g1,g2)
beta=0.2; s=0.5; sigma=10^(-3);
m=0;
while (g1(x+beta^m*s*d)>0 || g2(x+beta^m*s*d)>0)
m=m+1;
end
while (f(x)-f(x+beta^m*s*d) < -sigma *beta^m*s *(df(x))’*d)
m=m+1;
end
alpha = beta^m*s;

Here g1 and g2 are function handles which represent the inequality constraints,
and we have added a first loop, which secures that m is so large that the in-
equality constraints are satisfied. The rest of the code is as in the function
armijorule. After this we can also modify the function newtonbacktrack
from Exercise 12.2.10 to a function newtonbacktrackg1g2 in the obvious way,
so that the inequality constraints are passed to armijoruleg1g2:

307
function [x,numit]=newtonbacktrackg1g2LEC(f,df,d2f,A,b,x0,g1,g2)
epsilon=10^(-3);
x=x0;
maxit=100;
for numit=1:maxit
matr=[d2f(x) A’; A zeros(size(A,1))];
vect=[-df(x); zeros(size(A,1),1)];
solvedvals=matr\vect;
d=solvedvals(1:size(A,2));
eta=d’*d2f(x)*d;
if eta^2/2<epsilon
break;
end
alpha=armijoruleg1g2(f,df,x,d,g1,g2);
x=x+alpha*d;
end

Both these function work in all cases where there are exactly two inequality
constraints.
The interior-point barrier method is based on an approximation of problem
(14.2) by the barrier problem

minimize f (x) + µφ(x)


subject to (14.4)
Ax = b

where
r

φ(x) = − ln(−gj (x))
j=1

and µ > 0 is a parameter (in R). The function φ is called the (logarithmic)
barrier function and its domain is the relative interior of the feasible set

F ◦ = {x ∈ Rn : Ax = b, gj (x) < 0 (j ≤ r)}.

The same set F ◦ is the feasible set of the barrier problem. The key properties
of the barrier function are:
• φ is concave, i.e. −φ is a convex function. This may be shown from the
definition using that gj is convex and the fact that the logarithm function
is concave and increasing.
• If {xk } is a sequence in F ◦ such that gj (xk ) → 0 for some j ≤ r, then
φ(xk ) → ∞. This is the barrier property.

308
• φ is twice differentiable and
r
� 1
∇φ(x) = ∇gj (x) (14.5)
j=1
(−gj (x))

and
r
� � r
1 1
∇2 φ(x) = 2 ∇gj (x)∇gj (x)T + ∇2 gj (x) (14.6)
j=1
gj (x) j=1
(−g j (x))

The idea here is that for points x near the boundary of F the value of φ(x) is
very large. So, an iterative method which moves around in the interior F ◦ of F
will typically avoid points near the boundary as the logarithmic penalty term
makes the function value f (x) + µφ(x) very large.
The interior point method consists in solving the barrier problem, using
Newton’s method, for a sequence {µk } of (positive) barrier parameters; these
are called the outer iterations. The solution xk found for µ = µk is used as the
starting point in Newton’s method in the next outer iteration where µ = µk+1 .
The sequence {µk } is chosen such that µk → 0. When µ is very small, the
barrier function approximates the "ideal" penalty function η(x) which is zero
in F and −∞ when one of the inequalities gj (x) ≤ 0 is violated.
A natural question is why one bothers to solve the barrier problems for more
than one single µ, typically a very small value. The reason is that it would be
hard to find a good starting point for Newton’s method in that case; the Hessian
matrix of µφ is typically ill-conditioned for small µ.
Assume now that the barrier problem has a unique optimal solution x(µ);
this is true under reasonable assumptions that we shall return to. The point
x(µ) is called a central point. Assume also that Newton’s method may be
applied to solve the barrier problem. The set of points x(µ) for µ > 0 is called
the central path; it is a path (or curve) as we know it from multivariate calculus.
In order to investigate the central path we prefer to work with the equivalent
problem1 to (14.4) obtained by multiplying the objection function by 1/µ, so

minimize (1/µ)f (x) + φ(x)


subject to (14.7)
Ax = b.

A central point x(µ) is characterized by


Ax(µ) = b
gj (x(µ)) < 0 (j ≤ r)
and the existence of λ ∈ Rm (the Lagrange multiplier vector) such that
(1/µ)∇f (x(µ)) + ∇φ(x(µ)) + AT λ = 0
1 Equivalent here means the same minimum points.

309
i.e.,
r
� 1
(1/µ)∇f (x(µ)) + ∇gj (x) + AT λ = 0. (14.8)
j=1
(−gj (x))

A fundamental question is: how far from being optimal is the central point
x(µ)? We now show that duality provides a very elegant way of answering this
question.

Theorem 14.1. For each µ > 0 the central point x(µ) satisfies

f ∗ ≤ f (x(µ)) ≤ f ∗ + rµ.

Proof. Define ν(µ) = (ν1 (µ), . . . , νr (µ)) ∈ Rr and λ(µ) ∈ Rm by

νj (µ) = −µ/gj (x(µ)), (j ≤ r);


(14.9)
λ(µ) = µλ.

We want to show that the pair (λ(µ), ν(µ)) is a feasible solution in the dual
problem to (14.2), see Section 13.3. So there are two properties to verify, that
ν(µ) is nonnegative and that x(µ) minimizes the Lagrangian function for the
given (λ(µ), ν(µ)). The first property is immediate: as gj (x(µ)) < 0 and µ > 0,
we get νj (µ) = −µ/gj (x(µ)) > 0 for each j. Concerning the second property,
note first that the Lagrangian function L(x, λ, ν) = f (x)+λT (Ax−b)+ν T G(x)
is convex in x for given λ and µ ≥ 0. Thus, x minimizes this function if and
only if ∇x L = 0. Now,

∇x L(x(µ), λ(µ), ν(µ)) = ∇f (x(µ)) + AT λ(µ) + νj (µ)∇gj (x(µ)) = 0
j

by (14.8) and the definition of the dual variables (14.9). This shows that
(λ(µ), ν(µ)) is a feasible solution to the dual problem.
By weak duality Lemma 13.13 we therefore obtain

f ∗ ≥ g(λ(µ), ν(µ))
= L(x(µ), λ(µ), ν(µ))
r

= f (x(µ)) + λ(µ)T (Ax(µ) − b) + νj (µ)gj (x(µ))
j=1

= f (x(µ)) − rµ

which proves the result.


This theorem is very useful and shows why letting µ → 0 (more accurately
µ → 0+ ) is a good idea.

310
Corollary 14.2. The central path has the following property

lim f (x(µ)) = f ∗ .
µ→0

In particular, if f is continuous and limµ→0 x(µ) = x∗ for some x∗ , then x∗


is a global minimum in (14.2).

Proof. This follows from Theorem 14.1 by letting µ → 0. The second part
follows from
f (x∗ ) = f ( lim x(µ)) = lim f (x(µ)) = f ∗
µ→0 µ→0

by the first part and the continuity of f ; moreover x∗ must be a feasible point
by elementary topology.
After these considerations we may now present the interior-point barrier
method. It uses a tolerance � > 0 in its stopping criterion.

Interior-point barrier method:


1. Choose an initial point x = x0 in F ◦ , µ = µ0 and α < 1.
2. while rµ > � do
(i) (Centering step) Using initial point x find the solution x(µ) of (14.4)
(ii) (Update) x := x(µ)
(iii) (Decrease µ) µ := αµ.

This leads to the following algorithm for the internal point barrier method
for the case of equality constraints, and 2 inequality constraints:
function xopt=IPBopt(f,g1,g2,df,dg1,dg2,d2f,d2g1,d2g2,A,b,x0)
xopt=x0;
mu=1;
alpha=0.1;
r=2;
epsilon=10^(-3);
numitouter=0;
while (r*mu>epsilon)
[xopt,numit]=newtonbacktrackg1g2LEC(...
@(x)(f(x)-mu*log(-g1(x))-mu*log(-g2(x))),...
@(x)(df(x) - mu*dg1(x)/g1(x) - mu*dg2(x)/g2(x)),...
@(x)(d2f(x) + mu*dg1(x)*dg1(x)’/(g1(x)^2) ...
+ mu*dg2(x)*dg2(x)’/(g2(x)^2) - mu*d2g1(x)/g1(x)...
- mu*d2g2(x)/g2(x) ),A,b,xopt,g1,g2);
mu=alpha*mu;

311
numitouter=numitouter+1;
fprintf(’Iteration %i:’,numitouter);
fprintf(’(%f,%f)\n’,xopt,f(xopt));
end

Note that we here have inserted the expressions from Equation 14.5 and Equa-
tion 14.6 for the gradient and the Hesse matrix of the barrier function. The input
are f , g1 , g2 , their gradients and their Hesse matrices, the matrix A, the vector
b, and an initial feasible point x0 . The function calls newtonbacktrackg1g2LEC,
and returns the optimal solution x∗ . It also gives some information on the values
of f during the iterations. The iterations used in Newton’s method is called the
inner iterations. There are different implementation details here that we do not
discuss very much. A typical value on α is 0.1. The choice of the initial µ0 can
be difficult, if it is chosen too large, one may experience many outer iterations.
Another issue is how accurately one solves (14.4). It may be sufficient to find
a near-optimal solution here as this saves inner iterations. For this reason the
method is also called a path-following method; it follows in the neighborhood of
the central path.
Finally, it should be mentioned that there exists a variant of the interior-
point barrier method which permits an infeasible starting point. For more de-
tails on this and various implementation issues one may consult [2] or [11].
Example 14.3. Consider the function f (x) = x2 + 1, 2 ≤ x ≤ 4. Minimizing f
can be considered as the problem of finding a minimum subject to the constraints
g1 (x) = 2 − x ≤ 0, and g2 (x) = x − 4 ≤ 0. The barrier problem is to minimize
the function

f (x) + µφ(x) = x2 + 1 − µ ln(x − 2) − µ ln(4 − x).

Some of these are drawn in Figure 14.1, where we clearly can see the effect of
decreasing µ in the barrier function: The function converges to f pointwise,
except at the boundaries. It is easy to see that x = 2 is the minimum of f
under the given constraints, and that f (2) = 5 is the minimum value. There are
no equality constrains in this case, so that we can use the barrier method with
Newton’s method for unconstrained optimization, as this was implemented in
Exercise 12.2.10. We need, however, to make sure also here that the iterates
from Armijo’s rule satisfy the inequality constraints. In fact, in the exercises
you will be asked to verify that, for the function f considered here, some of the
iterates from Armijo’s rule do not satisfy the constraints.
It is straightforward to implement a function newtonbacktrackg1g2 which
implements Newtons method for two inequality constraints and no equality con-
straints (this can follow the implementation of the function newtonbacktrack
from Exercise 12.2.10, and use the function armijoruleg1g2, just as the func-
tion newtonbacktrackg1g2LEC). This leads to the following algorithm for the
internal point barrier method for the case of no equality constraints, but 2
inequality constraints:

312
20 20

15 15

10 10

5 5
2 2.5 3 3.5 4 2 2.5 3 3.5 4
(a) f (x) (b) Barrier problem with µ = 0.2

20 20

15 15

10 10

5 5
2 2.5 3 3.5 4 2 2.5 3 3.5 4
(c) Barrier problem with µ = 0.5 (d) Barrier problem with µ = 1

Figure 14.1: The function from Example 14.3 and some if its barrier functions.

313
function xopt=IPBopt2(f,g1,g2,df,dg1,dg2,d2f,d2g1,d2g2,x0)
xopt=x0;
mu=1; alpha=0.1; r=2; epsilon=10^(-3);
numitouter=0;
while (r*mu>epsilon)
[xopt,numit]=newtonbacktrackg1g2(...
@(x)(f(x)-mu*log(-g1(x))-mu*log(-g2(x))),...
@(x)(df(x) - mu*dg1(x)/g1(x) - mu*dg2(x)/g2(x)),...
@(x)(d2f(x) + mu*dg1(x)*dg1(x)’/(g1(x)^2) ...
+ mu*dg2(x)*dg2(x)’/(g2(x)^2) ...
- mu*d2g1(x)/g1(x) - mu*d2g2(x)/g2(x) ),xopt,g1,g2);
mu=alpha*mu;
numitouter=numitouter+1;
fprintf(’Iteration %i:’,numitouter);
fprintf(’(%f,%f)\n’,xopt,f(xopt));
end

Note that this function also prints a summary for each of the outer iterations,
so that we can see the progress in the barrier method. We can now find the
minimum of f with the following code, where we have subsituted with Matlab
functions for f , gi , their gradients, and their Hesse matrices.
IPBopt2(@(x)(x.^2+1),@(x)(2-x),@(x)(x-4),...
@(x)(2*x),@(x)(-1),@(x)(1),...
@(x)(2),@(x)(0),@(x)(0),3)

Running this code gives a good approximation to the minumum x = 2 after


4 outer iterations.

Exercises for Section 14.2

Ex. 1 — Consider problem (14.1) in Section 14.1. Verify that the KKT con-
ditions for this problem are as stated there.

Ex. 2 — Define the function f (x, y) = x + y. We will attempt to minimize f


under the constraints y − x = 1, and x, y ≥ 0
a. Find A, b, and functions g1 , g2 so that the problem takes the same form
as in Equation (14.2).
b. Draw the contours of the barrier function f (x, y) + µφ(x, y) for µ =
0.1, 0.2, 0.5, 1, where φ(x, y) = − ln(−g1 (x, y)) − ln(−g2 (x, y)).
c. Solve the barrier problem analytically using the Lagrange method.
d. It is straightforward to find the minimum of f under the mentioned
constraints. State a simple argument for finding this minimum.
e. State the KKT conditions for finding the minimum, and solve these.

314
f. Show that the central path converges to the same solution which you
found in d. and e..

Ex. 3 — Use the function IPBopt to verify the solution you found in Exer-
cise 2. Initially you must compute a feasible starting point x0 .

Ex. 4 — State the KKT conditions for finding the minimum for the con-
tstrained problem of Example 14.3, and solve these. Verify that you get the
same solution as in Example 14.3.

Ex. 5 — In the function IPBopt2, replace the call to the function newtonbacktrackg1g2
with a call to the function newtonbacktrack, with the obvious modification to
the parameters. Verify that the code does not return the expected minimum in
this case.

Ex. 6 — Consider the function f (x) = (x−3)2 , with the same constraints 2 ≤
x ≤ 4 as in Example 14.3. Verify in this case that the function IPBopt2 returns
the correct minimum regardless of whether you call newtonbacktrackg1g2 or
newtonbacktrack. This shows that, at least in some cases where the minimum
is an interior point, the iterates from Newtons method satisfy the inequality
constraints as well.

315

You might also like