Part 3 Nonlinear Op Tim Ization
Part 3 Nonlinear Op Tim Ization
Nonlinear optimization
247
Chapter 9
248
Due to its character this chapter is a “proof-free zone”, but in the remaining
text we usually give full proofs of the main results.
Notation: For z ∈ Rn and δ > 0 define the (closed) ball B̄(z; �) = {x ∈ Rn :
�x − z� ≤ �}. It consists of all points with distance at most � from z. Similarly,
define the open ball B(z; �) = {x ∈ Rn : �x − z� < �}. A neighborhood
of z is a set N containing B(z; �) for some � > 0. Vectors are treated as
column vectors and they are identified with the corresponding n-tuple, denoted
by x = (x1 , x2 , . . . , xn ). A statement like
P (x) (x ∈ H)
249
Example 9.1. To make these things concrete, consider an example from plane
geometry. Consider the point set C = {(z1 , z2 ) : z1 ≥ 0, z2 ≥ 0, z1 + z2 ≤ 1} in
the plane. We want to find a point x = (x1 , x2 ) ∈ C which is closest possible
to the point a = (3, 2). This can be formulated as the minimization problem
The function we want to minimize is f (x) = (x1 − 3)2 + (x2 − 2)2 which is a
quadratic function. This is the square of the distance between x and a; and
minimizing the distance or the square of the distance is equivalent (why?). A
minimum here is x∗ = (1, 0). If we instead minimize this function f over R2 ,
the unique global minimum is x∗ = a = (3, 2). It is useful to study this example
and try to solve it geometrically as well as analytically.
Theorem 9.2. Let C be a subset of Rn which is closed and bounded, and let
f : C → R be a continuous function.
Then f attains both its (global) minimum and maximum, so these are
points x1 , x2 ∈ C with
250
9.2.1 Portfolio optimization
The following optimization problem was introduced by Markowitz in order to
find an optimal portfolio in a financial market; he later received the Nobel prize
in economics1 (in 1990) for his contributions in this area:
� �n
minimize α i,j≤n cij xi xj − j=1 µj xj
subject to �
n
j=1 xj = 1
xj ≥ 0 (j ≤ n).
� n
�
f (x) = α cij xi xj − µj xj .
i,j≤n j=1
Nobel”
251
9.2.2 Fitting a model
In many applications one has a mathematical model of some phenomenon where
the model has some parameters. These parameters represent a flexibility of the
model, and they may be adjusted so that the model explains the phenomenon
best possible.
To be more specific consider a model
y = Fα (x)
y = Fα (x) + error
(xi , y i ) (i = 1, 2, . . . , m).
The optimization variable is the parameter α. Here the model error is quadratic
(corresponding to the Euclidean norm), but other norms are also used.
This optimization problem above is a constrained nonlinear optimization
problem. When the function Fα depends linearly on α, which often is the
case in practice, the problem becomes the classical least squares approxima-
tion problem which is treated in basic linear algebra courses. The solution is
then characterized by a certain linear system of equations, the so-called normal
equations.
252
9.2.3 Maximum likelihood
A very important problem in statistics, arising in many applications, is param-
eter estimation and, in particular, maximum likelihood estimation. It leads to
optimization.
Let Y be a “continuous” real-valued random variable with probability den-
sisty px (y). Here x is a parameter (often one uses other symbols for the pa-
rameter, like ξ, θ etc.). For instance, if Y is a normal (Gaussian) variable with
2
expectation x and variance 1, then px (y) = √12π e−(y−x) /2 and
� b
1 2
P(a ≤ Y ≤ b) = √ e−(y−x) /2 dy
a 2π
where P denotes probability.
Assume Y is the outcome of an experiment, and that we have observed
Y = y (so y is a known real number or a vector, if several observations were
made). On the basis of y we want to estimate the value of the parameter x
which “explains” best possible our observation Y = y. We have now available
the probability density px (·). The function x → px (y), for fixed y, is called
the likelihood function. It gives the “probability mass” in y as a function of the
parameter x. The maximum likelihood problem is to find a parameter value x
which maximizes the likelihood, i.e., which maximizes the probability of getting
precisely y. This is an optimization problem
max px (y)
x
where y is fixed and the optimization variable is x. We may here add a constraint
on x, say x ∈ C for some set C, which may incorporate possible knowledge of
x and assure that px (y) is positive for x ∈ C. Often it is easier to solve the
equivalent optimization problem of maximizing the logarithm of the likelihood
function
max ln px (y)
x
y = Ax + w
253
distributed with common density function p on R. This leads to the likelihood
function
�m
px (y) = p(yi − ai x)
i=1
where ai is the i’th row in A. Taking the logarithm we obtain the maximum
likelihood problem
m
�
max ln p(yi − ai x).
i=1
In many applications of statistics is is central to solve this optimization problem
numerically.
�T −1
minimize fT (xT ) + t=0 ft (xt , ut )
subject to (9.1)
xt+1 = ht (xt , ut ) (t = 0, 1, . . . , T − 1)
254
where the control is the sequence (u0 , u1 , . . . , uT −1 ) to be determined. This
problem arises an many applications, in engineering, finance, economics etc. We
now rewrite this problem. First, let u = (u1 , u2 , . . . , uT ) ∈ RN where N = T n.
Since, as we noted, xt is uniquely determined by u, there is a function v t such
that xt = v t (u) (t = 1, 2, . . . , T ); x0 is given. Therefore the total cost may be
written
T
� −1 T
� −1
fT (xT ) + ft (xt , ut ) = fT (v T (u)) + ft (v t (u), ut ) := f (u)
t=0 t=0
which is a function of u. Thus, we see that the optimal control problem may
be transformed to the unconstrained optimization problem
min f (u)
u∈RN
Sometimes there may be constraints on the control variables, for instance that
they each lie in some interval, and then the transformation above results in a
constrained optimization problem.
minimize cT x
subject to (9.2)
Ax = b, x ≥ 0.
255
9.3 Multivariate calculus and linear algebra
We first recall some useful facts from linear algebra.
The spectral theorem says that if A is a real symmetric matrix, then there
is an orthogonal matrix V (i.e., its columns are orthonormal) and a diagonal
matrix D such that
A = V DV T .
The diagonal of D contains the eigenvalues of A, and A has an orthonormal set
of eigenvectors (the columns of V ).
A real symmetric matrix is positive semidefinite2 if xT Ax ≥ 0 for all x ∈ Rn .
The following statements are equivalent
Similarly, a real symmetric matrix is positive definite if xT Ax > 0 for all nonzero
x ∈ Rn . The following statements are equivalent
256
For vector-valued functions we also need the derivative. Consider the vector-
valued function F given by
F1 (x)
F2 (x)
F (x) = ..
.
Fn (x)
The ith row of this matrix is therefore the gradient of Fi , now viewed as a row
vector.
Next we recall Taylor’s theorems from multivariate calculus 6 :
f (x + h) = f (x) + ∇f (x + th)T h.
The next one is known as Taylor’s formula, or the second order Taylor’s
theorem7 :
257
There is another version of the second order Taylor theorem in which the
Hessian is evaluated in x and, as a result, we get an error term. This theorem
shows how f may be approximated by a quadratic polynomial in n variables8 :
Using the O-notation from Definition 4.6, the very useful approximations we
get from Taylor’ theorems can thus be summarized as follows:
Taylor approximations:
First order: f (x + h) = f (x) + ∇f (x)T h + O(�h�)
≈ f (x) + ∇f (x)T h.
Second order: f (x + h) = f (x) + ∇f (x)T h + 12 hT ∇2 f (x)h + O(�h�2 )
≈ f (x) + ∇f (x)T h + 12 hT ∇2 f (x)h.
As we shall see, one can get a lot of optimization out of these approximations!
We also need a Taylor theorem for vector-valued functions, which follows by
applying Taylor’ theorem above to each component function:
Theorem 9.6 (First order Taylor theorem for vector-valued functions). Let
F : Rn → Rm be a vector-valued function which is continuously differentiable
in a neighborhood N of x. Then
when x + h ∈ N .
8 See Section 5.9 in [8]
258
Finally, if F : Rn → Rm and G : Rk → Rn we define the composition
H = F ◦ G as the function H : Rk → Rm by H(x) = F (G(x)). Then, under
natural differentiability assumption the following chain rule9 holds:
Here the right-hand side is a product of two matrices, the respective Jacobi
matrices evaluated in the right points.
Finally, we discuss some notions concerning the convergence of sequences.
Definition 9.7 (Linear convergence). We say that a sequence {xk }∞ k=1 con-
verges to x∗ linearly (or that the convergence speed in linear) if there is a
γ < 1 such that
�xk+1 − x∗ � ≤ γ�xk − x∗ �2 (k = 0, 1, . . .)
Ex. 2 — Consider the function f (x) = x sin(1/x) defined for x > 0. Find its
local minima. What about global minimum?
259
Ex. 5 — The level sets of a function f : R2 → R are sets of the form Lα =
x ∈ R2 : f (x) = α}. Let f (x) = (1/4)(x − 1)2 + (x − 3)2 . Draw the level sets
in the plane for α = 10, 5, 1, 0.1.
Ex. 8 — Later in these notes we will need the expression for the gradient of
functions which are expressed in terms of matrices.
a. Let f : Rn → R be defined by f (x) = q T x = xT q, where q is a vector.
Show that ∇f (x) = q, and that ∇2 f (x) = 0.
b. Let f : Rn → R be the quadratic function f (x) = (1/2)xT Ax. Show
that ∇f (x) = Ax, and that ∇2 f (x) = A.
� �
1 2
Ex. 10 — Let A = . Show that A is positive definite. (Try to give
2 8
two different proofs.)
Ex. 11 — Show that is A is positive definite, then its inverse is also positive
definite.
260
Chapter 10
261
2 2 2
1 1 1
0 0 0
−1 −1 −1
−2 −2 −2
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
x2
(a) A square (b) The ellipse 4
+ y2 ≤1 (c) The area x4 + y 4 ≤ 1
(This inequality holds for all x, y and λ as specified). Due to the convexity
of C, the point (1 − λ)x + λy lies in C, so the inequality is well-defined. The
geometrical interpretation in one dimension is that whenever you take two points
on the graph of f , say (x, f (x)) and (y, f (y)), the graph of f restricted to the
line segment [x, y] lies below the line segment in Rn+1 between the two chosen
points. A function g is called concave if −g is convex.
Every linear function is convex. Some other examples of convex functions in
n variables are
262
• f (x) = L(x) + α where L is a linear function from Rn into R (a linear
functional) and α is a real number. Such a function is called an affine
function and it may be written f (x) = cT x + α for a suitable vector c.
• f (x) = �x� (Euclidean norm). That this is convex can be proved by
writing �(1 − λ)x + λy� ≤ �(1 − λ)x� + �λy� = (1 − λ)�x� + λ�y�. In
fact, the same argument can be used to show that every norm defines
a convex function. Such an
�nexample is the l1 -norm, also called the sum
norm, defined by �x�1 = j=1 |xj |.
�n
xj
• f (x) = e j=1 (see Exercise 7).
The next result is often used, and is called Jensen’s inequality.. It can be
shown using induction.
�r
A point of the form j=1 λj xj , where the λj ’s are nonnegative and sum to
1, is called a convex combination of the points x1 , x2 . . . , xr . One can show that
a set is convex if and only if it contains all convex combinations of its points.
Finally, one connection between convex sets and convex functions is the
following fact whose proof is an exercise.
263
Proposition 10.5. Let C ⊆ Rn be a convex set and consider a convex function
f : C → R. Let α ∈ R. Then the “sublevel” set
{x ∈ C : f (x) ≤ α}
is a convex set.
With this result it is straightforward to prove that the remaining sets from
Figure 10.1 are convex. They can be written as sublevel sets of the functions
2
f (x, y) = x4 + y 2 , and f (x, y) = x4 + y 4 . For the first of these the level sets are
ellipses, and are shown in Figure 10.2, together with f itself. One can quickly
verify that the Hessian matrices of these functions are positive semidefinite. It
follows from Proposition 10.5 that the corresponding sets are convex.
An important class of convex functions consists of (certain) quadratic func-
tions. Let A ∈ Rn×n be a symmetric matrix which is positive semidefinite and
consider the quadratic function f : Rn → R given by
� n
�
f (x) = (1/2) xT Ax − bT x = (1/2) aij xi xj − bj x j .
i,j j=1
(If A = 0, then the function is linear, and it may be strange to call it quadratic.
But we still do this, for simplicity.) Then (Exercise 9.3.8) the Hessian matrix
of f is A, i.e., ∇2 f (x) = A for each x ∈ Rn . Therefore, by Theorem 10.7 is a
convex function.
264
2
6
1
4
0
2
0 −1
2
2
0 0
−2 −2 −2
−2 −1 0 1 2
x2
(a) The function f (x, y) = 4
+ y2 (b) Some level curves of f
265
(i) f is convex.
(ii) f (x) ≥ f (x0 ) + ∇f (x0 )T (x − x0 ) for all x, x0 ∈ C.
(iii) T
(∇f (x) − ∇f (x0 )) (x − x0 ) ≥ 0 for all x, x0 ∈ C.
This theorem is important. Property (ii) says that the first-order Taylor
approximation of f at x0 (which is the right-hand side of the inequality) always
underestimates f . This result has interesting consequences for optimization as
we shall see later.
Ex. 5 — Explain how you can write the LP problem max {cT x : Ax ≥
b, Bx = d, x ≥ 0} as an LP problem of the form
max{cT x : Hx ≤ h, x ≥ 0}
�n
xj
Ex. 7 — Show that f (x) = e j=1 is a convex function.
266
Ex. 9 — Assume that f and g are convex functions defined on an interval I.
Determine which of the functions following functions that are convex or concave:
a. λf where λ ∈ R,
b. min{f, g},
c. |f |.
i.e., a convex function defined on closed real interval attains its maximum in
one of the endpoints.
267
Chapter 11
Nonlinear equations
x21 − x1 x−3
2 + cos x1 = 1
5x41 + 2x31 − tan(x1 x82 ) = 3
Clearly, such equations can be very hard to solve. The general problem is to
solve the equation
F (x) = 0 (11.1)
for a given function F : Rn → Rn . If F (x) = 0 we call x a root of F (or
of the equation). The example above is equivalent to finding roots in F (x) =
(F1 (x), F2 (x)) where
268
Often the problem F (x) = 0 has the following form, or may be rewritten to
it:
K(x) = x. (11.2)
for some function K : Rn → Rn . This corresponds to the special choice F (x) =
K(x) − x. A point x ∈ Rn such that x = K(x) is called a fixed point of the
function K. In finding such a fixed point it is tempting to use the following
iterative method: choose a starting point x0 and repeat the following iteration
Fixed-point algorithm:
1. Choose an initial point x0 , let x = x0 and err = 1.
2. while err > � do
(i) Compute x1 = K(x)
(ii) Compute err = �x1 − x�
(iii) Update x := x1
When does the fixed-point iteration work? Let � · � be a fixed norm, e.g. the
Eulidean norm, on Rn . We say that the function K : Rn → Rn is a contraction
if there is a constant 0 ≤ c < 1 such that
We also say that K is c-Lipschitz in this case. The following theorem is called
the Banach contraction principle. It also holds in Banach spaces, i.e., complete
normed vector spaces (possibly infinite-dimensional).
Theorem 11.1. Assume that K is c-Lipschitz with 0 < c < 1. Then K has
a unique fixed point x∗ . For any starting point x0 the fixed-point iteration
∗
(11.3) generates a sequence {xk }∞
k=0 that converges to x . Moreover
so that
�xk − x∗ � ≤ ck �x0 − x∗ �.
269
Proof. First, note that if both x and y are fixed points of K, then
which means that x = y (as c < 1); therefore K has at most one fixed point.
Next, we compute
so �m−1 �m−1
�xm − x0 � = � k=0 (xk+1 − xk )� ≤ k=0 �xk+1 − xk �
�n−1
≤ ( k=0 ck )�x1 − x0 � ≤ (1/(1 − c))�x1 − x0 �
From this we derive that {xk } is a Cauchy sequence; as we have
and 0 < c < 1. Any Cauchy sequence in Rn has a limit point, so xm → x∗ for
some x∗ ∈ Rn . We now prove that the limit point x∗ is a (actually, the) fixed
point:
�x∗ − K(x∗ )� ≤ �x∗ − xm � + �xm − K(x∗ )�
= �x∗ − xm � + �K(xm−1 ) − K(x∗ )�
≤ �x∗ − xm � + c�xm−1 − x∗ �
and letting m → ∞ here gives �x∗ − K(x∗ )� ≤ 0 so x∗ = K(x∗ ) as desired.
Finally,
270
We solve TF1 (xk ; x) = 0 for x and define the next iterate as xk+1 = x. This
gives
xk+1 = xk − F � (xk )−1 F (xk ) (11.5)
which leads to Newton’s method. One here assumes that the derivative F � is
known analytically. Note that we do not (and hardly ever do!) compute the
inverse of the matrix F � .
This code also terminates after a given number of iterations, and when a given
accuracy is obtained. Note that this function should work for any function F ,
since it is a parameter to the function.
The convergence of Newton’s method may be analyzed using fixed point
theory since one may view Newton’s method as a fixed point iteration. Observe
that the Newton iteration (11.5) may be written
xk+1 = G(xk )
where G is the function
G(x) = x − F � (x)−1 F (x)
271
From this it is possible to show that if the starting point is sufficiently close to
the root, then Newton’s method will converge to this root at a linear convergence
rate. With more clever arguments one may show that the convergence rate of
Newton’s method is even faster: it has superlinear convergence. Actually, for
many functions one even has quadratic convergence rate. The proof of the
following convergence theorem relies purely on Taylor’s theorem.
Theorem 11.2. Assume that Newton’s method with initial point x0 produces
a sequence {xk }∞
k=0 which converges to a solution x of (11.1). Then the
∗
Proof. From Taylor’s theorem for vector-valued functions, Theorem 9.6, in the
point xk we have
Combining this with the Newton iteration xk+1 = xk − F � (xk )−1 F (xk ) we get
So
lim �xk+1 − x∗ �/�xk − x∗ � = 0
k→∞
where K and L are some constants. Here �F � (x0 )� denotes the operator norm
of the square matrix F � (x0 ) which is defined as
and it measures how much the operator F � (x0 ) may increase the size of vectors.
The following convergence result for Newton’s method is known as Kantorovich’
theorem.
272
Theorem 11.3 (Kantorovich’ theorem). Let F : U → Rn be a differentiable
function satisfying (11.6). Assume that B̄(x0 ; 1/(KL)) ⊆ U and that
Then F � (x) is invertible for all x ∈ B(x0 ; 1/(KL)) and Newton’s method with
initial point x0 will produce a sequence {xk }∞ k=0 contained in B(x0 ; 1/(KL))
and limk→∞ xk = x∗ for some limit point x∗ ∈ B̄(x0 ; 1/(KL)) with
F (x∗ ) = 0.
A proof of this theorem is quite long (but not very difficult to understand) [8].
One disadvantage with Newton’s method is that one needs to know the
Jacobi matrix F � explicitly. For complicated functions, or functions being the
output of a simulation, the derivative may be hard or impossible to find. The
quasi-Newton method, also called the secant-method, is then a good alternative.
The idea is to approximate F � (xk ) by some matrix Bk and to compute the new
search direction from
Bk p = −F (xk )
A practical method for finding these approximations B1 , B2 , . . . is Broyden’s
method. Provided that the previous iteration gave xk , with Broyden’s method
we compute xk+1 by following the search direction, define sk = xk+1 − xk and
y k = F (xk+1 ) − F (xk ), and compute Bk+1 from Bk by the formula
It can be shown that Bk approximates the Jacobi matrix F � (xk ) well in each
iteration. Moreover, the update given in (11.7) can be done efficiently (it is a
rank one update of Bk ).
Note that this algorithm also computes an α through what we call a line
search, to attempt to find the optimal distance to follow the search direction.
273
We do not here specify how this line search can be performed. Also, we do
not specify how the initial values can be chosen. For B0 , any approximation of
the Jacobian of F at x0 can be used, using a numerical differentiation method
of your own choosing. One can show that Broyden’s method, under certain
assumptions, also converges superlinearly, see [11].
Ex. 1 — Show that the problem of solving nonlinear equations (11.1) may
be transformed into a nonlinear optimization problem. (Hint: Square each
component function and sum these up!)
Ex. 3√— Let α ∈ R+ be fixed, and consider f (x) = x2 − α. Then the zeros
are ± α. Write down the Newton’s iteration for this problem. Let α = 2 and
compute the first three iterates in Newton’s method when x0 = 1.
274
a. Given a value x0 , implement a function which computes an estimate
of F � (x0 ) by estimating the partial derivatives of F , using a numerical
differentiation method and step size of you own choosing.
b. Implement a function
function x=broyden(x0,F)
which returns an estimate of a zero of F using Broyden’s method. Your
method should set B0 to be the matrix obtained from the function in a.
Just indicate where line search along the search direction should be per-
formed in your function, without implementing it. The function should
work as newtonmult in that it terminates after a given number of itera-
tions, or after precision of a given accuracy has been obtained.
275
Chapter 12
Unconstrained optimization
∇f (x∗ ) = 0. (12.1)
If, moreover, f has continuous second order partial derivatives, then ∇2 f (x∗ )
is positive semidefinite.
276
neighborhood of x∗ . From Theorem 9.3 (first order Taylor) we obtain
for some t ∈ (0, 1) (depending on α). By choosing α small enough, the right-
hand side of (12.2) is negative (as just said), and so f (x∗ + h) < f (x∗ ), contra-
dicting that x∗ is a local minimum. This proves that ∇f (x∗ ) = 0.
To prove the second statement, we get from Theorem 9.4 (second order
Taylor)
1
f (x∗ + h) = f (x∗ ) + ∇f (x∗ )T h + hT ∇2 f (x∗ + th)h
2
∗ 1 T 2
= f (x ) + h ∇ f (x + th)h (12.3)
2
If ∇2 f (x∗ ) is not positive semidefinite, there is an h such that hT ∇2 f (x∗ )h < 0
and, by continuity of the second order partial derivatives, hT ∇2 f (x)h < 0 for
all x in some neighborhood of x∗ . But then (12.3) gives f (x∗ + h) − f (x∗ ) < 0;
a contradiction. This proves that ∇2 f (x) is positive semidefinite.
The two necessary optimality conditions in Theorem 12.1 are called the first-
order and the second-order conditions, respectively. The first-order condition
says that the gradient must be zero at x∗ , and such a point if often called a
stationary point. The second-order condition may be interpreted by f being
"convex locally" at x∗ , although this is not a precise term. A stationary point
which is neither a local minimum or a local maximum is called a saddle point.
So, every neighborhood of a saddle point contains points with larger and points
with smaller f -value.
Theorem 12.1 gives a connection to nonlinear equations. In order to find a
stationary point we may solve ∇f (x) = 0, which is a n × n (usually nonlinear)
system of equations. (The system is linear whenever f is a quadratic function.)
One may solve this equation, for instance, by Newton’s method and thereby
get a candidate for a local minimum. Sometimes this approach works well,
in particular if f has a unique local minimum and we have an initial point
"sufficiently close". However, there are other better methods which we discuss
later.
It is important to point out that any algorithm for finding a minimum of f
has to be able to find a stationary point. Therefore algorithms in this area are
typically iterative and move to gradually better points where the norm of the
gradient becomes smaller, and eventually almost equal to zero.
As an example consider a convex quadratic function
f (x) = (1/2) xT Ax − bT x
where A is the (symmetric) Hessian matrix is (constant equal to) A and this ma-
trix is positive semidefinite. Then ∇f (x) = Ax − b so the first-order necessary
optimality condition is
Ax = b
277
which is a linear system of equations. If f is strictly convex, which happens when
A is positive definite, then A is invertible and the unique solution is x∗ = A−1 b.
Thus, there is only one candidate for a local (and global) minimum, namely
x∗ = A−1 b. Actually, this is indeed a unique global minimum, but to verify
this we need a suitable argument. One way is to use convexity (with results
presented later) or an alternative is to use sufficient optimality conditions which
we discuss next. The linear system Ax = b, when A is positive definite, may be
solved by several methods. A popular, and very fast, method is the conjugate
gradient method. This method, and related methods, are discussed in detail in
the course INF-MAT4360 Numerical linear algebra [10].
In order to present a sufficient optimality condition we need a result from
linear algebra. Recall from linear algebra that a symmetric positive definite
matrix has only real eigenvalues and all these are positive.
hT Ah ≥ λn �h�2 (h ∈ Rn ).
A = V DV T
Proof. From Theorem 9.5 (second order Taylor) and Proposition 12.2 we get
278
where λn > 0 is the smallest eigenvalue of ∇2 f (x∗ ). Dividing here by �h�2
gives
1
(f (x∗ + h) − f (x∗ ))/|h�2 = λn + �(h)
2
Since limh→0 �(h) = 0, there is an r such that for �h� < r, |�(h)| < λn /4. This
implies that
(f (x∗ + h) − f (x∗ ))/|h�2 ≥ λn /4
for all h with �h� < r. This proves that x∗ is a local minimum of f .
We remark that the proof of the previous theorem actually shows that x∗
is a strict local minimum of f meaning that f (x∗ ) is strictly smaller than f (x)
for all other points x in some neighborhood of x∗ . Note the difference between
the necessary and the sufficient optimality conditions: a necessary condition is
that ∇2 f (x) is positive semidefinite, while a part of the sufficient condition is
the stronger property that ∇2 f (x) is positive definite.
Let us see what happens when we work with a convex function.
∇f (x∗ ) = 0.
12.2 Methods
Algorithms for unconstrained optimization are iterative methods that generate
a sequence of points with gradually smaller values on the function f which is
to be minimized. There are two main types of algorithms in unconstrained
optimization:
279
• Line search methods: Here one first chooses a search direction dk from
the current point xk , using information about the function f . Then one
chooses a step length αk so that the new point
xk+1 = xk + αk dk
has a small, perhaps smallest possible, value on the halfline {xk + αdk :
α ≥ 0}. αk describes how far one should go along the search direction.
The problem of choosing αk is a one-dimensional optimization problem.
Sometimes we can find αk exactly, and in such cases we refer to the method
as exact line search. In cases where αk can not be found analytically, algo-
rithms can be used to approximate how we can get close to the minimum
on the halfline. The method is then refered to as backtracking line search.
• Trust region methods: In these methods one chooses an approximation
fˆk to the function in some neighborhood of the current point xk . The
function fˆk is simpler than f and one minimizes fˆk (in the mentioned
neighborhood) and let the next iterate xk+1 be this minimizer.
and equality holds for h = −α∇f (x) for some α ≥ 0. In general, we call h a
descent direction at x if ∇f (x) · h < 0. Thus, if we move in a descent direction
from x and make a sufficiently small step, the new point has a smaller f -value.
With this background we shall in the following focus on gradient methods given
by
xk+1 = xk + αk dk (12.4)
where the direction dk satisfies
280
• If we choose the search direction dk = −∇f (xk ), we get the steepest
descent method
xk+1 = xk − αk ∇f (xk ).
In each step it moves in the direction of the negative gradient. Sometimes
this gives slow convergence, so other methods have been developed where
other choices of direction dk are made.
• An important method is Newton’s method
This is the gradient method with dk = −∇2 f (xk )−1 ∇f (xk ); this vector
dk is called the Newton step. The so-called pure Newton method is when
one simply chooses step size αk = 1 for each k. To interpret this method
consider the second order Taylor approximation of f in xk
and choose step length αk = β mk s. Here σ is typically chosen very small, e.g.
σ = 10−3 . The parameter s fixes the search for step size to lie within the
interval [0, s]. This can be important: for instance, we can set s so small that
the initial step size we try is within the domain of definition for f . According to
[1] β is usually chosen in [1/10, 1/2]. In the literature one may find a lot more
information about step size rules and how they may be adjusted to the methods
for finding search direction, see [1], [11].
281
Now, we return to the choice of search direction in the gradient method
(12.4). A main question is whether it generates a sequence {xk }∞ k=1 which
converges to a stationary point x∗ , i.e., where ∇f (x∗ ) = 0. It turns out that
this may not be the case; one needs to be careful about the choice of dk to assure
this convergence. The problem is that if dk tends to be nearly orthogonal to
∇f (xk ) one may get into trouble. For this reason one introduces the following
notion:
What this condition assures is that �dk � is not too small or large compared
to �∇f (xk )� and that the angle between the vectors dk and ∇f (xk ) is not too
close to 90◦ . The proof of the following theorem may be found in [1].
Theorem 12.6. Let {xk }∞ k=0 be generated by the gradient method (12.4),
where {dk }∞
k=0 is gradient related to {xk }k=0 and the step size αk is chosen
∞
We remark that in Theorem 12.6 the same conclusion holds if we use exact
minimization as step size rule, i.e., f (xk +αdk ) is minimized exactly with respect
to α.
A very important property of a numerical algorithm is its convergence speed.
Let us consider the steepest descent method first. It turns out that the con-
vergence speed for this algorithm is very well explained by its performance
on minimizing a quadratic function, so therefore the following result is impor-
tant. In this theorem A is a symmetric positive definite matrix with eigenvalues
λ1 ≥ λ2 ≥ · · · ≥ λn > 0.
f (xk+1 ) ≤ mA f (xk )
The proof may be found in [1]. Thus, if the largest eigenvalue is much
larger than the smallest one, mA will be nearly 1 and one typically have slow
convergence. In this case we have mA ≈ cond(A) where cond(A) = λ1 /λn is
the condition number of the matrix A. So the rule is: if the condition number
of A is small we get fast convergence, but if cond(A) is large, there will be
282
slow convergence. A similar behavior holds for most functions f because locally
near a minimum point the function is very close to its second order Taylor
approximation in x∗ which is a quadratic function with A = ∇2 f (x∗ ).
Thus, Theorem 12.7 says that the sequence obtained in the steepest descent
method converges linearly to a stationary point (at least for quadratic functions).
We now turn to Newton’s method.
Recall that the pure Newton step minimizes the second order Taylor ap-
proximation of f at the current iterate xk . Thus, if the function we minimize
is quadratic, we are done in one step. Similarly, if the function can be well
approximated by a quadratic function, then one would expect fast convergence.
We shall give a result on the convergence of Newton’s method (see [2] for
further details). When A is symmetric, we let λmin (A) denote that smallest
eigenvalue of A.
For the convergence result we need a lemma on strictly convex functions.
Assume that x0 is a starting point for Newton’s method and let S = {x ∈ Rn :
f (x) ≤ f (x0 )}. We shall assume that f is continuous and convex, and this
implies that S is a closed convex set. We also assume that f has a minimum
point x∗ which then must be a global minimum. Moreover the minimum point
will be unique due to a strict convexity assumption on f . Let f ∗ = f (x∗ ) be
the optimal value.
The following lemma says that for a convex functions as just described, a
point is nearly a minimum point (in terms of the f -value) whenever the gradient
is small in that point.
Lemma 12.8. Assume that f is convex as above and that λmin (∇2 f (x)) ≥ m
for all x ∈ S. Then
1
f (x) − f ∗ ≤ �∇f (x)�2 . (12.8)
2m
Proof. From Theorem 9.4, the second order Taylor’ theorem, we have for each
x, y ∈ S
283
for suitable z on the line segment between x and y. Here a lower bound for the
quadratic term is (m/2)�y − x�2 , due to Proposition 12.2. Therefore
f (y) ≥ f (x) + ∇f (x)T (y − x) + (m/2)�y − x�2 .
Now, fix x and view the expression on the right-hand side as a quadratic function
of y. This function is minimized for y ∗ = x − (1/m)∇f (x). So, by inserting
y = y ∗ above we get
f (y) ≥ f (x) + ∇f (x)T (y ∗ − x) + (m/2)�y ∗ − x�2
1 2
= f (x) − 2m �∇f (x)�
Proof. Define f ∗ = f (x∗ ). It is possible to show that there are numbers η and
γ > 0 with 0 < η ≤ m2 /L such that the following holds for each k:
(i) If �∇f (xk )� ≥ η, then
f (xk+1 ) ≤ f (xk ) − γ. (12.9)
(ii) If �∇f (xk )� < η, then backtracking line search gives αk = 1 and
� �2
L L
�∇f (xk+1 )� ≤ �∇f (x k )� . (12.10)
2m2 2m2
284
We omit the proof of this fact; it may be found in [2].
We may now prove that if �∇f (xk )� < η, then also �∇f (xk+1 )� < η. This
follows from (ii) above and the fact (assumption) η ≤ m2 /L. Therefore, as soon
as case (ii) occurs in the iterative process, in all the remaining iterations case
(ii) will occur. Actually, as soon as case (ii) “kicks in” quadratic convergence
starts as we shall see now. So assume that case (ii) occurs from a certain k.
(Below we show that such k must exist.)
Define µl = 2m L
2 �∇f (xl )� for each l ≥ k. Then 0 ≤ µk < 1/2 as η ≤ m /L.
2
So (by induction)
l−k l−k
µl ≤ µ2k ≤ (1/2)2 (l = l, k + 1, . . .).
1 2m3 l−k+1
f (xk ) − f ∗ ≤ �∇f (xl )�2 ≤ 2 (1/2)2 (l ≥ k).
2m L
This inequality shows that f (xl ) → f ∗ , and since the minimum point is unique,
we must have xl → x∗ . Moreover, it follows that the convergence is quadratic.
It only remains to explain why case (ii) above indeed occurs for some k. In
each iteration of type (i) f is decreased by at least γ, as seen from equation
(12.10), so the number of such iterations must be bounded by
(f (x0 ) − f ∗ )/γ
2m3
(f (x0 ) − f ∗ )/γ + log2 log2 .
�L2
Here γ is the parameter introduced in the proof above. The second term in
this expression (the logarithmic term) grows very slowly as � is decreased, and
it may roughly be replaced by the constant 6. So, whenever the second stage
(case (ii) in the proof) occurs, the convergence is extremely fast, it takes about
6 more Newton iterations. Note that quadratic convergence means, roughly,
that the number of correct digits in the answer doubles for every iteration.
285
Exercises for Section 12.2
Ex. 1 — Consider the function f (x1 , x2 ) = x21 + ax22 where a > 0 is a param-
eter. Draw some of the level sets of f (for different levels) for each a in the set
{1, 4, 100}. Also draw the gradient in a few points on these level sets.
Ex. 2 — State and prove a theorem similar to Theorem 12.1 for maximization
problems.
Ex. 4 — Let f (x1 , x2 ) = 4x1 + 6x2 + x21 + 2x22 . Find all stationary points and
determine if they are minimum, maximum or saddlepoints. Do the same for the
function g(x1 , x2 ) = 4x1 + 6x2 + x21 − 2x22 .
Ex. 8 — Implement the steepest descent method. Test the algorithm on the
functions in exercises 4 and 5. Use different starting points.
286
Ex. 9 — Implement a function
function alpha=armijorule(f,df,x,d)
which returns α chosen according to the Armijo rule for a function f with the
given gradient, at point x, with search direction d. The function shuld compute
mk from Equation (12.7) with β = 0.2, s = 0.5, σ = 10−3 , and return α = β mk s.
287
Chapter 13
Constrained optimization -
theory
minimize f (x)
subject to
(13.1)
hi (x) = 0 (i ≤ m)
gj (x) ≤ 0 (j ≤ r)
288
minimize f (x)
subject to (13.2)
hi (x) = 0 (i ≤ m)
Theorem 13.1. Let x∗ be a local minimum in problem (13.1) and assume that
x∗ is a regular point. Then there is a unique vector λ∗ = (λ∗1 , λ∗2 , . . . , λ∗m ) ∈
Rm such that
�m
∗
∇f (x ) + λ∗i ∇hi (x∗ ) = 0. (13.3)
i=1
If f and each hi are twice continuously differentiable, then the following also
holds
m
�
hT (∇2 f (x∗ ) + λ∗i ∇2 hi (x∗ ))h ≥ 0 for all h ∈ T (x∗ ) (13.4)
i=1
The numbers λ∗i in this theorem are called the Lagrangian multipliers. Note
that the Lagrangian multiplier vector λ∗ is unique; this follows directly from
the linear independence assumption as x∗ is assumed regular. The theorem may
also be stated in terms of the Lagrangian function L : Rn × Rm → R given by
m
�
L(x, λ) = f (x) + λi hi (x) = f (x) + λT H(x) (x ∈ Rn , λ ∈ Rm ).
i=1
Then
�
∇x L(x, λ) = ∇f (x) + λi ∇hi
i
∇λ L(x, λ) = H(x).
Therefore, the first order conditions in Theorem 13.1 may be rewritten as follows
∇x L(x∗ , λ∗ ) = 0, ∇λ L(x∗ , λ∗ ) = 0.
289
h2 (x) = b2
h1 (x) = b1
Figure 13.1: The two surfaces h1 (x) = b1 og h2 (x) = b2 intersect each other in
a curve. Along this curve the constraints are fulfilled
Figure 13.2: ∇f (x∗ ) as a linear combination of ∇h1 (x∗ ) and ∇h2 (x∗ )
Here the second equation simply means that H(x) = 0. These two equations
say that (x∗ , λ∗ ) is a stationary point for the Lagrangian, and it is a system of
n + m (possibly nonlinear) equations in n + m variables.
We may interpret the theorem in the following way. At the point x∗ the
linear subspace T (x∗ ) consist of the “first order feasible directions”. Actually, if
each hi is linear, then T (x∗ ) consists of those h such that x∗ + h is feasible, i.e.,
hi (x∗ +h) = 0 for each i ≤ m. Thus, (13.3) says that in a local minimum x∗ the
gradient ∇f (x∗ ) is orthogonal to the subspace T (x∗ ) of the first order feasible
variations. This is reasonable since otherwise there would be a feasible direction
in which f would decrease. In Figure 13.1 we have plotted a curve where two
constraints are fulfilled. In Figure 13.2 we have then shown an interpretation of
Theorem 13.1.
Note that this necessary optimality condition corresponds to the condition
∇f (x∗ ) = 0 in the unconstrained case. The second condition (13.4) is a sim-
290
ilar generalization of the second order condition in Theorem 12.1 (saying that
∇2 f (x∗ ) is positive semidefinite).
It is possible to prove the theorem by eliminating variables based on the
equations and thereby reducing the problem to an unconstrained one. Another
proof, which we shall present below is based on the penalty approach. This
approach is also interesting as it leads to algorithms for actually solving the
problem.
Proof. (Theorem 13.1) For k = 1, 2, . . . consider the modified objective function
where the last inequality follows from the facts that x̄ ∈ B̄(x∗ ; �) and H(x̄) = 0.
Clearly, this gives x̄ = x∗ . We have therefore shown that the sequence {xk }
converges to the local minimum x∗ . Since x∗ is the center of the ball B̄(x∗ ; �),
the points xk lie in the interior of S for suitably large k. The conclusion is then
that xk is the unconstrained minimum of F k when k is sufficiently large. We
may therefore apply Theorem 12.1 so ∇F k (xk ) = 0, so
Here H � denotes the Jacobi matrix of H. For suitably large k the matrix
H � (xk )H � (xk )T is invertible (as the rows of H � (xk ) are linearly independent
due to rank(H � (x∗ )) = m and a continuity argument). Multiply equation (13.5)
by (H � (xk )H � (xk )T )−1 H � (xk ) to obtain
kH(xk ) = −(H � (xk )H � (xk )T )−1 H � (xk )(∇f (xk ) + α(xk − x∗ )).
Letting k → ∞ we see that the sequence {kH(xk )} is convergent and its limit
point λ∗ is given by
291
Finally, by passing to the limit in (13.5) we get
0 = ∇f (x∗ ) + H � (x∗ )T λ∗
This proves the first part of the theorem; we omit proving the second part which
may be found in [1].
The first order necessary condition (13.3) along with the constraints H(x) =
0 is a system of n + m equations in the n + m variables x1 , x2 , . . . , xn and
λ1 , λ2 , . . . , λm . One may use e.g. Newton’s method for solving these equations
and find a candidate for an optimal solution. But usually there are better
numerical methods for solving the optimization (13.1), as we shall see soon.
Necessary optimality conditions are used for finding a candidate solution
for being optimal. In order to verify optimality we need sufficient optimality
conditions.
where ∇2 L(x∗ , λ∗ ) is the Hessian of the Lagrangian function with second order
partial derivatives with respect to x. Then x∗ is a (strict) local minimum of
f subject to H(x) = 0.
This theorem may be proved (see [1] for details) by considering the aug-
mented Lagrangian function
and this problem must have the same local minima as the problem of minimizing
f (x) subject to H(x) = 0. The objective function in (13.8) contains the penalty
term (c/2)�H(x)�2 which may be interpreted as a penalty (increased function
value) for violating the constraint H(x) = 0. In connection with the proof of
Theorem 13.2 based on the augmented Lagrangian one also obtains the following
interesting and useful fact: if x∗ and λ∗ satisfy the sufficient conditions in
Theorem 13.2 then there exists a positive c̄ such that for all c ≥ c̄ the point x∗ is
also a local minimum of the augmented Lagrangian Lc (·, λ∗ ). Thus, the original
constrained problem has been converted to an unconstrained one involving the
augmented Lagrangian. And, as we know, unconstrained problems are easier to
solve (solve the equations saying that the gradient is equal to zero).
292
13.2 Inequality constraints and KKT
We now consider the general nonlinear optimization problem where there are
both equality and inequality constraints. The problem is then
minimize f (x)
subject to
(13.9)
hi (x) = 0 (i ≤ m)
gj (x) ≤ 0 (j ≤ r)
minimize f (x)
subject to
(13.10)
hi (x) = 0 (i ≤ m)
gj (x) + zj2 = 0 (j ≤ r).
We have introduced extra variables zj , one for each inequality. The square
of these variables represent slack in each of the original inequalities. Note that
there is no sign constraint on zj . Clearly, the problems (13.9) and (13.10) are
equivalent. This transformation can also be useful computationally. Moreover,
it is useful theoretically as one may apply the optimality conditions from the
previous section to problem (13.10) to derive the theorem below (see [1]).
We now present a main result in nonlinear optimization. It gives optimality
conditions for this problem, and these conditions are called the Karush-Kuhn-
Tucker conditions, or simply the KKT conditions. In order to present the KKT
conditions we introduce the Lagrangian function L : Rn × Rm × Rr → R given
by
293
m
� r
�
L(x, λ, µ) = f (x) + λi hi (x) + µj gj (x) = f (x) + λT H(x) + µT G(x).
i=1 j=1
Theorem 13.3. Consider problem (13.9) with the usual differentiability as-
sumptions.
(i) Let x∗ be a local minimum of this problem and assume that x∗ is
a regular point. Then there are unique Lagrange multiplier vectors λ∗ =
(λ∗1 , λ∗2 , . . . , λ∗m ) and µ∗ = (µ∗1 , µ∗2 , . . . , µ∗r ) such that
∇x L(x∗ , λ∗ , µ∗ ) = 0
µ∗j ≥ 0 (j ≤ r) (13.11)
µ∗j =0 ∗
(j �∈ A(x )).
If f , g and h are twice continuously differentiable, then the following also holds
for all y with ∇hi (x∗ )T y = 0 (i ≤ m) and ∇gj (x∗ )T y = 0 (j ∈ A(x∗ )).
(ii) Assume that x∗ , λ∗ and µ∗ are such that x∗ is a feasible point and
(13.11) holds. Assume, moreover, that (13.12) holds with strict inequality for
each y. Then x∗ is a (strict) local minimum in problem (13.9).
294
minimize f (x)
subject to
(13.13)
hi (x) = 0 (i ≤ m)
gj (x) = 0 (j ∈ A(x∗ ))
TC (x) always contains the zero vector and it is a cone, meaning that it
contains each positive multiple of its vectors. Consider now problem (13.9) and
let C be the set of feasible solutions (those x satisfying all the equality and
inequality constraints).
295
Definition 13.5 (Linearized feasible directions). A linearized feasible direc-
tion at x ∈ C is a vector d such that
d · ∇hi (x) = 0 (i ≤ m)
d · ∇gj (x) = 0 (j ∈ A(x∗ )).
296
1 1
z
0.5 0.5
0 0
1 1
1 1
0.5 0.5 0.5 0.5
y 0 0 x y 0 0 x
1
z
0.5
0
1
1
0.5 0.5
y 0 0 x
297
Example 13.9. Consider a quadratic optimization problem with linear equality
constraints
minimize (1/2) xT Dx − q T x
subject to
Ax = b
Under the additional assumption that D is positive definite and A has full row
rank, one can show that the coefficient matrix in (13.14) is invertible so this
system has a unique solution x, λ. Thus, for this problem, we may write down
an explicit solution (in terms of the inverse of the block matrix). Numerically,
one finds x (and the Lagrangian multiplier λ) by solving the linear system
(13.14) by e.g. Gaussian elimination or some faster (direct or iterative) method.
Example 13.10. Consider an extension of the previous example by allowing
linear inequality constraints as well:
minimize (1/2) xT Dx − q T x
subject to
Ax = b
x≥0
Here D, A and b are as above. Then ∇f (x) = Dx − q and ∇gk (x) = −ek .
Thus, the KKT conditions for this problem are: there are λ ∈ Rm and µ ∈ Rn
such that Dx − q + AT λ − µ = 0, µ ≥ 0 and µk = 0 if xk > 0 (k ≤ n). We
eliminate µ from the first equation and obtain the equivalent condition: there
is a λ ∈ Rm such that Dx + AT λ ≥ q and (Dx + AT λ − q)k · xk = 0 (k ≤ n).
In addition, we have Ax = b, x ≥ 0. This problem may be solved numerically,
for instance, by a so-called active set method, see [9].
Example 13.11. Linear optimization is a problem of the form
298
minimize cT x subject to Ax = b, x ≥ 0
Proof. 1.) The proof of property 1 is exactly as the proof of the first part of
Theorem 12.4, except that we work with local and global minimum of f over C.
2.) Assume the set C ∗ of minimum points is nonempty and let α = minx∈C f (x).
Then C ∗ = {x ∈ C : f (x) ≤ α} is a convex set, see Proposition 10.5. Moreover,
this set is closed as f is continuous.
3.) This follows directly from Theorem 10.9.
Next, we consider a quite general convex optimization problem which is of
the form (13.9):
299
minimize f (x)
subject to
(13.15)
Ax = b
gj (x) ≤ 0 (j ≤ r)
where all the functions f and gj are differentiable convex functions, and A ∈
Rm×n and b ∈ Rm . Let C denote the feasible set of problem (13.15). Then C is a
convex set, see Proposition 10.5. A special case of (13.15) is linear optimization.
An important concept in convex optimization is duality. To briefly explain
this introduce again the Lagrangian function L : Rn × Rm × Rr+ → R given by
Remark: we use the variable name ν here in stead of the µ used before
because of another parameter µ to be used soon. Note that we require ν ≥ 0.
Define the new function g : Rm × Rr+ → R̄ by
Note that this infimum may sometimes be equal to −∞ (meaning that the
function x → L(x, λ, ν) is unbounded below). The function g is the pointwise
infimum of a family of affine functions in (λ, µ), one function for each x, and
this implies that g is a concave function. We are interested in g due to the
following fact, which is easy to prove. It is usually referred to as weak duality.
300
maximize g(λ, ν)
subject to (13.16)
ν ≥ 0.
Actually, in this dual problem, we may further restrict the attention to those
(λ, ν) for which g(λ, ν) is finite.
The original problem (13.15) will be called the primal problem. It follows
from Lemma 13.13 that
g∗ ≤ f ∗
where f ∗ denotes the optimal value in the primal problem and g ∗ the optimal
value in the dual problem. If g ∗ < f ∗ , we say that there is a duality gap. Note
that the derivation above, and weak duality, holds for arbitrary functions f and
gj (j ≤ r). The concavity of g also holds generally.
The dual problem is useful when the dual objective function g may be com-
puted efficiently, either analytically or numerically. Duality provides a powerful
method for proving that a solution is optimal or, possibly, near-optimal. If we
have a feasible x in (13.15) and we have found a dual solution (λ, ν) with ν ≥ 0
such that
f (x) = g(λ, ν) + �
for some � (which then has to be nonnegative), then we can conclude that x is
“nearly optimal”, it is not possible to improve f by more than �. Such a point
x is sometimes called �-optimal, where the case � = 0 means optimal.
So, how good is this duality approach? For convex problems it is often
perfect as the next theorem says. We omit most of the proof, see [5, 1, 14]).
For nonconvex problems one should expect a duality gap. Recall that G� (x)
denotes the Jacobi matrix of G = (g1 , g2 , . . . , gr ) at x.
Proof. We only prove the second part (see the references above). So assume
that f ∗ = g ∗ and the infimum and supremum are attained in the primal and
dual problems, respectively. Let x be a feasible point in the primal problem.
301
Then x is a minimum in the primal problem if and only if there are λ ∈ Rm
and ν ∈ Rr such that all the inequalities in the proof of Lemma 13.13 hold
with equality. This means that g(λ, ν) = L(x, λ, ν) and ν T G(x) = 0. But
L(x, λ, ν) is convex in x so it is minimized by x if and only if its gradient is
the zero vector, i.e., ∇f (x) + λT A + G� (x)T ν = 0. This leads to the desired
characterization.
The assumption stated in the theorem, that gj (x� ) < 0 for each j, is called
the weak Slater condition.
Finally, we mention a theorem on convex optimization which is used in
several applications.
Proof. Assume first that ∇f (x∗ )T (x − x∗ ) < 0 for some x ∈ C. Consider the
function g(�) = f (x∗ + �(x − x∗ )) and apply the mean value theorem to this
function. Thus, for every � > 0 there exists an s ∈ [0, 1] with
Since ∇f (x∗ )T (x − x∗ ) < 0 and the gradient function is continuous (our stan-
dard assumption!) we have for sufficiently small � > 0 that ∇f (x∗ + s�(x −
x∗ ))T (x − x∗ ) < 0. This implies that f (x∗ + �(x − x∗ )) < f (x∗ ). But, as C is
convex, the point x∗ + �(x − x∗ ) also lies in C and so we conclude that x∗ is
not a local minimum. This proves that (13.17) is necessary for x∗ to be a local
minimum of f over C.
Next, assume that (13.17) holds. Using Theorem 10.9 we then get
so x∗ is a (global) minimum.
Ex. 1 — In the plane consider a rectangle R with sides of length x and y and
with perimeter equal to α (so 2x + 2y = α). Determine x and y so that the area
of R is largest possible.
302
where C = {(x1 , x2 ) ∈ R2 : x1 , x2 ≥ 0, 4x1 + x2 ≥ 8, 2x1 + 3x3 ≤ 12}.
Draw the feasible set C in the plane. Find the set of optimal solutions in each
of the cases given below.
a. f (x1 , x2 ) = 1.
b. f (x1 , x2 ) = x1 .
c. f (x1 , x2 ) = 3x1 + x2 .
d. f (x1 , x2 ) = (x1 − 1)2 + (x2 − 1)2 .
e. f (x1 , x2 ) = (x1 − 10)2 + (x2 − 8)2 .
Ex. 3 — Solve
n
�
max{x1 x2 · · · xn : xj = 1}.
j=1
Ex. 6 — Solve
using the Lagrangian, see Theorem 13.1. Next, solve the problem by eliminating
x2 (using the constraint).
303
Ex. 9 — Let f be a two times differentiable function f : Rn → R. Consider
the optimization problem
Ex. 10 — Consider the previous exercise. Explain how to convert this into
an unconstrained problem by eliminating xn . Find an
Ex. 12 — Solve
Hint: Use KKT and discuss depending on whether the constraint is active or
not.
Ex. 13 — Solve
Ex. 14 — Solve
min{x1 + x2 : x21 + x22 ≤ 2}.
Ex. 15 — Use Theorem 13.15 to find optimality conditions for the convex
optimization problem
n
�
min{f (x1 , x2 , . . . , xn ) : xj ≥ 0 (j ≤ n), xj ≤ 1}
j=1
304
Chapter 14
Constrained optimization -
methods
In this final chapter we present numerical methods for solving nonlinear opti-
mization problems. This is a huge area, so we can here only give a small taste
of it! The algorithms we present are known good methods.
minimize f (x)
subject to (14.1)
Ax = b
A(xk + h) = b
305
This is a quadratic optimization problem in h with a linear equality constraint
(Ah = 0) as in Example 13.9. The KKT conditions for this problem are thus
� 2 �� � � �
∇ f (xk ) AT h −∇f (xk )
=
A 0 λ 0
where λ is the Lagrange multiplier. The Newton step is only defined when the
coefficient matrix in the KKT problem is invertible. In that case, the problem
has a unique solution (h, λ) and we define dN t = h and call this the Newton
step.
Newton’s method for solving (14.1) may now be described as follows. Again
� > 0 is a small stopping criterion.
This leads to an algorithm for Newtons’s method for linear equality con-
strained optimization which is very similar to the function newtonbacktrack
from Exercise 12.2.10. We do not state a formal convergence theorem for this
method, but it behaves very much like Newton’s method for unconstrained op-
timization. Actually, it can be seen that the method just described corresponds
to eliminating variables based on the equations Ax = b and using the uncon-
strained Newton method for the resulting (smaller) problem. So as soon as
the solution is “sufficiently near” an optimal solution, the convergence rate is
quadratic, so extremely few iterations are needed in this final stage.
306
function values to points near the (relative) boundary of the feasible set, which
effectively becomes a barrier against leaving the feasible set.
Consider again the convex optimization problem
minimize f (x)
subject to
(14.2)
Ax = b
gj (x) ≤ 0 (j ≤ r)
holds, and therefore by Theorem 13.14 the KKT conditions for problem (14.2)
are
Ax = b, gj (x) ≤ 0 (j ≤ r)
ν ≥ 0, ∇f (x) + AT λ + G� (x)T ν = 0 (14.3)
νj gj (x) = 0 (j ≤ r).
So, x is a minimum in (14.2) if and only if there are λ ∈ Rm and ν ∈ Rr such
that (14.3) holds.
Let us state an algorithm for Newton’s method for linear equality constrained
optimization with inequality constraints. Before we do this there is one final
problem we need to address: The α we get from backtracking line search may be
so that x + αdN t do not satisfty the inequality constraints (in the exercises you
will be asked to verify that this is the case for a certain function). The problem
comes from that the iterates xk + β m sdk from Armijo’s rule do not necessarily
satisfy the inequality constraints. However, we can choose m large enough so
that all succeeding iterates satisfy these constraints. We can reimplement the
function armijorule to address this as follows:
function alpha=armijoruleg1g2(f,df,x,d,g1,g2)
beta=0.2; s=0.5; sigma=10^(-3);
m=0;
while (g1(x+beta^m*s*d)>0 || g2(x+beta^m*s*d)>0)
m=m+1;
end
while (f(x)-f(x+beta^m*s*d) < -sigma *beta^m*s *(df(x))’*d)
m=m+1;
end
alpha = beta^m*s;
Here g1 and g2 are function handles which represent the inequality constraints,
and we have added a first loop, which secures that m is so large that the in-
equality constraints are satisfied. The rest of the code is as in the function
armijorule. After this we can also modify the function newtonbacktrack
from Exercise 12.2.10 to a function newtonbacktrackg1g2 in the obvious way,
so that the inequality constraints are passed to armijoruleg1g2:
307
function [x,numit]=newtonbacktrackg1g2LEC(f,df,d2f,A,b,x0,g1,g2)
epsilon=10^(-3);
x=x0;
maxit=100;
for numit=1:maxit
matr=[d2f(x) A’; A zeros(size(A,1))];
vect=[-df(x); zeros(size(A,1),1)];
solvedvals=matr\vect;
d=solvedvals(1:size(A,2));
eta=d’*d2f(x)*d;
if eta^2/2<epsilon
break;
end
alpha=armijoruleg1g2(f,df,x,d,g1,g2);
x=x+alpha*d;
end
Both these function work in all cases where there are exactly two inequality
constraints.
The interior-point barrier method is based on an approximation of problem
(14.2) by the barrier problem
where
r
�
φ(x) = − ln(−gj (x))
j=1
and µ > 0 is a parameter (in R). The function φ is called the (logarithmic)
barrier function and its domain is the relative interior of the feasible set
The same set F ◦ is the feasible set of the barrier problem. The key properties
of the barrier function are:
• φ is concave, i.e. −φ is a convex function. This may be shown from the
definition using that gj is convex and the fact that the logarithm function
is concave and increasing.
• If {xk } is a sequence in F ◦ such that gj (xk ) → 0 for some j ≤ r, then
φ(xk ) → ∞. This is the barrier property.
308
• φ is twice differentiable and
r
� 1
∇φ(x) = ∇gj (x) (14.5)
j=1
(−gj (x))
and
r
� � r
1 1
∇2 φ(x) = 2 ∇gj (x)∇gj (x)T + ∇2 gj (x) (14.6)
j=1
gj (x) j=1
(−g j (x))
The idea here is that for points x near the boundary of F the value of φ(x) is
very large. So, an iterative method which moves around in the interior F ◦ of F
will typically avoid points near the boundary as the logarithmic penalty term
makes the function value f (x) + µφ(x) very large.
The interior point method consists in solving the barrier problem, using
Newton’s method, for a sequence {µk } of (positive) barrier parameters; these
are called the outer iterations. The solution xk found for µ = µk is used as the
starting point in Newton’s method in the next outer iteration where µ = µk+1 .
The sequence {µk } is chosen such that µk → 0. When µ is very small, the
barrier function approximates the "ideal" penalty function η(x) which is zero
in F and −∞ when one of the inequalities gj (x) ≤ 0 is violated.
A natural question is why one bothers to solve the barrier problems for more
than one single µ, typically a very small value. The reason is that it would be
hard to find a good starting point for Newton’s method in that case; the Hessian
matrix of µφ is typically ill-conditioned for small µ.
Assume now that the barrier problem has a unique optimal solution x(µ);
this is true under reasonable assumptions that we shall return to. The point
x(µ) is called a central point. Assume also that Newton’s method may be
applied to solve the barrier problem. The set of points x(µ) for µ > 0 is called
the central path; it is a path (or curve) as we know it from multivariate calculus.
In order to investigate the central path we prefer to work with the equivalent
problem1 to (14.4) obtained by multiplying the objection function by 1/µ, so
309
i.e.,
r
� 1
(1/µ)∇f (x(µ)) + ∇gj (x) + AT λ = 0. (14.8)
j=1
(−gj (x))
A fundamental question is: how far from being optimal is the central point
x(µ)? We now show that duality provides a very elegant way of answering this
question.
Theorem 14.1. For each µ > 0 the central point x(µ) satisfies
f ∗ ≤ f (x(µ)) ≤ f ∗ + rµ.
We want to show that the pair (λ(µ), ν(µ)) is a feasible solution in the dual
problem to (14.2), see Section 13.3. So there are two properties to verify, that
ν(µ) is nonnegative and that x(µ) minimizes the Lagrangian function for the
given (λ(µ), ν(µ)). The first property is immediate: as gj (x(µ)) < 0 and µ > 0,
we get νj (µ) = −µ/gj (x(µ)) > 0 for each j. Concerning the second property,
note first that the Lagrangian function L(x, λ, ν) = f (x)+λT (Ax−b)+ν T G(x)
is convex in x for given λ and µ ≥ 0. Thus, x minimizes this function if and
only if ∇x L = 0. Now,
�
∇x L(x(µ), λ(µ), ν(µ)) = ∇f (x(µ)) + AT λ(µ) + νj (µ)∇gj (x(µ)) = 0
j
by (14.8) and the definition of the dual variables (14.9). This shows that
(λ(µ), ν(µ)) is a feasible solution to the dual problem.
By weak duality Lemma 13.13 we therefore obtain
f ∗ ≥ g(λ(µ), ν(µ))
= L(x(µ), λ(µ), ν(µ))
r
�
= f (x(µ)) + λ(µ)T (Ax(µ) − b) + νj (µ)gj (x(µ))
j=1
= f (x(µ)) − rµ
310
Corollary 14.2. The central path has the following property
lim f (x(µ)) = f ∗ .
µ→0
Proof. This follows from Theorem 14.1 by letting µ → 0. The second part
follows from
f (x∗ ) = f ( lim x(µ)) = lim f (x(µ)) = f ∗
µ→0 µ→0
by the first part and the continuity of f ; moreover x∗ must be a feasible point
by elementary topology.
After these considerations we may now present the interior-point barrier
method. It uses a tolerance � > 0 in its stopping criterion.
This leads to the following algorithm for the internal point barrier method
for the case of equality constraints, and 2 inequality constraints:
function xopt=IPBopt(f,g1,g2,df,dg1,dg2,d2f,d2g1,d2g2,A,b,x0)
xopt=x0;
mu=1;
alpha=0.1;
r=2;
epsilon=10^(-3);
numitouter=0;
while (r*mu>epsilon)
[xopt,numit]=newtonbacktrackg1g2LEC(...
@(x)(f(x)-mu*log(-g1(x))-mu*log(-g2(x))),...
@(x)(df(x) - mu*dg1(x)/g1(x) - mu*dg2(x)/g2(x)),...
@(x)(d2f(x) + mu*dg1(x)*dg1(x)’/(g1(x)^2) ...
+ mu*dg2(x)*dg2(x)’/(g2(x)^2) - mu*d2g1(x)/g1(x)...
- mu*d2g2(x)/g2(x) ),A,b,xopt,g1,g2);
mu=alpha*mu;
311
numitouter=numitouter+1;
fprintf(’Iteration %i:’,numitouter);
fprintf(’(%f,%f)\n’,xopt,f(xopt));
end
Note that we here have inserted the expressions from Equation 14.5 and Equa-
tion 14.6 for the gradient and the Hesse matrix of the barrier function. The input
are f , g1 , g2 , their gradients and their Hesse matrices, the matrix A, the vector
b, and an initial feasible point x0 . The function calls newtonbacktrackg1g2LEC,
and returns the optimal solution x∗ . It also gives some information on the values
of f during the iterations. The iterations used in Newton’s method is called the
inner iterations. There are different implementation details here that we do not
discuss very much. A typical value on α is 0.1. The choice of the initial µ0 can
be difficult, if it is chosen too large, one may experience many outer iterations.
Another issue is how accurately one solves (14.4). It may be sufficient to find
a near-optimal solution here as this saves inner iterations. For this reason the
method is also called a path-following method; it follows in the neighborhood of
the central path.
Finally, it should be mentioned that there exists a variant of the interior-
point barrier method which permits an infeasible starting point. For more de-
tails on this and various implementation issues one may consult [2] or [11].
Example 14.3. Consider the function f (x) = x2 + 1, 2 ≤ x ≤ 4. Minimizing f
can be considered as the problem of finding a minimum subject to the constraints
g1 (x) = 2 − x ≤ 0, and g2 (x) = x − 4 ≤ 0. The barrier problem is to minimize
the function
Some of these are drawn in Figure 14.1, where we clearly can see the effect of
decreasing µ in the barrier function: The function converges to f pointwise,
except at the boundaries. It is easy to see that x = 2 is the minimum of f
under the given constraints, and that f (2) = 5 is the minimum value. There are
no equality constrains in this case, so that we can use the barrier method with
Newton’s method for unconstrained optimization, as this was implemented in
Exercise 12.2.10. We need, however, to make sure also here that the iterates
from Armijo’s rule satisfy the inequality constraints. In fact, in the exercises
you will be asked to verify that, for the function f considered here, some of the
iterates from Armijo’s rule do not satisfy the constraints.
It is straightforward to implement a function newtonbacktrackg1g2 which
implements Newtons method for two inequality constraints and no equality con-
straints (this can follow the implementation of the function newtonbacktrack
from Exercise 12.2.10, and use the function armijoruleg1g2, just as the func-
tion newtonbacktrackg1g2LEC). This leads to the following algorithm for the
internal point barrier method for the case of no equality constraints, but 2
inequality constraints:
312
20 20
15 15
10 10
5 5
2 2.5 3 3.5 4 2 2.5 3 3.5 4
(a) f (x) (b) Barrier problem with µ = 0.2
20 20
15 15
10 10
5 5
2 2.5 3 3.5 4 2 2.5 3 3.5 4
(c) Barrier problem with µ = 0.5 (d) Barrier problem with µ = 1
Figure 14.1: The function from Example 14.3 and some if its barrier functions.
313
function xopt=IPBopt2(f,g1,g2,df,dg1,dg2,d2f,d2g1,d2g2,x0)
xopt=x0;
mu=1; alpha=0.1; r=2; epsilon=10^(-3);
numitouter=0;
while (r*mu>epsilon)
[xopt,numit]=newtonbacktrackg1g2(...
@(x)(f(x)-mu*log(-g1(x))-mu*log(-g2(x))),...
@(x)(df(x) - mu*dg1(x)/g1(x) - mu*dg2(x)/g2(x)),...
@(x)(d2f(x) + mu*dg1(x)*dg1(x)’/(g1(x)^2) ...
+ mu*dg2(x)*dg2(x)’/(g2(x)^2) ...
- mu*d2g1(x)/g1(x) - mu*d2g2(x)/g2(x) ),xopt,g1,g2);
mu=alpha*mu;
numitouter=numitouter+1;
fprintf(’Iteration %i:’,numitouter);
fprintf(’(%f,%f)\n’,xopt,f(xopt));
end
Note that this function also prints a summary for each of the outer iterations,
so that we can see the progress in the barrier method. We can now find the
minimum of f with the following code, where we have subsituted with Matlab
functions for f , gi , their gradients, and their Hesse matrices.
IPBopt2(@(x)(x.^2+1),@(x)(2-x),@(x)(x-4),...
@(x)(2*x),@(x)(-1),@(x)(1),...
@(x)(2),@(x)(0),@(x)(0),3)
Ex. 1 — Consider problem (14.1) in Section 14.1. Verify that the KKT con-
ditions for this problem are as stated there.
314
f. Show that the central path converges to the same solution which you
found in d. and e..
Ex. 3 — Use the function IPBopt to verify the solution you found in Exer-
cise 2. Initially you must compute a feasible starting point x0 .
Ex. 4 — State the KKT conditions for finding the minimum for the con-
tstrained problem of Example 14.3, and solve these. Verify that you get the
same solution as in Example 14.3.
Ex. 5 — In the function IPBopt2, replace the call to the function newtonbacktrackg1g2
with a call to the function newtonbacktrack, with the obvious modification to
the parameters. Verify that the code does not return the expected minimum in
this case.
Ex. 6 — Consider the function f (x) = (x−3)2 , with the same constraints 2 ≤
x ≤ 4 as in Example 14.3. Verify in this case that the function IPBopt2 returns
the correct minimum regardless of whether you call newtonbacktrackg1g2 or
newtonbacktrack. This shows that, at least in some cases where the minimum
is an interior point, the iterates from Newtons method satisfy the inequality
constraints as well.
315