Nonlinear Program
Nonlinear Program
Yan-Bin Jia
Nov 1, 2022
1 Introduction
Given a single function f that depends on one or more independent variables, we want to find
the values of those variables where f is maximized or minimized. Often the computational cost is
dominated by the cost of evaluating f (and also perhaps its partial derivatives with respect to all
variables).
Finding a global extremum is, in general, a very difficult problem. Two standard heuristics are
widely used: i) find local extrema starting from widely varying values of the independent variables,
and then pick the most extreme of these; ii) perturb a local extremum by taking a finite amplitude
step away from it, and then see if your routine can get to a better point, or “always” to the same
one. Recently, “simulated annealing” methods have demonstrated important successes on a variety
of global optimization problems.
The diagram below describes a function in an interval [a, b]. The derivative vanishes at the
points B, C, D, E, F . The points B and D are local but not global maxima. The points C and E
are local but not global minima. The global maximum occurs at F where the derivative vanishes.
The global minimum occurs at the left endpoint A of the interval so that the derivative need not
vanish.
F
D
E
B
C
A
a b
1
Recall how the optimization process works in one dimension when f is a function from R to R.
First we might compute the critical set of f ,
cf = { x | f ′ (x) = 0 }
By examining this set we can determine those x that are global minima or maxima. Notice that
the computation of cf seems to entail finding the zeros of the derivative f ′ . In other words, we
have reduced the optimization problem to the root-finding problem. So why do we need to study
nonlinear programming? Well, in higher dimensions, it is often easier to find (local) minimum than
one would think. Intuitively, this is because f ′ is not an arbitrary function but rather a derivative
whose integral (f ) is given. We will thus have a lot more to say about optimization in higher
dimension than we did about root finding.
Since a maximization problem can be turned into a minimization problem simply by negating
the objective function, we will deal with minimization only from now on.
Now choose a point x, say, half way between a and b. If f (x) > f (b), as shown in the figure below,
then the new bracketing triple becomes [x, b, c]. If f (x) < f (b), then the new bracketing triple
becomes [a, x, b]. To ensure that the interval [a, c] will shrink toward a point, one should alternate
the bracketing tuples, at least after a few rounds. For instance, halve [a, b] in this round while [b, c]
in the next round.
a x b c
The figure on the next page shows a more complete example about how golden section works.
The minimum is originally bracketed by points 1, 3, 2. The function is evaluated at 4, which
replaces 2; then at 5, which replaces 1; and then at 6, which replaces 4. Note that the center point
is always lower than the two outside points. The minimum is bracketed by points 5, 3, 6 after the
three steps.
2
2
1
4
5
6
Golden section search is applicable to extremizing functions of one variable only, just like bisec-
tion is to finding the roots of functions of this type only. Now we consider functions of more than
one variables. Let the function f : Rn → R be twice continuous differentiable, that is, f ∈ C 2 . A
point x∗ ∈ Rn is said to be a relative minimum point or a local minimum point if there is an ǫ > 0
such that f (x) ≥ f (x∗ ) for all x ∈ Rn with kx − x∗ k < ǫ. If f (x) > f (x∗ ) for all x 6= x∗ with
kx − x∗ k < ǫ, then x∗ is said to be a strict relative minimum point of f .
A point x∗ is said to be a global minimum point of f if f (x) ≥ f (x∗ ) for all x. It is said to be
a strict global minimum point if f (x) > f (x∗ ) for all x 6= x∗ .
Let x = (x1 , x2 , . . . , xn )T . Recall that the gradient of f is a vector
∂f ∂f ∂f
∇f (x) = , ,···, . (1)
∂x1 ∂x2 ∂xn
It gives the direction in which the value of f increases the fastest. The Hessian H of f is defined
as an n × n matrix: ∂2f ∂2f 2f
· · · ∂x∂1 ∂x
∂x21 ∂x1 ∂x2 n
∂2f ∂2f ∂2f
· · · ∂x2 ∂xn
∂x22
∂x2 ∂x1
H(x) =
.
.. .
.. .. .
..
.
2
∂ f 2
∂ f 2
∂ f
∂xn ∂x1 ∂xn ∂x2 · · · ∂x2 n
i) ∇f (x∗ ) = 0,
ii) dT H(x∗ )d ≥ 0 for every d ∈ Rn .
3
In one dimension, the above necessary conditions are familiar to us:
x∗
The Hessian H(x∗ ) at a relative minimum x∗ is symmetric positive semi-definite, that is,
xT H(x∗ )x ≥ 0 for any x. If x∗ is a strict relative minimum, then H(x∗ ) is positive definite,
that is, xT H(x∗ )x > 0 for any x 6= 0. We can now derive sufficient conditions for a relative
minimum.
∇f (x) = bT + xT A,
H(x) = A.
So there is a single extremum, located at x∗ , which is the solution to the system Ax = −b. Since A is
positive definite, this extremum is a strict local minimum. Since it is the only one, it is in fact a global
minimum.
If we neglect the higher order terms inside the Big-O in the Taylor series (2), every function in
one dimension behaves like this (locally near a minimum):
1
y = c + bx + ax2 ,
2
y ′ = b + ax,
y ′′ = a > 0.
4
c
b2
c− 2a
− ab
3 Convex Function
We just saw that H is positive definite at and near a strict local minimum. By Taylor’s theorem
every function looks like a quadratic near a strict local minimum. Furthermore, if f happens to
be quadratic globally, formed from a symmetric positive definite matrix A as in (3), then it has a
unique local minimum. This local minimum is therefore a global minimum.
Can we say more about functions whose local minima are global minima? A broad class of such
functions are the convex functions.
A function f : Ω → R defined on a convex domain Ω is said to be convex if for every pair of
points x1 , x2 ∈ Ω and any α with 0 ≤ α ≤ 1, the following holds:
f αx1 + (1 − α)x2 ≤ αf (x1 ) + (1 − α)f (x2 ).
a x1 αx1 + (1 − α)x2 x2 b
Convex functions describe functions near local minima as studied in the following proposition.
Proposition 1 Let f ∈ C 2 . Then f is convex over a convex set Ω containing an interior point if
and only if the Hessian matrix H is positive semi-definite in Ω.
The minima of convex functions are global minima, as shown by the following theorem.
5
Theorem 2 Let f be a convex function defined on a convex set Ω. Then the set Γ where f achieves
its minimum value is convex. Furthermore, any relative minimum is a global minimum.
Proof If f has no relative minima then the theorem is valid by default. Assume therefore that
c0 is the minimum value of f on Ω. Define the set
Γ = { x ∈ Ω | f (x) = c0 }.
In other words, f is also minimized at the point αx1 + (1 − α)x2 . Thus all the points on the line
segment connecting x1 and x2 are in Γ. Since x1 and x2 are arbitrarily chosen from Γ, the set
must be convex.
Suppose now that x∗ ∈ Ω is a relative minimum point of f but not a global minimum. Then
there exists some y ∈ Ω such that f (y) < f (x∗ ). On the line { αy + (1 − α)x∗ | 0 ≤ α ≤ 1 }, we
have, for 0 < α ≤ 1
f αy + (1 − α)x∗ ≤ αf (y) + (1 − α)f (x∗ )
< αf (x∗ ) + (1 − α)f (x∗ )
= f (x∗ ),
4 Steepest Descent
Now let us return to the general problem of minimizing a function f : Rn → R. We want to find
the critical values where the gradient ∇f = 0. This is a system of n equations:
∂f
= 0,
∂x1
..
.
∂f
= 0.
∂xn
We might expect to encounter the usual difficulties associated with higher-dimensional root finding.
Fortunately, the derivative nature of the equations imposes some helpful structure to the problem.
6
In one dimension, to find a local minimum, we might employ the following rule:
In higher dimensions we use the negative gradient −∇f to point us toward a minimum. This is called
steepest descent. In particular, the algorithm repeatedly performs one-dimensional minimizations
along the direction of steepest descent.
In the algorithm, we start with x(0) as an approximation to a local minimum of f : Rn → R.
The method is also referred to as the line search strategy since during each iteration it moves
on the line x(m) − tu away from x(m) until encountering a local minimum of g(t). How to carry
out the line minimization of g(t)? Anyway you want. For instance, solve g ′ (t) = 0 directly. Or,
step along the line until you produce a bracket, and then refine it.
Thus, in the first step of steepest descent, we look for a minimum of the function
g(t) = f x(0) − t∇f (x(0) )
= f (1 + t, −1 + 3t)
= (1 + t)3 + (−1 + 3t)3 − 2(1 + t)2 + 3(−1 + 3t)2 − 8.
gives x(1) = ( 43 , 0)T . The gradient ∇f vanishes at x(1) . Therefore f achieves at least a local minimum at
( 43 , 0)T .
7
It turns out that steepest descent converges globally to a relative minimum. And the convergence
rate is linear. Let A and a be the largest and smallest eigenvalues, respectively, of the Hessian H
at the local minimum. Then the following holds regarding the ratio between the errors at two
adjacent steps
A−a 2
|em+1 |
∼ .
|em | A+a
The steepest descent method can take many steps. The problem is that the method repeatedly
moves in a steepest direction to a minimum along that direction. Consequently, consecutive steps
are perpendicular to each other (this behavior is illustrated in the figure below). To see why,
consider moving at x(k) along u = −∇f (x(k) ) to reach x(k+1) = x(k) + t∗ u where f (x) no longer
decreases. The rate at which the value of f changes, after a movement of tu from x(k) , is measured
by the directional derivative, ∇f (x(k) + tu) · u. This derivative is negative at x(k) (when t = 0),
and has not changed its sign before x(k+1) (when t = t∗ ). Suppose ∇f (x(k+1) ) 6⊥ ∇f (x(k) ). Then,
∇f (x(k+1) ) · u < 0 must hold. This means that the value of f would further decrease if continuing
the movement in the direction u from x(k+1) . Hence a contradiction.
So there are a number of back and forth steps that only slowly converge to a minimum. This
situation can get very bad in a narrow valley, where successive steps undo some of their previous
progress. Ideally, in Rn we would like to take n perpendicular steps, each of which attains a
minimum. This idea will lead to the conjugate gradient method.
y start
f (x) = c4
f (x) = c3
f (x) = c2
f (x) = c1
A Matrix Calculus
This appendix presents some basic rules of differentiations of scalars and vectors with respect to
vectors and matrices. These rules will be used later on in the course.
A.1 Differentiation With Respect to a Vector
The derivative of a vector function f (x) = (f1 (x), f2 (x), . . . , fm (x))T with respect to x is an m × n
matrix: ∂f
∂f1 ∂f1
1
· · ·
∂x1 ∂x2 ∂xn
∂f ∂f ∂f ∂f2
∂x1 ∂x2 · · · ∂x
2 2
= n . (4)
∂x . . . .
.. .. .. ..
∂fm ∂fm
∂x1 ∂x2 · · · ∂f∂xn
m
∂(cT x)
= cT , (5)
∂x
∂(Ax)
= A. (6)
∂x
∂(u · v) ∂(uT v)
= = uT ,
∂v ∂v
∂(u · v) ∂(v T u)
= = vT .
∂u ∂u
To differentiate the cross product u × v, for any vector w = (w1 , w2 , w3 )T we denote by w× the
following 3 × 3 anti-symmetric matrix:
0 −w3 w2
w× = w3 0 −w1 .
−w2 w1 0
Apparently, the product of the matrix u× with v is the cross product u × v. It then follows that
∂(u × v)
= u×,
∂v
∂(u × v) ∂(v × u)
= −
∂u ∂u
= −v× .
Now let us look at how to differentiate the scalar xT Ax, where x is an n-vector and A an n × n
matrix, with respect to x. We have
∂xT Ax ∂ ∂
xT (Ay) (y T A)x
= +
∂x ∂x y=x ∂x y=x
9
∂
(Ay)T x +y T A|y=x
=
∂x y=x
= (Ay)T +y T A|y=x
y=x
= xT A + x A.
T T
Let C = (cij )m×n and X = (xij )m×n be two matrices. The trace of the product matrix CX T is
the sum of its diagonal entries:
m n
!
X X
Tr(CX T ) = crs xrs .
r=1 s=1
Immediately, we have
∂
Tr(CX T ) = cij ,
∂xij
which implies that
∂
Tr(CX T ) = C.
∂X
Next, we differentiate the trace of the product matrix XCX T as follows:
∂ ∂ ∂
Tr(XCX T ) = Tr(XCY T ) + Tr (Y C)X T
∂X ∂X Y =X ∂X Y =X
∂
= Tr (Y C T )X T +Y C
∂X Y =X Y =X
= Y CT +XC
Y =X
= XC T + XC.
10
integration of the matrix operate element-wise:
Ȧ(t) = (ȧij ) ,
Z Z
A(t) dt = aij (t) dt .
Suppose A(t) is n × n and non-singular. Then we have AA−1 = In , the n × n identity matrix.
Thus,
d
0 = (AA−1 )
dt
d −1
= ȦA−1 + A (A ),
dt
which yields the derivative of the inverse matrix:
d −1
(A ) = −A−1 ȦA−1 . (8)
dt
An interesting case is with the rotation matrix R, which is also orthogonal, i.e., RRT = RT R =
In . We obtain
The above implies that the matrix ṘRT is anti-symmetric. Therefore, it can be written as
0 −ωz ωy
ṘRT = ωz 0 −ωx .
−ωy ωx 0
The vector ω = (ωx , ωy , ωz )T is the angular velocity, where the cross product ω × v = ṘRT v
describes the change rate of the vector v (i.e., the velocity of the destination point of the vector)
due to the body rotation described by R.
Using the Taylor expansion, we define the exponential function:
∞
X (At)j
eAt = ,
j!
j=0
where A is an n × n matrix. The function’s importance comes from that it is the solution to the
linear system ẋ = Ax + bu, where u is the control vector. It has the derivative
d At
e = AeAt = eAt A.
dt
11
References
[1] M. Erdmann. Lecture notes for 16-811 Mathematical Fundamentals for Robotics. The Robotics
Institute, Carnegie Mellon University, 1998.
[2] D. G. Luenberger. Linear and Nonlinear Programming. Addison-Wesley, 2nd edition, 1984.
[3] W. H. Press, et al. Numerical Recipes in C++: The Art of Scientific Computing. Cambridge
University Press, 2nd edition, 2002.
12