0% found this document useful (0 votes)
21 views

Lecture 2

The document discusses unconstrained optimization methods, including necessary and sufficient conditions for local minima. It introduces gradient descent algorithms like steepest descent that iteratively move in the direction of steepest descent. The steepest descent method chooses the descent direction as the negative gradient but can have slow convergence for problems like the Rosenbrock function.

Uploaded by

Jaco Greeff
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Lecture 2

The document discusses unconstrained optimization methods, including necessary and sufficient conditions for local minima. It introduces gradient descent algorithms like steepest descent that iteratively move in the direction of steepest descent. The steepest descent method chooses the descent direction as the negative gradient but can have slow convergence for problems like the Rosenbrock function.

Uploaded by

Jaco Greeff
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Optimization Methods

Lecture 2

Solmaz S. Kia
Mechanical and Aerospace Engineering Dept.
University of California Irvine
[email protected]

Reading: Sections 7.1-7.5, 8.6, 8.8 of Ref[2].

1 / 19
Unconstrained optimization

x? =argmin f(x)
x∈Rn

x? ∈ Rn Unconstrained local minimum of f if

∃  > 0 s.t. f(x? ) 6 f(x), ∀x with kx − x? k < ,

x? ∈ Rn Unconstrained global minimum of f if

f(x? ) 6 f(x), ∀x ∈ Rn ,

x? ∈ Rn Unconstrained strict local minimum of f if

∃  > 0 s.t. f(x? ) < f(x), ∀x with kx − x? k < ,

x? ∈ Rn Unconstrained strict global minimum of f if

f(x? ) < f(x), ∀x ∈ Rn ,

2 / 19
Necessary conditions for optimality

OPT: x? =argmin f(x)


x∈Rn
x ∈ X (X is the set of constraints)
for X = Rn (problem becomes unconstrained)

D ∈ Rn is a feasible
direction at x ∈ X for OPT
if (x + αd) ∈ X for
α ∈ [0, ᾱ]

Proposition:
First order necessary condition (FONC) consider OPT and let f ∈ C1 if x?
is a local minimizer for f then
∇f(x? )> d > 0, ∀d ∈ Rn , d is a feasible direction
Second order necessary condition (SONC) let f ∈ C2 if x? is a local
minimizer for f then
(i) ∇f(x? )> d > 0
(ii) if ∇f(x? ) = 0 ⇒ d> ∇2 f(x? )d > 0 ∀d ∈ Rn , d is a feasible direction
3 / 19
Necessary conditions for optimality

x? =argmin f(x)
x∈Rn

Proposition (necessary optimality conditions)


Let x? be an unconstrained local minimum of f : Rn → R and assume that f is
continuously differentiable in an open set S containing x? , Then

∇f(x? ) = 0. (First Order Necessary Condition)

If in addition f is twice continuously differentiable within S, then

∇2 f(x? ) : positive semidefinite. (Second Order Necessary Condition)

Proof: see page 13-14 of Ref[1].

Stationary point: Any point x̄ ∈ Rn that satisfies ∇f(x̄) = 0 is called a stationary


point. A stationary point can be a minimum, maximum or saddle point of cost
function f.
4 / 19
Sufficient conditions for optimality

x? =argmin f(x)
x∈Rn

Proposition (Second order sufficient optimality conditions)


Let f : Rn → R be twice continuously differentiable in an open set S. Suppose
that a vector x? satisfies the conditions

∇f(x? ) = 0, ∇2 f(x? ) : positive definite.

Then, x? is a strict unconstrained local minimum of f. In particular, there exist


scalars γ > 0 and  > 0 such that
γ
f(x) > f(x? ) + kx − x? k2 , ∀x with kx − x? k < .
2

Proof: see page 15 of Ref[1].

5 / 19
Stationary points: example

f(x) = |x|3 f(x) = −|x|3


f(x) = x3  
2
3x x>0 −3x2 x>0
∇f(x) = 3x2 ∇f(x) = ∇f(x) =
−3x2 x<0 3x2 x<0
stationary point: stationary point:
stationary point:
∇f(0) = 0 ∇f(0) = 0
∇f(0) = 0
x? = 0 local minimizer x? = 0 local maximizer
x? = 0 reflection point
− − − −− − − − −−
− − − −−  
6x x>0 −6x x>0
∇2 f(x) = 6x ∇2 f(x) = ∇2 f(x) =
−6x x<0 6x x<0
∇2 f(0) = 0
∇2 f(0) = 0 ∇2 f(0) = 0
Note here that in all three of these cases x? satisfies FONC and SONC, but satisfying necessary
conditions does not mean that these points are minimizers. Note that x? does not satisfy the second
order sufficient conditions either. 6 / 19
Singular and non-singular local minimum

Local minimum point that does not satisfy the sufficiency condition
∇f(x? ) = 0, ∇f(x? ) > 0 is called singular otherwise it is called nonsingular.
Singular local minima are harder to deal with
In the absence of convexity of f, their optimality cannot be ascertained using
easily verifiable sufficient conditions
In their neighborhood, the behavior of most commonly used optimization
algorithms tends to be slow and /or erratic

7 / 19
Convex sets and convex functions (see Appendix B of Ref[1])

Convex set Ω: The line connecting any point p, q ∈ Ω belongs to Ω:

∀p, q ∈ C : (t p + (1 − t) q) ∈ Ω for t ∈ [0, 1].

Convex function: f is convex over convex set Ω iff

f(t x1 + (1 − t) x2 ) 6 t f(x1 ) + (1 − t) f(x2 ), ∀x1 , x2 ∈ Ω for t ∈ [0, 1].

8 / 19
Convex function

Convex function: f is convex over convex set Ω iff


f(t x1 + (1 − t) x2 ) 6 t f(x1 ) + (1 − t) f(x2 ), ∀x1 , x2 ∈ Ω for t ∈ [0, 1].

When f is differentiable, it is convex over convex set Ω iff


f(x) > f(x0 ) + ∇f(x0 )(x − x0 ), ∀x0 , x ∈ Ω.

When f is twice differentiable, it is convex over convex set Ω iff


∇2 f(x) > 0, ∀x0 , x ∈ Ω.
9 / 19
Optimality conditions for convex functions

Proposition (Optimality conditions for convex functions)


Let f : X → R be a convex function over the convex set X.
(a) A local minimum of f over X is also a global minimum over X. If in addition
f is strictly convex, then there exists at most one global minimum of f.
(b) If f is convex and the set X is open, then ∇f(x? ) = 0 is a necessary and
sufficient condition for a vector x ∈ X to be a global minimum of f over X.

Proof: see page 14 of Ref[1]


for part (a) use f(αx? + (1 − α)x̄) 6 αf(x? ) + (1 − α)f(x̄)

for part (b) use f(x) > f(x? ) + ∇f(x? )> (x − x? ), ∀x ∈ X.

10 / 19
Numerical solvers (see Section 1.2 of Ref[1])

Iterative descent methods


start from x0 ∈ Rn (initial guess)

successively generate vectors x1 , x2 , · · · such that


f(xk+1 ) < f(xk ), k = 0, 1, 2, · · ·

xk+1 = xk + αk dk
Design factors in iterative descent algorithms:
what direction to move: descent direction
how far move in that direction: step size
11 / 19
Successive descent method

xk+1 = xk + αk dk
1st order Taylor series : f(xk+1 ) = f(xk + αk dk ) ≈ f(xk ) + αk ∇f(xk )> dk
for successive reduction: αk ∇f(xk )> dk < 0
If ∇f(xk ) 6= 0

90◦ < ∠(dk , ∇f(xk )) < 270◦ : ∇f(xk )> d < 0

by appropriate choice of step size αk we can


achieve f(xk+1 ) < f(xk )

Observations above lead to a set of gradient based


algorithms

12 / 19
Steepest descent method

xk+1 = xk + αk dk
1st order Taylor series : f(xk+1 ) = f(xk + αk dk ) ≈ f(xk ) + αk ∇f(xk )> dk
for successive reduction: αk ∇f(xk )> dk < 0

dk = −∇f(xk ) : −∇f(xk )> ∇f(xk ) < 0, ∇f(xk ) 6= 0

Proposition dk = −∇f(xk ) is a descent direction, i.e., f(xk + αk dk ) < f(xk ) for


all sufficiently small values of αk > 0.
Steepest Descent Algorithm
Step 0. Given x0 , set k := 0
Step 1. dk := −∇f(xk ). If dk = 0, then stop.
Step 2. Solve αk = argminf(xk + αdk ) for the stepsize αk (chosen by an
α
exact or inexact linesearch)
Step 3. Set xk+1 ← xk + αk dk , k ← k + 1. Go to Step 1.
Note: from Step 2 and the fact that dk = −∇k f(xk ) is a descent direction it
follows that f(xk+1 ) < f(xk ). 13 / 19
Steepest descent method
Steepest descent method can have slow convergence
Rosenbrock function:
f(x1 , x2 ) = 100(x2 − x21 )2 + (1 − x1 )2
2 2
f(x1 , x2 ) = 1 − e−(10x1 +x2 )

x0 = (−1.2, 1.0)> x? = (1, 1)>

14 / 19
Newton’s method

xk+1 = xk + αk dk
| {z }
∆xk

2nd order Taylor series:


1
f(xk+1 ) = f(xk + ∆xk ) ≈ h(∆xk ) = f(xk ) + ∇f(xk )> ∆xk + ∆x> ∇2 f(xk )∆xk
2 k

For successive reduction: find the ∆xk from minimize h(∆xk )


∆xk

∇h(∆x) = 0 ⇒ ∇2 f(xk )∆xk + ∇f(xk ) = 0 ⇒ ∆xk = −(∇2 f(xk ))−1 ∇f(xk )

xk+1 = xk − (∇2 f(xk ))−1 ∇f(xk )


Newton’s method
Step 0. Given x0 , set k := 0
Step 1. dk := −(∇2 f(xk ))−1 ∇f(xk ). If dk = 0, then stop.
Step 2. Solver αk = 1
Step 3. Set xk+1 ← xk + αk dk , k ← k + 1. Go to Step 1.
15 / 19
Modified Newton’s method
2nd order Taylor series:
f(xk+1 ) = f(xk + ∆xk ) ≈ h(∆xk ) = f(xk ) + ∇f(xk )> ∆xk + ∆x> 2
k ∇ f(xk )∆xk

xk+1 = xk − (∇2 f(xk ))−1 ∇f(xk ),


Note the following:
f(xk+1 ) < f(xk ) is not necessarily guaranteed
Algorithm can be modified to be xk+1 = xk − αk (∇2 f(xk ))−1 ∇f(xk ),
Step 2 the should be modified to be
Step 2. Solve αk = argminf(xk − α (∇2 f(xk ))−1 ∇f(xk )) for the stepsize αk
α
(chosen by an exact or inexact linesearch)
Proposition If H(xk ) = ∇2 f(xk ) is a symmetric positive definite matrix, then
dk := −H(x)−1 ∇f(xk )) is a descent direction, i.e., f(xk + αk dk ) < f(xk ) for all
sufficiently small values of αk > 0.
proof: for dk to be a descent direction we should show that ∇f(xk )> dk < 0.
here: ∇f(xk )> dk = −∇f(xk )> H(x)−1 ∇f(xk ). Because H(xk ) is positive
definite, it follows that ∇f(xk )> dk = −∇f(xk )> H(x)−1 ∇f(xk ) < 0. Here we
used the fact that if a matrix is positive definite, its inverse is also positive definite
16 / 19
Newton and modified Newton methods

Newton method typically converges very fast asymptotically


Does not exhibit the zig-zagging behavior of the steepest descent
on the down side: Newton’s method needs to compute not only the gradient, but
also the Hessian, which contains n(n + 1)/2 second order derivatives (numerically
expensive).
2 2
Example: f(x1 , x2 ) = 1 − e−(10x1 +x2 )

17 / 19
Practical Stopping Conditions for Iterative Optimization Algorithms for
Unconstrained Optimization

In iterative algorithms typically the initial point is picked randomly, or if we have a


guess for the location of local minima, we pick close to them.
Stopping Criteria: The stoping condition is related to the first order optimality
condition of ∇f(x) = 0. The followings are common practical stopping conditions
for iterative unconstrained optimization algorithms. Let  > 0:
kf(xk )k 6 
close to satisfying first order necessary condition ∇f(x) = 0.
|f(xk+1 ) − f(xk )| 6 
Improvements in function value are saturating.
kxk+1 − xk k 6 
Movement between iterates has become small.
|f(xk+1 )−f(xk )|
max{1,|f(xk )|} 6
A “relative" measure -removes dependence on the scale of f.
The max is taken to avoid dividing by small numbers.
kxk+1 −xk k
max{1,kxk k} 6
A “relative" measure -removes dependence on the scale of x(k)
The max is taken to avoid dividing by small numbers.
18 / 19
References

[1] Nonlinear Programming: 3rd Edition, by D. P. Bertsekas

[2] Linear and Nonlinear Programming, by D. G. Luenberger, Y. Ye

19 / 19

You might also like