0% found this document useful (0 votes)
11 views57 pages

Lecture8 UnconstrainedII 2023

This document discusses unconstrained optimization methods, focusing on line-search-based techniques and their applications. It covers concepts such as steepest descent, conjugate gradient methods, and the importance of selecting appropriate step sizes in optimization. Additionally, it introduces the Wolfe conditions and the backtracking line search algorithm to ensure convergence to a local minimum.

Uploaded by

muskaanbhayana9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views57 pages

Lecture8 UnconstrainedII 2023

This document discusses unconstrained optimization methods, focusing on line-search-based techniques and their applications. It covers concepts such as steepest descent, conjugate gradient methods, and the importance of selecting appropriate step sizes in optimization. Additionally, it introduces the Wolfe conditions and the backtracking line search algorithm to ensure convergence to a local minimum.

Uploaded by

muskaanbhayana9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Unconstrained Optimization II

Asst. Prof. Dr.-Ing. Sudchai Boonto


October 24, 2023
Department of Control System and Instrumentation Engineering
King Mongkut’s Unniversity of Technology Thonburi
Thailand
Objective

At the end of this chapter you should be able to:


• Describe, implement, and use line-search-based methods.
• Explain the pros and cons of the various search direction methods.
• Understand steepest descent, conjugate gradient, etc.

2/57
Two Approaches to Finding and Optimum

Line search approach Trust-region approach

3/57
Basic Concept

Consider a problem

minimize f (x), x ∈ Rn
x

• Most numerical methods require a starting design or point which we call x0 .


• We then determine the direction of travel d0 .
• A step size α0 is then determined based on minimizing f as much as possible
and the design point is updated as x1 = x0 + α0 d0 .
• The process of where to go and how far to go are repeated from x1 or
xk+1 = xk + αk dk .

4/57
Basic Concept: Example

Given f (x1 , x2 ) = x21 + 5x22 , a point x0 = [3 1]T , f0 = f (x0 ) = 14.


1. construct f (α) alogn the direction d = [−3 − 5]T and provide a plot of f (α)
versus α, for α ≥ 0.
We have x(α) = [x0 − αd] = [3 − 3α, 1 − 5α]T and
f (α) = (3 − 3α)2 + 5(1 − 5α)2 .

14
13
12
11
10
f(®)

9
8
7
6
5
0.0 0.1 0.2 0.3 0.4 0.5
®

5/57
Basic Concept : Example

2. Find the slope df (α)/dα) at α = 0. Verify that this equal ∇f (x0 )T d.


We have

df (α)
= (−6(3 − 3α) − 50(1 − 5α))|α=0 = −68
dα α=0
" #
h i −3
∇f (x0 )T d = 2(3) 10(1) = −68
−5

3. Minimize f (α) with respect to α, to obtain step size α0 . Given the corresponding
new point x1 and value of f1 = f (x1 )
We have

df (α)
= −6(3 − 3α) − 50(1 − 5α) = 0 =⇒ 268α = 68 or α = 0.2537

" # " # " #
3 −3 2.2388
x1 = + α0 = , f (x1 ) = 5.3732, less f0 = 14.
1 −5 −0.2687

6/57
Basic Concept : Example

2. Provide a plot showing contours of the function, steepest descent direction x0


and x1 .

3
2
x0
1
0
x2

-1 x1
-2
-3
-5 0 5
x1

7/57
Basic Concept : Example

We want to design the width and height of


the rectangular cross-section to increase
the bending stress defined by

6M
σ0 = , where M is a moment.
wh2

With the initial design (w, h) = (1, 3), we have

6(2000 × 24)
σ0 = = 32, 000psi
1(32 )

√ √
Using d = [−1/ 5 − 2/ 5]T and α = 0.2 we have

" # " √ # " #


1 −1/ 5 0.9106
x1 = x0 + αd = + 0.2 √ = , σ1 = 71, 342 psi
3 −2/ 5 2.8211
8/57
Line Search : Exact Line Search

Assume we have chosen a descent direction d. We need to choose the step factor α to
obtain our next design point. One approach is to use line search, which selects the
step factor that minimizes the one-dimensional function:

minimize f (x + αd)
α

To inform the search, we can use the derivative of the line search objective, which is
simply the directional derivative along d at x + αd.

function LINE_SEARCH(f, d)
objective = α -> f (x + α ∗ d)
a, b = brackect_minimum(objective)
α = minimize(objective, a, b)
return x + α ∗ d
end function

The exact line search is expensive, if we need to do it every step of the optimization.
In Matlab evironment, we can use commands fminbnd or fminsearch.
9/57
Line Search : Exact Line Search

• One disadvantage of conducting a line search at each step is the computational


cost of optimizing α to a high degree of precision.
• We could quickly find a reasonable value and then move on, selecting xk+1 ,
and then picking a new direction dk+1 .
• Some algorithms use a fixed step factor. Slarge steps will tend to result in fast
convergence but risk overshooting the minimum.
• Smaller steps tend to be more stable but can result in slower convergence.
• A fixed step factor α is sometimes referred to as a learning rate.
• Another method is to use a decaying step factor:

αk = α1 γk−1 for γ ∈ (0, 1]

The decaying step factors are popular when minimizing noisy objective function,
and always used in machine learning applications.

10/57
Line Search : Exact Line Search

Consider conducting a line search on f (x1 , x2 , x3 ) = sin(x1 x2 ) + e(x2 +x3 ) − x3 from


x = [1, 2, 3] in the direction d = [0, −1, −1]. The corresponding optimization problem
is:
150

minimize sin((1 + 0α)(2 − α))


α
100

+ e((2−α)+(3−α)) − (3 − α)

50

which simplifies to:

minimize sin(2 − α) + e(5−2α) + α − 3 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5


α

The minimum is at α ≈ 3.127 with x ≈ [1, −1.126, −1.126]. Note I:


∇f (α) = − cos(2 − α) − 2e(5−2α) + 1 = 0. We can solve for α by using vpasolve in
Matlab. Note II: We can use Nonlinear search in Matlab or Julia like fminbnd from the
original problem.

11/57
Approximate Line Search

• It is often more computationally efficient to perform more iterations of a


descent method than to do an exact line search at each iteration, especially if
the function and derivative calculations are expensive.
• Many methods discussed so far can benefit from using approximate line search
to find a suitable step size with a small number of evaluations.
• Since descent methods must descend, a step size α may be suitable if it causes
a decrease in the objective function value. We need the sufficient decrease
condition. (to protect that the reductions in f values is not to small.)
• The sufficient decrease in the objective function value:

f (xk+1 ) ≤ f (xk ) + βα∇dk f (xk )

with β ∈ [0, 1] often set to β = 1 × 10−4 .

12/57
Approximate Line Search

• If β = 0, then any decrease is acceptable. If β = 1, then the decrease has to be


at least as much as what would be predicted by a first-order approximation.
• If d is a valid descent direction, then there must exist a sufficiently small step
size that satisfies the sufficient decrease condition.
• We can start with a large step size and decrease it by a constant reduction
factor until the sufficient decrease condition is satisfied.
• The algorithm is known as backtracking line search because of how it
backtracks along the descent direction.
13/57
Approximate Line Search

function BACKTRACKING_LINE_SEARCH(f, ∇f, x, d, α; p = 0.5, β = 1e − 4)


y, g = f (x), ∇f (x)
while f (x + α ∗ d) > y + β ∗ α ∗ (gT · d) do
α∗= p
end while
return α
end function

• The first condition is insufficient to guarantee convergence to a local minimum.


Very small step sizes will satisfy the first condition but can prematurely
converge.
• Backtracking line search avoids premature convergence by accepting the largest
satisfactory step size obtained by sequential downscaling and is guaranteed to
converge to ta local minimum.

14/57
Curvature Condition

The curvature condition requires the the directional derivative at the next iterate to be
shallower (α is not too close to zero):

∇dk f (xk+1 ) ≥ σ∇dk f (xk )

• Where σ controls how shallow the next directional derivative must be.
• It is common to set β < σ < 1 with σ = 0.1 when approximate linear search is
used with the conjugate gradient method and to 0.9 when used with Newton’s
method.
• The strong curvature condition, which is more restrictive criterion in that is also
required not to be too positive:

|∇dk f (xk+1 )| ≤ −σ∇dk f (xk )

• Both sufficient decrease condition (for αU ) and strong curvature condition are
called strong Wolfe conditions.(for αL ). 15/57
Curvature Condition

16/57
Wolfe Condition

Consider approximate line search on f (x1 , x2 ) = x21 + x1 x2 + x22 from x = [1, 2] in


the direction d = [−1, −1], using a maximum step size of 10, a reduction factor of 0.5,
a first Wolfe condition parameter β = 1 × 10−4 and a second Wolfe condition
parameter σ = 0.9.
The first Wolfe condition is f (x + αd) ≤ f (x) + βα(gT d), where g = ∇f (x) = [4, 5].
α = 10 we have

" # " #! " #!


1 −1 h i −1
f + 10 ≤ 7 + 1 × 10−4 (10) 4 5
2 −1 −1

217 ≤ 6.991 (It is not satisfied.)

α = 0.5(10) = 5 we have

" # " #! " #


1 −1 h i −1
f +5 ≤ 7 + 1 × 10−4 (5) 4 5
2 −1 −1

37 ≤ 6.996 (It is not satisfied.)


17/57
Wolfe Condition

α = 0.5(5) = 2.5, we have

" # " #! " #


1 −1 h iT −1
f + 2.5 ≤ 7 + 1 × 10−4 (2.5) 4 5
2 −1 −1

3.25 ≤ 6.998( The first Wolfe condition is satisfied.)

The candidate design point x′ = x + αd = [−1.5, −0.5]T is checked against the


second Wolfe condition:

∇d f (x′ ) ≥ σ∇d f (x)


" # " #
h i −1 h i −1
−3.5 −2.5 ≥σ 4 5
−1 −1

6 ≥ −8.1( The second Wolfe condition is satisfied. )

Approximate line search terminates with x = [−1.5 − 0.5]T .

18/57
The Steepest Descent Method

The steepest-descent mehtod (also called gradient descent) is a simple and intuitive
method for determining the search direction.
Direction Vector:
• Let xk be the current point at the kth iteration: k = 0 corresponds to the
starting point.
• We need to choose a downhill direction d and then a step size α > 0 such that
the new point xk + αd is better. We desire f (xk + αd) < f (xk ).
• To see how d should be chosen, we use the Taylor’s expansion

f (xk + αd) = f (xk ) + α∇f (xk )T d + O(α2 )


δf = α∇f (xk )T d + O(α2 )

• For small enough α the term O(α2 ) is dominated. Consequently, we have

δf ≈ α∇f (xk )T d

19/57
The Steepest Descent Method

• For a reduction in f or δf < 0, we require d to be a descent direction or a


direction that satisfies

∇f (xk )T d < 0

• The steepest descent method is based on choosing d at the kth iteration, which
we will denote as dk , as

dk = −∇f (xk ) =⇒ ∇f (xk )T (−∇f (xk )) = −∥∇f (xk )∥2 < 0

• This direction will be referred to as the steepest descent direction.

20/57
The Steepest Descent Method : Example

h iT
Given f (x) = x1 x22 , x0 = 1 2 .
1. Find the steepest descent direction at x0

h iT
d = − ∇f (x)|x=x0 = −4 −4

h iT
2. Is d = −1 2 a direction of descent?

" #
h i −1
∇f (x) T
d= 4 4 =4>0
x=x0 2

It is not a descent direction.

21/57
The Steepest Descent Method

After we have the direction vector dk at the point xk , how far to go along this
direction?
• We need to develop a numerical procedure to determine the step size αk along
dk .
• If we move along dk the design variables and the objective function depen only
on α as

x(α) = xk + αdk , f (α) = f (xk + αdk )

• The slope or derivative f ′ (α) = df /dα is called the directional derivative of f


along the direction d and is given by the expression

df (α̂)
= ∇f (xk + α̂dk )T dk

22/57
The Steepest Descent Method

• In the steepest descent method, the direction vector is −∇f (xk ) resulting in
the slope at the current point α = 0 being

df (α)
= ∇f (xk )T (−∇f (xk )) = −∥∇f (xk )∥2 < 0
dα α=0

Implying a move in a downhill direction.

23/57
The Steepest Descent Method

• Starting from an initial point, we determine a direction vector and a step size,
and obtain a new point as xk+1 = xk + αk dk .
• The question is to know when to stop the iterative process. We have two stop
criteria to discuss here.
• First Befor performing line search, the necessary condition for optimality is
checked:

∥∇f (xk )∥ ≤ εG ,

where ϵG is a tolerance on the gradient and is supplied by the user. If the


condition is satisfied the the process is terminated.

24/57
The Steepest Descent Method : Stoping Criteria

• Second: We check the successive reductions in f as a criterion for stopping.

|f (xk+1 ) − f (xk )| ≤ εA + εR |f (xk )|

where εA = absolute tolerance on the change in function value and εR =


relative tolerance. Only if the condition is satisfied for two consecutive
iterations is the descen process stopped.

25/57
Steepest Descent Algorithm : Algorithm

Require: x0 , εG , εA , εR
k=0
while true do
Compute ∇f (xk )
if ∥∇f (xk )∥ ≤ εG then
Stop
else if then
dk = −∇f (xk )/∥∇f (xk )∥
end if
αk = line_search(f, d),
xk+1 = xk + αk dk ,
if |f (xk+1 ) − f (xk )| ≤ εA + εR |f (xk )| then
Stop
else
k = k + 1, xk = xk+1
end if
end while
26/57
Steepest Descent Algorithm: Zig-Zags Property

• The steepest descent method zig-zags its way towards the optimum point.
Consider

f (x1 , x2 ) = x21 + 5x22 .

5
Steepest Descent with 30 iterations

x0
0
x2

-5
-5 0 5 10
x1

27/57
Steepest Descent Algorithm: Zig-Zags Property

From the fact that αk is obtained by minimizing f (xk + αdk ). Thus

∂f (xk + αdk )
=0
∂α
∂f (xk+1 ) ∂f (xk+1 ) ∂xk+1 ∂f (xk+1 ) ∂(xk + αdk )
= = =0
∂α ∂xk+1 ∂α ∂xk+1 ∂α
∇f (xk+1 )T dk = 0

Set d = −∇fk , we have

−∇f (xk+1 )T ∇f (xk ) = 0

From the last line, it means the k + 1 direction is perpendicular to the k direction. If
you use the approximation line search, this perpendicular property is lost, but the
zig-zags are still there.

28/57
Steepest Descent Algorithm : The Bean function

Find the minimum of the bean function

1 2
f (x1 , x2 ) = (1 − x1 )2 + (1 − x2 )2 + 2x2 − x21 ,
2

using the steepest-descent algorithm with an exact line search, and a convergence
tolerance of ∥∇f } ≤ 10−6 .

2.5

1.5 " #
1.2134
1
x∗ = , f (x∗ ) = 0.0919
0.5 0.8241
0

-0.5

-1
-3 -2 -1 0 1 2 3

29/57
Steepest Descent Algorithm : Convergence Characteristics

• The speed of convergence of the method is related to the spectral condition


number of the Hessian matrix. The spectral condition number κ of a symmetric
positive definite matrix A is defined as the ratio of the largest to the smallest
eigenvalue, or

λmax
κ=
λmin

• For well-conditioned Hessian matrices, the condition number is close to unity,


contours are more circular, and the method is a its best.
• The higher the condition number, the more ill-conditioned is the Hessian, the
contours are more elliptical, more is the amount of zig-zagging near as the
optimum is approached, the smaller are the step sizes and thus the poorer is
the rate of convergence.

30/57
Steepest Descent Algorithm : Convergence Characteristics

Consider a function f (x1 , x2 ) = x21 + βx22 , we have

" #
2 0
H(f ) = ∇ f = 2
, with β = 1, 5, 15
0 2β

5 5
Steepest Descent with 1 iterations Steepest Descent with 29 iterations

x0 x0
0 0
x2

x2
x¤ x¤

-5 -5
-5 0 5 10 -5 0 5 10
x1 x1
5
Steepest Descent with 90 iterations

x0
0
x2

-5 31/57
-5 0 5 10
x1
The Conjugate Gradient Method

• The Conjugate Gradient Method [Fletcher and Powell 1963] is a dramatic


improve over the steepest descent method. The steepest descent perform
poorly in narrow valleys.
• It can find the minimum of a quadratic function of n variables in n iterations.
• The conjugate Gradient method is also powerful on general functions.
• Consider the problem of minimizing a quadratic function

1 T
minimize q(x) = x Ax + bT x + c
x 2

where A is a symmetric and positive definite.


• The conjugate directions, or directions that are mutually conjugate with respect
to A, as vectors which satisfy

dT
i Adj = 0, i ̸= j, 0 ≤ i, j ≤ n

32/57
The Conjugate Gradient Method : The method

• The mutually conjugate vectors are the basis vectors of A. They are generally
no orthogonal to one another.
• The algorithm is started with the direction of steepest descent:

d1 = −g1

• Use line search to find the next design point For quadratic functions, the step
factor α can be computed exactly. The update is then:

x 2 = x 1 + α1 d 1

33/57
The Conjugate Gradient Method : The method

• Suppose we want ot derive the optimal step factor for a line search on a
quadratic function:

minimize f (x + αd)
α

We have
 
∂f (x + αd) ∂ 1
= (x + αd)T A(x + αd) + bT (x + αd) + c
∂α ∂α 2
= dT A(x + αd) + bT d = dT A(x + αd) + dT b
= dT (Ax + b) + αdT Ad

∂f (x+αd)
Setting ∂α
= 0 results in:

dT (Ax + b)
α=−
dT Ad
34/57
The Conjugate Gradient Method : The method

• Subsequent iterations choose dk+1 based on the current gradient and a


contribution from the previous descent direction:

dk = −gk + βk dk−1

for scalar parameter β. Larger values of β indicate that the previous descent
direction contributes more strongly.
• To find the best value for β for a known A, using the fact that dk is conjugate to
dk−1 :

dT
k Adk−1 = 0 =⇒ (−gk + βk dk−1 )T Adk−1 = 0
gkT Adk−1
−gkT Adk−1 + βk dT
k−1 Adk−1 = 0 =⇒ βk =
dT
k−1 Adk−1

• The conjugate gradient method can be applied to nonquadratic functions as


well.
35/57
The Conjugate Gradient Method : The method

We do not know the value of A that best approximates f around xk . Several choices
for βk tend to work well:
• Fletcher-Reeves:

gkT gk
βk = − T
gk−1 gk−1

• Polak-Ribière:

gkT (gk − gk−1 )


βk = T
gk−1 gk−1

• Convergence for the Polak-Ribière method can be guaranteed if we modify it to


allow for automatic resets:

β ← max(β, 0)

36/57
The Conjugate Gradient Method : Example

h iT
Consider f = x21 + 4x22 , x0 = 1 1 . We will perform two iterations of the
conjugate gradient algorithm. The first step is the steepest descent iteration. Thus

h iT
d0 = −∇f (x0 ) = − 2 8

Assuming the direction vectors are not normalized to be unit vectors,

f (α) = f (x0 + αd0 ) = (1 − 2α)2 + 4(1 − 8α)2

h iT
which yields α0 = 0.1308, x1 = x0 + α0 d0 = 0.7385 −0.0462 . The next
iteration (using Fletcher-Reeves method):

∥∇f (x1 )∥2


β0 = = 2.3176/68 = 0.0341
∥∇f (x0 )∥2

37/57
The Conjugate Gradient Method : Example

" # " # " #


−1.4770 −2 −1.5451
d1 = −∇f T (x1 ) + β0 d0 = + 0.0341 =
0.3692 −8 0.0966

We have

f (α) = f (x1 + αd1 ) = (0.7385 − 1.5451α)2 + 4(−0.0462 + 0.0966α)2

which yields

α1 = 0.4780
" # " # " #
0.7385 −1.5451 0
x2 = x1 + α1 d 1 = + 0.4780 =
−0.0462 0.0966 0

38/57
The Conjugate Gradient Method : Example

5
Conjugate gradient with 2 iterations

x0
0
x2

-5
-5 0 5 10
x1

39/57
The Conjugate Gradient Method : Algorithm

Require: x0 , εG
k=0
while ∥∇fk ∥ > εG do
if k = 0 then
∇fx
dk = − ∥∇f ∥
k
else
T
∇fx ∇fk
βk =
∇fk−1
T ∇fk−1
∇fk
dk = − ∥∇f + βk dk−1
k∥
end if
αk = line_search(f, dk )
xk = xk−1 + αk dk
k =k+1
end while

40/57
The Conjugate Gradient Method : Example

The minimum of the bean function,

1 2
f (x1 , x2 ) = (1 − x1 )2 + (1 − x2 )2 + 2x2 − x21
2

3 3
CG with 18 iterations SD with 31 iterations
2 2
x0
1 1
x2

x2
x0 x¤


0 0

-1 -1
-2 -1 0 1 2 -2 -1 0 1 2
x1 x1

41/57
Newton’s Method

• The function value and gradient can help to determine the direction to travel,
but it does not directly help to determine how far to step to reach a local
minimum.
• Second-order information allows us to make a quadratic approximation of the
objective function and approximate the right step size to reach a local minimum.
• As we have seen with a quadratic fit search, we can analytically obtain the
location where a quadratic approximation has a zero gradient. We can use that
location as the next iteration to approach a local minimum.
• The quadratic approximation about a point xk comes from the second-order
Taylor expansion:

1
q(x) = f (xk ) + (x − xk )f ′ (xk ) + (x − xk )2 f ′′ (xk )
2

q(x) = f ′ (xk ) + (x − xk )f ′′ (xk ) = 0
∂x
f ′ (xk )
xk+1 = xk − ′′
f (xk )

42/57
Newton’s Method : Example

Suppose we want to minimize the following single-variable function:

f (x) = (x − 2)4 + 2x2 − 4x + 4, f ′ (x) = 4(x − 2)3 + 4x − 4,


f ′′ (x) = 12(x − 2)2 + 4

with x0 = 3, we can form the quadratic uisng the function value and the first and
second derivatives evaluated at the point.

30 30

20
20

10
x2

x2
10 x0
0
x2x3 x1 x0
x1
0 x2x3
-10
0 1 2 3 4 0 1 2 3 4
x1 x1

43/57
Newton’ s Method : Disadvantage

• The update rule in Newton’s method involves dividing by the second derivative.
The update is undefined if the second derivative is zero, which occurs when the
quadratic approximation is a horizontal line.
• Instability also ocurs when the second derivative is very close to zero, in which
case the next iterate will lie very far from the current design point, far from
where the local quadratic approximation is valid.
• Poor local approximations can lead to poor performance with Newton’s method.

Oscillation Overshoot Negative curvature


2 4
3.0
1 3

2.5 2
0
1
f

f
2.0 -1
0

1.5 -2 -1

-3 -2
-4 -2 0 2 4 xk xk + 1 xk + 1 xk
x1 x1 x1

44/57
Newton’s Method : Multivariate Optimization

• The multivariate second-order Taylor expansion at xk is

1
f (x) ≈ q(x) = f (xk ) + ∇f (xk )T (x − xk ) + (x − xk )T Hk (x − xk )
2
dq(xk )
= ∇f (xk ) + Hk (x − xk ) = 0
d(x − xk )

We then solve for the next iterate, thereby obtaining Newton’s method in
multivariate form:

xk+1 = xk − H−1
k ∇f (xk )

• If f (x) is quadratic and its Hessian is positive definite, then the update
converges to the global minimum in one step. For general functions, Newton’s
method is often terminated once x ceases to change by more than a given
tolerance.

45/57
Newton’s Method : Example
h i
With x1 = 9 8 , we will use Newton’s method to minimize Booth’s function:

f (x) = (x1 + 2x2 − 7)2 + (2x1 + x2 − 5)2 ,


" #
h iT 10 8
∇f (x) = 10x1 + 8x2 − 34, 8x1 + 10x2 − 38 , H(x) =
8 10

The first iteration of Newton’s method yields:

" # " #−1 " #


9 10 8 10(9) + 8(8) − 34
x2 = x1 − H−1
1 g1 = −
8 8 10 8(9) + 10(8) − 38
" # " #−1 " # " #
9 10 8 120 1
= − =
8 8 10 114 3

The gradient at x2 is zero, so we have converged after a single iteration. The Hessian
is positive definite everywhere, so x2 is the global minimum.

46/57
Newton’s Method : Multivariate Optimization

9
x0

x2

0
Steepest Descent with 6 iterations
Newton's method with 1 iterations
-3
-5 0 5 10
x1

3.05

3.00
x2

2.95
Steepest Descent with 6 iterations
Newton's method with 1 iterations

1.0 1.1 1.2


x1 47/57
Newton’s Method : Algorithm

Require: x0 , εG , ∇fk , Hk
k=0
while ∥∇fk ∥ > εG && k ≤ kmax do
∆ = H(x)−1 ∇f (x)
x=x−∆
k =k+1
end while
return x

48/57
Secant Method

Newton’s method is efficient because the second-order information results in better


search directions, but it has the significant shortcoming of requiring the Hessian.
Quasi-Newton methods are designed to address this issue. The basic idea is that we
can use first-order information (gradients) along each step in the iteration path to
building an approximation of the Hessian.
• The secant method uses the last two iterates to approximate the second
derivative:

f ′ (xk ) − f ′ (xk−1 )
f ′′ (xk ) ≈
xk − xk−1

• This estimate is substituted into Newton’s method:

xk − xk−1
xk+1 = xk − f ′ (xk )
f ′ (xk ) − f ′ (xk−1 )

• The secant method requires an additional initial design point. It suffers from
the same problems as Newton’s method.
49/57
Quasi-Newton Methods

• As the secant method approximates f ′′ in the univariate case, quasi-Newton


methods approximate the inverse Hessian. Quasi-Newton method updates have
the form:

xk+1 = xk − αk Qk ∇fk ,

where αk is a scalar step factor and Qk approximates the inverse of the


Hessian at xk
• These methods typically set Q1 to the identity matrix, and they then apply
updates to reflect information learned with each iteration. To simplify the
equations for the various quasi-Newton methos, we define the following:

γ k+1 = ∇fk+1 − ∇fk


δ k+1 = xk+1 − xk

50/57
Quasi-Newton Methods : Davidon-Fletcher-Powell (DFP)

The Davidon-Fletcher-Powell (DFP) method uses:

Qk γ k γ T
k Qk δk δT
Qk+1 = Qk − + k
γT
k Qk γ k δT
k γk

The update for Q in the DFP method havs three properties:


• Q remains symmetric and positive definite.
• If f (x) = 1 T
2
x Ax + bT x + c, then Q = A−1 . Thus the DFP has the same
convergence properties as the conjugate gradient method.
• For high-dimensional problems, storing and updating Q can be significant
compared to other methods like the conjugate gradient method.

51/57
Quasi-Newton Methods : Davidon-Fletcher-Powell (DFP)

Require: x0 , εG , ∇fk
k = 0, Q = I
while ∥∇fk ∥ > εG && k ≤ kmax do
g = ∇f (x)
x′ = line_search(f, x, −Q ∗ g)
g′ = ∇f (x′ )
δ = x′ − x
γ = g′ − g
Q = Q − Qγγ T Q/γ T Qγ + δδ T /δ T γ
return x′
end while

52/57
Quasi-Newton Methods : Broyden-Fletcher-Goldfarb-Shanno
(BFGS)

An alternative to DFP, the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method uses:

! !
T
δk γ T
k Qk + Qk γ k δ k γT
k Qk γ k δk δT
Qk+1 = Qk − + 1+ k
δT
k γk δT
k γk δT
k γk

Require: x0 , εG , ∇fk
k = 0, Q = I
while ∥∇fk ∥ > εG && k ≤ kmax do
g = ∇f (x)
x′ = line_search(f, x, −Q ∗ g)
g′ = ∇f (x′ )
δ = x′ − x
γ = g′ − g
 
Q = Q − δγ T Q + Qγδ T /δ T γ + 1 + γ T Qγ/δ T γ δδ T /δ T γ
return x′
end while 53/57
Compare Four Methods

Minimizing of bean function

Steepest Descent Conjugate gradient


4 4
3 26 iterations 3 12 iterations
x0 x0
2 2
1 x¤ 1 x¤
x2

x2
0 0
-1 -1
-2 -2
-4 -2 0 2 4 -4 -2 0 2 4
x1 x1

DFP(red) and BFGS(green) Newton's method


4 4
3 7 iterations 3 8 iterations
x0 7 iterations x0
2 2
x¤ x¤
1 1
x2

x2

0 0
-1 -1
-2 -2
-4 -2 0 2 4 -4 -2 0 2 4
x1 x1 54/57
Compare Four Methods

Minimizing the total potential energy for a spring system:

q 2 q 2
1 1
minimize k1 (l1 + x1 )2 + x22 − l1 + k2 (l2 − x1 )2 + x22 − l2 − mgx2
x1 ,x2 2 2

By letting l1 = 12, l2 = 8, k1 = 1, k2 = 10, mg = 7 (with appropriate units).


55/57
Compare Four Methods

56/57
Reference

1. Joaquim R. R. A. Martins, Andrew Ning, ”Engineering Design Optimization,”


Cambridge University Press, 2021.

2. Mykel J. kochenderfer, and Tim A. Wheeler, ”Algorithms for Optimization,” The MIT
Press, 2019.

3. Ashok D. Belegundu, Tirupathi R. Chandrupatla, ”Optimization Concepts and


Applications in Engineering,” Cambridge University Press, 2019.

4. Kalyanmoy D., ”Optimization for Engineering Design: Algorithms and Examples,”


2nd, PHI Learning Private Limited, 2012.

57/57

You might also like