Lecture8 UnconstrainedII 2023
Lecture8 UnconstrainedII 2023
2/57
Two Approaches to Finding and Optimum
3/57
Basic Concept
Consider a problem
minimize f (x), x ∈ Rn
x
4/57
Basic Concept: Example
14
13
12
11
10
f(®)
9
8
7
6
5
0.0 0.1 0.2 0.3 0.4 0.5
®
5/57
Basic Concept : Example
df (α)
= (−6(3 − 3α) − 50(1 − 5α))|α=0 = −68
dα α=0
" #
h i −3
∇f (x0 )T d = 2(3) 10(1) = −68
−5
3. Minimize f (α) with respect to α, to obtain step size α0 . Given the corresponding
new point x1 and value of f1 = f (x1 )
We have
df (α)
= −6(3 − 3α) − 50(1 − 5α) = 0 =⇒ 268α = 68 or α = 0.2537
dα
" # " # " #
3 −3 2.2388
x1 = + α0 = , f (x1 ) = 5.3732, less f0 = 14.
1 −5 −0.2687
6/57
Basic Concept : Example
3
2
x0
1
0
x2
-1 x1
-2
-3
-5 0 5
x1
7/57
Basic Concept : Example
6M
σ0 = , where M is a moment.
wh2
6(2000 × 24)
σ0 = = 32, 000psi
1(32 )
√ √
Using d = [−1/ 5 − 2/ 5]T and α = 0.2 we have
Assume we have chosen a descent direction d. We need to choose the step factor α to
obtain our next design point. One approach is to use line search, which selects the
step factor that minimizes the one-dimensional function:
minimize f (x + αd)
α
To inform the search, we can use the derivative of the line search objective, which is
simply the directional derivative along d at x + αd.
function LINE_SEARCH(f, d)
objective = α -> f (x + α ∗ d)
a, b = brackect_minimum(objective)
α = minimize(objective, a, b)
return x + α ∗ d
end function
The exact line search is expensive, if we need to do it every step of the optimization.
In Matlab evironment, we can use commands fminbnd or fminsearch.
9/57
Line Search : Exact Line Search
The decaying step factors are popular when minimizing noisy objective function,
and always used in machine learning applications.
10/57
Line Search : Exact Line Search
+ e((2−α)+(3−α)) − (3 − α)
50
11/57
Approximate Line Search
12/57
Approximate Line Search
14/57
Curvature Condition
The curvature condition requires the the directional derivative at the next iterate to be
shallower (α is not too close to zero):
• Where σ controls how shallow the next directional derivative must be.
• It is common to set β < σ < 1 with σ = 0.1 when approximate linear search is
used with the conjugate gradient method and to 0.9 when used with Newton’s
method.
• The strong curvature condition, which is more restrictive criterion in that is also
required not to be too positive:
• Both sufficient decrease condition (for αU ) and strong curvature condition are
called strong Wolfe conditions.(for αL ). 15/57
Curvature Condition
16/57
Wolfe Condition
α = 0.5(10) = 5 we have
18/57
The Steepest Descent Method
The steepest-descent mehtod (also called gradient descent) is a simple and intuitive
method for determining the search direction.
Direction Vector:
• Let xk be the current point at the kth iteration: k = 0 corresponds to the
starting point.
• We need to choose a downhill direction d and then a step size α > 0 such that
the new point xk + αd is better. We desire f (xk + αd) < f (xk ).
• To see how d should be chosen, we use the Taylor’s expansion
δf ≈ α∇f (xk )T d
19/57
The Steepest Descent Method
∇f (xk )T d < 0
• The steepest descent method is based on choosing d at the kth iteration, which
we will denote as dk , as
20/57
The Steepest Descent Method : Example
h iT
Given f (x) = x1 x22 , x0 = 1 2 .
1. Find the steepest descent direction at x0
h iT
d = − ∇f (x)|x=x0 = −4 −4
h iT
2. Is d = −1 2 a direction of descent?
" #
h i −1
∇f (x) T
d= 4 4 =4>0
x=x0 2
21/57
The Steepest Descent Method
After we have the direction vector dk at the point xk , how far to go along this
direction?
• We need to develop a numerical procedure to determine the step size αk along
dk .
• If we move along dk the design variables and the objective function depen only
on α as
df (α̂)
= ∇f (xk + α̂dk )T dk
dα
22/57
The Steepest Descent Method
• In the steepest descent method, the direction vector is −∇f (xk ) resulting in
the slope at the current point α = 0 being
df (α)
= ∇f (xk )T (−∇f (xk )) = −∥∇f (xk )∥2 < 0
dα α=0
23/57
The Steepest Descent Method
• Starting from an initial point, we determine a direction vector and a step size,
and obtain a new point as xk+1 = xk + αk dk .
• The question is to know when to stop the iterative process. We have two stop
criteria to discuss here.
• First Befor performing line search, the necessary condition for optimality is
checked:
∥∇f (xk )∥ ≤ εG ,
24/57
The Steepest Descent Method : Stoping Criteria
25/57
Steepest Descent Algorithm : Algorithm
Require: x0 , εG , εA , εR
k=0
while true do
Compute ∇f (xk )
if ∥∇f (xk )∥ ≤ εG then
Stop
else if then
dk = −∇f (xk )/∥∇f (xk )∥
end if
αk = line_search(f, d),
xk+1 = xk + αk dk ,
if |f (xk+1 ) − f (xk )| ≤ εA + εR |f (xk )| then
Stop
else
k = k + 1, xk = xk+1
end if
end while
26/57
Steepest Descent Algorithm: Zig-Zags Property
• The steepest descent method zig-zags its way towards the optimum point.
Consider
5
Steepest Descent with 30 iterations
x0
0
x2
x¤
-5
-5 0 5 10
x1
27/57
Steepest Descent Algorithm: Zig-Zags Property
∂f (xk + αdk )
=0
∂α
∂f (xk+1 ) ∂f (xk+1 ) ∂xk+1 ∂f (xk+1 ) ∂(xk + αdk )
= = =0
∂α ∂xk+1 ∂α ∂xk+1 ∂α
∇f (xk+1 )T dk = 0
From the last line, it means the k + 1 direction is perpendicular to the k direction. If
you use the approximation line search, this perpendicular property is lost, but the
zig-zags are still there.
28/57
Steepest Descent Algorithm : The Bean function
1 2
f (x1 , x2 ) = (1 − x1 )2 + (1 − x2 )2 + 2x2 − x21 ,
2
using the steepest-descent algorithm with an exact line search, and a convergence
tolerance of ∥∇f } ≤ 10−6 .
2.5
1.5 " #
1.2134
1
x∗ = , f (x∗ ) = 0.0919
0.5 0.8241
0
-0.5
-1
-3 -2 -1 0 1 2 3
29/57
Steepest Descent Algorithm : Convergence Characteristics
λmax
κ=
λmin
30/57
Steepest Descent Algorithm : Convergence Characteristics
" #
2 0
H(f ) = ∇ f = 2
, with β = 1, 5, 15
0 2β
5 5
Steepest Descent with 1 iterations Steepest Descent with 29 iterations
x0 x0
0 0
x2
x2
x¤ x¤
-5 -5
-5 0 5 10 -5 0 5 10
x1 x1
5
Steepest Descent with 90 iterations
x0
0
x2
x¤
-5 31/57
-5 0 5 10
x1
The Conjugate Gradient Method
1 T
minimize q(x) = x Ax + bT x + c
x 2
dT
i Adj = 0, i ̸= j, 0 ≤ i, j ≤ n
32/57
The Conjugate Gradient Method : The method
• The mutually conjugate vectors are the basis vectors of A. They are generally
no orthogonal to one another.
• The algorithm is started with the direction of steepest descent:
d1 = −g1
• Use line search to find the next design point For quadratic functions, the step
factor α can be computed exactly. The update is then:
x 2 = x 1 + α1 d 1
33/57
The Conjugate Gradient Method : The method
• Suppose we want ot derive the optimal step factor for a line search on a
quadratic function:
minimize f (x + αd)
α
We have
∂f (x + αd) ∂ 1
= (x + αd)T A(x + αd) + bT (x + αd) + c
∂α ∂α 2
= dT A(x + αd) + bT d = dT A(x + αd) + dT b
= dT (Ax + b) + αdT Ad
∂f (x+αd)
Setting ∂α
= 0 results in:
dT (Ax + b)
α=−
dT Ad
34/57
The Conjugate Gradient Method : The method
dk = −gk + βk dk−1
for scalar parameter β. Larger values of β indicate that the previous descent
direction contributes more strongly.
• To find the best value for β for a known A, using the fact that dk is conjugate to
dk−1 :
dT
k Adk−1 = 0 =⇒ (−gk + βk dk−1 )T Adk−1 = 0
gkT Adk−1
−gkT Adk−1 + βk dT
k−1 Adk−1 = 0 =⇒ βk =
dT
k−1 Adk−1
We do not know the value of A that best approximates f around xk . Several choices
for βk tend to work well:
• Fletcher-Reeves:
gkT gk
βk = − T
gk−1 gk−1
• Polak-Ribière:
β ← max(β, 0)
36/57
The Conjugate Gradient Method : Example
h iT
Consider f = x21 + 4x22 , x0 = 1 1 . We will perform two iterations of the
conjugate gradient algorithm. The first step is the steepest descent iteration. Thus
h iT
d0 = −∇f (x0 ) = − 2 8
h iT
which yields α0 = 0.1308, x1 = x0 + α0 d0 = 0.7385 −0.0462 . The next
iteration (using Fletcher-Reeves method):
37/57
The Conjugate Gradient Method : Example
We have
which yields
α1 = 0.4780
" # " # " #
0.7385 −1.5451 0
x2 = x1 + α1 d 1 = + 0.4780 =
−0.0462 0.0966 0
38/57
The Conjugate Gradient Method : Example
5
Conjugate gradient with 2 iterations
x0
0
x2
x¤
-5
-5 0 5 10
x1
39/57
The Conjugate Gradient Method : Algorithm
Require: x0 , εG
k=0
while ∥∇fk ∥ > εG do
if k = 0 then
∇fx
dk = − ∥∇f ∥
k
else
T
∇fx ∇fk
βk =
∇fk−1
T ∇fk−1
∇fk
dk = − ∥∇f + βk dk−1
k∥
end if
αk = line_search(f, dk )
xk = xk−1 + αk dk
k =k+1
end while
40/57
The Conjugate Gradient Method : Example
1 2
f (x1 , x2 ) = (1 − x1 )2 + (1 − x2 )2 + 2x2 − x21
2
3 3
CG with 18 iterations SD with 31 iterations
2 2
x0
1 1
x2
x2
x0 x¤
x¤
0 0
-1 -1
-2 -1 0 1 2 -2 -1 0 1 2
x1 x1
41/57
Newton’s Method
• The function value and gradient can help to determine the direction to travel,
but it does not directly help to determine how far to step to reach a local
minimum.
• Second-order information allows us to make a quadratic approximation of the
objective function and approximate the right step size to reach a local minimum.
• As we have seen with a quadratic fit search, we can analytically obtain the
location where a quadratic approximation has a zero gradient. We can use that
location as the next iteration to approach a local minimum.
• The quadratic approximation about a point xk comes from the second-order
Taylor expansion:
1
q(x) = f (xk ) + (x − xk )f ′ (xk ) + (x − xk )2 f ′′ (xk )
2
∂
q(x) = f ′ (xk ) + (x − xk )f ′′ (xk ) = 0
∂x
f ′ (xk )
xk+1 = xk − ′′
f (xk )
42/57
Newton’s Method : Example
with x0 = 3, we can form the quadratic uisng the function value and the first and
second derivatives evaluated at the point.
30 30
20
20
10
x2
x2
10 x0
0
x2x3 x1 x0
x1
0 x2x3
-10
0 1 2 3 4 0 1 2 3 4
x1 x1
43/57
Newton’ s Method : Disadvantage
• The update rule in Newton’s method involves dividing by the second derivative.
The update is undefined if the second derivative is zero, which occurs when the
quadratic approximation is a horizontal line.
• Instability also ocurs when the second derivative is very close to zero, in which
case the next iterate will lie very far from the current design point, far from
where the local quadratic approximation is valid.
• Poor local approximations can lead to poor performance with Newton’s method.
2.5 2
0
1
f
f
2.0 -1
0
1.5 -2 -1
-3 -2
-4 -2 0 2 4 xk xk + 1 xk + 1 xk
x1 x1 x1
44/57
Newton’s Method : Multivariate Optimization
1
f (x) ≈ q(x) = f (xk ) + ∇f (xk )T (x − xk ) + (x − xk )T Hk (x − xk )
2
dq(xk )
= ∇f (xk ) + Hk (x − xk ) = 0
d(x − xk )
We then solve for the next iterate, thereby obtaining Newton’s method in
multivariate form:
xk+1 = xk − H−1
k ∇f (xk )
• If f (x) is quadratic and its Hessian is positive definite, then the update
converges to the global minimum in one step. For general functions, Newton’s
method is often terminated once x ceases to change by more than a given
tolerance.
45/57
Newton’s Method : Example
h i
With x1 = 9 8 , we will use Newton’s method to minimize Booth’s function:
The gradient at x2 is zero, so we have converged after a single iteration. The Hessian
is positive definite everywhere, so x2 is the global minimum.
46/57
Newton’s Method : Multivariate Optimization
9
x0
x2
x¤
0
Steepest Descent with 6 iterations
Newton's method with 1 iterations
-3
-5 0 5 10
x1
3.05
3.00
x2
2.95
Steepest Descent with 6 iterations
Newton's method with 1 iterations
Require: x0 , εG , ∇fk , Hk
k=0
while ∥∇fk ∥ > εG && k ≤ kmax do
∆ = H(x)−1 ∇f (x)
x=x−∆
k =k+1
end while
return x
48/57
Secant Method
f ′ (xk ) − f ′ (xk−1 )
f ′′ (xk ) ≈
xk − xk−1
xk − xk−1
xk+1 = xk − f ′ (xk )
f ′ (xk ) − f ′ (xk−1 )
• The secant method requires an additional initial design point. It suffers from
the same problems as Newton’s method.
49/57
Quasi-Newton Methods
xk+1 = xk − αk Qk ∇fk ,
50/57
Quasi-Newton Methods : Davidon-Fletcher-Powell (DFP)
Qk γ k γ T
k Qk δk δT
Qk+1 = Qk − + k
γT
k Qk γ k δT
k γk
51/57
Quasi-Newton Methods : Davidon-Fletcher-Powell (DFP)
Require: x0 , εG , ∇fk
k = 0, Q = I
while ∥∇fk ∥ > εG && k ≤ kmax do
g = ∇f (x)
x′ = line_search(f, x, −Q ∗ g)
g′ = ∇f (x′ )
δ = x′ − x
γ = g′ − g
Q = Q − Qγγ T Q/γ T Qγ + δδ T /δ T γ
return x′
end while
52/57
Quasi-Newton Methods : Broyden-Fletcher-Goldfarb-Shanno
(BFGS)
! !
T
δk γ T
k Qk + Qk γ k δ k γT
k Qk γ k δk δT
Qk+1 = Qk − + 1+ k
δT
k γk δT
k γk δT
k γk
Require: x0 , εG , ∇fk
k = 0, Q = I
while ∥∇fk ∥ > εG && k ≤ kmax do
g = ∇f (x)
x′ = line_search(f, x, −Q ∗ g)
g′ = ∇f (x′ )
δ = x′ − x
γ = g′ − g
Q = Q − δγ T Q + Qγδ T /δ T γ + 1 + γ T Qγ/δ T γ δδ T /δ T γ
return x′
end while 53/57
Compare Four Methods
x2
0 0
-1 -1
-2 -2
-4 -2 0 2 4 -4 -2 0 2 4
x1 x1
x2
0 0
-1 -1
-2 -2
-4 -2 0 2 4 -4 -2 0 2 4
x1 x1 54/57
Compare Four Methods
q 2 q 2
1 1
minimize k1 (l1 + x1 )2 + x22 − l1 + k2 (l2 − x1 )2 + x22 − l2 − mgx2
x1 ,x2 2 2
56/57
Reference
2. Mykel J. kochenderfer, and Tim A. Wheeler, ”Algorithms for Optimization,” The MIT
Press, 2019.
57/57