0% found this document useful (0 votes)
79 views24 pages

E1 251 Linear and Nonlinear Op2miza2on

The document describes the method of steepest descent and Newton's method for optimization. It discusses using the negative gradient direction (-∇f(x)) in the steepest descent method to minimize a function f(x) at each iteration. For quadratic functions, the convergence rate of steepest descent depends on the condition number of the Hessian matrix. Newton's method finds the minimum of a function by iterating according to the Newton equation, which involves computing the Hessian matrix.

Uploaded by

data science
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views24 pages

E1 251 Linear and Nonlinear Op2miza2on

The document describes the method of steepest descent and Newton's method for optimization. It discusses using the negative gradient direction (-∇f(x)) in the steepest descent method to minimize a function f(x) at each iteration. For quadratic functions, the convergence rate of steepest descent depends on the condition number of the Hessian matrix. Newton's method finds the minimum of a function by iterating according to the Newton equation, which involves computing the Hessian matrix.

Uploaded by

data science
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

E1

 251  Linear  and  Nonlinear  


Op2miza2on  
 
Chapter  8:    Method  of  steepest  descent    and  
Newton’s  method  
               

1  
8.1.  Steepest  Descent  Method  
Step along an appropriate direction of highest decrease
x k+1 = x k + α k d k
d k : local best direction
To get the direction, look at Taylor's formula:
f (x k+1 ) = f (x k ) + (x k+1 − x k )T ∇f (x k ) +
(x k+1 − x k )T F(x k + θ (x k+1 − x k ))(x k+1 − x k )
f (x k+1 ) = f (x k ) + α k d k T ∇f (x k ) + α k2 d k T F(x k + θ (x k+1 − x k ))d k
For α k arbitrarily small:
f (x k+1 ) ! f (x k ) + α k d k T ∇f (x k )

For making f (x k+1 ) as small as possible, we set


d k = −∇f (x k ).
This gives the steepest descent algorithm:
x k+1 = x k − α k g k , where g k = ∇f(x k ) 2  
8.2.  Steepest  descent  for  quadra=c  func=ons  
Consider the normalized quadratic form:
1 T
f (x) = x Qx − x T b, Q : positive definite
2
Substitute x = x k+1 = x k − α k g k , in the above
equation gives
1
f (x k − α g k ) = (x k − α g k )T Q(x k − α g k ) − (x k − α g k )T b
2
Minimizing for α k gives
g Tk g k
αk = T .
g k Qg k
Hence the iteration is given by
⎛ g Tk g k ⎞
x k+1 = x k − ⎜ T ⎟ gk
⎝ g k Qg k ⎠ 3  
An  observa=on:  

The minimum is represented by


Qx* = b
Consider the error measure:
1
E(x) = (x − x* )T Q(x − x* )
2
1 * T
= f (x) + (x ) Q(x* )
2
Minimizing f (x) for α k is equivalent to
to minimizing E(x).

4  
Search  path  in  steepest  descent  method  

5  
8.3  Convergence  rate  of  SD  for  quadra=c  func=on  
1 1 T
E(x k ) = (x k − x ) Q(x k − x ) = y k Qy k , where y k = x k − x* .
* T *

2 2
1
E(x k+1 ) = (x k+1 − x* )T Q(x k+1 − x* )
2
⎡ ( gk gk ) ⎤
T 2

E(x k+1 )= ⎢1− T ⎥ E(x k ) = [1− γ k ] E(x k )


⎢ ( g k Qg k ) ( g k Q g k ) ⎥
T −1
⎣ ⎦
⎡ 4 λ1λn ⎤
E(x k+1 ) ≤ ⎢1− 2 ⎥
E(x k ),
⎣ (λ1 + λn ) ⎦
where λ1 and λn are smallest and largest Eigen values.
⎡ (λ1 − λn )2 ⎤ ⎡ (r − 1)2 ⎤
E(x k+1 ) ≤ ⎢ 2 ⎥
E(x k ) ≤ ⎢ 2 ⎥
E(x k ),
⎣ (λ1 + λn ) ⎦ ⎣ (r + 1) ⎦
r = λn / λ1 (condition number)
6  
8.4.  SD  convergence  rates  for  non-­‐quadra=c  problems
Assumption: Hessian satisfies aI ≤ F(x) ≤ AI
Case I: Exact line search
f (x k+1 ) − f * ≤ (1− a / A) ( f (x k ) − f * )

Case II: Inexact line search with the following condition (Armijo's rule)
f (x k + α k d k ) ≤ f (x k ) + c1α k ∇T f (x k )d k
f (x k + nα k d k ) > f (x k ) + c1nα k ∇T f (x k )d k

f (x k+1 ) − f * ≤ (1− 2c1a / nA) ( f (x k ) − f * )

Case III: Inexact line search with quadratic formula for α k :

⎛ A − a⎞
2

f (x k+1 ) − f * ≤ ⎜
⎝ A + a ⎟⎠
( f (x k ) − f *
)
where a and A are smallest and largest eigen values of F(x* )
and k is sufficiently large. 7  
Exercise (Luenberger):
Suppose we use the method of steepest decent to
minimize the quadratic function
f (x) = (x − x* )T Q(x − x* ), but we allow a tolerance
±δα k , δ > 0 in the line search. In other words,
we use the iteration
x k+1 = x k − α k g k , where
(1-δ )α k ≤ α k ≤ (1+δ )α k , with α k being
g Tk g k
αk = T .
g k Qg k
(a) Find the convergence rate
(b) What is the largest δ that allows guarantees
convergence ?
8  
From page 7:
⎡ 2α k g Tk Qy k − α k2 g Tk Qg k ⎤
E(x k+1 ) = ⎢1− T ⎥ E(x k )
⎣ T
y k Qy k ⎦
gk gk
Substitute α k = (1 ± δ ) T :
g k Qg k
⎡ ⎛ gk gk ⎞
T
⎛ gk gk ⎞
T 2

⎢ 2⎜ T ⎟ (1 ± δ )g k Qy k − ⎜ T
T
⎟ (1 ± δ ) g k Qg k ⎥
2 T

⎢ ⎝ g k Qg k ⎠ ⎝ g k Qg k ⎠ ⎥ E(x )
= ⎢1 − ⎥ k
y Tk Qy k
⎢ ⎥
⎢⎣ ⎥⎦
Using the relation Qy k = g k we get
⎡ ( ) ( ) ⎤
T 2 T 2
gk gk gk gk
⎢ 2 T (1 ± δ ) − T (1 ± δ )2 ⎥
⎢ g k Qg k g k Qg k ⎥
= ⎢1 − T −1 ⎥ E(x k )
gk Q gk
⎢ ⎥
⎢⎣ ⎥⎦ 9  
⎡ ⎤
⎢ ⎥

= ⎢1 −
2 ± 2δ − 1(+ δ 2
± 2δ ) ⎥
⎥ E(x k )
⎢ (g T
k Qg k )(
g T −1
k Q gk ) ⎥

( ) ⎥
T 2
⎣ gk gk ⎦
⎡ ⎤
⎢ ⎥
⎢ 1− δ 2 ⎥
= ⎢1 − T
(
g Qg g)(
T −1
) ⎥ E(x k ) = ⎡
⎣1 − (
1 − δ 2
)
γ k ⎤⎦ E(x k )
⎢ k k k Q gk ⎥

( ) ⎥
T 2
⎣ gk gk ⎦

10  
E(x k+1 ) = ⎡⎣1 − (1 − δ 2 )γ k ⎤⎦ E(x k )
⎡ 4(1 − δ )λ1λn ⎤
2
≤ ⎢1 − ⎥ E(x k )
⎣ (λ1 + λn ) ⎦
2

⎡ (λ1 − λn )2 + 4δ 2 λ1λn ⎤
≤⎢ ⎥ E(x k )
⎣ (λ1 + λn ) 2

⎡ (r − 1) + 4δ λn ⎤
2 2
≤⎢ ⎥ E(x k ),
⎣ (r + 1) 2

r = λn / λ1 (condition number)

11  
Exercise :

In the method of steepest descent applied


on a quadratic problem f (x) = (1 / 2)xT Qx − bT x,
prove that gTk+1g k = 0.

12  
Exercise (Luenberger):
Consider the quadratic problem
f (x) = (1 / 2)xT Qx − bT x, where Q is an n × n
{ }
matrix. Let v 0 , v1 ,...., v p−1 be the subset of
eigen vectors of Q such that their corresponsing
{
eigen values λ0 , λ1 ,...., λ p−1 } are in
nondecreasing order. Suppose that the intial
guess is chosen in such a way that the corresponding
gradient is a linear combination of the set
{v , v ,...., v }. Show that any subsequent gradient
0 1 p−1

g will be a linear combination of {v , v ,...., v } .


k 0 1 p−1

Derive the convergence rate.

13  
Exercise (Luenberger):
Suppose an iterative algorithm of the form
x k+1 = x k + α k d k is applied to the quadratic
problem with matrix Q, where α k as usual
as chosen as the minimum point of the line
search, where d k is the vector satisfying
d Tk g k < 0 and

(d g ) ≥ β ( d Tk Qd k ) ( g Tk Q−1g k ) , where 0<β ≤ 1.


T 2
k k

Estimate the rate of converegence of this


algorithm.

14  
Recap  from  Chapter  7:  

Determine α k :
1
f (x k + α d k ) = (x k + α d k )T Q(x k + α d k ) − (x k + α d k )T b
2
1 T α2 T
= x k Qx k + d k Qd k + α d k T Qx k − x k T b − α d k T b
2 2
α k is determined by
d
f (x k + α d k ) = α d k T Qd k + d k T Qx k − d k T b = 0.

α d k T Qd k + d k T ( Qx k − b ) = 0.
!#"# $
gk

d k Tgk
αk = − T
d k Qd k
15  
1 1 T
E(x k ) = (x k − x ) Q(x k − x ) = y k Qy k , where y k = x k − x* .
* T *

2 2
1
E(x k+1 ) = (x k+1 − x* )T Q(x k+1 − x* )
2
Substitute x k+1 = x k + α k d k :
1
E(x k+1 ) = (x k + α k d k − x* )T Q(x k + α k d k − x* )
2
1
= (y k + α k d k )T Q(y k + α k d k )
2
1 T 1 2 T
= y k Qy k + α k d k Qy k + α k d k Qd k
T

2 2
⎡ −2α k d Tk Qy k − α k2 d Tk Qd k ⎤ ⎛ 1 T ⎞
= ⎢1 − T ⎥ ⎜⎝ y k Qy k ⎟⎠
⎣ y k Qy k ⎦ 2

16  
d Tk g k
Substitute for α k = − T :
d k Qd k
⎡ ⎛ dk gk ⎞ T
T
⎛ dk gk ⎞ T
T 2

⎢ 2⎜ T ⎟ d k Qy k − ⎜ T ⎟ d k Qd k ⎥
⎢ ⎝ d k Qd k ⎠ ⎝ d k Qd k ⎠ ⎥ E(x )
= ⎢1 − ⎥ k
y Tk Qy k
⎢ ⎥
⎢⎣ ⎥⎦
Using the relation Qy k = g k we get
⎡ ( ) ( ) ⎤
T 2 T 2
dk gk dk gk
⎢ 2 T − T ⎥
⎢ d k Qd k d k Qd k ⎥
= ⎢1 − ⎥ E(x k )
g Tk Q−1g k
⎢ ⎥
⎢⎣ ⎥⎦

17  
⎡ ( ) ⎤
T 2
dk gk
E(x k+1 ) = ⎢1 − T ⎥ E(x k )

⎣ (
d k Qd k g Tk Q−1g k )( ) ⎥

( d g ) T 2

≥ β,
k k
Since
( d Qd )( g Q
T
k k
T
k
−1
gk )
E(x k+1 ) ≤ (1 − β )E(x k ).

18  
8.5  Newton  itera=on:  
Newton's method is based on the local quadratic approximation
of the function:
1
f (x)  fk (x) = f (x k ) + ∇f (x k )(x − x k ) + (x − x k )T F(x k )(x − x k ).
T

2
With respect to the standard quadratic form
fk (x) = 0.5xT Qk x − b Tk x + c, we have
Qk = F(x k )
b k = −∇f (x k ) + F(x k )x k
c = ∇f T (x k )x k + 0.5xTk F(x k )x k
The next iterate, x k+1 is the minimum of fk (x) :
x k+1 = x k − [F(x k )]−1 ∇f (x k ).
19  
Newton’s  itera=on  on  quadra=c  func=on  
x k+1 = x k − [F(x k )]−1 ∇f (x k )
x k+1 = x k − Q−1 ( Qx k − b )
−1
x k+1 = x k − x k − Q
! b
x*

x k+1 = x*

Convergence  analysis  of  Newton’s  method  


Theorem 8.5A:
Suppose that f ∈C 3 (three times continuously differentiable) and x*
is a point such that ∇f (x* ) = 0 and F(x* ) is invertible. Then for all x k
sufficiently close to x* , Newton's method is well defined for all k and
converges to x* with order of convergence at least two, i.e.,
2
x k+1 − x* ≤ c x k − x* .
20  
8.6  Damped  Newton  itera=on:  
The damped update:
x k+1 = x k − α k [F(x k )]−1 ∇f (x k )

dk

The vector d k = [F(x k )]−1 ∇f (x k ) is now merely a new type of


search direction.
Now let g(α ) = f (x − α d k ).
g′(0) = −∇f T (x k )d k = −∇f T (x k )[F(x k )]−1 ∇f (x k ).
Since g′(0) < 0, there exists a positive α such that g(α ) < g(0).
Hence there is always an α k such that f (x k+1 ) < f (x k ).
If computing [F(x k )]−1 is infeasible, we can use some [ F′(x k )]−1
as an approximate inverse of F(x k ).

21  
Interpreta=on  of  damped  Newton  itera=on:  
1.   For a given function f (x), define h(y) = f (Ty). Then the gradients
and Hessians are related as follows:
∇h(y) = TT ∇f (Ty), H(y) = TT F(Ty)T

2.   The gradient direction of h expressed in the in the x − coordinate system


is given by d = TTT ∇f (Ty)

3.   Let T = F −1/2 (x k ) and d k = F −1 (x k )∇f (x k ). Then the iteration of the form


x k+1 = x k − α k d k is a steepest descent search on the transformed function
h(y) = f (F −1/2 (x k )y) with variable represented in untransformed coordinate
system.

4.   Local convergence is determined by the extreme Eigen values of


H(y) = TT F(Ty)T, within the neighborhood.
22  
Using  the  convergence  of  SD  method    for  damped  
Newton  search  

For exact line search:


h(x k+1 ) − h* ≤ (1− a / A) ( h(x k ) − h* )

For line search with Armijo's rule:


h(x k + α k d k ) ≤ h(x k ) + c1α k ∇T h(x k )d k
h(x k + nα k d k ) > h(x k ) + c1nα k ∇T h(x k )d k
⎡ 2εa ⎤
h(x k+1 ) − h* ≤ ⎢1−
η ⎥ ( h(x k ) − h *
)
⎣ A ⎦

a and A are now largest and smallest eigen values of F −1/2 (x ′′ )F(x ′ )F −1/2 (x ′′ )
where x ′ and x ′′ any two points within the neighborhood of interest.
23  
Marquardt-­‐Levenberg  Modifica=on:  
When F(x k ) is not positive-definite
x k+1 = x k − α k [F(x k ) + µ k I]−1 ∇f (x k )
µ k is chosen in such a way that F(x k ) + µ k I is positive definite.
To choose a proper µ k , one needs to know extreme eigen values
of F(x k ).

Precondi=oned  Steepest  descent  search  


The iteration of the form x k+1 = x k − α k P∇f (x k ) where P is a
global approximation for F −1 (x) within the domain under consideration
is called pre-conditioned steepest descent method.

The convergence rate is determined by the extreme Eigen values of


P1/2 F(x) P1/2 within the domain under consideration.
24  

You might also like