Quasi Newton PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

EE236C (Spring 2011-12)

2. Quasi-Newton methods

• variable metric methods

• quasi-Newton methods

• BFGS update

• limited-memory quasi-Newton methods

2-1
Newton method for unconstrained minimization

minimize f (x)

f convex, twice continously differentiable

Newton method

x+ = x − t∇2f (x)−1∇f (x)

• advantages: fast convergence, affine invariance


• disadvantages: requires second derivatives, solution of linear equation

can be too expensive for large scale applications

Quasi-Newton methods 2-2


Variable metric methods

x+ = x − tH −1∇f (x)

H ≻ 0 is approximation of the Hessian at x, chosen to:

• avoid calculation of second derivatives


• simplify computation of search direction

‘variable metric’ interpretation (EE236B, lecture 10, page 11)

∆x = −H −1∇f (x)

is steepest descent direction at x for quadratic norm

T
1/2
kzkH = z Hz

Quasi-Newton methods 2-3


Quasi-Newton methods

given starting point x(0) ∈ dom f , H0 ≻ 0


for k = 1, 2, . . ., until a stopping criterion is satisfied
−1
1. compute quasi-Newton direction ∆x = −Hk−1 ∇f (x(k−1))
2. determine step size t (e.g., by backtracking line search)
3. compute x(k) = x(k−1) + t∆x
4. compute Hk

• different methods use different rules for updating H in step 4


• can also propagate Hk−1 to simplify calculation of ∆x

Quasi-Newton methods 2-4


Broyden-Fletcher-Goldfarb-Shanno (BFGS) update

BFGS update

yy T Hk−1ssT Hk−1
Hk = Hk−1 + T −
y s sT Hk−1s

where
s = x(k) − x(k−1), y = ∇f (x(k)) − ∇f (x(k−1))

inverse update
T T
ssT
   
sy ys
Hk−1 = I− T −1
Hk−1 I − T + T
y s y s y s

• note that y T s > 0 for strictly convex f ; see page 1-11


• cost of update or inverse update is O(n2) operations

Quasi-Newton methods 2-5


Positive definiteness

if y T s > 0, BFGS update preserves positive definitess of Hk

proof: from inverse update formula,

T
T T
(sT v)2
  
s v s v
v T Hk−1v = v− T y −1
Hk−1 v− T y + T
s y s y y s

• if Hk−1 ≻ 0, both terms are nonnegative for all v


• second term is zero only if sT v = 0; then first term is zero only if v = 0

this ensures that ∆x = −Hk−1∇f (x(k)) is a descent direction

Quasi-Newton methods 2-6


Secant condition

BFGS update satisfies the secant condition Hk s = y, i.e.,

Hk (x(k) − x(k−1)) = ∇f (x(k)) − ∇f (x(k−1))

interpretation: define second-order approximation at x(k)

(k) (k) T (k) 1


fquad(z) = f (x ) + ∇f (x ) (z − x ) + (z − x(k))T Hk (z − x(k))
2

secant condition implies that gradient of fquad agrees with f at x(k−1):

∇fquad(x(k−1)) = ∇f (x(k)) + Hk (x(k−1) − x(k))


= ∇f (x(k−1))

Quasi-Newton methods 2-7


secant method
for f : R → R, BFGS with unit step size gives the secant method

′ (k)
f (x ) f ′(x(k)) − f ′(x(k−1))
x(k+1) (k)
=x − , Hk =
Hk x(k) − x(k−1)

x(k−1) x(k) x(k+1)


fquad (z)

f ′(z)

Quasi-Newton methods 2-8


Convergence

global result

if f is strongly convex, BFGS with backtracking line search (EE236B,


lecture 10-6) converges from any x(0), H (0) ≻ 0

local convergence

if f is strongly convex and ∇2f (x) is Lipschitz continuous, local


convergence is superlinear : for sufficiently large k,

kx(k+1) − x⋆k2 ≤ ck kx(k) − x⋆k2 → 0

where ck → 0 (cf., quadratic local convergence of Newton method)

Quasi-Newton methods 2-9


Example
m
X
minimize cT x − log(bi − aTi x)
i=1
n = 100, m = 500
Newton BFGS
2 2
10 10

0 0
10 10
f (x(k)) − f ⋆

f (x(k)) − f ⋆
-2 -2
10 10

-4 -4
10 10

-6 -6
10 10

-8 -8
10 10

-10 -10
10 10

-12 -12
10 10
0 1 2 3 4 5 6 7 8 9 0 20 40 60 80 100 120 140

k k

cost per Newton iteration: O(n3) plus computing ∇2f (x)


cost per BFGS iteration: O(n2)

Quasi-Newton methods 2-10


Square root BFGS update

to improve numerical stability, can propagate Hk in factored form

if Hk−1 = Lk−1LTk−1 then Hk = Lk LTk with

T
(αỹ − s̃) s̃
 
Lk = Lk−1 I + ,
s̃T s̃

where 1/2
T

s̃ s̃
ỹ = L−1
k−1 y, s̃ = Lk−1s, α=
yT s

if Lk−1 is triangular, cost of reducing Lk to triangular is O(n2)

Quasi-Newton methods 2-11


Optimality of BFGS update

X = Hk solves the convex optimization problem

−1 −1
minimize tr(Hk−1 X) − log det(Hk−1 X) − n
subject to Xs = y

• cost function is nonnegative, equal to zero only if X = Hk−1


• also known as relative entropy between densities N (0, X), N (0, Hk−1)

optimality result follows from KKT conditions: X = Hk satisfies

1 T
X −1
= −1
Hk−1 − (sν + νsT ), Xs = y, X≻0
2

with ! !
T −1
1 −1 y Hk−1 y
ν= 2Hk−1 y − 1+ s
sT y yT s

Quasi-Newton methods 2-12


Davidon-Fletcher-Powell (DFP) update

switch Hk−1 and X in objective on previous page

minimize tr(Hk−1X −1) − log det(Hk−1X −1) − n


subject to Xs = y

• minimize relative entropy between N (0, Hk−1) and N (0, X)


• problem is convex in X −1 (with constraint written as s = X −1y)
• solution is ‘dual’ of BFGS formula
T T
yy T
   
ys sy
Hk = I− T Hk−1 I − T + T
s y s y s y

(known as DFP update)

pre-dates BFGS update, but is less often used

Quasi-Newton methods 2-13


Limited memory quasi-Newton methods

main disadvantage of quasi-Newton method is need to store Hk or Hk−1

limited-memory BFGS (L-BFGS): do not store Hk−1 explicitly

• instead we store the m (e.g., m = 30) most recent values of

sj = x(j) − x(j−1), yj = ∇f (x(j)) − ∇f (x(j−1))

• we evaluate ∆x = Hk−1∇f (x(k)) recursively, using


! !
sj yjT yj sTj sj sTj
Hj−1 = I− −1
Hj−1 I− + T
yjT sj yjT sj yj s j

−1
for j = k, k − 1, . . . , k − m + 1, assuming, for example, Hk−m =I
• cost per iteration is O(nm); storage is O(nm)

Quasi-Newton methods 2-14


References

• J. Nocedal and S. J. Wright, Numerical Optimization (2006), chapters


6 and 7

• J. E. Dennis and R. B. Schnabel, Numerical Methods for Unconstrained


Optimization and Nonlinear Equations (1983)

Quasi-Newton methods 2-15

You might also like