Lecture 15
Lecture 15
The quasi-Newton methods should be noted they are different from modified Hessian methods, since the
quasi-Newton methods are not working on Hessian matrices.
The Newton’s method requires us to solve the Newton’s step p = −[∇2 f (xk )]−1 ∇f (xk ). However, for general
case, this is not a cheap operation (solving an equation).
On the other hand, the Newton’s method has quadratic local convergence, which is much better than linear
convergence. The quasi-Newton is a compromise between convergence and complexity.
Suppose we have the model function at xk (You can imagine Newton’s method or Trust region method)
1
mk (p) = f (xk ) + pT ∇f (xk ) + pT Bk p (15.1)
2
This Bk is an approximation for Hessian of f , but how to get Bk+1 from Bk without computing the Hessian
exactly?
After one step, the next model is
1
mk+1 (p) = f (xk+1 ) + pT ∇fk+1 + pT Bk+1 p (15.2)
2
where xk+1 = xk + pk . The quasi-Newton considers the approximation Bk+1 should satisfy a condition
(secant condition): the gradients of model function mk+1 should match the gradients of f at xk and xk+1 .
That means
∇mk+1 (−pk ) = ∇fk+1 − Bk+1 pk = ∇fk (15.3)
so Bk+1 pk = ∇fk+1 − ∇fk . We denote
sk = xk+1 − xk , yk = ∇fk+1 − ∇f (xk ) (15.4)
then (the secant equation)
Bk+1 sk = yk (15.5)
This (15.5) is the condition that our Bk+1 should satisfy! Intuitively, just “informally” write it as (imagine
one dimensional case)
∇fk+1 − ∇fk
Bk+1 = (15.6)
xk+1 − xk
The right hand side is “like” Hessian. But a single equation (15.5) cannot determine Bk+1 uniquely (why?).
For many problems, we also hope Bk+1 be psd to make sure the direction is descent direction.
15-1
15-2 Lecture 15: BFGS and SR1
15.1.2 SR1
where σ = 1 or −1, and σ, v satisfies the secant equation yk = Bk+1 sk . The reason for its name as rank-1 is:
vv T is a rank-1 matrix. So we compute
yk = Bk sk + [σv T sk ]v (15.8)
so
−1/2
σ = sign(sTk (yk − Bk sk )), a = ±|sTk (yk − Bk sk )| (15.10)
which is
(yk − Bk sk )(yk − Bk sk )T
Bk+1 = Bk + (15.11)
(yk − Bk sk )T sk
The Sherman-Morrison formula (see A.27 in book) can easily invert this matrix by
(sk − Hk yk )(sk − Hk yk )T
Hk+1 = Hk + (15.12)
(sk − Hk yk )T yk
we can also derive above formula by setting rank-1 formula for Hk+1 as we did for Bk+1 . However there are
two issues with this method:
For the first issue, we can set a rule to skip the iteration,
say r = 10−8 , we will skip this iteration by setting Hk+1 = Hk , otherwise the denominator is not small, we
can still use the update formula.
15.1.3 BFGS
This famous quasi-Newton method: BFGS, is named after 4 distinguished mathematicians. The idea is
similar to the above SR1 method. Instead of rank-1 update, we can use rank-2 update formula. So BFGS is
using a update formula as
Bk+1 = Bk + auuT + bvv T (15.14)
So we can multiply sk ,
yk = Bk sk + au(uT sk ) + bv(v T sk ) (15.15)
Which means
a(uT sk )u + b(v T sk )v = yk − Bk sk (15.16)
Lecture 15: BFGS and SR1 15-3
Here actually we have multiple choices for u and v vectors, but BFGS takes u = yk and v = Bk sk , to match
the right hand side. Then we must have
yk ykT Bk sk sTk Bk
Bk+1 = Bk + − (BFGS)
ykT sk sTk Bk sk (15.18)
= (I − ρk yk sTk )Bk (I − ρk sk ykT ) + ρk yk ykT (DFP)
1 −1
where ρk = sT
. Use the relation Hk+1 sk = yk , we will get the rank-2 update for Hk+1 = Bk+1 .
k yk
sk sTk Hk yk ykT Hk
Hk+1 = Hk + T
− (DFP)
sk yk ykT Hk yk (15.19)
= (I − ρk sk ykT )Hk (I − ρk yk sTk ) + ρk sk sTk (BFGS)
The latter one is BFGS. Now, we have the updating formula, what is the initial value of H0 ? This is quite
difficult to come up with a good one unless we compute it explicitly, sometimes we can simply set it to be
identity .
compute xk+1 = xk +αk pk with step length αk chosen to satisfy Wolfe condition (important). Compute
sk = xk+1 − xk , yk = ∇fk+1 − ∇fk and update Hk+1 , k ← k + 1. Go to step 2.
The step length should not use backtracking algorithm to generate since the algorithm relies
on the curvature condition. The performance may be degraded using backtracking. We can
use the exact line search here.