0% found this document useful (0 votes)
2 views

L-BFGS algorithm

The document discusses the Limited-Memory BFGS (L-BFGS) algorithm, which is designed for large-scale optimization problems by approximating the inverse Hessian matrix using a limited number of vectors from recent iterations. It explains the trade-offs involved, such as reduced memory and computation at the cost of potentially losing local superlinear convergence. The document also outlines the implementation details of L-BFGS, including a two-loop recursion for efficiently computing the required matrix-vector products.

Uploaded by

Enoch C
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

L-BFGS algorithm

The document discusses the Limited-Memory BFGS (L-BFGS) algorithm, which is designed for large-scale optimization problems by approximating the inverse Hessian matrix using a limited number of vectors from recent iterations. It explains the trade-offs involved, such as reduced memory and computation at the cost of potentially losing local superlinear convergence. The document also outlines the implementation details of L-BFGS, including a two-loop recursion for efficiently computing the required matrix-vector products.

Uploaded by

Enoch C
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

UW-Madison CS/ISyE/Math/Stat 726 Spring 2023

Lecture 23: Limited-Memory BFGS (L-BFGS)

Yudong Chen

1 Basic ideas
Newton and quasi-Newton methods enjoy fast convergence (small number of iterations), but for
large-scale problems each iteration may be too costly.
For example, recall the quasi-Newton method xk+1 = xk − αk Hk ∇ f ( xk ) with BFGS update:

Hk = Vk⊤−1 Hk−1 Vk−1 + ρk−1 sk−1 s⊤


k −1 , (1)

where
1
ρk = , Vk = I − ρk yk s⊤
k ,
s⊤
k yk
s k = x k +1 − x k , y k = ∇ f ( x k +1 ) − ∇ f ( x k ),

and the stepsize αk satisfies WWC. The matrices Bk and Hk constructed by BFGS are often dense,
even when the true Hessian is sparse. In general, BFGS requires Θ(d2 ) computation per iteration
and Θ(d2 ) memory. For large d, Θ(d2 ) may be too much.
Idea of L-BFGS: instead of storing the full matrix Hk (approximation of ∇2 f ( xk )−1 ), construct
and represent Hk implicitly using a small number of vectors {si , yi } for the last few iterations.
Intuition: we do not expect the current Hessian to depend too much on “old” vectors si , yi (old
iterates xi and their gradients.)
Tradeoff: we reduce memory and computation to O(d), but we may lose local superlinear
convergence—we can only guarantee linear convergence in general.

2 L-BFGS
Recall and expand the BFGS update:

BFGS: Hk =Vk⊤−1 Hk−1 Vk−1 + ρk−1 sk−1 s⊤


k −1
=Vk⊤−1 Vk⊤−2 Hk−2 Vk−2 Vk−1 + ρk−2 Vk−2 sk−2 s⊤ ⊤
k −2 Vk −1 + ρk −1 sk −1 sk −1
 
= Vk⊤−1 Vk⊤−2 · · · Vk⊤−m Hk−m (Vk−m Vk−m+1 · · · Vk−1 )
 
+ ρk−m Vk⊤−1 · · · Vk⊤−m+1 sk−m s⊤ k −m (Vk −m+1 · · · Vk −1 )
 
+ ρk−m+1 Vk⊤−1 · · · Vk⊤−m+2 sk−m+1 s⊤ k −m+1 (Vk −m+2 · · · Vk −1 )

+···
+ ρk−2 Vk⊤−1 sk−2 s⊤
k −2 Vk −1
+ ρ k −1 s k −1 s ⊤
k −1 .

1
UW-Madison CS/ISyE/Math/Stat 726 Spring 2023

In L-BFGS, we replace Hk−m (a dense d × d matrix) with some sparse matrix Hk0 , e.g., a diagonal
matrix. Thus, Hk can be constructed using the most recent m ≪ d pairs {si , yi }ik=−k1−m . That is,
 
L-BFGS: Hk = Vk⊤−1 Vk⊤−2 · · · Vk⊤−m Hk0 (Vk−m Vk−m+1 · · · Vk−1 )
 
+ ρk−m Vk⊤−1 · · · Vk⊤−m+1 sk−m s⊤ k −m (Vk −m+1 · · · Vk −1 )
 
+ ρk−m+1 Vk⊤−1 · · · Vk⊤−m+2 sk−m+1 s⊤ k −m+1 (Vk −m+2 · · · Vk −1 )

+···
+ ρ k −1 s k −1 s ⊤
k −1 .

In fact, we only need the d-dimensional vector Hk ∇ f ( xk ) to update xk+1 = xk − αk Hk ∇ f ( xk ).


Therefore, we do not even need to compute or store the matrix Hk explicitly. Instead, we only
store the vectors {si , yi }ik=−k1−m , from which Hk ∇ f ( xk ) can be computed using only vector-vector
multiplications, thanks to tricks like ( aa⊤ + bb⊤ ) g = a( a⊤ g) + b(b⊤ g).

This leads to a two-loop recursion implementation for computing Hk ∇ f ( xk ), stated in Algo-


rithm 1.

Algorithm 1 L-BFGS two-loop recursion


set q = ∇ f ( xk ) want to compute Hk · ∇ f ( xk )
for i = k − 1, k − 2, . . . , k = m do:
αi ← ρi si⊤ q  
q ← q − αi yi // RHS= q − ρi si⊤ qyi = I − ρi yi si⊤ q
| {z }
Vi
r= Hk0 q
for i = k − m to k − 1:
β ← ρi yi⊤ r  
r ← r + si ( αi − β ) // RHS = r + ρi αi − ρi yi⊤ rsi = I − ρi si yi⊤ r + ρi αi
| {z }
Vi⊤
return r // which equals Hk ∇ f ( xk )

(Exercise) The total number of multiplications is at most 4md + nnz( Hk0 ) = O (md) .
In practice:

• We often take m to be a small constant independent of d, e.g., 3 ≤ m ≤ 20.


s⊤
k −1 y k −1
• A popular choice for Hk0 is Hk0 = γk I, where γk = y⊤
. This choice appears to be quite
k −1 y k −1
1 z⊤ 2
k ∇ f ( xk )zk
effective in practice. (Optional) is an approximation of , which is the size of the
γk ∥ z k ∥2
1/2
true Hessian along the direction zk ≈ ∇2 f ( xk )

sk ; see Section 6.1 in Nocedal-Wright.

The complete L-BFGS algorithm is given in Algorithm 2. As discussed in Lecture 21, it is important
that αk satisfies both the sufficient decrease and curvature conditions in Wolfe.

2
UW-Madison CS/ISyE/Math/Stat 726 Spring 2023

Algorithm 2 L-BFGS
input: x0 ∈ Rd (initial point), m > 0 (memory budget), ϵ > 0 (convergence criterion)
k←0
repeat:

• Choose Hk0

• pk ← − Hk ∇ f ( xk ), where Hk ∇ f ( xk ) is computed using Algorithm 1

• xk+1 ← xk + αk pk , where αk satisfies Wolfe Conditions

• if k > m:

– discard {sk−m , yk−m } from storage

• Compute and store sk ← xk+1 − xk and yk = ∇ f ( xk+1 ) − ∇ f ( xk )

• k ← k+1

until ∥∇ f ( xk ∥ ≤ ϵ

Some numerical results taken from Nocedal-Wright:

3 Relationship with nonlinear conjugate gradient methods


In Lecture 13 we mentioned several ways of generalizing CG to non-quadratic functions (a.k.a. non-
lienar CG), including Dai-Yuan, Fletcher-Rieves and Polak-Ribiere. The last one has a variant

3
UW-Madison CS/ISyE/Math/Stat 726 Spring 2023

called Hestenes-Stiefel, which uses the search direction


!
∇ f ( x k +1 ) ⊤ y k sk y⊤
k
p k +1 = −∇ f ( xk+1 ) + pk = − I− ⊤ ∇ f ( x k +1 ), (2)
y⊤k pk yk sk
| {z }
=: Ĥk+1

where we recall that yk = ∇ f ( xk+1 ) − ∇ f ( xk ) and sk = xk+1 − xk .


The matrix Ĥk+1 is neither symmetric nor p.d. If we try to symmetrize Ĥk+1 by taking Ĥk⊤+1 Ĥk+1 ,
we end up with a matrix that does not satisfy the secant equation and is singular.
A symmetric p.d. matrix that satisfies the secant equation is

sk s⊤
Hk+1 = Ĥk+1 Ĥk⊤+1 + k
y⊤
k ks
! !
sk y⊤ yk s⊤ sk s⊤
= I− ⊤k I I− ⊤k + k
yk sk yk sk y⊤
k ks
= BFGS update (1) applied to Hk = I

Therefore, computing Hk+1 as above for the search direction pk+1 = − Hk+1 ∇ f ( xk+1 ) can be
viewed as “memoryless” BFGS, i.e., L-BFGS with m = 1 and Hk0 = I.
Suppose we combine memoryless BFGS and exact line search:

αk = argmin f ( xk + αpk ).
α ∈R

For all k , the stepsize αk satisfies


D E
0 = ⟨∇ f ( xk + αk pk ), pk ⟩ = ∇ f ( xk+1 ), α−
k
1
s k ,

hence s⊤
k ∇ f ( xk +1 ) = 0. It follows that

pk+1 = − Hk+1 ∇ f ( xk+1 )


" ! ! #
sk y⊤ y k s ⊤ s k s ⊤
=− I− ⊤k I − ⊤ k + ⊤ k ∇ f ( x k +1 )
yk sk yk sk yk sk
y⊤
k ∇ f ( x k +1 )
= −∇ f ( xk+1 ) + sk s⊤
k ∇ f ( x k +1 ) = 0
y⊤
k sk
y⊤
k ∇ f ( x k +1 )
= −∇ f ( xk+1 ) + pk , sk = αk pk
y⊤
k pk

which is the same as Hestenes-Stiefel CG update (2).

You might also like