0% found this document useful (0 votes)

11 views4 pages

L-BFGS algorithm

The document discusses the Limited-Memory BFGS (L-BFGS) algorithm, which is designed for large-scale optimization problems by approximating the inverse Hessian matrix using a limited number of vectors from recent iterations. It explains the trade-offs involved, such as reduced memory and computation at the cost of potentially losing local superlinear convergence. The document also outlines the implementation details of L-BFGS, including a two-loop recursion for efficiently computing the required matrix-vector products.

Uploaded by

Enoch C

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views4 pages

L-BFGS algorithm

Uploaded by

Enoch C

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

UW-Madison CS/ISyE/Math/Stat 726 Spring 2023

Lecture 23: Limited-Memory BFGS (L-BFGS)

Yudong Chen

1 Basic ideas
Newton and quasi-Newton methods enjoy fast convergence (small number of iterations), but for
large-scale problems each iteration may be too costly.
For example, recall the quasi-Newton method xk+1 = xk − αk Hk ∇ f ( xk ) with BFGS update:

Hk = Vk⊤−1 Hk−1 Vk−1 + ρk−1 sk−1 s⊤

k −1 , (1)

where
1
ρk = , Vk = I − ρk yk s⊤
k ,
s⊤
k yk
s k = x k +1 − x k , y k = ∇ f ( x k +1 ) − ∇ f ( x k ),

and the stepsize αk satisfies WWC. The matrices Bk and Hk constructed by BFGS are often dense,
even when the true Hessian is sparse. In general, BFGS requires Θ(d2 ) computation per iteration
and Θ(d2 ) memory. For large d, Θ(d2 ) may be too much.
Idea of L-BFGS: instead of storing the full matrix Hk (approximation of ∇2 f ( xk )−1 ), construct
and represent Hk implicitly using a small number of vectors {si , yi } for the last few iterations.
Intuition: we do not expect the current Hessian to depend too much on “old” vectors si , yi (old
iterates xi and their gradients.)
Tradeoff: we reduce memory and computation to O(d), but we may lose local superlinear
convergence—we can only guarantee linear convergence in general.

2 L-BFGS
Recall and expand the BFGS update:

BFGS: Hk =Vk⊤−1 Hk−1 Vk−1 + ρk−1 sk−1 s⊤

k −1
=Vk⊤−1 Vk⊤−2 Hk−2 Vk−2 Vk−1 + ρk−2 Vk−2 sk−2 s⊤ ⊤
k −2 Vk −1 + ρk −1 sk −1 sk −1

= Vk⊤−1 Vk⊤−2 · · · Vk⊤−m Hk−m (Vk−m Vk−m+1 · · · Vk−1 )

+ ρk−m Vk⊤−1 · · · Vk⊤−m+1 sk−m s⊤ k −m (Vk −m+1 · · · Vk −1 )

+ ρk−m+1 Vk⊤−1 · · · Vk⊤−m+2 sk−m+1 s⊤ k −m+1 (Vk −m+2 · · · Vk −1 )

+···
+ ρk−2 Vk⊤−1 sk−2 s⊤
k −2 Vk −1
+ ρ k −1 s k −1 s ⊤
k −1 .

1
UW-Madison CS/ISyE/Math/Stat 726 Spring 2023

In L-BFGS, we replace Hk−m (a dense d × d matrix) with some sparse matrix Hk0 , e.g., a diagonal
matrix. Thus, Hk can be constructed using the most recent m ≪ d pairs {si , yi }ik=−k1−m . That is,

L-BFGS: Hk = Vk⊤−1 Vk⊤−2 · · · Vk⊤−m Hk0 (Vk−m Vk−m+1 · · · Vk−1 )

+ ρk−m Vk⊤−1 · · · Vk⊤−m+1 sk−m s⊤ k −m (Vk −m+1 · · · Vk −1 )

+ ρk−m+1 Vk⊤−1 · · · Vk⊤−m+2 sk−m+1 s⊤ k −m+1 (Vk −m+2 · · · Vk −1 )

+···
+ ρ k −1 s k −1 s ⊤
k −1 .

In fact, we only need the d-dimensional vector Hk ∇ f ( xk ) to update xk+1 = xk − αk Hk ∇ f ( xk ).

Therefore, we do not even need to compute or store the matrix Hk explicitly. Instead, we only
store the vectors {si , yi }ik=−k1−m , from which Hk ∇ f ( xk ) can be computed using only vector-vector
multiplications, thanks to tricks like ( aa⊤ + bb⊤ ) g = a( a⊤ g) + b(b⊤ g).

This leads to a two-loop recursion implementation for computing Hk ∇ f ( xk ), stated in Algo-

rithm 1.

Algorithm 1 L-BFGS two-loop recursion

set q = ∇ f ( xk ) want to compute Hk · ∇ f ( xk )
for i = k − 1, k − 2, . . . , k = m do:
αi ← ρi si⊤ q
q ← q − αi yi // RHS= q − ρi si⊤ qyi = I − ρi yi si⊤ q
| {z }
Vi
r= Hk0 q
for i = k − m to k − 1:
β ← ρi yi⊤ r
r ← r + si ( αi − β ) // RHS = r + ρi αi − ρi yi⊤ rsi = I − ρi si yi⊤ r + ρi αi
| {z }
Vi⊤
return r // which equals Hk ∇ f ( xk )

(Exercise) The total number of multiplications is at most 4md + nnz( Hk0 ) = O (md) .
In practice:

• We often take m to be a small constant independent of d, e.g., 3 ≤ m ≤ 20.

s⊤
k −1 y k −1
• A popular choice for Hk0 is Hk0 = γk I, where γk = y⊤
. This choice appears to be quite
k −1 y k −1
1 z⊤ 2
k ∇ f ( xk )zk
effective in practice. (Optional) is an approximation of , which is the size of the
γk ∥ z k ∥2
1/2
true Hessian along the direction zk ≈ ∇2 f ( xk )

sk ; see Section 6.1 in Nocedal-Wright.

The complete L-BFGS algorithm is given in Algorithm 2. As discussed in Lecture 21, it is important
that αk satisfies both the sufficient decrease and curvature conditions in Wolfe.

2
UW-Madison CS/ISyE/Math/Stat 726 Spring 2023

Algorithm 2 L-BFGS
input: x0 ∈ Rd (initial point), m > 0 (memory budget), ϵ > 0 (convergence criterion)
k←0
repeat:

• Choose Hk0

• pk ← − Hk ∇ f ( xk ), where Hk ∇ f ( xk ) is computed using Algorithm 1

• xk+1 ← xk + αk pk , where αk satisfies Wolfe Conditions

• if k > m:

– discard {sk−m , yk−m } from storage

• Compute and store sk ← xk+1 − xk and yk = ∇ f ( xk+1 ) − ∇ f ( xk )

• k ← k+1

until ∥∇ f ( xk ∥ ≤ ϵ

Some numerical results taken from Nocedal-Wright:

3 Relationship with nonlinear conjugate gradient methods

In Lecture 13 we mentioned several ways of generalizing CG to non-quadratic functions (a.k.a. non-
lienar CG), including Dai-Yuan, Fletcher-Rieves and Polak-Ribiere. The last one has a variant

3
UW-Madison CS/ISyE/Math/Stat 726 Spring 2023

called Hestenes-Stiefel, which uses the search direction

!
∇ f ( x k +1 ) ⊤ y k sk y⊤
k
p k +1 = −∇ f ( xk+1 ) + pk = − I− ⊤ ∇ f ( x k +1 ), (2)
y⊤k pk yk sk
| {z }
=: Ĥk+1

where we recall that yk = ∇ f ( xk+1 ) − ∇ f ( xk ) and sk = xk+1 − xk .

The matrix Ĥk+1 is neither symmetric nor p.d. If we try to symmetrize Ĥk+1 by taking Ĥk⊤+1 Ĥk+1 ,
we end up with a matrix that does not satisfy the secant equation and is singular.
A symmetric p.d. matrix that satisfies the secant equation is

sk s⊤
Hk+1 = Ĥk+1 Ĥk⊤+1 + k
y⊤
k ks
! !
sk y⊤ yk s⊤ sk s⊤
= I− ⊤k I I− ⊤k + k
yk sk yk sk y⊤
k ks
= BFGS update (1) applied to Hk = I

Therefore, computing Hk+1 as above for the search direction pk+1 = − Hk+1 ∇ f ( xk+1 ) can be
viewed as “memoryless” BFGS, i.e., L-BFGS with m = 1 and Hk0 = I.
Suppose we combine memoryless BFGS and exact line search:

αk = argmin f ( xk + αpk ).
α ∈R

For all k , the stepsize αk satisfies

D E
0 = ⟨∇ f ( xk + αk pk ), pk ⟩ = ∇ f ( xk+1 ), α−
k
1
s k ,

hence s⊤
k ∇ f ( xk +1 ) = 0. It follows that

pk+1 = − Hk+1 ∇ f ( xk+1 )

" ! ! #
sk y⊤ y k s ⊤ s k s ⊤
=− I− ⊤k I − ⊤ k + ⊤ k ∇ f ( x k +1 )
yk sk yk sk yk sk
y⊤
k ∇ f ( x k +1 )
= −∇ f ( xk+1 ) + sk s⊤
k ∇ f ( x k +1 ) = 0
y⊤
k sk
y⊤
k ∇ f ( x k +1 )
= −∇ f ( xk+1 ) + pk , sk = αk pk
y⊤
k pk

which is the same as Hestenes-Stiefel CG update (2).

Scientific Computing Selected Solutions
50% (2)
Scientific Computing Selected Solutions
3 pages
Lesson Plan Grammar Year 4
No ratings yet
Lesson Plan Grammar Year 4
3 pages
Exercises From Finite Difference Methods For Ordinary and Partial Differential Equations
No ratings yet
Exercises From Finite Difference Methods For Ordinary and Partial Differential Equations
35 pages
Lecture 15
No ratings yet
Lecture 15
3 pages
bfgs
No ratings yet
bfgs
11 pages
Wiki Lbfgs
No ratings yet
Wiki Lbfgs
6 pages
L-BFGS-B Summary
No ratings yet
L-BFGS-B Summary
1 page
Algorithm 778: L-BFGS-B: Fortran Subroutines For Large-Scale Bound-Constrained Optimization
No ratings yet
Algorithm 778: L-BFGS-B: Fortran Subroutines For Large-Scale Bound-Constrained Optimization
11 pages
1-s2.0-S0893965901001628-main
No ratings yet
1-s2.0-S0893965901001628-main
7 pages
Numerical Optimization: Lecture Notes #18 Quasi-Newton Methods - The BFGS Method
No ratings yet
Numerical Optimization: Lecture Notes #18 Quasi-Newton Methods - The BFGS Method
24 pages
1991IMAJNA-11-325-332
No ratings yet
1991IMAJNA-11-325-332
9 pages
A Limited-memory Algorithm For
No ratings yet
A Limited-memory Algorithm For
22 pages
BFGS
No ratings yet
BFGS
9 pages
Modified Limited Memory BFGS Method With Nonmonoto
No ratings yet
Modified Limited Memory BFGS Method With Nonmonoto
23 pages
E1 251 Linear and Nonlinear Op2miza2on: Chapter 10: Quasi - Newton Method
No ratings yet
E1 251 Linear and Nonlinear Op2miza2on: Chapter 10: Quasi - Newton Method
18 pages
A Modified BFGS Method and Its Global Convergence in Nonconvex Minimization - 2001 - Journal of Computational and Applied Mathematics
No ratings yet
A Modified BFGS Method and Its Global Convergence in Nonconvex Minimization - 2001 - Journal of Computational and Applied Mathematics
21 pages
Support Lecture 1
No ratings yet
Support Lecture 1
4 pages
Ica20100100003 17780538
No ratings yet
Ica20100100003 17780538
8 pages
A Limited T,: Memory Algorithm For Bound Constrained T, T
No ratings yet
A Limited T,: Memory Algorithm For Bound Constrained T, T
19 pages
b Fb 0067700
No ratings yet
b Fb 0067700
12 pages
16.323 Optimal Control Problems Set 1
No ratings yet
16.323 Optimal Control Problems Set 1
3 pages
opt-sem10
No ratings yet
opt-sem10
26 pages
Quasi Newton Methods
No ratings yet
Quasi Newton Methods
15 pages
Practica Cuasi-Newton
No ratings yet
Practica Cuasi-Newton
6 pages
Quasi Newton PDF
No ratings yet
Quasi Newton PDF
15 pages
Quasi Newton PDF
No ratings yet
Quasi Newton PDF
15 pages
Doan BFGS
No ratings yet
Doan BFGS
72 pages
Project For Automated Train by Roshan
No ratings yet
Project For Automated Train by Roshan
6 pages
Probabilistic Graphical Models & Applications: Pseudo Boolean Optimization I/III
No ratings yet
Probabilistic Graphical Models & Applications: Pseudo Boolean Optimization I/III
25 pages
SQ P Methods
No ratings yet
SQ P Methods
13 pages
The Levenberg-Marquardt Algorithm-Implementation and Theory
No ratings yet
The Levenberg-Marquardt Algorithm-Implementation and Theory
12 pages
A Truncated Nonmonotone Gauss-Newton Method For Large-Scale Nonlinear Least-Squares Problems
No ratings yet
A Truncated Nonmonotone Gauss-Newton Method For Large-Scale Nonlinear Least-Squares Problems
16 pages
Lecture_7_8_other_descent_methods
No ratings yet
Lecture_7_8_other_descent_methods
7 pages
Lec4 Orth
No ratings yet
Lec4 Orth
13 pages
CG Survey
No ratings yet
CG Survey
21 pages
Hintermüller M. Semismooth Newton Methods and Applications
No ratings yet
Hintermüller M. Semismooth Newton Methods and Applications
72 pages
A Quasi-Gauss-Newton Method For Solving Non-Linear Algebraic Equations
No ratings yet
A Quasi-Gauss-Newton Method For Solving Non-Linear Algebraic Equations
11 pages
Jurnal - Gauss Newton
No ratings yet
Jurnal - Gauss Newton
11 pages
Data Structures & Algorithms Cheatsheet
No ratings yet
Data Structures & Algorithms Cheatsheet
5 pages
Fast and Accurate Bessel Function Computation: John Harrison, Intel Corporation
No ratings yet
Fast and Accurate Bessel Function Computation: John Harrison, Intel Corporation
22 pages
Brent and McMillan - 2021 - Some New Algorithms For High-Precision Computation
No ratings yet
Brent and McMillan - 2021 - Some New Algorithms For High-Precision Computation
9 pages
Solns
No ratings yet
Solns
38 pages
All Exercises
No ratings yet
All Exercises
35 pages
1 Iterative Methods For Linear Systems 2 Eigenvalues and Eigenvectors
No ratings yet
1 Iterative Methods For Linear Systems 2 Eigenvalues and Eigenvectors
2 pages
Lecture 05 - Quasi Newthon Methods
No ratings yet
Lecture 05 - Quasi Newthon Methods
10 pages
MAT321 Lecture Notes Boumal 2019
No ratings yet
MAT321 Lecture Notes Boumal 2019
203 pages
Global Convergence of A Modified Fletcher-Reeves Conjugate Gradient Method With Armijo-Type Line Search - Zhang, Zhou (2006)
No ratings yet
Global Convergence of A Modified Fletcher-Reeves Conjugate Gradient Method With Armijo-Type Line Search - Zhang, Zhou (2006)
12 pages
The Convergence of Quasi-Gauss-Newton Methods For Nonlinear Problems
No ratings yet
The Convergence of Quasi-Gauss-Newton Methods For Nonlinear Problems
12 pages
Limited-Memory BFGS With Displacement Aggregation
No ratings yet
Limited-Memory BFGS With Displacement Aggregation
24 pages
On The Approximation of Volterra Integral Equations With 2019 Alexandria en
No ratings yet
On The Approximation of Volterra Integral Equations With 2019 Alexandria en
5 pages
A Strengthened Conjecture on the Minimax Optimal Constant Stepsize for Gradient Descent
No ratings yet
A Strengthened Conjecture on the Minimax Optimal Constant Stepsize for Gradient Descent
8 pages
Existence and Continuous Dependence of The Solutions of The Benjamin-Bona-Mahony-Peregrine-Burger's Equation On The Circle
No ratings yet
Existence and Continuous Dependence of The Solutions of The Benjamin-Bona-Mahony-Peregrine-Burger's Equation On The Circle
12 pages
Practice BFGS Algorithm
No ratings yet
Practice BFGS Algorithm
9 pages
GVL 5, 5.3 Heath 3.1-3.5? TB 7 - (8), 11? - Qr-New
No ratings yet
GVL 5, 5.3 Heath 3.1-3.5? TB 7 - (8), 11? - Qr-New
68 pages
Lec Continuous Least Squares Annotated
No ratings yet
Lec Continuous Least Squares Annotated
31 pages
Espacios de Sobolev
No ratings yet
Espacios de Sobolev
35 pages
Mathematical Functions
From Everand
Mathematical Functions
Oliver Linton
No ratings yet
Feynman Lectures Simplified 2C: Electromagnetism: in Relativity & in Dense Matter
From Everand
Feynman Lectures Simplified 2C: Electromagnetism: in Relativity & in Dense Matter
Robert Piccioni
No ratings yet
Problems in Quantum Mechanics: Third Edition
From Everand
Problems in Quantum Mechanics: Third Edition
D. ter Haar
3/5 (2)
Solutions to Problems in Fluids and Turbomachinery
From Everand
Solutions to Problems in Fluids and Turbomachinery
Rahul Basu
No ratings yet
Exercises of Quantum Physics
From Everand
Exercises of Quantum Physics
Simone Malacrida
No ratings yet
Anh 7.2-unit 9
No ratings yet
Anh 7.2-unit 9
6 pages
Annexure A 3.80
No ratings yet
Annexure A 3.80
1 page
Industry Best Companies in India
58% (19)
Industry Best Companies in India
174 pages
Circulatory Test
No ratings yet
Circulatory Test
7 pages
Enabling Metrics Logg
No ratings yet
Enabling Metrics Logg
8 pages
Introduction To Pests and Diseases in Forestry
No ratings yet
Introduction To Pests and Diseases in Forestry
39 pages
Service Innovation in Public Sector: A Case Study On PT. Kereta Api Indonesia
No ratings yet
Service Innovation in Public Sector: A Case Study On PT. Kereta Api Indonesia
8 pages
Sist Iso 1871 2011
No ratings yet
Sist Iso 1871 2011
9 pages
Robin: Nipper
No ratings yet
Robin: Nipper
1 page
Peet Curriculumvitae
No ratings yet
Peet Curriculumvitae
7 pages
Mouth Dissolving Film Thesis
100% (2)
Mouth Dissolving Film Thesis
4 pages
Sika Viscocrete - 1062ns
No ratings yet
Sika Viscocrete - 1062ns
2 pages
Hamas Didn't Know About Music Festival in Advance Says Israel Police - Middle East Monitor
No ratings yet
Hamas Didn't Know About Music Festival in Advance Says Israel Police - Middle East Monitor
7 pages
BT 1
No ratings yet
BT 1
2 pages
PDF Love's Last Stand S. B. Moores download
100% (4)
PDF Love's Last Stand S. B. Moores download
66 pages
TIB 0145 - Converting PW1.0 + & PW2.0 To PW1.1 & PW1.1 & PW2.1 (REV2)
No ratings yet
TIB 0145 - Converting PW1.0 + & PW2.0 To PW1.1 & PW1.1 & PW2.1 (REV2)
8 pages
Human Resource Management: Press
No ratings yet
Human Resource Management: Press
35 pages
Aresforti Systems PDF
No ratings yet
Aresforti Systems PDF
23 pages
Karka Et Al 2019 Polyethylenimine Modified Zeolite 13x For Co2 Capture Adsorption and Kinetic Studies
No ratings yet
Karka Et Al 2019 Polyethylenimine Modified Zeolite 13x For Co2 Capture Adsorption and Kinetic Studies
9 pages
LATIHAN SOAL MATEMATIKA 2017 PRAKTIK - Repaired
No ratings yet
LATIHAN SOAL MATEMATIKA 2017 PRAKTIK - Repaired
3,642 pages
SS Data Sheet 2MM
No ratings yet
SS Data Sheet 2MM
1 page
Control Corrosion Factors in Ammonia and Urea Plants
No ratings yet
Control Corrosion Factors in Ammonia and Urea Plants
13 pages
VBOX Video HD2 User Manual 2018
No ratings yet
VBOX Video HD2 User Manual 2018
241 pages
Spare Parts Management in SAP Plant Maintenance
100% (1)
Spare Parts Management in SAP Plant Maintenance
3 pages
Marine 2017 18 - Propeller Nozzles Design
No ratings yet
Marine 2017 18 - Propeller Nozzles Design
13 pages
Digital Working Length in Endodontics - A Review
No ratings yet
Digital Working Length in Endodontics - A Review
6 pages
Effect of Lateral Torsional Buckling On Web Tapered I-Beams: CR CR CR D
No ratings yet
Effect of Lateral Torsional Buckling On Web Tapered I-Beams: CR CR CR D
10 pages
Quantitative Hedge Funds Discretionary Systematic Ai Esg and Quantamental 2022 World Scientific Publishing Co Ltd
No ratings yet
Quantitative Hedge Funds Discretionary Systematic Ai Esg and Quantamental 2022 World Scientific Publishing Co Ltd
3 pages
WLP-in-English-8 - W10 - Q1
No ratings yet
WLP-in-English-8 - W10 - Q1
2 pages

L-BFGS algorithm

Uploaded by

L-BFGS algorithm

Uploaded by

UW-Madison CS/ISyE/Math/Stat 726 Spring 2023

Lecture 23: Limited-Memory BFGS (L-BFGS)

Hk = Vk⊤−1 Hk−1 Vk−1 + ρk−1 sk−1 s⊤

BFGS: Hk =Vk⊤−1 Hk−1 Vk−1 + ρk−1 sk−1 s⊤

In fact, we only need the d-dimensional vector Hk ∇ f ( xk ) to update xk+1 = xk − αk Hk ∇ f ( xk ).

This leads to a two-loop recursion implementation for computing Hk ∇ f ( xk ), stated in Algo-

Algorithm 1 L-BFGS two-loop recursion

• We often take m to be a small constant independent of d, e.g., 3 ≤ m ≤ 20.

• pk ← − Hk ∇ f ( xk ), where Hk ∇ f ( xk ) is computed using Algorithm 1

• xk+1 ← xk + αk pk , where αk satisfies Wolfe Conditions

– discard {sk−m , yk−m } from storage

• Compute and store sk ← xk+1 − xk and yk = ∇ f ( xk+1 ) − ∇ f ( xk )

Some numerical results taken from Nocedal-Wright:

3 Relationship with nonlinear conjugate gradient methods

called Hestenes-Stiefel, which uses the search direction

where we recall that yk = ∇ f ( xk+1 ) − ∇ f ( xk ) and sk = xk+1 − xk .

For all k , the stepsize αk satisfies

pk+1 = − Hk+1 ∇ f ( xk+1 )

which is the same as Hestenes-Stiefel CG update (2).

You might also like