0% found this document useful (0 votes)
9 views11 pages

Bfgs

The document discusses the global convergence of the BFGS quasi-Newton minimization algorithm for nonsmooth convex functions, building on Powell's 1976 results for smooth functions. It establishes that under certain conditions, the BFGS method can effectively minimize nonsmooth functions, providing examples and theoretical support for its application. The authors present a modified theorem that relaxes the smoothness requirement while strengthening convexity assumptions, demonstrating the algorithm's robustness in practical scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views11 pages

Bfgs

The document discusses the global convergence of the BFGS quasi-Newton minimization algorithm for nonsmooth convex functions, building on Powell's 1976 results for smooth functions. It establishes that under certain conditions, the BFGS method can effectively minimize nonsmooth functions, providing examples and theoretical support for its application. The authors present a modified theorem that relaxes the smoothness requirement while strengthening convexity assumptions, demonstrating the algorithm's robustness in practical scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

BFGS convergence to nonsmooth minimizers of

convex functions
arXiv:1703.06690v1 [math.OC] 20 Mar 2017

∗ †
J. Guo A.S. Lewis
March 21, 2017

Abstract
The popular BFGS quasi-Newton minimization algorithm under reason-
able conditions converges globally on smooth convex functions. This result
was proved by Powell in 1976: we consider its implications for functions that
are not smooth. In particular, an analogous convergence result holds for func-
tions, like the Euclidean norm, that are nonsmooth at the minimizer.

Key words: convex; BFGS; quasi-Newton; nonsmooth.


AMS 2000 Subject Classification: 90C30; 65K05.

1 Introduction
The BFGS (Broyden-Fletcher-Goldfarb-Shanno) method for minimizing a smooth
function has been popular for decades [6]. Surprisingly, however, it can also be
an effective general-purpose tool for nonsmooth optimization [3]. For twice con-
tinuously differentiable convex functions with compact level sets, Powell [7] proved
global convergence of the algorithm in 1976. By contrast, in the nonsmooth case, de-
spite substantial computational experience, the method is supported by little theory.
Beyond one dimension, with the exception of some contrived model examples [4],
the only previous convergence proof for the standard BFGS algorithm applied to
a nonsmooth function seems to be the analysis of the two-dimensional Euclidean
norm in [3].
As a simple illustration, consider the nonsmooth convex function f : R2 → R
defined by f (u, v) = u2 + |v|. A routine implementation of the BFGS method, using

ORIE, Cornell University, Ithaca, NY 14853, U.S.A. [email protected].

ORIE, Cornell University, Ithaca, NY 14853, U.S.A. people.orie.cornell.edu/aslewis.
Research supported in part by National Science Foundation Grant DMS-1613996.

1
Figure 1: BFGS method for f (u, v) = u2 + |v|. A thousand random starts, using
inexact line search, and initial approximate Hessian I. Semilog plots of function
value f (uk , vk ), initially normalized. Panel 1: against iteration count k. (Bold line
plots 2−2k .) Panel 2: against function evaluation count, including line search.

a random initial point and a standard backtracking line search, invariably converges
to the unique optimizer at zero. Not surprisingly, the method of steepest descent,
using the same line search, often converges to a nonoptimal point (u, 0) with u 6= 0.
For example, Figure 1 plots function values for a thousand runs of BFGS against
both iteration count and a count of the number of function-gradient evaluations,
including those incurred in each line search. (Precisely, the initial Hessian approxi-
mation is the identity, the weak Wolfe line search uses Armijo parameter 10−4 and
Wolfe parameter 0.9, and the initial function value is normalized to one.) The re-
sults compellingly support convergence, and indeed suggest a linear rate: the bold
line overlaid on the first panel corresponds to the BFGS iterates (2−k , 52 (−1)k 2−2k )
generated by an exact line search [4]. However, even for this very simple example,
a general convergence result does not seem easy.
Nonetheless, Powell’s theory does have consequences even in the nonsmooth case.
Loosely speaking, we prove, at least under a strict-convexity-like assumption, that
global convergence can only fail for the BFGS method if a subsequence of the iterates
converges to a nonsmooth point. For example, for the function f (u, v) = u2 + |v|,
BFGS iterates cannot remain a uniform distance away from the line v = 0. While
intuitive — a successful smooth algorithm should somehow detect nonsmoothness
— this result is also reassuring, and in fact suffices to prove convergence on some
interesting examples. An analogous technique proves convergence for the Euclidean

2
norm on Rn , generalizing the result for n = 2 in [3].

2 BFGS sequences
Given a set U ⊂ Rn , we consider the BFGS method for minimizing a possibly
nonsmooth function f : U → R. We call a sequence (xk ) in U “BFGS” if the BFGS
method could generate it using a line search satisfying the Armijo and weak Wolfe
conditions. More precisely, we make the following definition.

Definition 2.1 A sequence (xk ) is a BFGS sequence for the function f if f is differ-
entiable at each iterate xk with nonzero gradient ∇f (xk ), and there exist parameters
µ < ν in the interval (0, 1) and an n-by-n positive definite matrix H0 such that the
vectors
sk = xk+1 − xk and yk = ∇f (xk+1 ) − ∇f (xk )
and the matrices defined recursively by

sk ykT T sk sTk
(2.2) Vk = I − and Hk+1 = V H V
k k k +
sTk yk sTk yk

satisfy

(2.3) Hk ∇f (xk ) ∈ −R+ sk


(2.4) f (xk+1 ) ≤ f (xk ) + µ∇f (xk )T sk
(2.5) ∇f (xk+1 )T sk ≥ ν∇f (xk )T sk

for k = 0, 1, 2, . . ..

Notice that this property is independent of any particular line search algorithm
used to generate the sequence (xk ): it depends only on the sequences of functions
values f (xk ) and gradients ∇f (xk ). Conceptually, in the definition, the matrices
Hk are approximate inverse Hessians for the function f at the iterate xk : the equa-
tions (2.2) define the BFGS quasi-Newton update and the inclusion (2.3) expresses
the fact that the step sk is in the corresponding approximate Newton direction.
The inequalities (2.4) and (2.5) are the Armijo and weak Wolf line search conditions
respectively, with parameters µ and ν respectively. By a simple and standard induc-
tion argument, they imply that the property sTk yk > 0 then holds for all k, ensuring
the matrices Hk are well-defined and positive definite, and that the function values
f (xk ) decrease strictly. An implementation of the BFGS method for a convex func-
tion f using a standard backtracking line search will generate a BFGS sequence of
iterates, assuming that those iterates stay in the set U and that the method never
encounters a nonsmooth or critical point.

3
Example: a simple nonsmooth function
Consider the function f : R2 → R defined by f (u, v) = u2 + |v|. (We abuse notation
slightly and identify the vector [u v]T ∈ R2 with the point (u, v).) Then the sequence
in R2 defined by  2 
2−k , (−1)k 2−2k (k = 0, 1, 2, . . .)
5
is a BFGS sequence, as observed in [4, Prop 3.2]. Specifically, if we define a matrix
 1 
0
H0 = 4 1 ,
0 2
then the the definition of a BFGS sequence holds for any parameter values µ ∈ (0, 0.7]
and ν ∈ (µ, 1). In this example, the “exact” line search property ∇f (xk+1 )T sk = 0
holds for all k, and the approximate inverse Hessians are
 1   
2
0 1 5 (−1)k 21−k
H1 = , Hk = (k > 1).
0 14 6 (−1)k 21−k 23−2k

Example: the Euclidean norm


Consider the function f = k · k on R2 . Beginning with the initial vector [1 0]T ,
generate a sequence of vectors by, at each iteration, rotating clockwise through an
angle of π3 and shrinking by a factor 21 . The result is a BFGS sequence for f , as
observed in [3]. Specifically, if we define a matrix
 √ 
3
√ − 3
H0 = ,
− 3 3
then the the definition of a BFGS sequence holds for any parameter values µ ∈ 0, 23


and any ν ∈ (µ, 1). Again, the exact line search property ∇f (xk+1 )T sk = 0 holds
for all k. In this case the approximate
√ inverse Hessians have eigenvalues behaving
−k
asymptotically like 2 (3 ± 3) (see [3]).

3 Main result
The following theorem captures a key global convergence property of the BFGS
method.
Theorem 3.1 (Powell, 1976) Consider an open convex set U ⊂ Rn containing a
BFGS sequence (xk ) for a convex function f : U → R. Assume that the level set
{x ∈ U : f (x) ≤ f (x0 )} is bounded, and that
(3.2) ∇2 f is continuous throughout U .
Then the sequence of function values f (xk ) converges to min f .

4
Among the assumptions in Powell’s theorem, at least for dimension n > 2
(see [8]), convexity is central. Although the BFGS method works well in prac-
tice on general smooth functions [6], nonconvex counterexamples are known where
convergence fails: in particular, [1] presents a bounded but nonconvergent BFGS
sequence for a polynomial f : R4 → R. In the general convex case, on the other
hand, whether the smoothness assumption (3.2) can be weakened seems unclear.
We present here a result analogous to Powell’s theorem. We modify the as-
sumptions, strengthening the convexity assumption but weakening the smoothness
requirement (3.2). Similar results to the one below hold for many common minimiza-
tion algorithms possessing suitable global convergence properties in the smooth case.
Such algorithms generate sequences of iterates xk characterized by certain properties
of the function values f (xk ) and gradients ∇f (xk ) (for k = 0, 1, 2, . . .), analogous
to the definition of a BFGS sequence. Providing the algorithm generates function
values f (xk ) that must decrease to the minimum value min f for any convex function
whose level sets are bounded and whose Hessian is continuous and positive definite
throughout those level sets, exactly the same proof technique applies. Examples of
such algorithms include standard versions of steepest descent [6], coordinate descent
(see for example [5]), and conjugate gradient methods (see for example [2]). Here
we concentrate on BFGS because, in striking contrast to these methods, the BFGS
method works well in practice on nonsmooth functions [3].

Theorem 3.3 Powell’s Theorem also holds with the smoothness assumption (3.2)
replaced by the following assumption:
 2
 ∇ f is positive-definite and continuous throughout
(3.4) an open set V ⊂ U containing the set cl(xk ) and
satisfying inf V f = min f .

Proof We consider an open convex set U ⊂ Rn containing a BFGS sequence (xk )


for a convex function f : U → R satisfying assumption (3.4). We further assume
that the level set {x ∈ U : f (x) ≤ f (x0 )} is bounded, and our aim is to prove that
the sequence of function values f (xk ) converges to min f .
Assume first that the theorem is true in the special case when U = Rn and
the complement V c is bounded. We then deduce the general case as follows. Note
by assumption, that the function f is not constant, so by convexity there exists a
point x̄ ∈ U with f (x̄) > f (x0 ). Convexity also ensures that f is L-Lipschitz on the
nonempty compact convex set
K = {x ∈ U : f (x) ≤ f (x̄)},
for some constant L > 0. Hence there exists a convex Lipschitz function fˆ: Rn → R
agreeing with f on K, specifically the Lipschitz regularization defined by
fˆ(y) = min{f (x) + Lky − xk} (y ∈ Rn ).
x∈K

5
Now, for any sufficiently large β ∈ R, the convex function f˜: Rn → R defined by

˜
n
ˆ 1 2
o
f (x) = max f (x), kxk − β (x ∈ Rn )
2

also agrees with f on K. The Hessian of f˜ is just the identity throughout the open
set n
ˆ 1 2
o
W = x : f (x) < kxk − β .
2
Furthermore, this set has bounded complement, and therefore so does the open set

Ṽ = W ∪ {x ∈ V : f (x) < f (x̄)}.

Now notice that (xk ) is also a BFGS sequence for the function f˜, and all the as-
sumptions of the theorem hold with f replaced by f˜, U replaced by Rn , and V
replaced Ṽ . Applying the special case of the theorem, we deduce

f (xk ) = f˜(xk ) → min f˜ = min f,

as required.
We can therefore concentrate on the special case when U = Rn and the set
N = V c is compact. We can assume N is nonempty, since otherwise the result follows
immediately from Powell’s Theorem. The convex function f is then continuous
throughout Rn . It is not constant, and hence is unbounded above. Furthermore, by
assumption, the initial point x0 is not a minimizer, so all the level sets {x : f (x) ≤ α}
are compact. Since N is compact and f is continuous, we can fix a constant α >
f (x0 ) satisfying α > maxN f .
Since the values f (xk ) are decreasing, the sequence (xk ) is bounded and hence
the closure cl(xk ) is compact. For all sufficiently small  > 0, we then have

cl(xk ) ∩ (N + 2B) = ∅ and max f < α,


N +2B

where B denotes the closed unit ball in Rn . The distance function dN : Rn → R


defined by dN (x) = minN k · −xk (for x ∈ Rn ) is continuous, so the set

Ω = {x : dN (x) ≥ 2 and f (x) ≤ α}

is compact, and is contained in the open set {x : dN (x) > }. On this open set, the
function f is convex, in the sense of [11], and C (2) with positive-definite Hessian.
Hence, by [11, Theorem 3.2], there exists a C (2) convex function f on a convex open
neighborhood U of the convex hull conv Ω agreeing with f on Ω . Our choice of 
ensures
{x : f (x) = α} ⊂ Ω ⊂ {x : f (x) ≤ α},

6
so in fact conv Ω = {x : f (x) ≤ α}. (Although superfluous for this proof,
[11, Theorem 3.2] even guarantees that f has positive-definite Hessian on this com-
pact convex set, and hence is strongly convex on it.)
We next observe that the level set {x ∈ U : f (x) ≤ f (x0 )} is bounded, since it
is contained in the set {x : f (x) ≤ α}. Otherwise there would exist a point x ∈ U
satisfying f (x) ≤ f (x0 ) = f (x0 ) < α and f (x) > α. By continuity of f , there exists
a point y on the line segment between x0 and x satisfying f (y) = α. But then we
must have y ∈ Ω and hence f (y) = f (y) = α, contradicting the convexity of f .
The values and gradients of the functions f and f : U → R agree at each iterate
xk , so since those iterates comprise a BFGS sequence for f , they also do so for f .
We can therefore apply Theorem 3.1 to deduce

f (xk ) = f (xk ) ↓ min f as k → ∞.

By assumption, there exists a sequence of points xr ∈ V (for r = 1, 2, 3, . . .) sat-


isfying limr f (xr ) = min f . For any fixed index r, we know xr ∈ Ω for all  > 0
sufficiently small, so we have

min f ≤ lim f (xk ) = min f ≤ f (xr ) = f (xr ).


k

Taking the limit as r → ∞ shows limk f (xk ) = min f , as required. 2

The following consequence suggests simple examples.

Corollary 3.5 Powell’s Theorem also holds with smoothness assumption (3.2) re-
placed by the assumption that ∇2 f is positive-definite and continuous throughout the
set {x ∈ U : f (x) > min f }.

Proof Suppose the result fails. The given set, which we denote V must contain
the set cl(xk ): otherwise there would exist a subsequence of (xk ) converging to
a minimizer of f , and since the values f (xk ) decrease monotonically, they would
converge to min f , a contradiction. Clearly we have inf V f = min f . But now
applying Theorem 3.3 gives a contradiction. 2

Corollary 3.6 Consider an open semi-algebraic convex set U ⊂ Rn containing


a BFGS sequence for a semi-algebraic strongly convex function f : U → R with
bounded level sets. Assume that the sequence and all its limit points lie in the
interior of the set where f is twice differentiable. Then the sequence of function
values converges to the minimum value of f .

Proof Denote the interior of the set where f is twice differentiable by V . Stan-
dard results in semi-algebraic geometry [10, p. 502] guarantee that V is dense in

7
U , whence inf V f = min f , and furthermore that the Hessian ∇2 f is continuous
throughout V , and hence positive-definite by strong convexity. The result now fol-
lows by Theorem 3.3. 2

The open set V in the proof of Corollary 3.6, where the function f is smooth,
has full measure in the underlying set U . Hence, if we initialize the algorithm in
question with a starting point x0 generated at random from a continuous probability
distribution on U , and use a computationally realistic line search to generate each
iterate xk from its predecessor, then we would expect (xk ) ⊂ V almost surely. Then,
according to the result, one (or both) of two cases hold.

(i) The algorithm succeeds: f (xk ) → min f .

(ii) A subsequence of the iterates converges to a point where f is nonsmooth.

Extensive computational experiments suggest case (i) holds almost surely [3].
Like Theorem 3.3, analogous versions of Corollary 3.6 hold for many other algo-
rithms, in addition to the BFGS method. By contrast with BFGS, however, those
algorithms often fail in general, due to the possibility of case (ii). In the special
situation described in Corollary 3.5, case (ii) implies case (i), so analogous results
will hold for many common algorithms, like steepest descent, coordinate descent, or
conjugate gradients.

4 Special constructions
Unlike Powell’s original result, Theorem 3.3 requires the Hessian ∇2 f to be positive-
definite on an appropriate set, an assumption that fails for some simple but interest-
ing examples like the Euclidean norm. We can sometimes circumvent this difficulty
by a more direct construction, avoiding tools from [11]. The following result is a
version of Corollary 3.5 under a more complicated but weaker assumption.

Theorem 4.1 Powell’s Theorem also holds with the smoothness assumption (3.2)
replaced by the following weaker condition:

For all constants δ > 0, there is a convex open neighborhood Uδ ⊂ U of


the set {x ∈ U : f (x) ≤ f (x0 )}, and a C (2) convex function fδ : Uδ → R
satisfying fδ (x) = f (x) whenever f (x0 ) ≥ f (x) ≥ min f + δ.

Proof Clearly condition (3.2) implies the given condition, since we could choose
Uδ = U and fδ = f . Assuming this new condition instead, suppose the con-
clusion of Powell’s Theorem 3.1 fails, so there exists a number δ > 0 such that
f (xk ) > min f + 2δ for all k = 0, 1, 2, . . .. Consider the function fδ guaranteed

8
by our assumption. Since f is continuous, there exists a point x̄ ∈ U satisfying
f (x̄) = min f + δ, and since fδ (x̄) = f (x̄), we deduce min fδ ≤ min f + δ.
Since (xk ) is a BFGS sequence for the function f , it is also a BFGS sequence for
the function fδ . Applying Theorem 3.1 with f replaced by fδ shows the contradiction

min f + 2δ ≤ f (xk ) = fδ (xk ) ↓ min fδ ≤ min f + δ,

so the result follows. 2

We can apply this result directly to the Euclidean norm.

Corollary 4.2 Any BFGS sequence for the Euclidean norm on Rn converges to
zero.

Proof For any δ > 0, consider the function gδ : R → R defined by


 δ3 +3δt2 −|t|3
 3δ 2
(|t| ≤ δ)
(4.3) gδ (t) =
|t| (|t| ≥ δ).

This function is C (2) convex and symmetric. The function fδ : Rn → R defined


by fδ (x) = gδ (kxk) is also C (2) convex, either as a consequence of [9] or via a
straightforward direct calculation. The result now follows from Theorem 4.1. 2

Analogously, the following result is a more direct version of Theorem 3.3.

Theorem 4.4 Powell’s Theorem also holds with the smoothness assumption (3.2)
replaced by the assumption that some open set V ⊂ U containing the set cl(xk ) and
satisfying inf V f = min f also satisfies the following condition:
For all constants δ > 0, there is a convex open neighborhood Uδ ⊂ U of
the set {x ∈ U : f (x) ≤ f (x0 )}, and a C (2) convex function fδ : Uδ → R
satisfying fδ (x) = f (x) for all points x ∈ Uδ such that dV c (x) > δ.

Proof Denote the distance between the compact set cl(xk ) and the closed set V c
by δ̄, so we know δ̄ > 0. For any constant δ ∈ (0, δ̄), we have dV c (xk ) > δ for all
indices k = 0, 1, 2, . . ., and hence fδ (xk ) = f (xk ).
The values and gradients of the functions f and fδ agree at each iterate xk , so
since those iterates comprise a BFGS sequence for f , they also do so for fδ . We can
therefore apply Theorem 3.1 to deduce

f (xk ) = fδ (xk ) ↓ min fδ as k → ∞.

By assumption, there exists a sequence of points xr ∈ V (for r = 1, 2, 3, . . .) sat-


isfying limr f (xr ) = min f . For any fixed index r, we know dV c (xr ) > δ for all

9
sufficiently small δ > 0, so since fδ (xr ) = f (xr ), we deduce min fδ ≤ f (xr ). The
inequality limk f (xk ) ≤ f (xr ) follows, and letting r → ∞ proves limk f (xk ) = min f
as required. 2

We end by proving a claim from the introduction.

Corollary 4.5 Any BFGS sequence for the function f : R2 → R given by f (u, v) =
u2 + |v| has a subsequence converging to a point on the line v = 0.

Proof Suppose the result fails, so some BFGS sequence (uk , vk ) has its closure
contained in the open set

V = {(u, v) ∈ R2 : v 6= 0}.

Clearly we have inf V f = min f . For any constant δ > 0, define a function fδ : R2 →
R by f (u, v) = u2 + gδ (v), where the function gδ is given by equation (4.3). Then
we have f (u, v) = fδ (u, v) for any point (u, v) satisfying |v| > δ, or equivalently
dV c (u, v) > δ. Hence the assumptions of Theorem 4.4 hold (using the set Uδ =
R2 ), so we deduce f (uk , vk ) → 0, and hence (uk , vk ) → (0, 0). This contradiction
completes the proof. 2

As we remarked in the introduction, numerical evidence strongly supports a


conjecture that all BFGS sequences for the function f (u, v) = u2 + |v| converge to
zero. That conjecture remains open.

References
[1] Y.-H. Dai. A perfect example for the BFGS method. Mathematical Program-
ming, 138(1):501–530, 2013.

[2] J.C. Gilbert and J. Nocedal. Global convergence properties of conjugate gra-
dient methods for optimization. SIAM J. Optim., 2(1):21–42, 1992.

[3] A.S. Lewis and M.L. Overton. Nonsmooth optimization via quasi-Newton meth-
ods. Math. Program., 141(1-2, Ser. A):135–163, 2013.

[4] A.S. Lewis and S. Zhang. Nonsmoothness and a variable metric method. J.
Optim. Theory Appl., 165(1):151–171, 2015.

[5] Z.Q. Luo and P. Tseng. On the convergence of the coordinate descent method
for convex differentiable minimization. J. Optim. Theory Appl., 72(1):7–35,
1992.

10
[6] J. Nocedal and S.J. Wright. Numerical Optimization. Springer Series in Opera-
tions Research and Financial Engineering. Springer, New York, second edition,
2006.

[7] M.J.D. Powell. Some global convergence properties of a variable metric algo-
rithm for minimization without exact line searches. In Nonlinear Programming
(Proc. Sympos., New York, 1975), pages 53–72. SIAM–AMS Proc., Vol. IX.
Amer. Math. Soc., Providence, R. I., 1976.

[8] M.J.D. Powell. On the convergence of the DFP algorithm for unconstrained
optimization when there are only two variables. Math. Program., 87(2, Ser.
B):281–301, 2000. Studies in algorithmic optimization.

[9] H.S. Sendov. Nonsmooth analysis of Lorentz invariant functions. SIAM J.


Optim., 18(3):1106–1127, 2007.

[10] L. van den Dries and C. Miller. Geometric categories and o-minimal structures.
Duke Math. J., 84(2):497–540, 1996.

[11] M. Yan. Extension of convex function. J. Convex Anal., 21(4):965–987, 2014.

11

You might also like