0% found this document useful (0 votes)
33 views19 pages

Gradient Methods With Adaptive Step-Sizes

This document summarizes a research article that proposes two new gradient methods for optimization problems. The first method combines the steepest descent and minimal gradient step sizes in an adaptive way to prevent their worst behaviors. The second method derives a trust-region-like strategy to choose between the step sizes of the original Barzilai-Borwein method. For convex quadratics, the new algorithms are proven to converge linearly. Numerical tests on examples show the new methods compare favorably to other gradient methods.

Uploaded by

filipgd1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views19 pages

Gradient Methods With Adaptive Step-Sizes

This document summarizes a research article that proposes two new gradient methods for optimization problems. The first method combines the steepest descent and minimal gradient step sizes in an adaptive way to prevent their worst behaviors. The second method derives a trust-region-like strategy to choose between the step sizes of the original Barzilai-Borwein method. For convex quadratics, the new algorithms are proven to converge linearly. Numerical tests on examples show the new methods compare favorably to other gradient methods.

Uploaded by

filipgd1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/225728940

Gradient Methods with Adaptive Step-Sizes

Article  in  Computational Optimization and Applications · September 2006


DOI: 10.1007/s10589-006-6446-0

CITATIONS READS

185 3,724

3 authors, including:

Li Gao Yu-Hong Dai


Peking University Chinese Academy of Sciences
6 PUBLICATIONS   198 CITATIONS    203 PUBLICATIONS   7,546 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

stochastic approximation algorithms View project

All content following this page was uploaded by Yu-Hong Dai on 03 June 2014.

The user has requested enhancement of the downloaded file.


Computational Optimization and Applications, 35, 69–86, 2006

c 2006 Springer Science + Business Media, LLC. Manufactured in The Netherlands.
DOI: 10.1007/s10589-006-6446-0

Gradient Methods with Adaptive Step-Sizes


BIN ZHOU [email protected]
LI GAO [email protected]
School of Mathematical Sciences and LMAM, Peking University, Beijing, 100871, People’s Republic of China

YU-HONG DAI [email protected]


State Key Laboratory of Scientific and Engineering Computing, Institute of Computational Mathematics
and Scientific/Engineering Computing, Academy of Mathematics and Systems Science, Chinese Academy of
Sciences, P. O. Box 2719, Beijing, 100080, People’s Republic of China

Received December 23, 2004; Revised June 14, 2005

Published online: 31 March 2006

Abstract. Motivated by the superlinear behavior of the Barzilai-Borwein (BB) method for two-dimensional
quadratics, we propose two gradient methods which adaptively choose a small step-size or a large step-size at
each iteration. The small step-size is primarily used to induce a favorable descent direction for the next iteration,
while the large step-size is primarily used to produce a sufficient reduction. Although the new algorithms are
still linearly convergent in the quadratic case, numerical experiments on some typical test problems indicate
that they compare favorably with the BB method and some other efficient gradient methods.

Keywords: linear system, gradient method, adaptive step-size, Barzilai-Borwein method, superlinear be-
havior, trust-region approach

1. Introduction

We consider the minimization problem of a quadratic function


1
min f (x) = x t Ax − bt x, (1)
2

where A ∈ R n×n is symmetric and positive definite (SPD), and b, x ∈ R n . It is well-


known that this problem is equivalent to solving the linear system Ax = b.
The gradient method for (1) can be defined by the iteration

xk+1 = xk − αk gk , (2)

where gk = Axk − b and αk is a step-size depending on the line search applied.


For instance, the classical steepest descent (SD) method [8] determines αk by
gkt gk
αkS D = , (3)
gkt Agk

which minimizes the function value f (x) along the ray {xk − αgk : α > 0}.
70 ZHOU, GAO AND DAI

Another straightforward line search is to minimize the gradient norm g(x)2 along
the ray {xk − αgk : α > 0}—we name the associated algorithm “minimal gradient (MG)
method” for convenience. Trivial deductions (see [11]) yield

gkt Agk
αkM G = . (4)
gkt A2 gk

Moreover, by the Cauchy inequality, we can prove that αkM G  αkS D .


Despite the optimal properties, the SD and MG methods behave poorly in most cases.
Specifically, Akaike [1] showed that the sequence {xk } generated by the SD method
tends to zigzag in two orthogonal directions, which usually implies deteriorations in
convergence. Early efforts to improve this method gave rise to the development of the
conjugate gradient (CG) method [16].
In 1988, Barzilai and Borwein [3] developed another two formulae for αk (k > 0):
t
sk−1 sk−1 g t gk−1
αkB B1 = t = t k−1 = αk−1
SD
, (5)
sk−1 yk−1 gk−1 Agk−1
t t
sk−1 yk−1 gk−1 Agk−1
αkB B2 = t = t = αk−1
MG
, (6)
yk−1 yk−1 gk−1 A2 gk−1

where sk−1 = xk −xk−1 and yk−1 = gk −gk−1 . These two step-sizes minimize α −1 sk−1 −
yk−1 2 and sk−1 − αyk−1 2 respectively, providing a scalar approximation to each of
the secant equations Bk sk−1 = yk−1 and Hk yk−1 = sk−1 .
The BB method ((2)∼(5) or (2)∼(6)) performs much better than the SD method in
practice (see also [12]). Especially when n = 2, it converges R-superlinearly to the global
minimizer [3]. In any dimension, it is still globally convergent [21] but the convergence
is R-linear [10]. Raydan [22] adapted the method to unconstrained optimization by
incorporating the non-monotone line search of Grippo et al. [15]. Since this work, the
BB method has been successfully extended to many fields such as convex constrained
optimization [2, 4–7], nonlinear systems [17, 18], support vector machines [24], etc.
Inspired by the BB method, Friedlander et al. [14] introduced a family of gradient
methods with retards (GMR). Given positive integers m and qi (i = 1, . . . , m), the GMR
sets the step-size as
t
gν(k) Aρ(k)−1 gν(k)
αkG M R = , (7)
t
gν(k) Aρ(k) gν(k)

where

ν(k) ∈ {k, k − 1, . . . , max{0, k − m}},


(8)
ρ(k) ∈ {q1 , q2 , . . . , qm }.

It is clear that the step-sizes (3)–(6) are all special cases of (7). [23] and [9] investigated
a particular member of the GMR, in which the SD and BB1 step-sizes are used by turns.
This algorithm, known as the alternate step (AS) gradient method in [9], has proved to
GRADIENT METHODS WITH ADAPTIVE STEP-SIZES 71

be a promising alternative to the BB method. Another remarkable member of the GMR


is the alternate minimization (AM) gradient method [11], which takes alternately the SD
and MG step-sizes, and shows to be efficient for computing a low-precision solution.
For further information on the GMR, see [19] and [20].
In this paper, we propose two gradient methods which adaptively choose a small step-
size or a large one at every iteration. Here the small step-size serves to induce a favorable
descent direction for the next iteration, while the large one serves to produce a sufficient
reduction. To be more exact, in the first method, we combine the SD and MG step-sizes
in a dynamical way by which the worst-case behaviors of the SD and MG methods
are prevented at the same time; and in the second method, we derive a trust-region-like
strategy by which we are able to choose step-size between the two alternatives of the
original BB method. For strictly convex quadratics, we prove that the new algorithms
converge Q-linearly and R-linearly, respectively. Numerical results on some typical test
problems are presented, which suggest that our methods are competitive with the CG
method and generally preferable to the BB, AS, and AM methods.
The rest of the paper is organized as follows. In the next section, we describe our
motivating ideas and derive the new methods. Two simple numerical examples are
presented to illustrate their behaviors. In Section 3 we provide convergence analyses. In
Section 4, we present additional numerical results to compare our methods with the CG
method as well as with the BB, AS, and AM methods. Finally, in Section 5, we present
discussions and concluding remarks.

2. Derivation of the new methods

Barzilai and Borwein [3] proved that the BB method converges R-superlinearly for
two-dimensional quadratics, which can be illustrated by the following numerical phe-
nomenon: As k increases, the gradient gk generated by the BB method tends to be an
eigenvector of the matrix A. Unfortunately, this phenomenon does not evidently hold for
any-dimensional case where only R-linear convergence is obtained [10]. We now get a
heuristic idea as below: Assume the eigenvalues of A can be simply classified as the large
ones and the small ones, as if there were only two different eigenvalues, and suppose we
can frequently make gk approach some large or small eigenvectors (by which we mean
eigenvectors associated with the large or small eigenvalues) of A, then the resulting
algorithm might exhibit certain superlinear behavior and outperform the BB method
when n is very large.
In order to realize this idea, we choose to modify the SD method. Thus let us go
back to this old approach for more inspirations. Without loss of generality, we assume
that A is a diagonal matrix in our derivation. The following theorem illustrates the
worst-case behavior of the SD method in two dimensions. (In any dimension, the worst-
case behaviors of the SD and MG methods are only related to the smallest and largest
eigenvalues of A. Hence we just need to consider the two-dimensional case.)

Theorem 2.1. Consider the minimization problem (1), where A = diag(λ1 , λ2 ) with
λ2  λ1 > 0. Let x ∗ = A−1 b, and let {xk } be the sequence generated by the SD method
72 ZHOU, GAO AND DAI

(2)∼(3). If g0  (1, ±1)t , then


 2
f (xk+1 ) − f (x ∗ ) λ 2 − λ1
= for all k  0,
f (xk ) − f (x ∗ ) λ2 + λ1

which means the SD method reaches its slowest convergence rate.

Proof: To prove this theorem, it suffices to show that if g0  (1, ±1), then we must
have
 2
f (x1 ) − f (x ∗ ) λ 2 − λ1
g1  (1, ∓1)t and = . (9)
f (x0 ) − f (x ∗ ) λ 2 + λ1

Assume g0 = (u, ±u)t with u = 0. Substituting it into (3), we have

2u 2 2
α0S D = = . (10)
(λ2 + λ1 )u 2 λ 2 + λ1

Moreover, by (2), we have


 
g1 = Ax1 − b = A x0 − α0S D g0 − b = g0 − α0S D Ag0 ,

which, together with (10) and g0 = (u, ±u)t , yields

λ2 − λ 1
g1 = u (1, ∓1)t . (11)
λ2 + λ1

For the second part of (9), we have by Taylor’s theorem and ∇ f (x ∗ ) = 0 that

(xk − x ∗ )t A(xk − x ∗ )
f (xk ) − f (x ∗ ) = (xk − x ∗ )t ∇ f (x ∗ ) +
2
(Axk − Ax ∗ )t A−1 (Axk − Ax ∗ )
=
2
(Axk − b)t A−1 (Axk − b) g t A−1 gk
= = k .
2 2

Then by (11) and g0 = (u, ±u)t , we have


 
f (x1 ) − f (x ∗ ) g1t A−1 g1 λ2 − λ1 2 (1, ∓1)t A−1 (1, ∓1)
= t −1 =
f (x0 ) − f (x ∗ ) g0 A g0 λ2 + λ1 (1, ±1)t A−1 (1, ±1)
 2 −1  
λ2 − λ1 λ1 + λ−1 λ 2 − λ1 2
= 2
= .
λ2 + λ1 λ−1 1 + λ2
−1 λ 2 + λ1

f (xk+1 )− f (x ∗ )
Thus we have by induction that f (xk )− f (x ∗ )
= ( λλ22 −λ
+λ1
) for all k ≥ 0.
1 2
GRADIENT METHODS WITH ADAPTIVE STEP-SIZES 73

We can see from Theorem 2.1 that: While gk is about 45◦ away from any eigenvector
of A, the convergence of the SD method deteriorates. Naturally we hope that the next
gradient gk+1 will approach some eigenvector of A. But how to reset the k-th step-size?
The theorem below shows that the MG step-size is a desirable option when the SD
step-size is unfavorable.

Theorem 2.2. Consider the minimization problem (1), where A = diag(λ1 , λ2 ) with
λ2  λ1 > 0. Let {xk } be the sequence generated by a gradient method. If gk  (1, ±1)t ,
i.e. gk = (u, ±u)t with u = 0, then it holds that

αkM G /αkS D > 0.5

and

⎪ λ 2 − λ1

⎨ λ + λ u(1, ∓1) ,
t
i f αk = αkS D ;
2 1
gk+1 = (12)

⎪ λ2 − λ1
⎩ 2 u(λ2 , ∓λ1 )t , i f αk = αkM G .
λ2 + λ21

Proof: Substituting gk = (u, ±u)t into (3) and (4), we have

2 λ1 + λ 2
αkS D = and αkM G = . (13)
λ 1 + λ2 λ21 + λ22

Therefore,

αkM G (λ1 + λ2 )2
=   > 0.5.
αkS D 2 λ21 + λ22

And by (2) we have

gk+1 = Axk+1 − b = Axk − αk Agk − b = gk − αk Agk . (14)

Then we obtain (12) from (13) and (14) by direct calculations.


Since λ2  λ1 > 0, it is easy to see that the choice of αkM G makes gk+1 more inclined
to the eigenvector direction (1, 0)t or (−1, 0)t than the choice of αkS D . Moreover, it
holds that αkM G /αkS D > 0.5 when gk  (1, ±1)t . Thus we can set αk = αkM G to avoid
the worst-case behavior of the SD method, and let the inequality αkM G /αkS D > 0.5 be
the corresponding switch criterion. More precisely, we have the following scheme for
choosing the step-size:

αkM G , if, αkM G /αkS D > κ;


αk = (15)
αkS D , otherwise,

where κ ∈ (0, 1) is a parameter close to 0.5.


74 ZHOU, GAO AND DAI

One might ask this question: Does the scheme (15) also avoid the worst-case behavior
of the MG method? The answer is yes when the condition number of A is big enough. To
make it comprehensible, we first present a theorem illustrating the worst-case behavior
of the MG method in two dimensions.

Theorem 2.3. Consider the minimization problem (1), where A = diag(λ1 , λ2 ) with
λ2  √λ1 > √ 0. Let {xk } be the sequence generated by the MG method (2)−(4). If
g0  ( λ2 , ± λ1 )t , then it holds for all k  0
(a) The MG method reaches its slowest convergence rate:
 2
gk+1 22 λ 2 − λ1
= .
gk 22 λ 2 + λ1

(b) The ratio αkM G /αkS D reaches its smallest value:

αkM G 4λ1 λ2 (g t Ag)2


= = min .
αkSD (λ1 + λ2 )2 g=0 (g t g)(g t A2 g)

The proof of (a) is quite similar to that of Theorem 2.1, while (b) can be obtained
inductively by direct calculations (the second equality herein is an equivalent form of
the Kantorovich inequality). Supposing λ2 λ1 , we can see that αkM G /αkS D ≈ 0 and
g0 is almost parallel to the eigenvector (1, 0)t . That is to say, the worst-case behavior of
the MG method is usually due to the excessively small step-size and not to the descent
direction. Scheme (15) sets αk = αkS D when αkM G /αkS D  κ; therefore it is able to avoid
the worst-case behavior of the MG method for very ill-conditioned problems.
In a recent paper, Raydan and Svaiter [23] indicated that the SD method with (over
and under) relaxation usually improves upon the original method. This idea was also
considered by Dai and Yuan [11], in whose paper two shortened SD step gradient methods
have been well compared with the AM and SD methods. Motivated by these works, we
modify (15) as follows:

αkM G , if αkM G /αkS D > κ;


αk = (16)
αkS D − δ αkM G , otherwise,

where κ, δ ∈ (0, 1). Notice that when αkMG /αkSD reaches its smallest value, αkS D is the
least shortened. As αk can be viewed as an adaptively under-relaxed SD step-size (recall
that αkM G  αkS D ), we call the method (2)−(16) “adaptive steepest descent (ASD)
method”. Compared to the SD method, this method requires one extra inner product per
iteration.
How the ASD method might overcome the drawbacks of the SD method and simulate
the superlinear behavior of the BB method in two dimensions is depicted in figure 1,
where xkS D and xkAS D signify the iterates of the SD and ASD methods, respectively. We
can observe that the ASD method (κ, δ = 0.5) refines x0 less than the SD method at
GRADIENT METHODS WITH ADAPTIVE STEP-SIZES 75

x0
xSD
2
xASD
2
xASD
1
xSD
1

Figure 1. ASD vs. SD when A = diag(1, 7), g0  (1, −1)t .

the first iteration, but it induces a gradient inclined to the eigenvector direction (−1, 0)t ,
thereby obtaining a considerable descent at the next iteration.
Before we show by a simple example that the ASD method is still efficient for solving
any-dimensional problems, we first give a condition under which we should expect that
the method still would exhibit certain superlinear behavior. Note that our derivation of
the ASD method is totally based on the assumption that the matrix A has two different
eigenvalues. We assume here that A has two clusters of eigenvalues. Then, in view of
the above figure and theorems, we should expect that some of the descent directions
generated by the ASD method would incline towards the eigenvectors that relate to the
eigenvalues with small magnitude.
To illustrate the performance of the ASD method in any dimension, a simple test
problem is devised (see Section 4 for more experiments). We let A = diag(λ1 , . . . , λ100 )
with λ1 = 0.1 and λi = i for i = 2, . . . , 100. The right-hand-side b is set as (1, . . . , 1)t
and the initial guess is chosen as x0 = 0 ∈ 100 . In order to get gk 2  10−6 g0 2 ,
the BB method (α0 = α0S D ) needs 375 iterations, while the ASD (κ, δ = 0.5) method
requires only 302 iterations. Figure 2 reports the gradient norms for the BB and ASD
methods at each iteration (the figure also reports the gradient norms for a third method,
ABB, which will be defined later). From this figure, we can observe that the ASD method
gives the better solution at most iterations. Besides, this method seems to produce smaller
oscillations than the BB method. As will be proved in Section 3, the ASD method is in
fact a monotone algorithm.
Now let us extend (16) and derive an ASD-like variant of the BB method. Look-
ing into the iterations of the ASD method for the above test problem, we find that the
sign of αkM G /αkS D − κ is opposite to that of αk−1 MG
/αk−1
SD
− κ at most iterations (to be
exact, 238 out of 302 iterations). Similar phenomenon is also observed in many other
experiments (where κ = 0.5). For this reason, we replace αkM G /αkS D > κ in (16) with
αk−1
MG
/αk−1
SD
< κ:

αkM G , if αk−1
MG
/αk−1
SD
< κ;
αk = (17)
αkS D − δ αkM G , otherwise.
76 ZHOU, GAO AND DAI

3
10

1
10
Residual Norm

–1
10

–3
10
ABB

ASD BB
–5
10
0 100 200 300 400
Number of Iterations

Figure 2. Performances of BB, ASD and ABB for a 100-dimensional problem.

Since αk−1
SD
= αkB B1 and αk−1
MG
= αkB B2 , we rewrite (17) as

αkM G , if αkB B2 /αkB B1 < κ;


αk =
αkS D − δ αkM G , otherwise.

In order to get a simple scheme, we replace αkS D − δ αkM G and αkM G with αkB B1 and αkB B2
respectively. So we finally have

αkB B2 , if αkB B2 /αkB B1 < κ;


αk = (18)
αkB B1 , otherwise,

where κ ∈ (0, 1). Here we do not shorten αkB B1 because the BB method itself overcomes
the drawbacks of the SD method. Analogously, we call the method (2)∼(18) “adaptive
Barzilai-Borwein (ABB) method”. For the above test problem, this method (κ = 0.5,
and α0 = α0S D ) requires only 221 iterations (see also figure 2).
At first glance, the ABB method seems to require one more inner product per iteration
than the BB method, but this extra computational work can be eliminated by using the
GRADIENT METHODS WITH ADAPTIVE STEP-SIZES 77

following formulae:
t t
sk−1 sk−1 gk−1 gk−1
αkB B1 = = αk−1 t ,
t
sk−1 yk−1 gk−1 gk−1 − gk−1
t
gk
t
sk−1 yk−1 t
gk−1 gk−1 − gk−1
t
gk
αkB B2 = = α k−1 .
t
yk−1 yk−1 gk−1 gk−1 − 2gk−1 gk + gkt gk
t t

Here we have used sk−1 = xk − xk−1 = −αk−1 gk−1 and yk−1 = gk − gk−1 . Notice that at
t
every iteration, we only need to compute two inner products, i.e. gk−1 gk and gkt gk . Our
extensive numerical experiments show that these formulae are reliable in the quadratic
case. For non-quadratic functions, there might be negative or zero denominators, which
could also happen while using the original formulae. Raydan [22] proposed a simple
approach to deal with this situation.
Before concluding this section, we would like to give an intuitive explanation to (18).
Recall that when αkB B2 /αkB B1 = αk−1
MG
/αk−1
SD
≈ 0, the MG method performs poorly at
the point xk−1 (see Theorem 2.3); although we generally have αk−1 = αk−1 MG
in the ABB
method, there must be little reduction in g(x)2 at the previous iteration since it is
the MG step-size that minimizes g(x)2 along the gradient descent direction. Thus
we can interpret (18) as follows: If the previous iterate xk−1 is a bad point for the MG
method, and so that there is little reduction in g(x)2 , choose the smaller step-size
αkB B2 ; otherwise, choose the larger step-size αkB B1 . This explanation reminds us of the
trust-region approach, which uses a somewhat similar strategy while choosing the trust-
region radius. In addition, considering the good performance of the ABB method, it
seems reasonable to conjecture that the smaller step-size used in ABB might be good
at inducing a favorable descent direction, while the larger step-size might be good at
producing a sufficient reduction.

3. Convergence rate analyses

We first analyze the convergence rate of the ASD method.


√ Let x ∗ be the unique minimizer
of the quadratic f (x) in (1) and define x A = x Ax. Then we have the following
t

Q-linear convergence result for the ASD method.

Theorem 3.1. Consider the minimization problem (1). Let {xk } be the sequence gen-
erated by the ASD method (2)∼(16). Define

λn − λ1
s := min{κ, 1 − κ} and c := , (19)
λ n + λ1

where λ1 and λn are the smallest and largest eigenvalues of A, respectively. Then it holds
for all k  0 that

xk+1 − x ∗  A < c2 + (1 − c2 )(1 − s)2 xk − x ∗  A . (20)

Thus the ASD method is Q-linearly convergent.


78 ZHOU, GAO AND DAI

Proof: If αkM G /αkS D  κ, we have by δ ∈ (0, 1) that

αkAS D = αkS D − δαkM G


> αkS D − αkM G (21)
 (1 − κ)αkS D .

Else if αkM G /αkS D > κ, we have that

αkAS D = αkM G > καkS D . (22)

On the other hand, it always holds that

αkAS D  αkS D . (23)

Let θk = αkAS D /αkS D . Then we have the following relation by (21)–(23):

s < θk  1. (24)

To simplify notation, define ek = xk − x ∗ and E k = ek 2A . It is easy to see that

gk = Aek (25)

and

ek+1 = ek − θk αkS D gk . (26)

Then we have by (25), (26) and (3) that

E k − E k+1 et Aek − ek+1


t
Aek+1
= k t
Ek ek Aek
 2
2θk αk gk Aek − θk2 αkS D gkt Agk
SD t
=
gkt A−1 gk
  2
2θk − θk2 gkt gk
=  t  . (27)
gk Agk gkt A−1 gk

Noting that 2θk − θk2 = 1 − (1 − θk )2 , we have by (24) that

2θk − θk2 > 1 − (1 − s)2 . (28)

And by the Kantorovich inequality and (19), we have that


 2
gkt gk 4λ1 λn
   = 1 − c2 . (29)
gkt Agk gkt A−1 gk (λ1 + λn )2
GRADIENT METHODS WITH ADAPTIVE STEP-SIZES 79

Applying (28) and (29) to (27), we obtain that

E k − E k+1
> [1 − (1 − s)2 ](1 − c2 ),
Ek

or equivalently,
E k+1 < {1 − [1 − (1 − s)2 ](1 − c2 )}E k = [c2 + (1 − c2 )(1 − s)2 ]E k . (30)

As E k = xk − x ∗ 2A , the relation (30) is (20) in disguise. Notice also that
c2 + (1 − c2 )(1 − s)2 < c2 + (1 − c2 ) = 1.

Then it follows from (20) that the ASD method is Q-linearly convergent.
Observe that xk − x ∗  A = 2( f (xk ) − f (x ∗ )). Thus we have also proved that the
ASD method is a monotone algorithm.
In addition, we can observe that (20) is worse than the convergence relation of the
SD method, which is xk+1 − x ∗  A  c xk − x ∗  A . Nonetheless, extensive numerical
experiments show that the ASD method significantly outperforms the SD method. We
think the explanation for this is as follows. As shown by (23), the ASD method always
takes a shortened SD step-size. Therefore, it may improve xk quite slowly at some steps.
On the other hand, it is the small improvements at some steps that bring out suitable
descent directions and hence speed up the overall convergence of the method. (See
figure 1 for a geometric interpretation.)
Now we consider the convergence of the ABB method. Note that this method can
be regarded as a particular member of the GMR (7)∼(8) for which ν(k) = k − 1 and
ρ(k) ∈ {1, 2}. Recently, the third author of this paper has proved that the GMR is R-
linear convergent for any-dimensional quadratics (see [9], Corollary 4.2). Therefore, we
have the following R-linear convergence result for the ABB method.
Theorem 3.2. Consider the minimization problem (1). Let {xk } be the sequence gener-
ated by the ABB method (2)∼(18). Then either gk = 0 for some finite k, or the sequence
{gk 2 } converges to zero R-linearly.

4. Numerical experiments

In this section, we compare the ASD and ABB methods with the CG, BB, AS and
AM methods on some typical test problems. Here BB just denotes the method (2)∼(5),
which appears preferable to the version (2)∼(6) in practice. Unless otherwise noted, the
parameters in ASD and ABB are set to be κ, δ = 0.5. All experiments were run on a
2.6 GHz Pentium IV with 512 MB of RAM, using double precision Fortran 90. The
initial guess was always chosen as x0 = 0 ∈ n , and the stopping criterion was chosen
as

gk 2  θ g0 2

with different values of θ .


80 ZHOU, GAO AND DAI

We now describe three different test problems and report their corresponding numer-
ical results.

Example 1. (Random problems [14]) Consider the matrix A = Q D Q t , where

   
Q = I − 2w3 w3t I − 2w2 w2t I − 2w1 w1t ,

and w1 , w2 and w3 are random unitary vectors. D = diag(σ1 , . . . , σn ) is a diagonal


matrix where σ1 = 1, σn = cond, and σi is randomly generated between 1 and cond for
i = 2, . . . , n − 1. The entries of the right-hand-side b are randomly generated between
−10 and 10. In this experiment, we used n = 5000 and allowed a maximum 10000 of
iterations.
In Table 1 we report the numbers of iterations required by different methods. The
linear CG method is undoubtedly the most efficient approach. However, it seems not
so appealing when a low precision is required. The ASD and ABB methods are both
less efficient than the CG method. But they generally need fewer iterations than the
BB, AS and AM methods. Particularly, when the condition number is very large and
a high precision is required, the ABB method clearly outperforms other gradient
approaches.
It is very interesting to observe that, when cond = 105 and cond = 106 , as the
precision requirement increases, the ABB method requires only a few extra iterations.
Similar phenomena, although not so obvious, can be observed in the results of the
ASD and AS methods. These phenomena confirm our original expectation that the new
methods might exhibit certain superlinear behavior when n is very large.
To investigate the influence of the parameter κ on the performances of the new meth-
ods, we selected 80 different κ’s, which range from 0.01 to 0.80. For each κ and each
method, we tested 1000 random problems with cond = 5000. Depicted in figure 3 are
average numbers of iterations used by different methods to obtain gk 2  10−5 g0 2 .
We observe that the ABB method outperforms other methods for all κ’s, and the best
choice of κ seems to be 0.1  κ  0.2. For the ASD method, different choice
of κ makes great difference. However, it seems better to set κ slightly bigger than
0.5.

Example 2. (3D Laplace’s equation [13]) In the second experiment, we tested a large-
scale real problem, which is based on a 7-point finite difference approximation to 3D
Laplace’s equation on a unitary cube. Define the matrices

⎡ ⎤ ⎡ ⎤
6 −1 T −I
⎢−1 6 −1 ⎥ ⎢−I T −I ⎥
⎢ ⎥ ⎢ ⎥
⎢ .. .. .. ⎥ ⎢ .. .. .. ⎥
T =⎢ . . . ⎥, W =⎢ . . . ⎥
⎢ ⎥ ⎢ ⎥
⎣ −1 6 −1⎦ ⎣ −I T −I ⎦
−1 6 −I T
GRADIENT METHODS WITH ADAPTIVE STEP-SIZES 81

Table 1. Numbers of iterations required by different methods for solving random problems.

cond θ CG BB AS AM ASD ABB

102 10−1 10 17 18 10 9 15
102 10−2 22 28 30 24 23 34
102 10−3 33 47 50 40 41 42
102 10−4 45 82 54 59 61 54
102 10−5 57 93 84 72 88 62
103 10−1 26 17 18 12 13 15
103 10−2 63 94 78 98 73 64
103 10−3 101 187 136 196 144 119
103 10−4 138 294 210 332 194 186
103 10−5 169 318 262 470 213 219
104 10−1 63 19 18 16 16 15
104 10−2 174 236 216 298 201 175
104 10−3 240 408 466 1206 315 305
104 10−4 310 667 564 1688 470 494
104 10−5 360 835 720 2049 648 629
105 10−1 213 28 18 16 16 15
105 10−2 289 772 640 1423 409 540
105 10−3 350 1766 1376 1426 1324 1491
105 10−4 388 2625 1576 6458 2553 1586
105 10−5 461 3609 3376 9412 2826 1721
106 10−1 291 28 18 16 16 15
106 10−2 356 2552 3550 8686 1783 986
106 10−3 391 4304 4428 >10000 3353 989
106 10−4 465 8636 4428 >10000 3530 1016
106 10−5 497 >10000 4474 >10000 5351 1042
CG: conjugate gradient method; BB: Barzilai-Borwein method; AS: alternate step gradient
method; AM: alternate minimization gradient method; ASD: adaptive steepest descent
method; ABB: adaptive Barzilai-Borwein method.

and
⎡ ⎤
W −I
⎢−I W −I ⎥
⎢ ⎥
⎢ .. .. .. ⎥
A=⎢ . . . ⎥, (31)
⎢ ⎥
⎣ −I W −I ⎦
−I W

where T is m × m, W and A are block m × m. Here m is the number of interior nodes


in each coordinate direction. Obviously, the problem has m 3 variables. We denote the
82 ZHOU, GAO AND DAI

2700

2400

2100
Average Number of Iterations

1800

1500

1200
AM

900
ASD

600 AS
BB
ABB
300
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

κ
Figure 3. Average numbers of iterations for solving 1000 random problems.

solution by u ∗ and fix it to be function

u(x, y, z) = x(x − 1)y(y − 1)z(z − 1)


 
σ 2 ((x − α)2 + (y − β)2 + (z − γ )2 )
× exp − (32)
2

evaluated at the nodal points. This function is a Gaussian centered at (α, β, γ ) and
multiplied by x(x −1)y(y−1)z(z−1), which gives u = 0 on the boundary. The parameter
σ controls the decay rate of the Gaussian. We calculate the right-hand-side b = Au ∗ and
denote the resulting problem to be L1. We stop the iteration when gk 2  10−6 g0 2
and choose the parameters in two different ways, i.e.

(a) σ = 20, α = β = γ = 0.5;


(b) σ = 50, α = 0.4, β = 0.7, γ = 0.5.

The results for solving L1 are listed in Table 2. We observe that the linear CG method
is still the outright winner, and the ASD and ABB methods still outperform the BB, AS
and AM methods in most cases, although the difference between the number of iterations
is not very great. The ABB method seems to be the best gradient method. It requires the
fewest iterations in 13 tests.
GRADIENT METHODS WITH ADAPTIVE STEP-SIZES 83

Table 2. Numbers of iterations required by different methods for solving problem L1.

n Problem CG BB AS AM ASD ABB

1003 L1(a) 189 505 690 1282 413 392


L1(b) 273 569 406 946 542 329
1103 L1(a) 208 709 672 1434 561 558
L1(b) 301 631 564 962 484 435
1203 L1(a) 227 676 872 1540 597 543
L1(b) 328 463 560 1432 494 452
1303 L1(a) 246 670 696 1376 758 612
L1(b) 356 532 546 978 587 411
1403 L1(a) 265 930 796 1638 958 684
L1(b) 384 770 612 1242 491 635
1503 L1(a) 284 677 562 2236 934 438
L1(b) 412 690 646 1490 638 646
1603 L1(a) 303 1210 908 2144 789 816
L1(b) 440 790 612 1584 715 605
1703 L1(a) 322 1093 1000 2638 740 819
L1(b) 468 864 844 2010 696 681
1803 L1(a) 341 1159 868 2011 903 590
L1(b) 496 945 946 2458 836 847
CG: conjugate gradient method; BB: Barzilai-Borwein method; AS: alternate step
gradient method; AM: alternate minimization gradient method; ASD: adaptive steep-
est descent method; ABB: adaptive Barzilai-Borwein method.

Example 3. (A non-quadratic problem [13]) To obtain some information about the ABB
method for general functions, a non-quadratic test problem is derived from Example 2,
in which the objective function is

1 t 1  4
u Au − bt u + h 2 u . (33)
2 4 i jk i jk

This problem is referred to as L2. The matrix A is defined as in (31), and the vector b is
chosen so that the minimizer of (33) is function (32) evaluated at the nodal points. We
stop the iteration when gk 2  10−5 g0 2 .
In Table 3, we report the time and numbers of evaluations required by the Polak-
Ribière CG (PR-CG), unmodified BB and ABB methods. The columns headed #ls, #f
and #g give the numbers of line searches, function evaluations and gradient evaluations
to solve the problem. Note that the unmodified BB and ABB methods require no line
searches and function evaluations. In the PR-CG method, we require the step-size to
satisfy the standard strong Wolfe conditions with a relative slope tolerance of 0.1. From
Table 3, we can observe that the unmodified BB and ABB methods greatly improve
upon the PR-CG method, and the ABB method usually gives the best performance.
84 ZHOU, GAO AND DAI

Table 3. Time (seconds) and numbers of evaluations required by different methods for solving problem L2.

PR-CG BB ABB

n Problem Time #ls-#f-#g Time #g Time #g

1003 L2(a) 104.7 262-530-280 41.2 601 25.8 380


L2(b) 97.7 227-450-229 32.8 412 29.3 358
1103 L2(a) 195.1 365-740-384 35.2 389 32.2 345
L2(b) 145.2 250-496-257 46.3 439 34.8 303
1203 L2(a) 244.1 350-712-372 57.1 473 48.3 399
L2(b) 204.9 273-539-277 69.8 508 44.5 317
1303 L2(a) 291.9 328-666-350 81.2 536 66.8 439
L2(b) 281.1 295-583-300 88.7 502 75.2 421
1403 L2(a) 505.5 449-913-475 132.8 695 127.9 675
L2(b) 378.4 319-630-323 104.3 468 89.9 400
1503 L2(a) 531.0 390-793-413 128.4 556 106.0 441
L2(b) 500.9 341-677-348 122.3 444 118.3 428
1603 L2(a) 667.6 404-821-430 167.2 593 143.9 512
L2(b) 650.2 365-724-372 180.6 550 197.1 592
1703 L2(a) 845.4 431-876-457 311.1 921 308.1 911
L2(b) 825.8 387-772-398 287.5 734 252.0 631
1803 L2(a) 771.2 323-675-362 418.4 1031 334.0 833
L2(b) 1069.1 422-843-434 316.2 677 240.9 509
PR-CG: Polak-Ribière conjugate gradient method; BB: Barzilai-Borwein method; ABB: adap-
tive Barzilai-Borwein method.

These results suggest that the ABB method might surpass the PR-CG and BB methods
while solving unconstrained optimization problems.

5. Concluding remarks and discussions

In this paper, by simulating the numerical behavior of the BB method for two-dimensional
quadratics, we have introduced two gradient methods with adaptive step-sizes, namely,
ASD and ABB. The ASD method combines the SD and MG methods by avoiding their
drawbacks at the same time, while the ABB method uses a trust-region-like strategy to
choose its step-size from the two alternatives of the original BB method.
Based on our numerical experiments, we conclude that the ASD and ABB methods are
comparable and in general preferable to the BB, AS and AM methods. Particularly, the
ABB method seems to be a good option if the coefficient matrix is very ill-conditioned and
a high precision is required. We also report that the new algorithms outperform the linear
CG method when a low precision is required; and for a specific non-quadratic function,
the ABB method clearly surpasses the PR-CG method as well as the unmodified BB
method.
In view of the remarkable performance of the ASD method and the fact that it is
a monotone algorithm, we should expect that this method would be useful in some
GRADIENT METHODS WITH ADAPTIVE STEP-SIZES 85

circumstances. For example, if the dimension of the problem is very large and hence
only a few iterations can be carried out, then the ASD method might be a very good option.
The ABB method is not a monotone algorithm but it also has an obvious advantage. This
method itself requires no line searches for general functions and therefore might be able
to save a lot of computation work while solving unconstrained optimization problems.
To ensure the global convergence, one could combine the ABB method with the non-
monotone line search of Grippo et al. [15] or a new type of non-monotone line search
recently proposed by Zhang and Hager [25].
Finally, note that in our experiments, the parameter κ is usually chosen as 0.5. How-
ever, extensive numerical experiments show that taking a different κ sometimes may
lead to better results. According to our experience, it is better to set κ slightly bigger
than 0.5 in the ASD method, and smaller than 0.5 in the ABB method. In fact, the choice
of κ is a tradeoff between a large step-size and a small one. We would like to choose the
large step-size to give a substantial reduction, but at the same time, we might also need
the small one to induce a favorable descent direction for the next iteration. A suitable
κ, therefore, should ensure that both the large and small step-sizes will be frequently
adopted.

Acknowledgments

The authors wish to thank two anonymous referees for their valuable comments and
suggestions. This work was supported by the Chinese NSF grants 10171104, 10571171
and 40233029. The third author was also supported by the Alexander-von-Humboldt
Foundation.

References

1. H. Akaike, “On a successive transformation of probability distribution and its application to the analysis
of the optimum gradient method,” Ann. Inst. Statist. Math., Tokyo, vol. 11, pp. 1–17, 1959.
2. R. Andreani, E.G. Birgin, J.M. Martı́nez, and J. Yuan, “Spectral projected gradient and variable metric
methods for optimization with linear inequalities,” IMA J. Numer. Anal., vol. 25, pp. 221–252, 2005.
3. J. Barzilai and J.M. Borwein, “Two-point step size gradient methods,” IMA J. Numer. Anal., vol. 8, pp.
141–148, 1988.
4. L. Bello and M. Raydan, “Preconditioned spectral projected gradient methods on convex sets,” Journal
of Computational Mathematics, vol. 23, pp. 225–232, 2005.
5. E.G. Birgin, J.M. Martı́nez, and M. Raydan, “Nonmonotone spectral projected gradient methods on convex
sets,” SIAM J. Optim., vol. 10, pp. 1196–1211, 2000.
6. E.G. Birgin, J.M. Martı́nez, and M. Raydan, “Algorithm 813: SPG—software for convex-constrained
optimization,” ACM Trans. Math. Software, vol. 27, pp. 340–349, 2001.
7. E.G. Birgin, J.M. Martı́nez, and M. Raydan, “Inexact spectral projected gradient methods on convex sets,”
IMA J. Numer. Anal., vol. 23, pp. 539–559, 2003.
8. A. Cauchy, “Méthode générale pour la résolution des systèms d’équations simultanées,” Comp. Rend.
Sci. Paris, vol. 25, pp. 536–538, 1847.
9. Y.H. Dai, “Alternate step gradient method,” Optimization, vol. 52, pp. 395–415, 2003.
10. Y.H. Dai and L.Z. Liao, “R-linear convergence of the Barzilai and Borwein gradient method,” IMA J.
Numer. Anal., vol. 22, pp. 1–10, 2002.
11. Y.H. Dai and Y. Yuan, “Alternate minimization gradient method,” IMA J. Numer. Anal., vol. 23, pp.
377–393, 2003.
86 ZHOU, GAO AND DAI

12. R. Fletcher, “Low storage methods for unconstrained optimization,” Lectures in Applied Mathematics
(AMS), vol. 26, pp. 165–179, 1990.
13. R. Fletcher, “On the Barzilai-Borwein method,” Numerical Analysis Report NA/207, Department of
Mathematics, University of Dundee, 2001.
14. A. Friedlander, J.M. Martı́nez, B. Molina, and M. Raydan, “Gradient method with retards and general-
izations,” SIAM J. Numer. Anal., vol. 36, pp. 275–289, 1999.
15. L. Grippo, F. Lampariello, and S. Lucidi, “A nonmonotone line search technique for Newton’s method,”
SIAM J. Numer. Anal., vol. 23, pp. 707–716, 1986.
16. M.R. Hestenes and E.L. Stiefel, “Methods of conjugate gradients for solving linear systems,” J. Research
National Bureau of Standards, vol. B49, pp. 409–436, 1952.
17. W. La Cruz, J.M. Martı́nez, and M. Raydan, “Spectral residual method without gradient information for
solving large-scale nonlinear systems of equations,” Mathematics of Computation, to appear.
18. W. La Cruz and M. Raydan, “Nonmonotone spectral methods for large-scale nonlinear systems,” Opti-
mization Methods and Software, vol. 18, pp. 583–599, 2003.
19. J.-L. Lamotte, B. Molina, and M. Raydan, “Smooth and adaptive gradient method with retards,” Mathe-
matical and Computer Modelling, vol. 36, pp. 1161–1168, 2002.
20. F. Luengo and M. Raydan, “Gradient method with dynamical retards for large-scale optimization prob-
lems,” Electronic Transactions on Numerical Analysis, vol. 16, pp. 186–193, 2003.
21. M. Raydan, “On the Barzilai and Borwein choice of steplength for the gradient method,” IMA J. Numer.
Anal., vol. 13, pp. 321–326, 1993.
22. M. Raydan, “The Barzilai and Borwein method for the large scale unconstrained minimization problem,”
SIAM J. Optim., vol. 7, pp. 26–33, 1997.
23. M. Raydan and B.F. Svaiter, “Relaxed steepest descent and Cauchy-Barzilai-Borwein method,” Compu-
tational Optimization and Applications, vol. 21, pp. 155–167, 2002.
24. T. Serafini, G. Zanghirati, and L. Zanni, “Gradient projection methods for quadratic programs and appli-
cations in training support vector machines,” Tech. Rep. 48, University of Modena and Reggio Emilia,
Italy, 2003.
25. H. Zhang and W.W. Hager, “A nonmonotone line search technique and its application to unconstrained
optimization,” SIAM J. Optim., vol. 14, pp. 1043–1056, 2004.

View publication stats

You might also like