Gradient Methods With Adaptive Step-Sizes
Gradient Methods With Adaptive Step-Sizes
net/publication/225728940
CITATIONS READS
185 3,724
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Yu-Hong Dai on 03 June 2014.
Abstract. Motivated by the superlinear behavior of the Barzilai-Borwein (BB) method for two-dimensional
quadratics, we propose two gradient methods which adaptively choose a small step-size or a large step-size at
each iteration. The small step-size is primarily used to induce a favorable descent direction for the next iteration,
while the large step-size is primarily used to produce a sufficient reduction. Although the new algorithms are
still linearly convergent in the quadratic case, numerical experiments on some typical test problems indicate
that they compare favorably with the BB method and some other efficient gradient methods.
Keywords: linear system, gradient method, adaptive step-size, Barzilai-Borwein method, superlinear be-
havior, trust-region approach
1. Introduction
xk+1 = xk − αk gk , (2)
which minimizes the function value f (x) along the ray {xk − αgk : α > 0}.
70 ZHOU, GAO AND DAI
Another straightforward line search is to minimize the gradient norm g(x)2 along
the ray {xk − αgk : α > 0}—we name the associated algorithm “minimal gradient (MG)
method” for convenience. Trivial deductions (see [11]) yield
gkt Agk
αkM G = . (4)
gkt A2 gk
where sk−1 = xk −xk−1 and yk−1 = gk −gk−1 . These two step-sizes minimize α −1 sk−1 −
yk−1 2 and sk−1 − αyk−1 2 respectively, providing a scalar approximation to each of
the secant equations Bk sk−1 = yk−1 and Hk yk−1 = sk−1 .
The BB method ((2)∼(5) or (2)∼(6)) performs much better than the SD method in
practice (see also [12]). Especially when n = 2, it converges R-superlinearly to the global
minimizer [3]. In any dimension, it is still globally convergent [21] but the convergence
is R-linear [10]. Raydan [22] adapted the method to unconstrained optimization by
incorporating the non-monotone line search of Grippo et al. [15]. Since this work, the
BB method has been successfully extended to many fields such as convex constrained
optimization [2, 4–7], nonlinear systems [17, 18], support vector machines [24], etc.
Inspired by the BB method, Friedlander et al. [14] introduced a family of gradient
methods with retards (GMR). Given positive integers m and qi (i = 1, . . . , m), the GMR
sets the step-size as
t
gν(k) Aρ(k)−1 gν(k)
αkG M R = , (7)
t
gν(k) Aρ(k) gν(k)
where
It is clear that the step-sizes (3)–(6) are all special cases of (7). [23] and [9] investigated
a particular member of the GMR, in which the SD and BB1 step-sizes are used by turns.
This algorithm, known as the alternate step (AS) gradient method in [9], has proved to
GRADIENT METHODS WITH ADAPTIVE STEP-SIZES 71
Barzilai and Borwein [3] proved that the BB method converges R-superlinearly for
two-dimensional quadratics, which can be illustrated by the following numerical phe-
nomenon: As k increases, the gradient gk generated by the BB method tends to be an
eigenvector of the matrix A. Unfortunately, this phenomenon does not evidently hold for
any-dimensional case where only R-linear convergence is obtained [10]. We now get a
heuristic idea as below: Assume the eigenvalues of A can be simply classified as the large
ones and the small ones, as if there were only two different eigenvalues, and suppose we
can frequently make gk approach some large or small eigenvectors (by which we mean
eigenvectors associated with the large or small eigenvalues) of A, then the resulting
algorithm might exhibit certain superlinear behavior and outperform the BB method
when n is very large.
In order to realize this idea, we choose to modify the SD method. Thus let us go
back to this old approach for more inspirations. Without loss of generality, we assume
that A is a diagonal matrix in our derivation. The following theorem illustrates the
worst-case behavior of the SD method in two dimensions. (In any dimension, the worst-
case behaviors of the SD and MG methods are only related to the smallest and largest
eigenvalues of A. Hence we just need to consider the two-dimensional case.)
Theorem 2.1. Consider the minimization problem (1), where A = diag(λ1 , λ2 ) with
λ2 λ1 > 0. Let x ∗ = A−1 b, and let {xk } be the sequence generated by the SD method
72 ZHOU, GAO AND DAI
Proof: To prove this theorem, it suffices to show that if g0 (1, ±1), then we must
have
2
f (x1 ) − f (x ∗ ) λ 2 − λ1
g1 (1, ∓1)t and = . (9)
f (x0 ) − f (x ∗ ) λ 2 + λ1
2u 2 2
α0S D = = . (10)
(λ2 + λ1 )u 2 λ 2 + λ1
λ2 − λ 1
g1 = u (1, ∓1)t . (11)
λ2 + λ1
For the second part of (9), we have by Taylor’s theorem and ∇ f (x ∗ ) = 0 that
(xk − x ∗ )t A(xk − x ∗ )
f (xk ) − f (x ∗ ) = (xk − x ∗ )t ∇ f (x ∗ ) +
2
(Axk − Ax ∗ )t A−1 (Axk − Ax ∗ )
=
2
(Axk − b)t A−1 (Axk − b) g t A−1 gk
= = k .
2 2
f (xk+1 )− f (x ∗ )
Thus we have by induction that f (xk )− f (x ∗ )
= ( λλ22 −λ
+λ1
) for all k ≥ 0.
1 2
GRADIENT METHODS WITH ADAPTIVE STEP-SIZES 73
We can see from Theorem 2.1 that: While gk is about 45◦ away from any eigenvector
of A, the convergence of the SD method deteriorates. Naturally we hope that the next
gradient gk+1 will approach some eigenvector of A. But how to reset the k-th step-size?
The theorem below shows that the MG step-size is a desirable option when the SD
step-size is unfavorable.
Theorem 2.2. Consider the minimization problem (1), where A = diag(λ1 , λ2 ) with
λ2 λ1 > 0. Let {xk } be the sequence generated by a gradient method. If gk (1, ±1)t ,
i.e. gk = (u, ±u)t with u = 0, then it holds that
and
⎧
⎪ λ 2 − λ1
⎪
⎨ λ + λ u(1, ∓1) ,
t
i f αk = αkS D ;
2 1
gk+1 = (12)
⎪
⎪ λ2 − λ1
⎩ 2 u(λ2 , ∓λ1 )t , i f αk = αkM G .
λ2 + λ21
2 λ1 + λ 2
αkS D = and αkM G = . (13)
λ 1 + λ2 λ21 + λ22
Therefore,
αkM G (λ1 + λ2 )2
= > 0.5.
αkS D 2 λ21 + λ22
One might ask this question: Does the scheme (15) also avoid the worst-case behavior
of the MG method? The answer is yes when the condition number of A is big enough. To
make it comprehensible, we first present a theorem illustrating the worst-case behavior
of the MG method in two dimensions.
Theorem 2.3. Consider the minimization problem (1), where A = diag(λ1 , λ2 ) with
λ2 √λ1 > √ 0. Let {xk } be the sequence generated by the MG method (2)−(4). If
g0 ( λ2 , ± λ1 )t , then it holds for all k 0
(a) The MG method reaches its slowest convergence rate:
2
gk+1 22 λ 2 − λ1
= .
gk 22 λ 2 + λ1
The proof of (a) is quite similar to that of Theorem 2.1, while (b) can be obtained
inductively by direct calculations (the second equality herein is an equivalent form of
the Kantorovich inequality). Supposing λ2 λ1 , we can see that αkM G /αkS D ≈ 0 and
g0 is almost parallel to the eigenvector (1, 0)t . That is to say, the worst-case behavior of
the MG method is usually due to the excessively small step-size and not to the descent
direction. Scheme (15) sets αk = αkS D when αkM G /αkS D κ; therefore it is able to avoid
the worst-case behavior of the MG method for very ill-conditioned problems.
In a recent paper, Raydan and Svaiter [23] indicated that the SD method with (over
and under) relaxation usually improves upon the original method. This idea was also
considered by Dai and Yuan [11], in whose paper two shortened SD step gradient methods
have been well compared with the AM and SD methods. Motivated by these works, we
modify (15) as follows:
where κ, δ ∈ (0, 1). Notice that when αkMG /αkSD reaches its smallest value, αkS D is the
least shortened. As αk can be viewed as an adaptively under-relaxed SD step-size (recall
that αkM G αkS D ), we call the method (2)−(16) “adaptive steepest descent (ASD)
method”. Compared to the SD method, this method requires one extra inner product per
iteration.
How the ASD method might overcome the drawbacks of the SD method and simulate
the superlinear behavior of the BB method in two dimensions is depicted in figure 1,
where xkS D and xkAS D signify the iterates of the SD and ASD methods, respectively. We
can observe that the ASD method (κ, δ = 0.5) refines x0 less than the SD method at
GRADIENT METHODS WITH ADAPTIVE STEP-SIZES 75
x0
xSD
2
xASD
2
xASD
1
xSD
1
the first iteration, but it induces a gradient inclined to the eigenvector direction (−1, 0)t ,
thereby obtaining a considerable descent at the next iteration.
Before we show by a simple example that the ASD method is still efficient for solving
any-dimensional problems, we first give a condition under which we should expect that
the method still would exhibit certain superlinear behavior. Note that our derivation of
the ASD method is totally based on the assumption that the matrix A has two different
eigenvalues. We assume here that A has two clusters of eigenvalues. Then, in view of
the above figure and theorems, we should expect that some of the descent directions
generated by the ASD method would incline towards the eigenvectors that relate to the
eigenvalues with small magnitude.
To illustrate the performance of the ASD method in any dimension, a simple test
problem is devised (see Section 4 for more experiments). We let A = diag(λ1 , . . . , λ100 )
with λ1 = 0.1 and λi = i for i = 2, . . . , 100. The right-hand-side b is set as (1, . . . , 1)t
and the initial guess is chosen as x0 = 0 ∈
100 . In order to get gk 2 10−6 g0 2 ,
the BB method (α0 = α0S D ) needs 375 iterations, while the ASD (κ, δ = 0.5) method
requires only 302 iterations. Figure 2 reports the gradient norms for the BB and ASD
methods at each iteration (the figure also reports the gradient norms for a third method,
ABB, which will be defined later). From this figure, we can observe that the ASD method
gives the better solution at most iterations. Besides, this method seems to produce smaller
oscillations than the BB method. As will be proved in Section 3, the ASD method is in
fact a monotone algorithm.
Now let us extend (16) and derive an ASD-like variant of the BB method. Look-
ing into the iterations of the ASD method for the above test problem, we find that the
sign of αkM G /αkS D − κ is opposite to that of αk−1 MG
/αk−1
SD
− κ at most iterations (to be
exact, 238 out of 302 iterations). Similar phenomenon is also observed in many other
experiments (where κ = 0.5). For this reason, we replace αkM G /αkS D > κ in (16) with
αk−1
MG
/αk−1
SD
< κ:
αkM G , if αk−1
MG
/αk−1
SD
< κ;
αk = (17)
αkS D − δ αkM G , otherwise.
76 ZHOU, GAO AND DAI
3
10
1
10
Residual Norm
–1
10
–3
10
ABB
ASD BB
–5
10
0 100 200 300 400
Number of Iterations
Since αk−1
SD
= αkB B1 and αk−1
MG
= αkB B2 , we rewrite (17) as
In order to get a simple scheme, we replace αkS D − δ αkM G and αkM G with αkB B1 and αkB B2
respectively. So we finally have
where κ ∈ (0, 1). Here we do not shorten αkB B1 because the BB method itself overcomes
the drawbacks of the SD method. Analogously, we call the method (2)∼(18) “adaptive
Barzilai-Borwein (ABB) method”. For the above test problem, this method (κ = 0.5,
and α0 = α0S D ) requires only 221 iterations (see also figure 2).
At first glance, the ABB method seems to require one more inner product per iteration
than the BB method, but this extra computational work can be eliminated by using the
GRADIENT METHODS WITH ADAPTIVE STEP-SIZES 77
following formulae:
t t
sk−1 sk−1 gk−1 gk−1
αkB B1 = = αk−1 t ,
t
sk−1 yk−1 gk−1 gk−1 − gk−1
t
gk
t
sk−1 yk−1 t
gk−1 gk−1 − gk−1
t
gk
αkB B2 = = α k−1 .
t
yk−1 yk−1 gk−1 gk−1 − 2gk−1 gk + gkt gk
t t
Here we have used sk−1 = xk − xk−1 = −αk−1 gk−1 and yk−1 = gk − gk−1 . Notice that at
t
every iteration, we only need to compute two inner products, i.e. gk−1 gk and gkt gk . Our
extensive numerical experiments show that these formulae are reliable in the quadratic
case. For non-quadratic functions, there might be negative or zero denominators, which
could also happen while using the original formulae. Raydan [22] proposed a simple
approach to deal with this situation.
Before concluding this section, we would like to give an intuitive explanation to (18).
Recall that when αkB B2 /αkB B1 = αk−1
MG
/αk−1
SD
≈ 0, the MG method performs poorly at
the point xk−1 (see Theorem 2.3); although we generally have αk−1 = αk−1 MG
in the ABB
method, there must be little reduction in g(x)2 at the previous iteration since it is
the MG step-size that minimizes g(x)2 along the gradient descent direction. Thus
we can interpret (18) as follows: If the previous iterate xk−1 is a bad point for the MG
method, and so that there is little reduction in g(x)2 , choose the smaller step-size
αkB B2 ; otherwise, choose the larger step-size αkB B1 . This explanation reminds us of the
trust-region approach, which uses a somewhat similar strategy while choosing the trust-
region radius. In addition, considering the good performance of the ABB method, it
seems reasonable to conjecture that the smaller step-size used in ABB might be good
at inducing a favorable descent direction, while the larger step-size might be good at
producing a sufficient reduction.
Theorem 3.1. Consider the minimization problem (1). Let {xk } be the sequence gen-
erated by the ASD method (2)∼(16). Define
λn − λ1
s := min{κ, 1 − κ} and c := , (19)
λ n + λ1
where λ1 and λn are the smallest and largest eigenvalues of A, respectively. Then it holds
for all k 0 that
xk+1 − x ∗ A < c2 + (1 − c2 )(1 − s)2 xk − x ∗ A . (20)
s < θk 1. (24)
gk = Aek (25)
and
E k − E k+1
> [1 − (1 − s)2 ](1 − c2 ),
Ek
or equivalently,
E k+1 < {1 − [1 − (1 − s)2 ](1 − c2 )}E k = [c2 + (1 − c2 )(1 − s)2 ]E k . (30)
As E k = xk − x ∗ 2A , the relation (30) is (20) in disguise. Notice also that
c2 + (1 − c2 )(1 − s)2 < c2 + (1 − c2 ) = 1.
Then it follows from (20) that the ASD method is Q-linearly convergent.
Observe that xk − x ∗ A = 2( f (xk ) − f (x ∗ )). Thus we have also proved that the
ASD method is a monotone algorithm.
In addition, we can observe that (20) is worse than the convergence relation of the
SD method, which is xk+1 − x ∗ A c xk − x ∗ A . Nonetheless, extensive numerical
experiments show that the ASD method significantly outperforms the SD method. We
think the explanation for this is as follows. As shown by (23), the ASD method always
takes a shortened SD step-size. Therefore, it may improve xk quite slowly at some steps.
On the other hand, it is the small improvements at some steps that bring out suitable
descent directions and hence speed up the overall convergence of the method. (See
figure 1 for a geometric interpretation.)
Now we consider the convergence of the ABB method. Note that this method can
be regarded as a particular member of the GMR (7)∼(8) for which ν(k) = k − 1 and
ρ(k) ∈ {1, 2}. Recently, the third author of this paper has proved that the GMR is R-
linear convergent for any-dimensional quadratics (see [9], Corollary 4.2). Therefore, we
have the following R-linear convergence result for the ABB method.
Theorem 3.2. Consider the minimization problem (1). Let {xk } be the sequence gener-
ated by the ABB method (2)∼(18). Then either gk = 0 for some finite k, or the sequence
{gk 2 } converges to zero R-linearly.
4. Numerical experiments
In this section, we compare the ASD and ABB methods with the CG, BB, AS and
AM methods on some typical test problems. Here BB just denotes the method (2)∼(5),
which appears preferable to the version (2)∼(6) in practice. Unless otherwise noted, the
parameters in ASD and ABB are set to be κ, δ = 0.5. All experiments were run on a
2.6 GHz Pentium IV with 512 MB of RAM, using double precision Fortran 90. The
initial guess was always chosen as x0 = 0 ∈
n , and the stopping criterion was chosen
as
gk 2 θ g0 2
We now describe three different test problems and report their corresponding numer-
ical results.
Q = I − 2w3 w3t I − 2w2 w2t I − 2w1 w1t ,
Example 2. (3D Laplace’s equation [13]) In the second experiment, we tested a large-
scale real problem, which is based on a 7-point finite difference approximation to 3D
Laplace’s equation on a unitary cube. Define the matrices
⎡ ⎤ ⎡ ⎤
6 −1 T −I
⎢−1 6 −1 ⎥ ⎢−I T −I ⎥
⎢ ⎥ ⎢ ⎥
⎢ .. .. .. ⎥ ⎢ .. .. .. ⎥
T =⎢ . . . ⎥, W =⎢ . . . ⎥
⎢ ⎥ ⎢ ⎥
⎣ −1 6 −1⎦ ⎣ −I T −I ⎦
−1 6 −I T
GRADIENT METHODS WITH ADAPTIVE STEP-SIZES 81
Table 1. Numbers of iterations required by different methods for solving random problems.
102 10−1 10 17 18 10 9 15
102 10−2 22 28 30 24 23 34
102 10−3 33 47 50 40 41 42
102 10−4 45 82 54 59 61 54
102 10−5 57 93 84 72 88 62
103 10−1 26 17 18 12 13 15
103 10−2 63 94 78 98 73 64
103 10−3 101 187 136 196 144 119
103 10−4 138 294 210 332 194 186
103 10−5 169 318 262 470 213 219
104 10−1 63 19 18 16 16 15
104 10−2 174 236 216 298 201 175
104 10−3 240 408 466 1206 315 305
104 10−4 310 667 564 1688 470 494
104 10−5 360 835 720 2049 648 629
105 10−1 213 28 18 16 16 15
105 10−2 289 772 640 1423 409 540
105 10−3 350 1766 1376 1426 1324 1491
105 10−4 388 2625 1576 6458 2553 1586
105 10−5 461 3609 3376 9412 2826 1721
106 10−1 291 28 18 16 16 15
106 10−2 356 2552 3550 8686 1783 986
106 10−3 391 4304 4428 >10000 3353 989
106 10−4 465 8636 4428 >10000 3530 1016
106 10−5 497 >10000 4474 >10000 5351 1042
CG: conjugate gradient method; BB: Barzilai-Borwein method; AS: alternate step gradient
method; AM: alternate minimization gradient method; ASD: adaptive steepest descent
method; ABB: adaptive Barzilai-Borwein method.
and
⎡ ⎤
W −I
⎢−I W −I ⎥
⎢ ⎥
⎢ .. .. .. ⎥
A=⎢ . . . ⎥, (31)
⎢ ⎥
⎣ −I W −I ⎦
−I W
2700
2400
2100
Average Number of Iterations
1800
1500
1200
AM
900
ASD
600 AS
BB
ABB
300
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
κ
Figure 3. Average numbers of iterations for solving 1000 random problems.
evaluated at the nodal points. This function is a Gaussian centered at (α, β, γ ) and
multiplied by x(x −1)y(y−1)z(z−1), which gives u = 0 on the boundary. The parameter
σ controls the decay rate of the Gaussian. We calculate the right-hand-side b = Au ∗ and
denote the resulting problem to be L1. We stop the iteration when gk 2 10−6 g0 2
and choose the parameters in two different ways, i.e.
The results for solving L1 are listed in Table 2. We observe that the linear CG method
is still the outright winner, and the ASD and ABB methods still outperform the BB, AS
and AM methods in most cases, although the difference between the number of iterations
is not very great. The ABB method seems to be the best gradient method. It requires the
fewest iterations in 13 tests.
GRADIENT METHODS WITH ADAPTIVE STEP-SIZES 83
Table 2. Numbers of iterations required by different methods for solving problem L1.
Example 3. (A non-quadratic problem [13]) To obtain some information about the ABB
method for general functions, a non-quadratic test problem is derived from Example 2,
in which the objective function is
1 t 1 4
u Au − bt u + h 2 u . (33)
2 4 i jk i jk
This problem is referred to as L2. The matrix A is defined as in (31), and the vector b is
chosen so that the minimizer of (33) is function (32) evaluated at the nodal points. We
stop the iteration when gk 2 10−5 g0 2 .
In Table 3, we report the time and numbers of evaluations required by the Polak-
Ribière CG (PR-CG), unmodified BB and ABB methods. The columns headed #ls, #f
and #g give the numbers of line searches, function evaluations and gradient evaluations
to solve the problem. Note that the unmodified BB and ABB methods require no line
searches and function evaluations. In the PR-CG method, we require the step-size to
satisfy the standard strong Wolfe conditions with a relative slope tolerance of 0.1. From
Table 3, we can observe that the unmodified BB and ABB methods greatly improve
upon the PR-CG method, and the ABB method usually gives the best performance.
84 ZHOU, GAO AND DAI
Table 3. Time (seconds) and numbers of evaluations required by different methods for solving problem L2.
PR-CG BB ABB
These results suggest that the ABB method might surpass the PR-CG and BB methods
while solving unconstrained optimization problems.
In this paper, by simulating the numerical behavior of the BB method for two-dimensional
quadratics, we have introduced two gradient methods with adaptive step-sizes, namely,
ASD and ABB. The ASD method combines the SD and MG methods by avoiding their
drawbacks at the same time, while the ABB method uses a trust-region-like strategy to
choose its step-size from the two alternatives of the original BB method.
Based on our numerical experiments, we conclude that the ASD and ABB methods are
comparable and in general preferable to the BB, AS and AM methods. Particularly, the
ABB method seems to be a good option if the coefficient matrix is very ill-conditioned and
a high precision is required. We also report that the new algorithms outperform the linear
CG method when a low precision is required; and for a specific non-quadratic function,
the ABB method clearly surpasses the PR-CG method as well as the unmodified BB
method.
In view of the remarkable performance of the ASD method and the fact that it is
a monotone algorithm, we should expect that this method would be useful in some
GRADIENT METHODS WITH ADAPTIVE STEP-SIZES 85
circumstances. For example, if the dimension of the problem is very large and hence
only a few iterations can be carried out, then the ASD method might be a very good option.
The ABB method is not a monotone algorithm but it also has an obvious advantage. This
method itself requires no line searches for general functions and therefore might be able
to save a lot of computation work while solving unconstrained optimization problems.
To ensure the global convergence, one could combine the ABB method with the non-
monotone line search of Grippo et al. [15] or a new type of non-monotone line search
recently proposed by Zhang and Hager [25].
Finally, note that in our experiments, the parameter κ is usually chosen as 0.5. How-
ever, extensive numerical experiments show that taking a different κ sometimes may
lead to better results. According to our experience, it is better to set κ slightly bigger
than 0.5 in the ASD method, and smaller than 0.5 in the ABB method. In fact, the choice
of κ is a tradeoff between a large step-size and a small one. We would like to choose the
large step-size to give a substantial reduction, but at the same time, we might also need
the small one to induce a favorable descent direction for the next iteration. A suitable
κ, therefore, should ensure that both the large and small step-sizes will be frequently
adopted.
Acknowledgments
The authors wish to thank two anonymous referees for their valuable comments and
suggestions. This work was supported by the Chinese NSF grants 10171104, 10571171
and 40233029. The third author was also supported by the Alexander-von-Humboldt
Foundation.
References
1. H. Akaike, “On a successive transformation of probability distribution and its application to the analysis
of the optimum gradient method,” Ann. Inst. Statist. Math., Tokyo, vol. 11, pp. 1–17, 1959.
2. R. Andreani, E.G. Birgin, J.M. Martı́nez, and J. Yuan, “Spectral projected gradient and variable metric
methods for optimization with linear inequalities,” IMA J. Numer. Anal., vol. 25, pp. 221–252, 2005.
3. J. Barzilai and J.M. Borwein, “Two-point step size gradient methods,” IMA J. Numer. Anal., vol. 8, pp.
141–148, 1988.
4. L. Bello and M. Raydan, “Preconditioned spectral projected gradient methods on convex sets,” Journal
of Computational Mathematics, vol. 23, pp. 225–232, 2005.
5. E.G. Birgin, J.M. Martı́nez, and M. Raydan, “Nonmonotone spectral projected gradient methods on convex
sets,” SIAM J. Optim., vol. 10, pp. 1196–1211, 2000.
6. E.G. Birgin, J.M. Martı́nez, and M. Raydan, “Algorithm 813: SPG—software for convex-constrained
optimization,” ACM Trans. Math. Software, vol. 27, pp. 340–349, 2001.
7. E.G. Birgin, J.M. Martı́nez, and M. Raydan, “Inexact spectral projected gradient methods on convex sets,”
IMA J. Numer. Anal., vol. 23, pp. 539–559, 2003.
8. A. Cauchy, “Méthode générale pour la résolution des systèms d’équations simultanées,” Comp. Rend.
Sci. Paris, vol. 25, pp. 536–538, 1847.
9. Y.H. Dai, “Alternate step gradient method,” Optimization, vol. 52, pp. 395–415, 2003.
10. Y.H. Dai and L.Z. Liao, “R-linear convergence of the Barzilai and Borwein gradient method,” IMA J.
Numer. Anal., vol. 22, pp. 1–10, 2002.
11. Y.H. Dai and Y. Yuan, “Alternate minimization gradient method,” IMA J. Numer. Anal., vol. 23, pp.
377–393, 2003.
86 ZHOU, GAO AND DAI
12. R. Fletcher, “Low storage methods for unconstrained optimization,” Lectures in Applied Mathematics
(AMS), vol. 26, pp. 165–179, 1990.
13. R. Fletcher, “On the Barzilai-Borwein method,” Numerical Analysis Report NA/207, Department of
Mathematics, University of Dundee, 2001.
14. A. Friedlander, J.M. Martı́nez, B. Molina, and M. Raydan, “Gradient method with retards and general-
izations,” SIAM J. Numer. Anal., vol. 36, pp. 275–289, 1999.
15. L. Grippo, F. Lampariello, and S. Lucidi, “A nonmonotone line search technique for Newton’s method,”
SIAM J. Numer. Anal., vol. 23, pp. 707–716, 1986.
16. M.R. Hestenes and E.L. Stiefel, “Methods of conjugate gradients for solving linear systems,” J. Research
National Bureau of Standards, vol. B49, pp. 409–436, 1952.
17. W. La Cruz, J.M. Martı́nez, and M. Raydan, “Spectral residual method without gradient information for
solving large-scale nonlinear systems of equations,” Mathematics of Computation, to appear.
18. W. La Cruz and M. Raydan, “Nonmonotone spectral methods for large-scale nonlinear systems,” Opti-
mization Methods and Software, vol. 18, pp. 583–599, 2003.
19. J.-L. Lamotte, B. Molina, and M. Raydan, “Smooth and adaptive gradient method with retards,” Mathe-
matical and Computer Modelling, vol. 36, pp. 1161–1168, 2002.
20. F. Luengo and M. Raydan, “Gradient method with dynamical retards for large-scale optimization prob-
lems,” Electronic Transactions on Numerical Analysis, vol. 16, pp. 186–193, 2003.
21. M. Raydan, “On the Barzilai and Borwein choice of steplength for the gradient method,” IMA J. Numer.
Anal., vol. 13, pp. 321–326, 1993.
22. M. Raydan, “The Barzilai and Borwein method for the large scale unconstrained minimization problem,”
SIAM J. Optim., vol. 7, pp. 26–33, 1997.
23. M. Raydan and B.F. Svaiter, “Relaxed steepest descent and Cauchy-Barzilai-Borwein method,” Compu-
tational Optimization and Applications, vol. 21, pp. 155–167, 2002.
24. T. Serafini, G. Zanghirati, and L. Zanni, “Gradient projection methods for quadratic programs and appli-
cations in training support vector machines,” Tech. Rep. 48, University of Modena and Reggio Emilia,
Italy, 2003.
25. H. Zhang and W.W. Hager, “A nonmonotone line search technique and its application to unconstrained
optimization,” SIAM J. Optim., vol. 14, pp. 1043–1056, 2004.