0% found this document useful (0 votes)
8 views

A Limited-memory Algorithm For

This document presents a limited-memory quasi-Newton algorithm designed to solve large nonlinear optimization problems with simple bounds on variables. The algorithm utilizes a gradient projection method and a limited-memory BFGS matrix to efficiently approximate the Hessian without requiring second derivatives. Numerical tests demonstrate its effectiveness in handling large-scale optimization tasks.

Uploaded by

22110125
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

A Limited-memory Algorithm For

This document presents a limited-memory quasi-Newton algorithm designed to solve large nonlinear optimization problems with simple bounds on variables. The algorithm utilizes a gradient projection method and a limited-memory BFGS matrix to efficiently approximate the Hessian without requiring second derivatives. Numerical tests demonstrate its effectiveness in handling large-scale optimization tasks.

Uploaded by

22110125
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

A LIMITED-MEMORY ALGORITHM FOR

BOUND-CONSTRAINED OPTTMIZATION
SECEIVED
by F€B 2 8 19%
Richard H. Byrd,’ Peihuang L U , and
~ J o q e Nmedal f t

ABSTRACT

An algorithm for solving large nonlinear optimization problems with simple bounds is de-
scribed. It is baaed on the gradient projection method and uses a limikd-memory BFGS
matrix to appraximak the Hessian of the objective function. We show how to take advantage
of the form of the limited-memoryapproximation ta implement the algorithm efficiently. The
remits of numerical tests on a set of large problems are reported.

K e y words: bound-constrained optimization, limited-memory method, nonlinear optimization,


quasi-Newton method, large-scde optimization.

1 Introduction
In this paper we describe a limited-memory quasi-Newton algorithm for solving large nonlinear
optimization problems with simple bounds on the variables. We write this problem as

subject to 15 x 5 u, (1.2)
where f : 8 ‘‘ --c 92 is a nonlinear function whose gradient g is available, the vectors I and u
represent lower and upper bounds on the variables, respectively, and the number of variables
n is assumed to be large. The algorithm does not require second derivatives or knowledge of
the structure of the objective function and can therefore be applied when the Hessian matrix is
not practical to compute. A limited-memory quasi-Newton update is used to approximate the
Hessian matrix in such a way that the storage required is linear in n.
*Computer Science Department, University of Colorado at Boulder, Boulder, Colorado 50309. This author was
supported by NSF grant CCR-9101795, ARO grant DAAL 03-91-G-0151, and AFOSR grant AFOSR-90-0209.
‘Department of Electrical Engineering and Computer Science, Northwestern University, Evanston IL 60208;
[email protected]. These authors were supported by National Science Foundation Grants CCR-9101359 and
ASC-9213149, and by Department of Energy Grant DE-FG02-87ER25047-AOO4.
‘Thisauthor was also supported by the Office of Scientific Computing, U.S. Department of Energy, under Con-
tract W-31-109-Eng-38, while at the Mathematics and Computer Science Division, Argonne National Laboratory,
Argonne, IL 60439.
The algorithm described in this paper is similar to the algorithms proposed by Conn, Gould,
and Toint [9] and Mor4 and Toraldo [20], in that the gradient projection method is used to
determine a set of active constraints at each iteration. Our algorithm is distinguished from these
methods by our use of line searches (as opposed t o trust regions) but mainly by our use of limited-
memory BFGS matrices to approximate the Hessian of the objective function. The properties
of these limited-memory matrices have far-reaching consequences in the implementation of the
method, as will be discussed later on. We find that by making use of the compact representations
of limited-memory matrices described by Byrd, Nocedal, and Schnabel [SI, the computational
cost of one iteration of the algorithm can be kept to be of order It.
We used the gradient projection approach [IS],[18], [3]to determine the active set, because
recent studies [7], [5] indicate that it possesses good theoretical properties, and because it also
appears to be efficient on many large problems [8], [20]. However, some of the main components
of our algorithm could be useful in other frameworks, as long as limited-memory matrices are
used to approximate the Hessian of the objective function.

2 Outline of the Algorithm


At the beginning of each iteration, the current iterate zk, the function value f k , the gradient gk,
and a positive definite limited-memory approximation Bk are given. This allows us to form a
quadratic model o f f at x k ,
1
mk(2) = f ( z k ) + gf(x - zk) + 2(z - z k ) T B k ( z - Xk). (2-1)

Just as in the method studied by Conn, Gould, and Toint [9], the algorithm approximately
minimizes mk(2) subject to the bounds given by (1.2). This is done by first using the gradient
projection method to find a set of active bounds, followed by a minimization of mk treating those
bounds as equality constraints.
To do this, we first consider the piecewise linear path

z ( t > = P(5k - tgk,z,+

obtained by projecting the steepest descent direction onto the feasible region, where

I; if X; < li
2; if 2; E [Zi, u;]
u; if xi > u;.

We then compute the generalized Cauchy point xc, which is defined as the first local minimizer
of the univariate, piecewise quadratic

Qk(t) = m&(t)).

The variables whose value at zc is at lower or upper bound, comprising the active set A(x"),are
held fixed. We then consider the following quadratic problem over the subspace of free variables,

2
subject to I; 5 2; 5 u; V i f d(zc). (2.4)
We first solve or approximately solve (2.3), ignoring the bounds on the free variables, which can
be accomplished either by direct or iterative methods on the subspace of free variables or by a
dual approach, handling the active bounds in (2.3) by Lagrange multipliers. We then truncate
the path toward the solution so as to satisfy the bounds (2.4).
After an approximate solution Z k + l of this problem has been obtained, we compute the new
iterate Xk+1 by a backtracking h e search dong d k = Zk+, - x k that ensures that

where X k is the steplength and cy is a parameter that has the value in our code. We then
evaluate the gradient at z k + l , compute a new limited-memory Hessian approximation B k + l and
begin a new iteration.
Because in our algorithm every Hessian approximation B k is positive definite, the approximate
solution Z k + l of the quadratic problem (2.3)-(2.4) defines a descent direction dk = Z k + l - z k for
the objective function f. To see this, first note that the generalized Cauchy point zc, which is a
minimizer of mk(Z) on the projected steepest descent direction, satisfies m k ( 5 k ) > T n k ( Z c ) if the
projected gradient is nonzero. Since the point Z k + l is on a path from xc to the minimizer of (2.3),
dong which m k decreases, the value of m k at Z k + l is no larger than its value at xc. Therefore we
have
+ +
f ( z k ) = m k ( 5 k ) > mk(zc)2 m k ( f k + l ) = f ( z k ) $4 3 d ; B k d k -
This inequality implies that gTdk < 0 if B k is positive definite and d k is not zero.
The Hessian approximations Bk used in our algorithm are limited-memory BFGS matrices
(Nocedal [21] and Byrd, Nocedd, and Schnabel [SI). Even though these matrices do not take
advantage of the structure of the problem, they require only a small amount of storage and, as we
will show, allow the computation of the generalized Cauchy point and the subspace minimization
to be performed in O ( n ) operations. The new algorithm therefore has computational demands
similar to those of the limited-memory algorithm (L-BFGS)for unconstrained problems described
by Liu and Nocedal [19] and Gilbert and LemarCchal [14].
In the next three sections we describe in detail the limited-memory matrices, the computation
of the Cauchy point. and the minimization of the quadratic problem on a subspace.

3 Limited-Memory BFGS Matrices


In our algorithm, the limited-memory BFGS matrices are represented in the compact form de-
scribed by Byrd, Wocedal, and Schnabel [SI. At every iterate X k the algorithm stores a s m d
number, say m, of correction pairs {s;, y;}, i = k - 1,...,k - m, where

3
These correction pairs contain information about the curvature of the function and, in conjunction
with the BFGS formula, define the limited-memory iteration matrix B k . The question is how to
best represent these matrices without explicitly forming them.
In [6] it is proposed to use a compact (or outer product) form to define the limited-memory
matrix B k in terms of the n x m correction matrices

More specifically, it is shown in [6] that if 9 is a positive scaling parameter and if the m correction
~ 1 7s'y; > 0, then the matrix obtained by updating 6 1 m-times, using the
pairs {si, y j } ~ ~satisfy
BFGS formula and the pairs { S j , y i } i k , ~ ~ can
7 be written as

Bk = 61 - W k M k W f , ( 3 4

and where Lk and D k are the m x m matrices

{ if i > j
T
(Sk-m-l+i) (Yk-m-l+j)
(Lk)ij = 0 otherwise, (3.5)

D k = diag [ ST
k-mYk-m, * * - 9 ST
k-1yk-I] - (3.6)
(We should point out that (3.2) is a slight rearrangement of Equation (3.5) in [6].) Note that
since h f k is a 2 m x 2m matrix, and since m is chosen to be a small integer, the cost of computing
the inverse in (3.4) is negligible. It is shown in [6] that by using the compact representation (3.2)
various computations involving B k become inexpensive. In particular, the product of B k times a
vector, which occurs often in the algorithm of this paper, can be performed efficiently.
There is a similar representation of the inverse limited-memory BFGS matrix H k that ap-
proximates the inverse of the Hessian matrix:

4
and

{
T
(~k-m-l+i) (Yk-m-l+j) if i 5 j
(Rk)i,j = otherwise.
(We note that (3.7) is a slight rearrangement of equation (3.1) in [6].)
Since the algorithm performs a backtracking line search, we cannot guarantee that the con-
dition szyk > 0 always holds (cf. Dennis and Schnabel [12]). Therefore, to maintain the positive
definiteness of the Limited-memory BFGS matrix, we discard a correction pair (s]~,yk} if the
curvature condition
STYk > eps llv112 (3.9)
is not satisfied for a s m d positive constant eps. If this happens, we do not delete the oldest cor-
rection pair, as is normally done in Limited-memory updating. This means that the m directions
in S]Fand Yk may actually include some with indices less than IC - rn.

4 The Generalized Cauchy Point


The objective of the procedure described in this section is to find the first local minimizer of the
quadratic model along the piecewise linear path obtained by projecting points d o n g the steepest
descent direction, Z k - t g k , onto the feasible region. We define xo = 2 k and, throughout this
section, drop the index k of the outer iteration, so that g,z and B stand for gk,z k and Bk. We
use subscripts to denote the components of a vector; for example, g; denotes the i-th component
of g. Superscripts will be used to represent iterates during the piecewise search for the Cauchy
point.
To define the breakpoints in each coordinate direction, we compute

and sort {t;,i = 1 , . . .,n} in increasing order to obtain the ordered set { t j : t j 5 t j + l , j = 1, ...,n}.
We then search along P ( s o - t g , I, u), a piecewise linear path that can be expressed as

xi(t> = { x4 - tg; if t 5 t ;
xQ - t;g; otherwise.

Suppose that we are examining the interval [tj-l,tj]. Let ns define the ( j - 1)-th breakpoint as
zj-l = z.tj-f)

so that on [tj-l, tj]


x(t) = zj-l + AtdJ-',
where
At = t - tJ-'

5
and

Using this notation, we write the quadratic (2.1), on the line segment [z(tj-'),z(tj)],

m(z) = f + g T ( z - zO)+ +(z - zO)TB(z - SO)

= f + gT($-l + At&') + $(.'-' + At&')TB(#' +


where
- x 0. (4-3)
m ( z ) can be written as a quadratic in At?
Therefore on the line segment [x(tj-'),z(tJ)],

Differentiating 7jl(At) and equating to zero, we obtain At* = -f!- /fT-l.


Since B positive
definite, this defines a minimizer provided tj-l + 3.1
At* lies on [ti-', tf]. Otherwise the generalized
Cauchy point lies at z(tj-') if f&l > 0 , and beyond or at z(tj) if f'-l < 0.
If the generalized Cauchy point has not been found after exploring the interval [ t j - ' , t j ] , we
set
zi = + &1&-1 ~ j - 1= tj - ti-1,
(4-6)
and update the directional derivatives fj' and ff as the search moves to the next interval. Let us
assume for the moment that only one variable becomes active at t j , and let us denote its index
by b. Then tb = 4, and we zero out the corresponding component of the search direction,
d j = dj-l + g beb 7 (4.7)
where eb is the b-th unit vector. From the definitions (4.3) and (4.6) we have

Therefore, using (4.4),(4.5), (4.7), and (4.5),we obtain

6
and

(4.10)

which can require O ( n )operations since B is a dense limited-memory matrix. Therefore it would
appear that the computation of the generalized Cauchy point could require O ( n 2 ) operations,
since in the worst case n segments of the piecewise linear path can be examined. This cost would
be prohibitive for large problems. However, using the limited memory BFGS formula (3.2) and
the definition (4.2), the updating formulae (4.9)-(4.10) become

f' = fj-1 + ff-1+ g z + t?gb4 - gbwfMWT>, (4.11)

f!3' = f!' - 209,2 - 2gbwrMWTdj-' + 0s; - gzWrMWb, (4.12)


where w r stands for the b-th row of the matrix W . The only O ( n ) operations remaining in
(4.11) and (4.12) are WTrj and W T d j e l . We note, however, from (4.7) and (4.8) that d and
d j are updated at every iteration by a simple computation. Therefore, if we maintain the two
2m-vectors
+
p' E W T d j = W T ( d j e l gbeb) = $-' gbWb, +
2 wT> = wT(Zj-1 + Atj-ldj-1) = 2-1+ Atj-19-1,
then updating f j and fy using the expressions
fj = fj-1 + Atj-lfY-1 -k g z 4- - gbWrM2,
f; = fy-1 - 6gz - 2gbwrM$-' - gzWFhfWb,
will require only O(m2)operations. If more than one variable becomes active at t j - an atypical
situation - we repeat the updating process just described, before examining the new interval
[ t j ,t j + ' ] . We have thus been able to achieve a significant reduction in the cost of computing the
generalized Cauchy point.
Remark. The examination of the first segment of the projected steepest descent path, during the
computation of the generalized Cauchy point, requires O ( n ) opemtions. However, all subsequent
segments require only O ( m 2 )operations, where m is the number of correction vectors stored in
the limited-memory matrix.
Since m is usually small, say less than 10, the cost of examining all segments after the first
one is negligible. The following algorithm describes in more detail how to achieve these savings
in computation. Note that it is not necessary to keep track of the n-vector 9 since only the

I 7
' I

component 4 corresponding to the bound that has become active is needed to update f,' and
fj".

Algorithm CP: Computation of the generalized Cauchy point


Given x, 1, u,g,and B = BI - W M W T
For i = 1,. ..,n compute

0 if ti = 0
d; :=
-g; otherwise

0 Initialize
3 := {i : tj >0 }
P := WTd (2mn operations )
C := 0
f' := gTd = -dTd ( n operations)
f" : = OdTd - d r W M W T d = -6 f' - p T M p ( O ( m 2 )operations)
Atmin := -6
told := 0
t .- min{ti : i E F} (using the heapsort algorithm)
b := i such that tj = t (remove b from 3).
At .-
.-t-0
0 Examination of subsequent segments
While At,,,;,, > A t do

27 :=

zb :=
c :=
f' :=
f" :=
p :=
db :=

Atm;* :=

8
t := min{t; : i E F} (using the heapsort algorithm)
b := i such that t; = t (Remove b from 9)
At := t - told
end while
e Atmi, := max(Atm;,,O}
told := told + Atmin
e x? := x; + toldd;, v i such that t; 2 t
e For a.ll i E 3 with t; = t , remove i from F.
e c := c + At,;,p
The last step of this algorithm updates the 2m-vector c so that upon termination
c = W*(ZC - Zk). (4.13)
This vector will be used to initialize the subspace minimization when the primal direct method
or the conjugate gradient method is used, as will be discussed in the next section.
Our operation counts take into account only multiplications and divisions. Note that there
are no O ( n ) computations inside the loop. If nint denotes the total number of segments explored,
then the total cost of Algorithm CP is (2m+2)n+O(m2) xnint operations plus nlogn operations
which is the approximate cost of the heapsort algorithm [l].

5 Methods for Subspace Minimization


Once the Cauchy point zc has been found, we proceed to approximately minimize the quadratic
model mk over the space of free variables and impose the bounds on the problem. We consider
three approaches t o minimize the model: a direct primal method based on the Sherman-Morrison-
Woodbury formula, a primal iterative method using the conjugate gradient method, and a direct
dual method using Lagrange multipiers. Which of these is most appropriate seems problem
dependent, and we have experimented numerically with all three. In all these approaches we first
work on minimizing mk ignoring the bounds, and at an appropriate point truncate the move SO
I
I as to satisfy the bound constraints.
The following notation will be used throughout this section. The integer t denotes the number
of free variables at the Cauchy point zc;in other words there are n - t variables a t bound at zC.
As in the preceding section, 3 denotes the set of indices corresponding to the free variables, and
we note that this set is defined upon completion of the Cauchy point computation. We define
Zk to be the n x t matrix whose columns are unit vectors (i.e., columns of the identity matrix)
that span the subspace of the free variables at zc.Similarly Ak denotes the n x ( n - t ) matrix of
active constraint gradients at zc,which consists of n - t unit vectors. Note that A l Z k = 0 and
that
+
AkAT ZkZ,T = I . (5.1)

9
5.1 A Direct Primal Method
In a primal approach, we fix the n - t variables at bound at the generalized Cauchy point zc, and
solve the quadratic problem (2.3) over the subspace of the remaining t free variables, starting
from zcand imposing the free variable bounds (2.4). Thus we consider only the points z E Snof
the form
z = xc Zkd, + (5-2)
where d is a vector of dimension t . Using this notation, for points of the form (5.2) we can write
the quadratic (2.1) as

mk(x) = fk + gf(z - zc + xc - Zk) + s(z


1
- xc + xc - t k ) T B k ( Z - zc+ xc - 51)
z +
1
= (gk Bk(zC- ~ / c ) ) ~-(z') - ~ " ) ~ B -k ( z + y
~ ( 3 2')

G dTic + Zd
1-TB k d + y,
(5-3)

where y is a constant,

is the reduced Hessian of mk,and

is the reduced gradient of mk a t zc. Using (3.2) and (4.13), we can express this reduced gradient
as
i;= = z,T(gk +
8(ZC - 2 k ) - WkMkC), (5.4)
which, given that the vector c was saved from the Cauchy point computation, costs (2m 1)t + +
O(m2) extra operations. Then the subspace problem (2.3) can be formulated as

min f i k ( d ) 3 CiTF +pB&+ y


subject to l; - xf 5 di 5 ui - xf i E 7,

where the subscript i denotes the i-th component of a vector. The minimization (5.5) can be
solved either by a direct method, as we discuss here, or by a n iterative method as discussed in
the next subsection, and the constraints (5.6) can be imposed by backtracking.
Since the reduced limited-memory matrix l?k is a small-rank correction of a diagonal matrix,
we can formally compute its inverse by means of the Sherman-Morrison-Woodbury formula and
obtain the unconstrained solution of the subspace problem ( 5 . 5 ) ,

We can then backtrack towards the feasible region, if necessary, to obtain

d^" = a f d u ,

10
where the positive scalar a* is defined by

Therefore the approximate solution 5 of the subproblem (2.3)-(2.4) is given by

ifi43
2; = (5.9)

It remains only to consider how to perform the computation in (5.7). Since Bk is given by
(3.2) and ZrZk = I,the reduced matrix B is given by

B = 61 - (ZTW)(MWTZ),
where we have dropped the subscripts for simplicity. Applying the Sherman-Morrison-Woodbury
formula (see, for example, [17]),we obtain

1
1 1
- -81M W T Z Z T W )
-1
B-' = j1+ f T W ( I MWTZ8, (5.10)

so that the unconstrained subspace Newton direction 8


. is given by
1 1 1 -1
2 = -ec
e + -ZTW(I
62
- -6M W T Z Z T W ) MWTZiC. (5.11)

Given a set of free variables at z c that determines the matrix 2, and a limited-memory BFGS
matrix B defined in terms of 0,W and M, the following procedure implements the approach
just described. Note that since the columns of Z are unit vectors, the operation Zv,amounts
t o selecting appropriate elements from v. Here and throughout the paper our operation counts
include only multiplications and divisions. Recall that t denotes the number of free variables and
that m is the number of corrections stored in the limited memory matrix.

Direct Primal Method


1. Compute Z i " by (5.4) ((2772 + 1)t + O(m2)operations)
2. v := wTziC (2mt operations)
3. v := M u (O(m2) operations)
4. Form N = (I- $ M W T Z Z T W )
0 N := QwVTZZTW (2m2t + mt operations)
0 N := I - M N ( 0 ( m 3 )operations)
5 . v := N-'v (O(m3)operations)
6. d' := $+' + &ZTWv (2mt + t operations)
11
a satisfying (5.8)
7. Find ' ( t operations)
8. Compute Z i as in (5.9) ( t operations)
The total cost of this subspace minimization step based on the Sherman-Morrison-Woodbury
formula is
+
2m2t 6mt 4t O(m3) + + (5.12)
operations. This is quite acceptable when t is small (i.e., when there are few free variables).
However, in many problems the opposite is true: few constraints are active and t is large. In this
case the cost of the direct primal method can be quite large, but the following mechanism can
provide significant savings.
Note that when t is large, it is the computation of the matrix

in Step 4 , which requires 2m2t operations, that drives up the cost. Fortunately we can reduce
the cost when only a few variables enter or leave the active set from one iteration to the next
by saving the matrices Y T Z Z T Y , S T Z Z T Y and S T Z Z T S . These matrices can be updated to
account for the parts of the inner products corresponding t o variables that have changed status,
and to add rows and columns corresponding to the new step. In addition, when t is much larger
then n - t , it seems more efficient to use the relationship Y T Z Z T Y = YTY - Y T A A T Y , which
follows from (5.1), to compute Y T Z Z T Y . Similar relationships can be used for the matrices
S T Z Z T Y and S T Z Z T S . These devices can potentially result in significant savings, but they
have not been implemented in the code experimented with in Section 6.

5.2 A Primal Conjugate Gradient Method


Another approach for approximately solving the subspace problem (5.5) is to apply the conjugate
gradient method to the positive definite linear system

(5.13)

and stop the iteration when a boundary is encountered or when the residual is small enough.
Note that the accuracy of the solution controls the rate of convergence of the algorithm, once the
correct active set is identified, and should therefore be chosen with care. We follow Conn, Gould,
and Toint [8] and stop the conjugate gradient iteration when the residual i: of (5.13)satisfies

We also stop the iteration at a bound when a conjugate gradient step is about to violate a bound,
thus guaranteeing that (5.6) is satisfied. The conjugate gradient method is appropriate here since
almost all of the eigenvalues of are identical.

12
We now describe the conjugate gradient method and give its operation counts. Note that the
effective number of variables is t , the number of free variables. Given Bk,the following procedure
computes an approximate solution of (5.5).

The C o n j u g a t e Gradient M e t h o d

1. i := ic computed by (5.4) ((2m 1)t+ + O(m2)operations)


2. p := -i, d^ := 0, and p2 := i*i: ( t operations)
3. Stop if 11i11 < min(O.1, m)
IIi
c I
I

4. cyl := max(a : i5 + d^ + a p 5 C> ( t operations)


5. q := B k p (4mt operations)

6. a2 := p z / p T q (t operations)
7 . If cy2 > cy1 set d^ := 2 + a l p and stop;
otherwise cornpu t e:
0 d^:=d+a2p ( t operations)
i; := i; + a2q ( t operations)
p 1 := p2; p2 = iTi;p := p 2 j p 1 ( t operations)
0 p := -i p p +- ( t operations)
0 go to 3

The matrix-vector multiplication of Step 5 should be performed as described in [6]. The total
operation count of this conjugate gradient procedure is approximately

+
(2m 2)t + (4m + 6 ) t x citer + O ( m 2 ) , (5.14)

where citer is the number of conjugate gradient iterations. If we compare this with the cost
of the primal direct method (5.12), for t >> m, the direct method seems more efficient unless
citer 5 m/2. Note that the costs of both methods increase as the number of free variables t
becomes larger. Since the limited-memory matrix Bk is a rank 2m correction of the identity
matrix, the termination properties of the conjugate gradient method guarantee that the subspace
problem will be solved in at most 2m conjugate gradient iterations.
We point out that the conjugate gradient iteration could stop at a boundary even when the
unconstrained solution of the subspace problem is inside the box. Consider, for example, the
case when the unconstrained solution lies near a corner and the starting point of the conjugate
gradient iteration lies near another corner along the same edge of the box. Then the iterates
could soon fall outside of the feasible region. This example also illustrates the difficulties that
the conjugate gradient approach can have on nearly degenerate problems ill].

13
5.3 A Dual Method for Subspace Minimization
Since it often happens that the number of active bounds is small relative to the size of the
problem, it should be efficient to handle these bounds explicitly with Lagrange multipliers. Such
an approach is often referred to as a dual or a range space method (see [15]).
We will write
x z x k $ d

and restrict Xk + d t o lie on the subspace of free variables at xc by imposing the condition

(Recall that A k is the matrix of constraint gradients.) Using this notation, we formulate the
subspace problem as

(5.15)
(5.16)
(5.17)

We first solve this problem without the bound constraint (5.17). The optimatity conditions for
(5.15)-(5.16) are

(5.18)
(5.19)

Premultiplying (5.18) by AzHk, where flk is the inverse of Bk, we obtain

and using (5.19), we obtain


(AzHk&)X* = - A T a k g k - bk. (5.20)
Since the columns of A k are unit vectors and A k has full column rank, we see AEHkAk is a
principal submatrix of H k . Thus, (5.20) determines A', and hence d" is given by

(5.21)

(In the special case where there are no active constraints, we simply obtain B k d " = - g k . ) If the
vector x k +d* violates the bounds (5.17), we backtrack to the feasible region dong the line joining
this infeasible point and the generalized Cauchy point xc.
The linear system (5.20) can be solved by the Sherman-Morrison-Woodbury formula. Using
the inverse limited-memory BFGS matrix (3.7), and recalling the identity A z A k = I , we obtain

14
(We have again omitted the subscripts of M , W and 8 for simplicity.) Applying the Sherman-
Morrison-Woodbury formula, we obtain

(AzHkAk)-l = 8I - OA$W(I+ 8&IWTAkArW)-'&WTAk0. (5.22)

Given Q k , a set of active variables at zc that determines the matrix of constraint gradients
Ak, and an inverse limited-memory BFGS matrix Hk, the following procedure implements the
dual approach just described. Let us recall that t denotes the number of free variables, and
let us define t, = n - t , so that t, denotes the number of active constraints at zc. As before,
the operation counts given below include only multiplications and divisions, and rn denotes the
number of corrections stored in the limited memory matrix.

15
w := ii;iw +v (2m operations)
e d* := - Q ( A ~ X * + gk) + FVW +
((2m 1)n operations)
Backtrack if necessary:
Compute a* = m a {a : 1; 5 2, + a ( 2 k + d' - 2') 5 U i , i E F} ( t operations)
0 Set Z = 2, + a*(zk + d* - 2'). ( t operations)

Since the vectors S T g k and Y T g k have been computed while updating Elk [6], they can be
saved so that the product W T g k requires no further computation.
The total number of operations of this procedure, when no bounds are active ( t , = 0), is
(2m .)1 + +
O(m2).If t , bounds are active,
(2m + 3). + gmt, + 2m2t,+ 0 ( ~ 3 )
operations are required to compute the unconstrained subspace solution. By comparison, if
implemented as described above, the direct primal method requires 2m2t 6mt 4t O(m3) + + +
operations for the subspace minimization. These figures indicate that the dual method would be
less expensive when the number of bound variables is much less than the number of free variables.
However, this comparison does not take into account the devices for saving costs in the
computation of inner products discussed at the end of Section 5.1. Similarly, in the dual case, the
cost of computing r?lTAkAzw could be reduced by updating this matrix from one iteration to the
next and, if t , > n - t,, by computing the matrix m T Z ~ Z r T ?first,
I and subtracting from wTm.
Such devices would require a more complex implementation, but might reduce the difference in
cost between the primal and dual approaches. In fact, the primal and dual approaches have more
in common than appears here, in that the matrix ( I - ~ M W T Z Z T W ) - ' M appearing in (5.10)
+
can be shown to be identical to the matrix ( I 6@wTAkA%r;i/)-'@ in (5.22).

6 Numerical Experiments
We have tested our limited-memory algorithm using the three options for subspace minimiza-
tion (the direct primal, primal conjugate gradient, and dual methods) and compared the results
with those obtained with the subroutine SUBMIN of LANCELOT [lo] using partitioned BFGS
updating. Both our code and LANCELOT were terminated when

IIP(.k - gk,I, u ) - x k l l w < lo-'. (6.1)

(Note from (2.2) that P ( s k - gk,I , u ) - X k is the projected gradient.) The algorithm we tested is
given as follows.

Bound L-BFGS Algorithm


Choose a starting point 20, and an integer rn that determines the number of limited-memory
corrections stored. Define the initial limited-memory matrix to be the identity, and set k := 0.

16
1. Xf the convergence test (6.1) is satisfied, stop.

2. Compute the Cauchy point by Algorithm CP.


3. Compute a search direction d k by the direct primal method, the conjugate gradient method,
or the dual method.
4. Using a backtracking line search, starting from the unit steplength, compute a steplength
xk such that 2k+l = 2k + X k d k Satisfies (2.5) with a = ioq4.
5. Compute V f ( Z k + l ) .
6. If satisfies (3.9) with eps= lo-', add S k and yk to s k and Y k . If more than m updates
Yk
are stored, delete the oldest column from Sk and Y k .

Our code is written in double-precision Fortran 77. For the heapsort, during the generalized
Cauchy point computation, we use the Harwell routine KB12AD written by Gould [13]. The
backtracking line search was performed by the routine LNSRCH of Dennis and Schnabel [12].
For more details on how to update the limited-memory matrices in Step 7, see [SI. When testing
the routine SUBMIN of LANCELOT [lo] we used the default options and BFGS updating.
We selected seven problems, two bound-constrained quadratic optimization problems from
the MINPACK-2 collection [2], and five nonlinear problems from the CUTE collection [4], to
test the algorithms. To study a variety of cases, we tightened the bounds on several problems,
resulting in more active bounds at the solutions of these problems. Table 1lists the test problems
and the bounds added to those already given in the specification of the problem. The number
of variables is denoted by n, and the number of bounds active at the solution by n,. We note
that in those problems without active bounds at the solution (n, = 0), some bounds may become
active during the iteration.

17
Table 1: Test Problems
-
Problem Variant n -
na Reference Additional Bounds
EDENSCH 1 2000 0 CUTE [4] none
n
EDENSCH 2 2000 1 [0, 1.51'4 odd i
EDENSCH 3 2000 667 n
[-1, 0.51 i = 4,7,10, ...
EDENSCH 4 2000 999 n
[0,0.99] V odd i
EDENSCH 5 2000 100 n
[0,0.5lV odd i
LMINSURF 1 1024 124 n
none
LMINSURF 2 1024 147 n
[2, lo] V odd i
LMINSURF 3 1024 172 n
[5,10]V odd i
LMINSURF 4 1024 227 n
[5.5,6] V i
PENALTY 1 1 1000 0 n
none
PENALTY 1 2 1000 0 n
[0,1] V odd i
PENALTY 1 3 1000 334 n
[0.1,1] i = 4,7,10, ...
n
PENALTY 1 4 1000 500 [O.l,l]V odd i
RAYBENDL 1 44 4 n
none
RAYBENDL 2 44 6 n
[2,95] v i
ORTHREG 1 133 0 n
none
TORSION 1 1024 320 MINPACK-2 [2] none
JOURNAL 1 -
1024 330
- MINPACK-2 [2] none

The results of our numerical tests are given in Table 2. All computations were performed
on a Sun SPARCstation 2 with a 40-MHz CPU and 32-MB memory. In every run a3l methods
converged to the same solution point; in fact, this was one requirement in the selection of the
test problems. The number of corrections stored in the limited-memory method was m = 4. We
record the number of iterations (iter), the number of inner conjugate gradient iterations (cg) for
those methods that use them, and the total CPU time (time). A * indicates that the convergence
tolerance (6.1) was not met.

18
Table 2: Test Results of 3 Version of the New Limited-Memory Method with rn = 4 and
LANCELOT’S Subroutine SUBMIN using BFGS Updating

Problem Variant Primal Dual CG LANCELOTIBFGS


iter/ time iter/ time iter/cg/ time iterlcgltime
EDENSCH 1 31137.1 26112.1 30/115/35.2 41141128.9
EDENSCH 2 17115.4 17110.9 20164120.7 62/105/241.8
EDENSCH 3 16111.2 16111.5 15/40/10.2 53119/33.8*
EDENSCH 4 1519.21 15112.4 16/42/10.2 32116116.9
EDENSCH 5 1217.18 1219.6 1212416.22 15/3/7.5
LMINSURF 1 166176.4 166162.3 168/1219/149 29516171177.2
LMINSURF 2 4201213 403/ 141 430/2312/309 1121499185.6
LMINSURF 3 4741239 4621165 542/2908/381 921444175.5
LMINSURF 4 107152.8 107145.4 126/577/75.2 431297134.9
PENALTY 1 1 96141.1 97121.5 981295145.8 471554173.2
PENALTY 1 2 66130.4 61/15 591158p5.8 551513197.1
PENALTY 1 3 30111.3 30/10.1 3015719.2 33/0/41.5
PENALTY 1 4 30110.9 30112 30/58/8.9 32/0/30.3
RAYB EN DL 1 1179140.9 976121.1 1733113755192.8 895122645165
RAY BENDL 2 1425153.1 998124.5 1737113137193.2 894120468157
ORTHREG 1 266119.3 442114.5 26211504127.8 114116418.4
TORSION 1 57127.7 55127.1 591329139.9 11/94/15.1
JOURNAL 1 132175.5 120154.9 155/660/91.7 81116116.54

The test results indicate that the dual method was the fastest of the three subspace mini-
mization approaches in most cases. This is t o be expected for this implementation, given that
the number of active bounds at the solution was always less than n/2. The differences in the
number of iterations required by the direct primal and dual methods are due to rounding errors.
We note also that the direct primal method usually had a running time either close t o or faster
than that for the conjugate gradient method. In view of the discussion in Section 5.2 and the
fact that there were usually more than rn/2 = 2 conjugate gradient iterations per outer iteration
on the average, this is not surprising.
The tests described here are not intended to establish the superiority of LANCELOT or of
the new limited-memory algorithm, since these methods are designed for solving different types of
problems. LANCELOT is tailored for sparse or partially separable problems, whereas the limited
memory method is well suited for unstructured or dense problems. We use L-ANCELOT simply
as a benchmark and, for this reason, ran it only with its default settings and did not experiments
with its various options t o find the one that would give the best results on these problems. Also,
we used BFGS updating in LANCELOT (as opposed to SR1 updating) to minimize the differences
with our limited-memory code. However, a few observations on the two methods can be made.

19
The limited-memory method tends to spend less time per iteration than LANCELOT, but in
many cases requires more iterations. We also observed that LANCELOT is able to identify the
correct active set sooner. This is likely to be because LANCELOT tends to form a more accurate
Hessian approximation than the limited-memory matrix. It should be noted that the objective
function in all these problems has a significant degree of the kind of partial separability that
LANCELOT is designed t o exploit. However, problem PENALTY 1 was coded so as to prevent
LANCELOT from exploiting partial separability, which may explain its poor performance on this
problem.
For comparison, we also tried the option in LANCELOT using exact Hessians. The results,
shown in Table 3, indicate that in most cases the number of iterations and the computational
time decreased significantly.

Table 3: Results of LANCELOT's subroutine SUBMIN using exact Hessian

Problem I variant 11 LANCELOT/Exact Hessian 1


iterlcgltime
EDENSCH 1 131131 11.8
EDENSCH 2 1113641 109.4
EDENSCH 3 13/14/9.1
EDENSCH 4 12/1R/R 17.2
EDENSCH 5 9/0/5.0
LMINSURF 1 2721 53 11138.2
LMINSURF 2 1231512186.7
LMINSURF 3 991660198.1
LMINSURF 4 25/30 1127.7
PENALTY 1 1 621776 199.1
PENALTY 1 2 541524192.4
PENALTY 1 3 33/0/40.9
PENALTY 1 4 3210j30.3
RAYBENDL 1 108110912.1
RAYBENDL 2 801951 1.5
ORTHREG 1 27/67/2.2
TORSION 1 11/80/12
JOURNAL 1 7/177/14.8

Taking everything together, the new algorithm has most of the efficiency of the unconstrained
limited-memory algorithm (L-BFGS) [19] together with the capability of handling bounds, at the
cost of a significantly more complex code. Like the unconstrained method, the bound limited-
memory algorithm's main advantages are its low computational cost per iteration, its modest
storage requirements, and its ability to solve problems in which the Hessian matrices are large,

20
unstructured, dense, nd una railable. It is less likely t o be competitive when an exact Hessian
is available, or when significant advantage can be taken of sparsity. A well-documented and
carefully coded implementation of the algorithm described in this paper will soon be available
and can be obtained by contacting the authors at nocedalQeecs.nwu.edu.

References
[l]A. V. Aho, J. E. Hopcroft, and J. D. Ullman, The Design and Analysis of Computer Algo-
rithms. Reading, Mass: Addison-Wesley Pub. Co., 1974.
[a] B. M. Averick and J. J. MorC, “User guide for the MINPACK-2 test problem collection,” Ar-
gonne National Laboratory, Mathematics and Computer Science Division Report ANL/MCS-
TM-157, Argonne, Ill., 1991.
[3] D. P. Bertsekas, “Projected Newton methods for optimization problems with simple con-
straints”, SIAM J . Control and Optimization 20 (1982): 221-246.
[4] 1. Bongartz, A. R. Conn, N. I. M. Gould, and Ph. L. Toint, “CUTE: Constrained and
unconstrained testing environment,“ Research Report, IBM T.J.Watson Research Center,
Yorktown, New York, 1993.
[5] J. V. Burke, and J. J. MorC, “On the identification of active constraints,” SIAM J. Numer.
Anal. 25, no. 5 (1988): 1197-1211.
[6] R. H. Byrd, J. Nocedal, and R. B. Schnabel, “Representation of quasi-Newton matrices and
their use in limited memory methods,” Technical report, EECS Department, Northwestern
University, 1991, to appear in Mathematical Programming.
[7] P. H. Calamai,and J. J. MorB, “Projected gradient methods for linearly constrained prob-
lems” Mathematical Progmmmirrg 39 (1987): 93-116
[8] A. R. Conn, N. I. M. Gould, and Ph. L. Toint, “Testing a class of methods for solving
minimization problems with simple bounds on the variables,” Mathernutics of Computation
50, no. 182 (1988): 399-430.
f9] A. R. Conn, N. I. M. Gould, and Ph. L. Toint, “Global convergence of a class of trust region
algorithms for optimization with simple bounds,” SIAM J . Numer. Anal. 25 (1988): 433-460.
[lo] A. R. Conn, PI‘. I. M. Gould, Ph.L. Toint, “LANCELOT: A FORTRAN package for large-
scale nonlinear optimization (Release A),” Number 17 in Springer Series in Computational
Mathematics, Springer-Verlag, New York, 1992.
[ll] A. R. Conn and J. Mor&, Private communication, 1993.

[12] J. E. Dennis, Jr. and R. B. Schnabel, Numerical Methods for Unconstmined Optimization
and Nonlinear Equations, Englewood Cliffs, N. J.: Prentice-Hall, 1983.

21
[13] Hanuell Subroutine Library, Release IO, Advanced Computing Department, AEA Industrial
Technology, Harwell Laboratory, Oxfordshire, United Kingdom, 1990.
[14]J. C. Gilbert and C. LemarCchal, “Some numerical experiments with variable storage quasi-
Newton algorithms,” Mathematical Progrumming 45 (1989) 407-436.

[15] P. E. Gill,W Murray and M. H. Wright, Practical Optimization, London: Academic Press,
1981.
[16] A. A. Goldstein, “Convex programming in Hilbert space,” Bull. Arner. Math. Soc. 70 (1964):
709-710.

[17]J. M. Ortega and W. C. Rheinboldt, Iterative Solution of Nonlineur Equations in Seueml


Variables, New York, Academic Press, 1970.
[18]E. S . Levitin and B. T. Polyak, “Constrained minimization problems,” USSR Comput. Math.
and Math. Phys. 6 (1966): 1-50.
[19] D. C. Liu and J. Nocedal, “On the limited memory BFGS method for large scale optimization
methods,” Mathematical Programming 45 (1989): 503-528.
[20] J. J. Mor6 and G. Toraldo, “Algorithms for bound constrained quadratic programming prob-
lems,” Nurner. Math. 55 (1989): 377-400.
[Zl]J. Nocedal, “Updating quasi-Newton matrices with limited storage,” Mathematics of Com-
putation 35 (1980): 773-782.

DISCLAIMER
This report was prepared as an account of work sponsored by an agency of the United States
Government. Neither the United States Government nor any agency thereof, nor any of their
employees, makes any warranty, express or implied, or assumes any legal liability or responsi-
bility for the accuracy, completeness, or usefulness of any information, apparatus, product, or
process disclosed, or represents that its use would not infringe privately owned rights. Refer-
ence herein to any specific commercial product, process, or service by trade name, trademark,
manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recom-
mendation, or favoring by the United States Government or any agency thereof. The views
and opinions of authors expressed herein do not necessarily state or reflect those. of the
United States Government or any agency thereof.

22

You might also like