Sequential Quadratic Programming
Sequential Quadratic Programming
Sequential Quadratic Programming
(1996),
pp.
1{000
1 Introduction 2 The Basic SQP Method 3 Local Convergence 4 Merit Functions and Global Convergence 5 Global to Local Behavior 6 SQP Trust Region Methods 7 Practical Considerations References
1. Introduction
1 4 12 27 37 40 43 48
Since its popularization in the late 1970s, Sequential Quadratic Programming (SQP) has arguably become the most successful method for solving nonlinearly constrained optimization problems. As with most optimization methods, SQP is not a single algorithm, but rather a conceptual method from which numerous speci c algorithms have evolved. Backed by a solid theoretical and computational foundation, both commercial and public domain SQP algorithms have been developed and used to solve a remarkably large set of important practical problems. Recently large scale versions have been devised and tested with promising results. In this paper we examine the underlying ideas of the SQP method and the theory that establishes it as a framework from which e ective algorithms can
Contribution of the National Institute of Standards and Technology and not subject to copyright in the United States.
be derived. In the process we will describe the most popular manifestations of the method, discuss their theoretical properties, and comment on their practical implementations. The nonlinear programming problem to be solved is minimize f (x) x (NLP) subject to: h(x) = 0 g (x) 0 where f : Rn ! R, h : Rn ! Rm , and g : Rn ! Rp . Such problems arise in a variety of applications in science, engineering, industry, and management. In the form (NLP) the problem is quite general it includes as special cases linear and quadratic programs in which the constraint functions, h and g, are a ne and f is linear or quadratic. While these problems are important and numerous, the great strength of the SQP method is its ability to solve problems with nonlinear constraints. For this reason it is assumed that (NLP) contains at least one nonlinear constraint function. The basic idea of SQP is to model (NLP) at a given approximate solution, say xk , by a quadratic programming subproblem, and then to use the solution to this subproblem to construct a better approximation xk+1 . This process is iterated to create a sequence of approximations that, it is hoped, will converge to a solution x . Perhaps the key to understanding the performance and theory of SQP is the fact that, with an appropriate choice of quadratic subproblem, the method can be viewed as the natural extension of Newton and quasi-Newton methods to the constrained optimization setting. Thus one would expect SQP methods to share the characteristics of Newton-like methods, namely, rapid convergence when the iterates are close to the solution but possible erratic behavior that needs to be carefully controlled when the iterates are far from a solution. While this correspondence is valid in general, the presence of constraints makes both the analysis and implementation of SQP methods signi cantly more complex. Two additional properties of the SQP method should be pointed out. First, SQP is not a feasible-point method that is, neither the initial point nor any of the subsequent iterates need be feasible (a feasible point satis es all of the constraints of (NLP)). This is a major advantage since nding a feasible point when there are nonlinear constraints may be nearly as hard as solving (NLP) itself. SQP methods can be easily modi ed so that linear constraints, including simple bounds, are always satis ed. Second, the success of the SQP methods depends on the existence of rapid and accurate algorithms for solving quadratic programs. Fortunately, quadratic programs are easy to solve in the sense that there are a good procedures for their solution. Indeed, when there are only equality constraints the solution to a quadratic program reduces to the solution of a linear system of equations.
When there are inequality constraints a sequence of systems may have to be solved. A comprehensive theoretical foundation does not guarantee that a proposed algorithm will be e ective in practice. In real-life problems hypotheses may not be satis ed, certain constants and bounds may not be computable, matrices may be numerically singular, or the scaling may be poor. A successful algorithm needs adaptive safeguards that deal with these pathologies. The algorithmic details to overcome such di culties, as well as more mundane questions|how to choose parameters, how to recognize convergence, and how to carry out the numerical linear algebra|are lumped under the term \implementation". While a detailed treatment of this topic is not possible here, we will take care to point out questions of implementation that pertain speci cally to the SQP method. This survey is arranged as follows. In Section 2 we state the basic SQP method along with the assumptions about (NLP) that will hold throughout the paper. We also make some necessary remarks about notation and terminology. Section 3 treats local convergence, that is, behavior of the iterates when they are close to the solution. Rates of convergence are provided both in general terms and for some speci c SQP algorithms. The goal is not to present the strongest results but to establish the relation between Newton's method and SQP, to delineate the kinds of quadratic models that will yield satisfactory theoretical results, and to place current variations of the SQP method within this scheme. The term global is used in two di erent contexts in nonlinear optimization and is often the source of confusion. An algorithm is said to be globally convergent if, under suitable conditions, it will converge to some local solution from any remote starting point. Nonlinear optimization problems can have multiple local solutions the global solution is that local solution corresponding to the least value of f . SQP methods, like Newton's method and steepest descent, are only guaranteed to nd a local solution of (NLP) they should not be confused with algorithms for nding the global solution, which are of an entirely di erent avor. To establish global convergence for constrained optimization algorithms, a way of measuring progress towards a solution is needed. For SQP this is done by constructing a merit function, a reduction in which implies that an acceptable step has been taken. In Section 4, two standard merit functions are de ned and their advantages and disadvantages for forcing global convergence are considered. In Section 5, the problems of the transition from global to local convergence are discussed. In Sections 4 and 5 the emphasis is on line search methods. Because trust region methods, which have been found to be e ective in unconstrained optimization, have been extended to t into the SQP framework a brief description of this approach is given in
Section 6. Section 7 is devoted to implementation issues, including those associated with large scale problems. To avoid interrupting the ow of the presentation, comments on related results and references to the literature are provided at the end of each section. The number of papers on the SQP method and its variants is large and space prohibits us from compiling a complete list of references we have tried to give enough references to each topic to direct the interested reader to further material.
1.1. Notes and References The earliest reference to SQP-type algorithms seems to have been in the PhD thesis of Wilson (1963) at Harvard University, in which he proposed the method we call in Section 3 the Newton-SQP algorithm. The development of the secant or variable-metric algorithms for unconstrained optimization in the late 1960's and early 1970's naturally led to the extension of these methods to the constrained problem. The initial work on these methods was done by Mangasarian and his students at the University of Wisconsin. Garcia-Palomares and Mangasarian (1976) investigated an SQP-type algorithm in which the entire Hessian matrix of the Lagrangian, i.e., the matrix of second derivatives with respect to both to x and the multipliers, was updated at each step. Shortly thereafter, Han (Han 1976) and (Han 1977) provided further impetus for the study of SQP methods. In the rst paper Han gave local convergence and rate of convergence theorems for the PSBand BFGS-SQP algorithms for the inequality-constrained problem and, in the second, employed the `1 merit function to obtain a global convergence theorem in the convex case. In a series of papers presented at conferences, (Powell 1977), (Powell 1978a), and (Powell 1978b), Han's work was brought to the attention of the general optimization audience. From that time there has been a continuous production of research papers on the SQP method. As noted, a signi cant advantage of SQP is that feasible points are not required at any stage of the process. Nevertheless, a version of SQP that always remains feasible has been developed and studied, for example, by Bonnans, Panier, Tits and Zhou (1992).
2. The Basic SQP Method
2.1. Assumptions and Notation As with virtually all nonlinear problems, it is necessary to make some assumptions on the problem (NLP) that clarify the class of problems for which the algorithm can be shown to perform well. These assumptions, as well as the consequent theory of nonlinear programming, are also needed to describe the SQP algorithm itself. In this presentation we do not attempt to
make the weakest assumptions possible, but rather provide what we consider reasonable assumptions under which the main ideas can be illustrated. We make the blanket assumption that all of the functions in (NLP) are three times continuously di erentiable. We denote the gradient of a scalarvalued function by r, e.g., rf (x). (Here, and throughout the paper, all vectors are assumed to be column vectors we use the superscript t to denote transpose.) For vector-valued functions we also use r to denote the Jacobian of the function. For example, rh(x) = (rh1(x) rh2(x) : : : rhm (x)) : The Hessian of a scalar-valued function, denoted by the letter H , is de ned to be the symmetric matrix whose (i j )th component is 2 Hf (x) = @ f (x) :
ij
Where a function is de ned on more than one set of variables, such as the Lagrangian function de ned below, the di erentiation operators r and H will refer to di erentiation with respect to x only. A key function, one that plays a central role in all of the theory of constrained optimization, is the scalar-valued Lagrangian function de ned by L(x u v) = f (x) + uth(x) + vtg(x) (2:1) where u 2 Rm and v 2 Rp are the multiplier vectors. Given a vector x, the set of active constraints at x consists of the inequality constraints, if any, satis ed as equalities at x. We denote the index set of active constraints by I (x) = fi : gi(x) = 0g: The matrix G(x) made up of the matrix rh(x) along with the columns rgi(x) i 2 I (x) will be important in describing the basic assumptions and carrying out the subsequent analyses. Assuming that the matrix G(x) has full column rank, the null space of G(x)t de nes the tangent space to the equality and active inequality constraints at x. The projection onto this tangent space can be written as P (x) = I ; G(x) G(x)tG(x) ;1 G(x)t: (2:2) The corresponding projection onto the range space of G(x) will be written Q(x) = I ; P (x): (2:3) For convenience, these projection matrices evaluated at iterates xk and at a solution x will be denoted by P k , P , Qk , and Q . Similarly, we will write H L = H L(x u v ) throughout the remainder of this paper.
@ xi @ xj
In this paper x will represent any particular local solution of (NLP). We assume that the following conditions apply to each such solution. A1: The rst order necessary conditions hold, i.e., there exist optimal multiplier vectors u and v 0 such that rL(x u v ) = rf (x ) + rh(x )u + rg(x )v = 0: A2: The columns of G(x ) are linearly independent. A3: Strict complementary slackness holds, i.e., g i (x ) v i = 0 for i = 1 : : : p and, if gi (x ) = 0, then vi > 0. A4: The Hessian of the Lagrangian function with respect to x is positive de nite on the null space of G(x )t i.e., dt H L d > 0 for all d 6= 0 such that G(x )t d = 0: The above conditions, sometimes called the strong second order su cient conditions, in fact guarantee that x is an isolated local minimum of (NLP) and that the optimal multiplier vectors u and v are unique. It should be noted that without strong additional assumptions (NLP) may have multiple local solutions. We use the term critical point to denote a feasible point that satis es the rst order necessary conditions A1. A critical point may or may not be a local minimum of (NLP). In discussing convergence of SQP algorithms the asymptotic rate of convergence plays an important role. Three standard measures of convergence rates will be emphasized in this paper. In the de nitions that follow, and throughout the paper, the norm k k will refer to the 2-norm unless speci cally noted otherwise. Other norms can be used for most of the analysis. De nition 2.1 Let fxk g be a sequence converging to x . The sequence is said to converge linearly if there exists a positive constant < 1 such that
xk+1 ; x xk ; x
for all k su ciently large superlinearly if there exists a sequence of positive constants k ! 0 such that
xk+1 ; x
k xk ; x
for all k su ciently large and quadratically if there exists a positive constant such that
xk+1 ; x xk ; x
2
for all k su ciently large. These rates, which measure improvement at each step, are sometimes referred to as Q-rates. We will also have occasion to state results in terms of a measure of the average rate of convergence called the R-rate. A sequence fxk g will be said to converge R-linearly if the sequence f xk ; x g is bounded by a sequence that converges Q-linearly to zero. Similar de nitions exist for R-superlinear and R-quadratic. There are two relations between R-rate and Q-rate convergence that are of importance for this paper. First, m-step Q-linear convergence (where k + 1 is replaced by k + m in the above de nitions) implies an R-linear rate of convergence and, second, a Q-rate of convergence of a sequence of vectors implies the same R-rate (but not the same Q-rate) of convergence of its components. In the analyses that follow, rates not designated explicitly as Q- or R-rates are assumed to be Q-rates.
2.2. The Quadratic Subproblem As suggested in the Introduction the SQP method is an iterative method in which, at a current iterate xk , the step to the next iterate is obtained through information generated by solving a quadratic subproblem. The subproblem is assumed to re ect in some way the local properties of the original problem. The major reason for using a quadratic subproblem, i.e., a problem with a quadratic objective function and linear constraints, is that such problems are relatively easy to solve and yet, in their objective function, can re ect the nonlinearities of the original problem. The technical details of solving quadratic programs will not be dealt with here, although the algorithmic issues involved in solving these quadratic subproblems are nontrivial and their resolution will a ect the overall performance of the SQP algorithm. Further comments on this are made in Section 7. A major concern in SQP methods is the choice of appropriate quadratic subproblems. At a current approximation xk a reasonable choice for the constraints is a linearization of the actual constraints about xk . Thus the quadratic subproblem will have the form t minimize (rk )t dx + 1 2 dx Bk dx dx subject to: rh(xk )t dx + h(xk ) = 0 rg(xk )tdx + g(xk ) 0
where dx = x ; xk . The vector rk and the symmetric matrix Bk remain to be chosen. The most obvious choice for the objective function in this quadratic program is the local quadratic approximation to f at xk . That is, Bk is taken as the Hessian and rk as the gradient of f at xk . While this is a reasonable approximation to use if the constraints are linear, the presence of nonlin-
ear constraints makes this choice inappropriate. For example, consider the problem 2 minimize ;x1 ; 1 2 (x2 ) x subject to: (x1 )2 + (x2)2 ; 1 = 0: The point (1 0) is a solution satisfying A1{A4, but at the point (1 + 0) the approximating quadratic program is (with d replacing dx )
1 (d2 )2 minimize ;d1 ; 2 d ;1 subject to: d1 = ; 1 2 (2 + )(1 + )
which is unbounded regardless of how small is. Thus the algorithm that computes d breaks down for this problem. To take nonlinearities in the constraints into account while maintaining the linearity of the constraints in the subproblem, the SQP method uses a quadratic model of the Lagrangian function as the objective. This can be justi ed by noting that conditions A1{A4 imply that x is a local minimum for the problem minimize L(x u v ) x subject to: h(x) = 0 g(x) 0: Note that the constraint functions are included in the objective function for this equivalent problem. Although the optimal multipliers are not known, approximations uk and v k to the multipliers can be maintained as part of the iterative process. Then given a current iterate, (xk uk vk ), the quadratic Taylor series approximation in x for the Lagrangian is
t k k k L(xk uk vk ) + rL(xk uk vk )tdx + 1 2 dx H L(x u v )dx:
A strong motivation for using this function as the objective function in the quadratic subproblem is that it generates iterates that are identical to those generated by Newton's method when applied to the system composed of the rst order necessary condition (condition A1) and the constraint equations (including the active inequality constraints). This means that the resulting algorithm will have good local convergence properties. In spite of these local convergence properties there are good reasons to consider choices other than the actual Hessian of the Lagrangian, for example approximating matrices that have properties that permit the quadratic subproblem to be solved at any xk and the resulting algorithm to be amenable to a global convergence analysis. Letting Bk be an approximation of H L(xk uk vk ), we can write
the quadratic subproblem as: t minimize rL(xk uk vk )tdx + 1 2 dx Bk dx dx (2:4) subject to: rh(xk )t dx + h(xk ) = 0 k t k rg(x ) dx + g(x ) 0: The form of the quadratic subproblem most often found in the literature, and the one that will be employed here, is t minimize rf (xk )tdx + 1 2 dx Bk dx dx (QP) subject to: rh(xk )t dx + h(xk ) = 0 k t k rg(x ) dx + g(x ) 0: These two forms are equivalent for problems with only equality constraints since, by virtue of the linearized constraints, the term rh(xk )tdx is constant t and the objective function becomes rf (xk )t dx + 1 2 dx Bk dx . The two subproblems are not quite equivalent in the inequality-constrained case unless the multiplier estimate vk is zero for all inactive linear constraints. However (QP) is equivalent to (2:4) for the slack-variable formulation of (NLP) given by minimize f (x) xz subject to: h(x) = 0 (2:5) g(x) + z = 0
z
where z 2 Rp is the vector of slack variables. Therefore (QP) can be considered an appropriate quadratic subproblem for (NLP). The solution dx of (QP) can be used to generate a new iterate xk+1 , by taking a step from xk in the direction of dx . But to continue to the next iteration new estimates for the multipliers are needed. There are several ways in which these can be chosen, but one obvious approach is to use the optimal multipliers of the quadratic subproblem. (Other possibilities will be discussed where appropriate.) Letting the optimal multipliers of (QP) be denoted by uqp and vqp , and setting du = uqp ; uk dv = v qp ; v k allow the updates of (x,u,v) to be written in the compact form xk+1 = xk + dx uk+1 = uk + du (2:6) vk+1 = vk + dv for some selection of the steplength parameter . Once the new iterates
10
are constructed, the problem functions and derivatives are evaluated and a prescribed choice of Bk+1 calculated. It will be seen that the e ectiveness of the algorithm is determined in large part by this choice.
2.3. The Basic Algorithm Since the quadratic subproblem (QP) has to be solved to generate steps in the algorithm, the rst priority in analyzing an SQP algorithm is to determine conditions that guarantee that (QP) has a solution. To have a solution the system of constraints of (QP) must have a nonempty feasible set (i.e., the system must be consistent) and the quadratic objective function should be bounded below on that set (although a local solution can sometimes exist without this condition). The consistency condition can be guaranteed when xk is in a neighborhood of x by virtue of Assumption A2 but, depending on the problem, may fail at nonlocal points. Practical means of dealing with this possibility will be considered in Section 7. An appropriate choice of Bk will assure that a consistent quadratic problem will have a solution the discussion of this point will be a signi cant part of the analysis of the next two sections. Assuming that (QP) can be solved, the question of whether the sequence generated by the algorithm will converge must then be resolved. As described in the Introduction, convergence properties generally are classi ed as either local or global. Local convergence results proceed from the assumptions that the initial x-iterate is close to a solution x and the initial Hessian approximation is, in an appropriate sense, close to H L . Conditions on the updating schemes that ensure that the xk (and the Bk ) stay close to x (and H L ) are then formulated. These conditions allow one to conclude that (QP) is a good model of (NLP) at each iteration and hence that the system of equations de ned by the rst order conditions and the constraints for (QP) are nearly the same as those for (NLP) at x . Local convergence proofs can be modeled on the proof of convergence of the classical Newton's method, which assumes = 1 in (2:6). Convergence from a remote starting point is called global convergence. As stated in the Introduction, to ensure global convergence the SQP algorithm needs to be equipped with a measure of progress, a merit function , whose reduction implies progress towards a solution. In order to guarantee that is reduced at each step a procedure for adjusting the steplength parameter in (2:6) is required. Using the decrease in it can be shown that under certain assumptions the iterates will converge to a solution (or, to be precise, a potential solution) even if the initial x-iterate, x0, is not close to a solution. Local convergence theorems are based on Newton's method whereas global convergence theorems are based on descent. Ideally an algorithm should be such that as the iterates get close to the solution, the conditions for
11
local convergence will be satis ed and the local convergence theory will apply without the need for the merit function. In general a global algorithm, although ultimately forcing the x-iterates to get close to x , does not automatically force the other local conditions (such as unit steplengths and close Hessian approximations) to hold and therefore the merit function must be used throughout the iteration process. Since there is no practical way to know when, if ever, the local convergence conditions will hold an implementation's true performance can be deduced only from numerical experience. These issues will be discussed more fully in Section 5. With this background we can now state a basic SQP algorithm. This template indicates the general steps only, without the numerous details required for a general code. Given approximations (x0 u0 v0 ), B0 , and a merit function , set k = 0. 1. Form and solve (QP) to obtain (dx du dv ). 2. Choose steplength so that (xk + dx ) < (xk ): 3. Set xk+1 = xk + dx uk+1 = uk + du v k+1 = v k + dv : 4. Stop if converged. 5. Compute Bk+1 . 6. Set k := k + 1 go to 1.
2.4. Notes and References The basic second order su cient conditions as well as a description of the major theoretical ideas of nite-dimensional nonlinear constrained optimization can be found in numerous sources. See, e.g., (Luenberger 1984), (Nash and Sofer 1995), or (Gill, Murray and Wright 1981). The basic de nitions for various rates of convergence and the relations among them can be found in (Ortega and Rheinboldt 1970) and (Dennis and Schnabel 1983). The trust region methods discussed in Section 6 use di erent quadratic subproblems than those given here. Also Fukushima (1986) considers a di erent quadratic subproblem in order to avoid the Maratos e ect discussed in Section 5.
Basic Algorithm
12
3. Local Convergence
In this section the theory of local convergence for the most common versions of the SQP method is developed. The local convergence analysis establishes conditions under which the iterates converge to a solution and at what rate, given that that the starting data (e.g., x0 , u0 , v 0, B0 ) are su ciently close to the corresponding data at a solution, x . As will be seen, it is the method of generating the matrices Bk that will be the determining factor in obtaining the convergence results. An SQP method can have excellent theoretical local convergence properties quadratic, superlinear, or two-step superlinear convergence of the x - iterates can be achieved by requiring the Bk to approximate H L in an appropriate manner. Although each of these local convergence rates is achievable in practice under certain conditions, the need for a globally convergent algorithm further restricts the choice of the Bk . Certain choices of the Bk have led to algorithms that extensive numerical experience has shown to be quite acceptable. Therefore, while there is a gap between what is theoretically possible and what has been achieved in terms of local convergence, this discrepancy should not obscure the very real progress that the SQP methods represent in solving nonlinearly constrained optimization problems. Two important assumptions simplify the presentation. First, it will be assumed that the active inequality constraints for (NLP) at x are known. As will be discussed in Section 5, this assumption can be justi ed for many of the SQP algorithms because the problem (QP) at xk will have the same active constraints as (NLP) at x when xk is near x . The fact that the active constraints, and hence the inactive constraints, are correctly identi ed at xk means that those inequality constraints that are inactive for (NLP) at x can be ignored and those that are active can be changed to equality constraints without changing the solution of (QP). Thus, under this assumption, only equality-constrained problems need be considered for the local analysis. For the remainder of this section (NLP) will be assumed to be an equalityconstrained problem and the reference to inequality multiplier vectors (the vt g (x) term) will be eliminated from the Lagrangian. For reference, we rewrite the quadratic subproblem with only equality constraints: t minimize rf (xk )tdx + 1 2 dx Bk dx dx (ECQP ) subject to: rh(xk )t dx + h(xk ) = 0: The second assumption follows from the fact that the local convergence for SQP methods, as for many optimization methods, is based on Newton's method that is, the convergence and rate of convergence results are obtained by demonstrating an asymptotic relation between the iterates of the method being analyzed and the classical Newton steps. Because Newton's
13
method for a system with a nonsingular Jacobian at the solution requires that unit steps be taken to obtain rapid convergence, we assume that the merit function allows = 1 to be used in the update formulas (2.6). Similar results to those given in this section can be obtained if the steplengths converge to one su ciently quickly. For the equality-constrained problem a good initial estimate of x can be used to obtain a good estimate for the optimal multiplier vector. The rst order necessary condition, A1, together with condition A2, leads to the formula u = ; rh(x )t A rh(x )];1 rh(x )t A rf (x ): (3:1) for any nonsingular matrix A that is positive de nite on the null space of rh(x )t. In particular, if A is taken to be the identity matrix, then (3:1) de nes the least squares solution of the rst order necessary conditions. By our smoothness assumptions, it follows that u0 = ; rh(x0 )t rh(x0)];1 rh(x0 )trf (x0 ) (3:2) can be made arbitrarily close to u by choosing x0 close to x . Consequently, no additional assumption will be made about an initial optimal multiplier estimate for the local convergence analysis. Denoting the optimal multiplier for (ECQP) by uqp we see that the rst order conditions for this problem are Bk dx + rh(xk )uqp = ;rf (xk ) rh(xk )tdx = ;h(xk ): If, as discussed in Section 2, we set uk+1 = uqp = uk + du (3:3) then the above equations become Bk dx + rh(xk )du = ;rL(xk uk ) (3.4) t k k rh(x ) dx = ;h(x ): (3.5) We are now in a position to begin the analysis of the SQP methods.
3.1. The Newton SQP Method The straightforward SQP method derived from setting Bk = H L(xk uk ) will be analyzed rst. Assuming that x0 is close to x it follows from (3.2) that u0 can be presumed to be close to u and hence that H L(x0 u0) is close to H L . The local convergence for the SQP algorithm now follows
14
From assumptions A1 and A4 the Jacobian of this system at the solution, H L rh(x ) J (x u ) = rh (3:6) (x )t 0 is nonsingular. Therefore, the Newton iteration scheme xk+1 = xk + sx uk+1 = uk + su where the vector s = (sx su ) is the solution to J (xk uk ) s = ; (xk uk ) (3:7) yields iterates that converge to (x u ) quadratically provided (x0 u0) is su ciently close to (x u ). The equations (3.7) are identical to equations (3.4) and (3.5) with Bk = H L(xk uk ), dx = sx, and du = su . Consequently the iterates (xk+1 uk+1 ) are exactly those generated by the SQP algorithm. For this reason we call this version of the algorithm the (local) Newton SQP method. The basic local convergence results are summarized in the following theorem: Theorem 3.1 Let x0 be an initial estimate of the solution to (NLP) and let u0 be given by (3:2). Suppose that the sequence of iterates f(xk uk )g is generated by xk+1 = xk + dx uk+1 = uk + du where dx and uqp = uk + du are the solution and multiplier of the quadratic program (ECQP) with Bk = H L(xk uk ). Then, if x0 ; x is su ciently small, the sequence of iterates in (x u)-space is well-de ned and converges quadratically to the pair (x u ). It is important to emphasize that this version of the SQP algorithm always requires a unit step in the variables there is no line search to try to determine a better point. If a line search is used (the so-called damped Newton method) then the rate of convergence of the iterates can be signi cantly decreased| usually to linear. In a sense, the Newton version of the SQP algorithm can be thought of as the ideal algorithm for solving the nonlinear program: it provides rapid convergence with no requirement for line searches and no necessity for the
from the application of Newton's method to the nonlinear system of equations obtained from the rst order necessary conditions and the constraint equation: (x u) = 0: (x u) = rL h(x)
15
introduction of additional parameters. Of course, in this form it has little practical value. It is often di cult to choose an initial point close enough to the true solution to guarantee that the Newton algorithm will converge to a minimum of f . When remote from the solution H L(xk uk ) cannot be assumed to be positive de nite on the appropriate subspace and hence a solution to the subproblem (ECQP) cannot be guaranteed to exist. The value of this Newton SQP method is that it provides a standard against which the local behavior of other versions can be measured. In fact, to obtain superlinear convergence it is necessary and su cient that the steps aproach the Newton steps as the solution is neared. Despite the disadvantages, the use of H L(xk uk ) often makes (ECQP) an excellent model of (NLP) and its use, combined with some of the techniques of Sections 6 and 7, can lead to e ective algorithms. The actual computation of H L(xk uk ) can be accomplished e ciently by using nite di erence techniques or by automatic di erentiation methods.
3.2. Conditions for Local Convergence Here we discuss the theoretical properties of the approximation matrices, Bk , that are su cient to guarantee that the local SQP method will give Newton-like convergence results. In the following subsections the practical attempts to generate matrices that satisfy these properties will be described. In order to formulate conditions on Bk that will yield locally convergent SQP algorithms, the following assumption on H L will be imposed. A5: The matrix H L is nonsingular. In light of A4 most of the results that follow could be extended to cover the case where this condition is not satis ed, but the resulting complexity of the arguments would clutter the exposition. The following conditions on the matrix approximations will be referenced in the remainder of the paper. B1: The matrices Bk are uniformly positive de nite on the null spaces of the matrices rh(xk )t , i.e.,there exists a 1 > 0 such that for each k
dtBk d
2 1 kdk
rh(xk )td = 0:
>0
B3: The matrices Bk have uniformly bounded inverses, i.e., there exists a
16
As the SQP methods require the solution of the quadratic subproblem at each step, it is imperative that the choice of matrices Bk make that possible. The second order su cient conditions for a solution to (ECQP) to exist require that the matrix Bk be positive de nite on the null space of rh(xk )t (cf. A4). Condition B1 is a slight strengthening of this requirement. Assumption A5 and the fact that the matrices Bk are to approximate the Hessian of the Lagrangian suggest that conditions B2 and B3 are not unreasonable. Under assumptions B1{B3 (3.4) and (3.5) have a solution provided (xk ,uk ) is su ciently close to (x , u ). Moreover, the solutions can be written in the form du = rh(xk )tBk ;1 rh(xk )];1 h(xk ) ; rh(xk )t Bk ;1 rL(xk uk )] (3:8) and dx = ;Bk ;1 rL(xk uk+1 ) (3:9) where uk+1 is given by (3.3). In particular (3.8) leads to the relation uk+1 = rh(xk )t Bk ;1 rh(xk )];1 h(xk ) ; rh(xk )t Bk ;1 rf (xk )] W (xk Bk ): Setting A = Bk ;1 in (3.1) yields u = W (x Bk ): It can now be deduced that uk+1 ; u = W (xk Bk ) ; W (x Bk ) = rh(x )t Bk ;1 rh(x )];1 rh(x )t Bk ;1 (Bk ; H L )(xk ; x ) +wk (3.10) where, by assumptions B2 and B3,
wk xk ; x
2
for some independent of k. It should be noted that (3.10) can be used in conjunction with Theorem 3.1 to prove that the sequence fxk g is quadratically convergent when the Newton SQP method is used and H L is nonsingular. In that case fuk g converges R-quadratically. Equations (3.9) and (3.10) and A2 now yield h i xk+1 ; x = xk ; x ; Bk ;1 rL(xk uk+1 ) ; rL(x u ) h i = Bk ;1 (Bk ; H L )(xk ; x ) ; rh(x )(uk+1 ; u )
17
2
+O xk ; x where
= Bk ;1 Vk (Bk ; H L )(xk ; x ) + O xk ; x
(3.11)
The projection operator de ned by (2.2), which in the equality-constrained case has the form P (x) = I ; rh(x) rh(x)trh(x)];1rh(x)t satis es Vk P k = Vk : Thus (3.11) leads to the inequality xk+1 ; x Bk ;1 kVk k P k (Bk ; H L )(xk ; x ) +O xk ; x
2
(3.12)
Using induction the above analysis can be made rigorous to yield the following local convergence theorem. Theorem 3.2 Let assumptions B1{B3 hold. Then there exist positive constants and such that if
x0 ; x < u0 ; u <
rithm are well de ned and converge linearly to x and (x u ) respectively. The sequence fuk g converges R-linearly to u . Condition (3.13) is, in conjuction with the other hypotheses of the theorem, almost necessary for linear convergence. In fact, if the other hypotheses of the theorem hold and the sequence fxk g converges linearly, then there exists a such that P k (Bk ; H L )(xk ; x ) < (xk ; x ) for all k. These results and those that follow indicate the crucial role that the projection of (Bk ; H L ) plays in the theory. The inequality (3.13) is
P k (Bk ; H L )(xk ; x ) < xk ; x (3:13) for all k, then the sequences fxk g and f(xk uk )g generated by the SQP algo-
and
18
guaranteed by either of the stronger conditions: P k (Bk ; H L ) or k(Bk ; H L )k (3:14) which are easier to verify in practice. In order to satisfy (3.14) it is not necessary that the approximations converge to the true Hessian but only that the growth of the di erence kBk ; H L k be kept under some control. In the quasi-Newton theory for unconstrained optimization this can be accomplished if the approximating matrices have a property called bounded deterioration. This concept can be generalized from unconstrained optimization to the constrained setting in a straightforward way. De nition 3.1 A sequence of matrix approximations, fBk g, for the SQP method is said to have the bounded deterioration property if there exist constants 1 and 2 independent of k such that kBk+1 ; H L k (1 + 1 k ) kBk ; H L k + 2 k (3:15) where xk ; x uk+1 ; u uk ; u g: k = maxf xk+1 ; x It seems plausible that the bounded deterioration condition when applied to the SQP process will lead to a sequence of matrices that satisfy B2, B3, and (3.13) provided the initial matrix is close to the Hessian of the Lagrangian and (xk ,uk ) is close to (x ,u ). Indeed this is the case, as can be shown by induction to lead to the following result. Theorem 3.3 Suppose that the sequence of iterates f(xk uk )g is generated by the SQP algorithm and the sequence fBk g of symmetric matrix approximations satis es B1 and (3.15). If x0 ; x and kB0 ; H L k are su ciently small and u0 is given by (3:2) then the hypotheses of Theorem 3.2 hold. The linear convergence for the iterates guaranteed by the above theorem is hardly satisfactory in light of the quadratic convergence of the Newton SQP method. As would be expected, a stronger relation between the approximation matrices, Bk , and the Hessian of the Lagrangian is needed to improve the rate of convergence however, this relation still depends only on the projection of the di erence between the approximation and the true Hessian. With some e ort the following theorem can be deduced as a consequence of the inequalities above. Theorem 3.4 Let assumptions B1{B3 hold and let the sequence f(xk uk )g
19
be generated by the SQP algorithm. Assume that xk ! x . Then the sequence fxk g converges to x superlinearly if and only if the matrix approximations satisfy P k (Bk ; H L )(xk+1 ; xk ) lim = 0: (3:16) k!1 (xk+1 ; xk ) If this equation holds then the sequence fuk g converges R-superlinearly to u and the sequence f(xk uk )g converges superlinearly. Not that this theorem requires convergence of the x-iterates (3.16) does not appear to be enough to yield convergence by itself. A slightly di erent result that uses a two-sided approximation of the Hessian approximation can be obtained by writing (3:16) as ( k k sk P k (Bk ; H L )Qk sk ) P ( B ; H L ) P k + =0 (3:17) lim
k!1
where sk = (xk+1 ; xk ). It can be shown that if only the rst term goes to zero then a weaker form of superlinear convergence holds. Theorem 3.5 Let assumptions B1{B3 hold and let the sequence f(xk uk )g be generated by the SQP algorithm. Assume that xk ! x . Then, if P k (Bk ; H L )P k (xk+1 ; xk ) lim =0 (3:18) k!1 (xk+1 ; xk ) the sequence fxk g converges to x two-step superlinearly. If the sequence fxk+1 ; xk g approaches zero in a manner tangent to the null space of the Jacobians, i.e., k k+1 k lim Q (x ; x ) = 0
k!1
ksk k
ksk k
xk+1 ; xk
then (3.18) implies (3.16) and superlinear convergence results. This tangential convergence has been observed in practice for some of the methods discussed below. It has been di cult to nd useful updating schemes for the Bk that satisfy these conditions for linear and superlinear convergence. Two basic approaches have been tried for generating good matrix approximations. In full Hessian approximations the matrices Bk are chosen to approximate H L while in reduced Hessian approximations only matrices that approximate the Hessian on the null space of the Jacobians of the constraints are computed. Each of these methods will be described in turn.
20
3.3. Full Hessian Approximations An obvious choice of the matrix Bk is a nite di erence approximation of the Hessian of the Lagrangian at (xk uk ). It is clear from the Theorems 3.2 and 3.4 that if a nite di erence method is used, the resulting sequence will be superlinearly convergent if the nite di erence step size goes to zero and will have rapid linear convergence for xed step size. Of course using nite di erence approximations, while not requiring evaluation of second derivatives, su ers from the same global di culties as the Newton SQP method described in the earlier subsection. As has been seen, one way of obtaining convergence is to use a sequence of Hessian approximations that satis es the bounded deterioration property. However, this property is not, by itself, enough to guarantee that (3.16) is satis ed and hence it does not guarantee superlinear convergence. The condition (3.16) essentially requires the component of the step generated by the Hessian approximation in the direction of the null space to converge asymptotically to the corresponding component of the step generated by the Newton SQP method. In the following a class of approximations called secant approximations, which satisfy a version of this condition and have the bounded deterioration property as well, is considered. Under the smoothness assumptions of this paper the Lagrangian satis es rL(xk+1 uk+1) ; rL(xk uk+1) H L(xk+1 uk+1)(xk+1 ; xk ) with equality if the Lagrangian is quadratic in x. As a result, it makes sense to approximate the Hessian of the Lagrangian at (xk+1 ,uk+1 ) by requiring Bk+1 to satisfy Bk+1 (xk+1 ; xk ) = rL(xk+1 uk+1 ) ; rL(xk uk+1 ) (3:19) especially since this approximation strongly suggests that (3.16) is likely to be satis ed near the solution. Equation (3.19) is called the secant equation it plays an important role in the algorithmic theory of nonlinear systems and unconstrained optimization. A common procedure for generating the Hessian approximations that satisfy (3:19) is to compute (xk+1 uk+1 ) with a given Bk and then to update Bk according to Bk+1 = Bk + Uk where Uk is a rank-one or rank-two matrix that depends on the values of Bk , xk , uk , xk+1 , and uk+1 . In choosing secant approximations that have the bounded deterioration property, it is natural to look to those updating schemes of this type that have been developed for unconstrained optimization. The rank-two Powell-symmetric-Broyden (PSB) formula gives one such
21
updating scheme. For the constrained case this update is given by (y ; Bk s)st + s(y ; Bk s)t ] Bk+1 = Bk + (s1 t s) Bk s)t s s st ; (y ; (3.20) (st s)2 where s = xk+1 ; xk (3:21) and y = rL(xk+1 uk+1 ) ; rL(xk uk+1 ): (3:22) The PSB-SQP algorithm which employs this update has been shown to have the desired local convergence properties. Theorem 3.6 Suppose that the sequence of iterates f(xk uk )g is generated by the SQP algorithm using the sequence fBk g of matrix approximations generated by the PSB update formulas, (3.20){(3.22). Then, if x0 ; x and kB0 ; H L(x u )k are su ciently small and u0 is given by (3:2), the sequence fBk g is of bounded deterioration and the iterates (xk uk ) are well de ned and converge superlinearly to (x ,u ). In addition the xiterates converge superlinearly and the multipliers converge R-superlinearly. Note that there is no assumption of positive de niteness here. In fact, while B2 and B3 are satis ed as a result of the bounded deterioration, B1 does not necessarily hold. Consequently, dx and du are not necessarily solutions of (ECQP), but of the rst order conditions (3.4) and (3.5). As a practical method, the PSB-SQP algorithm has the advantage over the Newton SQP method of not requiring the computation of the Hessian of the Lagrangian (but, as a result, yielding only superlinear rather than quadratic convergence). However, because the matrices are not necessarily positive de nite it su ers from the same serious drawback the problem (ECQP) may not have a solution if the initial starting point is not close to the solution. Consequently, it does not appear to be useful in establishing a globally convergent algorithm (see, however, Section 6). As mentioned above, solving (ECQP) requires that the matrices be positive de nite on the appropriate subspace for each subproblem. One way to enforce this is to require the matrices to be positive de nite. A rank-two updating scheme that is considered to be the most e ective for unconstrained problems and has useful positive de nite properties is the BFGS method. The formula for this update, generalized to be applicable to the constrained problem, is
(3:23)
22
where s and y are as in (3.21) and (3.22). The important qualities of the matrices generated by this formula are that they have the bounded deterioration property and satisfy a hereditary positive de niteness condition. The latter states that if Bk is positive de nite and yt s > 0 (3:24) then Bk+1 is positive de nite. If the matrix approximations are positive de nite then (ECQP) is easily solved and the SQP steps are well de ned. Unfortunately, because the Hessian of the Lagrangian at (x u ) need only be positive de nite on a subspace, it need not be the case that (3.24) is satis ed and hence the algorithm may break down, even close to the solution. However, if the Hessian of the Lagrangian is positive de nite, (3.24) is valid provided Bk is positive de nite and (xk uk ) is close to (x u ). In this case this BFGS-SQP algorithm has the same local convergence properties as the PSB-SQP method. Theorem 3.7 Suppose that H L is positive de nite and let B0 be an initial positive de nite matrix. Suppose that the sequence of iterates f(xk uk )g is generated by the SQP algorithm using the sequence of matrix approximations generated by the BFGS update (3.23). Then, if x0 ; x and kB0 ; H L(x u )k are su ciently small and u0 is given by (3:2), the sequence fBk g is of bounded deterioration and (3.16) is satis ed. Therefore, the iterates (xk uk ) converge superlinearly to the pair (x u ). In addition the x-iterates converge superlinearly and the multipliers converge Rsuperlinearly. The assumption that H L is positive de nite allows B0 to be both positive de nite and close to H L . The requirement that H L be positive de nite is satis ed, for example, if (NLP) is convex, but cannot be assumed in general. However, because the positive de niteness of Bk permits the solution of the quadratic subproblems independent of the nearness of the iterate to the solution, the BFGS method has been the focus of vigorous research e orts to adapt it to the general case. Several of these e orts have resulted in implementable algorithms. One scheme, which we call the Powell-SQP method, is to maintain the positive de nite property of the Hessian approximations by modifying the update formula (3.23) by replacing y in (3.22) by ^ = y + (1 ; ) Bk s y (3:25) for some 2 (0 1]. With this modi cation, the condition (3.24) can always be satis ed although the updates no longer satisfy the secant condition. Nevertheless, it has been shown that a speci c choice of leads to a sequence, fxk g, that converges R-superlinearly to x provided that the sequence converges. Unfortunately, no proof of local covergence has been found although
23
algorithms based on this procedure have proven to be quite successful in practice. A second approach is to transform the problem so that H L is positive de nite and then to apply the BFGS-SQP method on the transformed problem. It is a common trick in developing algorithms for solving equality constrained optimization problems to replace the objective function in (NLP) by the function f A(x) = f (x) + 2 kh(x)k2 for a positive value of the parameter . This change gives a new Lagrangian function, the so-called augmented Lagrangian LA(x u) = L(x u) + 2 kh(x)k2 for which (x ,u ) is a critical point. The Hessian of the the augmented Lagrangian at (x ,u ) has the form H LA (x u ) = H L + rh(x )rh(x )t: It follows from A4 that there exists a positive value such that for , H LA (x u ) is positive de nite. If the value of is known then the BFGS-SQP algorithm can be applied directly to the transformed problem with and, by Theorem 3.7, a superlinearly convergent algorithm results. This version of the SQP algorithm has been extensively studied and implemented. Although adequate for local convergence theory, this augmented-BFGS method has major drawbacks that are related to the fact that it is di cult to choose an appropriate value of . To apply Theorem 3.7, the value of must be large enough to ensure that H LA (x u ) is positive de nite, a condition that requires a priori knowledge of x . If unnecessarily large values of are used without care numerical instabilities can result. This problem is exacerbated by the fact that if the iterates are not close to the solution appropriate values of may not exist. Quite recently, an intriguing adaptation of the BFGS-SQP algorithm has been suggested that shows promise of leading to a resolution of the above di culties. This method, called the SALSA-SQP method is related to the augmented Lagrangian SQP method but di ers (locally) in the sense that a precise estimate of can be chosen independent of x . If yA = rLA (xk+1 uk+1 ) ; rLA (xk uk+1) then yA = y + rh(xk+1 )h(xk+1 ) where y is the vector generated by the ordinary Lagrangian in (3.22). If yt (3:26) As > 0
24
then the BFGS updates for the augmented Lagrangian have the hereditary positive de niteness property. A minimum value of k , not directly dependent on x , can be given that guarantees that yt A s is \su ciently" positive for given values of xk and uk+1 in a neighborhood of (x ,u ). The (local) version of this algorithm proceeds by using the augmented Lagrangian at each iteration with the required value of k (which may be zero) to force the satisfaction of this condition. The Bk generated by this procedure are positive de nite while H L may not be hence, the standard bounded deterioration property is not applicable. As a result a local convergence theorem is not yet available. As in the Powell-SQP algorithm, an R-superlinear rate of convergence of the xk to x has been shown, provided that the sequence is known to converge.
3.4. Reduced Hessian SQP Methods An entirely di erent approach to approximating the Hessian is based on the fact that assumption A4 requires H L to be positive de nite only on a particular subspace. The reduced Hessian methods approximate only the portion of the Hessian matrix relevant to this subspace. The advantages of these methods are that the standard positive de nite updates can be used and that the dimension of the problem is reduced to n ; m (possibly a signi cant reduction). In this section we discuss the local convergence properties of such an approach. Several versions of a reduced Hessian type of algorithm have been proposed they di er in the ways the multiplier vectors are chosen and the way the reduced Hessian approximation is updated in particular, in the form of y that is used (see (3.31)). A general outline of the essential features of the method is given below. Let xk be an iterate for which rh(xk ) has full rank and let Zk and Yk be matrices whose columns form bases for the null space of rh(xk )t and range space of rh(xk ) respectively. Also assume that the columns of Zk are orthogonal. Zk and Yk could be obtained, for example, by a QR factorization of rh(xk ). De nition 3.2 Let (xk ,uk ) be a given solution-multiplier pair and assume that rh(xk ) has full rank. The matrix Zk tH L(xk uk )Zk is called a reduced Hessian for the Lagrangian function at (xk ,uk ). The reduced Hessian is not unique its form depends upon the choice of basis for the null space of rh(xk )t. Since by A4 a reduced Hessian at the solution is positive de nite, it follows that if (xk ,uk ) is close enough to (x ,u ) then the reduced Hessian at (xk uk ) is positive de nite. Decomposing the vector
25
dx as
The matrix in this unconstrained problem is an approximation to the reduced Hessian of the Lagrangian at xk . Rather than update Bk and then compute Zk t Bk Zk , the reduced matrix itself is updated. Thus at a particular iteration, given a positive de nite (n ; m) (n ; m) matrix Rk , an iterate xk and a multiplier estimate uk , the new iterate can be found by rst computing pY from (3.28), setting 1 t k (3:29) pZ = ;R; k Zk (rf (x ) + Bk pY ) and setting xk+1 = xk + dx, where dx is given by (3.27). A new multiplier, uk+1 , is generated and the reduced Hessian approximation is updated using the BFGS formula (3.23) with the Rk replacing Bk , s = Zk t (xk+1 ; xk ) = Zk t Zk pZ (3:30) and y = Zk t rL(xk + Zk pZ uk ) ; rL(xk uk )]: (3:31) The choices of s and y are motivated by the fact that only the reduced Hessian is being approximated. This method does not stipulate a new multiplier iterate directly since the problem being solved at each step is unconstrained. However, the least squares solution for the rst order conditions (cf. (3.2)) can be used. Generally, all that is needed is that the multipliers satisfy uk ; u = O xk ; x : (3:32) Since P k = Zk Zk t the approximation
it can be seen that the constraint equation of (ECQP) becomes rh(xk )tYk pY = ;h(xk ) which by virtue of A2 can be solved to obtain pY = ; rh(xk )tYk ];1h(xk ): (3:28) The minimization problem (ECQP) is now an unconstrained problem in n ; m variables given by 1 p t Zk t Bk Zk p + (rf (xk )t + p t Bk )Zk p : minimize Z Y Z 2 Z p
Z
dx = Zk pZ + Yk pY
(3:27)
Zk Rk Zk t
26
Thus since this method does not approximate neither the local convergence theorem nor the superlinear rate of convergence theorem, Theorems 3.2 and 3.4, follow as for full Hessian approximations. Nevertheless, the two-sided approximation of the Hessian matrix suggests that the conditions of Theorem 3.5 may hold. In fact, if it is assumed that the matrices Zk are chosen in a smooth way, i.e., so that kZk ; Z (x )k = O xk ; x (3:33) the assumption of local convergence leads to two-step superlinear convergence. Theorem 3.8 Assume that the reduced Hessian algorithm is applied with uk and Zk chosen so that (3.32) and (3.33) are satis ed. If the sequence fxk g 1 g are uniformly bounded and converges to x R-linearly then fRk g and fR; k fxk g converges two-step superlinearly. While the condition on the multiplier iterates are easy to satisfy, some care is required to ensure that (3.33) is satis ed as it has been observed that arbitrary choices of Zk may be be discontinuous and hence invalidate the theorem.
3.5. Notes and References The study of Newton's method applied to the rst order necessary conditions for constrained optimization can be found in (Tapia 1974) and (Goodman 1985). The equivalence of Newton's method applied to the system of rst order equations and feasibility conditions and the SQP step with the Lagrangian Hessian was rst mentioned by Tapia (1978). The expository paper, (Dennis and More 1977), is a good source for an introductory discussion of bounded deterioration and of updating methods satisfying the secant condition in the setting of unconstrained optimization. The paper by Broyden, Dennis and More (1973) provides the basic results on the convergence properties for quasi-Newton methods. Theorems (3:2) and (3:4) were rst proven by Boggs, Tolle and Wang (1982). The latter theorem was proven under the assumption that the iterates converge linearly, generalizing the result for unconstrained optimization given in Dennis and More (1977). The assumption was substantially weakened by several authors see, for example, (Fontecilla, Steihaug and Tapia 1987). Theorem (3:5) is due to Powell (1978b). The paper by Coleman (1990) contains a good general overview of superlinear convergence theorems for SQP methods. The (local) superlinear convergence of the PSB-SQP method was proven
P kH L
27
by Han (1976) in his rst paper on SQP methods. This paper also included the proof of superlinear covergence for the positive de nite updates when the Hessian of the Lagrangian is positive de nite (Theorem (3:7)) and suggested the use of the augmenting term in the general case. The augmented method was also investigated in (Schittkowski 1981) and (Tapia 1977). A description of the Powell-SQP method appears in (Powell 1978b) while the Salsa-SQP method is introduced in (Tapia 1988) and developed further in (Byrd, Tapia and Zhang 1992). The reduced Hessian SQP method has been the subject of research for a number of authors. The presentation here follows that of Byrd and Nocedal (1991). Other important articles include (Coleman and Conn 1984), (Nocedal and Overton 1985), (Gabay 1982), (Yuan 1985), and (Gilbert 1993). The problems involved in satisfying (3:33) have been addressed in (Byrd and Schnabel 1986) , (Coleman and Sorensen 1984), and (Gill, Murray, Saunders, Stewart and Wright 1985). Other papers of interest on local convergence for SQP methods include (Schittkowski 1983), (Panier and Tits 1993), (Bertsekas 1980), (Coleman and Feynes 1992), (Glad 1979), (Fukushima 1986), and (Fontecilla 1988).
4. Merit Functions and Global Convergence
In the previous section we demonstrated the existence of variants of the SQP method that are rapidly locally convergent. Here we show how the merit function assures that the iterates eventually get close to a critical point in the next section we point out some gaps in the theory that prevent a complete analysis. A merit function is incorporated into an SQP algorithm for the purpose of achieving global convergence. As described in Section 2.3, a line search procedure is used to modify the length of the step dx so that the step from xk to xk+1 reduces the value of . This reduction is taken to imply that acceptable progress towards the solution is being made. The standard way to ensure that a reduction in indicates progress is to construct so that the solutions of (NLP) are the unconstrained minimizers of . Then it must be possible to decrease by taking a step in the direction dx generated by solving the quadratic subproblem, i.e., dx must be a descent direction for . If this is the case, then for a su ciently small (xk + dx ) will be less than (xk ). An appropriate steplength that decreases can then be computed for example, by a \backtracking" procedure of trying successively smaller values of until a suitable one is obtained. Given that a merit function is found that has these properties and that a procedure is used to take a step that decreases the function, global convergence proofs for the resulting SQP algorithm are generally similar to those found in unconstrained optimization: they depend on proving that the limit
28
points of the x-iterates are critical points of . These proofs rely heavily on being able to guarantee that a \su cient" decrease in can be achieved at each iteration.. In unconstrained minimization there is a natural merit function namely, the objective function itself. In the constrained setting, unless the iterates are always feasible, a merit function has to balance the drive to decrease f with the need to satisfy the constraints. This balance is often controlled by a parameter in that weights a measure of the infeasibility against the value of either the objective function or the Lagrangian function. In this section, we illustrate these ideas by describing two of the most popular merit functions: a di erentiable augmented Lagrangian function and the simpler, but nondi erentiable, `1 penalty function. We derive some of the important properties of these functions and discuss some of the advantages and disadvantages of each. In addition, we provide appropriate rules for choosing the steplength parameter so as to obtain the basic convergence results. To simplify the presentation we restrict our attention at the beginning to the problem with only equality constraints. Extensions that allow inequality constraints are discussed at the ends of the sections. As might be expected, some additional assumptions are needed if global convergence is to be achieved. In fact, the need for these assumptions raises the issue as to what exactly is meant by \global" convergence. It is impossible to conceive of an algorithm that would converge from any starting point for every (NLP). The very nature of nonlinearity allows the possibility, for example, of perfectly satisfactory iterations following a steadily decreasing function towards achieving feasibility and a minimum at some in nite point. In order to focus attention on the methods and not the problem structure we make the following assumptions. C1: The starting point and all succeeding iterates lie in some compact set C. C2: The columns of rh(x) are linearly independent for all x 2 C . The rst assumption is made in some guise in almost all global convergence proofs, often by making speci c assumptions about the functions. The second assumption assures that the systems of linearized constraints are consistent. An additional assumption will be made about the matrix approximations Bk , depending on the particular merit function.
4.1. Augmented Lagrangian merit functions Our rst example of a merit function is the augmented Lagrangian function. There are several versions of this function we use the following version to
29
optimal multipliers based on the rst order necessary conditions and hence u(x ) = u (see (3.1) and (3.2)). Under the assumptions, F and u are di erentiable and F is bounded below on C for su ciently large. The following formulas will be useful in the discussion:
h i
rh(x)trh(x) ;1 rh(x)trf (x): (4:2) The multipliers u(x) de ned by (4:2) are the least squares estimates of the
u(x) = ;
(4:1)
satis es the rst order necessary conditions. The following theorem establishes that the augmented Lagrangian merit function has the essential properties of a merit function for the equalityconstrained nonlinear program. Theorem 4.1 Assume B1, B2, C1, and C2 are satis ed. Then for su ciently large the following hold: (i) x 2 C is a strong local minimum of F if and only if x is a strong local minimum of (NLP) and (ii) if x is not a critical point of (NLP) then dx is a descent direction for F . The proof is rather technical, but it does illustrate the techniques that are often used in such arguments and it provides a useful intermediate result, namely (4:10). A key idea is to decompose certain vectors according to the range and null space projections P and Q. Proof. If x 2 C is feasible, and satis es the rst order necessary conditions, then it follows from (4.4) that r F (x ) = 0: Conversely, if x 2 C , r F (x ) = 0, and is su ciently large then it follows from C1 and C2 that h(x ) = 0 and x satis es A1 for (NLP). To establish (i), the relation between H F (x ) and H L must be explored for such points x . It follows from (4:4) that H F (x ) = H L ; Q H L ; H L Q + rh(x )rh(x )t (4:5)
ru(x) = ;H L(x u(x))rh(x) rh(x)trh(x) ;1 +R1 (x) (4.3) r F (x ) = rf (x) + rh(x)u(x) +ru(x)h(x) + rh(x)h(x): (4.4) The function R1(x) in (4:3) is bounded and satis es R1(x ) = 0 if x
30
where Q is de ned by (2:3). Now let y 2 Rn , y 6= 0, be arbitrary. Then setting y = Q y + P y, (4:5) yields y t H F (x )y = (P y)t H Lh (P y ) ; (Q y)t H i L (Q y ) (4:6) t t + (Q y ) rh(x )rh(x ) (Q y): Assume that x is a strong local minimum of (NLP). Since Q y is in the range space of rh(x ), it follows from A2 that there exists a constant such that h i (Q y )t rh(x )rh(x )t (Q y) kQ yk2 : Let the maximum eigenvalue of H L in absolute value be max and let min be the minimum eigenvalue of P H L P , which by A4 is positive. De ning which implies
yk = kQ ky k
This last expression will be positive for su ciently large, which implies that x is a strong local minimum of F . Conversely, if x is a strong local minimum of F then (4:6) must be positive for all y . This implies that H L must be positive de nite on the null space of rh(x )t and hence that x is a strong local minimum of (NLP). This establishes (i). To show that the direction dx is always a descent direction for F , the inner product of both sides of (4:4) is taken with dx to yield dx t r F (xk ) = dxt rf (xk ) + dxt rh(xk )u(xk ) +dx t ru(xk )h(xk ) + dxt rh(xk )h(xk ): (4:7) Using (4:2) it follows that
dx t rh(xk )u(xk ) =
kP yk2 = 1 ; 2 kyk2 and dividing both sides of (4:6) by ky k2 gives y t H F (x )y min + ( ; kyk2
max ; min ) 2 :
(4:8) Writing dx = Qk dx + P k dx and noting that h(xk ) = ;rh(xk )t Qk dx results in dx t r F (xk ) = P k dx t rf (xk ) ; dx t ru(xk )rh(xk )tQk dx h i ; (Qk dx)t rh(xk )rh(xk )t (Qk dx):
i;1
rh(x)trf (x)
31
for some constant 1 . Dividing both sides of (4:9) by kdx k2, letting and using B1 yields
Finally, using the rst order necessary conditions for (ECQP) and the fact that P k rh(xk )t = 0, the following is obtained: dx t r F (xk ) = ;(P k dx )t Bk (P k dx ) ; (Qk dx )tBk dx ;dxtru(xkh)rh(xk )tQk dx i (4:9) k k t k k t ; (Q dx) rh(x )rh(x ) (Q dx): Now, from (4:3) and assumptions C1 and C2 it follows that dxt ru(xk )rh(xk )t Qk dx 1 Qk dx kdxk
Qk dx = kd k x
for constants 2 and 3 . The quantity on the left of (4:10) is then negative and can be uniformly bounded away from zero provided is su ciently large, thus proving dx is a descent direction for . 2 It is interesting to observe that the value of necessary to obtain descent of dx depends on the eigenvalue bounds on fBk g, whereas the value of necessary to ensure that x is a strong local minimizer of F depends on the eigenvalues of H L . Thus a strategy to adjust to achieve a good descent direction may not be su cient to prove that H F (x ) is positive de nite. We comment further on this below. This line of reasoning can be continued to obtain a more useful form of the descent result. From (4:4) and arguments similar to those above, r F (xk ) 4 kdx k for some constant 4. Thus from (4:10) it follows that there exists a constant 5 such that r F (xk )tdx ; 5<0 (4:11) r F (xk ) kdxk for all k. This inequality ensures that the cosine of the angle k between the direction dx and the negative gradient of F is bounded away from zero, i.e., for all k k )tdx cos( k ) = ; r F (x (4:12) 5 > 0: r F (xk ) kdxk
32
(4:10)
32
This condition is su cient to guarantee convergence of the sequence if it is coupled with suitable line search criteria that impose restrictions on the steplength . For example, the Wolfe conditions require a steplength to satisfy k k k t F (x + dx ) F (x ) + 1 r F (x ) dx (4.13) r F (xk + dx )tdx (4.14) 2 r F (xk )tdx where 0 < 1 < 2 < 1. Inequality (4:13) ensures that there will be a su cient reduction in F while (4:14) guarantees that the steplength will not be too small. An important property of these conditions is that if dx is a descent direction for F , then a steplength satisfying (4:13){(4:14) can always be found. Furthermore, the reduction in F for such a step satis es ) F (xk ) ; 6 cos2( k ) r F (xk ) for some positive constant 6. Therefore
F (x k+1
1 X
This implies that fxk g converges to a critical point of F . The following theorem results: Theorem 4.2 Assume that is chosen such that (4:12) holds. Then the SQP algorithm with steplength chosen to satisfy the Wolfe conditions (4:13){(4:14) is globally convergent to a critical point of F . If the critical points of F all correspond to local minima of (NLP) then the algorithm will converge to a local solution. This, however, can rarely be guaranteed. Only convergence to a critical point, and not to a local minimum of F , is guaranteed. It is easy to see from the following example that convergence can occur to a local maximum of (NLP). Consider minimize x1 x (4:16) 2 subject to: x2 1 + x2 ; 1 = 0 with a starting approximation of (2 0) and Bk = I for all k. The iterates will converge from the right to the point (1 0), which is a local maximum of the problem, but F will decrease appropriately at each iteration. The point (1 0), however, is a saddle point of F . In practice convergence to a maximizer of (NLP) rarely occurs. Theorems 4.2 and 4.1 require that be su ciently large. Since it is not
and, since cos( k ) is uniformly bounded away from 0, it follows that lim r (xk ) = 0: k!1 F
k=1
(4:15)
33
known in advance how large needs to be, it is necessary to employ an adaptive strategy for adjusting when designing a practical code. Such adjustment strategies can have a dramatic e ect on the performance of the implementation, as is discussed in Section 7. Augmented Lagrangian merit functions have been extended to handle inequality constrained problems in several ways. Two successful approaches are described below. In the rst an active set strategy is employed, i.e., at each iteration a set of the inequality constraints is selected and treated as if they were equality constraints the remaining inequalities are handled di erently. The active set is selected by using the multipliers from (QP). In particular, o n I k = i : gi(xk ) ; (vqp )i : With this choice, I k will contain all unsatis ed constraints and no \safely satis ed" ones. The merit function at xk is de ned by 2 t 1 FI (x uqp vqp ) = f (x) + h(x) uqp(x) + 2 kh(x)k X 1 g (x)2 ) + (g i (x)(vqp (x))i + 2 i +2
i2I k 1 X i= 2I k
(vqp (x))2 i:
FI is still di erentiable and, as a result of A3, the correct active set is eventually identi ed in a neighborhood of the solution. In this formulation, the multipliers are the multipliers from (QP), not the least squares multipliers. Therefore FI is a function of both x and these multipliers and, consequently, the analysis for FI is somewhat more complicated. A second approach uses the idea of squared slack variables. One can consider the problem minimize f (x) xt subject to: h(x) = 0 (4:17) g i (x) + (ti )2 = 0 i = 1 ::: p
where t is the vector of slack variables. This problem is equivalent to (NLP) in the sense that both have a strong local solution at x where, in (4:17), (ti )2 = ;g i (x ). By writing out the function F for the problem (4:17), it is observed that the merit function only involves the squares of ti . Thus, by letting z i = t2 i and setting (x ) h(x z ) = g (h x) + z
34
we can construct the merit function 2 t FZ (x z) = f (x) + h(x z) u(x z ) + 2 h(x z) : Here u(x,z ) is the least squares estimate for all of the multipliers. Note that (4:17) is not used to create the quadratic subproblem, but rather (QP) is solved at each iteration to obtain the step dx . The slack variables can then be updated at each iteration in a manner guaranteed to maintain the nonnegativity of z . For example, a step dz = ;(rg (xk )tdx + g(xk ) + z k ) can be calculated. Then the constraints of (QP) imply that z k+1 = z k + dz 0 if z k 0 and 2 (0 1].
4.2. The `1 Merit Function One of the rst merit functions to be introduced was the `1 exact penalty function that, in the equality constrained case, is given by (4:18) 1(x ) = f (x) + kh(x)k1 where is a positive constant to be chosen. The properties of this function vis-a-vis the equality-constrained optimization problem have been welldocumented in the literature. For our purposes it is su cient to note that 1, like the augmented Lagrangian of the previous section, is an \exact" penalty function that is, there exists a positive such that for all , an unconstrained minimum of 1 corresponds to a solution of (NLP). Note that 1 is not di erentiable on the feasible set. If the penalty term were squared to achieve di erentiability then the \exact" property would be lost minimizers of 1 would only converge to solutions of (NLP) as ! 1. Although 1 is not di erentiable, it does have a directional derivative along dx. It can be shown that this directional derivative, denoted by the operator D, is given by (4:19) D( 1(xk ) dx) = rf (xk )tdx ; h(xk ) 1 : Substituting the rst order necessary conditions for (ECQP) into (4:19) yields D( 1(xk ) dx) = ;dx tBk dx ; dxt rh(xk )uqp ; h(xk ) 1 : It follows from the linearized constraints of (ECQP) that dx t rh(xk )uqp = ;h(xk )t uqp and, since (4:20) h(xk )t uqp kuqp k1 h(xk 1
35
the inequality D( (xk ) dx) ;dx tBk dx ; ( ; kuqp k1 ) h(xk ) 1 (4:21) is obtained. In order to have dx be a descent direction for 1 and to obtain a convergence theorem it is su cient to assume the uniform positive de niteness of fBk g: B4: For all d 2 Rn there are positive constants 1 and 2 > 0 such that for all k. The assumptions B1 and B2 that were made for the augmented Lagrangian are su cient, but we make the stronger assumption to simplify the presentation. Using B4, the rst term in (4:21) is always negative and thus dx is a guaranteed descent direction for 1 if > kuqp k1 . Global convergence of an algorithm that uses 1 as a merit function can be demonstrated using arguments similar to those for Theorem 4.1. First, at each iteration, the parameter is chosen by = kuqp k1 + (4:22) for some constant > 0: Next, the rst Wolfe line search condition, (4:13), is replaced by (4:23) 1 (xk + dx) 1 (xk ) + D( 1 (xk ) dx) 1 where 2 (0 1 2 ). (The condition < 2 is for technical reasons in practice is usually chosen to be much smaller.) Recall that the second of the Wolfe conditions, (4:14), is to prevent a steplength that is too small. The assumptions that have been made guarantee that the use of a backtracking line search procedure produces steplengths that are uniformly bounded away from zero for all iterations. Denoting this lower bound by , it now follows from (4:21) and (4:23) that the reduction in 1 at each step satis es h i 2 (4:24) 1 kdx k + h(xk ) 1 : 1 (xk + dx) ; 1(xk ) ; Assumption C1 and (4:24) imply that
2 dx(xk ) + h(xk ) 1 1 k=1 2 dx(xk ) + h(xk ) 1 ! 0 as k ! 1. Since dx = 0 if and
1 X
2 1 kdk
dtBk d
2 kdk
and therefore only if xk is a feasible point satisfying A1, the following theorem results. Theorem 4.3 Assume that is chosen such that (4:22) holds. Then the
36
All of the theoretical results continue to hold under this extension. The merit function 1 has been popular because of its simplicity | it requires only the evaluation of f , h, and g to check a prospective point. F , on the other hand, is expensive to evaluate in that it requires the evaluation of f , h, g and their gradients just to check a prospective point. In addition, in the large scale case, the evaluation of u(x) involves nontrivial algebra. Implementations that use these merit functions partially circumvent this di culty by using an approximation to F , e.g., by approximating u by a linear function or by keeping it xed over the current step.
SQP algorithm started at any point x0 with steplength > 0 chosen to satisfy (4:23) converges to a stationary point of 1 . As in the case of F convergence to a local minimum of (NLP) cannot be guaranteed. In fact, the same counterexample used in that case applies here. An advantage of 1 over the augmented Lagrangian is that 1 is easily extended to handle inequalities. Since di erentiability is not an issue in this case, the function 1 can simply be de ned as + 1 (x) = f (x) + kh(x)k1 + g (x) 1 where if gi (x) 0 g+ (x) = 0 g (x) if g (x) > 0
i i
4.3. Notes and References The augmented Lagrangian merit function was rst proposed as an exact penalty function by Fletcher (1972). See also (Bertsekas 1982) and (Boggs and Tolle 1980). It was suggested as a merit function in (Powell and Yuan 1986) and in a slightly di erent form in (Boggs and Tolle 1984) and (Boggs and Tolle 1989). See (Byrd and Nocedal 1991) for its application to reduced Hessian methods. The Wolfe conditions have been studied by numerous authors we recommend (Nocedal 1992) or (Dennis and Schnabel 1983). Schittkowski (1981) and (Schittkowski 1983) developed the form for inequalities given by FI . This form has been further developed and incorporated into a highly successful algorithm by Gill, Murray, Saunders and Wright (1986). (Boggs, Tolle and Kearsley 1991b) proposed the form given by FZ . The `1 exact penalty function was originally suggested as a merit function by Han (1977) where he obtained the rst global convergence results for SQP methods. See (Fletcher 1981) for a good introduction to nondi erentiable optimization and the use of this function. See also (Polak 1989) and (Wolfe 1975). Byrd and Nocedal (1991) study the `1 merit function in the context
37
of reduced Hessian methods. The analysis for the `1 merit function applies to the corresponding `p merit function with the only change ocurring in (4:20) where the 1-norm and the 1-norm are replaced by the p-norm and the q -norm with 1=p + 1=q = 1.
5. Global to Local Behavior
In the previous two sections we have examined conditions that imply that the basic SQP algorithm will converge. Section 3 dealt with local convergence where the emphasis was on the rates of convergence, while Section 4 was concerned with obtaining convergence from remote starting points the implicit hope being that the two theories would come together to produce a uni ed theory that would be applicable to a given algorithm. For the global convergence theories, where the process has to be controlled by a merit function, it was seen in Section 4 that convergence of fxk g to a critical point of is all that can be demonstrated. Assuming that this critical point is a solution of (NLP), the question that arises is whether or not the conditions for the local convergence theories are eventually satis ed by this sequence. If they are then the more rapid rates of local convergence can be achieved. In this section we discuss three questions concerning this possibility: will the correct active set be chosen by (QP) when xk is close to x will Bk eventually approximate H L in one of the ways stipulated in Section 3 that yields rapid local convergence and will the merit function allow a steplength of one if the underlying process can achieve rapid local convergence? Recall that the local convergence theory was proven for equality-constrained problems only. This was based on the assumption that the active inequality constraints at the solution of (NLP) would be identi ed by (SQP) when xk gets close to the solution. The question of when the active constraints for (QP) will be the same as those for (NLP) is resolved by using a perturbation theory result as follows. Consider (NLP) with only inequality constraints minimize f (x) x (5:1) subject to: g (x) 0 and the quadratic program with the same solution t minimize rf (x )t(x ; x ) + 1 2 (x ; x ) B (x ; x ) x (5:2) subject to: rg(x )t(x ; x ) + g (x ) 0 where the only restriction on B is that yt B y > 0 (5:3) for all y such that rg(x )t y = 0. It is easily veri ed that (x v ) is a strong local minimizer of both (5:1) and (5:2). This implies that the active
38
sets are the same and that strict complementary slackness (assumption A3) holds for both. The quadratic programming approximation to (5:1) is t minimize rf (xk )tdx + 1 2 dx Bk dx dx (5:4) subject to: rg (xk )tdx + g(xk ) 0: A standard perturbation argument now asserts that if xk is close enough to x and Bk is close enough to B , then the active set for (5:4) is the same as the active set for (5:2). It follows that the active sets for (5:1) and (5:4) are the same. As a result, if xk is su ciently close to x and Bk is su ciently close to any matrix B satisfying (5:3) then (5:4) will identify the correct active set. The condition (5:3) will be satis ed if the Bk are always positive de nite or if they satisfy the condition (5:5) lim P k (Bk ; H L )P k = 0: k!1 This latter condition implies two-step superlinear convergence by Theorem (3:5). From Section 3 linear, superlinear, or two-step superlinear convergence is assured if Bk approaches H L in one of the ways hypothesized in Theorems 3.2{3.5. Since the initial matrix is unlikely in practice to be a good approximation to H L , it is reasonable to ask under what conditions the subsequent Bk will be good enough approximations to H L for these results of Section 3 to hold. If the Bk satisfy the secant equation (3:19) then it is reasonable to expect that the Bk will converge to H L provided that the directions, dx , span Rn repeatedly, for example, if each set of n directions is (uniformly) linearly independent. Some recent research supports this expectation. However, if positive de niteness is maintained, it is not possible for the Bk to converge to H L unless the latter is at least positive semide nite. So for the general case, further analysis is necessary. As seen in Section 3, (3:16) is equivalent to lim P (Bk ; H L )P sk + P (Bk ; H L )Q sk = 0 (5:6) where sk = (xk+1 ; xk ). For superlinear convergence both of these terms must tend to zero if the directions sk = ksk k repeatedly span Rn . Thus for positive de nite Bk superlinear convergence is not likely unless the second term goes to zero, which is the type of tangential convergence mentioned in Section 3.2. However, two step superlinear convergence, in which the rst term of (5:6) goes to zero, is possible for positive de nite updates and for reduced Hessian updates as well, provided that convergence occurs and steplengths of one are acceptable. Thus, it is important from the point of view of obtaining a rapid local rate of convergence that a steplength of one be taken near x .
k!1
ksk k
ksk k
39
For the general algorithm, however, the use of the merit function with the line search continues throughout the course of the algorithm. Thus, the line search for the merit function should allow a steplength of one if the matrix approximations are such that rapid local convergence is possible. In the case of the augmented Lagrangian merit function it can be shown, using arguments similar to those in Section 4, that
F (x
Then, if the process is converging superlinearly, the penultimate term in (5:7) tends to zero by Theorem 3.4 and the right hand side of (5:7) is ultimately less than zero, thus showing that a steplength of one decreases F . A slight extension to this argument shows that the Wolfe condition (4:13) also can be satis ed for a steplength of one. A signi cant disadvantage of the merit function 1 is that it may not allow a steplength of = 1 near the solution, no matter how good the approximation of (QP) to (NLP). This phenomenon, which can prohibit superlinear convergence, is called the Maratos e ect. Several procedures have been proposed to overcome this problem, including so-called nonmonotone line searches, i.e., techniques that allow the merit function to increase over one or more iterations (see Section 7). 5.1. Notes and References The perturbation argument showing conditions under which (QP) has the same active set as (NLP) is derived from the work of Robinson (1974) who proves a general perturbation result. For a discussion of the convergence of matrix updates see Boggs and Tolle (1994), Ge and Powell (1983) and Stoer (1984). The Maratos e ect, i.e., the fact that the `1 merit function may not allow a steplength of one even when the iterates are close to the solution and Bk is a good approximation to H L , was discussed by Chamberlain, Lemarechal, Pedersen and Powell (1982). These authors suggested the \watchdog" technique, which is essentially a nonmonotone line search method. (See also (Bonnans et al. 1992).) The fact that the augmented Lagrangian merit function does not su er from this problem has been shown by many authors.
6. SQP Trust Region Methods
(5:7)
Trust region algorithms have become a part of the arsenal for solving unconstrained optimization problems, so it is natural to attempt to extend the
40
ideas to solving constrained optimization problems and, in particular, to SQP methods. In this section an outline of the basic ideas of this approach will be provided. As the subject is still in a state of ux, no attempt will be made to give a comprehensive account of the algorithms that have been proposed. The discussion will be limited to equality-constrained problems the inclusion of inequality constraints into trust region algorithms has been the subject of little research. A major di culty associated with SQP methods arises from trying to ensure that (QP) has a solution. As was seen in the preceding, the requirement that the Hessian approximations be positive de nite on the null space of the constraints is di cult to guarantee short of requiring the Bk to be positive de nite. The trust region methods attempt to avoid this di culty by the addition of a bounding constraint. Since this added constraint causes the feasible region to be bounded the subproblem is guaranteed to have a solution independent of the choice of Bk provided the feasible region is nonempty To be precise, a straightforward trust region adaptation of the SQP method would generate the iterates by solving the problem
t minimize rf (xk )tdx + 1 2 dx Bk dx dx subject to: rh(xk )t dx + h(xk ) = 0
kS dxk2
42 k
(6:1)
where S is a positive diagonal scaling matrix and 4k is the trust region radius. In this discussion we assume the norm is the Euclidean norm but other norms have been used in the trust region constraint. The trust region radius is updated at each iteration based on a comparison of the actual decrease in a merit function to the decrease predicted by the model. If there is good agreement the radius is maintained or increased if not, it is decreased. It is worth noting that the necessary condition for dx to be a solution to (6:1) is that (Bk + S 2)dx = ;rh(xk )u ; rf (xk ) (6:2) for some multiplier vector u and some nonnegative scalar . This equation is used to generate approximate solutions to (6:1). The removal of the positive de nite requirement on Bk does not come without cost. The additional constraint is a quadratic inequality constraint and hence it is not a trivial matter to nd a good approximate solution to (6:1). Moreover, there is the bothersome possibility that the solution sets of the linear constraints and the trust region constraint may be disjoint. For this reason, research on trust region methods has centered on nding di erent types of subproblems that have feasible solutions but still capture
41
the essence of the quadratic subproblem. Several avenues of investigation are summarized below. One approach is to relax the linear constraints in such a way that the resulting problem is feasible. In this case the linear constraint in (6:1) is replaced by rh(xk )tdx + k h(xk ) = 0 (6:3) where 0 k 1. A major di culty with this approach is the problem of choosing k so that (6.3) together with the trust region constraint has a solution. A second approach is to replace the equality constraints by a least squares approximation. Then the equality constraints in (6:1) become the quadratic constraint 2 (6:4) rh(xk )tdx + h(xk ) ( k)2 where k is an appropriate value. One choice of k is the error in the linear constraints evaluated at the \Cauchy" point. The Cauchy point is de ned to be the optimal step in the steepest descent direction for the function i.e., the step, scp , which minimizes this function in the steepest descent direction. This yields
2 ( k )2 = rh(xk )t scp + h(xk ) :
Another possibility is to take k to be any value of rh(xk )t s + h(xk ) for which where 0 < 1 2 < 1. This approach assures that the quadratic subproblem is always feasible, but at the cost of some extra computation to obtain k . Moreover, while the resulting subproblem has only two quadratic constraints, nding a good approximate solution quickly is a matter that has not been completely resolved. Finally, in a Reduced Hessian approach to the trust region method the linear constraint in (6:1) is replaced by rh(xk )tdx + rh(xk )tsk = 0 (6:5) where sk is the solution of minimize rh(xk )t s + h(xk ) s subject to: ksk 4k
2 14k
ksk
2 4k
42
for 2 (0 1). Using the decomposition (3:27) from Section 3.4 dx = Zk pZ + Yk pY where the columns of Z (xk ) are a basis for the null space of rh(xk )t and the columns of Yk are a basis for the range space of rh(xk ), it follows from (6:5) that Yk pY = ;sk : As in Section 3.4, the null space component, pZ , can now be seen to be the solution of the quadratic problem
t t minimize f (xk ) + Bk sk pZ + 1 2 pZ Zk Bk Zk pZ pZ subject to: kpZ k2 (4k )2 ; ksk k2 : A great deal of research has been done on the subject of minimizing a quadratic function subject to a trust region constraint so that quick and accurate methods for approximating the solutions to these two problems are available. Much of the work to be done in transforming these approaches into algorithms for which local and global convergence theorems can be proven is similar in nature to that which must be done for standard SQP methods. In particular, a method for updating the matrices Bk needs to be speci ed (a wider class of updating schemes is now available, at least in theory, because there is no necessity of requiring them to have the positive de niteness properties) and a merit function has to be speci ed. As was shown in Section 3, local superlinear convergence depends on the steps approaching the Newton-SQP steps as the solution is neared. In particular, this requires that the modi ed constraints reduce to the linearized constraint of (ECQP) and that the trust region constraint not be active so that steplengths of one can be accepted. Conditions under which these events will occur have not been established. Global convergence theorems have been proven for most of the above variations of the trust region method by utilizing either the `1 or the augmented Lagrangian merit function under the assumptions C1 and C2.
it
6.1. Notes and References An introductory exposition of trust region methods for unconstrained optimization can be found in the book by Dennis and Schnabel (1983). Methods for minimizing a quadratic function subject to a trust region constraint can be found in (Gay 1981) and (More and Sorensen 1983). The rst application of trust region methods to the constrained problem appears to be that of Vardi (1985), who used the constraint (6:3) in place of the linearized equality constraints. This approach was also used by Byrd,
43
Schnabel and Schultz (1985). The introduction of the quadratic constraint (6:4) as a substitute for the linear constraint is due to Celis, Dennis and Tapia (1985), who used the error at the \Cauchy point" as the k . A global convergence theory for this strategy is given in (El-Alem 1991). Powell and Yuan (1986) suggested the second version of this approach mentioned in the text. Yuan (1990) proposed solution techniques for the resulting subproblem. The reduced Hessian approach has been introduced by Omojokun (1989) who also considers the case when inequality constraints are present. This approach has also been used by (Lalee, Nocedal and Plantega 1993) for large-scale problems.
7. Practical Considerations
Our goal throughout this survey has been to concentrate on those theoretical properties that bear on actual implementations, but we have mentioned some of the di culties that arise in the implementation of an SQP algorithm when the assumptions that we have made are not satis ed. In this section we elaborate on a few of the more important of these di culties and suggest some common computational techniques that can be used to work around them. No attempt is made to be complete the object is to give an indication of the issues involved. We also discuss brie y the extension of the SQP algorithm to the large scale case, where special considerations are necessary to create an e cient algorithm.
7.1. The Solution of (QP) The assumptions behind the theory in Sections 3 and 4 imply that (QP) always has a solution. In practice, of course, this may not be the case. (QP) may be infeasible or the objective function may be unbounded on the feasible set and have no local minima. As stated earlier, surveying methods for solving (QP) is beyond the scope of this paper, but we discuss some ways of continuing the computations when these di culties arise. One technique to avoid infeasibilities is to relax the constraints by using the trust region methods in Section 6. Another approach is to take dx to be some convenient direction when (QP) is infeasible, e.g., the steepest descent direction for the merit function. The infeasibility of (QP) is usually detected during a \Phase I" procedure that is used to obtain a feasible point with which to begin the quadratic programming algorithm. If the constraints are inconsistent, then there are no feasible points and the Phase I procedure will fail. However, the direction, dx , obtained in this case can sometimes be used to produce directions that reduce the constraint infeasibilities and can thus be used to improve xk . For example, the phase I procedure known as the \big M" method modi es (QP) by adding one new variable, say , in
44
such a way that the new problem t minimize rf (xk )tdx + 1 2 dx Bk dx + M dx (7:1) subject to: rh(xk )tdx + h(xk ) = 0 k t k rg(x ) dx + g(x ) ; e 0 is feasible. In this form M is a constant and e is the vector of ones. For 0 maxfg (xk ) : g (xk ) > 0g the initial point (dx ) = (0 0 ) is feasible i i for (7:1). The constant M is chosen large enough so that, as (7:1) is being solved, is forced to be reduced. Once 0 dx is a feasible point for (QP). If cannot be reduced to zero, then the inequalities are inconsistent. Nevertheless, it can be shown under certain conditions that the resulting dx is a descent direction for the merit function F . If the problem is unbounded, but feasible, then the di culty is due to the structure of the Hessian approximation, Bk . For example, if Bk is not positive de nite then a multiple of the identity (or non-negative diagonal matrix) can be added to Bk to ensure that it is positive de nite. Note that adding a strictly positive diagonal matrix to Bk is equivalent to using a trust region constraint (see (6:2)). Finally, in cases where (QP) does not yield approximate multipliers for nonlinear program such as when (QP) is infeasible or the matrix G(xk ) has linearly dependent columns, any reasonable approximation of the multipliers will usually su ce. The most common approach in these cases is to use a least squares multiplier approximation. If (NLP) does not satisfy A2 then at best the theoretical convergence rate will be slowed (if x is a simple degeneracy) and at worst even the Newton method may fail.
7.2. Adjusting the Merit Function Parameter It was shown in Section 4 that setting the parameter to ensure that the direction dx is a descent direction for the merit function 1 given by (4:18) can be relatively straightforward. Setting the parameter for F is only a little more di cult. In either case adjusting the parameter can cause both theoretical and computational di culties. In theory, there is no problem with having the parameter large. Indeed, if only increases in the parameter are allowed, the assumptions of Section 4 coupled with a reasonable adjustment strategy will lead to a provably nite value of the parameter and convergence can be proved using the techniques of Section 4. Computationally, however, having a large value of the parameter implies that there will be too much emphasis on satisfying the constraints. The result will be that the iterates will be forced to follow the constraint boundary closely which, in highly nonlinear problems, can cause the algorithm to be slow. A strategy for only allowing increases in the parameter can lead to an excessively large value due entirely to an early iterate's being far from the solution.
45
Ideally the parameter should be adjusted up or down at various stages of the iteration process to ensure both good theoretical and computational performance. Some strategies for this have been proposed. For example, it is possible to allow controlled decreases in the parameter that ensure that the predicted decrease in the merit function is not dominated by a decrease in the constraint infeasibilities. For such choices it is still possible to prove convergence.
7.3. Nonmonotone Decrease of the Merit Function Global convergence results are usually proved by insisting that an appropriate merit function be su ciently reduced at each iteration. Sometimes, however, it is more e cient computationally to be less conservative and to allow steps to be accepted even if the merit function is temporarily increased. If, for example, the merit function is forced to be reduced over any xed number of iterations, then convergence follows. In practice such strategies have been quite successful, especially near the solution. As a particular example, note that the merit function 1 may not allow a steplength of one near the solution. One remedy for this is to accept the full step temporarily even if 1 increases. Then, if 1 is not su ciently reduced after a small number of steps, the last good iterate is restored and reduction in 1 is required for the next iteration. As a second example, it was noted in Section 4 that the use of F requires the evaluation of f , h, g , and their gradients just to test a prospective point for acceptance. This di culty can be circumvented by using an \approximate" or \working" merit function that only requires evaluation of f , h, and g, and no gradients, to test a point. For example, F could be approximated (in the equality-constrained case) by k (x) = f (x) + h(x)t u(xk ) + 1 kh(x)k2 F 2 k th where u(x ) stays xed throughout the k iteration. This is then coupled with a strategy to monitor the iterations to ensure that F is su ciently reduced after a certain number of iterations as discussed above. 7.4. Large Scale Problems E cient SQP algorithms in the large scale case depend on carefully addressing many factors. In this section we mention two of these, namely problem structure and the solution of large scale quadratic programs. Problems are considered large if, to solve them e ciently, either their structure must be exploited or the storage and manipulation of the matrices involved must be handled in a special manner. The most obvious structure, and the one most commonly considered, is sparsity of the matrices rh, rg, and H L: Typically in large scale problems most constraints depend only on
46
a few of the variables and the objective function is \ partially separable" (a partially separable function is, for example made up of a sum of functions, each of which depends only on a few of the variables). In such cases the matrices are sparse. In other cases, one or more of the matrices is of low rank, and this structure can also be exploited. A key to using SQP in the large scale case is to be able to solve, or to solve approximately, the large quadratic programming subproblems. In the small scale case, it rarely matters how (QP) is solved and it usually makes sense to solve it completely. Depending on the algorithm used in the large scale case, it is often essential to realize that (QP) at iteration k + 1 di ers only slightly from that at iteration k. In these cases, the active set may remain the same, or change only slightly, and it follows that solving (approximately) the next (QP) can be accomplished quickly. It must be shown, however, that approximate solutions of (QP) are descent directions for an appropriate merit function. Recently, e cient interior point methods have been developed for solving large scale linear programs these ideas are now being applied to large scale quadratic programs. One such method is the \subspace" method where, at each iteration, a low-dimensional subspace is chosen and (QP) restricted to that subspace is solved. A step in the resulting direction is calculated and the procedure iterated. It has been shown that any number of iterations of this technique gives rise to a descent direction for an augmented Lagrangian merit function an SQP algorithm based on this has shown promise.
7.5. Notes and References See (Fletcher 1981) for an introduction to the solution of quadratic programs using classical techniques, including the Big M method. Recently there have been numerous papers on the application of interior-point methods to quadratic programming problems. This is still the subject of intense research and there are no good survey papers on the subject. See (Vanderbei and Carpenter 1993) and (Boggs, Domich, Rogers and Witzgall 1991a) or (Boggs, Domich, Rogers and Witzgall 1994a) for a discussion of two di erent approaches. Adjusting the penalty parameter in the augmented Lagrangian merit function is discussed in (Schittkowski 1981), (Schittkowski 1983), (Powell and Yuan 1986), (Byrd and Nocedal 1991), (Boggs, Tolle and Kearsley 1994b), (Byrd and Nocedal 1991), and (El-Alem 1992). The idea of allowing the merit function to increase temporarily is an old one. It is sometimes referred to as a nonmonotone line search for a general discussion see (Grippo, Lampariello and Lucidi 1986). Its use with the `1 merit function is discussed in (Chamberlain et al. 1982). Using approximate merit functions is
47
suggested in (Powell and Yuan 1986), (Boggs and Tolle 1989), and (Boggs et al. 1994b). Based on the work to date there is signi cant promise for SQP in the large scale case, but more work is required to compare SQP with other large scale techniques. For work to date see (Murray 1994), (Murray and Prieto to appear), and (Boggs et al. 1994b).
Palomares, Anthony Kearsley, Stephen Nash, Dianne O'Leary, and G. W. Stewart for reading and commenting on early drafts of the paper.
48
REFERENCES
D. P. Bertsekas (1980), Variable metric methods for constrained optimization based on di erentiable exact penalty functions, in Proceedings of the Eighteenth Allerton Conference on Communications, Control and Computing, Allerton Park, Illinois, pp. 584{593. D. P. Bertsekas (1982), Constrained Optimization and Lagrange Multipliers Methods, Academic Press, London. P. T. Boggs and J. W. Tolle (1980), `Augmented Lagrangians which are quadratic in the multiplier', Journal of Optimization Theory and Applications 31, 17{26. P. T. Boggs and J. W. Tolle (1984), `A family of descent functions for constrained optimization', SIAM Journal on Numerical Analysis 21, 1146{1161. P. T. Boggs and J. W. Tolle (1989), `A strategy for global convergence in a sequential quadratic programming algorithm', SIAM Journal on Numerical Analysis 21, 600{623. P. T. Boggs and J. W. Tolle (1994), `Convergence properties of a class of rank-two updates', SIAM Journal on Optimization 4, 262{287. P. T. Boggs, P. D. Domich, J. E. Rogers and C. Witzgall (1991a), `An interior point method for linear and quadratic programming problems', Mathematical Programming Society Committee on Algorithms Newsletter (19), 32{40. P. T. Boggs, P. D. Domich, J. E. Rogers and C. Witzgall (1994a), An interior-point method for general large scale quadratic programming problems, Internal Report (in process), National Institute of Standards and Technology. P. T. Boggs, J. W. Tolle and A. J. Kearsley (1991b), A merit function for inequality constrained nonlinear programming problems, Internal Report 4702, National Institute of Standards and Technology. P. T. Boggs, J. W. Tolle and A. J. Kearsley (1994b), A practical algorithm for general large scale nonlinear optimization problems, Internal report, National Institute of Standards and Technology. P. T. Boggs, J. W. Tolle and P. Wang (1982), `On the local convergence of quasiNewton methods for constrained optimization', SIAM Journal on Control and Optimization 20, 161{171. J. F. Bonnans, E. R. Panier, A. L. Tits and J. L. Zhou (1992), `Avoiding the Maratos e ect by means of a nonmonotone line search II. Inequality constrained problems|feasible iterates', SIAM Journal on Numerical Analysis 29, 1187{1202. C. G. Broyden, J. E. Dennis, Jr. and J. J. More (1973), `On the local and superlinear convergence of quasi-Newton methods', Journal of the Institute of Mathematical Applications 12, 223{246. R. H. Byrd and J. Nocedal (1991), `An analysis of reduced Hessian methods for constrained optimization', Mathematical Programming 49, 285{323. R. H. Byrd and R. B. Schnabel (1986), `Continuity of the null space basis and constrained optimization', Mathematical Programming 35, 32{41. R. H. Byrd, R. B. Schnabel and G. A. Schultz (1985), `A trust-region strategy for nonlinearly constrained optimization', SIAM Journal on Numerical Analysis 24, 1152{1170.
49
R. H. Byrd, R. A. Tapia and Y. Zhang (1992), `An SQP augmented Lagrangian BFGS algorithm for constrained optimization', SIAM Journal on Optimization 2, 210{241. M. R. Celis, J. E. Dennis, Jr. and R. A. Tapia (1985), A trust-region strategy for nonlinear equality constrained optimization, in Numerical Optimization 1984 (P. Boggs, R. Byrd and R. Schnabel, eds), SIAM, Philadelphia, pp. 71{82. R. Chamberlain, C. Lemarechal, H. C. Pedersen and M. J. D. Powell (1982), `The watchdog technique for forcing convergence in algorithms for constrained optimization', Mathematical Programming Study 16, 1{17. T. F. Coleman (1990), On characterizations of superlinear convergence for constrained optimization, in Lectures in Applied Mathematics, American Mathematical Society, Providence, Rhode Island, pp. 113{133. T. F. Coleman and A. R. Conn (1984), `On the local convergence of a quasi-Newton method for the nonlinear programming problem', SIAM Journal on Numerical Analysis 21, 755{769. T. F. Coleman and P. A. Feynes (1992), `Partitioned quasi-Newton methods for nonlinear equality constrained optimization', Mathematical Programming 53, 17{ 44. T. F. Coleman and D. C. Sorensen (1984), `A note on the computation of an orthonormal basis for the null space of a matrix', Mathematical Programming 29, 234{242. J. E. Dennis, Jr. and J. J. More (1977), `Quasi-Newton methods, motivation and theory', SIAM Review 19, 46{89. J. E. Dennis, Jr. and R. B. Schnabel (1983), Numerical Methods for Unconstrained Optimization and Nonlinear Equations, Prentice-Hall, Englewood Cli s, New Jersey. M. M. El-Alem (1991), `A global convergence theory for the Celis-Dennis-Tapia trust region algorithm for constrained optimization', SIAM Journal on Numerical Analysis 28, 266{290. M. M. El-Alem (1992), A robust trust region algorithm with nonmonotonic penalty parameter scheme for constrained optimization, Department of Mathematical Sciences 92-30, Rice University. R. Fletcher (1972), A class of methods for nonlinear programming, iii: Rates of convergence, in Numerical Methods for Nonlinear Optimization (F. A. Lootsma, ed.), Academic Press, New York, pp. 371{382. R. Fletcher (1981), Practical Methods of Optimization: Volume 2, John Wiley & Sons, New York. R. Fontecilla (1988), `Local convergence of secant methods for nonlinear constrained optimization', SIAM Journal on Numerical Analysis 25, 692{712. R. Fontecilla, T. Steihaug and R. A. Tapia (1987), `A convergence theory for a class of quasi-Newton methods for constrained optimization', SIAM Journal on Numerical Analysis 24, 1133{1151. M. Fukushima (1986), `A successive quadratic programming algorithm with global and superlinear convergence properties', Mathematical Programming 35, 253{ 264.
50
D. Gabay (1982), `Reduced quasi-Newton methods with feasibility improvement for nonlinearly constrained optimization', Mathematical Programming Study 16, 18{44. U. M. Garcia-Palomares and O. L. Mangasarian (1976), `Superlinearly convergent quasi-Newton methods for nonlinearly constrained optimization problems', Mathematical Programming 11, 1{13. D. M. Gay (1981), `Computing optimal locally constrained steps', SIAM Journal on Scienti c and Statistical Computing 2, 186{197. R. Ge and M. J. D. Powell (1983), `The convergence of variable metric matrices in unconstrained optimization', Mathematical Programming 27, 123{143. J. C. Gilbert (1993), Superlinear convergence of a reduced BFGS method with piecewise line-search and update criterion, Rapport de Recherche 2140, Institut National de Recherche en Informatique et en Automatique. P. E. Gill, W. Murray and M. H. Wright (1981), Practical Optimization, Academic Press, New York. P. E. Gill, W. Murray, M. A. Saunders and M. H. Wright (1986), User's guide for NPSOL (version 4.0): A Fortran package for nonlinear programming, Technical Report SOL 2, Department of Operations Research, Stanford University. P. E. Gill, W. Murray, M. A. Saunders, G. W. Stewart and M. H. Wright (1985), `Properties of a representation of a basis for the null space', Mathematical Programming 33, 172{186. S. T. Glad (1979), `Properties of updating methods for the multipliers in augmented Lagrangians', Journal of Optimization Theory and Applications 28, 135{156. J. Goodman (1985), `Newton's method for constrained optimization', Mathematical Programming 33, 162{171. L. Grippo, F. Lampariello and S. Lucidi (1986), `A nonmonotone line search technique for Newton's method', SIAM Journal on Numerical Analysis 23, 707{ 716. S.-P. Han (1976), `Superlinearly convergent variable metric algorithms for general nonlinear programming problems', Mathematical Programming 11, 263{282. S.-P. Han (1977), `A globally convergent method for nonlinear programming', Journal of Optimization Theory and Applications 22, 297{309. M. Lalee, J. Nocedal and T. Plantega (1993), On the implementation of an algorithm for large-scale equality optimization, Department of Electrical Engineering and Computer Science NAM 09-93, Northwestern University. D. G. Luenberger (1984), Linear and Nonlinear Programming, 2nd Edition, AddisonWesley, Reading, Massachusetts. J. More and D. C. Sorensen (1983), `Computing a trust region step', SIAM Journal of Scienti c and Statistical Computing 4, 553{572. W. Murray (1994), Algorithms for large nonlinear problems, in Mathematical Programming: State of the Art 1994 (J. R. Birge and K. G. Murty, eds), University of Michigan, Ann Arbor, MI, pp. 172{185. W. Murray and F. J. Prieto ( to appear), `A sequential quadratic programming algorithm using an incomplete solution of the subproblem', SIAM Journal on Optimization. S. G. Nash and A. Sofer (1995), Linear and Nonlinear Programming, McGraw-Hill, New York.
51
J. Nocedal (1992), `Theory of algorithms for unconstrained optimization', Acta Numerica 1991, 199{242. J. Nocedal and M. L. Overton (1985), `Projected Hessian updating algorithms for nonlinearly constrained optimization problems', SIAM Journal on Numerical Analysis 22, 821{850. E. M. Omojokun (1989), Trust Region Algorithms for Optimization with Nonlinear Equality and Inequality Constraints, PhD thesis, University of Colorado. J. M. Ortega and W. C. Rheinboldt (1970), Iterative Solution of Nonlinear Equations In Several Variables, Academic Press, New York. E. R. Panier and A. L. Tits (1993), `On combining feasibility, descent and superlinear convergence in inequality constrained optimization', Mathematical Programming 59, 261{276. E. Polak (1989), Basics of minimax algorithms, in Nonsmooth Optimization and Related Topics (F. H. Clark, V. F. Dem'yanov and F. Giannessi, eds), Plenum, New York, pp. 343{369. M. J. D. Powell (1977), A fast algorithm for nonlinearly constrained optimization calculations, in Numerical Analysis Dundee, 1977 (G. A. Watson, ed.), SpringerVerlag, Berlin, pp. 144{157. M. J. D. Powell (1978a), `Algorithms for nonlinear constraints that use Lagrangian functions', Mathematical Programming 14, 224{248. M. J. D. Powell (1978b), The convergence of variable metric methods for nonlinearly constrained optimization calculations, in Nonlinear Programming 3 (O. Mangasarian, R. Meyer and S. Robinson, eds), Academic Press, New York, NY, pp. 27{64. M. J. D. Powell and Y. Yuan (1986), `A recursive quadratic programming algorithm that uses di erentiable exact penalty functions', Mathematical Programming 35, 265{278. S. M. Robinson (1974), `Perturbed Kuhn-Tucker points and rates of convergence for a class of nonlinear-programming algorithms', Mathematical Programming 7, 1{16. K. Schittkowski (1981), `The nonlinear programming method of Wilson, Han, and Powell with an augemented Lagrangian type line search function, part 1: Convergence analysis', Numerische Mathematik 38, 83{114. K. Schittkowski (1983), `On the convergence of a sequential quadratic programming method with an augmented Lagrangian line search function', Math. Operationsforsch. U. Statist., Ser. Optimization 14, 197{216. J. Stoer (1984), `The convergence of matrices generated by rank-2 methods from the restricted -class of Broyden', Numerische Mathematik 44, 37{52. R. A. Tapia (1974), `A stable approach to Newton's method for optimization problems with equality constraints', Journal of Optimization Theory and Applications 14, 453{476. R. A. Tapia (1977), `Diagonalized multiplier methods and quasi-Newton methods for constrained optimization', Journal of Optimization Theory and Applications 22, 135{194. R. A. Tapia (1978), Quasi-Newton methods for equality constrained optimization: Equivalence of existing methods and a new implementation, in Nonlinear
52
Press, New York, NY, pp. 125{164. R. A. Tapia (1988), `On secant updates for use in general constrained optimization', Mathematics of Computation 51, 181{202. R. J. Vanderbei and T. J. Carpenter (1993), `Symmetric inde nite systems for interior point methods', Mathematical Programming 58, 1{32. A. Vardi (1985), `A trust-region algorithm for equality constrained minimization and implementation', SIAM Journal on Numerical Analysis 22, 575{591. R. B. Wilson (1963), A Simplicial Algorithm for Concave Programming, PhD thesis, Harvard University. P. Wolfe (1975), A method of conjugate subgradients for minimizingnondi erentiable functions, in Mathematical Programming Study 3 (M. Balinski and P. Wolfe, eds), North Holland, pp. 145{173. Y. Yuan (1985), `An only 2-step q-superlinear convergence example for some algorithms that use reduced Hessian approximations', Mathematical Programming 32, 224{231. Y. Yuan (1990), `On a subproblem of trust region algorithms for constrained optimization', Mathematical Programming 47, 53{63.