Steepest Descent
Steepest Descent
Keywords: optimization,gradient,minimization,Cauchy
Introduction
The classical steepest descent method is one of the oldest methods for the minimization of a general nonlinear function. The steepest descent method, also known as the gradient descent method, was rst proposed by Cauchy in 1847 [1]. In the original paper, Cauchy proposed the use of the gradient as a way of solving a nonlinear equation of the form f ( x 1 , x2 , . . . , x n ) = 0 , (1)
where f is a real-valued continuous function that never becomes negative and which remains continuous, at least within certain limits. The basis for the method is the simple observation that a continuous function should decrease, at least initially, if one takes a step along the direction of the negative gradient. The only diculty then is deciding how to choose the
This work was supported in part by the Director, Oce of Science, U.S. Department of Energy, under Contract No. DE-AC02-05CH11231.
1
length of the step one should take. While this is easy to compute for special cases such as a convex quadratic function, the general case usually requires the minimization of the function in question along the negative gradient direction. Despite its simplicity, the steepest descent method has played an important role in the development of the theory of optimization. Unfortunately, the method is known to be quite slow in most real-world problems and is therefore not widely used. Instead, more powerful methods such as the conjugate gradient method or quasi-Newton methods are frequently used. Recently however, several attempts have been proposed to improve the eciency of the method. These modications have led to a newfound interest in the steepest descent method, both from a theoretical and practical viewpoint. These methods have pointed to the interesting observation that the gradient direction itself is not a bad choice, but rather that the original step length chosen leads to the slow convergence behavior.
Suppose that we would like to nd the minimum of a function f (x), x Rn , and f : Rn R. We will denote the gradient of f by gk = g (xk ) = f (xk ). The general idea behind most minimization methods is to compute a step along a given search direction, dk , for example, xk+1 = xk + k dk , k = 0, 1 , . . . , (2)
(3)
Here arg min refers to the argument of the minimum for the given function. For the steepest descent method, the search direction is given by dk = f (xk ). The steepest descent algorithm can now be written as follows: The two main computational advantages of the steepest descent algorithm is the ease with which a computer algorithm can be implemented and the low storage requirements necessary, O(n). The main work requirement is the line search required to compute the step length, k and the computation of the gradient. 2
Algorithm 1 Steepest Descent Method Given an initial x0 , d0 = g0 , and a convergence tolerance tol for k = 0 to maxiter do Set k = argmin () = f (xk ) gk xk+1 = xk k gk Compute gk+1 = f (xk+1 ) if ||gk+1 ||2 tol then converged end if end for
Convergence Theory
One of the main advantages to the steepest descent method is that it has a nice convergence theory [2, 3]. It is fairly easy to show that the steepest descent method has a linear rate of convergence, which is not too surprising given the simplicity of the method. Unfortunately, even for mildly nonlinear problems this will result in convergence that is too slow for any practical application. On the other hand, the convergence theory for the steepest descent method is extremely useful in understanding the convergence behavior of more sophisticated methods. To start, lets consider the case of minimizing the following quadratic function 1 f (x) = xT Qx bT x, (4) 2 where b Rn , and Q is an n n symmetric positive denite matrix. Since Q is symmetric and positive denite, all of the eigenvalues are real and positive. Let the eigenvalues of the matrix Q be given by 0 < 1 2 , . . . , n . Note that the gradient of (4) is simply g (x) = Qx b. so we can write one step of the method of steepest descent as xk+1 = xk k (Qxk b), (6) (5)
where k is chosen to minimize f (x) along the direction gk . A simple calculation (for the quadratic case) yields the following equation for k : k =
T gk gk . T gk Qgk
(7)
To analyze the convergence, it is easiest to consider the quantity f (xk ) f (x ), where x denotes the global minimizer of equation (4). Here we will follow proofs that can be found in standard texts such as [2, 3]. We rst notice that the unique minimizer to equation (4) is given by the solution to the linear system Qx = b. (8) Consider the quantity: 1 T 1 T f ( xk ) f ( x ) = xk Qxk bT xk (x ) Qx bT x 2 2 1 T 1 T T = xk Qxk (Qx ) xk (x ) Qx (Qx )T x 2 2 1 T = ( xk x ) Q( xk x ) 2 To compute a bound, one uses a lemma due to Kantorivich, which can be found in Luenberger [2]. In particular, when the method of steepest descent with exact line searches is used on a strongly convex quadratic function then: 2 ( Q) 1 f (xk+1 ) f (x ) f ( xk ) f ( x ) . (9) ( Q) + 1 where (Q) = n /1 is the condition number of the matrix Q. A similar bound can be derived for the case of a general nonlinear objective function, if we assume that k is the global minimizer along the search direction.
Example
Consider a simple example of a 3 dimensional quadratic function given by: 1 f (x) = xT Q bT x, (10) 2 where 1 0 0 1 Q = 0 0 b = 1 . 0 0 2 1 Using the steepest descent algorithm on this example problem produces the following results. The convergence tolerance was set so that the algorithm would terminate when ||g (xk )|| 106 . One can clearly see the eects of even a mildly large condition number as predicted by the error bound and as seen in the number of iterations required to achieve convergence. 4
2 5 10 20 50
Scaling
One of the most important aspects in minimizing real-world problems is the issue of scaling. Because of the way that many scientic and engineering problems are initially formulated it is not uncommon to have problems due to variables having widely diering magnitudes. This can be due to many issues, but a common one is that variables have dierent physical units that can lead to the optimization variables having orders of magnitude dierences. For example, one variable could be given in kilometers (103 meters) and another variable might be in milliseconds (103 seconds) leading to a 6 order of magnitude dierence. As a general rule of thumb, one would like to have all the variables in an optimization problem having roughly similar magnitudes however. This leads to better decisions in which search direction to choose as well as in deciding when convergence is achieved. One fairly standard approach is to use a diagonal scaling based on what a typical value of a variable is expected to be. One would then transform the variables by the scaling: x = Dx, (11) where D is a diagonal scaling matrix. In the test problem given above for example, one simple choice would be: 1 0 0 D = 0 0 . (12) 2 0 0
Extensions
Recently, several new modications to the steepest descent method have been proposed. In 1988, Barzilai and Borwein [4] proposed two new step sizes for use with the negative gradient direction. Although their method did not guarantee descent in the objective function values, their numerical results indicated a substantial improvement over the classical steepest descent method. One of their main observations was that the behavior of the steepest descent algorithm depended as much on the step size as on the search direction. They proposed instead the following procedure. First one writes the new iterate as: 1 xk+1 = xk gk . (13) k Then, instead of computing the step size by doing a line search or using the formula for the quadratic case 7, one computes the step length, k , through the following formula: sT1 yk1 k = k , (14) sT k 1 sk 1
where sk1 = xk xk1 and yk1 = gk gk1 . Using this new formula, Barzilai and Borwein were able to produce a substantial improvement in the performance of the steepest descent algorithm for certain test problems. Subsequently, Raydan was able to prove convergence of the Barzilai and Borwein method for the case of a strictly convex quadratic function for any number of variables and in 1997 he proposed a nonmonotone line search strategy due to Grippo, Lampariello, and Lucidi [5], that guarantees global convergence [6] for the general nonlinear case. For an excellent overview on this subject and further details see [7]. The steepest descent method is one of the oldest known methods for minimizing a general nonlinear function. The convergence theory for the method is widely used and is the basis for understanding many of the more sophisticated and well known algorithms. However, the basic method is well known to converge slowly for many problems and is rarely used in practice. Recent results have generated a renewed interest in the steepest descent method. The main observation is that the steepest descent direction can be used with a dierent step size than the classical method that can substantially improve the convergence. One disadvantage however is the lack of monotone convergence. After so many years, it is interesting to note that this method can still yield some surprising results. 6
References
[1] A. Cauchy. Methodes generales pour la resolution des systemes dequations simultanees,. C.R. Acad. Sci. Par., 25:536538, 1847. [2] D. G. Luenberger and Yinyu Ye. Linear and Nonlinear Programming. Springer, 2008. [3] Stephen G. Nash and Ariela Sofer. Linear and Nonlinear Programming. McGraw-Hill, 1996. [4] J. Barzilai and J. Borwein. Two-point step size gradient methods. IMA Journal of Numerical Analysis, 8:141148, 1988. [5] L. Grippo, F. Lampariello, and S. Lucidi. A nonmonotone line search technique for Newtons method. SIAM Journal of Numer. Anal., 23:707 716, 1986. [6] M. Raydan. The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM Journal of Optimization, 7(1):2633, 1997. [7] R. Fletcher. On the Barzilai-Borwein method. In L. Qi, K. Teo, and X. Yang, editors, Optimization and Control with Applications, pages 235 256. Springer, 2005.