Conjugate Gradient
Conjugate Gradient
6.1 Introduction
Conjugate directions methods can be regarded as "in between" the steepest descent
method and the Newton method as far as the rate of convergence is concerned. These methods
are motivated by the desire to accelerate the typically slow rate of convergence associated with
the steepest descent method while avoiding the information requirements associated with the
evaluation, storage and inversion of the Hessian required in the Newton method or quasi-Newton
method.
The conjugate gradient method uses gradient of the cost function to generate conjugate
directions. At step k, one evaluates the current negative gradient vector and adds to it a linear
combination of the previous direction vectors to obtain a new conjugate direction along which to
move. In these methods, the first step is identical to the steepest descent step; each succeeding
step moves in a direction that is a linear combination of the current gradient (steepest descent)
direction and the preceding direction vector. There are three primary advantages of the conjugate
gradient methods:
1. Unless the solution is obtained in less than n steps, the gradient of the cost function is always
nonzero (except at saddle points) and linearly independent of all previous direction vectors in
case of quadratic functions (with exact arithmetic).
2. A more important advantage of the methods is the availability of a very simple formula to
determine the search direction using the gradient vector and a previous direction. This
simplicity makes the methods only slightly more complicated to implement than the steepest
descent method.
3. Since the directions are based on gradients of the cost function, the process makes good
uniform progress toward the solution at every step.
In Section 6.2, the conjugate gradient method is derived for a strictly convex quadratic
function. It is then extended for general functions in Section 6.3.
Conjugate gradient methods are invariably derived and analyzed for the purely quadratic
problem
1
min q(x) = 2 (x, Qx) - (b, x) (6.2.1)
x
where Q is an n × n symmetric positive definite matrix and b is an n × 1 nonzero vector. Then c
= 0 is a necessary and sufficient condition for a global minimum of q(x). The techniques once
worked out for this problem are then extended, by approximation, to more general nonlinear
problems. It is argued that, since near the solution point, every problem is approximately
quadratic, convergence behavior is similar to that for the pure quadratic situation.
(i)
Definition: Conjugate Directions. Given a symmetric matrix Q, two nonzero vectors d
26
COMPUTATIONAL METHODS FOR UNCONSTRAINED MINIMIZATION
This algorithm, developed by Hestenes and Stiefel (1952), has the following properties
with exact arithmetic:
(k) (i)
1. (c , c ) = 0, for all i < k (6.2.8)
(k) (i) (k) (i)
2. (c , d ) = 0, or (c , s ) = 0, for all i < k (6.2.9)
(k) (i)
3. (d , Qd ) = 0, for all i < k (6.2.10)
4. The global minimum is obtained in at the most n steps.
(k) (k) (k) (k)
5. (d , c ) = -(c , c ) (6.2.11)
(i) (i)
where s = αid . Proofs of the above results will be presented later. They can also be found in
Luenberger (1984) and Fletcher (1989). Equations (6.2.8) and (6.2.9) show that gradient of the
(k) (k)
cost function c at the point x is orthogonal to the gradients of the cost function as well as the
27
COMPUTATIONAL METHODS FOR UNCONSTRAINED MINIMIZATION
search directions at all the previous iterations. Equation (6.2.10) is the conjugate direction
(k)
condition. Equation (6.2.11) shows that d is a direction of descent. Note that descent condition
(k) (k)
implies d ≠ 0 for c ≠ 0.
The expression for the step size αk in Eq. (6.2.4) is derived to minimize
(k) (k)
q(x + αd ). The first order condition for this minimum is that
(k) (k+1) (k) (k+1)
(d , c ) = 0, or (s , c )=0 (6.2.12)
The formula for βk given in Eq. (6.2.6) is called the Hestenes-Stiefel formula. Using Eqs. (6.2.11)
(k) (k)
and (6.2.12) denominator of Eq. (6.2.6) reduces to (c , c ). Further using the condition in Eq.
(6.2.8), Fletcher and Reeves (1964) derived the following formula for βk:
(k+1) (k+1)
(c ,c )
βk = (k) (6.2.13)
(k)
(c , c )
(k) (k)
Fletcher (1989) suggested another formula for βk by replacing (c , c ) in the denominator of
(k) (k)
Eq. (6.2.13) by (d , c ) (using the condition in Eq. 6.2.11) as
(k+1) (k+1)
(c ,c )
βk = - (k) (6.2.14)
(k)
(d , c )
Note that the conditions in Eqs. (6.2.8) to (6.2.11) are true only for quadratic functions. Fletcher
and Reeves (1964) were the first to use conjugate gradient method for general functions which is
discussed in Section 6.3.
In this subsection, we will present derivations of the method and discuss several useful
properties.
(i)
Proposition. If Q is symmetric positive definite and d , i = 0 to (n - 1) are Q-conjugate,
then these vectors are linearly independent.
Proof. To prove linear independence, form a linear combination and set it to zero, i.e.,
n-1 (i)
∑ αi d = 0
i=0
(j)
where ai are constants. Premultiplying by Q and taking scalar product with any d , we get
n-1 (j) (i)
∑ αi (d , Qd ) = 0
i=0
Or, due to Q-conjugacy, all terms in the above equation are zero, except when i = j,
28
COMPUTATIONAL METHODS FOR UNCONSTRAINED MINIMIZATION
(i) (i)
αi (d , Qd ) = 0
Since Q is positive definite, (d(i), Q d(i)) ≠ 0. Thus ai = 0, i = 0 to (n - 1), and d(i) are linearly
independent. ||
The property of linear independence of d(i) allows one to use them as base vectors for a
subspace of appropriate dimension. In particular, if d(i), i = 0 to (n - 1) are known, then the
solution x* for the quadratic problem defined in Eq. (6.2.1) can be expressed as a linear
combination of these vectors.
29
COMPUTATIONAL METHODS FOR UNCONSTRAINED MINIMIZATION
(j)
independent, the only vector that can be orthogonal to all d is the null vector, i.e.,
(n)
c =0
which is the necessary and sufficient condition for global optimality of the convex quadratic
function. ||
30
COMPUTATIONAL METHODS FOR UNCONSTRAINED MINIMIZATION
a k +1,i =
c (
( k +1)
,Qd (i ) ) (6.2.24)
(
d (i ) ,Qd (i ) )
(j)
To show that c are linearly independent, re-write Eq. (6.2.23) as
(k+1) (k+1)
k (i)
c =-d + ∑ ak+1,i d (6.2.25)
i=0
(k+1) (i)
That is, c is written as a linear combination of d , i = 0 to k + 1. Using Eq. (6.2.1), we get
(k+1) (k+1) (k+1)
c = ∇q(x ) = -b + Qx (6.2.26)
(k+1) (i)
First we show that c is orthogonal to all d , i = 0 to k (shown previously in Eqs. 6.2.16 and
6.2.17):
(k) (0)
k-1 (i)
x =x + ∑ αi d
i=0
(j)
k-1 (i)
=x + ∑ αi d , for all j < k
i=j
Therefore,
(k)
⎛ (j) k-1 (i)
⎞
c = Q x + ∑ αi d ⎟ - b
⎜
⎜ ⎟
⎝ i=j ⎠
(j)
k-1 (i)
=c + ∑ αi Qd
i=j
(j+1)
k-1 (i)
=c + ∑αi Qd
i=j+1
(j)
Taking scalar product with d for j < k, we get
31
COMPUTATIONAL METHODS FOR UNCONSTRAINED MINIMIZATION
(j) (j)
j-1 (i)
c = -d + ∑ aj,i d
i=0
(k)
Taking scalar product with c for k > j, we get
32
COMPUTATIONAL METHODS FOR UNCONSTRAINED MINIMIZATION
(k)
Taking scalar product with d , we get
αk =
(y (k ) ,d (k ) ) (6.2.30)
(d (k ) ,Qd (k ) )
(k) (k+1)
This expression for αk cannot be used in numerical calculations because y depends on x . If
(k+1) (k)
exact line search is used, then (c , d ) = 0 and Eq. (6.2.30) reduces to
αk = −
(c (k ) ,d (k ) ) (6.2.31)
(d (k ) ,Qd (k ) )
which is the same as given in Eq. (6.2.4).
(k) (k)
k-1 (i)
d = -c + ∑ ak,i d (6.2.32)
i=0
(k)
which expresses d as a linear combination of the steepest descent and previous conjugate
(k)
directions. Taking scalar product with c , we get
αk =
(c (k ) ,c (k ) ) (6.2.34)
(d (k ) ,Qd (k ) )
Equation (6.2.30) gives
33
COMPUTATIONAL METHODS FOR UNCONSTRAINED MINIMIZATION
(k) (k)
Qd y
(k) (k) = (k) (k) (6.2.36)
(d , Qd ) (c , c )
Substituting Eq. (6.2.36) into Eq. (6.2.23), we get
(k+1)
= − c ( k +1) + ∑
k (c (k+1) ,y (i ) )d (i )
i =0 (c ,c )
d (i ) (i )
= − c ( k +1) + ∑
k (c (k+1) ,(c (i+1) − c (i ) ))d (i ) (6.2.37)
i =0 (c (i ) ,c (i ) )
The only nonzero term in the sum in Eq. (6.2.37) is when i = k due to the orthogonality condition
in Eq. (6.2.28). Therefore, Eq. (6.2.37) becomes
(k+1) (k+1) (k)
d =-c + βk d (6.2.38)
where βk is given as
(c ( k +1) , y ( k ) )
β k = (k ) (k ) (6.2.39)
(c ,c )
The expression for βk given in Eq. (6.2.39) is known as the Polak-Ribiè re formula.
(k)
Several other forms for βk can be derived. Substituting for y from Eq. (6.2.35) into Eq. (6.2.39)
(k+1) (k)
and using the fact that (c , c ) = 0 when exact line search is used, Fletcher and Reeves
obtained the following formula for βk:
βk =
(c ( k +1) ,c ( k +1) )
(6.2.40)
(c (k ) ,c (k ) )
Substituting Eq. (6.2.33) into Eq. (6.2.39), or when Eq. (6.2.31) for step size αk is used in the
34
COMPUTATIONAL METHODS FOR UNCONSTRAINED MINIMIZATION
βk = −
(c (k +1) ,y (k ) ) (6.2.41)
(d (k ) ,c (k ) )
If the step size formula given in Eq. (6.2.30) is used in the foregoing derivation, the equation for
βk in Eq. (6.2.39) becomes:
βk =
(c (k+1) ,y (k ) ) (6.2.42)
(d (k ) ,y (k ) )
This is known as the Hestenes-Stiefel formula. It is important to note that all the above formulas
are completely equivalent for quadratic problems with exact step size.
Using the exact line search termination criterion, the step size is given as
αk = −
(c (k ) ,d (k ) ) (6.2.47)
(d (k ) ,Qd (k ) )
Using the condition in Eq. (6.2.45), it can be shown that
35
COMPUTATIONAL METHODS FOR UNCONSTRAINED MINIMIZATION
αk =
(c (k ) ,c (k ) ) (6.2.49)
(d (k ) ,Qd (k ) )
(k)
Equation (6.2.48) also shows that the direction d is that of descent for the cost function.
(j)
Several expressions for the constant βk can be derived assuming Q-conjugacy of d and
exact line search at each iteration:
1. Hestenes-Stiefel (1952): β k =
(c (k+1) ,y (k ) ) (6.2.50)
(d (k ) ,y (k ) )
2. Fletcher-Reeves (1964): β k =
(c (k +1) ,c (k+1) ) (6.2.51)
(c (k ) ,c (k ) )
3. Polak-Ribiè re (1969): βk =
(c (k+1) ,y (k ) ) (6.2.52)
(c (k ) ,c (k ) )
4. Fletcher (1989): βk = −
(c (k+1) ,c (k +1) ) (6.2.53)
(d (k ) ,c (k ) )
It is important to note that all the formulas for βk are equivalent for quadratic functions
when exact line search is used. It is also important to note that when exact line search is not used,
the search directions given by Eq. (6.2.44) are no longer Q-conjugate. Also, the numerical
performance with the four formulas for βk given in Eqs. (6.2.50) to (6.2.53) can be quite different
and convergence to the global minimum in n iterations is also not guaranteed.
36
COMPUTATIONAL METHODS FOR UNCONSTRAINED MINIMIZATION
It is argued that, since near a solution point, any general nonlinear function f(x) can be
approximated by some quadratic function, we can extend methods in the previous section by
(k)
making the following associations at x :
(k) (k) (k)
c <---> ∇f(x ), Q <---> H .
For general applications, the step size determination formula given in Eq. (6.2.4) cannot be used
(k)
because it will require calculation of the Hessian matrix H at each iteration. This is an
enormous calculation. Therefore, numerical methods like quadratic or cubic interpolation need to
(k) (k)
be used to calculate αk to minimize f(x + αd ) instead of using Eq. (6.2.4).
Extension of the conjugate gradient methods for general applications results in behavior
that is different from that for the quadratic case. Several issues need to be discussed before the
conjugate gradient methods can be applied to general unconstrained problems.
1. Step Size Determination. As noted earlier, the step size in Step 3 of Algorithm 6.1
can be calculated using a line search procedure. This will not require calculation of the Hessian
matrix. Usually, inaccurate line search is used to effect efficiency, so the methods need to be
discussed taking this aspect into account.
2. Conjugacy of Search Directions. Usually the search direction based on Eq. (6.2.7)
will not satisfy conjugacy conditions for general problems. Also, termination in finite number of
iterations cannot be guaranteed.
3. Descent Property of Search Directions. The search directions determined using Eq.
(6.2.7) may not be that of descent for the cost function. This aspect needs to be studied to ensure
global convergence of the algorithms.
Due to inaccurate line search and lack of conjugacy of search directions, the results given
in Eqs. (6.2.45), (6.2.46) and (6.2.48) do not hold; i.e.,
(j) (k)
(i) (d , c ) ≠ 0, for any j < k
(j) (k)
(ii) (c , c ) ≠ 0, for any j < k
(k) (k) (k) (k)
(iii) (c , d ) ≠ -(c , c )
In addition, all the formulas for βk given in Eqs. (6.2.50) to (6.2.53) are not equivalent. Therefore
numerical behavior of the methods with different formulas can be quite different.
Note that even for general functions, the condition in Eq. (6.2.11) holds if exact line
(k) (k) (k) (k)
search is used; i.e., (d , c ) = -(c , c ) with exact line search. This can be seen by taking
(k+1) (k+1)
scalar product of d in Eq. (6.2.7) with c as follows:
(k+1) (k+1) (k+1) (k+1) (k) (k+1)
(d ,c ) = -(c ,c ) + βk (d , c )
37
COMPUTATIONAL METHODS FOR UNCONSTRAINED MINIMIZATION
The second term in the above equation vanishes because it is the exact line search termination
criterion. This proves the result in Eq. (6.2.11).
We could use Hestenes-Stiefel formula for βk as given in Eq. (6.2.50). The associated
method is called the Hestenes-Stiefel conjugate gradient method. It is called Fletcher-Reeves
conjugate gradient method if βk given in Eq. (6.2.51) is used. The formula for βk in Eq. (6.2.53)
is called the conjugate descent formula (Fletcher 1989). He argues that if the line search used
(k) (k) (k)
satisfies condition in Eq. (2.2.1), the descent property holds; that is (d , c ) < 0 if c ≠ 0. In
the Polak-Ribiè re conjugate gradient method, the formula for βk given in Eq. (6.2.52) is used
(k) (k)
and αk is calculated to minimize f(x + αd ). It has been shown that for strictly convex twice
continuously differentiable functions and exact line search, the directions obtained using Eq.
(6.3.1) satisfy
(k) (k) (k) (k)
-(c , d ) ≥ ρ ||c || ||d || (6.3.1)
where ρ > 0 (Polak 1971). Convergence behavior of these methods near the solution point of a
general nonlinear function is expected to be similar to the pure quadratic case.
Polak-Ribiè re and Fletcher-Reeves conjugate gradient methods do not have descent
properties for general problems (this is further explained following Eq. 6.3.4; also see Shanno
1978; Fletcher 1989). Polak-Ribiè re method has better numerical properties than Fletcher-
Reeves conjugate gradient method. If cost of gradient and function evaluation is small, then this
method is considered to be the best method due to its low iteration cost (Nash and Nocedal 1989).
Even though Polak-Ribiè re formula is far more successful than Fletcher-Reeves formula,
recent theoretical analysis favors the Fletcher-Reeves formula (Al-Baali 1985; Powell
1984a,1986). It can be shown (Powell 1986) that the Fletcher-Reeves method always satisfies:
(k)
lim ||c(x )|| = 0.
k →∞
For general functions, various formulas for βk are no longer equivalent even with exact
line search. Numerically, Polak-Ribiè re formula is far more successful than Fletcher-Reeves
formula. But, recent theoretical analysis favors the Fletcher-Reeves formula (Al-Baali 1985;
Powell 1984a,1986). If cost of gradient and function evaluation is small Polak-Ribiè re formula
is considered as the best one due to its low iteration cost (Nash and Nocedal 1989).
With exact line search, all the four formulas for βk in Eqs. (6.2.50) to (6.2.53) have a
(k) (k)
descent property since the directions generated by them satisfy the descent condition, (d , c ) ≤
(k)
0, if c ≠ 0. Therefore the global convergence is guaranteed with exact line search. But exact
line search is not practical because it is inefficient. Generally, inexact line search is used in
practice.
38
COMPUTATIONAL METHODS FOR UNCONSTRAINED MINIMIZATION
With inexact line search, the differences between the four formulas for βk are widened.
Directions generated by some methods are not necessarily that of descent. However, Eq. (6.2.53)
always gives a descent direction without any special requirement because the descent condition is
always satisfied.
Fletcher-Reeves method has descent and global convergence properties with inexact line
search under a loose condition. Al-Baali (1985) has proved the following theorem:
Consequence of this descent property is the global convergence. In practice the choice of
σ in (0, 1/2) is not restrictive because it is usual to use a fairly accurate line search with a
conjugate gradient method. σ = 0.1 is a good choice in practice.
Powell (1984a,1986) shows the Polak-Ribiè re method does not have such descent
property with inexact line search. After analyzing the differences between Fletcher-Reeves and
Polak-Ribiè re formulas, Powell (1986) suggests a modification for the Polak-Ribiè re formula,
that is,
(
⎧ c ( k +1) , y ( k ) ⎫
β k = max ⎨0 , ( k ) ( k ) ⎬
) (f)
⎩ c ,c ( ⎭ )
He thinks this value of βk may be more useful than Fletcher-Reeves and Polak-Ribiè re formulas.
But this modification does not give the descent property to the Polak-Ribiè re method. We can
prove, however, that with a slight modification to the Polak-Ribiè re and Hestenes-Stiefel
formulas, both methods have a descent property.
Theorem 6.3. If an αk is calculated to satisfy Eq. (a) with σ (0, 1/2] and βk calculated
using Eq. (6.2.50) or (6.2.53) satisfies the following bound conditions
(k+1) 2
||c ||
0 ≤ βk ≤ (k) 2 (g)
||c ||
(k)
for all k (c 0), then the descent property for the Polak-Ribiè re and Hestenes-Stiefel
methods holds for all k.
This descent property gives the Polak-Ribiè re and Hestenes-Stiefel methods the global
convergence.
Based on the numerical experience and Theorem 6.3, following modifications of βk are
39
COMPUTATIONAL METHODS FOR UNCONSTRAINED MINIMIZATION
suggested.
p p F
⎧⎪βk if 0 ≤ βk ≤ βk
= ⎨ βk
F p F
βk if βk ≥ βk (n)
⎪⎩ 0 p
if βk < 0
p F
β k is the value calculated by Polak-Ribiè re formula and βk is the value calculated by Fletcher-
Reeves formula.
Gilbert and Nocedal (1990) suggest another modification of the Polak-Ribiè re method,
and prove its global convergence. Their numerical results show a consistent improvement over
the original Polak-Ribiè re method.
Conjugacy
For nonquadratic functions, it is obvious that the directions generated by any conjugate
gradient methods do not have conjugate property any more since they do not satisfy the conjugacy
condition in Eq. (6.2.10). There is no constant Hessian matrix for nonquadratic functions. In the
(k+1) (k)
Hestenes-Stiefel conjugate gradient methods d is a "conjugate" direction to d in the sense
of
(k+1) (k)
(d ,y )=0
with inexact line search as well as exact line search. However, this "conjugacy" exists only
between two successive iterations.
40