Abstract.: BIT I9 11979), 35ff 367

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

BIT I9 11979), 35ff 367

ACCELERATED GAUSS-NEWTON ALGORITHMS


FOR NONLINEAR LEAST SQUARES PROBLEMS

AXEL RUHE

Abstract.
Recent theoretical and practical investigations have shown that the Gauss-Newton
algorithm is the method of choice for the numerical solution of nonlinear least squares
parameter estimation problems. It is shown that when line searches are included, the
Gauss-Newton algorithm behaves asymptotically like steepest descent, for a special choice
of parameterization, Based on this a conjugate gradient acceleration is developed. It
converges fast also for those large residual problems, where the original Gauss Newton
algorithm has a slow rate of convergence. Several numerical test examples are reported,
verifying the applicability of the theory.
AMS MOS classification:
Primary: 65K05
Secondary: 62F10

1. Introduction.
A wide class of parameter estimation problems require for their numerical
treatment solution of nonlinear least squares problems. In the simplest
unconstrained case such a problem is formulated as:

(1.1) Minimize q ) ( x ) ' = ~-flx)Tj'(x);


x E R", f ~ R m, m > n .

Let us assume that f is twice continuously differentiable; we can then get the
gradient and Hessian of ~o as,

(1.2) o2, = j r f

(1.3) q),, = j r j + ~ fiGi


i=1

where

m x n matrix, Jacobian,
[ (?x~ j '

G,(x) = n x n symmetric matrices.

Received August 25, 1978. Revised May 29, 1979.


ACCELERATED GAUSS-NEWTON ALGORITHMS F O R . . . 357

The most natural algorithm to use when solving (1.1) is the Gauss-Newton
algorithm. It can be regarded either as a linearization of (1.1) in each step, or as
the Newton Raphson minimization simplified by deleting the second term of q~"
(1.3). It is formulated as

ALGORITHM 1. Gauss-Newton with line search.

1. Start with x o chosen appropriately.


2. For s=O, 1. . . . until convergence.
j+ .
2.1 d ~ : = - ~ f~, search direction.
2.2 Determine ~ solution to
(1.4) Min llf (x~ + ctd,)ll z ,
ct

2.3 xs+ 1 : = x s + ~ d s ;
2.4 if I}J~d,LI<tolerance then convergence.

The convergence behavior of this algorithm without the line search (t.4) has
been studied by Wedin [11], and Ramsin and Wedin [9] have further reported a
sequence of tests indicating the viability of the Gauss-Newton algorithm. It is also
the algorithm of choice in standard treatises on parameter estimation as Bard's
[3].
It is the purpose of the present contribution first to investigate the effect of the
line searches (1.4). When the residual Itfit2 is large the optimal c~ differs
considerably from one, and even algorithms without line searches use to contain
some step length reduction (See [9]). We will show in section 2 that Gauss-
Newton with line searches will behave asymptotically like steepest descent for a
special choice of coordinates. It must be stressed that the Gauss-Newton
algorithm is invariant under a wide choice of coordinate transformations, while
steepest descent is not. Of these transformations there is one class, the standard
coordinates, in which steepest descent is close to Gauss-Newton. We cannot
transform into standard coordinates in advance, since they suppose the
knowledge of ~, the solution, J, and (~i. The importance of this analysis lies in the
fact that it explains how Gauss--Newton behaves on large residual problems,
especially the occurrence of a caging effect [8], and the dependence of the rate of
convergence on the condition of the standard Hessian

(1.5) H = I-TK,

where

(1.6) ~' = IIf It2,

residual
(1.7) K = J+rGwJ+,
358 AXEL RUHE

normal curvature matrix, and

(1.8) Gw = ~ wiGi, w = -JYllfll2,


i=1

unit normal.

There is a close correspondence of our results to those in [9]. One difference is


that we get convergence whenever a strong local minimum exists, that is when H
is positive definite, while the algorithm without line searches converges only for
}*max(H) < 2.
The second main part of this contribution is the application of a conjugate
gradient acceleration to Gauss-Newton with line searches. It can be done in a
rather straightforward manner, resembling that of a preconditioned linear system
[1]. We also show which factors determine the n-step quadratic convergence of the
accelerated algorithm. We have chosen to show directly how an approximate
factorization of H is computed, using the theory for the linear systems case Isee
[6] pp. 139-141). Though our algorithm does not directly fit into the framework
given by Stoer [2], it is believed that similar results can be obtained, and it may be
of interest to apply other n-step quadratically convergent algorithms to
accelerating Gauss-Newton. Developments after similar lines have been
announced by Dennis [4].
Finally in[ section 4 we report a series of numerical tests. We have tried our
algorithms on the test problems used by Ramsin and Wedin [9]. The tests
illustrate the behavior of our algorithms, and give some clues to which algorithm
to choose in a practical case.
Without further explanation we will use .,~ for the optimal point, and ]; 3 etc. for
the corresponding quantities evaluated at ,'L Likewise, x,, J;, and J~ refer to the
current iteration. A matrix is built by its columns like A~ = [ % , . . . , a S_ 1], where s
is the number of columns in As. A,~ is the leading s x s sub-matrix of A. We denote
by PRtA) and PNIA) the orthogonal projector on the range and null-spaces of A
respectively.

2. Asymptotic behavior of the Gauss-Newton algorithm with line search.


In this section we develop how the analysis of Wedin [9], [11] has to be
modified in order to take a line search (1.4) into account.
We start by showing that the search direction of Gauss-Newton is invariant
under linear transformations of the independent variable and orthogonal
transformations of the dependent variables. Then we show that for one special
transformation, the search direction of Gauss-Newton is asymptotically close to
that of steepest descent. We can then use the theory of the convergence of steepest
descent to explain the behavior of Gauss-Newton. The bounds obtained for the
rate of convergence of the algorithm with line searches are always at least as good
ACCELERATED GAUSS-NEWTON ALGORITHMS FOR,~. 359

as those for the one ,without; moreover the method with searches converges
whenever it is started close enough to a local minimum,

THEOREM 2.1. I f we,transform the coordinates in R" and R m by

(2.1) x = Sx', S nonsingular constant matrix

(2.2) y = Uy', U orthogonal constant matrix

tile search direction of the Gauss-Newton algorithm d = J + f is transJormed as


d = Sd'. The curvatures oJ the f surJace are left invariant.

PROOF. It is easy to see that the Jacobian and Hessian (1.2) (1.3) are
transformed as

(2.3) J = UJ'S- 1, G i = S - T UikG'k S - 1 ,


k 1

and thus that d = J + f = S J ' + U T f = SJ'+f ' = Sd'.


Then the curvature matrix (1.7) will be transformed as K = UK'U ~, w = U w '
leaving its eigenvalues, the curvatures, invariant. •

Let us now introduce the standard coordinate systems in which steepest descent
gives search directions close to G a u s s - N e w t o n :

DEFINITION 2.2. Among all linear transformations (2.1) of the independent and
orthogonal transjbrmations (2.2) of the dependent variables we define a standard
coordinate system as one in which

(2.4) f = Te,+l, J = [I01, (~,+1 = - - d i a g ( ~ k ) .

THEOREM 2.3. We can find a standard coordinate system if and only if J has full
column rank. I f furthermore 7=1=0, and all curvatures are different, it is unique,
except for an orthogonal transformation of the m - n - 1 last dependent variables.

PROOE. To find U, the new basis of the dependent variables, we diagonalize the
symmetric curvature matrix (1.7),
3+r~w3+ = U, diag (x3U T

where U, has orthonormal columns which spanR(JT). We can then extend U


with m - n arbitrary columns from N(J), for example u,+ 1 =f/% which is in N(J)
since f is the optimal point, and the rest arbitrary.
We then find the basis S as S = ) +U., which yields the second equality of
(2.4). •
360 AXEL RUHE

THEOREM 2.4. Using a standard coordinate system, the search directions o f


steepest descent and G a u s s - N e w t o n are asymptotically close, and

(2.5) IIJTf--J+filz < allx--iclI211JTfI[2.


Here and afterwards we let a denote a constant that depends only on f and its
first three derivatives, and assume that llx-~ll is so small that the first term
dominates the remainder of the corresponding Taylor expansions.

PROOF. Assume J = 0 . In a standard coordinate system,

(2.6) f(x) = + x ..... +r, Ilrll _-< allxll 3

J(x) =
E,oli -i [i] 1
+ xrG" +R = + F2
F~ , Ilell _-< allx/I -~ ,

where F 1 is n x n, F 2 1 x n and F 3 ( m - n - 1 ) x n. Omitting x we find immediately


(2.7) j T = [1~ + F1,
T Fz,T Fr3] .

To get a similar expression for J+ we use Wedin's decomposition theorem ([10]


theorem 2.1), which states that if B = A + T , then

(2.8) B + = A+-B+TA++(BrB)+TTpNIArj+PNImTT(AAr)+

Inourcase A = L 3,I/]
v T=F whereF= I

If IINil < 1, we are sure to have N (B) = 0, so we can omit the last term. Moreover
we see that TTPNIAr)= [0, F T, F T] and

TA + = F2 0
F3 0

Omitting higher order terms in IIF[I we get B + 2 1 1 0 0] and (BTB) + ~-I which
inserted into (2.8) gives us
(2.9) j+ = [I,0,0]+[_F1,0,0]+[0 , F z , F 3T] + R T

= [I-F1,FT, FT]+R, IIRH <= altFII z •

We note the similarity to (2.7).


Applying the decomposition once more with (2.9) inserted for B + and (BTB) +
we get
ACCELERATED GAUSS NEWTON ALGORITHMS FOR... 361

J+ = { I - F ~ + F ~ - F T F 2 - F ~ F 3, ( I - F ~ ) F ~ , ( I - F ~ ) F ~ ] + R ,
][Rll =< o'[]FI[ 3 , Fll = FI+F~.
Noting that Ilftl =O(lixtl) and

J(x) = + + ½ F x + r , Ilrll ~ Gllxil 3

we can now expand the difference

(2.10) j T f _ j + f = Fll~ox+V72~}+r, l[rl[ < o_tlx]13 .


The fact that J'ff= {x+Fr2~,l + r , ILrll<~ILxII 2, now proves the statement of the
theorem. It might be noted that
{~ + f = 7", 1
x +F2)-gF1x-),FllFz+O(Hxl] 3)
(2.11) +f Hx+r, [[r[[ < a[[x]d ~

where H is the Hessian (1.5). We have namely also e(x)=½xrHx+O(llxH 3) giving


O'(x)=J~f=Hx+O(l[xL[2). H is diagonal in standard coordinates

(2.12) H = diag ( 1 - 7 × k) .
It must be stressed that a standard coordinate system cannot be found unless
one already knows the solution of the problem. The equivalence of G a u s s -
Newton and steepest descent is only a theoretical tool, which explains how an
actual c o m p u t a t i o n with the G a u s s - N e w t o n algorithm behaves. After some time
the iterates get confined into a 2-dimensional subspace corresponding to the
extreme eigenvalues of the Hessian (2.12), i,e. the direction of (algebraically)
largest and smallest curvature. We can apply the theorem on the rate of
convergence of steepest descent (see [8], or preferably the elucidating exposition
in [5]), to get:

THEOREM 2.5. The rate of eonvergence of Gauss Newton with step length chosen
optimally is bounded by
I[x~+ x -2c][j z(H)- 1 2 7(zl - x , )
(2.13) ~< - 1 -
ILx~-.'21[a ~(H)+ 1 z(H)+ 1 2-7(xl +z,}

1 - ~,~,
(2.14) z(H) - , condition number of the Hessian.
1 --?z,

PROOF. See the standard p r o o f for steepest descent in [8] pp. 148-154. Note
that we have to use the n o r m
Ile, ita : = (eTdTJes)1/2

since we transform to a standard coordinate system.


362 AXEL RUHE

Compare with the corresponding expressions for the G a u s s - N e w t o n method


without line searches given in [9] (2.17) and (2.18). There the rate is bounded by

/tx~+l-~llj/llx~-~lIj <~ ~,max (xl, - ~ , )

which is equal to (2.13) when ~ , = - ~ 1 , but larger in all other cases. Moreover we
now get asymptotical convergence whenever 1 - ?~1 > 0, i.e. whenever ~0 has local
minimum in J.
Let us finally give an explanation of the caging behavior that has been observed
when G a u s s - N e w t o n is applied to large residual problems.

THEOREM 2.6. Ultimately the errors x~-2c will alternate between two directions in
the subspace spanned by the extreme eigenveetors oJ the problem (S~.X=×(JT))x
and moreover J~d~ = -PJfs tend to be mutually orthogonal in successive iterations.

PROOF. Simple translation of the corresponding facts for steepest descent in a


standard coordinate system, the theorem of Akaike, see [5] or [83.

3. Conjugate gradient acceleration.


Now that we know which norm to use, namely that corresponding to a
standard coordinate system, it is easy to formulate a conjugate gradient
acceleration of the G a u s s - N e w t o n method. Let us first state the algorithm, and
then show how it obtains an approximation to the Hessian: see e.g. [1] or [7].
Our situation is quite similar to that of a preconditioned linear system.

ALGORITHM 2. Gauss-Newton with conjugate gradient acceleration.

1. Start with x o chosen appropriately


2. For s = 0, 1 .. until convergence
2.1 d~:=J+fs, G a u s s - N e w t o n step or gradient
2.2. 6s: = HJflsH2= JlP j~LIB2,2 length of step
2.3 fls: = if sin = 0 then 0 else 6~/6~_ 1, restart or continuation
2.4 ps: = - d, + fl,Ps- 1, search direction
2.5. Determine ~, solution to Min, jlJ'(x~+c~p~,)l]2, line search
2.6 x, + 1 : = x~ + a,p~, update solution vector
2.7 if [, 6, < tolerance then convergence

We call the n steps that are performed between two restarts (Step 2.3) one sweep
of the algorithm. We will prove the quadratic convergence by first showing that
one sweep started close enough to the solution, computes an approximate
tridiagonal factorization of the Hessian (1.5). Then we show that the increment
during one sweep is an approximate solution to a linear system with H as its
matrix, and last that the solution of such a system is second order close to the
minimum.
ACCELERATED G A U S S - N E W T O N A L G O R I T H M S F O R . . . 363

LEMMA 3.1. Assume that we start one sweep of the algorithm close enough to the
solution. Then

(3,1) HP-PT = F, IILII _-< ~rllxlt tlp~ll


(3.2) pTHp = A+R, IIRtI < allxjlZlIPll where

H limiting Hessian ,
P = [Po . . . . . P , - 1], search directions,
(3.3)
T tridiagonal matrix, and
A = diag (60/%. . . . . ft,- 1/~n- 1) •

We omit the rather tedious proof which can be found in [12]. The ~ are
determined by the line search in step 2.5 of the algorithm. It is sufficient to have
asymptotically perfect line search to use the terminology of Stoer [2] i.e.
(3.4) T T
IpsJ~+tL+ll =< ~lp~T J,T LI tlJTfJ2
should hold for a suitable constant 4.
We also quote from [12]:

LEMMA 3.2. Assume that the starting x o is close enough to the solution .¢c,and that
H is o f bounded condition in the same region. Then

(3.5) H(x,-xo) = --JTofo+r, Hr[[ < ollXo--~l[ 2 •

These two lemmas lead us to

THEOREM 3.3. I f XO is close enough to the solution .¢¢, then

(3.6) IIx,-~ll < oltxo-xll 2 .

PROOf:. We have q~(x)=(O+½(x--.¢c)TH(x--ic)+r.

By lemma 3.2 H ( x , - x o ) = - d o + r , ][ril~ai[Xo-~[I 2, and (2.11) gives do=


H(xo-.'~)+r, ilrJl ~_~O'l!Xo--Xit 2. Consequently x,,=Yc +r, ilril <oilXo -.;ciI 2
provided that ][H-lr[ is bounded. II

4. N u m e r i c a l tests.
We have performed a series of tests of the algorithms discussed here. We have
used an IBM 5100 computer, and written the programs in APL. The machine has
a relative accuracy of about 15 decimal places.
We have used the same test problems as Ramsin and Wedin [9], and we refer
the reader to that paper for details. Problems are generated with varying residual,
condition of Jacobian, and curvatures of the f surface. Furthermore the global
364 AXEL RUHE

behavior is varied by choosing the G matrices (1.3) with large or small norm. We
have changed their set-up in one important respect: In [.9] the largest principal
curvature ~1 = 1, while the others are chosen at random in [. - 1, 1]. In contrast, we
let the minimum ~, be a parameter, and choose the remaining n - 2 curvatures
at random in [×,,, 1]. This depends on the fact that we have found that both ~1 and
z, are of importance for the behavior of the algorithms, especially in the large
residual cases, when G a u s s - N e w t o n without line search does not converge. For
instance, z, = 0 corresponds to a convex f surface. The results listed here are for
small problems ( n = 4 , m=16), but tests on larger ones have not shown any
significant new effects.
We have compared three algorithms. Algorithm 0 is G a u s s - N e w t o n without
line search as described in [9]. It is the same as Algorithm 1 above, but G is set to
the largest of the numbers 2 -k, k=O, 1,2 . . . . which satisfies

(4.1) ~p(G)-~O(Xs + O:ds) > .25~[]Pj,L]I 2 ,


This is essentially the Armijo-Goldstein criterion, and assures convergence of
the algorithm towards a local minimum. Each iteration of algorithm 0 needs one
evaluation of f and one of J, provided that no step length reduction is necessary.
Algorithm 1 is G a u s s - N e w t o n with line search, as described in sections l and 2
of this paper. We have used a very simple quadratic interpolation procedure to
determine G. Along the line x~+ed~ we fit q~(x) by a polynomial which
interpolates q~ and q/ at G, and q~ at x~+ds. We then get

(4.2) ~, = .5/{1 - (~(x~ + , h ) - ~o(x~t)/q~' (x,) } .

This search is asymptotically perfect (3.4) as required for the analysis in sections 2
and 3 to hold. We added a step halving procedure (4.1) to assure global
convergence, but that was never activated during our tests. Each iteration of
algorithm 1 will thus need two evaluations of f and one of J.
Algorithm 2, as stated in section 3, uses line search performed according to (4.2)
and (4.1) (with d~ replaced by p~). It needs also two evaluations o f f and one of J
for each iteration,
We have used the same stopping criterion in all our tests namely,

(4.3) I]Pj,fsN2 < tolerance


with tolerance= 10 -6 . This is the stopping criterion recommended in [9] and
[11]. It must be taken into account that the tests in [9] were performed with
another stopping criterion, based on tests on iixs-xil and tI,LlI- rtfiI- That was
necessary because [9] involved comparisons to programs where (4.3) was not
readily available.
A.direct comparison with the results in [-9] is provided by the runs with ×. =
- 1. We have recorded the number of iterations needed to satisfy (4.3). We noted
that the difference between algorithms 0 and 1 was small, Algorithm 2 was indeed
ACCELERATED GAUSS NEWTON ALGORITHMS FOR... 365

lhster for large residual problems, and about as fast as for those with a smaller
residual. If we compare with the results reported in [9], we note that our
accelerated algorithm 2 is better than the quasi-Newton algorithm tried there.
However, it might be remarked that if the number of function evaluations only is
the decisive factor, algorithm 0 will be superior up to comparably large values of
the residual 17], since it needs only one function evaluation per step.
In the cases when we have a convex surface (x,,=0t the accelerated algorithm
really comes to advantage, and we give some results in Fig. 4.1. The right half of

3O

20

T
O ~ $0

L.-..-
RENIDUAL

Fig. 4.1. Number of iterations for different residuals ";.


Globally simple problem (MATRIX 3). Convex surface (x,=0). ~(J~= 100.
V = Algorithm 0, GN without line search
ZX=Algorithm 1, GN with line search
[] =Algorithm 2, c-g acceleration

this figure should be compared to Fig. 7.1 of [9]. The number of iterations is
virtually insensitive to the size of the residual. The larger number needed for 7 = .9
and .99 depends on the fact that the problem of determining the closest f is ill
conditioned in those cases. Large negative 7, on the other hand, give well
conditioned problems. It is important to distinguish between the sensitivity of the
closest point ]~ depending on 1/(1-7xl), and the sensitivity of )? when ]" is
determined, depending on the condition number of the Jacobian J.
A more detailed picture of what happens when the three algorithms are run is
given in fig, 4.2. We there list lix~-)?ll, s = 0 , 1 , 2 . . . for a representative example.
Note that this globally simple problem behaves precisely as the theory predicts;
algorithm 0 converges linearly, algorithm 1 exhibits a caging behavior, while
algorithm 2 is n-cyclic with a faster convergence. However, we did not get a very
366 AXEL RUHE

.01

.001

.0001

Fig. 4.2. Convergencebehavior of the algorithms. Error ]Ix,--ill for a globally simple problem, convex
surface, x(d)= 100, residual 7=.8.

clear-cut picture of the quadratical convergence of algorithm 2. The globally


difficult problem IMATRIX 1) showed a similar picture, but then more iterations
were needed for the fast convergence of c - g to start, ttowever, detailed
experiments indicated that c - g never was slower, not even relatively far from a
minimum.
We admit that the step-length algorithm (4.2) is primitive. When seeking in the
steepest direction (algorithm 1, and after restarts in algorithm 2) it gave L0-values
(3.4) of around 10, while in other directions (between restarts in algorithm 2) it
could be several orders of magnitude larger. We therefore tried a more accurate
line search, corresponding to ~ < 1 in all cases, but despite more function
evaluations in each step, we seldom noted significantly fewer iterations.
To summarize these tests we conclude that c-g acceleration never increases the
number of iterations needed, and in large residual cases really speeds up
convergence. When line search is used we therefore recommend the use also of c-g
acceleration, since it only amounts to a negligible portion of extra work. However,
when the residual is known to be small, as is the case in most well-formulated
approximation problems, line search is only a waste of time, and the simplest
G a u s s - N e w t o n algorithm without line search is recommended.

ACKNOWLDCEMENTS. This work was performed while the author was visiting
University of California San Diego and enjoyed the kind hospitality of
J. M. Bunch, and had numerous penetrating discussions with W. B. Gragg. I am
also grateful to P. A. Wedin who introduced me to the subject, and constantly
gave me guide-lines to the necessary underlying theories.
ACCELERATED GAUSS NEWTON ALGORITHMS F O R . . . 367

Financial support has been obtained from the Swedish Natural Science
Research Council grant F-3471, and the U.S. Public Health Service grant HL-
17731.

REFERENCES
1. O. Axelsson, Solution oJ linear systems of equations: iterative methods, pp. 1-51 of Sparse Matrix
Techniques ed. by V. A, Barker, Lecture Notes in Mathematics 572, Springer-Verlag Berlin -
Heidelberg - New York 1977,
2. P. Baptist and J. Stoer, On the relation between quadratic termination and convergence properties of
mitTimization algorithm,s, Part 1. Theory, Num. Math. 28, 343-366 (1977), Part 11, Applications,
Num. Math, 28, 367-391 (1977).
3. Y. Bard, Nonlinear parameter estimation, Acad. Press, New York 1974.
4. J. E. Dennis, Some computational techniquesJor the nonlinear least squares problem, pp, 157 183 in
Numerical solution of systems oJ nonlinear algebraie equations, edited by G. D. Byrne and C. A.
Hall, Acad, Press New York 1973.
5. G. E. Forsythe, On the asymptotic directions of the s-dimensional optimum gradient method, Num.
Math, 11, 57-76 (1968).
6. A. S. Householder, The theory ol matrices in numerical analysis, Blaisdell, New York 1964.
7. D, R. Kincaid and D. M. Young, Surcey ol iterative methods, to appear in Encyclopedia of
computer science and technology, Marcel Dekker, New York.
8. D. G. Luenberger, Introduction to linear and mmlinear programming, Addison-Wesley, Reading
Mass. 1973.
9. H. Ramsin and P. ~. Wedin, A comparison oj some algorithms ,lot the non-linear least squares
problem. BIT 17, 72 90 (1977),
10. P. ,~. Wedin, Perturbation theory Ji~r pseudoineerses, BIT 13, 217--232 (1973).
11. P. /~. Wedin, On the Gauss-Newton method /br the non-linear least squares problem, ITM working
paper 24, Inst. f. Till~impad Matematik, Stockholm 1974.
12. A. Ruhe, Accelerated Guuss-Newton algorithms Jbr nonlinear least squares problems, Tech Rep
UMINF-67.78, Dept. of Information processing, Ume~t 1978.

UME~ UNIVERSfTET
DEPT OF INFORMATION PROCESSING
S-901 g7 UME/~, SWEDEN

You might also like