Abstract.: BIT I9 11979), 35ff 367
Abstract.: BIT I9 11979), 35ff 367
Abstract.: BIT I9 11979), 35ff 367
AXEL RUHE
Abstract.
Recent theoretical and practical investigations have shown that the Gauss-Newton
algorithm is the method of choice for the numerical solution of nonlinear least squares
parameter estimation problems. It is shown that when line searches are included, the
Gauss-Newton algorithm behaves asymptotically like steepest descent, for a special choice
of parameterization, Based on this a conjugate gradient acceleration is developed. It
converges fast also for those large residual problems, where the original Gauss Newton
algorithm has a slow rate of convergence. Several numerical test examples are reported,
verifying the applicability of the theory.
AMS MOS classification:
Primary: 65K05
Secondary: 62F10
1. Introduction.
A wide class of parameter estimation problems require for their numerical
treatment solution of nonlinear least squares problems. In the simplest
unconstrained case such a problem is formulated as:
Let us assume that f is twice continuously differentiable; we can then get the
gradient and Hessian of ~o as,
(1.2) o2, = j r f
where
m x n matrix, Jacobian,
[ (?x~ j '
The most natural algorithm to use when solving (1.1) is the Gauss-Newton
algorithm. It can be regarded either as a linearization of (1.1) in each step, or as
the Newton Raphson minimization simplified by deleting the second term of q~"
(1.3). It is formulated as
2.3 xs+ 1 : = x s + ~ d s ;
2.4 if I}J~d,LI<tolerance then convergence.
The convergence behavior of this algorithm without the line search (t.4) has
been studied by Wedin [11], and Ramsin and Wedin [9] have further reported a
sequence of tests indicating the viability of the Gauss-Newton algorithm. It is also
the algorithm of choice in standard treatises on parameter estimation as Bard's
[3].
It is the purpose of the present contribution first to investigate the effect of the
line searches (1.4). When the residual Itfit2 is large the optimal c~ differs
considerably from one, and even algorithms without line searches use to contain
some step length reduction (See [9]). We will show in section 2 that Gauss-
Newton with line searches will behave asymptotically like steepest descent for a
special choice of coordinates. It must be stressed that the Gauss-Newton
algorithm is invariant under a wide choice of coordinate transformations, while
steepest descent is not. Of these transformations there is one class, the standard
coordinates, in which steepest descent is close to Gauss-Newton. We cannot
transform into standard coordinates in advance, since they suppose the
knowledge of ~, the solution, J, and (~i. The importance of this analysis lies in the
fact that it explains how Gauss--Newton behaves on large residual problems,
especially the occurrence of a caging effect [8], and the dependence of the rate of
convergence on the condition of the standard Hessian
(1.5) H = I-TK,
where
residual
(1.7) K = J+rGwJ+,
358 AXEL RUHE
unit normal.
as those for the one ,without; moreover the method with searches converges
whenever it is started close enough to a local minimum,
PROOF. It is easy to see that the Jacobian and Hessian (1.2) (1.3) are
transformed as
Let us now introduce the standard coordinate systems in which steepest descent
gives search directions close to G a u s s - N e w t o n :
DEFINITION 2.2. Among all linear transformations (2.1) of the independent and
orthogonal transjbrmations (2.2) of the dependent variables we define a standard
coordinate system as one in which
THEOREM 2.3. We can find a standard coordinate system if and only if J has full
column rank. I f furthermore 7=1=0, and all curvatures are different, it is unique,
except for an orthogonal transformation of the m - n - 1 last dependent variables.
PROOE. To find U, the new basis of the dependent variables, we diagonalize the
symmetric curvature matrix (1.7),
3+r~w3+ = U, diag (x3U T
J(x) =
E,oli -i [i] 1
+ xrG" +R = + F2
F~ , Ilell _-< allx/I -~ ,
(2.8) B + = A+-B+TA++(BrB)+TTpNIArj+PNImTT(AAr)+
Inourcase A = L 3,I/]
v T=F whereF= I
If IINil < 1, we are sure to have N (B) = 0, so we can omit the last term. Moreover
we see that TTPNIAr)= [0, F T, F T] and
TA + = F2 0
F3 0
Omitting higher order terms in IIF[I we get B + 2 1 1 0 0] and (BTB) + ~-I which
inserted into (2.8) gives us
(2.9) j+ = [I,0,0]+[_F1,0,0]+[0 , F z , F 3T] + R T
J+ = { I - F ~ + F ~ - F T F 2 - F ~ F 3, ( I - F ~ ) F ~ , ( I - F ~ ) F ~ ] + R ,
][Rll =< o'[]FI[ 3 , Fll = FI+F~.
Noting that Ilftl =O(lixtl) and
(2.12) H = diag ( 1 - 7 × k) .
It must be stressed that a standard coordinate system cannot be found unless
one already knows the solution of the problem. The equivalence of G a u s s -
Newton and steepest descent is only a theoretical tool, which explains how an
actual c o m p u t a t i o n with the G a u s s - N e w t o n algorithm behaves. After some time
the iterates get confined into a 2-dimensional subspace corresponding to the
extreme eigenvalues of the Hessian (2.12), i,e. the direction of (algebraically)
largest and smallest curvature. We can apply the theorem on the rate of
convergence of steepest descent (see [8], or preferably the elucidating exposition
in [5]), to get:
THEOREM 2.5. The rate of eonvergence of Gauss Newton with step length chosen
optimally is bounded by
I[x~+ x -2c][j z(H)- 1 2 7(zl - x , )
(2.13) ~< - 1 -
ILx~-.'21[a ~(H)+ 1 z(H)+ 1 2-7(xl +z,}
1 - ~,~,
(2.14) z(H) - , condition number of the Hessian.
1 --?z,
PROOF. See the standard p r o o f for steepest descent in [8] pp. 148-154. Note
that we have to use the n o r m
Ile, ita : = (eTdTJes)1/2
which is equal to (2.13) when ~ , = - ~ 1 , but larger in all other cases. Moreover we
now get asymptotical convergence whenever 1 - ?~1 > 0, i.e. whenever ~0 has local
minimum in J.
Let us finally give an explanation of the caging behavior that has been observed
when G a u s s - N e w t o n is applied to large residual problems.
THEOREM 2.6. Ultimately the errors x~-2c will alternate between two directions in
the subspace spanned by the extreme eigenveetors oJ the problem (S~.X=×(JT))x
and moreover J~d~ = -PJfs tend to be mutually orthogonal in successive iterations.
We call the n steps that are performed between two restarts (Step 2.3) one sweep
of the algorithm. We will prove the quadratic convergence by first showing that
one sweep started close enough to the solution, computes an approximate
tridiagonal factorization of the Hessian (1.5). Then we show that the increment
during one sweep is an approximate solution to a linear system with H as its
matrix, and last that the solution of such a system is second order close to the
minimum.
ACCELERATED G A U S S - N E W T O N A L G O R I T H M S F O R . . . 363
LEMMA 3.1. Assume that we start one sweep of the algorithm close enough to the
solution. Then
H limiting Hessian ,
P = [Po . . . . . P , - 1], search directions,
(3.3)
T tridiagonal matrix, and
A = diag (60/%. . . . . ft,- 1/~n- 1) •
We omit the rather tedious proof which can be found in [12]. The ~ are
determined by the line search in step 2.5 of the algorithm. It is sufficient to have
asymptotically perfect line search to use the terminology of Stoer [2] i.e.
(3.4) T T
IpsJ~+tL+ll =< ~lp~T J,T LI tlJTfJ2
should hold for a suitable constant 4.
We also quote from [12]:
LEMMA 3.2. Assume that the starting x o is close enough to the solution .¢c,and that
H is o f bounded condition in the same region. Then
4. N u m e r i c a l tests.
We have performed a series of tests of the algorithms discussed here. We have
used an IBM 5100 computer, and written the programs in APL. The machine has
a relative accuracy of about 15 decimal places.
We have used the same test problems as Ramsin and Wedin [9], and we refer
the reader to that paper for details. Problems are generated with varying residual,
condition of Jacobian, and curvatures of the f surface. Furthermore the global
364 AXEL RUHE
behavior is varied by choosing the G matrices (1.3) with large or small norm. We
have changed their set-up in one important respect: In [.9] the largest principal
curvature ~1 = 1, while the others are chosen at random in [. - 1, 1]. In contrast, we
let the minimum ~, be a parameter, and choose the remaining n - 2 curvatures
at random in [×,,, 1]. This depends on the fact that we have found that both ~1 and
z, are of importance for the behavior of the algorithms, especially in the large
residual cases, when G a u s s - N e w t o n without line search does not converge. For
instance, z, = 0 corresponds to a convex f surface. The results listed here are for
small problems ( n = 4 , m=16), but tests on larger ones have not shown any
significant new effects.
We have compared three algorithms. Algorithm 0 is G a u s s - N e w t o n without
line search as described in [9]. It is the same as Algorithm 1 above, but G is set to
the largest of the numbers 2 -k, k=O, 1,2 . . . . which satisfies
This search is asymptotically perfect (3.4) as required for the analysis in sections 2
and 3 to hold. We added a step halving procedure (4.1) to assure global
convergence, but that was never activated during our tests. Each iteration of
algorithm 1 will thus need two evaluations of f and one of J.
Algorithm 2, as stated in section 3, uses line search performed according to (4.2)
and (4.1) (with d~ replaced by p~). It needs also two evaluations o f f and one of J
for each iteration,
We have used the same stopping criterion in all our tests namely,
lhster for large residual problems, and about as fast as for those with a smaller
residual. If we compare with the results reported in [9], we note that our
accelerated algorithm 2 is better than the quasi-Newton algorithm tried there.
However, it might be remarked that if the number of function evaluations only is
the decisive factor, algorithm 0 will be superior up to comparably large values of
the residual 17], since it needs only one function evaluation per step.
In the cases when we have a convex surface (x,,=0t the accelerated algorithm
really comes to advantage, and we give some results in Fig. 4.1. The right half of
3O
20
T
O ~ $0
L.-..-
RENIDUAL
this figure should be compared to Fig. 7.1 of [9]. The number of iterations is
virtually insensitive to the size of the residual. The larger number needed for 7 = .9
and .99 depends on the fact that the problem of determining the closest f is ill
conditioned in those cases. Large negative 7, on the other hand, give well
conditioned problems. It is important to distinguish between the sensitivity of the
closest point ]~ depending on 1/(1-7xl), and the sensitivity of )? when ]" is
determined, depending on the condition number of the Jacobian J.
A more detailed picture of what happens when the three algorithms are run is
given in fig, 4.2. We there list lix~-)?ll, s = 0 , 1 , 2 . . . for a representative example.
Note that this globally simple problem behaves precisely as the theory predicts;
algorithm 0 converges linearly, algorithm 1 exhibits a caging behavior, while
algorithm 2 is n-cyclic with a faster convergence. However, we did not get a very
366 AXEL RUHE
.01
.001
.0001
Fig. 4.2. Convergencebehavior of the algorithms. Error ]Ix,--ill for a globally simple problem, convex
surface, x(d)= 100, residual 7=.8.
ACKNOWLDCEMENTS. This work was performed while the author was visiting
University of California San Diego and enjoyed the kind hospitality of
J. M. Bunch, and had numerous penetrating discussions with W. B. Gragg. I am
also grateful to P. A. Wedin who introduced me to the subject, and constantly
gave me guide-lines to the necessary underlying theories.
ACCELERATED GAUSS NEWTON ALGORITHMS F O R . . . 367
Financial support has been obtained from the Swedish Natural Science
Research Council grant F-3471, and the U.S. Public Health Service grant HL-
17731.
REFERENCES
1. O. Axelsson, Solution oJ linear systems of equations: iterative methods, pp. 1-51 of Sparse Matrix
Techniques ed. by V. A, Barker, Lecture Notes in Mathematics 572, Springer-Verlag Berlin -
Heidelberg - New York 1977,
2. P. Baptist and J. Stoer, On the relation between quadratic termination and convergence properties of
mitTimization algorithm,s, Part 1. Theory, Num. Math. 28, 343-366 (1977), Part 11, Applications,
Num. Math, 28, 367-391 (1977).
3. Y. Bard, Nonlinear parameter estimation, Acad. Press, New York 1974.
4. J. E. Dennis, Some computational techniquesJor the nonlinear least squares problem, pp, 157 183 in
Numerical solution of systems oJ nonlinear algebraie equations, edited by G. D. Byrne and C. A.
Hall, Acad, Press New York 1973.
5. G. E. Forsythe, On the asymptotic directions of the s-dimensional optimum gradient method, Num.
Math, 11, 57-76 (1968).
6. A. S. Householder, The theory ol matrices in numerical analysis, Blaisdell, New York 1964.
7. D, R. Kincaid and D. M. Young, Surcey ol iterative methods, to appear in Encyclopedia of
computer science and technology, Marcel Dekker, New York.
8. D. G. Luenberger, Introduction to linear and mmlinear programming, Addison-Wesley, Reading
Mass. 1973.
9. H. Ramsin and P. ~. Wedin, A comparison oj some algorithms ,lot the non-linear least squares
problem. BIT 17, 72 90 (1977),
10. P. ,~. Wedin, Perturbation theory Ji~r pseudoineerses, BIT 13, 217--232 (1973).
11. P. /~. Wedin, On the Gauss-Newton method /br the non-linear least squares problem, ITM working
paper 24, Inst. f. Till~impad Matematik, Stockholm 1974.
12. A. Ruhe, Accelerated Guuss-Newton algorithms Jbr nonlinear least squares problems, Tech Rep
UMINF-67.78, Dept. of Information processing, Ume~t 1978.
UME~ UNIVERSfTET
DEPT OF INFORMATION PROCESSING
S-901 g7 UME/~, SWEDEN