1975 Robust Nonlinear Regression Using The Dogleg Algorithm
1975 Robust Nonlinear Regression Using The Dogleg Algorithm
Roy E. We1sch
Richard A. Becker*
March 1975
1. Introduction
2. The Prob1n
..l
1
3. What Can the Average Man t? 1
Tables
1. ThTDUCrION
In recent years the concepts of robust estimation For nonlinear least-squares (o(t)=t2) we have
have lead to a rethinking of the ways we fit rredels always faced the problem of specifying starting
to data. Papers by Beaton and Tukey [1974] and values. For robust loss functions such as
Mdrews [1974] have proposed algorithms for rdust
linear regression using iteratively reweighted Itlc
least-squares. This technique has proved to be
quite successful and has considerable intuitive pc(t) =ç 2 (2.2)
appeal because of its connection to weighted least-
squares regression.
ct--. ft>
respect to 8) of / 3. WMTCANThEAVERAGEt'AND3?
n
(y1—f(e)\
F5(O)
il "c\ s ) (2.1) We have found that many people have .iccess to sorrn
form of general nonlinear optimization program ardf
or a special routine for nonlinear least-squares.
wher.e p(S) is
assumed to satisfy p( t ) p( u
if Iticlul and is
often viewed as
a loss function bst of the researchers interested in robust f it-
(which, in general, need not be independent i of ting are not interested in extensively rrcdifying
or symctric). these programs or writing new ones. So we discuss
first sane approaches to robust nonlinear regression
that allow the use of existing programs. In effect, we have added the negative log likelod
of an inverted gaim prior distribution for the
If we assume that a general nonlinear optimization scale parameter.
rutine is• available then it seems reasonable to
try to estiaatc the scale, s, by making it a part We must, of course, specify B and B . If there is
of the optimization problem, prior information abut the sale thin B1 arid B2
would be taken from that. Otherwise we would
n
nu.n
8,s i1
E
' +Slogs (3 1) choose B1=S1 and B2 by penalty function considera-
tions such as those discussed by Bard [1974, p.l45].
Our experience with this method is limited and not
fri direct analogy to the related meximum J.flelicod wholly satisfactory.
problem. It is also imnediatelyclear that this
idea will fail for robust loss functions which are The above tlree methods provide ways for a person
bounded (such as (2.3)) because s will be forced with a general nonlinear optimizer to simply put in
to .0. (There is no proper axirrtum likelilood todel an objective function different fran the one for
in this case.). However for loss functions of the least-squares arid use his program as is. We do not
fonn (2.2) this is not true and (3.1) is viable.
advise doing this blindly but it generally works
(especially the first two methods). It has draw-
The constant S2 can be chosen variety of ways,in a backs. It is expensive because another parameter
one of which is the following. Differentiating Cs) must be estimated (with algorithms of order p2)
(3.1) with respect to s and setting it equal 0 to and the objective function is more complicated and
we obtain numerically less pleasing. When choice is available,
it does not seem reasonable to pay so much for the
S
n (rj(e)r1(e) (3.2)
privilege of iteratively modifying s.
1 S
There is another obvious approach. That is simply
to find an (°) (see section 7) arid then minimize
If the residuals were Gaussian then we might try to (2.1) with (°)• At the end of this computation
choose S1 so that s
would be asymptoticially unbi- a new s, say (1), s is
computed based on the current
ased giving values of 0, then left fixed until
a new local
minimusn is found, etc. until s(U changes
less
(n-p)_L tp(t) d(t) than, say, ten percent from 50c). This simple, is
but often proved to be as expensive as the earlier
where •(t) denotes the standardized Gaussian dis- approaches. It does, however, lead us to consider
tribution function. ways to hardle iterative scaling without making use
of the objective function.
Huber arid Dutter
Cl9714] have suggested a related
idea. They propose replacing (3.1) by 4. NONLINEAR RIWEIGHI'ED LFAST-SQUARES
n For those with special nonlinear least-squares
mi.n
e,si:1
(
C\ / algorithms available it is natural to attempt to
adapt the iteratively reweighted least-squares idea
where mentioned earlier to the nonlinear problem. We now
discuss this method in more detail.
S2 (n—p) I tp(t)—p0(t)d(t)
The gradient and Hessian of F5(e) are
pct is the Huber type, (2.2), then
- - J(0) (4.1)
tP(t)—p(t) (p(t))2/2
and the normal equation for s reduces to the scale n fr.(o)\ v2f.(e)
estimate proposed by Huber C1964, 1973]. This idea
also fails for bounded loss functions.
.H5(e) =
il —pt
(\ ) (4 .2)
We have tried (3.1) and (3.3) using a general opti-
+ JT(0)
mization algorithm to be described later and found
that both work about equally well. Bath can be
implemented very quickly. where J(0)[3f(0)/a9] is the Jacobian Tratrix,
DI
where wj:w(—),w is an nai diagonal matrix, and along and if
o" (t) is appvxjmated by w( t). If we have start- so how? Since starting values in non—
irig values (°) and s:s(O) then the first step of linear problems are generally not good, we feel
that w (0) and $ (° are crude and will need itermicn.
reweighted least-squares is It is not at all clear how these changes at each
+
step will interact with a specialized algorithm
((O)) g3(e) like that of Marquardt. Using this routine with
which In the linear case is
its special start—up procedures to do each step
after computing new weights will not be very
(O) + xTx_l xT (Y—xe°)
Successful. Direct intervention in the algorithm
is required, it cannot just be called each tine.
The whole procedure can, of course, be iterated. Clearly such modifications can be made but we show
note that Chantbers [1973, p. 7) indicates that such
Except for the fact that p"(t) has been approxinat- iterative procedures nay be inferior to general
ed by w(t) this is just the first Newton step for Optimization methods.
the solution of (2.1) in the linear case. Using
w(t) makes H positive semi-definite and makes the 5. CREATING SPECIALIZE) ALORITHIIS
analogy to weighted least-squares obvious.
If one is willing to intervene more directly in an
A word of caution is in order. Even if XX well
is optimization algorithm then sane special things can
conditioned, XTWX cay be very ill-conditioned arid be done to acccrnodate reweighting and scaling. We
the first Newton step a very poor one. (This can shall discuss our efforts in the context of a
happen when there are low weights on observations particular algorithm.
which contain all of the information about a para-
meter or infornation about how to separate the In the past year, work at the NBER Computer Researc:-
effects of two carriers.) Even if XTwX is well Center has created a need for nonlinear optinizaticn
behaved at a local minijrn.izn a bad start can lead to in such diverse areas as full information maxizin.
poor results. Often this problem is ignored in the likelihood estimation, probit analysis, end pro-
linear case because of the availability of robust, jection pursuit (Friedman arid Thkey (1974)]. The
scale invariant procedures, such as least absolute first algorithm implemented was
residuals [Bar'rodale and Roberts (1973)] to provide 1LEGF, developed
by Chien and Dennis at Cornell. This algcrithm
starting values. AU of the literature about only requires information about the function F and
Newton-Raphson methods applies to this problem - is closely related to the MINFA algorithm of Powell
such methods are only reasonable in a neighborhood
of a local minirmsn. Good algorithms for robust [1970] which, however, requires the gradient as
regression should contain some diagnostic studies of well as F. DCGLF was installed in the NB. TROLL
system as a function and is riot easily modified
the data matrix to determine potential high lever- except by experienced progranmers.
age observations. Varying the robustness parameter
c can also be very useful. Ridge regression In the TROLL system we also had a snbolic differen-
techniques could also be employed. tiator arid a proposed way to automatically ccmpile
In the nonlinear case we do not have such good F ,g, and H into very efficient code suitable for
repeated evaluation. We also have a macro language
techniques for finding starting values and the that allows a user to glue together various TROLL
first term of the Hessian does not vanish. But caimarids and functions in a way that cakes it easy
most nonlinear least-squares routines ignore the to experiment with new algorithms. With these
first term of the Hessian and use a technique like ideas in mind one of us (RAB) in consultation with
that of Marquardt (1963) to overcome the difficul- John Dennis developed the DOGLDX algorithm and
ties of auss-Newton steps away from the local macro. Since this algorithn formed the basis for
minimum. Once in a region where the residuals are, further researoh (by REW) on robust nonlinear
hopefully, small the first term of the Hessian can regression, we will describe it in detail.
be more safely ignored. Some work has been done
on the large residual least-squares problem, e.g. rCGLEXX utilizes a combination of
Dennis (1973]. Robust loss functions help to and Newton steps in the process of steepest-descent
minimizing a
reduce the size of the first term of the Hessian function. As long as gradient steps are relatively
because p'(t)l<It for large residuals arid with
(2.3), p'(t) is eventually zero. large, they are used. However, since gradient steps
tend to perform poorly in valleys, Newton steps are
With the same caveats we have always had in using
also used. Newton steps, however, are of doubtful
worth when taken from a point far removed fro.-n the
Gauss-Msrquardt nonlinear least-squares routines minimum. Hence, the algorithm uses a bound on the
for very nonlinear problems, it is reasonable to maxiznun step size and provides a canpranise 'ogleg"
propose that (2 • 1) be attacked by finding starting step which combines the gradient and Newton steps.
values and forming the weights and
solving the least squares problem with pki y and As input ECGLEGX requires only the function (all
/T f(e) as data and model, making the obvious derivatives are computed symbolically), start ing
change if there is a weighted nonlinear least- values e (0), an initial radius (R) to provide zn
squares routine available. upper bound on step size (the default is zero
which makes the first step a gradient step), the
We now ask - should we modify w and s as we go maximum number of iterations, and convergcnce tolcr'--
ances for the gradient and relative coefficient diar.c.
Initially the expressions for F, g, and H are If the slope of this line is negative, then A*<0.
evaluated. H( 0) is then forced to be positive If the slope is positive, A>0. When X'<0 or X2
definite by the use of a Greenstadt modification. a step of twice the length of 6 (k) would still have
This procedure is carried out whenever the second decreased the value of the objective function if
derivative matrix is reevaluated. S(A) really were linear. In these cases R is
doubled.
At the beginning of each itéDation, there is a test
for convergence using both the gradient and the If OA'<2 one more check is performed. The pre-
relative change in 0 from the previous iteration. dicted gradient at
The exact details will not be provided here. is compared with the actual gradient g(o (k) +6(k))
Assuming that there were no convergence, the algo- and if
rithm investigates a step in the direction of the 2
gradient vector. Define I 1g(8(k)+6(k))_g(Ø(k))_M(e(k))6(k)
Ak(6) 2
.25
where (a ,b) denotes an inner product. The function
Ak(6) then i&a quadratic approximation to F(e()+ ) the step bound, R, is doubled. In a.ll other eases
based on the gradient vector and the Hessian. the step bound remains the same for the next itera-
Powell [1970] shows that Ak is minimized along the tion. Iterations continue until convergence is
gradient direction by a step of length reached or the limit on the number of iterations is
exceeded.
— ___________________________
—
G EXDGLEGX was used for testing the ideas developed in
(g(0(k)), H(O)g(e)) section 3. It is an algorithm that invites tinker-
ing (ellipsoids instead of spheres for the step
At this point, the step-bound limitation is checked. bounding, quadruple instead of double the radius,
If cR then a step in the gradient direction of etc.) arid the macro (interpretive) form has permit-
length R will be tried. Let 6 represent the ted this kind of modification, often for specialized
purposes. In particular it permitted us to experi-
Newton step. If tG<R and 6N' R the Newton step
is attempted. If !<R and I6N I>R a. "dogleg" step ment with a number of ideas for robust nonlinear
6 is attempted. The dogleg step 6D is defined as regression.
the point on the line connecting 6, (the gradient
step) arid 6 which at a distance R from 6(k).
is 6. ROBUST NCNLINE.R REGRESSION
At this point let () represent the step that the Since EOGLEGX canputes the true Hessian we had r.o
algorithm decided to take (gradient, Newton, or need (at this point) for reweightirig as a way to
dogleg). If F(0(k)+dOc))<F(0(), the step is set solve our problem. We did however have to consider
accepted and we set o(k+1)O(k)+6, otherwise rescaling arid the LXJGLEGXS macro was developed by
(k) halve the radius, R, and start a new one of us (REW) to accomplish this.
iteration.
LOGLEOX is complicated by the fact that after it
One of the most powerful features of EOGLEGX in-
has found an acceptable step it looks ahead at the
volves revision of the step bound R. If the step
new gradient to see if it should increase the step
is accepted, a test of the approximation Ak(6) bound radius for -the next iteration. The question
is performed. If the predicted reduction measured
by F(O (k) )_Ak(6 (k)) is more than ten times the
arises - ifwe are changing the scale, at what
point in a step do we change it? Our discussion
actual reduction, F(0 (k) )—F(O )+SOC)), the radius of this problem is meant to be indicative of the
is halved and a new iteration begins. kind of problems that can arise in modifying non-
linear optimization algorithms for specialized
If this test is passed we perform further checks
to decide if the step bound should be increased.
applications like robust regression.
In order to do this we look at the sca].ar product In DGLEGXS we use to compute a scale
S(A)(g(eO)+A6)),6). The term
defines a line from (k) in the direction (k) 0) = median (Irj(0()I)/.67t45
S(X) measures the expected change in the objective
function starting at the point O(k)+X6(k) and tak-
i
ing a step (k) We would like this change to be Sections 7 and 8 contain a discussion of starting
values and other ways to compute S. The algorithm
negative, decreasing the value of the objective proceeds as in EOGLEGX yntil 60 s been deter—
function.
mined. Still using sP(1 (0(k)+6( /) is evaluated
and checked to see if is an acceptable step.
At this point we compute g(O .+dl'J) so that we
have S(O) and 5(1) available. If we assume S(A) If the step is rot accepted If it is
is linear, these two points define a line and we accepted we do not yet change s. The test of the
approxinution A'(6) is performed as before.
let X" be the point where S(A)0 i.e.
Thus in cases where the radius can be reduced we
do not change s before performing these tests.
— S(0)
This costs us an extra evaluation of F (we shall
—5—
have to eventually evaluate it with a new s) but it It has performed satisfactorily but requires some
is conservative in the sense that chinging s here form of nonweighted starting scale becaue
Cs generally decreases) would cause us to more is not defined. All of the results reported below
often reduce the radius. Reducing the radius is use the MAD scale.
costly because to increase it again we must compute
a new g and H, but if a step is not acceptable, no 9. CONFIDENCE REGIONS
new g and H are necessary to reduce the radius.
Since there is not yet general agreement about how
If the step has been acepted, we now compute to compute covariances for the estinated coe.ffi—
5(Jc#1) and proceed to see if the radius should be cients in robust linear regression, we cannot hope
increaed assuming, of course, that the test using to give very definitive results for the r.cnlinear
was passed (i.e. the radius was not case. Gross (1973) has proposed a way to find
reduced). confidence intervals for robust location estimates.
A partially completed Monte Carlo study by Paul
The tests for radius increase are thç sane as Holland, David Hoaglin, and Roy Welsch indicates
before but thern new gradient at 0 (1hj is conputed that a reasonable covariance estimate for robust
with A number of tests run using s linear regression would be
instead of (k#1) ]-ere gave indication of being
better or worse, but were irnich sore expensive
since the gradient had to be evaluated twice (with w. r?(0)(XTwX) (9.1)
s (k) and s (k+2)). The next iteration begins using n—p i1 1 1
00C42) and (h1
where the w. are the weights used to obtain 0 in
7. STAR'rflG VALUES the final ieration of reweighted least-squares.
The associated t-statistics would probably be
How to start a robust nonlinear regression is not based on an equivalent number of degrees of freedom
an easy problem. A scale free start would be nice
but least-squares is the only readily available like.1 wi-p.
one and, of course, requires a start itself. An obvious extension to the nonlinear problem is
(Perhaps an L ,lcpc2, start would work, but we have
not tried it. We could also linearize the .
problem at the supplied starting values and then w1 r(0) (JTWJ)l (9.2)
use least absolute residuals to get a revised start.
We have often found that the original starting where J and w were used to obtain 0. This, of
values specified by an intelligent uodel builder course, has been used in nost weighted nonlinear
in a
can be used directly robust loss function least-squares programs where the weights are
withc chosen so that the asnptotic efficiency at assumed to be fixed.
the Gaussian is say, .8, i.e.
It is useful to see what type of covariar.ce forsla
2 arises if we attack the nonlinear problem directly.
jL o(t) d(t)] To do this we follow Bard (1974, p. 176) and argue
that we want to examine the effect on the solution
I [p(t)]2 d(t) O of perturbations in the residuals at e'. Eard
j
gives the approximate covariance (in our notation)
as
(See Huber (1973) for a discussion of asymptotic
efficiency.) For (2.2) this means c is about .67
v0
j (9.3)
and for ç2.3) about 1. Too low a value of c can
throw away a lot of data (low weights) if the
start is poor and too high a c does not downweight where V is the "coyarianc&' matrix of the residuals
large residuals enough. We see little reason to at ea a&L we have replaced p" by w. One estirrute
perform a least-squares analysis first, although
of Vr would be r(0)rT(0*). Various other formulas
we may want to do this at sane point in studying arc possible and some have been explored by Thkey
the data. (1973). Until sore information is available, we
prefer to take the approach that (9.3) is ccnditior.ed
8. SCALE COMPUTATION on the weights, set Vr2w1 and estimate 2 by
We have used a median absolute deviation (LAD) n
scaling adjusted so that it will be unbiased for E w.r.(O*)
i=l 11 (9.4)
independent Caussian residuals. In order to allow
for a rwe asyrs.etric set of residuals, reduce the
"granularity" of the median, and reiove from the In cases where we ignore the first term of the
scale computation very large residuals we also tried Hessian, (9.3) would then reduce to (9.2). We have
mainly relied on (9.2) especially because robust
i:11 1 w.r?o0 loss functions tend to reduce the size of the first
term of the Hessian (cf. section 4).
ii )
1
—6—
10. ININATn.G SECOND DIVATIVES the final value of the (adjusted) MD scale (s),
the weighted least-squares scale (ws) as given in
Computing the exact Hessian is expensive even in (9.4), the number of evaluations of F g, and H,
sophisticated systems. After the above algorithms the "corrected" degrees of freedom ( wi-p) and the
were developed using DOGLEGX as a base we replaced regular degrees of freedom (n-p).
the exact Hessian by JrwJ, i.e. we used a type of
reweighted least-squares within the context of the We note that y is highly sensitive to changes in c
IX)GLEGX algorithm with scaling. (Call this and further investigation is called for, including,
algorithm IX)GLflW.) perhaps, a change in model formulation. The least-
squares results are not listed because the algorithn
However, as one might expect, this modified algo-. forced y to infinity (machine overflow) in that case.
rithrn does not work well on some types of highly A more detailed discussion of the mode], is contained
nonlinear problems. A compromje algorithm in Little and Welsch (1975].
(DOGLH) is now being tested by Dennis and Welsch.
In it, each of the two parts of the Hessian is In order to show how the IX)GLEGXS algorithm per—
treated separately. The second part is always formed on synthetic data we used the function
catuted exactly (except for the fact that w (see Chambers (1973)]
replaces p"). The first part is approxijtated and
updated using methods developed by Broyden Csee
Dennis,:(1973)Jto update the entire Hessian in
y:e —01x
—e
—6,x
+erDor
general optimization algorithms. This can be where 01 arid 02 had tnie values of 1 and 10, ten
accomplished in a way that keeps the Hessian observations were taken for x.l(.l)l, and the error'
was contarithiated ussian with 75% fran J(0,.l) arid
positive definite, removing the need for the 25% from N(0,l). The convergence criterion consisted
'eenstadt modification in IY)GLEGX.
of having the length of the gradient less than .1 and
the maximum relative coefficient change less than
II. ALS .001.
The above algorithms have been tested on a number' All canputatior5 i'e started at Oo and 02:0. The
of standard problems, but we present here an results are listed in Table 3.
example from marketing which arose in joint work 12. CONCUJDThG RU'tR1
with John Little. The model we are trying to
calibrate is We hope that the above discussion will stimulate
SALES(t) : statisticians to consider' the types of algorithms
R0. STREND(t). PROM.MOD(t). ADV.MOD(t) they would like to see developed for a flexible non-
srRnlD(t) SESON(t).TREND(t) linear fitting package which would include robust
loss functions. We also hope that numerical analysts
PROM.ZD(t) = will consider the problems that arise in this area,
i+B1.PROM(t)_B2.ppjM(t_) including large residuals, weights, and the role of
special parameters such as scale.
ADV.MOD(t) = a.i
I C2(K.ADV(t—j+].))
— 13. RflICES
i=i [Cl+ 1+ (K ADV(t—i+1) )Tj
1. Andrews, D.F. (1974). A Robust Method for
with .46 K .0041 Multiple Linear Regression. Technometrics 16,
.523—531.
a2
.32 : .88 2. Bard, Y. (1974). Nonlinear Parameter Estj'riation.
.22 Academic Press, New York.
C2 : .24
3. Barrodale, I. arid F.D.K. Roberts (1973). An
and starting values of what we want to estimate Improved Algorithm for Discrete L1 Approd.mation.
SIAN J. Numer. Anal. 10, 839-848.
R0=538 Bi=i 4. Beaton, A.E. arid J.W. Tukey (1974). The Fitting
2 B2 = .2 of Power Series, Meaning Polynomials,
on Band-Spectroscopic Data. Techriometric 16,
Illustrated
147—192.
Al]. estimation was dome on the first twenty-fota'
observations of the data in Table 1. 5. Chambers, J.M. (1973). Fittin Nonlinear Models:
Numerical Techniques. Biometrika 60 1-13.
Using the loss function of type (2.3) 6. Dennis, J.E. (1973). Some Canputaticnal
and JX)GLEGXS
we started the series of computations with c:l us-
Techniques for the Nonlinear Least Squares
ing the given starting values, and then used the Problem, in Bryne and Hall, eds. Numerical
results at c:l to start the computation for c:.8 Solutions of Systems of Nonlinear A]gbraic
and c:i.5 corresponding to asymptotic efficiencies
of about 50 percent and 95 percent at the Cussian. uations. Academic Press, New York, 157-183.
The standard errors (using 9.2) are given below the 7. Cross, A.M. (1973). A Robust Confidence 1ntcrvl
coefficient estimates in Table 2. Also are listed for Location for Symnctric Long-tailed Distribu-
tions. Proc. Nat. Aced. Scj. 70, 1995—1997
—7—
TABLE 2
C .8 1. 1.5
B2
.22 .21 .23
(.03) (0'3) (.04)
R0
'499. '491. 514.
TABLE 3
c .8 1. 1.5 IS
O
1
.75 .76 .84 2.16
(08) (.09) (.19) (1.19)
d.f. 8. 8. 8. 8.