06 Nonlinear Regression Models
06 Nonlinear Regression Models
Stanford University
Contents
1. Introduction 334
2. Single equation-i.i.d. case 336
2.1. Model 336
2.2. Asymptotic properties 337
2.3. Computation 341
2.4. Tests of hypotheses 347
2.5. Confidence regions 352
3. Single equation-non-i.i.d. case 354
3. I. Autocorrelated errors 354
3.2. Heteroscedastic errors 358
4. Multivariate models 359
5. Simultaneous equations models 362
5.1. Non-linear two-stage least squares estimator 362
5.2. Other single equation estimators 370
5.3. Non-linear simultaneous equations 375
5.4. Non-linear three-stage least squares estimator 376
5.5. Non-linear full information maximum likelihood estimator 379
References 385
*This work was supported by National Science Foundation Grant SE%7912965 at the Institute for
Mathematical Studies in the Social Sciences, Stanford University. The author is indebted to the
following people for valuable comments: R. C. Fair, A. R. Gallant, Z. Griliches, M. D. Intriligator,
T. E. MaCurdy, J. L. Powell, R. E. Quandt, N. E. Savin, and H. White.
1. Introduction
(1.1)
where y,, .x,, and (Y~are vectors of endogenous variables, exogenous variables, and
parameters, respectively, and uif are unobservable error terms with zero mean.
Eqs. (1. l), with all generality, constitute the non-linear simultaneous equations
model, which is analyzed in Section 5. I devote most of the discussion in the
chapter to this section because this area has been only recently developed and
therefore there is little account of it in general references.
Ch. 6: Non -linear Regression Models 335
Many simpler models arising as special cases of (1.1) are considered in other
sections. In Section 2 I take up the simplest of these, which I will call the standard
non-linear regression model, defined by
where (u,} are scalar i.i.d. (independent and identically distributed) random
variables with zero mean and constant variance. Since this is the model which has
been most extensively analyzed in the literature, I will also devote a lot of space to
the analysis of this model. Section 3 considers the non-i.i.d. case of the above
model, and Section 4 treats its multivariate generalization.
Now, I should mention what will not be discussed. I will not discuss the
maximum likelihood estimation of non-linear models unless the model is written
in the regression form (1.1). Many non-linear models are discussed elsewhere in
this Handbook; see, for example, the chapters by Dhrymes, McFadden, and
Maddala. The reader is advised to recognize a close connection between the
non-linear least squares estimator analyzed in this chapter and the maximum
likelihood estimator studied in the other chapters; essentially the same techniques
are used to derive the asymptotic properties of the two estimators and analogous
computer algorithms can be used to compute both.
I will not discuss splines and other methods of function approximation, since
space is limited and these techniques have not been as frequently used in
econometrics as they have in engineering applications. A good introduction to the
econometric applications of spline functions can be found in Poirier (1976).
Above I mentioned the linear model with the transformation to reduce the
autocorrelation of the error terms and the distributed-lag model. I will not
specifically study these models because they are very large topics by themselves
and are best dealt with separately. (See the chapter by Hendry, Pagan, and Sargan
in this Handbook). There are a few other important topics which, although
non-linearity is involved, woud best be studied within another context, e.g.
non-linear error-in-variable models and non-linear time-series models. Regarding
these two topics, I recommend Wolter and Fuller (1978) and Priestley (1978).
Finally, I conclude this introduction by citing general references on non-linear
regression models. Malinvaud (1970b) devotes one long chapter to non-linear
regression models in which he discusses the asymptotic properties of the non-
linear least squares estimator in a multivariate model. There are three references
which are especially good in the discussion of computation algorithms, confidence
regions, and worked out examples: Draper and Smith (1966) Bard (1974) and
Judge, Griffiths, Hill and Lee (1980). Several chapters in Goldfeld and Quandt
(1972) are devoted to the discussion of non-linear regression models. Their
Chapter 1 presents an excellent review of optimization techniques which can be
used in the computation of both the non-linear least squares and the maximum
likelihood estimators. Chapter 2 discusses the construction of confidence regions
336 T. Amemiya
in the non-linear regression model and the asymptotic properties of the maximum
likelihood estimator (but not of the non-linear least squares estimator). Chapter 5
considers the Cobb-Douglas production function with both multiplicative and
additive errors, and Chapter 8 considers non-linear (only in variables) simulta-
neous equations models. There are two noteworthy survey articles: Gallant
(1975a), with emphasis on testing and computation, and Bunke, Henscheke,
Strtiby and Wisotzki (1977), which is more theoretically oriented. None of the
above-mentioned references, however, discusses the estimation of simultaneous
equations models non-linear both in variables and parameters.
2.1. Model
(2.3)
Y =f(Po)+% (2.4)
where y, f( /3,-J,and u are T-vectors whose t th element is equal toy,, f( x,, &), and
u,, respectively. I will also use the symbolf,(&) to denote f(x,, &,)_
Ch. 6: Non -linear Regression Models 337
2.2.1. Consistency
The meaning of (2.8) is as follows. Define S(p) = plim T- ‘S,( j3). Then, given
E,S > 0, there exists To, independent of /I, such that for all T 2 To and for all
P, PUT-‘&-(P)- S(P)1 ’ &I< 6.
It is easy to construct examples in which the violation of any single assumption
above leads to the inconsistency of 8. [See Amemiya (1980).]
I will now give a sketch of the proof of the consistency and indicate what
additional assumptions are needed as I go along. From (2.1) and (2.5), we get
=A,+A,+A,, (2.9)
where c means CT=, unless otherwise noted. First, plim A, = ut by a law of large
numbers [see, for example, Kolmogorov Theorem 2, p. 115, in Rao (1973)].
Secondly, for fixed &, and p, plim A, = 0 follows from the convergence of
T-‘C[f,(&)- f,(p)]’ by Chebyshev’s inequality:
Since the uniform convergence of A, follows from the uniform convergence of the
right-hand side of (2.10), it suffices to assume
Having thus disposed of A, and A,, we need only to assume that lim A, is
uniquely minimized at PO; namely,
To sum up, the non-linear least squares estimator B of the model (2.1) is
consistent if (2.6), (2.1 l), and (2112) are satisfied. I will comment on the signifi-
cance and the plausibility of these three assumptions.
The assumption of a compact parameter space (2.6) is convenient but can be
rather easily removed. The trick is to dispose of the region outside a certain
compact subset of the parameter space by assuming that in that region
T-‘~MPoF.MP)12 IS * sufficiently large. This is done by Malinvaud (1970a).
An essentially similar argument appears also in Wald (1949) in the proof of the
consistency of the maximum likelihood estimator.
It would be nice if assumption (2.11) could be paraphrased into separate
assumptions on the functional form off and on the properties of the exogenous
Ch. 6: Non-linearRegressionModels 339
sequence {x,}, which are easily verifiable. Several authors have attempted to
obtain such assumptions. Jennrich (1969) observes that if f is bounded and
continuous, (2.11) is implied by the assumption that the empirical distribution
function of {x,} converges to a distribution function. He also notes that another
way to satisfy (2.11) is to assume that {x,} are i.i.d. with a distribution function F,
and f is bounded uniformly in p by a function which is square integrable with
respect to F. Malinvaud (1970a) generalizes the first idea of Jennrich by introduc-
ing the concept of weak convergence of measure, whereas Gallant (1977) gener-
alizes the second idea of Jennrich by considering the notion of Cesaro summabil-
ity. However, it seems to me that the best procedure is to leave (2.11) as it is and
try to verify it directly.
The assumption (2.12) is comparable to the familiar assumption in the linear
model that lim T- ‘X’X exists and is positive definite. It can be easily proved that
in the linear model the above assumption is not necessary for the consistency of
least squares and it is sufficient to assume (X’X)- ’+ 0. This observation
suggests that assumption (2.12) can be relaxed in an analogous way. One such
result can be found in Wu (198 1).
(2.13)
@(~_p,)=_ a2sT
[ 1Twl,.]‘
$ %I,,-
(2.14)
Thus, we are done if we can show that (i) the limit distribution of
fi-‘(asT/a&&, is normal and (ii) T- ‘( 6’2ST/apap’)B* converges in probabil-
ity to a non-singular matrix. We will consider these two statements in turn.
340 T. Amemiya
(2.15)
But it is easy to find the conditions for the asymptotic normality of (2.16) because
the summand in the right-hand side is a weighted average of an i.i.d. sequence-the
kind encountered in the least squares estimation of a linear model. Therefore, if
we assume
then
--t
1 as
+ N(0,4&). (2.18)
0 afi PO
(2.19)
We must show that each of the three terms in the right-hand side of (2.19)
Ch. 6: Non-linearRegressionModels 341
(2.20)
and
(2.21)
Then, we obtain;
1 a$-
PlimT apap, 8* =2C- (2.22)
(2.23)
The assumptions we needed in proving (2.23) were (2.17), (2.20), and (2.21) as
well as the assumption that /? is consistent.
It is worth pointing out that in the process of proving (2.23) we have in effect
shown that we have, asymptotically,
(2.24)
where I have put G = ( af/&3’),0, a F x K matrix. Note that (2.24) exactly holds
in the linear case. The practical consequence of the approximation (2.24) is that
all the results for the linear regression model are asymptotically valid for the
non-linear regression model if we treat G as the regressor matrix. In particular, we
can use the usual t and F statistics with an approximate precision, as I will
explain more fully in Sections 2.4 and 2.5 below. Since the matrix G depends on
the unknown parameters, we must in practice evaluate it at b.
2.3. Computation
Since there is in general no explicit formula for the NLLS estimator b, the
minimization of (2.5) must usually be carried out by some iterative method. There
342 T. Amemiya
are two general types of iteration methods: general optimization methods applied
to the non-linear least squares problem in particular, and procedures which are
specifically designed to cope with the present problem. In this chapter I will
discuss two representative methods - the Newton-Raphson iteration which be-
longs to the first type and the Gauss-Newton iteration which belongs to the
second type - and a few major variants of each method. These cover a majority of
the iterative methods currently used in econometric applications. Although not
discussed here, I should mention another method sometimes used in econometric
applications, namely the so-called conjugate gradient method of Powell (1964)
which does not require the calculation of derivatives and is based on a different
principle from the Newton methods. Much more detailed discussion of these and
other methods can be found in Chapter 12 of this Handbook and in Goldfeld and
Quandt (1972, ch. 1).
(2.25)
where B, is the initial estimate [obtained by a pure guess or by a method such as
the one proposed by Hartley and Booker (1965) described below]. The second-
round estimator & of the iteration is obtained by minimizing the right-hand side
of (2.25). Therefore,
(2.26)
The iteration is to be repeated until the sequence {&} thus obtained converges to
the desired degree of accuracy.
Inserting (2.26) into (2.25) and writing n + 1 and n for 2 and 1, we obtain:
(2.27)
The above equation shows two weaknesses of the Newton-Raphson iteration. (i)
Even if (2.27) holds exactly, &(&+ ,) < S,(&) is not guaranteed unless
(~2WWV’)~” is a positive definite matrix. (ii) Even if the matrix is positive
Ch. 6: Non -linear Regression Models 343
definite, &,+ , - fi,, may be too large or too small- if it is too large, it overshoots
the target, and if it is too small, the speed of convergence is slow.
The first weakness may be alleviated if we modify (2.26) as
(2.28)
where I is the identity matrix and (Y,is a scalar to be appropriately chosen by the
researcher subject to the condition that ( a2&/apap’)jn + a,Z is positive definite.
This modification was proposed by Goldfeld, Quandt and Trotter (1966) and is
called quadratic hill-climbing (since they were considering maximization). See the
same article or Goldfeld and Quandt (1972, ch. 1) for a discussion of how to
choose (Y,and the convergence properties of the method.
The second weakness may be remedied by the modification:
(2.29)
as, as,
-Iab j, = T IBo +
8*
(S,-PO)> (2.30)
where p* lies between B, and &. Inserting (2.30) into (2.26) yields
(2.31)
344 T Amemiya
But, under the assumptions of Section 2.2 from which we proved the asymptotic
normality of b, we have
(2.32)
Therefore,
(2.33)
where y means that both sides of the equation have the same non-degenerate
limit distribution.
To start an iteration, we need an initial estimate. Since there may be more than
one local minima in ST, it is helpful to use the starting value as close to the true
value as possible. Thus, it would abe desirable to have available an easily
computable good estimator, such as p,; all the better if it is consistent so that we
can take advantage of the result of the preceding paragraph. Surprisingly, I know
only one such estimator - the one proposed by Hartley and Booker (1965). Their
initial estimator is obtained as follows. Let us assume for simplicity mK = T for
some integer m and partition the set of integers (1,2,. . . , T) into K non-overlap-
ping consecutive subsets !P,, !P2;,. . . , !PK, each of which contains m elements. If we
define j$,=m-'ClsuyI and f&(/3)=m-LC,,yfi(/3),i=l,2,...,K, the
Harley-Booker estimator is defined as the value of b that satisfies K equations:
Y(i)=(i)(P),
i=1,2 ,**., K. (2.34)
Since (2.34) cannot generally be solved explicitly for p, one still needs an
iteration to solve it. Hartley and Booker propose the minimization of EYE,[ jjCi,-
fCi,(p)12 by an iterative method, such as one of the methods being discussed in
this section. This minimization is at least simpler than the original minimization
of (2.5) because the knowledge that the minimand is zero at /3 = & is useful.
However, if there are multiple solutions to (2.34), an iteration may lead to the
wrong solution.
Hartley and Booker proved the consistency of their estimator. Jennrich (1969)
gave a counterexample to their consistency proof; however, their proof can easily
be modified to take account of Jennrich’s counter-example. A more serious
weakness of the Hartley-Booker proof is that their assumptions are too restric-
tive: one can easily construct a benign example for which their assumptions are
violated and yet their estimator is consistent.
Ch. 6: Non -linear Regression Models 345
This is the method specifically designed to calculate the NLLS estimator. Expand-
ing f,(p) in a Taylor series around the initial estimate fi,, we get:
(2.35)
(2.36)
(2.37)
(2.38)
(2.40)
Hartley proves that under general conditions his iteration converges to a sta-
tionary point: that is, a root of the normal equation &S,/ap = 0. He also proves
(not so surprisingly) that if the iteration is started at a point sufficiently close to
b, it converges to b. See Tomheim (1963) for an alternative proof of the
convergence of the Hartley iteration. Some useful comments on Marquardt’s and
Hartley’s algorithms can be found in Gallant (1975a). The methods of determin-
ing A, in the Newton-Raphson iteration (2.29) mentioned above can be also
applied to the determination of A,, in (2.41).
Jennrich (1969) proves that if the Gauss-Newton iteration is started at a point
sufficiently close to the true value &, and if the sample size T is sufficiently large,
the iteration converges to &,. This is called the asymptotic stability of the
iteration. The following is a brief sketch of Jenmich’s proof. Rewrite the Gauss-
Newton iteration (2.37) as (I have also changed 1 to n and 2 to n + 1 in the
subscript)
b”+,=h(s,>? (2.42)
expansion:
(2.43)
where /3:_ , lies between &, and &_ , . If we define A,, = (ah /a/3’),, and denote
the largest characteristic root of AkA, by h,, we can show that A, + 0 almost
surely for all n as T + 00 and hence
But (2.44) implies two facts. First, the iteration converges to a stationary point,
and secondly, this stationary point must lie sufficiently close to the starting value
8, since
Non-normal
Linear I II
Non-linear III IV
hypotheses and the distribution of the error term as depicted in Table 2.1. I will
discuss the t and F tests in Case I and the likelihood ratio, Wald, and Rao tests in
Case IV. I will not discuss Cases II and III because the results in Case IV are a
fortiori valid in Cases II and III.
Partition the parameter vector as fi’= (&,,, &), where &,, is a K,-vector and &)
is a K,-vector. By a linear hypothesis I mean a hypothesis which specifies that &)
is equal to a certain known value PC*).Student’s t test is applicable if K, = 1 and
theFtestifK,>l.
The hypothesis of the form Q/3 = c, where Q is a known K, X K matrix and c is
a known K,-vector, can be transformed into a hypothesis of the form described
above and therefore need not be separately considered. Assuming Q is full rank,
we can find a K, X K matrix R such that (R’, Q’) = A’ is non-singular. If we
define (Y= A/3 and partition (Y’ = (‘Y;,),(Y{*)),the hypothesis Q/3 = c is equivalent
to the hypothesis a(Z)= c.
As noted after eq. (2.24), all the results of the linear regression model can be
extended to the non-linear model .by treating G = ( af/&3’),0 as the regressor
matrix if the assumptions of Section 2.2 are satisfied. Since &, is unknown, we
must use G = (af/ap)j in practice. We will generalize the t and F statistics of
the linear model by this principle. If K, = 1, we have approximately
-qK($z,-%J _ t(T_ K)
(2.46)
gm ’
where L!is the last diagonal element (if &) is the i th element of p, the i th diagonal
element) of (&G)-’ and t(T- K) denotes Student’s t distribution with T - K
degrees of freedom. For the case K, 2 1, we have asymptotically under the null
hypothesis:
Ch. 6: Non -linear Regression Models 349
where J’= (0, I), 0 being the K, X K, matrix of zeros and I being the identity
matrix of size K,, and F( K,, T - K) denotes the F distribution with K, and
T - K degrees of freedom.
Gallant (1975a) examined the accuracy of the approximation (2.46) by a Monte
Carlo experiment using the model
For each of the four parameters, the empirical distribution of the left-hand side of
(2.46) matched the distribution of t(T - K) reasonably well, although, as we
would suspect, the performance was the poorest for 8,.
In testing & = p(Z) when K, 2 1, we may alternatively use the asymptotic
approximation (under the null hypothesis):
(T-K)[%(i+&@)l _ J-(K T_ K)
(2.49)
29
K2wv
where b is the constrained non-linear least squares estimator obtained by mini-
mizing S,( /3) subject to pc2,= pc2). Although, as is well known, the statistics (2.47)
and (2.49) are identical in the linear model, they are different in the non-linear
model.
The study of Gallant (1975~) sheds some light on the choice between (2.47) and
(2.49). He obtained the asymptotic distribution of the statistics (2.47) and (2.49)
under the alternative hypothesis as follows. Regarding S,(b), which appears in
both formulae, we have asymptotically:
(2.52)
350 T. Amemiya
where fit2j0 is the true value of &.’ The asymptotic distribution of the statistic
(2.47) under the alternative hypothesis can now be derived from (2.50) and (2.52)
and, similarly, that of (2.49) from (2.50) and (2.5 1).
Gallant (1975~) conducted a Monte Carlo study using the model (2.48) to
compare the above two tests in testing p, = 0 against p, t 0 and & = - 1 against
& * - 1. His results show that (i) the asymptotic approximation under the
alternative hypothesis matches the empirical distribution reasonably well for both
statistics but works a little better for the statistic (2.49) and (ii) the power of (2.49)
tends to be higher than that of (2.47).2 Gallant (1975a) observes that (2.49) is
easier to calculate than (2.47) except when K, = 1. All these observations indicate
a preference for (2.49) over (2.47). See Gallant (1975b) for a tabulation of the
power function of the test based on S,(/?‘)/S,(&, which is equivalent to the
test based on (2.49).
h(P) = 0, (2.53)
LRT=2[logL(j)-logL@)], (2.54)
(2.55)
and
(2.56)
‘In deriving the asymptotic approximations (2.51) and (2.52), Gallant assumes that the “distance”
between the null and alternative hypotheses is sufficiently small. More precisely, he assumes that there
exists a sequence of hypothesized values @&) and hence a sequence (/36:> such that fi( &)a - p&)
and fl(&, -P;,;) converge to constant vectors as T goes to infinity.
*Actually, the powers of the two tests calculated either from the approximation or from the
empirical distribution are identical in testing j3, = 0. They differ only in the test of & = - 1.
Ch. 6: Non -lineur Regression Models 351
If we assume the normality of u in the non-linear regression model (2. l), we can
write (2.54), (2.55), and (2.56) as5
LRT=T[logT-‘S,(p)-logT-‘S,@)], (2.58)
(2.59)
and
(2.60)
3See Silvey (1959) for an interpretation of Rao’s test as a test on Lagrange multi Iiers.
P
41f 6 is distributed as a q-vector N(0, V), then (5 + a)‘V-‘(6 + p) - x*(q,p’V- p).
‘In the following derivation I have omitted some terms whose probability limit is zero in evaluating
@‘(6’log L/a/3’) and T-‘(a* log L/6’ga/F).
352 T. Amemiya
using a proof similar to Rao’s, we can show that the statistics (2.58), (2.59) and
(2.60) are asymptotically distributed as x’(q) even if u are not normal. Thus,
these statistics can be used to test a non&near hypothesis under a non-normal
situation.
In the linear regression model we can show Wald 2 LRT 2 Rao [see Bemdt
and Savin (1977)]. Although the inequalities do not exactly hold for the non-lin-
ear model, Mizon (1977) found Wald 2 LRT most of the time in his samples.
(2.61)
K%-(8)
and if we use (2.49) we obtain:
(T-K)[ST@-%@)I< F (K T_ K)
(2.63)
a 7
Goldfeld and Quandt (1972, p. 53) give a striking example in which the two
regions defined by (2.62) and (2.63) differ markedly, even though both statistics
have the same asymptotic distribution- F( K, T - K). I have not come across any
reference discussing the comparative merits of the two methods.
Beale (1960) shows that the confidence region based on (2.63) gives an accurate
result - that is, the distribution of the left-hand side of (2.63) is close to F( K, T -
K)- if the “non-linearity” of the model is small. He defines a measure of
Ch. 6: Non -linear Regression Models 353
non-linearity as
‘j= ii
i=l
2
f=l
[.htbi)-h(8)- ~l,(bi-/l)]z.K(r-K)-‘s,(iR)
Note that j? actually drops out of the denominator of (2.65), which makes the
computation of the confidence region simple in this case. In analogy to (2.65),
Hartley (1964) proposed the following confidence region:
3. I. Autocorrelated errors
In this section we consider the non-linear regression model (2.1) where {u,} follow
a general stationary process
cc
U, = C YjEt-jy (3.1)
j=O
where (Ed) are i.i.d. with Eel = 0 and I’&,= u*, and the y’s satisfy the condition
fl Y+Q,
j=O
(3.2)
and where
I will add whatever assumptions are needed in the course of the subsequent
discussion. The variance-covariance matrix Euu’ will be denoted by 2.
I will indicate how to prove the consistency and the asymptotic normality of
the non-linear least squares estimator B in the present model, given the above
assumptions as well as the assumptions of Section 2.2. Changing the assumption
of independence to autocorrelation poses no more difficulties in the non-linear
model than in the linear model.
To prove consistency, we consider (2.9) as before. Since A, does not depend on
p and A, does not depend on u,, we need to be concerned with only A,. Since A,
involves the vector product f’u and since E(f’u)* = f’Zf $ f’fx,(Z), where
h,(E) is the largest characteristic root of E, assumption (2.11) implies plim A, = 0
by Chebyshev’s inequality, provided that the characteristic roots of 2 are bounded
from above. But this last condition is implied by assumption (3.3).
To prove the asymptotic normality in the present case, we need only prove the
asymptotic normality of (2.16) which, just as in the linear model, follows from
theorem 10.2.11, page 585, of Anderson (1971) if we assume
j=O
I5IYjl <O” (3.4)
~(B-Po)-~[0,0,21imT-‘(G’G)-‘G’~G(G’G)~’], (3.5)
Ch. 6: Non -linear Regression Models 355
which indicates that the linear approximation (2.24) works for the autocorrelated
model as well. Again it is safe to say that all the results of the linear model are
asymptotically valid in the non-linear model. This suggests, for example, that the
Durbin-Watson test will be approximately valid in the non-linear model, though
this has not been rigorously demonstrated.
Now, let us consider the non-linear analogue of the generalized least squares
estimator, which I will call the non-linear generalized least squares (NLGLS)
estimator.
Hannan (1971) investigated the asymptotic properties of the class of estimators,
denoted by p(A), obtained by minimizing ( y - f )‘A - ‘( y - f ) for some A, which
is the variance-covariance matrix of a stationary process with bounded (both
from above and from below) characteristic roots. This class contains the NLLS
estimator, B = &I), and the NLGLS estimator, &E).
Hannan actually minimized an approximation of (y - f)‘A’(y - f) ex-
pressed in the frequency domain; therefore, his estimator is analogous to his
spectral estimator proposed for the linear regression model [Hannan (1963)]. If we
define the periodograms
(3.6)
: i+ytei’“Cf,eCitw,
f
2m 4lT 27r(T- 1)
w=o,- -,...,
T’ T T ’
we have approximately:
where C#I(w) is the spectral density associated with A. This approximation is based
on an approximation of A by a circular matrix. [See Amemiya and Fuller (1967,
p. 527).]
Hannan proves the strong consistency of his non-linear spectral estimator
obtained by minimizing the right-hand side of (3.7) under the assumptions (2.6),
356 T. Amemiya
f CfAcAf(r+s)(c*)
converges uniformly in c, , c2E B for every integer S.
f
(3.8)
Note that this is a generalization of the assumption (2.11). However, the assump-
tion (3.8) is merely sufficient and not necessary. Hannan shows that in the model
assumption (3.8) does not hold and yet b is strongly consistent if we assume (3.4)
and 0 < /?a < T. In fact, T(fi - &) converges to zero almost surely in this case.
In proving the asymptotic normality of his estimator, Hannan needs to gener-
alize (2.20) and (2.21) as follows:
and
Using (3.10), (3.1 l), and (3.12), as well as the assumptions needed for consistency,
Hannan proves
I aft af,,,
-__i= niswdF(w),
limrap ap /-ne (3.14)
In the model (3.9), assumptions (3.10) and (3.11) are not satisfied; nevertheless,
Hannan shows that the asymptotic normality holds if one assumes (3.4) and
0 < & < 7r. In fact, J?;T(b - &) --, normal in this case.
An interesting practical case is where I#B(W) = a)‘, where g(w) is a con-
sistent estimator of g(o). I will denote this estimator by b(e). Harman proves
that B(2) and b(Z) have the same asymptotic distribution if g(w) is a rational
spectral density.
Gallant and Goebel (1976) propose a NLGLS estimator of the autocorrelated
model which is constructed in the time domain, unlike Hannan’s spectral estima-
tor. In their method, they try to take account of the autocorrelation of {u,} by
fitting the least squares residuals ti, to an autoregressive model of a finite order.
Thus, their estimator is a non-linear analogue of the generalized least squares
estimator analyzed in Amemiya (1973a).
The Gallant-Goebel estimator is calculated in the following steps. (1) Obtain
the NLLS estimator 8. (2) Calculate li = y - f(b). (3) Assume that (u,} follow an
autoregressive model of a finite order and estimate the coefficients by the least
squares regression of z?, on zi,_,, zi,_ 2,. . . . (4) Let 2 be the variance-covariance
matrix of u obtained under the assumption of an autoregressive model. Then we
can find a lower triangular matrix R such that 2-l = R'R,where R depends on
the coefficients of the autoregressive model.6 Calculate i? using the estimates of
the coefficients obtained in Step (3) above. (5) Finally, minimize [&y - f)]’
[R(y - f)]to obtain the Gallant-Goebel estimator.
Gallant and Goebel conducted a Monte Carlo study of the model y, = &eS2xr
+ U, to compare the performance of the four estimators- the NLLS, the
Gallant-Goebel AR1 (based on the assumption of a first-order autoregressive
model), the Gallant-Goebel AR2, and Hannan’s b(2) - when the true distribu-
tion of (u,} is i.i.d., ARl, AR2, or MA4 (a fourth-order moving average process).
Their major findings were as follows. (1) The Gallant-Goebel AR2 was not much
better than the AR1 version. (2) The Gallant-Goebel estimators performed far
better than the NLLS estimator and a little better than Hannan’s B(e), even
when the true model was MA4- the situation most favorable to Hamran. They
think the reason for this is that in many situations an autoregressive model
produces a better approximation of the true autocovariance function than the
circular approximation upon which Hannan’s spectral estimator is based. They
(3.15)
where (u,} follow a first-order autoregressive process, using the maximum likeli-
hood estimator assuming normality.
White (1980a) considers a model which differs from the standard non-linear
regression model in that {x,} are regarded as vector random variables distributed
independently of {u,} and that {x,, u,} are serially independent but not identically
distributed. White is especially interested in the “stratified sample” case where
I/u,=affor ljtjT,,~~,=u,2forT,<t$T, ,....
For his model White first considers the non-linear weighted least squares
estimator which minimizes cW,( yr - f,)” = &r(P), where the weights (W,} are
bounded constants.
A major difference between his proof of consistency and the one employed in
Section 2.2 is that he must account for the possibility that plim T- ‘Q,( 8) may
not exist due to the heteroscedasticity of {u,}. [See White (1980a, p. 728) for an
example of this.] Therefore, instead of proving that plim T-l&( j3) attains the
minimum at & as done in Section 2.2, White proves that plim T- ‘[ QT( p)-
EQT( /I)] = 0 and there exists T, such that for any neighborhood N( &)
4. Multivariate models
Yit=fi(Xit~Pill)+Uit~
i=1,2,..., N, t=1,2 ,..., T. (4.1)
Sometimes I will write fi(x,,,&,) more simply as fir(&) or just fit* Defining
N-vectors y,= (r,,, Y~~,...,_Y~~)I,
f, = (flr,f2t,...,fNI)I, and u, = (~1~~%,...,uM)
we can write (4.1) as
i bt-ftW’bt-ft) (4.3)
t=l
360 T. Amemiya
for some matrix A. Let 4, be the &h-round estimator of the Zellner iteration.
Then,
8, = d(I),
(4.4)
d”+l= lqA,>, n = 1,2 ,***>
where
Note that in each step of the Zellner iteration we must use some other iterative
method to minimize the minimand (4.3). For this minimization, Malinvaud
(1970b) suggests a multivariate version of the Gauss-Newton iteration,
Beauchamp and Cornell (1966) recommend a multivariate version of Hartley’s
iteration, and Gallant (1975d) shows how to transform the problem to a uni-
variate problem so that standard univariate programs found at most computing
centers can be used.
The consistency and the asymptotic normality of &A) for a fixed A can be
proved by a straight!ol;ward modification of the proofs of Section 2.2. Gallant
(1975d) proves that e(Z) has the same asymptotic distribution as 8(z) if A$is a
consistent estimator of Z. In particular, his result means that the second-round
estimator t$ of the Zellner iteration (4.4) has the same asymptotic distribution as
b(z)- a result analogous to the linear case. Gallant also generalizes another
well-known result in the linear SUR model and proves that the asymptotic
distributions of &I) and d(e) are the same if {x,~} are the same for all i and
fi(xir, /3,) has the same functional form for all i, provided that there are no
constraints among the (/3,).’
By a Monte Carlo study of a two-equation model, Gallant (1975d) finds that an
estimate of the variance of the estimator calculated from the asymptotic formula
tends to underestimate the true variance and recommends certain corrections.
If u - N(0, z), the concentrated log-likelihood function can be written as
‘In the linear SUR model, the least squares and the generalized least squares estimators are
identically equal for every finite sample if the same conditions are met.
Ch. 6: Non -linear Regression Models 361
sufficiently large and if 4, (or any other initial estimate) is sufficiently close to 8.
Therefore, Phillips’ proof is of the same nature as Jennrich’s proof (1969) of the
asymptotic stability of the Gauss-Newton iteration, discussed in Section 2.3.
Since (4.3) is minimized at each step of the Zellner iteration (4.4), the iteration
can be also defined implicitly by the normal equation
But, since
the quasi MLE 8 is a stationary point of the iteration (4.7): that is,
H,(e, 6) = 0. (4.9)
Phillips proves that (4.7) defines a unique function d,,+, = A(&) by showing a
mapping (a, b) + (z, w) defined by z = Hr(a, b) and w = b has a Jacobian which
is a P-matrix (every principal minor is positive) and therefore is one-to-one by a
theorem of Gale and Nikaido (1965). The rest of this proof of the asymptotic
stability proceeds similarly to the arguments following (2.42) in Section 2.3.
In the discussion above we have assumed that {u,} are i.i.d. When {u,} follow a
vector stationary process, we have a’multivariate generalization of the model of
Section 3.1. Robinson (1972) considered such a model and extended the results of
Hannan (197 1) discussed above.
There are many empirical applications of the Zellner iteration in multivariate
non-linear models, especially in the analysis of expenditures. I will mention a few
representative papers. In all of these works, the dependent variables are expen-
diture shares, which means that the sum of the dependent variables across com-
modity groups (i=1,2,..., N) is unity. Therefore, (quasi) maximum likelihood
estimation is applied to N - 1 equations after dropping one equation. [See Theil
(1971, page 275) regarding when this procedure is justified.]
Deaton (1974) estimates various expenditure models (such as the Rotterdam
model, the linear expenditure system, and the direct addilog system), some of
which lead to multivariate (N = 9) non-linear regression models with linear
constraints on the parameters, using Marquardt’s modification of the
362 T. Amemiya
this by considering (2.9) and noting that plim A, * 0 in general becausef, may be
correlated with U, in the model (5.1) due to a possible dependence of Y, on u,. In
this section I will consider how we can generalize the two-stage least squares
(2SLS) method to the non-linear model (5.1) so that we can obtain a consistent
estimator.
The case where the non-linearity off occurs only in (~a does not pose much of a
problem. Such a case will occur, for example, if the variables of a linear structural
equation are transformed to take account of the autocorrelation of the error
terms. The model which is nonlinear only in parameters can be written in vector
form as
We can generalize the two-stage least squares estimator in this model using either
Theil’s interpretation of 2SLS [Theil (1953)] or the instrumental variables (I.V.)
interpretation. Suppose the reduced form for Y is given by
y=xII+I/. (5.3)
Next, I consider the case wheref is non-linear only in variables. Let F,(Y,, X,) be
a vector-valued function and let F be the matrix whose tth row is equal to F;.
Then, the present case can be written as
We will assume that the reduced form for 4 is not linear in X,, for then the model
364 T. Amemiya
is reduced to a linear model. Eq. (5.3) may or may not hold. This model is more
problematical than the model (5.2). Here, the estimator obtained according to
Theil’s interpretation is no longer consistent and the one based on the I.V.
interpretation is consistent but no longer asymptotically efficient.
The following simple example illustrates why the application of Theil’s inter-
pretation does not work in the present situation.’ Suppose a structural equation is
Y, = vz:+u, (5.5)
z,=x,+u,. (5.6)
Note that I have simplified the matter by assuming that the reduced form
coefficient is known. Inserting (5.6) into (5.5) yields
y,=yx,2+yu~+(U,+2yx,u,+yU~-yu,2), (5.7)
where the composite error term in parentheses has zero mean. Since the applica-
tion of Theil’s interpretation to (5.5) means regressing y, on x: without a constant
term, (5.7) clearly demonstrates that the resulting estimator is inconsistent.
That the estimator obtained according to the I.V. interpretation in the model
(5.4) is not fully efficient can be seen by noting that the reduced form for Ft is not
linear in X,. This suggests that one may hope to obtain a more efficient estimator
by premultiplying (5.4) by W( W’W)- ‘W’, where the tth row of W, denoted W;,
is a vector-valued function of X, such that the linear dependence of F, on W, is
greater than that of F, on X,. Thus, in the present situation it is useful to consider
the class of NL2S estimators with varying values for W. The elements of W, will
be called instrumental variables.
Goldfeld and Quandt (1968) were the first to consider NL2S estimators in
simultaneous equations models non-linear in variables. One of the two models
they analyzed is given by9
‘This statement should not be construed as a criticism of Theil’s interpretation. I know of at least
six more interpretations of ZSLS: a certain interpretation works better than others when one tries to
generalize 2SLS in a certain way. Thus, the more interpretations, the better.
9The subscript “0” indicating the true value is suppressed to simplify the notation.
Ch. 6: Non -linear Regression Models 365
consider only the estimation of the parameters of (5.8) for the time being. Note
that in this model yZt does not have a linear reduced form like (5.3). Goldfeld and
Quandt compared four estimators of (5.8) by a Monte Carlo study: (1) the least
squares, (2) NL2S where W; = (1, x,), (3) NL2S where W; = (1, x,, xf), and (4)
maximum likelihood.‘0 The ranking of the estimators turned out to be (2), (4), (3),
and (1). However, the top ranking of (2) was later disclaimed by Goldfeld and
Quandt (1972) as a computational error, since they found this estimator to be
inconsistent. This was also pointed out by Edgerton (1972). In fact, the con-
sistency of NL2S requires that the rank of W must be at least equal to the number
of regression parameters to be estimated, as I will explain below. Goldfeld and
Quandt (1968) also tried the Theil interpretation and discarded it since it gave
poor results, as they expected.
Kelejian (1971) points out the consistency of the NL2S estimator using W of a
sufficient rank and recommends that powers of the exogenous variables as well as
the original exogenous variables be used as the instrumental variables to form W.
Kelejian also shows the inconsistency of the estimator obtained by Theil’s
interpretation. Edgerton (1972) also noted the consistency of NL2S. See
Strickland and Weiss (1976) and Rice and Smith (1977) for applications of the
NL2S estimator using Kelejian’s recommendation. The former estimates a
three-equations model and the latter an eleven-equations model. Rice and Smith
use other statistical techniques in conjunction with NL2S - a correction for the
autocorrelation of the error terms and the use of principal components in the
definition of W.
Amemiya (1974) considered a general model (5.1) and defined the class of NL2S
estimators as the value of (Ythat minimizes
“Goldfeld and Quandt state that they generated U’S according to the normal distribution. But, their
model is not well-defined unless the domain of U’S is somehow restricted. Thus, I must interpret the
distribution they used as the truncated normal distribution. This means that the maximum likelihood
derived under normality is not the genuine maximum likelihood.
366 T. Amemiya
normality of the NL2S estimator by standard techniques similar to the ones used
in Section 2.2.
I will give an intuitive proof of the consistency by writing the expression
corresponding to (2.9). [The method of proof used in Amemiya (1974) is slightly
different.] Using a Taylor expansion of f(a) around (~c, we obtain the approxima-
tion
+~(a-a,)‘G’P,G(a-a,), (5.11)
3 = - 2G’P,u (5.12)
aa ao
and
(5.14)
V, = a2plimT(EG’EG)-‘. (5.14)
Ch. 6: Non -linear Regression Models 361
a=@), (5.16)
where 0 is a (K - q)-vector of freely varying parameters,” and write the asymp-
totic distribution in terms of both (5.15) and (5.16).
First, I will give a list of the symbols used in this section:
(5.17)
“See Gallant and Jorgenson (1979, p. 299) for the conditions under which this is possible.
Ch. 6: Non -linear Regression Models 369
ah
ad a0 = Hy
-ah ..
aaf a;= Hy
ar
a81 @.= R,
P2 = PGR(R’G’PGR)-‘R@P.
and
SSRD = --T-[&(a?)-S&q].
S,(&) (5.19)
Under the assumption that the alternative hypothesis is “close” to the null
hypothesis or, more rigorously,
a0 - (5.20)
Gallant and Jorgenson (1979) obtain the following limit distribution of the two
statistics:
HR=O, (5.22)
which implies
As an application of the SSRD test, Gallant and Jorgenson tested the hypothe-
sis of homogeneity of degree zero of the equation for durables in the two-equation
translog expenditure model of Jorgenson and Lau (1978). In this model, the ratios
of prices to total expenditures in the right-hand side of the regression equations,
as well as the expenditure shares in the left-hand side, are regarded as endogenous
variables; therefore, a non-linear simultaneous equation estimator such as NL2S
must be used to yield consistent estimates.
y= x;n+Fy, (5.25)
I=
I I
a2
cJl2
z
42
22
.
(5.26)
We will define X and V as matrices whose t th rows are X,! and v, respectively.
Since u and V are jointly normal, we can write
u= V&$7,, + E, (5.27)
Ch. 6: Non -linear Regression Models 371
In Section 5.1.3 we learned that the consistency of NL2S follows from the fact
that the projection matrix, P,, removes the stochastic part of G, but that the
“larger” P,G, the more efficient the estimator. This suggests that the projection
matrix M, = I - V(V’V) -‘V’ should perform better than P, since it precisely
eliminates the part of G that depends on V. Thus, if V were known (which is the
same as if II were known), we could define the estimator that minimizes
(Y-mG(.Y-f). (5.28)
In order to verify the consistency of the estimator, write T -’ times the minimand
in the same way as in (5.11). Then we get the same expression as the right-hand
side of (5.1 l), except that P, is replaced by M,. Thus, the first condition for
consistency is plim T -‘G’Myu = 0. But because of (5.27) this condition is
equivalent to
for any constant W, so that this estimator is asymptotically more efficient than
any member of the class of NL2S estimators.
12However, normality is not necessary. For example, it is sufficient to assume that u = Va + E for
some (Ywith V and E being independent.
372 T. Amemiya
However, since this estimator is not practical, I define the modified non-linear
two-stage least squares estimator (MNL2Sj as the value of CYthat minimizes (5.28)
after replacing Y with Y - XZI, where ZI = (X’X) - ‘X’Y. Thus, MNL2S mini-
mizes
(Y-f)l[Z-M,Y(Y’M,Y)_‘Y’M,](y-f), (5.31)
v,<v,. (5.33)
Since the Jacobian of the transformation from (u, V) to (v, Y) is unity in our
model, the log-likelihood function assuming normality can be written, apart from
a constant, as
where
z=+Q. (5.35)
n= (x94,x)-'X'M,Y. (5.37)
L= -~(logU’U+log,Y~M,Y-Y’M,x(x~~Ux)-’x’M,Y~), (5.38)
which depends only on (Y.Interpreting our model as that which represents the
limited information of the researcher, I call the value of (Ythat minimizes (5.38)
the NLLI estimator. Its asymptotic covariance matrix is given by
(5.39)
and iterate back and forth between (5.37) and (5.41). That is, obtain fi=
(X’X) - ‘X’Y and v = Y - XI?, maximize (5.41) with respect to (Yafter replacing
V with t (at this point one gets the MNL2S estimator of (u), call this estimator ai
and define fi = y - f(ai), insert it into (5.37) to obtain another estimator of IT,
and repeat the procedure until convergence.
5.2.3. Comparison
In this and the preceding sections we have discussed four estimators: (1) NL2S
(as a class), (2) BNL2S, (3) MNL2S, and (4) NLLI. If we denote NL2S(W= X)
by SNLZS (the first S stands for standard), we have in the linear case
where + means “is worse than”. However, it is important to remember that if the
normality assumption is removed the last two estimators may lose consistency,
whereas the first two estimators retain it.13
Quandt (1975) conducted a Monte Carlo experiment to study the comparative
performance of several estimators, including those discussed above, in the follow-
ing model:
where the right-hand side of (5.47) minus u, is a linear function of the x’s and the
transformation of the x’s which the researcher believes best explains y21, subject
to the condition that the number of parameters should not be so large.
We will assume that (5.46) is a specific example of (5.25); thus, SNL2S uses
(1, x ,1, x2,, xgr) as v and NLLI is the maximum likelihood estimator applied to
(5.44) and (5.46) assuming the joint normality of u,, and u,. Quandt compares the
performance of ordinary least squares (OLS), SNL2S, the NL2S estimators where
the W is chosen to be various sets of powers of the x’s, MNL2S, NLLI, and the
full information maximum likelihood estimator (NLFI), which is the maximum
131n the simple model defined by (5.5) and (5.6), the necessary and sufficient condition for the
consistency of either MNL2S or NLLI can be shown to be Eo2uEu2 = Eo3Euu.
14The same qualification as footnote 10 applies to this model as well.
Ch. 6: Non -linear Regression Models 315
likelihood estimator applied to the original equations (5.44) and (5.45). Quandt
concludes that NLFI is the best, OLS is the worst, and the rest are more or less
similar, although, to a certain extent, the asymptotic ranking (5.43) is preserved.
x,T aj>
fj(.YtY =Uiri i=1,2 T.--Yn; t=1,2 ,..., T, (5.48)
point out that non-linearity generally helps rather than hampers identification, so
that, for example, the number of excluded exogenous variables in a given
equation need not be greater than or equal to the number of parameters of the
same equation in a non-linear model. I should also point out that I have actually
given one sufficient condition for identifiability- that plim T - ‘(G’P,G) in the
right-hand side of (5.14) be non-singular.
To facilitate the discussion of the subsequent sections I will give a list of
symbols:
F= (f~l,,j&...,f~,,), a TXn-math
afit
git = x 3
a &-vector,
‘f(i)
Gi = a(y; , a T x Ki matrix whose t th row is gi:,
Before starting the main discussion I wish to point out that all the results of
Sections 5.1 and 5.2 are valid if we change (5.1) to
Consequently, the minimand (5.10) which defines the class of NL2S estimators
should be changed to
f’w(Fvw)-‘Wf. (5.51)
The asymptotic normality result (5.14) needs not be changed. The significance of
Ch. 6: Non -linear Regression Models 311
the above modification is that a NL2S estimator can be applied to each equation
of (5.48)?
As a natural extension of the class of the NLZS estimators defined in Section
5.1.3, Jorgenson and Laffont (1974) defined the class of non-linear three-stage
least squares estimators (NL3S) as the value of (Ythat minimizes
f~[e-l@w(IVw)-‘W] f, (5.52)
where ai is the NL2S estimator obtained from each equation. The above definition
is analogous to the definition of the linear 3SLS as a generalization of the linear
2SLS. The consistency and the asymptotic normality of the estimator defined
above are proved in Jorgenson and Laffont (1974) and Gallant (1977).
The consistency of the NL2S and NL3S estimators of the parameters of the
model (5.48) can be proved with minimal assumptions on uit- namely, those
stated after (5.48). This robustness makes the estimators attractive. Another
important strength of the estimators is that they retain their consistency regard-
less of whether or not (5.48) yields a unique solution for y, and, in the case of
multiple solutions, regardless of what additional mechanism chooses a unique
solution. See MaCurdy (1980) for an interesting discussion of this point.
Amemiya (1977) defined the class of the NL3S estimators slightly more
generally as the value of (Ythat minimizes
(5.54)
v, = plimT[G’A-‘S(S’A-‘S)-‘SW’G]-‘. (5.55)
“Another advantage of this modification is that the Box-Cox transformation model (v: - 1)/h =
B’x, + u, [see Box and Cox (1964)] can be regarded as a special case of (5.52). See Amemiya and
Powell (I 980) for the application of NLZS to the Box-Cox model.
378 T. Amemiya
which is attained when one chooses S = EG. I will call this estimator the BNL3S
estimator (B for best).
We can also attain the lower bound (5.56) using the Jorgenson-Laffont
definition, but that is possible if and only if the space spanned by the column
vectors of W contains the union of the spaces spanned by the column vectors of
EG, for i = 1,2,. . _, n. This necessitates including many columns in W, which is
likely to increase the finite sample variance of the estimator although it has no
effect asymptotically. This is a disadvantage of the Jorgenson-Laffont definition
compared to my definition.
Noting that BNL3S is not practical just as BNL2S, Amerniya (1976) suggests
the following approximation. (1) Compute Bi, an SNL2S estimator of q, i =
1,2,..., n. (2) Evaluate Gi at &,-call it &;. (3) Treat Gi as the dependent variables
of the regression and search for the optimal set of independent variables K that
best predict ei. (4) Choose S = diag{P,G,, P&, . . . , P,&,,}, where Pi =
HgJq’~)-‘Fq’.
In Section 5.1.4 I discussed tests of hypotheses based on NL2S developed by
Gallant and Jorgenson (1979). These tests can easily be generalized to the tests
based on NL3S, as shown by the same authors.
Let B be the NL3S estimator here and let 2 be a consistent estimate of 2:. Also,
let S,(a) refer to the NL3S minimand-and let d refer to the constrained NL3S
subject to the condition (5.15) or (5.16). Then, the Wald and SSRD test statistics
can now be defined as
and
SSRD=S,(&)-S,(8), (5.58)
Wald,SSRD-X2[4,h(aO)‘{~~[~‘(LI-1~P)G]H}-1h(cuo)], (5.59)
where
P, = (X-‘~P)G[C’(Z-‘~P)G]-‘G’(X-‘eP)
Ch. 6: Non -linear Regression Models 319
and
P,=(Z-'OP)6R[R'~(Z-'OP)GR]-l~~~(~-'~P).
As an application of the SSRD test, Gallant and Jorgenson tested the hypothe-
sis of symmetry of the matrix of parameters in the three-equation translog
expenditure model of Jorgenson and Lau (1975).
The other applications of the NL3S estimators include Jorgenson and Lau
[ 19781, which was previously mentioned in Section 5.1.4, and Haessel [ 19761, who
estimated a system of demand equations, nonlinear only in parameters, by both
NL2S and NL3S estimators.
Solving aL*/a_z
= 0 for Z, we get:
The NLFI maximum likelihood estimator of (Yis defined as the value of CYthat
maximizes (5.63).
It is shown in Amemiya (1977) that the NLFI estimator is consistent if the true
distribution of u, is normal but is generally inconsistent if u, is not normal.‘6 This
result is contrary to the result in the linear case where the FIML estimator derived
from the normality assumption retains its consistency even if the true distribution
is not normal. It is also shown in the same article that NLFI is asymptotically
more efficient than BNL3S in general if the true distribution of U, is normal. (On
the other hand, NL3S is more robust than NLFI because NL3S is consistent even
if the distribution of U, is not normal.) This result is also contrary to the result in
the linear case where FIML and 3SLS have the same asymptotic distribution. In
the subsequent subsections I will further discuss these results as well as some
other problems related to the NLFI estimator. I will not discuss hypotheses
testing here since the discussion of Section 2.4.2 is applicable to the present
model.
5.5.1. Consistency
where (-)i ’ denotes the ith column of the inverse of the matrix within the
parentheses. The consistency of NLFI is equivalent to the condition:
(5.65)
(5.66)
where I# is the i th column of Z - ’. Now, (5.66) could hold even if each term of a
summation is different from the corresponding term of the other,‘but that event is
extremely unlikely. Therefore, we can say that the consistency of NLFI is
“%k result is completely separate from and in no way contradicts the quite likely fact that the
maximum likelihood estimator of a non-linear model derived under the assumption of a certain
regular non-normal distribution is consistent if the true distribution is the same as the assumed
distribution.
Ch. 6: Non -linear Regression Models 381
agit
EaUir = Egi,u;al. (5.67)
where I have suppressed the subscript t from g and u for simplicity. Integrating
both sides of (5.69) with respect to the elements of u other than ui (this is not
always permissible but we assume again that certain conditions on g and I/Jenable
us to do ~0)‘~ and combining the resulting equation with (5.69), we obtain:
“This lemma, stated slightly differently, is originally due to Stein (1973). In Amemiya (1977), I
rediscovered this lemma being unaware of Stein’s work. But, there was an error in the conditions of
the lemma as presented in my article, which was pointed out by Phillips (1981). A correct proof is
given in Amemiya (1981).
181f we treat ag./L$ and JI as functions of ui alone, it is sufficient to assume: (1) ag,/au, is
continuous, (2) ~m~lagi/aail~dai < 00, and (3) #( u,.) IS
. monotone decreasing for luil > M for some
Mandlim,,,,,,$(ui)=O.
191t is sufficient if all the conditions of footnote 18 hold uniformly with respect to all the other
elements of u.
382 T. Amemiya
Now, the question is: Does (5.70) hold for a density G(U) other than normal?
The term within the square brackets in (5.70) is clearly zero if u is normal.
Moreover, we can say “if and only if’ in the preceding sentence, provided that we
restrict our attention to the class of continuously differentiable densities, 4, as
proved by White (1980b). However, a#/&, + u’a’Jl(u) = 0 is not a necessary
condition for (5.70) to hold, as we have noted in footnote 13 regarding a simple
example of (5.5) and (5.6). This was first noted by Phillips (1981), who gives
another interesting example. His model is defined by
In this example g, = 1 and g, = eU’-al; therefore, (5.70) clearly holds for i = 1 for
any density $, and Phillips found a class of densities for which (5.70) holds for
the case i = 2.
What is the significance of these examples? It is that given gi we can sometimes
find a class of non-normal densities J/ for which (5.70) holds. When gi are simple,
as in these examples, we can find a fairly broad class of non-normal densities for
.which (5.70) holds. However, if gi is a more complicated non-linear function of
the exogenous variables and the parameters {ai} as well as of u, (5.70) can be
made to hold only when we specify a density which depends on the exogenous
variables and the parameters of the model. In such a case, normality can be
regarded, for all practical purposes, as a necessary and sufficient condition for the
consistency of NLFI.
Amemiya (1977) showed that the asymptotic equivalence of NLFI and BNL3S
occurs if and (almost) only if fi, can be written in the form
~(Y,,x,,CWi)=Ai((Yi)‘Z(Yt,Xt)+Bi((Yi,Xt), (5.73)
(5.74)
Ch. 6: Non -linear Regression Models 383
where
(5.75)
and & = diag(G,, Gz, . . . , cn) and where all the variables that appear in the second
term of the right-hand side of (5.74) are evaluated at ai(
The Gauss-Newton iteration to obtain BNL3S can be written as
(5.76)
complexity and found that NL3S was three to ten times faster. Nevertheless,
Belsley shows that the computation of NLFI is quite feasible and can be
improved by using a more suitable algorithm and by using the approximation of
the Jacobian due to Fair- see eq. (5.79) below.
Fair and Parke (1980) estimated Fair’s (1976) macro model (97 equations, 29 of
which are stochastic, with 182 parameters including 12 first-order autoregressive
coefficients), which is non-linear in variables as well as in parameters (this latter
non-linearity caused by the transformation to take account of the first-order
autogression of the errors), by OLS, SNL2S, the Jorgenson-Laffont NL3S, and
NLFI. The latter two estimators are calculated by a derivative-free algorithm due
to Parke. This algorithm of NLFI uses the approximation of the Jacobian:
(5.79)
where J, = af,/i?y:, N is a small integer, and t,, t,,. . ., t, are equally spaced
between 1 and T. Fair finds that in terms of predictive accuracy there is not much
difference among different estimators, but in terms of policy response OLS is set
apart from the rest.
Bianchi and Calzolari (1980) propose a method by which one can calculate the
mean squared prediction error matrix of a vector predictor based on any
estimator of the non-linear simultaneous equations model. Suppose the structural
equations can be written as f ( yp, xp, a) = up at the prediction period p and we
can solve it for yp as y, = g(x,, (Y,up). Define the predictor Yp based on the
estimator B by $p = g(x,, &,O). (Note that yp is an n-vector.) Then we have
The authors suggest that A, be evaluated by simulation. As for A,, we can easily
obtain its asymptotic value from the knowledge of the asymptotic distribution of
ai.
Hatanaka (1978) considers a simultaneous equations model non-linear only in
variables. Such a model can be written as F(Y, X)r + XB = U. Define P by
F(?, X)f + X& = 0, where f and B are the OLS estimates. Then, Hatanaka
proposes using F(p, X) as the instruments to calculate 3SLS. He proposes the
Ch. 6: Non -linear Regression Models 385
References
Akaike, H. ( 1973) “Information Theory and an Extension of the Maximum Likelihood Principle”, in:
B. N. Petrov and F. Csaki (eds.), Second International Symposium on Information Theory. Budapest:
Akademiai Kiado, pp. 267-28 1.
Amemiya, T. (1973a) “Generalized Least Squares with an Estimated Autocovariance Matrix”,
Econometrica, 41, 723-732.
Amemiya, T. (1973b) “Regression Analysis When the Dependent Variable Is Truncated Normal”,
Econometrica, 41, 997- 1016.
Amemiya, T. (1974) “The Nonlinear Two-Stage Least-Squares Estimator”, Journal of Econometrics, 2,
105- 1IO.
Amemiya, T. (1975) “The Nonlinear Limited-Information Maximum-Likelihood Estimator and the
Modified Nonlinear Two-Stage Least-Squares Estimator”, Journal of Econometrics, 3, 375-386.
Amemiya, T. (1976) “Estimation in Nonlinear Simuhaneous Equation Models”, Paper presented at
Institut National de La Statistique et Des Etudes Economiques, Paris, March 10 and published in
French in: E. Malinvaud (ed.), Cahiers Du Seminaire D ‘econometric, no. 19 (I 978).
Amemiya, T. (1977) “The Maximum Likelihood and the Nonlinear Three-Stage Least Squares
Estimator in the General Nonlinear Simultaneous Equation Model”, Econometrica, 45, 955-968.
Amemiya, T. (I 980) “Lecture Notes in Advanced Theory of Econometrics”, Department of Econom-
ics, Stanford University.
Amemiya, T. ( 1981) “ Correction to a Lemma”, forthcoming in Econometrica.
Amemiya, T. and W. A. Fuller (1967) “A Comparative Study of Alternative Estimators in a
Distributed Lag Model”, Econometrica, 35, 509-529.
Amemiya, T. and J. L. Powell (1980) “A Comparison of the Box-Cox Maximum Likelihood Estimator
and the Nonlinear Two Stage Least Squares Estimator”, Technical Report No. 322, August,
Institute for Mathematical Studies in the Social Sciences, Stanford University.
Anderson, T. W. (1958) An Introduction to Multivariate Statistical Analysis. New York: John Wiley &
sons.
Anderson, T. W. (1971) The Statistical Analysis of Time Series. New York: John Wiley & Sons.
Arrow. K. J.. H. B. Chenerv. B. S. Minhas and R. M. Solow (1961) “Capital-Labor Substitution and
Ecohomic’Efficiency”, R-&iew of Economics and Statististics, 43,‘225-250.
Bard, Y. (1974) Nonlinear Parameter Estimation. New York: Academic Press.
Bates. D. M. and D. G. Watts (1980) “Relative Curvature Measures of Nonlinearitv”.
_ Journal of the
Royal Statistical Society, Ser. ‘B, 42, 1-25 (with discussion).
Beale, E. M. L. (1960) “Confidence Regions in Non-Linear Estimation”, Journal of the Royal
Statistical Society, Ser. B, 22, 41-88 (with discussion).
Beauchamp, J. J. and R. G. Cornell (1966) “Simultaneous Nonlinear Estimation”, Technometrics, 8,
3 19-326.
Belsley, D. A. (1979) “On the Computational Competitiveness of Full-Information Maximum-Likeh-
hood and Three-Stage Least-Squares in the Estimation of Nonlinear Simultaneous-Equations
Models”, Journal of Econometrics, 9, 3 15-342.
Bemdt, E. R., W. E. Diewert and M. N. Darrough (1977) “Flexible Functional Forms and
Expenditure Distributions: An Application to Canadian Consumer Demand Functions”, Interna-
tional Economic Review, 18, 65 I-676.
Bemdt, E. R., B. H. Hall, R. E. Hall and J. A. Hausman (1974) “Estimation and Inference in
Nonlinear Structural Models”, Annals of Econometric and Social Measurement, 3, 653-666.
Bemdt, E. R. and N. E. Savin (1977) “Conflict Among Criteria for Testing Hypotheses in the
Multivariate Linear Regression Model”, Econometrica, 45, 1263- 1278.
386 T. Amemiya
Bianchi, C. and G. Calzolari (1980) “The One-Period Forecast Error in Non-linear Econometric
Models”, International Economic Review, 21, 201-208.
Bodkin, R. G. and L. R. Klein (1967) “Nonlinear Estimation of Aggregate Production Functions”,
Review of Economics and Statistics, 49, 28-44.
Box, G. E. P. and D. R. Cox (1964) “An Analysis of Transformations”, Journal of_ Royal_ Statistical
Society, Ser. B, 26, 21 l-252 (withdiscussion):
Brown. M. and D. Heien (1972) “The S-Branch Utilitv Tree: A Generalization of the Linear
Expenditure System”, Econometrica, 40, 737-747. _
But&e, H., K. Henscheke, R. Sttiiby and C. Wisotzki (1977) “Parameter Estimation in Nonlinear
Regression Models”, Mathematische Operationsforschung und Statistik, Series Statistics, 8, 23-40.
Charatsis, E. G. (1971) “A Computer Program for Estimation of the Constant Elasticity of Substitu-
tion Production Function”, Applied Statistics, 20, 286-296.
Chow. G. C. (1973) “On the Comnutation of Full-Information Maximum Likelihood Estimates for
Nonlinear Equation Systems”, R&-w of Economics and Statistics, 55, lCl- 109.
Chow, G. C. and R. C. Fair (1973) “Maximum Likelihood Estimation of Linear Equation Systems
with Auto-Regressive Residuals”, Annals of Economic and Social Measurement, 2, 17-28.
Christensen, L. R., D. W. Jorgenson and L. J. Lau (1975) “Transcendental Logarithmic Utility
Functions”, American Economic Review, 65, 367-383.
Cox, D. R. ( 196 1) “Tests of Separate Families of Hypotheses”, in: J. Neyman (ed.), Proceedings of the
Fourth Berkeley Symposium on Mathematical Statistics and Probability, vol. I. Berkeley: University
of California Press, pp. 105-123.
Cox, D. R. (1962) “Further Results on Tests of Separate Families of Hypotheses”, Journal of the
Royal Statistical Society, Ser. B, 24, 406-424.
Dagenais, M. G. (1978) “The Computation of FIML Estimates as Iterative Generalized Least Squares
Estimates in Linear and Nonlinear Simultaneous Eouations Models”, Econometrica. 46, 1351-1362.
Darrough, M. N. (1977) “A Model of Consumption and Leisure in an Intertemporal Framework: A
Systematic Treatment Using Japanese Data”, International Economic Review, i8, 677-696.
Davidon. W. C. (19591 “Variable Metric Method for Minimization”, AEC Research Development
Report, ANL-5990. ’
Deaton, A. S. (1974) “The Analysis of Consumer Demand in the United Kingdom, 1900- 1970”,
Econometrica, 42, 341-368.
Diewert, W. E. (1973) “Separability and a Generalization of the Cobb-Douglas Cost, Production, and
Indirect Utility Functions”, Technical Report 86, January, Institute for Mathematical Studies in the
Social Sciences, Stanford University.
Diewert, W. E. (1974) “Applications of Duality Theory”, in: M. Intriligator and D. Kendrick (eds.),
Frontiers of Quantitative Economics, vol. II. Amsterdam: North-Holland Publishing Co., pp.
106-171.
Draper, N. R. and H. Smith (1966) Applied Regression Analysis. New York: John Wiley & Sons.
Edgerton, D. L. (1972) “Some Properties of Two Stage Least Squares as Applied to Non-Linear
Models”, International Economic Review, 13, 26-32.
Eisenpress, H. and J. Greenstadt (1966) “The Estimation of Nonlinear Econometric Systems”,
Econometrica, 34, 85 1-86 1.
Fair, R. C. (1976) A Model of Macroeconomic Activity, Vol. II: The Empirical Model. Cambridge,
Mass.: Ballinger.
Fair, R. C. and W. R. Parke (1980) “Full-Information Estimates of a Non-linear Macroecometric
Model”, Journal of Econometrics, 13, 269-291.
Fisher, F. M. (1966) The Identification Problem in Econometrics. New York: McGraw-Hill.
Fletcher. R. and M. J. D. Powell (1963) “A Ranidlv Convereent Descent Method for Minimization”.
Computer Journal, 6, 163-168. ~ ’ A * -
Gale, D. and H. Nikaido (1965) “The Jacobian Matrix and Global Univalence of Mappings”,
Mathematische Annalen, 159, 81-93.
Gallant, A. R. (1975a) “Nonlinear Regression”, The American Statistician, 29, 73-81.
Gallant, A. R. (1975b)“The Power of the Likelihood Ratio Test of Location in Nonlinear Regression
Models”, Journal of the American Statistical Association, 70, 198-203.
Gallant, A. R. (1975~) “Testing a Subset of the Parameters of a Nonlinear Regression Model”,
Journal of the American Statistical Association, 70, 927-932.
Ch. 6: Non -linear Regression Models 387
Just, R. E. and R. D. Pope (1978) “Stochastic Specification of Production Functions and Economic
Implications”, Journal of Econometrics, 7, 67-86.
Kelejian, H. H. (1971) “Two-Stage Least Squares and Econometric Systems Linear in Parameters but
Nonlinear in the Endogenous Variables”, Journal of the American Statistical Association, 66,
373-374.
Kelejian, H. H. (1974) “Efficient Instrumental Variable Estimation of Large Scale Nonlinear
Econometric Models”, mimeo.
Kelejian, H. H. (1975) “Nonlinear Systems and Non-Unique Solutions: Some Results Concerning
Estimation”, mimeo., May (Revised).
MacKinnon, J. G. (1976) “Estimating the Linear Expenditure System and Its Generalizations”, in:
S. M. Goldfeld and R. E. Quandt (eds.), Studies in Nonlinear Estimation. Cambridge, Mass.:
Ballinger, pp. 143- 166.
MaCurdy, T. E. (1980) “An Intertemporal Analysis of Taxation and Work Disincentives”, Working
Papers in Economics no. E-80-4, The Hoover Institution, Stanford University.
Malinvaud, E. (1970a) “The Consistency of Nonlinear Regressions”, Annals of Mathematical Statis-
tics, 41, 956-969.
Malinvaud, E. (I 970b) Statistical Methodr of Econometrics (2nd rev. edn.). Amsterdam: North-Holland
Publishing Co.
Marquardt, D. W. (1963) “An Algorithm for Least Squares Estimation of Nonlinear Parameters”,
Journal of the Society for Industrial and Applied Mathematics, 11, 43 l-441.
Mizon, G. E. (1977) “Inference Procedures in Nonlinear Models: An Application in a UK Industrial
Cross Section Study of Factor Substitution and Returns to Scale”, Econometrica, 45, 1221-1242.
Phillips, P. C. B. (1976) “The Iterated Minimum Distance Estimator and the Quasi-Maximum
Likelihood Estimator”, Econometrica, 44, 449-460.
Phillips, P. C. B. (1981) “On the Consistency of Non-Linear FIML”, mimeo.
Poirier, D. .I. (1976) The Econometrics of Structural Change. Amsterdam: North-Holland Publishing
co.
Powell, M. J. D. (1964) “A Efficient Method for Finding the Minimum of a Function of Several
Variables Without Calculating Derivatives”, Computer Journal, 7, 115- 162.
Priestley, M. B. (I 978) “Non-Linear Models in Time Series Analysis”, The Statistician, 27, 159- 176.
Quandt, R. E. (1975) “A Note on Amemiya’s Nonlinear Two-Stage Least Squares Estimators”,
Research Memorandum no. 178, May, Econometric Research Program, Princeton University.
Rao, C. R. (1947) “Large Sample Tests of Statistical Hypotheses Concerning Several Parameters with
Applications to Problems of Estimation”, Proceedings of Cambridge Philosophical Society, 44, 50-57.
Rao, C. R. (1973) Linear Statistical Inference and Its Applications (2nd edn.). New York: John Wiley &
Sons.
Revankar, N. S. (1971) “A Class of Variable Elasticity of Substitution Production Function”,
Econometrica, 39, 6 1-7 1.
Rice, P. and .V. K. Smith (1977) “An Econometric Model of the Petroleum Industry”, Journal of
Econometrics, 6. 263-288.
Robinson, P. M. (1972) “Non-Linear Regression for Multiple Time-Series”, Journal of Applied
Probability, 9, 758-768.
Sargent, T. J. (I 978) “Estimation of Dynamic Labor Demand Schedules Under Rational Expectations”,
Journal of Political Economy, 86, 1009- 1044.
Silvey, S. D. (1959) “The Lagrangian Multiplier Test”, Annals of Mathematical Statistics, 30, 389-407.
Stein, C. (1973) “Estimation of the Mean of a Multivariate Normal Distribution”, Technical Report
no. 48, June 26, Department of Statistics, Stanford University.
Strickland, A. D. and L. W. Weiss (1976) “Advertising, Concentration, and Price-Cost Margins”,
Journal of Political Economy, 84, 1109-l 121.
Theil, H. (1953) “Repeated Least-Squares Applied to Complete Equation Systems”, mimeo. The
Hague: Central Planning Bureau.
Theil, H. (1971) Principles of Econometrics. New York: John Wiley 8~ Sons.
Tornheim, L. (1963)“Convergence in Nonlinear Regression”, Technometrics, 5, 5 13-5 14.
Tsurumi, H. (1970) “Nonlinear Two-Stage Least Squares Estimation of CES Production Functions
Applied to the Canadian Manufacturing Industries”, Reoiew of Economics and Statistics, 52,
200-207.
Ch. 6: Non -linear Regression Models 389
Wald, A. (1943) “Tests of Statistical Hypotheses Concerning Several Parameters When the Number of
Observations is Large”, Transaction of American Mathematical Society, 54, 426-482.
Wald, A. (1949) “Note on the Consistency of the Maximum Likelihood Estimate”, Annals of
Mathematical Statistics, 60, 595-601.
White, H. (1980a) “Nonlinear Regression on Cross-Section Data”, Econometrica, 48, 721-746.
White, H. (1980b) “A Note on Normality and the Consistency of the Nonlinear Simultaneous
Equations Maximum Likelihood Estimator”, May, mimeo.
Wolter, K. M. and W. A. Fuller (1978) “Estimation of Nonlinear Errors-in-Variables Models”,
mimeo.
Wu, C. F. (I 98 I) “Asymptotic Theory of Nonlinear Least Squares Estimation”, Annals of Stafisfics, 9,
501-513.
Zellner, A. (1962) “An Efficient Method of Estimating Seemingly Unrelated Regressions and Tests for
Aggregation Bias”, Journal of the American Statisfical Association, 57, 348-368.
Zellner, A., D. S. Huang and L. C. Chau (1965) “Further Analysis of the Short-Run Consumption
Function with Emphasis on the Role of Liquid Assets”, Econometrica, 33, 571-581.