Advanced Econometrics - 1985 - 1era Edición - Amemiya
Advanced Econometrics - 1985 - 1era Edición - Amemiya
Advanced Econometrics - 1985 - 1era Edición - Amemiya
ECONOMETRICS
Takeshi Amemiya
Takeshi Amemiya
In many parts of the book, and in all of Chapters 3 and 4, I have used the
theorem-proof format and have attempted to develop all the mathematical
results rigorously. However, it has not been my aim to present theorems in full
mathematical generality. Because I intended this as a textbook rather than as a
monograph, I chose assumptions that are relatively easy to understand and
that lead to simple proofs, even in those instances where they could be relaxed.
This will enable readers to understand the basic structure of each theorem and
to generalize it for themselves depending on their needs and abilities. Many
simple applications of theorems are given either in the form of examples in the
text or in the form of exercises at the end of each chapter to bring out the
essential points of each theorem.
Although this is a textbook in econometrics methodology, I have included
discussions of numerous empirical papers to illustrate the practical use of
theoretical results. This is especially conspicuous in the last three chapters of
the book.
Too many people have contributed to the making of this book through the
many revisions it has undergone to mention all their names. I am especially
grateful to Trevor Breusch, Hidehiko Ichimura, Tom MaCurdy, Jim Powell,
and Gene Sevin for giving me valuable comments on the entire manuscript. I
am also indebted to Carl Christ, Art Goldberger, Cheng Hsiao, Roger
Koenker, Tony Lancaster, Chuck Mansld, and Hal White for their valuable
comments on parts of the manuscript. I am grateful to Colin Cameron, Tom
Downes, Harry Paarsch, Aaron Han, and Choon Moon for proofreading and
to the first three for correcting my English. In addition, Tom Downes and
Choon Moon helped me with the preparation of the index. Dzung Pham has
typed most of the manuscript through several revisions; her unfailing patience
and good nature despite many hours of overtime work are much appreciated.
David Criswell, Cathy Shimizu, and Bach-Hong Tran have also helped with
the typing. The financial support of the National Science Foundation for the
research that produced many of the results presented in the book is gratefully
acknowledged. Finally, I am indebted to the editors of the Journal of Eco-
nomic Literature for permission to include in Chapter 9 parts of my article
entitled "Qualitative Response Models: A Survey" (Journal of Economic
Literature 19:1483 1536, 1981) and to North-Holland Publishing Company
for permission to use in Chapter 10 the revised version of my article entitled
"Tobit Models: A Survey" ( Journal of Econometrics 24:3 61, 1984).
Contents
In this chapter we shall consider the basic results of statistical inference in the
classical linear regression model the model in which the regressors are inde-
pendent of the error term and the error term is serially uncorrelated and has a
constant variance. This model is the starting point of the study; the models to
be examined in later chapters are modifications of this one.
In this section let us look at the reasons for studying the linear regression
model and the method of specifying it. We shall start by defining Model 1, to
be considered throughout the chapter.
1 .1 .1 Introduction
Consider a sequence of K random variables (y x2 x31 ,. . . , xK, ),
t = 1, 2, . . . , T. Define a T-vector Y = (Y1, Y2,. . Yr)', a (K 1)-
vector = (x2 x3 . . . , x)', and a [(K 1) X 71-vector x*
(xr, xi",. . . , xV)'. Suppose for the sake of exposition that the joint density
of the variables is given byf(y, x*, 0), where 0 is a vector of unknown parame-
ters. We are concerned with inference about the parameter vector 0 on the
basis of the observed vectors y and x*.
In econometrics we are often interested in the conditional distribution of
one set of random variables given another set of random variables; for exam-
ple, the conditional distribution of consumption given income and the condi-
tional distribution of quantities demanded given prices. Suppose we want to
know the conditional distribution of y given x*. We can write the joint density
as the product of the conditional density and the marginal density as in
f(Y, x*, 0) = f (ylx* , e 111(x* , 02). (1.1.1)
Regression analysis can be defined as statistical inferences on 01. For this
purpose we can ignoref(x*, 02), provided there is no relationship between 0,
2 Advanced Econometrics
and 02. The vector y is called the vector of dependent or endogenous variables,
and the vector x* is called the vector of independent or exogenous variables.
In regression analysis we usually want to estimate only the first and second
moments of the conditional distribution, rather than the whole parameter
vector 01. (In certain special cases the first two moments characterize 0 1
complety.)Thuswandfiergolysaticnfero
the conditional mean E(y I x*) and the conditional variance-covariance matrix
V(ylx*). Generally, these moments are nonlinear functions of x*. However, in
the present chapter we shall consider the special case in which E(y,1 x*) is equal
to E(y,14 ) and is a linear function of xt, and V(y I x*) is a constant times an
identity matrix. Such a model is called the classical (or standard) linear
regression model or the homoscedastic (meaning constant variance) linear
regression model. Because this is the model to be studied in Chapter 1, let us
call it simply Model 1.
1.1.2 Modell
By writing x, = (1, x")', we can define Model 1 as follows. Assume
y, x',/1 + u t = 1, 2, . . . , T, (1.1.2)
where y, is a scalar observable random variable, fi is a K-vector of unknown
parameters, x, is a K-vector of known constants such that x, 34 is nonsin-
gular, and u, is a scalar, unobservable, random variable (called the error term
or the disturbance) such that Eu, = 0, Vu,= a 2 (another unknown parameter)
for all t, and Eut us = 0 for t s.
Note that we have assumed x* to be a vector of known constants. This is
essentially equivalent to stating that we are concerned only with estimating
the conditional distribution of y given x*. The most important assumption of
Model I is the linearity of E(y,Ixt ); we therefore shall devote the next subsec-
tion to a discussion of the implications of that assumption. We have also made
the assumption of homoscedasticity (Vu, = a 2 for all t) and the assumption of
no serial correlation (Eut us = 0 for t # s), not because we believe that they are
satisfied in most applications, but because they make a convenient starting
point. These assumptions will be removed in later chapters.
We shall sometimes impose additional assumptions on Model 1 to obtain
certain specific results. Notably, we shall occasionally make the assumption of
serial independence of (u,) or the assumption that u, is normally distributed.
in general, independence is a stronger assumption than no correlation, al-
Classical Least Squares Theory
though under normality the two concepts are equivalent. The additional
assumptions will be stated whenever they are introduced into Model.
1 .1 .3 Implications of Linearity
Suppose random variables y, and have finite second moments and their
variance-covariance matrix is denoted by
17 121
V [ x = [a
',11 L a 12 1 22
and Ex? v, =0. It is important to realize that Model 1 implies certain assump-
tions that (1.1.3) does not: (1.1.3) does not generally imply linearity of
E(y:Ixt ) because E(v,Ixt ) may not generally be zero.
We call flo + Ain (1.1.3) the best linear predictor of y, given because
flo and fi can be shown to be the values of 1 0 and b 1 that minimize
E(y, b o b )2. In contrast, the conditional mean E(y,lx* ) is called the
best predictor of y, given because E[y, E(yrix:9]2 E[y, g(4)1 2 for
any function g.
The reader might ask why we work with eq. (1.1.2) rather than with (1.1.3).
The answer is that (1.1.3) is so general that it does not allow us to obtain
interesting results. For example, whereas the natural estimators of /3 0 and /31
0 andcanbedfiyrplgthemonsfy,adxthcreiz/3
/31 with their corresponding sample moments (they actually coincide with the
least squares estimator), the mean of the estimator cannot be evaluated with-
out specifying more about the relationship between xr and v, .
How restrictive is the linearity of E(y,14 )? It holds if y, and x* are jointly
normal or if y, and 4 are both scalar dichotomous (Bernoulli) variables.' But
the linearity may not hold for many interesting distributions. Nevertheless,
the linear assumption is not as restrictive as it may appear at first glance
because 4 can be variables obtained by transforming the original indepen-
dent variables in various ways. For example, if the conditional mean of y the
supply of good, is a quadratic function of the price, p we can put
(p plr , thereby making E(y,lx* ) linear.
4 Advanced Econometrics
where Y = (Y) , Y2, YT) ' U = (U1, U7, UT) ' and X=
, x2 , . . . , xr)'. In other words, X is the T X K matrix, the tth row of
which is x' The elements of the matrix X are described as
X12 X1K1
X2K
x=
X7-2 XTK
S(fi) = (Y (Y (1.2.1)
= y'y 2y'Xfi + fi'X' Xi/
Putting the derivatives of S(j3) with respect to /3 equal to 0, we have
OS =
2X'y + 2X'Xfi= 0, (1.2.2)
where asiap denotes the K-vector the ith element of which is as/api, being
the ith element of ft. Solving (1.2.2) for ft gives
/j= (X'X)'X'y. (1.2.3)
Clearly, SA attains the global minimum at
Let us consider the special case K = 2 and 74 = (1, x21 ) and represent each
of the T-observations (y xv ) by a point on the plane. Then, geometrically,
the least squares estimates are the intercept and the slope of a line drawn in
such a way that the sum of squares of the deviations between the points and the
line is minimized in the direction of the y-axis. Different estimates result if the
sum of squares of deviations is minimized in any other direction.
Given the least squares estimator /3, we define
(1.2.4)
and call it the vector of the least squares residuals. Using fl, we can estimate a 2
by
8.2 = (1.2.5)
called the least squares estimator of a 2, although the use of the term least
squares here is not as compelling as in the estimation of the regression
parameters.
Using (1.2.4), we can write
Y = Xft + = PY MY, (1.2.6)
where P = X(X' 'X ' and M = I P. Because fi is orthogonal to X (that is,
' X = 0), least squares estimation can be regarded as decomposing y into two
orthogonal components: a component that can be written as a linear combi-
nation of the column vectors of X and a component that is orthogonal to X.
Alternatively, we can call Py the projection of y onto the space spanned by the
column vectors of X and My the projection of y onto the space orthogonal to
X. Theorem 14 of Appendix 1 gives the properties of a projection matrix such
as P or M. In the special case where both y and X are two-dimensional vectors
6 Advanced Econometrics
From (1.2.3) and (1.2.4), we have ill = Mu, where M = I X(X' X)X
8 Advanced Econometrics
We shall obtain the variance of 6 2 later, in Section 1.3, under the additional
assumption that u is normal.
The quantity V13 can be estimated by substituting either 6 2 or ep (defined
above) for the a 2 that appears in the right-hand side of (1.2.16).
higher than the estimator with the mean squared error matrix
[2 0
01
A problem with Definition 1.2.2 (more precisely, a problem inherent in the
comparison of vector estimates rather than in this definition) is that often it
does not allow us to say one estimator is either better or worse than the other.
For example, consider
A = [ 1 1 and B = [ 2 (1.2.25)
01 0 11
Clearly, neither A B nor B ?_ A. In such a case one might compare the
trace and conclude that 0 is better than 9* because tr A < tr B. Another
example is
A = [2 1 and B = [ 2 (1.2.26)
12 0 21
Again, neither A 2-- B nor B A. If one were using the determinant as the
criterion, one would prefer 0 over ON because det A < det B.
Note that B A implies both tr B tr A and det B det A. The first
follows from Theorem 7 and the second from Theorem 11 of Appendix 1. As
these two examples show, neither tr B tr A nor det B det A implies
BA.
Use of the trace as a criterion has an obvious intuitive appeal, inasmuch as it
is the sum of the individual variances. Justification for the use of the determi-
nant involves more complicated reasoning. Suppose 0 N(0, V), where V is
the variance-covariance matrix of 0. Then, by Theorem 1 of Appendix 2,
(0 0)'V -1(0 0) Xi, the chi-square distribution with K degrees of free-
dom, Kbeing the number of elements of 0. Therefore the (1 cx)96 confidence
ellipsoid for 0 is defined by
( 01(ii 0)' (Os 0) < xi(a)), (1.2.27)
where xi(a) is the number such that P [xi &a)] = a. Then the volume of
the ellipsoid (1.2.27) is proportional to the determinant of V, as shown by
Anderson (1958, p. 170).
A more intuitive justification for the determinant criterion is possible for
the case in which 0 is a two-dimensional vector. Let the mean squared error
matrix of an estimator 9 = (01, 02 )' be
Classical Least Squares Theory 11
az an"
a22
as the class of estimators of the form (S ' IS' y, where S is any T X K matrix
of constants such that S' X is nonsingular. When it is defined this way, we call
it the class of instrumental variable estimators (abbreviated as IV) and call the
column vectors of S instrumental variables. The variance-covariance matrix
of IV easily can be shown to be a 2(S' X)- 'S'S(X'S) l. We get LS when we put
S = X, and the optimality of LS can be proved as follows: Because
I S(S'S)- is is nonnegative definite by Theorem 14(v) of Appendix 1, we
have
X'X X'S(S'S)-1 5'X. (1.2.31)
Inverting both sides of (1.2.31) and using Theorem 17 of Appendix 1, we
obtain the desired result:
(X'X)- ' (S'X)-1 S'S(X'S)- I. (1.2.32)
In the preceding analysis we were first given the least squares estimator and
then proceeded to prove that it is best of all the linear unbiased estimators.
Suppose now that we knew nothing about the least squares estimator and had
to find the value of C that minimizes C'C in the matrix sense (that is, in terms
of a ranking of matrices based on the matrix inequality defined earlier) subject
to the condition C' X = I. Unlike the problem of scalar minimization, cal-
culus is of no direct use. In such a situation it is often useful to minimize the
variance of a linear unbiased estimator of the scalar parameter p 'A where p is
an arbitrary K-vector of known constants.
Let c'y be a linear estimator of p'fl. The unbiasedness condition implies
X' c = p. Because Vc ' y = a 2c ' c, the problem mathematically is
Minimize c ' c subject to X ' c = p. (1.2.33)
Define
S= cic 2A'(X'c p), (1.2.34)
where 22 is a K-vector of Lagrange multipliers. Setting the derivative of S with
respect to c equal to 0 yields
c (1.2.35)
Premultiplying both sides by X' and using the constraint, we obtain
(X'X)p. (1.2.36)
Inserting (1.2.36) into (1.2.35), we conclude that the best linear unbiased
Classical Least Squares Theory 13
(B ) E
a2 log L E a log L a log L
awe' ae
(C) E
a log L a log L > 0
ae
(D) oL-n 'dz= I.
f
Note: loga ivae is a K-vector, so that 0 in assumption A is a vector of Kzeroes.
Assumption C means that the left-hand side is a positive definite matrix. The
integral in assumption D is an n-tuple integral over the whole domain of the
Euclidean n-space because z is an n-vector. Finally, I in assumption D is the
identity matrix of size K.
Then we have for 0 E 0
82 10 LT1
V(b) [ E (1.3.8)
=E[
ol aL]
L
= f[Eizi w
alL dz
=
=
I I
ow -
by assumption D.
16 Advanced Econometrics
E
d log
ae
L E aLIJ (1.3.12)
= Iii%1-t1 ]L, dz
= aLdz.
do
f
The equivalence between assumptions B and B' follows from
82 log L _ E [ 1 aLi} (1.3.13)
awe' ae L ae'
aL aL 1 82L 1
E L- L2 ae ae' L aeae'
=E
r L
log a log L] I
ae ae' +
8 21, dz.
awe/
Furthermore, assumptions A', B', and D are equivalent to
aL dz =
(A") I a L dz
60 ae
Classical Least Squares Theory 17
(B")
f a2L30 , dz _ TO
a f al,
w dz
(D') 7 , LO dz,
because the right-hand sides of assumptions A", B", and D' are 0, 0, and!,
respectively. Written thus, the three assumptions are all of the form
a a
J.1-(z 0) dz = f f(z, 0) dr
ao ' (1.3.14)
,a9-1.0 (z, 0) 1 dz
= f1 iofo(z, 0*) -24 (z, 0)1 dz,
18 Advanced Econometrics
verify it only for the case where /3 is a scalar and a 2 is known, assuming that the
parameter space of 11 and a2 is compact and does not contain a 2 =0.
Consider log L given in (1.3.2). In applying Theorem 1.3.1 to our model, we
put 0' = (fl' , a2 ). The first- and second-order derivatives of log L are given by
a log L 1 (X'y X'X/I), (1.3.15)
a8/3 0.2
a log L T 1,
(1.3.16)
a(a2 ) 2a2 2a4 XfIr ( Y XfA
82 log L _ 1 ,
7,i x X, (1.3.17)
8/38/3'
82 log L T 1
80.2)2 - 2a4 a6 X/0'(Y (1.3.18)
82 log L 1 ,
afia(a2) a4 (X y X' X/I). (1.3.19)
E
a log L a log L
1
EX'uu'X =X'X ,
1
(1.3.21)
al) air a4
[a log 1 2 T2 ,-r1.tku T ,
E . ur u (1.3.22)
8072) 404 40 2a6
2 a4
because E(u'u)2 = (T2 + 2T) a4, and
a log L a log L
E a(c2 ) (1.3.23)
Oft
Therefore, from (1.3.20) to (1.3.23) we can see that assumptions B and C are
both satisfied.
Classical Least Squares Theory 19
We shall verify assumption D only for the case where fi is a scalar and a 2 is
known so that we can use Theorem 1.3.2, which is stated for the case of a scalar
parameter. Take I3L, as the jof that theorem. We need to check only the last
condition of the theorem. Differentiating (1.3.1) with respect to /3, we have
1
afi = fi--
0.2 (x' y /3x' x)L, (1.3.24)
The first integral on the right-hand side is finite because fl is assumed to have a
finite variance. The second integral on the right-hand side is also finite because
the moments of the normal distribution are finite. Moreover, both integrals
are uniformly bounded in the assumed parameter space. Thus the last condi-
tion of Theorem 1.3.2 is satisfied.
Finally, from (1.3.8) and (1.3.20) we have
Vfr cr 2(X' X) (1.3.26)
for any unbiased fr. The right-hand side of (1.3.26) is the variance-covariance
matrix of the least squares estimator of fl, thereby proving that the least
squares estimator is the best unbiased estimator under Model 1 with normal-
ity. Unlike the result in Section 1.2.5, the result in this section is not con-
strained by the linearity condition because the normality assumption was
added. Nevertheless, even with the normality assumption, there may be a
biased estimator that has a smaller average mean squared error than the least
squares estimator, as we shall show in Section 2.2. In nonnormal situations,
certain nonlinear estimators may be preferred to LS, as we shall see in Sec-
tion 2.3.
Therefore it does not attain the Cramer-Rao lower bound, although the differ-
ence is negligible when T is large.
We shall now show that there is a simple biased estimator of cr 2 that has a
smaller mean squared error than the Cramer-Rao lower bound. Define the
class of estimators
,;.2 - fi ' fi (1.3.28)
`" N- N'
where N is a positive integer. Both 62 and 62, defined in (1.2.5) and (1.2.18),
respectively, are special cases of (1.3.28). Using (1.2.17) and (1.3.5), we can
evaluate the mean squared error of 6-1,as
EA 0.2)2 _ 2(T K)+ (T K N)2 0. 4
. (1.3.29)
N2
By differentiating (1.3.29) with respect to N and equating the derivative to
zero, we can find the value of N that minimizes (1.3.29) to be
N* = T K+ 2. (1.3.30)
Inserting (1.3.30) into (1.3.29), we have
Ecca. _ 0.2)2 - 264 (1.3.31)
' T K + 2'
which is smaller than the Cramer-Rao bound if K = 1.
The study of this subject is useful for its own sake in addition to providing
preliminary results for the next section, where we shall discuss tests of the
linear hypothesis (1.4.1). We shall define the constrained least squares estima-
tor, present an alternative derivation, show it is BLUE when (1.4.1) is true,
and finally consider the case where ( 1.4. 1) is made stochastic by the addition of
a random error term to the right-hand side.
that satisfies these conditions will do. Then, defining A = (Q, R)', y = Aii,
and Z = XA-1 , we can rewrite the basic model (1.1.4) as
y = Xi? + u (1.4.7)
= XA-1Afl + u
= Zy + u.
If we partition y = (yl, y)', where y l = Q'fl and 72 = R'fl, we see that the
constraints (1.4.1) specify y l and leave 72 unspecified. The vector 72 has K- q
elements; thus we have reduced the problem of estimating K parameters
subject to q constraints to the problem of estimating K- q free parameters.
Using yi = c and
A.- ' = [Q(Q'Q) - ', R(R'R)-1 j, (1.4.8)
we have from (1.4.7)
y - XQ(Q'Q)- Ic = XR(R'R)iy2 + u. (1.4.9)
Let 512 be the least squares estimator of 72 in (1.4.9):
51'2 = R ' R(R' X ' XR)- 'It' X ' [y - XQ(Q ' Q)'c]. (1.4.10)
Now, transforming from y back to /I by the relationship fl = .21 - '7, we obtain
the CLS estimator of /3:
= R(R'X'XR)- IR'X'y (1.4.11)
[I - R(R ' X ' XR) - 'It' X ' X]Q(Q ' Q)'c.
Note that (1.4.11) is different from (1.4.5). Equation (1.4.5) is valid only if
X 'X is nonsingular, whereas (1.4.11) can be defined even if X 'X is singular
provided that R' X' XR is nonsingular. We can show that if X' X is nonsingu-
lar, (1.4.11) is unique and equal to (1.4.5). Denote the right-hand sides of
(1.4.5) and (1.4.11) by fl i and A, respectively. Then it is easy to show
Q '
[R'X'l
(fli P2) 0. (1.4.12)
where the matrix on the right-hand side is clearly nonsingular because non-
Classical Least Squares Theory 23
r t'x'x]
Q'
is nonsingular, as we desired.
sical statistics and entered that of Bayesian statistics, which treats the un-
known parameters fi as random variables. Although we generally adopt the
classical viewpoint in this book, we shall occasionally use Bayesian analysis
whenever we believe it sheds light on the problem at hand. 5
In terms of the parameterization of (1.4.7), the constraints (1.4.17) are
equivalent to
71 =-- c + v. (1.4.18)
We shall first derive the posterior distribution of 7, using a prior distribution
over all the elements of y, because it is mathematically simpler to do so; and
then we shall treat (1.4.18) as what obtains in the limit as the variances of the
prior distribution for 72 go to infinity. We shall assume a 2 is known, for this
assumption makes the algebra considerably simpler without changing the
essentials. For a discussion of the case where a prior distribution is assumed on
a2 as well, see Zellner (1971, p. 65) or Theil (1971, p. 670).
Let the prior density of 7 be
f(y) = (2n)-K/ 2 1111 -1 /2 exp [ (1/2)(y /.t)' srl(y p)], (1.4.19)
where fl is a known variance-covariance matrix. Thus, by Bayes's rule, the
posterior density of y given y is
.RY YV(Y) (1.4.20)
f(I)
.11(y17)./(7)dv
= c1 exp ((1/2)[cr -2 (y Zy)' (y Zy)
+ (y PA),
where c l does not depend on y. Rearranging the terms inside the bracket, we
have
a-2(y ZY)'(Y ZY) + (y fr i(Y (1.4.21)
= y'(o.-2Z'Z + crl)y 2(a -2y'Z + p'11-1 )y + cr-2y'y + Aeirly
= (y P)'(c)---2Z'Z + 0-1 )(y rf, ) j)'(cr -2Z'Z + LI-1 W
y Er 14 ,
where
(0-2z,z fr1)i( 0-2v y frIp). (1.4.22)
Therefore the posterior distribution of y is
YI Y IsT(T', [a-2Z' Z + I ]), (1.4.23)
Classical Least Squares Theory 25
and the Bayes estimator of >' is the posterior mean i given in (1.4.22). Because
7 = A/B, the Bayes estimator of /3 is given by
fl= ii- i[a-2(A' ) IX' XA- ' + CI- ' ]' [a -2(A ')- IX' y + 0- '0]
= (o.-2 X 1 X + ATI- lA)(a-2X' y + A 'Ll- 1/). (1.4.24)
We shall now specify p and Ca so that they conform to the stochastic con-
straints (1.4.18). This can be done by putting the first q elements ofp equal toe
(leaving the remaining K q elements unspecified because their values do not
matter in the limit we shall consider later), putting
[r2I 0
fl = (1.4.25)
0 v2 I1
and then taking v2 to infinity (which expresses the assumption that nothing is a
priori known about 72 ). Hence, in the limit we have
[ -2i,
i 0
iim Sr' = (1.4.26)
v--- 0 0j
Inserting (1.4.26) into (1.4.24) and writing the first q elements of p as c, we
finally obtain
fl= (x , x + ) 2QQ , y 1 (x , y + 22Qc) , (1.4.27)
where A2 .. 0.2tr 2 .
We have obtained the estimator /1 as a special case of the Hayes estimator,
but this estimator was originally proposed by Theil and Goldberger (1961) and
was called the mixed estimator on heuristic grounds. In their heuristic ap-
proach, Eqs. (1.1.4) and (1.4.17) are combined to yield a system of equations
Y X a _i_ u (1.4.28)
[Ac] ---- [AO] P 1 [Avi
Note that the multiplication of the second part of the equations by A renders
the combined error terms homoscedastic (that is, constant variance) so that
(1.4.28) satisfies the assumptions of Model 1. Then Theil and Goldberger
proposed the application of the least squares estimator to (1.4.28), an opera-
tion that yields the same estimator as/igiven in (1.4.27). An alternative way to
interpret this estimator as a Bayes estimator is given in Theil (1971, p. 670).
There is an interesting connection between the Bayes estimator (1.4.27) and
the constrained least squares estimator (1.4.11): The latter is obtained as the
limit of the former, taking A 2 to infinity. Note that this result is consistent with
our intuition inasmuch as ,1.2 , co is equivalent to r2 * 0, an equivalency that
26 Advanced Econometrics
implies that the stochastic element disappears from the constraints (1.4.17),
thereby reducing them to the nonstochastic constraints (1.4.1). We shall dem-
onstrate this below. [Note that the limit of (1.4.27) as 2 2 , oo is not
(QQ1-1Qc because QQ' is singular; rather, the limit is a K X K matrix with
rank q<K.]
Define
B A(2-2X'X + QQ9A' (1.4.29)
p,--2Q'x'xQ+Q'QQ'Q A-2Q'x'XR1
A-2R'x'xQ A-2R'x'xitr
Then, by Theorem 13 of Appendix 1, we have
(R'X'XR)-1 R'X'XQE-1
(1.4.30)
where
E = Q'QQ/Q + 2-2Q'X'XQ (1.4.31)
.1.-2Q'X'XR(R'X'XR) -1 R'X'XQ
and
F = .1-2R'X'XR (1.4.32)
2-4R'X'XQ(2-2Q'X'XQ + Q'QQ'Q)-1Q'X'XR.
From (1.4.27) and (1.4.29) we have
A13-1A(2-2X'y + Qc). (1.4.33)
Using (1.4.30), we have
urn A'13-1 A(A-2X'y) = R(R'X'XR)-1 R'X'y (1.4.34)
and
1 l
urn
lim A'B-1AQc = lim (Q, R)B-1 (1.4.35)
This is the test statistic one would use if ci were known. As we have shown in
Eq. (1.3.7), we have
6'6 y2 (1.5.3)
2 " TK
0
The random variables (1.5.2) and (1.5.3) easily can be shown to be indepen-
dent by using Theorem 6 of Appendix 2 or by noting DV' = 0, which implies
that 0 and flare independent because they are normal, which in turn implies
that u'fl and fl are independent. Hence, by Theorem 3 of Appendix 2 we have
Q c
ST-K, (1.5.4)
[ii 2Q' (x,x)-min
which is Student's t with T K degrees of freedom, where a is the square root
of the unbiased estimator of a 2 defined in Eq. (1.2.18). Note that the denomi-
nator in (1.5.4) is an estimate of the standard deviation of the numerator. Thus
the null hypothesis Q'fl = c can be tested by the statistic (1.5.4). We can use a
one-tail or two-tail test, depending on the alternative hypothesis.
In Chapter 3 we shall show that even if u is not normal, (1.5.2) holds
asymptotically (that is, approximately when T is large) under general condi-
tions. We shall also show that 3.2 converges to a 2 in probability as T goes to
28 Advanced Econometrics
infinity (the exact definition will be given there). Therefore under general
distributions of u the statistic defined in (1.5.4) is asymptotically distributed as
N(0, 1) and can be used to test the hypothesis using the standard normal
table. 6 In this case a 2 may be used in place of a2 because 461..2 also converges to (7 2
inprobalty.
Reject Q73 = c if Pi
T K S(A). .SCA > h, (1.5.9)
q S(A
where h is appropriately chosen. Clearly, (1.5.8) and (1.5.9) can be made
equivalent by putting h = [(T K)/q]((-2IT 1). The test (1.5.9) is called the
F test because under the null hypothesis n is distributed as F(q, T K) the
F distribution with q and T K degrees of freedom as we shall show next.
From (1.4.3) and (1.4.5) we have
S(13) S(k= (Q 'fi
c)' [Q' (X ' X)-- ' Q]' (Q73 c).
(1.5.10)
We shall now give an alternative motivation for the test statistic (1.5.12).
For this purpose, consider the general problem of testing the null hypothesis
110 : 0= 00 , where 0= (00 02)' is a two-dimensional vector of parameters.
Suppose we wish to construct a test statistic on the basis of an estimator 6 that
is distributed as N(00 , V) under the null hypothesis, where we assume V to be a
known diagonal matrix for simplicity. Consider the following two tests of 1/ 0 :
Reject 1/0 if (6 ti ny (6 0)> c (1.5.13)
and
Reject 1/0 if (i). 00'1" ( 0 0 ) > d, (1.5.14)
30 Advanced Econometrics
- 130 - ( - ) (1.5.15)
Applying this test to the test of the hypothesis (1.4.1), we readily see that the
test should be based on
cY[QPCX)-1 (2]-1(0 c) 2
(1.5.16)
Xe
a2
This test can be exactly valid only when a 2 is known and hence is analogous to
the standard normal test based on (1.5.2). Thus the Ftest statistic (1.5.12) may
be regarded as a natural adaptation of (1.5.16) to the situation where a 2 must
be estimated. (For a rigorous mathematical discussion of the optimal proper-
ties of the F-test, the reader is referred to Scheffe, 1959, p. 46.)
By comparing (1.5.4) and (1.5.12) we immediately note that if q = I (and
therefore Q' is a row vector) the F statistic is the square of the t statistic. This
fact clearly indicates that if q = 1 the t test should be used rather than the Ftest
because a one-tail test is possible only with the t test.
As stated earlier, (1.5.1) holds asymptotically even if u is not normal. Be-
cause a-2 = filAT K) converges to a2 in probability, the linear hypothesis
can be tested without assuming normality of u by using the fact that or is
asymptotically distributed as 4. Some people prefer to use the F test in this
situation. The remark in note 6 applies to this practice.
The F statistic Pi given in (1.5.12) takes on a variety of forms as we insert a
variety of specific values into Q and c. As an example, consider the case where
/3 is partitioned as //' = (fl, A), where /31 is a Krvector and fl2 is a K2-vector
such that K1 + K2 = K, and the null hypothesis specifies /12 = /12 and leaves /1 1
unspecifd.ThyotsanbewrithfomQ1=cbypuing
Q' = (0, I), where 0 is the K2 X K1 matrix of zeros and I is the identity matrix
of size K2, and by putting c = /12 . Inserting these values into (1.5.12) yields
Classical Least Squares Theory 31
where the vectors and matrices in (1.5.23) have T 1 rows and those in (1.5.24)
T2 rows, X 1 is a Ti X K* matrix and X2 is a T2 X K* matrix, and u 1 and u2 are
normally distributed with zero means and variance-covariance matrix
E[ U2
111 ](uc, u) =[(711T'
0 aiIi .
We assume that both X 1 and X2 have rank equal to /0.7 We want to test the
null hypothesis ii, = /32 assuming al = ai(=--- a2) in the present section and
ai # ai in the next section. This test is especially important in econometric
time series because the econometrician often suspects the occurrence of a
structural change from one era to another (say, from the prewar era to the
postwar era), a change that manifests itself in the regression parameters. When
al = a3, this test can be handled as a special case of the standard F test
presented in the preceding section.
To apply the F test to the problem, combine Eqs. (1.5.23) and (1.5.24) as
y = Xfl + u, (1.5.25)
where
y _ [yi ] , x _ [Xi 0 a flz and u = 1111
Y2 0 X] ' 1' Li2 i ' u2 '
Then, since oi = a( c2), (1.5.25) is the same as Model 1 with normality;
hence we can represent our hypothesis fl, =fl2 as a standard linear hypothesis
on Model 1 with normality by putting T = T1 + T2, K = 2K*, q = K*, Q' --
(I, I), and c = 0. Inserting these values into (1.5.12) yields the test statistic
(T 1 + T2 2K*) (fti ft2)1(XIX1) -1 + (XiX2) - 'NA ft2)
1 K* y'[I X( X'20-1Xly
F(K*, T 1 + T2 2K*), (1.5.26)
where A = oc,x,)--ixly, and A = (x2(2)- ixy2.
We shall now give an alternative derivation of (1.5.26). In (1.5.25) we
combined Eqs. (1.5.23) and (1.5.24) without making use of the hypothesis
fli = /32 . If we make use of it, we can combine the two equations as
y -- X/3+ u, (1.5.27)
where we have defined X = (XI, Xi)' and fl = fli = A . Let S(ft) be the sum of
squared residuals from (1.5.25), that is,
S(ft) -- ylI X(X'X) - lMy (1.5.28)
Classical Least Squares Theory 33
and let S(ii) be the sum of squared residuals from (1.5.27), that is,
S(fi) = y/I X(X'X) - IX/y. (1.5.29)
Then using (1.5.9) we haves
T, + T2 - 2K* SO) ,:g/1) Ric , Ti + T2 - 2K*).
?I K* S(II)
(1.5.30)
Rip= y' [I [ X
x21 ] (XIX, + XiX2)-1 (XI , Xi)] (1.5.32)
[X,(X'A lr'X', 0
S(fi) = y' [I (1.5.33)
0 X2(XX2) -1Xil] Y
= (YIX,, Y2X2)
x 1(XX1 )- ' (X'A i + X2C2)-1 (xpc, + Wc2)-1 i
. (XIX, + XiX2) -1 1 (XPE2)- ' - (XIX, + XPC2)- 1
X [NY]
3qY2
= (YIX1, Y2X2)
X XCY'l
[
,q3r 2 ..1 '
34 Advanced Econometrics
the last line of which is equal to the right-hand side of (1.5.31). The last
equality of (1.5.34) follows from Theorem 19 of Appendix 1.
The hypothesis ft1 = ft2 is merely one of many linear hypotheses we can
impose on theft of the model (1.5.25). For instance, we might want to test the
equality of a subset of /1 1 with the corresponding subset of /12 . If the subset
consists of the first Kr elements of both/31 and ft2 , we should put T T 1 + T2
and K= 2K* as before, but q= K Q' = (I, 0, I, 0), and c =0 in the for-
mula (1.5.12).
If, however, we wish to test the equality of a single element of /1 1 with the
corresponding element of/I2 , we should use the t test rather than the F test for
the reason given in Section 1.5.2. Suppose the null hypothesis is flii = /32;
where these are the ith elements of/hand/1 2 , respectively. Let the ith columns
of X1 and X2 be denoted by xi; and x2; and let X 1(j) and X2(i) consist of the
remaining K* 1 columns of X I and X2, respectively. Define M 1(1) = I
X i (0( Vic/Alcor 'rim, = M mxii , and 51, = M1(f)y1 and similarly define
M2(0/ 221y and 512 Then using Eqs. (1.2.12) and (1.2.13), we have
/4
- =hr
IJ 1 (P 1 i (1.5.35)
AL I 11 AliAli
and
2
5125'2 472
(1.5.36)
)921 = N (fl2s/ k)
221121
Therefore, under the null hypothesis,
P11 - P21
N(0, 1). (1.5.37)
ai ai ) 1/2
ilitz)
iiiiu
where o2 is the unbiased estimate of a 2 obtained from the model (1.5.25),
that is,
1
a 2K* y'[I X(X'X)-1 Xly. (1.5.41)
T1 + T2-
and
Y2M2Y2
2 X 2 K'. (1.5.43)
a
Because these two chi-square variables are independent by the assumptions of
the model, we have by Theorem 4 of Appendix 2
T2 - K * Y1M1Y 1
Fa K*9 (1.5.44)
T1 K* YiM2Y2 T2- f(*).
Unlike the F test of Section 1.5.2, a two-tailed test should be used here because
either a large or a small value of (1.5.44) is a reason for rejecting the null
hypothesis.
36 Advanced Econometrics
21,51 1i %/iv) \
where al= (T1 K*)lycM ly, and al = (T2 K*)-1 YM2Y2, is approxi-
mately distributed as Student's t with degrees of freedom to be appropriately
determined. Because the statement (1.5.37) is still valid, the assumption that
(1.5.45) is approximately Student's t is equivalent to the assumption that w
defined by
ai
, + a2
AziAzi
w 2 (1.5.46)
ai 022
%/ 12i
is approximately x?, for some v. Because Ew = v, w has the same mean as x!.
We shall determine v so as to satisfy
Vw = 2v. (1.5.47)
Solving (1.5.47) for v, we obtain
[ 0.3 ]2
v
%Au (1.5.48)
4 4
(7 2
)2 (T2 K *)(54/22a
Finally, using the standard normal variable (1.5.37) and the approximate
Classical Least Squares Theory 37
estimates obtained at the end of Step 1 and Step 2 may be used without
changing the asymptotic result (1.5.51).
Using the asymptotic theory of the likelihood ratio test, which will be
developed in Section 4.5.1, we have asymptotically (that is, approximately
when both T 1 and T2 are large)
2 log (Lit) = T1 log crivaD + T2 log (ayai) x. (1.5.51)
The null hypothesis /31 = fl2 is to be rejected when the statistic (1.5.51) is larger
than a certain value.
The asymptotic F test is derived by the following simple procedure: First,
estimate a and alby 6? and respectively, and define = I A. Second,
multiply both sides of (1.5.24) by and define the new equation
= xtP2 + u2 (1.5.52)
where = /5y2 , = nx2 , and ut = inh. Third, treat (1.5.23) and (1.5.52) as
the given equations and perform the F test (1.5.26) on them. The method
works asymptotically because the variance of u! is approximately the same as
that of u l when T1 and T2 are large, because //converges to a l /a2 in probability.
Goldfeld and Quandt (1978) conducted a Monte Carlo experiment that
showed that, when ai # ai , the asymptotic F test performs well, closely
followed by the asymptotic likelihood ratio test, whereas the F test based on
the assumption of equality of the variances could be considerably inferior.
1.6 Prediction
We shall add to Model 1 the pth period relationship (where p> T)
yp = x;/3 + u,, , (1.6.1)
where yp and up are scalars and xp are the pth period observations on the
regressors that we assume to be random variables distributed independently of
up and u.") We shall also assume that up is distributed independently of u with
Eup = 0 and Vup = a2. The problem we shall consider in this section is how to
predict yp by a function of y, X, and xp when ft and a 2 are unknown.
We shall only consider predictors of yp that can be written in the following
form:
4 = x'pr, ( 1.6.2)
where fl* is an arbitrary estimator of/3 and a function of y and X. Here, r may
be either linear or nonlinear and unbiased or biased. Although there are more
Classical Least Squares Theory 39
Exercises
1. (Section 1.1.2)
Give an example of a pair of random variables that are noncorrelated but
not independent.
2. (Section 1.1.3)
Let y and x be scalar dichotomous random variables with zero means.
Classical Least Squares Theory 41
Obtain the mean squared error offtR = y' y/x ' y and compare it with that
of the least squares estimator fl. (Note that PR' the reverse least squares
estimator, is obtained by minimizing the sum of squared errors in the
direction of the x-axis.)
11. (Section 1.3.2)
Assume K = 1 in Model 1 with normality and furthermore assume
/32 = a2. Obtain the maximum likelihood estimator of /3 and obtain the
Cramer-Rao lower bound.
12. (Section 1.4.2)
Suppose
Q , = [1 1 1
1 2 3]
Find a row vector R' such that (Q, R) is nonsingular and R'Q = 0.
13. (Section 1.4.2)
Somebody has run a least squares regression in the classical regression
model (Model 1) and reported
2 41 ] [ 5 311
P ' [g2 = 4] and a 2 (X' Xr 1 = [1 2 1] .
/33 2 112
(1) yi = a l 1 + a2 xi + ui
(2) y2 = /31 1 + /32 x2 + 112 ,
where a's and ll's are scalar parameters, yi, Y2, x 1 , x2, u1, and u2 are
seven-component vectors, and 1 is the seven-component vector of ones.
Assume that u l and u2 are normally distributed with the common var-
iance a 2 and that u 1 and u2 are independent of each other. Suppose that
x l = 1' x2 = 0 and l' y1 = I' y2 = 7. Suppose also that the sample mo-
ment matrix of the four observable variables is given as
Y1 Y2 )(1 3L2
yl 9.3 7 2 1.5
Y2 7 9 3 1
xl 2 3 2 1.2
X2 1.5 1 1.2 1
For example, the table shows y y i = 9.3 and yc y2 = 7. Should you reject
the joint hypothesis "a 1 = fl, and a l + 2a2 =132" at the 5% significance
level? How about at 1%?
16. (Section 1.5.3)
Consider a classical regression model
y1 = a + /31 z1 +
Y2 = cx 2 x2 /32 z2 + U2
y3 = a 3 X3 + AZ3 u3,
where a's and Fs are scalar unknown parameters, the other variables are
vectors often elements, x's and z's are vectors of known constants, and u's
are normal with mean zero and Eu iu; = a 2 I for every i and Eu iu; =0 if
i # j. Suppose the observed vector products are as follow:
In this chapter we shall present three additional topics. They can be discussed
in the framework of Model 1 but are grouped here in a separate chapter
because they involve developments more recent than the results of the pre-
vious chapter.
basis of the observed sample, and their gain from the game is a function of how
close their guess is to the true value. The major components of the game are 0,
the parameter space; Y, the sample space; and D, the decision space (the
totality of functions from Y to 0). We shall denote a single element of each
space by the lowercase letters 0, y, d. Thus, if y is a particular observed sample
(a vector of random variables), d is a function of y (called a statistic or an
estimator) used to estimate 0. We assume that the loss incurred by choosing d
when the true value of the parameter is 0 is given by the loss function L(d, 0).
We shall define a few standard terms used in statistical decision theory.
Risk. The expected loss Ey L(d, 0) for which the expectation is taken with
respect to y (which is implicitly in the argument of the function d) is called the
risk and is denoted by R(d10).
Uniformly smaller risk. The estimator2 d i has a uniformly smaller risk than
the estimator d 2 if R(d 1 10)-5 R(d210) for all 0 E 0 and R(d 1 10) < R(d210) for at
least one 0 E 0.
Admissible. An estimator is admissible if there is no d in D that has a
uniformly smaller risk. Otherwise it is called inadmissible.
Minimax. The estimator d* is called a minimax estimator if
max R(d*10) = mm max R(d10).
0E13 deD eEe
The minimax estimator protects the statistician against the worst pos-
sible situation. If max eEe R(d10) does not exist, it should be replaced with
supeEe R(d10) in the preceding definition (and min with inf).
Posterior risk. The expected loss Ef L(d, 0) for which the expectation is
taken with respect to the posterior distribution of 0 given y is called the
posterior risk and is denoted by R(d ly). It obviously depends on the particular
prior distribution used in obtaining the posterior distribution.
Bayes estimator. The Bayes estimator, given a particular prior distribution,
minimizes the posterior risk R(dly). If the loss function is quadratic, namely,
L(d, 0) = (d 0) ' W(d 0) where W is an arbitrary nonsingular matrix, the
posterior risk Ee(d 0)' W(d 0) is minimized at d = E80, the posterior
mean of 0. An example of the Bayes estimator was given in Section 1.4.4.
Regret. Let R(d10) be the risk. Then the regret W(d10) is defined by
W(d10) = R(d10) mm R(d10).
deD
Minimax regret. The minimax regret strategy minimizes max 0,43 W(d10)
with respect to d.
48 Advanced Econometrics
Some useful results can be stated informally as remarks rather than stating
them formally as theorems.
REMARK 2.1.1. A Bayes estimator is admissible.
REMARK 2.1.2. A minimax estimator is either a Bayes estimator or the limit
of a sequence of Bayes estimators. The latter is called a generalized Bayes
estimator. (In contrast, a Bayes estimator is sometimes called a proper Bayes
estimator.)
REMARK 2.1.3. A generalized Bayes estimator with a constant risk is
minimax.
REMARK 2.1.4. An admissible estimator may or may not be minimax, and a
minimax estimator may or may not be admissible.
and similarly for P(2 1Y), where 0, = (fi; , 0)% f(Y1 0i) is the joint density of y
given 0i , f(Oil i) is the prior density of 0, given the model i, and P(i) is the prior
probability that the model i is true, for i = 1 and 2.
There is an alternative way to characterize the Bayesian strategy. Let S be a
subset of the space of y such that the Bayesian chooses the model 1 if y E S.
Recent Developments in Regression Analysis 49
1 R2 Y'MY (2.1.7)
y'Ly
where M = I X(X'X) -1X' and L = I T-1 11' as before, choosing the
equation with the largest R 2 is equivalent to choosing the equation with the
50 Advanced Econometrics
(2.1.1) and (2.1.2), Theil's strategy based on his R 2 amounts to choosing Eq.
(2.1.1) if
al < Fri, (2.1.8)
where Cri =-- (T KI) iy'M ly,
M1 = I X1(XPC1)iX1,
= (T K2)- 'y'M2y, and
M2 = I - x2(xpc2)-13q.
The inequality (2.1.8) can be regarded as a constraint on y and hence defines
a subset in the space of y. Call it So ; that is, So = (yliii < al). This choice of S
can be evaluated in terms of the Bayesian minimand (2.1.5). Suppose Eq.
(2.1.2) is the true model. Then we have
normality. Note that the X2 here has no relationship with the X2 that appears
in Eq. (2.1.2). Suppose we suspect /32 might be 0 and test the hypothesis /12 = 0
by the F test developed in Section 1.5.2. The appropriate test statistic is
obtained by putting /32 = 0 in Eq. (1.5.19) as
. ( T KVy'M ly _
1) F(K2 , T K). (2.1.18)
'7 k K2 A y'My
The researcher first sets the critical value d and then chooses the model
(2.1.17) if ri _ d or the constrained model
y = Xfil + u (2.1.19)
if n< d.
Conventionally, the critical value d is determined rather arbitrarily in such a
way that P(t) d) evaluated under the null hypothesis equals a preassigned
significance level such as 1 or 5 percent. We shall consider a decision-theoretic
determination of d. For that we must first specify the risk function. The
decision of the researcher who chooses between models (2.1.17) and (2.1.19)
on the basis of the F statistic rj may be interpreted as a decision to estimate fl by
the estimator ft defined as
ft = ft ifla..d (2.1.20)
where ji is the least squares estimator applied to (2.1.17) and ii i is that applied
to (2.1.19). Thus it seems reasonable to adopt the mean squared error matrix
SI --- E(fi fi)(fi fi)', where the expectation is taken under (2.1.17) as our
risk (or expected loss) function. However, il is not easy to work with directly
because it depends on many variables and parameters, namely, X, ft, a2, K, K 1 ,
and d, in addition to having the fundamental difficulty of being a matrix. (For
the derivation of fl, see Sawa and Hiromatsu, 1973, or Farebrother, 1975.)
Thus people have worked with simpler risk functions.
Sawa and Hiromatsu (1973) chose as their risk function the largest charac-
teristic root of
where Q' = (0, I) where 0 is the K2 X ICI matrix of zeros and I is the identity
matrix of size K2. This transformation of LI lacks a strong theoretical justifica-
tion and is used primarily for mathematical convenience. Sawa and Hiro-
54 Advanced Econometrics
matsu applied the minimax regret strategy to the risk function (2.1.21) and
showed that in the special case K2 = 1, d 1.88 is optimal for most reasonable
values of T K. Brook (1976) applied the minimax regret strategy to a differ-
ent transformation of 0,
tr (2.1.22)
and recommended d= 2 on the basis of his results. The risk function (2.1.22)
seems more reasonable than (2.1.21) as it is more closely related to the mean
squared prediction error (see Section 1.6). At any rate, the conclusions of these
two articles are similar.
Now on the basis of these results we can evaluate the criteria discussed in the
previous subsections by asking what critical value is implied by each criterion
in a situation where a set of the regressors of one model is contained in that of
the other model. We must choose between models (2.1.17) and (2.1.19). For
each criterion, let p denote the ratio of the value of the criterion for model
(2.1.19) over that for model (2.1.17). Then, using (2.1.18), we can easily
establish a relationship between pj and p. For Theil's criterion we have from
(2.1.6)
T T K
17 p(Theil) (2.1.23)
K K
Therefore we obtain the well-known resultthat Theil's criterion selects
(2.1.19) over (2.1.17) if and only if n < 1. Thus, compared with the optimal
critical values suggested by Brook or by Sawa and Hiromatsu, Theil's criterion
imposes far less of a penalty upon the inclusion of regressors. From the
prediction criterion (2.1.13) we get
(T K 1)(T + K) TK
p(PC) (2.1.24)
(K K 1)(T + K K K 1 *
Therefore
2T
p(PC) > 1 if and only if ri > (2.1.25)
T + K1
Table 2.2 gives the values of 2 TAT + K 1) for a few selected values of K 1 /T.
These values are close to the values recommended by Brook and by Sawa and
Hiromatsu. The optimal critical value of the F test implied by the AIC can be
easily computed for various values of Ka T and KIT from (2.1.16) and
(2.1.18). The critical value for the AIC is very close to that for the PC, although
Recent Developments in Regression Analysis 55
KT+ K p(MC)
T K + 2K,
KKi
(2.1.26)
Therefore
p(MC) > 1 if and only if n > 2. (2.1.27)
These results give some credence to the proposition that the modified R 2
2 as a measure of the goodness prosedhiftoTel'scrdR
of fit. However, the reader should take this conclusion with a grain of salt for
several reasons: (1) None of the criteria discussed in the previous subsections is
derived from completely justifiable assumptions. (2) The results in the litera-
ture of the optimal significance level are derived from the somewhat question-
able principle of minimizing the maximum regret. (3) The results in the
literature on the optimal significance level are relevant to a comparison of the
criteria considered in the earlier subsections only to the extent that one set of
regressors is contained in the other set. The reader should be reminded again
that a measure of the goodness of fit is merely one of the many things to be
considered in the whole process of choosing a regression equation.
The biased estimators we shall consider here are either the constrained least
squares estimator discussed in Section 1.4.1 or the Bayes estimator discussed
in Section 1.4.4 or their variants. If the linear constraints (1.4.1) are true, the
constrained least squares estimator is best linear unbiased. Similarly, the
Bayes estimator has optimal properties if the regression vector # is indeed
random and generated according to the prior distribution. In this section,
however, we shall investigate the properties of these estimators assuming that
the constraints do not necessarily hold. Hence, we have called them biased
estimators. Even so, it is not at all surprising that such a biased estimator can
beat the least squares estimator over some region of the parameter space. For
example, 0 can beat any estimator when the true value of the parameter in
question is indeed 0. What is surprising is that there exists a biased estimator
that dominates the least squares estimates over the whole parameter space
when the risk function is the sum of the mean squared errors, as we shall show.
Such an estimator was first discovered by Stein (see James and Stein, 1961)
and has since attracted the attention of many statisticians, some of whom have
extended Stein's results in various directions.
In this section we shall discuss simultaneously two closely related and yet
separate ideas: One is the aforementioned idea that a biased estimator can
dominate least squares, for which the main result is Stein's, and the other is the
idea of ridge regression originally developed by Hoerl and Kennard (1970a, b)
to cope with the problem of multicollinearity. Although the two ideas were
initially developed independently of each other, the resulting estimators are
close cousins; in fact, the term Stein-type estimators and the term ridge esti-
mators are synonymous and may be used to describe the same class of estima-
tors. Nevertheless, it is important to recognize them as separate ideas. We
might be tempted to combine the two ideas by asserting that a biased estimator
can be good and is especially so if there is multicollinearity. The statement can
be proved wrong simply by noting that Stein's original model assumes
X 'X = I, the opposite of multicollinearity. The correct characterization of the
two ideas is as follows: (1) Some form of constraint is useful in estimation. (2)
Some form of constraint is necessary if there is multicollinearity.
The risk function we shall use throughout this section is the scalar
gft flY(fl -
11), (2.2.1)
where is an estimator in question. This choice of the risk function is as
general as
E(T1 fiYA(ft f1), (2.2.2)
Recent Developments in Regression Analysis 57
and consider the transformed parameter vector AInft. Note, however, that
(2.2.1) is not as general as the mean squared error matrix E(fl fi)(p fir ,
which we used in Section 1.2.4, since (2.2.1) is the trace of the mean squared
error matrix.
We shall now turn to a more general question: Can F'fl be estimated by least
squares, where F is an arbitrary K X fmatrix of rankf(f K)? To make sense
out of this question, we must first define the least squares estimator of F We
say that the least squares estimator of Fills F'/3, where /lis any solution (which
may not be unique) of the normal equation (2.2.6), provided F'fl is unique. If
F'/3 is unique, we also say that F'fl is estimable. Then it is easy to prove that
F'fl is estimable if and only if we can write F = X' A for some T X fmatrix A,
or equivalently, if and only if we can write F = X' XB for some K X fmatrix B.
(See Rao, 1973, pp. 223 - 224, for the proof.) If is estimable, it can be
shown that F'fl is the best linear unbiased estimator of F'fl.
The estimability of F'fl can be reduced to the estimability of a subset of the
regression parameters in the sense of the previous paragraph by the following
observation. Let G be a K X (K f ) matrix of rank K fsuch that G'F =0.
(We defined a similar matrix in Section 1.4.2.) Then we can write Model 1 as
y = Xfi + u (2.2.7)
F'
= X[F(F G(G'Gr [ G ,] fl + u
where the identity defines Z 1 , Z2, yi , and y2 . Then the estimability of F'fl is
equivalent to the estimability of y i .
If X' X is singular, fi is not estimable in the sense defined above (that is, a
solution of Eq. 2.2.6 is not unique). This fact does not mean that we should not
attempt to estimate fl. We can still meaningfully talk about a class of estima-
tors and study the relative merits of the members of the class. One such class
may be the totality of solutions of (2.2.6)infinite in number. Another class
may be the constrained least squares estimator satisfying linear constraints
Q'P = c. From Eq. (1.4.11) it is clear that this estimator can be defined even
when X'X is singular. A third class is the class of Bayes estimators with prior
Q/3 = c + v formed by varying Q, c, and the distribution of v. We should
mention an important member of the first class that also happens to be a
member of the second class. It is called the principal components estimator.
Suppose we arrange the diagonal elements of A defined in Section 2.2.2 in
descending order A 1 A. 2 AKand let the corresponding charac-
teristic vectors be h i , h2 , . . . , hK so that H = (h 1 , h2 , . . . , hK). Then we
call XII; the ith principal component of X. If X'X is singular, some of its
characteristic roots are 0. Partition
Recent Developments in Regression Analysis 59
A = [A, 0
(2.2.8)
0 A.2]
so that the diagonal elements of A 1 are positive and those of A2 are all 0, and
partition H = (H 1 , 112) conformably. Furthermore, define Xr= XH I and
X 21' = XH2 and partition a' = (a; , a) conformably. Then X 111 =0 and hence
a2 cannot be estimated. Suppose we estimate a l by
a, = (XrXr)-1Xf'y (2.2.9)
and set 112 = 0. (It is arbitrary to choose 0 here; any other constant will do.)
Transforming ee ...- (ac, ii;) into an estimator of ft, we obtain the principal
components estimator of p by the formula6
lip = Ha = H IAVWX'y. (2.2.10)
It is easy to show that fip satisfies (2.2.6); hence, it is a member of the first class.
It is also a member of the second class because it is the constrained least
squares subject to HVI =0.
It was shown by Fomby, Hill, and Johnson (1978) that the principal compo-
nents estimator (or constrained least squares subject to H;fl = 0) has a smaller
variance-covariance matrix than any constrained least squares estimator ob-
tained subject to the constraints Ql= c, where Q and c can be arbitrary
except that Q has an equal or smaller number of columns than H2.
We shall now consider a situation where X'X is nonsingular but nearly
singular. The near singularity of X' X is commonly referred to as multicollin-
earity. Another way of characterizing it is to say that the determinant ofX 'X is
close to 0 or that the smallest characteristic root of X' X is small. (The question
of how small is "small" will be better understood later.) We now ask the
question, How precisely or imprecisely can we estimate a linear combination
of the regression parameters cli by least squaresr Because the matrix H is
nonsingular, we can write c = Hd for some vector d. Then we have
V(c'M = a2d'A- id, (2.2.11)
which gives the answer to the question. In other words, the closer c is to the
direction of the first (last) principal component, the more precisely (impreci-
sely) one can estimate cl. In particular, we note from (2.2.5) that the preci-
sion of the estimator of an element of a is directly proportional to the corre-
sponding diagonal element of A.
Suppose we partition A as in (2.2.8) but this time include all the "large"
elements in A 1 and "small" elements in A2. The consideration of which roots
60 Advanced Econometrics
two facts: (1) Consider the ridge estimator (X'X + yI) -1X'y with y
(K 2)c2/[ (K 2)(3 2]. If X'X = I, it is reduced precisely to Stein's
estimator of /3 since X'y N(ft, a2I). Therefore this ridge estimator is uni-
formly better than the least squares estimator if X'X = I (the opposite of
multicollinearity). (2) Assume a general X'X in Model 1. If we define A =
(X'X) 1 /2 , we have Afl N(Afl, 0.21), where fi is the least squares estimator.
Applying Stein's estimator to AA we know ENO B)A/3 A/31I 2 <
M 12 for all /3 where B = (K 2)cr 2/fi'X'Xii. Therefore, equivalently,
(1 B)/1 is uniformly better than fl with respect to the risk function
E(ft M'X'X(ft M. Note that this is essentially the risk function we pro-
posed in (1.6.10) in Section 1.6, where we discussed prediction.
So far, we have assumed a2 is known. James and Stein showed that even
when a2 is unknown, if S is distributed independent of a and as a2x!, then
Ell [ 1 cS(61'6)- 96/ all 2 attains the minimum for all a and GI at c =
(K 2)/(n + 2) if K 3. They also showed that [1 cS(a'a)-1]a is uni-
formly better than oi if 0 < c < 2(K 2)/(n + 2). In the regression model we
can put S = y' [I X(X 1 30- 'Xly because it is independent of /I and distrib-
uted as a2A-K.
Efron and Morris (1972) interpreted Stein's estimator as an empirical Bayes
estimator. Suppose 61 N(a, a2I), where 0.2 is known, and the prior distribu-
tion of a is N(0, c2 ) - ' I). Then the Bayes estimator is ii* = (1 + y)'ol =
(1 Mil where B = y/(1 + y). The marginal distribution of& is N(0, a 233- 'I).
Therefore B&' &/a 2 xi. Because E[(x 2K) -1 ] = (K 2)-1 (see Johnson and
Kotz, 1970a, vol. 1, p. 166), we have
(K 2)a 2]
E[ B.
a'a
Thus we can use the term within the square bracket as an unbiased estimator
of B, thereby leading to Stein's estimator.
It is important not to confuse Stein's result with the statement that
E(ii* a)(6c* a)' <E(& a)( a)' in the matrix sense. This inequality
does not generally hold. Note that Stein's estimator shrinks each component
of a by the same factor B. If the amount of shrinkage for a particular compo-
nent is large, the mean squared error of Stein's estimator for that component
may well exceed that of the corresponding component of ci, even though
Ella* a11 2 < Ella al1 2. In view of this possibility, Efron and Morris
(1972) proposed a compromise: Limit the amount of shrinkage to a fixed
amount for each component. In this way the maximum possible mean
squared error for the components of a can be decreased, whereas, with luck,
the sum of the mean squared errors will not be increased by very much.
Recent Developments in Regression Analysis 63
= (1 Bi)rizi where Bi Y,
y
(Note: The transformation a = H'fl translates this estimator into the ridge
estimator (2.2.12). 7 is either a constant or a function of the sample.)
Generalized ridge estimator: a* = (A + fl 'A& where r is
diagonal,
Other ridge and generalized ridge estimators have been proposed by various
authors. In the three following ridge estimators, y is a positive quantity that
does not depend on therefore 111 is inversely proportional to Ai . This is an
intuitively appealing property because it seems reasonable to shrink the com-
ponent with the larger variance more. In the four following generalized ridge
estimators, exactly the opposite takes place: The amount of shrinkage B. is an
increasing function of ill an undesirable property. In some of the estimators,
a2 appears in the formula, and in some, its estimate 62, which is assumed to be
independent of ci, appears. As pointed out by Efron and Morris (1976), the
fundamental properties of Stein's estimators are not changed if 0 2 is indepen-
dently estimated.
Ridge Estimators
Ridge 1 (Sclove, 1973)
a2 tr A
a'Act
Ridge 2 (Hoer!, Kennard, and Baldwin, 1975) and Modified Ridge 2
(Thisted, 1976)
1062
Y -
aa
This estimator is obtained by putting A = I in Sclove's estimator. Although
the authors claimed its good properties on the basis of a Monte Carlo study,
Thisted (1976) showed that it can sometimes be far inferior to the maximum
likelihood estimator oi; he proposed a modification, y = (K 2)62/&16t, and
showed that the modified estimator is minimax for some A if a2 is known.
Ridge 3 (Thisted, 1976)
a2
if all d1 <00
doi?
fr.!
=0 otherwise,
where
A 7 14-j.
di
[(E 272) 21T
Recent Developments in Regression Analysis 65
Results
All the generalized ridge estimators are minimax for all A and generalized
ridge 2 is also admissible, whereas among the ridge estimators only Thisted's
(which is strictly not ridge because of a discontinuity) is minimax for all A.
66 Advanced Econometrics
Because & is minimax with a constant risk, any other minimax estimator
dominates a. However, the mere fact that an estimator dominates ci does not
necessarily make the estimator good in its own right. If the estimator is admis-
sible as well, like Berger's generalized ridge 2, there is no other estimator that
dominates it. Even that, however, is no guarantee of excellence because there
may be an estimator (which may be neither minimax nor admissible) that has
a lower risk over a wide range of the parameter space. It is nice to prove
minimaxity and admissibility; however, we should look for other criteria
of performance as well, such as whether the amount of shrinkage is propor-
tional to the variancethe criterion in which all the generalized ridge esti-
mators fail.
The exact distributions of Stein's or ridge estimators are generally hard to
obtain. However, in many situations they may be well approximated by the
jackknife and the bootstrap methods (see Section 4.3.4).
Carlo experiments, as pointed out by Efron, Morris, and Thisted in the dis-
cussion following the article.
Recent Developments in Regression Analysis 67
Let a N(a, 0 2I) and S ale, (independent of a). Consider the strategy:
[ Wlyd + (1 /d) [ 0
Vkl Y * WiY *
in the estimation of W'Zy. Therefore, premultiplying by W (which is the
inverse of W'), we see that
DZ(Z'Z)- 'Z'y* + (1 D)Z 2(Z;Z2)- IZ;y* Z9 (2.2.18)
dominates
/aZ(Z'Z)- 'Z'y* + (1 /d)Z2(Z2Z2)- 'Z;y* Z (2.2.19)
in the estimation of Zy. Finally, we conclude that Stein's positive-rule estima-
tor 9 defined by (2.2.18) dominates the pre-test estimator 9 defined by (2.2.19)
in the sense that E(S; y)'Z'Z(9 y) yYZ'Z(i y) for all y.
Although the preceding conclusion is the only known result that shows the
dominance of a Stein-type estimator over a pre-test estimator (with respect to
a particular risk function), any Stein-type or ridge-type estimator presented in
the previous subsections can be modified in such a way that "shrinking" or
"pulling" is done toward linear constraints Q'fl = c.
We can assume Q'Q = I without loss of generality because if Q'Q # I, we
can define Q* = Q(Q'Q) -1/2 and c* = (Q/Q) -1/2c so that Q*'fl = c* and
Q*'Q* = I. Denoting the least squares estimator of ft by fi, we have
Q'ft c N[Q'fi c, cr 2Q'(X'X)-1Q]. (2.2.20)
Defining the matrix G such that G'G = I and G'Q'(X'X) - 'QG =
diagonal, we have
G'(Q'ft c) /V[G 1 (Q'fl c), a2 M-1]. (2.2.21)
Therefore, if B is the diagonal matrix with the ith diagonal element B. defined
Recent Developments in Regression Analysis 69
Aigner and Judge (1977) applied the estimator (2.2.27) and another estima-
tor attributed to Bock (1975) to the international trade model of Baldwin
(1971) and compared these estimates to Baldwin's estimates, which may be
regarded as pre-test estimates because Baldwin utilized a certain linear restric-
tion. Aigner and Judge concluded that the conditions under which Bock's
estimator is minimax are not satisfied by the trade data and that although
Berger's estimator (2.2.27) is always minimax, it gives results very close to the
least squares in the trade model.
70 Advanced Econometrics
Table 2.3 Asymptotic variances (times sample size) of the sample mean and the
sample median under selected distributions
TV(11)
Normal 1.57
Laplace 2 1
Cauchy co 2.47
median outperforms the mean in the case of the other two long-tailed distri-
butions. (Note that the mean is the maximum likelihood estimator under
normality and that the median, because it minimizes I ly, bl, is the maxi-
a of largest and smallest observations have been removed. These and other
similar robust estimators were often used by statisticians in the nineteenth
century (see Huber, 1972; Stigler, 1973). However, the popularity of these
robust estimators declined at the turn of the century, and in the first half of the
present century the sample mean or the least squares estimator became the
dominant estimator. This change occurred probably because many sophisti-
cated testing procedures have been developed under the normality assump-
tion (mathematical convenience) and because statisticians have put an undue
confidence in the central limit theorem (rationalization). In the last twenty
years we have witnessed a resurgence of interest in robust estimation among
statisticians who have recognized that the distributions of real data are often
significantly different from normal and have heavier tails than the normal in
most cases. Tukey and his associates in Princeton have been the leading
proponents of this movement. We should also mention Mandelbrot (1963),
who has gone so far as to maintain that many economic variables have infinite
variance. However, it should be noted that the usefulness of robust estimation
is by no means dependent on the unboundedness of the variance; the occur-
72 Advanced Econometrics
rence of heavier tails than the normal is sufficient to ensure its efficacy. For a
survey of recent developments, the reader is referred to Andrews et al. (1972),
who reported on Monte Carlo studies of 68 robust estimators, and to Huber
(1972, 1977, 1981), Hogg (1974), and Koenker (1981a).
Robust estimators of location can be classified into four groups: M, L p , L,
and R estimators. M, L, and R estimators are the terms used by Huber (1972).
Lp estimators constitute a subset of the class M, but we have singled them out
because of their particular importance. We shall briefly explain these classes of
estimators and then generalize them to the regression case.
M Estimator. The M estimator (stands for "maximum-likelihood-type"
estimator) is defined as the value of b that minimizes Mr_ ,p R y, b)/s] where s
-
where z = (y p)Is and cis to be chosen by the researcher. (The Monte Carlo
studies of Andrews et al. (1972) considered several values of c between 0.7and
2.) Huber (1964) arrived at the p function in (2.3.2) as the minimax choice
(doing the best against the least favorable distribution) when F(z)==
(1 e)(1)(z)+ ell(z), where z = (y /1)1a, H varies among all the symmetric
distributions, 4) is the standard normal distribution function, and e is a given
constant between 0 and 1. The value of c depends on e in a certain way. As for
s, one may choose any robust estimate of the scale parameter. Huber (1964)
proposed the simultaneous solution of
(2.3.3)
Recent Developments in Regression Analysis 73
and
T (y, b)2 T , 222
p, p (z) e z dz (2.3.4)
r- s lin --
in terms of b and s. Huber's estimate ofy converges to the sample mean or the
sample median as c tends to 00 or 0, respectively.
Another M estimator shown to be robust in the study of Andrews et al.
(1972) is the following one proposed by Andrews (1974). Its p function is
defined by
p(z) = 1 + cos z if 121 g it (2.3.5)
=0 if Izl > 7r,
where z = (y b)ls as before. Andrews' choice ofs is (2.1) Median (Iy,
integer. Thus 0= corresponds to the median. It can be shown that the 0th
sample quantile, denoted by ii(0), minimizes
Remarks
We have covered most of the major robust estimators of location that have
been proposed. Of course, we can make numerous variations on these estima-
tors. Note that in some of the estimation methods discussed earlier there are
parameters that are left to the discretion of the researcher to determine. One
systematic way to determine them is the adaptive procedure, in which the
values of these parameters are determined on the basis of the information
contained in the sample. Hogg (1974) surveyed many such procedures. For
example, the a of the a-trimmed mean may be chosen so as to minimize an
Recent Developments in Regression Analysis 75
where 11 is the value of b that minimizes ET.. 1 1y, 34bl. Actually, their estima-
tor, which they called one-step sine estimator, is defined as
fts = (X'DX) -1 X'Dy, (2.3.10)
where D is the diagonal matrix whose tth diagonal element d is defined by
=0 if ly, x;i3l>
This is approximately the first step of the so-called Newton-Raphson iteration
designed to minimize (2.3.5), as we shall show next.
Put g(fl)= 7.. 1 p(z,) where z,--= xAls and expand g(//) around /3 = fi
in a Taylor series as
g(fo ag + 82
g (p fl), (2.3.12)
P) P P) (Via" 11 Ph
where the derivatives are evaluated atfl. Let fts be the value of/3 that minimizes
the right-hand side of (2.3.12). Thus
a 2g 1 2(_g_
(2.3.13)
[ afia" air
This is the first step in the Newton-Raphson iteration (see Section 4.4.1).
Inserting
ag
(2.3.14)
t-i
and
82g
(2.3.15)
aim"
where p' and p" are evaluated at (y, into (2.3.13), we obtain
T ) --1 T
In their Monte Carlo study, Hill and Holland assumed that the error term is
N(0, 1) with probability 1 a and N(0, c 2) with probability a, with various
values for c and a. The regressor matrix was artificially generated with various
degrees of outlying observations and with the number of regressors ranging
from 1 to 6. The sample size was chosen to be 20. The reader should look at the
table in their article (Hill and Holland, 1977) to see the striking improvement
of /3 or fls over the least squares estimator and the minor improvement of fis
over1.
Lp Estimator. This class also can easily be generalized to the regression
model. It is the value of b that minimizes Iy x; ND. A special case in
which p = 1, which was already defined as iO in the discussion about the M
estimator, will be more fully discussed later as an L estimator. For p 1, the
asymptotic variance of the Lp estimator can be obtained by the same formula
as that used for the asymptotic variance of the M estimator. Forsythe (1972)
conducted a Monte Carlo study of L estimators for p = 1.25, 1.5, 1.75, and 2
(least squares) in a regression model with one fixed regressor and an intercept
where the error term is distributed as GN(0, 1) + (1 G)N(S, R) for several
values of G, S, and R. His conclusion: The more "contamination," the smaller
p should be.
L Estimator. The 9 th sample quantile can be generalized to the regression
situation by simply replacing b by x; b in the minimand (2.3.7), as noted by
Koenker and Bassett (1978). We shall call the minimizing value the 9 th
sample regression quantile and shall denote it by AO). They investigated
the conditions for the unique solution of the minimization problem and ex-
tended Mosteller's _result to the regression case. They established that
A02), Aen) are asymptotically normal with the means equal to
+ ,u(0 I ), fi + p(0 2), . . . , 13 + /(O ), where p(k) = [,u(0 i), 0, 0, . . . , O]',
and the variance-covariance matrix is given by
Coy [i)(6i), ;4(03)] = cou(X 1 X) ', i j, (2.3.18)
where coy is given in (2.3.8). A proof for the special case ft() = ft is also given in
Bassett and Koenker (1978) (see Section 4.6.2).
Blattberg and Sargent (1971) conducted a Monte Carlo study and corn-
pared fl, the least squares estimator, and one other estimator in the model with
the regressor and no intercept, assuming the error term has the characteristic
function exp(I ala) for a = 1.1, 1.3, 1.5, 1.7, 1.9, and 2.0. Note that a = 2
gives the normal distribution and a = 1 the Cauchy distribution. They found
that ft did best in general.
78 Advanced Econometrics
T2 txt [1 _
11 ] xy. (2.3.22)
where T2 = 12- i[ff 2(u) du] -2. The ratio a 2/T2 is known as the Pitman effi-
ciency of the Wlcoxon rank test and is equal to 0.955 if f is normal and
greater than 0.864 for any symmetric distribution, whereas its upper bound is
infinity. Because the derivative of D exists almost everywhere, any iterative
scheme of minimization that uses only the first derivatives can be used.
(Second derivatives are identically zero.) The intercept /i 0 may be estimated by
the Hodges-Lehmann estimator, Median((a, + 12)/2), 1 jgT, where
is the vector of the least squares residuals. See the articles by McKean and
Hettmansperger (1976) and Hettmansperger and McKean (1977) for tests of
linear hypotheses using Jaeckel's estimator.
Exercises
1. (Section 2.1.3)
Show that the Bayesian minimand (2.1.5) is minimized when Sis chosen to
be the set of y that satisfies the inequality (2.1.3).
2. (Section 2.1.5)
A weakness of PC is that it does not choose the right model with probability
1 when Tgoes to infinity. (The weakness, however, is not serious.) Suppose
we must choose between regressor matrix X, and X such that X, c X. Show
that
lim P [PC chooses X, IX, is true] = P[XX-K, < 2K] < 1.
3. (Section 2.1.5)
Schwartz's (1978) criterion minimizes T log y'Miy K, log T. Show that
this criterion chooses the correct model with probability 1 as T goes to co.
4. (Section 2.2.3)
If F',5' is estimable, show that F'fl is the BLUE of Pfl, where fi is the LS
estimator.
80 Advanced Econometrics
5. (Section 2.2.3)
Show that fl , defined in (2.2.10) is a solution of (2.2.6).
6. (Section 2.2.4)
Show that for any square matrix A there exists a positive constant y o such
that for all y> yo , A + yl is nonsingular.
3 Large Sample Theory
Large sample theory plays a major role in the theory of econometrics because
econometricians must frequently deal with more complicated models than
the classical linear regression model of Chapter 1. Few finite sample results are
known for these models, and, therefore, statistical inference must be based on
the large sample properties of the estimators. In this chapter we shall present a
brief review of random variables and the distribution function, discuss various
convergence theorems including laws of large numbers and central limit
theorems, and then use these theorems to prove the consistency and the
asymptotic normality of the least squares estimator. Additional examples of
the application of the convergence theorems will be given.
This section is not meant to be a complete discussion of the subject; the reader
is assumed to know the fundamentals of the theory of probability, random
variables, and distribution functions at the level of an intermediate textbook
in mathematical statistics.' Here we shall introduce a few concepts that are not
usually dealt with in intermediate textbooks but are required in the subse-
quent analysis, in particular, the rigorous definition of a random variable and
the definition of the Stieltjes integra1. 2
terms, a sample space may be regarded as the set of all the possible outcomes of
an experiment. Thus, in the experiment of throwing a die, the six faces of the
die constitute the sample space; and in the experiment of measuring the height
of a randomly chosen student, the set of positive real numbers can be chosen
as the sample space. As in the first example, a sample space may be a set of
objects other than numbers. A subset of a sample space may be called an event.
Thus we speak of the event of an ace turning up or the event of an even number
showing in the throw of a die. With each event we associate a real number
between 0 and 1 called the probability of the event. When we think of a sample
space, we often think of the other two concepts as well: the collection of its
subsets (events) and the probabilities attached to the events. The term proba-
bility space refers to all three concepts collectively. We shall develop an ab-
stract definition of a probability space in that collective sense.
Given an abstract sample space LI, we want to define the collection A of
subsets of CI that possess certain desired properties.
P(0 E1 )--=
X( w ) Y(w) Z ( al)
) I 3
0 2"
-
I 0 1 I C -4 I 7$ 1
Figure 3.1 Discrete random variables defined over [0, 1] with Lebesgue measure
84 Advanced Econometrics
ity space. Note that Z is independent of either Xor Y, whereas b and Yare not
independent (in fact XY = 0). A continuous random variable X(w) with the
standard normal distribution can be defined over the same probability space
by X = cD-1 (co), where clo is the standard normal distribution function and
" 1" denotes the inverse function.
associated with this partition of the interval [a, bl. If, for any e > 0, there exists
a real number A and a partition such that for every finer partition and for any
choice ofx*, 1S A j < e, we call A the Riemann-Stieltjes integral and denote
it by f dF(x). It exists if his a continuous function except possibly for a
countable number of discontinuities, provided that, whenever its discontinu-
Large Sample Theory 85
ity coincides with that of F, it is continuous from the right.' Finally, we define
provided the limit (which may be + COD or 03) exists regardless of the way
a 00 and b 00.
If dF/dx exists and is equal to f(x), F(x 141 ) F(x i ) f(4 )(x 1+1 x i ) for
some x7 E (x1+1 , xi ] by the mean value theorem. Therefore
F( ) such that
1
F(x)=O, x<a (3.2.1)
(x _ a 1) 1 1
a --
2 n
1
=1, a + < x.
n
Then lirn F. is not continuous from the left at a and therefore is not a
distribution function. However, we would like to say that the random variable
with the distribution (3.2.1) converges in distribution to a degenerate random
variable which takes the value a with probability one. The phrase "at every
continuity point of F" enables us to do so.
DEFINITION 3.2.4 (almost sure convergence). A sequence (X.) is said to
converge to X almost surely 8 if
P(collim X(co)= X(co)) = 1.
We write X 21. X.
The next four theorems establish the logical relationships among the four
modes of convergence, depicted in Figure 3.2. 9
Proof We have
0. S.
M
1
p
Figure 3.2 Logical relationships among four modes of convergence
Large Sample Theory 87
= F(E) + [1 F.()]
= P(X,, < e) + P(X )
P[X!> 9.
Therefore, from (3.2.2) and (3.2.3), we obtain
p[Al > 2]
(3.2.4)
X3(co) = 1 for
2 2 3
=0 elsewhere
1 1 1
X4(co) = 1
=0 elsewhere
X5(co) = 1 12 12 5
=0 elsewhere
In other words, the subset of over which X assumes unity has the total
length 1/n and keeps moving to the right until it reaches the right end point of
[0, 1], at which point it moves back to 0 and starts again. For any 1 > e > 0, we
clearly have
1
and therefore I,, 0. However, because Mr_ i i-1 = co, there is no element in
fl for which lim n, X(w) = 0. Therefore P(collim n X (o))= 0) = 0, im-
plying that X not converge to 0 almost surely.
The next three convergence theorems are extremely useful in obtaining the
asymptotic properties of estimators.
THEOREM 3.2.5 (Mann and Wald). Let X and X be K-vectors of random
variables and let g( ) be a function from RK to R such that the set E of
discontinuity points of g( ) is closed and P(X E E) = 0. If X -d--* X, then
g(X) g(X).
A slightly more general theorem, in which a continuous function is replaced
by a Borel measurable function, was proved by Mann and Wald (1943). The
Large Sample Theory 89
use the same symbols for both stochastic and nonstochastic cases because
Definition 3.2.5 applies to the nonstochastic case as well in a trivial way.
Levy and Liapounov are special cases of Lindeberg-Feller, in the sense that the
assumptions of either one of the first two central limit theorems imply those of
Lindeberg-Feller. The assumptions of Lindeberg-Levy are more restrictive in
some respects and less restrictive in other respects than those of Liapounov.
Before we state the central limit theorems, however, we shall define the
characteristic function of a random variable and study its properties.
DEFINITION 3.3.1. The characteristic function of a random variable X is
defined by Eeu
Thus if the distribution function of X is F( ), the characteristic function is
dF(x). The characteristic function is generally a complex number.
However, because e 4tx = cos Ax + i sin ;Lx, the characteristic function of a
random variable with a density function symmetric around 0 is real. The
characteristic functon of N(0, 1) can be evaluated as
x2
1 exp (blx ) dx = exp ) (3.3.1)
2 2
Define go) = Log .1`1%., eaw dF(x), where Log denotes the principal loga-
rithm." Then the following Taylor expansion is valid provided EX' exists:
' (iA)i
g(A) = KJ + o(IAI'), (3.3.2)
j-0
= ng A
ta
22
= n o[
olni.
where cf)(6 (x) is the rth derivative of the density ofN(0, 1), we can invert (3.3.5)
to obtain
K3 /IA K4
F.(x) = ('D(x) , (t-',
D (x) + , 4) (4)(x) (3.3.7)
6 e Nin 24a n
K32
+ 4:0(6)(x) + 0(n-312 ).
72a6n
We shall conclude this section by stating a multivariate central limit
theorem, the proof of which can be found in Rao (1973, p. 128).
THEOREM 3.3.8. Let (X) be a sequence of K-dimensional vectors of ran-
dom variables. If cli converges to a normal random variable for every K-di-
mensional constant vector c #0, then X converges to a multivariate normal
random variable. (Note that showing convergence of each element of Ii
separately is not sufficient.)
follows:
and
AE X f x dF(x). (3.4.2)
THEOREM 3.4.1. If EIXIr < M for all n, then lim EX f,= AE X.; for any
s < r. In particular, if EP, <M, then lim EX = AE X. (Note that this con-
dition is violated by all three preceding examples.)
We are now in a position to define two important concepts regarding the
asymptotic properties of estimators, namely, asymptotic unbiasedness and
consistency.
DEFINITION 3.4.1. The estimator O of 0 is said to be asymptotically unbi-
ased if AE 6 = 0. We call AE 0 0 the asymptotic bias.
Note that some authors define asymptotic unbiasedness using lim E instead
of AE. Then it refers to a different concept.
DEFINITION 3.4.2. The estimator on of 0 is said to be a consistent estimator if
plim on = e.
Some authors use the term weakly consistent in the preceding definition, to
distinguish it from the term strong consistency used to describe the property
en Ls4 0. 12
In view of the preceding discussions, it is clear that a consistent estimator is
asymptotically unbiased, but not vice versa.
The reader should verify that it is not sufficient for consistency to assume
that all the diagonal elements of X'X go to infinity.
THEOREM 3.5.2. If we assume that (u s ) are i.i.d. in Model 1, the least squares
estimator er2 defined in (1.2.5) is a consistent estimator of a2.
THEOREM 3.5.3. Assume that (u,) are i.i.d. in Model 1. Assume that K = 1,
and because X is a vector in this case, write it as x. If
firn (x'x) max x? = 0, (3.5.5)
IStST
for any c. (This can be verified from the definition of the Riemann-Stieltjes
integral given in Section 3.1.2.) Therefore we need to prove
1 r
Jim
y I 1 >e2a2x'x a dFt(a)= 0
0.1
(3.5.6)
for any e, where F, is the distribution function of (z,u, ) 2. Let G be the distribu-
tion function of 4. Then, because Ft(a)= P(.44 < a) --= P(14 < alx)=
G(a/x?), we have
a dFt(a)=.- a dG(a14)
Le2a2x'z L >e2a2x'x
x?). dG(A).
1A>2a2xT 2lex
Therefore (3.5.6) follows from the inequality
THEOREM 3.5.4. Assume that (u,) are i.i.d. in Model 1. Assume also
lim (x;x,)- ' max x =0 (3.5.8)
IStST
where yk is the kth element of y and z,k is the t,kth element of Z. But, because
y'Z'Zy A.,(Z'Z)y'y by Theorem 10 of Appendix 1, we have
98 Advanced Econometrics
max
k-,-
, .,
A.,(Z'Z)
1 K ,2
max
where the first inequality above follows from the Cauchy-Schwarz inequality.
Therefore (3.5.9) follows from (3.5.10) because the last term of (3.5.10) con-
verges to 0 by our assumptions. Therefore, by Theorem 3.5.3,
y'Z'u
N(0, 1)
a(y'Z)1/2
for any constant vector y 0. Since y'Z'Zy c'Et lc, we have
y'Z'u N(0, a2clit lc).
Thus the theorem follows from Theorem 3.3.8.
At this point it seems appropriate to consider the significance of assumption
(3.5.5). Note that (3.5.5) implies x'x co. It would be instructive to try to
construct a sequence for which x'x 000 and yet (3.5.5) does not hold. The
following theorem shows, among other things, that (3.5.5) is less restrictive
than the commonly used assumption that lim T - Ix'x exists and is a nonzero
constant. It follows that if lim T - IX'X exists and is nonsingular, the condition
of Theorem 3.5.4. is satisfied.
THEOREM 3.5.5. Given a sequence of constants (x1 ), consider the state-
ments:
(i) T- Icr = a, where a # 0, a < CO, and c.= x.
CT =
(iii) o.
(iv) max is,57.4 = 0.
Then, (i) [(ii) and (iii)] (iv).
Proof (i) (ii) is obvious. We have
_ _ c T_ I 0.
(3.5.11)
T(T 1) T T 1
Large Sample Theory 99
THEOREM 3.5.6. Assume that (u,) are i.i.d. with a finite fourth moment
Eu: E--- n1 4 in Model 1. Then VT(6 2 - cr2) N(0, ma oh).
-0 -
We can write
x'x
We have E(x'u/x'x) 2 = a2(x'x)' * 0 as T 0 D . Therefore, by Theorem
3.2.1,
x'u
plim =0. (3.5.18)
x'x
Also we have
,. u'u ,. u'u / x'x] 0 2
pu -- m m =p =
n (3.5.19)
x'x T T c
because of Theorem 3.2.6 and Theorem 3.3.2 (Kolmogorov LLN 2). There-
fore, from (3.5.17), (3.5.18), and (3.5.19) and by using Theorem 3.2.6 again,
we obtain
0.2
plim PR= /1 + ft -. (3.5.20)
c
Note that c may be allowed to be co, in which case At
' becomes a consistent
estimator of fl.
EXAMPLE 3.5.2. Consider the same model as in Example 3.5.1 except
that we now assume s lim 7-2x'x = co. Also assume limp__ (x'x)'
maxi stsr x = 0 so that 13 is asymptotically normal. (Give an example of a
sequence satisfying these two conditions.) Show that /I = x'y/x'x and /3,R =
y'y/x'y have the same asymptotic distribution.
Clearly, plim /3 = plim PR = II. Therefore, by Theorem 3.2.2, both estima-
tors have the same degenerate limit distribution. But the question concerns
the asymptotic distribution; therefore we must obtain the limit distribution of
each estimator after a suitable normalization. We can write
fore (x'x)'/2(i3it - ii) and (x'x)I/2(ft - /3) have the same limit distribution by
repeated applications of Theorem 3.2.7.
EXAMPLE 3.5.3. Consider Model 1 with K = 2:
Y xi + fl2x2 (3.5.22)
where we assume that (ur ) are i.i.d. Assume also lim T 1X'X = A, where A is a
2 X 2 nonsingular matrix. Obtain the asymptotic distribution of/31 / 2 , where
fl i and fl2 are the least squares estimators of /3 1 and /32 , assuming )62 # 0.
We can write
Exercises
1. (Section 3.1.2)
Prove that the distribution function is continuous from the left.
2. (Section 3.2)
Prove Theorem 3.2.3. HINT: Definition 3.1.2 (iii) implies that if il c SI,
for n < m and lim n.... fin = A, then lim_, P(0) = P(A).
3. (Section 3.2)
Prove Theorem 3.2.4.
4. (Section 3.3)
Let (X,) be as defined in Theorem 3.3.1. Prove
E(TY,, - EX )2 = 0.
102 Advanced Econometrics
5. (Section 3.3)
Let (a,), t = 1, 2, . . . , be a nonnegative sequence such that
(MIL, adIT < M for some M and every T. Prove lim,M?:.,(a,It 2)< co.
6. (Section 3.3)
Prove that the conditions of Theorem 3.3.6 (Lindeberg-Feller CLT) fol-
low from the conditions of Theorem 3.3.4 (Lindeberg-Levy CLT) or of
Theorem 3.3.5 (Liapounov CLT).
7. (Section 3.3)
Let (X,) be i.i.d. with EX, p. Then I it. This is a corollary of
Theorem 3.3.2 (Kolmogorov LLN 2) and is called Khinchine's WLLN
(weak law of large numbers). Prove this theorem using characteristic
functions.
8. (Section 3.5)
Show that .1. 3(X1X) > co implies x;x i > co for every i, where x; is the ith
column vector of X. Show also that the converse does not hold.
9. (Section 3.5)
Assume K = 1 in Model 1 and write X as x. Assume that (u,) are indepen-
dent. If there exist L and M such that 0 < L < x'x/ T < M for all T, show
/pH.. fl.
4.1.1 Consistency
Because there is no essential difference between maximization and minimiza-
tion, we shall consider an estimator that maximizes a certain function of the
parameters. Let us denote the function by Q Y, 0), where y =
(Y1, Y2 YT) ' is a T-vector of random variables and 0 is a K-vector of
parameters. [We shall sometimes write it more compactly as Q r(0).] The
vector 0 should be understood to be the set of parameters that characterize the
106 Advanced Econometrics
Let AT be the event "I T-1 QT(0) Q(0)I < e/2 for all 0." Then
AT 12(67')> T-1 QT(aT) e12 (4.1.3)
and
A = Q( 00) 12. (4.1.4)
EXAMPLE 4.1.3. 0 = [0, 2], 0 0 = 1.5, T-1 QT continuous but not uniformly
convergent.
1
T-1 QT (0) = TO, 0505
2T
1 1
= 1 TO,
27 < "- T-
-
1
=0,
_ T
1 <0_00
T + 1 (0 1),
T
00 <052.
Here we have plim O. = plim (2 T)1.-- 0, although firn 7-1 QT attains its
unique maximum at 00 .
1 2-() 0< 1
= 0, 0 =-- 1.
Here we have plim aT = plim [1 2-( T+ 1)] = 1, although firn T-1 Q T attains
its unique maximum at 00 . (If we change this example so that 0 = [- 2, 1),
T-1 QT becomes continuous and only the compactness assumption is vio-
lated.)
The estimator oT of Theorem 4.1.1 maximizes the function Q T (0) globally.
However, in practice it is often difficult to locate a global maximum of Q(0),
for it means that we must look through the whole parameter space except in
the fortunate situation where we can prove that Q T(0) is globally concave.
Another weakness of the theorem is that it is often difficult to prove that
Q(0) attains its unique global maximum at 00. Therefore we would also like to
have a theorem regarding the consistency of a local maximum.
Still another reason for having such a theorem is that we can generally prove
asymptotic normality only for the local maximum, as we shall show in Section
4.1.2. Theorem 4.1.2 is such a theorem.
THEOREM 4.1.2. Make the assumptions:
(A) Let 0 be an open subset of the Euclidean K-space. (Thus the true value
00 is an interior point of 0.)
(B) Q T(y, 0) is a measurable function of y for all 0 E 0, and aQriao exists
and is continuous in an open neighborhood N1(00) of O. (Note that this
implies QT is continuous for 0 E
(C) There exists an open neighborhood N2(00) of 00 such that T-1 Q T(0)
converges to a nonstochastic function Q(0) in probability uniformly in 0 in
N2(00), and Q(0) attains a strict local maximum at 00 .
Let OT be the set of roots of the equation
OQT _ 0 (4.1.9)
ae
zorresponding to the local maxima. If that set is empty, set OT equal to (0).
Asymptotic Properties of Extremum Estimators 111
Proof Choose a compact set S c NI n N2. Then the value of 0, say 0, that
globally maximizes QT(0) in S is consistent by Theorem 4.1.1. But because the
probability that 7-1 Q T (0) attains a local maximum at Ot. approaches 1 as T
goes to 00, limr_. P(011. E OT) = 1.
We sometimes state the conclusion of Theorem 4.1.2 simply as "there is a
consistent root of the Eq. (4.1.9)."
The usefulness of Theorem 4.1.2 is limited by the fact that it merely states
that one of the local maxima is consistent and does not give any guide as to
how to choose a consistent maximum. There are two ways we can gain some
degree of confidence that a local maximum is a consistent root: (1) if the
solution gives a reasonable value from an economic-theoretic viewpoint and
(2) if the iteration by which the local maximum was obtained started from a
consistent estimator. We shall discuss the second point more fully in Section
4.4.2.
T 00) = [ 7,
1 a 2 ji
11 71
1.- , a
0102 710: (4.1.11)
Finally, the conclusion of the theorem follows from assumption C and Eqs.
(4.1.11) and (4.1.12) by repeated applications of Theorem 3.2.7.
As we noted earlier, QT is frequently the sum of independent random
variables (or, at least, of random variables with a limited degree of depen-
dence). Therefore it is not unreasonable to assume the conclusions of a law of
large numbers and a central limit theorem in assumptions B and C, respec-
tively. However, as we also noted earlier, the following more general normali-
zation may be necessary in certain cases: In assumption B, change
T-182QT18080' to il(T)a2 QT1a0a0'H(T), where 14(T) is a diagonal matrix
such that limr, H(T) =0; in assumption C, change T-1 /20Q T/80 to
H(T)0Q T/a0; and in the conclusion of the theorem, state the limit distribu-
tion in terms of H( T)i(O T 00).
Because assumption B is often not easily verifiable, we shall give two alter-
native assumptions, each of which implies assumption B. Let gT(0) g(y, 0)
be a function of a T-vector of random variables y = (yi , Y2,. . . , yr)' and a
continuous function of a K-vector of parameters G in 13, an open subset of the
Euclidean K-space, almost surely. We assume that g(y, 0) is a random variable
(that is, g..is a measurable function of y). We seek conditions that ensure
plim [gT (OT ) g(0)] = 0 whenever plim OT = 00. Note that Theorem 3.2.6
does not apply here because in that theorem g(. ) is a fixed function not
varying with T.
THEOREM 4.1.4. Assume that agr /o0 exists for 0 E 0, an open convex set,
and that for any e > 0 there exists M. such that
P(logT/801 < Mc) 1e
Asymptotic Properties of Extremum Estimators 113
for all T, for allp E 0, and for all i, where 01 is the ith element of the vector 0.
Then plim gr(0T) = plim 87400 if 00 plim Or is in 0.
Proof The proof of Theorem 4.1.4 follows from the Taylor expansion
<..
P[18T(OT) 8(e 7-)1-2S]< 2
Because g is continuous at 00 by our assumption, g(or ) converges to g(00) in
probability by Theorem 3.2.6. Therefore, for any e > 0 and 6> 0, there exists
T2 such that for T> T2
Therefore, from the inequalities (4.1.13) and (4.1.14) we have for T>
max [T 1 , T 2]
P[IgT( 62.)
- g(00)I e] g 1 6. (4.1.15)
2
inf P[gT(co, 0) < e] g 1 1,
0.8,1
1
ooy 1 Q
O2
--T1
where 0* lies between 0 and 00 . Taking the probability limit of both sides of
(4.1.16) and using assumptions B and C of Theorem 4.1.3 and A of this
theorem, we have
2(6 ) = 12(90) + 3(0 - 00) I A*(0 - 00),
1 (4.11 7)
where
_.1
A* - plim 7 82QT , itiw ie s.
But A* is a negative definite matrix because of assumption B of Theorem 4.1.3
and assumptions B and C of this theorem. Therefore
Q(0) < Q(00) for 0 # N. (4.1.18)
Thus all the assumptions of Theorem 4.1.2 are satisfied.
Asymptotic Properties of Extremum Estimators 115
4.2.2 Consistency
The conditions for the consistency of the global MLE or the local MLE can be
immediately obtained from Theorem 4.1.1 or Theorem 4.1.2 by putting
QT(0) = log L T(0). We consider the logarithm of the likelihood function
because T-1 log L T(0) usually converges to a finite constant. Clearly, taking
the logarithm does not change the location of either global or local maxima.
So far we have not made any assumption about the distribution of y. If we
assume that (y,) are i.i.d. with common density functionf( , 0), we can write
(4.2.5)
Furthermore, we have by Jensen's inequality (see Rao, 1973, P. 58, for the
proof)
where the expectation is taken using the true value 00, and, therefore
E log f(y 0) <E log f(y 00) for 0 # 00 . (4.2.7)
As in (4.2.7), we have T- IE log L(0) < E log L(O0) for 0 # 00 and for
all T. However, when we take the limit of both sides of the inequality (4.2.7) as
T goes to infinity, we have
lim T-1E log L T (0) lim T-1 E log LT(80).
E]
P T-1
r-i t-i L I 19E07
where the first inequality follows from the fact that if A implies B then
P(A) 5 P(B) and the last inequality follows from the triangle inequality. Be-
cause gi(0) is uniformly continuous in 0 E 0, we have for every i
lim sup IVO) &AM= 0. (4.2.9)
eee;
But, because
sup 1,00) 5. 2 sup Ig,(0)1 (4.2.10)
8E07 BEO
and the right-hand side of the inequality (4.2.10) is integrable by our assump-
tions, (4.2.9) implies by the Lebesgue convergence theorem (Royden, 1968,
p. 88)
lim E sup Ig,(0) gr(0)1= 0 (4.2.11)
8E07
uniformly for i. Take n so large that the expected value in (4.2.11) is smaller
than e/2. Finally, the conclusion of the theorem follows from Theorem 3.3.2
(Kolmogorov LLN 2) by taking T to infinity.
This theorem can be generalized to the extent that 7 IM,T_ igt(k) and
,supoew Igi(0) g,(0i)I can be subjected to a law of large numbers. The
where ci is a small positive constant and c2 and c3 are large positive constants,
and assume that (1%, al) is an interior point of 0. Then, clearly, the conver-
gence in (4.2.14) is uniform in 0 and the right-hand side of (4.2.14) is uniquely
maximized at (fl; , as). Put 0 = (fl', a2)' and define OT by
log L(0) = max log L T (0). (4.2.16)
see
The proof of (4:2.18) would be simple if we used our knowledge of the explicit
formulae for OT in this example. But that would be cheating. The proof of
(4.2.18) using condition D given after the proof of Theorem 4.1.1 is left as an
exercise.
There are cases where the global maximum likelihood estimator is incon-
sistent, whereas a root of the likelihood equation (4.2.1) can be consistent, as
in the following example.
1 A
+ , e P x i- - - (1', 112) 21(2 a1)d
NI 2.7ra2
If we put A l = yl and let cr1 approach 0, the term of the product that corre-
sponds to t = 1 goes to infinity, and, consequently, L goes to infinity. Hence,
the global MLE canot be consistent. Note that this example violates assump-
tion C of Theorem 4. Li because Q(0) does not attain a global maximum at O.
However, the conditions of Theorem 4.1.2 are generally satisfied by this
model. An extension of this model to the regression case is called the switching
regression model (see Quandt and Ramsey, 1978).
It is hard to construct examples in which the maximum likelihood estima-
tor (assuming the likelihood function is correctly specified) is not consistent
and another estimator is. Neyman and Scott (1948) have presented an inter-
esting example of this type. In their example MLE is not consistent because
the number of incidental (or nuisance) parameters goes to infinity as the
sample size goes to infinity.
f t(13of dy 0, (4.2.24)
a2f d o (4.2.25)
J WiFr Y
1 T 02 logf a2 log f
plim E uniformly in 0 in an open
T am maw
neighborhood of 00. (4.2.26)
A sufficient set of conditions for (4.2.26) can be found by putting g1(0) =
82 log fdaoiae; in Theorem 4.2.1. Because log L T = MIL, log f(y 0) in this
case, (4.2.26) implies assumption B of Theorem 4.1.3 because of Theorem
4.1.5. Assumption C of Theorem 4.1.3 follows from (4.2.24) and (4.2.26) on
account of Theorem 3.3.4 (Lindeberg-Levy CLT) since (4.2.24) implies
ga log pomp. =0. Finally, it is easy to show that assumptions (4.2.24)
(4.2.26) imply (4.2.22).
We shall use the same model as that used in Example 4.2.1 and shall
illustrate how the assumptions of Theorem 4.1.3 and the additional assump-
tion (4.2.22) are satisfied. As for Example 4.2.1, the sole purpose of Example
4.2.3 is as an illustration, as the same results have already been obtained by a
direct method in Chapter 1.
EXAMPLE 4.2.3. Under the same assumptions made in Example 4.2.1,
prove asymptotic normality of the maximum likelihood estimator
0 = (0% 6.-2),.
122 Advanced Econometrics
02 log L 1
aa2afl (7 4 ("13 XY. (4.2.31)
From (4.2.29), (4.2.30), and (4.2.31) we can clearly see that assumptions A
and B of Theorem 4.1.3 are satisfied. Also from these equations we can
evaluate the elements of A(00):
1
02log L 1 X'X
plim (4.2.32)
T aflajr crg T
. 1 02 log L 1
plim (4.2.33)
T a(a2)2 2cr8'
1 82 log L
plim = 0. (4.2.34)
T acr2afileo
From (4.2.27) and (4.2.28) we obtain
1 alogLI _ 1 X'u
and 1
a log LI _ 1 u'u
op 00 al NIT 47; 430'2 leo 24
Thus, by applying either the Lindeberg-Feller or Liapounov CLT to a se-
quence of an arbitrary linear combination of a (K + 1)-v ect or
(xitu x2,141 , . . . , xKtu u? as), we can show
1 a log L N (0, urn (4.2.35)
rt op I A, ao T
and
1 alogL
N (0, -a (4.2.36),
VT at72
Asymptotic Properties of Extremum Estimators 123
80
Figure 4.1 The log likelihood function in a nonregular case
with zero asymptotic covariance between (4.2.35) and (4.2.36). Thus assump-
tion C of Theorem 4.1.3 has been shown to hold. Finally, results (4.2.32)
through (4.2.36) show that assumption (4.2.22) is satisfied. We write the
conclusion (4.2.23) specifically for the present example as
,ff, [ ./.i /Jo ] _) N [ag (lim T- IX'X)- ' 204 1
(4.2.37)
{o
0.2 ag 0
There are cases where the global maximum likelihood estimator exists but
does not satisfy the likelihood equation (4.2.1). Then Theorem 4.1.3 cannot
be used to prove the asymptotic normality of MLE. The model of Aigner,
Amemiya, and Poirier (1976) is such an example. In their model, plim 7 '
log LT exists and is maximized at the true parameter value 00 so that MLE is
consistent. However, problems arise because plim T- I log L T is not smooth at
00 ; it looks like Figure 4.1. In such a case, it is generally difficult to prove
asymptotic normality.
a log LI 0
(4.2.42)
afl A '
and define ol as a root of
a log L*
aa a = 0. (4.2.43)
ao) I2 [plim
82mace
log L*1 1 8 log L*
4_i; aa L., (4.2.44)
where k-9 means that both sides of the equation have the same limit distribu-
tion. Differentiating both sides of (4.2.41) and evaluating the derivative at ao
yields
0 log L*1 a log L +
4'1 a log L (4.2.45)
Oa ao aa laoAao) aa aoi 9filao.;* ao)
log L
Oa lefo.Aao;
where the second equality follows from (4.2.42). By a Taylor expansion we
have
a log LI _ 8 log LI + a2 log LI [ ix%)
(4.2.46)
a 1.4ft) act 6lo
aa6iji
where 0+ lies between [c<1, %A. aon and 00 . But we have
[limE 32 L
-1 1 a log L (4.2.47)
.
'MA%) Po] La
3 - 1
T OP" Li 7T,
Therefore, from (4.2.45), (4.2.46), and (4.2.47), we obtain
a log Ll = (1 BC-1). a log L
(4.2.48)
ff Oa ao
Finally, using (4.2.22), we obtain from (L.48) lek.
1 a log L*L N(0, A BC-1 13'). (4.2.49)
aa
Asymptotic Properties of Extremum Estimators 127
Finally, we have proved that (4.2.44), (4.2.49), and (4.2.54) lead precisely to
the conclusion (4.2.40) as desired.
f( 10) + u, (4.3.2)
where y, f, and u are all T-vectors, for which the tth element is defined in
(4.3.1).
Nonlinearity arises in many diverse ways in econometric applications. For
example, it arises when the observed variables in a linear regression model are
transformed to take account of serial correlation of the error terms (cf. Section
6.3). Another example is the distributed-lag model (see Section 5.6), in which
the coefficients on the lagged exogenous variables are specified to decrease
with lags in a certain nonlinear fashion. In both of these examples, nonlinear-
ity exists only in parameters and not in variables.
More general nonlinear models, in which nonlinearity is present both in
parameters and variables, are used in the estimation of production functions
and demand functions. The Cobb-Douglas production function with an addi-
tive error term is given by
Qr = 131 1(92b13 + u (4.3.3)
where Q, K, and L denote output, capital input, and labor input, respectively. 6
TheCESproductinf (seArowtal.,196)myberinas
See Mizon (1977) for several other nonlinear production functions. In the
estimation of demand functions, a number of highly nonlinear functions have
been proposed (some of these are also used for supply functions), for example,
translog (Christensen, Jorgenson, and Lau, 1975), generalized Leontief (Die-
wen, 1974), S-branch (Brown and Heien, 1972), and quadratic (Howe, Pol-
lak, and Wales, 1979).
As in the case of the maximum likelihood estimator, we can define the
nonlinear least squares estimator (abbreviated as NLLS) of fl o in two ways,
depending on whether we consider the global minimum or a local minimum.
In the global case we define it as the value of // that minimizes
sr = f,(13)1 2 (4.3.5)
Asymptotic Properties of Extremum Estimators 129
over some parameter space B. In the local case we define it as a root of the
normal equation
8ST 0
(4.3.6)
op
We shall consider only the latter case because (4.3.6) is needed to prove
asymptotic normality, as we have seen in Section 4.1.2. Given the NLLS
estimator ft of fio , we define the NLLS estimator of al, denoted as 6 2, by
6 2 = TS-(ft). (4.3.7)
Note that and 62 defined above are also the maximum likelihood estima-
tors if the (tt,) are normally distributed.
4.3.2 Consistency'
We shall make additional assumptions in the nonlinear regression model so
that the assumptions of Theorem 4.1.2 are satisfied.
THEOREM 4.3.1. In the nonlinear regression model (4.3.1), make the addi-
tional assumptions: There exists an open neighborhood N of flo such that
(A) Of,/Oft exists and is continuous on N.
(B) fr(fl) is continuous in 5 E N uniformly in t; that is, given > 0 there
exists 8> 0 such that IAA) f,(132)j< e whenever (51 AAA 52) <8 for
all fi1,fi2 E N and for all t.8
(C) Er.. 1f,(fi1)f,(fi2) converges uniformly in , $2 E N.
(D) lim T- iIi 1 [ff(ti0)f,(fl)]2 # 0 if /I # So .
Then a root of (4.3.6) is consistent in the sense of Theorem 4.1.2.
Proof Inserting (4.3.1) into (4.3.5), we can rewrite 7 I times (4.3.5) as
T
ST 1 u
T
+ T1
f_ Efheo) .10)1 2
(4.3.8)
2
+
T
[mu -fruplut
v. A, +A2 +A3 .
The term A 1 converges to al in probability by Theorem 3.3.2 (Kolmogorov
LLN 2). The term A2 converges to a function that has a local minimum at $0
uniformly in fi because of assumptions C and D. We shall show that A3
converges to 0 in probability uniformly in fl E Nby an argument similar to the
130 Advanced Econometrics
7,[i
=1 057 Xtrd( 3 110)2
t I
We have
1T
t_1 Eyt (fl + )0212
plim-
T
cr 3 (/31 _ /32)2 + 4(J 0 _ /3)2p + 4(fli _ fl2)(/30 _
=
where the convergence is clearly uniform in /3 E [a, b]. But, because
Q(fi) al + (fl0 fl)2(flo fl + 2 0,
Q(I3) is uniquely minimized at /I = flu. Therefore the estimator in question is
consistent by Theorem 4.1.1.
asr = 2
OP
i EY, AP)] 43 (4.3.12)
Therefore we have
asT i 2 i u 9f1
(4.3.13)
afi lp, NIT r- aflip:
The results of Section 3.5 show that if we assume in the nonlinear regression
model that
1 T aft l
inn (L.) (4.3.14)
t-1 afl p o P
is a finite nonsingular matrix, 9
then the limit distribution of (4.3.13) is N(0, 4agC).
Second, consider the assumption of Theorem 4.1.5 that implies assumption
B of Theorem 4.1.3, except for nonsingularity. From (4.3.12) we have
1 a2s, _ 2 ,T Of, af, 2 ,?; a2ft (4.3.15)
afl 117 Top op' T A u t aflalr
7 -5-
Asymptotic Properties of Extremum Estimators 133
2 T a2f,
-77 XIAA) am,
A I + A2 A3.
We must make the assumption of Theorem 4.1.5 hold for each of the three
terms in (4.3.15). First, for A 1 to satisfy the assumption of Theorem 4.1.5, we
must assume
T aft of,
T 73-fi a" converges to a finite matrix uniformly (4.3.16)
neighborhood of it uniformly in t
and
Tce
1
T2 ,_, [afi,afi;
a2ft ]2 0 for all ft (4.3.18)
in an open neighborhood of A.
Finally, the uniform convergence of A3 requires
T a 2f,
T AAA) afgair L2 converges to a finite matrix uniformly
for all fil and /32 in an open neighborhood of A. (4.3.19)
Thus under these assumptions we have
1 a2sT r of: 1 of
plim 2 li m
T ;VI (4.3.20)
T aftap
whenever plim fi A.
These results can be summarized as a theorem.
THEOREM 4.3.2. In addition to the assumptions of the nonlinear regression
model and the assumptions of Theorem 4.3.1, assume conditions (4.3.14),
134 Advanced Econometrics
the NLLS estimator using all data and let ft , be the N_LLS estimator ob-
tained by omitting y,. Then "pseudovalues" /1* = - (N- i =
1., 2, . . . , N, can be treated like Nobservations (though not independent) on
ft. Thus, for example, Vfl may be estimated by (N - 1)-1
where fl* = N-1 I fit n.,(157-PM?r
136 Advanced Econometrics
gi = aQ
ao
1 and H,
(32Q
.
The iteration (4.4.2) is to be repeated until the sequence (o) thus obtained
converges.
138 Advanced Econometrics
Goldfeld and Quandt (1972, Chapter 1) have discussed how to choose al and
the convergence properties of the method.]
The second weakness may be remedied by the modification
62 = - 1g, (4.4.5)
where the scalar 1, is to be appropriately determined. Fletcher and Powell
(1963) have presented a method to determined 1 1 by cubic interpolation of
Q(0) along the current search direction. [This method is called the DFP
iteration because Fletcher and Powell refined the method originally proposed
by Davidson (1959).] Also, Berndt et al. (1974) have presented another
method for choosing A I .
The Newton-Raphson method can be used to obtain either the maximum
likelihood or the nonlinear least squares estimator by choosing the appropri-
ate Q. In the case of the MLE, E(c3 2 log Lime') may be substituted for
92 log Llaeae, in defining H. If this is done, the iteration is called the method of
scoring (see Rao, 1973, p. 366, or Zacks, 1971, p. 232). In view of Eq. (4.2.22),
E(0 log Limp log Liao') may be used instead; then we need not calculate
the second derivatives of log L.
-2'21 e
(4.4.6)
gi = ao awe'
:3 s
where 0* lies between I), and 00. Inserting (4.4.6) into (4.4.2) yields
0 -1 a2c,
47052 - 00) = {I - [Tei
62t0,
-J [itorcilesll - 00) (4.4.7)
_[i 82Q , r 1 aQ
ole,
(m ,rt aolo:
But, because under the condition of Theorem 4.1.3
,. 1 82Q ,. 1 82Q ,. 1 492Q 1
pum = pum - pum (4.4.8)
T ma& T ma& 6.
we have
where
as
ap lit = 2 IX f' ii)] l
aafiL.
; (4.4.13)
Then the second-round estimator 112 can be interpreted as the least squares
estimate of applied to the linear regression equation (4.4.14), treating the
whole left-hand side as the dependent variable and (af,/an as the vector of
independent variables. Equation (4.4.14) reminds us of the point raised at the
beginning of Section 4.3.5, namely, the nonlinear regression model asymptot-
ically behaves like a linear regression model if we treat 8f/813' evaluated at a
good estimate of /I as the regressor matrix.
The Gauss-Newton iteration suffers from weaknesses similar to those of the
Newton-Raphson iteration, namely, the possibility of an exact or near singu-
larity of the matrix to be inverted in (4.4.12) and the possibility of too much or
too little change from fi, to /32 .
To deal with the first weakness, Marquardt (1963) proposed a modification
[7. afl aft
A8 7 1,1wiasL, (4.4.15)
[
Az; (4.4.16)
Hartley proved that under general conditions his iteration converges to a root
of Eq. (4.3.6). (Gallant, 1975a, has made useful comments on Marquardt's
and Hartley's algorithms.)
As in the Newton-Raphson method, it can be shown that the second-round
estimator of the Gauss-Newton iteration is asymptotically as efficient as
NLLS if the iteration is started from an estimator fi i such that NIT(A fin)
converges to a nondegenerate random variable.
Finally, we want to mention several empirical papers in which the Gauss-
Newton iteration and related iterative methods have been used. Bodkin and
Klein (1967) estimated Cobb-Douglas and CES production functions by the
Newton-Raphson method. Charatsis (1971) estimated the CES production
function by a modification of the Gauss-Newton method similar to that of
Hartley (1961) and found that in 64 out of 74 samples it converged within six
iterations. Mizon (1977), in a paper whose major aim was to choose among
nine production functions including Cobb-Douglas and CES, used the conju-
gate gradient method of Powell (1964) (see Quandt, 1983). Mizon's article also
contained interesting econometric applications of various statistical tech-
niques we shall discuss in Section 4.5, namely, a comparison of the likelihood
ratio and related tests, Akaike information criterion, tests of separate families
of hypotheses, and the Box-Cox transformation (Section 8.1.2). Sargent
(1978) estimated a rational expectations model (which gives rise to nonlinear
constraints among parameters) by the DFP algorithm.
where 0* lies between 00 and & Noting that the second term of the right-hand
side of (4.5.6) is 0 by the definition of 0, we have
log LA log 030 -2
1--1 iT(e. 00)' 30(6 00), (4.5.7)
where we have defined
. I 02 log L
.7 0 = E (4.5.8)
T aoao'
Treating L[r(a)] L(a) as a function of a, we similarly obtain
log i() - log L(a0 ) 1=1 T(er a 0)'. 7 Jet ao), (4.5.9)
where
1 .32 log L
,,= lim E (4.5.10)
T aaaa'
Noting L(00) = gad, we have from (4.5.3), (4.5.7), and (4.5.9)
00)'30(8 00) aoY cge ao). (4.5.11)
But from Theorem 4.1.3 and its proof we have
we obtain
LRT R
I e'(I .71/2R.7 -i, IR'.7 V2*. (4.5.18)
But, because
3 a = R'.70R, (4.5.19)
I .7 y2R.7 ;1R'.7 y2 can be easily shown to be an idempotent matrix of rank q.
Thus we can easily show Wald LRT Rao. The inequalities hold also in
the multiequation linear model, as shown by Berndt and Savin (1977). Al-
though the inequalities do not always hold for the nonlinear model, Mizon
(1977) found Wald LRT most of the time in his samples.
Gallant and Holly (1980) obtained the asymptotic distribution of the three
statistics under local alternative hypotheses in a nonlinear simultaneous
equations model. Translated into the nonlinear regression model, their results
can be stated as follows: If there exists a sequence of true values (N.) such that
lim fiI= /30 and 6 = lim ra(fi plim /3) is finite, statistics (4.5.20),
(4.5.21), and (4.5.22) converge to chi-square with q degrees of freedom and
noncentrality parameter A, where
_1
A= a72o , 81
Op
[ Oh
,TA(G'G)-1
it
Ohl
arlho. (4.5.26)
aft A
Note that if is distributed as a q-vector N(0, V), then (C + 'V'( is
distributed as chi-square with q degrees of freedom and noncentrality
parameter g 'V In other words, the asymptotic local power of the tests
based on the three statistics is the same.
There appear to be only a few studies of the small sample properties of the
three tests, some of which are quoted in Breusch and Pagan (1980). No
clear-cut ranking of the tests emerged from these studies.
A generalization of the Wald statistic can be used to test the hypothesis
(4.5.1), even in a situation where the likelihood function is unspecified, as long
as an asymptotically normal estimator ft of 13 is available. Suppose ft is
asymptotically distributed as N(fl, V) under the null hypothesis, with V
estimated consistently by V. Then the generalized Wald statistic is defined by
G.W. = [ ah
afi' 1-` (4
h(n) .5.27)
and is asymptotically distributed as 2(q) under the null hypothesis. Note that
(4.5.21) is a special case of (4.5.27).
Another related asymptotic test is the specification test of Hausman (1978).
It can be used to test a more general hypothesis than (4.5.1). The only
requirement of the test is that we have an estimator, usually a maximum
likelihood estimator, that is asymptotically efficient under the null hypothesis
but loses consistency under an alternative hypothesis and another estimator
that is asymptotically less efficient than the first under the null hypothesis but
remains consistent under an alternative hypothesis. If we denote the first
146 Advanced Econometrics
Pesaran (1982) compared the power of the Cox test and other related tests.
For other recent references on the subject of nonnested models in general, see
White (1983).
m T =-- inf ta I I
T
t-i
W(a) A.
T
(4.6.3)
Define
Pt = 1 - P(Y, <M + 7 -112y).
Then, because by a Taylor expansion
1
Ps = -i - T- '12f(M)y - (4.6.5)
we have
T T
P X HI g - ) -- P[Zr + 0(T-1/2 ) f(M)y], (4.6.6)
( t-1 2
where W:' = W,(M + 7- u2y) and Z. = T -11211:1(W;' - Ps ). We now derive
the limit distribution of ZT using the characteristic function (Definition 3.3.1).
We have
T
E exp WO = r 1 E exp [iAT -112(WI I - PO]
t-1
(4.6.7)
fl
= T (P exp [iAT -1 /2(1 - Ps )]
'
+ (1 - Ps ) exp (-0.7-1/2P0)
A2
= [1 -- o(T-1)]T
8T+
-0 exp (-). 2/8),
150 Advanced Econometrics
where the third equality above is based on (4.6.5) and the expansion of the
exponent: ex = 1 + x + 2- 1X2 Therefore ZT N(0, 4-1 ), which
. . . .
ST = Yr el lYt
(4.6.10)
Then, clearly,
clearly, m r E Tri r We have
rM
plim T- IST = 0 + 2 I .11(A) &I, 20 f(A) chi., (4.6.11)
Jo Jo
where the convergence can be shown to be uniform in 0. The derivation of
(4.6.1 1) and the uniform convergence will be shown for a regression model,
for which the present i.i.d. model is a special case, in the next subsection.
Because
aQ
+ 2F(0) (4.6.12)
and
a2Q
= 2f(0), (4.6.13)
802
we conclude that Q is uniquely minimized at 0 = M and hence m r is consist-
ent by Theorem 4.1.1.
Next, we shall consider two complications that prevent us from proving the
asymptotic normality of the median by using Theorem 4.1.3: One is that
asTme= 0 may have no roots and the other is that 82s,"802= 0 except for a
finite number of points. These statements are illustrated in Figure 4.2, which
depicts two typical shapes of ST.
Despite these complications, assumption C of Theorem 4.1.3 is still valid if
we interpret the derivative to mean the left derivative. That is to say, define for
A>0
Asymptotic Properties of Extremum Estimators 151
aST
( ii)
de
= 0 has no roots
Because ( WAM)) are i.i.d. with mean i and variance +, we have by Theorem
3.3.4 (Lindeberg-Livy CLT)
= (4.6.17)
This is a generalization of (4.6.10). Like the median, the LAD estimator may
not be unique.
We shall now prove the consistency of the LAD estimator using Theorem
4.1.1. From (4.6.18) we have
ST = E Jo:145), (4.6.19)
t-i
where = fi fic, and h(zia) is defined as
If a 0, h(zia) = a if z 0 (4.6.20)
=a 2z if 0 <z<a
= a if z a.
If a <0, h(zia)= a if z a
=a-F2z if a<z<0
= a if z 0.
= 2 lim f f.) d)
T-0., 1 4,1
1T 1 r
2 lim E
Toco
f(A) (1). Ked+ lim
Toco
4,
a2Q 1 T
ame 2 lim f(x;o)v4 (4.6.23)
To show (4.6.24) is a contradiction, let a 1 , a2, and Mbe positive real numbers
such that I x;(51 I < M for all t and f(A): a 1 whenever PI *5- a2. Such numbers
exist because of our assumptions. Then we have
[1-1 4a,
f()) ddx;61= 1' f(A) dA114,1 (4.6.25)
a l a2
m(5' X'XJ, .
_6 A xt[Ey/(y, ;fi)]i 77
r 7'
, ; Eliot - x'th)} 0,
and
T
xi[Ewy, - 4/3)] ; XtEY0t - xA) (4.6.28)
v T t-i
+ plim 7,
1 x [f
c r Ev(y t x't fi)]x flu).
,IT(ft -
1,
fi0)L-9 foim 7 i; [ (TI
,' Ev(y t - 4/3)]} (4.6.29)
T
;Out).
V T t-i
Noting Ety(y, = 1 2F(x;6), Ev(14 1) = 0, and Ity/(u) = 1, we obtain
rt(11 fie) ' N(0, 4 - If(0)-2[Iim (4.6.30)
Exercises
1. (Section 4.1.1)
Prove (i) = (ii) = (iii) in Definition 4.1.1.
2. (Section 4.1.1)
In the model of Exercise 11 in Chapter 1, obtain the probability limits of
the two roots of the likelihood equation, assuming lim T_. 7 'x'x = c,
where c is a finite, positive constant.
3. (Section 4.1.1)
In the model of Exercise 2 (this chapter), prove the existence of a consist-
ent root, using Theorem 4.1.2.
4. (Section 4.1.2)
Suppose that (XT ) are essentially bounded; that is, for any E> 0, there
exists ME such that P(IXTI < Me) 1 - E for all T. Show that if
plimr_ YT = 0, then plim r_. XTYT= 0. (This is needed in the proof of
Theorem 4.1.4.)
5. (Section 4.2.2)
Prove (4.2.18) by verifying Condition D given after (4.1.7). Assume for
simplicity that a 2 is known. (Proof for the case of unknown (1 2 is similar
but more complicated.)
6. (Section 4.2.2)
Let Xi i = 1, 2, . . . , n,t = 1, 2, . . . , T, be independent with the dis-
tribution N(p cr 2). Obtain the probability limit of the maximum likeli-
hood estimator of a 2 assuming that n is fixed and T goes to infinity (cf.
Neyman and Scott, 1948).
7. (Section 4.2.3)
Let {X,), t = 1, 2, . . . , T, be i.i.d. with the probability distribution
X, = 1 with probability p
= 0 with probability 1 p.
Prove the consistency and asymptotic normality of the maximum likeli-
hood estimator using Theorems 4.1.1 and 4.2.4. (The direct method is
much simpler but not to be used here for the sake of an exercise.)
8. (Section 4.2.3)
Prove the asymptotic normality of the consistent root in the model of
Exercise 2 (this chapter).
156 Advanced Econometrics
9. (Section 4.2.3)
Let (X,) be i.i.d. with uniform distribution over (0, 0). Show that if 0 is
defined by T-.:(T + 1) max (Xi , X2 XT.),
P [T(0 0) < x] = exp (0-1x 1) for x 0.
10. (Section 4.2.3)
Consider the model
Yi =Po + t = 1, 2, . . . , T,
where y, and u, are scalar random variables and /30 is a scalar unknown
parameter. If (u,) are i.i.d. with Eu,= 0, Et4 = fig, Eu?= 0, and
Eu: = m4 (note that we do not assume the normality of u,), which of the
following three estimators do you prefer most and why?
(1) = T-1171Y1,
(2) fl. ,2 which maximizes S = (T/2) log (1 /2p2)I7
(3) /33 defined as 0.5 times the value of /3 that minimizes
w=t rYt fl1 2.
L fl ri
11. (Section 4.2.3)
Derive the asymptotic variance of the estimator of fl obtained by mini-
mizing IT:. 1(y, /3x,)4, where y, is independent with the distribution
N(/30x ag) and limT, Ix? is a finite, positive constant. You may
assume consistency and asymptotic normality. Indicate the additional
assumptions on x, one needs. Note if Z N(0, a 2), Erk = cr2k(2k)!I
(2kk!).
12. (Section 4.2.4)
Complete the proof of Example 4.2.4 the derivation of the asymptotic
normality of the superefficient estimator.
13. (Section 4.2.5)
In the model of Example 4.2.3, obtain the asymptotic variance-covar-
iance matrix of /3 using the concentrated likelihood function in /3.
14. (Section 4.3.2)
What assumptions are needed to prove consistency in Example 4.3.2
using Theorem 4.3.1?
15. (Section 4.3.3)
Prove the asymptotic normality of the NLLS estimator in Example 4.3.1.
Asymptotic Properties of Extremum Estimators 157
Because there are many books concerned solely with time series analysis, this
chapter is brief; only the most essential topics are considered. The reader who
wishes to study this topic further should consult Doob (1953) for a rigorous
probabilistic foundation of time series analysis; Anderson (1971) or Fuller
(1976) for estimation and large sample theory; Nerlove, Grether, and Car-
valho (1979) and Harvey (1981a, b), for practical aspects of fitting time series
by autoregressive and moving-average models; Whittle (1983) for the theory
of prediction; Granger and Newbold (1977) for the more practical aspects of
prediction; and Brillinger (1975) for the estimation of the spectral density.
In Section 5.1 we shall define stationary time series and the autocovariance
function and spectral density of stationary time series. In Section 5.2 autore-
gressive models will be defined and their estimation problems discussed. In
Section 5.3 autoregressive models with moving-average residuals will be de-
fined. In Section 5.4 we shall discuss the asymptotic properties of the IS and
ML estimators in the autoregressive model, and in Section 5.5 we shall discuss
prediction briefly. Finally, in Section 5.6 we shall discuss distributed-lag
models.
5.1 Introduction
ally not stationary in either sense. Time series with trends are not stationary,
and economic time series often exhibit trends. However, this fact does not
diminish the usefulness of Section 5.2 through 5.4 because a time series may
be analyzed after a trend is removed. A trend may be removed either by direct
subtraction or by differencing. The latter means considering first differences
(y t y 1 ), second differences ((y1 ye-1) (y1-1 Yr--2 )), and so forth.
There are three fundamental ways to analyze a stationary time series. First,
we can specify a model for it, such as an autoregressive model (which we shall
study in Section 5.2) or a combined autoregressive, moving-average model
(which we shall study in Section 5.3). Second, we can examine autocovar-
iances Ey,y,, h , h = 0, 1, 2, . . . . Third, we can examine the Fourier trans-
form of autocovariances called spectral density. In Sections 5.1.2 and 5.1.3 we
shall study autocovariances and spectral density.
5.1.2 Autocovariances
Define yh= EYtYr+h, h= 0, 1, 2, . . . . A sequence (Yh ) contains important
information about the characteristics of a time series (ye). It is useful to
arrange (yh) as an autocovariance matrix
Yo Yi Y2 VT-1
Y1 Yo Vi YT-2
Y2 Y1
(5.1.1)
Vi
7T-1 YT-2 Y1 Yo
This matrix is symmetric, its main diagonal line consists only of yo , the next
diagonal lines have only yi , and so on. Such a matrix is called a Toeplitz form.
= I cos OM,
h-
where the second equality follows from yh = y_ h and sin A = sin ( A).
Therefore spectral density is real and symmetric around co = 0.
Inverting (5.1.2), we obtain
where CO k = larin and (4) and (CO are independent of each other and inde-
pendent across k with E4 = ECk = 0 and Tik= VCk = al. Then we have
which means that the partial summation of the right-hand side converges to y,
in the mean square. The model (5.2.1) with Assumptions A, B, and C is
equivalent to the model (5.2.3) with Assumptions A and B. The latter is called
the moving-average representation of the former.
A quick mechanical way to obtain the moving-average representation
(5.2.3) of (5.2.1) and vice-versa is to define the lag operator L such that
Ly, = y t_ i , L 2Yt = Yr 2, Then (5.2.1) can be written as
-
( 1 PL)Ys = es (5.2.4)
where 1 is the identity operator such that ly, = y1 . Therefore
0 p 10
0 0 p 1
Then we have
RI y = eto . (5.2.11)
But, because Eet;41') = (721, we obtain
-= (5.2.12)
164 Advanced Econometrics
P 0 0 0
1 p2 0 0
1 p2
1+p 2
o p
By inserting (5.2.8) into (5.1.2), we can derive the spectral density of AR(1):
a2 e)
fl(W) 1 p2 plme- thw (5.2.15)
a2 [
= p2 1 + A(pekur + (pe - kih]
a2 [ pe lw ne-rro
_ p2 +
Pe) pe
a2
1 2p cos co + p2 '
Hence,
(1 ALM P2 nYt = (5.2.18)
where p, and /22 are the roots of z 2 piz p2 = 0. Premultiplying (5.2.18) by
(1 piL)-1 (1 122 L) l , we obtain
=
j,kg0
P2
1 0
0 0 - P2 Pi 101
Time Series Analysis 167
we have
R2y = ea) . (5.2.28)
Yt = PjYt-j + f t , t= 0, 2, , (5.2.30)
THEOREM 5.2.1. (y,) defined in (5.2.30) with Assumptions A, B", and C can
be written as a moving-average process of the form
J-0 J-0
< (5.2.31)
[11 ( 1 Yt = Et (5.2.32)
t
11 ( k2 kW)]
J-1 -o
[
Et . (5.2.33)
=
pStkial
Ilioui2 A..
Therefore
(5.2.36)
cro
where yh = EYtYt+h-
Proof We have
70 = a201 + + .)
6r2(4201 + 0102 + .)
72 = a2(0043 2 043 and so on.
Time Series Analysis 169
Therefore
ii 2,
(5.2.37)
iyhi,5 0'2 (7. )
(5.2.41)
h=b
from which the theorem follows.
The premise of the next theorem is weaker than the conclusion of the
preceding theorem. In terms of the spectral density, the premise of Theorem
5.2.4 is equivalent to its existence, and the conclusion of Theorem 5.2.3 to its
continuity.
A .fro
Po = flo - 1 , t 0,1,+2,.
(5.3.1)
where 1z1 2 = zi for a complex number z with i being its complex conjugate.
Note that (5.3.5) is reduced to (5.2.1 5) in the special case of AR(1). We also see
Time Series Analysis 171
from (5.3.5) that the spectral density of a moving-average model is, except for
02, the inverse of the spectral density of an autoregressive model with the same
order and the same coefficients. Because the spectral density of a stationary
process approximately corresponds to the set of the characteristic roots of the
autocovariance matrix, as was noted earlier, we can show that the autocovar-
iance matrix of a moving-average model is approximately equal to the inverse
of the autocovariance matrix of the corresponding autoregressive model. We
shall demonstrate this for the case of MA(1).
Consider an MA(1) model defined by
Y: = E PE: -1, (5.3.6)
where IA < 1 and (,) are i.i.d. with EE, = 0 and Ve, = a 2. The T X Tautoco-
variance matrix is given by
+pi p 0
p 1+ p2
20) = 0
(5.3.7)
0
1 p2
1
Pl
p2
1 v
`w1.
=
Whittle (1983, p. 75) has presented the exact inverse of 1, (, ) . The j, kth
element (j, k= 0, 1, . . , T 1) of 1-5, denoted Vet) , is given by
172 Advanced Econometrics
where
y = Yp + e,
Y = Yp+1 Yp+2
/32, . . , pp r , and
Y=
Yp Yp-1
Yp+ 1 Yp
YT) ',
Y1
Y2
1
e = (Ep+l ep+2) 9 er)'
(5.4.1)
(5.4.2)
Time Series Analysis 173
Model (5.4.1) superficially looks like the classical regression model (1.1.4),
but it is not because the regressors Y cannot be regarded as constants. This
makes it difficult to derive the mean and variance offi for general p. However,
in the case of p = 1, the exact distribution of p can be calculated by direct
integration using the method of Imhof (1961). The distribution is negatively
skewed and downward biased. Phillips (1977) derived the Edgeworth expan-
sion [up to the order of 0(T-1 )] of the distribution of p in the case of p = 1
assuming the normality of (,) and compared it with the exact distribution. He
found that the approximation is satisfactory for p = 0.4 but not for p = 0.8.
In the general AR(p) model we must rely on the asymptotic properties of the
LS estimator. For model (5.4.1) with Assumptions A, B", and C, Anderson
(1971, p. 193) proved
It(05 N(0, a2IT, ), (5.4.3)
Yt-lEt
,5 p = t ( 5.4.5)
,2
i
t2
p 2 .--,
j
3 1 ) = a z. (5.4.10)
Therefore
plim -
1 62p2 + 1 p2 plim
T1 (y; (5.4.11)
T--
But the last term of (5.4.11) is 0 because of (3.2.5) generalized Chebyshev's
inequality. Therefore
1 T 2
Vi.m7 37 1 fl pv
, (5.4.12)
The consistency of "P follows from (5.4.5), (5.4.7), and (5.4.12) because of
Theorem 3.2.6.
Next consider asymptotic normality. For this purpose we need the follow-
ing definition: A sequence (vi) is said to be K-dependent if (vt, , v,. . . , vt.)
are independent of (v51 , vsa , . . . , vi.) for any set of integers satisfying
t, < t2 . . < t,, < s, < s2 < . . . < sm and t + K < s,
To apply a central limit theorem to MT., v split the T observations into S
successive groups with M + Kobservations [so that S(M + K)= 1] and then
in each group retain the first M observations and eliminate the remaining K
Time Series Analysis 175
observations. If we choose S and M in such a way that S -> M -> 00, and
SIT-> 0 as T-0 cc, the elimination of K observations from each of the S
groups does not matter asymptotically.
We can write
1 T 1 r
1 Vm + ANT,
Trf-,IYt-IEt = IT t.2 (5.4.13)
i
where VArt -=': Et Mis c) P set 1 _s and ANT =-
NIT
(
I N+1
.--t .7 Pset-i-s). But we
have
EAT =
(2
I
s N+1
E
w+1
(5.4.14)
a4( T 1) - P23
T t-;+
Therefore ANT can be ignored for large enough N. (Anderson, 1971, Theorem
7.7.1, has given an exact statement of this.) We can show that for a fixed N2
Moreover, vm and VN.t +N+2 are independent for all t. Therefore (vm ) for each N
are (N + 1)-dependent and can be subjected to a central limit theorem (see
Anderson, 1971, Theorem 7 .7 .5). Therefore
T 4
1 p 2)
a (5.4.15)
VT t-
Combining (5.4.12) and (5.4.15), we have
1-T( - p) N(0, 1 - p 2 ). (5.4.16)
Now we consider the MLE in AR(1) under the normality of (e t). The
likelihood function is given by
L = (270-T121111 -112 exp [- (1/2)y'ZT I y I, (5.4.17)
(5.4.19)
Inserting (5.4.19) into (5.4.18) yields the concentrated log likelihood function
(aside from terms not depending on the unknown parameters)
T
log L* = log (2tr) log Q+ I log (1 p 2 ). (5.4.20)
2 2 2
Setting a log L*Iap = 0 results in a cubic equation in p with a unique real root
in the range [ 1, 1]. Beach and Mackinnon (1978) have reported a method of
deriving this root.
However, by setting OQ/ap = 0, we obtain a much simpler estimator
T-1 (5.4.21)
/
We call it the approximate MLE. Note that it is similar to the least square
estimator A, given in (5.4.2), for which the range of the summation in the
denominator is from t = 2 to T. If we denote the true MLE by hi, we can
easily show that VT (iSA - p) and 47(L m p) have the same limit distribution
by using the result of Section 4.2.5.
5.5 Prediction
In the classical regression model with nonstochastic regressors, the problem of
predicting a future value of the dependent variable and the problem of esti-
mating the regression parameters can be solved in a unified way. For example,
the least squares predictor is the best linear unbiased predictor, as shown in
Section 1.6. But, in time series models such as AR or MA, in which the
regressors are stochastic, the problem of optimal prediction is not a simple
corollary of the problem of optimal estimation because (1.6.3) no longer
holds. In view of this difficulty, what we usually do in practice is to obtain the
optimal predictor on the assumption that the parameters are known and then
insert the estimates obtained by the best known method, such as maximum
likelihood, mechanically into the formula for the optimal predictor.
Time Series Analysis 177
Yt+n :11
4 1cEt+nk (5.5.3)
k-0
91+n = f)
( j+AtI (5.5.4)
y, = + v, (5.6.3)
J -o
I
J-1
fipx.H-1-,+ v (5.6.5)
Time Series Analysis 179
in which /J, . , /3N lie on the curve of a qth-order polynomial; that is,
)3i = 4 + oij+ 4./2 + . . . + (5,7j4, j= 1, 2, . . . , N. (5.6.6)
By defining vectors II= (13 i , 132, , flu) and 3 = 050 , 01 , . . . , de r, we
can write (5.6.6) in vector notation as
fl= Jo, (5.6.7)
where
[ 11 1
1 2 22 24
J= .1
. .
N N2 Ng
The estimation of 6 can be done by the least squares method. Let X be a
T X N matrix, the t, jth element of which is .;_ 1 _1 . Then
o = (J'X'XJ) 'J'X'y and fl = J(5. 3 Note that fl is a special case of the con-
strained least squares estimator (1.4.11) where R = J and c = 0.
By choosing N and q judiciously, a researcher can hope to attain both a
reasonably flexible distribution of lags and parsimony in the number of pa-
rameters to estimate. Amemiya and Morimune (1974) showed that a small
order of polynomials (q = 2 or 3) works surprisingly well for many economic
time series.
Some researchers prefer to constrain the value of the polynomial to be 0 at
j = N + 1. This amounts to imposing another equation
(50 + (51(N + 1) + (52(N + 1)2 + . . . + (54(N + 1)4 = 0 (5.6.8)
in addition to (5.6.6). Solving (5.6.8) for 0 0 and inserting it into the right-hand
side of (5.6.7) yields the vector equation
= J*3*, (5.6.9)
where 6* = (61 , 02 , . . . , (5)' and J* should be appropriately defined.
Exercises
1. (Section 5.2.1)
Prove that model (5.2.1) with Assumptions A, B, and C is equivalent to
model (5.2.3) with Assumptions A and B.
180 Advanced Econometrics
2. (Section 5.2.1)
Show that the process defined in the paragraph following Eq. (5.2.5) is
AR(1).
3. (Section 5.3)
Find the exact inverse of the variance-covariance matrix of MA(1) using
(5.3.12) and compare it with the variance-covariance matrix of AR(1).
4. (Section 5.3)
In the MA(1) process defined by (5.3.6), define = e ;11 riet_ I , where
(en are i.i.d. with Eer = 0, Ye ;11 = p20.2. Show that the autocovariances of
Yt and y* are the same.
5. (Section 5.4)
If X, Y, and Z are jointly normal and ifX is independent of either Yor Z,
then EXYZ -= EXEYZ (Anderson, 1958, p. 22). Show by a counterexam-
ple that the equality does not in general hold without the normality
assumption.
6. (Section 5.4)
Show that INA p) and "(M the same limit distribution.
7. (Section 5.4)
In the AR(1) process defined by (5.2.1), define the first differences
y, y and derive phm T .g.3 4_ 1 4/ Zr_ 3 421 .
8. (Section 5.4)
In the AR(2) process defined by (5.2.16), derive plim,, M1 2 Yr-IYII
cT
Y?-1.
9. (Section 5.5)
Derive (5.5.2) from the general formula of (5.5.5).
10. (Section 5.5)
In the MA(1) process defined by (5.3.6), obtain the optimal predictor of
Yri-n given y1, y,-1,
11. (Section 5.6)
Show that (5.6.7) can be written in the equivalent form Q'fi =0, where Q
is an N X (N q 1) matrix such that [Q, J] is nonsingular and
Q'J 0. Find such a Q when N = 4 and q= 2.
6 Generalized Least Squares Theory
where y* = 1-1 /2y, X* = ii2 X, and u* = X-112u. Note that Eu* 0 and
Eu*u* ' = I, so (6.1.2) is Model 1 except that we do not assume the elements
of u* are i.i.d. here. The generalized least squares (GLS) estimator of fl in
model (6.1.1), denoted fiG , is defined as the least squares estimator of ft in
model (6.1.2); namely,
PG = (X*'X*rX*' y* (6.1.3)
=
Using the results of Section 1.2, we obtain that Eno .= ft and
VftG = ( X'X-1X)-1 . ( 6.1.4)
Furthermore, GLS is the best linear unbiased estimator of Model 6. (Note that
in Section 1.2.5 we did not require the independence of the error terms to
prove that LS is BLUE.)
THEOREM 6.1.1. Let X'X and X both be positive definite. Then the follow-
ing statements are equivalent.
(A) (X'X)- IX'XX(X'X) -1 =
(B) X = XB for some nonsingular B.
(C) =
(D) X = HA for some nonsingular A where the columns of H are K
characteristic vectors of X.
(E) X'IZ =0 for any Z such that Z'X = 0.
(F) X = + ZOZ' + a2I for some and 0 and Z such that Z' X = 0.
Proof We show that statement A statement B = statement C:
Generalized Least Squares Theory 183
Proof We have
tr vfi= tr [(X ' X)-1X ' IX(X ' X) -1]
tr [IX(X'X) -2X1
(I) tr [X(X'X) -2X' ] by Lemma 6.1.1
= A1 (1) tr (X' X) -1
K
,(X' X)
But the last term converges to 0 because of assumptions A and B.
Note that Theorem 3.5.1 is a special case of Theorem 6.1.2. One interesting
implication of Theorems 6.1.2 and 5.2.3 is that LS is consistent if u follows a
stationary time series satisfying (5.2.36).
We have not proved the asymptotic normality of LS or GLS in this section
because the proof would require additional specific assumptions about the
generation of u.
assuming I is nonsingular.
For fiF to be a reasonably good estimator, we should at least require it to be
consistent. This means that the number of the free parameters that character-
ize I should be either bounded or allowed to go to infinity at a slower rate than
T. Thus one must impose particular structure on I, specifying how it depends
on a set of free parameters that are fewer than T in number. In this section we
shall consider five types of models in succession. For each we shall impose a
particular structure on I and then study the properties of LS and FGLS and
other estimators of ft. We shall also discuss the estimation of I. The five
models we shall consider are (1) serial correlation, (2) seemingly unrelated
regression models, (3) heteroscedasticity, (4) error components models, and
(5) random coefficients models.
In each of the models mentioned in the preceding paragraph, i is obtained
from the least squares residuals fi -=. y Xfi, where ft is the LS estimator.
(B) -01il < where (,) are i.i.d. with Ee, = 0 and
M74)
EE? = a2.
Then IN (1) N(0, cT 2c2), where c2 = lfflT T- 'x'Euu'x.
Proof We need only prove T- ox'u N(0, c2) because then the theorem
follows from assumption A and Theorem 3.2.7 (iii) (Slutsky). We can write
T T
;li t = xt1 xt ctojet_j + ITI t-iT x, J-+1 (15fit-i (6.3.1)
tI
+A2 .
1 T-t
Nrt t- 7 -k+ 1
A n +A l2 +A 13 .
188 Advanced Econometrics
But V(A l2) T-1 a2Nivi2 Iy "114)./1)2, which goes to 0 as T' for a fixed N.
The same is true for A13. The theorem follows by noting that1.;',1_ o ckixti.;
t in Theorem 3.5.3. satifehcond rx
6.3.2 Estimation of p
Because I defined in (5.2.9) depends on a 2 only through a scalar multiplica-
tion, /30 defined in (6.1.3) does not depend on a 2. Therefore, in obtaining
FGLS (6.2.1), we need to estimate only p. The most natural estimator of p is
(6.3.3)
where fi, = y, xji. The consistency ofii is straightforward. We shall prove its
asymptotic normality.
Using ut = pug_ + et , we have
T
A + Ai
(6.3.4)
where
r T
= 5, (p P'xt_ ix;(fi p + A (fl - fi) 'xt_,u, (6.3.5)
1T P
T
(ft fir
XtUt-I 1RP
t-2 VI' P.2
2p t ( ,
it fi)
t-
and
1T 2 T
A2 = T Y.Rfi
- rzi - IlYx1 B) 'xi-tur-i.
t-12 +TA 0 ,-1 (6.3.6)
Orcutt, 1949). The resulting estimator is slightly different from FGLS but has
the same asymptotic distribution.
If (u,) follow AR(2) defined by (5.2.16), the relevant transformation is given
by (5.2.27), where a's are determined by solving V(a l ui) = cr 2,
V(a 2u1 + a3u2)= a2, and E[a1 u1(a2u1 + a3u2)] = 0. A generalization of the
Cochrane-Orcutt transformation is obtained by eliminating the first two rows
of k. Higher-order autoregressive processes can be similarly handled.
Inserting (6.3.9) into (6.3.8), we obtain the concentrated log likelihood func-
tion (aside from terms not depending on the unknown parameters)
T 1
log L* = - -f
- log Q + - log (1 - p2).
2
For a given value of fi, p may be estimated either by the true MLE, which
maximizes log L* using the method of Beach and MacKinnon (1978), or by
the approximate MLE, which minimizes log Q, as was noted earlier in Section
5.4. Both have the same asymptotic distribution. The formula for the approx-
imate MLE is
T
E ( y,_ 1 - x-ifixy,- x/i)
_ t. 2
(6.3.11)
P T- 1
(Yt-i - x;-i1)2
I
Given p, the value of /I that maximizes log L* is clearly
ft' = (X'r- IX)- `X'I- ly, (6.3.12)
Generalized Least Squares Theory 191
where X is as given in (5.2.9). The approximate MLE of /3, p, and a' are
obtained by solving (6.3.9), (6.3.11), and (6.3.12) simultaneously.
These equations defining the approximate MLE are highly nonlinear, so
that the MLE must be obtained either by a search method or by an iteration.
We have already described the common iterative procedure.sof going back and
forth between (6.3.11) and (6.3.12), starting from the LS A
For the asymptotic normality of MLE in Model 6 where {u,) follow a
stationary autoregressive, moving-average process, see Pierce (1971).
d' (6.3.13)
,-1
where (01) are the least squares residuals. By comparing (6.3.13) with (6.3.3),
we can show that
d = 2 219 + o(T- '). (6.3.14)
From this we know that plim d = 2 2p, and the asymptotic distribution of d
follows easily from the asymptotic distribution of P. derived in Section 6.3.2.
To derive the exact distribution of d under the null hypothesis, it is useful to
rewrite (6.3.13) as
u'MAMu
d (6.3.15)
u'Mu
where M = I X(X'X) IX' and A is a TX T symmetric matrix defined by
L1 1 0 01.
1 2 1 0 0
A= : (6.3.16)
0 0 1 2 1
0 0 1 1
192 Advanced Econometrics
and
2 TY (1 , Ed) 2
Vd (6.3.20)
(TKXTK+ 2)
where the term involving vi can be evaluated using
T-K
vf = tr (MAY. (6.3.21)
ri
Generalized Least Squares Theory 193
If du. < d < du., the test is inconclusive. To test 1/ 0 : p = 0 against HI : p < 0,
Durbin and Watson suggested the following procedure:
Reject 1/0 if d 4d , (6.3.26)
T t-1
E-2
Ytut + T Ytut+2)
t-o
LD a2
Ap _ 1 Nriv - p).
Thus we conclude that the asymptotic distribution of FGLS differs from that
n
of GLS and depends on the asymptotic distribution of (see Amemiya and
Fuller, 1967, for further discussion). Amemiya and Fuller showed how to
obtain an asymptotically efficient feasible estimator. Such an estimator is not
as efficient as GIS.
The theory of the test of independence discussed in the preceding subsec-
tion must also be modified under the present model. If X contained lagged
dependent variables, YI-1, Yl - 2 t 9 We Will still have Eq. (6.3.17) formally,
but (0 will no longer be independent normal because H will be a random
matrix correlated with u. Therefore the Durbin-Watson bounds will no longer
be valid.
Even the asymptotic distribution of d under the null hypothesis of indepen-
dence is different in this case from the case in which Xis purely nonstochastic.
The asymptotic distribution of d is determined by the asymptotic distribution
A
of because of (6.3.14). When X is nonstochastic, we have IN - p)--.
N(0, 1 - p 2) by the results of Section 6.3.2. But, if X contains the lagged
dependent variables, the asymptotic distribution of A will be different. This
can be seen by looking at the formula for It(16 - p) in Eq. (6.3.4). The third
term, for example, of the right-hand side of (6.3.5), which is
T -1121 72(fi - fl)'x04-1 does not converge to 0 in probability because x t and
ut_ 1 are correlated. Therefore the conclusion obtained there does not hold.
196 Advanced Econometrics
,-2
where 12, = y, y,_ 1 and Oi is the least squares estimator. Consider the limit
distribution of 17f; under the assumption p = 0. Because the denominator
times T'' converges to a2 in probability, we have asymptotically
ID 1 1T
NIT ;6=712 777Z k_la t (6.3.33)
LD 1 1 T ,
-
-- 1 1.141 _ lilt - (1 - a2)yt-1 141
2
-11 [E (i )4" (1 - a2
Ut_ilig )2 E (i Yt-1 14)2
04 T t 2 i2
T T
2(1 &)E(E 14,_ 1 u, E y,_ , u,) ]
t-2 t-2
=a2(1 7'-1).
Hence, assuming that the asymptotic normality holds, we have
In- N(0, a2). (6.3.35)
Durbin (1970) obtained the following more general result: Even if higher-
order lagged values of y, and purely exogenous variables are contained among
the regressors, we have under the assumption p = 0
MO, 1 AV(Tiii)], (6.3.36)
where eil is the least squares estimate of the coefficient on 34_, . He proposed
Generalized Least Squares Theory 197
The same result can be also obtained by using Theorem 6.1.1. Statement E is
especially easy to verify for the present problem.
To define FGLS, we must first estimate X. A natural consistent estimator of
its i,jth element is provided by du - = T-Ifi;fii , where Ili y, Xifli are the
least squares residuals from the ith equation. These estimates are clearly
consistent as T goes to on while N is fixed. Because of the special form of ir
given in the preceding paragraph, it is quite straightforward to prove that
FGLS and GLS have the same asymptotic distribution (as T co) under
general assumptions on u and X. The limit distribution of 17-"(fiG - 17) or
rt(fi F 5) is N[0, aim T -1X11-1X) -1 ]. Suitable assumptions on u and X
can easily be inferred from Theorem 3.5.4.
FGLS is generally unbiased provided that it possesses a mean, as proved in a
simple, elegant theorem by Kakwani (1967). The exact covariance matrix of
FGLS in simple situations has been obtained by several authors and compared
with that of GLS or LS (surveyed by Srivastava and Dwivedi, 1979). A particu-
larly interesting result is attributable to Kariya (1981), who obtained the
following inequalities concerning the covariance matrices of GLS and FGLS
in a two-equation model with normal errors:
6.5 Heteroscedasticity
A heteroscedastic regression model is Model 6 where is a diagonal matrix,
the diagonal elements of which assume at least two different values. Hetero-
scedasticity is a common occurrence in econometric applications and can
often be detected by plotting the least squares residuals against time, the
dependent variable, the regression mean, or any other linear combination of
the independent variables. For example, Prais and Houthakker (1955) found
that a variability of the residuals from a regression of food expenditure on
income increases with income. In the subsequent subsections we shall con-
sider various ways to parameterize the heteroscedasticity. We shall consider
the estimation of the heteroscedasticity parameters as well as the regression
coefficients. We shall also discuss tests for heteroscedasticity.
r X'IX X'IW 1 -1
LW'IX W'INV J '
we obtain
PG = ( XX)-1X/IW(W1M-IW'y. (6.5.3)
Of course, it is also possible to derive (6.5.3) directly from (6.1.3) without
regard to the interpretation given above. An advantage of (6.5.3) over (6.1.3) is
that the former does not depend on 1-1. Note that one cannot estimate
T-1X'I- IX consistently unless I can be consistently estimated. To trans-
form (6.5.3) into a feasible estimator, one is tempted to replace by a
diagonal matrix D whose tth element is (y, x41) 2. Then it is easy to prove
that under general assumptions plim 'X'DW = plim T - IX'IW and
plim T-1 W'DW = plim 1 W'IW element by element. However, one dif-
ficulty remains: Because the size of these matrices increases with T, the result-
ing feasible estimator is not asymptotically equivalent to GLS.
We can solve this problem partially by replacing (6.5.2) with
Wy = Wu, (6.5.4)
where W 1 consists of N columns of W, N being a fixed number. When GIS is
200 Advanced Econometrics
applied to (6.5.1) and (6.5.4), it is called the partially generalized least squares
(PGLS) and the estimator is given by
fip = (X1 X)-1 X'XWI(WcIWI)-1 Wcy. (6.5.5)
PGLS is more efficient than LS because
111 Vfl = (X 'X) - IX '1W i(W '1 /W, )- 1W 'i 2X(X ' X) - I, (6.5.6)
which is clearly nonnegative definite, but itis less efficient than GLS. An
asymptotically equivalent feasible version offip is obtained by replacing the I
in (6.5.5) by the D defined in the preceding paragraph.
White (1980a) proposed testing the hypothesis al = a 2 for all t by compar-
ing (X'X) -1 X'DX(X'X) -1 with 6 2(X'X)- 1, where D is as defined earlier and
el 2 is the least squares estimator of 0 2 defined in (1.2.5). Equivalently, White
considered elements of X'DX 62X'X. If we stack the elements of the upper
triangular part of this matrix, we obtain a vector of (K 2 + K)/2 dimension
defined by S'(fi2 _ i) where 0 2 is a T-vector the tth element of which is
, 1 is a T-vector of ones, and S is a T X (K2 + K)/2 matrix, the columns of
which are (xiixii , xi2xj2 , . . , xirxir)' for 1 4
. j K, and i j. It is easy
to show that T- u2S'(112 621) N(0, A), where A = lim (7`'S'AS +
T-21' Al S'll'S T -2S' Al l'S T-2S1 1 l'AS) and A = E(u 2 a21)
(u2 a21)'. The test statistic proposed by White is
_ a201s(71)-1svi2 _ 6.209
where 6.2 and al are the constrained and unconstrained MLE, respectively.
The statistic is asymptotically distributed as chi-square with N 1 degrees of
freedom.4
Taylor (1978) has considered a special case in which N = 2 in a model with
normal errors and has derived the formulae for the moments of FGLS. By
evaluating the covariance matrix of FGLS at various parameter values, Taylor
has shown that FGLS is usually far more efficient than LS and is only slightly
less efficient than GLS.
We shall sketch briefly the derivation of the moments of FGLS. Let C be a
KX K matrix such that
= (6.5.10)
and
C'XPL2C = crA (6.5.11)
where A is a diagonal matrix, the elements A,, A',. . . AK of which are the
roots of the equation
lai2XP2 AaT2X1X, I = 0. (6.5.12)
The existence of such a matrix is guaranteed by Theorem 16 of Appendix 1.
With S = C(I + A)-'a, transform the original equation y = Xfl + u to
y = X*y + u, (6.5.13)
where X* = XS and y = S- The FGLS estimator of y, denoted f, is given by
202 Advanced Econometrics
)7 = iii2Xorrn-105-1-23Cr yi ity2x*
2 IY2) , (6.5.14)
where ill = (7; K) -1(y1 Xiff)'(y, X1/1). 5
Using aT2X 11" X * = (I + A) -1 and cri2X = A(I + A)-1, we obtain
y = (I + A)[ar 2DXrui + cri2A-1(I D)Xru2], (6.5.15)
where D is a diagonal matrix the ith diagonal element of which is equal to
+ ,VriErf). Finally, the moments of f 7 can be obtained by mak-
ing use of the independence of D with X 1" u i and X 32" u2 and of known
formulae for the moments of D that involve a hypergeometric function.
Kariya (1981) has derived the following inequalities concerning the covar-
iance matrices of GLS and FGLS in a two-equation model:
VnG VfiF 5 [1 + 2( T
-
1
K
1
2) + 2( T2 K 2)
] V/IG . (6.5.16)
a is anotherisaubet(polywh)fergsionpamt/Id
vector of parameters unrelated to /1. In applications it is often assumed that
gt(a, 110= g(a, x;//). The estimation of a and /I can be done in several steps.
In the first step, we obtain the LS estimator of/J, denoted /3. In the second step,
a and can be estimated by minimizing IT., [in ge(a, fl,)12, where a, =
y, )4/I. The consistency and the asymptotic normality of the resulting esti-
mators, denoted a and , have been proved by Jobson and Fuller (1980). In
the third step we have two main options: FGLS using gAii, /31) or MLE under
normality using it and /3 as the initial estimates in some iterative algorithm. 6
CarolndRupet(1982a)rovdhunegalsmptioFGLS
has the same asymptotic distribution as GLS. Jobson and Fuller derived
simple formulae for the method of scoring and proved that the estimator of /I
obtained at the second iteration is asymptotically efficient (and is asymptoti-
cally more efficient than GLS or FGLS). Carroll and Ruppert (1982b) have
pointed out, however, that GLS or FGLS is more robust against a misspecifi-
cation in the g function. Carroll and Ruppert (1982a) have proposed a robust
version of FGLS (cf. Section 2.3) in an attempt to make it robust also against
nonnormality.
There are situations where FGLS has the same asymptotic distribution as
MLE. Amemiya (1973b), whose major goal lay in another area, compared the
asymptotic efficiency of FGLS vis--vis MLE in cases where y, has mean xdJ
Generalized Least Squares Theory 203
and variance p/2(x:fl)2 and follows one of the three distributions: (1) normal,
(2) lognormal, and (3) gamma. This is the form of heteroscedasticity suggested
by Prais and Houthakker (1955). It was shown that FGLS is asymptotically
efficient if y, has a gamma distribution. Thus this is an example of the BAN
estimators mentioned in Section 4.2.4.
This last result is a special case of the following more general result attribut-
able to Nelder and Wedderburn (1972). Let ( y 1) be independently distributed
with the density
exP (aW[g( 01)Y, h(01)+ k(A, 301), (6.5.17)
where a and 0, q(x Aare scalars and A, x and /I are vectors. It is assumed
that h'(0,)= 0,g'(0,) for every t, which implies that Ey, = 0, and Vy, =
[a(A)g'(0,)] - '.Then, in the estimation of ft, the method of scoring is identical
to the Gauss-Newton nonlinear weighted least squares iteration (cf. Section
4.4.3). The binomial distribution is a special case of (6.5.17). To see this,
define the binary variable y, that takes unity with probability 0, and zero with
probability 1 0, and put a(A) = 1, g(0,)= log 0, log (1 0,), h(0,)
log (1 0,), and k(A, y,) = 0. The normal distribution is also a special case:
Take y, N(0,, A2), a(2) = A-2, g(0,)= 0,, h(0,) = 0/2, and k(A, y,) =
2-1A2 log (27A2) y?12. The special case of Amemiya (1973b) is obtained
by putting a(A) = A, g(0,)= h(0,)= log 0 and k(A, y,) =
log roo log A + A-1 (1 A) log y,.
There are series of articles that discuss tests of homoscedasticity against
heteroscedasticity of the form ol = g(a, x;i1), where a is a scalar such that
g(0, x;f1) does not depend on t. Thus the null hypothesis is stated as a = 0.
Anscombe (1961) was the first to propose such a test, which was later modified
by Bickel (1978). Hammerstrom (1981) proved that Bickel's test is locally
uniformly most powerful under normality.
consider the estimation of a and the test of the hypothesis al = a 2 under this
model.
Hildreth and Houck (1968) and Goldfeld and Quandt (1972) were among
the first to study the estimation of a. We shall call these estimators HH and
GQ for short. We shall follow the discussion of Amemiya (1977b), who pro-
posed an estimator of a asymptotically more efficient under normality than
the HH and GQ estimators.
Hildreth and Houck presented their model as a random coefficients model
(cf. Section 6.7) defined by
yi --= 34(fi a (6.5.18)
where (,) are K-dimensional i.i.d. vectors with 4 =0 and EZ,e, = Wad.
The last matrix denotes aKXK diagonal matrix, the tth element of which is
a,. Thus the Hildreth and Houck model is shown to be a special case of the
model where the variance is a linear function of regressors by putting z, =
(xi, xi, , . . . , xlre)'.
We shall compare the HH, GQ, and Amemiya estimators under the as-
sumption of the normality of u. All three estimators are derived from a
regression model in which 14 serves as the dependent variable. Noting t2, =
u, x(X' X) -1X'u, we can write
Ill = z:a + vi, 2v2, + v3 (6.5.19)
where v1, = ul al, v2, = ti,x;(X'X) -1X'u, and v3, = [x;(X'X) -'X'uj2. We
can write (6.5.19) in vector notation as
02 = la + v 1 2v2 + v3 . (6.5.20)
We assume that X fulfills the assumptions of Theorem 3.5.4 and that lim r_..
T-1EZ is a nonsingular finite matrix. (Amemiya, 1977b, has presented more
specific assumptions.)
Equation (6.5.20) is not strictly a regression equation because Ev 2 #0 and
Ev3 # 0. However, they can be ignored asymptotically because they are
0( T-1/2) and O( T-1 ), respectively. Therefore the asymptotic properties of LS,
GLS, and FGLS derived from (6.5.20) can be analyzed as if v, were the only
error term. (Of course, this statement must be rigorously proved, as has been
done by Amemiya, 1977b.)
The GQ estimator, denoted a, , is LS applied to (6.5.20). Thus
ex, = (Z'Z)- 'Z'fi2. (6.5.21)
It can be shown that It(& l a) has the same limit distribution as
Generalized Least Squares Theory 205
Pagan suggested estimating P(Rao > c) by simulation. They pointed out that
since Rao's score test depends only on 11 the simulation is relatively simple.
,
effects estimator (since g's and )'s are treated as fixed effects rather than as
random variables) or the dummy-variable regression. Still another name for
the estimator is the covariance estimator.
To compare (6.6.9) with the corresponding GLS estimator, we need to
derive the corresponding subset of GLS /kJ . We have
/LI = [xgrlx, - (6.6.10)
X [Xlirly XP-1 1(11/-1 1)-1 11ray]
a. (6.6.15)
(N 1)(T 1)'
a , [ T- A _ T 1 j id a
NT
14 (6.6.16)
T(N 1XT 1)
and
[N;
1 N1
u B
N NT j 4d 6 0.6.17)
N(N 1)(T 1)
where 11 = y XfiQ . Amemiya also proved that they are asymptotically more
efficient than the estimates obtained by using y X,8 for fi, where fi is the LS
estimator.
These estimates of the variances, or any other estimates with the respective
probabilistic order of (NT) 1/2, N-112, and T-1/2, can be inserted into the
right-hand side of (6.6.6) for calculating FGLS. Fuller and Battese (1974)
proved that under general conditions FGLS and GLS have the same asymp-
totic distribution.
Q-1 = +
(41 (INT yiA), (6.6.21)
to (6.6.24). (Note that since the premultiplication by F' eliminates the vector
of ones, flo cannot be estimated from this equation.) GLS P G can be interpreted
as GLS applied to (6.6.23) and (6.6.24) simultaneously. Because these equa-
tions constitute the heteroscedastic regression model analyzed in Section
6.5.2, GLS has the following simple form:
PG = ( X'PX + cX'MX) - I(X'Py + cX'My), (6.6.25)
where c = (a! + To)/a, P = T- ILL', and M = I P. To define FGLS, c
may be estimated as follows: Estimate a! + Ta, by the LS estimator of the
variance obtained from regression (6.6.23), estimate e4 by the LS estimator of
the variance obtained from regression (6.6.24), and then take the ratio.
As we noted earlier, (6.6.23) and (6.6.24) constitute the heteroscedastic
regression model analyzed in Section 6.5.2. Therefore the finite-sample study
of Taylor (1978) applied to this model, but Taylor (1980) dealt with this model
specifically.
Next, following Balestra and Nerlove (1966), we derive MLE of the model
assuming the normality of u. For this purpose it is convenient to adopt the
following re-parameterization used by Balestra and Nerlove: Define a 2 =
oi+ qs , p= qda 2, and R = (1 p)I T + p17.1 Then we have
II = a2(IN 0 R),
= cr-2(IN 0 II-1 ),
Inserting (6.6.27) into (6.6.26) yields the concentrated log likelihood function
NT
L* = log I (x
2] (6.6.28)
1 p + pT
log [1 + -1 /=p] .
wit
i t t
P (6.6.30)
(T 1)
I I
Also, using (6.6.29,) we can simplify (6.6.27) as
I 14
a2 = (6.6.31)
NT
The MLE of fi, p, and a2 can be obtained by simultaneously solving the
formula for GLS, (6.6.30), and (6.6.31).
Both Balestra and Nerlove (1966) and Maddala (1971) pointed out that the
right-hand side of (6.6.30) can be negative. To ensure a positive estimate ofp,
Balestra and Nerlove suggested the following alternative formula for p:
2
I t i
P (6.6.32)
NT u2
I t
It is easy to show that the right-hand of (6.6.32) is always positive.
Maddala (1971) showed that the p given in (6.6.30) is less than 1. Berzeg
(1979) showed that if we allow for a nonzero covariance ao between u, and el
the formulae for MLE are the same as those given in (6.6.30) and (6.6.31) by
redefining a 2 = + 2cro,+ a! and p = (cr,i+ 2up.)1a 2 and in this model the
MLE of p lies between 0 and 1.
Generalized Least Squares Theory 215
One possible way to calculate FGIS consists of the following steps: (1)
Obtain the transformation estimator & (2) Define 0, Q = y X,8Q ; (3) Insert
L, into the right-hand side of Eq. (6.6.32). (4) Use the resulting estimator of p
to construct FGLS. In the third step the numerator of (6.6.32) divided by
N 2 T2 can be interpreted as the sample variance of the LS estimator of p i + /10
1fl1 + L(p + fl01) + e. obtainedfrmhgsoy=X
r . T r _i yi' t_ i = 1 13 (6.6.35)
Therefore
N T 0-2
Plinl A
rr. I j
t-1
1
(6.6.36)
defined by
Yft = x;tfi x;r(jz1 At) Eft) (6.7.1)
i= 1, 2, . . . , N and t = 1, 2, . . . , T. Note that we have separated out the
nonstochastic part if and the random partg i + A of the regression coefficients.
Using the symbols defined in Section 6.6.1 and two additional symbols, we
can write (6.7.1) in vector notation as
y = Xii + Xit + Xs). + e, (6.7.2)
where we have defined
X = diag (X,, X2,. . . , Xiy), X* =
(X r , r, . . . , Xr)', where X? diag (x'n , x . . . , x;T). It is assumed
,
that ,u, A, and e have zero means, are uncorrelated with each other, and have
covariance matrices given by Epp' = IN 0 = IT IA, and
Eee' =1,, where 1, IA, and ; are all nonsingular.
Kelejian and Stephan were concerned only with the probabilistic order of
the GLS estimator of fl an important and interesting topic previously over-
looked in the literature. For this purpose we can assume that Z, I A , and I are ,
Equation (6.7.7) is identical with Eq. (11) of Kelejian and Stephan (1983,
p. 252).
Now we can determine the order of V -1 . If we write the i, jth block subma-
trix of f(IN 1,7 1) A] -1 , j = 1, 2, . . . , N, as GU, the second term of the
right-hand side of (6.7.7) can be written as I i-i-1 (IP1_ 1 17_ 1 Gu)I; 1 . Therefore
the order of this term is N2/T. Therefore, if T goes to C at a rate equal to or
faster than N, the order of V-1 is N. But, because our model is symmetric in i
and t, we can conclude that if Ngoes to co at a rate equal to or faster than T, the
order of V-1 is T. Combining the two, we can state the order of V -1 is
min (N, T) or that the probabilistic order of fi G is max (N- 1/2, T-112).
Hsiao (1975) proposed the following method of estimating 1 p ZA, and a2:
,
For simplicity assume that X does not contain a constant term. A simple
modification of the subsequent discussion necessary for the case in which X
contains the constant term is given in the appendix of Hsiao (1975). Consider
the time series equation for the ith individual:
yi = + pi) + X + ei . (6.7.8)
are obtained simultaneously. Hsiao applied to his model the method of scor-
ing that Anderson (1969) derived for a very general random coefficients
model, where the covariance matrix of the error term can be expressed as a
linear combination of known matrices with unknown weights. (Note that
Anderson's model is so general that it encompasses all the models considered
in this chapter.) Hsiao essentially proved that the three estimators have the
same asymptotic distribution, although his proof is somewhat marred by his
assumption that these estimators are of the probabilistic order of (NT) -1/2.
= 0(N) 0(N/T).
Exercises
1. (Section 6.1.2)
Consider a classic regression model
y = ax + /3z + u,
where a and /3 are scalar unknown parameters; x and z are T-component
vectors of known constants such that x'l = z'l = 0, wherel is a T-compo-
nent vector of ones; and u is a T-component vector of unobservable i.i.d.
random variables with zero mean and unit variance. Suppose we are given
an estimator /I such that Eft = Vfl = T - ', and Eufl = T- '12p1, where p is
Generalized Least Squares Theory 223
a known constant with 0 lp I < 1. Write down the expression for the best
estimator of a you can think of. Justify your choice.
2. (Section 6.1.2)
In the model of Exercise 17 of Chapter 3, assume that the exact distribu-
tion of )3 is known to be N(/3, T -1).
a. Obtain the mean squared error of .
b. Find an estimator of a whose mean squared error is smaller than
that of a-t.
3. (Section 6.1.3)
Prove that statement D statement C in Theorem 6.1.1.
4. (Section 6.1.3)
If K = 1 in Model 6, the efficiency of LS relative to GLS can be defined by
Eff ("2
(x'X- Ix)(x'Ix)
Watson (1955) showed Eff 4A vtd(ill + 42, where Ai and As are the larg-
est and smallest characteristic roots of 1, respectively. Evaluate this lower
bound for the case where I is given by (5.2.9), using the approximation of
the characteristic roots by the spectral density (cf. Section 5.1.3).
5. (Section 6.1.3)
In Model 6 assume K = 1 and X = I, a vector of ones. Also assume Z is
equal to given in (5.2.9). Calculate the limit of the efficiency of LS as
T 00. (Efficiency is defined in Exercise 4.)
6. (Section 6.1.3)
Prove (6.1.6) directly, without using the fact that GLS is BLUE.
7. (Section 6.1.5)
Consider a regression model
y= Xii + u,
where Eu =0 and Euu' = P = Z(Z'Z) -1Z'. We assume that X and Z are
T X K and T X G matrices of constants, respectively, such that
rank(X) = K, rank(Z) = G < T, and PX = X. Find a linear unbiased esti-
mator of // the variance-covariance matrix of which is smaller than or
equal to that of any other linear unbiased estimator. Is such an estimator
unique?
224 Advanced Econometrics
8. (Section 6.2)
Suppose y N(Xfl, 2), where there is no restriction on I except that it is
positive definite. Can you obtain the MLE of2 by setting the derivative of
the log likelihood function with respect to 2 equal to 0?
9. (Section 6.3.2)
Show that A i and A2 given in (6.3.5) and (6.3.6) converge to 0 in probabil-
ity under the assumption of the text.
10. (Section 6.3.2)
Combining y, xd3+ u, and u, = pu,_, + e we can write
(et) are i.i.d. with a, = 0, VE, = a2, (14t) are stationary, and/3,P, and a 2 are
unknown parameters. Given a sample (Yr, x t), t = 1, 2, . . . , T, and
given xT+1 , what do you think is the best predictor of y 2-+I ?
13. (Section 6.3.3)
Let y, = fit + u where (14,) follow AR(1), u, = eu,_, + e,. Define the
following two predictors of y+1: 9T+1= (T + 1)/3 and 92.4. 1 = ( T + 1)13,
where /3 and /3 are the LS and GLS estimators of ,6 based on yi ,
Generalized Least Squares Theory 225
In this chapter we shall give only the basic facts concerning the estimation of
the parameters in linear simultaneous equations. A major purpose of the
chapter is to provide a basis for the discussion of nonlinear simultaneous
equations to be given in the next chapter. Another purpose is to provide a
rigorous derivation of the asymptotic properties of several commonly used
estimators. For more detailed discussion of linear simultaneous equations, the
reader is referred to textbooks by Christ (1966) and Malinvaud (1980).
ASSUMPTION 7.1.2. Rank of X is K, and lim T- ' X ' X exists and is nonsin-
gular.
ASSUMPTION 7.1.3. r is nonsingular.
Solving (7.1.1) for Y, we obtain
Y=XII+V, (7.1.2)
where
n = Br-1 (7A.3)
ye = Yiy, + + u, (7.1.4)
Z i a, + u,.
y = Za + u, (7.1.5)
230 Advanced Econometrics
where yigY,
a = (ac, ..
u Jac, . . , u'N y ,
and Z diag(Z i , Z2, , ZN).
and
irm = 0. (7.1.7)
Here, (e, ;o)' is the ith column of n, and (M1 , my is the subset of the
columns of n that are postmultiplied by y,. The second subscript 0 or 1
indicates the rows corresponding to the zero or nonzero elements of the ith
column of B. Note that fl, is a K(0 X N, matrix, where K m = K K. From
(7.1.7) it is clear that y, is uniquely determined if and only if
rank(lio) = N . , (7.1.8)
This is called the rank condition of identifiability. It is clear from (7.1.6) that
once yi is uniquely determined,fli is uniquely determined. For (7.1.8) to hold,
it is necessary to assume
K(0 N (7.1.9)
which means that the number of excluded exogenous variables is greater than
or equal to the number of included endogenous variables. The condition
(7.1.9) is called the order condition of identifiability.3
Linear Simultaneous Equations Models 231
Supply curve
If fil #0 and ft2 = 0 in the demand and supply model given in the beginning
of this section, y2 is identified but y 1 is not. This fact is illustrated in Figure 7.1,
where the equilibrium values of the quantity and the price will be scattered
along the supply curve as the demand curve shifts with the values of the
independent variables. Under the same assumption on the rs, we have
1
11 /11 and 112 = 72 fi (7.1.10)
1 V1 Y2 1 - Y1 Y2 I
where 111 and 1-12 are the coefficients on x 1 in the reduced form equations for p
and q, respectively. From (7.1.10) it is clear that if fli consists of a single
element, y2 is exactly identified, whereas if fl i is a vector of more than one
element, y2 is overidentified.
Inserting Y = XB0 ri + V. where B o and rc, denote the true values, into
(7.2.4) and taking the probability limit, we obtain
plim ST=143+ (Boro, Br,) , A(B0 r-0-1 Br )1, (7.2.5)
-
Zi ai
Z2 012
0. (7.2.8)
[0 X'7,, YN - ZNaN
0 Y'N YN ZNOiN
where V1 consistsof the nonstochastic part of those vectors of Y that appear in
the right-hand side of the ith structural equation.
Defining = Xi) and = diag(Z i , / iN), we can combine
(7.2.8) and (7.2.11) into a single equation
a= I)Zr l i'(I -1 I)y. (7.2.12)
The FIML estimator of a, denoted a, is a solution of (7.2.12), where is
replaced by the right-hand side of (7.2.2). Because both Z and Z in the right-
hand side of (7.2.12) depend on a, (7.2.12) defines a implicitly. Nevertheless,
this representation is useful for deriving the asymptotic distribution of FIML
as well as for comparing it with the 3SLS estimator (see Section 7.4).
Equation (7.2.12), with / replaced by the right-hand side of (7.2.2), can be
used as an iterative algorithm for calculating Evaluate Z and I in the
right-hand side of (7.2.12) using an initial estimate of a, thus obtaining a new
estimate of a by (7.2.12). Insert the new estimate into the right-hand side, and
so on. However, Chow (1968) found a similar algorithm inferior to the more
standard Newton-Raphson iteration (cf. Section 4.4.1).
The asymptotic normality of a, follows easily from (7.2.12). Let Z and / be
and / evaluated at a, respectively. Then we have from (7.2.12)
4-T(si a) = [ '(Z 0 I)Z]' ' I)u. (7.2.13)
Because a and are consistent estimators, we can prove, by a straightforward
application of Theorems 3.2.7, 3.5.4, and 3.5.5, that under Assumptions
7.1.1, 7.1.2, and 7.1.3
a)--0 MO, [lim (7.2.14)
where kPmeans that both sides of it have the same limit distribution. But, using
Theorem 3.5.4, we can show
proci
jut_ N(O A). (7.3.8)
,FTLx
Thus, from Eqs. (7.3.5) through (7.3.8) and by using Theorem 3.2.7 again,
we conclude
NIT a) N(0, ai A-1). (7.3.9)
To prove that the LIML estimator ofa has the same asymptotic distribution
as 2SLS, we shall first prove
plim U(2 1) = 0. (7.3.10)
We note
. ( id 1
2= min (7.3.11)
a o'Wo
which follows from the identity
(ywi es - ow l w-1/21
J'W6 nin
where ti = W"25, and from Theorems 5 and 10 of Appendix 1. Because
1 (Mid ( 1 , 7'')W1( 1 , 77
a (5'WtS (1, 7')W(1, y')'
we have
A _ 1 (7.3.12)
WI Mu,
uc[X(X/X)-- 'X' X1 (X1 Xi)Anui
Mu i
Therefore the desired result (7.3.10) follows from noting, for example,
plim 7 v2u1X(X'X)-- JX'u 1 = 0. From (7.3.3) we have
238 Advanced Econometrics
However, we must treat these results with caution. First, both Anderson's
and Fuller's results were obtained under the assumption of normality,
whereas the asymptotic distribution can be obtained without normality. Sec-
ond, Anderson's results were obtained only for a simple model; and, more-
over, it is difficult to verify in practice exactly how large the degrees of simulta-
neity and overidentifiability should be for LIML to dominate 2SLS. Third, we
should also compare the performance of estimators under misspecified
models. Monte Carlo studies indicate that the simpler the estimator the more
robust it tends to be against misspecification. (See Taylor, 1983, for a critical
appraisal of the finite sample results.)
where Ps = S(S 'S) S'. The matrix S should have Trows, but the number of
columns need not be the same as that of Z 1 . In addition, we assume that S
satisfies the following three conditions:
(i) plim T- 'S' S exists and is nonsingular,
(ii) plim T-1 S ' a, = 0, and
(iii) plim T-1 5' V I = O.
Under these assumptions we obtain, in a manner analogous to the derivation
of (7.3.9),
- a) - N(0, oiC -1 ), (7.3.16)
240 Advanced Econometrics
where
+ (7.3.19)
where J 1 = (X ' X' X i . From (7.3.19) we obtain
tion model can be written in this way by assuming that certain elements of V
are identically equal to 0, the reader may regard these two equations in a more
abstract sense and not necessarily as arising from the limited information
model considered in Section 7.3.1.
Premultiplying (7.3.21) and (7.3.22) by T -1/2, we obtain
tp-ia y = 1,11-112z a W-112 u (7.3.23)
and
lip-1/2z = 9F-1/2xn T-1/2v . (7.3.24)
We define the G2SLS estimator of a as the 2SLS estimator of a applied to
(7.3.23) and (7.3.24); that is,
&G2S = [Z"11-1X(X"P-1X)-1X'T-1Z]'Z't11-1X(X"P-'X)iX'41-1y.
(7.3.25)
Given appropriate assumptions on '11- v2 X, 1/2 u, and '11-1/2 V, we can
show
- a) N[0, (lim T-1 11' X' T-1 X11)-1 ]. (7.3.26)
As in Section 7.3.6, we can show that G2SLS is asymptotically the best instru-
mental variables estimator in the model defined by (7.3.21) and (7.3.22). The
limit distribution is unchanged if a regular consistent estimator of is substi-
tuted.
The idea of G2SLS is attributable to Theil (1961), who defined it in another
asymptotically equivalent way: (Z 'P'1/ -1 Z)-1Z ' IRP- Iy. It has the same
asymptotic distribution as (Si ms .
In this section we shall again consider the full information model defined by
(7.1.1). The 3SLS estimator of a in (7.1.5) can be defined as a special case of
G2SLS applied to the same equation. The reduced form equation comparable
to (7.3.22) is provided by
z=xn+v (7.4.1)
Exercises
1. (Section 7.3.1)
In the limited information model defined by (7.3.1) and (7.3.2), let
X = (X1 , X2 ) where X 1 and X2 have Ki and K2 columns, respectively.
Suppose we define a class of instrumental variables estimators of a by
(S ' Z 1 )-1 S' y, where S = (X2 A, X1 ) with A being a K2 X NI (K2 -4 NI )
matrix of constants. Show that there exists some A for which the instru-
mental variables estimator is consistent if and only if the rank condition of
identifiability is satisfied for Eq. (7.3.1).
2. (Section 7.3.3)
Show that the LIML estimator of y is obtained by minimizing
(5' W 1 (5Ic5' W 6, where 5 = (1, y' )', with respect toy. Hence the estimator
Linear Simultaneous Equations Models 243
X'X =
1 10 1 yy = i X 'Y =
[1 1
L 1]' L2 i2] ' i i I
where X = (x1 x2 and Y
, )
=(YI,Y2).
5. (Section 7.3.7)
Show that Theil's G2SLS defined at the end of Section 7.3.7 is asymptoti-
cally equivalent to the definition (7.3.25).
6. (Section 7.3.7)
Define oi = [Z'X(X411X)- IX'Z] 'Z'X(XVX)-1 X'y. Show that this is a
consistent estimator of a in model (7.3.21) but not as asymptotically as
efficient as eiG2s defined in (7.3.25).
7. (Section 7.4)
Suppose that a simultaneous equations model is defined by (7.1.5) and the
reduced form Z = [I 0 NII + V. Show that 6ems defined in (7.3.25) and
Theil's G2SLS defined at the end of Section 7.3.7 will lead to the same
3SLS when applied to this model.
244 Advanced Econometrics
8. (Section 7,4)
Consider the following model:
( 1 ) Y1 = YY2
(2) Y2 = flixi + fl2x2 + u2 Xfi + u2 ,
where y, /11 , and /I2 are scalar unknown parameters, x l and x2 are T-compo-
nent vectors of constants such that TX is nonsingular for every T and
also in the limit, y i and y2 are T-component vectors of observable random
variables, and u 1 and u2 are T-component vectors of unobservable random
variables that are independently and identically distributed with zero mean
and contemporaneous variance-covariance matrix
0. 121
,2
= [6 12 - 2
a. Prove that the 3SLS estimator of y is identical with the 2SLS estima-
tor of y.
b. Transform Eqs. (1) and (2) to obtain the following equivalent model:
1 1
(3) Y2 = Y1
(4) y i = Az, + yfl2x2 + yu2 + u l .
Define the reverse 2SLS estimator of y as the reciprocal of the 2SLS estima-
tor of 1 /y obtained from Eqs. (3) and (4). Prove that the reverse 2SLS
estimator of y has the same asymptotic distribution as the 2SLS estimator
of y.
c. Assume that at period p outside the sample the following relation-
ship holds:
(5) Yip = YY2p + u
(6) y2p = /31x1p + /32x2p + u2p = xp'fi + u2p ,
where u I p and u2p are independent of u 1 and u2 with zero mean and var-
iance-covariance matrix I. We want to predict y ip when .xlp and x24, are
given. Compare the mean squared prediction error of the indirect least
squares predictor, defined by
(7) 94, x'p(X'Xr 1X'y,
with the mean squared prediction error of the 2SLS predictor defined by
(8) Ai, 574,(X'Xr IX'y2 ,
where 5; is the 2SLS estimator of y. Can we say one is uniformly smaller than
the other?
8 Nonlinear Simultaneous Equations Models
In this chapter we shall develop the theory of statistical inference for nonlinear
simultaneous equations models. The main results are taken from the author's
recent contributions (especially, Amemiya, 1974c, 1975a, 1976a, 1977a).
Some additional results can be found in Amemiya (1983a). Section 8.1 deals
with the estimation of the parameters of a single equation and Section 8.2 with
that of simultaneous equations. Section 8.3 deals with tests of hypotheses,
prediction, and computation.
(2SLS) method to the nonlinear model (8.1.1) so that we can obtain a consist-
ent estimator.
Following the article by Amemiya (1974c), we define the class of nonlinear
two-stage least squares (NL2S) estimators of ao in the model (8.1.1) as the
value of a that minimizes
SAalW) = (3' f )'VV(WW) -1W(Y f), (8.1.2)
where W is some matrix of constants with rank at least equal to K.
In the literature prior to the article by Amemiya (1974c), a generalization of
2SLS was considered only in special cases of the fully nonlinear model (8.1.1),
namely, (1) the case of nonlinearity only in parameters and (2) the case of
nonlinearity only in variables. See, for example, the article by Zellner, Huang,
and Chau (1965) for the first case and the article by Kelejian (1971) for the
second case. 2 The definition in the preceding paragraph contains as special
cases the Zellner-Huang-Chau definition and the Kelejian definition, as well
as Theil's 2SLS. By defining the estimator as a solution of a minimization
problem, it is possible to prove its consistency and asymptotic normality by
the techniques discussed in Chapter 4.
First, we shall prove the consistency of NL2S using the general result for an
extremum estimator given in Theorem 4.1.2. The proof is analogous to but
slightly different from a proof given by Amemiya (1974c). The proof differs
from the proof of the consistency of NLLS (Theorem 4.3.1) in that the deriva-
tive off is used more extensively here.
THEOREM 8.1.1. Consider the nonlinear regression model (8.1.1) with the
additional assumptions:
(A) lim 'W'W exists and is nonsingular.
(B) oft/8a exists and is continuous in N(a0 ), an open neighborhood of ao
(C) T 'W'(af/aa') converges in probability uniformly in a E N(a0 ).
(D) plim T IW'(Of/oa') is full rank.
Then a solution of the minimization of (8.1.2) is consistent in the sense of
Theorem 4.1.2.
Proof Inserting (8.1.1) into (8.1.2), we can rewrite T' (8.1.2) as
T-1 ST = T- lu'Pwu + '(fo f )'Pw(fo f) (8.1.3)
+ 7-1 2(f0 f )'P wu
=A 1 +A2 +A3 ,
where Pw = W(W'W)- 'W', f = f(a), and fo = f(ao ). Note that (8.1.3) is simi-
Nonlinear Simultaneous Equations Models 247
where G. is the matrix the tth row of which is af,/aa* evaluated at a* between
a and ao . Therefore we have
A2 = T- i(ao a)'G's PwG. (ao a). (8.1.5)
1 w a2f I 11
'[7' Tv' &Oa' la II'
where ( ) in the second term of the right-hand side is the matrix the ith row of
a,
which is given inside ( ), 6 is aflea' evaluated at and G+ is the matrix the
tth row of which is a filOa' evaluated at c4F between a and alp . But the term
inside ( ) converges to 0 in probability because of assumptions A and C of
Theorem 8.1.1 and assumption B of this theorem. Therefore assumptions A
and C of Theorem 8.1.1 imply
. 1 a2s, . 2
plim = phm GPK,G0 (8.1.10)
T actaa' a T
Finally, because assumption A of this theorem implies assumption A of
Theorem 4.1.3, we have verified all the conditions of Theorem 4.1.3. Hence,
the conclusion of the theorem follows from (8.1.8) and (8.1.10).
Amemiya (1975a) considered, among other things, the optimal choice of
W. It is easy to show that plim T(W wG0 )- ' is minimized in the matrix sense
(that is, A> B means A B is a positive definite matrix) when we choose
W = G EGo . We call the resulting estimator the best nonlinear two stage -
= log y, if A = 0.
Note that because limA_0(y,t 1)/A. = log y z,(A) is continuous at A = 0. It is
assumed that (lid are i.i.d. with Eu, = 0 and Vu, = a2.
The transformation z1(A) is attractive because it contains y, and log y, as
special cases, and therefore the choice between y, and log y, can be made
within the framework of classical statistical inference.
250 Advanced Econometrics
if 2> 0 (8.1.14)
if A<0.
Then all the results of the previous section are valid with only an apparent
modification. The minimand (8.1.2), which defines the class of NL2S estima-
tors, should now be written as
f 'W(W/W)-- 'WI, (8.1.16)
where z(A) is the vector the tth element of which is z,(2). The discussion about
the choice of W given in the preceding section applies here as well. One
practical choice of W would be to use the x's and their powers and cross
products. Using arguments similar to the ones given in the preceding section,
we can prove the consistency and the asymptotic normality of the estimator.
The asymptotic covariance matrix of Nri(& a) is given by
-
az' ] --1
az
V(iii) = er2 plim T[[
. W(W'W)IW'k--
aA X] . (8.1.18)
.1.'
[[_x' ]
The Box-Cox maximum ikelihood estimator is defined as the pseudo MLE
obtained under the assumption that (u,) are normally distributed. In other
words, the Box-Cox MLE 0 of 0 = (A, /3', a2)' maximizes
Nonlinear Simultaneous Equations Models 251
(8.1.23)
Nonlinear Simultaneous Equations Models 253
We define X and V as matrices the tth rows of which are x', and V;, respec-
tively. Because u and V are jointly normal, we can write
which depends only on a. Interpreting our model as that which represents the
limited information of the researcher, we call the value of a that minimizes
(8.1.29) the NLLI estimator. The asymptotic covariance matrix of VT times
the estimator is given by
-
VL = plim T cellt2 G'M vG 042 1 17
12) G'MKG] , (8.1.30)
replacing V with V, call this estimator at and define fi = y flee), insert it into
(8.1.28) to obtain another estimator of n, and repeat the procedure until
convergence.
The estimator & defined in the preceding paragraph is interesting in its own
right. It is the value of a that minimizes
f MxY(rMxY) iVIVIxi(Y f). (8.1.32)
Amemiya (1975a) called this estimator the modified nonlinear two-stage
least squares (MNL2S) estimator. The asymptotic covariance matrix of
VTOL a) is given by
Vm = plim T(G'M vG)-1 [642G'MvG (8.1.33)
+ (a2 a*2)G'PxG](G'M yG)-1 .
Amemiya (1975a) proved V L < Vm < VB. It is interesting to note that if f is
linear in a and Y, MNL2S is reduced to the usual 2SLS (see Section 7.3.6).
In Sections 8.1.1 and 8.1.3, we discussed four estimators: (1) NL2S (as a
class); (2) BNL2S; (3) MNL2S; (4) NLLI. If we denote NL2S(W = X) by
SNL2S (the first S stands for standard), we have in the linear case
SNL2S BNL2S MNL2S NLLI, (8.1.34)
where =- means exact identity and means asymptotic equivalence. In the
nonlinear model defined by (8.1.1) and (8.1.22) with the normality assump-
Nonlinear Simultaneous Equations Models 255
tion, we can establish the following ranking in terms of the asymptotic covar-
iance matrix:
SNL2S BNL2S MNL2S 41 NLLI, (8.1.35)
where C means "is worse than." However, it is important to remember that
the first two estimators are consistent under more general assumptions than
those under which the last two estimators are consistent, as we shall show in
the following simple example.
Consider a very simple case of (8.1.1) and (8.1.22) given by
y, = az? + u, (8.1.36)
and
Z1 = 7LX, V (8.1.37)
where we assume the vector (u v,) is i.i.d. with zero mean and a finite
nonsingular covariance matrix. Inserting (8.1.37) into (8.1.36) yields
y, = aex? + au?, + (u,+ 2anx,v, + av? ca), (8.1.38)
where the composite error term (contained within the parantheses) has zero
mean. In this model, SNL2S is 2SLS with 4 regressed on x, in the first stage,
and BNL2S is 2SLS with 4 regressed on the constant term and .4 in the first
stage. Clearly, both estimators are consistent under general conditions with-
out further assumptions on u, and v On the other hand, it is not difficult to
show that the consistency of MNL2S and NLLI requires the additional as-
sumption
Ev?Ev?u,= Ev?Ev,u (8.1.39)
which is satisfied if u, and v, are jointly normal.
8.2.1 Introduction
Define a system of N nonlinear simultaneous equations by
fit( Yr xr =u i = 1, 2, . . . , N, t= 1, 2, . T,
(8.2.1)
where y, is an N-vector of endogenous variables, x, is a vector of exogenous
variables, and a, is a Krvector of unknown parameters. We assume that the
256 Advanced Econometrics
gi, = , a IC-vector
Oa,
where it is the NL2S estimator obtained from each equation. This definiton of
the NL3S estimators is analogous to the definition of the linear 3SLS as a
generalization of the linear 2SLS. The consistency and the asymptotic nor-
mality of the NL3S estimators defined in (8.2.2) and (8.2.3) have been proved
by Jorgenson and Laffont (1974) and Gallant (1977).
The consistency of the NL2S and NL3S estimators of the parameters of
model (8.2.1) can be proved with minimal assumptions on u ft namely,
those stated after (8.2.1). This robustness makes the estimators attractive.
Another important strength of the estimators is that they retain their consist-
ency regardless of whether or not (8.2.1) yields a unique solution for y, and, in
the case of multiple solutions, regardless of what additional mechanism
chooses a unique solution. (MaCurdy, 1980, has discussed this point further.)
However, in predicting the future value of the dependent variable, we must
know the mechanism that yields a unique solution.
Amemiya (1977a) defined the class of the NL3S estimators more generally
as the value of a that minimizes
which is attained when we choose S = EG. We call this estimator the BNL3S
estimator (B for "best").
We can also attain the lower bound (8.2.6) using the Jorgenson-Laffont
definition, but that is possible if and only if the space spanned by the column
vectors of W contains the union of the spaces spanned by the column vectors
of EG, for i = 1, 2, . . . , N. This necessitates including many columns in W,
which is likely to increase the finite sample variance of the estimator although
it has no effect asymptotically. This is a disadvantage of the Jorgenson-Laffont
definition compared to the Amemiya definition.
Noting that BNL3S is not practical, as was the case with BNL2S, Amemiya
(1976a) suggested the following approximation:
Step 1. Compute , an SNL2S estimator of at , i = 1, 2, . . . , N.
Step 2. Evaluate G, at &1 call it G,.
Step 3. Treat 'di as the dependent variables of a regression and search for the
optimal set of independent variables W i that best predict Gi .
Step 4. Choose S = diag(P, P22 ,. . . , PNdN), where P, --
-
Wi(W;Wi) ' W;.
Applications of NL3S can be found in articles by Jorgenson and Lau (1975),
who estimated a three-equation translog expenditure model; by Jorgenson
and Lau (1978), who estimated a two-equation translog expenditure model;
and by Haessel (1976), who estimated a system of demand equations, nonlin-
ear only in parameters, by both NL2S and NL3S estimators.
Assuming ut - N(0, I), we can write the log likelihood function of model
(8.2.1) as
1
L* = - - log III + log II afday,11- f f1 . (8.2. 7)
2 r-1
Inserting (8.2.8) into (8.2.7) yields the concentrated log likelihood function
L=i
T 1
A
2- log 17, T logIafdy,1- (8.2.9)
where ( )7 1 denotes the ith column of the inverse of the matrix within the
parentheses. The consistency of NLFI is equivalent to the condition
lim E
(8.2.12)
where cri is the ith column of Z-'. Now (8.2.13) could hold even if each term of
a summation is different from the corresponding term of the other, but that
event is extremely unlikely. Therefore we can say that the consistency of NLFI
is essentially equivalent to the condition
E ! z1
=
,
In simple models the condition (8.2.14) may hold without normality. In the
model defined by (8.1.36) and (8.1.37), we have g l, = -z? and g2, = x,.
Therefore (8.2.14) clearly holds for i = 2 for any distribution of u, provided
that the mean is O. The equation for i = 1 gives (8.1.39), which is satisfied by a
class of distributions including the normal distribution. (Phillips, 1982, has
presented another simple example.) However, if g, is a more complicated ,
nonlinear function of the exogenous variables and the parameters (ai ) as well
as of u, (8.2.14) can be made to hold only when we specify a density that
depends on the exogenous variables and the parameters of the model. In such
a case, normality can be regarded, for all practical purposes, as a necessary and
sufficient condition for the consistency of NLFI.
It is interesting to compare certain iterative formulae for calculating NLFI
Nonlinear Simultaneous Equations Models 261
(8.2.17)
T au;
and d --= diag(6 1 , d2 . . . , d' iv) and all the variables that appear in the sec-
ond term of the right-hand side of (8.2.16) are evaluated at am .
The Gauss-Newton iteration for BNL3S is defined by
where fi = ph/ clot% and O = Pf/aala, and the SSRD statistic is given by
T
SSRD = 7-(a) S 2(a)]. (8.3.2)
ST(a) [S
Both test statistics are asymptotically distributed as chi-square with q degrees
of freedom.
Gallant and Jorgenson (1979) derived the asymptotic distribution (a non-
central chi-square) of the two test statistics under the assumption that a
deviates from the hypothesized constraints in the order of T-I/2.
As an application of the SSRD test, Gallant and Jorgenson tested the hy-
pothesis of homogeneity of degree zero of an equation for durables in the
two-equation translog expenditure model of Jorgenson and Lau (1978).
The Wald and SSRD tests can be straightforwardly extended to the system
of equations (8.2.1) by using NL3S in lieu of NL2S. As an application of the
SSRD test using NL3S, Gallant and Jorgenson tested the hypothesis of sym-
metry of the matrix of parameters in the three-equation translog expenditure
model of Jorgenson and Lau (1975).
If we assume (8.1.22) in addition to (8.1.1) and assume normality, the
model is specified (although it is a limited information model); therefore the
three tests of Section 4.5.1 can be used with NLLI. The same is true of NLFI in
model (8.1.22) under normality. The asymptotic results of Gallant and Holly
(1980) given in Section 4.5.1 are also applicable if we replace cr 2(G'G) ' in the
right-hand side of (4.5.26) by the asymptotic covariance matrix of the NLFI
estimator.
8.3.2 Prediction
Bianchi and Calzolari (1980) proposed a method by which we can calculate
the mean squared prediction error matrix of a vector predictor based on any
estimator of the nonlinear simultaneous equations model. Suppose the struc-
tural equations can be written as f(yp , xp , a) =-- up at the prediction period p
and we can solve for yp as yp = g(xp , a, up ). Define the predictor tp based on
a
the estimator by tp = g(Xp 0). (Note that yp is an N-vector.) We call this
Nonlinear Simultaneous Equations Models 263
where = f(y x
Because yp tp = (yp Eyp ) + (Eyp Sfp ) for any predictor tp , we should
compare predictors on the basis of how well Ey p = Eg(xp , a, up ) is estimated
by each predictor. Moreover, because a is common for the three predictors,
we can essentially consider the situation where & in the predictor is replaced by
the parameter a. Thus the authors' problem is essentially equivalent to that of
comparing the following three estimators of Eg(up ):
Deterministic: g(0)
i s
Monte Carlo: g(v)
S
T
Residual-based: g(u)
T
Clearly, the deterministic predictor is the worst, as Mariano and Brown (1983)
concluded. According to their other article (Brown and Mariano, 1982), the
choice between the Monte Carlo and residual-based predictors depends on the
264 Advanced Econometrics
consideration that the former can be more efficient if S is large and the
assumed distribution of up is true, whereas the latter is simpler to compute and
more robust in the sense that the distribution of up need not be specified.
8.3.3 Computation
The discussion of the computation of NLFI preceded the theoretical discus-
sion of the statistical properties of NLFI by more than ten years. The first
article on computation was by Eisenpress and Greenstadt (1966), who pro-
posed a modified Newton-Raphson iteration. Their modification combined
both (4.4.4) and (4.4.5). Chow (1973) differed from Eisenpress and Greenstadt
in that he obtained simpler formulae by assuming that different parameters
appear in different equations, as in (8.2.1). We have already mentioned the
algorithm considered by Amemiya (1977a), mainly for a pedagogical purpose.
Dagenais (1978) modified this algorithm to speed up the convergence and
compared it with a Newton-Raphson method proposed by Chow and Fair
(1973) and with the DFP algorithm mentioned in Section 4.4.1 in certain
examples of nonlinear models. The results are inconclusive. Belsley (1979)
compared the computational speed of the DFP algorithm in computing NLFI
and NL3S in five models of various degrees of complexity and found that
NL3S was three to ten times faster. Nevertheless, Belsley showed that the
computation of NLFI is quite feasible and can be improved by using a more
suitable algorithm and by using the approximation of the Jacobian proposed
by Fairsee Eq. (8.3.6).
Fair and Parke (1980) estimated Fair's (1976) macro model (97 equations,
29 of which are stochastic, with 182 parameters including 12 first-order auto-
regressive coefficients), which is nonlinear in variables as well as in parameters
(this latter nonfinearity is caused by the transformation to take account of the
first-order autoregression of the errors), by OLS, SNL2S, the Jorgenson-Laf-
font NL3S, and NLFI. The latter two estimators are calculated by a deriva-
tive-free algorithm proposed by Parke (1982).
Parke noted that the largest model for which NLFI and NL3S had been
calculated before Parke's study was the one Belsley calculated, a model that
contained 19 equations and 61 unknown coefficients. Parke also noted that
Newton's method is best for linear models and that the DFP method is pre-
ferred for small nonlinear models; however, Parke's method is the only feasi-
ble one for large nonlinear models.
Nonlinear Simultaneous Equations Models 265
Exercises
1. (Section 8.1.2)
Define what you consider to be the best estimators of a and /I in the model
(y, + a)2 = fix, + u t = 1, 2, . . . , T, where (x,) are known constants
and (u,) are i.i.d. with Eu, = 0 and Vu,= a2. Justify your choice of esti-
mators.
2. (Section 8.1.2)
In model (8.1.12) show that the minimization of IT_ [z,(2) xfl] 2 with
respect to A and fl yields inconsistent estimates.
3. (Section 8.1.3)
Prove the consistency of the estimator of a obtained by minimizing
(y f )'[I V( V'V)ily]i(y f) and derive its asymptotic variance-
covariance matrix. Show that the matrix is smaller (in the matrix sense)
than 'TM given in (8.1.33).
4. (Section 8.1.3)
In the model defined by (8.1.36) and (8.1.37), consider the following two-
stage estimation method: In the first stage, regress z, on x, and define
266 Advanced Econometrics
1, = iIx,, where i i is the least squares estimator; in the second stage, regress
y, on (4)2 to obtain the least squares estimator of a. Show that the resulting
estimator of a is inconsistent. (This method may be regarded as an appli-
cation of Theil's interpretation Section 7.3.6 to a nonlinear model.)
5. (Section 8.1.3)
In the model defined by (8.1.36) and (8.1.37), show that the consistency of
MNL2S and NLLI requires (8.1.39).
6. (Section 8.1.3)
In the model defined by (8.1.36) and (8.1.37), assume it = 1, x, = 1 for all t,
Vu,= Vv, = 1, and Cov(u v,) = c. Evaluate the asymptotic variances,
denoted V1 , V2, V3, and V4, of the SNL2S, BNL2S, MNL2S, and NLLI
estimators of a and show Ili V2 V3 V4 for every I cl < 1.
7. (Section 8.2.3)
Consider the following two-equation model (Goldfeld and Quandt, 1968):
log y i, = yl log y2, + /31 + )62x, + u lt ,
Y2t = 72Ylt /33-x1 U2,9
where y, > 0 and y2 < 0. Show that there are two solutions of yu and y2, for
a given value of (u 1 u2,). Show also that (Li u , u21 ) cannot be normally
distributed.
8. (Section 8.2.3)
Consider the following model (Phillips, 1982):
log yi, + a1x = Ul t ,
Y2t a2Y1t 1.12t
9.1 Introduction
Qualitative response models (henceforth to be abbreviated as QR models) are
regression models in which dependent (or endogenous) variables take discrete
values. These models have numerous applications in economics because
many behavioral responses are qualitative in nature: A consumer decides
whether to buy a car or not; a commuter chooses a particular mode of trans-
portation from several available ones; a worker decides whether to take a job
offer or not; and so on. A long list of empirical examples of QR models can be
found in my recent survey (Amemiya, 1981).
Qualitative response models, also known as quantal, categorical, or discrete
models, have been used in biometric applications longer than they have been
used in economics. Biometricians use the models to study, for example, the
effect of an insecticide on the survival or death of an insect, or the effect of a
drug on a patient. The kind of QR model used by biometricians is usually the
simplest kind univariate binary (or dichotomous) dependent variable (sur-
vival or death) and a single independent variable (dosage).
Economists (and sociologists to a certain extent), on the other hand, must
deal with more complex models, such as models in which a single dependent
variable takes more than two discrete values (multinomial models) or models
that involve more than one discrete dependent variable (multivariate models),
as well as considering a larger number of independent variables. The estima-
tion of the parameters of these complex models requires more elaborate
techniques, many of which have been recently developed by econometricians.
This chapter begins with a discussion of the simplest model the model for
a univariate binary dependent variable (Section 9.2), and then moves on to
multinomial and multivariate models (Sections 9.3 and 9.4). The emphasis
here is on the theory of estimation (and hypothesis testing to a lesser extent)
and therefore complementary to Amemiya's survey mentioned earlier, which
discussed many empirical examples and contained only fundamental results
on the theory of statistical inference. We shall also discuss important topics
268 Advanced Econometrics
F(x) = (1)(x)
Logit Model
J L exp [ (t2/2)] dt.
ex
F(x)= A(x)
The linear probability model has an obvious defect in that F for this model"
is not a proper distribution function as it is not constrained to lie between 0
and 1. This defect can be corrected by defining F = 1 if F(x'iflo ) > 1 and F = 0
if F(x'iflo ) <0, but the procedure produces unrealistic kinks at the truncation
Qualitative Response Models 269
gy i = 1) = A [ )c ]. (9.2.3)
V3
EXAMPLE 9.2.2 (Domencich and McFadden, 1975). Let us consider the
decision of a person regarding whether he or she drives a car or travels by
270 Advanced Econometrics
transit to work. We assume that the utility associated with each mode of
transport is a function of the mode characteristics z (mainly the time and the
cost incurred by the use of the mode) and the individual's socioeconomic
characteristics w, plus an additive error term e. We define U il and U,D as the ith
person's indirect utilities associated with driving a car and traveling by transit,
respectively. Then, assuming a linear function, we have
UK) = ao + ex,fl + mlyo + 10 (9.2.4)
and
= ai + z;ift + w;71 + ell (9.2.5)
The basic assumption is that the ith person drives a car if Un > Um and travels
by transit if Un < U,. (There is indecision if U il = U,, but this happens with
zero probability if , and e l are continuous random variables.) Thus, defining
yi = 1 if the ith person drives a car, we have
P(h= 1) = P(Un > Um ) (9.2.6)
to make the proofs of consistency and asymptotic normality simpler. ItT this
way we hope that the reader will understand the essential's without being
hindered by too many technical details.
The logarithm of the likelihood function of the model (9.2.1) is given by
For example, suppose that (xi ) are scalars such that x i < 0 for i c for some
integer c between 1 and n and x i > 0 for i > c and that yi = 0 for i c and
yi = 1 for 1 > c. Then log L does not attain a maximum for any finite value of
II. If (xi ) are K-vectors, the same situation arises if y = 0 for the values of x i
RK and y == 1 for the values of x i lyinglyingwthoesdfayprlni
within the other side. However, the possibility of no solution of (9.2.8) does
not affect the consistency proof because the probability of no solution ap-
proaches 0 as n goes to co as shown in Theorem 4.1.2.
Next we shall prove asymptotic normality using Theorem 4.2.4, which
means that assumptions A, B, and C of Theorem 4.1.3 and Eq. (4.2.22) need to
be verified. Differentiating (9.2.8) with respect to /3 yields
32 log L
op"
r yi
41 If1(1 Fi )
12 2 ,
ixixi
(9.2.12)
i1 L
where
1 "
, X04. (9.2.15)
n fel Fm )
Finally, we obtain from (9.2.12)
1 82 log L
hm E A, (9.2.16)
n afiair Is,
Qualitative Response Models 273
9.2.3 Global Concavity of the Likelihood Function in the Logit and Probit
Models
Global concavity means that a2 log LIallar is a negative definite matrix for
fi E B. Because we have by a Taylor expansion
L(fl)= log L(fl)+
log
- a log L
(fl ft) (9.2.18)
ay
a2 log L
+ 2 (ft P)'
aflafl' p
where ft* lies between fl and fi, global concavity implies that log L(k >
log L(fl) fort/ fliffl is a solution of (9.2.8). We shall prove global concavity
for logit and probit models.
For the logit model we have
an =A(1 A) and a2n = (1 2A) an
. (9.2.19)
ax ax2 ax
Inserting (9.2.19) into (9.2.12) with F= A yields
(32 log L _
2; Ai )xix;, (9.2.20)
where A i = A(41). Thus the global concavity follows from Assumption 9.2.3.
A proof of global concavity for the probit model is a little more complicated.
Putting F. = cbi , f; = ctli , and f; = x;130i , where 4) is the density function of
N(0, 1), into (9.2.12) yields
(32 log L " (9.2.21)
- oi07-20 - o,)-2[(y, - 2mi), + 01)0,
(yi- coop -
Thus we need to show the positivity of
gy(x) = (y 2y0:13 + 0 2) + (y 0)0(1 (D)x
274 Advanced Econometrics
;42 =
f XiX; +
LI ]-1
(9.2.25)
where we have defined f",= ) and f, = fix'A).
An interesting interpretation of the iteration (9.2.25) is possible. From
(9.2.1) we obtain
yi = F(00 + ui ,
) (9.2.26)
where Eui = 0 and Vu i = F(xyl0 )[1 F(xX o )]. This is a heteroscedastic non-
linear regression model. Expanding F(xX 0 ) in a Taylor series around flo = /31
Qualitative Response Models 275
= +
I EF- (Pr)] (P, Pt)
xg$0 + vt + w, ,
where II lies between Pt and Pr,
1
vt= (Pt PO,
.1.[P'(PIA
and
1
Wt =
G[F-1(Pn] [F-1 (P 1)]) fi Pt).
The fact that v, and w, depend on n has been suppressed.
Because
Vv, a2 (9.2.31)
nil 2 [F- qP,)]
and because w, is 0(nT I ) and hence can be ignored for large n, (as we shall
show rigorously later), (9.2.30) approximately defines a heteroscedastic linear
regression model. The MIN x2 estimator, denoted A is defined as the WLS
estimator applied to (9.2.30) ignoring w,. We can estimate a? by "a? obtained
by substituting P, for P, in (9.2.31). Thus
-1 T
11 &-2X X') I Er-t2X,F-1 (A). (9.2.32)
r- 1-1
E a72x(vi + w1 )= E vn
vn t-t 1-1
(9.2.35)
X(oliv,+ aTlwe )
-
= oVx,o7 1 1;
obtain
nA t t 2 5 5
gutu, (9.2.41)
1
+ , E 1 E gurueusu,,
um
where u, = Pt - P, and g g, and g m denote the first-, second-, and third-
order partial derivatives of g evaluated at (P 1 , F2,. . . , PT), respectively.
The bias of the MLE to the order of n -1 is obtained by taking the expectation
of the first two terms of the right-hand side of (9.2.41). The mean squared error
of the MLE to the order of Pr-2 is obtained by calculating the mean squared
error of the right-hand side of (9.2.41), ignoring the terms of a smaller order
than n-2. We need not consider higher terms in the Taylor expansion because
E4 for k* 5 are at most of the order of nT 3. A Taylor expansion for the
MIN x2 estimator /I is obtained by expanding the right-hand side of (9.2.32)
around P,.
Using these formulae, Amemiya calculated the approximate mean squared
errors of MLE and the MIN x2 estimator in several examples, both artificial
and empirical, and found the MIN x2 estimator to have a smaller mean
squared error than MLE in all the examples considered. However, the differ-
ence between the two mean squared error matrices can be shown to be neither
positive definite nor negative definite (Ghosh and Sinha, 1981). In fact, Davis
(1984) showed examples in which the MLE has a smaller mean squared error
to the order of n offered an intuitive argument that showed that the
greater T, the more likely MLE is to have a smaller mean squared error.
Amemiya also derived the formulae for the n -2-order mean squared errors
of the bias-corrected MLE and the bias-corrected MIN x2 estimator and
showed that the former is smaller. The bias-corrected MLE is defined as
ft B(ft), where B is the bias to the order of n -1 , and similarly for MIN x2 .
This result is consistent with the second-order efficiency of MLE in the expo-
nential family proved by Ghosh and Subramanyam (1974), as mentioned in
Section 4.2.4. The actual magnitude of the difference of the n -2-order mean
squared errors of the bias-corrected MLE and MIN ein Amemiya's examples
was always found to be extremely small. Davis did not report the correspond-
ing results for her examples.
280 Advanced Econometrics
I n t[Pi(1
1-1
RxX)] 2, (9.2.45)
P(yi = 1lx?)
(x11 )qi (9.2.46)
g1(x)q1 + go(x?)qo'
where q1 and qo denote the marginal probabilities P(y1 = 1) and P(y1 = 0),
respectively. We shall evaluate (9.2.46), assuming that g 1 and go are the densi-
ties of N(it, , ) and N(/40 , X0 ), respectively. We state this assumption for-
mally as
xr = 1) II) (9.2.47)
x*KY = 0)
282 Advanced Econometrics
1
- -2 log 'III+ -2 log 'Id,
A = (XV - (9.2.51)
(9.2.54)
where n 1 =
_ no (9.2.55)
go -
Qualitative Response Models 283
where no = n n, ,
in
ni
I (9.2.56)
in
(1 (9.2.57)
no
i n
= yi(x* 111)(4 111)', (9.2.58)
and
-
=
I .0 in., I
( 1 - Yi) (xli - 410)(4 - 110)' - (9.2.59)
1 = '2;2& i I Yi(X t
[ )(xi' )' (9.2.60)
( 1 i) (X 7
1
The ML estimators of/3(1) , flo) , and A are obtained by inserting these estimates
into the right-hand side of (9.2.49), (9.2.50), and (9.2.51).
Discriminant analysis is frequently used in transport modal choice analysis.
See, for example, articles by Warner (1962) and McGillivray (1972).
We call the model defined by (9.2.47) with Z = 1 0 and by (9.2.52) the
discriminant analysis (DA) model and call the estimator of /3 (flo) , g2) )'
obtained by inserting (9.2.56), (9.2.57), and (9.2.60) into (9.2.49) and (9.2.50)
with I I = Io the DA estimator, denoted /3DA . In contrast, if we assume only
(9.2: 52) and not (9.2.47), we have a logit model. We denote the logit MLE offi
by PA. In the remainder of this section we shall compare these two estimators.
The relative performance of the two estimators will critically depend on the
assumed true distribution for xr. If (9.2.47) with I I = I o is assumed in addi-
tion to (9.2.52), the DA estimator is the genuine MLE and therefore should be
asymptotically more efficient than the logit MLE. However, if (9.2.47) is not
assumed, the DA estimator loses its consistency in general, whereas the logit
MLE retains its consistency. Thus we would expect the logit MLE to be more
robust.
Efron (1975) assumed the DA model to be the correct model and studied the
284 Advanced Econometrics
loss of efficiency that results if /I is estimated by the logit MLE. He used the
asymptotic mean of the error rate as a measure of the inefficiency of an
estimator. Conditional on a given estimator f3 (be it /IDA or /jA ), the error rate is
defined by
Error Rate = P[x'ft 0 I x IA% (9.2.61)
P[x'n < Olx - N(u 1 , Mg,
g0 o[(inft)-1/2 140] q1 CD[-- (friii) 1/2 An] .
Efron derived the asymptotic mean of (9.2.61) for each of the cases fi = PDA
and ft = PA, using the asymptotic distributions of the two estimators. Defining
the relative efficiency of the logit ML estimator as the ratio of the asymptotic
mean of the error rate of the DA estimator to that of the logit ML estimator,
Efron found that the efficiency ranges between 40 and 90% for the various
experimental parameter values he chose.
Press and Wilson (1978) compared the classification derived from the two
estimators in two real data examples in which many of the independent
variables are binary and therefore clearly violate the DA assumption (9.2.47).
Their results indicated a surprisingly good performance by DA (only slightly
worse than the logit MLE) in terms of the percentage of correct classification
both for the sample observations and for the validation set.
Amemiya and Powell (1983), motivated by the studies of Efron, Press, and
Wilson, considered a simple model with characteristics similar to the two
examples of Press and Wilson and analyzed it using the asymptotic techniques
analogous to those of Efron. They compared the two estimators in a logit
model with two binary independent variables. The criteria they used were the
asymptotic mean of the probability ofcorrect classification (PCC) (that is, one
minus the error rate) and the asymptotic mean squared error. They found that
in terms of the PCC criterion, the DA estimator does very well only slightly
worse than the logit MLE, thus confirming the results of Press and Wilson. For
all the experimental parameter values they considered, the lowest efficiency of
the DA estimator in terms of the PCC criterion was 97%. The DA estimator
performed quite well in terms of the mean squared error criterion as well,
although it did not do as well as it did in terms of the PCC criterion and it did
poorly for some parameter values. Although the DA estimator is inconsistent
in the model they considered, the degree of inconsistency (the difference
between the probability limit and the true value) was surprisingly small in a
majority of the cases. Thus normal discriminant analysis seems more robust
against nonnormality than we would intuitively expect.
Qualitative Response Models 285
We should point out, however, that their study was confined to the case of
binary independent variables; the DA estimator may not be robust against a
different type of nonnormality. McFadden (1976a) illustrated a rather signifi-
cant asymptotic bias of a DA estimator in a model in which the marginal
distribution of the independent variable is normal. [Note that when we spoke
of normality we referred to each of the two conditional distributions given in
(9.2.47). The marginal distribution of x* is not normal in the DA model but,
rather, is a mixture of normals.] Lachenbruch, Sneeringer, and Revo (1973)
also reported a poor performance of the DA estimator in certain nonnormal
models.
Once we calculate the asymptotic mean and variance, we have all the neces-
sary information for predicting r.
If we actually observe every x i , i = 1, 2, . . . , n, we can calculate the mean
and variance for (9.2.62) straightforwardly. However, because it is more real-
istic to assume that we cannot observe every x i , we shall consider the problem
of how to estimate the mean and variance for (9.2.62) in that situation. For
that purpose we assume that (x 1 ) are i.i.d. random variables with the common
K-variate distribution function G. Then the asymptotic mean and variance of
r can be estimated by EF(xl) and n- IEF(1 F), respectively, where E is the
expectation taken with respect to G.
Westin (1974) studied the evaluation of EF when F = A (logistic distribu-
286 Advanced Econometrics
tion) and xix, ix). He noted that the density of p = A(0) is given by
1 1 2
RP) exp { -- log ) , (9.2.63)
P( 1 p) 2a2 [ Pp
1
where it = fl and a2 = nj. Because the mean of this density does not
have a closed form expression, we must evaluate f pf(p) dp numerically for
given values of /.1 and a2.
McFadden and Reid (1975) showed that if F= (211 (standard normal distri-
bution) and x Ix ) as before, we have
E(x11) = CDR1 02)-1124 (9.2.64)
Thus the evaluation of EF is much simpler than in the logit case.
Neither Westin nor McFadden and Reid considered the evaluation of the
asymptotic variance of r, which constitutes an important piece of information
for the purpose of prediction.
Another deficiency of these studies is that the variability due to the estima-
tion of /I is totally ignored. We shall sugfest a partially Bayesian way to deal
with this problem. 5 Given an estimate /3 of fi, we treat ft as a random varia-
ble with the distribution mft,I p). An estimate of the asymptotic covariance
matrix of the estimator ft can be used for Z . We now regard (9.2.62) as the
,
+ Vier' E F1 ].
t I
shall discuss various types of the multinomial QR model and the specific
problems that arise with them.
Assuming that the dependent variable yi takes in ; + 1 values 0, 1, 2, .
mi , we define a general multinomial QR model as
i= 1, 2, . . . ,n and (9.3.1)
j = 1, 2, . . . , m,,
where x* and 0 are vectors of independent variables and parameters, respec-
tively. (Strictly speaking, we should write j as j i , but we shall suppress the
subscript i.) We sometimes write (9.3. 1) simply as Pu = Fu . We shall allow the
possibility that not all the independent variables and parameters are included
in the argument of every F. Note that P(yi = 0) ( ) need not be specified
because it must be equal to one minus the sum of the m i probabilities defined
in (9.3.1).
It is important to let m i depend on i because in many applications individ-
uals face different choice sets. For example, in transport modal choice analy-
sis, traveling by train is not included in the choice set of those who live outside
of its service area.
To define the maximum likelihood estimator of 0 in the model (9.3.1) it is
useful to define 7... 1 (m i + 1) binary variables
y1 = 1 if Yi =./ (9.3.2)
=0 if yi # j, i= 1, 2, . . . ,n and
j 0, 1, . . , m 1 .
Then we can write tl* log likelihood function as
n m,
log L = I yu log Fo , (9.3.3)
p- fi-o
a
which is a natural generalization of (9.2.7). The MLE of 0 is defined as a
solution of the normal equation 0 log Llae = 0.
Many of the results about the MLE in the binary case hold for the model
(9.3.1) as well. A reasonable set of sufficient conditions for its consistency and
asymptotic normality can be found using the relevant theorems of Chapter 4,
as we have done for the binary case. The equivalence of the method of scoring
and the NLWLS (NLGLS to be exact) iteration can also be shown. However,
we shall demonstrate these things under a slightly less general model than
(9.3.1). We assume6
288 Advanced Econometrics
and
02 log L "
E x k fl-" A7lf (9.3.12)
aflkafl; " "
Suppose # is the initial estimate fp, let ri, it, and Xi be, respectively, F i , f,
and A; evaluated at fl and define
n _ _
Aid = Xikf li"AT (9.3.13)
and
yi = Fi + ui , i = 1, 2, . . . , n, (9.3.16)
where Eu i =0, Eu u = Ai , and Euiu; =0 for i j. Expanding F. in Taylor
,
t = 1, 2, . . . , T and k 1, 2, . . . , m,
where elk = aGidaPti. Equation (9.3.20) is a generalization of (9.2.30), but
here we write it as an approximate equation, ignoring an error term that
corresponds to w, in (9.2.30). Equation (9.3.20) is an approximate linear
regression equation with a nonscalar covariance matrix that depends
Qualitative Response Models 291
Step 2. In (9.3.17) replace Fi and ill by Ft and ffi evaluated at A ik and replace
x;kflh by ,k.
Step 3. Apply FGLS on (9.3.17) using At evaluated at A sk .
We shall conclude this subsection by generalizing WSSR defined by (9.2.42)
and (9.2.45) to the multinomial case.
292 Advanced Econometrics
We shall not write the analog of (9.2.42) explicitly because it would require
cumbersome notation, although the idea issimple. Instead, we shall merely
point out that it is of the form Or )1 - i (Y MG) obtained from the
regression equation (9.3.20). It is asymptotically distributed as chi-square with
degrees of freedom equal to mT minus the number of regression parameters.
Next, consider the generalization of (9.2.45). Define the vector F, = F 1(1).
Then, the analog of (9.2.45) is
for some distribution function F. If F = 41), (9.3.25) defines the ordered probit
Qualitative Response Models 293
model; and if F = A, it defines the ordered logit model. Pratt (1981) showed
that the log likelihood function of the model (9.3.25) based on observations
(y i , i = 1, 2, . . . , n, on (y, x) is globally concave iff, derivative of F, is
positive and log f is concave.
The model (9.3.25) is motivated by considering an unobserved continuous
random variable y* that determines the outcome of y by the rule
y = j if and only if a < y* < ai+1 , (9.3.26)
j= 0, 1, . . . , m.
If the distribution function of y* x'fl is F, (9.3.26) implies (9.3.25).
In empirical applications of the ordered model, y* corresponds to a certain
interpretative concept. For example, in the study of the effect of an insecticide
by Gurland, Lee, and Dahm (1960), y* signifies the tolerance of an insect
against the insecticide. Depending on the value of y*, y takes three discrete
values corresponding to the three states of an insectdead, moribund, and
alive. In the study by David and Legg (1975), y* is the unobserved price of a
house, and the observed values of y correspond to various ranges of the price of
a house. In the study by Silberman and Talley (1974), y* signifies the excess
demand for banking in a particular location and y the number of chartered
bank offices in the location. See also Example 9.4.1 in Section 9.4.1.
The use of the ordered model is less common in econometric applications
than in biometric applications. This must be due to the fact that economic
phenomena are complex and difficult to explain in terms of only a single
unobserved index variable. We should be cautious in using an ordered model
because if the true model is unordered, an ordered model can lead to serious
biases in the estimation of the probabilities. On the other hand, the cost of
using an unordered model when the true model is ordered is a loss of efficiency
rather than consistency.
We shall conclude this subsection by giving an econometric example of an
ordered model, which is also an interesting application of the MIN x2 method
discussed in Section 9.3.1.
EXAMPLE 9.3.1 (Deacon and Shapiro, 1975). In this article Deacon and
Shapiro analyzed the voting behavior of Californians in two recent referenda:
Rapid Transit Initiative (November 1970) and Coastal Zone Conservation
Act (November 1972). We shall take up only the former. Let AU i be the
difference between the utilities resulting from rapid transit and no rapid
transit for the ith individual. Deacon and Shapiro assumed that AU, is distrib-
uted logistically with mean that is, P(AU, < x) = A(x p, ) and that
294 Advanced Econometrics
1
Ei+ 51( Y) PAY)]
P1(17)(1
Qualitative Response Models 295
and
log P(N)
- x'f(fli + /32) (9.3.33)
1 PO )
1
+ PANN1 P(N)] [PAN) PAN)].
Note that (9.3.32) and (9.3.33) constitute a special case of (9.3.20). The error
terms of these two equations are heteroscedastic and, moreover, correlated
with each other. The covariance between the error terms can be obtained from
the result Coy [P( Y), P(N)] = n 'P( Y)PAN). The MIN estimates of ;( 2
(/31 -12) and (/31 + /32) are obtained by applying generalized least squares to
(9.3.32) and (9.3.33), taking into account both heteroscedasticity and the
correlation.
i = 1, 2, . . . , n and -= 0, 1, . .
where we can assume x x, =0 without loss of generality. The log likelihood
function is given by
n
log L = I yi; log Pt,. (9.3.35)
EI fr=o
Following McFadden (1974), we shall show the global concavity of (9.3.35).
Differentiating (9.3.35) with respect to A we obtain
a log L _ vzz.
v ii (9.3.36)
oft Pii '
where 1i and ; denote E7.. 1 and Epo , respectively. Differentiating (9.3.36)
further yields
82 log L _ [ a2Pu i a1aP,J
(9.3.37)
8149' oflafli Py Op Oft'
296 Advanced Econometrics
the probablity the ith person chooses alternative j as (suppressing the subscript
i from po as well as from e l)
P(yi = 2) = P(Ua > Un ,Ua > U) (9.3.42)
exp (p a)
exp exp (A)+ exp (Ja)
Expression (9.3.42) is equal to Pa given in (9.3.34) if we put mu m = x'a fl
and xi p = x'n ft. The expressions for Pk and Pil can be similarly derived.
The fact that 13 is constant for all the modes makes this model useful in
predicting the demand for a certain new mode that comes into existence.
Suppose that an estimate ft of fl has been obtained in the model with three
modes (Example 9.3.2) and that the characteristics z a of a new mode (desig-
nated by subscript 3) have been ascertained from engineering calculations and
a sample survey. Then the probability that the ith person will use the new
mode (assuming that the new mode is accessible to the person) can be esti-
298 Advanced Econometrics
mated by
Pi3
exp (x;3fi) (9.344)
1 + exp (xj) + exp (x;2fl) + exp (343/)
where xi3 = z 3 zi0 .
the second project have nothing substantial in common except that both are
number j routes. However, this difference is not essential because in this type
of model each alternative is completely characterized by its characteristics
vector x, and a common name such as car is just as meaningless as a number j
in the operation of the model.
McFadden tested the IIA hypothesis by reestimating one of his models
using the choice set that consists of the chosen route and one additional route
randomly selected from mi . The idea is that if this hypothesis is true, estimates
obtained from a full set of alternatives should be close to estimates obtained by
randomly eliminating some nonchosen alternatives. For each coefficient the
difference between the two estimates was found to be less than its standard
deviation, a finding indicating that the hypothesis is likely to be accepted.
However, to be exact, we must test the equality of all the coefficients simulta-
neously.
Such a test, an application of Hausman's test (see Section 4.5.1), is devel-
oped with examples in the article by Hausman and McFadden (1984). They
tested the IIA hypothesis in a trichotomous logit model for which the three
alternatives are owning an electric dryer, a gas dryer, or no dryer. In one
experiment, data on the households without a dryer were discarded to obtain a
consistent but inefficient estimator, and in the other experiment, data on
those owning electric dryers were discarded. In both experiments Hausman's
test rejected the IIA hypothesis at less than 1% significance level. Alternative
tests of the IIA hypothesis will be discussed in Section 9.3.5.
mial logit model. As for 0 , we assume F(e0)= exp [exp (E0)] as in the
multinomial logit model.
Under these assumptions we can show
exp ( go )
P(y = 0) (9.3.51)
exp (ito) + [exp (p - + exp (p-1/22)?
and
p(), iy 0) _ exP ( 1 1)
(9.3.52)
exp (r 'i)+ :p(p- '112) .
The other probabilities can be deduced from (9.3.51) and (9.3.52). Therefore
these two equations define a nested logit model in the trichotomous case. By
dividing the numerator and the denominator of (9.3.51) by exP (Po) and those
of (9.3.52) by exp (r we note that the probabilities depend on 12
/21 muo , and p. We would normally specify "Li = x.;/), j = 0, 1, 2. The estima-
tion of /I and p will be discussed for more general nested logit models later.
The form of these two probabilities is intuitively attractive. Equation
(9.3.52) shows that the choice between the two similar alternatives is made
according to a binary logit model, whereas (9.3.51) suggests that the choice
between car and noncar is also like a logit model except that a certain kind of a
weighted average of exP (ii1) and exP (1 12) is used.
To obtain (9.3.51), note that
P(Y = 0) = P(U0 > U1, Uo > U2) (9.3.53)
=P(/10Eo>111+1,P0+0>112+2)
f
eo+" --01 [ o+Po
fe -11 2
exp ( E0 )
=
LI exp () exp [ exP ( E0)] de.0
-
, E2 ) deddel
-OD
X exp (exp (E l )
X fl exp _ ,112 )])P) cl1
= 1 + exP
where in the third equality we used
8F(e1, 2)
[exp ( El ) exp (p-1 e2 )1P-1 (9.3.56)
(9.3.58)
Then it can be shown that
T-1
at LE
.EB,
exp (pnif)
s 1, 2, . . . , S,
and
exp (p;
111 E 135) s = 1, 2, . . . , S. (9.3.60)
exP (P.; I,uk
kEB,
Note that (9.3.59) and (9.3.60) are generalizations of (9.3.51) and (9.3.52),
respectively. Clearly, these probabilities define the model completely. As be-
fore, we can interpret
EXAMPLE 9.3.4 (Small and Brownstone, 1982). Small and Brownstone ap-
plied a nested logit model to analyze trip timing. The dependent variable takes
12 values corresponding to different arrival times, and the authors experi-
mented with various ways of "nesting" the 12 responses, for example, B 1 --
(1, 2, . . , 8) and B 2 = (9, 10, 11, 12) or B 1 = (1, 2, . . . , 8), B2 = (9), and
B3 = (10, 11, 12). All a's were assumed to be equal to 1, and various specifica-
tions of the p's were tried. Small and Brownstone found that the two-step
estimator had much larger variances than the MLE and often yielded unrea-
sonable values. Also, the computation of the asymptotic covariance matrix of
the two-step estimator took as much time as the second-round estimator
obtained in the Newton-Raphson iteration, even though the fully iterated
Newton-Raphson iteration took six times as much time.
(9.3.62)
as
JEB,
rns)
/ = Mu) (93.65)
J EB,
a, [exp 02,10
TE JEB,
Two-Level
1 2
2 13 4 6 7 8
1
Three-Level
2 3 1 4
3 4 516 7
1---1 1 1 1- 7
Figure 9.1 Two-level and three-level nested logit models
306 Advanced Econometrics
and
bu f
TEc,,
a, [
jEB,
exp (141p,)
rel
= hila. (9.3.66)
JE CN
/ { / a, [ exp ( j/p,) ]
EEC,, jes,
and
1
PE0 (9.3.71)
1 + exp (AI ) + exp (gi2)'
where gii and ga are functions of all the explanatory variables of the model
Z 1 , za , and wi . Any arbitrary trichotomous model can be approximated by
this model by choosing the functions g a and ga appropriately. As long as g11
ga depend on all the mode characteristics, the universal logit model does and
not satisfy the assumption of the independence from irrelevant alternatives.
When the g's are linear in the explanatory variables with coefficients that
generally vary with the alternatives, the model is reduced to a multinomial
logit model sometimes used in applications (see Cox, 1966, p. 65), which
differs from the one defined in Section 9.3.3.
Albright, Lerman, and MansIci (1977) (see also Lerman and Manslci, 1981)
developed a computer program for calculating the ML estimator (by a gra-
dient iterative method) of a multinomial probit model similar to, but slightly
more general than, the model of Hausman and Wise. Their specification is
= 34;fl1 (9.3.74)
where ft, ;) and ei = (em , n ,. . . , ei,) N(0, ;). As in Hausman
and Wise, fl1 and ei are assumed to be independent of each other and indepen-
dent through i. A certain normalization is employed on the parameters to
make them identifiable.
An interesting feature of their computer program is that it gives the user the
option of calculating the probability in the form (9.3.72) at each step of the
iteration either (1) by simulation or (2) by Clark's approximation (Clark,
1961), rather than by series expansion. The authors claim that their program
can handle as many as ten alternatives. The simulation works as follows:
Consider evaluating (9.3.72), for example. We artificially generate many ob-
servations on U0 , U1 , and U2 according tofevaluated at particular parameter
values and simply estimate the probability by the observed frequency. Clark's
method is based on a normal approximation of the distribution of max (X, Y)
when both X and Y are normally distributed. The exact mean and variance of
max (X, Y), which can be evaluated easily, are used in the approximation.
Albright, Lerman, and Manslci performed a Monte Carlo study, which
showed Clark's method to be quite accurate. However, several other authors
have contradicted this conclusion, saying that Clark's approximation can be
quite erratic in some cases (see, for example, Horowitz, Sparmann, and Da-
ganzo, 1982).
Albright et al. applied their probit model to the same data that Hausman
and Wise used, and they estimated the parameters of their model by Clark's
method. Their model is more general than the model of Hausman and Wise,
and their independent variables contained additional variables such as mode-
specific dummies and the number of automobiles in the household. They also
estimated an independent logit model. Their conclusions were as follows:
1. Their probit and logit estimates did not differ by much. (They compared
the raw estimates rather than comparing aP/ax for each independent
variable.)
2. They could not obtain accurate estimates of ; in their probit model.
3. Based on the Akaike Information Criterion (Section 4.5.2), an increase
in log L in their probit model as compared to their logit model was not
large enough to compensate for the loss in degrees of freedom.
310 Advanced Econometrics
4. The logit model took 4.5 seconds per iteration, whereas the probit model
took 60 75 CPU seconds per iteration on an IBM 370 Model 158.
Thus, though they demonstrated the feasibility of their probit model, the
gain of using the probit model over the independent logit did not seem to
justify the added cost for this particular data set.
The discrepancy between the conclusions of Hausman and Wise and those
of Albright et al. is probably due to the fact that Hausman and Wise imposed
certain zero specifications on the covariance matrix, an observation that
suggests that covariance specification plays a crucial role in this type of model.
and
P(Yi= 2137,* 0, yi # 1) -- A(02). (9.3.79)
Note that Kahn and Morimune could have used an ordered logit model
with their data because we can conjecture that a continuous unobservable
variable (interpreted as a measure of the tendency for unemployment)
exists that affects the discrete outcome. Specifying y7 = xU1 + e i would lead to
one of the ordered models discussed in Section 9.3.2.
Y2
1 0
Pit PLO
POI Poo
312 Advanced Econometrics
multivariate model but in which its specific multivariate features are (perhaps
justifiably) ignored. By doing this, we hope to shed some light on the distinc-
tion between a multivariate model and the other multinomial models.
EXAMPLE 9.4.1 (Silberman and Durden, 1976). Silberman and Durden
analyzed how representatives voted on two bills (House bill and Substitute
bill) concerning the minimum wage. The independent variables are the socio-
economic characteristics of a legislator's congressional district, namely, the
campaign contribution of labor unions, the campaign contribution of small
businesses, the percentage of persons earning less than $4000, the percentage
of persons aged 16 21, and a dummy for the South. Denoting House bill and
Substitute bill by H and S, the actual counts of votes on the bills are given in
Table 9.2.
The zero count in the last cell explains why Silberman and Durden did not
set up a multivariate QR model to analyze the data. Instead, they used an
ordered probit model by ordering the three nonzero responses in the order of a
representative's feeling in favor of the minimum wage as shown in Table 9.3.
Assuming that y* is normally distributed with mean linearly dependent on the
independent variables, Silberman and Durden specified the probabilities as
(9.4.1)
and
PA) + Pil = (1:0(x;fl + a), (9.4.2)
where a> 0.
An alternative specification that takes into consideration the multivariate
nature of the problem and at the same time recognizes the zero count of the
last cell may be developed in the form of a sequential probit model (Section
9.3.10) as follows:
P(H; = Yes) = (1)(x;fi 1 ) (9.4.3)
and
P(Si = NolHi = Yes) = 0(02). (9.4.4)
The choice between these two models must be determined, first, from a
theoretical standpoint based on an analysis of the legislator's behavior and,
second, if the first consideration is not conclusive, from a statistical standpoint
based on some measure of goodness of fit. Because the two models involve
different numbers of parameters, adjustments for the degrees of freedom, such
as the Akaike Information Criterion (Section 4.5.2), must be employed. The
problem boils down to the question, Is the model defined by (9.4.3) and (9.4.4)
sufficiently better than the model defined by (9.4.1) and (9.4.2) to compensate
for a reduction in the degrees of freedom?
9.4.2 Multivariate Nested Logit Model
The model to be discussed in this subsection is identical to the nested logit
model discussed in Section 9.3.5. We shall merely give an example of its use in
a multivariate situation. We noted earlier that the nested logit model is useful
whenever a set of alternatives can be classified into classes each of which
contains similar alternatives. It is useful in a multivariate situation because the
alternatives can be naturally classified according to the outcome of one of the
variables. For example, in a 2 X 2 case such as in Table 9.1, the four alterna-
tives can be classified according to whether y 1 = 1 or yi = 0. Using a parame-
terization similar to Example 9.3.5, we can specialize the nested logit model to
a 2 X 2 multivariate model as follows:
P(yi = 1) = a 1 exp (zly)[exp (prix 111) + exp (pilxc oprl (9.4.5)
(a l exp (el 7)[exp (pi' xj) + exp (pilx1 0/1)P1
ao exP (zOY)[exP (PF) 141fi) + exP (1)040/1)1 P0},
exp (piix111)
P(Y2= 1 1311 = 1) exp (9.4.6)
Ixj) + exp (pr
314 Advanced Econometrics
and
exP (PV4IP)
P(Y2 = 1 lYi
exP (PV4ifi) exP (PO-lxSefl) .
The two-step estimation method discussed in Section 9.3.5 can be applied to
this model.
Y2
Y1
Y2
Y1
I d- 'ecri-fa24-`1.2 d - Ie*,
0 d - ie2
of a 12 also can be seen by the following equation that can be derived from
Table 9.5:
P(Y1 = 1 13'2) = A(ai + a12Y2). (9.4.9)
Goodman looked at the asymptotic t value of the MLE of each a (the MLE
divided by its asymptotic standard deviation) and tentatively concluded
a012 = a013 = 00123 = 0, called the null hypothesis. Then he proceeded to
accept formally the null hypothesis as a result of the following testing proce-
dure. Define P t = 1, 2, . . . , 16, as the observed frequencies in the 16 cells
created by all the possible joint outcomes of the four binary variables. (They
can be interpreted as the unconstrained MLE's of the probabilities P,.) Define
F, as the constrained MLE of P, under the null hypothesis. Then we must
reject the null hypothesis if and only if
16 (fr _ i)2
n E
1-1 P,
> xi.., (9.4.13)
where n is the total number of soldiers in the sample and A ct is the a% critical
value of . [Note that the left-hand side of (9.4.13) is analogous to (9.3.24).]
Or, alternatively, we can use
P, logP, >
16
2n I (9.4.14)
(-1 F
We shall indicate how to generalize a log-linear model to the case of discrete
variables that take more than two values. This is done simply by using the
binary variables defined in (9.3.2). We shall illustrate this idea by a simple
example: Suppose there are two variables z and y3 such that z takes the three
values 0, 1, and 2 and y3 takes the two values 0 and 1. Define two binary (0, 1)
variables y1 and y2 by the rule: y, = 1 if z = 1 and y2 = 1 if z = 2. Then we can
specify P(z, y3) by specifying P(y, , y2 , y3), which we can specify by a log-linear
model as in (9.4.11). However, we should remember one small detail: Because
in the present case y i y2 = 0 by definition, the two terms involving y, y 2 in the
right-hand side of (9.4.11) drop out.
In the preceding discussion we have touched upon only a small aspect of the
log-linear model. There is a vast amount of work on this topic in the statistical
literature. The interested reader should consult articles by Haberman (1978,
1979), Bishop, Feinberg, and Holland (1975), or the many references to Leo
Goodman's articles cited therein.
Qualitative Response Models 317
They also supposed that a coal miner develops wheeze (y 2 = 1) if his tolerance
(yr) against wheeze is less than 0 and that yr N(fix, 1). Then we have
P(Y2 = 1 ) = 0(/)ix)- (9.4.16)
Now that we have specified the marginal probabilities of y1 and y2 , the multi-
variate model is completed by specifying the joint probability
P(y, = 1, y2 = 1), which in turn is determined if the joint distribution of yr
and yr is specified. Ashford and Sowden assumed that yr and yr are jointly
normal with a correlation coefficient p. Thus
P(y, = 1, y2 = 1) = Fp(Ax, Ax), (9.4.17)
where F,, denotes the bivariate normal distribution function with zero means,
unit variances, and correlation p.
The parameters /3 1 , /32 , and p can be estimated by MLE or MIN x2 (if there
are many observations per cell for the latter method). Amemiya (1972) has
given the MIN x2 estimation of this model.
318 Advanced Econometrics
Y2 = 1 if u 2 < + Q21
= x'y + v
Btl = Fx + v
comparison, Morimune used Cox's test (Section 4.5.3) and his own modifica-
tion of it. He concluded that probit was preferred to logit by either test.
It is interesting to ask whether we could specify an Ashford-Sowden type
bivariate logit model by assuming the logistic distribution for yr and yliin the
Ashford-Sowden model. Although there is no "natural" bivariate logistic
distribution the marginal distributions of which are logistic (unlike the normal
case), Lee (1982b) found that Plackett's bivariate logistic distribution function
(Plackett, 1965) yielded results similar to a bivariate probit model when
applied to Ashford and Sowden's data and Morimune's data. Furthermore, it
was computationally simpler.
Given two marginal distribution functions F(x) and G(y), Plackett's class of
bivariate distributions H(x, y) is defined by
H(1 F G + H)
(9.4.20)
(F H)(G H)
for any fixed yi in (0, co).
Unfortunately, this method does not easily generalize to a higher-order
multivariate distribution, where because of the computational burden of the
probit model the logit analog of a multivariate probit model would be espe-
cially useful. Some progress in this direction has been made by Malik and
Abraham (1973), who generalized Gumbel's bivariate logistic distribution
(Gumbel, 1961) to a multivariate case.
Lco (9.5.3)
if Q(JIfio) is known. Note that if g(x) =f(x) and H(j) = Q(jIflo), (9.5.1) and
(9.5.3) both become
draw j with probability H(j) and then, given j, we can draw x according to the
conditional density f(x I j). Thus the joint probability is f(x1j)H(j), which by
Bayes's rule is equal to P(jlx)f(x)Q(j) -111(j).
This sampling scheme is different from a scheme in which the proportion of
people choosing alternative j is a priori determined and fixed. This latter
scheme may be a more realistic one. (Hsieh, Manslci, and McFadden, 1983,
have discussed this sampling scheme.) However, we shall adopt the definition
of the preceding paragraphs (following Mansld and Lerman, 1977) because in
this way choice-based sampling contains random sampling as a special case
[Q(j)= H(j)] and because the two definitions lead to the same estimators
with the same asymptotic properties.
Choice-based sampling is useful in a situation where exogenous sampling or
random sampling would find only a very small number of people choosing a
particular alternative. For example, suppose a small proportion of residents in
a particular city use the bus for going to work. Then, to ensure that a certain
number of bus riders are included in the sample, it would be much less
expensive to interview people at a bus depot than to conduct a random survey
of homes. Thus it is expected that random sampling augmented with choice-
based sampling of rare alternatives would maximize the efficiency of estima-
tion within the budgetary constraints of a researcher. Such augmentation can
be analyzed in the framework of generalized choice-based sampling proposed
by Cosslett (1981a) (to be discussed in Section 9.5.4).
In the subsequent subsections we shall discuss four articles: Mansld and
Lerman (1977), Cosslett (1981a), Cosslett (198 1 b), and Manslci and McFad-
den (1981). These articles together cover the four different types of models,
varying according to whether f is known and whether Q is known, and cover
five estimators of/3 the exogenous sampling maximum likelihood estimator
(ESMLE), the random sampling maximum likelihood estimator (RSMLE),
the choice-based sampling maximum likelihood estimator (CBMLE), the
Manski-Lerman weighted maximum likelihood estimator (WMLE), and the
Manski-McFadden estimator (MME).
A comparison of RSMLE and CBMLE is important because within the
framework of choice-based sampling a researcher controls H(j), and the par-
ticular choice H(j)= Q0( j) yields random sampling. The choice of H(j) is an
important problem of sample design and, as we shall see later, H(j) = Q0(j) is
not necessarily an optimal choice.
Table 9.6 indicates how the definitions of RSMLE and CBMLE vary with
the four types of model; it also indicates in which article each case is discussed.
Note that RSMLE = CBMLE if Q is known. ESMLE, which is not listed in
322 Advanced Econometrics
Table 9.6, is the same as RSMLE except when./ is unknown and Q is known.
In that case ESMLE maximizes L. with respect to /I without constraints.
RSMLE and CBMLE for the case of known Q will be referred to as the
constrained RSMLE and the constrained CBMLE, respectively. For the case
of unknown Q, we shall attach unconstrained to each estimator.
equation
(9.5.6)
It will become apparent that the weights w(j) ensure the consistency of the
estimator. If weights were not used, (9.5.5) would be reduced to the exogenous
sampling likelihood function (9.5.1), and the resulting estimator to the usual
MLE (the ESMLE), which can be shown to be inconsistent unless H(j)
CAD.
It should be noted that because WMLE does not depend on f(x), it can be
used regardless of whether or not f is known.
We shall prove the consistency of the WMLE fl in a somewhat different
way from the authors' proof. 12 The basic theorems we shall use are Theorems
4.1.2 and 4.2.1. We need to make six assumptions.
ASSUMPTION 9.5.1. The parameter space B is an open subset of Euclidean
space.
ASSUMPTION 9.5.2. H(j) > 0 for every j = 0, 1, . . . , m.
ASSUMPTION 9.5.3. a log puix, gyap exists and is continuous in an open
neighborhood N1 (90 ) of flo for every j and x. (Note that this assumption
requires P(jlx, fi) > 0 in the neighborhood.)' 3
ASSUMPTION 9.5.4. P(jlx, /i) is a measurable function off and x for every
/I E B.
It is clear that we must generally have c1 = 1 for every j in order for (9.5.10) to
be 0. 14
The asymptotic normality of ij can be proved with suitable additional
assumptions by using Theorem 4.1.3. We shall present merely an outline of
the derivation and shall obtain the asymptotic covariance matrix. The neces-
sary rigor can easily be supplied by the reader.
Differentiating (9.5.5) with respect to ft yields
as . Mid
w(h) (9.5.11)
aft r-i Pi)
u aft
Because (9.5.11) is a sum of i.i.d. random variables by Assumption 9.5.5, we
Qualitative Response Models 325
(9.5.12)
"
where
A = E[w(j)2yr], (9.5.13)
where y = [a log P(j)/8/1] PG . Differentiating (9.5.11) with respect to 13' yields
a2s ^ . 1 apcii)aptii)
I w(J1) PO .
air
(9.5.14)
1 (32P(ii)
w(ii) PC i i) *V'
Using the fact that (9.5.14) is a sum of i.i.d. random variables, we can show
1 a2s"
plim Ew(j)yy' (9.5.15)
n Po
A,
because
1 a2Pul _ f 821 ,,x, dx (9.5.16)
Ew(i) Po(f) alio" Ato aflaft po j"
0.
Po
Therefore, from (9.5.12) and (9.5.15), we conclude
flo ) )N(0, A'AA'). (9.5.17)
As we noted earlier, a researcher controls H(j) and therefore faces an
interesting problem of finding an optimal design: What is the optimal choice
of H(j)? We shall consider this problem here in the context of WMLE. Thus
the question is, What choice of H(j) will minimize A- 'A A'?
First of all, we should point out that H(j)= (20(j) is not necessarily the
optimal choice. If it were, it would mean that random sampling is always
preferred to choice-based sampling. The asymptotic covariance matrix of
V;i(13, fio) when H(j) = Q0(j) is (E*yy') -1 , where E* denotes the expecta-
tion taken with respect to the probability distribution P(jix, flo )f(x). Writing
326 Advanced Econometrics
E*w(j)y 2 = + 1 b h , (9.5.23)
where
a = ExA.(130x)Ex 2 exp (fl0x)[1 + exP 0304]-3 (9.5.24)
and
b = Ex[l A(213041Ex 2 exP (2fl0x) [1 + exP (fl041 -3. (9.5.25)
Qualitative Response Models 327
=0.5 if a = b.
We assume x is binary with probability distribution:
x= 1 with probability p
=0 with probability 1 - p.
Then, inserting fl0 = log[(p + 2Q0 - 1)/(p - 2Q0 + 1)], where Qo = Q0( 1 ),
into the right hand side of (9.5.22), Eff(H) becomes a function ofp, Q 0 , and h
-
alone. In the last five columns of Table 9.7, the values of Eff(H) are shown for
various values ofp, Q0 , and h. For each combination ofp and Q0 , the value of
the optimal h, denoted h*, is shown in the third column. For example, if
p = 0.9 and Q0 = 0.75, the optimal value of h is equal to 0.481. When h is set
equal to this optimal value, the efficiency of WMLE is 1.387. The table shows
that the efficiency gain of using choice-based sampling can be considerable
and that h = 0.5 performs well for all the parameter values considered. It can
be shown that if Q0 = 0.5, then h* = 0.5 for all the values of p.
In the foregoing discussion of WMLE, we have assumed Q0(j) to be known.
However, it is more realistic to assume that Q0(j) needs to be estimated from a
separate sample and such an estimate is used to define w(j) in WMLE.
Manski and Lerman did not discuss this problem except to note in a footnote
Table 9.7 Efficiency of WLME for various designs in a binary logit model
plim n-1/2
[00(ji) Q0(i irEa log Puiyaingo = 0.
(9.5.28)
Therefore we need
00(i) Q0(j) = o(n- '12). (9.5.29)
If 00(j) is the proportion of people choosing alternative/ in a separate sample
of size n l ,
Ooti) 420(i) = 0 (ni 1/2 ). (9.5.30)
Therefore asymptotic equivalence requires that nIni should converge to O. See
Hsieh, Manski, and McFadden (1983) for the asymptotic distribution of the
WMLE with Q estimated in the case where nIni converges to a nonzero
constant.
An application of WMLE in a model explaining family participation in the
AFDC program can be found in the article by Hosek (1980).
where f is unknown and Q is known. Because the results presented here are
straightforward, we shall only sketch the proof of consistency and asymptotic
normality and shall not spell out all the conditions needed.
First, consider the case where bothfand Q are known. CBMLE maximizes
Lco given in (9.5.3) subject to the condition
we obtain
Vii(A,AL flu ) N[0, G(G'Eyy'G)'G'1. (9.5.35)
As we would expect, jimi_ is asymptotically more efficient than WMLE ij. In
other words, we should have
G(G'Eyy'G) ' G' g (Ewyy') 'Ew 2yy'(Ewyn -1 , (9.5.36)
for any w and G. This inequality follows straightforwardly from (9.5.19).
330 Advanced Econometrics
The motivation for this estimator is the following: As we can see from (9.5.3),
the joint probability off and x under the present assumption is
h(j, x)=-- Mf(x) ,20(i)' HO. (9.5.40)
Therefore the conditional probability off given x is
h(j, x)
h(jfx) = (9.5.41)
I h(j, X)
which leads to the conditional likelihood function (9.5.39). The estimator is
computationally attractive because the right-hand side of (9.5.39) does not
depend on f(x), which is assumed unknown and requires a nonstandard
analysis of estimation, as we shall see in Sections 9.5.4 and 9.5.5.
To prove the consistency of the estimator, we observe that
Ns) = Q0(j)
JE.r,
H(s)=
From (9.5.46) we obtain
we can, after some manipulation, write log L I as a function offt and (25(8)) as
334 Advanced Econometrics
of (its) is supposed to estimate. The probability limits of MLE /1' and (1.,) can be
obtained as the values that maximize plim, 'Q. We have
+ H(s) log
s-
= H(s) I [log Pulx, joia(s) - Igilx,110)f(x) cht
s-I je.r,
tH(s) f [log .1513(six,
+ H(s) log As .
s.1
00(s) H(s
plim n-1 S2 = f ' 1 ) (six' 13) P(Tlx, 11)f(x) dx
As-P(six,
H(T)
+ = 1, 2, . . . , S 1, (9.5.58)
and
a .
n SI (9.5.59)
1
= H(s) I pu x, p aPap(i) 00(s)-1P(jix, ,80)f(x) dx
1 s aP(s)
H(s) I As op
5-1 AsAslx, 13) s'
s-i
X 0,0(s)'P(slx, flo)f(x) dx.
By studying (9.5.58) we see that the right-hand side vanishes if we put A s =
336 Advanced Econometrics
00(srH(s)0_0(S) and ft = /10 . Next, insert these same values into the right-
hand side of (9.5.59). Then each term becomes
S ap( i)
H(s)Qo(srf(x) dx
s-1 jEJ, uP
We conclude that
f PUYAS)420(S) -If(X) dx = 1 = 0.
where C is a function for which F(f + EC) can be defined. Let t 1 (a K-vector)
and t2 (a scalar) be unbiased estimators offi and ff(x)C(x) dx for some function
C such that f C(x) dx = 0 and f C(x) 2 dx = 1. Then Cosslett showed that the
(2K + 2)-vector
[
tc, t2 ,
a log L I I
0 log L
op" of c
has covariance matrix of the form
rc Ii
Li R()]'
where C is the covariance matrix of (t; , t 2)'. Because this covariance matrix is
positive definite, we have C> R(C) -1 , as in Theorem 1.3.1. Finally, it is shown
that the asymptotic covariance matrix offi is equal to the first K X K block of
max c R(C) - 1 .
Thus Cosslett seems justified in saying that ft is asymptotically efficient in
the sense defined in Section 4.2.4. As we remarked in that section, this does
not mean that fi has the smallest asymptotic covariance matrix among all
consistent estimators. Whether the results of LeCam and Rao mentioned in
Section 4.2.4 also apply to Cosslett's model remains to be shown.
we have
n, log a(s),
- s
al= log
(9.5.65)
1-1 AJ P(jjx,, p
j -0
where
X(E) = 1 if event E occurs (9.6.3)
=0 otherwise.
Note that the score is the number of correct predictions we would make if we
predicted y, to be 1 whenever x;fl 0. Manski's maximum score estimator&
is defined by
S(fl) = sup S(fl), (9.6.4)
pEB
where the parameter space B is taken as
B= {fillrfl= 1 ). (9.6.5)
340 Advanced Econometrics
Clearly, (9.6.5) implies no loss of generality because S(c/7) = S(fl) for any
positive scalar C.
Because S(fl) is not continuous in A we cannot use Theorem 4.1.1 without
a modification. However, an appropriate modification is possible by general-
izing the concept of convergence in probability as follows:
DEFINITION 9.6.1. Let (0, A, P) be a probability space. A sequence of not
necessarily measurable functions g7(co) for co E is said to converge to 0 in
probability in the generalized sense if for any e> 0 there exists ATE A such
that
AT g IgT( 0))1
and lim P(A T) = 1.
Using this definition, we can modify Theorem 4.1.1 as follows:
THEOREM 9.6.1. Make the following assumptions:
(A) The parameter space 0 is a compact subset of the Euclidean K-space
(Rx).
(B) QT(y, 0) is a measurable function of y for all 0 E 0.
(C) 7-1 Q7 (0) converges to a nonstochastic continuous function Q(0) in
-
where
WA(x)=O if x 0 (9.6.8)
=Ax if 0<x<2i
=1 if 01.-1 x.
Because each term of the summation in (9.6.7) minus its expected value
satisfies all the conditions of Theorem 4.2.1 for a fixed positive A, we can
conclude that for any e, > 0 there exists ni(A), which may depend on A, such
that for all n n i(A)
e 6
P sup In-1SA(il) QA(fi)I > < (9.6.9)
[pEB 3 2
where
= EF(x'flo)W2(x 71) + E[1 F(x x'fl). (9.6.10)
Second, we have
sup In- ' S(I1) tr'SA,,(P)1 (9.6.11)
5 sup In-1
- Di (x,13)1 + sup Ettt(x'ii)
ft
A l +A2 ,
where
(9.6.13)
342 Advanced Econometrics
We have
sup P[(x1) 2 < 2-21.
A2 .--1 (9.6.14)
s
But, because the right-hand side of (9.6.14) converges to 0 as A --. D because of
assumption B, we have for all A ..... A l
Therefore, from (9.6.11), (9.6.13), and (9.6.15), we conclude that for all n ?.-.
n2(2) and for all A __ AI
Third, define
Therefore, using the same argument that led to (9.6.15), we conclude that for
all A . A I
Fourth, because
sup In'S n(13) OP! suP In - i S(A n-15:1.(PI (9.6.20)
ft P
+ supin-1S),,(11) QA(11)1
ft
+ sup I GUI) QUM,
,
we conclude from (9.6.9), (9.6.16), and (9.6.19) that for any e, > Owe have
for all n _._ max[n 1 (21), n2(21)]
P[sup in - i Sn(P Q(fl)f> e] < 3. (9.6.21)
ft
Qualitative Response Models 343
Fifth and finally, it remains to show that Q(fi) defined in (9.6.17) attains a
unique global maximum at A. This is equivalent to showing
fefic<0
[1 2F(x'fl0)]g(x) dx (9.6.2)
> f
vp<0
[1 2F(x'flo)]g(x) cht if ft #
n m
sn(fi)=-- E yux(xu]
P-1.frO
x;kil for all k j) (9.6.25)
where
wA(z, , Z2 ,. . . , z) = 0 if min (zi ) 0 (9.6.27)
= A min (z i ) otherwise.
Then the first four steps of the proof of Theorem 9.6.2 generalize straightfor-
wardly to the present case.
The fifth step is similar but involves a somewhat different approach. From
(9.6.25) we obtain
Q(p) = plim (13) (9.6.28)
= Eh(x, 13),
where 00 and (xi ) are random variables from which i.i.d. observations 0,
and (x1i) are drawn and x = (x, xi,. . . , x',)'. First, we want to show
h(x, fl) is uniquely maximized at flo . For this purpose consider maximizing
AY.to = ilxo, fio) > /lb= 1 1343, fio) for all .io (9.6.30)
Then it is clearly best to assign x 0 to the region defined by
= 11 x, fio > P(y; = 11x, flo) for all fo (9.6.31)
Thus (9.6.29) is maximized by the partition (Ay) defined by
Ay = (x1x5/30 x130 for k # j). (9.6.32)
Clearly, this is also a solution to the more restricted problem of maximizing
h(x, fl). This maximum is unique because we always have a strict inequality in
(9.6.30) because of our assumptions. Also our assumptions are such that if
h(x, fl) is uniquely maximized for every x at /30 , then D(x, II) is uniquely
maximized at/30 . This completes the proof of the consistency of the maximum
score estimator in the multonomial case. (Figure 9.2 illustrates the maximiza-
tion of Eq. 9.6.29 for the case where m = 3 and x is a scalar.)
The asymptotic distribution of the maximum score estimator has not yet
been obtained. A major difficulty lies in the fact that the score function is not
differentiable and hence Theorem 4.1.3, which is based on a Taylor expansion
of the derivative of a maximand, cannot be used. The degree of difficulty for
the maximum score estimator seems greater than that for the LAD estimator
discussed in Section 4.6 the method of proving asymptotic normality for
LAD does not work in the present case. In the binary case, maximizing (9.6.2)
P (Y0 =1 ix Po
I .
P(y r1 I x, Po )
P ( y2 = x So )
A2
Figure 9.2 An optimal partition of the space of an independent variable
346 Advanced Econometrics
The consistency proof of Kiefer and Wolfowitz ( 1956) applies to this kind of
model. Cosslett showed how to compute MLE nand Pand derived conditions
for the consistency of MLE, translating the general conditions of Kiefer and
Wolfowitz into this particular model. The conditions Cosslett found, which
are not reproduced here, are quite reasonable and likely to hold in most
practical applications.
Clearly some kind of normalization is needed on /I and F before we maxi-
mize (9.6.33). Cosslet adopted the following normalization: The constant
term is 0 and the sum of squares of the remaining parameters is equal to 1.
Note that the assumption of zero constant term is adopted in lieu of Mansld's
assumption F(0) = 0.5. We assume that the constant term has already been
eliminated from the ell that appears in (9.6.33). Thus we can proceed, assum-
ing /31 = 1.
The maximization of (9.6.33) is carried out in two stages. In the first stage
we shall fix ft and maximize ty(fi, F) with respect to F. Let the solution be
P(fl). Then in the second stage we shall maximize w[11, P(fl)] with respect to
Although the second stage presents a more difficult computational problem,
we shall describe only the first stage because it is nonstandard and concep-
tually more difficult.
Qualitative Response Models 347
ingly. Note that this is a sequence consisting only of ones and zeros.
Step 2. Partition this sequence into the smallest possible number of succes-
sive groups in such a way that each group consists of a nonincreasing se-
quence.
Step 3. Calculate the ratio of ones over the number of elements in each
group. Let a sequence of ratios thus obtained be (r 1 , r2 , . . . , rK), assuming
there are K groups. If this is a nondecreasing sequence, we are done. We define
F(xWI) = ri if the Wth observation is in the jth group.
Step 4. If, however, ri < for some j, combine the jth and (j 1)th group
and repeat step 3 until we obtain a nondecreasing sequence.
The preceding procedure can best be taught by example:
Example 1.
00 11 0 11
0 1
'E(x O) 0 1
Note that P is not unique over some parts of the domain. For example,
between 7c4)fl and 7430 in Example 1, F may take any value between 0 and 3.
Asymptotic normality has not been proved for Cosslett's MLE if, nor for
any model to which the consistency proof of Kiefer and Wolfowitz is applica-
ble. This seems to be as difficult a problem as proving the asymptotic normal-
ity of Manslci's maximum score estimator.
Cosslett's MLE may be regarded as a generalization of Manslci's estimator
because the latter searches only among one-jump step functions to maximize
(9.6.33). However, this does not necessarily imply that Cosslett's estimator is
348 Advanced Econometrics
=0 otherwise.
In a very general model, y: would depend on independent variables, lagged
values of y*, lagged values of y iz , and an error term that can be variously
specified. We shall analyze a model that is simple enough to be computation-
ally feasible and yet general enough to contain most of the interesting features
of this type of model: Let
Y: = xiffi YYia- ' + vit, (9.7.11)
where for each i, vu is serially correlated in general. We define true state
dependenceto mean y 0 and heterogeneityto mean serial correlation of (yi ).
Model (9.7.2) results from (9.7.11) if we put y = 0 and vat = ui + ea , where
(Eft ) are serially independent. Thus we see that model (9.7.2) is restrictive not
only because it assumes no true state dependence but also because it assumes a
352 Advanced Econometrics
mally distributed independent of each other, and (,j are serially indepen-
dent. We suppress subscript i as before and express (9.7.13) in obvious vector
notation as
v ---- au + e, (9.7.14)
where v, a, and e are T-vectors and u is a scalar. Then the joint probability of y
can be written as
P(y) = E ,F[y/ (2y 1); D * (2y 1)(2y 1)'], (9.7.15)
where iv now includes au and D = Eee' . Because D is a diagonal matrix, F in
(9.7.15) can be factored as the product of T normal distribution functions. The
estimation of this model, therefore, is no more difficult than the estimation of
model (9.7.4).
For the case T = 3, model (9.7.14) contains a stationary first-order autore-
gressive model (see Section 5.2.1) as a special case. To see this, put a =
(1 P2)-"2 (p) 1 ) P)' Vu = ("2, and take the diagonal elements of D as a 2, 0,
and o-2. Thus, if T = 3, the hypothesis of AR(1) can easily be tested within the
more general model (9.7.14). Heckman (1981c) accepted the AR(1) hypoth-
esis using the same data for female labor participation as used by Heckman
and Willis (1977). If T> 4, model (9.7.13) can be stationary if and only if a, is
constant for all t. A verification of this is left as an exercise.
Consider a further simplification of (9.7.13) obtained by assuming a, = 1,
(ui ) are i.i.d. over i, and (ER) are i.i.d. both over i and t. This model differs from
model (9.7.2) only in the presence ofy i.,_ , among the right-hand variables and
is analogous to the Balestra-Nerlove model (Section 6.6.3) in the continuous
variable case.
As in the Balestra-Nerlove model, (ut ) may be regarded as unknown param-
eters to estimate. If both N and T go to 00, A 7, and (ui) can be consistently
estimated. An interesting question is whether we can estimate 11 and 7 consist-
ently when only Ngoes to 00. Unlike the Balestra-Nerlove model, the answer to
this question is generally negative for the model considered in this subsection.
In a probit model, for example, the values of p and 7 that maximize
'S T
L = H H cb(wi,+ ui)mi eqvi,+ (9.7.16)
1 t-1
Exercises
1. (Section 9.2.1)
In a Monte Carlo study, Goldfeld and Quandt (1972, Chapter 4) gener-
ated (y e ) according to the model P(ye = 1) = 0(0.2 + 0.5x11 + 2x21 ) and,
using the generated (y e ) and given (x11 , x2e ), estimated the fi's in the linear
probability model P(yi= 1 )7 Po + flixii Ax21 Their estimates were
0.58, flt =0.1742, and /92 = 0.7451. How do you convert these esti-
mates into the estimates of the coefficients in the probit model?
2. (Section 9.2.2)
Consider a logit model P(ye =1) = A(110 + Ax e ), where x e is a binary
variable taking values 0 and I. This model can be also written as a linear
probability model P(ye = 1) = yo + vi xi .
a. Determine yo and yi as functions of flo and /31 .
b. Show that the MLE of A, and y i are equal to the least squares
estimates in the regression of ye on x; with an intercept.
3. (Section 9.2.3)
Show that global concavity is not invariant to a one-to-one transforma-
tion of the parameter space.
4. (Section 9.2.8)
In the model of Exercise 2, we are given the following data:
x 1 1 1 0 0 0 0 0 1 0 1
y 0 0 1 0 0 1 1 0 1 0 1.
Calculate the MLE and the DA estimates (with 1 0 = ) of /30 and fli .
Qualitative Response Models 355
5. (Section 9.2.8)
The following data come from a hypothetical durable goods purchase
study:
r,
Case (t) Constant n, r,
n, 4) ()
-1 n, C6 [1?-1 C)]
? n,
I
6. (Section 9.2.9)
It may be argued that in (9.2.62) the asymptotic variance of r should
be the unconditional variance of r, which is tr 2E7_ 1 F,(1 F,)+
V(priX7.. 1 F1 ), where V is taken with respect to random variables (x,).
What is the fallacy of this argument?
7. (Section 9.3.3)
In the multinomial logit model (9.3.34), assume j = 0, 1, and 2. For this
model define the NLGLS iteration.
356 Advanced Econometrics
8. (Section 9.3.5)
Suppose (y,}, i = 1, 2, . . . , n, are independent random variables taking
three values, 0, 1, and 2, according to the probability distribution defined
by (9.3.51) and (9.3.52), where we assume it0 = 0, = x;fil , and p2 =
x'ifl2 . Indicate how we can consistently estimate 1/ 1 , /12 , and p using only a
binary logit program.
9. (Section 9.3.5)
Write down (9.3.59) and (9.3.60) in the special case where S= 2, B 1 =
(1, 2), and B2 = (3, 4) and show for which values of the parameters the
model is reduced to a four-response independent logit model.
10. (Section 9.3.5)
You are to analyze the decision of high school graduates as to whether or
not they go to college and, if they go to college, which college they go to.
For simplicity assume that each student considers only two possible col-
leges to go to. Suppose that for each person 1,1= 1, 2, . . . , n, we observe
zi (family income and levels of parents' education) and x u (the quality
index and the cost of the jth school), j = 1 and 2. Also suppose that we
observe for every person in the sample whether or not he or she went to
college but observe a college choice only for those who went to college.
Under these assumptions, define your own model and show how to esti-
mate the parameters of the model (cf. Radner and Miller, 1970).
11. (Section 9.3.6)
Write down (9.3.64), (9.3.65), and (9.3.66) in the special case ofFigure 9.1
(three-level), that is, C 1 = (1, 2), C2 = (3, 4), B,= (1, 2), B2 = (3, 4),
B3 = (5, 6), and B4 = (7, 8).
12. (Section 9.3.10)
Suppose that yi takes values 0, 1, and 2 with the probability distribution
P(y,= 0) = A(x14 ) and P(y1 = lly1 0 0) = ). Assuming that we
have n i independent observations on yr with the same value x i of the
independent variables, t = 1, 2, . . . , T, indicate how to calculate the
MIN x2 estimates of /10 and /II .
1
1 d7 exp (a'x, + fi'z g ) d7' exp (oex,)
0 d d
where d, = exp (a'; + + exp (a'x,) + 2. Given n, independent ob-
servations on (Yu, Y2t), define the minimum chi-square estimates of the
vectors a and ft. Assume that x, and z, are vectors of known constants.
15. (Section 9.4.3)
Let j = 1, 2, . . . , J, be discrete random variables taking Ni values.
Suppose the conditional probability P(yJIYI , , i, Y.,+1, , .r)
is given for every j and is positive. Then prove that there is at most one set
of joint probabilities consistent with the given conditional probabilities
(cf. Amemiya, 1975).
16. (Section 9.4.3)
Let yi and y2 be binary variables taking the value 1 or 0. Show that if
P(y = 1 13'2) = A(xli 1312 Y2 ) and gy2 = 1 1Y i) = A(x'fi2 1- At Yi ), then
P12 =Ai.
17. (Section 9.5.2)
In the simple logit model defined in the paragraph after (9.5.22), the
efficiency of the WMLE of fl using the optimal h* depends only on p and
Q0. Prove Eff [h*(P, Q0), 11, Qo] = Eff[h*(P, 1 Q0), 11, 1
10.1 Introduction
Tobit models refer to censored or truncated regression models in which the
range of the dependent variable is constrained in some way. In economics,
such a model was first suggested in a pioneering work by Tobin (1958). He
analyzed household expenditure on durable goods using a regression model
that specifically took account of the fact that the expediture (the dependent
variable of his regression model) cannot be negative. Tobin called his model
the model of limited dependent variables. It and its various generalizations are
known popularly among economists as Tobit models, a phrase coined by
Goldberger (1964), because of similarities to probit models. These models are
also known as censored or truncated regression models. The model is called
truncated if the observations outside a specified range are totally lost and
censored if we can at least observe the exogenous variables. A more precise
definition will be given later.
Censored and truncated regression models have been developed in other
disciplines (notably, biometrics and engineering) more or less independently
of their development in econometrics. Biometricians use the model to analyze
the survival time of a patient. Censoring or truncation occurs when either a
patient is still alive at the last observation date or he or she cannot be located.
Similarly, engineers use the model to analyze the time to failure of material or
of a machine or of a system. These models are called survival or duration
models. Sociologists and economists have also used survival models to ana-
lyze the duration of such phenomena as unemployment, welfare receipt,
employment in a particular job, residence in a particular region, marriage, and
the period of time between births. Mathematically, survival models belong to
the same general class of models as Tobit models; survival models and Tobit
models share certain characteristics. However, because survival models pos-
sess special features, they will be discussed separately in Chapter 11.
Between 1958 when Tobin's article appearedand 1970, the Tobit
model was used infrequently in econometric applications, but since the early
1970s numerous applications ranging over a wide area of economics have
appeared and continue to appear. This phenomenon is due to a recent in-
Tobit Models 361
crease in the availability of micro sample survey data, which the Tobit model
analyzes well, and to a recent advance in computer technology that has made
estimation of large-scale Tobit models feasible. At the same time, many gener-
alizations of the Tobit model and various estimation methods for these
models have been proposed. In fact, models and estimation methods are now
so numerous and diverse that it is difficult for econometricians to keep track of
all the existing models and estimation methods and maintain a clear notion of
their relative merits. Thus it is now particularly useful to examine the current
situation and prepare a unified summary and critical assessment of existing
results.
We shall try to accomplish this objective by means of classifying the diverse
Tobit models into five basic types. (Our review of the empirical literature
suggests that roughly 95% of the econometric applications of Tobit models fall
into one of these five types.) Although there are many ways to classify Tobit
models, we have chosen to classify them according to the form of the likeli-
hood function. This way seems to be the statistically most useful classification
because a similarity in the likelihood function implies a similarity in the
appropriate estimation and computation methods. It is interesting to note
that two models that superficially seem to be very different from each other
can be shown to belong to the same type when they are classified according to
this scheme.
Sections 10.2 through 10.5 will deal with the standard Tobit model (or Type
1 Tobit), and Sections 10.6 through 10.10 will deal with the remaining four
types of models. Basic estimation methods, which with a slight modification
can be applied to any of the five types, will be discussed at great length in
Section 10.4. More specialized estimation methods will be discussed in rele-
vant passages throughout the chapter. Each model is illustrated with a few
empirical examples.
We shall not discuss clisequilibrium models except for a few basic models,
which will be examined in Section 10.10.4. Some general references on dis-
equilibrium models will be cited there. Nor shall we discuss the related topic of
switching regression models. For a discussion of these topics, the reader
should consult articles by Maddala (1980, 1983). We shall not discuss Tobit
models for panel data (individuals observed through time), except to mention
a few articles in relevant passages.
Expenditure
0 Income
If we assume that u is a random variable and that yo varies with households but
is assumed known, this model will generate data like Figure 10.1. We can write
the likelihood function for n independent observations from the model
(10.2.1) as
L = fi F4Y01) 1 (10.2.2)
where F and f', are the distribution and density function, respectively, of yr,
no means the product over those i for which yr -5 yoi , and n, means the
product over those i for which 31> yoi . Note that the actual value of y when
y* g yo has no effect on the likelihood function. Therefore the second line of
Eq. (10.2.1) may be changed to the statement "if y* g yo , one merely observes
that fact."
The model originally proposed by Tobin (1958) is essentially the same as
the model given in the preceding paragraph except that he specifically as-
sumed y* to be normally distributed and assumed y o to be the same for all
households. We define the standard Tobit model (or Type 1 Tobit) as follows:
= fi + ul, j == 1, 2, . . . , n, (10.2.3)
= yr if >0 (10.2.4)
=0 if yr g 0,
where {u i) are assumed to be i.i.d. drawings from N(0, 0 .2). It is assumed that
(yi) and (xi) are observed for i = 1, 2, . . . , n but (yr) are unobserved if
yr g 0. Defining X to be then X K matrix the ith row of which is x, we assume
that {xi ) are uniformly bounded and lim_ n-1 X'X is positive definite. We
also assume that the parameter space of /3 and a2 is compact. In the Tobit
model we need to distinguish the vectors and matrices of positive observations
from the vectors and matrices of all the observations; the latter appear with an
underbar.
Note that yr > 0 and y* g 0 in (10.2.4) may be changed to yr > y o and
g yo without essentially changing the model, whether yo is known or un-
known, because yo can be absorbed into the constant term of the regression. If,
however, yo; changes with i and is known for every i, the model is slightly
changed because the resulting model would be essentially equivalent to the
model defined by (10.2.3) and (10.2.4), where one of the elements of /3 other
than the constant term is known. The model where yoi changes with i and is
unknown is not generally estimable.
The likelihood function of the standard Tobit model is given by
L = fi [1 (xd 0)] fi 0-10[(y,- leifl)/0 ], (10.2.5)
364 Advanced Econometrics
where cto and 44 are the distribution and density function, respectively, of the
standard normal variable.
The Tobit model belongs to what is sometimes known as the censored
regression model. In contrast, when we observe neither yi nor xi when y* g: 0,
the model is known as a truncated regression model. The likelihood function
of the truncated version of the Tobit model can be written as
L= 11 (1)(4/3/0) - ' 0.- '0[(YI 70)/a]. (10.2.6)
Henceforth, the standard Tobit model refers to the model defined by (10.2.3)
and (10.2.4), namely, a censored regression model, and the model the likeli-
hood function of which is given by (10.2.6) will be called the truncated stan-
dard Tobit model.
Then the first two products of the right-hand side of (10.4.1) constitute the
likelihood function of a probit model, and the last product is the likelihood
function of the truncated Tobit model as given in (10.2.6). The probit ML
estimator of a fl/ a, denoted is obtained by maximizing the logarithm of
,
0 if yr 0.
Note that the ith element of Ew is equal to (1)(x;a). The symbol A means that
both sides have the same asymptotic distribution) As shown in Section 9.2.2,
a is asymptotically normal with mean a and asymptotic variance-covariance
matrix given by
Vet -= (X/X)'. (10.4.4)
estimates. Although it is not so clear from the figure, the least squares regres-
sion using only the positive expenditures also yields biased estimates. These
facts can be mathematically demonstrated as follows.
First, consider the regression using only positive observations of y 1 . We
obtain from (10.2.3) and (10.2.4)
E(yilyi > 0) = xfl + E(uilui > x;fl). (10.4.5)
The last term of the right-hand side of (10.4.5) is generally nonzero (even
without assuming ui is normal). This implies the biasedness of the LS estima-
tor using positive observation on yi under more general models than the
standard Tobit model. When we assume normality of u; as in the Tobit model,
(10.4.5) can be shown by straightforward integration to be
E( yilyi > 0) = x;/3 + (7.1(x;i9/0, (10.4.6)
where A(z) = 41(z)/(1)(z). 2 As shown later, this equation plays a key role in the
derivation of Heckman's two-step, NLLS, and NLWLS estimators.
Equation (10.4.6) clearly indicates that the LS estimator of ft is biased and
inconsistent, but the direction and magnitude of the bias or inconsistency
cannot be shown without further assumptions. Goldberger (1981) evaluated
the asymptotic bias (the probability limit minus the true value) assuming that
the elements of; (except the first element, which is assumed to be a constant)
are normally distributed. More specifically, Goldberger rewrote (10.2.3) as
Y =A + flt + ui (10.4.7)
and assumed N(0, I), distributed independently of ui . (Here, the as-
sumption of zero mean involves no loss of generality because a nonzero mean
can be absorbed into ft0.) Under this assumption he obtained
y
plim ... (10.4.8)
1 P2 Y
(10.4.11) again as
= x; fi + a + ei + (10.4.13)
for i such that y,> 0,
where m = a[A(x'i ot) A(x;ii)]. We can write (10.4.13) in vector notation as
y=Xfi+al,++ti, (10.4.14)
where the vectors y, e, and ti have n, elements and matrix X has n, rows,
corresponding to the positive observations of yi . We can further rewrite
(10.4.14) as
y=27++1, (10.4.15)
where we have defined 2 = (X, I) and 7 = (fi' , a)'. Then Heckman's two-step
estimator of 7 is defined as
= (2,2)-I2'y. (10.4.16)
The consistency of follows easily from (10.4.15) and (10.4.16). We shall
derive its asymptotic distribution for the sake of completeness, although the
result is a special case of Heckman's result (Heckman, 1979). From (10.4.15)
and (10.4.16) we have
V(9 7) = (nr'2'2) -1 (nV/22'e + nV/22'n). (10.4.17)
Because the probit MLE cit is consistent, we have
plim = firn (10.4.18)
where Z = (X, A). Under the assumptions stated after (10.2.4), it can be shown
that
ni 1/22'e N(0, a2 lim (10.4.19)
where DI was defined after (10.4.2). Next, note that e and 7/ are uncorrelated
because n is asymptotically a linear function of w on account of (10.4.2) and
(10.4.20) and e and w are uncorrelated. Therefore, from (10.4.17), (10.4.18),
(10.4.19), and (10.4.21), we finally conclude that Y. is asymptotically normal
with mean y and asymptotic variance-covariance matrix given by
= a2(Z'Z)-1Z1 [I + (I 1,)X(Li'D 1 X)- 'X'( I 1)J7,(EZ)-1 .
(10.4.22)
and
I/9w = a2(Z'D 2L1-1Z)- I. (10.4.33)
Again, we cannot make a definite comparison between the two matrices.
worthwhile because 9 and iTw are asymptotically equivalent to A,/ and Aqw , as
we indicated earlier. The estimators O N , imv , fro iNw) are clearly stationary
values of the iterations starting from (9, fw , 9, Av). However, they may not
necessarily be the converging values.
A simulation study by Wales and Woodland (1980) based on only one
replication with sample sizes of 1000 and 5000 showed that A, / is distinctly
inferior to the MLE and is rather unsatisfactory.
(hy i
2
374 Advanced Econometrics
[ x ix; x iyi
11 Yix'i E1 Y?
Now, given 0 1 , let M(0 1) maximize Q(010 1) with respect to 0. Then we have
L( = Q(m I el 1-1(M101).
)
(10.4.47)
Thus we have proved the desirable property that L always increases or stays
constant at each step of the EM algorithm.
The preceding result implies that if L is bounded, then lim,...,,. L(0,) exists.
Let 0* satisfy the equality lim, L(0,)= L(0*). (0* exists if the range of L (0)
is closed.) We shall show that if 0* is a stationary point of the EM algorithm,
then 0* is a stationary point of L. For this we assume L is twice differentiable.
Differentiating (10.4.45) with respect to 0 and evaluating the derivative at
0= 0,, we obtain
aLl = aQ(010 1
a0 le, a el
)1 all(0101)
ao
(10.4.49)
But the last term of the right-hand side of (10.4.49) is 0 because of (10.4.46).
Therefore, if 0, is a stationary point of Q(91 0 i), it is a stationary point of L.
Unfortunately, a local maximum of Q(010 1 ) with respect to 0 may not be a
local maximum (let alone the global maximum) of L(0) because the negative
definiteness of [82Q(e1emea0'] 6, does not imply the negative definiteness of
[a2L1 awn, . However, this is true of any iterative method commonly used.
See Wu (1983) for further discussion of the convergence properties of the EM
algorithm.
Now consider an application of the algorithm to the Tobit mode1. 9 Define
0 = ( fic a2) as before. Then in the Tobit model we have
1 "
log f(y*10) = --2- log a 2
2a2 (y? - x;fl)2, (10.4.50)
and, for a given estimate 8 1 = (pc, of), the EM algorithm maximizes with
Tobit Models 377
respect to /3 and a2
E[logf(y*IMY,w01] (10.4.51)
n 2 1 1
= log
ER Y - 30)21 Wi = 01]
2a2 (Yi X;P)2 2(72
1 1
= log (12 (y1 /7)2
a2 - - - [E(Yr I = 0, 01) -
20.2
1
V(y1 I lwi= 0, 00,
- 2a 2
where
and
= 0, ) = a? + 34/31 [ 1 ai ]2 , (10.4.53)
Qcxrixtrol, (10.4.54)
where X was defined after (10.2.4). In other words, the EM algorithm amounts
to predicting all the unobservable values of yr by their conditional expecta-
tions and treating the predicted values as if they were the observed values. The
second-round estimate of a 2, denoted o-i, is given by
(10.4.55)
0
+ o, 0,)].
We can directly show that the MLE 0 is the equilibrium solution of the
iteration defined by (10.4.54) and (10.4.55). Partition X = (X', X')' so that
378 Advanced Econometrics
a
Xis multiplied by y and X by y. Then inserting into both sides of (10.4.54)
yields, after collecting terms,
acb(4/6) (10.4.56)
X/30= X'y X'
Li (D(x/j/c7)J'
where the last bracket denotes an (n n 1 )-dimensional vector the typical
element of which is given inside. Now, setting the derivative of log L with
respect to ft equal to 0 yields
a 1 cbi
(Di x + (yi x'S); = 0. (10.4.57)
In this section we shall discuss the properties of the Tobit MLE the estima-
tor that maximizes (10.2.5) under various types of nonstandard assump-
tions: heteroscedasticity, serial correlation, and nonnormality. It will be
shown that the Tobit MLE remains consistent under serial correlation but not
under heteroscedasticity or nonnormality. The same is true of the other esti-
mators considered earlier. This result contrasts with the classical regression
model in which the least squares estimator (the MLE under the 'normality
assumption) is generally consistent under all of the three types of nonstandard
assumptions mentioned earlier.
Before proceeding with a rigorous argument, we shall give an intuitive
explanation of the aforementioned result. By considering (10.4.11) we see that
serial correlation of y, should not affect the consistency of the NLLS estimator,
whereas heteroscedasticity changes a to cri and hence invalidates the estima-
tion of the equation by least squares. If)/ is not normal, Eq. (10.4.11) itself is
generally invalid, which leads to the inconsistency of the NLLS estimator.
Tobit Models 379
10.5.1 Heteroscedasticity
Hurd (1979) evaluated the probability limit of the truncated Tobit MLE when
a certain type of heteroscedasticity is present in two simple truncated Tobit
models: (1) the i.i.d. case (that is, the case of the regressor consisting only of a
constant term) and (2) the case of a constant term plus one independent
variable. Recall that the truncated Tobit model is the one in which no infor-
mation is available for those observations for which 3,1 1 <0 and therefore the
MLE maximizes (10.2.6) rather than (10.2.5).
In the i.i.d. case Hurd created heteroscedasticity by generating rn observa-
tions from N(y, al) and (1 r)n observations from N( 1.1, al). In each case he
recorded only positive observations. Let y, i = 1, 2, . . . , n l , be the recorded
observations. (Note n 1 5 n). We can show that the truncated Tobit MLE ofp
and a2, denoted fi and 6 2, are defined by equating the first two population
moments of y, to their respective sample moments:
ni
6A(i116)= ni l I Yr (10.5.1)
and
ni
12 2 (212A (i2/6) + (2.2 = nr Y?. (10.5.2)
1
Taking the probability limit of both sides of (10.5.1) and (10.5.2) and express-
ing plim Fir 'Iyi and plim nryl as certain functions of the parameters
it, al, al, and r, we can define plim 2 and plim 62 implicitly as functions of
these parameters. Hurd evaluated the probability limits for various values ofp
and a, after having fixed r = 0.5 and a2 = 1. Hurd found large asymptotic
biases in certain cases.
In the case of one independent variable, Hurd generated observations from
N(a F flx, al) after having generated xi and log 1a11from bivariate
N(0, 0, V, V, p). For given values of a, /3, V1 , V2 and p, Hurd found the
values of a, )6, and a 2 that maximize E log L, where L is as given in (10.2.6).
Those values are the probability limits of the MLE of a, fl, and a 2 under
Hurd's model if the expectation of log L is taken using the same model. Again,
Hurd found extremely large asymptotic biases in certain cases.
380 Advanced Econometrics
Arabmazar and Schmidt (1981) showed that the asymptotic biases of the
censored Tobit MLE in the i.i.d. case are not as large as those obtained by
Hurd.
10.5.3 Nonnormality
Goldberger (1983) considered an i.i.d. truncated sample model in which data
are generated by a certain nonnormal distribution with mean it and variance 1
and are recorded only when the value is smaller than a constant c. Let y
represent the recorded random variable and let 37 be the sample mean. The
researcher is to estimate p by the MLE, assuming that the data are generated
by N(p, 1). As in Hurd's i.i.d. model, the MLE it is defined by equating the
population mean of y to its sample mean:
A (c 1 2) = (10.5.3)
Taking the probability limit of both sides of (10.5.3) under the true model and
putting plim ft = it* yield
(10.5.4)
where h(c p) = E(p yly < c), the expectation being taken using the true
model. Defining m = p* p and 0 = c p, we can rewrite (10.5.4) as
m = A(0 m) h(0). (10.5.5)
Tobit Models 381
Dependent variables
Type Yi Y2 Y3
1 C
2 B C
3 C C
4 C C C
5 B C C
Note: C = censored; B = binary.
Tobit Models 385
where Ho and H i stand for the product over those i for which y2; =0 and
Y2, # 0, respectively, andf( lyri > 0) stands for the conditional density of yri
; > 0. Note the similarity between (10.4.1) and (10.7.3). As in Type 1 givenyr
386 Advanced Econometrics
Also, (10.7.4) is a part of the likelihood function for every one of the five types
of models; therefore a consistent estimate of /3 1 /al can be obtained by the
probit MLE in each of these types of models.
We can rewrite (10.7.3) as
wheref( , ) denotes the joint density of yr ; and yt. We can write the joint
density as the product of a conditional density and a marginal density, that is
1(Y ft, hi) flYril V(Y2i )9 and can determine a specific form for f(Y Y2i)
from the well-known fact that the conditional distribution of yt, given
yv is normal with mean /cc b-i B -a12aT - z 2 ( v2i - x1/i2) and variance
a? a 12 a2 2. Thus we can further rewrite (10.7.5) as
L = H [1 - (10(xc ifi1 ai')] (10.7.6)
Yt= 1421
Yu= Yri if Yfi> 0
=0 if yr, g 0
Y2/ = ir if y r, > 0
=0 if yr, g 0, = 1, 2, . . . , n,
where (uu , u21 ) are i.i.d. drawings from a bivariate normal distribution with
390 Advanced Econometrics
zero mean, variances ai and al, and covariance cr 12 . Note that this model
differs from Type 2 only in that in this model yri is also observed when it is
positive.
Because the estimation of this model can be handled in a manner similar to
the handling of Type 2, we shall discuss it only briefly. Instead, in the following
we shall give a detailed discussion of the estimation of Heckman's model
(1974), which constitutes the structural equations version of the model
(10.8.1).
The likelihood function of the model (10.8.1) can be written as
to be poor, and all three ML estimators were found to perform well. Heck-
man's two-step estimator was ranked somewhere between LS and MLE.
eters by MLE. In the next two subsections we shall discuss three alternative
methods of estimating the structural parameters.
tions two-step estimator and its WLS version (taking account of the hetero-
scedasticity) to more general simultaneous equations Tobit models and ob-
tained their asymptotic variance-covariance matrices.
where zi include the earnings characteristics of the ith firm. ( Vvi is assumed to
be unity.) The variable yt may be regarded as an index affecting a firm'deci-
sion as to whether or not it requests a rate increase. The model (10.8.14) and
(10.8.15) can be labeled as P(y, <0) P(yi >0, Y2 y3) in our shorthand
notation and therefore is a simple generalization of Type 3. The estimation
method of Roberts, Maddala, and Enholm is that of Lee, Maddala, and Trost
(1980) and can be described as follows:
Step 1. Estimate a by the probit MLE.
Step 2. Estimate /32 by Heckman's two-step method.
Step 3. Replace yt in the right-hand side of (10.8.15) by pit obtained in step
2 and estimate y and 113 by the least squares applied to (10.8.15) after adding
the hazard rate term E(u 31l > 0).
The second model of Roberts, Maddala, and Enholm is the same as the first
model except that (10.8.16) is replaced by
> Ri, (10.8.17)
where Ri refers to the current rate being earned, an independent variable. Thus
this model is essentially Type 3. (It would be exactly Type 3 if RE = 0.) The
estimation method is as follows:
Step 1. Estimate /32 by the Tobit MLE.
Step 2. Repeat step 3 described in the preceding paragraph.
Tobit Models 395
= xfl + u2i
= 'LA + 14 3,
=
=0 if yri .6 0
Y2i = 341 if >0
=0 if yri .6 0
Y3i = Yr/ 6 0
=0 if > 0, i = 1, 2, . n,
where (u 11 , uv , u31 ) are i.i.d. drawings from a trivariate normal distribution.
This model differs from Type 3 defined by (10.8.1) only by the addition of
Ai , which is observed only if A', 6 0. The estimation of this model is not
significantly different from that of Type 3. The likelihood function can be
written as
0
L=111 J3- 010 hi) dArlf2(Yii, Y21), (10.9.2)
o
where f3 ( , ) is the joint density of yr, and yt and f2 ( , ) is the joint
density of yri and 34. Heckman's two-step method for this model is similar to
the method for the preceding model. However, we must deal with three
conditional expectation equations in the present model. The equation for
will be slightly different from the other two because the variable is nonzero
when yr, is nonpositive. We obtain
gY31iYri 5- 0 ) = x;1fl3 - 07 1 ( ). (10.9.3)
We shall discuss three examples of the Type 4 Tobit model in the following
subsections: the model of Kenny et al. (1979); the model of Nelson and Olson
(1978); and the model of Tomes (1981). In the first two models the y* equa-
tions are written as simultaneous equations, like Heckman's model (1974), for
which the reduced-form equations take the form of (10.9.1). Tomes' model
has a slight twist. The estimation of the structural parameters of such models
can be handled in much the same way as the estimation of Hecicman's model
(1974), that is, by either Heckman's simultaneous equations two-step method
(and its Lee-Maddala-Trost extension) or by Amemiya's LS and GLS, both of
which were discussed in Section 10.8.
Tobit Models 397
In fact, these two estimation methods can easily accommodate the follow-
ing very general simultaneous equations Tobit model:
r'ylt = B'x; + ui , i = 1, 2, . . . , n, (10.9.4)
where the elements of the vector y 1 1 contain the following three classes of
variables: (1) always completely observable, (2) sometimes completely observ-
able and sometimes observed to lie in intervals, and (3) always observed to lie
in intervals. Note that the variable classified as C m Table 10.3 belongs to class
2 and the variable classified as B belongs to class 3. The models of Heckman
(1974), Kenny et al. (1979), and Nelson and Olson (1978), as well as a few
more models discussed under Type 5, such as that of Heckman (1978), are all
special cases of the model (10.9.4).
Yu yri
y11 if yfi > (10.9.9)
=0 if yri 0,
where yt is the potential inheritance, y ii is the actual inheritance, and Y21 is the
recipient's income. Note that this model differs from Nelson's model defined
by (10.9.5) and (10.9.6) only in that y ,, not yt, appears in the right-hand side
of (10.9.8). Assuming yi y2 < 1 for the logical consistency of the model (as in
Amemiya, 1974b, mentioned in Section 10.6), we can rewrite (10.9.7) as
= (1 Y1 Y2 )- I [YI (X21112 + U21) x, 1fi uu] (10.9.10)
and (10.9.8) as
Y21 = Y(2I1) ( 1 )1172)-1 [72(xiufil + 140 + xi1/32 u21] (10.9.11)
if yri > 0,
L =f 0 -
0
fot, y(2T dyti 34?), (10.9.12)
Yx;ifi3
Y21 = if yr' > 0
=0 if y g 0
Y3i if yr, 0
=0 if yri > 0, i = 1, 2, . . . , n,
400 Advanced Econometrics
where (Liu , u21 , au ) are i.i.d. drawings from a trivariate normal distribution.
The likelihood function of the model is
= z; + vi. (10.10.3)
Because we observe only A; if the worker joins the union and y if the worker
,
= Yt if yr, 0.
Lee assumed that x2 and x3 (the independent variables in the As' and 3/1
equations)cldhvuarcteisofmandwrkesuch
regional location, city size, education, experience, race, sex, and health,
whereas z includes certain other individual characteristics and variables that
represent the monetary and nonmonetary costs of becoming a union member.
Because yris unobserved except for the sign, the variance of yrcan be assumed
to be unity without loss of generality.
Lee estimated his model by Heckman's two-step method applied separately
to the yt and y equations. In Lee's model simultaneity exists only in the y'r
equation and hence is ignored in the application of Heckman's two-step
method. Amemiya's LS or GLS, which accounts for the simultaneity, will, of
course, work for this model as well, and the latter will yield more efficient
estimates although, of course, not as fully efficient as the MLE.
Tobit Models 401
and
= Y2Yri X2S2 (52W1 UN/ (10.10.6)
=0 if yri O.
There are no empirical results in the 1978 article, but the same model was
estimated by Heckman (1976b); in this application yT i represents the average
income of black people in the ith state, yri the unobservable sentiment toward
blacks in the ith state, and w i = 1 if an antidiscrimination law is instituted in
the ith state.
When we solve (10.10.5) and (10.10.6) for yr i , the solution should not
depend upon Ivo for that would clearly lead to logical inconsistencies. There-
fore we must assume
YiO2 (51 = 0 (10.10.8)
and
312i = (52Wl xiir2 + v21, (10.10.10)
where we can assume Vv ii = 1 without loss of generality. Thus Heckman's
model is a special case of Type 5 with just a constant shift between y'21' and yf
(that is, yt = x;n2 + v2i and y, = X;X2 U21). Moreover, if (52 = 0, it is a
special case of Type 5 where yr =-
Let us compare Heckman's reduced-form model defined by (10.10.9) and
(10.10.10) with Lee's model. Equation (10.10.9) is essentially the same
as (10.10.3) of Lee's model. Equation (10.10.4) of Lee's model can be rewrit-
ten as
402 Advanced Econometrics
up-to-date survey has been given by Quandt (1982). See, also, the article by
Hartley (1976a) for a connection between a disequilibrium model and the
standard Tobit model. Here we shall mention two basic models first discussed
in the pioneering work of Fair and Jaffee (1972).
The simplest disequilibrium model of Fair and Jaffee is a special case of the
Type 5 model (10.10.1), in which y , is the quantity demanded in the ith
period, yt is the quantity supplied in the ith period, and yr i = yti . Thus
the actual quantity sold, which a researcher observes, is the minimum of
supply and demand. The fact that the variance-covariance matrix of
(yf, y , yT) is only of rank 2 because of the linear relationship above does not
essentially change the nature of the model because the likelihood function
(10.10.2) involves only bivariate densities.
In another model Fair and Jaffee added the price equation to the model of
the preceding paragraphs as
Yai = Y Air), (10.10.12)
where 3,41 denotes a change in the price at the ith period. The likelihood
function of this model can be written as"
0
L =fJ f f3(yri,y3ily41)f(y41)dyri (10.10.13)
0
and
= 4;a2 + v2, (10.10.16)
where Cti and Tti denote the cost and the time incurred by the ith person
traveling by a private mode of transportation, and, similarly, the cost and the
time of traveling by a public mode are specified as
= xVi3 U3 (10.10.17)
and
TT, = z;ice3 + v3 . (10.10.18)
Lee assumed that Ct and 71 are observed if the ith person uses a private mode
and Ct and Tit are observed if he or she uses a public mode. A private mode is
used if yri > 0, where yri is given by
y it; = s;(51 + (52 T, A- 33 71 + 4 ( C, )+ Ei . (10.10.19)
Lee estimated his model by the following sequential procedure:
Step 1. Apply the probit MLE to (10.10.19) after replacing the starred
variables with their respective right-hand sides.
Step 2. Apply LS to each of the four equations (10.10.15) through (10.10.18)
after adding to the right-hand side of each the estimated hazard from step 1.
Step 3. Predict the dependent variables of the four equations (10.10.15)
through (10.10.18), using the estimates obtained in step 2; insert the predictors
into (10.10.19) and apply the probit MLE again.
Step 4. Calculate the MLE by iteration, starting from the estimates obtained
at the end of the step 3.
Willis and Rosen (1979) studied earnings differentials between those who
went to college and those who did not, using a more elaborate model than that
of Kenny et al. (1979), which was discussed in Section 10.9.2. In the model of
Kenny et al., y, (the desired years of college education, the sign of which
determines whether an individual attends college) is specified not to depend
directly on yt and y it (the earnings of the college-goer and the non-college-
goer, respectively). The first inclination of a researcher might be to hypothe-
size yri = A. However, this would be an oversimplification because the
decision to go to college should depend on the difference in expected lifetime
earnings rather than in current earnings.
Willis and Rosen solved this problem by developing a theory of the maxi-
mization of discounted, expected lifetime earnings, which led to the following
model:
Tobit Models 405
Il = X03 u3 , (10.10.22)
= 41a3 + v3 , (10.10.23)
and
12; = + ei , i = 1, 2, . . . ,n (10.10.24)
where I , and G , denote the initial earnings (in logarithm) and the growth rate
of earnings for the college-goer, It, and Gt denote the same for the non-col-
lege-goer, and 12, denotes the discount rate. It is assumed that the ith person
goes to college if yri > 0 where
(10.10.25)
and that the variables with subscript 2 are observed if y', > 0, those with
subscript 3 are observed if yri 0, and Ri is never observed. Thus the model is
formally identical to Lee's model (1977). Willis and Rosen used an estimation
method identical to that of Lee, given earlier in this subsection.
Borjas and Rosen (1980) used the same model as Willis and Rosen to study
the earnings differential between those who changed jobs and those who did
not within a certain period of observation.
and
y1(k) = xg) ' fl + vac, (10.10.27)
where x(i112 and 42 are vector functions of the inputoutput prices and eco-
nomic theory dictates that the same fl appears in both equations. I9 It is as-
sumed that (un , u,2 , . . , U Uc vil , v a, . . , v,K ) are i.i.d. drawings from a
,
where k, is the actual location the ith firm was observed to choose.
The estimation method proposed by Duncan can be outlined as follows:
Step I. Estimate the 13 that characterize fin (10.10.29) by nonlinear WLS.
Step 2. Estimate the fl that characterize Pin (10.10.29) by the multinomial
probit MLE using the nonlinear WLS iteration.
Step 3. Choose the optimum linear combination of the two estimates of /I
obtained in steps 1 and 2.
To describe step 1 explicitly, we must evaluate pi E[y,(101w,(k,)= 1]
and (4 = V[yi (k,)1w,(k,)= 1] as functions of fl and the variances and covar-
iances of the error terms of Eqs. (10.10.26) and (10.10.27). These conditional
moments can be obtained as follows. Define zi (j) = .s,(k) s i (j) and the
(K 1)-vector z i = [;(1), . . . , zi (ki 1), z i(ki + 1), . . , z,(K)]'. To
simplify the notation, write z, as z, omitting the subscript. Similarly, write
y,(k,) as y. Also, define R = E(y Ey)(z Ez)' [E(z Ez)(z Ez)V and
Q = Vy RE(z Ez)(y Ey). Then we obtain"
E(ylz> 0) = Ey + RE(z1z> 0) REz (10.10.30)
Tobit Models 407
and
= V(ylz > 0) = Riqzlz > 0)R' + Q. (10.10.31)
The conditional moments of z appearing in (10.10.30) and (10.10.31) can be
found in the articles by Amemiya (1974b, p. 1002) and Duncan (1980, p. 850).
Finally, we can describe the nonlinear WLS iteration of step 1 as follows:
Estimate al by inserting the initial estimates (for example, those obtained by
minimizing [yi (ki ) pi]2 ) of the parameters into the right-hand side of
with respect to the parameters that appear in the right-hand side of (10.10.30).
Use these estimates to evaluate the right-hand side of (10.10.31) again to get
another estimate of al . Repeat the process to yield new estimates of ft.
Now consider step 2. Define
= E(wi Pi )(wi P,)' = Di P,P , (10.10.33)
where Di is the K X K diagonal matrix the kth diagonal element of which is
Pik. To perform the nonlinear WLS iteration, first, estimate I, by inserting the
initial estimates of the parameters into the right-hand side of (10.10.33) (de-
note the estimate thus obtained as X i ); second, minimize
(wi - (wi Pi), (10.10.34)
where the minus sign in the superscript denotes a generalized inverse, with
respect to the parameters that characterize P i , and repeat the process until the
estimates converge. A generalized inverse A- of A is any matrix that satisfies
AAA = A (Rao, 1973, p. 24). A generalized inverse IT is obtained from the
matrix DT P II', where I is a vector of ones, by replacing its kth column
and row by a zero vector. It is not unique because we may choose any k.
Finally, regarding step 3, if we denote the two estimates of /I obtained
by steps 1 and 2 by ft1 and fl2 , respectively, and their respective asymp-
totic variance-covariance matrices by V I and V2, the optimal linear com-
bination of the two estimates is given by (VI' + V') - 'VT' +
(VT' + V -2-1 ) iVV/32 . This final estimator is asymptotically not fully effi-
cient, however. To see this, suppose the regression coefficients of (10.10.26)
and (10.10.27) differ: Call them fi2 and , say. Then, by a result of Amemiya
(1976b), we know that fi2 is an asymptotically efficient estimator of fi2 . How-
ever, as we have indicated in Section 10.4.4, /1 1 is not asymptotically efficient.
So a weighted average of the two could not be asymptotically efficient.
408 Advanced Econometrics
Exercises
1. (Section 10.4.3)
Verify (10.4.19).
2. (Section 10.4.3)
Verify (10.4.28).
3. (Section 10.4.3)
Consider V9v, and Vy7w given in (10.4.32) and (10.4.33). As stated in
the text, the difference of the two matrices is neither positive definite
nor negative definite. Show that the first part of Viw , namely,
o-2 (Z' is smaller than VAv in the matrix sense.
4. (Section 10.4.5)
In the standard Tobit model (10.2.3), assume that a 2 = 1,/i is a scalar and
the only unknown parameter, and (x i) are i.i.d. binary random variables
taking 1 with probability p and 0 with probability 1 p. Derive the
formulae of p AV [17i(fl [3)] for /3= Probit MLE, Tobit MLE, Heck-
man's LS, and NLLS. Evaluate them for fi =0, 1, and 2.
5. (Section 10.4.6)
Consider the following model:
= 1 if yt 0
=0 if 31 <0, i= 1, 2, . . . , n,
where (y*) are independent N(x'i ft, 1). It is assumed that ( yi ) are observed
but (y*) are not. Write a step-by-step instruction of the EM algorithm to
obtain the MLE offi and show that the MLE is an equilibrium solution of
the iteration.
Tobit Models 409
6. (Section 10.6)
Consider the following model:
31i=x1fl1 +
Yir= 142i
Yli = Yri if yri > 0
0 if yt g 0
Y2, 1 if A; > 0
=0 if yt .S. 0, i = 1, 2, . . . , n,
where (u 11 , u2i ) are i.i.d. with the continuous densityf( , ). Denote the
marginal density of 14 11 by f; ( ) and that of u 2; by f2 ( ).
a. Assuming that y ii , y, xli and x2i are observed for i = 1,
2, . . . , n, express the likelihood function in terms off;f1 and f2 .
,
b. Assuming that yu , y21 , x 11 , and x21 are observed for all i, express the
likelihood function in terms off, fi and f2 .
7. (Section 10.6)
Consider the following model:
= az, +
zr = fivi +
= 1 if yr ?.- 0
=0 if y* < 0
zi = 1 if z0
=0 if < 0,
where ui and vi are jointly normal with zero means and nonzero covar-
iance. Assume that y*, z*, u, and v are unobservable and y and z are
observable. Show that the model makes sense (that is, y and z are uniquely
determined as functions of u and v) if and only if afl = 0.
8. (Section 10.6)
In the model of Exercise 7, assume that /1 = 0 and that we have n
observations on (yi , zi ), i = 1, 2, . . . , n. Write the likelihood function
of a. You may write the joint density of (u, v) as simply f(u, v) without
explicitly writing the bivariate normal density.
410 Advanced Econometrics
9. (Section 10.6)
Suppose y? and 4, i = 1, 2, . . . , n, are i.i.d. and jointly normally dis-
tributed with nonzero correlation. For each i, we observe (1) only y?, (2)
only 41 , or (3) neither, according to the following scheme:
(1) Observe y = y, and do not observe z7 if y' 4 0.
(2) Observe 4 = z, and do not observe yi if z* > 0.
(3) Do not observe either if yr <0 or 4. < 0.
Write down the likelihood function of the model. You may write the joint
normal density simply as f( , ).
We can use the term time series models in a broad sense to mean statistical
models that specify how the distribution of random variables observed over
time depends on their past observations. Thus defined, Markov chain models
and duration models, as well as the models discussed in Chapter 5, are special
cases of time series models. However, time series models in a narrow sense
refer to the models of Chapter 5, in which random variables take on continu-
ous values and are observed at discrete times. Thus we may characterize the
models of Chapter 5 as continuous-state, discrete-time models. Continuous-
state, continuous-time models also constitute an important class of models,
although we have not discussed them. In contrast, Markov chain models (or,
more simply, Markov models) may be characterized as discrete-state, dis-
crete-time models, and duration models (or, survival models) as discrete-state,
continuous-time models. In this chapter we shall take up these two models in
turn.
The reader who wishes to pursue either of these topics in greater detail than
is presented here should consult the textbooks by Bartholomew (1982) for
Markov models and by Kalbfleisch and Prentice (1980) or Miller (1981) for
duration models. For recent results on duration models with econometric
applications, see Heckman and Singer (1984b).
= 0 otherwise,
i = 1, 2, . . . , N, t = 1, 2, , . . , T,
j = 1, 2, . . . , M.
Markov Chain and Duration Models 413
MO= E At)
I-1 j
p'(0), the likelihood function of the first-order Markov model can be written
as
L=n fl n pj,(0),J0-04(0 n fi pjoyin, (11.1.4)
ilk j If
- "=" L1 L2.
which implies that the first column of H is p(c0) and hence the first row of 11 -1
istheranpofM-vctruniydeo1.Thrf,m(14)
Inserting (11.1.16) into (11.1.13) yields the identity p(c0) = p(c0)1'p(0) = p(op)
for any value of p(0). If p(00) exists, it can be determined by solving (11.1.15)
subject to the constraint l'p(00) = 1. Because the rank of! P' is M 1 under
L = fi fl (11.1.19)
See Anderson and Goodman (1957) for the asymptotic properties of the MLE
(11.1.22).
Anderson and Goodman also discussed the test of various hypotheses in the
homogeneous stationary Markov model. Suppose we want to test the null
hypothesis that Pik is equal to a certain (nonzero) specified value for
k= 1, 2, . . . , M and for a particular j. Then, using a derivation similar to
(9.3.24), we can show
IM
Si nik it (15 ;lc Plia2 A 2 (11.1.23)
M- 1
.07
.1 ) k I
where fijk is the MLE. Furthermore, if ilk is given for] = 1, 2, . . . , Mas well
as k, we can use the test statistic M .71 1 Si , which is asymptotically distributed as
chi-square with M(M 1) degrees of freedom. Next, suppose we want to test
(11.1.18) itself against a homogeneous but nonstationary model characterized
by PM = PA (t). This can be tested by the likelihood ratio test statistic with
the following distribution:
2 log H H H rfi, /15, _,k ,, nik (I) XtT- IMAM- 1) I (11.1.24)
r j k
Pik(t) S1 )Vik(t) if j # k.
Markov Chain and Duration Models 419
P = S + (I S)V. (11.1.27)
Now, P and 1*2) can be consistently estimated by the MLE mentioned earlier.
Inserting the MLE into the left-hand side of (11.1.27) and (11.1.28) gives us
2M(M 1) equations But since there are only M 2 parameters to estimate in S
and V, solving M2 equations out of the 2M(M 1) equations for S and V will
yield consistent estimates.
The empirical phenomenon mentioned earlier can be explained by the
mover stayer model as follows: From (11.1.27) and (11.1.28) we obtain after
some manipulation
1*2) - P2 = S - SW 2 [S + - SW][S + - SW]
= (I - S)(I V)S(I - V). (11.1.29)
!' j
which is positive.
Shorrocks (1976) accounted for the invalidity of (11.1.25) in a study of
income mobility by postulating a second-order Markov model. Depending on
the initial conditions and the values of the parameters, a second-order Markov
model can lead to a situation where the diagonal elements of P 42) are larger
than the corresponding elements of P 2.
The likelihood function of a second-order Markov model conditional on
the initial values yii( 1) and y(0) is given by
L= nrilum
t ilk 1
P jki 4 0-2b0 -1 )Yi(i), (11.1.30)
where Piiki (t) is the probability the ith person is in state 1 at time t given that he
or she was in state j at time t 2 and in state k at time t 1. If homogeneity
and stationarity are assumed, then P(t) = P. Even then the model con-
tains M 2(M 1) parameters to estimate. Shorrocks grouped income into five
classes (M = 5), thus implying 100 parameters. By assuming the Champer-
nowne process (Champernowne, 1953), where income mobility at each time
change is restricted to the three possibilities staying in the same class or
moving up or down to an adjacent income class, he reduced the number of
parameters to six, which he estimated by ML. We can see this as follows: Let
= 1, 0, 1 represent the three possible movements. Then the Champer-
problem by setting0 /i-o nor = 100,000 arbitrarily. We shall see that although
changing /t_onoi changes the estimates of the transition probabilities, it does
not affect the equilibrium relative size distribution of firms (except relative to
size 0).
Let p p(co) be a vector of equilibrium probabilities. Then, as was shown
earlier, p can be obtained by solving
rE1 _ roi (11.1.31)
L jP - Lij ,
where * means eliminating the first row of the matrix. Now, consider pi lpk for
j, k 0, where pi and Pk are solved from (11.1.31) by Cramer's rule. Because
I?_onoi affects only the first column of [I P']* proportionally, it does not
affect p,Ipk .
Duncan and Lin (1972) criticized Adelman's model, saying that it is unreal-
istic to suppose a homogeneous pool of firms in state 0 because a firm that
once goes out of business is not likely to come back. Duncan and Lin solved
the problem by treating entry and exit separately. Exit is assumed to be an
absorbing state. Suppose j = 1 is an absorbing state, then P 11 = 1 and P1k = 0
for k = 2, 3, . . . , M. Entry into state km k (t) firms at time t, k=
2, 3, . . . , Mis assumed to follow a Poisson distribution:
Fk(t) = k(O nIkM Annik(t)!] -1I . (11.1.32)
Then Pik and !Lk are estimated by maximizing the likelihood function
T m
L*LEIF k (t), (11.1.33)
I 2
where Lis as given in (11.1.4). This model is applied to data on five classes of
banks according to the ratio of farm loans to total loans. Maximum likelihood
estimates are obtained, and a test of stationarity is performed following the
Anderson-Goodman methodology.
WU.
-1 xi! ].
xitY0-1
Markov Chain and Duration Models 423
The equivalence of the NLWLS iteration to the method of scoring also holds
for this model.
Similarly, the minimum chi-square estimator can also be defined for this
model as in Section 9.2.5. It is applicable when there are many observations of
with the same value of xi,. Although a more general grouping can be
handled by the subsequent analysis, we assume x i, = ; for every i, so that yo
i = 1, 2, . . . , N, are associated with the same vector ; .
Define
Yit( 1 Y,-i)
(11.1.39)
I (1 -
and
_
I YitY4i.-1
(11.1.40)
I
Then we have
F-1 (P?) = ft'; + t = 1, 2, . . . , T (11.1.41)
and
F-1(PD (ot + fl)'; + th, t= 1, 2, . . . , T. (11.1.42)
The error terms and nt approximately have zero means, and their respective
conditional variances given y4,_, , i = 1, 2, . . . , N, are approximately
11(1 11) (11.1.43)
f2 [F-1 (in 3,41-1)
and
Pi (1 Pi) (11.1.44)
V(11) N
f 2[F- VD] y41-1
I-1
where P(,) = Pt, (t) and P) = P (t).The MIN estimator of y is the weighted
)( 2
A+B A Bl
1
Using these results, we obtain
it 0 1
(PT = lim H (11.1.50)
t. [0 (1 A BY] 11-1
[1 0]
=H H-1
00
B B
A+B A+B
A Al'
A+B A+B
which, of course, could have been obtained directly from (11.1.16) and
(11.1.49).
Although the asymptotic variance-covariance matrix of the MLE and in a
the Boslcin-Nold model can be derived from (11.1.38), we can derive it directly
using the likelihood function (11.1.47). We shall do so only for et because the
derivation for /3 is similar. We shall assume T*co, for this assumption en-
ables us to obtain a simple formula.
426 Advanced Econometrics
We need to consider only the part of (11.1.47) that involves A i ; its loga-
rithm is
log L = nlo log A ; + ni l log (1 A i ). (11.1.51)
v nio _ 1 02A ;
[ A i 1 A i ] &ea'
T-1
= 0) [ E (P7] p(0)P1 0 .
(11.1.56)
Markov Chain and Duration Models 427
Similarly,
Ai)Bi
En (11.1.57)
1 Ai + Bi
From (11.1.53), (11.1.56), and (11.1.57) we obtain
02logL T A.(1 A i )B; ,
E E xi xi (11.1.58)
aaaa' i A.+ B,
Hence, 1
7, Ai)Bi
AV(&) = [ "] . (11.1.59)
i Ai +
Similarly,
B (1 Bi )A i
,
log LI= [ r o
-
yl(t)] log [Bil(A i + Bi )] (11.1.61)
1 [AD
A m + AD Ai '
where the subscript 0 denotes the true value. Therefore we obtain
1
(11.1.63)
NT IN LI
A,0
11111 '7;
1 l
N os IV Al LAD+BiDB,D10g (
ir,1 B.L
A i + Bi) A m + B3
log (
- \
Ai 11
A i + Bi i j'
from which we can conclude the consistency of the estimator.
428 Advanced Econometrics
Di = E To -0-1A = A
-a (11.1.65)
T-1 ar
=A
a a r
A 1r
A 1
(1 r)2 A'
where r = 1 A. In words, this equation says that the mean duration on
welfare is the inverse of the probability of moving off welfare.
Suppose the ith person experiences H welfare spells of duration
ti , t 12 , . . . , PH and K off-welfare spells of duration si , ..., If we
generalize the Boskin-Nold model and let the ith person's transition probabil-
ities A. and B. vary with spells (but stay constant during each spell), the
likelihood function can be written as
Theoretically, not much need be said about this model beyond what we have
discussed in Section 11.1.1 for the general case and in Section 11.1.3 for the
two-state case. The likelihood function can be derived from (11.1.4) by speci-
fying P.iik(t) as a function of exogenous variables and parameters. The equiva-
lence of the NLWLS to the method of scoring iteration was discussed for the
Markov Chain and Duration Models 429
general case in Section 11.1.1, and the minimum chi-square estimator defined
for the two-state case in Section 11.1.3 can be straightforwardly generalized to
the multistate case. Therefore it should be sufficient to discuss an empirical
article by Toikka (1976) as an illustration of the NLWLS (which in his case is a
linear WLS estimator because of his linear probability specification).
Toikka's model is a three-state Markov model of labor market decisions in
which the three states (corresponding to j = 1, 2, and 3) are the state of being
employed, the state of being in the labor force (actively looking for a job) and
unemployed, and the state of being out of the labor force.
The exogenous variables used by Toikka consist of average (over individ-
uals) income, average wage rate, and seasonal dummies, all of which depend
on time (months) but not on individuals. Thus Toikka's model is a homoge-
neous and nonstationary Markov model. Moreover, Toikka assumed that
transition probabilities depend linearly on the exogenous variables. 2 Thus, in
his model, Eq. (11.1.7) can be written as
= [ yi(t 1)' 0 x111 + (11.1.67)
xi 170
Y, 3( U2
11+ (11.1.68)
i7T YT-I UT
The LS estimator fl is therefore given by
T
fl= [i 0 x,x;)] 0 x,). (11.1.69)
f I
y -- Xar+u. (11.1.70)
The FGLS estimator of it is (Vii iX) iX f ol
y, where Cl is a consistent
estimator of Cl = Euu'. Here, Cl has the following form:
Di1 D12 DIG
0= (11.1.71)
DG1 DGG
I=
where each Dik is a diagonal matrix of size NT. If each Dik were a constant
times the identity matrix, (11.1.70) would be Zellner's seemingly unrelated
regression model (see Section 6.4), and therefore the LS estimator would be
asymptotically efficient. In fact, however, the diagonal elements of Dik are not
constant.
Toikka's estimator of n, denoted fl, is defined by
and then applying least squares. If, instead, generalized least squares were
applied in the last stage, the resulting estimator of H would be identical with
the GLS estimator of n derived from (11.1.68).
aggregate data ni (t) . 1 yj(t) are available. We shall first discuss LS and
GLS estimators and then we shall discuss MLE briefly.
Suppose the Markov matrix Pi(t) is constant across i, so that Pi(t) = P(t).
Summing both sides of (11.1.7) over i yields
'(t) =ii(t)' yi(t 1) + i i(t). (11.1.73)
v[ E - = E [D(4u1 ) -
u], (11.1.74)
and
Vrt
+E log 1 Ft,(1 F11 ), (11.1.78)
i-i 1-1 r=1
which will asymptotically give the same estimator. 3 Let i be the estimator
obtained by minimizing either (11.1.77) or (11.1.78). Then we have
Ar aFit N
v v aF,t -1
1 T A av A av/
rATTYY s y) . N ( 0, plim A 7 i-i ,
xi Fit (1 Fft )
-
(11.1.79)
The asymptotic variance-covariance matrix of can be shown to be larger
(in matrix sense) than that of given in (11.1.38) as follows: The inverse of the
latter can be also written as
1 TN 1 aFit
plim (11.1.80)
NT Fu ) ay ay'
Put zt, = [F,,(1 Fii)]- "28Fii/aY and au = [Fii( 1 Fri)1 112 . Then the desired
inequality follows from
Finally, we can define the MLE, which maximizes the joint probability of r,
and r,_ 1 , t = 1, 2, . . . , T. The joint probability can be shown to be
T min [N [( r
(11.1.82)
t-11-maxv,-.,_,,o] r/
X (N ri-1 ) (P o) 1 _
\ 1
The maximization of (11.1.82) is probably too difficult to make this estimator
of any practical value. Thus we must use the NLLS or NLWLS estimator if
only aggregate observations are available, even though the MLE is asymptoti-
cally more efficient, as we can conclude from the study of Baranldn and
Gurland (1951).
where = (EkL 1 ).ik ) determines the probability of exiting j. But using the
well-known identity (1 n )" = e', we obtain for small At
P(A) = exp ( )f t)ilik At. (11.2.3)
the product of these terms over all the recorded events of an individual and
then over all the individuals in the sample. Because of (11.2.3), a stationary
model is also referred to as an exponential model.
Suppose M = 3 and a particular individual's event history is as follows: This
person stays in state 1 in period (0, t1 ), moves to state 2 at time t / and stays
there until time t 1 + t2 , moves to state 3 at time t i + t2 and stays there until
t1 + t2 + t3 , at which point this person is observed to move back to state 1.
(The observation is terminated after we have seen him or her move to state 1.)
Then this person's likelihood function is given by
L = exp (, _t 1 , 12 ern ( t _23
2_2, exn ( /1. 3t 3,1/1.31. (11.2.4)
If we change this scenario slightly and assume that we observe this person to
leave state 3 at time t 1 + t2 + t3 but do not know where he or she went, we
should change A31 to A3 in (11.2.4). Furthermore, if we terminate our observa-
tion at time t, + t 2 + t3 without knowing whether he or she continues to stay in
state 3 or not, we should drop A31 altogether from (11.2.4). In this last case we
say "censoring (more exactly, right-censoring) occurs at time t 1 + t2 + t3 ."
Let us consider the simple case of M = 2. In this case we have A i = Al2 and
/12 = A21. (We are still considering a particular individual and therefore have
suppressed the subscript i.) To have a concrete idea, let us suppose that state 1
signifies unemployment and state 2, employment. The event history of an
individual may consist of unemployment spells and employment spells. (If
the observation is censored from the right, the last spell is incomplete.) The
individual's likelihood function can be written as the product of two terms
the probability of unemployment spells and the probability of employment
spells. We shall now concentrate on unemployment spells. Suppose our typi-
cal individual experienced r completed unemployment spells of duration
t1 , t2,. . . , tr during the observation period. Then the contribution of these r
spells to the likelihood function is given by
L= (11.2.5)
where we have defined T = Ej_ l t; and have written A for A,. The individual's
complete likelihood function is (11.2.5) times the corresponding part for the
employment spells.
We now wish to consider closely the likelihood function of one complete
spell: e/1.. At the beginning of this section, we derived it by a limit operation,
but, here, we shall give it a somewhat different (although essentially the same)
interpretation. We can interpret e-.4 as P(T> t)where Tis a random variable
that signifies the duration of an unemployment spell. Therefore the distribu-
Markov Chain and Duration Models 435
likelihood function as
L = ll fkti), (11.2.10)
i-1
which is a standard likelihood function of a model involving continuous
random variables. Suppose, however, that individuals i = 1, 2, . . . , n com-
plete their unemployment spells of duration t, but individuals i = n + 1,
. . . , N are right-censored at time C. Then the likelihood function is
given by
L = ll f '(ti) fi [1 Fi(rio)], (11.2.11)
imn+1
which is a mixture of densities and probabilities just like the likelihood func-
tion of a standard Tobit model. Thus we see that a duration model with
right-censoring is similar to a standard Tobit model.
436 Advanced Econometrics
_(A.T) 2e -IT
2
It is easy to deduce from the derivation in (11.2.12) that the probability of
observing r completed spells in total time T is given by
( Ant e-AT
P(r,T) - , (11.2.13)
r!
which is a Poisson distribution. This is equivalent to (11.2.5) because Tr and ri
do not depend on the unknown parameters.
We can now put back the subscript i in the right-hand side of (11.2.5) and
take the product over i to obtain the likelihood function of all the individuals:
L exp (- Ai Ti ). (11.2.14)
where the scalar a and the vector II are unknown parameters. We shall derive
the ML estimators of these parameters. 4 In the following discussion it will
become apparent why we have separated the constant term a from the other
regression coefficients.
The log likelihood function is given by
log L = ri(a + xi) E exp + (11.2.16)
and
o log L rot, ea Ti exp (Ind; = (11.2.18)
O.
Thus ft can be obtained from (11.2.19) by the method given in the following
paragraph. Inserting fl into (11.2.17) yields an explicit solution for a.
The following method of obtaining ft was proposed by Holford (1980).
Define
[Ti exp ('
I3 xi )
(11.2.20)
-H i Ti exp(fl'x i)
Therefore, we have
log L, = r1 log + rift' xi (11.2.21)
( I ri) !
L = L I L2
Hi Ti" (11.2.23)
where
Mi n
exp [ exp (a+fi'xi )][2 exp (a -1-fl' xi )
L2 -
( ri)!
(11.2.24)
Note that L2 is a Poisson distribution. Setting a log L2 = 0 yields (11.2.17).
We can describe the calculation of the MLE as follows: First, maximize L i
A second, insert fl into L2 and maximize it with respect to a. withrespco
11.2.3 Durations as Dependent Variables of a Regression Equation
Suppose that each individual experiences one complete spell. Then the likeli-
hood function is
The case of a person having more than one complete spell can be handled by
behaving as if these spells belonged to different individuals. Assume as before
exp (fl' xi ). (11.2.26)
But, here, we have absorbed the constant term a into fi as there is no need to
separate it out.
We shall derive the asymptotic covariance matrix of the MLE n. We have
82 log L_
Etiexp (IF x i )x,4. (11.2.27)
a
Markov Chain and Duration Models 439
But we have
1
Et, = f zA i exp ( Aiz) dz = , (11.2.28)
= c log Ai ,
where c=-
-= 0.577 is Euler's constant, and
2
= n6- + (c + log A, )2.
Therefore we have
7r2
V log t = (11.2.33)
i
We can write (11.2.31) and (11.2.33) as a linear regression
log ti + c = fi'x, u, (11.2.34)
where Eui = 0 and Vu i = 7r2/6. Because (u1 ) are independent, (11.2.34) de-
fines a classical regression model, which we called Model) in Chapter 1.
Therefore the exact covariance matrix of the LS estimator ikc is given by
7,2 --1
I xix,) . (11.2.35)
6
440 Advanced Econometrics
Therefore we have
Vv i = exp ( 2fl' x i ). (11.2.38)
v/iwis = f,
exp (2fl' xi ) -ay) x.x' . (11.2.39)
dP12 p (11.2.42)
11 , 12 P12 A21
dt
Performing an analogous operation on P11, P211 and P22, we obtain the linear
vector differential equation
dt
ddP: 1
r_ A21 111 P21 (11.2.43)
dPi2 dP22 L Al2 A21]
[P12 P22
dt dt
dP
o
r dt
P' = H (11.2.44)
where the columns H are characteristic vectors of A, D is the diagonal matrix
consisting of the characteristic roots of A, and eD is the diagonal matrix
consisting of exp (di t), di, being the elements of D.
We shall derive D and H. Solving the determinantal equation
IA dIl = 0 (11.2.45)
442 Advanced Econometrics
D=
F0 0
(11.2.46)
LO (2 12 221 )] .
Let h 1 be the first column of H (the characteristic vector corresponding to the
zero root). Then it should satisfy A h i = 0, which yields a solution (which is
not unique) h 1 = (221, 212). Next, the second column h 2 of H should satisfy
[A + (2 12 + 221 )Ilh2 = 0, which yields a solution h2 = (-1, 1) '. Combining
the two vectors, we obtain
-1
(11.2.47)
1
Finally, inserting (11.2.46) and (11.2.47) into (11.2.44) yields the following
expressions for the elements of P' (putting back the superscript i):
(t) = 1 exp ( - 3; 0 (11.2.48)
P12(t) = Yi yi exP ( 6 ; 0
n(t)= 1 y, (1 yi ) exp
13 22 (t) = yi+ ( 1 yi) exP ( t),
where y i = NI2AAl2 + 11 ) and (5i= Ai2 A . Suppose we observe the ith indi-
vidual in state j1 at time 0 and in state ki at time to i = 1, 2, . . . , N. Then the
likelihood function is given by
L = 11 plk,(ti ). (11.2.49)
ti+12
X exp [ f A2 (z) dziA 23 (t, + 12 )
ti+4+13
X exp [ (z) dz1 A31 (t + t2 + t) ).
Ito-t2
As in (11.2.4), A31 should be changed to A3 if the individual is only observed to
leave state 3, and A31 should be dropped if right-censoring occurs at that time.
We shall concentrate on the transition from state 1 to state 2 and write Al2 (t)
simply as A(t), as we did earlier. The distribution function of duration under a
nonstationary model is given by
l t2 +
Mz) dz = frx i t + a az t3. (11.2.56)
2 3
Some people terminate their employment during the sample period, but some
remain in their jobs at the end of the sample period (right-censoring). There-
fore Tuma's likelihood function is precisely in the form of (11.2.11).
Model of Tuma, Hannan, and Groeneveld. Tuma, Hannan, and Groene-
veld (1979) studied the duration of marriage. They handled nonstationarity
by dividing the sample period into four subperiods and assuming that the
hazard rate remains constant within each subperiod but varies across different
subperiods. More specifically, they specified
Mt) = ffp xi for t E Tp , p= 1, 2, 3, 4, (11.2.57)
where 7; is the pth subperiod. This kind of a discrete change in the hazard rate
creates no real problem. Suppose that the event history of an individual
consists of a single completed spell of duration L and that during this period a
constant hazard rate ).(1) holds from time 0 to time rand another constant rate
).(2) holds from time T to time t. Then this individual's likelihood function is
given by
L = e-uoTe -42xt-rw2) . (11.2.58)
Model of Lancaster. Lancaster (1979) was concerned with unemployment
duration and questioned the assumption of a constant hazard rate. Although a
simple search theory may indicate an increasing hazard rate, it is not clear
from economic theory alone whether we should expect a constant, decreasing,
Markov Chain and Duration Models 445
OA
a = 1 *-0 = 0 (constant hazard rate)
at
OA
a< 1 <0 (decreasing hazard rate).
The function (t A - 1)/ A is the Box-Cox transformation (see Section 8.1.2) and
approaches log t as A approaches 0. Therefore putting A l = 0 and y2 = 0 in
(11.2.69) reduces it to a Weibull model. Note that x u is assumed to depend on t
in (11.2.69). Flinn and Heckman assumed that changes in xu(t) occurred only
at the beginning of a month and that the levels were constant throughout the
month. The authors devised an efficient computation algorithm for handling
Markov Chain and Duration Models 447
the heterogeneity correlated across spells and the exogenous variables varying
with time.
I (Numerator)
J.
: (Numerator) ds
[1 F(s)] As _ [1 F(s)] As
ES '
.1.-[1 _F(s)] ds
0
where ES = f;sf(s) ds. In (11.2.70) the third equality follows from the as-
sumption that P[U started in (s As, s)] does not depend on s (the as-
sumption of constant entry rate), and the last equality follows from integra-
tion by parts. By eliminating As from both sides of (11.2.70), we obtain
448 Advanced Econometrics
1 F(s)
g(s) (11.2.71)
ES
In the second case we should derive the joint density g(s,t)= g(tls)g(s). The
density g(s) has been derived, but we still need to derive g(t1s). Let X denote
total unemployment duration. First evaluate
P(X>s+t,X>s)
P(X>s+tlX>s) (11.2.72)
P(X> s)
_ P(X>s+t)
P(X > s)
1 F(s + t)
1 F(s)
If we denote the distribution function of g(t1s) by G(t1s), (11.2.72) implies
F(s + t)
G(t1s) (11.2.73)
1 F(s) .
Therefore, differentiating (11.2.73) with respect to t, we obtain
f(s + t)
g(t1s) (11.2.74)
1 F(s) .
Finally, from (11.2.71) and (11.2.74), we obtain
f(s + t)
g(s,t) (11.2.75)
ES
This situation holds for Lancaster (1979), as he observed both s and t.
However, Lancaster used the conditional density (11.2.74) rather than the
joint density (11.2.75) because he felt uncertain about the assumption of
constant entry rate.
Finally, in the third case we need g(1). This can be obtained by integrating
g(s,t) with respect to s as follows:
1 r-
o f(s + t) ds (11.2.76)
g(t) = -is J
_1 F(t)
ES
See Flinn and Heckman (1982) for an alternative derivation of g(t).
Markov Chain and Duration Models 449
where A(t) = f l(z) dz. Combining the exp functions that appear in both
terms and rewriting the combined term further, we obtain
where R(t)= (ilt i t). To understand the second equality in (11.2.79), note
that I hF_Ao) exp xh ) is a step function described in Figure 11.1.
Cox's PMLE flp maximizes
L = L iL2 (11.2.81)
450 Advanced Econometrics
e l311 "
ijeo'xN
t) t2 t N-1 tN
Figure 11.1 IhERio exP (rxh ) as a function of t
where
and
6y I
X dt N_ I dtN- 2 dt
co le
.1, , -, H
co N-2
. 2(0 exp (fi'xi )
10 z, fI
6
X exp [ exp (11'10 f2(z) dzi
o
exp (fi'xx-i)
X
exp (fi'xN-1) + exP (fi'xN)
Because the parameter vector fl appears in both L i and L2, we expect Cox's
PMLE to be asymptotically less efficient than the full MLE in general. This is
indeed the case. We shall compare the asymptotic covariance matrix of the
two estimators in the special case of a stationary model where the A(t) that
appears in the right-hand side of (11.2.77) is a constant. Furthermore, we shall
suppose there is no censoring. In this special case, Cox's model is identical to
the model considered in Section 11.2.3. Therefore the asymptotic covariance
matrix of the MLE fl can be derived from (11.2.30) by noting that the 13 of the
present section can be regarded as the vector consisting of all but the last
element of the ft of Section 11.2.3. Hence, by Theorem 13 of Appendix 1, we
obtain
= ;1 (11.2.94)
454 Advanced Econometrics
We shall first evaluate (11.2.94) for the case N = 3 and then for the case of
general N. If t, <12 < 13 , we have
v3 .X11 + x3i + xi
(11.2.95)
[ 4- + x13
If we change the rank order of (t i, 12, 13 ), the right-hand side of (11.2.95) will
change correspondingly. But, under our stationarity (constant A) and homoge-
neity (fl = 0) assumptions, each one of six possible rank orderings can happen
with an equal probability. Therefore we obtain
Et [ _ V] (11.2.96)
E [
/Xi, 2
1 1
= + 1) tX, (11.2.98)
031 ( 222 + 6
i 2x xi
3 / j-i+1 i
Generalizing (11.2.96) and (11.2.98) to a general N, we obtain
X h2 N
x, 1
EE [-- Ex? (11.2.99)
and
(11.2.100)
Markov Chain and Duration Models 455
Exercises
1. (Section 11.1.1)
Using (11.1.5), express the unconditional mean and variance of yl (2) as a
function of the mean and variance of y'(0).
2. (Section 11.1.1)
Write L and L2 defined in (11.1.10) explicitly in the special case where
Pi/JO = Pik, M = 2, and T = 2.
3. (Section 11.1.1)
Find an example of a matrix that is not necessarily positive but for which a
unique vector of equilibrium probabilities exists.
456 Advanced Econometrics
4. (Section 11.1.1)
Prove the statement following (11.1.15).
5. (Section 11.1.1)
Verify statement (11.1.23).
6. (Section 11.1.3)
Derive the asymptotic variance-covariance matrix of the MLE of a in the
Bosldn-Nold model using (11.1.38).
7. (Section 11.1.3)
The mean duration is derived in (11.1.65). Using a similar technique,
derive Vt,
8. (Section 11.1.5)
Let the Markov matrix of a two-state (1 or 0) stationary homogeneous
first-order Markov chain model be
[P11 P10]_[1 A Al
Pol Poo
where A is the only unknown parameter of the model. Define the follow-
ing symbols:
nik Number of people who were in state j at time 0 and are in state k
at time 1
/If . Number of people who were in state j at time 0
n. Number of people who are in state j at time 1
The theorems listed in this appendix are the ones especially useful in econo-
metrics. All matrices are assumed to be real. Proofs for many of these
theorems can be found in Bellman (1970).
1. For any square matrix A, with distinct characteristic roots, there exists a
nonsingular matrix P such that PAP-1 = A, where A is a diagonal matrix with
the characteristic roots of A in the diagonal. If the characteristic roots are not
distinct, A takes the Jordan canonical form (see Bellman, p. 198).
2. The determinant of a matrix is the product of its characteristic roots.
3. For any symmetric matrix A, there exists an orthogonal matrix H such
that H' H = I and H 'AH = A, where A is a diagonal matrix with the charac-
teristic roots (which are real) of A in the diagonal. The ith column of H is called
the characteristic vector of A corresponding to the characteristic root of A that
is the ith diagonal element of A.
4. For a symmetric matrix A, the following statements are equivalent:
(i) A is a positive definite. (Write A> 0.)
(ii) x 'Ax is positive for any nonzero vector x.
(iii) Principal minors of A are all positive.
(iv) Characteristic roots of A are all positive.
The above is true if we change the word positive to nonnegative.
5. For any matrices A and B, the nonzero characteristic roots of AB and BA
are the same, whenever both AB and BA are defined.
6. tr AB = tr BA.
7. For any square matrix A, tr A is equal to the sum of its characteristic
roots.
8. Let A, B be symmetric matrices of the same size. A necessary and
460 Appendix 1
sufficient condition that there exists an orthogonal matrix H such that both
H'AH and H'BH are diagonal is that AB = BA.
9. Any nonnegative (semipositive) definite matrix A can be written as
A = TT', where T is a lower triangular matrix with nonnegative diagonal
elements.
10. Let A i , 22, . . . , 2,, be the characteristic roots of a symmetric matrix A
in descending order (2 1 being the largest). Then
x'Ax ( x',Axi
A., = max = ,
,
x'Ax ( xAx2
.1.2 = max
e )
xt o x'x 3c x2 -'
x 'Ax
23= max-, and so on.
xxe-o x x
exv-o
11. Let A and B be symmetric matrices (n X n) with B nonnegative defi-
nite. Then it i(A + B) A i(A), i, = 1, 2, . . . , n, where ).'s and p's are the
characteristic roots in descending order. The strict inequality holds if B is
positive definite.
d dX
-
(ii) log IIXII = tr ?C1
og cg
d
(iii) log II All = (A ' )'
dA
d _ dX
(iv)
W
c X' = X-' 7-
g X-1
d
(v) T- tr X -1 Y ft (X -4
dX X-1 1')
g
d
(vi) a tr A-1 B = A -1 BA -1
462 Appendix 1
(vii) tr AB = B'
The theorems listed in this appendix, as well as many other results concerning
the distribution of a univariate continuous random variable, can be found in
Johnson and Kotz (1970a,b). "Rao" stands for Rao (1973) and "Plackett" for
Plackett (1960).
where
beta function.
Symbolically we write
(n) 1/2
Sn .
464 Appendix 2
10. If we have cross-section data, p refers to a cross-section unit not included in the
sample.
11. Theorem 1.6.1 is true even if we let C depend on x, . Then the two theorems are
equivalent because any d satisfying (1.6.5) can be written as Cz i, for some C such that
C'X = I.
1. A much more detailed account of this topic can be found in Amemiya (1980a).
2. We use the term estimator here, but all the definitions and the results of this
subsection remain the same if we interpret d as any decision function mapping Y
into 0.
3. We assume that the losses do not depend on the parameters of the models.
Otherwise, the choice of models and the estimation of the parameters cannot be
sequentially analyzed, which would immensely complicate the problem. However, we
do not claim that our assumption is realistic. We adopt this simplification for the
purpose of illustrating certain basic ideas.
4. For ways to get around this problem, see, for example, Alcaike (1976) and
Schwartz (1978).
5. See Thisted (1976) for an excellent survey of this topic, as well as for some original
results. More recent surveys are given by Draper and Van Nostrand (1979) and Judge
and Bock (1983).
6. The matrix Hair 'HI is sometimes referred to as the Moore-Penrose generalized
inverse of X'X and is denoted by ( VX)+. See Rao (1973, p. 24) for more discussion of
generalized inverses.
7. This question was originally posed and solved by Silvey (1969). .
8. Although Sclove considers the random coefficients model and the prediction
(rather than estimation) of the regression vector, there is no essential difference be-
tween his approach and the Bayesian estimation.
9. What follows simplifies the derivation of Sclove et al. (1972).
10. See Section 4.6.1. In Chapter 3 we shall discuss large sample theory and make
the meaning of the term asymptotically more precise. For the time being, the reader
should simply interpret it to mean "approximately when T is large."
11. See the definition of probability limit in Chapter 3. Loosely spealdng, the state-
ment means that when T is large, s is close to so with probability close to 1.
1. In the proof of Theorem 4.1.1, continuity of Q T(0) is used only to imply continu-
ity of Q(0) and to make certain the measurability of O. Therefore we can modify this
468 Notes to Pages 107-173
theorem in such a way that we assume continuity of Q(0) but not of Q T() and define
convergence in probability in a more general way that does not require measurability
of 'or . This is done in Theorem 9.6.1.
Also note that the proof of Theorem 4.1.1 can easily be modified to show that if the
convergence in the sense of (i) holds in assumption C, 07-converges to 00 almost surely.
2. Strictly speaking, (4.1.10) is defined only for those T's for which Otis nonempty.
However, the probability that OT is nonempty approaches 1 as T goes to infinity
because of Assumption C of Theorem 4.1.2. As an aside, Jennrich (1969) proved that
0* is a measurable function.
3. This proof is patterned after the proof of a similar theorem by Jennrich (1969).
4. Note that if z r is merely defined as a random variable with mean p(0) and
variance-covariance matrix 1(0), the minimization of the quadratic form does not
even yield a consistent estimator of 0.
5. The term second-order efficiency is sometimes used to denote a related but differ-
ent concept (see Pfanzagl, 1973).
6. If the error term is multiplicative as in Q = 13 1 KP2Lheu, the log transformation
reduces it to a linear regression model. See Bodkin and Klein (1967) for the estimation
of both models.
7. The methods of proof used in this section and Section 4.3.3 are similar to those of
Jennrich (1969).
8. Because a function continuous on a compact set is uniformly continuous, we can
assume without loss of generality that f,() is uniformly continuous in p E N.
9. The nonsingularity is not needed here but is assumed as it will be needed later.
10. Note that in the special case where the constraint h(/)) = 0 is linear and can be
written as Q1 = c, (4.5.21) is similar to (4.3.32). We cannot unequivocally determine
whether the chi-square approximation of the distribution of (4.5.21) is better or worse
than the F approximation of the distribution of (4.3.32).
11. If the sample is (1, 2, 3), 2 is the unique median. If the sample is (1, 2, 3, 4), any
point in the closed interval [2, 3] may be defined as a median. The definition (4.6.3)
picks 2 as the median. If f(x) > 0 in the neighborhood of x == M, this ambiguity
vanishes as the sample size approaches infinity.
12. The second term of the right-hand side of (4.6.10) does not affect the minimiza-
tion but is added so that plim T 'ST can be evaluated without assuming the existence
of the first absolute moment of Y,. This idea originates in Huber (1965).
13. Alternative methods of proof of asymptotic normality can be found in Bassett
and Koenker (1978) and in Amemiya (1982a).
3. There is a slight error in the formula for the mean squared error of MLE, which
has been corrected by Amemiya (1984b).
4. See Anderson (1958, Ch. 6) for a thorough discussion of discriminant analysis.
5. Certain areas of statistics are more amenable to Bayesian analysis than to classical
analysis. The simultaneous determination of prediction and estimation is one of them.
6. To simplify the notation we shall omit the subscript 0 that denotes the true value.
The reader should be able to understand from the context whether a symbol denotes
the true value of a parameter or the domain of the parameter space.
7. The possible inconsistency due to this approximation was briefly discussed at the
end of Section 9.2.5.
8. Deacon and Shapiro actually used (9.3.32) and the equation obtained by sum-
ming (9.3.32) and (9.3.33). The resulting estimates off), and fl2 are the same as those
obtained by the method described in the text.
9. The triple integral in (9.3.72) can be reduced to a double integral by a certain
transformation. In general, we must evaluate m-tuple integrals for m + 1 responses.
10. We assume that a component of the vector xis either discrete or continuous, so
that f is actually the density and the probability combined (density with respect to
some measure). Thus the integration with respect to x that will appear later should be
regarded as the integral and summation combined (the Lebesgue-Stieltjes integral with
respect to the appropriate measure).
11. Because g(xi) is known, it can be ignored in the maximization of L e . However,
we retained it to remind the reader of the sampling scheme. In (9.5.2), H(j i) is retained
for the same reason.
12. The main difference between the two proofs is that we define as a solution of
(9.5.6) whereas Mansld and Lerman define it as the value of fi that attains the global
maximum of (9.5.5) over a compact parameter space containing the true value.
13. The following analysis can be modified to allow for the possibility that for j and
x, P(jlx, fi) =0 for allfl. Such a case arises when certain alternatives are unavailable for
a certain individual. See Mansld and McFadden (1983, footnote 23, p. 13).
14. Manski and Lerman quote an interesting result attributed to McFadden, which
states that in a multinomial logit model with alternative-specific interceptsthat is,
the model in which a in (9.3.43) varies with j as in (9.2.4) and (9.2.5), the inconsistency
is confined to the parameters (a/).
15. In the simple logit model considered after (9.5.22), it can be shown that the
asymptotic variances of MME and WMLE are identical, so that Table 9.7 applies to
MME as well.
16. Cosslett's sampling scheme can be generalized further to yield a general strati-
fied sampling (see Manski and McFadden, 1983, p. 28).
17. This interpretation does not contradict the fact that in actual decisions the
determination of x precedes that of j.
18. There is a subtle difference. In the simple choice-based sampling defined earlier,
a person choosing alternative j is sampled with probability H(j), whereas in the gener-
472 Notes to Pages 332-385
alized choice-based sampling the number of people in the subsample s is fixed and not
random.
19. It is interesting to note that maximizing n with respect to fl and A, is equivalent
to maximizing W defined in (9.5.39) with respect to and Qo(j). Mansld and McFad-
den (1983, p. 24) suggested this version of MME without realizing its equivalence to
CBMLE.
20. Cosslett defined WMLE and MME using the actual proportions of people
choosing ] instead of H(]).
21. Manslci's proof is not complete because in the fourth line from the bottom on
page 218 of his article, double limit operations are interchanged without verifying the
necessary conditions. It seems that we would have to make more assumptions than
made by Manski in order for the necessary conditions to hold. A correct proof for the
binary case can be found in Manski (1985).
22. Manslci (1975) considered a more general score function than that defined here.
23. The heterogeneity problem is also known as the mover-stayer problem in the
literature of Markov chain models. Among the first to discuss the problem were
Blumen, Kogan, and McCarthy (1955), who found that individuals who changed
occupations most frequently in the past were more likely to change in the future.
24. The (u1) in (9.7.5) are not i.i.d., unlike the (u,) in (9.7.2).
11. See Cragg (1971) for models that ensure the nonnegativity of y 2 as well as of y1 .
12. For a more elaborate derivation of the reservation wage model based on search
theory, see Gronau (1974).
13. Gronau specified that the independent variables in the Wr equation include a
woman's age and education, family income, number of children, and her husband's
age and education, whereas the independent variables in the W equation include only
a woman's age and education. However, Gronau readily admitted to the arbitrariness
of the specification and the possibility that all the variables are included in both.
14. Gronau assumed the independence between u 2 and v and used an estimator
different from those mentioned here. Amemiya (1984a) pointed out an error in
Gronau's procedure. The independence between u 2 and v is unnecessary if we use
either the MLE or Heckman's two-step estimator.
15. Although Heckman's model (1974) is a simultaneous equations model, the
two-step estimator of Heckman studied by Wales and Woodland is essentially a re-
duced-form estimator, which we have discussed in this subsection, rather than the
structural equations version we shall discuss in the next subsection.
16. For a panel data generalization of Heckman's model, see Heckman and Ma-
Curdy (1980).
17. Actually, Heckman used log Wr and log W. The independent variables x 2
includehsba'wg,tincomeprsadviulchrtesandz
includes housewife's schooling and experience.
18. A more explicit expression for the likelihood function was obtained by Ame-
miya (1974a), who pointed out the incorrectness of the likelihood function originally
given by Fair and Jaffee.
19. Equation (10.10.26) is the maximized profit function and (10.10.27) is an input
demand or output supply function obtained by differentiating (10.10.26) with respect
to the own input or output price (Hotelling's lemma). For convenience only one input
or output has been assumed; so, strictly speaking, 4) and xa) are scalars.
20. These two equations correspond to the two equations in a proposition of Dun-
can (1980, p. 851). It seems that Duncan inadvertently omitted the last term from
(10.10.30).
1. Note that this result can be obtained from (11.1.38) using the Boskin-Nold
assumptions as well as the assumption that T goes to co. The result (11.1.38) is valid
even if T does not go to co provided that NT goes to C. If we assumed p(0) = p(00), the
approximate equality in (11.1.55) would be exact and hence (11.1.59) would be valid
without assuming T 00 provided that NT I .
2. Actually, Toiklca is interested only in three out of the six transition probabilities,
and he lets those three depend linearly on exogenous variables. However, for the
simplicity of analysis, we shall proceed in our discussion as if all the six transition
probabilities depended linearly on exogenous variables.
474 Notes to Pages 432-450
1971. The Statistical Analysis of Time Series. New York: John Wiley and
Sons.
1974."An Asymptotic Expansion of the Distribution of the Limited Informa-
tion Maximum Likelihood Estimate of a Coefficient in a Simultaneous Equation
System." Journal of the American Statistical Association 60:565 573.
-
Anderson, T. W., and C. Hsiao. 1981. "Estimation of Dynamic Models with Error
Components." Journal of the American Statistical Association 76:598 606. -
Andrews, D. F. 1974. "A Robust Method for Multiple Linear Regression." Techno-
metrics 16:523-531.
Andrews, D. F., P. J. Bickel, F. R. Hampel, P. J. Huber, W. H. Rogers, and J. W.
Tukey. 1972. Robust Estimates of Location. Princeton: Princeton University
Press.
Anscombe, F. J. (1961). "Examination of Residuals," in J. Neyman, ed., Proceedings
of the Fourth Berkeley Symposium on Mathematical Statistics and Probability,
1:1 36. Berkeley: University of California Press.
-
nents Structure: Ability, Schooling and the Economic Success of Brothers." Inter-
national Economic Review 16:422-429.
Champernowne, D. G. 1953. "A Model of Income Distribution." Economic Journal
63:318-351.
Charatsis, E. G. 1971. "A Computer Program for Estimation of the Constant Elasticity
of Substitution Production Function." Applied Statistics 20:286-296.
Chow, G. C. 1960. "Tests for Equality between Sets of Coefficients in Two Linear
Regressions." Econometrica 28:591-605.
1968. "Two Methods of Computing Full-Information Maximum Likelihood
Estimates in Simultaneous Stochastic Equations." International Economic Re-
view 9:100-112.
1973. "On the Computation of Full-Information Maximum Likelihood Esti-
mates for Nonlinear Equation Systems." Review of Economics and Statistics
55:104-109.
Chow, G. C., and R. C. Fair. 1973. "Maximum Likelihood Estimation of Linear
Equation Systems with Auto-Regressive Residuals." Annals of Economic and
Social Measurement 2:17-28.
Christ, C. F. 1966. Econometric Models and Methods. New York: John Wiley & Sons.
Christensen, L. R., D. W. Jorgenson, and L. J. Lau. 1975. "Transcendental Logarith-
mic Utility Functions." American Economic Review 65:367-383.
Chung, C. F., and A. S. Goldberger. 1984. "Proportional Projections in Limited
Dependent Variable Models." Econometrica 52:531-534.
Chung, K. L. 1974. A Course in Probability Theory, 2d ed. New York: Academic Press.
Clark, C. 1961. "The Greatest of a Finite Set of Random Variables." Operations
Research 9:145-162.
Cochrane, D., and G. H. Orcutt. 1949. "Application of Least Squares Regression to
Relationships Containing Autocorrelated Error Terms." Journal of the American
Statistical Association 44:32-61.
Cooley, T. F., and E. C. Prescott. 1976. "Estimation in the Presence of Stochastic
Parameter Variation." Econometrica 44:167 - 184.
Cooper, J. P. 1972. "Two Approaches to Polynomial Distributed Lags Estimation: An
Expository Note and Comment." The American Statistician 26:32 - 35.
Cosslett, S. R. 1978. "Efficient Estimation of Discrete-Choice Models from Choice-
Based Samples." Workshop in Transportation Economics, University of Califor-
nia, Berkeley.
1981a. "Maximum Likelihood Estimator for Choice-Based Samples." Econo-
metrica 49:1289-1316.
1981b. 'Efficient Estimation of Discrete-Choice Models," in C. F. Mansld and
D. McFadden, eds., Structural Analysis of Discrete Data with Econometric Appli-
cations, pp. 51-111. Cambridge, Mass.: MIT Press.
1983. "Distribution-Free Maximum Likelihood Estimator of the Binary
Choice Model." Econometrica 51:765-782.
References 483
Efron, B., and C. Morris. 1972. "Limiting the Risk of Bayes and Empirical Bayes
Estimators-Part II: The Empirical Bayes Case." Journal of the American Statis-
tical Association 67:130-139.
1973. "Stein's Estimation Rule and Its Competitors-An Empirical Bayes
Approach." Journal of the American Statistical Association 68:117-130.
1975. "Data Analysis Using Stein's Estimator and Its Generalizations." Jour-
nal of the American Statistical Association 70:311-319.
1976. "Families of Minimax Estimators of the Mean of a Multivariate Normal
Distribution." Annals of Statistics 4:11 -21.
Ehrlich, I. 1977. "Capital Punishment and Deterrence: Some Further Thoughts and
Additional Evidence." Journal of Political Economy 85:741 -788.
Eicker, F. 1963. "Asymptotic Normality and Consistency of the Least Squares Esti-
mators for Families of Linear Regressions." Annals of Mathematical Statistics
34:447-456.
Eisenpress, H., and J. Greenstadt. 1966. "The Estimation of Nonlinear Econometric
Systems." Econometrica 34:851-861.
Engle, R. F. 1984. "Wald, Likelihood Ratio, and Lagrange Multiplier Tests in Econo-
metrics," in Z. Griliches and M. D. Intrilligator, eds., Handbook of Econometrics,
2:775 - 826. Amsterdam: North-Holland Publishing.
Fair, R. C. 1974. "On the Robust Estimation of Econometric Models." Annals of
Economic and Social Measurement 3:667-677.
1976. A Model of Macroeconomic Activity. Vol. 2, The Empirical Model.
Cambridge, Mass.: Ballinger.
1978. "A Theory of Extramarital Affairs." Journal of Political Economy
86:45-61.
Fair, R. C., and D. M. Jaffee. 1972. "Methods of Estimation for Markets in Disequilib-
rium." Econometrica 40:497-514.
Fair, R. C., and W. R. Parke. 1980. "Full-Information Estimates of a Nonlinear
Macroeconometric Model." Journal of Econometrics 13:269-291.
Farebrother, R. W. 1975. "Minimax Regret Significance Points for a Preliminary Test
in Regression Analysis." Econometrica 43:1005-1006.
1980. "The Durbin-Watson Test for Serial Correlation When There Is No
Intercept in the Regression." Econometrica 48:1553 - 1563.
Feller, W. 1961. An Introduction to Probability Theory and Its Applications, vol. I, 2d
ed. New York: John Wiley & Sons.
Ferguson, T. S. 1958. "A Method of Generating Best Asymptotically Normal Esti-
mates with Application to the Estimation of Bacterial Densities." Annals ofMath-
ematical Statistics 29:1046-1062.
Fisher, F. M. 1966. The Identification Problem in Econometrics. New York: McGraw-
Hill.
1970. "A Correspondence Principle for Simultaneous Equation Models."
Econometrica 38:73-92.
486 References
Fletcher, R., and M. J. D. Powell. 1963. "A Rapidly Convergent Descent Method for
Minimization." Computer Journal 6:163-168.
Flinn, C. J., and J. J. Heckman. 1982. "Models for the Analysis of Labor Force
Dynamics." Advances in Econometrics 1:35-95.
Fomby, T. B., R. C. Hill, and S. R. Johnson. 1978. "An Optimal Property of Principal
Components in the Context of Restricted Least Squares." Journal ofthe American
Statistical Association 73:191-193.
Forsythe, A. B. 1972. "Robust Estimation of Straight Line Regression Coefficients by
Minimizing p-th Power Deviations." Technometrics 14:159- 166.
Freund, J. F. 1971. Mathematical Statistics, 2d ed. Englewood Cliffs, N.J.: Prentice-
Hall.
Froehlich, B. R. 1973. "Some Estimates fora Random Coefficient Regression Model."
Journal of the American Statistical Association. 68:329-335.
Fuller, W. A. 1976. Introduction to Statistical Time Series. New York: John Wiley &
Sons.
1977. "Some Properties of a Modification of the Limited Information Estima-
tor." Econometrica 45:939-953.
Fuller, W. A., and G. E. Battese. 1974. "Estimation of Linear Models with Crossed-
Error Structure." Journal of Econometrics 2:67-78.
Gallant, A. R. 1975a. "Nonlinear Regression." The American Statistician 29:73-81.
1975b. "Testing a Subset of the Parameters ofa Nonlinear Regression Model."
Journal of the American Statistical Association 70:927-932.
1977. "Three-Stage Least-Squares Estimation for a System of Simultaneous,
Nonlinear, Implicit Equations." Journal of Econometrics 5:71-88.
Gallant, A. R., and A. Holly. 1980. "Statistical Inference in an Implicit, Nonlinear,
Simultaneous Equation Model in the Context of Maximum Likelihood Estima-
tion." Econometrica 48:697-720.
Gallant, A. R., and D. W. Jorgenson. 1979. "Statistical Inference for a System of
Simultaneous, Nonlinear, Implicit Equations in the Context of Instrumental
Variable Estimation." Journal of Econometrics 11:275-302.
Gastwirth, J. L. 1966. "On Robust Procedures." Journal of the American Statistical
Association 65:946-973.
Gayer, K. M., and M. S. Geisel. 1974. "Discriminating Among Alternative Models:
Bayesian and Non-Bayesian Methods," in P. Zarembka, ed., Frontiers in Econo-
metrics. pp. 49-80. New York: Academic Press.
Ghosh, J. K., and B. K. Sinha. 1981. "A Necessary and Sufficient Condition for
Second Order Admissibility with Applications to Berlcson's Bioassay Problem."
Annals of Statistics 9:1334-1338.
Ghosh, J. K., and K. Subramanyam. 1974. "Second-Order Efficiency of Maximum
Likelihood Estimators." Sankhya ser. A, 36:325-358.
Gnedenko, B. V., and A. N. Kolmogorov. 1954. Limit Distributions for Sums of
Independent Random Variables. Reading, Mass.: Addison-Wesley.
References 487
1981a. The Econometric Analysis of Time Series. Oxford: Philip Allan Pub-
lishers.
1981b. Time Series Models. Oxford: Philip Allan Publishers.
Hatanaka, M. 1974. "An Efficient Two-Step Estimator for the Dynamic Adjustment
Model with Autoregressive Errors." Journal of Econometrics 2:199 - 220.
1978. "On the Efficient Estimation Methods for the Macro-Economic Models
Nonlinear in Variables." Journal of Econometrics 8:323-356.
Hause, J. C. 1980. "The Fine Structure of Earnings and On-the-Job Training Hypoth-
esis." Econometrica 48:1013-1029.
Hausman, J. A. 1975. "An Instrumental Variable Approach to Full Information
Estimators for Linear and Certain Nonlinear Econometric Models." Economet-
rica 43:727-738.
1978. "Specification Tests in Econometrics." Econometrica 46:1251- 1272.
Hausman, J. A., and D. McFadden. 1984. "Specification Tests for the Multinomial
Logit Model." Econometrica 52:1219 - 1240.
Hausman, J. A., and W. E. Taylor. 1981. "Panel Data and Unobservable Individual
Effects." Econometrica 49:1377 - 1398.
Hausman, J. A., and D. A. Wise. 1976. "The Evaluation of Results from Truncated
Samples: The New Jersey Income Maintenance Experiment." Annals of Eco-
nomic and Social Measurement 5:421-445.
1977. "Social Experimentation, Truncated Distributions, and Efficient Esti-
mation." Econometrica 45:919 - 938.
1978. "A Conditional Probit Model for Qualitative Choice: Discrete Decisions
Recognizing Interdependence and Heterogeneous Preferences." Econometrica
46:403-426.
1979. "Attrition Bias in Experimental and Panel Data: The Gary Income
Maintenance Experiment." Econometrica 47:455 -473.
Heckman, J. J. 1974. "Shadow Prices, Market Wages, and Labor Supply." Economet-
rica 42:679-693.
I 976a. "The Common Structure of Statistical Models of Truncation, Sample
Selection and Limited Dependent Variables and a Simple Estimator for Such
Models." Annals of Economic and Social Measurement 5:475-492.
1976b. "Simultaneous Equations Models with Continuous and Discrete En-
dogenous Variables and Structural Shifts," in S. M. Goldfeld and R. E. Quandt,
eds., Studies in Nonlinear Estimation, pp. 235-272. Cambridge, Mass.: Ballinger
Publishing.
1978. "Dummy Endogenous Variables in a Simultaneous Equation System."
Econometrica 46:931-960.
1979. "Sample Selection Bias as a Specification Error." Econometrica
47:153- 161.
1981a. "Statistical Models for Discrete Panel Data," in C. F. Manski and
D. McFadden, eds., Structural Analysis of Discrete Data with Econometric Appli-
cations, pp. 114 - 178. Cambridge, Mass.: MIT Press.
490 References
198 lb. "The Incidental Parameters Problem and the Problem of Initial Condi-
tions in Estimating a Discrete Time-Discrete Data Stochastic Process," in
C. F. Manslci and D. McFadden, eds., Structural Analysis of Discrete Data with
Econometric Applications, pp. 179- 195. Cambridge, Mass.: MIT Press.
1981c. "Heterogeneity and State Dependence," in S. Rosen, ed., Studies in
Labor Markets, pp. 91- 139. Cambridge, Mass.: National Bureau of Economic
Reseach.
Heckman, J. J., and G. J. Borjas. 1980. "Does Unemployment Cause Future Unem-
ployment? Definitions, Questions and Answers from a Continuous Time Model
of Heterogeneity and State Dependence." Economica 47:247 -283.
Heckman, J. J., and T. E. MaCurdy. 1980. "A Life Cycle Model of Female Labor
Supply." Review of Economic Studies 47:47-74.
Heckman, J. J., and S. Polachek. 1974. "Empirical Evidence on the Functional Form
of the Earnings-Schooling Relationship." Journal of the American Statistical As-
sociation 69:350-354.
Heckman, J. J., and B. Singer. 1982. "The Identification Problem in Econometric
Models for Duration Data," in W. Hildenbrand, ed., Advances in Econometrics,
pp. 39-77. Cambridge: Cambridge University Press.
1984a. "A Method for Minimizing the Impact of Distributional Assumptions
in Econometric Models for Duration Data." Econometrica 52:271-320.
1984b. "Econometric Duration Analysis." Journal of Econometrics 24:63 -
132.
Heckman, J. J., and R. J. Willis. 1977. "A Beta-Logistic Model for the Analysis of
Sequential Labor Force Participation by Married Women." Journal of Political
Economy 85:27-58.
Hettmansperger, T. P., and J. W. McKean. 1977. "A Robust Alternative Based on
Ranks to Least Squares in Analyzing Linear Models." Technometrics 19:275 -
284.
Hildreth, C., and J. P. Houck. 1968. "Some Estimators for a Linear Model with
Random Coefficients." Journal of the American Statistical Association 63:584 -
595.
Hill, R. W., and P. W. Holland. 1977. "Two Robust Alternatives to Least-Squares
Regression." Journal of the American Statistical Association 72:1041 - 1067.
Hinkley, D. V. 1971. "Inference in Two-Phase Regression." Journal of the American
Statistical Association 66:736-743.
1975. "On Power Transformations to Symmetry." Biometrika 62:101-112.
Hoadley, B. 1971. "Asymptotic Properties of Maximum Likelihood Estimators for the
Independent Not Identically Distributed Case." Annals of Mathematical Statis-
tics 42:1977-1991.
Hodges, J. L., and E. L. Lehmann. 1950. "Some Problems in Minimax Point Estima-
tion." Annals of Mathematical Statistics 21:182-197.
1963. "Estimates of Location Based on Rank Tests." Annals of Mathematical
Statistics 34:598-611.
References 491
Hoe!, P. G. 1971. Introduction to Mathematical Statistics, 4th ed. New York: John
Wiley & Sons.
Hoer!, A. E., and R. W. Kennard. 1970a. "Ridge Regression: Biased Estimation for
Nonorthogonal Problems." Technometrics 12:55-67.
1970b. "Ridge Regression: Applications to Nonorthogonal Problems." Tech-
nometrics 12:69-82.
Hoerl, A. E., R. W. Kennard, and K. F. Baldwin. 1975. "Ridge Regression: Some
Simulations." Communications in Statistics 4:105-123.
Hogg, R. V. 1974. "Adaptive Robust Procedures: A Partial Review and Some Sugges-
tions for Future Applications and Theory." Journal of the American Statistical
Association 69:909-923.
Holford, T. R. 1980. "The Analysis of Rates and of Survivorship Using Log-Linear
Models." Biometrics 36:299-305.
Horowitz, J. L., J. M. Sparmann, and C. F. Daganzo. 1982. "An Investigation of the
Accuracy of the Clark Approximation for the Multinomial Probit Model." Trans-
portation Science 16:382-401.
Hosek, J. R. 1980. "Determinants of Family Participation in the AFDC-Unemployed
Fathers Program." Review of Economics and Statistics 62:466-470.
Howe, H., R. A. Pollack, and T. J. Wales. 1979. "Theory and Time Series Esti-
mation of the Quadratic Expenditure System." Econometrica 47:1231 -
1248.
Hsiao, C. 1974. "Statistical Inference for a Model with Both Random Cross-Sectional
and Time Effects." International Economic Review 15:12-30.
1975. "Some Estimation Methods for a Random Coefficient Model." Econo-
metrica 43:305-325.
1983. "Identification," in Z. Griliches and M. D. Intrilligator, eds., Handbook
of Econometrics 1:223-283. Amsterdam: North-Holland Publishing.
Hsieh, D., C. F. Manski, and D. McFadden. 1983. "Estimation of Response Probabili-
ties from Augmented Retrospective Observations." Mimeographed Paper, De-
partment of Economics, Massachusetts Institute of Technology, Cambridge,
Mass.
Huang, D. S. 1964. "Discrete Stock Adjustment: The Case of Demand for Automo-
biles." International Economic Review 5:46-62.
Huber, P. J. 1964. "Robust Estimation of a Location Parameter." Annals of Mathe-
matical Statistics 35:73- 101.
1965. "The Behavior of Maximum Likelihood Estimates under Nonstandard
Conditions," in J. Neyman, ed., Proceedings of the Fifth Berkeley Symposium,
1:221-233. Berkeley: University of California Press.
1972. "Robust Statistics: A Review." Annals of Mathematical Statistics
43:1041-1067.
1977. Robust Statistical Procedures. Philadelphia: Society for Industrial and
Applied Mathematics.
1981. Robust Statistics. New York: John Wiley & Sons.
492 References
1980. The Statistical Analysis of Failure Time Data. New York: John Wiley &
Sons.
ICariya, T. 1981. "Bounds for the Covariance Matrices of Zellner's Estimator in the
SUR Model and the 2SAE in a Heteroscedastic Model." Journal of the American
Statistical Asssociation 76:975-979.
Kay, R. 1979. "Some Further Asymptotic Efficiency Calculations for Survival Data
Regression Models." Biometrika 66:91-96.
Keeley, M. C., P. K. Robins, R. G. Spiegelman, and R. W. West. 1978. "The Estima-
tion of Labor Supply Models Using Experimental Data." American Economic
Review 68:873-887.
Kelejian, H. H. 1971. "Two-Stage Least Squares and Econometric Systems Linear in
Parameters but Nonlinear in the Endogenous Variables." Journal ofthe American
Statistical Association 66:373 - 374.
1974. "Efficient Instrumental Variable Estimation of Large Scale Nonlinear
Econometric Models." Mimeographed paper.
Kelejian, H. H., and S. W. Stephan. 1983. "Inference in Random Coefficient Panel
Data Models: A Correction and Clarification of the Literature." International
Economic Review 24:249-254.
Kendall, M. G., and A. Stuart. 1979. The Advanced Theory of Statistics, 4th ed., vol. 2.
New York: Charles Griffin and Co.
Kenny, L. W., L. F. Lee, G. S. Maddala, and R. P. Trost. 1979. "Returns to College
Education: An Investigation of Self-Selection Bias Based on the Project Talent
Data." International Economic Review 20:775-789.
Kiefer, J., and J. Wolfowitz. 1956. "Consistency of the Maximum Likelihood Estima-
tor in the Presence of Infinitely Many Incidental Parameters." Annals of Mathe-
matical Statistics 27:887-906.
Koenker, R. 1981a. "Robust Methods in Econometrics." Bell Laboratories Eco-
nomics Discussion Paper no. 228.
1981b. "A Note on Studentizing a Test for Heteroscedasticity." Journal of
Econometrics 17:107 - 112.
Koenker, R., and G. Bassett, Jr. 1978. "Regression Quantiles." Econometrica
46:33-50.
1982. "Robust Tests for Heteroscedasticity Based on Regression Quantiles."
Econometrica 50:43 - 61.
Koopmans, T. C., and W. C. Hood. 1953. "The Estimation of Simultaneous
Linear Economic Relationships," in W. C. Hood and T. C. Koopmans, eds.,
Studies in Econometric Method, pp. 112-199. New York: John Wiley
& Sons.
Kotlikoff, L. J. 1979. "Testing the Theory of Social Security and Life Cycle Accumula-
tion." American Economic Review 69:396-410.
Koyck, L. M. 1954. Distributed Lags and Investment Analysis. Amsterdam: North-
Holland Publishing.
494 References
McCall, J. J. 1971. "A Markovian Model of Income Dynamics." Journal of the Ameri-
can Statistical Association 66:439-447.
McFadden, D. 1974. "Conditional Logit Analysis of Qualitative Choice Behavior," in
P. Zarembka, ed., Frontiers in Econometrics, pp. 105 -142. New York: Academic
Press.
1976a. "A Comment on Discriminant Analysis 'versus' Logit Analysis."
Annals of Economic and Social Measurement 5:511 -524.
1976b. "The Revealed Preferences of a Government Bureaucracy: Empirical
Evidence." Bell Journal of Economics 7:55-72.
1977. "Quantitative Methods for Analyzing Travel Behavior of Individuals:
Some Recent Developments." Cowles Foundation Discussion Paper no. 474.
1978. "Modelling the Choice of Residential Location," in A. Karlqvist et al.,
eds., Spatial Interaction Theory and Planning Models, pp. 75-96. Amsterdam:
North-Holland Publishing.
1981. "Econometric Models of Probabilistic Choice," in C. F. Manski and D.
McFadden, eds., Structural Analysis of Discrete Data with Econometric Applica-
tions, pp. 198-272. Cambridge, Mass.: MIT Press.
McFadden, D., and F. Reid. 1975. "Aggregate Travel Demand Forecasting from
Disaggregated Behavior Models." Transportation Research Board, Record, no.
534, Washington, D. C.
McGillivray, R. G. 1972. "Binary Choice of Urban Transport Mode in the San Fran-
cisco Bay Region." Econometrica 40:827 - 848.
McKean, J. W., and T. P. Hettmansperger. 1976. "Tests of Hypothesis in the General
Linear Model Based on Ranks." Communications in Statistics A,1:693- 709.
MacRae, E. 1977. "Estimation of Time-Varying Markov Process with Aggregate
Data." Econometrica 45:183- 198.
MaCurdy, T. E. 1980. "An Intertemporal Analysis of Taxation and Work Disincen-
tives." Working Papers in Economics no. E-80-4, The Hoover Institution, Stan-
ford University, Calif.
1982. "The Use of Time Series Processes to Model the Error Structure
of Earnings in a Longitudinal Data Analysis." Journal of Econometrics. 18:83 -
114.
Maddala, G. S. 1971. "The Use of Variance Components Models in Pooling Cross
Section and Time Series Data." Econometrica 39:341-358.
1980. "Disequilibrium, Self-Selection and Switching Models." Social Science
Working Paper 303, California Institute of Technology.
1983. Limited-Dependent and Qualitative Variables in Econometrics. Cam-
bridge: Cambridge University Press.
Maddala, G. S., and F. D. Nelson. 1974. "Maximum Likelihood Methods for Models
of Markets in Disequilibrium." Econometrica 42:1013-1030.
Malik, H. J., and B. Abraham. 1973. "Multivariate Logistic Distributions." Annals of
Statistics 1:588 - 590.
496 References
Muthen, B. 1979. "A Structural Probit Model with Latent Variables." Journal of the
American Statistical Association 74:807-811.
Nagar, A. L., and S. N. Sahay. 1978. "The Bias and Mean Squared Error of Forecasts
from Partially Restricted Reduced Form." Journal of Econometrics 7:227 -243.
Nakamura, M., A. Nakamura, and D. Cullen. 1979. "Job Opportunities, the Offered
Wage, and the Labor Supply of Married Women." American Economic Review
69:787-805.
Nelder, J. A., and R. W. M. Wedderbum. 1972. "Generalized Linear Models." Jour-
nal of the Royal Statistical Society ser. B, 135:370-384.
Nelson, F. D. 1977. "Censored Regression Models with Unobserved, Stochastic Cen-
soring Thresholds." Journal of Econometrics 6:309-327.
1981. "A Test for Misspecification in the Censored Normal Model." Econo-
metrica 49:1317-1329.
Nelson, F. D., and L. Olson. 1978. "Specification and Estimation of a Simultaneous-
Equation Model with Limited Dependent Variables." International Economic
Review 19:695-709.
Nerlove, M. 1958. Distributed Lags and Demand Analysis for Agricultural and Other
Commodities. Washington, D.C.: U.S. Department of Agriculture.
1971. "Further Evidence on the Estimation of Dynamic Relations from a
Time Series of Cross Sections." Econometrica 39:359-382.
Nerlove, M., D. M. Grether, and J. L. Carvalho. 1979. Analysis of Economic Time
Series: A Synthesis. New York: Academic Press.
Nerlove, M., and S. J. Press. 1973. "Univariate and Multivariate Log-Linear and
Logistic Models." RAND Corporation Paper R-1306-EDA/NIH, Santa Monica,
Calif.
Neyman, J., and E. L. Scott. 1948. "Consistent Estimates Based on Partially Consis-
tent Observations." Econometrica 16:1-32.
Nickell, S. 1979. "Estimating the Probability of Leaving Unemployment." Economet-
rica 47:1249-1266.
Norden, R. H. 1972. "A Survey of Maximum Likelihood Estimation." International
Statistical Revue 40:329-354.
1973. "A Survey of Maximum Likelihood Estimation, Part 2." International
Statistical Revue 41:39-58.
Oberhofer, W., and J. Kmenta. 1974. "A General Procedure for Obtaining Maximum
Likelihood Estimates in Generalized Regression Models." Econometrica
42:579-590.
Olsen, R. J. 1978. "Note on the Uniqueness of the Maximum Likelihood Estimator for
the Tobit Model." Econometrica 46:1211 -1215.
1980. "A Least Squares Correction for Selectivity Bias." Econometrica
48:1815-1820.
Paarsch, H. J. 1984. "A Monte Carlo Comparison of Estimators for Censored Regres-
sion Models." Journal of Econometrics 24:197 - 213.
498 References
Parke, W. R. 1982. "An Algorithm for FIML and 3SLS Estimation of Large Nonlinear
Models." Econometrica 50:81-95.
Pearson, E. S., and N. W. Please. 1975. "Relation between the Shape of Population
Distribution and the Robustness of Four Simple Statistical Tests." Biometrika
62:223-241.
Pesaran, M. H. 1982. "Comparison of Local Power of Alternative Tests of Non-Nested
Regression Models." Econometrica 50:1287- 1305.
Pfanzagl, J. 1973. "Asymptotic Expansions Related to Minimum Contrast Estima-
tors." Annals of Statistics 1:993-1026.
Phillips, P. C. B. 1977. "Approximations to Some Finite Sample Distributions Asso-
ciated with a First-Order Stochastic Difference Equation." Econometrica
45:463-485.
1982. "On the Consistency of Nonlinear FIML." Econometrica 50:1307 -
1324.
1983. "Exact Small Sample Theory in the Simultaneous Equations Model," in
Z. Griliches and M. D. Intrilligator, eds., Handbook of Econometrics, 2:449-516.
Amsterdam: North-Holland Publishing.
Pierce, D. A. 1971. "Least Squares Estimation in the Regression Model with Autore-
gressive-Moving Average Errors." Biometrika 58:299-312.
Plackett, R. L. 1960. Principles of Regression Analysis. London: Oxford University
Press.
1965. "A Class of Bivariate Distributions." Journal of American Statistical
Association 60:516-522.
Poirier, D. J. 1978. "The Use of the Box-Cox Transformation in Limited Dependent
Variable Models." Journal of the American Statistical Association 73:284-287.
Powell, J. L. 1981. "Least Absolute Deviations Estimation for Censored and Trun-
cated Regression Models." Technical Report no. 356, Institute for Mathematical
Studies in the Social Sciences, Stanford University, Calif.
1983. "Asymptotic Normality of the Censored and Truncated Least Absolute
Deviations Estimators." Technical Report no. 395, Institute for Mathematical
Studies in the Social Sciences, Stanford University, Calif.
Powell, M. J. D. 1964. "An Efficient Method for Finding the Minimum of a Function
of Several Variables without Calculating Derivatives." Computer Journal 7:115 -
162.
Powers, J. A., L. C. Marsh, R. R. Huckfeldt, and C. L. Johnson. 1978. "A Comparison
of Logit, Probit and Discriminant Analysis in Predicting Family Size." American
Statistical Association Proceedings of the Social Statistics Section, pp. 693-697.
Prais, S. J., and H. S. Houthaldcer. 1955. The Analysis of Family Budgets. Cambridge:
Cambridge University Press.
Pratt, J. W. 1981. "Concavity of the Log Likelihood." Journal of the American Statis-
tical Association 76:103- 106.
References 499
Press, S. J., and S. Wilson. 1978. "Choosing Between Logistic Regression and Discrim-
inant Analysis." Journal of the American Statistical Association 73:699-705.
Quandt, R. E. 1958. "The Estimation of the Parameters of a Linear Regression System
Obeying Two Separate Regimes." Journal of the American Statistical Association
53:873-880.
1982. "Econometric Disequilibrium Models." Econometric Reviews 1:1-63.
1983. "Computational Problems and Methods," in Z. Griliches and M. D.
Intrilligator, eds., Handbook of Econometrics, 1:699-764. Amsterdam: North-
Holland Publishing.
Quandt, R. E., and J. B. Ramsey. 1978. "Estimating Mixtures of Normal Distributions
and Switching Regressions." Journal of the American Statistical Association
73:730-738.
Radner, R., and L. S. Miller. 1970. "Demand and Supply in U. S. Higher Education: A
Progress Report." American Economic Review-Papers and Proceedings
60:326-334.
Rao, C. R. 1947. "Large Sample Tests of Statistical Hypotheses Concerning Several
Parameters with Applications to Problems of Estimation." Proceedings of the
Cambridge Philosophical Society 44:50-57.
1965. "The Theory of Least Squares When the Parameters Are Stochastic and
Its Applications to the Analysis of Growth Curves." Biometrilca 52:447-458.
1970. "Estimation of Heteroscedastic Variances in a Linear Model." Journal
of the American Statistical Association 65:161-172.
1973. Linear Statistical Inference and Its Applications, 2d ed. New York: John
Wiley & Sons.
Reece, W. S. 1979. "Charitable Contributions: The New Evidence on Household
Behavior." American Economic Review 69:142 -151.
Rice, P., and V. K. Smith. 1977. "An Econometric Model of the Petroleum Industry."
Journal of Econometrics 6:263-288.
Roberts, R. B., G. S. Maddala, and G. Enholm. 1978. "Determinants of the Requested
Rate of Return and the Rate of Return Granted in a Formal Regulatory Process."
Bell Journal of Economics 9:611-621.
Robinson, P. M. 1982a. "On the Asymptotic Properties of Estimators of Models
Containing Limited Dependent Variables." Econometrica 50:27-41.
1982b. "Analysis of Time Series from Mixed Distributions." Annals of Statis-
tics 10:915-925.
Rosenberg, B. 1973. "The Analysis of a Cross Section of Time Series by Stochastically
Convergent Parameter Regression." Annals of Economic and Social Measure-
ment 2:399-428.
Rosenzweig, M. R. 1980. "Neoclassical Theory and the Optimizing Peasant: An Econ-
ometric Analysis of Market Family Labor Supply in a Developing Country."
Quarterly Journal of Economics 94:31-55.
500 References
Small, K. A., and D. Brownstone. 1982. "Efficient Estimation of Nested Logit Models:
An Application to Trip Timing." Research Memorandum no. 296, Econometric
Research Program, Princeton University.
Smith, K. C., N. E. Savin, and J. L. Robertson. 1984. "A Monte Carlo Comparison of
Maximum Likelihood and Minimum Chi-Square Sampling Distributions in
Logit Analysis." Biometrics 40:471-482.
Spitzer, J. J. 1976. "The Demand for Money, the Liquidity Trap, and Functional
Forms." International Economic Review 17:220-227.
1978. "A Monte Carlo Investigation of the Box-Cox Transformation in Small
Samples." Journal of the American Statistical Association 73:488-495.
Srivastava, V. K., and T. D. Dwivedi. 1979. "Estimation of Seemingly Unrelated
Regression Equations: A Brief Survey." Journal of Econometrics 10:15 -32.
Stapleton, D. C., and D. J. Young. 1984. "Censored Normal Regression with Measure-
ment Error on the Dependent Variable." Econometrica 52:737-760.
Stein, C. 1973. "Estimation of the Mean of a Multivariate Normal Distribution."
Technical Report no. 48, Department of Statistics, Stanford University, Calif.
Stephenson, S. P., and J. F. McDonald. 1979. "Disaggregation of Income Mainte-
nance Impacts on Family Earnings." Review of Economics and Statistics 61:354 -
360.
Stigler, S. M. 1973. "Simon Newcomb, Percy Daniell, and the History of Robust
Estimation, 1885- 1920." Journal of the American Statistical Association
68:872-879.
1977. "Do Robust Estimators Work with Real Data?" Annals of Statistics
5:1055-1098.
Strawderman, W. E. 1978. "Minimax Adaptive Generalized Ridge Regression Esti-
mators." Journal of the American Statistical Association 73:623-627.
Strickland, A. D., and L. W. Weiss. 1976. "Advertising, Concentration, and Price-Cost
Margins." Journal of Political Economy 84:1109-1121.
Strotz, R. H. 1960. "Interdependence as a Specification Error." Econometrica
28:428-442.
Swamy, P. A. V. B. 1970. "Efficient Inference in a Random Coefficient Regression
Model." Econometrica 38:311-323.
1980. "A Comparison of Estimators for Undersized Samples." Journal of
Econometrics 14:161 - 181.
Swamy, P. A. V. B., and J. S. Mehta. 1977. "Estimation of Linear Models with Time
and Cross-Sectionally Varying Parameters." Journal of the American Statistical
Association 72:890-891.
Taylor, W. E. 1978. "The Heteroscedastic Linear Model: Exact Finite Sample Re-
sults." Econometrica 46:663-676.
1980. "Small Sample Considerations in Estimation from Panel Data." Journal
of Econometrics 13:203-223.
502 References
Witte, A. D. 1980. "Estimating the Economic Model of Crime with Individual Data."
Quarterly Journal of Economics 94:57-84.
Wu, C. F. J. 1983. "On the Convergence Properties of the EM Algorithm." Annals of
Statistics 11:95-103.
Wu, D. M. 1965. "An Empirical Analysis of Household Durable Goods Expenditure."
Econometrica 33:761 -780.
Zacks, S. 1971. Theory of Statistical Inference. New York: John Wiley & Sons.
Zarembka, P. 1968. "Functional Form in the Demand for Money." Journal of the
American Statistical Association 63:502-511.
Zellner, A. 1962. "An Efficient Method of Estimating Seemingly Unrelated Regres-
sions and Tests for Aggregation Bias." Journal of the American Statistical Associ-
ation 57:348-368.
1971. An Introduction to Bayesian Inference in Econometrics. New York: John
Wiley & Sons.
Zellner, A., D. S. Huang, and L. C. Chau. 1965. "Further Analysis of the Short-Run
Consumption Function with Emphasis on the Role of Liquid Assets." Economet-
rica 40:193-199.
Zellner, A., and H. Theil. 1962. "Three-Stage Least Squares: Simultaneous Estimation
of Simultaneous Equations." Econometrica 30:54-78.
Name Index
Chau, L. C., 246 Fair, R. C., 78, 264, 265, 365, 403, 473
Chenery, H. B., 128 Farebrother, N. W., 53, 469
Chow, G. C., 234, 264, 465 Feinberg, S. E., 316
Christ, C. F., 228 Feller, W., 274
Christensen, L. R., 128 Ferguson, T. S., 125
Chung, K. L., 90, 466, 472 Fisher, F. M., 229, 256, 469, 470
Clark, C., 309 Fisher, R. A., 16
Cochrane, D., 189-190 Fletcher, R., 138
Cooley, T. F., 222 Flinn, C. J., 446, 448, 457
Cooper, J. M., 469 Fomby, T. B., 59
Cosslett, S. R., 321, 322, 326, 331, 332, 334, Forsythe, A. B., 77
336, 337, 338, 339, 346, 358, 471, 472 Freund, J. F., 466
Cox, D. R., 147, 249, 250, 251, 278, 307, Froehlich, B. R., 469
449, 450, 465, 466 Fuller, W. A., 159, 161, 178, 195, 202, 211,
Cragg, J. G., 410, 472 212, 238, 239
Cramer, H., 93, 118
Crowder, J. M., 178 Gallant, A. R., 113, 136, 137, 141, 145, 257,
Cullen, D., 395 262
Gastwirth, J. L., 74
Dapnzo, C. F., 309 Gayer, K. M., 49
Dagenais, M. G., 264 Geisel, M. S., 49
Dahm, P. A., 293 Ghosh, J. K., 125, 279
David, J. M., 293 Gillen, P. W., 389
Davidson, W. C., 138 Gnedenko, B. V., 90
Davis, L., 279 Godfrey, L. G., 469
Deacon, R., 293, 294, 471 Goldberg, S., 166
Dempster, A. P., 66, 375 Goldberger, A. S., 25, 360, 367, 380, 381, 472
Dent, W. T., 469 Goldfeld, S. M., 37, 38, 137, 138, 204, 206,
Dhrymes, P. J., 178 207, 256, 266, 354, 470
Diewert, W. E., 128 Goodman, L. A., 315, 316, 417, 418, 419
Dolcsum, K. A., 71, 465, 466 Gourieroux, C., 273, 472
Domencich, T. A., 269 Gradshteyn, I. S., 474
Doob, J. L., 159 Granger, C. W. J., 159
Draper, N. R., 137, 251, 466 Graybill, F. A., 466
Dubin, J. A., 408 Greene, W. H., 368, 472
Dudley, L., 387 Greenstadt, J., 264
Duncan, D. B., 275 Grenander, U., 161
Duncan, G. M., 405, 406, 407, 408, 473 Grether, D. M., 159, 178
Duncan, G. T., 421 Griliches, Z., 178, 217
Durbin, J., 191, 192, 193, 194, 196, 222, Groeneveld, L. P., 444
224, 225, 233 Gronau, R., 388, 473
Durden, G. C., 312 Guilkey, D. K., 317
Dwivedi, T. D., 198 Gumbel, E. J., 319
Gunst, R. F., 66
Efron, B., 62, 63, 64, 67, 135, 283, 284, 455 Gurland, J., 124, 293, 433, 470
Ehrlich, I., 252
Eicker, F., 199 Haberman, S. J., 316
Eisenpress, H., 264 Haessel, W., 258
Engle, R. F., 142 Hahn, G. J., 378
Enholm, G., 394, 395 Hall, B. H., 138
Evans, J. M., 222 Hall, R. E., 138
Name Index 507
Willis, R. J., 216, 348, 349, 350, 351, 353, Young, D. J., 472
404, 405
Wilson, S., 284 Zacics, S., 46, 138, 465
Wise, D. A., 308, 309, 310, 395 Zarembka, P., 251, 252
Witte, A. D., 365 Zellner, A., 24, 49, 197, 241, 246, 465
Wolfowitz, J., 331, 346, 347
Woodland, A. D., 373, 375, 390, 472, 473
Wu, C. F. J., 375
Wu, D., 388
Subject Index
Generalized Wald test: definition, 145; in model, 298-299; nested multinomial logit
nonlinear simultaneous equations model, model as correction for, 300, 302; tests for,
261-262 299, 302
Geometric lag model, 178 Independent variables. See Exogenous
Goldfeld-Quandt estimator, 204-205, 206 variables
Goldfeld-Quandt peak test for homoscedas- Information matrix, 16
ticity, 207 Initial conditions: in Balestra-Nerlove model,
Goodness of fit. See R2 215, 216; in first-order autoregressive
Gumbel's type B bivariate extreme value (AR(1)) model, 163; in Markov chain
distribution, 300 model, 413-414; in panel data qualitative
response model, 352, 353-354; in
Hadamard product, 462 two-error components model with serially
Hausman's specification test: as test for correlated error, 216, 217
independence of irrelevant alternatives Instrumental variables (IV) estimator.
(IA), 299, 302; as test for normality in definition, 11-12; FIML estimator as,
Tobit model, 381; asymptotic properties 223-224; G2SLS estimator as, 241; in two-
of, 145-146; in nonlinear simultaneous error components model with endogenous
equations model, 265 regressors, 217-218; two-stage least squares
Hazard rate: in duration model, 435; in as asymptotically best, 239-240
Tobit model, 472 Iterative methods. See EM algorithm;
Heckman's two-step estimator: in standard Gauss-Newton method; Method of scoring;
Tobit model, 367-372; in Type 2 Tobit Newton-Raphson method
model, 386-387; in Type 3 Tobit model,
390, 392-393; in Type 4 Tobit model, 396;
in Type 5 Tobit model, 402 Jackknife estimator, 135-136
Heterogeneity: in duration model, 445, Jackknife method, 135-136
446-447; in Markov model, 414; in panel Jensen's inequality, 116
data qualitative response model, 349, Jordan canonical form, 459
350-353. See also Mover-stayer model
Heteroscedasticity: constant variance in a Khinchine's weak law of large numbers
subset of the sample, 200-202; definition, (WLLN), 102
198; general parametric, 202-203; in Kolmogorov laws of large numbers (LLN), 90
standard Tobit model, 378, 379-380; tests Koyck lag model, 178
for, 200, 201, 203, 206-207; unrestricted, Kronecker product, 462
198-200; variance as an exponential
function of the regressors, 207; variance as
a linear function of regressors, 204-207 L estimators, 73-74, 77-78
Hildreth-Houck estimator in a heteroscedas- Li, estimators, 72, 73, 77
tic regression model: definition, 204, 205, Lag operator, 162
206; modifications, 205-206, 469 Lagrange multiplier test. See Score test
Milder's inequality, 19 Lagrangian interpolation polynomial, 469
Homoscedasticity, 2 Laplace distribution, 70
Laws of large numbers (LLN): Ithinchine's,
Idempotent, 460 102; Kolmogorov's number 1 and number
Identification: in a linear simultaneous 2, 90; Markov's, 467; strong, 90; weak, 90
equations model, 230; in a nonlinear si- Least absolute deviations (LAD) estimator
multaneous equations model, 256 in classical regression model, 152-154; in
Incidental parameters. See Nuisance standard Tobit model, 382-383. See also
parameters Median
Independence of irrelevant alternatives (IA): Least squares (LS) estimator
as characteristic of multinomial logit In autoregressive model: asymptotic
516 Subject Index
Least squares (LS) estimator (continued) Lebesgue measure, 83, 106, 113, 124, 467
normality in first-order case, 174-175; Lebesgue-Stieltjes integral, 467
asymptotic normality in pth order Liapounov central limit theorem (CLT),
case, 173; consistency in first-order 90-91,92
case, 173-174; small sample proper- Likelihood function. See Concentrated
ties, 173 likelihood function; Conditional likelihood
In classical linear regression model: as function; Maximum likelihood estimator
best linear unbiased estimator (MLE)
(BLUE), 11-13; as best unbiased Likelihood ratio test
estimator under normality of error As test for homoscedasticity, 201
terms, 17-20; asymptotic normality of, In nonlinear regression model, 144, 145
96-98, 99; consistency of, 95-96; Under general parametric hypotheses:
definition, 4-5; equivalence under asymptotic distribution of, 142-144;
normality of, with maximum definition, 142; small sample
likelihood estimator (MLE), 13; properties of, 145
geometric interpretation of, 5-6; mean Under linear hypotheses on a linear
and variance of, 7-8; of a linear model: definition, 28-32, 32-34, 38;
combination of regression parameters, relationship with Wald test and score
7, 58; of a subset of regression test, 144-145
parameters, 6-7; unbiased alternative Limit distribution, 85, 92
to estimate of variance of error terms, 8 Limited dependent variables model. See
In duration model with one completed Tobit model
spell per individual, 439 Limited information maximum likelihood
In geometric lag model, 178 (LIML) estimator: asymptotic distribution,
In linear regression model with general 237-238; asymptotic equivalence with
covariance matrix of error terms: 2SLS estimator, 236, 238; definition,
asymptotic normality in case of 235-236; exact distribution, 238-239;
known covariance matrix, 185; Fuller's modification, 238
asymptotic normality in case of serial Limited information model, 234-235
correlation, 187-188; consistency, Linear regression model, with general covar-
184-185; covariance matrix, 182; ex- iance matrix of error terms, 184
amples of equivalence with GLS, Linear simultaneous equations model,
183-184; inconsistency in presence of 228-229
lagged endogenous variables and serial Lindeberg-Feller central limit theorem
correlation, 194; relative efficiency as (CLT), 90-91, 92
compared with GLS, 182-183 Lindeberg-Levy central limit theorem (CLT),
In Markov model when only aggregate 90-92
data are available, 431 Linear constraints: as testable hypotheses,
In standard Tobit model: biasedness 27; form of, 20; stochastic, 23-24
when all observations used, 367-368; Logistic distribution: Gumbel's bivariate,
biasedness when positive observations 319; Plackett's bivariate, 319; relationship
used, 367 with normal distribution, 269
In Toikka's three-state Markov model, Logit model
429, 430 Binary: definition, 268-269; global
See also Constrained least squares (CLS) concavity of likelihood function, 273
estimator; Generalized least squares Multinomial: as result of utility-maxi-
(GLS) estimator. mizing behavior, 296-297; definition,
Least squares predictor, 39-40 295; global concavity of likelihood
Least squares residuals, 5, 21, 32-33 function, 295-296; independence of
Lebesgue convergence theorem, 117 irrelevant alternatives (IIA) in, 298
Subject Index 517
Method of scoring: conditions for equiva- Nonlinear least squares (NLLS) estimator:
lence with Gauss-Newton method, 203; asymptotic normality of, 132-134;
definition, 138; equivalence with nonlinear consistency of, 129-130; definition,
weighted least squares (NLWLS) iteration, 127-129; equivalence of, with MLE under
274-275, 289-290; in random coefficients normality, 129; in Markov model when
model, 221; when variance is linear only aggregate data are available, 431, 433;
function of regressors, 206 in standard Tobit model, 372-373, 378;
Mills' ratio, 472 inconsistency in nonlinear simultaneous
Minimax estimator, 47, 48 equations model, 245; under general
Minimax regret, 47 parametric constraints, 144; under linear
Minimini principle, 51 constraints, 136-137
Minimum chi-square (MIN x2) method, 275. Nonlinear limited information (NLLI)
See also Berkson's minimum chi-square maximum likelihood estimator asympto-
(MIN x2) estimator tic covariance matrix of, 254; definition,
MINQUE, 469 252-254; inconsistency of, when normality
Mixed estimator under stochastic constraints, fails to hold, 255; iterative methods for
25 solving for, 254
Mixture of normal distributions, 72, 77, Nonlinear regression model, 127-128
119-120 Nonlinear simultaneous equations models:
Model selection problem, 146 full information case, 255-256; limited
Modified nonlinear two-stage least squares information case, 252-253
(MN2LS) estimator: definition, 254; Nonlinear three-stage least squares (NL3S)
inconsistency when normality fails to hold, estimator: asymptotic normality of, 257;
255 consistency of, 257; definition (Amemiya),
Moore-Penrose generalized inverse, 112, 466 257; definition (Jorgenson and Laffont),
Mover-stayer model, 418-419 257. See also Best nonlinear three-stage
Moving-average (MA) models: autocovar- least squares (BNL3S) estimator
iance matrix of, 171; first-order (MA(1)), Nonlinear two-stage least squares (NL2S)
171-172; spectral density of, 170-171 estimator: asymptotic normality, 247-248;
Multicollinearity, 56, 59 consistency, 246-247; definition, 246, 250;
in case of Box-Cox transformation, 250,
251-252. See also Best nonlinear two-stage
Newton-Raphson method: applications, 141; least squares (BNL2S) estimator; Modified
as method of calculating FIML estimator, nonlinear two-stage least squares (MNL2S)
234; definition, 137; DFP iteration, 138; estimator
for obtaining NLFI maximum likelihood Nonlinear weighted least squares (NLWLS)
estimator, 264; method of scoring as estimatoc in binary qualitative response
means of finding MLE, 138; quadratic model, 274-275; in duration model with
hill-climbing, 138; second-round estima- one completed spell per individual, 440; in
tor, 137, 139 Markov model when only aggregate data
Nonlinear full information (NLFI) maxi- are available, 432, 433; in multinomial
mum likelihood estimatoc asymptotic qualitative response model, 289-290; in
properties, 259, 261; definition, 259; multi-state Markov model with exogenous
inconsistency when error terms are not variables, 428-429; in standard Tobit
normal, 259-260; iterative method, 260-261 model, 372-373, 472; in two-state Markov
Nonlinear generalized least squares (NLGLS) model with exogenous variables, 423. See
estimator: in first-order Markov model, also Method of scoring
415; in Markov model when only Nonnested models, 147-148
aggregate data are available, 431. See also Nonnormality: in standard Tobit model,
Nonlinear weighted least squares 378, 380-382, 383; robust estimation
(NLWLS) estimator under, 70-71
Subject Index 519