Graduate Econometrics Lecture Notes - Michael Creel (414 Pages)
Graduate Econometrics Lecture Notes - Michael Creel (414 Pages)
Michael Creel
Version 0.4, 06 Nov. 2002, copyright (C) 2002 by Michael Creel
Contents
1 License, availability and use
10
1.1
License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.2
10
1.3
Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.4
Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
12
14
3.1
14
3.2
15
3.3
16
3.4
17
3.4.1
17
In X Y Space . . . . . . . . . . . . . . . . . . . . . . . . . .
Dept.
of Economics and Economic History,
[email protected]
3.4.2
In Observation Space . . . . . . . . . . . . . . . . . . . . . .
17
3.4.3
Projection Matrices . . . . . . . . . . . . . . . . . . . . . . .
19
3.5
20
3.6
Goodness of fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.7
25
3.7.1
Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.7.2
Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.7.3
26
28
4.1
28
4.2
Consistency of MLE . . . . . . . . . . . . . . . . . . . . . . . . . .
29
4.3
31
4.4
33
4.5
37
4.6
39
43
5.1
Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
5.2
Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . . .
44
5.3
Asymptotic efficiency . . . . . . . . . . . . . . . . . . . . . . . . . .
45
6.2
47
47
6.1.1
Imposition . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
6.1.2
52
Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
6.2.1
t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
6.2.2
F test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
6.2.3
Wald-type tests . . . . . . . . . . . . . . . . . . . . . . . . .
58
6.2.4
59
6.2.5
62
6.3
63
6.4
68
6.5
Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . .
68
6.6
Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
6.7
71
76
7.1
77
7.2
78
7.3
Feasible GLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
7.4
Heteroscedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
7.4.1
84
7.4.2
Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
7.4.3
Correction . . . . . . . . . . . . . . . . . . . . . . . . . . .
88
Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
7.5.1
Causes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
7.5.2
AR(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
7.5.3
MA(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
7.5.4
7.5
7.5.6
8 Stochastic regressors
107
8.1
Case 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.2
Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.3
Case 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.4
9 Data problems
9.1
9.2
9.3
114
Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
9.1.1
9.1.2
9.1.3
9.1.4
9.2.2
9.3.2
9.3.3
132
149
195
208
225
233
. . . . . . . . . . . . . . . . . . . . . . . . . . 234
241
261
270
306
312
326
348
23 Simulation-based estimation
378
404
25 The GPL
404
1.3 Use
You are free to use the notes as you like, for study, preparing a course, etc. I find
that a hard copy is of most use for lecturing or study, while the html version is useful
for quick reference or answering students questions in office hours. I would greatly
1 Free
10
appreciate that you inform me of any errors you find. Id also welcome contributions
in any area, especially in the areas of time series and nonstationary data.
1.4 Sources
The following is a partial list of the sources that have been used in preparing these
notes.
References
[Amemiya (1985)]
[Davidson and MacKinnon (1993)] Davidson, R. and J.G. MacKinnon (1993) Estimation and Inference in Econometrics, Oxford
Univ. Press.
[Gallant (1987)]
[Gallant (1997)]
[Hamilton (1994)]
[Hayashi (2000)]
[Judge (1985)]
xi
x i p i m i zi
pi is G 1 vector of prices
mi is income
xi
0 pi
p mi m wi
w i
12
The functions xi
which may differ for all i have been restricted to all belong
Of all parametric families of functions, we have restricted the model to the class
of linear in the variables functions.
These are very strong restrictions, compared to the theoretical model. Furthermore,
these restrictions have no theoretical basis. In addition, we still need to make more
assumptions in order to determine how to estimate the model. The validity of any
results we obtain using this model will be contingent on these restrictions being correct.
For this reason, specification testing will be needed, to check that the model seems to
be reasonable. Only when we are convinced that the model is at least approximately
correct should we use it for economic analysis. In the next sections we will obtain
results supposing that the econometric model is correctly specified. Later we will
examine the consequences of misspecification and see some methods for determining
if a model is correctly specified.
13
yt
xt
0 t
X0
or in matrix form,
where y is n 1 X
x1 x2
xn
where xt is K
conformable. The subscript 0 in 0 means this is the true value of the unknown
parameter. It will be suppressed when its not necessary for clarity. Linear
models are more general than they might first appear, since one can employ
nonlinear transformations of the variables:
0 zt
(The i
1
wt
2 wt
p wt
0 zt xt1
1 wt etc. leads to
a model in the form of equation (??). For example, the Cobb-Douglas model
Aw2 2 w3 3 exp
ln z
ln A 2 ln w2 3 ln w3
14
E
E
Var
20 In
1
nX
X
s
arg min s
y
y
X
yt
xt
2
t 1
y
X
2y
X
X
X
X
This last expression makes it clear how the OLS estimator chooses : it minimizes the
Euclidean distance between y and X
15
To minimize the criterion s take the f.o.n.c. and set them to zero:
D s
2X
y 2X
X
so
X
X 1 X
y
Since X
2X
X
X
X
Note that
X
20
1
n
16
Figure 1 shows a typical fit to data, with a residual. The area of the square is that
residuals contribution to the sum of squared errors. The fitted line is chosen so as to
minimize this sum.
Figure 1: Fitted Regression Line
The fitted line and a residual
x
e_i
1
Observation 2
e = M_xY
S(x)
x
x*beta=P_xY
Observation 1
We can decompose y into two components: the orthogonal projection onto the
K
Since is chosen to make as short as possible, will be orthogonal to the space
spanned by X Since X is in this space, X
18
X X
X
X
y
X X
X
PX
since
PX y
is the projection of y off the space spanned by X (that is onto the space that is
orthogonal to the span of X We have that
y
y
X X
X
In
X X
X
X
y
1
X
y
MX
X X
X 1 X
In
In
PX
We have
19
MX y
Therefore
PX y MX y
X
AA
X
X
X
i y
ci
y
This is how we define a linear estimator - its a linear function of the dependent
variable. Since its a linear combination of the observations on the dependent variable, where the weights are detemined by the observations on the regressors, some
observations may have more influence than others. Define
ht
PX
tt
et
PX et
PX et
20
2
et
ht
1 and
TrPX
K n
So, on average, the weight on the yt s is K n. If the weight is much higher, then the
observation is influential. However, an observation may also be influential due to the
value of yt , rather than the weight it is multiplied by, which only depends on the xt s.
To account for this, consider estimation of without using the t th observation (designate this estimator as t
1
1
h
X
X
Xt
t
Xt t
ht
1
ht
While and observation may be influential if it doesnt affect its own fitted value, it
certainly is influential if it does. A fast means of identifying influential observations is
to plot
ht
1 ht
t as a function of t.
After influential observations are detected, one needs to determine why they are
influential. Possi causes include:
data entry error, which can easily be corrected once detected. Data entry errors
X
X 2
X
y
y
0, so
X
X
y
y
X
X
y
y
PX y 2
y 2
cos2
where is the angle between y and the span of X (show with the one regressor, two
observation example).
Figure 3: Uncentered R2
!#"$"%
'&()'*,+,
&,
+-
&$ +
.*/.
10,23
+
23
the ability of the model to explain the variation of y about its unconditional sam-
ple mean.
Let
1 1 1
a n -vector. So
In
n
In
1
y
M y
1
ESS
T SS
so M
In this case
y
M y
X
M X
So
R2c
RSS
T SS
R2c
1
24
then
X
X
X
y
X
X
X
X
0
E
X
X 1 X
0
20
1
n
K
1
E 20
K
E Tr
M
E TrM
TrE M
20 TrM
20 n
20 n
1
n
1
n
1
n
1
n
1
n
M
1
n
20
25
TrX X
X
Tr X
X
X
X
3.7.2 Normality
X
X
N 0 X
X
1 2
0
X
X 1 X
y
Cy
It is also unbiased, as we proved above. One could consider other weights W in place
of the OLS weights. Well still insist upon unbiasedness. Consider
estimator is unbiased
E Wy
E W X 0 W
W X 0
0
WX
IK
26
Wy If the
The variance of is
V
WW
20
Define
D
X
X
X
X
so
Since W X
IK DX
V
0 so
X
X
DD
X
X
D
X
X
20
20
So
V
V
It is worth noting that we have not used the normality assumption in any way
to prove the Gauss-Markov theorem, so it is valid if the errors are not normally
distributed, as long as the other assumptions hold.
The previous properties hold for finite sample sizes. Before considering the asymptotic
properties of the OLS estimator it is useful to review the MLE estimator, since under
the assumption of normal errors the two estimators coincide.
27
yn
f Y Y 0
L Y
fY Y
L Y
yt
t 1
L Y
f y 1 f y 2 y1 f y 3 y1 y2
28
f yn y1 y2 yt
n
xt
y1 y2
S t
yt
where S is the sample space of Y (With this, conditioning on x1 has no effect and gives
a marginal probability). Now the likelihood function can be written as
n
L Y
yt xt
t 1
sn
1 n
ln f yt xt
nt 1
1
ln L Y
n
arg max sn
is a monotonic increasing
29
sn
uas
lim E0 sn s 0
We have suppressed Y here for simplicity. This requires that almost sure convergence
holds for all possible parameter values.
Continuity sn is continuous in
in
a s
We will use these assumptions to show that
0
0
L
E ln
L 0
by Jensens inequality ( ln
ln E
L
L 0
is a concave function).
L
L 0
L
L 0 dy
L 0
1
L
E ln
L 0
30
0
0
or
E sn
E s n 0
0
s 0
s 0 0
0
0 :
s 0
s 0 0 0
0
s 0
s 0 0
0
0 a s
This completes the proof of strong consistency of the MLE. One can use weaker assumptions to prove weak consistency (convergence in probability to 0 ) of the MLE.
This is omitted here. Note that almost sure convergence implies convergence in probability.
31
gn Y
D sn
1 n
D ln f yt xx
nt 1
1 n
gt
n t1
argument, which implies that it is a random function. Y will often be suppressed for
clarity, but one should not forget that it is still there.
The ML estimator sets the derivatives to zero:
1 n
gt 0
n t1
gn
D ln f yt x f yt x dyt
E gt
1
D f yt xt f yt xt dyt
f yt xt
D f yt xt dyt
E gt
D
D 1
0
32
f t yt xt dyt
So E gt
0
Recall that we assume that sn is twice continuously differentiable. Take a first order
g
g 0
D g
0
H
where
0 0
0
g 0
1 Assume H
in a minute). So
n
Now consider H
ng 0
This is
H
D g
D2 sn
1
D2 ln f t
nt 1
2 sn
Given that this is an average of terms, it should usually be the case that this satisfies
33
a strong law of large numbers (SLLN). Regularity conditions are a set of assumptions
that guarantee that this will happen. There are different sets of assumptions that can
must not
be too strongly dependent over time, and their variances must not become infinite. We
dont assume any particular set here, since the appropriate assumptions will depend
upon the particularities of a given model. However, we assume that a SLLN applies.
Also, since we know that is consistent, and since
that
as
0 . Given this, H
0 we have
as
lim E D2 sn 0
H 0
D2 lim E sn 0
H 0
D2 s 0 0
s 0
s 0 0
i.e., 0 maximizes the limiting objective function. Since there is a unique maximizer,
the limit), then H 0 must be negative definite, and therefore of full rank. Therefore
34
Now consider
as
H 0
1
ng 0
(1)
ng 0 This is
ngn 0
nD sn
n n
D ln ft yt xt 0
n t 1
1 n
gt 0
n t1
applies.
Note that gn 0
as
V Xn
where V Xn
1 2
1 2
Xn
E Xn
d N 0 I
V Xn
1 2
V Xn
1 2
V Xn
The certain conditions that Xn must satisfy depend on the case at hand. Usually, Xn
Xn
n:
tn 1 Xt
n
n
35
properties of the Xt For example, if the Xt have finite variances and are not too strongly
dependent, then a CLT for dependent processes will apply. Supposing that a CLT
applies, and noting that E
ngn 0
I 0
0 we get
1 2
d N 0 I
K
ngn 0
where
I 0
lim E0 n gn 0 gn 0
ngn 0
lim V0
ngn 0
d N 0 I
0
(2)
N 0 H 0
I 0 H 0
1
0
d N0 V
36
(3)
There do exist, in special cases, estimators that are consistent such that
p
n is the highest
factor that we can multiply by an still get convergence to a stable limiting distribution.
Definition 3 (Asymptotic unbiasedness) An estimator of a parameter 0 is asymptotically unbiased if
lim E
(4)
Estimators that are CAN are asymptotically unbiased, though not all consistent
estimators are asymptotically unbiased. Such cases are unusual, though. An example
is:
Exercise 4 Consider an estimator with distribution
0 with prob. 1
1
1
n with prob.
n
n
(5)
ft dy so
D ft dy
D ln ft ft dy
37
D2 ln ft
E D ln ft D ln ft
D ln ft D ln ft
ft dy
E gt gt
E Ht
D ln ft D ft dy
E D2 ln ft
E D2 ln ft
ft dy
(6)
1
n
1 n
Ht
nt 1
1 n
gt gt
nt 1
E
s since for t
s ft yt y1 yt
1
has
conditioned on prior information, so what was random in s is fixed in t. (This forms the
basis for a specification test proposed by White: if the scores appear to be correlated
one may question the specification of the model). This allows us to write
E H
E n g g
since all cross products between different periods expect to zero. Finally take limits,
we get
H
I
n
N 0 H
0
as
38
I 0 H 0
1
simplifies to
n
as
N 0 I 0
n gt gt
I 0
t 1
H
H 0
I 0
n gn
gn
From this we see that there are alternative ways to estimate V 0 that are all
valid. These include
V 0
V 0
V 0
H 0
I 0
H 0
1
1
I 0 H 0
These are known as the inverse Hessian, outer product of the gradient (OPG) and
sandwich estimators, respectively. The sandwich form is the most robust, since it
coincides with the covariance estimator of the quasi-ML estimator.
lim E
Differentiate wrt
:
D lim E
D f Y dy
lim
0 this is a K
Noting that D f Y
lim
f Y
IK and
f Y D dy
lim
f D ln f dy
K matrix of zeros
f D ln f we can write
lim
IK dy
1
D ln f f dy
n
f D ln f dy
IK
0
lim
IK
lim E
n
ng
IK
n
40
tends
to V Therefore,
V
ng
V
IK
IK
I
I
1
V
IK
I
IK
I
0
This simplifies to
V
Since is arbitrary, V
I 1
0
This means that I 1 is a lower bound for the asymptotic variance of a CAN
estimator.
Definition 6 (Asymptotic Efficiency) An estimator is of a parameter 0 is asymp
CAN estimator of 0
A direct proof of asymptotic efficiency of an estimator is infeasible, but if one can
show that the asymptotic variance is equal to the inverse of the information matrix,
then the estimator is asymptotically efficient. In particular, the MLE is asymptotically
efficient.
Summary of MLE
Consistent
Asymptotically efficient
Asymptotically unbiased
This is for general MLE: we havent specified the distribution or the linearity/nonlinearity of the estimator
42
X
X
X
y
X
X
X
X
X
X
X
X
n
1
X
n
XX
n
QX
limn
XX
n
1
X
n
X
n
1 n
xt t
n t1
V xt t
xt xt
20
E xt t s xs
0t
and
s
1 n
xt t
n t1
as
0
43
0
This is the property of strong consistency: the estimator converges almost surely to the
true value. If we has used a weak LLN (defined in terms of convergence in probability),
we would have (simple, weak) consistency.
Now as before,
Considering
X
n
XX
n
1
X
X
X
X
n
0
X
X
1
X
1
X
n
Q 1
X
lim V
X
n
1 n
xt xt
n
t 1
20 lim
n
20 QX
n. Apply-
d N 0 2 Q
0 X
Therefore,
n
d N 0 2 Q 1
0 X
In summary, the OLS estimator is normally distributed in small and large samples if is normally distributed. If is not normally distributed, is asymptotically normally distributed.
s
xt
yt
2
t 1
f
N 0 20 In so
t 1
1
22
45
exp
t2
22
The joint density for y can be constructed using a change of variables. We have
y
X so
y
In and
1 so
f y
1
t 1
22
exp
xt
22
yt
Taking logs,
ln L
n ln 2
n ln
n
t 1
yt
xt
22
Its clear that the fonc for the MLE of 0 are the same as the fonc for OLS (up to multiplication by a constant), so the estimators are the same, under the present assumptions.
Therefore, their properties are the same. In particular, under the classical assumptions
with normality, the OLS estimator is asymptotically efficient.
As well see later, it will be possible to use linear estimation methods and still
2 In as long as
is still normally distributed. This is not the case if is nonnormal. In general with
nonnormal errors it will be necessary to use nonlinear estimation methods to achieve
asymptotically efficient estimation.
46
ln q
0 1 ln p1 2 ln p2 3 ln m
k0 ln q
0 1 ln kp1 2 ln kp2 3 ln km
so
1 ln p1 2 ln p2 3 ln m
1 ln kp1 2 ln kp2 3 ln km
ln k 1 2 3
1 ln p1 2 ln p2 3 ln m
0
47
6.1.1 Imposition
The general formulation of linear equality restrictions is the model
X
y
R
where R is a Q K matrix, Q
We also assume that that satisfies the restrictions: they arent infeasible.
1
y
n
min s
X
X
2
R
r
The Lagrange multipliers are scaled by 2, which makes thing less messy. The fonc are
D s
0
2X
y 2X
X R 2R
D s
R R
r 0
X
X R
R
48
X
y
We get
R
X
X R
X
y
X
X
X
X R
0
1
R X
X
AB
IQ
IK
X
X
R X
X
IK
X
X
and
IK
X
X
1R
IK
1
X
X
DC
IK
49
Q
so
DAB
IK
DA
B
Q
1
IK
0
1R
X
X
X
X
1R
R X
X
1R
X
X
X
X
IQ
X
X
X
X
1
P 1R X
X
1R
so
X
X
1R
X
X
P
1R
1R
X
X
X
X
1R
X
X
P
X
y
1R
X
X
r
1
r
IK
X
X
1R
1R
P 1R
X
X
1R
1r
1r
are linear functions of makes it easy to determine their disThe fact that R and
tributions, since the distribution of is already known. Recall that for x a random
vector, and for A and b a matrix and vector of constants, respectively, Var Ax b
AVar x A
Though this is the obvious way to go about finding the restricted estimator, an
easier way, if the number of restrictions is small, is to impose them by substitution.
50
Write
R
R2
X1 1 X2 2
1
R1 1 r
R 1 1 R 2 2
X1 R1 1 r
y
y
X1 R1 1 r
X2
X1 R1 1 R2 2 X2 2
X1 R1 1 R2 2
yR
XR 2
This model satisfies the classical assumptions, supposing the restriction is true. One
can estimate by OLS. The variance of 2 is as before
V 2
XR
XR
20
XR
XR
51
where one estimates 20 in the normal way, using the restricted model, i.e.,
yR
20
XR 2
yR
XR 2
Q
To recover 1 use the restriction. To find the variance of 1 use the fact that it is a
linear function of 2 so
V 1
R1 1 R2V 2 R2
R1 1
R1 1 R2 X2
X2
R2
R1 1
20
X
X
R
P
r
R
X
X
X
X
R
P
X
X
X
X
R
P
X
X
R
P
r
X
X
X
X
R
P
R
P
r
R X
X 1 X
y
R
X
X
R
P 1 R X
X
R
R X
X
E R R
Noting that the crosses between the second term and the other terms expect to zero,
and that the cross of the first and third has a cancellation with the square of the third,
52
we obtain
MSE R
X
X
1 2
X
X
X
X
R
P
r
R r
R
P 1 R X
X
R
P 1 R X
X
1 2
So, the first term is the OLS covariance. The second term is PSD, and the third term is
NSD.
If the restriction is true, the second term is 0, so we are better off. True restrictions improve efficiency of estimation.
If the restriction is false, we may be better or worse off, in terms of MSE, depending on the magnitudes of r
R and 2
6.2 Testing
In many cases, one wishes to test economic theories. If theory suggests parameter
restrictions, as in the above homogeneity example, one can test theory by testing parameter restrictions. A number of tests are available.
6.2.1 t-test
Suppose one has the model
53
r vs. HA :R
r . Under H0
N 0 R X
X 1 R
20
so
R
R X
X
r
1R
1R
R X
X
N 0 1
The problem is that 20 is unknown. One could use the consistent estimator 20 in place
of 20 but the test would only be valid asymptotically in this case.
Proposition 7
N 0 1
2 q
q
t q
(7)
x
x
where
i 2i
2 n
(8)
parameter.
Proposition 9 If the n dimensional random vector x
54
N 0 V then x
V
1x
2 n
Well prove this one as an indication of how the following unproven propositions
could be proved.
Proof: Factor V
as PP
(this is the Cholesky factorization). Then consider y
P
x We have
N 0 P
V P
but
so PV P
V PP
In
P
V PP
In
N 0 In
y
y
x
PP
x
xV
2 n
x
Bx
2 B
55
N 0 V then
(9)
N 0 I and B is idempotent
2 r
x
Bx
(10)
20
0
2 n
MX
K
N 0 I then Ax and x
Bx
0
are independent if AB
Now consider (remember that we have only one restriction in this case)
R r
R X X 1R
R
n K 20
r
1R
R X
X
X
X
1X
K distribution if and
are independent. But
and
X
X
X
MX
0
so
R
0
R r
R
R X
X
1R
t n
K
In particular, for the commonly encountered test of significance of an individual coefficient, for which H0 : i
0 vs. H0 : i
i
i
t n
56
K
Note: the t
test is strictly valid only if the errors are actually normally dis
tributed. If one has nonnormal errors, one could use the above asymptotic result
N 0 1 as n
K
d
from the t distribution if nonnormality is suspected. This will reject H0 less often
since the t distribution is fatter-tailed than is the normal.
6.2.2 F test
The F test allows testing multiple restrictions jointly.
Proposition 13 If x
2 r and y
2 s then
x r
y s
F r s
(11)
N 0 I then x
Ax and x
Bx
0
are independent if AB
Using these results, and previous results on the 2 distribution, it is simple to show
that the following statistic has the F distribution:
R
R X
X
R
F qn
q 2
ESSR ESSU q
ESSU n K
57
F qn
K
K
Note: The F test is strictly valid only if the errors are truly normally distributed.
The following tests will be appropriate when one cannot assume normally distributed errors.
then under H0 : R0
d N 0 2 Q 1
0 X
r we have
n R
d N 0 2 RQ 1 R
0
X
so by Proposition [9]
n R
20 RQX 1 R
1
d 2 q
r
Note that QX 1 or 20 are not observable. The test statistic we use substitutes the con
is a cancellation of n
s and the statistic to use is
R
20 R X
X
R
d 2 q
r
The Wald test is a simple way to test restrictions without having to estimate the
restricted model.
58
Note that this formula is similar to one of the formulae provided for the F test.
X
1 Estimation of nonlinear
models is a bit more complicated, so one might prefer to have a test based upon the
restricted, linear model. The score test is useful in this situation.
Score-type tests are based upon the general principle that the gradient vector of
the unrestricted model, evaluated at the restricted estimate, should be asymptotically normally distributed with mean zero, if the restrictions are true. The
original development was for ML estimation, but the principle is valid for a
wide variety of estimation methods.
R X
X
1
R
r
1
R
r
Given that
n R
d N 0 2 RQ 1 R
0
X
d N 0 2 P 1 RQ 1 R
P
0
X
59
or
d N 0 2 lim n nP
0
RQX 1 R
P
1
since the ns cancel and inserting the limit of a matrix of constants changes nothing.
However,
lim nR X
X
lim nP
X
X
n
lim R
RQX 1 R
In this case,
d N 0 2 lim nP
0
R X
X
20
1R
1
d 2 q
since the powers of n cancel. To get a usable test statistic substitute a consistent estimator of 20
This makes it clear why the test is sometimes referred to as a Lagrange multiplier
test. It may seem that one needs the actual Lagrange multipliers to calculate this.
If we impose the restrictions by substitution, these are not available. Note that
the test can be written as
X
X
20
60
1R
d 2 q
X
y X
X R R
to get that
X
y
X R
X
R
R
X X
X
20
1X
R d
2 q
PX
R
20
d 2 q
To see why the test is also known as a score test, note that the fonc for restricted least
squares
X
y X
X R R
give us
X
y
X
X R
and the rhs is simply the gradient (score) of the unrestricted model, evaluated at the
restricted estimator. The scores evaluated at the unrestricted estimate are identically
zero. The logic behind the score test is that the scores evaluated at the restricted estimate should be approximately zero, if the restriction is true. The test is also known as
a Rao test, since P. Rao first proposed it in 1948.
61
2 ln L
LR
ln L
where is the unrestricted estimate and is the restricted estimate. To show that it is
ln L
ln L
H
H
LR
As n
H H
0
LR
I 0
a
I 0 1 n1 2 g 0
An analogous result for the restricted estimator is (this is unproven here, to prove
this set up the Lagrangean for MLE subject to R
62
conditions) :
n
I 0
R
RI 0
In
1
RI 0
n1 2 g 0
n1 2 I 0 1 R
RI 0
1
1
RI 0
g 0
LR
n1 2 g 0
I 0
RI 0
R
1 1 2
RI 0
g 0
But since
n1 2 g 0
d N0 I
0
RI 0
1 1 2
g 0
d N 0 RI
0
R
We can see that LR is a quadratic form of this rv, with the inverse of its variance in the
middle, so
LR
d 2 q
6.3 The asymptotic equivalence of the LR, Wald and score tests
We have seen that the three tests all converge to 2 random variables. In fact, they all
converge to the same 2 rv, under the null hypothesis. Well show that the Wald and LR
tests are asymptotically equivalent. We have seen that the Wald test is asymptotically
63
equivalent to
n R
20 RQX 1 R
R
d 2 q
r
Using
0
X
X 1 X
and
R
R
r
0
we get
nR
0
nR X
X
R
X
X
n
1
1 2
a
a
X QX 1 R
20 RQX 1 R
X X
X
A A
A
20
PR
20
1A
R
20 R X
X
RQX 1 X
1
R X
X
1R
Note that this matrix is idempotent and has q columns, so the projection matrix
has rank q
64
LR
n1 2 g 0
I 0
R
RI 0
1 1 2
RI 0
g 0
ln L
n ln 2
n ln
1 y
2
X
Using this,
1
D ln L
n
X
y X 0
n2
X
n2
g 0
I 0
H 0
lim D g 0
X
y X 0
lim D
n2
X
X
lim 2
n
QX
2
so
I 0
65
2 Q X 1
y
X
LR
a
a
a
X
X
X
R
20 R X
X
1
R X
X
PR
20
W
This completes the proof that the Wald and LR tests are asymptotically equivalent.
Similarly, one can show that, under the null hypothesis,
qF
LM
LR
The proof for the statistics except for LR does not depend upon normality of the
errors, as can be verified by examining the expressions for the statistics.
The LR statistic is based upon distributional assumptions, since one cant write
the likelihood function without them.
However, due to the close relationship between the statistics qF and LR suppos
The presentation of the score and Wald tests has been done in the context of
the linear model. This is readily generalizable to nonlinear models and/or other
estimation methods.
Though the four statistics are asymptotically equivalent, they are numerically different
in small samples. The numeric values of the tests also depend upon how 2 is esti66
mated, and weve already seen than there are several ways to do this. For example all
of the following are consistent for 2 under H0
n k
n
R R
n k q
R R
n
and in general the denominator call be replaced with any quantity a such that lim a n
1
It can be shown, for linear regression models subject to linear restrictions, and if
n is
R R
n
LR
LM
For this reason, the Wald test will always reject if the LR test rejects, and in turn the
LR test rejects if the LM test rejects. This is a bit problematic: there is the possibility
that by careful choice of the statistic used, one can manipulate reported results to favor
or disfavor a hypothesis. A conservative/honest approach would be to report all three
test statistics when they are available. In the case of linear models with normal errors
the F test is to be preferred, since asymptotic approximations are not an issue.
The small sample behavior of the tests can be quite different. The true size (probability of rejection of the null when the null is true) of the Wald test is often dramatically
higher than the nominal size associated with the asymptotic distribution. Likewise, the
true size of the score test is often smaller than the nominal size.
67
t
a 100 1
C
: c
2
A confidence ellipse for two coefficients jointly would be, analogously, the set
of {1 2 such that the F (or some other test statistic) doesnt reject at the specified
critical value. This generates an ellipse, if the estimators are correlated. Draw a picture
here.
The region is an ellipse, since the CI for an individual coefficient defines a (infinitely long) rectangle with total prob. mass 1
marginalized (e.g., can take on any value). Since the ellipse is bounded in both
dimensions but also contains mass 1
the individual CI.
Rejection of hypotheses individually does not imply that the joint test will
reject.
Joint rejection does not imply individal tests will reject.
6.6 Bootstrapping
When we rely on asymptotic theory to use the normal distribution-based tests and
confidence intervals, were often at serious risk of making important errors. If the
sample size is small and errors are highly nonnormal, the small sample distribution
n
of
may be very different than its large sample distribution. Also, the
distributions of test statistics may not resemble their limiting distributions at all. A
means of trying to gain information on the small sample distribution of test statistics
and estimators is the bootstrap. Well consider a simple example, just to get the main
idea.
Suppose that
X 0
IID 0 20
X is nonstochastic
Given that the distribution of is unknown, the distribution of will be unknown in
small samples. However, since we have random sampling, we could generate artificial
data. The steps are:
1. Draw n observations from with replacement. Call this vector j (its a n 1
2. Then generate the data by y j
X j
69
X
X
X
y j
4. Save j
5. Repeat steps 1-4, until we have a large number, J of j
With this, we can use the replications to calculate the empirical distribution of j
One way to form a 100(1- % confidence interval for 0 would be to order the j
from smallest to largest, and drop the first and last J 2 of the replications, and use
the remaining endpoints as the limits of the CI. Note that this will not give the shortest
CI if the empirical distribution is skewed.
Suppose one was interested in the distribution of some function of for example
a test statistic. Simple: just calculate the transformation for each j and work
How to choose J: J should be large enough that the results dont change with
repetition of the entire bootstrap. This is easy to check. If you find the results
In finite samples, this doesnt hold. At a minimum, the bootstrap is a good way
to check if asymptotic theory results offer a decent approximation to the small
sample distribution.
r 0
where r
0
ated at as
D r
R
R
in a neighborhood of 0 Take a first order Taylors series expansion of r about 0 :
r
r 0
71
0
r
0
nr
nR 0
n
0
d N 0 R Q 1 R
2
0 X
0
0
nr
R 0 Q X 1 R 0
1
r
20
d 2 q
under the null hypothesis. Substituting consistent estimators for 0 QX and 20 the
resulting statistic is
r
R X
X
1R
1
r
d 2 q
Since this is a Wald test, it will tend to over-reject in finite samples. The score
and LR tests are also possibilities, but they require estimation methods for nonlinear models, which arent in the scope of this course.
72
Note that this also gives a convenient way to estimate nonlinear functions and associ
ated asymptotic confidence intervals. If the nonlinear function r 0 is not hypothesized to be zero, we just have
n r
d N 0 R Q 1 R
2
0 X
0
0
r 0
1
R 0
20
f x
x
x
where
x
f x
function
y
x
x
x
x
(note that this is the entire vector of elasticities). The estimated elasticities are
x
x
x
73
R
x
x1
0
..
.
x2
..
1 x21
0
..
.
2 x22
0
..
.
..
xk
0
..
.
0
x
k x2k
Note that the elasticity and the stanTo get a consistent estimator just substitute in .
dard error are functions of x
In many cases, nonlinear restrictions can also involve the data, not just the param
funcion, where p is prices and m is income. An expenditure share system for G goods
is
p i xi p m
i
m
si p m
1 2 G
Now demand must be positive, and we assume that expenditures sum to income, so we
have the restrictions
0
G
si
si p m
p m
1 i
i 1
si p m
i1 p
ip mim i
It is fairly easy to write restrictions such that the shares sum to one, but the restriction
74
that the shares lie in the 0 1 interval depends on both parameters and the values of p
si p m
and m In such cases, one might consider whether or not a linear model is a reasonable
specification.
75
IID 0 2
IIN 0 2
or occasionally
Now well investigate the consequences of nonidentically and/or dependently distributed errors. The model is
X
y
E
V
E X
Perhaps its because under the classical assumptions, a joint confidence region
for would be an n
dimensional hypersphere.
X
X 1 X
y
X
X 1 X
ness, as before.
supposing X is nonstochastic, is
The variance of ,
E
E X
X
X
X
X
X X
X
X
X X
X
1
Due to this, any test statistic that is based upon 2 or the probability limit 2 of
is invalid. In particular, the formulas for the t F 2 based tests given above do
N X
X
X
X X
X
1
77
n X
X
X
X
n
1 2X
lim E
so we obtain
n
1 2
X
X
n
d N 0 Q 1 Q 1
X
X
unbiased in the same circumstances in which the estimator is unbiased with iid
errors
has a different variance than before, so the previous test statistics arent valid
is consistent
PP
78
We have
PP
In
so
P
PP
In
P
y
y
This variance of
P
is
E P
P
P
P
In
y
E
V
E X
0
In
and nonnormality of The GLS estimator is simply OLS applied to the transformed
model:
GLS
X
X
X
PP
X
X
y
X
X
PP
y
X
The GLS estimator is unbiased in the same circumstances under which the OLS
estimator is unbiased. For example, assuming X is nonstochastic
E GLS
E
X
X
y
1
X
GLS
X
y
X
X
X
X
X
X
X
so
E
GLS
GLS
X
X
X
X X
X
X
X
X
X
80
X
X
X X
X
1
All the previous results regarding the desirable properties of the least squares
estimator hold, when dealing with the transformed model.
Tests are valid, using the previous formulas, as long as we substitute X in place
of X Furthermore, any test that involves 2 can set it to 1 This is preferable to
re-deriving the appropriate formulas.
The GLS estimator is more efficient than the OLS estimator. This is a consequence of the Gauss-Markov theorem, since the GLS estimator is based on a
model that satisfies the classical assumptions but the OLS estimator is not. To
see this directly, not that (the following needs to be completed)
Var
Var GLS
X
X
X
X X
X
1
X
As one can verify by calculating fonc, the GLS estimator is the solution to the
minimization problem
GLS
so the metric
arg min y
X
1
X
unique elements.
81
n
2 n
n2 n
The number of parameters to estimate is larger than n and increases faster than
n Theres no way to devise an estimator that satisfies a LLN without adding
restrictions.
The feasible GLS estimator is based upon making sufficient assumptions regarding the form of so that a consistent estimator can be devised.
X
In this case,
X
p X
If we replace in the formulas for the GLS estimator with we obtain the FGLS
estimator. The FGLS estimator shares the same asymptotic properties as GLS.
These are
1. Consistency
2. Asymptotic normality
3. Asymptotic efficiency if the errors are normally distributed. (Cramer-Rao).
4. Test procedures are asymptotically valid.
In practice, the usual way to proceed is
82
X
Chol
P
y
7.4 Heteroscedasticity
Heteroscedasticity is the case where
E
is a diagonal matrix, so that the errors are uncorrelated, but have different variances.
Heteroscedasticity is usually thought of as associated with cross sectional data, though
there is absolutely no reason why time series data cannot also be heteroscedastic. Actually, the popular ARCH (autoregressive conditionally heteroscedastic) models explicitly assume that a time series is heteroscedastic.
Consider a supply function
qi
1 p Pi s Si i
where Pi is price and Si is some measure of size of the ith firm. One might suppose
83
qi
where P is price and M is income. In this case, i can reflect variations in preferences.
There are more possibilities for expression of preferences when one is rich, so it is
possible that the variance of i could be higher when M is high.
Add example of group means.
n
d N 0 Q 1 Q 1
X
X
X
X
n
1 n
x
xt 2
n t1 t t
84
One can then modify the previous test statistics to obtain tests that are valid when there
is heteroscedasticity of unknown form. For example, the Wald test for H0 : R
would be
n R
r
X
X
R
n
X
X
R
R
2 q
7.4.2 Detection
There exist many tests for the presence of heteroscedasticity. Well discuss three methods.
Goldfeld-Quandt
The sample is divided in to three parts, with n1 n2 and n3 observations, where
n1 n2 n3
n. The model is estimated using the first and third parts of the sample,
1 M 1 1
2
d 2 n K
1
3
3
2
3 M 3 3
2
d 2 n K
3
and
so
1
1 n1
3
3 n3
K
K
d F n K n K
1
3
The distributional result is exact if the errors are normally distributed. This test is a
two-tailed test. Alternatively, and probably more conventionally, if one has prior ideas
about the possible magnitudes of the variances of the observations, one could order
the observations accordingly, from largest to smallest. In this case, one would use a
85
Ordering the observations is an important step if the test is to have any power.
The motive for dropping the middle observations is to increase the difference
between the average variance in the subsamples, supposing that there exists heteroscedasticity. This can increase the power of the test. On the other hand,
dropping too many observations will substantially increase the variance of the
statistics 1
1 and 3
3 A rule of thumb, based on Monte Carlo experiments is
to drop around 25% of the observations.
If one doesnt have any ideas about the form of the het. the test will probably
have low power since a sensible data ordering isnt available.
Whites test
When one has little idea if there exists heteroscedasticity, and no idea of its potential form, the White test is a possibility. The idea is that if there is homoscedasticity,
then
E t2 xt
2 t
2 zt
vt
as other variables. Whites original suggestion was to use xt , plus the set of all
unique squares and cross products of variables in xt
86
P ESSR ESSU P
ESSU n P 1
qF
we get
qF
P
R2
1
1 R2
Note that this is the R2 or the artificial regression used to test for heteroscedasticity, not the R2 of the original model.
An asymptotically equivalent statistic, under the null of no heteroscedasticity (so that
R2 should tend to zero), is
nR2
2 P
This doesnt require normality of the errors, though it does assume that the fourth
moment of t is constant, under the null. Question: why is this necessary?
The White test has the disadvantage that it may not be very powerful unless the
zt vector is chosen well, and this is hard to do without knowledge of the form of
heteroscedasticity.
It also has the problem that specification errors other than heteroscedasticity may
lead to rejection.
Note: the null hypothesis of this test may be interpreted as
model V t2
h zt
where h
The test is more general than is may appear from the regression that is used.
87
7.4.3 Correction
will be specific to the for supplied for Well consider two examples. Before this,
lets consider the general nature of GLS when there is heteroscedasticity.
Multiplicative heteroscedasticity
Suppose the model is
xt
t
yt
E t2
t2
zt
zt
vt
and vt has mean zero. Nonlinear least squares could be used to estimate and consistently, were t observable. The solution is to substitute the squared OLS residuals
t2 in place of t2 since it is consistent by the Slutsky theorem. Once we have and
zt
88
2
t
In the second step, we transform the model by dividing by the standard deviation:
xt
t
t
t
yt
t
or
xt
t
yt
This model is a bit complex in that NLS is required to estimate the model of the
variance. A simpler version would be
xt
t
yt
E t2
t2
2 zt
0 3
2 ztm vt
89
Save the pairs (2m m and the corresponding ESSm Choose the pair with the
Groupwise heteroscedasticity
A common case is where we have repeated observations on each of a number of
economic agents: e.g., 10 years of macroeconomic data on each of a set of countries or
regions, or daily observations of transactions of 200 banks. This sort of data is a pooled
cross-section time-series model. It may be reasonable to presume that the variance is
constant over time within the cross-sectional units, but that it differs across them (e.g.,
firms or countries of different sizes...). The model is
xit
it
yit
E 2it
where i
2i t
To correct for heteroscedasticity, just estimate each 2i using the natural estimator:
2i
1 n 2
n t1 it
Note that we use 1 n here since its possible that there are more than n regressors,
so n
yit
i
Do this for each cross-sectional group. This transformed model satisfies the
classical assumptions, asymptotically.
7.5 Autocorrelation
Autocorrelation, which is the serial correlation of the error term, is a problem that is
usually associated with time series data, but also can affect cross-sectional data. For
example, a shock to oil prices will simultaneously affect all countries, so one could
expect contemporaneous correlation of macroeconomic variables across countries.
7.5.1 Causes
Autocorrelation is the existence of correlation across the error term:
E t s 0 t s
91
yt
to be positive,
yt
0 1 xt 2 xt2 t
but we estimate
yt
0 1 xt t
92
7.5.2 AR(1)
There are many types of autocorrelation. Well consider two examples. The first is the
most commonly encountered case: autoregressive order 1 (AR(1) errors. The model is
yt
xt
t
ut
iid 0 2u
ut
E t us
0t
ut
t
2 t
ut
ut
2 t
ut
ut
ut
1
ut
2
mut
93
ut
0 as m so we obtain
m 0
2u
2m
m 0
2u
2
1
2 E t2
V t
2V t
2E t
1
2u
1 ut
E ut2
so
2u
1 2
V t
V t
Cov t t
E t
1
ut t
V t
2u
1 2
Cov t t
s
t
s
s 2u
1 2
94
1
cov x y
se x se y
corr x y
but in this case, the two standard errors are the same, so the s-order autocorrelation s
is
s
s
All this means that the overall matrix has the form
1
..
.
2u
1 2
..
.
n 2
1
..
.
..
n 1
1
So we have homoscedasticity, but elements off the main diagonal are not zero.
All of this depends only on two parameters, and 2u If we can estimate these
consistently, we can apply FGLS.
It turns out that its easy to estimate these consistently. The steps are
1. Estimate the model yt
xt
t by OLS. This is consistent as long as 1n X
X
converges to a finite limiting matrix. It turns out that this requires that the regressors X satisfy the previous stationarity conditions and that
have assumed.
95
1 which we
ut
Since t
ut
2u
p u , the estimator
t
2
u
2 p
structure of and estimate by FGLS. Actually, one can omit the factor 2u 1
FGLS
1
y
One can iterate the process, by taking the first FGLS estimator of re-estimating
normal errors).
An asymptotically equivalent approach is to simply estimate the transformed
model
yt
using n
t
y
1
xt
t
x
1
ut
of Cochrane and Orcutt. Dropping the first observation is asymptotically irrelevant, but it can be very important in small samples. One can recuperate the first
observation by putting
2 y1
y1
2 x1
x1
1
1
See
Davidson and MacKinnon, pg. 348-49 for more discussion. Note that the variance of y1 is 2u asymptotically, so we see that the transformed model will be
7.5.3 MA(1)
The linear regression model with moving average order 1 errors is
yt
xt
t
ut ut
ut
iid 0 2u
E t us
0t
In this case,
V t
E ut ut
2u 2 2u
2u 1 2
97
1
2
Similarly
E ut ut
1
ut
ut
2
2u
and
ut ut
1
ut
ut
3
so in this case
1 2
2u
0
..
.
0
..
.
..
.
..
2u
2u 1 2
1
0
1 2
1 and a minimum at
and minimal autocorrelations are 1/2 and -1/2. Therefore, series that are more
strongly autocorrelated cant be MA(1) processes.
98
Again the covariance matrix has a simple structure that depends on only two parameters. The problem in this case is that one cant estimate using OLS on
ut ut
because the ut are unobservable and they cant be estimated consistently. However,
there is a simple way to estimate the parameters.
V t
2u 1 2
1 n 2
n t1 t
2u 1 2
By the Slutsky theorem, we can interpret this as defining an (unidentified) estimator of both 2u and e.g., use this as
2u 1 2
1 n 2
n t1 t
Cov t t
2u
1
1 n
t t
n t2
using
This is a consistent estimator, following a LLN (and given that the epsilon hats
99
are consistent for the epsilons). As above, this can be interpreted as defining an
unidentified estimator:
1 n
t t
n t2
2u
Now solve these two equations to obtain identified (and therefore consistent)
estimators of both and 2u Define the consistent estimator
2u
following the form weve seen above, and transform the model using the Cholesky
decomposition. The transformed model satisfies the classical assumptions asymptotically.
n
d N 0 Q 1 Q 1
X
X
where, as before, is
lim E
X
X
n
100
xt t (recall that xt is defined as a
x
x2
xn
1
2
..
.
n
n
xt t
t 1
n
mt
t 1
so that
n
1
lim E
n n
mt
mt
t 1
t 1
We assume that mt is covariance stationary (so that the covariance between mt and
mt
Define the v
th autocovariance of mt as
v
Note that E mt mt
v
E mt mt
v
v
(show this with an example). In general, we expect that:
E mt mt
v
0
Note that this autocovariance does not depend on t due to covariance stationar
ity.
1
n
mt
mt
t 1
t 1
We have (show that the following is true, by expanding sum and shifting rows to left)
n
1
1 1
2
2 2
1
n
n
1
1
nt
m t m t
v
v 1
where
xt t
m t
estimator of n would be
n
n 1
n
1 1
n 2
n
0 vn
2 2
1n v
1 n
1
n
v v
1
more than the number of observations, and increases more rapidly than n, so information does not build up as n
On the other hand, supposing that v tends to zero sufficiently rapidly as v tends to
a modified estimator
qn
v v
v 1
where q n
The assumption that autocorrelations die off is reasonable in many cases. For
example, the AR(1) model with
The term
n v
n
q n , given that
Newey and West proposed and estimator (Econometrica, 1987) that solves the
problem of possible nonpositive definiteness of the above estimator. Their estimator is
n
qn
1
v
v 1
q 1
v v
1 4q
n
0 Note that this is a very slow rate of growth for q This estimator
1
nX
X to
consistently estimate the limiting distribution of the OLS estimator under heteroscedas-
103
ticity and autocorrelation of unknown form. With this, asymptotically valid tests are
constructed in the usual way.
t t 1 2
tn 1 t2
tn 2 t2 2 t t 1 t2
tn 1 t2
tn
DW
1
The null hypothesis is that the first order autocorrelation of the errors is zero:
H0 : 1
is not that the errors are AR(1), since many general patterns of autocorrelation
will have the first order autocorrelation different than zero. For this reason the
test is useful for detecting autocorrelation in general. For the same reason, one
shouldnt just assume that an AR(1) model is appropriate when the DW test
rejects the null.
Under the null, the middle term tends to zero, and the other two tend to one, so
DW
p 2
Supposing
2 so DW
p 0
p 4
104
exact critical values. The give upper and lower bounds, which correspond to the
extremes that are possible. Picture here. There are means of determining exact
critical values conditional on X
Breusch-Godfrey test
This test uses an auxiliary regression, as does the White test for heteroscedasticity.
The regression is
t
xt
1 t
2 t
P t
vt
and the test statistic is the nR2 statistic, just as in the White test. There are P restric
The intuition is that the lagged errors shouldnt contribute to explaining the current error if there is no autocorrelation.
xt is included as a regressor to account for the fact that the t are not independent
even if the t are. This is a technicality that we wont go into here.
The alternative is not that the model is an AR(P), following the argument above.
The alternative is simply that some or all of the first P autocorrelations are different from zero. This is compatible with many specific forms of autocorrelation.
yt
xt
yt
ut
yt
E yt 1 t
xt
1
ut
1
0
0 Since
plim
plim
X
n
the OLS estimator is inconsistent in this case. One needs to estimate by instrumental
variables (IV), which well get to later.
106
8 Stochastic regressors
Up until now weve assumed that the regressors are nonstochastic. This is highly
unrealistic in the case of economic data.
There are several ways to think of the problem. First, if we are interested in an analysis conditional on the explanatory variables, then it is irrelevant if they are stochastic
or not, since conditional on the values of they regressors take on, they are nonstochastic, which is the case already considered.
In cross-sectional analysis it is usually reasonable to make the analysis conditional on the regressors.
1
sufficiently general, since we may want to predict into the future many periods
out, so we need to consider the behavior of and the relevant test statistics
unconditional on X
The model well deal with is
1. Linearity: the model is a linear function of the parameter vector 0 :
yt
xt
0 t
X 0
or in matrix form,
where y is n 1 X
x1 x2
xn
conformable.
107
where xt is K
E
E
20 In
X is stochastic
X is uncorrelated with : E X
1
Pr n X
X
0
limn
QX
trix.
1 2X
d N 0 Q 2
X 0
8.1 Case 1
Normality of X independent of
In this case,
X
X
0 E
X
X
1X
E
Conditional on X
N 0 X
X
108
1 2
0
If the density of X is d X the marginal density of is obtained by multiplying
However, conditional on X the usual test statistics have the t F and 2 distribu
to obtain the unconditional distribution, nothing changes. The tests are valid in
small samples.
8.2 Case 2
nonnormally distributed, independent of X
The unbiasedness of carries through as before. However, the argument regarding
test statistics doesnt hold, due to nonnormality of Still, we have
0
0
X
X
X
X
n
109
X
1
X
n
Now
X
X
n
p Q 1
X
by assumption, and
X
n
1 2X
p
since the numerator converges to a N 0 QX 2 r.v. and the denominator still goes
to infinity. We have unbiasedness and the variance disappearing, so, the estimator is
consistent:
0
X
X
n
n
X
X
n
X
n
1 2
so
n
d N 0 Q 1 2
0
X
directly following the assumptions. Asymptotic normality of the estimator still holds.
Since the asymptotic results on all test statistics only require this, all the previous
asymptotic results on test statistics are also valid in this case.
depend on normality.
4. Asymptotic normality
5. Tests are asymptotically valid, but are not valid in small samples.
8.3 Case 3
Lagged dependent variables (dynamic models).
An important class of models are dynamic models, where lagged dependent variables have an impact on the current value. A simple version of these models that
captures the important points is
yt
syt
zt
s 1
xt
t
iid 0 20 In
where now xt contains lagged dependent variables. Clearly X and arent independent
anymore, so one cant show unbiasedness. For example, consider
E t 1 xt 0
since xt contains yt
(which is a function of t
1
as an element.
This fact implies that all of the small sample properties such as unbiasedness,
Gauss-Markov theorem, and small sample validity of test statistics do not hold
in this case.
Nevertheless, under the above assumptions, all asymptotic properties continue
to hold, using the same arguments as before.
111
1. limn
2. n
1 2X
QX
d N 0 Q 2
X 0
The most complicated case is that of dynamic models, since the other cases can be
treated as nested in this case. There exist a number of central limit theorems for dependent processes, many of which are fairly technical. We wont enter into details
(see Hamilton, Chapter 7 if youre interested). A main requirement for use of standard
asymptotics for a dependent sequence
st
1 n
zt
n t1
zt zt
s
q
zt
not depend on t
Covariance (weak) stationarity requires that the first and second moments of this
xt
xt
IIN 0 2
112
One can show that the variance of xt depends upon t in this case.
Stationarity prevents the process from trending off to plus or minus infinity, and prevents cyclical behavior which would allow correlations between far removed zt znd zs
to be high. Draw a picture here.
For application of central limit theorems, a useful concept is that of a martingale
difference sequence. This is a sequence zt such that
E zt t
1
0
t Note that
xt
t
is a martingale difference sequence. Hamilton, Proposition 7.8 (pg. 193) gives a central limit theorem for covariance stationary martingale difference sequences.
113
9 Data problems
In this section well consider problems associated with the regressor matrix: collinearity, missing observation and measurement error.
9.1 Collinearity
Collinearity is the existence of linear relationships amongst the regressors. We can
always write
1 x 1 2 x 2
K x K v
where xi is the ith column of the regressor matrix X and v is an n 1 vector. In the
case that there exists collinearity, the variation in v is relatively small, so that there is
an approximately exact linear relation between the regressors.
relative and approximate are imprecise, so its difficult to define when collinearilty
exists.
In the extreme, if there are exact linear relationships (every element of v equal) then
X
K so X
X
K so X
X is not invertible and the OLS estimator is not
yt
1 2 x2t 3 x3t t
x2t
1 2 x3t
114
1 2 1 2 x3t
yt
3 x3t t
1 2 1 2 2 x3t 3 x3t t
1 2 1
2 2 3 x3t
1 2 x3t t
The
s can be consistently estimated, but since the
s define two equations in
three
s the
s cant be consistently estimated (there are multiple values of
Another case where perfect collinearity may be encountered is with models with dummy
0 other-
wise. Similarly, define Gi Ti and Li for Girona, Tarragona and Lleida. One could use
a model such as
yi
1 2 Bi 3 Gi 4 Ti 5 Li xi
i
In this model, Bi Gi Ti Li
variables and the column of ones corresponding to the constant. One must either drop
the constant, or one of the qualitative variables.
115
x W
where x is the first column of X (note: we can interchange the columns of X isf we like,
so theres no loss of generality in considering the first column). Now, the variance of
under the classical assumptions, is
V
X
X
X
X
x
x
x
W
W
x W
W
116
X
X
1
11
x
W W
W 1W
x
x
x
x
In
1
ESSx W
W W
W 1W
x
where by ESSx W we mean the error sum of squares obtained from the regression
W v
Since
R2
1
ESS T SS
we have
ESS
T SS 1
R2
2
T SSx 1 R2x W
We see three factors influence the variance of this coefficient. It will be high if
1. 2 is large
2. There is little variation in x Draw a picture here.
3. There is a strong linear relationship between x and the other regressors, so that
W can explain the movement in x well. In this case, R2x W will be close to 1. As
R2x W
1 V
x
Intuitively, when there are strong linear relations between the regressors, it is difficult to determine the separate influence of the regressors on the dependent variable.
This can be seen by comparing the OLS objective function in the case of no correlation
between regressors with the objective function with correlation between the regressors.
See the figures nocollin.ps (no correlation) and collin.ps (correlation), available on the
web site.
variables is significantly different from zero (e.g., their separate influences arent well
determined).
In summary, the artificial regressions are the best approach if one wants to be careful.
restrictions that have been neglected? Picture illustrating how a restriction can solve
problem of perfect collinearity.
Stochastic restrictions and ridge regression
Supposing that there is no more data or neglected restrictions, one possibility is to
change perspectives, to Bayesian econometrics. One can express prior beliefs regarding the coefficients using stochastic restrictions. A stochastic linear restriction would
be something of the form
r v
where R and r are as in the case of exact linear restrictions, but v is a random vector.
For example, the model could be
X
y
R
N
r
0
2 In
0n
0q
2v Iq
This sort of model isnt in line with the classical interpretation of parameters as constants: according to this interpretation the left hand side of R
r v is constant
but the right is random. This model does fit the Bayesian perspective: we combine
information coming from the model and the data, summarized in
N 0 2 In
119
N r 2v Iq
0 which is the
last piece of information in the specification. How can you estimate using this model?
The solution is to treat the restrictions as artificial data. Write
v
This expresses the degree of belief in the restriction relative to the variability of the
data. Supposing that we specify k then the model
kr
X
kR
kv
is homoscedastic and can be estimated by OLS. Note that this estimator is biased. It
is consistent, however, given that k is a fixed constant, even if the restriction is false
(this is in contrast to the case of false exact restrictions). To see this, note that there
are Q restrictions, where Q is the number of rows of R As n
these Q artificial
observations have no weight in the objective function, so the estimator has the same
limiting objective function as the OLS estimator, and is therefore consistent.
To motivate the use of stochastic restrictions, consider the expectation of the squared
120
length of :
E
E
X
X
X
X
E
X X
X
X
X
Tr X
X
1X
2 K
i 1 i (the trace is the sum of eigenvalues)
max X X
so
E
min X X
mum eigenvalue of X
X
1
more nearly singular, so min X X tends to zero (recall that the determinant is the prod
y
kIK
kv
ridge
X
kIK
X
kIK
X
X k2 IK
X
1
IK
y
X
y
This is the ordinary ridge regression estimator. The ridge regression estimator can be
seen to add k2 IK which is nonsingular, to X
X which is more and more nearly singular
121
0
that is, the coefficients are shrunken toward zero. Also, the estimator tends to
ridge
ridge
so ridge
X
X k2 IK
1
X
y
k2 IK
1
X
y
k2
X
y
0
is at al sensible.
There should be some amount of shrinkage that is in fact a true restriction. The
problem is to determine the k such that the restriction is correct. The interest in
ridge regression centers on the fact that it can be shown that there exists a k such
that MSE ridge
unknown.
ridge as a function of k and chooses the value
The ridge trace method plots ridge
of k that artistically seems appropriate (e.g., where the effect of increasing k dies
off). Draw picture here. This means of choosing k is obviously subjective. This is not
a problem from the Bayesian perspective: the choice of k reflects prior beliefs about
the length of
In summary, the ridge estimator offers some hope, but it is impossible to guarantee
that it will outperform the OLS estimator. Collinearity is a fact of life in econometrics,
and there is no clear solution to the problem.
necessarily be correct?
iid 0 2v
vt
y v
so
X
iid 0 2 2v
and can be estimated by OLS. This type of measurement error isnt a problem,
then.
123
yt
xt
xt
iid 0 v
vt
where v is a K
vt
X is what is observed. Again assume that v is independent of and that the model
yt
xt
vt
t
xt
vt
t
xt
t
The problem is that now there is a correlation between xt and t since
E xt t
E xt
vt
vt
t
where
v
E vt vt
Because of this correlation, the OLS estimator is biased and inconsistent, just as in
the case of autocorrelated errors with lagged dependent variables. In matrix notation,
write the estimated model as
y
X
124
We have that
X
X
n
X
y
n
and
X
X
n
plim
plim
QX
X
n
v
V
plim
V
V
n
lim E 1n tn 1 vt vt
Likewise,
plim
X
y
n
plim
X
n
QX
so
plim
QX
v
QX
So we see that the least squares estimator is inconsistent when the regressors are measured with error.
125
y
or
y1
y2
X1
1
X2
y1
X1 1
Since these observations satisfy the classical assumptions, one could estimate by
OLS.
The question remains whether or not one could somehow replace the unobserved
y2 by a predictor, and improve over OLS in some sense. Let y2 be the predictor
of y2 Now
126
X1
X1
X2
X1
X2
1
y1
X2
X1
X1 X2
X2
y2
X1
y1 X2
y2
X
y
X1
y1
Likewise, and OLS regression using only the second (filled in) observations would
give
X2
X2 2
X2
y2
Substituting these into the equation for the overall combined estimator gives
X1
X1 X2
X2
X1
X1 X2
X2
X1
X1 1 X2
X2 2
X1
X1 1
A 1
IK
X1
X1 X2
X2
A 2
where
A
X1
X1 X2
X2
127
X1
X1
X2
X2 2
and we use
X1
X1 X2
X2
X2
X2
X1
X1 X2
X2
IK
1
X1
X1 X2
X2
X1
X1 X2
X2
IK
X1
X1
X1
X1
A
Now,
E
IK
A E 2
The conclusion is the this filled in observations alone would need to define an
unbiased estimator. This will be the case only if
X2 2
y2
where 2 has mean zero. Clearly, it is difficult to satisfy this condition without
knowledge of
biased estimator.
Exercise 15 Formally prove this last statement.
One possibility that has been suggested (see Greene, page 275) is to estimate
using a first round estimation using only the complete observations
1
X1
X1 1 X1
y1
128
X2 1
y2
1X y
1
1
X2 X1
X1
Now, the overall estimate is a weighted average of 1 and 2 just as above, but
we have
2
1 X y
2
2
X2
X2
X2
X2
1X X
2
2 1
This shows that this suggestion is completely empty of content: the final estimator is the same as the OLS estimator using only the complete observations.
yt
yt
yt ifyt
The difference in this case is that the missing values are not random: they are
correlated with the xt Consider the case
x
y
with V
y1
y2
X1
X2
1
but we assume now that each row of X2 has an unobserved component(s). Again,
one could just estimate using the complete observations, but it may seem frustrating
to have to drop observations simply because of a single missing variable. In general,
errors of observation. As before, this means that the OLS estimator is biased when
X2 is used instead of X2 Consistency is salvaged, however, as long as the number of
Including observations that have missing values replaced by ad hoc values can
be interpreted as introducing false stochastic restrictions. In general, this introduces bias. It is difficult to determine whether MSE increases or decreases.
Monte Carlo studies suggest that it is dangerous to simply substitute the mean,
for example.
In the case that there is only one regressor other that the constant, subtitution of
x for the missing xt does not lead to bias. This is a special case that doesnt hold
for K
2
130
131
Aw1 1 w2 2 qq e
ln c
where 0
0 1 ln w1 2 ln w2 q ln q
0 1
0 2
0 3
0 when q
0 Homogeneity of
1
0 1 w 1 2 w 2 q q
132
is a K
xt
x zt
yt
xt
t
P2
1 P
P
2 P
free parameters: one for each independent effect that we wish to model.
Suppose that the model is
y
g x
133
0 is
g x
g 0
x
Dx g 0
x
D2x g 0 x
R
2
Use the approximation, which simply drops the remainder term, as an approximation
to g x :
g x
gK x
g 0
x
Dx g 0
x
D2x g 0 x
2
0 the approximation becomes more and more exact, in the sense that g x
K
D g x and D2 g x D2 g x For x 0 the approximation is
g x D x gK x
x
x K
x
As x
exact, up to the second order. The idea behind many flexible functional forms is to
note that g 0 Dx g 0 and D2x g 0 are all constants. If we treat them as parameters, the
approximation will have exactly enough free parameters to approximate the function
0 The
model is
x
1 2x
x
gK x
so the regression model to fit is
x
1 2x
x
Dx g 0 ? Is plim
D2x g 0 ?
The answer is no, in general. The reason is that if we treat the true values of the
parameters as these derivatives, then is forced to play the part of the remainder
term, which is a function of x so that x and are correlated in this case. As
134
The conclusion is that flexible functional forms arent really flexible in a useful statistical sense, in that neither the function itself nor its derivatives are consistently estimated, unless the function belongs to the parametric family of the
specified functional form. In order to lead to consistent inferences, the regression model must be correctly specified.
ln c
ln
z
z
ln z
ln z
x
1 2x
x
135
x
ln c
ln z
c z
z c
which is the elasticity of c with respect to z This is a convenient feature of the translog
model. Note that at the means of the conditioning variables, z, x
y
x
0 so
z z
c w q
where w is a vector of input prices and q is output. We could add other variables by
extending q in the obvious manner, but this is supressed for simplicity. By Shephards
lemma, the conditional factor demands are
c w q
w
wx
c
c w q w
w c
136
which is simply the vector of elasticities of cost with respect to input prices. If the cost
function is modeled using a translog function, we have
x
z
1 2
ln c
x z
11 12
12
22
x
x
z
1 2x
11 x x
12 z 1 2z2 22
where x
ln w
w and z
ln q q and
11 12
11
12 22
13
12
22
23
33
11
12
x
Therefore, the share equations and the cost equation have parameters in common. By
pooling the equations together and imposing the (true) restriction that the parameters
of the equations be the same, we can gain efficiency.
To illustrate in more detail, consider the case of two inputs, so
x1
x2
137
ln c
1 x1 2 x2 z
11 2 22 2 33 2
x
x
z 12 x1 x2 13 x1 z 23 x2 z
2 1
2 2
2
The two cost shares of the inputs are the derivatives of ln c with respect to x 1 and x2 :
s1
1 11 x1 12 x2 13 z
s2
2 12 x1 22 x2 13 z
Note that the share equations and the cost equation have parameters in common.
One can do a pooled estimation of the three equations at once, imposing that the parameters are the same. In this way were using more observations and therefore more
information, which will lead to imporved efficiency. Note that this does assume that
the cost equation is correctly specified (i.e., not an approximation), since otherwise the
derivatives would not be the true derivatives of the log cost function, and would then
be misspecified for the shares. To pool the equations, write the model in matrix form
138
ln c
s1
1 x 1 x2 z
0 1
s2
x22
2
0 x1 0
0 0
x21
2
0 0
x2
x2 0
x1 x2 x1 z x 2 z
13
12
3
33
22
11
x1
z2
2
2
1
23
This is one observation on the three equations. With the appropriate notation, a
single observation can be written as
Xt t
yt
The overall model would stack n observations on the three equations for a total of 3n
observations:
y1
y2
..
.
yn
X2
..
.
Xn
X1
2
..
.
Next we need to consider the errors. For observation t the errors can be placed in a
139
vector
1t
2t
3t
First consider the covariance matrix of this vector: the shares are certainly correlated since they must sum to one. (In fact, with 2 shares the variances are equal and
the covariance is -1 times the variance. General notation is used to allow easy extension to the case of more than 2 inputs). Also, its likely that the shares and the cost
equation have different variances. Supposing that the model is covariance stationary,
the variance of t won
t depend upon t:
Vart
11 12 13
22 23
33
Note that this matrix is singular, since the shares sum to 1. Assuming that there is no
autocorrelation, the overall covariance matrix has the seemingly unrelated regressions
(SUR) structure.
Var
2
..
.
n
1
0 0
0
..
.
0
..
.
In
0
0
0
. . ..
. .
140
matrices A and B is
a11 B a12 B
.
a21 B . .
..
.
a1q B
..
.
a pq B
a pq B
1 n
t
n t1 t
3. Next we need to account for the singularity of 0 It can be shown that 0 will
be singular when the shares sum to one, so FGLS wont work. The solution is to
141
drop one of the share equations, for example the second. The model becomes
ln c
1 x 1 x2 z
s1
0 1
x22
2
x21
2
z2
2
0 x1 0
x2
12
13
33
22
11
x1 x2 x1 z x 2 z
2
1
23
or in matrix notation for the observation:
Xt t
yt
y1
y2
..
.
X2
..
.
X
142
Xn
yn
2
..
.
1
X1
Var
In
0
In
P0
and the Cholesky factorization of the overall covariance matrix of the 2 equation
model, which can be calculated as
Chol
P0
In
5. Finally the FGLS estimator can be calculated by applying OLS to the transformed model
Py
PX
1
X
143
P0 Xt P
P0 yy
i j
0 j
1 2 3
i 1
These are linear parameter restrictions, so they are easy to impose and will improve efficiency if they are true.
3. The estimation procedure outlined above can be iterated. That is, estimate FGLS
as above, then re-estimate 0 using errors calculated as
y
X FGLS
These might be expected to lead to a better estimate than the estimator based on
OLS since FGLS is asymptotically more efficient. Then re-estimate using the
new estimated error covariance. It can be shown that if this is repeated until the
144
estimates dont change (i.e., iterated to convergence) then the resulting estimator
is the MLE. At any rate, the asymptotic properties of the iterated and uniterated
estimators are the same, since both are based upon a consistent estimator of the
error covariance.
yt
yt
xt
0
M1 : y
t
M2 : y
X
iid 0 2
Z
iid 0 2
145
1 2
One could account for non-iid errors, but well suppress this for simplicity.
There are a number of ways to proceed. Well consider the J test, proposed by
Davidson and MacKinnon, Econometrica (1981). The idea is to artificially nest
the two models, e.g.,
X Z
If the first model is correctly specified, then the true value of is zero. On the
other hand, if the second model is correctly specified then
1
The problem is that this model is not identified in general. For example, if
the models share some regressors, as in
M1 : yt
1 2 x2t 3 x3t t
M2
1 2 x2t 3 x4t t
: yt
yt
1
1
1
2 x2t
1
yt
1
1 1
1
2 2 x2t
1
3 x3t 3 x4t t
146
The four
s are consistently estimable, but is not, since we have four equations in 7
unknowns, so one cant test the hypothesis that
0
X Z
X y
where y
Z Z
Z
1Z
can show that, under the hypothesis that the first model is correct,
ordinary t -statistic for
0 is asymptotically normal:
t
N 0 1
bility to 1, while its estimated standard error tends to zero. Thus the test will
always reject the false null model, asymptotically, since the statistic will eventually exceed any critical value with probability one.
We can reverse the roles of the models, testing the second against the first.
It may be the case that neither model is correctly specified. In this case, the test
will still reject the null hypothesis, asymptotically, if we use critical values from
zero, t
p Of course, when we switch the roles of the models the other will
147
In summary, there are 4 possible outcomes when we test two models, each
against the other. Both may be rejected, neither may be rejected, or one of the
test is simple to
apply when both models are linear in the parameters. The P-test is similar, but
tions.
Monte-Carlo evidence shows that these tests often over-reject a correctly specified model. Can use bootstrap critical values to get better-performing tests.
148
where, for purposes of estimation we can treat X as fixed. This means that when estimating we condition on X When analyzing dynamic models, were not interested in
conditioning on X as we saw in the section on stochastic regressors. Nevertheless, the
Demand: qt
1t
2t
1 2 pt 2t
Supply: qt
1t
2t
11 12
22
t
The presumption is that qt and pt are jointly determined at the same time by the in149
1 2 pt 2t
2 pt
pt
1
2
1 3 yt 1t
1
2
3 yt
2 2
2t
1t
2
2t
2
E pt 1t
1 1
2 2
12
2
11
2
3 yt
2 2
1t
2
2t
2
1t
Because of this correlation, OLS estimation of the demand equation will be biased and
inconsistent. The same applies to the supply equation, for the same reason.
In this model, qt and pt are the endogenous varibles (endogs), that are determined
within the system. yt is an exogenous variable (exogs). These concepts are a bit tricky,
and well return to it in a minute. First, some notation. Suppose we group together
current endogs in the vector Yt If there are G endogs, Yt is G 1 Group current and
lagged exogs, as well as lagged endogs in the vector Xt , which is K 1 Stack the errors
of the G equations into the error vector Et The model, with additional assumtions, can
be written as
Yt
Xt
B Et
N 0
Et
E Et Es
0t
150
s
XB E
E X
E
0 K
G
N 0
vec E
where
Y1
Y
Y2
..
.
Yn
X2
..
.
E1
X1
Xn
E
E2
..
.
En
Y is n G X is n K and E is n G
11 In 12 In
22 In
1G In
..
.
. . ..
. .
GG In
In
X may contain lagged endogenous and exogenous variables. These variables are
predetermined.
151
We need to define what is meant by endogenous and exogenous when classifying the current period variables.
11.2 Exogeneity
The model defines a data generating process. The model involves two sets of variables,
Yt and Xt as well as a parameter vector
vec
vec
vec B
G2
G
2 G di-
mensional vector. This is the parameter vector that were interested in estimating.
In principle, there exists a joint density function for Yt and Xt which depends on
ft Yt Xt It
ft Yt Xt It
ft Yt Xt It ft Xt It
This is a general factorization, but is may very well be the case that not all
parameters in affect both factors. So use 1 to indicate elements of that
enter into the conditional density and write 2 for parameters that enter into the
152
ft Yt Xt It
ft Yt Xt 1 It ft Xt 2 It
Xt
B Et
N 0
Et
E Et Es
s
0t
Normality and lack of correlation over time imply that the observations are independent of one another, so we can write the log-likelihood function as the sum of likelihood contributions of each observation:
ln ft Yt
ln L Y It
t 1
n
Xt It
ln
t 1
n
ft Yt Xt 1 It ft Xt 2 It
ln ft Yt Xt 1 It
t 1
ln ft
Xt 2 It
t 1
Definition 17 (Weak Exogeneity) Xt is weakly exogeneous for (the original parameter vector) if there is a mapping from to that is invariant to 2 More formally, for
an arbitrary 1 2
1
This implies that 1 and 2 cannot share elements if Xt is weakly exogenous, since
1 would change as 2 changes, which prevents consideration of arbitrary combina
tions of 1 2 .
153
Supposing that Xt is weakly exogenous, then the MLE of 1 using the joint density
is the same as the MLE using only the conditional density
ln L Y X It
ln ft Yt Xt
1 It
t 1
since the conditional likelihood doesnt depend on 2 In other words, the joint and
conditional log-likelihoods maximize at the same value of 1
ence. The joint MLE is valid, but the conditional MLE is not.
In resume, we require the variables in Xt to be weakly exogenous if we are to be
able to treat them as fixed in estimation. Lagged Yt satisfy the definition, since
they are in the conditioning information set, e.g., Yt
It Lagged Yt arent
exogenous in the normal usage of the word, since their values are determined
within the model, just earlier on. Weakly exogenous variables include exogenous
(in the normal sense) variables as well as all predetermined variables.
154
Yt
Xt
B Et
V Et
Yt
Xt
B
Et
Xt
Vt
Now only one current period endog appears in each equation. This is the reduced form.
Definition 19 (Reduced form) An equation is in reduced form if only one current period endog is included.
An example is our supply/demand system. The reduced form for quantity is obtained by solving the supply equation for price and substituting into demand:
155
2 qt
qt
1 2
qt
2 qt
2 1
qt
2 1
2
1 2t
2
2 1 2t
2 1
2
2 3 yt 2 1t
2 3 yt
2 2
3 yt 1t
2 1t
2
2 2t
2
11 21 yt V1t
1 2 pt 3 yt 1t
2 pt
pt
1
2
1 3 yt 1t
1
2
3 yt
2 2
2t
1t
2
2t
2
12 22 yt V2t
The interesting thing about the rf is that the equations individually satisfy the classical
assumptions, since yt is uncorrelated with 1t and 2t by assumption, and therefore
E yt Vit
V1t
V2t
2 1t 2 2t
2 2
1t 2t
2 2
V V1t
2 1t 2 2t
2 1t
2 2
2
22 11 22 2 12 2 22
2
2
156
2 2t
2
V V2t
1t 2t
1t
2 2
2
212 22
11
2
2t
2
2 1t 2 2t
1t 2t
2 2
2 2
2 11 2 2 12 22
E V1t V2t
2
Yt
Xt
B
Et
Xt
Vt
so we have that
Vt
N 0
Et
1
and that the Vt are timewise independent (note that this wouldnt be the case if the Et
were autocorrelated).
157
11.4 IV estimation
The simultaneous equations model is
Y
XB E
Considering the first equation (this is without loss of generality, since we can always
reorder the equations) we can partition the Y matrix as
y
Y
Similarly, partition X as
X
Y1 Y2
X
X2
E12
Assume that has ones on the main diagonal. These are normalization restrictions
that simply scale the remaining coefficients on each equation, and which scale the
variances of the error terms.
158
Given this scaling and our partitioning, the coefficient matrices can be written as
12
1
1 22
32
1 B12
B
B22
Z
The problem, as weve seen is that Z is correlated with since Y1 is formed of endogs.
Now, lets consider the general problem of a linear regression model with correlation between regressors and ther error term:
iid 0 In 2
E X
0
The present case of a structural equation from a system of equations fits into this notation, but so do other problems, such as measurement error or lagged dependent variables with autocorrelated errors. Consider some matrix W which is formed of variables
uncorrelated with . This matrix defines a projection matrix
PW
W W
W
159
so that anything that is projected onto the space spanned by W will be uncorrelated
with by the definition of W Transforming the model with this projection matrix we
get
PW X PW
PW y
or
X
E X
E X
PW
PW
E X
PW
and
W W
W 1W
X
PW X
model
X
will lead to a consistent estimator, given a few more assumptions. This is the generalized instrumental variables estimator. W is known as the matrix of instruments. The
estimator is
IV
X
PW X
160
X
PW y
IV
X
PW X
X
PW X
1
X
PW X
X
PW
so
IV
X
PW X
X
PW
X
W W
W
W
X
X
W W
W
IV
X
W
n
W
W
n
W
X
n
X
W
n
W
W
n
W
n
Assuming that each of the terms with a n in the denominator satisfies a LLN, so that
WW
n
p Q , a finite pd matrix
WW
XW
n
0
W p
n
then the plim of the rhs is zero. This last term has plim 0 since we assume that W and
are uncorrelated, e.g.,
E Wt
t
0
p
161
Furthermore, scaling by
n IV
n we have
X
W
n
W
W
n
W
X
n
X
W
n
W
W
n
W
n
N 0 Q 2
WW
W d
n
then we get
n IV
d N 0 Q Q 1 Q
XW WW XW
1 2
The estimators for QXW and QWW are the obvious ones. An estimator for 2 is
1
y
n
2IV
X IV
X IV
This estimator is consistent following the proof of consistency of the OLS estimator of
2 when the classical assumptions hold.
X
W
W
W
W
X
2IV
The IV estimator is
1. Consistent
2. Asymptotically normally distributed
0 E X
PW X
and X
PW are not independent.
162
1X
PW
may
An important point is that the asymptotic distribution of IV depends upon QXW and
QWW and these depend upon the choice of W The choice of instruments influences
W2 then the IV
The penalty for indiscriminant use of instruments is that the small sample bias of
the IV estimator rises as the number of instruments increases. The reason for this
is that PW X becomes closer and closer to X itself as the number of instruments
increases.
QXW QWW
QXW
163
1 2
0. This matrix is
The necessary and sufficient condition for identification is simply that this matrix
be positive definite, and that the instruments be (asymptotically) uncorrelated
with .
For this matrix to be positive definite, we need that the conditions noted above
hold: QWW must be positive definite and QXW must be of full rank ( K ).
These identification conditions are not that intuitive nor is it very obvious how
to check them.
Y
X1
Notation:
Let K
K
K be
Let G
cols Y1
G
Now the X1 are weakly exogenous and can serve as their own instruments.
164
It turns out that X exhausts the set of possible instruments, in that if the variables
in X dont lead to an identified model then no other instruments will identify the
model either. Assuming this is true (well prove it in a moment), then a necessary
one instrument must be used twice, so W will not have full column rank:
W
QZW K
1
G
1
To show that this is in fact a necessary condition consider some arbitrary set of
instruments W A necessary condition for identification is that
1
plim W
Z
n
where
Y
X1
165
XB E
G
as
y
X
Y1 Y2
X2
y
Y1 Y2
X
X2
11 12 13
21 22 23
v V1 V2
so we have
Y1
X1 12 X2 22 V1
1
W
X
so
1
W
Z
n
X2 22 V1 X1
12
Because the W s are uncorrelated with the V1 s, by assumption, the cross between W
and V1 converges in probability to zero, so
1
plim W
Z
n
X
1
plim W
12
X2 22 X1
Since the far rhs term is formed only of linear combinations of columns of X the rank
of this matrix can never be greater than K regardless of the choice of instruments. If
Z has more than K columns, then it is not of full column rank. When Z has more than
K columns we have
1 K
166
or noting that K
K
K
1
In this case, the limiting matrix is not of full column rank, and the identification condition fails.
Xt
B Et
V Et
Yt
Xt
B
Et
Xt
Vt
V Vt
The reduced form parameters are consistently estimable, but none of them are known
a priori, and there are no restrictions on their values. The problem is that more than
one structural form has the same reduced form, so knowledge of the reduced form
167
parameters alone isnt enough to determine the structural parameters. To see this,
consider the model
Yt
F
Xt
BF Et F
F
F
V Et F
Xt
BF F
Yt
Xt
BFF
Xt
B
Et
Et F F
Et FF
Xt
Vt
V Et F F
V Et
Since the two structural forms lead to the same rf, and the rf is all that is directly
estimable, the models are said to be observationally equivalent. What we need for
identification are restrictions on and B such that the only admissible F is an identity
matrix (if all of the equations are to be identified). Take the coefficient matrices as
partitioned before:
168
B
1 22
12
1
32
B12
B22
The coefficients of the first equation of the transformed model are simply these coefficients multiplied by the first column of F. This gives
f11
F2
1 22
12
1
32
B12
B22
f11
F2
For identification of the first equation we need that there be enough restrictions so that
the only admissible
f11
F2
be the leading column of an identity matrix, so that
12
1 22
1
B12
B22
f11
1
0
F2
1
0
169
32
32
F2
B22
32
cols
32
B22
G
B22
then the only way this can hold, without additional restrictions on the models parameters, is if F2 is a vector of zeros. Given that F2 is a vector of zeros, then the first
equation
1
12
Therefore, as long as
f11
1
f11
F2
32
1
B22
then
f11
F2
0G
The first equation is identified in this case, so the condition is sufficient for identification. It is also necessary, since the condition implies that this submatrix must have at
least G
170
rows, we obtain
G
G
or
G
When K
4. Nonlinearities in variables
When these sorts of information are available, the above conditions arent necessary for identification, though they are of course still sufficient.
To give an example of how other information can be used, consider the model
Y
XB E
where is an upper triangular matrix with 1s on the main diagonal. This is a triangular system of equations. In this case, the first equation is
y1
XB 1 E 1
y2
21 y1 X B 2 E 2
E y1t 2t
Xt
B 1 1t 2t
Consumption: Ct
0 1 Pt 2 Pt
Investment: It
0 1 Pt 2 Pt
Private Wages: Wt
0 1 Xt 2 Xt
Output: Xt
Ct It Gt
Profits: Pt
Xt
Capital Stock: Kt
Kt
Tt
Wt
3 Wt
3 Kt
Wt
1t
2t
3 At 3t
It
The other variables are the government wage bill, Wt taxes, Tt government nonwage
spending, Gt and a time trend, At The endogenous variables are the lhs variables,
C
Yt
It Wt
Xt Pt Kt
1
Xt
Wt
Gt Tt At Pt
Kt
Xt
X B E gives
1 0
0
3 0
0
0
1
1 0
1
1
0
1 1
1 0
1
0
1 0
0
173
0 0 0 0 0
3 0
0 0
1 0
3 0 0
B
1 0
0
2 2 0
0 0
3 0
0 0
2 0 0
tion. These are the rows that have zeros in the first column, and we need to drop the
first column. We get
1
0
0
1 0
1 1
1
1 0
1
32
B22
3 0
0
0
1 0
We need to find a set of 5 rows of this matrix gives a full-rank 5 5 matrix. For
174
0 0
1 0
3 0 0
A
3 0
1 0
0
0 0
This matrix is of full rank, so the sufficient condition for identification is met. Counting
included endogs, G
5 so
K
5
L
1
1
Wt enter the consumption equation, and their coefficients are restricted to be the
same. For this reason the consumption equation is in fact overidentified by four
restrictions.
11.6 2SLS
When we have no information regarding cross-equation restrictions or the structure of
the error covariance matrix, one can estimate the parameters of a single equation of the
system without regard to the other equations.
175
This isnt always efficient, as well see, but it has the advantage that misspecifications in other equations will not affect the consistency of the estimator of the
parameters of the equation of interest.
The 2SLS estimator is very simple: in the first stage, each column of Y1 is regressed on
all the weakly exogenous variables in the system, e.g., the entire X matrix. The fitted
values are
Y1
X X
X 1 X
Y1
PX Y1
1
X
Since these fitted values are the projection of Y1 on the space spanned by X and since
other requirement is that the instruments be linearly independent. This should be the
case when the order condition is satisfied, since there are more columns in X2 than in
Y1 in this case.
The second stage substitutes Y1 in place of Y1 and estimates by OLS. This original
model is
Y1 1 X1 1
Z
176
model as
PX Y1 1 PX X1 1
PX Z
Z
PX Z 1 Z
PX y
which is exactly what we get if we estimate using IV, with the reduced form predictions
of the endogs used as instruments. Note that if we define
PX Z
Y
X1
Z
Z
Z
y
Important note: OLS on the transformed model can be used to calculate the
2SLS estimate of since we see that its equivalent to IV using a particular set
177
PX Z
Y X
Z
Z
Z
Z
Z
Z
1
2IV
Y
Z
Z
X1
Y
X1
Y
X1
Y
X1
Y1
X1
X1
Y1
PX Y1 Y1
PX X1
X we can write
X1
Y1
PX PX Y1 Y1
PX X1
X1
PX Y1
Y
X1
X1
X1
Y
X1
Z
Z
Therefore, the second and last term in the variance formula cancel, so the 2SLS varcov
estimator simplifies to
V
Z
Z
178
1
2IV
which, following some algebra similar to the above, can also be written as
V
Z
Z
1
2IV
Finally, recall that though this is presented in terms of the first equation, it is general
since any equation can be placed first.
Properties of 2SLS:
1. Consistent
2. Asymptotically normal
3. Biased when the mean esists (the existence of moments is a technical issue we
wont go into here).
4. Asymptotically inefficient, except in special circumstances (more on this later).
y
X IV
PW y
179
X IV
but
IV
X IV
y
y
X X
PW X
X X
PW X
I
X X
PW X
X
PW y
1
X
PW y
X
PW
X
A X
where
A I
X X
PW X
X
PW
so
s IV
X
A
PW A X
Moreover, A
PW A is idempotent, as can be verified by multiplication:
A
PW A
PW
PW X X
PW X
PW X X
PW X
X
PW I
PW X X
PW X
X X
PW X
X
PW
PW
X
PW
I
X
X X
PW X
X
180
X
PW
PW X X
PW X
Furthermore, A is orthogonal to X
AX
X
PW X
X
PW
so
s IV
A
PW A
Supposing the are normally distributed, with variance 2 then the random variable
s IV
2
A
PW A
2
middle, so
s IV
2
2 A
PW A
2
2 A
PW A
Even if the arent normally distributed, the asymptotic result still holds. The
last thing we need to determine is the rank of the idempotent matrix. We have
A
PW A
PW
PW X X
PW X
X
PW
so
A
PW A
Tr PW
TrPW
PW X X
PW X
TrW
W W
W
KX
181
X
PW
TrX
PW PW X X
PW X
TrW W
W
KW
KX
KX
This test is an overall specification test: the joint null hypothesis is that the
model is correctly specified and that the W form valid instruments (e.g., that
the variables classified as exogs really are uncorrelated with Rejection can
Z is misspecified, or that there is correlation
and
s IV
A
PW A
we can write
s IV
W W
W
1W
W W
W
n
1W
n RSSIV W T SSIV
nR2u
where R2u is the uncentered R2 from a regression of the IV residuals on all of the
instruments W . This is a convenient way to calculate the test statistic.
On an aside, consider IV estimation of a just-identified model, using the standard notation
182
PW y
X
PW y
X IV
The IV estimator is
IV
X
PW X
X
PW y
X
PW X
W
X
W
X
W
X
1
X
W W
W
X
W W
W
W
W X
W
IV
W
X
W
X
W
X
1
1
W
W X
W
W
W X
W
W
y
183
X
PW y
X
W W
W 1W
y
X IV
PW y
X IV
X IV
y
PW y
IV
X
PW y
X IV
X
PW X IV
IV
X
PW y
IV
X IV
y
PW y
X IV
y
PW y
IV
X
PW y X
PW X IV
y
PW y
X IV
by the fonc for generalized IV. However, when were in the just indentified case, this
is
s IV
y
PW y
W
y
y
PW I
X W
X
X W
X
y
W W
W
W
y
W W
W
W
X W
X
W
y
The value of the objective function of the IV estimator is zero in the just identified case.
This makes sense, since weve already shown that the objective function after dividing
by 2 is asymptotically 2 with degrees of freedom equal to the number of overidentifying restrictions. In the present case, there are no overidentifying restrictions, so we
have a 2 0 rv, which has mean 0 and variance 0, e.g., its simply 0. This means were
not able to test the identifying restrictions in the case of exact identification.
184
Recall that overidentification improves efficiency of estimation, since an overidentified equation can use more instruments than are necessary for consistent
estimation.
XB E
E X
E
0 K
G
N 0
vec E
11 In 12 In
22 In
1G In
..
.
. . ..
. .
GG In
In
This means that the structural equations are heteroscedastic and correlated with
one another
185
In general, ignoring this will lead to inefficient estimation, following the section on GLS. When equations are correlated with one another estimation should
account for the correlation in order to obtain efficiency.
Also, since the equations are correlated, information about one equation is implicitly information about all equations. Therefore, overidentification restrictions in any equation improve efficiency for all equations, even the just identified
equations.
Single equation methods cant use these types of information, and are therefore
inefficient (in general).
11.8.1 3SLS
Following our above notation, each structural equation can be written as
Yi 1 Xi 1 i
yi
Zi i i
y1
Z1 0
0
..
.
yG
y2
..
.
0
..
.
Z2
..
ZG
or
y
186
2
..
.
2
..
.
. 0
E
In
The 3SLS estimator is just 2SLS combined with a GLS correction that takes advantage
of the structure of Define Z as
X X
X
1X
Z1
0
Z
X X
X
0
..
.
1X
Z2
..
0
..
.
. 0
X X
X
1X
ZG
Y1 X1
0
..
.
0
..
.
0
Y2 X2
..
. 0
YG XG
These instruments are simply the unrestricted rf predicitions of the endogs, combined with the exogs. The distinction is that if the model is overidentified, then
187
Z
Z 1 Z
y
as can be verified by simple multiplication, and noting that the inverse of a blockdiagonal matrix is just the matrix with the inverses of the blocks on the main diagonal.
This IV estimator still ignores the covariance information. The natural extension is
to add the GLS transformation, putting the inverse of the error covariance into the
formula, which gives the 3SLS estimator
3SLS
In
In
Z
In Z
Z
In y
yi
Zi i 2SLS
is estimated by
i
j
n
i j
Substitute into the formula above to get the feasible 3SLS estimator.
Analogously to what we did in the case of 2SLS, the asymptotic distribution of the
3SLS estimator can be shown to be
n 3SLS
N 0 lim E
188
In
n
A formula for estimating the variance of the 3SLS estimator in finite samples (cancelling out the powers of n is
V 3SLS
In Z
1
This is analogous to the 2SLS formula in equation (??), combined with the GLS
correction.
In the case that all equations are just identified, 3SLS is numerically equivalent
to 2SLS. Proving this is easiest if we use a GMM interpretation of 2SLS and
3SLS. GMM is presented in the next econometrics course. For now, take it on
faith.
calculated equation
The 3SLS estimator is based upon the rf parameter estimator
which is simply
X
X
X
X
y
X
Y
y2
yG
that is, OLS equation by equation using all the exogs in the estimation of each column
of
It may seem odd that we use OLS on the reduced form, since the rf equations are
correlated:
Yt
Xt
B
Xt
Vt
189
Et
and
Vt
N 0
Et
y1
X 0
y2
..
.
0
..
.
yG
0
..
.
X
..
v1
2
..
.
. 0
v2
..
.
vG
where yi is the n 1 vector of observations of the ith endog, X is the entire n K matrix
of exogs, i is the ith column of and vi is the ith column of V Use the notation
X v
to indicate the pooled model. Following this notation, the error covariance matrix is
V v
In
tions.
However, pooled estimation using the GLS correction is more efficient, since
equation-by-equation estimation is equivalent to pooled estimation, since X is
block diagonal, but ignoring the covariance information.
The model is estimated by GLS, where is estimated using the OLS residuals
from equation-by-equation estimation, which are consistent.
In the special case that all the Xi are the same, which is true in the present case
1. A
2. A
3. A
In
B
B
B C
B
and
D
SUR
AC
In
BD we get
X
1
In
In
1
X
In
X
In
IG
2
..
.
In
X
X
X
X
X
X
y
X
y
X
y
1
G
So the unrestricted rf coefficients can be estimated efficiently (assuming normality) by OLS, even if the equations are correlated.
191
We have ignored any potential zeros in the matrix which if they exist could
11.8.2 FIML
Full information maximum likelihood is an alternative estimation method. FIML will
be asymptotically efficient, since ML estimators based on a given information set are
asymptotically efficient w.r.t. all other estimators that use the same information set,
and in the case of the full-information ML estimator we use the entire information set.
The 2SLS and 3SLS estimators dont require distributional assumptions, while FIML
of course does. Our model is, recall
Yt
Xt
B Et
N 0
Et
E Et Es
0t
s
The joint normality of Et means that the density for Et is the multivariate normal,
which is
2
g 2
det
1 2
1
exp
1
Et
1 Et
2
det
dEt
dYt
192
det
2
G 2
det det
1 2
1
Y
2 t
exp
Xt
B
Yt
Xt
B
Given the assumption of independence over time, the joint log-likelihood function is
ln L B
nG
ln 2
2
n ln det
n
ln det
2
1 n
Y
2 t1 t
Xt
B
1
Yt
next section.
It turns out that the asymptotic distribution of 3SLS and FIML are the same,
2. Calculate
1
B 3SLS 3SLS
This is new, we didnt estimate in this way
before. This estimator may have some zeros in it. When Greene says
iterated 3SLS doesnt lead to FIML, he means this for a procedure that
but only updates and B and If you update
you do
doesnt update
converge to FIML.
3. Calculate the instruments Y
Xt
B
FIML is fully efficient, since its an ML estimator that uses all information. This
implies that 3SLS is fully efficient when the errors are normally distributed.
Also, if each equation is just identified and the errors are normal, then 2SLS will
194
xt
t and
cases. For example, economic variables are often nonnegative (for example prices and
quatities), or the variables may be restricted to integers (for example, the number of
visits to the doctor a person makes in a year). In this section well see a few examples
of models for these sorts of data.
v j p m z
j j
other variables related to the persons preferences or characteristics of the object. The
first object ( j
1 is chosen if
0 v0 m p z
v1 m p z
or if
0
v1 m p z
195
v0 m p z
Define
v1 x
v0 x The first
object is chosen if
Define y
v x
1 y
Pr y
1
F v x
p x
where are the parameters of the utility functions and the distribution function of
A fairly simple version of this model is the standard probit model. Suppose that
0 0 m p
0
v0 m p z
1 1 m p
1
v1 m p z
and
N
0
0 5 12
22
0 22
0 5 then
1
11 12
0
0
N 0 1
Also,
v w
1
0
m p
196
0 m p
1
0
and
Pr y
where
1
m p
x
the parameters and which are in turn functions of the parameters i , i and i
1 2
equal to x
The density function for a single Bernoulli trial is
x
Pr y x
y
x
1 y
0 1
xt
yt
ln L
xt
1 yt
t 1
sn
1 n
yt ln xt
ni 1
1
yt ln 1
xt
This is a nonlinear in the parameters function. Well discuss how it can be maximized
later. Most packages for econometrics have a function to calculate
The parameters in are consistently estimated. On the other hand this has required making assumptions regarding the parameters 11 12 and 22 Without
197
these restrictions the distribution function of is not identified, so the other parameters arent identified. Also, knowledge of does not allow recovery of the
i i and i since there are twice as many unknowns as equations.
N 0 1 and 1
distribution for
Binary response models of this sort are never identified without these sorts of
restrictions.
The logit model is very similar to the probit model. Under the logit model the j j
12
F z
1
exp
z
so
Pr y
1
1
exp x
It turns out that the probit and logit models give very similar estimates for Pr y
1
usually of most interest. Therefore the choice between logit and probit models is not
very important, in the binary choice case. The coefficients are different (there is a
scaling factor that related the coefficients). However, the coefficient themselves arent
usually of much interest and they are difficult to interpret.
example, the dependent variable could be the number of auto accidents in a weekend, or the number of political leaders that make fools of themselve in a week. Such
variables are termed count data dependent variables.
The Poisson model is one of the simplest models for count data. The Poisson
density is
exp
fY y
y
y
y!
0 1 2
0
exp x
exp xt
st
yt exp xt
ln yt !
and the average likelihood function is just the sum of this divided by the sample size:
sn
1 n
n i1
exp xt
yt exp xt
ln yt !
E y
V y
variance. There are generalizations of the Poisson model that relax this restriction.
The way this is done is to make a function of another random variable, then integrate
199
exp x
f z
so the joint density of y and is the product of the conditional density of y given
fY y
exp
exp x
exp x
y
f
y!
fY y
exp
exp x
z exp x
z y
f z dz
y!
This effectively introduces other parameters into the densisty which relax the Poisson
mean-variance restriction.
t1
distribution function FD t
Pr D
t
Several questions may be of interest. For example, one might wish to know the
expected time one has to wait to find a job given that one has already waited s years.
The probability that a spell lasts s years is
Pr D
s
Pr D
s
FD s
fD t
1 FD s
fD t D
s
The expectanced additional time required for the spell to end given that is has already
lasted s years is the expectation of D with respect to this density, minus s
E DD
s
s
fD z
ds
z
1 FD s
s
fD t
e
t
densities.
To illustrate application of this model, 402 observations on the length (in months)
of strikes in the industrial sector were used to fit a Weibull model. The parameter
201
estimates are
lllParameterEstimateSt Error
& 0.559 & 0.034 \\\(\gamma\)& 0.867 & 0.033\end{tabular}\end{equation} and the
log-likelihood value is -659.3
A plot of E with 95% confidence bands follows. The plot is accompanied by
a nonparametric Kaplan-Meier estimate of life-expectancy. This nonparametric estimator of E simply averages all spell lengths greater than t and subtracts t This is
fD t
1 t
1
1 1 1 t
1 1
1
2 t
2
2 2 2 t
2 1
0.233
0.016
1.722
0.166
1.731
0.101
1.522
0.096
0.428
0.035
202
point. Supposing were trying to maximize sn Take a second order Taylors series
sn sn k
g k
k
1 2
H k
k
To attempt to maximize sn we can maximize the portion of the right-hand side that
s
g k
1 2
H k
k
D s
g k
H k
k
203
H k 1 g k
k
ak H k
1
g k
A potential problem is that the Hessian may not be negative definite when were
far from the maximizing point. So
H k
1g
H k
pen when the objective function may have flat regions, in which case the Hessian matrix is very ill-conditioned (e.g., is nearly singular), or when were in the
vicinity of a local minimum, H k is positive definite, and our direction is a decreasing direction of search. Matrix inverses by computers are subject to large
errors when the matrix is ill-conditioned. Also, we certainly dont want to go
in the direction of a minimum when were maximizing. To solve this problem,
Quasi-Newton methods simply add a positive definite component to H to ensure that the resulting matrix is positive definite, e.g., Q
H
bI where b
is chosen large enough so that Q is well-conditioned. This has the benefit that
204
f x
ln x
x
1 0 e
1 0
1 0x
f
1 058416
2 206 10
g x
f z
f
z x
z
1
f
z x
2
z
82563
at
78371
Now set the expansion point to the new value, and re-plot:
z2
78371
g2 x
f
z2 x
f z2
1
f
z
2
2
z2
73735
z2
at
99458
Another round:
z3
g3 x
99458
f z3
f
z3 x
z3
1
f
z
3
2
72907
z3
at
1 055
So after two NR
iterations were already pretty close to the maximum and the approximation is quite
205
kj
kj
kj
1
kj
2
kj 1
s k
s k
g j k
g j k
Also, if were maximizing, its good to check that the last round Hessian is
negative definite.
206
Starting values
The Newton-Raphson and related algorithms work well if the objective function
is concave (when maximizing), but not so well if there are convex regions and local
minima or multiple local maxima. The algorithm may converge to a local minimum
or to a local maximum that is not optimal. The algorithm may also have difficulties
converging at all.
The usual way to ensure that a global maximum has been found is to use
many different starting values, and choose the solution that returns the highest
objective function value. THIS IS IMPORTANT in practice.
207
wt yt
1
yt
j
a function only of its own lagged values, unconditional on other observable variables.
One can think of this as modeling the behavior of yt after marginalizing out all other
variables. While its not immediately clear why a model that has other explanatory
variables should marginalize to a linear in the parameters time series model, most time
series work is done with linear models, though nonlinear time series is also a large and
growing field. Well stick with linear time series models.
t
(12)
n
t 1
(13)
E yt
jt
where t
E yt
t yt
j
(14)
j
jt
j t
weak sta-
tionarity.
What is the mean of Yt ? The time series is one sample from the stochastic process.
One could think of M repeated samples from the stoch. proc., e.g., ytm By a LLN,
p EY
t
The problem is, we have only one sample to work with, since we cant go back in time
and collect another. How can E Yt be estimated then? It turns out that ergodicity is
the needed property.
209
p
(15)
j
j 0
This implies that the autocovariances die off, so that the yt are not so strongly dependent that they dont satisfy a LLN.
Definition 26 (Autocorrelation) The jth autocorrelation, j is just the jth autocovariance divided by the variance:
j
j
0
(16)
Definition 27 (White noise) White noise is just the time series literature term for a
0 t ii) V t
210
yt
2 t
q t
E yt
E t 1 t
2 t
2 1 21 22
q t
2q
q
j j
0 j
1 1
2 2
q q
j
j
yt
c 1 yt
2 yt
211
p yt
The dynamic behavior of an AR(p) process can be studied by writing this pth order
difference equation as a vector first order difference equation:
yt
yt
..
.
1
0
..
.
yt
1 2
p 1
0
..
.
1
..
0
..
yt
. 0
..
. 0
Et
yt
..
.
yt
0
..
.
..
or
Yt
C FYt
Yt
C FYt Et
C F C FYt
Et
Et
C FC F 2Yt
FEt Et
C FYt
Et
C F C FC F 2Yt
FEt Et
and
Yt
C FC F 2C F 3Yt
1
F 2 Et FEt
Et
Et
FEt
or in general
Yt
C FC
F jC F j 1Yt
F j Et F j
212
Et
j 1
Et
j
This is simply
F11
11
If the system is to be stationary, then as we move forward in time this impact must
die off. Otherwise a shock causes a permanent change in the mean of yt Therefore,
stationarity requires that
j
lim F 1 1
0
Consider the eigenvalues of the matrix F These are the for such that
IP
F
F
so
can be written as
1
When p
2 the matrix F is
0
1 2
213
1 the
so
IP
2
1
and
IP
F
2
1
2
1
2
which can be found using the quadratic equation. This generalizes. For a pth order AR
process, the eigenvalues are the roots of
p
1
p 2 2
p
1
Supposing that all of the roots of this polynomial are distinct, then the matrix F can be
factored as
T T
where T is the matrix which has as its columns the eigenvectors of F and is a
diagonal matrix with the eigenvalues on the main diagonal. Using this decomposition,
we can write
Fj
where T T
T T
1
T T
1
Fj
T jT
214
T T
1
and
j
1
j
2
..
j
p
0
Supposing that the i i
lim F 1 1
0
requires that
i
1 2 p
1i
It may be the case that some eigenvalues are complex-valued. The previous
result generalizes to the requirement that the eigenvalues be less than one in
modulus, where the modulus of a complex number a bi is
mod a bi
a2 b2
This leads to the famous statement that stationarity requires the roots of the
determinantal polynomial to lie inside the complex unit circle. draw picture
here.
When there are roots on the unit circle (unit roots) or outside the unit circle, we
eigenvalue lead to ocillatory behavior. Of course, when there are multiple eigenvalues the overall effect can be a mixture. pictures
Invertibility of AR process
To begin with, define the lag operator L
Lyt
yt
L2 yt
L Lyt
Lyt
yt
1
2
or
L 1 L yt
1
1
Lyt Lyt
yt
L2 yt
yt
1 yt
2 yt
2
p yt
or
yt 1
1 L
2 L2
216
pL p
1 L
2 L2
pL p
1 L 1
1
2 L
1
pL
For the moment, just assume that the i are coefficients to be determined. Since L is
defined to operate as an algebraic quantitiy, determination of the i is the same as
determination of the i such that the following two expressions are the same for all z :
1 z
2 z2
p
1 z1
2 z2
p
1 p
1
pz p
1 z 1
2 z
1
pz
p
1z
1 z
1
2
1
p
so we get
2 p
2
1
1
2
p
The LHS is precisely the determinantal polynomial that gives the eigenvalues of F
Therefore, the i that are the coefficients of the factorization are simply the eigenvalues
of the matrix F
Now consider a different stationary process
1
L yt
217
1
1 L 2 L2
jL j
L yt
j L j to get
1 L 2 L2
j L j t
1 L 2 L2
jL j
L
2 L2
1 L 2 L2
jL j
1L j 1
yt
j L j t
1
1 j 1
yt
1 L 2 L2
j L j t
so
j
yt
1 j 1
1 L 2 L2
yt
j 1 L j 1 y 0 since
t
Now as j
j L j t
1 so
yt
1 L 2 L2
j L j t
and the approximation becomes better and better as j increases. However, we started
with
1
L yt
yt
1 L 2 L2
218
jL j
1
L yt
so
1 L 2 L2
jL j
L
1 define
L
jL j
j 0
yt 1
1 L
2 L2
pL p
yt 1
1 L 1
2 L
1
pL
where the are the eigenvalues of F and given stationarity, all the i
1 Therefore,
yt
j
1 L j
j 0
j
2 L j
j 0
pj L j
j 0
yt
1 1 L 2 L2
t
The i are formed of products of powers of the i , which are in turn functions
of the i
The i are real-valued because any complex-valued i always occur in conju219
bi In
multiplication
a bi a
a2
bi
abi abi
b 2 i2
a2 b2
which is real-valued.
Yt
C FC
F jC F j 1Yt
F j Et F j
Et
FEt
j 1
Et
If the process is mean zero, then everything with a C drops out. Take this and
lag it by j periods to get
Yt
As j
F j 1Yt
j 1
F j Et
Fj
Et
j 1
FEt
Et
except for their first element, so we see that the first equation here, in the limit,
is just
yt
j 0
Fj
11 t j
which makes explicit the relationship between the i and the i (and the i as
well, recalling the previous factorization of F j
Moments of AR(p) process
220
yt
2 yt
p yt
p yt
Assuming stationarity, E yt
t so
c 1 2
so
1
c
2
and
c
1
p
so
yt
1 yt
p 1 yt
2 yt
2 yt
1
2
p yt
p
With this, the second moments are easy to find: The variance is
0
1 1 2 2
E yt
j
E 1 yt
1
p p 2
2 yt
1 j
2
1
yt
2 j
221
2
j
p yt
t yt
p j
p
j
have p 1 unknowns (2 0 1
for j
0 1 p, which
yt
1 1 L
q Lq t
1 1 L
q Lq
1 L 1
2 L
1
q L
we can write
1 1 L
1
q Lq
yt
where
1 1 L
q Lq
j L j yt
j 0
with 0
1 or
yt
1 yt
1
2 yt
222
2
or
c 1 yt
yt
2 yt
where
1 2
1
1 2 q
It turns out that one can always manipulate the parameters of an MA(q) process
to find an invertible representation. For example, the two MA(1) processes
yt
L t
and
yt
1 L t
1
2
For example, weve seen that
0
2 1 2
2 2 1
2 1 2
so the variances are the same. It turns out that all the autocovariances will be the
223
same, as is easily checked. This means that the two MA processes are observationally equivalent. As before, its impossible to distinguish between observationally equivalent processes on the basis of data.
For a given MA(q) process, its always possible to manipulate the parameters to
find an invertible representation (which is unique).
Its important to find an invertible representation, since its the only representation that allows one to represent t as a function of past y
s The other representations express
This is the reason that ARMA models are popular. Combining low-order AR
and MA models can usually offer a satisfactory representation of univariate time
series data with a reasonable number of parameters.
Stationarity and invertibility of ARMA models is similar to what weve seen we wont go into the details. Likewise, calculating moments is similar.
1 L t
224
t t
1 2 n 0
Xn 0 n where Xn
x1 x2
xn
is defined as
arg min sn
1X
X
X
Xn
yn
1 n yn
Xn
y
arg max Ln
n
2
1 2
exp
yt
2
t 1
average logarithm of the likelihood function is achieved at the same as for the likelihood function:
arg max sn
1 n ln Ln
1 2 ln 2
n
1 n
t 1
225
yt
y
One can investigate the properties of an ML estimator supposing that the distributional assumptions are incorrect. This gives a quasi-ML estimator, which
well study later.
1 0 is a moment-parameter equation.
0 though in
general the relationship may be more complicated. The sample first moment is
n
yt
n
t 1
Define
m1
1
m1
parameter estimate.
In this case,
m1
yt
0
t 1
Since tn 1 yt n
E yt
20
Define
m2
V yt
tn
yt
n
y2
m2
tn
yt
n
y2
0
Again, by the LLN, the sample variance is consistent for the true variance, that
is,
tn
1
yt
n
So,
tn
p 20
y2
1
yt
2n
y2
The previous two examples give two estimators of 0 which are both consistent.
With a given sample, the estimators will be different in general.
With two moment-parameter equations and only one parameter, we have overidentification, which means that we have more information than is strictly necessary for consistent estimation of the parameter.
1 n m1t
m1
t 1
n
yt
n
t 1
E m1 0
0 and
0.
m2t
yt
and
m2
2
tn
set either m1
0 or m2
1
y2
yt
n
y2
as
equations simultaneously.
228
m
m1 m2
and choosing
arg min sn
d m
m
Am where A is a positive definite ma-
trix. While its clear that the MM gives consistent estimates if there is a one-to-one
relationship between parameters and moments, its not immediately obvious that the
GMM estimator is consistent. (Well see later that it is.)
These examples show that these widely used estimators may all be interpreted as
the solution of an optimization problem. For this reason, the study of extremum estimators is useful for its generality. We will see that the general results extend smoothly
to the more specialized results available for specific estimators. After studying extremum estimators in general, we will study the GMM estimator, then QML and NLS.
The reason we study GMM first is that LS, IV, NLS, MLE, QML and other well-known
parametric estimators may all be interpreted as special cases of the GMM estimator,
so the general results on GMM can simplify and unify the treatment of these other
estimators. Nevertheless, there are some special results on QML and NLS, and both
are important in empirical research, which makes focus on them useful.
One of the focal points of the course will be nonlinear models. This is not to
suggest that linear models arent useful. Linear models are more general than they
might first appear, since one can employ nonlinear transformations of the variables:
0 yt
1
xt
2 xt
229
p xt
For example,
x1t x21t x1t x2t t
ln yt
fits this form.
The important point is that the model is linear in the parameters but not necessarily linear in the variables.
In spite of this generality, situations often arise which simply can not be convincingly
represented by linear in the parameters models.
Example 35 Expenditure shares
Roys Identity states that the quantity demanded of the ith of G goods is
xi
v p y pi
v p y y
An expenditure share is
si
so necessarily si
0 1 and G
i 1 si
p i xi y
1.
No linear in the parameters model for xi or si with a parameter space that is defined
independent of the data can guarantee that either of these conditions holds. These
constraints will often be violated by estimated linear models, which calls into question
their appropriateness in cases of this sort.
Example 36 Binary limited dependent variable
Suppose there is a latent process
x
0
230
1y
0
1 x
0
F x
0
y
E
0
1
y
n
t
arg min
F x
2
F x
0
x
x
parameter. This is because for any x we can always find a such that x
will be
Since this sort of problem occurs often in empirical work, it is useful to study NLS
and other nonlinear models.
After discussing these estimation methods for parametric models well briefly introduce nonparametric estimation methods. These methods allow one, for example, to
231
estimate f xt consistently when we are not willing to assume that a model of the form
yt
f xt
Pr t
where f
yt
f xt
z
F z xt
since economic theory gives us general information about functions and the signs of
their derivatives, but not about their specific form.
The final section deals with simulation methods in econometrics. These methods
allow us to substitute computer power for mental power. Since computer power is
becoming relatively cheap compared to mental effort, any econometrician who lives
by the principles of economic theory should be interested in these techniques.
232
All vectors will be column vectors, unless they have a transpose symbol (or I
forget to apply this rule - your help catching typos and er0rors is much appreciated). For example, if xt is a p 1 vector, xt
is a 1 p vector. When I refer to a
p-vector, I mean a column vector.
: p
nized as a p-vector,
s
..
.
s
p
is orga-
s
2
s
1
2 s
is a 1 p vector and
s
s
a x
x
a.
is a p p matrix. Also,
2 s
f
f
233
: p
x Ax
x
A A
.
argument . Then
f g
f
g
g
has dimension n r
Exercise 39 For x and both p 1 vectors, show that
exp x
exp x
x.
n
n 1
n to some other set, so that the set is ordered according to the natural numbers as
234
Real-valued sequences:
Definition 41 [Convergence] A real-valued sequence of vectors a n converges to the
a
an
a
where
T
fn
fn
f
0 and
n
converges
N
Its important to note that N depends upon so that converge may be much
more rapid for certain than for others. Uniform convergence requires a similar rate
of convergence throughout
sup fn
f
converges uni
N
(insert a diagram here showing the envelope around f in which f n must lie)
235
Stochastic sequences
In econometrics, we typically deal with stochastic sequences. Given a probability
space F P
recall that a random variable maps the sample space to the real
line, i.e., X :
is a collection
of such mappings, i.e., each Xn is a random variable with respect to the probabil
ity space F P
X
X
X
Y where n is the sample size, can be used to form a sequence of ran
dom vectors n . A number of modes of convergence are in use when dealing with
: Xn
X
. Then
Xn
converges in probability to X if
lim P An
0
p X or plim X
n
X
Xn
: limn
Xn
X . Then
P A
1
In other words, Xn
a set C
236
or Xn
Xn
as
p X
Xn
Stochastic functions
a s 0
Simple laws of large numbers (LLNs) allow us to directly conclude that n
in
and
X
n
X
X
n
X
n
This easy proof is a result of the linearity of the model, which allows us to express
the estimator in a way that separates parameters from random functions. In general,
this is not possible. We often deal with the more complicated situation where the
stochastic sequence depends on parameters in a manner that is not reducible to a simple
sequence of random variables. In this case, we have a sequence of random functions
that depend on : Xn
237
most surely in to X if
lim sup Xn
X
0 (a.s.)
Implicit is the assumption that all Xn and X are random variables w.r.t.
F P for all
up
uas
and
An equivalent definition, based on the fact that almost sure means with probability one is
lim sup Xn
Pr
X
This has a form similar to that of the definition of a.s. convergence - the essential
difference is the addition of the sup.
f n
f n
gn
0
f n
f n
gn
K where K is
a finite constant.
This definition doesnt require that
f n
gn
238
f n
gn
o p g n means
p 0
1X
X
X
XX
Since plim
1X
X
X
0 we can write X
X
1
1X
X
X
1X
1X
X 0
o p 1 and
0 and all n
f n
g n
Example 53 If Xn
N 0 1 then Xn
K
such that P Xn
Useful rules:
O p n p O p nq
o p n p o p nq
Op np
op np
Example 54 Consider a random sample of iid r.v.s with mean 0 and variance 2 .
The estimator of the mean
n1 2
N 0 2 So n1 2
O p 1 so
Op n
1 2
Before we had
o p 1 now
we have have the stronger result that relates the rate of convergence to the sample size.
Example 55 Now consider a random sample of iid r.v.s with mean and variance 2 .
The estimator of the mean
n1
2
A
N 0 2 So n1
2
O p 1 so
239
Op n
1 2
so
O p 1
These two examples show that averages of centered (mean zero) quantities typically have plim 0, while averages of uncentered quantities have finite nonzero plims.
Note that the definition of O p does not mean that f n and g n are of the same order.
Asymptotic equality ensures that this is the case.
Definition 56 Two sequences of random variables f n and gn are asymptotically
equal (written fn
gn if
f n
plim
g n
Finally, analogous almost sure versions of o p and O p are defined in the obvious
way.
240
Readings: Gourieroux and Monfort (1995), Vol. 2, Ch. 24 ; Amemiya, Ch. 4 section
4.1 ; Davidson and MacKinnon, pp. 591-96; Gallant, Ch. 3; Newey and McFadden
(1994), Large Sample Estimation and Hypothesis Testing, in Handbook of Econometrics, Vol. 4, Ch. 36.
objective function sn over a set . Let the objective function sn Zn depend upon
z
a n p random matrix Zn
z2
zn
finite.
Example 57 Given the model yi
xi
i with n observations, define zi
yi xi
1 n yi
s n Zn
xi
2
i 1
1 n Y
X
16.2 Consistency
The following theorem is patterned on a proof in Gallant (1987) (the article, ref. later),
which well see in its original form later in the course. It is interesting to compare
the following proof with Amemiyas Theorem 4.1.1, which is done in terms of convergence in probability.
241
lim sup sn
s
0 a.s.
3. Identification: s
s
i.e., s 0
0
a s 0
Then n
Proof: Select a
is a fixed sequence of
functions. Suppose that is such that sn converges uniformly to s This happens with probability one by assumption (b). The sequence n lies in the compact
set by assumption (1) and the fact that maximixation is over . Since every se
quence from a compact set has at least one limit point (Davidson, Thm. 2.12), say that
is a limit point of n
nm
lim snm nm
242
s
Then uniform
convergence implies
lim snm t
Continuity of s
s t
implies that
lim s t
s
of is .
So the above claim is true.
t
Next, by maximization
s nm 0
snm nm
which holds in the limit, so
lim snm nm
lim snm 0
However,
lim snm nm
s
lim snm 0
s 0
by uniform convergence, so
s
s 0
have s
s 0 and
since so far we have held fixed, but now we need to consider all
n has only one limit point, 0 except on a set C
243
with P C
. Therefore
0
s 0 must have
0
0
which matches the way we will write the assumption in the section on nonparametric
inference.
down of consistency.
The assumption that 0 is in the interior of (part of the identification assumption) has not been used to prove consistency, so we could directly assume that
0 is simply an element of a compact set The reason that we assume its in
the interior here is that this is necessary for subsequent proof of asymptotic normality, and Id like to maintain a minimal set of simple assumptions, for clarity.
Parameters on the boundary of the parameter set cause theoretical difficulties
that we will not deal with in this course. Just note that conventional hypothesis
testing methods do not apply in this case.
We need a uniform strong law of large numbers in order to verify assumption (2)
of Theorem 58. The following theorem is from Davidson, pg. 337.
245
sup Gn
as
if and only if
(a) Gn
as
(b) Gn
0 for each
norm.
The pointwise almost sure convergence needed for assuption (a) comes from one
space
These are reasonable conditions in many cases, and henceforth when dealing
with specific estimators well simply assume that pointwise almost sure conver-
0 0 wt
support W
1 wt
0 0
xt
0 t The sample
so we can write yt
1 n yt
sn
xt
t 1
n
1 n xt
0 t
2
i 1
1 n xt
0
2
2 n xt
0
t 1 n t2
t 1
t 1
xt
t 1
1 n t2
as
2 d dE
W
2
W E
t 1
1 n xt
0
2as
t 1
0
2 0
0
x
0
247
2
dW
wdW
E w
(17)
2
E w
0
w2 dW
Finally, the objective function is clearly continuous, and the parameter space is assumed to be compact, so the convergence is also uniform. Thus,
s
2
2 0
0
E w
E w2
0
Exercise 60 Show that in order for the above solution to be unique it is necessary
that E w2
colinearity of regressors.
This example shows that Theorem 58 can be used to prove strong consistency of
the OLS estimator. There are easier ways to show this, of course - this is only an
example of application of the theorem.
(a) Jn
0
(b) Jn n
as
d N 0 I 0 where I 0 lim
n
d N 0 J 0 1 I 0 J 0 1
Then n 0
nD sn 0
(c)
nD sn 0
Var
D sn 0
D sn n
where
0 0
D2 sn
0
1
as
D2 sn
as
J 0
So
D sn 0
J 0
o p 1
0
And
nD sn 0
J 0
n
o p 1
0
a
nD sn 0
J 0
249
n
0
J 0
nD sn 0
Because of assumption (c), and the formula for the variance of a linear combination of
r.v.s,
n
d N 0 J 0
I 0 J 0
1
Assumption (b) is not implied by the Slutsky theorem. The Slutsky theorem says
as
that g xn
g x if xn
a s 0
continuous at 0 and
as
g 0 if g 0 is
To apply this to the second derivatives, sufficient conditions would be that the
second derivatives be strongly stochastically equicontinuous on a neighborhood
of 0 and that an ordinary LLN applies to the derivatives when evaluated at
N 0
Stronger conditions that imply this are as above: continuous and bounded second
derivatives in a neighborhood of 0
Skip this in lecture. A note on the order of these matrices: Supposing that sn
is representable as an average of n terms, which is the case for all estimators we
250
d N 0 I 0 means that
nD sn 0
O p 1
D sn 0
1
2
n wed have
O p 1
1
2
Op n
q
O p nr
The sequence D sn 0
a vector of other variables such as prices, personal characteristics, etc. After provision,
utility is v1 m z
v1 m
A z
2 We
v0 m z
assume here that responses are truthful, that is there is no strategic behavior and that individuals
are able to order their preferences in this hypothetical situation.
251
Define
Define y
v1 m
A z
v0 m z
0 otherwise. The
probability of agreement is
Pr y
F v w A
1
suppose that
v1 m z
v0 m z
m
and 0 and 1 are i.i.d. extreme value random variables. That is, utility depends only
on income, preferences in both states are homothetic, and a specific distributional assumption is made on the distribution of preferences in the population. With these assumptions (the details are unimportant here, see articles by D. McFadden for details)
it can be shown that
p A
A
z
1 exp
z
This is the simple logit model: the choice probability is the logit function of a linear in
parameters function.
252
1y
0
N 0 1
Pr
1
x ,
x
where
x
1 2
2
exp
2
d
2
If p x
p x
Pr y
1 x
x
we have a logit model. If p x
x
where
is the
fYi yi xi
p xi
yi
p x
1 yi
so as long as the observations are independent, the maximum likelihood (ML) estima-
253
1 n
yi ln p xi
ni 1
sn
yi ln 1
p xi
1 n
s y i xi
n i1
(18)
Following the above theoretical results, tends in probability to the 0 that maximizes
Noting that E yi
p x i 0
and following
representative term s y x First one can take the expectation conditional on x to get
Ey x y ln p x
y ln 1
p x
p x 0 ln p x
p x 0 ln 1
p x
s
p x 0 ln p x
p x 0 ln 1
p x x dx
(19)
where x is the (joint - the integral is understood to be multiple, and X is the support
of x) density function of the explanatory variables x. This is clearly continuous in
have uniform almost sure convergence. Note that p x is continous for the logit and
order conditions
p x 0
px
p x
1
1
p x 0
px
p x
x dx
n
0
d N 0 J 0 1 I 0 J 0
limn
Var
Theres no need to subtract the mean, since its zero, following the f.o.c. in the
consistency proof above and the fact that observations are i.i.d.
lim Var nD sn 0
1
s 0
n t
lim Var nD
lim Var
1
D s 0
n
t
1
lim Var D s 0
n n
t
lim VarD s 0
VarD s 0
So we get
s y x 0
s y x 0
I 0
Likewise,
2
E
s y x 0
Expectations are jointly over y and x or equivalently, first over y conditional on x then
s y x 0
y ln p x 0
255
1
y ln 1
p x 0
Now suppose that we are dealing with a correctly specified logit model:
p x
1 exp
x
p x
1 exp
1 exp
x
x
exp x
x
1 exp x
p x x
p x
exp
x
p x 1
x
p x
x
So
s y x 0
2 0
s
p x 0 x
(20)
p x 0
p x 0
xx
xx
x dx
EY y2 2p x 0 p x 0
I 0
p x 0
p x 0
xx
x dx
(22)
p x 0
(21)
p x 0 . Likewise,
EY y2
J 0
p x 0
p x 0
xx
x dx
(23)
Note that we arrive at the expected result: the information matrix equality holds (that
256
is, J 0
d N 0 J 0
I 0 J 0
1
simplifies to
n
d N 0 J 0
1
0
d N 0 I 0
1
On a final note, the logit and standard normal CDFs are very similar - the logit distribution is a bit more fat-tailed. While coefficients will vary slightly between the two
yi
h x i 0
where
i
iid 0 2
257
h xi
Well study this more later, but for now it is clear that the foc for minimization will
require solving a set of nonlinear equations. A common approach to the problem seeks
to avoid this difficulty by linearizing the model. A first order Taylors series expansion
about the point x0 with remainder gives
yi
0
hx
xi
x0
h x0 0
x
where i encompasses both i and the Taylors series remainder. Note that i is no
longer a classical error - its mean is not zero. We should expect problems.
Define
h x0
h x0 0
x0
x
h x0 0
x
xi i
yi
The answer is no, as one can see by interpreting and as extremum estimators.
Let
arg min sn
1 n
yi
n i1
258
xi
uas
sn
s
EX EY X y x
arg min EX EY
x
Noting that
EX EY X y x
EX EY X h x 0
2 EX h x 0
x
x
2
2
since cross products involving drop out. 0 and 0 correspond to the hyperplane
that is closest to the true regression function h x 0 according to the mean squared
conditioning variables.
h(x,)
x
Tangent line
x
x
x
x
x_0
259
x
Fitted line
It is clear that the tangent line does not minimize MSE, since, for example, if
h x 0 is concave, all errors between the tangent line and the true function are
negative.
Note that the true underlying parameter 0 is not estimated consistently, either
(it may be of a different dimension than the dimension of the parameter of the
260
Readings: Hamilton, ch. 5, section 7 (pp. 133-139) ; Gourieroux and Monfort, Vol.
-vector) of a function s This function may not be continuous, and it may not be
differentiable. Even if it is twice continuously differentiable, it may not be globally
concave, so local maxima, minima and saddlepoints may all exist. Supposing s
were a quadratic function of e.g.,
s
a b
C
2
D s
b C
1 b
lem we have with linear models estimated by OLS. Its also the case for feasible GLS,
since conditional on the estimate of the varcov matrix, we have a quadratic objective
function in the remaining parameters.
More general problems will not have linear f.o.c., and we will not be able to solve
for the maximizer analytically. This is when we need a numeric optimization method.
261
17.1 Search
See Hamilton. Note, to check q values in each dimension of a K dimensional parameter
space, we need to check qK points. For example, if q
be 10010
100 and K
10 there would
1 00000 00000 00000 00000 points to check. If 1000 points can be checked
approximately the age of the earth. The search method is a very reasonable choice if
K is small, but it quickly becomes infeasible if K is moderate or large.
Search in two dimensions with refinement
The maximizing
point
262
of so that
k
ak d k
s ad
a:
a
for a positive but small. That is, if we go in direction d, we will improve on the
objective function, at least if we dont go too far in that direction.
As long as the gradient at is not zero there exist increasing directions, and
g
a0
s ad
s 0d
s
0 g 0d
d o 1
ag
d o 1
direction, we need g
d
0 Defining d
we guarantee that
g
d
unless g
g
Qg
matrices are those such that the angle between g and Qg is less that 90 de263
grees.)
1
k
ak Qk g k
and we keep going until the gradient becomes zero, so that there is no increasing
direction. The problem is how to choose a and Q
Disadvantages: This doesnt always work too well however....Draw banana function.
17.2.3 Newton-Raphson
The Newton-Raphson method uses information about the slope and curvature of the
objective function to determine which direction and how far to move from an initial
264
point. Supposing were trying to maximize sn Take a second order Taylors series
sn sn k
g k
k
1 2
H k
k
To attempt to maximize sn we can maximize the portion of the right-hand side that
s
g k
1 2
H k
k
D s
g k
H k
k
H k 1 g k
ak H k
1
g k
A potential problem is that the Hessian may not be negative definite when were
far from the maximizing point. So
H k
1g
H k
pen when the objective function has flat regions, in which case the Hessian ma265
trix is very ill-conditioned (e.g., is nearly singular), or when were in the vicinity
of a local minimum, H k is positive definite, and our direction is a decreasing direction of search. Matrix inverses by computers are subject to large errors
when the matrix is ill-conditioned. Also, we certainly dont want to go in the
direction of a minimum when were maximizing. To solve this problem, Quasi
H
bI where b is chosen
large enough so that Q is well-conditioned and positive definite. This has the
benefit that improvement in the objective function is guaranteed.
Stopping criteria
The last thing we need is to decide when to stop. A digital computer is subject to
limited machine precision and round-off errors. For these reasons, it is unreasonable
to hope that a program can exactly find the point that maximizes a function, and in
fact, more than about 6-10 decimals of precision is usually infeasible. Some stopping
criteria are:
kj
kj
266
1
kj
kj
2
kj 1
1
s k
s k
3
g j k
Also, if were maximizing, its good to check that the last round (real, not approximate) Hessian is negative definite.
Starting values
The Newton-Raphson and related algorithms work well if the objective function
is concave (when maximizing), but not so well if there are convex regions and local
minima or multiple local maxima. The algorithm may converge to a local minimum
or to a local maximum that is not optimal. The algorithm may also have difficulties
converging at all.
The usual way to ensure that a global maximum has been found is to use
many different starting values, and choose the solution that returns the highest
objective function value. THIS IS IMPORTANT in practice.
Calculating derivatives
267
use programs such as Mupad or Mathematica to calculate analytic derivatives. Example: Scientific WorkPlace can be used to find that
arctan
1
1 2
which I certainly didnt know before writing this example. H. Varian has a book that
discusses the use of Mathematica in this context in detail.
Numeric derivatives are less accurate than analytic derivatives, and are usually
more costly to evaluate. Both factors usually cause optimization programs to be
meric derivatives.
Numeric second derivatives are much more accurate if the data are scaled so that
the elements of the gradient are of the same order of magnitude. Example: if the
model is yt
h xt zt
1000 and D sn
1000; xt
1000; zt
and D sn
1000xt ;
will both
be 1.
In general, estimation programs always work better if data is scaled in this way,
since roundoff errors are less likely to become important. This is important in
268
practice.
There are algorithms (such as BFGS and DFP) that use the sequential gradient evaluations to build up an approximation to the Hessian. The iterations are
faster for this reason since the actual Hessian isnt calculated, but more iterations
usually are required for convergence.
269
Readings: Hamilton Ch. 14 ; Davidson and MacKinnon, Ch. 17 (see pg. 587 for
refs. to applications); Newey and McFadden (1994), Large Sample Estimation and
Hypothesis Testing, in Handbook of Econometrics, Vol. 4, Ch. 36.
18.1 Definition
Weve already seen one example of GMM in the introduction, based upon the 2 distribution. Consider the following example based upon the t-distribution. The density
function of a t-distributed r.v. Yt is
0 1
fYt yt
0
1 2
2
0 2
yt2
0
0 1 2
Given an iid sample of size n one could estimate 0 by maximizing the log-likelihood
function
n
ln fYt
arg max ln Ln
yt
t 1
the ML estimator.
Continuing with the example, a t-distributed r.v. with density fYt yt 0 has
270
2 (for 0
2
2
yt2 and m1
1 n tn 1 m1t
1 n tn 1 yt2 As be
2
E 0 m 1 0
0 and
0
Choosing to set m1
0 yields a MM estimator:
2
1
(24)
n
i y2i
This estimator is based on only one moment of the distribution - it uses less information
than the ML estimator, so it is intuitively clear that the MM estimator will be inefficient
relative to the ML estimator.
3 0
0 2 0
E yt4
provided 0
m2
3 2
2 4
4
1 n 4
y
n t1 t
youll see that the estimate is different from that in equation ??.
This estimator isnt efficient either, since it uses only one moment. A GMM estimator
would use the two moment conditions together to estimate the single parameter. The
271
As before, set mn
m1 m2
Op n
O p 1
using the true distribution with parameter 0 This is the fundamental reason that
GMM is consistent.
sn
m
Wn m and we minimize
m
Wn m We assume Wn converges to a finite positive definite ma-
trix.
For the purposes of this course, the following definition of the GMM estimator is
sufficiently general:
Definition 63 The GMM estimator of the K -dimensional parameter vector 0
arg min sn
with E m 0
mn
Wn mn where mn
1 n
n t 1 mt
is a g -vector, g
K
definite matrix W .
moment condition. GMM is robust with respect to distributional misspecification. The price for robustness is loss of efficiency with respect to the
MLE estimator. Keep in mind that the true distribution is not known so if
we erroneously specify a distribution and estimate by MLE, the estimator
will be inconsistent in general (not always).
Feasibility: in some cases the MLE estimator is not available, because we
are not able to deduce the likelihood function. More on this in the section
on simulation-based estimation. The GMM estimator may still be feasible
even though MLE is not possible.
18.2 Consistency
We simply assume that the assumptions of Theorem 58 hold, so the GMM estimator
is strongly consistent. The only assumption that warrants additional comments is that
of identification. In Theorem 58, the third assumption reads: (c) Identification: s
s
Since E mn 0
Since s 0
0 by assumption, m 0
m 0
W m 0
mn
Wn mn first consider mn
as
m
0
we need that m
in finite samples.
d N 0 J 0
2
sn
I 0 J 0
and I 0
1
limn
Var
sn 0
n
We need to determine the form of these matrices given the objective function s n
mn
Wn mn
Define the K
m W m
n
g matrix
Dn
mn
so
s
2D W m
(25)
(Note that Dn Wn and mn all depend on the sample size n but we will often
274
s
2Di Wn m
2DiW D
i
D
D
2m
W
at 0 assume that
2m
W
D i
satisfies a LLN, so that it converges almost surely to a finite
2m 0
W
since m 0
D
as
i
0
W .
as
o p 1 W
lim
2
sn 0
J 0
2DW D
a s
D a s and limW
With regard to I 0 , following equation ??, and noting that the scores have mean
zero at 0 (since E m 0
I 0
0 by assumption), we have
lim Var n
0
sn
lim E 4nDnWn m 0 m
Wn Dn
lim E 4DnWn
nm 0
nm
Wn Dn
275
d N 0
nm 0
n. Assuming this,
where
lim E nm 0 m 0
I 0
4DW W D
d N 0 D W D
DW W D
DW D
the asymptotic distribution of the GMM estimator for arbitrary weighting matrix Wn
Note that for J to be positive definite, D must have full row rank, D
k.
a 0
0 b
276
with a much larger than b In this case, errors in the second moment condition have
less weight in the objective function.
Since moments are not independent, in general, we should expect that there be a
correlation between the moment conditions, so it may not be desirable to set the
off-diagonal elements to 0. W may be a random, data dependent matrix.
We have already seen that the choice of W will influence the asymptotic distribution of the GMM estimator. Since the GMM estimator is already inefficient
w.r.t MLE, we might like to choose the W matrix to make the GMM estimator
efficient within the class of GMM estimators.
x
where
P
1
PV P
e.g, P
P
In (Note: we use AB
PP
1A 1
P P
P
1P
tion y
X
X Interpreting y
X
as moment conditions
(note that they do have zero expectation when evaluated at 0 ), the optimal
weighting matrix is seen to be the inverse of the covariance matrix of the moment
conditions. This result carries over to GMM estimation. (Note: this presentation
of GLS is not a GMM estimator, because the number of moment conditions here
is equal to the sample size, n Later well see that GLS can be put into the GMM
framework defined above).
277
a s
variance of will be minimized by choosing Wn so that Wn W
limn
nm 0 m 0
1 where
Proof: For W
DW D
simplifies to D 1 D
DW D
DW W D
1 consider the
D 1 D
D 1
DW D
1 2
W D
DW W D
DW W D
DW D
1 2
1 2 D
DW
n
d N 0 D 1 D
allows us to treat
0
D 1 D
278
(26)
where the
estimators of D and
mn
mn
mn
estimation of
E mt mt
s
0).
2it ).
Since we need to estimate so many components if we are to take the parametric approach, it is unlikely that we would arrive at a correct parametric specification. For
this reason, research has focused on consistent nonparametric estimators of
Henceforth we assume that mt is covariance stationary (the covariance between mt
and mt
tions v
E mt mt
s
Note that E mt mt
s
279
v
Recall that mt and m are functions of
mt
Now
E nm m
E n 1 n mt
1 n mt
t 1
mt
t 1
1
t 1
mt
E 1 n
t 1
1 1
2
2 2
1
n
n
1
1 n
m t m t
v
t v 1
estimator of would be
0
1
1 1
n
n 1
v 1
v
2
2 2
v v
1
On the other hand, supposing that v tends to zero sufficiently rapidly as v tends to
a modified estimator
qn
v v
v 1
where q n
The term
circularity is to set the weighting matrix W arbitrarily (for example to an identity matrix), obtain a first consistent but inefficient estimate of 0 then use this
qn
v 1
1
v
q 1
v v
1 4q
0 Note that this is a very slow rate of growth for q This estimator is nonparametric weve placed no parametric restrictions on the form of It is an example of a kernel
estimator.
In a more recent paper, Newey and West (Review of Economic Studies, 1994) use
pre-whitening before applying the kernel estimator. The idea is to fit a VAR model
to the moment conditions. It is expected that the residuals of the VAR model will be
more nearly white noise, so that the Newey-West covariance estimator might perform
better with short lag lengths..
281
m t
p m t
ut
This is estimated, giving the residuals ut Then the Newey-West covariance estimator is
applied to these pre-whitened residuals, and the covariance is estimated combining
the fitted VAR
m t
1 m t
p m t
with the kernel estimate of the covariance of the ut See Newey-West for details.
282
variable X
Y f Y X dY
EY X Y
0
Y g X f Y X dY dX
This can be factored into a conditional expectation and an expectation w.r.t. the
marginal density of X :
EY g X
Y g X f Y X dY
f X dX
EY g X
Y f Y X dY g X f X dX
EY g X
as claimed.
This is important econometrically, since models often imply restrictions on condi
tional moments. Suppose a model tells us that the function K yt xt has expectation,
E K yt xt It
283
k xt
set K yt xt
yt so that k xt
xt
t we can
xt
.
ht
K yt xt
k xt
E ht It
0
1
dimensional parameter However, the above result allows us to form various unconditional expectations
mt
Z wt ht
from the information set It The Z wt are instrumental variables. We now have g
moment conditions, so as long as g
holds.
284
Zn
Z1 w 1
Z2 w 1
Zg w 1
Z2 w 2
Zg w 2
..
.
Z2 w n
Z1 w 2
..
.
Z1 w n
Zg w n
Z1
Z2
Zn
h1
mn
1
Z
n n
h2
..
.
hn
1
Zn
hn
n
1 n
Zt ht
nt 1
1 n
mt
nt 1
where Z t
tion that arises is how one should choose the instrumental variables Z wt to achieve
maximum efficiency.
Note that with this choice of moment conditions, we have that Dn
285
(a
g matrix) is
1
Zn
hn
n
1
h
Zn
n n
Dn
1
Hn Zn
n
Dn
where Hn is a K n matrix that has the derivatives of the individual moment conditions
as its columns. Likewise, define the var-cov. of the moment conditions
1n Z h
n
n
Zn
E
where we have defined n
E nmn 0 mn 0
Zn
0 h n 0
Zn
1 0 0
hn hn
n
n
Zn
n
Zn
where
V
lim
Hn Zn
n
0
d N 0 V
Zn
n Zn
n
Zn
Hn
n
(27)
Using an argument similar to that used to prove that 1 is the efficient weighting
286
Zn
n 1 Hn
lim
Hn n 1 Hn
(28)
and furthermore, this matrix is smaller that the limiting var-cov for any other choice
of instrumental variables. (To prove this, examine the difference of the inverses of the
var-cov matrices with the optimal intruments and with non-optimal instruments. As
above, you can show that the difference is positive semi-definite).
h
n
Zn n Zn
n
must be estimated
m
mn
0
or
1 mn
D
0
m
Multiplying by D
1 m
D
m n 0
Dn
0
0
o p 1
(29)
we obtain
1 mn 0
D
1 D 0
D
288
0
o p 1
D 1 mn 0
D 1 D
or
n D 1 D
D 1 mn 0
With this, and taking into account the original expansion (equation ??), we get
nm
nmn 0
1
nD
D D
D 1 mn 0
nm
1 2
n
1
D
D D
D 1
2
1 2 mn 0
Or
n 1 2 m
D D
1 2 D
n Ig
D 1
2
1 2 mn 0
Now
n 1 2 mn 0
d N 0 I
g
D D
1 2 D
is idempotent of rank g
Ig
D 1
2
289
its trace) so
n 1 2 m
n 1 2 m
nm
1 m
d 2 g K
d 2 g K
1 m
nm
or
n sn
d 2 g K
supposing the model is correctly specified. This is a convenient test since we just
multiply the optimized value of the objective function by n and compare with a 2 g
K critical value. The test is a general test of whether or not the moments used to
estimate are correctly specified.
This wont work when the estimator is just identified. The f.o.c. are
1 m 0
D
D sn
m 0
So the moment conditions are zero regardless of the weighting matrix used. As
such, we might as well use an identity matrix and save trouble. Also s n
so the test breaks down.
290
0,
A note: this sort of test often over-rejects in finite samples. If the sample size is
small, it might be better to use bootstrap critical values. That is, draw artificial
samples of size n by sampling from the data with replacement. For R bootstrap
1 2 R Define
the bootstrap critical value Cb such that 100 percent of the s j exceed the
value. Of course, R must be a very large number if g
K is large, in order to
determine the critical value with precision. This sort of test has been found to
have quite good small sample properties.
X0 where
N 0 a diagonal matrix.
sional parameter vector, and to estimate and jointly (feasible GLS). This
will work well if the parameterization of is correct.
tently by OLS. However, the typical covariance estimator V
X
X
mt
xt yt
291
xt
0 which
1 n mt
m
1 n xt yt
1 n xt xt
For any choice of W m will be identically zero at the minimum, due to exact iden
tification. That is, since the number of moment conditions is identical to the number
of parameters, the foc imply that m
optimal weighting matrix in this case, an identity matrix works just as well for the
purpose of estimation. Therefore
xt xt
xt yt
X
X
X
y
that D is simply
Recall
In this case
1 n xt xt
D
X
X n
n 1
v v
v 1
292
m t m t
1 n
t 1
xt xt
1 n
yt
xt
t 1
xt xt
t2
1 n
t 1
X
EX
n
where E is an n n diagonal matrix with t2 in the position t t.
V
n
X
X
n
X
X
n
X
EX
n
X
EX
n
X
X
n
X
X
n
This is the varcov estimator that White (1980) arrived at in an influential article. This
estimator is consistent under heteroscedasticity of an unknown form. If there is autocorrelation, the Newey-West estimator can be used to estimate - the rest is the
same.
X0
N 0
293
1
y
1 n
xt yt
t 0
1 n
xt xt
0
t 0
That is, the GLS estimator in this case has an obvious representation as a GMM estimator. With autocorrelation, the representation exists but it is a little more complicated.
Nevertheless, the idea is the same. There are a few points:
This means that it is more efficient than the above example of OLS with Whites
heteroscedastic consistent covariance, which is an alternative GMM estimator.
This means that the choice of the moment conditions is important to achieve
efficiency.
18.9.3 2SLS
Consider the linear model
yt
zt
t
or
294
X X
X
X
Z
X X
X
X
Z
z t yt
zt
and so
1 n z t yt
m
zt
Since we have K parameters and K moment conditions, the GMM estimator will
set m identically equal to zero, regardless of W so we have
zt zt
z t yt
Z
Z
1
y
Z
This is the standard formula for 2SLS. We use the exogenous variables and the reduced
form predictions of the endogenous variables as instruments, and apply IV estimation.
See Hamilton pp. 420-21 for the varcov formula (which is the standard formula for
2SLS), and for how to deal with t heterogeneous and dependent (basically, just use the
Newey-West or some other consistent estimator of and apply the usual formula).
Note that t dependent causes lagged endogenous variables to loose their status as
legitimate instruments.
295
y1t
f1 zt 01
y2t
f2 zt 02
1t
2t
..
.
fG zt 0G
yGt
Gt
or in compact notation
f zt 0
yt
where f
We need to find an Ai
10
20
0
G
uncorrelated with it Typical instruments would be low order monomials in the exogenous variables in zt with their lagged values. Then we can define the G
i 1 Ai
orthogonality conditions
mt
y2t
yGt
f1 zt 1 x1t
y1t
f2 zt 2 x2t
..
.
fG zt G xGt
non-trivial problem.
A note on efficiency: the selected set of instruments has important effects on the
efficiency of estimation. Unfortunately there is little theory offering guidance on
296
y1
y2
yt
Then at time t Yt
has been observed (refer to it as the information set, since we assume the conditioning
variables have been selected to take advantage of all useful information). The likelihood function is the joint density of the sample:
L
f y 1 y2
yn
L
f yn Yn
1
f Yn
1
L
f yn Yn
1
f yn
297
Yn
2
f y1
ln f
ln L
yt Yt
1
t 1
Define
mt Yt D ln f yt Yt
1
as the score of the t th observation. It can be shown that, under the regularity conditions, that the scores have conditional mean zero when evaluated at 0 (see notes to
Introduction to Econometrics):
E mt Yt 0 Yt
0
1 n mt Yt
t 1
1 n D ln f yt Yt
1
0
t 1
which are precisely the first order conditions of MLE. Therefore, MLE can be inter
1D
1
D
m Yt
1 n D2 ln f yt Yt
1
t 1
s
298
uncorrelation follows from the fact that conditional uncorrelation hold regardless of the realization of Yt
preserves
uncorrelation (see the section on ML estimation, above). The fact that the scores
are serially uncorrelated implies that can be estimated by the estimator of the
0th autocovariance of the moment conditions:
n
1 n mt Yt mt Yt
1 n D ln f yt Yt
t 1
D ln f yt Yt
1
t 1
Recall from study of ML estimation that the information matrix equality (equation ??)
states that
D ln f yt Yt
1
0
D ln f yt Yt
1
0
E D2 ln f yt Yt
1
0
This result implies the well known (and already seeen) result that we can estimate V
in any of three ways:
tn 1 D2 ln f yt Yt 1
f yt Yt 1 D ln f yt Yt 1
1
n
tn
D ln
tn 1 D2 ln
f yt Yt
1
or the inverse of the negative of the Hessian (since the middle and last term
cancel, except for a minus sign):
1 n
D2 ln
t 1
299
1
f yt Yt
1
or the inverse of the outer product of the gradient (since the middle and last
cancel except for a minus sign, and the first term converges to minus the inverse
of the middle term, which is still inside the overall inverse)
1 n D ln f yt Yt
1
1
D ln f yt Yt
1
t 1
This simplification is a special result for the MLE estimator - it doesnt apply to GMM
estimators in general.
Asymptotically, if the model is correctly specified, all of these forms converge to
the same limit. In small samples they will differ. In particular, there is evidence that the
outer product of the gradient formula does not perform very well in small samples (see
Davidson and MacKinnon, pg. 477). Whites Information matrix test (Econometrica,
1982) is based upon comparing the two ways to estimate the information matrix: outer
product of gradient or negative of the Hessian. If they differ by too much, this is
evidence of misspecification of the model.
300
t 0
The objec-
s E
u ct
It
s
(30)
s 0
It is the information set at time t and includes the all realizations of random
Suppose the consumer can invest in a risky asset. A dollar invested in the asset
yields a gross return
1 rt
1
pt
dt
pt
where pt is the price and dt is the dividend in period t The price of ct is normalized to 1
Current wealth wt
1 rt it
1,
where it
is investment in period t
1. So the
problem is to allocate current wealth between current consumption and investment to finance future consumption: wt
s
ct it .
A partial set of necessary conditions for utility maximization have the form:
u
ct
1 rt
1
u
ct
1
It
(31)
To see that the condition is necessary, suppose that the lhs < rhs. Then by reduc
301
there is no discounting of the current period. At the same time, the marginal reduc
1
which
1 rt
1
u
ct
1
It
To use this we need to choose the functional form of utility. A constant relative
risk aversion form is
ct
u ct
where 1
u
ct
ct
ct
1
1 It
1 rt
1
ct
E ct
1 rt
1
1
1
ct
It
so that we could use this to define moment conditions, it is unlikely that ct is stationary,
even though it is in real terms, and our theory requires stationarity. To solve this, divide
1
though by ct
E 1-
1 rt
ct 1
ct
1
It
(note that ct can be passed though the conditional expectation since ct is chosen based
only upon information available in time t
Suppose that xt is a vector of variables drawn from the information set It We can
302
1
1 rt
1
ct 1
ct
1
mt
xt
represents and
Therefore, the above expression may be interpreted as a moment condition which
can be used for GMM estimation of the parameters 0
lim E nm 0 m 0
1 n mt mt
t 1
obtained by setting the weighting matrix W arbitrarily (to an identity matrix, for example). After obtaining we then minimize
s
1m
m
This process can be iterated, e.g., use the new estimate to re-estimate use this to
This whole approach relies on the very strong assumption that equation 31 holds
303
without error. Supposing agents were heterogeneous, this wouldnt be reasonable. If there were an error term here, it could potentially be autocorrelated,
which would no longer allow any variable in the information set to be used as an
instrument..
In principle, we could use a very large number of moment conditions in estimation, since any current or lagged variable could be used in xt Since use of more
moment conditions will lead to a more (asymptotically) efficient estimator, one
might be tempted to use many instrumental variables. We will do a compter lab
that will show that this may not be a good idea with finite samples. This issue has
been studied using Monte Carlos (Tauchen, JBES, 1986). The reason for poor
performance when using many instruments is that the estimate of becomes
very imprecise.
Empirical papers that use this approach often have serious problems in obtaining
precise estimates of the parameters. Note that we are basing everything on a
single parial first order condition. Probably this f.o.c. is simply not informative
enough. Simulation-based estimation methods (discussed below) are one means
of trying to use more informative moment conditions to estimate this sort of
model.
18.11 Problems
1. Perform GMM estimation of the rational expectations model described above
using the data in the file gmmdata, located , on the volcano server. The columns
of this data file are c p and d in that order. There are 95 observations (source:
and to convergence.
Comment on the results. Are the results sensitive to the set of instruments
as well as Are these good instruments? Are the instruused? (Look at
ments highly correlated with one another?
305
19 Quasi-ML
Quasi-ML is the estimator one obtains when a misspecified probability model is used
to calculate an ML estimator.
Given a sample of size n of a random vector y and a vector of conditioning variables
x suppose the joint density of Y
y1
yn
conditional on X
x1
xn
pY Y X 0
fully characterizes the random characteristics of samples: e.g., it fully describes the
probabilistically important features of the d.g.p. The likelihood function is just this
density evaluated at other values
L Y X
pY Y X
Let Yt
y1
yt
,Y
0 and let Xt
x1
xt
The like-
pt
L Y X
t 1
n
yt Yt
pt
t 1
306
1
Xt
1 n
ln pt
nt 1
1
ln L Y X
n
sn
Suppose that we do not have knowledge of the family of densities pt Mistakenly, we may assume that the conditional density of yt is a member of the family
pt yt Yt
Xt
ft yt Yt
Xt 0
1
Xt 0
This setup allows for heterogeneous time series data, with dynamic misspecification.
The QML estimator is the argument that maximizes the misspecified average log likelihood, which we refer to as the quasi-log likelihood function. This objective function
is
1 n
ln ft yt Yt
nt 1
sn
1
Xt 0
1 n
ln ft
nt 1
arg max sn
sn
as
1 n
lim E ln ft
n n
t 1
s
We assume that this can be strengthened to uniform convergence, a.s., following the
307
arg max s
0 a.s.
is compact
definite in a neighborhood of 0
Applying the asymptotic normality theorem,
n
0
d N 0 J 0
I 0 J 0
where
J 0
lim E D2 sn 0
and
I 0
lim Var nD sn 0
308
1
Note that asymptotic normality only requires that the additional assumptions
regarding J and I hold in a neighborhood of 0 for J and at 0 for I not
1 n 2
D ln ft n
nt 1
as
lim E
1 n 2
D ln ft 0
nt 1
J 0
That is, just calculate the Hessian using the estimate n in place of 0
Notation: Let gt
D ft 0
We need to estimate
I 0
lim Var nD sn 0
lim Var n
1 n
D ln ft 0
nt 1
1
lim Var gt
n
t 1
n
1
lim E
n n
gt
n
E gt
t 1
t 1
gt
E gt
n
t 1
lim
which will not tend to zero, in general. This term is not consistently estimable in
309
general, since it requires calculating an expectation using the true density under the
d.g.p., which is unknown.
joint distribution of yt xt is identical. This does not imply that the conditional
density f yt xt is identical).
With random sampling, the limiting objective function is simply
EX E0 ln f y x 0
s 0
marginal density of x
By the requirement that the limiting objective function be maximized at 0 we
have
D EX E0 ln f y x 0
D s 0
D EX E0 ln f y x 0
EX E0 D ln f y x 0
nt 1
310
d N 0 I 0
That is, its not necessary to subtract the individual means, since they are zero.
Given this, and due to independent observations, a consistent estimator is
1 n
D ln ft D ln ft
nt 1
This is an important case where consistent estimation of the covariance matrix is possible. Other cases exist, even for dynamically misspecified time series models.
311
f xt 0
yt
t
In general, t will be heteroscedastic and autocorrelated, and possibly nonnormally distributed. However, dealing with this is exactly as in the case of linear
models, so well just treat the iid case here,
iid 0 2
y1 y2 yn
f x1 f x 1
f x1
and
1 2 n
f
312
1
y
n
arg min sn
f
y
1
n
f
y
f
The estimator minimizes the weighted sum of squared errors, which is the same
1
y
y
n
sn
2y
f
f
f
f y f f
0
D f
F
(32)
In shorthand, use F in place of F Using this, the first order conditions can be written
as
F
y F
f 0
or
F
y
f
0
(33)
This bears a good deal of similarity to the f.o.c. for the linear model - the derivative of
X
y
X
X
313
0
X then F is simply X
Note that the nonlinearity of the manifold leads to potential multiple local max
20.2 Identification
As before, identification can be considered conditional on the sample, and asymp
s
objective function:
1 n
yt
n t1
sn
f xt 2
1 n
f xt 0
nt 1
ft xt
2
1 n
ft 0
nt 1
ft
2
2
ft 0
nt 1
1 n
t
n t1
ft t
Turning to the first term, well assume a pointwise law of large numbers applies,
so
1 n
ft 0
nt 1
ft
2as
f z 0
f z
d z
(34)
1 exp
x
f : K
1
2
s
f x 0
f x
2
d x
f x 0
f x
2
2
d x
D f z 0
D f z 0
d z
the expectation of the outer product of the gradient of the regression function evaluated
at 0 (Note: the uniform boundedness we have already assumed allows passing the
derivative through the integral, by the dominated convergence theorem.) This matrix
will be positive definite (wp1) as long as the gradient vector is of full rank (wp1). The
tangent space to the regression manifold must span a K -dimensional space if we are
315
2 lim E
J 0
F
F
n
20.3 Consistency
We simply assume that the conditions of Theorem 58 hold, so the estimator is consistent. Given that the strong stochastic equicontinuity conditions hold, as discussed
above, and given the above identification conditions an a compact estimation space (the
closure of the parameter space the consistency proofs assumptions are satisfied.
0
d N 0 J 0 1 I 0 J 0
n D sn 0
2
sn
D sn 0
316
evaluated at 0 and
a s
I 0
sn
f xt 2
So
2 n
yt
n t1
D sn
f xt D f xt
Evaluating at 0
2 n
t D f xt 0
nt 1
D sn 0
n D sn 0
D sn 0
4
n
t D f xt 0
t 1
t 1
Noting that
n
t D f
0
f
xt 0
t 1
n D sn 0
D sn 0
4
F
F
n
I 0
42 lim E
317
F
F
n
t D f xt 0
2 lim E
J 0
F
F
n
where the expectation is with respect to the joint density of x and Combining these
expressions for J 0 and I 0 and the result of the asymptotic normality theorem,
we get
n
0
d N 0 lim E F
F
2
(35)
f
f
the obvious estimator. Note the close correspondence to the results for the linear
model.
f yt
exp
t t t
yt
yt !
318
0 1 2
The mean of yt is t as is the variance. Note that t must be positive. Suppose that the
true mean is
t0
exp xt
0
arg min sn
exp xt
2
We can write
sn
1 n
exp xt
0 t
Tt 1
exp xt
1 n
exp xt
0
Tt 1
exp xt
1 n 2
1 n
2
t exp xt
0
t
Tt 1
Tt 1
exp xt
The last term has expectation zero since the assumption that E yt xt
implies that E t xt
exp xt
0
with t Applying a strong LLN, and noting that the objective function is continuous
on a compact parameter space, we get
s
Ex exp x
0 exp x
Ex exp x
0
where the last term comes from the fact that the conditional variance of is the same
as the variance of y This function is clearly minimized at
n
0
sn
319
f 0
f
where is a combination of the fundamental error term and the error due to evaluating the regression function at rather than the true value 0 Take a first order Taylors
series approximation around a point 1 :
f 1
D f 1
1
approximationerror.
where, as above, F 1
F 1 b ,
sion function, evaluated at 1 and is plus approximation error from the truncated
Taylors series.
Similarly, z y
320
1
a new Taylors series expansion around 2 and repeat the process. Stop when
b
To see why this might work, consider the above approximation, but evaluated at the
NLS estimator:
y
The OLS estimate of b
f
F
is
b
F
F
1
F
y
f
f
0
by definition of the NLS estimator (these are the normal equations as in equation 33,
Since b 0 when we evaluate at updating would stop.
The Gauss-Newton method doesnt require second derivatives, as does the Newton-
directly, since its just the OLS varcov estimator from the last iteration.
321
y
When evaluated at 2
tion, so F will have rank that is essentially 2, rather than 3. In this case, F
F
Readings: Davidson and MacKinnon, Ch. 15 (a quick reading is sufficient), J. Heckman, Sample Selection Bias as a Specification Error, Econometrica, 1979 (This is a
classic article, not required for reading, and which is a bit out-dated. Nevertheless its
a good place to start if you encounter sample selection problems in your research).
Sample selection is a common problem in applied research. The problem occurs
when observations used in estimation are sampled non-randomly, according to some
selection scheme.
Characteristics of individual: x
Offer wage: wo
z
322
Reservation wage: wr
q
r
Assume that
N
0
We assume that the offer wage and the reservation wage, as well as the latent variable
1w
ws
0
In other words, we observe whether or not a person is working. If the person is work
Otherwise,
s Note that we are using a simplifying assumption that individuals can freely
x
residual
323
0 or equivalently,
r
and
0
r
since and are dependent. Furthermore, this expectation will in general depend on x
since elements of x can enter in r Because of these two facts, least squares estimation
is biased and inconsistent.
Consider more carefully E
r
Given the joint normality of and we
can write (see for example Spanos Statistical Foundations of Econometric Modelling,
pg. 122)
where has mean zero and is independent of . With this we can write
x
r
we get
x
E
r
z
N 0 1
E zz
z
z
z
where
and
324
respectively. The quantity on the RHS above is known as the inverse Mills ratio:
z
z
IMR z
x
r
r
r
r
where
(36)
(37)
The error term has conditional mean zero, and is uncorrelated with
the regressors x
r
r
Heckman showed how one can estimate this in a two step procedure where first
is estimated, then equation 37 is estimated by least squares using the estimated
value of to form the regressors. This is inefficient and estimation of the covariance is a tricky issue. It is probably easier (and more efficient) just to do
MLE.
The model presented above depends strongly on joint normality. There exist
many alternative models which weaken the maintained assumptions. It is possible to estimate consistently without distributional assumptions. See Ahn and
Powell, Journal of Econometrics, 1994.
325
326
variance
mean/var
max
% zeros
OBDV
3.4120
37.446
0.091117
68.000
0.32000
OPV
0.20400
1.0944
0.18641
20.000
0.88800
ERV
0.18400
0.30614
0.60102
6.0000
0.86400
IPV
0.076000
0.14222
0.53437
5.0000
0.94600
DV
1.0360
3.1107
0.33304
16.000
0.55800
PRESCR
8.0500
214.39
0.037549
107.00
0.29000
Since health care visits are count data, a simple approach to modeling demand
could be based upon the Poisson model. Recall that the Poisson model is
fY y
exp
y
y!
exp x
Recall that the Poisson model imposes that the conditional mean equals the conditional variance (equidispersion). We see from the above descriptive statistics that the
data are all unconditionally overdispersed, since the unconditional variance is greater
than the unconditional mean. To achieve conditional equidispersion, the model would
have to fit quite well.
327
-3.8679
params
t(OPG)
t(Sand.)
t(Hess)
constant
-0.51541
-8.5242
-1.0992
-3.2325
pub_ins
0.61054
16.999
3.0582
7.6966
priv_ins
0.18459
5.1354
1.1697
2.4819
sex
0.35452
21.396
2.1007
7.0053
age
0.022112
24.396
4.3966
10.795
educ
0.027979
8.6896
0.93269
2.9554
inc
0.0070852
2.2891
0.30328
0.87485
Information Criteria
Consistent Akaike
3918.4
Schwartz
3911.4
Hannan-Quinn
3893.5
Akaike
3881.9
**************************************************************************
The insurance variables have the expected sign, but PRIV is not significant.
328
Women and older people make more visits. Income appears not to affect de-
329
-0.49978
params
t(OPG)
t(Sand.)
t(Hess)
constant
-1.1669
-2.0607
-1.6099
-1.8912
pub_ins
0.65307
2.3722
1.7257
2.3114
priv_ins
-0.26764
-0.93634
-0.83555
-0.90040
sex
-0.57001
-2.7777
-2.0050
-2.6389
age
0.0037963
0.60114
0.32714
0.45393
educ
0.0010258
0.026424
0.024977
0.026173
inc
-0.12531
-2.2085
-2.2781
-2.3102
Information Criteria
Consistent Akaike
550.29
Schwartz
543.29
Hannan-Quinn
525.36
Akaike
513.78
**************************************************************************
330
Women are less likely to make emergency room visits compared to men.
Richer people make fewer visits, and the effect seems to be significant. Perhaps
poor people do not have good insurance coverage and use emergency visits as a
substitute for preventive care?
There is less difference between the three forms of the t-statistics. Is this an
indication that the Poisson model might work better for ERV than for OBDV?
To check the plausibility of the Poisson model, we can compare the sample unconditional variance with the estimated unconditional variance according to the Poisson
model: V y
tn
1 t
. For OBDV and ERV, we get We see that even after condi-
tioning, the overdispersion is not captured in either case. There is huge problem with
OBDV, and a significant problem with ERV. In both cases the Poisson model does not
appear to be plausible.
331
exp
fY y x
y
y!
exp x
exp x exp
where
exp x ) and
fY y x
y
f z dz
y!
exp
This density can be used directly, perhaps using numberical integration to evaluate the
likelihood function. In some cases, though, the integral will have an analytic solution.
For example, if follows a certain one parameter gamma density, then
y
y 1
fY y
where
(38)
. We again parameterize
exp x
, where
0, then V y x
1 , where
0, then V y x
NB-II model.
332
So both forms of the NB model allow for overdispersion, with the NB-II model allowing for a more radical form.
333
-2.2656
t-Stats
params
t(OPG)
t(Sand.)
-0.055766
-0.16793
-0.17418
-0.17215
pub_ins
0.47936
2.9406
2.8296
2.9122
priv_ins
0.20673
1.3847
1.4201
1.4086
sex
0.34916
3.2466
3.4148
3.3434
age
0.015116
3.3569
3.8055
3.5974
educ
0.014637
0.78661
0.67910
0.73757
inc
0.012581
0.60022
0.93782
0.76330
1.7389
23.669
11.295
16.660
constant
ln_alpha
Information Criteria
Consistent Akaike
2323.3
Schwartz
2315.3
Hannan-Quinn
2294.8
Akaike
2281.6
334
t(Hess)
-2.2616
t-Stats
params
t(OPG)
t(Sand.)
constant
-0.65981
-1.8913
-1.4717
-1.6977
pub_ins
0.68928
2.9991
3.1825
3.1436
priv_ins
0.22171
1.1515
1.2057
1.1917
sex
0.44610
3.8752
2.9768
3.5164
age
0.024221
3.8193
4.5236
4.3239
educ
0.020608
0.94844
0.74627
0.86004
inc
0.020040
0.87374
0.72569
0.86579
0.47421
5.6622
4.6278
5.6281
ln_alpha
Information Criteria
Consistent Akaike
2319.3
Schwartz
2311.3
Hannan-Quinn
2290.8
Akaike
2277.6
335
t(Hess)
**************************************************************************
For the OBDV model, the NB-II model does a better job, in terms of the average
log-likelihood and the information criteria.
Note that both versions of the NB model fit much better than does the Poisson
model.
The t-statistics are now similar for all three ways of calculating them, which
might indicate that the serious specification problems of the Poisson model for
the OBDV data are partially solved by moving to the NB model.
To check the plausibility of the NB-II model, we can compare the sample unconditional variance with the estimated unconditional variance according to the NB-II
tn
model: V y
t
n
1 t
we get The overdispersion problem is significantly better than in the Poisson case, but
Table 2: Marginal Variances, Sample and Estimated (NB-II)
OBDV
ERV
Sample
37.446 0.30614
Estimated 26.962 0.27620
there is still some overdispersion that is not captured, for both OBDV and ERV.
1 fY
j
i 1 y i
j
j xi n We see that for the OBDV measure, there are many more actual ze
336
ros than predicted. For ERV, there are somewhat more actual zeros than fitted, but the
difference is not too important.
Why might OBDV not fit the zeros well? What if people made the decision to
contact the doctor for a first visit, they are sick, then the doctor decides on whether or
not follow-up visits are needed. This is a principal/agent type situation, where the total
number of visits depends upon the decision of both the patient and the doctor. Since
different parameters may govern the two decision-makers choices, we might expect
that different parameters govern the probability of zeros versus the other counts. Let
p be the parameters of the patients demand for visits, and let d be the paramter of
the doctors demand for visits. The patient will initiate visits according to a discrete
choice model, for example, a logit model:
Pr Y
Pr Y
0
fY 0 p
1 1 exp
0
1 1 exp
p
p
The above probabilities are used to estimate the binary 0/1 hurdle process. Then, for
the observations where visits are positive, a truncated Poisson density is estimated.
337
This density is
fY y d y
fY y d
1 exp d
0
Since the hurdle and truncated components of the overall density for Y share no parameters, they may be estimated separately, which is computationally more efficient
than estimating the overall model. (Recall that the BFGS algorithm, for example, will
have to invert the approximated Hessian. The computational overhead is of order K 2
where K is the number of parameters to be estimated) . The expectation of Y is
E Y x
Pr Y
1
0 x E Y Y
1
exp
338
p 1
0 x
d
exp d
-0.58939
t-Stats
params
t(OPG)
t(Sand.)
t(Hess)
constant
-1.5502
-2.5709
-2.5269
-2.5560
pub_ins
1.0519
3.0520
3.0027
3.0384
priv_ins
0.45867
1.7289
1.6924
1.7166
sex
0.63570
3.0873
3.1677
3.1366
age
0.018614
2.1547
2.1969
2.1807
educ
0.039606
1.0467
0.98710
1.0222
inc
0.077446
1.7655
2.1672
1.9601
Information Criteria
Consistent Akaike
639.89
Schwartz
632.89
Hannan-Quinn
614.96
Akaike
603.39
**************************************************************************
339
-2.7042
t-Stats
params
t(OPG)
t(Sand.)
constant
0.54254
7.4291
1.1747
3.2323
pub_ins
0.31001
6.5708
1.7573
3.7183
priv_ins
0.014382
0.29433
0.10438
0.18112
sex
0.19075
10.293
1.1890
3.6942
age
0.016683
16.148
3.5262
7.9814
educ
0.016286
4.2144
0.56547
1.6353
-0.0079016
-2.3186
-0.35309
-0.96078
inc
t(Hess)
Information Criteria
Consistent Akaike
2754.7
Schwartz
2747.7
Hannan-Quinn
2729.8
Akaike
2718.2
**************************************************************************
340
Fitted and actual probabilites (NB-II fits are provided as well) are:
Table 4: Actual and Hurdle Poisson fitted frequencies
Count
OBDV
Count Actual Fitted HP Fitted NB-II
0
0.32
0.32
0.34
1
0.18
0.035
0.16
2
0.11
0.071
0.11
3
0.10
0.10
0.08
4
0.052
0.11
0.06
5
0.032
0.10
0.05
Actual
0.86
0.10
0.02
0.004
0.002
0
ERV
Fitted HP Fitted NB-II
0.86
0.86
0.10
0.10
0.02
0.02
0.006
0.006
0.002
0.002
0.0005
0.001
For the Hurdle Poisson models, the ERV fit is very accurate. The OBDV fit is not
so good. Zeros are exact, but 1s and 2s are underestimated, and higher counts are
overestimated. For the NB-II fits, performance is at least as good as the hurdle Poisson
model, and one should recall that many fewer parameters are used. Hurdle version of
the negative binomial model are also widely used.
341
fY y 1 p 1 p
p 1
i fY
1
i
y i
p
p fY y p
i 1
where i
0i
1 2 p, p
1
p 1
1 i ,
and i
1 i
1. Identification requires
p and i
j i j.
This is simple to accomplish post-estimation by rearrangement and possible elimination of redundant component densities.
The properties of the mixture density follow in a straightforward way from those
of the components. In particular, the moment generating function is the same
mixture of the moment generating functions of the component densities, so, for
example, E Y x
th
1 i i x , where i x is the mean of the i component
density.
Mixture densities may suffer from overparameterization, since the total number of parameters grows rapidly with the number of component densities. It is
possible to constrained parameters across the mixtures.
Testing for the number of component densities is a tricky issue. For example,
testing for p
1, which is on the
second component can take on any value without affecting the density. Usual
methods such as the likelihood ratio test are not applicable when parameters
are on the boundary under the null hypothesis. Information criteria means of
choosing the model (see below) are valid.
The following are results for a mixture of 2 negative binomial (NB-I) models, for the
342
OBDV data.
343
**************************************************************************
MEPS data, OBDV
mixnegbin results
Strong convergence
Observations = 500
Function value
-2.2312
t-Stats
params
t(OPG)
t(Sand.)
0.64852
1.3851
1.3226
1.4358
pub_ins
-0.062139
-0.23188
-0.13802
-0.18729
priv_ins
0.093396
0.46948
0.33046
0.40854
sex
0.39785
2.6121
2.2148
2.4882
age
0.015969
2.5173
2.5475
2.7151
educ
-0.049175
-1.8013
-1.7061
-1.8036
inc
0.015880
0.58386
0.76782
0.73281
ln_alpha
0.69961
2.3456
2.0396
2.4029
constant
-3.6130
-1.6126
-1.7365
-1.8411
pub_ins
2.3456
1.7527
3.7677
2.6519
priv_ins
0.77431
0.73854
1.1366
0.97338
sex
0.34886
0.80035
0.74016
0.81892
age
0.021425
1.1354
1.3032
1.3387
educ
0.22461
2.0922
1.7826
2.1470
inc
0.019227
0.20453
0.40854
0.36313
2.8419
6.2497
6.8702
7.6182
0.85186
1.7096
1.4827
1.7883
constant
ln_alpha
logit_inv_mix
Information Criteria
344
t(Hess)
Consistent Akaike
2353.8
Schwartz
2336.8
Hannan-Quinn
2293.3
Akaike
2265.2
**************************************************************************
Delta method for mix parameter st.
mix
se_mix
0.70096
0.12043
err.
The 95% confidence interval for the mix parameter is perilously close to 1, which
suggests that there may really be only one component density, rather than a
mixture. Again, this is not the way to test this - it is merely suggetive.
Education is interesting. For the subpopulation that is healthy, i.e., that makes
relatively few visits, education seems to have a positive effect on visits. For the
unhealthy group, education has a negative effect on visits. The other results
are more mixed. A larger sample could help clarify things.
The following are results for a 2 component constrained mixture negative binomial
model where all the slope parameters in j
nents. The constants and the overdispersion parameters j are allowed to differ for the
two components.
345
**************************************************************************
MEPS data, OBDV
cmixnegbin results
Strong convergence
Observations = 500
Function value
-2.2441
t-Stats
params
t(OPG)
t(Sand.)
constant
-0.34153
-0.94203
-0.91456
-0.97943
pub_ins
0.45320
2.6206
2.5088
2.7067
priv_ins
0.20663
1.4258
1.3105
1.3895
sex
0.37714
3.1948
3.4929
3.5319
age
0.015822
3.1212
3.7806
3.7042
educ
0.011784
0.65887
0.50362
0.58331
inc
0.014088
0.69088
0.96831
0.83408
ln_alpha
1.1798
4.6140
7.2462
6.4293
const_2
1.2621
0.47525
2.5219
1.5060
lnalpha_2
2.7769
1.5539
6.4918
4.2243
logit_inv_mix
2.4888
0.60073
3.7224
1.9693
Information Criteria
Consistent Akaike
2323.5
Schwartz
2312.5
Hannan-Quinn
346
t(Hess)
2284.3
Akaike
2266.1
**************************************************************************
Delta method for mix parameter st.
mix
se_mix
0.92335
0.047318
err.
The slope parameter estimates are pretty close to what we got with the NB-I
model.
k ln n 1
k ln n
2k
2 ln L
BIC
AIC
2 ln L
CAIC
2 ln L
347
It can be shown that the CAIC and BIC will select the correctly specified model from a
group of models, asymptotically. This doesnt mean, of course, that the correct model
is necesarily in the group. The AIC is not consistent, and will asymptotically favor
an over-parameterized model over the correctly specified model. Here are information
criteria values for the models weve seen, for OBDV. According to the AIC, the best
Table 5: Information Criteria, OBDV
Model
Poisson
NB-I
Hurdle Poisson
MNB-I
CMNB-I
AIC
3822
2282
3333
2265
2266
BIC CAIC
3911 3918
2315 2323
3381 3395
2337 2354
2312 2323
is the MNB-I, which has relatively many parameters. The best according to the BIC is
CMNB-I, and according to CAIC, the best is NB-I. The Poisson-based models do not
do well.
22 Nonparametric inference
22.1 Possible pitfalls of parametric inference: estimation
Readings: H. White (1980) Using Least Squares to Approximate Unknown Regression Functions, International Economic Review, pp. 149-70.
In this section we consider a simple example, which illustrates both why nonparametric methods may in some cases be preferred to parametric methods.
348
f x
f x
3x
2
x
2
2
the range of x.
series approximation to f x about some point x0 Flexible functional forms such as the
transcendental logarithmic (usually know as the translog) can be interpreted as second
order Taylors series approximations. Well work with a first order approximation, for
simplicity. Approximating about x0 :
h x
f x0
Dx f x0 x
0 we can write
h x
a bx
x0
n
1 n yt
s a b
h xt
t 1
The limiting objective function, following the argument we used to get equations 17
and 34 is
s a b
2
0
f x
349
h x
dx
The theorem regarding the consistency of extremum estimators (Theorem 58) tells
us that a and b will converge almost surely to the values that minimize the limiting
objective function. Solving the first order conditions3 reveals that s a b obtains its
7
6
minimum at a0
1
b0
h x
7 6 x
We may plot the true function and the limit of the approximation to see the asymptotic
bias as a function of x:
(The approximating model is the straight line, the true model has curvature.) Note
that the approximating model is in general inconsistent, even at the approximation
point. This shows that flexible functional forms based upon Taylors series approximations do not in general allow consistent estimation. The mathematical properties of
the Taylors series do not carry over when coefficients are estimated.
The approximating model seems to fit the true model fairly well, asymptotically.
However, we are interested in the elasticity of the function. Recall that an elasticity is
the marginal function divided by the average function:
x
x
x x
Good approximation of the elasticity over the range of x will require a good approxi
x
3 All
xh
x h x
350
Plotting the true elasticity and the elasticity obtained from the limiting approximating
model
The true elasticity is the line that has negative slope for large x Visually we see
that the elasticity is not approximated so well. Root mean squared error in the approximation of the elasticity is
2
x
x dx
1 2
31546
Now suppose we use the leading terms of a trigonometric series as the approximating model. The reason for using a trigonometric series as an approximating model
is motivated by the asymptotic properties of the Fourier flexible functional form (Gallant, 1981, 1982), which we will study in more detail below. Normally with this type
of model the number of basis functions is an increasing function of the sample size.
Here we hold the set of basis function fixed. We will consider the asymptotic behavior
of a fixed model, which we interpret as an approximation to the estimators behavior
in finite samples. Consider the set of basis functions:
1
Z x
x cos x
sin x
cos 2x
sin 2x
Z x
gK x
Maintaining these basis functions as the sample size increases, we find that the limiting
objective function is minimized at
a1
7
a2
6
1
a3
1
a4
2
0 a5
1
a6
42
0
Substituting these values into gK x we obtain the almost sure limit of the approxima351
tion
7 6 x
g x
cos x
1
2
sin x 0
cos 2x
1
42
sin 2x 0 (39)
2
0
x
x x
g x
1 2
dx
16213
about half that of the RMSE when the first order approximation is used. If the trigonometric series contained infinite terms, this error measure would be driven to zero, as
we shall see.
Consider means of testing for the hypothesis that consumers maximize utility. A
definite. One approach to testing for utility maximization would estimate a set
352
Estimation of these functions by normal parametric methods requires specification of the functional form of demand, for example
x p m
x p m 0
0
0
parameter.
After estimation, we could use x
bility problem, which is non-trivial) D2p h
maximize utility.
The problem with this is that the reason for rejection of the theoretical proposition may be that our choice of functional form is incorrect. In the introductory
section we saw that functional form misspecification leads to inconsistent esti-
theory.
Nonparametric inference allows direct testing of economic propositions, without
the model-induced augmenting hypothesis.
353
f x
assume that is a classical error. Let us take the estimation of the vector of
elasticities with typical element
xi f x
f x xi f x
xi
at an arbitrary point xi
The Fourier form, following Gallant (1982), but with a somewhat different parameterization, may be written as
gK x K
x
1 2x
Cx
1j 1
u j cos jk
x
v j sin jk
x
(40)
vec C
u11 v11 uJA vJA
(41)
We assume that the conditioning variables x have each been transformed to lie in
an interval that is shorter than 2 This is required to avoid periodic behavior of
354
2 in value.
The k are multi-indices which are simply P vectors formed of integers (neg
pendent, and we follow the convention that the first non-zero element be positive.
For example
0
1 0 1
0
1
1 0 1
0
2
2 0 2
index.
We parameterize the matrix C differently than does Gallant because it simplifies
things in practice. The cost of this is that we are no longer able to test a quadratic
specification using nested testing.
355
Dx gK x K
Cx
1j 1
u j sin jk
x
v j cos jk
x
jk
(42)
D2x gK x K
u j cos jk
x
1j 1
v j sin jk
x
j2 k k
(43)
D h x
x1 1 x2 2
xNN
h x
D gK x K
z x
K
(44)
Both the approximating model and the derivatives of the approximating model
are linear in the parameters.
z
K for simplicity
The following theorem can be used to prove the consistency of the Fourier form.
356
topology defined by h .
(b) Denseness:
K HK ,
respect to h and HK
HK 1 .
lim sup sn h
s h h
almost surely.
s h h
must
have h
h
0.
h
h n
Kn
almost surely.
The modification of the original statement of the theorem that has been made is to
set the parameter space in Gallant and Nychkas (1987) Theorem 0 to a single point
and to state the theorem in terms of maximization rather than minimization.
This theorem is very similar in form to Theorem 58. The main differences are:
1. A generic norm
h
h
implies convergence w.r.t the Euclidean norm. Typically we will want to make
sure that the norm is strong enough to imply convergence of all functions of
interest.
357
2. The estimation space H is a function space. It plays the role of the parameter
space in our discussion of parametric estimators. There is no restriction to a
parametric family, only a restriction to a space of functions that satisfy certain
conditions. This formulation is much less restrictive than the restriction to a
parametric family.
3. There is a denseness assumption that was not present in the other theorem.
We will not prove this theorem (the proof is quite similar to the proof of theorem [58],
see Gallant, 1987) but we will discuss its assumptions, in relation to the Fourier form
as the approximating model.
norm we wish to use. We need a norm that guarantees that the errors in approximation
of the functions we are interested in are accounted for. Since we are interested in firstorder elasticities in the present case, we need close approximation of both the function
contains all values of x that were interested in. The Sobolev norm is appropriate in
this case. It is defined, making use of our notation for partial derivatives, as:
h
max sup D h x
mX
m X
f x
gK x K
358
mX
We see that this norm takes into account errors in approximating the function and
partial derivatives up to order m If we want to estimate first order elasticities, as is the
1 Furthermore, since we examine
the sup over X convergence w.r.t. the Sobolev means uniform convergence, so that we
22.3.2 Compactness
Verifying compactness with respect to this norm is quite technical and unenlightening. It is proven by Elbadawi, Gallant and Souza, Econometrica, 1983. The basic
requirement is that if we need consistency w.r.t. h
mX
must belong to a Sobolev space which takes into account derivatives of order m 1. A
Sobolev space is the set of functions
Wm X D
h x : h x
mX
where D is a finite constant. In plain words, the functions must have bounded partial
derivatives of one order higher than the derivatives we seek to estimate.
W2 X D The estima
H
359
HK
gK x K : gK x K
W2 Z D K
22.3.4 Denseness
The important point here is that HK is a space of functions that is indexed by a finite
dimensional parameter (K has K elements, as in equation ??). With n observations,
K this parameter is estimable. Note that the true function h is not necessarily
as n This is achieved
is clear that K will have to grow more slowly than n. The second requirement is:
2. We need that the HK be dense subsets of H
The estimation subspace HK , defined above, is a subset of the closure of the estimation
space, H . A set of subsets Aa of a set A is dense if the closure of the countable
union of the subsets is equal to the closure of A :
A
a 1 a
Use a picture here. The rest of the discussion of denseness is provided just for completeness: theres no need to study it in detail. To show that HK is a dense subset of
360
H with respect to h
1X
turn cites Edmunds and Moscatelli (1977). We reproduce the theorem as presented by
Gallant, with minor notational changes, for convenience of reference:
oK
m q
as K
m, and every
q
1, and m
h x
hK x K
qX
the elements of H are once continuously differentiable on X , which is open and contains the closure of X , so the theorem is applicable. Closely following Gallant and
Nychka (1987),
HK
such that
lim
for all h
hK
1X
H . Therefore,
H
HK
However,
HK
HK
H
so
Therefore
HK
361
0
so
HK
1 X.
sn K
gK xt K
With random sampling, as in the case of Equations 17 and 34, the limiting objective
function is
s g f
f x
g x 2 dx
(45)
where the true function f x takes the place of the generic function h in the presenta
HK .
1X
s g 1 f
lim
g1 g0
g1
1X
lim
g0
1X
0 X
since
g1 x
s g 0 f
f x
2
g0 x
f x
2
dx
By the dominated convergence theorem (which applies since the finite bound D used
362
22.3.6 Identification
s f f
g
f
H s g f
1X
Estimation space H
Consistency norm
h
1X
norm.
H
maximum in its first argument, over the closure of the infinite union of the esti-
mation subpaces, at g
f
xi f x
f x xi f x
are consistently estimated for all x
X
363
22.3.8 Discussion
Consistency requires that the number of parameters used in the expansion increase
with the sample size, tending to infinity. If parameters are added at a high rate, the
bias tends relatively rapidly to zero. A basic problem is that a high rate of inclusion
of additional parameters causes the variance to tend more slowly to zero. The issue
of how to chose the rate at which parameters are added and which to add first is fairly
complex. A problem is that the allowable rates for asymptotic normality to obtain
(Andrews 1991; Gallant and Souza, 1991) are very strict. Supposing we stick to these
rates, our approximating model is:
gK x K
z
K
ZK
ZK
ZK
y
. The prediction, z
K of the unknown function f x is asymptotically normally
distributed:
n z
K
f x
d N 0 AV
where
AV
lim E z
ZK
ZK
n
z 2
Formally, this is exactly the same as if we were dealing with a parametric linear
364
model. I emphasize, though, that this is only valid if K grows very slowly as
n grows. If we cant stick to acceptable rates, we should probably use some
other method of approximating the small sample distribution. Bootstrapping is
a possibility. Well discuss this in the section on simulation.
yt
g xt
where
E t xt
0
g x
y f x y dy
h x
1
h x
365
y f x y dy
f x y dy
h x
y f x y dy
h x
1 n K x xt n
n t1
kn
The function K
and K
K x dx
In this respect, K
integrates to 1 :
K x dx
1
to be nonnegative.
lim nkn
n
n
So, the window width must tend to zero, but not too quickly.
366
of the estimator (since the estimator is an average of iid terms we only need to
consider the expectation of a representative term):
k K x z n h z dz
n
E h x
Change variables as z
z n so z
dz
dz
k K z h x n z k dz
n
n
E h x
n z and
kn we obtain
K z h x n z dz
Now, asymptotically,
lim E h x
K z h x n z dz
lim
lim K z h x n z dz
n
K z h x dz
K z dz
h x
h x
since n
0 and
K z dz
limit through the integral is a result of the dominated convergence theorem.. For
this to hold we need that h
367
h x
nknV
1
nkn 2
n
xt n
K x
kn
t 1
n
n k
1
V K x
nt 1
xt n
nknV h x
nknV h x
E x
n k E
E x2
Also, since V x
n k V K x
z n
we have
K x
z n
n k E K x
2
z n
k K x z n 2 h z dz k k K x z n h z dz
n
n
n
k K x z n 2 h z dz k E h x
n
n
kn E h x
2
0
by the previous result regarding the expectation and the fact that n
0 There-
fore,
lim nknV h x
k K x z n 2 h z dz
n
lim
Using exactly the same change of variables as before, this can be shown to be
lim nknV h x
368
h x
K z
dz
Since both
K z
0
V h x
Since the bias and the variance both go to zero, we have pointwise consistency
(convergence in quadratic mean implies convergence in probability).
1 n K
n t1
f x y
xt n
The kernel K
yt n x
nk 1
yK y x dy
K y x dy
K x
y f y x dy
1 n K x xt n
yt
n t1
kn
369
g x
y f y x dy
1
h x
K x xt n
1 n
n t 1 yt
kn
x
x
K
1 n
t n
n t 1
kn
tn 1 yt K x xt n
tn 1 K x xt n
22.4.3 Discussion
1 2 n, where higher weights are associated with points that are closer to xt
The window width parameter n imposes smoothness. The estimator is increasingly flat as n
A large window width reduces the variance (strong imposition of flatness), but
information is used, the variance is large when the window width is small.
The standard normal density is a popular choice for K
there are possibly better alternatives.
370
and K y x though
5. Calculate RMSE
6. Go to step 2 or to the next step if enough window widths have been tried.
7. Select the that minimizes RMSE( (Verify that a minimum has been found,
for example by plotting RMSE as a function of
f x y
h x
fy x
y yt n x xt n
nk 1
1 n K x xt n
n t 1
kn
n
1 t 1 K y yt n x xt
n
tn 1 K x xt n
1 n K
n t 1
n
where we obtain the expressions for the joint and marginal densities from the section
on kernel regression.
g p y x
h2p y f y x
p x
where
h p y
k yk
k 0
372
Because
h2p
normalization: 0 is set to 1.
Similarly to Cameron and Johannson (1997), we may develop a negative binomial
polynomial (NBP) density for count data. The negative binomial baseline density may
be written (see equation as
y
y 1
fY y
where
0 and
ex . When
V Y
To obtain a more flexible density, we may reshape the negative binomial density
using a squared polynomial
h p y
k yk
(46)
k 0
fY y
h p y 2 y
p y 1
y
(47)
373
using
E Y
yr fY
y
y 0
h p y 2
y
p fY y
y 0
yr fY
y k l yk yl p
y 0k 0l 0
k l
yr
fY y
y 0
k 0l 0
p p
k l
p
k l mk
p
l r
k 0l 0
By setting r
k l m k
p
(48)
k 0l 0
negative binomial raw moments, which may be obtained from the moment generating
function
MY t
et
(49)
To illustrate, here are the first through fourth raw moments of the NB density, calculated using Mathematica and then programmed in Ox. These are the moments you
2 .
if(k_gam >= 1)
{
m[][0] = lambda;
m[][1] = (lambda .* (lambda + psi + lambda .* psi)) ./ psi;
}
374
if(k_gam >= 2)
{
m[][2] = (lambda .* (psi .^ 2 + 3 .* lambda .* psi .* (1 +
psi) + lambda .^
.^ 2;
m[][3] = (lambda .* (psi .^ 3 + 7 .* lambda .* psi .^ 2 .*
(1 + psi) +
6 .* lambda .^ 2 .* psi .* (2 + 3 .* psi + psi .^ 2) +
lambda .^ 3 .* (6 + 11 .* psi + 6 .* psi .^ 2 + psi .^ 3)))
./ psi .^ 3;
}
After calculating the raw moments, the normalization factor is calculated using
equation 48, again with the help of Mathematica.
if(k_gam == 1)
{
norm_factor = 1 + gam[0][] .* (2 .* m[][0] + gam[0][] .* m[][1]);
}
else
if(k_gam == 2)
{
norm_factor = 1 + gam[0][] .^ 2 .* m[][1] + 2 .* gam[0][]
.* (m[][0] +
gam[1][] .* m[][2]) +
gam[1][] .* (2 .* m[][1] + gam[1][] .* m[][3]);
}
For p
model that would be difficult ot formulate without the help of a program like Mathe375
matica.
It is possible that there is conditional heterogeneity such that the appropriate reshaping should be more local. This can be accomodated by allowing the k parameters
to depend upon the conditioning variables, for example using polynomials.
Gallant and Nychka, Econometrica, 1987 prove that this sort of density can approximate a wide variety of densities arbitrarily well as the degree of the polynomial
increases with the sample size. This approach is not without its drawbacks: the sample
objective function can have an extremely large number of local maxima that can lead
to numeric difficulties. If someone could figure out how to do in a way such that the
sample objective function was nice and smooth, they would probably get the paper
published in a good journal. Any ideas?
Heres a plot of true and the limiting SNP approximations (with the order of the
polynomial fixed) to four different count data densities. The baseline model is a negative binomial density.
376
Case 1
Case 2
.5
.4
.1
.3
.2
.05
.1
0
Case 3
10
15
20
.25
.2
.2
.15
.15
Case 4
10
15
20
25
.1
.1
.05
.05
1
377
2.5
7.5
10
12.5
15
23 Simulation-based estimation
Readings: In addition to the book mentioned previously, articles include Gallant and
Tauchen (1996), Which Moments to Match?, ECONOMETRIC THEORY, Vol. 12,
1996, pages 657-681;a Gourieroux, Monfort and Renault (1993), Indirect Inference,
J. Apl. Econometrics; Pakes and Pollard (1989) Econometrica; McFadden (1989)
Econometrica.
23.1 Motivation
Simulation methods are of interest when the DGP is fully characterized by a parameter
vector, but the likelihood function is not calculable. If it were available, we would
simply estimate by MLE, which is asymptotically fully efficient.
yi
Xi i
N 0
(50)
378
y
This mapping is such that each element of y is either zero or one (in some cases
only one element will be one).
Define
Ai
A yi
y yi
independent of one another (and clearly are not if is not diagonal). However,
yi is independent of y j , i
Let
vec
j
pi
Ai
Xi dyi
n yi
where
n
M 2
2
1 2
exp
1
2
ni 1
ln L
1 n D pi
n i1 pi
0
ble when m (the dimension of y is higher than 3 or 4 (as long as there are no
restrictions on
The mapping y
has not been made specific so far. This setup is quite general:
models as well as the case of multinomial discrete choice (the choice of one out
of a finite set of alternatives).
Multinomial discrete choice is illustrated by a (very simple) job search
model. We have cross sectional data on individuals matching to a set of
m jobs that are available (one of which is unemployment). The utility of
alternative j is
uj
X j j
yj
1 uj
uk
mk
j
u jt
W jt
12
1 2 m
jt
380
Then
u2
u1
W2
W1 2
that is yit
1y
0
otherwise.
Pr y
i
exp
i
i!
The mean and variance of the Poisson distribution are both equal to :
E y
V y
381
exp Xi
This ensures that the mean is positive (as it must be). Estimation by ML is straightforward.
Often, count data exhibits overdispersion which simply means that
E y
V y
If this is the case, a solution is to use the negative binomial distribution rather than the
Poisson. An alternative is to introduce a latent variable that reflects heterogeneity into
the specification:
i
exp Xi i
where i has some specified density with support S (this density may depend on addi
tional parameters). Let d i be the density of i In some cases, the marginal density
of y
Pr y
yi
exp
S
will have a closed-form solution (one can derive the negative binomial distribution in
the way if has an exponential distribution), but often this will not be possible. In
In this case, since there is only one latent variable, quadrature is probably a
better choice. However, a more flexible model with heterogeneity would allow
382
Pr y
yi
exp
entails a K
g yt dt h yt dWt
dyt
W T
T
0
dWt
N 0 T
W 0
0
W s
W t
W s
N 0s
t
W t and W j
383
k That is,
One can think of Brownian motion the accumulation of independent normally distributed shocks with infinitesimal variance.
To estimate a model of this sort, we typically have data that are assumed to be observations of yt in discrete points y1 y2 yT That is, though yt is a continuous process it
1
This den-
yt
yt
g yt
N 0 1
1
h yt
1
The discretization induces a new parameter, (that is, the 0 which defines
the best approximation of the discretization to the actual (unknown) discrete
time version of the model is not equal to 0 which is the true parameter value).
This is an approximation, and as such ML estimation of (which is actually
quasi-maximum likelihood, QML) based upon this equation is in general biased
and inconsistent for the original parameter, . Nevertheless, the approximation
shouldnt be too bad, which will be useful, as we will see.
384
The important point about these three examples is that computational difficulties
prevent direct application of ML, GMM, etc. Nevertheless the model is fully
specified in probabilistic terms up to a parameter vector. This means that the
model is simulable, conditional on the parameter vector.
nt 1
ML
arg max sn
E f yt Xt
p yt Xt
p yt Xt
is unbiased for p yt Xt
function, that is
SML
arg max sn
1 n
ln p yt Xt
ni 1
385
uj
1 uj
yj
uk k
mk
j
Calculate ui
Define yi j
1 ui j
uik k
mk
j
i j
Now p yi Xi
yi
H1 H
s 1 ln i
ln L
1 n
yi
ln p yi Xi
ni 1
386
The H draws of i are draw only once and are used repeatedly during the it The draws are different for each i If the i are
erations used to find and
as Newton-Raphson.
It may be the case, particularly if few simulations, H, are used, that some elements of i are zero or one. In this case, taking the logarithm is going to cause
problems.
Solutions to discontinuity:
1) use an estimation method that doesnt require a continuous and differentiable objective function, for example, simulated annealing. This is computationally costly.
2) Smooth the simulated probabilities so that they are continuous functions
of the parameters. For example, apply a kernel transformation such as
yi j
A
u
ij
max uik
k 1
5
1 ui j
max uik
k 1
function becomes arbitrarily close as the sample size increases. There are
alternative methods (e.g., Gibbs sampling) that may work better, but this is
too technical to discuss here.
To solve to log(0) problem, use the slog function distributed on the web page.
Also, increase H if this is a serious problem.
23.2.2 Properties
The properties of the SML estimator depend on how H is set. The following is taken
from Lee (1995) Asymptotic Bias in Simulated Maximum Likelihood Estimation of
Discrete Choice Models, Econometric Theory, 11, pp. 437-83.
Theorem 71 [Lee] 1) if limn
1 2
n SML
2) if limn
1 2
0 then
d N 0 I
1
0
n SML
0
d N B I
1
0
This means that the SML estimator is asymptotically biased if H doesnt grow
faster than n1
2
388
The varcov is the typical inverse of the information matrix, so that as long as H
grows fast enough the estimator is consistent and fully asymptotically efficient.
Suppose we have a DGP y x which is simulable given , but is such that the density
of y is not calculable.
Once could, in principle, base a GMM estimator upon the moment conditions
mt
K yt xt
k xt zt
where
K yt xt p y xt dy
k xt
1 H h
K yt xt
H h1
k xt
as
k xt
as H
which provides
a clear intuitive basis for the estimator, though in fact we obtain consistency
even for H finite, since a law of large numbers is also operating across the n
observations of real data, so errors introduced by simulation cancel themselves
out.
389
mt
K yt xt
k xt zt
(51)
ni 1
m
1 n
K yt xt
ni 1
1 H h
k y xt
H h1 t
zt
(52)
with which we form the GMM criterion and estimate as usual. Note that the
23.3.1 Properties
Suppose that the optimal weighting matrix is used. McFadden (ref. above) and Pakes
and Pollard (refs. above) show that the asymptotic distribution of the MSM estimator
is very similar to that of the infeasible GMM estimator. In particular, assuming that
the optimal weighting matrix is used, and for H finite,
n MSM
where D 1 D
0
H
d N 0 1 1
D 1 D
1
(53)
That is, the asymptotic variance is inflated by a factor 1 1 H For this reason
the MSM estimator is not fully asymptotically efficient relative to the infeasible
GMM estimator, for H finite, but the efficiency loss is small and controllable, by
1 This is an advantage
relative to SML.
If one doesnt use the optimal weighting matrix, the asymptotic varcov is just
23.3.2 Comments
Why is SML inconsistent if H is finite, while MSM is? The reason is that SML is
based upon an average of logarithms of an unbiased simulator (the densities of the
observations). To use the multinomial probit model as an example, the log-likelihood
function is
1 n
yi
ln pi
ni 1
ln L
ni 1
ln L
ln E pi
E ln pi
E pi
pi
tends to p
The reason that MSM does not suffer from this problem is that in this case the
unbiased simulator appears linearly within every sum of terms, and it appears within a
391
sum over n (see equation [??]). Therefore the SLLN applies to cancel out simulation
errors, from which we get consistency. That is, using simple notation for the random
sampling case, the moment conditions
1 n
K yt xt
ni 1
m
1 H h
k y xt
H h1 t
1 n
k xt 0
ni 1
zt
(54)
1 H
k xt
H h1
ht zt
(55)
m
k x 0
k x z x d x
s
m
1 m
If you look at equation ?? a bit, you will see why the variance inflation factor is
1
H
A poor choice of moment conditions may lead to very inefficient estimators, and
can even cause identification problems (as weve seen with the GMM problem
set).
392
The drawback of the above approach MSM is that the moment conditions used
in estimation are selected arbitrarily. The asymptotic efficiency of the estimator
may be low.
The asymptotically optimal choice of moments would be the score vector of the
likelihood function,
mt
D ln pt It
p yt xt 0
pt 0
We can define an auxiliary model, called the score generator, which simply provides a (misspecified) parametric density
ft
f y xt
arg max sn
1 n
ln ft
nt 1
.
we can calculate the score functions D ln f yt xt
After determining
393
The important point is that even if the density is misspecified, there is a pseudotrue 0 for which the true expectation, taken with respect to the true but unknown
0 : EX EY
D ln f y x 0
D ln f y x 0 p y x 0 dyd x
X Y X
conditions
1 n
pt dy
D ln ft
nt 1
mn
(56)
These moment conditions are not calculable, since pt is not available, but
they are simulable using
1 n 1 H
D ln f yth xt
nt 1H h 1
mn
where yth is a draw from DGP holding xt fixed. By the LLN and the fact that
converges to 0 ,
m 0 0
0
This is not the case for other values of , assuming that 0 is identified.
will closely approximate the optimal moment conditions which
then mn
If one has prior information that a certain density approximates the data well, it
If one has no density in mind, there exist good ways of approximating unknown
394
n
d N 0 J 0
1
I 0 J 0
(57)
would be the
were in fact the true density p y xt then
If the density f yt xt
1I
to the information matrix equality. However, in the present case we assume that
is only an approximation to p y xt so there is no cancellation.
f yt xt
Recall that J 0
p lim
2
sn
0
J 0
D m 0 0
395
As in Theorem 61,
sn
lim E n
n
sn
In this case, this is simply the asymptotic variance covariance matrix of the moment
conditions, Now take a first order Taylors series approximation to
nmn 0
about 0 :
nm n 0
nm n 0 0
nD m 0 0
0
o p 1
First consider
1
H I
0 .
nD m 0 0
0 . Note that D m n 0 0
as
J 0 so we have
nD m 0 0
0
nJ 0
0
a s
nJ 0
N 0 I 0
Now, combining the results for the first and second terms,
nm n 0
N 0
1
H
I 0
in this case (see the section on QML) . Even if this is the case, the individuals means
arg min mn
1
H
mn
If one has used the Gallant-Nychka ML estimator as the auxiliary model, the
appropriate weighting matrix is simply the information matrix of the auxiliary
model, since the scores are uncorrelated. (e.g., it really is ML estimation asymptotically, since the score generator can approximate the unknown density arbitrarily well).
n
0
d N 0 D
1
H
I 0
where
D
lim E D mn
0 0
397
N 0
implies that
nmn
where q is dim
1
H
I
1
H
1
I 0
mn
2 q
not identified, so testing is impossible. One test of the model is simply based on this
statistic: if it exceeds the 2 q critical point, something may be wrong (the small
sample performance of this sort of test would be a topic worth investigating).
diag
1
H
I
1 2
nmn
can be used to test which moments are not well modeled. Since these moments
are related to parameters of the score generator, which are usually related to
certain features of the model, this information can be used to revise the model.
and
nmn 0
nmn
It can be shown that the pseudo-t statistics are biased toward nonrejection. See
Gourieroux et. al. or Gallant and Long, 1995, for more details.
398
ro : reservation price. If the highest bid is below r0 the good is not sold.
Bidders seal their bids, and envelopes are opened after all bids collected.
1 2 B
Bidders know their own valuation and the distribution of valuations in the popu
lation, f v q 0
The bidders at time t are assumed to be drawn randomly from the population of
bidders.
0
0
399
Bidders are risk neutral, and form their bids under the assumption that all bidders
play a symmetric Bayesian Nash strategy.
The problem for the econometrician is to estimate 0 which allows prediction of the
Under the above assumptions, the winning bid (for 2 or more bidders, and assuming the item is sold) is
E max v B
1:B
r0 v B:B
where v
v 2:B
1:B
v
B 1:B
B:B
vB
Intuitively, a bidder will bid the value of the order statistic that is less than
his/her private value, since this bid is the lowest bid that can be expected to
win, conditional on the winning bid being below the private valuation.
except at y
0 and y
In general,
p y r0 B q
p y x
is not calculable.
above.
400
The data necessary to estimate this model are simply the characteristics of the
good, the reservation price, and the winning bid. A more efficient (and complicated) model would use all of the bid information, were it available.
yt
yt
g yt
N 0 1
401
1
h yt
1
g yt dt h yt dWt
dyt
is simulated over , and the scores are calculated and averaged over the simulations
1 N
min
Ni 1
m n
m n 0
yt
yt g yt
h yt t
N 0
402
enough so that direct probit is not feasible). Only one element is equal to 1, indicating the alternative chosen, while the rest are zero. The choice depends upon the
characteristics of the alternatives, xi i
nomial probit (MNP) model using SML or MSM, one looses the diagnostic testing
possibilities of indirect inference.
For example, the score generator could be a multinomial logit model (MNL) model,
characterized by choice probabilities of the form
Pr yi
1
exp xi
G
j 1 exp x
j
These are tractable for any dimension G The reason the multinomial probit is to be
preferred over the multinomial logit is that the MNL suffers from a problem of lack
of independence of irrelevant alternatives. For example, if we have a problem of
choice between travel to work by car and red bus, the probabilities of selection of
these modes of transit are PC and PRB According to the MNL model, if we add the
possibility of travel by blue bus, PC will drop, since the numerator doesnt change but
the denominator does. The MNP model is more satisfactory since the covariance matrix of the errors (see equation [??]) allows for complementarity and substitutability
of alternatives).
403
24 Thanks
The following is a list of people who have contributed to these notes in some form.
A number of IDEA students - typos and error corrections
Montserrat Farell - error corrections
25 The GPL
GNU GENERAL PUBLIC LICENSE Version 2, June 1991
Copyright (C) 1989, 1991 Free Software Foundation, Inc. 59 Temple Place, Suite
330, Boston, MA 02111-1307 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
Preamble
The licenses for most software are designed to take away your freedom to share and
change it. By contrast, the GNU General Public License is intended to guarantee your
freedom to share and change free softwareto make sure the software is free for all its
users. This General Public License applies to most of the Free Software Foundations
software and to any other program whose authors commit to using it. (Some other Free
Software Foundation software is covered by the GNU Library General Public License
instead.) You can apply it to your programs, too.
When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute
copies of free software (and charge for this service if you wish), that you receive source
code or can get it if you want it, that you can change the software or use pieces of it in
new free programs; and that you know you can do these things.
To protect your rights, we need to make restrictions that forbid anyone to deny you
these rights or to ask you to surrender the rights. These restrictions translate to certain
404
responsibilities for you if you distribute copies of the software, or if you modify it.
For example, if you distribute copies of such a program, whether gratis or for a fee,
you must give the recipients all the rights that you have. You must make sure that they,
too, receive or can get the source code. And you must show them these terms so they
know their rights.
We protect your rights with two steps: (1) copyright the software, and (2) offer
you this license which gives you legal permission to copy, distribute and/or modify the
software.
Also, for each authors protection and ours, we want to make certain that everyone
understands that there is no warranty for this free software. If the software is modified
by someone else and passed on, we want its recipients to know that what they have
is not the original, so that any problems introduced by others will not reflect on the
original authors reputations.
Finally, any free program is threatened constantly by software patents. We wish to
avoid the danger that redistributors of a free program will individually obtain patent
licenses, in effect making the program proprietary. To prevent this, we have made it
clear that any patent must be licensed for everyones free use or not licensed at all.
The precise terms and conditions for copying, distribution and modification follow.
GNU GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
0. This License applies to any program or other work which contains a notice
placed by the copyright holder saying it may be distributed under the terms of this
General Public License. The "Program", below, refers to any such program or work,
and a "work based on the Program" means either the Program or any derivative work
under copyright law: that is to say, a work containing the Program or a portion of it,
either verbatim or with modifications and/or translated into another language. (Here405
notice that there is no warranty (or else, saying that you provide a warranty) and that
users may redistribute the program under these conditions, and telling the user how to
view a copy of this License. (Exception: if the Program itself is interactive but does not
normally print such an announcement, your work based on the Program is not required
to print an announcement.)
These requirements apply to the modified work as a whole. If identifiable sections
of that work are not derived from the Program, and can be reasonably considered
independent and separate works in themselves, then this License, and its terms, do not
apply to those sections when you distribute them as separate works. But when you
distribute the same sections as part of a whole which is a work based on the Program,
the distribution of the whole must be on the terms of this License, whose permissions
for other licensees extend to the entire whole, and thus to each and every part regardless
of who wrote it.
Thus, it is not the intent of this section to claim rights or contest your rights to
work written entirely by you; rather, the intent is to exercise the right to control the
distribution of derivative or collective works based on the Program.
In addition, mere aggregation of another work not based on the Program with the
Program (or with a work based on the Program) on a volume of a storage or distribution
medium does not bring the other work under the scope of this License.
3. You may copy and distribute the Program (or a work based on it, under Section
2) in object code or executable form under the terms of Sections 1 and 2 above provided
that you also do one of the following:
a) Accompany it with the complete corresponding machine-readable source code,
which must be distributed under the terms of Sections 1 and 2 above on a medium
customarily used for software interchange; or,
b) Accompany it with a written offer, valid for at least three years, to give any third
407
party, for a charge no more than your cost of physically performing source distribution,
a complete machine-readable copy of the corresponding source code, to be distributed
under the terms of Sections 1 and 2 above on a medium customarily used for software
interchange; or,
c) Accompany it with the information you received as to the offer to distribute
corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form with
such an offer, in accord with Subsection b above.)
The source code for a work means the preferred form of the work for making modifications to it. For an executable work, complete source code means all the source code
for all modules it contains, plus any associated interface definition files, plus the scripts
used to control compilation and installation of the executable. However, as a special
exception, the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel,
and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable.
If distribution of executable or object code is made by offering access to copy from
a designated place, then offering equivalent access to copy the source code from the
same place counts as distribution of the source code, even though third parties are not
compelled to copy the source along with the object code.
4. You may not copy, modify, sublicense, or distribute the Program except as
expressly provided under this License. Any attempt otherwise to copy, modify, sublicense or distribute the Program is void, and will automatically terminate your rights
under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain
in full compliance.
408
5. You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Program or its
derivative works. These actions are prohibited by law if you do not accept this License. Therefore, by modifying or distributing the Program (or any work based on the
Program), you indicate your acceptance of this License to do so, and all its terms and
conditions for copying, distributing or modifying the Program or works based on it.
6. Each time you redistribute the Program (or any work based on the Program), the
recipient automatically receives a license from the original licensor to copy, distribute
or modify the Program subject to these terms and conditions. You may not impose any
further restrictions on the recipients exercise of the rights granted herein. You are not
responsible for enforcing compliance by third parties to this License.
7. If, as a consequence of a court judgment or allegation of patent infringement
or for any other reason (not limited to patent issues), conditions are imposed on you
(whether by court order, agreement or otherwise) that contradict the conditions of this
License, they do not excuse you from the conditions of this License. If you cannot
distribute so as to satisfy simultaneously your obligations under this License and any
other pertinent obligations, then as a consequence you may not distribute the Program
at all. For example, if a patent license would not permit royalty-free redistribution of
the Program by all those who receive copies directly or indirectly through you, then
the only way you could satisfy both it and this License would be to refrain entirely
from distribution of the Program.
If any portion of this section is held invalid or unenforceable under any particular
circumstance, the balance of the section is intended to apply and the section as a whole
is intended to apply in other circumstances.
It is not the purpose of this section to induce you to infringe any patents or other
property right claims or to contest validity of any such claims; this section has the sole
409
purpose of protecting the integrity of the free software distribution system, which is
implemented by public license practices. Many people have made generous contributions to the wide range of software distributed through that system in reliance on
consistent application of that system; it is up to the author/donor to decide if he or she
is willing to distribute software through any other system and a licensee cannot impose
that choice.
This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License. 8. If the distribution and/or use of the Program is
restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Program under this License may add an explicit
geographical distribution limitation excluding those countries, so that distribution is
permitted only in or among countries not thus excluded. In such case, this License
incorporates the limitation as if written in the body of this License.
9. The Free Software Foundation may publish revised and/or new versions of the
General Public License from time to time. Such new versions will be similar in spirit
to the present version, but may differ in detail to address new problems or concerns.
Each version is given a distinguishing version number. If the Program specifies a
version number of this License which applies to it and "any later version", you have
the option of following the terms and conditions either of that version or of any later
version published by the Free Software Foundation. If the Program does not specify
a version number of this License, you may choose any version ever published by the
Free Software Foundation.
10. If you wish to incorporate parts of the Program into other free programs whose
distribution conditions are different, write to the author to ask for permission. For
software which is copyrighted by the Free Software Foundation, write to the Free
Software Foundation; we sometimes make exceptions for this. Our decision will be
410
guided by the two goals of preserving the free status of all derivatives of our free
software and of promoting the sharing and reuse of software generally.
NO WARRANTY
11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE
IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE
COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM
"AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME
THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED
TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY
WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT
NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE
OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE
PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH
HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
SUCH DAMAGES.
END OF TERMS AND CONDITIONS
How to Apply These Terms to Your New Programs
411
If you develop a new program, and you want it to be of the greatest possible use to
the public, the best way to achieve this is to make it free software which everyone can
redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest to attach them to
the start of each source file to most effectively convey the exclusion of warranty; and
each file should have at least the "copyright" line and a pointer to where the full notice
is found.
<one line to give the programs name and a brief idea of what it does.> Copyright
(C) 19yy <name of author>
This program is free software; you can redistribute it and/or modify it under the
terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
more details.
You should have received a copy of the GNU General Public License along with
this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place,
Suite 330, Boston, MA 02111-1307 USA
Also add information on how to contact you by electronic and paper mail.
If the program is interactive, make it output a short notice like this when it starts in
an interactive mode:
Gnomovision version 69, Copyright (C) 19yy name of author Gnomovision comes
with ABSOLUTELY NO WARRANTY; for details type show w. This is free software, and you are welcome to redistribute it under certain conditions; type show c
for details.
412
The hypothetical commands show w and show c should show the appropriate
parts of the General Public License. Of course, the commands you use may be called
something other than show w and show c; they could even be mouse-clicks or menu
itemswhatever suits your program.
You should also get your employer (if you work as a programmer) or your school,
if any, to sign a "copyright disclaimer" for the program, if necessary. Here is a sample;
alter the names:
Yoyodyne, Inc., hereby disclaims all copyright interest in the program Gnomovision (which makes passes at compilers) written by James Hacker.
<signature of Ty Coon>, 1 April 1989 Ty Coon, President of Vice
This General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more
useful to permit linking proprietary applications with the library. If this is what you
want to do, use the GNU Library General Public License instead of this License.
413
Index
classical linear model, 14
Cobb-Douglas model, 14
cross section, 12
estimator, linear, 20, 26
estimator, OLS, 15
Gauss Markov theorem, 26
likelihood function, 28
matrix, idempotent, 20
matrix, projection, 19
matrix, symmetric, 20
observations, influential, 20
outliers, 20
parameter space, 28
R- squared, uncentered, 22
R-squared, centered, 24
414