Econometrics - Slides
Econometrics - Slides
Econometrics - Slides
2011/2012
João Nicolau
2
1 Introduction
forecast (e.g. interest rates, in‡ation rates, and gross domestic product).
evaluating and implementing government and business policy. For example, what are the
e¤ects of political campaign expenditures on voting outcomes? What is the e¤ect of school
spending on student performance in the …eld of education?
3
Build the economic model. An economic model consists of mathematical equations that
describe various relationships. Formal economic modeling is sometimes the starting point
for empirical analysis, but it is more common to use economic theory less formally, or
even to rely entirely on intuition.
Cross-sectional data is closely aligned with the applied microeconomics …elds, such as labor
economics, state and local public …nance, industrial organization, urban economics, demog-
raphy, and health economics.
5
Models based on Cross-Sectional Data usually satisfy the assumptions cover by the chapter
“Finite-Sample Properties of OLS”.
7
A time series data set consists of observations on a variable or several variables over time.
E.g.: stock prices, money supply, consumer price index, gross domestic product, annual
homicide rates, and automobile sales …gures, etc.
Time series data cannot be assumed to be independent across time. For example, knowing
something about the gross domestic product from last quarter tells us quite a bit about the
likely range of the GDP during this quarter ...
The analysis of time series data is more di¢ cult than that of cross-sectional data. Reasons:
time-series data exhibits unique features such as trends over time and seasonality;
models based on time-series data rarely satisfy the assumptions cover be the chapter
“Finite-Sample Properties of OLS”. The most adequate assumptions are cover by chapter
“Large-Sample Theory”, which is theoretically more advanced.
8
An example of a time series (scatterplots cannot in general be used here, but there are
exceptions):
9
Ceteris Paribus: “other (relevant) factors being equal”. Plays an important role in causal
analysis.
Example. Suppose that wages depend on education and labor force experience. Your goal
is to measure the “return to education”. If your analysis involves only wages and education
you may not uncover the ceteris paribus e¤ect of education on wages. Consider the following
data:
Example. In a totalitarianism regime how can you measure the ceteris paribus e¤ect of
another year of education on wages? You may create 100 clones of a “normal” individual.
Give to each person an amount of education and then measure their wages.
In economics you have nonexperimental data, so in principle, it is di¢ cult to estimate the
ceteris paribus e¤ects. However, we will see that econometric methods can simulate a ceteris
paribus experiment. We will be able to do in nonexperimental environments what natural
scientists are able to do in a controlled laboratory setting: keep other factors …xed.
11
This chapter covers the …nite- or small-sample properties of the OLS estimator, that is, the
statistical properties of the OLS estimator that are valid for any given sample size.
The dependent variable is related to several other variables (called the regressors or the
explanatory variables).
Let (xi1; xi2; :::; xiK ) be the i-th observation of the K regressors. The sample or data is a
collection of those n observations.
0s : regression coe¢ cients. They represent the marginal and separate e¤ects of the regres-
sors.
Example (1.1). (Consumption function): Consider
yi = 1 + e 2xi2 + "i
14
Partial E¤ects
What is the impact on the conditional expected value y; E ( yij xi1; xi2) when xi2 is increased
by a small amount
x0i = (xi1; xi2) ! xi 0 = (xi1; xi2 + xi2) (holding the other variable …xed)?
Let
E ( yij xi) E ( yij xi1 = xi1; xi2 = xi2 + xi2) E ( yij xi1; xi2) :
Equation Interpretation of 2
(level-level) yi = 1 + 2xi2 + "i E ( yij xi) = 2 xi2
2 xi2
(level-log) yi = 1 + 2 log (xi2) + "i E ( yij xi) ' 100 x i2
100
E(yi jxi )
(log-level) log (yi) = 1 + 2xi2 + "i 100 ' (100 2) xi2
E(yi jxi )
(100 2: semi-elast.)
E(yi jxi ) xi2
(log-log) log (yi) = 1 + 2 log (xi2) + "i 100 ' 2 xi2 100
E(yi jxi )
( 2: elasticity)
15
Exercise 2.1. Suppose, for example, the marginal e¤ect of experience on wages declines with
the level of experience. How can this be captured?
Exercise 2.2. Provide an interpretation of 2 in the following equations:
(a) coni = 1 + 2inci + "i; where inc: income, con: consumption (both measured in
dollars). Assume that 2 = 0:8;
(b) log (wagei) = 1 + 2educi + 3tenurei + 4expri + "i: Assume that 2 = 0:05:
(c) log (pricei) = 1 + 2 log (disti) + "i where prices = housing price and dist =
distance from a recently built garbage incinerator. Assume that 2 = 0:6:
16
We have
2 3
1
h i6 7
6 2 7
yi = 1 xi1 + 2xi2 + ::: + K xiK + "i = xi1 xi2 xiK 6 ... 7 + "i
4 5
K
= x0i + "i
where
2 3 2 3
xi1 1
6 xi2 7 6 7
xi = 6
6 ...
7
7;
6
=6 2 7
... 7
4 5 4 5
xiK K
yi = x0i + "i:
17
More compactly
2 3 2 3 2 3
y1 x11 x12 x1K "1
6 y2 7 6 x x22 x2K 7 6 "2 7
6 7 6 7
y =6 ... 7; X = 6 21 7; "i = 6
6
7
7
4 5 4 ... ... ... 5 4 ... 5
yn xn1 xn2 xnK "n
y = X + ":
Important: y and X (or yi and xik ) may be random variables or observed values. We use
the same notation for both cases.
18
E "ij xj = 0; 8i; j i 6= j
It remains to be analyzed whether or not
?
E ( "ij xi) = 0:
19
(Time Series, Static models) There is a feedback from yi on future values of xi;
Cov ("i; xi2) = Cov ( 3xi3 + vi; xi2) = 3 Cov (xi3; xi2) 6= 0 ) E ( "ij X) 6= 0:
20
xi2 = wi + ui:
Assume: E (ui) = 0; Cov (wi; ui) = 0; Cov (vi; ui) = 0. Now substituting xi2 = wi + ui
into yi = 1 + 2wi + vi we obtain
Example (Feedback from y on future values of x). Consider a simple static time-series model
to explain a city’s murder rate (yt) in terms of police o¢ cers per capita (xt):
yt = 1 + 2xt + "t;
Suppose that the city adjusts the size of its police force based on past values of the murder
rate. This means that, say, xt+1 might be correlated with "t (since a higher "t leads to a
higher yt).
Example (There is a lag dependent variable as a regressor). See section 2.1.5.
Exercise 2.3. Let kids denote the number of children ever born to a woman, and let educ
denote years of education for the woman. A simple model relating fertility to years of
education is
kidsi = 1 + 2educi + "i:
where "i is the unobserved error. (i) What kinds of factors are contained in "i? Are these
likely to be correlated with level of education? (ii) Will a simple regression analysis uncover
the ceteris paribus e¤ect of education on fertility? Explain.
22
E ("i) = 0; 8i:
E "ij xj = 0; 8i; j:
E xjk "i = 0; 8i; j; k or E xj "i = 0; 8i; j The regressors are orthogonal to the
error term for all observations
For time-series models where strict exogeneity can be rephrased as: the regressors are or-
thogonal to the past, current, and future error terms. However, for most time-series models,
strict exogeneity is not satis…ed.
Example. Consider
yi = yi 1 + "i; E ( "ij yi 1) = 0 (thus E (yi 1"i) = 0).
Let xi = yi 1: By construction we have
2
E (xi+1"i) = E (yi"i) = ::: = E "i 6= 0:
The regressor is not orthogonal to the past error term, which is a violation of strict exogeneity.
However, the estimator may possess good large-sample properties without strict exogeneity.
None of the K columns of the data matrix X can be expressed as a linear combination of
the other columns of X.
Example (1.4 - continuation of Example 1.2). If no individuals in the sample ever changed
jobs, then tenurei = expri for all i, in violation of the no multicollinearity assumption.
There no way to distinguish the tenure e¤ect on the wage rate from the experience e¤ect.
Remedy: drop tenurei or expri from the wage equation.
Example (Dummy Variable Trap). Consider
wagei = 1 + 2educi + 3f emalei + 4malei + "i
where
(
1 if i corresponds to a female
f emalei = ; malei = 1 f emalei:
0 if i corresponds to a male
In vectorial notation we have
wage = 11 + 2 educ + 3 female + 4 male + ":
It is obvious that 1 = female + male: Therefore the above model violates Assumption
1.3. One may also justify using scalar notation: xi1 = f emalei + malei because this
relationship implies 1 = female + male: Can you overcome the dummy variable trap by
removing xi1 1 from the equation?
25
Exercise 2.4. In a study relating college grade point average to time spent in various activ-
ities, you distribute a survey to several students. The students are asked how many hours
they spend each week in four activities: studying, sleeping, working, and leisure. Any activity
is put into one of the four categories, so that for each student the sum of hours in the four
activities must be 168. (i) In the model
Exercise 2.5. Under the Assumptions 1.2 and 1.4, show that Cov yi; yj X = 0:
26
E ""0 X = 2I:
Note
2 3
E "21 X E ( " 1 "2 j X ) E ( "1 "n j X )
6 7
6 ( " " j X) 7
0 6 E 1 2 E "22 X E ( "2 "n j X ) 7
E "" X =6
6 ... ... ... ... 7:
7
4 5
E ( " 1 "n j X ) E ( " 2 " n j X ) E "2n X
27
The sample (y; X) is a random sample if f(yi; xi)g is i.i.d. (independently and identically
distributed) across observations. Random sample automatically implies:
This is a simplifying (and generally an unrealistic) assumption to make the statistical analysis
tractable. It means that X is exactly the same in repeated samples. Sampling schemes that
support this assumption:
a) Experimental situations. For example, suppose that y represents the yields of a crop
grown on n experimental plots, and let the rows of X represent the seed varieties, irrigation
and fertilizer for each plot. The experiment can be repeated as often as desired, with the
same X. Only y varies across plots.
yi x0i ~ :
Vector of residuals (evaluated at ~ ):
y X~:
Sum of squared residuals (SSR):
n
X 2 0
SSR ~ = yi x0i ~ = y X~ y X~ :
i=1
The OLS (Ordinary Least Squares):
K = 1 ; y i = x i + "i
Example. Consider yi = 1 + 2xi2 + "i: The data:
y X
1 1 1
3 1 3
2 1 1
8 1 3
12 1 8
!
0
Verify that SSR ~ = 42 when ~ = :
1
31
@ 2SSR ~
0 is a Positive De…nite Matrix , b is global minimum point.
~
@ @ ~
32
X0Xb = X0y or
X0 (y Xb) = 0:
33
This is a system with K equations and K unknowns. These equations are called the normal
equations. If
1
rank (X) = K ) X0 X is nonsingular ) there exists X0 X :
Therefore, if rank (X) = K we have a unique solution:
1
b = X0 X X0 y OLS estimator.
The SOC is
@ 2SSR ~
0 = 2 X0 X:
@ ~@ ~
If rank (X) = K then 2X0X is a positive de…nite matrix thus SSR ~ is strictly convex
in Rk . Hence b is a global minimum point.
e=y Xb
is called the vector of OLS residuals (or simply residuals).
34
1
b = X0 X X0 y
X0 X 1 X0 y
b= n n = Sxx1Sxy ; where
n
X0 X 1X
Sxx = = xix0i (sample average of xix0i)
n n i=1
n
X0 y 1X
Sxy = = xiyi (sample average of xiyi).
n n i=1
y X
1 1 1
3 1 3
2 1 1
8 1 3
12 1 8
Properties:
Exercise 2.7. Show that P and M are symmetric and idempotent and
PX = X
MX = 0
y
^ = Py
e = My = M"
SSR = e0e = y0My = "0M":
37
The OLS estimate of 2 (the variance of the error term), denoted s2, is
SSR e0 e
s2 = =
n K n K
s2 is called the standard error of regression.
y y
25 R^2 = 0.96 60 y
50 y^
20 40
30 R^2 = 0.19
15 20
y 10
10
y^ 0
x
-3 -2 -1 -10 0 1 2 3
5
-20
0 x -30
-3 -2 -1 0 1 2 3 -40
-5 -50
y
17
16
15
14
13 y
12 y^
11
10 R^2 = 0.00
9
8 x
-3 -2 -1 0 1 2 3
39
“The most important thing about R2 is that it is not important” (Goldberger). Why?
We are concerned with parameters in a population, not with goodness of …t in the sample;
Assumptions:
1.1 - Linearity: yi = 1xi1 + 2xi2 + ::: + K xiK + "i:
1.2 - Strict exogeneity: E ( "ij X) = 0:
1.3 - No multicollinearity.
1.4 - Spherical error variance: E "2i X = 2; E "i"j X = 0:
0
Matrix P = X X0X 1 X
Py ! Fitted values from the regression of y on X
Pz ! ?
1 0
Matrix M = I P = I X X0 X X
My ! Residuals from the regression of y on X
Mz ! ?
h i
Consider a partition of X as follows X = X1 X2
1 0
Matrix P1= X1 X01X1 X
1
P1y ! ?
1 0
Matrix M1= I P1 = I X1 X01X1 X
1
M1 y ! ?
44
Partition X as
h i
X= X 1 X2 ; XK 1 n; XK2 n ; K1 + K2 = K
Long Regression
We have
" #
h i b1
y=y
^ + e = Xb + e = X1 X 2 + e = X1b1 + X2b2 + e:
b2
Short Regression
Suppose that we shorten the list of explanatory variables and regress y on X1: We have
y=y
^ + e = X1b1 + e
where
1
b1 = X01X1 X1 y
1
e = M1 y ; M1 = I X1 X01X1 X01
45
b1 vs. b1
We have,
1
b1 = X01X1 X1 y
1
= X01X1 X01 (X1b1 + X2b2 + e)
1 1
= b1 + X01X1 X01X2b2 + X01X1 X01e
| {z }
0
1
= b1 + X01X1 X01X2b2
1
= b1 + Fb2; F = X01X1 X01X2:
Thus, in general, b1 6= b1: Exceptional cases: b2 = 0 or X01X2 = O ) b1 = b1:
46
e vs. e
We have,
e = M1 y
= M1 (X1b1 + X2b2 + e)
= M1X1b1 + M1X2b2 + M1e
= M1X2b2 + e;
= v+e
Thus,
e 0e = e0e + v0v e0e
Thus the SSR of the short regression (e 0e ) exceeds the SSR of the long regression (e0e)
and e 0e = e0e i¤ v = 0; that is i¤ b2 = 0:
47
Consider
y = X +"
= X 1 1 + X2 2 + ":
Premultiplying both sides by M1 and using M1X1 = 0; we obtain
M1 y = M1 X 1 1 + M 1 X 2 2 + M1 "
y ~ 2 2 + M1 "
~ = X
The OLS gives
1 1 1
~0 X
b2 = X ~ ~0 y
X ~ = ~0 X
X ~ ~ 0 M1 y = X
X ~0 X
~ ~0 y
X
2 2 2 2 2 2 2 2 2
Thus
1
~0 X
b2 = X ~ ~0 y
X
2 2 2
49
1
~0 X
Another way to prove b2 = X ~ ~ 0 y (you may skip this proof). We have
X
2 2 2
1 1
~0 X
X ~ ~0 y =
X ~0 X
X ~ ~ 0 (X1b1 + X2b2 + e)
X
2 2 2 2 2 2
1 1 1
= ~0 X
X ~ ~ 0 X1b1 + X
X ~0 X
~ ~ 0 X2b2 + X
X ~0 X
~ ~0 e
X
2 2 2 2 2 2 2 2 2
| {z } | {z } | {z }
0 b2 0
= b2
since:
1 1
~0 X
X ~ ~ 0 X1b1 = X
X ~0 X
~ X02M1X1b1
2 2 2 2 2
= 0
1 0 1
~0 X
X ~ ~ X2b2 = X
X ~0 X
~ X02M1X2b2
2 2 2 2 2
0 0 1 0
= X 2 M 1 M1 X 2 X2M1X2b2
0 1 0
= X 2 M 1 X2 X2M1X2b2
= b2
~0 e =
X X02M1e
2
= X02e
= 0:
50
1 1
~0 X
The conclusion is that we can obtain b2 = X ~ ~0 y = X
X ~0 X
~ ~0 y
X
2 2 2 2 2 2 ~ as follows:
OR:
1’) Same as 1).
2’a) Regress y on X1 to get the residuals y
~ = M1 y :
2’b) Regress y ~ 2 to get the coe¢ cient b2 of the long regression.
~ on X
h i
Example. Consider X = 1 exper tenure IQ educ and
h i
X1 = 1 exper tenure IQ ; X2 = educ
53
54
Suppose that yt and xt have a linear trend. Should the trend term be included in the
regression as in the case
yt = 1 + 2xt2 + 3xt3 + "t; xt3 = t
or should the variables …rst be “detrended” and then used without the trend term included
as in
y~t = 2x
~t2 + ~
"t ?
According to the previous results, the OLS coe¢ cient b2 is the same in both regressions.
In the second regression b2 is obtained from the regression of y
~ = M1y on x
~ 2 = M1 x 2
where
2 3
1 1
h i 6 7
6 1 2 7
X1 = 1 x 3 = 6 .. .. 7 :
4 . . 5
1 n
55
EQ01
Dependent Variable: TXDES EQ02
Method: Least Squares Dependent Variable: TXDES
Sample: 1948 2003 Method: Least Squares
Sample: 1948 2003
Variable Coefficient Std. Error t-Statistic Prob.
Variable Coefficient Std. Error t-Statistic Prob.
C 4.463068 0.425856 10.48023 0.0000
INF 0.104712 0.063329 1.653473 0.1041 C 4.801316 0.379453 12.65325 0.0000
@TREND 0.027788 0.011806 2.353790 0.0223 @TREND 0.030277 0.011896 2.545185 0.0138
EQ03
Dependent Variable: INF EQ04
Method: Least Squares Dependent Variable: TXDES_
Sample: 1948 2003 Method: Least Squares
Sample: 1948 2003
Variable Coefficient Std. Error t-Statistic Prob.
Variable Coefficient Std. Error t-Statistic Prob.
C 3.230263 0.802598 4.024758 0.0002
@TREND 0.023770 0.025161 0.944696 0.3490 INF_ 0.104712 0.062167 1.684382 0.0978
56
Suppose that we have data on the variable y; quarter by quarter, for m years. A way to deal
with (deterministic) seasonality is the following
yt = 1Qt1 + 2Qt2 + 3Qt3 + 4Qt4 + 5xt5 + "i
where
(
1 in quarter i
Qti =
0 otherwise.
Let
h i h i
X= Q1 Q2 Q3 Q4 x 5 ; X1 = Q1 Q2 Q3 Q4 :
Previous results show that b5 can be obtained from the regression of y
~ = M1y on x
~ 5=
M1x 5: It can be proved
8
>
> yt yQ1 in quarter 1
>
>
< y yQ2 in quarter 2
t
y~t =
>
> yt yQ3 in quarter 3
>
>
: y yQ4 in quarter 4
t
where yQi is the seasonal mean of quarter i:
57
so that
y
^ = x 1b1 + X2b2:
~ 2 = M1X2 where
1) Regress X2 on x 1 to get the residuals X
1 0 x 1x0 1
M1 = I x 1 x0 1x 1 x 1 =I :
n
58
As we know
~ 2 = M1 X2
X
h i
= M1 x 2 x K
h i
= M1 x 2 M1 x K
2 3
x12 x2 x1K xK
6 ... ... 7
= 4 5:
xn2 x2 xnK xK
Consider:
E ( y j X) = X1 1 + X 2 2
= X
Var ( yj X) = 2 I; etc.
60
b1 is a biased estimator of 1
Given that
1 1
b1 = X01X1 X01y = b1 + Fb2; F= X01X1 X01X2:
we have
E ( b1j X) = E ( b1 + Fb2j X) = 1 + F 2;
1 1 1
Var ( b1j X) = Var X01X1 X01y X = X01X1 X01 Var ( yj X) X1 X01X1
2 1
= X01X1
thus, in general,
b1 is a biased estimator of 1 (“omitted-variable bias”)
unless:
Consider b1 = b1 Fb2
In practise there may be a bias-variance trade-o¤ between short and long regression when
the target is 1:
62
Exercise 2.9. Consider the standard simple regression model yi = 1 + 2xi2 + "i under
Assumptions 1.1 through 1.4. Thus, the usual OLS estimators b1 and b2 are unbiased for
their respective population parameters. Let b2 be the estimator of 2 obtained by assuming
the intercept is zero i.e. 1 = 0 (i) Find E b2 X . Verify that b2 is unbiased for 2 when
the population intercept 1 is zero. Are there other cases where b2 is unbiased? (ii) Find the
variance of b2. (iii) Show that Var b2 X Var ( b2j X); (iv) Comment on the trade-o¤
between bias and variance when choosing between b2 and b2.
Exercise 2.10. Suppose that average worker productivity at manufacturing …rms (avgprod)
depends on two factors, average hours of training (avgtrain) and average worker ability
(avgabil):
avgprodi = 1 + 2avgtraini + 3avgabili + "i
Assume that this equation satis…es Assumptions 1.1 through 1.4. If grants have been given to
…rms whose workers have less than average ability, so that avgtrain and avgabil are negatively
correlated, what is the likely bias in b2 in obtained from the simple regression of avgprod on
avgtrain?
63
Let’s see now that the omission of explanatory variables leads to an increase in the expected
SSR. We have, by R5,
0
E e e X = E y0M1y X = tr (M1 Var ( yj X)) + E ( yj X)0 M1 E ( yj X)
= 2 tr (M1) + 0 X ~ 2 = 2 (n K1) + 0 X
~0 X
2 2 2
~0 X
~2
2 2 2
and E e0e X = 2 (n K ) thus
0 0 2 0 ~0 X
~
E e e X E e e X = K2 + 2X 2 2 2 > 0:
Notice that: e 0e ~0 X
e0e = b02X ~ ~0 X
0: (check E b02X ~ 2K
2 2 b2 2 2 b2 X = 2 +
~0 X
0X ~ 2 ).
2
2 2
64
C) Residual Regression
If follows that
2
Var ( bK j X) = 0
x K M1 x K
and x0 K M1x K is the sum of the squared residuals in the auxiliary regression
x K = 1x 1 + 2x 2 + ::: + K 1x K 1 + error:
One can conclude (assuming that x 1 is the summer vector):
2 x0 K M1x K
RK =1 P 2
:
(xiK xK )
Solving this equation for x0 K M1x K we have
X
x0 K M1 x K = 1
2
RK (xiK xK )2 :
We get
2 2
Var ( bK j X) = P 2
= :
1 2
RK (xiK xK ) 1 2 S2 n
RK xK
66
2 2
Var ( bK j X) = = :
1 2 P (x
RK xK ) 2
1 2 S2 n
RK
iK xK
We can conclude that the precision of bK is high (i.e. Var (bK ) is small) when:
2 is low;
Exercise 2.11. Consider: sleep: minutes sleep at night per week; totwrk: hours worked
per week; educ: years of schooling; female: binary variable equal to one if the individual
is female. Do women sleep more than men? Explain the di¤erences between the estimates
32.18 and -90.969.
Variable Coefficient Std. Error t-Statistic Prob. C 3838.486 86.67226 44.28737 0.0000
TOTWRK -0.167339 0.017937 -9.329260 0.0000
C 3252.407 22.22211 146.3591 0.0000 EDUC -13.88479 5.657573 -2.454196 0.0144
FEMALE 32.18074 33.75413 0.953387 0.3407 FEMALE -90.96919 34.27441 -2.654143 0.0081
R-squared 0.001289 Mean dependent var 3266.356 R-squared 0.119277 Mean dependent var 3266.356
Adjusted R-squared -0.000129 S.D. dependent var 444.4134 Adjusted R-squared 0.115514 S.D. dependent var 444.4134
S.E. of regression 444.4422 Akaike info criterion 15.03435 S.E. of regression 417.9581 Akaike info criterion 14.91429
Sum squared resid 1.39E+08 Schwarz criterion 15.04726 Sum squared resid 1.23E+08 Schwarz criterion 14.94012
68
Example. The goal is to analyze the impact of another year of education on wages. Consider:
wage: monthly earnings; KWW: knowledge of world work score (KWW is a general test of
work-related abilities); educ: years of education; exper: years of work experience; tenure:
years with current employer
Dependent Variable: LOG(WAGE)
Method: Least Squares
Dependent Variable: LOG(WAGE) Sample: 1 935
Method: Least Squares White Heteroskedasticity-Consistent Standard Errors & Covariance
Sample: 1 935
White Heteroskedasticity-Consistent Standard Errors & Covariance Variable Coefficient Std. Error t-Statistic Prob.
Variable Coefficient Std. Error t-Statistic Prob. C 5.496696 0.112030 49.06458 0.0000
EDUC 0.074864 0.006654 11.25160 0.0000
C 5.973062 0.082272 72.60160 0.0000 EXPER 0.015328 0.003405 4.501375 0.0000
EDUC 0.059839 0.006079 9.843503 0.0000 TENURE 0.013375 0.002657 5.033021 0.0000
R-squared 0.097417 Mean dependent var 6.779004 R-squared 0.155112 Mean dependent var 6.779004
Adjusted R-squared 0.096449 S.D. dependent var 0.421144 Adjusted R-squared 0.152390 S.D. dependent var 0.421144
S.E. of regression 0.400320 Akaike info criterion 1.009029 S.E. of regression 0.387729 Akaike info criterion 0.947250
Sum squared resid 149.5186 Schwarz criterion 1.019383 Sum squared resid 139.9610 Schwarz criterion 0.967958
y = X1b1 + X2b2 + e;
and the following coe¢ cients (obtained from the short regressions):
1 1
b1 = X01X1 X01y; b2 = X02X2 X02y:
Decide if you agree or disagree with the following statement: if Cov b1; b2 X1; X2 = O
(zero matrix) then b1 = b1 and b2 = b2:
70
2.5 Multicollinearity
If rank (X) < K then b is not de…ned. This is called strict multicollinearity. When this
happens, the statistical software will be unable to construct X0X 1 : Since the error is
discovered quickly, this is rarely a problem for applied econometric practice.
The more relevant situation is near multicollinearity, which is often called “multicollinearity”
for brevity. This is the situation when the X0X is near singular, when the columns of X are
close to linearly dependent.
Consequence: the individual coe¢ cient estimates will be imprecise. We have shown that
2
Var ( bK j X) = :
1 2 S2 n
RK xK
2 is the coe¢ cient of determination in the auxiliary regression
where RK
x K = 1x 1 + 2x 2 + ::: + K 1x K 1 + error:
71
Exercise 2.14. Do you agree with the following quotations: (a) “But more data is no remedy
for multicollinearity if the additional data are simply "more of the same." So obtaining lots
of small samples from the same population will not help” (Johnston, 1984); (b) “Another
important point is that a high degree of correlation between certain independent variables
can be irrelevant as to how well we can estimate other parameters in the model.”
Exercise 2.15. Suppose you postulate a model explaining …nal exam score in terms of class
attendance. Thus, the dependent variable is …nal exam score, and the key explanatory
variable is number of classes attended. To control for student abilities and e¤orts outside
the classroom, you include among the explanatory variables cumulative GPA, SAT score, and
measures of high school performance. Someone says, “You cannot hope to learn anything
from this exercise because cumulative GPA, SAT score, and high school performance are
likely to be highly collinear.” What should be your answer?
72
Assumption 1.5 together with Assumptions 1.2 and 1.4 implies that
"j X N 0; 2I and yj X N X ; 2I :
1. z N (0; I) , z0z 2 :
(n)
2 ; w 2 ; w w =m
2. w1 (m) 2 (n) 1 and w2 are independent, w1 =n F (m; n) :
2
3. w 2 ; z
(n) N (0; 1) ; w and z are independent, p z t(n):
w=n
4. Asymptotic Results:
d
v F (m; n) ) mv ! 2(m) as n ! 1
d
u t(n) ) u ! N (0; 1) as n ! 1:
w = (y X )0 1 (y X ) 2 :
(n)
74
"0M" X 2 :
(r)
1
8. bj X N ; 2 X0 X :
9. Let r = R (Rp K ) with rank (R) = p (in Hayashi’s notation p is equal to #r):
Then,
1
Rbj X N r; 2R X0 X R0 :
75
1
10. Let bk be the kth element of b and q kk the (k; k) element of X0X : Then,
b
bk j X N k;
2 q kk or zk = kq k N (0; 1) :
q kk
0 1 1 2 2 :
11. w = (Rb r) R X0 X R0 (Rb r) = (p)
2
(bk k) 2 :
12. wk = 2 q kk (1)
13. w0 = e0e= 2 2
(n K) :
b 1
16. tk = k^ k t (n K ) ; where ^ 2b is the (k; k) element of s2 X0X :
bk k
17. q Rb R t (n K ) ; R is of type 1 K
s R(X0 X) 1 R0
0 1 1
18. F = (Rb r) R X0 X R0 (Rb r) = ps2 F (p; n K) :
Exercise 2.16. Prove the results #8, #9, #16 and #18 (take the other results as given).
P jtj < t =2 = 1 :
78
P (F > F ) = 1
79
(1 ) 100% Con…dence region for the parameter vector (consider R = I in the pre-
vious case)
n o
: (b ) 0
X0 X (b ) =s2 pF :
80
Exercise 2.17. Consider yi = 1xi1 + 2xi2 + "i where yi = wagesi wages; xi1 =
educi educ; xi2 = experi exper: The results are
Dependent Variable: Y
Method: Least Squares
Sample: 1 526
X
X1 0.644272 0.053755 11.98541 0.0000
X2 0.070095 0.010967 6.391393 0.0000
" # " #
4025:4297 5910:064 1 2:7291 10 4 1:6678 10 5
X0 X = ; X0 X =
5910:064 96706:846 1:6678 10 5 1:1360 10 5
(c) Build the 95% con…dence region for the parameter vector :
81
.10
.09
.08
beta2
.07
.06
.05
.04
.50 .55 .60 .65 .70 .75 .80
beta1
Suppose that we have a hypothesis about the kth regression coe¢ cient:
H0 : k = 0k
( 0k is a speci…c value, e.g. zero), and that this hypothesis is tested against the alternative
hypothesis
H1 : k 6= 0k :
We do not reject H0 at the 100% level if
0 lies within the (1 ) 100% CI for k ; i.e., bk t =2 ^ bk ;
k
reject H0 otherwise. Equivalently, calculate the test statistic
bk 0
tobs = k
^ bk
and,
if jtobsj > t =2 then reject H0;
if jtobsj t =2 then do not reject H0:
83
Other cases:
H0 : k = 0k vs: H1 : k > 0k ;
if tobs > t then reject H0 at the 100% level; otherwise do not reject H0:
H0 : k = 0k vs: H1 : k < 0k ;
if tobs < t then reject H0 at the 100% level; otherwise do not reject H0:
84
p-value
p-value (or p) is the probability of obtaining a test statistic at least as extreme as the one that
was actually observed, assuming that the null hypothesis is true. p is an informal measure
of evidence of the null hypothesis.
Example. Consider H0 : k = 0k vs: H1 : k 6= 0k
When the null isn’t rejected we say that bk (not k) is not signi…cantly di¤erent from
0 at 100%.
k
When the null is rejected we say that bk (not k ) is signi…cantly di¤erent from zero at
100% level, or the variable (associated with bk ) is statistically signi…cant at 100%.
When the null isn’t rejected we say that bk (not k ) is not signi…cantly di¤erent from
zero at 100% level, or the variable is not statistically signi…cant at 100%.
86
More Remarks:
Rejection of the null is not proof that the null is false. Why?
Acceptance of the null is not proof that the null is true. Why? We prefer to use the
language “we fail to reject H0 at the x% level” rather than “H0 is accepted at the x%
level.”
The statistical signi…cance of a variable is determined by the size of tobs = bk =se (bk ) ;
whereas the economic signi…cance of a variable is related to the size and sign of bk :
Example. Suppose that in a business activity we have
log\
(wagei) = :1 + 0:01 f emale + ::: n = 600
(0:001)
H0 : 2 = 0 vs. H1 = 2 6= 0: We have:
b2
t0k= t(600 K) N (0; 1) (under the null)
^ b2
0:01
tobs = = 10;
0:001
p-value = 2P t0k > j10j H0 is true 0:
Discuss statistical versus economic signi…cance.
88
Exercise 2.18. Can we say that students at smaller schools perform better than those at
larger schools? To discuss this hypothesis we consider data on 408 high schools in Michigan
for the year 1993 (see Wooldridge, chapter 4). Performance is measured by the percentage
of students receiving a passing score on a tenth grade math test ( math10). School size
is measured by student enrollment ( enroll). We will control for two other factors, average
annual teacher compensation ( totcomp) and the number of sta¤ per one thousand students
( sta¤ ). Teacher compensation is a measure of teacher quality, and sta¤ size is a rough
measure of how much attention students receive. Figure below reports the results. Answer
to the initial question.
Exercise 2.19. We want to relate the median housing price ( price) in the community to
various community characteristics: nox is the amount of nitrous oxide in the air, in parts
per million; dist is a weighted distance of the community from …ve employment centers, in
miles; rooms is the average number of rooms in houses in the community; and stratio is
the average student-teacher ratio of schools in the community. Can we conclude that the
elasticity of price with respect to nox is -1? (Sample: 506 communities in the Boston area -
see Wooldridge, chapter 4).
H0 : R = r vs. H1 : R 6= r:
where R p 1; Rp K ). The test statistics is
1 1
F0 = (Rb r)0
R X0 X R0 (Rb r) = ps2 :
F0 F(p;n K):
If we observe F 0 > F and the H0 is true, then a low-probability event has occurred.
91
In the case p = 1 (single linear combination of the elements of ) one may use the test
statistics
0 Rb R
t = q t (n K ) :
1
s R (X0X) R0
Example. We consider a simple model to compare the returns to education at junior colleges
and four-year colleges; for simplicity, we refer to the latter as “universities”(See Wooldridge,
chap. 4).The model is
2 3
0:0023972 9:4121 10 5 8:50437 105 1:6780 10 5
6 7
1 6 9:41217 10 5 0:0002520 1:04201 10 5 9:2871 10 8 7
X0 X =6
6
7
7
4 8:50437 10 5 1:0420 10 5 2:88090 10 5 2:12598 10 7 5
1:67807 10 5 9:2871 10 8 2:1259 10 7 1:3402 10 7
Under the null, the test statistics is
Rb R
t0 = q t (n K) :
1
s R (X0X) R0
93
We have
h i
R = 0 1 1 0
q
1
R (X0X) R0 = 0:016124827
q
s R (X0X) 1 R0 = 0:430138 0:016124827 = 0:006936
2 3
1:472326
h i 6 0:066697 7
6 7
Rb = 0 1 1 0 6 7 = 0:01018
4 0:076876 5
0:004944
2 3
1
h i6 7
6 2 7
R = 0 1 1 0 6 7= 2 3 = 0 (under H0)
4 3 5
4
0:01018
tobs = = 1:467
0:006936
t0:05 = 1:645:
We do not reject H0 at the 5% level. There is no evidence against 2 = 3 at 5% level.
94
H0 : R = r vs. H1 : R 6= r:
(where R p 1; Rp K ). It can be proved that
1 1
F0 = (Rb r)0
R X0 X R0 (Rb r) = ps2
e 0e e0e =p
=
e0e= (n K )
R2 R2 =p
= F (p; n K)
1 R2 = (n K)
where refers to the short regression or the regression subjected to the constraint R = r.
96
Example. Consider once again the equation log (wagesi) = 1 + 2jci + 3univi +
4 experi + "i and H0 : 2 = 3 against H0 : 2 6= 3 : The results of the regression
subjected to the constraint H0 : 2 = 3 are
In the case “all slopes zero” (test of signi…cance of the complete regression), it can be
proved that F o equals
R2= (K 1)
F0 = :
1 R2 = (n K)
Having speci…ed the distribution of the error vector, we can use the maximum likelihood
(ML) principle to estimate the model parameters = 0; 2 0.
ML principle: choose the parameter estimates to maximize the probability of obtaining the
data. Maximizing the joint density associated with the data, f y; X; ~ ; leads to the same
solution. Therefore:
0.0012
joint density
0.0011
0.0010
0.0009
0.0008
0.0007
0.0006
0.0005
0.0004
0.0003
0.0002
0.0001
0.0000
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
theta
100
d 6 (1 )4 6
=0,^=
d 10
and since
d2 6 (1 )4
<0
d 2
^ = 0:6 maximizes f y;~ : ^ is the “most likely” value ; that is the value that maximizes
the probability of observing (y1; :::; y10) : Notice that the ML estimator is y:
In most cases we prefer to solve max log f y; X; ~ rather max f y; X; ~ ; since the
transformation log greatly simplify the likelihood (products become sums).
101
Assumption 1.5 (the normality assumption) together with Assumptions 1.2 and 1.4 imply
that the distribution of " conditional on X is N 0; 2I . Thus,
"j X N 0; 2I ) yj X N X ; 2I )
2 n=2 1 0
f ( y j X; ) = 2 exp 2
(y X ) (y X ) )
2
n 2 1
log f ( yj X; ) = log 2 2
(y X )0 (y X ):
2 2
It can be proved
n
X n
n 2 1 X 2
log f ( yj X; ) = log f ( yij xi) = log 2 2
yi x0i :
i=1 2 2 i=1
Proposition (1.5 - ML Estimator of and 2). Suppose Assumptions 1.1-1.5 hold. Then,
1
M L estimator of = X0 X X0 y :
2 e0e 2 e0 e
M L estimator of = 6= s = :
n n K
103
e0 e 6= 2:
E n
e 0e
limn!1 E n = 2:
Proposition (1.6 - b is the Best Unbiased Estimator BUE). Under Assumptions 1.1-1.5,
the OLS estimator b of is BUE in that any other unbiased (but not necessarily linear)
estimator has larger conditional variance in the matrix sense.
This result should be distinguished from the Gauss-Markov Theorem that b is minimum
variance among those estimators that are unbiased and linear in y. Proposition 1.6 says
that b is minimum variance in a larger class of estimators that includes nonlinear unbiased
estimators. This stronger statement is obtained under the normality assumption (Assumption
1.5) which is not assumed in the Gauss-Markov Theorem. Put di¤erently, the Gauss-Markov
Theorem does not exclude the possibility of some nonlinear estimator beating OLS, but this
possibility is ruled out by the normality assumption.
104
Exercise 2.22. Suppose yi = x0i + "i where "ij X t(v): Assume that Assumptions
1.1-1.4 hold. Use your intuition to answer “true” or “false” to the following statements:
(c) the BUE estimator can only be obtained numerically (i.e. there is not a closed formula
for the BUE estimator).
The model y = X + " based on the assumptions Assumptions 1.1-1.3 and E ""0 X =
2 V is called generalized regression model.
Example (case where E "2i X depends on X). Consider the following model
yi = 1 + 2xi2 + "i
to explain household expenditure on food (y ) as a function of household income. Typical
behavior: Low-income household do not have the option of extravagant food tastes: they
have few choices and are almost forced to spend a particular portion of their income on food;
High-income household could have simple food tastes or extravagant food tastes: income by
itself is likely to be relatively less important as an explanatory variable.
20
18
16
y : Expenditure
14
12
10
8
6
4
2
0
6 7 8 9 10 11 12 13
x : Income
108
If e accurately re‡ects the behavior of the "; the information in the previous …gure suggests
that the variability of yi increases as income increases, thus it is reasonable to suppose that
1. The Gauss-Markov Theorem no longer holds for the OLS estimator. The BLUE is some
other estimator.
2. The t-ratio is not distributed as the t distribution. Thus, the t-test is no longer valid. The
same comments apply to the F-test. Note that Var ( bj X) is no longer 2 X0X 1 : In
e¤ect,
1 1 1
Var ( bj X) = Var X0 X X0 y X = X0 X X0 Var ( yj X) X X0 X
2 1 1
= X0 X X0VX X0X :
On the other hand,
2 tr (MVM) 2 tr (MV)
E e0e X tr (Var ( ej X))
E s2 X = = = = :
n K n K n K n K
The conventional standard errors are incorrect when Var ( yj X) 6= 2I: Con…dence
region and hypothesis test procedures based on the classical regression model are not
valid.
110
3. However, the OLS estimator is still unbiased, because the unbiasedness result (Propo-
sition 1.1 (a)) does not require Assumption 1.4. In e¤ect,
0 1 1
E ( bj X) = X X X0 E ( y j X) = X0 X X0 X = ; E (b) =
1 1
Use b to estimate and Var ( bj X) = 2 X0X X0VX X0X for inference
purposes. Note that yj X N X ; 2V implies
1 1
bj X N ; 2 X0 X X0VX X0X :
This is not a good solution as if you know V you may use a more e¢ cient estimator, as
we will see below. Later on, in chapter “Large Sample Theory” we will …nd that 2V
may be replaced by a consistent estimator.
If the value of the matrix function V is known, a BLUE estimator for , called generalized
least squares (GLS), can be deduced. The basic idea of the derivation is to transform
the generalized regression model into a model that satis…es all the assumptions, including
Assumption 1.4, of the classical regression model. Consider
y = X + "; 0 2
E "" X = V:
We should multiply both sides of the equation by a nonsingular matrix C (depending on X)
Cy = CX + C"
y ~ +~
~ = X "
" verify E ~
such that the transformed error ~ "0 X = 2I; i.e.
"~
Given CVC0 = I, how to …nd C? Since V is by construction symmetric and positive de…-
nite, there exists a nonsingular n n matrix C such
1 1
V=C C0 or V 1 = C0C
Note
1 0
CVC0 = CC 1 C0 C = I:
It easy to see that if y = X + " satis…es Assumptions 1.1-1.3 and Assumption 1.5 (but
not Assumption 1.4), then
y ~ +~
~=X "; where y ~ = CX
~ = Cy; X
satis…es Assumptions 1.1-1.5. Let
1 1X 1 1 y:
~ 0X
^ GLS = X ~ ~ 0y
X ~ = X0 V X0 V
113
E ^ GLS X = :
(b) (expression for the variance) Under Assumptions 1.1-1.3 and the assumption E ""0 X =
2 V that the conditional second moment is proportional to V,
2 1
Var ^ GLS X = X0 V 1 X :
(c) (the GLS estimator is BLUE) Under the same set of assumptions as in (b), the GLS
estimator is e¢ cient in that the conditional variance of any unbiased estimator that is linear
in y is greater than or equal to Var ^ GLS X in the matrix sense.
We have
2 3 2 3
v1 0 0 1=v1 0 0
6 0 v 0 7 6 0 1=v2 0 7
V = 6
6 .. 2
... . . . ...
7
7) V 1 6
= 6 ... ... ... ...
7
7)
4 . 5 4 5
0 0 vn 0 0 1=vn
2 p 3
1= v1 0 0
6 p 7
0 1= v2 0
C = 6
6 ... ... ... ...
7
7:
4 5
p
0 0 1= vn
115
Now
2 y 3
2
p 32 3 p1
1= v1 0 0 y1 6 yv1 7
6 p 76 y 7 6 p2 7
6 0 1= v2 0 76 2 7 6 7
y
~ = Cy = 6 .
.. .
.. ... .
.. 7 6 .. 7 = 6 .v2 7
4 54 . 5 6 .. 7
p 4 y 5
0 0 1= vn yn pn
vn
2 p 32 3
1= v1 0 0 1 x12 x1K
6 p 76 1 7
~ = CX = 6 6 0 1 = v2 0 76 x22 x2K 7
X ... ... ... ... 7 6 .. ... ... ... 7
4 54 . 5
p
0 0 1= vn 1 xn2 xnK
2 p p p 3
1= v1 x12= v1 x1K = v1
6 1=pv x =
p
v x =
p 7
v2 7
6 2 22 2 2K
= 6 ... ... ... ... 7:
4 5
p p p
1= vn xn2= vn xnK = vn
Notice:
!
" 1 1
"ij X) = Var p ix xi2
Var (~ = x Var ( "ij xi2) = x 2exi2 = 2:
e i2 e i2 e i2
E¢ cient estimation under a known form of heteroskedasticity is called the weighted regression
(or the weighted least squares (WLS)).
Example. Consider wagei = 1 + 2educi + 3experi + "i:
30 30
25 25
20 20
WAGE
WAGE
15 15
10 10
5 5
0 0
0 10 20 30 40 50 60 0 4 8 12 16 20
EXPER EDUC
118
300
250
Dependent Variable: WAGE
Method: Least Squares 200
Sample: 1 526
RES2
Variable Coefficient Std. Error t-Statistic Prob.
150
C -3.390540 0.766566 -4.423023 0.0000
EDUC 0.644272 0.053806 11.97397 0.0000
EXPER 0.070095 0.010978 6.385291 0.0000 100
Exercise 2.23. Let fyi; i = 1; 2; :::g be a sequence of independent random variables with
distribution N ; 2i ; where 2i is known (note: we assume 21 6= 22 6= :::). When
the variances are unequal, the sample mean y is not the best linear unbiased estimator,
Pn
i.e. BLUE). The BLUE has the form ^ = i=1 wiyi where wi are nonrandom weights.
(a) Find a condition on wi such that E ^ = ; (b) Find the optimal weights wi that
make ^ the BLUE. Hint: You may translate this problem into an econometric framework:
if fyig is a sequence of independent random variables with distribution N ; 2i then yi
can be represented by the equation yi = + "i; where "i N 0; 2i : Then …nd the GLS
estimator of :
120
Exercise 2.26. A research …rst ran a OLS regression. Then she was given the true V matrix.
She transformed the data appropriately and obtained the GLS estimator. For several coe¢ -
cient, standard errors in the second regression were larger than those in the …rst regression.
Does this contradict 1.7 proposition? See the previous exercise.
Finite-sample properties of GLS rest on the assumption that the regressors are strictly
exogenous. In time-series models the regressors are not strictly exogenous and the error
is serially correlated.
V can be estimated from the sample. This approach is called the Feasible Generalized
Least Squares (FGLS). But if the function V is estimated from the sample, its value V
becomes a random variable, which a¤ects the distribution of the GLS estimator. Very
little is known about the …nite-sample properties of the FGLS estimator. We need to
use the large-sample properties ...
122
3 Large-Sample Theory
The …nite-sample theory breaks down if one of the following three assumptions is violated:
This chapter develops an alternative approach based on large-sample theory (n is “su¢ ciently
large”).
123
Convergence in Probability
Example. Consider a fair coin. Let zi = 1 if the ith toss results in heads and zi = 0
1 Pn p
otherwise. Let zn = n i=1 zi : The following graph suggests that zn ! 1=2:
125
A sequence of random scalars fzig converges in mean square (or in quadratic mean) to a
if
h i
2
lim E (zn ) =0
n!1
Convergence in Distribution
Let fzng be a sequence of random scalars and Fn be the cumulative distribution function
(c.d.f.) of zn, i.e. zn Fn. We say that fzng converges in distribution to a random scalar
z if the c.d.f. Fn, of zn , converges to the c.d.f. F of z at every continuity point of F . We
write
d
zn ! z; where z F;
F is is the asymptotic (or limiting) distribution of z . If F is well-known, for example, if F
is the cumulative normal N (0; 1) distribution we prefer to write
d d
zn ! N (0; 1) (instead of zn ! z and z N (0; 1)):
d
Example. Consider zn t(n): We know that zn ! N (0; 1) :
p p
(a) if zn ! ) f (zn) ! f ( ) ;
d d
(b) if zn ! z ) f (zn) ! f (z) :
An immediate implication of Lemma 2.3 (a) is that the usual arithmetic operations preserve
convergence in probability:
p p p
xn ! ; yn ! ) xn + yn ! + :
p p p
xn ! ; yn ! ) xnyn ! :
p p p
xn ! ; yn ! ) xn=yn ! = ; 6= 0:
p p
Yn ! ) Yn 1 ! 1 ( is invertible).
128
d p d
(a) xn ! x; yn ! ) xn + yn ! x + :
d p p
(b) xn ! x; yn ! 0 ) y0nxn ! 0:
d p d
(c) xn ! x; An ! A ) Anxn ! Ax: In particular if x N (0; ) ; then
d
Anxn ! N 0; A A0 :
d p d
(d) xn ! x; An ! A ) x0nAn 1xn ! x0A 1x (A is nonsingular).
p
If xn ! 0 we write xn = op (1) :
p
If xn yn ! 0 we write xn = yn + op (1) :
d
In part (c) we may write Anxn = Axn (Anxn and Axn have the same asymptotic
distribution).
129
Wooldridge’s quotation:
While not all useful estimators are unbiased, virtually all economists agree that
consistency is a minimal requirement for an estimator. The famous econometrician
Clive W.J. Granger once remarked: “If you can’t get it right as n goes to in…nity,
you shouldn’t be in this business.” The implication is that, if your estimator of a
particular population parameter is not consistent, then you are wasting your time.
130
The variance matrix is called the asymptotic variance and is denoted Avar ^n ; i.e.
p
lim Var n ^n = Avar ^n = :
n!1
Some authors use the notation Avar ^n to mean =n (which is zero in the limit).
131
Consider
n
1X
zn = zi:
n i=1
p
We say that zn obeys to the LLN if zn ! where = E (zi) or limn E (zn) = :
p
(Kolmogorov’s Second Strong LLN) If fzig is i.i.d. with E (zn) = ) zn ! :
Theorem 1 (Lindeberg-Levy CLT). Let fzig be i.i.d. with E (zn) = and Var (zi) = :
Then
n
p 1 X d
n (zn )=p (zi ) ! N (0; ) :
n i=1
Notice that
p
E n (zn ) = 0 ) E (zn) =
p
Var n (zn ) = ) Var (zn) = =n
Given the previous equations, some authors write
!
a
zn N ; :
n
133
Example. Let fzig be i.i.d. with distribution 2(1): By the Lindeberg-Levy CLT (scalar case)
we have
n !
1X 2
a
zn = zi N ;
n i=1 n
where
n
1X
E (zn) = E (zi) = E (zi) = = 1;
n i=1
0 1
Xn 2
1 1 2
Var (zn) = Var @ ziA = Var (zi) = = :
n i=1 n n n
134
-
3210.1
0.4
0.3
0.2
Example. In a random sampling, sample size = 30; on the variable z with E (z ) = 10;
Var (z ) = 9 but unknown distribution, obtain an approximation to P (zn < 9:5) : We do
not know the exact distribution of zn: However, from Lindeberg-Levy CLT we have
!
p (zn ) d 2
a
n ! N (0; 1) or zn N ; :
n
Hence,
!
p (zn ) p (9:5 10)
P (zn < 9:5) = P n < 30
3
' ( 0:9128) , [ is the cdf of N (0; 1) ]
= 0:1807:
136
Stochastic process (SP): is a sequence of random variables. For this reason, it is more
adequate to write a SP as fzig (means a sequence of random variables) rather than zi
(means the random variable at time i).
137
The de…nition implies that any transformation (function) of a stationary process is itself
stationary,
n othat is, if fzig is stationary, then fg (zi)g is. For example, if fzig is stationary
then ziz0i is also a SP.
De…nition (Covariance Stationary Processes). A stochastic process fzig is weakly (or co-
variance) stationary if: (i) E (zi) does not depend on i , and (ii) Cov zi; zi j exists, is
…nite, and depends only on j but not on i:
q
Example. It can be proved that fzig ; zi = 0 + 1zi2 1"i; where f"ig is i.i.d. with mean
q
zero and unit variance and 0> 0 and 1=3 1< 1 is a covariance stationary process.
However, wi = zi2 is not a covariance stationary process as E wi2 does not exist.
Exercise 3.3. Consider the SP futg where
8
>
> t if t 2000
<
ut = q
>
> k 2 if t > 2000
:
k t
iid iid
where t and s are independent for all t and s and t N (0; 1) and s t(k). Explain
why futg is weakly (or covariance) stationary but not strictly stationary.
De…nition (White Noise Processes). A white noise process fzig is a covariance stationary
process with zero mean and no serial correlation:
Y Y
8 25
4
20
0
15
-4
10
-8
5
-12
0
-16
-20 -5
25 50 75 100 125 150 175 200 25 50 75 100 125 150 175 200
Y Y5
10 4
3
0
2
-10 1
0
-20
-1
-30 -2
-3
-40
-4
-50 -5
25 50 75 100 125 150 175 200 10 20 30 40 50 60 70 80 90
140
In the literature there is not a unique de…nition of ergodicity. We prefer to call “weakly
dependent process” to what Hayashi calls “ergodic process”.
De…nition. A stationary process fzig is said to be a weakly dependent process (= ergodic in
Hayashi’s de…nition) if, for any two bounded functions f : Rk+1 ! R and g : Rs+1 ! R;
Serial dependence, which is ruled out by the i.i.d. assumption in Kolmogorov’s LLN, is
allowed in this Theorem, provided that it disappears in the long run. Since, for any function
f , ff (zi)g is a S&WD stationary whenever fzig is, this theorem implies that any moment
of a S&WD process (if it exists and is …nite) is consistently estimated by the sample moment.
For example, suppose fzig is a S&WD process and E ziz0i exists and is …nite. Then
n
1X p
zn = zizi ! E ziz0i :
0
n i=1
141
De…nition (Martingale). A vector process fzig is called a martingale with respect to fzig if
By de…nition
0 1 0 1
n
X n
X nX1 X
n
1 @ A
1 @
Var (gn) = 2 Var gt = 2 Var (gt) + 2 Cov gi; gi j A :
n t=1 n t=1 j=1 i=j+1
However, if fgig is a stationary MDS with …nite second moment then
n
X
Var (gt) = n Var (gt) ; Cov gi; gi j = 0;
t=1
so
1
Var (gn) = Var (gt) :
n
De…nition (Random Walk). Let fgig be a vector independent white noise process. A random
walk, fzig, is a sequence of cumulative sums:
zi = gi + gi 1 + ::: + g1:
Exercise 3.4. Show that the random walk can be written as
zi = zi 1 + gi ; z1 = g1:
143
We have three formulations of a lack of serial dependence for zero-mean covariance stationary
processes:
The model presented in this section has probably the widest range of economic applications:
No speci…c distributional assumption (such as the normality of the error term) is required;
The requirement in …nite-sample theory that the regressors be strictly exogenous or …xed
is replaced by a much weaker requirement that they be "predetermined."
Assumption (2.5 - fgig is a martingale di¤erence sequence with …nite second moments).
fgig ; where gi = xi"i; is a martingale di¤erence sequence (so a fortiori E (gi) = 0.
The K K matrix of cross moments, E gigi0 , is nonsingular. We use S for Avar (g ) (the
p P
variance of ng; where g = n1 gi). By Assumption 2.2 and S&WD Martingale Di¤erences
CLT, S = E gigi0 :
Remarks:
1. (S&WD) A special case of S&WD is that f(yi; xi)g is i.i.d. (random sample in cross-
sectional data).
3. (E (xi"i) = 0 vs. E ( "ij xi) = 0) The condition E ( "ij xi) = 0 is stronger than
E (xi"i) = 0. In e¤ect,
4. (Predetermined vs. strictly exogenous regressors) Assumption 2.3, restricts only the
contemporaneous relationship between the error term and the regressors. The exogeneity
assumption (Assumption 1.2) implies that, for any regressor k, E xjk "i = 0 for all i
and j; not just for i = j: Strict exogeneity is a strong assumption that does not hold in
general for time series models.
148
6. (A su¢ cient condition for fgig to be a MDS) Since a MDS is zero-mean by de…nition,
Assumption 2.5 is stronger than Assumption 2.3 (this latter is redundant in face of
Assumption 2.5). We will need Assumption 2.5 to prove the asymptotic normality of
the OLS estimator. A su¢ cient condition for fgig to be an MDS is
E ( xi"ij gi 1; :::; g1) = E [ E ( xi"ij Fi)j gi 1; :::; g1] = E [0j gi 1; :::; g1] = 0
thus E ( "ij Fi) = 0 ) fgig is a MDS.
150
E ( "ij "i 1; :::; "1) = E ( E ( "ij gi 1; :::; g1)j "i 1; :::; "1) = 0:
Assumption 2.5 implies that the error term itself is a MDS and hence is serially uncorrelated.
Proposition (2.1- asymptotic distribution of the OLS Estimator). (a) (Consistency of b for
) Under Assumptions 2.1-2.4,
p
b ! :
(b) (Asymptotic Normality of b) If Assumption 2.3 is strengthened as Assumption 2.5, then
p d
n (b ) ! N (0; Avar (b))
where
Avar (b) = 1 1
xx S xx :
^
(c) (Consistent Estimate of Avar (b)) Suppose there is available a consistent estimator S
of S: Then under Assumption 2.2, Avar (b) is consistently estimated by
^ xx1
[ (b) = Sxx1SS
Avar
where
n
X0 X 1X
Sxx = = xix0i:
n n i=1
153
Proposition (2.2 - consistent estimation of error variance). Under the Assumptions 2.1- 2.4,
n
X
1 p
s2 = e2i ! E "2i
n K i=1
Under conditional homocedasticity E "2i xi = 2 (we will see this in detail later) we
have,
S = E gigi0 = E "2i xix0i = ::: = 2 0
E xixi =
2
xx
and
Derivation of the distribution of test statistics is easier than in …nite-sample theory because
we are only concerned about the large-sample approximation to the exact distribution.
Proposition (2.3 - robust t-ratio and Wald statistic). Suppose Assumptions 2.1-2.5 hold,
^ of S. As before, let Avar
and suppose there is available a consistent estimate of S [ (b) =
^ 1: Then
Sxx1SS xx
Remarks
The di¤erences from the …nite-sample t-test are: (1) the way the standard error is
calculated is di¤erent, (2) we use the table of N (0; 1) rather than that of t(n K),
and (3) the actual size or exact size of the test (the probability of Type I error given
the sample size) equals the nominal size (i.e., the desired signi…cance level ) only
approximately, although the approximation becomes arbitrarily good as the sample size
increases. The di¤erence between the exact size and the nominal size of a test is called
the size distortion.
How to select an estimator for a population parameter? One of the most important method
is the analog estimation method or the method of moments. The method of moment
principle: To estimate a feature of the population, use the corresponding feature of the
sample.
E (yi) Y
Var (yi) Sy2
xy Sxy
2 2
x Pn Sx
i=1 Ifyi cg
P (yi c) n
median (yi) sample median
max(yi) maxi=1;:::;n (yi)
157
The analogy principle suggests that E "2i xix0i can be estimated using the estimator
n
1X
"2i xix0i:
n i=1
Since "i is not observable we need another one:
Xn
1
^=
S e2i xix0i:
n i=1
2
Assumption (2.6 - …nite fourth moments for regressors). E xik xij exists and is …nite
for all k and j (k; j = 1; :::; K ) :
Proposition (2.4 - consistent estimation of S). Suppose S = E "2i xix0i exists and is …nite.
Then, under Assumptions 2.1-2.4 and 2.6, S ^ is consistent for S:
158
!
a [ b)
Avar( ^ xx1
Sxx1SS 1 1
b N ; n =N ; n =N ; X0 X X0BX X0X
0 1
W = n (Rb [ (b) R0
r) RAvar (Rb r)
1
= n (Rb r) 0 ^ xx1R0
RSxx1SS (Rb r)
0 1 d
0 1 1 2
= (Rb r) R X0 X X0BX X0 X R (Rb r) ! (p)
159
(b) (Consistent estimation of asymptotic variance) Under the same set of assumptions,
Avar (b) is consistently estimated by
1
[ (b) = s2Sxx1 = ns2 X0X
Avar :
161
Under H0 : k = 0k we have
bk 0 [ ( bk )
Avar 1
d
t0k = k ! N (0; 1) ; where ^ 2bk = 2 0
=s XX :
^ bk n kk
Notice
e 0e e0e d
pF 0 = 0 ! 2
(p)
e e= (n K )
where refers to the short regression or the regression subjected to the constraint R =r
Remark (No need for fourth-moment assumption) By S&WD and Assumptions 2.1-2.4,
p
s2Sxx ! 2 xx = S: We do not need the fourth-moment assumption (Assumption 2.6)
for consistency.
162
With the advent of robust standard errors allowing us to do inference without specifying the
conditional second moment testing conditional homoskedasticity is not as important as it
used to be. This section presents only the most popular test due to White (1980) for the
case of random samples.
where R2 is the R2 from the auxiliary regression of e2i on a constant and i and m is the
dimension of i:
163
Test Equation:
Dependent Variable: RESID^2
Even when the error is found to be conditionally heteroskedastic, the OLS estimator is still
consistent and asymptotically normal, and valid statistical inference can be conducted with
robust standard errors and robust Wald statistics. However, in the (somewhat unlikely) case
of a priori knowledge of the functional form of the conditional second moment, it should be
possible to obtain sharper estimates with smaller asymptotic variance.
166
To simplify the discussion, throughout this section we strengthen Assumptions 2.2 and 2.5
by assuming that f(yi; xi)g is i.i.d.
The parametric functional form for the conditional second moment we consider is
2 0
E "i xi = zi
where zi is a function of xi:
z0i = 1 x2i2 :
167
The WLS (also GLS) estimator can be obtained by applying the OLS to the regression
~0i + ~"i
y~i = x
where
y x "
y~i = q i ; ~ik = q ik ;
x "i = q i ;
~ i = 1; 2; :::; n
z0i z0i z0i
We have
1 1X 1 1 y:
~ 0X
^ GLS = ^ (V) = X ~ ~ 0y
X ~ = X0 V X0 V
168
Note that
"i j x
E (~ ~ i ) = 0:
Therefore, provided that E x ~0i is nonsingular, Assumptions 2.1-2.5 are satis…ed for equa-
~i x
tion y~i = x ~0i +~"i. Furthermore, by construction, the error ~"i is conditionally homoskedastic:
E (~"i j x
~i) = 1. So Proposition 2.5 applies: the WLS estimator is consistent and asymptoti-
cally normal, and the asymptotic variance is
1
Avar ^ (V) = E x ~0i
~i x
0 1 1
n
X
1
= plim @ x ~0iA
~i x (by S&WD theorem)
n i=1
1 0 1
= plim X V 1X :
n
Thus n1 X0V 1X is a consistent estimator of Avar ^ (V) :
169
"2i = E "2i xi + i
where by construction E ( ij xi) = 0: This suggest that the following regression can be
considered
"2i = z0i + i
Provided that E ziz0i is nonsingular, Proposition 2.1 is applicable to this auxiliary regres-
sion: the OLS estimator of is consistent and asymptotically normal. However we cannot
run this regression as "i is not observable. In the previous regression we should replace "i
by the consistent estimate ei (despite the presence of conditional heteroskedasticity). In
conclusion, we may obtain a consistent estimate of by considering the regression of e2i on
zi to get
0 1 1
n
X Xn
^ =@ ziz0iA zie2i :
i=1 i=1
170
Step 1: Estimate the equation yi = x0i + "i by OLS and compute the OLS residuals ei:
^ V
^ p
!
p d
n ^ V
^ ! N 0; Avar ^ (V)
1 X0 V
^ 1X is a consistent estimator of Avar ^ (V) :
n
The especi…cation "2i = z0i + i may lead to z0i ^ < 0: To overcome this problem a popular
speci…cation for E "2i xi is
n o
E "2i xi = exp x0i
(it guarantees that Var ( yij xi) > 0 for all 2 Rr ): It implies log E "2i xi = x0i : This
suggests the following procedure:
x
c) Transform the data y~i = ^yi ; ~ij = ^ij .
x
i i
d) Regress y ~ and obtain ^ V
~ on X ^
173
Based on information below, are the standard errors reported in the …rst table reliable?
174
cigs: number of cigarettes smoked per day, log(income): log of annual income, log(cigprice):
log of per pack price of cigarettes in cents, educ: years of education, age and restaurn:
binary indicator equal to unity if the person resides in a state with restaurant smoking re-
strictions.
175
n o
Calculate ^ 2i = exp x0i ^ \
= exp log e2i :
\
Notice: log \
e21; :::; log e2n are the …tted values of the above regression.
178
b and ^ V
^ are consistent.
Assuming that the functional form of the conditional second moment is correctly spec-
i…ed, ^ V
^ is asymptotically more e¢ cient than b.
It is not clear which estimator is better (in terms of e¢ ciency) in the following situations:
– in …nite samples, even if the functional form is correctly speci…ed, the large-sample
approximation will probably work less well for the WLS estimator than for OLS
because of the estimation of extra parameters (a) involved in the WLS procedure.
180
Because the issue of serial correlation arises almost always in time-series models, we use the
subscript "t" instead of "i" in this section. Throughout this section we assume that the
regressors include a constant. The issue is how to deal with
E "t"t j xt j ; xt 6= 0:
181
When the regressors include a constant (true in virtually all known applications), Assumption
2.5 implies that the error term is a scalar martingale di¤erence sequence, so if the error
is found to be serially correlated (or autocorrelated), that is an indication of a failure of
Assumption 2.5.
Assumptions 2.1-2.4 may hold under serial correlation, so the OLS estimator may be consis-
tent even if the error is autocorrelated. However, the large-sample properties of b, t , and
F of proposition 2.5 are not valid. To see why, consider
p p
n (b ) = Sxx1 ng :
182
We have
Avar (b) = 1
xx S
1
xx ;
\
Avar ^ xx1:
(b) = Sxx1SS
Consider the regression yt = x0t + "t: We want to test whether or not "t is serial correlated.
Consider
Cov "t; "t j Cov "t; "t j j E " t "t j
j = r = = =
2
:
Var ("t) Var "t j Var (" t ) 0 E "t
Proposition. If f"tg is a stationary MDS with E "2t "t 1; "t 2; ::: = 2; then
p d p d
n~j ! N 0; 4 and n~j ! N (0; 1) :
Proposition. Under the assumptions of the previous proposition
p
X p
X
p 2 d
Box-Pierce Q statistics = QBP = n~j =n ~2j ! 2(p):
j=1 j=1
regression et on et 1; :::; et p:
We calculate the F statistic for the hypothesis that the p coe¢ cients of et 1; :::; et p are
all zero.
187
Given
et = 1 + 2xt2 + ::: + K xtK + 1et 1 + ::: + pet p + errort
the null hypothesis can be formulated as
H0 : 1 = ::: = p = 0
Use the F test:
188
EVIEWS
189
Example. Consider, chnimp: the volume of imports of barium chloride from China, chempi:
index of chemical production (to control for overall demand for barium chloride), gas: the
volume of gasoline production (another demand variable), rtwex: an exchange rate index
(measures the strength of the dollar against several other currencies).
Equation 1
Dependent Variable: LOG(CHNIMP)
Method: Least Squares
Sample: 1978M02 1988M12
Included observations: 131
Equation 2
Breusch-Godfrey Serial Correlation LM Test:
Test Equation:
Dependent Variable: RESID
Method: Least Squares
Sample: 1978M02 1988M12
Included observations: 131
Presample missing value lagged residuals set to zero.
If you conclude that the errors are serial correlated you have a few options:
(a) You know (at least approximately) the form of autocorrelation and so you use a feasible
GLS estimator.
(b) The second approach, parallels the use of the White estimator for heteroskedasticity:
you don’t know the form of autocorrelation so you rely on the OLS, but you use a
consistent estimator for Avar (b) :
(c) You are concerned only with the dynamic speci…cation of the model and with forecast.
You may try to convert your model into a dynamically complete model.
(d) Your model may be misspeci…ed: you respeci…ed the model and the autocorrelation
disappear.
192
There are many forms of autocorrelation and each one leads to a di¤erent structure for the
error covariance matrix V. The most popular form is known as the …rst-order autoregressive
process. In this case the error term in
yt = x0t + "t
is assumed to follow the AR(1) model
Initial Model:
yt = x0t + "t; "t = "t 1 + vt; j j<1
The GLS estimator is the OLS estimator applied to the transformed model
~0t + vt
y~t = x
where
( q ( q
1 2y t=1 ; 1 2 x0 t= 1 ;
y~t = 1 ~0t =
x 1
yt yt 1 t > 1 (xt xt 1)0 t > 1
Without the …rst observation, the transformed model is
0
yt yt 1 = (xt xt 1) + vt:
Example (continuation of the previous example). Let’s consider the residuals of Equation 1:
Equation 3
Dependent Variable: LOG(CHNIMP)
Method: Least Squares
Sample (adjusted): 1978M03 1988M12
Included observations: 130 after adjustments
Convergence achieved after 8 iterations
Exercise 3.7. Consider yt = 1 + 2xt2 + "t where "t = "t 1 + vt and fvtg is a white
noise process. Using the …rst di¤erences of the variables one gets yt = 1 xt2 + "t:
Show that Corr ( "t; "t 1) = (1 ) =2: Discuss the advantages and disadvantages
of di¤erentiating the variables as a procedure to remove autocorrelation.
195
For sake of generality, assume that you have also a problem of heteroskedasticity.
Given
p 1 nX1 X n
0 0
S = Var ng = Var (gt) + E gtgt j + E gt j gt
n j=1 t=j+1
nX1 X n
1
= E "2t xtx0t + 0 0
E "t"t j xtxt j + E "t j "txt j xt ;
n j=1 t=j+1
a possible estimator of S based on the analogy principle would be
n 0
nX1 X n
1X 1
e2t xtx0t + etet j xtx0t j + et j etxt j x0t ; n0 < n:
n t=1 n j=1 t=j+1
A major problem with this estimator is that it is not positive semi-de…nite and hence cannot
be a well-de…ned variance-covariance matrix.
196
Newey and West show that with a suitable weighting function ! (j ), the estimator below is
consistent and positive semi-de…nite:
Xn XL Xn
1 1
^ HAC =
S e2t xtx0t + ! (j ) etet j xtx0t j + et j etxt j x0t
n t=1 n j=1 t=j+1
where the weighting function ! (j ) is
j
! (j ) = 1 :
L+1
The maximum lag L must be determined in advance. Autocorrelations at lags longer than
L are ignored. For a moving-average process, this value is in general a small number.
This estimator is known as (HAC) covariance matrix estimator and is valid when both
conditional heteroskedasticity and serial correlations are present but of an unknown form.
197
1
! (1) = 1 = 0:75
4
2
! (2) = 1 = 0:50
4
3
! (3) = 1 = 0:25
4
198
EVIEWS:
10
L
9
0
0 1000 2000 3000 4000 5000
n
n 2=9
Eviews selects L = f loor(4 100 )
199
Equation 4
Dependent Variable: LOG(CHNIMP)
Method: Least Squares
Sample: 1978M02 1988M12
Included observations: 131
Newey-West HAC Standard Errors & Covariance (lag truncation=4)
Consider
~0t + ut
yt = x
such that E ( utj x
~t) = 0: This condition although necessary for consistency, does not pre-
clude autocorrelation. You may try to increase the number of regressors to xt and get a new
regression model
yt = x0t + "t such that
Proposition. If a model is DC then the errors are not correlated. Moreover fgig is a MDS.
E ( ytj x
~t) = E ( ytj xt2) = 1 + 2xt2:
202
ut = yt ( 1 + 2xt2) )
ut 1 = yt 1 ( 1 + 2xt 1;2)
we have
yt = 1 + 2xt2 + ut
= 1 + 2 xt2 + ut 1 + "t
= 1 + 2 xt2 + yt 1 1 + 2 xt 1;2 + "t :
This equation can be written in the form
Equation 6
Breusch-Godfrey Serial Correlation LM Test:
Test Equation:
Dependent Variable: RESID
Method: Least Squares
Equation 5 Date: 05/12/10 Time: 19:13
Dependent Variable: LOG(CHNIMP) Sample: 1978M03 1988M12
Included observations: 130
Method: Least Squares Presample missing value lagged residuals set to zero.
Sample (adjusted): 1978M03 1988M12
Variable Coefficient Std. Error t-Statistic Prob.
Included observations: 130 after adjustments
C 1.025127 26.26657 0.039028 0.9689
LOG(CHEMPI) 1.373671 3.968650 0.346130 0.7299
Variable Coefficient Std. Error t-Statistic Prob. LOG(GAS) -0.279136 1.055889 -0.264361 0.7920
LOG(RTWEX) -0.074592 2.234853 -0.033377 0.9734
C -11.30596 23.24886 -0.486302 0.6276 LOG(CHEMPI(-1)) -1.878917 4.322963 -0.434636 0.6647
LOG(GAS(-1)) 0.315918 1.076831 0.293378 0.7698
LOG(CHEMPI) -7.193799 3.539951 -2.032175 0.0443 LOG(RTWEX(-1)) -0.007029 2.224878 -0.003159 0.9975
LOG(GAS) 1.319540 1.003825 1.314513 0.1911 LOG(CHNIMP(-1)) 0.151065 0.293284 0.515082 0.6075
RESID(-1) -0.189924 0.307062 -0.618520 0.5375
LOG(RTWEX) -0.501520 2.108623 -0.237842 0.8124 RESID(-2) 0.088557 0.124602 0.710715 0.4788
LOG(CHEMPI(-1)) 9.618587 3.602977 2.669622 0.0086 RESID(-3) 0.154141 0.098337 1.567475 0.1199
RESID(-4) -0.125009 0.098681 -1.266795 0.2079
LOG(GAS(-1)) -1.223681 1.002237 -1.220950 0.2245 RESID(-5) -0.035680 0.099831 -0.357407 0.7215
LOG(RTWEX(-1)) 0.935678 2.088961 0.447915 0.6550 RESID(-6) 0.048053 0.098008 0.490291 0.6249
LOG(CHNIMP(-1)) 0.270704 0.084103 3.218710 0.0016 RESID(-7) 0.129226 0.097417 1.326523 0.1874
RESID(-8) 0.052884 0.099891 0.529420 0.5976
RESID(-9) -0.122323 0.102670 -1.191423 0.2361
R-squared 0.394405 Mean dependent var 6.180590 RESID(-10) 0.022149 0.099419 0.222788 0.8241
RESID(-11) 0.034364 0.099973 0.343738 0.7317
Adjusted R-squared 0.359658 S.D. dependent var 0.699063 RESID(-12) -0.038034 0.102071 -0.372628 0.7101
S.E. of regression 0.559400 Akaike info criterion 1.735660
R-squared 0.081251 Mean dependent var -9.76E-15
Sum squared resid 38.17726 Schwarz criterion 1.912123 Adjusted R-squared -0.077442 S.D. dependent var 0.544011
Log likelihood -104.8179 Hannan-Quinn criter. 1.807363 S.E. of regression 0.564683 Akaike info criterion 1.835533
F-statistic 11.35069 Durbin-Watson stat 2.059684 Sum squared resid 35.07532 Schwarz criterion 2.276692
Log likelihood -99.30962 Hannan-Quinn criter. 2.014790
Prob(F-statistic) 0.000000 F-statistic 0.512002 Durbin-Watson stat 2.011429
Prob(F-statistic) 0.952295
204
In many cases the …nding of autocorrelation is an indication that the model is misspeci…ed.
If this is the case, the most natural route is not to change your estimator (from OLS to GLS)
but to change your model. Types of misspeci…cation may lead to a …nding of autocorrelation
in your OLS residuals:
yt = 1 + 2 log t + "t:
In the following …gure we estimate a misspeci…ed functional form: yt = 1 + 2t + "t : The
residuals are clearly autocorrelated
206
Consider
yt = + f (t) + "t
where f (t) is a function of time (e.g. f (t) = t or f (t) = t2 etc.). This kind of models
do not satisfy the Assumption 2.2: f(yi; xi)g is jointly S&WD. This type of nonstationary
is not serious and the OLS is applicable. Let’s us focus on the case
yt = + t + "t
= x0t + "t;
" #
x0t = 1 t ; = :
Proposition (2.11 - OLS estimation of the time regression). Consider the time regression
yt = + t + "t where "t is independent white noise with E "2t = 2 and E "4 < 1:
Then
p ! 0 " # 11 " #!
n (^ ) d 1 1=2 4 6
! N @0; 2 A = N 0; 2 :
n3=2 ^ 1=2 1=3 6 12
p p
As in the stationary case, ^ is n-consistent because n ^ converges to a (normal)
random variable. The OLS estimate of the time coe¢ cient, ^, is also consistent, but the
speed of convergence is faster: it is n3=2-consistent in that n3=2 ^ converge to a
random variable. In this sense, ^ is superconsistent.
208
The OLS coe¢ cient estimates of the time regression are asymptotically normal, provided the
sampling error is properly scaled. Inference about ^ can be based on
n3=2 ^ d
p ! N (0; 1) in the case yt = + t + "t
s212
n3=2 ^ d
p ! N (0; 1) in the case yt = t + "t
s2 3
210
Consider
yi = 1zi1 + 2zi2 + ::: + K ziK + "i:
If Cov zij ; "i =6 0 (or E zij "i 6= 0) then we say that zij (j -th regressor) is endogenous.
It follows that E (zi"i) 6= 0:
De…nition (endogenous regressor). We say that a regressor is endogenous if it is not predeter-
mined (i.e., not orthogonal to the error term), that is, if it does not satisfy the orthogonality
condition (Assumption 2.3 does not hold).
If the regressors are endogenous we have, under the Assumptions 2.1, 2.2 and 2.4,
0 1 1
n
X Xn
1 1 p 1
b= +@ ziz0iA zi"i ! + zz E (zi"i) 6=
n i=1 n i=1
1
since E (zi"i) 6= 0: The term zz E (zi"i) is the asymptotic bias.
211
yi = 1 + 2zi2 + "i
is
2 3
" # d i2;yi)
Cov(z
b1 1 6 y Sz2
z2 7
6 7
b= = Z0 Z Z0y =6 d i2;yi)
2 7
b2 4 Cov(z 5
Sz2
2
where
1X 1X
d
Cov (zi2; yi) = (zi2 z2) (yi y ) ; Sz22 = (zi2 z2)2 :
n n
Under the assumption 2.2 we have
d (z ; y ) p Cov (z ; y )
Cov i2 i i2 i
b2 = !
Sz2 Var (zi2)
Cov (zi2; 1 + 2zi2 + "i) Cov (zi2; "i)
= = 2+ :
Var (zi2) Var (zi2)
212
d (z ; y )
Cov p Cov (zi2; yi)
i2 i
b1 = y 2
z2 ! E (y ) E (zi2)
Sz2 Var (zi2)
!
Cov (zi2; "i)
= 1 + 2 E (zi2 ) 2+ E (zi2)
Var (zi2)
Cov (zi2; "i)
= 1 E (zi2)
Var (zi2)
p
If Cov (zi2; "i) = 0 ) bi ! i: If zi2 is endogenous, b1 and b2 are inconsistent. Show
that
2 3
Cov(zi2 ;"i )
1 (z " ) 6 (z )
Var(zi2 ) E i2 7
zz E i i =4 Cov(zi2 ;"i ) 5:
Var(zi2 )
213
Example. Consider
yi1 = 0+ 1yi2 + "i1
yi2 = 0 + 1 yi1 + "i2
where "i1 and "i2 are independent. By construction yi1 and yi2 are endogenous regressors. In
fact, it can be proved that
1
Cov (yi2; "i1) = Var ("i1) 6= 0
1 1 1
1
Cov (yi1; "i2) = Var ("i2) 6= 0
1 1 1
Now
pCov (yi2; yi1) Cov (yi2; 0 + 1yi2 + "i1) Cov (yi2; "i1)
^ 1;OLS ! = = 1+ 6= 1
Var (yi2) Var (yi2) Var (yi2)
^ 1;OLS p! Cov (yi2; yi1) = Cov (yi1; 0 + 1yi1 + "i2) = 1 + Cov (yi1; "i2) 6= 1:
Var (yi1) Var (yi1) Var (yi1)
214
The OLS estimator is inconsistent for both 1 and 1 (and for 0 and 0 as well). This
phenomenon is known as the simultaneous equations bias or simultaneity bias, because the
regressor and the error term are often related to each other through a system of simultaneous
equations.
Example. Consider
Ci = + 1Yi + ui
0 (consumption function)
Yi = Ci + Ii (GNP identity).
where Cov (ui; Ii) = 0: It can be proved that
p 1 Var (ui)
^ 1;OLS ! 1+ :
1 1 Var (yi )
Example. See Hayashi:
We will see that predetermined regressor necessarily becomes endogenous when measured
with error. This problem is ubiquitous, particularly in micro data on households.
Consider
yi = zi + ui
where zi is a predetermined regressor. The variables yi and zi are measured with error:
y = X1 1 + X2 2 + u
and suppose that this model satis…es the assumptions 2.1-2.4 (hence the OLS based on
the previous equation is consistent). However, for some reason X2 is not included in the
regression model (“short regression)”
y = X1 1 + "; " = X2 2 +u
We are interested only in 1: We have
1
b1 = X01X1 X1 y
1
= X01X1 X1 (X1 1 + X2 2 + u)
0 1 0 1
= 1 + X 1 X1 X1 X2 2 + X1 X1 X1 u
! 1 ! 1
X01X1 X1 X2 X01X1 X1 u
= 1 + 2 +
n n n n
217
The conclusion is that b1 is inconsistent if there are omitted variables that are correlated
with X1: The variables in X1 are endogenous as long as Cov (X1; X2) 6= 0
The OLS will be inconsistent estimator of 2 if educi and abili are correlated. In e¤ect,
p Cov (educi; "i) Cov (educi; 3abili + ui)
b2 ! 2 + = 2+
Var (educi) Var (educi)
Cov (educi; abili)
= 2+ 3 :
Var (educi)
219
De…nition. xi is an instrumental variable (IV) for zi if (1) xi is uncorrelated with "i, that
is, Cov(xi; "i) = 0 (thus, xi is a predetermined variable), and (2) xi is correlated with zi,
that is, Cov (xi; zi) 6= 0.
Exercise 4.1. Consider log (wagei) = 1 + 2educi + "i: Omitted variable: ability . (a)
Is educ an endogenous variable? (b) Can IQ be considered an IV for educ? and mother’s
education?
Exercise 4.2. Consider childreni = 1 + 2mothereduci + 3motheragei + "i. Omitted
variable: bcmi : dummy equal to one if the mother is informed about birth control methods.
(a) Is mothereduc endogenous? (b) Suggest an IV for mothereducation.
Exercise 4.3. Consider scorei = 1 + 2skippedi + "i: Omitted variable: motivation
(a) Is skippedi endogenous? (b) Can the distance between home (or living quarters) and
university be considered an IV variable?
220
Exercise 4.4. (Wooldridge, Chap. 15) Consider a simple model to estimate the e¤ect of
personal computer (PC) ownership on college grade point average for graduating seniors at
a large public university:
GP Ai = 1 + 2P Ci + "i
where PC is a binary variable indicating PC ownership. (a) Why might PC ownership be
correlated with "i? (b) Explain why PC is likely to be related to parents’ annual income.
Does this mean parental income is a good IV for PC? Why or why not? (c) Suppose that, four
years ago, the university gave grants to buy computers to roughly one-half of the incoming
students, and the students who received grants were randomly chosen. Carefully explain
how you would use this information to construct an instrumental variable for PC. (d) Same
question as (c) but suppose that the university gave grant priority to low-income students.
E xi yi z0i = 0 or E (gi) = 0
where gi = xi"i:
Notice: xi should include the “1” (constant). Not only xi1 = 1 can be considered as an IV
variable but also guarantee that E 1 yi z0i = 0 , E ( "i ) = 0 :
222
yi = qi;
" # " #
1 0
zi = ; = ; L=2
pi 1
" #
1
xi = ; K=2
xi
2 3
qi
wi = 6 7
4 pi 5 :
xi
In the above example, xi and zi share the same variable (a constant). The instruments that
are also regressors are called predetermined regressors, and the rest of the regressors, those
that are not included in xi, are called endogenous regressors.
223
IQi is IQ (an error-ridden measure of the individual’s ability, is endogenous due to the
errors-in-variables problem)
Note: AGE; is excluded from the wage equation, re‡ecting the underlying assumption that,
once experience is controlled for, age has no e¤ect on the wage rate.
224
yi = LWi;
2 3 2 3
1 1
6 Si 7 6 7
zi = 6
6
7
7;
6
=6 2 7
7; L=4
4 EXP Ri 5 4 3 5
IQi 4
2 3
1
6 7
6 Si 7
6 7
xi = 6
6 EXP Ri
7;
7 K=5
6 AGE 7
4 i 5
M EDi
h i
wi0 = LWi Si EXP Ri IQi AGEi M EDi :
225
4.2.2 Identi…cation
The GMM estimation of the parameter vector is about how to exploit the information
a¤orded by the orthogonality conditions
E xiz0i = E (xiyi) can be interpreted as a linear system with K equations where is the
unknown vector. Notice: E xiz0i is a K L matrix and E (xiyi) is a K 1 vector. Can
we solve the system with respect to ? We need to study the identi…cation of the system.
Assumption (3.4 - rank condition for identi…cation). The K L matrix E xiz0i is of full
column rank (i.e., its rank equals L, the number of its columns). We denote this matrix by
xz:
226
E xiz0i = xz
2 3
1 E (Si) E (EXP ERi) E (IQi)
6 7
6
6 E (Si) E Si2 E (SiEXP ERi) E (SiIQi) 7
7
6 7
= 6 E (EXP Ri) E (EXP RiSi)
6 E EXP ERi2 E (EXP RiIQi) 7:
7
6 7
4 E (AGEi) E (AGEiSi) E (AGEiEXP ERi) E (AGEiIQi) 5
E (M EDi) E (M EDiSi) E (M EDiEXP ERi) E (M EDiIQi)
Since rank ( xz) min fK; Lg we have: if K < L ) rank ( xz) < L. Thus a
necessary condition for identi…cation is that K L:
De…nition (order condition for identi…cation). K L or
#orthogonality
| {z
conditions} #parameters
| {z }
.
K L
De…nition. We say that the equation is overidenti…ed if the rank condition is satis…ed and
K > L, exactly identi…ed or just identi…ed if the rank condition is satis…ed and K = L
and underidenti…ed (or not identi…ed) if the order condition is not satis…ed (i.e., if K < L).
229
4. If K < L then rank (A) < L and the SLE is underidenti…ed. Example:
" #
h i x1
1 1 = 1 ) x1 = 1 x2 ; x 2 2 R
x2
Note: rank (A) = 1 and K = 1 < L = 2:
231
Assumption (3.5 - fgig is a martingale di¤erence sequence with …nite second moments).
Let gi = xi"i. fgig is a martingale di¤erence sequence (so E (gi) = 0). The K K matrix
of cross moments, E gigi0 , is nonsingular. Let S = Avar (g) :
Remarks:
p
Assumption 3.5 implies Avar (g) = lim Var ( ng) = E gigi0 :
p d
Assumption 3.5 implies ng ! N (0; Avar (g)) :
If the instruments include a constant, then this assumption implies that the error is a
martingale di¤erence sequence (and a fortiori serially uncorrelated).
232
A su¢ cient and perhaps easier to understand condition for Assumption 3.5 is that
Since gigi0 = "2i xix0i; S is a matrix of fourth moments. Consistent estimation of S will
require a fourth-moment assumption to be speci…ed in Assumption 3.6 below.
If fgig is serially correlated, then S does not equal E gigi0 and will take a more
complicated form.
233
The method of moment principle: To estimate a feature of the population, use the
corresponding feature of the sample.
Examples:
Method of moments: choose the parameter estimate so that the corresponding sample
moments are also equal to zero. Since we know that E xi yi z0i = 0 we choose the
parameter estimate ~ so that
n
1X
xi yi z0i~ = 0:
n i=1
234
Let’s expand gn ~ = 0 :
n
1X
xi yi z0i~ = 0
n i=1
n n
1X 1X
xiyi xiz0i~ = 0
n i=1 n i=1
n n
1X 1 X
xiz0i~ = xiyi
n i=1 n i=1
Sxz~ = sxy :
235
Thus:
0
E xizi = E (xiyi) :
236
Consider
Sxz~ = sxy
Example. Consider
yi = 1 + 2zi2 + "i
and suppose that Cov (zi; "i) 6= 0; that is, zi is an endogenous variable. We have L = 2
so we need at least K = 2 instrumental variables. Let x0i = 1 xi2 and suppose that
Cov (xi2; "i) = 0 and Cov (xi2; zi2) 6= 0: Thus an IV estimator is
1
^IV = X0Z X0 y :
Exercise 4.5. Consider the previous example. (a) Show that the IV estimator ^2;IV can be
written as
Pn
^2;IV = P i=1 (xi2 x2) (yi y ) :
n (x x2) (zi2 z2)
i=1 i2
(b) Show Cov (xi2; yi) = 2 Cov (xi2; zi2) + Cov (xi2; "i) ; (c) Based on part (b), show
p
that ^2;IV ! 2 (write the assumptions you need to prove these results).
238
4.3.2 GMM
It may happen that K > L (there are more orthogonality conditions than parameters). In
principle, it is better to have as many IV as possible, so the case K > L is desirable, but
then the system Sxz~ = sxy may not have a solution.
Example. Suppose
2 3 2 3
1:00 0:097 0:099 1:954
6 0:097 1:011 0:059 7 6 1:346 7
Sxz = 6
6
7
7; sxy = 6
6
7
7
4 0:099 0:059 0:967 5 4 0:900 5
0:182 0:203 0:031 0:0262
(K = 4; L = 3) and try (if you can) to solve Sxz~ = sxy : This system is of same type of
8
>
> ~1 + ~2 = 1
>
>
< ~3 = 1
>
> ~4 + ~5 = 5
>
>
: ~ +~ = 2
1 2
(the …rst and fourth equations are incompatible - the system is impossible - there is not a
solution).
239
This means we cannot set gn ~ exactly equal to 0: However, we can at least choose ~
so that gn ~ is as close to 0 as possible. In Linear Algebra two vectors are “close” if the
distance between them is relatively small. We will de…ne the distance in RK as follows:
^ W
^ = arg min J ^; W
^
~
where
J ^; W
^ = ngn ~ 0 Wg
^ n ~
0
= n sxy Sxz ~ ^ sxy
W Sxz~ :
Proposition. Under the Assumptions 3.2 and 3.4
1
^ = S0xzWS
GM M estimator ^ W ^ xz S0xzWs
^ xy :
1
^ W
^ = S0xzWS
^ xz S0xzWs
^ xy
1
= Sxz1W^ 1 S0xz S0xzWs
^ xy
= Sxz1sxy = ^IV :
Proposition (3.1 - asymptotic distribution of the GMM estimator). (a) (Consistency) Un-
p
der Assumptions 3.1-3.4, ^ W^ ! ; (b) (Asymptotic Normality) If Assumption 3.3 is
strengthened as Assumption 3.5, then
p d
n ^ W
^ ! N 0; Avar ^ W
^
where
0 W 1 0 WSW 0 W 1
Avar ^ W
^ = xz xz xz xz xz xz
Proposition (3.2 - consistent estimation of error variance). For any consistent estimator ^
and under Assumptions 3.1, 3.2, the assumptions that E ziz0i and E "2i exist and are
…nite we have
n
1X p
"i ! E "2i
^
n i=1
where ^
"i yi z0i^:
Proposition (3.3 - robust t-ratio and Wald statistics). Suppose Assumptions 3.1-3.5 hold,
^ of S ( Avar (g) = E gig0 . Let
and suppose there is available a consistent estimate S i
1 1
[ ^ W
Avar ^ = S0xzWS
^ xz S0xzW
^S ^ xz S0xzWS
^ WS ^ xz :
245
[ ^ W
where Avar ^ [ ^ W
is the (j; j ) element of Avar ^ and
jj
s
1 [ ^ ^
SEj = Avar W :
n jj
(b) Under the null hypothesis H0:R = r where p is the number of restrictions and R
(p L) is of full row rank,
0 1 d
W = n R^ W
^ r [ ^ W
RAvar ^ R0 R^ W
^ r ! 2(p):
246
4.4.4 Estimation of S
Let
n
1X
^
S "2i xix0i; where ^
^ "i yi z0i^:
n i=1
Assumption (3.6 - …nite fourth moments). E (xik zi`)2 exists and is …nite for all k =
1; :::; K; and ` = 1; :::; L.
Proposition (3.4 - consistent estimation of S). Suppose ^ is consistent and S = E gigi0
exists and is …nite. Then under Assumptions 3.1, 3.2 and 3.6 the following estimator
n
1X
^
S "2i xix0i; where ^
^ "i yi z0i^:
n i=1
is consistent.
247
then the lower bound for the asymptotic variance of the GMM estimators is reached, which
is equal to
0 S 1 1
xz xz :
De…nition. The estimator
^ S
^ 1 = arg min ngn ~ 0 Wg
^ n ~
~
and
1 0 S 1 1
Avar ^ S
^ = xz xz
\ 1S 1
Avar ^ S
^ 1 = S0xzS
^ xz :
249
^ ; which depends
To calculate the e¢ cient GMM estimator, we need the consistent estimator S
on ^
"i. This leads us to the following two-step e¢ cient GMM procedure:
^
Step1: Compute S 1 Pn ^
" 2 x x0 , where ^
"i = yi z0i~: To obtain ~ :
n i=1 i i i
0
~ W
^ = arg min n sxy Sxz~ W ^ sxy Sxz~
Step 2: Minimize J ~; S
^ with respect to ~: The minimizer is the e¢ cient GMM estimator,
^ S
^ 1 = arg min n (sxy Sxz )0 S
^ 1 (s Sxz ) :
xy
250
Example. (Wooldridge, chap. 15 - data base:card) Wage and education data for a sample
of men in 1976
h i
z0i = 1 EDU Ci EXP ERi EXP ERi2 BLACKi SM SAi SOU T H
h i
x0i = 1 EXP ERi EXP ERi2 BLACKi SM SAi SOU T H N EAR4i N EAR2i
^ 1 ;S 1 d
J ~ S ^ ! 2(K L)
254
Two comments:
1) This is a speci…cation test, testing whether all the restrictions of the model (which are
the assumptions maintained in Proposition 3.6) are satis…ed. If the J ~ S ^ 1 ;S ^ 1 is
surprisingly large, it means that either the orthogonality conditions (Assumption 3.3) or the
other assumptions (or both) are likely to be false. Only when we are con…dent about those
other assumptions can we interpret the large J statistic as evidence for the endogeneity of
some of the K instruments included in xi:
Consider
" #
xi1g K1 rows
xi =
xi2g K K1 rows
We want to test H0 : E (xi2"i) = 0:
The basic idea is to compare two J statistics from two separate GMM estimators, one using
only the instruments included in xi1 and the other using also the suspect instruments xi2
in addition to xi1: If the inclusion of the suspect instruments signi…cantly increases the J
statistic, that is a good reason for doubting the predeterminedness of xi2. This restriction
is testable K1 L (why?).
257
Proposition (3.7 - testing a subset of orthogonality conditions). Suppose that the rank
condition is satis…ed for xi1; so E xi1z0i is of full column rank. Under assumptions 3.1-
3.5. Let
0 1
J = ngn ^ S ^ 1Sxz
^ 1gn ^ ; ^ = S0xzS S0xzS
^ 1s
xy
0 1
J1 = ng1n ^
S 1g
1n ; ^ 1 Sx z
= S0x1zS ^ 1sx y :
S0x1zS
11 1 11 1
Then, under the null H0 : E (xi2"i) = 0,
d
C J J1 ! 2(K K ):
1
258
Example. EVIEWS 7 performs this test. Following previous example, suppose you want to
test E (nearc4i"i) = 0: In our case, xi1 is 7 1 vector and xi2 = nearc4i is a scalar
(L = 7; K1 = 7; K K1 = 1).
259
Value df Probability
Difference in J-stats 2.673614 1 0.1020
J-statistic summary:
Value
Restricted J-statistic 2.673614
Unrestricted J-statistic 5.16E-33
We can use Proposition 3.7 to test for the endogeneity of a subset of regressors.
Assume now:
Assumption (3.7 - conditional homoskedasticity). E "2i xi = 2:
The estimator ^2SLS is called two-stage least squares (2SLS or TSLS), for reasons we explain
below. It follows
2 0 S 1 1
Avar ^2SLS = xz xx xz
\ 1
Avar ^2SLS = ^ 2 S0xzSxx1Sxz :
Proposition (3.9 - asymptotic properties of 2SLS). Skip.
262
If K = L then
1
^IV = X0Z X0 y :
1
2) regress yi on ^
z1; :::; ^
zL to obtain the estimator Z ^
^ 0Z ^ 0y which is also the ^2SLS : In
Z
e¤ect,
0 1
1 B 1 0 C
B
^ 0Z
Z ^ ^ 0y
Z = BZ0X(X0X) 1X0X 0
XX X ZC
C Z0X(X0X) 1X0y
@| {z }| {z }A | {z }
^0
Z ^ ^0
Z
Z
1 1 0 1 0
= Z0X(X0X) X0 Z 0
Z X(X X) Xy
= ^2SLS :
264
Exercise 4.6. Consider the equation yi = z0i + "i and the instrumental variables xi,
where K = L: Assume Assumptions 3.1-3.7 and suppose that xi and zi are strictly ex-
ogenous (so the use of the IV estimator is unnecessary). Show that ^IV = X0Z 1 X0y
is unbiased and consistent but less e¢ cient than ^OLS = Z0Z 1 Z0y: Hint: compare
Var ^IV Z; X to Var ^OLS Z; X and and notice that an idempotent matrix is posi-
tive semi-de…nite. Also notice that Var ^IV Z; X Var ^OLS Z; X is positive semi-
1 1
de…nite i¤ Var ^OLS Z; X Var ^IV Z; X is positive semi-de…nite (provided
these inverses exist).